diff --git a/src/java/com/twitter/search/README.md b/src/java/com/twitter/search/README.md deleted file mode 100644 index f92a9bdf3..000000000 --- a/src/java/com/twitter/search/README.md +++ /dev/null @@ -1,50 +0,0 @@ -# Tweet Search System (Earlybird) -> **TL;DR** Tweet Search System (Earlybird) find tweets from people you follow, rank them, and serve the tweets to Home. - -## What is Tweet Search System (Earlybird)? -[Earlybird](http://notes.stephenholiday.com/Earlybird.pdf) is a **real-time search system** based on [Apache Lucene](https://lucene.apache.org/) to support the high volume of queries and content updates. The major use cases are Relevance Search (specifically, Text search) and Timeline In-network Tweet retrieval (or UserID based search). It is designed to enable the efficient indexing and querying of billions of tweets, and to provide low-latency search results, even with heavy query loads. - -## How it is related to the Home Timeline Recommendation Algorithm - -![in-network](img/in-network.png) - -At Twitter, we use Tweet Search System (Earlybird) to do Home Timeline In-network Tweet retrieval: given a list of following users, find their recently posted tweets. Earlybird (Search Index) is the major candidate source for in-network tweets across Following tab and For You tab. - - -## High-level architecture -We split our entire tweet search index into three clusters: a **realtime** cluster indexing all public tweets posted in about the last 7 days, a **protected** cluster indexing all protected tweets for the same timeframe; and an **archive** cluster indexing all tweets ever posted, up to about two days ago. - -Earlybird addresses the challenges of scaling real-time search by splitting each cluster across multiple **partitions**, each responsible for a portion of the index. The architecture uses a distributed *inverted index* that is sharded and replicated. This design allows for efficient index updates and query processing. - -The system also employs an incremental indexing approach, enabling it to process and index new tweets in real-time as they arrive. With single writer, multiple reader structure, Earlybird can handle a large number of real-time updates and queries concurrently while maintaining low query latency. The system can achieve high query throughput and low query latency while maintaining a high degree of index freshness. - - -### Indexing -* Ingesters read tweets and user modifications from kafka topics, extract fields and features from them and write the extracted data to intermediate kafka topics for Earlybirds to consume, index and serve. -* Feature Update Service feeds feature updates such as up-to-date engagement (like, retweets, replies) counts to Earlybird. -![indexing](img/indexing.png) - -### Serving -Earlybird roots fanout requests to different Earlybird clusters or partitions. Upon receiving responses from the clusters or partitions, roots merge the responses before finally returning the merged response to the client. -![serving](img/serving.png) - -## Use cases - -1. Tweet Search - * Top search - * Latest search - -![top](img/top-search.png) - -2. Candidate generation - * Timeline (For You Tab, Following Tab) - * Notifications - -![home](img/foryou.png) - -## References -* "Earlybird: Real-Time Search at Twitter" (http://notes.stephenholiday.com/Earlybird.pdf) -* "Reducing search indexing latency to one second" (https://blog.twitter.com/engineering/en_us/topics/infrastructure/2020/reducing-search-indexing-latency-to-one-second) -* "Omnisearch index formats" (https://blog.twitter.com/engineering/en_us/topics/infrastructure/2016/omnisearch-index-formats) - - diff --git a/src/java/com/twitter/search/common/README.md b/src/java/com/twitter/search/common/README.md deleted file mode 100644 index c7f2e38bb..000000000 --- a/src/java/com/twitter/search/common/README.md +++ /dev/null @@ -1 +0,0 @@ -Contains code that is common to multiple earlybird services (ingesters, roots and earlybird). \ No newline at end of file diff --git a/src/java/com/twitter/search/common/converter/earlybird/BUILD b/src/java/com/twitter/search/common/converter/earlybird/BUILD deleted file mode 100644 index a5d4ea4ae..000000000 --- a/src/java/com/twitter/search/common/converter/earlybird/BUILD +++ /dev/null @@ -1,57 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/com/twitter/elephantbird:core", - "3rdparty/jvm/geo/google:geoGoogle", - "3rdparty/jvm/joda-time", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/httpcomponents:httpcore", - "3rdparty/jvm/org/apache/lucene:lucene-core", - "3rdparty/jvm/org/apache/thrift:libthrift", - "3rdparty/jvm/org/apache/zookeeper:zookeeper-client", - "3rdparty/jvm/org/slf4j:slf4j-api", - "cuad/projects/ner/thrift/src/main/thrift:thrift-java", - "decider/src/main/scala", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common/collections", - "src/java/com/twitter/common/text/language:locale-util", - "src/java/com/twitter/common/text/token", - "src/java/com/twitter/common/text/util:token-util", - "src/java/com/twitter/common_internal/text:text-penguin7", - "src/java/com/twitter/common_internal/text/version", - "src/java/com/twitter/search/common/config", - "src/java/com/twitter/search/common/constants", - "src/java/com/twitter/search/common/debug", - "src/java/com/twitter/search/common/decider", - "src/java/com/twitter/search/common/encoding/docvalues", - "src/java/com/twitter/search/common/encoding/features", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/partitioning/base", - "src/java/com/twitter/search/common/partitioning/snowflakeparser", - "src/java/com/twitter/search/common/relevance:entities_and_filters", - "src/java/com/twitter/search/common/relevance:text", - "src/java/com/twitter/search/common/relevance/features", - "src/java/com/twitter/search/common/schema", - "src/java/com/twitter/search/common/schema/base", - "src/java/com/twitter/search/common/schema/earlybird", - "src/java/com/twitter/search/common/util:longintconverter", - "src/java/com/twitter/search/common/util/analysis", - "src/java/com/twitter/search/common/util/lang", - "src/java/com/twitter/search/common/util/spatial", - "src/java/com/twitter/search/common/util/text", - "src/java/com/twitter/search/common/util/text/regex", - "src/java/com/twitter/search/common/util/thrift:thrift-utils", - "src/java/com/twitter/search/common/util/url", - "src/java/com/twitter/search/ingester/model", - "src/thrift/com/twitter/search/common:constants-java", - "src/thrift/com/twitter/search/common:indexing-java", - "src/thrift/com/twitter/search/common:schema-java", - "src/thrift/com/twitter/search/common/debug:debug-java", - "src/thrift/com/twitter/service/spiderduck/gen:metadata-store-java", - "src/thrift/com/twitter/tweetypie:tweet-java", - ], -) diff --git a/src/java/com/twitter/search/common/converter/earlybird/BasicIndexingConverter.java b/src/java/com/twitter/search/common/converter/earlybird/BasicIndexingConverter.java deleted file mode 100644 index afde8a84e..000000000 --- a/src/java/com/twitter/search/common/converter/earlybird/BasicIndexingConverter.java +++ /dev/null @@ -1,647 +0,0 @@ -package com.twitter.search.common.converter.earlybird; - -import java.io.IOException; -import java.util.Date; -import java.util.List; -import java.util.Optional; -import javax.annotation.concurrent.NotThreadSafe; - -import com.google.common.base.Preconditions; - -import org.apache.commons.collections.CollectionUtils; -import org.joda.time.DateTime; -import org.joda.time.DateTimeZone; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.search.common.converter.earlybird.EncodedFeatureBuilder.TweetFeatureWithEncodeFeatures; -import com.twitter.search.common.indexing.thriftjava.Place; -import com.twitter.search.common.indexing.thriftjava.PotentialLocation; -import com.twitter.search.common.indexing.thriftjava.ProfileGeoEnrichment; -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.indexing.thriftjava.VersionedTweetFeatures; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; -import com.twitter.search.common.relevance.entities.GeoObject; -import com.twitter.search.common.relevance.entities.TwitterMessage; -import com.twitter.search.common.relevance.entities.TwitterQuotedMessage; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.schema.earlybird.EarlybirdEncodedFeatures; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.schema.earlybird.EarlybirdThriftDocumentBuilder; -import com.twitter.search.common.schema.thriftjava.ThriftDocument; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEventType; -import com.twitter.search.common.util.spatial.GeoUtil; -import com.twitter.search.common.util.text.NormalizerHelper; -import com.twitter.tweetypie.thriftjava.ComposerSource; - -/** - * Converts a TwitterMessage into a ThriftVersionedEvents. This is only responsible for data that - * is available immediately when a Tweet is created. Some data, like URL data, isn't available - * immediately, and so it is processed later, in the DelayedIndexingConverter and sent as an - * update. In order to achieve this we create the document in 2 passes: - * - * 1. BasicIndexingConverter builds thriftVersionedEvents with the fields that do not require - * external services. - * - * 2. DelayedIndexingConverter builds all the document fields depending on external services, once - * those services have processed the relevant Tweet and we have retrieved that data. - */ -@NotThreadSafe -public class BasicIndexingConverter { - private static final Logger LOG = LoggerFactory.getLogger(BasicIndexingConverter.class); - - private static final SearchCounter NUM_NULLCAST_FEATURE_FLAG_SET_TWEETS = - SearchCounter.export("num_nullcast_feature_flag_set_tweets"); - private static final SearchCounter NUM_NULLCAST_TWEETS = - SearchCounter.export("num_nullcast_tweets"); - private static final SearchCounter NUM_NON_NULLCAST_TWEETS = - SearchCounter.export("num_non_nullcast_tweets"); - private static final SearchCounter ADJUSTED_BAD_CREATED_AT_COUNTER = - SearchCounter.export("adjusted_incorrect_created_at_timestamp"); - private static final SearchCounter INCONSISTENT_TWEET_ID_AND_CREATED_AT_MS = - SearchCounter.export("inconsistent_tweet_id_and_created_at_ms"); - private static final SearchCounter NUM_SELF_THREAD_TWEETS = - SearchCounter.export("num_self_thread_tweets"); - private static final SearchCounter NUM_EXCLUSIVE_TWEETS = - SearchCounter.export("num_exclusive_tweets"); - - // If a tweet carries a timestamp smaller than this timestamp, we consider the timestamp invalid, - // because twitter does not even exist back then before: Sun, 01 Jan 2006 00:00:00 GMT - private static final long VALID_CREATION_TIME_THRESHOLD_MILLIS = - new DateTime(2006, 1, 1, 0, 0, 0, DateTimeZone.UTC).getMillis(); - - private final EncodedFeatureBuilder featureBuilder; - private final Schema schema; - private final EarlybirdCluster cluster; - - public BasicIndexingConverter(Schema schema, EarlybirdCluster cluster) { - this.featureBuilder = new EncodedFeatureBuilder(); - this.schema = schema; - this.cluster = cluster; - } - - /** - * This function converts TwitterMessage to ThriftVersionedEvents, which is a generic data - * structure that can be consumed by Earlybird directly. - */ - public ThriftVersionedEvents convertMessageToThrift( - TwitterMessage message, - boolean strict, - List penguinVersions) throws IOException { - Preconditions.checkNotNull(message); - Preconditions.checkNotNull(penguinVersions); - - ThriftVersionedEvents versionedEvents = new ThriftVersionedEvents() - .setId(message.getId()); - - ImmutableSchemaInterface schemaSnapshot = schema.getSchemaSnapshot(); - - for (PenguinVersion penguinVersion : penguinVersions) { - ThriftDocument document = - buildDocumentForPenguinVersion(schemaSnapshot, message, strict, penguinVersion); - - ThriftIndexingEvent thriftIndexingEvent = new ThriftIndexingEvent() - .setDocument(document) - .setEventType(ThriftIndexingEventType.INSERT) - .setSortId(message.getId()); - message.getFromUserTwitterId().map(thriftIndexingEvent::setUid); - versionedEvents.putToVersionedEvents(penguinVersion.getByteValue(), thriftIndexingEvent); - } - - return versionedEvents; - } - - private ThriftDocument buildDocumentForPenguinVersion( - ImmutableSchemaInterface schemaSnapshot, - TwitterMessage message, - boolean strict, - PenguinVersion penguinVersion) throws IOException { - TweetFeatureWithEncodeFeatures tweetFeature = - featureBuilder.createTweetFeaturesFromTwitterMessage( - message, penguinVersion, schemaSnapshot); - - EarlybirdThriftDocumentBuilder builder = - buildBasicFields(message, schemaSnapshot, cluster, tweetFeature); - - buildUserFields(builder, message, tweetFeature.versionedFeatures, penguinVersion); - buildGeoFields(builder, message, tweetFeature.versionedFeatures); - buildRetweetAndReplyFields(builder, message, strict); - buildQuotesFields(builder, message); - buildVersionedFeatureFields(builder, tweetFeature.versionedFeatures); - buildAnnotationFields(builder, message); - buildNormalizedMinEngagementFields(builder, tweetFeature.encodedFeatures, cluster); - buildDirectedAtFields(builder, message); - - builder.withSpaceIdFields(message.getSpaceIds()); - - return builder.build(); - } - - /** - * Build the basic fields for a tweet. - */ - public static EarlybirdThriftDocumentBuilder buildBasicFields( - TwitterMessage message, - ImmutableSchemaInterface schemaSnapshot, - EarlybirdCluster cluster, - TweetFeatureWithEncodeFeatures tweetFeature) { - EarlybirdEncodedFeatures extendedEncodedFeatures = tweetFeature.extendedEncodedFeatures; - if (extendedEncodedFeatures == null && EarlybirdCluster.isTwitterMemoryFormatCluster(cluster)) { - extendedEncodedFeatures = EarlybirdEncodedFeatures.newEncodedTweetFeatures( - schemaSnapshot, EarlybirdFieldConstant.EXTENDED_ENCODED_TWEET_FEATURES_FIELD); - } - EarlybirdThriftDocumentBuilder builder = new EarlybirdThriftDocumentBuilder( - tweetFeature.encodedFeatures, - extendedEncodedFeatures, - new EarlybirdFieldConstants(), - schemaSnapshot); - - builder.withID(message.getId()); - - final Date createdAt = message.getDate(); - long createdAtMs = createdAt == null ? 0L : createdAt.getTime(); - - createdAtMs = fixCreatedAtTimeStampIfNecessary(message.getId(), createdAtMs); - - if (createdAtMs > 0L) { - builder.withCreatedAt((int) (createdAtMs / 1000)); - } - - builder.withTweetSignature(tweetFeature.versionedFeatures.getTweetSignature()); - - if (message.getConversationId() > 0) { - long conversationId = message.getConversationId(); - builder.withLongField( - EarlybirdFieldConstant.CONVERSATION_ID_CSF.getFieldName(), conversationId); - // We only index conversation ID when it is different from the tweet ID. - if (message.getId() != conversationId) { - builder.withLongField( - EarlybirdFieldConstant.CONVERSATION_ID_FIELD.getFieldName(), conversationId); - } - } - - if (message.getComposerSource().isPresent()) { - ComposerSource composerSource = message.getComposerSource().get(); - builder.withIntField( - EarlybirdFieldConstant.COMPOSER_SOURCE.getFieldName(), composerSource.getValue()); - if (composerSource == ComposerSource.CAMERA) { - builder.withCameraComposerSourceFlag(); - } - } - - EarlybirdEncodedFeatures encodedFeatures = tweetFeature.encodedFeatures; - if (encodedFeatures.isFlagSet(EarlybirdFieldConstant.FROM_VERIFIED_ACCOUNT_FLAG)) { - builder.addFilterInternalFieldTerm(EarlybirdFieldConstant.VERIFIED_FILTER_TERM); - } - if (encodedFeatures.isFlagSet(EarlybirdFieldConstant.FROM_BLUE_VERIFIED_ACCOUNT_FLAG)) { - builder.addFilterInternalFieldTerm(EarlybirdFieldConstant.BLUE_VERIFIED_FILTER_TERM); - } - - if (encodedFeatures.isFlagSet(EarlybirdFieldConstant.IS_OFFENSIVE_FLAG)) { - builder.withOffensiveFlag(); - } - - if (message.getNullcast()) { - NUM_NULLCAST_TWEETS.increment(); - builder.addFilterInternalFieldTerm(EarlybirdFieldConstant.NULLCAST_FILTER_TERM); - } else { - NUM_NON_NULLCAST_TWEETS.increment(); - } - if (encodedFeatures.isFlagSet(EarlybirdFieldConstant.IS_NULLCAST_FLAG)) { - NUM_NULLCAST_FEATURE_FLAG_SET_TWEETS.increment(); - } - if (message.isSelfThread()) { - builder.addFilterInternalFieldTerm( - EarlybirdFieldConstant.SELF_THREAD_FILTER_TERM); - NUM_SELF_THREAD_TWEETS.increment(); - } - - if (message.isExclusive()) { - builder.addFilterInternalFieldTerm(EarlybirdFieldConstant.EXCLUSIVE_FILTER_TERM); - builder.withLongField( - EarlybirdFieldConstant.EXCLUSIVE_CONVERSATION_AUTHOR_ID_CSF.getFieldName(), - message.getExclusiveConversationAuthorId()); - NUM_EXCLUSIVE_TWEETS.increment(); - } - - builder.withLanguageCodes(message.getLanguage(), message.getBCP47LanguageTag()); - - return builder; - } - - /** - * Build the user fields. - */ - public static void buildUserFields( - EarlybirdThriftDocumentBuilder builder, - TwitterMessage message, - VersionedTweetFeatures versionedTweetFeatures, - PenguinVersion penguinVersion) { - // 1. Set all the from user fields. - if (message.getFromUserTwitterId().isPresent()) { - builder.withLongField(EarlybirdFieldConstant.FROM_USER_ID_FIELD.getFieldName(), - message.getFromUserTwitterId().get()) - // CSF - .withLongField(EarlybirdFieldConstant.FROM_USER_ID_CSF.getFieldName(), - message.getFromUserTwitterId().get()); - } else { - LOG.warn("fromUserTwitterId is not set in TwitterMessage! Status id: " + message.getId()); - } - - if (message.getFromUserScreenName().isPresent()) { - String fromUser = message.getFromUserScreenName().get(); - String normalizedFromUser = - NormalizerHelper.normalizeWithUnknownLocale(fromUser, penguinVersion); - - builder - .withWhiteSpaceTokenizedScreenNameField( - EarlybirdFieldConstant.TOKENIZED_FROM_USER_FIELD.getFieldName(), - normalizedFromUser) - .withStringField(EarlybirdFieldConstant.FROM_USER_FIELD.getFieldName(), - normalizedFromUser); - - if (message.getTokenizedFromUserScreenName().isPresent()) { - builder.withCamelCaseTokenizedScreenNameField( - EarlybirdFieldConstant.CAMELCASE_USER_HANDLE_FIELD.getFieldName(), - fromUser, - normalizedFromUser, - message.getTokenizedFromUserScreenName().get()); - } - } - - Optional toUserScreenName = message.getToUserLowercasedScreenName(); - if (toUserScreenName.isPresent() && !toUserScreenName.get().isEmpty()) { - builder.withStringField( - EarlybirdFieldConstant.TO_USER_FIELD.getFieldName(), - NormalizerHelper.normalizeWithUnknownLocale(toUserScreenName.get(), penguinVersion)); - } - - if (versionedTweetFeatures.isSetUserDisplayNameTokenStreamText()) { - builder.withTokenStreamField(EarlybirdFieldConstant.TOKENIZED_USER_NAME_FIELD.getFieldName(), - versionedTweetFeatures.getUserDisplayNameTokenStreamText(), - versionedTweetFeatures.getUserDisplayNameTokenStream()); - } - } - - /** - * Build the geo fields. - */ - public static void buildGeoFields( - EarlybirdThriftDocumentBuilder builder, - TwitterMessage message, - VersionedTweetFeatures versionedTweetFeatures) { - double lat = GeoUtil.ILLEGAL_LATLON; - double lon = GeoUtil.ILLEGAL_LATLON; - if (message.getGeoLocation() != null) { - GeoObject location = message.getGeoLocation(); - builder.withGeoField(EarlybirdFieldConstant.GEO_HASH_FIELD.getFieldName(), - location.getLatitude(), location.getLongitude(), location.getAccuracy()); - - if (location.getSource() != null) { - builder.withStringField(EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), - EarlybirdFieldConstants.formatGeoType(location.getSource())); - } - - if (GeoUtil.validateGeoCoordinates(location.getLatitude(), location.getLongitude())) { - lat = location.getLatitude(); - lon = location.getLongitude(); - } - } - - // See SEARCH-14317 for investigation on how much space geo filed is used in archive cluster. - // In lucene archives, this CSF is needed regardless of whether geoLocation is set. - builder.withLatLonCSF(lat, lon); - - if (versionedTweetFeatures.isSetTokenizedPlace()) { - Place place = versionedTweetFeatures.getTokenizedPlace(); - Preconditions.checkArgument(place.isSetId(), "Place ID not set for tweet " - + message.getId()); - Preconditions.checkArgument(place.isSetFullName(), - "Place full name not set for tweet " + message.getId()); - builder.addFilterInternalFieldTerm(EarlybirdFieldConstant.PLACE_ID_FIELD.getFieldName()); - builder - .withStringField(EarlybirdFieldConstant.PLACE_ID_FIELD.getFieldName(), place.getId()) - .withStringField(EarlybirdFieldConstant.PLACE_FULL_NAME_FIELD.getFieldName(), - place.getFullName()); - if (place.isSetCountryCode()) { - builder.withStringField(EarlybirdFieldConstant.PLACE_COUNTRY_CODE_FIELD.getFieldName(), - place.getCountryCode()); - } - } - - if (versionedTweetFeatures.isSetTokenizedProfileGeoEnrichment()) { - ProfileGeoEnrichment profileGeoEnrichment = - versionedTweetFeatures.getTokenizedProfileGeoEnrichment(); - Preconditions.checkArgument( - profileGeoEnrichment.isSetPotentialLocations(), - "ProfileGeoEnrichment.potentialLocations not set for tweet " - + message.getId()); - List potentialLocations = profileGeoEnrichment.getPotentialLocations(); - Preconditions.checkArgument( - !potentialLocations.isEmpty(), - "Found tweet with an empty ProfileGeoEnrichment.potentialLocations: " - + message.getId()); - builder.addFilterInternalFieldTerm(EarlybirdFieldConstant.PROFILE_GEO_FILTER_TERM); - for (PotentialLocation potentialLocation : potentialLocations) { - if (potentialLocation.isSetCountryCode()) { - builder.withStringField( - EarlybirdFieldConstant.PROFILE_GEO_COUNTRY_CODE_FIELD.getFieldName(), - potentialLocation.getCountryCode()); - } - if (potentialLocation.isSetRegion()) { - builder.withStringField(EarlybirdFieldConstant.PROFILE_GEO_REGION_FIELD.getFieldName(), - potentialLocation.getRegion()); - } - if (potentialLocation.isSetLocality()) { - builder.withStringField(EarlybirdFieldConstant.PROFILE_GEO_LOCALITY_FIELD.getFieldName(), - potentialLocation.getLocality()); - } - } - } - - builder.withPlacesField(message.getPlaces()); - } - - /** - * Build the retweet and reply fields. - */ - public static void buildRetweetAndReplyFields( - EarlybirdThriftDocumentBuilder builder, - TwitterMessage message, - boolean strict) { - long retweetUserIdVal = -1; - long sharedStatusIdVal = -1; - if (message.getRetweetMessage() != null) { - if (message.getRetweetMessage().getSharedId() != null) { - sharedStatusIdVal = message.getRetweetMessage().getSharedId(); - } - if (message.getRetweetMessage().hasSharedUserTwitterId()) { - retweetUserIdVal = message.getRetweetMessage().getSharedUserTwitterId(); - } - } - - long inReplyToStatusIdVal = -1; - long inReplyToUserIdVal = -1; - if (message.isReply()) { - if (message.getInReplyToStatusId().isPresent()) { - inReplyToStatusIdVal = message.getInReplyToStatusId().get(); - } - if (message.getToUserTwitterId().isPresent()) { - inReplyToUserIdVal = message.getToUserTwitterId().get(); - } - } - - buildRetweetAndReplyFields( - retweetUserIdVal, - sharedStatusIdVal, - inReplyToStatusIdVal, - inReplyToUserIdVal, - strict, - builder); - } - - /** - * Build the quotes fields. - */ - public static void buildQuotesFields( - EarlybirdThriftDocumentBuilder builder, - TwitterMessage message) { - if (message.getQuotedMessage() != null) { - TwitterQuotedMessage quoted = message.getQuotedMessage(); - if (quoted != null && quoted.getQuotedStatusId() > 0 && quoted.getQuotedUserId() > 0) { - builder.withQuote(quoted.getQuotedStatusId(), quoted.getQuotedUserId()); - } - } - } - - /** - * Build directed at field. - */ - public static void buildDirectedAtFields( - EarlybirdThriftDocumentBuilder builder, - TwitterMessage message) { - if (message.getDirectedAtUserId().isPresent() && message.getDirectedAtUserId().get() > 0) { - builder.withDirectedAtUser(message.getDirectedAtUserId().get()); - builder.addFilterInternalFieldTerm(EarlybirdFieldConstant.DIRECTED_AT_FILTER_TERM); - } - } - - /** - * Build the versioned features for a tweet. - */ - public static void buildVersionedFeatureFields( - EarlybirdThriftDocumentBuilder builder, - VersionedTweetFeatures versionedTweetFeatures) { - builder - .withHashtagsField(versionedTweetFeatures.getHashtags()) - .withMentionsField(versionedTweetFeatures.getMentions()) - .withStocksFields(versionedTweetFeatures.getStocks()) - .withResolvedLinksText(versionedTweetFeatures.getNormalizedResolvedUrlText()) - .withTokenStreamField(EarlybirdFieldConstant.TEXT_FIELD.getFieldName(), - versionedTweetFeatures.getTweetTokenStreamText(), - versionedTweetFeatures.isSetTweetTokenStream() - ? versionedTweetFeatures.getTweetTokenStream() : null) - .withStringField(EarlybirdFieldConstant.SOURCE_FIELD.getFieldName(), - versionedTweetFeatures.getSource()) - .withStringField(EarlybirdFieldConstant.NORMALIZED_SOURCE_FIELD.getFieldName(), - versionedTweetFeatures.getNormalizedSource()); - - // Internal fields for smileys and question marks - if (versionedTweetFeatures.hasPositiveSmiley) { - builder.withStringField( - EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), - EarlybirdFieldConstant.HAS_POSITIVE_SMILEY); - } - if (versionedTweetFeatures.hasNegativeSmiley) { - builder.withStringField( - EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), - EarlybirdFieldConstant.HAS_NEGATIVE_SMILEY); - } - if (versionedTweetFeatures.hasQuestionMark) { - builder.withStringField(EarlybirdFieldConstant.TEXT_FIELD.getFieldName(), - EarlybirdThriftDocumentBuilder.QUESTION_MARK); - } - } - - /** - * Build the escherbird annotations for a tweet. - */ - public static void buildAnnotationFields( - EarlybirdThriftDocumentBuilder builder, - TwitterMessage message) { - List escherbirdAnnotations = - message.getEscherbirdAnnotations(); - if (CollectionUtils.isEmpty(escherbirdAnnotations)) { - return; - } - - builder.addFacetSkipList(EarlybirdFieldConstant.ENTITY_ID_FIELD.getFieldName()); - - for (TwitterMessage.EscherbirdAnnotation annotation : escherbirdAnnotations) { - String groupDomainEntity = String.format("%d.%d.%d", - annotation.groupId, annotation.domainId, annotation.entityId); - String domainEntity = String.format("%d.%d", annotation.domainId, annotation.entityId); - String entity = String.format("%d", annotation.entityId); - - builder.withStringField(EarlybirdFieldConstant.ENTITY_ID_FIELD.getFieldName(), - groupDomainEntity); - builder.withStringField(EarlybirdFieldConstant.ENTITY_ID_FIELD.getFieldName(), - domainEntity); - builder.withStringField(EarlybirdFieldConstant.ENTITY_ID_FIELD.getFieldName(), - entity); - } - } - - /** - * Build the correct ThriftIndexingEvent's fields based on retweet and reply status. - */ - public static void buildRetweetAndReplyFields( - long retweetUserIdVal, - long sharedStatusIdVal, - long inReplyToStatusIdVal, - long inReplyToUserIdVal, - boolean strict, - EarlybirdThriftDocumentBuilder builder) { - Optional retweetUserId = Optional.of(retweetUserIdVal).filter(x -> x > 0); - Optional sharedStatusId = Optional.of(sharedStatusIdVal).filter(x -> x > 0); - Optional inReplyToUserId = Optional.of(inReplyToUserIdVal).filter(x -> x > 0); - Optional inReplyToStatusId = Optional.of(inReplyToStatusIdVal).filter(x -> x > 0); - - // We have six combinations here. A Tweet can be - // 1) a reply to another tweet (then it has both in-reply-to-user-id and - // in-reply-to-status-id set), - // 2) directed-at a user (then it only has in-reply-to-user-id set), - // 3) not a reply at all. - // Additionally, it may or may not be a Retweet (if it is, then it has retweet-user-id and - // retweet-status-id set). - // - // We want to set some fields unconditionally, and some fields (reference-author-id and - // shared-status-id) depending on the reply/retweet combination. - // - // 1. Normal tweet (not a reply, not a retweet). None of the fields should be set. - // - // 2. Reply to a tweet (both in-reply-to-user-id and in-reply-to-status-id set). - // IN_REPLY_TO_USER_ID_FIELD should be set to in-reply-to-user-id - // SHARED_STATUS_ID_CSF should be set to in-reply-to-status-id - // IS_REPLY_FLAG should be set - // - // 3. Directed-at a user (only in-reply-to-user-id is set). - // IN_REPLY_TO_USER_ID_FIELD should be set to in-reply-to-user-id - // IS_REPLY_FLAG should be set - // - // 4. Retweet of a normal tweet (retweet-user-id and retweet-status-id are set). - // RETWEET_SOURCE_USER_ID_FIELD should be set to retweet-user-id - // SHARED_STATUS_ID_CSF should be set to retweet-status-id - // IS_RETWEET_FLAG should be set - // - // 5. Retweet of a reply (both in-reply-to-user-id and in-reply-to-status-id set, - // retweet-user-id and retweet-status-id are set). - // RETWEET_SOURCE_USER_ID_FIELD should be set to retweet-user-id - // SHARED_STATUS_ID_CSF should be set to retweet-status-id (retweet beats reply!) - // IS_RETWEET_FLAG should be set - // IN_REPLY_TO_USER_ID_FIELD should be set to in-reply-to-user-id - // IS_REPLY_FLAG should NOT be set - // - // 6. Retweet of a directed-at tweet (only in-reply-to-user-id is set, - // retweet-user-id and retweet-status-id are set). - // RETWEET_SOURCE_USER_ID_FIELD should be set to retweet-user-id - // SHARED_STATUS_ID_CSF should be set to retweet-status-id - // IS_RETWEET_FLAG should be set - // IN_REPLY_TO_USER_ID_FIELD should be set to in-reply-to-user-id - // IS_REPLY_FLAG should NOT be set - // - // In other words: - // SHARED_STATUS_ID_CSF logic: if this is a retweet SHARED_STATUS_ID_CSF should be set to - // retweet-status-id, otherwise if it's a reply to a tweet, it should be set to - // in-reply-to-status-id. - - Preconditions.checkState(retweetUserId.isPresent() == sharedStatusId.isPresent()); - - if (retweetUserId.isPresent()) { - builder.withNativeRetweet(retweetUserId.get(), sharedStatusId.get()); - - if (inReplyToUserId.isPresent()) { - // Set IN_REPLY_TO_USER_ID_FIELD even if this is a retweet of a reply. - builder.withInReplyToUserID(inReplyToUserId.get()); - } - } else { - // If this is a retweet of a reply, we don't want to mark it as a reply, or override fields - // set by the retweet logic. - // If we are in this branch, this is not a retweet. Potentially, we set the reply flag, - // and override shared-status-id and reference-author-id. - - if (inReplyToStatusId.isPresent()) { - if (strict) { - // Enforcing that if this is a reply to a tweet, then it also has a replied-to user. - Preconditions.checkState(inReplyToUserId.isPresent()); - } - builder.withReplyFlag(); - builder.withLongField( - EarlybirdFieldConstant.SHARED_STATUS_ID_CSF.getFieldName(), - inReplyToStatusId.get()); - builder.withLongField( - EarlybirdFieldConstant.IN_REPLY_TO_TWEET_ID_FIELD.getFieldName(), - inReplyToStatusId.get()); - } - if (inReplyToUserId.isPresent()) { - builder.withReplyFlag(); - builder.withInReplyToUserID(inReplyToUserId.get()); - } - } - } - - /** - * Build the engagement fields. - */ - public static void buildNormalizedMinEngagementFields( - EarlybirdThriftDocumentBuilder builder, - EarlybirdEncodedFeatures encodedFeatures, - EarlybirdCluster cluster) throws IOException { - if (EarlybirdCluster.isArchive(cluster)) { - int favoriteCount = encodedFeatures.getFeatureValue(EarlybirdFieldConstant.FAVORITE_COUNT); - int retweetCount = encodedFeatures.getFeatureValue(EarlybirdFieldConstant.RETWEET_COUNT); - int replyCount = encodedFeatures.getFeatureValue(EarlybirdFieldConstant.REPLY_COUNT); - builder - .withNormalizedMinEngagementField( - EarlybirdFieldConstant.NORMALIZED_FAVORITE_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD - .getFieldName(), - favoriteCount); - builder - .withNormalizedMinEngagementField( - EarlybirdFieldConstant.NORMALIZED_RETWEET_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD - .getFieldName(), - retweetCount); - builder - .withNormalizedMinEngagementField( - EarlybirdFieldConstant.NORMALIZED_REPLY_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD - .getFieldName(), - replyCount); - } - } - - /** - * As seen in SEARCH-5617, we sometimes have incorrect createdAt. This method tries to fix them - * by extracting creation time from snowflake when possible. - */ - public static long fixCreatedAtTimeStampIfNecessary(long id, long createdAtMs) { - if (createdAtMs < VALID_CREATION_TIME_THRESHOLD_MILLIS - && id > SnowflakeIdParser.SNOWFLAKE_ID_LOWER_BOUND) { - // This tweet has a snowflake ID, and we can extract timestamp from the ID. - ADJUSTED_BAD_CREATED_AT_COUNTER.increment(); - return SnowflakeIdParser.getTimestampFromTweetId(id); - } else if (!SnowflakeIdParser.isTweetIDAndCreatedAtConsistent(id, createdAtMs)) { - LOG.error( - "Found inconsistent tweet ID and created at timestamp: [statusID={}], [createdAtMs={}]", - id, createdAtMs); - INCONSISTENT_TWEET_ID_AND_CREATED_AT_MS.increment(); - } - - return createdAtMs; - } -} diff --git a/src/java/com/twitter/search/common/converter/earlybird/CombinedIndexingConverter.java b/src/java/com/twitter/search/common/converter/earlybird/CombinedIndexingConverter.java deleted file mode 100644 index 1ed40bcd4..000000000 --- a/src/java/com/twitter/search/common/converter/earlybird/CombinedIndexingConverter.java +++ /dev/null @@ -1,99 +0,0 @@ -package com.twitter.search.common.converter.earlybird; - -import java.io.IOException; -import java.util.List; - -import javax.annotation.concurrent.NotThreadSafe; - -import com.google.common.base.Preconditions; - -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.relevance.entities.TwitterMessage; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.schema.earlybird.EarlybirdThriftDocumentBuilder; -import com.twitter.search.common.schema.thriftjava.ThriftDocument; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEventType; - -/** - * CombinedIndexingConverter builds objects from TwitterMessage to ThriftVersionedEvent. - * - * It is used in tests and in offline jobs, so all data is available on the TwitterMessage. This - * means that we don't need to split up the ThriftVersionedEvents into basic events and update - * events, like we do in the realtime pipeline using the BasicIndexingConverter and the - * DelayedIndexingConverter. - */ -@NotThreadSafe -public class CombinedIndexingConverter { - private final EncodedFeatureBuilder featureBuilder; - private final Schema schema; - private final EarlybirdCluster cluster; - - public CombinedIndexingConverter(Schema schema, EarlybirdCluster cluster) { - this.featureBuilder = new EncodedFeatureBuilder(); - this.schema = schema; - this.cluster = cluster; - } - - /** - * Converts a TwitterMessage to a Thrift representation. - */ - public ThriftVersionedEvents convertMessageToThrift( - TwitterMessage message, - boolean strict, - List penguinVersions) throws IOException { - Preconditions.checkNotNull(message); - Preconditions.checkNotNull(penguinVersions); - - ThriftVersionedEvents versionedEvents = new ThriftVersionedEvents() - .setId(message.getId()); - - ImmutableSchemaInterface schemaSnapshot = schema.getSchemaSnapshot(); - - for (PenguinVersion penguinVersion : penguinVersions) { - ThriftDocument document = - buildDocumentForPenguinVersion(schemaSnapshot, message, strict, penguinVersion); - - ThriftIndexingEvent thriftIndexingEvent = new ThriftIndexingEvent() - .setDocument(document) - .setEventType(ThriftIndexingEventType.INSERT) - .setSortId(message.getId()); - message.getFromUserTwitterId().map(thriftIndexingEvent::setUid); - versionedEvents.putToVersionedEvents(penguinVersion.getByteValue(), thriftIndexingEvent); - } - - return versionedEvents; - } - - private ThriftDocument buildDocumentForPenguinVersion( - ImmutableSchemaInterface schemaSnapshot, - TwitterMessage message, - boolean strict, - PenguinVersion penguinVersion) throws IOException { - EncodedFeatureBuilder.TweetFeatureWithEncodeFeatures tweetFeature = - featureBuilder.createTweetFeaturesFromTwitterMessage( - message, penguinVersion, schemaSnapshot); - - EarlybirdThriftDocumentBuilder builder = - BasicIndexingConverter.buildBasicFields(message, schemaSnapshot, cluster, tweetFeature); - - BasicIndexingConverter - .buildUserFields(builder, message, tweetFeature.versionedFeatures, penguinVersion); - BasicIndexingConverter.buildGeoFields(builder, message, tweetFeature.versionedFeatures); - DelayedIndexingConverter.buildURLFields(builder, message, tweetFeature.encodedFeatures); - BasicIndexingConverter.buildRetweetAndReplyFields(builder, message, strict); - BasicIndexingConverter.buildQuotesFields(builder, message); - BasicIndexingConverter.buildVersionedFeatureFields(builder, tweetFeature.versionedFeatures); - DelayedIndexingConverter.buildCardFields(builder, message, penguinVersion); - BasicIndexingConverter.buildAnnotationFields(builder, message); - BasicIndexingConverter.buildNormalizedMinEngagementFields( - builder, tweetFeature.encodedFeatures, cluster); - DelayedIndexingConverter.buildNamedEntityFields(builder, message); - BasicIndexingConverter.buildDirectedAtFields(builder, message); - - return builder.build(); - } -} diff --git a/src/java/com/twitter/search/common/converter/earlybird/DelayedIndexingConverter.java b/src/java/com/twitter/search/common/converter/earlybird/DelayedIndexingConverter.java deleted file mode 100644 index 0ed3ac134..000000000 --- a/src/java/com/twitter/search/common/converter/earlybird/DelayedIndexingConverter.java +++ /dev/null @@ -1,594 +0,0 @@ -package com.twitter.search.common.converter.earlybird; - -import java.io.IOException; -import java.util.ArrayList; -import java.util.List; -import java.util.Locale; -import java.util.Map; -import java.util.Set; -import javax.annotation.Nullable; - -import com.google.common.base.Joiner; -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; - -import org.apache.commons.lang.StringUtils; -import org.apache.http.annotation.NotThreadSafe; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.text.token.TokenizedCharSequenceStream; -import com.twitter.common.text.util.TokenStreamSerializer; -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.cuad.ner.plain.thriftjava.NamedEntity; -import com.twitter.decider.Decider; -import com.twitter.search.common.constants.SearchCardType; -import com.twitter.search.common.decider.DeciderUtil; -import com.twitter.search.common.indexing.thriftjava.SearchCard2; -import com.twitter.search.common.indexing.thriftjava.ThriftExpandedUrl; -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.indexing.thriftjava.TwitterPhotoUrl; -import com.twitter.search.common.relevance.entities.TwitterMessage; -import com.twitter.search.common.relevance.entities.TwitterMessageUser; -import com.twitter.search.common.relevance.features.TweetTextFeatures; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.earlybird.EarlybirdEncodedFeatures; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants; -import com.twitter.search.common.schema.earlybird.EarlybirdThriftDocumentBuilder; -import com.twitter.search.common.schema.thriftjava.ThriftDocument; -import com.twitter.search.common.schema.thriftjava.ThriftField; -import com.twitter.search.common.schema.thriftjava.ThriftFieldData; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEventType; -import com.twitter.search.common.util.lang.ThriftLanguageUtil; -import com.twitter.search.common.util.text.LanguageIdentifierHelper; -import com.twitter.search.common.util.text.NormalizerHelper; -import com.twitter.search.common.util.text.TokenizerHelper; -import com.twitter.search.common.util.text.TokenizerResult; -import com.twitter.search.common.util.text.TweetTokenStreamSerializer; -import com.twitter.service.spiderduck.gen.MediaTypes; -import com.twitter.search.common.metrics.SearchCounter; - -/** - * Create and populate ThriftVersionedEvents from the URL data, card data, and named entities - * contained in a TwitterMessage. This data is delayed because these services take a few seconds - * to process tweets, and we want to send the basic data available in the BasicIndexingConverter as - * soon as possible, so we send the additional data a few seconds later, as an update. - * - * Prefer to add data and processing to the BasicIndexingConverter when possible. Only add data here - * if your data source _requires_ data from an external service AND the external service takes at - * least a few seconds to process new tweets. - */ -@NotThreadSafe -public class DelayedIndexingConverter { - private static final SearchCounter NUM_TWEETS_WITH_CARD_URL = - SearchCounter.export("tweets_with_card_url"); - private static final SearchCounter NUM_TWEETS_WITH_NUMERIC_CARD_URI = - SearchCounter.export("tweets_with_numeric_card_uri"); - private static final SearchCounter NUM_TWEETS_WITH_INVALID_CARD_URI = - SearchCounter.export("tweets_with_invalid_card_uri"); - private static final SearchCounter TOTAL_URLS = - SearchCounter.export("total_urls_on_tweets"); - private static final SearchCounter MEDIA_URLS_ON_TWEETS = - SearchCounter.export("media_urls_on_tweets"); - private static final SearchCounter NON_MEDIA_URLS_ON_TWEETS = - SearchCounter.export("non_media_urls_on_tweets"); - public static final String INDEX_URL_DESCRIPTION_AND_TITLE_DECIDER = - "index_url_description_and_title"; - - private static class ThriftDocumentWithEncodedTweetFeatures { - private final ThriftDocument document; - private final EarlybirdEncodedFeatures encodedFeatures; - - public ThriftDocumentWithEncodedTweetFeatures(ThriftDocument document, - EarlybirdEncodedFeatures encodedFeatures) { - this.document = document; - this.encodedFeatures = encodedFeatures; - } - - public ThriftDocument getDocument() { - return document; - } - - public EarlybirdEncodedFeatures getEncodedFeatures() { - return encodedFeatures; - } - } - - // The list of all the encoded_tweet_features flags that might be updated by this converter. - // No extended_encoded_tweet_features are updated (otherwise they should be in this list too). - private static final List UPDATED_FLAGS = - Lists.newArrayList( - EarlybirdFieldConstants.EarlybirdFieldConstant.IS_OFFENSIVE_FLAG, - EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_LINK_FLAG, - EarlybirdFieldConstants.EarlybirdFieldConstant.IS_SENSITIVE_CONTENT, - EarlybirdFieldConstants.EarlybirdFieldConstant.TEXT_SCORE, - EarlybirdFieldConstants.EarlybirdFieldConstant.TWEET_SIGNATURE, - EarlybirdFieldConstants.EarlybirdFieldConstant.LINK_LANGUAGE, - EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_IMAGE_URL_FLAG, - EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_VIDEO_URL_FLAG, - EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_NEWS_URL_FLAG, - EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_EXPANDO_CARD_FLAG, - EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_MULTIPLE_MEDIA_FLAG, - EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_CARD_FLAG, - EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_VISIBLE_LINK_FLAG, - EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_CONSUMER_VIDEO_FLAG, - EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_PRO_VIDEO_FLAG, - EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_VINE_FLAG, - EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_PERISCOPE_FLAG, - EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_NATIVE_IMAGE_FLAG - ); - - private static final Logger LOG = LoggerFactory.getLogger(DelayedIndexingConverter.class); - private static final String AMPLIFY_CARD_NAME = "amplify"; - private static final String PLAYER_CARD_NAME = "player"; - - private final EncodedFeatureBuilder featureBuilder = new EncodedFeatureBuilder(); - - private final Schema schema; - private final Decider decider; - - public DelayedIndexingConverter(Schema schema, Decider decider) { - this.schema = schema; - this.decider = decider; - } - - /** - * Converts the given message to two ThriftVersionedEvents instances: the first one is a feature - * update event for all link and card related flags, and the second one is the append event that - * might contain updates to all link and card related fields. - * - * We need to split the updates to fields and flags into two separate events because: - * - When a tweet is created, earlybirds get the "main" event, which does not have resolved URLs. - * - Then the earlybirds might get a feature update from the signal ingesters, marking the tweet - * as spam. - * - Then the ingesters resolve the URLs and send an update event. At this point, the ingesters - * need to send updates for link-related flags too (HAS_LINK_FLAG, etc.). And there are a few - * ways to do this: - * 1. Encode these flags into encoded_tweet_features and extended_encoded_tweet_features and - * add these fields to the update event. The problem is that earlybirds will then override - * the encoded_tweet_features ane extended_encoded_tweet_features fields in the index for - * this tweet, which will override the feature update the earlybirds got earlier, which - * means that a spammy tweet might no longer be marked as spam in the index. - * 2. Send updates only for the flags that might've been updated by this converter. Since - * ThriftIndexingEvent already has a map of field -> value, it seems like the natural place - * to add these updates to. However, earlybirds can correctly process flag updates only if - * they come in a feature update event (PARTIAL_UPDATE). So we need to send the field - * updates in an OUT_OF_ORDER_UPDATE event, and the flag updates in a PARTIAL_UPDATE event. - * - * We need to send the feature update event before the append event to avoid issues like the one - * in SEARCH-30919 where tweets were returned from the card name field index before the HAS_CARD - * feature was updated to true. - * - * @param message The TwitterMessage to convert. - * @param penguinVersions The Penguin versions for which ThriftIndexingEvents should be created. - * @return An out of order update event for all link- and card-related fields and a feature update - * event for all link- and card-related flags. - */ - public List convertMessageToOutOfOrderAppendAndFeatureUpdate( - TwitterMessage message, List penguinVersions) { - Preconditions.checkNotNull(message); - Preconditions.checkNotNull(penguinVersions); - - ThriftVersionedEvents featureUpdateVersionedEvents = new ThriftVersionedEvents(); - ThriftVersionedEvents outOfOrderAppendVersionedEvents = new ThriftVersionedEvents(); - ImmutableSchemaInterface schemaSnapshot = schema.getSchemaSnapshot(); - - for (PenguinVersion penguinVersion : penguinVersions) { - ThriftDocumentWithEncodedTweetFeatures documentWithEncodedFeatures = - buildDocumentForPenguinVersion(schemaSnapshot, message, penguinVersion); - - ThriftIndexingEvent featureUpdateThriftIndexingEvent = new ThriftIndexingEvent(); - featureUpdateThriftIndexingEvent.setEventType(ThriftIndexingEventType.PARTIAL_UPDATE); - featureUpdateThriftIndexingEvent.setUid(message.getId()); - featureUpdateThriftIndexingEvent.setDocument( - buildFeatureUpdateDocument(documentWithEncodedFeatures.getEncodedFeatures())); - featureUpdateVersionedEvents.putToVersionedEvents( - penguinVersion.getByteValue(), featureUpdateThriftIndexingEvent); - - ThriftIndexingEvent outOfOrderAppendThriftIndexingEvent = new ThriftIndexingEvent(); - outOfOrderAppendThriftIndexingEvent.setDocument(documentWithEncodedFeatures.getDocument()); - outOfOrderAppendThriftIndexingEvent.setEventType(ThriftIndexingEventType.OUT_OF_ORDER_APPEND); - message.getFromUserTwitterId().ifPresent(outOfOrderAppendThriftIndexingEvent::setUid); - outOfOrderAppendThriftIndexingEvent.setSortId(message.getId()); - outOfOrderAppendVersionedEvents.putToVersionedEvents( - penguinVersion.getByteValue(), outOfOrderAppendThriftIndexingEvent); - } - - featureUpdateVersionedEvents.setId(message.getId()); - outOfOrderAppendVersionedEvents.setId(message.getId()); - - return Lists.newArrayList(featureUpdateVersionedEvents, outOfOrderAppendVersionedEvents); - } - - private ThriftDocument buildFeatureUpdateDocument(EarlybirdEncodedFeatures encodedFeatures) { - ThriftDocument document = new ThriftDocument(); - for (EarlybirdFieldConstants.EarlybirdFieldConstant flag : UPDATED_FLAGS) { - ThriftField field = new ThriftField(); - field.setFieldConfigId(flag.getFieldId()); - field.setFieldData(new ThriftFieldData().setIntValue(encodedFeatures.getFeatureValue(flag))); - document.addToFields(field); - } - return document; - } - - private ThriftDocumentWithEncodedTweetFeatures buildDocumentForPenguinVersion( - ImmutableSchemaInterface schemaSnapshot, - TwitterMessage message, - PenguinVersion penguinVersion) { - - EarlybirdEncodedFeatures encodedFeatures = featureBuilder.createTweetFeaturesFromTwitterMessage( - message, penguinVersion, schemaSnapshot).encodedFeatures; - - EarlybirdThriftDocumentBuilder builder = new EarlybirdThriftDocumentBuilder( - encodedFeatures, - null, - new EarlybirdFieldConstants(), - schemaSnapshot); - - builder.setAddLatLonCSF(false); - builder.withID(message.getId()); - buildFieldsFromUrlInfo(builder, message, penguinVersion, encodedFeatures); - buildCardFields(builder, message, penguinVersion); - buildNamedEntityFields(builder, message); - builder.withTweetSignature(message.getTweetSignature(penguinVersion)); - - buildSpaceAdminAndTitleFields(builder, message, penguinVersion); - - builder.setAddEncodedTweetFeatures(false); - - return new ThriftDocumentWithEncodedTweetFeatures(builder.build(), encodedFeatures); - } - - public static void buildNamedEntityFields( - EarlybirdThriftDocumentBuilder builder, TwitterMessage message) { - for (NamedEntity namedEntity : message.getNamedEntities()) { - builder.withNamedEntity(namedEntity); - } - } - - private void buildFieldsFromUrlInfo( - EarlybirdThriftDocumentBuilder builder, - TwitterMessage message, - PenguinVersion penguinVersion, - EarlybirdEncodedFeatures encodedFeatures) { - // We need to update the RESOLVED_LINKS_TEXT_FIELD, since we might have new resolved URLs. - // Use the same logic as in EncodedFeatureBuilder.java. - TweetTextFeatures textFeatures = message.getTweetTextFeatures(penguinVersion); - String resolvedUrlsText = Joiner.on(" ").skipNulls().join(textFeatures.getResolvedUrlTokens()); - builder.withResolvedLinksText(resolvedUrlsText); - - buildURLFields(builder, message, encodedFeatures); - buildAnalyzedURLFields(builder, message, penguinVersion); - } - - private void buildAnalyzedURLFields( - EarlybirdThriftDocumentBuilder builder, TwitterMessage message, PenguinVersion penguinVersion - ) { - TOTAL_URLS.add(message.getExpandedUrls().size()); - if (DeciderUtil.isAvailableForRandomRecipient( - decider, - INDEX_URL_DESCRIPTION_AND_TITLE_DECIDER)) { - for (ThriftExpandedUrl expandedUrl : message.getExpandedUrls()) { - /* - Consumer Media URLs are added to the expanded URLs in - TweetEventParserHelper.addMediaEntitiesToMessage. These Twitter.com media URLs contain - the tweet text as the description and the title is " on Twitter". This is - redundant information at best and misleading at worst. We will ignore these URLs to avoid - polluting the url_description and url_title field as well as saving space. - */ - if (!expandedUrl.isSetConsumerMedia() || !expandedUrl.isConsumerMedia()) { - NON_MEDIA_URLS_ON_TWEETS.increment(); - if (expandedUrl.isSetDescription()) { - buildTweetTokenizerTokenizedField(builder, - EarlybirdFieldConstants.EarlybirdFieldConstant.URL_DESCRIPTION_FIELD.getFieldName(), - expandedUrl.getDescription(), - penguinVersion); - } - if (expandedUrl.isSetTitle()) { - buildTweetTokenizerTokenizedField(builder, - EarlybirdFieldConstants.EarlybirdFieldConstant.URL_TITLE_FIELD.getFieldName(), - expandedUrl.getTitle(), - penguinVersion); - } - } else { - MEDIA_URLS_ON_TWEETS.increment(); - } - } - } - } - - /** - * Build the URL based fields from a tweet. - */ - public static void buildURLFields( - EarlybirdThriftDocumentBuilder builder, - TwitterMessage message, - EarlybirdEncodedFeatures encodedFeatures - ) { - Map expandedUrlMap = message.getExpandedUrlMap(); - - for (ThriftExpandedUrl expandedUrl : expandedUrlMap.values()) { - if (expandedUrl.getMediaType() == MediaTypes.NATIVE_IMAGE) { - EncodedFeatureBuilder.addPhotoUrl(message, expandedUrl.getCanonicalLastHopUrl()); - } - } - - // now add all twitter photos links that came with the tweet's payload - Map photos = message.getPhotoUrls(); - List photoURLs = new ArrayList<>(); - if (photos != null) { - for (Map.Entry entry : photos.entrySet()) { - TwitterPhotoUrl photo = new TwitterPhotoUrl(entry.getKey()); - String mediaUrl = entry.getValue(); - if (mediaUrl != null) { - photo.setMediaUrl(mediaUrl); - } - photoURLs.add(photo); - } - } - - try { - builder - .withURLs(Lists.newArrayList(expandedUrlMap.values())) - .withTwimgURLs(photoURLs); - } catch (IOException ioe) { - LOG.error("URL field creation threw an IOException", ioe); - } - - - if (encodedFeatures.isFlagSet( - EarlybirdFieldConstants.EarlybirdFieldConstant.IS_OFFENSIVE_FLAG)) { - builder.withOffensiveFlag(); - } - if (encodedFeatures.isFlagSet( - EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_CONSUMER_VIDEO_FLAG)) { - builder.addFilterInternalFieldTerm( - EarlybirdFieldConstants.EarlybirdFieldConstant.CONSUMER_VIDEO_FILTER_TERM); - } - if (encodedFeatures.isFlagSet( - EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_PRO_VIDEO_FLAG)) { - builder.addFilterInternalFieldTerm( - EarlybirdFieldConstants.EarlybirdFieldConstant.PRO_VIDEO_FILTER_TERM); - } - if (encodedFeatures.isFlagSet(EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_VINE_FLAG)) { - builder.addFilterInternalFieldTerm( - EarlybirdFieldConstants.EarlybirdFieldConstant.VINE_FILTER_TERM); - } - if (encodedFeatures.isFlagSet( - EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_PERISCOPE_FLAG)) { - builder.addFilterInternalFieldTerm( - EarlybirdFieldConstants.EarlybirdFieldConstant.PERISCOPE_FILTER_TERM); - } - } - - /** - * Build the card information inside ThriftIndexingEvent's fields. - */ - static void buildCardFields(EarlybirdThriftDocumentBuilder builder, - TwitterMessage message, - PenguinVersion penguinVersion) { - if (message.hasCard()) { - SearchCard2 card = buildSearchCardFromTwitterMessage( - message, - TweetTokenStreamSerializer.getTweetTokenStreamSerializer(), - penguinVersion); - buildCardFeatures(message.getId(), builder, card); - } - } - - private static SearchCard2 buildSearchCardFromTwitterMessage( - TwitterMessage message, - TokenStreamSerializer streamSerializer, - PenguinVersion penguinVersion) { - SearchCard2 card = new SearchCard2(); - card.setCardName(message.getCardName()); - if (message.getCardDomain() != null) { - card.setCardDomain(message.getCardDomain()); - } - if (message.getCardLang() != null) { - card.setCardLang(message.getCardLang()); - } - if (message.getCardUrl() != null) { - card.setCardUrl(message.getCardUrl()); - } - - if (message.getCardTitle() != null && !message.getCardTitle().isEmpty()) { - String normalizedTitle = NormalizerHelper.normalize( - message.getCardTitle(), message.getLocale(), penguinVersion); - TokenizerResult result = TokenizerHelper.tokenizeTweet( - normalizedTitle, message.getLocale(), penguinVersion); - TokenizedCharSequenceStream tokenSeqStream = new TokenizedCharSequenceStream(); - tokenSeqStream.reset(result.tokenSequence); - try { - card.setCardTitleTokenStream(streamSerializer.serialize(tokenSeqStream)); - card.setCardTitleTokenStreamText(result.tokenSequence.toString()); - } catch (IOException e) { - LOG.error("TwitterTokenStream serialization error! Could not serialize card title: " - + result.tokenSequence); - card.unsetCardTitleTokenStream(); - card.unsetCardTitleTokenStreamText(); - } - } - if (message.getCardDescription() != null && !message.getCardDescription().isEmpty()) { - String normalizedDesc = NormalizerHelper.normalize( - message.getCardDescription(), message.getLocale(), penguinVersion); - TokenizerResult result = TokenizerHelper.tokenizeTweet( - normalizedDesc, message.getLocale(), penguinVersion); - TokenizedCharSequenceStream tokenSeqStream = new TokenizedCharSequenceStream(); - tokenSeqStream.reset(result.tokenSequence); - try { - card.setCardDescriptionTokenStream(streamSerializer.serialize(tokenSeqStream)); - card.setCardDescriptionTokenStreamText(result.tokenSequence.toString()); - } catch (IOException e) { - LOG.error("TwitterTokenStream serialization error! Could not serialize card description: " - + result.tokenSequence); - card.unsetCardDescriptionTokenStream(); - card.unsetCardDescriptionTokenStreamText(); - } - } - - return card; - } - - /** - * Builds card features. - */ - private static void buildCardFeatures( - long tweetId, EarlybirdThriftDocumentBuilder builder, SearchCard2 card) { - if (card == null) { - return; - } - builder - .withTokenStreamField( - EarlybirdFieldConstants.EarlybirdFieldConstant.CARD_TITLE_FIELD.getFieldName(), - card.getCardTitleTokenStreamText(), - card.isSetCardTitleTokenStream() ? card.getCardTitleTokenStream() : null) - .withTokenStreamField( - EarlybirdFieldConstants.EarlybirdFieldConstant.CARD_DESCRIPTION_FIELD.getFieldName(), - card.getCardDescriptionTokenStreamText(), - card.isSetCardDescriptionTokenStream() ? card.getCardDescriptionTokenStream() : null) - .withStringField( - EarlybirdFieldConstants.EarlybirdFieldConstant.CARD_NAME_FIELD.getFieldName(), - card.getCardName()) - .withIntField( - EarlybirdFieldConstants.EarlybirdFieldConstant.CARD_TYPE_CSF_FIELD.getFieldName(), - SearchCardType.cardTypeFromStringName(card.getCardName()).getByteValue()); - - if (card.getCardLang() != null) { - builder.withStringField( - EarlybirdFieldConstants.EarlybirdFieldConstant.CARD_LANG.getFieldName(), - card.getCardLang()).withIntField( - EarlybirdFieldConstants.EarlybirdFieldConstant.CARD_LANG_CSF.getFieldName(), - ThriftLanguageUtil.getThriftLanguageOf(card.getCardLang()).getValue()); - } - if (card.getCardDomain() != null) { - builder.withStringField( - EarlybirdFieldConstants.EarlybirdFieldConstant.CARD_DOMAIN_FIELD.getFieldName(), - card.getCardDomain()); - } - if (card.getCardUrl() != null) { - NUM_TWEETS_WITH_CARD_URL.increment(); - if (card.getCardUrl().startsWith("card://")) { - String suffix = card.getCardUrl().replace("card://", ""); - if (StringUtils.isNumeric(suffix)) { - NUM_TWEETS_WITH_NUMERIC_CARD_URI.increment(); - builder.withLongField( - EarlybirdFieldConstants.EarlybirdFieldConstant.CARD_URI_CSF.getFieldName(), - Long.parseLong(suffix)); - LOG.debug(String.format( - "Good card URL for tweet %s: %s", - tweetId, - card.getCardUrl())); - } else { - NUM_TWEETS_WITH_INVALID_CARD_URI.increment(); - LOG.debug(String.format( - "Card URL starts with \"card://\" but followed by non-numeric for tweet %s: %s", - tweetId, - card.getCardUrl())); - } - } - } - if (isCardVideo(card)) { - // Add into "internal" field so that this tweet is returned by filter:videos. - builder.addFacetSkipList( - EarlybirdFieldConstants.EarlybirdFieldConstant.VIDEO_LINKS_FIELD.getFieldName()); - } - } - - /** - * Determines if a card is a video. - */ - private static boolean isCardVideo(@Nullable SearchCard2 card) { - if (card == null) { - return false; - } - return AMPLIFY_CARD_NAME.equalsIgnoreCase(card.getCardName()) - || PLAYER_CARD_NAME.equalsIgnoreCase(card.getCardName()); - } - - private void buildSpaceAdminAndTitleFields( - EarlybirdThriftDocumentBuilder builder, - TwitterMessage message, - PenguinVersion penguinVersion) { - - buildSpaceAdminFields(builder, message.getSpaceAdmins(), penguinVersion); - - // build the space title field. - buildTweetTokenizerTokenizedField( - builder, - EarlybirdFieldConstants.EarlybirdFieldConstant.SPACE_TITLE_FIELD.getFieldName(), - message.getSpaceTitle(), - penguinVersion); - } - - private void buildSpaceAdminFields( - EarlybirdThriftDocumentBuilder builder, - Set spaceAdmins, - PenguinVersion penguinVersion) { - - for (TwitterMessageUser spaceAdmin : spaceAdmins) { - if (spaceAdmin.getScreenName().isPresent()) { - // build screen name (aka handle) fields. - String screenName = spaceAdmin.getScreenName().get(); - String normalizedScreenName = - NormalizerHelper.normalizeWithUnknownLocale(screenName, penguinVersion); - - builder.withStringField( - EarlybirdFieldConstants.EarlybirdFieldConstant.SPACE_ADMIN_FIELD.getFieldName(), - normalizedScreenName); - builder.withWhiteSpaceTokenizedScreenNameField( - EarlybirdFieldConstants - .EarlybirdFieldConstant.TOKENIZED_SPACE_ADMIN_FIELD.getFieldName(), - normalizedScreenName); - - if (spaceAdmin.getTokenizedScreenName().isPresent()) { - builder.withCamelCaseTokenizedScreenNameField( - EarlybirdFieldConstants - .EarlybirdFieldConstant.CAMELCASE_TOKENIZED_SPACE_ADMIN_FIELD.getFieldName(), - screenName, - normalizedScreenName, - spaceAdmin.getTokenizedScreenName().get()); - } - } - - if (spaceAdmin.getDisplayName().isPresent()) { - buildTweetTokenizerTokenizedField( - builder, - EarlybirdFieldConstants - .EarlybirdFieldConstant.TOKENIZED_SPACE_ADMIN_DISPLAY_NAME_FIELD.getFieldName(), - spaceAdmin.getDisplayName().get(), - penguinVersion); - } - } - } - - private void buildTweetTokenizerTokenizedField( - EarlybirdThriftDocumentBuilder builder, - String fieldName, - String text, - PenguinVersion penguinVersion) { - - if (StringUtils.isNotEmpty(text)) { - Locale locale = LanguageIdentifierHelper - .identifyLanguage(text); - String normalizedText = NormalizerHelper.normalize( - text, locale, penguinVersion); - TokenizerResult result = TokenizerHelper - .tokenizeTweet(normalizedText, locale, penguinVersion); - TokenizedCharSequenceStream tokenSeqStream = new TokenizedCharSequenceStream(); - tokenSeqStream.reset(result.tokenSequence); - TokenStreamSerializer streamSerializer = - TweetTokenStreamSerializer.getTweetTokenStreamSerializer(); - try { - builder.withTokenStreamField( - fieldName, - result.tokenSequence.toString(), - streamSerializer.serialize(tokenSeqStream)); - } catch (IOException e) { - LOG.error("TwitterTokenStream serialization error! Could not serialize: " + text); - } - } - } -} diff --git a/src/java/com/twitter/search/common/converter/earlybird/EncodedFeatureBuilder.java b/src/java/com/twitter/search/common/converter/earlybird/EncodedFeatureBuilder.java deleted file mode 100644 index c5d6b1c76..000000000 --- a/src/java/com/twitter/search/common/converter/earlybird/EncodedFeatureBuilder.java +++ /dev/null @@ -1,531 +0,0 @@ -package com.twitter.search.common.converter.earlybird; - -import java.io.IOException; -import java.util.HashSet; -import java.util.List; -import java.util.Locale; -import java.util.Map; -import java.util.Optional; -import java.util.Set; -import java.util.regex.Matcher; -import java.util.regex.Pattern; -import java.util.stream.Collectors; - -import com.google.common.base.Joiner; -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; - -import org.apache.commons.lang.StringUtils; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.text.token.TokenizedCharSequence; -import com.twitter.common.text.token.TokenizedCharSequenceStream; -import com.twitter.common.text.util.TokenStreamSerializer; -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.search.common.indexing.thriftjava.Place; -import com.twitter.search.common.indexing.thriftjava.PotentialLocation; -import com.twitter.search.common.indexing.thriftjava.ProfileGeoEnrichment; -import com.twitter.search.common.indexing.thriftjava.ThriftExpandedUrl; -import com.twitter.search.common.indexing.thriftjava.VersionedTweetFeatures; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.relevance.entities.PotentialLocationObject; -import com.twitter.search.common.relevance.entities.TwitterMessage; -import com.twitter.search.common.relevance.features.FeatureSink; -import com.twitter.search.common.relevance.features.MutableFeatureNormalizers; -import com.twitter.search.common.relevance.features.RelevanceSignalConstants; -import com.twitter.search.common.relevance.features.TweetTextFeatures; -import com.twitter.search.common.relevance.features.TweetTextQuality; -import com.twitter.search.common.relevance.features.TweetUserFeatures; -import com.twitter.search.common.schema.base.FeatureConfiguration; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.earlybird.EarlybirdEncodedFeatures; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.util.lang.ThriftLanguageUtil; -import com.twitter.search.common.util.text.LanguageIdentifierHelper; -import com.twitter.search.common.util.text.NormalizerHelper; -import com.twitter.search.common.util.text.SourceNormalizer; -import com.twitter.search.common.util.text.TokenizerHelper; -import com.twitter.search.common.util.text.TokenizerResult; -import com.twitter.search.common.util.text.TweetTokenStreamSerializer; -import com.twitter.search.common.util.url.LinkVisibilityUtils; -import com.twitter.search.common.util.url.NativeVideoClassificationUtils; -import com.twitter.search.ingester.model.VisibleTokenRatioUtil; - -/** - * EncodedFeatureBuilder helps to build encoded features for TwitterMessage. - * - * This is stateful so should only be used one tweet at a time - */ -public class EncodedFeatureBuilder { - private static final Logger LOG = LoggerFactory.getLogger(EncodedFeatureBuilder.class); - - private static final SearchCounter NUM_TWEETS_WITH_INVALID_TWEET_ID_IN_PHOTO_URL = - SearchCounter.export("tweets_with_invalid_tweet_id_in_photo_url"); - - // TwitterTokenStream for converting TokenizedCharSequence into a stream for serialization - // This is stateful so should only be used one tweet at a time - private final TokenizedCharSequenceStream tokenSeqStream = new TokenizedCharSequenceStream(); - - // SUPPRESS CHECKSTYLE:OFF LineLength - private static final Pattern TWITTER_PHOTO_PERMA_LINK_PATTERN = - Pattern.compile("(?i:^(?:(?:https?\\:\\/\\/)?(?:www\\.)?)?twitter\\.com\\/(?:\\?[^#]+)?(?:#!?\\/?)?\\w{1,20}\\/status\\/(\\d+)\\/photo\\/\\d*$)"); - - private static final Pattern TWITTER_PHOTO_COPY_PASTE_LINK_PATTERN = - Pattern.compile("(?i:^(?:(?:https?\\:\\/\\/)?(?:www\\.)?)?twitter\\.com\\/(?:#!?\\/)?\\w{1,20}\\/status\\/(\\d+)\\/photo\\/\\d*$)"); - // SUPPRESS CHECKSTYLE:ON LineLength - - private static final VisibleTokenRatioUtil VISIBLE_TOKEN_RATIO = new VisibleTokenRatioUtil(); - - private static final Map SERIALIZE_FAILURE_COUNTERS_MAP = - Maps.newEnumMap(PenguinVersion.class); - static { - for (PenguinVersion penguinVersion : PenguinVersion.values()) { - SERIALIZE_FAILURE_COUNTERS_MAP.put( - penguinVersion, - SearchCounter.export( - "tokenstream_serialization_failure_" + penguinVersion.name().toLowerCase())); - } - } - - public static class TweetFeatureWithEncodeFeatures { - public final VersionedTweetFeatures versionedFeatures; - public final EarlybirdEncodedFeatures encodedFeatures; - public final EarlybirdEncodedFeatures extendedEncodedFeatures; - - public TweetFeatureWithEncodeFeatures( - VersionedTweetFeatures versionedFeatures, - EarlybirdEncodedFeatures encodedFeatures, - EarlybirdEncodedFeatures extendedEncodedFeatures) { - this.versionedFeatures = versionedFeatures; - this.encodedFeatures = encodedFeatures; - this.extendedEncodedFeatures = extendedEncodedFeatures; - } - } - - /** - * Create tweet text features and the encoded features. - * - * @param message the tweet message - * @param penguinVersion the based penguin version to create the features - * @param schemaSnapshot the schema associated with the features - * @return the text features and the encoded features - */ - public TweetFeatureWithEncodeFeatures createTweetFeaturesFromTwitterMessage( - TwitterMessage message, - PenguinVersion penguinVersion, - ImmutableSchemaInterface schemaSnapshot) { - VersionedTweetFeatures versionedTweetFeatures = new VersionedTweetFeatures(); - - // Write extendedPackedFeatures. - EarlybirdEncodedFeatures extendedEncodedFeatures = - createExtendedEncodedFeaturesFromTwitterMessage(message, penguinVersion, schemaSnapshot); - if (extendedEncodedFeatures != null) { - extendedEncodedFeatures - .writeExtendedFeaturesToVersionedTweetFeatures(versionedTweetFeatures); - } - - setSourceAndNormalizedSource( - message.getStrippedSource(), versionedTweetFeatures, penguinVersion); - - TweetTextFeatures textFeatures = message.getTweetTextFeatures(penguinVersion); - - /////////////////////////////// - // Add hashtags and mentions - textFeatures.getHashtags().forEach(versionedTweetFeatures::addToHashtags); - textFeatures.getMentions().forEach(versionedTweetFeatures::addToMentions); - - /////////////////////////////// - // Extract some extra information from the message text. - // Index stock symbols with $ prepended - textFeatures.getStocks().stream() - .filter(stock -> stock != null) - .forEach(stock -> versionedTweetFeatures.addToStocks(stock.toLowerCase())); - - // Question marks - versionedTweetFeatures.setHasQuestionMark(textFeatures.hasQuestionMark()); - // Smileys - versionedTweetFeatures.setHasPositiveSmiley(textFeatures.hasPositiveSmiley()); - versionedTweetFeatures.setHasNegativeSmiley(textFeatures.hasNegativeSmiley()); - - TokenStreamSerializer streamSerializer = - TweetTokenStreamSerializer.getTweetTokenStreamSerializer(); - TokenizedCharSequence tokenSeq = textFeatures.getTokenSequence(); - tokenSeqStream.reset(tokenSeq); - int tokenPercent = VISIBLE_TOKEN_RATIO.extractAndNormalizeTokenPercentage(tokenSeqStream); - tokenSeqStream.reset(tokenSeq); - - // Write packedFeatures. - EarlybirdEncodedFeatures encodedFeatures = createEncodedFeaturesFromTwitterMessage( - message, penguinVersion, schemaSnapshot, tokenPercent); - encodedFeatures.writeFeaturesToVersionedTweetFeatures(versionedTweetFeatures); - - try { - versionedTweetFeatures.setTweetTokenStream(streamSerializer.serialize(tokenSeqStream)); - versionedTweetFeatures.setTweetTokenStreamText(tokenSeq.toString()); - } catch (IOException e) { - LOG.error("TwitterTokenStream serialization error! Could not serialize: " - + tokenSeq.toString()); - SERIALIZE_FAILURE_COUNTERS_MAP.get(penguinVersion).increment(); - versionedTweetFeatures.unsetTweetTokenStream(); - versionedTweetFeatures.unsetTweetTokenStreamText(); - } - - // User name features - if (message.getFromUserDisplayName().isPresent()) { - Locale locale = LanguageIdentifierHelper - .identifyLanguage(message.getFromUserDisplayName().get()); - String normalizedDisplayName = NormalizerHelper.normalize( - message.getFromUserDisplayName().get(), locale, penguinVersion); - TokenizerResult result = TokenizerHelper - .tokenizeTweet(normalizedDisplayName, locale, penguinVersion); - tokenSeqStream.reset(result.tokenSequence); - try { - versionedTweetFeatures.setUserDisplayNameTokenStream( - streamSerializer.serialize(tokenSeqStream)); - versionedTweetFeatures.setUserDisplayNameTokenStreamText(result.tokenSequence.toString()); - } catch (IOException e) { - LOG.error("TwitterTokenStream serialization error! Could not serialize: " - + message.getFromUserDisplayName().get()); - SERIALIZE_FAILURE_COUNTERS_MAP.get(penguinVersion).increment(); - versionedTweetFeatures.unsetUserDisplayNameTokenStream(); - versionedTweetFeatures.unsetUserDisplayNameTokenStreamText(); - } - } - - String resolvedUrlsText = Joiner.on(" ").skipNulls().join(textFeatures.getResolvedUrlTokens()); - versionedTweetFeatures.setNormalizedResolvedUrlText(resolvedUrlsText); - - addPlace(message, versionedTweetFeatures, penguinVersion); - addProfileGeoEnrichment(message, versionedTweetFeatures, penguinVersion); - - versionedTweetFeatures.setTweetSignature(message.getTweetSignature(penguinVersion)); - - return new TweetFeatureWithEncodeFeatures( - versionedTweetFeatures, encodedFeatures, extendedEncodedFeatures); - } - - - protected static void setSourceAndNormalizedSource( - String strippedSource, - VersionedTweetFeatures versionedTweetFeatures, - PenguinVersion penguinVersion) { - - if (strippedSource != null && !strippedSource.isEmpty()) { - // normalize source for searchable field - replaces whitespace with underscores (???). - versionedTweetFeatures.setNormalizedSource( - SourceNormalizer.normalize(strippedSource, penguinVersion)); - - // source facet has simpler normalization. - Locale locale = LanguageIdentifierHelper.identifyLanguage(strippedSource); - versionedTweetFeatures.setSource(NormalizerHelper.normalizeKeepCase( - strippedSource, locale, penguinVersion)); - } - } - - /** - * Adds the given photo url to the thrift status if it is a twitter photo permalink. - * Returns true, if this was indeed a twitter photo, false otherwise. - */ - public static boolean addPhotoUrl(TwitterMessage message, String photoPermalink) { - Matcher matcher = TWITTER_PHOTO_COPY_PASTE_LINK_PATTERN.matcher(photoPermalink); - if (!matcher.matches() || matcher.groupCount() < 1) { - matcher = TWITTER_PHOTO_PERMA_LINK_PATTERN.matcher(photoPermalink); - } - - if (matcher.matches() && matcher.groupCount() == 1) { - // this is a native photo url which we need to store in a separate field - String idStr = matcher.group(1); - if (idStr != null) { - // idStr should be a valid tweet ID (and therefore, should fit into a Long), but we have - // tweets for which idStr is a long sequence of digits that does not fit into a Long. - try { - long photoStatusId = Long.parseLong(idStr); - message.addPhotoUrl(photoStatusId, null); - } catch (NumberFormatException e) { - LOG.warn("Found a tweet with a photo URL with an invalid tweet ID: " + message); - NUM_TWEETS_WITH_INVALID_TWEET_ID_IN_PHOTO_URL.increment(); - } - } - return true; - } - return false; - } - - private void addPlace(TwitterMessage message, - VersionedTweetFeatures versionedTweetFeatures, - PenguinVersion penguinVersion) { - String placeId = message.getPlaceId(); - if (placeId == null) { - return; - } - - // Tweet.Place.id and Tweet.Place.full_name are both required fields. - String placeFullName = message.getPlaceFullName(); - Preconditions.checkNotNull(placeFullName, "Tweet.Place without full_name."); - - Locale placeFullNameLocale = LanguageIdentifierHelper.identifyLanguage(placeFullName); - String normalizedPlaceFullName = - NormalizerHelper.normalize(placeFullName, placeFullNameLocale, penguinVersion); - String tokenizedPlaceFullName = StringUtils.join( - TokenizerHelper.tokenizeQuery(normalizedPlaceFullName, placeFullNameLocale, penguinVersion), - " "); - - Place place = new Place(placeId, tokenizedPlaceFullName); - String placeCountryCode = message.getPlaceCountryCode(); - if (placeCountryCode != null) { - Locale placeCountryCodeLocale = LanguageIdentifierHelper.identifyLanguage(placeCountryCode); - place.setCountryCode( - NormalizerHelper.normalize(placeCountryCode, placeCountryCodeLocale, penguinVersion)); - } - - versionedTweetFeatures.setTokenizedPlace(place); - } - - private void addProfileGeoEnrichment(TwitterMessage message, - VersionedTweetFeatures versionedTweetFeatures, - PenguinVersion penguinVersion) { - List potentialLocations = message.getPotentialLocations(); - if (potentialLocations.isEmpty()) { - return; - } - - List thriftPotentialLocations = Lists.newArrayList(); - for (PotentialLocationObject potentialLocation : potentialLocations) { - thriftPotentialLocations.add(potentialLocation.toThriftPotentialLocation(penguinVersion)); - } - versionedTweetFeatures.setTokenizedProfileGeoEnrichment( - new ProfileGeoEnrichment(thriftPotentialLocations)); - } - - /** Returns the encoded features. */ - public static EarlybirdEncodedFeatures createEncodedFeaturesFromTwitterMessage( - TwitterMessage message, - PenguinVersion penguinVersion, - ImmutableSchemaInterface schema, - int normalizedTokenPercentBucket) { - FeatureSink sink = new FeatureSink(schema); - - // Static features - sink.setBooleanValue(EarlybirdFieldConstant.IS_RETWEET_FLAG, message.isRetweet()) - .setBooleanValue(EarlybirdFieldConstant.IS_REPLY_FLAG, message.isReply()) - .setBooleanValue( - EarlybirdFieldConstant.FROM_VERIFIED_ACCOUNT_FLAG, message.isUserVerified()) - .setBooleanValue( - EarlybirdFieldConstant.FROM_BLUE_VERIFIED_ACCOUNT_FLAG, message.isUserBlueVerified()) - .setBooleanValue(EarlybirdFieldConstant.IS_SENSITIVE_CONTENT, message.isSensitiveContent()); - - TweetTextFeatures textFeatures = message.getTweetTextFeatures(penguinVersion); - if (textFeatures != null) { - final FeatureConfiguration featureConfigNumHashtags = schema.getFeatureConfigurationByName( - EarlybirdFieldConstant.NUM_HASHTAGS.getFieldName()); - final FeatureConfiguration featureConfigNumMentions = schema.getFeatureConfigurationByName( - EarlybirdFieldConstant.NUM_MENTIONS.getFieldName()); - - sink.setNumericValue( - EarlybirdFieldConstant.NUM_HASHTAGS, - Math.min(textFeatures.getHashtagsSize(), featureConfigNumHashtags.getMaxValue())) - .setNumericValue( - EarlybirdFieldConstant.NUM_MENTIONS, - Math.min(textFeatures.getMentionsSize(), featureConfigNumMentions.getMaxValue())) - .setBooleanValue( - EarlybirdFieldConstant.HAS_MULTIPLE_HASHTAGS_OR_TRENDS_FLAG, - TwitterMessage.hasMultipleHashtagsOrTrends(textFeatures)) - .setBooleanValue( - EarlybirdFieldConstant.HAS_TREND_FLAG, - textFeatures.getTrendingTermsSize() > 0); - } - - TweetTextQuality textQuality = message.getTweetTextQuality(penguinVersion); - if (textQuality != null) { - sink.setNumericValue(EarlybirdFieldConstant.TEXT_SCORE, textQuality.getTextScore()); - sink.setBooleanValue( - EarlybirdFieldConstant.IS_OFFENSIVE_FLAG, - textQuality.hasBoolQuality(TweetTextQuality.BooleanQualityType.OFFENSIVE) - || textQuality.hasBoolQuality(TweetTextQuality.BooleanQualityType.OFFENSIVE_USER) - // Note: if json message "possibly_sensitive" flag is set, we consider the tweet - // sensitive and is currently filtered out in safe search mode via a hacky setup: - // earlybird does not create _filter_sensitive_content field, only - // _is_offensive field is created, and used in filter:safe operator - || textQuality.hasBoolQuality(TweetTextQuality.BooleanQualityType.SENSITIVE)); - if (textQuality.hasBoolQuality(TweetTextQuality.BooleanQualityType.SENSITIVE)) { - sink.setBooleanValue(EarlybirdFieldConstant.IS_SENSITIVE_CONTENT, true); - } - } else { - // we don't have text score, for whatever reason, set to sentinel value so we won't be - // skipped by scoring function - sink.setNumericValue(EarlybirdFieldConstant.TEXT_SCORE, - RelevanceSignalConstants.UNSET_TEXT_SCORE_SENTINEL); - } - - if (message.isSetLocale()) { - sink.setNumericValue(EarlybirdFieldConstant.LANGUAGE, - ThriftLanguageUtil.getThriftLanguageOf(message.getLocale()).getValue()); - } - - // User features - TweetUserFeatures userFeatures = message.getTweetUserFeatures(penguinVersion); - if (userFeatures != null) { - sink.setBooleanValue(EarlybirdFieldConstant.IS_USER_SPAM_FLAG, userFeatures.isSpam()) - .setBooleanValue(EarlybirdFieldConstant.IS_USER_NSFW_FLAG, userFeatures.isNsfw()) - .setBooleanValue(EarlybirdFieldConstant.IS_USER_BOT_FLAG, userFeatures.isBot()); - } - if (message.getUserReputation() != TwitterMessage.DOUBLE_FIELD_NOT_PRESENT) { - sink.setNumericValue(EarlybirdFieldConstant.USER_REPUTATION, - (byte) message.getUserReputation()); - } else { - sink.setNumericValue(EarlybirdFieldConstant.USER_REPUTATION, - RelevanceSignalConstants.UNSET_REPUTATION_SENTINEL); - } - - sink.setBooleanValue(EarlybirdFieldConstant.IS_NULLCAST_FLAG, message.getNullcast()); - - // Realtime Ingestion does not write engagement features. Updater does that. - if (message.getNumFavorites() > 0) { - sink.setNumericValue(EarlybirdFieldConstant.FAVORITE_COUNT, - MutableFeatureNormalizers.BYTE_NORMALIZER.normalize(message.getNumFavorites())); - } - if (message.getNumRetweets() > 0) { - sink.setNumericValue(EarlybirdFieldConstant.RETWEET_COUNT, - MutableFeatureNormalizers.BYTE_NORMALIZER.normalize(message.getNumRetweets())); - } - if (message.getNumReplies() > 0) { - sink.setNumericValue(EarlybirdFieldConstant.REPLY_COUNT, - MutableFeatureNormalizers.BYTE_NORMALIZER.normalize(message.getNumReplies())); - } - - sink.setNumericValue(EarlybirdFieldConstant.VISIBLE_TOKEN_RATIO, normalizedTokenPercentBucket); - - EarlybirdEncodedFeatures encodedFeatures = - (EarlybirdEncodedFeatures) sink.getFeaturesForBaseField( - EarlybirdFieldConstant.ENCODED_TWEET_FEATURES_FIELD.getFieldName()); - updateLinkEncodedFeatures(encodedFeatures, message); - return encodedFeatures; - } - - /** - * Returns the extended encoded features. - */ - public static EarlybirdEncodedFeatures createExtendedEncodedFeaturesFromTwitterMessage( - TwitterMessage message, - PenguinVersion penguinVersion, - ImmutableSchemaInterface schema) { - FeatureSink sink = new FeatureSink(schema); - - TweetTextFeatures textFeatures = message.getTweetTextFeatures(penguinVersion); - - if (textFeatures != null) { - setExtendedEncodedFeatureIntValue(sink, schema, - EarlybirdFieldConstant.NUM_HASHTAGS_V2, textFeatures.getHashtagsSize()); - setExtendedEncodedFeatureIntValue(sink, schema, - EarlybirdFieldConstant.NUM_MENTIONS_V2, textFeatures.getMentionsSize()); - setExtendedEncodedFeatureIntValue(sink, schema, - EarlybirdFieldConstant.NUM_STOCKS, textFeatures.getStocksSize()); - } - - Optional referenceAuthorId = message.getReferenceAuthorId(); - if (referenceAuthorId.isPresent()) { - setEncodedReferenceAuthorId(sink, referenceAuthorId.get()); - } - - return (EarlybirdEncodedFeatures) sink.getFeaturesForBaseField( - EarlybirdFieldConstant.EXTENDED_ENCODED_TWEET_FEATURES_FIELD.getFieldName()); - } - - /** - * Updates all URL-related features, based on the values stored in the given message. - * - * @param encodedFeatures The features to be updated. - * @param message The message. - */ - public static void updateLinkEncodedFeatures( - EarlybirdEncodedFeatures encodedFeatures, TwitterMessage message) { - if (message.getLinkLocale() != null) { - encodedFeatures.setFeatureValue( - EarlybirdFieldConstant.LINK_LANGUAGE, - ThriftLanguageUtil.getThriftLanguageOf(message.getLinkLocale()).getValue()); - } - - if (message.hasCard()) { - encodedFeatures.setFlag(EarlybirdFieldConstant.HAS_CARD_FLAG); - } - - // Set HAS_IMAGE HAS_NEWS HAS_VIDEO etc. flags for expanded urls. - if (message.getExpandedUrlMapSize() > 0) { - encodedFeatures.setFlag(EarlybirdFieldConstant.HAS_LINK_FLAG); - - for (ThriftExpandedUrl url : message.getExpandedUrlMap().values()) { - if (url.isSetMediaType()) { - switch (url.getMediaType()) { - case NATIVE_IMAGE: - encodedFeatures.setFlag(EarlybirdFieldConstant.HAS_IMAGE_URL_FLAG); - encodedFeatures.setFlag(EarlybirdFieldConstant.HAS_NATIVE_IMAGE_FLAG); - break; - case IMAGE: - encodedFeatures.setFlag(EarlybirdFieldConstant.HAS_IMAGE_URL_FLAG); - break; - case VIDEO: - encodedFeatures.setFlag(EarlybirdFieldConstant.HAS_VIDEO_URL_FLAG); - break; - case NEWS: - encodedFeatures.setFlag(EarlybirdFieldConstant.HAS_NEWS_URL_FLAG); - break; - case UNKNOWN: - break; - default: - throw new IllegalStateException("Unexpected enum value: " + url.getMediaType()); - } - } - } - } - - Set canonicalLastHopUrlsStrings = message.getCanonicalLastHopUrls(); - Set expandedUrlsStrings = message.getExpandedUrls() - .stream() - .map(ThriftExpandedUrl::getExpandedUrl) - .collect(Collectors.toSet()); - Set expandedAndLastHopUrlsStrings = new HashSet<>(); - expandedAndLastHopUrlsStrings.addAll(expandedUrlsStrings); - expandedAndLastHopUrlsStrings.addAll(canonicalLastHopUrlsStrings); - // Check both expanded and last hop url for consumer videos as consumer video urls are - // sometimes redirected to the url of the tweets containing the videos (SEARCH-42612). - if (NativeVideoClassificationUtils.hasConsumerVideo(expandedAndLastHopUrlsStrings)) { - encodedFeatures.setFlag(EarlybirdFieldConstant.HAS_CONSUMER_VIDEO_FLAG); - } - if (NativeVideoClassificationUtils.hasProVideo(canonicalLastHopUrlsStrings)) { - encodedFeatures.setFlag(EarlybirdFieldConstant.HAS_PRO_VIDEO_FLAG); - } - if (NativeVideoClassificationUtils.hasVine(canonicalLastHopUrlsStrings)) { - encodedFeatures.setFlag(EarlybirdFieldConstant.HAS_VINE_FLAG); - } - if (NativeVideoClassificationUtils.hasPeriscope(canonicalLastHopUrlsStrings)) { - encodedFeatures.setFlag(EarlybirdFieldConstant.HAS_PERISCOPE_FLAG); - } - if (LinkVisibilityUtils.hasVisibleLink(message.getExpandedUrls())) { - encodedFeatures.setFlag(EarlybirdFieldConstant.HAS_VISIBLE_LINK_FLAG); - } - } - - private static void setExtendedEncodedFeatureIntValue( - FeatureSink sink, - ImmutableSchemaInterface schema, - EarlybirdFieldConstant field, - int value) { - boolean fieldInSchema = schema.hasField(field.getFieldName()); - if (fieldInSchema) { - FeatureConfiguration featureConfig = - schema.getFeatureConfigurationByName(field.getFieldName()); - sink.setNumericValue(field, Math.min(value, featureConfig.getMaxValue())); - } - } - - private static void setEncodedReferenceAuthorId(FeatureSink sink, long referenceAuthorId) { - LongIntConverter.IntegerRepresentation ints = - LongIntConverter.convertOneLongToTwoInt(referenceAuthorId); - sink.setNumericValue( - EarlybirdFieldConstant.REFERENCE_AUTHOR_ID_LEAST_SIGNIFICANT_INT, ints.leastSignificantInt); - sink.setNumericValue( - EarlybirdFieldConstant.REFERENCE_AUTHOR_ID_MOST_SIGNIFICANT_INT, ints.mostSignificantInt); - } -} diff --git a/src/java/com/twitter/search/common/encoding/docvalues/BUILD b/src/java/com/twitter/search/common/encoding/docvalues/BUILD deleted file mode 100644 index bc4756173..000000000 --- a/src/java/com/twitter/search/common/encoding/docvalues/BUILD +++ /dev/null @@ -1,20 +0,0 @@ -# Java library for docvalues and common stride field encoding utilities. -java_library( - sources = ["*.java"], - platform = "java8", - provides = artifact( - org = "com.twitter.search.common", - name = "encoding-docvalues", - repo = artifactory, - ), - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/org/apache/lucene:lucene-analyzers-common", - "3rdparty/jvm/org/apache/lucene:lucene-core", - "3rdparty/jvm/org/apache/lucene:lucene-facet", - "3rdparty/jvm/org/apache/thrift:libthrift", - "src/java/com/twitter/search/common/schema/base", - "src/thrift/com/twitter/search/common:schema-java", - ], -) diff --git a/src/java/com/twitter/search/common/encoding/docvalues/CSFTypeUtil.java b/src/java/com/twitter/search/common/encoding/docvalues/CSFTypeUtil.java deleted file mode 100644 index 1d6d2c0bb..000000000 --- a/src/java/com/twitter/search/common/encoding/docvalues/CSFTypeUtil.java +++ /dev/null @@ -1,34 +0,0 @@ -package com.twitter.search.common.encoding.docvalues; - -public final class CSFTypeUtil { - private CSFTypeUtil() { - } - - /** - * Convert a long into a byte array, stored into dest. - */ - public static void convertToBytes(byte[] dest, int valueIndex, int value) { - int offset = valueIndex * Integer.BYTES; - dest[offset] = (byte) (value >>> 24); - dest[offset + 1] = (byte) (value >>> 16); - dest[offset + 2] = (byte) (value >>> 8); - dest[offset + 3] = (byte) value; - } - - /** - * Convert bytes into a long value. Inverse function of convertToBytes. - */ - public static int convertFromBytes(byte[] data, int startOffset, int valueIndex) { - // This should rarely happen, eg. when we get a corrupt ThriftIndexingEvent, we insert a new - // Document which is blank. Such a document results in a length 0 BytesRef. - if (data.length == 0) { - return 0; - } - - int offset = startOffset + valueIndex * Integer.BYTES; - return ((data[offset] & 0xFF) << 24) - | ((data[offset + 1] & 0xFF) << 16) - | ((data[offset + 2] & 0xFF) << 8) - | (data[offset + 3] & 0xFF); - } -} diff --git a/src/java/com/twitter/search/common/encoding/features/BUILD b/src/java/com/twitter/search/common/encoding/features/BUILD deleted file mode 100644 index 93b13c03f..000000000 --- a/src/java/com/twitter/search/common/encoding/features/BUILD +++ /dev/null @@ -1,17 +0,0 @@ -# Java library for feature encoding and decoding utilities. -java_library( - sources = ["*.java"], - platform = "java8", - provides = artifact( - org = "com.twitter.search.common", - name = "encoding-features", - repo = artifactory, - ), - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/org/apache/thrift:libthrift", - "src/java/com/twitter/search/common/schema/base", - "src/thrift/com/twitter/search/common:indexing-java", - ], -) diff --git a/src/java/com/twitter/search/common/encoding/features/BinByteNormalizer.java b/src/java/com/twitter/search/common/encoding/features/BinByteNormalizer.java deleted file mode 100644 index 36abc323e..000000000 --- a/src/java/com/twitter/search/common/encoding/features/BinByteNormalizer.java +++ /dev/null @@ -1,73 +0,0 @@ -package com.twitter.search.common.encoding.features; - -import java.util.Map; -import java.util.SortedSet; -import java.util.TreeMap; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Maps; -import com.google.common.collect.Sets; - -/** - * Normalizes values to predefined bins. - * If the value to normalize is lower than the lowest bin defined, normalizes to Byte.MIN_VALUE. - */ -public class BinByteNormalizer extends ByteNormalizer { - - private final TreeMap bins = Maps.newTreeMap(); - private final TreeMap reverseBins = Maps.newTreeMap(); - - /** - * Constructs a normalizer using predefined bins. - * @param bins A mapping between the upper bound of a value and the bin it should normalize to. - * For example providing a map with 2 entries, {5=>1, 10=>2} will normalize as follows: - * values under 5: Byte.MIN_VALUE - * values between 5 and 10: 1 - * values over 10: 2 - */ - public BinByteNormalizer(final Map bins) { - Preconditions.checkNotNull(bins); - Preconditions.checkArgument(!bins.isEmpty(), "No bins provided"); - Preconditions.checkArgument(hasIncreasingValues(bins)); - this.bins.putAll(bins); - for (Map.Entry entry : bins.entrySet()) { - reverseBins.put(entry.getValue(), entry.getKey()); - } - } - - /** - * check that if key1 > key2 then val1 > val2 in the {@code map}. - */ - private static boolean hasIncreasingValues(final Map map) { - SortedSet orderedKeys = Sets.newTreeSet(map.keySet()); - byte prev = Byte.MIN_VALUE; - for (Double key : orderedKeys) { // save the unboxing - byte cur = map.get(key); - if (cur <= prev) { - return false; - } - prev = cur; - } - return true; - } - - @Override - public byte normalize(double val) { - Map.Entry lowerBound = bins.floorEntry(val); - return lowerBound == null - ? Byte.MIN_VALUE - : lowerBound.getValue(); - } - - @Override - public double unnormLowerBound(byte norm) { - return reverseBins.get(reverseBins.floorKey(norm)); - } - - @Override - public double unnormUpperBound(byte norm) { - return norm == reverseBins.lastKey() - ? Double.POSITIVE_INFINITY - : reverseBins.get(reverseBins.floorKey((byte) (1 + norm))); - } -} diff --git a/src/java/com/twitter/search/common/encoding/features/ByteNormalizer.java b/src/java/com/twitter/search/common/encoding/features/ByteNormalizer.java deleted file mode 100644 index 6a6845a12..000000000 --- a/src/java/com/twitter/search/common/encoding/features/ByteNormalizer.java +++ /dev/null @@ -1,38 +0,0 @@ -package com.twitter.search.common.encoding.features; - -/** - * Interface for compressing unbounded float values to a signed byte. It includes both - * normalization of values and encoding of values in a byte. - */ -public abstract class ByteNormalizer { - public static byte intToUnsignedByte(int i) { - return (byte) i; - } - - public static int unsignedByteToInt(byte b) { - return (int) b & 0xFF; - } - - /** - * Returns the byte-compressed value of {@code val}. - */ - public abstract byte normalize(double val); - - /** - * Returns a lower bound to the unnormalized range of {@code norm}. - */ - public abstract double unnormLowerBound(byte norm); - - /** - * Returns an upper bound to the unnormalized range of {@code norm}. - */ - public abstract double unnormUpperBound(byte norm); - - /** - * Returns true if the normalized value of {@code val} is different than the normalized value of - * {@code val - 1} - */ - public boolean changedNorm(double val) { - return normalize(val) != normalize(val - 1); - } -} diff --git a/src/java/com/twitter/search/common/encoding/features/ClampByteNormalizer.java b/src/java/com/twitter/search/common/encoding/features/ClampByteNormalizer.java deleted file mode 100644 index ec1d3faa9..000000000 --- a/src/java/com/twitter/search/common/encoding/features/ClampByteNormalizer.java +++ /dev/null @@ -1,47 +0,0 @@ -package com.twitter.search.common.encoding.features; - -import com.google.common.base.Preconditions; - -/** - * A byte normalizer that restricts the values to the given range before normalizing them. - */ -public class ClampByteNormalizer extends ByteNormalizer { - private final int minUnnormalizedValue; - private final int maxUnnormalizedValue; - - /** - * Creates a new ClampByteNormalizer instance. - * - * @param minValue The smallest allowed unnormalized value. - * @param maxValue The largest allowed unnormalized value. - */ - public ClampByteNormalizer(int minUnnormalizedValue, int maxUnnormalizedValue) { - Preconditions.checkState(minUnnormalizedValue <= maxUnnormalizedValue); - Preconditions.checkState(minUnnormalizedValue >= 0); - Preconditions.checkState(maxUnnormalizedValue <= 255); - this.minUnnormalizedValue = minUnnormalizedValue; - this.maxUnnormalizedValue = maxUnnormalizedValue; - } - - @Override - public byte normalize(double val) { - int adjustedValue = (int) val; - if (adjustedValue < minUnnormalizedValue) { - adjustedValue = minUnnormalizedValue; - } - if (adjustedValue > maxUnnormalizedValue) { - adjustedValue = maxUnnormalizedValue; - } - return ByteNormalizer.intToUnsignedByte(adjustedValue); - } - - @Override - public double unnormLowerBound(byte norm) { - return ByteNormalizer.unsignedByteToInt(norm); - } - - @Override - public double unnormUpperBound(byte norm) { - return ByteNormalizer.unsignedByteToInt(norm) + 1; - } -} diff --git a/src/java/com/twitter/search/common/encoding/features/EncodedFeatures.java b/src/java/com/twitter/search/common/encoding/features/EncodedFeatures.java deleted file mode 100644 index f6d9b16bb..000000000 --- a/src/java/com/twitter/search/common/encoding/features/EncodedFeatures.java +++ /dev/null @@ -1,58 +0,0 @@ -package com.twitter.search.common.encoding.features; - -/** - * Encodes multiple values (bytes or bits) into an integer. - */ -public class EncodedFeatures { - private int value; - - public final void setSerializedValue(int val) { - this.value = val; - } - - public final int getSerializedValue() { - return value; - } - - // setByte is agnostic to signed / unsigned bytes. - protected final EncodedFeatures setByte(byte count, int bitshift, long inverseMask) { - value = (int) ((value & inverseMask) | ((count & 0xffL) << bitshift)); - return this; - } - - /** - * Sets the value but only if greater. setByteIfGreater assumes unsigned bytes. - */ - public final EncodedFeatures setByteIfGreater(byte newCount, int bitshift, long inversemask) { - if ((getByte(bitshift) & 0xff) < (newCount & 0xff)) { - setByte(newCount, bitshift, inversemask); - } - return this; - } - - protected final int getByte(int bitshift) { - return (int) (((value & 0xffffffffL) >>> bitshift) & 0xffL); - } - - protected final int getByteMasked(int bitshift, long mask) { - return (int) (((value & mask) >>> bitshift) & 0xffL); - } - - protected final EncodedFeatures setBit(int bit, boolean flag) { - if (flag) { - value |= bit; - } else { - value &= ~bit; - } - return this; - } - - protected final boolean getBit(int bit) { - return (value & bit) != 0; - } - - @Override - public String toString() { - return String.format("%x", value); - } -} diff --git a/src/java/com/twitter/search/common/encoding/features/IntNormalizer.java b/src/java/com/twitter/search/common/encoding/features/IntNormalizer.java deleted file mode 100644 index 0a2477e46..000000000 --- a/src/java/com/twitter/search/common/encoding/features/IntNormalizer.java +++ /dev/null @@ -1,15 +0,0 @@ -package com.twitter.search.common.encoding.features; - -/** - * Interface for processing different feature values into an int. It provides a one-way translation - * of encoding using com.twitter.search.common.encoding.features.ByteNormalizer and supports all the - * old normalizers. The difference is that we directly return the normalized int value - * (instead of converting from byte). - */ -public interface IntNormalizer { - /** - * Returns the normalized value of {@code val}. - * The value may be byte-compressed or as-is depending on the normalizer type - */ - int normalize(double val); -} diff --git a/src/java/com/twitter/search/common/encoding/features/IntegerEncodedFeatures.java b/src/java/com/twitter/search/common/encoding/features/IntegerEncodedFeatures.java deleted file mode 100644 index a86e079c3..000000000 --- a/src/java/com/twitter/search/common/encoding/features/IntegerEncodedFeatures.java +++ /dev/null @@ -1,159 +0,0 @@ -package com.twitter.search.common.encoding.features; - -import java.util.List; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; - -import com.twitter.search.common.indexing.thriftjava.PackedFeatures; -import com.twitter.search.common.schema.base.FeatureConfiguration; - -/** - * Class used to read/write integers encoded according to - * {@link com.twitter.search.common.schema.base.FeatureConfiguration} - * - * Implementations must override {@link #getInt(int pos)} and {@link #setInt(int pos, int value)}. - */ -public abstract class IntegerEncodedFeatures { - /** - * Returns the value at the given position. - */ - public abstract int getInt(int pos); - - /** - * Sets the given value at the given position. - */ - public abstract void setInt(int pos, int value); - - /** - * Get the maximum number of integers to hold features. - * @return the number of integers to represent all features. - */ - public abstract int getNumInts(); - - /** - * Test to see if the given feature is true or non-zero. Useful for one bit features. - * @param feature feature to examine - * @return true if feature is non-zero - */ - public boolean isFlagSet(FeatureConfiguration feature) { - return (getInt(feature.getValueIndex()) & feature.getBitMask()) != 0; - } - - public IntegerEncodedFeatures setFlag(FeatureConfiguration feature) { - setInt(feature.getValueIndex(), getInt(feature.getValueIndex()) | feature.getBitMask()); - return this; - } - - public IntegerEncodedFeatures clearFlag(FeatureConfiguration feature) { - setInt(feature.getValueIndex(), getInt(feature.getValueIndex()) & feature.getInverseBitMask()); - return this; - } - - /** - * Sets a boolean flag. - */ - public IntegerEncodedFeatures setFlagValue(FeatureConfiguration feature, boolean value) { - if (value) { - setFlag(feature); - } else { - clearFlag(feature); - } - return this; - } - - /** - * Get feature value - * @param feature feature to get - * @return the value of the feature - */ - public int getFeatureValue(FeatureConfiguration feature) { - return (getInt(feature.getValueIndex()) & feature.getBitMask()) - >>> feature.getBitStartPosition(); - } - - /** - * Set feature value - * @param feature feature to modify - * @param value value to set. - */ - public IntegerEncodedFeatures setFeatureValue(FeatureConfiguration feature, int value) { - Preconditions.checkState( - value <= feature.getMaxValue(), - "Feature value, %s, is greater than the max value allowed for this feature. " - + "Feature: %s, Max value: %s", - value, feature.getName(), feature.getMaxValue()); - - // Clear the value of the given feature in its int. - int temp = getInt(feature.getValueIndex()) & feature.getInverseBitMask(); - - // Set the new feature value. Applying the bit mask here ensures that other features in the - // same int are not modified by mistake. - temp |= (value << feature.getBitStartPosition()) & feature.getBitMask(); - - setInt(feature.getValueIndex(), temp); - return this; - } - - /** - * Sets feature value if greater than current value - * @param feature feature to modify - * @param value new value - */ - public IntegerEncodedFeatures setFeatureValueIfGreater(FeatureConfiguration feature, int value) { - if (value > getFeatureValue(feature)) { - setFeatureValue(feature, value); - } - return this; - } - - /** - * Increment a feature if its not at its maximum value. - * @return whether the feature is incremented. - */ - public boolean incrementIfNotMaximum(FeatureConfiguration feature) { - int newValue = getFeatureValue(feature) + 1; - if (newValue <= feature.getMaxValue()) { - setFeatureValue(feature, newValue); - return true; - } else { - return false; - } - } - - /** - * Copy these encoded features to a new PackedFeatures thrift struct. - */ - public PackedFeatures copyToPackedFeatures() { - return copyToPackedFeatures(new PackedFeatures()); - } - - /** - * Copy these encoded features to a PackedFeatures thrift struct. - */ - public PackedFeatures copyToPackedFeatures(PackedFeatures packedFeatures) { - Preconditions.checkNotNull(packedFeatures); - final List integers = Lists.newArrayListWithCapacity(getNumInts()); - for (int i = 0; i < getNumInts(); i++) { - integers.add(getInt(i)); - } - packedFeatures.setDeprecated_featureConfigurationVersion(0); - packedFeatures.setFeatures(integers); - return packedFeatures; - } - - /** - * Copy features from a packed features struct. - */ - public void readFromPackedFeatures(PackedFeatures packedFeatures) { - Preconditions.checkNotNull(packedFeatures); - List ints = packedFeatures.getFeatures(); - for (int i = 0; i < getNumInts(); i++) { - if (i < ints.size()) { - setInt(i, ints.get(i)); - } else { - setInt(i, 0); - } - } - } -} diff --git a/src/java/com/twitter/search/common/encoding/features/LogByteNormalizer.java b/src/java/com/twitter/search/common/encoding/features/LogByteNormalizer.java deleted file mode 100644 index 0124d0be3..000000000 --- a/src/java/com/twitter/search/common/encoding/features/LogByteNormalizer.java +++ /dev/null @@ -1,53 +0,0 @@ -package com.twitter.search.common.encoding.features; - -import com.google.common.base.Preconditions; - -/** - * Normalizes values as follows: - * Positive numbers normalize to (1 + round(log_baseN(value))). - * Negative numbers throw. - * 0 will normalize to 0. - * The log base is 2 by default. - */ -public class LogByteNormalizer extends ByteNormalizer { - - private static final double DEFAULT_BASE = 2; - private final double base; - private final double logBase; - - public LogByteNormalizer(double base) { - Preconditions.checkArgument(base > 0); - this.base = base; - logBase = Math.log(base); - } - - public LogByteNormalizer() { - this(DEFAULT_BASE); - } - - @Override - public byte normalize(double val) { - if (val < 0) { - throw new IllegalArgumentException("Can't log-normalize negative value " + val); - } else if (val == 0) { - return 0; - } else { - long logVal = 1 + (long) Math.floor(Math.log(val) / logBase); - return logVal > Byte.MAX_VALUE ? Byte.MAX_VALUE : (byte) logVal; - } - } - - @Override - public double unnormLowerBound(byte norm) { - return norm < 0 - ? Double.NEGATIVE_INFINITY - : Math.floor(Math.pow(base, norm - 1)); - } - - @Override - public double unnormUpperBound(byte norm) { - return norm == Byte.MAX_VALUE - ? Double.POSITIVE_INFINITY - : Math.floor(Math.pow(base, norm)); - } -} diff --git a/src/java/com/twitter/search/common/encoding/features/PredictionScoreNormalizer.java b/src/java/com/twitter/search/common/encoding/features/PredictionScoreNormalizer.java deleted file mode 100644 index e02519f08..000000000 --- a/src/java/com/twitter/search/common/encoding/features/PredictionScoreNormalizer.java +++ /dev/null @@ -1,51 +0,0 @@ -package com.twitter.search.common.encoding.features; - -import com.google.common.base.Preconditions; - -/** - * A normalizer that normalizes the prediction score from a machine learning classifier, which - * ranges within [0.0, 1.0], to an integer value by multiplying by (10 ^ precision), and returns - * the rounded value. The lower the precision, the less amount of bits it takes to encode the score. - * @see #precision - * - * This normalizer also could denormalize the normalized value from integer back to double using the - * same precision. - */ -public class PredictionScoreNormalizer { - - private final int precision; - private final double normalizingBase; - - public PredictionScoreNormalizer(int precision) { - this.precision = precision; - this.normalizingBase = Math.pow(10, this.precision); - } - - /** - * Returns the normalized int value for prediction score {@code score} by multiplying - * by {@code normalizingBase}, and round the result. - * @throws IllegalArgumentException when parameter {@code score} is not within [0.0, 1.0] - */ - public int normalize(double score) { - Preconditions.checkArgument(isScoreWithinRange(score)); - return (int) Math.round(score * this.normalizingBase); - } - - /** - * Converts the normalized int value back to a double score by dividing by {@code normalizingBase} - * @throws IllegalStateException when the denormalized value is not within [0.0, 1.0] - */ - public double denormalize(int normalizedScore) { - double denormalizedValue = normalizedScore / this.normalizingBase; - if (!isScoreWithinRange(denormalizedValue)) { - throw new IllegalStateException( - String.format("The denormalized value %s is not within [0.0, 1.0]", denormalizedValue) - ); - } - return denormalizedValue; - } - - private static boolean isScoreWithinRange(double score) { - return 0.0 <= score && score <= 1.0; - } -} diff --git a/src/java/com/twitter/search/common/encoding/features/SingleBytePositiveFloatNormalizer.java b/src/java/com/twitter/search/common/encoding/features/SingleBytePositiveFloatNormalizer.java deleted file mode 100644 index 32acc5048..000000000 --- a/src/java/com/twitter/search/common/encoding/features/SingleBytePositiveFloatNormalizer.java +++ /dev/null @@ -1,35 +0,0 @@ -package com.twitter.search.common.encoding.features; - -/** - * Normalizes using the logic described in {@link SingleBytePositiveFloatUtil}. - */ -public class SingleBytePositiveFloatNormalizer extends ByteNormalizer { - - @Override - public byte normalize(double val) { - return SingleBytePositiveFloatUtil.toSingleBytePositiveFloat((float) val); - } - - @Override - public double unnormLowerBound(byte norm) { - return SingleBytePositiveFloatUtil.toJavaFloat(norm); - } - - /** - * Get the upper bound of the raw value for a normalized byte. - * @deprecated This is wrongly implemented, always use unnormLowerBound(), - * or use SmartIntegerNormalizer. - */ - @Override @Deprecated - public double unnormUpperBound(byte norm) { - return 1 + SingleBytePositiveFloatUtil.toJavaFloat(norm); - } - - /** - * Return the the post-log2 unnormalized value. This is only used for some legacy Earlybird - * features and scoring functions. - */ - public double unnormAndLog2(byte norm) { - return SingleBytePositiveFloatUtil.toLog2Double(norm); - } -} diff --git a/src/java/com/twitter/search/common/encoding/features/SingleBytePositiveFloatUtil.java b/src/java/com/twitter/search/common/encoding/features/SingleBytePositiveFloatUtil.java deleted file mode 100644 index 2894241e8..000000000 --- a/src/java/com/twitter/search/common/encoding/features/SingleBytePositiveFloatUtil.java +++ /dev/null @@ -1,164 +0,0 @@ -package com.twitter.search.common.encoding.features; - -/** - * Util used to: - * - Encode a positive Java float into a single byte float - * - Decode a single byte into a positive Java float - * - * Configuration: - * - Exponent: higher 4 bits, base 10. - * - Mantissa: lower 4 bit, representing 1.0 to 9.0 - * - Exponent bias is 1. - * - * Formula: - * Max(Mantissa, 9) * 10 ^ (Exponent - 1) - * - * Smallest float: 0.0 (0000 0000) - * Smallest positive float: 1.0 * 10^-1 (0000 0001) - * Largest float: 9.0 * 10^13 (1110 1111) - * Infinity: (1111 0000) - * NaN: (1111 1000) - */ -public final class SingleBytePositiveFloatUtil { - private SingleBytePositiveFloatUtil() { } - - // 4 bits mantissa. Range [1.0, 10.0) is divided into 16 steps - public static final byte MAX_BYTE_VALUE = (byte) 0xEF; - public static final byte INFINITY = (byte) 0xF0; - public static final byte NOT_A_NUMBER = (byte) 0xF8; - private static final float STEP_SIZE = 1.0f; - private static final int EXPONENT_BIAS = 1; - private static final byte MIN_EXPONENT = -EXPONENT_BIAS; - private static final int MAX_EXPONENT = 14 - EXPONENT_BIAS; - private static final byte MANTISSA_MASK = 0x0F; - - /** - * Converts the given float into a single byte floating point number. - * This is used in the updater and OK to be a bit slow. - */ - public static byte toSingleBytePositiveFloat(float f) { - if (f < 0) { - throw new UnsupportedOperationException( - "Cannot encode negative floats into SingleBytePostiveFloat."); - } - - if (Float.compare(f, Float.POSITIVE_INFINITY) == 0) { - return INFINITY; - } - - if (Float.compare(f, Float.NaN) == 0) { - return NOT_A_NUMBER; - } - - int mantissa = 0; - int exponent = (int) Math.floor(Math.log10(f)); - // Overflow (Number too large), just return the largest possible value - if (exponent > MAX_EXPONENT) { - return MAX_BYTE_VALUE; - } - - // Underflow (Number too small), just return 0 - if (exponent < MIN_EXPONENT) { - return 0; - } - - int frac = Math.round(f / (float) Math.pow(10.0f, exponent) / STEP_SIZE); - mantissa = fractionToMantissaTable[frac]; - - return (byte) (((exponent + EXPONENT_BIAS) << 4) | mantissa); - } - - /** - * Called in Earlybird per hit and needs to be fast. - */ - public static float toJavaFloat(byte b) { - return BYTE_TO_FLOAT_CONVERSION_TABLE[b & 0xff]; - } - - // Table used for converting mantissa into a significant - private static float[] mantissaToFractionTable = { - // Decimal Matisa value - STEP_SIZE * 0, // 0000 - STEP_SIZE * 1, // 0001 - STEP_SIZE * 1, // 0010 - STEP_SIZE * 2, // 0011 - STEP_SIZE * 2, // 0100 - STEP_SIZE * 3, // 0101 - STEP_SIZE * 3, // 0110 - STEP_SIZE * 4, // 0111 - STEP_SIZE * 4, // 1000 - STEP_SIZE * 5, // 1001 - STEP_SIZE * 5, // 1010 - STEP_SIZE * 6, // 1011 - STEP_SIZE * 6, // 1100 - STEP_SIZE * 7, // 1101 - STEP_SIZE * 8, // 1110 - STEP_SIZE * 9 // 1111 - }; - - // Table used for converting fraction into mantissa. - // Reverse operation of the above - private static int[] fractionToMantissaTable = { - 0, // 0 - 1, // 1 - 3, // 2 - 5, // 3 - 7, // 4 - 9, // 5 - 11, // 6 - 13, // 7 - 14, // 8 - 15, // 9 - 15, // 10 (Edge case: because we round the fraction, we can get 10 here.) - }; - - public static final byte LARGEST_FRACTION_UNDER_ONE = (byte) (toSingleBytePositiveFloat(1f) - 1); - - /** - * Converts the given byte to java float. - */ - private static float toJavaFloatSlow(byte b) { - if (b == INFINITY) { - return Float.POSITIVE_INFINITY; - } - - if ((b & 0xff) > (INFINITY & 0xff)) { - return Float.NaN; - } - - int exponent = ((b & 0xff) >>> 4) - EXPONENT_BIAS; - int mantissa = b & MANTISSA_MASK; - return mantissaToFractionTable[mantissa] * (float) Math.pow(10.0f, exponent); - } - - // Cached results from byte to float conversion - private static final float[] BYTE_TO_FLOAT_CONVERSION_TABLE = new float[256]; - private static final double[] BYTE_TO_LOG2_CONVERSION_TABLE = new double[256]; - private static final byte[] OLD_TO_NEW_BYTE_CONVERSION_TABLE = new byte[256]; - - static { - LogByteNormalizer normalizer = new LogByteNormalizer(); - for (int i = 0; i < 256; i++) { - byte b = (byte) i; - BYTE_TO_FLOAT_CONVERSION_TABLE[i] = toJavaFloatSlow(b); - BYTE_TO_LOG2_CONVERSION_TABLE[i] = - 0xff & normalizer.normalize(BYTE_TO_FLOAT_CONVERSION_TABLE[i]); - if (b == 0) { - OLD_TO_NEW_BYTE_CONVERSION_TABLE[i] = 0; - } else if (b > 0) { - OLD_TO_NEW_BYTE_CONVERSION_TABLE[i] = - toSingleBytePositiveFloat((float) normalizer.unnormLowerBound(b)); - } else { - // should not get here. - OLD_TO_NEW_BYTE_CONVERSION_TABLE[i] = MAX_BYTE_VALUE; - } - } - } - - /** - * Convert a normalized byte to the log2() version of its original value - */ - static double toLog2Double(byte b) { - return BYTE_TO_LOG2_CONVERSION_TABLE[b & 0xff]; - } -} diff --git a/src/java/com/twitter/search/common/encoding/features/SmartIntegerNormalizer.java b/src/java/com/twitter/search/common/encoding/features/SmartIntegerNormalizer.java deleted file mode 100644 index f2655e294..000000000 --- a/src/java/com/twitter/search/common/encoding/features/SmartIntegerNormalizer.java +++ /dev/null @@ -1,150 +0,0 @@ -package com.twitter.search.common.encoding.features; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -/** - * A smart integer normalizer that converts an integer of a known range to a small integer up to - * 8 bits long. This normalizer generates a boundary value array in the constructor as the buckets - * for different values. - *

- * The normalized value has a nice properties: - * 1) it maintains the order of original value: if a > b, then normalize(a) > normalize(b). - * 2) the value 0 is always normalized to byte 0. - * 3) the normalized values are (almost) evenly distributed on the log scale - * 4) no waste in code space, all possible values representable by normalized bits are used, - * each corresponding to a different value. - */ -public class SmartIntegerNormalizer extends ByteNormalizer { - // The max value we want to support in this normalizer. If the input is larger than this value, - // it's normalized as if it's the maxValue. - private final int maxValue; - // Number of bits used for normalized value, the largest normalized value - // would be (1 << numBits) - 1. - private final int numBits; - // The inclusive lower bounds of all buckets. A normalized value k corresponds to original values - // in the inclusive-exclusive range - // [ boundaryValues[k], boundaryValues[k+1] ) - private final int[] boundaryValues; - // The length of the boundaryValues array, or the number of buckets. - private final int length; - - /** - * Construct a normalizer. - * - * @param maxValue max value it supports, must be larger than minValue. Anything larger than this - * would be treated as maxValue. - * @param numBits number of bits you want to use for this normalization, between 1 and 8. - * higher resolution for the lower numbers. - */ - public SmartIntegerNormalizer(int maxValue, int numBits) { - Preconditions.checkArgument(maxValue > 0); - Preconditions.checkArgument(numBits > 0 && numBits <= 8); - - this.maxValue = maxValue; - this.numBits = numBits; - - this.length = 1 << numBits; - this.boundaryValues = new int[length]; - - - int index; - for (index = length - 1; index >= 0; --index) { - // values are evenly distributed on the log scale - int boundary = (int) Math.pow(maxValue, (double) index / length); - // we have more byte slots left than we have possible boundary values (buckets), - // just give consecutive boundary values to all remaining slots, starting from 0. - if (boundary <= index) { - break; - } - boundaryValues[index] = boundary; - } - if (index >= 0) { - for (int i = 1; i <= index; ++i) { - boundaryValues[i] = i; - } - } - boundaryValues[0] = 0; // the first one is always 0. - } - - @Override - public byte normalize(double val) { - int intVal = (int) (val > maxValue ? maxValue : val); - return intToUnsignedByte(binarySearch(intVal, boundaryValues)); - } - - /** - * Return the lower bound of the bucket represent by norm. This simply returns the boundary - * value indexed by current norm. - */ - @Override - public double unnormLowerBound(byte norm) { - return boundaryValues[unsignedByteToInt(norm)]; - } - - /** - * Return the upper bound of the bucket represent by norm. This returns the next boundary value - * minus 1. If norm represents the last bucket, it returns the maxValue. - */ - @Override - public double unnormUpperBound(byte norm) { - // if it's already the last possible normalized value, just return the corresponding last - // boundary value. - int intNorm = unsignedByteToInt(norm); - if (intNorm == length - 1) { - return maxValue; - } - return boundaryValues[intNorm + 1] - 1; - } - - /** - * Do a binary search on array and find the index of the item that's no bigger than value. - */ - private static int binarySearch(int value, int[] array) { - // corner cases - if (value <= array[0]) { - return 0; - } else if (value >= array[array.length - 1]) { - return array.length - 1; - } - int left = 0; - int right = array.length - 1; - int pivot = (left + right) >> 1; - do { - int midVal = array[pivot]; - if (value == midVal) { - break; - } else if (value > midVal) { - left = pivot; - } else { - right = pivot; - } - pivot = (left + right) >> 1; - } while (pivot != left); - return pivot; - } - - @Override - public String toString() { - StringBuilder sb = new StringBuilder(String.format( - "Smart Integer Normalizer (numBits = %d, max = %d)\n", - this.numBits, this.maxValue)); - for (int i = 0; i < this.length; i++) { - sb.append(String.format( - "[%2d] boundary = %6d, range [ %6d, %6d ), norm: %4d | %4d | %4d %s\n", - i, boundaryValues[i], - (int) unnormLowerBound(intToUnsignedByte(i)), - (int) unnormUpperBound(intToUnsignedByte(i)), - unsignedByteToInt(normalize(boundaryValues[i] - 1)), - unsignedByteToInt(normalize(boundaryValues[i])), - unsignedByteToInt(normalize(boundaryValues[i] + 1)), - i == boundaryValues[i] ? "*" : "")); - } - return sb.toString(); - } - - @VisibleForTesting - int[] getBoundaryValues() { - return boundaryValues; - } -} diff --git a/src/java/com/twitter/search/common/query/BUILD b/src/java/com/twitter/search/common/query/BUILD deleted file mode 100644 index 5c4cd6330..000000000 --- a/src/java/com/twitter/search/common/query/BUILD +++ /dev/null @@ -1,25 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/code/findbugs:jsr305", - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/org/apache/lucene:lucene-analyzers-common", - "3rdparty/jvm/org/apache/lucene:lucene-analyzers-smartcn", - "3rdparty/jvm/org/apache/lucene:lucene-core", - "3rdparty/jvm/org/apache/lucene:lucene-facet", - "3rdparty/jvm/org/apache/lucene:lucene-queries", - "3rdparty/jvm/org/apache/thrift:libthrift", - "3rdparty/jvm/org/apache/zookeeper:zookeeper-client", - "3rdparty/jvm/org/slf4j:slf4j-api", - "src/java/com/twitter/search/common/features", - "src/java/com/twitter/search/common/schema/base", - "src/java/com/twitter/search/common/schema/earlybird", - "src/java/com/twitter/search/common/util/analysis", - "src/java/com/twitter/search/queryparser", - "src/java/com/twitter/search/queryparser/query:core-query-nodes", - "src/java/com/twitter/search/queryparser/query/search:search-query-nodes", - ], -) diff --git a/src/java/com/twitter/search/common/query/BoostUtils.java b/src/java/com/twitter/search/common/query/BoostUtils.java deleted file mode 100644 index 10ae55942..000000000 --- a/src/java/com/twitter/search/common/query/BoostUtils.java +++ /dev/null @@ -1,27 +0,0 @@ -package com.twitter.search.common.query; - -import org.apache.lucene.search.BoostQuery; -import org.apache.lucene.search.Query; - -/** - * A class of utilities related to query boosts. - */ -public final class BoostUtils { - private BoostUtils() { - } - - /** - * Wraps the given query into a BoostQuery, if {@code boost} is not equal to 1.0f. - * - * @param query The query. - * @param boost The boost. - * @return If {@code boost} is equal to 1.0f, then {@code query} is returned; otherwise, - * {@code query} is wrapped into a {@code BoostQuery} instance with the given boost. - */ - public static Query maybeWrapInBoostQuery(Query query, float boost) { - if (boost == 1.0f) { - return query; - } - return new BoostQuery(query, boost); - } -} diff --git a/src/java/com/twitter/search/common/query/CollectAnnotationsVisitor.java b/src/java/com/twitter/search/common/query/CollectAnnotationsVisitor.java deleted file mode 100644 index 457ace646..000000000 --- a/src/java/com/twitter/search/common/query/CollectAnnotationsVisitor.java +++ /dev/null @@ -1,92 +0,0 @@ -package com.twitter.search.common.query; - - -import java.util.Map; -import java.util.Set; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Maps; - -import com.twitter.search.queryparser.query.BooleanQuery; -import com.twitter.search.queryparser.query.Conjunction; -import com.twitter.search.queryparser.query.Disjunction; -import com.twitter.search.queryparser.query.Operator; -import com.twitter.search.queryparser.query.Phrase; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.query.QueryVisitor; -import com.twitter.search.queryparser.query.SpecialTerm; -import com.twitter.search.queryparser.query.Term; -import com.twitter.search.queryparser.query.annotation.Annotation; - -/** - * Collect the nodes with a specified annotation type in the given query. - */ -public class CollectAnnotationsVisitor extends QueryVisitor { - - protected final Annotation.Type type; - - protected final Map nodeToTypeMap = Maps.newIdentityHashMap(); - - public CollectAnnotationsVisitor(Annotation.Type type) { - this.type = Preconditions.checkNotNull(type); - } - - @Override - public Boolean visit(Disjunction disjunction) throws QueryParserException { - return visitBooleanQuery(disjunction); - } - - @Override - public Boolean visit(Conjunction conjunction) throws QueryParserException { - return visitBooleanQuery(conjunction); - } - - @Override - public Boolean visit(Phrase phrase) throws QueryParserException { - return visitQuery(phrase); - } - - @Override - public Boolean visit(Term term) throws QueryParserException { - return visitQuery(term); - } - - @Override - public Boolean visit(Operator operator) throws QueryParserException { - return visitQuery(operator); - } - - @Override - public Boolean visit(SpecialTerm special) throws QueryParserException { - return visitQuery(special); - } - - protected boolean visitQuery(Query query) throws QueryParserException { - if (query.hasAnnotationType(type)) { - collectNode(query); - return true; - } - return false; - } - - protected void collectNode(Query query) { - nodeToTypeMap.put(query, true); - } - - protected boolean visitBooleanQuery(BooleanQuery query) throws QueryParserException { - boolean found = false; - if (query.hasAnnotationType(type)) { - collectNode(query); - found = true; - } - for (Query child : query.getChildren()) { - found |= child.accept(this); - } - return found; - } - - public Set getNodes() { - return nodeToTypeMap.keySet(); - } -} diff --git a/src/java/com/twitter/search/common/query/CollectQueryTypeVisitor.java b/src/java/com/twitter/search/common/query/CollectQueryTypeVisitor.java deleted file mode 100644 index 0e135991e..000000000 --- a/src/java/com/twitter/search/common/query/CollectQueryTypeVisitor.java +++ /dev/null @@ -1,89 +0,0 @@ -package com.twitter.search.common.query; - -import java.util.Map; -import java.util.Set; - -import com.google.common.collect.Maps; - -import com.twitter.search.queryparser.query.BooleanQuery; -import com.twitter.search.queryparser.query.Conjunction; -import com.twitter.search.queryparser.query.Disjunction; -import com.twitter.search.queryparser.query.Operator; -import com.twitter.search.queryparser.query.Phrase; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.query.QueryVisitor; -import com.twitter.search.queryparser.query.SpecialTerm; -import com.twitter.search.queryparser.query.Term; - -/** - * Collects the nodes with a specified query type in the given query. - */ -public class CollectQueryTypeVisitor extends QueryVisitor { - - protected final Query.QueryType queryType; - - protected final Map nodeToTypeMap = Maps.newIdentityHashMap(); - - public CollectQueryTypeVisitor(Query.QueryType queryType) { - this.queryType = queryType; - } - - @Override - public Boolean visit(Disjunction disjunction) throws QueryParserException { - return visitBooleanQuery(disjunction); - } - - @Override - public Boolean visit(Conjunction conjunction) throws QueryParserException { - return visitBooleanQuery(conjunction); - } - - @Override - public Boolean visit(Phrase phrase) throws QueryParserException { - return visitQuery(phrase); - } - - @Override - public Boolean visit(Term term) throws QueryParserException { - return visitQuery(term); - } - - @Override - public Boolean visit(Operator operator) throws QueryParserException { - return visitQuery(operator); - } - - @Override - public Boolean visit(SpecialTerm special) throws QueryParserException { - return visitQuery(special); - } - - public Set getCollectedNodes() { - return nodeToTypeMap.keySet(); - } - - protected boolean visitQuery(Query query) throws QueryParserException { - if (query.isTypeOf(queryType)) { - collectNode(query); - return true; - } - return false; - } - - protected void collectNode(Query query) { - nodeToTypeMap.put(query, true); - } - - protected boolean visitBooleanQuery(BooleanQuery query) throws QueryParserException { - boolean found = false; - if (query.isTypeOf(queryType)) { - collectNode(query); - found = true; - } - for (Query child : query.getChildren()) { - found |= child.accept(this); - } - return found; - } -} diff --git a/src/java/com/twitter/search/common/query/CollectVariantVisitor.java b/src/java/com/twitter/search/common/query/CollectVariantVisitor.java deleted file mode 100644 index a66961d7f..000000000 --- a/src/java/com/twitter/search/common/query/CollectVariantVisitor.java +++ /dev/null @@ -1,13 +0,0 @@ -package com.twitter.search.common.query; - -import com.twitter.search.queryparser.query.annotation.Annotation; - - -/** - * A visitor that collects the nodes that have :v annotation - */ -public class CollectVariantVisitor extends CollectAnnotationsVisitor { - public CollectVariantVisitor() { - super(Annotation.Type.VARIANT); - } -} diff --git a/src/java/com/twitter/search/common/query/DefaultFilterWeight.java b/src/java/com/twitter/search/common/query/DefaultFilterWeight.java deleted file mode 100644 index 5fcc14433..000000000 --- a/src/java/com/twitter/search/common/query/DefaultFilterWeight.java +++ /dev/null @@ -1,60 +0,0 @@ -package com.twitter.search.common.query; - -import java.io.IOException; -import java.util.Set; - -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.Term; -import org.apache.lucene.search.ConstantScoreScorer; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.Explanation; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.Weight; - -/** - * An abstract Weight implementation that can be used by all "filter" classes (Query instances that - * should not contribute to the overall query score). - */ -public abstract class DefaultFilterWeight extends Weight { - public DefaultFilterWeight(Query query) { - super(query); - } - - @Override - public void extractTerms(Set terms) { - } - - @Override - public Explanation explain(LeafReaderContext context, int doc) throws IOException { - Scorer scorer = scorer(context); - if ((scorer != null) && (scorer.iterator().advance(doc) == doc)) { - return Explanation.match(0f, "Match on id " + doc); - } - return Explanation.match(0f, "No match on id " + doc); - } - - @Override - public Scorer scorer(LeafReaderContext context) throws IOException { - DocIdSetIterator disi = getDocIdSetIterator(context); - if (disi == null) { - return null; - } - - return new ConstantScoreScorer(this, 0.0f, ScoreMode.COMPLETE_NO_SCORES, disi); - } - - @Override - public boolean isCacheable(LeafReaderContext ctx) { - return false; - } - - /** - * Returns the DocIdSetIterator over which the scorers created by this weight need to iterate. - * - * @param context The LeafReaderContext instance used to create the scorer. - */ - protected abstract DocIdSetIterator getDocIdSetIterator(LeafReaderContext context) - throws IOException; -} diff --git a/src/java/com/twitter/search/common/query/DocIdFilter.java b/src/java/com/twitter/search/common/query/DocIdFilter.java deleted file mode 100644 index fed309f86..000000000 --- a/src/java/com/twitter/search/common/query/DocIdFilter.java +++ /dev/null @@ -1,74 +0,0 @@ -package com.twitter.search.common.query; - -import java.io.IOException; -import java.util.Set; - -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.Term; -import org.apache.lucene.search.ConstantScoreScorer; -import org.apache.lucene.search.Explanation; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.Weight; - -/** - * Lucene filter on top of a known docid - * - */ -public class DocIdFilter extends Query { - private final int docid; - - public DocIdFilter(int docid) { - this.docid = docid; - } - - @Override - public Weight createWeight( - IndexSearcher searcher, ScoreMode scoreMode, float boost) throws IOException { - return new Weight(this) { - @Override - public void extractTerms(Set terms) { - } - - @Override - public Explanation explain(LeafReaderContext context, int doc) throws IOException { - Scorer scorer = scorer(context); - if ((scorer != null) && (scorer.iterator().advance(doc) == doc)) { - return Explanation.match(0f, "Match on id " + doc); - } - return Explanation.match(0f, "No match on id " + doc); - } - - @Override - public Scorer scorer(LeafReaderContext context) throws IOException { - return new ConstantScoreScorer(this, 0.0f, scoreMode, new SingleDocDocIdSetIterator(docid)); - } - - @Override - public boolean isCacheable(LeafReaderContext ctx) { - return true; - } - }; - } - - @Override - public int hashCode() { - return docid; - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof DocIdFilter)) { - return false; - } - - return docid == DocIdFilter.class.cast(obj).docid; - } - - @Override - public String toString(String field) { - return "DOC_ID_FILTER[docId=" + docid + " + ]"; - } -} diff --git a/src/java/com/twitter/search/common/query/FieldRankHitInfo.java b/src/java/com/twitter/search/common/query/FieldRankHitInfo.java deleted file mode 100644 index f7d509719..000000000 --- a/src/java/com/twitter/search/common/query/FieldRankHitInfo.java +++ /dev/null @@ -1,48 +0,0 @@ -package com.twitter.search.common.query; - -/** - * When a hit (on a part of the query tree) occurs, this class is passed to HitAttributeCollector - * for collection. - * - * This implementation carries the following info: - *

    - *
  • The field that matched (the field ID is recorded)
  • - *
  • The query node that matched (the query node rank is recorded)
  • - *
  • The ID of the last doc that matched this query
  • - *
- * - * Each IdentifiableQuery should be associated with one FieldRankHitInfo, which is passed to a - * HitAttributeCollector when a hit occurs. - */ -public class FieldRankHitInfo { - protected static final int UNSET_DOC_ID = -1; - - private final int fieldId; - private final int rank; - private int docId = UNSET_DOC_ID; - - public FieldRankHitInfo(int fieldId, int rank) { - this.fieldId = fieldId; - this.rank = rank; - } - - public int getFieldId() { - return fieldId; - } - - public int getRank() { - return rank; - } - - public int getDocId() { - return docId; - } - - public void setDocId(int docId) { - this.docId = docId; - } - - public void resetDocId() { - this.docId = UNSET_DOC_ID; - } -} diff --git a/src/java/com/twitter/search/common/query/FieldWeightUtil.java b/src/java/com/twitter/search/common/query/FieldWeightUtil.java deleted file mode 100644 index dcb7d08a8..000000000 --- a/src/java/com/twitter/search/common/query/FieldWeightUtil.java +++ /dev/null @@ -1,205 +0,0 @@ -package com.twitter.search.common.query; - -import java.util.Collections; -import java.util.EnumSet; -import java.util.List; -import java.util.Map; -import java.util.Set; - -import javax.annotation.Nullable; - -import com.google.common.base.Enums; -import com.google.common.base.Function; -import com.google.common.base.Functions; -import com.google.common.base.Predicates; -import com.google.common.collect.FluentIterable; -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.Iterables; -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.schema.base.FieldWeightDefault; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.query.annotation.Annotation; -import com.twitter.search.queryparser.query.annotation.FieldAnnotationUtils; -import com.twitter.search.queryparser.query.annotation.FieldNameWithBoost; - -public final class FieldWeightUtil { - private static final Logger LOG = LoggerFactory.getLogger(FieldWeightUtil.class); - private FieldWeightUtil() { - } - - /** - * Combines default field weight configuration with field annotations and returns a - * field-to-weight map. - * - * @param query The query whose annotations we will look into - * @param defaultFieldWeightMap field-to-FieldWeightDefault map - * @param enabledFieldWeightMap for optimization, this is the field-to-weight map inferred from - * the field-to-FieldWeightDefault map - * @param fieldNameToTyped A function that can turn string field name to typed field - * @param The typed field - */ - public static ImmutableMap combineDefaultWithAnnotation( - Query query, - Map defaultFieldWeightMap, - Map enabledFieldWeightMap, - Function fieldNameToTyped) throws QueryParserException { - return combineDefaultWithAnnotation( - query, - defaultFieldWeightMap, - enabledFieldWeightMap, - fieldNameToTyped, - Collections.emptyMap(), - Functions.forMap(Collections.emptyMap(), "")); - } - - /** - * Combines default field weight configuration with field annotations and returns a - * field-to-weight map. Also maps generic mappable fields to field weight boosts and resolves them - * - * @param query The query whose annotations we will look into - * @param defaultFieldWeightMap field-to-FieldWeightDefault map - * @param enabledFieldWeightMap for optimization, this is the field-to-weight map inferred from - * the field-to-FieldWeightDefault map - * @param fieldNameToTyped A function that can turn a string field name to typed field - * @param mappableFieldMap mapping of mappable fields to the corresponding typed fields - * @param typedToFieldName A function that can turn a typed field into a string field name - * @param The typed field - * - * Note: As a result of discussion on SEARCH-24029, we now allow replace and remove annotations - * on a single term. See http://go/fieldweight for info on field weight annotations. - */ - public static ImmutableMap combineDefaultWithAnnotation( - Query query, - Map defaultFieldWeightMap, - Map enabledFieldWeightMap, - Function fieldNameToTyped, - Map mappableFieldMap, - Function typedToFieldName) throws QueryParserException { - List fieldAnnotations = query.getAllAnnotationsOf(Annotation.Type.FIELD); - List mappableFieldAnnotations = - query.getAllAnnotationsOf(Annotation.Type.MAPPABLE_FIELD); - - if (fieldAnnotations.isEmpty() && mappableFieldAnnotations.isEmpty()) { - return ImmutableMap.copyOf(enabledFieldWeightMap); - } - - // Convert mapped fields to field annotations - Iterable fieldAnnotationsForMappedFields = - FluentIterable.from(mappableFieldAnnotations) - .transform(FieldWeightUtil.fieldAnnotationForMappableField(mappableFieldMap, - typedToFieldName)) - .filter(Predicates.notNull()); - - Iterable annotations = - Iterables.concat(fieldAnnotationsForMappedFields, fieldAnnotations); - - // Sanitize the field annotations first, remove the ones we don't know - // for REPLACE and REMOVE. - List sanitizedFields = Lists.newArrayList(); - Set seenModifierTypes = - EnumSet.noneOf(FieldNameWithBoost.FieldModifier.class); - - for (Annotation annotation : annotations) { - FieldNameWithBoost fieldNameWithBoost = (FieldNameWithBoost) annotation.getValue(); - T typedField = fieldNameToTyped.apply(fieldNameWithBoost.getFieldName()); - FieldNameWithBoost.FieldModifier modifier = fieldNameWithBoost.getFieldModifier(); - if (defaultFieldWeightMap.containsKey(typedField)) { - seenModifierTypes.add(modifier); - sanitizedFields.add(fieldNameWithBoost); - } - } - - // Even if there is no mapping for a mapped annotation, if a query is replaced by an unknown - // mapping, it should not map to other fields, so we need to detect a REPLACE annotation - if (seenModifierTypes.isEmpty() - && FieldAnnotationUtils.hasReplaceAnnotation(mappableFieldAnnotations)) { - seenModifierTypes.add(FieldNameWithBoost.FieldModifier.REPLACE); - } - - boolean onlyHasReplace = seenModifierTypes.size() == 1 - && seenModifierTypes.contains(FieldNameWithBoost.FieldModifier.REPLACE); - - // If we only have replace, start with an empty map, otherwise, start with all enabled fields. - Map actualMap = onlyHasReplace - ? Maps.newLinkedHashMap() - : Maps.newLinkedHashMap(enabledFieldWeightMap); - - // Go over all field annotations and apply them. - for (FieldNameWithBoost fieldAnnotation : sanitizedFields) { - T typedField = fieldNameToTyped.apply(fieldAnnotation.getFieldName()); - FieldNameWithBoost.FieldModifier modifier = fieldAnnotation.getFieldModifier(); - switch (modifier) { - case REMOVE: - actualMap.remove(typedField); - break; - - case ADD: - case REPLACE: - if (fieldAnnotation.getBoost().isPresent()) { - actualMap.put(typedField, fieldAnnotation.getBoost().get()); - } else { - // When annotation does not specify weight, use default weight - actualMap.put( - typedField, - defaultFieldWeightMap.get(typedField).getWeight()); - } - break; - default: - throw new QueryParserException("Unknown field annotation type: " + fieldAnnotation); - } - } - - return ImmutableMap.copyOf(actualMap); - } - - public static ImmutableMap combineDefaultWithAnnotation( - Query query, - Map defaultFieldWeightMap, - Map enabledFieldWeightMap) throws QueryParserException { - - return combineDefaultWithAnnotation( - query, defaultFieldWeightMap, enabledFieldWeightMap, Functions.identity()); - } - - /** - * Create an annotation of the FIELD type from annotations of the MAPPED_FIELD type - * @param mappableFieldMap mapping of mappable fields to the corresponding typed fields - * @param typedToFieldName A function that can turn a typed field into a string field name - * @param The typed field - * @return an Annotation with the same modifier and boost for a FIELD as the incoming MAPPED_FIELD - * annotation - */ - private static Function fieldAnnotationForMappableField( - final Map mappableFieldMap, - final Function typedToFieldName) { - return new Function() { - @Nullable - @Override - public Annotation apply(Annotation mappableAnnotation) { - FieldNameWithBoost fieldNameWithBoost = (FieldNameWithBoost) mappableAnnotation.getValue(); - MappableField mappedField = - Enums.getIfPresent( - MappableField.class, - fieldNameWithBoost.getFieldName().toUpperCase()).orNull(); - T typedFieldName = mappableFieldMap.get(mappedField); - Annotation fieldAnnotation = null; - if (typedFieldName != null) { - String fieldName = typedToFieldName.apply(typedFieldName); - FieldNameWithBoost mappedFieldBoost = - new FieldNameWithBoost( - fieldName, - fieldNameWithBoost.getBoost(), - fieldNameWithBoost.getFieldModifier()); - fieldAnnotation = Annotation.Type.FIELD.newInstance(mappedFieldBoost); - } - return fieldAnnotation; - } - }; - } -} diff --git a/src/java/com/twitter/search/common/query/FilteredQuery.java b/src/java/com/twitter/search/common/query/FilteredQuery.java deleted file mode 100644 index a4740970b..000000000 --- a/src/java/com/twitter/search/common/query/FilteredQuery.java +++ /dev/null @@ -1,225 +0,0 @@ -package com.twitter.search.common.query; - -import java.io.IOException; -import java.util.Set; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.IndexReader; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.Term; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.Explanation; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.Weight; - -/** - * A pairing of a query and a filter. The hits traversal is driven by the query's DocIdSetIterator, - * and the filter is used only to do post-filtering. In other words, the filter is never used to - * find the next doc ID: it's only used to filter out the doc IDs returned by the query's - * DocIdSetIterator. This is useful when we need to have a conjunction between a query that can - * quickly iterate through doc IDs (eg. a posting list), and an expensive filter (eg. a filter based - * on the values stored in a CSF). - * - * For example, let say we want to build a query that returns all docs that have at least 100 faves. - * 1. One option is to go with the [min_faves 100] query. This would be very expensive though, - * because this query would have to walk through every doc in the segment and for each one of - * them it would have to extract the number of faves from the forward index. - * 2. Another option is to go with a conjunction between this query and the HAS_ENGAGEMENT filter: - * (+[min_faves 100] +[cached_filter has_engagements]). The HAS_ENGAGEMENT filter could - * traverse the doc ID space faster (if it's backed by a posting list). But this approach would - * still be slow, because as soon as the HAS_ENGAGEMENT filter finds a doc ID, the conjunction - * scorer would trigger an advance(docID) call on the min_faves part of the query, which has - * the same problem as the first option. - * 3. Finally, a better option for this particular case would be to drive by the HAS_ENGAGEMENT - * filter (because it can quickly jump over all docs that do not have any engagement), and use - * the min_faves filter as a post-processing step, on a much smaller set of docs. - */ -public class FilteredQuery extends Query { - /** - * A doc ID predicate that determines if the given doc ID should be accepted. - */ - @FunctionalInterface - public static interface DocIdFilter { - /** - * Determines if the given doc ID should be accepted. - */ - boolean accept(int docId) throws IOException; - } - - /** - * A factory for creating DocIdFilter instances based on a given LeafReaderContext instance. - */ - @FunctionalInterface - public static interface DocIdFilterFactory { - /** - * Returns a DocIdFilter instance for the given LeafReaderContext instance. - */ - DocIdFilter getDocIdFilter(LeafReaderContext context) throws IOException; - } - - private static class FilteredQueryDocIdSetIterator extends DocIdSetIterator { - private final DocIdSetIterator queryScorerIterator; - private final DocIdFilter docIdFilter; - - public FilteredQueryDocIdSetIterator( - DocIdSetIterator queryScorerIterator, DocIdFilter docIdFilter) { - this.queryScorerIterator = Preconditions.checkNotNull(queryScorerIterator); - this.docIdFilter = Preconditions.checkNotNull(docIdFilter); - } - - @Override - public int docID() { - return queryScorerIterator.docID(); - } - - @Override - public int nextDoc() throws IOException { - int docId; - do { - docId = queryScorerIterator.nextDoc(); - } while (docId != NO_MORE_DOCS && !docIdFilter.accept(docId)); - return docId; - } - - @Override - public int advance(int target) throws IOException { - int docId = queryScorerIterator.advance(target); - if (docId == NO_MORE_DOCS || docIdFilter.accept(docId)) { - return docId; - } - return nextDoc(); - } - - @Override - public long cost() { - return queryScorerIterator.cost(); - } - } - - private static class FilteredQueryScorer extends Scorer { - private final Scorer queryScorer; - private final DocIdFilter docIdFilter; - - public FilteredQueryScorer(Weight weight, Scorer queryScorer, DocIdFilter docIdFilter) { - super(weight); - this.queryScorer = Preconditions.checkNotNull(queryScorer); - this.docIdFilter = Preconditions.checkNotNull(docIdFilter); - } - - @Override - public int docID() { - return queryScorer.docID(); - } - - @Override - public float score() throws IOException { - return queryScorer.score(); - } - - @Override - public DocIdSetIterator iterator() { - return new FilteredQueryDocIdSetIterator(queryScorer.iterator(), docIdFilter); - } - - @Override - public float getMaxScore(int upTo) throws IOException { - return queryScorer.getMaxScore(upTo); - } - } - - private static class FilteredQueryWeight extends Weight { - private final Weight queryWeight; - private final DocIdFilterFactory docIdFilterFactory; - - public FilteredQueryWeight( - FilteredQuery query, Weight queryWeight, DocIdFilterFactory docIdFilterFactory) { - super(query); - this.queryWeight = Preconditions.checkNotNull(queryWeight); - this.docIdFilterFactory = Preconditions.checkNotNull(docIdFilterFactory); - } - - @Override - public void extractTerms(Set terms) { - queryWeight.extractTerms(terms); - } - - @Override - public Explanation explain(LeafReaderContext context, int doc) throws IOException { - return queryWeight.explain(context, doc); - } - - @Override - public Scorer scorer(LeafReaderContext context) throws IOException { - Scorer queryScorer = queryWeight.scorer(context); - if (queryScorer == null) { - return null; - } - - return new FilteredQueryScorer(this, queryScorer, docIdFilterFactory.getDocIdFilter(context)); - } - - @Override - public boolean isCacheable(LeafReaderContext ctx) { - return queryWeight.isCacheable(ctx); - } - } - - private final Query query; - private final DocIdFilterFactory docIdFilterFactory; - - public FilteredQuery(Query query, DocIdFilterFactory docIdFilterFactory) { - this.query = Preconditions.checkNotNull(query); - this.docIdFilterFactory = Preconditions.checkNotNull(docIdFilterFactory); - } - - public Query getQuery() { - return query; - } - - @Override - public Query rewrite(IndexReader reader) throws IOException { - Query rewrittenQuery = query.rewrite(reader); - if (rewrittenQuery != query) { - return new FilteredQuery(rewrittenQuery, docIdFilterFactory); - } - return this; - } - - @Override - public int hashCode() { - return query.hashCode() * 13 + docIdFilterFactory.hashCode(); - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof FilteredQuery)) { - return false; - } - - FilteredQuery filteredQuery = FilteredQuery.class.cast(obj); - return query.equals(filteredQuery.query) - && docIdFilterFactory.equals(filteredQuery.docIdFilterFactory); - } - - @Override - public String toString(String field) { - StringBuilder sb = new StringBuilder(); - sb.append("FilteredQuery(") - .append(query) - .append(" -> ") - .append(docIdFilterFactory) - .append(")"); - return sb.toString(); - } - - @Override - public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) - throws IOException { - Weight queryWeight = Preconditions.checkNotNull(query.createWeight(searcher, scoreMode, boost)); - return new FilteredQueryWeight(this, queryWeight, docIdFilterFactory); - } -} diff --git a/src/java/com/twitter/search/common/query/FilteredScorer.java b/src/java/com/twitter/search/common/query/FilteredScorer.java deleted file mode 100644 index 41d9032f6..000000000 --- a/src/java/com/twitter/search/common/query/FilteredScorer.java +++ /dev/null @@ -1,36 +0,0 @@ -package com.twitter.search.common.query; - -import java.io.IOException; - -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.Weight; - -public class FilteredScorer extends Scorer { - protected final Scorer inner; - - public FilteredScorer(Weight weight, Scorer inner) { - super(weight); - this.inner = inner; - } - - @Override - public float score() throws IOException { - return inner.score(); - } - - @Override - public int docID() { - return inner.docID(); - } - - @Override - public DocIdSetIterator iterator() { - return inner.iterator(); - } - - @Override - public float getMaxScore(int upTo) throws IOException { - return inner.getMaxScore(upTo); - } -} diff --git a/src/java/com/twitter/search/common/query/HitAttributeCollector.java b/src/java/com/twitter/search/common/query/HitAttributeCollector.java deleted file mode 100644 index 21844aa71..000000000 --- a/src/java/com/twitter/search/common/query/HitAttributeCollector.java +++ /dev/null @@ -1,101 +0,0 @@ -package com.twitter.search.common.query; - -import java.util.List; -import java.util.Map; -import java.util.function.BiFunction; -import java.util.function.Function; - -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; - -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.search.Query; - -/** - * Not threadsafe, but should be reused across different queries unless the size of the existing - * one is too small for a new huge serialized query. - */ -public class HitAttributeCollector { - private final List hitInfos = Lists.newArrayList(); - private final BiFunction hitInfoSupplier; - - private int docBase = 0; - - public HitAttributeCollector() { - this.hitInfoSupplier = FieldRankHitInfo::new; - } - - /** - * Constructs a new {@code HitAttributionCollector} with the specified {@code FieldRankHitInfo} - * supplier. - * - * @param hitInfoSupplier function to supply a {@code FieldRankHitInfo} instance - */ - public HitAttributeCollector(BiFunction hitInfoSupplier) { - this.hitInfoSupplier = hitInfoSupplier; - } - - /** - * Creates a new IdentifiableQuery for the given query, fieldId and rank, and "registers" - * the fieldId and the rank with this collector. - * - * @param query the query to be wrapped. - * @param fieldId the ID of the field to be searched. - * @param rank The rank of this query. - * @return A new IdentifiableQuery instance for the given query, fieldId and rank. - */ - public IdentifiableQuery newIdentifiableQuery(Query query, int fieldId, int rank) { - FieldRankHitInfo fieldRankHitInfo = hitInfoSupplier.apply(fieldId, rank); - hitInfos.add(fieldRankHitInfo); - return new IdentifiableQuery(query, fieldRankHitInfo, this); - } - - public void clearHitAttributions(LeafReaderContext ctx, FieldRankHitInfo hitInfo) { - docBase = ctx.docBase; - hitInfo.resetDocId(); - } - - public void collectScorerAttribution(int docId, FieldRankHitInfo hitInfo) { - hitInfo.setDocId(docId + docBase); - } - - /** - * This method should be called when a global hit occurs. - * This method returns hit attribution summary for the whole query tree. - * This supports getting hit attribution for only the curDoc. - * - * @param docId docId passed in for checking against curDoc. - * @return Returns a map from node rank to a set of matching field IDs. This map does not contain - * entries for ranks that did not hit at all. - */ - public Map> getHitAttribution(int docId) { - return getHitAttribution(docId, (fieldId) -> fieldId); - } - - /** - * This method should be called when a global hit occurs. - * This method returns hit attribution summary for the whole query tree. - * This supports getting hit attribution for only the curDoc. - * - * @param docId docId passed in for checking against curDoc. - * @param fieldIdFunc The mapping of field IDs to objects of type T. - * @return Returns a map from node rank to a set of matching objects (usually field IDs or names). - * This map does not contain entries for ranks that did not hit at all. - */ - public Map> getHitAttribution(int docId, Function fieldIdFunc) { - int key = docId + docBase; - Map> hitMap = Maps.newHashMap(); - - // Manually iterate through all hitInfos elements. It's slightly faster than using an Iterator. - for (FieldRankHitInfo hitInfo : hitInfos) { - if (hitInfo.getDocId() == key) { - int rank = hitInfo.getRank(); - List rankHits = hitMap.computeIfAbsent(rank, k -> Lists.newArrayList()); - T fieldDescription = fieldIdFunc.apply(hitInfo.getFieldId()); - rankHits.add(fieldDescription); - } - } - - return hitMap; - } -} diff --git a/src/java/com/twitter/search/common/query/HitAttributeHelper.java b/src/java/com/twitter/search/common/query/HitAttributeHelper.java deleted file mode 100644 index 572f7b855..000000000 --- a/src/java/com/twitter/search/common/query/HitAttributeHelper.java +++ /dev/null @@ -1,102 +0,0 @@ -package com.twitter.search.common.query; - -import java.util.List; -import java.util.Map; -import java.util.function.Function; - -import com.google.common.collect.Maps; - -import com.twitter.search.queryparser.query.Query; - -import static com.twitter.search.common.query.FieldRankHitInfo.UNSET_DOC_ID; - -/** - * Generic helper class containing the data needed to set up and collect field hit attributions. - */ -public class HitAttributeHelper implements HitAttributeProvider { - private final HitAttributeCollector collector; - private final Function fieldIdsToFieldNames; - - // This is a mapping of type T query nodes to rank id - private final Map nodeToRankMap; - - // This is meant to expand individual Query nodes into multiple ranks, - // for example, expanding a multi_term_disjunction to include a rank for each disjunction value. - private final Map> expandedNodeToRankMap; - - // A single-entry cache for hit attribution, so we can reuse the immediate result. Will be used - // only when lastDocId matches - private ThreadLocal>> lastHitAttrHolder = new ThreadLocal<>(); - private ThreadLocal lastDocIdHolder = ThreadLocal.withInitial(() -> UNSET_DOC_ID); - - protected HitAttributeHelper( - HitAttributeCollector collector, - Function fieldIdsToFieldNames, - Map nodeToRankMap, - Map> expandedNodeToRankMap) { - this.collector = collector; - this.fieldIdsToFieldNames = fieldIdsToFieldNames; - this.nodeToRankMap = nodeToRankMap; - this.expandedNodeToRankMap = expandedNodeToRankMap; - } - - /** - * Constructs a new {@code HitAttributeHelper} with the specified {@code HitAttributeCollector} - * instance and fields. - * - * @param collector a collector instance - * @param fieldIdsToFieldNames a list of field names indexed by id - */ - public HitAttributeHelper(HitAttributeCollector collector, String[] fieldIdsToFieldNames) { - this(collector, - (fieldId) -> fieldIdsToFieldNames[fieldId], - Maps.newHashMap(), - Maps.newHashMap()); - } - - public HitAttributeCollector getFieldRankHitAttributeCollector() { - return collector; - } - - /** - * Returns hit attribution information indexed by node rank - * - * @param docId the document id - * @return a mapping from the query's node rank to a list of field names that were hit. - */ - public Map> getHitAttribution(int docId) { - // check cache first so we don't have to recompute the same thing. - if (lastDocIdHolder.get() == docId) { - return lastHitAttrHolder.get(); - } - - lastDocIdHolder.set(docId); - Map> hitAttribution = - collector.getHitAttribution(docId, fieldIdsToFieldNames); - lastHitAttrHolder.set(hitAttribution); - return hitAttribution; - } - - /** - * Adds a new node and its respective rank to the helper's node-to-rank map - * Will throw an exception if attempting to add/update an existing node - * - * @param node the query node - * @param rank the rank associated with the node - */ - public void addNodeRank(Query node, int rank) { - // if there are two of the same terms, just map them to the first rank, they should get the same - // hits back - if (!nodeToRankMap.containsKey(node)) { - nodeToRankMap.put(node, rank); - } - } - - public Map getNodeToRankMap() { - return nodeToRankMap; - } - - public Map> getExpandedNodeToRankMap() { - return expandedNodeToRankMap; - } -} diff --git a/src/java/com/twitter/search/common/query/HitAttributeProvider.java b/src/java/com/twitter/search/common/query/HitAttributeProvider.java deleted file mode 100644 index bcdcea90c..000000000 --- a/src/java/com/twitter/search/common/query/HitAttributeProvider.java +++ /dev/null @@ -1,12 +0,0 @@ -package com.twitter.search.common.query; - -import java.util.List; -import java.util.Map; - -/** - * The interface for objects that can provide hit attributes for a document. - */ -public interface HitAttributeProvider { - /** Returns the hit attributes for the given document. */ - Map> getHitAttribution(int docId); -} diff --git a/src/java/com/twitter/search/common/query/IDDisjunctionQuery.java b/src/java/com/twitter/search/common/query/IDDisjunctionQuery.java deleted file mode 100644 index e6ac8afe1..000000000 --- a/src/java/com/twitter/search/common/query/IDDisjunctionQuery.java +++ /dev/null @@ -1,378 +0,0 @@ -package com.twitter.search.common.query; - -import java.io.IOException; -import java.util.ArrayList; -import java.util.Iterator; -import java.util.List; -import java.util.Objects; -import java.util.Set; -import java.util.stream.Collectors; - -import org.apache.lucene.index.FilteredTermsEnum; -import org.apache.lucene.index.IndexReader; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.PostingsEnum; -import org.apache.lucene.index.Term; -import org.apache.lucene.index.TermState; -import org.apache.lucene.index.TermStates; -import org.apache.lucene.index.Terms; -import org.apache.lucene.index.TermsEnum; -import org.apache.lucene.search.BooleanClause.Occur; -import org.apache.lucene.search.BooleanQuery; -import org.apache.lucene.search.BulkScorer; -import org.apache.lucene.search.ConstantScoreQuery; -import org.apache.lucene.search.ConstantScoreScorer; -import org.apache.lucene.search.ConstantScoreWeight; -import org.apache.lucene.search.DocIdSet; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.MultiTermQuery; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.TermQuery; -import org.apache.lucene.search.Weight; -import org.apache.lucene.util.AttributeSource; -import org.apache.lucene.util.BytesRef; -import org.apache.lucene.util.DocIdSetBuilder; - -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.base.IndexedNumericFieldSettings; -import com.twitter.search.common.util.analysis.LongTermAttributeImpl; -import com.twitter.search.common.util.analysis.SortableLongTermAttributeImpl; -import com.twitter.search.queryparser.query.QueryParserException; - -/** - * An extension of Lucene's MultiTermQuery which creates a disjunction of - * long ID terms. Lucene tries to rewrite the Query depending on the number - * of clauses to perform as efficiently as possible. - */ -public class IDDisjunctionQuery extends MultiTermQuery { - private final List ids; - private final boolean useOrderPreservingEncoding; - - /** Creates a new IDDisjunctionQuery instance. */ - public IDDisjunctionQuery(List ids, String field, ImmutableSchemaInterface schemaSnapshot) - throws QueryParserException { - super(field); - this.ids = ids; - - setRewriteMethod(new Rewrite()); - - if (!schemaSnapshot.hasField(field)) { - throw new QueryParserException( - "Tried to search a field which does not exist in schema: " + field); - } - - IndexedNumericFieldSettings numericFieldSettings = - schemaSnapshot.getFieldInfo(field).getFieldType().getNumericFieldSettings(); - - if (numericFieldSettings == null) { - throw new QueryParserException("Requested id field is not numerical: " + field); - } - - this.useOrderPreservingEncoding = numericFieldSettings.isUseSortableEncoding(); - } - - /** - * Work around for an issue where LongTerms are not valid utf8, so calling - * toString on any TermQuery containing a LongTerm may cause exceptions. - */ - private class Rewrite extends RewriteMethod { - @Override - public Query rewrite(IndexReader reader, MultiTermQuery query) throws IOException { - Query result = new MultiTermQueryConstantScoreWrapper( - (IDDisjunctionQuery) query, useOrderPreservingEncoding); - return result; - } - } - - @Override - protected TermsEnum getTermsEnum(final Terms terms, AttributeSource atts) throws IOException { - final Iterator it = this.ids.iterator(); - final TermsEnum termsEnum = terms.iterator(); - - return new FilteredTermsEnum(termsEnum) { - private final BytesRef term = useOrderPreservingEncoding - ? SortableLongTermAttributeImpl.newBytesRef() - : LongTermAttributeImpl.newBytesRef(); - - @Override protected AcceptStatus accept(BytesRef term) throws IOException { - return AcceptStatus.YES; - } - - @Override public BytesRef next() throws IOException { - while (it.hasNext()) { - Long longTerm = it.next(); - if (useOrderPreservingEncoding) { - SortableLongTermAttributeImpl.copyLongToBytesRef(term, longTerm); - } else { - LongTermAttributeImpl.copyLongToBytesRef(term, longTerm); - } - if (termsEnum.seekExact(term)) { - return term; - } - } - - return null; - } - }; - } - - @Override - public String toString(String field) { - StringBuilder builder = new StringBuilder(); - builder.append("IDDisjunction[").append(this.field).append(":"); - for (Long id : this.ids) { - builder.append(id); - builder.append(","); - } - builder.setLength(builder.length() - 1); - builder.append("]"); - return builder.toString(); - } - - private static class TermQueryWithToString extends TermQuery { - private final boolean useOrderPreservingEncoding; - - public TermQueryWithToString(Term t, TermStates states, boolean useOrderPreservingEncoding) { - super(t, states); - this.useOrderPreservingEncoding = useOrderPreservingEncoding; - } - - @Override - public String toString(String field) { - StringBuilder buffer = new StringBuilder(); - if (!getTerm().field().equals(field)) { - buffer.append(getTerm().field()); - buffer.append(":"); - } - long longTerm; - BytesRef termBytes = getTerm().bytes(); - if (useOrderPreservingEncoding) { - longTerm = SortableLongTermAttributeImpl.copyBytesRefToLong(termBytes); - } else { - longTerm = LongTermAttributeImpl.copyBytesRefToLong(termBytes); - } - buffer.append(longTerm); - return buffer.toString(); - } - } - - /** - * This class provides the functionality behind {@link MultiTermQuery#CONSTANT_SCORE_REWRITE}. - * It tries to rewrite per-segment as a boolean query that returns a constant score and otherwise - * fills a DocIdSet with matches and builds a Scorer on top of this DocIdSet. - */ - static final class MultiTermQueryConstantScoreWrapper extends Query { - // disable the rewrite option which will scan all posting lists sequentially and perform - // the intersection using a temporary DocIdSet. In earlybird this mode is slower than a "normal" - // disjunctive BooleanQuery, due to early termination and the fact that everything is in memory. - private static final int BOOLEAN_REWRITE_TERM_COUNT_THRESHOLD = 3000; - - private static class TermAndState { - private final BytesRef term; - private final TermState state; - private final int docFreq; - private final long totalTermFreq; - - TermAndState(BytesRef term, TermState state, int docFreq, long totalTermFreq) { - this.term = term; - this.state = state; - this.docFreq = docFreq; - this.totalTermFreq = totalTermFreq; - } - } - - private static class WeightOrDocIdSet { - private final Weight weight; - private final DocIdSet docIdSet; - - WeightOrDocIdSet(Weight weight) { - this.weight = Objects.requireNonNull(weight); - this.docIdSet = null; - } - - WeightOrDocIdSet(DocIdSet docIdSet) { - this.docIdSet = docIdSet; - this.weight = null; - } - } - - protected final IDDisjunctionQuery query; - private final boolean useOrderPreservingEncoding; - - /** - * Wrap a {@link MultiTermQuery} as a Filter. - */ - protected MultiTermQueryConstantScoreWrapper( - IDDisjunctionQuery query, - boolean useOrderPreservingEncoding) { - this.query = query; - this.useOrderPreservingEncoding = useOrderPreservingEncoding; - } - - @Override - public String toString(String field) { - // query.toString should be ok for the filter, too, if the query boost is 1.0f - return query.toString(field); - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof MultiTermQueryConstantScoreWrapper)) { - return false; - } - - return query.equals(MultiTermQueryConstantScoreWrapper.class.cast(obj).query); - } - - @Override - public int hashCode() { - return query == null ? 0 : query.hashCode(); - } - - /** Returns the field name for this query */ - public String getField() { - return query.getField(); - } - - private List getIDs() { - return query.ids; - } - - @Override - public Weight createWeight( - final IndexSearcher searcher, - final ScoreMode scoreMode, - final float boost) throws IOException { - return new ConstantScoreWeight(this, boost) { - /** Try to collect terms from the given terms enum and return true iff all - * terms could be collected. If {@code false} is returned, the enum is - * left positioned on the next term. */ - private boolean collectTerms(LeafReaderContext context, - TermsEnum termsEnum, - List terms) throws IOException { - final int threshold = Math.min(BOOLEAN_REWRITE_TERM_COUNT_THRESHOLD, - BooleanQuery.getMaxClauseCount()); - for (int i = 0; i < threshold; ++i) { - final BytesRef term = termsEnum.next(); - if (term == null) { - return true; - } - TermState state = termsEnum.termState(); - terms.add(new TermAndState(BytesRef.deepCopyOf(term), - state, - termsEnum.docFreq(), - termsEnum.totalTermFreq())); - } - return termsEnum.next() == null; - } - - /** - * On the given leaf context, try to either rewrite to a disjunction if - * there are few terms, or build a DocIdSet containing matching docs. - */ - private WeightOrDocIdSet rewrite(LeafReaderContext context) - throws IOException { - final Terms terms = context.reader().terms(query.getField()); - if (terms == null) { - // field does not exist - return new WeightOrDocIdSet((DocIdSet) null); - } - - final TermsEnum termsEnum = query.getTermsEnum(terms); - assert termsEnum != null; - - PostingsEnum docs = null; - - final List collectedTerms = new ArrayList<>(); - if (collectTerms(context, termsEnum, collectedTerms)) { - // build a boolean query - BooleanQuery.Builder bqBuilder = new BooleanQuery.Builder(); - for (TermAndState t : collectedTerms) { - final TermStates termStates = new TermStates(searcher.getTopReaderContext()); - termStates.register(t.state, context.ord, t.docFreq, t.totalTermFreq); - final Term term = new Term(query.getField(), t.term); - bqBuilder.add( - new TermQueryWithToString(term, termStates, useOrderPreservingEncoding), - Occur.SHOULD); - } - Query q = BoostUtils.maybeWrapInBoostQuery( - new ConstantScoreQuery(bqBuilder.build()), score()); - return new WeightOrDocIdSet( - searcher.rewrite(q).createWeight(searcher, scoreMode, boost)); - } - - // Too many terms: go back to the terms we already collected and start building - // the DocIdSet - DocIdSetBuilder builder = new DocIdSetBuilder(context.reader().maxDoc()); - if (!collectedTerms.isEmpty()) { - TermsEnum termsEnum2 = terms.iterator(); - for (TermAndState t : collectedTerms) { - termsEnum2.seekExact(t.term, t.state); - docs = termsEnum2.postings(docs, PostingsEnum.NONE); - builder.add(docs); - } - } - - // Then keep filling the DocIdSet with remaining terms - do { - docs = termsEnum.postings(docs, PostingsEnum.NONE); - builder.add(docs); - } while (termsEnum.next() != null); - - return new WeightOrDocIdSet(builder.build()); - } - - private Scorer scorer(DocIdSet set) throws IOException { - if (set == null) { - return null; - } - final DocIdSetIterator disi = set.iterator(); - if (disi == null) { - return null; - } - return new ConstantScoreScorer(this, score(), ScoreMode.COMPLETE_NO_SCORES, disi); - } - - @Override - public BulkScorer bulkScorer(LeafReaderContext context) throws IOException { - final WeightOrDocIdSet weightOrDocIdSet = rewrite(context); - if (weightOrDocIdSet.weight != null) { - return weightOrDocIdSet.weight.bulkScorer(context); - } else { - final Scorer scorer = scorer(weightOrDocIdSet.docIdSet); - if (scorer == null) { - return null; - } - return new DefaultBulkScorer(scorer); - } - } - - @Override - public Scorer scorer(LeafReaderContext context) throws IOException { - final WeightOrDocIdSet weightOrDocIdSet = rewrite(context); - if (weightOrDocIdSet.weight != null) { - return weightOrDocIdSet.weight.scorer(context); - } else { - return scorer(weightOrDocIdSet.docIdSet); - } - } - - @Override - public void extractTerms(Set terms) { - terms.addAll(getIDs() - .stream() - .map(id -> new Term(getField(), LongTermAttributeImpl.copyIntoNewBytesRef(id))) - .collect(Collectors.toSet())); - } - - @Override - public boolean isCacheable(LeafReaderContext ctx) { - return false; - } - }; - } - } -} diff --git a/src/java/com/twitter/search/common/query/IdentifiableQuery.java b/src/java/com/twitter/search/common/query/IdentifiableQuery.java deleted file mode 100644 index dbecf88aa..000000000 --- a/src/java/com/twitter/search/common/query/IdentifiableQuery.java +++ /dev/null @@ -1,77 +0,0 @@ -package com.twitter.search.common.query; - -import java.io.IOException; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.IndexReader; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.Weight; - -/** - * Query implementation adds attribute collection support for an underlying query. - */ -public class IdentifiableQuery extends Query { - protected final Query inner; - private final FieldRankHitInfo queryId; - private final HitAttributeCollector attrCollector; - - public IdentifiableQuery(Query inner, FieldRankHitInfo queryId, - HitAttributeCollector attrCollector) { - this.inner = Preconditions.checkNotNull(inner); - this.queryId = queryId; - this.attrCollector = Preconditions.checkNotNull(attrCollector); - } - - @Override - public Weight createWeight( - IndexSearcher searcher, ScoreMode scoreMode, float boost) throws IOException { - Weight innerWeight = inner.createWeight(searcher, scoreMode, boost); - return new IdentifiableQueryWeight(this, innerWeight, queryId, attrCollector); - } - - @Override - public Query rewrite(IndexReader reader) throws IOException { - Query rewritten = inner.rewrite(reader); - if (rewritten != inner) { - return new IdentifiableQuery(rewritten, queryId, attrCollector); - } - return this; - } - - @Override - public int hashCode() { - return inner.hashCode() * 13 + (queryId == null ? 0 : queryId.hashCode()); - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof IdentifiableQuery)) { - return false; - } - - IdentifiableQuery identifiableQuery = IdentifiableQuery.class.cast(obj); - return inner.equals(identifiableQuery.inner) - && (queryId == null - ? identifiableQuery.queryId == null - : queryId.equals(identifiableQuery.queryId)); - } - - @Override - public String toString(String field) { - return inner.toString(field); - } - - @VisibleForTesting - public Query getQueryForTest() { - return inner; - } - - @VisibleForTesting - public FieldRankHitInfo getQueryIdForTest() { - return queryId; - } -} diff --git a/src/java/com/twitter/search/common/query/IdentifiableQueryScorer.java b/src/java/com/twitter/search/common/query/IdentifiableQueryScorer.java deleted file mode 100644 index 98c8340eb..000000000 --- a/src/java/com/twitter/search/common/query/IdentifiableQueryScorer.java +++ /dev/null @@ -1,60 +0,0 @@ -package com.twitter.search.common.query; - -import java.io.IOException; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.Weight; - -/** - * Scorer implementation that adds attribute collection support for an underlying query. - * Meant to be used in conjunction with {@link IdentifiableQuery}. - */ -public class IdentifiableQueryScorer extends FilteredScorer { - private final FieldRankHitInfo queryId; - private final HitAttributeCollector attrCollector; - - public IdentifiableQueryScorer(Weight weight, Scorer inner, FieldRankHitInfo queryId, - HitAttributeCollector attrCollector) { - super(weight, inner); - this.queryId = queryId; - this.attrCollector = Preconditions.checkNotNull(attrCollector); - } - - @Override - public DocIdSetIterator iterator() { - final DocIdSetIterator superDISI = super.iterator(); - - return new DocIdSetIterator() { - @Override - public int docID() { - return superDISI.docID(); - } - - @Override - public int nextDoc() throws IOException { - int docid = superDISI.nextDoc(); - if (docid != NO_MORE_DOCS) { - attrCollector.collectScorerAttribution(docid, queryId); - } - return docid; - } - - @Override - public int advance(int target) throws IOException { - int docid = superDISI.advance(target); - if (docid != NO_MORE_DOCS) { - attrCollector.collectScorerAttribution(docid, queryId); - } - return docid; - } - - @Override - public long cost() { - return superDISI.cost(); - } - }; - } -} diff --git a/src/java/com/twitter/search/common/query/IdentifiableQueryWeight.java b/src/java/com/twitter/search/common/query/IdentifiableQueryWeight.java deleted file mode 100644 index 5daba7517..000000000 --- a/src/java/com/twitter/search/common/query/IdentifiableQueryWeight.java +++ /dev/null @@ -1,58 +0,0 @@ -package com.twitter.search.common.query; - -import java.io.IOException; -import java.util.Set; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.Term; -import org.apache.lucene.search.Explanation; -import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.Weight; - -/** - * Weight implementation that adds attribute collection support for an underlying query. - * Meant to be used in conjunction with {@link IdentifiableQuery}. - */ -public class IdentifiableQueryWeight extends Weight { - private final Weight inner; - private final FieldRankHitInfo queryId; - private final HitAttributeCollector attrCollector; - - /** Creates a new IdentifiableQueryWeight instance. */ - public IdentifiableQueryWeight(IdentifiableQuery query, Weight inner, FieldRankHitInfo queryId, - HitAttributeCollector attrCollector) { - super(query); - this.inner = inner; - this.queryId = queryId; - this.attrCollector = Preconditions.checkNotNull(attrCollector); - } - - @Override - public Explanation explain(LeafReaderContext context, int doc) - throws IOException { - return inner.explain(context, doc); - } - - @Override - public Scorer scorer(LeafReaderContext context) throws IOException { - attrCollector.clearHitAttributions(context, queryId); - Scorer innerScorer = inner.scorer(context); - if (innerScorer != null) { - return new IdentifiableQueryScorer(this, innerScorer, queryId, attrCollector); - } else { - return null; - } - } - - @Override - public void extractTerms(Set terms) { - inner.extractTerms(terms); - } - - @Override - public boolean isCacheable(LeafReaderContext ctx) { - return inner.isCacheable(ctx); - } -} diff --git a/src/java/com/twitter/search/common/query/MappableField.java b/src/java/com/twitter/search/common/query/MappableField.java deleted file mode 100644 index 53905472c..000000000 --- a/src/java/com/twitter/search/common/query/MappableField.java +++ /dev/null @@ -1,34 +0,0 @@ -package com.twitter.search.common.query; - -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.Maps; - -/** - * The indices may map the fields declared here to fields internally without exposing their schemas - * to other services. This can be used, for example, to set boosts for URL-like fields in Earlybird - * without direct knowledge of the internal Earlybird field name - */ -public enum MappableField { - REFERRAL, - URL; - - static { - ImmutableMap.Builder builder = ImmutableMap.builder(); - for (MappableField mappableField : MappableField.values()) { - builder.put(mappableField, mappableField.toString().toLowerCase()); - } - MAPPABLE_FIELD_TO_NAME_MAP = Maps.immutableEnumMap(builder.build()); - } - - private static final ImmutableMap MAPPABLE_FIELD_TO_NAME_MAP; - - /** Returns the name of the given MappableField. */ - public static String mappableFieldName(MappableField mappableField) { - return MAPPABLE_FIELD_TO_NAME_MAP.get(mappableField); - } - - /** Returns the name of this MappableField. */ - public String getName() { - return MAPPABLE_FIELD_TO_NAME_MAP.get(this); - } -} diff --git a/src/java/com/twitter/search/common/query/MultiTermDisjunctionQuery.java b/src/java/com/twitter/search/common/query/MultiTermDisjunctionQuery.java deleted file mode 100644 index 1f54b0671..000000000 --- a/src/java/com/twitter/search/common/query/MultiTermDisjunctionQuery.java +++ /dev/null @@ -1,61 +0,0 @@ -package com.twitter.search.common.query; - -import java.io.IOException; -import java.util.Iterator; -import java.util.Set; - -import org.apache.lucene.index.FilteredTermsEnum; -import org.apache.lucene.index.Terms; -import org.apache.lucene.index.TermsEnum; -import org.apache.lucene.search.MultiTermQuery; -import org.apache.lucene.util.AttributeSource; -import org.apache.lucene.util.BytesRef; - - -public class MultiTermDisjunctionQuery extends MultiTermQuery { - - private final Set values; - - /** Creates a new MultiTermDisjunctionQuery instance. */ - public MultiTermDisjunctionQuery(String field, Set values) { - super(field); - this.values = values; - } - - @Override - protected TermsEnum getTermsEnum(Terms terms, AttributeSource atts) - throws IOException { - final TermsEnum termsEnum = terms.iterator(); - final Iterator it = values.iterator(); - - return new FilteredTermsEnum(termsEnum) { - @Override protected AcceptStatus accept(BytesRef term) throws IOException { - return AcceptStatus.YES; - } - - @Override public BytesRef next() throws IOException { - while (it.hasNext()) { - BytesRef termRef = it.next(); - if (termsEnum.seekExact(termRef)) { - return termRef; - } - } - - return null; - } - }; - } - - @Override - public String toString(String field) { - StringBuilder builder = new StringBuilder(); - builder.append("MultiTermDisjunctionQuery["); - for (BytesRef termVal : this.values) { - builder.append(termVal); - builder.append(","); - } - builder.setLength(builder.length() - 1); - builder.append("]"); - return builder.toString(); - } -} diff --git a/src/java/com/twitter/search/common/query/QueryCommonFieldHitsVisitor.java b/src/java/com/twitter/search/common/query/QueryCommonFieldHitsVisitor.java deleted file mode 100644 index e9db5beac..000000000 --- a/src/java/com/twitter/search/common/query/QueryCommonFieldHitsVisitor.java +++ /dev/null @@ -1,160 +0,0 @@ -package com.twitter.search.common.query; - -import java.util.Collections; -import java.util.List; -import java.util.Map; -import java.util.Set; -import java.util.logging.Level; -import java.util.logging.Logger; - -import com.google.common.collect.Sets; - -import com.twitter.search.queryparser.query.Conjunction; -import com.twitter.search.queryparser.query.Disjunction; -import com.twitter.search.queryparser.query.Phrase; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.query.SpecialTerm; -import com.twitter.search.queryparser.query.Term; -import com.twitter.search.queryparser.query.search.Link; -import com.twitter.search.queryparser.query.search.SearchOperator; -import com.twitter.search.queryparser.query.search.SearchQueryVisitor; - -/** - * Visitor to track the fields hits of each node - * Returns the common fields among conjunctions and the union of the fields amongst disjunctions - */ -public final class QueryCommonFieldHitsVisitor extends SearchQueryVisitor> { - - private static final Logger LOG = Logger.getLogger(QueryCommonFieldHitsVisitor.class.getName()); - - private Map nodeToRankMap; - private Map> hitFieldsByRank; - - /** - * Find query term hit intersections based on hitmap given by HitAttributeHelper - * - * @param hitAttributeHelper the HitAttributeHelper - * @param docID documentID - * @param query the query searched - * @return a set of hit fields in String representation - */ - public static Set findIntersection( - HitAttributeHelper hitAttributeHelper, - int docID, - Query query) { - return findIntersection(hitAttributeHelper.getNodeToRankMap(), - hitAttributeHelper.getHitAttribution(docID), - query); - } - - /** - * Find query term hit intersections based on hitmap given by HitAttributeHelper - * - * @param nodeToRankMap the map of query node to its integer rank value - * @param hitFieldsByRank map of rank to list of hit fields in String representation - * @param query the query searched - * @return a set of hit fields in String representation - */ - public static Set findIntersection( - Map nodeToRankMap, - Map> hitFieldsByRank, - Query query) { - QueryCommonFieldHitsVisitor visitor = - new QueryCommonFieldHitsVisitor(nodeToRankMap, hitFieldsByRank); - try { - Set returnSet = query.accept(visitor); - return returnSet; - } catch (QueryParserException e) { - LOG.log(Level.SEVERE, "Could not find intersection for query [" + query + "]: ", e); - return Collections.emptySet(); - } - } - - private QueryCommonFieldHitsVisitor(Map nodeToRankMap, - Map> hitFieldsByRank) { - this.nodeToRankMap = nodeToRankMap; - this.hitFieldsByRank = hitFieldsByRank; - } - - @Override - public Set visit(Disjunction disjunction) throws QueryParserException { - Set fieldHitIntersections = Sets.newHashSet(); - for (Query child : disjunction.getChildren()) { - fieldHitIntersections.addAll(child.accept(this)); - } - return fieldHitIntersections; - } - - @Override - public Set visit(Conjunction conjunction) throws QueryParserException { - List children = conjunction.getChildren(); - if (!children.isEmpty()) { - boolean initializedIntersections = false; - Set fieldHitIntersections = Sets.newHashSet(); - for (Query child : children) { - Set hits = child.accept(this); - if (hits.isEmpty()) { - // if it is empty, it means this query node is not of term type - // and we do not include these in the field intersection - // eg. cache filters, proximity groups - continue; - } - if (!initializedIntersections) { - fieldHitIntersections.addAll(hits); - initializedIntersections = true; - } else { - fieldHitIntersections.retainAll(hits); - } - } - return fieldHitIntersections; - } - return Collections.emptySet(); - } - - @Override - public Set visit(Term term) throws QueryParserException { - Set fieldHitIntersections = Sets.newHashSet(); - Integer rank = nodeToRankMap.get(term); - if (rank != null) { - List fields = hitFieldsByRank.get(rank); - // for disjunction cases where a term may not have any hits - if (fields != null) { - fieldHitIntersections.addAll(fields); - } - } - return fieldHitIntersections; - } - - @Override - public Set visit(SpecialTerm specialTerm) throws QueryParserException { - // This is way of splitting @mentions ensures consistency with way the lucene query is built in - // expertsearch - if (specialTerm.getType() == SpecialTerm.Type.MENTION && specialTerm.getValue().contains("_")) { - Phrase phrase = new Phrase(specialTerm.getValue().split("_")); - return phrase.accept(this); - } - return specialTerm.toTermOrPhrase().accept(this); - } - - @Override - public Set visit(SearchOperator operator) throws QueryParserException { - return Collections.emptySet(); - } - - @Override - public Set visit(Link link) throws QueryParserException { - return link.toPhrase().accept(this); - } - - @Override - public Set visit(Phrase phrase) throws QueryParserException { - // All terms in the phrase should return the same hits fields, just check the first one - List terms = phrase.getTerms(); - if (!terms.isEmpty()) { - Term term = new Term(phrase.getTerms().get(0)); - return term.accept(this); - } - return Collections.emptySet(); - } -} diff --git a/src/java/com/twitter/search/common/query/QueryHitAttributeHelper.java b/src/java/com/twitter/search/common/query/QueryHitAttributeHelper.java deleted file mode 100644 index 1a4ad07ad..000000000 --- a/src/java/com/twitter/search/common/query/QueryHitAttributeHelper.java +++ /dev/null @@ -1,81 +0,0 @@ -package com.twitter.search.common.query; - -import java.util.Collections; -import java.util.IdentityHashMap; -import java.util.List; -import java.util.Map; -import java.util.function.Function; - -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.visitors.MultiTermDisjunctionRankVisitor; -import com.twitter.search.queryparser.visitors.NodeRankAnnotator; -import com.twitter.search.queryparser.visitors.QueryTreeIndex; - -/** - * A helper class to collect field and query node hit attributions. - */ -public class QueryHitAttributeHelper extends HitAttributeHelper { - private final Query annotatedQuery; - - protected QueryHitAttributeHelper(HitAttributeCollector collector, - Function fieldIdsToFieldNames, - IdentityHashMap nodeToRankMap, - Query annotatedQuery, - Map> expandedRanksMap) { - super(collector, fieldIdsToFieldNames, nodeToRankMap, expandedRanksMap); - this.annotatedQuery = annotatedQuery; - } - - /** - * Constructor specific for com.twitter.search.queryParser.query.Query - * - * This helper visits a parsed query to construct a node-to-rank mapping, - * and uses a schema to determine all of the possible fields to be tracked. - * A collector is then created. - * - * @param query the query for which we will collect hit attribution. - * @param schema the indexing schema. - */ - public static QueryHitAttributeHelper from(Query query, final Schema schema) - throws QueryParserException { - IdentityHashMap nodeToRankMap; - Query annotatedQuery; - - // First see if the query already has node rank annotations on it. If so, we'll just use those - // to identify query nodes. - // We enforce that all provided ranks are in the range of [0, N-1] so not to blow up the size - // of the collection array. - QueryRankVisitor rankVisitor = new QueryRankVisitor(); - if (query.accept(rankVisitor)) { - nodeToRankMap = rankVisitor.getNodeToRankMap(); - annotatedQuery = query; - } else { - // Otherwise, we will assign all nodes in-order ranks, and use those to track per-node hit - // attribution - QueryTreeIndex queryTreeIndex = QueryTreeIndex.buildFor(query); - NodeRankAnnotator annotator = new NodeRankAnnotator(queryTreeIndex.getNodeToIndexMap()); - annotatedQuery = query.accept(annotator); - nodeToRankMap = annotator.getUpdatedNodeToRankMap(); - } - - // Extract ranks for multi_term_disjunction operators - MultiTermDisjunctionRankVisitor multiTermDisjunctionRankVisitor = - new MultiTermDisjunctionRankVisitor(Collections.max(nodeToRankMap.values())); - annotatedQuery.accept(multiTermDisjunctionRankVisitor); - Map> expandedRanksMap = - multiTermDisjunctionRankVisitor.getMultiTermDisjunctionRankExpansionsMap(); - - return new QueryHitAttributeHelper( - new HitAttributeCollector(), - (fieldId) -> schema.getFieldName(fieldId), - nodeToRankMap, - annotatedQuery, - expandedRanksMap); - } - - public Query getAnnotatedQuery() { - return annotatedQuery; - } -} diff --git a/src/java/com/twitter/search/common/query/QueryRankVisitor.java b/src/java/com/twitter/search/common/query/QueryRankVisitor.java deleted file mode 100644 index e6f657f6a..000000000 --- a/src/java/com/twitter/search/common/query/QueryRankVisitor.java +++ /dev/null @@ -1,56 +0,0 @@ -package com.twitter.search.common.query; - -import java.util.IdentityHashMap; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Maps; - -import com.twitter.search.queryparser.query.BooleanQuery; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.query.annotation.Annotation; -import com.twitter.search.queryparser.visitors.DetectAnnotationVisitor; - -/** - * A visitor that collects node ranks from :r annotation in the query - */ -public class QueryRankVisitor extends DetectAnnotationVisitor { - private final IdentityHashMap nodeToRankMap = Maps.newIdentityHashMap(); - - public QueryRankVisitor() { - super(Annotation.Type.NODE_RANK); - } - - @Override - protected boolean visitBooleanQuery(BooleanQuery query) throws QueryParserException { - if (query.hasAnnotationType(Annotation.Type.NODE_RANK)) { - collectNodeRank(query.getAnnotationOf(Annotation.Type.NODE_RANK).get(), query); - } - - boolean found = false; - for (Query child : query.getChildren()) { - found |= child.accept(this); - } - return found; - } - - @Override - protected boolean visitQuery(Query query) throws QueryParserException { - if (query.hasAnnotationType(Annotation.Type.NODE_RANK)) { - collectNodeRank(query.getAnnotationOf(Annotation.Type.NODE_RANK).get(), query); - return true; - } - - return false; - } - - private void collectNodeRank(Annotation anno, Query query) { - Preconditions.checkArgument(anno.getType() == Annotation.Type.NODE_RANK); - int rank = (Integer) anno.getValue(); - nodeToRankMap.put(query, rank); - } - - public IdentityHashMap getNodeToRankMap() { - return nodeToRankMap; - } -} diff --git a/src/java/com/twitter/search/common/query/SingleDocDocIdSetIterator.java b/src/java/com/twitter/search/common/query/SingleDocDocIdSetIterator.java deleted file mode 100644 index f68438b22..000000000 --- a/src/java/com/twitter/search/common/query/SingleDocDocIdSetIterator.java +++ /dev/null @@ -1,51 +0,0 @@ -package com.twitter.search.common.query; - -import java.io.IOException; - -import org.apache.lucene.search.DocIdSetIterator; - -public class SingleDocDocIdSetIterator extends DocIdSetIterator { - - // the only docid in the list - private final int doc; - - private int docid = -1; - - public SingleDocDocIdSetIterator(int doc) { - this.doc = doc; - } - - @Override - public int docID() { - return docid; - } - - @Override - public int nextDoc() throws IOException { - if (docid == -1) { - docid = doc; - } else { - docid = NO_MORE_DOCS; - } - return docid; - } - - @Override - public int advance(int target) throws IOException { - if (docid == NO_MORE_DOCS) { - return docid; - } else if (doc < target) { - docid = NO_MORE_DOCS; - return docid; - } else { - docid = doc; - } - return docid; - } - - @Override - public long cost() { - return 1; - } - -} diff --git a/src/java/com/twitter/search/common/query/StaticHitAttributeProvider.java b/src/java/com/twitter/search/common/query/StaticHitAttributeProvider.java deleted file mode 100644 index 4ea8e53ba..000000000 --- a/src/java/com/twitter/search/common/query/StaticHitAttributeProvider.java +++ /dev/null @@ -1,32 +0,0 @@ -package com.twitter.search.common.query; - -import java.util.Collections; -import java.util.List; -import java.util.Map; - -/** - * A hit attribute provider based on the static data - */ -public class StaticHitAttributeProvider implements HitAttributeProvider { - private int currentDocId; - private Map> currentHitAttr; - - public StaticHitAttributeProvider() { - } - - /** - * Set a fake last doc id and hit attribution, this is only used to generate explanation. - */ - public void setCurrentHitAttr(int docId, Map> hitAttr) { - this.currentDocId = docId; - this.currentHitAttr = hitAttr; - } - - @Override - public Map> getHitAttribution(int docId) { - if (docId == currentDocId) { - return currentHitAttr; - } - return Collections.EMPTY_MAP; - } -} diff --git a/src/java/com/twitter/search/common/relevance/BUILD b/src/java/com/twitter/search/common/relevance/BUILD deleted file mode 100644 index 118eea883..000000000 --- a/src/java/com/twitter/search/common/relevance/BUILD +++ /dev/null @@ -1,257 +0,0 @@ -java_library( - name = "utils", - sources = ["utils/*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/com/twitter/elephantbird:core", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/thrift:libthrift", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common/text/language:locale-util", - "src/java/com/twitter/search/common/partitioning/snowflakeparser", - "src/java/com/twitter/search/common/relevance/features", - "src/java/com/twitter/search/common/schema/earlybird", - "src/java/com/twitter/search/common/tweetypie", - "src/thrift/com/twitter/search:earlybird-java", - "src/thrift/com/twitter/search/common:schema-java", - ], -) - -java_library( - name = "ranking", - sources = ["ranking/**/*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":utils", - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/com/twitter/elephantbird:core", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/thrift:libthrift", - "3rdparty/jvm/org/apache/zookeeper:zookeeper-client", - "3rdparty/jvm/org/slf4j:slf4j-api", - "src/java/com/twitter/common/base", - "src/java/com/twitter/search/common/logging", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/relevance/features", - "src/thrift/com/twitter/search:earlybird-java", - ], -) - -TRENDS_DATA_SERVICE_SOURCES = [ - "TrendsThriftDataServiceManager.java", - "NGramCache.java", -] - -java_library( - name = "trends-data-service", - sources = TRENDS_DATA_SERVICE_SOURCES, - platform = "java8", - provides = artifact( - org = "com.twitter.search.common.relevance", - name = "trends-data-service", - repo = artifactory, - ), - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/io/netty:netty4-tcnative-boringssl-static", - "3rdparty/jvm/org/apache/thrift:libthrift", - "3rdparty/jvm/org/apache/zookeeper:zookeeper-client", - "finagle-internal/mtls/src/main/scala/com/twitter/finagle/mtls/authentication", - "finagle-internal/mtls/src/main/scala/com/twitter/finagle/mtls/client", - "finagle/finagle-core/src/main", - "finagle/finagle-thrift/src/main/java", - "finagle/finagle-thriftmux/src/main/scala", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common_internal/text/version", - "src/java/com/twitter/penguin/search/filter", - "src/java/com/twitter/search/common/metrics", - "src/thrift/com/twitter/trends/plus:trends-plus-java", - "src/thrift/com/twitter/trends/service/gen:trends_service-java", - "src/thrift/com/twitter/trends/trending_content:trending-content-service-java", - "trends/trends_metadata/thrift/src/main/thrift/com/twitter/trends/trends_metadata:thrift-java", - "twitter-server-internal/src/main/scala", - "util/util-core:scala", - "util/util-stats/src/main/scala", - ], -) - -java_library( - name = "feature-update-reader", - sources = ["readers/*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/org/apache/bookkeeper:bookkeeper-server", - "3rdparty/jvm/org/apache/bookkeeper:bookkeeper-twitter-science-provider", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/thrift:libthrift", - "3rdparty/jvm/org/apache/zookeeper:zookeeper-client", - "3rdparty/jvm/org/slf4j:slf4j-api", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common/util:system-mocks", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/util/io:record-reader-api", - "src/java/com/twitter/search/common/util/thrift:text-protocol", - "src/thrift/com/twitter/search/common:schema-java", - ], -) - -target( - dependencies = [ - ":feature-update-reader", - ":trends-data-service", - "src/java/com/twitter/search/common/relevance/features", - ], -) - -java_library( - name = "config", - sources = ["config/**/*.java"], - platform = "java8", - provides = artifact( - org = "com.twitter.search.common.relevance", - name = "config", - repo = artifactory, - ), - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/org/apache/zookeeper:zookeeper-client", - "src/java/com/twitter/search/common/config", - "src/resources/com/twitter/search/common/relevance/config", - ], -) - -java_library( - name = "classifiers", - sources = ["classifiers/**/*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":config", - ":entities_and_filters", - ":trends-data-service", - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/com/twitter/elephantbird:core", - "3rdparty/jvm/commons-lang", - "3rdparty/jvm/geo/google:geoGoogle", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/thrift:libthrift", - "3rdparty/jvm/org/apache/zookeeper:zookeeper-client", - "finagle-internal/mtls/src/main/scala/com/twitter/finagle/mtls/authentication", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common/collections", - "src/java/com/twitter/common/text/language:locale-util", - "src/java/com/twitter/common/text/token", - "src/java/com/twitter/common/text/transformer", - "src/java/com/twitter/common_internal/text:text-penguin7", - "src/java/com/twitter/common_internal/text/version", - "src/java/com/twitter/search/common/config", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/relevance/features", - "src/java/com/twitter/search/common/util/io/periodic", - "src/java/com/twitter/search/common/util/text", - "twitter-text/lib/java/src/main/java/com/twitter/twittertext", - ], -) - -java_library( - name = "text", - sources = ["text/**/*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":entities_and_filters", - "3rdparty/jvm/com/google/guava", - "src/java/com/twitter/common/text/token", - "src/java/com/twitter/common/text/util:char-seq-util", - "src/java/com/twitter/common_internal/text:text-penguin7", - "src/java/com/twitter/common_internal/text/version", - "src/java/com/twitter/search/common/relevance/features", - "src/java/com/twitter/search/common/util/text", - "src/java/com/twitter/search/common/util/text/regex", - "src/thrift/com/twitter/search/common:indexing-java", - ], -) - -java_library( - name = "scorers", - sources = ["scorers/**/*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":classifiers", - ":config", - ":entities_and_filters", - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/com/twitter/elephantbird:core", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/zookeeper:zookeeper-client", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common_internal/text/version", - "src/java/com/twitter/search/common/encoding/features", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/relevance/features", - "src/java/com/twitter/search/common/schema/base", - "src/java/com/twitter/search/common/schema/earlybird", - ], -) - -java_library( - name = "entities_and_filters", - sources = [ - "entities/**/*.java", - "filters/**/*.java", - ], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/code/findbugs:jsr305", - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/commons-lang", - "3rdparty/jvm/geo/google:geoGoogle", - "3rdparty/jvm/org/apache/commons:commons-lang3", - "3rdparty/jvm/org/apache/lucene:lucene-core", - "3rdparty/jvm/org/apache/thrift:libthrift", - "3rdparty/jvm/org/slf4j:slf4j-api", - "cuad/projects/ner/thrift/src/main/thrift:thrift-java", - "decider/src/main/scala", - "src/java/com/twitter/common/text/extractor", - "src/java/com/twitter/common/text/language:locale-util", - "src/java/com/twitter/common/text/pipeline", - "src/java/com/twitter/common/text/token", - "src/java/com/twitter/common/text/transformer", - "src/java/com/twitter/common_internal/text:text-penguin7", - "src/java/com/twitter/common_internal/text/version", - "src/java/com/twitter/search/common/decider", - "src/java/com/twitter/search/common/encoding/features", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/partitioning/snowflakeparser", - "src/java/com/twitter/search/common/relevance/features", - "src/java/com/twitter/search/common/schema/earlybird", - "src/java/com/twitter/search/common/util/text", - "src/thrift/com/twitter/search/common:indexing-java", - "src/thrift/com/twitter/service/spiderduck/gen:metadata-store-java", - "src/thrift/com/twitter/tweetypie:tweet-java", - "util/util-core:scala", - ], -) - -java_library( - name = "scores", - sources = ["scores/**/*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - ], -) diff --git a/src/java/com/twitter/search/common/relevance/NGramCache.java b/src/java/com/twitter/search/common/relevance/NGramCache.java deleted file mode 100644 index 41a3478bd..000000000 --- a/src/java/com/twitter/search/common/relevance/NGramCache.java +++ /dev/null @@ -1,152 +0,0 @@ -package com.twitter.search.common.relevance; - -import java.util.Collections; -import java.util.List; -import java.util.Locale; -import java.util.Map; -import java.util.Set; -import java.util.concurrent.TimeUnit; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.cache.CacheBuilder; -import com.google.common.collect.ImmutableList; - -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.penguin.search.filter.StringMatchFilter; -import com.twitter.util.Duration; - -/** - * the Cache for Trends - */ -public class NGramCache { - private static final int DEFAULT_MAX_CACHE_SIZE = 5000; - private static final long DEFAULT_CACHE_ITEM_TTL_SEC = 24 * 3600; // 1 day - - private final PenguinVersion penguinVersion; - - // Keys are trends. Values are empty strings. - private final Map trendsCache; - - private volatile StringMatchFilter trendsMatcher = null; - - /** - * Extract Trends from a list of normalized tokens - */ - public List extractTrendsFromNormalized(List tokens) { - if (trendsMatcher == null) { - return Collections.emptyList(); - } - - ImmutableList.Builder trends = ImmutableList.builder(); - for (String trend : trendsMatcher.extractNormalized(tokens)) { - if (trendsCache.containsKey(trend)) { - trends.add(trend); - } - } - - return trends.build(); - } - - /** - * Extract Trends from a list of tokens - */ - public List extractTrendsFrom(List tokens, Locale language) { - if (trendsMatcher == null) { - return Collections.emptyList(); - } - return trendsMatcher.extract(language, tokens); - } - - /** - * Extract Trends from a given CharSequence - */ - public List extractTrendsFrom(CharSequence text, Locale language) { - if (trendsMatcher == null) { - return Collections.emptyList(); - } - - ImmutableList.Builder trends = ImmutableList.builder(); - for (String trend : trendsMatcher.extract(language, text)) { - if (trendsCache.containsKey(trend)) { - trends.add(trend); - } - } - - return trends.build(); - } - - public long numTrendingTerms() { - return trendsCache.size(); - } - - public Set getTrends() { - return trendsCache.keySet(); - } - - public void clear() { - trendsCache.clear(); - trendsMatcher = null; - } - - /** Adds all trends to this NGramCache. */ - public void addAll(Iterable trends) { - for (String trend : trends) { - trendsCache.put(trend, ""); - } - - trendsMatcher = new StringMatchFilter(trendsCache.keySet(), penguinVersion); - } - - public static Builder builder() { - return new Builder(); - } - - public static class Builder { - private int maxCacheSize = DEFAULT_MAX_CACHE_SIZE; - private long cacheItemTTLSecs = DEFAULT_CACHE_ITEM_TTL_SEC; // 1 day - private PenguinVersion penguinVersion = PenguinVersion.PENGUIN_4; - - public Builder maxCacheSize(int cacheSize) { - this.maxCacheSize = cacheSize; - return this; - } - - public Builder cacheItemTTL(long cacheItemTTL) { - this.cacheItemTTLSecs = cacheItemTTL; - return this; - } - - public Builder penguinVersion(PenguinVersion newPenguinVersion) { - this.penguinVersion = Preconditions.checkNotNull(newPenguinVersion); - return this; - } - - /** Builds an NGramCache instance. */ - public NGramCache build() { - return new NGramCache( - maxCacheSize, - Duration.apply(cacheItemTTLSecs, TimeUnit.SECONDS), - penguinVersion); - } - } - - // Should be used only in tests that want to mock out this class. - @VisibleForTesting - public NGramCache() { - this(DEFAULT_MAX_CACHE_SIZE, - Duration.apply(DEFAULT_CACHE_ITEM_TTL_SEC, TimeUnit.SECONDS), - PenguinVersion.PENGUIN_4); - } - - private NGramCache(int maxCacheSize, Duration cacheItemTTL, PenguinVersion penguinVersion) { - // we only have 1 refresher thread that writes to the cache - this.trendsCache = CacheBuilder.newBuilder() - .concurrencyLevel(1) - .expireAfterWrite(cacheItemTTL.inSeconds(), TimeUnit.SECONDS) - .maximumSize(maxCacheSize) - .build() - .asMap(); - this.penguinVersion = penguinVersion; - } -} diff --git a/src/java/com/twitter/search/common/relevance/TrendsThriftDataServiceManager.java b/src/java/com/twitter/search/common/relevance/TrendsThriftDataServiceManager.java deleted file mode 100644 index 62bbd9890..000000000 --- a/src/java/com/twitter/search/common/relevance/TrendsThriftDataServiceManager.java +++ /dev/null @@ -1,353 +0,0 @@ -package com.twitter.search.common.relevance; - -import java.util.List; -import java.util.Map; -import java.util.Set; -import java.util.concurrent.Executors; -import java.util.concurrent.ScheduledExecutorService; -import java.util.concurrent.TimeUnit; -import java.util.concurrent.atomic.AtomicLong; -import java.util.stream.Collectors; - -import scala.runtime.BoxedUnit; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.Sets; -import com.google.common.util.concurrent.ThreadFactoryBuilder; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.finagle.Service; -import com.twitter.finagle.ThriftMux; -import com.twitter.finagle.builder.ClientBuilder; -import com.twitter.finagle.builder.ClientConfig; -import com.twitter.finagle.mtls.authentication.ServiceIdentifier; -import com.twitter.finagle.mtls.client.MtlsClientBuilder; -import com.twitter.finagle.stats.DefaultStatsReceiver; -import com.twitter.finagle.thrift.ThriftClientRequest; -import com.twitter.search.common.metrics.RelevanceStats; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.trends.plus.Module; -import com.twitter.trends.plus.TrendsPlusRequest; -import com.twitter.trends.plus.TrendsPlusResponse; -import com.twitter.trends.service.gen.Location; -import com.twitter.trends.trending_content.thriftjava.TrendingContentService; -import com.twitter.trends.trends_metadata.thriftjava.TrendsMetadataService; -import com.twitter.util.Duration; -import com.twitter.util.Future; -import com.twitter.util.Try; - -/** - * Manages trends data retrieved from trends thrift API and perform automatic refresh. - */ -public final class TrendsThriftDataServiceManager { - private static final Logger LOG = - LoggerFactory.getLogger(TrendsThriftDataServiceManager.class.getName()); - - private static final int DEFAULT_TIME_TO_KILL_SEC = 60; - - @VisibleForTesting - protected static final Map DEFAULT_TRENDS_PARAMS_MAP = ImmutableMap.of( - "MAX_ITEMS_TO_RETURN", "10"); // we only take top 10 for each woeid. - - @VisibleForTesting - protected static final int MAX_TRENDS_PER_WOEID = 10; - - private final Duration requestTimeout; - private final Duration refreshDelayDuration; - private final Duration reloadIntervalDuration; - private final int numRetries; - - // a list of trends cache we want to update - private final List trendsCacheList; - - private final SearchCounter getAvailableSuccessCounter = - RelevanceStats.exportLong("trends_extractor_get_available_success"); - private final SearchCounter getAvailableFailureCounter = - RelevanceStats.exportLong("trends_extractor_get_available_failure"); - private final SearchCounter getTrendsSuccessCounter = - RelevanceStats.exportLong("trends_extractor_success_fetch"); - private final SearchCounter getTrendsFailureCounter = - RelevanceStats.exportLong("trends_extractor_failed_fetch"); - private final SearchCounter updateFailureCounter = - RelevanceStats.exportLong("trends_extractor_failed_update"); - - private final ServiceIdentifier serviceIdentifier; - private ScheduledExecutorService scheduler; - - - @VisibleForTesting - protected Service contentService; - protected TrendingContentService.ServiceToClient contentClient; - protected Service metadataService; - protected TrendsMetadataService.ServiceToClient metadataClient; - - @VisibleForTesting - protected TrendsUpdater trendsUpdater; - - /** - * Returns an instance of TrendsThriftDataServiceManager. - * @param serviceIdentifier The service that wants to call - * into Trend's services. - * @param numRetries The number of retries in the event of - * request failures. - * @param requestTimeout The amount of time we wait before we consider a - * a request as failed. - * @param initTrendsCacheDelay How long to wait before the initial - * filling of the Trends cache in milliseconds. - * @param reloadInterval How often to refresh the cache with updated trends. - * @param trendsCacheList The cache of trends. - * @return An instance of TrendsThriftDataServiceManager configured - * with respect to the params provided. - */ - public static TrendsThriftDataServiceManager newInstance( - ServiceIdentifier serviceIdentifier, - int numRetries, - Duration requestTimeout, - Duration initTrendsCacheDelay, - Duration reloadInterval, - List trendsCacheList) { - return new TrendsThriftDataServiceManager( - serviceIdentifier, - numRetries, - requestTimeout, - initTrendsCacheDelay, - reloadInterval, - trendsCacheList); - } - - /** - * Resume auto refresh. Always called in constructor. Can be invoked after a - * stopAuthRefresh call to resume auto refreshing. Invoking it after shutDown is undefined. - */ - public synchronized void startAutoRefresh() { - if (scheduler == null) { - scheduler = Executors.newSingleThreadScheduledExecutor( - new ThreadFactoryBuilder().setDaemon(true).setNameFormat( - "trends-data-refresher[%d]").build()); - scheduler.scheduleAtFixedRate( - trendsUpdater, - refreshDelayDuration.inSeconds(), - reloadIntervalDuration.inSeconds(), - TimeUnit.SECONDS); - } - } - - /** - * Stop auto refresh. Wait for the current execution thread to finish. - * This is a blocking call. - */ - public synchronized void stopAutoRefresh() { - if (scheduler != null) { - scheduler.shutdown(); // Disable new tasks from being submitted - try { - // Wait a while for existing tasks to terminate - if (!scheduler.awaitTermination(DEFAULT_TIME_TO_KILL_SEC, TimeUnit.SECONDS)) { - scheduler.shutdownNow(); // Cancel currently executing tasks - // Wait a while for tasks to respond to being cancelled - if (!scheduler.awaitTermination(DEFAULT_TIME_TO_KILL_SEC, TimeUnit.SECONDS)) { - LOG.info("Executor thread pool did not terminate."); - } - } - } catch (InterruptedException ie) { - // (Re-)Cancel if current thread also interrupted - scheduler.shutdownNow(); - // Preserve interrupt status - Thread.currentThread().interrupt(); - } - scheduler = null; - } - } - - /** Shuts down the manager. */ - public void shutDown() { - stopAutoRefresh(); - // clear the cache - for (NGramCache cache : trendsCacheList) { - cache.clear(); - } - - if (contentService != null) { - contentService.close(); - } - - if (metadataService != null) { - metadataService.close(); - } - } - - private TrendsThriftDataServiceManager( - ServiceIdentifier serviceIdentifier, - int numRetries, - Duration requestTimeoutMS, - Duration refreshDelayDuration, - Duration reloadIntervalDuration, - List trendsCacheList) { - this.numRetries = numRetries; - this.requestTimeout = requestTimeoutMS; - this.refreshDelayDuration = refreshDelayDuration; - this.reloadIntervalDuration = reloadIntervalDuration; - this.serviceIdentifier = serviceIdentifier; - this.trendsCacheList = Preconditions.checkNotNull(trendsCacheList); - trendsUpdater = new TrendsUpdater(); - metadataService = buildMetadataService(); - metadataClient = buildMetadataClient(metadataService); - contentService = buildContentService(); - contentClient = buildContentClient(contentService); - } - - @VisibleForTesting - protected Service buildContentService() { - ClientBuilder< - ThriftClientRequest, - byte[], ClientConfig.Yes, - ClientConfig.Yes, - ClientConfig.Yes - > - builder = ClientBuilder.get() - .stack(ThriftMux.client()) - .name("trends_thrift_data_service_manager_content") - .dest("") - .retries(numRetries) - .reportTo(DefaultStatsReceiver.get()) - .tcpConnectTimeout(requestTimeout) - .requestTimeout(requestTimeout); - ClientBuilder mtlsBuilder = - new MtlsClientBuilder.MtlsClientBuilderSyntax<>(builder).mutualTls(serviceIdentifier); - - return ClientBuilder.safeBuild(mtlsBuilder); - } - - @VisibleForTesting - protected TrendingContentService.ServiceToClient buildContentClient( - Service service) { - return new TrendingContentService.ServiceToClient(service); - } - - @VisibleForTesting - protected Service buildMetadataService() { - ClientBuilder< - ThriftClientRequest, - byte[], - ClientConfig.Yes, - ClientConfig.Yes, - ClientConfig.Yes - > - builder = ClientBuilder.get() - .stack(ThriftMux.client()) - .name("trends_thrift_data_service_manager_metadata") - .dest("") - .retries(numRetries) - .reportTo(DefaultStatsReceiver.get()) - .tcpConnectTimeout(requestTimeout) - .requestTimeout(requestTimeout); - ClientBuilder mtlsBuilder = - new MtlsClientBuilder.MtlsClientBuilderSyntax<>(builder).mutualTls(serviceIdentifier); - - return ClientBuilder.safeBuild(mtlsBuilder); - } - - @VisibleForTesting - protected TrendsMetadataService.ServiceToClient buildMetadataClient( - Service service) { - return new TrendsMetadataService.ServiceToClient(service); - } - - /** - * Updater that fetches available woeids and corresponding trending terms. - */ - @VisibleForTesting - protected class TrendsUpdater implements Runnable { - @Override - public void run() { - populateCacheFromTrendsService(); - } - - private Future populateCacheFromTrendsService() { - long startTime = System.currentTimeMillis(); - AtomicLong numTrendsReceived = new AtomicLong(0); - return metadataClient.getAvailable().flatMap(locations -> { - if (locations == null) { - getAvailableFailureCounter.increment(); - LOG.warn("Failed to get woeids from trends."); - return Future.value(BoxedUnit.UNIT); - } - getAvailableSuccessCounter.increment(); - return populateCacheFromTrendLocations(locations, numTrendsReceived); - }).onFailure(throwable -> { - LOG.info("Update failed", throwable); - updateFailureCounter.increment(); - return BoxedUnit.UNIT; - }).ensure(() -> { - logRefreshStatus(startTime, numTrendsReceived); - return BoxedUnit.UNIT; - }); - } - - private Future populateCacheFromTrendLocations( - List locations, - AtomicLong numTrendsReceived) { - List> trendsPlusFutures = locations.stream() - .map(location -> makeTrendsPlusRequest(location)) - .collect(Collectors.toList()); - - Future>> trendsPlusFuture = - Future.collectToTry(trendsPlusFutures); - return trendsPlusFuture.map(tryResponses -> { - populateCacheFromResponses(tryResponses, numTrendsReceived); - return BoxedUnit.UNIT; - }); - } - - private Future makeTrendsPlusRequest(Location location) { - TrendsPlusRequest request = new TrendsPlusRequest() - .setWoeid(location.getWoeid()) - .setMaxTrends(MAX_TRENDS_PER_WOEID); - long startTime = System.currentTimeMillis(); - return contentClient.getTrendsPlus(request) - .onSuccess(response -> { - getTrendsSuccessCounter.increment(); - return BoxedUnit.UNIT; - }).onFailure(throwable -> { - getTrendsFailureCounter.increment(); - return BoxedUnit.UNIT; - }); - } - - private void populateCacheFromResponses( - List> tryResponses, - AtomicLong numTrendsReceived) { - Set trendStrings = Sets.newHashSet(); - - for (Try tryResponse : tryResponses) { - if (tryResponse.isThrow()) { - LOG.warn("Failed to fetch trends:" + tryResponse.toString()); - continue; - } - - TrendsPlusResponse trendsPlusResponse = tryResponse.get(); - numTrendsReceived.addAndGet(trendsPlusResponse.modules.size()); - for (Module module : trendsPlusResponse.modules) { - trendStrings.add(module.getTrend().name); - } - } - - for (NGramCache cache : trendsCacheList) { - cache.addAll(trendStrings); - } - } - } - - private void logRefreshStatus(long startTime, AtomicLong numTrendsReceived) { - LOG.info(String.format("Refresh done in [%dms] :\nfetchSuccess[%d] fetchFailure[%d] " - + "updateFailure[%d] num trends received [%d]", - System.currentTimeMillis() - startTime, - getTrendsSuccessCounter.get(), - getTrendsFailureCounter.get(), - updateFailureCounter.get(), - numTrendsReceived.get())); - } -} diff --git a/src/java/com/twitter/search/common/relevance/classifiers/TweetClassifier.java b/src/java/com/twitter/search/common/relevance/classifiers/TweetClassifier.java deleted file mode 100644 index 16210eec8..000000000 --- a/src/java/com/twitter/search/common/relevance/classifiers/TweetClassifier.java +++ /dev/null @@ -1,118 +0,0 @@ -package com.twitter.search.common.relevance.classifiers; - -import com.google.common.base.Preconditions; - -import com.twitter.search.common.relevance.entities.TwitterMessage; - -/** - * Interface to perform feature classification for a single - * @TwitterMessage object, or a group of them. - * - * Classification includes two steps: feature extraction, and - * quality evaluation. During feature extraction, any interesting - * feature that is deemed useful for subsequent quality analysis - * is extracted from the @TwitterMessage object. Quality evaluation - * is then done by a group of @TweetEvaluator objects associated - * with the classifier, by using the various features extracted in the - * previous step. - * - * Feature extraction and quality evaluation results are stored in - * @TweetFeatures field of the @TwitterMessage object, which is defined - * in src/main/thrift/classifier.thrift. - */ -public abstract class TweetClassifier { - /** - * A list of TweetQualityEvaluators which are invoked after - * feature extraction is done. If null, no quality evaluation - * is done. - */ - protected Iterable qualityEvaluators = null; - - /** - * Passed in TwitterMessage is examined and any extractable - * features are saved in TweetFeatures field of TwitterMessage. - * Then TweetQualityEvaluators are applied to compute various - * quality values. - * - * @param tweet TwitterMessage to perform classification on. - */ - public void classifyTweet(final TwitterMessage tweet) { - Preconditions.checkNotNull(tweet); - - // extract features - extractFeatures(tweet); - - // compute quality - evaluate(tweet); - } - - /** - * Classify a group of TwitterMessages and store features in their corresponding - * TweetFeatures fields. - * - * This default implementation just iterates through the map and classifies each - * individual tweet. Batching for better performance, if applicable, can be implemented by - * concrete subclasses. - * - * @param tweets TwitterMessages to perform classification on. - */ - public void classifyTweets(final Iterable tweets) { - extractFeatures(tweets); - evaluate(tweets); - } - - /** - * Use the specified list of TweetQualityEvaluators for this classifier. - * - * @param evaluators list of TweetQualityEvaluators to be used with this classifier. - */ - protected void setQualityEvaluators(final Iterable qualityEvaluators) { - Preconditions.checkNotNull(qualityEvaluators); - this.qualityEvaluators = qualityEvaluators; - } - - - /** - * Extract interesting features from a single TwitterMessage for classification. - * - * @param tweet TwitterMessage to extract interesting features for - */ - protected abstract void extractFeatures(final TwitterMessage tweet); - - /** - * Extract interesting features from a list of TwitterMessages for classification. - * @param tweets list of TwitterMessages to extract interesting features for - */ - protected void extractFeatures(final Iterable tweets) { - for (TwitterMessage tweet: tweets) { - extractFeatures(tweet); - } - } - - /** - * Given a TwitterMessage which already has its features extracted, - * perform quality evaluation. - * - * @param tweet TwitterMessage to perform quality evaluation for - */ - protected void evaluate(final TwitterMessage tweet) { - if (qualityEvaluators == null) { - return; - } - for (TweetEvaluator evaluator : qualityEvaluators) { - evaluator.evaluate(tweet); - } - } - - /** - * Given a list of TwitterMessages which already have their features extracted, - * perform quality evaluation. - * - * @param tweets list of TwitterMessages to perform quality evaluation for - */ - protected void evaluate(final Iterable tweets) { - for (TwitterMessage tweet: tweets) { - evaluate(tweet); - } - } -} diff --git a/src/java/com/twitter/search/common/relevance/classifiers/TweetEvaluator.java b/src/java/com/twitter/search/common/relevance/classifiers/TweetEvaluator.java deleted file mode 100644 index e582e97d9..000000000 --- a/src/java/com/twitter/search/common/relevance/classifiers/TweetEvaluator.java +++ /dev/null @@ -1,37 +0,0 @@ -package com.twitter.search.common.relevance.classifiers; - -import com.google.common.base.Preconditions; - -import com.twitter.search.common.relevance.entities.TwitterMessage; - -/** - * Interface to perform quality evaluation for a single @TwitterMessage - * object or a group of them. - * - */ -public abstract class TweetEvaluator { - /** - * Passed in TwitterMessage is examined and any extractable - * features are stored in TweetFeatures field of TwitterMessage. - * - * @param tweet TwitterMessage to perform classification on. - */ - public abstract void evaluate(final TwitterMessage tweet); - - /** - * Classify a group of TwitterMessages and store the features in their corresponding - * TweetFeatures fields. - * - * This default implementation just iterates through the map and classifies each - * individual tweet. Batching for better performance, if applicable, can be implemented by - * concrete subclasses. - * - * @param tweets TwitterMessages to perform classification on. - */ - public void evaluate(final Iterable tweets) { - Preconditions.checkNotNull(tweets); - for (TwitterMessage tweet: tweets) { - evaluate(tweet); - } - } -} diff --git a/src/java/com/twitter/search/common/relevance/classifiers/TweetOffensiveEvaluator.java b/src/java/com/twitter/search/common/relevance/classifiers/TweetOffensiveEvaluator.java deleted file mode 100644 index 2de2bc3b5..000000000 --- a/src/java/com/twitter/search/common/relevance/classifiers/TweetOffensiveEvaluator.java +++ /dev/null @@ -1,260 +0,0 @@ -package com.twitter.search.common.relevance.classifiers; - -import java.io.File; -import java.io.IOException; -import java.io.InputStream; -import java.util.ArrayList; -import java.util.List; -import java.util.concurrent.Executors; -import java.util.concurrent.ScheduledExecutorService; -import java.util.concurrent.atomic.AtomicReference; - -import com.google.common.base.Joiner; -import com.google.common.base.Preconditions; -import com.google.common.io.ByteSource; -import com.google.common.util.concurrent.ThreadFactoryBuilder; - -import org.apache.commons.io.IOUtils; -import org.apache.commons.lang.StringUtils; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.text.language.LocaleUtil; -import com.twitter.common.text.token.TokenizedCharSequence; -import com.twitter.common.text.token.attribute.TokenType; -import com.twitter.common.util.Clock; -import com.twitter.common_internal.text.pipeline.TwitterNgramGenerator; -import com.twitter.common_internal.text.topic.BlacklistedTopics; -import com.twitter.common_internal.text.topic.BlacklistedTopics.FilterMode; -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.search.common.metrics.RelevanceStats; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.relevance.entities.TwitterMessage; -import com.twitter.search.common.relevance.features.TweetTextFeatures; -import com.twitter.search.common.relevance.features.TweetTextQuality; -import com.twitter.search.common.util.io.periodic.PeriodicFileLoader; -import com.twitter.search.common.util.text.NormalizerHelper; -import com.twitter.search.common.util.text.TokenizerHelper; - -/** - * Determines if tweet text or username contains potentially offensive language. - */ -public class TweetOffensiveEvaluator extends TweetEvaluator { - private static final Logger LOG = LoggerFactory.getLogger(TweetOffensiveEvaluator.class); - - private static final int MAX_OFFENSIVE_TERMS = 2; - - private final File filterDirectory; - private static final File DEFAULT_FILTER_DIR = new File(""); - private static final String ADULT_TOKEN_FILE_NAME = "adult_tokens.txt"; - private static final String OFFENSIVE_TOPIC_FILE_NAME = "offensive_topics.txt"; - private static final String OFFENSIVE_SUBSTRING_FILE_NAME = "offensive_substrings.txt"; - - private static final ThreadLocal NGRAM_GENERATOR_HOLDER = - new ThreadLocal() { - @Override - protected TwitterNgramGenerator initialValue() { - // It'll generate ngrams from TokenizedCharSequence, which contains tokenization results, - // so it doesn't matter which Penguin version to use here. - return new TwitterNgramGenerator.Builder(PenguinVersion.PENGUIN_6) - .setSize(1, MAX_OFFENSIVE_TERMS) - .build(); - } - }; - - private final AtomicReference offensiveTopics = - new AtomicReference<>(); - private final AtomicReference offensiveUsersTopics = - new AtomicReference<>(); - - private final AtomicReference adultTokenFileContents = new AtomicReference<>(); - private final AtomicReference offensiveTokenFileContents = new AtomicReference<>(); - private final AtomicReference offensiveSubstringFileContents = new - AtomicReference<>(); - - private final SearchCounter sensitiveTextCounter = - RelevanceStats.exportLong("num_sensitive_text"); - - public TweetOffensiveEvaluator() { - this(DEFAULT_FILTER_DIR); - } - - public TweetOffensiveEvaluator( - File filterDirectory - ) { - this.filterDirectory = filterDirectory; - adultTokenFileContents.set(BlacklistedTopics.getResource( - BlacklistedTopics.DATA_PREFIX + ADULT_TOKEN_FILE_NAME)); - offensiveTokenFileContents.set(BlacklistedTopics.getResource( - BlacklistedTopics.DATA_PREFIX + OFFENSIVE_TOPIC_FILE_NAME)); - offensiveSubstringFileContents.set(BlacklistedTopics.getResource( - BlacklistedTopics.DATA_PREFIX + OFFENSIVE_SUBSTRING_FILE_NAME)); - - try { - rebuildBlacklistedTopics(); - } catch (IOException e) { - throw new RuntimeException(e); - } - - ScheduledExecutorService executor = Executors.newSingleThreadScheduledExecutor( - new ThreadFactoryBuilder() - .setNameFormat("offensive-evaluator-blacklist-reloader") - .setDaemon(true) - .build()); - initPeriodicFileLoader(adultTokenFileContents, ADULT_TOKEN_FILE_NAME, executor); - initPeriodicFileLoader(offensiveTokenFileContents, OFFENSIVE_TOPIC_FILE_NAME, executor); - initPeriodicFileLoader(offensiveSubstringFileContents, OFFENSIVE_SUBSTRING_FILE_NAME, executor); - } - - private void initPeriodicFileLoader( - AtomicReference byteSource, - String fileName, - ScheduledExecutorService executor) { - File file = new File(filterDirectory, fileName); - try { - PeriodicFileLoader loader = new PeriodicFileLoader( - "offensive-evaluator-" + fileName, - file.getPath(), - executor, - Clock.SYSTEM_CLOCK) { - @Override - protected void accept(InputStream stream) throws IOException { - byteSource.set(ByteSource.wrap(IOUtils.toByteArray(stream))); - rebuildBlacklistedTopics(); - } - }; - loader.init(); - } catch (Exception e) { - // Not the end of the world if we couldn't load the file, we already loaded the resource. - LOG.error("Could not load offensive topic filter " + fileName + " from ConfigBus", e); - } - } - - private void rebuildBlacklistedTopics() throws IOException { - offensiveTopics.set(new BlacklistedTopics.Builder(false) - .loadFilterFromSource(adultTokenFileContents.get(), FilterMode.EXACT) - .loadFilterFromSource(offensiveSubstringFileContents.get(), FilterMode.SUBSTRING) - .build()); - - offensiveUsersTopics.set(new BlacklistedTopics.Builder(false) - .loadFilterFromSource(offensiveTokenFileContents.get(), FilterMode.EXACT) - .loadFilterFromSource(offensiveSubstringFileContents.get(), FilterMode.SUBSTRING) - .build()); - } - - @Override - public void evaluate(final TwitterMessage tweet) { - BlacklistedTopics offensiveFilter = this.offensiveTopics.get(); - BlacklistedTopics offensiveUsersFilter = this.offensiveUsersTopics.get(); - - if (offensiveFilter == null || offensiveUsersFilter == null) { - return; - } - - if (tweet.isSensitiveContent()) { - sensitiveTextCounter.increment(); - } - - // Check for user name. - Preconditions.checkState(tweet.getFromUserScreenName().isPresent(), - "Missing from-user screen name"); - - for (PenguinVersion penguinVersion : tweet.getSupportedPenguinVersions()) { - TweetTextQuality textQuality = tweet.getTweetTextQuality(penguinVersion); - - if (tweet.isSensitiveContent()) { - textQuality.addBoolQuality(TweetTextQuality.BooleanQualityType.SENSITIVE); - } - - // Check if username has an offensive term - if (isUserNameOffensive( - tweet.getFromUserScreenName().get(), offensiveUsersFilter, penguinVersion)) { - SearchRateCounter offensiveUserCounter = RelevanceStats.exportRate( - "num_offensive_user_" + penguinVersion.name().toLowerCase()); - offensiveUserCounter.increment(); - textQuality.addBoolQuality(TweetTextQuality.BooleanQualityType.OFFENSIVE_USER); - } - - // Check if tweet has an offensive term - if (isTweetOffensive(tweet, offensiveFilter, penguinVersion)) { - SearchRateCounter offensiveTextCounter = RelevanceStats.exportRate( - "num_offensive_text_" + penguinVersion.name().toLowerCase()); - offensiveTextCounter.increment(); - textQuality.addBoolQuality(TweetTextQuality.BooleanQualityType.OFFENSIVE); - } - } - } - - private boolean isUserNameOffensive(String userName, - BlacklistedTopics offensiveUsersFilter, - PenguinVersion penguinVersion) { - String normalizedUserName = NormalizerHelper.normalizeKeepCase( - userName, LocaleUtil.UNKNOWN, penguinVersion); - List termsToCheck = new ArrayList(TokenizerHelper.getSubtokens(normalizedUserName)); - termsToCheck.add(normalizedUserName.toLowerCase()); - - for (String userNameToken : termsToCheck) { - if (!StringUtils.isBlank(userNameToken) && offensiveUsersFilter.filter(userNameToken)) { - return true; - } - } - return false; - } - - private boolean isTweetOffensive(final TwitterMessage tweet, - BlacklistedTopics offensiveFilter, - PenguinVersion penguinVersion) { - TweetTextFeatures textFeatures = tweet.getTweetTextFeatures(penguinVersion); - - boolean tweetHasOffensiveTerm = false; - - // Check for tweet text. - List ngrams = - NGRAM_GENERATOR_HOLDER.get().generateNgramsAsTokenizedCharSequence( - textFeatures.getTokenSequence(), tweet.getLocale()); - for (TokenizedCharSequence ngram : ngrams) { - // skip URL ngram - if (!ngram.getTokensOf(TokenType.URL).isEmpty()) { - continue; - } - String ngramStr = ngram.toString(); - if (!StringUtils.isBlank(ngramStr) && offensiveFilter.filter(ngramStr)) { - tweetHasOffensiveTerm = true; - break; - } - } - - // Due to some strangeness in Penguin, we don't get ngrams for tokens around "\n-" or "-\n" - // in the original string, this made us miss some offensive words this way. Here we do another - // pass of check using just the tokens generated by the tokenizer. (See SEARCHQUAL-8907) - if (!tweetHasOffensiveTerm) { - for (String ngramStr : textFeatures.getTokens()) { - // skip URLs - if (ngramStr.startsWith("http://") || ngramStr.startsWith("https://")) { - continue; - } - if (!StringUtils.isBlank(ngramStr) && offensiveFilter.filter(ngramStr)) { - tweetHasOffensiveTerm = true; - break; - } - } - } - - if (!tweetHasOffensiveTerm) { - // check for resolved URLs - String resolvedUrlsText = - Joiner.on(" ").skipNulls().join(textFeatures.getResolvedUrlTokens()); - List ngramStrs = NGRAM_GENERATOR_HOLDER.get().generateNgramsAsString( - resolvedUrlsText, LocaleUtil.UNKNOWN); - for (String ngram : ngramStrs) { - if (!StringUtils.isBlank(ngram) && offensiveFilter.filter(ngram)) { - tweetHasOffensiveTerm = true; - break; - } - } - } - - return tweetHasOffensiveTerm; - } -} diff --git a/src/java/com/twitter/search/common/relevance/classifiers/TweetQualityFeatureExtractor.java b/src/java/com/twitter/search/common/relevance/classifiers/TweetQualityFeatureExtractor.java deleted file mode 100644 index 5aefd9cb8..000000000 --- a/src/java/com/twitter/search/common/relevance/classifiers/TweetQualityFeatureExtractor.java +++ /dev/null @@ -1,105 +0,0 @@ -package com.twitter.search.common.relevance.classifiers; - -import java.io.IOException; -import java.util.Set; - -import com.google.common.base.Preconditions; - -import com.twitter.common.text.transformer.RegexTransformer; -import com.twitter.common.text.transformer.RtRemovalTransformer; -import com.twitter.common.text.transformer.Transformer; -import com.twitter.common.text.transformer.TransformerChain; -import com.twitter.common_internal.text.duplicate.RandomSubstringExtractor; -import com.twitter.common_internal.text.duplicate.SignatureGenerator; -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.search.common.relevance.entities.TwitterMessage; -import com.twitter.search.common.relevance.features.TweetIntegerShingleSignature; -import com.twitter.search.common.relevance.features.TweetTextFeatures; -import com.twitter.search.common.util.text.NormalizerHelper; -import com.twitter.twittertext.Regex; - -/** - * Given a tweet text, extract useful text features. - */ -public class TweetQualityFeatureExtractor { - private static final Transformer STATUS_TEXT_CLEANER = - TransformerChain.of( - // remove @reply as defined in twitter-text - new RegexTransformer.Builder() - .setRegexPattern(Regex.VALID_REPLY) - .setReplaceString("") - .setTriggeringChar('@') - .build(), - // remove the old style retweet, eg RT: @mention or via @mention - new RtRemovalTransformer() - ); - - // for signature generation - private static final int MIN_NUM_FEATURES = 2; - private final SignatureGenerator signatureGenerator = new SignatureGenerator( - new RandomSubstringExtractor( - TweetIntegerShingleSignature.NUM_SHINGLES, // number of signatures - MIN_NUM_FEATURES, // each signature is generated by taking this number of features/tokens - // from text - false, // do not consider full tweet text as a feature - false)); // do not do early termination - - /** - * Given TwitterMessage, extract all interesting tweet text features and store in - * the returned TweetTextFeatures object. - * - * @param tweet TwitterMessage to extract features from - * @throws IOException - */ - public void extractTweetTextFeatures(final TwitterMessage tweet) { - Preconditions.checkNotNull(tweet); - - for (PenguinVersion penguinVersion : tweet.getSupportedPenguinVersions()) { - // Get basic features. - TweetTextFeatures textFeatures = tweet.getTweetTextFeatures(penguinVersion); - - extractCharLength(textFeatures); - - // Signature that hashes on text with resolved urls, aggressively remove RT tags, which - // accounts for more than 50% of neardups, also remove @mentions. - // we use resolved urls for signature since they are what matters. - CharSequence strippedText = tweet.getTextReplacedWithResolvedURLs(); - strippedText = strippedText == null ? "" : strippedText; - strippedText = STATUS_TEXT_CLEANER.transform(strippedText); - - // Generate the signature. - // will lower case, use penguin - String normalizedSignatureText = - NormalizerHelper.normalize(strippedText, tweet.getLocale(), penguinVersion); - if (normalizedSignatureText != null && !normalizedSignatureText.isEmpty()) { - Set rawSignature = - signatureGenerator.generateSignatureByteArray(normalizedSignatureText); - textFeatures.setSignature((new TweetIntegerShingleSignature(rawSignature)).serialize()); - } - } - } - - /** - * Compute number of letters in stripped tweet text, also records unsupported char counts. - * - * @param textFeatures TweetTextFeatures object to store letter length, unsupported chars, etc. - */ - private static void extractCharLength(final TweetTextFeatures textFeatures) { - Preconditions.checkNotNull(textFeatures); - int length = 0; - int caps = 0; - String strippedText = textFeatures.getNormalizedStrippedText(); - if (strippedText != null && !strippedText.isEmpty()) { - for (char c : strippedText.toCharArray()) { - if (Character.isLetter(c)) { - length++; - if (Character.isUpperCase(c)) { - caps++; - } - } - } - } - textFeatures.setLength(length); - textFeatures.setCaps(caps); - } -} diff --git a/src/java/com/twitter/search/common/relevance/classifiers/TweetTextClassifier.java b/src/java/com/twitter/search/common/relevance/classifiers/TweetTextClassifier.java deleted file mode 100644 index d45d18e11..000000000 --- a/src/java/com/twitter/search/common/relevance/classifiers/TweetTextClassifier.java +++ /dev/null @@ -1,67 +0,0 @@ -package com.twitter.search.common.relevance.classifiers; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; - -import com.twitter.finagle.mtls.authentication.ServiceIdentifier; -import java.util.List; - -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.search.common.relevance.config.TweetProcessingConfig; -import com.twitter.search.common.relevance.entities.TwitterMessage; - -/** - * Classifier that focuses on tweet text features and their corresponding - * quality. - */ -public class TweetTextClassifier extends TweetClassifier { - private TweetQualityFeatureExtractor featureExtractor = new TweetQualityFeatureExtractor(); - private TweetTrendsExtractor trendsExtractor = null; - - /** - * Constructor. Requires a list of TweetQualityEvaluator objects. - * @param evaluators list of TweetQualityEvaluator objects responsible for quality evaluation. - * @param serviceIdentifier The identifier of the calling service. - * @param supportedPenguinVersions A list of supported penguin versions. - */ - public TweetTextClassifier( - final Iterable evaluators, - ServiceIdentifier serviceIdentifier, - List supportedPenguinVersions) { - Preconditions.checkNotNull(evaluators); - setQualityEvaluators(evaluators); - TweetProcessingConfig.init(); - - if (TweetProcessingConfig.getBool("extract_trends", false)) { - trendsExtractor = new TweetTrendsExtractor(serviceIdentifier, supportedPenguinVersions); - } - } - - /** - * Extract text features for the specified TwitterMessage. - * - * @param tweet TwitterMessage to extract features from. - */ - @Override - protected void extractFeatures(TwitterMessage tweet) { - extractFeatures(Lists.newArrayList(tweet)); - } - - /** - * Extract text features for the specified list of TwitterMessages. - * - * @param tweets list of TwitterMessages to extract interesting features for - */ - @Override - protected void extractFeatures(Iterable tweets) { - Preconditions.checkNotNull(tweets); - for (TwitterMessage tweet : tweets) { - featureExtractor.extractTweetTextFeatures(tweet); - } - - // Optionally try to annotate trends for all the tweets. - if (TweetProcessingConfig.getBool("extract_trends", false) && trendsExtractor != null) { - trendsExtractor.extractTrends(tweets); - } - } -} diff --git a/src/java/com/twitter/search/common/relevance/classifiers/TweetTextEvaluator.java b/src/java/com/twitter/search/common/relevance/classifiers/TweetTextEvaluator.java deleted file mode 100644 index db70c6c29..000000000 --- a/src/java/com/twitter/search/common/relevance/classifiers/TweetTextEvaluator.java +++ /dev/null @@ -1,54 +0,0 @@ -package com.twitter.search.common.relevance.classifiers; - -import java.util.List; -import java.util.Map; -import java.util.function.Function; -import java.util.stream.Collectors; - -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.search.common.relevance.entities.TwitterMessage; -import com.twitter.search.common.relevance.features.TweetTextFeatures; -import com.twitter.search.common.relevance.features.TweetTextQuality; - -/** - * Calculates entropy of tweet text based on tokens. - */ -public class TweetTextEvaluator extends TweetEvaluator { - - @Override - public void evaluate(final TwitterMessage tweet) { - for (PenguinVersion penguinVersion : tweet.getSupportedPenguinVersions()) { - TweetTextFeatures textFeatures = tweet.getTweetTextFeatures(penguinVersion); - TweetTextQuality textQuality = tweet.getTweetTextQuality(penguinVersion); - - double readability = 0; - int numKeptWords = textFeatures.getStrippedTokensSize(); - for (String token : textFeatures.getStrippedTokens()) { - readability += token.length(); - } - if (numKeptWords > 0) { - readability = readability * Math.log(numKeptWords) / numKeptWords; - } - textQuality.setReadability(readability); - textQuality.setEntropy(entropy(textFeatures.getStrippedTokens())); - textQuality.setShout(textFeatures.getCaps() / Math.max(textFeatures.getLength(), 1.0d)); - } - } - - private static double entropy(List tokens) { - Map tokenCounts = - tokens.stream().collect(Collectors.groupingBy(Function.identity(), Collectors.counting())); - int numItems = tokens.size(); - - double entropy = 0; - for (long count : tokenCounts.values()) { - double prob = (double) count / numItems; - entropy -= prob * log2(prob); - } - return entropy; - } - - private static double log2(double n) { - return Math.log(n) / Math.log(2); - } -} diff --git a/src/java/com/twitter/search/common/relevance/classifiers/TweetTrendsExtractor.java b/src/java/com/twitter/search/common/relevance/classifiers/TweetTrendsExtractor.java deleted file mode 100644 index a600c1697..000000000 --- a/src/java/com/twitter/search/common/relevance/classifiers/TweetTrendsExtractor.java +++ /dev/null @@ -1,165 +0,0 @@ -package com.twitter.search.common.relevance.classifiers; - -import java.util.List; -import java.util.concurrent.TimeUnit; - -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.ImmutableMap; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.finagle.mtls.authentication.ServiceIdentifier; -import com.twitter.search.common.metrics.RelevanceStats; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.relevance.NGramCache; -import com.twitter.search.common.relevance.TrendsThriftDataServiceManager; -import com.twitter.search.common.relevance.config.TweetProcessingConfig; -import com.twitter.search.common.relevance.entities.TwitterMessage; -import com.twitter.search.common.relevance.features.TweetTextFeatures; -import com.twitter.util.Duration; - -/** - * Determines if tweets contains trending terms. - * Sets corresponding bits and fields to TweetTextFeatures. - */ -public class TweetTrendsExtractor { - - // The amount of time before filling the trends cache for the first time. - private static final long INIT_TRENDS_CACHE_DELAY = 0; - - private static final Logger LOG = LoggerFactory.getLogger(TweetTrendsExtractor.class.getName()); - - private static final int LOGGING_INTERVAL = 100000; - - // Singleton trends data service. This is the default service used unless a different - // instance is injected in the constructor. - private static volatile TrendsThriftDataServiceManager trendsDataServiceSingleton; - - // trends cache used for extracting trends from tweets - private static volatile ImmutableMap trendsCaches; - - private static synchronized void initTrendsDataServiceInstance( - ServiceIdentifier serviceIdentifier, - List supportedPenguinVersions) { - if (trendsDataServiceSingleton == null) { - TweetProcessingConfig.init(); - if (trendsCaches == null) { - ImmutableMap.Builder trendsCachesBuilder = - ImmutableMap.builder(); - for (PenguinVersion penguinVersion : supportedPenguinVersions) { - NGramCache cache = NGramCache.builder() - .maxCacheSize( - TweetProcessingConfig.getInt("trends_extractor_num_trends_to_cache", 5000)) - .penguinVersion(penguinVersion) - .build(); - trendsCachesBuilder.put(penguinVersion, cache); - } - trendsCaches = trendsCachesBuilder.build(); - } - long rawTimeout = TweetProcessingConfig.getLong("trends_extractor_timeout_msec", 200); - long rawInterval = - TweetProcessingConfig.getLong("trends_extractor_reload_interval_sec", 600L); - trendsDataServiceSingleton = - TrendsThriftDataServiceManager.newInstance( - serviceIdentifier, - TweetProcessingConfig.getInt("trends_extractor_retry", 2), - Duration.apply(rawTimeout, TimeUnit.MILLISECONDS), - Duration.apply(INIT_TRENDS_CACHE_DELAY, TimeUnit.SECONDS), - Duration.apply(rawInterval, TimeUnit.SECONDS), - trendsCaches.values().asList() - ); - trendsDataServiceSingleton.startAutoRefresh(); - LOG.info("Started trend extractor."); - } - } - - public TweetTrendsExtractor( - ServiceIdentifier serviceIdentifier, - List supportedPenguinVersions) { - initTrendsDataServiceInstance(serviceIdentifier, supportedPenguinVersions); - } - - /** - * Extract trending terms from the specified tweet. - * @param tweet the specified tweet - */ - public void extractTrends(TwitterMessage tweet) { - extractTrends(ImmutableList.of(tweet)); - } - - /** - * Extract trending terms from the specified list of tweets. - * @param tweets a list of tweets - */ - public void extractTrends(Iterable tweets) { - Preconditions.checkNotNull(tweets); - - for (TwitterMessage tweet : tweets) { - for (PenguinVersion penguinVersion : tweet.getSupportedPenguinVersions()) { - NGramCache trendsCache = trendsCaches.get(penguinVersion); - if (trendsCache == null) { - LOG.info("Trends cache for Penguin version " + penguinVersion + " is null."); - continue; - } else if (trendsCache.numTrendingTerms() == 0) { - LOG.info("Trends cache for Penguin version " + penguinVersion + " is empty."); - continue; - } - - List trendsInTweet = trendsCache.extractTrendsFrom( - tweet.getTokenizedCharSequence(penguinVersion), tweet.getLocale()); - - TweetTextFeatures textFeatures = tweet.getTweetTextFeatures(penguinVersion); - if (textFeatures == null || textFeatures.getTokens() == null) { - continue; - } - - textFeatures.getTrendingTerms().addAll(trendsInTweet); - - updateTrendsStats( - tweet, - textFeatures, - penguinVersion, - RelevanceStats.exportLong( - "trends_extractor_has_trends_" + penguinVersion.name().toLowerCase()), - RelevanceStats.exportLong( - "trends_extractor_no_trends_" + penguinVersion.name().toLowerCase()), - RelevanceStats.exportLong( - "trends_extractor_too_many_trends_" + penguinVersion.name().toLowerCase())); - } - } - } - - private void updateTrendsStats(TwitterMessage tweet, - TweetTextFeatures textFeatures, - PenguinVersion penguinVersion, - SearchCounter hasTrendsCounterToUpdate, - SearchCounter noTrendsCounterToUpdate, - SearchCounter tooManyTrendsCounterToUpdate) { - int numTrendingTerms = textFeatures.getTrendingTerms().size(); - if (numTrendingTerms == 0) { - noTrendsCounterToUpdate.increment(); - } else { - if (numTrendingTerms > 1) { - tooManyTrendsCounterToUpdate.increment(); - } - hasTrendsCounterToUpdate.increment(); - } - - long counter = noTrendsCounterToUpdate.get(); - if (counter % LOGGING_INTERVAL == 0) { - long hasTrends = hasTrendsCounterToUpdate.get(); - long noTrends = noTrendsCounterToUpdate.get(); - long tooManyTrends = tooManyTrendsCounterToUpdate.get(); - double ratio = 100.0d * hasTrends / (hasTrends + noTrends + 1); - double tooManyTrendsRatio = 100.0d * tooManyTrends / (hasTrends + 1); - LOG.info(String.format( - "Has trends %d, no trends %d, ratio %.2f, too many trends %.2f," - + " sample tweet id [%d] matching terms [%s] penguin version [%s]", - hasTrends, noTrends, ratio, tooManyTrendsRatio, tweet.getId(), - textFeatures.getTrendingTerms(), penguinVersion)); - } - } -} diff --git a/src/java/com/twitter/search/common/relevance/config/TweetProcessingConfig.java b/src/java/com/twitter/search/common/relevance/config/TweetProcessingConfig.java deleted file mode 100644 index e09472c3a..000000000 --- a/src/java/com/twitter/search/common/relevance/config/TweetProcessingConfig.java +++ /dev/null @@ -1,114 +0,0 @@ -package com.twitter.search.common.relevance.config; - -import java.io.InputStream; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.config.ConfigFile; - -/** - * Config file for relevance computation. - */ -public final class TweetProcessingConfig { - private static final Logger LOG = LoggerFactory.getLogger(TweetProcessingConfig.class); - private static final String SCORER_CONFIG_DIR = "common/relevance/config"; - public static final String DEFAULT_CONFIG_FILE = "relevance.yml"; - private static ConfigFile relevanceConfig = null; - - private TweetProcessingConfig() { - } - - /** Initializes this instance from the given config file. */ - public static void init(String configFile) { - if (relevanceConfig == null) { - synchronized (TweetProcessingConfig.class) { - if (relevanceConfig == null) { - String file = configFile == null ? DEFAULT_CONFIG_FILE : configFile; - relevanceConfig = new ConfigFile(SCORER_CONFIG_DIR, file); - } - } - } - } - - /** Initializes this instance from the given input stream. */ - public static void init(InputStream inputStream, String configType) { - if (relevanceConfig == null) { - synchronized (TweetProcessingConfig.class) { - if (relevanceConfig == null) { - relevanceConfig = new ConfigFile(inputStream, configType); - } - } - } - } - - /** Initializes this instance. */ - public static void init() { - init(null); - } - - /** - * Returns the value of the given property as a double value. - * - * @param property The property. - * @param defaultValue The default value to return if the property is not present in the config. - */ - public static double getDouble(String property, double defaultValue) { - return relevanceConfig.getDouble(property, defaultValue); - } - - /** - * Returns the value of the given property as a string value. - * - * @param property The property. - * @param defaultValue The default value to return if the property is not present in the config. - */ - public static String getString(String property, String defaultValue) { - return relevanceConfig.getString(property, defaultValue); - } - - /** - * Returns the value of the given property as an integer value. - * - * @param property The property. - * @param defaultValue The default value to return if the property is not present in the config. - */ - public static int getInt(String property, int defaultValue) { - return relevanceConfig.getInt(property, defaultValue); - } - - /** - * Returns the value of the given property as a long value. - * - * @param property The property. - * @param defaultValue The default value to return if the property is not present in the config. - */ - public static long getLong(String property, long defaultValue) { - return relevanceConfig.getLong(property, defaultValue); - } - - /** - * Returns the value of the given property as a boolean value. - * - * @param property The property. - * @param defaultValue The default value to return if the property is not present in the config. - */ - public static boolean getBool(String property, boolean defaultValue) { - return relevanceConfig.getBool(property, defaultValue); - } - - /** - * Returns the value of the given property as a string. - * - * @param property The property. - * @throws ConfigurationException If the given property is not found in the config. - */ - public static String getString(String property) { - try { - return relevanceConfig.getString(property); - } catch (ConfigurationException e) { - LOG.error("Fatal error: could not get config string " + property, e); - throw new RuntimeException(e); - } - } -} diff --git a/src/java/com/twitter/search/common/relevance/entities/GeoObject.java b/src/java/com/twitter/search/common/relevance/entities/GeoObject.java deleted file mode 100644 index ef49c98a6..000000000 --- a/src/java/com/twitter/search/common/relevance/entities/GeoObject.java +++ /dev/null @@ -1,201 +0,0 @@ -package com.twitter.search.common.relevance.entities; - -import java.util.List; -import java.util.Optional; - -import com.google.common.annotations.VisibleForTesting; - -import com.twitter.search.common.indexing.thriftjava.ThriftGeoLocationSource; -import com.twitter.search.common.indexing.thriftjava.ThriftGeoTags; -import com.twitter.tweetypie.thriftjava.GeoCoordinates; -import com.twitter.tweetypie.thriftjava.Place; - -import geo.google.datamodel.GeoAddressAccuracy; - -/** - * A GeoObject, extending a GeoCoordinate to include radius and accuracy - */ -public class GeoObject { - - public static final int INT_FIELD_NOT_PRESENT = -1; - public static final double DOUBLE_FIELD_NOT_PRESENT = -1.0; - - private double latitude = DOUBLE_FIELD_NOT_PRESENT; - private double longitude = DOUBLE_FIELD_NOT_PRESENT; - private double radius = DOUBLE_FIELD_NOT_PRESENT; - - private final ThriftGeoLocationSource source; - - // Valid range is 0-9. With 0 being unknown and 9 being most accurate. - // If this GeoObject is valid, this should be set to INT_FIELD_NOT_PRESENT - private int accuracy = 0; - - /** Creates a new GeoObject instance. */ - public GeoObject(double lat, double lon, ThriftGeoLocationSource source) { - this(lat, lon, 0, source); - } - - /** Creates a new GeoObject instance. */ - public GeoObject(double lat, double lon, int acc, ThriftGeoLocationSource source) { - latitude = lat; - longitude = lon; - accuracy = acc; - this.source = source; - } - - /** Creates a new GeoObject instance. */ - public GeoObject(ThriftGeoLocationSource source) { - this.source = source; - } - - /** - * Tries to create a {@code GeoObject} instance from a given TweetyPie {@code Place} struct based - * on its bounding box coordinates. - * - * @param place - * @return {@code Optional} instance with {@code GeoObject} if bounding box coordinates are - * available, or an empty {@code Optional}. - */ - public static Optional fromPlace(Place place) { - // Can't use place.centroid: from the sample of data, centroid seems to always be null - // (as of May 17 2016). - if (place.isSetBounding_box() && place.getBounding_boxSize() > 0) { - int pointsCount = place.getBounding_boxSize(); - - if (pointsCount == 1) { - GeoCoordinates point = place.getBounding_box().get(0); - return Optional.of(createForIngester(point.getLatitude(), point.getLongitude())); - } else { - double sumLatitude = 0.0; - double sumLongitude = 0.0; - - List box = place.getBounding_box(); - - // Drop the last point if it's the same as the first point. - // The same logic is present in several other classes dealing with places. - // See e.g. birdherd/src/main/scala/com/twitter/birdherd/tweetypie/TweetyPiePlace.scala - if (box.get(pointsCount - 1).equals(box.get(0))) { - pointsCount--; - } - - for (int i = 0; i < pointsCount; i++) { - GeoCoordinates coords = box.get(i); - sumLatitude += coords.getLatitude(); - sumLongitude += coords.getLongitude(); - } - - double averageLatitude = sumLatitude / pointsCount; - double averageLongitude = sumLongitude / pointsCount; - return Optional.of(GeoObject.createForIngester(averageLatitude, averageLongitude)); - } - } - return Optional.empty(); - } - - public void setRadius(double radius) { - this.radius = radius; - } - - public Double getRadius() { - return radius; - } - - public void setLatitude(double latitude) { - this.latitude = latitude; - } - - public Double getLatitude() { - return latitude; - } - - public void setLongitude(double longitude) { - this.longitude = longitude; - } - - public Double getLongitude() { - return longitude; - } - - public int getAccuracy() { - return accuracy; - } - - public void setAccuracy(int accuracy) { - this.accuracy = accuracy; - } - - public ThriftGeoLocationSource getSource() { - return source; - } - - /** Convers this GeoObject instance to a ThriftGeoTags instance. */ - public ThriftGeoTags toThriftGeoTags(long twitterMessageId) { - ThriftGeoTags geoTags = new ThriftGeoTags(); - geoTags.setStatusId(twitterMessageId); - geoTags.setLatitude(getLatitude()); - geoTags.setLongitude(getLongitude()); - geoTags.setAccuracy(accuracy); - geoTags.setGeoLocationSource(source); - return geoTags; - } - - private static final double COORDS_EQUALITY_THRESHOLD = 1e-7; - - /** - * Performs an approximate comparison between the two GeoObject instances. - * - * @deprecated This code is not performant and should not be used in - * production code. Use only for tests. See SEARCH-5148. - */ - @Deprecated - @VisibleForTesting - public static boolean approxEquals(GeoObject a, GeoObject b) { - if (a == null && b == null) { - return true; - } - if ((a == null && b != null) || (a != null && b == null)) { - return false; - } - - if (a.accuracy != b.accuracy) { - return false; - } - if (Math.abs(a.latitude - b.latitude) > COORDS_EQUALITY_THRESHOLD) { - return false; - } - if (Math.abs(a.longitude - b.longitude) > COORDS_EQUALITY_THRESHOLD) { - return false; - } - if (Double.compare(a.radius, b.radius) != 0) { - return false; - } - if (a.source != b.source) { - return false; - } - - return true; - } - - @Override - public String toString() { - return "GeoObject{" - + "latitude=" + latitude - + ", longitude=" + longitude - + ", radius=" + radius - + ", source=" + source - + ", accuracy=" + accuracy - + '}'; - } - - /** - * Convenience factory method for ingester purposes. - */ - public static GeoObject createForIngester(double latitude, double longitude) { - return new GeoObject( - latitude, - longitude, - // store with highest level of accuracy: POINT_LEVEL - GeoAddressAccuracy.POINT_LEVEL.getCode(), - ThriftGeoLocationSource.GEOTAG); - } -} diff --git a/src/java/com/twitter/search/common/relevance/entities/PotentialLocationObject.java b/src/java/com/twitter/search/common/relevance/entities/PotentialLocationObject.java deleted file mode 100644 index 5547e7d5d..000000000 --- a/src/java/com/twitter/search/common/relevance/entities/PotentialLocationObject.java +++ /dev/null @@ -1,122 +0,0 @@ -package com.twitter.search.common.relevance.entities; - -import java.util.Locale; - -import com.google.common.base.Preconditions; - -import org.apache.commons.lang.StringUtils; - -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.search.common.indexing.thriftjava.PotentialLocation; -import com.twitter.search.common.util.text.LanguageIdentifierHelper; -import com.twitter.search.common.util.text.NormalizerHelper; -import com.twitter.search.common.util.text.TokenizerHelper; - -/** - * An immutable tuple to wrap a country code, region and locality. Based on the PotentialLocation - * struct in status.thrift. - */ -public class PotentialLocationObject { - private final String countryCode; - private final String region; - private final String locality; - - /** - * Creates a new PotentialLocationObject instance. - * - * @param countryCode The country code. - * @param region The region. - * @param locality The locality. - */ - public PotentialLocationObject(String countryCode, String region, String locality) { - this.countryCode = countryCode; - this.region = region; - this.locality = locality; - } - - public String getCountryCode() { - return countryCode; - } - - public String getRegion() { - return region; - } - - public String getLocality() { - return locality; - } - - /** - * Converts this PotentialLocationObject instance to a PotentialLocation thrift struct. - * - * @param penguinVersion The penguin version to use for normalization and tokenization. - */ - public PotentialLocation toThriftPotentialLocation(PenguinVersion penguinVersion) { - Preconditions.checkNotNull(penguinVersion); - - String normalizedCountryCode = null; - if (countryCode != null) { - Locale countryCodeLocale = LanguageIdentifierHelper.identifyLanguage(countryCode); - normalizedCountryCode = - NormalizerHelper.normalize(countryCode, countryCodeLocale, penguinVersion); - } - - String tokenizedRegion = null; - if (region != null) { - Locale regionLocale = LanguageIdentifierHelper.identifyLanguage(region); - String normalizedRegion = NormalizerHelper.normalize(region, regionLocale, penguinVersion); - tokenizedRegion = StringUtils.join( - TokenizerHelper.tokenizeQuery(normalizedRegion, regionLocale, penguinVersion), " "); - } - - String tokenizedLocality = null; - if (locality != null) { - Locale localityLocale = LanguageIdentifierHelper.identifyLanguage(locality); - String normalizedLocality = - NormalizerHelper.normalize(locality, localityLocale, penguinVersion); - tokenizedLocality = - StringUtils.join(TokenizerHelper.tokenizeQuery( - normalizedLocality, localityLocale, penguinVersion), " "); - } - - return new PotentialLocation() - .setCountryCode(normalizedCountryCode) - .setRegion(tokenizedRegion) - .setLocality(tokenizedLocality); - } - - @Override - public int hashCode() { - return ((countryCode == null) ? 0 : countryCode.hashCode()) - + 13 * ((region == null) ? 0 : region.hashCode()) - + 19 * ((locality == null) ? 0 : locality.hashCode()); - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof PotentialLocationObject)) { - return false; - } - - PotentialLocationObject entry = (PotentialLocationObject) obj; - return (countryCode == null - ? entry.countryCode == null - : countryCode.equals(entry.countryCode)) - && (region == null - ? entry.region == null - : region.equals(entry.region)) - && (locality == null - ? entry.locality == null - : locality.equals(entry.locality)); - } - - @Override - public String toString() { - return new StringBuilder("PotentialLocationObject {") - .append("countryCode=").append(countryCode) - .append(", region=").append(region) - .append(", locality=").append(locality) - .append("}") - .toString(); - } -} diff --git a/src/java/com/twitter/search/common/relevance/entities/TwitterMessage.java b/src/java/com/twitter/search/common/relevance/entities/TwitterMessage.java deleted file mode 100644 index 524c558b2..000000000 --- a/src/java/com/twitter/search/common/relevance/entities/TwitterMessage.java +++ /dev/null @@ -1,1267 +0,0 @@ -package com.twitter.search.common.relevance.entities; - -import java.text.DateFormat; -import java.util.ArrayList; -import java.util.Arrays; -import java.util.Collection; -import java.util.Collections; -import java.util.Date; -import java.util.HashSet; -import java.util.LinkedHashMap; -import java.util.List; -import java.util.Locale; -import java.util.Map; -import java.util.Optional; -import java.util.Set; -import javax.annotation.Nonnull; -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.collect.ComparisonChain; -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; -import com.google.common.collect.Sets; - -import org.apache.commons.lang.StringUtils; -import org.apache.commons.lang3.builder.EqualsBuilder; -import org.apache.commons.lang3.builder.HashCodeBuilder; -import org.apache.commons.lang3.builder.ToStringBuilder; -import org.apache.lucene.analysis.TokenStream; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.text.language.LocaleUtil; -import com.twitter.common.text.pipeline.TwitterLanguageIdentifier; -import com.twitter.common.text.token.TokenizedCharSequence; -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.cuad.ner.plain.thriftjava.NamedEntity; -import com.twitter.search.common.indexing.thriftjava.ThriftExpandedUrl; -import com.twitter.search.common.relevance.features.TweetFeatures; -import com.twitter.search.common.relevance.features.TweetTextFeatures; -import com.twitter.search.common.relevance.features.TweetTextQuality; -import com.twitter.search.common.relevance.features.TweetUserFeatures; -import com.twitter.search.common.util.text.NormalizerHelper; -import com.twitter.service.spiderduck.gen.MediaTypes; -import com.twitter.tweetypie.thriftjava.ComposerSource; -import com.twitter.util.TwitterDateFormat; - -/** - * A representation of tweets used as an intermediate object during ingestion. As we proceed - * in ingestion, we fill this object with data. We then convert it to ThriftVersionedEvents (which - * itself represents a single tweet too, in different penguin versions potentially). - */ -public class TwitterMessage { - private static final Logger LOG = LoggerFactory.getLogger(TwitterMessage.class); - - public static class EscherbirdAnnotation implements Comparable { - public final long groupId; - public final long domainId; - public final long entityId; - - public EscherbirdAnnotation(long groupId, long domainId, long entityId) { - this.groupId = groupId; - this.domainId = domainId; - this.entityId = entityId; - } - - @Override - public boolean equals(Object o2) { - if (o2 instanceof EscherbirdAnnotation) { - EscherbirdAnnotation a2 = (EscherbirdAnnotation) o2; - return groupId == a2.groupId && domainId == a2.domainId && entityId == a2.entityId; - } - return false; - } - - @Override - public int hashCode() { - return new HashCodeBuilder() - .append(groupId) - .append(domainId) - .append(entityId) - .toHashCode(); - } - - @Override - public int compareTo(EscherbirdAnnotation o) { - return ComparisonChain.start() - .compare(this.groupId, o.groupId) - .compare(this.domainId, o.domainId) - .compare(this.entityId, o.entityId) - .result(); - } - } - - private final List escherbirdAnnotations = Lists.newArrayList(); - - // tweet features for multiple penguin versions - private static class VersionedTweetFeatures { - // TweetFeatures populated by relevance classifiers, structure defined - // in src/main/thrift/classifier.thrift. - private TweetFeatures tweetFeatures = new TweetFeatures(); - private TokenizedCharSequence tokenizedCharSequence = null; - private Set normalizedHashtags = Sets.newHashSet(); - - public TweetFeatures getTweetFeatures() { - return this.tweetFeatures; - } - - public void setTweetFeatures(final TweetFeatures tweetFeatures) { - this.tweetFeatures = tweetFeatures; - } - - public TweetTextQuality getTweetTextQuality() { - return this.tweetFeatures.getTweetTextQuality(); - } - - public TweetTextFeatures getTweetTextFeatures() { - return this.tweetFeatures.getTweetTextFeatures(); - } - - public TweetUserFeatures getTweetUserFeatures() { - return this.tweetFeatures.getTweetUserFeatures(); - } - - public TokenizedCharSequence getTokenizedCharSequence() { - return this.tokenizedCharSequence; - } - - public void setTokenizedCharSequence(TokenizedCharSequence sequence) { - this.tokenizedCharSequence = sequence; - } - - public Set getNormalizedHashtags() { - return this.normalizedHashtags; - } - - public void addNormalizedHashtags(String normalizedHashtag) { - this.normalizedHashtags.add(normalizedHashtag); - } - } - - public static final int INT_FIELD_NOT_PRESENT = -1; - public static final long LONG_FIELD_NOT_PRESENT = -1; - public static final double DOUBLE_FIELD_NOT_PRESENT = -1; - public static final int MAX_USER_REPUTATION = 100; - - private final long tweetId; - - private String text; - - private Date date; - @Nonnull - private Optional optionalFromUser = Optional.empty(); - @Nonnull - private Optional optionalToUser = Optional.empty(); - private Locale locale = null; - private Locale linkLocale = null; - - // Original source text. - private String origSource; - // Source with HTML tags removed and truncated. - private String strippedSource; - - // Original location text. - private String origLocation; - - // Location truncated for mysql field-width reasons (see TwitterMessageUtil.java). - private String truncatedNormalizedLocation; - - // User's country - private String fromUserLocCountry; - - private Integer followersCount = INT_FIELD_NOT_PRESENT; - private boolean deleted = false; - - // Fields extracted from entities (in the JSON object) - private List mentions = new ArrayList<>(); - private Set hashtags = Sets.newHashSet(); - // Lat/lon and region accuracy tuples extracted from tweet text, or null. - private GeoObject geoLocation = null; - private boolean uncodeableLocation = false; - // This is set if the tweet is geotagged. (i.e. "geo" or "coordinate" section is present - // in the json) - // This field has only a getter but no setter --- it is filled in when the json is parsed. - private GeoObject geoTaggedLocation = null; - - private double userReputation = DOUBLE_FIELD_NOT_PRESENT; - private boolean geocodeRequired = false; - private boolean sensitiveContent = false; - private boolean userProtected; - private boolean userVerified; - private boolean userBlueVerified; - private TwitterRetweetMessage retweetMessage; - private TwitterQuotedMessage quotedMessage; - private List places; - // maps from original url (the t.co url) to ThriftExpandedUrl, which contains the - // expanded url and the spiderduck response (canoicalLastHopUrl and mediatype) - private final Map expandedUrls; - // maps the photo status id to the media url - private Map photoUrls; - private Optional inReplyToStatusId = Optional.empty(); - private Optional directedAtUserId = Optional.empty(); - - private long conversationId = -1; - - // True if tweet is nullcasted. - private boolean nullcast = false; - - // True if tweet is a self-threaded tweet - private boolean selfThread = false; - - // If the tweet is a part of an exclusive conversation, the author who started - // that conversation. - private Optional exclusiveConversationAuthorId = Optional.empty(); - - // tweet features map for multiple versions of penguin - private Map versionedTweetFeaturesMap; - - // Engagments count: favorites, retweets and replies - private int numFavorites = 0; - private int numRetweets = 0; - private int numReplies = 0; - - // Card information - private String cardName; - private String cardDomain; - private String cardTitle; - private String cardDescription; - private String cardLang; - private String cardUrl; - - private String placeId; - private String placeFullName; - private String placeCountryCode; - - private Set namedEntities = Sets.newHashSet(); - - // Spaces data - private Set spaceIds = Sets.newHashSet(); - private Set spaceAdmins = Sets.newHashSet(); - private String spaceTitle; - - private Optional composerSource = Optional.empty(); - - private final List potentialLocations = Lists.newArrayList(); - - // one or two penguin versions supported by this system - private final List supportedPenguinVersions; - - public TwitterMessage(Long tweetId, List supportedPenguinVersions) { - this.tweetId = tweetId; - this.places = new ArrayList<>(); - this.expandedUrls = new LinkedHashMap<>(); - // make sure we support at least one, but no more than two versions of penguin - this.supportedPenguinVersions = supportedPenguinVersions; - this.versionedTweetFeaturesMap = getVersionedTweetFeaturesMap(); - Preconditions.checkArgument(this.supportedPenguinVersions.size() <= 2 - && this.supportedPenguinVersions.size() > 0); - } - - /** - * Replace to-user with in-reply-to user if needed. - */ - public void replaceToUserWithInReplyToUserIfNeeded( - String inReplyToScreenName, long inReplyToUserId) { - if (shouldUseReplyUserAsToUser(optionalToUser, inReplyToUserId)) { - TwitterMessageUser replyUser = - TwitterMessageUser.createWithNamesAndId(inReplyToScreenName, "", inReplyToUserId); - optionalToUser = Optional.of(replyUser); - } - } - - // To-user could have been inferred from the mention at the position 0. - // But if there is an explicit in-reply-to user, we might need to use it as to-user instead. - private static boolean shouldUseReplyUserAsToUser( - Optional currentToUser, - long inReplyToUserId) { - if (!currentToUser.isPresent()) { - // There is no mention in the tweet that qualifies as to-user. - return true; - } - - // We already have a mention in the tweet that qualifies as to-user. - TwitterMessageUser toUser = currentToUser.get(); - if (!toUser.getId().isPresent()) { - // The to-user from the mention is a stub. - return true; - } - - long toUserId = toUser.getId().get(); - if (toUserId != inReplyToUserId) { - // The to-user from the mention is different that the in-reply-to user, - // use in-reply-to user instead. - return true; - } - - return false; - } - - public double getUserReputation() { - return userReputation; - } - - /** - * Sets the user reputation. - */ - public TwitterMessage setUserReputation(double newUserReputation) { - if (newUserReputation > MAX_USER_REPUTATION) { - LOG.warn("Out of bounds user reputation {} for status id {}", newUserReputation, tweetId); - this.userReputation = (float) MAX_USER_REPUTATION; - } else { - this.userReputation = newUserReputation; - } - return this; - } - - public String getText() { - return text; - } - - public Optional getOptionalToUser() { - return optionalToUser; - } - - public void setOptionalToUser(Optional optionalToUser) { - this.optionalToUser = optionalToUser; - } - - public void setText(String text) { - this.text = text; - } - - public Date getDate() { - return date; - } - - public void setDate(Date date) { - this.date = date; - } - - public void setFromUser(@Nonnull TwitterMessageUser fromUser) { - Preconditions.checkNotNull(fromUser, "Don't set a null fromUser"); - optionalFromUser = Optional.of(fromUser); - } - - public Optional getFromUserScreenName() { - return optionalFromUser.isPresent() - ? optionalFromUser.get().getScreenName() - : Optional.empty(); - } - - /** - * Sets the fromUserScreenName. - */ - public void setFromUserScreenName(@Nonnull String fromUserScreenName) { - TwitterMessageUser newFromUser = optionalFromUser.isPresent() - ? optionalFromUser.get().copyWithScreenName(fromUserScreenName) - : TwitterMessageUser.createWithScreenName(fromUserScreenName); - - optionalFromUser = Optional.of(newFromUser); - } - - public Optional getTokenizedFromUserScreenName() { - return optionalFromUser.flatMap(TwitterMessageUser::getTokenizedScreenName); - } - - public Optional getFromUserDisplayName() { - return optionalFromUser.flatMap(TwitterMessageUser::getDisplayName); - } - - /** - * Sets the fromUserDisplayName. - */ - public void setFromUserDisplayName(@Nonnull String fromUserDisplayName) { - TwitterMessageUser newFromUser = optionalFromUser.isPresent() - ? optionalFromUser.get().copyWithDisplayName(fromUserDisplayName) - : TwitterMessageUser.createWithDisplayName(fromUserDisplayName); - - optionalFromUser = Optional.of(newFromUser); - } - - public Optional getFromUserTwitterId() { - return optionalFromUser.flatMap(TwitterMessageUser::getId); - } - - /** - * Sets the fromUserId. - */ - public void setFromUserId(long fromUserId) { - TwitterMessageUser newFromUser = optionalFromUser.isPresent() - ? optionalFromUser.get().copyWithId(fromUserId) - : TwitterMessageUser.createWithId(fromUserId); - - optionalFromUser = Optional.of(newFromUser); - } - - public long getConversationId() { - return conversationId; - } - - public void setConversationId(long conversationId) { - this.conversationId = conversationId; - } - - public boolean isUserProtected() { - return this.userProtected; - } - - public void setUserProtected(boolean userProtected) { - this.userProtected = userProtected; - } - - public boolean isUserVerified() { - return this.userVerified; - } - - public void setUserVerified(boolean userVerified) { - this.userVerified = userVerified; - } - - public boolean isUserBlueVerified() { - return this.userBlueVerified; - } - - public void setUserBlueVerified(boolean userBlueVerified) { - this.userBlueVerified = userBlueVerified; - } - - public void setIsSensitiveContent(boolean isSensitiveContent) { - this.sensitiveContent = isSensitiveContent; - } - - public boolean isSensitiveContent() { - return this.sensitiveContent; - } - - public Optional getToUserObject() { - return optionalToUser; - } - - public void setToUserObject(@Nonnull TwitterMessageUser user) { - Preconditions.checkNotNull(user, "Don't set a null to-user"); - optionalToUser = Optional.of(user); - } - - public Optional getToUserTwitterId() { - return optionalToUser.flatMap(TwitterMessageUser::getId); - } - - /** - * Sets toUserId. - */ - public void setToUserTwitterId(long toUserId) { - TwitterMessageUser newToUser = optionalToUser.isPresent() - ? optionalToUser.get().copyWithId(toUserId) - : TwitterMessageUser.createWithId(toUserId); - - optionalToUser = Optional.of(newToUser); - } - - public Optional getToUserLowercasedScreenName() { - return optionalToUser.flatMap(TwitterMessageUser::getScreenName).map(String::toLowerCase); - } - - public Optional getToUserScreenName() { - return optionalToUser.flatMap(TwitterMessageUser::getScreenName); - } - - /** - * Sets toUserScreenName. - */ - public void setToUserScreenName(@Nonnull String screenName) { - Preconditions.checkNotNull(screenName, "Don't set a null to-user screenname"); - - TwitterMessageUser newToUser = optionalToUser.isPresent() - ? optionalToUser.get().copyWithScreenName(screenName) - : TwitterMessageUser.createWithScreenName(screenName); - - optionalToUser = Optional.of(newToUser); - } - - // to use from TweetEventParseHelper - public void setDirectedAtUserId(Optional directedAtUserId) { - this.directedAtUserId = directedAtUserId; - } - - @VisibleForTesting - public Optional getDirectedAtUserId() { - return directedAtUserId; - } - - /** - * Returns the referenceAuthorId. - */ - public Optional getReferenceAuthorId() { - // The semantics of reference-author-id: - // - if the tweet is a retweet, it should be the user id of the author of the original tweet - // - else, if the tweet is directed at a user, it should be the id of the user it's directed at. - // - else, if the tweet is a reply in a root self-thread, directed-at is not set, so it's - // the id of the user who started the self-thread. - // - // For definitive info on replies and directed-at, take a look at go/replies. To view these - // for a certain tweet, use http://go/t. - // - // Note that if directed-at is set, reply is always set. - // If reply is set, directed-at is not necessarily set. - if (isRetweet() && retweetMessage.hasSharedUserTwitterId()) { - long retweetedUserId = retweetMessage.getSharedUserTwitterId(); - return Optional.of(retweetedUserId); - } else if (directedAtUserId.isPresent()) { - // Why not replace directedAtUserId with reply and make this function depend - // on the "reply" field of TweetCoreData? - // Well, verified by counters, it seems for ~1% of tweets, which contain both directed-at - // and reply, directed-at-user is different than the reply-to-user id. This happens in the - // following case: - // - // author / reply-to / directed-at - // T1 A - - - // T2 B A A - // T3 B B A - // - // T2 is a reply to T1, T3 is a reply to T2. - // - // It's up to us to decide who this tweet is "referencing", but with the current code, - // we choose that T3 is referencing user A. - return directedAtUserId; - } else { - // This is the case of a root self-thread reply. directed-at is not set. - Optional fromUserId = this.getFromUserTwitterId(); - Optional toUserId = this.getToUserTwitterId(); - - if (fromUserId.isPresent() && fromUserId.equals(toUserId)) { - return fromUserId; - } - } - return Optional.empty(); - } - - public void setNumFavorites(int numFavorites) { - this.numFavorites = numFavorites; - } - - public void setNumRetweets(int numRetweets) { - this.numRetweets = numRetweets; - } - - public void setNumReplies(int numRepliess) { - this.numReplies = numRepliess; - } - - public void addEscherbirdAnnotation(EscherbirdAnnotation annotation) { - escherbirdAnnotations.add(annotation); - } - - public List getEscherbirdAnnotations() { - return escherbirdAnnotations; - } - - public List getPotentialLocations() { - return potentialLocations; - } - - public void setPotentialLocations(Collection potentialLocations) { - this.potentialLocations.clear(); - this.potentialLocations.addAll(potentialLocations); - } - - @Override - public String toString() { - return ToStringBuilder.reflectionToString(this); - } - - // Tweet language related getters and setters. - - /** - * Returns the locale. - *

- * Note the getLocale() will never return null, this is for the convenience of text related - * processing in the ingester. If you want the real locale, you need to check isSetLocale() - * first to see if we really have any information about the locale of this tweet. - */ - public Locale getLocale() { - if (locale == null) { - return TwitterLanguageIdentifier.UNKNOWN; - } else { - return locale; - } - } - - public void setLocale(Locale locale) { - this.locale = locale; - } - - /** - * Determines if the locate is set. - */ - public boolean isSetLocale() { - return locale != null; - } - - /** - * Returns the language of the locale. E.g. zh - */ - public String getLanguage() { - if (isSetLocale()) { - return getLocale().getLanguage(); - } else { - return null; - } - } - - /** - * Returns the IETF BCP 47 Language Tag of the locale. E.g. zh-CN - */ - public String getBCP47LanguageTag() { - if (isSetLocale()) { - return getLocale().toLanguageTag(); - } else { - return null; - } - } - - public void setLanguage(String language) { - if (language != null) { - locale = LocaleUtil.getLocaleOf(language); - } - } - - // Tweet link language related getters and setters. - public Locale getLinkLocale() { - return linkLocale; - } - - public void setLinkLocale(Locale linkLocale) { - this.linkLocale = linkLocale; - } - - /** - * Returns the language of the link locale. - */ - public String getLinkLanguage() { - if (this.linkLocale == null) { - return null; - } else { - return this.linkLocale.getLanguage(); - } - } - - public String getOrigSource() { - return origSource; - } - - public void setOrigSource(String origSource) { - this.origSource = origSource; - } - - public String getStrippedSource() { - return strippedSource; - } - - public void setStrippedSource(String strippedSource) { - this.strippedSource = strippedSource; - } - - public String getOrigLocation() { - return origLocation; - } - - public String getLocation() { - return truncatedNormalizedLocation; - } - - public void setOrigLocation(String origLocation) { - this.origLocation = origLocation; - } - - public void setTruncatedNormalizedLocation(String truncatedNormalizedLocation) { - this.truncatedNormalizedLocation = truncatedNormalizedLocation; - } - - public boolean hasFromUserLocCountry() { - return fromUserLocCountry != null; - } - - public String getFromUserLocCountry() { - return fromUserLocCountry; - } - - public void setFromUserLocCountry(String fromUserLocCountry) { - this.fromUserLocCountry = fromUserLocCountry; - } - - public String getTruncatedNormalizedLocation() { - return truncatedNormalizedLocation; - } - - public Integer getFollowersCount() { - return followersCount; - } - - public void setFollowersCount(Integer followersCount) { - this.followersCount = followersCount; - } - - public boolean hasFollowersCount() { - return followersCount != INT_FIELD_NOT_PRESENT; - } - - public boolean isDeleted() { - return deleted; - } - - public void setDeleted(boolean deleted) { - this.deleted = deleted; - } - - public boolean hasCard() { - return !StringUtils.isBlank(getCardName()); - } - - @Override - public int hashCode() { - return ((Long) getId()).hashCode(); - } - - /** - * Parses the given date using the TwitterDateFormat. - */ - public static Date parseDate(String date) { - DateFormat parser = TwitterDateFormat.apply("EEE MMM d HH:mm:ss Z yyyy"); - try { - return parser.parse(date); - } catch (Exception e) { - return null; - } - } - - public boolean hasGeoLocation() { - return geoLocation != null; - } - - public void setGeoLocation(GeoObject location) { - this.geoLocation = location; - } - - public GeoObject getGeoLocation() { - return geoLocation; - } - - public String getPlaceId() { - return placeId; - } - - public void setPlaceId(String placeId) { - this.placeId = placeId; - } - - public String getPlaceFullName() { - return placeFullName; - } - - public void setPlaceFullName(String placeFullName) { - this.placeFullName = placeFullName; - } - - public String getPlaceCountryCode() { - return placeCountryCode; - } - - public void setPlaceCountryCode(String placeCountryCode) { - this.placeCountryCode = placeCountryCode; - } - - public void setGeoTaggedLocation(GeoObject geoTaggedLocation) { - this.geoTaggedLocation = geoTaggedLocation; - } - - public GeoObject getGeoTaggedLocation() { - return geoTaggedLocation; - } - - public void setLatLon(double latitude, double longitude) { - geoLocation = new GeoObject(latitude, longitude, null); - } - - public Double getLatitude() { - return hasGeoLocation() ? geoLocation.getLatitude() : null; - } - - public Double getLongitude() { - return hasGeoLocation() ? geoLocation.getLongitude() : null; - } - - public boolean isUncodeableLocation() { - return uncodeableLocation; - } - - public void setUncodeableLocation() { - uncodeableLocation = true; - } - - public void setGeocodeRequired() { - this.geocodeRequired = true; - } - - public boolean isGeocodeRequired() { - return geocodeRequired; - } - - public Map getPhotoUrls() { - return photoUrls; - } - - /** - * Associates the given mediaUrl with the given photoStatusId. - */ - public void addPhotoUrl(long photoStatusId, String mediaUrl) { - if (photoUrls == null) { - photoUrls = new LinkedHashMap<>(); - } - photoUrls.putIfAbsent(photoStatusId, mediaUrl); - } - - public Map getExpandedUrlMap() { - return expandedUrls; - } - - public int getExpandedUrlMapSize() { - return expandedUrls.size(); - } - - /** - * Associates the given originalUrl with the given expanderUrl. - */ - public void addExpandedUrl(String originalUrl, ThriftExpandedUrl expandedUrl) { - this.expandedUrls.put(originalUrl, expandedUrl); - } - - /** - * Replaces urls with resolved ones. - */ - public String getTextReplacedWithResolvedURLs() { - String retText = text; - for (Map.Entry entry : expandedUrls.entrySet()) { - ThriftExpandedUrl urlInfo = entry.getValue(); - String resolvedUrl; - String canonicalLastHopUrl = urlInfo.getCanonicalLastHopUrl(); - String expandedUrl = urlInfo.getExpandedUrl(); - if (canonicalLastHopUrl != null) { - resolvedUrl = canonicalLastHopUrl; - LOG.debug("{} has canonical last hop url set", urlInfo); - } else if (expandedUrl != null) { - LOG.debug("{} has no canonical last hop url set, using expanded url instead", urlInfo); - resolvedUrl = expandedUrl; - } else { - LOG.debug("{} has no canonical last hop url or expanded url set, skipping", urlInfo); - continue; - } - retText = retText.replace(entry.getKey(), resolvedUrl); - } - return retText; - } - - public long getId() { - return tweetId; - } - - public boolean isRetweet() { - return retweetMessage != null; - } - - public boolean hasQuote() { - return quotedMessage != null; - } - - public boolean isReply() { - return getToUserScreenName().isPresent() - || getToUserTwitterId().isPresent() - || getInReplyToStatusId().isPresent(); - } - - public boolean isReplyToTweet() { - return getInReplyToStatusId().isPresent(); - } - - public TwitterRetweetMessage getRetweetMessage() { - return retweetMessage; - } - - public void setRetweetMessage(TwitterRetweetMessage retweetMessage) { - this.retweetMessage = retweetMessage; - } - - public TwitterQuotedMessage getQuotedMessage() { - return quotedMessage; - } - - public void setQuotedMessage(TwitterQuotedMessage quotedMessage) { - this.quotedMessage = quotedMessage; - } - - public List getPlaces() { - return places; - } - - public void addPlace(String place) { - // Places are used for earlybird serialization - places.add(place); - } - - public Optional getInReplyToStatusId() { - return inReplyToStatusId; - } - - public void setInReplyToStatusId(long inReplyToStatusId) { - Preconditions.checkArgument(inReplyToStatusId > 0, "In-reply-to status ID should be positive"); - this.inReplyToStatusId = Optional.of(inReplyToStatusId); - } - - public boolean getNullcast() { - return nullcast; - } - - public void setNullcast(boolean nullcast) { - this.nullcast = nullcast; - } - - public List getSupportedPenguinVersions() { - return supportedPenguinVersions; - } - - private VersionedTweetFeatures getVersionedTweetFeatures(PenguinVersion penguinVersion) { - VersionedTweetFeatures versionedTweetFeatures = versionedTweetFeaturesMap.get(penguinVersion); - return Preconditions.checkNotNull(versionedTweetFeatures); - } - - public TweetFeatures getTweetFeatures(PenguinVersion penguinVersion) { - return getVersionedTweetFeatures(penguinVersion).getTweetFeatures(); - } - - @VisibleForTesting - // only used in Tests - public void setTweetFeatures(PenguinVersion penguinVersion, TweetFeatures tweetFeatures) { - versionedTweetFeaturesMap.get(penguinVersion).setTweetFeatures(tweetFeatures); - } - - public int getTweetSignature(PenguinVersion penguinVersion) { - return getVersionedTweetFeatures(penguinVersion).getTweetTextFeatures().getSignature(); - } - - public TweetTextQuality getTweetTextQuality(PenguinVersion penguinVersion) { - return getVersionedTweetFeatures(penguinVersion).getTweetTextQuality(); - } - - public TweetTextFeatures getTweetTextFeatures(PenguinVersion penguinVersion) { - return getVersionedTweetFeatures(penguinVersion).getTweetTextFeatures(); - } - - public TweetUserFeatures getTweetUserFeatures(PenguinVersion penguinVersion) { - return getVersionedTweetFeatures(penguinVersion).getTweetUserFeatures(); - } - - public TokenizedCharSequence getTokenizedCharSequence(PenguinVersion penguinVersion) { - return getVersionedTweetFeatures(penguinVersion).getTokenizedCharSequence(); - } - - public void setTokenizedCharSequence(PenguinVersion penguinVersion, - TokenizedCharSequence sequence) { - getVersionedTweetFeatures(penguinVersion).setTokenizedCharSequence(sequence); - } - - // True if the features contain multiple hash tags or multiple trends. - // This is intended as an anti-trend-spam measure. - public static boolean hasMultipleHashtagsOrTrends(TweetTextFeatures textFeatures) { - // Allow at most 1 trend and 2 hashtags. - return textFeatures.getTrendingTermsSize() > 1 || textFeatures.getHashtagsSize() > 2; - } - - /** - * Returns the expanded URLs. - */ - public Collection getExpandedUrls() { - return expandedUrls.values(); - } - - /** - * Returns the canonical last hop URLs. - */ - public Set getCanonicalLastHopUrls() { - Set result = new HashSet<>(expandedUrls.size()); - for (ThriftExpandedUrl url : expandedUrls.values()) { - result.add(url.getCanonicalLastHopUrl()); - } - return result; - } - - public String getCardName() { - return cardName; - } - - public void setCardName(String cardName) { - this.cardName = cardName; - } - - public String getCardDomain() { - return cardDomain; - } - - public void setCardDomain(String cardDomain) { - this.cardDomain = cardDomain; - } - - public String getCardTitle() { - return cardTitle; - } - - public void setCardTitle(String cardTitle) { - this.cardTitle = cardTitle; - } - - public String getCardDescription() { - return cardDescription; - } - - public void setCardDescription(String cardDescription) { - this.cardDescription = cardDescription; - } - - public String getCardLang() { - return cardLang; - } - - public void setCardLang(String cardLang) { - this.cardLang = cardLang; - } - - public String getCardUrl() { - return cardUrl; - } - - public void setCardUrl(String cardUrl) { - this.cardUrl = cardUrl; - } - - public List getMentions() { - return this.mentions; - } - - public void setMentions(List mentions) { - this.mentions = mentions; - } - - public List getLowercasedMentions() { - return Lists.transform(getMentions(), user -> { - // This condition is also checked in addUserToMentions(). - Preconditions.checkState(user.getScreenName().isPresent(), "Invalid mention"); - return user.getScreenName().get().toLowerCase(); - }); - } - - public Set getHashtags() { - return this.hashtags; - } - - public Set getNormalizedHashtags(PenguinVersion penguinVersion) { - return getVersionedTweetFeatures(penguinVersion).getNormalizedHashtags(); - } - - public void addNormalizedHashtag(String normalizedHashtag, PenguinVersion penguinVersion) { - getVersionedTweetFeatures(penguinVersion).addNormalizedHashtags(normalizedHashtag); - } - - public Optional getComposerSource() { - return composerSource; - } - - public void setComposerSource(ComposerSource composerSource) { - Preconditions.checkNotNull(composerSource, "composerSource should not be null"); - this.composerSource = Optional.of(composerSource); - } - - public boolean isSelfThread() { - return selfThread; - } - - public void setSelfThread(boolean selfThread) { - this.selfThread = selfThread; - } - - public boolean isExclusive() { - return exclusiveConversationAuthorId.isPresent(); - } - - public long getExclusiveConversationAuthorId() { - return exclusiveConversationAuthorId.get(); - } - - public void setExclusiveConversationAuthorId(long exclusiveConversationAuthorId) { - this.exclusiveConversationAuthorId = Optional.of(exclusiveConversationAuthorId); - } - - /** - * Adds an expanded media url based on the given parameters. - */ - public void addExpandedMediaUrl(String originalUrl, - String expandedUrl, - @Nullable MediaTypes mediaType) { - if (!StringUtils.isBlank(originalUrl) && !StringUtils.isBlank(expandedUrl)) { - ThriftExpandedUrl thriftExpandedUrl = new ThriftExpandedUrl(); - if (mediaType != null) { - thriftExpandedUrl.setMediaType(mediaType); - } - thriftExpandedUrl.setOriginalUrl(originalUrl); - thriftExpandedUrl.setExpandedUrl(expandedUrl); // This will be tokenized and indexed - // Note that the mediaURL is not indexed. We could also index it, but it is not indexed - // to reduce RAM usage. - thriftExpandedUrl.setCanonicalLastHopUrl(expandedUrl); // This will be tokenized and indexed - addExpandedUrl(originalUrl, thriftExpandedUrl); - thriftExpandedUrl.setConsumerMedia(true); - } - } - - /** - * Adds an expanded non-media url based on the given parameters. - */ - public void addExpandedNonMediaUrl(String originalUrl, String expandedUrl) { - if (!StringUtils.isBlank(originalUrl)) { - ThriftExpandedUrl thriftExpandedUrl = new ThriftExpandedUrl(originalUrl); - if (!StringUtils.isBlank(expandedUrl)) { - thriftExpandedUrl.setExpandedUrl(expandedUrl); - } - addExpandedUrl(originalUrl, thriftExpandedUrl); - thriftExpandedUrl.setConsumerMedia(false); - } - } - - /** - * Only used in tests. - * - * Simulates resolving compressed URLs, which is usually done by ResolveCompressedUrlsStage. - */ - @VisibleForTesting - public void replaceUrlsWithResolvedUrls(Map resolvedUrls) { - for (Map.Entry urlEntry : expandedUrls.entrySet()) { - String tcoUrl = urlEntry.getKey(); - if (resolvedUrls.containsKey(tcoUrl)) { - ThriftExpandedUrl expandedUrl = urlEntry.getValue(); - expandedUrl.setCanonicalLastHopUrl(resolvedUrls.get(tcoUrl)); - } - } - } - - /** - * Adds a mention for a user with the given screen name. - */ - public void addMention(String screenName) { - TwitterMessageUser user = TwitterMessageUser.createWithScreenName(screenName); - addUserToMentions(user); - } - - /** - * Adds the given user to mentions. - */ - public void addUserToMentions(TwitterMessageUser user) { - Preconditions.checkArgument(user.getScreenName().isPresent(), "Don't add invalid mentions"); - this.mentions.add(user); - } - - /** - * Adds the given hashtag. - */ - public void addHashtag(String hashtag) { - this.hashtags.add(hashtag); - for (PenguinVersion penguinVersion : supportedPenguinVersions) { - addNormalizedHashtag(NormalizerHelper.normalize(hashtag, getLocale(), penguinVersion), - penguinVersion); - } - } - - private Map getVersionedTweetFeaturesMap() { - Map versionedMap = - Maps.newEnumMap(PenguinVersion.class); - for (PenguinVersion penguinVersion : getSupportedPenguinVersions()) { - versionedMap.put(penguinVersion, new VersionedTweetFeatures()); - } - - return versionedMap; - } - - public int getNumFavorites() { - return numFavorites; - } - - public int getNumRetweets() { - return numRetweets; - } - - public int getNumReplies() { - return numReplies; - } - - public Set getNamedEntities() { - return namedEntities; - } - - public void addNamedEntity(NamedEntity namedEntity) { - namedEntities.add(namedEntity); - } - - public Set getSpaceIds() { - return spaceIds; - } - - public void setSpaceIds(Set spaceIds) { - this.spaceIds = Sets.newHashSet(spaceIds); - } - - public Set getSpaceAdmins() { - return spaceAdmins; - } - - public void addSpaceAdmin(TwitterMessageUser admin) { - spaceAdmins.add(admin); - } - - public String getSpaceTitle() { - return spaceTitle; - } - - public void setSpaceTitle(String spaceTitle) { - this.spaceTitle = spaceTitle; - } - - private static boolean equals(List l1, List l2) { - EscherbirdAnnotation[] arr1 = l1.toArray(new EscherbirdAnnotation[l1.size()]); - Arrays.sort(arr1); - EscherbirdAnnotation[] arr2 = l1.toArray(new EscherbirdAnnotation[l2.size()]); - Arrays.sort(arr2); - return Arrays.equals(arr1, arr2); - } - - /** - * Compares the given messages using reflection and determines if they're approximately equal. - */ - public static boolean reflectionApproxEquals( - TwitterMessage a, - TwitterMessage b, - List additionalExcludeFields) { - List excludeFields = Lists.newArrayList( - "versionedTweetFeaturesMap", - "geoLocation", - "geoTaggedLocation", - "escherbirdAnnotations" - ); - excludeFields.addAll(additionalExcludeFields); - - return EqualsBuilder.reflectionEquals(a, b, excludeFields) - && GeoObject.approxEquals(a.getGeoLocation(), b.getGeoLocation()) - && GeoObject.approxEquals(a.getGeoTaggedLocation(), b.getGeoTaggedLocation()) - && equals(a.getEscherbirdAnnotations(), b.getEscherbirdAnnotations()); - } - - public static boolean reflectionApproxEquals(TwitterMessage a, TwitterMessage b) { - return reflectionApproxEquals(a, b, Collections.emptyList()); - } -} diff --git a/src/java/com/twitter/search/common/relevance/entities/TwitterMessageUser.java b/src/java/com/twitter/search/common/relevance/entities/TwitterMessageUser.java deleted file mode 100644 index 6ecd5efd7..000000000 --- a/src/java/com/twitter/search/common/relevance/entities/TwitterMessageUser.java +++ /dev/null @@ -1,231 +0,0 @@ -package com.twitter.search.common.relevance.entities; - -import java.util.Optional; -import javax.annotation.Nonnull; - -import com.google.common.base.Preconditions; - -import org.apache.commons.lang3.builder.EqualsBuilder; -import org.apache.commons.lang3.builder.HashCodeBuilder; -import org.apache.lucene.analysis.TokenStream; - -import com.twitter.search.common.util.text.TokenizerHelper; - -// Represents from-user, to-user, mentions and audioSpace admins in TwitterMessage. -public final class TwitterMessageUser { - - @Nonnull private final Optional screenName; // a.k.a. user handle or username - @Nonnull private final Optional displayName; - - @Nonnull private Optional tokenizedScreenName; - - @Nonnull private final Optional id; // twitter ID - - public static final class Builder { - @Nonnull private Optional screenName = Optional.empty(); - @Nonnull private Optional displayName = Optional.empty(); - @Nonnull private Optional tokenizedScreenName = Optional.empty(); - @Nonnull private Optional id = Optional.empty(); - - public Builder() { - } - - /** - * Initialized Builder based on an existing TwitterMessageUser - */ - public Builder(TwitterMessageUser user) { - this.screenName = user.screenName; - this.displayName = user.displayName; - this.tokenizedScreenName = user.tokenizedScreenName; - this.id = user.id; - } - - /** - * Initialized Builder screen name (handle/the name following the "@") and do tokenization - * for it. - */ - public Builder withScreenName(Optional newScreenName) { - this.screenName = newScreenName; - if (newScreenName.isPresent()) { - this.tokenizedScreenName = Optional.of( - TokenizerHelper.getNormalizedCamelcaseTokenStream(newScreenName.get())); - } - return this; - } - - /** - * Initialized Builder display name - */ - public Builder withDisplayName(Optional newDisplayName) { - this.displayName = newDisplayName; - return this; - } - - public Builder withId(Optional newId) { - this.id = newId; - return this; - } - - public TwitterMessageUser build() { - return new TwitterMessageUser( - screenName, displayName, tokenizedScreenName, id); - } - } - - /** Creates a TwitterMessageUser instance with the given screen name. */ - public static TwitterMessageUser createWithScreenName(@Nonnull String screenName) { - Preconditions.checkNotNull(screenName, "Don't set a null screen name"); - return new Builder() - .withScreenName(Optional.of(screenName)) - .build(); - } - - /** Creates a TwitterMessageUser instance with the given display name. */ - public static TwitterMessageUser createWithDisplayName(@Nonnull String displayName) { - Preconditions.checkNotNull(displayName, "Don't set a null display name"); - return new Builder() - .withDisplayName(Optional.of(displayName)) - .build(); - } - - /** Creates a TwitterMessageUser instance with the given ID. */ - public static TwitterMessageUser createWithId(long id) { - Preconditions.checkArgument(id >= 0, "Don't sent a negative user ID"); - return new Builder() - .withId(Optional.of(id)) - .build(); - } - - /** Creates a TwitterMessageUser instance with the given parameters. */ - public static TwitterMessageUser createWithNamesAndId( - @Nonnull String screenName, - @Nonnull String displayName, - long id) { - Preconditions.checkNotNull(screenName, "Use another method instead of passing null name"); - Preconditions.checkNotNull(displayName, "Use another method instead of passing null name"); - Preconditions.checkArgument(id >= 0, "Use another method instead of passing negative ID"); - return new Builder() - .withScreenName(Optional.of(screenName)) - .withDisplayName(Optional.of(displayName)) - .withId(Optional.of(id)) - .build(); - } - - /** Creates a TwitterMessageUser instance with the given parameters. */ - public static TwitterMessageUser createWithNames( - @Nonnull String screenName, - @Nonnull String displayName) { - Preconditions.checkNotNull(screenName, "Use another method instead of passing null name"); - Preconditions.checkNotNull(displayName, "Use another method instead of passing null name"); - return new Builder() - .withScreenName(Optional.of(screenName)) - .withDisplayName(Optional.of(displayName)) - .build(); - } - - /** Creates a TwitterMessageUser instance with the given parameters. */ - public static TwitterMessageUser createWithOptionalNamesAndId( - @Nonnull Optional optScreenName, - @Nonnull Optional optDisplayName, - @Nonnull Optional optId) { - Preconditions.checkNotNull(optScreenName, "Pass Optional.absent() instead of null"); - Preconditions.checkNotNull(optDisplayName, "Pass Optional.absent() instead of null"); - Preconditions.checkNotNull(optId, "Pass Optional.absent() instead of null"); - return new Builder() - .withScreenName(optScreenName) - .withDisplayName(optDisplayName) - .withId(optId) - .build(); - } - - private TwitterMessageUser( - @Nonnull Optional screenName, - @Nonnull Optional displayName, - @Nonnull Optional tokenizedScreenName, - @Nonnull Optional id) { - this.screenName = screenName; - this.displayName = displayName; - this.tokenizedScreenName = tokenizedScreenName; - this.id = id; - } - - /** Creates a copy of this TwitterMessageUser instance, with the given screen name. */ - public TwitterMessageUser copyWithScreenName(@Nonnull String newScreenName) { - Preconditions.checkNotNull(newScreenName, "Don't set a null screen name"); - return new Builder(this) - .withScreenName(Optional.of(newScreenName)) - .build(); - } - - /** Creates a copy of this TwitterMessageUser instance, with the given display name. */ - public TwitterMessageUser copyWithDisplayName(@Nonnull String newDisplayName) { - Preconditions.checkNotNull(newDisplayName, "Don't set a null display name"); - return new Builder(this) - .withDisplayName(Optional.of(newDisplayName)) - .build(); - } - - /** Creates a copy of this TwitterMessageUser instance, with the given ID. */ - public TwitterMessageUser copyWithId(long newId) { - Preconditions.checkArgument(newId >= 0, "Don't set a negative user ID"); - return new Builder(this) - .withId(Optional.of(newId)) - .build(); - } - - public Optional getScreenName() { - return screenName; - } - - public Optional getDisplayName() { - return displayName; - } - - public Optional getTokenizedScreenName() { - return tokenizedScreenName; - } - - public Optional getId() { - return id; - } - - @Override - public String toString() { - return "[" + screenName + ", " + displayName + ", " + id + "]"; - } - - /** - * Compares this TwitterMessageUser instance to the given object. - * - * @deprecated deprecated. - */ - @Deprecated - @Override - public boolean equals(Object o) { - if (o == null) { - return false; - } - if (o == this) { - return true; - } - if (o.getClass() != getClass()) { - return false; - } - TwitterMessageUser other = (TwitterMessageUser) o; - return new EqualsBuilder() - .append(screenName, other.screenName) - .append(displayName, other.displayName) - .isEquals(); - } - - /** - * Returns a hash code for this TwitterMessageUser instance. - * - * @deprecated deprecated. - */ - @Deprecated - @Override - public int hashCode() { - return HashCodeBuilder.reflectionHashCode(this); - } -} diff --git a/src/java/com/twitter/search/common/relevance/entities/TwitterMessageUtil.java b/src/java/com/twitter/search/common/relevance/entities/TwitterMessageUtil.java deleted file mode 100644 index 7437de7fd..000000000 --- a/src/java/com/twitter/search/common/relevance/entities/TwitterMessageUtil.java +++ /dev/null @@ -1,444 +0,0 @@ -package com.twitter.search.common.relevance.entities; - -import java.text.Normalizer; -import java.util.Map; -import java.util.NavigableMap; -import java.util.Set; -import java.util.TreeMap; -import java.util.concurrent.ConcurrentMap; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.collect.Maps; - -import org.apache.commons.lang.StringUtils; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.text.transformer.HTMLTagRemovalTransformer; -import com.twitter.common_internal.text.extractor.EmojiExtractor; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; - -public final class TwitterMessageUtil { - private static final Logger LOG = LoggerFactory.getLogger(TwitterMessageUtil.class); - - private TwitterMessageUtil() { - } - - @VisibleForTesting - static final ConcurrentMap COUNTERS_MAP = Maps.newConcurrentMap(); - // We truncate the location string because we used to use a MySQL table to store the geocoding - // information. In the MySQL table, the location string was fix width of 30 characters. - // We have migrated to Manhattan and the location string is no longer limited to 30 character. - // However, in order to correctly lookup location geocode from Manhattan, we still need to - // truncate the location just like we did before. - private static final int MAX_LOCATION_LEN = 30; - - // Note: we strip tags to index source, as typically source contains tags. - // Sometimes we get a source where stripping fails, as the URL in the tag was - // excessively long. We drop these sources, as there is little reason to index them. - private static final int MAX_SOURCE_LEN = 64; - - private static HTMLTagRemovalTransformer tagRemovalTransformer = new HTMLTagRemovalTransformer(); - - private static final String STAT_PREFIX = "twitter_message_"; - - public enum Field { - FROM_USER_DISPLAY_NAME, - NORMALIZED_LOCATION, - ORIG_LOCATION, - ORIG_SOURCE, - SHARED_USER_DISPLAY_NAME, - SOURCE, - TEXT, - TO_USER_SCREEN_NAME; - - public String getNameForStats() { - return name().toLowerCase(); - } - } - - @VisibleForTesting - static class Counters { - private final SearchRateCounter truncatedCounter; - private final SearchRateCounter tweetsWithStrippedSupplementaryCharsCounter; - private final SearchRateCounter strippedSupplementaryCharsCounter; - private final SearchRateCounter nonStrippedEmojiCharsCounter; - private final SearchRateCounter emojisAtTruncateBoundaryCounter; - - Counters(Field field) { - String fieldNameForStats = field.getNameForStats(); - truncatedCounter = SearchRateCounter.export( - STAT_PREFIX + "truncated_" + fieldNameForStats); - tweetsWithStrippedSupplementaryCharsCounter = SearchRateCounter.export( - STAT_PREFIX + "tweets_with_stripped_supplementary_chars_" + fieldNameForStats); - strippedSupplementaryCharsCounter = SearchRateCounter.export( - STAT_PREFIX + "stripped_supplementary_chars_" + fieldNameForStats); - nonStrippedEmojiCharsCounter = SearchRateCounter.export( - STAT_PREFIX + "non_stripped_emoji_chars_" + fieldNameForStats); - emojisAtTruncateBoundaryCounter = SearchRateCounter.export( - STAT_PREFIX + "emojis_at_truncate_boundary_" + fieldNameForStats); - } - - SearchRateCounter getTruncatedCounter() { - return truncatedCounter; - } - - SearchRateCounter getTweetsWithStrippedSupplementaryCharsCounter() { - return tweetsWithStrippedSupplementaryCharsCounter; - } - - SearchRateCounter getStrippedSupplementaryCharsCounter() { - return strippedSupplementaryCharsCounter; - } - - SearchRateCounter getNonStrippedEmojiCharsCounter() { - return nonStrippedEmojiCharsCounter; - } - - SearchRateCounter getEmojisAtTruncateBoundaryCounter() { - return emojisAtTruncateBoundaryCounter; - } - } - - static { - for (Field field : Field.values()) { - COUNTERS_MAP.put(field, new Counters(field)); - } - } - - // Note: the monorail enforces a limit of 15 characters for screen names, - // but some users with up to 20 character names were grandfathered-in. To allow - // those users to be searchable, support up to 20 chars. - private static final int MAX_SCREEN_NAME_LEN = 20; - - // Note: we expect the current limit to be 10K. Also, all supplementary unicode characters (with - // the exception of emojis, maybe) will be removed and not counted as total length. Added alert - // for text truncation rate as well. SEARCH-9512 - private static final int MAX_TWEET_TEXT_LEN = 10000; - - @VisibleForTesting - static final SearchRateCounter FILTERED_NO_STATUS_ID = - SearchRateCounter.export(STAT_PREFIX + "filtered_no_status_id"); - @VisibleForTesting - static final SearchRateCounter FILTERED_NO_FROM_USER = - SearchRateCounter.export(STAT_PREFIX + "filtered_no_from_user"); - @VisibleForTesting - static final SearchRateCounter FILTERED_LONG_SCREEN_NAME = - SearchRateCounter.export(STAT_PREFIX + "filtered_long_screen_name"); - @VisibleForTesting - static final SearchRateCounter FILTERED_NO_TEXT = - SearchRateCounter.export(STAT_PREFIX + "filtered_no_text"); - @VisibleForTesting - static final SearchRateCounter FILTERED_NO_DATE = - SearchRateCounter.export(STAT_PREFIX + "filtered_no_date"); - @VisibleForTesting - static final SearchRateCounter NULLCAST_TWEET = - SearchRateCounter.export(STAT_PREFIX + "filter_nullcast_tweet"); - @VisibleForTesting - static final SearchRateCounter NULLCAST_TWEET_ACCEPTED = - SearchRateCounter.export(STAT_PREFIX + "nullcast_tweet_accepted"); - @VisibleForTesting - static final SearchRateCounter INCONSISTENT_TWEET_ID_AND_CREATED_AT = - SearchRateCounter.export(STAT_PREFIX + "inconsistent_tweet_id_and_created_at_ms"); - - /** Strips the given source from the message with the given ID. */ - private static String stripSource(String source, Long messageId) { - if (source == null) { - return null; - } - // Always strip emojis from sources: they don't really make sense in this field. - String strippedSource = stripSupplementaryChars( - tagRemovalTransformer.transform(source).toString(), Field.SOURCE, true); - if (strippedSource.length() > MAX_SOURCE_LEN) { - LOG.warn("Message " - + messageId - + " contains stripped source that exceeds MAX_SOURCE_LEN. Removing: " - + strippedSource); - COUNTERS_MAP.get(Field.SOURCE).getTruncatedCounter().increment(); - return null; - } - return strippedSource; - } - - /** - * Strips and truncates the location of the message with the given ID. - * - */ - private static String stripAndTruncateLocation(String location) { - // Always strip emojis from locations: they don't really make sense in this field. - String strippedLocation = stripSupplementaryChars(location, Field.NORMALIZED_LOCATION, true); - return truncateString(strippedLocation, MAX_LOCATION_LEN, Field.NORMALIZED_LOCATION, true); - } - - /** - * Sets the origSource and strippedSource fields on a TwitterMessage - * - */ - public static void setSourceOnMessage(TwitterMessage message, String modifiedDeviceSource) { - // Always strip emojis from sources: they don't really make sense in this field. - message.setOrigSource(stripSupplementaryChars(modifiedDeviceSource, Field.ORIG_SOURCE, true)); - message.setStrippedSource(stripSource(modifiedDeviceSource, message.getId())); - } - - /** - * Sets the origLocation to the stripped location, and sets - * the truncatedNormalizedLocation to the truncated and normalized location. - */ - public static void setAndTruncateLocationOnMessage( - TwitterMessage message, - String newOrigLocation) { - // Always strip emojis from locations: they don't really make sense in this field. - message.setOrigLocation(stripSupplementaryChars(newOrigLocation, Field.ORIG_LOCATION, true)); - - // Locations in the new locations table require additional normalization. It can also change - // the length of the string, so we must do this before truncation. - if (newOrigLocation != null) { - String normalized = - Normalizer.normalize(newOrigLocation, Normalizer.Form.NFKC).toLowerCase().trim(); - message.setTruncatedNormalizedLocation(stripAndTruncateLocation(normalized)); - } else { - message.setTruncatedNormalizedLocation(null); - } - } - - /** - * Validates the given TwitterMessage. - * - * @param message The message to validate. - * @param stripEmojisForFields The set of fields for which emojis should be stripped. - * @param acceptNullcastMessage Determines if this message should be accepted, if it's a nullcast - * message. - * @return {@code true} if the given message is valid; {@code false} otherwise. - */ - public static boolean validateTwitterMessage( - TwitterMessage message, - Set stripEmojisForFields, - boolean acceptNullcastMessage) { - if (message.getNullcast()) { - NULLCAST_TWEET.increment(); - if (!acceptNullcastMessage) { - LOG.info("Dropping nullcasted message " + message.getId()); - return false; - } - NULLCAST_TWEET_ACCEPTED.increment(); - } - - if (!message.getFromUserScreenName().isPresent() - || StringUtils.isBlank(message.getFromUserScreenName().get())) { - LOG.error("Message " + message.getId() + " contains no from user. Skipping."); - FILTERED_NO_FROM_USER.increment(); - return false; - } - String fromUserScreenName = message.getFromUserScreenName().get(); - - if (fromUserScreenName.length() > MAX_SCREEN_NAME_LEN) { - LOG.warn("Message " + message.getId() + " has a user screen name longer than " - + MAX_SCREEN_NAME_LEN + " characters: " + message.getFromUserScreenName() - + ". Skipping."); - FILTERED_LONG_SCREEN_NAME.increment(); - return false; - } - - // Remove supplementary characters and truncate these text fields. - if (message.getFromUserDisplayName().isPresent()) { - message.setFromUserDisplayName(stripSupplementaryChars( - message.getFromUserDisplayName().get(), - Field.FROM_USER_DISPLAY_NAME, - stripEmojisForFields.contains(Field.FROM_USER_DISPLAY_NAME))); - } - if (message.getToUserScreenName().isPresent()) { - String strippedToUserScreenName = stripSupplementaryChars( - message.getToUserLowercasedScreenName().get(), - Field.TO_USER_SCREEN_NAME, - stripEmojisForFields.contains(Field.TO_USER_SCREEN_NAME)); - message.setToUserScreenName( - truncateString( - strippedToUserScreenName, - MAX_SCREEN_NAME_LEN, - Field.TO_USER_SCREEN_NAME, - stripEmojisForFields.contains(Field.TO_USER_SCREEN_NAME))); - } - - String strippedText = stripSupplementaryChars( - message.getText(), - Field.TEXT, - stripEmojisForFields.contains(Field.TEXT)); - message.setText(truncateString( - strippedText, - MAX_TWEET_TEXT_LEN, - Field.TEXT, - stripEmojisForFields.contains(Field.TEXT))); - - if (StringUtils.isBlank(message.getText())) { - FILTERED_NO_TEXT.increment(); - return false; - } - - if (message.getDate() == null) { - LOG.error("Message " + message.getId() + " contains no date. Skipping."); - FILTERED_NO_DATE.increment(); - return false; - } - - if (message.isRetweet()) { - return validateRetweetMessage(message.getRetweetMessage(), stripEmojisForFields); - } - - // Track if both the snowflake ID and created at timestamp are consistent. - if (!SnowflakeIdParser.isTweetIDAndCreatedAtConsistent(message.getId(), message.getDate())) { - LOG.error("Found inconsistent tweet ID and created at timestamp: [messageID=" - + message.getId() + "], [messageDate=" + message.getDate() + "]."); - INCONSISTENT_TWEET_ID_AND_CREATED_AT.increment(); - } - - return true; - } - - private static boolean validateRetweetMessage( - TwitterRetweetMessage message, Set stripEmojisForFields) { - if (message.getSharedId() == null || message.getRetweetId() == null) { - LOG.error("Retweet Message contains a null twitter id. Skipping."); - FILTERED_NO_STATUS_ID.increment(); - return false; - } - - if (message.getSharedDate() == null) { - LOG.error("Retweet Message " + message.getRetweetId() + " contains no date. Skipping."); - return false; - } - - // Remove supplementary characters from these text fields. - message.setSharedUserDisplayName(stripSupplementaryChars( - message.getSharedUserDisplayName(), - Field.SHARED_USER_DISPLAY_NAME, - stripEmojisForFields.contains(Field.SHARED_USER_DISPLAY_NAME))); - - return true; - } - - /** - * Strips non indexable chars from the text. - * - * Returns the resulting string, which may be the same object as the text argument when - * no stripping or truncation is necessary. - * - * Non-indexed characters are "supplementary unicode" that are not emojis. Note that - * supplementary unicode are still characters that seem worth indexing, as many characters - * in CJK languages are supplementary. However this would make the size of our index - * explode (~186k supplementary characters exist), so it's not feasible. - * - * @param text The text to strip - * @param field The field this text is from - * @param stripSupplementaryEmojis Whether or not to strip supplementary emojis. Note that this - * parameter name isn't 100% accurate. This parameter is meant to replicate behavior prior to - * adding support for *not* stripping supplementary emojis. The prior behavior would turn an - * emoji such as a keycap "1\uFE0F\u20E3" (http://www.iemoji.com/view/emoji/295/symbols/keycap-1) - * into just '1'. So the keycap emoji is not completely stripped, only the portion after the '1'. - * - */ - @VisibleForTesting - public static String stripSupplementaryChars( - String text, - Field field, - boolean stripSupplementaryEmojis) { - if (text == null || text.isEmpty()) { - return text; - } - - // Initialize an empty map so that if we choose not to strip emojis, - // then no emojipositions will be found and we don't need a null - // check before checking if an emoji is at a certain spot. - NavigableMap emojiPositions = new TreeMap<>(); - - if (!stripSupplementaryEmojis) { - emojiPositions = EmojiExtractor.getEmojiPositions(text); - } - - StringBuilder strippedTextBuilder = new StringBuilder(); - int sequenceStart = 0; - int i = 0; - while (i < text.length()) { - if (Character.isSupplementaryCodePoint(text.codePointAt(i))) { - // Check if this supplementary character is an emoji - if (!emojiPositions.containsKey(i)) { - // It's not an emoji, or we want to strip emojis, so strip it - - // text[i] and text[i + 1] are part of a supplementary code point. - strippedTextBuilder.append(text.substring(sequenceStart, i)); - sequenceStart = i + 2; // skip 2 chars - i = sequenceStart; - COUNTERS_MAP.get(field).getStrippedSupplementaryCharsCounter().increment(); - } else { - // It's an emoji, keep it - i += emojiPositions.get(i); - COUNTERS_MAP.get(field).getNonStrippedEmojiCharsCounter().increment(); - } - } else { - ++i; - } - } - if (sequenceStart < text.length()) { - strippedTextBuilder.append(text.substring(sequenceStart)); - } - - String strippedText = strippedTextBuilder.toString(); - if (strippedText.length() < text.length()) { - COUNTERS_MAP.get(field).getTweetsWithStrippedSupplementaryCharsCounter().increment(); - } - return strippedText; - } - - /** - * Truncates the given string to the given length. - * - * Note that we are truncating based on the # of UTF-16 characters a given emoji takes up. - * So if a single emoji takes up 4 UTF-16 characters, that counts as 4 for the truncation, - * not just 1. - * - * @param text The text to truncate - * @param maxLength The maximum length of the string after truncation - * @param field The field from which this string cames - * @param splitEmojisAtMaxLength If true, don't worry about emojis and just truncate at maxLength, - * potentially splitting them. If false, truncate before the emoji if truncating at maxLength - * would cause the emoji to be split. - */ - @VisibleForTesting - static String truncateString( - String text, - int maxLength, - Field field, - boolean splitEmojisAtMaxLength) { - Preconditions.checkArgument(maxLength > 0); - - if ((text == null) || (text.length() <= maxLength)) { - return text; - } - - int truncatePoint = maxLength; - NavigableMap emojiPositions; - // If we want to consider emojis we should not strip on an emoji boundary. - if (!splitEmojisAtMaxLength) { - emojiPositions = EmojiExtractor.getEmojiPositions(text); - - // Get the last emoji before maxlength. - Map.Entry lastEmojiBeforeMaxLengthEntry = - emojiPositions.lowerEntry(maxLength); - - if (lastEmojiBeforeMaxLengthEntry != null) { - int lowerEmojiEnd = lastEmojiBeforeMaxLengthEntry.getKey() - + lastEmojiBeforeMaxLengthEntry.getValue(); - - // If the last emoji would be truncated, truncate before the last emoji. - if (lowerEmojiEnd > truncatePoint) { - truncatePoint = lastEmojiBeforeMaxLengthEntry.getKey(); - COUNTERS_MAP.get(field).getEmojisAtTruncateBoundaryCounter().increment(); - } - } - } - - COUNTERS_MAP.get(field).getTruncatedCounter().increment(); - return text.substring(0, truncatePoint); - } -} diff --git a/src/java/com/twitter/search/common/relevance/entities/TwitterQuotedMessage.java b/src/java/com/twitter/search/common/relevance/entities/TwitterQuotedMessage.java deleted file mode 100644 index 4e9f9b88f..000000000 --- a/src/java/com/twitter/search/common/relevance/entities/TwitterQuotedMessage.java +++ /dev/null @@ -1,41 +0,0 @@ -package com.twitter.search.common.relevance.entities; - -import org.apache.commons.lang3.builder.EqualsBuilder; -import org.apache.commons.lang3.builder.HashCodeBuilder; -import org.apache.commons.lang3.builder.ToStringBuilder; - -/** - * The object for quoted message - */ -public class TwitterQuotedMessage { - private final long quotedStatusId; - private final long quotedUserId; - - public TwitterQuotedMessage(long quotedStatusId, long quotedUserId) { - this.quotedStatusId = quotedStatusId; - this.quotedUserId = quotedUserId; - } - - public long getQuotedStatusId() { - return quotedStatusId; - } - - public long getQuotedUserId() { - return quotedUserId; - } - - @Override - public boolean equals(Object o) { - return EqualsBuilder.reflectionEquals(this, o); - } - - @Override - public int hashCode() { - return HashCodeBuilder.reflectionHashCode(this); - } - - @Override - public String toString() { - return ToStringBuilder.reflectionToString(this); - } -} diff --git a/src/java/com/twitter/search/common/relevance/entities/TwitterRetweetMessage.java b/src/java/com/twitter/search/common/relevance/entities/TwitterRetweetMessage.java deleted file mode 100644 index e2aac7bc2..000000000 --- a/src/java/com/twitter/search/common/relevance/entities/TwitterRetweetMessage.java +++ /dev/null @@ -1,80 +0,0 @@ -package com.twitter.search.common.relevance.entities; - -import java.util.Date; - -import org.apache.commons.lang3.builder.EqualsBuilder; -import org.apache.commons.lang3.builder.HashCodeBuilder; -import org.apache.commons.lang3.builder.ToStringBuilder; - -public class TwitterRetweetMessage { - // based on original tweet - private Long sharedId; - - // TwitterMessageUtil checks them - private String sharedUserDisplayName; - private Long sharedUserTwitterId = TwitterMessage.LONG_FIELD_NOT_PRESENT; - - private Date sharedDate = null; - - // based on retweet - private Long retweetId; - - public Long getRetweetId() { - return retweetId; - } - - public void setRetweetId(Long retweetId) { - this.retweetId = retweetId; - } - - public Long getSharedId() { - return sharedId; - } - - public void setSharedId(Long sharedId) { - this.sharedId = sharedId; - } - - public String getSharedUserDisplayName() { - return sharedUserDisplayName; - } - - public void setSharedUserDisplayName(String sharedUserDisplayName) { - this.sharedUserDisplayName = sharedUserDisplayName; - } - - public Long getSharedUserTwitterId() { - return sharedUserTwitterId; - } - - public boolean hasSharedUserTwitterId() { - return sharedUserTwitterId != TwitterMessage.LONG_FIELD_NOT_PRESENT; - } - - public void setSharedUserTwitterId(Long sharedUserTwitterId) { - this.sharedUserTwitterId = sharedUserTwitterId; - } - - public Date getSharedDate() { - return sharedDate; - } - - public void setSharedDate(Date sharedDate) { - this.sharedDate = sharedDate; - } - - @Override - public boolean equals(Object o) { - return EqualsBuilder.reflectionEquals(this, o); - } - - @Override - public int hashCode() { - return HashCodeBuilder.reflectionHashCode(this); - } - - @Override - public String toString() { - return ToStringBuilder.reflectionToString(this); - } -} diff --git a/src/java/com/twitter/search/common/relevance/features/AgeDecay.java b/src/java/com/twitter/search/common/relevance/features/AgeDecay.java deleted file mode 100644 index 910eaae40..000000000 --- a/src/java/com/twitter/search/common/relevance/features/AgeDecay.java +++ /dev/null @@ -1,88 +0,0 @@ -package com.twitter.search.common.relevance.features; - -import java.util.concurrent.TimeUnit; - -import com.google.common.base.Preconditions; - -/** - * Utility to compute an age decay multiplier based on a sigmoid function. - */ -public class AgeDecay { - public static final double SLOPE_COEFF = 4.0; - public static final double LN_HALF = Math.log(0.5); - public final double halflife; - public final double maxBoost; - public final double base; - public final double slope; - - /** Creates a new AgeDecay instance. */ - public AgeDecay(double base, double maxBoost, double halflife, double slope) { - this.maxBoost = maxBoost; - this.base = base; - this.halflife = halflife; - this.slope = slope; - } - - /** Creates a new AgeDecay instance. */ - public AgeDecay(double base, double halflife, double slope) { - this(base, 1.0, halflife, slope); - } - - /** - * Compute the age decay, using the provided halflife. - * - * @param tweetAge The tweet age. - * @param unit The unit of the tweetAge parameter. - */ - public double getAgeDecayMultiplier(long tweetAge, TimeUnit unit) { - return getAgeDecayMultiplier(TimeUnit.SECONDS.convert(tweetAge, unit)); - } - - /** - * Compute the age decay, assuming the halflife in the constructor is in minutes. - * @param ageInSeconds the age in seconds - */ - public double getAgeDecayMultiplier(long ageInSeconds) { - long minutesSinceTweet = TimeUnit.MINUTES.convert(ageInSeconds, TimeUnit.SECONDS); - return compute(minutesSinceTweet); - } - - /** - * Compute age decay given an age, the age has to be in the same unit as halflife, which you - * construct the object with. - */ - public double compute(double age) { - return compute(base, maxBoost, halflife, slope, age); - } - - /** - * Compute the age decay given all parameters. Use this if you don't need to reuse an AgeDecay - * object. - */ - public static double compute( - double base, double maxBoost, double halflife, double slope, double age) { - return base + ((maxBoost - base) / (1 + Math.exp(slope * (age - halflife)))); - } - - public static double compute( - double base, double maxBoost, double halflife, double age) { - Preconditions.checkArgument(halflife != 0); - return compute(base, maxBoost, halflife, SLOPE_COEFF / halflife, age); - } - - /** - * Another nicer exponential decay function. Returns a value in (0, 1] - */ - public static double computeExponential(double halflife, double exp, double age) { - return Math.exp(LN_HALF * Math.pow(age, exp) / Math.pow(halflife, exp)); - } - - /** - * Exponential decay with remapping of the value from (0,1] to (min,max] - */ - public static double computeExponential(double halflife, double exp, double age, - double minBoost, double maxBoost) { - double decay = computeExponential(halflife, exp, age); // in (0, 1] - return (maxBoost - minBoost) * decay + minBoost; - } -} diff --git a/src/java/com/twitter/search/common/relevance/features/BUILD b/src/java/com/twitter/search/common/relevance/features/BUILD deleted file mode 100644 index f93592fd9..000000000 --- a/src/java/com/twitter/search/common/relevance/features/BUILD +++ /dev/null @@ -1,24 +0,0 @@ -# Java library for tweet features and utilities. -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/com/twitter/elephantbird:core", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/thrift:libthrift", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common/text/token", - "src/java/com/twitter/search/common/encoding/features", - "src/java/com/twitter/search/common/features", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/schema/base", - "src/java/com/twitter/search/common/schema/earlybird", - "src/java/com/twitter/search/common/util/lang", - "src/thrift/com/twitter/search/common:constants-java", - "src/thrift/com/twitter/search/common:features-java", - "src/thrift/com/twitter/search/common:schema-java", - ], -) diff --git a/src/java/com/twitter/search/common/relevance/features/EarlybirdDocumentFeatures.java b/src/java/com/twitter/search/common/relevance/features/EarlybirdDocumentFeatures.java deleted file mode 100644 index 79afe8d2f..000000000 --- a/src/java/com/twitter/search/common/relevance/features/EarlybirdDocumentFeatures.java +++ /dev/null @@ -1,232 +0,0 @@ -package com.twitter.search.common.relevance.features; - -import java.io.IOException; -import java.util.Map; -import java.util.function.Function; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Maps; - -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.NumericDocValues; - -import com.twitter.search.common.features.thrift.ThriftSearchResultFeatures; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.schema.base.FeatureConfiguration; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.schema.thriftjava.ThriftCSFType; -import com.twitter.search.common.schema.thriftjava.ThriftFeatureNormalizationType; - -public class EarlybirdDocumentFeatures { - private static final Map FEATURE_CONFIG_IS_NULL_MAP = Maps.newHashMap(); - private static final Map FEATURE_OUTPUT_TYPE_IS_NULL_MAP = - Maps.newHashMap(); - private static final Map NO_SCHEMA_FIELD_FOR_FEATURE_MAP = - Maps.newHashMap(); - private static final String FEATURE_CONFIG_IS_NULL_COUNTER_PATTERN = - "null_feature_config_for_feature_id_%d"; - private static final String FEATURE_OUTPUT_TYPE_IS_NULL_COUNTER_PATTERN = - "null_output_type_for_feature_id_%d"; - private static final String NO_SCHEMA_FIELD_FOR_FEATURE_COUNTER_PATTERN = - "no_schema_field_for_feature_id_%d"; - private static final SearchCounter UNKNOWN_FEATURE_OUTPUT_TYPE_COUNTER = - SearchCounter.export("unknown_feature_output_type"); - - private final Map numericDocValues = Maps.newHashMap(); - private final LeafReader leafReader; - private int docId = -1; - - /** - * Creates a new EarlybirdDocumentFeatures instance that will return feature values based on the - * NumericDocValues stored in the given LeafReader for the given document. - */ - public EarlybirdDocumentFeatures(LeafReader leafReader) { - this.leafReader = Preconditions.checkNotNull(leafReader); - } - - /** - * Advances this instance to the given doc ID. The new doc ID must be greater than or equal to the - * current doc ID stored in this instance. - */ - public void advance(int target) { - Preconditions.checkArgument( - target >= 0, - "Target (%s) cannot be negative.", - target); - Preconditions.checkArgument( - target >= docId, - "Target (%s) smaller than current doc ID (%s).", - target, - docId); - Preconditions.checkArgument( - target < leafReader.maxDoc(), - "Target (%s) cannot be greater than or equal to the max doc ID (%s).", - target, - leafReader.maxDoc()); - docId = target; - } - - /** - * Returns the feature value for the given field. - */ - public long getFeatureValue(EarlybirdFieldConstant field) throws IOException { - // The index might not have a NumericDocValues instance for this feature. - // This might happen if we dynamically update the feature schema, for example. - // - // Cache the NumericDocValues instances for all accessed features, even if they're null. - String fieldName = field.getFieldName(); - NumericDocValues docValues; - if (numericDocValues.containsKey(fieldName)) { - docValues = numericDocValues.get(fieldName); - } else { - docValues = leafReader.getNumericDocValues(fieldName); - numericDocValues.put(fieldName, docValues); - } - return docValues != null && docValues.advanceExact(docId) ? docValues.longValue() : 0L; - } - - /** - * Determines if the given flag is set. - */ - public boolean isFlagSet(EarlybirdFieldConstant field) throws IOException { - return getFeatureValue(field) != 0; - } - - /** - * Returns the unnormalized value for the given field. - */ - public double getUnnormalizedFeatureValue(EarlybirdFieldConstant field) throws IOException { - long featureValue = getFeatureValue(field); - ThriftFeatureNormalizationType normalizationType = field.getFeatureNormalizationType(); - if (normalizationType == null) { - normalizationType = ThriftFeatureNormalizationType.NONE; - } - switch (normalizationType) { - case NONE: - return featureValue; - case LEGACY_BYTE_NORMALIZER: - return MutableFeatureNormalizers.BYTE_NORMALIZER.unnormLowerBound((byte) featureValue); - case LEGACY_BYTE_NORMALIZER_WITH_LOG2: - return MutableFeatureNormalizers.BYTE_NORMALIZER.unnormAndLog2((byte) featureValue); - case SMART_INTEGER_NORMALIZER: - return MutableFeatureNormalizers.SMART_INTEGER_NORMALIZER.unnormUpperBound( - (byte) featureValue); - case PREDICTION_SCORE_NORMALIZER: - return IntNormalizers.PREDICTION_SCORE_NORMALIZER.denormalize((int) featureValue); - default: - throw new IllegalArgumentException( - "Unsupported normalization type " + normalizationType + " for feature " - + field.getFieldName()); - } - } - - /** - * Creates a ThriftSearchResultFeatures instance populated with values for all available features - * that have a non-zero value set. - */ - public ThriftSearchResultFeatures getSearchResultFeatures(ImmutableSchemaInterface schema) - throws IOException { - return getSearchResultFeatures(schema, (featureId) -> true); - } - - /** - * Creates a ThriftSearchResultFeatures instance populated with values for all available features - * that have a non-zero value set. - * - * @param schema The schema. - * @param shouldCollectFeatureId A predicate that determines which features should be collected. - */ - public ThriftSearchResultFeatures getSearchResultFeatures( - ImmutableSchemaInterface schema, - Function shouldCollectFeatureId) throws IOException { - Map boolValues = Maps.newHashMap(); - Map doubleValues = Maps.newHashMap(); - Map intValues = Maps.newHashMap(); - Map longValues = Maps.newHashMap(); - - Map idToFeatureConfigMap = schema.getFeatureIdToFeatureConfig(); - for (int featureId : schema.getSearchFeatureSchema().getEntries().keySet()) { - if (!shouldCollectFeatureId.apply(featureId)) { - continue; - } - - FeatureConfiguration featureConfig = idToFeatureConfigMap.get(featureId); - if (featureConfig == null) { - FEATURE_CONFIG_IS_NULL_MAP.computeIfAbsent( - featureId, - (fId) -> SearchCounter.export( - String.format(FEATURE_CONFIG_IS_NULL_COUNTER_PATTERN, fId))).increment(); - continue; - } - - ThriftCSFType outputType = featureConfig.getOutputType(); - if (outputType == null) { - FEATURE_OUTPUT_TYPE_IS_NULL_MAP.computeIfAbsent( - featureId, - (fId) -> SearchCounter.export( - String.format(FEATURE_OUTPUT_TYPE_IS_NULL_COUNTER_PATTERN, fId))).increment(); - continue; - } - - if (!EarlybirdFieldConstants.hasFieldConstant(featureId)) { - // Should only happen for features that were dynamically added to the schema. - NO_SCHEMA_FIELD_FOR_FEATURE_MAP.computeIfAbsent( - featureId, - (fId) -> SearchCounter.export( - String.format(NO_SCHEMA_FIELD_FOR_FEATURE_COUNTER_PATTERN, fId))).increment(); - continue; - } - - EarlybirdFieldConstant field = EarlybirdFieldConstants.getFieldConstant(featureId); - switch (outputType) { - case BOOLEAN: - if (isFlagSet(field)) { - boolValues.put(featureId, true); - } - break; - case BYTE: - // It's unclear why we don't add this feature to a separate byteValues map... - byte byteFeatureValue = (byte) getFeatureValue(field); - if (byteFeatureValue != 0) { - intValues.put(featureId, (int) byteFeatureValue); - } - break; - case INT: - int intFeatureValue = (int) getFeatureValue(field); - if (intFeatureValue != 0) { - intValues.put(featureId, intFeatureValue); - } - break; - case LONG: - long longFeatureValue = getFeatureValue(field); - if (longFeatureValue != 0) { - longValues.put(featureId, longFeatureValue); - } - break; - case FLOAT: - // It's unclear why we don't add this feature to a separate floatValues map... - float floatFeatureValue = (float) getFeatureValue(field); - if (floatFeatureValue != 0) { - doubleValues.put(featureId, (double) floatFeatureValue); - } - break; - case DOUBLE: - double doubleFeatureValue = getUnnormalizedFeatureValue(field); - if (doubleFeatureValue != 0) { - doubleValues.put(featureId, doubleFeatureValue); - } - break; - default: - UNKNOWN_FEATURE_OUTPUT_TYPE_COUNTER.increment(); - } - } - - return new ThriftSearchResultFeatures() - .setBoolValues(boolValues) - .setIntValues(intValues) - .setLongValues(longValues) - .setDoubleValues(doubleValues); - } -} diff --git a/src/java/com/twitter/search/common/relevance/features/FeatureSink.java b/src/java/com/twitter/search/common/relevance/features/FeatureSink.java deleted file mode 100644 index 63be4bdad..000000000 --- a/src/java/com/twitter/search/common/relevance/features/FeatureSink.java +++ /dev/null @@ -1,75 +0,0 @@ -package com.twitter.search.common.relevance.features; - -import java.util.Map; - -import com.google.common.collect.Maps; - -import com.twitter.search.common.encoding.features.IntegerEncodedFeatures; -import com.twitter.search.common.schema.base.FeatureConfiguration; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.earlybird.EarlybirdEncodedFeatures; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; - -/** - * FeatureSink is used to write features based on feature configuration or feature name. After - * all feature is written, the class can return the base field integer array values. - * - * This class is not thread-safe. - */ -public class FeatureSink { - private ImmutableSchemaInterface schema; - private final Map encodedFeatureMap; - - /** Creates a new FeatureSink instance. */ - public FeatureSink(ImmutableSchemaInterface schema) { - this.schema = schema; - this.encodedFeatureMap = Maps.newHashMap(); - } - - private IntegerEncodedFeatures getFeatures(String baseFieldName) { - IntegerEncodedFeatures features = encodedFeatureMap.get(baseFieldName); - if (features == null) { - features = EarlybirdEncodedFeatures.newEncodedTweetFeatures(schema, baseFieldName); - encodedFeatureMap.put(baseFieldName, features); - } - return features; - } - - /** Sets the given numeric value for the field. */ - public FeatureSink setNumericValue(EarlybirdFieldConstant field, int value) { - return setNumericValue(field.getFieldName(), value); - } - - /** Sets the given numeric value for the feature with the given name. */ - public FeatureSink setNumericValue(String featureName, int value) { - final FeatureConfiguration featureConfig = schema.getFeatureConfigurationByName(featureName); - if (featureConfig != null) { - getFeatures(featureConfig.getBaseField()).setFeatureValue(featureConfig, value); - } - return this; - } - - /** Sets the given boolean value for the given field. */ - public FeatureSink setBooleanValue(EarlybirdFieldConstant field, boolean value) { - return setBooleanValue(field.getFieldName(), value); - } - - /** Sets the given boolean value for the feature with the given name. */ - public FeatureSink setBooleanValue(String featureName, boolean value) { - final FeatureConfiguration featureConfig = schema.getFeatureConfigurationByName(featureName); - if (featureConfig != null) { - getFeatures(featureConfig.getBaseField()).setFlagValue(featureConfig, value); - } - return this; - } - - /** Returns the features for the given base field. */ - public IntegerEncodedFeatures getFeaturesForBaseField(EarlybirdFieldConstant baseField) { - return getFeaturesForBaseField(baseField.getFieldName()); - } - - /** Returns the features for the given base field. */ - public IntegerEncodedFeatures getFeaturesForBaseField(String baseFieldName) { - return encodedFeatureMap.get(baseFieldName); - } -} diff --git a/src/java/com/twitter/search/common/relevance/features/IntNormalizers.java b/src/java/com/twitter/search/common/relevance/features/IntNormalizers.java deleted file mode 100644 index 5dc3d5ddd..000000000 --- a/src/java/com/twitter/search/common/relevance/features/IntNormalizers.java +++ /dev/null @@ -1,39 +0,0 @@ -package com.twitter.search.common.relevance.features; - -import java.util.concurrent.TimeUnit; - -import com.twitter.search.common.encoding.features.ByteNormalizer; -import com.twitter.search.common.encoding.features.IntNormalizer; -import com.twitter.search.common.encoding.features.PredictionScoreNormalizer; - -/** - * Int value normalizers used to push feature values into earlybird db. For the - * 8-bit feature types, this class wraps the - * com.twitter.search.common.relevance.features.MutableFeatureNormalizers - */ -public final class IntNormalizers { - private IntNormalizers() { - } - - public static final IntNormalizer LEGACY_NORMALIZER = - val -> ByteNormalizer.unsignedByteToInt( - MutableFeatureNormalizers.BYTE_NORMALIZER.normalize(val)); - - public static final IntNormalizer SMART_INTEGER_NORMALIZER = - val -> ByteNormalizer.unsignedByteToInt( - MutableFeatureNormalizers.SMART_INTEGER_NORMALIZER.normalize(val)); - - // The PARUS_SCORE feature is deprecated and is never set in our indexes. However, we still need - // this normalizer for now, because some models do not work properly with "missing" features, so - // for now we still need to set the PARUS_SCORE feature to 0. - public static final IntNormalizer PARUS_SCORE_NORMALIZER = val -> 0; - - public static final IntNormalizer BOOLEAN_NORMALIZER = - val -> val == 0 ? 0 : 1; - - public static final IntNormalizer TIMESTAMP_SEC_TO_HR_NORMALIZER = - val -> (int) TimeUnit.SECONDS.toHours((long) val); - - public static final PredictionScoreNormalizer PREDICTION_SCORE_NORMALIZER = - new PredictionScoreNormalizer(3); -} diff --git a/src/java/com/twitter/search/common/relevance/features/MutableFeatureNormalizers.java b/src/java/com/twitter/search/common/relevance/features/MutableFeatureNormalizers.java deleted file mode 100644 index b44414ea3..000000000 --- a/src/java/com/twitter/search/common/relevance/features/MutableFeatureNormalizers.java +++ /dev/null @@ -1,23 +0,0 @@ -package com.twitter.search.common.relevance.features; - -import com.twitter.search.common.encoding.features.ByteNormalizer; -import com.twitter.search.common.encoding.features.SingleBytePositiveFloatNormalizer; -import com.twitter.search.common.encoding.features.SmartIntegerNormalizer; - -/** - * Byte value normalizers used to push feature values into earlybird db. - */ -public abstract class MutableFeatureNormalizers { - // The max value we support in SMART_INTEGER_NORMALIZER below, this should be enough for all kinds - // of engagements we see on Twitter, anything larger than this would be represented as the same - // value (255, if using a byte). - private static final int MAX_COUNTER_VALUE_SUPPORTED = 50000000; - - // Avoid using this normalizer for procesing any new data, always use SmartIntegerNormalizer - // below. - public static final SingleBytePositiveFloatNormalizer BYTE_NORMALIZER = - new SingleBytePositiveFloatNormalizer(); - - public static final ByteNormalizer SMART_INTEGER_NORMALIZER = - new SmartIntegerNormalizer(MAX_COUNTER_VALUE_SUPPORTED, 8); -} diff --git a/src/java/com/twitter/search/common/relevance/features/QueryFeatureType.java b/src/java/com/twitter/search/common/relevance/features/QueryFeatureType.java deleted file mode 100644 index d46c183fa..000000000 --- a/src/java/com/twitter/search/common/relevance/features/QueryFeatureType.java +++ /dev/null @@ -1,9 +0,0 @@ -package com.twitter.search.common.relevance.features; - -/** - * An enum to hold different types of query-specific features (these are not indexed in Earlybird) - */ -public enum QueryFeatureType { - SOCIAL_ENGAGEMENTS, - CLICKS -} diff --git a/src/java/com/twitter/search/common/relevance/features/RelevanceSignalConstants.java b/src/java/com/twitter/search/common/relevance/features/RelevanceSignalConstants.java deleted file mode 100644 index abae2e9a8..000000000 --- a/src/java/com/twitter/search/common/relevance/features/RelevanceSignalConstants.java +++ /dev/null @@ -1,30 +0,0 @@ -package com.twitter.search.common.relevance.features; - -/** - * Defines relevance related constants that are used at both ingestion time and - * earlybird scoring time. - */ -public final class RelevanceSignalConstants { - // user reputation - public static final byte UNSET_REPUTATION_SENTINEL = Byte.MIN_VALUE; - public static final byte MAX_REPUTATION = 100; - public static final byte MIN_REPUTATION = 0; - // below overall CDF of ~10%, default value for new users, - // given as a goodwill value in case it is unset - public static final byte GOODWILL_REPUTATION = 17; - - // text score - public static final byte UNSET_TEXT_SCORE_SENTINEL = Byte.MIN_VALUE; - // roughly at overall CDF of ~10%, given as a goodwill value in case it is unset - public static final byte GOODWILL_TEXT_SCORE = 19; - - private RelevanceSignalConstants() { - } - - // check whether the specified user rep value is valid - public static boolean isValidUserReputation(int userRep) { - return userRep != UNSET_REPUTATION_SENTINEL - && userRep >= MIN_REPUTATION - && userRep < MAX_REPUTATION; - } -} diff --git a/src/java/com/twitter/search/common/relevance/features/ScoringUtils.java b/src/java/com/twitter/search/common/relevance/features/ScoringUtils.java deleted file mode 100644 index 7fc7a502f..000000000 --- a/src/java/com/twitter/search/common/relevance/features/ScoringUtils.java +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.search.common.relevance.features; - -import com.google.common.base.Preconditions; - -/** - * Scoring utilities - */ -public final class ScoringUtils { - private ScoringUtils() { } - - /** - * normalize a positive value of arbitrary range to [0.0, 1.0], with a slop - * @param value the value to normalize. - * @param halfval a reference value that will be normalized to 0.5 - * @param exp an exponential parameter (must be positive) to control the converging speed, - * the smaller the value the faster it reaches the halfval but slower it reaches the maximum. - * @return a normalized value - */ - public static float normalize(float value, double halfval, double exp) { - Preconditions.checkArgument(exp > 0.0 && exp <= 1.0); - return (float) (Math.pow(value, exp) / (Math.pow(value, exp) + Math.pow(halfval, exp))); - } - -} diff --git a/src/java/com/twitter/search/common/relevance/features/TermVector.java b/src/java/com/twitter/search/common/relevance/features/TermVector.java deleted file mode 100644 index 75e7982e2..000000000 --- a/src/java/com/twitter/search/common/relevance/features/TermVector.java +++ /dev/null @@ -1,79 +0,0 @@ -package com.twitter.search.common.relevance.features; - -import java.util.Map; - -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.Maps; - -import com.twitter.common.base.Function; - -/** - * Class to keep String-Double of term vectors - * It can calculate magnitude, dot product, and cosine similarity - */ -public class TermVector { - private static final double MIN_MAGNITUDE = 0.00001; - private final double magnitude; - private final ImmutableMap termWeights; - - /** Creates a new TermVector instance. */ - public TermVector(Map termWeights) { - this.termWeights = ImmutableMap.copyOf(termWeights); - double sum = 0.0; - for (Map.Entry entry : termWeights.entrySet()) { - double value = entry.getValue(); - sum += value * value; - } - magnitude = Math.sqrt(sum); - } - - public ImmutableMap getTermWeights() { - return termWeights; - } - - public double getMagnitude() { - return magnitude; - } - - /** - * Normalize term vector into unit magnitude - * @return the unit normalized TermVector with magnitude equals 1 - * return null if magnitude is very low - */ - public TermVector getUnitNormalized() { - if (magnitude < MIN_MAGNITUDE) { - return null; - } - return new TermVector( - Maps.transformValues(termWeights, (Function) weight -> weight / magnitude)); - } - - /** - * Calculate the dot product with another term vector - * @param other the other term vector - * @return the dot product of the two vectors - */ - public double getDotProduct(TermVector other) { - double sum = 0.0; - for (Map.Entry entry : termWeights.entrySet()) { - Double value2 = other.termWeights.get(entry.getKey()); - if (value2 != null) { - sum += entry.getValue() * value2; - } - } - return sum; - } - - /** - * Calculate the cosine similarity of with another term vector - * @param other the other term vector - * @return the cosine similarity. - * if either has very small magnitude, it returns 0 (dotProduct close to 0) - */ - public double getCosineSimilarity(TermVector other) { - if (magnitude < MIN_MAGNITUDE || other.magnitude < MIN_MAGNITUDE) { - return 0; - } - return getDotProduct(other) / (magnitude * other.magnitude); - } -} diff --git a/src/java/com/twitter/search/common/relevance/features/TweetEngagementFeatures.java b/src/java/com/twitter/search/common/relevance/features/TweetEngagementFeatures.java deleted file mode 100644 index 22b610e4c..000000000 --- a/src/java/com/twitter/search/common/relevance/features/TweetEngagementFeatures.java +++ /dev/null @@ -1,57 +0,0 @@ -package com.twitter.search.common.relevance.features; - -import com.twitter.search.common.encoding.features.EncodedFeatures; - -/** - * Holds engagement features for a particular tweet and encodes them as a single int. - * The features are: retweet count, favorite count, itweet score, reply count. - */ -public class TweetEngagementFeatures extends EncodedFeatures { - private static final int RETWEET_COUNT_BIT_SHIFT = 0; - private static final long RETWEET_COUNT_INVERSE_BIT_MASK = 0xffffff00L; - - private static final int ITWEET_SCORE_BIT_SHIFT = 8; - private static final long ITWEET_SCORE_INVERSE_BIT_MASK = 0xffff00ffL; - - private static final int FAV_COUNT_BIT_SHIFT = 16; - private static final long FAV_COUNT_INVERSE_BIT_MASK = 0xff00ffffL; - - private static final int REPLY_COUNT_BIT_SHIFT = 24; - private static final long REPLY_COUNT_INVERSE_BIT_MASK = 0x00ffffffL; - - public TweetEngagementFeatures setRetweetCount(byte count) { - setByteIfGreater(count, RETWEET_COUNT_BIT_SHIFT, RETWEET_COUNT_INVERSE_BIT_MASK); - return this; - } - - public int getRetweetCount() { - return getByte(RETWEET_COUNT_BIT_SHIFT); - } - - public TweetEngagementFeatures setITweetScore(byte score) { - setByteIfGreater(score, ITWEET_SCORE_BIT_SHIFT, ITWEET_SCORE_INVERSE_BIT_MASK); - return this; - } - - public int getITweetScore() { - return getByte(ITWEET_SCORE_BIT_SHIFT); - } - - public TweetEngagementFeatures setFavCount(byte count) { - setByteIfGreater(count, FAV_COUNT_BIT_SHIFT, FAV_COUNT_INVERSE_BIT_MASK); - return this; - } - - public int getFavCount() { - return getByte(FAV_COUNT_BIT_SHIFT); - } - - public TweetEngagementFeatures setReplyCount(byte count) { - setByteIfGreater(count, REPLY_COUNT_BIT_SHIFT, REPLY_COUNT_INVERSE_BIT_MASK); - return this; - } - - public int getReplyCount() { - return getByte(REPLY_COUNT_BIT_SHIFT); - } -} diff --git a/src/java/com/twitter/search/common/relevance/features/TweetFeatureType.java b/src/java/com/twitter/search/common/relevance/features/TweetFeatureType.java deleted file mode 100644 index 024a14ea4..000000000 --- a/src/java/com/twitter/search/common/relevance/features/TweetFeatureType.java +++ /dev/null @@ -1,291 +0,0 @@ -package com.twitter.search.common.relevance.features; - -import java.util.Map; -import java.util.Set; -import javax.annotation.Nullable; - -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.ImmutableSet; - -import com.twitter.search.common.encoding.features.IntNormalizer; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants; - -import static com.twitter.search.common.relevance.features.IntNormalizers.BOOLEAN_NORMALIZER; -import static com.twitter.search.common.relevance.features.IntNormalizers.LEGACY_NORMALIZER; -import static com.twitter.search.common.relevance.features.IntNormalizers.PARUS_SCORE_NORMALIZER; -import static com.twitter.search.common.relevance.features.IntNormalizers.SMART_INTEGER_NORMALIZER; -import static com.twitter.search.common.relevance.features.IntNormalizers.TIMESTAMP_SEC_TO_HR_NORMALIZER; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; - -/** - * An enum to represent all dynamic/realtime feature types we can update in the Signal Ingester. - * It provides information for their normalization and their corresponding earlybird feature fields - * and provides utils both producer (Signal Ingester) and consumer (Earlybird) side. - * - */ -public enum TweetFeatureType { - RETWEET (true, 0, LEGACY_NORMALIZER, - EarlybirdFieldConstant.RETWEET_COUNT), - REPLY (true, 1, LEGACY_NORMALIZER, - EarlybirdFieldConstant.REPLY_COUNT), - FAVORITE (true, 4, LEGACY_NORMALIZER, - EarlybirdFieldConstant.FAVORITE_COUNT), - PARUS_SCORE (false, 3, PARUS_SCORE_NORMALIZER, - EarlybirdFieldConstant.PARUS_SCORE), - EMBEDS_IMP_COUNT (true, 10, LEGACY_NORMALIZER, - EarlybirdFieldConstant.EMBEDS_IMPRESSION_COUNT), - EMBEDS_URL_COUNT (true, 11, LEGACY_NORMALIZER, - EarlybirdFieldConstant.EMBEDS_URL_COUNT), - VIDEO_VIEW (false, 12, LEGACY_NORMALIZER, - EarlybirdFieldConstant.VIDEO_VIEW_COUNT), - // v2 engagement counters, they will eventually replace v1 counters above - RETWEET_V2 (true, 13, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.RETWEET_COUNT_V2), - REPLY_V2 (true, 14, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.REPLY_COUNT_V2), - FAVORITE_V2 (true, 15, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.FAVORITE_COUNT_V2), - EMBEDS_IMP_COUNT_V2 (true, 16, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.EMBEDS_IMPRESSION_COUNT_V2), - EMBEDS_URL_COUNT_V2 (true, 17, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.EMBEDS_URL_COUNT_V2), - VIDEO_VIEW_V2 (false, 18, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.VIDEO_VIEW_COUNT_V2), - // other new items - QUOTE (true, 19, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.QUOTE_COUNT), - // weighted engagement counters - WEIGHTED_RETWEET (true, 20, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.WEIGHTED_RETWEET_COUNT), - WEIGHTED_REPLY (true, 21, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.WEIGHTED_REPLY_COUNT), - WEIGHTED_FAVORITE (true, 22, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.WEIGHTED_FAVORITE_COUNT), - WEIGHTED_QUOTE (true, 23, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.WEIGHTED_QUOTE_COUNT), - - // tweet-level safety labels - LABEL_ABUSIVE (false, 24, BOOLEAN_NORMALIZER, - EarlybirdFieldConstant.LABEL_ABUSIVE_FLAG), - LABEL_ABUSIVE_HI_RCL (false, 25, BOOLEAN_NORMALIZER, - EarlybirdFieldConstant.LABEL_ABUSIVE_HI_RCL_FLAG), - LABEL_DUP_CONTENT (false, 26, BOOLEAN_NORMALIZER, - EarlybirdFieldConstant.LABEL_DUP_CONTENT_FLAG), - LABEL_NSFW_HI_PRC (false, 27, BOOLEAN_NORMALIZER, - EarlybirdFieldConstant.LABEL_NSFW_HI_PRC_FLAG), - LABEL_NSFW_HI_RCL (false, 28, BOOLEAN_NORMALIZER, - EarlybirdFieldConstant.LABEL_NSFW_HI_RCL_FLAG), - LABEL_SPAM (false, 29, BOOLEAN_NORMALIZER, - EarlybirdFieldConstant.LABEL_SPAM_FLAG), - LABEL_SPAM_HI_RCL (false, 30, BOOLEAN_NORMALIZER, - EarlybirdFieldConstant.LABEL_SPAM_HI_RCL_FLAG), - - PERISCOPE_EXISTS (false, 32, BOOLEAN_NORMALIZER, - EarlybirdFieldConstant.PERISCOPE_EXISTS), - PERISCOPE_HAS_BEEN_FEATURED (false, 33, BOOLEAN_NORMALIZER, - EarlybirdFieldConstant.PERISCOPE_HAS_BEEN_FEATURED), - PERISCOPE_IS_CURRENTLY_FEATURED (false, 34, BOOLEAN_NORMALIZER, - EarlybirdFieldConstant.PERISCOPE_IS_CURRENTLY_FEATURED), - PERISCOPE_IS_FROM_QUALITY_SOURCE(false, 35, BOOLEAN_NORMALIZER, - EarlybirdFieldConstant.PERISCOPE_IS_FROM_QUALITY_SOURCE), - PERISCOPE_IS_LIVE (false, 36, BOOLEAN_NORMALIZER, - EarlybirdFieldConstant.PERISCOPE_IS_LIVE), - - // decayed engagement counters - DECAYED_RETWEET (true, 37, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.DECAYED_RETWEET_COUNT), - DECAYED_REPLY (true, 38, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.DECAYED_REPLY_COUNT), - DECAYED_FAVORITE (true, 39, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.DECAYED_FAVORITE_COUNT), - DECAYED_QUOTE (true, 40, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.DECAYED_QUOTE_COUNT), - - // timestamp of last engagement types - LAST_RETWEET_SINCE_CREATION_HR (false, 41, TIMESTAMP_SEC_TO_HR_NORMALIZER, - EarlybirdFieldConstant.LAST_RETWEET_SINCE_CREATION_HRS), - LAST_REPLY_SINCE_CREATION_HR (false, 42, TIMESTAMP_SEC_TO_HR_NORMALIZER, - EarlybirdFieldConstant.LAST_REPLY_SINCE_CREATION_HRS), - LAST_FAVORITE_SINCE_CREATION_HR (false, 43, TIMESTAMP_SEC_TO_HR_NORMALIZER, - EarlybirdFieldConstant.LAST_FAVORITE_SINCE_CREATION_HRS), - LAST_QUOTE_SINCE_CREATION_HR (false, 44, TIMESTAMP_SEC_TO_HR_NORMALIZER, - EarlybirdFieldConstant.LAST_QUOTE_SINCE_CREATION_HRS), - - // fake engagement counters - FAKE_RETWEET (true, 45, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.FAKE_RETWEET_COUNT), - FAKE_REPLY (true, 46, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.FAKE_REPLY_COUNT), - FAKE_FAVORITE (true, 47, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.FAKE_FAVORITE_COUNT), - FAKE_QUOTE (true, 48, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.FAKE_QUOTE_COUNT), - - // blink engagement counters - BLINK_RETWEET (true, 49, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.BLINK_RETWEET_COUNT), - BLINK_REPLY (true, 50, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.BLINK_REPLY_COUNT), - BLINK_FAVORITE (true, 51, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.BLINK_FAVORITE_COUNT), - BLINK_QUOTE (true, 52, SMART_INTEGER_NORMALIZER, - EarlybirdFieldConstant.BLINK_QUOTE_COUNT), - - /* semicolon in a single line to avoid polluting git blame */; - - private static final Map V2_COUNTER_MAP = - ImmutableMap.builder() - .put(RETWEET, RETWEET_V2) - .put(REPLY, REPLY_V2) - .put(FAVORITE, FAVORITE_V2) - .put(EMBEDS_IMP_COUNT, EMBEDS_IMP_COUNT_V2) - .put(EMBEDS_URL_COUNT, EMBEDS_URL_COUNT_V2) - .put(VIDEO_VIEW, VIDEO_VIEW_V2) - .build(); - - private static final Map WEIGHTED_COUNTER_MAP = - ImmutableMap.builder() - .put(RETWEET, WEIGHTED_RETWEET) - .put(REPLY, WEIGHTED_REPLY) - .put(FAVORITE, WEIGHTED_FAVORITE) - .put(QUOTE, WEIGHTED_QUOTE) - .build(); - - private static final Map DECAYED_COUNTER_MAP = - ImmutableMap.builder() - .put(RETWEET, DECAYED_RETWEET) - .put(REPLY, DECAYED_REPLY) - .put(FAVORITE, DECAYED_FAVORITE) - .put(QUOTE, DECAYED_QUOTE) - .build(); - - private static final Map DECAYED_COUNTER_TO_ELAPSED_TIME = - ImmutableMap.builder() - .put(DECAYED_RETWEET, LAST_RETWEET_SINCE_CREATION_HR) - .put(DECAYED_REPLY, LAST_REPLY_SINCE_CREATION_HR) - .put(DECAYED_FAVORITE, LAST_FAVORITE_SINCE_CREATION_HR) - .put(DECAYED_QUOTE, LAST_QUOTE_SINCE_CREATION_HR) - .build(); - - private static final Set DECAYED_FEATURES = - ImmutableSet.of(DECAYED_RETWEET, DECAYED_REPLY, DECAYED_FAVORITE, DECAYED_QUOTE); - - private static final Set FAKE_ENGAGEMENT_FEATURES = - ImmutableSet.of(FAKE_RETWEET, FAKE_REPLY, FAKE_FAVORITE, FAKE_QUOTE); - - private static final Set BLINK_ENGAGEMENT_FEATURES = - ImmutableSet.of(BLINK_RETWEET, BLINK_REPLY, BLINK_FAVORITE, BLINK_QUOTE); - - @Nullable - public TweetFeatureType getV2Type() { - return V2_COUNTER_MAP.get(this); - } - - @Nullable - public static TweetFeatureType getWeightedType(TweetFeatureType type) { - return WEIGHTED_COUNTER_MAP.get(type); - } - - @Nullable - public static TweetFeatureType getDecayedType(TweetFeatureType type) { - return DECAYED_COUNTER_MAP.get(type); - } - - // Whether this feature is incremental or direct value. - private final boolean incremental; - - // This normalizer is used to (1) normalize the output value in DLIndexEventOutputBolt, - // (2) check value change. - private final IntNormalizer normalizer; - - // value for composing cache key. It has to be unique and in increasing order. - private final int typeInt; - - private final EarlybirdFieldConstants.EarlybirdFieldConstant earlybirdField; - - private final IncrementChecker incrementChecker; - - /** - * Constructing an enum for a type. The earlybirdField can be null if it's not prepared, they - * can be here as placeholders but they can't be outputted. - * The normalizer is null for the timestamp features that do not require normalization - */ - TweetFeatureType(boolean incremental, - int typeInt, - IntNormalizer normalizer, - @Nullable EarlybirdFieldConstant earlybirdField) { - this.incremental = incremental; - this.typeInt = typeInt; - this.normalizer = normalizer; - this.earlybirdField = earlybirdField; - this.incrementChecker = new IncrementChecker(this); - } - - public boolean isIncremental() { - return incremental; - } - - public IntNormalizer getNormalizer() { - return normalizer; - } - - public int getTypeInt() { - return typeInt; - } - - public int normalize(double value) { - return normalizer.normalize(value); - } - - public IncrementChecker getIncrementChecker() { - return incrementChecker; - } - - public EarlybirdFieldConstant getEarlybirdField() { - return Preconditions.checkNotNull(earlybirdField); - } - - public boolean hasEarlybirdField() { - return earlybirdField != null; - } - - public boolean isDecayed() { - return DECAYED_FEATURES.contains(this); - } - - @Nullable - public TweetFeatureType getElapsedTimeFeatureType() { - return DECAYED_COUNTER_TO_ELAPSED_TIME.get(this); - } - - public boolean isFakeEngagement() { - return FAKE_ENGAGEMENT_FEATURES.contains(this); - } - - public boolean isBlinkEngagement() { - return BLINK_ENGAGEMENT_FEATURES.contains(this); - } - - /** - * Check if an increment is eligible for emitting - */ - public static class IncrementChecker { - private final IntNormalizer normalizer; - - public IncrementChecker(IntNormalizer normalizer) { - this.normalizer = normalizer; - } - - IncrementChecker(TweetFeatureType type) { - this(type.getNormalizer()); - } - - /** - * Check if a value change is eligible for output - */ - public boolean eligibleForEmit(int oldValue, int newValue) { - return normalizer.normalize(oldValue) != normalizer.normalize(newValue); - } - } -} diff --git a/src/java/com/twitter/search/common/relevance/features/TweetFeatures.java b/src/java/com/twitter/search/common/relevance/features/TweetFeatures.java deleted file mode 100644 index b3eb4600a..000000000 --- a/src/java/com/twitter/search/common/relevance/features/TweetFeatures.java +++ /dev/null @@ -1,19 +0,0 @@ -package com.twitter.search.common.relevance.features; - -public class TweetFeatures { - private final TweetTextQuality tweetTextQuality = new TweetTextQuality(); - private final TweetTextFeatures tweetTextFeatures = new TweetTextFeatures(); - private final TweetUserFeatures tweetUserFeatures = new TweetUserFeatures(); - - public TweetTextFeatures getTweetTextFeatures() { - return tweetTextFeatures; - } - - public TweetTextQuality getTweetTextQuality() { - return tweetTextQuality; - } - - public TweetUserFeatures getTweetUserFeatures() { - return tweetUserFeatures; - } -} diff --git a/src/java/com/twitter/search/common/relevance/features/TweetIntegerShingleSignature.java b/src/java/com/twitter/search/common/relevance/features/TweetIntegerShingleSignature.java deleted file mode 100644 index 9caf94e88..000000000 --- a/src/java/com/twitter/search/common/relevance/features/TweetIntegerShingleSignature.java +++ /dev/null @@ -1,201 +0,0 @@ -package com.twitter.search.common.relevance.features; - -import java.nio.ByteBuffer; -import java.util.Arrays; - -import com.google.common.base.Preconditions; - -/** - * A TweetIntegerShingleSignature object consists of 4 bytes, each representing the signature of - * a status text sample. The signature bytes are sorted in ascending order and compacted to an - * integer in big endian for serialization. - * - * Fuzzy matching of two TweetIntegerShingleSignature objects is met when the number of matching - * bytes between the two is equal to or above 3. - */ -public class TweetIntegerShingleSignature { - public static final int NUM_SHINGLES = Integer.SIZE / Byte.SIZE; - public static final int DEFAULT_NO_SIGNATURE = 0; - public static final TweetIntegerShingleSignature NO_SIGNATURE_HANDLE = - deserialize(DEFAULT_NO_SIGNATURE); - public static final int DEFAULT_MIN_SHINGLES_MATCH = 3; - private final int minShinglesMatch; - - private final byte[] shingles; - private final int signature; // redundant information, for easier comparison. - - /** - * Construct from a byte array. - */ - public TweetIntegerShingleSignature(byte[] shingles, int minShinglesMatch) { - Preconditions.checkArgument(shingles.length == NUM_SHINGLES); - this.shingles = shingles; - // sort to byte's natural ascending order - Arrays.sort(this.shingles); - this.minShinglesMatch = minShinglesMatch; - this.signature = serializeInternal(shingles); - } - - /** - * Construct from a byte array. - */ - public TweetIntegerShingleSignature(byte[] shingles) { - this(shingles, DEFAULT_MIN_SHINGLES_MATCH); - } - - /** - * Construct from a serialized integer signature. - */ - public TweetIntegerShingleSignature(int signature, int minShinglesMatch) { - this.shingles = deserializeInternal(signature); - // sort to byte's natural ascending order - Arrays.sort(this.shingles); - this.minShinglesMatch = minShinglesMatch; - // now store the sorted shingles into signature field, may be different from what passed in. - this.signature = serializeInternal(shingles); - } - - /** - * Construct from a serialized integer signature. - */ - public TweetIntegerShingleSignature(int signature) { - this(signature, DEFAULT_MIN_SHINGLES_MATCH); - } - - /** - * Used by ingester to generate signature. - * Raw signatures are in byte arrays per sample, and can be more or less - * than what is asked for. - * - * @param rawSignature - */ - public TweetIntegerShingleSignature(Iterable rawSignature) { - byte[] condensedSignature = new byte[NUM_SHINGLES]; - int i = 0; - for (byte[] signatureItem : rawSignature) { - condensedSignature[i++] = signatureItem[0]; - if (i == NUM_SHINGLES) { - break; - } - } - this.shingles = condensedSignature; - Arrays.sort(this.shingles); - this.minShinglesMatch = DEFAULT_MIN_SHINGLES_MATCH; - this.signature = serializeInternal(shingles); - } - - /** - * When used in a hashtable for dup detection, take the first byte of each signature for fast - * pass for majority case of no fuzzy matching. For top queries, this optimization losses about - * only 4% of all fuzzy matches. - * - * @return most significant byte of this signature as its hashcode. - */ - @Override - public int hashCode() { - return shingles[0] & 0xFF; - } - - /** - * Perform fuzzy matching between two TweetIntegerShingleSignature objects. - * - * @param other TweetIntegerShingleSignature object to perform fuzzy match against - * @return true if at least minMatch number of bytes match - */ - @Override - public boolean equals(Object other) { - if (this == other) { - return true; - } - if (other == null) { - return false; - } - if (getClass() != other.getClass()) { - return false; - } - - final TweetIntegerShingleSignature otherSignatureInteger = (TweetIntegerShingleSignature) other; - - int otherSignature = otherSignatureInteger.serialize(); - if (signature == otherSignature) { - // Both serialized signature is the same - return true; - } else if (signature != DEFAULT_NO_SIGNATURE && otherSignature != DEFAULT_NO_SIGNATURE) { - // Neither is NO_SIGNATURE, need to compare shingles. - byte[] otherShingles = otherSignatureInteger.getShingles(); - int numberMatchesNeeded = minShinglesMatch; - // expect bytes are in ascending sorted order - int i = 0; - int j = 0; - while (((numberMatchesNeeded <= (NUM_SHINGLES - i)) // early termination for i - || (numberMatchesNeeded <= (NUM_SHINGLES - j))) // early termination j - && (i < NUM_SHINGLES) && (j < NUM_SHINGLES)) { - if (shingles[i] == otherShingles[j]) { - if (shingles[i] != 0) { // we only consider two shingles equal if they are non zero - numberMatchesNeeded--; - if (numberMatchesNeeded == 0) { - return true; - } - } - i++; - j++; - } else if (shingles[i] < otherShingles[j]) { - i++; - } else if (shingles[i] > otherShingles[j]) { - j++; - } - } - } - // One is NO_SIGNATURE and one is not. - return false; - } - - /** - * Returns the sorted array of signature bytes. - */ - public byte[] getShingles() { - return shingles; - } - - /** - * Serialize 4 sorted signature bytes into an integer in big endian order. - * - * @return compacted int signature - */ - private static int serializeInternal(byte[] shingles) { - ByteBuffer byteBuffer = ByteBuffer.allocate(NUM_SHINGLES); - byteBuffer.put(shingles, 0, NUM_SHINGLES); - return byteBuffer.getInt(0); - } - - /** - * Deserialize an integer into a 4-byte array. - * @param signature The signature integer. - * @return A byte array with 4 elements. - */ - private static byte[] deserializeInternal(int signature) { - return ByteBuffer.allocate(NUM_SHINGLES).putInt(signature).array(); - } - - public int serialize() { - return signature; - } - - public static boolean isFuzzyMatch(int signature1, int signature2) { - return TweetIntegerShingleSignature.deserialize(signature1).equals( - TweetIntegerShingleSignature.deserialize(signature2)); - } - - public static TweetIntegerShingleSignature deserialize(int signature) { - return new TweetIntegerShingleSignature(signature); - } - - public static TweetIntegerShingleSignature deserialize(int signature, int minMatchSingles) { - return new TweetIntegerShingleSignature(signature, minMatchSingles); - } - - @Override - public String toString() { - return String.format("%d %d %d %d", shingles[0], shingles[1], shingles[2], shingles[3]); - } -} diff --git a/src/java/com/twitter/search/common/relevance/features/TweetSignatureUtil.java b/src/java/com/twitter/search/common/relevance/features/TweetSignatureUtil.java deleted file mode 100644 index 76bb215db..000000000 --- a/src/java/com/twitter/search/common/relevance/features/TweetSignatureUtil.java +++ /dev/null @@ -1,15 +0,0 @@ -package com.twitter.search.common.relevance.features; - -public final class TweetSignatureUtil { - private TweetSignatureUtil() { - } - - /** Converts the signature in args[0] to a TweetIntegerShingleSignature. */ - public static void main(String[] args) throws Exception { - if (args.length < 1) { - throw new RuntimeException("Please provide signature value."); - } - int signature = Integer.parseInt(args[0]); - System.out.println(TweetIntegerShingleSignature.deserialize(signature).toString()); - } -} diff --git a/src/java/com/twitter/search/common/relevance/features/TweetTextFeatures.java b/src/java/com/twitter/search/common/relevance/features/TweetTextFeatures.java deleted file mode 100644 index e545edd3f..000000000 --- a/src/java/com/twitter/search/common/relevance/features/TweetTextFeatures.java +++ /dev/null @@ -1,225 +0,0 @@ -package com.twitter.search.common.relevance.features; - -import java.util.Collection; -import java.util.List; -import java.util.Set; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.collect.Sets; - -import com.twitter.common.text.token.TokenizedCharSequence; - -public class TweetTextFeatures { - // Basic Features, always extracted. - // normalized, lower cased tweet text, w/o resolved urls - private String normalizedText; - - // tokens from normalizedText, w/o resolved urls, lower cased. - private List tokens; - - // tokens from resolved urls, lower cased. - private List resolvedUrlsTokens; - - // tokens in the form of a TokenizedCharSeq, NOT LOWER CASED - private TokenizedCharSequence tokenSequence; - - // strippedTokens above joined with space - private String normalizedStrippedText; - - // normalized, original case tokens, without @mention, #hashtag or urls. - private List strippedTokens; - - // all hash tags, without "#", lower cased - private Set hashtags = Sets.newHashSet(); - - // all mentions, without "@", lower cased - private Set mentions = Sets.newHashSet(); - - // whether this tweet has a question mark that's not in url. - private boolean hasQuestionMark = false; - - private boolean hasPositiveSmiley = false; - private boolean hasNegativeSmiley = false; - - // normalized, original case smileys - private List smileys; - - // lower cased, normalized stock names, without "$" - private List stocks; - - // Extra features for text quality evaluation only. - private int signature = TweetIntegerShingleSignature.DEFAULT_NO_SIGNATURE; - private Set trendingTerms = Sets.newHashSet(); - private int length; - private int caps; - - public String getNormalizedText() { - return normalizedText; - } - - public void setNormalizedText(String normalizedText) { - this.normalizedText = normalizedText; - } - - public List getTokens() { - return tokens; - } - - public int getTokensSize() { - return tokens == null ? 0 : tokens.size(); - } - - public void setTokens(List tokens) { - this.tokens = tokens; - } - - public List getResolvedUrlTokens() { - return resolvedUrlsTokens; - } - - public int getResolvedUrlTokensSize() { - return resolvedUrlsTokens == null ? 0 : resolvedUrlsTokens.size(); - } - - public void setResolvedUrlTokens(List tokensResolvedUrls) { - this.resolvedUrlsTokens = tokensResolvedUrls; - } - - public TokenizedCharSequence getTokenSequence() { - return tokenSequence; - } - - public void setTokenSequence(TokenizedCharSequence tokenSequence) { - this.tokenSequence = tokenSequence; - } - - public String getNormalizedStrippedText() { - return normalizedStrippedText; - } - - public void setNormalizedStrippedText(String normalizedStrippedText) { - this.normalizedStrippedText = normalizedStrippedText; - } - - public List getStrippedTokens() { - return strippedTokens; - } - - public int getStrippedTokensSize() { - return strippedTokens == null ? 0 : strippedTokens.size(); - } - - public void setStrippedTokens(List strippedTokens) { - this.strippedTokens = strippedTokens; - } - - public Set getHashtags() { - return hashtags; - } - - public int getHashtagsSize() { - return hashtags.size(); - } - - public void setHashtags(Collection hashtags) { - this.hashtags = Sets.newHashSet(hashtags); - } - - public Set getMentions() { - return mentions; - } - - public int getMentionsSize() { - return mentions.size(); - } - - public void setMentions(Collection mentions) { - this.mentions = Sets.newHashSet(mentions); - } - - public boolean hasQuestionMark() { - return hasQuestionMark; - } - - public void setHasQuestionMark(boolean hasQuestionMark) { - this.hasQuestionMark = hasQuestionMark; - } - - public boolean hasPositiveSmiley() { - return hasPositiveSmiley; - } - - public void setHasPositiveSmiley(boolean hasPositiveSmiley) { - this.hasPositiveSmiley = hasPositiveSmiley; - } - - public boolean hasNegativeSmiley() { - return hasNegativeSmiley; - } - - public void setHasNegativeSmiley(boolean hasNegativeSmiley) { - this.hasNegativeSmiley = hasNegativeSmiley; - } - - public List getSmileys() { - return smileys; - } - - public int getSmileysSize() { - return smileys == null ? 0 : smileys.size(); - } - - public void setSmileys(List smileys) { - this.smileys = smileys; - } - - public List getStocks() { - return stocks; - } - - public int getStocksSize() { - return stocks == null ? 0 : stocks.size(); - } - - public void setStocks(List stocks) { - this.stocks = stocks; - } - - public int getSignature() { - return signature; - } - - public void setSignature(int signature) { - this.signature = signature; - } - - /** Returns the trending terms. */ - public Set getTrendingTerms() { - return trendingTerms; - } - - public int getTrendingTermsSize() { - return trendingTerms.size(); - } - - @VisibleForTesting - public void setTrendingTerms(Set trendingTerms) { - this.trendingTerms = trendingTerms; - } - - public int getLength() { - return length; - } - - public void setLength(int length) { - this.length = length; - } - - public int getCaps() { - return caps; - } - - public void setCaps(int caps) { - this.caps = caps; - } -} diff --git a/src/java/com/twitter/search/common/relevance/features/TweetTextQuality.java b/src/java/com/twitter/search/common/relevance/features/TweetTextQuality.java deleted file mode 100644 index 63aa30eeb..000000000 --- a/src/java/com/twitter/search/common/relevance/features/TweetTextQuality.java +++ /dev/null @@ -1,69 +0,0 @@ -package com.twitter.search.common.relevance.features; - -import java.util.Set; - -import com.google.common.collect.Sets; - -public class TweetTextQuality { - - public static enum BooleanQualityType { - OFFENSIVE, // tweet text is offensive - OFFENSIVE_USER, // user name is offensive - HASHTAG_NAME_MATCH, // hashtag matches username - SENSITIVE, // tweet is marked as sensitive when it comes in - } - - public static final double ENTROPY_NOT_SET = Double.MIN_VALUE; - - public static final byte UNSET_TEXT_SCORE = -128; - - private double readability; - private double shout; - private double entropy = ENTROPY_NOT_SET; - private final Set boolQualities = Sets.newHashSet(); - private byte textScore = UNSET_TEXT_SCORE; - - public double getReadability() { - return readability; - } - - public void setReadability(double readability) { - this.readability = readability; - } - - public double getShout() { - return shout; - } - - public void setShout(double shout) { - this.shout = shout; - } - - public double getEntropy() { - return entropy; - } - - public void setEntropy(double entropy) { - this.entropy = entropy; - } - - public void addBoolQuality(BooleanQualityType type) { - boolQualities.add(type); - } - - public boolean hasBoolQuality(BooleanQualityType type) { - return boolQualities.contains(type); - } - - public Set getBoolQualities() { - return boolQualities; - } - - public byte getTextScore() { - return textScore; - } - - public void setTextScore(byte textScore) { - this.textScore = textScore; - } -} diff --git a/src/java/com/twitter/search/common/relevance/features/TweetUserFeatures.java b/src/java/com/twitter/search/common/relevance/features/TweetUserFeatures.java deleted file mode 100644 index 89c9b5196..000000000 --- a/src/java/com/twitter/search/common/relevance/features/TweetUserFeatures.java +++ /dev/null @@ -1,114 +0,0 @@ -package com.twitter.search.common.relevance.features; - -import java.util.Map; - -public class TweetUserFeatures { - private String lang; - private double langConfidence; - private int followers; - private int following; - private int reputation; - private int tweets; - private int retweets; - private int retweeted; - private Map knownForTopics; - private boolean isSpam; - private boolean isNsfw; - private boolean isBot; - - public String getLang() { - return lang; - } - - public void setLang(String lang) { - this.lang = lang; - } - - public double getLangConfidence() { - return langConfidence; - } - - public void setLangConfidence(double langConfidence) { - this.langConfidence = langConfidence; - } - - public int getFollowers() { - return followers; - } - - public void setFollowers(int followers) { - this.followers = followers; - } - - public int getFollowing() { - return following; - } - - public void setFollowing(int following) { - this.following = following; - } - - public int getReputation() { - return reputation; - } - - public void setReputation(int reputation) { - this.reputation = reputation; - } - - public int getTweets() { - return tweets; - } - - public void setTweets(int tweets) { - this.tweets = tweets; - } - - public int getRetweets() { - return retweets; - } - - public void setRetweets(int retweets) { - this.retweets = retweets; - } - - public int getRetweeted() { - return retweeted; - } - - public void setRetweeted(int retweeted) { - this.retweeted = retweeted; - } - - public Map getKnownForTopics() { - return knownForTopics; - } - - public void setKnownForTopics(Map knownForTopics) { - this.knownForTopics = knownForTopics; - } - - public boolean isSpam() { - return isSpam; - } - - public void setSpam(boolean spam) { - isSpam = spam; - } - - public boolean isNsfw() { - return isNsfw; - } - - public void setNsfw(boolean nsfw) { - isNsfw = nsfw; - } - - public boolean isBot() { - return isBot; - } - - public void setBot(boolean bot) { - isBot = bot; - } -} diff --git a/src/java/com/twitter/search/common/relevance/scorers/TweetScorer.java b/src/java/com/twitter/search/common/relevance/scorers/TweetScorer.java deleted file mode 100644 index bd8f55bad..000000000 --- a/src/java/com/twitter/search/common/relevance/scorers/TweetScorer.java +++ /dev/null @@ -1,65 +0,0 @@ -package com.twitter.search.common.relevance.scorers; - -import com.twitter.search.common.relevance.classifiers.TweetClassifier; -import com.twitter.search.common.relevance.entities.TwitterMessage; - -/** - * Interface to compute feature scores for a single @TwitterMessage - * object, or a group of them, after they have been processed by - * feature classifiers. - * - * Intentionally kept Scorers separate from Classifiers, since they - * may be run at different stages and in different batching manners. - * Convenience methods are provided to run classification and scoring - * in one call. - */ -public abstract class TweetScorer { - /** - * Compute and store feature score in TwitterMessage based on its - * TweetFeatures. - * - * @param tweet tweet message to compute and store score to. - */ - public abstract void scoreTweet(final TwitterMessage tweet); - - /** - * Score a group of TwitterMessages based on their corresponding TweetFeatures - * and store feature scores in TwitterMessages. - * - * This default implementation just iterates through the map and scores each - * individual tweet. Batching for better performance, if applicable, can be implemented by - * concrete subclasses. - * - * @param tweets TwitterMessages to score. - */ - public void scoreTweets(Iterable tweets) { - for (TwitterMessage tweet: tweets) { - scoreTweet(tweet); - } - } - - /** - * Convenience method. - * Classify tweet using the specified list of classifiers, then compute score. - * - * @param classifier list of classifiers to use for classification. - * @param tweet tweet to classify and score - */ - public void classifyAndScoreTweet(TweetClassifier classifier, TwitterMessage tweet) { - classifier.classifyTweet(tweet); - scoreTweet(tweet); - } - - /** - * Convenience method. - * Classify tweets using the specified list of classifiers, then compute score. - * - * @param classifier classifier to use for classification. - * @param tweets tweets to classify and score - */ - public void classifyAndScoreTweets(TweetClassifier classifier, Iterable tweets) { - for (TwitterMessage tweet: tweets) { - classifyAndScoreTweet(classifier, tweet); - } - } -} diff --git a/src/java/com/twitter/search/common/relevance/scorers/TweetTextScorer.java b/src/java/com/twitter/search/common/relevance/scorers/TweetTextScorer.java deleted file mode 100644 index e682e5614..000000000 --- a/src/java/com/twitter/search/common/relevance/scorers/TweetTextScorer.java +++ /dev/null @@ -1,242 +0,0 @@ -package com.twitter.search.common.relevance.scorers; - -import java.util.Map; -import java.util.concurrent.ConcurrentMap; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Maps; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.search.common.metrics.RelevanceStats; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.relevance.config.TweetProcessingConfig; -import com.twitter.search.common.relevance.entities.TwitterMessage; -import com.twitter.search.common.relevance.features.TweetFeatures; -import com.twitter.search.common.relevance.features.TweetTextFeatures; -import com.twitter.search.common.relevance.features.TweetTextQuality; - -/** - * Compute a text score for TwitterMessage based on its offensiveness, - * shoutness, length, readability and hashtag properties extracted from - * tweet text. - *

- * Formula: - * text_score = offensive_text_damping * offensive_username_damping * - * Sigma(feature_score_weight * feature_score) - *

- * scored features are: length, readability, shout, entropy, links - */ -public class TweetTextScorer extends TweetScorer { - private static final Logger LOG = LoggerFactory.getLogger(TweetTextScorer.class); - - private static final double DEFAULT_OFFENSIVE_TERM_DAMPING = 0.2d; - private static final double DEFAULT_OFFENSIVE_NAME_DAMPING = 0.2d; - - // Sigma of all weights = 1.0d - private static final double DEFAULT_LENGTH_WEIGHT = 0.5d; - private static final double DEFAULT_READABILITY_WEIGHT = 0.1d; - private static final double DEFAULT_SHOUT_WEIGHT = 0.1d; - private static final double DEFAULT_ENTROPY_WEIGHT = 0.25d; - private static final double DEFAULT_LINK_WEIGHT = 0.05d; - - private static final double DEFAULT_NO_DAMPING = 1.0d; - - // Sigmoid alpha values for normalization - private static final double DEFAULT_READABILITY_ALPHA = 0.05d; - private static final double DEFAULT_ENTROPY_ALPHA = 0.5d; - private static final double DEFAULT_LENGTH_ALPHA = 0.03d; - - private static final ConcurrentMap RATE_COUNTERS = - Maps.newConcurrentMap(); - private static final ConcurrentMap> - SCORE_HISTOGRAMS = Maps.newConcurrentMap(); - - private double offensiveTermDamping = DEFAULT_OFFENSIVE_TERM_DAMPING; - private double offensiveNameDamping = DEFAULT_OFFENSIVE_NAME_DAMPING; - - private double lengthWeight = DEFAULT_LENGTH_WEIGHT; - private double readabilityWeight = DEFAULT_READABILITY_WEIGHT; - private double shoutWeight = DEFAULT_SHOUT_WEIGHT; - private double entropyWeight = DEFAULT_ENTROPY_WEIGHT; - private double linkWeight = DEFAULT_LINK_WEIGHT; - - private double readabilityAlpha = DEFAULT_READABILITY_ALPHA; - private double entropyAlpha = DEFAULT_ENTROPY_ALPHA; - private double lengthAlpha = DEFAULT_LENGTH_ALPHA; - - /** Configure from a config file, validate the configuration. */ - public TweetTextScorer(String configFile) { - TweetProcessingConfig.init(configFile); - - // get dampings - checkWeightRange(offensiveTermDamping = TweetProcessingConfig - .getDouble("offensive_term_damping", DEFAULT_OFFENSIVE_TERM_DAMPING)); - checkWeightRange(offensiveNameDamping = TweetProcessingConfig - .getDouble("offensive_name_damping", DEFAULT_OFFENSIVE_NAME_DAMPING)); - - // get weights - checkWeightRange(lengthWeight = TweetProcessingConfig - .getDouble("length_weight", DEFAULT_LENGTH_WEIGHT)); - checkWeightRange(readabilityWeight = TweetProcessingConfig - .getDouble("readability_weight", DEFAULT_READABILITY_WEIGHT)); - checkWeightRange(shoutWeight = TweetProcessingConfig - .getDouble("shout_weight", DEFAULT_SHOUT_WEIGHT)); - checkWeightRange(entropyWeight = TweetProcessingConfig - .getDouble("entropy_weight", DEFAULT_ENTROPY_WEIGHT)); - checkWeightRange(linkWeight = TweetProcessingConfig - .getDouble("link_weight", DEFAULT_LINK_WEIGHT)); - - // check sigma of weights - Preconditions.checkArgument( - lengthWeight + readabilityWeight + shoutWeight + entropyWeight + linkWeight == 1.0d); - - readabilityAlpha = TweetProcessingConfig - .getDouble("readability_alpha", DEFAULT_READABILITY_ALPHA); - entropyAlpha = TweetProcessingConfig.getDouble("entropy_alpha", DEFAULT_ENTROPY_ALPHA); - lengthAlpha = TweetProcessingConfig.getDouble("length_alpha", DEFAULT_LENGTH_ALPHA); - } - - /** Creates a new TweetTextScorer instance. */ - public TweetTextScorer() { - } - - /** Scores the given tweet. */ - public void scoreTweet(final TwitterMessage tweet) { - Preconditions.checkNotNull(tweet); - - for (PenguinVersion penguinVersion : tweet.getSupportedPenguinVersions()) { - TweetFeatures features = Preconditions.checkNotNull(tweet.getTweetFeatures(penguinVersion)); - TweetTextFeatures textFeatures = Preconditions.checkNotNull(features.getTweetTextFeatures()); - TweetTextQuality textQuality = Preconditions.checkNotNull(features.getTweetTextQuality()); - boolean isOffensiveText = textQuality.hasBoolQuality( - TweetTextQuality.BooleanQualityType.OFFENSIVE); - boolean isOffensiveScreenName = textQuality.hasBoolQuality( - TweetTextQuality.BooleanQualityType.OFFENSIVE_USER); - double shoutScore = DEFAULT_NO_DAMPING - textQuality.getShout(); - double lengthScore = normalize(textFeatures.getLength(), lengthAlpha); - double readabilityScore = normalize(textQuality.getReadability(), readabilityAlpha); - double entropyScore = normalize(textQuality.getEntropy(), entropyAlpha); - - double score = (isOffensiveText ? offensiveTermDamping : DEFAULT_NO_DAMPING) - * (isOffensiveScreenName ? offensiveNameDamping : DEFAULT_NO_DAMPING) - * (lengthWeight * lengthScore - + readabilityWeight * readabilityScore - + shoutWeight * shoutScore - + entropyWeight * entropyScore - + linkWeight * (tweet.getExpandedUrlMapSize() > 0 ? 1 : 0)); - - // scale to [0, 100] byte - textQuality.setTextScore((byte) (score * 100)); - - updateStats( - isOffensiveText, - isOffensiveScreenName, - textFeatures, - score, - getRateCounterStat("num_offensive_text_", penguinVersion), - getRateCounterStat("num_offensive_user_", penguinVersion), - getRateCounterStat("num_no_trends_", penguinVersion), - getRateCounterStat("num_has_trends_", penguinVersion), - getRateCounterStat("num_too_many_trends_", penguinVersion), - getRateCounterStat("num_scored_tweets_", penguinVersion), - getScoreHistogram(penguinVersion)); - - if (LOG.isDebugEnabled()) { - LOG.debug(String.format( - "Tweet length [%.2f] weighted length [%.2f], readability [%.2f] " - + "weighted readability [%.2f], shout [%.2f] weighted shout [%.2f], " - + "entropy [%.2f], weighted entropy [%.2f], " - + "score [%.2f], text [%s], penguin version [%s]", - lengthScore, - lengthWeight * lengthScore, - readabilityScore, - readabilityWeight * readabilityScore, - shoutScore, - shoutWeight * shoutScore, - entropyScore, - entropyWeight * entropyScore, - score, - tweet.getText(), - penguinVersion)); - } - } - } - - private void updateStats(boolean isOffensiveText, - boolean isOffensiveScreenName, - TweetTextFeatures textFeatures, - double score, - SearchRateCounter offensiveTextCounter, - SearchRateCounter offensiveUserNameCounter, - SearchRateCounter noTrendsCounter, - SearchRateCounter hasTrendsCounter, - SearchRateCounter tooManyTrendsHashtagsCounter, - SearchRateCounter scoredTweets, - Map scoreHistogram) { - // set stats - if (isOffensiveText) { - offensiveTextCounter.increment(); - } - if (isOffensiveScreenName) { - offensiveUserNameCounter.increment(); - } - if (textFeatures.getTrendingTermsSize() == 0) { - noTrendsCounter.increment(); - } else { - hasTrendsCounter.increment(); - } - if (TwitterMessage.hasMultipleHashtagsOrTrends(textFeatures)) { - tooManyTrendsHashtagsCounter.increment(); - } - scoredTweets.increment(); - - int bucket = (int) Math.floor(score * 10) * 10; - scoreHistogram.get(bucket).increment(); - } - - // normalize the passed in value to smoothed [0, 1.0d] range - private static double normalize(double value, double alpha) { - return 2 * (1.0d / (1.0d + Math.exp(-(alpha * value))) - 0.5); - } - - // Make sure weight values are within the range of [0.0, 1.0] - private void checkWeightRange(double value) { - Preconditions.checkArgument(value >= 0.0d && value <= 1.0d); - } - - private Map getScoreHistogram(PenguinVersion penguinVersion) { - Map scoreHistogram = SCORE_HISTOGRAMS.get(penguinVersion); - if (scoreHistogram == null) { - scoreHistogram = Maps.newHashMap(); - String statsName = "num_text_score_%d_%s"; - - for (int i = 0; i <= 100; i += 10) { - scoreHistogram.put(i, RelevanceStats.exportRate( - String.format(statsName, i, penguinVersion.name().toLowerCase()))); - } - - scoreHistogram = SCORE_HISTOGRAMS.putIfAbsent(penguinVersion, scoreHistogram); - if (scoreHistogram == null) { - scoreHistogram = SCORE_HISTOGRAMS.get(penguinVersion); - } - } - - return scoreHistogram; - } - - private SearchRateCounter getRateCounterStat(String statPrefix, PenguinVersion penguinVersion) { - String statName = statPrefix + penguinVersion.name().toLowerCase(); - SearchRateCounter rateCounter = RATE_COUNTERS.get(statName); - if (rateCounter == null) { - // Only one RateCounter instance is created for each stat name. So we don't need to worry - // that another thread might've created this instance in the meantime: we can just create/get - // it, and store it in the map. - rateCounter = RelevanceStats.exportRate(statName); - RATE_COUNTERS.put(statName, rateCounter); - } - return rateCounter; - } -} diff --git a/src/java/com/twitter/search/common/relevance/text/LocationUtils.java b/src/java/com/twitter/search/common/relevance/text/LocationUtils.java deleted file mode 100644 index 5fb43543e..000000000 --- a/src/java/com/twitter/search/common/relevance/text/LocationUtils.java +++ /dev/null @@ -1,41 +0,0 @@ -package com.twitter.search.common.relevance.text; - -import java.util.regex.Matcher; - -import com.twitter.search.common.relevance.entities.TwitterMessage; -import com.twitter.search.common.util.text.regex.Regex; - -public final class LocationUtils { - private LocationUtils() { - } - - /** - * Extract lat/lon information from a twitter message. - * @param message The twitter message. - * @return A two-element double array for the lat/lon information. - */ - public static double[] extractLatLon(TwitterMessage message) { - // first look in text for L:, then fall back to profile - Matcher loc = Regex.LAT_LON_LOC_PATTERN.matcher(message.getText()); - if (loc.find() || message.getOrigLocation() != null - && (loc = Regex.LAT_LON_PATTERN.matcher(message.getOrigLocation())).find()) { - final double lat = Double.parseDouble(loc.group(2)); - final double lon = Double.parseDouble(loc.group(3)); - - if (Math.abs(lat) > 90.0) { - throw new NumberFormatException("Latitude cannot exceed +-90 degrees: " + lat); - } - if (Math.abs(lon) > 180.0) { - throw new NumberFormatException("Longitude cannot exceed +-180 degrees: " + lon); - } - - // Reject these common "bogus" regions. - if ((lat == 0 && lon == 0) || lat == -1 || lon == -1) { - return null; - } - - return new double[]{lat, lon}; - } - return null; - } -} diff --git a/src/java/com/twitter/search/common/relevance/text/TweetParser.java b/src/java/com/twitter/search/common/relevance/text/TweetParser.java deleted file mode 100644 index df518ba5f..000000000 --- a/src/java/com/twitter/search/common/relevance/text/TweetParser.java +++ /dev/null @@ -1,190 +0,0 @@ -package com.twitter.search.common.relevance.text; - -import java.util.ArrayList; -import java.util.Collections; -import java.util.List; -import java.util.Locale; -import java.util.Set; - -import com.google.common.base.Joiner; -import com.google.common.collect.Sets; - -import com.twitter.common.text.util.CharSequenceUtils; -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.search.common.indexing.thriftjava.ThriftExpandedUrl; -import com.twitter.search.common.relevance.entities.TwitterMessage; -import com.twitter.search.common.relevance.features.TweetTextFeatures; -import com.twitter.search.common.util.text.NormalizerHelper; -import com.twitter.search.common.util.text.Smileys; -import com.twitter.search.common.util.text.TokenizerHelper; -import com.twitter.search.common.util.text.TokenizerResult; - -/** - * A parser to extract very basic information from a tweet. - */ -public class TweetParser { - private static final boolean DO_NOT_REMOVE_WWW = false; - - /** Parses the given TwitterMessage. */ - public void parseTweet(TwitterMessage message) { - parseTweet(message, false, true); - } - - /** Parses the given TwitterMessage. */ - public void parseTweet(TwitterMessage message, - boolean useEntitiesFromTweetText, - boolean parseUrls) { - for (PenguinVersion penguinVersion : message.getSupportedPenguinVersions()) { - parseTweet(message, useEntitiesFromTweetText, parseUrls, penguinVersion); - } - } - - /** Parses the given TwitterMessage. */ - public void parseTweet(TwitterMessage message, - boolean useEntitiesFromTweetText, - boolean parseUrls, - PenguinVersion penguinVersion) { - TweetTextFeatures textFeatures = message.getTweetTextFeatures(penguinVersion); - String rawText = message.getText(); - Locale locale = message.getLocale(); - - // don't lower case first. - String normalizedText = NormalizerHelper.normalizeKeepCase(rawText, locale, penguinVersion); - String lowercasedNormalizedText = - CharSequenceUtils.toLowerCase(normalizedText, locale).toString(); - - textFeatures.setNormalizedText(lowercasedNormalizedText); - - TokenizerResult result = TokenizerHelper.tokenizeTweet(normalizedText, locale, penguinVersion); - List tokens = new ArrayList<>(result.tokens); - textFeatures.setTokens(tokens); - textFeatures.setTokenSequence(result.tokenSequence); - - if (parseUrls) { - parseUrls(message, textFeatures); - } - - textFeatures.setStrippedTokens(result.strippedDownTokens); - textFeatures.setNormalizedStrippedText(Joiner.on(" ").skipNulls() - .join(result.strippedDownTokens)); - - // Sanity checks, make sure there is no null token list. - if (textFeatures.getTokens() == null) { - textFeatures.setTokens(Collections.emptyList()); - } - if (textFeatures.getResolvedUrlTokens() == null) { - textFeatures.setResolvedUrlTokens(Collections.emptyList()); - } - if (textFeatures.getStrippedTokens() == null) { - textFeatures.setStrippedTokens(Collections.emptyList()); - } - - setHashtagsAndMentions(message, textFeatures, penguinVersion); - textFeatures.setStocks(sanitizeTokenizerResults(result.stocks, '$')); - textFeatures.setHasQuestionMark(findQuestionMark(textFeatures)); - - // Set smiley polarities. - textFeatures.setSmileys(result.smileys); - for (String smiley : textFeatures.getSmileys()) { - if (Smileys.isValidSmiley(smiley)) { - boolean polarity = Smileys.getPolarity(smiley); - if (polarity) { - textFeatures.setHasPositiveSmiley(true); - } else { - textFeatures.setHasNegativeSmiley(true); - } - } - } - message.setTokenizedCharSequence(penguinVersion, result.rawSequence); - - if (useEntitiesFromTweetText) { - takeEntities(message, textFeatures, result, penguinVersion); - } - } - - /** Parse the URLs in the given TwitterMessage. */ - public void parseUrls(TwitterMessage message) { - for (PenguinVersion penguinVersion : message.getSupportedPenguinVersions()) { - parseUrls(message, message.getTweetTextFeatures(penguinVersion)); - } - } - - /** Parse the URLs in the given TwitterMessage. */ - public void parseUrls(TwitterMessage message, TweetTextFeatures textFeatures) { - if (message.getExpandedUrlMap() != null) { - Set urlsToTokenize = Sets.newLinkedHashSet(); - for (ThriftExpandedUrl url : message.getExpandedUrlMap().values()) { - if (url.isSetExpandedUrl()) { - urlsToTokenize.add(url.getExpandedUrl()); - } - if (url.isSetCanonicalLastHopUrl()) { - urlsToTokenize.add(url.getCanonicalLastHopUrl()); - } - } - TokenizerResult resolvedUrlResult = - TokenizerHelper.tokenizeUrls(urlsToTokenize, message.getLocale(), DO_NOT_REMOVE_WWW); - List urlTokens = new ArrayList<>(resolvedUrlResult.tokens); - textFeatures.setResolvedUrlTokens(urlTokens); - } - } - - private void takeEntities(TwitterMessage message, - TweetTextFeatures textFeatures, - TokenizerResult result, - PenguinVersion penguinVersion) { - if (message.getHashtags().isEmpty()) { - // add hashtags to TwitterMessage if it doens't already have them, from - // JSON entities, this happens when we do offline indexing - for (String hashtag : sanitizeTokenizerResults(result.hashtags, '#')) { - message.addHashtag(hashtag); - } - } - - if (message.getMentions().isEmpty()) { - // add mentions to TwitterMessage if it doens't already have them, from - // JSON entities, this happens when we do offline indexing - for (String mention : sanitizeTokenizerResults(result.mentions, '@')) { - message.addMention(mention); - } - } - - setHashtagsAndMentions(message, textFeatures, penguinVersion); - } - - private void setHashtagsAndMentions(TwitterMessage message, - TweetTextFeatures textFeatures, - PenguinVersion penguinVersion) { - textFeatures.setHashtags(message.getNormalizedHashtags(penguinVersion)); - textFeatures.setMentions(message.getLowercasedMentions()); - } - - // The strings in the mentions, hashtags and stocks lists in TokenizerResult should already have - // the leading characters ('@', '#' and '$') stripped. So in most cases, this sanitization is not - // needed. However, sometimes Penguin tokenizes hashtags, cashtags and mentions incorrectly - // (for example, when using the Korean tokenizer for tokens like ~@mention or ?#hashtag -- see - // SEARCHQUAL-11924 for more details). So we're doing this extra sanitization here to try to work - // around these tokenization issues. - private List sanitizeTokenizerResults(List tokens, char tokenSymbol) { - List sanitizedTokens = new ArrayList(); - for (String token : tokens) { - int indexOfTokenSymbol = token.indexOf(tokenSymbol); - if (indexOfTokenSymbol < 0) { - sanitizedTokens.add(token); - } else { - String sanitizedToken = token.substring(indexOfTokenSymbol + 1); - if (!sanitizedToken.isEmpty()) { - sanitizedTokens.add(sanitizedToken); - } - } - } - return sanitizedTokens; - } - - /** Determines if the normalized text of the given features contain a question mark. */ - public static boolean findQuestionMark(TweetTextFeatures textFeatures) { - // t.co links don't contain ?'s, so it's not necessary to subtract ?'s occurring in Urls - // the tweet text always contains t.co, even if the display url is different - // all links on twitter are now wrapped into t.co - return textFeatures.getNormalizedText().contains("?"); - } -} diff --git a/src/java/com/twitter/search/common/relevance/text/VisibleTokenRatioNormalizer.java b/src/java/com/twitter/search/common/relevance/text/VisibleTokenRatioNormalizer.java deleted file mode 100644 index d7017448f..000000000 --- a/src/java/com/twitter/search/common/relevance/text/VisibleTokenRatioNormalizer.java +++ /dev/null @@ -1,39 +0,0 @@ -package com.twitter.search.common.relevance.text; - -public class VisibleTokenRatioNormalizer { - - private static final int NORMALIZE_TO_BITS = 4; - private final int normalizeToSize; - - /** - * constructor - */ - public VisibleTokenRatioNormalizer(int normalizeToBits) { - int size = 2 << (normalizeToBits - 1); - // Let's say normalizeSize is set to 16.... - // If you multiply 1.0 * 16, it is 16 - // If you multiply 0.0 * 16, it is 0 - // That would be occupying 17 ints, not 16, so we subtract 1 here... - this.normalizeToSize = size - 1; - } - - /** - * method - */ - public int normalize(double percent) { - if (percent > 1 || percent < 0) { - throw new IllegalArgumentException("percent should be less than 1 and greater than 0"); - } - int bucket = (int) (percent * normalizeToSize); - return normalizeToSize - bucket; - } - - public double denormalize(int reverseBucket) { - int bucket = normalizeToSize - reverseBucket; - return bucket / (double) normalizeToSize; - } - - public static VisibleTokenRatioNormalizer createInstance() { - return new VisibleTokenRatioNormalizer(NORMALIZE_TO_BITS); - } -} diff --git a/src/java/com/twitter/search/common/schema/AnalyzerFactory.java b/src/java/com/twitter/search/common/schema/AnalyzerFactory.java deleted file mode 100644 index 36da161f4..000000000 --- a/src/java/com/twitter/search/common/schema/AnalyzerFactory.java +++ /dev/null @@ -1,142 +0,0 @@ -package com.twitter.search.common.schema; - -import java.io.Reader; -import java.text.ParseException; -import java.util.Map; - -import com.google.common.base.Splitter; -import com.google.common.collect.Lists; -import com.google.common.collect.Sets; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import org.apache.lucene.analysis.Analyzer; -import org.apache.lucene.analysis.CharArraySet; -import org.apache.lucene.analysis.CharFilter; -import org.apache.lucene.analysis.TokenStream; -import org.apache.lucene.analysis.Tokenizer; -import org.apache.lucene.analysis.charfilter.HTMLStripCharFilter; -import org.apache.lucene.analysis.core.WhitespaceAnalyzer; -import org.apache.lucene.analysis.fa.PersianCharFilter; -import org.apache.lucene.analysis.standard.StandardAnalyzer; -import org.apache.lucene.util.Version; - -import com.twitter.search.common.schema.thriftjava.ThriftAnalyzer; -import com.twitter.search.common.schema.thriftjava.ThriftClassInstantiater; -import com.twitter.search.common.schema.thriftjava.ThriftCustomAnalyzer; - -public class AnalyzerFactory { - private static final Logger LOG = LoggerFactory.getLogger(AnalyzerFactory.class); - - private static final String MATCH_VERSION_ARG_NAME = "matchVersion"; - private static final String STANDARD_ANALYZER = "StandardAnalyzer"; - private static final String WHITESPACE_ANALYZER = "WhitespaceAnalyzer"; - private static final String SEARCH_WHITESPACE_ANALYZER = "SearchWhitespaceAnalyzer"; - private static final String HTML_STRIP_CHAR_FILTER = "HTMLStripCharFilter"; - private static final String PERSIAN_CHAR_FILTER = "PersianCharFilter"; - - /** - * Return a Lucene Analyzer based on the given ThriftAnalyzer. - */ - public Analyzer getAnalyzer(ThriftAnalyzer analyzer) { - if (analyzer.isSetAnalyzer()) { - return resolveAnalyzerClass(analyzer.getAnalyzer()); - } else if (analyzer.isSetCustomAnalyzer()) { - return buildCustomAnalyzer(analyzer.getCustomAnalyzer()); - } - return new SearchWhitespaceAnalyzer(); - } - - private Analyzer resolveAnalyzerClass(ThriftClassInstantiater classDef) { - Map params = classDef.getParams(); - Version matchVersion = Version.LUCENE_8_5_2; - - String matchVersionName = getArg(params, MATCH_VERSION_ARG_NAME); - if (matchVersionName != null) { - try { - matchVersion = Version.parse(matchVersionName); - } catch (ParseException e) { - // ignore and use default version - LOG.warn("Unable to parse match version: " + matchVersionName - + ". Will use default version of 8.5.2."); - } - } - - if (classDef.getClassName().equals(STANDARD_ANALYZER)) { - String stopwords = getArg(params, "stopwords"); - if (stopwords != null) { - - CharArraySet stopwordSet = new CharArraySet( - Lists.newLinkedList(Splitter.on(",").split(stopwords)), - false); - return new StandardAnalyzer(stopwordSet); - } else { - return new StandardAnalyzer(); - } - } else if (classDef.getClassName().equals(WHITESPACE_ANALYZER)) { - return new WhitespaceAnalyzer(); - } else if (classDef.getClassName().equals(SEARCH_WHITESPACE_ANALYZER)) { - return new SearchWhitespaceAnalyzer(); - } - - return null; - } - - private Analyzer buildCustomAnalyzer(final ThriftCustomAnalyzer customAnalyzer) { - return new Analyzer() { - @Override - protected TokenStreamComponents createComponents(String fieldName) { - final Tokenizer tokenizer = resolveTokenizerClass(customAnalyzer.getTokenizer()); - - TokenStream filter = tokenizer; - - if (customAnalyzer.isSetFilters()) { - for (ThriftClassInstantiater filterClass : customAnalyzer.getFilters()) { - filter = resolveTokenFilterClass(filterClass, filter); - } - } - - return new TokenStreamComponents(tokenizer, filter); - } - }; - } - - private Tokenizer resolveTokenizerClass(ThriftClassInstantiater classDef) { - return null; - } - - private TokenStream resolveTokenFilterClass(ThriftClassInstantiater classDef, TokenStream input) { - return null; - } - - private CharFilter resolveCharFilterClass(ThriftClassInstantiater classDef, Reader input) { - if (classDef.getClassName().equals(HTML_STRIP_CHAR_FILTER)) { - String escapedTags = getArg(classDef.getParams(), "excapedTags"); - if (escapedTags != null) { - return new HTMLStripCharFilter(input, Sets.newHashSet(Splitter.on(",").split(escapedTags))); - } else { - return new HTMLStripCharFilter(input); - } - } else if (classDef.getClassName().equals(PERSIAN_CHAR_FILTER)) { - return new PersianCharFilter(input); - } - - - throw new ClassNotSupportedException("CharFilter", classDef); - } - - private String getArg(Map args, String arg) { - if (args == null) { - return null; - } - - return args.get(arg); - } - - public final class ClassNotSupportedException extends RuntimeException { - private ClassNotSupportedException(String type, ThriftClassInstantiater classDef) { - super(type + " class with name " + classDef.getClassName() + " currently not supported."); - } - } -} diff --git a/src/java/com/twitter/search/common/schema/BUILD b/src/java/com/twitter/search/common/schema/BUILD deleted file mode 100644 index 1eaa7b968..000000000 --- a/src/java/com/twitter/search/common/schema/BUILD +++ /dev/null @@ -1,34 +0,0 @@ -# Library for schema builder and related analysis utilities. -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/lucene:lucene-analyzers-common", - "3rdparty/jvm/org/apache/lucene:lucene-analyzers-smartcn", - "3rdparty/jvm/org/apache/lucene:lucene-core", - "3rdparty/jvm/org/apache/lucene:lucene-facet", - "3rdparty/jvm/org/apache/thrift:libthrift", - "3rdparty/jvm/org/apache/zookeeper:zookeeper-client", - "3rdparty/jvm/org/slf4j:slf4j-api", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common/collections", - "src/java/com/twitter/common/text/token", - "src/java/com/twitter/common/text/util:token-util", - "src/java/com/twitter/search/common/encoding/docvalues", - "src/java/com/twitter/search/common/features", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/schema/base", - "src/java/com/twitter/search/common/util/analysis", - "src/java/com/twitter/search/common/util/io", - "src/java/com/twitter/search/common/util/io:record-reader-api", - "src/java/com/twitter/search/common/util/spatial", - "src/java/com/twitter/search/common/util/text", - "src/java/com/twitter/search/common/util/thrift:thrift-utils", - "src/thrift/com/twitter/search/common:features-java", - "src/thrift/com/twitter/search/common:schema-java", - ], -) diff --git a/src/java/com/twitter/search/common/schema/DynamicSchema.java b/src/java/com/twitter/search/common/schema/DynamicSchema.java deleted file mode 100644 index ee1063728..000000000 --- a/src/java/com/twitter/search/common/schema/DynamicSchema.java +++ /dev/null @@ -1,214 +0,0 @@ -package com.twitter.search.common.schema; - -import java.util.Collection; -import java.util.Map; -import java.util.concurrent.atomic.AtomicReference; - -import javax.annotation.Nullable; - -import com.google.common.base.Preconditions; -import com.google.common.base.Predicate; -import com.google.common.collect.ImmutableCollection; -import com.google.common.collect.ImmutableMap; - -import org.apache.lucene.analysis.Analyzer; -import org.apache.lucene.facet.FacetsConfig; -import org.apache.lucene.index.FieldInfos; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.features.thrift.ThriftSearchFeatureSchema; -import com.twitter.search.common.schema.base.FeatureConfiguration; -import com.twitter.search.common.schema.base.FieldWeightDefault; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.thriftjava.ThriftAnalyzer; -import com.twitter.search.common.schema.thriftjava.ThriftCSFType; -import com.twitter.search.common.schema.thriftjava.ThriftFieldConfiguration; - -/** - * A schema implementation that allow minor version increments at run time. - */ -public class DynamicSchema implements Schema { - private static final Logger LOG = LoggerFactory.getLogger(DynamicSchema.class); - - private final AtomicReference schema; - - public DynamicSchema(ImmutableSchema schema) { - this.schema = new AtomicReference<>(schema); - } - - public ImmutableSchemaInterface getSchemaSnapshot() { - return schema.get(); - } - - /** - * Update the schema reference inside this DynamicSchema. - */ - public synchronized void updateSchema(ImmutableSchema newSchema) throws SchemaUpdateException { - ImmutableSchema oldSchema = schema.get(); - if (newSchema.getMajorVersionNumber() != oldSchema.getMajorVersionNumber()) { - throw new SchemaUpdateException("Dynamic major version update is not supported."); - } else { - if (newSchema.getMinorVersionNumber() <= oldSchema.getMinorVersionNumber()) { - throw new SchemaUpdateException("Dynamic backward minor version update is not supported."); - } else { - LOG.info("DynamicSchema accepted update. Old version is {}.{}; new version is {}.{}", - oldSchema.getMajorVersionNumber(), - oldSchema.getMinorVersionNumber(), - newSchema.getMajorVersionNumber(), - newSchema.getMinorVersionNumber()); - schema.set(newSchema); - } - } - } - - public static class SchemaUpdateException extends Exception { - public SchemaUpdateException(String message) { - super(message); - } - } - - // The below are all methods in the Schema interface delegated to the underlying ImmutableSchema. - // The below is generated by IntelliJ, and reviewers can stop reviewing this file here. - // If you are adding logic into this class, please do so above this line. - @Override - public FieldInfos getLuceneFieldInfos( - Predicate acceptedFields) { - return schema.get().getLuceneFieldInfos(acceptedFields); - } - - @Override - public FacetsConfig getFacetsConfig() { - return schema.get().getFacetsConfig(); - } - - @Override - public Analyzer getDefaultAnalyzer( - ThriftAnalyzer override) { - return schema.get().getDefaultAnalyzer(override); - } - - @Override - public ImmutableCollection getFieldInfos() { - return schema.get().getFieldInfos(); - } - - @Override - public boolean hasField(int fieldConfigId) { - return schema.get().hasField(fieldConfigId); - } - - @Override - public boolean hasField(String fieldName) { - return schema.get().hasField(fieldName); - } - - @Override - @Nullable - public FieldInfo getFieldInfo(int fieldConfigId) { - return schema.get().getFieldInfo(fieldConfigId); - } - - @Override - @Nullable - public FieldInfo getFieldInfo(String fieldName) { - return schema.get().getFieldInfo(fieldName); - } - - @Override - public String getFieldName(int fieldConfigId) { - return schema.get().getFieldName(fieldConfigId); - } - - @Override - public FieldInfo getFieldInfo(int fieldConfigId, - ThriftFieldConfiguration override) { - return schema.get().getFieldInfo(fieldConfigId, override); - } - - @Override - public int getNumFacetFields() { - return schema.get().getNumFacetFields(); - } - - @Override - public FieldInfo getFacetFieldByFacetName( - String facetName) { - return schema.get().getFacetFieldByFacetName(facetName); - } - - @Override - public FieldInfo getFacetFieldByFieldName( - String fieldName) { - return schema.get().getFacetFieldByFieldName(fieldName); - } - - @Override - public Collection getFacetFields() { - return schema.get().getFacetFields(); - } - - @Override - public Collection getCsfFacetFields() { - return schema.get().getCsfFacetFields(); - } - - @Override - public String getVersionDescription() { - return schema.get().getVersionDescription(); - } - - @Override - public int getMajorVersionNumber() { - return schema.get().getMajorVersionNumber(); - } - - @Override - public int getMinorVersionNumber() { - return schema.get().getMinorVersionNumber(); - } - - @Override - public boolean isVersionOfficial() { - return schema.get().isVersionOfficial(); - } - - @Override - public Map getFieldWeightMap() { - return schema.get().getFieldWeightMap(); - } - - @Override - public FeatureConfiguration getFeatureConfigurationByName( - String featureName) { - return schema.get().getFeatureConfigurationByName(featureName); - } - - @Override - public FeatureConfiguration getFeatureConfigurationById(int featureFieldId) { - return Preconditions.checkNotNull(schema.get().getFeatureConfigurationById(featureFieldId)); - } - - @Override - @Nullable - public ThriftCSFType getCSFFieldType( - String fieldName) { - return schema.get().getCSFFieldType(fieldName); - } - - @Override - public ThriftSearchFeatureSchema getSearchFeatureSchema() { - return schema.get().getSearchFeatureSchema(); - } - - @Override - public ImmutableMap getFeatureIdToFeatureConfig() { - return schema.get().getFeatureIdToFeatureConfig(); - } - - @Override - public ImmutableMap getFeatureNameToFeatureConfig() { - return schema.get().getFeatureNameToFeatureConfig(); - } -} diff --git a/src/java/com/twitter/search/common/schema/ImmutableSchema.java b/src/java/com/twitter/search/common/schema/ImmutableSchema.java deleted file mode 100644 index 6285812f0..000000000 --- a/src/java/com/twitter/search/common/schema/ImmutableSchema.java +++ /dev/null @@ -1,904 +0,0 @@ -package com.twitter.search.common.schema; - -import java.io.IOException; -import java.io.ObjectOutputStream; -import java.util.Collection; -import java.util.List; -import java.util.Map; -import java.util.Set; -import java.util.SortedMap; -import java.util.TreeMap; -import java.util.concurrent.atomic.AtomicLong; -import javax.annotation.Nullable; -import javax.annotation.concurrent.Immutable; -import javax.annotation.concurrent.ThreadSafe; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.base.Predicate; -import com.google.common.collect.ImmutableCollection; -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.ImmutableSet; -import com.google.common.collect.ImmutableSortedMap; -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; -import com.google.common.collect.Sets; - -import org.apache.lucene.analysis.Analyzer; -import org.apache.lucene.facet.FacetsConfig; -import org.apache.lucene.index.DocValuesType; -import org.apache.lucene.index.FieldInfos; -import org.apache.lucene.index.IndexOptions; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.collections.Pair; -import com.twitter.common.text.util.TokenStreamSerializer; -import com.twitter.search.common.features.ExternalTweetFeature; -import com.twitter.search.common.features.SearchResultFeature; -import com.twitter.search.common.features.thrift.ThriftSearchFeatureSchema; -import com.twitter.search.common.features.thrift.ThriftSearchFeatureSchemaEntry; -import com.twitter.search.common.features.thrift.ThriftSearchFeatureSchemaSpecifier; -import com.twitter.search.common.features.thrift.ThriftSearchFeatureType; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.schema.base.EarlybirdFieldType; -import com.twitter.search.common.schema.base.FeatureConfiguration; -import com.twitter.search.common.schema.base.FieldWeightDefault; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.base.IndexedNumericFieldSettings; -import com.twitter.search.common.schema.thriftjava.ThriftAnalyzer; -import com.twitter.search.common.schema.thriftjava.ThriftCSFFieldSettings; -import com.twitter.search.common.schema.thriftjava.ThriftCSFType; -import com.twitter.search.common.schema.thriftjava.ThriftCSFViewSettings; -import com.twitter.search.common.schema.thriftjava.ThriftFacetFieldSettings; -import com.twitter.search.common.schema.thriftjava.ThriftFieldConfiguration; -import com.twitter.search.common.schema.thriftjava.ThriftFieldSettings; -import com.twitter.search.common.schema.thriftjava.ThriftIndexedFieldSettings; -import com.twitter.search.common.schema.thriftjava.ThriftSchema; -import com.twitter.search.common.schema.thriftjava.ThriftSearchFieldSettings; -import com.twitter.search.common.schema.thriftjava.ThriftTokenStreamSerializer; - -/** - * A schema instance that does not change at run time. - */ -@Immutable @ThreadSafe -public class ImmutableSchema implements ImmutableSchemaInterface { - private static final Logger LOG = LoggerFactory.getLogger(ImmutableSchema.class); - private static final ImmutableSet CAN_FACET_ON_CSF_TYPES = - ImmutableSet.builder() - .add(ThriftCSFType.BYTE) - .add(ThriftCSFType.INT) - .add(ThriftCSFType.LONG) - .build(); - - private static final SearchCounter FEATURES_EXISTED_IN_OLD_SCHEMA = - SearchCounter.export("features_existed_in_old_schema"); - - // Currently our index uses 4 bits to store the facet field id. - public static final int MAX_FACET_FIELD_ID = 15; - - public static final String HF_TERM_PAIRS_FIELD = "hf_term_pairs"; - public static final String HF_PHRASE_PAIRS_FIELD = "hf_phrase_pairs"; - - private final ImmutableMap fieldSettingsMapById; - private final ImmutableMap fieldSettingsMapByName; - private final ImmutableMap featureConfigMapByName; - private final ImmutableMap featureConfigMapById; - - @Nullable - private final ThriftAnalyzer defaultAnalyzer; - private final AnalyzerFactory analyzerFactory; - - private final ImmutableMap fieldWeightMap; - private final Map facetNameToFieldMap = Maps.newHashMap(); - private final int numFacetFields; - private final ImmutableSet csfFacetFields; - - // This is the search result feature schema - it has the definition for all the column stride - // view fields. - private final ThriftSearchFeatureSchema searchFeatureSchema; - - private final int majorVersionNumber; - private final int minorVersionNumber; - private final String versionDesc; - private final boolean isVersionOfficial; - - /** - * Construct a Schema instance with the given ThriftSchema and AnalyzerFactory. - */ - public ImmutableSchema(ThriftSchema thriftSchema, - AnalyzerFactory analyzerFactory, - String featureSchemaVersionPrefix) throws SchemaValidationException { - Pair versionPair = parseVersionString(thriftSchema.getVersion()); - this.majorVersionNumber = thriftSchema.getMajorVersionNumber(); - this.minorVersionNumber = thriftSchema.getMinorVersionNumber(); - this.versionDesc = versionPair.getSecond(); - this.isVersionOfficial = thriftSchema.isVersionIsOfficial(); - - this.analyzerFactory = analyzerFactory; - - Map tmpMap = Maps.newLinkedHashMap(); - Set tmpSet = Sets.newHashSet(); - - if (thriftSchema.isSetDefaultAnalyzer()) { - this.defaultAnalyzer = thriftSchema.getDefaultAnalyzer().deepCopy(); - } else { - this.defaultAnalyzer = null; - } - - Map configs = thriftSchema.getFieldConfigs(); - - // Collect all the CSF Views, so that we can later verify that they are appropriately - // configured once we've processed all the other field settings. - Map csfViewFields = Maps.newHashMap(); - boolean requiresHfPairFields = false; - boolean hasHfTermPairField = false; - boolean hasHfPhrasePairField = false; - int numFacets = 0; - for (Map.Entry entry : configs.entrySet()) { - int fieldId = entry.getKey(); - - if (tmpMap.containsKey(fieldId)) { - throw new SchemaValidationException("Duplicate field id " + fieldId); - } - - ThriftFieldConfiguration config = entry.getValue(); - FieldInfo fieldInfo = parseThriftFieldSettings(fieldId, config, csfViewFields); - validate(fieldInfo); - if (fieldInfo.getFieldType().isFacetField()) { - if (numFacets > MAX_FACET_FIELD_ID) { - throw new SchemaValidationException( - "Maximum supported facet field ID is: " + MAX_FACET_FIELD_ID); - } - numFacets++; - facetNameToFieldMap.put(fieldInfo.getFieldType().getFacetName(), fieldInfo); - - if (fieldInfo.getFieldType().isUseCSFForFacetCounting()) { - tmpSet.add(fieldInfo); - } - } - - tmpMap.put(fieldId, fieldInfo); - - if (fieldInfo.getFieldType().isIndexHFTermPairs()) { - requiresHfPairFields = true; - } - if (fieldInfo.getName().equals(HF_TERM_PAIRS_FIELD)) { - hasHfTermPairField = true; - } - if (fieldInfo.getName().equals(HF_PHRASE_PAIRS_FIELD)) { - hasHfPhrasePairField = true; - } - } - - this.numFacetFields = numFacets; - this.csfFacetFields = ImmutableSet.copyOf(tmpSet); - - // If any field requires high frequency term/phrase pair fields, make sure they exist - if (requiresHfPairFields) { - if (!hasHfTermPairField || !hasHfPhrasePairField) { - throw new SchemaValidationException( - "High frequency term/phrase pair fields do not exist in the schema."); - } - } - - this.fieldSettingsMapById = ImmutableMap.copyOf(tmpMap); - - Pair, ImmutableMap> - featureConfigMapPair = buildFeatureMaps(csfViewFields); - this.featureConfigMapByName = featureConfigMapPair.getFirst(); - this.featureConfigMapById = featureConfigMapPair.getSecond(); - - for (ThriftFieldConfiguration csfViewField : csfViewFields.values()) { - SchemaBuilder.verifyCSFViewSettings(configs, csfViewField); - } - - ImmutableMap.Builder builder = ImmutableMap.builder(); - - for (FieldInfo info : fieldSettingsMapById.values()) { - info.getFieldType().freeze(); - builder.put(info.getName(), info); - } - this.fieldSettingsMapByName = builder.build(); - - ImmutableMap.Builder fieldWeightMapBuilder = ImmutableMap.builder(); - - for (FieldInfo fi : getFieldInfos()) { - // CSF fields are not searchable. All other fields are. - if (fi.getFieldType().isIndexedField()) { - fieldWeightMapBuilder.put( - fi.getName(), - new FieldWeightDefault( - fi.getFieldType().isTextSearchableByDefault(), - fi.getFieldType().getTextSearchableFieldWeight())); - } - } - - this.fieldWeightMap = fieldWeightMapBuilder.build(); - // Create features with extra Earlybird derived fields, extra fields won't change the version - // but they do change the checksum. - this.searchFeatureSchema = createSearchResultFeatureSchema( - featureSchemaVersionPrefix, fieldSettingsMapByName, featureConfigMapByName); - } - - /** - * Add a set of features to a schema if they don't exist yet, and update the schema checksum. - * if there's conflict, RuntimeException will be thrown. - * Old map won't be touched, a new map will be returned will old and new data combined. - */ - public static Map appendToFeatureSchema( - Map oldEntryMap, - Set features) throws SchemaValidationException { - if (oldEntryMap == null) { - throw new SchemaValidationException( - "Cannot append features to schema, the entryMap is null"); - } - // make a copy of the existing map - ImmutableMap.Builder builder = - ImmutableSortedMap.naturalOrder() - .putAll(oldEntryMap); - - for (SearchResultFeature feature : features) { - if (oldEntryMap.containsKey(feature.getId())) { - FEATURES_EXISTED_IN_OLD_SCHEMA.increment(); - } else { - builder.put(feature.getId(), new ThriftSearchFeatureSchemaEntry() - .setFeatureName(feature.getName()) - .setFeatureType(feature.getType())); - } - } - return builder.build(); - } - - /** - * Append external features to create a new schema. - * @param oldSchema The old schema to build on top of - * @param features a list of features to be appended to the schema - * @param versionSuffix the version suffix, if not-null, it will be attached to the end of - * original schema's version. - * @return A new schema object with the appended fields - * @throws SchemaValidationException thrown when the checksum cannot be computed - */ - public static ThriftSearchFeatureSchema appendToCreateNewFeatureSchema( - ThriftSearchFeatureSchema oldSchema, - Set features, - @Nullable String versionSuffix) throws SchemaValidationException { - - ThriftSearchFeatureSchema newSchema = new ThriftSearchFeatureSchema(); - // copy over all the entries plus the new ones - newSchema.setEntries(appendToFeatureSchema(oldSchema.getEntries(), features)); - - ThriftSearchFeatureSchemaSpecifier spec = new ThriftSearchFeatureSchemaSpecifier(); - // the version is directly inherited or with a suffix - Preconditions.checkArgument(versionSuffix == null || !versionSuffix.isEmpty()); - spec.setVersion(versionSuffix == null - ? oldSchema.getSchemaSpecifier().getVersion() - : oldSchema.getSchemaSpecifier().getVersion() + versionSuffix); - spec.setChecksum(getChecksum(newSchema.getEntries())); - newSchema.setSchemaSpecifier(spec); - return newSchema; - } - - @Override - public FieldInfos getLuceneFieldInfos(Predicate acceptedFields) { - List acceptedFieldInfos = Lists.newArrayList(); - for (FieldInfo fi : getFieldInfos()) { - if (acceptedFields == null || acceptedFields.apply(fi.getName())) { - acceptedFieldInfos.add(convert(fi.getName(), fi.getFieldId(), fi.getFieldType())); - } - } - return new FieldInfos(acceptedFieldInfos.toArray( - new org.apache.lucene.index.FieldInfo[acceptedFieldInfos.size()])); - } - - private FieldInfo parseThriftFieldSettings(int fieldId, ThriftFieldConfiguration fieldConfig, - Map csfViewFields) - throws SchemaValidationException { - FieldInfo fieldInfo - = new FieldInfo(fieldId, fieldConfig.getFieldName(), new EarlybirdFieldType()); - ThriftFieldSettings fieldSettings = fieldConfig.getSettings(); - - - boolean settingFound = false; - - if (fieldSettings.isSetIndexedFieldSettings()) { - if (fieldSettings.isSetCsfFieldSettings() || fieldSettings.isSetCsfViewSettings()) { - throw new SchemaValidationException("ThriftFieldSettings: Only one of " - + "'indexedFieldSettings', 'csfFieldSettings', 'csfViewSettings' can be set."); - } - - applyIndexedFieldSettings(fieldInfo, fieldSettings.getIndexedFieldSettings()); - settingFound = true; - } - - if (fieldSettings.isSetCsfFieldSettings()) { - if (fieldSettings.isSetIndexedFieldSettings() || fieldSettings.isSetCsfViewSettings()) { - throw new SchemaValidationException("ThriftFieldSettings: Only one of " - + "'indexedFieldSettings', 'csfFieldSettings', 'csfViewSettings' can be set."); - } - - applyCsfFieldSettings(fieldInfo, fieldSettings.getCsfFieldSettings()); - settingFound = true; - } - - if (fieldSettings.isSetFacetFieldSettings()) { - if (!fieldSettings.isSetIndexedFieldSettings() && !(fieldSettings.isSetCsfFieldSettings() - && fieldSettings.getFacetFieldSettings().isUseCSFForFacetCounting() - && CAN_FACET_ON_CSF_TYPES.contains(fieldSettings.getCsfFieldSettings().getCsfType()))) { - throw new SchemaValidationException("ThriftFieldSettings: 'facetFieldSettings' can only be " - + "used in combination with 'indexedFieldSettings' or with 'csfFieldSettings' " - + "where 'isUseCSFForFacetCounting' was set to true and ThriftCSFType is a type that " - + "can be faceted on."); - } - - applyFacetFieldSettings(fieldInfo, fieldSettings.getFacetFieldSettings()); - settingFound = true; - } - - if (fieldSettings.isSetCsfViewSettings()) { - if (fieldSettings.isSetIndexedFieldSettings() || fieldSettings.isSetCsfFieldSettings()) { - throw new SchemaValidationException("ThriftFieldSettings: Only one of " - + "'indexedFieldSettings', 'csfFieldSettings', 'csfViewSettings' can be set."); - } - - // add this field now, but apply settings later to make sure the base field was added properly - // before - csfViewFields.put(fieldId, fieldConfig); - settingFound = true; - } - - if (!settingFound) { - throw new SchemaValidationException("ThriftFieldSettings: One of 'indexedFieldSettings', " - + "'csfFieldSettings' or 'facetFieldSettings' must be set."); - } - - // search field settings are optional - if (fieldSettings.isSetSearchFieldSettings()) { - if (!fieldSettings.isSetIndexedFieldSettings()) { - throw new SchemaValidationException( - "ThriftFieldSettings: 'searchFieldSettings' can only be " - + "used in combination with 'indexedFieldSettings'"); - } - - applySearchFieldSettings(fieldInfo, fieldSettings.getSearchFieldSettings()); - } - - return fieldInfo; - } - - private void applyCsfFieldSettings(FieldInfo fieldInfo, ThriftCSFFieldSettings settings) - throws SchemaValidationException { - // csfType is required - no need to check if it's set - fieldInfo.getFieldType().setDocValuesType(DocValuesType.NUMERIC); - fieldInfo.getFieldType().setCsfType(settings.getCsfType()); - - if (settings.isVariableLength()) { - fieldInfo.getFieldType().setDocValuesType(DocValuesType.BINARY); - fieldInfo.getFieldType().setCsfVariableLength(); - } else { - if (settings.isSetFixedLengthSettings()) { - fieldInfo.getFieldType().setCsfFixedLengthSettings( - settings.getFixedLengthSettings().getNumValuesPerDoc(), - settings.getFixedLengthSettings().isUpdateable()); - if (settings.getFixedLengthSettings().getNumValuesPerDoc() > 1) { - fieldInfo.getFieldType().setDocValuesType(DocValuesType.BINARY); - } - } else { - throw new SchemaValidationException( - "ThriftCSFFieldSettings: Either variableLength should be set to 'true', " - + "or fixedLengthSettings should be set."); - } - } - - fieldInfo.getFieldType().setCsfLoadIntoRam(settings.isLoadIntoRAM()); - if (settings.isSetDefaultValue()) { - fieldInfo.getFieldType().setCsfDefaultValue(settings.getDefaultValue()); - } - } - - private void applyCsfViewFieldSettings(FieldInfo fieldInfo, FieldInfo baseField, - ThriftCSFViewSettings settings) - throws SchemaValidationException { - // csfType is required - no need to check if it's set - fieldInfo.getFieldType().setDocValuesType(DocValuesType.NUMERIC); - fieldInfo.getFieldType().setCsfType(settings.getCsfType()); - - fieldInfo.getFieldType().setCsfFixedLengthSettings(1 /* numValuesPerDoc*/, - false /* updateable*/); - - fieldInfo.getFieldType().setCsfViewSettings(fieldInfo.getName(), settings, baseField); - } - - private void applyFacetFieldSettings(FieldInfo fieldInfo, ThriftFacetFieldSettings settings) { - if (settings.isSetFacetName()) { - fieldInfo.getFieldType().setFacetName(settings.getFacetName()); - } else { - // fall back to field name if no facet name is explicitly provided - fieldInfo.getFieldType().setFacetName(fieldInfo.getName()); - } - fieldInfo.getFieldType().setStoreFacetSkiplist(settings.isStoreSkiplist()); - fieldInfo.getFieldType().setStoreFacetOffensiveCounters(settings.isStoreOffensiveCounters()); - fieldInfo.getFieldType().setUseCSFForFacetCounting(settings.isUseCSFForFacetCounting()); - } - - private void applyIndexedFieldSettings(FieldInfo fieldInfo, ThriftIndexedFieldSettings settings) - throws SchemaValidationException { - fieldInfo.getFieldType().setIndexedField(true); - fieldInfo.getFieldType().setStored(settings.isStored()); - fieldInfo.getFieldType().setTokenized(settings.isTokenized()); - fieldInfo.getFieldType().setStoreTermVectors(settings.isStoreTermVectors()); - fieldInfo.getFieldType().setStoreTermVectorOffsets(settings.isStoreTermVectorOffsets()); - fieldInfo.getFieldType().setStoreTermVectorPositions(settings.isStoreTermVectorPositions()); - fieldInfo.getFieldType().setStoreTermVectorPayloads(settings.isStoreTermVectorPayloads()); - fieldInfo.getFieldType().setOmitNorms(settings.isOmitNorms()); - fieldInfo.getFieldType().setIndexHFTermPairs(settings.isIndexHighFreqTermPairs()); - fieldInfo.getFieldType().setUseTweetSpecificNormalization( - settings.deprecated_performTweetSpecificNormalizations); - - if (settings.isSetIndexOptions()) { - switch (settings.getIndexOptions()) { - case DOCS_ONLY : - fieldInfo.getFieldType().setIndexOptions(IndexOptions.DOCS); - break; - case DOCS_AND_FREQS : - fieldInfo.getFieldType().setIndexOptions(IndexOptions.DOCS_AND_FREQS); - break; - case DOCS_AND_FREQS_AND_POSITIONS : - fieldInfo.getFieldType().setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS); - break; - case DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS : - fieldInfo.getFieldType().setIndexOptions( - IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); - break; - default: - throw new SchemaValidationException("Unknown value for IndexOptions: " - + settings.getIndexOptions()); - } - } else if (settings.isIndexed()) { - // default for backward-compatibility - fieldInfo.getFieldType().setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS); - } - - fieldInfo.getFieldType().setStorePerPositionPayloads(settings.isStorePerPositionPayloads()); - fieldInfo.getFieldType().setDefaultPayloadLength( - settings.getDefaultPerPositionPayloadLength()); - fieldInfo.getFieldType().setBecomesImmutable(!settings.isSupportOutOfOrderAppends()); - fieldInfo.getFieldType().setSupportOrderedTerms(settings.isSupportOrderedTerms()); - fieldInfo.getFieldType().setSupportTermTextLookup(settings.isSupportTermTextLookup()); - - if (settings.isSetNumericFieldSettings()) { - fieldInfo.getFieldType().setNumericFieldSettings( - new IndexedNumericFieldSettings(settings.getNumericFieldSettings())); - } - - if (settings.isSetTokenStreamSerializer()) { - fieldInfo.getFieldType().setTokenStreamSerializerBuilder( - buildTokenStreamSerializerProvider(settings.getTokenStreamSerializer())); - } - } - - private void applySearchFieldSettings(FieldInfo fieldInfo, ThriftSearchFieldSettings settings) - throws SchemaValidationException { - fieldInfo.getFieldType().setTextSearchableFieldWeight( - (float) settings.getTextSearchableFieldWeight()); - fieldInfo.getFieldType().setTextSearchableByDefault(settings.isTextDefaultSearchable()); - } - - private void validate(FieldInfo fieldInfo) throws SchemaValidationException { - } - - private TokenStreamSerializer.Builder buildTokenStreamSerializerProvider( - final ThriftTokenStreamSerializer settings) { - TokenStreamSerializer.Builder builder = TokenStreamSerializer.builder(); - for (String serializerName : settings.getAttributeSerializerClassNames()) { - try { - builder.add((TokenStreamSerializer.AttributeSerializer) Class.forName(serializerName) - .newInstance()); - } catch (InstantiationException e) { - throw new RuntimeException( - "Unable to instantiate AttributeSerializer for name " + serializerName); - } catch (IllegalAccessException e) { - throw new RuntimeException( - "Unable to instantiate AttributeSerializer for name " + serializerName); - } catch (ClassNotFoundException e) { - throw new RuntimeException( - "Unable to instantiate AttributeSerializer for name " + serializerName); - } - } - return builder; - } - - @Override - public FacetsConfig getFacetsConfig() { - FacetsConfig facetsConfig = new FacetsConfig(); - - for (String facetName : facetNameToFieldMap.keySet()) { - // set multiValued = true as default, since we're using SortedSetDocValues facet, in which, - // there is no difference between multiValued true or false for the real facet, but only the - // checking of the values. - facetsConfig.setMultiValued(facetName, true); - } - - return facetsConfig; - } - - @Override - public Analyzer getDefaultAnalyzer(ThriftAnalyzer override) { - if (override != null) { - return analyzerFactory.getAnalyzer(override); - } - - if (defaultAnalyzer != null) { - return analyzerFactory.getAnalyzer(defaultAnalyzer); - } - - return new SearchWhitespaceAnalyzer(); - } - - @Override - public ImmutableCollection getFieldInfos() { - return fieldSettingsMapById.values(); - } - - /** - * This is the preferred method to check whether a field configuration is in schema. - * One can also use getFieldInfo and do null checks, but should be careful about excessive - * warning logging resulting from looking up fields not in schema. - */ - @Override - public boolean hasField(int fieldConfigId) { - return fieldSettingsMapById.containsKey(fieldConfigId); - } - - /** - * This is the preferred method to check whether a field configuration is in schema. - * One can also use getFieldInfo and do null checks, but should be careful about excessive - * warning logging resulting from looking up fields not in schema. - */ - @Override - public boolean hasField(String fieldName) { - return fieldSettingsMapByName.containsKey(fieldName); - } - - /** - * Get FieldInfo for the given field id. - * If the goal is to check whether a field is in the schema, use {@link #hasField(int)} instead. - * This method logs a warning whenever it returns null. - */ - @Override - @Nullable - public FieldInfo getFieldInfo(int fieldConfigId) { - return getFieldInfo(fieldConfigId, null); - } - - private org.apache.lucene.index.FieldInfo convert(String fieldName, - int index, - EarlybirdFieldType type) { - return new org.apache.lucene.index.FieldInfo( - fieldName, // String name - index, // int number - type.storeTermVectors(), // boolean storeTermVector - type.omitNorms(), // boolean omitNorms - type.isStorePerPositionPayloads(), // boolean storePayloads - type.indexOptions(), // IndexOptions indexOptions - type.docValuesType(), // DocValuesType docValues - -1, // long dvGen - Maps.newHashMap(), // Map attributes - 0, // int pointDataDimensionCount - 0, // int pointIndexDimensionCount - 0, // int pointNumBytes - false); // boolean softDeletesField - } - - /** - * Get FieldInfo for the given field name, or null if the field does not exist. - */ - @Override - @Nullable - public FieldInfo getFieldInfo(String fieldName) { - return fieldSettingsMapByName.get(fieldName); - } - - @Override - public String getFieldName(int fieldConfigId) { - FieldInfo fieldInfo = fieldSettingsMapById.get(fieldConfigId); - return fieldInfo != null ? fieldInfo.getName() : null; - } - - @Override - public FieldInfo getFieldInfo(int fieldConfigId, ThriftFieldConfiguration override) { - FieldInfo fieldInfo = fieldSettingsMapById.get(fieldConfigId); - if (fieldInfo == null) { - // This method is used to check the availability of fields by IDs, - // so no warning is logged here (would be too verbose otherwise). - return null; - } - - if (override != null) { - try { - return merge(fieldConfigId, fieldInfo, override); - } catch (SchemaValidationException e) { - throw new RuntimeException(e); - } - } - - return fieldInfo; - } - - @Override - public int getNumFacetFields() { - return numFacetFields; - } - - @Override - public FieldInfo getFacetFieldByFacetName(String facetName) { - return facetNameToFieldMap.get(facetName); - } - - @Override - public FieldInfo getFacetFieldByFieldName(String fieldName) { - FieldInfo fieldInfo = getFieldInfo(fieldName); - return fieldInfo != null && fieldInfo.getFieldType().isFacetField() ? fieldInfo : null; - } - - @Override - public Collection getFacetFields() { - return facetNameToFieldMap.values(); - } - - @Override - public Collection getCsfFacetFields() { - return csfFacetFields; - } - - @Override - public String getVersionDescription() { - return versionDesc; - } - - @Override - public int getMajorVersionNumber() { - return majorVersionNumber; - } - - @Override - public int getMinorVersionNumber() { - return minorVersionNumber; - } - - @Override - public boolean isVersionOfficial() { - return isVersionOfficial; - } - - /** - * Parses a version string like "16: renamed field x into y" into a version number and - * a string description. - * @return a Pair of the version number and the description - */ - private static Pair parseVersionString(String version) - throws SchemaValidationException { - Preconditions.checkNotNull(version, "Schema must have a version number and description."); - int colonIndex = version.indexOf(':'); - if (colonIndex == -1) { - throw new SchemaValidationException("Malformed version string: " + version); - } - try { - int versionNumber = Integer.parseInt(version.substring(0, colonIndex)); - String versionDesc = version.substring(colonIndex + 1); - return Pair.of(versionNumber, versionDesc); - } catch (Exception e) { - throw new SchemaValidationException("Malformed version string: " + version, e); - } - } - - @Override - public Map getFieldWeightMap() { - return fieldWeightMap; - } - - /** - * Build the feature maps so that we can use feature name to get the feature configuration. - * @return: an immutable map keyed on fieldName. - */ - private Pair, - ImmutableMap> buildFeatureMaps( - final Map csvViewFields) - throws SchemaValidationException { - - final ImmutableMap.Builder featureConfigMapByNameBuilder = - ImmutableMap.builder(); - final ImmutableMap.Builder featureConfigMapByIdBuilder = - ImmutableMap.builder(); - - for (final Map.Entry entry : csvViewFields.entrySet()) { - ThriftFieldSettings fieldSettings = entry.getValue().getSettings(); - FieldInfo fieldInfo = getFieldInfo(entry.getKey()); - FieldInfo baseFieldInfo = - getFieldInfo(fieldSettings.getCsfViewSettings().getBaseFieldConfigId()); - if (baseFieldInfo == null) { - throw new SchemaValidationException("Base field (id=" - + fieldSettings.getCsfViewSettings().getBaseFieldConfigId() + ") not found."); - } - applyCsfViewFieldSettings(fieldInfo, baseFieldInfo, fieldSettings.getCsfViewSettings()); - - FeatureConfiguration featureConfig = fieldInfo.getFieldType() - .getCsfViewFeatureConfiguration(); - if (featureConfig != null) { - featureConfigMapByNameBuilder.put(fieldInfo.getName(), featureConfig); - featureConfigMapByIdBuilder.put(fieldInfo.getFieldId(), featureConfig); - } - } - - return Pair.of(featureConfigMapByNameBuilder.build(), featureConfigMapByIdBuilder.build()); - } - - @Override - public FeatureConfiguration getFeatureConfigurationByName(String featureName) { - return featureConfigMapByName.get(featureName); - } - - @Override - public FeatureConfiguration getFeatureConfigurationById(int featureFieldId) { - return Preconditions.checkNotNull(featureConfigMapById.get(featureFieldId), - "Field ID: " + featureFieldId); - } - - @Override - @Nullable - public ThriftCSFType getCSFFieldType(String fieldName) { - FieldInfo fieldInfo = getFieldInfo(fieldName); - if (fieldInfo == null) { - return null; - } - - EarlybirdFieldType fieldType = fieldInfo.getFieldType(); - if (fieldType.docValuesType() != org.apache.lucene.index.DocValuesType.NUMERIC) { - return null; - } - - return fieldType.getCsfType(); - } - - @Override - public ImmutableSchemaInterface getSchemaSnapshot() { - return this; - } - - private FieldInfo merge(int fieldConfigId, - FieldInfo fieldInfo, - ThriftFieldConfiguration overrideConfig) - throws SchemaValidationException { - - throw new UnsupportedOperationException("Field override config not supported"); - } - - @Override - public ThriftSearchFeatureSchema getSearchFeatureSchema() { - return searchFeatureSchema; - } - - @Override - public ImmutableMap getFeatureIdToFeatureConfig() { - return featureConfigMapById; - } - - @Override - public ImmutableMap getFeatureNameToFeatureConfig() { - return featureConfigMapByName; - } - - private ThriftSearchFeatureSchema createSearchResultFeatureSchema( - String featureSchemaVersionPrefix, - Map allFieldSettings, - Map featureConfigurations) throws SchemaValidationException { - final ImmutableMap.Builder builder = - new ImmutableMap.Builder<>(); - - for (Map.Entry field : allFieldSettings.entrySet()) { - FeatureConfiguration featureConfig = featureConfigurations.get(field.getKey()); - if (featureConfig == null) { - // This is either a not csf related field or a csf field. - continue; - } - - // This is a csfView field. - if (featureConfig.getOutputType() == null) { - LOG.info("Skip unused fieldschemas: {} for search feature schema.", field.getKey()); - continue; - } - - ThriftSearchFeatureType featureType = getResultFeatureType(featureConfig.getOutputType()); - if (featureType != null) { - builder.put( - field.getValue().getFieldId(), - new ThriftSearchFeatureSchemaEntry(field.getKey(), featureType)); - } else { - LOG.error("Invalid CSFType encountered for csf field: {}", field.getKey()); - } - } - Map indexOnlySchemaEntries = builder.build(); - - // Add earlybird derived features, they are defined in ExternalTweetFeatures and used in the - // scoring function. They are no different from those auto-generated index-based features - // viewed from outside Earlybird. - Map entriesWithEBFeatures = - appendToFeatureSchema( - indexOnlySchemaEntries, ExternalTweetFeature.EARLYBIRD_DERIVED_FEATURES); - - // Add other features needed for tweet ranking from EarlybirdRankingDerivedFeature. - Map allSchemaEntries = appendToFeatureSchema( - entriesWithEBFeatures, ExternalTweetFeature.EARLYBIRD_RANKING_DERIVED_FEATURES); - - long schemaEntriesChecksum = getChecksum(allSchemaEntries); - SearchLongGauge.export("feature_schema_checksum", new AtomicLong(schemaEntriesChecksum)); - - String schemaVersion = String.format( - "%s.%d.%d", featureSchemaVersionPrefix, majorVersionNumber, minorVersionNumber); - ThriftSearchFeatureSchemaSpecifier schemaSpecifier = - new ThriftSearchFeatureSchemaSpecifier(schemaVersion, schemaEntriesChecksum); - - ThriftSearchFeatureSchema schema = new ThriftSearchFeatureSchema(); - schema.setSchemaSpecifier(schemaSpecifier); - schema.setEntries(allSchemaEntries); - - return schema; - } - - // Serializes schemaEntries to a byte array, and computes a CRC32 checksum of the array. - // The serialization needs to be stable: if schemaEntries1.equals(schemaEntries2), we want - // this method to produce the same checksum for schemaEntrie1 and schemaEntrie2, even if - // the checksums are computed in different JVMs, etc. - private static long getChecksum(Map schemaEntries) - throws SchemaValidationException { - SortedMap sortedSchemaEntries = - new TreeMap(schemaEntries); - - CRC32OutputStream crc32OutputStream = new CRC32OutputStream(); - ObjectOutputStream objectOutputStream = null; - try { - objectOutputStream = new ObjectOutputStream(crc32OutputStream); - for (Integer fieldId : sortedSchemaEntries.keySet()) { - objectOutputStream.writeObject(fieldId); - ThriftSearchFeatureSchemaEntry schemaEntry = sortedSchemaEntries.get(fieldId); - objectOutputStream.writeObject(schemaEntry.getFeatureName()); - objectOutputStream.writeObject(schemaEntry.getFeatureType()); - } - objectOutputStream.flush(); - return crc32OutputStream.getValue(); - } catch (IOException e) { - throw new SchemaValidationException("Could not serialize feature schema entries.", e); - } finally { - Preconditions.checkNotNull(objectOutputStream); - try { - objectOutputStream.close(); - } catch (IOException e) { - throw new SchemaValidationException("Could not close ObjectOutputStream.", e); - } - } - } - - /** - * Get the search feature type based on the csf type. - * @param csfType the column stride field type for the data - * @return the corresponding search feature type - */ - @VisibleForTesting - public static ThriftSearchFeatureType getResultFeatureType(ThriftCSFType csfType) { - switch (csfType) { - case INT: - case BYTE: - return ThriftSearchFeatureType.INT32_VALUE; - case BOOLEAN: - return ThriftSearchFeatureType.BOOLEAN_VALUE; - case FLOAT: - case DOUBLE: - return ThriftSearchFeatureType.DOUBLE_VALUE; - case LONG: - return ThriftSearchFeatureType.LONG_VALUE; - default: - return null; - } - } -} diff --git a/src/java/com/twitter/search/common/schema/NumericField.java b/src/java/com/twitter/search/common/schema/NumericField.java deleted file mode 100644 index c6c528d55..000000000 --- a/src/java/com/twitter/search/common/schema/NumericField.java +++ /dev/null @@ -1,44 +0,0 @@ -package com.twitter.search.common.schema; - -import org.apache.lucene.document.Field; -import org.apache.lucene.document.FieldType; -import org.apache.lucene.index.IndexOptions; - -/** - * A Lucene numeric field, similar to the LegacyIntField, LegacyLongField, etc. Lucene classes that - * were removed in Lucene 7.0.0. - */ -public final class NumericField extends Field { - private static final FieldType NUMERIC_FIELD_TYPE = new FieldType(); - static { - NUMERIC_FIELD_TYPE.setTokenized(true); - NUMERIC_FIELD_TYPE.setOmitNorms(true); - NUMERIC_FIELD_TYPE.setIndexOptions(IndexOptions.DOCS); - NUMERIC_FIELD_TYPE.freeze(); - } - - /** - * Creates a new integer field with the given name and value. - */ - public static NumericField newIntField(String fieldName, int value) { - NumericField field = new NumericField(fieldName); - field.fieldsData = Integer.valueOf(value); - return field; - } - - /** - * Creates a new long field with the given name and value. - */ - public static NumericField newLongField(String fieldName, long value) { - NumericField field = new NumericField(fieldName); - field.fieldsData = Long.valueOf(value); - return field; - } - - // We could replace the static methods with constructors, but I think that would make it much - // easier to accidentally use NumericField(String, int) instead of NumericField(String, long), - // for example, leading to hard to debug errors. - private NumericField(String fieldName) { - super(fieldName, NUMERIC_FIELD_TYPE); - } -} diff --git a/src/java/com/twitter/search/common/schema/SchemaBuilder.java b/src/java/com/twitter/search/common/schema/SchemaBuilder.java deleted file mode 100644 index e12da2a65..000000000 --- a/src/java/com/twitter/search/common/schema/SchemaBuilder.java +++ /dev/null @@ -1,693 +0,0 @@ -package com.twitter.search.common.schema; - -import java.util.Map; -import java.util.Set; -import javax.annotation.Nullable; - -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.Sets; - -import com.twitter.common.text.util.CharSequenceTermAttributeSerializer; -import com.twitter.common.text.util.PositionIncrementAttributeSerializer; -import com.twitter.common.text.util.TokenStreamSerializer; -import com.twitter.common.text.util.TokenTypeAttributeSerializer; -import com.twitter.search.common.schema.base.FeatureConfiguration; -import com.twitter.search.common.schema.base.FieldNameToIdMapping; -import com.twitter.search.common.schema.thriftjava.ThriftCSFFieldSettings; -import com.twitter.search.common.schema.thriftjava.ThriftCSFType; -import com.twitter.search.common.schema.thriftjava.ThriftCSFViewSettings; -import com.twitter.search.common.schema.thriftjava.ThriftFacetFieldSettings; -import com.twitter.search.common.schema.thriftjava.ThriftFeatureNormalizationType; -import com.twitter.search.common.schema.thriftjava.ThriftFeatureUpdateConstraint; -import com.twitter.search.common.schema.thriftjava.ThriftFieldConfiguration; -import com.twitter.search.common.schema.thriftjava.ThriftFieldSettings; -import com.twitter.search.common.schema.thriftjava.ThriftFixedLengthCSFSettings; -import com.twitter.search.common.schema.thriftjava.ThriftIndexOptions; -import com.twitter.search.common.schema.thriftjava.ThriftIndexedFieldSettings; -import com.twitter.search.common.schema.thriftjava.ThriftIndexedNumericFieldSettings; -import com.twitter.search.common.schema.thriftjava.ThriftNumericType; -import com.twitter.search.common.schema.thriftjava.ThriftSchema; -import com.twitter.search.common.schema.thriftjava.ThriftSearchFieldSettings; -import com.twitter.search.common.schema.thriftjava.ThriftTokenStreamSerializer; -import com.twitter.search.common.util.analysis.CharTermAttributeSerializer; -import com.twitter.search.common.util.analysis.IntTermAttributeSerializer; -import com.twitter.search.common.util.analysis.LongTermAttributeSerializer; -import com.twitter.search.common.util.analysis.PayloadAttributeSerializer; - -public class SchemaBuilder { - - public static final String CSF_VIEW_NAME_SEPARATOR = "."; - protected final ThriftSchema schema = new ThriftSchema(); - protected final FieldNameToIdMapping idMapping; - protected final int tokenStreamSerializerVersion; - - // As of now, we do not allow two fields to share the same field name. - // This set is used to perform this check. - private final Set fieldNameSet = Sets.newHashSet(); - - /** - * Construct a schema builder with the given FieldNameToIdMapper. - * A SchemaBuilder is used to build a ThriftSchema incrementally. - */ - public SchemaBuilder(FieldNameToIdMapping idMapping, - TokenStreamSerializer.Version tokenStreamSerializerVersion) { - this.idMapping = idMapping; - Preconditions.checkArgument( - tokenStreamSerializerVersion == TokenStreamSerializer.Version.VERSION_2); - this.tokenStreamSerializerVersion = tokenStreamSerializerVersion.ordinal(); - } - - /** - * Build ThriftSchema using settings accumulated so far. - */ - public final ThriftSchema build() { - return schema; - } - - /** - * Uses fieldName also as facetName. - */ - public final SchemaBuilder withFacetConfigs(String fieldName, - boolean storeSkipList, - boolean storeOffensiveCounters, - boolean useCSFForFacetCounting) { - return withFacetConfigs( - fieldName, - fieldName, - storeSkipList, - storeOffensiveCounters, - useCSFForFacetCounting); - } - - /** - * Add facet field configuration. - */ - public final SchemaBuilder withFacetConfigs(String fieldName, - String facetName, - boolean storeSkipList, - boolean storeOffensiveCounters, - boolean useCSFForFacetCounting) { - if (!shouldIncludeField(fieldName)) { - return this; - } - ThriftFacetFieldSettings facetSettings = new ThriftFacetFieldSettings(); - // As of now, all our facet names are the same as field names - facetSettings.setFacetName(facetName); - facetSettings.setStoreSkiplist(storeSkipList); - facetSettings.setStoreOffensiveCounters(storeOffensiveCounters); - facetSettings.setUseCSFForFacetCounting(useCSFForFacetCounting); - - int fieldId = idMapping.getFieldID(fieldName); - ThriftFieldConfiguration fieldConfiguration = schema.getFieldConfigs().get(fieldId); - Preconditions.checkNotNull(fieldConfiguration, - "In Earlybird, a facet field must be indexed. " - + "No ThriftIndexedFieldSettings found for field " + fieldName); - fieldConfiguration.getSettings().setFacetFieldSettings(facetSettings); - return this; - } - - /** - * Configure the given field ID to be used for partitioning. - */ - public final SchemaBuilder withPartitionFieldId(int partitionFieldId) { - schema.setPartitionFieldId(partitionFieldId); - return this; - } - - /** - * Add a column stride field into schema. - */ - public final SchemaBuilder withColumnStrideField(String fieldName, - ThriftCSFType type, - int numValuesPerDoc, - boolean updatable, - boolean loadIntoRam) { - return withColumnStrideField(fieldName, type, numValuesPerDoc, updatable, loadIntoRam, null); - } - - /** - * Add a column stride field into schema that is variable length. - */ - public final SchemaBuilder withBinaryColumnStrideField(String fieldName, - boolean loadIntoRam) { - if (!shouldIncludeField(fieldName)) { - return this; - } - ThriftCSFFieldSettings csfFieldSettings = new ThriftCSFFieldSettings(); - csfFieldSettings.setCsfType(ThriftCSFType.BYTE) - .setVariableLength(true) - .setLoadIntoRAM(loadIntoRam); - - ThriftFieldSettings fieldSettings = - new ThriftFieldSettings().setCsfFieldSettings(csfFieldSettings); - ThriftFieldConfiguration fieldConf = - new ThriftFieldConfiguration(fieldName).setSettings(fieldSettings); - putIntoFieldConfigs(idMapping.getFieldID(fieldName), fieldConf); - return this; - } - - /** - * Add a column stride field into schema which has a default value. - */ - public final SchemaBuilder withColumnStrideField(String fieldName, - ThriftCSFType type, - int numValuesPerDoc, - boolean updatable, - boolean loadIntoRam, - Long defaultValue) { - if (!shouldIncludeField(fieldName)) { - return this; - } - ThriftCSFFieldSettings csfFieldSettings = new ThriftCSFFieldSettings(); - csfFieldSettings.setCsfType(type) - .setVariableLength(false) - .setFixedLengthSettings( - new ThriftFixedLengthCSFSettings() - .setNumValuesPerDoc(numValuesPerDoc) - .setUpdateable(updatable)) - .setLoadIntoRAM(loadIntoRam); - - if (defaultValue != null) { - csfFieldSettings.setDefaultValue(defaultValue); - } - - ThriftFieldSettings fieldSettings = - new ThriftFieldSettings().setCsfFieldSettings(csfFieldSettings); - ThriftFieldConfiguration fieldConf = - new ThriftFieldConfiguration(fieldName).setSettings(fieldSettings); - putIntoFieldConfigs(idMapping.getFieldID(fieldName), fieldConf); - return this; - } - - /** - * Add a CSF view into schema. A view is a portion of another CSF. - */ - public final SchemaBuilder withColumnStrideFieldView( - String fieldName, - ThriftCSFType csfType, - ThriftCSFType outputCSFType, - String baseFieldName, - int valueIndex, - int bitStartPosition, - int bitLength, - ThriftFeatureNormalizationType featureNormalizationType, - @Nullable Set constraints) { - if (!shouldIncludeField(fieldName)) { - return this; - } - - int baseFieldConfigID = idMapping.getFieldID(baseFieldName); - - ThriftCSFViewSettings csfViewSettings = new ThriftCSFViewSettings() - .setBaseFieldConfigId(baseFieldConfigID) - .setCsfType(csfType) - .setValueIndex(valueIndex) - .setBitStartPosition(bitStartPosition) - .setBitLength(bitLength); - if (outputCSFType != null) { - csfViewSettings.setOutputCSFType(outputCSFType); - } - if (featureNormalizationType != ThriftFeatureNormalizationType.NONE) { - csfViewSettings.setNormalizationType(featureNormalizationType); - } - if (constraints != null) { - csfViewSettings.setFeatureUpdateConstraints(constraints); - } - ThriftFieldSettings fieldSettings = new ThriftFieldSettings() - .setCsfViewSettings(csfViewSettings); - ThriftFieldConfiguration fieldConf = new ThriftFieldConfiguration(fieldName) - .setSettings(fieldSettings); - - Map fieldConfigs = schema.getFieldConfigs(); - verifyCSFViewSettings(fieldConfigs, fieldConf); - - putIntoFieldConfigs(idMapping.getFieldID(fieldName), fieldConf); - return this; - } - - /** - * Sanity checks for CSF view settings. - */ - public static void verifyCSFViewSettings(Map fieldConfigs, - ThriftFieldConfiguration fieldConf) { - Preconditions.checkNotNull(fieldConf.getSettings()); - Preconditions.checkNotNull(fieldConf.getSettings().getCsfViewSettings()); - ThriftCSFViewSettings csfViewSettings = fieldConf.getSettings().getCsfViewSettings(); - - if (fieldConfigs != null) { - ThriftFieldConfiguration baseFieldConfig = fieldConfigs.get( - csfViewSettings.getBaseFieldConfigId()); - if (baseFieldConfig != null) { - String baseFieldName = baseFieldConfig.getFieldName(); - String expectedViewNamePrefix = baseFieldName + CSF_VIEW_NAME_SEPARATOR; - if (fieldConf.getFieldName().startsWith(expectedViewNamePrefix)) { - ThriftFieldSettings baseFieldSettings = baseFieldConfig.getSettings(); - ThriftCSFFieldSettings baseFieldCSFSettings = baseFieldSettings.getCsfFieldSettings(); - - if (baseFieldCSFSettings != null) { - if (!baseFieldCSFSettings.isVariableLength() - && baseFieldCSFSettings.getFixedLengthSettings() != null) { - - ThriftCSFType baseCSFType = baseFieldCSFSettings.getCsfType(); - switch (baseCSFType) { - case BYTE: - checkCSFViewPositions(baseFieldCSFSettings, 8, csfViewSettings); - break; - case INT: - checkCSFViewPositions(baseFieldCSFSettings, 32, csfViewSettings); - break; - default: - throw new IllegalStateException("Base field: " + baseFieldName - + " is of a non-supported CSFType: " + baseCSFType); - } - } else { - throw new IllegalStateException("Base field: " + baseFieldName - + " must be a fixed-length CSF field"); - } - } else { - throw new IllegalStateException("Base field: " + baseFieldName + " is not a CSF field"); - } - } else { - throw new IllegalStateException("View field name for baseFieldConfigID: " - + csfViewSettings.getBaseFieldConfigId() + " must start with: '" - + expectedViewNamePrefix + "'"); - } - } else { - throw new IllegalStateException("Can't add a view, no field defined for base fieldID: " - + csfViewSettings.getBaseFieldConfigId()); - } - } else { - throw new IllegalStateException("Can't add a view, no field configs defined."); - } - } - - private static void checkCSFViewPositions(ThriftCSFFieldSettings baseFieldCSFSettings, - int bitsPerValue, - ThriftCSFViewSettings csfViewSettings) { - ThriftFixedLengthCSFSettings fixedLengthCSFSettings = - baseFieldCSFSettings.getFixedLengthSettings(); - Preconditions.checkNotNull(fixedLengthCSFSettings); - - int numValues = fixedLengthCSFSettings.getNumValuesPerDoc(); - Preconditions.checkState(csfViewSettings.getValueIndex() >= 0, - "value index must be positive: " + csfViewSettings.getValueIndex()); - Preconditions.checkState(csfViewSettings.getValueIndex() < numValues, "value index " - + csfViewSettings.getValueIndex() + " must be less than numValues: " + numValues); - - Preconditions.checkState(csfViewSettings.getBitStartPosition() >= 0, - "bitStartPosition must be positive: " + csfViewSettings.getBitStartPosition()); - Preconditions.checkState(csfViewSettings.getBitStartPosition() < bitsPerValue, - "bitStartPosition " + csfViewSettings.getBitStartPosition() - + " must be less than bitsPerValue " + bitsPerValue); - - Preconditions.checkState(csfViewSettings.getBitLength() >= 1, - "bitLength must be positive: " + csfViewSettings.getBitLength()); - - Preconditions.checkState( - csfViewSettings.getBitStartPosition() + csfViewSettings.getBitLength() <= bitsPerValue, - String.format("bitStartPosition (%d) + bitLength (%d) must be less than bitsPerValue (%d)", - csfViewSettings.getBitStartPosition(), csfViewSettings.getBitLength(), bitsPerValue)); - } - - // No position; no freq; not pretokenized; not tokenized. - /** - * Norm is disabled as default. Like Lucene string field, or int/long fields. - */ - public final SchemaBuilder withIndexedNotTokenizedField(String fieldName) { - return withIndexedNotTokenizedField(fieldName, false); - } - - /** - * Add an indexed but not tokenized field. This is similar to Lucene's StringField. - */ - public final SchemaBuilder withIndexedNotTokenizedField(String fieldName, - boolean supportOutOfOrderAppends) { - return withIndexedNotTokenizedField(fieldName, supportOutOfOrderAppends, true); - } - - private final SchemaBuilder withIndexedNotTokenizedField(String fieldName, - boolean supportOutOfOrderAppends, - boolean omitNorms) { - if (!shouldIncludeField(fieldName)) { - return this; - } - ThriftFieldSettings settings = getNoPositionNoFreqSettings(supportOutOfOrderAppends); - settings.getIndexedFieldSettings().setOmitNorms(omitNorms); - ThriftFieldConfiguration config = new ThriftFieldConfiguration(fieldName) - .setSettings(settings); - putIntoFieldConfigs(idMapping.getFieldID(fieldName), config); - return this; - } - - - /** Makes the given field searchable by default, with the given weight. */ - public final SchemaBuilder withSearchFieldByDefault( - String fieldName, float textSearchableFieldWeight) { - if (!shouldIncludeField(fieldName)) { - return this; - } - - ThriftFieldSettings settings = - schema.getFieldConfigs().get(idMapping.getFieldID(fieldName)).getSettings(); - settings.setSearchFieldSettings( - new ThriftSearchFieldSettings() - .setTextSearchableFieldWeight(textSearchableFieldWeight) - .setTextDefaultSearchable(true)); - - return this; - } - - /** - * Similar to Lucene's TextField. The string is analyzed using the default/override analyzer. - * @param fieldName - * @param addHfPairIfHfFieldsArePresent Add hfPair fields if they exists in the schema. - * For certain text fields, adding hfPair fields are usually preferred, but they may - * not exist in the schema, in which case the hfPair fields will not be added. - */ - public final SchemaBuilder withTextField(String fieldName, - boolean addHfPairIfHfFieldsArePresent) { - if (!shouldIncludeField(fieldName)) { - return this; - } - ThriftFieldConfiguration config = new ThriftFieldConfiguration(fieldName).setSettings( - getDefaultSettings(ThriftIndexOptions.DOCS_AND_FREQS_AND_POSITIONS)); - - if (addHfPairIfHfFieldsArePresent) { - // Add hfPair fields only if they exist in the schema for the cluster - boolean hfPair = shouldIncludeField(ImmutableSchema.HF_TERM_PAIRS_FIELD) - && shouldIncludeField(ImmutableSchema.HF_PHRASE_PAIRS_FIELD); - config.getSettings().getIndexedFieldSettings().setIndexHighFreqTermPairs(hfPair); - } - - config.getSettings().getIndexedFieldSettings().setTokenized(true); - putIntoFieldConfigs(idMapping.getFieldID(fieldName), config); - return this; - } - - /** - * Marked the given field as having per position payload. - */ - public final SchemaBuilder withPerPositionPayload(String fieldName, int defaultPayloadLength) { - if (!shouldIncludeField(fieldName)) { - return this; - } - ThriftFieldSettings settings = - schema.getFieldConfigs().get(idMapping.getFieldID(fieldName)).getSettings(); - - settings.getIndexedFieldSettings().setStorePerPositionPayloads(true); - settings.getIndexedFieldSettings().setDefaultPerPositionPayloadLength(defaultPayloadLength); - return this; - } - - /** - * Add field into schema that is pre-tokenized and does not have position. - * E.g. hashtags / stocks / card_domain - */ - public final SchemaBuilder withPretokenizedNoPositionField(String fieldName) { - if (!shouldIncludeField(fieldName)) { - return this; - } - ThriftFieldConfiguration config = new ThriftFieldConfiguration(fieldName) - .setSettings(getPretokenizedNoPositionFieldSetting()); - // Add hfPair fields only if they exist in the schema for the cluster - boolean hfPair = shouldIncludeField(ImmutableSchema.HF_TERM_PAIRS_FIELD) - && shouldIncludeField(ImmutableSchema.HF_PHRASE_PAIRS_FIELD); - config.getSettings().getIndexedFieldSettings().setIndexHighFreqTermPairs(hfPair); - putIntoFieldConfigs(idMapping.getFieldID(fieldName), config); - return this; - } - - /** - * Mark the field to have ordered term dictionary. - * In Lucene, term dictionary is sorted. In Earlybird, term dictionary order is not - * guaranteed unless this is turned on. - */ - public final SchemaBuilder withOrderedTerms(String fieldName) { - if (!shouldIncludeField(fieldName)) { - return this; - } - ThriftFieldSettings settings = - schema.getFieldConfigs().get(idMapping.getFieldID(fieldName)).getSettings(); - - settings.getIndexedFieldSettings().setSupportOrderedTerms(true); - return this; - } - - /** - * Support lookup of term text by term id in the term dictionary. - */ - public final SchemaBuilder withTermTextLookup(String fieldName) { - if (!shouldIncludeField(fieldName)) { - return this; - } - ThriftFieldSettings settings = - schema.getFieldConfigs().get(idMapping.getFieldID(fieldName)).getSettings(); - - settings.getIndexedFieldSettings().setSupportTermTextLookup(true); - return this; - } - - /** - * Add a text field that is pre-tokenized, so not analyzed again in the index (e.g. Earlybird). - * - * Note that the token streams MUST be created using the attributes defined in - * {@link com.twitter.search.common.util.text.TweetTokenStreamSerializer}. - */ - public final SchemaBuilder withPretokenizedTextField( - String fieldName, - boolean addHfPairIfHfFieldsArePresent) { - if (!shouldIncludeField(fieldName)) { - return this; - } - ThriftFieldConfiguration config = new ThriftFieldConfiguration(fieldName) - .setSettings(getDefaultPretokenizedSettings( - ThriftIndexOptions.DOCS_AND_FREQS_AND_POSITIONS)); - putIntoFieldConfigs(idMapping.getFieldID(fieldName), config); - // Add hfPair fields only if they exist in the schema for the cluster - if (addHfPairIfHfFieldsArePresent) { - // Add hfPair fields only if they exist in the schema for the cluster - boolean hfPair = shouldIncludeField(ImmutableSchema.HF_TERM_PAIRS_FIELD) - && shouldIncludeField(ImmutableSchema.HF_PHRASE_PAIRS_FIELD); - config.getSettings().getIndexedFieldSettings().setIndexHighFreqTermPairs(hfPair); - } - return this; - } - - /** - * Add a feature configuration - */ - public final SchemaBuilder withFeatureConfiguration(String baseFieldName, String viewName, - FeatureConfiguration featureConfiguration) { - return withColumnStrideFieldView( - viewName, - // Defaulting all encoded tweet features to int since the underlying encoded tweet features - // are ints. - ThriftCSFType.INT, - featureConfiguration.getOutputType(), - baseFieldName, - featureConfiguration.getValueIndex(), - featureConfiguration.getBitStartPosition(), - featureConfiguration.getBitLength(), - featureConfiguration.getFeatureNormalizationType(), - featureConfiguration.getUpdateConstraints() - ); - } - - /** - * Add a long field in schema. This field uses LongTermAttribute. - */ - private SchemaBuilder addLongTermField(String fieldName, boolean useSortableEncoding) { - if (!shouldIncludeField(fieldName)) { - return this; - } - ThriftFieldSettings longTermSettings = getEarlybirdNumericFieldSettings(); - ThriftTokenStreamSerializer tokenStreamSerializer = - new ThriftTokenStreamSerializer(tokenStreamSerializerVersion); - tokenStreamSerializer.setAttributeSerializerClassNames( - ImmutableList.of(LongTermAttributeSerializer.class.getName())); - longTermSettings.getIndexedFieldSettings().setTokenStreamSerializer(tokenStreamSerializer); - - ThriftIndexedNumericFieldSettings numericFieldSettings = - new ThriftIndexedNumericFieldSettings(true); - numericFieldSettings.setNumericType(ThriftNumericType.LONG); - numericFieldSettings.setUseSortableEncoding(useSortableEncoding); - longTermSettings.getIndexedFieldSettings().setNumericFieldSettings(numericFieldSettings); - - putIntoFieldConfigs(idMapping.getFieldID(fieldName), - new ThriftFieldConfiguration(fieldName).setSettings(longTermSettings)); - return this; - } - - public final SchemaBuilder withSortableLongTermField(String fieldName) { - return addLongTermField(fieldName, true); - } - - public final SchemaBuilder withLongTermField(String fieldName) { - return addLongTermField(fieldName, false); - } - - /** - * Add an int field in schema. This field uses IntTermAttribute. - */ - public final SchemaBuilder withIntTermField(String fieldName) { - if (!shouldIncludeField(fieldName)) { - return this; - } - ThriftFieldSettings intTermSettings = getEarlybirdNumericFieldSettings(); - ThriftTokenStreamSerializer attributeSerializer = - new ThriftTokenStreamSerializer(tokenStreamSerializerVersion); - attributeSerializer.setAttributeSerializerClassNames( - ImmutableList.of(IntTermAttributeSerializer.class.getName())); - intTermSettings.getIndexedFieldSettings().setTokenStreamSerializer(attributeSerializer); - - ThriftIndexedNumericFieldSettings numericFieldSettings = - new ThriftIndexedNumericFieldSettings(true); - numericFieldSettings.setNumericType(ThriftNumericType.INT); - intTermSettings.getIndexedFieldSettings().setNumericFieldSettings(numericFieldSettings); - - putIntoFieldConfigs(idMapping.getFieldID(fieldName), - new ThriftFieldConfiguration(fieldName).setSettings(intTermSettings)); - return this; - } - - /** - * Timeline and ExpertSearch uses - * {@link com.twitter.search.common.util.analysis.PayloadWeightedTokenizer} to store weighted - * values. - * - * E.g. for the PRODUCED_LANGUAGES and CONSUMED_LANGUAGES fields, they contain not a single, - * value, but instead a list of values with a weight associated with each value. - * - * This method adds an indexed field that uses - * {@link com.twitter.search.common.util.analysis.PayloadWeightedTokenizer}. - */ - public final SchemaBuilder withCharTermPayloadWeightedField(String fieldName) { - ThriftFieldConfiguration config = new ThriftFieldConfiguration(fieldName) - .setSettings(getPayloadWeightedSettings(ThriftIndexOptions.DOCS_AND_FREQS_AND_POSITIONS)); - putIntoFieldConfigs(idMapping.getFieldID(fieldName), config); - return this; - } - - /** - * Set the version and description of this schema. - */ - public final SchemaBuilder withSchemaVersion( - int majorVersionNumber, - int minorVersionNumber, - String versionDesc, - boolean isOfficial) { - schema.setMajorVersionNumber(majorVersionNumber); - schema.setMinorVersionNumber(minorVersionNumber); - - schema.setVersion(majorVersionNumber + ":" + versionDesc); - schema.setVersionIsOfficial(isOfficial); - - return this; - } - - public final SchemaBuilder withSchemaVersion( - int majorVersionNumber, - String versionDesc, - boolean isOfficial) { - return withSchemaVersion(majorVersionNumber, 0, versionDesc, isOfficial); - } - - protected void putIntoFieldConfigs(int id, ThriftFieldConfiguration config) { - if (schema.getFieldConfigs() != null && schema.getFieldConfigs().containsKey(id)) { - throw new IllegalStateException("Already have a ThriftFieldConfiguration for field id " + id); - } - - if (fieldNameSet.contains(config.getFieldName())) { - throw new IllegalStateException("Already have a ThriftFieldConfiguration for field " - + config.getFieldName()); - } - fieldNameSet.add(config.getFieldName()); - schema.putToFieldConfigs(id, config); - } - - // Default field settings. Most field settings are similar to this. - protected ThriftFieldSettings getDefaultSettings(ThriftIndexOptions indexOption) { - return getDefaultSettings(indexOption, false); - } - - protected ThriftFieldSettings getDefaultSettings(ThriftIndexOptions indexOption, - boolean supportOutOfOrderAppends) { - ThriftFieldSettings fieldSettings = new ThriftFieldSettings(); - ThriftIndexedFieldSettings indexedFieldSettings = new ThriftIndexedFieldSettings(); - indexedFieldSettings - .setIndexed(true) - .setStored(false) - .setTokenized(false) - .setStoreTermVectors(false) - .setStoreTermVectorOffsets(false) - .setStoreTermVectorPayloads(false) - .setStoreTermVectorPositions(false) - .setSupportOutOfOrderAppends(supportOutOfOrderAppends) - .setIndexOptions(indexOption) - .setOmitNorms(true); // All Earlybird fields omit norms. - fieldSettings.setIndexedFieldSettings(indexedFieldSettings); - return fieldSettings; - } - - /** - * Default field settings for fields that are pretokenized - * - * The fields that use these settings will need to be tokenized using a serializer with the - * attributes defined in {@link com.twitter.search.common.util.text.TweetTokenStreamSerializer}. - */ - protected final ThriftFieldSettings getDefaultPretokenizedSettings( - ThriftIndexOptions indexOption) { - ThriftFieldSettings fieldSettings = getDefaultSettings(indexOption); - fieldSettings.getIndexedFieldSettings().setTokenized(true); - ThriftTokenStreamSerializer attributeSerializer = - new ThriftTokenStreamSerializer(tokenStreamSerializerVersion); - attributeSerializer.setAttributeSerializerClassNames( - ImmutableList.of( - CharSequenceTermAttributeSerializer.class.getName(), - PositionIncrementAttributeSerializer.class.getName(), - TokenTypeAttributeSerializer.class.getName())); - - fieldSettings.getIndexedFieldSettings().setTokenStreamSerializer(attributeSerializer); - return fieldSettings; - } - - protected final ThriftFieldSettings getPretokenizedNoPositionFieldSetting() { - return getDefaultPretokenizedSettings(ThriftIndexOptions.DOCS_AND_FREQS); - } - - protected final ThriftFieldSettings getNoPositionNoFreqSettings() { - return getNoPositionNoFreqSettings(false); - } - - protected final ThriftFieldSettings getNoPositionNoFreqSettings( - boolean supportOutOfOrderAppends) { - return getDefaultSettings(ThriftIndexOptions.DOCS_ONLY, supportOutOfOrderAppends); - } - - protected final ThriftFieldSettings getEarlybirdNumericFieldSettings() { - // Supposedly numeric fields are not tokenized. - // However, Earlybird uses SingleTokenTokenStream to handle int/long fields. - // So we need to set indexed to true for these fields. - ThriftFieldSettings settings = getNoPositionNoFreqSettings(); - settings.getIndexedFieldSettings().setTokenized(true); - return settings; - } - - private ThriftFieldSettings getPayloadWeightedSettings(ThriftIndexOptions indexOption) { - ThriftFieldSettings fieldSettings = getDefaultSettings(indexOption); - fieldSettings.getIndexedFieldSettings().setTokenized(true); - ThriftTokenStreamSerializer attributeSerializer = - new ThriftTokenStreamSerializer(tokenStreamSerializerVersion); - attributeSerializer.setAttributeSerializerClassNames( - ImmutableList.of(CharTermAttributeSerializer.class.getName(), - PositionIncrementAttributeSerializer.class.getName(), - PayloadAttributeSerializer.class.getName())); - fieldSettings.getIndexedFieldSettings().setTokenStreamSerializer(attributeSerializer); - return fieldSettings; - } - - protected boolean shouldIncludeField(String fieldName) { - return true; - } -} diff --git a/src/java/com/twitter/search/common/schema/SchemaDocumentFactory.java b/src/java/com/twitter/search/common/schema/SchemaDocumentFactory.java deleted file mode 100644 index 1efb745cc..000000000 --- a/src/java/com/twitter/search/common/schema/SchemaDocumentFactory.java +++ /dev/null @@ -1,433 +0,0 @@ -package com.twitter.search.common.schema; - -import java.io.IOException; -import java.io.StringReader; -import java.util.Collections; -import java.util.List; -import java.util.Set; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.Sets; - -import org.apache.lucene.analysis.Analyzer; -import org.apache.lucene.analysis.TokenStream; -import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; -import org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute; -import org.apache.lucene.document.Document; -import org.apache.lucene.document.Field; -import org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetField; -import org.apache.lucene.util.BytesRef; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.text.token.TwitterTokenStream; -import com.twitter.search.common.schema.base.EarlybirdFieldType; -import com.twitter.search.common.schema.base.IndexedNumericFieldSettings; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.thriftjava.ThriftDocument; -import com.twitter.search.common.schema.thriftjava.ThriftField; -import com.twitter.search.common.schema.thriftjava.ThriftFieldData; -import com.twitter.search.common.schema.thriftjava.ThriftGeoCoordinate; -import com.twitter.search.common.util.analysis.IntTermAttribute; -import com.twitter.search.common.util.analysis.LongTermAttribute; -import com.twitter.search.common.util.analysis.SortableLongTermAttribute; -import com.twitter.search.common.util.spatial.GeoUtil; -import com.twitter.search.common.util.text.HighFrequencyTermPairs; -import com.twitter.search.common.util.text.OmitNormTextField; -import com.twitter.search.common.util.text.SingleTokenStream; - -/** - * A document factory that converts {@link ThriftDocument} into Lucene {@link Document}s - * using the provided {@link com.twitter.search.common.schema.base.Schema}. - */ -public class SchemaDocumentFactory { - private static final Logger LOG = LoggerFactory.getLogger(SchemaDocumentFactory.class); - - private final Schema schema; - private final ImmutableList tokenStreamRewriters; - - /** - * Creates a SchemaDocumentFactory with a schema and the tokenStreamRewriters. - * - * @param tokenStreamRewriters a list of token stream rewriters, which will be applied in order. - */ - public SchemaDocumentFactory( - Schema schema, - List tokenStreamRewriters) { - this.schema = schema; - this.tokenStreamRewriters = ImmutableList.copyOf(tokenStreamRewriters); - } - - /** - * Creates a SchemaDocumentFactory with no tokenStreamRewriters. - */ - public SchemaDocumentFactory(Schema schema) { - this(schema, Collections.EMPTY_LIST); - } - - public final Document newDocument(ThriftDocument document) throws IOException { - return innerNewDocument(document); - } - - /** - * Create a Lucene document from the ThriftDocument. - */ - @VisibleForTesting - public Document innerNewDocument(ThriftDocument document) throws IOException { - Document luceneDocument = new Document(); - Set hfTerms = Sets.newHashSet(); - Set hfPhrases = Sets.newHashSet(); - - Analyzer defaultAnalyzer = schema.getDefaultAnalyzer(document.getDefaultAnalyzerOverride()); - - for (ThriftField field : document.getFields()) { - boolean successful = false; - try { - addLuceneFields(field, defaultAnalyzer, luceneDocument, hfTerms, hfPhrases); - successful = true; - } finally { - if (!successful) { - LOG.warn("Unexpected exception while trying to add field. Field ID: " - + field.getFieldConfigId() + " Field Name: " - + schema.getFieldName(field.getFieldConfigId())); - } - } - } - - for (String token : hfTerms) { - for (String token2 : hfTerms) { - if (token.compareTo(token2) < 0) { - luceneDocument.add(new Field(ImmutableSchema.HF_TERM_PAIRS_FIELD, - HighFrequencyTermPairs.createPair(token, token2), - OmitNormTextField.TYPE_NOT_STORED)); - } - } - } - - for (String phrase : hfPhrases) { - // Tokens in the phrase set are not terms and have already been processed with - // HighFrequencyTermPairs.createPhrasePair. - luceneDocument.add(new Field(ImmutableSchema.HF_PHRASE_PAIRS_FIELD, phrase, - OmitNormTextField.TYPE_NOT_STORED)); - } - - return schema.getFacetsConfig().build(luceneDocument); - } - - private void addLuceneFields(ThriftField field, Analyzer analyzer, Document doc, - Set hfTerms, Set hfPhrases) throws IOException { - Schema.FieldInfo fieldInfo = - schema.getFieldInfo(field.getFieldConfigId(), field.getFieldConfigOverride()); - - if (fieldInfo == null) { - // field not defined in schema - skip it - return; - } - - ThriftFieldData fieldData = field.getFieldData(); - if (fieldInfo.getFieldType().getCsfType() != null) { - addCSFField(doc, fieldInfo, fieldData); - return; - } - - // Checking which data type is set is not sufficient here. We also need to check schema to - // see what the type the field is configured to be. See SEARCH-5173 for more details. - // The problem is that Pig, while converting Tuples to Thrift, sets all primitive type - // fields to 0. (i.e. the isSet calls will return true). - IndexedNumericFieldSettings numericSettings = - fieldInfo.getFieldType().getNumericFieldSettings(); - if (fieldData.isSetTokenStreamValue()) { - addTokenField(doc, hfTerms, hfPhrases, fieldInfo, fieldData); - } else if (fieldData.isSetStringValue()) { - addStringField(analyzer, doc, hfTerms, hfPhrases, fieldInfo, fieldData); - } else if (fieldData.isSetBytesValue()) { - addBytesField(doc, fieldInfo, fieldData); - } else if (fieldData.isSetGeoCoordinate()) { - addGeoField(doc, fieldInfo, fieldData); - } else if (numericSettings != null) { - // handle numeric fields. - switch (numericSettings.getNumericType()) { - case INT: - Preconditions.checkState(fieldData.isSetIntValue(), - "Int field does not have int value set. Field name: %s", fieldInfo.getName()); - addIntField(doc, fieldInfo, fieldData); - break; - case LONG: - Preconditions.checkState(fieldData.isSetLongValue(), - "Long field does not have long value set. Field name: %s", fieldInfo.getName()); - addLongField(doc, fieldInfo, fieldData); - break; - case FLOAT: - Preconditions.checkState(fieldData.isSetFloatValue(), - "Float field does not have float value set. Field name: %s ", fieldInfo.getName()); - addFloatField(); - break; - case DOUBLE: - Preconditions.checkState(fieldData.isSetDoubleValue(), - "Double field does not have double value set. Field name: %s", fieldInfo.getName()); - addDoubleFIeld(); - break; - default: - throw new UnsupportedOperationException("Earlybird does not know how to handle field " - + field.getFieldConfigId() + " " + field); - } - } else { - throw new UnsupportedOperationException("Earlybird does not know how to handle field " - + field.getFieldConfigId() + " " + field); - } - } - - private void addCSFField(Document doc, Schema.FieldInfo fieldInfo, ThriftFieldData fieldData) { - if (fieldInfo.getFieldType().getCsfFixedLengthNumValuesPerDoc() > 1) { - - // As an optimization, TBinaryProtocol stores a byte array field as a part of a larger byte - // array field. Must call fieldData.getBytesValue(). fieldData.bytesValue.array() will - // return extraneous data. See: SEARCH-3996 - doc.add(new Field(fieldInfo.getName(), fieldData.getBytesValue(), fieldInfo.getFieldType())); - } else { - doc.add(new CSFField(fieldInfo.getName(), fieldInfo.getFieldType(), fieldData)); - } - } - - private void addTokenField( - Document doc, - Set hfTerms, - Set hfPhrases, - Schema.FieldInfo fieldInfo, - ThriftFieldData fieldData) throws IOException { - TwitterTokenStream twitterTokenStream - = fieldInfo.getFieldType().getTokenStreamSerializer().deserialize( - fieldData.getTokenStreamValue(), fieldData.getStringValue()); - - try { - for (TokenStreamRewriter rewriter : tokenStreamRewriters) { - twitterTokenStream = rewriter.rewrite(fieldInfo, twitterTokenStream); - } - - expandStream(doc, fieldInfo, twitterTokenStream, hfTerms, hfPhrases); - doc.add(new Field(fieldInfo.getName(), twitterTokenStream, fieldInfo.getFieldType())); - } finally { - twitterTokenStream.close(); - } - } - - private void addStringField(Analyzer analyzer, Document doc, Set hfTerms, - Set hfPhrases, Schema.FieldInfo fieldInfo, - ThriftFieldData fieldData) { - doc.add(new Field(fieldInfo.getName(), fieldData.getStringValue(), fieldInfo.getFieldType())); - if (fieldInfo.getFieldType().tokenized()) { - try { - TokenStream tokenStream = analyzer.tokenStream(fieldInfo.getName(), - new StringReader(fieldData.getStringValue())); - try { - expandStream( - doc, - fieldInfo, - tokenStream, - hfTerms, - hfPhrases); - } finally { - tokenStream.close(); - } - } catch (IOException e) { - LOG.error("IOException expanding token stream", e); - } - } else { - addFacetField(doc, fieldInfo, fieldData.getStringValue()); - } - } - - private void addBytesField(Document doc, Schema.FieldInfo fieldInfo, ThriftFieldData fieldData) { - doc.add(new Field(fieldInfo.getName(), fieldData.getBytesValue(), fieldInfo.getFieldType())); - } - - private void addIntField(Document doc, Schema.FieldInfo fieldInfo, - ThriftFieldData fieldData) { - int value = fieldData.getIntValue(); - addFacetField(doc, fieldInfo, String.valueOf(value)); - - if (fieldInfo.getFieldType().getNumericFieldSettings() == null) { - // No NumericFieldSettings. Even though the data is numeric, this field is not - // really a numerical field. Just add as a string. - doc.add(new Field(fieldInfo.getName(), String.valueOf(value), fieldInfo.getFieldType())); - } else if (fieldInfo.getFieldType().getNumericFieldSettings().isUseTwitterFormat()) { - addIntTermAttributeField(value, fieldInfo, doc); - } else { - // Use lucene style numerical fields - doc.add(NumericField.newIntField(fieldInfo.getName(), value)); - } - } - - private void addIntTermAttributeField(int value, - Schema.FieldInfo fieldInfo, - Document doc) { - SingleTokenStream singleToken = new SingleTokenStream(); - IntTermAttribute termAtt = singleToken.addAttribute(IntTermAttribute.class); - termAtt.setTerm(value); - doc.add(new Field(fieldInfo.getName(), singleToken, fieldInfo.getFieldType())); - } - - private void addLongField(Document doc, Schema.FieldInfo fieldInfo, - ThriftFieldData fieldData) { - long value = fieldData.getLongValue(); - addFacetField(doc, fieldInfo, String.valueOf(value)); - - if (fieldInfo.getFieldType().getNumericFieldSettings() == null) { - // No NumericFieldSettings. Even though the data is numeric, this field is not - // really a numerical field. Just add as a string. - doc.add(new Field(fieldInfo.getName(), String.valueOf(value), fieldInfo.getFieldType())); - } else if (fieldInfo.getFieldType().getNumericFieldSettings().isUseTwitterFormat()) { - // Twitter style numerical field: use LongTermAttribute - addLongTermAttributeField(value, fieldInfo, doc); - } else { - // Use lucene style numerical fields - doc.add(NumericField.newLongField(fieldInfo.getName(), value)); - } - } - - private void addLongTermAttributeField(long value, - Schema.FieldInfo fieldInfo, - Document doc) { - SingleTokenStream singleToken = new SingleTokenStream(); - boolean useSortableEncoding = - fieldInfo.getFieldType().getNumericFieldSettings().isUseSortableEncoding(); - - if (useSortableEncoding) { - SortableLongTermAttribute termAtt = singleToken.addAttribute(SortableLongTermAttribute.class); - termAtt.setTerm(value); - } else { - LongTermAttribute termAtt = singleToken.addAttribute(LongTermAttribute.class); - termAtt.setTerm(value); - } - doc.add(new Field(fieldInfo.getName(), singleToken, fieldInfo.getFieldType())); - } - - private void addFloatField() { - throw new UnsupportedOperationException("Earlybird does not support float values yet."); - } - - private void addDoubleFIeld() { - throw new UnsupportedOperationException("Earlybird does not support double values yet."); - } - - private void addGeoField(Document doc, Schema.FieldInfo fieldInfo, ThriftFieldData fieldData) { - ThriftGeoCoordinate coord = fieldData.getGeoCoordinate(); - if (GeoUtil.validateGeoCoordinates(coord.getLat(), coord.getLon())) { - GeoUtil.fillGeoFields(doc, fieldInfo.getName(), - coord.getLat(), coord.getLon(), coord.getAccuracy()); - } - } - - private void addFacetField(Document doc, Schema.FieldInfo fieldInfo, String value) { - Preconditions.checkArgument(doc != null); - Preconditions.checkArgument(fieldInfo != null); - Preconditions.checkArgument(value != null); - - if (fieldInfo.getFieldType().getFacetName() != null) { - doc.add(new SortedSetDocValuesFacetField(fieldInfo.getFieldType().getFacetName(), value)); - } - } - - private String getTerm(TermToBytesRefAttribute attr) { - if (attr instanceof CharTermAttribute) { - return ((CharTermAttribute) attr).toString(); - } else if (attr instanceof IntTermAttribute) { - return String.valueOf(((IntTermAttribute) attr).getTerm()); - } else if (attr instanceof LongTermAttribute) { - return String.valueOf(((LongTermAttribute) attr).getTerm()); - } else { - return attr.getBytesRef().utf8ToString(); - } - } - - /** - * Expand the TwitterTokenStream and populate high-frequency terms, phrases and/or facet category paths. - */ - private void expandStream( - Document doc, - Schema.FieldInfo fieldInfo, - TokenStream stream, - Set hfTerms, - Set hfPhrases) throws IOException { - // Checkstyle does not allow assignment to parameters. - Set facetHfTerms = hfTerms; - Set facetHfPhrases = hfPhrases; - - if (!(HighFrequencyTermPairs.INDEX_HF_TERM_PAIRS - && fieldInfo.getFieldType().isIndexHFTermPairs())) { - // high-frequency terms and phrases are not needed - if (fieldInfo.getFieldType().getFacetName() == null) { - // Facets are not needed either, simply return, would do nothing otherwise - return; - } - facetHfTerms = null; - facetHfPhrases = null; - } - - final TermToBytesRefAttribute attr = stream.getAttribute(TermToBytesRefAttribute.class); - stream.reset(); - - String lastHFTerm = null; - while (stream.incrementToken()) { - String term = getTerm(attr); - if (fieldInfo.getFieldType().getFacetName() != null) { - addFacetField(doc, fieldInfo, term); - } - if (HighFrequencyTermPairs.HF_TERM_SET.contains(term)) { - if (facetHfTerms != null) { - facetHfTerms.add(term); - } - if (lastHFTerm != null) { - if (facetHfPhrases != null) { - facetHfPhrases.add(HighFrequencyTermPairs.createPhrasePair(lastHFTerm, term)); - } - } - lastHFTerm = term; - } else { - lastHFTerm = null; - } - } - } - - public static final class CSFField extends Field { - /** - * Create a CSFField with the given fieldType, containing the given field data. - */ - public CSFField(String name, EarlybirdFieldType fieldType, ThriftFieldData data) { - super(name, fieldType); - - if (fieldType.isCsfVariableLength()) { - fieldsData = new BytesRef(data.getBytesValue()); - } else { - switch (fieldType.getCsfType()) { - case BYTE: - fieldsData = Long.valueOf(data.getByteValue()); - break; - case INT: - fieldsData = Long.valueOf(data.getIntValue()); - break; - case LONG: - fieldsData = Long.valueOf(data.getLongValue()); - break; - case FLOAT: - fieldsData = Long.valueOf(Float.floatToRawIntBits((float) data.getFloatValue())); - break; - case DOUBLE: - fieldsData = Long.valueOf(Double.doubleToRawLongBits(data.getDoubleValue())); - break; - default: - throw new IllegalArgumentException("Unknown csf type: " + fieldType.getCsfType()); - } - } - } - } - - public interface TokenStreamRewriter { - /** - * Rewrite the token stream. - */ - TwitterTokenStream rewrite(Schema.FieldInfo fieldInfo, TwitterTokenStream stream); - } -} diff --git a/src/java/com/twitter/search/common/schema/SchemaUtil.java b/src/java/com/twitter/search/common/schema/SchemaUtil.java deleted file mode 100644 index cba903a2b..000000000 --- a/src/java/com/twitter/search/common/schema/SchemaUtil.java +++ /dev/null @@ -1,102 +0,0 @@ -package com.twitter.search.common.schema; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.DocValuesType; -import org.apache.lucene.index.IndexOptions; -import org.apache.lucene.util.BytesRef; - -import com.twitter.search.common.schema.base.EarlybirdFieldType; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.base.IndexedNumericFieldSettings; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.thriftjava.ThriftCSFType; -import com.twitter.search.common.schema.thriftjava.ThriftNumericType; -import com.twitter.search.common.util.analysis.IntTermAttributeImpl; -import com.twitter.search.common.util.analysis.LongTermAttributeImpl; -import com.twitter.search.common.util.analysis.SortableLongTermAttributeImpl; - -public final class SchemaUtil { - private SchemaUtil() { - } - - /** - * Get the a fixed CSF field's number of values per doc. - * @param schema the Schema for the index - * @param fieldId the field id the CSF field - the field must be of binary integer type and - * in fixed size - * @return the number of values per doc - */ - public static int getCSFFieldFixedLength(ImmutableSchemaInterface schema, int fieldId) { - final Schema.FieldInfo fieldInfo = Preconditions.checkNotNull(schema.getFieldInfo(fieldId)); - return getCSFFieldFixedLength(fieldInfo); - } - - /** - * Get the a fixed CSF field's number of values per doc. - * @param schema the Schema for the index - * @param fieldName the field name of the CSF field - the field must be of binary integer type - * and in fixed size - * @return the number of values per doc - */ - public static int getCSFFieldFixedLength(ImmutableSchemaInterface schema, String fieldName) { - final Schema.FieldInfo fieldInfo = Preconditions.checkNotNull(schema.getFieldInfo(fieldName)); - return getCSFFieldFixedLength(fieldInfo); - } - - /** - * Get the a fixed CSF field's number of values per doc. - * @param fieldInfo the field of the CSF field - the field must be of binary integer type - * and in fixed size - * @return the number of values per doc - */ - public static int getCSFFieldFixedLength(Schema.FieldInfo fieldInfo) { - final EarlybirdFieldType fieldType = fieldInfo.getFieldType(); - Preconditions.checkState(fieldType.docValuesType() == DocValuesType.BINARY - && fieldType.getCsfType() == ThriftCSFType.INT); - return fieldType.getCsfFixedLengthNumValuesPerDoc(); - } - - /** Converts the given value to a BytesRef instance, according to the type of the given field. */ - public static BytesRef toBytesRef(Schema.FieldInfo fieldInfo, String value) { - EarlybirdFieldType fieldType = fieldInfo.getFieldType(); - Preconditions.checkArgument(fieldType.indexOptions() != IndexOptions.NONE); - IndexedNumericFieldSettings numericSetting = fieldType.getNumericFieldSettings(); - if (numericSetting != null) { - if (!numericSetting.isUseTwitterFormat()) { - throw new UnsupportedOperationException( - "Numeric field not using Twitter format: cannot drill down."); - } - - ThriftNumericType numericType = numericSetting.getNumericType(); - switch (numericType) { - case INT: - try { - return IntTermAttributeImpl.copyIntoNewBytesRef(Integer.parseInt(value)); - } catch (NumberFormatException e) { - throw new UnsupportedOperationException( - String.format("Cannot parse value for int field %s: %s", - fieldInfo.getName(), value), - e); - } - case LONG: - try { - return numericSetting.isUseSortableEncoding() - ? SortableLongTermAttributeImpl.copyIntoNewBytesRef(Long.parseLong(value)) - : LongTermAttributeImpl.copyIntoNewBytesRef(Long.parseLong(value)); - } catch (NumberFormatException e) { - throw new UnsupportedOperationException( - String.format("Cannot parse value for long field %s: %s", - fieldInfo.getName(), value), - e); - } - default: - throw new UnsupportedOperationException( - String.format("Unsupported numeric type for field %s: %s", - fieldInfo.getName(), numericType)); - } - } - - return new BytesRef(value); - } -} diff --git a/src/java/com/twitter/search/common/schema/SearchWhitespaceAnalyzer.java b/src/java/com/twitter/search/common/schema/SearchWhitespaceAnalyzer.java deleted file mode 100644 index fd94f0e78..000000000 --- a/src/java/com/twitter/search/common/schema/SearchWhitespaceAnalyzer.java +++ /dev/null @@ -1,27 +0,0 @@ -package com.twitter.search.common.schema; - -import org.apache.lucene.analysis.Analyzer; -import org.apache.lucene.analysis.core.WhitespaceTokenizer; - -/** - * The majority of the code is copied from Lucene 3.1 analysis.core.WhitespaceAnalyzer. The only - * new code is the getPositionIncrementGap() - */ -public final class SearchWhitespaceAnalyzer extends Analyzer { - @Override - protected TokenStreamComponents createComponents(final String fieldName) { - return new TokenStreamComponents(new WhitespaceTokenizer()); - } - - /** - * Make sure that phrase queries do not match across 2 instances of the text field. - * - * See the Javadoc for Analyzer.getPositionIncrementGap() for a good explanation of how this - * method works. - */ - @Override - public int getPositionIncrementGap(String fieldName) { - // Hard-code "text" here, because we can't depend on EarlybirdFieldConstants. - return "text".equals(fieldName) ? 1 : super.getPositionIncrementGap(fieldName); - } -} diff --git a/src/java/com/twitter/search/common/schema/ThriftDocumentBuilder.java b/src/java/com/twitter/search/common/schema/ThriftDocumentBuilder.java deleted file mode 100644 index 7bec85040..000000000 --- a/src/java/com/twitter/search/common/schema/ThriftDocumentBuilder.java +++ /dev/null @@ -1,228 +0,0 @@ -package com.twitter.search.common.schema; - -import java.io.IOException; -import java.util.List; -import java.util.logging.Level; -import java.util.logging.Logger; - -import javax.annotation.Nullable; - -import com.twitter.common.text.util.PositionIncrementAttributeSerializer; -import com.twitter.common.text.util.TokenStreamSerializer; -import com.twitter.search.common.schema.base.FieldNameToIdMapping; -import com.twitter.search.common.schema.thriftjava.ThriftDocument; -import com.twitter.search.common.schema.thriftjava.ThriftField; -import com.twitter.search.common.schema.thriftjava.ThriftFieldData; -import com.twitter.search.common.schema.thriftjava.ThriftGeoCoordinate; -import com.twitter.search.common.util.analysis.CharTermAttributeSerializer; -import com.twitter.search.common.util.analysis.LongTermAttributeSerializer; -import com.twitter.search.common.util.analysis.LongTermsTokenStream; -import com.twitter.search.common.util.analysis.PayloadAttributeSerializer; -import com.twitter.search.common.util.analysis.PayloadWeightedTokenizer; -import com.twitter.search.common.util.spatial.GeoUtil; - -/** - * Builder class for building ThriftDocuments. - */ -public class ThriftDocumentBuilder { - private static final Logger LOG = Logger.getLogger(ThriftDocumentBuilder.class.getName()); - - protected final ThriftDocument doc = new ThriftDocument(); - protected final FieldNameToIdMapping idMapping; - - private static final ThreadLocal PAYLOAD_WEIGHTED_SERIALIZER_PER_THREAD = - new ThreadLocal() { - @Override - protected TokenStreamSerializer initialValue() { - return TokenStreamSerializer.builder() - .add(new CharTermAttributeSerializer()) - .add(new PositionIncrementAttributeSerializer()) - .add(new PayloadAttributeSerializer()) - .build(); - } - }; - - private static final ThreadLocal LONG_TERM_SERIALIZER_PER_THREAD = - new ThreadLocal() { - @Override - protected TokenStreamSerializer initialValue() { - return TokenStreamSerializer.builder() - .add(new LongTermAttributeSerializer()) - .build(); - } - }; - - public ThriftDocumentBuilder(FieldNameToIdMapping idMapping) { - this.idMapping = idMapping; - } - - protected void prepareToBuild() { - // left empty, subclass can override this. - } - - public ThriftDocument build() { - prepareToBuild(); - return doc; - } - - /** - * Add a long field. This is indexed as a - * {@link com.twitter.search.common.util.analysis.LongTermAttribute} - */ - public final ThriftDocumentBuilder withLongField(String fieldName, long value) { - ThriftFieldData fieldData = new ThriftFieldData().setLongValue(value); - ThriftField field = new ThriftField() - .setFieldConfigId(idMapping.getFieldID(fieldName)).setFieldData(fieldData); - doc.addToFields(field); - return this; - } - - /** - * Add an int field. This is indexed as a - * {@link com.twitter.search.common.util.analysis.IntTermAttribute} - */ - public final ThriftDocumentBuilder withIntField(String fieldName, int value) { - ThriftFieldData fieldData = new ThriftFieldData().setIntValue(value); - ThriftField field = new ThriftField() - .setFieldConfigId(idMapping.getFieldID(fieldName)).setFieldData(fieldData); - doc.addToFields(field); - return this; - } - - /** - * Add a field whose value is a single byte. - */ - public final ThriftDocumentBuilder withByteField(String fieldName, byte value) { - ThriftFieldData fieldData = new ThriftFieldData().setByteValue(value); - ThriftField field = new ThriftField() - .setFieldConfigId(idMapping.getFieldID(fieldName)).setFieldData(fieldData); - doc.addToFields(field); - return this; - } - - /** - * Add a field whose value is a byte array. - */ - public final ThriftDocumentBuilder withBytesField(String fieldName, byte[] value) { - ThriftFieldData fieldData = new ThriftFieldData().setBytesValue(value); - ThriftField field = new ThriftField() - .setFieldConfigId(idMapping.getFieldID(fieldName)).setFieldData(fieldData); - doc.addToFields(field); - return this; - } - - /** - * Add a field whose value is a float. - */ - public final ThriftDocumentBuilder withFloatField(String fieldName, float value) { - ThriftFieldData fieldData = new ThriftFieldData().setFloatValue(value); - ThriftField field = new ThriftField() - .setFieldConfigId(idMapping.getFieldID(fieldName)).setFieldData(fieldData); - doc.addToFields(field); - return this; - } - - /** - * Added a field whose value is a Lucene TokenStream. - * The Lucene TokenStream is serialized using Twitter's - * {@link com.twitter.common.text.util.TokenStreamSerializer} - */ - public final ThriftDocumentBuilder withTokenStreamField(String fieldName, - @Nullable String tokenStreamText, - byte[] tokenStream) { - if (tokenStream == null) { - return this; - } - ThriftFieldData fieldData = new ThriftFieldData() - .setStringValue(tokenStreamText).setTokenStreamValue(tokenStream); - ThriftField field = new ThriftField() - .setFieldConfigId(idMapping.getFieldID(fieldName)).setFieldData(fieldData); - doc.addToFields(field); - return this; - } - - /** - * Add a field whose value is a String. - * @param fieldName Name of the field where the string will be added. - * @param text This string is indexed as is (not analyzed). - */ - public final ThriftDocumentBuilder withStringField(String fieldName, String text) { - if (text == null || text.isEmpty()) { - return this; - } - - ThriftFieldData fieldData = new ThriftFieldData().setStringValue(text); - ThriftField field = new ThriftField() - .setFieldConfigId(idMapping.getFieldID(fieldName)).setFieldData(fieldData); - doc.addToFields(field); - return this; - } - - /** - * Add a field whose value is a geo coordinate. - * Earlybird will process the coordinates into geo hashes before indexing. - */ - public final ThriftDocumentBuilder withGeoField(String fieldName, - double lat, double lon, int acc) { - if (!GeoUtil.validateGeoCoordinates(lat, lon)) { - // If the geo coordinates are invalid, don't add any field. - return this; - } - ThriftGeoCoordinate coord = new ThriftGeoCoordinate(); - coord.setLat(lat); - coord.setLon(lon); - coord.setAccuracy(acc); - - ThriftFieldData fieldData = new ThriftFieldData().setGeoCoordinate(coord); - ThriftField field = new ThriftField() - .setFieldConfigId(idMapping.getFieldID(fieldName)).setFieldData(fieldData); - doc.addToFields(field); - return this; - } - - /** - * Added a list of tokens that are weighted. The weights are stored inside payload. - * See {@link com.twitter.search.common.util.analysis.PayloadWeightedTokenizer} for more details. - */ - public final ThriftDocumentBuilder withPayloadWeightTokenStreamField(String fieldName, - String tokens) { - byte[] serialized; - try { - PayloadWeightedTokenizer tokenizer = new PayloadWeightedTokenizer(tokens); - serialized = PAYLOAD_WEIGHTED_SERIALIZER_PER_THREAD.get().serialize(tokenizer); - tokenizer.close(); - } catch (IOException e) { - LOG.log(Level.WARNING, - "Failed to add PayloadWeightedTokenizer field. Bad token weight list: " + tokens, e); - return this; - } catch (NumberFormatException e) { - LOG.log(Level.WARNING, - "Failed to add PayloadWeightedTokenizer field. Cannot parse token weight: " + tokens, e); - return this; - } - withTokenStreamField(fieldName, tokens, serialized); - return this; - } - - /** - * Add a field whose value is a list of longs. - * Each long is encoded into a LongTermAttribute. - * The field will contain a LongTermTokenStream. - */ - public final ThriftDocumentBuilder withLongIDsField(String fieldName, - List longList) throws IOException { - - if (longList == null || longList.isEmpty()) { - return this; - } - LongTermsTokenStream stream = new LongTermsTokenStream(longList); - stream.reset(); - byte[] serializedStream = LONG_TERM_SERIALIZER_PER_THREAD.get().serialize(stream); - - ThriftFieldData fieldData = new ThriftFieldData().setTokenStreamValue(serializedStream); - ThriftField field = new ThriftField() - .setFieldConfigId(idMapping.getFieldID(fieldName)).setFieldData(fieldData); - doc.addToFields(field); - return this; - } -} diff --git a/src/java/com/twitter/search/common/schema/base/BUILD b/src/java/com/twitter/search/common/schema/base/BUILD deleted file mode 100644 index 8501bb387..000000000 --- a/src/java/com/twitter/search/common/schema/base/BUILD +++ /dev/null @@ -1,25 +0,0 @@ -# Library for Schema.java and other utilities with minimal dependencies. -java_library( - name = "base", - sources = ["*.java"], - platform = "java8", - provides = artifact( - org = "com.twitter.search.common", - name = "schema-base", - repo = artifactory, - ), - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/code/findbugs:jsr305", - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/commons-lang", - "3rdparty/jvm/org/apache/lucene:lucene-analyzers-common", - "3rdparty/jvm/org/apache/lucene:lucene-core", - "3rdparty/jvm/org/apache/lucene:lucene-facet", - "3rdparty/jvm/org/apache/thrift:libthrift", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common/text/util:token-util", - "src/thrift/com/twitter/search/common:features-java", - "src/thrift/com/twitter/search/common:schema-java", - ], -) diff --git a/src/java/com/twitter/search/common/schema/base/EarlybirdFieldType.java b/src/java/com/twitter/search/common/schema/base/EarlybirdFieldType.java deleted file mode 100644 index f1e0a501e..000000000 --- a/src/java/com/twitter/search/common/schema/base/EarlybirdFieldType.java +++ /dev/null @@ -1,374 +0,0 @@ -package com.twitter.search.common.schema.base; - -import javax.annotation.Nullable; - -import org.apache.commons.lang.StringUtils; -import org.apache.lucene.document.FieldType; -import org.apache.lucene.index.DocValuesType; -import org.apache.lucene.index.IndexOptions; - -import com.twitter.common.text.util.TokenStreamSerializer; -import com.twitter.search.common.schema.thriftjava.ThriftCSFType; -import com.twitter.search.common.schema.thriftjava.ThriftCSFViewSettings; -import com.twitter.search.common.schema.thriftjava.ThriftFeatureUpdateConstraint; - -/** - * An extension of Lucene's {@link FieldType} that contains additional Earlybird-specific settings. - * Lucene IndexingChains can downcast the FieldType object to access these additional settings. - */ -public class EarlybirdFieldType extends FieldType { - - public static final EarlybirdFieldType LONG_CSF_FIELD_TYPE = new EarlybirdFieldType(); - public static final EarlybirdFieldType INT_CSF_FIELD_TYPE = new EarlybirdFieldType(); - public static final EarlybirdFieldType BYTE_CSF_FIELD_TYPE = new EarlybirdFieldType(); - - static { - LONG_CSF_FIELD_TYPE.setCsfType(ThriftCSFType.LONG); - LONG_CSF_FIELD_TYPE.setDocValuesType(DocValuesType.NUMERIC); - LONG_CSF_FIELD_TYPE.setCsfLoadIntoRam(true); - LONG_CSF_FIELD_TYPE.freeze(); - - INT_CSF_FIELD_TYPE.setCsfType(ThriftCSFType.INT); - INT_CSF_FIELD_TYPE.setDocValuesType(DocValuesType.NUMERIC); - INT_CSF_FIELD_TYPE.setCsfLoadIntoRam(true); - INT_CSF_FIELD_TYPE.freeze(); - - BYTE_CSF_FIELD_TYPE.setCsfType(ThriftCSFType.BYTE); - BYTE_CSF_FIELD_TYPE.setDocValuesType(DocValuesType.NUMERIC); - BYTE_CSF_FIELD_TYPE.setCsfLoadIntoRam(true); - BYTE_CSF_FIELD_TYPE.freeze(); - } - - - private boolean storePerPositionPayloads; - private int defaultPayloadLength; - // This is true for fields that become immutable after optimization - private boolean becomesImmutable = true; - private boolean supportOrderedTerms; - private boolean supportTermTextLookup; - private boolean indexHFTermPairs; - - /** - * This flag turns on tweet specific normalizations. - * This turns on the following two token processors: - * {@link com.twitter.search.common.util.text.splitter.HashtagMentionPunctuationSplitter} - * {@link com.twitter.search.common.util.text.filter.NormalizedTokenFilter} - * - * HashtagMentionPunctuationSplitter would break a mention or hashtag like @ab_cd or #ab_cd into - * tokens {ab, cd}. - * NormalizedTokenFilter strips out the # @ $ from the tokens. - * - * - * @deprecated we should remove this flag. It is confusing to have Earlybird apply additional - * tokenization on top of what ingester produced. - */ - @Deprecated - private boolean useTweetSpecificNormalization; - - @Nullable - private TokenStreamSerializer.Builder tokenStreamSerializerProvider = null; - - // csf type settings - private ThriftCSFType csfType; - private boolean csfVariableLength; - private int csfFixedLengthNumValuesPerDoc; - private boolean csfFixedLengthUpdateable; - private boolean csfLoadIntoRam; - private boolean csfDefaultValueSet; - private long csfDefaultValue; - // True if this is a CSF field which is a view on top of a different CSF field - private boolean csfViewField; - // If this field is a csf view, this is the ID of the CSF field backing the view - private int csfViewBaseFieldId; - private FeatureConfiguration csfViewFeatureConfiguration; - - // facet field settings - private String facetName; - private boolean storeFacetSkiplist; - private boolean storeFacetOffensiveCounters; - private boolean useCSFForFacetCounting; - - // Determines if this field is indexed - private boolean indexedField = false; - - // search field settings - // whether a field should be searched by default - private boolean textSearchableByDefault = false; - private float textSearchableFieldWeight = 1.0f; - - // For indexed numerical fields - private IndexedNumericFieldSettings numericFieldSettings = null; - - public boolean isStorePerPositionPayloads() { - return storePerPositionPayloads; - } - - public void setStorePerPositionPayloads(boolean storePerPositionPayloads) { - checkIfFrozen(); - this.storePerPositionPayloads = storePerPositionPayloads; - } - - public int getDefaultPayloadLength() { - return defaultPayloadLength; - } - - public void setDefaultPayloadLength(int defaultPayloadLength) { - checkIfFrozen(); - this.defaultPayloadLength = defaultPayloadLength; - } - - public boolean becomesImmutable() { - return becomesImmutable; - } - - public void setBecomesImmutable(boolean becomesImmutable) { - checkIfFrozen(); - this.becomesImmutable = becomesImmutable; - } - - public boolean isSupportOrderedTerms() { - return supportOrderedTerms; - } - - public void setSupportOrderedTerms(boolean supportOrderedTerms) { - checkIfFrozen(); - this.supportOrderedTerms = supportOrderedTerms; - } - - public boolean isSupportTermTextLookup() { - return supportTermTextLookup; - } - - public void setSupportTermTextLookup(boolean supportTermTextLookup) { - this.supportTermTextLookup = supportTermTextLookup; - } - - @Nullable - public TokenStreamSerializer getTokenStreamSerializer() { - return tokenStreamSerializerProvider == null ? null : tokenStreamSerializerProvider.safeBuild(); - } - - public void setTokenStreamSerializerBuilder(TokenStreamSerializer.Builder provider) { - checkIfFrozen(); - this.tokenStreamSerializerProvider = provider; - } - - public ThriftCSFType getCsfType() { - return csfType; - } - - public void setCsfType(ThriftCSFType csfType) { - checkIfFrozen(); - this.csfType = csfType; - } - - public boolean isCsfVariableLength() { - return csfVariableLength; - } - - public int getCsfFixedLengthNumValuesPerDoc() { - return csfFixedLengthNumValuesPerDoc; - } - - public void setCsfVariableLength() { - checkIfFrozen(); - this.csfVariableLength = true; - } - - /** - * Make the field a fixed length CSF, with the given length. - */ - public void setCsfFixedLengthSettings(int csfFixedLengthNumValuesPerDocument, - boolean isCsfFixedLengthUpdateable) { - checkIfFrozen(); - this.csfVariableLength = false; - this.csfFixedLengthNumValuesPerDoc = csfFixedLengthNumValuesPerDocument; - this.csfFixedLengthUpdateable = isCsfFixedLengthUpdateable; - } - - public boolean isCsfFixedLengthUpdateable() { - return csfFixedLengthUpdateable; - } - - public boolean isCsfLoadIntoRam() { - return csfLoadIntoRam; - } - - public void setCsfLoadIntoRam(boolean csfLoadIntoRam) { - checkIfFrozen(); - this.csfLoadIntoRam = csfLoadIntoRam; - } - - public void setCsfDefaultValue(long defaultValue) { - checkIfFrozen(); - this.csfDefaultValue = defaultValue; - this.csfDefaultValueSet = true; - } - - public long getCsfDefaultValue() { - return csfDefaultValue; - } - - public boolean isCsfDefaultValueSet() { - return csfDefaultValueSet; - } - - public String getFacetName() { - return facetName; - } - - public void setFacetName(String facetName) { - checkIfFrozen(); - this.facetName = facetName; - } - - public boolean isStoreFacetSkiplist() { - return storeFacetSkiplist; - } - - public void setStoreFacetSkiplist(boolean storeFacetSkiplist) { - checkIfFrozen(); - this.storeFacetSkiplist = storeFacetSkiplist; - } - - public boolean isStoreFacetOffensiveCounters() { - return storeFacetOffensiveCounters; - } - - public void setStoreFacetOffensiveCounters(boolean storeFacetOffensiveCounters) { - checkIfFrozen(); - this.storeFacetOffensiveCounters = storeFacetOffensiveCounters; - } - - public boolean isUseCSFForFacetCounting() { - return useCSFForFacetCounting; - } - - public void setUseCSFForFacetCounting(boolean useCSFForFacetCounting) { - checkIfFrozen(); - this.useCSFForFacetCounting = useCSFForFacetCounting; - } - - public boolean isFacetField() { - return facetName != null && !StringUtils.isEmpty(facetName); - } - - public boolean isIndexHFTermPairs() { - return indexHFTermPairs; - } - - public void setIndexHFTermPairs(boolean indexHFTermPairs) { - checkIfFrozen(); - this.indexHFTermPairs = indexHFTermPairs; - } - - public boolean acceptPretokenizedField() { - return tokenStreamSerializerProvider != null; - } - - /** - * set this field to use additional twitter specific tokenization. - * @deprecated should avoid doing additional tokenizations on top of what ingester produced. - */ - @Deprecated - public boolean useTweetSpecificNormalization() { - return useTweetSpecificNormalization; - } - - /** - * test whether this field uses additional twitter specific tokenization. - * @deprecated should avoid doing additional tokenizations on top of what ingester produced. - */ - @Deprecated - public void setUseTweetSpecificNormalization(boolean useTweetSpecificNormalization) { - checkIfFrozen(); - this.useTweetSpecificNormalization = useTweetSpecificNormalization; - } - - public boolean isIndexedField() { - return indexedField; - } - - public void setIndexedField(boolean indexedField) { - this.indexedField = indexedField; - } - - public boolean isTextSearchableByDefault() { - return textSearchableByDefault; - } - - public void setTextSearchableByDefault(boolean textSearchableByDefault) { - checkIfFrozen(); - this.textSearchableByDefault = textSearchableByDefault; - } - - public float getTextSearchableFieldWeight() { - return textSearchableFieldWeight; - } - - public void setTextSearchableFieldWeight(float textSearchableFieldWeight) { - checkIfFrozen(); - this.textSearchableFieldWeight = textSearchableFieldWeight; - } - - /** - * Convenience method to find out if this field stores positions. {@link #indexOptions()} can also - * be used to determine the index options for this field. - */ - public final boolean hasPositions() { - return indexOptions() == IndexOptions.DOCS_AND_FREQS_AND_POSITIONS - || indexOptions() == IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS; - } - - public boolean isCsfViewField() { - return csfViewField; - } - - public int getCsfViewBaseFieldId() { - return csfViewBaseFieldId; - } - - public FeatureConfiguration getCsfViewFeatureConfiguration() { - return csfViewFeatureConfiguration; - } - - /** - * Set the CSF view settings. A CSF view is a portion of an another CSF. - */ - public void setCsfViewSettings(String fieldName, - ThriftCSFViewSettings csfViewSettings, - Schema.FieldInfo baseField) { - checkIfFrozen(); - this.csfViewField = true; - this.csfViewBaseFieldId = csfViewSettings.getBaseFieldConfigId(); - FeatureConfiguration.Builder builder = FeatureConfiguration.builder() - .withName(fieldName) - .withType(csfViewSettings.csfType) - .withBitRange(csfViewSettings.getValueIndex(), - csfViewSettings.getBitStartPosition(), - csfViewSettings.getBitLength()) - .withBaseField(baseField.getName()); - if (csfViewSettings.isSetOutputCSFType()) { - builder.withOutputType(csfViewSettings.getOutputCSFType()); - } - if (csfViewSettings.isSetNormalizationType()) { - builder.withFeatureNormalizationType(csfViewSettings.getNormalizationType()); - } - if (csfViewSettings.isSetFeatureUpdateConstraints()) { - for (ThriftFeatureUpdateConstraint c : csfViewSettings.getFeatureUpdateConstraints()) { - builder.withFeatureUpdateConstraint(c); - } - } - - this.csfViewFeatureConfiguration = builder.build(); - } - - public IndexedNumericFieldSettings getNumericFieldSettings() { - return numericFieldSettings; - } - - public void setNumericFieldSettings(IndexedNumericFieldSettings numericFieldSettings) { - checkIfFrozen(); - this.numericFieldSettings = numericFieldSettings; - } -} diff --git a/src/java/com/twitter/search/common/schema/base/FeatureConfiguration.java b/src/java/com/twitter/search/common/schema/base/FeatureConfiguration.java deleted file mode 100644 index 74cddf020..000000000 --- a/src/java/com/twitter/search/common/schema/base/FeatureConfiguration.java +++ /dev/null @@ -1,316 +0,0 @@ -package com.twitter.search.common.schema.base; - -import java.util.Set; - -import javax.annotation.Nullable; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Sets; - -import com.twitter.common.base.MorePreconditions; -import com.twitter.search.common.schema.thriftjava.ThriftCSFType; -import com.twitter.search.common.schema.thriftjava.ThriftFeatureNormalizationType; -import com.twitter.search.common.schema.thriftjava.ThriftFeatureUpdateConstraint; - -// FeatureConfiguration is defined for all the column stride view fields. -public final class FeatureConfiguration { - private final String name; - private final int intIndex; - // Start position in the given int (0-31) - private final int bitStartPos; - // Length in bits of the feature - private final int bitLength; - // precomputed for reuse - private final int bitMask; - private final int inverseBitMask; - private final int maxValue; - - private final ThriftCSFType type; - - // This is the client seen feature type: if this is null, this field is unused. - @Nullable - private final ThriftCSFType outputType; - - private final String baseField; - - private final Set featureUpdateConstraints; - - private final ThriftFeatureNormalizationType featureNormalizationType; - - /** - * Creates a new FeatureConfiguration with a base field. - * - * @param intIndex which integer is the feature in (0 based). - * @param bitStartPos at which bit does the feature start (0-31). - * @param bitLength length in bits of the feature - * @param baseField the CSF this feature is stored within. - */ - private FeatureConfiguration( - String name, - ThriftCSFType type, - ThriftCSFType outputType, - int intIndex, - int bitStartPos, - int bitLength, - String baseField, - Set featureUpdateConstraints, - ThriftFeatureNormalizationType featureNormalizationType) { - Preconditions.checkState(bitStartPos + bitLength <= Integer.SIZE, - "Feature must not cross int boundary."); - this.name = MorePreconditions.checkNotBlank(name); - this.type = Preconditions.checkNotNull(type); - this.outputType = outputType; - this.intIndex = intIndex; - this.bitStartPos = bitStartPos; - this.bitLength = bitLength; - // Technically, int-sized features can use all 32 bits to store a positive value greater than - // Integer.MAX_VALUE. But in practice, we will convert the values of those features to Java ints - // on the read side, so the max value for those features will still be Integer.MAX_VALUE. - this.maxValue = (1 << Math.min(bitLength, Integer.SIZE - 1)) - 1; - this.bitMask = (int) (((1L << bitLength) - 1) << bitStartPos); - this.inverseBitMask = ~bitMask; - this.baseField = baseField; - this.featureUpdateConstraints = featureUpdateConstraints; - this.featureNormalizationType = Preconditions.checkNotNull(featureNormalizationType); - } - - public String getName() { - return name; - } - - public int getMaxValue() { - return maxValue; - } - - @Override - public String toString() { - return new StringBuilder().append(name) - .append(" (").append(intIndex).append(", ") - .append(bitStartPos).append(", ") - .append(bitLength).append(") ").toString(); - } - - public int getValueIndex() { - return intIndex; - } - - public int getBitStartPosition() { - return bitStartPos; - } - - public int getBitLength() { - return bitLength; - } - - public int getBitMask() { - return bitMask; - } - - public int getInverseBitMask() { - return inverseBitMask; - } - - public String getBaseField() { - return baseField; - } - - public ThriftCSFType getType() { - return type; - } - - @Nullable - public ThriftCSFType getOutputType() { - return outputType; - } - - public ThriftFeatureNormalizationType getFeatureNormalizationType() { - return featureNormalizationType; - } - - /** - * Returns the update constraint for the feature. - */ - public Set getUpdateConstraints() { - if (featureUpdateConstraints == null) { - return null; - } - Set constraintSet = Sets.newHashSet(); - for (FeatureConstraint constraint : featureUpdateConstraints) { - constraintSet.add(constraint.getType()); - } - return constraintSet; - } - - /** - * Returns true if the given update satisfies all feature update constraints. - */ - public boolean validateFeatureUpdate(final Number oldValue, final Number newValue) { - if (featureUpdateConstraints != null) { - for (FeatureConstraint contraint : featureUpdateConstraints) { - if (!contraint.apply(oldValue, newValue)) { - return false; - } - } - } - - return true; - } - - @Override - public int hashCode() { - return (name == null ? 0 : name.hashCode()) - + intIndex * 7 - + bitStartPos * 13 - + bitLength * 23 - + bitMask * 31 - + inverseBitMask * 43 - + (int) maxValue * 53 - + (type == null ? 0 : type.hashCode()) * 61 - + (outputType == null ? 0 : outputType.hashCode()) * 71 - + (baseField == null ? 0 : baseField.hashCode()) * 83 - + (featureUpdateConstraints == null ? 0 : featureUpdateConstraints.hashCode()) * 87 - + (featureNormalizationType == null ? 0 : featureNormalizationType.hashCode()) * 97; - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof FeatureConfiguration)) { - return false; - } - - FeatureConfiguration featureConfiguration = FeatureConfiguration.class.cast(obj); - return (name == featureConfiguration.name) - && (bitStartPos == featureConfiguration.bitStartPos) - && (bitLength == featureConfiguration.bitLength) - && (bitMask == featureConfiguration.bitMask) - && (inverseBitMask == featureConfiguration.inverseBitMask) - && (maxValue == featureConfiguration.maxValue) - && (type == featureConfiguration.type) - && (outputType == featureConfiguration.outputType) - && (baseField == featureConfiguration.baseField) - && (featureUpdateConstraints == null - ? featureConfiguration.featureUpdateConstraints == null - : featureUpdateConstraints.equals(featureConfiguration.featureUpdateConstraints)) - && (featureNormalizationType == null - ? featureConfiguration.featureNormalizationType == null - : featureNormalizationType.equals(featureConfiguration.featureNormalizationType)); - } - - private interface FeatureConstraint { - boolean apply(Number oldValue, Number newValue); - ThriftFeatureUpdateConstraint getType(); - } - - public static Builder builder() { - return new Builder(); - } - - public static final class Builder { - private String name; - private ThriftCSFType type; - private ThriftCSFType outputType; - private int intIndex; - // Start position in the given int (0-31) - private int bitStartPos; - // Length in bits of the feature - private int bitLength; - - private String baseField; - - private Set featureUpdateConstraints; - - private ThriftFeatureNormalizationType featureNormalizationType = - ThriftFeatureNormalizationType.NONE; - - public FeatureConfiguration build() { - return new FeatureConfiguration(name, type, outputType, intIndex, bitStartPos, bitLength, - baseField, featureUpdateConstraints, featureNormalizationType); - } - - public Builder withName(String n) { - this.name = n; - return this; - } - - public Builder withType(ThriftCSFType featureType) { - this.type = featureType; - return this; - } - - public Builder withOutputType(ThriftCSFType featureFeatureType) { - this.outputType = featureFeatureType; - return this; - } - - public Builder withFeatureNormalizationType( - ThriftFeatureNormalizationType normalizationType) { - this.featureNormalizationType = Preconditions.checkNotNull(normalizationType); - return this; - } - - /** - * Sets the bit range at the given intIndex, startPos and length. - */ - public Builder withBitRange(int index, int startPos, int length) { - this.intIndex = index; - this.bitStartPos = startPos; - this.bitLength = length; - return this; - } - - public Builder withBaseField(String baseFieldName) { - this.baseField = baseFieldName; - return this; - } - - /** - * Adds a feature update constraint. - */ - public Builder withFeatureUpdateConstraint(final ThriftFeatureUpdateConstraint constraint) { - if (featureUpdateConstraints == null) { - featureUpdateConstraints = Sets.newHashSet(); - } - - switch (constraint) { - case IMMUTABLE: - featureUpdateConstraints.add(new FeatureConstraint() { - @Override public boolean apply(Number oldValue, Number newValue) { - return false; - } - @Override public ThriftFeatureUpdateConstraint getType() { - return ThriftFeatureUpdateConstraint.IMMUTABLE; - } - }); - break; - case INC_ONLY: - featureUpdateConstraints.add(new FeatureConstraint() { - @Override public boolean apply(Number oldValue, Number newValue) { - return newValue.intValue() > oldValue.intValue(); - } - @Override public ThriftFeatureUpdateConstraint getType() { - return ThriftFeatureUpdateConstraint.INC_ONLY; - } - }); - break; - case POSITIVE: - featureUpdateConstraints.add(new FeatureConstraint() { - @Override public boolean apply(Number oldValue, Number newValue) { - return newValue.intValue() >= 0; - } - @Override public ThriftFeatureUpdateConstraint getType() { - return ThriftFeatureUpdateConstraint.POSITIVE; - } - }); - break; - default: - } - - return this; - } - - private Builder() { - - } - } -} - diff --git a/src/java/com/twitter/search/common/schema/base/FieldNameToIdMapping.java b/src/java/com/twitter/search/common/schema/base/FieldNameToIdMapping.java deleted file mode 100644 index 4a4db3bab..000000000 --- a/src/java/com/twitter/search/common/schema/base/FieldNameToIdMapping.java +++ /dev/null @@ -1,28 +0,0 @@ -package com.twitter.search.common.schema.base; - -import java.util.Map; - -import com.google.common.collect.ImmutableMap; - -/** - * Maps from fieldName to fieldIDs. - */ -public abstract class FieldNameToIdMapping { - /** - * Returns field ID for the given fieldName. - * Can throw unchecked exceptions is the fieldName is not known to Earlybird. - */ - public abstract int getFieldID(String fieldName); - - /** - * Wrap the given map into a fieldNameToIdMapping instance. - */ - public static FieldNameToIdMapping newFieldNameToIdMapping(Map map) { - final ImmutableMap immutableMap = ImmutableMap.copyOf(map); - return new FieldNameToIdMapping() { - @Override public int getFieldID(String fieldName) { - return immutableMap.get(fieldName); - } - }; - } -} diff --git a/src/java/com/twitter/search/common/schema/base/FieldWeightDefault.java b/src/java/com/twitter/search/common/schema/base/FieldWeightDefault.java deleted file mode 100644 index ec3842a5e..000000000 --- a/src/java/com/twitter/search/common/schema/base/FieldWeightDefault.java +++ /dev/null @@ -1,110 +0,0 @@ -package com.twitter.search.common.schema.base; - -import java.util.LinkedHashMap; -import java.util.Map; - -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.Maps; - -import static com.google.common.base.Preconditions.checkNotNull; - -/** - * Records whether a field's enabled for search by default and its default weight. Note that these - * two are decoupled -- a field can have a default weight but not enabled for search by default. - * In a query it can be enabled by an annotation that does not specify a weight (e.g., ":f:foo"), - * which would then use the default weight. - * - * Instances are mutable. - */ -public class FieldWeightDefault { - private final boolean enabled; - private final float weight; - - public FieldWeightDefault(boolean enabled, float weight) { - this.enabled = enabled; - this.weight = weight; - } - - public static FieldWeightDefault fromSignedWeight(float signedValue) { - return new FieldWeightDefault(signedValue >= 0, Math.abs(signedValue)); - } - - /** - * Returns an immutable map from field name to default field weights for only enabled fields. - * Fields that are not enabled for search by default will not be included. - */ - public static ImmutableMap getOnlyEnabled( - Map map) { - - ImmutableMap.Builder builder = ImmutableMap.builder(); - for (Map.Entry entry : map.entrySet()) { - if (entry.getValue().isEnabled()) { - builder.put(entry.getKey(), entry.getValue().getWeight()); - } - } - return builder.build(); - } - - public boolean isEnabled() { - return enabled; - } - - public float getWeight() { - return weight; - } - - /** - * Overlays the base field-weight map with the given one. Since it is an overlay, a - * field that does not exist in the base map will never be added. Also, negative value means - * the field is not enabled for search by default, but if it is, the absolute value would serve as - * the default. - */ - public static ImmutableMap overrideFieldWeightMap( - Map base, - Map fieldWeightMapOverride) { - - checkNotNull(base); - if (fieldWeightMapOverride == null) { - return ImmutableMap.copyOf(base); - } - - LinkedHashMap map = Maps.newLinkedHashMap(base); - for (Map.Entry entry : fieldWeightMapOverride.entrySet()) { - if (base.containsKey(entry.getKey()) - && entry.getValue() >= -Float.MAX_VALUE - && entry.getValue() <= Float.MAX_VALUE) { - - map.put( - entry.getKey(), - FieldWeightDefault.fromSignedWeight(entry.getValue().floatValue())); - } - } - - return ImmutableMap.copyOf(map); - } - - /** - * Creates a field-to-FieldWeightDefault map from the given field-to-weight map, where negative - * weight means the the field is not enabled for search by default, but if it is (e.g., - * by annotation), the absolute value of the weight shall be used. - */ - public static ImmutableMap fromSignedWeightMap( - Map signedWeightMap) { - - ImmutableMap.Builder builder = ImmutableMap.builder(); - for (Map.Entry entry : signedWeightMap.entrySet()) { - // If double to float conversion failed, we will get a float infinity. - // See http://stackoverflow.com/a/10075093/716468 - float floatValue = entry.getValue().floatValue(); - if (floatValue != Float.NEGATIVE_INFINITY - && floatValue != Float.POSITIVE_INFINITY) { - - builder.put( - entry.getKey(), - FieldWeightDefault.fromSignedWeight(floatValue)); - } - } - - return builder.build(); - } -} diff --git a/src/java/com/twitter/search/common/schema/base/ImmutableSchemaInterface.java b/src/java/com/twitter/search/common/schema/base/ImmutableSchemaInterface.java deleted file mode 100644 index ea04b16e0..000000000 --- a/src/java/com/twitter/search/common/schema/base/ImmutableSchemaInterface.java +++ /dev/null @@ -1,14 +0,0 @@ -package com.twitter.search.common.schema.base; - -import javax.annotation.concurrent.Immutable; -import javax.annotation.concurrent.ThreadSafe; - -/** - * This interface carries the same signature as Schema with the only difference that this schema - * is immutable. This should be used by short sessions and the class would guarantee the schema - * would not change for the session. A typical usage is like a search query session. - */ -@Immutable -@ThreadSafe -public interface ImmutableSchemaInterface extends Schema { -} diff --git a/src/java/com/twitter/search/common/schema/base/IndexedNumericFieldSettings.java b/src/java/com/twitter/search/common/schema/base/IndexedNumericFieldSettings.java deleted file mode 100644 index d436fc189..000000000 --- a/src/java/com/twitter/search/common/schema/base/IndexedNumericFieldSettings.java +++ /dev/null @@ -1,37 +0,0 @@ -package com.twitter.search.common.schema.base; - -import com.twitter.search.common.schema.thriftjava.ThriftIndexedNumericFieldSettings; -import com.twitter.search.common.schema.thriftjava.ThriftNumericType; - -public class IndexedNumericFieldSettings { - private final ThriftNumericType numericType; - private final int numericPrecisionStep; - private final boolean useTwitterFormat; - private final boolean useSortableEncoding; - - /** - * Create a IndexedNumericFieldSettings from a ThriftIndexedNumericFieldSettings - */ - public IndexedNumericFieldSettings(ThriftIndexedNumericFieldSettings numericFieldSettings) { - this.numericType = numericFieldSettings.getNumericType(); - this.numericPrecisionStep = numericFieldSettings.getNumericPrecisionStep(); - this.useTwitterFormat = numericFieldSettings.isUseTwitterFormat(); - this.useSortableEncoding = numericFieldSettings.isUseSortableEncoding(); - } - - public ThriftNumericType getNumericType() { - return numericType; - } - - public int getNumericPrecisionStep() { - return numericPrecisionStep; - } - - public boolean isUseTwitterFormat() { - return useTwitterFormat; - } - - public boolean isUseSortableEncoding() { - return useSortableEncoding; - } -} diff --git a/src/java/com/twitter/search/common/schema/base/Schema.java b/src/java/com/twitter/search/common/schema/base/Schema.java deleted file mode 100644 index 51f90bd29..000000000 --- a/src/java/com/twitter/search/common/schema/base/Schema.java +++ /dev/null @@ -1,231 +0,0 @@ -package com.twitter.search.common.schema.base; - -import java.util.Collection; -import java.util.Map; - -import javax.annotation.Nullable; - -import com.google.common.base.Predicate; -import com.google.common.collect.ImmutableCollection; -import com.google.common.collect.ImmutableMap; - -import org.apache.lucene.analysis.Analyzer; -import org.apache.lucene.facet.FacetsConfig; -import org.apache.lucene.index.FieldInfos; - -import com.twitter.search.common.features.thrift.ThriftSearchFeatureSchema; -import com.twitter.search.common.schema.thriftjava.ThriftAnalyzer; -import com.twitter.search.common.schema.thriftjava.ThriftCSFType; -import com.twitter.search.common.schema.thriftjava.ThriftFieldConfiguration; - -/** - * Search Schema. - */ -public interface Schema { - /** - * Certain Schema implementations can evolve at run time. This call returns a snapshot of - * of the schema which is guaranteed to not change. - */ - ImmutableSchemaInterface getSchemaSnapshot(); - - /** - * Returns a string describing the current schema version. - */ - String getVersionDescription(); - - /** - * Returns whether the schema version is official. Only official segments are uploaded to HDFS. - */ - boolean isVersionOfficial(); - - /** - * Returns the schema's major version. - */ - int getMajorVersionNumber(); - - /** - * Returns the schema's minor version. - */ - int getMinorVersionNumber(); - - /** - * Returns the default analyzer. This analyzer is used when none is specified on the field info. - */ - Analyzer getDefaultAnalyzer(ThriftAnalyzer override); - - /** - * Returns whether the given field is configured in the schema. - */ - boolean hasField(int fieldConfigId); - - /** - * Returns whether the given field is configured in the schema. - */ - boolean hasField(String fieldName); - - /** - * Get the field name corresponding to the given field id. - */ - String getFieldName(int fieldConfigId); - - /** - * Return the FieldInfo of all fields. - */ - ImmutableCollection getFieldInfos(); - - /** - * Get the field info for the given field id. If an override is given, attempt to merge the - * base field info with the override config. - */ - FieldInfo getFieldInfo(int fieldConfigId, ThriftFieldConfiguration override); - - - /** - * Get the field info for the given field id. No override. - */ - @Nullable - FieldInfo getFieldInfo(int fieldConfigId); - - /** - * Get the field info for the given field name. No override. - */ - @Nullable - FieldInfo getFieldInfo(String fieldName); - - /** - * Builds a lucene FieldInfos instance, usually used for indexing. - */ - FieldInfos getLuceneFieldInfos(Predicate acceptedFields); - - /** - * Returns the number of facet fields in this schema. - */ - int getNumFacetFields(); - - /** - * Return facet configurations. - */ - FacetsConfig getFacetsConfig(); - - /** - * Get the facet field's field info by facet name. - */ - FieldInfo getFacetFieldByFacetName(String facetName); - - /** - * Get the facet field's field info by field name. - */ - FieldInfo getFacetFieldByFieldName(String fieldName); - - /** - * Get the field infos for all facet fields. - */ - Collection getFacetFields(); - - /** - * Get the field infos for all facet fields backed by column stride fields. - */ - Collection getCsfFacetFields(); - - /** - * Get the field weight map for text searchable fields. - */ - Map getFieldWeightMap(); - - /** - * Get scoring feature configuration by feature name. - */ - FeatureConfiguration getFeatureConfigurationByName(String featureName); - - /** - * Get scoring feature configuration by feature field id. The feature configuration is - * guaranteed to be not null, or a NullPointerException will be thrown out. - */ - FeatureConfiguration getFeatureConfigurationById(int featureFieldId); - - /** - * Returns the ThriftCSFType for a CSF field. - * Note: for non-CSF field, null will be returned. - */ - @Nullable - ThriftCSFType getCSFFieldType(String fieldName); - - /** - * Get the search result feature schema for all possible features in all search results. - * - * The returned value is not really immutable (because it's a pre-generated thrift struct). - * We want to return it directly because we want to pre-build it once and return with the thrift - * search results as is. - */ - ThriftSearchFeatureSchema getSearchFeatureSchema(); - - /** - * Get the mapping from feature id to feature configuration. - */ - ImmutableMap getFeatureIdToFeatureConfig(); - - /** - * Get the mapping from feature name to feature configuration. - */ - ImmutableMap getFeatureNameToFeatureConfig(); - - /** - * Field configuration for a single field. - */ - final class FieldInfo { - private final int fieldId; - private final String name; - private final EarlybirdFieldType luceneFieldType; - - public FieldInfo(int fieldId, String name, EarlybirdFieldType luceneFieldType) { - this.fieldId = fieldId; - this.name = name; - this.luceneFieldType = luceneFieldType; - } - - public int getFieldId() { - return fieldId; - } - - public String getName() { - return name; - } - - public EarlybirdFieldType getFieldType() { - return luceneFieldType; - } - - public String getDescription() { - return String.format( - "(FieldInfo [fieldId: %d, name: %s, luceneFieldType: %s])", - fieldId, name, luceneFieldType.getFacetName() - ); - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof FieldInfo)) { - return false; - } - return fieldId == ((FieldInfo) obj).fieldId; - } - - @Override - public int hashCode() { - return fieldId; - } - } - - /** - * Exception thrown when errors or inconsistences are detected in a search schema. - */ - final class SchemaValidationException extends Exception { - public SchemaValidationException(String msg) { - super(msg); - } - - public SchemaValidationException(String msg, Exception e) { - super(msg, e); - } - } -} diff --git a/src/java/com/twitter/search/common/schema/base/ThriftDocumentUtil.java b/src/java/com/twitter/search/common/schema/base/ThriftDocumentUtil.java deleted file mode 100644 index 03f0e343e..000000000 --- a/src/java/com/twitter/search/common/schema/base/ThriftDocumentUtil.java +++ /dev/null @@ -1,146 +0,0 @@ -package com.twitter.search.common.schema.base; - -import java.util.ArrayList; -import java.util.HashSet; -import java.util.List; -import java.util.Set; - -import com.twitter.search.common.schema.thriftjava.ThriftDocument; -import com.twitter.search.common.schema.thriftjava.ThriftField; - -/** - * Utility APIs for ThriftDocument. - */ -public final class ThriftDocumentUtil { - private ThriftDocumentUtil() { - } - - /** - * Get ThriftField out of a ThriftDocument. - */ - public static ThriftField getField(ThriftDocument thriftDoc, - String fieldName, - FieldNameToIdMapping idMap) { - int id = idMap.getFieldID(fieldName); - for (ThriftField field : thriftDoc.getFields()) { - int fieldId = field.getFieldConfigId(); - if (fieldId == id) { - return field; - } - } - - return null; - } - - /** - * Get all fields out of a ThriftDocument that match the given field name. - */ - public static List getFields( - ThriftDocument thriftDoc, String fieldName, FieldNameToIdMapping idMap) { - - int id = idMap.getFieldID(fieldName); - List result = new ArrayList<>(); - - for (ThriftField field : thriftDoc.getFields()) { - int fieldId = field.getFieldConfigId(); - if (fieldId == id) { - result.add(field); - } - } - - return result; - } - - - /** - * Retrieve the long value from a thrift field - */ - public static long getLongValue(ThriftDocument thriftDoc, - String fieldName, - FieldNameToIdMapping idMap) { - ThriftField f = getField(thriftDoc, fieldName, idMap); - return f == null ? 0L : f.getFieldData().getLongValue(); - } - - /** - * Retrieve the byte value from a thrift field - */ - public static byte getByteValue(ThriftDocument thriftDoc, - String fieldName, - FieldNameToIdMapping idMap) { - ThriftField f = getField(thriftDoc, fieldName, idMap); - return f == null ? (byte) 0 : f.getFieldData().getByteValue(); - } - - /** - * Retrieve the bytes value from a thrift field - */ - public static byte[] getBytesValue(ThriftDocument thriftDoc, - String fieldName, - FieldNameToIdMapping idMap) { - ThriftField f = getField(thriftDoc, fieldName, idMap); - return f == null ? null : f.getFieldData().getBytesValue(); - } - - /** - * Retrieve the int value from a thrift field - */ - public static int getIntValue(ThriftDocument thriftDoc, - String fieldName, - FieldNameToIdMapping idMap) { - ThriftField f = getField(thriftDoc, fieldName, idMap); - return f == null ? 0 : f.getFieldData().getIntValue(); - } - - /** - * Retrieve the string value from a thrift field - */ - public static String getStringValue(ThriftDocument thriftDoc, - String fieldName, - FieldNameToIdMapping idMap) { - ThriftField f = getField(thriftDoc, fieldName, idMap); - return f == null ? null : f.getFieldData().getStringValue(); - } - - /** - * Retrieve the string values from all thrift fields with the given fieldName. - */ - public static List getStringValues( - ThriftDocument thriftDoc, - String fieldName, - FieldNameToIdMapping idMap) { - List fields = getFields(thriftDoc, fieldName, idMap); - List fieldStrings = new ArrayList<>(); - - for (ThriftField field : fields) { - fieldStrings.add(field.getFieldData().getStringValue()); - } - return fieldStrings; - } - - /** - * Returns whether the specified document has duplicate fields. - */ - public static boolean hasDuplicateFields(ThriftDocument thriftDoc) { - Set seen = new HashSet<>(); - for (ThriftField field : thriftDoc.getFields()) { - if (!seen.add(field.getFieldConfigId())) { - return true; - } - } - return false; - } - - /** - * Get ThriftField out of a ThriftDocument. - */ - public static ThriftField getField(ThriftDocument thriftDoc, int fieldId) { - for (ThriftField field : thriftDoc.getFields()) { - if (field.getFieldConfigId() == fieldId) { - return field; - } - } - - return null; - } -} diff --git a/src/java/com/twitter/search/common/schema/earlybird/BUILD b/src/java/com/twitter/search/common/schema/earlybird/BUILD deleted file mode 100644 index e7f8ea032..000000000 --- a/src/java/com/twitter/search/common/schema/earlybird/BUILD +++ /dev/null @@ -1,93 +0,0 @@ -# Library for earlybird-specific schema. -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/com/twitter/elephantbird:core", - "3rdparty/jvm/geo/google:geoGoogle", - "3rdparty/jvm/joda-time", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/lucene:lucene-core", - "3rdparty/jvm/org/apache/thrift:libthrift", - "3rdparty/jvm/org/apache/zookeeper:zookeeper-client", - "3rdparty/jvm/org/slf4j:slf4j-api", - "cuad/projects/ner/thrift/src/main/thrift:thrift-java", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common/collections", - "src/java/com/twitter/common/text/token", - "src/java/com/twitter/common/text/util:token-util", - "src/java/com/twitter/common_internal/text", - "src/java/com/twitter/common_internal/text/version", - "src/java/com/twitter/search/common/config", - "src/java/com/twitter/search/common/constants", - "src/java/com/twitter/search/common/encoding/docvalues", - "src/java/com/twitter/search/common/encoding/features", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/partitioning/snowflakeparser", - "src/java/com/twitter/search/common/schema", - "src/java/com/twitter/search/common/schema/base", - "src/java/com/twitter/search/common/util:longintconverter", - "src/java/com/twitter/search/common/util/analysis", - "src/java/com/twitter/search/common/util/spatial", - "src/java/com/twitter/search/common/util/text", - "src/java/com/twitter/search/common/util/text/regex", - "src/java/com/twitter/search/common/util/url", - "src/thrift/com/twitter/search/common:indexing-java", - "src/thrift/com/twitter/search/common:schema-java", - "src/thrift/com/twitter/service/spiderduck/gen:metadata-store-java", - ], - exports = [ - "src/thrift/com/twitter/search/common:indexing-java", - ], -) - -java_library( - name = "for-timelines", - sources = [ - "EarlybirdCluster.java", - "EarlybirdFieldConstants.java", - ], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/com/twitter/elephantbird:core", - "3rdparty/jvm/geo/google:geoGoogle", - "3rdparty/jvm/joda-time", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/lucene:lucene-core", - "3rdparty/jvm/org/apache/thrift:libthrift", - "3rdparty/jvm/org/apache/zookeeper:zookeeper-client", - "3rdparty/jvm/org/slf4j:slf4j-api", - "cuad/projects/ner/thrift/src/main/thrift:thrift-java", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common/collections", - "src/java/com/twitter/common/text/token", - "src/java/com/twitter/common/text/util:token-util", - "src/java/com/twitter/common_internal/text/version", - "src/java/com/twitter/search/common/config", - "src/java/com/twitter/search/common/constants", - "src/java/com/twitter/search/common/encoding/docvalues", - "src/java/com/twitter/search/common/encoding/features", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/partitioning/snowflakeparser", - "src/java/com/twitter/search/common/schema", - "src/java/com/twitter/search/common/schema/base", - "src/java/com/twitter/search/common/util:longintconverter", - "src/java/com/twitter/search/common/util/analysis", - "src/java/com/twitter/search/common/util/spatial", - "src/java/com/twitter/search/common/util/text", - "src/java/com/twitter/search/common/util/text/regex", - "src/java/com/twitter/search/common/util/url", - "src/thrift/com/twitter/search/common:indexing-java", - "src/thrift/com/twitter/search/common:schema-java", - "src/thrift/com/twitter/service/spiderduck/gen:metadata-store-java", - ], - exports = [ - "src/thrift/com/twitter/search/common:indexing-java", - ], -) diff --git a/src/java/com/twitter/search/common/schema/earlybird/EarlybirdCluster.java b/src/java/com/twitter/search/common/schema/earlybird/EarlybirdCluster.java deleted file mode 100644 index d956b341d..000000000 --- a/src/java/com/twitter/search/common/schema/earlybird/EarlybirdCluster.java +++ /dev/null @@ -1,90 +0,0 @@ -package com.twitter.search.common.schema.earlybird; - -import java.util.Set; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.collect.ImmutableSet; - -/** - * A list of existing Earlybird clusters. - */ -public enum EarlybirdCluster { - /** - * Realtime earlybird cluster. Has 100% of tweet for about 7 days. - */ - REALTIME, - /** - * Protected earlybird cluster. Has only tweets from protected accounts. - */ - PROTECTED, - /** - * Full archive cluster. Has all tweets until about 2 days ago. - */ - FULL_ARCHIVE, - /** - * SuperRoot cluster. Talks to the other clusters instead of talking directly to earlybirds. - */ - SUPERROOT, - - /** - * A dedicated cluster for Candidate Generation use cases based on Earlybird in Home/PushService - */ - REALTIME_CG; - - public String getNameForStats() { - return name().toLowerCase(); - } - - public static boolean isArchive(EarlybirdCluster cluster) { - return isClusterInSet(cluster, ARCHIVE_CLUSTERS); - } - - public static boolean isTwitterMemoryFormatCluster(EarlybirdCluster cluster) { - return isClusterInSet(cluster, TWITTER_IN_MEMORY_INDEX_FORMAT_GENERAL_PURPOSE_CLUSTERS); - } - - public static boolean hasEarlybirds(EarlybirdCluster cluster) { - return cluster != SUPERROOT; - } - - private static boolean isClusterInSet(EarlybirdCluster cluster, Set set) { - return set.contains(cluster); - } - - protected static final ImmutableSet ARCHIVE_CLUSTERS = - ImmutableSet.of(FULL_ARCHIVE); - - @VisibleForTesting - public static final ImmutableSet - TWITTER_IN_MEMORY_INDEX_FORMAT_GENERAL_PURPOSE_CLUSTERS = - ImmutableSet.of( - REALTIME, - PROTECTED); - - @VisibleForTesting - public static final ImmutableSet TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS = - ImmutableSet.of( - REALTIME, - PROTECTED, - REALTIME_CG); - - /** - * Constant for field used in general purpose clusters, - * Note that GENERAL_PURPOSE_CLUSTERS does not include REALTIME_CG. If you wish to include REALTIME_CG, - * please use ALL_CLUSTERS - */ - protected static final ImmutableSet GENERAL_PURPOSE_CLUSTERS = - ImmutableSet.of( - REALTIME, - PROTECTED, - FULL_ARCHIVE, - SUPERROOT); - - protected static final ImmutableSet ALL_CLUSTERS = - ImmutableSet.of( - REALTIME, - PROTECTED, - FULL_ARCHIVE, - SUPERROOT, - REALTIME_CG); -} diff --git a/src/java/com/twitter/search/common/schema/earlybird/EarlybirdEncodedFeatures.java b/src/java/com/twitter/search/common/schema/earlybird/EarlybirdEncodedFeatures.java deleted file mode 100644 index e3ea16c23..000000000 --- a/src/java/com/twitter/search/common/schema/earlybird/EarlybirdEncodedFeatures.java +++ /dev/null @@ -1,148 +0,0 @@ -package com.twitter.search.common.schema.earlybird; - -import com.google.common.base.Preconditions; - -import com.twitter.search.common.encoding.features.IntegerEncodedFeatures; -import com.twitter.search.common.indexing.thriftjava.PackedFeatures; -import com.twitter.search.common.indexing.thriftjava.VersionedTweetFeatures; -import com.twitter.search.common.schema.SchemaUtil; -import com.twitter.search.common.schema.base.FeatureConfiguration; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; - -/** - * A class for encoding earlybird features in integers - */ -public abstract class EarlybirdEncodedFeatures extends IntegerEncodedFeatures { - private final ImmutableSchemaInterface schema; - private final EarlybirdFieldConstant baseField; - - public EarlybirdEncodedFeatures(ImmutableSchemaInterface schema, - EarlybirdFieldConstant baseField) { - this.schema = schema; - this.baseField = baseField; - } - - /** - * Write this object into packedFeatures of the given VersionedTweetFeatures. - */ - public void writeFeaturesToVersionedTweetFeatures( - VersionedTweetFeatures versionedTweetFeatures) { - if (!versionedTweetFeatures.isSetPackedFeatures()) { - versionedTweetFeatures.setPackedFeatures(new PackedFeatures()); - } - copyToPackedFeatures(versionedTweetFeatures.getPackedFeatures()); - } - - /** - * Write this object into extendedPackedFeatures of the given VersionedTweetFeatures. - */ - public void writeExtendedFeaturesToVersionedTweetFeatures( - VersionedTweetFeatures versionedTweetFeatures) { - if (!versionedTweetFeatures.isSetExtendedPackedFeatures()) { - versionedTweetFeatures.setExtendedPackedFeatures(new PackedFeatures()); - } - copyToPackedFeatures(versionedTweetFeatures.getExtendedPackedFeatures()); - } - - @Override - public String toString() { - StringBuilder ret = new StringBuilder(); - ret.append("Tweet features: \n"); - for (FeatureConfiguration feature - : EarlybirdSchemaCreateTool.FEATURE_CONFIGURATION_MAP.values()) { - ret.append(feature.getName()).append(": ").append(getFeatureValue(feature)).append("\n"); - } - return ret.toString(); - } - - public boolean isFlagSet(EarlybirdFieldConstant field) { - return isFlagSet(schema.getFeatureConfigurationById(field.getFieldId())); - } - - public int getFeatureValue(EarlybirdFieldConstant field) { - return getFeatureValue(schema.getFeatureConfigurationById(field.getFieldId())); - } - - public EarlybirdEncodedFeatures setFlag(EarlybirdFieldConstant field) { - setFlag(schema.getFeatureConfigurationById(field.getFieldId())); - return this; - } - - public EarlybirdEncodedFeatures clearFlag(EarlybirdFieldConstant field) { - clearFlag(schema.getFeatureConfigurationById(field.getFieldId())); - return this; - } - - public EarlybirdEncodedFeatures setFlagValue(EarlybirdFieldConstant field, - boolean value) { - setFlagValue(schema.getFeatureConfigurationById(field.getFieldId()), value); - return this; - } - - public EarlybirdEncodedFeatures setFeatureValue(EarlybirdFieldConstant field, - int value) { - setFeatureValue(schema.getFeatureConfigurationById(field.getFieldId()), value); - return this; - } - - public EarlybirdEncodedFeatures setFeatureValueIfGreater(EarlybirdFieldConstant field, - int value) { - setFeatureValueIfGreater(schema.getFeatureConfigurationById(field.getFieldId()), value); - return this; - } - - public boolean incrementIfNotMaximum(EarlybirdFieldConstant field) { - return incrementIfNotMaximum(schema.getFeatureConfigurationById(field.getFieldId())); - } - - private static final class ArrayEncodedTweetFeatures extends EarlybirdEncodedFeatures { - private final int[] encodedInts; - - private ArrayEncodedTweetFeatures(ImmutableSchemaInterface schema, - EarlybirdFieldConstant baseField) { - super(schema, baseField); - - final int numIntegers = SchemaUtil.getCSFFieldFixedLength(schema, baseField.getFieldId()); - Preconditions.checkState(numIntegers > 0); - this.encodedInts = new int[numIntegers]; - } - - @Override - public int getNumInts() { - return encodedInts.length; - } - - @Override - public int getInt(int pos) { - return encodedInts[pos]; - } - - @Override - public void setInt(int pos, int value) { - encodedInts[pos] = value; - } - } - - /** - * Create a new {@link EarlybirdEncodedFeatures} object based on schema and base field. - * @param schema the schema for all fields - * @param baseField base field's constant value - */ - public static EarlybirdEncodedFeatures newEncodedTweetFeatures( - ImmutableSchemaInterface schema, EarlybirdFieldConstant baseField) { - return new ArrayEncodedTweetFeatures(schema, baseField); - } - - /** - * Create a new {@link EarlybirdEncodedFeatures} object based on schema and base field name. - * @param schema the schema for all fields - * @param baseFieldName base field's name - */ - public static EarlybirdEncodedFeatures newEncodedTweetFeatures( - ImmutableSchemaInterface schema, String baseFieldName) { - EarlybirdFieldConstant baseField = EarlybirdFieldConstants.getFieldConstant(baseFieldName); - Preconditions.checkNotNull(baseField); - return newEncodedTweetFeatures(schema, baseField); - } -} diff --git a/src/java/com/twitter/search/common/schema/earlybird/EarlybirdEncodedFeaturesUtil.java b/src/java/com/twitter/search/common/schema/earlybird/EarlybirdEncodedFeaturesUtil.java deleted file mode 100644 index d8330faca..000000000 --- a/src/java/com/twitter/search/common/schema/earlybird/EarlybirdEncodedFeaturesUtil.java +++ /dev/null @@ -1,36 +0,0 @@ -package com.twitter.search.common.schema.earlybird; - -import com.twitter.search.common.encoding.docvalues.CSFTypeUtil; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; - -public final class EarlybirdEncodedFeaturesUtil { - private EarlybirdEncodedFeaturesUtil() { - } - - /** - * Returns a byte array that can be stored in a ThriftDocument as bytesField. - */ - public static byte[] toBytesForThriftDocument(EarlybirdEncodedFeatures features) { - int numInts = features.getNumInts(); - byte[] serializedFeatures = new byte[numInts * Integer.BYTES]; - for (int i = 0; i < numInts; i++) { - CSFTypeUtil.convertToBytes(serializedFeatures, i, features.getInt(i)); - } - return serializedFeatures; - } - - /** - * Converts data in a given byte array (starting at the provided offset) into - * EarlybirdEncodedFeatures. - */ - public static EarlybirdEncodedFeatures fromBytes( - ImmutableSchemaInterface schema, EarlybirdFieldConstants.EarlybirdFieldConstant baseField, - byte[] data, int offset) { - EarlybirdEncodedFeatures features = EarlybirdEncodedFeatures.newEncodedTweetFeatures( - schema, baseField); - for (int idx = 0; idx < features.getNumInts(); ++idx) { - features.setInt(idx, CSFTypeUtil.convertFromBytes(data, offset, idx)); - } - return features; - } -} diff --git a/src/java/com/twitter/search/common/schema/earlybird/EarlybirdFieldConstants.java b/src/java/com/twitter/search/common/schema/earlybird/EarlybirdFieldConstants.java deleted file mode 100644 index 6ec044933..000000000 --- a/src/java/com/twitter/search/common/schema/earlybird/EarlybirdFieldConstants.java +++ /dev/null @@ -1,1132 +0,0 @@ - -package com.twitter.search.common.schema.earlybird; - -import java.util.Collection; -import java.util.EnumSet; -import java.util.List; -import java.util.Map; -import java.util.Set; - -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.ImmutableSet; -import com.google.common.collect.Sets; - -import com.twitter.search.common.indexing.thriftjava.ThriftGeoLocationSource; -import com.twitter.search.common.schema.ImmutableSchema; -import com.twitter.search.common.schema.SchemaBuilder; -import com.twitter.search.common.schema.base.FeatureConfiguration; -import com.twitter.search.common.schema.base.FieldNameToIdMapping; -import com.twitter.search.common.schema.thriftjava.ThriftFeatureNormalizationType; - -/** - * Field names, field IDs etc. - */ -public class EarlybirdFieldConstants extends FieldNameToIdMapping { - @VisibleForTesting - public static final String ENCODED_TWEET_FEATURES_FIELD_NAME = "encoded_tweet_features"; - - @VisibleForTesting - public static final String EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME = - "extended_encoded_tweet_features"; - - private enum FlagFeatureFieldType { - NON_FLAG_FEATURE_FIELD, - FLAG_FEATURE_FIELD - } - - private enum UnusedFeatureFieldType { - USED_FEATURE_FIELD, - UNUSED_FEATURE_FIELD - } - - /** - * CSF_NAME_TO_MIN_ENGAGEMENT_FIELD_MAP and MIN_ENGAGEMENT_FIELD_TO_CSF_NAME_MAP are used in - * EarlybirdLuceneQueryVisitor to map the CSFs REPLY_COUNT, RETWEET_COUNT, and FAVORITE_COUNT to - * their respective min engagement fields, and vice versa. - */ - public static final ImmutableMap - CSF_NAME_TO_MIN_ENGAGEMENT_FIELD_MAP = ImmutableMap.builder() - .put(EarlybirdFieldConstant.REPLY_COUNT.getFieldName(), - EarlybirdFieldConstant.NORMALIZED_REPLY_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD) - .put(EarlybirdFieldConstant.RETWEET_COUNT.getFieldName(), - EarlybirdFieldConstant.NORMALIZED_RETWEET_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD) - .put(EarlybirdFieldConstant.FAVORITE_COUNT.getFieldName(), - EarlybirdFieldConstant.NORMALIZED_FAVORITE_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD) - .build(); - - public static final ImmutableMap - MIN_ENGAGEMENT_FIELD_TO_CSF_NAME_MAP = ImmutableMap.builder() - .put(EarlybirdFieldConstant.NORMALIZED_REPLY_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD - .getFieldName(), - EarlybirdFieldConstant.REPLY_COUNT) - .put(EarlybirdFieldConstant.NORMALIZED_RETWEET_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD - .getFieldName(), - EarlybirdFieldConstant.RETWEET_COUNT) - .put(EarlybirdFieldConstant.NORMALIZED_FAVORITE_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD - .getFieldName(), - EarlybirdFieldConstant.FAVORITE_COUNT) - .build(); - - /** - * A list of Earlybird field names and field IDs, and the clusters that need them. - */ - public enum EarlybirdFieldConstant { - // These enums are grouped by category and sorted alphabetically. - // Next indexed field ID is 76 - // Next CSF field ID is 115 - // Next encoded_features CSF field ID is 185 - // Next extended_encoded_features CSF field ID is 284 - - // Text searchable fields - // Provides slow ID Mapping from tweet ID to doc ID through TermsEnum.seekExact(). - ID_FIELD("id", 0, EarlybirdCluster.ALL_CLUSTERS), - RESOLVED_LINKS_TEXT_FIELD("resolved_links_text", 1), - TEXT_FIELD("text", 2), - TOKENIZED_FROM_USER_FIELD("tokenized_from_user", 3), - - // Other indexed fields - CARD_TITLE_FIELD("card_title", 4), - CARD_DESCRIPTION_FIELD("card_description", 5), - // We require the createdAt field to be set so we can properly filter tweets based on time. - CREATED_AT_FIELD("created_at", 6, EarlybirdCluster.ALL_CLUSTERS), - // 7 was formerly EVENT_IDS_FIELD("event_ids", 7, EarlybirdCluster.REALTIME) - ENTITY_ID_FIELD("entity_id", 40), - // The screen name of the user that created the tweet. Should be set to the normalized value in - // the com.twitter.gizmoduck.thriftjava.Profile.screen_name field. - FROM_USER_FIELD("from_user", 8), - // The numeric ID of the user that created the tweet. - FROM_USER_ID_FIELD("from_user_id", 9, EarlybirdCluster.ALL_CLUSTERS), - CARD_DOMAIN_FIELD("card_domain", 11), - CARD_NAME_FIELD("card_name", 12), - GEO_HASH_FIELD("geo_hash", 13), - HASHTAGS_FIELD("hashtags", 14), - HF_PHRASE_PAIRS_FIELD(ImmutableSchema.HF_PHRASE_PAIRS_FIELD, 15), - HF_TERM_PAIRS_FIELD(ImmutableSchema.HF_TERM_PAIRS_FIELD, 16), - IMAGE_LINKS_FIELD("image_links", 17), - IN_REPLY_TO_TWEET_ID_FIELD("in_reply_to_tweet_id", 59), - IN_REPLY_TO_USER_ID_FIELD("in_reply_to_user_id", 38), - // The internal field is used for many purposes: - // 1. to store facet skiplists - // 2. to power the filter operator, by storing posting list for terms like __filter_twimg - // 3. to store posting lists for positive and negative smileys - // 4. to store geo location types. - // etc. - INTERNAL_FIELD("internal", 18, EarlybirdCluster.ALL_CLUSTERS), - ISO_LANGUAGE_FIELD("iso_lang", 19), - LINK_CATEGORY_FIELD("link_category", 36), - LINKS_FIELD("links", 21), - MENTIONS_FIELD("mentions", 22), - // Field 23 used to be NAMED_ENTITIES_FIELD - NEWS_LINKS_FIELD("news_links", 24), - NORMALIZED_SOURCE_FIELD("norm_source", 25), - PLACE_FIELD("place", 26), - // Field 37 used to be PUBLICLY_INFERRED_USER_LOCATION_PLACE_ID_FIELD - // The ID of the source tweet. Set for retweets only. - RETWEET_SOURCE_TWEET_ID_FIELD("retweet_source_tweet_id", 60, - EarlybirdCluster.ALL_CLUSTERS), - // The ID of the source tweet's author. Set for retweets only. - RETWEET_SOURCE_USER_ID_FIELD("retweet_source_user_id", 39), - SOURCE_FIELD("source", 29), - STOCKS_FIELD("stocks", 30), - // The screen name of the user that a tweet was directed at. - TO_USER_FIELD("to_user", 32), - // Field 33 used to be TOPIC_IDS_FIELD and is now unused. It can be reused later. - TWIMG_LINKS_FIELD("twimg_links", 34), - VIDEO_LINKS_FIELD("video_links", 35), - CAMELCASE_USER_HANDLE_FIELD("camelcase_tokenized_from_user", 41), - // This field should be set to the the tokenized and normalized value in the - // com.twitter.gizmoduck.thriftjava.Profile.name field. - TOKENIZED_USER_NAME_FIELD("tokenized_from_user_display_name", 42), - CONVERSATION_ID_FIELD("conversation_id", 43), - PLACE_ID_FIELD("place_id", 44), - PLACE_FULL_NAME_FIELD("place_full_name", 45), - PLACE_COUNTRY_CODE_FIELD("place_country_code", 46), - PROFILE_GEO_COUNTRY_CODE_FIELD("profile_geo_country_code", 47), - PROFILE_GEO_REGION_FIELD("profile_geo_region", 48), - PROFILE_GEO_LOCALITY_FIELD("profile_geo_locality", 49), - LIKED_BY_USER_ID_FIELD("liked_by_user_id", 50, EarlybirdCluster.REALTIME), - NORMALIZED_REPLY_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD( - "normalized_reply_count_greater_than_or_equal_to", 51, EarlybirdCluster.FULL_ARCHIVE), - NORMALIZED_RETWEET_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD( - "normalized_retweet_count_greater_than_or_equal_to", 52, EarlybirdCluster.FULL_ARCHIVE), - NORMALIZED_FAVORITE_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD( - "normalized_favorite_count_greater_than_or_equal_to", 53, EarlybirdCluster.FULL_ARCHIVE), - COMPOSER_SOURCE("composer_source", 54), - QUOTED_TWEET_ID_FIELD("quoted_tweet_id", 55), - QUOTED_USER_ID_FIELD("quoted_user_id", 56), - RETWEETED_BY_USER_ID("retweeted_by_user_id", 57, EarlybirdCluster.REALTIME), - REPLIED_TO_BY_USER_ID("replied_to_by_user_id", 58, EarlybirdCluster.REALTIME), - CARD_LANG("card_lang", 61), - // SEARCH-27823: Field ID 62 used to be named_entity, which was the combination of all - // named_entity* fields below. We need to leave 62 unused for backwards compatibility. - NAMED_ENTITY_FROM_URL_FIELD("named_entity_from_url", 63), - NAMED_ENTITY_FROM_TEXT_FIELD("named_entity_from_text", 64), - NAMED_ENTITY_WITH_TYPE_FROM_URL_FIELD("named_entity_with_type_from_url", 65), - NAMED_ENTITY_WITH_TYPE_FROM_TEXT_FIELD("named_entity_with_type_from_text", 66), - DIRECTED_AT_USER_ID_FIELD("directed_at_user_id", 67), - SPACE_ID_FIELD("space_id", 68, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_GENERAL_PURPOSE_CLUSTERS), - SPACE_TITLE_FIELD("space_title", 69, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_GENERAL_PURPOSE_CLUSTERS), - - // Detailed description of the space admin fields can be found at go/earlybirdfields. - SPACE_ADMIN_FIELD("space_admin", 70, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_GENERAL_PURPOSE_CLUSTERS), - TOKENIZED_SPACE_ADMIN_FIELD("tokenized_space_admin", 71, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_GENERAL_PURPOSE_CLUSTERS), - CAMELCASE_TOKENIZED_SPACE_ADMIN_FIELD("camelcase_tokenized_space_admin", 72, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_GENERAL_PURPOSE_CLUSTERS), - TOKENIZED_SPACE_ADMIN_DISPLAY_NAME_FIELD("tokenized_space_admin_display_name", 73, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_GENERAL_PURPOSE_CLUSTERS), - URL_DESCRIPTION_FIELD("url_description", 74), - URL_TITLE_FIELD("url_title", 75), - - // CSF - CARD_TYPE_CSF_FIELD("card_type_csf", 100), - ENCODED_TWEET_FEATURES_FIELD(ENCODED_TWEET_FEATURES_FIELD_NAME, 102, - EarlybirdCluster.ALL_CLUSTERS), - // Provides the doc ID -> original tweet ID mapping for retweets. - SHARED_STATUS_ID_CSF("shared_status_id_csf", 106, EarlybirdCluster.ALL_CLUSTERS), - // Provides the doc ID -> tweet author's user ID mapping. - FROM_USER_ID_CSF("from_user_id_csf", 103, EarlybirdCluster.ALL_CLUSTERS), - CREATED_AT_CSF_FIELD("created_at_csf", 101, EarlybirdCluster.ARCHIVE_CLUSTERS), - // Provides the doc ID -> tweet ID mapping. - ID_CSF_FIELD("id_csf", 104, EarlybirdCluster.ARCHIVE_CLUSTERS), - LAT_LON_CSF_FIELD("latlon_csf", 105), - CONVERSATION_ID_CSF("conversation_id_csf", 107, EarlybirdCluster.ALL_CLUSTERS), - QUOTED_TWEET_ID_CSF("quoted_tweet_id_csf", 108), - QUOTED_USER_ID_CSF("quoted_user_id_csf", 109), - CARD_LANG_CSF("card_lang_csf", 110), - DIRECTED_AT_USER_ID_CSF("directed_at_user_id_csf", 111), - REFERENCE_AUTHOR_ID_CSF("reference_author_id_csf", 112), - EXCLUSIVE_CONVERSATION_AUTHOR_ID_CSF("exclusive_conversation_author_id_csf", 113), - CARD_URI_CSF("card_uri_csf", 114), - - // CSF Views on top of ENCODED_TWEET_FEATURES_FIELD - IS_RETWEET_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "IS_RETWEET_FLAG", 150, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - IS_OFFENSIVE_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "IS_OFFENSIVE_FLAG", 151, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - HAS_LINK_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "HAS_LINK_FLAG", 152, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - HAS_TREND_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "HAS_TREND_FLAG", 153, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - IS_REPLY_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "IS_REPLY_FLAG", 154, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - IS_SENSITIVE_CONTENT(ENCODED_TWEET_FEATURES_FIELD_NAME, "IS_SENSITIVE_CONTENT", 155, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - HAS_MULTIPLE_HASHTAGS_OR_TRENDS_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, - "HAS_MULTIPLE_HASHTAGS_OR_TRENDS_FLAG", 156, FlagFeatureFieldType.FLAG_FEATURE_FIELD, - EarlybirdCluster.ALL_CLUSTERS), - FROM_VERIFIED_ACCOUNT_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "FROM_VERIFIED_ACCOUNT_FLAG", - 157, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - TEXT_SCORE(ENCODED_TWEET_FEATURES_FIELD_NAME, "TEXT_SCORE", 158, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - LANGUAGE(ENCODED_TWEET_FEATURES_FIELD_NAME, "LANGUAGE", 159, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - LINK_LANGUAGE(ENCODED_TWEET_FEATURES_FIELD_NAME, "LINK_LANGUAGE", 160, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - HAS_IMAGE_URL_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "HAS_IMAGE_URL_FLAG", 161, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - HAS_VIDEO_URL_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "HAS_VIDEO_URL_FLAG", 162, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - HAS_NEWS_URL_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "HAS_NEWS_URL_FLAG", 163, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - HAS_EXPANDO_CARD_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "HAS_EXPANDO_CARD_FLAG", 164, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - HAS_MULTIPLE_MEDIA_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "HAS_MULTIPLE_MEDIA_FLAG", 165, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - PROFILE_IS_EGG_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "PROFILE_IS_EGG_FLAG", 166, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - NUM_MENTIONS(ENCODED_TWEET_FEATURES_FIELD_NAME, "NUM_MENTIONS", 167, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - NUM_HASHTAGS(ENCODED_TWEET_FEATURES_FIELD_NAME, "NUM_HASHTAGS", 168, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - HAS_CARD_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "HAS_CARD_FLAG", 169, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - HAS_VISIBLE_LINK_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "HAS_VISIBLE_LINK_FLAG", 170, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - USER_REPUTATION(ENCODED_TWEET_FEATURES_FIELD_NAME, "USER_REPUTATION", 171, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - IS_USER_SPAM_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "IS_USER_SPAM_FLAG", 172, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - IS_USER_NSFW_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "IS_USER_NSFW_FLAG", 173, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - IS_USER_BOT_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "IS_USER_BOT_FLAG", 174, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - IS_USER_NEW_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "IS_USER_NEW_FLAG", 175, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - PREV_USER_TWEET_ENGAGEMENT(ENCODED_TWEET_FEATURES_FIELD_NAME, "PREV_USER_TWEET_ENGAGEMENT", - 176, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - COMPOSER_SOURCE_IS_CAMERA_FLAG( - ENCODED_TWEET_FEATURES_FIELD_NAME, - "COMPOSER_SOURCE_IS_CAMERA_FLAG", - 177, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, - EarlybirdCluster.ALL_CLUSTERS), - RETWEET_COUNT( - ENCODED_TWEET_FEATURES_FIELD_NAME, - "RETWEET_COUNT", - 178, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.ALL_CLUSTERS, - ThriftFeatureNormalizationType.LEGACY_BYTE_NORMALIZER_WITH_LOG2), - FAVORITE_COUNT( - ENCODED_TWEET_FEATURES_FIELD_NAME, - "FAVORITE_COUNT", - 179, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.ALL_CLUSTERS, - ThriftFeatureNormalizationType.LEGACY_BYTE_NORMALIZER_WITH_LOG2), - REPLY_COUNT( - ENCODED_TWEET_FEATURES_FIELD_NAME, - "REPLY_COUNT", - 180, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.ALL_CLUSTERS, - ThriftFeatureNormalizationType.LEGACY_BYTE_NORMALIZER_WITH_LOG2), - PARUS_SCORE(ENCODED_TWEET_FEATURES_FIELD_NAME, "PARUS_SCORE", 181, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - - /** - * This is the rough percentage of the nth token at 140 divided by num tokens - * and is basically n / num tokens where n is the token starting before 140 characters - */ - VISIBLE_TOKEN_RATIO(ENCODED_TWEET_FEATURES_FIELD_NAME, "VISIBLE_TOKEN_RATIO", 182, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - HAS_QUOTE_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "HAS_QUOTE_FLAG", 183, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - - FROM_BLUE_VERIFIED_ACCOUNT_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, - "FROM_BLUE_VERIFIED_ACCOUNT_FLAG", - 184, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - - TWEET_SIGNATURE(ENCODED_TWEET_FEATURES_FIELD_NAME, "TWEET_SIGNATURE", 188, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - - // MEDIA TYPES - HAS_CONSUMER_VIDEO_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "HAS_CONSUMER_VIDEO_FLAG", 189, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - HAS_PRO_VIDEO_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "HAS_PRO_VIDEO_FLAG", 190, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - HAS_VINE_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "HAS_VINE_FLAG", 191, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - HAS_PERISCOPE_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "HAS_PERISCOPE_FLAG", 192, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - HAS_NATIVE_IMAGE_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "HAS_NATIVE_IMAGE_FLAG", 193, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - - // NOTE: if possible, please reserve field ID 194 to 196 for future media types (SEARCH-9131) - - IS_NULLCAST_FLAG(ENCODED_TWEET_FEATURES_FIELD_NAME, "IS_NULLCAST_FLAG", 197, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, EarlybirdCluster.ALL_CLUSTERS), - - // EXTENDED ENCODED TWEET FEATURES that's not available on archive clusters - EXTENDED_ENCODED_TWEET_FEATURES_FIELD(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, 200, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - - EMBEDS_IMPRESSION_COUNT( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "EMBEDS_IMPRESSION_COUNT", - 221, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.LEGACY_BYTE_NORMALIZER), - EMBEDS_URL_COUNT( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "EMBEDS_URL_COUNT", - 222, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.LEGACY_BYTE_NORMALIZER), - VIDEO_VIEW_COUNT( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "VIDEO_VIEW_COUNT", - 223, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.LEGACY_BYTE_NORMALIZER), - - // empty bits in integer 0 (starting bit 24, 8 bits) - EXTENDED_FEATURE_UNUSED_BITS_0_24_8(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "UNUSED_BITS_0_24_8", 244, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - UnusedFeatureFieldType.UNUSED_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - - // SEARCH-8564 - Reference Tweet Author ID - REFERENCE_AUTHOR_ID_LEAST_SIGNIFICANT_INT(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "REFERENCE_AUTHOR_ID_LEAST_SIGNIFICANT_INT", 202, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - REFERENCE_AUTHOR_ID_MOST_SIGNIFICANT_INT(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "REFERENCE_AUTHOR_ID_MOST_SIGNIFICANT_INT", 203, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - - // SEARCHQUAL-8130: engagement counters v2 - // Integer 3 - RETWEET_COUNT_V2(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "RETWEET_COUNT_V2", 225, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - FAVORITE_COUNT_V2(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "FAVORITE_COUNT_V2", 226, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - REPLY_COUNT_V2(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "REPLY_COUNT_V2", 227, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - EMBEDS_IMPRESSION_COUNT_V2( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "EMBEDS_IMPRESSION_COUNT_V2", - 228, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - - // Integer 4 - EMBEDS_URL_COUNT_V2( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "EMBEDS_URL_COUNT_V2", - 229, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - VIDEO_VIEW_COUNT_V2( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "VIDEO_VIEW_COUNT_V2", - 230, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - QUOTE_COUNT( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "QUOTE_COUNT", - 231, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - - // Tweet Safety Labels - LABEL_ABUSIVE_FLAG(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "LABEL_ABUSIVE_FLAG", 232, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - - LABEL_ABUSIVE_HI_RCL_FLAG(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "LABEL_ABUSIVE_HI_RCL_FLAG", 233, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - - LABEL_DUP_CONTENT_FLAG(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "LABEL_DUP_CONTENT_FLAG", 234, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - - LABEL_NSFW_HI_PRC_FLAG(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "LABEL_NSFW_HI_PRC_FLAG", 235, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - - LABEL_NSFW_HI_RCL_FLAG(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "LABEL_NSFW_HI_RCL_FLAG", 236, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - - LABEL_SPAM_FLAG(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "LABEL_SPAM_FLAG", 237, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - - LABEL_SPAM_HI_RCL_FLAG(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "LABEL_SPAM_HI_RCL_FLAG", 238, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - - // please save this bit for other safety labels - EXTENDED_TEST_FEATURE_UNUSED_BITS_4_31_1(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "UNUSED_BITS_4_31_1", 239, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - UnusedFeatureFieldType.UNUSED_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - - // Integer 5 - WEIGHTED_RETWEET_COUNT( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "WEIGHTED_RETWEET_COUNT", - 240, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - WEIGHTED_REPLY_COUNT( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "WEIGHTED_REPLY_COUNT", - 241, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - WEIGHTED_FAVORITE_COUNT( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "WEIGHTED_FAVORITE_COUNT", - 242, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - WEIGHTED_QUOTE_COUNT( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "WEIGHTED_QUOTE_COUNT", - 243, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - - // Integer 6 - // Periscope features - PERISCOPE_EXISTS(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "PERISCOPE_EXISTS", 245, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - PERISCOPE_HAS_BEEN_FEATURED(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "PERISCOPE_HAS_BEEN_FEATURED", 246, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - PERISCOPE_IS_CURRENTLY_FEATURED(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "PERISCOPE_IS_CURRENTLY_FEATURED", 247, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - PERISCOPE_IS_FROM_QUALITY_SOURCE(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "PERISCOPE_IS_FROM_QUALITY_SOURCE", 248, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - PERISCOPE_IS_LIVE(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "PERISCOPE_IS_LIVE", 249, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - IS_TRENDING_NOW_FLAG(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "IS_TRENDING_NOW_FLAG", 292, - FlagFeatureFieldType.FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - - // remaining bits for integer 6 (starting bit 6, 26 remaining bits) - EXTENDED_TEST_FEATURE_UNUSED_BITS_7_6_26(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "UNUSED_BITS_7_6_26", 250, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - UnusedFeatureFieldType.UNUSED_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - - // Decaying engagement counters - // Integer 7 - DECAYED_RETWEET_COUNT( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "DECAYED_RETWEET_COUNT", - 251, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - DECAYED_REPLY_COUNT( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "DECAYED_REPLY_COUNT", - 252, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - DECAYED_FAVORITE_COUNT( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "DECAYED_FAVORITE_COUNT", - 253, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - DECAYED_QUOTE_COUNT( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "DECAYED_QUOTE_COUNT", - 254, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - - // Fake engagement counters. The fake here is in the sense of spam, not in the sense of testing. - // Refer to [JIRA SEARCHQUAL-10736 Remove Fake Engagements in Search] for more details. - // Integer 8 - FAKE_RETWEET_COUNT( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "FAKE_RETWEET_COUNT", 269, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - FAKE_REPLY_COUNT( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "FAKE_REPLY_COUNT", 270, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - FAKE_FAVORITE_COUNT( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "FAKE_FAVORITE_COUNT", 271, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - FAKE_QUOTE_COUNT( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "FAKE_QUOTE_COUNT", 272, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - - // Last engagement timestamps. These features use the Tweet's creation time as base and - // are incremented every 1 hour - // Integer 9 - LAST_RETWEET_SINCE_CREATION_HRS( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "LAST_RETWEET_SINCE_CREATION_HRS", - 273, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.NONE), - LAST_REPLY_SINCE_CREATION_HRS( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "LAST_REPLY_SINCE_CREATION_HRS", - 274, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.NONE), - LAST_FAVORITE_SINCE_CREATION_HRS( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "LAST_FAVORITE_SINCE_CREATION_HRS", - 275, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.NONE), - LAST_QUOTE_SINCE_CREATION_HRS( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "LAST_QUOTE_SINCE_CREATION_HRS", - 276, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.NONE), - - // 4 bits hashtag count, mention count and stock count (SEARCH-24336) - // Integer 10 - NUM_HASHTAGS_V2( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "NUM_HASHTAGS_V2", - 277, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.NONE - ), - NUM_MENTIONS_V2( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "NUM_MENTIONS_V2", - 278, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.NONE - ), - NUM_STOCKS( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "NUM_STOCKS", - 279, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.NONE - ), - - // Integer 11 - // Blink engagement counters - BLINK_RETWEET_COUNT( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "BLINK_RETWEET_COUNT", - 280, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - BLINK_REPLY_COUNT( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "BLINK_REPLY_COUNT", - 281, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - BLINK_FAVORITE_COUNT( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "BLINK_FAVORITE_COUNT", - 282, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - BLINK_QUOTE_COUNT( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "BLINK_QUOTE_COUNT", - 283, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.SMART_INTEGER_NORMALIZER), - - // Integer 10 (remaining) - // Production Toxicity and PBlock score from HML (go/toxicity, go/pblock) - TOXICITY_SCORE( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "TOXICITY_SCORE", 284, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.PREDICTION_SCORE_NORMALIZER - ), - PBLOCK_SCORE( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "PBLOCK_SCORE", 285, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.PREDICTION_SCORE_NORMALIZER - ), - - // Integer 12 - // Experimental health model scores from HML - EXPERIMENTAL_HEALTH_MODEL_SCORE_1( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "EXPERIMENTAL_HEALTH_MODEL_SCORE_1", 286, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.PREDICTION_SCORE_NORMALIZER - ), - EXPERIMENTAL_HEALTH_MODEL_SCORE_2( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "EXPERIMENTAL_HEALTH_MODEL_SCORE_2", 287, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.PREDICTION_SCORE_NORMALIZER - ), - EXPERIMENTAL_HEALTH_MODEL_SCORE_3( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "EXPERIMENTAL_HEALTH_MODEL_SCORE_3", 288, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.PREDICTION_SCORE_NORMALIZER - ), - // remaining bits for index 12 (unused_bits_12) - EXTENDED_TEST_FEATURE_UNUSED_BITS_12_30_2(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "UNUSED_BITS_12_30_2", 289, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - UnusedFeatureFieldType.UNUSED_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - - // Integer 13 - // Experimental health model scores from HML (cont.) - EXPERIMENTAL_HEALTH_MODEL_SCORE_4( - EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "EXPERIMENTAL_HEALTH_MODEL_SCORE_4", 290, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.PREDICTION_SCORE_NORMALIZER - ), - // Production pSpammyTweet score from HML (go/pspammytweet) - P_SPAMMY_TWEET_SCORE(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "P_SPAMMY_TWEET_SCORE", 291, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.PREDICTION_SCORE_NORMALIZER - ), - // Production pReportedTweet score from HML (go/preportedtweet) - P_REPORTED_TWEET_SCORE(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "P_REPORTED_TWEET_SCORE", 293, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.PREDICTION_SCORE_NORMALIZER - ), - // remaining bits for index 13 (unused_bits_13) - EXTENDED_TEST_FEATURE_UNUSED_BITS_13_30_2(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "UNUSED_BITS_13_30_2", 294, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - UnusedFeatureFieldType.UNUSED_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS - ), - - // Integer 14 - // Health model scores from HML (cont.) - // Prod Spammy Tweet Content model score from Platform Manipulation (go/spammy-tweet-content) - SPAMMY_TWEET_CONTENT_SCORE(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "SPAMMY_TWEET_CONTENT_SCORE", 295, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS, - ThriftFeatureNormalizationType.PREDICTION_SCORE_NORMALIZER - ), - // remaining bits for index 14 (unused_bits_14) - EXTENDED_TEST_FEATURE_UNUSED_BITS_14_10_22(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "UNUSED_BITS_14_10_22", 296, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - UnusedFeatureFieldType.UNUSED_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS - ), - - // Note that the integer block index i in the names UNUSED_BITS{i}" below is 1-based, but the - // index j in UNUSED_BITS_{j}_x_y above is 0-based. - EXTENDED_TEST_FEATURE_UNUSED_BITS_16(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "UNUSED_BITS16", 216, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - UnusedFeatureFieldType.UNUSED_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - - EXTENDED_TEST_FEATURE_UNUSED_BITS_17(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "UNUSED_BITS17", 217, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - UnusedFeatureFieldType.UNUSED_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - - EXTENDED_TEST_FEATURE_UNUSED_BITS_18(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "UNUSED_BITS18", 218, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - UnusedFeatureFieldType.UNUSED_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - - EXTENDED_TEST_FEATURE_UNUSED_BITS_19(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "UNUSED_BITS19", 219, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - UnusedFeatureFieldType.UNUSED_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS), - - EXTENDED_TEST_FEATURE_UNUSED_BITS_20(EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - "UNUSED_BITS20", 220, - FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - UnusedFeatureFieldType.UNUSED_FEATURE_FIELD, - EarlybirdCluster.TWITTER_IN_MEMORY_INDEX_FORMAT_ALL_CLUSTERS); - - // Filter field terms. These end up as terms in the "internal" field (id=18). So for example - // you can have a doc with field(internal) = "__filter_nullcast", "__filter_vine" and that will - // be a nullcast tweet with a vine link in it. - public static final String NULLCAST_FILTER_TERM = "nullcast"; - public static final String VERIFIED_FILTER_TERM = "verified"; - public static final String BLUE_VERIFIED_FILTER_TERM = "blue_verified"; - public static final String NATIVE_RETWEETS_FILTER_TERM = "nativeretweets"; - public static final String QUOTE_FILTER_TERM = "quote"; - public static final String REPLIES_FILTER_TERM = "replies"; - public static final String CONSUMER_VIDEO_FILTER_TERM = "consumer_video"; - public static final String PRO_VIDEO_FILTER_TERM = "pro_video"; - public static final String VINE_FILTER_TERM = "vine"; - public static final String PERISCOPE_FILTER_TERM = "periscope"; - public static final String PROFILE_GEO_FILTER_TERM = "profile_geo"; - public static final String SELF_THREAD_FILTER_TERM = "self_threads"; - public static final String DIRECTED_AT_FILTER_TERM = "directed_at"; - public static final String EXCLUSIVE_FILTER_TERM = "exclusive"; - - // Reserved terms for the internal field. - public static final String HAS_POSITIVE_SMILEY = "__has_positive_smiley"; - public static final String HAS_NEGATIVE_SMILEY = "__has_negative_smiley"; - public static final String IS_OFFENSIVE = "__is_offensive"; - - // Facet fields - public static final String MENTIONS_FACET = "mentions"; - public static final String HASHTAGS_FACET = "hashtags"; - public static final String STOCKS_FACET = "stocks"; - public static final String VIDEOS_FACET = "videos"; - public static final String IMAGES_FACET = "images"; - public static final String NEWS_FACET = "news"; - public static final String LANGUAGES_FACET = "languages"; - public static final String SOURCES_FACET = "sources"; - public static final String TWIMG_FACET = "twimg"; - public static final String FROM_USER_ID_FACET = "user_id"; - public static final String RETWEETS_FACET = "retweets"; - public static final String LINKS_FACET = "links"; - public static final String SPACES_FACET = "spaces"; - - /** - * Used by the query parser to check that the operator of a [filter X] query is valid. - * Also used by blender, though it probably shouldn't be. - */ - public static final ImmutableSet FACETS = ImmutableSet.builder() - .add(MENTIONS_FACET) - .add(HASHTAGS_FACET) - .add(STOCKS_FACET) - .add(VIDEOS_FACET) - .add(IMAGES_FACET) - .add(NEWS_FACET) - .add(LINKS_FACET) - .add(LANGUAGES_FACET) - .add(SOURCES_FACET) - .add(TWIMG_FACET) - .add(SPACES_FACET) - .build(); - - /** - * Used by blender to convert facet names to field names. We should find a way to get the - * information we need in blender without needing this map. - */ - public static final ImmutableMap FACET_TO_FIELD_MAP = - ImmutableMap.builder() - .put(MENTIONS_FACET, MENTIONS_FIELD.getFieldName()) - .put(HASHTAGS_FACET, HASHTAGS_FIELD.getFieldName()) - .put(STOCKS_FACET, STOCKS_FIELD.getFieldName()) - .put(VIDEOS_FACET, VIDEO_LINKS_FIELD.getFieldName()) - .put(IMAGES_FACET, IMAGE_LINKS_FIELD.getFieldName()) - .put(NEWS_FACET, NEWS_LINKS_FIELD.getFieldName()) - .put(LANGUAGES_FACET, ISO_LANGUAGE_FIELD.getFieldName()) - .put(SOURCES_FACET, SOURCE_FIELD.getFieldName()) - .put(TWIMG_FACET, TWIMG_LINKS_FIELD.getFieldName()) - .put(LINKS_FACET, LINKS_FIELD.getFieldName()) - .put(SPACES_FACET, SPACE_ID_FIELD.getFieldName()) - .build(); - - public static String getFacetSkipFieldName(String fieldName) { - return "__has_" + fieldName; - } - - private final String fieldName; - private final int fieldId; - private final EnumSet clusters; - private final FlagFeatureFieldType flagFeatureField; - - private final UnusedFeatureFieldType unusedField; - - // Only set for feature fields. - @Nullable - private final FeatureConfiguration featureConfiguration; - - // Only set for feature fields. - private final ThriftFeatureNormalizationType featureNormalizationType; - - // To simplify field configurations and reduce duplicate code, we give clusters a default value - EarlybirdFieldConstant(String fieldName, int fieldId) { - this(fieldName, fieldId, EarlybirdCluster.GENERAL_PURPOSE_CLUSTERS, null); - } - - EarlybirdFieldConstant(String fieldName, int fieldId, Set clusters) { - this(fieldName, fieldId, clusters, null); - } - - EarlybirdFieldConstant(String fieldName, int fieldId, EarlybirdCluster cluster) { - this(fieldName, fieldId, ImmutableSet.of(cluster), null); - } - - /** - * Base field name is needed here in order to construct the full - * name of the feature. Our convention is that a feature should be named - * as: baseFieldName.featureName. For example: encoded_tweet_features.retweet_count. - */ - EarlybirdFieldConstant( - String baseName, - String fieldName, - int fieldId, - FlagFeatureFieldType flagFeatureField, - Set clusters) { - this((baseName + SchemaBuilder.CSF_VIEW_NAME_SEPARATOR + fieldName).toLowerCase(), - fieldId, clusters, flagFeatureField, null); - } - - EarlybirdFieldConstant( - String baseName, - String fieldName, - int fieldId, - FlagFeatureFieldType flagFeatureField, - UnusedFeatureFieldType unusedField, - Set clusters) { - this((baseName + SchemaBuilder.CSF_VIEW_NAME_SEPARATOR + fieldName).toLowerCase(), - fieldId, clusters, flagFeatureField, unusedField, null); - } - - EarlybirdFieldConstant( - String baseName, - String fieldName, - int fieldId, - FlagFeatureFieldType flagFeatureField, - Set clusters, - ThriftFeatureNormalizationType featureNormalizationType) { - this((baseName + SchemaBuilder.CSF_VIEW_NAME_SEPARATOR + fieldName).toLowerCase(), - fieldId, clusters, flagFeatureField, UnusedFeatureFieldType.USED_FEATURE_FIELD, - featureNormalizationType, null); - } - - /** - * Constructor. - */ - EarlybirdFieldConstant(String fieldName, int fieldId, Set clusters, - @Nullable FeatureConfiguration featureConfiguration) { - this(fieldName, fieldId, clusters, FlagFeatureFieldType.NON_FLAG_FEATURE_FIELD, - featureConfiguration); - } - - /** - * Constructor. - */ - EarlybirdFieldConstant(String fieldName, - int fieldId, - Set clusters, - FlagFeatureFieldType flagFeatureField, - @Nullable FeatureConfiguration featureConfiguration) { - this(fieldName, fieldId, clusters, flagFeatureField, - UnusedFeatureFieldType.USED_FEATURE_FIELD, featureConfiguration); - } - - /** - * Constructor. - */ - EarlybirdFieldConstant(String fieldName, - int fieldId, - Set clusters, - FlagFeatureFieldType flagFeatureField, - UnusedFeatureFieldType unusedField, - @Nullable FeatureConfiguration featureConfiguration) { - this(fieldName, fieldId, clusters, flagFeatureField, unusedField, null, featureConfiguration); - } - - /** - * Constructor. - */ - EarlybirdFieldConstant(String fieldName, - int fieldId, - Set clusters, - FlagFeatureFieldType flagFeatureField, - UnusedFeatureFieldType unusedField, - @Nullable ThriftFeatureNormalizationType featureNormalizationType, - @Nullable FeatureConfiguration featureConfiguration) { - this.fieldId = fieldId; - this.fieldName = fieldName; - this.clusters = EnumSet.copyOf(clusters); - this.flagFeatureField = flagFeatureField; - this.unusedField = unusedField; - this.featureNormalizationType = featureNormalizationType; - this.featureConfiguration = featureConfiguration; - } - - // Override toString to make replacing StatusConstant Easier. - @Override - public String toString() { - return fieldName; - } - - public boolean isValidFieldInCluster(EarlybirdCluster cluster) { - return clusters.contains(cluster); - } - - public String getFieldName() { - return fieldName; - } - - public int getFieldId() { - return fieldId; - } - - public FlagFeatureFieldType getFlagFeatureField() { - return flagFeatureField; - } - - public boolean isFlagFeatureField() { - return flagFeatureField == FlagFeatureFieldType.FLAG_FEATURE_FIELD; - } - - public boolean isUnusedField() { - return unusedField == UnusedFeatureFieldType.UNUSED_FEATURE_FIELD; - } - - @Nullable - public FeatureConfiguration getFeatureConfiguration() { - return featureConfiguration; - } - - @Nullable - public ThriftFeatureNormalizationType getFeatureNormalizationType() { - return featureNormalizationType; - } - } - - private static final Map NAME_TO_ID_MAP; - private static final Map ID_TO_FIELD_MAP; - static { - ImmutableMap.Builder nameToIdMapBuilder = - ImmutableMap.builder(); - ImmutableMap.Builder idToFieldMapBuilder = - ImmutableMap.builder(); - Set fieldNameDupDetector = Sets.newHashSet(); - Set fieldIdDupDetector = Sets.newHashSet(); - for (EarlybirdFieldConstant fc : EarlybirdFieldConstant.values()) { - if (fieldNameDupDetector.contains(fc.getFieldName())) { - throw new IllegalStateException("detected fields sharing field name: " + fc.getFieldName()); - } - if (fieldIdDupDetector.contains(fc.getFieldId())) { - throw new IllegalStateException("detected fields sharing field id: " + fc.getFieldId()); - } - - fieldNameDupDetector.add(fc.getFieldName()); - fieldIdDupDetector.add(fc.getFieldId()); - nameToIdMapBuilder.put(fc.getFieldName(), fc); - idToFieldMapBuilder.put(fc.getFieldId(), fc); - } - NAME_TO_ID_MAP = nameToIdMapBuilder.build(); - ID_TO_FIELD_MAP = idToFieldMapBuilder.build(); - } - - // This define the list of boolean features, but the name does not have "flag" inside. This - // definition is only for double checking purpose to prevent code change mistakes. The setting - // of the flag feature is based on FlagFeatureFieldType.FLAG_FEATURE_FIELD. - public static final Set EXTRA_FLAG_FIELDS = - Sets.newHashSet(EarlybirdFieldConstants.EarlybirdFieldConstant.IS_SENSITIVE_CONTENT); - public static final String FLAG_STRING = "flag"; - - private static final List FLAG_FEATURE_FIELDS; - static { - ImmutableList.Builder flagFieldBuilder = ImmutableList.builder(); - for (EarlybirdFieldConstant fc : EarlybirdFieldConstant.values()) { - if (fc.getFlagFeatureField() == FlagFeatureFieldType.FLAG_FEATURE_FIELD - && !fc.isUnusedField()) { - flagFieldBuilder.add(fc); - } - } - FLAG_FEATURE_FIELDS = flagFieldBuilder.build(); - } - - /** - * Get all the flag features meaning that they are boolean features with only 1 bit in the packed - * feature encoding. - */ - public static Collection getFlagFeatureFields() { - return FLAG_FEATURE_FIELDS; - } - - /** - * Get the EarlybirdFieldConstant for the specified field. - */ - public static EarlybirdFieldConstant getFieldConstant(String fieldName) { - EarlybirdFieldConstant field = NAME_TO_ID_MAP.get(fieldName); - if (field == null) { - throw new IllegalArgumentException("Unknown field: " + fieldName); - } - return field; - } - - /** - * Get the EarlybirdFieldConstant for the specified field. - */ - public static EarlybirdFieldConstant getFieldConstant(int fieldId) { - EarlybirdFieldConstant field = ID_TO_FIELD_MAP.get(fieldId); - if (field == null) { - throw new IllegalArgumentException("Unknown field: " + fieldId); - } - return field; - } - - /** - * Determines if there's a field with the given ID. - */ - public static boolean hasFieldConstant(int fieldId) { - return ID_TO_FIELD_MAP.keySet().contains(fieldId); - } - - @Override - public final int getFieldID(String fieldName) { - return getFieldConstant(fieldName).getFieldId(); - } - - public static final String formatGeoType(ThriftGeoLocationSource source) { - return "__geo_location_type_" + source.name().toLowerCase(); - } -} diff --git a/src/java/com/twitter/search/common/schema/earlybird/EarlybirdSchemaBuilder.java b/src/java/com/twitter/search/common/schema/earlybird/EarlybirdSchemaBuilder.java deleted file mode 100644 index 095e00fe5..000000000 --- a/src/java/com/twitter/search/common/schema/earlybird/EarlybirdSchemaBuilder.java +++ /dev/null @@ -1,96 +0,0 @@ -package com.twitter.search.common.schema.earlybird; - -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableList; - -import com.twitter.common.text.util.TokenStreamSerializer; -import com.twitter.search.common.schema.SchemaBuilder; -import com.twitter.search.common.schema.base.FieldNameToIdMapping; -import com.twitter.search.common.schema.thriftjava.ThriftFieldConfiguration; -import com.twitter.search.common.schema.thriftjava.ThriftFieldSettings; -import com.twitter.search.common.schema.thriftjava.ThriftTokenStreamSerializer; -import com.twitter.search.common.util.analysis.CharTermAttributeSerializer; -import com.twitter.search.common.util.analysis.TermPayloadAttributeSerializer; - -/** - * Build class used to build a ThriftSchema - */ -public class EarlybirdSchemaBuilder extends SchemaBuilder { - private final EarlybirdCluster cluster; - - public EarlybirdSchemaBuilder(FieldNameToIdMapping idMapping, - EarlybirdCluster cluster, - TokenStreamSerializer.Version tokenStreamSerializerVersion) { - super(idMapping, tokenStreamSerializerVersion); - this.cluster = cluster; - } - - /** - * Configure the specified field to be Out-of-order. - * In the realtime cluster, this causes Earlybird to used the skip list posting format. - */ - public final EarlybirdSchemaBuilder withOutOfOrderEnabledForField(String fieldName) { - if (!shouldIncludeField(fieldName)) { - return this; - } - ThriftFieldSettings settings = - schema.getFieldConfigs().get(idMapping.getFieldID(fieldName)).getSettings(); - Preconditions.checkState(settings.isSetIndexedFieldSettings(), - "Out of order field must be indexed"); - settings.getIndexedFieldSettings().setSupportOutOfOrderAppends(true); - return this; - } - - /** - * This turns on tweet specific normalizations. This turns on the following two token processors: - * {@link com.twitter.search.common.util.text.splitter.HashtagMentionPunctuationSplitter} - * {@link com.twitter.search.common.util.text.filter.NormalizedTokenFilter} - *

- * HashtagMentionPunctuationSplitter would break a mention or hashtag like @ab_cd or #ab_cd into - * tokens {ab, cd}. - * NormalizedTokenFilter strips out the # @ $ from the tokens. - */ - public final EarlybirdSchemaBuilder withTweetSpecificNormalization(String fieldName) { - if (!shouldIncludeField(fieldName)) { - return this; - } - ThriftFieldSettings settings = - schema.getFieldConfigs().get(idMapping.getFieldID(fieldName)).getSettings(); - Preconditions.checkState(settings.isSetIndexedFieldSettings(), - "Tweet text field must be indexed."); - settings.getIndexedFieldSettings().setDeprecated_performTweetSpecificNormalizations(true); - return this; - } - - /** - * Add a twitter photo facet field. - */ - public final EarlybirdSchemaBuilder withPhotoUrlFacetField(String fieldName) { - if (!shouldIncludeField(fieldName)) { - return this; - } - ThriftFieldSettings photoFieldSettings = getNoPositionNoFreqSettings(); - ThriftTokenStreamSerializer tokenStreamSerializer = - new ThriftTokenStreamSerializer(tokenStreamSerializerVersion); - tokenStreamSerializer.setAttributeSerializerClassNames( - ImmutableList.of( - CharTermAttributeSerializer.class.getName(), - TermPayloadAttributeSerializer.class.getName())); - photoFieldSettings - .getIndexedFieldSettings() - .setTokenStreamSerializer(tokenStreamSerializer) - .setTokenized(true); - putIntoFieldConfigs(idMapping.getFieldID(fieldName), - new ThriftFieldConfiguration(fieldName).setSettings(photoFieldSettings)); - return this; - } - - /** - * Returns whether the given field should be included or dropped. - */ - @Override - protected boolean shouldIncludeField(String fieldName) { - return EarlybirdFieldConstants.getFieldConstant(fieldName).isValidFieldInCluster(cluster); - } -} - diff --git a/src/java/com/twitter/search/common/schema/earlybird/EarlybirdSchemaCreateTool.java b/src/java/com/twitter/search/common/schema/earlybird/EarlybirdSchemaCreateTool.java deleted file mode 100644 index f2376cf6b..000000000 --- a/src/java/com/twitter/search/common/schema/earlybird/EarlybirdSchemaCreateTool.java +++ /dev/null @@ -1,702 +0,0 @@ -package com.twitter.search.common.schema.earlybird; - -import java.util.Map; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.collect.Maps; - -import com.twitter.common.text.util.TokenStreamSerializer; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.schema.AnalyzerFactory; -import com.twitter.search.common.schema.DynamicSchema; -import com.twitter.search.common.schema.ImmutableSchema; -import com.twitter.search.common.schema.SchemaBuilder; -import com.twitter.search.common.schema.base.FeatureConfiguration; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.schema.thriftjava.ThriftCSFType; -import com.twitter.search.common.schema.thriftjava.ThriftFeatureUpdateConstraint; -import com.twitter.search.common.schema.thriftjava.ThriftSchema; - -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.BLINK_FAVORITE_COUNT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.BLINK_QUOTE_COUNT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.BLINK_REPLY_COUNT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.BLINK_RETWEET_COUNT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.COMPOSER_SOURCE_IS_CAMERA_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.DECAYED_FAVORITE_COUNT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.DECAYED_QUOTE_COUNT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.DECAYED_REPLY_COUNT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.DECAYED_RETWEET_COUNT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.EMBEDS_IMPRESSION_COUNT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.EMBEDS_IMPRESSION_COUNT_V2; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.EMBEDS_URL_COUNT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.EMBEDS_URL_COUNT_V2; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.EXPERIMENTAL_HEALTH_MODEL_SCORE_1; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.EXPERIMENTAL_HEALTH_MODEL_SCORE_2; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.EXPERIMENTAL_HEALTH_MODEL_SCORE_3; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.EXPERIMENTAL_HEALTH_MODEL_SCORE_4; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.EXTENDED_FEATURE_UNUSED_BITS_0_24_8; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.EXTENDED_TEST_FEATURE_UNUSED_BITS_12_30_2; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.EXTENDED_TEST_FEATURE_UNUSED_BITS_13_30_2; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.EXTENDED_TEST_FEATURE_UNUSED_BITS_14_10_22; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.EXTENDED_TEST_FEATURE_UNUSED_BITS_16; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.EXTENDED_TEST_FEATURE_UNUSED_BITS_17; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.EXTENDED_TEST_FEATURE_UNUSED_BITS_18; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.EXTENDED_TEST_FEATURE_UNUSED_BITS_19; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.EXTENDED_TEST_FEATURE_UNUSED_BITS_20; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.EXTENDED_TEST_FEATURE_UNUSED_BITS_4_31_1; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.EXTENDED_TEST_FEATURE_UNUSED_BITS_7_6_26; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.FAKE_FAVORITE_COUNT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.FAKE_QUOTE_COUNT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.FAKE_REPLY_COUNT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.FAKE_RETWEET_COUNT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.FAVORITE_COUNT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.FAVORITE_COUNT_V2; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.FROM_BLUE_VERIFIED_ACCOUNT_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.FROM_VERIFIED_ACCOUNT_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_CARD_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_CONSUMER_VIDEO_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_EXPANDO_CARD_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_IMAGE_URL_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_LINK_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_MULTIPLE_HASHTAGS_OR_TRENDS_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_MULTIPLE_MEDIA_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_NATIVE_IMAGE_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_NEWS_URL_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_PERISCOPE_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_PRO_VIDEO_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_QUOTE_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_TREND_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_VIDEO_URL_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_VINE_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.HAS_VISIBLE_LINK_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.IS_NULLCAST_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.IS_OFFENSIVE_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.IS_REPLY_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.IS_RETWEET_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.IS_SENSITIVE_CONTENT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.IS_TRENDING_NOW_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.IS_USER_BOT_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.IS_USER_NEW_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.IS_USER_NSFW_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.IS_USER_SPAM_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.LABEL_ABUSIVE_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.LABEL_ABUSIVE_HI_RCL_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.LABEL_DUP_CONTENT_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.LABEL_NSFW_HI_PRC_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.LABEL_NSFW_HI_RCL_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.LABEL_SPAM_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.LABEL_SPAM_HI_RCL_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.LANGUAGE; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.LAST_FAVORITE_SINCE_CREATION_HRS; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.LAST_QUOTE_SINCE_CREATION_HRS; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.LAST_REPLY_SINCE_CREATION_HRS; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.LAST_RETWEET_SINCE_CREATION_HRS; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.LINK_LANGUAGE; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.NORMALIZED_FAVORITE_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.NORMALIZED_REPLY_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.NORMALIZED_RETWEET_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.NUM_HASHTAGS; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.NUM_HASHTAGS_V2; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.NUM_MENTIONS; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.NUM_MENTIONS_V2; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.NUM_STOCKS; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.PARUS_SCORE; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.PBLOCK_SCORE; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.PERISCOPE_EXISTS; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.PERISCOPE_HAS_BEEN_FEATURED; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.PERISCOPE_IS_CURRENTLY_FEATURED; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.PERISCOPE_IS_FROM_QUALITY_SOURCE; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.PERISCOPE_IS_LIVE; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.PREV_USER_TWEET_ENGAGEMENT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.PROFILE_IS_EGG_FLAG; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.P_REPORTED_TWEET_SCORE; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.P_SPAMMY_TWEET_SCORE; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.QUOTE_COUNT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.REFERENCE_AUTHOR_ID_LEAST_SIGNIFICANT_INT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.REFERENCE_AUTHOR_ID_MOST_SIGNIFICANT_INT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.REPLY_COUNT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.REPLY_COUNT_V2; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.RETWEET_COUNT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.RETWEET_COUNT_V2; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.SPAMMY_TWEET_CONTENT_SCORE; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.TEXT_SCORE; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.TOXICITY_SCORE; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.TWEET_SIGNATURE; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.USER_REPUTATION; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.VIDEO_VIEW_COUNT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.VIDEO_VIEW_COUNT_V2; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.VISIBLE_TOKEN_RATIO; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.WEIGHTED_FAVORITE_COUNT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.WEIGHTED_QUOTE_COUNT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.WEIGHTED_REPLY_COUNT; -import static com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant.WEIGHTED_RETWEET_COUNT; - -/** - * Field configurations for Earlybird. - */ -public final class EarlybirdSchemaCreateTool { - // How many times a schema is built - private static final SearchCounter SCHEMA_BUILD_COUNT = - SearchCounter.export("schema_build_count"); - - // Number of integers for the column of ENCODED_TWEET_FEATURES_FIELD. - @VisibleForTesting - public static final int NUMBER_OF_INTEGERS_FOR_FEATURES = 5; - - // Number of integers for the column of EXTENDED_ENCODED_TWEET_FEATURES_FIELD. - // extra 80 bytes - // In realtime cluster, assuming 19 segments total, and 8388608 docs per segment - // this would amount to about 12.75GB of memory needed - // - @VisibleForTesting - public static final int NUMBER_OF_INTEGERS_FOR_EXTENDED_FEATURES = 20; - - @VisibleForTesting - public static final Map FEATURE_CONFIGURATION_MAP - = Maps.newLinkedHashMap(); - - public static final String BASE_FIELD_NAME = - EarlybirdFieldConstant.ENCODED_TWEET_FEATURES_FIELD.getFieldName(); - - private static String getBaseFieldName(String fullName) { - int index = fullName.indexOf(SchemaBuilder.CSF_VIEW_NAME_SEPARATOR); - Preconditions.checkArgument(index > 0); - return fullName.substring(0, index); - } - - private static String getBaseFieldName(EarlybirdFieldConstant fieldConstant) { - return getBaseFieldName(fieldConstant.getFieldName()); - } - - private static String getFeatureNameInField(EarlybirdFieldConstant fieldConstant) { - int index = fieldConstant.getFieldName().indexOf(SchemaBuilder.CSF_VIEW_NAME_SEPARATOR); - Preconditions.checkArgument(index > 0); - return fieldConstant.getFieldName().substring(index + 1); - } - - // defining all features - static { - // Add individual tweet encoded features as views on top of - // EarlybirdFieldConstant.ENCODED_TWEET_FEATURES_FIELD - - // int intIndex, int bitStartPos, int bitLength - newEarlybirdFeatureConfiguration(IS_RETWEET_FLAG, ThriftCSFType.BOOLEAN, 0, 0, 1); - newEarlybirdFeatureConfiguration(IS_OFFENSIVE_FLAG, ThriftCSFType.BOOLEAN, 0, 1, 1); - newEarlybirdFeatureConfiguration(HAS_LINK_FLAG, ThriftCSFType.BOOLEAN, 0, 2, 1); - newEarlybirdFeatureConfiguration(HAS_TREND_FLAG, ThriftCSFType.BOOLEAN, 0, 3, 1); - newEarlybirdFeatureConfiguration(IS_REPLY_FLAG, ThriftCSFType.BOOLEAN, 0, 4, 1); - newEarlybirdFeatureConfiguration(IS_SENSITIVE_CONTENT, ThriftCSFType.BOOLEAN, 0, 5, 1); - newEarlybirdFeatureConfiguration(HAS_MULTIPLE_HASHTAGS_OR_TRENDS_FLAG, - ThriftCSFType.BOOLEAN, 0, 6, 1); - newEarlybirdFeatureConfiguration(FROM_VERIFIED_ACCOUNT_FLAG, ThriftCSFType.BOOLEAN, 0, 7, 1); - newEarlybirdFeatureConfiguration(TEXT_SCORE, ThriftCSFType.INT, 0, 8, 8); - newEarlybirdFeatureConfiguration(LANGUAGE, ThriftCSFType.INT, 0, 16, 8); - newEarlybirdFeatureConfiguration(LINK_LANGUAGE, ThriftCSFType.INT, 0, 24, 8); - - newEarlybirdFeatureConfiguration(HAS_IMAGE_URL_FLAG, ThriftCSFType.BOOLEAN, 1, 0, 1); - newEarlybirdFeatureConfiguration(HAS_VIDEO_URL_FLAG, ThriftCSFType.BOOLEAN, 1, 1, 1); - newEarlybirdFeatureConfiguration(HAS_NEWS_URL_FLAG, ThriftCSFType.BOOLEAN, 1, 2, 1); - newEarlybirdFeatureConfiguration(HAS_EXPANDO_CARD_FLAG, ThriftCSFType.BOOLEAN, 1, 3, 1); - newEarlybirdFeatureConfiguration(HAS_MULTIPLE_MEDIA_FLAG, ThriftCSFType.BOOLEAN, 1, 4, 1); - newEarlybirdFeatureConfiguration(PROFILE_IS_EGG_FLAG, ThriftCSFType.BOOLEAN, 1, 5, 1); - newEarlybirdFeatureConfiguration(NUM_MENTIONS, ThriftCSFType.INT, 1, 6, 2); // 0, 1, 2, 3+ - newEarlybirdFeatureConfiguration(NUM_HASHTAGS, ThriftCSFType.INT, 1, 8, 2); // 0, 1, 2, 3+ - newEarlybirdFeatureConfiguration(HAS_CARD_FLAG, ThriftCSFType.BOOLEAN, 1, 10, 1); - newEarlybirdFeatureConfiguration(HAS_VISIBLE_LINK_FLAG, ThriftCSFType.BOOLEAN, 1, 11, 1); - newEarlybirdFeatureConfiguration(USER_REPUTATION, ThriftCSFType.INT, 1, 12, 8); - newEarlybirdFeatureConfiguration(IS_USER_SPAM_FLAG, ThriftCSFType.BOOLEAN, 1, 20, 1); - newEarlybirdFeatureConfiguration(IS_USER_NSFW_FLAG, ThriftCSFType.BOOLEAN, 1, 21, 1); - newEarlybirdFeatureConfiguration(IS_USER_BOT_FLAG, ThriftCSFType.BOOLEAN, 1, 22, 1); - newEarlybirdFeatureConfiguration(IS_USER_NEW_FLAG, ThriftCSFType.BOOLEAN, 1, 23, 1); - newEarlybirdFeatureConfiguration(PREV_USER_TWEET_ENGAGEMENT, ThriftCSFType.INT, 1, 24, 6); - newEarlybirdFeatureConfiguration(COMPOSER_SOURCE_IS_CAMERA_FLAG, - ThriftCSFType.BOOLEAN, 1, 30, 1); - newEarlybirdFeatureConfiguration(IS_NULLCAST_FLAG, ThriftCSFType.BOOLEAN, 1, 31, 1); - - newEarlybirdFeatureConfiguration(RETWEET_COUNT, ThriftCSFType.DOUBLE, 2, 0, 8, - ThriftFeatureUpdateConstraint.INC_ONLY); - newEarlybirdFeatureConfiguration(FAVORITE_COUNT, ThriftCSFType.DOUBLE, 2, 8, 8, - ThriftFeatureUpdateConstraint.INC_ONLY); - newEarlybirdFeatureConfiguration(REPLY_COUNT, ThriftCSFType.DOUBLE, 2, 16, 8, - ThriftFeatureUpdateConstraint.INC_ONLY); - newEarlybirdFeatureConfiguration(PARUS_SCORE, ThriftCSFType.DOUBLE, 2, 24, 8); - - newEarlybirdFeatureConfiguration(HAS_CONSUMER_VIDEO_FLAG, ThriftCSFType.BOOLEAN, 3, 0, 1); - newEarlybirdFeatureConfiguration(HAS_PRO_VIDEO_FLAG, ThriftCSFType.BOOLEAN, 3, 1, 1); - newEarlybirdFeatureConfiguration(HAS_VINE_FLAG, ThriftCSFType.BOOLEAN, 3, 2, 1); - newEarlybirdFeatureConfiguration(HAS_PERISCOPE_FLAG, ThriftCSFType.BOOLEAN, 3, 3, 1); - newEarlybirdFeatureConfiguration(HAS_NATIVE_IMAGE_FLAG, ThriftCSFType.BOOLEAN, 3, 4, 1); - // NOTE: There are 3 bits left in the first byte of INT 3, if possible, please reserve them - // for future media types (SEARCH-9131) - // newEarlybirdFeatureConfiguration(FUTURE_MEDIA_BITS, ThriftCSFType.INT, 3, 5, 3); - - newEarlybirdFeatureConfiguration(VISIBLE_TOKEN_RATIO, ThriftCSFType.INT, 3, 8, 4); - newEarlybirdFeatureConfiguration(HAS_QUOTE_FLAG, ThriftCSFType.BOOLEAN, 3, 12, 1); - newEarlybirdFeatureConfiguration(FROM_BLUE_VERIFIED_ACCOUNT_FLAG, - ThriftCSFType.BOOLEAN, 3, 13, 1); - // Unused bits from bit 14 to bit 31 (18 bits) - // newEarlybirdFeatureConfiguration(UNUSED_BITS, ThriftCSFType.INT, 3, 14, 18); - - newEarlybirdFeatureConfiguration(TWEET_SIGNATURE, ThriftCSFType.INT, 4, 0, 32); - - newEarlybirdFeatureConfiguration(EMBEDS_IMPRESSION_COUNT, - ThriftCSFType.DOUBLE, 0, 0, 8, ThriftFeatureUpdateConstraint.INC_ONLY); - newEarlybirdFeatureConfiguration(EMBEDS_URL_COUNT, - ThriftCSFType.DOUBLE, 0, 8, 8, ThriftFeatureUpdateConstraint.INC_ONLY); - newEarlybirdFeatureConfiguration(VIDEO_VIEW_COUNT, - ThriftCSFType.DOUBLE, 0, 16, 8, ThriftFeatureUpdateConstraint.INC_ONLY); - - // Unused bits from bit 24 to bit 31 (8 bits). - // This used to be a feature that was decommissioned (SEARCHQUAL-10321) - newEarlybirdFeatureConfiguration(EXTENDED_FEATURE_UNUSED_BITS_0_24_8, - ThriftCSFType.INT, 0, 24, 8); - - newEarlybirdFeatureConfiguration(REFERENCE_AUTHOR_ID_LEAST_SIGNIFICANT_INT, - ThriftCSFType.INT, 1, 0, 32, ThriftFeatureUpdateConstraint.IMMUTABLE); - newEarlybirdFeatureConfiguration(REFERENCE_AUTHOR_ID_MOST_SIGNIFICANT_INT, - ThriftCSFType.INT, 2, 0, 32, ThriftFeatureUpdateConstraint.IMMUTABLE); - - newEarlybirdFeatureConfiguration(RETWEET_COUNT_V2, - ThriftCSFType.DOUBLE, 3, 0, 8, ThriftFeatureUpdateConstraint.INC_ONLY); - newEarlybirdFeatureConfiguration(FAVORITE_COUNT_V2, - ThriftCSFType.DOUBLE, 3, 8, 8, ThriftFeatureUpdateConstraint.INC_ONLY); - newEarlybirdFeatureConfiguration(REPLY_COUNT_V2, - ThriftCSFType.DOUBLE, 3, 16, 8, ThriftFeatureUpdateConstraint.INC_ONLY); - newEarlybirdFeatureConfiguration(EMBEDS_IMPRESSION_COUNT_V2, - ThriftCSFType.DOUBLE, 3, 24, 8, ThriftFeatureUpdateConstraint.INC_ONLY); - - newEarlybirdFeatureConfiguration(EMBEDS_URL_COUNT_V2, - ThriftCSFType.DOUBLE, 4, 0, 8, ThriftFeatureUpdateConstraint.INC_ONLY); - newEarlybirdFeatureConfiguration(VIDEO_VIEW_COUNT_V2, - ThriftCSFType.DOUBLE, 4, 8, 8, ThriftFeatureUpdateConstraint.INC_ONLY); - newEarlybirdFeatureConfiguration(QUOTE_COUNT, - ThriftCSFType.DOUBLE, 4, 16, 8); - - newEarlybirdFeatureConfiguration(LABEL_ABUSIVE_FLAG, ThriftCSFType.BOOLEAN, 4, 24, 1); - newEarlybirdFeatureConfiguration(LABEL_ABUSIVE_HI_RCL_FLAG, ThriftCSFType.BOOLEAN, 4, 25, 1); - newEarlybirdFeatureConfiguration(LABEL_DUP_CONTENT_FLAG, ThriftCSFType.BOOLEAN, 4, 26, 1); - newEarlybirdFeatureConfiguration(LABEL_NSFW_HI_PRC_FLAG, ThriftCSFType.BOOLEAN, 4, 27, 1); - newEarlybirdFeatureConfiguration(LABEL_NSFW_HI_RCL_FLAG, ThriftCSFType.BOOLEAN, 4, 28, 1); - newEarlybirdFeatureConfiguration(LABEL_SPAM_FLAG, ThriftCSFType.BOOLEAN, 4, 29, 1); - newEarlybirdFeatureConfiguration(LABEL_SPAM_HI_RCL_FLAG, ThriftCSFType.BOOLEAN, 4, 30, 1); - - newEarlybirdFeatureConfiguration(EXTENDED_TEST_FEATURE_UNUSED_BITS_4_31_1, - ThriftCSFType.INT, 4, 31, 1); - - newEarlybirdFeatureConfiguration(WEIGHTED_RETWEET_COUNT, - ThriftCSFType.DOUBLE, 5, 0, 8, ThriftFeatureUpdateConstraint.INC_ONLY); - newEarlybirdFeatureConfiguration(WEIGHTED_REPLY_COUNT, - ThriftCSFType.DOUBLE, 5, 8, 8, ThriftFeatureUpdateConstraint.INC_ONLY); - newEarlybirdFeatureConfiguration(WEIGHTED_FAVORITE_COUNT, - ThriftCSFType.DOUBLE, 5, 16, 8, ThriftFeatureUpdateConstraint.INC_ONLY); - newEarlybirdFeatureConfiguration(WEIGHTED_QUOTE_COUNT, - ThriftCSFType.DOUBLE, 5, 24, 8, ThriftFeatureUpdateConstraint.INC_ONLY); - - newEarlybirdFeatureConfiguration(PERISCOPE_EXISTS, - ThriftCSFType.BOOLEAN, 6, 0, 1); - newEarlybirdFeatureConfiguration(PERISCOPE_HAS_BEEN_FEATURED, - ThriftCSFType.BOOLEAN, 6, 1, 1); - newEarlybirdFeatureConfiguration(PERISCOPE_IS_CURRENTLY_FEATURED, - ThriftCSFType.BOOLEAN, 6, 2, 1); - newEarlybirdFeatureConfiguration(PERISCOPE_IS_FROM_QUALITY_SOURCE, - ThriftCSFType.BOOLEAN, 6, 3, 1); - newEarlybirdFeatureConfiguration(PERISCOPE_IS_LIVE, - ThriftCSFType.BOOLEAN, 6, 4, 1); - - newEarlybirdFeatureConfiguration(IS_TRENDING_NOW_FLAG, - ThriftCSFType.BOOLEAN, 6, 5, 1); - - // remaining bits for integer 6 - newEarlybirdFeatureConfiguration(EXTENDED_TEST_FEATURE_UNUSED_BITS_7_6_26, - ThriftCSFType.INT, 6, 6, 26); - - // The decaying counters can become smaller - newEarlybirdFeatureConfiguration(DECAYED_RETWEET_COUNT, - ThriftCSFType.DOUBLE, 7, 0, 8, ThriftFeatureUpdateConstraint.POSITIVE); - newEarlybirdFeatureConfiguration(DECAYED_REPLY_COUNT, - ThriftCSFType.DOUBLE, 7, 8, 8, ThriftFeatureUpdateConstraint.POSITIVE); - newEarlybirdFeatureConfiguration(DECAYED_FAVORITE_COUNT, - ThriftCSFType.DOUBLE, 7, 16, 8, ThriftFeatureUpdateConstraint.POSITIVE); - newEarlybirdFeatureConfiguration(DECAYED_QUOTE_COUNT, - ThriftCSFType.DOUBLE, 7, 24, 8, ThriftFeatureUpdateConstraint.POSITIVE); - - // The fake engagement counters. - newEarlybirdFeatureConfiguration(FAKE_RETWEET_COUNT, - ThriftCSFType.DOUBLE, 8, 0, 8, ThriftFeatureUpdateConstraint.POSITIVE); - newEarlybirdFeatureConfiguration(FAKE_REPLY_COUNT, - ThriftCSFType.DOUBLE, 8, 8, 8, ThriftFeatureUpdateConstraint.POSITIVE); - newEarlybirdFeatureConfiguration(FAKE_FAVORITE_COUNT, - ThriftCSFType.DOUBLE, 8, 16, 8, ThriftFeatureUpdateConstraint.POSITIVE); - newEarlybirdFeatureConfiguration(FAKE_QUOTE_COUNT, - ThriftCSFType.DOUBLE, 8, 24, 8, ThriftFeatureUpdateConstraint.POSITIVE); - - newEarlybirdFeatureConfiguration(LAST_RETWEET_SINCE_CREATION_HRS, - ThriftCSFType.INT, 9, 0, 8, ThriftFeatureUpdateConstraint.INC_ONLY); - newEarlybirdFeatureConfiguration(LAST_REPLY_SINCE_CREATION_HRS, - ThriftCSFType.INT, 9, 8, 8, ThriftFeatureUpdateConstraint.INC_ONLY); - newEarlybirdFeatureConfiguration(LAST_FAVORITE_SINCE_CREATION_HRS, - ThriftCSFType.INT, 9, 16, 8, ThriftFeatureUpdateConstraint.INC_ONLY); - newEarlybirdFeatureConfiguration(LAST_QUOTE_SINCE_CREATION_HRS, - ThriftCSFType.INT, 9, 24, 8, ThriftFeatureUpdateConstraint.INC_ONLY); - - newEarlybirdFeatureConfiguration(NUM_HASHTAGS_V2, - ThriftCSFType.INT, 10, 0, 4); - newEarlybirdFeatureConfiguration(NUM_MENTIONS_V2, - ThriftCSFType.INT, 10, 4, 4); - newEarlybirdFeatureConfiguration(NUM_STOCKS, - ThriftCSFType.INT, 10, 8, 4); - - // Remaining bits for integer 10 - // Production Toxicity and PBlock score from HML (go/toxicity, go/pblock) - newEarlybirdFeatureConfiguration(TOXICITY_SCORE, - ThriftCSFType.DOUBLE, 10, 12, 10); - newEarlybirdFeatureConfiguration(PBLOCK_SCORE, - ThriftCSFType.DOUBLE, 10, 22, 10); - - // The blink engagement counters - newEarlybirdFeatureConfiguration(BLINK_RETWEET_COUNT, - ThriftCSFType.DOUBLE, 11, 0, 8, ThriftFeatureUpdateConstraint.POSITIVE); - newEarlybirdFeatureConfiguration(BLINK_REPLY_COUNT, - ThriftCSFType.DOUBLE, 11, 8, 8, ThriftFeatureUpdateConstraint.POSITIVE); - newEarlybirdFeatureConfiguration(BLINK_FAVORITE_COUNT, - ThriftCSFType.DOUBLE, 11, 16, 8, ThriftFeatureUpdateConstraint.POSITIVE); - newEarlybirdFeatureConfiguration(BLINK_QUOTE_COUNT, - ThriftCSFType.DOUBLE, 11, 24, 8, ThriftFeatureUpdateConstraint.POSITIVE); - - // Experimental health model scores from HML - newEarlybirdFeatureConfiguration(EXPERIMENTAL_HEALTH_MODEL_SCORE_1, - ThriftCSFType.DOUBLE, 12, 0, 10); - newEarlybirdFeatureConfiguration(EXPERIMENTAL_HEALTH_MODEL_SCORE_2, - ThriftCSFType.DOUBLE, 12, 10, 10); - newEarlybirdFeatureConfiguration(EXPERIMENTAL_HEALTH_MODEL_SCORE_3, - ThriftCSFType.DOUBLE, 12, 20, 10); - // remaining bits for integer 12 - newEarlybirdFeatureConfiguration(EXTENDED_TEST_FEATURE_UNUSED_BITS_12_30_2, - ThriftCSFType.INT, 12, 30, 2); - - // Experimental health model scores from HML (cont.) - newEarlybirdFeatureConfiguration(EXPERIMENTAL_HEALTH_MODEL_SCORE_4, - ThriftCSFType.DOUBLE, 13, 0, 10); - // Production pSpammyTweet score from HML (go/pspammytweet) - newEarlybirdFeatureConfiguration(P_SPAMMY_TWEET_SCORE, - ThriftCSFType.DOUBLE, 13, 10, 10); - // Production pReportedTweet score from HML (go/preportedtweet) - newEarlybirdFeatureConfiguration(P_REPORTED_TWEET_SCORE, - ThriftCSFType.DOUBLE, 13, 20, 10); - // remaining bits for integer 13 - newEarlybirdFeatureConfiguration(EXTENDED_TEST_FEATURE_UNUSED_BITS_13_30_2, - ThriftCSFType.INT, 13, 30, 2); - - // Experimental health model scores from HML (cont.) - // Prod Spammy Tweet Content model score from Platform Manipulation (go/spammy-tweet-content) - newEarlybirdFeatureConfiguration(SPAMMY_TWEET_CONTENT_SCORE, - ThriftCSFType.DOUBLE, 14, 0, 10); - // remaining bits for integer 14 - newEarlybirdFeatureConfiguration(EXTENDED_TEST_FEATURE_UNUSED_BITS_14_10_22, - ThriftCSFType.INT, 14, 10, 22); - - // Note that the integer index below is 0-based, but the index j in UNUSED_BITS_{j} below - // is 1-based. - newEarlybirdFeatureConfiguration(EXTENDED_TEST_FEATURE_UNUSED_BITS_16, - ThriftCSFType.INT, 15, 0, 32); - newEarlybirdFeatureConfiguration(EXTENDED_TEST_FEATURE_UNUSED_BITS_17, - ThriftCSFType.INT, 16, 0, 32); - newEarlybirdFeatureConfiguration(EXTENDED_TEST_FEATURE_UNUSED_BITS_18, - ThriftCSFType.INT, 17, 0, 32); - newEarlybirdFeatureConfiguration(EXTENDED_TEST_FEATURE_UNUSED_BITS_19, - ThriftCSFType.INT, 18, 0, 32); - newEarlybirdFeatureConfiguration(EXTENDED_TEST_FEATURE_UNUSED_BITS_20, - ThriftCSFType.INT, 19, 0, 32); - } - - private EarlybirdSchemaCreateTool() { } - - /** - * Get schema for the Earlybird. - */ - public static DynamicSchema buildSchema(EarlybirdCluster cluster) - throws Schema.SchemaValidationException { - SCHEMA_BUILD_COUNT.increment(); - return new DynamicSchema(new ImmutableSchema(buildThriftSchema(cluster), - new AnalyzerFactory(), - cluster.getNameForStats())); - } - - /** - * Get schema for the Earlybird, can throw runtime exception. This is mostly for static schema - * usage, which does not care about schema updates. - */ - @VisibleForTesting - public static DynamicSchema buildSchemaWithRuntimeException(EarlybirdCluster cluster) { - try { - return buildSchema(cluster); - } catch (Schema.SchemaValidationException e) { - throw new RuntimeException(e); - } - } - - private static FeatureConfiguration newEarlybirdFeatureConfiguration( - EarlybirdFieldConstant fieldConstant, - ThriftCSFType type, - int intIndex, int bitStartPos, int bitLength, - ThriftFeatureUpdateConstraint... constraints) { - - if (!fieldConstant.isFlagFeatureField() && type == ThriftCSFType.BOOLEAN) { - throw new IllegalArgumentException( - "Non-flag feature field configured with boolean Thrift type: " + fieldConstant); - } - if (fieldConstant.isFlagFeatureField() && type != ThriftCSFType.BOOLEAN) { - throw new IllegalArgumentException( - "Flag feature field configured with non-boolean Thrift type: " + fieldConstant); - } - - String baseFieldName = getBaseFieldName(fieldConstant); - String name = getFeatureNameInField(fieldConstant); - FeatureConfiguration.Builder builder = FeatureConfiguration.builder() - .withName(name) - .withType(type) - .withBitRange(intIndex, bitStartPos, bitLength); - // remove the following line once we configure features purely by the schema - builder.withBaseField(baseFieldName); - - if (!fieldConstant.isUnusedField()) { - builder.withOutputType(type); - } - if (fieldConstant.getFeatureNormalizationType() != null) { - builder.withFeatureNormalizationType(fieldConstant.getFeatureNormalizationType()); - } - - for (ThriftFeatureUpdateConstraint constraint : constraints) { - builder.withFeatureUpdateConstraint(constraint); - } - FeatureConfiguration featureConfiguration = builder.build(); - FEATURE_CONFIGURATION_MAP.put(fieldConstant.getFieldName(), featureConfiguration); - return featureConfiguration; - } - - /** - * Build ThriftSchema for the Earlybird. Note that the schema returned can be used - * all Earlybird clusters. However, some clusters may not use all the field configurations. - */ - @VisibleForTesting - public static ThriftSchema buildThriftSchema(EarlybirdCluster cluster) { - EarlybirdSchemaBuilder builder = new EarlybirdSchemaBuilder( - new EarlybirdFieldConstants(), cluster, TokenStreamSerializer.Version.VERSION_2); - - builder.withSchemaVersion( - FlushVersion.CURRENT_FLUSH_VERSION.getVersionNumber(), - FlushVersion.CURRENT_FLUSH_VERSION.getMinorVersion(), - FlushVersion.CURRENT_FLUSH_VERSION.getDescription(), - FlushVersion.CURRENT_FLUSH_VERSION.isOfficial()); - - // ID field, used for partitioning - builder.withPartitionFieldId(0) - .withSortableLongTermField(EarlybirdFieldConstant.ID_FIELD.getFieldName()) - // Text Fields that are searched by default - .withTextField(EarlybirdFieldConstant.RESOLVED_LINKS_TEXT_FIELD.getFieldName(), true) - .withSearchFieldByDefault( - EarlybirdFieldConstant.RESOLVED_LINKS_TEXT_FIELD.getFieldName(), 0.1f) - .withPretokenizedTextField(EarlybirdFieldConstant.TEXT_FIELD.getFieldName(), true) - .withSearchFieldByDefault(EarlybirdFieldConstant.TEXT_FIELD.getFieldName(), 1.0f); - builder.withTweetSpecificNormalization(EarlybirdFieldConstant.TEXT_FIELD.getFieldName()) - .withTextField(EarlybirdFieldConstant.TOKENIZED_FROM_USER_FIELD.getFieldName(), true) - .withSearchFieldByDefault( - EarlybirdFieldConstant.TOKENIZED_FROM_USER_FIELD.getFieldName(), 0.2f) - - // Text fields not searched by default - .withTextField(EarlybirdFieldConstant.FROM_USER_FIELD.getFieldName(), false) - .withTextField(EarlybirdFieldConstant.TO_USER_FIELD.getFieldName(), false) - - // cards are not searched by default, and have weight 0. - .withPretokenizedTextField(EarlybirdFieldConstant.CARD_TITLE_FIELD.getFieldName(), false) - .withPretokenizedTextField( - EarlybirdFieldConstant.CARD_DESCRIPTION_FIELD.getFieldName(), false) - .withTextField(EarlybirdFieldConstant.CARD_LANG.getFieldName(), false) - - // Out-of-order append fields - .withLongTermField(EarlybirdFieldConstant.LIKED_BY_USER_ID_FIELD.getFieldName()) - .withLongTermField(EarlybirdFieldConstant.RETWEETED_BY_USER_ID.getFieldName()) - .withLongTermField(EarlybirdFieldConstant.REPLIED_TO_BY_USER_ID.getFieldName()) - - // No Position fields, sorted alphabetically - .withPretokenizedNoPositionField(EarlybirdFieldConstant.CARD_DOMAIN_FIELD.getFieldName()) - .withIndexedNotTokenizedField(EarlybirdFieldConstant.CARD_NAME_FIELD.getFieldName()) - .withIntTermField(EarlybirdFieldConstant.CREATED_AT_FIELD.getFieldName()) - .withIndexedNotTokenizedField(EarlybirdFieldConstant.ENTITY_ID_FIELD.getFieldName()) - .withIndexedNotTokenizedField(EarlybirdFieldConstant.GEO_HASH_FIELD.getFieldName()) - .withLongTermField(EarlybirdFieldConstant.FROM_USER_ID_FIELD.getFieldName()) - .withLongTermField(EarlybirdFieldConstant.IN_REPLY_TO_TWEET_ID_FIELD.getFieldName()) - .withLongTermField(EarlybirdFieldConstant.IN_REPLY_TO_USER_ID_FIELD.getFieldName()) - .withLongTermField(EarlybirdFieldConstant.RETWEET_SOURCE_TWEET_ID_FIELD.getFieldName()) - .withLongTermField(EarlybirdFieldConstant.RETWEET_SOURCE_USER_ID_FIELD.getFieldName()) - .withLongTermField(EarlybirdFieldConstant.CONVERSATION_ID_FIELD.getFieldName()) - .withIndexedNotTokenizedField(EarlybirdFieldConstant.PLACE_ID_FIELD.getFieldName()) - .withTextField(EarlybirdFieldConstant.PLACE_FULL_NAME_FIELD.getFieldName(), false) - .withIndexedNotTokenizedField( - EarlybirdFieldConstant.PLACE_COUNTRY_CODE_FIELD.getFieldName()) - .withIndexedNotTokenizedField( - EarlybirdFieldConstant.PROFILE_GEO_COUNTRY_CODE_FIELD.getFieldName()) - .withTextField(EarlybirdFieldConstant.PROFILE_GEO_REGION_FIELD.getFieldName(), false) - .withTextField(EarlybirdFieldConstant.PROFILE_GEO_LOCALITY_FIELD.getFieldName(), false) - .withTermTextLookup(EarlybirdFieldConstant.FROM_USER_ID_FIELD.getFieldName()) - .withTermTextLookup(EarlybirdFieldConstant.IN_REPLY_TO_USER_ID_FIELD.getFieldName()) - .withPretokenizedNoPositionField(EarlybirdFieldConstant.HASHTAGS_FIELD.getFieldName()) - .withIndexedNotTokenizedField(ImmutableSchema.HF_PHRASE_PAIRS_FIELD) - .withIndexedNotTokenizedField(ImmutableSchema.HF_TERM_PAIRS_FIELD) - .withIndexedNotTokenizedField(EarlybirdFieldConstant.IMAGE_LINKS_FIELD.getFieldName()) - .withIndexedNotTokenizedField(EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName()) - .withIndexedNotTokenizedField(EarlybirdFieldConstant.ISO_LANGUAGE_FIELD.getFieldName()) - .withIndexedNotTokenizedField(EarlybirdFieldConstant.LINKS_FIELD.getFieldName()) - .withIntTermField(EarlybirdFieldConstant.LINK_CATEGORY_FIELD.getFieldName()) - .withIndexedNotTokenizedField(EarlybirdFieldConstant.MENTIONS_FIELD.getFieldName()) - .withIndexedNotTokenizedField(EarlybirdFieldConstant.NEWS_LINKS_FIELD.getFieldName()) - .withIndexedNotTokenizedField(EarlybirdFieldConstant.NORMALIZED_SOURCE_FIELD.getFieldName()) - .withIndexedNotTokenizedField(EarlybirdFieldConstant.PLACE_FIELD.getFieldName()) - .withIndexedNotTokenizedField(EarlybirdFieldConstant.SOURCE_FIELD.getFieldName()) - .withPretokenizedNoPositionField(EarlybirdFieldConstant.STOCKS_FIELD.getFieldName()) - .withIndexedNotTokenizedField(EarlybirdFieldConstant.VIDEO_LINKS_FIELD.getFieldName()) - .withIntTermField(NORMALIZED_FAVORITE_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD.getFieldName()) - .withIntTermField(NORMALIZED_REPLY_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD.getFieldName()) - .withIntTermField(NORMALIZED_RETWEET_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD.getFieldName()) - - .withIntTermField(EarlybirdFieldConstant.COMPOSER_SOURCE.getFieldName()) - - .withLongTermField(EarlybirdFieldConstant.QUOTED_TWEET_ID_FIELD.getFieldName()) - .withLongTermField(EarlybirdFieldConstant.QUOTED_USER_ID_FIELD.getFieldName()) - .withLongTermField(EarlybirdFieldConstant.DIRECTED_AT_USER_ID_FIELD.getFieldName()) - - // Named entity fields - .withIndexedNotTokenizedField( - EarlybirdFieldConstant.NAMED_ENTITY_FROM_URL_FIELD.getFieldName(), true) - .withIndexedNotTokenizedField( - EarlybirdFieldConstant.NAMED_ENTITY_FROM_TEXT_FIELD.getFieldName(), true) - .withIndexedNotTokenizedField( - EarlybirdFieldConstant.NAMED_ENTITY_WITH_TYPE_FROM_URL_FIELD.getFieldName(), true) - .withIndexedNotTokenizedField( - EarlybirdFieldConstant.NAMED_ENTITY_WITH_TYPE_FROM_TEXT_FIELD.getFieldName(), true) - - // camelCase-tokenized user handles and tokenized user names, not searchable by default - .withPretokenizedTextField( - EarlybirdFieldConstant.CAMELCASE_USER_HANDLE_FIELD.getFieldName(), false) - .withPretokenizedTextField( - EarlybirdFieldConstant.TOKENIZED_USER_NAME_FIELD.getFieldName(), false) - - .withIndexedNotTokenizedField( - EarlybirdFieldConstant.SPACE_ID_FIELD.getFieldName()) - .withTextField(EarlybirdFieldConstant.SPACE_ADMIN_FIELD.getFieldName(), false) - .withPretokenizedTextField(EarlybirdFieldConstant.SPACE_TITLE_FIELD.getFieldName(), false) - .withTextField(EarlybirdFieldConstant.TOKENIZED_SPACE_ADMIN_FIELD.getFieldName(), true) - .withPretokenizedTextField( - EarlybirdFieldConstant.CAMELCASE_TOKENIZED_SPACE_ADMIN_FIELD.getFieldName(), false) - .withPretokenizedTextField( - EarlybirdFieldConstant.TOKENIZED_SPACE_ADMIN_DISPLAY_NAME_FIELD.getFieldName(), false) - .withPretokenizedTextField( - EarlybirdFieldConstant.URL_DESCRIPTION_FIELD.getFieldName(), false) - .withPretokenizedTextField( - EarlybirdFieldConstant.URL_TITLE_FIELD.getFieldName(), false); - - builder - .withPhotoUrlFacetField(EarlybirdFieldConstant.TWIMG_LINKS_FIELD.getFieldName()) - .withOutOfOrderEnabledForField( - EarlybirdFieldConstant.LIKED_BY_USER_ID_FIELD.getFieldName()) - .withOutOfOrderEnabledForField( - EarlybirdFieldConstant.RETWEETED_BY_USER_ID.getFieldName()) - .withOutOfOrderEnabledForField( - EarlybirdFieldConstant.REPLIED_TO_BY_USER_ID.getFieldName()); - - // ColumnStrideFields. - boolean loadCSFIntoRAMDefault = cluster != EarlybirdCluster.FULL_ARCHIVE; - - builder - .withColumnStrideField(EarlybirdFieldConstants.ENCODED_TWEET_FEATURES_FIELD_NAME, - ThriftCSFType.INT, NUMBER_OF_INTEGERS_FOR_FEATURES, - true, loadCSFIntoRAMDefault) - .withColumnStrideField(EarlybirdFieldConstant.FROM_USER_ID_CSF.getFieldName(), - ThriftCSFType.LONG, 1, false, /* the full archive loads this field into RAM */ true) - .withColumnStrideField(EarlybirdFieldConstant.SHARED_STATUS_ID_CSF.getFieldName(), - ThriftCSFType.LONG, 1, false, loadCSFIntoRAMDefault) - .withColumnStrideField(EarlybirdFieldConstant.CARD_TYPE_CSF_FIELD.getFieldName(), - ThriftCSFType.BYTE, 1, false, loadCSFIntoRAMDefault) - // CSF Used by archive mappers - .withColumnStrideField(EarlybirdFieldConstant.CREATED_AT_CSF_FIELD.getFieldName(), - ThriftCSFType.INT, 1, false, /* the full archive loads this field into RAM */ true) - .withColumnStrideField(EarlybirdFieldConstant.ID_CSF_FIELD.getFieldName(), - ThriftCSFType.LONG, 1, false, /* the full archive loads this field into RAM */ true) - .withColumnStrideField(EarlybirdFieldConstant.LAT_LON_CSF_FIELD.getFieldName(), - ThriftCSFType.LONG, 1, false, loadCSFIntoRAMDefault) - .withColumnStrideField(EarlybirdFieldConstant.CONVERSATION_ID_CSF.getFieldName(), - ThriftCSFType.LONG, 1, false, loadCSFIntoRAMDefault) - .withColumnStrideField(EarlybirdFieldConstant.QUOTED_TWEET_ID_CSF.getFieldName(), - ThriftCSFType.LONG, 1, false, loadCSFIntoRAMDefault) - .withColumnStrideField(EarlybirdFieldConstant.QUOTED_USER_ID_CSF.getFieldName(), - ThriftCSFType.LONG, 1, false, loadCSFIntoRAMDefault) - .withColumnStrideField(EarlybirdFieldConstant.CARD_LANG_CSF.getFieldName(), - ThriftCSFType.INT, 1, false, loadCSFIntoRAMDefault) - .withColumnStrideField(EarlybirdFieldConstant.CARD_URI_CSF.getFieldName(), - ThriftCSFType.LONG, 1, false, loadCSFIntoRAMDefault) - .withColumnStrideField(EarlybirdFieldConstant.DIRECTED_AT_USER_ID_CSF.getFieldName(), - ThriftCSFType.LONG, 1, false, loadCSFIntoRAMDefault) - .withColumnStrideField(EarlybirdFieldConstant.REFERENCE_AUTHOR_ID_CSF.getFieldName(), - ThriftCSFType.LONG, 1, false, loadCSFIntoRAMDefault) - .withColumnStrideField( - EarlybirdFieldConstant.EXCLUSIVE_CONVERSATION_AUTHOR_ID_CSF.getFieldName(), - ThriftCSFType.LONG, 1, false, loadCSFIntoRAMDefault) - - /* Semicolon on separate line to preserve git blame. */; - - builder.withColumnStrideField( - EarlybirdFieldConstants.EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME, - ThriftCSFType.INT, NUMBER_OF_INTEGERS_FOR_EXTENDED_FEATURES, - true, loadCSFIntoRAMDefault); - - for (Map.Entry entry : FEATURE_CONFIGURATION_MAP.entrySet()) { - String fullName = entry.getKey(); - String baseName = getBaseFieldName(fullName); - EarlybirdFieldConstant fieldConstant = EarlybirdFieldConstants.getFieldConstant(fullName); - if (fieldConstant.isValidFieldInCluster(cluster)) { - builder.withFeatureConfiguration(baseName, fullName, entry.getValue()); - } - } - // Add facet settings for facet fields - // boolean args are respectively whether to use skiplist, whether offensive, whether to use CSF - builder - .withFacetConfigs(EarlybirdFieldConstant.MENTIONS_FIELD.getFieldName(), - EarlybirdFieldConstant.MENTIONS_FACET, true, false, false) - .withFacetConfigs(EarlybirdFieldConstant.HASHTAGS_FIELD.getFieldName(), - EarlybirdFieldConstant.HASHTAGS_FACET, true, false, false) - .withFacetConfigs(EarlybirdFieldConstant.STOCKS_FIELD.getFieldName(), - EarlybirdFieldConstant.STOCKS_FACET, true, false, false) - .withFacetConfigs(EarlybirdFieldConstant.IMAGE_LINKS_FIELD.getFieldName(), - EarlybirdFieldConstant.IMAGES_FACET, true, true, false) - .withFacetConfigs(EarlybirdFieldConstant.VIDEO_LINKS_FIELD.getFieldName(), - EarlybirdFieldConstant.VIDEOS_FACET, true, true, false) - .withFacetConfigs(EarlybirdFieldConstant.NEWS_LINKS_FIELD.getFieldName(), - EarlybirdFieldConstant.NEWS_FACET, true, false, false) - .withFacetConfigs(EarlybirdFieldConstant.ISO_LANGUAGE_FIELD.getFieldName(), - EarlybirdFieldConstant.LANGUAGES_FACET, false, false, false) - .withFacetConfigs(EarlybirdFieldConstant.SOURCE_FIELD.getFieldName(), - EarlybirdFieldConstant.SOURCES_FACET, false, false, false) - .withFacetConfigs(EarlybirdFieldConstant.TWIMG_LINKS_FIELD.getFieldName(), - EarlybirdFieldConstant.TWIMG_FACET, true, true, false) - .withFacetConfigs(EarlybirdFieldConstant.FROM_USER_ID_CSF.getFieldName(), - EarlybirdFieldConstant.FROM_USER_ID_FACET, false, false, true /* facet on CSF */) - .withFacetConfigs(EarlybirdFieldConstant.SHARED_STATUS_ID_CSF.getFieldName(), - EarlybirdFieldConstant.RETWEETS_FACET, false, false, true /* facet on CSF */) - .withFacetConfigs(EarlybirdFieldConstant.LINKS_FIELD.getFieldName(), - EarlybirdFieldConstant.LINKS_FACET, true, false, false) - .withFacetConfigs( - EarlybirdFieldConstant.NAMED_ENTITY_WITH_TYPE_FROM_URL_FIELD.getFieldName(), - true, false, false) - .withFacetConfigs( - EarlybirdFieldConstant.NAMED_ENTITY_WITH_TYPE_FROM_TEXT_FIELD.getFieldName(), - true, false, false) - .withFacetConfigs( - EarlybirdFieldConstant.ENTITY_ID_FIELD.getFieldName(), - true, false, false) - .withFacetConfigs(EarlybirdFieldConstant.SPACE_ID_FIELD.getFieldName(), - EarlybirdFieldConstant.SPACES_FACET, true, false, false); - return builder.build(); - } -} diff --git a/src/java/com/twitter/search/common/schema/earlybird/EarlybirdThriftDocumentBuilder.java b/src/java/com/twitter/search/common/schema/earlybird/EarlybirdThriftDocumentBuilder.java deleted file mode 100644 index 06666adc0..000000000 --- a/src/java/com/twitter/search/common/schema/earlybird/EarlybirdThriftDocumentBuilder.java +++ /dev/null @@ -1,897 +0,0 @@ -package com.twitter.search.common.schema.earlybird; - -import java.io.IOException; -import java.util.HashSet; -import java.util.List; -import java.util.Set; -import javax.annotation.Nonnull; -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.ImmutableSet; -import com.google.common.collect.Sets; - -import org.apache.commons.lang.StringUtils; -import org.apache.lucene.analysis.TokenStream; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.collections.Pair; -import com.twitter.common.text.util.TokenStreamSerializer; -import com.twitter.cuad.ner.plain.thriftjava.NamedEntity; -import com.twitter.cuad.ner.plain.thriftjava.NamedEntityContext; -import com.twitter.cuad.ner.plain.thriftjava.NamedEntityInputSourceType; -import com.twitter.cuad.ner.thriftjava.WholeEntityType; -import com.twitter.search.common.constants.SearchCardType; -import com.twitter.search.common.indexing.thriftjava.ThriftExpandedUrl; -import com.twitter.search.common.indexing.thriftjava.ThriftGeoLocationSource; -import com.twitter.search.common.indexing.thriftjava.TwitterPhotoUrl; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.schema.ThriftDocumentBuilder; -import com.twitter.search.common.schema.base.FieldNameToIdMapping; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.util.analysis.CharTermAttributeSerializer; -import com.twitter.search.common.util.analysis.IntTermAttributeSerializer; -import com.twitter.search.common.util.analysis.TermPayloadAttributeSerializer; -import com.twitter.search.common.util.analysis.TwitterPhotoTokenStream; -import com.twitter.search.common.util.spatial.GeoUtil; -import com.twitter.search.common.util.text.TokenizerHelper; -import com.twitter.search.common.util.text.TweetTokenStreamSerializer; -import com.twitter.search.common.util.text.regex.Regex; -import com.twitter.search.common.util.url.LinkVisibilityUtils; -import com.twitter.search.common.util.url.URLUtils; - -import geo.google.datamodel.GeoAddressAccuracy; -import com.twitter.search.common.schema.thriftjava.ThriftDocument; - -/** - * Builder class for building a {@link ThriftDocument}. - */ -public final class EarlybirdThriftDocumentBuilder extends ThriftDocumentBuilder { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdThriftDocumentBuilder.class); - - private static final SearchCounter SERIALIZE_FAILURE_COUNT_NONPENGUIN_DEPENDENT = - SearchCounter.export("tokenstream_serialization_failure_non_penguin_dependent"); - - private static final String HASHTAG_SYMBOL = "#"; - private static final String CASHTAG_SYMBOL = "$"; - private static final String MENTION_SYMBOL = "@"; - - private static final SearchCounter BCP47_LANGUAGE_TAG_COUNTER = - SearchCounter.export("bcp47_language_tag"); - - /** - * Used to check if a card is video card. - * - * @see #withSearchCard - */ - private static final String AMPLIFY_CARD_NAME = "amplify"; - private static final String PLAYER_CARD_NAME = "player"; - - // Extra term indexed for native retweets, to ensure that the "-rt" query excludes them. - public static final String RETWEET_TERM = "rt"; - public static final String QUESTION_MARK = "?"; - - private static final Set NAMED_ENTITY_URL_SOURCE_TYPES = - ImmutableSet.of( - NamedEntityInputSourceType.URL_TITLE, NamedEntityInputSourceType.URL_DESCRIPTION); - - private final TokenStreamSerializer intTermAttributeSerializer = - new TokenStreamSerializer(ImmutableList.of( - new IntTermAttributeSerializer())); - private final TokenStreamSerializer photoUrlSerializer = - new TokenStreamSerializer(ImmutableList - .of( - new CharTermAttributeSerializer(), new TermPayloadAttributeSerializer())); - private final Schema schema; - - private boolean isSetLatLonCSF = false; - private boolean addLatLonCSF = true; - private boolean addEncodedTweetFeatures = true; - - @Nonnull - private final EarlybirdEncodedFeatures encodedTweetFeatures; - @Nullable - private final EarlybirdEncodedFeatures extendedEncodedTweetFeatures; - - /** - * Default constructor - */ - public EarlybirdThriftDocumentBuilder( - @Nonnull EarlybirdEncodedFeatures encodedTweetFeatures, - @Nullable EarlybirdEncodedFeatures extendedEncodedTweetFeatures, - FieldNameToIdMapping idMapping, - Schema schema) { - super(idMapping); - this.schema = schema; - this.encodedTweetFeatures = Preconditions.checkNotNull(encodedTweetFeatures); - - this.extendedEncodedTweetFeatures = extendedEncodedTweetFeatures; - } - - /** - * Get internal {@link EarlybirdEncodedFeatures} - */ - public EarlybirdEncodedFeatures getEncodedTweetFeatures() { - return encodedTweetFeatures; - } - - /** - * Add skip list entry for the given field. - * This adds a term __has_fieldName in the INTERNAL field. - */ - public EarlybirdThriftDocumentBuilder addFacetSkipList(String fieldName) { - withStringField(EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), - EarlybirdFieldConstant.getFacetSkipFieldName(fieldName)); - return this; - } - - /** - * Add a filter term in the INTERNAL field. - */ - public EarlybirdThriftDocumentBuilder addFilterInternalFieldTerm(String filterName) { - withStringField(EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), - EarlybirdThriftDocumentUtil.formatFilter(filterName)); - return this; - } - - /** - * Add id field and id csf field. - */ - public EarlybirdThriftDocumentBuilder withID(long id) { - withLongField(EarlybirdFieldConstant.ID_FIELD.getFieldName(), id); - withLongField(EarlybirdFieldConstant.ID_CSF_FIELD.getFieldName(), id); - return this; - } - - /** - * Add created at field and created at csf field. - */ - public EarlybirdThriftDocumentBuilder withCreatedAt(int createdAt) { - withIntField(EarlybirdFieldConstant.CREATED_AT_FIELD.getFieldName(), createdAt); - withIntField(EarlybirdFieldConstant.CREATED_AT_CSF_FIELD.getFieldName(), createdAt); - return this; - } - - /** - * Add tweet text field. - */ - public EarlybirdThriftDocumentBuilder withTweetText( - String text, byte[] textTokenStream) throws IOException { - withTokenStreamField(EarlybirdFieldConstants.EarlybirdFieldConstant.TEXT_FIELD.getFieldName(), - text, textTokenStream); - return this; - } - - public EarlybirdThriftDocumentBuilder withTweetText(String text) throws IOException { - withTweetText(text, null); - return this; - } - - /** - * Add a list of cashTags. Like $TWTR. - */ - public EarlybirdThriftDocumentBuilder withStocksFields(List cashTags) { - if (isNotEmpty(cashTags)) { - addFacetSkipList(EarlybirdFieldConstant.STOCKS_FIELD.getFieldName()); - for (String cashTag : cashTags) { - withStringField( - EarlybirdFieldConstant.STOCKS_FIELD.getFieldName(), CASHTAG_SYMBOL + cashTag); - } - } - return this; - } - - /** - * Add a list of hashtags. - */ - public EarlybirdThriftDocumentBuilder withHashtagsField(List hashtags) { - if (isNotEmpty(hashtags)) { - int numHashtags = Math.min( - hashtags.size(), - schema.getFeatureConfigurationById( - EarlybirdFieldConstant.NUM_HASHTAGS.getFieldId()).getMaxValue()); - encodedTweetFeatures.setFeatureValue(EarlybirdFieldConstant.NUM_HASHTAGS, numHashtags); - addFacetSkipList(EarlybirdFieldConstant.HASHTAGS_FIELD.getFieldName()); - for (String hashtag : hashtags) { - withStringField( - EarlybirdFieldConstant.HASHTAGS_FIELD.getFieldName(), HASHTAG_SYMBOL + hashtag); - } - } - return this; - } - - /** - * Added a list of mentions. - */ - public EarlybirdThriftDocumentBuilder withMentionsField(List mentions) { - if (isNotEmpty(mentions)) { - int numMentions = Math.min( - mentions.size(), - schema.getFeatureConfigurationById( - EarlybirdFieldConstant.NUM_HASHTAGS.getFieldId()).getMaxValue()); - encodedTweetFeatures.setFeatureValue(EarlybirdFieldConstant.NUM_MENTIONS, numMentions); - addFacetSkipList(EarlybirdFieldConstant.MENTIONS_FIELD.getFieldName()); - for (String mention : mentions) { - withStringField( - EarlybirdFieldConstant.MENTIONS_FIELD.getFieldName(), MENTION_SYMBOL + mention); - } - } - return this; - } - - /** - * Add a list of Twitter Photo URLs (twimg URLs). These are different from regular URLs, because - * we use the TwitterPhotoTokenStream to index them, and we also include the status ID as payload. - */ - public EarlybirdThriftDocumentBuilder withTwimgURLs( - List urls) throws IOException { - if (isNotEmpty(urls)) { - for (TwitterPhotoUrl photoUrl : urls) { - TokenStream ts = new TwitterPhotoTokenStream(photoUrl.getPhotoStatusId(), - photoUrl.getMediaUrl()); - byte[] serializedTs = photoUrlSerializer.serialize(ts); - withTokenStreamField(EarlybirdFieldConstant.TWIMG_LINKS_FIELD.getFieldName(), - Long.toString(photoUrl.getPhotoStatusId()), serializedTs); - addFacetSkipList(EarlybirdFieldConstant.TWIMG_LINKS_FIELD.getFieldName()); - } - encodedTweetFeatures.setFlag(EarlybirdFieldConstant.HAS_IMAGE_URL_FLAG); - encodedTweetFeatures.setFlag(EarlybirdFieldConstant.HAS_NATIVE_IMAGE_FLAG); - } - return this; - } - - /** - * Add a list of URLs. This also add facet skip list terms for news / images / videos if needed. - */ - public EarlybirdThriftDocumentBuilder withURLs(List urls) { - if (isNotEmpty(urls)) { - Set dedupedLinks = Sets.newHashSet(); - - for (ThriftExpandedUrl expandedUrl : urls) { - if (expandedUrl.isSetOriginalUrl()) { - String normalizedOriginalUrl = URLUtils.normalizePath(expandedUrl.getOriginalUrl()); - dedupedLinks.add(normalizedOriginalUrl); - } - if (expandedUrl.isSetExpandedUrl()) { - dedupedLinks.add(URLUtils.normalizePath(expandedUrl.getExpandedUrl())); - } - - if (expandedUrl.isSetCanonicalLastHopUrl()) { - String url = URLUtils.normalizePath(expandedUrl.getCanonicalLastHopUrl()); - dedupedLinks.add(url); - - String facetUrl = URLUtils.normalizeFacetURL(url); - - if (expandedUrl.isSetMediaType()) { - switch (expandedUrl.getMediaType()) { - case NEWS: - withStringField(EarlybirdFieldConstant.NEWS_LINKS_FIELD.getFieldName(), url); - addFacetSkipList(EarlybirdFieldConstant.NEWS_LINKS_FIELD.getFieldName()); - encodedTweetFeatures.setFlag(EarlybirdFieldConstant.HAS_NEWS_URL_FLAG); - break; - case VIDEO: - withStringField(EarlybirdFieldConstant.VIDEO_LINKS_FIELD.getFieldName(), facetUrl); - addFacetSkipList(EarlybirdFieldConstant.VIDEO_LINKS_FIELD.getFieldName()); - encodedTweetFeatures.setFlag(EarlybirdFieldConstant.HAS_VIDEO_URL_FLAG); - break; - case IMAGE: - withStringField(EarlybirdFieldConstant.IMAGE_LINKS_FIELD.getFieldName(), facetUrl); - addFacetSkipList(EarlybirdFieldConstant.IMAGE_LINKS_FIELD.getFieldName()); - encodedTweetFeatures.setFlag(EarlybirdFieldConstant.HAS_IMAGE_URL_FLAG); - break; - case NATIVE_IMAGE: - // Nothing done here. Native images are handled separately. - // They are in PhotoUrls instead of expandedUrls. - break; - case UNKNOWN: - break; - default: - throw new RuntimeException("Unknown Media Type: " + expandedUrl.getMediaType()); - } - } - - if (expandedUrl.isSetLinkCategory()) { - withIntField(EarlybirdFieldConstant.LINK_CATEGORY_FIELD.getFieldName(), - expandedUrl.getLinkCategory().getValue()); - } - } - } - - if (!dedupedLinks.isEmpty()) { - encodedTweetFeatures.setFlag(EarlybirdFieldConstant.HAS_LINK_FLAG); - - addFacetSkipList(EarlybirdFieldConstant.LINKS_FIELD.getFieldName()); - - for (String linkUrl : dedupedLinks) { - withStringField(EarlybirdFieldConstant.LINKS_FIELD.getFieldName(), linkUrl); - } - } - - encodedTweetFeatures.setFlagValue( - EarlybirdFieldConstant.HAS_VISIBLE_LINK_FLAG, - LinkVisibilityUtils.hasVisibleLink(urls)); - } - - return this; - } - - /** - * Add a list of places. The place are U64 encoded place IDs. - */ - public EarlybirdThriftDocumentBuilder withPlacesField(List places) { - if (isNotEmpty(places)) { - for (String place : places) { - withStringField(EarlybirdFieldConstant.PLACE_FIELD.getFieldName(), place); - } - } - return this; - } - - /** - * Add tweet text signature field. - */ - public EarlybirdThriftDocumentBuilder withTweetSignature(int signature) { - encodedTweetFeatures.setFeatureValue(EarlybirdFieldConstant.TWEET_SIGNATURE, signature); - return this; - } - - /** - * Add geo hash field and internal filter field. - */ - public EarlybirdThriftDocumentBuilder withGeoHash(double lat, double lon, int accuracy) { - if (GeoUtil.validateGeoCoordinates(lat, lon)) { - withGeoField( - EarlybirdFieldConstant.GEO_HASH_FIELD.getFieldName(), - lat, lon, accuracy); - withLatLonCSF(lat, lon); - } - return this; - } - - public EarlybirdThriftDocumentBuilder withGeoHash(double lat, double lon) { - withGeoHash(lat, lon, GeoAddressAccuracy.UNKNOWN_LOCATION.getCode()); - return this; - } - - /** - * Add geo location source to the internal field with ThriftGeoLocationSource object. - */ - public EarlybirdThriftDocumentBuilder withGeoLocationSource( - ThriftGeoLocationSource geoLocationSource) { - if (geoLocationSource != null) { - withGeoLocationSource(EarlybirdFieldConstants.formatGeoType(geoLocationSource)); - } - return this; - } - - /** - * Add geo location source to the internal field. - */ - public EarlybirdThriftDocumentBuilder withGeoLocationSource(String geoLocationSource) { - withStringField(EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), geoLocationSource); - return this; - } - - /** - * Add encoded lat and lon to LatLonCSF field. - */ - public EarlybirdThriftDocumentBuilder withLatLonCSF(double lat, double lon) { - isSetLatLonCSF = true; - long encodedLatLon = GeoUtil.encodeLatLonIntoInt64((float) lat, (float) lon); - withLongField(EarlybirdFieldConstant.LAT_LON_CSF_FIELD.getFieldName(), encodedLatLon); - return this; - } - - /** - * Add from verified account flag to internal field. - */ - public EarlybirdThriftDocumentBuilder withFromVerifiedAccountFlag() { - encodedTweetFeatures.setFlag(EarlybirdFieldConstant.FROM_VERIFIED_ACCOUNT_FLAG); - addFilterInternalFieldTerm(EarlybirdFieldConstant.VERIFIED_FILTER_TERM); - return this; - } - - /** - * Add from blue-verified account flag to internal field. - */ - public EarlybirdThriftDocumentBuilder withFromBlueVerifiedAccountFlag() { - encodedTweetFeatures.setFlag(EarlybirdFieldConstant.FROM_BLUE_VERIFIED_ACCOUNT_FLAG); - addFilterInternalFieldTerm(EarlybirdFieldConstant.BLUE_VERIFIED_FILTER_TERM); - return this; - } - - /** - * Add offensive flag to internal field. - */ - public EarlybirdThriftDocumentBuilder withOffensiveFlag() { - encodedTweetFeatures.setFlag(EarlybirdFieldConstant.IS_OFFENSIVE_FLAG); - withStringField( - EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), - EarlybirdFieldConstant.IS_OFFENSIVE); - return this; - } - - /** - * Add user reputation value to encoded feature. - */ - public EarlybirdThriftDocumentBuilder withUserReputation(byte score) { - encodedTweetFeatures.setFeatureValue(EarlybirdFieldConstant.USER_REPUTATION, score); - return this; - } - - /** - * This method creates the fields related to document language. - * For most languages, their isoLanguageCode and bcp47LanguageTag are the same. - * For some languages with variants, these two fields are different. - * E.g. for simplified Chinese, their isoLanguageCode is zh, but their bcp47LanguageTag is zh-cn. - *

- * This method adds fields for both the isoLanguageCode and bcp47LanguageTag. - */ - public EarlybirdThriftDocumentBuilder withLanguageCodes( - String isoLanguageCode, String bcp47LanguageTag) { - if (isoLanguageCode != null) { - withISOLanguage(isoLanguageCode); - } - if (bcp47LanguageTag != null && !bcp47LanguageTag.equals(isoLanguageCode)) { - BCP47_LANGUAGE_TAG_COUNTER.increment(); - withISOLanguage(bcp47LanguageTag); - } - return this; - } - - /** - * Adds a String field into the ISO_LANGUAGE_FIELD. - */ - public EarlybirdThriftDocumentBuilder withISOLanguage(String languageString) { - withStringField( - EarlybirdFieldConstant.ISO_LANGUAGE_FIELD.getFieldName(), languageString.toLowerCase()); - return this; - } - - /** - * Add from user ID fields. - */ - public EarlybirdThriftDocumentBuilder withFromUserID(long fromUserId) { - withLongField(EarlybirdFieldConstant.FROM_USER_ID_FIELD.getFieldName(), fromUserId); - withLongField(EarlybirdFieldConstant.FROM_USER_ID_CSF.getFieldName(), fromUserId); - return this; - } - - /** - * Add from user information fields. - */ - public EarlybirdThriftDocumentBuilder withFromUser( - long fromUserId, String fromUser) { - withFromUser(fromUserId, fromUser, null); - return this; - } - - /** - * Add from user information fields. - */ - public EarlybirdThriftDocumentBuilder withFromUser(String fromUser) { - withFromUser(fromUser, null); - return this; - } - - /** - * Add from user information fields. - */ - public EarlybirdThriftDocumentBuilder withFromUser( - String fromUser, String tokenizedFromUser) { - withStringField(EarlybirdFieldConstant.FROM_USER_FIELD.getFieldName(), fromUser); - withStringField(EarlybirdFieldConstant.TOKENIZED_FROM_USER_FIELD.getFieldName(), - isNotBlank(tokenizedFromUser) ? tokenizedFromUser : fromUser); - return this; - } - - /** - * Add from user information fields. - */ - public EarlybirdThriftDocumentBuilder withFromUser( - long fromUserId, String fromUser, String tokenizedFromUser) { - withFromUserID(fromUserId); - withFromUser(fromUser, tokenizedFromUser); - return this; - } - - /** - * Add to user field. - */ - public EarlybirdThriftDocumentBuilder withToUser( - String toUser) { - withStringField(EarlybirdFieldConstant.TO_USER_FIELD.getFieldName(), toUser); - return this; - } - - /** - * Add escherbird annotation fields. - */ - public EarlybirdThriftDocumentBuilder withAnnotationEntities(List entities) { - if (isNotEmpty(entities)) { - for (String entity : entities) { - withStringField(EarlybirdFieldConstant.ENTITY_ID_FIELD.getFieldName(), entity); - } - } - return this; - } - - /** - * Add replies to internal field and set is reply flag. - */ - public EarlybirdThriftDocumentBuilder withReplyFlag() { - encodedTweetFeatures.setFlag(EarlybirdFieldConstant.IS_REPLY_FLAG); - addFilterInternalFieldTerm(EarlybirdFieldConstant.REPLIES_FILTER_TERM); - return this; - } - - public EarlybirdThriftDocumentBuilder withCameraComposerSourceFlag() { - encodedTweetFeatures.setFlag(EarlybirdFieldConstant.COMPOSER_SOURCE_IS_CAMERA_FLAG); - return this; - } - - /** - * Add in reply to user id. - *

- * Notice {@link #withReplyFlag} is not automatically called since retweet a tweet that is - * a reply to some other tweet is not considered a reply. - * The caller should call {@link #withReplyFlag} separately if this tweet is really a reply tweet. - */ - public EarlybirdThriftDocumentBuilder withInReplyToUserID(long inReplyToUserID) { - withLongField(EarlybirdFieldConstant.IN_REPLY_TO_USER_ID_FIELD.getFieldName(), inReplyToUserID); - return this; - } - - /** - * Add reference tweet author id. - */ - public EarlybirdThriftDocumentBuilder withReferenceAuthorID(long referenceAuthorID) { - withLongField(EarlybirdFieldConstant.REFERENCE_AUTHOR_ID_CSF.getFieldName(), referenceAuthorID); - return this; - } - - /** - * Add all native retweet related fields/label - */ - @VisibleForTesting - public EarlybirdThriftDocumentBuilder withNativeRetweet(final long retweetUserID, - final long sharedStatusID) { - withLongField(EarlybirdFieldConstant.SHARED_STATUS_ID_CSF.getFieldName(), sharedStatusID); - - withLongField(EarlybirdFieldConstant.RETWEET_SOURCE_TWEET_ID_FIELD.getFieldName(), - sharedStatusID); - withLongField(EarlybirdFieldConstant.RETWEET_SOURCE_USER_ID_FIELD.getFieldName(), - retweetUserID); - withLongField(EarlybirdFieldConstant.REFERENCE_AUTHOR_ID_CSF.getFieldName(), retweetUserID); - - encodedTweetFeatures.setFlag(EarlybirdFieldConstant.IS_RETWEET_FLAG); - - // Add native retweet label to the internal field. - addFilterInternalFieldTerm(EarlybirdFieldConstant.NATIVE_RETWEETS_FILTER_TERM); - withStringField(EarlybirdFieldConstant.TEXT_FIELD.getFieldName(), RETWEET_TERM); - return this; - } - - /** - * Add quoted tweet id and user id. - */ - @VisibleForTesting - public EarlybirdThriftDocumentBuilder withQuote( - final long quotedStatusId, final long quotedUserId) { - withLongField(EarlybirdFieldConstant.QUOTED_TWEET_ID_FIELD.getFieldName(), quotedStatusId); - withLongField(EarlybirdFieldConstant.QUOTED_USER_ID_FIELD.getFieldName(), quotedUserId); - - withLongField(EarlybirdFieldConstant.QUOTED_TWEET_ID_CSF.getFieldName(), quotedStatusId); - withLongField(EarlybirdFieldConstant.QUOTED_USER_ID_CSF.getFieldName(), quotedUserId); - - encodedTweetFeatures.setFlag(EarlybirdFieldConstant.HAS_QUOTE_FLAG); - - // Add quote label to the internal field. - addFilterInternalFieldTerm(EarlybirdFieldConstant.QUOTE_FILTER_TERM); - return this; - } - - /** - * Add resolved links text field. - */ - public EarlybirdThriftDocumentBuilder withResolvedLinksText(String linksText) { - withStringField(EarlybirdFieldConstant.RESOLVED_LINKS_TEXT_FIELD.getFieldName(), linksText); - return this; - } - - /** - * Add source field. - */ - public EarlybirdThriftDocumentBuilder withSource(String source) { - withStringField(EarlybirdFieldConstant.SOURCE_FIELD.getFieldName(), source); - return this; - } - - /** - * Add normalized source field. - */ - public EarlybirdThriftDocumentBuilder withNormalizedSource(String normalizedSource) { - withStringField( - EarlybirdFieldConstant.NORMALIZED_SOURCE_FIELD.getFieldName(), normalizedSource); - return this; - } - - /** - * Add positive smiley to internal field. - */ - public EarlybirdThriftDocumentBuilder withPositiveSmiley() { - withStringField( - EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), - EarlybirdFieldConstant.HAS_POSITIVE_SMILEY); - return this; - } - - /** - * Add negative smiley to internal field. - */ - public EarlybirdThriftDocumentBuilder withNegativeSmiley() { - withStringField( - EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), - EarlybirdFieldConstant.HAS_NEGATIVE_SMILEY); - return this; - } - - /** - * Add question mark label to a text field. - */ - public EarlybirdThriftDocumentBuilder withQuestionMark() { - withStringField(EarlybirdFieldConstant.TEXT_FIELD.getFieldName(), QUESTION_MARK); - return this; - } - - /** - * Add card related fields. - */ - public EarlybirdThriftDocumentBuilder withSearchCard( - String name, - String domain, - String title, byte[] serializedTitleStream, - String description, byte[] serializedDescriptionStream, - String lang) { - if (isNotBlank(title)) { - withTokenStreamField( - EarlybirdFieldConstants.EarlybirdFieldConstant.CARD_TITLE_FIELD.getFieldName(), - title, serializedTitleStream); - } - - if (isNotBlank(description)) { - withTokenStreamField( - EarlybirdFieldConstants.EarlybirdFieldConstant.CARD_DESCRIPTION_FIELD.getFieldName(), - description, serializedDescriptionStream); - } - - if (isNotBlank(lang)) { - withStringField(EarlybirdFieldConstant.CARD_LANG.getFieldName(), lang); - } - - if (isNotBlank(domain)) { - withStringField( - EarlybirdFieldConstants.EarlybirdFieldConstant.CARD_DOMAIN_FIELD.getFieldName(), domain); - } - - if (isNotBlank(name)) { - withStringField( - EarlybirdFieldConstants.EarlybirdFieldConstant.CARD_NAME_FIELD.getFieldName(), name); - withIntField( - EarlybirdFieldConstants.EarlybirdFieldConstant.CARD_TYPE_CSF_FIELD.getFieldName(), - SearchCardType.cardTypeFromStringName(name).getByteValue()); - } - - if (AMPLIFY_CARD_NAME.equalsIgnoreCase(name) - || PLAYER_CARD_NAME.equalsIgnoreCase(name)) { - // Add into "internal" field so that this tweet is returned by filter:videos. - addFacetSkipList( - EarlybirdFieldConstants.EarlybirdFieldConstant.VIDEO_LINKS_FIELD.getFieldName()); - } - - return this; - } - - public EarlybirdThriftDocumentBuilder withNormalizedMinEngagementField( - String fieldName, int normalizedNumEngagements) throws IOException { - EarlybirdThriftDocumentUtil.addNormalizedMinEngagementField(doc, fieldName, - normalizedNumEngagements); - return this; - } - - /** - * Add named entity with given canonical name and type to document. - */ - public EarlybirdThriftDocumentBuilder withNamedEntity(NamedEntity namedEntity) { - if (namedEntity.getContexts() == null) { - // In this unlikely case, we don't have any context for named entity type or source, - // so we can't properly index it in any of our fields. We'll just skip it in this case. - return this; - } - - // Keep track of the fields we've applied in the builder already, to ensure we only index - // each term (field/value pair) once - Set> fieldsApplied = new HashSet<>(); - for (NamedEntityContext context : namedEntity.getContexts()) { - if (context.isSetInput_source() - && NAMED_ENTITY_URL_SOURCE_TYPES.contains(context.getInput_source().getSource_type())) { - // If the source is one of the URL* types, add the named entity to the "from_url" fields, - // ensuring we add it only once - addNamedEntityFields( - fieldsApplied, - EarlybirdFieldConstant.NAMED_ENTITY_FROM_URL_FIELD, - EarlybirdFieldConstant.NAMED_ENTITY_WITH_TYPE_FROM_URL_FIELD, - namedEntity.getCanonical_name(), - context); - } else { - addNamedEntityFields( - fieldsApplied, - EarlybirdFieldConstant.NAMED_ENTITY_FROM_TEXT_FIELD, - EarlybirdFieldConstant.NAMED_ENTITY_WITH_TYPE_FROM_TEXT_FIELD, - namedEntity.getCanonical_name(), - context); - } - } - - return this; - } - - /** - * Add space id fields. - */ - public EarlybirdThriftDocumentBuilder withSpaceIdFields(Set spaceIds) { - if (!spaceIds.isEmpty()) { - addFacetSkipList(EarlybirdFieldConstant.SPACE_ID_FIELD.getFieldName()); - for (String spaceId : spaceIds) { - withStringField(EarlybirdFieldConstant.SPACE_ID_FIELD.getFieldName(), spaceId); - } - } - return this; - } - - /** - * Add directed at user. - */ - @VisibleForTesting - public EarlybirdThriftDocumentBuilder withDirectedAtUser(final long directedAtUserId) { - withLongField(EarlybirdFieldConstant.DIRECTED_AT_USER_ID_FIELD.getFieldName(), - directedAtUserId); - - withLongField(EarlybirdFieldConstant.DIRECTED_AT_USER_ID_CSF.getFieldName(), directedAtUserId); - - return this; - } - - /** - * Add a white space tokenized screen name field. - * - * Example: - * screenName - "super_hero" - * tokenized version - "super hero" - */ - public EarlybirdThriftDocumentBuilder withWhiteSpaceTokenizedScreenNameField( - String fieldName, - String normalizedScreenName) { - String whiteSpaceTokenizableScreenName = StringUtils.join( - normalizedScreenName.split(Regex.HASHTAG_USERNAME_PUNCTUATION_REGEX), " "); - withStringField(fieldName, whiteSpaceTokenizableScreenName); - return this; - } - - /** - * Add a camel case tokenized screen name field. - */ - public EarlybirdThriftDocumentBuilder withCamelCaseTokenizedScreenNameField( - String fieldName, - String screenName, - String normalizedScreenName, - TokenStream screenNameTokenStream) { - - // this normalized text is consistent to how the tokenized stream is created from - // TokenizerHelper.getNormalizedCamelcaseTokenStream - ie. just lowercasing. - String camelCaseTokenizedScreenNameText = - TokenizerHelper.getNormalizedCamelcaseTokenStreamText(screenName); - try { - // Reset the token stream in case it has been read before. - screenNameTokenStream.reset(); - byte[] camelCaseTokenizedScreenName = - TweetTokenStreamSerializer.getTweetTokenStreamSerializer() - .serialize(screenNameTokenStream); - - withTokenStreamField( - fieldName, - camelCaseTokenizedScreenNameText.isEmpty() - ? normalizedScreenName : camelCaseTokenizedScreenNameText, - camelCaseTokenizedScreenName); - } catch (IOException e) { - LOG.error("TwitterTokenStream serialization error! Could not serialize: " + screenName); - SERIALIZE_FAILURE_COUNT_NONPENGUIN_DEPENDENT.increment(); - } - return this; - } - - private void addNamedEntityFields( - Set> fieldsApplied, - EarlybirdFieldConstant nameOnlyField, - EarlybirdFieldConstant nameWithTypeField, - String name, - NamedEntityContext context) { - withOneTimeStringField(fieldsApplied, nameOnlyField, name, false); - if (context.isSetEntity_type()) { - withOneTimeStringField(fieldsApplied, nameWithTypeField, - formatNamedEntityString(name, context.getEntity_type()), true); - } - } - - private void withOneTimeStringField( - Set> fieldsApplied, EarlybirdFieldConstant field, - String value, boolean addToFacets) { - Pair fieldValuePair = new Pair<>(field, value); - if (!fieldsApplied.contains(fieldValuePair)) { - if (addToFacets) { - addFacetSkipList(field.getFieldName()); - } - withStringField(field.getFieldName(), value); - fieldsApplied.add(fieldValuePair); - } - } - - private String formatNamedEntityString(String name, WholeEntityType type) { - return String.format("%s:%s", name, type).toLowerCase(); - } - - /** - * Set whether set LAT_LON_CSF_FIELD or not before build - * if LAT_LON_CSF_FIELD is not set deliberately. - * - * @see #prepareToBuild() - */ - public EarlybirdThriftDocumentBuilder setAddLatLonCSF(boolean isSet) { - addLatLonCSF = isSet; - return this; - } - - /** - * Set if add encoded tweet feature field in the end. - * - * @see #prepareToBuild() - */ - public EarlybirdThriftDocumentBuilder setAddEncodedTweetFeatures(boolean isSet) { - addEncodedTweetFeatures = isSet; - return this; - } - - @Override - protected void prepareToBuild() { - if (!isSetLatLonCSF && addLatLonCSF) { - // In lucene archives, this CSF is needed regardless of whether geoLocation is set. - withLatLonCSF(GeoUtil.ILLEGAL_LATLON, GeoUtil.ILLEGAL_LATLON); - } - - if (addEncodedTweetFeatures) { - // Add encoded_tweet_features before building the document. - withBytesField( - EarlybirdFieldConstant.ENCODED_TWEET_FEATURES_FIELD.getFieldName(), - EarlybirdEncodedFeaturesUtil.toBytesForThriftDocument(encodedTweetFeatures)); - } - - if (extendedEncodedTweetFeatures != null) { - // Add extended_encoded_tweet_features before building the document. - withBytesField( - EarlybirdFieldConstant.EXTENDED_ENCODED_TWEET_FEATURES_FIELD.getFieldName(), - EarlybirdEncodedFeaturesUtil.toBytesForThriftDocument(extendedEncodedTweetFeatures)); - } - } - - private static boolean isNotBlank(String value) { - return value != null && !value.isEmpty(); - } - - private static boolean isNotEmpty(List value) { - return value != null && !value.isEmpty(); - } -} diff --git a/src/java/com/twitter/search/common/schema/earlybird/EarlybirdThriftDocumentUtil.java b/src/java/com/twitter/search/common/schema/earlybird/EarlybirdThriftDocumentUtil.java deleted file mode 100644 index b8a13722b..000000000 --- a/src/java/com/twitter/search/common/schema/earlybird/EarlybirdThriftDocumentUtil.java +++ /dev/null @@ -1,377 +0,0 @@ -package com.twitter.search.common.schema.earlybird; - -import java.io.IOException; -import java.util.Iterator; -import java.util.List; - -import com.google.common.collect.ImmutableList; - -import com.twitter.common.text.util.TokenStreamSerializer; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.base.ThriftDocumentUtil; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.schema.thriftjava.ThriftDocument; -import com.twitter.search.common.schema.thriftjava.ThriftField; -import com.twitter.search.common.schema.thriftjava.ThriftFieldData; -import com.twitter.search.common.util.analysis.IntTermAttributeSerializer; -import com.twitter.search.common.util.analysis.TwitterNormalizedMinEngagementTokenStream; - -/** - * Utility APIs for ThriftDocument used in Earlybird. - */ -public final class EarlybirdThriftDocumentUtil { - private static final EarlybirdFieldConstants ID_MAPPING = new EarlybirdFieldConstants(); - - private static final String FILTER_FORMAT_STRING = "__filter_%s"; - - /** - * Used to check whether a thrift document has filter nullcast internal field set. - * @see #isNullcastFilterSet(ThriftDocument) - */ - private static final String NULLCAST_FILTER_TERM = - formatFilter(EarlybirdFieldConstant.NULLCAST_FILTER_TERM); - - private static final String SELF_THREAD_FILTER_TERM = - formatFilter(EarlybirdFieldConstant.SELF_THREAD_FILTER_TERM); - - private static final String DIRECTED_AT_FILTER_TERM = - formatFilter(EarlybirdFieldConstant.DIRECTED_AT_FILTER_TERM); - - private EarlybirdThriftDocumentUtil() { - // Cannot instantiate. - } - - /** - * Formats a regular, simple filter term. The 'filter' argument should correspond to a constant - * from the Operator class, matching the operand (filter:links -> "links"). - */ - public static final String formatFilter(String filter) { - return String.format(FILTER_FORMAT_STRING, filter); - } - - /** - * Get status id. - */ - public static long getID(ThriftDocument document) { - return ThriftDocumentUtil.getLongValue( - document, EarlybirdFieldConstant.ID_FIELD.getFieldName(), ID_MAPPING); - } - - /** - * Get Card name. - */ - public static String getCardName(ThriftDocument document) { - return ThriftDocumentUtil.getStringValue( - document, EarlybirdFieldConstant.CARD_NAME_FIELD.getFieldName(), ID_MAPPING); - } - - /** - * Get Card language. - */ - public static String getCardLang(ThriftDocument document) { - return ThriftDocumentUtil.getStringValue( - document, EarlybirdFieldConstant.CARD_LANG.getFieldName(), ID_MAPPING); - } - - /** - * Get Card language CSF. - * - * card language CSF is represented internally as an integer ID for a ThriftLanguage. - */ - public static int getCardLangCSF(ThriftDocument document) { - return ThriftDocumentUtil.getIntValue( - document, EarlybirdFieldConstant.CARD_LANG_CSF.getFieldName(), ID_MAPPING); - } - - /** - * Get quoted tweet id. - */ - public static long getQuotedTweetID(ThriftDocument document) { - return ThriftDocumentUtil.getLongValue( - document, EarlybirdFieldConstant.QUOTED_TWEET_ID_FIELD.getFieldName(), ID_MAPPING); - } - - /** - * Get quoted tweet user id. - */ - public static long getQuotedUserID(ThriftDocument document) { - return ThriftDocumentUtil.getLongValue( - document, EarlybirdFieldConstant.QUOTED_USER_ID_FIELD.getFieldName(), ID_MAPPING); - } - - /** - * Get directed at user id. - */ - public static long getDirectedAtUserId(ThriftDocument document) { - return ThriftDocumentUtil.getLongValue( - document, EarlybirdFieldConstant.DIRECTED_AT_USER_ID_FIELD.getFieldName(), ID_MAPPING); - } - - /** - * Get directed at user id CSF. - */ - public static long getDirectedAtUserIdCSF(ThriftDocument document) { - return ThriftDocumentUtil.getLongValue( - document, EarlybirdFieldConstant.DIRECTED_AT_USER_ID_CSF.getFieldName(), ID_MAPPING); - } - - /** - * Get reference author id CSF. - */ - public static long getReferenceAuthorIdCSF(ThriftDocument document) { - return ThriftDocumentUtil.getLongValue( - document, EarlybirdFieldConstant.REFERENCE_AUTHOR_ID_CSF.getFieldName(), ID_MAPPING); - } - - /** - * Get links. - */ - public static List getLinks(ThriftDocument document) { - return getStringValues(document, EarlybirdFieldConstant.LINKS_FIELD); - } - - /** - * Get created at timestamp in sec. - */ - public static int getCreatedAtSec(ThriftDocument document) { - return ThriftDocumentUtil.getIntValue( - document, EarlybirdFieldConstant.CREATED_AT_FIELD.getFieldName(), ID_MAPPING); - } - - /** - * Get created at timestamp in ms. - */ - public static long getCreatedAtMs(ThriftDocument document) { - long createdAtSec = (long) getCreatedAtSec(document); - return createdAtSec * 1000L; - } - - /** - * Get from user id. - */ - public static long getFromUserID(ThriftDocument document) { - return ThriftDocumentUtil.getLongValue( - document, EarlybirdFieldConstant.FROM_USER_ID_FIELD.getFieldName(), ID_MAPPING); - } - - /** - * Get from user. - */ - public static String getFromUser(ThriftDocument document) { - return ThriftDocumentUtil.getStringValue( - document, EarlybirdFieldConstant.FROM_USER_FIELD.getFieldName(), ID_MAPPING); - } - - /** - * Get tokenized from user display name. - */ - public static String getFromUserDisplayName(ThriftDocument document) { - return ThriftDocumentUtil.getStringValue( - document, EarlybirdFieldConstant.TOKENIZED_USER_NAME_FIELD.getFieldName(), ID_MAPPING); - } - - /** - * Get tokenized from user. - */ - public static String getTokenizedFromUser(ThriftDocument document) { - return ThriftDocumentUtil.getStringValue( - document, EarlybirdFieldConstant.TOKENIZED_FROM_USER_FIELD.getFieldName(), ID_MAPPING); - } - - /** - * Get resolved links text. - */ - public static String getResolvedLinksText(ThriftDocument document) { - return ThriftDocumentUtil.getStringValue( - document, EarlybirdFieldConstant.RESOLVED_LINKS_TEXT_FIELD.getFieldName(), ID_MAPPING); - } - - /** - * Get iso language code. - */ - public static List getISOLanguage(ThriftDocument document) { - return ThriftDocumentUtil.getStringValues( - document, EarlybirdFieldConstant.ISO_LANGUAGE_FIELD.getFieldName(), ID_MAPPING); - } - - /** - * First remove the old timestamp if they exist. - * Then add the created at and created at csf fields to the given thrift document. - */ - public static void replaceCreatedAtAndCreatedAtCSF(ThriftDocument document, int value) { - removeField(document, EarlybirdFieldConstant.CREATED_AT_FIELD); - removeField(document, EarlybirdFieldConstant.CREATED_AT_CSF_FIELD); - - addIntField(document, EarlybirdFieldConstant.CREATED_AT_FIELD, value); - addIntField(document, EarlybirdFieldConstant.CREATED_AT_CSF_FIELD, value); - } - - /** - * Add the given int value as the given field into the given document. - */ - public static ThriftDocument addIntField( - ThriftDocument document, EarlybirdFieldConstant fieldConstant, int value) { - ThriftFieldData fieldData = new ThriftFieldData().setIntValue(value); - ThriftField field = - new ThriftField().setFieldConfigId(fieldConstant.getFieldId()).setFieldData(fieldData); - document.addToFields(field); - return document; - } - - private static EarlybirdFieldConstant getFeatureField(EarlybirdFieldConstant field) { - if (field.getFieldName().startsWith( - EarlybirdFieldConstant.ENCODED_TWEET_FEATURES_FIELD.getFieldName())) { - return EarlybirdFieldConstant.ENCODED_TWEET_FEATURES_FIELD; - } else if (field.getFieldName().startsWith( - EarlybirdFieldConstant.EXTENDED_ENCODED_TWEET_FEATURES_FIELD.getFieldName())) { - return EarlybirdFieldConstant.EXTENDED_ENCODED_TWEET_FEATURES_FIELD; - } else { - throw new IllegalArgumentException("Not a feature field: " + field); - } - } - - /** - * Get the feature value of a field. - */ - public static int getFeatureValue( - ImmutableSchemaInterface schema, - ThriftDocument document, - EarlybirdFieldConstant field) { - - EarlybirdFieldConstant featureField = getFeatureField(field); - - byte[] encodedFeaturesBytes = - ThriftDocumentUtil.getBytesValue(document, featureField.getFieldName(), ID_MAPPING); - - if (encodedFeaturesBytes == null) { - // Treat the feature value as 0 if there is no encoded feature field. - return 0; - } else { - EarlybirdEncodedFeatures encodedFeatures = EarlybirdEncodedFeaturesUtil.fromBytes( - schema, featureField, encodedFeaturesBytes, 0); - return encodedFeatures.getFeatureValue(field); - } - } - - /** - * Check whether the feature flag is set. - */ - public static boolean isFeatureBitSet( - ImmutableSchemaInterface schema, - ThriftDocument document, - EarlybirdFieldConstant field) { - - EarlybirdFieldConstant featureField = getFeatureField(field); - - byte[] encodedFeaturesBytes = - ThriftDocumentUtil.getBytesValue(document, featureField.getFieldName(), ID_MAPPING); - - if (encodedFeaturesBytes == null) { - // Treat the bit as not set if there is no encoded feature field. - return false; - } else { - EarlybirdEncodedFeatures encodedFeatures = EarlybirdEncodedFeaturesUtil.fromBytes( - schema, featureField, encodedFeaturesBytes, 0); - return encodedFeatures.isFlagSet(field); - } - } - - /** - * Check whether nullcast flag is set in the encoded features field. - */ - public static boolean isNullcastBitSet(ImmutableSchemaInterface schema, ThriftDocument document) { - return isFeatureBitSet(schema, document, EarlybirdFieldConstant.IS_NULLCAST_FLAG); - } - - /** - * Remove all fields with the given field constant in a document. - */ - public static void removeField(ThriftDocument document, EarlybirdFieldConstant fieldConstant) { - List fields = document.getFields(); - if (fields != null) { - Iterator fieldsIterator = fields.iterator(); - while (fieldsIterator.hasNext()) { - if (fieldsIterator.next().getFieldConfigId() == fieldConstant.getFieldId()) { - fieldsIterator.remove(); - } - } - } - } - - /** - * Remove a string field with given fieldConstant and value. - */ - public static void removeStringField( - ThriftDocument document, EarlybirdFieldConstant fieldConstant, String value) { - List fields = document.getFields(); - if (fields != null) { - for (ThriftField field : fields) { - if (field.getFieldConfigId() == fieldConstant.getFieldId() - && field.getFieldData().getStringValue().equals(value)) { - fields.remove(field); - return; - } - } - } - } - - /** - * Adds a new TokenStream field for each engagement counter if normalizedNumEngagements >= 1. - */ - public static void addNormalizedMinEngagementField( - ThriftDocument doc, - String fieldName, - int normalizedNumEngagements) throws IOException { - if (normalizedNumEngagements < 1) { - return; - } - TokenStreamSerializer serializer = - new TokenStreamSerializer(ImmutableList.of(new IntTermAttributeSerializer())); - TwitterNormalizedMinEngagementTokenStream stream = new - TwitterNormalizedMinEngagementTokenStream(normalizedNumEngagements); - byte[] serializedStream = serializer.serialize(stream); - ThriftFieldData fieldData = new ThriftFieldData().setTokenStreamValue(serializedStream); - ThriftField field = new ThriftField().setFieldConfigId(ID_MAPPING.getFieldID(fieldName)) - .setFieldData(fieldData); - doc.addToFields(field); - } - - public static List getStringValues( - ThriftDocument document, EarlybirdFieldConstant field) { - return ThriftDocumentUtil.getStringValues(document, field.getFieldName(), ID_MAPPING); - } - - public static boolean isNullcastFilterSet(ThriftDocument document) { - return isFilterSet(document, NULLCAST_FILTER_TERM); - } - - public static boolean isSelfThreadFilterSet(ThriftDocument document) { - return isFilterSet(document, SELF_THREAD_FILTER_TERM); - } - - public static String getSelfThreadFilterTerm() { - return SELF_THREAD_FILTER_TERM; - } - - public static String getDirectedAtFilterTerm() { - return DIRECTED_AT_FILTER_TERM; - } - - public static boolean isDirectedAtFilterSet(ThriftDocument document) { - return isFilterSet(document, DIRECTED_AT_FILTER_TERM); - } - - /** - * Check whether given filter is set in the internal field. - */ - private static boolean isFilterSet(ThriftDocument document, String filter) { - List terms = ThriftDocumentUtil.getStringValues( - document, EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), ID_MAPPING); - for (String term : terms) { - if (filter.equals(term)) { - return true; - } - } - return false; - } -} diff --git a/src/java/com/twitter/search/common/schema/earlybird/FlushVersion.java b/src/java/com/twitter/search/common/schema/earlybird/FlushVersion.java deleted file mode 100644 index dd935c90e..000000000 --- a/src/java/com/twitter/search/common/schema/earlybird/FlushVersion.java +++ /dev/null @@ -1,336 +0,0 @@ -package com.twitter.search.common.schema.earlybird; - -import javax.annotation.Nullable; - -import com.twitter.search.common.config.Config; - -public enum FlushVersion { - /* ======================================================= - * Versions - * ======================================================= */ - VERSION_0("Initial version of partition flushing."), - VERSION_1("Added timestamps and corresponding mapper to SegmentData."), - VERSION_2("Add column stride fields."), - VERSION_3("Change facet field configuration."), - VERSION_4("Add per term offensive counters to parallel posting arrays."), - VERSION_5("Add native photo facet."), - VERSION_6("Add UserFeature column stride field"), - VERSION_7("Index segment optimizations; new facet data structures."), - VERSION_8("Store statuses in memory in Earlybird."), - VERSION_9("Index from_user_ids into a searchable field."), - VERSION_10("Change from_user_id dictionary from fst to mphf"), - VERSION_11("Write image and video facet in separate lucene field."), - VERSION_12("Add retweeted status ID to the sparse CSF."), - VERSION_13("Add isOffensive field for profanity filter."), - VERSION_14("Fix features column stride field corruption."), - VERSION_15("Upgrade Lucene version, which has a different FST serialization format."), - VERSION_16("Remove maxDoc in favor of lastDocID"), - VERSION_17("Added partition and timeslice identifiers to SegmentData."), - VERSION_18("Per-term payloads"), - VERSION_19("Multiple per-doc payload fields"), - VERSION_20("Unify and fix hash codes"), - VERSION_21("Super awesome new flexible realtime posting list format."), - VERSION_22("Added new geo implementation."), - VERSION_23("Upgrade to Lucene 4.0.0 Final"), - VERSION_24("Added tweet topic ids."), - VERSION_25("Turn on skip list for mention facet."), - VERSION_26("Added new EncodedTweetFeaturesColumnStrideField."), - VERSION_27("Topic ids facet field."), - VERSION_28("From-user discover stories skiplist field."), - VERSION_29("Move tokenized screen name to the new username field"), - VERSION_30("Enable HF term pairs index."), - VERSION_31("Remove reverse doc ids."), - VERSION_32("Switch shared status id CSF to non-sparse long CSF index."), - VERSION_33("New skip lists for optimized high df posting lists."), - VERSION_34("Store tweet signature in EarlybirdEncodedFeatures."), - VERSION_35("Don't store shared status id csf in archive indexes."), - VERSION_36("Don't store norms."), - VERSION_37("64 bit user ids."), - VERSION_38("Index links in archive."), - VERSION_39("Fix pic.twitter.com image link handling not setting the internal field correctly."), - VERSION_40("Fix all archive tweets being marked as replies."), - VERSION_41("Avoid flushing event_ids field; event clusters are applied as updates."), - VERSION_42("No position fields refactoring; made a few fields to not use position."), - VERSION_43("Index private geo coordinates"), - VERSION_44("Materialize last doc id in HighDFCompressedPostinglists", true), - VERSION_45("Removing from_user_id facets support", true), - VERSION_46("Guard against badly out of order tweets in the search archive.", true), - VERSION_47("Added card title and description fields.", true), - VERSION_48("Added card type CSF.", true), - VERSION_49("Lucene 4.4 upgrade", true), - VERSION_50("Put mem-archive back on non-lucene optimized indexes", true), - VERSION_51("Force index rebuild to fix blank text field. See SEARCH-2505.", true), - VERSION_52("Refactoring of docValues/CSF.", true), - VERSION_53("Remove SegmentData.Configuration", true), - VERSION_54("Fix bad indices caused by SEARCH-2723.", true), - VERSION_55("Fixed non-deterministic facetIds across restarts. SEARCH-2815.", true), - VERSION_56("Flush FacetIDMap.", true), - VERSION_57("Remove LatLonMapper and use standard DocValues instead.", true), - VERSION_58("Longterm Attribute Optimization.", true), - VERSION_59("Renamed archive segment names. Current segment is no longer mutable.", true), - // Flush version 60 and 59 have the same format. - // Flush version is increased to trigger a rebuild, because we noticed incomplete segments. - // More details can be found on SEARCH-3664 - VERSION_60("Flush version change to trigger segment rebuild.", true), - VERSION_61("Adding back from_user_id", true), - VERSION_62("Add retweet facet.", true), - VERSION_63("Switch to new index API in com.twitter.search.core.earlybird.", true), - VERSION_64("Sort merge archive day and part-* data. SEARCH-4692.", true), - VERSION_65("Fix ID_FIELD and CREATED_AT_FIELD sort order. SEARCH-4004 SEARCH-912 ", true), - VERSION_66("Rebuild data for 1/5/2015. Data on HDFS fixed as part of SEARCH-5347.", true), - VERSION_67("Upgrade to Lucene 4.10.3.", true), - VERSION_68("Switching to Penguin v4", true), - VERSION_69("Fix 16% archive segments: SEARCH-6073", true), - VERSION_70("Switching to Penguin v4 for full archive cluster. SEARCH-5302", true), - VERSION_71("Switching to Penguin v4 for ssd archive cluster.", true), - VERSION_72("Added Escherbird annotations for full archive.", true), - VERSION_73("Lucene 5.2.1 upgrade.", true, 0), - VERSION_74("Hanndle geo scurbbed data and archive geo index accuracy", true, 0), - VERSION_75("Delete from_user_id_stories from indices", true, 0), - VERSION_76("Allow multiple index extensions.", true, 0), - VERSION_77("Removed EarlybirdCodec", true, 0), - // minor version 2: added embedded tweet features - // minor version 3: change embedded tweet features to INC_ONLY - VERSION_78("Added 80 bytes of extended features", true, 3), - // minor version 1: SEARCH-8564 - Reference Tweet Author ID, using - // EXTENDED_TEST_FEATURE_UNUSED_BITS_2 and EXTENDED_TEST_FEATURE_UNUSED_BITS_3 - VERSION_79("Renamed UNUSED_BIT to HAS_VISIBLE_LINK", true, 1), - // minor version 2: SEARCH-8564 / http://go/rb/770373 - // Made REFERENCE_AUTHOR_ID_LEAST_SIGNIFICANT_INT and - // REFERENCE_AUTHOR_ID_MOST_SIGNIFICANT_INT immutable field - VERSION_80("Facet for links: SEARCH-8331", true, 2), - // minor version 1: added video view count - VERSION_81("Adding LowDF posting list with packed ints", true, 1), - VERSION_82("Enabling HighDF posting list with packed ints", true, 0), - // minor version 1: SEARCH-9379 - Added bitset for nullcast tweets - // minor version 2: SEARCH-8765 - Added visible token ratio - VERSION_83("Add bits in encoded features for media type flags. SEARCH-9131", true, 2), - VERSION_84("Enable archive rebuild for __has_links field. SEARCH-9635", true, 0), - // minor version 1: SEARCHQUAL-8130, add engagement v2 - VERSION_85("New archive build gen for missing geo data. SEARCH-9894", true, 1), - VERSION_86("Added new fields to the index", true, 0), - // During this rebuild both the statuses and the engagement counts were regenerated. - // minor version 1: added quote_count - VERSION_87("Periodic archive full rebuild. SEARCH-9423", true, 1), - // minor version 1: make new tokenized user name/handle fields textSearchable - // (see go/rb/847134/) - // minor version 2: added has_quote - VERSION_88("Fixing missing day in the full archive index. SEARCH-11233", true, 2), - VERSION_89("Index and store conversation ids.", true, 0), - VERSION_90("Fixing inconsistent days in the full archive index. SEARCH-11744", true, 0), - VERSION_91("Making in_reply_to_user_id field use MPH. SEARCH-10836", true, 0), - VERSION_92("Allow searches by any field. SEARCH-11251", true, 0), - // During this rebuild we regenerated engagement counts and merged the annotations in the - // aggregate job. - VERSION_93("Periodic archive full rebuild. SEARCH-11076", true, 0), - // minor version 1: add ThriftCSFViewSettings.outputCSFType - VERSION_94("Indexing a bunch of geo fields. SEARCH-10283", true, 1), - VERSION_95("Removing topic ID fields. SEARCH-8616", true, 0), - // minor version 1: add ThriftCSFViewSettings.normalizationType - VERSION_96("Enabling conversation ID for all clusters. SEARCH-11989", true, 1), - // minor version 1: set several feature configuration to be correct double type - // minor version 2: set some more feature configuration to be correct double type - // minor version 3: add safety labels SEARCHQUAL-9561 - // minor version 4: add weighted engagement counts SEARCHQUAL-9574 - // minor version 5: add Dopamine non personalized score SEARCHQUAL-9743 - VERSION_97("Changing CSF type to BOOLEAN for some has_* flags.", true, 5), - VERSION_98("Periodic archive full rebuild. PCM-56871.", true, 1), - VERSION_99("Removing named_entities field. SEARCH-13708", true, 0), - // minor version 1: add periscope features (SEARCHQUAL-10008) - // minor version 2: add raw_earlybird_score to TweetExternalFeatures (SEARCHQUAL-10347) - VERSION_100("Upgrade Penguin Version from V4 to V6. SEARCH-12991", true, 2), - // minor version 1: adjust for normalizer type for some engagement counters (SEARCHQUAL-9537) - // minor version 2: add decaying engagement counts and last engaged timestamps (SEARCHQUAL-10532) - VERSION_101("Add emoji to the index. SEARCH-12991", true, 2), - VERSION_102("Periodic full archive rebuild. PCM-67851", true, 0), - VERSION_103("Add liked_by_user_id field. SEARCH-15341", true, 0), - // minor version 1: remove last engaged timestamp with 3-hour increment (SEARCHQUAL-10903) - // minor version 2: add fake engagement counts (SEARCHQUAL-10795) - // minor version 3: add last engaged timestamp with 1-hour increment (SEARCHQUAL-10942) - VERSION_104("Reverting to the 20170109_pc100_par30 build gen. SEARCH-15731", true, 3), - VERSION_105("Add 3 new fields to archive index for engagement features. SEARCH-16102", true, 0), - // This is the last rebuild based on /tables/statuses. Starting 9/14 this build-gen is powered - // by TweetSource. During this rebuild both statuses and engagement counts were rebuilt. - VERSION_106("Periodic archive full rebuild. PCM-74652", true, 0), - VERSION_107("Removing card fields from full archive index.", true, 0), - VERSION_108("Removing the tms_id field from all schemas.", true, 0), - VERSION_109("Removing LAT_LON_FIELD from all schemas.", true, 0), - VERSION_110("Adding the card fields back to the full archive index.", true, 1), - // minor version 1: Add composer source csf field (SEARCH-22494) - VERSION_111("Adding composer_source to index. SEARCH-20377.", true, 1), - VERSION_112("Partial rebuild to fix SEARCH-22529.", true, 0), - VERSION_113("Full archive build gen 20180312_pc100_par30.", true, 0), - VERSION_114("Fix for SEARCH-23761.", true, 0), - VERSION_115("Add fields for quoted tweets. SEARCH-23919", true, 0), - // minor version 1: Add 4 bit hashtag count, mention count and stock count (SEARCH-24336) - VERSION_116("Bump flush version for scrubbing pipeline. SEARCH-24225", true, 1), - VERSION_117("Add retweeted_by_user_id and replied_to_by_user_id fields. SEARCH-24463", true, 0), - // minor version 1: Removed dopamine_non_personalized_score (SEARCHQUAL-10321) - VERSION_118("Adding the reply and retweet source tweet IDs: SEARCH-23702, SEARCH-24502", true, 1), - // minor version 1: add blink engagement counts (SEARCHQUAL-15176) - VERSION_119("Remove public inferred location: SEARCH-24235", true, 1), - VERSION_120("Flush extensions before fields when flushing segments.", true, 0), - VERSION_121("Flush the startingDocIdForSearch field. SEARCH-25464.", true, 0), - VERSION_122("Do not flush the startingDocIdForSearch field.", true, 0), - VERSION_123("Renaming the largestDocID flushed property to firstAddedDocID.", true, 0), - VERSION_124("Use the skip list posting list for all fields.", true, 0), - VERSION_125("Use hashmap for tweet ID lookup.", true, 0), - VERSION_126("Use the skip list posting list for all fields.", true, 0), - VERSION_127("Flushing the min and max doc IDs in each segment.", true, 0), - VERSION_128("Add card_lang to index. SEARCH-26539", true, 0), - VERSION_129("Move the tweet ID mapper to the segment data.", true, 0), - VERSION_130("Move the time mapper to the segment data.", true, 0), - VERSION_131("Change the facets classes to work with any doc IDs.", true, 0), - VERSION_132("Make the CSF classes work with any doc IDs.", true, 0), - VERSION_133("Removing smallestDocID property.", true, 0), - VERSION_134("Optimize DeletedDocs before flushing.", true, 0), - VERSION_135("Add payloads to skiplists.", true, 0), - VERSION_136("Add name to int pools.", true, 0), - VERSION_137("Add unsorted stream offset.", true, 0), - VERSION_138("Switch to the OutOfOrderRealtimeTweetIDMapper.", true, 0), - VERSION_139("Remove realtime posting lists.", true, 0), - VERSION_140("Add named_entity field. SEARCH-27547", true, 0), - VERSION_141("Flush the out of order updates count.", true, 0), - VERSION_142("Add named_entity facet support. SEARCH-28054", true, 0), - VERSION_143("Index updates before optimizing segment.", true, 0), - VERSION_144("Refactor TermsArray.", true, 0), - VERSION_145("Remove SmallestDocID.", true, 0), - VERSION_146("Add entity_id facet support. SEARCH-28071", true, 0), - VERSION_147("Enable updating facets", true, 0), - VERSION_148("Rename the counter for feature updates to partial updates", true, 0), - VERSION_149("Stop flushing offsets for sorted updates DL streams.", true, 0), - VERSION_150("Update the name of the property for the updates DL stream offset.", true, 0), - VERSION_151("Upgrade Lucene version to 5.5.5.", true, 0), - VERSION_152("Upgrade Lucene version to 6.0.0.", true, 0), - VERSION_153("Upgrade Lucene version to 6.6.6.", true, 0), - VERSION_154("Store the timeslice ID on EarlybirdIndexSegmentData.", true, 0), - VERSION_155("Do not flush index extensions.", true, 0), - VERSION_156("Deprecate ThriftIndexedFieldSettings.defaultFieldBoost.", true, 0), - VERSION_157("Load CREATED_AT_CSF_FIELD into RAM in archive.", true, 0), - VERSION_158("Added directed at user ID field and CSF.", true, 0), - VERSION_159("Changing deleted docs serialization format.", true, 0), - VERSION_160("Add fields for health model scores. SEARCH-31907, HML-2099", true, 0), - VERSION_161("Switch to the 'search' Kafka cluster.", true, 0), - VERSION_162("Update Lucene version to 7.0.0.", true, 0), - VERSION_163("Update Lucene version to 7.7.2.", true, 0), - // minor version 1: add IS_TRENDING_NOW_FLAG - VERSION_164("Collect per-term stats in the realtime segments.", true, 1), - VERSION_165("Update Lucene version to 8.5.2.", true, 0), - VERSION_166("Serialize maxPosition field for InvertedRealtimeIndex", true, 0), - VERSION_167("Add field for pSpammyTweetScore. HML-2557", true, 0), - VERSION_168("Add field for pReportedTweetScore. HML-2644", true, 0), - VERSION_169("Add field for spammyTweetContentScore. PFM-70", true, 0), - VERSION_170("Add reference author id CSF. SEARCH-34715", true, 0), - VERSION_171("Add space_id field. SEARCH-36156", true, 0), - VERSION_172("Add facet support for space_id. SEARCH-36388", true, 0), - VERSION_173("Add space admin and title fields. SEARCH-36986", true, 0), - VERSION_174("Switching to Penguin v7 for realtime-exp0 cluster. SEARCH-36068", true, 0), - VERSION_175("Adding exclusive conversation author id CSF", true, 0), - VERSION_176("Adding card URI CSF", true, 0), - // minor version 1: add FROM_BLUE_VERIFIED_ACCOUNT_FLAG - // minor version 2: Adding new cluster REALTIME_CG. SEARCH-45692 - VERSION_177("Adding URL Description and Title fields. SEARCH-41641", true, 2), - - /** - * This semi colon is on a separate line to avoid polluting git blame history. - * Put a comma after the new enum field you're adding. - */; - - // The current version. - public static final FlushVersion CURRENT_FLUSH_VERSION = - FlushVersion.values()[FlushVersion.values().length - 1]; - - public static final String DELIMITER = "_v_"; - - /* ======================================================= - * Helper methods - * ======================================================= */ - private final String description; - private final boolean isOfficial; - private final int minorVersion; - - /** - * A flush version is not official unless explicitly stated to be official. - * An unofficial flush version is never uploaded to HDFS. - */ - private FlushVersion(String description) { - this(description, false, 0); - } - - private FlushVersion(String description, boolean isOfficial) { - this(description, isOfficial, 0); - } - - private FlushVersion(String description, boolean isOfficial, int minorVersion) { - this.description = description; - this.isOfficial = isOfficial; - this.minorVersion = minorVersion; - } - - /** - * Returns file extension with version number. - */ - public String getVersionFileExtension() { - if (this == VERSION_0) { - return ""; - } else { - return DELIMITER + ordinal(); - } - } - - /** - * Returns file extension given flush version number. - * If the flush version is unknown (e.g. higher than current flush version or lower than 0), null - * is returned. - */ - @Nullable - public static String getVersionFileExtension(int flushVersion) { - if (flushVersion > CURRENT_FLUSH_VERSION.ordinal() || flushVersion < 0) { - return null; - } else { - return FlushVersion.values()[flushVersion].getVersionFileExtension(); - } - } - - /** - * Returns a string describing the current schema version. - * @deprecated Please use {@link com.twitter.search.common.schema.base.Schema#getVersionDescription()} - */ - @Deprecated - public String getDescription() { - return description; - } - - /** - * Returns the schema's major version. - * @deprecated Please use {@link com.twitter.search.common.schema.base.Schema#getMajorVersionNumber()}. - */ - @Deprecated - public int getVersionNumber() { - return this.ordinal(); - } - - public boolean onOrAfter(FlushVersion other) { - return compareTo(other) >= 0; - } - - /** - * Returns whether the schema version is official. Only official segments are uploaded to HDFS. - * @deprecated Please use {@link com.twitter.search.common.schema.base.Schema#isVersionOfficial()}. - */ - @Deprecated - public boolean isOfficial() { - // We want the loading/flushing tests to pass locally even if the version is not meant - // to be an official version. - return isOfficial || Config.environmentIsTest(); - } - - /** - * As of now, this is hardcoded to 0. We will start using this soon. - * @deprecated Please consult schema for minor version. This should only be used to build schema. - */ - @Deprecated - public int getMinorVersion() { - return minorVersion; - } -} diff --git a/src/java/com/twitter/search/common/search/AndNotDocIdSetIterator.java b/src/java/com/twitter/search/common/search/AndNotDocIdSetIterator.java deleted file mode 100644 index 5fc221ba7..000000000 --- a/src/java/com/twitter/search/common/search/AndNotDocIdSetIterator.java +++ /dev/null @@ -1,71 +0,0 @@ -package com.twitter.search.common.search; - -import java.io.IOException; - -import org.apache.lucene.search.DocIdSetIterator; - -public class AndNotDocIdSetIterator extends DocIdSetIterator { - private int nextDelDoc; - private final DocIdSetIterator baseIter; - private final DocIdSetIterator notIter; - private int currID; - - /** Creates a new AndNotDocIdSetIterator instance. */ - public AndNotDocIdSetIterator(DocIdSetIterator baseIter, DocIdSetIterator notIter) - throws IOException { - nextDelDoc = notIter.nextDoc(); - this.baseIter = baseIter; - this.notIter = notIter; - currID = -1; - } - - @Override - public int advance(int target) throws IOException { - currID = baseIter.advance(target); - if (currID == DocIdSetIterator.NO_MORE_DOCS) { - return currID; - } - - if (nextDelDoc != DocIdSetIterator.NO_MORE_DOCS) { - if (currID < nextDelDoc) { - return currID; - } else if (currID == nextDelDoc) { - return nextDoc(); - } else { - nextDelDoc = notIter.advance(currID); - if (currID == nextDelDoc) { - return nextDoc(); - } - } - } - return currID; - } - - @Override - public int docID() { - return currID; - } - - @Override - public int nextDoc() throws IOException { - currID = baseIter.nextDoc(); - if (nextDelDoc != DocIdSetIterator.NO_MORE_DOCS) { - while (currID != DocIdSetIterator.NO_MORE_DOCS) { - if (currID < nextDelDoc) { - return currID; - } else { - if (currID == nextDelDoc) { - currID = baseIter.nextDoc(); - } - nextDelDoc = notIter.advance(currID); - } - } - } - return currID; - } - - @Override - public long cost() { - return baseIter.cost(); - } -} diff --git a/src/java/com/twitter/search/common/search/BUILD b/src/java/com/twitter/search/common/search/BUILD deleted file mode 100644 index ac5fe14b7..000000000 --- a/src/java/com/twitter/search/common/search/BUILD +++ /dev/null @@ -1,33 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/com/twitter/elephantbird:core", - "3rdparty/jvm/geo/google:geoGoogle", - "3rdparty/jvm/log4j", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/lucene:lucene-analyzers-common", - "3rdparty/jvm/org/apache/lucene:lucene-analyzers-smartcn", - "3rdparty/jvm/org/apache/lucene:lucene-core", - "3rdparty/jvm/org/apache/lucene:lucene-facet", - "3rdparty/jvm/org/apache/lucene:lucene-queries", - "3rdparty/jvm/org/apache/lucene:lucene-spatial-extras", - "3rdparty/jvm/org/apache/thrift:libthrift", - "3rdparty/jvm/org/apache/zookeeper:zookeeper-client", - "3rdparty/jvm/org/slf4j:slf4j-api", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common/collections", - "src/java/com/twitter/common/util:system-mocks", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/query", - "src/java/com/twitter/search/common/schema", - "src/java/com/twitter/search/common/schema/base", - "src/java/com/twitter/search/common/util/spatial", - "src/java/com/twitter/search/queryparser", - "src/thrift/com/twitter/search/common:facets-java", - "src/thrift/com/twitter/search/common:query-java", - ], -) diff --git a/src/java/com/twitter/search/common/search/DelegatingEarlyTerminationCollector.java b/src/java/com/twitter/search/common/search/DelegatingEarlyTerminationCollector.java deleted file mode 100644 index 977f4a0a5..000000000 --- a/src/java/com/twitter/search/common/search/DelegatingEarlyTerminationCollector.java +++ /dev/null @@ -1,75 +0,0 @@ -package com.twitter.search.common.search; - -import java.io.IOException; -import java.util.List; - -import javax.annotation.Nullable; - -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.search.Collector; -import org.apache.lucene.search.LeafCollector; -import org.apache.lucene.search.Scorable; -import org.apache.lucene.search.ScoreMode; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.query.thriftjava.CollectorParams; - -/** - * A {@link com.twitter.search.common.search.TwitterEarlyTerminationCollector} - * that delegates actual hit collection to a sub collector. - */ -public final class DelegatingEarlyTerminationCollector - extends TwitterEarlyTerminationCollector { - private final Collector subCollector; - private LeafCollector subLeafCollector; - - /** Creates a new DelegatingEarlyTerminationCollector instance. */ - public DelegatingEarlyTerminationCollector(Collector subCollector, - CollectorParams collectorParams, - TerminationTracker terminationTracker, - @Nullable QueryCostProvider queryCostProvider, - int numDocsBetweenTimeoutChecks, - Clock clock) { - super( - collectorParams, - terminationTracker, - queryCostProvider, - numDocsBetweenTimeoutChecks, - clock); - this.subCollector = subCollector; - } - - @Override - public void setScorer(Scorable scorer) throws IOException { - super.setScorer(scorer); - subLeafCollector.setScorer(scorer); - } - - @Override - protected void doCollect() throws IOException { - subLeafCollector.collect(curDocId); - } - - @Override - protected void doFinishSegment(int lastSearchedDocID) throws IOException { - if (subCollector instanceof TwitterCollector) { - ((TwitterCollector) subCollector).finishSegment(lastSearchedDocID); - } - } - - @Override - public void setNextReader(LeafReaderContext context) throws IOException { - super.setNextReader(context); - subLeafCollector = subCollector.getLeafCollector(context); - } - - @Override - public ScoreMode scoreMode() { - return subCollector.scoreMode(); - } - - @Override - public List getDebugInfo() { - return null; - } -} diff --git a/src/java/com/twitter/search/common/search/DocIdTracker.java b/src/java/com/twitter/search/common/search/DocIdTracker.java deleted file mode 100644 index 97546315e..000000000 --- a/src/java/com/twitter/search/common/search/DocIdTracker.java +++ /dev/null @@ -1,12 +0,0 @@ -package com.twitter.search.common.search; - -/** - * Provide an accessor for a doc ID. This is useful for classes that iterate through doc IDs - * and maintain a "last seen" doc ID. - */ -public interface DocIdTracker { - /** - * Retrieve current doc ID - */ - int getCurrentDocId(); -} diff --git a/src/java/com/twitter/search/common/search/EarlyTerminationState.java b/src/java/com/twitter/search/common/search/EarlyTerminationState.java deleted file mode 100644 index 31a1731e6..000000000 --- a/src/java/com/twitter/search/common/search/EarlyTerminationState.java +++ /dev/null @@ -1,51 +0,0 @@ -package com.twitter.search.common.search; - -import javax.annotation.Nonnull; - -import com.google.common.base.Preconditions; - -import com.twitter.search.common.metrics.SearchCounter; - -/** - * This is not an enum to allow different clusters to define their own EarlyTerminationStates. - */ -public final class EarlyTerminationState { - private static final String STATS_PREFIX = "early_termination_"; - - public static final EarlyTerminationState COLLECTING = - new EarlyTerminationState("no_early_termination", false); - public static final EarlyTerminationState TERMINATED_TIME_OUT_EXCEEDED = - new EarlyTerminationState("terminated_timeout_exceeded", true); - public static final EarlyTerminationState TERMINATED_MAX_QUERY_COST_EXCEEDED = - new EarlyTerminationState("terminated_max_query_cost_exceeded", true); - public static final EarlyTerminationState TERMINATED_MAX_HITS_EXCEEDED = - new EarlyTerminationState("terminated_max_hits_exceeded", true); - public static final EarlyTerminationState TERMINATED_NUM_RESULTS_EXCEEDED = - new EarlyTerminationState("terminated_num_results_exceeded", true); - - - // This string can be returned as a part of a search response, to tell the searcher - // why the search got early terminated. - private final String terminationReason; - private final boolean terminated; - private final SearchCounter count; - - public EarlyTerminationState(@Nonnull String terminationReason, boolean terminated) { - this.terminationReason = Preconditions.checkNotNull(terminationReason); - this.terminated = terminated; - count = SearchCounter.export(STATS_PREFIX + terminationReason + "_count"); - - } - - public boolean isTerminated() { - return terminated; - } - - public String getTerminationReason() { - return terminationReason; - } - - public void incrementCount() { - count.increment(); - } -} diff --git a/src/java/com/twitter/search/common/search/GeoQuadTreeQueryBuilderUtil.java b/src/java/com/twitter/search/common/search/GeoQuadTreeQueryBuilderUtil.java deleted file mode 100644 index 43475e9b7..000000000 --- a/src/java/com/twitter/search/common/search/GeoQuadTreeQueryBuilderUtil.java +++ /dev/null @@ -1,65 +0,0 @@ -package com.twitter.search.common.search; - -import java.util.LinkedHashSet; -import java.util.Set; - -import org.apache.lucene.search.Query; -import org.apache.lucene.spatial.prefix.tree.Cell; -import org.apache.lucene.spatial.prefix.tree.CellIterator; -import org.apache.lucene.util.BytesRef; - -import com.twitter.search.common.util.spatial.GeohashChunkImpl; -import com.twitter.search.queryparser.util.GeoCode; - -import geo.google.datamodel.GeoAddressAccuracy; - -public final class GeoQuadTreeQueryBuilderUtil { - private GeoQuadTreeQueryBuilderUtil() { - } - - /** - * Build a geo quad tree query based around the geo code based on the geo field. - * @param geocode the geo location for the quad tree query - * @param field the field where the geohash tokens are indexed - * @return the corresponding for the geo quad tree query - */ - public static Query buildGeoQuadTreeQuery(GeoCode geocode, String field) { - Set geoHashSet = new LinkedHashSet<>(); - - // if accuracy is specified. Add a term query based on accuracy. - if (geocode.accuracy != GeoAddressAccuracy.UNKNOWN_LOCATION.getCode()) { - BytesRef termRef = new BytesRef(GeohashChunkImpl.buildGeoStringWithAccuracy(geocode.latitude, - geocode.longitude, - geocode.accuracy)); - geoHashSet.add(termRef); - } - - // If distance is specified. Add term queries based on distance - if (geocode.distanceKm != GeoCode.DOUBLE_DISTANCE_NOT_SET) { - // Build query based on distance - int treeLevel = -1; - // First find block containing query point with diagonal greater than 2 * radius. - Cell centerNode = GeohashChunkImpl.getGeoNodeByRadius(geocode.latitude, geocode.longitude, - geocode.distanceKm); - // Add center node querying term - if (centerNode != null) { - geoHashSet.add(centerNode.getTokenBytesNoLeaf(new BytesRef())); - treeLevel = centerNode.getLevel(); - } - - // This improves edge case recall, by adding cells also intersecting the query area. - CellIterator nodes = GeohashChunkImpl.getNodesIntersectingCircle(geocode.latitude, - geocode.longitude, - geocode.distanceKm, - treeLevel); - // If there are other nodes intersecting query circle, also add them in. - if (nodes != null) { - while (nodes.hasNext()) { - geoHashSet.add(nodes.next().getTokenBytesNoLeaf(new BytesRef())); - } - } - } - - return new com.twitter.search.common.query.MultiTermDisjunctionQuery(field, geoHashSet); - } -} diff --git a/src/java/com/twitter/search/common/search/IntArrayDocIdSetIterator.java b/src/java/com/twitter/search/common/search/IntArrayDocIdSetIterator.java deleted file mode 100644 index ea370ce9d..000000000 --- a/src/java/com/twitter/search/common/search/IntArrayDocIdSetIterator.java +++ /dev/null @@ -1,76 +0,0 @@ -package com.twitter.search.common.search; - -import java.util.Arrays; - -import org.apache.lucene.search.DocIdSetIterator; - -/** - * DocIdSetIterator implementation from a sorted list of non-negative integers. If the given list of - * doc IDs is not sorted or contains negative doc IDs, the results are undefined. - */ -public class IntArrayDocIdSetIterator extends DocIdSetIterator { - private final int[] docIds; - private int docId; - private int cursor; - - public IntArrayDocIdSetIterator(int[] ids) { - docIds = ids; - reset(); - } - - /** Used for testing. */ - public void reset() { - docId = -1; - cursor = -1; - } - - @Override - public int docID() { - return docId; - } - - @Override - public int nextDoc() { - return advance(docId); - } - - @Override - public int advance(int target) { - if (docId == NO_MORE_DOCS) { - return docId; - } - - if (target < docId) { - return docId; - } - - if (cursor == docIds.length - 1) { - docId = NO_MORE_DOCS; - return docId; - } - - if (target == docId) { - docId = docIds[++cursor]; - return docId; - } - - int toIndex = Math.min(cursor + (target - docId) + 1, docIds.length); - int targetIndex = Arrays.binarySearch(docIds, cursor + 1, toIndex, target); - if (targetIndex < 0) { - targetIndex = -targetIndex - 1; - } - - if (targetIndex == docIds.length) { - docId = NO_MORE_DOCS; - } else { - cursor = targetIndex; - docId = docIds[cursor]; - } - return docId; - } - - @Override - public long cost() { - return docIds == null ? 0 : docIds.length; - } -} diff --git a/src/java/com/twitter/search/common/search/PairDocIdSetIterator.java b/src/java/com/twitter/search/common/search/PairDocIdSetIterator.java deleted file mode 100644 index 0ed125923..000000000 --- a/src/java/com/twitter/search/common/search/PairDocIdSetIterator.java +++ /dev/null @@ -1,82 +0,0 @@ -package com.twitter.search.common.search; - -import java.io.IOException; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.search.DocIdSetIterator; -/** - * Disjunction over 2 DocIdSetIterators. This should be faster than a disjunction over N since there - * would be no need to adjust the heap. - */ -public class PairDocIdSetIterator extends DocIdSetIterator { - - private final DocIdSetIterator d1; - private final DocIdSetIterator d2; - - private int doc = -1; - - /** Creates a new PairDocIdSetIterator instance. */ - public PairDocIdSetIterator(DocIdSetIterator d1, DocIdSetIterator d2) throws IOException { - Preconditions.checkNotNull(d1); - Preconditions.checkNotNull(d2); - this.d1 = d1; - this.d2 = d2; - // position the iterators - this.d1.nextDoc(); - this.d2.nextDoc(); - } - - @Override - public int docID() { - return doc; - } - - @Override - public int nextDoc() throws IOException { - int doc1 = d1.docID(); - int doc2 = d2.docID(); - DocIdSetIterator iter = null; - if (doc1 < doc2) { - doc = doc1; - //d1.nextDoc(); - iter = d1; - } else if (doc1 > doc2) { - doc = doc2; - //d2.nextDoc(); - iter = d2; - } else { - doc = doc1; - //d1.nextDoc(); - //d2.nextDoc(); - } - - if (doc != NO_MORE_DOCS) { - if (iter != null) { - iter.nextDoc(); - } else { - d1.nextDoc(); - d2.nextDoc(); - } - } - return doc; - } - - @Override - public int advance(int target) throws IOException { - if (d1.docID() < target) { - d1.advance(target); - } - if (d2.docID() < target) { - d2.advance(target); - } - return (doc != NO_MORE_DOCS) ? nextDoc() : doc; - } - - @Override - public long cost() { - // very coarse estimate - return d1.cost() + d2.cost(); - } - -} diff --git a/src/java/com/twitter/search/common/search/QueryCostProvider.java b/src/java/com/twitter/search/common/search/QueryCostProvider.java deleted file mode 100644 index 7e5d72433..000000000 --- a/src/java/com/twitter/search/common/search/QueryCostProvider.java +++ /dev/null @@ -1,9 +0,0 @@ -package com.twitter.search.common.search; - -/** - * Any class that can track and return query cost. - */ -public interface QueryCostProvider { - /** Returns the total cost. */ - double getTotalCost(); -} diff --git a/src/java/com/twitter/search/common/search/TerminationTracker.java b/src/java/com/twitter/search/common/search/TerminationTracker.java deleted file mode 100644 index 916415078..000000000 --- a/src/java/com/twitter/search/common/search/TerminationTracker.java +++ /dev/null @@ -1,202 +0,0 @@ -package com.twitter.search.common.search; - -import java.util.HashSet; -import java.util.Set; - -import com.google.common.base.Preconditions; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.query.thriftjava.CollectorTerminationParams; - -/** - * Used for tracking termination criteria for earlybird queries. - * - * Currently this tracks the query time out and query cost, if they are set on the - * {@link com.twitter.search.common.query.thriftjava.CollectorTerminationParams}. - */ -public class TerminationTracker { - /** Query start time provided by client. */ - private final long clientStartTimeMillis; - - /** Timeout end times, calculated from {@link #clientStartTimeMillis}. */ - private final long timeoutEndTimeMillis; - - /** Query start time recorded at earlybird server. */ - private final long localStartTimeMillis; - - /** Tracking query cost */ - private final double maxQueryCost; - - // Sometimes, we want to early terminate before timeoutEndTimeMillis, to reserve time for - // work that needs to be done after early termination (E.g. merging results). - private final int postTerminationOverheadMillis; - - // We don't check for early termination often enough. Some times requests timeout in between - // early termination checks. This buffer time is also substracted from deadline. - // To illustrate how this is used, let's use a simple example: - // If we spent 750ms searching 5 segments, a rough estimate is that we need 150ms to search - // one segment. If the timeout is set to 800ms, we should not starting searching the next segment. - // In this case, on can set preTerminationSafeBufferTimeMillis to 150ms, so that when early - // termination check computes the deadline, this buffer is also subtracted. See SEARCH-29723. - private int preTerminationSafeBufferTimeMillis = 0; - - private EarlyTerminationState earlyTerminationState = EarlyTerminationState.COLLECTING; - - // This flag determines whether the last searched doc ID trackers should be consulted when a - // timeout occurs. - private final boolean useLastSearchedDocIdOnTimeout; - - private final Set lastSearchedDocIdTrackers = new HashSet<>(); - - /** - * Creates a new termination tracker that will not specify a timeout or max query cost. - * Can be used for queries that explicitly do not want to use a timeout. Meant to be used for - * tests, and background queries running for the query cache. - */ - public TerminationTracker(Clock clock) { - this.clientStartTimeMillis = clock.nowMillis(); - this.localStartTimeMillis = clientStartTimeMillis; - this.timeoutEndTimeMillis = Long.MAX_VALUE; - this.maxQueryCost = Double.MAX_VALUE; - this.postTerminationOverheadMillis = 0; - this.useLastSearchedDocIdOnTimeout = false; - } - - /** - * Convenient method overloading for - * {@link #TerminationTracker(CollectorTerminationParams, long, Clock, int)}. - */ - public TerminationTracker( - CollectorTerminationParams terminationParams, Clock clock, - int postTerminationOverheadMillis) { - this(terminationParams, clock.nowMillis(), clock, postTerminationOverheadMillis); - } - - /** - * Convenient method overloading for - * {@link #TerminationTracker(CollectorTerminationParams, long, Clock, int)}. - */ - public TerminationTracker( - CollectorTerminationParams terminationParams, int postTerminationOverheadMillis) { - this( - terminationParams, - System.currentTimeMillis(), - Clock.SYSTEM_CLOCK, - postTerminationOverheadMillis); - } - - /** - * Creates a new TerminationTracker instance. - * - * @param terminationParams CollectorParams.CollectorTerminationParams carrying parameters - * about early termination. - * @param clientStartTimeMillis The query start time (in millis) specified by client. This is used - * to calculate timeout end time, like {@link #timeoutEndTimeMillis}. - * @param clock used to sample {@link #localStartTimeMillis}. - * @param postTerminationOverheadMillis How much time should be reserved. E.g. if request time - * out is 800ms, and this is set to 200ms, early termination - * will kick in at 600ms mark. - */ - public TerminationTracker( - CollectorTerminationParams terminationParams, - long clientStartTimeMillis, - Clock clock, - int postTerminationOverheadMillis) { - Preconditions.checkNotNull(terminationParams); - Preconditions.checkArgument(postTerminationOverheadMillis >= 0); - - this.clientStartTimeMillis = clientStartTimeMillis; - this.localStartTimeMillis = clock.nowMillis(); - - if (terminationParams.isSetTimeoutMs() - && terminationParams.getTimeoutMs() > 0) { - Preconditions.checkState(terminationParams.getTimeoutMs() >= postTerminationOverheadMillis); - this.timeoutEndTimeMillis = this.clientStartTimeMillis + terminationParams.getTimeoutMs(); - } else { - // Effectively no timeout. - this.timeoutEndTimeMillis = Long.MAX_VALUE; - } - - // Tracking query cost - if (terminationParams.isSetMaxQueryCost() - && terminationParams.getMaxQueryCost() > 0) { - maxQueryCost = terminationParams.getMaxQueryCost(); - } else { - maxQueryCost = Double.MAX_VALUE; - } - - this.useLastSearchedDocIdOnTimeout = terminationParams.isEnforceQueryTimeout(); - this.postTerminationOverheadMillis = postTerminationOverheadMillis; - } - - /** - * Returns the reserve time to perform post termination work. Return the deadline timestamp - * with postTerminationWorkEstimate subtracted. - */ - public long getTimeoutEndTimeWithReservation() { - // Return huge value if time out is disabled. - if (timeoutEndTimeMillis == Long.MAX_VALUE) { - return timeoutEndTimeMillis; - } else { - return timeoutEndTimeMillis - - postTerminationOverheadMillis - - preTerminationSafeBufferTimeMillis; - } - } - - public void setPreTerminationSafeBufferTimeMillis(int preTerminationSafeBufferTimeMillis) { - Preconditions.checkArgument(preTerminationSafeBufferTimeMillis >= 0); - - this.preTerminationSafeBufferTimeMillis = preTerminationSafeBufferTimeMillis; - } - - public long getLocalStartTimeMillis() { - return localStartTimeMillis; - } - - public long getClientStartTimeMillis() { - return clientStartTimeMillis; - } - - public double getMaxQueryCost() { - return maxQueryCost; - } - - public boolean isEarlyTerminated() { - return earlyTerminationState.isTerminated(); - } - - public EarlyTerminationState getEarlyTerminationState() { - return earlyTerminationState; - } - - public void setEarlyTerminationState(EarlyTerminationState earlyTerminationState) { - this.earlyTerminationState = earlyTerminationState; - } - - /** - * Return the minimum searched doc ID amongst all registered trackers, or -1 if there aren't any - * trackers. Doc IDs are stored in ascending order, and trackers update their doc IDs as they - * search, so the minimum doc ID reflects the most recent fully searched doc ID. - */ - int getLastSearchedDocId() { - return lastSearchedDocIdTrackers.stream() - .mapToInt(DocIdTracker::getCurrentDocId).min().orElse(-1); - } - - void resetDocIdTrackers() { - lastSearchedDocIdTrackers.clear(); - } - - /** - * Add a DocIdTracker, to keep track of the last fully-searched doc ID when early termination - * occurs. - */ - public void addDocIdTracker(DocIdTracker docIdTracker) { - lastSearchedDocIdTrackers.add(docIdTracker); - } - - public boolean useLastSearchedDocIdOnTimeout() { - return useLastSearchedDocIdOnTimeout; - } -} diff --git a/src/java/com/twitter/search/common/search/TwitterCollector.java b/src/java/com/twitter/search/common/search/TwitterCollector.java deleted file mode 100644 index 0661db8fc..000000000 --- a/src/java/com/twitter/search/common/search/TwitterCollector.java +++ /dev/null @@ -1,31 +0,0 @@ -package com.twitter.search.common.search; - -import java.io.IOException; - -import org.apache.lucene.search.Collector; - -/** - * Lucene Collectors throw CollectionTerminatedException to perform early termination. - * We don't believe that throwing Exceptions to control execution flow is ideal, so we are adding - * this class to be a base of all Twitter Collectors. - * - * {@link com.twitter.search.common.search.TwitterIndexSearcher} uses the {@link #isTerminated()} - * method to perform early termination, instead of relying on CollectionTerminatedException. - */ -public abstract class TwitterCollector implements Collector { - - /** - * Subclasses should return true if they want to perform early termination. - * This method is called every hit and should not be expensive. - */ - public abstract boolean isTerminated() throws IOException; - - /** - * Lucene API only has a method that's called before searching a segment setNextReader(). - * This hook is called after finishing searching a segment. - * @param lastSearchedDocID is the last docid searched before termination, - * or NO_MORE_DOCS if there was no early termination. This doc need not be a hit, - * and should not be collected here. - */ - public abstract void finishSegment(int lastSearchedDocID) throws IOException; -} diff --git a/src/java/com/twitter/search/common/search/TwitterEarlyTerminationCollector.java b/src/java/com/twitter/search/common/search/TwitterEarlyTerminationCollector.java deleted file mode 100644 index bc7711e7d..000000000 --- a/src/java/com/twitter/search/common/search/TwitterEarlyTerminationCollector.java +++ /dev/null @@ -1,328 +0,0 @@ -package com.twitter.search.common.search; - -import java.io.IOException; -import java.util.List; -import javax.annotation.Nonnull; -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.search.LeafCollector; -import org.apache.lucene.search.Scorable; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.query.thriftjava.CollectorParams; -import com.twitter.search.common.query.thriftjava.CollectorTerminationParams; - -/** - * A TwitterCollector containing the most common early termination logic based on - * timeout, cost, and max hits. This class does not do any actual hit collection---this class - * is abstract and cannot be instantiated. - * - * If a Collector and all its subclasses need early termination, it should extend this class. - * - * However, if one just wants to add EarlyTermination to any single collector, he can just - * use {@link DelegatingEarlyTerminationCollector} - * as a wrapper. - */ -public abstract class TwitterEarlyTerminationCollector - extends TwitterCollector implements LeafCollector { - private static final Logger LOG = LoggerFactory.getLogger(TwitterEarlyTerminationCollector.class); - private static final SearchCounter NEGATIVE_TIME_PER_SEGMENT = - SearchCounter.export("TwitterEarlyTerminationCollector_negative_time_per_segment"); - private static final SearchRateCounter QUERY_TIMEOUT_ENFORCED = - SearchRateCounter.export("TwitterEarlyTerminationCollector_query_timeout_enforced"); - - protected int curDocId = -1; - - protected Scorable scorer = null; - private LeafReader curReader = null; - private final long maxHitsToProcess; - private long numHitsProcessed = 0; - private int lastEarlyTerminationCheckDocId = -1; - private final Clock clock; - - @Nullable - private final QueryCostProvider queryCostProvider; - - private final TerminationTracker terminationTracker; - - // This determines how often the expensive early termination check is performed. - // If set to be negative, expensive early termination check only performed at segment boundaries. - // If set to a positive number X, this check is performed every X docs processed. - private int numDocsBetweenTimeoutChecks; - - // Number of segments searched so far. - // This is used to predicatively early terminate. - // Expensive early termination checks may not happen often enough. Sometimes the request - // times out in between the termination checks. - // After finishing searching a segment, we estimate how much time is needed to search one - // segment on average. If searching the next segment would cause a timeout, we early terminate. - private int numSearchedSegments = 0; - - /** - * Creates a new TwitterEarlyTerminationCollector instance. - * - * @param collectorParams the parameters needed to guide early termination. - * @param terminationTracker If null is passed in, a new TerminationTrack is created. Otherwise, - * the one passed in is used. - * @param numDocsBetweenTimeoutChecks TerminationTracker based check are performed upon a hit - * every numDocsBetweenTimeoutChecks docs. If a non-positive number is passed - * in, TerminationTracker based checks are disabled. - * If collectorParams specifies a value as well, that value is used. - */ - public TwitterEarlyTerminationCollector( - CollectorParams collectorParams, - TerminationTracker terminationTracker, - @Nullable QueryCostProvider queryCostProvider, - int numDocsBetweenTimeoutChecks, - Clock clock) { - CollectorTerminationParams terminationParams = collectorParams.getTerminationParams(); - - if (terminationParams == null) { - terminationParams = new CollectorTerminationParams() - .setMaxHitsToProcess(Integer.MAX_VALUE) - .setMaxQueryCost(Double.MAX_VALUE) - .setTimeoutMs(Integer.MAX_VALUE); - } - - if (!terminationParams.isSetMaxHitsToProcess() || terminationParams.getMaxHitsToProcess() < 0) { - maxHitsToProcess = Integer.MAX_VALUE; - } else { - maxHitsToProcess = terminationParams.getMaxHitsToProcess(); - } - - if (terminationParams.isSetNumDocsBetweenTimeoutChecks()) { - this.numDocsBetweenTimeoutChecks = terminationParams.getNumDocsBetweenTimeoutChecks(); - } else { - this.numDocsBetweenTimeoutChecks = numDocsBetweenTimeoutChecks; - } - - this.terminationTracker = Preconditions.checkNotNull(terminationTracker); - this.queryCostProvider = queryCostProvider; - this.clock = clock; - } - - public final LeafCollector getLeafCollector(LeafReaderContext context) throws IOException { - this.setNextReader(context); - return this; - } - - /** - * Sub-classes may override this to add more collection logic. - */ - protected abstract void doCollect() throws IOException; - - /** - * Sub-classes may override this to add more segment completion logic. - * @param lastSearchedDocID is the last docid searched before termination, - * or NO_MORE_DOCS if there was no early termination. This doc may not be a hit! - */ - protected abstract void doFinishSegment(int lastSearchedDocID) throws IOException; - - /** - * sub classes can override this to perform more early termination checks. - */ - public EarlyTerminationState innerShouldCollectMore() throws IOException { - return EarlyTerminationState.COLLECTING; - } - - /** - * After early termination, this method can be used to retrieve early termination reason. - */ - @Nonnull - public final EarlyTerminationState getEarlyTerminationState() { - return terminationTracker.getEarlyTerminationState(); - } - - protected final EarlyTerminationState setEarlyTerminationState( - EarlyTerminationState newEarlyTerminationState) { - terminationTracker.setEarlyTerminationState(newEarlyTerminationState); - return newEarlyTerminationState; - } - - @Override - public final boolean isTerminated() throws IOException { - EarlyTerminationState earlyTerminationState = getEarlyTerminationState(); - - if (earlyTerminationState.isTerminated()) { - return true; - } - - if (getNumHitsProcessed() >= getMaxHitsToProcess()) { - collectedEnoughResults(); - if (shouldTerminate()) { - return setEarlyTerminationState(EarlyTerminationState.TERMINATED_MAX_HITS_EXCEEDED) - .isTerminated(); - } else { - return false; - } - } - - return innerShouldCollectMore().isTerminated(); - } - - /** - * Note: subclasses overriding this method are expected to call "super.setNextReader" - * in their setNextReader(). - * @deprecated Remove this methods in favor of {@link #getLeafCollector(LeafReaderContext)} - */ - @Deprecated - public void setNextReader(LeafReaderContext context) throws IOException { - if (!terminationTracker.useLastSearchedDocIdOnTimeout()) { - expensiveEarlyTerminationCheck(); - } - - // Reset curDocId for next segment - curDocId = -1; - lastEarlyTerminationCheckDocId = -1; - curReader = context.reader(); - } - - /** - * Sub-classes overriding this method are expected to call super.setScorer() - */ - @Override - public void setScorer(Scorable scorer) throws IOException { - this.scorer = scorer; - } - - @Override - public final void collect(int doc) throws IOException { - curDocId = doc; - doCollect(); - numHitsProcessed++; - if (numDocsBetweenTimeoutChecks > 0 - && (curDocId - lastEarlyTerminationCheckDocId) >= numDocsBetweenTimeoutChecks) { - lastEarlyTerminationCheckDocId = curDocId; - - if (!terminationTracker.useLastSearchedDocIdOnTimeout()) { - expensiveEarlyTerminationCheck(); - } - } - } - - /** - * Accounting for a segment searched. - * @param lastSearchedDocID is the last docid searched before termination, - * or NO_MORE_DOCS if there was no early termination. This doc may not be a hit! - */ - protected final void trackCompleteSegment(int lastSearchedDocID) throws IOException { - doFinishSegment(lastSearchedDocID); - } - - @Override - public final void finishSegment(int lastSearchedDocID) throws IOException { - // finished searching a segment. Computer average time needed to search a segment. - Preconditions.checkState(curReader != null, "Did subclass call super.setNextReader()?"); - numSearchedSegments++; - - long totalTime = clock.nowMillis() - terminationTracker.getLocalStartTimeMillis(); - - if (totalTime >= Integer.MAX_VALUE) { - String msg = String.format( - "%s: A query runs for %d that is longer than Integer.MAX_VALUE ms. lastSearchedDocID: %d", - getClass().getSimpleName(), totalTime, lastSearchedDocID - ); - LOG.error(msg); - throw new IllegalStateException(msg); - } - - int timePerSegment = ((int) totalTime) / numSearchedSegments; - - if (timePerSegment < 0) { - NEGATIVE_TIME_PER_SEGMENT.increment(); - timePerSegment = 0; - } - - // If we're enforcing timeout via the last searched doc ID, we don't need to add this buffer, - // since we'll detect the timeout right away. - if (!terminationTracker.useLastSearchedDocIdOnTimeout()) { - terminationTracker.setPreTerminationSafeBufferTimeMillis(timePerSegment); - } - - // Check whether we timed out and are checking for timeout at the leaves. If so, we should use - // the captured lastSearchedDocId from the tracker instead, which is the most up-to-date amongst - // the query nodes. - if (terminationTracker.useLastSearchedDocIdOnTimeout() - && EarlyTerminationState.TERMINATED_TIME_OUT_EXCEEDED.equals( - terminationTracker.getEarlyTerminationState())) { - QUERY_TIMEOUT_ENFORCED.increment(); - trackCompleteSegment(terminationTracker.getLastSearchedDocId()); - } else { - trackCompleteSegment(lastSearchedDocID); - } - - // We finished a segment, so clear out the DocIdTrackers. The next segment will register its - // own trackers, and we don't need to keep the trackers from the current segment. - terminationTracker.resetDocIdTrackers(); - - curDocId = -1; - curReader = null; - scorer = null; - } - - /** - * More expensive Early Termination checks, which are not called every hit. - * This sets EarlyTerminationState if it decides that early termination should kick in. - * See: SEARCH-29723. - */ - private void expensiveEarlyTerminationCheck() { - if (queryCostProvider != null) { - double totalQueryCost = queryCostProvider.getTotalCost(); - double maxQueryCost = terminationTracker.getMaxQueryCost(); - if (totalQueryCost >= maxQueryCost) { - setEarlyTerminationState(EarlyTerminationState.TERMINATED_MAX_QUERY_COST_EXCEEDED); - } - } - - final long nowMillis = clock.nowMillis(); - if (nowMillis >= terminationTracker.getTimeoutEndTimeWithReservation()) { - setEarlyTerminationState(EarlyTerminationState.TERMINATED_TIME_OUT_EXCEEDED); - } - } - - public long getMaxHitsToProcess() { - return maxHitsToProcess; - } - - public final void setNumHitsProcessed(long numHitsProcessed) { - this.numHitsProcessed = numHitsProcessed; - } - - protected final long getNumHitsProcessed() { - return numHitsProcessed; - } - - protected final int getNumSearchedSegments() { - return numSearchedSegments; - } - - protected final Clock getClock() { - return clock; - } - - @VisibleForTesting - protected final TerminationTracker getTerminationTracker() { - return this.terminationTracker; - } - - protected void collectedEnoughResults() throws IOException { - } - - protected boolean shouldTerminate() { - return true; - } - - /** - * Debug info collected during execution. - */ - public abstract List getDebugInfo(); -} diff --git a/src/java/com/twitter/search/common/search/TwitterIndexSearcher.java b/src/java/com/twitter/search/common/search/TwitterIndexSearcher.java deleted file mode 100644 index 97f10160a..000000000 --- a/src/java/com/twitter/search/common/search/TwitterIndexSearcher.java +++ /dev/null @@ -1,189 +0,0 @@ -package com.twitter.search.common.search; - -import java.io.IOException; -import java.util.List; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.IndexReader; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.MultiDocValues; -import org.apache.lucene.index.NumericDocValues; -import org.apache.lucene.index.Term; -import org.apache.lucene.index.Terms; -import org.apache.lucene.search.CollectionStatistics; -import org.apache.lucene.search.Collector; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.LeafCollector; -import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.TermStatistics; -import org.apache.lucene.search.Weight; - -/** - * An IndexSearch that works with TwitterEarlyTerminationCollector. - * If a stock Lucene collector is passed into search(), this IndexSearch.search() behaves the - * same as Lucene's stock IndexSearcher. However, if a TwitterEarlyTerminationCollector is passed - * in, this IndexSearcher performs early termination without relying on - * CollectionTerminatedException. - */ -public class TwitterIndexSearcher extends IndexSearcher { - public TwitterIndexSearcher(IndexReader r) { - super(r); - } - - /** - * search() main loop. - * This behaves exactly like IndexSearcher.search() if a stock Lucene collector passed in. - * However, if a TwitterCollector is passed in, this class performs Twitter style early - * termination without relying on - * {@link org.apache.lucene.search.CollectionTerminatedException}. - */ - @Override - protected void search(List leaves, Weight weight, Collector coll) - throws IOException { - - // If an TwitterCollector is passed in, we can do a few extra things in here, such - // as early termination. Otherwise we can just fall back to IndexSearcher.search(). - if (coll instanceof TwitterCollector) { - TwitterCollector collector = (TwitterCollector) coll; - - for (LeafReaderContext ctx : leaves) { // search each subreader - if (collector.isTerminated()) { - return; - } - - // Notify the collector that we're starting this segment, and check for early - // termination criteria again. setNextReader() performs 'expensive' early - // termination checks in some implementations such as TwitterEarlyTerminationCollector. - LeafCollector leafCollector = collector.getLeafCollector(ctx); - if (collector.isTerminated()) { - return; - } - - // Initialize the scorer - it should not be null. Note that constructing the scorer - // may actually do real work, such as advancing to the first hit. - Scorer scorer = weight.scorer(ctx); - - if (scorer == null) { - collector.finishSegment(DocIdSetIterator.NO_MORE_DOCS); - continue; - } - - leafCollector.setScorer(scorer); - - // Start searching. - DocIdSetIterator docIdSetIterator = scorer.iterator(); - int docID = docIdSetIterator.nextDoc(); - if (docID != DocIdSetIterator.NO_MORE_DOCS) { - // Collect results. Note: check isTerminated() before calling nextDoc(). - do { - leafCollector.collect(docID); - } while (!collector.isTerminated() - && (docID = docIdSetIterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS); - } - - // Always finish the segment, providing the last docID advanced to. - collector.finishSegment(docID); - } - } else { - // The collector given is not a TwitterCollector, just use stock lucene search(). - super.search(leaves, weight, coll); - } - } - - /** Returns {@link NumericDocValues} for this field, or - * null if no {@link NumericDocValues} were indexed for - * this field. The returned instance should only be - * used by a single thread. */ - public NumericDocValues getNumericDocValues(String field) throws IOException { - return MultiDocValues.getNumericValues(getIndexReader(), field); - } - - @Override - public CollectionStatistics collectionStatistics(String field) throws IOException { - return collectionStatistics(field, getIndexReader()); - } - - @Override - public TermStatistics termStatistics(Term term, int docFreq, long totalTermFreq) { - return termStats(term, docFreq, totalTermFreq); - } - - /** - * Lucene relies on the fact that maxDocID is typically equal to the number of documents in the - * index, which is false when we have sparse doc IDs or when we start from 8 million docs and - * decrement, so in this class we pass in numDocs instead of the maximum assigned document ID. - * Note that the comment on {@link CollectionStatistics#maxDoc()} says that it returns the number - * of documents in the segment, not the maximum ID, and that it is only used this way. This is - * necessary for all lucene scoring methods, e.g. - * {@link org.apache.lucene.search.similarities.TFIDFSimilarity#idfExplain}. This method body is - * largely copied from {@link IndexSearcher#collectionStatistics(String)}. - */ - public static CollectionStatistics collectionStatistics(String field, IndexReader indexReader) - throws IOException { - Preconditions.checkNotNull(field); - - int docsWithField = 0; - long sumTotalTermFreq = 0; - long sumDocFreq = 0; - for (LeafReaderContext leaf : indexReader.leaves()) { - Terms terms = leaf.reader().terms(field); - if (terms == null) { - continue; - } - - docsWithField += terms.getDocCount(); - sumTotalTermFreq += terms.getSumTotalTermFreq(); - sumDocFreq += terms.getSumDocFreq(); - } - - if (docsWithField == 0) { - // The CollectionStatistics API in Lucene is designed poorly. On one hand, starting with - // Lucene 8.0.0, searchers are expected to always produce valid CollectionStatistics instances - // and all int fields in these instances are expected to be strictly greater than 0. On the - // other hand, Lucene itself produces null CollectionStatistics instances in a few places. - // Also, there's no good placeholder value to indicate that a field is empty, which is a very - // reasonable thing to happen (for example, the first few tweets in a new segment might not - // have any links, so then the resolved_links_text would be empty). So to get around this - // issue, we do here what Lucene does: we return a CollectionStatistics instance with all - // fields set to 1. - return new CollectionStatistics(field, 1, 1, 1, 1); - } - - // The writer could have added more docs to the index since this searcher started processing - // this request, or could be in the middle of adding a doc, which could mean that only some of - // the docsWithField, sumTotalTermFreq and sumDocFreq stats have been updated. I don't think - // this is a big deal, as these stats are only used for computing a hit's score, and minor - // inaccuracies should have very little effect on a hit's final score. But CollectionStatistic's - // constructor has some strict asserts for the relationship between these stats. So we need to - // make sure we cap the values of these stats appropriately. - // - // Adjust numDocs based on docsWithField (instead of doing the opposite), because: - // 1. If new documents were added to this segment after the reader was created, it seems - // reasonable to take the more recent information into account. - // 2. The termStats() method below will return the most recent docFreq (not the value that - // docFreq was set to when this reader was created). If this value is higher than numDocs, - // then Lucene might end up producing negative scores, which must never happen. - int numDocs = Math.max(indexReader.numDocs(), docsWithField); - sumDocFreq = Math.max(sumDocFreq, docsWithField); - sumTotalTermFreq = Math.max(sumTotalTermFreq, sumDocFreq); - return new CollectionStatistics(field, numDocs, docsWithField, sumTotalTermFreq, sumDocFreq); - } - - /** - * This method body is largely copied from {@link IndexSearcher#termStatistics(Term, int, long)}. - * The only difference is that we make sure all parameters we pass to the TermStatistics instance - * we create are set to at least 1 (because Lucene 8.0.0 expects them to be). - */ - public static TermStatistics termStats(Term term, int docFreq, long totalTermFreq) { - // Lucene expects the doc frequency and total term frequency to be at least 1. This assumption - // doesn't always make sense (the segment can be empty -- see comment above), but to make Lucene - // happy, make sure to always set these parameters to at least 1. - int adjustedDocFreq = Math.max(docFreq, 1); - return new TermStatistics( - term.bytes(), - adjustedDocFreq, - Math.max(totalTermFreq, adjustedDocFreq)); - } -} diff --git a/src/java/com/twitter/search/common/search/termination/BUILD b/src/java/com/twitter/search/common/search/termination/BUILD deleted file mode 100644 index 913bb480e..000000000 --- a/src/java/com/twitter/search/common/search/termination/BUILD +++ /dev/null @@ -1,20 +0,0 @@ -java_library( - name = "termination", - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/org/apache/lucene:lucene-analyzers-common", - "3rdparty/jvm/org/apache/lucene:lucene-core", - "3rdparty/jvm/org/apache/lucene:lucene-facet", - "3rdparty/jvm/org/apache/lucene:lucene-queries", - "3rdparty/jvm/org/slf4j:slf4j-api", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common/util:system-mocks", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/query", - "src/java/com/twitter/search/common/search", - "src/thrift/com/twitter/search:earlybird-java", - ], -) diff --git a/src/java/com/twitter/search/common/search/termination/QueryTimeout.java b/src/java/com/twitter/search/common/search/termination/QueryTimeout.java deleted file mode 100644 index 52ffa2b54..000000000 --- a/src/java/com/twitter/search/common/search/termination/QueryTimeout.java +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.search.common.search.termination; - -import com.twitter.search.common.search.DocIdTracker; - -/** - * QueryTimeout provides a method for early termination of queries. - */ -public interface QueryTimeout { - /** - * Returns true if query processing should terminate, otherwise false. - */ - boolean shouldExit(); - - /** - * Register a DocIdTracker for the scope of the query, to determine the last fully-searched - * doc ID after early termination. - */ - void registerDocIdTracker(DocIdTracker docIdTracker); - - /** - * Return client ID of query. - */ - String getClientId(); -} diff --git a/src/java/com/twitter/search/common/search/termination/QueryTimeoutFactory.java b/src/java/com/twitter/search/common/search/termination/QueryTimeoutFactory.java deleted file mode 100644 index 8ac2e0ec7..000000000 --- a/src/java/com/twitter/search/common/search/termination/QueryTimeoutFactory.java +++ /dev/null @@ -1,34 +0,0 @@ -package com.twitter.search.common.search.termination; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.search.TerminationTracker; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; - -public class QueryTimeoutFactory { - /** - * Creates a QueryTimeout instance for a given EarlybirdRequest and TerminationTracker, if the - * required conditions for leaf-level timeout checking are met. Returns null otherwise. - * - * The conditions are: - * 1) CollectorTerminationParams.isEnforceQueryTimeout() - * 2) CollectorTerminationParams.isSetTimeoutMs() - */ - public QueryTimeout createQueryTimeout( - EarlybirdRequest request, - TerminationTracker tracker, - Clock clock) { - if (tracker != null - && request != null - && request.isSetSearchQuery() - && request.getSearchQuery().isSetCollectorParams() - && request.getSearchQuery().getCollectorParams().isSetTerminationParams() - && request.getSearchQuery().getCollectorParams().getTerminationParams() - .isEnforceQueryTimeout() - && request.getSearchQuery().getCollectorParams().getTerminationParams() - .isSetTimeoutMs()) { - return new QueryTimeoutImpl(request.getClientId(), tracker, clock); - } else { - return null; - } - } -} diff --git a/src/java/com/twitter/search/common/search/termination/QueryTimeoutImpl.java b/src/java/com/twitter/search/common/search/termination/QueryTimeoutImpl.java deleted file mode 100644 index 252b57db1..000000000 --- a/src/java/com/twitter/search/common/search/termination/QueryTimeoutImpl.java +++ /dev/null @@ -1,65 +0,0 @@ -package com.twitter.search.common.search.termination; - -import com.google.common.base.Preconditions; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.search.DocIdTracker; -import com.twitter.search.common.search.EarlyTerminationState; -import com.twitter.search.common.search.TerminationTracker; - -/** - * QueryTimeoutImpl provides a method for early termination of queries based on time. - */ -public class QueryTimeoutImpl implements QueryTimeout { - private final String clientId; - private final TerminationTracker tracker; - private final Clock clock; - - private final SearchRateCounter shouldTerminateCounter; - - public QueryTimeoutImpl(String clientId, TerminationTracker tracker, Clock clock) { - this.clientId = Preconditions.checkNotNull(clientId); - this.tracker = Preconditions.checkNotNull(tracker); - this.clock = Preconditions.checkNotNull(clock); - shouldTerminateCounter = - SearchRateCounter.export("query_timeout_should_terminate_" + clientId); - } - - /** - * Returns true when the clock's time has met or exceeded the tracker's timeout end. - */ - public boolean shouldExit() { - if (clock.nowMillis() >= tracker.getTimeoutEndTimeWithReservation()) { - tracker.setEarlyTerminationState(EarlyTerminationState.TERMINATED_TIME_OUT_EXCEEDED); - shouldTerminateCounter.increment(); - return true; - } - return false; - } - - @Override - public void registerDocIdTracker(DocIdTracker docIdTracker) { - tracker.addDocIdTracker(docIdTracker); - } - - @Override - public String getClientId() { - return clientId; - } - - @Override - public int hashCode() { - return clientId.hashCode() * 13 + tracker.hashCode(); - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof QueryTimeoutImpl)) { - return false; - } - - QueryTimeoutImpl queryTimeout = QueryTimeoutImpl.class.cast(obj); - return clientId.equals(queryTimeout.clientId) && tracker.equals(queryTimeout.tracker); - } -} diff --git a/src/java/com/twitter/search/common/search/termination/TerminationQuery.java b/src/java/com/twitter/search/common/search/termination/TerminationQuery.java deleted file mode 100644 index a91ae074a..000000000 --- a/src/java/com/twitter/search/common/search/termination/TerminationQuery.java +++ /dev/null @@ -1,66 +0,0 @@ -package com.twitter.search.common.search.termination; - -import java.io.IOException; -import java.util.Arrays; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.IndexReader; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.Weight; - -/** - * Query implementation that can timeout and return non-exhaustive results. - */ -public class TerminationQuery extends Query { - private final Query inner; - private final QueryTimeout timeout; - - public TerminationQuery(Query inner, QueryTimeout timeout) { - this.inner = Preconditions.checkNotNull(inner); - this.timeout = Preconditions.checkNotNull(timeout); - } - - @Override - public Weight createWeight( - IndexSearcher searcher, ScoreMode scoreMode, float boost) throws IOException { - Weight innerWeight = inner.createWeight(searcher, scoreMode, boost); - return new TerminationQueryWeight(this, innerWeight, timeout); - } - - @Override - public Query rewrite(IndexReader reader) throws IOException { - Query rewritten = inner.rewrite(reader); - if (rewritten != inner) { - return new TerminationQuery(rewritten, timeout); - } - return this; - } - - public QueryTimeout getTimeout() { - return timeout; - } - - @Override - public int hashCode() { - return Arrays.hashCode(new Object[] {inner, timeout}); - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof TerminationQuery)) { - return false; - } - - TerminationQuery terminationQuery = TerminationQuery.class.cast(obj); - return Arrays.equals(new Object[] {inner, timeout}, - new Object[] {terminationQuery.inner, terminationQuery.timeout}); - } - - @Override - public String toString(String field) { - return inner.toString(field); - } -} diff --git a/src/java/com/twitter/search/common/search/termination/TerminationQueryScorer.java b/src/java/com/twitter/search/common/search/termination/TerminationQueryScorer.java deleted file mode 100644 index d6d8af04d..000000000 --- a/src/java/com/twitter/search/common/search/termination/TerminationQueryScorer.java +++ /dev/null @@ -1,91 +0,0 @@ -package com.twitter.search.common.search.termination; - -import java.io.IOException; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.Weight; - -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.query.FilteredScorer; -import com.twitter.search.common.search.DocIdTracker; - -/** - * Scorer implementation that adds termination support for an underlying query. - * Meant to be used in conjunction with {@link TerminationQuery}. - */ -public class TerminationQueryScorer extends FilteredScorer implements DocIdTracker { - private final QueryTimeout timeout; - private int lastSearchedDocId = -1; - - TerminationQueryScorer(Weight weight, Scorer inner, QueryTimeout timeout) { - super(weight, inner); - this.timeout = Preconditions.checkNotNull(timeout); - this.timeout.registerDocIdTracker(this); - SearchRateCounter.export( - timeout.getClientId() + "_num_termination_query_scorers_created").increment(); - } - - @Override - public DocIdSetIterator iterator() { - final DocIdSetIterator superDISI = super.iterator(); - return new DocIdSetIterator() { - // lastSearchedDocId is the ID of the last document that was traversed in the posting list. - // docId is the current doc ID in this iterator. In most cases, lastSearchedDocId and docId - // will be equal. They will be different only if the query needed to be terminated based on - // the timeout. In that case, docId will be set to NO_MORE_DOCS, but lastSearchedDocId will - // still be set to the last document that was actually traversed. - private int docId = -1; - - @Override - public int docID() { - return docId; - } - - @Override - public int nextDoc() throws IOException { - if (docId == NO_MORE_DOCS) { - return NO_MORE_DOCS; - } - - if (timeout.shouldExit()) { - docId = NO_MORE_DOCS; - } else { - docId = superDISI.nextDoc(); - lastSearchedDocId = docId; - } - return docId; - } - - @Override - public int advance(int target) throws IOException { - if (docId == NO_MORE_DOCS) { - return NO_MORE_DOCS; - } - - if (target == NO_MORE_DOCS) { - docId = NO_MORE_DOCS; - lastSearchedDocId = docId; - } else if (timeout.shouldExit()) { - docId = NO_MORE_DOCS; - } else { - docId = superDISI.advance(target); - lastSearchedDocId = docId; - } - return docId; - } - - @Override - public long cost() { - return superDISI.cost(); - } - }; - } - - @Override - public int getCurrentDocId() { - return lastSearchedDocId; - } -} diff --git a/src/java/com/twitter/search/common/search/termination/TerminationQueryWeight.java b/src/java/com/twitter/search/common/search/termination/TerminationQueryWeight.java deleted file mode 100644 index 41aee0e7b..000000000 --- a/src/java/com/twitter/search/common/search/termination/TerminationQueryWeight.java +++ /dev/null @@ -1,53 +0,0 @@ -package com.twitter.search.common.search.termination; - -import java.io.IOException; -import java.util.Set; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.Term; -import org.apache.lucene.search.Explanation; -import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.Weight; - -/** - * Weight implementation that adds termination support for an underlying query. - * Meant to be used in conjunction with {@link TerminationQuery}. - */ -public class TerminationQueryWeight extends Weight { - private final Weight inner; - private final QueryTimeout timeout; - - TerminationQueryWeight(TerminationQuery query, Weight inner, QueryTimeout timeout) { - super(query); - this.inner = inner; - this.timeout = Preconditions.checkNotNull(timeout); - } - - @Override - public Explanation explain(LeafReaderContext context, int doc) - throws IOException { - return inner.explain(context, doc); - } - - @Override - public Scorer scorer(LeafReaderContext context) throws IOException { - Scorer innerScorer = inner.scorer(context); - if (innerScorer != null) { - return new TerminationQueryScorer(this, innerScorer, timeout); - } - - return null; - } - - @Override - public void extractTerms(Set terms) { - inner.extractTerms(terms); - } - - @Override - public boolean isCacheable(LeafReaderContext ctx) { - return inner.isCacheable(ctx); - } -} diff --git a/src/java/com/twitter/search/common/util/earlybird/BUILD b/src/java/com/twitter/search/common/util/earlybird/BUILD deleted file mode 100644 index ac7f561d9..000000000 --- a/src/java/com/twitter/search/common/util/earlybird/BUILD +++ /dev/null @@ -1,32 +0,0 @@ -java_library( - sources = ["**/*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/com/twitter/elephantbird:core", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/thrift:libthrift", - "3rdparty/jvm/org/apache/zookeeper:zookeeper-client", - "3rdparty/jvm/org/slf4j:slf4j-api", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common/collections", - "src/java/com/twitter/search/common/encoding/features", - "src/java/com/twitter/search/common/logging", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/relevance:ranking", - "src/java/com/twitter/search/common/relevance:text", - "src/java/com/twitter/search/common/relevance/features", - "src/java/com/twitter/search/common/runtime", - "src/java/com/twitter/search/common/schema/base", - "src/java/com/twitter/search/common/schema/earlybird", - "src/thrift/com/twitter/search:earlybird-java", - "src/thrift/com/twitter/search/adaptive:adaptive-results-java", - "src/thrift/com/twitter/search/common:constants-java", - "src/thrift/com/twitter/search/common:indexing-java", - "src/thrift/com/twitter/search/common:query-java", - "src/thrift/com/twitter/search/common:ranking-java", - "util/util-core:scala", - ], -) diff --git a/src/java/com/twitter/search/common/util/earlybird/EarlybirdResponseMergeUtil.java b/src/java/com/twitter/search/common/util/earlybird/EarlybirdResponseMergeUtil.java deleted file mode 100644 index c41003e7d..000000000 --- a/src/java/com/twitter/search/common/util/earlybird/EarlybirdResponseMergeUtil.java +++ /dev/null @@ -1,269 +0,0 @@ -package com.twitter.search.common.util.earlybird; - -import java.util.HashMap; -import java.util.List; -import java.util.Map; -import java.util.concurrent.ExecutionException; - -import com.google.common.base.Preconditions; -import com.google.common.cache.LoadingCache; -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.Lists; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.collections.Pair; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.earlybird.thrift.ThriftSearchRankingMode; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftTweetSource; - -/** - * Utility methods to merge EarlybirdResponses. - */ -public final class EarlybirdResponseMergeUtil { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdResponseMergeUtil.class); - - private static final String INVALID_RESPONSE_STATS_PREFIX = "invalid_response_stats_"; - - // Stats for invalid earlybird response - private static final ImmutableMap ERROR_EXCEPTIONS; - - public static final SearchCounter NULL_RESPONSE_COUNTER = - SearchCounter.export(INVALID_RESPONSE_STATS_PREFIX + "null_response"); - public static final SearchCounter SEARCH_RESULTS_NOT_SET_COUNTER = - SearchCounter.export(INVALID_RESPONSE_STATS_PREFIX + "search_results_not_set"); - public static final SearchCounter SEARCH_RESULTS_WITH_RESULTS_NOT_SET_COUNTER = - SearchCounter.export(INVALID_RESPONSE_STATS_PREFIX + "search_results_with_results_not_set"); - public static final SearchCounter MAX_SEARCHED_STATUS_ID_NOT_SET_COUNTER = - SearchCounter.export(INVALID_RESPONSE_STATS_PREFIX + "max_searched_status_id_not_set"); - public static final SearchCounter MIN_SEARCHED_STATUS_ID_NOT_SET_COUNTER = - SearchCounter.export(INVALID_RESPONSE_STATS_PREFIX + "min_searched_status_id_not_set"); - - static { - ImmutableMap.Builder builder = ImmutableMap.builder(); - - for (EarlybirdResponseCode responseCode : EarlybirdResponseCode.values()) { - if (responseCode != EarlybirdResponseCode.SUCCESS) { - builder.put(responseCode, SearchCounter.export( - INVALID_RESPONSE_STATS_PREFIX + responseCode.name().toLowerCase())); - } - } - - ERROR_EXCEPTIONS = builder.build(); - } - - private EarlybirdResponseMergeUtil() { - } - - /** - * Tags the results in the given EarlybirdResponse with the given ThriftTweetSource and adds them - * to the given list of results. - * - * @param results The list of results to which the new results will be added. - * @param earlybirdResponse The EarlybirdResponse whose results will be added to {@code results}. - * @param tweetSource The ThriftTweetSource that will be used to mark all results in - * {@code earlybirdResponse}. - * @return {@code false} if {@code earlybirdResponse} is {@code null} or doesn't have any results; - * {@code true}, otherwise. - */ - public static boolean addResultsToList(List results, - EarlybirdResponse earlybirdResponse, - ThriftTweetSource tweetSource) { - return EarlybirdResponseUtil.hasResults(earlybirdResponse) - && addResultsToList(results, - earlybirdResponse.getSearchResults().getResults(), - tweetSource); - } - - /** - * Tags the results in the given list with the given ThriftTweetSource and adds them to the given - * list of results. - * - * @param results The list of results to which the new results will be added. - * @param resultsToAdd The list of results to add. - * @param tweetSource The ThriftTweetSource that will be used to mark all results in - * {@code resultsToAdd}. - * @return {@code false} if {@code results} is {@code null} or if {@code resultsToAdd} is - * {@code null} or doesn't have any results; {@code true}, otherwise. - */ - public static boolean addResultsToList(List results, - List resultsToAdd, - ThriftTweetSource tweetSource) { - Preconditions.checkNotNull(results); - if ((resultsToAdd == null) || resultsToAdd.isEmpty()) { - return false; - } - - markWithTweetSource(resultsToAdd, tweetSource); - - results.addAll(resultsToAdd); - return true; - } - - /** - * Distinct the input ThriftSearchResult by its status id. If there are duplicates, the first - * instance of the duplicates is returned in the distinct result. If the distinct result is the - * same as the input result, the initial input result is returned; otherwise, the distinct result - * is returned. - * - * @param results the input result - * @param dupsStats stats counter track duplicates source - * @return the input result if there is no duplicate; otherwise, return the distinct result - */ - public static List distinctByStatusId( - List results, - LoadingCache, SearchCounter> dupsStats) { - Map seenStatusIdToSourceMap = new HashMap<>(); - List distinctResults = Lists.newArrayListWithCapacity(results.size()); - for (ThriftSearchResult result : results) { - if (seenStatusIdToSourceMap.containsKey(result.getId())) { - ThriftTweetSource source1 = seenStatusIdToSourceMap.get(result.getId()); - ThriftTweetSource source2 = result.getTweetSource(); - if (source1 != null && source2 != null) { - try { - dupsStats.get(Pair.of(source1, source2)).increment(); - } catch (ExecutionException e) { - LOG.warn("Could not increment stat for duplicate results from clusters " + source1 - + " and " + source2, e); - } - } - } else { - distinctResults.add(result); - seenStatusIdToSourceMap.put(result.getId(), result.getTweetSource()); - } - } - return results.size() == distinctResults.size() ? results : distinctResults; - } - - /** - * Tags the given results with the given ThriftTweetSource. - * - * @param results The results to be tagged. - * @param tweetSource The ThriftTweetSource to be used to tag the given results. - */ - public static void markWithTweetSource(List results, - ThriftTweetSource tweetSource) { - if (results != null) { - for (ThriftSearchResult result : results) { - result.setTweetSource(tweetSource); - } - } - } - - /** - * Check if an Earlybird response is valid - */ - public static boolean isValidResponse(final EarlybirdResponse response) { - if (response == null) { - NULL_RESPONSE_COUNTER.increment(); - return false; - } - - if (!EarlybirdResponseUtil.isSuccessfulResponse(response)) { - return false; - } - - if (!response.isSetSearchResults()) { - SEARCH_RESULTS_NOT_SET_COUNTER.increment(); - return true; - } - - if (!response.getSearchResults().isSetResults()) { - SEARCH_RESULTS_WITH_RESULTS_NOT_SET_COUNTER.increment(); - } - - // In earlybird, when earlybird terminated, e.g., time out, complex queries - we don't set the - // min/max searched status id. - boolean isEarlyTerminated = response.isSetEarlyTerminationInfo() - && response.getEarlyTerminationInfo().isEarlyTerminated(); - - if (!isEarlyTerminated && !response.getSearchResults().isSetMinSearchedStatusID()) { - MIN_SEARCHED_STATUS_ID_NOT_SET_COUNTER.increment(); - } - - if (!isEarlyTerminated && !response.getSearchResults().isSetMaxSearchedStatusID()) { - MAX_SEARCHED_STATUS_ID_NOT_SET_COUNTER.increment(); - } - - return true; - } - - /** - * For invalid successful Earlybird Response, return a failed response with debug msg. - */ - public static EarlybirdResponse transformInvalidResponse(final EarlybirdResponse response, - final String debugMsg) { - if (response == null) { - return failedEarlybirdResponse(EarlybirdResponseCode.PERSISTENT_ERROR, - debugMsg + ", msg: null response from downstream"); - } - Preconditions.checkState(response.getResponseCode() != EarlybirdResponseCode.SUCCESS); - - EarlybirdResponseCode newResponseCode; - EarlybirdResponseCode responseCode = response.getResponseCode(); - switch (responseCode) { - case TIER_SKIPPED: - ERROR_EXCEPTIONS.get(responseCode).increment(); - return response; - case REQUEST_BLOCKED_ERROR: - case CLIENT_ERROR: - case SERVER_TIMEOUT_ERROR: - case QUOTA_EXCEEDED_ERROR: - case CLIENT_CANCEL_ERROR: - case TOO_MANY_PARTITIONS_FAILED_ERROR: - ERROR_EXCEPTIONS.get(responseCode).increment(); - newResponseCode = responseCode; - break; - default: - ERROR_EXCEPTIONS.get(responseCode).increment(); - newResponseCode = EarlybirdResponseCode.PERSISTENT_ERROR; - } - - String newDebugMsg = debugMsg + ", downstream response code: " + responseCode - + (response.isSetDebugString() ? ", downstream msg: " + response.getDebugString() : ""); - - - return failedEarlybirdResponse(newResponseCode, newDebugMsg); - } - - /** - * Create a new EarlybirdResponse with debug msg - */ - public static EarlybirdResponse failedEarlybirdResponse(final EarlybirdResponseCode responseCode, - final String debugMsg) { - EarlybirdResponse failedResponse = new EarlybirdResponse(); - failedResponse.setResponseCode(responseCode); - failedResponse.setDebugString(debugMsg); - return failedResponse; - } - - /** - * Returns the number of results to keep as part of merge-collection. Recency mode should ignore - * relevance options. In particular, the flag returnAllResults inside relevance options. - */ - public static int computeNumResultsToKeep(EarlybirdRequest request) { - ThriftSearchQuery searchQuery = request.getSearchQuery(); - - if (searchQuery.getRankingMode() != ThriftSearchRankingMode.RECENCY - && searchQuery.isSetRelevanceOptions() - && searchQuery.getRelevanceOptions().isReturnAllResults()) { - return Integer.MAX_VALUE; - } - - if (request.isSetNumResultsToReturnAtRoot()) { - return request.getNumResultsToReturnAtRoot(); - } - - if (searchQuery.isSetCollectorParams()) { - return searchQuery.getCollectorParams().getNumResultsToReturn(); - } - - return searchQuery.getNumResults(); - } -} diff --git a/src/java/com/twitter/search/common/util/earlybird/EarlybirdResponseUtil.java b/src/java/com/twitter/search/common/util/earlybird/EarlybirdResponseUtil.java deleted file mode 100644 index 51c81edfa..000000000 --- a/src/java/com/twitter/search/common/util/earlybird/EarlybirdResponseUtil.java +++ /dev/null @@ -1,204 +0,0 @@ -package com.twitter.search.common.util.earlybird; - -import java.util.ArrayList; -import java.util.Collections; -import java.util.HashSet; -import java.util.List; -import java.util.Set; -import java.util.stream.Collectors; - -import com.google.common.base.Preconditions; - -import com.twitter.search.adaptive.adaptive_results.thriftjava.TweetSource; -import com.twitter.search.common.logging.ObjectKey; -import com.twitter.search.common.runtime.DebugManager; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird.thrift.ThriftTweetSource; - -/** Utility methods that work on EarlybirdResponses. */ -public final class EarlybirdResponseUtil { - private EarlybirdResponseUtil() { - } - - /** - * Returns the results in the given EarlybirdResponse. - * - * @param response The EarlybirdResponse. - * @return The results in the given EarlybirdResponse, or {@code null} if the response is - * {@code null} or the results are not set. - */ - public static ThriftSearchResults getResults(EarlybirdResponse response) { - if ((response == null) || !response.isSetSearchResults()) { - return null; - } - - return response.getSearchResults(); - } - - /** - * Determines if the given EarlybirdResponse has results. - * - * @param response The EarlybirdResponse. - * @return {@code true} if the given EarlybirdResponse has results; {@code false} otherwise. - */ - public static boolean hasResults(EarlybirdResponse response) { - ThriftSearchResults results = getResults(response); - return (results != null) && results.isSetResults() && !results.getResults().isEmpty(); - } - - /** - * Returns the number of results in the given EarlybirdResponse. - * - * @param response The EarlybirdResponse. - * @return The number of results in the given EarlybirdResponse. - */ - public static int getNumResults(EarlybirdResponse response) { - return hasResults(response) ? response.getSearchResults().getResultsSize() : 0; - } - - /** - * Determines the response is early-terminated. - * - * @param response The EarlybirdResponse. - * @return {@code true} if the response is early-terminated; {@code false} otherwise. - */ - public static boolean isEarlyTerminated(EarlybirdResponse response) { - Preconditions.checkNotNull(response); - return response.isSetEarlyTerminationInfo() - && response.getEarlyTerminationInfo().isEarlyTerminated(); - } - - /** - * Returns if the response should be considered failed for purposes of stats and logging. - */ - public static boolean responseConsideredFailed(EarlybirdResponseCode code) { - return code != EarlybirdResponseCode.SUCCESS - && code != EarlybirdResponseCode.REQUEST_BLOCKED_ERROR - && code != EarlybirdResponseCode.TIER_SKIPPED; - } - - /** - * Extract results from Earlybird response. - */ - public static List extractResultsFromEarlybirdResponse( - EarlybirdResponse response) { - return hasResults(response) - ? response.getSearchResults().getResults() : Collections.emptyList(); - } - - /** - * Log the Earlybird response as a candidate source. - */ - public static EarlybirdResponse debugLogAsCandidateSource( - EarlybirdResponse response, TweetSource tweetSource) { - List results = extractResultsFromEarlybirdResponse(response); - debugLogAsCandidateSourceHelper(results, tweetSource); - return response; - } - - /** - * Log a list of ThriftSearchResult as a candidate source. - */ - public static List debugLogAsCandidateSource( - List results, TweetSource tweetSource) { - debugLogAsCandidateSourceHelper(results, tweetSource); - return results; - } - - private static void debugLogAsCandidateSourceHelper( - List results, TweetSource tweetSource) { - // debug message for Earlybird relevance candidate source - List strIds = results - .stream() - .map(ThriftSearchResult::getId) - .map(Object::toString) - .collect(Collectors.toList()); - ObjectKey debugMsgKey = ObjectKey.createTweetCandidateSourceKey( - tweetSource.name()); - DebugManager.perObjectBasic( - debugMsgKey, - String.format("[%s][%s] results: %s", debugMsgKey.getType(), debugMsgKey.getId(), strIds)); - } - - /** - * Extract the real time response from an existing response - */ - public static EarlybirdResponse extractRealtimeResponse(EarlybirdResponse response) { - EarlybirdResponse realtimeResponse = response.deepCopy(); - if (EarlybirdResponseUtil.hasResults(response)) { - List realtimeResults = realtimeResponse.getSearchResults().getResults(); - realtimeResults.clear(); - for (ThriftSearchResult result : response.getSearchResults().getResults()) { - if (result.getTweetSource() == ThriftTweetSource.REALTIME_CLUSTER) { - realtimeResults.add(result); - } - } - } - - return realtimeResponse; - } - - /** - * Returns an EarlybirdResponse that should be returned by roots when a tier was skipped. - * - * @param minId The minSearchedStatusID to be set on the response. - * @param maxId The maxSearchedStatusID to be set on the response. - * @param debugMsg The debug message to be set on the response. - * @return A response that should be returned by roots when a tier was skipped. - */ - public static EarlybirdResponse tierSkippedRootResponse(long minId, long maxId, String debugMsg) { - return new EarlybirdResponse(EarlybirdResponseCode.SUCCESS, 0) - .setSearchResults(new ThriftSearchResults() - .setResults(new ArrayList<>()) - .setMinSearchedStatusID(minId) - .setMaxSearchedStatusID(maxId)) - .setDebugString(debugMsg); - } - - /** - * Determines if the given response is a success response. - * - * A response is considered successful if it's not null and has either a SUCCESS, TIER_SKIPPED or - * REQUEST_BLOCKED_ERROR response code. - * - * @param response The response to check. - * @return Whether the given response is successful or not. - */ - public static boolean isSuccessfulResponse(EarlybirdResponse response) { - return response != null - && (response.getResponseCode() == EarlybirdResponseCode.SUCCESS - || response.getResponseCode() == EarlybirdResponseCode.TIER_SKIPPED - || response.getResponseCode() == EarlybirdResponseCode.REQUEST_BLOCKED_ERROR); - } - - /** - * Finds all unexpected nullcast statuses within the given result. A nullcast status is - * unexpected iff: - * 1. the tweet is a nullcast tweet. - * 2. the tweet is NOT explicitly requested with {@link ThriftSearchQuery#searchStatusIds} - */ - public static Set findUnexpectedNullcastStatusIds( - ThriftSearchResults thriftSearchResults, EarlybirdRequest request) { - Set statusIds = new HashSet<>(); - for (ThriftSearchResult result : thriftSearchResults.getResults()) { - if (resultIsNullcast(result) && !isSearchStatusId(request, result.getId())) { - statusIds.add(result.getId()); - } - } - return statusIds; - } - - private static boolean isSearchStatusId(EarlybirdRequest request, long id) { - return request.getSearchQuery().isSetSearchStatusIds() - && request.getSearchQuery().getSearchStatusIds().contains(id); - } - - private static boolean resultIsNullcast(ThriftSearchResult result) { - return result.isSetMetadata() && result.getMetadata().isIsNullcast(); - } -} diff --git a/src/java/com/twitter/search/common/util/earlybird/FacetsResultsUtils.java b/src/java/com/twitter/search/common/util/earlybird/FacetsResultsUtils.java deleted file mode 100644 index 43d5732e4..000000000 --- a/src/java/com/twitter/search/common/util/earlybird/FacetsResultsUtils.java +++ /dev/null @@ -1,495 +0,0 @@ -package com.twitter.search.common.util.earlybird; - -import java.util.ArrayList; -import java.util.Collection; -import java.util.Collections; -import java.util.Comparator; -import java.util.HashMap; -import java.util.Iterator; -import java.util.List; -import java.util.Map; -import java.util.Set; - -import com.google.common.collect.Lists; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.constants.thriftjava.ThriftLanguage; -import com.twitter.search.common.logging.DebugMessageBuilder; -import com.twitter.search.common.ranking.thriftjava.ThriftFacetFinalSortOrder; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.ThriftFacetCount; -import com.twitter.search.earlybird.thrift.ThriftFacetCountMetadata; -import com.twitter.search.earlybird.thrift.ThriftFacetFieldRequest; -import com.twitter.search.earlybird.thrift.ThriftFacetFieldResults; -import com.twitter.search.earlybird.thrift.ThriftFacetRankingMode; -import com.twitter.search.earlybird.thrift.ThriftFacetRequest; -import com.twitter.search.earlybird.thrift.ThriftFacetResults; -import com.twitter.search.earlybird.thrift.ThriftTermResults; - -/** - * A utility class to provide some functions for facets results processing. - */ -public final class FacetsResultsUtils { - - private static final Logger LOG = LoggerFactory.getLogger(FacetsResultsUtils.class); - - private FacetsResultsUtils() { - } - - public static class FacetFieldInfo { - public ThriftFacetFieldRequest fieldRequest; - public int totalCounts; - public Map topFacets; - public List> languageHistogramEntries = Lists.newLinkedList(); - } - - // Only return top languages in the language histogram which sum up to at least this much - // ratio, here we get first 80 percentiles. - public static final double MIN_PERCENTAGE_SUM_REQUIRED = 0.8; - // if a language ratio is over this number, we already return. - public static final double MIN_PERCENTAGE = 0.01; - - /** - * Prepare facet fields with empty entries and check if we need termStats for filtering. - * Returns true if termStats filtering is needed (thus the termStats servie call). - * @param facetRequest The related facet request. - * @param facetFieldInfoMap The facet field info map to fill, a map from facet type to the facet - * fiels results info. - * @return {@code true} if termstats request is needed afterwards. - */ - public static boolean prepareFieldInfoMap( - ThriftFacetRequest facetRequest, - final Map facetFieldInfoMap) { - boolean termStatsFilteringMode = false; - - for (ThriftFacetFieldRequest fieldRequest : facetRequest.getFacetFields()) { - FacetsResultsUtils.FacetFieldInfo info = new FacetsResultsUtils.FacetFieldInfo(); - info.fieldRequest = fieldRequest; - facetFieldInfoMap.put(fieldRequest.getFieldName(), info); - if (fieldRequest.getRankingMode() == ThriftFacetRankingMode.FILTER_WITH_TERM_STATISTICS) { - termStatsFilteringMode = true; - } - } - - return termStatsFilteringMode; - } - - /** - * Extract information from one ThriftFacetResults into facetFieldInfoMap and userIDWhitelist. - * @param facetResults Related facets results. - * @param facetFieldInfoMap The facets field info map to fill, a map from facet type to the facet - * fiels results info. - * @param userIDWhitelist The user whitelist to fill. - */ - public static void fillFacetFieldInfo( - final ThriftFacetResults facetResults, - final Map facetFieldInfoMap, - final Set userIDWhitelist) { - - for (String facetField : facetResults.getFacetFields().keySet()) { - FacetsResultsUtils.FacetFieldInfo info = facetFieldInfoMap.get(facetField); - if (info.topFacets == null) { - info.topFacets = new HashMap<>(); - } - - ThriftFacetFieldResults results = facetResults.getFacetFields().get(facetField); - if (results.isSetLanguageHistogram()) { - info.languageHistogramEntries.addAll(results.getLanguageHistogram().entrySet()); - } - for (ThriftFacetCount newCount : results.getTopFacets()) { - ThriftFacetCount resultCount = info.topFacets.get(newCount.facetLabel); - if (resultCount == null) { - info.topFacets.put(newCount.facetLabel, new ThriftFacetCount(newCount)); - } else { - resultCount.setFacetCount(resultCount.facetCount + newCount.facetCount); - resultCount.setSimpleCount(resultCount.simpleCount + newCount.simpleCount); - resultCount.setWeightedCount(resultCount.weightedCount + newCount.weightedCount); - resultCount.setPenaltyCount(resultCount.penaltyCount + newCount.penaltyCount); - // this could pass the old metadata object back or a new merged one. - resultCount.setMetadata( - mergeFacetMetadata(resultCount.getMetadata(), newCount.getMetadata(), - userIDWhitelist)); - } - } - info.totalCounts += results.totalCount; - } - } - - /** - * Merge a metadata into an existing one. - * @param baseMetadata the metadata to merge into. - * @param metadataUpdate the new metadata to merge. - * @param userIDWhitelist user id whitelist to filter user id with. - * @return The updated metadata. - */ - public static ThriftFacetCountMetadata mergeFacetMetadata( - final ThriftFacetCountMetadata baseMetadata, - final ThriftFacetCountMetadata metadataUpdate, - final Set userIDWhitelist) { - ThriftFacetCountMetadata mergedMetadata = baseMetadata; - if (metadataUpdate != null) { - String mergedExplanation = null; - if (mergedMetadata != null) { - if (mergedMetadata.maxTweepCred < metadataUpdate.maxTweepCred) { - mergedMetadata.setMaxTweepCred(metadataUpdate.maxTweepCred); - } - - if (mergedMetadata.isSetExplanation()) { - mergedExplanation = mergedMetadata.getExplanation(); - if (metadataUpdate.isSetExplanation()) { - mergedExplanation += "\n" + metadataUpdate.getExplanation(); - } - } else if (metadataUpdate.isSetExplanation()) { - mergedExplanation = metadataUpdate.getExplanation(); - } - - if (mergedMetadata.getStatusId() == -1) { - if (LOG.isDebugEnabled()) { - LOG.debug("status id in facet count metadata is -1: " + mergedMetadata); - } - mergedMetadata = metadataUpdate; - } else if (metadataUpdate.getStatusId() != -1 - && metadataUpdate.getStatusId() < mergedMetadata.getStatusId()) { - // keep the oldest tweet, ie. the lowest status ID - mergedMetadata = metadataUpdate; - } else if (metadataUpdate.getStatusId() == mergedMetadata.getStatusId()) { - if (mergedMetadata.getTwitterUserId() == -1) { - // in this case we didn't find the user in a previous partition yet - // only update the user if the status id matches - mergedMetadata.setTwitterUserId(metadataUpdate.getTwitterUserId()); - mergedMetadata.setDontFilterUser(metadataUpdate.isDontFilterUser()); - } - if (!mergedMetadata.isSetStatusLanguage()) { - mergedMetadata.setStatusLanguage(metadataUpdate.getStatusLanguage()); - } - } - if (!mergedMetadata.isSetNativePhotoUrl() && metadataUpdate.isSetNativePhotoUrl()) { - mergedMetadata.setNativePhotoUrl(metadataUpdate.getNativePhotoUrl()); - } - } else { - mergedMetadata = metadataUpdate; - } - - // this will not set an explanation if neither oldMetadata nor metadataUpdate - // had an explanation - if (mergedExplanation != null) { - mergedMetadata.setExplanation(mergedExplanation); - } - - if (userIDWhitelist != null) { - // result must not be null now because of the if above - if (mergedMetadata.getTwitterUserId() != -1 && !mergedMetadata.isDontFilterUser()) { - mergedMetadata.setDontFilterUser( - userIDWhitelist.contains(mergedMetadata.getTwitterUserId())); - } - } - } - - return mergedMetadata; - } - - /** - * Appends all twimg results to the image results. Optionally resorts the image results if - * a comparator is passed in. - * Also computes the sums of totalCount, totalScore, totalPenalty. - */ - public static void mergeTwimgResults(ThriftFacetResults facetResults, - Comparator optionalSortComparator) { - if (facetResults == null || !facetResults.isSetFacetFields()) { - return; - } - - ThriftFacetFieldResults imageResults = - facetResults.getFacetFields().get(EarlybirdFieldConstant.IMAGES_FACET); - ThriftFacetFieldResults twimgResults = - facetResults.getFacetFields().remove(EarlybirdFieldConstant.TWIMG_FACET); - if (imageResults == null) { - if (twimgResults != null) { - facetResults.getFacetFields().put(EarlybirdFieldConstant.IMAGES_FACET, twimgResults); - } - return; - } - - if (twimgResults != null) { - imageResults.setTotalCount(imageResults.getTotalCount() + twimgResults.getTotalCount()); - imageResults.setTotalPenalty(imageResults.getTotalPenalty() + twimgResults.getTotalPenalty()); - imageResults.setTotalScore(imageResults.getTotalScore() + twimgResults.getTotalScore()); - for (ThriftFacetCount count : twimgResults.getTopFacets()) { - imageResults.addToTopFacets(count); - } - if (optionalSortComparator != null) { - Collections.sort(imageResults.topFacets, optionalSortComparator); - } - } - } - - /** - * Dedup twimg facets. - * - * Twimg facet uses the status ID as the facet label, instead of the twimg URL, a.k.a. - * native photo URL. It is possible to have the same twimg URL appearing in two different - * facet label (RT style retweet? copy & paste the twimg URL?). Therefore, to dedup twimg - * facet correctly, we need to look at ThriftFacetCount.metadata.nativePhotoUrl - * - * @param dedupSet A set holding the native URLs from the twimg facetFieldResults. By having - * the caller passing in the set, it allows the caller to dedup the facet - * across different ThriftFacetFieldResults. - * @param facetFieldResults The twimg facet field results to be debupped - * @param debugMessageBuilder - */ - public static void dedupTwimgFacet(Set dedupSet, - ThriftFacetFieldResults facetFieldResults, - DebugMessageBuilder debugMessageBuilder) { - if (facetFieldResults == null || facetFieldResults.getTopFacets() == null) { - return; - } - - Iterator iterator = facetFieldResults.getTopFacetsIterator(); - - while (iterator.hasNext()) { - ThriftFacetCount count = iterator.next(); - if (count.isSetMetadata() && count.getMetadata().isSetNativePhotoUrl()) { - String nativeUrl = count.getMetadata().getNativePhotoUrl(); - - if (dedupSet.contains(nativeUrl)) { - iterator.remove(); - debugMessageBuilder.detailed("dedupTwimgFacet removed %s", nativeUrl); - } else { - dedupSet.add(nativeUrl); - } - } - } - - - } - - private static final class LanguageCount { - private final ThriftLanguage lang; - private final double count; - private LanguageCount(ThriftLanguage lang, double count) { - this.lang = lang; - this.count = count; - } - } - - /** - * Calculate the top languages and store them in the results. - */ - public static void fillTopLanguages(FacetsResultsUtils.FacetFieldInfo info, - final ThriftFacetFieldResults results) { - double sumForLanguage = 0.0; - double[] sums = new double[ThriftLanguage.values().length]; - for (Map.Entry entry : info.languageHistogramEntries) { - sumForLanguage += entry.getValue(); - if (entry.getKey() == null) { - // EB might be setting null key for unknown language. SEARCH-1294 - continue; - } - sums[entry.getKey().getValue()] += entry.getValue(); - } - if (sumForLanguage == 0.0) { - return; - } - List langCounts = new ArrayList<>(ThriftLanguage.values().length); - for (int i = 0; i < sums.length; i++) { - if (sums[i] > 0.0) { - // ThriftLanguage.findByValue() might return null, which should fall back to UNKNOWN. - ThriftLanguage lang = ThriftLanguage.findByValue(i); - lang = lang == null ? ThriftLanguage.UNKNOWN : lang; - langCounts.add(new LanguageCount(lang, sums[i])); - } - } - Collections.sort(langCounts, (left, right) -> Double.compare(right.count, left.count)); - double percentageSum = 0.0; - Map languageHistogramMap = - new HashMap<>(langCounts.size()); - int numAdded = 0; - for (LanguageCount langCount : langCounts) { - if (langCount.count == 0.0) { - break; - } - double percentage = langCount.count / sumForLanguage; - if (percentageSum > MIN_PERCENTAGE_SUM_REQUIRED - && percentage < MIN_PERCENTAGE && numAdded >= 3) { - break; - } - languageHistogramMap.put(langCount.lang, percentage); - percentageSum += percentage; - numAdded++; - } - results.setLanguageHistogram(languageHistogramMap); - } - - /** - * Replace "p.twimg.com/" part of the native photo (twimg) URL with "pbs.twimg.com/media/". - * We need to do this because of blobstore and it's suppose to be a temporary measure. This - * code should be removed once we verified that all native photo URL being sent to Search - * are prefixed with "pbs.twimg.com/media/" and no native photo URL in our index contains - * "p.twimg.com/" - * - * Please see SEARCH-783 and EVENTS-539 for more details. - * - * @param response response containing the facet results - */ - public static void fixNativePhotoUrl(EarlybirdResponse response) { - if (response == null - || !response.isSetFacetResults() - || !response.getFacetResults().isSetFacetFields()) { - return; - } - - for (Map.Entry facetMapEntry - : response.getFacetResults().getFacetFields().entrySet()) { - final String facetResultField = facetMapEntry.getKey(); - - if (EarlybirdFieldConstant.TWIMG_FACET.equals(facetResultField) - || EarlybirdFieldConstant.IMAGES_FACET.equals(facetResultField)) { - ThriftFacetFieldResults facetFieldResults = facetMapEntry.getValue(); - for (ThriftFacetCount facetCount : facetFieldResults.getTopFacets()) { - replacePhotoUrl(facetCount.getMetadata()); - } - } - } - } - - /** - * Replace "p.twimg.com/" part of the native photo (twimg) URL with "pbs.twimg.com/media/". - * We need to do this because of blobstore and it's suppose to be a temporary measure. This - * code should be removed once we verified that all native photo URL being sent to Search - * are prefixed with "pbs.twimg.com/media/" and no native photo URL in our index contains - * "p.twimg.com/" - * - * Please see SEARCH-783 and EVENTS-539 for more details. - * - * @param termResultsCollection collection of ThriftTermResults containing the native photo URL - */ - public static void fixNativePhotoUrl(Collection termResultsCollection) { - if (termResultsCollection == null) { - return; - } - - for (ThriftTermResults termResults : termResultsCollection) { - if (!termResults.isSetMetadata()) { - continue; - } - replacePhotoUrl(termResults.getMetadata()); - } - } - - /** - * Helper function for fixNativePhotoUrl() - */ - private static void replacePhotoUrl(ThriftFacetCountMetadata metadata) { - if (metadata != null - && metadata.isSetNativePhotoUrl()) { - String nativePhotoUrl = metadata.getNativePhotoUrl(); - nativePhotoUrl = nativePhotoUrl.replace("://p.twimg.com/", "://pbs.twimg.com/media/"); - metadata.setNativePhotoUrl(nativePhotoUrl); - } - } - - /** - * Deepcopy of an EarlybirdResponse without explanation - */ - public static EarlybirdResponse deepCopyWithoutExplanation(EarlybirdResponse facetsResponse) { - if (facetsResponse == null) { - return null; - } else if (!facetsResponse.isSetFacetResults() - || facetsResponse.getFacetResults().getFacetFieldsSize() == 0) { - return facetsResponse.deepCopy(); - } - EarlybirdResponse copy = facetsResponse.deepCopy(); - for (Map.Entry entry - : copy.getFacetResults().getFacetFields().entrySet()) { - if (entry.getValue().getTopFacetsSize() > 0) { - for (ThriftFacetCount fc : entry.getValue().getTopFacets()) { - fc.getMetadata().unsetExplanation(); - } - } - } - return copy; - } - - /** - * Returns a comparator used to compare facet counts by calling - * getFacetCountComparator(ThriftFacetFinalSortOrder). The sort order is determined by - * the facetRankingOptions on the facet request. - */ - public static Comparator getFacetCountComparator( - ThriftFacetRequest facetRequest) { - - ThriftFacetFinalSortOrder sortOrder = ThriftFacetFinalSortOrder.SCORE; - - if (facetRequest.isSetFacetRankingOptions() - && facetRequest.getFacetRankingOptions().isSetFinalSortOrder()) { - sortOrder = facetRequest.getFacetRankingOptions().getFinalSortOrder(); - } - - return getFacetCountComparator(sortOrder); - } - - /** - * Returns a comparator using the specified order. - */ - public static Comparator getFacetCountComparator( - ThriftFacetFinalSortOrder sortOrder) { - - switch (sortOrder) { - case SIMPLE_COUNT: return SIMPLE_COUNT_COMPARATOR; - case SCORE: return SCORE_COMPARATOR; - case CREATED_AT: return CREATED_AT_COMPARATOR; - case WEIGHTED_COUNT: return WEIGHTED_COUNT_COMPARATOR; - default: return SCORE_COMPARATOR; - } - } - - private static final Comparator SIMPLE_COUNT_COMPARATOR = - (count1, count2) -> { - if (count1.simpleCount > count2.simpleCount) { - return 1; - } else if (count1.simpleCount < count2.simpleCount) { - return -1; - } - - return count1.facetLabel.compareTo(count2.facetLabel); - }; - - private static final Comparator WEIGHTED_COUNT_COMPARATOR = - (count1, count2) -> { - if (count1.weightedCount > count2.weightedCount) { - return 1; - } else if (count1.weightedCount < count2.weightedCount) { - return -1; - } - - return SIMPLE_COUNT_COMPARATOR.compare(count1, count2); - }; - - private static final Comparator SCORE_COMPARATOR = - (count1, count2) -> { - if (count1.score > count2.score) { - return 1; - } else if (count1.score < count2.score) { - return -1; - } - return SIMPLE_COUNT_COMPARATOR.compare(count1, count2); - }; - - private static final Comparator CREATED_AT_COMPARATOR = - (count1, count2) -> { - if (count1.isSetMetadata() && count1.getMetadata().isSetCreated_at() - && count2.isSetMetadata() && count2.getMetadata().isSetCreated_at()) { - // more recent items have higher created_at values - if (count1.getMetadata().getCreated_at() > count2.getMetadata().getCreated_at()) { - return 1; - } else if (count1.getMetadata().getCreated_at() < count2.getMetadata().getCreated_at()) { - return -1; - } - } - - return SCORE_COMPARATOR.compare(count1, count2); - }; -} diff --git a/src/java/com/twitter/search/common/util/earlybird/ResponseMergerUtils.java b/src/java/com/twitter/search/common/util/earlybird/ResponseMergerUtils.java deleted file mode 100644 index a0931a1da..000000000 --- a/src/java/com/twitter/search/common/util/earlybird/ResponseMergerUtils.java +++ /dev/null @@ -1,45 +0,0 @@ -package com.twitter.search.common.util.earlybird; - -import java.util.List; -import java.util.Set; - -import com.google.common.collect.Lists; -import com.google.common.collect.Sets; - -import com.twitter.search.common.query.thriftjava.EarlyTerminationInfo; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; - -public final class ResponseMergerUtils { - - // Utility class, disallow instantiation. - private ResponseMergerUtils() { - } - - /** - * Merges early termination infos from several earlybird responses. - * - * @param responses earlybird responses to merge the early termination infos from - * @return merged early termination info - */ - public static EarlyTerminationInfo mergeEarlyTerminationInfo(List responses) { - EarlyTerminationInfo etInfo = new EarlyTerminationInfo(false); - Set etReasonSet = Sets.newHashSet(); - // Fill in EarlyTerminationStatus - for (EarlybirdResponse ebResp : responses) { - if (ebResp.isSetEarlyTerminationInfo() - && ebResp.getEarlyTerminationInfo().isEarlyTerminated()) { - etInfo.setEarlyTerminated(true); - if (ebResp.getEarlyTerminationInfo().isSetEarlyTerminationReason()) { - etReasonSet.add(ebResp.getEarlyTerminationInfo().getEarlyTerminationReason()); - } - if (ebResp.getEarlyTerminationInfo().isSetMergedEarlyTerminationReasons()) { - etReasonSet.addAll(ebResp.getEarlyTerminationInfo().getMergedEarlyTerminationReasons()); - } - } - } - if (etInfo.isEarlyTerminated()) { - etInfo.setMergedEarlyTerminationReasons(Lists.newArrayList(etReasonSet)); - } - return etInfo; - } -} diff --git a/src/java/com/twitter/search/common/util/earlybird/ResultsUtil.java b/src/java/com/twitter/search/common/util/earlybird/ResultsUtil.java deleted file mode 100644 index e314ca553..000000000 --- a/src/java/com/twitter/search/common/util/earlybird/ResultsUtil.java +++ /dev/null @@ -1,36 +0,0 @@ -package com.twitter.search.common.util.earlybird; - -import java.util.Map; - -import com.google.common.base.Function; -import com.google.common.collect.Iterables; -import com.google.common.collect.Maps; - -/** - * Utility class used to help merging results. - */ -public final class ResultsUtil { - private ResultsUtil() { } - - /** - * Aggregate a list of responses in the following way. - * 1. For each response, mapGetter can turn the response into a map. - * 2. Dump all entries from the above map into a "total" map, which accumulates entries from - * all the responses. - */ - public static Map aggregateCountMap( - Iterable responses, - Function> mapGetter) { - Map total = Maps.newHashMap(); - for (Map map : Iterables.transform(responses, mapGetter)) { - if (map != null) { - for (Map.Entry entry : map.entrySet()) { - T key = entry.getKey(); - total.put(key, total.containsKey(key) - ? total.get(key) + entry.getValue() : entry.getValue()); - } - } - } - return total; - } -} diff --git a/src/java/com/twitter/search/common/util/earlybird/TermStatisticsUtil.java b/src/java/com/twitter/search/common/util/earlybird/TermStatisticsUtil.java deleted file mode 100644 index e599d5cf3..000000000 --- a/src/java/com/twitter/search/common/util/earlybird/TermStatisticsUtil.java +++ /dev/null @@ -1,47 +0,0 @@ -package com.twitter.search.common.util.earlybird; - -import java.util.concurrent.TimeUnit; - -import com.twitter.search.earlybird.thrift.ThriftHistogramSettings; - -/** - * A utility class to provide some functions for TermStatistics request processing - */ -public final class TermStatisticsUtil { - - private static final org.slf4j.Logger LOG = - org.slf4j.LoggerFactory.getLogger(TermStatisticsUtil.class); - - private TermStatisticsUtil() { - } - - /** - * Determine the binsize base on settings in ThriftHistogramSettings.granularity - */ - public static int determineBinSize(ThriftHistogramSettings histogramSettings) { - final int DEFAULT_BINSIZE = (int) TimeUnit.HOURS.toSeconds(1); - int binSize; - switch (histogramSettings.getGranularity()) { - case DAYS: - binSize = (int) TimeUnit.DAYS.toSeconds(1); - break; - case HOURS: - binSize = (int) TimeUnit.HOURS.toSeconds(1); - break; - case MINUTES: - binSize = (int) TimeUnit.MINUTES.toSeconds(1); - break; - case CUSTOM: - binSize = histogramSettings.isSetBinSizeInSeconds() - ? histogramSettings.getBinSizeInSeconds() - : DEFAULT_BINSIZE; - break; - default: - binSize = DEFAULT_BINSIZE; - LOG.warn("Unknown ThriftHistogramGranularityType {} using default binsize: {}", - histogramSettings.getGranularity(), DEFAULT_BINSIZE); - } - - return binSize; - } -} diff --git a/src/java/com/twitter/search/common/util/earlybird/ThriftSearchQueryUtil.java b/src/java/com/twitter/search/common/util/earlybird/ThriftSearchQueryUtil.java deleted file mode 100644 index 441349715..000000000 --- a/src/java/com/twitter/search/common/util/earlybird/ThriftSearchQueryUtil.java +++ /dev/null @@ -1,29 +0,0 @@ -package com.twitter.search.common.util.earlybird; - -import com.twitter.search.common.query.thriftjava.CollectorParams; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; - -/** - * Utility class from constructing ThriftSearchQuery. - */ -public final class ThriftSearchQueryUtil { - private ThriftSearchQueryUtil() { } - - /** - * Convenience methods for constructing a ThriftSearchQuery. - */ - public static ThriftSearchQuery newSearchQuery(String serializedQuery, int numResults) { - ThriftSearchQuery searchQuery = new ThriftSearchQuery(); - searchQuery.setSerializedQuery(serializedQuery); - searchQuery.setCollectorParams(new CollectorParams().setNumResultsToReturn(numResults)); - return searchQuery; - } - - /** Determines if the given request was initiated by a logged in user. */ - public static boolean requestInitiatedByLoggedInUser(EarlybirdRequest request) { - ThriftSearchQuery searchQuery = request.getSearchQuery(); - return (searchQuery != null) && searchQuery.isSetSearcherId() - && (searchQuery.getSearcherId() > 0); - } -} diff --git a/src/java/com/twitter/search/common/util/earlybird/ThriftSearchResultUtil.java b/src/java/com/twitter/search/common/util/earlybird/ThriftSearchResultUtil.java deleted file mode 100644 index 3b9661c6e..000000000 --- a/src/java/com/twitter/search/common/util/earlybird/ThriftSearchResultUtil.java +++ /dev/null @@ -1,209 +0,0 @@ -package com.twitter.search.common.util.earlybird; - -import java.util.List; -import java.util.Map; -import javax.annotation.Nullable; - -import com.google.common.base.Function; -import com.google.common.base.Predicate; -import com.google.common.base.Predicates; -import com.google.common.collect.Iterables; -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; - -import com.twitter.search.common.constants.thriftjava.ThriftLanguage; -import com.twitter.search.common.relevance.ranking.ActionChain; -import com.twitter.search.common.relevance.ranking.filters.ExactDuplicateFilter; -import com.twitter.search.common.relevance.text.VisibleTokenRatioNormalizer; -import com.twitter.search.common.runtime.ActionChainDebugManager; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.earlybird.thrift.ThriftFacetFieldResults; -import com.twitter.search.earlybird.thrift.ThriftFacetResults; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadata; -import com.twitter.search.earlybird.thrift.ThriftSearchResultType; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird.thrift.ThriftTweetSource; - -/** - * ThriftSearchResultUtil contains some simple static methods for constructing - * ThriftSearchResult objects. - */ -public final class ThriftSearchResultUtil { - private ThriftSearchResultUtil() { } - - private static final VisibleTokenRatioNormalizer NORMALIZER = - VisibleTokenRatioNormalizer.createInstance(); - - public static final Function> LANG_MAP_GETTER = - searchResults -> searchResults.getLanguageHistogram(); - public static final Function> HIT_COUNTS_MAP_GETTER = - searchResults -> searchResults.getHitCounts(); - - // Some useful Predicates - public static final Predicate IS_OFFENSIVE_TWEET = - result -> { - if (result != null && result.isSetMetadata()) { - ThriftSearchResultMetadata metadata = result.getMetadata(); - return metadata.isIsOffensive(); - } else { - return false; - } - }; - - public static final Predicate IS_TOP_TWEET = - result -> result != null - && result.isSetMetadata() - && result.getMetadata().isSetResultType() - && result.getMetadata().getResultType() == ThriftSearchResultType.POPULAR; - - public static final Predicate FROM_FULL_ARCHIVE = - result -> result != null - && result.isSetTweetSource() - && result.getTweetSource() == ThriftTweetSource.FULL_ARCHIVE_CLUSTER; - - public static final Predicate IS_FULL_ARCHIVE_TOP_TWEET = - Predicates.and(FROM_FULL_ARCHIVE, IS_TOP_TWEET); - - public static final Predicate IS_NSFW_BY_ANY_MEANS_TWEET = - result -> { - if (result != null && result.isSetMetadata()) { - ThriftSearchResultMetadata metadata = result.getMetadata(); - return metadata.isIsUserNSFW() - || metadata.isIsOffensive() - || metadata.getExtraMetadata().isIsSensitiveContent(); - } else { - return false; - } - }; - - /** - * Returns the number of underlying ThriftSearchResult results. - */ - public static int numResults(ThriftSearchResults results) { - if (results == null || !results.isSetResults()) { - return 0; - } else { - return results.getResultsSize(); - } - } - - /** - * Returns the list of tweet IDs in ThriftSearchResults. - * Returns null if there's no results. - */ - @Nullable - public static List getTweetIds(ThriftSearchResults results) { - if (numResults(results) > 0) { - return getTweetIds(results.getResults()); - } else { - return null; - } - } - - /** - * Returns the list of tweet IDs in a list of ThriftSearchResult. - * Returns null if there's no results. - */ - public static List getTweetIds(@Nullable List results) { - if (results != null && results.size() > 0) { - return Lists.newArrayList(Iterables.transform( - results, - searchResult -> searchResult.getId() - )); - } - return null; - } - - /** - * Given ThriftSearchResults, build a map from tweet ID to the tweets metadata. - */ - public static Map getTweetMetadataMap( - Schema schema, ThriftSearchResults results) { - Map resultMap = Maps.newHashMap(); - if (results == null || results.getResultsSize() == 0) { - return resultMap; - } - for (ThriftSearchResult searchResult : results.getResults()) { - resultMap.put(searchResult.getId(), searchResult.getMetadata()); - } - return resultMap; - } - - /** - * Return the total number of facet results in ThriftFacetResults, by summing up the number - * of facet results in each field. - */ - public static int numFacetResults(ThriftFacetResults results) { - if (results == null || !results.isSetFacetFields()) { - return 0; - } else { - int numResults = 0; - for (ThriftFacetFieldResults field : results.getFacetFields().values()) { - if (field.isSetTopFacets()) { - numResults += field.topFacets.size(); - } - } - return numResults; - } - } - - /** - * Updates the search statistics on base, by adding the corresponding stats from delta. - */ - public static void incrementCounts(ThriftSearchResults base, - ThriftSearchResults delta) { - if (delta.isSetNumHitsProcessed()) { - base.setNumHitsProcessed(base.getNumHitsProcessed() + delta.getNumHitsProcessed()); - } - if (delta.isSetNumPartitionsEarlyTerminated() && delta.getNumPartitionsEarlyTerminated() > 0) { - // This currently used for merging results on a single earlybird, so we don't sum up all the - // counts, just set it to 1 if we see one that was early terminated. - base.setNumPartitionsEarlyTerminated(1); - } - if (delta.isSetMaxSearchedStatusID()) { - long deltaMax = delta.getMaxSearchedStatusID(); - if (!base.isSetMaxSearchedStatusID() || deltaMax > base.getMaxSearchedStatusID()) { - base.setMaxSearchedStatusID(deltaMax); - } - } - if (delta.isSetMinSearchedStatusID()) { - long deltaMin = delta.getMinSearchedStatusID(); - if (!base.isSetMinSearchedStatusID() || deltaMin < base.getMinSearchedStatusID()) { - base.setMinSearchedStatusID(deltaMin); - } - } - if (delta.isSetScore()) { - if (base.isSetScore()) { - base.setScore(base.getScore() + delta.getScore()); - } else { - base.setScore(delta.getScore()); - } - } - } - - /** - * Removes the duplicates from the given list of results. - * - * @param results The list of ThriftSearchResults. - * @return The given list with duplicates removed. - */ - public static List removeDuplicates(List results) { - ActionChain filterChain = - ActionChainDebugManager - .createActionChainBuilder("RemoveDuplicatesFilters") - .appendActions(new ExactDuplicateFilter()) - .build(); - return filterChain.apply(results); - } - - /** - * Returns ranking score from Earlybird shard-based ranking models if any, and 0 otherwise. - */ - public static double getTweetScore(@Nullable ThriftSearchResult result) { - if (result == null || !result.isSetMetadata() || !result.getMetadata().isSetScore()) { - return 0.0; - } - return result.getMetadata().getScore(); - } -} diff --git a/src/java/com/twitter/search/common/util/earlybird/ThriftSearchResultsRelevanceStatsUtil.java b/src/java/com/twitter/search/common/util/earlybird/ThriftSearchResultsRelevanceStatsUtil.java deleted file mode 100644 index 182dd1274..000000000 --- a/src/java/com/twitter/search/common/util/earlybird/ThriftSearchResultsRelevanceStatsUtil.java +++ /dev/null @@ -1,46 +0,0 @@ -package com.twitter.search.common.util.earlybird; - -import com.twitter.search.earlybird.thrift.ThriftSearchResultsRelevanceStats; - -public final class ThriftSearchResultsRelevanceStatsUtil { - private ThriftSearchResultsRelevanceStatsUtil() { } - - /** - * Adding ThriftSearchResultsRelevanceStats from one set of results onto a base set. - * Assumes all values are set on both of the inputs. - * - * @param base the stats to add to. - * @param delta the stats to be added. - */ - public static void addRelevanceStats(ThriftSearchResultsRelevanceStats base, - ThriftSearchResultsRelevanceStats delta) { - base.setNumScored(base.getNumScored() + delta.getNumScored()); - base.setNumSkipped(base.getNumSkipped() + delta.getNumSkipped()); - base.setNumSkippedForAntiGaming( - base.getNumSkippedForAntiGaming() + delta.getNumSkippedForAntiGaming()); - base.setNumSkippedForLowReputation( - base.getNumSkippedForLowReputation() + delta.getNumSkippedForLowReputation()); - base.setNumSkippedForLowTextScore( - base.getNumSkippedForLowTextScore() + delta.getNumSkippedForLowTextScore()); - base.setNumSkippedForSocialFilter( - base.getNumSkippedForSocialFilter() + delta.getNumSkippedForSocialFilter()); - base.setNumSkippedForLowFinalScore( - base.getNumSkippedForLowFinalScore() + delta.getNumSkippedForLowFinalScore()); - if (delta.getOldestScoredTweetAgeInSeconds() > base.getOldestScoredTweetAgeInSeconds()) { - base.setOldestScoredTweetAgeInSeconds(delta.getOldestScoredTweetAgeInSeconds()); - } - - base.setNumFromDirectFollows(base.getNumFromDirectFollows() + delta.getNumFromDirectFollows()); - base.setNumFromTrustedCircle(base.getNumFromTrustedCircle() + delta.getNumFromTrustedCircle()); - base.setNumReplies(base.getNumReplies() + delta.getNumReplies()); - base.setNumRepliesTrusted(base.getNumRepliesTrusted() + delta.getNumRepliesTrusted()); - base.setNumRepliesOutOfNetwork( - base.getNumRepliesOutOfNetwork() + delta.getNumRepliesOutOfNetwork()); - base.setNumSelfTweets(base.getNumSelfTweets() + delta.getNumSelfTweets()); - base.setNumWithMedia(base.getNumWithMedia() + delta.getNumWithMedia()); - base.setNumWithNews(base.getNumWithNews() + delta.getNumWithNews()); - base.setNumSpamUser(base.getNumSpamUser() + delta.getNumSpamUser()); - base.setNumOffensive(base.getNumOffensive() + delta.getNumOffensive()); - base.setNumBot(base.getNumBot() + delta.getNumBot()); - } -} diff --git a/src/java/com/twitter/search/common/util/lang/BUILD b/src/java/com/twitter/search/common/util/lang/BUILD deleted file mode 100644 index e88e63360..000000000 --- a/src/java/com/twitter/search/common/util/lang/BUILD +++ /dev/null @@ -1,18 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - provides = artifact( - org = "com.twitter.search.common.util", - name = "lang", - repo = artifactory, - ), - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/code/findbugs:jsr305", - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/org/apache/thrift:libthrift", - "3rdparty/jvm/org/slf4j:slf4j-api", - "src/java/com/twitter/common/text/language:locale-util", - "src/thrift/com/twitter/search/common:constants-java", - ], -) diff --git a/src/java/com/twitter/search/common/util/lang/ThriftLanguageUtil.java b/src/java/com/twitter/search/common/util/lang/ThriftLanguageUtil.java deleted file mode 100644 index 2ede4c2f0..000000000 --- a/src/java/com/twitter/search/common/util/lang/ThriftLanguageUtil.java +++ /dev/null @@ -1,141 +0,0 @@ -package com.twitter.search.common.util.lang; - -import java.lang.reflect.Field; -import java.util.Locale; -import java.util.Map; - -import javax.annotation.Nullable; - -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.Maps; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.text.language.LocaleUtil; -import com.twitter.search.common.constants.thriftjava.ThriftLanguage; - -/** - * This class can be used to convert ThriftLanguage to Locale object and vise versa. - */ -public final class ThriftLanguageUtil { - private static final Logger LOG = LoggerFactory.getLogger(ThriftLanguageUtil.class.getName()); - - // stores ThriftLanguage.id -> Locale mapping - private static final Locale[] LOCALES; - - // stores Locale -> ThriftLanguage mapping - private static final Map THRIFT_LANGUAGES; - - static { - LOCALES = new Locale[ThriftLanguage.values().length]; - Map thriftLanguageMap = Maps.newHashMap(); - - // get all languages defined in ThriftLanguage - Field[] fields = ThriftLanguage.class.getDeclaredFields(); - for (Field field : fields) { - if (!field.isEnumConstant()) { - continue; - } - - try { - ThriftLanguage thriftLang = (ThriftLanguage) field.get(null); - String thriftLanguageName = field.getName(); - - // get corresponding Locale declared in LocaleUtil - try { - Field localeUtilField = LocaleUtil.class.getDeclaredField(thriftLanguageName); - Locale localeLang = (Locale) localeUtilField.get(null); - - LOCALES[thriftLang.getValue()] = localeLang; - thriftLanguageMap.put(localeLang, thriftLang); - } catch (NoSuchFieldException e) { - LOG.warn("{} is defined in ThriftLanguage, but not in LocaleUtil.", thriftLanguageName); - } - } catch (IllegalAccessException e) { - // shouldn't happen. - LOG.warn("Could not get a declared field.", e); - } - } - - // Let's make sure that all Locales defined in LocaleUtil are also defined in ThriftLanguage - for (Locale lang : LocaleUtil.getDefinedLanguages()) { - if (!thriftLanguageMap.containsKey(lang)) { - LOG.warn("{} is defined in LocaleUtil but not in ThriftLanguage.", lang.getLanguage()); - } - } - - THRIFT_LANGUAGES = ImmutableMap.copyOf(thriftLanguageMap); - } - - private ThriftLanguageUtil() { - } - - /** - * Returns a Locale object which corresponds to a given ThriftLanguage object. - * @param language ThriftLanguage object - * @return a corresponding Locale object - */ - public static Locale getLocaleOf(ThriftLanguage language) { - // Note that ThriftLanguage.findByValue() can return null (thrift generated code). - // So ThriftLanguageUtil.getLocaleOf needs to handle null correctly. - if (language == null) { - return LocaleUtil.UNKNOWN; - } - - Preconditions.checkArgument(language.getValue() < LOCALES.length); - return LOCALES[language.getValue()]; - } - - /** - * Returns a ThriftLanguage object which corresponds to a given Locale object. - * - * @param language Locale object - * @return a corresponding ThriftLanguage object, or UNKNOWN if there's no corresponding one. - */ - public static ThriftLanguage getThriftLanguageOf(Locale language) { - Preconditions.checkNotNull(language); - ThriftLanguage thriftLang = THRIFT_LANGUAGES.get(language); - return thriftLang == null ? ThriftLanguage.UNKNOWN : thriftLang; - } - - /** - * Returns a ThriftLanguage object which corresponds to a given language code. - * - * @param languageCode BCP-47 language code - * @return a corresponding ThriftLanguage object, or UNKNOWN if there's no corresponding one. - */ - public static ThriftLanguage getThriftLanguageOf(String languageCode) { - Preconditions.checkNotNull(languageCode); - ThriftLanguage thriftLang = THRIFT_LANGUAGES.get(LocaleUtil.getLocaleOf(languageCode)); - return thriftLang == null ? ThriftLanguage.UNKNOWN : thriftLang; - } - - /** - * Returns a ThriftLanguage object which corresponds to a given int value. - * If value is not valid, returns ThriftLanguage.UNKNOWN - * @param value value of language - * @return a corresponding ThriftLanguage object - */ - public static ThriftLanguage safeFindByValue(int value) { - ThriftLanguage thriftLang = ThriftLanguage.findByValue(value); - return thriftLang == null ? ThriftLanguage.UNKNOWN : thriftLang; - } - - /** - * Returns the language code which corresponds to a given ThriftLanguage. - * - * Note that multiple ThriftLanguage entries can return the same language code. - * - * @param thriftLang ThriftLanguage object - * @return Corresponding language or null if thriftLang is null. - */ - @Nullable - public static String getLanguageCodeOf(@Nullable ThriftLanguage thriftLang) { - if (thriftLang == null) { - return null; - } - return ThriftLanguageUtil.getLocaleOf(thriftLang).getLanguage(); - } -} diff --git a/src/java/com/twitter/search/common/util/ml/BUILD b/src/java/com/twitter/search/common/util/ml/BUILD deleted file mode 100644 index b6c67753a..000000000 --- a/src/java/com/twitter/search/common/util/ml/BUILD +++ /dev/null @@ -1,16 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/it/unimi/dsi:fastutil", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/thrift:libthrift", - "3rdparty/jvm/org/slf4j:slf4j-api", - "src/java/com/twitter/common/base", - "src/java/com/twitter/search/common/file", - "src/java/com/twitter/search/common/util/io", - ], -) diff --git a/src/java/com/twitter/search/common/util/ml/EnumBasedLinearModel.java b/src/java/com/twitter/search/common/util/ml/EnumBasedLinearModel.java deleted file mode 100644 index 50b2fc46a..000000000 --- a/src/java/com/twitter/search/common/util/ml/EnumBasedLinearModel.java +++ /dev/null @@ -1,141 +0,0 @@ -package com.twitter.search.common.util.ml; - -import java.io.IOException; -import java.util.EnumMap; -import java.util.EnumSet; -import java.util.Map; -import java.util.Set; - -import com.google.common.base.Preconditions; -import com.google.common.base.Predicates; -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.Maps; - -import com.twitter.search.common.file.AbstractFile; -import com.twitter.search.common.util.io.TextFileLoadingUtils; - -/** - * Represents a linear model for scoring and classification. - * - * The list of features is defined by an Enum class. The model weights and instances are - * represented as maps that must contain an entry for all the values of the enum. - * - */ -public class EnumBasedLinearModel> implements MapBasedLinearModel { - - private final EnumSet features; - private final EnumMap weights; - - /** - * Creates a model from a map of weights. - * - * @param enumType Enum used for the keys - * @param weights Feature weights. - */ - public EnumBasedLinearModel(Class enumType, Map weights) { - features = EnumSet.allOf(enumType); - EnumMap enumWeights = - new EnumMap<>(Maps.filterValues(weights, Predicates.notNull())); - Preconditions.checkArgument(features.equals(enumWeights.keySet()), - "The model does not include weights for all the available features"); - - this.weights = enumWeights; - } - - public ImmutableMap getWeights() { - return Maps.immutableEnumMap(weights); - } - - @Override - public float score(Map instance) { - float total = 0; - for (Map.Entry weightEntry : weights.entrySet()) { - Float feature = instance.get(weightEntry.getKey()); - if (feature != null) { - total += weightEntry.getValue() * feature; - } - } - return total; - } - - /** - * Determines whether an instance is positive. - */ - @Override - public boolean classify(float threshold, Map instance) { - return score(instance) > threshold; - } - - @Override - public boolean classify(Map instance) { - return classify(0, instance); - } - - @Override - public String toString() { - return String.format("EnumBasedLinearModel[%s]", weights); - } - - /** - * Creates a model where all the features have the same weight. - * This method is useful for generating the feature vectors for training a new model. - */ - public static > EnumBasedLinearModel createWithEqualWeight(Class enumType, - Float weight) { - EnumSet features = EnumSet.allOf(enumType); - EnumMap weights = Maps.newEnumMap(enumType); - for (T feature : features) { - weights.put(feature, weight); - } - return new EnumBasedLinearModel<>(enumType, weights); - } - - /** - * Loads the model from a TSV file with the following format: - * - * feature_name \t weight - */ - public static > EnumBasedLinearModel createFromFile( - Class enumType, AbstractFile path) throws IOException { - return new EnumBasedLinearModel<>(enumType, loadWeights(enumType, path, true)); - } - - /** - * Loads the model from a TSV file, using a default weight of 0 for missing features. - * - * File format: - * - * feature_name \t weight - */ - public static > EnumBasedLinearModel createFromFileSafe( - Class enumType, AbstractFile path) throws IOException { - return new EnumBasedLinearModel<>(enumType, loadWeights(enumType, path, false)); - } - - /** - * Creates a map of (feature_name, weight) from a TSV file. - * - * If strictMode is true, it will throw an exception if the file doesn't contain all the - * features declared in the enum. Otherwise, it will use zero as default value. - * - */ - private static > EnumMap loadWeights( - Class enumType, AbstractFile fileHandle, boolean strictMode) throws IOException { - Map weightsFromFile = - TextFileLoadingUtils.loadMapFromFile(fileHandle, input -> Float.parseFloat(input)); - EnumMap weights = Maps.newEnumMap(enumType); - Set expectedFeatures = EnumSet.allOf(enumType); - if (!strictMode) { - for (T feature : expectedFeatures) { - weights.put(feature, 0f); - } - } - for (String featureName : weightsFromFile.keySet()) { - Float weight = weightsFromFile.get(featureName); - weights.put(Enum.valueOf(enumType, featureName.toUpperCase()), weight); - } - Preconditions.checkArgument(expectedFeatures.equals(weights.keySet()), - "Model does not contain weights for all the features"); - return weights; - } -} diff --git a/src/java/com/twitter/search/common/util/ml/FeatureUtils.java b/src/java/com/twitter/search/common/util/ml/FeatureUtils.java deleted file mode 100644 index fef79620d..000000000 --- a/src/java/com/twitter/search/common/util/ml/FeatureUtils.java +++ /dev/null @@ -1,120 +0,0 @@ -package com.twitter.search.common.util.ml; - -import java.util.List; -import java.util.Map; -import java.util.Optional; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Sets; - -/** - * Utilities for feature transformation and extraction. - */ -public final class FeatureUtils { - - private FeatureUtils() { - } - - /** - * Computes the difference between 2 values and returns the ratio of the difference over the - * minimum of both, according to these cases: - * - * 1. if (a > b) return a / b - * 2. if (a < b) return - b / a - * 3. if (a == b == 0) return 0 - * - * The upper/lower limit is (-) maxRatio. For cases 1 and 2, if the denominator is 0, - * it returns maxRatio. - * - * This method is used to define a feature that tells how much larger or smaller is the - * first value with respect to the second one.. - */ - public static float diffRatio(float a, float b, float maxRatio) { - float diff = a - b; - if (diff == 0) { - return 0; - } - float denominator = Math.min(a, b); - float ratio = denominator != 0 ? Math.abs(diff / denominator) : maxRatio; - return Math.copySign(Math.min(ratio, maxRatio), diff); - } - - /** - * Computes the cosine similarity between two maps that represent sparse vectors. - */ - public static double cosineSimilarity( - Map vector1, Map vector2) { - if (vector1 == null || vector1.isEmpty() || vector2 == null || vector2.isEmpty()) { - return 0; - } - double squaredSum1 = 0; - double squaredSum2 = 0; - double squaredCrossSum = 0; - - for (K key : Sets.union(vector1.keySet(), vector2.keySet())) { - double value1 = 0; - double value2 = 0; - - V optValue1 = vector1.get(key); - if (optValue1 != null) { - value1 = optValue1.doubleValue(); - } - V optValue2 = vector2.get(key); - if (optValue2 != null) { - value2 = optValue2.doubleValue(); - } - - squaredSum1 += value1 * value1; - squaredSum2 += value2 * value2; - squaredCrossSum += value1 * value2; - } - - if (squaredSum1 == 0 || squaredSum2 == 0) { - return 0; - } else { - return squaredCrossSum / Math.sqrt(squaredSum1 * squaredSum2); - } - } - - /** - * Computes the cosine similarity between two (dense) vectors. - */ - public static double cosineSimilarity( - List vector1, List vector2) { - if (vector1 == null || vector1.isEmpty() || vector2 == null || vector2.isEmpty()) { - return 0; - } - - Preconditions.checkArgument(vector1.size() == vector2.size()); - double squaredSum1 = 0; - double squaredSum2 = 0; - double squaredCrossSum = 0; - for (int i = 0; i < vector1.size(); i++) { - double value1 = vector1.get(i).doubleValue(); - double value2 = vector2.get(i).doubleValue(); - squaredSum1 += value1 * value1; - squaredSum2 += value2 * value2; - squaredCrossSum += value1 * value2; - } - - if (squaredSum1 == 0 || squaredSum2 == 0) { - return 0; - } else { - return squaredCrossSum / Math.sqrt(squaredSum1 * squaredSum2); - } - } - - /** - * Finds the key of the map with the highest value (compared in natural order) - */ - @SuppressWarnings("unchecked") - public static Optional findMaxKey(Map map) { - if (map == null || map.isEmpty()) { - return Optional.empty(); - } - - Optional> maxEntry = map.entrySet().stream().max(Map.Entry.comparingByValue()); - return maxEntry.map(Map.Entry::getKey); - } - -} diff --git a/src/java/com/twitter/search/common/util/ml/MapBasedLinearModel.java b/src/java/com/twitter/search/common/util/ml/MapBasedLinearModel.java deleted file mode 100644 index 0f8899271..000000000 --- a/src/java/com/twitter/search/common/util/ml/MapBasedLinearModel.java +++ /dev/null @@ -1,32 +0,0 @@ -package com.twitter.search.common.util.ml; - -import java.util.Map; - -/** - * An interface for linear models that are backed by some sort of map - */ -public interface MapBasedLinearModel { - /** - * Evaluate using this model given a feature vector. - * @param instance The feature vector in format of a hashmap. - * @return - */ - boolean classify(Map instance); - - /** - * Evaluate using this model given a classification threshold and a feature vector. - * @param threshold Score threshold used for classification. - * @param instance The feature vector in format of a hashmap. - * @return - */ - boolean classify(float threshold, Map instance); - - /** - * Computes the score of an instance as a linear combination of the features and the model - * weights. 0 is used as default value for features or weights that are not present. - * - * @param instance The feature vector in format of a hashmap. - * @return The instance score according to the model. - */ - float score(Map instance); -} diff --git a/src/java/com/twitter/search/common/util/ml/StringMapBasedLinearModel.java b/src/java/com/twitter/search/common/util/ml/StringMapBasedLinearModel.java deleted file mode 100644 index cc0686ef4..000000000 --- a/src/java/com/twitter/search/common/util/ml/StringMapBasedLinearModel.java +++ /dev/null @@ -1,125 +0,0 @@ -package com.twitter.search.common.util.ml; - -import java.util.Map; - -import com.google.common.annotations.VisibleForTesting; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.base.Function; -import com.twitter.search.common.file.AbstractFile; -import com.twitter.search.common.util.io.TextFileLoadingUtils; - -import it.unimi.dsi.fastutil.objects.Object2FloatMap; -import it.unimi.dsi.fastutil.objects.Object2FloatOpenHashMap; - -/** - * Represents a linear model for scoring and classification. - * - * Features are represented as arbitrary strings, making this a fairly flexible implementation - * (at the cost of some performance, since all operations require hash lookups). Instances - * and weights are both encoded sparsely (as maps) so this implementation is well suited to - * models with large feature sets where most features are inactive at a given time. Weights - * for unknown features are assumed to be 0. - * - */ -public class StringMapBasedLinearModel implements MapBasedLinearModel { - private static final Logger LOG = LoggerFactory.getLogger(StringMapBasedLinearModel.class); - - protected final Object2FloatMap model = new Object2FloatOpenHashMap<>(); - - /** - * Creates a model from a map of weights. - * - * @param weights Feature weights. - */ - public StringMapBasedLinearModel(Map weights) { - model.putAll(weights); - model.defaultReturnValue(0.0f); - } - - /** - * Get the weight of a feature - * @param featureName - * @return - */ - public float getWeight(String featureName) { - return model.getFloat(featureName); - } - - /** - * Get the full weight map - */ - @VisibleForTesting - protected Map getWeights() { - return model; - } - - /** - * Evaluate using this model given a feature vector. - * @param values The feature vector in format of a hashmap. - * @return - */ - @Override - public float score(Map values) { - float score = 0.0f; - for (Map.Entry value : values.entrySet()) { - String featureName = value.getKey(); - float weight = getWeight(featureName); - if (weight != 0.0f) { - score += weight * value.getValue(); - if (LOG.isDebugEnabled()) { - LOG.debug(String.format("%s = %.3f * %.3f = %.3f, ", - featureName, weight, value.getValue(), - weight * value.getValue())); - } - } - } - if (LOG.isDebugEnabled()) { - LOG.debug(String.format("Score = %.3f", score)); - } - return score; - } - - /** - * Determines whether an instance is positive. - */ - @Override - public boolean classify(Map values) { - return classify(0.0f, values); - } - - @Override - public boolean classify(float threshold, Map values) { - return score(values) > threshold; - } - - public int size() { - return model.size(); - } - - @Override - public String toString() { - StringBuilder sb = new StringBuilder(); - sb.append("StringMapBasedLinearModel["); - for (Map.Entry entry : model.entrySet()) { - sb.append(String.format("(%s = %.3f), ", entry.getKey(), entry.getValue())); - } - sb.append("]"); - return sb.toString(); - } - - /** - * Loads the model from a TSV file with the following format: - * - * feature_name \t weight - */ - public static StringMapBasedLinearModel loadFromFile(AbstractFile fileHandle) { - Map weights = - TextFileLoadingUtils.loadMapFromFile( - fileHandle, - (Function) item -> Float.parseFloat(item)); - return new StringMapBasedLinearModel(weights); - } -} diff --git a/src/java/com/twitter/search/common/util/ml/models_manager/BUILD b/src/java/com/twitter/search/common/util/ml/models_manager/BUILD deleted file mode 100644 index ba62e194c..000000000 --- a/src/java/com/twitter/search/common/util/ml/models_manager/BUILD +++ /dev/null @@ -1,14 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - strict_deps = True, - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/slf4j:slf4j-api", - "3rdparty/jvm/org/yaml:snakeyaml", - "src/java/com/twitter/search/common/file", - "src/java/com/twitter/search/common/metrics", - ], -) diff --git a/src/java/com/twitter/search/common/util/ml/models_manager/BaseModelsManager.java b/src/java/com/twitter/search/common/util/ml/models_manager/BaseModelsManager.java deleted file mode 100644 index 5c94c7988..000000000 --- a/src/java/com/twitter/search/common/util/ml/models_manager/BaseModelsManager.java +++ /dev/null @@ -1,293 +0,0 @@ -package com.twitter.search.common.util.ml.models_manager; - -import java.io.BufferedReader; -import java.io.IOException; -import java.io.UncheckedIOException; -import java.util.Collections; -import java.util.Date; -import java.util.HashMap; -import java.util.List; -import java.util.Map; -import java.util.Optional; -import java.util.Set; -import java.util.concurrent.ConcurrentHashMap; -import java.util.concurrent.Executors; -import java.util.concurrent.TimeUnit; -import java.util.function.Function; -import java.util.function.Supplier; -import java.util.stream.Collectors; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.base.Strings; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.Sets; -import com.google.common.util.concurrent.ThreadFactoryBuilder; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; -import org.yaml.snakeyaml.Yaml; - -import com.twitter.search.common.file.AbstractFile; -import com.twitter.search.common.file.FileUtils; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchLongGauge; - -/** - * Loads models from HDFS and provides an interface for reloading them periodically. - * - * There are 2 possible ways of detecting the active models: - * - * - DirectorySupplier: Uses all the subdirectories of a base path - * - ConfigSupplier: Gets the list from from a configuration file - * - * Models can be updated or added. Depending on the selected method, existing models can be removed - * if they are no longer active. - */ -public abstract class BaseModelsManager implements Runnable { - private static final Logger LOG = LoggerFactory.getLogger(BaseModelsManager.class); - - protected final Map lastModifiedMsByModel = new ConcurrentHashMap<>(); - protected final Map loadedModels = new ConcurrentHashMap<>(); - protected final Supplier> activeModelsSupplier; - - protected Map prevLoadedModels = new ConcurrentHashMap<>(); - - // This flag determines whether models are unloaded immediately when they're removed from - // activeModelsSupplier. If false, old models stay in memory until the process is restarted. - // This may be useful to safely change model configuration without restarting. - protected final boolean shouldUnloadInactiveModels; - - protected final SearchLongGauge numModels; - protected final SearchCounter numErrors; - protected final SearchLongGauge lastLoadedMs; - - protected Supplier shouldServeModels; - protected Supplier shouldLoadModels; - - public BaseModelsManager( - Supplier> activeModelsSupplier, - boolean shouldUnloadInactiveModels, - String statsPrefix - ) { - this( - activeModelsSupplier, - shouldUnloadInactiveModels, - statsPrefix, - () -> true, - () -> true - ); - } - - public BaseModelsManager( - Supplier> activeModelsSupplier, - boolean shouldUnloadInactiveModels, - String statsPrefix, - Supplier shouldServeModels, - Supplier shouldLoadModels - ) { - this.activeModelsSupplier = activeModelsSupplier; - this.shouldUnloadInactiveModels = shouldUnloadInactiveModels; - - this.shouldServeModels = shouldServeModels; - this.shouldLoadModels = shouldLoadModels; - - numModels = SearchLongGauge.export( - String.format("model_loader_%s_num_models", statsPrefix)); - numErrors = SearchCounter.export( - String.format("model_loader_%s_num_errors", statsPrefix)); - lastLoadedMs = SearchLongGauge.export( - String.format("model_loader_%s_last_loaded_timestamp_ms", statsPrefix)); - } - - /** - * Retrieves a particular model. - */ - public Optional getModel(String name) { - if (shouldServeModels.get()) { - return Optional.ofNullable(loadedModels.get(name)); - } else { - return Optional.empty(); - } - } - - /** - * Reads a model instance from the directory file instance. - * - * @param modelBaseDir AbstractFile instance representing the directory. - * @return Model instance parsed from the directory. - */ - public abstract T readModelFromDirectory(AbstractFile modelBaseDir) throws Exception; - - /** - * Cleans up any resources used by the model instance. - * This method is called after removing the model from the in-memory map. - * Sub-classes can provide custom overridden implementation as required. - * - * @param unloadedModel Model instance that would be unloaded from the manager. - */ - protected void cleanUpUnloadedModel(T unloadedModel) { } - - @Override - public void run() { - // Get available models, either from the config file or by listing the base directory - final Map modelPathsFromConfig; - if (!shouldLoadModels.get()) { - LOG.info("Loading models is currently disabled."); - return; - } - - modelPathsFromConfig = activeModelsSupplier.get(); - for (Map.Entry nameAndPath : modelPathsFromConfig.entrySet()) { - String modelName = nameAndPath.getKey(); - try { - AbstractFile modelDirectory = nameAndPath.getValue(); - if (!modelDirectory.exists() && loadedModels.containsKey(modelName)) { - LOG.warn("Loaded model '{}' no longer exists at HDFS path {}, keeping loaded version; " - + "replace directory in HDFS to update model.", modelName, modelDirectory); - continue; - } - - long previousModifiedTimestamp = lastModifiedMsByModel.getOrDefault(modelName, 0L); - long lastModifiedMs = modelDirectory.getLastModified(); - if (previousModifiedTimestamp == lastModifiedMs) { - continue; - } - - LOG.info("Starting to load model. name={} path={}", modelName, modelDirectory.getPath()); - T model = Preconditions.checkNotNull(readModelFromDirectory(modelDirectory)); - LOG.info("Model initialized: {}. Last modified: {} ({})", - modelName, lastModifiedMs, new Date(lastModifiedMs)); - T previousModel = loadedModels.put(modelName, model); - lastModifiedMsByModel.put(modelName, lastModifiedMs); - - if (previousModel != null) { - cleanUpUnloadedModel(previousModel); - } - } catch (Exception e) { - numErrors.increment(); - LOG.error("Error initializing model: {}", modelName, e); - } - } - - // Remove any currently loaded models not present in the latest list - if (shouldUnloadInactiveModels) { - Set inactiveModels = - Sets.difference(loadedModels.keySet(), modelPathsFromConfig.keySet()).immutableCopy(); - - for (String modelName : inactiveModels) { - T modelToUnload = loadedModels.get(modelName); - loadedModels.remove(modelName); - - if (modelToUnload != null) { - // We could have an inactive model key without a model (value) if the - // initial readModelFromDirectory failed for the model entry. - // Checking for null to avoid exception. - cleanUpUnloadedModel(modelToUnload); - } - LOG.info("Unloaded model that is no longer active: {}", modelName); - } - } - - if (!prevLoadedModels.keySet().equals(loadedModels.keySet())) { - LOG.info("Finished loading models: {}", loadedModels.keySet()); - } - prevLoadedModels = loadedModels; - numModels.set(loadedModels.size()); - lastLoadedMs.set(System.currentTimeMillis()); - } - - /** - * Schedules the loader to run periodically. - * @param period Period between executions - * @param timeUnit The time unit the period parameter. - */ - public final void scheduleAtFixedRate( - long period, TimeUnit timeUnit, String builderThreadName) { - Executors.newSingleThreadScheduledExecutor( - new ThreadFactoryBuilder() - .setDaemon(true) - .setNameFormat(builderThreadName) - .build()) - .scheduleAtFixedRate(this, 0, period, timeUnit); - } - - /** - * Gets the active list of models from the subdirectories in a base directory. - * - * Each model is identified by the name of the subdirectory. - */ - @VisibleForTesting - public static class DirectorySupplier implements Supplier> { - private static final Logger LOG = LoggerFactory.getLogger(DirectorySupplier.class); - private final AbstractFile baseDir; - - public DirectorySupplier(AbstractFile baseDir) { - this.baseDir = baseDir; - } - - @Override - public Map get() { - try { - LOG.info("Loading models from the directories in: {}", baseDir.getPath()); - List modelDirs = - ImmutableList.copyOf(baseDir.listFiles(AbstractFile.IS_DIRECTORY)); - LOG.info("Found {} model directories: {}", modelDirs.size(), modelDirs); - return modelDirs.stream() - .collect(Collectors.toMap( - AbstractFile::getName, - Function.identity() - )); - } catch (IOException e) { - throw new UncheckedIOException(e); - } - } - } - - /** - * Gets the active list of models by reading a YAML config file. - * - * The keys are the model names, the values are dictionaries with a single entry for the path - * of the model in HDFS (without the HDFS name node prefix). For example: - * - * model_a: - * path: /path/to/model_a - * model_b: - * path: /path/to/model_b - * - */ - @VisibleForTesting - public static class ConfigSupplier implements Supplier> { - - private final AbstractFile configFile; - - public ConfigSupplier(AbstractFile configFile) { - this.configFile = configFile; - } - - @SuppressWarnings("unchecked") - @Override - public Map get() { - try (BufferedReader configReader = configFile.getCharSource().openBufferedStream()) { - Yaml yamlParser = new Yaml(); - //noinspection unchecked - Map> config = - (Map>) yamlParser.load(configReader); - - if (config == null || config.isEmpty()) { - return Collections.emptyMap(); - } - - Map modelPaths = new HashMap<>(); - for (Map.Entry> nameAndConfig : config.entrySet()) { - String path = Strings.emptyToNull(nameAndConfig.getValue().get("path")); - Preconditions.checkNotNull(path, "Missing path for model: %s", nameAndConfig.getKey()); - modelPaths.put(nameAndConfig.getKey(), FileUtils.getHdfsFileHandle(path)); - } - return modelPaths; - } catch (IOException e) { - throw new UncheckedIOException(e); - } - } - } -} diff --git a/src/java/com/twitter/search/common/util/ml/prediction_engine/BUILD b/src/java/com/twitter/search/common/util/ml/prediction_engine/BUILD deleted file mode 100644 index 45513f511..000000000 --- a/src/java/com/twitter/search/common/util/ml/prediction_engine/BUILD +++ /dev/null @@ -1,68 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/thrift:libthrift", - "3rdparty/jvm/org/slf4j:slf4j-api", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common_internal/hadoop", - "src/java/com/twitter/ml/api:api-base", - "src/java/com/twitter/ml/api/transform", - "src/java/com/twitter/ml/common/base", - "src/java/com/twitter/ml/prediction/core", - "src/java/com/twitter/ml/tool/prediction:ModelInterpreter", - "src/java/com/twitter/ml/vw/constant", - "src/java/com/twitter/mlv2/trees/predictor", - "src/java/com/twitter/mlv2/trees/scorer", - "src/java/com/twitter/search/common/features", - "src/java/com/twitter/search/common/file", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/util/ml/models_manager", - "src/java/com/twitter/search/modeling/common", - "src/thrift/com/twitter/ml/api:data-java", - "src/thrift/com/twitter/search/common:features-java", - ], -) - -java_library( - name = "for-timelines", - sources = [ - "BaseLegacyScoreAccumulator.java", - "BaseModelBuilder.java", - "BaseScoreAccumulator.java", - "CompositeFeatureContext.java", - "DiscretizedFeature.java", - "DiscretizedFeatureRange.java", - "LegacyModelBuilder.java", - "LightweightLinearModel.java", - "ModelBuilder.java", - "SchemaBasedModelBuilder.java", - ], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/thrift:libthrift", - "3rdparty/jvm/org/slf4j:slf4j-api", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common_internal/hadoop", - "src/java/com/twitter/ml/api:api-base", - "src/java/com/twitter/ml/api/transform:DiscretizerTransform", - "src/java/com/twitter/ml/common/base", - "src/java/com/twitter/ml/tool/prediction:ModelInterpreter", - "src/java/com/twitter/ml/vw/constant", - "src/java/com/twitter/search/common/features", - "src/java/com/twitter/search/common/file", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/util/ml/models_manager", - "src/java/com/twitter/search/modeling/common", - "src/thrift/com/twitter/ml/api:data-java", - "src/thrift/com/twitter/search/common:features-java", - ], -) diff --git a/src/java/com/twitter/search/common/util/ml/prediction_engine/BaseLegacyScoreAccumulator.java b/src/java/com/twitter/search/common/util/ml/prediction_engine/BaseLegacyScoreAccumulator.java deleted file mode 100644 index 02c92b0d6..000000000 --- a/src/java/com/twitter/search/common/util/ml/prediction_engine/BaseLegacyScoreAccumulator.java +++ /dev/null @@ -1,64 +0,0 @@ -package com.twitter.search.common.util.ml.prediction_engine; - -import com.google.common.base.Preconditions; - -import com.twitter.ml.api.Feature; - -/** - * Score accumulator for legacy (non-schema-based) features. It provides methods to add features - * using Feature objects. - * - * @deprecated This class is retired and we suggest to switch to schema-based features. - */ -@Deprecated -public abstract class BaseLegacyScoreAccumulator extends BaseScoreAccumulator { - - public BaseLegacyScoreAccumulator(LightweightLinearModel model) { - super(model); - Preconditions.checkState(!model.isSchemaBased(), - "Cannot create LegacyScoreAccumulator with a schema-based model: %s", model.getName()); - } - - /** - * Add to the score the weight of a binary feature (if it's present). - * - * @deprecated This function is retired and we suggest to switch to addSchemaBooleanFeatures in - * SchemaBasedScoreAccumulator. - */ - @Deprecated - protected BaseLegacyScoreAccumulator addBinaryFeature(Feature feature, - boolean value) { - if (value) { - Double weight = model.binaryFeatures.get(feature); - if (weight != null) { - score += weight; - } - } - return this; - } - - /** - * Add to the score the weight of a continuous feature. - *

- * If the model uses real valued features, it multiplies its weight by the provided value. - * Otherwise, it tries to find the discretized feature and adds its weight to the score. - * - * @deprecated This function is retired and we suggest to switch to addSchemaContinuousFeatures in - * SchemaBasedScoreAccumulator. - */ - @Deprecated - protected BaseLegacyScoreAccumulator addContinuousFeature(Feature feature, - double value) { - Double weightFromContinuous = model.continuousFeatures.get(feature); - if (weightFromContinuous != null) { - score += weightFromContinuous * value; - } else { - DiscretizedFeature discretizedFeature = model.discretizedFeatures.get(feature); - if (discretizedFeature != null) { - // Use only the weight of the discretized feature (there's no need to multiply it) - score += discretizedFeature.getWeight(value); - } - } - return this; - } -} diff --git a/src/java/com/twitter/search/common/util/ml/prediction_engine/BaseModelBuilder.java b/src/java/com/twitter/search/common/util/ml/prediction_engine/BaseModelBuilder.java deleted file mode 100644 index 2d4d539ee..000000000 --- a/src/java/com/twitter/search/common/util/ml/prediction_engine/BaseModelBuilder.java +++ /dev/null @@ -1,111 +0,0 @@ -package com.twitter.search.common.util.ml.prediction_engine; - -import java.util.Collection; -import java.util.Comparator; -import java.util.List; - -import com.google.common.collect.Lists; - -import com.twitter.ml.api.FeatureParser; -import com.twitter.ml.api.transform.DiscretizerTransform; -import com.twitter.ml.tool.prediction.ModelInterpreter; - -/** - * The base model builder for LightweightLinearModels. - */ -public abstract class BaseModelBuilder implements ModelBuilder { - // Ignore features that have an absolute weight lower than this value - protected static final double MIN_WEIGHT = 1e-9; - private static final String BIAS_FIELD_NAME = ModelInterpreter.BIAS_FIELD_NAME; - static final String DISCRETIZER_NAME_SUFFIX = - "." + DiscretizerTransform.DEFAULT_FEATURE_NAME_SUFFIX; - - protected final String modelName; - protected double bias; - - public BaseModelBuilder(String modelName) { - this.modelName = modelName; - this.bias = 0.0; - } - - /** - * Collects all the ranges of a discretized feature and sorts them. - */ - static DiscretizedFeature buildFeature(Collection ranges) { - List sortedRanges = Lists.newArrayList(ranges); - sortedRanges.sort(Comparator.comparingDouble(a -> a.minValue)); - - double[] splits = new double[ranges.size()]; - double[] weights = new double[ranges.size()]; - - for (int i = 0; i < sortedRanges.size(); i++) { - splits[i] = sortedRanges.get(i).minValue; - weights[i] = sortedRanges.get(i).weight; - } - return new DiscretizedFeature(splits, weights); - } - - /** - * Parses a line from the interpreted model text file. See the javadoc of the constructor for - * more details about how to create the text file. - *

- * The file uses TSV format with 3 columns: - *

- * Model name (Generated by ML API, but ignored by this class) - * Feature definition: - * Name of the feature or definition from the MDL discretizer. - * Weight: - * Weight of the feature using LOGIT scale. - *

- * When it parses each line, it stores the weights for all the features defined in the context, - * as well as the bias, but it ignores any other feature (e.g. label, prediction or - * meta.record_weight) and features with a small absolute weight (see MIN_WEIGHT). - *

- * Example lines: - *

- * model_name bias 0.019735312089324074 - * model_name demo.binary_feature 0.06524706073105327 - * model_name demo.continuous_feature 0.0 - * model_name demo.continuous_feature.dz/dz_model=mdl/dz_range=-inf_3.58e-01 0.07155931927263737 - * model_name demo.continuous_feature.dz/dz_model=mdl/dz_range=3.58e-01_inf -0.08979256264865387 - * - * @see ModelInterpreter - * @see DiscretizerTransform - */ - @Override - public ModelBuilder parseLine(String line) { - String[] columns = line.split("\t"); - if (columns.length != 3) { - return this; - } - - // columns[0] has the model name, which we don't need - String featureName = columns[1]; - double weight = Double.parseDouble(columns[2]); - - if (BIAS_FIELD_NAME.equals(featureName)) { - bias = weight; - return this; - } - - FeatureParser parser = FeatureParser.parse(featureName); - String baseName = parser.getBaseName(); - - if (Math.abs(weight) < MIN_WEIGHT && !baseName.endsWith(DISCRETIZER_NAME_SUFFIX)) { - // skip, unless it represents a range of a discretized feature. - // discretized features with all zeros should also be removed, but will handle that later - return this; - } - - addFeature(baseName, weight, parser); - return this; - } - - /** - * Adds feature to the model - */ - protected abstract void addFeature(String baseName, double weight, FeatureParser parser); - - @Override - public abstract LightweightLinearModel build(); -} diff --git a/src/java/com/twitter/search/common/util/ml/prediction_engine/BaseScoreAccumulator.java b/src/java/com/twitter/search/common/util/ml/prediction_engine/BaseScoreAccumulator.java deleted file mode 100644 index 1be1c4872..000000000 --- a/src/java/com/twitter/search/common/util/ml/prediction_engine/BaseScoreAccumulator.java +++ /dev/null @@ -1,48 +0,0 @@ -package com.twitter.search.common.util.ml.prediction_engine; - -/** - * The base class for a lightweight scorer based on a model and some feature data. - * - * @param The type of feature data to be scored with - */ -public abstract class BaseScoreAccumulator { - protected final LightweightLinearModel model; - protected double score; - - public BaseScoreAccumulator(LightweightLinearModel model) { - this.model = model; - this.score = model.bias; - } - - /** - * Compute score with a model and feature data - */ - public final double scoreWith(D featureData, boolean useLogitScore) { - updateScoreWithFeatures(featureData); - return useLogitScore ? getLogitScore() : getSigmoidScore(); - } - - public final void reset() { - this.score = model.bias; - } - - /** - * Update the accumulator score with features, after this function the score should already - * be computed. - */ - protected abstract void updateScoreWithFeatures(D data); - - /** - * Get the already accumulated score - */ - protected final double getLogitScore() { - return score; - } - - /** - * Returns the score as a value mapped between 0 and 1. - */ - protected final double getSigmoidScore() { - return 1 / (1 + Math.exp(-score)); - } -} diff --git a/src/java/com/twitter/search/common/util/ml/prediction_engine/CompositeFeatureContext.java b/src/java/com/twitter/search/common/util/ml/prediction_engine/CompositeFeatureContext.java deleted file mode 100644 index 5da921b13..000000000 --- a/src/java/com/twitter/search/common/util/ml/prediction_engine/CompositeFeatureContext.java +++ /dev/null @@ -1,35 +0,0 @@ -package com.twitter.search.common.util.ml.prediction_engine; - -import java.util.function.Supplier; -import javax.annotation.Nullable; - -import com.twitter.ml.api.FeatureContext; -import com.twitter.search.common.features.thrift.ThriftSearchFeatureSchema; - -/** - * An object to store feature context information to build models with. - */ -public class CompositeFeatureContext { - // legacy static feature context - private final FeatureContext legacyContext; - // a supplier for the context (well the schema itself) of the schema-based features - private final Supplier schemaSupplier; - - public CompositeFeatureContext( - FeatureContext legacyContext, - @Nullable Supplier schemaSupplier) { - this.legacyContext = legacyContext; - this.schemaSupplier = schemaSupplier; - } - - FeatureContext getLegacyContext() { - return legacyContext; - } - - ThriftSearchFeatureSchema getFeatureSchema() { - if (schemaSupplier == null) { - throw new UnsupportedOperationException("Feature schema was not initialized"); - } - return schemaSupplier.get(); - } -} diff --git a/src/java/com/twitter/search/common/util/ml/prediction_engine/DecisionForestModelsManager.java b/src/java/com/twitter/search/common/util/ml/prediction_engine/DecisionForestModelsManager.java deleted file mode 100644 index 7b9d84ebf..000000000 --- a/src/java/com/twitter/search/common/util/ml/prediction_engine/DecisionForestModelsManager.java +++ /dev/null @@ -1,69 +0,0 @@ -package com.twitter.search.common.util.ml.prediction_engine; - -import java.io.IOException; -import java.util.Collections; -import java.util.Map; -import java.util.function.Supplier; - -import com.google.common.base.Preconditions; - -import com.twitter.ml.api.FeatureContext; -import com.twitter.mlv2.trees.predictor.CartTree; -import com.twitter.mlv2.trees.scorer.DecisionForestScorer; -import com.twitter.search.common.file.AbstractFile; -import com.twitter.search.common.util.ml.models_manager.BaseModelsManager; - -/** - * Loads Decision Forest based models and keep them in memory. Can also be scheduled to reload - * models periodically. - * - * Note: Each instance is tied to a single {@link FeatureContext} instance. So, to load models - * for different tasks, you should use different instances of the this class. - */ -public class DecisionForestModelsManager extends BaseModelsManager> { - private static final String MODEL_FILE_NAME = "model.json"; - - private final FeatureContext featureContext; - - DecisionForestModelsManager( - Supplier> activeModelsSupplier, - FeatureContext featureContext, - boolean shouldUnloadInactiveModels, - String statsPrefix - ) { - super(activeModelsSupplier, shouldUnloadInactiveModels, statsPrefix); - this.featureContext = featureContext; - } - - @Override - public DecisionForestScorer readModelFromDirectory(AbstractFile modelBaseDir) - throws IOException { - String modelFilePath = modelBaseDir.getChild(MODEL_FILE_NAME).getPath(); - return DecisionForestScorer.createCartTreeScorer(modelFilePath, featureContext); - } - - /** - * Creates an instance that loads the models specified in a configuration file. - * - * Note that if the configuration file changes and it doesn't include a model that was present - * before, the model will be removed (i.e. it unloads models that are not active anymore). - */ - public static DecisionForestModelsManager createUsingConfigFile( - AbstractFile configFile, FeatureContext featureContext, String statsPrefix) { - Preconditions.checkArgument( - configFile.canRead(), "Config file is not readable: %s", configFile.getPath()); - return new DecisionForestModelsManager( - new ConfigSupplier(configFile), featureContext, true, statsPrefix); - } - - /** - * Creates a no-op instance. It can be used for tests or when the models are disabled. - */ - public static DecisionForestModelsManager createNoOp(String statsPrefix) { - return new DecisionForestModelsManager( - Collections::emptyMap, new FeatureContext(), false, statsPrefix) { - @Override - public void run() { } - }; - } -} diff --git a/src/java/com/twitter/search/common/util/ml/prediction_engine/DiscretizedFeature.java b/src/java/com/twitter/search/common/util/ml/prediction_engine/DiscretizedFeature.java deleted file mode 100644 index 562535c48..000000000 --- a/src/java/com/twitter/search/common/util/ml/prediction_engine/DiscretizedFeature.java +++ /dev/null @@ -1,47 +0,0 @@ -package com.twitter.search.common.util.ml.prediction_engine; - -import java.util.Arrays; - -import com.google.common.base.Preconditions; - -/** - * Represents a continuous feature that has been discretized into a set of disjoint ranges. - * - * Each range [a, b) is represented by the lower split point (a) and its associated weight. - */ -class DiscretizedFeature { - - protected final double[] splitPoints; - protected final double[] weights; - - /** - * Creates an instance from a list of split points and their corresponding weights. - * - * @param splitPoints Lower values of the ranges. The first entry must be Double.NEGATIVE_INFINITY - * They must be sorted (in ascending order). - * @param weights Weights for the splits. - */ - protected DiscretizedFeature(double[] splitPoints, double[] weights) { - Preconditions.checkArgument(splitPoints.length == weights.length); - Preconditions.checkArgument(splitPoints.length > 1); - Preconditions.checkArgument(splitPoints[0] == Double.NEGATIVE_INFINITY, - "First split point must be Double.NEGATIVE_INFINITY"); - this.splitPoints = splitPoints; - this.weights = weights; - } - - public double getWeight(double value) { - // binarySearch returns (- insertionPoint - 1) - int index = Math.abs(Arrays.binarySearch(splitPoints, value) + 1) - 1; - return weights[index]; - } - - public boolean allValuesBelowThreshold(double minWeight) { - for (double weight : weights) { - if (Math.abs(weight) > minWeight) { - return false; - } - } - return true; - } -} diff --git a/src/java/com/twitter/search/common/util/ml/prediction_engine/DiscretizedFeatureRange.java b/src/java/com/twitter/search/common/util/ml/prediction_engine/DiscretizedFeatureRange.java deleted file mode 100644 index 725009ab0..000000000 --- a/src/java/com/twitter/search/common/util/ml/prediction_engine/DiscretizedFeatureRange.java +++ /dev/null @@ -1,33 +0,0 @@ -package com.twitter.search.common.util.ml.prediction_engine; - -import com.google.common.base.Preconditions; - -/** - * The discretized value range for a continous feature. After discretization a continuous feature - * may become multiple discretized binary features, each occupying a range. This class stores this - * range and a weight for it. - */ -public class DiscretizedFeatureRange { - protected final double minValue; - protected final double maxValue; - protected final double weight; - - DiscretizedFeatureRange(double weight, String range) { - String[] limits = range.split("_"); - Preconditions.checkArgument(limits.length == 2); - - this.minValue = parseRangeValue(limits[0]); - this.maxValue = parseRangeValue(limits[1]); - this.weight = weight; - } - - private static double parseRangeValue(String value) { - if ("inf".equals(value)) { - return Double.POSITIVE_INFINITY; - } else if ("-inf".equals(value)) { - return Double.NEGATIVE_INFINITY; - } else { - return Double.parseDouble(value); - } - } -} diff --git a/src/java/com/twitter/search/common/util/ml/prediction_engine/LegacyModelBuilder.java b/src/java/com/twitter/search/common/util/ml/prediction_engine/LegacyModelBuilder.java deleted file mode 100644 index 4cb87f556..000000000 --- a/src/java/com/twitter/search/common/util/ml/prediction_engine/LegacyModelBuilder.java +++ /dev/null @@ -1,86 +0,0 @@ -package com.twitter.search.common.util.ml.prediction_engine; - -import java.util.Map; - -import com.google.common.collect.HashMultimap; -import com.google.common.collect.Maps; -import com.google.common.collect.Multimap; - -import com.twitter.ml.api.Feature; -import com.twitter.ml.api.FeatureContext; -import com.twitter.ml.api.FeatureParser; -import com.twitter.ml.api.transform.DiscretizerTransform; - -/** - * The builder for a model based on the legacy (non-schema-based) features. - * See also SchemaBasedModelBuilder. - */ -public final class LegacyModelBuilder extends BaseModelBuilder { - - private final Map featuresByName; - // for legacy features - private final Map, Double> binaryFeatures; - private final Map, Double> continuousFeatures; - private final Multimap, DiscretizedFeatureRange> discretizedFeatureRanges; - - LegacyModelBuilder(String modelName, FeatureContext context) { - super(modelName); - featuresByName = getFeaturesByName(context); - binaryFeatures = Maps.newHashMap(); - continuousFeatures = Maps.newHashMap(); - discretizedFeatureRanges = HashMultimap.create(); - } - - private static Map getFeaturesByName(FeatureContext featureContext) { - Map featuresByName = Maps.newHashMap(); - for (Feature feature : featureContext.getAllFeatures()) { - featuresByName.put(feature.getFeatureName(), feature); - } - return featuresByName; - } - - @Override - protected void addFeature(String baseName, double weight, FeatureParser parser) { - Feature feature = featuresByName.get(baseName); - if (feature != null) { - switch (feature.getFeatureType()) { - case BINARY: - binaryFeatures.put(feature, weight); - break; - case CONTINUOUS: - continuousFeatures.put(feature, weight); - break; - default: - throw new IllegalArgumentException( - String.format("Unsupported feature type: %s", feature)); - } - } else if (baseName.endsWith(DISCRETIZER_NAME_SUFFIX) - && parser.getExtension().containsKey(DiscretizerTransform.DEFAULT_RANGE_EXT)) { - - String featureName = - baseName.substring(0, baseName.length() - DISCRETIZER_NAME_SUFFIX.length()); - - feature = featuresByName.get(featureName); - if (feature == null) { - return; - } - - String rangeSpec = parser.getExtension().get(DiscretizerTransform.DEFAULT_RANGE_EXT); - discretizedFeatureRanges.put(feature, new DiscretizedFeatureRange(weight, rangeSpec)); - } - } - - @Override - public LightweightLinearModel build() { - Map, DiscretizedFeature> discretizedFeatures = Maps.newHashMap(); - for (Feature feature : discretizedFeatureRanges.keySet()) { - DiscretizedFeature discretizedFeature = - BaseModelBuilder.buildFeature(discretizedFeatureRanges.get(feature)); - if (!discretizedFeature.allValuesBelowThreshold(MIN_WEIGHT)) { - discretizedFeatures.put(feature, discretizedFeature); - } - } - return LightweightLinearModel.createForLegacy( - modelName, bias, binaryFeatures, continuousFeatures, discretizedFeatures); - } -} diff --git a/src/java/com/twitter/search/common/util/ml/prediction_engine/LightweightLinearModel.java b/src/java/com/twitter/search/common/util/ml/prediction_engine/LightweightLinearModel.java deleted file mode 100644 index 57324120b..000000000 --- a/src/java/com/twitter/search/common/util/ml/prediction_engine/LightweightLinearModel.java +++ /dev/null @@ -1,187 +0,0 @@ -package com.twitter.search.common.util.ml.prediction_engine; - -import java.io.BufferedReader; -import java.io.FileReader; -import java.io.IOException; -import java.util.Map; -import javax.annotation.Nullable; - -import com.google.common.base.Preconditions; - -import com.twitter.ml.api.Feature; -import com.twitter.search.common.file.AbstractFile; - -/** - * Provides an interface to the weights associated to the features of a linear model trained - * with Prediction Engine. - * - * This class is used along with ScoreAccumulator to efficiently score instances. It supports only - * a limited set of features: - * - * - Only linear models are supported. - * - Only binary and continuous features (i.e. it doesn't support discrete/categorical features). - * - It supports the MDL discretizer (but not the one based on trees). - * - It doesn't support feature crossings. - * - * Instances of this class should be created using only the load methods (loadFromHdfs and - * loadFromLocalFile). - * - * IMPORTANT: - * - * Use this class, and ScoreAccumulator, ONLY when runtime is a major concern. Otherwise, consider - * using Prediction Engine as a library. Ideally, we should access directly the structures that - * Prediction Engine creates when it loads a model, instead of parsing a text file with the - * feature weights. - * - * The discretized feature bins created by MDL may be too fine to be displayed properly in the - * parsed text file and there may be bins with the same min value. A binary search finding the - * bin for a same feature value therefore may end up with different bins/scores in different runs, - * producing unstable scores. See SEARCHQUAL-15957 for more detail. - * - * @see com.twitter.ml.tool.prediction.ModelInterpreter - */ -public class LightweightLinearModel { - protected final double bias; - protected final boolean schemaBased; - protected final String name; - - // for legacy metadata based model - protected final Map, Double> binaryFeatures; - protected final Map, Double> continuousFeatures; - protected final Map, DiscretizedFeature> discretizedFeatures; - - // for schema-based model - protected final Map binaryFeaturesById; - protected final Map continuousFeaturesById; - protected final Map discretizedFeaturesById; - - private static final String SCHEMA_BASED_SUFFIX = ".schema_based"; - - LightweightLinearModel( - String modelName, - double bias, - boolean schemaBased, - @Nullable Map, Double> binaryFeatures, - @Nullable Map, Double> continuousFeatures, - @Nullable Map, DiscretizedFeature> discretizedFeatures, - @Nullable Map binaryFeaturesById, - @Nullable Map continuousFeaturesById, - @Nullable Map discretizedFeaturesById) { - - this.name = modelName; - this.bias = bias; - this.schemaBased = schemaBased; - - // legacy feature maps - this.binaryFeatures = - schemaBased ? null : Preconditions.checkNotNull(binaryFeatures); - this.continuousFeatures = - schemaBased ? null : Preconditions.checkNotNull(continuousFeatures); - this.discretizedFeatures = - schemaBased ? null : Preconditions.checkNotNull(discretizedFeatures); - - // schema based feature maps - this.binaryFeaturesById = - schemaBased ? Preconditions.checkNotNull(binaryFeaturesById) : null; - this.continuousFeaturesById = - schemaBased ? Preconditions.checkNotNull(continuousFeaturesById) : null; - this.discretizedFeaturesById = - schemaBased ? Preconditions.checkNotNull(discretizedFeaturesById) : null; - } - - public String getName() { - return name; - } - - /** - * Create model for legacy features - */ - protected static LightweightLinearModel createForLegacy( - String modelName, - double bias, - Map, Double> binaryFeatures, - Map, Double> continuousFeatures, - Map, DiscretizedFeature> discretizedFeatures) { - return new LightweightLinearModel(modelName, bias, false, - binaryFeatures, continuousFeatures, discretizedFeatures, - null, null, null); - } - - /** - * Create model for schema-based features - */ - protected static LightweightLinearModel createForSchemaBased( - String modelName, - double bias, - Map binaryFeaturesById, - Map continuousFeaturesById, - Map discretizedFeaturesById) { - return new LightweightLinearModel(modelName, bias, true, - null, null, null, - binaryFeaturesById, continuousFeaturesById, discretizedFeaturesById); - } - - public boolean isSchemaBased() { - return schemaBased; - } - - /** - * Loads a model from a text file. - * - * See the javadoc of the constructor for more details on how to create the file from a trained - * Prediction Engine model. - * - * If schemaBased is true, the featureContext is ignored. - */ - public static LightweightLinearModel load( - String modelName, - BufferedReader reader, - boolean schemaBased, - CompositeFeatureContext featureContext) throws IOException { - - ModelBuilder builder = schemaBased - ? new SchemaBasedModelBuilder(modelName, featureContext.getFeatureSchema()) - : new LegacyModelBuilder(modelName, featureContext.getLegacyContext()); - String line; - while ((line = reader.readLine()) != null) { - builder.parseLine(line); - } - return builder.build(); - } - - /** - * Loads a model from a local text file. - * - * See the javadoc of the constructor for more details on how to create the file from a trained - * Prediction Engine model. - */ - public static LightweightLinearModel loadFromLocalFile( - String modelName, - CompositeFeatureContext featureContext, - String fileName) throws IOException { - try (BufferedReader reader = new BufferedReader(new FileReader(fileName))) { - boolean schemaBased = modelName.endsWith(SCHEMA_BASED_SUFFIX); - return load(modelName, reader, schemaBased, featureContext); - } - } - - /** - * Loads a model from a file in the local filesystem or in HDFS. - * - * See the javadoc of the constructor for more details on how to create the file from a trained - * Prediction Engine model. - */ - public static LightweightLinearModel load( - String modelName, CompositeFeatureContext featureContext, AbstractFile modelFile) - throws IOException { - try (BufferedReader reader = modelFile.getCharSource().openBufferedStream()) { - boolean schemaBased = modelName.endsWith(SCHEMA_BASED_SUFFIX); - return load(modelName, reader, schemaBased, featureContext); - } - } - - public String toString() { - return String.format("LightweightLinearModel. {bias=%s binary=%s continuous=%s discrete=%s}", - this.bias, this.binaryFeatures, this.continuousFeatures, this.discretizedFeatures); - } -} diff --git a/src/java/com/twitter/search/common/util/ml/prediction_engine/ModelBuilder.java b/src/java/com/twitter/search/common/util/ml/prediction_engine/ModelBuilder.java deleted file mode 100644 index f0c6612a5..000000000 --- a/src/java/com/twitter/search/common/util/ml/prediction_engine/ModelBuilder.java +++ /dev/null @@ -1,16 +0,0 @@ -package com.twitter.search.common.util.ml.prediction_engine; - -/** - * A builder interface to build a LightweightLinearModel. - */ -public interface ModelBuilder { - /** - * parses a line of the model file and updates the build state - */ - ModelBuilder parseLine(String line); - - /** - * builds the model - */ - LightweightLinearModel build(); -} diff --git a/src/java/com/twitter/search/common/util/ml/prediction_engine/ModelLoader.java b/src/java/com/twitter/search/common/util/ml/prediction_engine/ModelLoader.java deleted file mode 100644 index 7809161b0..000000000 --- a/src/java/com/twitter/search/common/util/ml/prediction_engine/ModelLoader.java +++ /dev/null @@ -1,178 +0,0 @@ -package com.twitter.search.common.util.ml.prediction_engine; - -import java.io.IOException; -import java.util.List; -import java.util.Map; - -import com.google.common.base.Optional; -import com.google.common.base.Supplier; -import com.google.common.base.Suppliers; -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.file.AbstractFile; -import com.twitter.search.common.file.FileUtils; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.metrics.SearchStatsReceiver; - -/** - * Loads LightweightLinearModel objects from a directory and provides an interface for reloading - * them periodically. - * - * All the models must support the same features (defined by a FeatureContext) and they are - * identified by the name of the subdirectory. This is the required directory structure: - * - * /path/to/base-directory - * one-model/model.tsv - * another-model/model.tsv - * experimental-model/model.tsv - * - * Each subdirectory must contain a file named 'model.tsv' in the format required by - * LightweightLinearModel. - */ -public class ModelLoader implements Runnable { - - private static final Logger LOG = LoggerFactory.getLogger(ModelLoader.class); - private static final String MODEL_FILE_NAME = "model.tsv"; - - private final CompositeFeatureContext featureContext; - private final Supplier directorySupplier; - - private final Map models; - private final Map lastModifiedMsByModel; - - private final SearchLongGauge lastModelLoadedAtMs; - private final SearchLongGauge numModels; - private final SearchCounter numLoads; - private final SearchCounter numErrors; - - /** - * Creates a new instance for a feature context and a base directory. - * - * It exports 4 counters: - * - * ${counterPrefix}_last_loaded: - * Timestamp (in ms) when the last model was loaded. - * ${counterPrefix}_num_models: - * Number of models currently loaded. - * ${counterPrefix}_num_loads: - * Number of succesful model loads. - * ${counterPrefix}_num_errors: - * Number of errors occurred while loading the models. - */ - protected ModelLoader( - CompositeFeatureContext featureContext, - Supplier directorySupplier, - String counterPrefix, - SearchStatsReceiver statsReceiver) { - this.featureContext = featureContext; - - // This function returns the base directory every time we call 'run'. We use a function instead - // of using directly an AbstractFile instance, in case that we can't obtain an instance at - // initialization time (e.g. if there's an issue with HDFS). - this.directorySupplier = directorySupplier; - this.models = Maps.newConcurrentMap(); - this.lastModifiedMsByModel = Maps.newConcurrentMap(); - - this.lastModelLoadedAtMs = statsReceiver.getLongGauge(counterPrefix + "last_loaded"); - this.numModels = statsReceiver.getLongGauge(counterPrefix + "num_models"); - this.numLoads = statsReceiver.getCounter(counterPrefix + "num_loads"); - this.numErrors = statsReceiver.getCounter(counterPrefix + "num_errors"); - } - - public Optional getModel(String name) { - return Optional.fromNullable(models.get(name)); - } - - /** - * Loads the models from the base directory. - * - * It doesn't load a model if its file has not been modified since the last time it was loaded. - * - * This method doesn't delete previously loaded models if their directories are not available. - */ - @Override - public void run() { - try { - AbstractFile baseDirectory = directorySupplier.get(); - List modelDirectories = - Lists.newArrayList(baseDirectory.listFiles(IS_MODEL_DIR)); - for (AbstractFile directory : modelDirectories) { - try { - // Note that the modelName is the directory name, if it ends with ".schema_based", the - // model will be loaded as a schema-based model. - String modelName = directory.getName(); - AbstractFile modelFile = directory.getChild(MODEL_FILE_NAME); - long currentLastModified = modelFile.getLastModified(); - Long lastModified = lastModifiedMsByModel.get(modelName); - if (lastModified == null || lastModified < currentLastModified) { - LightweightLinearModel model = - LightweightLinearModel.load(modelName, featureContext, modelFile); - if (!models.containsKey(modelName)) { - LOG.info("Loading model {}.", modelName); - } - models.put(modelName, model); - lastModifiedMsByModel.put(modelName, currentLastModified); - lastModelLoadedAtMs.set(System.currentTimeMillis()); - numLoads.increment(); - LOG.debug("Model: {}", model); - } else { - LOG.debug("Directory for model {} has not changed.", modelName); - } - } catch (Exception e) { - LOG.error("Error loading model from directory: " + directory.getPath(), e); - this.numErrors.increment(); - } - } - if (numModels.get() != models.size()) { - LOG.info("Finished loading models. Model names: {}", models.keySet()); - } - this.numModels.set(models.size()); - } catch (IOException e) { - LOG.error("Error loading models", e); - this.numErrors.increment(); - } - } - - /** - * Creates an instance that loads models from a directory (local or from HDFS). - */ - public static ModelLoader forDirectory( - final AbstractFile directory, - CompositeFeatureContext featureContext, - String counterPrefix, - SearchStatsReceiver statsReceiver) { - Supplier directorySupplier = Suppliers.ofInstance(directory); - return new ModelLoader(featureContext, directorySupplier, counterPrefix, statsReceiver); - } - - /** - * Creates an instance that loads models from HDFS. - */ - public static ModelLoader forHdfsDirectory( - final String nameNode, - final String directory, - CompositeFeatureContext featureContext, - String counterPrefix, - SearchStatsReceiver statsReceiver) { - Supplier directorySupplier = - () -> FileUtils.getHdfsFileHandle(directory, nameNode); - return new ModelLoader(featureContext, directorySupplier, counterPrefix, statsReceiver); - } - - private static final AbstractFile.Filter IS_MODEL_DIR = file -> { - try { - if (file.isDirectory()) { - AbstractFile modelFile = file.getChild(MODEL_FILE_NAME); - return (modelFile != null) && modelFile.canRead(); - } - } catch (IOException e) { - LOG.error("Error reading file: " + file, e); - } - return false; - }; -} diff --git a/src/java/com/twitter/search/common/util/ml/prediction_engine/PredictionEngineModelsManager.java b/src/java/com/twitter/search/common/util/ml/prediction_engine/PredictionEngineModelsManager.java deleted file mode 100644 index b2a96fd42..000000000 --- a/src/java/com/twitter/search/common/util/ml/prediction_engine/PredictionEngineModelsManager.java +++ /dev/null @@ -1,67 +0,0 @@ -package com.twitter.search.common.util.ml.prediction_engine; - -import java.util.Collections; -import java.util.Map; -import java.util.function.Supplier; - -import com.google.common.base.Preconditions; - -import com.twitter.ml.prediction.core.PredictionEngine; -import com.twitter.ml.prediction.core.PredictionEngineFactory; -import com.twitter.ml.prediction.core.PredictionEngineLoadingException; -import com.twitter.ml.vw.constant.SnapshotConstants; -import com.twitter.search.common.file.AbstractFile; -import com.twitter.search.common.util.ml.models_manager.BaseModelsManager; - -/** - * Loads PredictionEngine models from a model provider (config or fixed directory) - * and keeps them in memory. Can also reload models periodically by querying the - * same model provider source. - */ -public class PredictionEngineModelsManager extends BaseModelsManager { - - PredictionEngineModelsManager( - Supplier> activeModelsSupplier, - boolean shouldUnloadInactiveModels, - String statsPrefix) { - super(activeModelsSupplier, shouldUnloadInactiveModels, statsPrefix); - } - - @Override - public PredictionEngine readModelFromDirectory(AbstractFile modelBaseDir) - throws PredictionEngineLoadingException { - // We need to add the 'hdfs://' prefix, otherwise PredictionEngine will treat it as a - // path in the local filesystem. - PredictionEngine predictionEngine = new PredictionEngineFactory() - .createFromSnapshot( - "hdfs://" + modelBaseDir.getPath(), SnapshotConstants.FIXED_PATH); - - predictionEngine.initialize(); - - return predictionEngine; - } - - /** - * Creates an instance that loads the models specified in a configuration file. - * - * Note that if the configuration file changes and it doesn't include a model that was present - * before, the model will be removed (i.e. it unloads models that are not active anymore). - */ - public static PredictionEngineModelsManager createUsingConfigFile( - AbstractFile configFile, String statsPrefix) { - Preconditions.checkArgument( - configFile.canRead(), "Config file is not readable: %s", configFile.getPath()); - return new PredictionEngineModelsManager(new ConfigSupplier(configFile), true, statsPrefix); - } - - /** - * Creates a no-op instance. It can be used for tests or when the models are disabled. - */ - public static PredictionEngineModelsManager createNoOp(String statsPrefix) { - return new PredictionEngineModelsManager(Collections::emptyMap, false, statsPrefix) { - @Override - public void run() { } - }; - } - -} diff --git a/src/java/com/twitter/search/common/util/ml/prediction_engine/SchemaBasedModelBuilder.java b/src/java/com/twitter/search/common/util/ml/prediction_engine/SchemaBasedModelBuilder.java deleted file mode 100644 index 3b8483f1c..000000000 --- a/src/java/com/twitter/search/common/util/ml/prediction_engine/SchemaBasedModelBuilder.java +++ /dev/null @@ -1,105 +0,0 @@ -package com.twitter.search.common.util.ml.prediction_engine; - -import java.util.Map; -import java.util.stream.Collectors; - -import com.google.common.collect.HashMultimap; -import com.google.common.collect.Maps; -import com.google.common.collect.Multimap; - -import com.twitter.ml.api.FeatureParser; -import com.twitter.ml.api.transform.DiscretizerTransform; -import com.twitter.search.common.features.thrift.ThriftSearchFeatureSchema; -import com.twitter.search.common.features.thrift.ThriftSearchFeatureSchemaEntry; - -/** - * Builds a model with schema-based features, here all features are tracked by Id. - * This class is very similar to LegacyModelBuilder, which will eventually be deprecated. - */ -public class SchemaBasedModelBuilder extends BaseModelBuilder { - private final Map featuresByName; - private final Map binaryFeatures; - private final Map continuousFeatures; - private final Multimap discretizedFeatureRanges; - - /** - * a class to hold feature information - */ - static class FeatureData { - private final ThriftSearchFeatureSchemaEntry entry; - private final int id; - - public FeatureData(ThriftSearchFeatureSchemaEntry entry, int id) { - this.entry = entry; - this.id = id; - } - } - - SchemaBasedModelBuilder(String modelName, ThriftSearchFeatureSchema featureSchema) { - super(modelName); - featuresByName = getFeatureDataMap(featureSchema); - binaryFeatures = Maps.newHashMap(); - continuousFeatures = Maps.newHashMap(); - discretizedFeatureRanges = HashMultimap.create(); - } - - /** - * Creates a map from feature name to thrift entries - */ - private static Map getFeatureDataMap( - ThriftSearchFeatureSchema schema) { - return schema.getEntries().entrySet().stream() - .collect(Collectors.toMap( - e -> e.getValue().getFeatureName(), - e -> new FeatureData(e.getValue(), e.getKey()) - )); - } - - @Override - protected void addFeature(String baseName, double weight, FeatureParser parser) { - FeatureData feature = featuresByName.get(baseName); - if (feature != null) { - switch (feature.entry.getFeatureType()) { - case BOOLEAN_VALUE: - binaryFeatures.put(feature.id, weight); - break; - case INT32_VALUE: - case LONG_VALUE: - case DOUBLE_VALUE: - continuousFeatures.put(feature.id, weight); - break; - default: - // other values are not supported yet - throw new IllegalArgumentException( - String.format("Unsupported feature type: %s", feature)); - } - } else if (baseName.endsWith(DISCRETIZER_NAME_SUFFIX) - && parser.getExtension().containsKey(DiscretizerTransform.DEFAULT_RANGE_EXT)) { - - String featureName = - baseName.substring(0, baseName.length() - DISCRETIZER_NAME_SUFFIX.length()); - - feature = featuresByName.get(featureName); - if (feature == null) { - return; - } - - String rangeSpec = parser.getExtension().get(DiscretizerTransform.DEFAULT_RANGE_EXT); - discretizedFeatureRanges.put(feature.id, new DiscretizedFeatureRange(weight, rangeSpec)); - } - } - - @Override - public LightweightLinearModel build() { - Map discretizedFeatures = Maps.newHashMap(); - for (Integer feature : discretizedFeatureRanges.keySet()) { - DiscretizedFeature discretizedFeature = - BaseModelBuilder.buildFeature(discretizedFeatureRanges.get(feature)); - if (!discretizedFeature.allValuesBelowThreshold(MIN_WEIGHT)) { - discretizedFeatures.put(feature, discretizedFeature); - } - } - return LightweightLinearModel.createForSchemaBased( - modelName, bias, binaryFeatures, continuousFeatures, discretizedFeatures); - } -} diff --git a/src/java/com/twitter/search/common/util/ml/prediction_engine/SchemaBasedScoreAccumulator.java b/src/java/com/twitter/search/common/util/ml/prediction_engine/SchemaBasedScoreAccumulator.java deleted file mode 100644 index 68742211f..000000000 --- a/src/java/com/twitter/search/common/util/ml/prediction_engine/SchemaBasedScoreAccumulator.java +++ /dev/null @@ -1,64 +0,0 @@ -package com.twitter.search.common.util.ml.prediction_engine; - -import java.util.Map; - -import com.google.common.base.Preconditions; - -import com.twitter.search.common.features.thrift.ThriftSearchResultFeatures; -import com.twitter.search.modeling.common.TweetFeaturesUtils; - -/** - * Score accumulator for schema-based features. - */ -public class SchemaBasedScoreAccumulator extends BaseScoreAccumulator { - - public SchemaBasedScoreAccumulator(LightweightLinearModel model) { - super(model); - Preconditions.checkState(model.isSchemaBased(), - "Cannot create SchemaBasedScoreAccumulator with a non-schema-based model: %s", - model.getName()); - } - - @Override - protected final void updateScoreWithFeatures(ThriftSearchResultFeatures featureData) { - // go through all features available and apply all those available in the model - addSchemaBooleanFeatures(featureData.getBoolValues()); - addSchemaContinuousFeatures(featureData.getIntValues()); - addSchemaContinuousFeatures(featureData.getLongValues()); - addSchemaContinuousFeatures(featureData.getDoubleValues()); - } - - private void addSchemaBooleanFeatures(Map booleanMap) { - if (booleanMap == null || booleanMap.isEmpty()) { - return; - } - for (Map.Entry entry : booleanMap.entrySet()) { - if (entry.getValue()) { - score += model.binaryFeaturesById.getOrDefault(entry.getKey(), 0.0); - } - } - } - - private void addSchemaContinuousFeatures(Map valueMap) { - if (valueMap == null || valueMap.isEmpty()) { - return; - } - for (Map.Entry entry : valueMap.entrySet()) { - Integer id = entry.getKey(); - if (TweetFeaturesUtils.isFeatureDiscrete(id)) { - continue; // we don't process any discrete features now - } - Double weight = model.continuousFeaturesById.get(id); - if (weight != null) { - // found non-discretized entry - score += weight * entry.getValue().doubleValue(); - } else { - DiscretizedFeature discretizedFeature = model.discretizedFeaturesById.get(id); - if (discretizedFeature != null) { - // Use only the weight of the discretized feature (there's no need to multiply it) - score += discretizedFeature.getWeight(entry.getValue().doubleValue()); - } - } - } - } -} diff --git a/src/java/com/twitter/search/common/util/ml/tensorflow_engine/BUILD b/src/java/com/twitter/search/common/util/ml/tensorflow_engine/BUILD deleted file mode 100644 index 56923850e..000000000 --- a/src/java/com/twitter/search/common/util/ml/tensorflow_engine/BUILD +++ /dev/null @@ -1,21 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/org/slf4j:slf4j-api", - "3rdparty/jvm/org/tensorflow", - "finatra/inject/inject-slf4j/src/main/scala/com/twitter/inject", - "src/java/com/twitter/ml/api:api-base", - "src/java/com/twitter/search/common/file", - "src/java/com/twitter/search/common/schema", - "src/java/com/twitter/search/common/schema/base", - "src/java/com/twitter/search/common/util/ml/models_manager", - "src/thrift/com/twitter/search/common:features-java", - "tensorflow/tfcompute-java/src/main/java/com/twitter/tfcompute_java", - "twml/runtime/src/main/scala/com/twitter/twml/runtime/lib", - "twml/runtime/src/main/scala/com/twitter/twml/runtime/models", - "util/util-core:scala", - ], -) diff --git a/src/java/com/twitter/search/common/util/ml/tensorflow_engine/TensorflowModelsManager.java b/src/java/com/twitter/search/common/util/ml/tensorflow_engine/TensorflowModelsManager.java deleted file mode 100644 index 3028a3395..000000000 --- a/src/java/com/twitter/search/common/util/ml/tensorflow_engine/TensorflowModelsManager.java +++ /dev/null @@ -1,189 +0,0 @@ -package com.twitter.search.common.util.ml.tensorflow_engine; - -import java.io.IOException; -import java.util.Collections; -import java.util.HashMap; -import java.util.Map; -import java.util.function.Supplier; - -import com.google.common.base.Preconditions; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; -import org.tensorflow.SavedModelBundle; -import org.tensorflow.Session; - -import com.twitter.ml.api.FeatureUtil; -import com.twitter.search.common.features.thrift.ThriftSearchFeatureSchema; -import com.twitter.search.common.features.thrift.ThriftSearchFeatureSchemaEntry; -import com.twitter.search.common.file.AbstractFile; -import com.twitter.search.common.schema.DynamicSchema; -import com.twitter.search.common.util.ml.models_manager.BaseModelsManager; -import com.twitter.tfcompute_java.TFModelRunner; -import com.twitter.tfcompute_java.TFSessionInit; -import com.twitter.twml.runtime.lib.TwmlLoader; -import com.twitter.twml.runtime.models.ModelLocator; -import com.twitter.twml.runtime.models.ModelLocator$; -import com.twitter.util.Await; - -/** - * TensorflowModelsManager manages the lifecyle of TF models. - */ -public class TensorflowModelsManager extends BaseModelsManager { - - private static final Logger LOG = LoggerFactory.getLogger(TensorflowModelsManager.class); - - private static final String[] TF_TAGS = new String[] {"serve"}; - - private volatile Map featureSchemaIdToMlApiId = new HashMap(); - - static { - TwmlLoader.load(); - } - - public static final TensorflowModelsManager NO_OP_MANAGER = - createNoOp("no_op_manager"); - - public TensorflowModelsManager( - Supplier> activeModelsSupplier, - boolean shouldUnloadInactiveModels, - String statsPrefix - ) { - this( - activeModelsSupplier, - shouldUnloadInactiveModels, - statsPrefix, - () -> true, - () -> true, - null - ); - } - - public TensorflowModelsManager( - Supplier> activeModelsSupplier, - boolean shouldUnloadInactiveModels, - String statsPrefix, - Supplier serveModels, - Supplier loadModels, - DynamicSchema dynamicSchema - ) { - super( - activeModelsSupplier, - shouldUnloadInactiveModels, - statsPrefix, - serveModels, - loadModels - ); - if (dynamicSchema != null) { - updateFeatureSchemaIdToMlIdMap(dynamicSchema.getSearchFeatureSchema()); - } - } - - /** - * The ML API feature ids for tensorflow scoring are hashes of their feature names. This hashing - * could be expensive to do for every search request. Instead, allow the map from schema feature - * id to ML API id to be updated whenever the schema is reloaded. - */ - public void updateFeatureSchemaIdToMlIdMap(ThriftSearchFeatureSchema schema) { - HashMap newFeatureSchemaIdToMlApiId = new HashMap(); - Map featureEntries = schema.getEntries(); - for (Map.Entry entry : featureEntries.entrySet()) { - long mlApiFeatureId = FeatureUtil.featureIdForName(entry.getValue().getFeatureName()); - newFeatureSchemaIdToMlApiId.put(entry.getKey(), mlApiFeatureId); - } - - featureSchemaIdToMlApiId = newFeatureSchemaIdToMlApiId; - } - - public Map getFeatureSchemaIdToMlApiId() { - return featureSchemaIdToMlApiId; - } - - /** - * If the manager is not enabled, it won't fetch TF models. - */ - public boolean isEnabled() { - return true; - } - - /** - * Load an individual model and make it available for inference. - */ - public TFModelRunner readModelFromDirectory( - AbstractFile modelDir) throws IOException { - - ModelLocator modelLocator = - ModelLocator$.MODULE$.apply( - modelDir.toString(), - modelDir.toURI() - ); - - try { - Await.result(modelLocator.ensureLocalPresent(true)); - } catch (Exception e) { - LOG.error("Couldn't find model " + modelDir.toString(), e); - throw new IOException("Couldn't find model " + modelDir.toString()); - } - - Session session = SavedModelBundle.load(modelLocator.localPath(), TF_TAGS).session(); - - return new TFModelRunner(session); - } - - - /** - * Initialize Tensorflow intra and inter op thread pools. - * See `ConfigProto.[intra|inter]_op_parallelism_threads` documentation for more information: - * https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/protobuf/config.proto - * Initialization should happen only once. - * Default values for Tensorflow are: - * intraOpParallelismThreads = 0 which means that TF will pick an appropriate default. - * interOpParallelismThreads = 0 which means that TF will pick an appropriate default. - * operation_timeout_in_ms = 0 which means that no timeout will be applied. - */ - public static void initTensorflowThreadPools( - int intraOpParallelismThreads, - int interOpParallelismThreads) { - new TFSessionInit(intraOpParallelismThreads, interOpParallelismThreads, 0); - } - - /** - * Creates a no-op instance. It can be used for tests or when the models are disabled. - */ - public static TensorflowModelsManager createNoOp(String statsPrefix) { - return new TensorflowModelsManager(Collections::emptyMap, false, statsPrefix) { - @Override - public void run() { } - - @Override - public boolean isEnabled() { - return false; - } - - @Override - public void updateFeatureSchemaIdToMlIdMap(ThriftSearchFeatureSchema schema) { } - }; - } - - /** - * Creates an instance that loads the models based on a ConfigSupplier. - */ - public static TensorflowModelsManager createUsingConfigFile( - AbstractFile configFile, - boolean shouldUnloadInactiveModels, - String statsPrefix, - Supplier serveModels, - Supplier loadModels, - DynamicSchema dynamicSchema) { - Preconditions.checkArgument( - configFile.canRead(), "Config file is not readable: %s", configFile.getPath()); - return new TensorflowModelsManager( - new ConfigSupplier(configFile), - shouldUnloadInactiveModels, - statsPrefix, - serveModels, - loadModels, - dynamicSchema - ); - } -} diff --git a/src/java/com/twitter/search/core/earlybird/BUILD b/src/java/com/twitter/search/core/earlybird/BUILD deleted file mode 100644 index a8432dfe0..000000000 --- a/src/java/com/twitter/search/core/earlybird/BUILD +++ /dev/null @@ -1,38 +0,0 @@ -java_library( - sources = ["**/*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/it/unimi/dsi:fastutil", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/lucene:lucene-analyzers-common", - "3rdparty/jvm/org/apache/lucene:lucene-analyzers-smartcn", - "3rdparty/jvm/org/apache/lucene:lucene-core", - "3rdparty/jvm/org/apache/lucene:lucene-facet", - "3rdparty/jvm/org/apache/lucene:lucene-queries", - "3rdparty/jvm/org/apache/thrift:libthrift", - "3rdparty/jvm/org/apache/zookeeper:zookeeper-client", - "3rdparty/jvm/org/slf4j:slf4j-api", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common/collections", - "src/java/com/twitter/search/common/encoding/docvalues", - "src/java/com/twitter/search/common/encoding/features", - "src/java/com/twitter/search/common/facets", - "src/java/com/twitter/search/common/hashtable", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/relevance/features", - "src/java/com/twitter/search/common/schema", - "src/java/com/twitter/search/common/schema/base", - "src/java/com/twitter/search/common/schema/earlybird", - "src/java/com/twitter/search/common/search", - "src/java/com/twitter/search/common/util:log_format_util", - "src/java/com/twitter/search/common/util/analysis", - "src/java/com/twitter/search/common/util/hash", - "src/java/com/twitter/search/common/util/io:flushable", - "src/thrift/com/twitter/search/common:constants-java", - "src/thrift/com/twitter/search/common:facets-java", - "src/thrift/com/twitter/search/common:schema-java", - ], -) diff --git a/src/java/com/twitter/search/core/earlybird/README.md b/src/java/com/twitter/search/core/earlybird/README.md deleted file mode 100644 index 337d327fb..000000000 --- a/src/java/com/twitter/search/core/earlybird/README.md +++ /dev/null @@ -1,21 +0,0 @@ -# Search Index (Earlybird) core classes - -> **TL;DR** Earlybird (Search Index) find tweets from people you follow, rank them, and serve tweets to Home. - -## What is Earlybird (Search Index) - -[Earlybird](http://notes.stephenholiday.com/Earlybird.pdf) is a **real-time search system** based on [Apache Lucene](https://lucene.apache.org/) to support the high volume of queries and content updates. The major use cases are Relevance Search (specifically, Text search) and Timeline In-network Tweet retrieval (or UserID based search). It is designed to enable the efficient indexing and querying of billions of tweets, and to provide low-latency search results, even with heavy query loads. - -## Directory Structure -The project consists of several packages and files, which can be summarized as follows: - - -* `facets/`: This subdirectory contains classes responsible for facet counting and processing. Some key classes include EarlybirdFacets, EarlybirdFacetsFactory, FacetAccumulator, and FacetCountAggregator. The classes handle facet counting, facet iterators, facet label providers, and facet response rewriting. -* `index/`: This directory contains the indexing and search infra files, with several subdirectories for specific components. - * `column/`: This subdirectory contains classes related to column-stride field indexes, including ColumnStrideByteIndex, ColumnStrideIntIndex, ColumnStrideLongIndex, and various optimized versions of these indexes. These classes deal with managing and updating doc values. - * `extensions/`: This subdirectory contains classes for index extensions, including EarlybirdIndexExtensionsData, EarlybirdIndexExtensionsFactory, and EarlybirdRealtimeIndexExtensionsData. - * `inverted/`: This subdirectory focuses on the inverted index and its components, such as InMemoryFields, IndexOptimizer, InvertedIndex, and InvertedRealtimeIndex. It also contains classes for managing and processing posting lists and term dictionaries, like EarlybirdPostingsEnum, FSTTermDictionary, and MPHTermDictionary. - * `util/`: This subdirectory contains utility classes for managing search iterators and filters, such as AllDocsIterator, RangeDISI, RangeFilterDISI, and SearchSortUtils. The system appears to be designed to handle search indexing and facet counting efficiently. Key components include an inverted index, various types of posting lists, and term dictionaries. Facet counting and processing is handled by specialized classes within the facets subdirectory. The overall structure indicates a well-organized and modular search indexing system that can be maintained and extended as needed. - -## Related Services -* The Earlybirds main classes. See `src/java/com/twitter/search/earlybird/` diff --git a/src/java/com/twitter/search/core/earlybird/facets/AbstractFacetCountingArray.java b/src/java/com/twitter/search/core/earlybird/facets/AbstractFacetCountingArray.java deleted file mode 100644 index c587c5470..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/AbstractFacetCountingArray.java +++ /dev/null @@ -1,231 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -import java.io.IOException; -import java.util.ArrayList; -import java.util.List; -import java.util.Map; - -import com.google.common.base.Preconditions; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.core.earlybird.index.inverted.IntBlockPool; - -/** - * AbstractFacetCountingArray implements a lookup from a doc ID to an unordered list of facets. - * A facet is a pair of (term ID, field ID), which could represent, - * for example ("http://twitter.com", "links"). - * - * Internally, we have two data structures: A map from doc ID to an int and a pool of ints. We refer - * to the values contained in these structures as packed values. A packed value can either be a - * pointer to a location in the pool, an encoded facet or a sentinel value. Pointers always have - * their high bit set to 1. - * - * If a document has just one facet, we will store the encoded facet in the map, and nothing in the - * pool. Otherwise, the map will contain a pointer into the int pool. - * - * The int pool is encoded in a block-allocated linked list. - * See {@link AbstractFacetCountingArray#collectForDocId} for details on how to traverse the list. - */ -public abstract class AbstractFacetCountingArray implements Flushable { - private static final Logger LOG = LoggerFactory.getLogger(AbstractFacetCountingArray.class); - - private static final FacetCountIterator EMPTY_ITERATOR = new FacetCountIterator() { - @Override - public void collect(int docID) { - // noop - } - }; - - public static final AbstractFacetCountingArray EMPTY_ARRAY = new AbstractFacetCountingArray() { - @Override - public final FacetCountIterator getIterator(EarlybirdIndexSegmentAtomicReader reader, - FacetCountState countState, - FacetCountIteratorFactory iteratorFactory) { - return EMPTY_ITERATOR; - } - - @Override - public final int getFacet(int docID) { - return UNASSIGNED; - } - - @Override - public final void setFacet(int docID, int facetID) { - } - - @Override - public final AbstractFacetCountingArray rewriteAndMapIDs( - Map termIDMapper, - DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) { - return this; - } - - @Override - public Handler getFlushHandler() { - return null; - } - }; - - protected class ArrayFacetCountIterator extends FacetCountIterator { - @Override - public void collect(int docID) { - collectForDocId(docID, this); - } - } - - private static final int NUM_BITS_TERM_ID = 27; - private static final int TERM_ID_MASK = (1 << NUM_BITS_TERM_ID) - 1; - - private static final int NUM_BITS_FIELD_ID = 4; - private static final int FIELD_ID_MASK = (1 << NUM_BITS_FIELD_ID) - 1; - - private static final int HIGHEST_ORDER_BIT = Integer.MIN_VALUE; // 1L << 31 - private static final int HIGHEST_ORDER_BIT_INVERSE_MASK = HIGHEST_ORDER_BIT - 1; - - protected static final int UNASSIGNED = Integer.MAX_VALUE; - - protected static final int decodeTermID(int facetID) { - if (facetID != UNASSIGNED) { - int termID = facetID & TERM_ID_MASK; - return termID; - } - - return EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND; - } - - protected static final int decodeFieldID(int facetID) { - return (facetID >>> NUM_BITS_TERM_ID) & FIELD_ID_MASK; - } - - protected static final int encodeFacetID(int fieldID, int termID) { - return ((fieldID & FIELD_ID_MASK) << NUM_BITS_TERM_ID) | (termID & TERM_ID_MASK); - } - - protected static final int decodePointer(int value) { - return value & HIGHEST_ORDER_BIT_INVERSE_MASK; - } - - protected static final int encodePointer(int value) { - return value | HIGHEST_ORDER_BIT; - } - - protected static final boolean isPointer(int value) { - return (value & HIGHEST_ORDER_BIT) != 0; - } - - private final IntBlockPool facetsPool; - - protected AbstractFacetCountingArray() { - facetsPool = new IntBlockPool("facets"); - } - - protected AbstractFacetCountingArray(IntBlockPool facetsPool) { - this.facetsPool = facetsPool; - } - - /** - * Returns an iterator to iterate all docs/facets stored in this FacetCountingArray. - */ - public FacetCountIterator getIterator( - EarlybirdIndexSegmentAtomicReader reader, - FacetCountState countState, - FacetCountIteratorFactory iteratorFactory) { - Preconditions.checkNotNull(countState); - Preconditions.checkNotNull(reader); - - List iterators = new ArrayList<>(); - for (Schema.FieldInfo fieldInfo : countState.getSchema().getCsfFacetFields()) { - if (countState.isCountField(fieldInfo)) { - // Rather than rely on the normal facet counting array, we read from a column stride - // field using a custom implementation of FacetCountIterator. - // This optimization is due to two factors: - // 1) for the from_user_id_csf facet, every document has a from user id, - // but many documents contain no other facets. - // 2) we require from_user_id and shared_status_id to be in a column stride field - // for other uses. - try { - iterators.add(iteratorFactory.getFacetCountIterator(reader, fieldInfo)); - } catch (IOException e) { - String facetName = fieldInfo.getFieldType().getFacetName(); - LOG.error("Failed to construct iterator for " + facetName + " facet", e); - } - } - } - if (iterators.size() == 0) { - return new ArrayFacetCountIterator(); - } - if (iterators.size() < countState.getNumFieldsToCount()) { - iterators.add(new ArrayFacetCountIterator()); - } - return new CompositeFacetCountIterator(iterators); - } - - /** - * Collects facets of the document with the provided docID. - * See {@link FacetCountingArrayWriter#addFacet} for details on the format of the int pool. - */ - public void collectForDocId(int docID, FacetTermCollector collector) { - int firstValue = getFacet(docID); - if (firstValue == UNASSIGNED) { - return; // no facet - } - if (!isPointer(firstValue)) { - // highest order bit not set, only one facet for this document. - collector.collect(docID, decodeTermID(firstValue), decodeFieldID(firstValue)); - return; - } - - // multiple facets, traverse the linked list to find all of the facets for this document. - int pointer = decodePointer(firstValue); - while (true) { - int packedValue = facetsPool.get(pointer); - // UNASSIGNED is a sentinel value indicating that we have reached the end of the linked list. - if (packedValue == UNASSIGNED) { - return; - } - - if (isPointer(packedValue)) { - // If the packedValue is a pointer, we need to skip over some ints to reach the facets for - // this document. - pointer = decodePointer(packedValue); - } else { - // If the packedValue is not a pointer, it is an encoded facet, and we can simply decrement - // the pointer to collect the next value. - collector.collect(docID, decodeTermID(packedValue), decodeFieldID(packedValue)); - pointer--; - } - } - } - - /** - * This method can return one of three values for each given doc ID: - * - UNASSIGNED, if the document has no facets - * - If the highest-order bit is not set, then the (negated) returned value is the single facet - * for this document. - * - If the highest-order bit is set, then the document has multiple facets, and the returned - * values is a pointer into facetsPool. - */ - protected abstract int getFacet(int docID); - - protected abstract void setFacet(int docID, int facetID); - - /** - * Called during segment optimization to map term ids that have changed as a - * result of the optimization. - */ - public abstract AbstractFacetCountingArray rewriteAndMapIDs( - Map termIDMapper, - DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) throws IOException; - - IntBlockPool getFacetsPool() { - return facetsPool; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/CSFFacetCountIterator.java b/src/java/com/twitter/search/core/earlybird/facets/CSFFacetCountIterator.java deleted file mode 100644 index efe8cb7be..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/CSFFacetCountIterator.java +++ /dev/null @@ -1,56 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -import java.io.IOException; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.NumericDocValues; - -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; - -/** - * An iterator that looks up the termID from the appropriate CSF - */ -public class CSFFacetCountIterator extends FacetCountIterator { - private final int fieldID; - private final NumericDocValues numericDocValues; - - /** - * Creates a new iterator for the given facet csf field. - */ - public CSFFacetCountIterator( - EarlybirdIndexSegmentAtomicReader reader, - Schema.FieldInfo facetFieldInfo) throws IOException { - FacetIDMap.FacetField facetField = reader.getFacetIDMap().getFacetField(facetFieldInfo); - Preconditions.checkNotNull(facetField); - this.fieldID = facetField.getFacetId(); - numericDocValues = reader.getNumericDocValues(facetFieldInfo.getName()); - Preconditions.checkNotNull(numericDocValues); - } - - @Override - public void collect(int internalDocID) throws IOException { - if (numericDocValues.advanceExact(internalDocID)) { - long termID = numericDocValues.longValue(); - if (shouldCollect(internalDocID, termID)) { - collect(internalDocID, termID, fieldID); - } - } - } - - /** - * Subclasses should override if they need to restrict the docs or termIDs - * that they collect on. For example, these may need to override if - * 1) Not all docs set this field, so we should not collect on - * the default value of 0 - * 2) The same CSF field means different things (in particular, shared_status_id means - * retweet OR reply parent id) so we need to do some other check to determine if we should - * collect - * - * @return whether we should collect on this doc/termID - */ - protected boolean shouldCollect(int internalDocID, long termID) throws IOException { - return true; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/CompositeFacetCountIterator.java b/src/java/com/twitter/search/core/earlybird/facets/CompositeFacetCountIterator.java deleted file mode 100644 index 4aa6e8748..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/CompositeFacetCountIterator.java +++ /dev/null @@ -1,46 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -import java.io.IOException; -import java.util.Collection; -import java.util.List; - -import com.twitter.common.collections.Pair; - -/** - * Calls multiple FacetCountIterators. Currently this is used for calling the - * default FacetCountingArray iterator and the CSF and retweet iterators - */ -public class CompositeFacetCountIterator extends FacetCountIterator { - private final Collection iterators; - - /** - * Creates a new composite iterator on the provided collection of iterators. - */ - public CompositeFacetCountIterator(Collection iterators) { - this.iterators = iterators; - for (FacetCountIterator iterator : iterators) { - iterator.setIncrementData(this.incrementData); - } - } - - @Override - public void collect(int docID) throws IOException { - for (FacetCountIterator iterator : iterators) { - iterator.collect(docID); - } - } - - @Override - protected void addProof(int docID, long termID, int fieldID) { - for (FacetCountIterator iterator : iterators) { - iterator.addProof(docID, termID, fieldID); - } - } - - @Override - public void setProofs(List> proof) { - for (FacetCountIterator iterator : iterators) { - iterator.setProofs(proof); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/DummyFacetAccumulator.java b/src/java/com/twitter/search/core/earlybird/facets/DummyFacetAccumulator.java deleted file mode 100644 index 2395b790e..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/DummyFacetAccumulator.java +++ /dev/null @@ -1,28 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -/** - * This accumulator does not accumulate the facet counts when {@link #add(long, int, int, int)} - * is called. - */ -public class DummyFacetAccumulator extends FacetAccumulator { - - @Override - public int add(long termID, int scoreIncrement, int penaltyCount, int tweepCred) { - return 0; - } - - @Override - public R getAllFacets() { - return null; - } - - @Override - public R getTopFacets(int n) { - return null; - } - - @Override - public void reset(FacetLabelProvider facetLabelProvider) { - } - -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/EarlybirdFacetDocValueSet.java b/src/java/com/twitter/search/core/earlybird/facets/EarlybirdFacetDocValueSet.java deleted file mode 100644 index ae7be787a..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/EarlybirdFacetDocValueSet.java +++ /dev/null @@ -1,153 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -import java.util.Map; -import java.util.Map.Entry; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.facet.FacetsConfig; -import org.apache.lucene.index.ReaderUtil; -import org.apache.lucene.index.SortedSetDocValues; -import org.apache.lucene.util.BytesRef; -import org.apache.lucene.util.BytesRefBuilder; - -import com.twitter.search.core.earlybird.index.inverted.InvertedIndex; - -public class EarlybirdFacetDocValueSet extends SortedSetDocValues { - private final AbstractFacetCountingArray countingArray; - private final InvertedIndex[] labelProviders; - private final String[] fieldNames; - private final int[] starts; - private final BytesRefBuilder ordCache; - private int totalTerms; - private int docID = -1; - private int currentFacet = FacetCountingArray.UNASSIGNED; - private int pointer = -1; - private boolean hasMoreOrds = false; - - public static final String FIELD_NAME = FacetsConfig.DEFAULT_INDEX_FIELD_NAME; - - /** - * Creates a new EarlybirdFacetDocValueSet from the provided FacetCountingArray. - */ - public EarlybirdFacetDocValueSet(AbstractFacetCountingArray countingArray, - Map labelProviderMap, - FacetIDMap facetIdMap) { - this.countingArray = countingArray; - labelProviders = new InvertedIndex[facetIdMap.getNumberOfFacetFields()]; - fieldNames = new String[facetIdMap.getNumberOfFacetFields()]; - for (Entry entry : labelProviderMap.entrySet()) { - FacetLabelProvider labelProvider = entry.getValue(); - if (labelProvider instanceof InvertedIndex) { - FacetIDMap.FacetField facetField = facetIdMap.getFacetFieldByFacetName(entry.getKey()); - if (facetField != null) { - labelProviders[facetField.getFacetId()] = (InvertedIndex) labelProvider; - fieldNames[facetField.getFacetId()] = entry.getKey(); - } - } - } - - starts = new int[labelProviders.length + 1]; // build starts array - ordCache = new BytesRefBuilder(); - totalTerms = 0; - - for (int i = 0; i < labelProviders.length; ++i) { - if (labelProviders[i] != null) { - starts[i] = totalTerms; - int termCount = labelProviders[i].getNumTerms(); - totalTerms += termCount; - } - } - - // added to so that mapping from ord to index works via ReaderUtil.subIndex - starts[labelProviders.length] = totalTerms; - } - - private long encodeOrd(int fieldId, int termId) { - assert starts[fieldId] + termId < starts[fieldId + 1]; - return starts[fieldId] + termId; - } - - @Override - public long nextOrd() { - if (!hasMoreOrds || currentFacet == FacetCountingArray.UNASSIGNED) { - return SortedSetDocValues.NO_MORE_ORDS; - } - - // only 1 facet val - if (!FacetCountingArray.isPointer(currentFacet)) { - int termId = FacetCountingArray.decodeTermID(currentFacet); - int fieldId = FacetCountingArray.decodeFieldID(currentFacet); - hasMoreOrds = false; - return encodeOrd(fieldId, termId); - } - - // multiple facets, follow the pointer to find all facets in the facetsPool. - if (pointer == -1) { - pointer = FacetCountingArray.decodePointer(currentFacet); - } - int facetID = countingArray.getFacetsPool().get(pointer); - int termId = FacetCountingArray.decodeTermID(facetID); - int fieldId = FacetCountingArray.decodeFieldID(facetID); - - hasMoreOrds = FacetCountingArray.isPointer(facetID); - pointer++; - return encodeOrd(fieldId, termId); - } - - @Override - public BytesRef lookupOrd(long ord) { - int idx = ReaderUtil.subIndex((int) ord, this.starts); - if (labelProviders[idx] != null) { - int termID = (int) ord - starts[idx]; - BytesRef term = new BytesRef(); - labelProviders[idx].getTerm(termID, term); - String name = fieldNames[idx]; - String val = FacetsConfig.pathToString(new String[] {name, term.utf8ToString()}); - ordCache.copyChars(val); - } else { - ordCache.copyChars(""); - } - return ordCache.get(); - } - - @Override - public long lookupTerm(BytesRef key) { - throw new UnsupportedOperationException(); - } - - @Override - public long getValueCount() { - return totalTerms; - } - - @Override - public int docID() { - return docID; - } - - @Override - public int nextDoc() { - return ++docID; - } - - @Override - public int advance(int target) { - Preconditions.checkState(target >= docID); - docID = target; - currentFacet = countingArray.getFacet(docID); - pointer = -1; - hasMoreOrds = true; - return docID; - } - - @Override - public boolean advanceExact(int target) { - return advance(target) != FacetCountingArray.UNASSIGNED; - } - - @Override - public long cost() { - return totalTerms; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/EarlybirdFacets.java b/src/java/com/twitter/search/core/earlybird/facets/EarlybirdFacets.java deleted file mode 100644 index 8872d2049..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/EarlybirdFacets.java +++ /dev/null @@ -1,102 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -import java.io.IOException; -import java.util.List; -import java.util.Map; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; - -import org.apache.lucene.facet.FacetResult; -import org.apache.lucene.facet.Facets; -import org.apache.lucene.facet.FacetsCollector; -import org.apache.lucene.facet.FacetsCollector.MatchingDocs; -import org.apache.lucene.util.BitDocIdSet; -import org.apache.lucene.util.BitSet; - -import com.twitter.search.common.facets.FacetSearchParam; -import com.twitter.search.common.facets.thriftjava.FacetFieldRequest; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; - -/** - * Lucene accumulator implementation that counts on our facet counting array data structure. - * - */ -public class EarlybirdFacets extends Facets { - - private final AbstractFacetCountingArray countingArray; - private final FacetCountAggregator aggregator; - private final EarlybirdIndexSegmentAtomicReader reader; - private final MatchingDocs matchingDocs; - private final Map resultMapping; - - /** - * Constructs an EarlybirdFacets accumulator. - */ - public EarlybirdFacets( - List facetSearchParams, - FacetsCollector facetsCollector, - EarlybirdIndexSegmentAtomicReader reader) throws IOException { - - Preconditions.checkArgument(facetSearchParams != null && !facetSearchParams.isEmpty()); - Preconditions.checkArgument( - facetsCollector != null - && facetsCollector.getMatchingDocs() != null - && facetsCollector.getMatchingDocs().size() == 1); - Preconditions.checkNotNull(reader); - - this.countingArray = reader.getSegmentData().getFacetCountingArray(); - this.reader = reader; - this.aggregator = new FacetCountAggregator(facetSearchParams, - reader.getSegmentData().getSchema(), - reader.getFacetIDMap(), - reader.getSegmentData().getPerFieldMap()); - this.matchingDocs = facetsCollector.getMatchingDocs().get(0); - - this.resultMapping = count(); - } - - private Map count() throws IOException { - Preconditions.checkState(matchingDocs.bits instanceof BitDocIdSet, - "Assuming BitDocIdSet"); - final BitSet bits = ((BitDocIdSet) matchingDocs.bits).bits(); - final int length = bits.length(); - int doc = reader.getSmallestDocID(); - if (doc != -1) { - while (doc < length && (doc = bits.nextSetBit(doc)) != -1) { - countingArray.collectForDocId(doc, aggregator); - doc++; - } - } - return aggregator.getTop(); - } - - @Override - public FacetResult getTopChildren(int topN, String dim, String... path) throws IOException { - FacetFieldRequest facetFieldRequest = new FacetFieldRequest(dim, topN); - if (path.length > 0) { - facetFieldRequest.setPath(Lists.newArrayList(path)); - } - - FacetResult result = resultMapping.get(facetFieldRequest); - - Preconditions.checkNotNull( - result, - "Illegal facet field request: %s, supported requests are: %s", - facetFieldRequest, - resultMapping.keySet()); - - return result; - } - - @Override - public Number getSpecificValue(String dim, String... path) { - throw new UnsupportedOperationException("Not supported"); - } - - @Override - public List getAllDims(int topN) throws IOException { - throw new UnsupportedOperationException("Not supported"); - } - -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/EarlybirdFacetsFactory.java b/src/java/com/twitter/search/core/earlybird/facets/EarlybirdFacetsFactory.java deleted file mode 100644 index a790cd6f7..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/EarlybirdFacetsFactory.java +++ /dev/null @@ -1,48 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -import java.io.IOException; -import java.util.List; - -import org.apache.lucene.facet.Facets; -import org.apache.lucene.facet.FacetsCollector; - -import com.twitter.search.common.facets.CountFacetSearchParam; -import com.twitter.search.common.facets.FacetSearchParam; -import com.twitter.search.common.facets.FacetsFactory; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; - -/** - * Factory for EarlybirdFacets - */ -public class EarlybirdFacetsFactory implements FacetsFactory { - private final EarlybirdIndexSegmentAtomicReader reader; - - public EarlybirdFacetsFactory(EarlybirdIndexSegmentAtomicReader reader) { - this.reader = reader; - } - - @Override - public Facets create( - List facetSearchParams, - FacetsCollector facetsCollector) throws IOException { - - return new EarlybirdFacets(facetSearchParams, facetsCollector, reader); - } - - @Override - public boolean accept(FacetSearchParam facetSearchParam) { - if (!(facetSearchParam instanceof CountFacetSearchParam) - || (facetSearchParam.getFacetFieldRequest().getPath() != null - && !facetSearchParam.getFacetFieldRequest().getPath().isEmpty())) { - return false; - } - - String field = facetSearchParam.getFacetFieldRequest().getField(); - Schema.FieldInfo facetInfo = reader.getSegmentData().getSchema() - .getFacetFieldByFacetName(field); - - return facetInfo != null - && reader.getSegmentData().getPerFieldMap().containsKey(facetInfo.getName()); - } -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/FacetAccumulator.java b/src/java/com/twitter/search/core/earlybird/facets/FacetAccumulator.java deleted file mode 100644 index e38d3f7c0..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/FacetAccumulator.java +++ /dev/null @@ -1,36 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - - -/** - * Counts facet occurrences and provides the top items - * at the end. Actual subclass can implement this functionality differently: e.g. by using - * a heap (priority queue) or a hashmap with pruning step. - * The type R represents the facet results, which can e.g. be a thrift class. - */ -public abstract class FacetAccumulator { - /** Called to notify the accumulator that the given termID has occurred in a document - * Returns the current count of the given termID. - */ - public abstract int add(long termID, int scoreIncrement, int penaltyIncrement, int tweepCred); - - /** After hit collection is done this can be called to - * retrieve the items that occurred most often */ - public abstract R getTopFacets(int n); - - /** After hit collection is done this can be called to retrieve all the items accumulated - * (which may not be all that occurred) */ - public abstract R getAllFacets(); - - /** Called to reset a facet accumulator for re-use. This is an optimization - * which takes advantage of the fact that these accumulators may allocate - * large hash-tables, and we use one per-segment, which may be as many as 10-20 **/ - public abstract void reset(FacetLabelProvider facetLabelProvider); - - /** Language histogram accumulation and retrieval. They both have no-op default implementations. - */ - public void recordLanguage(int languageId) { } - - public LanguageHistogram getLanguageHistogram() { - return LanguageHistogram.EMPTY_HISTOGRAM; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/FacetCountAggregator.java b/src/java/com/twitter/search/core/earlybird/facets/FacetCountAggregator.java deleted file mode 100644 index 36bf3598e..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/FacetCountAggregator.java +++ /dev/null @@ -1,93 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -import java.util.List; -import java.util.Map; -import java.util.Map.Entry; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Maps; - -import org.apache.lucene.facet.FacetResult; - -import com.twitter.search.common.facets.CountFacetSearchParam; -import com.twitter.search.common.facets.FacetSearchParam; -import com.twitter.search.common.facets.thriftjava.FacetFieldRequest; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.core.earlybird.index.inverted.InvertedIndex; - -/** - * Global facet aggregator across all fields. - * - */ -public class FacetCountAggregator implements FacetTermCollector { - - // keys for the following aggregators are fieldIds - private final Map aggregators; - private final Map facetSearchParamMap; - - /** - * Creates a new facet aggregator. - */ - public FacetCountAggregator( - List facetSearchParams, - Schema schema, - FacetIDMap facetIDMap, - Map labelProviderMap) { - - aggregators = Maps.newHashMap(); - facetSearchParamMap = Maps.newHashMap(); - - // Check params: - for (FacetSearchParam facetSearchParam : facetSearchParams) { - if (!(facetSearchParam instanceof CountFacetSearchParam)) { - throw new IllegalArgumentException( - "this collector only supports CountFacetSearchParam; got " + facetSearchParam); - } - if (facetSearchParam.getFacetFieldRequest().getPath() != null - && !facetSearchParam.getFacetFieldRequest().getPath().isEmpty()) { - throw new IllegalArgumentException( - "this collector dosen't support hierarchical facets: " - + facetSearchParam.getFacetFieldRequest().getPath()); - } - - String field = facetSearchParam.getFacetFieldRequest().getField(); - Schema.FieldInfo facetField = - schema == null ? null : schema.getFacetFieldByFacetName(field); - - if (facetField == null || !labelProviderMap.containsKey(facetField.getName())) { - throw new IllegalStateException("facet field: " + field + " is not defined"); - } - - int fieldId = facetIDMap.getFacetField(facetField).getFacetId(); - Preconditions.checkState(!aggregators.containsKey(fieldId)); - Preconditions.checkState(!facetSearchParamMap.containsKey(fieldId)); - aggregators.put(fieldId, new PerfieldFacetCountAggregator(field, - labelProviderMap.get(facetField.getName()))); - facetSearchParamMap.put(fieldId, facetSearchParam); - } - } - - /** - * Returns the top facets. - */ - public Map getTop() { - Map map = Maps.newHashMap(); - for (Entry entry : aggregators.entrySet()) { - FacetSearchParam facetSearchParam = facetSearchParamMap.get(entry.getKey()); - map.put(facetSearchParam.getFacetFieldRequest(), entry.getValue().getTop(facetSearchParam)); - } - return map; - } - - @Override - public boolean collect(int docID, long termID, int fieldID) { - PerfieldFacetCountAggregator perfieldAggregator = aggregators.get(fieldID); - if (perfieldAggregator != null) { - perfieldAggregator.collect((int) termID); - return true; - } else { - return false; - } - } - -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/FacetCountIterator.java b/src/java/com/twitter/search/core/earlybird/facets/FacetCountIterator.java deleted file mode 100644 index b70b5c560..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/FacetCountIterator.java +++ /dev/null @@ -1,57 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -import java.io.IOException; -import java.util.List; - -import com.twitter.common.collections.Pair; - -/** - * The collect() method is called for every document for which facets shall be counted. - * This iterator then calls the FacetAccumulators for all facets that belong to the - * current document. - */ -public abstract class FacetCountIterator implements FacetTermCollector { - - public static class IncrementData { - public FacetAccumulator[] accumulators; - public int weightedCountIncrement; - public int penaltyIncrement; - public int tweepCred; - public int languageId; - } - - public IncrementData incrementData = new IncrementData(); - - private List> proofs = null; - - void setIncrementData(IncrementData incrementData) { - this.incrementData = incrementData; - } - - public void setProofs(List> proofs) { - this.proofs = proofs; - } - - // interface method that collects a specific term in a specific field for this document. - @Override - public boolean collect(int docID, long termID, int fieldID) { - FacetAccumulator accumulator = incrementData.accumulators[fieldID]; - accumulator.add(termID, incrementData.weightedCountIncrement, incrementData.penaltyIncrement, - incrementData.tweepCred); - accumulator.recordLanguage(incrementData.languageId); - - if (proofs != null) { - addProof(docID, termID, fieldID); - } - return true; - } - - protected void addProof(int docID, long termID, int fieldID) { - proofs.add(new Pair<>(fieldID, termID)); - } - - /** - * Collected facets for the given document. - */ - public abstract void collect(int docID) throws IOException; -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/FacetCountIteratorFactory.java b/src/java/com/twitter/search/core/earlybird/facets/FacetCountIteratorFactory.java deleted file mode 100644 index 91df9649b..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/FacetCountIteratorFactory.java +++ /dev/null @@ -1,23 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -import java.io.IOException; - -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; - -/** - * A factory for {@link FacetCountIterator}s. - */ -public abstract class FacetCountIteratorFactory { - /** - * For a field that is being faceted on and for which we should use a CSF for facet counting, - * return the iterator we should use for counting. - * - * @param reader The reader to use when getting CSF values - * @param fieldInfo The Schema.FieldInfo corresponding to the facet we're counting - * @return An iterator for this field - */ - public abstract FacetCountIterator getFacetCountIterator( - EarlybirdIndexSegmentAtomicReader reader, - Schema.FieldInfo fieldInfo) throws IOException; -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/FacetCountState.java b/src/java/com/twitter/search/core/earlybird/facets/FacetCountState.java deleted file mode 100644 index 920868312..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/FacetCountState.java +++ /dev/null @@ -1,88 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -import java.util.HashMap; -import java.util.HashSet; -import java.util.Iterator; -import java.util.Map; -import java.util.Set; - -import com.google.common.collect.Sets; - -import com.twitter.search.common.schema.base.Schema; - -/** - * Maintains internal state during one facet count request. - */ -public final class FacetCountState { - private final Set fieldsToCount = new HashSet<>(); - private final Map> facetfieldResults = - new HashMap<>(); - private final int minNumFacetResults; - private final Schema schema; - - public FacetCountState(Schema schema, int minNumFacetResults) { - this.schema = schema; - this.minNumFacetResults = minNumFacetResults; - } - - /** - * Adds a facet to be counted in this request. - */ - public void addFacet(String facetName, int numResultsRequested) { - facetfieldResults.put(facetName, new FacetFieldResults(facetName, - Math.max(numResultsRequested, minNumFacetResults))); - Schema.FieldInfo field = schema.getFacetFieldByFacetName(facetName); - fieldsToCount.add(field); - } - - public Schema getSchema() { - return schema; - } - - public int getNumFieldsToCount() { - return fieldsToCount.size(); - } - - /** - * Returns whether or not there is a field to be counted for which no skip list is stored - */ - public boolean hasFieldToCountWithoutSkipList() { - for (Schema.FieldInfo facetField: fieldsToCount) { - if (!facetField.getFieldType().isStoreFacetSkiplist()) { - return true; - } - } - return false; - } - - public Set getFacetFieldsToCountWithSkipLists() { - return Sets.filter( - fieldsToCount, - facetField -> facetField.getFieldType().isStoreFacetSkiplist()); - } - - public boolean isCountField(Schema.FieldInfo field) { - return fieldsToCount.contains(field); - } - - public Iterator> getFacetFieldResultsIterator() { - return facetfieldResults.values().iterator(); - } - - public static final class FacetFieldResults { - public final String facetName; - public final int numResultsRequested; - public R results; - public int numResultsFound; - public boolean finished = false; - - private FacetFieldResults(String facetName, int numResultsRequested) { - this.facetName = facetName; - this.numResultsRequested = numResultsRequested; - } - - public boolean isFinished() { - return finished || results != null && numResultsFound >= numResultsRequested; - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/FacetCountingArray.java b/src/java/com/twitter/search/core/earlybird/facets/FacetCountingArray.java deleted file mode 100644 index cd6098d22..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/FacetCountingArray.java +++ /dev/null @@ -1,156 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -import java.io.IOException; -import java.util.Map; - -import com.google.common.base.Preconditions; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; -import com.twitter.search.core.earlybird.index.inverted.IntBlockPool; - -import it.unimi.dsi.fastutil.ints.Int2IntOpenHashMap; - -public class FacetCountingArray extends AbstractFacetCountingArray { - private static final Logger LOG = LoggerFactory.getLogger(FacetCountingArray.class); - - private final Int2IntOpenHashMap facetsMap; - - /** - * Creates a new, empty FacetCountingArray with the given size. - */ - public FacetCountingArray(int maxSegmentSize) { - super(); - facetsMap = new Int2IntOpenHashMap(maxSegmentSize); - facetsMap.defaultReturnValue(UNASSIGNED); - } - - private FacetCountingArray(Int2IntOpenHashMap facetsMap, IntBlockPool facetsPool) { - super(facetsPool); - this.facetsMap = facetsMap; - } - - @Override - protected int getFacet(int docID) { - return facetsMap.get(docID); - } - - @Override - protected void setFacet(int docID, int facetID) { - facetsMap.put(docID, facetID); - } - - @Override - public AbstractFacetCountingArray rewriteAndMapIDs( - Map termIDMapper, - DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) throws IOException { - Preconditions.checkNotNull(originalTweetIdMapper); - Preconditions.checkNotNull(optimizedTweetIdMapper); - - // We need to rewrite the facet array, because the term ids have to be mapped to the - // key space of the minimum perfect hash function that replaces the hash table. - // We also need to remap tweet IDs to the optimized doc IDs. - int maxDocID = optimizedTweetIdMapper.getPreviousDocID(Integer.MAX_VALUE); - AbstractFacetCountingArray newArray = new OptimizedFacetCountingArray(maxDocID + 1); - final FacetCountingArrayWriter writer = new FacetCountingArrayWriter(newArray); - FacetCountIterator iterator = new ArrayFacetCountIterator() { - @Override - public boolean collect(int docID, long termID, int fieldID) { - int[] termIDMap = termIDMapper.get(fieldID); - int mappedTermID; - // If there isn't a map for this term, we are using the original term IDs and can continue - // with that term ID. If there is a term ID map, then we need to use the new term ID, - // because the new index will use an MPH term dictionary with new term IDs. - if (termIDMap == null) { - mappedTermID = (int) termID; - } else if (termID < termIDMap.length) { - mappedTermID = termIDMap[(int) termID]; - } else { - // During segment optimization we might index a new term after the termIDMap is created - // in IndexOptimizer.optimizeInvertedIndexes(). We can safely ignore these terms, as - // they will be re-indexed later. - return false; - } - - try { - long tweetId = originalTweetIdMapper.getTweetID(docID); - int newDocId = optimizedTweetIdMapper.getDocID(tweetId); - Preconditions.checkState(newDocId != DocIDToTweetIDMapper.ID_NOT_FOUND, - "Did not find a mapping in the new tweet ID mapper for doc ID " - + newDocId + ", tweet ID " + tweetId); - - writer.addFacet(newDocId, fieldID, mappedTermID); - } catch (IOException e) { - LOG.error("Caught an unexpected IOException while optimizing facet.", e); - } - - return true; - } - }; - - // We want to iterate the facets in increasing tweet ID order. This might not correspond to - // decreasing doc ID order in the original mapper (see OutOfOrderRealtimeTweetIDMapper). - // However, the optimized mapper should be sorted both by tweet IDs and by doc IDs (in reverse - // order). So we need to iterate here over the doc IDs in the optimized mapper, convert them - // to doc IDs in the original mapper, and pass those doc IDs to collect(). - int docId = optimizedTweetIdMapper.getPreviousDocID(Integer.MAX_VALUE); - while (docId != DocIDToTweetIDMapper.ID_NOT_FOUND) { - long tweetId = optimizedTweetIdMapper.getTweetID(docId); - int originalDocId = originalTweetIdMapper.getDocID(tweetId); - iterator.collect(originalDocId); - docId = optimizedTweetIdMapper.getPreviousDocID(docId); - } - return newArray; - } - - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - public static final class FlushHandler extends Flushable.Handler { - private static final String FACETS_POOL_PROP_NAME = "facetsPool"; - private final int maxSegmentSize; - - public FlushHandler(int maxSegmentSize) { - this.maxSegmentSize = maxSegmentSize; - } - - public FlushHandler(FacetCountingArray objectToFlush) { - super(objectToFlush); - maxSegmentSize = -1; - } - - @Override - public void doFlush(FlushInfo flushInfo, DataSerializer out) throws IOException { - FacetCountingArray array = getObjectToFlush(); - out.writeInt(array.facetsMap.size()); - for (Int2IntOpenHashMap.Entry entry : array.facetsMap.int2IntEntrySet()) { - out.writeInt(entry.getIntKey()); - out.writeInt(entry.getIntValue()); - } - array.getFacetsPool().getFlushHandler().flush( - flushInfo.newSubProperties(FACETS_POOL_PROP_NAME), out); - } - - @Override - public FacetCountingArray doLoad(FlushInfo flushInfo, DataDeserializer in) throws IOException { - int size = in.readInt(); - Int2IntOpenHashMap facetsMap = new Int2IntOpenHashMap(maxSegmentSize); - facetsMap.defaultReturnValue(UNASSIGNED); - for (int i = 0; i < size; i++) { - facetsMap.put(in.readInt(), in.readInt()); - } - IntBlockPool facetsPool = new IntBlockPool.FlushHandler().load( - flushInfo.getSubProperties(FACETS_POOL_PROP_NAME), in); - return new FacetCountingArray(facetsMap, facetsPool); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/FacetCountingArrayWriter.java b/src/java/com/twitter/search/core/earlybird/facets/FacetCountingArrayWriter.java deleted file mode 100644 index f02d52bfb..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/FacetCountingArrayWriter.java +++ /dev/null @@ -1,55 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -import com.twitter.search.core.earlybird.index.inverted.IntBlockPool; - -public class FacetCountingArrayWriter { - private final AbstractFacetCountingArray facetCountingArray; - private int previousDocID = -1; - - public FacetCountingArrayWriter(AbstractFacetCountingArray array) { - facetCountingArray = array; - } - - /** - * Adds a facet for the given doc, field and term tuple. - * - * The layout of the packedValues in the term pool is: - * - * index |0 |1 |2 |3 |4 |5 |6 |7 |8 |9 | - * value |U |1a|1b|1c|U |2b|2c|P3|1d|1f| - * - * Where U is UNASSIGNED, P+X is a pointer to index X (e.g. P3 means pointer to index 3), - * or a doc ID and facet (e.g. doc ID 1 and facet a would be 1a). - */ - public void addFacet(int docID, int fieldID, int termID) { - IntBlockPool facetsPool = facetCountingArray.getFacetsPool(); - int packedValue = facetCountingArray.getFacet(docID); - - if (packedValue == AbstractFacetCountingArray.UNASSIGNED) { - // first facet for this doc. - // keep it in the array and don't add it to the map. - facetCountingArray.setFacet(docID, AbstractFacetCountingArray.encodeFacetID(fieldID, termID)); - return; - } - - if (!FacetCountingArray.isPointer(packedValue)) { - // If the packedValue is not a pointer, we know that we have exactly one facet in the index - // for this document, so copy the existing facet into the pool. - facetsPool.add(AbstractFacetCountingArray.UNASSIGNED); - facetsPool.add(packedValue); - } else if (previousDocID != docID) { - // We have seen this document ID in a different document. Store the pointer to the first facet - // for this doc ID in the pool so that we can traverse the linked list. - facetsPool.add(packedValue); - } - - previousDocID = docID; - - // Add the new facet to the end of the FacetCountingArray. - facetsPool.add(AbstractFacetCountingArray.encodeFacetID(fieldID, termID)); - - // Set the facetValue for this document to the pointer to the facet we just added to the array. - int poolPointer = AbstractFacetCountingArray.encodePointer(facetsPool.length() - 1); - facetCountingArray.setFacet(docID, poolPointer); - } -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/FacetIDMap.java b/src/java/com/twitter/search/core/earlybird/facets/FacetIDMap.java deleted file mode 100644 index 4254abd89..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/FacetIDMap.java +++ /dev/null @@ -1,161 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -import java.io.IOException; -import java.util.Arrays; -import java.util.Collection; -import java.util.Map; - -import com.google.common.collect.Maps; - -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; - -/** - * Currently a facet is configured by: - * - Index field name: The Lucene field name which stores the indexed terms of this facet - * - Facet name: The name of the facet that the search API specifies to request facet counts. - * - Facet id: An internal id which is used to store the facet forward mapping in the facet counting - * data structures. - * - * This is a multi-map with two different mappings: - * Facet name -> Facet id - * Facet id -> FieldInfo - */ -public final class FacetIDMap implements Flushable { - private final FacetField[] facetIDToFieldMap; - private final Map facetNameToIDMap; - - private FacetIDMap(FacetField[] facetIDToFieldMap) { - this.facetIDToFieldMap = facetIDToFieldMap; - - facetNameToIDMap = Maps.newHashMapWithExpectedSize(facetIDToFieldMap.length); - for (int i = 0; i < facetIDToFieldMap.length; i++) { - facetNameToIDMap.put(facetIDToFieldMap[i].getFacetName(), i); - } - } - - public FacetField getFacetField(Schema.FieldInfo fieldInfo) { - return fieldInfo != null && fieldInfo.getFieldType().isFacetField() - ? getFacetFieldByFacetName(fieldInfo.getFieldType().getFacetName()) : null; - } - - public FacetField getFacetFieldByFacetName(String facetName) { - Integer facetID = facetNameToIDMap.get(facetName); - return facetID != null ? facetIDToFieldMap[facetID] : null; - } - - public FacetField getFacetFieldByFacetID(int facetID) { - return facetIDToFieldMap[facetID]; - } - - public Collection getFacetFields() { - return Arrays.asList(facetIDToFieldMap); - } - - public int getNumberOfFacetFields() { - return facetIDToFieldMap.length; - } - - /** - * Builds a new FacetIDMap from the given schema. - */ - public static FacetIDMap build(Schema schema) { - FacetField[] facetIDToFieldMap = new FacetField[schema.getNumFacetFields()]; - - int facetId = 0; - - for (Schema.FieldInfo fieldInfo : schema.getFieldInfos()) { - if (fieldInfo.getFieldType().isFacetField()) { - facetIDToFieldMap[facetId] = new FacetField(facetId, fieldInfo); - facetId++; - } - } - - return new FacetIDMap(facetIDToFieldMap); - } - - public static final class FacetField { - private final int facetId; - private final Schema.FieldInfo fieldInfo; - - private FacetField(int facetId, Schema.FieldInfo fieldInfo) { - this.facetId = facetId; - this.fieldInfo = fieldInfo; - } - - public int getFacetId() { - return facetId; - } - - public Schema.FieldInfo getFieldInfo() { - return fieldInfo; - } - - public String getFacetName() { - return fieldInfo.getFieldType().getFacetName(); - } - - public String getDescription() { - return String.format( - "(FacetField [facetId: %d, fieldInfo: %s])", - getFacetId(), fieldInfo.getDescription()); - } - } - - @SuppressWarnings("unchecked") - @Override - public FacetIDMap.FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - public static final class FlushHandler extends Flushable.Handler { - private static final String NUM_FACET_FIELDS_PROP_NAME = "numFacetFields"; - - private final Schema schema; - - public FlushHandler(Schema schema) { - this.schema = schema; - } - - public FlushHandler(FacetIDMap objectToFlush) { - super(objectToFlush); - // schema only needed here for loading, not for flushing - this.schema = null; - } - - @Override - public void doFlush(FlushInfo flushInfo, DataSerializer out) throws IOException { - FacetIDMap toFlush = getObjectToFlush(); - int[] idMap = new int[toFlush.facetIDToFieldMap.length]; - for (int i = 0; i < toFlush.facetIDToFieldMap.length; i++) { - idMap[i] = toFlush.facetIDToFieldMap[i].getFieldInfo().getFieldId(); - } - out.writeIntArray(idMap); - - flushInfo.addIntProperty(NUM_FACET_FIELDS_PROP_NAME, idMap.length); - } - - - @Override - public FacetIDMap doLoad(FlushInfo flushInfo, DataDeserializer in) throws IOException { - int[] idMap = in.readIntArray(); - if (idMap.length != schema.getNumFacetFields()) { - throw new IOException("Wrong number of facet fields. Expected by schema: " - + schema.getNumFacetFields() - + ", but found in serialized segment: " + idMap.length); - } - - FacetField[] facetIDToFieldMap = new FacetField[schema.getNumFacetFields()]; - - for (int i = 0; i < idMap.length; i++) { - int fieldConfigId = idMap[i]; - facetIDToFieldMap[i] = new FacetField(i, schema.getFieldInfo(fieldConfigId)); - } - - return new FacetIDMap(facetIDToFieldMap); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/FacetLabelProvider.java b/src/java/com/twitter/search/core/earlybird/facets/FacetLabelProvider.java deleted file mode 100644 index 8f653e0eb..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/FacetLabelProvider.java +++ /dev/null @@ -1,206 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -import org.apache.lucene.util.BytesRef; - -import com.twitter.search.common.hashtable.HashTable; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.util.analysis.IntTermAttributeImpl; -import com.twitter.search.common.util.analysis.LongTermAttributeImpl; -import com.twitter.search.common.util.analysis.SortableLongTermAttributeImpl; -import com.twitter.search.core.earlybird.index.inverted.InvertedIndex; - -/** - * Given a termID this accessor can be used to retrieve the term bytesref and text - * that corresponds to the termID. - */ -public interface FacetLabelProvider { - /** - * Returns a {@link FacetLabelAccessor} for this provider. - */ - FacetLabelAccessor getLabelAccessor(); - - abstract class FacetLabelAccessor { - private int currentTermID = -1; - - protected final BytesRef termRef = new BytesRef(); - protected boolean hasTermPayload = false; - protected final BytesRef termPayload = new BytesRef(); - protected int offensiveCount = 0; - - protected final boolean maybeSeek(long termID) { - if (termID == currentTermID) { - return true; - } - - if (seek(termID)) { - currentTermID = (int) termID; - return true; - } else { - currentTermID = -1; - return false; - } - } - - // Seek to term id provided. Returns true if term found. Should update termRef, - // hasTermPayload, and termPayload as appropriate. - protected abstract boolean seek(long termID); - - public final BytesRef getTermRef(long termID) { - return maybeSeek(termID) ? termRef : null; - } - - public String getTermText(long termID) { - return maybeSeek(termID) ? termRef.utf8ToString() : null; - } - - public final BytesRef getTermPayload(long termID) { - return maybeSeek(termID) && hasTermPayload ? termPayload : null; - } - - public final int getOffensiveCount(long termID) { - return maybeSeek(termID) ? offensiveCount : 0; - } - } - - /** - * Assumes the term is stored as an IntTermAttribute, and uses this to convert - * the term bytesref to an integer string facet label. - */ - class IntTermFacetLabelProvider implements FacetLabelProvider { - private final InvertedIndex invertedIndex; - - public IntTermFacetLabelProvider(InvertedIndex invertedIndex) { - this.invertedIndex = invertedIndex; - } - - @Override - public FacetLabelAccessor getLabelAccessor() { - return new FacetLabelAccessor() { - @Override - protected boolean seek(long termID) { - if (termID != HashTable.EMPTY_SLOT) { - invertedIndex.getTerm((int) termID, termRef); - return true; - } - return false; - } - - @Override - public String getTermText(long termID) { - return maybeSeek(termID) - ? Integer.toString(IntTermAttributeImpl.copyBytesRefToInt(termRef)) - : null; - } - }; - } - } - - /** - * Assumes the term is stored as an LongTermAttribute, and uses this to convert - * the term bytesref to an long string facet label. - */ - class LongTermFacetLabelProvider implements FacetLabelProvider { - private final InvertedIndex invertedIndex; - - public LongTermFacetLabelProvider(InvertedIndex invertedIndex) { - this.invertedIndex = invertedIndex; - } - - @Override - public FacetLabelAccessor getLabelAccessor() { - return new FacetLabelAccessor() { - @Override - protected boolean seek(long termID) { - if (termID != HashTable.EMPTY_SLOT) { - invertedIndex.getTerm((int) termID, termRef); - return true; - } - return false; - } - - @Override - public String getTermText(long termID) { - return maybeSeek(termID) - ? Long.toString(LongTermAttributeImpl.copyBytesRefToLong(termRef)) - : null; - } - }; - } - } - - class SortedLongTermFacetLabelProvider implements FacetLabelProvider { - private final InvertedIndex invertedIndex; - - public SortedLongTermFacetLabelProvider(InvertedIndex invertedIndex) { - this.invertedIndex = invertedIndex; - } - - @Override - public FacetLabelAccessor getLabelAccessor() { - return new FacetLabelAccessor() { - @Override - protected boolean seek(long termID) { - if (termID != HashTable.EMPTY_SLOT) { - invertedIndex.getTerm((int) termID, termRef); - return true; - } - return false; - } - - @Override - public String getTermText(long termID) { - return maybeSeek(termID) - ? Long.toString(SortableLongTermAttributeImpl.copyBytesRefToLong(termRef)) - : null; - } - }; - } - } - - class IdentityFacetLabelProvider implements FacetLabelProvider { - @Override - public FacetLabelAccessor getLabelAccessor() { - return new FacetLabelAccessor() { - @Override - protected boolean seek(long termID) { - return true; - } - - @Override - public String getTermText(long termID) { - return Long.toString(termID); - } - }; - } - } - - /** - * The methods on this provider should NOT be called under normal circumstances! - * - * When a facet misses inverted index and does not use CSF, this InaccessibleFacetLabelProvider - * will be used as a dummy provider. Then, unexptectedFacetLabelAccess counter will be - * incremented when this provider is used later. - * - * Also see: - * {@link FacetUtil} - */ - class InaccessibleFacetLabelProvider implements FacetLabelProvider { - private final SearchCounter unexptectedFacetLabelAccess; - - public InaccessibleFacetLabelProvider(String fieldName) { - this.unexptectedFacetLabelAccess = - SearchCounter.export("unexpected_facet_label_access_for_field_" + fieldName); - } - - @Override - public FacetLabelAccessor getLabelAccessor() { - return new FacetLabelAccessor() { - @Override - protected boolean seek(long termID) { - unexptectedFacetLabelAccess.increment(); - return false; - } - }; - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/FacetResponseRewriter.java b/src/java/com/twitter/search/core/earlybird/facets/FacetResponseRewriter.java deleted file mode 100644 index 349805d2d..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/FacetResponseRewriter.java +++ /dev/null @@ -1,16 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -import com.twitter.search.common.facets.thriftjava.FacetResponse; - -/** - * Rewrite facet responses - */ -public interface FacetResponseRewriter { - /** - * Do the response rewrite - * - * @param facetResponse the response before the rewriting - * @return the rewrited response - */ - FacetResponse rewrite(FacetResponse facetResponse); -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/FacetTermCollector.java b/src/java/com/twitter/search/core/earlybird/facets/FacetTermCollector.java deleted file mode 100644 index 668b079d3..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/FacetTermCollector.java +++ /dev/null @@ -1,16 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -/** - * An interface for collecting all facets in an document. - */ -public interface FacetTermCollector { - /** - * Collect one facet term. - * @param docID The docID for which the facets are being collected. - * @param termID The termID for this facet item. - * @param fieldID The fieldID for this facet item. - * @return True if anything has actually been collected, false if this has been skipped. - * Currently, this return value is not used. - */ - boolean collect(int docID, long termID, int fieldID); -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/FacetUtil.java b/src/java/com/twitter/search/core/earlybird/facets/FacetUtil.java deleted file mode 100644 index 7105e7728..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/FacetUtil.java +++ /dev/null @@ -1,106 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -import java.util.HashMap; -import java.util.Map; - -import com.google.common.base.Preconditions; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.schema.base.EarlybirdFieldType; -import com.twitter.search.common.schema.base.IndexedNumericFieldSettings; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.thriftjava.ThriftNumericType; -import com.twitter.search.core.earlybird.index.inverted.InvertedIndex; - -/** - * A utility class for selecting iterators and label providers - * for facets. - * - */ -public abstract class FacetUtil { - private static final Logger LOG = LoggerFactory.getLogger(FacetUtil.class); - - private FacetUtil() { - // unused - } - - /** - * A utility method for choosing the right facet label provider based on the EarlybirdFieldType. - * Takes in a InvertedIndex since some facet label providers are or depend on the inverted - * index. - * Should never return null. - * - * @param fieldType A FieldType for the facet - * @param invertedField The inverted index associated with the facet. May be null. - * @return A non-null FacetLabelProvider - */ - public static FacetLabelProvider chooseFacetLabelProvider( - EarlybirdFieldType fieldType, - InvertedIndex invertedField) { - Preconditions.checkNotNull(fieldType); - - // In the case neither inverted index existing nor using CSF, - // return FacetLabelProvider.InaccessibleFacetLabelProvider to throw exception - // more meaningfully and explicitly. - if (invertedField == null && !fieldType.isUseCSFForFacetCounting()) { - return new FacetLabelProvider.InaccessibleFacetLabelProvider(fieldType.getFacetName()); - } - - if (fieldType.isUseCSFForFacetCounting()) { - return new FacetLabelProvider.IdentityFacetLabelProvider(); - } - IndexedNumericFieldSettings numericSettings = fieldType.getNumericFieldSettings(); - if (numericSettings != null && numericSettings.isUseTwitterFormat()) { - if (numericSettings.getNumericType() == ThriftNumericType.INT) { - return new FacetLabelProvider.IntTermFacetLabelProvider(invertedField); - } else if (numericSettings.getNumericType() == ThriftNumericType.LONG) { - return numericSettings.isUseSortableEncoding() - ? new FacetLabelProvider.SortedLongTermFacetLabelProvider(invertedField) - : new FacetLabelProvider.LongTermFacetLabelProvider(invertedField); - } else { - Preconditions.checkState(false, - "Should never be reached, indicates incomplete handling of different kinds of facets"); - return null; - } - } else { - return invertedField; - } - } - - /** - * Get segment-specific facet label providers based on the schema - * and on the fieldToInvertedIndexMapping for the segment. - * These will be used by facet accumulators to get the text of the termIDs - * - * @param schema the schema, for info on fields and facets - * @param fieldToInvertedIndexMapping map of fields to their inverted indices - * @return facet label provider map - */ - public static Map getFacetLabelProviders( - Schema schema, - Map fieldToInvertedIndexMapping) { - - HashMap facetLabelProviderBuilder - = new HashMap<>(); - - for (Schema.FieldInfo fieldInfo : schema.getFacetFields()) { - EarlybirdFieldType fieldType = fieldInfo.getFieldType(); - Preconditions.checkNotNull(fieldType); - String fieldName = fieldInfo.getName(); - String facetName = fieldType.getFacetName(); - InvertedIndex invertedIndex = fieldToInvertedIndexMapping.get(fieldName); - if (invertedIndex == null && !fieldType.isUseCSFForFacetCounting()) { - LOG.warn("No docs in segment had field " + fieldName - + " indexed for facet " + facetName - + " so InaccessibleFacetLabelProvider will be provided." - ); - } - facetLabelProviderBuilder.put(facetName, Preconditions.checkNotNull( - chooseFacetLabelProvider(fieldType, invertedIndex))); - } - - return facetLabelProviderBuilder; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/LanguageHistogram.java b/src/java/com/twitter/search/core/earlybird/facets/LanguageHistogram.java deleted file mode 100644 index 213519e4e..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/LanguageHistogram.java +++ /dev/null @@ -1,104 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -import java.util.Arrays; -import java.util.Map; - -import com.google.common.collect.ImmutableMap; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.constants.thriftjava.ThriftLanguage; - -/** - * A util class to build a language histogram - */ -public class LanguageHistogram { - private static final Logger LOG = LoggerFactory.getLogger(LanguageHistogram.class); - - public static final LanguageHistogram EMPTY_HISTOGRAM = new LanguageHistogram() { - // Let's make this immutable for safety. - @Override public void clear() { - throw new UnsupportedOperationException(); - } - - @Override public void increment(int languageID) { - throw new UnsupportedOperationException(); - } - - @Override public void add(int languageID, int value) { - throw new UnsupportedOperationException(); - } - - @Override public void addAll(LanguageHistogram histogram) { - throw new UnsupportedOperationException(); - } - }; - - private final int[] languageHistogram = new int[ThriftLanguage.values().length]; - - public int[] getLanguageHistogram() { - return languageHistogram; - } - - /** - * Returns this histogram represented as a language->count map. - */ - public Map getLanguageHistogramAsMap() { - ImmutableMap.Builder builder = ImmutableMap.builder(); - for (int i = 0; i < languageHistogram.length; i++) { - // ThriftLanguage.findByValue() might return null, which should fall back to UNKNOWN. - ThriftLanguage lang = ThriftLanguage.findByValue(i); - lang = lang == null ? ThriftLanguage.UNKNOWN : lang; - builder.put(lang, languageHistogram[i]); - } - return builder.build(); - } - - public void clear() { - Arrays.fill(languageHistogram, 0); - } - - public void increment(int languageId) { - if (isValidLanguageId(languageId)) { - languageHistogram[languageId]++; - } - } - - public void increment(ThriftLanguage language) { - increment(language.getValue()); - } - - public void add(int languageId, int value) { - if (isValidLanguageId(languageId)) { - languageHistogram[languageId] += value; - } - } - - public void add(ThriftLanguage language, int value) { - add(language.getValue(), value); - } - - /** - * Adds all entries from the provided histogram to this histogram. - */ - public void addAll(LanguageHistogram histogram) { - if (histogram == EMPTY_HISTOGRAM) { - return; - } - for (int i = 0; i < languageHistogram.length; i++) { - languageHistogram[i] += histogram.languageHistogram[i]; - } - } - - // Check for out of bound languages. If a language is out of bounds, we don't want it - // to cause the entire search to fail. - private boolean isValidLanguageId(int languageId) { - if (languageId < languageHistogram.length) { - return true; - } else { - LOG.error("Language id " + languageId + " out of range"); - return false; - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/OptimizedFacetCountingArray.java b/src/java/com/twitter/search/core/earlybird/facets/OptimizedFacetCountingArray.java deleted file mode 100644 index 622ccc69f..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/OptimizedFacetCountingArray.java +++ /dev/null @@ -1,82 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -import java.io.IOException; -import java.util.Arrays; -import java.util.Map; - -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; -import com.twitter.search.core.earlybird.index.inverted.IntBlockPool; - -public class OptimizedFacetCountingArray extends AbstractFacetCountingArray { - private final int[] facetsMap; - - /** - * Creates a new, empty FacetCountingArray with the given size. - */ - public OptimizedFacetCountingArray(int maxDocIdInclusive) { - super(); - facetsMap = new int[maxDocIdInclusive]; - Arrays.fill(facetsMap, UNASSIGNED); - } - - private OptimizedFacetCountingArray(int[] facetsMap, IntBlockPool facetsPool) { - super(facetsPool); - this.facetsMap = facetsMap; - } - - @Override - protected int getFacet(int docID) { - return facetsMap[docID]; - } - - @Override - protected void setFacet(int docID, int facetID) { - facetsMap[docID] = facetID; - } - - @Override - public AbstractFacetCountingArray rewriteAndMapIDs( - Map termIDMapper, - DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) { - throw new UnsupportedOperationException( - "OptimizedFacetCountingArray instances should never be rewritten."); - } - - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - public static final class FlushHandler extends Flushable.Handler { - private static final String FACETS_POOL_PROP_NAME = "facetsPool"; - - public FlushHandler() { - } - - public FlushHandler(OptimizedFacetCountingArray objectToFlush) { - super(objectToFlush); - } - - @Override - public void doFlush(FlushInfo flushInfo, DataSerializer out) throws IOException { - OptimizedFacetCountingArray objectToFlush = getObjectToFlush(); - out.writeIntArray(objectToFlush.facetsMap); - objectToFlush.getFacetsPool().getFlushHandler().flush( - flushInfo.newSubProperties(FACETS_POOL_PROP_NAME), out); - } - - @Override - public OptimizedFacetCountingArray doLoad(FlushInfo flushInfo, DataDeserializer in) - throws IOException { - int[] facetsMap = in.readIntArray(); - IntBlockPool facetsPool = new IntBlockPool.FlushHandler().load( - flushInfo.getSubProperties(FACETS_POOL_PROP_NAME), in); - return new OptimizedFacetCountingArray(facetsMap, facetsPool); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/PerfieldFacetCountAggregator.java b/src/java/com/twitter/search/core/earlybird/facets/PerfieldFacetCountAggregator.java deleted file mode 100644 index 7da65a031..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/PerfieldFacetCountAggregator.java +++ /dev/null @@ -1,96 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.facet.FacetResult; -import org.apache.lucene.facet.LabelAndValue; -import org.apache.lucene.util.BytesRef; -import org.apache.lucene.util.PriorityQueue; - -import com.twitter.search.common.facets.FacetSearchParam; -import com.twitter.search.core.earlybird.facets.FacetLabelProvider.FacetLabelAccessor; - -import it.unimi.dsi.fastutil.ints.Int2IntMap.Entry; -import it.unimi.dsi.fastutil.ints.Int2IntMap.FastEntrySet; -import it.unimi.dsi.fastutil.ints.Int2IntOpenHashMap; - -public class PerfieldFacetCountAggregator { - - private final Int2IntOpenHashMap countMap; - private final FacetLabelAccessor facetLabelAccessor; - private final String name; - - /** - * Creates a new per-field facet aggregator. - */ - public PerfieldFacetCountAggregator(String name, FacetLabelProvider facetLabelProvider) { - this.name = name; - this.countMap = new Int2IntOpenHashMap(); - this.countMap.defaultReturnValue(0); - this.facetLabelAccessor = facetLabelProvider.getLabelAccessor(); - } - - public void collect(int termId) { - countMap.put(termId, countMap.get(termId) + 1); - } - - /** - * Returns the top facets. - */ - public FacetResult getTop(FacetSearchParam facetSearchParam) { - Preconditions.checkArgument( - facetSearchParam != null - && facetSearchParam.getFacetFieldRequest().getField().equals(name) - && (facetSearchParam.getFacetFieldRequest().getPath() == null - || facetSearchParam.getFacetFieldRequest().getPath().isEmpty())); - - PriorityQueue pq = new PriorityQueue( - facetSearchParam.getFacetFieldRequest().getNumResults()) { - - private BytesRef buffer = new BytesRef(); - - @Override - protected boolean lessThan(Entry a, Entry b) { - // first by count desc - int r = Integer.compare(a.getIntValue(), b.getIntValue()); - if (r != 0) { - return r < 0; - } - - // and then by label asc - BytesRef label1 = facetLabelAccessor.getTermRef(a.getIntKey()); - buffer.bytes = label1.bytes; - buffer.offset = label1.offset; - buffer.length = label1.length; - - return buffer.compareTo(facetLabelAccessor.getTermRef(b.getIntKey())) > 0; - } - - }; - - final FastEntrySet entrySet = countMap.int2IntEntrySet(); - - int numValid = 0; - for (Entry entry : entrySet) { - long val = entry.getIntValue(); - if (val > 0) { - numValid++; - pq.insertWithOverflow(entry); - } - } - - int numVals = pq.size(); - LabelAndValue[] labelValues = new LabelAndValue[numVals]; - - // Priority queue pops out "least" element first (that is the root). - // Least in our definition regardless of how we define what that is should be the last element. - for (int i = labelValues.length - 1; i >= 0; i--) { - Entry entry = pq.pop(); - labelValues[i] = new LabelAndValue( - facetLabelAccessor.getTermText(entry.getIntKey()), - entry.getValue()); - } - - return new FacetResult(name, null, 0, labelValues, numValid); - } -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/SortedSetDocValuesFacetsFactory.java b/src/java/com/twitter/search/core/earlybird/facets/SortedSetDocValuesFacetsFactory.java deleted file mode 100644 index 272e3749e..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/SortedSetDocValuesFacetsFactory.java +++ /dev/null @@ -1,45 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -import java.io.IOException; -import java.util.List; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.facet.Facets; -import org.apache.lucene.facet.FacetsCollector; -import org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts; -import org.apache.lucene.facet.sortedset.SortedSetDocValuesReaderState; - -import com.twitter.search.common.facets.CountFacetSearchParam; -import com.twitter.search.common.facets.FacetSearchParam; -import com.twitter.search.common.facets.FacetsFactory; - -/** - * Factory for SortedSetDocValuesFacetCounts - */ -public class SortedSetDocValuesFacetsFactory implements FacetsFactory { - private final SortedSetDocValuesReaderState state; - - public SortedSetDocValuesFacetsFactory(SortedSetDocValuesReaderState state) { - this.state = state; - } - - @Override - public Facets create( - List facetSearchParams, - FacetsCollector facetsCollector) throws IOException { - - Preconditions.checkNotNull(facetsCollector); - - return new SortedSetDocValuesFacetCounts(state, facetsCollector); - } - - @Override - public boolean accept(FacetSearchParam facetSearchParam) { - return facetSearchParam instanceof CountFacetSearchParam - && (facetSearchParam.getFacetFieldRequest().getPath() == null - || facetSearchParam.getFacetFieldRequest().getPath().isEmpty()) - && SortedSetDocValuesReaderStateHelper.isDimSupported( - state, facetSearchParam.getFacetFieldRequest().getField()); - } -} diff --git a/src/java/com/twitter/search/core/earlybird/facets/SortedSetDocValuesReaderStateHelper.java b/src/java/com/twitter/search/core/earlybird/facets/SortedSetDocValuesReaderStateHelper.java deleted file mode 100644 index de9c58548..000000000 --- a/src/java/com/twitter/search/core/earlybird/facets/SortedSetDocValuesReaderStateHelper.java +++ /dev/null @@ -1,14 +0,0 @@ -package com.twitter.search.core.earlybird.facets; - -import org.apache.lucene.facet.sortedset.SortedSetDocValuesReaderState; - -/** - * We have to check if the facet field (dim called by lucene) is supported or - * not by the SortedSetDocValuesReaderState. The method we have to call is - * private to the lucene package, so we have this helper to do the call for us. - */ -public abstract class SortedSetDocValuesReaderStateHelper { - public static boolean isDimSupported(SortedSetDocValuesReaderState state, String dim) { - return state.getOrdRange(dim) != null; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/DocIDToTweetIDMapper.java b/src/java/com/twitter/search/core/earlybird/index/DocIDToTweetIDMapper.java deleted file mode 100644 index 187213dab..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/DocIDToTweetIDMapper.java +++ /dev/null @@ -1,79 +0,0 @@ -package com.twitter.search.core.earlybird.index; - -import java.io.IOException; - -/** - * An interface for mapping the doc IDs in our indexes to the corresponding tweet IDs. - */ -public interface DocIDToTweetIDMapper { - /** A constant indicating that a doc ID was not found in the mapper. */ - int ID_NOT_FOUND = -1; - - /** - * Returns the tweet ID corresponding to the given doc ID. - * - * @param docID The doc ID stored in our indexes. - * @return The tweet ID corresponding to the given doc ID. - */ - long getTweetID(int docID); - - /** - * Returns the internal doc ID corresponding to the given tweet ID. Returns ID_NOT_FOUND if the - * given tweet ID cannot be found in the index. - * - * @param tweetID The tweet ID. - * @return The doc ID corresponding to the given tweet ID. - */ - int getDocID(long tweetID) throws IOException; - - /** - * Returns the smallest valid doc ID in this mapper that's strictly higher than the given doc ID. - * If no such doc ID exists, ID_NOT_FOUND is returned. - * - * @param docID The current doc ID. - * @return The smallest valid doc ID in this mapper that's strictly higher than the given doc ID, - * or a negative number, if no such doc ID exists. - */ - int getNextDocID(int docID); - - /** - * Returns the largest valid doc ID in this mapper that's strictly smaller than the given doc ID. - * If no such doc ID exists, ID_NOT_FOUND is returned. - * - * @param docID The current doc ID. - * @return The largest valid doc ID in this mapper that's strictly smaller than the given doc ID, - * or a negative number, if no such doc ID exists. - */ - int getPreviousDocID(int docID); - - /** - * Returns the total number of documents stored in this mapper. - * - * @return The total number of documents stored in this mapper. - */ - int getNumDocs(); - - /** - * Adds a mapping for the given tweet ID. Returns the doc ID assigned to this tweet ID. - * This method does not check if the tweet ID is already present in the mapper. It always assigns - * a new doc ID to the given tweet. - * - * @param tweetID The tweet ID to be added to the mapper. - * @return The doc ID assigned to the given tweet ID, or ID_NOT_FOUND if a doc ID could not be - * assigned to this tweet. - */ - int addMapping(long tweetID); - - /** - * Converts the current DocIDToTweetIDMapper to a DocIDToTweetIDMapper instance with the same - * tweet IDs. The tweet IDs in the original and optimized instances can be mapped to different - * doc IDs. However, we expect doc IDs to be assigned such that tweets created later have smaller - * have smaller doc IDs. - * - * This method should be called when an earlybird segment is being optimized, right before - * flushing it to disk. - * - * @return An optimized DocIDToTweetIDMapper with the same tweet IDs. - */ - DocIDToTweetIDMapper optimize() throws IOException; -} diff --git a/src/java/com/twitter/search/core/earlybird/index/EarlybirdIndexSegmentAtomicReader.java b/src/java/com/twitter/search/core/earlybird/index/EarlybirdIndexSegmentAtomicReader.java deleted file mode 100644 index 5d960f049..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/EarlybirdIndexSegmentAtomicReader.java +++ /dev/null @@ -1,139 +0,0 @@ -package com.twitter.search.core.earlybird.index; - -import java.io.IOException; -import java.util.Map; -import java.util.Set; - -import com.google.common.collect.Sets; - -import org.apache.lucene.index.FieldInfos; -import org.apache.lucene.index.Fields; -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.NumericDocValues; -import org.apache.lucene.index.PostingsEnum; -import org.apache.lucene.index.Term; -import org.apache.lucene.search.DocIdSetIterator; - -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.core.earlybird.facets.AbstractFacetCountingArray; -import com.twitter.search.core.earlybird.facets.FacetIDMap; -import com.twitter.search.core.earlybird.facets.FacetLabelProvider; -import com.twitter.search.core.earlybird.index.inverted.DeletedDocs; - -/** - * Base class for atomic Earlybird segment readers. - */ -public abstract class EarlybirdIndexSegmentAtomicReader extends LeafReader { - public static final int TERM_NOT_FOUND = -1; - - private final DeletedDocs.View deletesView; - private final EarlybirdIndexSegmentData segmentData; - protected final EarlybirdIndexSegmentData.SyncData syncData; - - private FieldInfos fieldInfos; - - /** - * Creates a new atomic reader for this Earlybird segment. - */ - public EarlybirdIndexSegmentAtomicReader(EarlybirdIndexSegmentData segmentData) { - super(); - this.segmentData = segmentData; - this.syncData = segmentData.getSyncData(); - this.deletesView = segmentData.getDeletedDocs().getView(); - // fieldInfos will be initialized lazily if required - this.fieldInfos = null; - } - - public int getSmallestDocID() { - return syncData.getSmallestDocID(); - } - - public final FacetIDMap getFacetIDMap() { - return segmentData.getFacetIDMap(); - } - - public final Map getFacetLabelProviders() { - return segmentData.getFacetLabelProviders(); - } - - public AbstractFacetCountingArray getFacetCountingArray() { - return segmentData.getFacetCountingArray(); - } - - public final FacetLabelProvider getFacetLabelProviders(Schema.FieldInfo field) { - String facetName = field.getFieldType().getFacetName(); - return facetName != null && segmentData.getFacetLabelProviders() != null - ? segmentData.getFacetLabelProviders().get(facetName) : null; - } - - @Override - public FieldInfos getFieldInfos() { - if (fieldInfos == null) { - // TwitterInMemoryIndexReader is constructed per query, and this call is only needed for - // optimize. We wouldn't want to create a new FieldInfos per search, so we deffer it. - Schema schema = segmentData.getSchema(); - final Set fieldSet = Sets.newHashSet(segmentData.getPerFieldMap().keySet()); - fieldSet.addAll(segmentData.getDocValuesManager().getDocValueNames()); - fieldInfos = schema.getLuceneFieldInfos(input -> input != null && fieldSet.contains(input)); - } - return fieldInfos; - } - - /** - * Returns the ID that was assigned to the given term in - * {@link com.twitter.search.core.earlybird.index.inverted.InvertedRealtimeIndex} - */ - public abstract int getTermID(Term t) throws IOException; - - /** - * Returns the oldest posting for the given term - * NOTE: This method may return a deleted doc id. - */ - public abstract int getOldestDocID(Term t) throws IOException; - - @Override - public abstract NumericDocValues getNumericDocValues(String field) throws IOException; - - /** - * Determines if this reader has any documents to traverse. Note that it is possible for the tweet - * ID mapper to have documents, but for this reader to not see them yet. In this case, this method - * will return false. - */ - public boolean hasDocs() { - return segmentData.numDocs() > 0; - } - - /** - * Returns the newest posting for the given term - */ - public final int getNewestDocID(Term term) throws IOException { - PostingsEnum td = postings(term); - if (td == null) { - return EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND; - } - - if (td.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) { - return td.docID(); - } else { - return EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND; - } - } - - public final DeletedDocs.View getDeletesView() { - return deletesView; - } - - @Override - public final Fields getTermVectors(int docID) { - // Earlybird does not use term vectors. - return null; - } - - public EarlybirdIndexSegmentData getSegmentData() { - return segmentData; - } - - public Schema getSchema() { - return segmentData.getSchema(); - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/EarlybirdIndexSegmentData.java b/src/java/com/twitter/search/core/earlybird/index/EarlybirdIndexSegmentData.java deleted file mode 100644 index fefb1b4d1..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/EarlybirdIndexSegmentData.java +++ /dev/null @@ -1,474 +0,0 @@ -package com.twitter.search.core.earlybird.index; - -import java.io.IOException; -import java.util.ArrayList; -import java.util.Collections; -import java.util.HashMap; -import java.util.Iterator; -import java.util.List; -import java.util.Map; -import java.util.concurrent.ConcurrentHashMap; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.DirectoryReader; -import org.apache.lucene.index.IndexWriterConfig; -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.store.Directory; - -import com.twitter.common.collections.Pair; -import com.twitter.search.common.schema.base.EarlybirdFieldType; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.facets.AbstractFacetCountingArray; -import com.twitter.search.core.earlybird.facets.FacetCountingArrayWriter; -import com.twitter.search.core.earlybird.facets.FacetIDMap; -import com.twitter.search.core.earlybird.facets.FacetLabelProvider; -import com.twitter.search.core.earlybird.index.column.ColumnStrideByteIndex; -import com.twitter.search.core.earlybird.index.column.DocValuesManager; -import com.twitter.search.core.earlybird.index.extensions.EarlybirdIndexExtensionsData; -import com.twitter.search.core.earlybird.index.extensions.EarlybirdIndexExtensionsFactory; -import com.twitter.search.core.earlybird.index.inverted.DeletedDocs; -import com.twitter.search.core.earlybird.index.inverted.InvertedIndex; -import com.twitter.search.core.earlybird.index.inverted.InvertedRealtimeIndex; -import com.twitter.search.core.earlybird.index.inverted.OptimizedMemoryIndex; -import com.twitter.search.core.earlybird.index.inverted.TermPointerEncoding; - -/** - * Base class that references data structures belonging to an Earlybird segment. - */ -public abstract class EarlybirdIndexSegmentData implements Flushable { - /** - * This class has a map which contains a snapshot of max published pointers, to distinguish the - * documents in the skip lists that are fully indexed, and safe to return to searchers and those - * that are in progress and should not be returned to searchers. See - * "Earlybird Indexing Latency Design Document" - * for rationale and design. - * - * It also has the smallestDocID, which determines the smallest assigned doc ID in the tweet ID - * mapper that is safe to traverse. - * - * The pointer map and smallestDocID need to be updated atomically. See SEARCH-27650. - */ - public static class SyncData { - private final Map indexPointers; - private final int smallestDocID; - - public SyncData(Map indexPointers, int smallestDocID) { - this.indexPointers = indexPointers; - this.smallestDocID = smallestDocID; - } - - public Map getIndexPointers() { - return indexPointers; - } - - public int getSmallestDocID() { - return smallestDocID; - } - } - - private volatile SyncData syncData; - - private final int maxSegmentSize; - private final long timeSliceID; - - private final ConcurrentHashMap queryCacheMap = - new ConcurrentHashMap<>(); - private final AbstractFacetCountingArray facetCountingArray; - private final boolean isOptimized; - private final ConcurrentHashMap perFieldMap; - private final ConcurrentHashMap normsMap; - - private final Map facetLabelProviders; - private final FacetIDMap facetIDMap; - - private final Schema schema; - private final DocValuesManager docValuesManager; - - private final DeletedDocs deletedDocs; - - private final DocIDToTweetIDMapper docIdToTweetIdMapper; - private final TimeMapper timeMapper; - - static LeafReader getLeafReaderFromOptimizedDirectory(Directory directory) throws IOException { - List leaves = DirectoryReader.open(directory).getContext().leaves(); - int leavesSize = leaves.size(); - Preconditions.checkState(1 == leavesSize, - "Expected one leaf reader in directory %s, but found %s", directory, leavesSize); - return leaves.get(0).reader(); - } - - /** - * Creates a new SegmentData instance using the provided data. - */ - public EarlybirdIndexSegmentData( - int maxSegmentSize, - long timeSliceID, - Schema schema, - boolean isOptimized, - int smallestDocID, - ConcurrentHashMap perFieldMap, - ConcurrentHashMap normsMap, - AbstractFacetCountingArray facetCountingArray, - DocValuesManager docValuesManager, - Map facetLabelProviders, - FacetIDMap facetIDMap, - DeletedDocs deletedDocs, - DocIDToTweetIDMapper docIdToTweetIdMapper, - TimeMapper timeMapper) { - this.maxSegmentSize = maxSegmentSize; - this.timeSliceID = timeSliceID; - this.schema = schema; - this.isOptimized = isOptimized; - this.facetCountingArray = facetCountingArray; - this.perFieldMap = perFieldMap; - this.syncData = new SyncData(buildIndexPointers(), smallestDocID); - this.normsMap = normsMap; - this.docValuesManager = docValuesManager; - this.facetLabelProviders = facetLabelProviders; - this.facetIDMap = facetIDMap; - this.deletedDocs = deletedDocs; - this.docIdToTweetIdMapper = docIdToTweetIdMapper; - this.timeMapper = timeMapper; - - Preconditions.checkNotNull(schema); - } - - public final Schema getSchema() { - return schema; - } - - /** - * Returns all {@link EarlybirdIndexExtensionsData} instances contained in this segment. - * Since index extensions are optional, the returned map might be null or empty. - */ - public abstract S getIndexExtensionsData(); - - public DocIDToTweetIDMapper getDocIDToTweetIDMapper() { - return docIdToTweetIdMapper; - } - - public TimeMapper getTimeMapper() { - return timeMapper; - } - - public final DocValuesManager getDocValuesManager() { - return docValuesManager; - } - - public Map getFacetLabelProviders() { - return facetLabelProviders; - } - - public FacetIDMap getFacetIDMap() { - return facetIDMap; - } - - /** - * Returns the QueryCacheResult for the given filter for this segment. - */ - public QueryCacheResultForSegment getQueryCacheResult(String queryCacheFilterName) { - return queryCacheMap.get(queryCacheFilterName); - } - - public long getQueryCachesCardinality() { - return queryCacheMap.values().stream().mapToLong(q -> q.getCardinality()).sum(); - } - - /** - * Get cache cardinality for each query cache. - * @return - */ - public List> getPerQueryCacheCardinality() { - ArrayList> result = new ArrayList<>(); - - queryCacheMap.forEach((cacheName, queryCacheResult) -> { - result.add(Pair.of(cacheName, queryCacheResult.getCardinality())); - }); - return result; - } - - /** - * Updates the QueryCacheResult stored for the given filter for this segment - */ - public QueryCacheResultForSegment updateQueryCacheResult( - String queryCacheFilterName, QueryCacheResultForSegment queryCacheResultForSegment) { - return queryCacheMap.put(queryCacheFilterName, queryCacheResultForSegment); - } - - /** - * Subclasses are allowed to return null here to disable writing to a FacetCountingArray. - */ - public FacetCountingArrayWriter createFacetCountingArrayWriter() { - return getFacetCountingArray() != null - ? new FacetCountingArrayWriter(getFacetCountingArray()) : null; - } - - public int getMaxSegmentSize() { - return maxSegmentSize; - } - - public long getTimeSliceID() { - return timeSliceID; - } - - public void updateSmallestDocID(int smallestDocID) { - // Atomic swap - syncData = new SyncData(Collections.unmodifiableMap(buildIndexPointers()), smallestDocID); - } - - private Map buildIndexPointers() { - Map newIndexPointers = new HashMap<>(); - for (InvertedIndex index : perFieldMap.values()) { - if (index.hasMaxPublishedPointer()) { - newIndexPointers.put(index, index.getMaxPublishedPointer()); - } - } - - return newIndexPointers; - } - - public SyncData getSyncData() { - return syncData; - } - - public AbstractFacetCountingArray getFacetCountingArray() { - return facetCountingArray; - } - - public void addField(String fieldName, InvertedIndex field) { - perFieldMap.put(fieldName, field); - } - - public Map getPerFieldMap() { - return Collections.unmodifiableMap(perFieldMap); - } - - public InvertedIndex getFieldIndex(String fieldName) { - return perFieldMap.get(fieldName); - } - - public Map getNormsMap() { - return Collections.unmodifiableMap(normsMap); - } - - public DeletedDocs getDeletedDocs() { - return deletedDocs; - } - - /** - * Returns the norms index for the given field name. - */ - public ColumnStrideByteIndex getNormIndex(String fieldName) { - return normsMap == null ? null : normsMap.get(fieldName); - } - - /** - * Returns the norms index for the given field name, add if not exist. - */ - public ColumnStrideByteIndex createNormIndex(String fieldName) { - if (normsMap == null) { - return null; - } - ColumnStrideByteIndex csf = normsMap.get(fieldName); - if (csf == null) { - csf = new ColumnStrideByteIndex(fieldName, maxSegmentSize); - normsMap.put(fieldName, csf); - } - return csf; - } - - /** - * Flushes this segment to disk. - */ - public void flushSegment(FlushInfo flushInfo, DataSerializer out) throws IOException { - getFlushHandler().flush(flushInfo, out); - } - - public final boolean isOptimized() { - return this.isOptimized; - } - - /** - * Returns a new atomic reader for this segment. - */ - public EarlybirdIndexSegmentAtomicReader createAtomicReader() throws IOException { - EarlybirdIndexSegmentAtomicReader reader = doCreateAtomicReader(); - EarlybirdIndexExtensionsData indexExtension = getIndexExtensionsData(); - if (indexExtension != null) { - indexExtension.setupExtensions(reader); - } - return reader; - } - - /** - * Creates a new atomic reader for this segment. - */ - protected abstract EarlybirdIndexSegmentAtomicReader doCreateAtomicReader() throws IOException; - - /** - * Creates a new segment writer for this segment. - */ - public abstract EarlybirdIndexSegmentWriter createEarlybirdIndexSegmentWriter( - IndexWriterConfig indexWriterConfig) throws IOException; - - public abstract static class AbstractSegmentDataFlushHandler - - extends Flushable.Handler { - protected static final String MAX_SEGMENT_SIZE_PROP_NAME = "maxSegmentSize"; - protected static final String TIME_SLICE_ID_PROP_NAME = "time_slice_id"; - protected static final String SMALLEST_DOCID_PROP_NAME = "smallestDocID"; - protected static final String DOC_ID_MAPPER_SUBPROPS_NAME = "doc_id_mapper"; - protected static final String TIME_MAPPER_SUBPROPS_NAME = "time_mapper"; - public static final String IS_OPTIMIZED_PROP_NAME = "isOptimized"; - - // Abstract methods child classes should implement: - // 1. How to additional data structures - protected abstract void flushAdditionalDataStructures( - FlushInfo flushInfo, DataSerializer out, EarlybirdIndexSegmentData toFlush) - throws IOException; - - // 2. Load additional data structures and construct SegmentData. - // Common data structures should be passed into this method to avoid code duplication. - // Subclasses should load additional data structures and construct a SegmentData. - protected abstract EarlybirdIndexSegmentData constructSegmentData( - FlushInfo flushInfo, - ConcurrentHashMap perFieldMap, - int maxSegmentSize, - S indexExtension, - DocIDToTweetIDMapper docIdToTweetIdMapper, - TimeMapper timeMapper, - DataDeserializer in) throws IOException; - - protected abstract S newIndexExtension(); - - protected final Schema schema; - protected final EarlybirdIndexExtensionsFactory indexExtensionsFactory; - private final Flushable.Handler docIdMapperFlushHandler; - private final Flushable.Handler timeMapperFlushHandler; - - public AbstractSegmentDataFlushHandler( - Schema schema, - EarlybirdIndexExtensionsFactory indexExtensionsFactory, - Flushable.Handler docIdMapperFlushHandler, - Flushable.Handler timeMapperFlushHandler) { - super(); - this.schema = schema; - this.indexExtensionsFactory = indexExtensionsFactory; - this.docIdMapperFlushHandler = docIdMapperFlushHandler; - this.timeMapperFlushHandler = timeMapperFlushHandler; - } - - public AbstractSegmentDataFlushHandler(EarlybirdIndexSegmentData objectToFlush) { - super(objectToFlush); - this.schema = objectToFlush.schema; - this.indexExtensionsFactory = null; // factory only needed for loading SegmentData from disk - this.docIdMapperFlushHandler = null; // docIdMapperFlushHandler needed only for loading data - this.timeMapperFlushHandler = null; // timeMapperFlushHandler needed only for loading data - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) - throws IOException { - EarlybirdIndexSegmentData segmentData = getObjectToFlush(); - - Preconditions.checkState(segmentData.docIdToTweetIdMapper instanceof Flushable); - ((Flushable) segmentData.docIdToTweetIdMapper).getFlushHandler().flush( - flushInfo.newSubProperties(DOC_ID_MAPPER_SUBPROPS_NAME), out); - - if (segmentData.timeMapper != null) { - segmentData.timeMapper.getFlushHandler() - .flush(flushInfo.newSubProperties(TIME_MAPPER_SUBPROPS_NAME), out); - } - - flushInfo.addBooleanProperty(IS_OPTIMIZED_PROP_NAME, segmentData.isOptimized()); - flushInfo.addIntProperty(MAX_SEGMENT_SIZE_PROP_NAME, segmentData.getMaxSegmentSize()); - flushInfo.addLongProperty(TIME_SLICE_ID_PROP_NAME, segmentData.getTimeSliceID()); - flushInfo.addIntProperty(SMALLEST_DOCID_PROP_NAME, - segmentData.getSyncData().getSmallestDocID()); - - flushIndexes(flushInfo, out, segmentData); - - // Flush cluster specific data structures: - // FacetCountingArray, TweetIDMapper, LatLonMapper, and TimeMapper - flushAdditionalDataStructures(flushInfo, out, segmentData); - } - - private void flushIndexes( - FlushInfo flushInfo, - DataSerializer out, - EarlybirdIndexSegmentData segmentData) throws IOException { - Map perFieldMap = segmentData.getPerFieldMap(); - FlushInfo fieldProps = flushInfo.newSubProperties("fields"); - long sizeBeforeFlush = out.length(); - for (Map.Entry entry : perFieldMap.entrySet()) { - String fieldName = entry.getKey(); - entry.getValue().getFlushHandler().flush(fieldProps.newSubProperties(fieldName), out); - } - fieldProps.setSizeInBytes(out.length() - sizeBeforeFlush); - } - - @Override - protected EarlybirdIndexSegmentData doLoad(FlushInfo flushInfo, DataDeserializer in) - throws IOException { - DocIDToTweetIDMapper docIdToTweetIdMapper = docIdMapperFlushHandler.load( - flushInfo.getSubProperties(DOC_ID_MAPPER_SUBPROPS_NAME), in); - - FlushInfo timeMapperFlushInfo = flushInfo.getSubProperties(TIME_MAPPER_SUBPROPS_NAME); - TimeMapper timeMapper = - timeMapperFlushInfo != null ? timeMapperFlushHandler.load(timeMapperFlushInfo, in) : null; - - final int maxSegmentSize = flushInfo.getIntProperty(MAX_SEGMENT_SIZE_PROP_NAME); - ConcurrentHashMap perFieldMap = loadIndexes(flushInfo, in); - return constructSegmentData( - flushInfo, - perFieldMap, - maxSegmentSize, - newIndexExtension(), - docIdToTweetIdMapper, - timeMapper, - in); - } - - // Move this method into EarlybirdRealtimeIndexSegmentData (careful, - // we may need to increment FlushVersion because EarlybirdLuceneIndexSegmentData - // currently has the 'fields' subproperty in its FlushInfo as well) - private ConcurrentHashMap loadIndexes( - FlushInfo flushInfo, DataDeserializer in) throws IOException { - ConcurrentHashMap perFieldMap = new ConcurrentHashMap<>(); - - FlushInfo fieldProps = flushInfo.getSubProperties("fields"); - Iterator fieldIterator = fieldProps.getKeyIterator(); - while (fieldIterator.hasNext()) { - String fieldName = fieldIterator.next(); - EarlybirdFieldType fieldType = schema.getFieldInfo(fieldName).getFieldType(); - FlushInfo subProp = fieldProps.getSubProperties(fieldName); - boolean isOptimized = subProp.getBooleanProperty( - OptimizedMemoryIndex.FlushHandler.IS_OPTIMIZED_PROP_NAME); - final InvertedIndex invertedIndex; - if (isOptimized) { - if (!fieldType.becomesImmutable()) { - throw new IOException("Tried to load an optimized field that is not immutable: " - + fieldName); - } - invertedIndex = (new OptimizedMemoryIndex.FlushHandler(fieldType)).load(subProp, in); - } else { - invertedIndex = (new InvertedRealtimeIndex.FlushHandler( - fieldType, TermPointerEncoding.DEFAULT_ENCODING)) - .load(subProp, in); - } - perFieldMap.put(fieldName, invertedIndex); - } - return perFieldMap; - } - } - - public int numDocs() { - return docIdToTweetIdMapper.getNumDocs(); - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/EarlybirdIndexSegmentWriter.java b/src/java/com/twitter/search/core/earlybird/index/EarlybirdIndexSegmentWriter.java deleted file mode 100644 index 697a95bb2..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/EarlybirdIndexSegmentWriter.java +++ /dev/null @@ -1,130 +0,0 @@ -package com.twitter.search.core.earlybird.index; - -import java.io.Closeable; -import java.io.IOException; - -import org.apache.lucene.document.Document; -import org.apache.lucene.index.IndexReader; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.search.Collector; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.LeafCollector; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.Scorable; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.store.Directory; - -import com.twitter.search.core.earlybird.index.column.ColumnStrideFieldIndex; -import com.twitter.search.core.earlybird.index.column.DocValuesUpdate; - -/** - * IndexSegmentWriter combines some common functionality between the Lucene and Realtime index - * segment writers. - */ -public abstract class EarlybirdIndexSegmentWriter implements Closeable { - - public EarlybirdIndexSegmentWriter() { - } - - /** - * Gets the segment data this segment write is associated with. - * @return - */ - public abstract EarlybirdIndexSegmentData getSegmentData(); - - /** - * Appends terms from the document to the document matching the query. Does not replace a field or - * document, actually adds to the the field in the segment. - */ - public final void appendOutOfOrder(Query query, Document doc) throws IOException { - runQuery(query, docID -> appendOutOfOrder(doc, docID)); - } - - protected abstract void appendOutOfOrder(Document doc, int docId) throws IOException; - - /** - * Deletes a document in this segment that matches this query. - */ - public void deleteDocuments(Query query) throws IOException { - runQuery(query, docID -> getSegmentData().getDeletedDocs().deleteDoc(docID)); - } - - /** - * Updates the docvalues of a document in this segment that matches this query. - */ - public void updateDocValues(Query query, String field, DocValuesUpdate update) - throws IOException { - runQuery(query, docID -> { - ColumnStrideFieldIndex docValues = - getSegmentData().getDocValuesManager().getColumnStrideFieldIndex(field); - if (docValues == null) { - return; - } - - update.update(docValues, docID); - }); - } - - private void runQuery(final Query query, final OnHit onHit) throws IOException { - try (IndexReader reader = getSegmentData().createAtomicReader()) { - new IndexSearcher(reader).search(query, new Collector() { - @Override - public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException { - return new LeafCollector() { - @Override - public void setScorer(Scorable scorer) { - } - - @Override - public void collect(int docID) throws IOException { - onHit.hit(docID); - } - }; - } - - @Override - public ScoreMode scoreMode() { - return ScoreMode.COMPLETE_NO_SCORES; - } - }); - } - } - - private interface OnHit { - void hit(int docID) throws IOException; - } - - /** - * Adds a new document to this segment. In production, this method should be called only by - * Expertsearch. - */ - public abstract void addDocument(Document doc) throws IOException; - - /** - * Adds a new tweet to this segment. This method should be called only by Earlybird. - */ - public abstract void addTweet(Document doc, long tweetId, boolean docIsOffensive) - throws IOException; - - /** - * Returns the total number of documents in the segment. - */ - public abstract int numDocs() throws IOException; - - /** - * Returns the number of documents in this segment without taking deleted docs into account. - * E.g. if 10 documents were added to this segments, and 5 were deleted, - * this method still returns 10. - */ - public abstract int numDocsNoDelete() throws IOException; - - /** - * Forces the underlying index to be merged down to a single segment. - */ - public abstract void forceMerge() throws IOException; - - /** - * Appends the provides Lucene indexes to this segment. - */ - public abstract void addIndexes(Directory... dirs) throws IOException; -} diff --git a/src/java/com/twitter/search/core/earlybird/index/EarlybirdIndexableField.java b/src/java/com/twitter/search/core/earlybird/index/EarlybirdIndexableField.java deleted file mode 100644 index 4bc5558f4..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/EarlybirdIndexableField.java +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.search.core.earlybird.index; - -import org.apache.lucene.document.Field; -import org.apache.lucene.index.DocValuesType; - -import com.twitter.search.common.schema.base.EarlybirdFieldType; - -public class EarlybirdIndexableField extends Field { - - /** - * Creates a new indexable field with the given name, value and {@link EarlybirdFieldType}. - */ - public EarlybirdIndexableField(String name, Object value, EarlybirdFieldType fieldType) { - super(name, fieldType); - if (fieldType.docValuesType() == DocValuesType.NUMERIC) { - if (value instanceof Number) { - super.fieldsData = ((Number) value).longValue(); - } else { - throw new IllegalArgumentException("value not a number: " + value.getClass()); - } - } - } - -} diff --git a/src/java/com/twitter/search/core/earlybird/index/EarlybirdLuceneIndexSegmentAtomicReader.java b/src/java/com/twitter/search/core/earlybird/index/EarlybirdLuceneIndexSegmentAtomicReader.java deleted file mode 100644 index 63c811449..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/EarlybirdLuceneIndexSegmentAtomicReader.java +++ /dev/null @@ -1,336 +0,0 @@ -package com.twitter.search.core.earlybird.index; - -import java.io.IOException; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.BinaryDocValues; -import org.apache.lucene.index.FieldInfos; -import org.apache.lucene.index.FilterLeafReader; -import org.apache.lucene.index.LeafMetaData; -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.NumericDocValues; -import org.apache.lucene.index.PointValues; -import org.apache.lucene.index.PostingsEnum; -import org.apache.lucene.index.SortedDocValues; -import org.apache.lucene.index.SortedNumericDocValues; -import org.apache.lucene.index.SortedSetDocValues; -import org.apache.lucene.index.StoredFieldVisitor; -import org.apache.lucene.index.Term; -import org.apache.lucene.index.Terms; -import org.apache.lucene.index.TermsEnum; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.store.Directory; -import org.apache.lucene.util.Bits; -import org.apache.lucene.util.BytesRef; - -import com.twitter.search.common.encoding.docvalues.CSFTypeUtil; -import com.twitter.search.common.encoding.features.IntegerEncodedFeatures; -import com.twitter.search.common.schema.base.EarlybirdFieldType; -import com.twitter.search.common.schema.base.FeatureConfiguration; -import com.twitter.search.common.schema.base.Schema.FieldInfo; -import com.twitter.search.core.earlybird.index.column.ColumnStrideFieldDocValues; -import com.twitter.search.core.earlybird.index.column.ColumnStrideFieldIndex; - -public final class EarlybirdLuceneIndexSegmentAtomicReader - extends EarlybirdIndexSegmentAtomicReader { - private abstract static class DocIdSetIteratorWrapper extends NumericDocValues { - private final DocIdSetIterator delegate; - - public DocIdSetIteratorWrapper(DocIdSetIterator delegate) { - this.delegate = Preconditions.checkNotNull(delegate); - } - - @Override - public int docID() { - return delegate.docID(); - } - - @Override - public int nextDoc() throws IOException { - return delegate.nextDoc(); - } - - @Override - public int advance(int target) throws IOException { - return delegate.advance(target); - } - - @Override - public long cost() { - return delegate.cost(); - } - } - - private static class BytesRefBasedIntegerEncodedFeatures extends IntegerEncodedFeatures { - private final BytesRef bytesRef; - private final int numInts; - - public BytesRefBasedIntegerEncodedFeatures(BytesRef bytesRef, int numInts) { - this.bytesRef = bytesRef; - this.numInts = numInts; - } - - @Override - public int getInt(int pos) { - return CSFTypeUtil.convertFromBytes(bytesRef.bytes, bytesRef.offset, pos); - } - - @Override - public void setInt(int pos, int value) { - throw new UnsupportedOperationException(); - } - - @Override - public int getNumInts() { - return numInts; - } - } - - private static final int OLDEST_DOC_SKIP_INTERVAL = 256; - - private final LeafReader delegate; - - /** - * Do not add public constructors to this class. EarlybirdLuceneIndexSegmentAtomicReader instances - * should be created only by calling EarlybirdLuceneIndexSegmentData.createAtomicReader(), to make - * sure everything is set up properly (such as CSF readers). - */ - EarlybirdLuceneIndexSegmentAtomicReader( - EarlybirdIndexSegmentData segmentData, Directory directory) throws IOException { - super(segmentData); - this.delegate = getDelegateReader(directory); - } - - private LeafReader getDelegateReader(Directory directory) throws IOException { - LeafReader directoryReader = - EarlybirdIndexSegmentData.getLeafReaderFromOptimizedDirectory(directory); - return new FilterLeafReader(directoryReader) { - @Override - public NumericDocValues getNumericDocValues(String field) throws IOException { - EarlybirdFieldType type = getSchema().getFieldInfo(field).getFieldType(); - if ((type == null) || !type.isCsfViewField()) { - return in.getNumericDocValues(field); - } - - // Compute as many things as possible once, outside the NumericDocValues.get() call. - String baseFieldName = getSchema().getFieldInfo(type.getCsfViewBaseFieldId()).getName(); - FieldInfo baseFieldInfo = - Preconditions.checkNotNull(getSchema().getFieldInfo(baseFieldName)); - EarlybirdFieldType baseFieldType = baseFieldInfo.getFieldType(); - Preconditions.checkState(!baseFieldType.isCsfVariableLength()); - int numInts = baseFieldType.getCsfFixedLengthNumValuesPerDoc(); - FeatureConfiguration featureConfiguration = - Preconditions.checkNotNull(type.getCsfViewFeatureConfiguration()); - Preconditions.checkArgument(featureConfiguration.getValueIndex() < numInts); - - if (numInts == 1) { - // All encoded tweet features are encoded in a single integer. - NumericDocValues numericDocValues = in.getNumericDocValues(baseFieldName); - return new DocIdSetIteratorWrapper(numericDocValues) { - @Override - public long longValue() throws IOException { - return (numericDocValues.longValue() & featureConfiguration.getBitMask()) - >> featureConfiguration.getBitStartPosition(); - } - - @Override - public boolean advanceExact(int target) throws IOException { - return numericDocValues.advanceExact(target); - } - }; - } - - BinaryDocValues binaryDocValues = - Preconditions.checkNotNull(in.getBinaryDocValues(baseFieldName)); - return new DocIdSetIteratorWrapper(binaryDocValues) { - @Override - public long longValue() throws IOException { - BytesRef data = binaryDocValues.binaryValue(); - IntegerEncodedFeatures encodedFeatures = - new BytesRefBasedIntegerEncodedFeatures(data, numInts); - return encodedFeatures.getFeatureValue(featureConfiguration); - } - - @Override - public boolean advanceExact(int target) throws IOException { - return binaryDocValues.advanceExact(target); - } - }; - } - - @Override - public CacheHelper getCoreCacheHelper() { - return in.getCoreCacheHelper(); - } - - @Override - public CacheHelper getReaderCacheHelper() { - return in.getReaderCacheHelper(); - } - }; - } - - private TermsEnum getTermsEnumAtTerm(Term term) throws IOException { - Terms terms = terms(term.field()); - if (terms == null) { - return null; - } - - TermsEnum termsEnum = terms.iterator(); - return termsEnum.seekExact(term.bytes()) ? termsEnum : null; - } - - @Override - public int getOldestDocID(Term term) throws IOException { - TermsEnum termsEnum = getTermsEnumAtTerm(term); - if (termsEnum == null) { - return EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND; - } - - PostingsEnum td = termsEnum.postings(null); - int oldestDocID = td.nextDoc(); - if (oldestDocID == DocIdSetIterator.NO_MORE_DOCS) { - return EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND; - } - - final int docFreq = termsEnum.docFreq(); - if (docFreq > OLDEST_DOC_SKIP_INTERVAL * 16) { - final int skipSize = docFreq / OLDEST_DOC_SKIP_INTERVAL; - do { - oldestDocID = td.docID(); - } while (td.advance(oldestDocID + skipSize) != DocIdSetIterator.NO_MORE_DOCS); - - td = delegate.postings(term); - td.advance(oldestDocID); - } - - do { - oldestDocID = td.docID(); - } while (td.nextDoc() != DocIdSetIterator.NO_MORE_DOCS); - - return oldestDocID; - } - - @Override - public int getTermID(Term term) throws IOException { - TermsEnum termsEnum = getTermsEnumAtTerm(term); - return termsEnum != null - ? (int) termsEnum.ord() - : EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND; - } - - @Override - public Terms terms(String field) throws IOException { - return delegate.terms(field); - } - - @Override - public FieldInfos getFieldInfos() { - return delegate.getFieldInfos(); - } - - @Override - public Bits getLiveDocs() { - return getDeletesView().getLiveDocs(); - } - - @Override - public int numDocs() { - return delegate.numDocs(); - } - - @Override - public int maxDoc() { - return delegate.maxDoc(); - } - - @Override - public void document(int docID, StoredFieldVisitor visitor) throws IOException { - delegate.document(docID, visitor); - } - - @Override - public boolean hasDeletions() { - return getDeletesView().hasDeletions(); - } - - @Override - protected void doClose() throws IOException { - delegate.close(); - } - - @Override - public NumericDocValues getNumericDocValues(String field) throws IOException { - FieldInfo fieldInfo = getSegmentData().getSchema().getFieldInfo(field); - if (fieldInfo == null) { - return null; - } - - // If this field is a CSF view field or if it's not loaded in memory, get the NumericDocValues - // from the delegate. - EarlybirdFieldType fieldType = fieldInfo.getFieldType(); - if (fieldType.isCsfViewField() || !fieldInfo.getFieldType().isCsfLoadIntoRam()) { - NumericDocValues delegateVals = delegate.getNumericDocValues(field); - if (delegateVals != null) { - return delegateVals; - } - } - - // The field is either loaded in memory, or the delegate doesn't have NumericDocValues for it. - // Return the NumericDocValues for this field stored in the DocValuesManager. - ColumnStrideFieldIndex csf = - getSegmentData().getDocValuesManager().getColumnStrideFieldIndex(field); - return csf != null ? new ColumnStrideFieldDocValues(csf, this) : null; - } - - @Override - public BinaryDocValues getBinaryDocValues(String field) throws IOException { - return delegate.getBinaryDocValues(field); - } - - @Override - public SortedDocValues getSortedDocValues(String field) throws IOException { - return delegate.getSortedDocValues(field); - } - - @Override - public SortedSetDocValues getSortedSetDocValues(String field) throws IOException { - return delegate.getSortedSetDocValues(field); - } - - @Override - public NumericDocValues getNormValues(String field) throws IOException { - return delegate.getNormValues(field); - } - - @Override - public SortedNumericDocValues getSortedNumericDocValues(String field) throws IOException { - return delegate.getSortedNumericDocValues(field); - } - - @Override - public void checkIntegrity() throws IOException { - delegate.checkIntegrity(); - } - - @Override - public PointValues getPointValues(String field) throws IOException { - return delegate.getPointValues(field); - } - - @Override - public LeafMetaData getMetaData() { - return delegate.getMetaData(); - } - - @Override - public CacheHelper getCoreCacheHelper() { - return delegate.getCoreCacheHelper(); - } - - @Override - public CacheHelper getReaderCacheHelper() { - return delegate.getReaderCacheHelper(); - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/EarlybirdLuceneIndexSegmentData.java b/src/java/com/twitter/search/core/earlybird/index/EarlybirdLuceneIndexSegmentData.java deleted file mode 100644 index 82c858e69..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/EarlybirdLuceneIndexSegmentData.java +++ /dev/null @@ -1,197 +0,0 @@ -package com.twitter.search.core.earlybird.index; - -import java.io.IOException; -import java.util.concurrent.ConcurrentHashMap; - -import org.apache.lucene.index.IndexWriterConfig; -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.store.Directory; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.facets.AbstractFacetCountingArray; -import com.twitter.search.core.earlybird.facets.FacetCountingArrayWriter; -import com.twitter.search.core.earlybird.index.column.ColumnStrideFieldIndex; -import com.twitter.search.core.earlybird.index.column.DocValuesManager; -import com.twitter.search.core.earlybird.index.column.OptimizedDocValuesManager; -import com.twitter.search.core.earlybird.index.extensions.EarlybirdIndexExtensionsData; -import com.twitter.search.core.earlybird.index.extensions.EarlybirdIndexExtensionsFactory; -import com.twitter.search.core.earlybird.index.inverted.DeletedDocs; -import com.twitter.search.core.earlybird.index.inverted.InvertedIndex; - -/** - * Implements {@link EarlybirdIndexSegmentData} for Lucene-based on-disk Earlybird segments. - */ -public final class EarlybirdLuceneIndexSegmentData extends EarlybirdIndexSegmentData { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdLuceneIndexSegmentData.class); - - private final Directory directory; - private final EarlybirdIndexExtensionsData indexExtension; - - /** - * Creates a new Lucene-based SegmentData instance from a lucene directory. - */ - public EarlybirdLuceneIndexSegmentData( - Directory directory, - int maxSegmentSize, - long timeSliceID, - Schema schema, - DocIDToTweetIDMapper docIdToTweetIdMapper, - TimeMapper timeMapper, - EarlybirdIndexExtensionsFactory indexExtensionsFactory) { - this( - directory, - maxSegmentSize, - timeSliceID, - schema, - false, // isOptimized - 0, // smallestDocId - new ConcurrentHashMap<>(), - AbstractFacetCountingArray.EMPTY_ARRAY, - new OptimizedDocValuesManager(schema, maxSegmentSize), - docIdToTweetIdMapper, - timeMapper, - indexExtensionsFactory == null - ? null : indexExtensionsFactory.newLuceneIndexExtensionsData()); - } - - public EarlybirdLuceneIndexSegmentData( - Directory directory, - int maxSegmentSize, - long timeSliceID, - Schema schema, - boolean isOptimized, - int smallestDocID, - ConcurrentHashMap perFieldMap, - AbstractFacetCountingArray facetCountingArray, - DocValuesManager docValuesManager, - DocIDToTweetIDMapper docIdToTweetIdMapper, - TimeMapper timeMapper, - EarlybirdIndexExtensionsData indexExtension) { - super(maxSegmentSize, - timeSliceID, - schema, - isOptimized, - smallestDocID, - perFieldMap, - new ConcurrentHashMap<>(), - facetCountingArray, - docValuesManager, - null, // facetLabelProviders - null, // facetIDMap - DeletedDocs.NO_DELETES, - docIdToTweetIdMapper, - timeMapper); - this.directory = directory; - this.indexExtension = indexExtension; - } - - public Directory getLuceneDirectory() { - return directory; - } - - @Override - public EarlybirdIndexExtensionsData getIndexExtensionsData() { - return indexExtension; - } - - @Override - public FacetCountingArrayWriter createFacetCountingArrayWriter() { - return null; - } - - @Override - protected EarlybirdIndexSegmentAtomicReader doCreateAtomicReader() throws IOException { - // EarlybirdSegment creates one single EarlybirdIndexSegmentAtomicReader instance per segment - // and caches it, and the cached instance is recreated only when the segment's data changes. - // This is why this is a good place to reload all CSFs that should be loaded in RAM. Also, it's - // easier and less error-prone to do it here, than trying to track down all places that mutate - // the segment data and do it there. - LeafReader reader = getLeafReaderFromOptimizedDirectory(directory); - for (Schema.FieldInfo fieldInfo : getSchema().getFieldInfos()) { - // Load CSF into RAM based on configurations in the schema. - if (fieldInfo.getFieldType().getCsfType() != null - && fieldInfo.getFieldType().isCsfLoadIntoRam()) { - if (reader.getNumericDocValues(fieldInfo.getName()) != null) { - ColumnStrideFieldIndex index = getDocValuesManager().addColumnStrideField( - fieldInfo.getName(), fieldInfo.getFieldType()); - index.load(reader, fieldInfo.getName()); - } else { - LOG.warn("Field {} does not have NumericDocValues.", fieldInfo.getName()); - } - } - } - - return new EarlybirdLuceneIndexSegmentAtomicReader(this, directory); - } - - @Override - public EarlybirdIndexSegmentWriter createEarlybirdIndexSegmentWriter( - IndexWriterConfig indexWriterConfig) throws IOException { - return new EarlybirdLuceneIndexSegmentWriter(this, indexWriterConfig); - } - - @Override - public EarlybirdIndexSegmentData.AbstractSegmentDataFlushHandler getFlushHandler() { - return new OnDiskSegmentDataFlushHandler(this); - } - - public static class OnDiskSegmentDataFlushHandler - extends AbstractSegmentDataFlushHandler { - private final Directory directory; - - public OnDiskSegmentDataFlushHandler(EarlybirdLuceneIndexSegmentData objectToFlush) { - super(objectToFlush); - this.directory = objectToFlush.directory; - } - - public OnDiskSegmentDataFlushHandler( - Schema schema, - Directory directory, - EarlybirdIndexExtensionsFactory indexExtensionsFactory, - Flushable.Handler docIdMapperFlushHandler, - Flushable.Handler timeMapperFlushHandler) { - super(schema, indexExtensionsFactory, docIdMapperFlushHandler, timeMapperFlushHandler); - this.directory = directory; - } - - @Override - protected EarlybirdIndexExtensionsData newIndexExtension() { - return indexExtensionsFactory.newLuceneIndexExtensionsData(); - } - - @Override - protected void flushAdditionalDataStructures( - FlushInfo flushInfo, DataSerializer out, EarlybirdIndexSegmentData toFlush) { - } - - @Override - protected EarlybirdIndexSegmentData constructSegmentData( - FlushInfo flushInfo, - ConcurrentHashMap perFieldMap, - int maxSegmentSize, - EarlybirdIndexExtensionsData indexExtension, - DocIDToTweetIDMapper docIdToTweetIdMapper, - TimeMapper timeMapper, - DataDeserializer in) { - return new EarlybirdLuceneIndexSegmentData( - directory, - maxSegmentSize, - flushInfo.getLongProperty(TIME_SLICE_ID_PROP_NAME), - schema, - flushInfo.getBooleanProperty(IS_OPTIMIZED_PROP_NAME), - flushInfo.getIntProperty(SMALLEST_DOCID_PROP_NAME), - perFieldMap, - AbstractFacetCountingArray.EMPTY_ARRAY, - new OptimizedDocValuesManager(schema, maxSegmentSize), - docIdToTweetIdMapper, - timeMapper, - indexExtension); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/EarlybirdLuceneIndexSegmentWriter.java b/src/java/com/twitter/search/core/earlybird/index/EarlybirdLuceneIndexSegmentWriter.java deleted file mode 100644 index 6f73c3c32..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/EarlybirdLuceneIndexSegmentWriter.java +++ /dev/null @@ -1,170 +0,0 @@ -package com.twitter.search.core.earlybird.index; - -import java.io.File; -import java.io.IOException; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; -import org.slf4j.Marker; -import org.slf4j.MarkerFactory; - -import org.apache.lucene.document.Document; -import org.apache.lucene.index.IndexWriter; -import org.apache.lucene.index.IndexWriterConfig; -import org.apache.lucene.search.Query; -import org.apache.lucene.store.Directory; -import org.apache.lucene.store.FSDirectory; -import org.apache.lucene.store.LockObtainFailedException; - -/** - * EarlybirdIndexWriter implementation that's a wrapper around Lucene's {@link IndexWriter} - * and writes Lucene segments into a {@link Directory}. - */ -public class EarlybirdLuceneIndexSegmentWriter extends EarlybirdIndexSegmentWriter { - private static final Logger LOG = - LoggerFactory.getLogger(EarlybirdLuceneIndexSegmentWriter.class); - private static final Marker FATAL = MarkerFactory.getMarker("FATAL"); - - private final EarlybirdLuceneIndexSegmentData segmentData; - private final IndexWriter indexWriter; - - @Override - public EarlybirdIndexSegmentData getSegmentData() { - return segmentData; - } - - /** - * Construct a lucene IndexWriter-based Earlybird segment writer. - * This will open a Lucene IndexWriter on segmentData.getLuceneDirectory(). - * This constructor will throw LockObtainFailedException if it cannot obtain the "write.lock" - * inside the directory segmentData.getLuceneDirectory(). - * - * Don't add public constructors to this class. EarlybirdLuceneIndexSegmentWriter instances should - * be created only by calling EarlybirdLuceneIndexSegmentData.createEarlybirdIndexSegmentWriter(), - * to make sure everything is set up properly (such as CSF readers). - */ - EarlybirdLuceneIndexSegmentWriter( - EarlybirdLuceneIndexSegmentData segmentData, - IndexWriterConfig indexWriterConfig) throws IOException { - Preconditions.checkNotNull(segmentData); - this.segmentData = segmentData; - try { - this.indexWriter = new IndexWriter(segmentData.getLuceneDirectory(), indexWriterConfig); - } catch (LockObtainFailedException e) { - logDebuggingInfoUponFailureToObtainLuceneWriteLock(segmentData, e); - // Rethrow the exception, and this Earlybird will trigger critical alerts - throw e; - } - } - - private void logDebuggingInfoUponFailureToObtainLuceneWriteLock( - EarlybirdLuceneIndexSegmentData luceneIndexSegmentData, - LockObtainFailedException e) throws IOException { - // Every day, we create a new Lucene dir---we do not append into existing Lucene dirs. - // Supposedly, we should never fail to obtain the write lock from a fresh and empty - // Lucene directory. - // Adding debugging information for SEARCH-4454, where a timeslice roll failed because - // Earlybird failed to get the write lock for a new timeslice. - Directory dir = luceneIndexSegmentData.getLuceneDirectory(); - LOG.error( - FATAL, - "Unable to obtain write.lock for Lucene directory. The Lucene directory is: " + dir, - e); - - if (dir instanceof FSDirectory) { // this check should always be true in our current setup. - FSDirectory fsDir = (FSDirectory) dir; - // Log if the underlying directory on disk does not exist. - File underlyingDir = fsDir.getDirectory().toFile(); - if (underlyingDir.exists()) { - LOG.info("Lucene directory contains the following files: " - + Lists.newArrayList(fsDir.listAll())); - } else { - LOG.error( - FATAL, - "Directory " + underlyingDir + " does not exist on disk.", - e); - } - - if (!underlyingDir.canWrite()) { - LOG.error( - FATAL, - "Cannot write into directory " + underlyingDir, - e); - } - - File writeLockFile = new File(underlyingDir, "write.lock"); - if (writeLockFile.exists()) { - LOG.error( - FATAL, - "Write lock file " + writeLockFile + " already exists.", - e); - } - - if (!writeLockFile.canWrite()) { - LOG.error( - FATAL, - "No write access to lock file: " + writeLockFile - + " Usable space: " + underlyingDir.getUsableSpace(), - e); - } - - // List all files in the segment directory - File segmentDir = underlyingDir.getParentFile(); - LOG.warn("Segment directory contains the following files: " - + Lists.newArrayList(segmentDir.list())); - } else { - LOG.warn("Unable to log debugging info upon failing to acquire Lucene write lock." - + "The class of the directory is: " + dir.getClass().getName()); - } - } - - @Override - public void addDocument(Document doc) throws IOException { - indexWriter.addDocument(doc); - } - - @Override - public void addTweet(Document doc, long tweetId, boolean docIdOffensive) throws IOException { - indexWriter.addDocument(doc); - } - - @Override - protected void appendOutOfOrder(Document doc, int docId) throws IOException { - throw new UnsupportedOperationException("This Lucene-based IndexWriter does not support " - + "updates and out-of-order appends."); - } - - @Override - public int numDocs() { - return indexWriter.getDocStats().maxDoc; - } - - @Override - public int numDocsNoDelete() throws IOException { - return numDocs(); - } - - @Override - public void deleteDocuments(Query query) throws IOException { - super.deleteDocuments(query); - indexWriter.deleteDocuments(query); - } - - @Override - public void addIndexes(Directory... dirs) throws IOException { - indexWriter.addIndexes(dirs); - } - - @Override - public void forceMerge() throws IOException { - indexWriter.forceMerge(1); - } - - @Override - public void close() throws IOException { - indexWriter.close(); - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/EarlybirdRealtimeIndexSegmentAtomicReader.java b/src/java/com/twitter/search/core/earlybird/index/EarlybirdRealtimeIndexSegmentAtomicReader.java deleted file mode 100644 index 78c1f6d45..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/EarlybirdRealtimeIndexSegmentAtomicReader.java +++ /dev/null @@ -1,175 +0,0 @@ -package com.twitter.search.core.earlybird.index; - -import java.io.IOException; - -import org.apache.lucene.index.BinaryDocValues; -import org.apache.lucene.index.Fields; -import org.apache.lucene.index.LeafMetaData; -import org.apache.lucene.index.NumericDocValues; -import org.apache.lucene.index.PointValues; -import org.apache.lucene.index.SortedDocValues; -import org.apache.lucene.index.SortedNumericDocValues; -import org.apache.lucene.index.SortedSetDocValues; -import org.apache.lucene.index.StoredFieldVisitor; -import org.apache.lucene.index.Term; -import org.apache.lucene.index.Terms; -import org.apache.lucene.search.Sort; -import org.apache.lucene.util.Bits; -import org.apache.lucene.util.Version; - -import com.twitter.search.core.earlybird.facets.EarlybirdFacetDocValueSet; -import com.twitter.search.core.earlybird.index.column.ColumnStrideFieldDocValues; -import com.twitter.search.core.earlybird.index.column.ColumnStrideFieldIndex; -import com.twitter.search.core.earlybird.index.inverted.InMemoryFields; -import com.twitter.search.core.earlybird.index.inverted.InvertedIndex; - -public final class EarlybirdRealtimeIndexSegmentAtomicReader - extends EarlybirdIndexSegmentAtomicReader { - private final Fields fields; - private final int maxDocId; - private final int numDocs; - - /** - * Creates a new real-time reader for the given segment. Do not add public constructors to this - * class. EarlybirdRealtimeIndexSegmentAtomicReader instances should be created only by calling - * EarlybirdRealtimeIndexSegmentData.createAtomicReader(), to make sure everything is set up - * properly (such as CSF readers). - */ - EarlybirdRealtimeIndexSegmentAtomicReader(EarlybirdRealtimeIndexSegmentData segmentData) { - super(segmentData); - - this.fields = new InMemoryFields(segmentData.getPerFieldMap(), syncData.getIndexPointers()); - - // We cache the highest doc ID and the number of docs, because the reader must return the same - // values for its entire lifetime, and the segment will get more tweets over time. - // These values could be slightly out of sync with 'fields', because we don't update these - // values atomically with the fields. - this.maxDocId = segmentData.getDocIDToTweetIDMapper().getPreviousDocID(Integer.MAX_VALUE); - this.numDocs = segmentData.getDocIDToTweetIDMapper().getNumDocs(); - } - - @Override - public int maxDoc() { - return maxDocId + 1; - } - - @Override - public int numDocs() { - return numDocs; - } - - @Override - protected void doClose() { - // nothing to do - } - - @Override - public void document(int docID, StoredFieldVisitor visitor) { - // not supported - } - - @Override - public int getOldestDocID(Term t) throws IOException { - InvertedIndex perField = getSegmentData().getPerFieldMap().get(t.field()); - if (perField == null) { - return TERM_NOT_FOUND; - } - return perField.getLargestDocIDForTerm(t.bytes()); - } - - @Override - public int getTermID(Term t) throws IOException { - InvertedIndex perField = getSegmentData().getPerFieldMap().get(t.field()); - if (perField == null) { - return TERM_NOT_FOUND; - } - return perField.lookupTerm(t.bytes()); - } - - @Override - public Bits getLiveDocs() { - // liveDocs contains inverted (decreasing) docIDs. - return getDeletesView().getLiveDocs(); - } - - @Override - public boolean hasDeletions() { - return getDeletesView().hasDeletions(); - } - - @Override - public Terms terms(String field) throws IOException { - return fields.terms(field); - } - - @Override - public NumericDocValues getNumericDocValues(String field) throws IOException { - ColumnStrideFieldIndex csf = - getSegmentData().getDocValuesManager().getColumnStrideFieldIndex(field); - return csf != null ? new ColumnStrideFieldDocValues(csf, this) : null; - } - - @Override - public boolean hasDocs() { - // smallestDocID is the smallest document ID that was available when this reader was created. - // So we need to check its value in order to decide if this reader can see any documents, - // because in the meantime other documents might've been added to the tweet ID mapper. - return getSmallestDocID() != Integer.MAX_VALUE; - } - - @Override - public BinaryDocValues getBinaryDocValues(String field) { - return null; - } - - @Override - public SortedDocValues getSortedDocValues(String field) { - return null; - } - - @Override - public SortedSetDocValues getSortedSetDocValues(String field) { - // special handling for facet field - if (EarlybirdFacetDocValueSet.FIELD_NAME.equals(field)) { - return ((EarlybirdRealtimeIndexSegmentData) getSegmentData()).getFacetDocValueSet(); - } - - return null; - } - - @Override - public NumericDocValues getNormValues(String field) throws IOException { - ColumnStrideFieldIndex csf = getSegmentData().getNormIndex(field); - return csf != null ? new ColumnStrideFieldDocValues(csf, this) : null; - } - - @Override - public SortedNumericDocValues getSortedNumericDocValues(String field) { - return null; - } - - @Override - public void checkIntegrity() { - // nothing to do - } - - @Override - public PointValues getPointValues(String field) { - return null; - } - - @Override - public LeafMetaData getMetaData() { - return new LeafMetaData(Version.LATEST.major, Version.LATEST, Sort.RELEVANCE); - } - - @Override - public CacheHelper getCoreCacheHelper() { - return null; - } - - @Override - public CacheHelper getReaderCacheHelper() { - return null; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/EarlybirdRealtimeIndexSegmentData.java b/src/java/com/twitter/search/core/earlybird/index/EarlybirdRealtimeIndexSegmentData.java deleted file mode 100644 index 58ea6f3bf..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/EarlybirdRealtimeIndexSegmentData.java +++ /dev/null @@ -1,251 +0,0 @@ -package com.twitter.search.core.earlybird.index; - -import java.io.IOException; -import java.util.Map; -import java.util.concurrent.ConcurrentHashMap; - -import com.google.common.collect.Maps; - -import org.apache.lucene.index.IndexWriterConfig; -import org.apache.lucene.search.IndexSearcher; - -import com.twitter.search.common.schema.SearchWhitespaceAnalyzer; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.facets.AbstractFacetCountingArray; -import com.twitter.search.core.earlybird.facets.EarlybirdFacetDocValueSet; -import com.twitter.search.core.earlybird.facets.FacetCountingArray; -import com.twitter.search.core.earlybird.facets.FacetIDMap; -import com.twitter.search.core.earlybird.facets.FacetLabelProvider; -import com.twitter.search.core.earlybird.facets.FacetUtil; -import com.twitter.search.core.earlybird.facets.OptimizedFacetCountingArray; -import com.twitter.search.core.earlybird.index.column.DocValuesManager; -import com.twitter.search.core.earlybird.index.column.OptimizedDocValuesManager; -import com.twitter.search.core.earlybird.index.column.UnoptimizedDocValuesManager; -import com.twitter.search.core.earlybird.index.extensions.EarlybirdIndexExtensionsFactory; -import com.twitter.search.core.earlybird.index.extensions.EarlybirdRealtimeIndexExtensionsData; -import com.twitter.search.core.earlybird.index.inverted.DeletedDocs; -import com.twitter.search.core.earlybird.index.inverted.IndexOptimizer; -import com.twitter.search.core.earlybird.index.inverted.InvertedIndex; - -/** - * Implements {@link EarlybirdIndexSegmentData} for real-time in-memory Earlybird segments. - */ -public class EarlybirdRealtimeIndexSegmentData extends EarlybirdIndexSegmentData { - private final EarlybirdRealtimeIndexExtensionsData indexExtension; - - private EarlybirdFacetDocValueSet facetDocValueSet; - - /** - * Creates a new empty real-time SegmentData instance. - */ - public EarlybirdRealtimeIndexSegmentData( - int maxSegmentSize, - long timeSliceID, - Schema schema, - DocIDToTweetIDMapper docIdToTweetIdMapper, - TimeMapper timeMapper, - EarlybirdIndexExtensionsFactory indexExtensionsFactory) { - this( - maxSegmentSize, - timeSliceID, - schema, - false, // isOptimized - Integer.MAX_VALUE, - new ConcurrentHashMap<>(), - new FacetCountingArray(maxSegmentSize), - new UnoptimizedDocValuesManager(schema, maxSegmentSize), - Maps.newHashMapWithExpectedSize(schema.getNumFacetFields()), - FacetIDMap.build(schema), - new DeletedDocs.Default(maxSegmentSize), - docIdToTweetIdMapper, - timeMapper, - indexExtensionsFactory == null - ? null - : indexExtensionsFactory.newRealtimeIndexExtensionsData()); - } - - /** - * Creates a new real-time SegmentData instance using the passed in data structures. Usually this - * constructor is used by the FlushHandler after a segment was loaded from disk, but also the - * {@link IndexOptimizer} uses it to create an - * optimized segment. - */ - public EarlybirdRealtimeIndexSegmentData( - int maxSegmentSize, - long timeSliceID, - Schema schema, - boolean isOptimized, - int smallestDocID, - ConcurrentHashMap perFieldMap, - AbstractFacetCountingArray facetCountingArray, - DocValuesManager docValuesManager, - Map facetLabelProviders, - FacetIDMap facetIDMap, - DeletedDocs deletedDocs, - DocIDToTweetIDMapper docIdToTweetIdMapper, - TimeMapper timeMapper, - EarlybirdRealtimeIndexExtensionsData indexExtension) { - super(maxSegmentSize, - timeSliceID, - schema, - isOptimized, - smallestDocID, - perFieldMap, - new ConcurrentHashMap<>(), - facetCountingArray, - docValuesManager, - facetLabelProviders, - facetIDMap, - deletedDocs, - docIdToTweetIdMapper, - timeMapper); - this.indexExtension = indexExtension; - this.facetDocValueSet = null; - } - - @Override - public EarlybirdRealtimeIndexExtensionsData getIndexExtensionsData() { - return indexExtension; - } - - /** - * For realtime segments, this wraps a facet datastructure into a SortedSetDocValues to - * comply to Lucene facet api. - */ - public EarlybirdFacetDocValueSet getFacetDocValueSet() { - if (facetDocValueSet == null) { - AbstractFacetCountingArray facetCountingArray = getFacetCountingArray(); - if (facetCountingArray != null) { - facetDocValueSet = new EarlybirdFacetDocValueSet( - facetCountingArray, getFacetLabelProviders(), getFacetIDMap()); - } - } - return facetDocValueSet; - } - - @Override - protected EarlybirdIndexSegmentAtomicReader doCreateAtomicReader() { - return new EarlybirdRealtimeIndexSegmentAtomicReader(this); - } - - /** - * Convenience method for creating an EarlybirdIndexSegmentWriter for this segment with a default - * IndexSegmentWriter config. - */ - public EarlybirdIndexSegmentWriter createEarlybirdIndexSegmentWriter() { - return createEarlybirdIndexSegmentWriter( - new IndexWriterConfig(new SearchWhitespaceAnalyzer()).setSimilarity( - IndexSearcher.getDefaultSimilarity())); - } - - @Override - public EarlybirdIndexSegmentWriter createEarlybirdIndexSegmentWriter( - IndexWriterConfig indexWriterConfig) { - // Prepare the in-memory segment with all enabled CSF fields. - DocValuesManager docValuesManager = getDocValuesManager(); - for (Schema.FieldInfo fieldInfo : getSchema().getFieldInfos()) { - if (fieldInfo.getFieldType().getCsfType() != null) { - docValuesManager.addColumnStrideField(fieldInfo.getName(), fieldInfo.getFieldType()); - } - } - - return new EarlybirdRealtimeIndexSegmentWriter( - this, - indexWriterConfig.getAnalyzer(), - indexWriterConfig.getSimilarity()); - } - - @Override - public EarlybirdIndexSegmentData.AbstractSegmentDataFlushHandler getFlushHandler() { - return new InMemorySegmentDataFlushHandler(this); - } - - public static class InMemorySegmentDataFlushHandler - extends AbstractSegmentDataFlushHandler { - public InMemorySegmentDataFlushHandler(EarlybirdIndexSegmentData objectToFlush) { - super(objectToFlush); - } - - public InMemorySegmentDataFlushHandler( - Schema schema, - EarlybirdIndexExtensionsFactory factory, - Flushable.Handler docIdMapperFlushHandler, - Flushable.Handler timeMapperFlushHandler) { - super(schema, factory, docIdMapperFlushHandler, timeMapperFlushHandler); - } - - @Override - protected EarlybirdRealtimeIndexExtensionsData newIndexExtension() { - return indexExtensionsFactory.newRealtimeIndexExtensionsData(); - } - - @Override - protected void flushAdditionalDataStructures( - FlushInfo flushInfo, - DataSerializer out, - EarlybirdIndexSegmentData segmentData) throws IOException { - segmentData.getFacetCountingArray().getFlushHandler() - .flush(flushInfo.newSubProperties("facet_counting_array"), out); - - // flush all column stride fields - segmentData.getDocValuesManager().getFlushHandler() - .flush(flushInfo.newSubProperties("doc_values"), out); - - segmentData.getFacetIDMap().getFlushHandler() - .flush(flushInfo.newSubProperties("facet_id_map"), out); - - segmentData.getDeletedDocs().getFlushHandler() - .flush(flushInfo.newSubProperties("deleted_docs"), out); - } - - @Override - protected EarlybirdIndexSegmentData constructSegmentData( - FlushInfo flushInfo, - ConcurrentHashMap perFieldMap, - int maxSegmentSize, - EarlybirdRealtimeIndexExtensionsData indexExtension, - DocIDToTweetIDMapper docIdToTweetIdMapper, - TimeMapper timeMapper, - DataDeserializer in) throws IOException { - boolean isOptimized = flushInfo.getBooleanProperty(IS_OPTIMIZED_PROP_NAME); - - Flushable.Handler facetLoader = isOptimized - ? new OptimizedFacetCountingArray.FlushHandler() - : new FacetCountingArray.FlushHandler(maxSegmentSize); - AbstractFacetCountingArray facetCountingArray = - facetLoader.load(flushInfo.getSubProperties("facet_counting_array"), in); - - Flushable.Handler docValuesLoader = isOptimized - ? new OptimizedDocValuesManager.OptimizedFlushHandler(schema) - : new UnoptimizedDocValuesManager.UnoptimizedFlushHandler(schema); - DocValuesManager docValuesManager = - docValuesLoader.load(flushInfo.getSubProperties("doc_values"), in); - - FacetIDMap facetIDMap = new FacetIDMap.FlushHandler(schema) - .load(flushInfo.getSubProperties("facet_id_map"), in); - - DeletedDocs.Default deletedDocs = new DeletedDocs.Default.FlushHandler(maxSegmentSize) - .load(flushInfo.getSubProperties("deleted_docs"), in); - - return new EarlybirdRealtimeIndexSegmentData( - maxSegmentSize, - flushInfo.getLongProperty(TIME_SLICE_ID_PROP_NAME), - schema, - isOptimized, - flushInfo.getIntProperty(SMALLEST_DOCID_PROP_NAME), - perFieldMap, - facetCountingArray, - docValuesManager, - FacetUtil.getFacetLabelProviders(schema, perFieldMap), - facetIDMap, - deletedDocs, - docIdToTweetIdMapper, - timeMapper, - indexExtension); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/EarlybirdRealtimeIndexSegmentWriter.java b/src/java/com/twitter/search/core/earlybird/index/EarlybirdRealtimeIndexSegmentWriter.java deleted file mode 100644 index 049d33ce8..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/EarlybirdRealtimeIndexSegmentWriter.java +++ /dev/null @@ -1,789 +0,0 @@ -package com.twitter.search.core.earlybird.index; - -import java.io.IOException; -import java.util.ArrayList; -import java.util.Arrays; -import java.util.HashMap; -import java.util.HashSet; -import java.util.List; -import java.util.Map; -import java.util.Set; -import java.util.concurrent.ConcurrentHashMap; - -import javax.annotation.Nullable; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; -import org.apache.lucene.analysis.Analyzer; -import org.apache.lucene.analysis.TokenStream; -import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; -import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; -import org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute; -import org.apache.lucene.document.Document; -import org.apache.lucene.document.Field; -import org.apache.lucene.facet.FacetsConfig; -import org.apache.lucene.index.DocValuesType; -import org.apache.lucene.index.FieldInvertState; -import org.apache.lucene.index.IndexOptions; -import org.apache.lucene.index.IndexableField; -import org.apache.lucene.index.IndexableFieldType; -import org.apache.lucene.search.similarities.Similarity; -import org.apache.lucene.store.Directory; -import org.apache.lucene.util.AttributeSource; -import org.apache.lucene.util.BytesRef; -import org.apache.lucene.util.BytesRefHash; -import org.apache.lucene.util.Version; - -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.schema.base.EarlybirdFieldType; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants; -import com.twitter.search.core.earlybird.facets.FacetCountingArrayWriter; -import com.twitter.search.core.earlybird.facets.FacetIDMap.FacetField; -import com.twitter.search.core.earlybird.facets.FacetLabelProvider; -import com.twitter.search.core.earlybird.facets.FacetUtil; -import com.twitter.search.core.earlybird.index.column.ColumnStrideByteIndex; -import com.twitter.search.core.earlybird.index.extensions.EarlybirdRealtimeIndexExtensionsData; -import com.twitter.search.core.earlybird.index.inverted.EarlybirdCSFDocValuesProcessor; -import com.twitter.search.core.earlybird.index.inverted.InvertedRealtimeIndex; -import com.twitter.search.core.earlybird.index.inverted.InvertedRealtimeIndexWriter; -import com.twitter.search.core.earlybird.index.inverted.TermPointerEncoding; -import com.twitter.search.core.earlybird.index.util.AllDocsIterator; - -/** - * EarlybirdIndexWriter implementation that writes realtime in-memory segments. - * Note that it is used by both Earlybirds and ExpertSearch. - */ -public final class EarlybirdRealtimeIndexSegmentWriter extends EarlybirdIndexSegmentWriter { - private static final Logger LOG = - LoggerFactory.getLogger(EarlybirdRealtimeIndexSegmentWriter.class); - /** - * Maximum tweet length is 10k, setting maximum token position to 25k in case of weird unicode. - */ - private static final int MAX_POSITION = 25000; - - private static final String OUT_OF_ORDER_APPEND_UNSUPPORTED_STATS_PATTERN = - "out_of_order_append_unsupported_for_field_%s"; - private static final ConcurrentHashMap - UNSUPPORTED_OUT_OF_ORDER_APPEND_MAP = new ConcurrentHashMap<>(); - private static final SearchRateCounter NUM_TWEETS_DROPPED = - SearchRateCounter.export("EarlybirdRealtimeIndexSegmentWriter_num_tweets_dropped"); - - private long nextFieldGen; - - private HashMap fields = new HashMap<>(); - private List fieldsInDocument = new ArrayList<>(); - - private final EarlybirdCSFDocValuesProcessor docValuesProcessor; - - private Map termHashSync = new HashMap<>(); - private Set appendedFields = new HashSet<>(); - - private final Analyzer analyzer; - private final Similarity similarity; - - private final EarlybirdRealtimeIndexSegmentData segmentData; - - private final Field allDocsField; - - @Nullable - private final FacetCountingArrayWriter facetCountingArrayWriter; - - /** - * Creates a new writer for a real-time in-memory Earlybird segment. - * - * Do not add public constructors to this class. EarlybirdRealtimeIndexSegmentWriter instances - * should be created only by calling - * EarlybirdRealtimeIndexSegmentData.createEarlybirdIndexSegmentWriter(), to make sure everything - * is set up properly (such as CSF readers). - */ - EarlybirdRealtimeIndexSegmentWriter( - EarlybirdRealtimeIndexSegmentData segmentData, - Analyzer analyzer, - Similarity similarity) { - Preconditions.checkNotNull(segmentData); - this.segmentData = segmentData; - this.facetCountingArrayWriter = segmentData.createFacetCountingArrayWriter(); - this.docValuesProcessor = new EarlybirdCSFDocValuesProcessor(segmentData.getDocValuesManager()); - this.analyzer = analyzer; - this.similarity = similarity; - this.allDocsField = buildAllDocsField(segmentData); - } - - @Override - public EarlybirdRealtimeIndexSegmentData getSegmentData() { - return segmentData; - } - - @Override - public int numDocsNoDelete() { - return segmentData.getDocIDToTweetIDMapper().getNumDocs(); - } - - @Override - public void addDocument(Document doc) throws IOException { - // This method should be called only from Expertsearch, not tweets Earlybirds. - DocIDToTweetIDMapper docIdToTweetIdMapper = segmentData.getDocIDToTweetIDMapper(); - Preconditions.checkState(docIdToTweetIdMapper instanceof SequentialDocIDMapper); - - // Make sure we have space for a new doc in this segment. - Preconditions.checkState(docIdToTweetIdMapper.getNumDocs() < segmentData.getMaxSegmentSize(), - "Cannot add a new document to the segment, because it's full."); - - addDocument(doc, docIdToTweetIdMapper.addMapping(-1L), false); - } - - @Override - public void addTweet(Document doc, long tweetId, boolean docIsOffensive) throws IOException { - DocIDToTweetIDMapper docIdToTweetIdMapper = segmentData.getDocIDToTweetIDMapper(); - Preconditions.checkState(!(docIdToTweetIdMapper instanceof SequentialDocIDMapper)); - - // Make sure we have space for a new doc in this segment. - Preconditions.checkState(docIdToTweetIdMapper.getNumDocs() < segmentData.getMaxSegmentSize(), - "Cannot add a new document to the segment, because it's full."); - - Preconditions.checkNotNull(doc.getField( - EarlybirdFieldConstants.EarlybirdFieldConstant.CREATED_AT_FIELD.getFieldName())); - - addAllDocsField(doc); - - int docId = docIdToTweetIdMapper.addMapping(tweetId); - // Make sure we successfully assigned a doc ID to the new document/tweet before proceeding. - // If the docId is DocIDToTweetIDMapper.ID_NOT_FOUND then either: - // 1. the tweet is older than the OutOfOrderRealtimeTweetIDMapper.segmentBoundaryTimestamp and - // is too old for this segment - // 2. the OutOfOrderRealtimeTweetIDMapper does not have any available doc ids left - if (docId == DocIDToTweetIDMapper.ID_NOT_FOUND) { - LOG.info("Could not assign doc id for tweet. Dropping tweet id " + tweetId - + " for segment with timeslice: " + segmentData.getTimeSliceID()); - NUM_TWEETS_DROPPED.increment(); - return; - } - - addDocument(doc, docId, docIsOffensive); - } - - private void addDocument(Document doc, - int docId, - boolean docIsOffensive) throws IOException { - fieldsInDocument.clear(); - - long fieldGen = nextFieldGen++; - - // NOTE: we need two passes here, in case there are - // multi-valued fields, because we must process all - // instances of a given field at once, since the - // analyzer is free to reuse TokenStream across fields - // (i.e., we cannot have more than one TokenStream - // running "at once"): - - try { - for (IndexableField field : doc) { - if (!skipField(field.name())) { - processField(docId, field, fieldGen, docIsOffensive); - } - } - } finally { - // Finish each indexed field name seen in the document: - for (PerField field : fieldsInDocument) { - field.finish(docId); - } - - // When indexing a dummy document for out-of-order updates into a loaded segment, that - // document gets docID set as maxSegment size. So we have to make sure that we never - // sync backwards in document order. - int smallestDocID = Math.min(docId, segmentData.getSyncData().getSmallestDocID()); - segmentData.updateSmallestDocID(smallestDocID); - } - } - - @Override - protected void appendOutOfOrder(Document doc, int internalDocID) throws IOException { - Preconditions.checkNotNull(doc); - fieldsInDocument.clear(); - - long fieldGen = nextFieldGen++; - - try { - for (IndexableField indexableField : doc) { - if (!skipField(indexableField.name())) { - Schema.FieldInfo fi = segmentData.getSchema().getFieldInfo(indexableField.name()); - if (fi == null) { - LOG.error("FieldInfo for " + indexableField.name() + " is null!"); - continue; - } - if (segmentData.isOptimized() && fi.getFieldType().becomesImmutable()) { - UNSUPPORTED_OUT_OF_ORDER_APPEND_MAP.computeIfAbsent( - indexableField.name(), - f -> SearchRateCounter.export( - String.format(OUT_OF_ORDER_APPEND_UNSUPPORTED_STATS_PATTERN, f)) - ).increment(); - continue; - } - processField(internalDocID, indexableField, fieldGen, false); - appendedFields.add(indexableField.name()); - } - } - } finally { - // Finish each indexed field name seen in the document: - for (PerField field : fieldsInDocument) { - field.finish(internalDocID); - } - // force sync - segmentData.updateSmallestDocID(segmentData.getSyncData().getSmallestDocID()); - } - } - - @Override - public void addIndexes(Directory... dirs) { - throw new UnsupportedOperationException("In realtime mode addIndexes() is currently " - + "not supported."); - } - - @Override - public void forceMerge() { - // we always have a single segment in realtime-mode - } - - @Override - public void close() { - // nothing to close - } - - private void processField( - int docId, - IndexableField field, - long fieldGen, - boolean currentDocIsOffensive) throws IOException { - String fieldName = field.name(); - IndexableFieldType fieldType = field.fieldType(); - - // Invert indexed fields: - if (fieldType.indexOptions() != IndexOptions.NONE) { - PerField perField = getOrAddField(fieldName, fieldType); - - // Whether this is the first time we have seen this field in this document. - boolean first = perField.fieldGen != fieldGen; - perField.invert(field, docId, first, currentDocIsOffensive); - - if (first) { - fieldsInDocument.add(perField); - perField.fieldGen = fieldGen; - } - } else { - Schema.FieldInfo facetFieldInfo = - segmentData.getSchema().getFacetFieldByFieldName(fieldName); - FacetField facetField = facetFieldInfo != null - ? segmentData.getFacetIDMap().getFacetField(facetFieldInfo) : null; - EarlybirdFieldType facetFieldType = facetFieldInfo != null - ? facetFieldInfo.getFieldType() : null; - Preconditions.checkState( - facetFieldInfo == null || (facetField != null && facetFieldType != null)); - if (facetField != null && facetFieldType.isUseCSFForFacetCounting()) { - segmentData.getFacetLabelProviders().put( - facetField.getFacetName(), - Preconditions.checkNotNull( - FacetUtil.chooseFacetLabelProvider(facetFieldType, null))); - } - } - - if (fieldType.docValuesType() != DocValuesType.NONE) { - StoredFieldsConsumerBuilder consumerBuilder = new StoredFieldsConsumerBuilder( - fieldName, (EarlybirdFieldType) fieldType); - EarlybirdRealtimeIndexExtensionsData indexExtension = segmentData.getIndexExtensionsData(); - if (indexExtension != null) { - indexExtension.createStoredFieldsConsumer(consumerBuilder); - } - if (consumerBuilder.isUseDefaultConsumer()) { - consumerBuilder.addConsumer(docValuesProcessor); - } - - StoredFieldsConsumer storedFieldsConsumer = consumerBuilder.build(); - if (storedFieldsConsumer != null) { - storedFieldsConsumer.addField(docId, field); - } - } - } - - /** Returns a previously created {@link PerField}, absorbing the type information from - * {@link org.apache.lucene.document.FieldType}, and creates a new {@link PerField} if this field - * name wasn't seen yet. */ - private PerField getOrAddField(String name, IndexableFieldType fieldType) { - // Note that this could be a computeIfAbsent, but that allocates a closure in the hot path and - // slows down indexing. - PerField perField = fields.get(name); - if (perField == null) { - boolean omitNorms = fieldType.omitNorms() || fieldType.indexOptions() == IndexOptions.NONE; - perField = new PerField(this, name, fieldType.indexOptions(), omitNorms); - fields.put(name, perField); - } - return perField; - } - - /** NOTE: not static: accesses at least docState, termsHash. */ - private static final class PerField implements Comparable { - - private final EarlybirdRealtimeIndexSegmentWriter indexSegmentWriter; - - private final String fieldName; - private final IndexOptions indexOptions; - private final boolean omitNorms; - - private InvertedRealtimeIndex invertedField; - private InvertedDocConsumer indexWriter; - - /** We use this to know when a PerField is seen for the - * first time in the current document. */ - private long fieldGen = -1; - - // reused - private TokenStream tokenStream; - - private int currentPosition; - private int currentOffset; - private int currentLength; - private int currentOverlap; - private int lastStartOffset; - private int lastPosition; - - public PerField( - EarlybirdRealtimeIndexSegmentWriter indexSegmentWriter, - String fieldName, - IndexOptions indexOptions, - boolean omitNorms) { - this.indexSegmentWriter = indexSegmentWriter; - this.fieldName = fieldName; - this.indexOptions = indexOptions; - this.omitNorms = omitNorms; - - initInvertState(); - } - - void initInvertState() { - // it's okay if this is null - in that case TwitterTermHashPerField - // will not add it to the facet array - final Schema.FieldInfo facetFieldInfo - = indexSegmentWriter.segmentData.getSchema().getFacetFieldByFieldName(fieldName); - final FacetField facetField = facetFieldInfo != null - ? indexSegmentWriter.segmentData.getFacetIDMap().getFacetField(facetFieldInfo) : null; - final EarlybirdFieldType facetFieldType - = facetFieldInfo != null ? facetFieldInfo.getFieldType() : null; - Preconditions.checkState( - facetFieldInfo == null || (facetField != null && facetFieldType != null)); - - if (facetField != null && facetFieldType.isUseCSFForFacetCounting()) { - indexSegmentWriter.segmentData.getFacetLabelProviders().put( - facetField.getFacetName(), - Preconditions.checkNotNull( - FacetUtil.chooseFacetLabelProvider(facetFieldType, null))); - return; - } - - Schema.FieldInfo fi = indexSegmentWriter.segmentData.getSchema().getFieldInfo(fieldName); - final EarlybirdFieldType fieldType = fi.getFieldType(); - - InvertedDocConsumerBuilder consumerBuilder = new InvertedDocConsumerBuilder( - indexSegmentWriter.segmentData, fieldName, fieldType); - EarlybirdRealtimeIndexExtensionsData indexExtension = - indexSegmentWriter.segmentData.getIndexExtensionsData(); - if (indexExtension != null) { - indexExtension.createInvertedDocConsumer(consumerBuilder); - } - - if (consumerBuilder.isUseDefaultConsumer()) { - if (indexSegmentWriter.segmentData.getPerFieldMap().containsKey(fieldName)) { - invertedField = (InvertedRealtimeIndex) indexSegmentWriter - .segmentData.getPerFieldMap().get(fieldName); - } else { - invertedField = new InvertedRealtimeIndex( - fieldType, - TermPointerEncoding.DEFAULT_ENCODING, - fieldName); - } - - InvertedRealtimeIndexWriter fieldWriter = new InvertedRealtimeIndexWriter( - invertedField, facetField, indexSegmentWriter.facetCountingArrayWriter); - - if (facetField != null) { - Map providerMap = - indexSegmentWriter.segmentData.getFacetLabelProviders(); - if (!providerMap.containsKey(facetField.getFacetName())) { - providerMap.put( - facetField.getFacetName(), - Preconditions.checkNotNull( - FacetUtil.chooseFacetLabelProvider(facetFieldType, invertedField))); - } - } - - indexSegmentWriter.segmentData.addField(fieldName, invertedField); - - if (indexSegmentWriter.appendedFields.contains(fieldName)) { - indexSegmentWriter.termHashSync.put(fieldName, fieldWriter); - } - - consumerBuilder.addConsumer(fieldWriter); - } - - indexWriter = consumerBuilder.build(); - } - - @Override - public int compareTo(PerField other) { - return this.fieldName.compareTo(other.fieldName); - } - - @Override - public boolean equals(Object other) { - if (!(other instanceof PerField)) { - return false; - } - - return this.fieldName.equals(((PerField) other).fieldName); - } - - @Override - public int hashCode() { - return fieldName.hashCode(); - } - - public void finish(int docId) { - if (indexWriter != null) { - indexWriter.finish(); - } - - if (!omitNorms) { - FieldInvertState state = new FieldInvertState( - Version.LATEST.major, - fieldName, - indexOptions, - currentPosition, - currentLength, - currentOverlap, - currentOffset, - 0, // maxTermFrequency - 0); // uniqueTermCount - ColumnStrideByteIndex normsIndex = - indexSegmentWriter.segmentData.createNormIndex(fieldName); - if (normsIndex != null) { - normsIndex.setValue(docId, (byte) indexSegmentWriter.similarity.computeNorm(state)); - } - } - } - - /** Inverts one field for one document; first is true - * if this is the first time we are seeing this field - * name in this document. */ - public void invert(IndexableField field, - int docId, - boolean first, - boolean currentDocIsOffensive) throws IOException { - if (indexWriter == null) { - return; - } - if (first) { - currentPosition = -1; - currentOffset = 0; - lastPosition = 0; - lastStartOffset = 0; - - if (invertedField != null) { - invertedField.incrementNumDocs(); - } - } - - IndexableFieldType fieldType = field.fieldType(); - final boolean analyzed = fieldType.tokenized() && indexSegmentWriter.analyzer != null; - boolean succeededInProcessingField = false; - try { - tokenStream = field.tokenStream(indexSegmentWriter.analyzer, tokenStream); - tokenStream.reset(); - - PositionIncrementAttribute posIncrAttribute = - tokenStream.addAttribute(PositionIncrementAttribute.class); - OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class); - TermToBytesRefAttribute termAtt = tokenStream.addAttribute(TermToBytesRefAttribute.class); - - Set seenTerms = new HashSet<>(); - indexWriter.start(tokenStream, currentDocIsOffensive); - while (tokenStream.incrementToken()) { - // If we hit an exception in stream.next below - // (which is fairly common, e.g. if analyzer - // chokes on a given document), then it's - // non-aborting and (above) this one document - // will be marked as deleted, but still - // consume a docID - - int posIncr = posIncrAttribute.getPositionIncrement(); - currentPosition += posIncr; - if (currentPosition < lastPosition) { - if (posIncr == 0) { - throw new IllegalArgumentException( - "first position increment must be > 0 (got 0) for field '" + field.name() + "'"); - } else if (posIncr < 0) { - throw new IllegalArgumentException( - "position increments (and gaps) must be >= 0 (got " + posIncr + ") for field '" - + field.name() + "'"); - } else { - throw new IllegalArgumentException( - "position overflowed Integer.MAX_VALUE (got posIncr=" + posIncr + " lastPosition=" - + lastPosition + " position=" + currentPosition + ") for field '" + field.name() - + "'"); - } - } else if (currentPosition > MAX_POSITION) { - throw new IllegalArgumentException( - "position " + currentPosition + " is too large for field '" + field.name() - + "': max allowed position is " + MAX_POSITION); - } - lastPosition = currentPosition; - if (posIncr == 0) { - currentOverlap++; - } - - int startOffset = currentOffset + offsetAttribute.startOffset(); - int endOffset = currentOffset + offsetAttribute.endOffset(); - if (startOffset < lastStartOffset || endOffset < startOffset) { - throw new IllegalArgumentException( - "startOffset must be non-negative, and endOffset must be >= startOffset, and " - + "offsets must not go backwards startOffset=" + startOffset + ",endOffset=" - + endOffset + ",lastStartOffset=" + lastStartOffset + " for field '" + field.name() - + "'"); - } - lastStartOffset = startOffset; - indexWriter.add(docId, currentPosition); - currentLength++; - - BytesRef term = termAtt.getBytesRef(); - if (seenTerms.add(term) && (invertedField != null)) { - invertedField.incrementSumTermDocFreq(); - } - } - - tokenStream.end(); - - currentPosition += posIncrAttribute.getPositionIncrement(); - currentOffset += offsetAttribute.endOffset(); - succeededInProcessingField = true; - } catch (BytesRefHash.MaxBytesLengthExceededException e) { - byte[] prefix = new byte[30]; - BytesRef bigTerm = tokenStream.getAttribute(TermToBytesRefAttribute.class).getBytesRef(); - System.arraycopy(bigTerm.bytes, bigTerm.offset, prefix, 0, 30); - String msg = "Document contains at least one immense term in field=\"" + fieldName - + "\" (whose UTF8 encoding is longer than the max length), all of " - + "which were skipped." + "Please correct the analyzer to not produce such terms. " - + "The prefix of the first immense term is: '" + Arrays.toString(prefix) - + "...', original message: " + e.getMessage(); - LOG.warn(msg); - // Document will be deleted above: - throw new IllegalArgumentException(msg, e); - } finally { - if (!succeededInProcessingField) { - LOG.warn("An exception was thrown while processing field " + fieldName); - } - if (tokenStream != null) { - try { - tokenStream.close(); - } catch (IOException e) { - if (succeededInProcessingField) { - // only throw this exception if no other exception already occurred above - throw e; - } else { - LOG.warn("Exception while trying to close TokenStream.", e); - } - } - } - } - - if (analyzed) { - currentPosition += indexSegmentWriter.analyzer.getPositionIncrementGap(fieldName); - currentOffset += indexSegmentWriter.analyzer.getOffsetGap(fieldName); - } - } - } - - @Override - public int numDocs() { - return segmentData.getDocIDToTweetIDMapper().getNumDocs(); - } - - public interface InvertedDocConsumer { - /** - * Called for each document before inversion starts. - */ - void start(AttributeSource attributeSource, boolean currentDocIsOffensive); - - /** - * Called for each token in the current document. - * @param docID Document id. - * @param position Position in the token stream for this document. - */ - void add(int docID, int position) throws IOException; - - /** - * Called after the last token was added and before the next document is processed. - */ - void finish(); - } - - public interface StoredFieldsConsumer { - /** - * Adds a new stored fields. - */ - void addField(int docID, IndexableField field) throws IOException; - } - - /** - * This Builder allows registering listeners for a particular field of an indexable document. - * For each field name any number of listeners can be added. - * - * Using {@link #useDefaultConsumer} it can be specified whether this index writer will use - * the default consumer in addition to any additionally registered consumers. - */ - public abstract static class ConsumerBuilder { - private boolean useDefaultConsumer; - private final List consumers; - private final EarlybirdFieldType fieldType; - private final String fieldName; - - private ConsumerBuilder(String fieldName, EarlybirdFieldType fieldType) { - useDefaultConsumer = true; - consumers = Lists.newArrayList(); - this.fieldName = fieldName; - this.fieldType = fieldType; - } - - public String getFieldName() { - return fieldName; - } - - public EarlybirdFieldType getFieldType() { - return fieldType; - } - - /** - * If set to true, {@link EarlybirdRealtimeIndexSegmentWriter} will use the default consumer - * (e.g. build a default inverted index for an inverted field) in addition to any consumers - * added via {@link #addConsumer(Object)}. - */ - public void setUseDefaultConsumer(boolean useDefaultConsumer) { - this.useDefaultConsumer = useDefaultConsumer; - } - - public boolean isUseDefaultConsumer() { - return useDefaultConsumer; - } - - /** - * Allows registering any number of additional consumers for the field associated with this - * builder. - */ - public void addConsumer(T consumer) { - consumers.add(consumer); - } - - T build() { - if (consumers.isEmpty()) { - return null; - } else if (consumers.size() == 1) { - return consumers.get(0); - } else { - return build(consumers); - } - } - - abstract T build(List consumerList); - } - - public static final class StoredFieldsConsumerBuilder - extends ConsumerBuilder { - private StoredFieldsConsumerBuilder(String fieldName, EarlybirdFieldType fieldType) { - super(fieldName, fieldType); - } - - @Override - StoredFieldsConsumer build(final List consumers) { - return (docID, field) -> { - for (StoredFieldsConsumer consumer : consumers) { - consumer.addField(docID, field); - } - }; - } - } - - public static final class InvertedDocConsumerBuilder - extends ConsumerBuilder { - private final EarlybirdIndexSegmentData segmentData; - - private InvertedDocConsumerBuilder( - EarlybirdIndexSegmentData segmentData, String fieldName, EarlybirdFieldType fieldType) { - super(fieldName, fieldType); - this.segmentData = segmentData; - } - - @Override - InvertedDocConsumer build(final List consumers) { - return new InvertedDocConsumer() { - @Override - public void start(AttributeSource attributeSource, boolean currentDocIsOffensive) { - for (InvertedDocConsumer consumer : consumers) { - consumer.start(attributeSource, currentDocIsOffensive); - } - } - - @Override - public void finish() { - for (InvertedDocConsumer consumer : consumers) { - consumer.finish(); - } - } - - @Override - public void add(int docID, int position) throws IOException { - for (InvertedDocConsumer consumer : consumers) { - consumer.add(docID, position); - } - } - }; - } - - public EarlybirdIndexSegmentData getSegmentData() { - return segmentData; - } - } - - /** - * Returns true, if a field should not be indexed. - * @deprecated This writer should be able to process all fields in the future. - */ - @Deprecated - private static boolean skipField(String fieldName) { - // ignore lucene facet fields for realtime index, we are handling it differently for now. - return fieldName.startsWith(FacetsConfig.DEFAULT_INDEX_FIELD_NAME); - } - - private static Field buildAllDocsField(EarlybirdRealtimeIndexSegmentData segmentData) { - String fieldName = EarlybirdFieldConstants.EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(); - if (segmentData.getSchema().hasField(fieldName)) { - Schema.FieldInfo fi = Preconditions.checkNotNull( - segmentData.getSchema().getFieldInfo(fieldName)); - return new Field(fi.getName(), AllDocsIterator.ALL_DOCS_TERM, fi.getFieldType()); - } - - return null; - } - - /** - * Every document must have this field and term, so that we can safely iterate through documents - * using {@link AllDocsIterator}. This is to prevent the problem of adding a tweet to the doc ID - * mapper, and returning it for a match-all query when the rest of the document hasn't been - * published. This could lead to queries returning incorrect results for queries that are only - * negations. - * */ - private void addAllDocsField(Document doc) { - if (allDocsField != null) { - doc.add(allDocsField); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/QueryCacheResultForSegment.java b/src/java/com/twitter/search/core/earlybird/index/QueryCacheResultForSegment.java deleted file mode 100644 index 21d2e0d29..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/QueryCacheResultForSegment.java +++ /dev/null @@ -1,39 +0,0 @@ -package com.twitter.search.core.earlybird.index; - -import org.apache.lucene.search.DocIdSet; - -/** - * Class to hold the actual cache which provides a doc id iterator to walk through the cache/result. - * - * An instance holds the results for a single query of the different ones defined in querycache.yml. - */ -public class QueryCacheResultForSegment { - private final DocIdSet docIdSet; - private final int smallestDocID; - private final long cardinality; - - /** - * Stores query cache results. - * - * @param docIdSet Documents in the cache. - * @param cardinality Size of the cache. - * @param smallestDocID The most recently posted document contained in the cache. - */ - public QueryCacheResultForSegment(DocIdSet docIdSet, long cardinality, int smallestDocID) { - this.docIdSet = docIdSet; - this.smallestDocID = smallestDocID; - this.cardinality = cardinality; - } - - public DocIdSet getDocIdSet() { - return docIdSet; - } - - public int getSmallestDocID() { - return smallestDocID; - } - - public long getCardinality() { - return cardinality; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/SequentialDocIDMapper.java b/src/java/com/twitter/search/core/earlybird/index/SequentialDocIDMapper.java deleted file mode 100644 index 3d029dcb2..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/SequentialDocIDMapper.java +++ /dev/null @@ -1,87 +0,0 @@ -package com.twitter.search.core.earlybird.index; - -/** - * A doc ID mapper that assigns doc IDs sequentially in decreasing order, starting with the given - * max ID. Used by Expertsearch, which doesn't index tweets. - */ -public class SequentialDocIDMapper implements DocIDToTweetIDMapper { - private final int maxSegmentSize; - private int lastAssignedDocId; - - public SequentialDocIDMapper(int maxSegmentSize) { - this.maxSegmentSize = maxSegmentSize; - lastAssignedDocId = maxSegmentSize; - } - - @Override - public long getTweetID(int docID) { - // Should be used only at segment optimization time and in tests. - if ((docID < lastAssignedDocId) || (docID >= maxSegmentSize)) { - return ID_NOT_FOUND; - } - - return docID; - } - - @Override - public int getDocID(long tweetID) { - // Should be used only at segment optimization time and in tests. - if ((tweetID < lastAssignedDocId) || (tweetID >= maxSegmentSize)) { - return ID_NOT_FOUND; - } - - return (int) tweetID; - } - - @Override - public int getNumDocs() { - return maxSegmentSize - lastAssignedDocId; - } - - @Override - public int getNextDocID(int docID) { - int nextDocID = docID + 1; - - // nextDocID is larger than any doc ID that can be assigned by this mapper. - if (nextDocID >= maxSegmentSize) { - return ID_NOT_FOUND; - } - - // nextDocID is smaller than any doc ID assigned by this mapper so far. - if (nextDocID < lastAssignedDocId) { - return lastAssignedDocId; - } - - // nextDocID is in the range of doc IDs assigned by this mapper. - return nextDocID; - } - - @Override - public int getPreviousDocID(int docID) { - int previousDocID = docID - 1; - - // previousDocID is larger than any doc ID that can be assigned by this mapper. - if (previousDocID >= maxSegmentSize) { - return maxSegmentSize - 1; - } - - // previousDocID is smaller than any doc ID assigned by this mapper so far. - if (previousDocID < lastAssignedDocId) { - return ID_NOT_FOUND; - } - - // previousDocID is in the range of doc IDs assigned by this mapper. - return previousDocID; - } - - @Override - public int addMapping(final long tweetID) { - return --lastAssignedDocId; - } - - @Override - public DocIDToTweetIDMapper optimize() { - // Segments that use this DocIDToTweetIDMapper should never be optimized. - throw new UnsupportedOperationException("SequentialDocIDMapper cannot be optimized."); - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/TimeMapper.java b/src/java/com/twitter/search/core/earlybird/index/TimeMapper.java deleted file mode 100644 index e2f609168..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/TimeMapper.java +++ /dev/null @@ -1,80 +0,0 @@ -package com.twitter.search.core.earlybird.index; - -import java.io.IOException; - -import com.twitter.search.common.util.io.flushable.Flushable; - -/** - * Maps timestamps to the doc IDs assigned to the documents that are indexed (tweets, users, etc.). - */ -public interface TimeMapper extends Flushable { - // Unless specified, all time fields are seconds-since-epoch. - int ILLEGAL_TIME = Integer.MIN_VALUE; - - /** - * Returns the time of the newest tweet in the index. - * - * @return The time of the newest tweet in the index. - */ - int getLastTime(); - - /** - * Returns the time of the oldest tweet in the index. - * - * @return The time of the oldest tweet in the index. - */ - int getFirstTime(); - - /** - * Returns the timestamp of the document mapped to the given doc ID, or ILLEGAL_TIME if this - * mapper doesn't know about this doc ID. - * - * @param docID The document's internal ID. - * @return The timestamp of the document mapped to the given doc ID. - */ - int getTime(int docID); - - /** - * Returns the doc ID of the first indexed document with a timestamp equal to or greater than the - * given timestamp. - * - * If timeSeconds is larger than the max timestamp in this mapper, smallestDocID is returned. - * If timeSeconds is smaller than the min timestamp in the mapper, the largest docID is returned. - * - * Note that when tweets are indexed out of order, this method might return the doc ID of a tweet - * with a timestamp greater than timeSeconds, even if there's a tweet with a timestamp of - * timeSeconds. So the callers of this method can use the returned doc ID as a starting point for - * iteration purposes, but should have a check that the traversed doc IDs have a timestamp in the - * desired range. See SinceUntilFilter.getDocIdSet() for an example. - * - * Example: - * DocIds: 6, 5, 4, 3, 2, 1, 0 - * Times: 1, 5, 3, 4, 4, 3, 6 - * With that data: - * findFirstDocId(1, 0) should return 6. - * findFirstDocId(3, 0) should return 5. - * findFirstDocId(4, 0) should return 5. - * findFirstDocId(5, 0) should return 5. - * findFirstDocId(6, 0) should return 0. - * - * @param timeSeconds The boundary timestamp, in seconds. - * @param smallestDocID The doc ID to return if the given time boundary is larger than the max - * timestamp in this mapper. - */ - int findFirstDocId(int timeSeconds, int smallestDocID) throws IOException; - - /** - * Optimizes this time mapper. - * - * At segment optimization time, the doc IDs assigned to the documents in that segment might - * change (they might be mapped to a more compact space for performance reasons, for example). - * When that happens, we need to remap accordingly the doc IDs stored in the time mapper for that - * segment too. It would also be a good time to optimize the data stored in the time mapper. - * - * @param originalDocIdMapper The doc ID mapper used by this segment before it was optimized. - * @param optimizedDocIdMapper The doc ID mapper used by this segment after it was optimized. - * @return An optimized TimeMapper with the same tweet IDs. - */ - TimeMapper optimize(DocIDToTweetIDMapper originalDocIdMapper, - DocIDToTweetIDMapper optimizedDocIdMapper) throws IOException; -} diff --git a/src/java/com/twitter/search/core/earlybird/index/column/AbstractColumnStrideMultiIntIndex.java b/src/java/com/twitter/search/core/earlybird/index/column/AbstractColumnStrideMultiIntIndex.java deleted file mode 100644 index d7dc63910..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/column/AbstractColumnStrideMultiIntIndex.java +++ /dev/null @@ -1,79 +0,0 @@ -package com.twitter.search.core.earlybird.index.column; - -import java.io.IOException; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.BinaryDocValues; -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.util.BytesRef; - -import com.twitter.search.common.encoding.docvalues.CSFTypeUtil; -import com.twitter.search.common.util.io.flushable.Flushable; - -public abstract class AbstractColumnStrideMultiIntIndex - extends ColumnStrideFieldIndex implements Flushable { - private static final int NUM_BYTES_PER_INT = java.lang.Integer.SIZE / java.lang.Byte.SIZE; - - private final int numIntsPerField; - - protected AbstractColumnStrideMultiIntIndex(String name, int numIntsPerField) { - super(name); - this.numIntsPerField = numIntsPerField; - } - - public int getNumIntsPerField() { - return numIntsPerField; - } - - @Override - public long get(int docID) { - throw new UnsupportedOperationException(); - } - - /** - * Returns the value stored at the given index for the given doc ID. - */ - public abstract int get(int docID, int valueIndex); - - /** - * Sets the value stored at the given index for the given doc ID. - */ - public abstract void setValue(int docID, int valueIndex, int val); - - @Override - public void load(LeafReader atomicReader, String field) throws IOException { - BinaryDocValues docValues = atomicReader.getBinaryDocValues(field); - int numBytesPerDoc = numIntsPerField * NUM_BYTES_PER_INT; - - for (int docID = 0; docID < atomicReader.maxDoc(); docID++) { - Preconditions.checkState(docValues.advanceExact(docID)); - BytesRef scratch = docValues.binaryValue(); - Preconditions.checkState( - scratch.length == numBytesPerDoc, - "Unexpected doc value length for field " + field - + ": Should be " + numBytesPerDoc + ", but was " + scratch.length); - - scratch.length = NUM_BYTES_PER_INT; - for (int i = 0; i < numIntsPerField; i++) { - setValue(docID, i, asInt(scratch)); - scratch.offset += NUM_BYTES_PER_INT; - } - } - } - - public void updateDocValues(BytesRef ref, int docID) { - for (int i = 0; i < numIntsPerField; i++) { - setValue(docID, i, CSFTypeUtil.convertFromBytes(ref.bytes, ref.offset, i)); - } - } - - private static int asInt(BytesRef b) { - return asInt(b, b.offset); - } - - private static int asInt(BytesRef b, int pos) { - int p = pos; - return (b.bytes[p++] << 24) | (b.bytes[p++] << 16) | (b.bytes[p++] << 8) | (b.bytes[p] & 0xFF); - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/column/ColumnStrideByteIndex.java b/src/java/com/twitter/search/core/earlybird/index/column/ColumnStrideByteIndex.java deleted file mode 100644 index 8dd783504..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/column/ColumnStrideByteIndex.java +++ /dev/null @@ -1,88 +0,0 @@ -package com.twitter.search.core.earlybird.index.column; - -import java.io.IOException; - -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; - -import it.unimi.dsi.fastutil.ints.Int2ByteOpenHashMap; - -public class ColumnStrideByteIndex extends ColumnStrideFieldIndex implements Flushable { - private final Int2ByteOpenHashMap values; - private final int maxSize; - - public ColumnStrideByteIndex(String name, int maxSize) { - super(name); - values = new Int2ByteOpenHashMap(maxSize); // default unset value is 0 - this.maxSize = maxSize; - } - - private ColumnStrideByteIndex(String name, Int2ByteOpenHashMap values, int maxSize) { - super(name); - this.values = values; - this.maxSize = maxSize; - } - - @Override - public void setValue(int docID, long value) { - values.put(docID, (byte) value); - } - - @Override - public long get(int docID) { - return values.get(docID); - } - - @Override - public ColumnStrideFieldIndex optimize( - DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) throws IOException { - return new OptimizedColumnStrideByteIndex(this, originalTweetIdMapper, optimizedTweetIdMapper); - } - - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - public static final class FlushHandler extends Flushable.Handler { - private static final String NAME_PROP_NAME = "fieldName"; - private static final String MAX_SIZE_PROP = "maxSize"; - - public FlushHandler() { - super(); - } - - public FlushHandler(ColumnStrideByteIndex objectToFlush) { - super(objectToFlush); - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) throws IOException { - ColumnStrideByteIndex index = getObjectToFlush(); - flushInfo.addStringProperty(NAME_PROP_NAME, index.getName()); - flushInfo.addIntProperty(MAX_SIZE_PROP, index.maxSize); - - out.writeInt(index.values.size()); - for (Int2ByteOpenHashMap.Entry entry : index.values.int2ByteEntrySet()) { - out.writeInt(entry.getIntKey()); - out.writeByte(entry.getByteValue()); - } - } - - @Override - protected ColumnStrideByteIndex doLoad(FlushInfo flushInfo, DataDeserializer in) - throws IOException { - int size = in.readInt(); - int maxSize = flushInfo.getIntProperty(MAX_SIZE_PROP); - Int2ByteOpenHashMap map = new Int2ByteOpenHashMap(maxSize); - for (int i = 0; i < size; i++) { - map.put(in.readInt(), in.readByte()); - } - return new ColumnStrideByteIndex(flushInfo.getStringProperty(NAME_PROP_NAME), map, maxSize); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/column/ColumnStrideFieldDocValues.java b/src/java/com/twitter/search/core/earlybird/index/column/ColumnStrideFieldDocValues.java deleted file mode 100644 index 9b6a1d6d6..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/column/ColumnStrideFieldDocValues.java +++ /dev/null @@ -1,76 +0,0 @@ -package com.twitter.search.core.earlybird.index.column; - -import java.io.IOException; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.NumericDocValues; - -import com.twitter.search.core.earlybird.index.util.AllDocsIterator; - -/** - * A NumericDocValues implementation that uses an AllDocsIterator to iterate through all docs, and - * gets its values from a ColumnStrideFieldIndex instance. - */ -public class ColumnStrideFieldDocValues extends NumericDocValues { - private final ColumnStrideFieldIndex csf; - private final AllDocsIterator iterator; - - public ColumnStrideFieldDocValues(ColumnStrideFieldIndex csf, LeafReader reader) - throws IOException { - this.csf = Preconditions.checkNotNull(csf); - this.iterator = new AllDocsIterator(Preconditions.checkNotNull(reader)); - } - - @Override - public long longValue() { - return csf.get(docID()); - } - - @Override - public int docID() { - return iterator.docID(); - } - - @Override - public int nextDoc() throws IOException { - return iterator.nextDoc(); - } - - @Override - public int advance(int target) throws IOException { - return iterator.advance(target); - } - - @Override - public boolean advanceExact(int target) throws IOException { - // The javadocs for advance() and advanceExact() are inconsistent. advance() allows the target - // to be smaller than the current doc ID, and requires the iterator to advance the current doc - // ID past the target, and past the current doc ID. So essentially, advance(target) returns - // max(target, currentDocId + 1). At the same time, advanceExact() is undefined if the target is - // smaller than the current do ID (or if it's an invalid doc ID), and always returns the target. - // So essentially, advanceExact(target) should always set the current doc ID to the given target - // and if target == currentDocId, then currentDocId should not be advanced. This is why we have - // these extra checks here instead of moving them to advance(). - Preconditions.checkState( - target >= docID(), - "ColumnStrideFieldDocValues.advance() for field %s called with target %s, " - + "but the current doc ID is %s.", - csf.getName(), - target, - docID()); - if (target == docID()) { - return true; - } - - // We don't need to check if we have a value for 'target', because a ColumnStrideFieldIndex - // instance has a value for every doc ID (though that value might be 0). - return advance(target) == target; - } - - @Override - public long cost() { - return iterator.cost(); - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/column/ColumnStrideFieldIndex.java b/src/java/com/twitter/search/core/earlybird/index/column/ColumnStrideFieldIndex.java deleted file mode 100644 index 0f5261f0a..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/column/ColumnStrideFieldIndex.java +++ /dev/null @@ -1,64 +0,0 @@ -package com.twitter.search.core.earlybird.index.column; - -import java.io.IOException; - -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.NumericDocValues; - -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; - -/** - * Get an underlying data for a field by calling - * EarlybirdIndexSegmentAtomicReader#getNumericDocValues(String). - */ -public abstract class ColumnStrideFieldIndex { - private final String name; - - public ColumnStrideFieldIndex(String name) { - this.name = name; - } - - public String getName() { - return name; - } - - /** - * Returns the CSF value for the given doc ID. - */ - public abstract long get(int docID); - - /** - * Updates the CSF value for the given doc ID to the given value. - */ - public void setValue(int docID, long value) { - throw new UnsupportedOperationException(); - } - - /** - * Loads the CSF from an AtomicReader. - */ - public void load(LeafReader atomicReader, String field) throws IOException { - NumericDocValues docValues = atomicReader.getNumericDocValues(field); - if (docValues != null) { - for (int i = 0; i < atomicReader.maxDoc(); i++) { - if (docValues.advanceExact(i)) { - setValue(i, docValues.longValue()); - } - } - } - } - - /** - * Optimizes the representation of this column stride field, and remaps its doc IDs, if necessary. - * - * @param originalTweetIdMapper The original tweet ID mapper. - * @param optimizedTweetIdMapper The optimized tweet ID mapper. - * @return An optimized column stride field equivalent to this CSF, - * with possibly remapped doc IDs. - */ - public ColumnStrideFieldIndex optimize( - DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) throws IOException { - return this; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/column/ColumnStrideIntIndex.java b/src/java/com/twitter/search/core/earlybird/index/column/ColumnStrideIntIndex.java deleted file mode 100644 index 7bb0d1b02..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/column/ColumnStrideIntIndex.java +++ /dev/null @@ -1,88 +0,0 @@ -package com.twitter.search.core.earlybird.index.column; - -import java.io.IOException; - -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; - -import it.unimi.dsi.fastutil.ints.Int2IntOpenHashMap; - -public class ColumnStrideIntIndex extends ColumnStrideFieldIndex implements Flushable { - private final Int2IntOpenHashMap values; - private final int maxSize; - - public ColumnStrideIntIndex(String name, int maxSize) { - super(name); - values = new Int2IntOpenHashMap(maxSize); // default unset value is 0 - this.maxSize = maxSize; - } - - public ColumnStrideIntIndex(String name, Int2IntOpenHashMap values, int maxSize) { - super(name); - this.values = values; - this.maxSize = maxSize; - } - - @Override - public void setValue(int docID, long value) { - values.put(docID, (int) value); - } - - @Override - public long get(int docID) { - return values.get(docID); - } - - @Override - public ColumnStrideFieldIndex optimize( - DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) throws IOException { - return new OptimizedColumnStrideIntIndex(this, originalTweetIdMapper, optimizedTweetIdMapper); - } - - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - public static final class FlushHandler extends Flushable.Handler { - private static final String NAME_PROP_NAME = "fieldName"; - private static final String MAX_SIZE_PROP = "maxSize"; - - public FlushHandler() { - super(); - } - - public FlushHandler(ColumnStrideIntIndex objectToFlush) { - super(objectToFlush); - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) throws IOException { - ColumnStrideIntIndex index = getObjectToFlush(); - flushInfo.addStringProperty(NAME_PROP_NAME, index.getName()); - flushInfo.addIntProperty(MAX_SIZE_PROP, index.maxSize); - - out.writeInt(index.values.size()); - for (Int2IntOpenHashMap.Entry entry : index.values.int2IntEntrySet()) { - out.writeInt(entry.getIntKey()); - out.writeInt(entry.getIntValue()); - } - } - - @Override - protected ColumnStrideIntIndex doLoad(FlushInfo flushInfo, DataDeserializer in) - throws IOException { - int size = in.readInt(); - int maxSize = flushInfo.getIntProperty(MAX_SIZE_PROP); - Int2IntOpenHashMap map = new Int2IntOpenHashMap(maxSize); - for (int i = 0; i < size; i++) { - map.put(in.readInt(), in.readInt()); - } - return new ColumnStrideIntIndex(flushInfo.getStringProperty(NAME_PROP_NAME), map, maxSize); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/column/ColumnStrideIntViewIndex.java b/src/java/com/twitter/search/core/earlybird/index/column/ColumnStrideIntViewIndex.java deleted file mode 100644 index 9084bc0ed..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/column/ColumnStrideIntViewIndex.java +++ /dev/null @@ -1,71 +0,0 @@ -package com.twitter.search.core.earlybird.index.column; - -import com.twitter.search.common.encoding.features.IntegerEncodedFeatures; -import com.twitter.search.common.schema.base.FeatureConfiguration; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; - -/** - * An Int CSF view on top of an {@link AbstractColumnStrideMultiIntIndex}. - * - * Used for decoding encoded packed features and exposing them as - * {@link org.apache.lucene.index.NumericDocValues}. - */ -public class ColumnStrideIntViewIndex extends ColumnStrideFieldIndex { - private static class IntViewIntegerEncodedFeatures extends IntegerEncodedFeatures { - private final AbstractColumnStrideMultiIntIndex baseIndex; - private final int docID; - - public IntViewIntegerEncodedFeatures(AbstractColumnStrideMultiIntIndex baseIndex, int docID) { - this.baseIndex = baseIndex; - this.docID = docID; - } - - @Override - public int getInt(int pos) { - return baseIndex.get(docID, pos); - } - - @Override - public void setInt(int pos, int value) { - baseIndex.setValue(docID, pos, value); - } - - @Override - public int getNumInts() { - return baseIndex.getNumIntsPerField(); - } - } - - private final AbstractColumnStrideMultiIntIndex baseIndex; - private final FeatureConfiguration featureConfiguration; - - /** - * Creates a new ColumnStrideIntViewIndex on top of an existing AbstractColumnStrideMultiIntIndex. - */ - public ColumnStrideIntViewIndex(Schema.FieldInfo info, - AbstractColumnStrideMultiIntIndex baseIndex) { - super(info.getName()); - this.baseIndex = baseIndex; - this.featureConfiguration = info.getFieldType().getCsfViewFeatureConfiguration(); - } - - @Override - public long get(int docID) { - IntegerEncodedFeatures encodedFeatures = new IntViewIntegerEncodedFeatures(baseIndex, docID); - return encodedFeatures.getFeatureValue(featureConfiguration); - } - - @Override - public void setValue(int docID, long value) { - IntegerEncodedFeatures encodedFeatures = new IntViewIntegerEncodedFeatures(baseIndex, docID); - encodedFeatures.setFeatureValue(featureConfiguration, (int) value); - } - - @Override - public ColumnStrideFieldIndex optimize( - DocIDToTweetIDMapper originalTweetIdMapper, DocIDToTweetIDMapper optimizedTweetIdMapper) { - throw new UnsupportedOperationException( - "ColumnStrideIntViewIndex instances do not support optimization"); - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/column/ColumnStrideLongIndex.java b/src/java/com/twitter/search/core/earlybird/index/column/ColumnStrideLongIndex.java deleted file mode 100644 index 37321b8b8..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/column/ColumnStrideLongIndex.java +++ /dev/null @@ -1,88 +0,0 @@ -package com.twitter.search.core.earlybird.index.column; - -import java.io.IOException; - -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; - -import it.unimi.dsi.fastutil.ints.Int2LongOpenHashMap; - -public class ColumnStrideLongIndex extends ColumnStrideFieldIndex implements Flushable { - private final Int2LongOpenHashMap values; - private final int maxSize; - - public ColumnStrideLongIndex(String name, int maxSize) { - super(name); - values = new Int2LongOpenHashMap(maxSize); // default unset value is 0 - this.maxSize = maxSize; - } - - private ColumnStrideLongIndex(String name, Int2LongOpenHashMap values, int maxSize) { - super(name); - this.values = values; - this.maxSize = maxSize; - } - - @Override - public void setValue(int docID, long value) { - values.put(docID, value); - } - - @Override - public long get(int docID) { - return values.get(docID); - } - - @Override - public ColumnStrideFieldIndex optimize( - DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) throws IOException { - return new OptimizedColumnStrideLongIndex(this, originalTweetIdMapper, optimizedTweetIdMapper); - } - - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - public static final class FlushHandler extends Flushable.Handler { - private static final String NAME_PROP_NAME = "fieldName"; - private static final String MAX_SIZE_PROP = "maxSize"; - - public FlushHandler() { - super(); - } - - public FlushHandler(ColumnStrideLongIndex objectToFlush) { - super(objectToFlush); - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) throws IOException { - ColumnStrideLongIndex index = getObjectToFlush(); - flushInfo.addStringProperty(NAME_PROP_NAME, index.getName()); - flushInfo.addIntProperty(MAX_SIZE_PROP, index.maxSize); - - out.writeInt(index.values.size()); - for (Int2LongOpenHashMap.Entry entry : index.values.int2LongEntrySet()) { - out.writeInt(entry.getIntKey()); - out.writeLong(entry.getLongValue()); - } - } - - @Override - protected ColumnStrideLongIndex doLoad(FlushInfo flushInfo, DataDeserializer in) - throws IOException { - int size = in.readInt(); - int maxSize = flushInfo.getIntProperty(MAX_SIZE_PROP); - Int2LongOpenHashMap map = new Int2LongOpenHashMap(maxSize); - for (int i = 0; i < size; i++) { - map.put(in.readInt(), in.readLong()); - } - return new ColumnStrideLongIndex(flushInfo.getStringProperty(NAME_PROP_NAME), map, maxSize); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/column/ColumnStrideMultiIntIndex.java b/src/java/com/twitter/search/core/earlybird/index/column/ColumnStrideMultiIntIndex.java deleted file mode 100644 index ccbf99a29..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/column/ColumnStrideMultiIntIndex.java +++ /dev/null @@ -1,102 +0,0 @@ -package com.twitter.search.core.earlybird.index.column; - -import java.io.IOException; - -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; - -import it.unimi.dsi.fastutil.ints.Int2IntOpenHashMap; - -public class ColumnStrideMultiIntIndex extends AbstractColumnStrideMultiIntIndex { - private final Int2IntOpenHashMap[] values; - private final int maxSize; - - public ColumnStrideMultiIntIndex(String name, int maxSize, int numIntsPerField) { - super(name, numIntsPerField); - values = new Int2IntOpenHashMap[numIntsPerField]; - for (int i = 0; i < numIntsPerField; i++) { - values[i] = new Int2IntOpenHashMap(maxSize); // default unset value is 0 - } - this.maxSize = maxSize; - } - - public ColumnStrideMultiIntIndex(String name, Int2IntOpenHashMap[] values, int maxSize) { - super(name, values.length); - this.values = values; - this.maxSize = maxSize; - } - - @Override - public void setValue(int docID, int valueIndex, int value) { - values[valueIndex].put(docID, value); - } - - @Override - public int get(int docID, int valueIndex) { - return values[valueIndex].get(docID); - } - - @Override - public ColumnStrideFieldIndex optimize( - DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) throws IOException { - return new OptimizedColumnStrideMultiIntIndex( - this, originalTweetIdMapper, optimizedTweetIdMapper); - } - - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - public static final class FlushHandler extends Flushable.Handler { - private static final String NAME_PROP_NAME = "fieldName"; - private static final String MAX_SIZE_PROP = "maxSize"; - - public FlushHandler() { - super(); - } - - public FlushHandler(ColumnStrideMultiIntIndex objectToFlush) { - super(objectToFlush); - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) throws IOException { - ColumnStrideMultiIntIndex index = getObjectToFlush(); - flushInfo.addStringProperty(NAME_PROP_NAME, index.getName()); - flushInfo.addIntProperty(MAX_SIZE_PROP, index.maxSize); - - out.writeInt(index.values.length); - for (int i = 0; i < index.values.length; i++) { - Int2IntOpenHashMap map = index.values[i]; - out.writeInt(map.size()); - for (Int2IntOpenHashMap.Entry entry : map.int2IntEntrySet()) { - out.writeInt(entry.getIntKey()); - out.writeInt(entry.getIntValue()); - } - } - } - - @Override - protected ColumnStrideMultiIntIndex doLoad(FlushInfo flushInfo, DataDeserializer in) - throws IOException { - int numIntsPerField = in.readInt(); - int maxSize = flushInfo.getIntProperty(MAX_SIZE_PROP); - Int2IntOpenHashMap[] values = new Int2IntOpenHashMap[numIntsPerField]; - for (int i = 0; i < numIntsPerField; i++) { - int size = in.readInt(); - Int2IntOpenHashMap map = new Int2IntOpenHashMap(maxSize); - for (int j = 0; j < size; j++) { - map.put(in.readInt(), in.readInt()); - } - values[i] = map; - } - return new ColumnStrideMultiIntIndex( - flushInfo.getStringProperty(NAME_PROP_NAME), values, maxSize); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/column/ConstantColumnStrideFieldIndex.java b/src/java/com/twitter/search/core/earlybird/index/column/ConstantColumnStrideFieldIndex.java deleted file mode 100644 index 3da67aec3..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/column/ConstantColumnStrideFieldIndex.java +++ /dev/null @@ -1,18 +0,0 @@ -package com.twitter.search.core.earlybird.index.column; - -/** - * A ColumnStrideFieldIndex implementation that always returns the same value. - */ -public class ConstantColumnStrideFieldIndex extends ColumnStrideFieldIndex { - private final long defaultValue; - - public ConstantColumnStrideFieldIndex(String name, long defaultValue) { - super(name); - this.defaultValue = defaultValue; - } - - @Override - public long get(int docID) { - return defaultValue; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/column/DocValuesManager.java b/src/java/com/twitter/search/core/earlybird/index/column/DocValuesManager.java deleted file mode 100644 index 2e1e61b4b..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/column/DocValuesManager.java +++ /dev/null @@ -1,248 +0,0 @@ -package com.twitter.search.core.earlybird.index.column; - -import java.io.IOException; -import java.util.Iterator; -import java.util.Map; -import java.util.Set; -import java.util.concurrent.ConcurrentHashMap; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Maps; -import com.google.common.collect.Sets; - -import com.twitter.search.common.schema.base.EarlybirdFieldType; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; - -public abstract class DocValuesManager implements Flushable { - protected final Schema schema; - protected final int segmentSize; - protected final ConcurrentHashMap columnStrideFields; - - public DocValuesManager(Schema schema, int segmentSize) { - this(schema, segmentSize, new ConcurrentHashMap<>()); - } - - protected DocValuesManager(Schema schema, - int segmentSize, - ConcurrentHashMap columnStrideFields) { - this.schema = Preconditions.checkNotNull(schema); - this.segmentSize = segmentSize; - this.columnStrideFields = columnStrideFields; - } - - protected abstract ColumnStrideFieldIndex newByteCSF(String field); - protected abstract ColumnStrideFieldIndex newIntCSF(String field); - protected abstract ColumnStrideFieldIndex newLongCSF(String field); - protected abstract ColumnStrideFieldIndex newMultiIntCSF(String field, int numIntsPerField); - - /** - * Optimize this doc values manager, and return a doc values manager a more compact and fast - * encoding for doc values (but that we can't add new doc IDs to). - */ - public abstract DocValuesManager optimize( - DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) throws IOException; - - public Set getDocValueNames() { - return columnStrideFields.keySet(); - } - - /** - * Creates a new {@link ColumnStrideFieldIndex} for the given field and returns it. - */ - public ColumnStrideFieldIndex addColumnStrideField(String field, EarlybirdFieldType fieldType) { - // For CSF view fields, we will perform the same check on the base field when we try to create - // a ColumnStrideFieldIndex for them in newIntViewCSF(). - if (!fieldType.isCsfViewField()) { - Preconditions.checkState( - fieldType.isCsfLoadIntoRam(), "Field %s is not loaded in RAM", field); - } - - if (columnStrideFields.containsKey(field)) { - return columnStrideFields.get(field); - } - - final ColumnStrideFieldIndex index; - switch (fieldType.getCsfType()) { - case BYTE: - index = newByteCSF(field); - break; - case INT: - if (fieldType.getCsfFixedLengthNumValuesPerDoc() > 1) { - index = newMultiIntCSF(field, fieldType.getCsfFixedLengthNumValuesPerDoc()); - } else if (fieldType.isCsfViewField()) { - index = newIntViewCSF(field); - } else { - index = newIntCSF(field); - } - break; - case LONG: - index = newLongCSF(field); - break; - default: - throw new RuntimeException("Invalid CsfType."); - } - - columnStrideFields.put(field, index); - return index; - } - - protected ColumnStrideFieldIndex newIntViewCSF(String field) { - Schema.FieldInfo info = Preconditions.checkNotNull(schema.getFieldInfo(field)); - Schema.FieldInfo baseFieldInfo = Preconditions.checkNotNull( - schema.getFieldInfo(info.getFieldType().getCsfViewBaseFieldId())); - - Preconditions.checkState( - baseFieldInfo.getFieldType().isCsfLoadIntoRam(), - "Field %s has a base field (%s) that is not loaded in RAM", - field, baseFieldInfo.getName()); - - // We might not have a CSF for the base field yet. - ColumnStrideFieldIndex baseFieldIndex = - addColumnStrideField(baseFieldInfo.getName(), baseFieldInfo.getFieldType()); - Preconditions.checkNotNull(baseFieldIndex); - Preconditions.checkState(baseFieldIndex instanceof AbstractColumnStrideMultiIntIndex); - return new ColumnStrideIntViewIndex(info, (AbstractColumnStrideMultiIntIndex) baseFieldIndex); - } - - /** - * Returns the ColumnStrideFieldIndex instance for the given field. - */ - public ColumnStrideFieldIndex getColumnStrideFieldIndex(String field) { - ColumnStrideFieldIndex docValues = columnStrideFields.get(field); - if (docValues == null) { - Schema.FieldInfo info = schema.getFieldInfo(field); - if (info != null && info.getFieldType().isCsfDefaultValueSet()) { - return new ConstantColumnStrideFieldIndex(field, info.getFieldType().getCsfDefaultValue()); - } - } - - return docValues; - } - - private static final String CSF_INDEX_CLASS_NAME_PROP_NAME = "csfIndexClassName"; - private static final String CSF_PROP_NAME = "column_stride_fields"; - protected static final String MAX_SEGMENT_SIZE_PROP_NAME = "maxSegmentSize"; - - private static Map> getIntViewFields(Schema schema) { - Map> intViewFields = Maps.newHashMap(); - for (Schema.FieldInfo fieldInfo : schema.getFieldInfos()) { - if (fieldInfo.getFieldType().isCsfViewField()) { - Schema.FieldInfo baseFieldInfo = Preconditions.checkNotNull( - schema.getFieldInfo(fieldInfo.getFieldType().getCsfViewBaseFieldId())); - String baseFieldName = baseFieldInfo.getName(); - Set intViewFieldsForBaseField = - intViewFields.computeIfAbsent(baseFieldName, k -> Sets.newHashSet()); - intViewFieldsForBaseField.add(fieldInfo); - } - } - return intViewFields; - } - - public abstract static class FlushHandler extends Handler { - private final Schema schema; - - public FlushHandler(Schema schema) { - this.schema = schema; - } - - public FlushHandler(DocValuesManager docValuesManager) { - super(docValuesManager); - this.schema = docValuesManager.schema; - } - - @Override - public void doFlush(FlushInfo flushInfo, DataSerializer out) throws IOException { - long startTime = getClock().nowMillis(); - - DocValuesManager docValuesManager = getObjectToFlush(); - flushInfo.addIntProperty(MAX_SEGMENT_SIZE_PROP_NAME, docValuesManager.segmentSize); - long sizeBeforeFlush = out.length(); - FlushInfo csfProps = flushInfo.newSubProperties(CSF_PROP_NAME); - for (ColumnStrideFieldIndex csf : docValuesManager.columnStrideFields.values()) { - if (!(csf instanceof ColumnStrideIntViewIndex)) { - Preconditions.checkState( - csf instanceof Flushable, - "Cannot flush column stride field {} of type {}", - csf.getName(), csf.getClass().getCanonicalName()); - FlushInfo info = csfProps.newSubProperties(csf.getName()); - info.addStringProperty(CSF_INDEX_CLASS_NAME_PROP_NAME, csf.getClass().getCanonicalName()); - ((Flushable) csf).getFlushHandler().flush(info, out); - } - } - csfProps.setSizeInBytes(out.length() - sizeBeforeFlush); - getFlushTimerStats().timerIncrement(getClock().nowMillis() - startTime); - } - - @Override - public DocValuesManager doLoad(FlushInfo flushInfo, DataDeserializer in) - throws IOException { - long startTime = getClock().nowMillis(); - Map> intViewFields = getIntViewFields(schema); - - FlushInfo csfProps = flushInfo.getSubProperties(CSF_PROP_NAME); - ConcurrentHashMap columnStrideFields = - new ConcurrentHashMap<>(); - - Iterator csfPropIter = csfProps.getKeyIterator(); - while (csfPropIter.hasNext()) { - String fieldName = csfPropIter.next(); - try { - FlushInfo info = csfProps.getSubProperties(fieldName); - String className = info.getStringProperty(CSF_INDEX_CLASS_NAME_PROP_NAME); - Class fieldIndexType = - (Class) Class.forName(className); - Preconditions.checkNotNull( - fieldIndexType, - "Invalid field configuration: field " + fieldName + " not found in config."); - - for (Class c : fieldIndexType.getDeclaredClasses()) { - if (Handler.class.isAssignableFrom(c)) { - @SuppressWarnings("rawtypes") - Handler handler = (Handler) c.newInstance(); - ColumnStrideFieldIndex index = (ColumnStrideFieldIndex) handler.load( - csfProps.getSubProperties(fieldName), in); - columnStrideFields.put(fieldName, index); - - // If this is a base field, create ColumnStrideIntViewIndex instances for all the - // view fields based on it. - if (index instanceof AbstractColumnStrideMultiIntIndex) { - AbstractColumnStrideMultiIntIndex multiIntIndex = - (AbstractColumnStrideMultiIntIndex) index; - - // We should have AbstractColumnStrideMultiIntIndex instances only for base fields - // and all our base fields have views defined on top of them. - for (Schema.FieldInfo intViewFieldInfo : intViewFields.get(fieldName)) { - columnStrideFields.put( - intViewFieldInfo.getName(), - new ColumnStrideIntViewIndex(intViewFieldInfo, multiIntIndex)); - } - } - - break; - } - } - } catch (ClassNotFoundException | IllegalAccessException | InstantiationException e) { - throw new IOException( - "Invalid field configuration for column stride field: " + fieldName, e); - } - } - getLoadTimerStats().timerIncrement(getClock().nowMillis() - startTime); - - return createDocValuesManager( - schema, - flushInfo.getIntProperty(MAX_SEGMENT_SIZE_PROP_NAME), - columnStrideFields); - } - - protected abstract DocValuesManager createDocValuesManager( - Schema docValuesSchema, - int maxSegmentSize, - ConcurrentHashMap columnStrideFields); - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/column/DocValuesUpdate.java b/src/java/com/twitter/search/core/earlybird/index/column/DocValuesUpdate.java deleted file mode 100644 index 1f49cc3c7..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/column/DocValuesUpdate.java +++ /dev/null @@ -1,8 +0,0 @@ -package com.twitter.search.core.earlybird.index.column; - -public interface DocValuesUpdate { - /** - * Performs an doc values update on the given document. - */ - void update(ColumnStrideFieldIndex docValues, int docID); -} diff --git a/src/java/com/twitter/search/core/earlybird/index/column/OptimizedColumnStrideByteIndex.java b/src/java/com/twitter/search/core/earlybird/index/column/OptimizedColumnStrideByteIndex.java deleted file mode 100644 index 93a6faea3..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/column/OptimizedColumnStrideByteIndex.java +++ /dev/null @@ -1,81 +0,0 @@ -package com.twitter.search.core.earlybird.index.column; - -import java.io.IOException; - -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; - -public class OptimizedColumnStrideByteIndex extends ColumnStrideFieldIndex implements Flushable { - private final byte[] values; - - public OptimizedColumnStrideByteIndex(String name, int maxSize) { - super(name); - values = new byte[maxSize]; - } - - public OptimizedColumnStrideByteIndex( - ColumnStrideByteIndex columnStrideByteIndex, - DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) throws IOException { - super(columnStrideByteIndex.getName()); - int maxDocId = optimizedTweetIdMapper.getPreviousDocID(Integer.MAX_VALUE); - values = new byte[maxDocId + 1]; - - int docId = optimizedTweetIdMapper.getNextDocID(Integer.MIN_VALUE); - while (docId != DocIDToTweetIDMapper.ID_NOT_FOUND) { - int originalDocId = originalTweetIdMapper.getDocID(optimizedTweetIdMapper.getTweetID(docId)); - setValue(docId, columnStrideByteIndex.get(originalDocId)); - docId = optimizedTweetIdMapper.getNextDocID(docId); - } - } - - private OptimizedColumnStrideByteIndex(String name, byte[] values) { - super(name); - this.values = values; - } - - @Override - public void setValue(int docID, long value) { - this.values[docID] = (byte) value; - } - - @Override - public long get(int docID) { - return values[docID]; - } - - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - public static final class FlushHandler extends Flushable.Handler { - private static final String NAME_PROP_NAME = "fieldName"; - - public FlushHandler() { - super(); - } - - public FlushHandler(OptimizedColumnStrideByteIndex objectToFlush) { - super(objectToFlush); - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) throws IOException { - OptimizedColumnStrideByteIndex columnStrideByteIndex = getObjectToFlush(); - flushInfo.addStringProperty(NAME_PROP_NAME, columnStrideByteIndex.getName()); - out.writeByteArray(columnStrideByteIndex.values); - } - - @Override - protected OptimizedColumnStrideByteIndex doLoad(FlushInfo flushInfo, DataDeserializer in) - throws IOException { - byte[] values = in.readByteArray(); - return new OptimizedColumnStrideByteIndex( - flushInfo.getStringProperty(NAME_PROP_NAME), values); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/column/OptimizedColumnStrideIntIndex.java b/src/java/com/twitter/search/core/earlybird/index/column/OptimizedColumnStrideIntIndex.java deleted file mode 100644 index 725b54746..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/column/OptimizedColumnStrideIntIndex.java +++ /dev/null @@ -1,81 +0,0 @@ -package com.twitter.search.core.earlybird.index.column; - -import java.io.IOException; - -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; - -public class OptimizedColumnStrideIntIndex extends ColumnStrideFieldIndex implements Flushable { - private final int[] values; - - public OptimizedColumnStrideIntIndex(String name, int maxSize) { - super(name); - values = new int[maxSize]; - } - - public OptimizedColumnStrideIntIndex( - ColumnStrideIntIndex columnStrideIntIndex, - DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) throws IOException { - super(columnStrideIntIndex.getName()); - int maxDocId = optimizedTweetIdMapper.getPreviousDocID(Integer.MAX_VALUE); - values = new int[maxDocId + 1]; - - int docId = optimizedTweetIdMapper.getNextDocID(Integer.MIN_VALUE); - while (docId != DocIDToTweetIDMapper.ID_NOT_FOUND) { - int originalDocId = originalTweetIdMapper.getDocID(optimizedTweetIdMapper.getTweetID(docId)); - setValue(docId, columnStrideIntIndex.get(originalDocId)); - docId = optimizedTweetIdMapper.getNextDocID(docId); - } - } - - private OptimizedColumnStrideIntIndex(String name, int[] values) { - super(name); - this.values = values; - } - - @Override - public void setValue(int docID, long value) { - this.values[docID] = (int) value; - } - - @Override - public long get(int docID) { - return values[docID]; - } - - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - public static final class FlushHandler extends Flushable.Handler { - private static final String NAME_PROP_NAME = "fieldName"; - - public FlushHandler() { - super(); - } - - public FlushHandler(OptimizedColumnStrideIntIndex objectToFlush) { - super(objectToFlush); - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) throws IOException { - OptimizedColumnStrideIntIndex columnStrideIntIndex = getObjectToFlush(); - flushInfo.addStringProperty(NAME_PROP_NAME, columnStrideIntIndex.getName()); - out.writeIntArray(columnStrideIntIndex.values); - } - - @Override - protected OptimizedColumnStrideIntIndex doLoad(FlushInfo flushInfo, DataDeserializer in) - throws IOException { - int[] values = in.readIntArray(); - return new OptimizedColumnStrideIntIndex( - flushInfo.getStringProperty(NAME_PROP_NAME), values); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/column/OptimizedColumnStrideLongIndex.java b/src/java/com/twitter/search/core/earlybird/index/column/OptimizedColumnStrideLongIndex.java deleted file mode 100644 index df74a7e4e..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/column/OptimizedColumnStrideLongIndex.java +++ /dev/null @@ -1,81 +0,0 @@ -package com.twitter.search.core.earlybird.index.column; - -import java.io.IOException; - -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; - -public class OptimizedColumnStrideLongIndex extends ColumnStrideFieldIndex implements Flushable { - private final long[] values; - - public OptimizedColumnStrideLongIndex(String name, int maxSize) { - super(name); - values = new long[maxSize]; - } - - public OptimizedColumnStrideLongIndex( - ColumnStrideLongIndex columnStrideLongIndex, - DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) throws IOException { - super(columnStrideLongIndex.getName()); - int maxDocId = optimizedTweetIdMapper.getPreviousDocID(Integer.MAX_VALUE); - values = new long[maxDocId + 1]; - - int docId = optimizedTweetIdMapper.getNextDocID(Integer.MIN_VALUE); - while (docId != DocIDToTweetIDMapper.ID_NOT_FOUND) { - int originalDocId = originalTweetIdMapper.getDocID(optimizedTweetIdMapper.getTweetID(docId)); - setValue(docId, columnStrideLongIndex.get(originalDocId)); - docId = optimizedTweetIdMapper.getNextDocID(docId); - } - } - - private OptimizedColumnStrideLongIndex(String name, long[] values) { - super(name); - this.values = values; - } - - @Override - public void setValue(int docID, long value) { - this.values[docID] = value; - } - - @Override - public long get(int docID) { - return values[docID]; - } - - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - public static final class FlushHandler extends Flushable.Handler { - private static final String NAME_PROP_NAME = "fieldName"; - - public FlushHandler() { - super(); - } - - public FlushHandler(OptimizedColumnStrideLongIndex objectToFlush) { - super(objectToFlush); - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) throws IOException { - OptimizedColumnStrideLongIndex columnStrideLongIndex = getObjectToFlush(); - flushInfo.addStringProperty(NAME_PROP_NAME, columnStrideLongIndex.getName()); - out.writeLongArray(columnStrideLongIndex.values); - } - - @Override - protected OptimizedColumnStrideLongIndex doLoad(FlushInfo flushInfo, DataDeserializer in) - throws IOException { - long[] values = in.readLongArray(); - return new OptimizedColumnStrideLongIndex( - flushInfo.getStringProperty(NAME_PROP_NAME), values); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/column/OptimizedColumnStrideMultiIntIndex.java b/src/java/com/twitter/search/core/earlybird/index/column/OptimizedColumnStrideMultiIntIndex.java deleted file mode 100644 index 82f233ad8..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/column/OptimizedColumnStrideMultiIntIndex.java +++ /dev/null @@ -1,90 +0,0 @@ -package com.twitter.search.core.earlybird.index.column; - -import java.io.IOException; - -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; - -public class OptimizedColumnStrideMultiIntIndex - extends AbstractColumnStrideMultiIntIndex implements Flushable { - private final int[] values; - - public OptimizedColumnStrideMultiIntIndex(String name, int maxSize, int numIntsPerField) { - super(name, numIntsPerField); - values = new int[Math.multiplyExact(maxSize, numIntsPerField)]; - } - - public OptimizedColumnStrideMultiIntIndex( - ColumnStrideMultiIntIndex columnStrideMultiIntIndex, - DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) throws IOException { - super(columnStrideMultiIntIndex.getName(), columnStrideMultiIntIndex.getNumIntsPerField()); - int maxDocId = optimizedTweetIdMapper.getPreviousDocID(Integer.MAX_VALUE); - values = new int[columnStrideMultiIntIndex.getNumIntsPerField() * (maxDocId + 1)]; - - int docId = optimizedTweetIdMapper.getNextDocID(Integer.MIN_VALUE); - while (docId != DocIDToTweetIDMapper.ID_NOT_FOUND) { - int originalDocId = originalTweetIdMapper.getDocID(optimizedTweetIdMapper.getTweetID(docId)); - for (int i = 0; i < columnStrideMultiIntIndex.getNumIntsPerField(); ++i) { - setValue(docId, i, columnStrideMultiIntIndex.get(originalDocId, i)); - } - docId = optimizedTweetIdMapper.getNextDocID(docId); - } - } - - private OptimizedColumnStrideMultiIntIndex(String name, int numIntsPerField, int[] values) { - super(name, numIntsPerField); - this.values = values; - } - - @Override - public void setValue(int docID, int valueIndex, int value) { - values[docID * getNumIntsPerField() + valueIndex] = value; - } - - @Override - public int get(int docID, int valueIndex) { - return values[docID * getNumIntsPerField() + valueIndex]; - } - - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - public static final class FlushHandler - extends Flushable.Handler { - private static final String INTS_PER_FIELD_PROP_NAME = "intsPerField"; - private static final String NAME_PROP_NAME = "fieldName"; - - public FlushHandler() { - super(); - } - - public FlushHandler(OptimizedColumnStrideMultiIntIndex objectToFlush) { - super(objectToFlush); - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) throws IOException { - OptimizedColumnStrideMultiIntIndex columnStrideMultiIntIndex = getObjectToFlush(); - flushInfo.addStringProperty(NAME_PROP_NAME, columnStrideMultiIntIndex.getName()); - flushInfo.addIntProperty(INTS_PER_FIELD_PROP_NAME, - columnStrideMultiIntIndex.getNumIntsPerField()); - out.writeIntArray(columnStrideMultiIntIndex.values); - } - - @Override - protected OptimizedColumnStrideMultiIntIndex doLoad(FlushInfo flushInfo, DataDeserializer in) - throws IOException { - int[] values = in.readIntArray(); - return new OptimizedColumnStrideMultiIntIndex( - flushInfo.getStringProperty(NAME_PROP_NAME), - flushInfo.getIntProperty(INTS_PER_FIELD_PROP_NAME), - values); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/column/OptimizedDocValuesManager.java b/src/java/com/twitter/search/core/earlybird/index/column/OptimizedDocValuesManager.java deleted file mode 100644 index 2053d7ce5..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/column/OptimizedDocValuesManager.java +++ /dev/null @@ -1,97 +0,0 @@ -package com.twitter.search.core.earlybird.index.column; - -import java.io.IOException; -import java.util.Set; -import java.util.concurrent.ConcurrentHashMap; - -import com.google.common.collect.Sets; - -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; - -public class OptimizedDocValuesManager extends DocValuesManager { - public OptimizedDocValuesManager(Schema schema, int segmentSize) { - super(schema, segmentSize); - } - - public OptimizedDocValuesManager(DocValuesManager docValuesManager, - DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) throws IOException { - super(docValuesManager.schema, docValuesManager.segmentSize); - Set intViewIndexes = Sets.newHashSet(); - for (String fieldName : docValuesManager.columnStrideFields.keySet()) { - ColumnStrideFieldIndex originalColumnStrideField = - docValuesManager.columnStrideFields.get(fieldName); - if (originalColumnStrideField instanceof ColumnStrideIntViewIndex) { - intViewIndexes.add((ColumnStrideIntViewIndex) originalColumnStrideField); - } else { - ColumnStrideFieldIndex optimizedColumnStrideField = - originalColumnStrideField.optimize(originalTweetIdMapper, optimizedTweetIdMapper); - columnStrideFields.put(fieldName, optimizedColumnStrideField); - } - } - - // We have to process the ColumnStrideIntViewIndex instances after we process all other CSFs, - // because we need to make sure we've optimized the CSFs for the base fields. - for (ColumnStrideIntViewIndex intViewIndex : intViewIndexes) { - String fieldName = intViewIndex.getName(); - columnStrideFields.put(fieldName, newIntViewCSF(fieldName)); - } - } - - private OptimizedDocValuesManager( - Schema schema, - int segmentSize, - ConcurrentHashMap columnStrideFields) { - super(schema, segmentSize, columnStrideFields); - } - - @Override - protected ColumnStrideFieldIndex newByteCSF(String field) { - return new OptimizedColumnStrideByteIndex(field, segmentSize); - } - - @Override - protected ColumnStrideFieldIndex newIntCSF(String field) { - return new OptimizedColumnStrideIntIndex(field, segmentSize); - } - - @Override - protected ColumnStrideFieldIndex newLongCSF(String field) { - return new OptimizedColumnStrideLongIndex(field, segmentSize); - } - - @Override - protected ColumnStrideFieldIndex newMultiIntCSF(String field, int numIntsPerField) { - return new OptimizedColumnStrideMultiIntIndex(field, segmentSize, numIntsPerField); - } - - @Override - public DocValuesManager optimize(DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) throws IOException { - return this; - } - - @Override - public FlushHandler getFlushHandler() { - return new OptimizedFlushHandler(this); - } - - public static class OptimizedFlushHandler extends FlushHandler { - public OptimizedFlushHandler(Schema schema) { - super(schema); - } - - private OptimizedFlushHandler(DocValuesManager docValuesManager) { - super(docValuesManager); - } - - @Override - protected DocValuesManager createDocValuesManager( - Schema schema, - int maxSegmentSize, - ConcurrentHashMap columnStrideFields) { - return new OptimizedDocValuesManager(schema, maxSegmentSize, columnStrideFields); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/column/UnoptimizedDocValuesManager.java b/src/java/com/twitter/search/core/earlybird/index/column/UnoptimizedDocValuesManager.java deleted file mode 100644 index 840fe7cfe..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/column/UnoptimizedDocValuesManager.java +++ /dev/null @@ -1,69 +0,0 @@ -package com.twitter.search.core.earlybird.index.column; - -import java.io.IOException; -import java.util.concurrent.ConcurrentHashMap; - -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; - -public class UnoptimizedDocValuesManager extends DocValuesManager { - public UnoptimizedDocValuesManager(Schema schema, int segmentSize) { - super(schema, segmentSize); - } - - private UnoptimizedDocValuesManager( - Schema schema, - int segmentSize, - ConcurrentHashMap columnStrideFields) { - super(schema, segmentSize, columnStrideFields); - } - - @Override - protected ColumnStrideFieldIndex newByteCSF(String field) { - return new ColumnStrideByteIndex(field, segmentSize); - } - - @Override - protected ColumnStrideFieldIndex newIntCSF(String field) { - return new ColumnStrideIntIndex(field, segmentSize); - } - - @Override - protected ColumnStrideFieldIndex newLongCSF(String field) { - return new ColumnStrideLongIndex(field, segmentSize); - } - - @Override - protected ColumnStrideFieldIndex newMultiIntCSF(String field, int numIntsPerField) { - return new ColumnStrideMultiIntIndex(field, segmentSize, numIntsPerField); - } - - @Override - public DocValuesManager optimize(DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) throws IOException { - return new OptimizedDocValuesManager(this, originalTweetIdMapper, optimizedTweetIdMapper); - } - - @Override - public FlushHandler getFlushHandler() { - return new UnoptimizedFlushHandler(this); - } - - public static class UnoptimizedFlushHandler extends FlushHandler { - public UnoptimizedFlushHandler(Schema schema) { - super(schema); - } - - private UnoptimizedFlushHandler(DocValuesManager docValuesManager) { - super(docValuesManager); - } - - @Override - protected DocValuesManager createDocValuesManager( - Schema schema, - int maxSegmentSize, - ConcurrentHashMap columnStrideFields) { - return new UnoptimizedDocValuesManager(schema, maxSegmentSize, columnStrideFields); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/extensions/EarlybirdIndexExtensionsData.java b/src/java/com/twitter/search/core/earlybird/index/extensions/EarlybirdIndexExtensionsData.java deleted file mode 100644 index b63defe36..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/extensions/EarlybirdIndexExtensionsData.java +++ /dev/null @@ -1,15 +0,0 @@ -package com.twitter.search.core.earlybird.index.extensions; - -import java.io.IOException; - -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; - -/** - * Base index extensions class. - */ -public interface EarlybirdIndexExtensionsData { - /** - * Sets up the extensions for the given reader. - */ - void setupExtensions(EarlybirdIndexSegmentAtomicReader atomicReader) throws IOException; -} diff --git a/src/java/com/twitter/search/core/earlybird/index/extensions/EarlybirdIndexExtensionsFactory.java b/src/java/com/twitter/search/core/earlybird/index/extensions/EarlybirdIndexExtensionsFactory.java deleted file mode 100644 index 6b9d30687..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/extensions/EarlybirdIndexExtensionsFactory.java +++ /dev/null @@ -1,19 +0,0 @@ -package com.twitter.search.core.earlybird.index.extensions; - -/** - * Base class to implement factories that create realtime and Lucene index extensions. - * - * The factory needs to be able to create instances for new segments, as well as load - * index extensions of existing segments from disk. - */ -public abstract class EarlybirdIndexExtensionsFactory { - /** - * Returns the {@link EarlybirdRealtimeIndexExtensionsData} instance to be used for a new segment. - */ - public abstract EarlybirdRealtimeIndexExtensionsData newRealtimeIndexExtensionsData(); - - /** - * Returns the {@link EarlybirdIndexExtensionsData} instance to be used for a new Lucene segment. - */ - public abstract EarlybirdIndexExtensionsData newLuceneIndexExtensionsData(); -} diff --git a/src/java/com/twitter/search/core/earlybird/index/extensions/EarlybirdRealtimeIndexExtensionsData.java b/src/java/com/twitter/search/core/earlybird/index/extensions/EarlybirdRealtimeIndexExtensionsData.java deleted file mode 100644 index 284475566..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/extensions/EarlybirdRealtimeIndexExtensionsData.java +++ /dev/null @@ -1,20 +0,0 @@ -package com.twitter.search.core.earlybird.index.extensions; - -import com.twitter.search.core.earlybird.index.EarlybirdRealtimeIndexSegmentWriter; - -/** - * An index extensions implementation for real-time Earlybird indexes. - */ -public interface EarlybirdRealtimeIndexExtensionsData extends EarlybirdIndexExtensionsData { - /** - * Optionally, an implementing class can provide a custom consumer for inverted fields (i.e. streams of tokens). - */ - void createInvertedDocConsumer( - EarlybirdRealtimeIndexSegmentWriter.InvertedDocConsumerBuilder builder); - - /** - * Optionally, an implementing class can provide a custom consumer for stored fields (e.g. doc values fields). - */ - void createStoredFieldsConsumer( - EarlybirdRealtimeIndexSegmentWriter.StoredFieldsConsumerBuilder builder); -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/BaseByteBlockPool.java b/src/java/com/twitter/search/core/earlybird/index/inverted/BaseByteBlockPool.java deleted file mode 100644 index 332ce3872..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/BaseByteBlockPool.java +++ /dev/null @@ -1,373 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; -import java.util.Arrays; - -import org.apache.lucene.store.DataInput; -import org.apache.lucene.store.DataOutput; -import org.apache.lucene.util.ArrayUtil; -import org.apache.lucene.util.ByteBlockPool; -import org.apache.lucene.util.BytesRef; - -import static org.apache.lucene.util.RamUsageEstimator.NUM_BYTES_OBJECT_REF; - -/** - * Base class for BlockPools backed by byte[] arrays. - */ -public abstract class BaseByteBlockPool { - /** - * The extra object with final array is necessary to guarantee visibility to - * other threads without synchronization/using volatile. - * - * From 'Java Concurrency in practice' by Brian Goetz, p. 349: - * - * "Initialization safety guarantees that for properly constructed objects, all - * threads will see the correct values of final fields that were set by the con- - * structor, regardless of how the object is published. Further, any variables - * that can be reached through a final field of a properly constructed object - * (such as the elements of a final array or the contents of a HashMap refer- - * enced by a final field) are also guaranteed to be visible to other threads." - */ - public static final class Pool { - public final byte[][] buffers; - - public Pool(byte[][] buffers) { - this.buffers = buffers; - } - - public byte[][] getBlocks() { - return buffers; - } - } - - public Pool pool = new Pool(new byte[10][]); - // The index of the current buffer in pool.buffers. - public int bufferUpto = -1; - // The number of bytes that have been written in the current buffer. - public int byteUpto = ByteBlockPool.BYTE_BLOCK_SIZE; - // The current buffer, i.e. a reference to pool.buffers[bufferUpto] - public byte[] buffer; - // The total number of bytes that have been used up to now, excluding the current buffer. - public int byteOffset = -ByteBlockPool.BYTE_BLOCK_SIZE; - // The one and only WriteStream for this pool. - private WriteStream writeStream = new WriteStream(); - - protected BaseByteBlockPool() { } - - /** - * Used for loading flushed pool. - */ - protected BaseByteBlockPool(Pool pool, int bufferUpto, int byteUpTo, int byteOffset) { - this.pool = pool; - this.bufferUpto = bufferUpto; - this.byteUpto = byteUpTo; - this.byteOffset = byteOffset; - if (bufferUpto >= 0) { - this.buffer = pool.buffers[bufferUpto]; - } - } - - /** - * Resets the index of the pool to 0 in the first buffer and resets the byte arrays of - * all previously allocated buffers to 0s. - */ - public void reset() { - if (bufferUpto != -1) { - // We allocated at least one buffer - - for (int i = 0; i < bufferUpto; i++) { - // Fully zero fill buffers that we fully used - Arrays.fill(pool.buffers[i], (byte) 0); - } - - // Partial zero fill the final buffer - Arrays.fill(pool.buffers[bufferUpto], 0, byteUpto, (byte) 0); - - bufferUpto = 0; - byteUpto = 0; - byteOffset = 0; - buffer = pool.buffers[0]; - } - } - - /** - * Switches to the next buffer and positions the index at its beginning. - */ - public void nextBuffer() { - if (1 + bufferUpto == pool.buffers.length) { - byte[][] newBuffers = new byte[ArrayUtil.oversize(pool.buffers.length + 1, - NUM_BYTES_OBJECT_REF)][]; - System.arraycopy(pool.buffers, 0, newBuffers, 0, pool.buffers.length); - pool = new Pool(newBuffers); - } - buffer = pool.buffers[1 + bufferUpto] = new byte[ByteBlockPool.BYTE_BLOCK_SIZE]; - bufferUpto++; - - byteUpto = 0; - byteOffset += ByteBlockPool.BYTE_BLOCK_SIZE; - } - - /** - * Returns the start offset of the next data that will be added to the pool, UNLESS the data is - * added using addBytes and avoidSplitting = true - */ - public int getOffset() { - return byteOffset + byteUpto; - } - - /** - * Returns the start offset of b in the pool - * @param b byte to put - */ - public int addByte(byte b) { - int initOffset = byteOffset + byteUpto; - int remainingBytesInBuffer = ByteBlockPool.BYTE_BLOCK_SIZE - byteUpto; - // If the buffer is full, move on to the next one. - if (remainingBytesInBuffer <= 0) { - nextBuffer(); - } - buffer[byteUpto] = b; - byteUpto++; - return initOffset; - } - - /** - * Returns the start offset of the bytes in the pool. - * If avoidSplitting is false, this is guaranteed to return the same value that would be - * returned by getOffset() - * @param bytes source array - * @param length number of bytes to put - * @param avoidSplitting if possible (the length is less than ByteBlockPool.BYTE_BLOCK_SIZE), - * the bytes will not be split across buffer boundaries. This is useful for small data - * that will be read a lot (small amount of space wasted in return for avoiding copying - * memory when calling getBytes). - */ - public int addBytes(byte[] bytes, int offset, int length, boolean avoidSplitting) { - // The first time this is called, there may not be an existing buffer yet. - if (buffer == null) { - nextBuffer(); - } - - int remainingBytesInBuffer = ByteBlockPool.BYTE_BLOCK_SIZE - byteUpto; - - if (avoidSplitting && length < ByteBlockPool.BYTE_BLOCK_SIZE) { - if (remainingBytesInBuffer < length) { - nextBuffer(); - } - int initOffset = byteOffset + byteUpto; - System.arraycopy(bytes, offset, buffer, byteUpto, length); - byteUpto += length; - return initOffset; - } else { - int initOffset = byteOffset + byteUpto; - if (remainingBytesInBuffer < length) { - // Must split the bytes across buffers. - int remainingLength = length; - while (remainingLength > ByteBlockPool.BYTE_BLOCK_SIZE - byteUpto) { - int lengthToCopy = ByteBlockPool.BYTE_BLOCK_SIZE - byteUpto; - System.arraycopy(bytes, length - remainingLength + offset, - buffer, byteUpto, lengthToCopy); - remainingLength -= lengthToCopy; - nextBuffer(); - } - System.arraycopy(bytes, length - remainingLength + offset, - buffer, byteUpto, remainingLength); - byteUpto += remainingLength; - } else { - // Just add all bytes to the current buffer. - System.arraycopy(bytes, offset, buffer, byteUpto, length); - byteUpto += length; - } - return initOffset; - } - } - - /** - * Default addBytes. Does not avoid splitting. - * @see #addBytes(byte[], int, boolean) - */ - public int addBytes(byte[] bytes, int length) { - return addBytes(bytes, 0, length, false); - } - - /** - * Default addBytes. Does not avoid splitting. - * @see #addBytes(byte[], int, boolean) - */ - public int addBytes(byte[] bytes, int offset, int length) { - return addBytes(bytes, offset, length, false); - } - - /** - * Reads one byte from the pool. - * @param offset location to read byte from - */ - public byte getByte(int offset) { - int bufferIndex = offset >>> ByteBlockPool.BYTE_BLOCK_SHIFT; - int bufferOffset = offset & ByteBlockPool.BYTE_BLOCK_MASK; - return pool.buffers[bufferIndex][bufferOffset]; - } - - /** - * Returns false if offset is invalid or there aren't these many bytes - * available in the pool. - * @param offset location to start reading bytes from - * @param length number of bytes to read - * @param output the object to write the output to. MUST be non null. - */ - public boolean getBytesToBytesRef(int offset, int length, BytesRef output) { - if (offset < 0 || offset + length > byteUpto + byteOffset) { - return false; - } - int currentBuffer = offset >>> ByteBlockPool.BYTE_BLOCK_SHIFT; - int currentOffset = offset & ByteBlockPool.BYTE_BLOCK_MASK; - // If the requested bytes are split across pools, we have to make a new array of bytes - // to copy them into and return a ref to that. - if (currentOffset + length <= ByteBlockPool.BYTE_BLOCK_SIZE) { - output.bytes = pool.buffers[currentBuffer]; - output.offset = currentOffset; - output.length = length; - } else { - byte[] bytes = new byte[length]; - int remainingLength = length; - while (remainingLength > ByteBlockPool.BYTE_BLOCK_SIZE - currentOffset) { - int lengthToCopy = ByteBlockPool.BYTE_BLOCK_SIZE - currentOffset; - System.arraycopy(pool.buffers[currentBuffer], currentOffset, bytes, - length - remainingLength, lengthToCopy); - remainingLength -= lengthToCopy; - currentBuffer++; - currentOffset = 0; - } - System.arraycopy(pool.buffers[currentBuffer], currentOffset, bytes, length - remainingLength, - remainingLength); - output.bytes = bytes; - output.length = bytes.length; - output.offset = 0; - } - return true; - - } - - /** - * Returns the read bytes, or null if offset is invalid or there aren't these many bytes - * available in the pool. - * @param offset location to start reading bytes from - * @param length number of bytes to read - */ - public BytesRef getBytes(int offset, int length) { - BytesRef result = new BytesRef(); - if (getBytesToBytesRef(offset, length, result)) { - return result; - } else { - return null; - } - } - - /** - * get a new readStream at a given offset for this pool. - * - * Notice that individual ReadStreams are not threadsafe, but you can get as many ReadStreams as - * you want. - */ - public ReadStream getReadStream(int offset) { - return new ReadStream(offset); - } - - /** - * get the (one and only) WriteStream for this pool. - * - * Notice that there is exactly one WriteStream per pool, and it is not threadsafe. - */ - public WriteStream getWriteStream() { - return writeStream; - } - - /** - * A DataOutput-like interface for writing "contiguous" data to a ByteBlockPool. - * - * This is not threadsafe. - */ - public final class WriteStream extends DataOutput { - private WriteStream() { } - - /** - * Returns the start offset of the next data that will be added to the pool, UNLESS the data is - * added using addBytes and avoidSplitting = true - */ - public int getOffset() { - return BaseByteBlockPool.this.getOffset(); - } - - /** - * Write bytes to the pool. - * @param bytes source array - * @param offset offset in bytes of the data to write - * @param length number of bytes to put - * @param avoidSplitting same as {link ByteBlockPool.addBytes} - * @return the start offset of the bytes in the pool - */ - public int writeBytes(byte[] bytes, int offset, int length, boolean avoidSplitting) { - return addBytes(bytes, offset, length, avoidSplitting); - } - - @Override - public void writeBytes(byte[] b, int offset, int length) throws IOException { - addBytes(b, offset, length); - } - - @Override - public void writeByte(byte b) { - addByte(b); - } - } - - /** - * A DataInput-like interface for reading "contiguous" data from a ByteBlockPool. - * - * This is not threadsafe. - * - * This does not fully implement the DataInput interface - its DataInput.readBytes method throws - * UnsupportedOperationException because this class provides a facility for no-copy reading. - */ - public final class ReadStream extends DataInput { - private int offset; - - private ReadStream(int offset) { - this.offset = offset; - } - - public BytesRef readBytes(int n) { - return readBytes(n, false); - } - - /** - * read n bytes that were written with a given value of avoidSplitting - * @param n number of bytes to read. - * @param avoidSplitting this should be the same that was used at writeBytes time. - * @return a reference to the bytes read or null. - */ - public BytesRef readBytes(int n, boolean avoidSplitting) { - int currentBuffer = offset >>> ByteBlockPool.BYTE_BLOCK_SHIFT; - int currentOffset = offset & ByteBlockPool.BYTE_BLOCK_MASK; - if (avoidSplitting && n < ByteBlockPool.BYTE_BLOCK_SIZE - && currentOffset + n > ByteBlockPool.BYTE_BLOCK_SIZE) { - ++currentBuffer; - currentOffset = 0; - offset = currentBuffer << ByteBlockPool.BYTE_BLOCK_SHIFT; - } - BytesRef result = getBytes(offset, n); - this.offset += n; - return result; - } - - @Override - public byte readByte() { - return getByte(offset++); - } - - @Override - public void readBytes(byte[] b, int off, int len) throws IOException { - throw new UnsupportedOperationException("Use the no-copies version of ReadBytes instead."); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/ByteBlockPool.java b/src/java/com/twitter/search/core/earlybird/index/inverted/ByteBlockPool.java deleted file mode 100644 index 1401b4003..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/ByteBlockPool.java +++ /dev/null @@ -1,58 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; - -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; - -public class ByteBlockPool extends BaseByteBlockPool implements Flushable { - - public ByteBlockPool() { - } - - /** - * Used for loading flushed pool. - */ - private ByteBlockPool(Pool pool, int bufferUpto, int byteUpTo, int byteOffset) { - super(pool, bufferUpto, byteUpTo, byteOffset); - } - - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - public static class FlushHandler extends Flushable.Handler { - private static final String BUFFER_UP_TO_PROP_NAME = "bufferUpto"; - private static final String BYTE_UP_TO_PROP_NAME = "byteUpto"; - private static final String BYTE_OFFSET_PROP_NAME = "byteOffset"; - - public FlushHandler(ByteBlockPool objectToFlush) { - super(objectToFlush); - } - - public FlushHandler() { - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) throws IOException { - ByteBlockPool objectToFlush = getObjectToFlush(); - out.writeByteArray2D(objectToFlush.pool.buffers, objectToFlush.bufferUpto + 1); - flushInfo.addIntProperty(BUFFER_UP_TO_PROP_NAME, objectToFlush.bufferUpto); - flushInfo.addIntProperty(BYTE_UP_TO_PROP_NAME, objectToFlush.byteUpto); - flushInfo.addIntProperty(BYTE_OFFSET_PROP_NAME, objectToFlush.byteOffset); - } - - @Override - protected ByteBlockPool doLoad(FlushInfo flushInfo, - DataDeserializer in) throws IOException { - return new ByteBlockPool( - new BaseByteBlockPool.Pool(in.readByteArray2D()), - flushInfo.getIntProperty(BUFFER_UP_TO_PROP_NAME), - flushInfo.getIntProperty(BYTE_UP_TO_PROP_NAME), - flushInfo.getIntProperty(BYTE_OFFSET_PROP_NAME)); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/ByteTermUtils.java b/src/java/com/twitter/search/core/earlybird/index/inverted/ByteTermUtils.java deleted file mode 100644 index c246caa68..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/ByteTermUtils.java +++ /dev/null @@ -1,126 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import org.apache.lucene.util.ByteBlockPool; -import org.apache.lucene.util.BytesRef; -import org.apache.lucene.util.StringHelper; - -/** - * Utility class for BytePools which have each term's length encoded before the contents in the - * ByteBlockPool - * Another solution is to have a class that encapsulates both textStarts and the byteBlockPool and - * knows how the byteBlockPool is used to store the strings - **/ -public abstract class ByteTermUtils { - /** - * Fill in a BytesRef from term's length & bytes encoded in byte block - */ - public static int setBytesRef(final BaseByteBlockPool byteBlockPool, - BytesRef term, - final int textStart) { - final byte[] block = term.bytes = - byteBlockPool.pool.buffers[textStart >>> ByteBlockPool.BYTE_BLOCK_SHIFT]; - final int start = textStart & ByteBlockPool.BYTE_BLOCK_MASK; - int pos = start; - - byte b = block[pos++]; - term.length = b & 0x7F; - for (int shift = 7; (b & 0x80) != 0; shift += 7) { - b = block[pos++]; - term.length |= (b & 0x7F) << shift; - } - term.offset = pos; - - assert term.length >= 0; - return textStart + (pos - start) + term.length; - } - - /** - * Test whether the text for current RawPostingList p equals - * current tokenText in utf8. - */ - public static boolean postingEquals(final BaseByteBlockPool termPool, - final int textStart, final BytesRef other) { - final byte[] block = termPool.pool.getBlocks()[textStart >>> ByteBlockPool.BYTE_BLOCK_SHIFT]; - assert block != null; - - int pos = textStart & ByteBlockPool.BYTE_BLOCK_MASK; - - byte b = block[pos++]; - int len = b & 0x7F; - for (int shift = 7; (b & 0x80) != 0; shift += 7) { - b = block[pos++]; - len |= (b & 0x7F) << shift; - } - - if (len == other.length) { - final byte[] utf8Bytes = other.bytes; - for (int tokenPos = other.offset; - tokenPos < other.length + other.offset; pos++, tokenPos++) { - if (utf8Bytes[tokenPos] != block[pos]) { - return false; - } - } - return true; - } else { - return false; - } - } - - /** - * Returns the hashCode of the term stored at the given position in the block pool. - */ - public static int hashCode( - final BaseByteBlockPool termPool, final int textStart) { - final byte[] block = termPool.pool.getBlocks()[textStart >>> ByteBlockPool.BYTE_BLOCK_SHIFT]; - final int start = textStart & ByteBlockPool.BYTE_BLOCK_MASK; - - int pos = start; - - byte b = block[pos++]; - int len = b & 0x7F; - for (int shift = 7; (b & 0x80) != 0; shift += 7) { - b = block[pos++]; - len |= (b & 0x7F) << shift; - } - - // Hash code returned here must be consistent with the one used in TermHashTable.lookupItem, so - // use the fixed hash seed. See TermHashTable.lookupItem for explanation of fixed hash seed. - return StringHelper.murmurhash3_x86_32(block, pos, len, InvertedRealtimeIndex.FIXED_HASH_SEED); - } - - /** - * Copies the utf8 encoded byte ref to the termPool. - * @param termPool - * @param utf8 - * @return The text's start position in the termPool - */ - public static int copyToTermPool(BaseByteBlockPool termPool, BytesRef bytes) { - // Maybe grow the termPool before we write. Assume we need 5 bytes in - // the worst case to store the VInt. - if (bytes.length + 5 + termPool.byteUpto > ByteBlockPool.BYTE_BLOCK_SIZE) { - // Not enough room in current block - termPool.nextBuffer(); - } - - final int textStart = termPool.byteUpto + termPool.byteOffset; - - writeVInt(termPool, bytes.length); - System.arraycopy(bytes.bytes, bytes.offset, termPool.buffer, termPool.byteUpto, bytes.length); - termPool.byteUpto += bytes.length; - - return textStart; - } - - private static void writeVInt(final BaseByteBlockPool termPool, final int v) { - int value = v; - final byte[] block = termPool.buffer; - int blockUpto = termPool.byteUpto; - - while ((value & ~0x7F) != 0) { - block[blockUpto++] = (byte) ((value & 0x7f) | 0x80); - value >>>= 7; - } - block[blockUpto++] = (byte) value; - termPool.byteUpto = blockUpto; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/DeletedDocs.java b/src/java/com/twitter/search/core/earlybird/index/inverted/DeletedDocs.java deleted file mode 100644 index 264d105fa..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/DeletedDocs.java +++ /dev/null @@ -1,245 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; - -import org.apache.lucene.util.Bits; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; - -import it.unimi.dsi.fastutil.ints.Int2IntOpenHashMap; - -public abstract class DeletedDocs implements Flushable { - private static final Logger LOG = LoggerFactory.getLogger(DeletedDocs.class); - - /** - * Deletes the given document. - */ - public abstract boolean deleteDoc(int docID); - - /** - * Returns a point-in-time view of the deleted docs. Calling {@link #deleteDoc(int)} afterwards - * will not alter this View. - */ - public abstract View getView(); - - /** - * Number of deletions. - */ - public abstract int numDeletions(); - - /** - * Returns a DeletedDocs instance that has the same deleted tweet IDs, but mapped to the doc IDs - * in the optimizedTweetIdMapper. - * - * @param originalTweetIdMapper The original DocIDToTweetIDMapper instance that was used to add - * doc IDs to this DeletedDocs instance. - * @param optimizedTweetIdMapper The new DocIDToTweetIDMapper instance. - * @return An DeletedDocs instance that has the same tweets deleted, but mapped to the doc IDs in - * optimizedTweetIdMapper. - */ - public abstract DeletedDocs optimize( - DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) throws IOException; - - public abstract class View { - /** - * Returns true, if the given document was deleted. - */ - public abstract boolean isDeleted(int docID); - - /** - * Returns true, if there are any deleted documents in this View. - */ - public abstract boolean hasDeletions(); - - /** - * Returns {@link Bits} where all deleted documents have their bit set to 0, and - * all non-deleted documents have their bits set to 1. - */ - public abstract Bits getLiveDocs(); - } - - public static class Default extends DeletedDocs { - private static final int KEY_NOT_FOUND = -1; - - private final int size; - private final Int2IntOpenHashMap deletes; - - // Each delete is marked with a unique, consecutively-increasing sequence ID. - private int sequenceID = 0; - - public Default(int size) { - this.size = size; - deletes = new Int2IntOpenHashMap(size); - deletes.defaultReturnValue(KEY_NOT_FOUND); - } - - /** - * Returns false, if this call was a noop, i.e. if the document was already deleted. - */ - @Override - public boolean deleteDoc(int docID) { - if (deletes.putIfAbsent(docID, sequenceID) == KEY_NOT_FOUND) { - sequenceID++; - return true; - } - return false; - } - - private boolean isDeleted(int internalID, int readerSequenceID) { - int deletedSequenceId = deletes.get(internalID); - return (deletedSequenceId >= 0) && (deletedSequenceId < readerSequenceID); - } - - private boolean hasDeletions(int readerSequenceID) { - return readerSequenceID > 0; - } - - @Override - public int numDeletions() { - return sequenceID; - } - - @Override - public View getView() { - return new View() { - private final int readerSequenceID = sequenceID; - - // liveDocs bitset contains inverted (decreasing) docids. - public final Bits liveDocs = !hasDeletions() ? null : new Bits() { - @Override - public final boolean get(int docID) { - return !isDeleted(docID); - } - - @Override - public final int length() { - return size; - } - }; - - @Override - public Bits getLiveDocs() { - return liveDocs; - } - - - // Operates on internal (increasing) docids. - @Override - public final boolean isDeleted(int internalID) { - return DeletedDocs.Default.this.isDeleted(internalID, readerSequenceID); - } - - @Override - public final boolean hasDeletions() { - return DeletedDocs.Default.this.hasDeletions(readerSequenceID); - } - }; - } - - @Override - public DeletedDocs optimize(DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) throws IOException { - DeletedDocs optimizedDeletedDocs = new Default(size); - for (int deletedDocID : deletes.keySet()) { - long tweetID = originalTweetIdMapper.getTweetID(deletedDocID); - int optimizedDeletedDocID = optimizedTweetIdMapper.getDocID(tweetID); - optimizedDeletedDocs.deleteDoc(optimizedDeletedDocID); - } - return optimizedDeletedDocs; - } - - @SuppressWarnings("unchecked") - @Override - public Default.FlushHandler getFlushHandler() { - return new Default.FlushHandler(this, size); - } - - public static final class FlushHandler extends Flushable.Handler { - private final int size; - - public FlushHandler(Default objectToFlush, int size) { - super(objectToFlush); - this.size = size; - } - - public FlushHandler(int size) { - this.size = size; - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) throws IOException { - long startTime = getClock().nowMillis(); - - Int2IntOpenHashMap deletes = getObjectToFlush().deletes; - out.writeIntArray(deletes.keySet().toIntArray()); - - getFlushTimerStats().timerIncrement(getClock().nowMillis() - startTime); - } - - @Override - protected Default doLoad(FlushInfo flushInfo, DataDeserializer in) throws IOException { - Default deletedDocs = new Default(size); - long startTime = getClock().nowMillis(); - - int[] deletedDocIDs = in.readIntArray(); - for (int docID : deletedDocIDs) { - deletedDocs.deleteDoc(docID); - } - - getLoadTimerStats().timerIncrement(getClock().nowMillis() - startTime); - return deletedDocs; - } - } - } - - public static final DeletedDocs NO_DELETES = new DeletedDocs() { - @Override - public Handler getFlushHandler() { - return null; - } - - @Override - public boolean deleteDoc(int docID) { - return false; - } - - @Override - public DeletedDocs optimize(DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) { - return this; - } - - @Override - public int numDeletions() { - return 0; - } - - @Override - public View getView() { - return new View() { - @Override - public boolean isDeleted(int docID) { - return false; - } - - @Override - public boolean hasDeletions() { - return false; - } - - @Override - public Bits getLiveDocs() { - return null; - } - - }; - } - }; -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/EarlybirdCSFDocValuesProcessor.java b/src/java/com/twitter/search/core/earlybird/index/inverted/EarlybirdCSFDocValuesProcessor.java deleted file mode 100644 index 45fec2f5f..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/EarlybirdCSFDocValuesProcessor.java +++ /dev/null @@ -1,74 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.facet.FacetsConfig; -import org.apache.lucene.index.DocValuesType; -import org.apache.lucene.index.IndexableField; - -import com.twitter.search.common.schema.base.EarlybirdFieldType; -import com.twitter.search.core.earlybird.index.EarlybirdRealtimeIndexSegmentWriter; -import com.twitter.search.core.earlybird.index.column.AbstractColumnStrideMultiIntIndex; -import com.twitter.search.core.earlybird.index.column.ColumnStrideFieldIndex; -import com.twitter.search.core.earlybird.index.column.DocValuesManager; - -/** - * Handler for docvalues in the indexing chain. - */ -public class EarlybirdCSFDocValuesProcessor - implements EarlybirdRealtimeIndexSegmentWriter.StoredFieldsConsumer { - - private final DocValuesManager docValuesManager; - - public EarlybirdCSFDocValuesProcessor(DocValuesManager docValuesManager) { - this.docValuesManager = docValuesManager; - } - - @Override - public void addField(int docID, IndexableField field) throws IOException { - final DocValuesType dvType = field.fieldType().docValuesType(); - if (dvType != null) { - - // ignore lucene facet fields for realtime index, we are handling it differently - if (field.name().startsWith(FacetsConfig.DEFAULT_INDEX_FIELD_NAME)) { - return; - } - if (!(field.fieldType() instanceof EarlybirdFieldType)) { - throw new RuntimeException( - "fieldType must be an EarlybirdFieldType instance for field " + field.name()); - } - EarlybirdFieldType fieldType = (EarlybirdFieldType) field.fieldType(); - - if (dvType == DocValuesType.NUMERIC) { - if (!(field.numericValue() instanceof Long)) { - throw new IllegalArgumentException( - "illegal type " + field.numericValue().getClass() - + ": DocValues types must be Long"); - } - - ColumnStrideFieldIndex csfIndex = - docValuesManager.addColumnStrideField(field.name(), fieldType); - if (fieldType.getCsfFixedLengthNumValuesPerDoc() > 1) { - throw new UnsupportedOperationException("unsupported multi numeric values"); - } else { - csfIndex.setValue(docID, field.numericValue().longValue()); - } - - } else if (dvType == DocValuesType.BINARY) { - ColumnStrideFieldIndex csfIndex = - docValuesManager.addColumnStrideField(field.name(), fieldType); - if (fieldType.getCsfFixedLengthNumValuesPerDoc() > 1) { - Preconditions.checkArgument( - csfIndex instanceof AbstractColumnStrideMultiIntIndex, - "Unsupported multi-value binary CSF class: " + csfIndex); - ((AbstractColumnStrideMultiIntIndex) csfIndex).updateDocValues( - field.binaryValue(), docID); - } - } else { - throw new UnsupportedOperationException("unsupported DocValues.Type: " + dvType); - } - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/EarlybirdOptimizedPostingsEnum.java b/src/java/com/twitter/search/core/earlybird/index/inverted/EarlybirdOptimizedPostingsEnum.java deleted file mode 100644 index a60562c5b..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/EarlybirdOptimizedPostingsEnum.java +++ /dev/null @@ -1,178 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; - -import org.apache.lucene.util.BytesRef; - -/** - * Extend {@link EarlybirdPostingsEnum} to add more functionalities for docs (and positions) - * enumerator of {@link OptimizedPostingLists}. - */ -public abstract class EarlybirdOptimizedPostingsEnum extends EarlybirdPostingsEnum { - /** Current doc and its frequency. */ - private int currentDocID = -1; - private int currentFreq = 0; - - /** - * Next doc and its frequency. - * These values should be set at {@link #loadNextPosting()}. - */ - protected int nextDocID; - protected int nextFreq; - - /** Pointer to the enumerated posting list. */ - protected final int postingListPointer; - - /** Total number of postings in the enumerated posting list. */ - protected final int numPostingsTotal; - - /** Query cost tracker. */ - protected final QueryCostTracker queryCostTracker; - - /** - * Sole constructor. - * - * @param postingListPointer pointer to the posting list for which this enumerator is created - * @param numPostings number of postings in the posting list for which this enumerator is created - */ - public EarlybirdOptimizedPostingsEnum(int postingListPointer, int numPostings) { - this.postingListPointer = postingListPointer; - this.numPostingsTotal = numPostings; - - // Get the thread local query cost tracker. - this.queryCostTracker = QueryCostTracker.getTracker(); - } - - /** - * Set {@link #currentDocID} and {@link #currentFreq} and load next posting. - * This method will de-dup if duplicate doc IDs are stored. - * - * @return {@link #currentDocID} - * @see {@link #nextDoc()} - */ - @Override - protected final int nextDocNoDel() throws IOException { - currentDocID = nextDocID; - - // Return immediately if exhausted. - if (currentDocID == NO_MORE_DOCS) { - return NO_MORE_DOCS; - } - - currentFreq = nextFreq; - loadNextPosting(); - - // In case duplicate doc ID is stored. - while (currentDocID == nextDocID) { - currentFreq += nextFreq; - loadNextPosting(); - } - - startCurrentDoc(); - return currentDocID; - } - - /** - * Called when {@link #nextDocNoDel()} advances to a new docID. - * Subclasses can do extra accounting as needed. - */ - protected void startCurrentDoc() { - // No-op in this class. - } - - /** - * Loads the next posting, setting the nextDocID and nextFreq. - * - * @see #nextDocNoDel() - */ - protected abstract void loadNextPosting(); - - /** - * Subclass should implement {@link #skipTo(int)}. - * - * @see org.apache.lucene.search.DocIdSetIterator#advance(int) - */ - @Override - public final int advance(int target) throws IOException { - // Skipping to NO_MORE_DOCS or beyond largest doc ID. - if (target == NO_MORE_DOCS || target > getLargestDocID()) { - currentDocID = nextDocID = NO_MORE_DOCS; - currentFreq = nextFreq = 0; - return NO_MORE_DOCS; - } - - // Skip as close as possible. - skipTo(target); - - // Calling nextDoc to reach the target, or go beyond it if target does not exist. - int doc; - do { - doc = nextDoc(); - } while (doc < target); - - return doc; - } - - /** - * Used in {@link #advance(int)}. - * This method should skip to the given target as close as possible, but NOT reach the target. - * - * @see #advance(int) - */ - protected abstract void skipTo(int target); - - /** - * Return loaded {@link #currentFreq}. - * - * @see org.apache.lucene.index.PostingsEnum#freq() - * @see #nextDocNoDel() - */ - @Override - public final int freq() throws IOException { - return currentFreq; - } - - /** - * Return loaded {@link #currentDocID}. - * - * @see org.apache.lucene.index.PostingsEnum#docID() () - * @see #nextDocNoDel() - */ - @Override - public final int docID() { - return currentDocID; - } - - /********************************************* - * Not Supported Information * - * @see org.apache.lucene.index.PostingsEnum * - *********************************************/ - - @Override - public int nextPosition() throws IOException { - return -1; - } - - @Override - public int startOffset() throws IOException { - return -1; - } - - @Override - public int endOffset() throws IOException { - return -1; - } - - @Override - public BytesRef getPayload() throws IOException { - return null; - } - - /********************************* - * Helper methods for subclasses * - *********************************/ - - protected int getCurrentFreq() { - return currentFreq; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/EarlybirdPostingsEnum.java b/src/java/com/twitter/search/core/earlybird/index/inverted/EarlybirdPostingsEnum.java deleted file mode 100644 index 535c8b55d..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/EarlybirdPostingsEnum.java +++ /dev/null @@ -1,26 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; - -import org.apache.lucene.index.PostingsEnum; - -/** - * Extension of Lucene's PostingsEnum interface that adds additional funcionality. - */ -public abstract class EarlybirdPostingsEnum extends PostingsEnum { - @Override - public final int nextDoc() throws IOException { - // SEARCH-7008 - return nextDocNoDel(); - } - - /** - * Advances to the next doc without paying attention to liveDocs. - */ - protected abstract int nextDocNoDel() throws IOException; - - /** - * Returns the largest docID contained in this posting list. - */ - public abstract int getLargestDocID() throws IOException; -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/FSTTermDictionary.java b/src/java/com/twitter/search/core/earlybird/index/inverted/FSTTermDictionary.java deleted file mode 100644 index 638cbaffc..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/FSTTermDictionary.java +++ /dev/null @@ -1,299 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; -import java.util.Comparator; - -import org.apache.lucene.index.BaseTermsEnum; -import org.apache.lucene.index.ImpactsEnum; -import org.apache.lucene.index.PostingsEnum; -import org.apache.lucene.index.SlowImpactsEnum; -import org.apache.lucene.index.TermsEnum; -import org.apache.lucene.util.BytesRef; -import org.apache.lucene.util.InPlaceMergeSorter; -import org.apache.lucene.util.IntsRefBuilder; -import org.apache.lucene.util.fst.BytesRefFSTEnum; -import org.apache.lucene.util.fst.FST; -import org.apache.lucene.util.fst.PositiveIntOutputs; -import org.apache.lucene.util.fst.Util; -import org.apache.lucene.util.packed.PackedInts; - -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; - -public class FSTTermDictionary implements TermDictionary, Flushable { - private final FST fst; - - private final PackedInts.Reader termPointers; - private final ByteBlockPool termPool; - private final TermPointerEncoding termPointerEncoding; - private int numTerms; - - FSTTermDictionary(int numTerms, FST fst, - ByteBlockPool termPool, PackedInts.Reader termPointers, - TermPointerEncoding termPointerEncoding) { - this.numTerms = numTerms; - this.fst = fst; - this.termPool = termPool; - this.termPointers = termPointers; - this.termPointerEncoding = termPointerEncoding; - } - - @Override - public int getNumTerms() { - return numTerms; - } - - @Override - public int lookupTerm(BytesRef term) throws IOException { - if (fst == null) { - return EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND; - } - final BytesRefFSTEnum fstEnum = new BytesRefFSTEnum<>(fst); - - final BytesRefFSTEnum.InputOutput result = fstEnum.seekExact(term); - if (result != null && result.input.equals(term)) { - // -1 because 0 is not supported by the fst - return result.output.intValue() - 1; - } else { - return EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND; - } - } - - static FSTTermDictionary buildFST( - final ByteBlockPool termPool, - int[] termPointers, - int numTerms, - final Comparator comp, - boolean supportTermTextLookup, - final TermPointerEncoding termPointerEncoding) throws IOException { - final IntsRefBuilder scratchIntsRef = new IntsRefBuilder(); - - final int[] compact = new int[numTerms]; - for (int i = 0; i < numTerms; i++) { - compact[i] = i; - } - - // first sort the terms - new InPlaceMergeSorter() { - private BytesRef scratch1 = new BytesRef(); - private BytesRef scratch2 = new BytesRef(); - - @Override - protected void swap(int i, int j) { - final int o = compact[i]; - compact[i] = compact[j]; - compact[j] = o; - } - - @Override - protected int compare(int i, int j) { - final int ord1 = compact[i]; - final int ord2 = compact[j]; - ByteTermUtils.setBytesRef(termPool, scratch1, - termPointerEncoding.getTextStart(termPointers[ord1])); - ByteTermUtils.setBytesRef(termPool, scratch2, - termPointerEncoding.getTextStart(termPointers[ord2])); - return comp.compare(scratch1, scratch2); - } - - }.sort(0, compact.length); - - final PositiveIntOutputs outputs = PositiveIntOutputs.getSingleton(); - - final org.apache.lucene.util.fst.Builder builder = - new org.apache.lucene.util.fst.Builder<>(FST.INPUT_TYPE.BYTE1, outputs); - - final BytesRef term = new BytesRef(); - for (int termID : compact) { - ByteTermUtils.setBytesRef(termPool, term, - termPointerEncoding.getTextStart(termPointers[termID])); - // +1 because 0 is not supported by the fst - builder.add(Util.toIntsRef(term, scratchIntsRef), (long) termID + 1); - } - - if (supportTermTextLookup) { - PackedInts.Reader packedTermPointers = OptimizedMemoryIndex.getPackedInts(termPointers); - return new FSTTermDictionary( - numTerms, - builder.finish(), - termPool, - packedTermPointers, - termPointerEncoding); - } else { - return new FSTTermDictionary( - numTerms, - builder.finish(), - null, // termPool - null, // termPointers - termPointerEncoding); - } - } - - @Override - public boolean getTerm(int termID, BytesRef text, BytesRef termPayload) { - if (termPool == null) { - throw new UnsupportedOperationException( - "This dictionary does not support term lookup by termID"); - } else { - int termPointer = (int) termPointers.get(termID); - boolean hasTermPayload = termPointerEncoding.hasPayload(termPointer); - int textStart = termPointerEncoding.getTextStart(termPointer); - // setBytesRef sets the passed in BytesRef "text" to the term in the termPool. - // As a side effect it returns the offset of the next entry in the pool after the term, - // which may optionally be used if this term has a payload. - int termPayloadStart = ByteTermUtils.setBytesRef(termPool, text, textStart); - if (termPayload != null && hasTermPayload) { - ByteTermUtils.setBytesRef(termPool, termPayload, termPayloadStart); - } - - return hasTermPayload; - } - } - - @Override - public TermsEnum createTermsEnum(OptimizedMemoryIndex index) { - return new BaseTermsEnum() { - private final BytesRefFSTEnum fstEnum = fst != null ? new BytesRefFSTEnum<>(fst) : null; - private BytesRefFSTEnum.InputOutput current; - - @Override - public SeekStatus seekCeil(BytesRef term) - throws IOException { - if (fstEnum == null) { - return SeekStatus.END; - } - - current = fstEnum.seekCeil(term); - if (current != null && current.input.equals(term)) { - return SeekStatus.FOUND; - } else { - return SeekStatus.END; - } - } - - @Override - public boolean seekExact(BytesRef text) throws IOException { - current = fstEnum.seekExact(text); - return current != null; - } - - // In our case the ord is the termId. - @Override - public void seekExact(long ord) { - current = new BytesRefFSTEnum.InputOutput<>(); - current.input = null; - // +1 because 0 is not supported by the fst - current.output = ord + 1; - - if (termPool != null) { - BytesRef bytesRef = new BytesRef(); - int termId = (int) ord; - assert termId == ord; - FSTTermDictionary.this.getTerm(termId, bytesRef, null); - current.input = bytesRef; - } - } - - @Override - public BytesRef next() throws IOException { - current = fstEnum.next(); - if (current == null) { - return null; - } - return current.input; - } - - @Override - public BytesRef term() { - return current.input; - } - - // In our case the ord is the termId. - @Override - public long ord() { - // -1 because 0 is not supported by the fst - return current.output - 1; - } - - @Override - public int docFreq() { - return index.getDF((int) ord()); - } - - @Override - public long totalTermFreq() { - return docFreq(); - } - - @Override - public PostingsEnum postings(PostingsEnum reuse, int flags) throws IOException { - int termID = (int) ord(); - int postingsPointer = index.getPostingListPointer(termID); - int numPostings = index.getNumPostings(termID); - return index.getPostingLists().postings(postingsPointer, numPostings, flags); - } - - @Override - public ImpactsEnum impacts(int flags) throws IOException { - return new SlowImpactsEnum(postings(null, flags)); - } - }; - } - - @SuppressWarnings("unchecked") - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - public static class FlushHandler extends Flushable.Handler { - private static final String NUM_TERMS_PROP_NAME = "numTerms"; - private static final String SUPPORT_TERM_TEXT_LOOKUP_PROP_NAME = "supportTermTextLookup"; - private final TermPointerEncoding termPointerEncoding; - - public FlushHandler(TermPointerEncoding termPointerEncoding) { - super(); - this.termPointerEncoding = termPointerEncoding; - } - - public FlushHandler(FSTTermDictionary objectToFlush) { - super(objectToFlush); - this.termPointerEncoding = objectToFlush.termPointerEncoding; - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) - throws IOException { - FSTTermDictionary objectToFlush = getObjectToFlush(); - flushInfo.addIntProperty(NUM_TERMS_PROP_NAME, objectToFlush.getNumTerms()); - flushInfo.addBooleanProperty(SUPPORT_TERM_TEXT_LOOKUP_PROP_NAME, - objectToFlush.termPool != null); - if (objectToFlush.termPool != null) { - out.writePackedInts(objectToFlush.termPointers); - objectToFlush.termPool.getFlushHandler().flush(flushInfo.newSubProperties("termPool"), out); - } - objectToFlush.fst.save(out.getIndexOutput()); - } - - @Override - protected FSTTermDictionary doLoad(FlushInfo flushInfo, - DataDeserializer in) throws IOException { - int numTerms = flushInfo.getIntProperty(NUM_TERMS_PROP_NAME); - boolean supportTermTextLookup = - flushInfo.getBooleanProperty(SUPPORT_TERM_TEXT_LOOKUP_PROP_NAME); - PackedInts.Reader termPointers = null; - ByteBlockPool termPool = null; - if (supportTermTextLookup) { - termPointers = in.readPackedInts(); - termPool = (new ByteBlockPool.FlushHandler()) - .load(flushInfo.getSubProperties("termPool"), in); - } - final PositiveIntOutputs outputs = PositiveIntOutputs.getSingleton(); - return new FSTTermDictionary(numTerms, new FST<>(in.getIndexInput(), outputs), - termPool, termPointers, termPointerEncoding); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/HighDFPackedIntsDocsAndPositionsEnum.java b/src/java/com/twitter/search/core/earlybird/index/inverted/HighDFPackedIntsDocsAndPositionsEnum.java deleted file mode 100644 index 7b18275d0..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/HighDFPackedIntsDocsAndPositionsEnum.java +++ /dev/null @@ -1,156 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; - -/** - * Docs, frequencies, and positions enumerator for {@link HighDFPackedIntsPostingLists}. - */ -public class HighDFPackedIntsDocsAndPositionsEnum extends HighDFPackedIntsDocsEnum { - /** - * Pre-computed shifts, masks, and start int indices for {@link #positionListsReader}. - * These pre-computed values should be read-only and shared across all reader threads. - * - * Notice: - * - start int indices are NEEDED since there IS jumping within a slice in - * {@link #doAdditionalSkip()} and {@link #startCurrentDoc()}. - */ - private static final PackedLongsReaderPreComputedValues PRE_COMPUTED_VALUES = - new PackedLongsReaderPreComputedValues( - HighDFPackedIntsPostingLists.MAX_POSITION_BIT, - HighDFPackedIntsPostingLists.POSITION_SLICE_NUM_BITS_WITHOUT_HEADER, - HighDFPackedIntsPostingLists.POSITION_SLICE_SIZE_WITHOUT_HEADER, - true); - - /** - * Int block pool holding the positions for the read posting list. This is mainly used while - * reading slice headers in {@link #loadNextPositionSlice()}. - */ - private final IntBlockPool positionLists; - - /** Packed ints reader for positions. */ - private final IntBlockPoolPackedLongsReader positionListsReader; - - /** Total number of positions in the current position slice. */ - private int numPositionsInSliceTotal; - - /** - * Number of remaining positions for {@link #currentDocID}; this value is decremented every time - * {@link #nextPosition()} is called. - */ - private int numPositionsRemainingForCurrentDocID; - - /** - * Pointer to the first int, which contains the position slice header, of the next position slice. - * This value is used to track which slice will be loaded when {@link #loadNextPositionSlice()} is - * called. - */ - private int nextPositionSlicePointer; - - /** - * Create a docs and positions enumerator. - */ - public HighDFPackedIntsDocsAndPositionsEnum( - IntBlockPool skipLists, - IntBlockPool deltaFreqLists, - IntBlockPool positionLists, - int postingListPointer, - int numPostings, - boolean omitPositions) { - super(skipLists, deltaFreqLists, postingListPointer, numPostings, omitPositions); - - this.positionLists = positionLists; - this.positionListsReader = new IntBlockPoolPackedLongsReader( - positionLists, - PRE_COMPUTED_VALUES, - queryCostTracker, - QueryCostTracker.CostType.LOAD_OPTIMIZED_POSTING_BLOCK); - - // Load the first position slice. - this.nextPositionSlicePointer = skipListReader.getPositionCurrentSlicePointer(); - loadNextPositionSlice(); - } - - /** - * Prepare for current doc: - * - skipping over unread positions for the current doc. - * - reset remaining positions for current doc to {@link #currentFreq}. - * - * @see #nextDocNoDel() - */ - @Override - protected void startCurrentDoc() { - // Locate next position for current doc by skipping over unread positions from the previous doc. - if (numPositionsRemainingForCurrentDocID != 0) { - int numPositionsRemainingInSlice = - numPositionsInSliceTotal - positionListsReader.getPackedValueIndex(); - while (numPositionsRemainingInSlice <= numPositionsRemainingForCurrentDocID) { - numPositionsRemainingForCurrentDocID -= numPositionsRemainingInSlice; - nextPositionSlicePointer += HighDFPackedIntsPostingLists.SLICE_SIZE; - loadNextPositionSlice(); - numPositionsRemainingInSlice = numPositionsInSliceTotal; - } - - positionListsReader.setPackedValueIndex( - positionListsReader.getPackedValueIndex() + numPositionsRemainingForCurrentDocID); - } - - // Number of remaining positions for current doc is current freq. - numPositionsRemainingForCurrentDocID = getCurrentFreq(); - } - - /** - * Put positions reader to the start of next position slice and reset number of bits per packed - * value for next position slice. - */ - private void loadNextPositionSlice() { - final int header = positionLists.get(nextPositionSlicePointer); - final int bitsForPosition = HighDFPackedIntsPostingLists.getNumBitsForPosition(header); - numPositionsInSliceTotal = HighDFPackedIntsPostingLists.getNumPositionsInSlice(header); - - positionListsReader.jumpToInt( - nextPositionSlicePointer + HighDFPackedIntsPostingLists.POSITION_SLICE_HEADER_SIZE, - bitsForPosition); - } - - /** - * Return next position for current doc. - * @see org.apache.lucene.index.PostingsEnum#nextPosition() - */ - @Override - public int nextPosition() throws IOException { - // Return -1 immediately if all positions are used up for current doc. - if (numPositionsRemainingForCurrentDocID == 0) { - return -1; - } - - if (positionListsReader.getPackedValueIndex() < numPositionsInSliceTotal) { - // Read next position in current slice. - final int nextPosition = (int) positionListsReader.readPackedLong(); - numPositionsRemainingForCurrentDocID--; - return nextPosition; - } else { - // All positions in current slice is used up, load next slice. - nextPositionSlicePointer += HighDFPackedIntsPostingLists.SLICE_SIZE; - loadNextPositionSlice(); - return nextPosition(); - } - } - - /** - * Set {@link #positionListsReader} to the correct location and correct number of bits per packed - * value for the delta-freq slice on which this enum is landed after skipping. - * - * @see #skipTo(int) - */ - @Override - protected void doAdditionalSkip() { - nextPositionSlicePointer = skipListReader.getPositionCurrentSlicePointer(); - loadNextPositionSlice(); - - // Locate the exact position in slice. - final int skipListEntryEncodedMetadata = skipListReader.getEncodedMetadataCurrentSlice(); - positionListsReader.setPackedValueIndex( - HighDFPackedIntsPostingLists.getPositionOffsetInSlice(skipListEntryEncodedMetadata)); - numPositionsRemainingForCurrentDocID = 0; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/HighDFPackedIntsDocsEnum.java b/src/java/com/twitter/search/core/earlybird/index/inverted/HighDFPackedIntsDocsEnum.java deleted file mode 100644 index e09a2ef2b..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/HighDFPackedIntsDocsEnum.java +++ /dev/null @@ -1,222 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; - -/** - * Docs and frequencies enumerator for {@link HighDFPackedIntsPostingLists}. - */ -public class HighDFPackedIntsDocsEnum extends EarlybirdOptimizedPostingsEnum { - /** - * Pre-computed shifts, masks for {@link #deltaFreqListsReader}. - * These pre-computed values should be read-only and shared across all reader threads. - * - * Notice: - * - start int indices are NOT needed since there is not jumping within a slice. - */ - private static final PackedLongsReaderPreComputedValues PRE_COMPUTED_VALUES = - new PackedLongsReaderPreComputedValues( - HighDFPackedIntsPostingLists.MAX_DOC_ID_BIT - + HighDFPackedIntsPostingLists.MAX_FREQ_BIT, - HighDFPackedIntsPostingLists.NUM_BITS_PER_SLICE, - HighDFPackedIntsPostingLists.SLICE_SIZE, - false); - - /** Packed ints reader for delta-freq pairs. */ - private final IntBlockPoolPackedLongsReader deltaFreqListsReader; - - /** Skip list reader. */ - protected final HighDFPackedIntsSkipListReader skipListReader; - - /** Number of remaining docs (delta-freq pairs) in a slice. */ - private int numDocsRemaining; - - /** - * Total number of docs (delta-freq pairs) in a slice. - * This value is set every time a slice is loaded in {@link #loadNextDeltaFreqSlice()}. - */ - private int numDocsInSliceTotal; - - /** - * Number of bits used for frequency in a delta-freq slice. - * This value is set every time a slice is loaded in {@link #loadNextDeltaFreqSlice()}. - */ - private int bitsForFreq; - - /** - * Frequency mask used to extract frequency from a delta-freq pair, in a delta-freq slice. - * This value is set every time a slice is loaded in {@link #loadNextDeltaFreqSlice()}. - */ - private int freqMask; - private boolean freqBitsIsZero; - - /** - * Sole constructor. - * - * @param skipLists skip lists int block pool - * @param deltaFreqLists delta-freq lists int block pool - * @param postingListPointer pointer to the posting list for which this enumerator is created - * @param numPostings number of postings in the posting list for which this enumerator is created - * @param omitPositions whether positions are omitted in the posting list of which this enumerator - * is created - */ - public HighDFPackedIntsDocsEnum( - IntBlockPool skipLists, - IntBlockPool deltaFreqLists, - int postingListPointer, - int numPostings, - boolean omitPositions) { - super(postingListPointer, numPostings); - - // Create skip list reader and get first skip entry. - this.skipListReader = new HighDFPackedIntsSkipListReader( - skipLists, postingListPointer, omitPositions); - this.skipListReader.getNextSkipEntry(); - - // Set number of remaining docs in this posting list. - this.numDocsRemaining = skipListReader.getNumDocsTotal(); - - // Create a delta-freq pair packed values reader. - this.deltaFreqListsReader = new IntBlockPoolPackedLongsReader( - deltaFreqLists, - PRE_COMPUTED_VALUES, - queryCostTracker, - QueryCostTracker.CostType.LOAD_OPTIMIZED_POSTING_BLOCK); - - loadNextDeltaFreqSlice(); - loadNextPosting(); - } - - /** - * Load next delta-freq slice, return false if all docs exhausted. - * Notice!! The caller of this method should make sure the current slice is all used up and - * {@link #numDocsRemaining} is updated accordingly. - * - * @return whether a slice is loaded. - * @see #loadNextPosting() - * @see #skipTo(int) - */ - private boolean loadNextDeltaFreqSlice() { - // Load nothing if no docs are remaining. - if (numDocsRemaining == 0) { - return false; - } - - final int encodedMetadata = skipListReader.getEncodedMetadataCurrentSlice(); - final int bitsForDelta = HighDFPackedIntsPostingLists.getNumBitsForDelta(encodedMetadata); - bitsForFreq = HighDFPackedIntsPostingLists.getNumBitsForFreq(encodedMetadata); - numDocsInSliceTotal = HighDFPackedIntsPostingLists.getNumDocsInSlice(encodedMetadata); - - freqMask = (1 << bitsForFreq) - 1; - freqBitsIsZero = bitsForFreq == 0; - - // Locate and reset the reader for this slice. - final int bitsPerPackedValue = bitsForDelta + bitsForFreq; - deltaFreqListsReader.jumpToInt( - skipListReader.getDeltaFreqCurrentSlicePointer(), bitsPerPackedValue); - return true; - } - - /** - * Load next delta-freq pair from the current slice and set the computed - * {@link #nextDocID} and {@link #nextFreq}. - */ - @Override - protected final void loadNextPosting() { - assert numDocsRemaining >= (numDocsInSliceTotal - deltaFreqListsReader.getPackedValueIndex()) - : "numDocsRemaining should be equal to or greater than number of docs remaining in slice"; - - if (deltaFreqListsReader.getPackedValueIndex() < numDocsInSliceTotal) { - // Current slice is not exhausted. - final long nextDeltaFreqPair = deltaFreqListsReader.readPackedLong(); - - /** - * Optimization: No need to do shifts and masks if number of bits for frequency is 0. - * Also, the stored frequency is the actual frequency - 1. - * @see - * HighDFPackedIntsPostingLists#copyPostingList(org.apache.lucene.index.PostingsEnum, int) - */ - if (freqBitsIsZero) { - nextFreq = 1; - nextDocID += (int) nextDeltaFreqPair; - } else { - nextFreq = (int) ((nextDeltaFreqPair & freqMask) + 1); - nextDocID += (int) (nextDeltaFreqPair >>> bitsForFreq); - } - - numDocsRemaining--; - } else { - // Current slice is exhausted, get next skip entry and load next slice. - skipListReader.getNextSkipEntry(); - if (loadNextDeltaFreqSlice()) { - // Next slice is loaded, load next posting again. - loadNextPosting(); - } else { - // All docs are exhausted, mark this enumerator as exhausted. - assert numDocsRemaining == 0; - nextDocID = NO_MORE_DOCS; - nextFreq = 0; - } - } - } - - /** - * Skip over slices to approach the given target as close as possible. - */ - @Override - protected final void skipTo(int target) { - assert target != NO_MORE_DOCS : "Should be handled in parent class advance method"; - - int numSlicesToSkip = 0; - int numDocsToSkip = 0; - int numDocsRemainingInSlice = numDocsInSliceTotal - deltaFreqListsReader.getPackedValueIndex(); - - // Skipping over slices. - while (skipListReader.peekPreviousDocIDNextSlice() < target) { - skipListReader.getNextSkipEntry(); - nextDocID = skipListReader.getPreviousDocIDCurrentSlice(); - numDocsToSkip += numDocsRemainingInSlice; - int header = skipListReader.getEncodedMetadataCurrentSlice(); - numDocsRemainingInSlice = HighDFPackedIntsPostingLists.getNumDocsInSlice(header); - - numSlicesToSkip++; - } - - // If skipped any slices, load the new slice. - if (numSlicesToSkip > 0) { - numDocsRemaining -= numDocsToSkip; - final boolean hasNextSlice = loadNextDeltaFreqSlice(); - assert hasNextSlice; - assert numDocsRemaining >= numDocsInSliceTotal && numDocsInSliceTotal > 0; - - // Do additional skip for the delta freq slice that was just loaded. - doAdditionalSkip(); - - loadNextPosting(); - } - } - - /** - * Subclass should override this method if want to do additional skip on its data structure. - */ - protected void doAdditionalSkip() { - // No-op in this class. - } - - /** - * Get the largest doc ID from {@link #skipListReader}. - */ - @Override - public int getLargestDocID() throws IOException { - return skipListReader.getLargestDocID(); - } - - /** - * Return {@link #numDocsRemaining} as a proxy of cost. - * - * @see org.apache.lucene.index.PostingsEnum#cost() - */ - @Override - public long cost() { - return numDocsRemaining; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/HighDFPackedIntsPostingLists.java b/src/java/com/twitter/search/core/earlybird/index/inverted/HighDFPackedIntsPostingLists.java deleted file mode 100644 index bf92d814f..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/HighDFPackedIntsPostingLists.java +++ /dev/null @@ -1,829 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; - -import javax.annotation.Nullable; - -import org.apache.lucene.index.PostingsEnum; -import org.apache.lucene.search.DocIdSetIterator; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; - -/** - * An optimized posting lists implementation storing doc deltas, doc freqs, and positions as packed - * ints in a 64 ints slice backed by {@link IntBlockPool}. - * - * There are three inner data structures used to store values used by a posting lists instance: - * - * - Skip lists, used for fast {@link PostingsEnum#advance(int)}, are stored in {@link #skipLists} - * int block pool. - * - Doc deltas and freqs are stored in {@link #deltaFreqLists} int block pool. - * - Positions are stored in {@link #positionLists} int block pool. - * - * For detail layout and configuration, please refer to the Javadoc of {@link #skipLists}, - * {@link #deltaFreqLists} and {@link #positionLists}. - * - * This implementation designed for posting lists with a LARGE number of postings. - * - * Acknowledgement: the concepts of slice based packed ints encoding/decoding is borrowed - * from {@code HighDFCompressedPostinglists}, which will be deprecated due - * to not supporting positions that are greater than 255. - */ -public class HighDFPackedIntsPostingLists extends OptimizedPostingLists { - /** - * A counter used to track when positions enum is required and a posting lists instance is set - * to omit positions. - * - * @see #postings(int, int, int) - */ - private static final SearchCounter GETTING_POSITIONS_WITH_OMIT_POSITIONS = - SearchCounter.export( - "high_df_packed_ints_posting_list_getting_positions_with_omit_positions"); - - /** - * Information related to size of a slice. - */ - static final int SLICE_SIZE_BIT = 6; - static final int SLICE_SIZE = 1 << SLICE_SIZE_BIT; // 64 ints per block - static final int NUM_BITS_PER_SLICE = SLICE_SIZE * Integer.SIZE; // 2048 bits per block - - /** - * A skip list has ONE skip list header that contains 5 ints (4 ints if positions are omitted): - * - 1st int: number of skip entries in this skip list. - * - 2nd int: largest doc ID in this posting list. - * - 3rd int: number of docs in this posting list. - * - 4th int: pointer to the start of the delta-freq list of this posting list. - * - 5th int: (OPTIONAL) pointer to the start of the position list of this posting list. - */ - static final int SKIPLIST_HEADER_SIZE = 5; - static final int SKIPLIST_HEADER_SIZE_WITHOUT_POSITIONS = SKIPLIST_HEADER_SIZE - 1; - - /** - * A skip list has MANY skip entries. Each skip entry is for one slice in delta-freq list. - * There are 3 ints in every skip entry (2 ints if positions are omitted): - * - 1st int: last doc ID in previous slice (0 for the first slice), this is mainly used during - * skipping because deltas, not absolute doc IDs, are stored in a slice. - * - 2nd int: encoded metadata of the corresponding delta-freq slice. There are 4 piece of - * information from the LOWEST bits to HIGHEST bits of this int: - * 11 bits: number of docs (delta-freq pairs) in this slice. - * 5 bits: number of bits used to encode each freq. - * 5 bits: number of bits used to encode each delta. - * 11 bits: POSITION SLICE OFFSET: an index of number of positions; this is where the - * first position of the first doc (in this delta-freq slice) is in the - * position slice. The position slice is identified by the 3rd int below. - * These two piece information uniquely identified the location of the start - * position of this delta-freq slice. This value is always 0 if position is - * omitted. - * - 3rd int: (OPTIONAL) POSITION SLICE INDEX: an index of of number of slices; this value - * identifies the slice in which the first position of the first doc (in this - * delta-freq slice) exists. The exact location inside the position slice is identified - * by POSITION SLICE OFFSET that is stored in the 2nd int above. - * Notice: this is not the absolute address in the block pool, but instead a relative - * offset (in number of slices) on top of this term's first position slice. - * This value DOES NOT EXIST if position is omitted. - */ - static final int SKIPLIST_ENTRY_SIZE = 3; - static final int SKIPLIST_ENTRY_SIZE_WITHOUT_POSITIONS = SKIPLIST_ENTRY_SIZE - 1; - - /** - * Shifts and masks used to encode/decode metadata from the 2nd int of a skip list entry. - * @see #SKIPLIST_ENTRY_SIZE - * @see #encodeSkipListEntryMetadata(int, int, int, int) - * @see #getNumBitsForDelta(int) - * @see #getNumBitsForFreq(int) - * @see #getNumDocsInSlice(int) - * @see #getPositionOffsetInSlice(int) - */ - static final int SKIPLIST_ENTRY_POSITION_OFFSET_SHIFT = 21; - static final int SKIPLIST_ENTRY_NUM_BITS_DELTA_SHIFT = 16; - static final int SKIPLIST_ENTRY_NUM_BITS_FREQ_SHIFT = 11; - static final int SKIPLIST_ENTRY_POSITION_OFFSET_MASK = (1 << 11) - 1; - static final int SKIPLIST_ENTRY_NUM_BITS_DELTA_MASK = (1 << 5) - 1; - static final int SKIPLIST_ENTRY_NUM_BITS_FREQ_MASK = (1 << 5) - 1; - static final int SKIPLIST_ENTRY_NUM_DOCS_MASK = (1 << 11) - 1; - - /** - * Each position slice has a header that is the 1st int in this position slice. From LOWEST bits - * to HIGHEST bits, there are 2 pieces of information encoded in this single int: - * 11 bits: number of positions in this slice. - * 5 bits: number of bits used to encode each position. - */ - static final int POSITION_SLICE_HEADER_SIZE = 1; - - /** - * Information related to size of a position slice. The actual size is the same as - * {@link #SLICE_SIZE}, but there is 1 int used for position slice header. - */ - static final int POSITION_SLICE_SIZE_WITHOUT_HEADER = SLICE_SIZE - POSITION_SLICE_HEADER_SIZE; - static final int POSITION_SLICE_NUM_BITS_WITHOUT_HEADER = - POSITION_SLICE_SIZE_WITHOUT_HEADER * Integer.SIZE; - - /** - * Shifts and masks used to encode/decode metadata from the position slice header. - * @see #POSITION_SLICE_HEADER_SIZE - * @see #encodePositionEntryHeader(int, int) - * @see #getNumPositionsInSlice(int) - * @see #getNumBitsForPosition(int) - */ - static final int POSITION_SLICE_HEADER_BITS_POSITION_SHIFT = 11; - static final int POSITION_SLICE_HEADER_BITS_POSITION_MASK = (1 << 5) - 1; - static final int POSITION_SLICE_HEADER_NUM_POSITIONS_MASK = (1 << 11) - 1; - - /** - * Stores skip list for each posting list. - * - * A skip list consists of ONE skip list header and MANY skip list entries, and each skip entry - * corresponds to one delta-freq slice. Also, unlike {@link #deltaFreqLists} and - * {@link #positionLists}, values in skip lists int pool are NOT stored in unit of slices. - * - * Example: - * H: skip list header int - * E: skip list entry int - * ': int boundary - * |: header/entry boundary (also a boundary of int) - * - * <----- skip list A -----> <- skip list B -> - * |H'H'H'H'H|E'E|E'E|E'E|E'E|H'H'H'H'H|E'E|E'E| - */ - private final IntBlockPool skipLists; - - /** - * Stores delta-freq list for each posting list. - * - * A delta-freq list consists of MANY 64-int slices, and delta-freq pairs are stored compactly - * with a fixed number of bits within a single slice. Each slice has a corresponding skip list - * entry in {@link #skipLists} storing metadata about this slice. - * - * Example: - * |: slice boundary - * - * <----------------- delta-freq list A -----------------> <--- delta-freq list B ---> - * |64 ints slice|64 ints slice|64 ints slice|64 ints slice|64 ints slice|64 ints slice| - */ - private final IntBlockPool deltaFreqLists; - - /** - * Stores position list for each posting list. - * - * A position list consists of MANY 64 ints slices, and positions are stored compactly with a - * fixed number of bits within a single slice. The first int in each slice is used as a header to - * store the metadata about this position slice. - * - * Example: - * H: position header int - * ': int boundary - * |: slice boundary - * - * <--------------- position list A ---------------> <---------- position list B ----------> - * |H'63 ints|H'63 ints|H'63 ints|H'63 ints|H'63 ints|H'63 ints|H'63 ints|H'63 ints|H'63 ints| - */ - private final IntBlockPool positionLists; - - /** - * Whether positions are omitted in this optimized posting lists. - */ - private final boolean omitPositions; - - /** - * Skip list header and entry size for this posting lists, could be different depends on whether - * position is omitted or not. - * - * @see #SKIPLIST_HEADER_SIZE - * @see #SKIPLIST_HEADER_SIZE_WITHOUT_POSITIONS - * @see #SKIPLIST_ENTRY_SIZE - * @see #SKIPLIST_ENTRY_SIZE_WITHOUT_POSITIONS - */ - private final int skipListHeaderSize; - private final int skiplistEntrySize; - - /** - * Buffer used in {@link #copyPostingList(PostingsEnum, int)} - * to queue up values needed for a slice. - * Loaded posting lists have them set as null. - */ - private final PostingsBufferQueue docFreqQueue; - private final PostingsBufferQueue positionQueue; - - /** - * Packed ints writer used to write into delta-freq int pool and position int pool. - * Loaded posting lists have them set as null. - */ - private final IntBlockPoolPackedLongsWriter deltaFreqListsWriter; - private final IntBlockPoolPackedLongsWriter positionListsWriter; - - /** - * Default constructor. - * - * @param omitPositions whether positions will be omitted in these posting lists. - */ - public HighDFPackedIntsPostingLists(boolean omitPositions) { - this( - new IntBlockPool("high_df_packed_ints_skip_lists"), - new IntBlockPool("high_df_packed_ints_delta_freq_lists"), - new IntBlockPool("high_df_packed_ints_position_lists"), - omitPositions, - new PostingsBufferQueue(NUM_BITS_PER_SLICE), - new PostingsBufferQueue(POSITION_SLICE_NUM_BITS_WITHOUT_HEADER)); - } - - /** - * Constructors used by loader. - * - * @param skipLists loaded int block pool represents skip lists - * @param deltaFreqLists loaded int block pool represents delta-freq lists - * @param positionLists loaded int block pool represents position lists - * @param omitPositions whether positions will be omitted in these posting lists - * @param docFreqQueue buffer used to queue up values used for a doc freq slice, null if loaded - * @param positionQueue buffer used to queue up values used for a position slice, null if loaded - * @see FlushHandler#doLoad(FlushInfo, DataDeserializer) - */ - private HighDFPackedIntsPostingLists( - IntBlockPool skipLists, - IntBlockPool deltaFreqLists, - IntBlockPool positionLists, - boolean omitPositions, - @Nullable PostingsBufferQueue docFreqQueue, - @Nullable PostingsBufferQueue positionQueue) { - this.skipLists = skipLists; - this.deltaFreqLists = deltaFreqLists; - this.positionLists = positionLists; - this.omitPositions = omitPositions; - - this.docFreqQueue = docFreqQueue; - this.positionQueue = positionQueue; - - // docFreqQueue is null if this postingLists is loaded, - // we don't need to create writer at that case. - if (docFreqQueue == null) { - assert positionQueue == null; - this.deltaFreqListsWriter = null; - this.positionListsWriter = null; - } else { - this.deltaFreqListsWriter = new IntBlockPoolPackedLongsWriter(deltaFreqLists); - this.positionListsWriter = new IntBlockPoolPackedLongsWriter(positionLists); - } - - if (omitPositions) { - skipListHeaderSize = SKIPLIST_HEADER_SIZE_WITHOUT_POSITIONS; - skiplistEntrySize = SKIPLIST_ENTRY_SIZE_WITHOUT_POSITIONS; - } else { - skipListHeaderSize = SKIPLIST_HEADER_SIZE; - skiplistEntrySize = SKIPLIST_ENTRY_SIZE; - } - } - - /** - * A simple wrapper around assorted states used when coping positions in a posting enum. - * @see #copyPostingList(PostingsEnum, int) - */ - private static class PositionsState { - /** Max position has been seen for the current position slice. */ - private int maxPosition = 0; - - /** Bits needed to encode/decode positions in the current position slice. */ - private int bitsNeededForPosition = 0; - - /** Total number of position slices created for current posting list. */ - private int numPositionsSlices = 0; - - /** - * Whenever a slice of doc/freq pairs is written, this will point to the first position - * associated with the first doc in the doc/freq slice. - */ - private int currentPositionsSliceIndex = 0; - private int currentPositionsSliceOffset = 0; - - /** - * Whenever a new document is processed, this points to the first position for this doc. - * This is used if this doc ends up being chosen as the first doc in a doc/freq slice. - */ - private int nextPositionsSliceIndex = 0; - private int nextPositionsSliceOffset = 0; - } - - /** - * Copies postings in the given postings enum into this posting lists instance. - * - * @param postingsEnum enumerator of the posting list that needs to be copied - * @param numPostings number of postings in the posting list that needs to be copied - * @return pointer to the copied posting list in this posting lists instance - */ - @Override - public int copyPostingList(PostingsEnum postingsEnum, int numPostings) throws IOException { - assert docFreqQueue.isEmpty() : "each new posting list should start with an empty queue"; - assert positionQueue.isEmpty() : "each new posting list should start with an empty queue"; - - final int skipListPointer = skipLists.length(); - final int deltaFreqListPointer = deltaFreqLists.length(); - final int positionListPointer = positionLists.length(); - assert isSliceStart(deltaFreqListPointer) : "each new posting list should start at a new slice"; - assert isSliceStart(positionListPointer) : "each new posting list should start at a new slice"; - - // Make room for skip list HEADER. - for (int i = 0; i < skipListHeaderSize; i++) { - skipLists.add(-1); - } - - int doc; - int prevDoc = 0; - int prevWrittenDoc = 0; - - int maxDelta = 0; - int maxFreq = 0; - - int bitsNeededForDelta = 0; - int bitsNeededForFreq = 0; - - // Keep tracking positions related info for this posting list. - PositionsState positionsState = new PositionsState(); - - int numDocs = 0; - int numDeltaFreqSlices = 0; - while ((doc = postingsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { - numDocs++; - - int delta = doc - prevDoc; - assert delta <= MAX_DOC_ID; - - int newBitsForDelta = bitsNeededForDelta; - if (delta > maxDelta) { - maxDelta = delta; - newBitsForDelta = log(maxDelta, 2); - assert newBitsForDelta <= MAX_DOC_ID_BIT; - } - - /** - * Optimization: store freq - 1 since a freq must be positive. Save bits and improve decoding - * speed. At read side, the read frequency will plus 1. - * @see HighDFPackedIntsDocsEnum#loadNextPosting() - */ - int freq = postingsEnum.freq() - 1; - assert freq >= 0; - - int newBitsForFreq = bitsNeededForFreq; - if (freq > maxFreq) { - maxFreq = freq; - newBitsForFreq = log(maxFreq, 2); - assert newBitsForFreq <= MAX_FREQ_BIT; - } - - // Write positions for this doc if not omit positions. - if (!omitPositions) { - writePositionsForDoc(postingsEnum, positionsState); - } - - if ((newBitsForDelta + newBitsForFreq) * (docFreqQueue.size() + 1) > NUM_BITS_PER_SLICE) { - //The latest doc does not fit into this slice. - assert (bitsNeededForDelta + bitsNeededForFreq) * docFreqQueue.size() - <= NUM_BITS_PER_SLICE; - - prevWrittenDoc = writeDeltaFreqSlice( - bitsNeededForDelta, - bitsNeededForFreq, - positionsState, - prevWrittenDoc); - numDeltaFreqSlices++; - - maxDelta = delta; - maxFreq = freq; - bitsNeededForDelta = log(maxDelta, 2); - bitsNeededForFreq = log(maxFreq, 2); - } else { - bitsNeededForDelta = newBitsForDelta; - bitsNeededForFreq = newBitsForFreq; - } - - docFreqQueue.offer(doc, freq); - - prevDoc = doc; - } - - // Some positions may be left in the buffer queue. - if (!positionQueue.isEmpty()) { - writePositionSlice(positionsState.bitsNeededForPosition); - } - - // Some docs may be left in the buffer queue. - if (!docFreqQueue.isEmpty()) { - writeDeltaFreqSlice( - bitsNeededForDelta, - bitsNeededForFreq, - positionsState, - prevWrittenDoc); - numDeltaFreqSlices++; - } - - // Write skip list header. - int skipListHeaderPointer = skipListPointer; - final int numSkipListEntries = - (skipLists.length() - (skipListPointer + skipListHeaderSize)) / skiplistEntrySize; - assert numSkipListEntries == numDeltaFreqSlices - : "number of delta freq slices should be the same as number of skip list entries"; - skipLists.set(skipListHeaderPointer++, numSkipListEntries); - skipLists.set(skipListHeaderPointer++, prevDoc); - skipLists.set(skipListHeaderPointer++, numDocs); - skipLists.set(skipListHeaderPointer++, deltaFreqListPointer); - if (!omitPositions) { - skipLists.set(skipListHeaderPointer, positionListPointer); - } - - return skipListPointer; - } - - /** - * Write positions for current doc into {@link #positionLists}. - * - * @param postingsEnum postings enumerator containing the positions need to be written - * @param positionsState some states about {@link #positionLists} and {@link #positionQueue} - * @see #copyPostingList(PostingsEnum, int) - */ - private void writePositionsForDoc( - PostingsEnum postingsEnum, - PositionsState positionsState) throws IOException { - assert !omitPositions : "this method should not be called if positions are omitted"; - - for (int i = 0; i < postingsEnum.freq(); i++) { - int pos = postingsEnum.nextPosition(); - - int newBitsForPosition = positionsState.bitsNeededForPosition; - if (pos > positionsState.maxPosition) { - positionsState.maxPosition = pos; - newBitsForPosition = log(positionsState.maxPosition, 2); - assert newBitsForPosition <= MAX_POSITION_BIT; - } - - if (newBitsForPosition * (positionQueue.size() + 1) - > POSITION_SLICE_NUM_BITS_WITHOUT_HEADER - || positionQueue.isFull()) { - assert positionsState.bitsNeededForPosition * positionQueue.size() - <= POSITION_SLICE_NUM_BITS_WITHOUT_HEADER; - - writePositionSlice(positionsState.bitsNeededForPosition); - positionsState.numPositionsSlices++; - - positionsState.maxPosition = pos; - positionsState.bitsNeededForPosition = log(positionsState.maxPosition, 2); - } else { - positionsState.bitsNeededForPosition = newBitsForPosition; - } - - // Update first position pointer if this position is the first position of a doc - if (i == 0) { - positionsState.nextPositionsSliceIndex = positionsState.numPositionsSlices; - positionsState.nextPositionsSliceOffset = positionQueue.size(); - } - - // Stores a dummy doc -1 since doc is unused in position list. - positionQueue.offer(-1, pos); - } - } - - /** - * Write out all the buffered positions in {@link #positionQueue} into a position slice. - * - * @param bitsNeededForPosition number of bits used for each position in this position slice - */ - private void writePositionSlice(final int bitsNeededForPosition) { - assert !omitPositions; - assert 0 <= bitsNeededForPosition && bitsNeededForPosition <= MAX_POSITION_BIT; - - final int lengthBefore = positionLists.length(); - assert isSliceStart(lengthBefore); - - // First int in this slice stores number of bits needed for position - // and number of positions in this slice.. - positionLists.add(encodePositionEntryHeader(bitsNeededForPosition, positionQueue.size())); - - positionListsWriter.jumpToInt(positionLists.length(), bitsNeededForPosition); - while (!positionQueue.isEmpty()) { - int pos = PostingsBufferQueue.getSecondValue(positionQueue.poll()); - assert log(pos, 2) <= bitsNeededForPosition; - - positionListsWriter.writePackedInt(pos); - } - - // Fill up this slice in case it is only partially filled. - while (positionLists.length() < lengthBefore + SLICE_SIZE) { - positionLists.add(0); - } - - assert positionLists.length() - lengthBefore == SLICE_SIZE; - } - - /** - * Write out all the buffered docs and frequencies in {@link #docFreqQueue} into a delta-freq - * slice and update the skip list entry of this slice. - * - * @param bitsNeededForDelta number of bits used for each delta in this delta-freq slice - * @param bitsNeededForFreq number of bits used for each freq in this delta-freq slice - * @param positionsState some states about {@link #positionLists} and {@link #positionQueue} - * @param prevWrittenDoc last doc written in previous slice - * @return last doc written in this slice - */ - private int writeDeltaFreqSlice( - final int bitsNeededForDelta, - final int bitsNeededForFreq, - final PositionsState positionsState, - final int prevWrittenDoc) { - assert 0 <= bitsNeededForDelta && bitsNeededForDelta <= MAX_DOC_ID_BIT; - assert 0 <= bitsNeededForFreq && bitsNeededForFreq <= MAX_FREQ_BIT; - - final int lengthBefore = deltaFreqLists.length(); - assert isSliceStart(lengthBefore); - - writeSkipListEntry(prevWrittenDoc, bitsNeededForDelta, bitsNeededForFreq, positionsState); - - // Keep track of previous docID so that we compute the docID deltas. - int prevDoc = prevWrittenDoc; - - // A pair is stored as a packed value. - final int bitsPerPackedValue = bitsNeededForDelta + bitsNeededForFreq; - deltaFreqListsWriter.jumpToInt(deltaFreqLists.length(), bitsPerPackedValue); - while (!docFreqQueue.isEmpty()) { - long value = docFreqQueue.poll(); - int doc = PostingsBufferQueue.getDocID(value); - int delta = doc - prevDoc; - assert log(delta, 2) <= bitsNeededForDelta; - - int freq = PostingsBufferQueue.getSecondValue(value); - assert log(freq, 2) <= bitsNeededForFreq; - - // Cast the delta to long before left shift to avoid overflow. - final long deltaFreqPair = (((long) delta) << bitsNeededForFreq) + freq; - deltaFreqListsWriter.writePackedLong(deltaFreqPair); - prevDoc = doc; - } - - // Fill up this slice in case it is only partially filled. - while (deltaFreqLists.length() < lengthBefore + SLICE_SIZE) { - deltaFreqLists.add(0); - } - - positionsState.currentPositionsSliceIndex = positionsState.nextPositionsSliceIndex; - positionsState.currentPositionsSliceOffset = positionsState.nextPositionsSliceOffset; - - assert deltaFreqLists.length() - lengthBefore == SLICE_SIZE; - return prevDoc; - } - - /** - * Write the skip list entry for a delta-freq slice. - * - * @param prevWrittenDoc last doc written in previous slice - * @param bitsNeededForDelta number of bits used for each delta in this delta-freq slice - * @param bitsNeededForFreq number of bits used for each freq in this delta-freq slice - * @param positionsState some states about {@link #positionLists} and {@link #positionQueue} - * @see #writeDeltaFreqSlice(int, int, PositionsState, int) - * @see #SKIPLIST_ENTRY_SIZE - */ - private void writeSkipListEntry( - int prevWrittenDoc, - int bitsNeededForDelta, - int bitsNeededForFreq, - PositionsState positionsState) { - // 1st int: last written doc ID in previous slice - skipLists.add(prevWrittenDoc); - - // 2nd int: encoded metadata - skipLists.add( - encodeSkipListEntryMetadata( - positionsState.currentPositionsSliceOffset, - bitsNeededForDelta, - bitsNeededForFreq, - docFreqQueue.size())); - - // 3rd int: optional, position slice index - if (!omitPositions) { - skipLists.add(positionsState.currentPositionsSliceIndex); - } - } - - /** - * Create and return a docs enumerator or docs-positions enumerator based on input flag. - * - * @see org.apache.lucene.index.PostingsEnum - */ - @Override - public EarlybirdPostingsEnum postings( - int postingListPointer, int numPostings, int flags) throws IOException { - // Positions are omitted but position enumerator are requried. - if (omitPositions && PostingsEnum.featureRequested(flags, PostingsEnum.POSITIONS)) { - GETTING_POSITIONS_WITH_OMIT_POSITIONS.increment(); - } - - if (!omitPositions && PostingsEnum.featureRequested(flags, PostingsEnum.POSITIONS)) { - return new HighDFPackedIntsDocsAndPositionsEnum( - skipLists, - deltaFreqLists, - positionLists, - postingListPointer, - numPostings, - false); - } else { - return new HighDFPackedIntsDocsEnum( - skipLists, - deltaFreqLists, - postingListPointer, - numPostings, - omitPositions); - } - } - - /****************************************************** - * Skip list entry encoded data encoding and decoding * - ******************************************************/ - - /** - * Encode a skip list entry metadata, which is stored in the 2nd int of the skip list entry. - * - * @see #SKIPLIST_ENTRY_SIZE - */ - private static int encodeSkipListEntryMetadata( - int positionOffsetInSlice, int numBitsForDelta, int numBitsForFreq, int numDocsInSlice) { - assert 0 <= positionOffsetInSlice - && positionOffsetInSlice < POSITION_SLICE_NUM_BITS_WITHOUT_HEADER; - assert 0 <= numBitsForDelta && numBitsForDelta <= MAX_DOC_ID_BIT; - assert 0 <= numBitsForFreq && numBitsForFreq <= MAX_FREQ_BIT; - assert 0 < numDocsInSlice && numDocsInSlice <= NUM_BITS_PER_SLICE; - return (positionOffsetInSlice << SKIPLIST_ENTRY_POSITION_OFFSET_SHIFT) - + (numBitsForDelta << SKIPLIST_ENTRY_NUM_BITS_DELTA_SHIFT) - + (numBitsForFreq << SKIPLIST_ENTRY_NUM_BITS_FREQ_SHIFT) - // stores numDocsInSlice - 1 to avoid over flow since numDocsInSlice ranges in [1, 2048] - // and 11 bits are used to store number docs in slice - + (numDocsInSlice - 1); - } - - /** - * Decode POSITION_SLICE_OFFSET of the delta-freq slice having the given skip entry encoded data. - * - * @see #SKIPLIST_ENTRY_SIZE - */ - static int getPositionOffsetInSlice(int skipListEntryEncodedMetadata) { - return (skipListEntryEncodedMetadata >>> SKIPLIST_ENTRY_POSITION_OFFSET_SHIFT) - & SKIPLIST_ENTRY_POSITION_OFFSET_MASK; - } - - /** - * Decode number of bits used for delta in the slice having the given skip entry encoded data. - * - * @see #SKIPLIST_ENTRY_SIZE - */ - static int getNumBitsForDelta(int skipListEntryEncodedMetadata) { - return (skipListEntryEncodedMetadata >>> SKIPLIST_ENTRY_NUM_BITS_DELTA_SHIFT) - & SKIPLIST_ENTRY_NUM_BITS_DELTA_MASK; - } - - /** - * Decode number of bits used for freqs in the slice having the given skip entry encoded data. - * - * @see #SKIPLIST_ENTRY_SIZE - */ - static int getNumBitsForFreq(int skipListEntryEncodedMetadata) { - return (skipListEntryEncodedMetadata >>> SKIPLIST_ENTRY_NUM_BITS_FREQ_SHIFT) - & SKIPLIST_ENTRY_NUM_BITS_FREQ_MASK; - } - - /** - * Decode number of delta-freq pairs stored in the slice having the given skip entry encoded data. - * - * @see #SKIPLIST_ENTRY_SIZE - */ - static int getNumDocsInSlice(int skipListEntryEncodedMetadata) { - /** - * Add 1 to the decode value since the stored value is subtracted by 1. - * @see #encodeSkipListEntryMetadata(int, int, int, int) - */ - return (skipListEntryEncodedMetadata & SKIPLIST_ENTRY_NUM_DOCS_MASK) + 1; - } - - /***************************************************** - * Position slice entry header encoding and decoding * - *****************************************************/ - - /** - * Encode a position slice entry header. - * - * @param numBitsForPosition number of bits used to encode positions in this slice. - * @param numPositionsInSlice number of positions in this slice. - * @return an int as the encoded header. - * @see #POSITION_SLICE_HEADER_SIZE - */ - private static int encodePositionEntryHeader(int numBitsForPosition, int numPositionsInSlice) { - assert 0 <= numBitsForPosition && numBitsForPosition <= MAX_POSITION_BIT; - assert 0 < numPositionsInSlice && numPositionsInSlice <= POSITION_SLICE_NUM_BITS_WITHOUT_HEADER; - return (numBitsForPosition << POSITION_SLICE_HEADER_BITS_POSITION_SHIFT) + numPositionsInSlice; - } - - /** - * Decode number of bits used for position in the slice having the given header. - * - * @param positionEntryHeader entry header will be decoded. - * @see #POSITION_SLICE_HEADER_SIZE - */ - static int getNumBitsForPosition(int positionEntryHeader) { - return (positionEntryHeader >>> POSITION_SLICE_HEADER_BITS_POSITION_SHIFT) - & POSITION_SLICE_HEADER_BITS_POSITION_MASK; - } - - /** - * Decode number of positions stored in the slice having the given header. - * - * @param positionEntryHeader entry header will be decoded. - * @see #POSITION_SLICE_HEADER_SIZE - */ - static int getNumPositionsInSlice(int positionEntryHeader) { - return positionEntryHeader & POSITION_SLICE_HEADER_NUM_POSITIONS_MASK; - } - - /****************** - * Helper methods * - ******************/ - - /** - * Check if given pointer is pointing to the slice start. - * - * @param pointer the index will be checked. - */ - static boolean isSliceStart(int pointer) { - return pointer % HighDFPackedIntsPostingLists.SLICE_SIZE == 0; - } - - /** - * Ceil of log of x in the given base. - * - * @return x == 0 ? 0 : Math.ceil(Math.log(x) / Math.log(base)) - */ - private static int log(int x, int base) { - assert base >= 2; - if (x == 0) { - return 0; - } - int ret = 1; - long n = base; // needs to be a long to avoid overflow - while (x >= n) { - n *= base; - ret++; - } - return ret; - } - - /********************** - * For flush and load * - **********************/ - - @SuppressWarnings("unchecked") - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - public static class FlushHandler extends Flushable.Handler { - private static final String OMIT_POSITIONS_PROP_NAME = "omitPositions"; - private static final String SKIP_LISTS_PROP_NAME = "skipLists"; - private static final String DELTA_FREQ_LISTS_PROP_NAME = "deltaFreqLists"; - private static final String POSITION_LISTS_PROP_NAME = "positionLists"; - - public FlushHandler() { - super(); - } - - public FlushHandler(HighDFPackedIntsPostingLists objectToFlush) { - super(objectToFlush); - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) - throws IOException { - HighDFPackedIntsPostingLists objectToFlush = getObjectToFlush(); - flushInfo.addBooleanProperty(OMIT_POSITIONS_PROP_NAME, objectToFlush.omitPositions); - objectToFlush.skipLists.getFlushHandler() - .flush(flushInfo.newSubProperties(SKIP_LISTS_PROP_NAME), out); - objectToFlush.deltaFreqLists.getFlushHandler() - .flush(flushInfo.newSubProperties(DELTA_FREQ_LISTS_PROP_NAME), out); - objectToFlush.positionLists.getFlushHandler() - .flush(flushInfo.newSubProperties(POSITION_LISTS_PROP_NAME), out); - } - - @Override - protected HighDFPackedIntsPostingLists doLoad( - FlushInfo flushInfo, DataDeserializer in) throws IOException { - IntBlockPool skipLists = (new IntBlockPool.FlushHandler()) - .load(flushInfo.getSubProperties(SKIP_LISTS_PROP_NAME), in); - IntBlockPool deltaFreqLists = (new IntBlockPool.FlushHandler()) - .load(flushInfo.getSubProperties(DELTA_FREQ_LISTS_PROP_NAME), in); - IntBlockPool positionLists = (new IntBlockPool.FlushHandler()) - .load(flushInfo.getSubProperties(POSITION_LISTS_PROP_NAME), in); - return new HighDFPackedIntsPostingLists( - skipLists, - deltaFreqLists, - positionLists, - flushInfo.getBooleanProperty(OMIT_POSITIONS_PROP_NAME), - null, - null); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/HighDFPackedIntsSkipListReader.java b/src/java/com/twitter/search/core/earlybird/index/inverted/HighDFPackedIntsSkipListReader.java deleted file mode 100644 index 7f6f04f47..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/HighDFPackedIntsSkipListReader.java +++ /dev/null @@ -1,200 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import org.apache.lucene.search.DocIdSetIterator; - -/** - * A skip list reader of a single term used {@link HighDFPackedIntsDocsEnum}. - * @see HighDFPackedIntsPostingLists - */ -class HighDFPackedIntsSkipListReader { - /** Skip lists int pool. */ - private final IntBlockPool skipLists; - - /** Whether positions are omitted in the posting list having the read skip list. */ - private final boolean omitPositions; - - /** - * Last doc in the previous slice relative to the current delta-freq slice. This value is 0 if - * the current slice is the first delta-freq slice. - */ - private int previousDocIDCurrentSlice; - - /** Encoded metadata of the current delta-freq slice.*/ - private int encodedMetadataCurrentSlice; - - /** - * Pointer to the first int (contains the position slice header) of the position slice that has - * the first position of the first doc in the current delta-freq slice. - */ - private int positionCurrentSliceIndex; - - /** Pointer to the first int in the current delta-freq slice. */ - private int deltaFreqCurrentSlicePointer; - - /** Data of next slice. */ - private int previousDocIDNextSlice; - private int encodedMetadataNextSlice; - private int positionNextSliceIndex; - private int deltaFreqNextSlicePointer; - - /** Used to load blocks and read ints from skip lists int pool. */ - private int[] currentSkipListBlock; - private int skipListBlockStart; - private int skipListBlockIndex; - - /** Number of remaining skip entries for the read skip list. */ - private int numSkipListEntriesRemaining; - - /** Largest doc ID in the posting list having the read skip list. */ - private final int largestDocID; - - /** Pointer to the first int in the first slice that stores positions for this term. */ - private final int positionListPointer; - - /** Total number of docs in the posting list having the read skip list. */ - private final int numDocsTotal; - - /** - * Create a skip list reader specified by the given skip list pointer in the given skip lists int - * pool. - * - * @param skipLists int pool where the read skip list exists - * @param skipListPointer pointer to the read skip list - * @param omitPositions whether positions are omitted in the positing list to which the read skip - * list belongs - */ - public HighDFPackedIntsSkipListReader( - final IntBlockPool skipLists, - final int skipListPointer, - final boolean omitPositions) { - this.skipLists = skipLists; - this.omitPositions = omitPositions; - - this.skipListBlockStart = IntBlockPool.getBlockStart(skipListPointer); - this.skipListBlockIndex = IntBlockPool.getOffsetInBlock(skipListPointer); - this.currentSkipListBlock = skipLists.getBlock(skipListBlockStart); - - // Read skip list header. - this.numSkipListEntriesRemaining = readNextValueFromSkipListBlock(); - this.largestDocID = readNextValueFromSkipListBlock(); - this.numDocsTotal = readNextValueFromSkipListBlock(); - int deltaFreqListPointer = readNextValueFromSkipListBlock(); - this.positionListPointer = omitPositions ? -1 : readNextValueFromSkipListBlock(); - - // Set it back by one slice for fetchNextSkipEntry() to advance correctly. - this.deltaFreqNextSlicePointer = deltaFreqListPointer - HighDFPackedIntsPostingLists.SLICE_SIZE; - fetchNextSkipEntry(); - } - - /** - * Load already fetched data in next skip entry into current data variables, and pre-fetch again. - */ - public void getNextSkipEntry() { - previousDocIDCurrentSlice = previousDocIDNextSlice; - encodedMetadataCurrentSlice = encodedMetadataNextSlice; - positionCurrentSliceIndex = positionNextSliceIndex; - deltaFreqCurrentSlicePointer = deltaFreqNextSlicePointer; - fetchNextSkipEntry(); - } - - /** - * Fetch data for next skip entry if skip list is not exhausted; otherwise, set docIDNextSlice - * to NO_MORE_DOCS. - */ - private void fetchNextSkipEntry() { - if (numSkipListEntriesRemaining == 0) { - previousDocIDNextSlice = DocIdSetIterator.NO_MORE_DOCS; - return; - } - - previousDocIDNextSlice = readNextValueFromSkipListBlock(); - encodedMetadataNextSlice = readNextValueFromSkipListBlock(); - if (!omitPositions) { - positionNextSliceIndex = readNextValueFromSkipListBlock(); - } - deltaFreqNextSlicePointer += HighDFPackedIntsPostingLists.SLICE_SIZE; - numSkipListEntriesRemaining--; - } - - /************************************** - * Getters of data in skip list entry * - **************************************/ - - /** - * In the context of a current slice, this is the docID of the last document in the previous - * slice (or 0 if the current slice is the first slice). - * - * @see HighDFPackedIntsPostingLists#SKIPLIST_ENTRY_SIZE - */ - public int getPreviousDocIDCurrentSlice() { - return previousDocIDCurrentSlice; - } - - /** - * Get the encoded metadata of the current delta-freq slice. - * - * @see HighDFPackedIntsPostingLists#SKIPLIST_ENTRY_SIZE - */ - public int getEncodedMetadataCurrentSlice() { - return encodedMetadataCurrentSlice; - } - - /** - * Get the pointer to the first int, WHICH CONTAINS THE POSITION SLICE HEADER, of the position - * slice that contains the first position of the first doc in the delta-freq slice that - * is corresponding to the current skip list entry. - * - * @see HighDFPackedIntsPostingLists#SKIPLIST_ENTRY_SIZE - */ - public int getPositionCurrentSlicePointer() { - assert !omitPositions; - return positionListPointer - + positionCurrentSliceIndex * HighDFPackedIntsPostingLists.SLICE_SIZE; - } - - /** - * Get the pointer to the first int in the current delta-freq slice. - */ - public int getDeltaFreqCurrentSlicePointer() { - return deltaFreqCurrentSlicePointer; - } - - /** - * In the context of next slice, get the last doc ID in the previous slice. This is used to skip - * over slices. - * - * @see HighDFPackedIntsDocsEnum#skipTo(int) - */ - public int peekPreviousDocIDNextSlice() { - return previousDocIDNextSlice; - } - - /*************************************** - * Getters of data in skip list header * - ***************************************/ - - public int getLargestDocID() { - return largestDocID; - } - - public int getNumDocsTotal() { - return numDocsTotal; - } - - /*************************************************** - * Methods helping loading int block and read ints * - ***************************************************/ - - private int readNextValueFromSkipListBlock() { - if (skipListBlockIndex == IntBlockPool.BLOCK_SIZE) { - loadSkipListBlock(); - } - return currentSkipListBlock[skipListBlockIndex++]; - } - - private void loadSkipListBlock() { - skipListBlockStart += IntBlockPool.BLOCK_SIZE; - currentSkipListBlock = skipLists.getBlock(skipListBlockStart); - skipListBlockIndex = 0; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/InMemoryFields.java b/src/java/com/twitter/search/core/earlybird/index/inverted/InMemoryFields.java deleted file mode 100644 index dad877614..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/InMemoryFields.java +++ /dev/null @@ -1,44 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.util.HashMap; -import java.util.Iterator; -import java.util.Map; - -import org.apache.lucene.index.Fields; -import org.apache.lucene.index.Terms; - -public class InMemoryFields extends Fields { - private final Map termsCache = new HashMap<>(); - private final Map perFields; - private final Map pointerIndex; - - /** - * Returns a new {@link Fields} instance for the provided {@link InvertedIndex}es. - */ - public InMemoryFields(Map perFields, - Map pointerIndex) { - this.perFields = perFields; - this.pointerIndex = pointerIndex; - } - - @Override - public Iterator iterator() { - return perFields.keySet().iterator(); - } - - @Override - public Terms terms(String field) { - InvertedIndex invertedIndex = perFields.get(field); - if (invertedIndex == null) { - return null; - } - - return termsCache.computeIfAbsent(invertedIndex, - index -> index.createTerms(pointerIndex.getOrDefault(invertedIndex, -1))); - } - - @Override - public int size() { - return perFields.size(); - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/IndexOptimizer.java b/src/java/com/twitter/search/core/earlybird/index/inverted/IndexOptimizer.java deleted file mode 100644 index 3fc082a47..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/IndexOptimizer.java +++ /dev/null @@ -1,201 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; -import java.util.HashMap; -import java.util.Map; -import java.util.concurrent.ConcurrentHashMap; - -import com.google.common.base.Preconditions; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; -import org.apache.lucene.index.PostingsEnum; -import org.apache.lucene.index.TermsEnum; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.util.BytesRef; - -import com.twitter.search.common.schema.base.EarlybirdFieldType; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.core.earlybird.facets.AbstractFacetCountingArray; -import com.twitter.search.core.earlybird.facets.FacetLabelProvider; -import com.twitter.search.core.earlybird.facets.FacetUtil; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; -import com.twitter.search.core.earlybird.index.EarlybirdRealtimeIndexSegmentData; -import com.twitter.search.core.earlybird.index.TimeMapper; -import com.twitter.search.core.earlybird.index.column.DocValuesManager; - -public final class IndexOptimizer { - private static final Logger LOG = LoggerFactory.getLogger(IndexOptimizer.class); - - private IndexOptimizer() { - } - - /** - * Optimizes this in-memory index segment. - */ - public static EarlybirdRealtimeIndexSegmentData optimize( - EarlybirdRealtimeIndexSegmentData source) throws IOException { - LOG.info("Starting index optimizing."); - - ConcurrentHashMap targetMap = new ConcurrentHashMap<>(); - LOG.info(String.format( - "Source PerFieldMap size is %d", source.getPerFieldMap().size())); - - LOG.info("Optimize doc id mapper."); - // Optimize the doc ID mapper first. - DocIDToTweetIDMapper originalTweetIdMapper = source.getDocIDToTweetIDMapper(); - DocIDToTweetIDMapper optimizedTweetIdMapper = originalTweetIdMapper.optimize(); - - TimeMapper optimizedTimeMapper = - source.getTimeMapper() != null - ? source.getTimeMapper().optimize(originalTweetIdMapper, optimizedTweetIdMapper) - : null; - - // Some fields have their terms rewritten to support the minimal perfect hash function we use - // (note that it's a minimal perfect hash function, not a minimal perfect hash _table_). - // The FacetCountingArray stores term IDs. This is a map from the facet field ID to a map from - // original term ID to the new, MPH term IDs. - Map termIDMapper = new HashMap<>(); - - LOG.info("Optimize inverted indexes."); - optimizeInvertedIndexes( - source, targetMap, originalTweetIdMapper, optimizedTweetIdMapper, termIDMapper); - - LOG.info("Rewrite and map ids in facet counting array."); - AbstractFacetCountingArray facetCountingArray = source.getFacetCountingArray().rewriteAndMapIDs( - termIDMapper, originalTweetIdMapper, optimizedTweetIdMapper); - - Map facetLabelProviders = - FacetUtil.getFacetLabelProviders(source.getSchema(), targetMap); - - LOG.info("Optimize doc values manager."); - DocValuesManager optimizedDocValuesManager = - source.getDocValuesManager().optimize(originalTweetIdMapper, optimizedTweetIdMapper); - - LOG.info("Optimize deleted docs."); - DeletedDocs optimizedDeletedDocs = - source.getDeletedDocs().optimize(originalTweetIdMapper, optimizedTweetIdMapper); - - final boolean isOptimized = true; - return new EarlybirdRealtimeIndexSegmentData( - source.getMaxSegmentSize(), - source.getTimeSliceID(), - source.getSchema(), - isOptimized, - optimizedTweetIdMapper.getNextDocID(Integer.MIN_VALUE), - targetMap, - facetCountingArray, - optimizedDocValuesManager, - facetLabelProviders, - source.getFacetIDMap(), - optimizedDeletedDocs, - optimizedTweetIdMapper, - optimizedTimeMapper, - source.getIndexExtensionsData()); - } - - private static void optimizeInvertedIndexes( - EarlybirdRealtimeIndexSegmentData source, - ConcurrentHashMap targetMap, - DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper, - Map termIDMapper - ) throws IOException { - for (Map.Entry entry : source.getPerFieldMap().entrySet()) { - String fieldName = entry.getKey(); - Preconditions.checkState(entry.getValue() instanceof InvertedRealtimeIndex); - InvertedRealtimeIndex sourceIndex = (InvertedRealtimeIndex) entry.getValue(); - EarlybirdFieldType fieldType = source.getSchema().getFieldInfo(fieldName).getFieldType(); - - InvertedIndex newIndex; - if (fieldType.becomesImmutable() && sourceIndex.getNumTerms() > 0) { - Schema.FieldInfo facetField = source.getSchema().getFacetFieldByFieldName(fieldName); - - newIndex = new OptimizedMemoryIndex( - fieldType, - fieldName, - sourceIndex, - termIDMapper, - source.getFacetIDMap().getFacetField(facetField), - originalTweetIdMapper, - optimizedTweetIdMapper); - } else { - newIndex = optimizeMutableIndex( - fieldType, - fieldName, - sourceIndex, - originalTweetIdMapper, - optimizedTweetIdMapper); - } - - targetMap.put(fieldName, newIndex); - } - } - - /** - * Optimize a mutable index. - */ - private static InvertedIndex optimizeMutableIndex( - EarlybirdFieldType fieldType, - String fieldName, - InvertedRealtimeIndex originalIndex, - DocIDToTweetIDMapper originalMapper, - DocIDToTweetIDMapper optimizedMapper - ) throws IOException { - Preconditions.checkState(!fieldType.isStorePerPositionPayloads()); - TermsEnum allTerms = originalIndex.createTermsEnum(originalIndex.getMaxPublishedPointer()); - - int numTerms = originalIndex.getNumTerms(); - - InvertedRealtimeIndex index = new InvertedRealtimeIndex( - fieldType, - TermPointerEncoding.DEFAULT_ENCODING, - fieldName); - index.setNumDocs(originalIndex.getNumDocs()); - - for (int termID = 0; termID < numTerms; termID++) { - allTerms.seekExact(termID); - PostingsEnum postingsEnum = new OptimizingPostingsEnumWrapper( - allTerms.postings(null), originalMapper, optimizedMapper); - - BytesRef termPayload = originalIndex.getLabelAccessor().getTermPayload(termID); - copyPostingList(index, postingsEnum, termID, allTerms.term(), termPayload); - } - return index; - } - - - /** - * Copies the given posting list into these posting lists. - * - * @param postingsEnum enumerator of the posting list that needs to be copied - */ - private static void copyPostingList( - InvertedRealtimeIndex index, - PostingsEnum postingsEnum, - int termID, - BytesRef term, - BytesRef termPayload - ) throws IOException { - int docId; - while ((docId = postingsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { - index.incrementSumTermDocFreq(); - for (int i = 0; i < postingsEnum.freq(); i++) { - index.incrementSumTotalTermFreq(); - int position = postingsEnum.nextPosition(); - int newTermID = InvertedRealtimeIndexWriter.indexTerm( - index, - term, - docId, - position, - termPayload, - null, // We know that fields that remain mutable never have a posting payload. - TermPointerEncoding.DEFAULT_ENCODING); - - // Our term lookups are very slow, so we cache term dictionaries for some fields across many - // segments, so we must keep the term IDs the same while remapping. - Preconditions.checkState(newTermID == termID); - } - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/IntBlockPool.java b/src/java/com/twitter/search/core/earlybird/index/inverted/IntBlockPool.java deleted file mode 100644 index bf85c8765..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/IntBlockPool.java +++ /dev/null @@ -1,225 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; -import java.util.Arrays; - -import com.google.common.annotations.VisibleForTesting; - -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; - -// Modeled after TwitterCharBlockPool, with a lot of simplification. -public class IntBlockPool implements Flushable { - private static final SearchLongGauge INT_BLOCK_POOL_MAX_LENGTH = - SearchLongGauge.export("twitter_int_block_pool_max_size"); - private static final String STAT_PREFIX = "twitter_int_block_pool_size_"; - - private static final int BLOCK_SHIFT = 14; - public static final int BLOCK_SIZE = 1 << BLOCK_SHIFT; - private static final int BLOCK_MASK = BLOCK_SIZE - 1; - - // We can address up to 2^31 elements with an int. We use 1 << 14 bits for the block offset, - // so we can use the remaining 17 bits for the blocks index. Therefore the maximum number of - // addressable blocks is 1 << 17 or maxInt >> 14. - private static final int MAX_NUM_BLOCKS = Integer.MAX_VALUE >> BLOCK_SHIFT; - - // Initial value written into the blocks. - private final int initialValue; - - // Extra object with final array is necessary to guarantee visibility - // to other threads without synchronization / volatiles. See comment - // in TwitterCharBlockPool. - public static final class Pool { - public final int[][] blocks; - Pool(int[][] blocks) { - this.blocks = blocks; - - // Adjust max size if exceeded maximum value. - synchronized (INT_BLOCK_POOL_MAX_LENGTH) { - if (this.blocks != null) { - final long currentSize = (long) (this.blocks.length * BLOCK_SIZE); - if (currentSize > INT_BLOCK_POOL_MAX_LENGTH.get()) { - INT_BLOCK_POOL_MAX_LENGTH.set(currentSize); - } - } - } - } - } - public Pool pool; - - private int currBlockIndex; // Index into blocks array. - private int[] currBlock = null; - private int currBlockOffset; // Index into current block. - private final String poolName; - private final SearchLongGauge sizeGauge; - - public IntBlockPool(String poolName) { - this(0, poolName); - } - - public IntBlockPool(int initialValue, String poolName) { - // Start with room for 16 initial blocks (does not allocate these blocks). - this.pool = new Pool(new int[16][]); - this.initialValue = initialValue; - - // Start at the end of a previous, non-existent blocks. - this.currBlockIndex = -1; - this.currBlock = null; - this.currBlockOffset = BLOCK_SIZE; - this.poolName = poolName; - this.sizeGauge = createGauge(poolName, pool); - } - - // Constructor for FlushHandler. - protected IntBlockPool( - int currBlockIndex, - int currBlockOffset, - int[][]blocks, - String poolName) { - this.initialValue = 0; - this.pool = new Pool(blocks); - this.currBlockIndex = currBlockIndex; - this.currBlockOffset = currBlockOffset; - if (currBlockIndex >= 0) { - this.currBlock = this.pool.blocks[currBlockIndex]; - } - this.poolName = poolName; - this.sizeGauge = createGauge(poolName, pool); - } - - private static SearchLongGauge createGauge(String suffix, Pool pool) { - SearchLongGauge gauge = SearchLongGauge.export(STAT_PREFIX + suffix); - if (pool.blocks != null) { - gauge.set(pool.blocks.length * BLOCK_SIZE); - } - return gauge; - } - - /** - * Adds an int to the current block and returns it's overall index. - */ - public int add(int value) { - if (currBlockOffset == BLOCK_SIZE) { - newBlock(); - } - currBlock[currBlockOffset++] = value; - return (currBlockIndex << BLOCK_SHIFT) + currBlockOffset - 1; - } - - // Returns number of ints in this blocks - public int length() { - return currBlockOffset + currBlockIndex * BLOCK_SIZE; - } - - // Gets an int from the specified index. - public final int get(int index) { - return getBlock(index)[getOffsetInBlock(index)]; - } - - public static int getBlockStart(int index) { - return (index >>> BLOCK_SHIFT) * BLOCK_SIZE; - } - - public static int getOffsetInBlock(int index) { - return index & BLOCK_MASK; - } - - public final int[] getBlock(int index) { - final int blockIndex = index >>> BLOCK_SHIFT; - return pool.blocks[blockIndex]; - } - - // Sets an int value at the specified index. - public void set(int index, int value) { - final int blockIndex = index >>> BLOCK_SHIFT; - final int offset = index & BLOCK_MASK; - pool.blocks[blockIndex][offset] = value; - } - - /** - * Evaluates whether two instances of IntBlockPool are equal by value. It is - * slow because it has to check every element in the pool. - */ - @VisibleForTesting - public boolean verySlowEqualsForTests(IntBlockPool that) { - if (length() != that.length()) { - return false; - } - - for (int i = 0; i < length(); i++) { - if (get(i) != that.get(i)) { - return false; - } - } - - return true; - } - - private void newBlock() { - final int newBlockIndex = 1 + currBlockIndex; - if (newBlockIndex >= MAX_NUM_BLOCKS) { - throw new RuntimeException( - "Too many blocks, would overflow int index for blocks " + poolName); - } - if (newBlockIndex == pool.blocks.length) { - // Blocks array is too small to add a new block. Resize. - int[][] newBlocks = new int[pool.blocks.length * 2][]; - System.arraycopy(pool.blocks, 0, newBlocks, 0, pool.blocks.length); - pool = new Pool(newBlocks); - - sizeGauge.set(pool.blocks.length * BLOCK_SIZE); - } - - currBlock = pool.blocks[newBlockIndex] = allocateBlock(); - currBlockOffset = 0; - currBlockIndex = newBlockIndex; - } - - private int[] allocateBlock() { - int[] block = new int[BLOCK_SIZE]; - Arrays.fill(block, initialValue); - return block; - } - - @SuppressWarnings("unchecked") - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - public static final class FlushHandler extends Flushable.Handler { - private static final String CURRENT_BLOCK_INDEX_PROP_NAME = "currentBlockIndex"; - private static final String CURRENT_BLOCK_OFFSET_PROP_NAME = "currentBlockOffset"; - private static final String POOL_NAME = "poolName"; - - public FlushHandler() { - super(); - } - - public FlushHandler(IntBlockPool objToFlush) { - super(objToFlush); - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) throws IOException { - IntBlockPool pool = getObjectToFlush(); - flushInfo.addIntProperty(CURRENT_BLOCK_INDEX_PROP_NAME, pool.currBlockIndex); - flushInfo.addIntProperty(CURRENT_BLOCK_OFFSET_PROP_NAME, pool.currBlockOffset); - flushInfo.addStringProperty(POOL_NAME, pool.poolName); - out.writeIntArray2D(pool.pool.blocks, pool.currBlockIndex + 1); - } - - @Override - protected IntBlockPool doLoad(FlushInfo flushInfo, DataDeserializer in) throws IOException { - String poolName = flushInfo.getStringProperty(POOL_NAME); - return new IntBlockPool( - flushInfo.getIntProperty(CURRENT_BLOCK_INDEX_PROP_NAME), - flushInfo.getIntProperty(CURRENT_BLOCK_OFFSET_PROP_NAME), - in.readIntArray2D(), - poolName); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/IntBlockPoolPackedLongsReader.java b/src/java/com/twitter/search/core/earlybird/index/inverted/IntBlockPoolPackedLongsReader.java deleted file mode 100644 index 5edf92f77..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/IntBlockPoolPackedLongsReader.java +++ /dev/null @@ -1,253 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import javax.annotation.Nullable; - -/** - * A packed ints reader reading packed values (int/long) written in {@link IntBlockPool}. - * @see IntBlockPoolPackedLongsWriter - * - * A standard usage would be : - * - set reader at an int block pool pointer and number of bits per packed value: - * {@link #jumpToInt(int, int)}} - * - read: {@link #readPackedLong()} - * - * Example usage: - * @see HighDFPackedIntsDocsEnum - * @see HighDFPackedIntsDocsAndPositionsEnum - */ -public final class IntBlockPoolPackedLongsReader { - /** - * Mask used to convert an int to a long. We cannot just cast because it will fill in the higher - * 32 bits with the sign bit, but we need the higher 32 bits to be 0 instead. - */ - private static final long LONG_MASK = 0xFFFFFFFFL; - - /** The int block pool from which packed ints will be read. */ - private final IntBlockPool intBlockPool; - - /** Pre-computed shifts, masks, and start int indices used to decode packed ints. */ - private final PackedLongsReaderPreComputedValues preComputedValues; - - /** - * The underlying {@link #intBlockPool} will be read block by blocks. The current read - * block will be identified by {@link #startPointerForCurrentBlock} and assigned to - * {@link #currentBlock}. {@link #indexInCurrentBlock} will be used access values from the - * {@link #currentBlock}. - */ - private int[] currentBlock; - private int indexInCurrentBlock; - private int startPointerForCurrentBlock = -1; - - /** - * Whether the decoded packed values are spanning more than 1 int. - * @see #readPackedLong() - */ - private boolean packedValueNeedsLong; - - /** - * Masks used to extract packed values. - * @see #readPackedLong() - */ - private long packedValueMask; - - /** PRE-COMPUTED: The index of the first int that has a specific packed values. */ - private int[] packedValueStartIndices; - - /** PRE-COMPUTED: The shifts and masks used to decode packed values. */ - private int[] packedValueLowBitsRightShift; - private int[] packedValueMiddleBitsLeftShift; - private int[] packedValueMiddleBitsMask; - private int[] packedValueHighBitsLeftShift; - private int[] packedValueHighBitsMask; - - /** Index of packed values. */ - private int packedValueIndex; - - /** - * The {@link #indexInCurrentBlock} and {@link #startPointerForCurrentBlock} of the first int - * that holds packed values. This two values together uniquely form a int block pool pointer - * --- {@link #packedValueStartBlockStart} + {@link #packedValueStartBlockIndex} --- that points - * to the first int that has pointer. - * - * @see #jumpToInt(int, int) - */ - private int packedValueStartBlockIndex; - private int packedValueStartBlockStart; - - /** Current int read from {@link #currentBlock}. */ - private int currentInt; - - /** - * If given, query cost will be tracked every time a int block is loaded. - * @see #loadNextBlock() - */ - private final QueryCostTracker queryCostTracker; - private final QueryCostTracker.CostType queryCostType; - - /** - * Default constructor. - * - * @param intBlockPool from which packed ints will be read - * @param preComputedValues pre-computed shifts, masks, and start int - * @param queryCostTracker optional, query cost tracker used while loading a new block - * @param queryCostType optional, query cost type will be tracked while loading a new block - */ - public IntBlockPoolPackedLongsReader( - IntBlockPool intBlockPool, - PackedLongsReaderPreComputedValues preComputedValues, - @Nullable QueryCostTracker queryCostTracker, - @Nullable QueryCostTracker.CostType queryCostType) { - this.intBlockPool = intBlockPool; - this.preComputedValues = preComputedValues; - - // For query cost tracking. - this.queryCostTracker = queryCostTracker; - this.queryCostType = queryCostType; - } - - /** - * Constructor with {@link #queryCostTracker} and {@link #queryCostType} set to null. - * - * @param intBlockPool from which packed ints will be read - * @param preComputedValues pre-computed shifts, masks, and start int - */ - public IntBlockPoolPackedLongsReader( - IntBlockPool intBlockPool, - PackedLongsReaderPreComputedValues preComputedValues) { - this(intBlockPool, preComputedValues, null, null); - } - - /** - * 1. Set the reader to starting reading at the given int block pool pointer. Correct block will - * be loaded if the given pointer points to the different block than {@link #currentBlock}. - * 2. Update shifts, masks, and start int indices based on given number of bits per packed value. - * 3. Reset packed value sequence start data. - * - * @param intBlockPoolPointer points to the int from which this reader will start reading - * @param bitsPerPackedValue number of bits per packed value. - */ - public void jumpToInt(int intBlockPoolPointer, int bitsPerPackedValue) { - assert bitsPerPackedValue <= Long.SIZE; - - // Update indexInCurrentBlock and load a different index if needed. - int newBlockStart = IntBlockPool.getBlockStart(intBlockPoolPointer); - indexInCurrentBlock = IntBlockPool.getOffsetInBlock(intBlockPoolPointer); - - if (startPointerForCurrentBlock != newBlockStart) { - startPointerForCurrentBlock = newBlockStart; - loadNextBlock(); - } - - // Re-set shifts, masks, and start int indices for the given number bits per packed value. - packedValueNeedsLong = bitsPerPackedValue > Integer.SIZE; - packedValueMask = - bitsPerPackedValue == Long.SIZE ? 0xFFFFFFFFFFFFFFFFL : (1L << bitsPerPackedValue) - 1; - packedValueStartIndices = preComputedValues.getStartIntIndices(bitsPerPackedValue); - packedValueLowBitsRightShift = preComputedValues.getLowBitsRightShift(bitsPerPackedValue); - packedValueMiddleBitsLeftShift = preComputedValues.getMiddleBitsLeftShift(bitsPerPackedValue); - packedValueMiddleBitsMask = preComputedValues.getMiddleBitsMask(bitsPerPackedValue); - packedValueHighBitsLeftShift = preComputedValues.getHighBitsLeftShift(bitsPerPackedValue); - packedValueHighBitsMask = preComputedValues.getHighBitsMask(bitsPerPackedValue); - - // Update packed values sequence start data. - packedValueIndex = 0; - packedValueStartBlockIndex = indexInCurrentBlock; - packedValueStartBlockStart = startPointerForCurrentBlock; - - // Load an int to prepare for readPackedLong. - loadInt(); - } - - /** - * Read next packed value as a long. - * - * Caller could cast the returned long to an int if needed. - * NOTICE! Be careful of overflow while casting a long to an int. - * - * @return next packed value in a long. - */ - public long readPackedLong() { - long packedValue; - - if (packedValueNeedsLong) { - packedValue = - (LONG_MASK & currentInt) - >>> packedValueLowBitsRightShift[packedValueIndex] & packedValueMask; - packedValue |= - (LONG_MASK & loadInt() - & packedValueMiddleBitsMask[packedValueIndex]) - << packedValueMiddleBitsLeftShift[packedValueIndex]; - if (packedValueHighBitsLeftShift[packedValueIndex] != 0) { - packedValue |= - (LONG_MASK & loadInt() - & packedValueHighBitsMask[packedValueIndex]) - << packedValueHighBitsLeftShift[packedValueIndex]; - } - } else { - packedValue = - currentInt >>> packedValueLowBitsRightShift[packedValueIndex] & packedValueMask; - if (packedValueMiddleBitsLeftShift[packedValueIndex] != 0) { - packedValue |= - (loadInt() - & packedValueMiddleBitsMask[packedValueIndex]) - << packedValueMiddleBitsLeftShift[packedValueIndex]; - } - } - - packedValueIndex++; - return packedValue; - } - - /** - * A simple getter of {@link #packedValueIndex}. - */ - public int getPackedValueIndex() { - return packedValueIndex; - } - - /** - * A setter of {@link #packedValueIndex}. This setter will also set the correct - * {@link #indexInCurrentBlock} based on {@link #packedValueStartIndices}. - */ - public void setPackedValueIndex(int packedValueIndex) { - this.packedValueIndex = packedValueIndex; - this.indexInCurrentBlock = - packedValueStartBlockIndex + packedValueStartIndices[packedValueIndex]; - this.startPointerForCurrentBlock = packedValueStartBlockStart; - loadInt(); - } - - /************************** - * Private Helper Methods * - **************************/ - - /** - * Load a new int block, specified by {@link #startPointerForCurrentBlock}, from - * {@link #intBlockPool}. If {@link #queryCostTracker} is given, query cost with type - * {@link #queryCostType} will be tracked as well. - */ - private void loadNextBlock() { - if (queryCostTracker != null) { - assert queryCostType != null; - queryCostTracker.track(queryCostType); - } - - currentBlock = intBlockPool.getBlock(startPointerForCurrentBlock); - } - - /** - * Load an int from {@link #currentBlock}. The loaded int will be returned as well. - * If the {@link #currentBlock} is used up, next block will be automatically loaded. - */ - private int loadInt() { - while (indexInCurrentBlock >= IntBlockPool.BLOCK_SIZE) { - startPointerForCurrentBlock += IntBlockPool.BLOCK_SIZE; - loadNextBlock(); - - indexInCurrentBlock = Math.max(indexInCurrentBlock - IntBlockPool.BLOCK_SIZE, 0); - } - - currentInt = currentBlock[indexInCurrentBlock++]; - return currentInt; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/IntBlockPoolPackedLongsWriter.java b/src/java/com/twitter/search/core/earlybird/index/inverted/IntBlockPoolPackedLongsWriter.java deleted file mode 100644 index 320be6650..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/IntBlockPoolPackedLongsWriter.java +++ /dev/null @@ -1,166 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -/** - * A packed ints writer writing packed values (int/long) into {@link IntBlockPool}. - * @see IntBlockPoolPackedLongsReader - * - * A standard useage would be: - * - set writer at an int block pool pointer and number of bits per packed value: - * {@link #jumpToInt(int, int)} - * - write: {@link #writePackedInt(int)} or {@link #writePackedLong(long)} - * - * Example usage: - * @see HighDFPackedIntsPostingLists - */ -public final class IntBlockPoolPackedLongsWriter { - /** - * Mask used to convert an int to a long. We cannot just cast because it will fill in the higher - * 32 bits with the sign bit, but we need the higher 32 bits to be 0 instead. - */ - private static final long LONG_MASK = 0xFFFFFFFFL; - - /** The int block pool into which packed ints will be written. */ - private final IntBlockPool intBlockPool; - - /** The value in the current position in the int block pool. */ - private int currentIntValue = 0; - - /** Starting bit index of unused bits in {@link #currentIntValue}. */ - private int currentIntBitIndex = 0; - - /** Pointer of {@link #currentIntValue} in {@link #intBlockPool}. */ - private int currentIntPointer = -1; - - /** - * Number of bits per packed value that will be written with - * {@link #writePackedInt(int)} or {@link #writePackedLong(long)}. - */ - private int numBitsPerPackedValue = -1; - - /** - * Mask used to extract the lower {@link #numBitsPerPackedValue} in a given value. - */ - private long packedValueBitsMask = 0; - - /** - * Sole constructor. - * - * @param intBlockPool into which packed ints will be written - */ - public IntBlockPoolPackedLongsWriter(IntBlockPool intBlockPool) { - this.intBlockPool = intBlockPool; - } - - /** - * 1. Set this writer to start writing at the given int block pool pointer. - * 2. Set number of bits per packed value that will be write. - * 3. Re-set {@link #currentIntValue} and {@link #currentIntBitIndex} to 0. - * - * @param intBlockPoolPointer the position this writer should start writing packed values. This - * pointer must be less then or equal to he length of the block pool. - * Subsequent writes will {@link IntBlockPool#add(int)} to the - * end of the int block pool if the given pointer equals to the length. - * @param bitsPerPackedValue must be non-negative. - */ - public void jumpToInt(int intBlockPoolPointer, int bitsPerPackedValue) { - assert intBlockPoolPointer <= intBlockPool.length(); - assert bitsPerPackedValue >= 0; - - // Set the writer to start writing at the given int block pool pointer. - this.currentIntPointer = intBlockPoolPointer; - - // Set number of bits that will be write per packed value. - this.numBitsPerPackedValue = bitsPerPackedValue; - - // Compute the mask used to extract lower number of bitsPerPackedValue. - this.packedValueBitsMask = - bitsPerPackedValue == Long.SIZE ? -1L : (1L << bitsPerPackedValue) - 1; - - // Reset current int data to 0. - this.currentIntValue = 0; - this.currentIntBitIndex = 0; - } - - /** - * The given int value will be ZERO extended to a long and written using - * {@link #writePackedValueInternal(long)} (long)}. - * - * @see #LONG_MASK - */ - public void writePackedInt(final int value) { - assert numBitsPerPackedValue <= Integer.SIZE; - writePackedValueInternal(LONG_MASK & value); - } - - /** - * Write a long value. - * The given long value must bu UNABLE to fit in an int. - */ - public void writePackedLong(final long value) { - assert numBitsPerPackedValue <= Long.SIZE; - writePackedValueInternal(value); - } - - /************************* - * Private Helper Method * - *************************/ - - /** - * Write the given number of bits of the given value into this int pool as a packed int. - * - * @param value value will be written - */ - private void writePackedValueInternal(final long value) { - // Extract lower 'numBitsPerPackedValue' from the given value. - long val = value & packedValueBitsMask; - - assert val == value : String.format( - "given value %d needs more bits than specified %d", value, numBitsPerPackedValue); - - int numBitsWrittenCurIter; - int numBitsRemaining = numBitsPerPackedValue; - - // Each iteration of this while loop is writing part of the given value. - while (numBitsRemaining > 0) { - // Write into 'currentIntValue' int. - currentIntValue |= val << currentIntBitIndex; - - // Calculate number of bits have been written in this iteration, - // we either used up all the remaining bits in 'currentIntValue' or - // finished up writing the value, whichever is smaller. - numBitsWrittenCurIter = Math.min(Integer.SIZE - currentIntBitIndex, numBitsRemaining); - - // Number of bits remaining should be decremented. - numBitsRemaining -= numBitsWrittenCurIter; - - // Right shift the value to remove the bits have been written. - val >>>= numBitsWrittenCurIter; - - // Update bit index in current int. - currentIntBitIndex += numBitsWrittenCurIter; - assert currentIntBitIndex <= Integer.SIZE; - - flush(); - - // if 'currentIntValue' int is used up. - if (currentIntBitIndex == Integer.SIZE) { - currentIntPointer++; - - currentIntValue = 0; - currentIntBitIndex = 0; - } - } - } - - /** - * Flush the {@link #currentIntValue} int into the int pool if the any bits of the int are used. - */ - private void flush() { - if (currentIntPointer == intBlockPool.length()) { - intBlockPool.add(currentIntValue); - assert currentIntPointer + 1 == intBlockPool.length(); - } else { - intBlockPool.set(currentIntPointer, currentIntValue); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/InvertedIndex.java b/src/java/com/twitter/search/core/earlybird/index/inverted/InvertedIndex.java deleted file mode 100644 index 6e4b79250..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/InvertedIndex.java +++ /dev/null @@ -1,144 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; - -import org.apache.lucene.index.Terms; -import org.apache.lucene.index.TermsEnum; -import org.apache.lucene.util.BytesRef; - -import com.twitter.search.common.schema.base.EarlybirdFieldType; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.facets.FacetLabelProvider; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; - -/** - * Inverted index for a single field. - * - * Example: The field is "hashtags", this index contains a mapping from all the hashtags - * that we've seen to a list of postings. - */ -public abstract class InvertedIndex implements FacetLabelProvider, Flushable { - protected final EarlybirdFieldType fieldType; - - public InvertedIndex(EarlybirdFieldType fieldType) { - this.fieldType = fieldType; - } - - public EarlybirdFieldType getFieldType() { - return fieldType; - } - - /** - * Get the internal doc id of the oldest doc that includes term. - * @param term the term to look for. - * @return The internal docid, or TERM_NOT_FOUND. - */ - public final int getLargestDocIDForTerm(BytesRef term) throws IOException { - final int termID = lookupTerm(term); - return getLargestDocIDForTerm(termID); - } - - /** - * Get the document frequency for this term. - * @param term the term to look for. - * @return The document frequency of this term in the index. - */ - public final int getDF(BytesRef term) throws IOException { - final int termID = lookupTerm(term); - if (termID == EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND) { - return 0; - } - return getDF(termID); - } - - public boolean hasMaxPublishedPointer() { - return false; - } - - public int getMaxPublishedPointer() { - return -1; - } - - /** - * Create the Lucene magic Terms accessor. - * @param maxPublishedPointer used by the skip list to enable atomic document updates. - * @return a new Terms object. - */ - public abstract Terms createTerms(int maxPublishedPointer); - - /** - * Create the Lucene magic TermsEnum accessor. - * @param maxPublishedPointer used by the skip list to enable atomic document updates. - * @return a new TermsEnum object. - */ - public abstract TermsEnum createTermsEnum(int maxPublishedPointer); - - /** - * Returns the number of distinct terms in this inverted index. - * For example, if the indexed documents are: - * "i love chocolate and i love cakes" - * "i love cookies" - * - * then this method will return 6, because there are 6 distinct terms: - * i, love, chocolate, and, cakes, cookies - */ - public abstract int getNumTerms(); - - /** - * Returns the number of distinct documents in this index. - */ - public abstract int getNumDocs(); - - /** - * Returns the total number of postings in this inverted index. - * - * For example, if the indexed documents are: - * "i love chocolate and i love cakes" - * "i love cookies" - * - * then this method will return 10, because there's a total of 10 words in these 2 documents. - */ - public abstract int getSumTotalTermFreq(); - - /** - * Returns the sum of the number of documents for each term in this index. - * - * For example, if the indexed documents are: - * "i love chocolate and i love cakes" - * "i love cookies" - * - * then this method will return 8, because there are: - * 2 documents for term "i" (it doesn't matter that the first document has the term "i" twice) - * 2 documents for term "love" (same reason) - * 1 document for terms "chocolate", "and", "cakes", "cookies" - */ - public abstract int getSumTermDocFreq(); - - /** - * Lookup a term. - * @param term the term to lookup. - * @return the term ID for this term. - */ - public abstract int lookupTerm(BytesRef term) throws IOException; - - /** - * Get the text for a given termID. - * @param termID the term id - * @param text a BytesRef that will be modified to contain the text of this termid. - */ - public abstract void getTerm(int termID, BytesRef text); - - /** - * Get the internal doc id of the oldest doc that includes this term. - * @param termID The termID of the term. - * @return The internal docid, or TERM_NOT_FOUND. - */ - public abstract int getLargestDocIDForTerm(int termID) throws IOException; - - /** - * Get the document frequency for a given termID - * @param termID the term id - * @return the document frequency of this term in this index. - */ - public abstract int getDF(int termID); -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/InvertedRealtimeIndex.java b/src/java/com/twitter/search/core/earlybird/index/inverted/InvertedRealtimeIndex.java deleted file mode 100644 index 381d4c6b5..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/InvertedRealtimeIndex.java +++ /dev/null @@ -1,558 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; -import java.util.Comparator; - -import javax.annotation.Nullable; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.Terms; -import org.apache.lucene.index.TermsEnum; -import org.apache.lucene.util.BytesRef; -import org.apache.lucene.util.StringHelper; - -import com.twitter.search.common.hashtable.HashTable; -import com.twitter.search.common.schema.base.EarlybirdFieldType; -import com.twitter.search.common.util.hash.KeysSource; -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; - -public class InvertedRealtimeIndex extends InvertedIndex { - public static final int FIXED_HASH_SEED = 0; - - public final class TermHashTable extends HashTable { - - private final TermPointerEncoding termPointerEncoding; - - public TermHashTable(int size, TermPointerEncoding termPointerEncoding) { - super(size); - this.termPointerEncoding = termPointerEncoding; - } - - public TermHashTable(int[] termsHash, TermPointerEncoding termPointerEncoding) { - super(termsHash); - this.termPointerEncoding = termPointerEncoding; - } - - @Override - public boolean matchItem(BytesRef term, int candidateTermID) { - return ByteTermUtils.postingEquals( - getTermPool(), - termPointerEncoding.getTextStart(termsArray.termPointers[candidateTermID]), term); - } - - @Override - public int hashCodeForItem(int itemID) { - return ByteTermUtils.hashCode( - getTermPool(), termPointerEncoding.getTextStart(termsArray.termPointers[itemID])); - } - - /* - * Use a fixed hash seed to compute the hash code for the given item. This is necessary because - * we want the TermHashTable to be consistent for lookups in indexes that have been flushed and - * loaded across restarts and redeploys. - * - * Note: previously we used item.hashcode(), however that hash function relies on the seed value - * StringHelper.GOOD_FAST_HASH_SEED, which is initialized to System.currentTimeMillis() when the - * JVM process starts up. - */ - public long lookupItem(BytesRef item) { - int itemHashCode = StringHelper.murmurhash3_x86_32(item, FIXED_HASH_SEED); - - return super.lookupItem(item, itemHashCode); - } - } - - - /** - * Skip list comparator used by {@link #termsSkipList}. The key would be the bytesRef of the term, - * and the value would be the termID of a term. - * - * Notice this comparator is keeping states, - * so different threads CANNOT share the same comparator. - */ - public static final class TermsSkipListComparator implements SkipListComparator { - private static final Comparator BYTES_REF_COMPARATOR = Comparator.naturalOrder(); - - private static final int SENTINEL_VALUE = HashTable.EMPTY_SLOT; - - // Initializing two BytesRef to use for later comparisons. - // Notice different threads cannot share the same comparator. - private final BytesRef bytesRef1 = new BytesRef(); - private final BytesRef bytesRef2 = new BytesRef(); - - /** - * We have to pass each part of the index in since during load process, the comparator - * needs to be build before the index. - */ - private final InvertedRealtimeIndex invertedIndex; - - public TermsSkipListComparator(InvertedRealtimeIndex invertedIndex) { - this.invertedIndex = invertedIndex; - } - - @Override - public int compareKeyWithValue(BytesRef key, int targetValue, int targetPosition) { - // No key could represent SENTINEL_VALUE and SENTINEL_VALUE is greatest. - if (targetValue == SENTINEL_VALUE) { - return -1; - } else { - getTerm(targetValue, bytesRef1); - return BYTES_REF_COMPARATOR.compare(key, bytesRef1); - } - } - - @Override - public int compareValues(int v1, int v2) { - // SENTINEL_VALUE is greatest. - if (v1 != SENTINEL_VALUE && v2 != SENTINEL_VALUE) { - getTerm(v1, bytesRef1); - getTerm(v2, bytesRef2); - return BYTES_REF_COMPARATOR.compare(bytesRef1, bytesRef2); - } else if (v1 == SENTINEL_VALUE && v2 == SENTINEL_VALUE) { - return 0; - } else if (v1 == SENTINEL_VALUE) { - return 1; - } else { - return -1; - } - } - - @Override - public int getSentinelValue() { - return SENTINEL_VALUE; - } - - /** - * Get the term specified by the termID. - * This method should be the same as {@link InvertedRealtimeIndex#getTerm} - */ - private void getTerm(int termID, BytesRef text) { - invertedIndex.getTerm(termID, text); - } - } - - private static final int HASHMAP_SIZE = 64 * 1024; - - private SkipListContainer termsSkipList; - - private final TermPointerEncoding termPointerEncoding; - private final ByteBlockPool termPool; - private final SkipListPostingList postingList; - - private int numTerms; - private int numDocs; - private int sumTotalTermFreq; - private int sumTermDocFreq; - private int maxPosition; - - private volatile TermHashTable hashTable; - private TermsArray termsArray; - - /** - * Creates a new in-memory real-time inverted index for the given field. - */ - public InvertedRealtimeIndex(EarlybirdFieldType fieldType, - TermPointerEncoding termPointerEncoding, - String fieldName) { - super(fieldType); - this.termPool = new ByteBlockPool(); - - this.termPointerEncoding = termPointerEncoding; - this.hashTable = new TermHashTable(HASHMAP_SIZE, termPointerEncoding); - - this.postingList = new SkipListPostingList( - fieldType.hasPositions() - ? SkipListContainer.HasPositions.YES - : SkipListContainer.HasPositions.NO, - fieldType.isStorePerPositionPayloads() - ? SkipListContainer.HasPayloads.YES - : SkipListContainer.HasPayloads.NO, - fieldName); - - this.termsArray = new TermsArray( - HASHMAP_SIZE, fieldType.isStoreFacetOffensiveCounters()); - - // Create termsSkipList to maintain order if field is support ordered terms. - if (fieldType.isSupportOrderedTerms()) { - // Terms skip list does not support position. - this.termsSkipList = new SkipListContainer<>( - new TermsSkipListComparator(this), - SkipListContainer.HasPositions.NO, - SkipListContainer.HasPayloads.NO, - "terms"); - this.termsSkipList.newSkipList(); - } else { - this.termsSkipList = null; - } - } - - void setTermsSkipList(SkipListContainer termsSkipList) { - this.termsSkipList = termsSkipList; - } - - SkipListContainer getTermsSkipList() { - return termsSkipList; - } - - private InvertedRealtimeIndex( - EarlybirdFieldType fieldType, - int numTerms, - int numDocs, - int sumTermDocFreq, - int sumTotalTermFreq, - int maxPosition, - int[] termsHash, - TermsArray termsArray, - ByteBlockPool termPool, - TermPointerEncoding termPointerEncoding, - SkipListPostingList postingList) { - super(fieldType); - this.numTerms = numTerms; - this.numDocs = numDocs; - this.sumTermDocFreq = sumTermDocFreq; - this.sumTotalTermFreq = sumTotalTermFreq; - this.maxPosition = maxPosition; - this.termsArray = termsArray; - this.termPool = termPool; - this.termPointerEncoding = termPointerEncoding; - this.hashTable = new TermHashTable(termsHash, termPointerEncoding); - this.postingList = postingList; - } - - void insertToTermsSkipList(BytesRef termBytesRef, int termID) { - if (termsSkipList != null) { - // Use the comparator passed in while building the skip list since we only have one writer. - termsSkipList.insert(termBytesRef, termID, SkipListContainer.FIRST_LIST_HEAD); - } - } - - @Override - public int getNumTerms() { - return numTerms; - } - - @Override - public int getNumDocs() { - return numDocs; - } - - @Override - public int getSumTotalTermFreq() { - return sumTotalTermFreq; - } - - @Override - public int getSumTermDocFreq() { - return sumTermDocFreq; - } - - @Override - public Terms createTerms(int maxPublishedPointer) { - return new RealtimeIndexTerms(this, maxPublishedPointer); - } - - @Override - public TermsEnum createTermsEnum(int maxPublishedPointer) { - // Use SkipListInMemoryTermsEnum if termsSkipList is not null, which indicates field required - // ordered term. - if (termsSkipList == null) { - return new RealtimeIndexTerms.InMemoryTermsEnum(this, maxPublishedPointer); - } else { - return new RealtimeIndexTerms.SkipListInMemoryTermsEnum(this, maxPublishedPointer); - } - } - - int getPostingListPointer(int termID) { - return termsArray.getPostingsPointer(termID); - } - - @Override - public int getLargestDocIDForTerm(int termID) { - if (termID == EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND) { - return TermsArray.INVALID; - } else { - return postingList.getDocIDFromPosting(termsArray.largestPostings[termID]); - } - } - - @Override - public int getDF(int termID) { - if (termID == HashTable.EMPTY_SLOT) { - return 0; - } else { - return this.postingList.getDF(termID, termsArray); - } - } - - @Override - public int getMaxPublishedPointer() { - return this.postingList.getMaxPublishedPointer(); - } - - @Override - public int lookupTerm(BytesRef term) { - return HashTable.decodeItemId(hashTable.lookupItem(term)); - } - - @Override - public FacetLabelAccessor getLabelAccessor() { - final TermsArray termsArrayCopy = this.termsArray; - - return new FacetLabelAccessor() { - @Override protected boolean seek(long termID) { - if (termID == HashTable.EMPTY_SLOT) { - return false; - } - int termPointer = termsArrayCopy.termPointers[(int) termID]; - hasTermPayload = termPointerEncoding.hasPayload(termPointer); - int textStart = termPointerEncoding.getTextStart(termPointer); - int termPayloadStart = ByteTermUtils.setBytesRef(termPool, termRef, textStart); - if (hasTermPayload) { - ByteTermUtils.setBytesRef(termPool, termPayload, termPayloadStart); - } - offensiveCount = termsArrayCopy.offensiveCounters != null - ? termsArrayCopy.offensiveCounters[(int) termID] : 0; - - return true; - } - }; - } - - @Override - public boolean hasMaxPublishedPointer() { - return true; - } - - @Override - public void getTerm(int termID, BytesRef text) { - getTerm(termID, text, termsArray, termPointerEncoding, termPool); - } - - /** - * Extract to helper method so the logic can be shared with - * {@link TermsSkipListComparator#getTerm} - */ - private static void getTerm(int termID, BytesRef text, - TermsArray termsArray, - TermPointerEncoding termPointerEncoding, - ByteBlockPool termPool) { - int textStart = termPointerEncoding.getTextStart(termsArray.termPointers[termID]); - ByteTermUtils.setBytesRef(termPool, text, textStart); - } - - /** - * Called when postings hash is too small (> 50% occupied). - */ - void rehashPostings(int newSize) { - TermHashTable newTable = new TermHashTable(newSize, termPointerEncoding); - hashTable.rehash(newTable); - hashTable = newTable; - } - - /** - * Returns per-term array containing the number of documents indexed with that term that were - * considered to be offensive. - */ - @Nullable - int[] getOffensiveCounters() { - return this.termsArray.offensiveCounters; - } - - /** - * Returns access to all the terms in this index as a {@link KeysSource}. - */ - public KeysSource getKeysSource() { - final int localNumTerms = this.numTerms; - final TermsArray termsArrayCopy = this.termsArray; - - return new KeysSource() { - private int termID = 0; - private BytesRef text = new BytesRef(); - - @Override - public int getNumberOfKeys() { - return localNumTerms; - } - - /** Must not be called more often than getNumberOfKeys() before rewind() is called */ - @Override - public BytesRef nextKey() { - Preconditions.checkState(termID < localNumTerms); - int textStart = termPointerEncoding.getTextStart(termsArrayCopy.termPointers[termID]); - ByteTermUtils.setBytesRef(termPool, text, textStart); - termID++; - return text; - } - - @Override - public void rewind() { - termID = 0; - } - }; - } - - /** - * Returns byte pool containing term text for all terms in this index. - */ - public ByteBlockPool getTermPool() { - return this.termPool; - } - - /** - * Returns per-term array containing pointers to where the text of each term is stored in the - * byte pool returned by {@link #getTermPool()}. - */ - public int[] getTermPointers() { - return this.termsArray.termPointers; - } - - /** - * Returns the hash table used to look up terms in this index. - */ - InvertedRealtimeIndex.TermHashTable getHashTable() { - return hashTable; - } - - - TermsArray getTermsArray() { - return termsArray; - } - - TermsArray growTermsArray() { - termsArray = termsArray.grow(); - return termsArray; - } - - @SuppressWarnings("unchecked") - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - TermPointerEncoding getTermPointerEncoding() { - return termPointerEncoding; - } - - SkipListPostingList getPostingList() { - return postingList; - } - - void incrementNumTerms() { - numTerms++; - } - - void incrementSumTotalTermFreq() { - sumTotalTermFreq++; - } - - public void incrementSumTermDocFreq() { - sumTermDocFreq++; - } - - public void incrementNumDocs() { - numDocs++; - } - - void setNumDocs(int numDocs) { - this.numDocs = numDocs; - } - - void adjustMaxPosition(int position) { - if (position > maxPosition) { - maxPosition = position; - } - } - - int getMaxPosition() { - return maxPosition; - } - - public static class FlushHandler extends Flushable.Handler { - private static final String NUM_DOCS_PROP_NAME = "numDocs"; - private static final String SUM_TOTAL_TERM_FREQ_PROP_NAME = "sumTotalTermFreq"; - private static final String SUM_TERM_DOC_FREQ_PROP_NAME = "sumTermDocFreq"; - private static final String NUM_TERMS_PROP_NAME = "numTerms"; - private static final String POSTING_LIST_PROP_NAME = "postingList"; - private static final String TERMS_SKIP_LIST_PROP_NAME = "termsSkipList"; - private static final String MAX_POSITION = "maxPosition"; - - protected final EarlybirdFieldType fieldType; - protected final TermPointerEncoding termPointerEncoding; - - public FlushHandler(EarlybirdFieldType fieldType, - TermPointerEncoding termPointerEncoding) { - this.fieldType = fieldType; - this.termPointerEncoding = termPointerEncoding; - } - - public FlushHandler(InvertedRealtimeIndex objectToFlush) { - super(objectToFlush); - this.fieldType = objectToFlush.fieldType; - this.termPointerEncoding = objectToFlush.getTermPointerEncoding(); - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) - throws IOException { - InvertedRealtimeIndex objectToFlush = getObjectToFlush(); - flushInfo.addIntProperty(NUM_TERMS_PROP_NAME, objectToFlush.getNumTerms()); - flushInfo.addIntProperty(NUM_DOCS_PROP_NAME, objectToFlush.numDocs); - flushInfo.addIntProperty(SUM_TERM_DOC_FREQ_PROP_NAME, objectToFlush.sumTermDocFreq); - flushInfo.addIntProperty(SUM_TOTAL_TERM_FREQ_PROP_NAME, objectToFlush.sumTotalTermFreq); - flushInfo.addIntProperty(MAX_POSITION, objectToFlush.maxPosition); - - out.writeIntArray(objectToFlush.hashTable.slots()); - objectToFlush.termsArray.getFlushHandler() - .flush(flushInfo.newSubProperties("termsArray"), out); - objectToFlush.getTermPool().getFlushHandler() - .flush(flushInfo.newSubProperties("termPool"), out); - objectToFlush.getPostingList().getFlushHandler() - .flush(flushInfo.newSubProperties(POSTING_LIST_PROP_NAME), out); - - if (fieldType.isSupportOrderedTerms()) { - Preconditions.checkNotNull(objectToFlush.termsSkipList); - - objectToFlush.termsSkipList.getFlushHandler() - .flush(flushInfo.newSubProperties(TERMS_SKIP_LIST_PROP_NAME), out); - } - } - - @Override - protected InvertedRealtimeIndex doLoad(FlushInfo flushInfo, DataDeserializer in) - throws IOException { - int[] termsHash = in.readIntArray(); - TermsArray termsArray = (new TermsArray.FlushHandler()) - .load(flushInfo.getSubProperties("termsArray"), in); - ByteBlockPool termPool = (new ByteBlockPool.FlushHandler()) - .load(flushInfo.getSubProperties("termPool"), in); - SkipListPostingList postingList = (new SkipListPostingList.FlushHandler()) - .load(flushInfo.getSubProperties(POSTING_LIST_PROP_NAME), in); - - InvertedRealtimeIndex index = new InvertedRealtimeIndex( - fieldType, - flushInfo.getIntProperty(NUM_TERMS_PROP_NAME), - flushInfo.getIntProperty(NUM_DOCS_PROP_NAME), - flushInfo.getIntProperty(SUM_TERM_DOC_FREQ_PROP_NAME), - flushInfo.getIntProperty(SUM_TOTAL_TERM_FREQ_PROP_NAME), - flushInfo.getIntProperty(MAX_POSITION), - termsHash, - termsArray, - termPool, - termPointerEncoding, - postingList); - - if (fieldType.isSupportOrderedTerms()) { - SkipListComparator comparator = new TermsSkipListComparator(index); - index.setTermsSkipList((new SkipListContainer.FlushHandler<>(comparator)) - .load(flushInfo.getSubProperties(TERMS_SKIP_LIST_PROP_NAME), in)); - } - - return index; - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/InvertedRealtimeIndexWriter.java b/src/java/com/twitter/search/core/earlybird/index/inverted/InvertedRealtimeIndexWriter.java deleted file mode 100644 index fbea007a2..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/InvertedRealtimeIndexWriter.java +++ /dev/null @@ -1,163 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.analysis.tokenattributes.PayloadAttribute; -import org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute; -import org.apache.lucene.util.AttributeSource; -import org.apache.lucene.util.BytesRef; - -import com.twitter.search.common.hashtable.HashTable; -import com.twitter.search.common.util.analysis.TermPayloadAttribute; -import com.twitter.search.core.earlybird.facets.FacetCountingArrayWriter; -import com.twitter.search.core.earlybird.facets.FacetIDMap.FacetField; -import com.twitter.search.core.earlybird.index.EarlybirdRealtimeIndexSegmentWriter; - -public class InvertedRealtimeIndexWriter - implements EarlybirdRealtimeIndexSegmentWriter.InvertedDocConsumer { - private final InvertedRealtimeIndex invertedIndex; - private final FacetCountingArrayWriter facetArray; - private final FacetField facetField; - - private TermToBytesRefAttribute termAtt; - private TermPayloadAttribute termPayloadAtt; - private PayloadAttribute payloadAtt; - private boolean currentDocIsOffensive; - - /** - * Creates a new writer for writing to an inverted in-memory real-time index. - */ - public InvertedRealtimeIndexWriter( - InvertedRealtimeIndex index, - FacetField facetField, - FacetCountingArrayWriter facetArray) { - super(); - this.invertedIndex = index; - this.facetArray = facetArray; - this.facetField = facetField; - } - - @Override - public void start(AttributeSource attributeSource, boolean docIsOffensive) { - termAtt = attributeSource.addAttribute(TermToBytesRefAttribute.class); - termPayloadAtt = attributeSource.addAttribute(TermPayloadAttribute.class); - payloadAtt = attributeSource.addAttribute(PayloadAttribute.class); - currentDocIsOffensive = docIsOffensive; - } - - /** - * Adds a posting to the provided inverted index. - * - * @param termBytesRef is a payload that is stored with the term. It is only stored once for each - * term. - * @param postingPayload is a byte payload that will be stored separately for every posting. - * @return term id of the added posting. - */ - public static int indexTerm(InvertedRealtimeIndex invertedIndex, BytesRef termBytesRef, - int docID, int position, BytesRef termPayload, - BytesRef postingPayload, TermPointerEncoding termPointerEncoding) { - - InvertedRealtimeIndex.TermHashTable hashTable = invertedIndex.getHashTable(); - BaseByteBlockPool termPool = invertedIndex.getTermPool(); - - TermsArray termsArray = invertedIndex.getTermsArray(); - - long hashTableInfoForBytesRef = hashTable.lookupItem(termBytesRef); - int termID = HashTable.decodeItemId(hashTableInfoForBytesRef); - int hashTableSlot = HashTable.decodeHashPosition(hashTableInfoForBytesRef); - - invertedIndex.adjustMaxPosition(position); - - if (termID == HashTable.EMPTY_SLOT) { - // First time we are seeing this token since we last flushed the hash. - // the LSB in textStart denotes whether this term has a term payload - int textStart = ByteTermUtils.copyToTermPool(termPool, termBytesRef); - boolean hasTermPayload = termPayload != null; - int termPointer = termPointerEncoding.encodeTermPointer(textStart, hasTermPayload); - - if (hasTermPayload) { - ByteTermUtils.copyToTermPool(termPool, termPayload); - } - - termID = invertedIndex.getNumTerms(); - invertedIndex.incrementNumTerms(); - if (termID >= termsArray.getSize()) { - termsArray = invertedIndex.growTermsArray(); - } - - termsArray.termPointers[termID] = termPointer; - - Preconditions.checkState(hashTable.slots()[hashTableSlot] == HashTable.EMPTY_SLOT); - hashTable.setSlot(hashTableSlot, termID); - - if (invertedIndex.getNumTerms() * 2 >= hashTable.numSlots()) { - invertedIndex.rehashPostings(2 * hashTable.numSlots()); - } - - // Insert termID into termsSkipList. - invertedIndex.insertToTermsSkipList(termBytesRef, termID); - } - - invertedIndex.incrementSumTotalTermFreq(); - invertedIndex.getPostingList() - .appendPosting(termID, termsArray, docID, position, postingPayload); - - return termID; - } - - /** - * Delete a posting that was inserted out of order. - * - * This function needs work before it is used in production: - * - It should take an isDocOffensive parameter so we can decrement the offensive - * document count for the term. - * - It doesn't allow the same concurrency guarantees that the other posting methods do. - */ - public static void deletePosting( - InvertedRealtimeIndex invertedIndex, BytesRef termBytesRef, int docID) { - - long hashTableInfoForBytesRef = invertedIndex.getHashTable().lookupItem(termBytesRef); - int termID = HashTable.decodeItemId(hashTableInfoForBytesRef); - - if (termID != HashTable.EMPTY_SLOT) { - // Have seen this term before, and the field that supports deletes. - invertedIndex.getPostingList().deletePosting(termID, invertedIndex.getTermsArray(), docID); - } - } - - @Override - public void add(int docID, int position) { - final BytesRef payload; - if (payloadAtt == null) { - payload = null; - } else { - payload = payloadAtt.getPayload(); - } - - BytesRef termPayload = termPayloadAtt.getTermPayload(); - - int termID = indexTerm(invertedIndex, termAtt.getBytesRef(), - docID, position, termPayload, payload, - invertedIndex.getTermPointerEncoding()); - - if (termID == -1) { - return; - } - - TermsArray termsArray = invertedIndex.getTermsArray(); - - if (currentDocIsOffensive && termsArray.offensiveCounters != null) { - termsArray.offensiveCounters[termID]++; - } - - if (facetField != null) { - facetArray.addFacet(docID, facetField.getFacetId(), termID); - } - } - - @Override - public void finish() { - payloadAtt = null; - termPayloadAtt = null; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/LowDFPackedIntsPostingLists.java b/src/java/com/twitter/search/core/earlybird/index/inverted/LowDFPackedIntsPostingLists.java deleted file mode 100644 index 8c1963a70..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/LowDFPackedIntsPostingLists.java +++ /dev/null @@ -1,255 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; - -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.PostingsEnum; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.util.packed.PackedInts; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; - -/** - * A posting list intended for low-df terms, terms that have a small number of postings. - * - * The postings (docs and positions) are stored in PackedInts, packed based on the largest docId - * and position across all low-df terms in a field. - * - * All docIds are packed together in their own PackedInts, and all positions are stored together - * in their own PackedInts. - * - A docId is stored for every single posting, that is if a doc has a frequency of N, it will be - * stored N times. - * - For fields that omitPositions, positions are not stored at all. - * - * Example: - * Postings in the form (docId, position): - * (1, 0), (1, 1), (2, 1), (2, 3), (2, 5), (4, 0), (5, 0) - * Will be stored as: - * packedDocIds: [1, 1, 2, 2, 2, 4, 5] - * packedPositions: [0, 1, 1, 3, 5, 0, 0] - */ -public class LowDFPackedIntsPostingLists extends OptimizedPostingLists { - private static final SearchCounter GETTING_POSITIONS_WITH_OMIT_POSITIONS = - SearchCounter.export("low_df_packed_ints_posting_list_getting_positions_with_omit_positions"); - - /** - * Internal class for hiding PackedInts Readers and Writers. A Mutable instance of PackedInts is - * only required when we're optimizing a new index. - * For the read side, we only need a PackedInts.Reader. - * For loaded indexes, we also only need a PackedInts.Reader. - */ - private static final class PackedIntsWrapper { - // Will be null if we are operating on a loaded in read-only index. - @Nullable - private final PackedInts.Mutable mutablePackedInts; - private final PackedInts.Reader readerPackedInts; - - private PackedIntsWrapper(PackedInts.Mutable mutablePackedInts) { - this.mutablePackedInts = Preconditions.checkNotNull(mutablePackedInts); - this.readerPackedInts = mutablePackedInts; - } - - private PackedIntsWrapper(PackedInts.Reader readerPackedInts) { - this.mutablePackedInts = null; - this.readerPackedInts = readerPackedInts; - } - - public int size() { - return readerPackedInts.size(); - } - - public PackedInts.Reader getReader() { - return readerPackedInts; - } - - public void set(int index, long value) { - this.mutablePackedInts.set(index, value); - } - } - - private final PackedIntsWrapper packedDocIds; - /** - * Will be null for fields that omitPositions. - */ - @Nullable - private final PackedIntsWrapper packedPositions; - private final boolean omitPositions; - private final int totalPostingsAcrossTerms; - private final int maxPosition; - private int currentPackedIntsPosition; - - /** - * Creates a new LowDFPackedIntsPostingLists. - * @param omitPositions whether positions should be omitted or not. - * @param totalPostingsAcrossTerms how many postings across all terms this field has. - * @param maxPosition the largest position used in all the postings for this field. - */ - public LowDFPackedIntsPostingLists( - boolean omitPositions, - int totalPostingsAcrossTerms, - int maxPosition) { - this( - new PackedIntsWrapper(PackedInts.getMutable( - totalPostingsAcrossTerms, - PackedInts.bitsRequired(MAX_DOC_ID), - PackedInts.DEFAULT)), - omitPositions - ? null - : new PackedIntsWrapper(PackedInts.getMutable( - totalPostingsAcrossTerms, - PackedInts.bitsRequired(maxPosition), - PackedInts.DEFAULT)), - omitPositions, - totalPostingsAcrossTerms, - maxPosition); - } - - private LowDFPackedIntsPostingLists( - PackedIntsWrapper packedDocIds, - @Nullable - PackedIntsWrapper packedPositions, - boolean omitPositions, - int totalPostingsAcrossTerms, - int maxPosition) { - this.packedDocIds = packedDocIds; - this.packedPositions = packedPositions; - this.omitPositions = omitPositions; - this.totalPostingsAcrossTerms = totalPostingsAcrossTerms; - this.maxPosition = maxPosition; - this.currentPackedIntsPosition = 0; - } - - @Override - public int copyPostingList(PostingsEnum postingsEnum, int numPostings) throws IOException { - int pointer = currentPackedIntsPosition; - - int docId; - - while ((docId = postingsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { - assert docId <= MAX_DOC_ID; - int freq = postingsEnum.freq(); - assert freq <= numPostings; - - for (int i = 0; i < freq; i++) { - packedDocIds.set(currentPackedIntsPosition, docId); - if (packedPositions != null) { - int position = postingsEnum.nextPosition(); - assert position <= maxPosition; - packedPositions.set(currentPackedIntsPosition, position); - } - currentPackedIntsPosition++; - } - } - - return pointer; - } - - @Override - public EarlybirdPostingsEnum postings( - int postingListPointer, - int numPostings, - int flags) throws IOException { - - if (PostingsEnum.featureRequested(flags, PostingsEnum.POSITIONS) && !omitPositions) { - assert packedPositions != null; - return new LowDFPackedIntsPostingsEnum( - packedDocIds.getReader(), - packedPositions.getReader(), - postingListPointer, - numPostings); - } else { - if (PostingsEnum.featureRequested(flags, PostingsEnum.POSITIONS) && omitPositions) { - GETTING_POSITIONS_WITH_OMIT_POSITIONS.increment(); - } - - return new LowDFPackedIntsPostingsEnum( - packedDocIds.getReader(), - null, // no positions - postingListPointer, - numPostings); - } - } - - @VisibleForTesting - int getPackedIntsSize() { - return packedDocIds.size(); - } - - @VisibleForTesting - int getMaxPosition() { - return maxPosition; - } - - @VisibleForTesting - boolean isOmitPositions() { - return omitPositions; - } - - @SuppressWarnings("unchecked") - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - static class FlushHandler extends Flushable.Handler { - private static final String OMIT_POSITIONS_PROP_NAME = "omitPositions"; - private static final String TOTAL_POSTINGS_PROP_NAME = "totalPostingsAcrossTerms"; - private static final String MAX_POSITION_PROP_NAME = "maxPosition"; - - public FlushHandler() { - super(); - } - - public FlushHandler(LowDFPackedIntsPostingLists objectToFlush) { - super(objectToFlush); - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) throws IOException { - LowDFPackedIntsPostingLists objectToFlush = getObjectToFlush(); - - flushInfo.addBooleanProperty(OMIT_POSITIONS_PROP_NAME, objectToFlush.omitPositions); - flushInfo.addIntProperty(TOTAL_POSTINGS_PROP_NAME, objectToFlush.totalPostingsAcrossTerms); - flushInfo.addIntProperty(MAX_POSITION_PROP_NAME, objectToFlush.maxPosition); - - out.writePackedInts(objectToFlush.packedDocIds.getReader()); - - if (!objectToFlush.omitPositions) { - assert objectToFlush.packedPositions != null; - out.writePackedInts(objectToFlush.packedPositions.getReader()); - } - } - - @Override - protected LowDFPackedIntsPostingLists doLoad( - FlushInfo flushInfo, - DataDeserializer in) throws IOException { - - boolean omitPositions = flushInfo.getBooleanProperty(OMIT_POSITIONS_PROP_NAME); - int totalPostingsAcrossTerms = flushInfo.getIntProperty(TOTAL_POSTINGS_PROP_NAME); - int maxPosition = flushInfo.getIntProperty(MAX_POSITION_PROP_NAME); - - PackedIntsWrapper packedDocIds = new PackedIntsWrapper(in.readPackedInts()); - - PackedIntsWrapper packedPositions = null; - if (!omitPositions) { - packedPositions = new PackedIntsWrapper(in.readPackedInts()); - } - - return new LowDFPackedIntsPostingLists( - packedDocIds, - packedPositions, - omitPositions, - totalPostingsAcrossTerms, - maxPosition); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/LowDFPackedIntsPostingsEnum.java b/src/java/com/twitter/search/core/earlybird/index/inverted/LowDFPackedIntsPostingsEnum.java deleted file mode 100644 index cb1c54c05..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/LowDFPackedIntsPostingsEnum.java +++ /dev/null @@ -1,112 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; - -import javax.annotation.Nullable; - -import org.apache.lucene.util.packed.PackedInts; - -/** - * A PostingsEnum for iterating over LowDFPackedIntsPostingLists. - * - * Can be used with positions and without positions. - */ -public class LowDFPackedIntsPostingsEnum extends EarlybirdOptimizedPostingsEnum { - private static final int SKIP_INTERVAL = 128; - - private final PackedInts.Reader packedDocIds; - @Nullable - private final PackedInts.Reader packedPositions; - private final int lastPostingPointer; - private final int largestDocID; - private int currentPositionPointer; - - /** Pointer to the next posting that will be loaded. */ - private int nextPostingPointer; - - /** - * Creates a new PostingsEnum for all postings in a given term. - */ - public LowDFPackedIntsPostingsEnum( - PackedInts.Reader packedDocIds, - @Nullable - PackedInts.Reader packedPositions, - int postingListPointer, - int numPostings) { - super(postingListPointer, numPostings); - - this.packedDocIds = packedDocIds; - this.packedPositions = packedPositions; - this.nextPostingPointer = postingListPointer; - - this.lastPostingPointer = postingListPointer + numPostings - 1; - this.largestDocID = (int) packedDocIds.get(lastPostingPointer); - - loadNextPosting(); - - // Treat each term as a single block load. - queryCostTracker.track(QueryCostTracker.CostType.LOAD_OPTIMIZED_POSTING_BLOCK); - } - - @Override - protected void loadNextPosting() { - if (nextPostingPointer <= lastPostingPointer) { - nextDocID = (int) packedDocIds.get(nextPostingPointer); - nextFreq = 1; - } else { - // all postings fully processed - nextDocID = NO_MORE_DOCS; - nextFreq = 0; - } - nextPostingPointer++; - } - - @Override - protected void startCurrentDoc() { - if (packedPositions != null) { - /** - * Remember where we were at the beginning of this doc, so that we can iterate over the - * positions for this doc if needed. - * Adjust by `- 1 - getCurrentFreq()` because we already advanced beyond the last posting in - * the previous loadNextPosting() calls. - * @see #nextDocNoDel() - */ - currentPositionPointer = nextPostingPointer - 1 - getCurrentFreq(); - } - } - - @Override - protected void skipTo(int target) { - assert target != NO_MORE_DOCS : "Should be handled in parent class advance method"; - - // now we know there must be a doc in this block that we can return - int skipIndex = nextPostingPointer + SKIP_INTERVAL; - while (skipIndex <= lastPostingPointer && target > packedDocIds.get(skipIndex)) { - nextPostingPointer = skipIndex; - skipIndex += SKIP_INTERVAL; - } - } - - @Override - public int nextPosition() throws IOException { - if (packedPositions == null) { - return -1; - } else if (currentPositionPointer < packedPositions.size()) { - return (int) packedPositions.get(currentPositionPointer++); - } else { - return -1; - } - } - - @Override - public int getLargestDocID() throws IOException { - return largestDocID; - } - - @Override - public long cost() { - // cost would be -1 if this enum is exhausted. - final int cost = lastPostingPointer - nextPostingPointer + 1; - return cost < 0 ? 0 : cost; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/MPHTermDictionary.java b/src/java/com/twitter/search/core/earlybird/index/inverted/MPHTermDictionary.java deleted file mode 100644 index dd76ee10e..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/MPHTermDictionary.java +++ /dev/null @@ -1,190 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; - -import org.apache.lucene.index.BaseTermsEnum; -import org.apache.lucene.index.ImpactsEnum; -import org.apache.lucene.index.PostingsEnum; -import org.apache.lucene.index.SlowImpactsEnum; -import org.apache.lucene.index.TermsEnum; -import org.apache.lucene.util.BytesRef; -import org.apache.lucene.util.packed.PackedInts; - -import com.twitter.search.common.util.hash.BDZAlgorithm; -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; - -public class MPHTermDictionary implements TermDictionary, Flushable { - private final BDZAlgorithm termsHashFunction; - private final PackedInts.Reader termPointers; - private final ByteBlockPool termPool; - private final TermPointerEncoding termPointerEncoding; - private final int numTerms; - - MPHTermDictionary(int numTerms, BDZAlgorithm termsHashFunction, - PackedInts.Reader termPointers, ByteBlockPool termPool, - TermPointerEncoding termPointerEncoding) { - this.numTerms = numTerms; - this.termsHashFunction = termsHashFunction; - this.termPointers = termPointers; - this.termPool = termPool; - this.termPointerEncoding = termPointerEncoding; - } - - @Override - public int getNumTerms() { - return numTerms; - } - - @Override - public int lookupTerm(BytesRef term) { - int termID = termsHashFunction.lookup(term); - if (termID >= getNumTerms() || termID < 0) { - return EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND; - } - - if (ByteTermUtils.postingEquals(termPool, termPointerEncoding - .getTextStart((int) termPointers.get(termID)), term)) { - return termID; - } else { - return EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND; - } - } - - @Override - public boolean getTerm(int termID, BytesRef text, BytesRef termPayload) { - int termPointer = (int) termPointers.get(termID); - boolean hasTermPayload = termPointerEncoding.hasPayload(termPointer); - int textStart = termPointerEncoding.getTextStart(termPointer); - // setBytesRef sets the passed in BytesRef "text" to the term in the termPool. - // As a side effect it returns the offset of the next entry in the pool after the term, - // which may optionally be used if this term has a payload. - int termPayloadStart = ByteTermUtils.setBytesRef(termPool, text, textStart); - if (termPayload != null && hasTermPayload) { - ByteTermUtils.setBytesRef(termPool, termPayload, termPayloadStart); - } - - return hasTermPayload; - } - - @Override - public TermsEnum createTermsEnum(OptimizedMemoryIndex index) { - return new MPHTermsEnum(index); - } - - public static class MPHTermsEnum extends BaseTermsEnum { - private int termID; - private final BytesRef bytesRef = new BytesRef(); - private final OptimizedMemoryIndex index; - - MPHTermsEnum(OptimizedMemoryIndex index) { - this.index = index; - } - - @Override - public int docFreq() { - return index.getDF(termID); - } - - @Override - public PostingsEnum postings(PostingsEnum reuse, int flags) throws IOException { - int postingsPointer = index.getPostingListPointer(termID); - int numPostings = index.getNumPostings(termID); - return index.getPostingLists().postings(postingsPointer, numPostings, flags); - } - - @Override - public ImpactsEnum impacts(int flags) throws IOException { - return new SlowImpactsEnum(postings(null, flags)); - } - - @Override - public SeekStatus seekCeil(BytesRef text) throws IOException { - termID = index.lookupTerm(text); - - if (termID == -1) { - return SeekStatus.END; - } else { - return SeekStatus.FOUND; - } - } - - @Override - public BytesRef next() { - return null; - } - - @Override - public long ord() { - return termID; - } - - @Override - public void seekExact(long ord) { - if (ord < index.getNumTerms()) { - termID = (int) ord; - index.getTerm(termID, bytesRef, null); - } - } - - @Override - public BytesRef term() { - return bytesRef; - } - - @Override - public long totalTermFreq() { - return docFreq(); - } - } - - @SuppressWarnings("unchecked") - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - public static class FlushHandler extends Flushable.Handler { - static final String NUM_TERMS_PROP_NAME = "numTerms"; - private final TermPointerEncoding termPointerEncoding; - - public FlushHandler(TermPointerEncoding termPointerEncoding) { - super(); - this.termPointerEncoding = termPointerEncoding; - } - - public FlushHandler(MPHTermDictionary objectToFlush) { - super(objectToFlush); - this.termPointerEncoding = objectToFlush.termPointerEncoding; - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) - throws IOException { - MPHTermDictionary objectToFlush = getObjectToFlush(); - flushInfo.addIntProperty(NUM_TERMS_PROP_NAME, objectToFlush.getNumTerms()); - - out.writePackedInts(objectToFlush.termPointers); - objectToFlush.termPool.getFlushHandler().flush(flushInfo.newSubProperties("termPool"), out); - objectToFlush.termsHashFunction.getFlushHandler() - .flush(flushInfo.newSubProperties("termsHashFunction"), out); - } - - @Override - protected MPHTermDictionary doLoad(FlushInfo flushInfo, - DataDeserializer in) throws IOException { - int numTerms = flushInfo.getIntProperty(NUM_TERMS_PROP_NAME); - PackedInts.Reader termPointers = in.readPackedInts(); - ByteBlockPool termPool = (new ByteBlockPool.FlushHandler()).load( - flushInfo.getSubProperties("termPool"), in); - BDZAlgorithm termsHashFunction = (new BDZAlgorithm.FlushHandler()).load( - flushInfo.getSubProperties("termsHashFunction"), in); - - return new MPHTermDictionary(numTerms, termsHashFunction, termPointers, - termPool, termPointerEncoding); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/MultiPostingLists.java b/src/java/com/twitter/search/core/earlybird/index/inverted/MultiPostingLists.java deleted file mode 100644 index 12ff47365..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/MultiPostingLists.java +++ /dev/null @@ -1,135 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; - -import com.google.common.annotations.VisibleForTesting; - -import org.apache.lucene.index.PostingsEnum; - -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; - -public class MultiPostingLists extends OptimizedPostingLists { - - @VisibleForTesting - public static final int DEFAULT_DF_THRESHOLD = 1000; - - private final OptimizedPostingLists lowDF; - private final OptimizedPostingLists highDF; - - private final int dfThreshold; - - /** - * Given the number of postings in each term (in this field), sum up the number of postings in - * the low df fields. - * @param numPostingsPerTerm number of postings in each term in this field. - * @param dfThreshold the low/high df threshold. - */ - private static int numPostingsInLowDfTerms(int[] numPostingsPerTerm, int dfThreshold) { - int sumOfAllPostings = 0; - for (int numPostingsInATerm : numPostingsPerTerm) { - if (numPostingsInATerm < dfThreshold) { - sumOfAllPostings += numPostingsInATerm; - } - } - return sumOfAllPostings; - } - - /** - * Creates a new posting list delegating to either lowDF or highDF posting list. - * @param omitPositions whether positions should be omitted or not. - * @param numPostingsPerTerm number of postings in each term in this field. - * @param maxPosition the largest position used in all the postings for this field. - */ - public MultiPostingLists( - boolean omitPositions, - int[] numPostingsPerTerm, - int maxPosition) { - this( - new LowDFPackedIntsPostingLists( - omitPositions, - numPostingsInLowDfTerms(numPostingsPerTerm, DEFAULT_DF_THRESHOLD), - maxPosition), - new HighDFPackedIntsPostingLists(omitPositions), - DEFAULT_DF_THRESHOLD); - } - - private MultiPostingLists( - OptimizedPostingLists lowDF, - OptimizedPostingLists highDF, - int dfThreshold) { - this.lowDF = lowDF; - this.highDF = highDF; - this.dfThreshold = dfThreshold; - } - - @Override - public int copyPostingList(PostingsEnum postingsEnum, int numPostings) - throws IOException { - return numPostings < dfThreshold - ? lowDF.copyPostingList(postingsEnum, numPostings) - : highDF.copyPostingList(postingsEnum, numPostings); - } - - @Override - public EarlybirdPostingsEnum postings(int postingsPointer, int numPostings, int flags) - throws IOException { - return numPostings < dfThreshold - ? lowDF.postings(postingsPointer, numPostings, flags) - : highDF.postings(postingsPointer, numPostings, flags); - } - - @SuppressWarnings("unchecked") - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - @VisibleForTesting - OptimizedPostingLists getLowDfPostingsList() { - return lowDF; - } - - @VisibleForTesting - OptimizedPostingLists getHighDfPostingsList() { - return highDF; - } - - public static class FlushHandler extends Flushable.Handler { - private static final String DF_THRESHOLD_PROP_NAME = "dfThresHold"; - - public FlushHandler() { - super(); - } - - public FlushHandler(MultiPostingLists objectToFlush) { - super(objectToFlush); - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) - throws IOException { - MultiPostingLists objectToFlush = getObjectToFlush(); - flushInfo.addIntProperty(DF_THRESHOLD_PROP_NAME, objectToFlush.dfThreshold); - objectToFlush.lowDF.getFlushHandler().flush( - flushInfo.newSubProperties("lowDFPostinglists"), out); - objectToFlush.highDF.getFlushHandler().flush( - flushInfo.newSubProperties("highDFPostinglists"), out); - } - - @Override - protected MultiPostingLists doLoad(FlushInfo flushInfo, - DataDeserializer in) throws IOException { - OptimizedPostingLists lowDF = new LowDFPackedIntsPostingLists.FlushHandler() - .load(flushInfo.getSubProperties("lowDFPostinglists"), in); - OptimizedPostingLists highDF = new HighDFPackedIntsPostingLists.FlushHandler() - .load(flushInfo.getSubProperties("highDFPostinglists"), in); - return new MultiPostingLists( - lowDF, - highDF, - flushInfo.getIntProperty(DF_THRESHOLD_PROP_NAME)); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/MultiSegmentTermDictionary.java b/src/java/com/twitter/search/core/earlybird/index/inverted/MultiSegmentTermDictionary.java deleted file mode 100644 index 8b7dee75d..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/MultiSegmentTermDictionary.java +++ /dev/null @@ -1,60 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import com.google.common.collect.ImmutableList; - -import org.apache.lucene.util.BytesRef; - -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; - -/** - * A term dictionary that's backed by multiple underlying segments/indexes. For a given term, will - * be able to return the termId for each of the underlying indexes. - */ -public interface MultiSegmentTermDictionary { - - /** - * Lookup a term in this multi segment term dictionary, and return the term ids for that term on - * all of the managed segments. - * - * @return An array containing a termId for each segment that this term dictionary is backed by. - * The order of segments will match the order returned by {@link #getSegmentIndexes()}. - * - * For each segment, the term id will be returned, or - * {@link EarlybirdIndexSegmentAtomicReader#TERM_NOT_FOUND} if that segment does not have the - * given term. - */ - int[] lookupTermIds(BytesRef term); - - /** - * A convenience method for checking whether a specific index/segment is backed by this term - * dictionary. Returning true here is equivalent to returning: - *

-   * getSegmentIndexes().contains(invertedIndex);
-   * 
- */ - default boolean supportSegmentIndex(InvertedIndex invertedIndex) { - return getSegmentIndexes().contains(invertedIndex); - } - - /** - * The list of indexes that this term dictionary is backed by. The order of indexes here will - * be consistent with the order of termIds returned by {@link #lookupTermIds(BytesRef)}. - */ - ImmutableList getSegmentIndexes(); - - /** - * Returns the number of terms in this term dictionary. - * - * If the term "foo" appears in segment A and in segment B, it will be counted once. To get the - * total number of terms across all managed segments, see {@link #getNumTermEntries()}. - */ - int getNumTerms(); - - /** - * Returns the total number of terms in this term dictionary across all managed segments. - * - * If the term "foo" appears in segment A and in segment B, it will have 2 entries in this term - * dictionary. - */ - int getNumTermEntries(); -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/MultiSegmentTermDictionaryWithFastutil.java b/src/java/com/twitter/search/core/earlybird/index/inverted/MultiSegmentTermDictionaryWithFastutil.java deleted file mode 100644 index 56efa3754..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/MultiSegmentTermDictionaryWithFastutil.java +++ /dev/null @@ -1,161 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.util.Arrays; -import java.util.HashMap; -import java.util.List; -import java.util.OptionalInt; -import java.util.concurrent.TimeUnit; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Stopwatch; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.Maps; - -import org.apache.lucene.util.BytesRef; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.common.util.LogFormatUtil; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; - -import it.unimi.dsi.fastutil.ints.IntArrayList; - -/** - * This implementation took MultiSegmentTermDictionaryWithMap and replaced some of the - * data structures with fastutil equivalents and it also uses a more memory efficient way to - * store the precomputed data. - * - * This implementation has a requirement that each term per field needs to be present at - * most once per document, since we only have space to index 2^24 terms and we have 2^23 - * documents as of now in realtime earlybirds. - * - * See UserIdMultiSegmentQuery class comment for more information on how this is used. - */ -public class MultiSegmentTermDictionaryWithFastutil implements MultiSegmentTermDictionary { - private static final Logger LOG = LoggerFactory.getLogger( - MultiSegmentTermDictionaryWithFastutil.class); - - @VisibleForTesting - public static final SearchTimerStats TERM_DICTIONARY_CREATION_STATS = - SearchTimerStats.export("multi_segment_term_dictionary_with_fastutil_creation", - TimeUnit.MILLISECONDS, false); - - private static final int MAX_TERM_ID_BITS = 24; - private static final int TERM_ID_MASK = (1 << MAX_TERM_ID_BITS) - 1; // First 24 bits. - private static final int MAX_SEGMENT_SIZE = 1 << (MAX_TERM_ID_BITS - 1); - - private final ImmutableList indexes; - - // For each term, a list of (index id, term id) packed into an integer. - // The integer contains: - // byte 0: index (segment id). Since we have ~20 segments, this fits into a byte. - // bytes [1-3]: term id. The terms we're building this dictionary for are user ids - // associated with a tweet - from_user_id and in_reply_to_user_id. Since we have - // at most 2**23 tweets in realtime, we'll have at most 2**23 unique terms per - // segments. The term ids post optimization are consecutive numbers, so they will - // fit in 24 bits. We don't use the term dictionary in archive, which has more - // tweets per segment. - // - // To verify the maximum amount of tweets in a segment, see max_segment_size in - // earlybird-config.yml. - private final HashMap termsMap; - private final int numTerms; - private final int numTermEntries; - - int encodeIndexAndTermId(int indexId, int termId) { - // Push the index id to the left and use the other 24 bits for the term id. - return (indexId << MAX_TERM_ID_BITS) | termId; - } - - void decodeIndexAndTermId(int[] arr, int packed) { - arr[packed >> MAX_TERM_ID_BITS] = packed & TERM_ID_MASK; - } - - - /** - * Creates a new multi-segment term dictionary backed by a regular java map. - */ - public MultiSegmentTermDictionaryWithFastutil( - String field, - List indexes) { - - this.indexes = ImmutableList.copyOf(indexes); - - // Pre-size the map with estimate of max number of terms. It should be at least that big. - OptionalInt optionalMax = indexes.stream().mapToInt(OptimizedMemoryIndex::getNumTerms).max(); - int maxNumTerms = optionalMax.orElse(0); - this.termsMap = Maps.newHashMapWithExpectedSize(maxNumTerms); - - LOG.info("About to merge {} indexes for field {}, estimated {} terms", - indexes.size(), field, LogFormatUtil.formatInt(maxNumTerms)); - Stopwatch stopwatch = Stopwatch.createStarted(); - - BytesRef termBytesRef = new BytesRef(); - - for (int indexId = 0; indexId < indexes.size(); indexId++) { - // The inverted index for this field. - OptimizedMemoryIndex index = indexes.get(indexId); - - int indexNumTerms = index.getNumTerms(); - - if (indexNumTerms > MAX_SEGMENT_SIZE) { - throw new IllegalStateException("too many terms: " + indexNumTerms); - } - - for (int termId = 0; termId < indexNumTerms; termId++) { - index.getTerm(termId, termBytesRef); - - IntArrayList indexTerms = termsMap.get(termBytesRef); - if (indexTerms == null) { - BytesRef term = BytesRef.deepCopyOf(termBytesRef); - - indexTerms = new IntArrayList(); - termsMap.put(term, indexTerms); - } - - indexTerms.add(encodeIndexAndTermId(indexId, termId)); - } - } - - this.numTerms = termsMap.size(); - this.numTermEntries = indexes.stream().mapToInt(OptimizedMemoryIndex::getNumTerms).sum(); - - TERM_DICTIONARY_CREATION_STATS.timerIncrement(stopwatch.elapsed(TimeUnit.MILLISECONDS)); - LOG.info("Done merging {} segments for field {} in {} - " - + "num terms: {}, num term entries: {}.", - indexes.size(), field, stopwatch, - LogFormatUtil.formatInt(this.numTerms), - LogFormatUtil.formatInt(this.numTermEntries)); - } - - @Override - public int[] lookupTermIds(BytesRef term) { - int[] termIds = new int[indexes.size()]; - Arrays.fill(termIds, EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND); - - IntArrayList indexTerms = termsMap.get(term); - if (indexTerms != null) { - for (int i = 0; i < indexTerms.size(); i++) { - decodeIndexAndTermId(termIds, indexTerms.getInt(i)); - } - } - - return termIds; - } - - @Override - public ImmutableList getSegmentIndexes() { - return indexes; - } - - @Override - public int getNumTerms() { - return this.numTerms; - } - - @Override - public int getNumTermEntries() { - return this.numTermEntries; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/MultiSegmentTermDictionaryWithMap.java b/src/java/com/twitter/search/core/earlybird/index/inverted/MultiSegmentTermDictionaryWithMap.java deleted file mode 100644 index 74b7103bc..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/MultiSegmentTermDictionaryWithMap.java +++ /dev/null @@ -1,134 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.util.Arrays; -import java.util.HashMap; -import java.util.List; -import java.util.OptionalInt; -import java.util.concurrent.TimeUnit; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; - -import org.apache.lucene.util.BytesRef; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.common.util.LogFormatUtil; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; - -/** - * A rather simple implementation of a MultiSegmentTermDictionary that just keeps all terms in a - * java hash map, and all the termIds for a term in a java list. - * - * An alternate implementation could have an MPH for the map, and a IntBlockPool for storing - * the term ids. - * - * See UserIdMultiSegmentQuery class comment for more information on how this is used. - */ -public class MultiSegmentTermDictionaryWithMap implements MultiSegmentTermDictionary { - private static final Logger LOG = LoggerFactory.getLogger( - MultiSegmentTermDictionaryWithMap.class); - - @VisibleForTesting - public static final SearchTimerStats TERM_DICTIONARY_CREATION_STATS = - SearchTimerStats.export("multi_segment_term_dictionary_with_map_creation", - TimeUnit.MILLISECONDS, false); - - private final ImmutableList indexes; - private final HashMap> termsMap; - private final int numTerms; - private final int numTermEntries; - - private static class IndexTerm { - private int indexId; - private final int termId; - - public IndexTerm(int indexId, int termId) { - this.indexId = indexId; - this.termId = termId; - } - } - - /** - * Creates a new multi-segment term dictionary backed by a regular java map. - */ - public MultiSegmentTermDictionaryWithMap( - String field, - List indexes) { - - this.indexes = ImmutableList.copyOf(indexes); - - // Pre-size the map with estimate of max number of terms. It should be at least that big. - OptionalInt optionalMax = indexes.stream().mapToInt(OptimizedMemoryIndex::getNumTerms).max(); - int maxNumTerms = optionalMax.orElse(0); - this.termsMap = Maps.newHashMapWithExpectedSize(maxNumTerms); - - LOG.info("About to merge {} indexes for field {}, estimated {} terms", - indexes.size(), field, LogFormatUtil.formatInt(maxNumTerms)); - long start = System.currentTimeMillis(); - - BytesRef termText = new BytesRef(); - long copiedBytes = 0; - for (int indexId = 0; indexId < indexes.size(); indexId++) { - // The inverted index for this field. - OptimizedMemoryIndex index = indexes.get(indexId); - - int indexNumTerms = index.getNumTerms(); - for (int termId = 0; termId < indexNumTerms; termId++) { - index.getTerm(termId, termText); - - // This copies the underlying array to a new array. - BytesRef term = BytesRef.deepCopyOf(termText); - copiedBytes += term.length; - - List indexTerms = termsMap.computeIfAbsent(term, k -> Lists.newArrayList()); - - indexTerms.add(new IndexTerm(indexId, termId)); - } - } - - this.numTerms = termsMap.size(); - this.numTermEntries = indexes.stream().mapToInt(OptimizedMemoryIndex::getNumTerms).sum(); - - long elapsed = System.currentTimeMillis() - start; - TERM_DICTIONARY_CREATION_STATS.timerIncrement(elapsed); - LOG.info("Done merging {} indexes for field {} in {}ms - " - + "num terms: {}, num term entries: {}, copied bytes: {}", - indexes.size(), field, elapsed, - LogFormatUtil.formatInt(this.numTerms), LogFormatUtil.formatInt(this.numTermEntries), - LogFormatUtil.formatInt(copiedBytes)); - } - - @Override - public int[] lookupTermIds(BytesRef term) { - int[] termIds = new int[indexes.size()]; - Arrays.fill(termIds, EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND); - - List indexTerms = termsMap.get(term); - if (indexTerms != null) { - for (IndexTerm indexTerm : indexTerms) { - termIds[indexTerm.indexId] = indexTerm.termId; - } - } - - return termIds; - } - - @Override - public ImmutableList getSegmentIndexes() { - return indexes; - } - - @Override - public int getNumTerms() { - return this.numTerms; - } - - @Override - public int getNumTermEntries() { - return this.numTermEntries; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/OptimizedIndexTerms.java b/src/java/com/twitter/search/core/earlybird/index/inverted/OptimizedIndexTerms.java deleted file mode 100644 index 10aa7849d..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/OptimizedIndexTerms.java +++ /dev/null @@ -1,57 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import org.apache.lucene.index.Terms; -import org.apache.lucene.index.TermsEnum; - -public class OptimizedIndexTerms extends Terms { - private final OptimizedMemoryIndex index; - - public OptimizedIndexTerms(OptimizedMemoryIndex index) { - this.index = index; - } - - @Override - public long size() { - return index.getNumTerms(); - } - - @Override - public TermsEnum iterator() { - return index.createTermsEnum(index.getMaxPublishedPointer()); - } - - @Override - public long getSumTotalTermFreq() { - return index.getSumTotalTermFreq(); - } - - @Override - public long getSumDocFreq() { - return index.getSumTermDocFreq(); - } - - @Override - public int getDocCount() { - return index.getNumDocs(); - } - - @Override - public boolean hasFreqs() { - return false; - } - - @Override - public boolean hasOffsets() { - return false; - } - - @Override - public boolean hasPositions() { - return true; - } - - @Override - public boolean hasPayloads() { - return false; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/OptimizedMemoryIndex.java b/src/java/com/twitter/search/core/earlybird/index/inverted/OptimizedMemoryIndex.java deleted file mode 100644 index 3298b80e8..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/OptimizedMemoryIndex.java +++ /dev/null @@ -1,434 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; -import java.util.Comparator; -import java.util.Map; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.PostingsEnum; -import org.apache.lucene.index.Terms; -import org.apache.lucene.index.TermsEnum; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.util.BytesRef; -import org.apache.lucene.util.packed.PackedInts; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.schema.base.EarlybirdFieldType; -import com.twitter.search.common.util.hash.BDZAlgorithm; -import com.twitter.search.common.util.hash.BDZAlgorithm.MPHFNotFoundException; -import com.twitter.search.common.util.hash.KeysSource; -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.facets.FacetIDMap.FacetField; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; - -public class OptimizedMemoryIndex extends InvertedIndex implements Flushable { - private static final Logger LOG = LoggerFactory.getLogger(OptimizedMemoryIndex.class); - private static final Comparator BYTES_REF_COMPARATOR = Comparator.naturalOrder(); - - private static final SearchCounter MPH_NOT_FOUND_COUNT = - SearchCounter.export("twitter_optimized_index_mph_not_found_count"); - - private final PackedInts.Reader numPostings; - private final PackedInts.Reader postingListPointers; - private final PackedInts.Reader offensiveCounters; - private final MultiPostingLists postingLists; - - private final TermDictionary dictionary; - - private final int numDocs; - private final int sumTotalTermFreq; - private final int sumTermDocFreq; - - private OptimizedMemoryIndex(EarlybirdFieldType fieldType, - int numDocs, - int sumTermDocFreq, - int sumTotalTermFreq, - PackedInts.Reader numPostings, - PackedInts.Reader postingListPointers, - PackedInts.Reader offensiveCounters, - MultiPostingLists postingLists, - TermDictionary dictionary) { - super(fieldType); - this.numDocs = numDocs; - this.sumTermDocFreq = sumTermDocFreq; - this.sumTotalTermFreq = sumTotalTermFreq; - this.numPostings = numPostings; - this.postingListPointers = postingListPointers; - this.offensiveCounters = offensiveCounters; - this.postingLists = postingLists; - this.dictionary = dictionary; - } - - public OptimizedMemoryIndex( - EarlybirdFieldType fieldType, - String field, - InvertedRealtimeIndex source, - Map termIDMapper, - FacetField facetField, - DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) throws IOException { - super(fieldType); - - numDocs = source.getNumDocs(); - sumTermDocFreq = source.getSumTermDocFreq(); - sumTotalTermFreq = source.getSumTotalTermFreq(); - - Preconditions.checkNotNull(originalTweetIdMapper, "The segment must have a tweet ID mapper."); - Preconditions.checkNotNull(optimizedTweetIdMapper, - "The optimized tweet ID mapper cannot be null."); - - // We rely on the fact that new terms always have a greater term ID. We ignore all terms that - // are equal to or greater than numTerms, as they may be incompletely applied. If new terms are - // added while optimizing, they will be re-added when we re-apply updates. - final KeysSource termsIterator = source.getKeysSource(); - int numTerms = termsIterator.getNumberOfKeys(); - int maxPublishedPointer = source.getMaxPublishedPointer(); - - int[] tempPostingListPointers = new int[numTerms]; - - BDZAlgorithm termsHashFunction = null; - - final boolean supportTermTextLookup = facetField != null || fieldType.isSupportTermTextLookup(); - if (supportTermTextLookup) { - try { - termsHashFunction = new BDZAlgorithm(termsIterator); - } catch (MPHFNotFoundException e) { - // we couldn't find a mphf for this field - // no problem, this can happen for very small fields - // - just use the fst in that case - LOG.warn("Unable to build MPH for field: {}", field); - MPH_NOT_FOUND_COUNT.increment(); - } - } - - // Make sure to only call the expensive computeNumPostings() once. - int[] numPostingsSource = computeNumPostings(source, numTerms, maxPublishedPointer); - - // The BDZ Algorithm returns a function from bytesref to term ID. However, these term IDs are - // different than the original term IDs (it's a hash function, not a hash _table_), so we have - // to remap the term IDs to match the ones generated by BDZ. We track that using the termIDMap. - int[] termIDMap = null; - - if (termsHashFunction != null) { - termsIterator.rewind(); - termIDMap = BDZAlgorithm.createIdMap(termsHashFunction, termsIterator); - if (facetField != null) { - termIDMapper.put(facetField.getFacetId(), termIDMap); - } - - PackedInts.Reader termPointers = getPackedInts(source.getTermPointers(), termIDMap); - this.numPostings = getPackedInts(numPostingsSource, termIDMap); - this.offensiveCounters = source.getOffensiveCounters() == null ? null - : getPackedInts(source.getOffensiveCounters(), termIDMap); - - this.dictionary = new MPHTermDictionary( - numTerms, - termsHashFunction, - termPointers, - source.getTermPool(), - TermPointerEncoding.DEFAULT_ENCODING); - } else { - this.dictionary = FSTTermDictionary.buildFST( - source.getTermPool(), - source.getTermPointers(), - numTerms, - BYTES_REF_COMPARATOR, - supportTermTextLookup, - TermPointerEncoding.DEFAULT_ENCODING); - - this.numPostings = getPackedInts(numPostingsSource); - this.offensiveCounters = source.getOffensiveCounters() == null ? null - : getPackedInts(source.getOffensiveCounters()); - } - - TermsEnum allTerms = source.createTermsEnum(maxPublishedPointer); - - this.postingLists = new MultiPostingLists( - !fieldType.hasPositions(), - numPostingsSource, - source.getMaxPosition()); - - for (int termID = 0; termID < numTerms; termID++) { - allTerms.seekExact(termID); - PostingsEnum postingsEnum = new OptimizingPostingsEnumWrapper( - allTerms.postings(null), originalTweetIdMapper, optimizedTweetIdMapper); - int mappedTermID = termIDMap != null ? termIDMap[termID] : termID; - tempPostingListPointers[mappedTermID] = - postingLists.copyPostingList(postingsEnum, numPostingsSource[termID]); - } - - this.postingListPointers = getPackedInts(tempPostingListPointers); - } - - private static int[] map(int[] source, int[] map) { - int[] target = new int[map.length]; - for (int i = 0; i < map.length; i++) { - target[map[i]] = source[i]; - } - return target; - } - - static PackedInts.Reader getPackedInts(int[] values) { - return getPackedInts(values, null); - } - - private static PackedInts.Reader getPackedInts(int[] values, int[] map) { - int[] mappedValues = values; - if (map != null) { - mappedValues = map(mappedValues, map); - } - - // first determine max value - long maxValue = Long.MIN_VALUE; - for (int value : mappedValues) { - if (value > maxValue) { - maxValue = value; - } - } - - PackedInts.Mutable packed = - PackedInts.getMutable(mappedValues.length, PackedInts.bitsRequired(maxValue), - PackedInts.DEFAULT); - for (int i = 0; i < mappedValues.length; i++) { - packed.set(i, mappedValues[i]); - } - - return packed; - } - - /** - * Returns per-term array containing the number of posting in this index for each term. - * This call is extremely slow. - */ - private static int[] computeNumPostings( - InvertedRealtimeIndex source, - int numTerms, - int maxPublishedPointer - ) throws IOException { - int[] numPostings = new int[numTerms]; - TermsEnum allTerms = source.createTermsEnum(maxPublishedPointer); - - for (int termID = 0; termID < numTerms; termID++) { - allTerms.seekExact(termID); - PostingsEnum docsEnum = allTerms.postings(null); - while (docsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) { - numPostings[termID] += docsEnum.freq(); - } - } - - return numPostings; - } - - @Override - public int getNumDocs() { - return numDocs; - } - - @Override - public int getSumTotalTermFreq() { - return sumTotalTermFreq; - } - - @Override - public int getSumTermDocFreq() { - return sumTermDocFreq; - } - - public OptimizedPostingLists getPostingLists() { - Preconditions.checkState(hasPostingLists()); - return postingLists; - } - - int getPostingListPointer(int termID) { - Preconditions.checkState(hasPostingLists()); - return (int) postingListPointers.get(termID); - } - - int getNumPostings(int termID) { - Preconditions.checkState(hasPostingLists()); - return (int) numPostings.get(termID); - } - - public boolean getTerm(int termID, BytesRef text, BytesRef termPayload) { - return dictionary.getTerm(termID, text, termPayload); - } - - @Override - public FacetLabelAccessor getLabelAccessor() { - return new FacetLabelAccessor() { - @Override - protected boolean seek(long termID) { - if (termID != EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND) { - hasTermPayload = getTerm((int) termID, termRef, termPayload); - offensiveCount = offensiveCounters != null - ? (int) offensiveCounters.get((int) termID) : 0; - return true; - } else { - return false; - } - } - }; - } - - @Override - public Terms createTerms(int maxPublishedPointer) { - return new OptimizedIndexTerms(this); - } - - @Override - public TermsEnum createTermsEnum(int maxPublishedPointer) { - return dictionary.createTermsEnum(this); - } - - @Override - public int lookupTerm(BytesRef term) throws IOException { - return dictionary.lookupTerm(term); - } - - @Override - public int getLargestDocIDForTerm(int termID) throws IOException { - Preconditions.checkState(hasPostingLists()); - if (termID == EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND) { - return EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND; - } else { - return postingLists.getLargestDocID((int) postingListPointers.get(termID), - (int) numPostings.get(termID)); - } - } - - @Override - public int getDF(int termID) { - return (int) numPostings.get(termID); - } - - @Override - public int getNumTerms() { - return dictionary.getNumTerms(); - } - - @Override - public void getTerm(int termID, BytesRef text) { - dictionary.getTerm(termID, text, null); - } - - @VisibleForTesting TermDictionary getTermDictionary() { - return dictionary; - } - - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - public boolean hasPostingLists() { - return postingListPointers != null - && postingLists != null - && numPostings != null; - } - - @VisibleForTesting - OptimizedPostingLists getOptimizedPostingLists() { - return postingLists; - } - - public static class FlushHandler extends Flushable.Handler { - private static final String NUM_DOCS_PROP_NAME = "numDocs"; - private static final String SUM_TOTAL_TERM_FREQ_PROP_NAME = "sumTotalTermFreq"; - private static final String SUM_TERM_DOC_FREQ_PROP_NAME = "sumTermDocFreq"; - private static final String USE_MIN_PERFECT_HASH_PROP_NAME = "useMinimumPerfectHashFunction"; - private static final String SKIP_POSTING_LIST_PROP_NAME = "skipPostingLists"; - private static final String HAS_OFFENSIVE_COUNTERS_PROP_NAME = "hasOffensiveCounters"; - public static final String IS_OPTIMIZED_PROP_NAME = "isOptimized"; - - private final EarlybirdFieldType fieldType; - - public FlushHandler(EarlybirdFieldType fieldType) { - super(); - this.fieldType = fieldType; - } - - public FlushHandler(OptimizedMemoryIndex objectToFlush) { - super(objectToFlush); - fieldType = objectToFlush.fieldType; - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) throws IOException { - long startTime = getClock().nowMillis(); - OptimizedMemoryIndex objectToFlush = getObjectToFlush(); - boolean useHashFunction = objectToFlush.dictionary instanceof MPHTermDictionary; - boolean skipPostingLists = !objectToFlush.hasPostingLists(); - - flushInfo.addIntProperty(NUM_DOCS_PROP_NAME, objectToFlush.numDocs); - flushInfo.addIntProperty(SUM_TERM_DOC_FREQ_PROP_NAME, objectToFlush.sumTermDocFreq); - flushInfo.addIntProperty(SUM_TOTAL_TERM_FREQ_PROP_NAME, objectToFlush.sumTotalTermFreq); - flushInfo.addBooleanProperty(USE_MIN_PERFECT_HASH_PROP_NAME, useHashFunction); - flushInfo.addBooleanProperty(SKIP_POSTING_LIST_PROP_NAME, skipPostingLists); - flushInfo.addBooleanProperty(HAS_OFFENSIVE_COUNTERS_PROP_NAME, - objectToFlush.offensiveCounters != null); - flushInfo.addBooleanProperty(IS_OPTIMIZED_PROP_NAME, true); - - if (!skipPostingLists) { - out.writePackedInts(objectToFlush.postingListPointers); - out.writePackedInts(objectToFlush.numPostings); - } - if (objectToFlush.offensiveCounters != null) { - out.writePackedInts(objectToFlush.offensiveCounters); - } - - if (!skipPostingLists) { - objectToFlush.postingLists.getFlushHandler().flush( - flushInfo.newSubProperties("postingLists"), out); - } - objectToFlush.dictionary.getFlushHandler().flush(flushInfo.newSubProperties("dictionary"), - out); - getFlushTimerStats().timerIncrement(getClock().nowMillis() - startTime); - } - - @Override - protected OptimizedMemoryIndex doLoad( - FlushInfo flushInfo, DataDeserializer in) throws IOException { - long startTime = getClock().nowMillis(); - boolean useHashFunction = flushInfo.getBooleanProperty(USE_MIN_PERFECT_HASH_PROP_NAME); - boolean skipPostingLists = flushInfo.getBooleanProperty(SKIP_POSTING_LIST_PROP_NAME); - - PackedInts.Reader postingListPointers = skipPostingLists ? null : in.readPackedInts(); - PackedInts.Reader numPostings = skipPostingLists ? null : in.readPackedInts(); - PackedInts.Reader offensiveCounters = - flushInfo.getBooleanProperty(HAS_OFFENSIVE_COUNTERS_PROP_NAME) - ? in.readPackedInts() : null; - - MultiPostingLists postingLists = skipPostingLists ? null - : (new MultiPostingLists.FlushHandler()) - .load(flushInfo.getSubProperties("postingLists"), in); - - TermDictionary dictionary; - if (useHashFunction) { - dictionary = (new MPHTermDictionary.FlushHandler(TermPointerEncoding.DEFAULT_ENCODING)) - .load(flushInfo.getSubProperties("dictionary"), in); - } else { - dictionary = (new FSTTermDictionary.FlushHandler(TermPointerEncoding.DEFAULT_ENCODING)) - .load(flushInfo.getSubProperties("dictionary"), in); - } - getLoadTimerStats().timerIncrement(getClock().nowMillis() - startTime); - - return new OptimizedMemoryIndex(fieldType, - flushInfo.getIntProperty(NUM_DOCS_PROP_NAME), - flushInfo.getIntProperty(SUM_TERM_DOC_FREQ_PROP_NAME), - flushInfo.getIntProperty(SUM_TOTAL_TERM_FREQ_PROP_NAME), - numPostings, - postingListPointers, - offensiveCounters, - postingLists, - dictionary); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/OptimizedPostingLists.java b/src/java/com/twitter/search/core/earlybird/index/inverted/OptimizedPostingLists.java deleted file mode 100644 index 3bec3505b..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/OptimizedPostingLists.java +++ /dev/null @@ -1,41 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; - -import org.apache.lucene.index.PostingsEnum; - -import com.twitter.search.common.util.io.flushable.Flushable; - -public abstract class OptimizedPostingLists implements Flushable { - static final int MAX_DOC_ID_BIT = 24; - static final int MAX_DOC_ID = (1 << MAX_DOC_ID_BIT) - 1; - - static final int MAX_POSITION_BIT = 31; - - static final int MAX_FREQ_BIT = 31; - - /** - * Copies the given posting list into these posting lists. - * - * @param postingsEnum enumerator of the posting list that needs to be copied - * @param numPostings number of postings in the posting list that needs to be copied - * @return position index of the head of the copied posting list in these posting lists instance - */ - public abstract int copyPostingList(PostingsEnum postingsEnum, int numPostings) - throws IOException; - - /** - * Create and return a postings doc enumerator or doc-position enumerator based on input flag. - * - * @see org.apache.lucene.index.PostingsEnum - */ - public abstract EarlybirdPostingsEnum postings(int postingListPointer, int numPostings, int flags) - throws IOException; - - /** - * Returns the largest docID contained in the posting list pointed by {@code postingListPointer}. - */ - public final int getLargestDocID(int postingListPointer, int numPostings) throws IOException { - return postings(postingListPointer, numPostings, PostingsEnum.NONE).getLargestDocID(); - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/OptimizingPostingsEnumWrapper.java b/src/java/com/twitter/search/core/earlybird/index/inverted/OptimizingPostingsEnumWrapper.java deleted file mode 100644 index a06637c1b..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/OptimizingPostingsEnumWrapper.java +++ /dev/null @@ -1,128 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; -import java.util.Collections; -import java.util.List; -import java.util.Map; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; - -import org.apache.lucene.index.PostingsEnum; -import org.apache.lucene.util.BytesRef; - -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; - -/** - * A PostingsEnum that maps doc IDs in one DocIDToTweetIDMapper instance to doc IDs in another - * DocIDToTweetIDMapper. - * - * Unoptimized segments can use any DocIDToTweetIDMapper they want, which means that there are no - * guarantees on the distribution of the doc IDs in this mapper. However, optimized segments must - * use an OptimizedTweetIDMapper: we want to assign sequential doc IDs and use delta encondings in - * order to save space. So when an Earlybird segment needs to be optimized, we might need to convert - * the doc ID space of the unoptimized tweet ID mapper to the doc ID space of the optimized mapper. - * However, once we do this, the doc IDs stored in the posting lists in that segment will no longer - * be valid, unless we remap them too. So the goal of this class is to provide a way to do that. - * - * When we want to optimize a posting list, we need to traverse it and pack it. This class provides - * a wrapper around the original posting list that does the doc ID remapping at traversal time. - */ -public class OptimizingPostingsEnumWrapper extends PostingsEnum { - private final List docIds = Lists.newArrayList(); - private final Map> positions = Maps.newHashMap(); - - private int docIdIndex = -1; - private int positionIndex = -1; - - public OptimizingPostingsEnumWrapper(PostingsEnum source, - DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper newTweetIdMapper) throws IOException { - int docId; - while ((docId = source.nextDoc()) != NO_MORE_DOCS) { - long tweetId = originalTweetIdMapper.getTweetID(docId); - int newDocId = newTweetIdMapper.getDocID(tweetId); - Preconditions.checkState(newDocId != DocIDToTweetIDMapper.ID_NOT_FOUND, - "Did not find a mapping in the new tweet ID mapper for tweet ID %s, doc ID %s", - tweetId, docId); - - docIds.add(newDocId); - List docPositions = Lists.newArrayListWithCapacity(source.freq()); - positions.put(newDocId, docPositions); - for (int i = 0; i < source.freq(); ++i) { - docPositions.add(source.nextPosition()); - } - } - Collections.sort(docIds); - } - - @Override - public int nextDoc() { - ++docIdIndex; - if (docIdIndex >= docIds.size()) { - return NO_MORE_DOCS; - } - - positionIndex = -1; - return docIds.get(docIdIndex); - } - - @Override - public int freq() { - Preconditions.checkState(docIdIndex >= 0, "freq() called before nextDoc()."); - Preconditions.checkState(docIdIndex < docIds.size(), - "freq() called after nextDoc() returned NO_MORE_DOCS."); - return positions.get(docIds.get(docIdIndex)).size(); - } - - @Override - public int nextPosition() { - Preconditions.checkState(docIdIndex >= 0, "nextPosition() called before nextDoc()."); - Preconditions.checkState(docIdIndex < docIds.size(), - "nextPosition() called after nextDoc() returned NO_MORE_DOCS."); - - ++positionIndex; - Preconditions.checkState(positionIndex < positions.get(docIds.get(docIdIndex)).size(), - "nextPosition() called more than freq() times."); - return positions.get(docIds.get(docIdIndex)).get(positionIndex); - } - - // All other methods are not supported. - - @Override - public int advance(int target) { - throw new UnsupportedOperationException( - "OptimizingPostingsEnumWrapper.advance() is not supported."); - } - - @Override - public long cost() { - throw new UnsupportedOperationException( - "OptimizingPostingsEnumWrapper.cost() is not supported."); - } - - @Override - public int docID() { - throw new UnsupportedOperationException( - "OptimizingPostingsEnumWrapper.docID() is not supported."); - } - - @Override - public int endOffset() { - throw new UnsupportedOperationException( - "OptimizingPostingsEnumWrapper.endOffset() is not supported."); - } - - @Override - public BytesRef getPayload() { - throw new UnsupportedOperationException( - "OptimizingPostingsEnumWrapper.getPayload() is not supported."); - } - - @Override - public int startOffset() { - throw new UnsupportedOperationException( - "OptimizingPostingsEnumWrapper.startOffset() is not supported."); - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/PackedLongsReaderPreComputedValues.java b/src/java/com/twitter/search/core/earlybird/index/inverted/PackedLongsReaderPreComputedValues.java deleted file mode 100644 index 3ea8d3480..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/PackedLongsReaderPreComputedValues.java +++ /dev/null @@ -1,202 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -/** - * Pre-computed shifts, mask, and start int indices used by - * {@link IntBlockPoolPackedLongsReader} to decode packed values from - * {@link IntBlockPool}. - * - * The purpose of this class is for decoding efficiency and speed. This class is thread-safe since - * all its usages are read-only. - * - * Packed ints are stored from LOWEST bits for HIGHEST bits in an int. - * - * Here are 3 different situations when a packed value spans 1, 2, and 3 ints: - * - * - A packed value spans 1 int: - * [High Bits ................................. Low Bits] - * int[n] = [possible_other_data|packed_value|possible_other_data] - * - * To decode, 1 shift right and 1 mask are needed: - * * shift - {@link #allLowBitsRightShift} - * * mask - dynamically computed based on bitsPerValue (in decoded slice). - * - * - A packed value spans 2 ints: - * The data is stored as: - * [High Bits .................. Low Bits] - * int[n] = [low_bits_of_packed_value | other_data] - * int[n+1] = [other_data| high_bits_of_packed_value] - * - * To decode, 1 shift right, 1 shift left, and 2 masks are needed: - * * 1 shift right {@link #allLowBitsRightShift} and 1 mask (computed on the fly) to compute - * low_bits_of_packed_value - * * 1 mask {@link #allMiddleBitsMask} and 1 shift left {@link #allMiddleBitsLeftShift} to - * compute high_bits_of_packed_value - * * 1 OR to combine `high_bits_of_packed_value | low_bits_of_packed_value` - * - * - A packed value spans 3 ints: - * The data is stored as: - * [High Bits .................. Low Bits] - * int[n] = [low_bits_of_packed_value | other_data] - * int[n+1] = [ ... middle_bits_of_packed_value ... ] - * int[n+2] = [other_data| high_bits_of_packed_value] - * - * To decode, 1 shift right, 2 shift left, and 3 masks are needed: - * * 1 shift right {@link #allLowBitsRightShift} and 1 mask (computed on the fly) to compute - * low_bits_of_packed_value - * * 1 shift left {@link #allMiddleBitsLeftShift} and 1 mask {@link #allMiddleBitsMask} to - * compute middle_bits_of_data - * * 1 shift left {@link #allHighBitsLeftShift} and 1 mask {@link #allHighBitsMask} to compute - * high_bits_of_data - * * 1 OR to combine `low_bits_of_data | middle_bits_of_data | high_bits_of_data` - * - * Example usage: - * @see HighDFPackedIntsDocsEnum - * @see HighDFPackedIntsDocsAndPositionsEnum - */ -public final class PackedLongsReaderPreComputedValues { - private final int[][] allLowBitsRightShift; - private final int[][] allMiddleBitsLeftShift; - private final int[][] allMiddleBitsMask; - private final int[][] allHighBitsLeftShift; - private final int[][] allHighBitsMask; - - /** - * 2D int arrays containing pre-computed start int indices; the 2 dimensions are - * int[numBitsPerPackedValue][packedValueIndex]. - * - * For a given number bits per packed value and a given packed value index, this is the first - * int in the subsequent of ints that contains the packed value with the given packed value index. - */ - private final int[][] allStartIntIndices; - - /** - * Sole constructor. - * - * @param maxBitsPerValue max possible number of bits of packed values that will be decoded - * @param maxNumValues max number of values are encoded back to back - * @param maxNumInts max number of ints are used to store packed values - * @param needStartIntIndex for optimization: whether start int indices are needed - */ - PackedLongsReaderPreComputedValues( - int maxBitsPerValue, - int maxNumValues, - int maxNumInts, - boolean needStartIntIndex) { - assert maxBitsPerValue <= Long.SIZE; - - if (needStartIntIndex) { - this.allStartIntIndices = new int[maxBitsPerValue + 1][maxNumValues]; - } else { - this.allStartIntIndices = null; - } - - this.allLowBitsRightShift = new int[maxBitsPerValue + 1][maxNumValues]; - this.allMiddleBitsLeftShift = new int[maxBitsPerValue + 1][maxNumValues]; - this.allMiddleBitsMask = new int[maxBitsPerValue + 1][maxNumValues]; - - // Packed value could use up 2 ints. - if (maxBitsPerValue > Integer.SIZE) { - this.allHighBitsLeftShift = new int[maxBitsPerValue + 1][maxNumValues]; - this.allHighBitsMask = new int[maxBitsPerValue + 1][maxNumValues]; - } else { - this.allHighBitsLeftShift = null; - this.allHighBitsMask = null; - } - - compute(maxBitsPerValue, maxNumValues, maxNumInts); - } - - /** - * Compute masks, shifts and start indices. - */ - private void compute(int maxBitsPerValue, int maxNumValues, int maxNumInts) { - // For each possible bits per packed value. - for (int bitsPerPackedValue = 0; bitsPerPackedValue <= maxBitsPerValue; bitsPerPackedValue++) { - int[] startIntIndices = - allStartIntIndices != null ? allStartIntIndices[bitsPerPackedValue] : null; - int[] lowBitsRightShift = - allLowBitsRightShift[bitsPerPackedValue]; - int[] middleBitsLeftShift = - allMiddleBitsLeftShift[bitsPerPackedValue]; - int[] middleBitsMask = - allMiddleBitsMask[bitsPerPackedValue]; - int[] highBitsLeftShift = - allHighBitsLeftShift != null ? allHighBitsLeftShift[bitsPerPackedValue] : null; - int[] highBitsMask = - allHighBitsMask != null ? allHighBitsMask[bitsPerPackedValue] : null; - - int shift = 0; - int currentIntIndex = 0; - int bitsRead; - int bitsRemaining; - - // For each packed value. - for (int packedValueIndex = 0; packedValueIndex < maxNumValues; packedValueIndex++) { - if (startIntIndices != null) { - startIntIndices[packedValueIndex] = currentIntIndex; - } - // Packed value spans to the 1st int. - lowBitsRightShift[packedValueIndex] = shift; - bitsRead = Integer.SIZE - shift; - bitsRemaining = bitsPerPackedValue - bitsRead; - - if (bitsRemaining >= 0) { - // Packed value spans to the 2nd int. - currentIntIndex++; - if (currentIntIndex == maxNumInts) { - break; - } - middleBitsLeftShift[packedValueIndex] = bitsRead; - middleBitsMask[packedValueIndex] = - bitsRemaining >= Integer.SIZE ? 0xFFFFFFFF : (1 << bitsRemaining) - 1; - - // Packed value spans to the 3rd int. - bitsRead += Integer.SIZE; - bitsRemaining -= Integer.SIZE; - if (bitsRemaining >= 0) { - currentIntIndex++; - if (currentIntIndex == maxNumInts) { - break; - } - assert highBitsLeftShift != null; - assert highBitsMask != null; - highBitsLeftShift[packedValueIndex] = bitsRead; - highBitsMask[packedValueIndex] = - bitsRemaining >= Integer.SIZE ? 0xFFFFFFFF : (1 << bitsRemaining) - 1; - } - } - - shift += bitsPerPackedValue; - shift = shift % Integer.SIZE; - } - } - } - - /******************************************************************** - * Getters of Pre-computed Values: returns should NEVER be modified * - ********************************************************************/ - - int[] getStartIntIndices(int numBitsPerValue) { - return allStartIntIndices == null ? null : allStartIntIndices[numBitsPerValue]; - } - - int[] getLowBitsRightShift(int numBitsPerValue) { - return allLowBitsRightShift[numBitsPerValue]; - } - - int[] getMiddleBitsLeftShift(int numBitsPerValue) { - return allMiddleBitsLeftShift[numBitsPerValue]; - } - - int[] getMiddleBitsMask(int numBitsPerValue) { - return allMiddleBitsMask[numBitsPerValue]; - } - - int[] getHighBitsLeftShift(int numBitsPerValue) { - return allHighBitsLeftShift == null ? null : allHighBitsLeftShift[numBitsPerValue]; - } - - int[] getHighBitsMask(int numBitsPerValue) { - return allHighBitsMask == null ? null : allHighBitsMask[numBitsPerValue]; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/PayloadUtil.java b/src/java/com/twitter/search/core/earlybird/index/inverted/PayloadUtil.java deleted file mode 100644 index f7addebbb..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/PayloadUtil.java +++ /dev/null @@ -1,91 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import org.apache.lucene.util.BytesRef; - -/** - * Utilities for encoding and decoding BytesRefs into ints. The encoding is: - * [0..n] n bytes big-endian decoded into integers. - * n: number of bytes. - * - * Example: - * encode([DE, AD, BE, EF, AB]) => [0xDEADBEEF, 0xAB000000, 5] - * - * It's necessary to store the length at the end instead of the start so that we can know how far to - * jump backward from a skiplist entry. We can't store it after the skip list entry because there - * can be a variable number of pointers after the skip list entry. - * - * An example skip list entry, with labels on the following line: - * [0xDEADBEEF, 12, 654, 0x877, 0x78879] - * [ payload, position, docID, level0Pointer, level1Pointer] - */ -public final class PayloadUtil { - private PayloadUtil() { - } - - public static final int[] EMPTY_PAYLOAD = new int[]{0}; - - /** - * Encodes a {@link BytesRef} into an int array (to be inserted into a - * {@link IntBlockPool}. The encoder considers the input to be big-endian encoded ints. - */ - public static int[] encodePayload(BytesRef payload) { - if (payload == null) { - return EMPTY_PAYLOAD; - } - - int intsInPayload = intsForBytes(payload.length); - - int[] arr = new int[1 + intsInPayload]; - - for (int i = 0; i < intsInPayload; i++) { - int n = 0; - for (int j = 0; j < 4; j++) { - int index = i * 4 + j; - int b; - if (index < payload.length) { - // mask off the top bits in case b is negative. - b = payload.bytes[index] & 0xFF; - } else { - b = 0; - } - n = n << 8 | b; - } - - arr[i] = n; - } - - arr[intsInPayload] = payload.length; - - return arr; - } - - /** - * Decodes a {@link IntBlockPool} and position into a {@link BytesRef}. The ints are - * converted into big-endian encoded bytes. - */ - public static BytesRef decodePayload( - IntBlockPool b, - int pointer) { - int length = b.get(pointer); - BytesRef bytesRef = new BytesRef(length); - bytesRef.length = length; - - int numInts = intsForBytes(length); - - for (int i = 0; i < numInts; i++) { - int n = b.get(pointer - numInts + i); - for (int j = 0; j < 4; j++) { - int byteIndex = 4 * i + j; - if (byteIndex < length) { - bytesRef.bytes[byteIndex] = (byte) (n >> 8 * (3 - byteIndex % 4)); - } - } - } - - return bytesRef; - } - - private static int intsForBytes(int byteCount) { - return (byteCount + 3) / 4; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/PostingsBufferQueue.java b/src/java/com/twitter/search/core/earlybird/index/inverted/PostingsBufferQueue.java deleted file mode 100644 index 51ffbbe0c..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/PostingsBufferQueue.java +++ /dev/null @@ -1,155 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.util.NoSuchElementException; - -import com.google.common.annotations.VisibleForTesting; - -/** - * A posting buffer used by {@link HighDFPackedIntsPostingLists} while copying over posting list. - */ -final class PostingsBufferQueue { - /** - * Mask used to convert an int to a long. We cannot just cast because doing so will fill in the - * higher 32 bits with the sign bit, but we need the higher 32 bits to be 0 instead. - */ - static final long LONG_MASK = (1L << 32) - 1; - - /** - * A circular FIFO long queue used internally to store posting. - * @see #postingsQueue - */ - @VisibleForTesting - static final class Queue { - private final long[] queue; - private int head = 0; - private int tail = 0; - private int size; - - Queue(int maxSize) { - this.queue = new long[maxSize < 2 ? 2 : maxSize]; - } - - boolean isEmpty() { - return size() == 0; - } - - boolean isFull() { - return size() == queue.length; - } - - void offer(long value) { - if (size() == queue.length) { - throw new IllegalStateException("Queue is full"); - } - queue[tail] = value; - tail = (tail + 1) % queue.length; - size++; - } - - long poll() { - if (isEmpty()) { - throw new NoSuchElementException("Queue is empty."); - } - long value = queue[head]; - head = (head + 1) % queue.length; - size--; - return value; - } - - int size() { - return size; - } - } - - /** - * Internal posting queue. - */ - private final Queue postingsQueue; - - /** - * Constructor with max size. - * - * @param maxSize max size of this buffer. - */ - PostingsBufferQueue(int maxSize) { - this.postingsQueue = new Queue(maxSize); - } - - /** - * Check if the buffer is empty. - * - * @return If this buffer is empty - */ - boolean isEmpty() { - return postingsQueue.isEmpty(); - } - - /** - * Check if the buffer is full. - * - * @return If this buffer is full - */ - boolean isFull() { - return postingsQueue.isFull(); - } - - /** - * Get the current size of this buffer. - * - * @return Current size of this buffer - */ - int size() { - return postingsQueue.size(); - } - - /** - * Store a posting with docID and a second value that could be freq, position, or any additional - * info. This method will encode the offered doc ID and second value with - * {@link #encodePosting(int, int)}. - * - * @param docID doc ID of the posting - * @param secondValue an additional value of the posting - */ - void offer(int docID, int secondValue) { - postingsQueue.offer(encodePosting(docID, secondValue)); - } - - /** - * Remove and return the earliest inserted posting, this is a FIFO queue. - * - * @return the earliest inserted posting. - */ - long poll() { - return postingsQueue.poll(); - } - - /** - * Encode a doc ID and a second value, both are ints, into a long. The higher 32 bits store the - * doc ID and lower 32 bits store the second value. - * - * @param docID an int specifying doc ID of the posting - * @param secondValue an int specifying the second value of the posting - * @return an encoded long represent the posting - */ - private static long encodePosting(int docID, int secondValue) { - return ((LONG_MASK & docID) << 32) | (LONG_MASK & secondValue); - } - - /** - * Decode doc ID from the given posting. - * @param posting a given posting encoded with {@link #encodePosting(int, int)} - * @return the doc ID of the given posting. - */ - static int getDocID(long posting) { - return (int) (posting >> 32); - } - - /** - * Decode the second value from the given posting. - * @param posting a given posting encoded with {@link #encodePosting(int, int)} - * @return the second value of the given posting. - */ - static int getSecondValue(long posting) { - return (int) posting; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/QueryCostTracker.java b/src/java/com/twitter/search/core/earlybird/index/inverted/QueryCostTracker.java deleted file mode 100644 index 5918cd93b..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/QueryCostTracker.java +++ /dev/null @@ -1,48 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import org.apache.lucene.util.CloseableThreadLocal; - -import com.twitter.search.common.search.QueryCostProvider; - -public class QueryCostTracker implements QueryCostProvider { - public static enum CostType { - // For the realtime segment we track how many posting list blocks - // are accessed during the lifetime of one query. - LOAD_REALTIME_POSTING_BLOCK(1), - - // Number of optimized posting list blocks - LOAD_OPTIMIZED_POSTING_BLOCK(1); - - private final double cost; - - private CostType(double cost) { - this.cost = cost; - } - } - - private static final CloseableThreadLocal TRACKERS - = new CloseableThreadLocal() { - @Override protected QueryCostTracker initialValue() { - return new QueryCostTracker(); - } - }; - - public static QueryCostTracker getTracker() { - return TRACKERS.get(); - } - - private double totalCost; - - public void track(CostType costType) { - totalCost += costType.cost; - } - - public void reset() { - totalCost = 0; - } - - @Override - public double getTotalCost() { - return totalCost; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/RealtimeIndexTerms.java b/src/java/com/twitter/search/core/earlybird/index/inverted/RealtimeIndexTerms.java deleted file mode 100644 index 7f6c60b97..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/RealtimeIndexTerms.java +++ /dev/null @@ -1,365 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.util.Iterator; -import java.util.TreeSet; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.BaseTermsEnum; -import org.apache.lucene.index.ImpactsEnum; -import org.apache.lucene.index.PostingsEnum; -import org.apache.lucene.index.SlowImpactsEnum; -import org.apache.lucene.index.Terms; -import org.apache.lucene.index.TermsEnum; -import org.apache.lucene.util.BytesRef; - -import com.twitter.search.common.hashtable.HashTable; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.util.hash.KeysSource; - -public class RealtimeIndexTerms extends Terms { - // Calling InMemoryTermsEnum.next() creates a full copy of the entire term dictionary, and can - // be quite expensive. We don't expect these calls to happen, and they shpould not happen on the - // regular read path. We stat them here just in case to see if there is any unexpected usage. - private static final SearchCounter TERMS_ENUM_NEXT_CALLS = - SearchCounter.export("in_memory_terms_enum_next_calls"); - private static final SearchCounter TERMS_ENUM_CREATE_TERM_SET = - SearchCounter.export("in_memory_terms_enum_next_create_term_set"); - private static final SearchCounter TERMS_ENUM_CREATE_TERM_SET_SIZE = - SearchCounter.export("in_memory_terms_enum_next_create_term_set_size"); - - private final InvertedRealtimeIndex index; - private final int maxPublishedPointer; - - public RealtimeIndexTerms(InvertedRealtimeIndex index, int maxPublishedPointer) { - this.index = index; - this.maxPublishedPointer = maxPublishedPointer; - } - - @Override - public long size() { - return index.getNumTerms(); - } - - @Override - public TermsEnum iterator() { - return index.createTermsEnum(maxPublishedPointer); - } - - /** - * This TermsEnum use a tree set to support {@link TermsEnum#next()} method. However, this is not - * efficient enough to support realtime operation. {@link TermsEnum#seekCeil} is not fully - * supported in this termEnum. - */ - public static class InMemoryTermsEnum extends BaseTermsEnum { - private final InvertedRealtimeIndex index; - private final int maxPublishedPointer; - private int termID = -1; - private BytesRef bytesRef = new BytesRef(); - private Iterator termIter; - private TreeSet termSet; - - public InMemoryTermsEnum(InvertedRealtimeIndex index, int maxPublishedPointer) { - this.index = index; - this.maxPublishedPointer = maxPublishedPointer; - termIter = null; - } - - @Override - public int docFreq() { - return index.getDF(termID); - } - - @Override - public PostingsEnum postings(PostingsEnum reuse, int flags) { - int postingsPointer = index.getPostingListPointer(termID); - return index.getPostingList().postings(postingsPointer, docFreq(), maxPublishedPointer); - } - - @Override - public ImpactsEnum impacts(int flags) { - return new SlowImpactsEnum(postings(null, flags)); - } - - @Override - public SeekStatus seekCeil(BytesRef text) { - // Nullify termIter. - termIter = null; - - termID = index.lookupTerm(text); - - if (termID == -1) { - return SeekStatus.END; - } else { - index.getTerm(termID, bytesRef); - return SeekStatus.FOUND; - } - } - - @Override - public BytesRef next() { - TERMS_ENUM_NEXT_CALLS.increment(); - if (termSet == null) { - termSet = new TreeSet<>(); - KeysSource keysource = index.getKeysSource(); - keysource.rewind(); - int numTerms = keysource.getNumberOfKeys(); - for (int i = 0; i < numTerms; ++i) { - BytesRef ref = keysource.nextKey(); - // we need to clone the ref since the keysource is reusing the returned BytesRef - // instance and we are storing it - termSet.add(ref.clone()); - } - TERMS_ENUM_CREATE_TERM_SET.increment(); - TERMS_ENUM_CREATE_TERM_SET_SIZE.add(numTerms); - } - - // Construct termIter from the subset. - if (termIter == null) { - termIter = termSet.tailSet(bytesRef, true).iterator(); - } - - if (termIter.hasNext()) { - bytesRef = termIter.next(); - termID = index.lookupTerm(bytesRef); - } else { - termID = -1; - bytesRef = null; - } - return bytesRef; - } - - @Override - public long ord() { - return termID; - } - - @Override - public void seekExact(long ord) { - // Nullify termIter. - termIter = null; - - if (ord < index.getNumTerms()) { - termID = (int) ord; - index.getTerm(termID, bytesRef); - } - } - - @Override - public BytesRef term() { - return bytesRef; - } - - @Override - public long totalTermFreq() { - return docFreq(); - } - } - - /** - * This TermsEnum use a {@link SkipListContainer} backed termsSkipList provided by - * {@link InvertedRealtimeIndex} to supported ordered terms operations like - * {@link TermsEnum#next()} and {@link TermsEnum#seekCeil}. - */ - public static class SkipListInMemoryTermsEnum extends BaseTermsEnum { - private final InvertedRealtimeIndex index; - - private int termID = -1; - private BytesRef bytesRef = new BytesRef(); - private int nextTermIDPointer; - - /** - * {@link #nextTermIDPointer} is used to record pointer to next termsID to accelerate - * {@link #next}. However, {@link #seekCeil} and {@link #seekExact} may jump to an arbitrary - * term so the {@link #nextTermIDPointer} may not be correct, and this flag is used to check if - * this happens. If this flag is false, {@link #correctNextTermIDPointer} should be called to - * correct the value. - */ - private boolean isNextTermIDPointerCorrect; - - private final SkipListContainer termsSkipList; - private final InvertedRealtimeIndex.TermsSkipListComparator termsSkipListComparator; - private final int maxPublishedPointer; - - /** - * Creates a new {@link TermsEnum} for a skip list-based sorted real-time term dictionary. - */ - public SkipListInMemoryTermsEnum(InvertedRealtimeIndex index, int maxPublishedPointer) { - Preconditions.checkNotNull(index.getTermsSkipList()); - - this.index = index; - this.termsSkipList = index.getTermsSkipList(); - - // Each Terms Enum shall have their own comparators to be thread safe. - this.termsSkipListComparator = - new InvertedRealtimeIndex.TermsSkipListComparator(index); - this.nextTermIDPointer = - termsSkipList.getNextPointer(SkipListContainer.FIRST_LIST_HEAD); - this.isNextTermIDPointerCorrect = true; - this.maxPublishedPointer = maxPublishedPointer; - } - - @Override - public int docFreq() { - return index.getDF(termID); - } - - @Override - public PostingsEnum postings(PostingsEnum reuse, int flags) { - int postingsPointer = index.getPostingListPointer(termID); - return index.getPostingList().postings(postingsPointer, docFreq(), maxPublishedPointer); - } - - @Override - public ImpactsEnum impacts(int flags) { - return new SlowImpactsEnum(postings(null, flags)); - } - - @Override - public SeekStatus seekCeil(BytesRef text) { - // Next term pointer is not correct anymore since seek ceil - // will jump to an arbitrary term. - isNextTermIDPointerCorrect = false; - - // Doing precise lookup first. - termID = index.lookupTerm(text); - - // Doing ceil lookup if not found, otherwise we are good. - if (termID == -1) { - return seekCeilWithSkipList(text); - } else { - index.getTerm(termID, bytesRef); - return SeekStatus.FOUND; - } - } - - /** - * Doing ceil terms search with terms skip list. - */ - private SeekStatus seekCeilWithSkipList(BytesRef text) { - int termIDPointer = termsSkipList.searchCeil(text, - SkipListContainer.FIRST_LIST_HEAD, - termsSkipListComparator, - null); - - // End reached but still cannot found a ceil term. - if (termIDPointer == SkipListContainer.FIRST_LIST_HEAD) { - termID = HashTable.EMPTY_SLOT; - return SeekStatus.END; - } - - termID = termsSkipList.getValue(termIDPointer); - - // Set next termID pointer and is correct flag. - nextTermIDPointer = termsSkipList.getNextPointer(termIDPointer); - isNextTermIDPointerCorrect = true; - - // Found a ceil term but not the precise match. - index.getTerm(termID, bytesRef); - return SeekStatus.NOT_FOUND; - } - - /** - * {@link #nextTermIDPointer} is used to record the pointer to next termID. This method is used - * to correct {@link #nextTermIDPointer} to correct value after {@link #seekCeil} or - * {@link #seekExact} dropped current term to arbitrary point. - */ - private void correctNextTermIDPointer() { - final int curTermIDPointer = termsSkipList.search( - bytesRef, - SkipListContainer.FIRST_LIST_HEAD, - termsSkipListComparator, - null); - // Must be able to find the exact term. - assert termID == HashTable.EMPTY_SLOT - || termID == termsSkipList.getValue(curTermIDPointer); - - nextTermIDPointer = termsSkipList.getNextPointer(curTermIDPointer); - isNextTermIDPointerCorrect = true; - } - - @Override - public BytesRef next() { - // Correct nextTermIDPointer first if not correct due to seekExact or seekCeil. - if (!isNextTermIDPointerCorrect) { - correctNextTermIDPointer(); - } - - // Skip list is exhausted. - if (nextTermIDPointer == SkipListContainer.FIRST_LIST_HEAD) { - termID = HashTable.EMPTY_SLOT; - return null; - } - - termID = termsSkipList.getValue(nextTermIDPointer); - - index.getTerm(termID, bytesRef); - - // Set next termID Pointer. - nextTermIDPointer = termsSkipList.getNextPointer(nextTermIDPointer); - return bytesRef; - } - - @Override - public long ord() { - return termID; - } - - @Override - public void seekExact(long ord) { - if (ord < index.getNumTerms()) { - termID = (int) ord; - index.getTerm(termID, bytesRef); - - // Next term pointer is not correct anymore since seek exact - // just jump to an arbitrary term. - isNextTermIDPointerCorrect = false; - } - } - - @Override - public BytesRef term() { - return bytesRef; - } - - @Override - public long totalTermFreq() { - return docFreq(); - } - } - - @Override - public long getSumTotalTermFreq() { - return index.getSumTotalTermFreq(); - } - - @Override - public long getSumDocFreq() { - return index.getSumTermDocFreq(); - } - - @Override - public int getDocCount() { - return index.getNumDocs(); - } - - @Override - public boolean hasFreqs() { - return true; - } - - @Override - public boolean hasOffsets() { - return false; - } - - @Override - public boolean hasPositions() { - return true; - } - - @Override - public boolean hasPayloads() { - return true; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/SkipListComparator.java b/src/java/com/twitter/search/core/earlybird/index/inverted/SkipListComparator.java deleted file mode 100644 index 3c23de5d1..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/SkipListComparator.java +++ /dev/null @@ -1,43 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -/** - * Comparator interface for {@link SkipListContainer}, - * see sample implementation {@link SkipListIntegerComparator}. - * - * Notice: less/equal/greater here refer to the order precedence, instead of numerical value. - */ -public interface SkipListComparator { - - /** - * Determine the order between the given key and the key of the given targetValue. - * Notice, usually key of a value could be derived from the value along. - * - * Implementation of this method should consider sentinel value, see {@link #getSentinelValue()}. - * - * Can include position data (primarily for text posting lists). Position should be ignored if - * the skip list was constructed without positions enabled. - * - * @return negative, zero, or positive to indicate if first value is - * less than, equal to, or greater than the second value, respectively. - */ - int compareKeyWithValue(K key, int targetValue, int targetPosition); - - /** - * Determine the order of two given values based on their keys. - * Notice, usually key of a value could be derived from the value along. - * - * Implementation of this method should consider sentinel value, see {@link #getSentinelValue()}. - * - * @return negative, zero, or positive to indicate if first value is - * less than, equal to, or greater than the second value, respectively. - */ - int compareValues(int v1, int v2); - - /** - * Return a sentinel value, sentinel value should be considered by this comparator - * as an ADVISORY GREATEST value, which should NOT be actually inserted into the skip list. - * - * @return the sentinel value. - */ - int getSentinelValue(); -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/SkipListContainer.java b/src/java/com/twitter/search/core/earlybird/index/inverted/SkipListContainer.java deleted file mode 100644 index da4d1d001..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/SkipListContainer.java +++ /dev/null @@ -1,739 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; -import java.util.Random; - -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; - -import static com.twitter.search.core.earlybird.index.inverted.PayloadUtil.EMPTY_PAYLOAD; - -/** - * This is a skip list container implementation backed by {@link IntBlockPool}. - * - * Skip list is a data structure similar to linked list, but with a hierarchy of lists - * each skipping over fewer elements, and the bottom hierarchy does NOT skip any elements. - * @see Skip List Wikipedia - * - * This implementation is lock free and thread safe with ONE writer thread and MULTIPLE reader - * threads. - * - * This implementation could contain one or more skip lists, and they are all backed by - * the same {@link IntBlockPool}. - * - * Values are actually stored as integers; however search key is implemented as a generic type. - * Inserts of values that already exist are stored as subsequent elements. This is used to support - * positions and term frequency. - * - * Also reserve the integer after value to store next ordinal pointer information. We avoid storing - * pointers to the next element in the tower by allocating them contiguously. To descend the tower, - * we just increment the pointer. - * - * This skip list can also store positions as integers. It allocates them before it allocates the - * value (the value is a doc ID if we are using positions). This means that we can access the - * position by simply decrementing the value pointer. - * - * To understand how the skip list works, first understand how insert works, then the rest will be - * more comprehendable. - * - * A skip list will be implemented in a circle linked way: - * - the list head node will have the sentinel value, which is the advisory greatest value - * provided by comparator. - * - Real first value will be pointed by the list head node. - * - Real last value will point to the list head. - * - * Constraints: - * - Does NOT support negative value. - * - * Simple Viz: - * - * Empty list with max tower height 5. S = Sentinel value, I = Initial value. - * | s| 0| 0| 0| 0| 0| i| i| i| i| i| i| i| i| i| i| - * - * One possible situation after inserting 4, 6, 5. - * | s| 6| 6| 9| 0| 0| 4|13|13| 6| 0| 0| 0| 5| 9| 9| - */ -public class SkipListContainer implements Flushable { - /** - * The list head of first skip list in the container, this is for convenient usage, - * so application use only one skip list does not need to keep track of the list head. - */ - static final int FIRST_LIST_HEAD = 0; - - /** - * Initial value used when initialize int block pool. Notice -1 is not used here in order to give - * application more freedom because -1 is a special value when doing bit manipulations. - */ - static final int INITIAL_VALUE = -2; - - /** - * Maximum tower height of this skip list and chance to grow tower by level. - * - * Notice these two values could affect the memory usage and the performance. - * Ideally they should be calculated based on the potential size of the skip list. - * - * Given n is the number of elements in the skip list, the memory usage is in O(n). - * - * More precisely, - * - * the memory is mainly used for the following data: - * - * header_tower = O(maxTowerHeight + 1) - * value = O(n) - * next_pointers = O(n * (1 - growTowerChance^(maxTowerHeight + 1)) / (1 - growTowerChance)) - * - * thus, the total memory usage is in O(header_tower + value + next_pointers). - * - * Default value for maximum tower height and grow tower chance, these two numbers are chosen - * arbitrarily now. - */ - @VisibleForTesting - public static final int MAX_TOWER_HEIGHT = 10; - private static final float GROW_TOWER_CHANCE = 0.2f; - - public enum HasPositions { - YES, - NO - } - - public enum HasPayloads { - YES, - NO - } - - static final int INVALID_POSITION = -3; - - /** Memory barrier. */ - private volatile int maxPoolPointer; - - /** Actual storage data structure. */ - private final IntBlockPool blockPool; - - /** - * Default comparator used to determine the order between two given values or between one key and - * another value. - * - * Notice this comparator is shared by all threads using this skip list, so it is not thread safe - * if it is maintaining some states. However, {@link #search}, {@link #insert}, and - * {@link #searchCeil} support passed in comparator as a parameter, which should be thread safe if - * managed by the caller properly. - */ - private final SkipListComparator defaultComparator; - - /** Random generator used to decide if to grow tower by one level or not. */ - private final Random random = new Random(); - - /** - * Used by writer thread to record last pointers at each level. Notice it is ok to have it as an - * instance field because we would only have one writer thread. - */ - private final int[] lastPointers; - - /** - * Whether the skip list contains positions. Used for text fields. - */ - private final HasPositions hasPositions; - - private final HasPayloads hasPayloads; - - /** - * Creates a new probabilistic skip list, using the provided comparator to compare keys - * of type K. - * - * @param comparator a comparator used to compare integer values. - */ - public SkipListContainer( - SkipListComparator comparator, - HasPositions hasPositions, - HasPayloads hasPayloads, - String name - ) { - this(comparator, new IntBlockPool(INITIAL_VALUE, name), hasPositions, hasPayloads); - } - - /** - * Base constructor, also used by flush handler. - */ - private SkipListContainer( - SkipListComparator comparator, - IntBlockPool blockPool, - HasPositions hasPositions, - HasPayloads hasPayloads) { - // Sentinel value specified by the comparator cannot equal to INITIAL_VALUE. - Preconditions.checkArgument(comparator.getSentinelValue() != INITIAL_VALUE); - - this.defaultComparator = comparator; - this.lastPointers = new int[MAX_TOWER_HEIGHT]; - this.blockPool = blockPool; - this.hasPositions = hasPositions; - this.hasPayloads = hasPayloads; - } - - /** - * Search for the index of the greatest value which has key less than or equal to the given key. - * - * This is more like a floor search function. See {@link #searchCeil} for ceil search. - * - * @param key target key will be searched. - * @param skipListHead index of the header tower of the skip list will be searched. - * @param comparator comparator used for comparison when traversing through the skip list. - * @param searchFinger {@link SkipListSearchFinger} to accelerate search speed, - * notice the search finger must be before the key. - * @return the index of the greatest value which is less than or equal to given value, - * will return skipListHead if given value has no greater or equal values. - */ - public int search( - K key, - int skipListHead, - SkipListComparator comparator, - @Nullable SkipListSearchFinger searchFinger) { - assert comparator != null; - // Start at the header tower. - int currentPointer = skipListHead; - - // Instantiate nextPointer and nextValue outside of the for loop so we can use the value - // directly after for loop. - int nextPointer = getForwardPointer(currentPointer, MAX_TOWER_HEIGHT - 1); - int nextValue = getValue(nextPointer); - - // Top down traversal. - for (int currentLevel = MAX_TOWER_HEIGHT - 1; currentLevel >= 0; currentLevel--) { - nextPointer = getForwardPointer(currentPointer, currentLevel); - nextValue = getValue(nextPointer); - - // Jump to search finger at current level. - if (searchFinger != null) { - final int fingerPointer = searchFinger.getPointer(currentLevel); - assert searchFinger.isInitialPointer(fingerPointer) - || comparator.compareKeyWithValue(key, getValue(fingerPointer), INVALID_POSITION) >= 0; - - if (!searchFinger.isInitialPointer(fingerPointer) - && comparator.compareValues(getValue(fingerPointer), nextValue) >= 0) { - currentPointer = fingerPointer; - nextPointer = getForwardPointer(currentPointer, currentLevel); - nextValue = getValue(nextPointer); - } - } - - // Move forward. - while (comparator.compareKeyWithValue(key, nextValue, INVALID_POSITION) > 0) { - currentPointer = nextPointer; - - nextPointer = getForwardPointer(currentPointer, currentLevel); - nextValue = getValue(nextPointer); - } - - // Advance search finger. - if (searchFinger != null && currentPointer != skipListHead) { - final int currentValue = getValue(currentPointer); - final int fingerPointer = searchFinger.getPointer(currentLevel); - - if (searchFinger.isInitialPointer(fingerPointer) - || comparator.compareValues(currentValue, getValue(fingerPointer)) > 0) { - searchFinger.setPointer(currentLevel, currentPointer); - } - } - } - - // Return next pointer if next value matches searched value; otherwise return currentPointer. - return comparator.compareKeyWithValue(key, nextValue, INVALID_POSITION) == 0 - ? nextPointer : currentPointer; - } - - /** - * Perform search with {@link #defaultComparator}. - * Notice {@link #defaultComparator} is not thread safe if it is keeping some states. - */ - public int search(K key, int skipListHead, @Nullable SkipListSearchFinger searchFinger) { - return search(key, skipListHead, this.defaultComparator, searchFinger); - } - - /** - * Ceil search on given {@param key}. - * - * @param key target key will be searched. - * @param skipListHead index of the header tower of the skip list will be searched. - * @param comparator comparator used for comparison when traversing through the skip list. - * @param searchFinger {@link SkipListSearchFinger} to accelerate search speed. - * @return index of the smallest value with key greater or equal to the given key. - */ - public int searchCeil( - K key, - int skipListHead, - SkipListComparator comparator, - @Nullable SkipListSearchFinger searchFinger) { - assert comparator != null; - - // Perform regular search. - final int foundPointer = search(key, skipListHead, comparator, searchFinger); - - // Return foundPointer if it is not the list head and the pointed value has key equal to the - // given key; otherwise, return next pointer. - if (foundPointer != skipListHead - && comparator.compareKeyWithValue(key, getValue(foundPointer), INVALID_POSITION) == 0) { - return foundPointer; - } else { - return getNextPointer(foundPointer); - } - } - - /** - * Perform searchCeil with {@link #defaultComparator}. - * Notice {@link #defaultComparator} is not thread safe if it is keeping some states. - */ - public int searchCeil( - K key, int skipListHead, @Nullable SkipListSearchFinger searchFinger) { - return searchCeil(key, skipListHead, this.defaultComparator, searchFinger); - } - - /** - * Insert a new value into the skip list. - * - * Notice inserting supports duplicate keys and duplicate values. - * - * Duplicate keys with different values or positions will be inserted consecutively. - * Duplciate keys with identical values will be ignored, and the duplicate will not be stored in - * the posting list. - * - * @param key is the key of the given value. - * @param value is the value will be inserted, cannot be {@link #getSentinelValue()}. - * @param skipListHead index of the header tower of the skip list will accept the new value. - * @param comparator comparator used for comparison when traversing through the skip list. - * @return whether this value exists in the posting list. Note that this will return true even - * if it is a new position. - */ - public boolean insert(K key, int value, int position, int[] payload, int skipListHead, - SkipListComparator comparator) { - Preconditions.checkArgument(comparator != null); - Preconditions.checkArgument(value != getSentinelValue()); - - // Start at the header tower. - int currentPointer = skipListHead; - - // Initialize lastPointers. - for (int i = 0; i < MAX_TOWER_HEIGHT; i++) { - this.lastPointers[i] = INITIAL_VALUE; - } - int nextPointer = INITIAL_VALUE; - - // Top down traversal. - for (int currentLevel = MAX_TOWER_HEIGHT - 1; currentLevel >= 0; currentLevel--) { - nextPointer = getForwardPointer(currentPointer, currentLevel); - int nextValue = getValue(nextPointer); - - int nextPosition = getPosition(nextPointer); - while (comparator.compareKeyWithValue(key, nextValue, nextPosition) > 0) { - currentPointer = nextPointer; - - nextPointer = getForwardPointer(currentPointer, currentLevel); - nextValue = getValue(nextPointer); - nextPosition = getPosition(nextPointer); - } - - // Store last pointers. - lastPointers[currentLevel] = currentPointer; - } - - // we use isDuplicateValue to determine if a value already exists in a posting list (even if it - // is a new position). We need to check both current pointer and next pointer in case this is - // the largest position we have seen for this value in this skip list. In that case, nextPointer - // will point to a larger value, but we want to check the smaller one to see if it is the same - // value. For example, if we have [(1, 2), (2, 4)] and we want to insert (1, 3), then - // nextPointer will point to (2, 4), but we want to check the doc ID of (1, 2) to see if it has - // the same document ID. - boolean isDuplicateValue = getValue(currentPointer) == value || getValue(nextPointer) == value; - - if (comparator.compareKeyWithValue(key, getValue(nextPointer), getPosition(nextPointer)) != 0) { - if (hasPayloads == HasPayloads.YES) { - Preconditions.checkNotNull(payload); - // If this skip list has payloads, we store the payload immediately before the document ID - // and position (iff the position exists) in the block pool. We store payloads before - // positions because they are variable length, and reading past them would require knowing - // the size of the payload. We don't store payloads after the doc ID because we have a - // variable number of pointers after the doc ID, and we would have no idea where the - // pointers stop and the payload starts. - for (int n : payload) { - this.blockPool.add(n); - } - } - - if (hasPositions == HasPositions.YES) { - // If this skip list has positions, we store the position before the document ID in the - // block pool. - this.blockPool.add(position); - } - - // Insert value. - final int insertedPointer = this.blockPool.add(value); - - // Insert outgoing pointers. - final int height = getRandomTowerHeight(); - for (int currentLevel = 0; currentLevel < height; currentLevel++) { - this.blockPool.add(getForwardPointer(lastPointers[currentLevel], currentLevel)); - } - - this.sync(); - - // Update incoming pointers. - for (int currentLevel = 0; currentLevel < height; currentLevel++) { - setForwardPointer(lastPointers[currentLevel], currentLevel, insertedPointer); - } - - this.sync(); - } - - return isDuplicateValue; - } - - /** - * Delete a given key from skip list - * - * @param key the key of the given value - * @param skipListHead index of the header tower of the skip list will accept the new value - * @param comparator comparator used for comparison when traversing through the skip list - * @return smallest value in the container. Returns {@link #INITIAL_VALUE} if the - * key does not exist. - */ - public int delete(K key, int skipListHead, SkipListComparator comparator) { - boolean foundKey = false; - - for (int currentLevel = MAX_TOWER_HEIGHT - 1; currentLevel >= 0; currentLevel--) { - int currentPointer = skipListHead; - int nextValue = getValue(getForwardPointer(currentPointer, currentLevel)); - - // First we skip over all the nodes that are smaller than our key. - while (comparator.compareKeyWithValue(key, nextValue, INVALID_POSITION) > 0) { - currentPointer = getForwardPointer(currentPointer, currentLevel); - nextValue = getValue(getForwardPointer(currentPointer, currentLevel)); - } - - Preconditions.checkState(currentPointer != INITIAL_VALUE); - - // If we don't find the node at this level that's OK, keep searching on a lower one. - if (comparator.compareKeyWithValue(key, nextValue, INVALID_POSITION) != 0) { - continue; - } - - // We found an element to delete. - foundKey = true; - - // Otherwise, save the current pointer. Right now, current pointer points to the first element - // that has the same value as key. - int savedPointer = currentPointer; - - currentPointer = getForwardPointer(currentPointer, currentLevel); - // Then, walk over every element that is equal to the key. - while (comparator.compareKeyWithValue(key, getValue(currentPointer), INVALID_POSITION) == 0) { - currentPointer = getForwardPointer(currentPointer, currentLevel); - } - - // update the saved pointer to point to the first non-equal element of the skip list. - setForwardPointer(savedPointer, currentLevel, currentPointer); - } - - // Something has changed, need to sync up here. - if (foundKey) { - this.sync(); - // return smallest value, might be used as first postings later - return getSmallestValue(skipListHead); - } - - return INITIAL_VALUE; - } - - /** - * Perform insert with {@link #defaultComparator}. - * Notice {@link #defaultComparator} is not thread safe if it is keeping some states. - */ - public boolean insert(K key, int value, int skipListHead) { - return insert(key, value, INVALID_POSITION, EMPTY_PAYLOAD, skipListHead, - this.defaultComparator); - } - - public boolean insert(K key, int value, int position, int[] payload, int skipListHead) { - return insert(key, value, position, payload, skipListHead, this.defaultComparator); - } - - /** - * Perform delete with {@link #defaultComparator}. - * Notice {@link #defaultComparator} is not thread safe if it is keeping some states. - */ - public int delete(K key, int skipListHead) { - return delete(key, skipListHead, this.defaultComparator); - } - - /** - * Get the pointer of next value pointed by the given pointer. - * - * @param pointer reference to the current value. - * @return pointer of next value. - */ - public int getNextPointer(int pointer) { - return getForwardPointer(pointer, 0); - } - - /** - * Get the value pointed by a pointer, this is a dereference process. - * - * @param pointer is an array index on this.blockPool. - * @return value pointed pointed by the pointer. - */ - public int getValue(int pointer) { - int value = blockPool.get(pointer); - - // Visibility race - if (value == INITIAL_VALUE) { - // Volatile read to cross the memory barrier again. - final boolean isSafe = isPointerSafe(pointer); - assert isSafe; - - // Re-read the pointer again - value = blockPool.get(pointer); - } - - return value; - } - - public int getSmallestValue(int skipListHeader) { - return getValue(getForwardPointer(skipListHeader, 0)); - } - - /** - * Builder of a forward search finger with header tower index. - * - * @return a new {@link SkipListSearchFinger} object. - */ - public SkipListSearchFinger buildSearchFinger() { - return new SkipListSearchFinger(MAX_TOWER_HEIGHT); - } - - /** - * Added another skip list into the int pool. - * - * @return index of the header tower of the newly created skip list. - */ - public int newSkipList() { - // Virtual value of header. - final int sentinelValue = getSentinelValue(); - if (hasPositions == HasPositions.YES) { - this.blockPool.add(INVALID_POSITION); - } - final int skipListHead = this.blockPool.add(sentinelValue); - - // Build header tower, initially point all the pointers to - // itself since no value has been inserted. - for (int i = 0; i < MAX_TOWER_HEIGHT; i++) { - this.blockPool.add(skipListHead); - } - - this.sync(); - - return skipListHead; - } - - /** - * Check if the block pool has been initiated by {@link #newSkipList}. - */ - public boolean isEmpty() { - return this.blockPool.length() == 0; - } - - /** - * Write to the volatile variable to cross memory barrier. maxPoolPointer is the memory barrier - * for new appends. - */ - private void sync() { - this.maxPoolPointer = this.blockPool.length(); - } - - /** - * Read from volatile variable to cross memory barrier. - * - * @param pointer is an block pool index. - * @return boolean indicate if given pointer is within the range of max pool pointer. - */ - private boolean isPointerSafe(int pointer) { - return pointer <= this.maxPoolPointer; - } - - /** - * Get the position associated with the doc ID pointed to by pointer. - * @param pointer aka doc ID pointer. - * @return The value of the position for that doc ID. Returns INVALID_POSITION if the skip list - * does not have positions, or if there is no position for that pointer. - */ - public int getPosition(int pointer) { - if (hasPositions == HasPositions.NO) { - return INVALID_POSITION; - } - // if this skip list has positions, the position will always be inserted into the block pool - // immediately before the doc ID. - return getValue(pointer - 1); - } - - /** - * Get the payload pointer from a normal pointer (e.g. one returned from the {@link this#search} - * method). - */ - public int getPayloadPointer(int pointer) { - Preconditions.checkState(hasPayloads == HasPayloads.YES, - "getPayloadPointer() should only be called on a skip list that supports payloads."); - - // if this skip list has payloads, the payload will always be inserted into the block pool - // before the doc ID, and before the position if there is a position. - int positionOffset = hasPositions == HasPositions.YES ? 1 : 0; - - return pointer - 1 - positionOffset; - } - - - int getPoolSize() { - return this.blockPool.length(); - } - - - IntBlockPool getBlockPool() { - return blockPool; - } - - public HasPayloads getHasPayloads() { - return hasPayloads; - } - - /****************** - * Helper Methods * - ******************/ - - /** - * Get the next forward pointer on a given level. - * - * @param pointer is an array index on this.blockPool, might be SENTINEL_VALUE. - * @param level indicates the level of the forward pointer will be acquired. It is zero indexed. - * @return next forward pointer on the given level, might be SENTINEL_VALUE. - */ - private int getForwardPointer(int pointer, int level) { - final int pointerIndex = pointer + level + 1; - - int forwardPointer = blockPool.get(pointerIndex); - - // Visibility race - if (forwardPointer == INITIAL_VALUE) { - // Volatile read to cross the memory barrier again. - final boolean isSafe = isPointerSafe(pointerIndex); - assert isSafe; - - // Re-read the pointer again - forwardPointer = blockPool.get(pointerIndex); - } - - return forwardPointer; - } - - /** - * Set the next forward pointer on a given level. - * - * @param pointer points to the value, of which the pointer value will be updated. - * @param level indicates the level of the forward pointer will be set. It is zero indexed. - * @param target the value fo the target pointer which will be set. - */ - private void setForwardPointer(int pointer, int level, int target) { - // Update header tower if given pointer points to headerTower. - setPointer(pointer + level + 1, target); - } - - /** - * Set the value pointed by pointer - * @param pointer point to the actual position in the pool - * @param target the value we are going to set - */ - private void setPointer(int pointer, int target) { - blockPool.set(pointer, target); - } - - /** - * Getter of the sentinel value used by this skip list. The sentinel value should be provided - * by the comparator. - * - * @return sentinel value used by this skip list. - */ - int getSentinelValue() { - return defaultComparator.getSentinelValue(); - } - - /** - * Return a height h in range [1, maxTowerHeight], each number with chance - * growTowerChance ^ (h - 1). - * - * @return a integer indicating height. - */ - private int getRandomTowerHeight() { - int height = 1; - while (height < MAX_TOWER_HEIGHT && random.nextFloat() < GROW_TOWER_CHANCE) { - height++; - } - return height; - } - - @SuppressWarnings("unchecked") - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler<>(this); - } - - public static class FlushHandler extends Flushable.Handler> { - private final SkipListComparator comparator; - private static final String BLOCK_POOL_PROP_NAME = "blockPool"; - private static final String HAS_POSITIONS_PROP_NAME = "hasPositions"; - private static final String HAS_PAYLOADS_PROP_NAME = "hasPayloads"; - - public FlushHandler(SkipListContainer objectToFlush) { - super(objectToFlush); - this.comparator = objectToFlush.defaultComparator; - } - - public FlushHandler(SkipListComparator comparator) { - this.comparator = comparator; - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) throws IOException { - long startTime = getClock().nowMillis(); - SkipListContainer objectToFlush = getObjectToFlush(); - flushInfo.addBooleanProperty(HAS_POSITIONS_PROP_NAME, - objectToFlush.hasPositions == HasPositions.YES); - flushInfo.addBooleanProperty(HAS_PAYLOADS_PROP_NAME, - objectToFlush.hasPayloads == HasPayloads.YES); - - objectToFlush.blockPool.getFlushHandler() - .flush(flushInfo.newSubProperties(BLOCK_POOL_PROP_NAME), out); - getFlushTimerStats().timerIncrement(getClock().nowMillis() - startTime); - } - - @Override - protected SkipListContainer doLoad(FlushInfo flushInfo, DataDeserializer in) - throws IOException { - long startTime = getClock().nowMillis(); - IntBlockPool blockPool = (new IntBlockPool.FlushHandler()).load( - flushInfo.getSubProperties(BLOCK_POOL_PROP_NAME), in); - getLoadTimerStats().timerIncrement(getClock().nowMillis() - startTime); - - HasPositions hasPositions = flushInfo.getBooleanProperty(HAS_POSITIONS_PROP_NAME) - ? HasPositions.YES : HasPositions.NO; - HasPayloads hasPayloads = flushInfo.getBooleanProperty(HAS_PAYLOADS_PROP_NAME) - ? HasPayloads.YES : HasPayloads.NO; - - return new SkipListContainer<>( - this.comparator, - blockPool, - hasPositions, - hasPayloads); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/SkipListIntegerComparator.java b/src/java/com/twitter/search/core/earlybird/index/inverted/SkipListIntegerComparator.java deleted file mode 100644 index 6acc19542..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/SkipListIntegerComparator.java +++ /dev/null @@ -1,26 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -/** - * Example implementation of {@link SkipListComparator} with Order-Theoretic Properties. - * - * Notice: - * Re-using key object is highly suggested! - * Normally the generic type should be a mutable object so it can be reused by the reader/writer. - */ -public class SkipListIntegerComparator implements SkipListComparator { - - @Override - public int compareKeyWithValue(Integer key, int targetValue, int targetPosition) { - return key - targetValue; - } - - @Override - public int compareValues(int v1, int v2) { - return v1 - v2; - } - - @Override - public int getSentinelValue() { - return Integer.MAX_VALUE; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/SkipListPostingList.java b/src/java/com/twitter/search/core/earlybird/index/inverted/SkipListPostingList.java deleted file mode 100644 index 498321beb..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/SkipListPostingList.java +++ /dev/null @@ -1,232 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; -import javax.annotation.Nullable; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.PostingsEnum; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.util.BytesRef; - -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; - -import static com.twitter.search.core.earlybird.index.inverted.SkipListContainer.HasPayloads; -import static com.twitter.search.core.earlybird.index.inverted.SkipListContainer.HasPositions; -import static com.twitter.search.core.earlybird.index.inverted.SkipListContainer.INVALID_POSITION; -import static com.twitter.search.core.earlybird.index.inverted.TermsArray.INVALID; - -/** - * A skip list implementation of real time posting list. Supports out of order updates. - */ -public class SkipListPostingList implements Flushable { - /** Underlying skip list. */ - private final SkipListContainer skipListContainer; - - /** Key used when inserting into the skip list. */ - private final Key key = new Key(); - - public SkipListPostingList( - HasPositions hasPositions, - HasPayloads hasPayloads, - String field) { - this.skipListContainer = new SkipListContainer<>( - new DocIDComparator(), - hasPositions, - hasPayloads, - field); - } - - /** Used by {@link SkipListPostingList.FlushHandler} */ - private SkipListPostingList(SkipListContainer skipListContainer) { - this.skipListContainer = skipListContainer; - } - - /** - * Appends a posting to the posting list for a term. - */ - public void appendPosting( - int termID, - TermsArray termsArray, - int docID, - int position, - @Nullable BytesRef payload) { - termsArray.getLargestPostings()[termID] = Math.max( - termsArray.getLargestPostings()[termID], - docID); - - // Append to an existing skip list. - // Notice, header tower index is stored at the last postings pointer spot. - int postingsPointer = termsArray.getPostingsPointer(termID); - if (postingsPointer == INVALID) { - // Create a new skip list and add the first posting. - postingsPointer = skipListContainer.newSkipList(); - } - - boolean havePostingForThisDoc = insertPosting(docID, position, payload, postingsPointer); - - // If this is a new document ID, we need to update the document frequency for this term - if (!havePostingForThisDoc) { - termsArray.getDocumentFrequency()[termID]++; - } - - termsArray.updatePostingsPointer(termID, postingsPointer); - } - - /** - * Deletes the given doc ID from the posting list for the term. - */ - public void deletePosting(int termID, TermsArray postingsArray, int docID) { - int docFreq = postingsArray.getDocumentFrequency()[termID]; - if (docFreq == 0) { - return; - } - - int postingsPointer = postingsArray.getPostingsPointer(termID); - // skipListContainer is not empty, try to delete docId from it. - int smallestDoc = deletePosting(docID, postingsPointer); - if (smallestDoc == SkipListContainer.INITIAL_VALUE) { - // Key does not exist. - return; - } - - postingsArray.getDocumentFrequency()[termID]--; - } - - /** - * Insert posting into an existing skip list. - * - * @param docID docID of the this posting. - * @param skipListHead header tower index of the skip list - * in which the posting will be inserted. - * @return whether we have already inserted this document ID into this term list. - */ - private boolean insertPosting(int docID, int position, BytesRef termPayload, int skipListHead) { - int[] payload = PayloadUtil.encodePayload(termPayload); - return skipListContainer.insert(key.withDocAndPosition(docID, position), docID, position, - payload, skipListHead); - } - - private int deletePosting(int docID, int skipListHead) { - return skipListContainer.delete(key.withDocAndPosition(docID, INVALID_POSITION), skipListHead); - } - - /** Return a term docs enumerator with position flag on. */ - public PostingsEnum postings( - int postingPointer, - int docFreq, - int maxPublishedPointer) { - return new SkipListPostingsEnum( - postingPointer, docFreq, maxPublishedPointer, skipListContainer); - } - - /** - * Get the number of documents (AKA document frequency or DF) for the given term. - */ - public int getDF(int termID, TermsArray postingsArray) { - int[] documentFrequency = postingsArray.getDocumentFrequency(); - Preconditions.checkArgument(termID < documentFrequency.length); - - return documentFrequency[termID]; - } - - public int getDocIDFromPosting(int posting) { - // Posting is simply the whole doc ID. - return posting; - } - - public int getMaxPublishedPointer() { - return skipListContainer.getPoolSize(); - } - - - @SuppressWarnings("unchecked") - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - public static class FlushHandler extends Flushable.Handler { - private static final String SKIP_LIST_PROP_NAME = "skipList"; - - public FlushHandler(SkipListPostingList objectToFlush) { - super(objectToFlush); - } - - public FlushHandler() { - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) throws IOException { - SkipListPostingList objectToFlush = getObjectToFlush(); - - objectToFlush.skipListContainer.getFlushHandler() - .flush(flushInfo.newSubProperties(SKIP_LIST_PROP_NAME), out); - } - - @Override - protected SkipListPostingList doLoad( - FlushInfo flushInfo, DataDeserializer in) throws IOException { - SkipListComparator comparator = new DocIDComparator(); - SkipListContainer.FlushHandler flushHandler = - new SkipListContainer.FlushHandler<>(comparator); - SkipListContainer skipList = - flushHandler.load(flushInfo.getSubProperties(SKIP_LIST_PROP_NAME), in); - return new SkipListPostingList(skipList); - } - } - - /** - * Key used to in {@link SkipListContainer} by {@link SkipListPostingList}. - */ - public static class Key { - private int docID; - private int position; - - public int getDocID() { - return docID; - } - - public int getPosition() { - return position; - } - - public Key withDocAndPosition(int withDocID, int withPosition) { - this.docID = withDocID; - this.position = withPosition; - return this; - } - } - - /** - * Comparator for docID and position. - */ - public static class DocIDComparator implements SkipListComparator { - private static final int SENTINEL_VALUE = DocIdSetIterator.NO_MORE_DOCS; - - @Override - public int compareKeyWithValue(Key key, int targetDocID, int targetPosition) { - // No key could represent sentinel value and sentinel value is the largest. - int docCompare = key.getDocID() - targetDocID; - if (docCompare == 0 && targetPosition != INVALID_POSITION) { - return key.getPosition() - targetPosition; - } else { - return docCompare; - } - } - - @Override - public int compareValues(int docID1, int docID2) { - // Sentinel value is the largest. - return docID1 - docID2; - } - - @Override - public int getSentinelValue() { - return SENTINEL_VALUE; - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/SkipListPostingsEnum.java b/src/java/com/twitter/search/core/earlybird/index/inverted/SkipListPostingsEnum.java deleted file mode 100644 index 908ae1c87..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/SkipListPostingsEnum.java +++ /dev/null @@ -1,255 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.PostingsEnum; -import org.apache.lucene.util.BytesRef; - -import com.twitter.search.core.earlybird.index.EarlybirdRealtimeIndexSegmentData; - -import static com.twitter.search.core.earlybird.index.inverted.SkipListContainer.INVALID_POSITION; - -/** - * TermDocs enumerator used by {@link SkipListPostingList}. - */ -public class SkipListPostingsEnum extends PostingsEnum { - /** Initialize cur doc ID and frequency. */ - private int curDoc = TermsArray.INVALID; - private int curFreq = 0; - - private final int postingPointer; - - private final int cost; - - /** - * maxPublishedPointer exists to prevent us from returning documents that are partially indexed. - * These pointers are safe to follow, but the documents should not be returned. See - * {@link EarlybirdRealtimeIndexSegmentData#getSyncData()} ()}. - */ - private final int maxPublishedPointer; - - /** Skip list info and search key */ - private final SkipListContainer skiplist; - private final SkipListPostingList.Key key = new SkipListPostingList.Key(); - - /** - * Pointer/posting/docID of next posting in the skip list. - * Notice the next here is relative to last posting with curDoc ID. - */ - private int nextPostingPointer; - private int nextPostingDocID; - - /** - * We save the positionPointer because we must walk the posting list to obtain term frequency - * before we can start iterating through document positions. To do that walk, we increment - * postingsPointer until it points to the first posting for the next doc, so postingsPointer is no - * longer what we want to use as the start of the position list. The position pointer starts out - * pointing to the first posting with that doc ID value. There can be duplicate doc ID values with - * different positions. To find subsequent positions, we simply walk the posting list using this - * pointer. - */ - private int positionPointer = -1; - - /** - * The payloadPointer should only be called after calling nextPosition, as it points to a payload - * for each position. It is not updated unless nextPosition is called. - */ - private int payloadPointer = -1; - - /** Search finger used in advance method. */ - private final SkipListSearchFinger advanceSearchFinger; - - /** - * A new {@link PostingsEnum} for a real-time skip list-based posting list. - */ - public SkipListPostingsEnum( - int postingPointer, - int docFreq, - int maxPublishedPointer, - SkipListContainer skiplist) { - this.postingPointer = postingPointer; - this.skiplist = skiplist; - this.advanceSearchFinger = this.skiplist.buildSearchFinger(); - this.maxPublishedPointer = maxPublishedPointer; - this.nextPostingPointer = postingPointer; - - // WARNING: - // docFreq is approximate and may not be the true document frequency of the posting list. - this.cost = docFreq; - - if (postingPointer != -1) { - // Because the posting pointer is not negative 1, we know it's valid. - readNextPosting(); - } - - advanceSearchFinger.reset(); - } - - @Override - public final int nextDoc() { - // Notice if skip list is exhausted nextPostingPointer will point back to postingPointer since - // skip list is circle linked. - if (nextPostingPointer == postingPointer) { - // Skip list is exhausted. - curDoc = NO_MORE_DOCS; - curFreq = 0; - } else { - // Skip list is not exhausted. - curDoc = nextPostingDocID; - curFreq = 1; - positionPointer = nextPostingPointer; - - // Keep reading all the posting with the same doc ID. - // Notice: - // - posting with the same doc ID will be stored consecutively - // since the skip list is sorted. - // - if skip list is exhausted, nextPostingPointer will become postingPointer - // since skip list is circle linked. - readNextPosting(); - while (nextPostingPointer != postingPointer && nextPostingDocID == curDoc) { - curFreq++; - readNextPosting(); - } - } - - // Returned updated curDoc. - return curDoc; - } - - /** - * Moves the enumerator forward by one element, then reads the information at that position. - * */ - private void readNextPosting() { - // Move search finger forward at lowest level. - advanceSearchFinger.setPointer(0, nextPostingPointer); - - // Read next posting pointer. - nextPostingPointer = skiplist.getNextPointer(nextPostingPointer); - - // Read the new posting positioned under nextPostingPointer into the nextPostingDocID. - readNextPostingInfo(); - } - - private boolean isPointerPublished(int pointer) { - return pointer <= maxPublishedPointer; - } - - /** Read next posting and doc id encoded in next posting. */ - private void readNextPostingInfo() { - // We need to skip over every pointer that has not been published to this Enum, otherwise the - // searcher will see unpublished documents. We also end termination if we reach - // nextPostingPointer == postingPointer, because that means we have reached the end of the - // skiplist. - while (!isPointerPublished(nextPostingPointer) && nextPostingPointer != postingPointer) { - // Move search finger forward at lowest level. - advanceSearchFinger.setPointer(0, nextPostingPointer); - - // Read next posting pointer. - nextPostingPointer = skiplist.getNextPointer(nextPostingPointer); - } - - // Notice if skip list is exhausted, nextPostingPointer will be postingPointer - // since skip list is circle linked. - if (nextPostingPointer != postingPointer) { - nextPostingDocID = skiplist.getValue(nextPostingPointer); - } else { - nextPostingDocID = NO_MORE_DOCS; - } - } - - /** - * Jump to the target, then use {@link #nextDoc()} to collect nextDoc info. - * Notice target might be smaller than curDoc or smallestDocID. - */ - @Override - public final int advance(int target) { - if (target == NO_MORE_DOCS) { - // Exhaust the posting list, so that future calls to docID() always return NO_MORE_DOCS. - nextPostingPointer = postingPointer; - } - - if (nextPostingPointer == postingPointer) { - // Call nextDoc to ensure that all values are updated and we don't have to duplicate that - // here. - return nextDoc(); - } - - // Jump to target if target is bigger. - if (target >= curDoc && target >= nextPostingDocID) { - jumpToTarget(target); - } - - // Retrieve next doc. - return nextDoc(); - } - - /** - * Set the next posting pointer (and info) to the first posting - * with doc ID equal to or larger than the target. - * - * Notice this method does not set curDoc or curFreq. - */ - private void jumpToTarget(int target) { - // Do a ceil search. - nextPostingPointer = skiplist.searchCeil( - key.withDocAndPosition(target, INVALID_POSITION), postingPointer, advanceSearchFinger); - - // Read next posting information. - readNextPostingInfo(); - } - - @Override - public int nextPosition() { - // If doc ID is equal to no more docs than we are past the end of the posting list. If doc ID - // is invalid, then we have not called nextDoc yet, and we should not return a real position. - // If the position pointer is past the current doc ID, then we should not return a position - // until nextDoc is called again (we don't want to return positions for a different doc). - if (docID() == NO_MORE_DOCS - || docID() == TermsArray.INVALID - || skiplist.getValue(positionPointer) != docID()) { - return INVALID_POSITION; - } - payloadPointer = positionPointer; - int position = skiplist.getPosition(positionPointer); - do { - positionPointer = skiplist.getNextPointer(positionPointer); - } while (!isPointerPublished(positionPointer) && positionPointer != postingPointer); - return position; - } - - @Override - public BytesRef getPayload() { - if (skiplist.getHasPayloads() == SkipListContainer.HasPayloads.NO) { - return null; - } - - int pointer = skiplist.getPayloadPointer(this.payloadPointer); - Preconditions.checkState(pointer > 0); - return PayloadUtil.decodePayload(skiplist.getBlockPool(), pointer); - } - - @Override - public int startOffset() { - return -1; - } - - @Override - public int endOffset() { - return -1; - } - - @Override - public final int docID() { - return curDoc; - } - - @Override - public final int freq() { - return curFreq; - } - - @Override - public long cost() { - return cost; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/SkipListSearchFinger.java b/src/java/com/twitter/search/core/earlybird/index/inverted/SkipListSearchFinger.java deleted file mode 100644 index 2ab52c0f2..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/SkipListSearchFinger.java +++ /dev/null @@ -1,45 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -/** - * A forward search finger used, optionally, by {@link SkipListContainer#search}. - * - * A search finger is pointer to the result returned by last time a search method is performed. - * @see Finger search wikipedia. - * - * Using a search finger on a skip list could reduce the search search time from - * log(n) to log(k), where n is length of the skip list and k is the distance between last searched - * key and current searched key. - */ -public class SkipListSearchFinger { - // Pointer used when initialize the search finger. - public static final int INITIAL_POINTER = Integer.MIN_VALUE; - - private final int[] lastPointers; - - /** - * Creates a new search finger. - */ - public SkipListSearchFinger(int maxTowerHeight) { - lastPointers = new int[maxTowerHeight]; - - reset(); - } - - public void reset() { - for (int i = 0; i < lastPointers.length; i++) { - setPointer(i, INITIAL_POINTER); - } - } - - public int getPointer(int level) { - return lastPointers[level]; - } - - public void setPointer(int level, int pointer) { - lastPointers[level] = pointer; - } - - public boolean isInitialPointer(int pointer) { - return pointer == INITIAL_POINTER; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/TermDictionary.java b/src/java/com/twitter/search/core/earlybird/index/inverted/TermDictionary.java deleted file mode 100644 index 6a4360304..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/TermDictionary.java +++ /dev/null @@ -1,47 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; - -import org.apache.lucene.index.TermsEnum; -import org.apache.lucene.util.BytesRef; - -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; - -/** - * A two-way mapping between terms and their interned value (termID). - * - * Implementation of this interface must guarantee that termIDs are dense, starting at 0; - * so they are good to be used as indices in arrays. - */ -public interface TermDictionary extends Flushable { - int TERM_NOT_FOUND = EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND; - - /** - * Returns the number of terms in this dictionary. - */ - int getNumTerms(); - - /** - * Create a TermsEnum object over this TermDictionary for a given index. - * @param index - */ - TermsEnum createTermsEnum(OptimizedMemoryIndex index); - - /** - * Lookup a term in this dictionary. - * @param term the term to lookup. - * @return the term id for this term, or TERM_NOT_FOUND - * @throws IOException - */ - int lookupTerm(BytesRef term) throws IOException; - - /** - * Get the term for given id and possibly its payload. - * @param termID the term that we want to get. - * @param text MUST be non-null. It will be filled with the term. - * @param termPayload if non-null, it will be filled with the payload if the term has any. - * @return Returns true, iff this term has a term payload. - */ - boolean getTerm(int termID, BytesRef text, BytesRef termPayload); -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/TermPointerEncoding.java b/src/java/com/twitter/search/core/earlybird/index/inverted/TermPointerEncoding.java deleted file mode 100644 index 22927f6b0..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/TermPointerEncoding.java +++ /dev/null @@ -1,38 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -/** - * Encodes and decodes term pointers. - */ -public abstract class TermPointerEncoding { - /** - * Returns the start of the text stored in a {@link BaseByteBlockPool} of the given term. - */ - public abstract int getTextStart(int termPointer); - - /** - * Returns true, if the given term stores a per-term payload. - */ - public abstract boolean hasPayload(int termPointer); - - /** - * Encodes and returns a pointer for a term stored at the given textStart in a - * {@link BaseByteBlockPool}. - */ - public abstract int encodeTermPointer(int textStart, boolean hasPayload); - - public static final TermPointerEncoding DEFAULT_ENCODING = new TermPointerEncoding() { - @Override public int getTextStart(int termPointer) { - return termPointer >>> 1; - } - - @Override public boolean hasPayload(int termPointer) { - return (termPointer & 1) != 0; - } - - @Override - public int encodeTermPointer(int textStart, boolean hasPayload) { - int code = textStart << 1; - return hasPayload ? (code | 1) : code; - } - }; -} diff --git a/src/java/com/twitter/search/core/earlybird/index/inverted/TermsArray.java b/src/java/com/twitter/search/core/earlybird/index/inverted/TermsArray.java deleted file mode 100644 index a3331044d..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/inverted/TermsArray.java +++ /dev/null @@ -1,189 +0,0 @@ -package com.twitter.search.core.earlybird.index.inverted; - -import java.io.IOException; -import java.util.Arrays; - -import org.apache.lucene.util.ArrayUtil; - -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; - -/** - * TermsArray provides information on each term in the posting list. - * - * It does not provide any concurrency guarantees. The writer must ensure that all updates are - * visible to readers with an external memory barrier. - */ -public class TermsArray implements Flushable { - private static final int BYTES_PER_POSTING = 5 * Integer.BYTES; - public static final int INVALID = -1; - - private final int size; - - public final int[] termPointers; - private final int[] postingsPointers; - - // Derived data. Not atomic and not reliable. - public final int[] largestPostings; - public final int[] documentFrequency; - public final int[] offensiveCounters; - - TermsArray(int size, boolean useOffensiveCounters) { - this.size = size; - - termPointers = new int[size]; - postingsPointers = new int[size]; - - largestPostings = new int[size]; - documentFrequency = new int[size]; - - if (useOffensiveCounters) { - offensiveCounters = new int[size]; - } else { - offensiveCounters = null; - } - - Arrays.fill(postingsPointers, INVALID); - Arrays.fill(largestPostings, INVALID); - } - - private TermsArray(TermsArray oldArray, int newSize) { - this(newSize, oldArray.offensiveCounters != null); - copyFrom(oldArray); - } - - private TermsArray( - int size, - int[] termPointers, - int[] postingsPointers, - int[] largestPostings, - int[] documentFrequency, - int[] offensiveCounters) { - this.size = size; - - this.termPointers = termPointers; - this.postingsPointers = postingsPointers; - - this.largestPostings = largestPostings; - this.documentFrequency = documentFrequency; - this.offensiveCounters = offensiveCounters; - } - - TermsArray grow() { - int newSize = ArrayUtil.oversize(size + 1, BYTES_PER_POSTING); - return new TermsArray(this, newSize); - } - - - private void copyFrom(TermsArray from) { - copy(from.termPointers, termPointers); - copy(from.postingsPointers, postingsPointers); - - copy(from.largestPostings, largestPostings); - copy(from.documentFrequency, documentFrequency); - - if (from.offensiveCounters != null) { - copy(from.offensiveCounters, offensiveCounters); - } - } - - private void copy(int[] from, int[] to) { - System.arraycopy(from, 0, to, 0, from.length); - } - - /** - * Returns the size of this array. - */ - public int getSize() { - return size; - } - - /** - * Write side operation for updating the pointer to the last posting for a given term. - */ - public void updatePostingsPointer(int termID, int newPointer) { - postingsPointers[termID] = newPointer; - } - - /** - * The returned pointer is guaranteed to be memory safe to follow to its target. The data - * structure it points to will be consistent and safe to traverse. The posting list may contain - * doc IDs that the current reader should not see, and the reader should skip over these doc IDs - * to ensure that the readers provide an immutable view of the doc IDs in a posting list. - */ - public int getPostingsPointer(int termID) { - return postingsPointers[termID]; - } - - public int[] getDocumentFrequency() { - return documentFrequency; - } - - /** - * Gets the array containing the first posting for each indexed term. - */ - public int[] getLargestPostings() { - return largestPostings; - } - - @SuppressWarnings("unchecked") - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - public static class FlushHandler extends Flushable.Handler { - private static final String SIZE_PROP_NAME = "size"; - private static final String HAS_OFFENSIVE_COUNTERS_PROP_NAME = "hasOffensiveCounters"; - - public FlushHandler(TermsArray objectToFlush) { - super(objectToFlush); - } - - public FlushHandler() { - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) throws IOException { - TermsArray objectToFlush = getObjectToFlush(); - flushInfo.addIntProperty(SIZE_PROP_NAME, objectToFlush.size); - boolean hasOffensiveCounters = objectToFlush.offensiveCounters != null; - flushInfo.addBooleanProperty(HAS_OFFENSIVE_COUNTERS_PROP_NAME, hasOffensiveCounters); - - out.writeIntArray(objectToFlush.termPointers); - out.writeIntArray(objectToFlush.postingsPointers); - - out.writeIntArray(objectToFlush.largestPostings); - out.writeIntArray(objectToFlush.documentFrequency); - - if (hasOffensiveCounters) { - out.writeIntArray(objectToFlush.offensiveCounters); - } - } - - @Override - protected TermsArray doLoad( - FlushInfo flushInfo, DataDeserializer in) throws IOException { - int size = flushInfo.getIntProperty(SIZE_PROP_NAME); - boolean hasOffensiveCounters = flushInfo.getBooleanProperty(HAS_OFFENSIVE_COUNTERS_PROP_NAME); - - int[] termPointers = in.readIntArray(); - int[] postingsPointers = in.readIntArray(); - - int[] largestPostings = in.readIntArray(); - int[] documentFrequency = in.readIntArray(); - - int[] offensiveCounters = hasOffensiveCounters ? in.readIntArray() : null; - - return new TermsArray( - size, - termPointers, - postingsPointers, - largestPostings, - documentFrequency, - offensiveCounters); - } - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/util/AllDocsIterator.java b/src/java/com/twitter/search/core/earlybird/index/util/AllDocsIterator.java deleted file mode 100644 index b5ab9ae26..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/util/AllDocsIterator.java +++ /dev/null @@ -1,82 +0,0 @@ -package com.twitter.search.core.earlybird.index.util; - -import java.io.IOException; - -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.Terms; -import org.apache.lucene.index.TermsEnum; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.util.BytesRef; - -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants; -import com.twitter.search.core.earlybird.index.EarlybirdRealtimeIndexSegmentAtomicReader; - -/** - * Used to iterate through all of the documents in an Earlybird segment. This is necessary so that - * we can ensure all of the documents we are reading have been published to the readers. If we used - * the doc ID mapper to iterate through documents, it would return documents that have been only - * partially added to the index, and could return bogus search results (SEARCH-27711). - */ -public class AllDocsIterator extends DocIdSetIterator { - public static final String ALL_DOCS_TERM = "__all_docs"; - - private final DocIdSetIterator delegate; - - public AllDocsIterator(LeafReader reader) throws IOException { - delegate = buildDISI(reader); - } - - private static DocIdSetIterator buildDISI(LeafReader reader) throws IOException { - if (!isRealtimeUnoptimizedSegment(reader)) { - return all(reader.maxDoc()); - } - - Terms terms = - reader.terms(EarlybirdFieldConstants.EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName()); - if (terms == null) { - return all(reader.maxDoc()); - } - - TermsEnum termsEnum = terms.iterator(); - boolean hasTerm = termsEnum.seekExact(new BytesRef(ALL_DOCS_TERM)); - if (hasTerm) { - return termsEnum.postings(null); - } - - return empty(); - } - - @Override - public int docID() { - return delegate.docID(); - } - - @Override - public int nextDoc() throws IOException { - return delegate.nextDoc(); - } - - @Override - public int advance(int target) throws IOException { - return delegate.advance(target); - } - - @Override - public long cost() { - return delegate.cost(); - } - - /** - * Returns whether this is a realtime segment in the realtime index that is still unoptimized and - * mutable. - */ - private static boolean isRealtimeUnoptimizedSegment(LeafReader reader) { - if (reader instanceof EarlybirdRealtimeIndexSegmentAtomicReader) { - EarlybirdRealtimeIndexSegmentAtomicReader realtimeReader = - (EarlybirdRealtimeIndexSegmentAtomicReader) reader; - return !realtimeReader.getSegmentData().isOptimized(); - } - - return false; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/util/RangeDISI.java b/src/java/com/twitter/search/core/earlybird/index/util/RangeDISI.java deleted file mode 100644 index accee2156..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/util/RangeDISI.java +++ /dev/null @@ -1,50 +0,0 @@ -package com.twitter.search.core.earlybird.index.util; - -import java.io.IOException; - -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.search.DocIdSetIterator; - -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; - -public class RangeDISI extends DocIdSetIterator { - private final int start; - private final int end; - private final AllDocsIterator delegate; - - private int currentDocId = -1; - - public RangeDISI(LeafReader reader, int start, int end) throws IOException { - this.delegate = new AllDocsIterator(reader); - this.start = start; - if (end == DocIDToTweetIDMapper.ID_NOT_FOUND) { - this.end = Integer.MAX_VALUE; - } else { - this.end = end; - } - } - - @Override - public int docID() { - return currentDocId; - } - - @Override - public int nextDoc() throws IOException { - return advance(currentDocId + 1); - } - - @Override - public int advance(int target) throws IOException { - currentDocId = delegate.advance(Math.max(target, start)); - if (currentDocId > end) { - currentDocId = NO_MORE_DOCS; - } - return currentDocId; - } - - @Override - public long cost() { - return delegate.cost(); - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/util/RangeFilterDISI.java b/src/java/com/twitter/search/core/earlybird/index/util/RangeFilterDISI.java deleted file mode 100644 index 934355fc9..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/util/RangeFilterDISI.java +++ /dev/null @@ -1,58 +0,0 @@ -package com.twitter.search.core.earlybird.index.util; - -import java.io.IOException; - -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.search.DocIdSetIterator; - -/** - * A doc id set iterator that iterates over a filtered set of ids from firstId inclusive to lastId - * inclusive. - */ -public class RangeFilterDISI extends DocIdSetIterator { - private final RangeDISI delegate; - - public RangeFilterDISI(LeafReader reader) throws IOException { - this(reader, 0, reader.maxDoc() - 1); - } - - public RangeFilterDISI(LeafReader reader, int smallestDocID, int largestDocID) - throws IOException { - this.delegate = new RangeDISI(reader, smallestDocID, largestDocID); - } - - @Override - public int docID() { - return delegate.docID(); - } - - @Override - public int nextDoc() throws IOException { - delegate.nextDoc(); - return nextValidDoc(); - } - - @Override - public int advance(int target) throws IOException { - delegate.advance(target); - return nextValidDoc(); - } - - private int nextValidDoc() throws IOException { - int doc = delegate.docID(); - while (doc != NO_MORE_DOCS && !shouldReturnDoc()) { - doc = delegate.nextDoc(); - } - return doc; - } - - @Override - public long cost() { - return delegate.cost(); - } - - // Override this method to add additional filters. Should return true if the current doc is OK. - protected boolean shouldReturnDoc() throws IOException { - return true; - } -} diff --git a/src/java/com/twitter/search/core/earlybird/index/util/SearchSortUtils.java b/src/java/com/twitter/search/core/earlybird/index/util/SearchSortUtils.java deleted file mode 100644 index c17565784..000000000 --- a/src/java/com/twitter/search/core/earlybird/index/util/SearchSortUtils.java +++ /dev/null @@ -1,42 +0,0 @@ -package com.twitter.search.core.earlybird.index.util; - -import com.google.common.base.Preconditions; - -public abstract class SearchSortUtils { - public interface Comparator { - /** - * Compares the item represented by the given index with the provided value. - */ - int compare(int index, T value); - } - - /** - * Performs a binary search using the given comparator, and returns the index of the item that - * was found. If foundLow is true, the greatest item that's lower than the provided key - * is returned. Otherwise, the lowest item that's greater than the provided key is returned. - */ - public static int binarySearch(Comparator comparator, final int begin, final int end, - final T key, boolean findLow) { - int low = begin; - int high = end; - Preconditions.checkState(comparator.compare(low, key) <= comparator.compare(high, key)); - while (low <= high) { - int mid = (low + high) >>> 1; - int result = comparator.compare(mid, key); - if (result < 0) { - low = mid + 1; - } else if (result > 0) { - high = mid - 1; - } else { - return mid; - } // key found - } - - assert low > high; - if (findLow) { - return high < begin ? begin : high; - } else { - return low > end ? end : low; - } - } -} diff --git a/src/java/com/twitter/search/earlybird/BUILD b/src/java/com/twitter/search/earlybird/BUILD deleted file mode 100644 index 457cfe12a..000000000 --- a/src/java/com/twitter/search/earlybird/BUILD +++ /dev/null @@ -1,222 +0,0 @@ -COMMON_SOURCES = ["common/**/*.java"] - -CONFIG_SOURCES = ["config/**/*.java"] - -TOOLS_SOURCES = ["tools/**/*.java"] - -INDEX_SOURCES = ["index/facets/**/*.java"] - -SEGMENT_BUILDER_SOURCES = ["archive/segmentbuilder/**/*.java"] - -java_library( - name = "earlybird-lib", - sources = ["**/*.java"] + exclude_globs(COMMON_SOURCES + CONFIG_SOURCES + TOOLS_SOURCES + SEGMENT_BUILDER_SOURCES + INDEX_SOURCES), - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/code/gson", - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/twitter/distributedlog:distributedlog-core", - "3rdparty/jvm/com/twitter/elephantbird:core", - "3rdparty/jvm/commons-codec", - "3rdparty/jvm/commons-httpclient", - "3rdparty/jvm/commons-io", - "3rdparty/jvm/commons-lang", - "3rdparty/jvm/geo/google:geoGoogle", - "3rdparty/jvm/io/netty:netty4-tcnative-boringssl-static", - "3rdparty/jvm/it/unimi/dsi:fastutil", - "3rdparty/jvm/javax/servlet:servlet-api", - "3rdparty/jvm/net/java/dev/jets3t", - "3rdparty/jvm/org/apache/bookkeeper:bookkeeper-server", - "3rdparty/jvm/org/apache/bookkeeper:bookkeeper-twitter-science-provider", - "3rdparty/jvm/org/apache/commons:commons-lang3", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/httpcomponents:httpclient", - "3rdparty/jvm/org/apache/kafka:kafka-clients", - "3rdparty/jvm/org/apache/lucene:lucene-analyzers-common", - "3rdparty/jvm/org/apache/lucene:lucene-analyzers-smartcn", - "3rdparty/jvm/org/apache/lucene:lucene-core", - "3rdparty/jvm/org/apache/lucene:lucene-facet", - "3rdparty/jvm/org/apache/lucene:lucene-queries", - "3rdparty/jvm/org/apache/lucene:lucene-queryparser", - "3rdparty/jvm/org/apache/lucene:lucene-spatial-extras", - "3rdparty/jvm/org/apache/lucene:lucene-test-framework", - "3rdparty/jvm/org/apache/thrift", - "3rdparty/jvm/org/apache/zookeeper:zookeeper-client", - "3rdparty/jvm/org/json", - "3rdparty/jvm/org/slf4j:slf4j-api", - "3rdparty/jvm/org/tensorflow", - "3rdparty/jvm/org/tensorflow:tensorflow-hadoop", - "3rdparty/jvm/org/yaml:snakeyaml", - "cuad/projects/ner/thrift/src/main/thrift:thrift-java", - "decider/src/main/scala", - "finagle-internal/mtls/src/main/scala/com/twitter/finagle/mtls/authentication", - "finagle-internal/mtls/src/main/scala/com/twitter/finagle/mtls/client", - "finagle-internal/mtls/src/main/scala/com/twitter/finagle/mtls/server", - "finagle-internal/slo/src/main/scala/com/twitter/finagle/slo", - "finagle/finagle-base-http", - "finagle/finagle-core/src/main", - "finagle/finagle-http", - "finagle/finagle-serversets/src/main/scala", - "finagle/finagle-stats/src/main/scala", - "finagle/finagle-thrift/src/main/java", - "finagle/finagle-thrift/src/main/scala", - "finagle/finagle-thriftmux/src/main/scala", - "finagle/finagle-zipkin-core/src/main/scala", - "finagle/finagle-zipkin-scribe/src/main/scala", - "kafka/finagle-kafka/finatra-kafka/src/main/scala", - "periscope/api-proxy-thrift/thrift/src/main/thrift:thrift-java", - "servo/decider", - "snowflake/src/main/scala/com/twitter/snowflake/id", - "src/antlr/com/twitter/search/queryparser/antlr:queryparser-antlr", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common/collections", - "src/java/com/twitter/common/net:dynamic-host-set", - "src/java/com/twitter/common/quantity", - "src/java/com/twitter/common/text/language:locale-util", - "src/java/com/twitter/common/text/token", - "src/java/com/twitter/common/text/util:token-util", - "src/java/com/twitter/common/util", - "src/java/com/twitter/common/util:system-mocks", - "src/java/com/twitter/common/zookeeper:client", - "src/java/com/twitter/common/zookeeper:group", - "src/java/com/twitter/common/zookeeper:server-set", - "src/java/com/twitter/common_internal/bloomfilter", - "src/java/com/twitter/common_internal/collections", - "src/java/com/twitter/common_internal/text:text-penguin7", - "src/java/com/twitter/common_internal/text/version", - "src/java/com/twitter/common_internal/zookeeper", - "src/java/com/twitter/ml/api:api-base", - "src/java/com/twitter/search/common/aurora", - "src/java/com/twitter/search/common/concurrent", - "src/java/com/twitter/search/common/config", - "src/java/com/twitter/search/common/constants", - "src/java/com/twitter/search/common/dark", - "src/java/com/twitter/search/common/database", - "src/java/com/twitter/search/common/decider", - "src/java/com/twitter/search/common/encoding/docvalues", - "src/java/com/twitter/search/common/encoding/features", - "src/java/com/twitter/search/common/features", - "src/java/com/twitter/search/common/file", - "src/java/com/twitter/search/common/logging", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/partitioning/base", - "src/java/com/twitter/search/common/partitioning/snowflakeparser", - "src/java/com/twitter/search/common/partitioning/zookeeper", - "src/java/com/twitter/search/common/query", - "src/java/com/twitter/search/common/relevance:feature-update-reader", - "src/java/com/twitter/search/common/relevance:scorers", - "src/java/com/twitter/search/common/relevance:text", - "src/java/com/twitter/search/common/relevance/features", - "src/java/com/twitter/search/common/schema", - "src/java/com/twitter/search/common/schema/base", - "src/java/com/twitter/search/common/schema/earlybird", - "src/java/com/twitter/search/common/search", - "src/java/com/twitter/search/common/search/termination", - "src/java/com/twitter/search/common/util:closeresourceutil", - "src/java/com/twitter/search/common/util:finagleutil", - "src/java/com/twitter/search/common/util:gcutil", - "src/java/com/twitter/search/common/util:kerberos", - "src/java/com/twitter/search/common/util:log_format_util", - "src/java/com/twitter/search/common/util:longintconverter", - "src/java/com/twitter/search/common/util:platform_stats_exporter", - "src/java/com/twitter/search/common/util:rule_based_converter", - "src/java/com/twitter/search/common/util/analysis", - "src/java/com/twitter/search/common/util/date", - "src/java/com/twitter/search/common/util/earlybird", - "src/java/com/twitter/search/common/util/hash", - "src/java/com/twitter/search/common/util/io", - "src/java/com/twitter/search/common/util/io:dl-reader-writer", - "src/java/com/twitter/search/common/util/io:flushable", - "src/java/com/twitter/search/common/util/io:record-reader-api", - "src/java/com/twitter/search/common/util/io/kafka", - "src/java/com/twitter/search/common/util/lang", - "src/java/com/twitter/search/common/util/ml/models_manager", - "src/java/com/twitter/search/common/util/ml/prediction_engine", - "src/java/com/twitter/search/common/util/ml/tensorflow_engine", - "src/java/com/twitter/search/common/util/spatial", - "src/java/com/twitter/search/common/util/text", - "src/java/com/twitter/search/common/util/text/regex", - "src/java/com/twitter/search/common/util/thrift:text-protocol", - "src/java/com/twitter/search/common/util/thrift:thrift-utils", - "src/java/com/twitter/search/common/util/url", - "src/java/com/twitter/search/common/util/zktrylock", - "src/java/com/twitter/search/common/util/zookeeper", - "src/java/com/twitter/search/core/earlybird", - "src/java/com/twitter/search/earlybird/common", - "src/java/com/twitter/search/earlybird/common/config", - "src/java/com/twitter/search/earlybird/common/userupdates", - "src/java/com/twitter/search/earlybird/config", - "src/java/com/twitter/search/earlybird/index/facets", - "src/java/com/twitter/search/ingester/pipeline/strato_fetchers", - "src/java/com/twitter/search/modeling/common", - "src/java/com/twitter/search/modeling/tweet_ranking", - "src/java/com/twitter/search/queryparser", - "src/java/com/twitter/search/queryparser/query:core-query-nodes", - "src/java/com/twitter/search/queryparser/query/search:search-query-nodes", - "src/resources/com/twitter/search/earlybird/com/twitter", - "src/resources/com/twitter/search/earlybird/ml", - "src/thrift/com/twitter/search:common", - "src/thrift/com/twitter/search:earlybird-java", - "src/thrift/com/twitter/search/common:features-java", - "src/thrift/com/twitter/search/common:indexing-java", - "src/thrift/com/twitter/search/common:query-java", - "src/thrift/com/twitter/service/spiderduck/gen:metadata-store-java", - "src/thrift/com/twitter/tweetypie:events-java", - "src/thrift/org/apache/aurora/gen:api", - "stitch/stitch-core/src/main/scala/com/twitter/stitch", - "strato/src/main/scala/com/twitter/strato/catalog", - "strato/src/main/scala/com/twitter/strato/client", - "strato/src/main/scala/com/twitter/strato/data", - "strato/src/main/scala/com/twitter/strato/thrift", - "tensorflow/tfcompute-java/src/main/java/com/twitter/tfcompute_java", - "thrift-web-forms/src/main/java/com/twitter/thriftwebforms", - "thrift-web-forms/src/main/scala/com/twitter/thriftwebforms", - "twitter-server-internal", - "twitter-server/server/src/main/scala", - "ubs/common/src/main/thrift/com/twitter/ubs:broadcast-thrift-java", - "ubs/common/src/main/thrift/com/twitter/ubs:events-java", - "util-internal/util-eval/src/main/scala", - "util/util-app", - "util/util-core:scala", - "util/util-function", - "util/util-lint", - "util/util-slf4j-api/src/main/scala", - "util/util-stats/src/main/scala", - ], -) - -jvm_binary( - name = "earlybird-binary", - basename = "earlybird", - main = "com.twitter.search.earlybird.EarlybirdMain", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":earlybird-lib", - "loglens/loglens-log4j", - ], -) - -java_library( - name = "tools", - sources = TOOLS_SOURCES, - tags = [ - "bazel-compatible", - "bazel-only", - ], - dependencies = [ - ":earlybird-lib", - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/guava:guava-testlib", - "3rdparty/jvm/commons-codec", - "3rdparty/jvm/commons-httpclient", - "3rdparty/jvm/geo/google:geoGoogle", - "3rdparty/jvm/junit", - "3rdparty/jvm/net/java/dev/jets3t", - "3rdparty/jvm/org/apache/bookkeeper:bookkeeper-server", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/thrift:libthrift", - "src/thrift/com/twitter/search:earlybird-java", - ], -) diff --git a/src/java/com/twitter/search/earlybird/CONFIG.ini b/src/java/com/twitter/search/earlybird/CONFIG.ini deleted file mode 100644 index 6d4d06376..000000000 --- a/src/java/com/twitter/search/earlybird/CONFIG.ini +++ /dev/null @@ -1,7 +0,0 @@ -; See http://go/CONFIG.ini - -[jira] -project: SEARCH - -[kite] -project: earlybird diff --git a/src/java/com/twitter/search/earlybird/Earlybird.java b/src/java/com/twitter/search/earlybird/Earlybird.java deleted file mode 100644 index 54dc33f16..000000000 --- a/src/java/com/twitter/search/earlybird/Earlybird.java +++ /dev/null @@ -1,267 +0,0 @@ -package com.twitter.search.earlybird; - -import java.io.File; -import java.io.IOException; -import java.net.InetAddress; -import java.net.UnknownHostException; -import java.util.Arrays; -import java.util.Map; -import java.util.function.Predicate; -import java.util.stream.Collectors; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.app.Flag; -import com.twitter.app.Flaggable; -import com.twitter.finagle.Http; -import com.twitter.finagle.http.HttpMuxer; -import com.twitter.search.common.aurora.AuroraInstanceKey; -import com.twitter.search.common.config.Config; -import com.twitter.search.common.config.LoggerConfiguration; -import com.twitter.search.common.constants.SearchThriftWebFormsAccess; -import com.twitter.search.common.metrics.BuildInfoStats; -import com.twitter.search.common.util.Kerberos; -import com.twitter.search.common.util.PlatformStatsExporter; -import com.twitter.search.earlybird.admin.EarlybirdAdminManager; -import com.twitter.search.earlybird.admin.EarlybirdHealthHandler; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.common.config.EarlybirdProperty; -import com.twitter.search.earlybird.exception.EarlybirdStartupException; -import com.twitter.search.earlybird.exception.UncaughtExceptionHandler; -import com.twitter.search.earlybird.factory.EarlybirdServerFactory; -import com.twitter.search.earlybird.factory.EarlybirdWireModule; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.search.earlybird.util.EarlybirdDecider; -import com.twitter.server.handler.DeciderHandler$; -import com.twitter.server.AbstractTwitterServer; -import com.twitter.thriftwebforms.DisplaySettingsConfig; -import com.twitter.thriftwebforms.MethodOptionsAccessConfig; -import com.twitter.thriftwebforms.ThriftClientSettingsConfig; -import com.twitter.thriftwebforms.ThriftMethodSettingsConfig; -import com.twitter.thriftwebforms.ThriftServiceSettings; -import com.twitter.thriftwebforms.ThriftWebFormsSettings; -import com.twitter.thriftwebforms.TwitterServerThriftWebForms; -import com.twitter.util.Await; -import com.twitter.util.TimeoutException; - -public class Earlybird extends AbstractTwitterServer { - private static final Logger LOG = LoggerFactory.getLogger(Earlybird.class); - - // Flags defined here need to be processed before setting override values to EarlybirdConfig. - - private final Flag configFile = flag().create( - "config_file", - new File("earlybird-search.yml"), - "specify config file", - Flaggable.ofFile() - ); - - private final Flag logDir = flag().create( - "earlybird_log_dir", - "", - "override log dir from config file", - Flaggable.ofString() - ); - - private final Map> flagMap = Arrays.stream(EarlybirdProperty.values()) - .collect(Collectors.toMap( - property -> property.name(), - property -> property.createFlag(flag()))); - - private final UncaughtExceptionHandler uncaughtExceptionHandler = - new UncaughtExceptionHandler(); - - private EarlybirdServer earlybirdServer; - private EarlybirdAdminManager earlybirdAdminManager; - - public Earlybird() { - // Default health handler is added inside Lifecycle trait. To override that we need to set it - // in the constructor since HttpAdminServer is started before Earlybird.preMain() is called. - HttpMuxer.addHandler("/health", new EarlybirdHealthHandler()); - } - - /** - * Needs to be called from preMain and not from onInit() as flags / args parsing happens after - * onInit() is called. - */ - @VisibleForTesting - void configureFromFlagsAndSetupLogging() { - // Makes sure the EarlybirdStats is injected with a variable repository. - EarlybirdConfig.init(configFile.getWithDefault().get().getName()); - - if (logDir.isDefined()) { - EarlybirdConfig.overrideLogDir(logDir.get().get()); - } - new LoggerConfiguration(EarlybirdConfig.getLogPropertiesFile(), - EarlybirdConfig.getLogDir()).configure(); - - String instanceKey = System.getProperty("aurora.instanceKey"); - if (instanceKey != null) { - EarlybirdConfig.setAuroraInstanceKey(AuroraInstanceKey.fromInstanceKey(instanceKey)); - LOG.info("Earlybird is running on Aurora"); - checkRequiredProperties(EarlybirdProperty::isRequiredOnAurora, "Aurora"); - } else { - LOG.info("Earlybird is running on dedicated hardware"); - checkRequiredProperties(EarlybirdProperty::isRequiredOnDedicated, "dedicated hardware"); - } - LOG.info("Config environment: {}", Config.getEnvironment()); - - if (adminPort().isDefined() && adminPort().get().isDefined()) { - int adminPort = adminPort().get().get().getPort(); - LOG.info("Admin port is {}", adminPort); - EarlybirdConfig.setAdminPort(adminPort); - } - - EarlybirdConfig.setOverrideValues( - flagMap.values().stream() - .filter(Flag::isDefined) - .collect(Collectors.toMap(Flag::name, flag -> flag.get().get()))); - } - - private void checkRequiredProperties( - Predicate propertyPredicate, String location) { - Arrays.stream(EarlybirdProperty.values()) - .filter(propertyPredicate) - .map(property -> flagMap.get(property.name())) - .forEach(flag -> - Preconditions.checkState(flag.isDefined(), - "-%s is required on %s", flag.name(), location)); - } - - private void logEarlybirdInfo() { - try { - LOG.info("Hostname: {}", InetAddress.getLocalHost().getHostName()); - } catch (UnknownHostException e) { - LOG.info("Unable to be get local host: {}", e.getMessage()); - } - LOG.info("Earlybird info [Name: {}, Zone: {}, Env: {}]", - EarlybirdProperty.EARLYBIRD_NAME.get(), - EarlybirdProperty.ZONE.get(), - EarlybirdProperty.ENV.get()); - LOG.info("Earlybird scrubgen from Aurora: {}]", - EarlybirdProperty.EARLYBIRD_SCRUB_GEN.get()); - LOG.info("Find final partition config by searching the log for \"Partition config info\""); - } - - private EarlybirdServer makeEarlybirdServer() { - EarlybirdWireModule earlybirdWireModule = new EarlybirdWireModule(); - EarlybirdServerFactory earlybirdFactory = new EarlybirdServerFactory(); - try { - return earlybirdFactory.makeEarlybirdServer(earlybirdWireModule); - } catch (IOException e) { - LOG.error("Exception while constructing EarlybirdServer.", e); - throw new RuntimeException(e); - } - } - - private void setupThriftWebForms() { - TwitterServerThriftWebForms.addAdminRoutes(this, TwitterServerThriftWebForms.apply( - ThriftWebFormsSettings.apply( - DisplaySettingsConfig.DEFAULT, - ThriftServiceSettings.apply( - EarlybirdService.ServiceIface.class.getSimpleName(), - EarlybirdConfig.getThriftPort()), - ThriftClientSettingsConfig.makeCompactRequired( - EarlybirdProperty.getServiceIdentifier()), - ThriftMethodSettingsConfig.access( - MethodOptionsAccessConfig.byLdapGroup( - SearchThriftWebFormsAccess.READ_LDAP_GROUP))), - scala.reflect.ClassTag$.MODULE$.apply(EarlybirdService.ServiceIface.class))); - } - - private void setupDeciderWebForms() { - addAdminRoute( - DeciderHandler$.MODULE$.route( - "earlybird", - EarlybirdDecider.getMutableDecisionMaker(), - EarlybirdDecider.getDecider())); - } - - @Override - public Http.Server configureAdminHttpServer(Http.Server server) { - return server.withMonitor(uncaughtExceptionHandler); - } - - @Override - public void preMain() { - configureFromFlagsAndSetupLogging(); - logEarlybirdInfo(); - LOG.info("Starting preMain()"); - - BuildInfoStats.export(); - PlatformStatsExporter.exportPlatformStats(); - - // Use our own exception handler to monitor all unhandled exceptions. - Thread.setDefaultUncaughtExceptionHandler((thread, e) -> { - LOG.error("Invoked default uncaught exception handler."); - uncaughtExceptionHandler.handle(e); - }); - LOG.info("Registered unhandled exception monitor."); - - Kerberos.kinit( - EarlybirdConfig.getString("kerberos_user", ""), - EarlybirdConfig.getString("kerberos_keytab_path", "") - ); - - LOG.info("Creating earlybird server."); - earlybirdServer = makeEarlybirdServer(); - - uncaughtExceptionHandler.setShutdownHook(() -> { - earlybirdServer.shutdown(); - this.close(); - }); - - earlybirdAdminManager = EarlybirdAdminManager.create(earlybirdServer); - earlybirdAdminManager.start(); - LOG.info("Started admin interface."); - - setupThriftWebForms(); - setupDeciderWebForms(); - - LOG.info("Opened thrift serving form."); - - LOG.info("preMain() complete."); - } - - @Override - public void main() throws InterruptedException, TimeoutException, EarlybirdStartupException { - innerMain(); - } - - /** - * Setting up an innerMain() so that tests can mock out the contents of main without interfering - * with reflection being done in App.scala looking for a method named "main". - */ - @VisibleForTesting - void innerMain() throws TimeoutException, InterruptedException, EarlybirdStartupException { - LOG.info("Starting main()."); - - // If this method throws, TwitterServer will catch the exception and call close, so we don't - // catch it here. - try { - earlybirdServer.start(); - } catch (Throwable throwable) { - LOG.error("Exception while starting:", throwable); - throw throwable; - } - - Await.ready(adminHttpServer()); - LOG.info("main() complete."); - } - - @Override - public void onExit() { - LOG.info("Starting onExit()"); - earlybirdServer.shutdown(); - try { - earlybirdAdminManager.doShutdown(); - } catch (InterruptedException e) { - LOG.warn("earlybirdAdminManager shutdown was interrupted with " + e); - } - LOG.info("onExit() complete."); - } -} diff --git a/src/java/com/twitter/search/earlybird/EarlybirdCPUQualityFactor.java b/src/java/com/twitter/search/earlybird/EarlybirdCPUQualityFactor.java deleted file mode 100644 index 0fdfbc1d5..000000000 --- a/src/java/com/twitter/search/earlybird/EarlybirdCPUQualityFactor.java +++ /dev/null @@ -1,181 +0,0 @@ -package com.twitter.search.earlybird; - -import com.google.common.annotations.VisibleForTesting; -import com.sun.management.OperatingSystemMXBean; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.decider.Decider; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.metrics.SearchStatsReceiver; - -/** - * Manages the quality factor for an Earlybird based on CPU usage. - */ -public class EarlybirdCPUQualityFactor implements QualityFactor { - public static final String ENABLE_QUALITY_FACTOR_DECIDER = "enable_quality_factor"; - public static final String OVERRIDE_QUALITY_FACTOR_DECIDER = "override_quality_factor"; - - @VisibleForTesting - protected static final double CPU_USAGE_THRESHOLD = 0.8; - @VisibleForTesting - protected static final double MAX_QF_INCREMENT = 0.5; - @VisibleForTesting - protected static final double MAX_QF_DECREMENT = 0.1; - @VisibleForTesting - protected static final double MAX_CPU_USAGE = 1.0; - - private static final Logger QUALITY_FACTOR_LOG = - LoggerFactory.getLogger(EarlybirdCPUQualityFactor.class); - private static final Logger EARLYBIRD_LOG = LoggerFactory.getLogger(Earlybird.class); - - /** - * Tracks the real, underlying CPU QF value, regardless of the decider enabling - * it. - */ - @VisibleForTesting - protected static final String UNDERLYING_CPU_QF_GUAGE = "underlying_cpu_quality_factor"; - - /** - * Reports the QF actually used to degrade Earlybirds. - */ - @VisibleForTesting - protected static final String CPU_QF_GUAGE = "cpu_quality_factor"; - - private static final int SAMPLING_WINDOW_MILLIS = 60 * 1000; // one minute - - - private double qualityFactor = 1; - private double previousQualityFactor = 1; - - private final SearchDecider decider; - private final OperatingSystemMXBean operatingSystemMXBean; - - public EarlybirdCPUQualityFactor( - Decider decider, - OperatingSystemMXBean operatingSystemMXBean, - SearchStatsReceiver searchStatsReceiver) { - this.decider = new SearchDecider(decider); - this.operatingSystemMXBean = operatingSystemMXBean; - - searchStatsReceiver.getCustomGauge(UNDERLYING_CPU_QF_GUAGE, () -> qualityFactor); - searchStatsReceiver.getCustomGauge(CPU_QF_GUAGE, this::get); - } - - /** - * Updates the current quality factor based on CPU usage. - */ - @VisibleForTesting - protected void update() { - previousQualityFactor = qualityFactor; - - double cpuUsage = operatingSystemMXBean.getSystemCpuLoad(); - - if (cpuUsage < CPU_USAGE_THRESHOLD) { - double increment = - ((CPU_USAGE_THRESHOLD - cpuUsage) / CPU_USAGE_THRESHOLD) * MAX_QF_INCREMENT; - qualityFactor = Math.min(1, qualityFactor + increment); - } else { - double decrement = - ((cpuUsage - CPU_USAGE_THRESHOLD) / (MAX_CPU_USAGE - CPU_USAGE_THRESHOLD)) - * MAX_QF_DECREMENT; - qualityFactor = Math.max(0, qualityFactor - decrement); - } - - if (!qualityFactorChanged()) { - return; - } - - QUALITY_FACTOR_LOG.info( - String.format("CPU: %.2f Quality Factor: %.2f", cpuUsage, qualityFactor)); - - if (!enabled()) { - return; - } - - if (degradationBegan()) { - EARLYBIRD_LOG.info("Service degradation began."); - } - - if (degradationEnded()) { - EARLYBIRD_LOG.info("Service degradation ended."); - } - } - - @Override - public double get() { - if (!enabled()) { - return 1; - } - - if (isOverridden()) { - return override(); - } - - return qualityFactor; - } - - @Override - public void startUpdates() { - new Thread(() -> { - while (true) { - update(); - try { - Thread.sleep(SAMPLING_WINDOW_MILLIS); - } catch (InterruptedException e) { - QUALITY_FACTOR_LOG.warn( - "Quality factoring thread interrupted during sleep between updates", e); - } - } - }).start(); - } - - /** - * Returns true if quality factoring is enabled by the decider. - * @return - */ - private boolean enabled() { - return decider != null && decider.isAvailable(ENABLE_QUALITY_FACTOR_DECIDER); - } - - /** - * Returns true if a decider has overridden the quality factor. - * @return - */ - private boolean isOverridden() { - return decider != null && decider.getAvailability(OVERRIDE_QUALITY_FACTOR_DECIDER) < 10000.0; - } - - /** - * Returns the override decider value. - * @return - */ - private double override() { - return decider == null ? 1 : decider.getAvailability(OVERRIDE_QUALITY_FACTOR_DECIDER) / 10000.0; - } - - /** - * Returns true if the quality factor has changed since the last update. - * @return - */ - private boolean qualityFactorChanged() { - return Math.abs(qualityFactor - previousQualityFactor) > 0.01; - } - - /** - * Returns true if we've entered a degraded state. - * @return - */ - private boolean degradationBegan() { - return Math.abs(previousQualityFactor - 1.0) < 0.01 && qualityFactor < previousQualityFactor; - } - - /** - * Returns true if we've left the degraded state. - * @return - */ - private boolean degradationEnded() { - return Math.abs(qualityFactor - 1.0) < 0.01 && previousQualityFactor < qualityFactor; - } -} diff --git a/src/java/com/twitter/search/earlybird/EarlybirdDarkProxy.java b/src/java/com/twitter/search/earlybird/EarlybirdDarkProxy.java deleted file mode 100644 index c0d2fea2e..000000000 --- a/src/java/com/twitter/search/earlybird/EarlybirdDarkProxy.java +++ /dev/null @@ -1,113 +0,0 @@ -package com.twitter.search.earlybird; - -import java.util.concurrent.TimeUnit; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.collect.Lists; - -import org.apache.thrift.protocol.TCompactProtocol; - -import com.twitter.finagle.ThriftMux; -import com.twitter.finagle.builder.ClientBuilder; -import com.twitter.finagle.builder.ClientConfig.Yes; -import com.twitter.finagle.mtls.client.MtlsThriftMuxClient; -import com.twitter.finagle.stats.StatsReceiver; -import com.twitter.finagle.thrift.ClientId; -import com.twitter.finagle.thrift.ThriftClientRequest; -import com.twitter.finagle.zipkin.thrift.ZipkinTracer; -import com.twitter.search.common.dark.DarkProxy; -import com.twitter.search.common.dark.ResolverProxy; -import com.twitter.search.common.dark.ServerSetResolver; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.util.thrift.BytesToThriftFilter; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.common.config.EarlybirdProperty; -import com.twitter.util.Duration; - -public class EarlybirdDarkProxy { - private static final String WARM_UP_DECIDER_KEY_PREFIX = "warmup_"; - - private static final int DARK_REQUESTS_TOTAL_REQUEST_TIMEOUT_MS = - EarlybirdConfig.getInt("dark_requests_total_request_timeout_ms", 800); - private static final int DARK_REQUESTS_INDIVIDUAL_REQUEST_TIMEOUT_MS = - EarlybirdConfig.getInt("dark_requests_individual_request_timeout_ms", 800); - private static final int DARK_REQUESTS_CONNECT_TIMEOUT_MS = - EarlybirdConfig.getInt("dark_requests_connect_timeout_ms", 500); - private static final int DARK_REQUESTS_NUM_RETRIES = - EarlybirdConfig.getInt("dark_requests_num_retries", 1); - private static final String DARK_REQUESTS_FINAGLE_CLIENT_ID = - EarlybirdConfig.getString("dark_requests_finagle_client_id", "earlybird_warmup"); - - private final DarkProxy darkProxy; - - public EarlybirdDarkProxy(SearchDecider searchDecider, - StatsReceiver statsReceiver, - EarlybirdServerSetManager earlybirdServerSetManager, - EarlybirdWarmUpManager earlybirdWarmUpManager, - String clusterName) { - darkProxy = newDarkProxy(searchDecider, - statsReceiver, - earlybirdServerSetManager, - earlybirdWarmUpManager, - clusterName); - } - - public DarkProxy getDarkProxy() { - return darkProxy; - } - - @VisibleForTesting - protected DarkProxy newDarkProxy( - SearchDecider searchDecider, - StatsReceiver statsReceiver, - EarlybirdServerSetManager earlybirdServerSetManager, - final EarlybirdWarmUpManager earlybirdWarmUpManager, - String clusterName) { - ResolverProxy resolverProxy = new ResolverProxy(); - ServerSetResolver.SelfServerSetResolver selfServerSetResolver = - new ServerSetResolver.SelfServerSetResolver( - earlybirdServerSetManager.getServerSetIdentifier(), resolverProxy); - selfServerSetResolver.init(); - - final String clusterNameForDeciderKey = clusterName.toLowerCase().replaceAll("-", "_"); - final String warmUpServerSetIdentifier = earlybirdWarmUpManager.getServerSetIdentifier(); - DarkProxy newDarkProxy = new DarkProxy( - selfServerSetResolver, - newClientBuilder(statsReceiver), - resolverProxy, - searchDecider, - Lists.newArrayList(warmUpServerSetIdentifier), - new BytesToThriftFilter(), - statsReceiver) { - @Override - protected String getServicePathDeciderKey(String servicePath) { - if (warmUpServerSetIdentifier.equals(servicePath)) { - return WARM_UP_DECIDER_KEY_PREFIX + clusterNameForDeciderKey; - } - - return clusterNameForDeciderKey; - } - }; - - newDarkProxy.init(); - return newDarkProxy; - } - - private ClientBuilder newClientBuilder( - StatsReceiver statsReceiver) { - return ClientBuilder.get() - .daemon(true) - .timeout(Duration.apply(DARK_REQUESTS_TOTAL_REQUEST_TIMEOUT_MS, TimeUnit.MILLISECONDS)) - .requestTimeout( - Duration.apply(DARK_REQUESTS_INDIVIDUAL_REQUEST_TIMEOUT_MS, TimeUnit.MILLISECONDS)) - .tcpConnectTimeout(Duration.apply(DARK_REQUESTS_CONNECT_TIMEOUT_MS, TimeUnit.MILLISECONDS)) - .retries(DARK_REQUESTS_NUM_RETRIES) - .reportTo(statsReceiver) - .tracer(ZipkinTracer.mk(statsReceiver)) - .stack(new MtlsThriftMuxClient( - ThriftMux.client()) - .withMutualTls(EarlybirdProperty.getServiceIdentifier()) - .withProtocolFactory(new TCompactProtocol.Factory()) - .withClientId(new ClientId(DARK_REQUESTS_FINAGLE_CLIENT_ID))); - } -} diff --git a/src/java/com/twitter/search/earlybird/EarlybirdFinagleServerManager.java b/src/java/com/twitter/search/earlybird/EarlybirdFinagleServerManager.java deleted file mode 100644 index c84083475..000000000 --- a/src/java/com/twitter/search/earlybird/EarlybirdFinagleServerManager.java +++ /dev/null @@ -1,53 +0,0 @@ -package com.twitter.search.earlybird; - -import com.twitter.finagle.thrift.ThriftClientRequest; -import com.twitter.search.common.dark.DarkProxy; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.util.Duration; - -/** - * Manages a finagle server underneath, which can be recreated. - * - * This class is not thread-safe. It is up to the concrete implementations and their callers to - * correctly synchronize calls to these methods (for example, to make sure that there is no race - * condition if startProductionFinagleServer() and stopProductionFinagleServer() are called - * concurrently from two different threads). - */ -public interface EarlybirdFinagleServerManager { - /** - * Determines if the warm up finagle server is currently running - */ - boolean isWarmUpServerRunning(); - - /** - * Starts up the warm up finagle server on the given port. - */ - void startWarmUpFinagleServer( - EarlybirdService.ServiceIface serviceIface, - String serviceName, - int port); - - /** - * Stops the warm up finagle server, after waiting for at most the given amount of time. - */ - void stopWarmUpFinagleServer(Duration serverCloseWaitTime) throws InterruptedException; - - /** - * Determines if the production finagle server is currently running. - */ - boolean isProductionServerRunning(); - - /** - * Starts up the production finagle server on the given port. - */ - void startProductionFinagleServer( - DarkProxy darkProxy, - EarlybirdService.ServiceIface serviceIface, - String serviceName, - int port); - - /** - * Stops the production finagle server after waiting for at most the given amount of time. - */ - void stopProductionFinagleServer(Duration serverCloseWaitTime) throws InterruptedException; -} diff --git a/src/java/com/twitter/search/earlybird/EarlybirdFuturePoolManager.java b/src/java/com/twitter/search/earlybird/EarlybirdFuturePoolManager.java deleted file mode 100644 index 180b058f7..000000000 --- a/src/java/com/twitter/search/earlybird/EarlybirdFuturePoolManager.java +++ /dev/null @@ -1,114 +0,0 @@ -package com.twitter.search.earlybird; - -import java.util.concurrent.ArrayBlockingQueue; -import java.util.concurrent.ExecutorService; -import java.util.concurrent.Executors; -import java.util.concurrent.RejectedExecutionException; -import java.util.concurrent.ThreadFactory; -import java.util.concurrent.ThreadPoolExecutor; -import java.util.concurrent.TimeUnit; - -import scala.Function0; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.util.concurrent.ThreadFactoryBuilder; - -import com.twitter.search.common.concurrent.ThreadPoolExecutorStats; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.earlybird.common.config.EarlybirdProperty; -import com.twitter.util.ExecutorServiceFuturePool; -import com.twitter.util.Future; -import com.twitter.util.FuturePool; - -/** - * A future pool that delegates all calls to an underlying futurePool, which can be recreated. - */ -public class EarlybirdFuturePoolManager implements FuturePool { - private volatile ExecutorServiceFuturePool pool = null; - - private final String threadName; - private final ThreadPoolExecutorStats threadPoolExecutorStats; - - public EarlybirdFuturePoolManager(String threadName) { - this.threadName = threadName; - this.threadPoolExecutorStats = new ThreadPoolExecutorStats(threadName); - } - - final synchronized void createUnderlyingFuturePool(int threadCount) { - Preconditions.checkState(pool == null, "Cannot create a new pool before stopping the old one"); - - ExecutorService executorService = - createExecutorService(threadCount, getMaxQueueSize()); - if (executorService instanceof ThreadPoolExecutor) { - threadPoolExecutorStats.setUnderlyingExecutorForStats((ThreadPoolExecutor) executorService); - } - - pool = new ExecutorServiceFuturePool(executorService); - } - - final synchronized void stopUnderlyingFuturePool(long timeout, TimeUnit timeunit) - throws InterruptedException { - Preconditions.checkNotNull(pool); - pool.executor().shutdown(); - pool.executor().awaitTermination(timeout, timeunit); - pool = null; - } - - boolean isPoolReady() { - return pool != null; - } - - @Override - public final Future apply(Function0 f) { - return Preconditions.checkNotNull(pool).apply(f); - } - - @VisibleForTesting - protected ExecutorService createExecutorService(int threadCount, int maxQueueSize) { - if (maxQueueSize <= 0) { - return Executors.newFixedThreadPool(threadCount, createThreadFactory(threadName)); - } - - SearchRateCounter rejectedTaskCounter = - SearchRateCounter.export(threadName + "_rejected_task_count"); - return new ThreadPoolExecutor( - threadCount, threadCount, 0, TimeUnit.MILLISECONDS, - new ArrayBlockingQueue<>(maxQueueSize), - createThreadFactory(threadName), - (runnable, executor) -> { - rejectedTaskCounter.increment(); - throw new RejectedExecutionException(threadName + " queue is full"); - }); - } - - @VisibleForTesting - protected int getMaxQueueSize() { - return EarlybirdProperty.MAX_QUEUE_SIZE.get(0); - } - - @VisibleForTesting - static ThreadFactory createThreadFactory(String threadName) { - return new ThreadFactoryBuilder() - .setNameFormat(threadName + "-%d") - .setDaemon(true) - .build(); - } - - @Override - public int poolSize() { - return Preconditions.checkNotNull(pool).poolSize(); - } - - @Override - public int numActiveTasks() { - return Preconditions.checkNotNull(pool).numActiveTasks(); - } - - @Override - public long numCompletedTasks() { - return Preconditions.checkNotNull(pool).numCompletedTasks(); - } - - -} diff --git a/src/java/com/twitter/search/earlybird/EarlybirdIndexConfig.java b/src/java/com/twitter/search/earlybird/EarlybirdIndexConfig.java deleted file mode 100644 index b5928c651..000000000 --- a/src/java/com/twitter/search/earlybird/EarlybirdIndexConfig.java +++ /dev/null @@ -1,190 +0,0 @@ -package com.twitter.search.earlybird; - -import java.io.IOException; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Predicate; -import com.google.common.base.Predicates; - -import org.apache.lucene.index.IndexWriterConfig; -import org.apache.lucene.store.Directory; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.decider.Decider; -import com.twitter.search.common.schema.DynamicSchema; -import com.twitter.search.common.schema.base.Schema.SchemaValidationException; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.schema.earlybird.EarlybirdSchemaCreateTool; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent; -import com.twitter.search.common.util.CloseResourceUtil; -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentData; -import com.twitter.search.core.earlybird.index.extensions.EarlybirdIndexExtensionsFactory; -import com.twitter.search.earlybird.document.DocumentFactory; -import com.twitter.search.earlybird.document.ThriftIndexingEventDocumentFactory; -import com.twitter.search.earlybird.document.ThriftIndexingEventUpdateFactory; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.partition.PartitionConfig; -import com.twitter.search.earlybird.partition.SearchIndexingMetricSet; -import com.twitter.search.earlybird.partition.SegmentSyncInfo; -import com.twitter.search.earlybird.partition.UserPartitionUtil; - -/** - * Collection of required indexing entities that differ in the various Earlybird clusters. - */ -public abstract class EarlybirdIndexConfig { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdIndexConfig.class); - - private final EarlybirdCluster cluster; - private final DynamicSchema schema; - private final Decider decider; - private final SearchIndexingMetricSet searchIndexingMetricSet; - protected final CriticalExceptionHandler criticalExceptionHandler; - - /** - * Creates a new index config using an applicable schema built for the provided cluster. - */ - protected EarlybirdIndexConfig( - EarlybirdCluster cluster, Decider decider, SearchIndexingMetricSet searchIndexingMetricSet, - CriticalExceptionHandler criticalExceptionHandler) { - this(cluster, buildSchema(cluster), decider, searchIndexingMetricSet, - criticalExceptionHandler); - } - - @VisibleForTesting - protected EarlybirdIndexConfig( - EarlybirdCluster cluster, - DynamicSchema schema, - Decider decider, - SearchIndexingMetricSet searchIndexingMetricSet, - CriticalExceptionHandler criticalExceptionHandler) { - this.cluster = cluster; - this.schema = schema; - this.decider = decider; - this.searchIndexingMetricSet = searchIndexingMetricSet; - this.criticalExceptionHandler = criticalExceptionHandler; - LOG.info("This Earlybird uses index config: " + this.getClass().getSimpleName()); - } - - private static DynamicSchema buildSchema(EarlybirdCluster cluster) { - try { - return EarlybirdSchemaCreateTool.buildSchema(cluster); - } catch (SchemaValidationException e) { - throw new RuntimeException(e); - } - } - - /** - * Creates the appropriate document factory for this earlybird. - */ - public final DocumentFactory createDocumentFactory() { - return new ThriftIndexingEventDocumentFactory( - getSchema(), getCluster(), decider, searchIndexingMetricSet, - criticalExceptionHandler); - } - - /** - * Creates a document factory for ThriftIndexingEvents that are updates to the index. - */ - public final DocumentFactory createUpdateFactory() { - return new ThriftIndexingEventUpdateFactory( - getSchema(), getCluster(), decider, criticalExceptionHandler); - } - - /** - * Return the EarlybirdCluster enum identifying the cluster this config is for. - */ - public final EarlybirdCluster getCluster() { - return cluster; - } - - /** - * Return the default filter for UserUpdatesTable - for the archive cluster keep - * users that belong to the current partition. - */ - public final Predicate getUserTableFilter(PartitionConfig partitionConfig) { - if (EarlybirdCluster.isArchive(getCluster())) { - return UserPartitionUtil.filterUsersByPartitionPredicate(partitionConfig); - } - - return Predicates.alwaysTrue(); - } - - /** - * Creates a new Lucene {@link Directory} to be used for indexing documents. - */ - public abstract Directory newLuceneDirectory(SegmentSyncInfo segmentSyncInfo) throws IOException; - - /** - * Creates a new Lucene IndexWriterConfig that can be used for creating a segment writer for a - * new segment. - */ - public abstract IndexWriterConfig newIndexWriterConfig(); - - /** - * Creates a new SegmentData object to add documents to. - */ - public abstract EarlybirdIndexSegmentData newSegmentData( - int maxSegmentSize, - long timeSliceID, - Directory dir, - EarlybirdIndexExtensionsFactory extensionsFactory); - - /** - * Loads a flushed index for the given segment. - */ - public abstract EarlybirdIndexSegmentData loadSegmentData( - FlushInfo flushInfo, - DataDeserializer dataInputStream, - Directory dir, - EarlybirdIndexExtensionsFactory extensionsFactory) throws IOException; - - /** - * Creates a new segment optimizer for the given segment data. - */ - public abstract EarlybirdIndexSegmentData optimize( - EarlybirdIndexSegmentData earlybirdIndexSegmentData) throws IOException; - - /** - * Whether the index is stored on disk or not. If an index is not on disk, it is presumed to be - * in memory. - */ - public abstract boolean isIndexStoredOnDisk(); - - /** - * Whether documents are search in LIFO ordering (RT mode), or default (Lucene) FIFO ordering - */ - public final boolean isUsingLIFODocumentOrdering() { - return !isIndexStoredOnDisk(); - } - - /** - * Whether this index supports out-of-order indexing - */ - public abstract boolean supportOutOfOrderIndexing(); - - /** - * Returns a CloseResourceUtil used for closing resources. - */ - public abstract CloseResourceUtil getResourceCloser(); - - /** - * Returns the schema for this index configuration. - */ - public final DynamicSchema getSchema() { - return schema; - } - - /** - * Returns the decider used by this EarlybirdIndexConfig instance. - */ - public Decider getDecider() { - return decider; - } - - public SearchIndexingMetricSet getSearchIndexingMetricSet() { - return searchIndexingMetricSet; - } -} diff --git a/src/java/com/twitter/search/earlybird/EarlybirdMain.java b/src/java/com/twitter/search/earlybird/EarlybirdMain.java deleted file mode 100644 index 809d1b7c9..000000000 --- a/src/java/com/twitter/search/earlybird/EarlybirdMain.java +++ /dev/null @@ -1,10 +0,0 @@ -package com.twitter.search.earlybird; - -public final class EarlybirdMain { - private EarlybirdMain() { - } - - public static void main(String[] args) { - new Earlybird().main(args); - } -} diff --git a/src/java/com/twitter/search/earlybird/EarlybirdProductionFinagleServerManager.java b/src/java/com/twitter/search/earlybird/EarlybirdProductionFinagleServerManager.java deleted file mode 100644 index 3bdfa78a9..000000000 --- a/src/java/com/twitter/search/earlybird/EarlybirdProductionFinagleServerManager.java +++ /dev/null @@ -1,151 +0,0 @@ -package com.twitter.search.earlybird; - -import java.net.InetSocketAddress; -import java.util.concurrent.atomic.AtomicReference; - -import org.apache.thrift.protocol.TCompactProtocol; -import org.apache.thrift.protocol.TProtocolFactory; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.finagle.ListeningServer; -import com.twitter.finagle.Service; -import com.twitter.finagle.SslException; -import com.twitter.finagle.ThriftMux; -import com.twitter.finagle.mtls.server.MtlsThriftMuxServer; -import com.twitter.finagle.mux.transport.OpportunisticTls; -import com.twitter.finagle.stats.MetricsStatsReceiver; -import com.twitter.finagle.thrift.ThriftClientRequest; -import com.twitter.finagle.util.ExitGuard; -import com.twitter.finagle.zipkin.thrift.ZipkinTracer; -import com.twitter.search.common.dark.DarkProxy; -import com.twitter.search.earlybird.common.config.EarlybirdProperty; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.exception.EarlybirdFinagleServerMonitor; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.server.filter.AdmissionControl; -import com.twitter.server.filter.cpuAdmissionControl; -import com.twitter.util.Await; -import com.twitter.util.Duration; -import com.twitter.util.TimeoutException; - -public class EarlybirdProductionFinagleServerManager implements EarlybirdFinagleServerManager { - private static final Logger LOG = - LoggerFactory.getLogger(EarlybirdProductionFinagleServerManager.class); - - private final AtomicReference warmUpFinagleServer = new AtomicReference<>(); - private final AtomicReference productionFinagleServer = new AtomicReference<>(); - private final EarlybirdFinagleServerMonitor unhandledExceptionMonitor; - - public EarlybirdProductionFinagleServerManager( - CriticalExceptionHandler criticalExceptionHandler) { - this.unhandledExceptionMonitor = - new EarlybirdFinagleServerMonitor(criticalExceptionHandler); - } - - @Override - public boolean isWarmUpServerRunning() { - return warmUpFinagleServer.get() != null; - } - - @Override - public void startWarmUpFinagleServer(EarlybirdService.ServiceIface serviceIface, - String serviceName, - int port) { - TProtocolFactory protocolFactory = new TCompactProtocol.Factory(); - startFinagleServer(warmUpFinagleServer, "warmup", - new EarlybirdService.Service(serviceIface, protocolFactory), - protocolFactory, serviceName, port); - } - - @Override - public void stopWarmUpFinagleServer(Duration serverCloseWaitTime) throws InterruptedException { - stopFinagleServer(warmUpFinagleServer, serverCloseWaitTime, "Warm up"); - } - - @Override - public boolean isProductionServerRunning() { - return productionFinagleServer.get() != null; - } - - @Override - public void startProductionFinagleServer(DarkProxy darkProxy, - EarlybirdService.ServiceIface serviceIface, - String serviceName, - int port) { - TProtocolFactory protocolFactory = new TCompactProtocol.Factory(); - startFinagleServer(productionFinagleServer, "production", - darkProxy.toFilter().andThen(new EarlybirdService.Service(serviceIface, protocolFactory)), - protocolFactory, serviceName, port); - } - - @Override - public void stopProductionFinagleServer(Duration serverCloseWaitTime) - throws InterruptedException { - stopFinagleServer(productionFinagleServer, serverCloseWaitTime, "Production"); - } - - private void startFinagleServer(AtomicReference target, String serverDescription, - Service service, TProtocolFactory protocolFactory, String serviceName, - int port) { - target.set(getServer(service, serviceName, port, protocolFactory)); - LOG.info("Started EarlybirdServer " + serverDescription + " finagle server on port " + port); - } - - private ListeningServer getServer( - Service service, String serviceName, int port, - TProtocolFactory protocolFactory) { - MetricsStatsReceiver statsReceiver = new MetricsStatsReceiver(); - ThriftMux.Server server = new MtlsThriftMuxServer(ThriftMux.server()) - .withMutualTls(EarlybirdProperty.getServiceIdentifier()) - .withServiceClass(EarlybirdService.class) - .withOpportunisticTls(OpportunisticTls.Required()) - .withLabel(serviceName) - .withStatsReceiver(statsReceiver) - .withTracer(ZipkinTracer.mk(statsReceiver)) - .withMonitor(unhandledExceptionMonitor) - .withProtocolFactory(protocolFactory); - - if (cpuAdmissionControl.isDefined()) { - LOG.info("cpuAdmissionControl flag is set, replacing AuroraThrottlingAdmissionFilter" - + " with LinuxCpuAdmissionFilter"); - server = server - .configured(AdmissionControl.auroraThrottling().off().mk()) - .configured(AdmissionControl.linuxCpu().useGlobalFlag().mk()); - } - - return server.serve(new InetSocketAddress(port), service); - } - - private void stopFinagleServer(AtomicReference finagleServer, - Duration serverCloseWaitTime, - String serverDescription) throws InterruptedException { - try { - LOG.info("Waiting for " + serverDescription + " finagle server to close. " - + "Current time is " + System.currentTimeMillis()); - Await.result(finagleServer.get().close(), serverCloseWaitTime); - LOG.info("Stopped " + serverDescription + " finagle server. Current time is " - + System.currentTimeMillis()); - finagleServer.set(null); - } catch (TimeoutException e) { - LOG.warn(serverDescription + " finagle server did not shutdown cleanly.", e); - } catch (SslException e) { - // Closing the Thrift port seems to throw an SSLException (SSLEngine closed already). - // See SEARCH-29449. Log the exception and reset finagleServer, so that future calls to - // startProductionFinagleServer() succeed. - LOG.warn("Got a SSLException while trying to close the Thrift port.", e); - finagleServer.set(null); - } catch (InterruptedException e) { - // If we catch an InterruptedException here, it means that we're probably shutting down. - // We should propagate this exception, and rely on EarlybirdServer.stopThriftService() - // to do the right thing. - throw e; - } catch (Exception e) { - LOG.error(e.getMessage(), e); - } finally { - // If the finagle server does not close cleanly, this line prints details about - // the ExitGuards. - LOG.info(serverDescription + " server ExitGuard explanation: " + ExitGuard.explainGuards()); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/EarlybirdSearcher.java b/src/java/com/twitter/search/earlybird/EarlybirdSearcher.java deleted file mode 100644 index 386c0bcb6..000000000 --- a/src/java/com/twitter/search/earlybird/EarlybirdSearcher.java +++ /dev/null @@ -1,1918 +0,0 @@ -package com.twitter.search.earlybird; - -import java.io.IOException; -import java.util.ArrayList; -import java.util.HashMap; -import java.util.Iterator; -import java.util.List; -import java.util.Map; -import java.util.Set; -import java.util.stream.Collectors; -import javax.annotation.Nonnull; -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Joiner; -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.ImmutableSet; -import com.google.common.collect.Lists; - -import org.apache.commons.lang.StringUtils; -import org.apache.lucene.index.Term; -import org.apache.lucene.queryparser.classic.ParseException; -import org.apache.lucene.queryparser.classic.QueryParser; -import org.apache.lucene.search.BooleanClause.Occur; -import org.apache.lucene.search.BooleanQuery; -import org.apache.lucene.search.Query; -import org.apache.thrift.TException; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.decider.Decider; -import com.twitter.search.common.database.DatabaseConfig; -import com.twitter.search.common.decider.DeciderUtil; -import com.twitter.search.common.features.thrift.ThriftSearchFeatureSchema; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.metrics.SearchTimer; -import com.twitter.search.common.partitioning.base.Segment; -import com.twitter.search.common.query.MappableField; -import com.twitter.search.common.query.QueryHitAttributeHelper; -import com.twitter.search.common.query.thriftjava.CollectorParams; -import com.twitter.search.common.query.thriftjava.CollectorTerminationParams; -import com.twitter.search.common.query.thriftjava.EarlyTerminationInfo; -import com.twitter.search.common.ranking.thriftjava.ThriftRankingParams; -import com.twitter.search.common.ranking.thriftjava.ThriftScoringFunctionType; -import com.twitter.search.common.results.thriftjava.FieldHitList; -import com.twitter.search.common.schema.SchemaUtil; -import com.twitter.search.common.schema.SearchWhitespaceAnalyzer; -import com.twitter.search.common.schema.base.FieldWeightDefault; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.search.TerminationTracker; -import com.twitter.search.common.search.TwitterEarlyTerminationCollector; -import com.twitter.search.common.search.termination.QueryTimeoutFactory; -import com.twitter.search.common.util.earlybird.EarlybirdResponseUtil; -import com.twitter.search.common.util.ml.tensorflow_engine.TensorflowModelsManager; -import com.twitter.search.common.util.thrift.ThriftUtils; -import com.twitter.search.core.earlybird.facets.FacetCountState; -import com.twitter.search.earlybird.common.ClientIdUtil; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.exception.ClientException; -import com.twitter.search.earlybird.exception.TransientException; -import com.twitter.search.earlybird.index.facets.FacetSkipList; -import com.twitter.search.earlybird.ml.ScoringModelsManager; -import com.twitter.search.earlybird.partition.AudioSpaceTable; -import com.twitter.search.earlybird.partition.MultiSegmentTermDictionaryManager; -import com.twitter.search.earlybird.partition.PartitionConfig; -import com.twitter.search.earlybird.partition.SegmentInfo; -import com.twitter.search.earlybird.partition.SegmentManager; -import com.twitter.search.earlybird.querycache.QueryCacheConversionRules; -import com.twitter.search.earlybird.querycache.QueryCacheManager; -import com.twitter.search.earlybird.queryparser.DetectFieldAnnotationVisitor; -import com.twitter.search.earlybird.queryparser.EarlybirdLuceneQueryVisitor; -import com.twitter.search.earlybird.queryparser.HighFrequencyTermPairRewriteVisitor; -import com.twitter.search.earlybird.queryparser.LuceneRelevanceQueryVisitor; -import com.twitter.search.earlybird.queryparser.ProtectedOperatorQueryRewriter; -import com.twitter.search.earlybird.search.AbstractResultsCollector; -import com.twitter.search.earlybird.search.AntiGamingFilter; -import com.twitter.search.earlybird.search.queries.BadUserRepFilter; -import com.twitter.search.earlybird.search.EarlybirdLuceneSearcher; -import com.twitter.search.earlybird.search.EarlybirdMultiSegmentSearcher; -import com.twitter.search.earlybird.search.queries.MatchAllDocsQuery; -import com.twitter.search.earlybird.search.queries.RequiredStatusIDsFilter; -import com.twitter.search.earlybird.search.SearchRequestInfo; -import com.twitter.search.earlybird.search.SearchResultsCollector; -import com.twitter.search.earlybird.search.SearchResultsInfo; -import com.twitter.search.earlybird.search.SimpleSearchResults; -import com.twitter.search.earlybird.search.SocialFilter; -import com.twitter.search.earlybird.search.SocialSearchResultsCollector; -import com.twitter.search.earlybird.search.queries.UserFlagsExcludeFilter; -import com.twitter.search.earlybird.search.queries.UserIdMultiSegmentQuery; -import com.twitter.search.earlybird.search.facets.EntityAnnotationCollector; -import com.twitter.search.earlybird.search.facets.ExpandedUrlCollector; -import com.twitter.search.earlybird.search.facets.ExplainFacetResultsCollector; -import com.twitter.search.earlybird.search.facets.FacetRankingModule; -import com.twitter.search.earlybird.search.facets.FacetResultsCollector; -import com.twitter.search.earlybird.search.facets.FacetSearchRequestInfo; -import com.twitter.search.earlybird.search.facets.NamedEntityCollector; -import com.twitter.search.earlybird.search.facets.SpaceFacetCollector; -import com.twitter.search.earlybird.search.facets.TermStatisticsCollector; -import com.twitter.search.earlybird.search.facets.TermStatisticsRequestInfo; -import com.twitter.search.earlybird.search.relevance.RelevanceSearchRequestInfo; -import com.twitter.search.earlybird.search.relevance.RelevanceSearchResults; -import com.twitter.search.earlybird.search.relevance.collectors.AbstractRelevanceCollector; -import com.twitter.search.earlybird.search.relevance.collectors.BatchRelevanceTopCollector; -import com.twitter.search.earlybird.search.relevance.collectors.RelevanceAllCollector; -import com.twitter.search.earlybird.search.relevance.collectors.RelevanceTopCollector; -import com.twitter.search.earlybird.search.relevance.scoring.RelevanceQuery; -import com.twitter.search.earlybird.search.relevance.scoring.ScoringFunction; -import com.twitter.search.earlybird.search.relevance.scoring.ScoringFunctionProvider; -import com.twitter.search.earlybird.search.relevance.scoring.TensorflowBasedScoringFunction; -import com.twitter.search.earlybird.stats.EarlybirdRPCStats; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; -import com.twitter.search.earlybird.thrift.EarlybirdDebugInfo; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.ThriftFacetCount; -import com.twitter.search.earlybird.thrift.ThriftFacetCountMetadata; -import com.twitter.search.earlybird.thrift.ThriftFacetFieldRequest; -import com.twitter.search.earlybird.thrift.ThriftFacetFieldResults; -import com.twitter.search.earlybird.thrift.ThriftFacetRequest; -import com.twitter.search.earlybird.thrift.ThriftFacetResults; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.earlybird.thrift.ThriftSearchRankingMode; -import com.twitter.search.earlybird.thrift.ThriftSearchRelevanceOptions; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftSearchResultExtraMetadata; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadataOptions; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird.thrift.ThriftTermRequest; -import com.twitter.search.earlybird.thrift.ThriftTermStatisticsRequest; -import com.twitter.search.earlybird.thrift.ThriftTermStatisticsResults; -import com.twitter.search.earlybird.util.EarlybirdSearchResultUtil; -import com.twitter.search.queryparser.parser.SerializedQueryParser; -import com.twitter.search.queryparser.query.Conjunction; -import com.twitter.search.queryparser.query.Disjunction; -import com.twitter.search.queryparser.query.QueryNodeUtils; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.query.annotation.Annotation; -import com.twitter.search.queryparser.query.search.SearchOperator; -import com.twitter.search.queryparser.query.search.SearchOperatorConstants; -import com.twitter.search.queryparser.util.IdTimeRanges; -import com.twitter.search.queryparser.visitors.ConversionVisitor; -import com.twitter.search.queryparser.visitors.DetectPositiveOperatorVisitor; -import com.twitter.search.queryparser.visitors.NamedDisjunctionVisitor; -import com.twitter.search.queryparser.visitors.ProximityGroupRewriteVisitor; -import com.twitter.search.queryparser.visitors.StripAnnotationsVisitor; - -import static com.twitter.search.queryparser.query.search.SearchOperator.Type.UNTIL_TIME; - -/** - * This class provides the basic search() method: - * - converts the thrift request object into what lucene expects. - * - gets the segment. - * - handles all errors, and prepares the response in case of error. - * - * We have one instance of this class per search received. - */ -public class EarlybirdSearcher { - public enum QueryMode { - // Please think before adding more query modes: can this be implemented in a general way? - RECENCY(new EarlybirdRPCStats("search_recency")), - FACETS(new EarlybirdRPCStats("search_facets")), - TERM_STATS(new EarlybirdRPCStats("search_termstats")), - RELEVANCE(new EarlybirdRPCStats("search_relevance")), - TOP_TWEETS(new EarlybirdRPCStats("search_toptweets")); - - private final EarlybirdRPCStats requestStats; - - QueryMode(EarlybirdRPCStats requestStats) { - this.requestStats = requestStats; - } - - public EarlybirdRPCStats getRequestStats() { - return requestStats; - } - } - - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdSearcher.class); - private static final String MATCH_ALL_SERIALIZED_QUERY = "(* )"; - /** - * generic field annotations can be mapped to a concrete field in the index using this mapping - * via {@link com.twitter.search.queryparser.query.annotation.Annotation.Type#MAPPABLE_FIELD} - */ - private static final Map MAPPABLE_FIELD_MAP = - ImmutableMap.of( - MappableField.URL, - EarlybirdFieldConstant.RESOLVED_LINKS_TEXT_FIELD.getFieldName()); - - private static final String ALLOW_QUERY_SPECIFIC_SIGNAL_DECIDER_KEY - = "allow_query_specific_score_adjustments"; - - @VisibleForTesting - public static final String ALLOW_AUTHOR_SPECIFIC_SIGNAL_DECIDER_KEY - = "allow_author_specific_score_adjustments"; - - private static final String USE_MULTI_TERM_DISJUNCTION_FOR_LIKED_BY_USER_IDS_DECIDER_KEY - = "use_multi_term_disjunction_for_liked_by_user_ids"; - - private static final String ALLOW_CAMELCASE_USERNAME_FIELD_WEIGHT_OVERRIDE_DECIDER_KEY_PREFIX - = "allow_camelcase_username_field_weight_override_in_"; - - private static final String ALLOW_TOKENIZED_DISPLAY_NAME_FIELD_WEIGHT_OVERRIDE_DECIDER_KEY_PREFIX - = "allow_tokenized_display_name_field_weight_override_in_"; - - private static final boolean ALLOW_QUERY_SPECIFIC_SIGNAL_CONFIG - = EarlybirdConfig.getBool("allow_query_specific_score_adjustments", false); - - private static final boolean ALLOW_AUTHOR_SPECIFIC_SIGNAL_CONFIG - = EarlybirdConfig.getBool("allow_author_specific_score_adjustments", false); - - public static final int DEFAULT_NUM_FACET_RESULTS = 100; - - private final ImmutableSchemaInterface schemaSnapshot; - private final EarlybirdCluster cluster; - - private final Clock clock; - private final Decider decider; - - // The actual request thrift. - private final EarlybirdRequest request; - - // searchQuery from inside the request. - private final ThriftSearchQuery searchQuery; - - // CollectorParams from inside the searchQuery; - private final CollectorParams collectorParams; - - // Parsed query (parsed from serialized query string in request). - private com.twitter.search.queryparser.query.Query parsedQuery; - private boolean parsedQueryAllowNullcast; - private IdTimeRanges idTimeRanges; - - // Lucene version of the above. This is what we will actually be executing. - private org.apache.lucene.search.Query luceneQuery; - - // Used for queries where we want to collect per-field hit attribution - @Nullable - private QueryHitAttributeHelper hitAttributeHelper; - - // Debugging info can be appended to this buffer. - private final StringBuilder messageBuffer = new StringBuilder(1024); - private final EarlybirdDebugInfo debugInfo = new EarlybirdDebugInfo(); - - // The segment we are searching, or null for the multi-searcher. - private Segment segment = null; - - // True iff we are searching all segments (multi-searcher). - private final boolean searchAllSegments; - - // Tracking termination criteria for this query - private final TerminationTracker terminationTracker; - - private EarlybirdLuceneSearcher searcher = null; - - private final SegmentManager segmentManager; - private final QueryCacheManager queryCacheManager; - private final ScoringModelsManager scoringModelsManager; - private final TensorflowModelsManager tensorflowModelsManager; - - private AntiGamingFilter antiGamingFilter = null; - - private final boolean searchHighFrequencyTermPairs = - EarlybirdConfig.getBool("search_high_frequency_term_pairs", false); - - // How long to allow post-termination when enforcing query timeout - private final int enforceQueryTimeoutBufferMillis = - EarlybirdConfig.getInt("enforce_query_timeout_buffer_millis", 50); - - private EarlybirdRPCStats requestStats; - - private QueryTimeoutFactory queryTimeoutFactory; - - // Exported stats - private final EarlybirdSearcherStats searcherStats; - - @VisibleForTesting - public static final SearchCounter FIELD_WEIGHT_OVERRIDE_MAP_NON_NULL_COUNT = - SearchCounter.export("field_weight_override_map_non_null_count"); - @VisibleForTesting - public static final SearchCounter DROPPED_CAMELCASE_USERNAME_FIELD_WEIGHT_OVERRIDE = - SearchCounter.export("dropped_camelcase_username_field_weight_override"); - @VisibleForTesting - public static final SearchCounter DROPPED_TOKENIZED_DISPLAY_NAME_FIELD_WEIGHT_OVERRIDE = - SearchCounter.export("dropped_tokenized_display_name_field_weight_override"); - - private static final SearchCounter RESPONSE_HAS_NO_THRIFT_SEARCH_RESULTS = - SearchCounter.export("tweets_earlybird_searcher_response_has_no_thrift_search_results"); - private static final SearchCounter CLIENT_HAS_FEATURE_SCHEMA_COUNTER = - SearchCounter.export("tweets_earlybird_searcher_client_has_feature_schema"); - private static final SearchCounter CLIENT_DOESNT_HAVE_FEATURE_SCHEMA_COUNTER = - SearchCounter.export("tweet_earlybird_searcher_client_doesnt_have_feature_schema"); - private static final SearchCounter COLLECTOR_PARAMS_MAX_HITS_TO_PROCESS_NOT_SET_COUNTER = - SearchCounter.export("collector_params_max_hits_to_process_not_set"); - private static final SearchCounter POSITIVE_PROTECTED_OPERATOR_DETECTED_COUNTER = - SearchCounter.export("positive_protected_operator_detected_counter"); - - // Query mode we are executing. - private final QueryMode queryMode; - - // facetRequest from inside the request (or null). - private final ThriftFacetRequest facetRequest; - - // termStatisticsRequest from inside the request (or null). - private final ThriftTermStatisticsRequest termStatisticsRequest; - - // Results fields filled in during searchInternal(). - private ThriftSearchResults searchResults = null; - private ThriftFacetResults facetResults = null; - private ThriftTermStatisticsResults termStatisticsResults = null; - private EarlyTerminationInfo earlyTerminationInfo = null; - - // Partition config used to fill in debugging info. - // If null, no debug info is written into results. - @Nullable - private final PartitionConfig partitionConfig; - - private final MultiSegmentTermDictionaryManager multiSegmentTermDictionaryManager; - - private final QualityFactor qualityFactor; - - private Set queriedFields; - private final AudioSpaceTable audioSpaceTable; - - public EarlybirdSearcher( - EarlybirdRequest request, - SegmentManager segmentManager, - AudioSpaceTable audioSpaceTable, - QueryCacheManager queryCacheManager, - ImmutableSchemaInterface schema, - EarlybirdCluster cluster, - @Nullable PartitionConfig partitionConfig, - Decider decider, - EarlybirdSearcherStats searcherStats, - ScoringModelsManager scoringModelsManager, - TensorflowModelsManager tensorflowModelsManager, - Clock clock, - MultiSegmentTermDictionaryManager multiSegmentTermDictionaryManager, - QueryTimeoutFactory queryTimeoutFactory, - QualityFactor qualityFactor) { - this.queryMode = getQueryMode(request); - this.schemaSnapshot = schema.getSchemaSnapshot(); - // set the request stats as early as possible, so that we can track errors that happen - // early on in query processing. - this.requestStats = queryMode.getRequestStats(); - this.facetRequest = request.isSetFacetRequest() ? request.getFacetRequest() : null; - this.termStatisticsRequest = request.isSetTermStatisticsRequest() - ? request.getTermStatisticsRequest() : null; - this.partitionConfig = partitionConfig; - this.searcherStats = searcherStats; - this.multiSegmentTermDictionaryManager = multiSegmentTermDictionaryManager; - this.clock = clock; - this.decider = decider; - this.request = request; - this.segmentManager = segmentManager; - this.queryCacheManager = queryCacheManager; - this.cluster = cluster; - this.scoringModelsManager = scoringModelsManager; - this.tensorflowModelsManager = tensorflowModelsManager; - this.audioSpaceTable = audioSpaceTable; - // Note: we're deferring the validation/nullchecks until validateRequest() - // for more contained exception handling - this.searchQuery = request.getSearchQuery(); - this.collectorParams = this.searchQuery == null ? null : this.searchQuery.getCollectorParams(); - // Search all segments if searchSegmentId is unset. - this.searchAllSegments = !request.isSetSearchSegmentId(); - if (this.collectorParams == null - || !this.collectorParams.isSetTerminationParams()) { - this.terminationTracker = new TerminationTracker(clock); - } else if (request.isSetClientRequestTimeMs()) { - this.terminationTracker = new TerminationTracker(collectorParams.getTerminationParams(), - request.getClientRequestTimeMs(), clock, - getPostTerminationOverheadMillis(collectorParams.getTerminationParams())); - } else { - this.terminationTracker = new TerminationTracker( - collectorParams.getTerminationParams(), clock, - getPostTerminationOverheadMillis(collectorParams.getTerminationParams())); - } - this.queryTimeoutFactory = queryTimeoutFactory; - this.qualityFactor = qualityFactor; - } - - private int getPostTerminationOverheadMillis(CollectorTerminationParams terminationParams) { - // If enforcing timeouts, set the post-termination buffer to the smaller of the timeout or the - // configured buffer. This ensures that timeout >= buffer, and a request with a smaller timeout - // should just time out immediately (because timeout == buffer). - return (terminationParams.isEnforceQueryTimeout() && terminationParams.getTimeoutMs() > 0) - ? Math.min(enforceQueryTimeoutBufferMillis, terminationParams.getTimeoutMs()) : 0; - } - - // Appends a debug string to the buffer. - private void appendMessage(String message) { - messageBuffer.append(message).append("\n"); - } - - /** - * Processes an Earlybird search request. - * @return the earlybird response for this search request. - */ - public EarlybirdResponse search() { - try { - debugInfo.setHost(DatabaseConfig.getLocalHostname()); - - // Throws transient exception for invalid requests. - validateRequest(); - - // Throws client exception for bad queries, - parseEarlybirdRequest(); - - // Modify the Lucene query if necessary. - luceneQuery = postLuceneQueryProcess(luceneQuery); - - // Might return PARTITION_NOT_FOUND or PARTITION_DISABLED. - EarlybirdResponseCode code = initSearcher(); - if (code != EarlybirdResponseCode.SUCCESS) { - return respondError(code); - } - - return searchInternal(); - - } catch (TransientException e) { - LOG.error(String.format("Transient exception in search() for EarlybirdRequest:\n%s", request), - e); - appendMessage(e.getMessage()); - return respondError(EarlybirdResponseCode.TRANSIENT_ERROR); - } catch (ClientException e) { - LOG.warn(String.format("Client exception in search() %s for EarlybirdRequest:\n %s", - e, request)); - appendMessage(e.getMessage()); - return respondError(EarlybirdResponseCode.CLIENT_ERROR); - } catch (Exception e) { - LOG.warn(String.format("Uncaught exception in search() for EarlybirdRequest:\n%s", request), - e); - appendMessage(e.getMessage()); - return respondError(EarlybirdResponseCode.TRANSIENT_ERROR); - } catch (AssertionError e) { - LOG.warn(String.format("Assertion error in search() for EarlybirdRequest:\n%s", request), e); - appendMessage(e.getMessage()); - return respondError(EarlybirdResponseCode.TRANSIENT_ERROR); - } catch (Error e) { - // SEARCH-33166: If we got here, it means what was thrown was not an Exception, or anything - // we know how to handle. Log the Error for diagnostic purposes and propagate it. - LOG.error("Re-throwing uncaught error", e); - throw e; - } - } - - public EarlybirdRPCStats getRequestStats() { - return requestStats; - } - - /** - * Wraps the given query with the provided filter queries. - * - * @param query the query to wrap with filters. - * @param filters the filters to wrap the query with. - * @return a BooleanQuery wrapped with filters - */ - public static Query wrapFilters(Query query, Query... filters) { - boolean filtersEmpty = filters == null || filters.length == 0; - - if (!filtersEmpty) { - filtersEmpty = true; - for (Query f : filters) { - if (f != null) { - filtersEmpty = false; - break; - } - } - } - - if (filtersEmpty) { - if (query == null) { - return new MatchAllDocsQuery(); - } else { - return query; - } - } - - BooleanQuery.Builder bqBuilder = new BooleanQuery.Builder(); - if (query != null) { - bqBuilder.add(query, Occur.MUST); - } - for (Query f : filters) { - if (f != null) { - bqBuilder.add(f, Occur.FILTER); - } - } - return bqBuilder.build(); - } - - // Examine all fields in the request for sanity. - private void validateRequest() throws TransientException, ClientException { - // First try thrift's internal validate. Should always succeed. - try { - request.validate(); - } catch (TException e) { - throw new TransientException(e.getMessage(), e); - } - - if (searchQuery == null) { - throw new TransientException("No ThriftSearchQuery specified"); - } - - if (collectorParams == null) { - throw new TransientException("No CollectorParams specified"); - } - - validateTermStatsRequest(); - - if (!searchAllSegments) { - if (request.getSearchSegmentId() <= 0) { - String msg = "Bad time slice ID: " + request.getSearchSegmentId(); - throw new TransientException(msg); - } - - // Initialize the segment. - SegmentInfo segmentInfo = this.segmentManager.getSegmentInfo(request.getSearchSegmentId()); - segment = segmentInfo != null ? segmentInfo.getSegment() : null; - } - - if (collectorParams.getNumResultsToReturn() < 0) { - String msg = "Invalid numResults: " + collectorParams.getNumResultsToReturn(); - throw new TransientException(msg); - } - - if (searchQuery.getNamedDisjunctionMapSize() > 0 && searchQuery.isSetLuceneQuery()) { - throw new ClientException("namedMultiTermDisjunctionMap does not support with luceneQuery"); - } - } - - private void validateTermStatsRequest() throws ClientException { - // Validate the field names and values for all ThriftTermRequests. - if (request.isSetTermStatisticsRequest() - && request.getTermStatisticsRequest().isSetTermRequests()) { - for (ThriftTermRequest termRequest : request.getTermStatisticsRequest().getTermRequests()) { - // If termRequest.fieldName is not set, it defaults to 'text', which is a string field, - // so we don't need to check the term. - if (termRequest.isSetFieldName()) { - String fieldName = termRequest.getFieldName(); - Schema.FieldInfo facetFieldInfo = schemaSnapshot.getFacetFieldByFacetName(fieldName); - if (facetFieldInfo != null) { - // Facet fields are string fields, so we don't need to check the term. - continue; - } - - Schema.FieldInfo fieldInfo = schemaSnapshot.getFieldInfo(fieldName); - if (fieldInfo == null) { - throw new ClientException("Field " + fieldName + " is not present in the schema."); - } - - try { - SchemaUtil.toBytesRef(fieldInfo, termRequest.getTerm()); - } catch (UnsupportedOperationException e) { - throw new ClientException("Term " + termRequest.getTerm() + " is not compatible with " - + "the type of field " + fieldName); - } - } - } - } - } - - private void setQueriesInDebugInfo( - com.twitter.search.queryparser.query.Query parsedQ, - org.apache.lucene.search.Query luceneQ) { - debugInfo.setParsedQuery(parsedQ == null ? null : parsedQ.serialize()); - debugInfo.setLuceneQuery(luceneQ == null ? null : luceneQ.toString()); - } - - /** - * Takes the EarlybirdRequest that came into the service and after various parsing and processing - * steps ultimately produces a Lucene query. - */ - private void parseEarlybirdRequest() throws ClientException { - SerializedQueryParser parser = new SerializedQueryParser(EarlybirdConfig.getPenguinVersion()); - - try { - // if the deprecated iterativeQueries field is set, return an error to the client - // indicating that support for it has been removed. - if (searchQuery.isSetDeprecated_iterativeQueries()) { - throw new ClientException("Invalid request: iterativeQueries feature has been removed"); - } - - // we parse the actual query from the user, if any - luceneQuery = null; - parsedQuery = null; // this will be set by parseQueryHelper() - - if (searchQuery.getLikedByUserIDFilter64Size() > 0 - && searchQuery.isSetLuceneQuery()) { - throw new ClientException("likedByUserIDFilter64 does not support with luceneQuery"); - } - - if (!StringUtils.isBlank(request.getSearchQuery().getSerializedQuery())) { - searcherStats.thriftQueryWithSerializedQuery.increment(); - luceneQuery = parseSerializedQuery(searchQuery.getSerializedQuery(), parser, true); - } else if (!StringUtils.isBlank(request.getSearchQuery().getLuceneQuery())) { - searcherStats.thriftQueryWithLuceneQuery.increment(); - luceneQuery = parseLuceneQuery(searchQuery.getLuceneQuery()); - LOG.info("lucene query: {}", searchQuery.getLuceneQuery()); - if (luceneQuery != null) { - LOG.info("Using lucene query directly from the request: " + luceneQuery.toString()); - } - } else { - searcherStats.thriftQueryWithoutTextQuery.increment(); - luceneQuery = parseSerializedQuery( - MATCH_ALL_SERIALIZED_QUERY, - parser, - queryMode != QueryMode.TERM_STATS); - } - } catch (QueryParserException | BooleanQuery.TooManyClauses e) { - LOG.info("Exception parsing query during search", e); - appendMessage(e.getMessage()); - throw new ClientException(e); - } - } - - /** - * Parses a serialized query and creates a Lucene query out of it. - * - * To see how serialized queries look like, go to go/searchsyntax. - */ - private Query parseSerializedQuery( - String serializedQuery, - SerializedQueryParser parser, - boolean shouldAdjustQueryBasedOnRequestParameters) throws QueryParserException { - // Parse the serialized query. - parsedQuery = parser.parse(serializedQuery); - if (parsedQuery == null) { - return null; - } - - // rewrite query if positive 'protected' operator is detected - if (parsedQuery.accept(new DetectPositiveOperatorVisitor(SearchOperatorConstants.PROTECTED))) { - POSITIVE_PROTECTED_OPERATOR_DETECTED_COUNTER.increment(); - ProtectedOperatorQueryRewriter rewriter = new ProtectedOperatorQueryRewriter(); - parsedQuery = rewriter.rewrite( - parsedQuery, - request.followedUserIds, - segmentManager.getUserTable()); - } - - ThriftSearchRelevanceOptions options = searchQuery.getRelevanceOptions(); - if (shouldAdjustQueryBasedOnRequestParameters) { - // If likedByUserIDFilter64 is set, combine it with query - // Note: we deal with likedByUserIDFilter64 here instead of in postLuceneQueryProcess as we - // want annotate query with ranks. - if (searchQuery.isSetLikedByUserIDFilter64() - && searchQuery.getLikedByUserIDFilter64Size() > 0) { - parsedQuery = combineWithLikedByUserIdFilter64( - parsedQuery, searchQuery.getLikedByUserIDFilter64()); - } - - // If namedListMap field is set, replace the named lists in the serialized query. - if (searchQuery.getNamedDisjunctionMapSize() > 0) { - parsedQuery = parsedQuery.accept( - new NamedDisjunctionVisitor(searchQuery.getNamedDisjunctionMap())); - } - - if (searchQuery.isSetRelevanceOptions() - && searchQuery.getRelevanceOptions().isCollectFieldHitAttributions()) { - // NOTE: Before we do any modifications to the serialized query tree, annotate the query - // nodes with their node rank in the original query. - this.hitAttributeHelper = - QueryHitAttributeHelper.from(parsedQuery, schemaSnapshot); - parsedQuery = hitAttributeHelper.getAnnotatedQuery(); - } - - // Currently antisocial/nullcast tweets are dropped when we build index, but some tweets may - // become antisocial with realtime updates. For consistency, we should always filter out - // antisocial/nullcast tweets if the user is not explicitly including it. - final boolean allowAntisocial = - parsedQuery.accept(new DetectPositiveOperatorVisitor(SearchOperatorConstants.ANTISOCIAL)); - if (!allowAntisocial) { - parsedQuery = QueryNodeUtils.appendAsConjunction( - parsedQuery, - QueryCacheConversionRules.CACHED_EXCLUDE_ANTISOCIAL); - } - parsedQueryAllowNullcast = - parsedQuery.accept(new DetectPositiveOperatorVisitor(SearchOperatorConstants.NULLCAST)); - if (!parsedQueryAllowNullcast) { - parsedQuery = QueryNodeUtils.appendAsConjunction( - parsedQuery, new SearchOperator("filter", SearchOperatorConstants.NULLCAST).negate()); - } - - // Strip all annotations from the filters that will be converted to query cache filters. - // See SEARCH-15552. - parsedQuery = parsedQuery.accept( - new StripAnnotationsVisitor(QueryCacheConversionRules.STRIP_ANNOTATIONS_QUERIES)); - - // Convert certain filters into cached filters, also consolidate them. - parsedQuery = parsedQuery.accept( - new ConversionVisitor(QueryCacheConversionRules.DEFAULT_RULES)); - - // add proximity if needed - if (options != null - && options.isProximityScoring() - && searchQuery.getRankingMode() != ThriftSearchRankingMode.RECENCY) { - parsedQuery = parsedQuery.accept(new ProximityGroupRewriteVisitor()).simplify(); - } - } - - if (request.isSkipVeryRecentTweets()) { - parsedQuery = restrictQueryToFullyIndexedTweets(parsedQuery); - } - - parsedQuery = parsedQuery.simplify(); - debugInfo.setParsedQuery(parsedQuery.serialize()); - - // Extract top-level since-id for pagination optimizations. - idTimeRanges = IdTimeRanges.fromQuery(parsedQuery); - - // Does any final processing specific to EarlybirdSearch class. - parsedQuery = preLuceneQueryProcess(parsedQuery); - - // Convert to a lucene query. - EarlybirdLuceneQueryVisitor luceneVisitor = getLuceneVisitor( - options == null ? null : options.getFieldWeightMapOverride()); - - if (options != null) { - luceneVisitor - .setProximityPhraseWeight((float) options.getProximityPhraseWeight()) - .setProximityPhraseSlop(options.getProximityPhraseSlop()); - } - - // Propagate hit attribute helper to the lucene visitor if it has been setup. - luceneVisitor.setFieldHitAttributeHelper(this.hitAttributeHelper); - - org.apache.lucene.search.Query query = parsedQuery.accept(luceneVisitor); - if (query != null) { - debugInfo.setLuceneQuery(query.toString()); - } - - queriedFields = luceneVisitor.getQueriedFields(); - - return query; - } - - private Query parseLuceneQuery(String query) { - QueryParser parser = new QueryParser( - EarlybirdFieldConstant.TEXT_FIELD.getFieldName(), - new SearchWhitespaceAnalyzer()); - parser.setSplitOnWhitespace(true); - try { - return parser.parse(query); - } catch (ParseException e) { - LOG.error("Cannot parse raw lucene query: " + query, e); - } catch (NullPointerException e) { - LOG.error("NullPointerException while parsing raw lucene query: " + query - + ", probably your grammar is wrong.\n", e); - } - return null; - } - - private com.twitter.search.queryparser.query.Query combineWithLikedByUserIdFilter64( - com.twitter.search.queryparser.query.Query query, - List ids) throws QueryParserException { - return QueryNodeUtils.appendAsConjunction(query, getLikedByUserIdQuery(ids)); - } - - /** - * initSearcher initializes the segmentSearcher, and returns SUCCESS if OK - * or some other response code it not OK. - */ - private EarlybirdResponseCode initSearcher() throws IOException { - searcher = null; - if (searchAllSegments) { - return initMultiSegmentSearcher(); - } else { - return initSingleSegmentSearcher(); - } - } - - private EarlybirdResponseCode initSingleSegmentSearcher() throws IOException { - if (segment == null) { - String message = "Segment not found for time slice: " + request.getSearchSegmentId(); - LOG.warn(message); - appendMessage(message); - return EarlybirdResponseCode.PARTITION_NOT_FOUND; - } - - EarlybirdResponseCode code = this.segmentManager.checkSegment(segment); - if (code != EarlybirdResponseCode.SUCCESS) { - String message = "Segment " + segment + " either disabled or dropped"; - LOG.warn(message); - appendMessage(message); - return code; - } - - searcher = segmentManager.getSearcher(segment, schemaSnapshot); - if (searcher == null) { - String message = "Could not construct searcher for segment " + segment; - LOG.error(message); - appendMessage(message); - return EarlybirdResponseCode.PERSISTENT_ERROR; - } else { - appendMessage("Searching segment: " + segment); - return EarlybirdResponseCode.SUCCESS; - } - } - - private EarlybirdResponseCode initMultiSegmentSearcher() throws IOException { - EarlybirdMultiSegmentSearcher multiSearcher = - segmentManager.getMultiSearcher(schemaSnapshot); - searcher = multiSearcher; - Preconditions.checkNotNull(searcher); - - // Set a top level since id to skip entire segments when possible. - multiSearcher.setIdTimeRanges(idTimeRanges); - return EarlybirdResponseCode.SUCCESS; - } - - private com.twitter.search.queryparser.query.Query - restrictQueryToFullyIndexedTweets(com.twitter.search.queryparser.query.Query query) { - long untilTimeSeconds = - RecentTweetRestriction.recentTweetsUntilTime(decider, (int) (clock.nowMillis() / 1000)); - if (untilTimeSeconds == 0) { - return query; - } - - SearchOperator timeLimit = new SearchOperator(UNTIL_TIME, untilTimeSeconds); - return new Conjunction(query, timeLimit); - } - - private EarlybirdResponse newResponse(EarlybirdResponseCode code, boolean setDebugInfo) { - EarlybirdResponse response = new EarlybirdResponse(); - response.setResponseCode(code); - if (setDebugInfo) { - response.setDebugInfo(debugInfo); - if (messageBuffer.length() > 0) { - response.setDebugString(DatabaseConfig.getLocalHostname() - + ":\n" + messageBuffer.toString()); - } - } - return response; - } - - private EarlybirdResponse respondError(EarlybirdResponseCode code) { - appendMessage("Responding with error code " + code); - // Always respond with an error message, even when request.debug is false - return newResponse(code, true); - } - - @VisibleForTesting - public TerminationTracker getTerminationTracker() { - return terminationTracker; - } - - public void maybeSetCollectorDebugInfo(TwitterEarlyTerminationCollector collector) { - if (request.isSetDebugOptions() && request.getDebugOptions().isIncludeCollectorDebugInfo()) { - debugInfo.setCollectorDebugInfo(collector.getDebugInfo()); - } - } - - public void setTermStatisticsDebugInfo(List termStatisticsDebugInfo) { - debugInfo.setTermStatisticsDebugInfo(termStatisticsDebugInfo); - } - - private EarlybirdResponse searchInternal() throws TransientException, ClientException { - searchResults = new ThriftSearchResults(); - - SearchResultsInfo searchResultsInfo; - try { - switch (queryMode) { - case RECENCY: - searchResultsInfo = processRealtimeQuery(); - break; - case RELEVANCE: - // Relevance search and Model-based search differ only on the scoring function used. - SearchTimer timer = searcherStats.createTimer(); - timer.start(); - searchResultsInfo = processRelevanceQuery(); - timer.stop(); - searcherStats.recordRelevanceStats(timer, request); - break; - case FACETS: - searchResultsInfo = processFacetsQuery(); - break; - case TERM_STATS: - searchResultsInfo = processTermStatsQuery(); - break; - case TOP_TWEETS: - searchResultsInfo = processTopTweetsQuery(); - break; - default: - throw new TransientException("Unknown query mode " + queryMode); - } - - return respondSuccess(searchResults, facetResults, termStatisticsResults, - earlyTerminationInfo, searchResultsInfo); - } catch (IOException e) { - throw new TransientException(e.getMessage(), e); - } - } - - /** - * Helper method to process facets query. - */ - private SearchResultsInfo processFacetsQuery() throws ClientException, IOException { - // figure out which fields we need to count - FacetCountState facetCountState = newFacetCountState(); - - // Additionally wrap our query into a skip list boolean query for faster counting. - if (!facetRequest.isUsingQueryCache()) { - // Only if all fields to be counted use skip lists, then we can add a required clause - // that filters out all results that do not contain those fields - boolean cannotAddRequiredClause = facetCountState.hasFieldToCountWithoutSkipList(); - final Query facetSkipListFilter = - cannotAddRequiredClause ? null : FacetSkipList.getSkipListQuery(facetCountState); - final Query antisocialFilter = UserFlagsExcludeFilter.getUserFlagsExcludeFilter( - segmentManager.getUserTable(), true, true, false); - luceneQuery = wrapFilters(luceneQuery, - facetSkipListFilter, - antisocialFilter); - } - - facetResults = new ThriftFacetResults(new HashMap<>()); - - FacetSearchRequestInfo searchRequestInfo = - new FacetSearchRequestInfo(searchQuery, facetRequest.getFacetRankingOptions(), - luceneQuery, facetCountState, terminationTracker); - searchRequestInfo.setIdTimeRanges(idTimeRanges); - if (searchQuery.getMaxHitsPerUser() > 0) { - antiGamingFilter = new AntiGamingFilter( - searchQuery.getMaxHitsPerUser(), - searchQuery.getMaxTweepcredForAntiGaming(), - luceneQuery); - } - - AbstractResultsCollector< - FacetSearchRequestInfo, EarlybirdLuceneSearcher.FacetSearchResults> collector; - if (request.getDebugMode() > 2) { - collector = new ExplainFacetResultsCollector(schemaSnapshot, - searchRequestInfo, antiGamingFilter, searcherStats, clock, request.debugMode); - } else { - collector = new FacetResultsCollector(schemaSnapshot, - searchRequestInfo, antiGamingFilter, searcherStats, clock, request.debugMode); - } - - setQueriesInDebugInfo(parsedQuery, searchRequestInfo.getLuceneQuery()); - searcher.search(searchRequestInfo.getLuceneQuery(), collector); - EarlybirdLuceneSearcher.FacetSearchResults hits = collector.getResults(); - - EarlybirdSearchResultUtil.setResultStatistics(searchResults, hits); - earlyTerminationInfo = EarlybirdSearchResultUtil.prepareEarlyTerminationInfo(hits); - Set userIDWhitelist = - antiGamingFilter != null ? antiGamingFilter.getUserIDWhitelist() : null; - prepareFacetResults(facetResults, hits, facetCountState, userIDWhitelist, - request.getDebugMode()); - facetResults.setUserIDWhitelist(userIDWhitelist); - - maybeSetCollectorDebugInfo(collector); - - if (collector instanceof ExplainFacetResultsCollector) { - ((ExplainFacetResultsCollector) collector).setExplanations(facetResults); - } - - return hits; - } - - /** - * Helper method to process term-stats query. - */ - private SearchResultsInfo processTermStatsQuery() throws IOException { - // first extract the terms that we need to count - TermStatisticsRequestInfo searchRequestInfo = - new TermStatisticsRequestInfo(searchQuery, luceneQuery, termStatisticsRequest, - terminationTracker); - searchRequestInfo.setIdTimeRanges(idTimeRanges); - setQueriesInDebugInfo(parsedQuery, searchRequestInfo.getLuceneQuery()); - TermStatisticsCollector.TermStatisticsSearchResults hits = - searcher.collectTermStatistics(searchRequestInfo, this, request.getDebugMode()); - EarlybirdSearchResultUtil.setResultStatistics(searchResults, hits); - earlyTerminationInfo = EarlybirdSearchResultUtil.prepareEarlyTerminationInfo(hits); - if (hits.results != null) { - termStatisticsResults = new ThriftTermStatisticsResults(); - prepareTermStatisticsResults(termStatisticsResults, hits, request.getDebugMode()); - } - - return hits; - } - - /** - * Helper method to process realtime query. - */ - private SearchResultsInfo processRealtimeQuery() throws IOException, ClientException { - // Disable maxHitsToProcess. - if (!collectorParams.isSetTerminationParams()) { - collectorParams.setTerminationParams(new CollectorTerminationParams()); - collectorParams.getTerminationParams().setMaxHitsToProcess(-1); - COLLECTOR_PARAMS_MAX_HITS_TO_PROCESS_NOT_SET_COUNTER.increment(); - } - - SearchRequestInfo searchRequestInfo = new SearchRequestInfo( - searchQuery, luceneQuery, terminationTracker); - searchRequestInfo.setIdTimeRanges(idTimeRanges); - searchRequestInfo.setHitAttributeHelper(hitAttributeHelper); - searchRequestInfo.setTimestamp(getQueryTimestamp(searchQuery)); - - AbstractResultsCollector collector; - if (searchQuery.isSetSocialFilterType()) { - if (!searchRequestInfo.getSearchQuery().isSetDirectFollowFilter() - || !searchRequestInfo.getSearchQuery().isSetTrustedFilter()) { - searcherStats.unsetFiltersForSocialFilterTypeQuery.increment(); - throw new ClientException( - "SocialFilterType specified without a TrustedFilter or DirectFollowFilter"); - } - SocialFilter socialFilter = new SocialFilter( - searchQuery.getSocialFilterType(), - searchRequestInfo.getSearchQuery().getSearcherId(), - searchRequestInfo.getSearchQuery().getTrustedFilter(), - searchRequestInfo.getSearchQuery().getDirectFollowFilter()); - collector = new SocialSearchResultsCollector( - schemaSnapshot, - searchRequestInfo, - socialFilter, - searcherStats, - cluster, - segmentManager.getUserTable(), - request.getDebugMode()); - } else { - collector = new SearchResultsCollector( - schemaSnapshot, - searchRequestInfo, - clock, - searcherStats, - cluster, - segmentManager.getUserTable(), - request.getDebugMode()); - } - - setQueriesInDebugInfo(parsedQuery, luceneQuery); - searcher.search(luceneQuery, collector); - - SimpleSearchResults hits = collector.getResults(); - - EarlybirdSearchResultUtil.setResultStatistics(searchResults, hits); - earlyTerminationInfo = EarlybirdSearchResultUtil.prepareEarlyTerminationInfo(hits); - EarlybirdSearchResultUtil.prepareResultsArray( - searchResults.getResults(), hits, request.debugMode > 0 ? partitionConfig : null); - searchResults.setHitCounts(collector.getHitCountMap()); - - maybeSetCollectorDebugInfo(collector); - - addResultPayloads(); - - return hits; - } - - /** - * Helper method to process relevance query. - */ - private SearchResultsInfo processRelevanceQuery() throws IOException, ClientException { - if (!searchQuery.isSetRelevanceOptions()) { - LOG.warn("Relevance query with no relevance options!"); - searchQuery.setRelevanceOptions(new ThriftSearchRelevanceOptions()); - } - - // Note: today the assumption is that if you specify hasSpecifiedTweets, - // you really do want all tweets scored and returned. - final boolean hasSpecifiedTweets = searchQuery.getSearchStatusIdsSize() > 0; - if (hasSpecifiedTweets) { - collectorParams.setNumResultsToReturn(searchQuery.getSearchStatusIdsSize()); - } - // If we have explicit user ids, we will want to look at all results from those users, and will - // not need to use the AntiGamingFilter. - final boolean hasSpecifiedFromUserIds = searchQuery.getFromUserIDFilter64Size() > 0; - - createRelevanceAntiGamingFilter(hasSpecifiedTweets, hasSpecifiedFromUserIds); - - if (searchQuery.getRelevanceOptions().isSetRankingParams()) { - ThriftRankingParams rankingParams = searchQuery.getRelevanceOptions().getRankingParams(); - - // The score adjustment signals that are passed in the request are disabled for the archive - // cluster or when the features are decidered off. If the request provides those fields, - // we unset them since checking the hashmap when scoring can cause a slight bump in - // latency. - // - // Verify that the signal query specific scores for tweets signal is enabled - if (rankingParams.isSetQuerySpecificScoreAdjustments()) { - if (ALLOW_QUERY_SPECIFIC_SIGNAL_CONFIG - && DeciderUtil.isAvailableForRandomRecipient( - decider, ALLOW_QUERY_SPECIFIC_SIGNAL_DECIDER_KEY)) { - searcherStats.querySpecificSignalQueriesUsed.increment(); - searcherStats.querySpecificSignalMapTotalSize.add( - rankingParams.getQuerySpecificScoreAdjustmentsSize()); - } else { - searchQuery.getRelevanceOptions().getRankingParams().unsetQuerySpecificScoreAdjustments(); - searcherStats.querySpecificSignalQueriesErased.increment(); - } - } - - // Verify that the signal author specific scores signal is enabled - if (rankingParams.isSetAuthorSpecificScoreAdjustments()) { - if (ALLOW_AUTHOR_SPECIFIC_SIGNAL_CONFIG - && DeciderUtil.isAvailableForRandomRecipient( - decider, ALLOW_AUTHOR_SPECIFIC_SIGNAL_DECIDER_KEY)) { - searcherStats.authorSpecificSignalQueriesUsed.increment(); - searcherStats.authorSpecificSignalMapTotalSize.add( - rankingParams.getAuthorSpecificScoreAdjustmentsSize()); - } else { - searchQuery.getRelevanceOptions().getRankingParams() - .unsetAuthorSpecificScoreAdjustments(); - searcherStats.authorSpecificSignalQueriesErased.increment(); - } - } - } - - ScoringFunction scoringFunction = - new ScoringFunctionProvider.DefaultScoringFunctionProvider( - request, schemaSnapshot, searchQuery, antiGamingFilter, - segmentManager.getUserTable(), hitAttributeHelper, - parsedQuery, scoringModelsManager, tensorflowModelsManager) - .getScoringFunction(); - scoringFunction.setDebugMode(request.getDebugMode()); - - RelevanceQuery relevanceQuery = new RelevanceQuery(luceneQuery, scoringFunction); - RelevanceSearchRequestInfo searchRequestInfo = - new RelevanceSearchRequestInfo( - searchQuery, relevanceQuery, terminationTracker, qualityFactor); - searchRequestInfo.setIdTimeRanges(idTimeRanges); - searchRequestInfo.setHitAttributeHelper(hitAttributeHelper); - searchRequestInfo.setTimestamp(getQueryTimestamp(searchQuery)); - - if (shouldUseTensorFlowCollector() - && searchQuery.getRelevanceOptions().isUseRelevanceAllCollector()) { - throw new ClientException("Tensorflow scoring does not work with the RelevanceAllCollector"); - } - - final AbstractRelevanceCollector collector; - // First check if the Tensorflow results collector should be used, because the - // TensorflowBasedScoringFunction only works with the BatchRelevanceTopCollector - if (shouldUseTensorFlowCollector()) { - // Collect top numResults. - collector = new BatchRelevanceTopCollector( - schemaSnapshot, - searchRequestInfo, - scoringFunction, - searcherStats, - cluster, - segmentManager.getUserTable(), - clock, - request.getDebugMode()); - } else if (hasSpecifiedTweets - || searchQuery.getRelevanceOptions().isUseRelevanceAllCollector()) { - // Collect all. - collector = new RelevanceAllCollector( - schemaSnapshot, - searchRequestInfo, - scoringFunction, - searcherStats, - cluster, - segmentManager.getUserTable(), - clock, - request.getDebugMode()); - } else { - // Collect top numResults. - collector = new RelevanceTopCollector( - schemaSnapshot, - searchRequestInfo, - scoringFunction, - searcherStats, - cluster, - segmentManager.getUserTable(), - clock, - request.getDebugMode()); - } - - // Make sure that the Tensorflow scoring function and the Tensorflow results collector are - // always used together. If this fails it will result in a TRANSIENT_ERROR response. - Preconditions.checkState((collector instanceof BatchRelevanceTopCollector) - == (scoringFunction instanceof TensorflowBasedScoringFunction)); - - setQueriesInDebugInfo(parsedQuery, searchRequestInfo.getLuceneQuery()); - searcher.search(searchRequestInfo.getLuceneQuery(), collector); - - RelevanceSearchResults hits = collector.getResults(); - EarlybirdSearchResultUtil.setResultStatistics(searchResults, hits); - searchResults.setScoringTimeNanos(hits.getScoringTimeNanos()); - - earlyTerminationInfo = EarlybirdSearchResultUtil.prepareEarlyTerminationInfo(hits); - EarlybirdSearchResultUtil.setLanguageHistogram(searchResults, collector.getLanguageHistogram()); - EarlybirdSearchResultUtil.prepareRelevanceResultsArray( - searchResults.getResults(), - hits, - antiGamingFilter != null ? antiGamingFilter.getUserIDWhitelist() : null, - request.getDebugMode() > 0 ? partitionConfig : null); - - searchResults.setHitCounts(collector.getHitCountMap()); - searchResults.setRelevanceStats(hits.getRelevanceStats()); - - maybeSetCollectorDebugInfo(collector); - - if (explanationsEnabled(request.getDebugMode())) { - searcher.explainSearchResults(searchRequestInfo, hits, searchResults); - } - - addResultPayloads(); - - return hits; - } - - public static boolean explanationsEnabled(int debugLevel) { - return debugLevel > 1; - } - - private boolean shouldUseTensorFlowCollector() { - return tensorflowModelsManager.isEnabled() - && searchQuery.getRelevanceOptions().isSetRankingParams() - && searchQuery.getRelevanceOptions().getRankingParams().isSetType() - && searchQuery.getRelevanceOptions().getRankingParams().getType() - == ThriftScoringFunctionType.TENSORFLOW_BASED; - } - /** - * Optionally, if requested and needed, will create a new AntiGamingFilter. Otherwize, no - * AntiGamingFilter will be used for this query. - * @param hasSpecifiedTweets whether the request has searchStatusIds specified. - * @param hasSpecifiedFromUserIds whether the request has fromUserIDFilter64 specified. - */ - private void createRelevanceAntiGamingFilter( - boolean hasSpecifiedTweets, boolean hasSpecifiedFromUserIds) { - - // Anti-gaming filter (turned off for specified tweets mode, or when you're explicitly asking - // for specific users' tweets). - if (searchQuery.getMaxHitsPerUser() > 0 && !hasSpecifiedTweets && !hasSpecifiedFromUserIds) { - searcherStats.relevanceAntiGamingFilterUsed.increment(); - antiGamingFilter = new AntiGamingFilter( - searchQuery.getMaxHitsPerUser(), - searchQuery.getMaxTweepcredForAntiGaming(), - luceneQuery); - } else if (searchQuery.getMaxHitsPerUser() <= 0) { - searcherStats.relevanceAntiGamingFilterNotRequested.increment(); - } else if (hasSpecifiedTweets && hasSpecifiedFromUserIds) { - searcherStats.relevanceAntiGamingFilterSpecifiedTweetsAndFromUserIds.increment(); - } else if (hasSpecifiedTweets) { - searcherStats.relevanceAntiGamingFilterSpecifiedTweets.increment(); - } else if (hasSpecifiedFromUserIds) { - searcherStats.relevanceAntiGamingFilterSpecifiedFromUserIds.increment(); - } - } - - /** - * Check to make sure that there are no nullcast documents in results. If there exists nullcasts - * in results, we should log error and increment counters correspondingly. - */ - @VisibleForTesting - public void logAndIncrementStatsIfNullcastInResults(ThriftSearchResults thriftSearchResults) { - if (!thriftSearchResults.isSetResults()) { - return; - } - - Set unexpectedNullcastStatusIds = - EarlybirdResponseUtil.findUnexpectedNullcastStatusIds(thriftSearchResults, request); - - if (!unexpectedNullcastStatusIds.isEmpty()) { - searcherStats.nullcastUnexpectedQueries.increment(); - searcherStats.nullcastUnexpectedResults.add(unexpectedNullcastStatusIds.size()); - - String base64Request; - try { - base64Request = ThriftUtils.toBase64EncodedString(request); - } catch (TException e) { - base64Request = "Failed to parse base 64 request"; - } - LOG.error( - "Found unexpected nullcast tweets: {} | parsedQuery: {} | request: {} | response: {} | " - + "request base 64: {}", - Joiner.on(",").join(unexpectedNullcastStatusIds), - parsedQuery.serialize(), - request, - thriftSearchResults, - base64Request); - } - } - - private void addResultPayloads() throws IOException { - if (searchQuery.getResultMetadataOptions() != null) { - if (searchQuery.getResultMetadataOptions().isGetTweetUrls()) { - searcher.fillFacetResults(new ExpandedUrlCollector(), searchResults); - } - - if (searchQuery.getResultMetadataOptions().isGetNamedEntities()) { - searcher.fillFacetResults(new NamedEntityCollector(), searchResults); - } - - if (searchQuery.getResultMetadataOptions().isGetEntityAnnotations()) { - searcher.fillFacetResults(new EntityAnnotationCollector(), searchResults); - } - - if (searchQuery.getResultMetadataOptions().isGetSpaces()) { - searcher.fillFacetResults(new SpaceFacetCollector(audioSpaceTable), searchResults); - } - } - } - - /** - * Helper method to process top tweets query. - */ - private SearchResultsInfo processTopTweetsQuery() throws IOException, ClientException { - // set dummy relevance options if it's not available, but this shouldn't happen in prod - if (!searchQuery.isSetRelevanceOptions()) { - searchQuery.setRelevanceOptions(new ThriftSearchRelevanceOptions()); - } - if (!searchQuery.getRelevanceOptions().isSetRankingParams()) { - searchQuery.getRelevanceOptions().setRankingParams( - // this is important, or it's gonna pick DefaultScoringFunction which pretty much - // does nothing. - new ThriftRankingParams().setType(ThriftScoringFunctionType.TOPTWEETS)); - } - ScoringFunction scoringFunction = new ScoringFunctionProvider.DefaultScoringFunctionProvider( - request, schemaSnapshot, searchQuery, null, - segmentManager.getUserTable(), hitAttributeHelper, parsedQuery, - scoringModelsManager, tensorflowModelsManager) - .getScoringFunction(); - scoringFunction.setDebugMode(request.getDebugMode()); - - RelevanceQuery relevanceQuery = new RelevanceQuery(luceneQuery, scoringFunction); - RelevanceSearchRequestInfo searchRequestInfo = - new RelevanceSearchRequestInfo( - searchQuery, relevanceQuery, terminationTracker, qualityFactor); - searchRequestInfo.setIdTimeRanges(idTimeRanges); - searchRequestInfo.setTimestamp(getQueryTimestamp(searchQuery)); - - final AbstractRelevanceCollector collector = - new RelevanceTopCollector( - schemaSnapshot, - searchRequestInfo, - scoringFunction, - searcherStats, - cluster, - segmentManager.getUserTable(), - clock, - request.getDebugMode()); - - setQueriesInDebugInfo(parsedQuery, searchRequestInfo.getLuceneQuery()); - searcher.search(searchRequestInfo.getLuceneQuery(), collector); - - RelevanceSearchResults hits = collector.getResults(); - EarlybirdSearchResultUtil.setResultStatistics(searchResults, hits); - searchResults.setScoringTimeNanos(hits.getScoringTimeNanos()); - earlyTerminationInfo = EarlybirdSearchResultUtil.prepareEarlyTerminationInfo(hits); - EarlybirdSearchResultUtil.setLanguageHistogram( - searchResults, - collector.getLanguageHistogram()); - EarlybirdSearchResultUtil.prepareRelevanceResultsArray( - searchResults.getResults(), - hits, - null, - request.getDebugMode() > 0 ? partitionConfig : null); - - searchResults.setHitCounts(collector.getHitCountMap()); - searchResults.setRelevanceStats(hits.getRelevanceStats()); - - maybeSetCollectorDebugInfo(collector); - - if (explanationsEnabled(request.getDebugMode()) - && searchQuery.isSetRelevanceOptions() - && searchQuery.getRelevanceOptions().isSetRankingParams()) { - searcher.explainSearchResults(searchRequestInfo, hits, searchResults); - } - - addResultPayloads(); - - return hits; - } - - private FacetCountState newFacetCountState() throws ClientException { - int minNumFacetResults = DEFAULT_NUM_FACET_RESULTS; - if (facetRequest.isSetFacetRankingOptions() - && facetRequest.getFacetRankingOptions().isSetNumCandidatesFromEarlybird()) { - minNumFacetResults = facetRequest.getFacetRankingOptions().getNumCandidatesFromEarlybird(); - } - - // figure out which fields we need to count - FacetCountState facetCountState = new FacetCountState(schemaSnapshot, minNumFacetResults); - - // all categories if none! - if (facetRequest.getFacetFields() == null || facetRequest.getFacetFields().isEmpty()) { - for (Schema.FieldInfo facetField : schemaSnapshot.getFacetFields()) { - facetCountState.addFacet( - facetField.getFieldType().getFacetName(), DEFAULT_NUM_FACET_RESULTS); - } - } else { - Iterator it = facetRequest.getFacetFieldsIterator(); - while (it.hasNext()) { - ThriftFacetFieldRequest facetFieldRequest = it.next(); - Schema.FieldInfo facet = schemaSnapshot.getFacetFieldByFacetName( - facetFieldRequest.getFieldName()); - if (facet != null) { - facetCountState.addFacet( - facet.getFieldType().getFacetName(), facetFieldRequest.getNumResults()); - } else { - throw new ClientException("Unknown facet field: " + facetFieldRequest.getFieldName()); - } - } - } - return facetCountState; - } - - private com.twitter.search.queryparser.query.Query preLuceneQueryProcess( - com.twitter.search.queryparser.query.Query twitterQuery) throws QueryParserException { - - com.twitter.search.queryparser.query.Query query = twitterQuery; - if (searchHighFrequencyTermPairs && !includesCardField(searchQuery, query)) { - // Process high frequency term pairs. Works best when query is as flat as possible. - query = HighFrequencyTermPairRewriteVisitor.safeRewrite( - query, - DeciderUtil.isAvailableForRandomRecipient( - decider, "enable_hf_term_pair_negative_disjunction_rewrite")); - } - return query.simplify(); - } - - private Query postLuceneQueryProcess(final Query query) throws ClientException { - if (StringUtils.isBlank(request.getSearchQuery().getSerializedQuery()) - && StringUtils.isBlank(request.getSearchQuery().getLuceneQuery())) { - searcherStats.numRequestsWithBlankQuery.get(queryMode).increment(); - if (searchQuery.getSearchStatusIdsSize() == 0 - && searchQuery.getFromUserIDFilter64Size() == 0 - && searchQuery.getLikedByUserIDFilter64Size() == 0) { - // No query or ids to search. This is only allowed in some modes. - if (queryMode == QueryMode.RECENCY - || queryMode == QueryMode.RELEVANCE - || queryMode == QueryMode.TOP_TWEETS) { - throw new ClientException( - "No query or status ids for " + queryMode.toString().toLowerCase() + " query"); - } - } - } - - // Wrap the query as needed with additional query filters. - List filters = Lists.newArrayList(); - - // Min tweep cred filter. - if (searchQuery.isSetMinTweepCredFilter()) { - searcherStats.addedFilterBadUserRep.increment(); - filters.add(BadUserRepFilter.getBadUserRepFilter(searchQuery.getMinTweepCredFilter())); - } - - if (searchQuery.getFromUserIDFilter64Size() > 0) { - this.queriedFields.add(EarlybirdFieldConstant.FROM_USER_ID_FIELD.getFieldName()); - this.searcherStats.addedFilterFromUserIds.increment(); - try { - filters.add(UserIdMultiSegmentQuery.createIdDisjunctionQuery( - "from_user_id_filter", - searchQuery.getFromUserIDFilter64(), - EarlybirdFieldConstant.FROM_USER_ID_FIELD.getFieldName(), - schemaSnapshot, - multiSegmentTermDictionaryManager, - decider, - cluster, - Lists.newArrayList(), - null, - queryTimeoutFactory.createQueryTimeout(request, terminationTracker, clock))); - } catch (QueryParserException e) { - throw new ClientException(e); - } - } - - // Wrap the lucene query with these filters. - Query wrappedQuery = wrapFilters(query, filters.toArray(new Query[filters.size()])); - - // If searchStatusIds is set, additionally modify the query to search exactly these - // ids, using the luceneQuery only for scoring. - if (searchQuery.getSearchStatusIdsSize() > 0) { - this.searcherStats.addedFilterTweetIds.increment(); - - final Query queryForScoring = wrappedQuery; - final Query queryForRetrieval = - RequiredStatusIDsFilter.getRequiredStatusIDsQuery(searchQuery.getSearchStatusIds()); - - return new BooleanQuery.Builder() - .add(queryForRetrieval, Occur.MUST) - .add(queryForScoring, Occur.SHOULD) - .build(); - } - - return wrappedQuery; - } - - private com.twitter.search.queryparser.query.Query getLikedByUserIdQuery( - List ids) throws QueryParserException { - if (DeciderUtil.isAvailableForRandomRecipient( - decider, USE_MULTI_TERM_DISJUNCTION_FOR_LIKED_BY_USER_IDS_DECIDER_KEY)) { - // rewrite LikedByUserIdFilter64 to a multi_term_disjuntion query - return createMultiTermDisjunctionQueryForLikedByUserIds(ids); - } else { - // rewrite LikedByUserIdFilter64 to a disjunction of multiple liked_by_user_ids query - return createDisjunctionQueryForLikedByUserIds(ids); - } - } - - /** - * Returns the Lucene query visitor that should be applied to the original request. - * - * @param fieldWeightMapOverride The per-field weight overrides. - */ - @VisibleForTesting - public EarlybirdLuceneQueryVisitor getLuceneVisitor( - Map fieldWeightMapOverride) { - String clusterName = cluster.getNameForStats(); - // Iff in relevance mode _and_ intepreteSinceId is false, we turn off since_id - // operator by using LuceneRelevanceQueryVisitor. - - if (searchQuery.getRankingMode() == ThriftSearchRankingMode.RELEVANCE - && searchQuery.getRelevanceOptions() != null - && !searchQuery.getRelevanceOptions().isInterpretSinceId()) { - // hack! reset top level since id, which is the same thing LuceneRelevanceVisitor - // is doing. - idTimeRanges = null; - return new LuceneRelevanceQueryVisitor( - schemaSnapshot, - queryCacheManager, - segmentManager.getUserTable(), - segmentManager.getUserScrubGeoMap(), - terminationTracker, - FieldWeightDefault.overrideFieldWeightMap( - schemaSnapshot.getFieldWeightMap(), - dropBadFieldWeightOverrides(fieldWeightMapOverride, decider, clusterName)), - MAPPABLE_FIELD_MAP, - multiSegmentTermDictionaryManager, - decider, - cluster, - queryTimeoutFactory.createQueryTimeout( - request, terminationTracker, clock)); - } else { - return new EarlybirdLuceneQueryVisitor( - schemaSnapshot, - queryCacheManager, - segmentManager.getUserTable(), - segmentManager.getUserScrubGeoMap(), - terminationTracker, - FieldWeightDefault.overrideFieldWeightMap( - schemaSnapshot.getFieldWeightMap(), - dropBadFieldWeightOverrides(fieldWeightMapOverride, decider, clusterName)), - MAPPABLE_FIELD_MAP, - multiSegmentTermDictionaryManager, - decider, - cluster, - queryTimeoutFactory.createQueryTimeout( - request, terminationTracker, clock)); - } - } - - private void prepareFacetResults(ThriftFacetResults thriftFacetResults, - EarlybirdLuceneSearcher.FacetSearchResults hits, - FacetCountState facetCountState, - Set userIDWhitelist, - byte debugMode) throws IOException { - for (FacetRankingModule rankingModule : FacetRankingModule.REGISTERED_RANKING_MODULES) { - rankingModule.prepareResults(hits, facetCountState); - } - - Map allFacetResults = new HashMap<>(); - - Iterator> fieldResultsIterator = - facetCountState.getFacetFieldResultsIterator(); - while (fieldResultsIterator.hasNext()) { - - FacetCountState.FacetFieldResults facetFieldResults = - fieldResultsIterator.next(); - - if (facetFieldResults.results == null) { - // return empty resultset for this facet - List emptyList = new ArrayList<>(); - facetFieldResults.results = new ThriftFacetFieldResults(emptyList, 0); - } - thriftFacetResults.putToFacetFields(facetFieldResults.facetName, - facetFieldResults.results); - - Schema.FieldInfo field = schemaSnapshot.getFacetFieldByFacetName( - facetFieldResults.facetName); - - for (ThriftFacetCount result : facetFieldResults.results.topFacets) { - if (result.facetLabel != null) { - allFacetResults.put(new Term(field.getName(), result.facetLabel), result); - } else { - LOG.warn("Null facetLabel, field: {}, result: {}", field.getName(), result); - } - } - } - - searcher.fillFacetResultMetadata(allFacetResults, schemaSnapshot, debugMode); - - if (userIDWhitelist != null) { - for (ThriftFacetCount facetCount : allFacetResults.values()) { - ThriftFacetCountMetadata metadata = facetCount.getMetadata(); - if (metadata != null) { - metadata.setDontFilterUser(userIDWhitelist.contains(metadata.getTwitterUserId())); - } - } - } - } - - private void prepareTermStatisticsResults( - ThriftTermStatisticsResults termStatistics, - TermStatisticsCollector.TermStatisticsSearchResults hits, - byte debugMode) throws IOException { - - termStatistics.setBinIds(hits.binIds); - termStatistics.setHistogramSettings(termStatisticsRequest.getHistogramSettings()); - termStatistics.setTermResults(hits.results); - setTermStatisticsDebugInfo(hits.getTermStatisticsDebugInfo()); - - if (hits.lastCompleteBinId != -1) { - termStatistics.setMinCompleteBinId(hits.lastCompleteBinId); - } else { - SearchRateCounter.export(String.format( - "term_stats_%s_unset_min_complete_bin_id", request.getClientId())).increment(); - } - - if (idTimeRanges != null - && idTimeRanges.getUntilTimeExclusive().isPresent() - && hits.getMinSearchedTime() > idTimeRanges.getUntilTimeExclusive().get()) { - SearchRateCounter.export(String.format( - "term_stats_%s_min_searched_time_after_until_time", request.getClientId())).increment(); - } - - searcher.fillTermStatsMetadata(termStatistics, schemaSnapshot, debugMode); - } - - private EarlybirdResponse respondSuccess( - ThriftSearchResults thriftSearchResults, - ThriftFacetResults thriftFacetResults, - ThriftTermStatisticsResults termStatisticResults, - @Nonnull EarlyTerminationInfo earlyTerminationState, - @Nonnull SearchResultsInfo searchResultsInfo) { - - Preconditions.checkNotNull(earlyTerminationState); - Preconditions.checkNotNull(searchResultsInfo); - - exportEarlyTerminationStats(earlyTerminationState); - - EarlybirdResponse response = - newResponse(EarlybirdResponseCode.SUCCESS, request.getDebugMode() > 0); - response.setEarlyTerminationInfo(earlyTerminationState); - response.setNumSearchedSegments(searchResultsInfo.getNumSearchedSegments()); - - if (thriftSearchResults != null) { - // Nullcast check is only used when parsed query is available: if there is no parsed query, - // we would not add possible exclude nullcast filter. - if (parsedQuery != null && !parsedQueryAllowNullcast) { - logAndIncrementStatsIfNullcastInResults(thriftSearchResults); - } - response.setSearchResults(thriftSearchResults); - } else { - RESPONSE_HAS_NO_THRIFT_SEARCH_RESULTS.increment(); - } - if (thriftFacetResults != null) { - response.setFacetResults(thriftFacetResults); - } - if (termStatisticResults != null) { - response.setTermStatisticsResults(termStatisticResults); - } - - appendFeatureSchemaIfNeeded(response); - - appendLikedByUserIdsIfNeeded(response); - - return response; - } - - private void exportEarlyTerminationStats(@Nonnull EarlyTerminationInfo earlyTerminationState) { - if (earlyTerminationState.isSetEarlyTerminationReason()) { - SearchRateCounter.export(String.format("early_termination_%s_%s", - ClientIdUtil.formatClientId(request.getClientId()), - earlyTerminationState.getEarlyTerminationReason())).increment(); - SearchRateCounter.export(String.format("early_termination_%s_%s", - ClientIdUtil.formatClientIdAndRequestType( - request.getClientId(), queryMode.name().toLowerCase()), - earlyTerminationState.getEarlyTerminationReason())).increment(); - } - } - - /** - * Builds a rank -> userId map for liked_by_user_id queries that request hit attribution, and - * appends the resulting map to the response. - */ - private void appendLikedByUserIdsIfNeeded(EarlybirdResponse response) { - // Check if user asked for likedByUserIds list in response - ThriftSearchRelevanceOptions resultRelevanceOptions = - request.getSearchQuery().getRelevanceOptions(); - if ((resultRelevanceOptions == null) - || !resultRelevanceOptions.isCollectFieldHitAttributions()) { - return; - } - - // Make sure we have results in response and hit attribution helper is set up correctly - if (!response.isSetSearchResults() || hitAttributeHelper == null) { - return; - } - - // Get rank to node map - Map nodeToRankMap = - Preconditions.checkNotNull(hitAttributeHelper.getNodeToRankMap()); - - Map> expandedNodeToRankMap = - Preconditions.checkNotNull(hitAttributeHelper.getExpandedNodeToRankMap()); - - // Build a rank to id map - ImmutableMap.Builder builder = ImmutableMap.builder(); - for (com.twitter.search.queryparser.query.Query query : nodeToRankMap.keySet()) { - if (query instanceof SearchOperator) { - SearchOperator op = (SearchOperator) query; - if (expandedNodeToRankMap.containsKey(query)) { - // for multi_term_disjunction case - List ranks = expandedNodeToRankMap.get(op); - Preconditions.checkArgument(op.getNumOperands() == ranks.size() + 1); - for (int i = 0; i < ranks.size(); ++i) { - builder.put(ranks.get(i), Long.valueOf(op.getOperands().get(i + 1))); - } - } else if (op.getOperatorType() == SearchOperator.Type.LIKED_BY_USER_ID) { - // for liked_by_user_id case - Preconditions.checkArgument(op.getAnnotationOf(Annotation.Type.NODE_RANK).isPresent()); - builder.put( - (Integer) op.getAnnotationOf(Annotation.Type.NODE_RANK).get().getValue(), - Long.valueOf(op.getOperands().get(0))); - } - } - } - Map rankToIdMap = builder.build(); - - // Append liked_by_user_id filed into result - for (ThriftSearchResult result : response.getSearchResults().getResults()) { - if (result.isSetMetadata() - && result.getMetadata().isSetFieldHitAttribution() - && result.getMetadata().getFieldHitAttribution().isSetHitMap()) { - - List likedByUserIdList = Lists.newArrayList(); - - Map hitMap = - result.getMetadata().getFieldHitAttribution().getHitMap(); - // iterate hit attributions - for (int rank : hitMap.keySet()) { - if (rankToIdMap.containsKey(rank)) { - likedByUserIdList.add(rankToIdMap.get(rank)); - } - } - if (!result.getMetadata().isSetExtraMetadata()) { - result.getMetadata().setExtraMetadata(new ThriftSearchResultExtraMetadata()); - } - result.getMetadata().getExtraMetadata().setLikedByUserIds(likedByUserIdList); - } - } - } - - private void appendFeatureSchemaIfNeeded(EarlybirdResponse response) { - // Do not append the schema if the client didn't request it. - ThriftSearchResultMetadataOptions resultMetadataOptions = - request.getSearchQuery().getResultMetadataOptions(); - if ((resultMetadataOptions == null) || !resultMetadataOptions.isReturnSearchResultFeatures()) { - return; - } - - if (!response.isSetSearchResults()) { - return; - } - - ThriftSearchFeatureSchema featureSchema = schemaSnapshot.getSearchFeatureSchema(); - Preconditions.checkState( - featureSchema.isSetSchemaSpecifier(), - "The feature schema doesn't have a schema specifier set: {}", featureSchema); - - // If the client has this schema, we only need to return the schema version. - // If the client doesn't have this schema, we need to return the schema entries too. - if (resultMetadataOptions.isSetFeatureSchemasAvailableInClient() - && resultMetadataOptions.getFeatureSchemasAvailableInClient().contains( - featureSchema.getSchemaSpecifier())) { - CLIENT_HAS_FEATURE_SCHEMA_COUNTER.increment(); - ThriftSearchFeatureSchema responseFeatureSchema = new ThriftSearchFeatureSchema(); - responseFeatureSchema.setSchemaSpecifier(featureSchema.getSchemaSpecifier()); - response.getSearchResults().setFeatureSchema(responseFeatureSchema); - } else { - CLIENT_DOESNT_HAVE_FEATURE_SCHEMA_COUNTER.increment(); - Preconditions.checkState(featureSchema.isSetEntries(), - "Entries are not set in the feature schema: " + featureSchema); - response.getSearchResults().setFeatureSchema(featureSchema); - } - } - - private static long getQueryTimestamp(ThriftSearchQuery query) { - return query != null && query.isSetTimestampMsecs() ? query.getTimestampMsecs() : 0; - } - - private static boolean includesCardField(ThriftSearchQuery searchQuery, - com.twitter.search.queryparser.query.Query query) - throws QueryParserException { - - if (searchQuery.isSetRelevanceOptions()) { - ThriftSearchRelevanceOptions options = searchQuery.getRelevanceOptions(); - if (options.isSetFieldWeightMapOverride() - && (options.getFieldWeightMapOverride().containsKey( - EarlybirdFieldConstant.CARD_TITLE_FIELD.getFieldName()) - || options.getFieldWeightMapOverride() - .containsKey(EarlybirdFieldConstant.CARD_DESCRIPTION_FIELD.getFieldName()))) { - - return true; - } - } - - return query.accept(new DetectFieldAnnotationVisitor(ImmutableSet.of( - EarlybirdFieldConstant.CARD_TITLE_FIELD.getFieldName(), - EarlybirdFieldConstant.CARD_DESCRIPTION_FIELD.getFieldName()))); - } - - private static QueryMode getQueryMode(EarlybirdRequest request) { - if (request.isSetFacetRequest()) { - return QueryMode.FACETS; - } else if (request.isSetTermStatisticsRequest()) { - return QueryMode.TERM_STATS; - } - - // Recency mode until we determine otherwise. - QueryMode queryMode = QueryMode.RECENCY; - ThriftSearchQuery searchQuery = request.getSearchQuery(); - if (searchQuery != null) { - switch (searchQuery.getRankingMode()) { - case RECENCY: - queryMode = QueryMode.RECENCY; - break; - case RELEVANCE: - queryMode = QueryMode.RELEVANCE; - break; - case TOPTWEETS: - queryMode = QueryMode.TOP_TWEETS; - break; - default: - break; - } - } - - if (searchQuery == null - || !searchQuery.isSetSerializedQuery() - || searchQuery.getSerializedQuery().isEmpty()) { - LOG.debug("Search query was empty, query mode was " + queryMode); - } - - return queryMode; - } - - private static ImmutableMap dropBadFieldWeightOverrides( - Map map, Decider decider, String clusterName) { - - if (map == null) { - return null; - } - - FIELD_WEIGHT_OVERRIDE_MAP_NON_NULL_COUNT.increment(); - ImmutableMap.Builder builder = ImmutableMap.builder(); - - for (Map.Entry entry : map.entrySet()) { - if (EarlybirdFieldConstant.CAMELCASE_USER_HANDLE_FIELD.getFieldName().equals(entry.getKey()) - && !isAllowedCamelcaseUsernameFieldWeightOverride(decider, clusterName)) { - DROPPED_CAMELCASE_USERNAME_FIELD_WEIGHT_OVERRIDE.increment(); - } else if (EarlybirdFieldConstant.TOKENIZED_USER_NAME_FIELD.getFieldName().equals( - entry.getKey()) - && !isAllowedTokenizedScreenNameFieldWeightOverride(decider, clusterName)) { - DROPPED_TOKENIZED_DISPLAY_NAME_FIELD_WEIGHT_OVERRIDE.increment(); - } else { - builder.put(entry.getKey(), entry.getValue()); - } - } - - return builder.build(); - } - - private static boolean isAllowedCamelcaseUsernameFieldWeightOverride( - Decider decider, String clusterName) { - return DeciderUtil.isAvailableForRandomRecipient(decider, - ALLOW_CAMELCASE_USERNAME_FIELD_WEIGHT_OVERRIDE_DECIDER_KEY_PREFIX + clusterName); - } - - private static boolean isAllowedTokenizedScreenNameFieldWeightOverride( - Decider decider, String clusterName) { - return DeciderUtil.isAvailableForRandomRecipient(decider, - ALLOW_TOKENIZED_DISPLAY_NAME_FIELD_WEIGHT_OVERRIDE_DECIDER_KEY_PREFIX + clusterName); - } - - private static com.twitter.search.queryparser.query.Query - createMultiTermDisjunctionQueryForLikedByUserIds(List ids) throws QueryParserException { - List operands = new ArrayList<>(ids.size() + 1); - operands.add(EarlybirdFieldConstant.LIKED_BY_USER_ID_FIELD.getFieldName()); - for (long id : ids) { - operands.add(String.valueOf(id)); - } - return new SearchOperator(SearchOperator.Type.MULTI_TERM_DISJUNCTION, operands) - .simplify(); - } - - private static com.twitter.search.queryparser.query.Query createDisjunctionQueryForLikedByUserIds( - List ids) throws QueryParserException { - return new Disjunction( - ids.stream() - .map(id -> new SearchOperator(SearchOperator.Type.LIKED_BY_USER_ID, id)) - .collect(Collectors.toList())) - .simplify(); - } - - public com.twitter.search.queryparser.query.Query getParsedQuery() { - return parsedQuery; - } - - /** - * Get the index fields that were queried after this searcher completed its job. - * @return - */ - public Set getQueriedFields() { - return queriedFields; - } - - public Query getLuceneQuery() { - return luceneQuery; - } -} diff --git a/src/java/com/twitter/search/earlybird/EarlybirdServer.java b/src/java/com/twitter/search/earlybird/EarlybirdServer.java deleted file mode 100644 index 44d9dc1d8..000000000 --- a/src/java/com/twitter/search/earlybird/EarlybirdServer.java +++ /dev/null @@ -1,1087 +0,0 @@ -package com.twitter.search.earlybird; - -import java.io.BufferedWriter; -import java.io.Closeable; -import java.io.File; -import java.io.IOException; -import java.nio.file.Files; -import java.util.ArrayList; -import java.util.List; -import java.util.Set; -import java.util.concurrent.ArrayBlockingQueue; -import java.util.concurrent.ExecutionException; -import java.util.concurrent.ExecutorService; -import java.util.concurrent.Executors; -import java.util.concurrent.RejectedExecutionException; -import java.util.concurrent.TimeUnit; -import java.util.concurrent.atomic.AtomicReference; -import javax.annotation.Nullable; -import javax.annotation.concurrent.GuardedBy; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Charsets; -import com.google.common.base.Stopwatch; -import com.google.common.cache.CacheBuilder; -import com.google.common.cache.CacheLoader; -import com.google.common.cache.LoadingCache; -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.Lists; -import com.google.common.util.concurrent.AtomicLongMap; - -import org.apache.commons.codec.binary.Base64; -import org.apache.lucene.search.IndexSearcher; -import org.apache.thrift.TBase; -import org.apache.thrift.TException; -import org.apache.thrift.TSerializer; -import org.apache.zookeeper.KeeperException; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.collections.Pair; -import com.twitter.common.util.Clock; -import com.twitter.common.zookeeper.ServerSet.UpdateException; -import com.twitter.common.zookeeper.ZooKeeperClient; -import com.twitter.decider.Decider; -import com.twitter.finagle.Failure; -import com.twitter.search.common.database.DatabaseConfig; -import com.twitter.search.common.metrics.Percentile; -import com.twitter.search.common.metrics.PercentileUtil; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.metrics.SearchStatsReceiver; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.common.metrics.Timer; -import com.twitter.search.common.schema.DynamicSchema; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.earlybird.FlushVersion; -import com.twitter.search.common.search.termination.QueryTimeoutFactory; -import com.twitter.search.common.util.FinagleUtil; -import com.twitter.search.common.util.GCUtil; -import com.twitter.search.common.util.ml.tensorflow_engine.TensorflowModelsManager; -import com.twitter.search.common.util.zookeeper.ZooKeeperProxy; -import com.twitter.search.core.earlybird.index.inverted.QueryCostTracker; -import com.twitter.search.earlybird.admin.LastSearchesSummary; -import com.twitter.search.earlybird.admin.QueriedFieldsAndSchemaStats; -import com.twitter.search.earlybird.common.ClientIdUtil; -import com.twitter.search.earlybird.common.EarlybirdRequestLogger; -import com.twitter.search.earlybird.common.EarlybirdRequestPostLogger; -import com.twitter.search.earlybird.common.EarlybirdRequestPreLogger; -import com.twitter.search.earlybird.common.EarlybirdRequestUtil; -import com.twitter.search.earlybird.common.RequestResponsePair; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.exception.EarlybirdStartupException; -import com.twitter.search.earlybird.exception.TransientException; -import com.twitter.search.earlybird.ml.ScoringModelsManager; -import com.twitter.search.earlybird.partition.AudioSpaceTable; -import com.twitter.search.earlybird.partition.DynamicPartitionConfig; -import com.twitter.search.earlybird.partition.EarlybirdStartup; -import com.twitter.search.earlybird.partition.MultiSegmentTermDictionaryManager; -import com.twitter.search.earlybird.partition.PartitionConfig; -import com.twitter.search.earlybird.partition.PartitionManager; -import com.twitter.search.earlybird.partition.SearchIndexingMetricSet; -import com.twitter.search.earlybird.partition.SegmentManager; -import com.twitter.search.earlybird.partition.SegmentSyncConfig; -import com.twitter.search.earlybird.partition.SegmentVulture; -import com.twitter.search.earlybird.querycache.QueryCacheManager; -import com.twitter.search.earlybird.stats.EarlybirdRPCStats; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.EarlybirdServerStats; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.search.earlybird.thrift.EarlybirdStatusCode; -import com.twitter.search.earlybird.thrift.EarlybirdStatusResponse; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird.util.OneTaskScheduledExecutorManager; -import com.twitter.search.earlybird.util.TermCountMonitor; -import com.twitter.search.earlybird.util.TweetCountMonitor; -import com.twitter.snowflake.id.SnowflakeId; -import com.twitter.util.Duration; -import com.twitter.util.Function; -import com.twitter.util.Function0; -import com.twitter.util.Future; - -public class EarlybirdServer implements EarlybirdService.ServiceIface, ServerSetMember { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdServer.class); - - private static final String EARLYBIRD_STARTUP = "earlybird startup"; - public static final String SERVICE_NAME = "Earlybird"; - - private static final boolean REGISTER_WITH_ZK_ON_STARTUP = - EarlybirdConfig.getBool("register_with_zk_on_startup", true); - private static final Duration SERVER_CLOSE_WAIT_TIME = Duration.apply(5L, TimeUnit.SECONDS); - - private static final Failure QUEUE_FULL_FAILURE = - Failure.rejected("Rejected due to full executor queue"); - - private final int port = EarlybirdConfig.getThriftPort(); - private final int warmUpPort = EarlybirdConfig.getWarmUpThriftPort(); - private final int numSearcherThreads = EarlybirdConfig.getSearcherThreads(); - - private final SearchStatsReceiver earlybirdServerStatsReceiver; - private final EarlybirdRPCStats searchStats = new EarlybirdRPCStats("search"); - private final EarlybirdSearcherStats tweetsSearcherStats; - - private static final String REQUESTS_RECEIVED_BY_FINAGLE_ID_COUNTER_NAME_PATTERN = - "requests_for_finagle_id_%s_all"; - private static final String REQUESTS_RECEIVED_BY_FINAGLE_ID_AND_CLIENT_ID_COUNTER_NAME_PATTERN = - "requests_for_finagle_id_%s_and_client_id_%s"; - private static final String RESPONSES_PER_CLIENT_ID_STAT_TEMPLATE = - "responses_for_client_id_%s_with_response_code_%s"; - - // Loading cache for per finagle-client-id stats. Storing them in a loading cache key-ed by - // finagle client id so we don't export the stat multiple times. - private final LoadingCache requestCountersByFinagleClientId = - CacheBuilder.newBuilder().build( - new CacheLoader() { - @Override - public SearchTimerStats load(String finagleClientId) { - return earlybirdServerStatsReceiver.getTimerStats( - String.format( - REQUESTS_RECEIVED_BY_FINAGLE_ID_COUNTER_NAME_PATTERN, - finagleClientId), TimeUnit.MICROSECONDS, false, true, false); - } - }); - - // Counters per client and response code. - private final LoadingCache responseByClientIdAndResponseCode = - CacheBuilder.newBuilder().build( - new CacheLoader() { - @Override - public SearchCounter load(String key) { - return earlybirdServerStatsReceiver.getCounter(key); - } - }); - - private final LoadingCache resultsAgeCounter = - CacheBuilder.newBuilder().build( - new CacheLoader() { - @Override - public SearchCounter load(String key) { - return earlybirdServerStatsReceiver.getCounter(key); - } - } - ); - - // Loading cache for per finagle client id and client id stats. These are stored separate - // from the other stats because they are key-ed by the pair of finagle client id and client id - // in order to make sure the stats are only exported once. - // In the key-pair the first element is the finagle client id while the second element is the - // client id. - private final LoadingCache, SearchRateCounter> - requestCountersByFinagleIdAndClientId = CacheBuilder.newBuilder().build( - new CacheLoader, SearchRateCounter>() { - @Override - public SearchRateCounter load(Pair clientKey) { - return earlybirdServerStatsReceiver.getRateCounter( - String.format( - REQUESTS_RECEIVED_BY_FINAGLE_ID_AND_CLIENT_ID_COUNTER_NAME_PATTERN, - clientKey.getFirst(), - clientKey.getSecond())); - } - }); - - // Loading cache for per-client-id latency stats. Stored in a loading cache here mainly because - // the tests assert the mock stats receiver that each stat is only exported once. - private final LoadingCache clientIdSearchStats = - CacheBuilder.newBuilder().build( - new CacheLoader() { - @Override - public SearchTimerStats load(String clientId) { - String formattedClientId = ClientIdUtil.formatClientId(clientId); - return earlybirdServerStatsReceiver.getTimerStats(formattedClientId, - TimeUnit.MICROSECONDS, false, true, true); - } - }); - - private final LoadingCache clientIdScoringPerQueryStats = - CacheBuilder.newBuilder().build( - new CacheLoader() { - @Override - public SearchTimerStats load(String clientId) { - String statName = - String.format("scoring_time_per_query_for_client_id_%s", clientId); - return earlybirdServerStatsReceiver.getTimerStats(statName, - TimeUnit.NANOSECONDS, false, true, false); - } - }); - - private final LoadingCache clientIdScoringPerHitStats = - CacheBuilder.newBuilder().build( - new CacheLoader() { - @Override - public SearchTimerStats load(String clientId) { - String statName = - String.format("scoring_time_per_hit_for_client_id_%s", clientId); - return earlybirdServerStatsReceiver.getTimerStats(statName, - TimeUnit.NANOSECONDS, false, true, false); - } - }); - - private final LoadingCache> clientIdScoringNumHitsProcessedStats = - CacheBuilder.newBuilder().build( - new CacheLoader>() { - @Override - public Percentile load(String clientId) { - String statName = - String.format("scoring_num_hits_processed_for_client_id_%s", clientId); - return PercentileUtil.createPercentile(statName); - } - }); - - private final LoadingCache> lastRequestPerClientId = - CacheBuilder.newBuilder().build( - new CacheLoader>() { - @Override - public AtomicReference load(String key) throws Exception { - return new AtomicReference<>(null); - } - }); - - - private final SearchTimerStats overallScoringTimePerQueryStats; - private final SearchTimerStats overallScoringTimePerHitStats; - private final Percentile overallScoringNumHitsProcessedStats; - - private final EarlybirdIndexConfig earlybirdIndexConfig; - private final DynamicPartitionConfig dynamicPartitionConfig; - private final SegmentManager segmentManager; - private final UpdateableEarlybirdStateManager stateManager; - private final AudioSpaceTable audioSpaceTable; - - private final SearchLongGauge startupTimeGauge; - - // Time spent in an internal thread pool queue, between the time we get the search request - // from finagle until it actually starts being executed. - private final SearchTimerStats internalQueueWaitTimeStats; - - // Tracking request that have exceeded their allocated timeout prior to us actually being able - // to start executing the search. - private final SearchCounter requestTimeoutExceededBeforeSearchCounter; - // Current number of running searcher threads. - private final SearchLongGauge numSearcherThreadsGauge; - private final QueryTimeoutFactory queryTimeoutFactory; - - private PartitionManager partitionManager; - private QueryCacheManager queryCacheManager; - - private final ScoringModelsManager scoringModelsManager; - - private final TensorflowModelsManager tensorflowModelsManager; - - private final EarlybirdRequestPreLogger requestPreLogger; - private final EarlybirdRequestPostLogger requestLogger; - - private final TweetCountMonitor tweetCountMonitor; - private final TermCountMonitor termCountMonitor; - - private final EarlybirdServerSetManager serverSetManager; - private final EarlybirdWarmUpManager warmUpManager; - private final MultiSegmentTermDictionaryManager multiSegmentTermDictionaryManager; - - private final Object shutdownLock = new Object(); - @GuardedBy("shutdownLock") - private final EarlybirdFuturePoolManager futurePoolManager; - @GuardedBy("shutdownLock") - private final EarlybirdFinagleServerManager finagleServerManager; - - // If a search request comes in with a client-side start time, and we see that based on that - // the timeout has expired, whether we should drop that query immediately. - private final boolean skipTimedOutRequests = - EarlybirdConfig.getBool("skip_timedout_requests", false); - - // client of szookeeper.local.twitter.com. - // This is used to perform distributed locking and layout reading etc. - private final ZooKeeperProxy sZooKeeperClient; - - private final Decider decider; - - private final Clock clock; - - private final List toClose = new ArrayList<>(); - - private final SearchIndexingMetricSet searchIndexingMetricSet; - - private final EarlybirdDarkProxy earlybirdDarkProxy; - - private final ImmutableMap responseCodeCounters; - private final SegmentSyncConfig segmentSyncConfig; - private final EarlybirdStartup earlybirdStartup; - private final QualityFactor qualityFactor; - - private boolean isShutdown = false; - private boolean isShuttingDown = false; - - private final AtomicLongMap queriedFieldsCounts = AtomicLongMap.create(); - - public EarlybirdServer(QueryCacheManager queryCacheManager, - ZooKeeperProxy sZkClient, - Decider decider, - EarlybirdIndexConfig earlybirdIndexConfig, - DynamicPartitionConfig dynamicPartitionConfig, - PartitionManager partitionManager, - SegmentManager segmentManager, - AudioSpaceTable audioSpaceTable, - TermCountMonitor termCountMonitor, - TweetCountMonitor tweetCountMonitor, - UpdateableEarlybirdStateManager earlybirdStateManager, - EarlybirdFuturePoolManager futurePoolManager, - EarlybirdFinagleServerManager finagleServerManager, - EarlybirdServerSetManager serverSetManager, - EarlybirdWarmUpManager warmUpManager, - SearchStatsReceiver earlybirdServerStatsReceiver, - EarlybirdSearcherStats tweetsSearcherStats, - ScoringModelsManager scoringModelsManager, - TensorflowModelsManager tensorflowModelsManager, - Clock clock, - MultiSegmentTermDictionaryManager multiSegmentTermDictionaryManager, - EarlybirdDarkProxy earlybirdDarkProxy, - SegmentSyncConfig segmentSyncConfig, - QueryTimeoutFactory queryTimeoutFactory, - EarlybirdStartup earlybirdStartup, - QualityFactor qualityFactor, - SearchIndexingMetricSet searchIndexingMetricSet) { - LOG.info("Creating EarlybirdServer"); - this.decider = decider; - this.clock = clock; - this.sZooKeeperClient = sZkClient; - this.earlybirdIndexConfig = earlybirdIndexConfig; - this.dynamicPartitionConfig = dynamicPartitionConfig; - this.segmentManager = segmentManager; - this.queryCacheManager = queryCacheManager; - this.termCountMonitor = termCountMonitor; - this.tweetCountMonitor = tweetCountMonitor; - this.stateManager = earlybirdStateManager; - this.partitionManager = partitionManager; - this.futurePoolManager = futurePoolManager; - this.finagleServerManager = finagleServerManager; - this.serverSetManager = serverSetManager; - this.warmUpManager = warmUpManager; - this.earlybirdServerStatsReceiver = earlybirdServerStatsReceiver; - this.tweetsSearcherStats = tweetsSearcherStats; - this.scoringModelsManager = scoringModelsManager; - this.tensorflowModelsManager = tensorflowModelsManager; - this.multiSegmentTermDictionaryManager = multiSegmentTermDictionaryManager; - this.searchIndexingMetricSet = searchIndexingMetricSet; - this.earlybirdDarkProxy = earlybirdDarkProxy; - this.segmentSyncConfig = segmentSyncConfig; - this.queryTimeoutFactory = queryTimeoutFactory; - this.earlybirdStartup = earlybirdStartup; - this.qualityFactor = qualityFactor; - this.audioSpaceTable = audioSpaceTable; - - EarlybirdStatus.setStartTime(System.currentTimeMillis()); - - // Our initial status code is STARTING. - EarlybirdStatus.setStatus(EarlybirdStatusCode.STARTING); - EarlybirdStatus.THRIFT_SERVICE_STARTED.set(false); - - PartitionConfig partitionConfig = dynamicPartitionConfig.getCurrentPartitionConfig(); - earlybirdServerStatsReceiver.getLongGauge( - "search_cluster_" + partitionConfig.getClusterName()).set(1); - earlybirdServerStatsReceiver.getLongGauge( - "tier_name_" + partitionConfig.getTierName()).set(1); - - earlybirdServerStatsReceiver.getLongGauge("partition").set( - partitionConfig.getIndexingHashPartitionID()); - earlybirdServerStatsReceiver.getLongGauge("replica").set( - partitionConfig.getHostPositionWithinHashPartition()); - earlybirdServerStatsReceiver.getLongGauge("penguin_version").set( - EarlybirdConfig.getPenguinVersionByte()); - - earlybirdServerStatsReceiver.getLongGauge("flush_version").set( - FlushVersion.CURRENT_FLUSH_VERSION.ordinal()); - String buildGen = EarlybirdConfig.getString("offline_segment_build_gen", "unknown"); - earlybirdServerStatsReceiver.getLongGauge("build_gen_" + buildGen).set(1); - - this.startupTimeGauge = earlybirdServerStatsReceiver.getLongGauge("startup_time_millis"); - this.internalQueueWaitTimeStats = earlybirdServerStatsReceiver.getTimerStats( - "internal_queue_wait_time", TimeUnit.MILLISECONDS, false, true, false); - this.requestTimeoutExceededBeforeSearchCounter = earlybirdServerStatsReceiver.getCounter( - "request_timeout_exceeded_before_search"); - this.numSearcherThreadsGauge = - earlybirdServerStatsReceiver.getLongGauge("num_searcher_threads"); - this.overallScoringTimePerQueryStats = earlybirdServerStatsReceiver.getTimerStats( - "overall_scoring_time_per_query", TimeUnit.NANOSECONDS, false, true, false); - - // For most of our scoring functions the scoring_time_per_hit records the actual time to score a - // single hit. However, the tensorflow based scoring function uses batch scoring, so we do not - // know the actual time it takes to score a single hit. We are now including batch scoring time - // in all scoring time stats (SEARCH-26014), which means that the scoring_time_per_hit stat may - // be a bit misleading for tensorflow based queries. For these queries the scoring_time_per_hit - // represents the ratio between total_scoring_time and the number_of_hits, instead of the actual - // time to score a single hit. - this.overallScoringTimePerHitStats = earlybirdServerStatsReceiver.getTimerStats( - "overall_scoring_time_per_hit", TimeUnit.NANOSECONDS, false, true, false); - this.overallScoringNumHitsProcessedStats = PercentileUtil.createPercentile( - "overall_scoring_num_hits_processed"); - - ImmutableMap.Builder responseCodeCountersBuilder = - new ImmutableMap.Builder<>(); - for (EarlybirdResponseCode responseCode : EarlybirdResponseCode.values()) { - responseCodeCountersBuilder.put( - responseCode, - earlybirdServerStatsReceiver.getCounter( - "responses_with_response_code_" + responseCode.name().toLowerCase())); - } - responseCodeCounters = responseCodeCountersBuilder.build(); - - disableLuceneQueryCache(); - initManagers(); - - requestPreLogger = EarlybirdRequestPreLogger.buildForShard( - EarlybirdConfig.getInt("latency_warn_threshold", 100), decider); - requestLogger = EarlybirdRequestPostLogger.buildForShard( - EarlybirdConfig.getInt("latency_warn_threshold", 100), decider); - - this.qualityFactor.startUpdates(); - - LOG.info("Created EarlybirdServer"); - } - - public boolean isShutdown() { - return this.isShutdown; - } - - private void initManagers() { - LOG.info("Created EarlybirdIndexConfig: " + earlybirdIndexConfig.getClass().getSimpleName()); - - segmentManager.addUpdateListener(queryCacheManager); - } - - public PartitionManager getPartitionManager() { - return partitionManager; - } - - public QueryCacheManager getQueryCacheManager() { - return queryCacheManager; - } - - public SegmentManager getSegmentManager() { - return segmentManager; - } - - public MultiSegmentTermDictionaryManager getMultiSegmentTermDictionaryManager() { - return this.multiSegmentTermDictionaryManager; - } - - @VisibleForTesting - public int getPort() { - return port; - } - - private void disableLuceneQueryCache() { - // SEARCH-30046: Look into possibly re-enabling the query -> weight cache. - // We can't use this cache until we upgrade to Lucene 6.0.0, because we have queries with a - // boost of 0.0, and they don't play nicely with Lucene's LRUQueryCache.get() method. - // - // Lucene 6.0.0 changes how boosts are handled: "real" boosts should be wrapped into BoostQuery - // instances, and queries with a boost of 0.0 should be rewritten as "filters" - // (BooleanQuery.add(query, BooleanClause.Occur.FILTER)). So when we upgrade to Lucene 6.0.0 we - // will be forced to refactor how we handle our current queries with a boost of 0.0, which might - // allow us to re-enable this cache. - // - // Note that disabling this cache is not a regression: it should give us the behavior that we - // had with Lucene 5.2.1 (and it's unclear if this cache is useful at all). - // - // WARNING: The default 'DefaultQueryCache' maintains a static reference to the weight forever, - // causing a memory leak. Our weights hold references to an entire segment so the memory leak is - // significant. - IndexSearcher.setDefaultQueryCache(null); - } - - /** - * Starts the earlybird server. - */ - public void start() throws EarlybirdStartupException { - // Make sure this is at the top of the function before other parts of the system start running - new EarlybirdBlacklistHandler(Clock.SYSTEM_CLOCK, sZooKeeperClient) - .blockThenExitIfBlacklisted(); - - Stopwatch startupWatch = Stopwatch.createStarted(); - EarlybirdStatus.beginEvent(EARLYBIRD_STARTUP, searchIndexingMetricSet.startupInProgress); - - LOG.info("java.library.path is: " + System.getProperty("java.library.path")); - - PartitionConfig partitionConfig = dynamicPartitionConfig.getCurrentPartitionConfig(); - - SegmentVulture.removeUnusedSegments(partitionManager, partitionConfig, - earlybirdIndexConfig.getSchema().getMajorVersionNumber(), segmentSyncConfig); - - // Start the schema manager - schedule(stateManager); - - Closeable closeable = earlybirdStartup.start(); - toClose.add(closeable); - if (EarlybirdStatus.getStatusCode() == EarlybirdStatusCode.STOPPING) { - LOG.info("Server is shutdown. Exiting..."); - return; - } - - startupTimeGauge.set(startupWatch.elapsed(TimeUnit.MILLISECONDS)); - - EarlybirdStatus.endEvent(EARLYBIRD_STARTUP, searchIndexingMetricSet.startupInProgress); - - GCUtil.runGC(); // Attempt to force a full GC before joining the serverset - - try { - startThriftService(null, true); - } catch (InterruptedException e) { - LOG.info("Interrupted while starting thrift server, quitting earlybird"); - throw new EarlybirdStartupException("Interrupted while starting thrift server"); - } - - EarlybirdStatus.THRIFT_SERVICE_STARTED.set(true); - - // only once we're current, kick off daily tweet count monitors only for archive cluster - if (EarlybirdConfig.getInt(TweetCountMonitor.RUN_INTERVAL_MINUTES_CONFIG_NAME, -1) > 0) { - schedule(tweetCountMonitor); - } - - // only once we're current, kick off per-field term count monitors - if (EarlybirdConfig.getInt(TermCountMonitor.RUN_INTERVAL_MINUTES_CONFIG_NAME, -1) > 0) { - schedule(termCountMonitor); - } - - startupTimeGauge.set(startupWatch.elapsed(TimeUnit.MILLISECONDS)); - LOG.info("EarlybirdServer start up time: {}", startupWatch); - } - - /** - * Starts the thrift server if the server is not running. - * If searcherThreads is null, it uses the value specified by EarlybirdConfig. - */ - public void startThriftService(@Nullable Integer searcherThreads, boolean isStartingUp) - throws InterruptedException { - synchronized (shutdownLock) { - if (!finagleServerManager.isWarmUpServerRunning() - && !finagleServerManager.isProductionServerRunning()) { - int threadCount = searcherThreads != null - ? searcherThreads : this.numSearcherThreads; - LOG.info("Starting searcher pool with " + threadCount + " threads"); - futurePoolManager.createUnderlyingFuturePool(threadCount); - numSearcherThreadsGauge.set(threadCount); - - // If the server is not shutting down, go through the warm up stage. If the server is - // instructed to shut down during warm up, warmUpManager.warmUp() should return within a - // second, and should leave the warm up server set. We should still shut down the warm up - // Finagle server. - if (isStartingUp && (EarlybirdStatus.getStatusCode() != EarlybirdStatusCode.STOPPING)) { - LOG.info("Opening warmup thrift port..."); - finagleServerManager.startWarmUpFinagleServer(this, SERVICE_NAME, warmUpPort); - EarlybirdStatus.WARMUP_THRIFT_PORT_OPEN.set(true); - - try { - warmUpManager.warmUp(); - } catch (UpdateException e) { - LOG.warn("Could not join or leave the warm up server set.", e); - } finally { - finagleServerManager.stopWarmUpFinagleServer(SERVER_CLOSE_WAIT_TIME); - EarlybirdStatus.WARMUP_THRIFT_PORT_OPEN.set(false); - } - } - - // If the server is not shutting down, we can start the production Finagle server and join - // the production server set. - if (EarlybirdStatus.getStatusCode() != EarlybirdStatusCode.STOPPING) { - LOG.info("Opening production thrift port..."); - finagleServerManager.startProductionFinagleServer( - earlybirdDarkProxy.getDarkProxy(), this, SERVICE_NAME, port); - EarlybirdStatus.THRIFT_PORT_OPEN.set(true); - - if (REGISTER_WITH_ZK_ON_STARTUP) { - // After the earlybird starts up, register with ZooKeeper. - try { - joinServerSet("internal start-up"); - - // Join separate server set for ServiceProxy on Archive Earlybirds - if (!EarlybirdConfig.isAurora()) { - joinServerSetForServiceProxy(); - } - } catch (UpdateException e) { - throw new RuntimeException("Unable to join ServerSet during startup.", e); - } - } - } - } - } - } - - /** - * Stops the thrift server if the server is already running. - */ - public void stopThriftService(boolean shouldShutDown) { - synchronized (shutdownLock) { - try { - leaveServerSet(shouldShutDown ? "internal shutdown" : "admin stopThriftService"); - } catch (UpdateException e) { - LOG.warn("Leaving production ServerSet failed.", e); - } - - if (finagleServerManager.isProductionServerRunning()) { - try { - finagleServerManager.stopProductionFinagleServer(SERVER_CLOSE_WAIT_TIME); - futurePoolManager.stopUnderlyingFuturePool( - SERVER_CLOSE_WAIT_TIME.inSeconds(), TimeUnit.SECONDS); - numSearcherThreadsGauge.set(0); - } catch (InterruptedException e) { - LOG.error("Interrupted while stopping thrift service", e); - Thread.currentThread().interrupt(); - } - EarlybirdStatus.THRIFT_PORT_OPEN.set(false); - } - } - } - - /** - * Gets a string with information about the last request we've seen from each client. - */ - public Future getLastSearchesByClient(boolean includeResults) { - LastSearchesSummary summary = new LastSearchesSummary( - lastRequestPerClientId, clientIdSearchStats, includeResults); - return Future.value(summary.getSummary()); - } - - /** - * The following are all the Thrift RPC methods inherited from EarlybirdService.Iface - */ - - // Thrift getName RPC. - @Override - public Future getName() { - return Future.value(SERVICE_NAME); - } - - // Thrift getStatus RPC. - @Override - public Future getStatus() { - EarlybirdStatusResponse response = new EarlybirdStatusResponse(); - response.setCode(EarlybirdStatus.getStatusCode()); - response.setAliveSince(EarlybirdStatus.getStartTime()); - response.setMessage(EarlybirdStatus.getStatusMessage()); - return Future.value(response); - } - - public Future> getSegmentMetadata() { - return Future.value(segmentManager.getSegmentMetadata()); - } - - public Future getQueryCachesData() { - return Future.value(segmentManager.getQueryCachesData()); - } - - /** - * Get a text summary for which fields did we use in a schema. - */ - public Future getQueriedFieldsAndSchemaStats() { - ImmutableSchemaInterface schema = this.earlybirdIndexConfig.getSchema().getSchemaSnapshot(); - - QueriedFieldsAndSchemaStats summary = new QueriedFieldsAndSchemaStats(schema, - queriedFieldsCounts); - return Future.value(summary.getSummary()); - } - - /** - * Shuts down the earlybird server. - */ - public void shutdown() { - LOG.info("shutdown(): status set to STOPPING"); - EarlybirdStatus.setStatus(EarlybirdStatusCode.STOPPING); - try { - LOG.info("Stopping Finagle server."); - stopThriftService(true); - EarlybirdStatus.THRIFT_SERVICE_STARTED.set(false); - - if (queryCacheManager != null) { - queryCacheManager.shutdown(); - } else { - LOG.info("No queryCacheManager to shut down"); - } - - earlybirdIndexConfig.getResourceCloser().shutdownExecutor(); - - isShuttingDown = true; - LOG.info("Closing {} closeables.", toClose.size()); - for (Closeable closeable : toClose) { - closeable.close(); - } - } catch (InterruptedException | IOException e) { - EarlybirdStatus.setStatus(EarlybirdStatusCode.UNHEALTHY, e.getMessage()); - LOG.error("Interrupted during shutdown, status set to UNHEALTHY"); - } - LOG.info("Earlybird server stopped!"); - isShutdown = true; - } - - @Override - public Future search(final EarlybirdRequest request) { - final long requestReceivedTimeMillis = System.currentTimeMillis(); - // Record clock diff as early as possible. - EarlybirdRequestUtil.recordClientClockDiff(request); - - if (!futurePoolManager.isPoolReady()) { - return Future.exception(new TransientException("Earlybird not yet able to handle requests.")); - } - - return futurePoolManager.apply(new Function0() { - @Override - public EarlybirdResponse apply() { - return doSearch(request, requestReceivedTimeMillis); - } - }).rescue(Function.func( - // respond with Nack when the queue is full - t -> Future.exception((t instanceof RejectedExecutionException) ? QUEUE_FULL_FAILURE : t))); - } - - private EarlybirdResponse doSearch(EarlybirdRequest request, long requestReceivedTimeMillis) { - final long queueWaitTime = System.currentTimeMillis() - requestReceivedTimeMillis; - internalQueueWaitTimeStats.timerIncrement(queueWaitTime); - - // request restart time, not to be confused with startTime which is server restart time - Timer timer = new Timer(TimeUnit.MICROSECONDS); - - requestPreLogger.logRequest(request); - - String clientId = ClientIdUtil.getClientIdFromRequest(request); - String finagleClientId = FinagleUtil.getFinagleClientName(); - requestCountersByFinagleIdAndClientId.getUnchecked(new Pair<>(finagleClientId, clientId)) - .increment(); - - EarlybirdRequestUtil.checkAndSetCollectorParams(request); - - // If the thrift logger is busy logging, queue the thrift request for logging. - if (EarlybirdThriftRequestLoggingUtil.thriftLoggerBusy) { - EarlybirdThriftRequestLoggingUtil.REQUEST_BUFFER.offer(request); - } - - EarlybirdRequestUtil.logAndFixExcessiveValues(request); - - final EarlybirdSearcher searcher = new EarlybirdSearcher( - request, - segmentManager, - audioSpaceTable, - queryCacheManager, - earlybirdIndexConfig.getSchema().getSchemaSnapshot(), - earlybirdIndexConfig.getCluster(), - dynamicPartitionConfig.getCurrentPartitionConfig(), - decider, - tweetsSearcherStats, - scoringModelsManager, - tensorflowModelsManager, - clock, - multiSegmentTermDictionaryManager, - queryTimeoutFactory, - qualityFactor); - - QueryCostTracker queryCostTracker = QueryCostTracker.getTracker(); - EarlybirdResponse response = null; - try { - if (skipTimedOutRequests - && searcher.getTerminationTracker().getTimeoutEndTimeWithReservation() - <= clock.nowMillis()) { - requestTimeoutExceededBeforeSearchCounter.increment(); - response = new EarlybirdResponse(); - response.setResponseCode(EarlybirdResponseCode.SERVER_TIMEOUT_ERROR); - } else { - queryCostTracker.reset(); - response = searcher.search(); - } - } finally { - if (response == null) { - // This can only happen if we failed to catch an exception in the searcher. - LOG.error("Response was null: " + request.toString()); - response = new EarlybirdResponse(); - response.setResponseCode(EarlybirdResponseCode.TRANSIENT_ERROR); - } - - if (response.getSearchResults() == null) { - List emptyResultSet = Lists.newArrayList(); - response.setSearchResults(new ThriftSearchResults(emptyResultSet)); - } - - long reqLatency = timer.stop(); - response.setResponseTime(reqLatency / 1000); - response.setResponseTimeMicros(reqLatency); - response.getSearchResults().setQueryCost(queryCostTracker.getTotalCost()); - - requestLogger.logRequest(request, response, timer); - - int numResults = EarlybirdRequestLogger.numResultsForLog(response); - boolean success = response.getResponseCode() == EarlybirdResponseCode.SUCCESS; - boolean clientError = response.getResponseCode() == EarlybirdResponseCode.CLIENT_ERROR; - boolean earlyTerminated = (response.getSearchResults().isSetNumPartitionsEarlyTerminated() - && response.getSearchResults().getNumPartitionsEarlyTerminated() > 0) - || searcher.getTerminationTracker().isEarlyTerminated(); - // Update termination stats. - searcher.getTerminationTracker().getEarlyTerminationState().incrementCount(); - - searchStats.requestComplete(reqLatency, numResults, success, earlyTerminated, clientError); - if (searcher.getRequestStats() != null) { - searcher.getRequestStats().requestComplete(reqLatency, numResults, success, - earlyTerminated, clientError); - } - - getResponseCodeCounter(response.getResponseCode()).increment(); - // Adding this counter to make it easier to debug cases where we see a spike in - // bad client request errors but don't know where they're coming from. (The - // alternative is to ssh to a machine in the cluster and sample - // /var/log/earlybird/earlybird.failed_requests). - getClientIdResponseCodeCounter(clientId, response.getResponseCode()).increment(); - - // Export request latency as a stat. - clientIdSearchStats.getUnchecked(clientId).timerIncrement(reqLatency); - requestCountersByFinagleClientId.getUnchecked(finagleClientId).timerIncrement(reqLatency); - addEarlybirdServerStats(response, queueWaitTime); - // Export scoring stats for the request. - exportScoringTimeStats(response, clientId); - } - - Set queriedFields = searcher.getQueriedFields(); - if (queriedFields != null) { - for (String queriedField : queriedFields) { - queriedFieldsCounts.incrementAndGet(queriedField); - } - } - - // Increment counters for age of the returned results. - if (response.getSearchResults() != null && response.getSearchResults().getResults() != null) { - long currentTime = System.currentTimeMillis(); - for (ThriftSearchResult result : response.getSearchResults().getResults()) { - long tweetId = result.getId(); - if (SnowflakeId.isSnowflakeId(tweetId)) { - long ageMillis = Math.max(0L, - currentTime - SnowflakeId.unixTimeMillisFromId(tweetId)); - int ageDays = Duration.fromMilliseconds(ageMillis).inDays(); - - if (EarlybirdConfig.isRealtimeOrProtected()) { - String key = "result_age_in_days_" + ageDays; - resultsAgeCounter.getUnchecked(key).increment(); - } else { - int ageYears = ageDays / 365; - String key = "result_age_in_years_" + ageYears; - resultsAgeCounter.getUnchecked(key).increment(); - } - } - } - } - - try { - lastRequestPerClientId.get(clientId).set( - new RequestResponsePair(request, searcher.getParsedQuery(), - searcher.getLuceneQuery(), response)); - } catch (ExecutionException ex) { - // Not a big problem, we'll just notice that the admin page doesn't work, and it - // probably won't happen. - } - - - return response; - } - - private void exportScoringTimeStats(EarlybirdResponse response, String clientId) { - if (response.isSetSearchResults() - && response.getSearchResults().isSetScoringTimeNanos() - && response.getSearchResults().isSetNumHitsProcessed()) { - int numHitsProcessed = response.getSearchResults().getNumHitsProcessed(); - long scoringTimeNanos = response.getSearchResults().getScoringTimeNanos(); - - if (numHitsProcessed > 0) { - // Only compute and report scoring time per hit when we have hits. (i.e. we don't just want - // to report 0's for cases where there were no hits, and only want to report legit per-hit - // times. - long scoringTimePerHit = scoringTimeNanos / numHitsProcessed; - - this.clientIdScoringPerHitStats.getUnchecked(clientId).timerIncrement(scoringTimePerHit); - this.overallScoringTimePerHitStats.timerIncrement(scoringTimePerHit); - } - - this.clientIdScoringPerQueryStats.getUnchecked(clientId).timerIncrement(scoringTimeNanos); - this.overallScoringTimePerQueryStats.timerIncrement(scoringTimeNanos); - - // The num hits processed stats here are scoped only to queries that were actually scored. - // This would exclude queries like term stats (that would otherwise have huge num hits - // processed). - this.clientIdScoringNumHitsProcessedStats.getUnchecked(clientId).record(numHitsProcessed); - this.overallScoringNumHitsProcessedStats.record(numHitsProcessed); - } - } - - private void addEarlybirdServerStats(EarlybirdResponse response, long queueWaitTime) { - PartitionConfig curPartitionConfig = dynamicPartitionConfig.getCurrentPartitionConfig(); - EarlybirdServerStats earlybirdServerStats = new EarlybirdServerStats(); - response.setEarlybirdServerStats(earlybirdServerStats); - earlybirdServerStats.setHostname(DatabaseConfig.getLocalHostname()); - earlybirdServerStats.setPartition(curPartitionConfig.getIndexingHashPartitionID()); - earlybirdServerStats.setTierName(curPartitionConfig.getTierName()); - earlybirdServerStats.setCurrentQps(searchStats.getRequestRate()); - earlybirdServerStats.setQueueTimeMillis(queueWaitTime); - earlybirdServerStats.setAverageQueueTimeMillis( - (long) (double) internalQueueWaitTimeStats.read()); - earlybirdServerStats.setAverageLatencyMicros(searchStats.getAverageLatency()); - } - - @Override - public void joinServerSet(String username) throws UpdateException { - serverSetManager.joinServerSet(username); - } - - - @Override - public int getNumberOfServerSetMembers() throws InterruptedException, - ZooKeeperClient.ZooKeeperConnectionException, KeeperException { - return serverSetManager.getNumberOfServerSetMembers(); - } - - @Override - public void leaveServerSet(String username) throws UpdateException { - serverSetManager.leaveServerSet(username); - } - - @Override - public void joinServerSetForServiceProxy() { - serverSetManager.joinServerSetForServiceProxy(); - } - - @VisibleForTesting - protected static class EarlybirdThriftRequestLoggingUtil { - private static final int DEFAULT_MAX_ENTRIES_TO_LOG = 50000; - private static final int DEFAULT_BUFFER_SIZE = 10000; - private static final int DEFAULT_LOGGING_SLEEP_MS = 100; - - @VisibleForTesting - protected static volatile boolean thriftLoggerBusy = false; - private static final ExecutorService LOGGING_EXECUTOR = Executors.newCachedThreadPool(); - - // Synchronized circular buffer used for buffering requests. - // If buffer is full, the oldest requests are replaced. This should not be a problem for - // logging purpose. - @VisibleForTesting - protected static final ArrayBlockingQueue REQUEST_BUFFER = - new ArrayBlockingQueue<>(DEFAULT_BUFFER_SIZE); - - - /** - * Create a separate thread to log thrift request to the given file. If a thread is already - * logging thrift requests, this does nothing and throws an IOException indicating that the - * logging thread is busy. - * - * @param logFile File to log to. - * @param maxEntriesToLog Number of entries to log. - * @param postLoggingHook Code to run after logging finishes. Only used for testing as of now. - */ - @VisibleForTesting - protected static synchronized void startThriftLogging(final File logFile, - final int maxEntriesToLog, - final Runnable postLoggingHook) - throws IOException { - if (thriftLoggerBusy) { - throw new IOException("Already busy logging thrift request. No action taken."); - } - - if (!logFile.canWrite()) { - throw new IOException("Unable to open log file for writing: " + logFile); - } - - final BufferedWriter thriftLogWriter = - Files.newBufferedWriter(logFile.toPath(), Charsets.UTF_8); - - // TSerializer used by the writer thread. - final TSerializer serializer = new TSerializer(); - - REQUEST_BUFFER.clear(); - thriftLoggerBusy = true; - LOG.info("Started to log thrift requests into file " + logFile.getAbsolutePath()); - LOGGING_EXECUTOR.submit(() -> { - try { - int count = 0; - while (count < maxEntriesToLog) { - if (REQUEST_BUFFER.isEmpty()) { - Thread.sleep(DEFAULT_LOGGING_SLEEP_MS); - continue; - } - - try { - EarlybirdRequest ebRequest = REQUEST_BUFFER.poll(); - String logLine = serializeThriftObject(ebRequest, serializer); - thriftLogWriter.write(logLine); - count++; - } catch (TException e) { - LOG.warn("Unable to serialize EarlybirdRequest for logging.", e); - } - } - return count; - } finally { - thriftLogWriter.close(); - thriftLoggerBusy = false; - LOG.info("Finished logging thrift requests into file " + logFile.getAbsolutePath()); - REQUEST_BUFFER.clear(); - if (postLoggingHook != null) { - postLoggingHook.run(); - } - } - }); - } - - /** - * Serialize a thrift object to a base 64 encoded string. - */ - private static String serializeThriftObject(TBase tObject, TSerializer serializer) - throws TException { - return new Base64().encodeToString(serializer.serialize(tObject)) + "\n"; - } - } - - /** - * Start to log thrift EarlybirdRequests. - * - * @param logFile Log file to write to. - * @param numRequestsToLog Number of requests to collect. Default value of 50000 used if - * 0 or negative numbers are pass in. - */ - public void startThriftLogging(File logFile, int numRequestsToLog) throws IOException { - int requestToLog = numRequestsToLog <= 0 - ? EarlybirdThriftRequestLoggingUtil.DEFAULT_MAX_ENTRIES_TO_LOG : numRequestsToLog; - EarlybirdThriftRequestLoggingUtil.startThriftLogging(logFile, requestToLog, null); - } - - @VisibleForTesting - @Override - public boolean isInServerSet() { - return serverSetManager.isInServerSet(); - } - - @VisibleForTesting - SearchCounter getResponseCodeCounter(EarlybirdResponseCode responseCode) { - return responseCodeCounters.get(responseCode); - } - - @VisibleForTesting - SearchCounter getClientIdResponseCodeCounter( - String clientId, EarlybirdResponseCode responseCode) { - String key = String.format(RESPONSES_PER_CLIENT_ID_STAT_TEMPLATE, - clientId, responseCode.name().toLowerCase()); - return responseByClientIdAndResponseCode.getUnchecked(key); - } - - public void setNoShutdownWhenNotInLayout(boolean noShutdown) { - stateManager.setNoShutdownWhenNotInLayout(noShutdown); - } - - private void schedule(OneTaskScheduledExecutorManager manager) { - if (!isShuttingDown) { - manager.schedule(); - toClose.add(manager); - } - } - - public DynamicSchema getSchema() { - return earlybirdIndexConfig.getSchema(); - } - - public AudioSpaceTable getAudioSpaceTable() { - return audioSpaceTable; - } -} diff --git a/src/java/com/twitter/search/earlybird/EarlybirdServerSetManager.java b/src/java/com/twitter/search/earlybird/EarlybirdServerSetManager.java deleted file mode 100644 index cd490992c..000000000 --- a/src/java/com/twitter/search/earlybird/EarlybirdServerSetManager.java +++ /dev/null @@ -1,275 +0,0 @@ -package com.twitter.search.earlybird; - -import java.net.InetAddress; -import java.net.InetSocketAddress; -import java.util.concurrent.atomic.AtomicLong; - -import javax.annotation.concurrent.GuardedBy; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.Maps; - -import org.apache.zookeeper.KeeperException; -import org.apache.zookeeper.Watcher; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.zookeeper.ServerSet; -import com.twitter.common.zookeeper.ZooKeeperClient; -import com.twitter.common_internal.zookeeper.TwitterServerSet; -import com.twitter.search.common.config.Config; -import com.twitter.search.common.database.DatabaseConfig; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.metrics.SearchStatsReceiver; -import com.twitter.search.common.util.zookeeper.ZooKeeperProxy; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.common.config.EarlybirdProperty; -import com.twitter.search.earlybird.config.TierConfig; -import com.twitter.search.earlybird.exception.AlreadyInServerSetUpdateException; -import com.twitter.search.earlybird.exception.NotInServerSetUpdateException; -import com.twitter.search.earlybird.partition.PartitionConfig; - -public class EarlybirdServerSetManager implements ServerSetMember { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdServerSetManager.class); - - // How many times this earlybird joined/left its partition's server set - @VisibleForTesting - protected final SearchCounter leaveServerSetCounter; - @VisibleForTesting - protected final SearchCounter joinServerSetCounter; - private final ZooKeeperProxy discoveryZKClient; - private final SearchLongGauge inServerSetGauge; - private final PartitionConfig partitionConfig; - private final int port; - private final String serverSetNamePrefix; - - @VisibleForTesting - protected final SearchLongGauge connectedToZooKeeper; - - private final Object endpointStatusLock = new Object(); - @GuardedBy("endpointStatusLock") - private ServerSet.EndpointStatus endpointStatus = null; - - private boolean inServerSetForServiceProxy = false; - - public EarlybirdServerSetManager( - SearchStatsReceiver searchStatsReceiver, - ZooKeeperProxy discoveryZKClient, - final PartitionConfig partitionConfig, - int port, - String serverSetNamePrefix) { - this.discoveryZKClient = discoveryZKClient; - this.partitionConfig = partitionConfig; - this.port = port; - this.serverSetNamePrefix = serverSetNamePrefix; - - // Export serverset related stats - Preconditions.checkNotNull(searchStatsReceiver); - this.joinServerSetCounter = searchStatsReceiver.getCounter( - serverSetNamePrefix + "join_server_set_count"); - this.leaveServerSetCounter = searchStatsReceiver.getCounter( - serverSetNamePrefix + "leave_server_set_count"); - - // Create a new stat based on the partition number for hosts-in-partition aggregation. - // The value of the stat is dependent on whether the server is in the serverset so that the - // aggregate stat reflects the number serving traffic instead of the live process count. - AtomicLong sharedInServerSetStatus = new AtomicLong(); - this.inServerSetGauge = searchStatsReceiver.getLongGauge( - serverSetNamePrefix + "is_in_server_set", sharedInServerSetStatus); - this.connectedToZooKeeper = searchStatsReceiver.getLongGauge( - serverSetNamePrefix + "connected_to_zookeeper"); - - searchStatsReceiver.getLongGauge( - serverSetNamePrefix + "member_of_partition_" + partitionConfig.getIndexingHashPartitionID(), - sharedInServerSetStatus); - - this.discoveryZKClient.registerExpirationHandler(() -> connectedToZooKeeper.set(0)); - - this.discoveryZKClient.register(event -> { - if (event.getType() == Watcher.Event.EventType.None - && event.getState() == Watcher.Event.KeeperState.SyncConnected) { - connectedToZooKeeper.set(1); - } - }); - } - - /** - * Join ServerSet and update endpointStatus. - * This will allow Earlybird consumers, e.g. Blender, to detect when an - * Earlybird goes online and offline. - * @param username - */ - @Override - public void joinServerSet(String username) throws ServerSet.UpdateException { - joinServerSetCounter.increment(); - - synchronized (endpointStatusLock) { - LOG.info("Joining {} ServerSet (instructed by: {}) ...", serverSetNamePrefix, username); - if (endpointStatus != null) { - LOG.warn("Already in ServerSet. Nothing done."); - throw new AlreadyInServerSetUpdateException("Already in ServerSet. Nothing done."); - } - - try { - TwitterServerSet.Service service = getServerSetService(); - - ServerSet serverSet = discoveryZKClient.createServerSet(service); - endpointStatus = serverSet.join( - new InetSocketAddress(InetAddress.getLocalHost().getHostName(), port), - Maps.newHashMap(), - partitionConfig.getHostPositionWithinHashPartition()); - - inServerSetGauge.set(1); - - String path = service.getPath(); - EarlybirdStatus.recordEarlybirdEvent("Joined " + serverSetNamePrefix + " ServerSet " + path - + " (instructed by: " + username + ")"); - LOG.info("Successfully joined {} ServerSet {} (instructed by: {})", - serverSetNamePrefix, path, username); - } catch (Exception e) { - endpointStatus = null; - String message = "Failed to join " + serverSetNamePrefix + " ServerSet of partition " - + partitionConfig.getIndexingHashPartitionID(); - LOG.error(message, e); - throw new ServerSet.UpdateException(message, e); - } - } - } - - /** - * Takes this Earlybird out of its registered ServerSet. - * - * @throws ServerSet.UpdateException if there was a problem leaving the ServerSet, - * or if this Earlybird is already not in a ServerSet. - * @param username - */ - @Override - public void leaveServerSet(String username) throws ServerSet.UpdateException { - leaveServerSetCounter.increment(); - synchronized (endpointStatusLock) { - LOG.info("Leaving {} ServerSet (instructed by: {}) ...", serverSetNamePrefix, username); - if (endpointStatus == null) { - String message = "Not in a ServerSet. Nothing done."; - LOG.warn(message); - throw new NotInServerSetUpdateException(message); - } - - endpointStatus.leave(); - endpointStatus = null; - inServerSetGauge.set(0); - EarlybirdStatus.recordEarlybirdEvent("Left " + serverSetNamePrefix - + " ServerSet (instructed by: " + username + ")"); - LOG.info("Successfully left {} ServerSet. (instructed by: {})", - serverSetNamePrefix, username); - } - } - - @Override - public int getNumberOfServerSetMembers() - throws InterruptedException, ZooKeeperClient.ZooKeeperConnectionException, KeeperException { - String path = getServerSetService().getPath(); - return discoveryZKClient.getNumberOfServerSetMembers(path); - } - - /** - * Determines if this earlybird is in the server set. - */ - @Override - public boolean isInServerSet() { - synchronized (endpointStatusLock) { - return endpointStatus != null; - } - } - - /** - * Returns the server set that this earlybird should join. - */ - public String getServerSetIdentifier() { - TwitterServerSet.Service service = getServerSetService(); - return String.format("/cluster/local/%s/%s/%s", - service.getRole(), - service.getEnv(), - service.getName()); - } - - private TwitterServerSet.Service getServerSetService() { - // If the tier name is 'all' then it treat it as an untiered EB cluster - // and do not add the tier component into the ZK path it registers under. - String tierZKPathComponent = ""; - if (!TierConfig.DEFAULT_TIER_NAME.equalsIgnoreCase(partitionConfig.getTierName())) { - tierZKPathComponent = "/" + partitionConfig.getTierName(); - } - if (EarlybirdConfig.isAurora()) { - // ROLE, EARYLBIRD_NAME, and ENV properties are required on Aurora, thus will be set here - return new TwitterServerSet.Service( - EarlybirdProperty.ROLE.get(), - EarlybirdProperty.ENV.get(), - getServerSetPath(EarlybirdProperty.EARLYBIRD_NAME.get() + tierZKPathComponent)); - } else { - return new TwitterServerSet.Service( - DatabaseConfig.getZooKeeperRole(), - Config.getEnvironment(), - getServerSetPath("earlybird" + tierZKPathComponent)); - } - } - - private String getServerSetPath(String earlybirdName) { - return String.format("%s%s/hash_partition_%d", serverSetNamePrefix, earlybirdName, - partitionConfig.getIndexingHashPartitionID()); - } - - /** - * Join ServerSet for ServiceProxy with a named admin port and with a zookeeper path that Service - * Proxy can translate to a domain name label that is less than 64 characters (due to the size - * limit for domain name labels described here: https://tools.ietf.org/html/rfc1035) - * This will allow us to access Earlybirds that are not on mesos via ServiceProxy. - */ - @Override - public void joinServerSetForServiceProxy() { - // This additional Zookeeper server set is only necessary for Archive Earlybirds which are - // running on bare metal hardware, so ensure that this method is never called for services - // on Aurora. - Preconditions.checkArgument(!EarlybirdConfig.isAurora(), - "Attempting to join server set for ServiceProxy on Earlybird running on Aurora"); - - LOG.info("Attempting to join ServerSet for ServiceProxy"); - try { - TwitterServerSet.Service service = getServerSetForServiceProxyOnArchive(); - - ServerSet serverSet = discoveryZKClient.createServerSet(service); - String hostName = InetAddress.getLocalHost().getHostName(); - int adminPort = EarlybirdConfig.getAdminPort(); - serverSet.join( - new InetSocketAddress(hostName, port), - ImmutableMap.of("admin", new InetSocketAddress(hostName, adminPort)), - partitionConfig.getHostPositionWithinHashPartition()); - - String path = service.getPath(); - LOG.info("Successfully joined ServerSet for ServiceProxy {}", path); - inServerSetForServiceProxy = true; - } catch (Exception e) { - String message = "Failed to join ServerSet for ServiceProxy of partition " - + partitionConfig.getIndexingHashPartitionID(); - LOG.warn(message, e); - } - } - - @VisibleForTesting - protected TwitterServerSet.Service getServerSetForServiceProxyOnArchive() { - String serverSetPath = String.format("proxy/%s/p_%d", - partitionConfig.getTierName(), - partitionConfig.getIndexingHashPartitionID()); - return new TwitterServerSet.Service( - DatabaseConfig.getZooKeeperRole(), - Config.getEnvironment(), - serverSetPath); - } - - @VisibleForTesting - protected boolean isInServerSetForServiceProxy() { - return inServerSetForServiceProxy; - } -} diff --git a/src/java/com/twitter/search/earlybird/EarlybirdStatus.java b/src/java/com/twitter/search/earlybird/EarlybirdStatus.java deleted file mode 100644 index 49ee768e7..000000000 --- a/src/java/com/twitter/search/earlybird/EarlybirdStatus.java +++ /dev/null @@ -1,204 +0,0 @@ -package com.twitter.search.earlybird; - -import java.text.SimpleDateFormat; -import java.util.Date; -import java.util.List; -import java.util.Optional; -import java.util.concurrent.TimeUnit; -import java.util.concurrent.atomic.AtomicBoolean; - -import com.google.common.collect.Lists; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.BuildInfo; -import com.twitter.search.earlybird.partition.SearchIndexingMetricSet; -import com.twitter.search.earlybird.thrift.EarlybirdStatusCode; -import com.twitter.util.Duration; - -/** - * High level status of an Earlybird server. SEARCH-28016 - */ -public final class EarlybirdStatus { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdStatus.class); - - private static final String BUILD_SHA = getBuildShaFromVars(); - - protected static long startTime; - protected static EarlybirdStatusCode statusCode; - protected static String statusMessage; - protected static final AtomicBoolean THRIFT_PORT_OPEN = new AtomicBoolean(false); - protected static final AtomicBoolean WARMUP_THRIFT_PORT_OPEN = new AtomicBoolean(false); - protected static final AtomicBoolean THRIFT_SERVICE_STARTED = new AtomicBoolean(false); - - private static final List EARLYBIRD_SERVER_EVENTS = Lists.newArrayList(); - private static class EarlybirdEvent { - private final String eventName; - private final long timestampMillis; - private final long timeSinceServerStartMillis; - private final long durationMillis; - - public EarlybirdEvent(String eventName, long timestampMillis) { - this(eventName, timestampMillis, -1); - } - - public EarlybirdEvent( - String eventName, - long timestampMillis, - long eventDurationMillis) { - this.eventName = eventName; - this.timestampMillis = timestampMillis; - this.timeSinceServerStartMillis = timestampMillis - startTime; - this.durationMillis = eventDurationMillis; - } - - public String getEventLogString() { - String result = String.format( - "%s %s", - new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS").format(new Date(timestampMillis)), - eventName); - - if (durationMillis > 0) { - result += String.format( - ", took: %s", Duration.apply(durationMillis, TimeUnit.MILLISECONDS).toString()); - } - - result += String.format( - ", time since server start: %s", - Duration.apply(timeSinceServerStartMillis, TimeUnit.MILLISECONDS).toString() - ); - - return result; - } - } - - private EarlybirdStatus() { - } - - public static synchronized void setStartTime(long time) { - startTime = time; - LOG.info("startTime set to " + time); - } - - public static synchronized void setStatus(EarlybirdStatusCode code) { - setStatus(code, null); - } - - public static synchronized void setStatus(EarlybirdStatusCode code, String message) { - statusCode = code; - statusMessage = message; - LOG.info("status set to " + code + (message != null ? " with message " + message : "")); - } - - public static synchronized long getStartTime() { - return startTime; - } - - public static synchronized boolean isStarting() { - return statusCode == EarlybirdStatusCode.STARTING; - } - - public static synchronized boolean hasStarted() { - return statusCode == EarlybirdStatusCode.CURRENT; - } - - public static boolean isThriftServiceStarted() { - return THRIFT_SERVICE_STARTED.get(); - } - - public static synchronized EarlybirdStatusCode getStatusCode() { - return statusCode; - } - - public static synchronized String getStatusMessage() { - return (statusMessage == null ? "" : statusMessage + ", ") - + "warmup thrift port is " + (WARMUP_THRIFT_PORT_OPEN.get() ? "OPEN" : "CLOSED") - + ", production thrift port is " + (THRIFT_PORT_OPEN.get() ? "OPEN" : "CLOSED"); - } - - public static synchronized void recordEarlybirdEvent(String eventName) { - long timeMillis = System.currentTimeMillis(); - EARLYBIRD_SERVER_EVENTS.add(new EarlybirdEvent(eventName, timeMillis)); - } - - private static String getBeginEventMessage(String eventName) { - return "[Begin Event] " + eventName; - } - - private static String getEndEventMessage(String eventName) { - return "[ End Event ] " + eventName; - } - - /** - * Records the beginning of the given event. - * - * @param eventName The event name. - * @param startupMetric The metric that will be used to keep track of the time for this event. - */ - public static synchronized void beginEvent(String eventName, - SearchIndexingMetricSet.StartupMetric startupMetric) { - long timeMillis = System.currentTimeMillis(); - String eventMessage = getBeginEventMessage(eventName); - LOG.info(eventMessage); - EARLYBIRD_SERVER_EVENTS.add(new EarlybirdEvent(eventMessage, timeMillis)); - - startupMetric.begin(); - } - - /** - * Records the end of the given event. - * - * @param eventName The event name. - * @param startupMetric The metric used to keep track of the time for this event. - */ - public static synchronized void endEvent(String eventName, - SearchIndexingMetricSet.StartupMetric startupMetric) { - long timeMillis = System.currentTimeMillis(); - - String beginEventMessage = getBeginEventMessage(eventName); - Optional beginEventOpt = EARLYBIRD_SERVER_EVENTS.stream() - .filter(event -> event.eventName.equals(beginEventMessage)) - .findFirst(); - - String eventMessage = getEndEventMessage(eventName); - LOG.info(eventMessage); - EarlybirdEvent endEvent = new EarlybirdEvent( - eventMessage, - timeMillis, - beginEventOpt.map(e -> timeMillis - e.timestampMillis).orElse(-1L)); - - EARLYBIRD_SERVER_EVENTS.add(endEvent); - - startupMetric.end(endEvent.durationMillis); - } - - public static synchronized void clearAllEvents() { - EARLYBIRD_SERVER_EVENTS.clear(); - } - - public static String getBuildSha() { - return BUILD_SHA; - } - - /** - * Returns the list of all earlybird events that happened since the server started. - */ - public static synchronized Iterable getEarlybirdEvents() { - List eventLog = Lists.newArrayListWithCapacity(EARLYBIRD_SERVER_EVENTS.size()); - for (EarlybirdEvent event : EARLYBIRD_SERVER_EVENTS) { - eventLog.add(event.getEventLogString()); - } - return eventLog; - } - - private static String getBuildShaFromVars() { - BuildInfo buildInfo = new BuildInfo(); - String buildSha = buildInfo.getProperties().getProperty(BuildInfo.Key.GIT_REVISION.value); - if (buildSha != null) { - return buildSha; - } else { - return "UNKNOWN"; - } - } -} diff --git a/src/java/com/twitter/search/earlybird/EarlybirdWarmUpManager.java b/src/java/com/twitter/search/earlybird/EarlybirdWarmUpManager.java deleted file mode 100644 index 446bd9171..000000000 --- a/src/java/com/twitter/search/earlybird/EarlybirdWarmUpManager.java +++ /dev/null @@ -1,100 +0,0 @@ -package com.twitter.search.earlybird; - -import com.google.common.annotations.VisibleForTesting; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.common.zookeeper.ServerSet; -import com.twitter.decider.Decider; -import com.twitter.search.common.decider.DeciderUtil; -import com.twitter.search.earlybird.partition.PartitionConfig; -import com.twitter.search.earlybird.partition.SearchIndexingMetricSet; -import com.twitter.search.earlybird.thrift.EarlybirdStatusCode; - -public class EarlybirdWarmUpManager { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdWarmUpManager.class); - private static final String WARM_UP_ON_DURATION_DECIDER_KEY_PATTERN = - "%s_warm_up_duration_seconds"; - - private final EarlybirdServerSetManager earlybirdServerSetManager; - private final String clusterName; - private final SearchIndexingMetricSet.StartupMetric startUpInWarmUpMetric; - private final Decider decider; - private final Clock clock; - - public EarlybirdWarmUpManager(EarlybirdServerSetManager earlybirdServerSetManager, - PartitionConfig partitionConfig, - SearchIndexingMetricSet searchIndexingMetricSet, - Decider decider, - Clock clock) { - this.earlybirdServerSetManager = earlybirdServerSetManager; - this.clusterName = partitionConfig.getClusterName(); - this.startUpInWarmUpMetric = searchIndexingMetricSet.startupInWarmUp; - this.decider = decider; - this.clock = clock; - } - - public String getServerSetIdentifier() { - return earlybirdServerSetManager.getServerSetIdentifier(); - } - - /** - * Warms up the earlybird. The earlybird joins a special server set that gets production dark - * reads, and leaves this server set after a specified period of time. - */ - public void warmUp() throws InterruptedException, ServerSet.UpdateException { - int warmUpDurationSeconds = DeciderUtil.getAvailability( - decider, - String.format(WARM_UP_ON_DURATION_DECIDER_KEY_PATTERN, clusterName.replaceAll("-", "_"))); - if (warmUpDurationSeconds == 0) { - LOG.info(String.format("Warm up stage duration for cluster %s set to 0. Skipping.", - clusterName)); - return; - } - - earlybirdServerSetManager.joinServerSet("internal warm up"); - - // If doWarmUp() is interrupted, try to leave the server set, and propagate the - // InterruptedException. Otherwise, try to leave the server set, and propagate any exception - // that it might throw. - InterruptedException warmUpInterruptedException = null; - try { - doWarmUp(warmUpDurationSeconds); - } catch (InterruptedException e) { - warmUpInterruptedException = e; - throw e; - } finally { - if (warmUpInterruptedException != null) { - try { - earlybirdServerSetManager.leaveServerSet("internal warm up"); - } catch (Exception e) { - warmUpInterruptedException.addSuppressed(e); - } - } else { - earlybirdServerSetManager.leaveServerSet("internal warm up"); - } - } - } - - @VisibleForTesting - protected void doWarmUp(int warmUpDurationSeconds) throws InterruptedException { - long warmUpStartTimeMillis = clock.nowMillis(); - LOG.info(String.format("Warming up for %d seconds.", warmUpDurationSeconds)); - EarlybirdStatus.beginEvent("warm_up", startUpInWarmUpMetric); - - // Sleep for warmUpDurationSeconds seconds, but check if the server is going down every second. - int count = 0; - try { - while ((count++ < warmUpDurationSeconds) - && (EarlybirdStatus.getStatusCode() != EarlybirdStatusCode.STOPPING)) { - clock.waitFor(1000); - } - } finally { - LOG.info(String.format("Done warming up after %d milliseconds.", - clock.nowMillis() - warmUpStartTimeMillis)); - EarlybirdStatus.endEvent("warm_up", startUpInWarmUpMetric); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/QualityFactor.java b/src/java/com/twitter/search/earlybird/QualityFactor.java deleted file mode 100644 index 601426cdf..000000000 --- a/src/java/com/twitter/search/earlybird/QualityFactor.java +++ /dev/null @@ -1,17 +0,0 @@ -package com.twitter.search.earlybird; - -/** - * Interface defining a quality factor. - */ -public interface QualityFactor { - /** - * Returns the current quality factor. - * @return The quality factor; a number between 0.0 and 1.0. - */ - double get(); - - /** - * Starts a thread to update the quality factor periodically. - */ - void startUpdates(); -} diff --git a/src/java/com/twitter/search/earlybird/README.md b/src/java/com/twitter/search/earlybird/README.md deleted file mode 100644 index c26adedcf..000000000 --- a/src/java/com/twitter/search/earlybird/README.md +++ /dev/null @@ -1,83 +0,0 @@ -# Search Index (Earlybird) main classes - -> **TL;DR** Earlybird (Search Index) find tweets from people you follow, rank them, and serve them to Home. - -## What is Earlybird (Search Index) - -[Earlybird](http://notes.stephenholiday.com/Earlybird.pdf) is a **real-time search system** based on [Apache Lucene](https://lucene.apache.org/) to support the high volume of queries and content updates. The major use cases are Relevance Search (specifically, Text search) and Timeline In-network Tweet retrieval (or UserID based search). It is designed to enable the efficient indexing and querying of billions of tweets, and to provide low-latency search results, even with heavy query loads. - -## High-level architecture -We split our entire tweet search index into three clusters: a **realtime** cluster indexing all public tweets posted in about the last 7 days, a **protected** cluster indexing all protected tweets for the same timeframe; and an **archive** cluster indexing all tweets ever posted, up to about two days ago. - -Earlybird addresses the challenges of scaling real-time search by splitting each cluster across multiple **partitions**, each responsible for a portion of the index. The architecture uses a distributed *inverted index* that is sharded and replicated. This design allows for efficient index updates and query processing. - -The system also employs an incremental indexing approach, enabling it to process and index new tweets in real-time as they arrive. With single writer, multiple reader structure, Earlybird can handle a large number of real-time updates and queries concurrently while maintaining low query latency. The system can achieve high query throughput and low query latency while maintaining a high degree of index freshness. - -## Main Components - -**Partition Manager**: Responsible for managing the configuration of partitions, as well as the mapping between users and partitions. It also handles index loading and flushing. - -**Real-time Indexer**: Continuously reads from a kafka stream of incoming tweets and updates the index (tweet creation, tweet updates, user updates). It also supports tweet deletion events. - -**Query Engine**: Handles the execution of search queries against the distributed index. It employs various optimization techniques, such as term-based pruning and caching. - -**Document Preprocessor**: Converts raw tweets into a document representation suitable for indexing. It handles tokenization, normalization, and analysis of tweet text and metadata. See our ingestion pipeline `src/java/com/twitter/search/ingester` for more write-path processing. - -**Index Writer**: Writes tweet documents to the index and maintains the index structure, including **posting lists** and **term dictionaries**. - -**Segment Manager**: Manages index segments within a partition. It is responsible for merging, optimizing, and flushing index segments to disk, or flush to HDFS to snapshot live segments. - -**Searcher**: Executes queries against the index, using techniques like caching and parallel query execution to minimize query latency. It also incorporates scoring models and ranking algorithms to provide relevant search results. - -The most important two data structures for Earlybird (or Information Retrieval in general) including: - -* **Inverted Index** which stores a mapping between a Term to a list of Doc IDs. Essentially, we build a hash map: each key in the map is a distinct Term (e.g., `cat`, `dog`) in a tweet, and each value is the list of tweets (aka., Document) in which the word appears. We keep one inverted index per field (text, UserID, user name, links, etc.) -* **Postings List** which optimize the storage a the list of Doc IDs mentioned above. - -See more at: https://blog.twitter.com/engineering/en_us/topics/infrastructure/2016/omnisearch-index-formats - -## Advanced features - -Earlybird incorporates several advanced features such as facet search, which allows users to refine search results based on specific attributes such as user mentions, hashtags, and URLs. Furthermore, the system supports various ranking models, including machine learning-based scoring models, to provide relevant search results. - -## Directory Structure -The project consists of several packages and files, which can be summarized as follows: - -* At the root level, the primary focus is on the Earlybird server implementation and its associated classes. These include classes for search, CPU quality factors, server management, index config, main classes, server startup, etc. -* `archive/`: Directory deals with the management and configuration of archived data, specifically for Earlybird Index Configurations. It also contains a `segmentbuilder/` subdirectory, which includes classes for building and updating archive index segments. -* `common/`: Directory holds utility classes for logging, handling requests, and Thrift backend functionality. It also has two subdirectories: `config/` for Earlybird configuration and `userupdates/` for user-related data handling. -* `config/`: Directory is dedicated to managing tier configurations specifically for archive cluster, which relate to server and search query distribution. -* `document/`: Handles document creation and processing, including various factories and token stream writers. -* `exception/`: Contains custom exceptions and exception handling classes related to the system. -* `factory/`: Provides utilities and factories for configurations, Kafka consumers, and server instances. -* `index/`: Contains index-related classes, including in-memory time mappers, tweet ID mappers, and facets. -* `ml/`: Houses the `ScoringModelsManager` for managing machine learning models. -* `partition/`: Manages partitions and index segments, including index loaders, segment writers, and startup indexers. -* `querycache/`: Implements caching for queries and query results, including cache configuration and update tasks. -* `queryparser/`: Provides query parsing functionality, including files that cover query rewriters and lhigh-frequency term extraction. -* `search/`: Contains read path related classes, such as search request processing, result collectors, and facet collectors. -* `segment/`: Provides classes for managing segment data providers and data reader sets. -* `stats/`: Contains classes for tracking and reporting statistics related to the system. -* `tools/`: Houses utility classes for deserializing thrift requests. -* `util/`: Includes utility classes for various tasks, such as action logging, scheduled tasks, and JSON viewers. - -## Related Services - -* The Earlybirds sit behind Earlybird Root servers that fan out queries to them. See `src/java/com/twitter/search/earlybird_root/` -* The Earlybirds are powered by multiple ingestion pipelines. See `src/java/com/twitter/search/ingester/` -* Earlybird segments for the Archives are built offline by segment builders -* Also, Earlybird light ranking is defined in `timelines/data_processing/ad_hoc/earlybird_ranking` - and `src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird`. -* Search common library/packages - -## References - -See more: - -* "Earlybird: Real-Time Search at Twitter" (http://notes.stephenholiday.com/Earlybird.pdf) -* "Reducing search indexing latency to one second" (https://blog.twitter.com/engineering/en_us/topics/infrastructure/2020/reducing-search-indexing-latency-to-one-second) -* "Omnisearch index formats" (https://blog.twitter.com/engineering/en_us/topics/infrastructure/2016/omnisearch-index-formats) - - - - diff --git a/src/java/com/twitter/search/earlybird/RealtimeEarlybirdIndexConfig.java b/src/java/com/twitter/search/earlybird/RealtimeEarlybirdIndexConfig.java deleted file mode 100644 index 5e95903f0..000000000 --- a/src/java/com/twitter/search/earlybird/RealtimeEarlybirdIndexConfig.java +++ /dev/null @@ -1,128 +0,0 @@ -package com.twitter.search.earlybird; - -import java.io.IOException; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.IndexWriterConfig; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.store.Directory; -import org.apache.lucene.store.RAMDirectory; - -import com.twitter.decider.Decider; -import com.twitter.search.common.schema.DynamicSchema; -import com.twitter.search.common.schema.SearchWhitespaceAnalyzer; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.util.CloseResourceUtil; -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentData; -import com.twitter.search.core.earlybird.index.EarlybirdRealtimeIndexSegmentData; -import com.twitter.search.core.earlybird.index.extensions.EarlybirdIndexExtensionsFactory; -import com.twitter.search.core.earlybird.index.inverted.IndexOptimizer; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.index.OptimizedTimeMapper; -import com.twitter.search.earlybird.index.OptimizedTweetIDMapper; -import com.twitter.search.earlybird.index.OutOfOrderRealtimeTweetIDMapper; -import com.twitter.search.earlybird.index.RealtimeTimeMapper; -import com.twitter.search.earlybird.partition.SearchIndexingMetricSet; -import com.twitter.search.earlybird.partition.SegmentSyncInfo; - -/** - * Index config for the Real-Time in-memory Tweet cluster. - */ -public class RealtimeEarlybirdIndexConfig extends EarlybirdIndexConfig { - private final CloseResourceUtil resourceCloser = new CloseResourceUtil(); - - public RealtimeEarlybirdIndexConfig( - EarlybirdCluster cluster, Decider decider, SearchIndexingMetricSet searchIndexingMetricSet, - CriticalExceptionHandler criticalExceptionHandler) { - super(cluster, decider, searchIndexingMetricSet, criticalExceptionHandler); - } - - public RealtimeEarlybirdIndexConfig( - EarlybirdCluster cluster, DynamicSchema schema, Decider decider, - SearchIndexingMetricSet searchIndexingMetricSet, - CriticalExceptionHandler criticalExceptionHandler) { - super(cluster, schema, decider, searchIndexingMetricSet, criticalExceptionHandler); - } - - @Override - public Directory newLuceneDirectory(SegmentSyncInfo segmentSyncInfo) { - return new RAMDirectory(); - } - - @Override - public IndexWriterConfig newIndexWriterConfig() { - return new IndexWriterConfig(new SearchWhitespaceAnalyzer()) - .setSimilarity(IndexSearcher.getDefaultSimilarity()); - } - - @Override - public EarlybirdIndexSegmentData newSegmentData( - int maxSegmentSize, - long timeSliceID, - Directory dir, - EarlybirdIndexExtensionsFactory extensionsFactory) { - return new EarlybirdRealtimeIndexSegmentData( - maxSegmentSize, - timeSliceID, - getSchema(), - new OutOfOrderRealtimeTweetIDMapper(maxSegmentSize, timeSliceID), - new RealtimeTimeMapper(maxSegmentSize), - extensionsFactory); - } - - @Override - public EarlybirdIndexSegmentData loadSegmentData( - FlushInfo flushInfo, - DataDeserializer dataInputStream, - Directory dir, - EarlybirdIndexExtensionsFactory extensionsFactory) throws IOException { - EarlybirdRealtimeIndexSegmentData.InMemorySegmentDataFlushHandler flushHandler; - boolean isOptimized = flushInfo.getBooleanProperty( - EarlybirdIndexSegmentData.AbstractSegmentDataFlushHandler.IS_OPTIMIZED_PROP_NAME); - if (isOptimized) { - flushHandler = new EarlybirdRealtimeIndexSegmentData.InMemorySegmentDataFlushHandler( - getSchema(), - extensionsFactory, - new OptimizedTweetIDMapper.FlushHandler(), - new OptimizedTimeMapper.FlushHandler()); - } else { - flushHandler = new EarlybirdRealtimeIndexSegmentData.InMemorySegmentDataFlushHandler( - getSchema(), - extensionsFactory, - new OutOfOrderRealtimeTweetIDMapper.FlushHandler(), - new RealtimeTimeMapper.FlushHandler()); - } - - - return flushHandler.load(flushInfo, dataInputStream); - } - - @Override - public EarlybirdIndexSegmentData optimize( - EarlybirdIndexSegmentData earlybirdIndexSegmentData) throws IOException { - Preconditions.checkArgument( - earlybirdIndexSegmentData instanceof EarlybirdRealtimeIndexSegmentData, - "Expected EarlybirdRealtimeIndexSegmentData but got %s", - earlybirdIndexSegmentData.getClass()); - - return IndexOptimizer.optimize((EarlybirdRealtimeIndexSegmentData) earlybirdIndexSegmentData); - } - - @Override - public boolean isIndexStoredOnDisk() { - return false; - } - - @Override - public final CloseResourceUtil getResourceCloser() { - return resourceCloser; - } - - @Override - public boolean supportOutOfOrderIndexing() { - return true; - } -} diff --git a/src/java/com/twitter/search/earlybird/RecentTweetRestriction.java b/src/java/com/twitter/search/earlybird/RecentTweetRestriction.java deleted file mode 100644 index 7c25bfb14..000000000 --- a/src/java/com/twitter/search/earlybird/RecentTweetRestriction.java +++ /dev/null @@ -1,60 +0,0 @@ -package com.twitter.search.earlybird; - -import scala.Option; - -import com.google.common.annotations.VisibleForTesting; - -import com.twitter.decider.Decider; - -public final class RecentTweetRestriction { - private static final String RECENT_TWEETS_THRESHOLD = "recent_tweets_threshold"; - private static final String QUERY_CACHE_UNTIL_TIME = "query_cache_until_time"; - - @VisibleForTesting - public static final int DEFAULT_RECENT_TWEET_SECONDS = 15; - - private RecentTweetRestriction() { - } - - /** - * Returns the point in time (in seconds past the unix epoch) before which all tweets will be - * completely indexed. This is required by some clients, because they rely on Earlybird monotonically - * indexing tweets by ID and that tweets are completely indexed when they see them. - * - * @param lastTime The time at which the most recent tweet was indexed, in seconds since the unix - * epoch. - */ - public static int recentTweetsUntilTime(Decider decider, int lastTime) { - return untilTimeSeconds(decider, lastTime, RECENT_TWEETS_THRESHOLD); - } - - /** - * Returns the point in time (in seconds past the unix epoch) before which all tweets will be - * completely indexed. This is required by some clients, because they rely on Earlybird monotonically - * indexing tweets by ID and that tweets are completely indexed when they see them. - * - * @param lastTime The time at which the most recent tweet was indexed, in seconds since the unix - * epoch. - */ - public static int queryCacheUntilTime(Decider decider, int lastTime) { - return untilTimeSeconds(decider, lastTime, QUERY_CACHE_UNTIL_TIME); - } - - private static int untilTimeSeconds(Decider decider, int lastTime, String deciderKey) { - int recentTweetSeconds = getRecentTweetSeconds(decider, deciderKey); - - if (recentTweetSeconds == 0) { - return 0; - } - - return lastTime - recentTweetSeconds; - } - - private static int getRecentTweetSeconds(Decider decider, String deciderKey) { - Option deciderValue = decider.getAvailability(deciderKey); - if (deciderValue.isDefined()) { - return (int) deciderValue.get(); - } - return DEFAULT_RECENT_TWEET_SECONDS; - } -} diff --git a/src/java/com/twitter/search/earlybird/ServerSetMember.java b/src/java/com/twitter/search/earlybird/ServerSetMember.java deleted file mode 100644 index a287f6950..000000000 --- a/src/java/com/twitter/search/earlybird/ServerSetMember.java +++ /dev/null @@ -1,55 +0,0 @@ -package com.twitter.search.earlybird; - -import org.apache.zookeeper.KeeperException; - -import com.twitter.common.zookeeper.ServerSet; -import com.twitter.common.zookeeper.ZooKeeperClient; - -/** - * Represents a server that can add and remove itself from a server set. - */ -public interface ServerSetMember { - /** - * Makes this server join its server set. - * - * @throws ServerSet.UpdateException - * @param requestSource - */ - void joinServerSet(String requestSource) throws ServerSet.UpdateException; - - /** - * Makes this server leave its server set. - * - * @throws ServerSet.UpdateException - * @param requestSource - */ - void leaveServerSet(String requestSource) throws ServerSet.UpdateException; - - /** - * Gets and returns the current number of members in this server's server set. - * - * @return number of members currently in this host's server set. - * @throws InterruptedException - * @throws ZooKeeperClient.ZooKeeperConnectionException - * @throws KeeperException - */ - int getNumberOfServerSetMembers() throws InterruptedException, - ZooKeeperClient.ZooKeeperConnectionException, KeeperException; - - /** - * Checks if this earlybird is in the server set. - * - * @return true if it is, false otherwise. - */ - boolean isInServerSet(); - - /** - * Should only be called for Archive Earlybirds. - * - * Join ServerSet for ServiceProxy with a named admin port and with a zookeeper path that Service - * Proxy can translate to a domain name label that is less than 64 characters (due to the size - * limit for domain name labels described here: https://tools.ietf.org/html/rfc1035) - * This will allow us to access Earlybirds that are not on mesos via ServiceProxy. - */ - void joinServerSetForServiceProxy(); -} diff --git a/src/java/com/twitter/search/earlybird/UpdateableEarlybirdStateManager.java b/src/java/com/twitter/search/earlybird/UpdateableEarlybirdStateManager.java deleted file mode 100644 index 24b1cbce3..000000000 --- a/src/java/com/twitter/search/earlybird/UpdateableEarlybirdStateManager.java +++ /dev/null @@ -1,437 +0,0 @@ -package com.twitter.search.earlybird; - -import java.io.File; -import java.io.IOException; -import java.util.Random; -import java.util.concurrent.TimeUnit; -import java.util.concurrent.atomic.AtomicLong; -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Charsets; - -import org.apache.thrift.TException; -import org.apache.zookeeper.KeeperException; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.common.zookeeper.ZooKeeperClient; -import com.twitter.search.common.aurora.AuroraSchedulerClient; -import com.twitter.search.common.concurrent.ScheduledExecutorServiceFactory; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.file.LocalFile; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.metrics.SearchStatsReceiver; -import com.twitter.search.common.schema.AnalyzerFactory; -import com.twitter.search.common.schema.DynamicSchema; -import com.twitter.search.common.schema.ImmutableSchema; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.thriftjava.ThriftSchema; -import com.twitter.search.common.util.ml.tensorflow_engine.TensorflowModelsManager; -import com.twitter.search.common.util.thrift.ThriftUtils; -import com.twitter.search.common.util.zookeeper.ZooKeeperProxy; -import com.twitter.search.earlybird.common.NonPagingAssert; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.ml.ScoringModelsManager; -import com.twitter.search.earlybird.partition.DynamicPartitionConfig; -import com.twitter.search.earlybird.partition.PartitionConfig; -import com.twitter.search.earlybird.partition.PartitionConfigLoader; -import com.twitter.search.earlybird.partition.PartitionConfigLoadingException; -import com.twitter.search.earlybird.util.OneTaskScheduledExecutorManager; -import com.twitter.search.earlybird.util.PeriodicActionParams; -import com.twitter.search.earlybird.util.ShutdownWaitTimeParams; - -/** - * A class that keeps track of Earlybird state that may change while an Earlybird runs, and keeps - * that state up to date. Currently keeps track of the current Earlybird schema and partition - * configuration, and periodically updates them from Zookeeper. It also reloads periodically the - * scoring models from HDFS. - */ -public class UpdateableEarlybirdStateManager extends OneTaskScheduledExecutorManager { - private static final Logger LOG = LoggerFactory.getLogger(UpdateableEarlybirdStateManager.class); - public static final String SCHEMA_SUFFIX = ".schema.v"; - - private static final String THREAD_NAME_PATTERN = "state_update-%d"; - private static final boolean THREAD_IS_DAEMON = true; - private static final long EXECUTOR_SHUTDOWN_WAIT_SEC = 5; - - private static final String DEFAULT_ZK_SCHEMA_LOCATION = - "/twitter/search/production/earlybird/schema"; - private static final String DEFAULT_LOCAL_SCHEMA_LOCATION = - "/home/search/earlybird_schema_canary"; - private static final long DEFAULT_UPDATE_PERIOD_MILLIS = - TimeUnit.MINUTES.toMillis(30); - - private static final String SCHEMA_MAJOR_VERSION_NAME = - "schema_major_version"; - private static final String SCHEMA_MINOR_VERSION_NAME = - "schema_minor_version"; - private static final String LAST_SUCCESSFUL_SCHEMA_RELOAD_TIME_MILLIS_NAME = - "last_successful_schema_reload_timestamp_millis"; - @VisibleForTesting - static final String FAIL_TO_LOAD_SCHEMA_COUNT_NAME = - "fail_to_load_schema_count"; - @VisibleForTesting - static final String HOST_IS_CANARY_SCHEME = "host_is_canary_schema"; - @VisibleForTesting - static final String DID_NOT_FIND_SCHEMA_COUNT_NAME = - "did_not_find_schema_count"; - private static final String LAST_SUCCESSFUL_PARTITION_CONFIG_RELOAD_TIME_MILLIS_NAME = - "last_successful_partition_config_reload_timestamp_millis"; - @VisibleForTesting - static final String FAIL_TO_LOAD_PARTITION_CONFIG_COUNT_NAME = - "fail_to_load_partition_config_count"; - @VisibleForTesting - static final String HOST_IS_IN_LAYOUT_STAT_NAME = "host_is_in_layout"; - private static final String NOT_IN_LAYOUT_SHUT_DOWN_ATTEMPTED_NAME = - "not_in_layout_shut_down_attempted"; - - private static final String SHUT_DOWN_EARLYBIRD_WHEN_NOT_IN_LAYOUT_DECIDER_KEY = - "shut_down_earlybird_when_not_in_layout"; - - private static final String NO_SHUTDOWN_WHEN_NOT_IN_LAYOUT_NAME = - "no_shutdown_when_not_in_layout"; - - private final SearchLongGauge schemaMajorVersion; - private final SearchLongGauge schemaMinorVersion; - private final SearchLongGauge lastSuccessfulSchemaReloadTimeMillis; - private final SearchCounter failToLoadSchemaCount; - private final SearchLongGauge hostIsCanarySchema; - private final SearchCounter didNotFindSchemaCount; - private final SearchLongGauge lastSuccessfulPartitionConfigReloadTimeMillis; - private final SearchCounter failToLoadPartitionConfigCount; - private final SearchLongGauge hostIsInLayout; - private final SearchCounter notInLayoutShutDownAttemptedCount; - private final SearchLongGauge noShutdownWhenNotInLayoutGauge; - - private final EarlybirdIndexConfig indexConfig; - private final DynamicPartitionConfig partitionConfig; - private final String schemaLocationOnLocal; - private final String schemaLocationOnZK; - private final ZooKeeperProxy zkClient; - private final AuroraSchedulerClient schedulerClient; - private final ScoringModelsManager scoringModelsManager; - private final TensorflowModelsManager tensorflowModelsManager; - private final SearchDecider searchDecider; - private final AtomicLong noShutdownWhenNotInLayout; - private EarlybirdServer earlybirdServer; - private Clock clock; - - public UpdateableEarlybirdStateManager( - EarlybirdIndexConfig indexConfig, - DynamicPartitionConfig partitionConfig, - ZooKeeperProxy zooKeeperClient, - @Nullable AuroraSchedulerClient schedulerClient, - ScheduledExecutorServiceFactory executorServiceFactory, - ScoringModelsManager scoringModelsManager, - TensorflowModelsManager tensorflowModelsManager, - SearchStatsReceiver searchStatsReceiver, - SearchDecider searchDecider, - CriticalExceptionHandler criticalExceptionHandler, - Clock clock) { - this( - indexConfig, - partitionConfig, - DEFAULT_LOCAL_SCHEMA_LOCATION, - DEFAULT_ZK_SCHEMA_LOCATION, - DEFAULT_UPDATE_PERIOD_MILLIS, - zooKeeperClient, - schedulerClient, - executorServiceFactory, - scoringModelsManager, - tensorflowModelsManager, - searchStatsReceiver, - searchDecider, - criticalExceptionHandler, - clock); - } - - protected UpdateableEarlybirdStateManager( - EarlybirdIndexConfig indexConfig, - DynamicPartitionConfig partitionConfig, - String schemaLocationOnLocal, - String schemaLocationOnZK, - long updatePeriodMillis, - ZooKeeperProxy zkClient, - @Nullable AuroraSchedulerClient schedulerClient, - ScheduledExecutorServiceFactory executorServiceFactory, - ScoringModelsManager scoringModelsManager, - TensorflowModelsManager tensorflowModelsManager, - SearchStatsReceiver searchStatsReceiver, - SearchDecider searchDecider, - CriticalExceptionHandler criticalExceptionHandler, - Clock clock) { - super( - executorServiceFactory, - THREAD_NAME_PATTERN, - THREAD_IS_DAEMON, - PeriodicActionParams.withFixedDelay( - updatePeriodMillis, - TimeUnit.MILLISECONDS - ), - new ShutdownWaitTimeParams( - EXECUTOR_SHUTDOWN_WAIT_SEC, - TimeUnit.SECONDS - ), - searchStatsReceiver, - criticalExceptionHandler); - this.indexConfig = indexConfig; - this.partitionConfig = partitionConfig; - this.schemaLocationOnLocal = schemaLocationOnLocal; - this.schemaLocationOnZK = schemaLocationOnZK; - this.zkClient = zkClient; - this.schedulerClient = schedulerClient; - this.scoringModelsManager = scoringModelsManager; - this.searchDecider = searchDecider; - this.noShutdownWhenNotInLayout = new AtomicLong(0); - this.tensorflowModelsManager = tensorflowModelsManager; - this.clock = clock; - this.schemaMajorVersion = getSearchStatsReceiver().getLongGauge( - SCHEMA_MAJOR_VERSION_NAME); - this.schemaMinorVersion = getSearchStatsReceiver().getLongGauge( - SCHEMA_MINOR_VERSION_NAME); - this.lastSuccessfulSchemaReloadTimeMillis = getSearchStatsReceiver().getLongGauge( - LAST_SUCCESSFUL_SCHEMA_RELOAD_TIME_MILLIS_NAME); - this.failToLoadSchemaCount = getSearchStatsReceiver().getCounter( - FAIL_TO_LOAD_SCHEMA_COUNT_NAME); - this.hostIsCanarySchema = getSearchStatsReceiver().getLongGauge(HOST_IS_CANARY_SCHEME); - this.didNotFindSchemaCount = getSearchStatsReceiver().getCounter( - DID_NOT_FIND_SCHEMA_COUNT_NAME); - this.lastSuccessfulPartitionConfigReloadTimeMillis = getSearchStatsReceiver().getLongGauge( - LAST_SUCCESSFUL_PARTITION_CONFIG_RELOAD_TIME_MILLIS_NAME); - this.failToLoadPartitionConfigCount = getSearchStatsReceiver().getCounter( - FAIL_TO_LOAD_PARTITION_CONFIG_COUNT_NAME); - this.hostIsInLayout = getSearchStatsReceiver().getLongGauge( - HOST_IS_IN_LAYOUT_STAT_NAME); - this.notInLayoutShutDownAttemptedCount = getSearchStatsReceiver().getCounter( - NOT_IN_LAYOUT_SHUT_DOWN_ATTEMPTED_NAME); - this.noShutdownWhenNotInLayoutGauge = getSearchStatsReceiver().getLongGauge( - NO_SHUTDOWN_WHEN_NOT_IN_LAYOUT_NAME, noShutdownWhenNotInLayout); - - updateSchemaVersionStats(indexConfig.getSchema()); - } - - private void updateSchemaVersionStats(Schema schema) { - schemaMajorVersion.set(schema.getMajorVersionNumber()); - schemaMinorVersion.set(schema.getMinorVersionNumber()); - lastSuccessfulSchemaReloadTimeMillis.set(System.currentTimeMillis()); - lastSuccessfulPartitionConfigReloadTimeMillis.set(System.currentTimeMillis()); - hostIsInLayout.set(1); - } - - private void updateSchemaVersionWithThriftSchema(ThriftSchema thriftSchema) - throws Schema.SchemaValidationException, DynamicSchema.SchemaUpdateException { - - ImmutableSchema newSchema = new ImmutableSchema( - thriftSchema, new AnalyzerFactory(), indexConfig.getCluster().getNameForStats()); - indexConfig.getSchema().updateSchema(newSchema); - tensorflowModelsManager.updateFeatureSchemaIdToMlIdMap(newSchema.getSearchFeatureSchema()); - updateSchemaVersionStats(indexConfig.getSchema()); - LOG.info("Schema updated. New Schema is: \n" + ThriftUtils.toTextFormatSafe(thriftSchema)); - } - - protected void updateSchema(ZooKeeperProxy zkClientToUse) { - // There are 3 cases: - // 1. Try to locate local schema file to canary, it might fail either because file not exist or - // ineligible versions. - // 2. Canary local schema failed, lookup schema file from zookeeper. - // 3. Both local and zookeeper updates failed, we do not update schema. Either schema not exists - // in zookeeper, or this would happened after canary schema: we updated current schema but did - // not rollback after finished. - if (updateSchemaFromLocal()) { - LOG.info("Host is used for schema canary"); - hostIsCanarySchema.set(1); - } else if (updateSchemaFromZooKeeper(zkClientToUse)) { - // Host is using schema file from zookeeper - hostIsCanarySchema.set(0); - } else { - // Schema update failed. Please check schema file exists on zookeeper and make sure - // rollback after canary. Current version: {}.{} - return; - } - } - - private boolean updateSchemaFromLocal() { - ThriftSchema thriftSchema = - loadCanaryThriftSchemaFromLocal(getCanarySchemaFileOnLocal()); - if (thriftSchema == null) { - // It is expected to not find a local schema file. The schema file only exists when the host - // is used as canary for schema updates - return false; - } - return updateSchemaFromThriftSchema(thriftSchema); - } - - private boolean updateSchemaFromZooKeeper(ZooKeeperProxy zkClientToUse) { - ThriftSchema thriftSchema = loadThriftSchemaFromZooKeeper(zkClientToUse); - if (thriftSchema == null) { - // It is expected to usually not find a schema file on ZooKeeper; one is only uploaded if the - // schema changes after the package has been compiled. All the relevant error handling and - // logging is expected to be handled by loadThriftSchemaFromZooKeeper(). - failToLoadSchemaCount.increment(); - return false; - } - return updateSchemaFromThriftSchema(thriftSchema); - } - - private boolean updateSchemaFromThriftSchema(ThriftSchema thriftSchema) { - Schema currentSchema = indexConfig.getSchema(); - if (thriftSchema.getMajorVersionNumber() != currentSchema.getMajorVersionNumber()) { - LOG.warn( - "Major version updates are not allowed. Current major version {}, try to update to {}", - currentSchema.getMajorVersionNumber(), thriftSchema.getMajorVersionNumber()); - return false; - } - if (thriftSchema.getMinorVersionNumber() > currentSchema.getMinorVersionNumber()) { - try { - updateSchemaVersionWithThriftSchema(thriftSchema); - } catch (Schema.SchemaValidationException | DynamicSchema.SchemaUpdateException e) { - LOG.warn("Exception while updating schema: ", e); - return false; - } - return true; - } else if (thriftSchema.getMinorVersionNumber() == currentSchema.getMinorVersionNumber()) { - LOG.info("Schema version to update is same as current one: {}.{}", - currentSchema.getMajorVersionNumber(), currentSchema.getMinorVersionNumber()); - return true; - } else { - LOG.info("Found schema to update, but not eligible for dynamic update. " - + "Current Version: {}.{}; Schema Version for updates: {}.{}", - currentSchema.getMajorVersionNumber(), - currentSchema.getMinorVersionNumber(), - thriftSchema.getMajorVersionNumber(), - thriftSchema.getMinorVersionNumber()); - return false; - } - } - - void updatePartitionConfig(@Nullable AuroraSchedulerClient schedulerClientToUse) { - try { - if (schedulerClientToUse == null) { - NonPagingAssert.assertFailed("aurora_scheduler_client_is_null"); - throw new PartitionConfigLoadingException("AuroraSchedulerClient can not be null."); - } - - PartitionConfig newPartitionConfig = - PartitionConfigLoader.getPartitionInfoForMesosConfig(schedulerClientToUse); - partitionConfig.setCurrentPartitionConfig(newPartitionConfig); - lastSuccessfulPartitionConfigReloadTimeMillis.set(System.currentTimeMillis()); - hostIsInLayout.set(1); - } catch (PartitionConfigLoadingException e) { - // Do not change hostIsInLayout's value if we could not load the layout. - LOG.warn("Failed to load partition config from ZooKeeper.", e); - failToLoadPartitionConfigCount.increment(); - } - } - - @Nullable - private ThriftSchema loadCanaryThriftSchemaFromLocal(LocalFile schemaFile) { - String schemaString; - if (!schemaFile.getFile().exists()) { - return null; - } - try { - schemaString = schemaFile.getCharSource().read(); - } catch (IOException e) { - LOG.warn("Fail to read from local schema file."); - return null; - } - ThriftSchema thriftSchema = new ThriftSchema(); - try { - ThriftUtils.fromTextFormat(schemaString, thriftSchema); - return thriftSchema; - } catch (TException e) { - LOG.warn("Unable to deserialize ThriftSchema loaded locally from {}.\n{}", - schemaFile.getName(), e); - return null; - } - } - - @Nullable - private ThriftSchema loadThriftSchemaFromZooKeeper(ZooKeeperProxy zkClientToUse) { - String schemaPathOnZk = getFullSchemaPathOnZK(); - byte[] rawBytes; - try { - rawBytes = zkClientToUse.getData(schemaPathOnZk, false, null); - } catch (KeeperException.NoNodeException e) { - didNotFindSchemaCount.increment(); - return null; - } catch (KeeperException e) { - LOG.warn("Exception while loading schema from ZK at {}.\n{}", schemaPathOnZk, e); - return null; - } catch (InterruptedException e) { - Thread.currentThread().interrupt(); - LOG.warn("Interrupted while loading schema from ZK at {}.\n{}", schemaPathOnZk, e); - return null; - } catch (ZooKeeperClient.ZooKeeperConnectionException e) { - LOG.warn("Exception while loading schema from ZK at {}.\n{}", schemaPathOnZk, e); - return null; - } - if (rawBytes == null) { - LOG.warn("Got null schema from ZooKeeper at {}.", schemaPathOnZk); - return null; - } - String schemaString = new String(rawBytes, Charsets.UTF_8); - ThriftSchema thriftSchema = new ThriftSchema(); - try { - ThriftUtils.fromTextFormat(schemaString, thriftSchema); - return thriftSchema; - } catch (TException e) { - LOG.warn("Unable to deserialize ThriftSchema loaded from ZK at {}.\n{}", schemaPathOnZk, e); - return null; - } - } - - @VisibleForTesting - protected String getSchemaFileName() { - return indexConfig.getCluster().name().toLowerCase() - + UpdateableEarlybirdStateManager.SCHEMA_SUFFIX - + indexConfig.getSchema().getMajorVersionNumber(); - } - - @VisibleForTesting - protected String getFullSchemaPathOnZK() { - return String.format("%s/%s", schemaLocationOnZK, getSchemaFileName()); - } - - LocalFile getCanarySchemaFileOnLocal() { - String canarySchemaFilePath = - String.format("%s/%s", schemaLocationOnLocal, getSchemaFileName()); - return new LocalFile(new File(canarySchemaFilePath)); - } - - void setNoShutdownWhenNotInLayout(boolean noShutdown) { - noShutdownWhenNotInLayout.set(noShutdown ? 1 : 0); - } - - @Override - protected void runOneIteration() { - updateSchema(zkClient); - updatePartitionConfig(schedulerClient); - - LOG.info("Reloading models."); - scoringModelsManager.reload(); - tensorflowModelsManager.run(); - - Random random = new Random(); - - try { - // We had an issue where HDFS operations were blocking, so reloading these models - // was finishing at the same time on each instance and after that every time an instance - // was reloading models, it was happening at the same time. This caused issues with HDFS - // load. We now place a "guard" waiting time after each reload so that the execution time - // on every instance is different and these calls can't easily sync to the same point in time. - int sleepSeconds = random.nextInt(30 * 60); - LOG.info("Sleeping for {} seconds", sleepSeconds); - clock.waitFor(sleepSeconds * 1000); - } catch (InterruptedException ex) { - LOG.info("Interrupted while sleeping"); - } - } - - public void setEarlybirdServer(EarlybirdServer earlybirdServer) { - this.earlybirdServer = earlybirdServer; - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/ArchiveEarlybirdIndexConfig.java b/src/java/com/twitter/search/earlybird/archive/ArchiveEarlybirdIndexConfig.java deleted file mode 100644 index b7709008b..000000000 --- a/src/java/com/twitter/search/earlybird/archive/ArchiveEarlybirdIndexConfig.java +++ /dev/null @@ -1,75 +0,0 @@ -package com.twitter.search.earlybird.archive; - -import java.io.IOException; -import java.util.concurrent.ConcurrentHashMap; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.IndexWriterConfig; -import org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy; -import org.apache.lucene.index.LogByteSizeMergePolicy; -import org.apache.lucene.index.SerialMergeScheduler; - -import com.twitter.decider.Decider; -import com.twitter.search.common.schema.SearchWhitespaceAnalyzer; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.util.CloseResourceUtil; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentData; -import com.twitter.search.core.earlybird.index.EarlybirdLuceneIndexSegmentData; -import com.twitter.search.earlybird.EarlybirdIndexConfig; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.partition.SearchIndexingMetricSet; - -/** - * Base config for the top archive tweet clusters. - */ -public abstract class ArchiveEarlybirdIndexConfig extends EarlybirdIndexConfig { - - private final CloseResourceUtil resourceCloser = new CloseResourceUtil(); - - public ArchiveEarlybirdIndexConfig( - EarlybirdCluster cluster, Decider decider, SearchIndexingMetricSet searchIndexingMetricSet, - CriticalExceptionHandler criticalExceptionHandler) { - super(cluster, decider, searchIndexingMetricSet, criticalExceptionHandler); - } - - @Override - public IndexWriterConfig newIndexWriterConfig() { - return new IndexWriterConfig(new SearchWhitespaceAnalyzer()) - .setIndexDeletionPolicy(new KeepOnlyLastCommitDeletionPolicy()) - .setMergeScheduler(new SerialMergeScheduler()) - .setMergePolicy(new LogByteSizeMergePolicy()) - .setRAMBufferSizeMB(IndexWriterConfig.DEFAULT_RAM_PER_THREAD_HARD_LIMIT_MB) - .setMaxBufferedDocs(IndexWriterConfig.DISABLE_AUTO_FLUSH) - .setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND); - } - - @Override - public CloseResourceUtil getResourceCloser() { - return resourceCloser; - } - - @Override - public EarlybirdIndexSegmentData optimize( - EarlybirdIndexSegmentData segmentData) throws IOException { - Preconditions.checkArgument( - segmentData instanceof EarlybirdLuceneIndexSegmentData, - "Expected EarlybirdLuceneIndexSegmentData but got %s", - segmentData.getClass()); - EarlybirdLuceneIndexSegmentData data = (EarlybirdLuceneIndexSegmentData) segmentData; - - return new EarlybirdLuceneIndexSegmentData( - data.getLuceneDirectory(), - data.getMaxSegmentSize(), - data.getTimeSliceID(), - data.getSchema(), - true, // isOptimized - data.getSyncData().getSmallestDocID(), - new ConcurrentHashMap<>(data.getPerFieldMap()), - data.getFacetCountingArray(), - data.getDocValuesManager(), - data.getDocIDToTweetIDMapper(), - data.getTimeMapper(), - data.getIndexExtensionsData()); - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/ArchiveHDFSUtils.java b/src/java/com/twitter/search/earlybird/archive/ArchiveHDFSUtils.java deleted file mode 100644 index 9fe7f3da2..000000000 --- a/src/java/com/twitter/search/earlybird/archive/ArchiveHDFSUtils.java +++ /dev/null @@ -1,173 +0,0 @@ -package com.twitter.search.earlybird.archive; - -import java.io.IOException; -import java.util.Calendar; -import java.util.Date; -import java.util.regex.Matcher; -import java.util.regex.Pattern; - -import org.apache.commons.io.IOUtils; -import org.apache.hadoop.fs.FileStatus; -import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.partitioning.base.Segment; -import com.twitter.search.earlybird.partition.HdfsUtil; -import com.twitter.search.earlybird.partition.SegmentInfo; -import com.twitter.search.earlybird.partition.SegmentSyncConfig; - - -public final class ArchiveHDFSUtils { - private static final Logger LOG = LoggerFactory.getLogger(ArchiveHDFSUtils.class); - - private static final Pattern SEGMENT_NAME_PATTERN = - Pattern.compile("_start_([0-9]+)_p_([0-9]+)_of_([0-9]+)_([0-9]{14}+)_"); - private static final int MATCHER_GROUP_END_DATE = 4; - - private ArchiveHDFSUtils() { - } - - /** - * Check if a given segment already has its indices built on hdfs. - * @return true if the indices exist on hdfs; otherwise, false. - */ - public static boolean hasSegmentIndicesOnHDFS(SegmentSyncConfig sync, SegmentInfo segment) { - LOG.info("checking segment on hdfs: " + segment - + " enabled: " + sync.isSegmentLoadFromHdfsEnabled()); - FileSystem fs = null; - try { - fs = HdfsUtil.getHdfsFileSystem(); - String hdfsBaseDirPrefix = segment.getSyncInfo() - .getHdfsSyncDirPrefix(); - FileStatus[] statuses = fs.globStatus(new Path(hdfsBaseDirPrefix)); - return statuses != null && statuses.length > 0; - } catch (IOException ex) { - LOG.error("Failed checking segment on hdfs: " + segment, ex); - return false; - } finally { - IOUtils.closeQuietly(fs); - } - } - - /** - * Delete the segment index directories on the HDFS. If 'deleteCurrentDir' is true, the - * index directory with the end date matching 'segment' will be deleted. If 'deleteOlderDirs', - * the index directories with the end date earlier than the the segment enddate will be deleted. - * - */ - public static void deleteHdfsSegmentDir(SegmentSyncConfig sync, SegmentInfo segment, - boolean deleteCurrentDir, boolean deleteOlderDirs) { - FileSystem fs = null; - try { - fs = HdfsUtil.getHdfsFileSystem(); - String hdfsFlushDir = segment.getSyncInfo().getHdfsFlushDir(); - String hdfsBaseDirPrefix = segment.getSyncInfo() - .getHdfsSyncDirPrefix(); - String endDateStr = extractEndDate(hdfsBaseDirPrefix); - if (endDateStr != null) { - hdfsBaseDirPrefix = hdfsBaseDirPrefix.replace(endDateStr, "*"); - } - String[] hdfsDirs = {segment.getSyncInfo().getHdfsTempFlushDir(), - hdfsBaseDirPrefix}; - for (String hdfsDir : hdfsDirs) { - FileStatus[] statuses = fs.globStatus(new Path(hdfsDir)); - if (statuses != null && statuses.length > 0) { - for (FileStatus status : statuses) { - if (status.getPath().toString().endsWith(hdfsFlushDir)) { - if (deleteCurrentDir) { - fs.delete(status.getPath(), true); - LOG.info("Deleted segment: " + status.getPath()); - } - } else { - if (deleteOlderDirs) { - fs.delete(status.getPath(), true); - LOG.info("Deleted segment: " + status.getPath()); - } - } - } - } - } - } catch (IOException e) { - LOG.error("Error delete Segment Dir :" + segment, e); - } finally { - IOUtils.closeQuietly(fs); - } - } - - /** - * Given a segment, check if there is any indices built on HDFS; if yes, return the end date - * of the index built on HDFS; otherwise, return null. - */ - public static Date getSegmentEndDateOnHdfs(SegmentSyncConfig sync, SegmentInfo segment) { - if (sync.isSegmentLoadFromHdfsEnabled()) { - LOG.info("About to check segment on hdfs: " + segment - + " enabled: " + sync.isSegmentLoadFromHdfsEnabled()); - - FileSystem fs = null; - try { - String hdfsBaseDirPrefix = segment.getSyncInfo() - .getHdfsSyncDirPrefix(); - String endDateStr = extractEndDate(hdfsBaseDirPrefix); - if (endDateStr == null) { - return null; - } - hdfsBaseDirPrefix = hdfsBaseDirPrefix.replace(endDateStr, "*"); - - fs = HdfsUtil.getHdfsFileSystem(); - FileStatus[] statuses = fs.globStatus(new Path(hdfsBaseDirPrefix)); - if (statuses != null && statuses.length > 0) { - Path hdfsSyncPath = statuses[statuses.length - 1].getPath(); - String hdfsSyncPathName = hdfsSyncPath.getName(); - endDateStr = extractEndDate(hdfsSyncPathName); - return Segment.getSegmentEndDate(endDateStr); - } - } catch (Exception ex) { - LOG.error("Failed getting segment from hdfs: " + segment, ex); - return null; - } finally { - IOUtils.closeQuietly(fs); - } - } - return null; - } - - private static String extractEndDate(String segmentDirPattern) { - Matcher matcher = SEGMENT_NAME_PATTERN.matcher(segmentDirPattern); - if (!matcher.find()) { - return null; - } - - try { - return matcher.group(MATCHER_GROUP_END_DATE); - } catch (IllegalStateException e) { - LOG.error("Match operation failed: " + segmentDirPattern, e); - return null; - } catch (IndexOutOfBoundsException e) { - LOG.error(" No group in the pattern with the given index : " + segmentDirPattern, e); - return null; - } - } - - /** - * Converts the given date to a path, using the given separator. For example, if the sate is - * January 5, 2019, and the separator is "/", this method will return "2019/01/05". - */ - public static String dateToPath(Date date, String separator) { - StringBuilder builder = new StringBuilder(); - Calendar cal = Calendar.getInstance(); - cal.setTime(date); - builder.append(cal.get(Calendar.YEAR)) - .append(separator) - .append(padding(cal.get(Calendar.MONTH) + 1, 2)) - .append(separator) - .append(padding(cal.get(Calendar.DAY_OF_MONTH), 2)); - return builder.toString(); - } - - private static String padding(int value, int len) { - return String.format("%0" + len + "d", value); - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/ArchiveOnDiskEarlybirdIndexConfig.java b/src/java/com/twitter/search/earlybird/archive/ArchiveOnDiskEarlybirdIndexConfig.java deleted file mode 100644 index ad6f3981c..000000000 --- a/src/java/com/twitter/search/earlybird/archive/ArchiveOnDiskEarlybirdIndexConfig.java +++ /dev/null @@ -1,79 +0,0 @@ -package com.twitter.search.earlybird.archive; - -import java.io.File; -import java.io.IOException; - -import org.apache.lucene.store.Directory; -import org.apache.lucene.store.FSDirectory; - -import com.twitter.decider.Decider; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentData; -import com.twitter.search.core.earlybird.index.EarlybirdLuceneIndexSegmentData; -import com.twitter.search.core.earlybird.index.extensions.EarlybirdIndexExtensionsFactory; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.index.DocValuesBasedTimeMapper; -import com.twitter.search.earlybird.index.DocValuesBasedTweetIDMapper; -import com.twitter.search.earlybird.partition.SearchIndexingMetricSet; -import com.twitter.search.earlybird.partition.SegmentSyncInfo; - -/** - * Index config for the on-disk Tweet clusters. - */ -public class ArchiveOnDiskEarlybirdIndexConfig extends ArchiveEarlybirdIndexConfig { - public ArchiveOnDiskEarlybirdIndexConfig( - Decider decider, SearchIndexingMetricSet searchIndexingMetricSet, - CriticalExceptionHandler criticalExceptionHandler) { - super(EarlybirdCluster.FULL_ARCHIVE, decider, searchIndexingMetricSet, - criticalExceptionHandler); - } - - @Override - public boolean isIndexStoredOnDisk() { - return true; - } - - @Override - public Directory newLuceneDirectory(SegmentSyncInfo segmentSyncInfo) throws IOException { - File dirPath = new File(segmentSyncInfo.getLocalLuceneSyncDir()); - return FSDirectory.open(dirPath.toPath()); - } - - @Override - public EarlybirdIndexSegmentData newSegmentData( - int maxSegmentSize, - long timeSliceID, - Directory dir, - EarlybirdIndexExtensionsFactory extensionsFactory) { - return new EarlybirdLuceneIndexSegmentData( - dir, - maxSegmentSize, - timeSliceID, - getSchema(), - new DocValuesBasedTweetIDMapper(), - new DocValuesBasedTimeMapper(), - extensionsFactory); - } - - @Override - public EarlybirdIndexSegmentData loadSegmentData( - FlushInfo flushInfo, - DataDeserializer dataInputStream, - Directory dir, - EarlybirdIndexExtensionsFactory extensionsFactory) throws IOException { - // IO Exception will be thrown if there's an error during load - return (new EarlybirdLuceneIndexSegmentData.OnDiskSegmentDataFlushHandler( - getSchema(), - dir, - extensionsFactory, - new DocValuesBasedTweetIDMapper.FlushHandler(), - new DocValuesBasedTimeMapper.FlushHandler())).load(flushInfo, dataInputStream); - } - - @Override - public boolean supportOutOfOrderIndexing() { - return false; - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/ArchiveSearchPartitionManager.java b/src/java/com/twitter/search/earlybird/archive/ArchiveSearchPartitionManager.java deleted file mode 100644 index 251ceba0b..000000000 --- a/src/java/com/twitter/search/earlybird/archive/ArchiveSearchPartitionManager.java +++ /dev/null @@ -1,485 +0,0 @@ -package com.twitter.search.earlybird.archive; - -import java.io.IOException; -import java.util.Date; -import java.util.List; -import java.util.concurrent.TimeUnit; -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.base.Predicate; -import com.google.common.collect.Lists; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.concurrent.ScheduledExecutorServiceFactory; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchStatsReceiver; -import com.twitter.search.common.util.GCUtil; -import com.twitter.search.common.util.io.recordreader.RecordReader; -import com.twitter.search.common.util.zktrylock.ZooKeeperTryLockFactory; -import com.twitter.search.earlybird.EarlybirdIndexConfig; -import com.twitter.search.earlybird.EarlybirdStatus; -import com.twitter.search.earlybird.ServerSetMember; -import com.twitter.search.earlybird.archive.ArchiveTimeSlicer.ArchiveTimeSlice; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.util.ScrubGenUtil; -import com.twitter.search.earlybird.document.TweetDocument; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.partition.CompleteSegmentManager; -import com.twitter.search.earlybird.partition.DynamicPartitionConfig; -import com.twitter.search.earlybird.partition.MultiSegmentTermDictionaryManager; -import com.twitter.search.earlybird.partition.PartitionConfig; -import com.twitter.search.earlybird.partition.PartitionManager; -import com.twitter.search.earlybird.partition.SearchIndexingMetricSet; -import com.twitter.search.earlybird.partition.SegmentHdfsFlusher; -import com.twitter.search.earlybird.partition.SegmentInfo; -import com.twitter.search.earlybird.partition.SegmentLoader; -import com.twitter.search.earlybird.partition.SegmentManager; -import com.twitter.search.earlybird.partition.SegmentManager.Filter; -import com.twitter.search.earlybird.partition.SegmentManager.Order; -import com.twitter.search.earlybird.partition.SegmentOptimizer; -import com.twitter.search.earlybird.partition.SegmentSyncConfig; -import com.twitter.search.earlybird.partition.SegmentWarmer; -import com.twitter.search.earlybird.partition.SimpleSegmentIndexer; -import com.twitter.search.earlybird.partition.UserScrubGeoEventStreamIndexer; -import com.twitter.search.earlybird.partition.UserUpdatesStreamIndexer; -import com.twitter.search.earlybird.querycache.QueryCacheManager; -import com.twitter.search.earlybird.segment.SegmentDataProvider; -import com.twitter.search.earlybird.thrift.EarlybirdStatusCode; -import com.twitter.search.earlybird.util.CoordinatedEarlybirdAction; -import com.twitter.search.earlybird.util.CoordinatedEarlybirdActionInterface; -import com.twitter.search.earlybird.util.CoordinatedEarlybirdActionLockFailed; - -public class ArchiveSearchPartitionManager extends PartitionManager { - private static final Logger LOG = - LoggerFactory.getLogger(ArchiveSearchPartitionManager.class); - - public static final String CONFIG_NAME = "archive"; - - private static final long ONE_DAY_MILLIS = TimeUnit.DAYS.toMillis(1); - - private final ArchiveTimeSlicer timeSlicer; - private final ArchiveSegmentDataProvider segmentDataProvider; - - private final UserUpdatesStreamIndexer userUpdatesStreamIndexer; - private final UserScrubGeoEventStreamIndexer userScrubGeoEventStreamIndexer; - - private final SegmentWarmer segmentWarmer; - private final EarlybirdIndexConfig earlybirdIndexConfig; - private final ZooKeeperTryLockFactory zkTryLockFactory; - private final Clock clock; - private final SegmentSyncConfig segmentSyncConfig; - protected final SearchCounter gcAfterIndexing; - - // Used for coordinating daily updated across different replicas on the same hash partition, - // to run them one at a time, and minimize the impact on query latencies. - private final CoordinatedEarlybirdActionInterface coordinatedDailyUpdate; - - private final SearchIndexingMetricSet indexingMetricSet; - - // This is only used in tests where no coordination is needed. - @VisibleForTesting - public ArchiveSearchPartitionManager( - ZooKeeperTryLockFactory zooKeeperTryLockFactory, - QueryCacheManager queryCacheManager, - SegmentManager segmentManager, - DynamicPartitionConfig dynamicPartitionConfig, - UserUpdatesStreamIndexer userUpdatesStreamIndexer, - UserScrubGeoEventStreamIndexer userScrubGeoEventStreamIndexer, - SearchStatsReceiver searchStatsReceiver, - ArchiveEarlybirdIndexConfig earlybirdIndexConfig, - ScheduledExecutorServiceFactory executorServiceFactory, - ScheduledExecutorServiceFactory userUpdateIndexerScheduledExecutorFactory, - SearchIndexingMetricSet searchIndexingMetricSet, - SegmentSyncConfig syncConfig, - Clock clock, - CriticalExceptionHandler criticalExceptionHandler) - throws IOException { - this( - zooKeeperTryLockFactory, - queryCacheManager, - segmentManager, - dynamicPartitionConfig, - userUpdatesStreamIndexer, - userScrubGeoEventStreamIndexer, - searchStatsReceiver, - earlybirdIndexConfig, - null, - executorServiceFactory, - userUpdateIndexerScheduledExecutorFactory, - searchIndexingMetricSet, - syncConfig, - clock, - criticalExceptionHandler); - } - - public ArchiveSearchPartitionManager( - ZooKeeperTryLockFactory zooKeeperTryLockFactory, - QueryCacheManager queryCacheManager, - SegmentManager segmentManager, - DynamicPartitionConfig dynamicPartitionConfig, - UserUpdatesStreamIndexer userUpdatesStreamIndexer, - UserScrubGeoEventStreamIndexer userScrubGeoEventStreamIndexer, - SearchStatsReceiver searchStatsReceiver, - ArchiveEarlybirdIndexConfig earlybirdIndexConfig, - ServerSetMember serverSetMember, - ScheduledExecutorServiceFactory executorServiceFactory, - ScheduledExecutorServiceFactory userUpdateIndexerExecutorFactory, - SearchIndexingMetricSet searchIndexingMetricSet, - SegmentSyncConfig syncConfig, - Clock clock, - CriticalExceptionHandler criticalExceptionHandler) throws IOException { - super(queryCacheManager, segmentManager, dynamicPartitionConfig, executorServiceFactory, - searchIndexingMetricSet, searchStatsReceiver, criticalExceptionHandler); - - Preconditions.checkState(syncConfig.getScrubGen().isPresent()); - Date scrubGen = ScrubGenUtil.parseScrubGenToDate(syncConfig.getScrubGen().get()); - - this.zkTryLockFactory = zooKeeperTryLockFactory; - final DailyStatusBatches dailyStatusBatches = new DailyStatusBatches( - zkTryLockFactory, - scrubGen); - this.earlybirdIndexConfig = earlybirdIndexConfig; - PartitionConfig curPartitionConfig = dynamicPartitionConfig.getCurrentPartitionConfig(); - - this.indexingMetricSet = searchIndexingMetricSet; - - this.timeSlicer = new ArchiveTimeSlicer( - EarlybirdConfig.getMaxSegmentSize(), dailyStatusBatches, - curPartitionConfig.getTierStartDate(), curPartitionConfig.getTierEndDate(), - earlybirdIndexConfig); - this.segmentDataProvider = - new ArchiveSegmentDataProvider( - dynamicPartitionConfig, - timeSlicer, - this.earlybirdIndexConfig); - - this.userUpdatesStreamIndexer = userUpdatesStreamIndexer; - this.userScrubGeoEventStreamIndexer = userScrubGeoEventStreamIndexer; - - this.coordinatedDailyUpdate = new CoordinatedEarlybirdAction( - zkTryLockFactory, - "archive_daily_update", - dynamicPartitionConfig, - serverSetMember, - criticalExceptionHandler, - syncConfig); - - this.segmentWarmer = new SegmentWarmer(criticalExceptionHandler); - this.clock = clock; - this.segmentSyncConfig = syncConfig; - this.gcAfterIndexing = SearchCounter.export("gc_after_indexing"); - } - - @Override - public SegmentDataProvider getSegmentDataProvider() { - return segmentDataProvider; - } - - @Override - protected void startUp() throws Exception { - LOG.info("Using CompleteSegmentManager to index complete segments."); - - // deferring handling of multi-segment term dictionary for the archive. - // SEARCH-11952 - CompleteSegmentManager completeSegmentManager = new CompleteSegmentManager( - zkTryLockFactory, - segmentDataProvider, - userUpdatesStreamIndexer, - userScrubGeoEventStreamIndexer, - segmentManager, - null, - indexingMetricSet, - clock, - MultiSegmentTermDictionaryManager.NOOP_INSTANCE, - segmentSyncConfig, - criticalExceptionHandler); - - completeSegmentManager.indexUserEvents(); - completeSegmentManager.indexCompleteSegments( - () -> segmentManager.getSegmentInfos(Filter.NeedsIndexing, Order.OLD_TO_NEW)); - - // In the archive cluster, the current segment needs to be loaded too. - List allSegments = - Lists.newArrayList(segmentManager.getSegmentInfos(Filter.All, Order.OLD_TO_NEW)); - completeSegmentManager.loadCompleteSegments(allSegments); - - completeSegmentManager.buildMultiSegmentTermDictionary(); - - completeSegmentManager.warmSegments(allSegments); - - LOG.info("Starting to run UserUpdatesKafkaConsumer"); - new Thread(userUpdatesStreamIndexer::run, "userupdates-stream-indexer").start(); - - if (EarlybirdConfig.consumeUserScrubGeoEvents()) { - LOG.info("Starting to run UserScrubGeoEventKafkaConsumer"); - new Thread(userScrubGeoEventStreamIndexer::run, - "userScrubGeoEvent-stream-indexer").start(); - } - } - - private static List truncateSegmentList(List segmentList, - int maxNumSegments) { - // Maybe cut-off the beginning of the sorted list of IDs. - if (maxNumSegments > 0 && maxNumSegments < segmentList.size()) { - return segmentList.subList(segmentList.size() - maxNumSegments, segmentList.size()); - } else { - return segmentList; - } - } - - - @Override - protected void indexingLoop(boolean firstLoop) throws Exception { - if (firstLoop) { - EarlybirdStatus.beginEvent( - INDEX_CURRENT_SEGMENT, getSearchIndexingMetricSet().startupInCurrentSegment); - } - - List timeSlices = timeSlicer.getTimeSlicesInTierRange(); - PartitionConfig curPartitionConfig = dynamicPartitionConfig.getCurrentPartitionConfig(); - timeSlices = truncateSegmentList(timeSlices, curPartitionConfig.getMaxEnabledLocalSegments()); - - for (final ArchiveTimeSlice timeSlice : timeSlices) { - // If any timeslice build failed, do not try to build timeslice after that to prevent - // possible holes between timeslices. - try { - if (!processArchiveTimeSlice(timeSlice)) { - LOG.warn("Building timeslice {} has failed, stopping future builds.", - timeSlice.getDescription()); - indexingMetricSet.archiveTimeSliceBuildFailedCounter.increment(); - return; - } - } catch (CoordinatedEarlybirdActionLockFailed e) { - // If the timeslice build failed because of lock coordination, we can wait for the next - // iteration to build again. - return; - } - } - - if (firstLoop) { - EarlybirdStatus.endEvent( - INDEX_CURRENT_SEGMENT, getSearchIndexingMetricSet().startupInCurrentSegment); - LOG.info("First indexing loop complete. Setting up query cache..."); - EarlybirdStatus.beginEvent( - SETUP_QUERY_CACHE, getSearchIndexingMetricSet().startupInQueryCacheUpdates); - } - setupQueryCacheIfNeeded(); - - if (EarlybirdStatus.isStarting() && queryCacheManager.allTasksRan()) { - LOG.info("Query cache setup complete. Becoming current now..."); - EarlybirdStatus.endEvent( - SETUP_QUERY_CACHE, getSearchIndexingMetricSet().startupInQueryCacheUpdates); - - becomeCurrent(); - EarlybirdStatus.recordEarlybirdEvent("Archive Earlybird is current"); - } - - updateIndexFreshnessStats(timeSlices); - } - - @VisibleForTesting - protected boolean processArchiveTimeSlice(final ArchiveTimeSlice timeSlice) - throws CoordinatedEarlybirdActionLockFailed, IOException { - PartitionConfig curPartitionConfig = dynamicPartitionConfig.getCurrentPartitionConfig(); - long minStatusID = timeSlice.getMinStatusID(curPartitionConfig.getIndexingHashPartitionID()); - SegmentInfo segmentInfo = segmentManager.getSegmentInfo(minStatusID); - if (segmentInfo == null) { - return indexSegmentFromScratch(timeSlice); - } else if (existingSegmentNeedsUpdating(timeSlice, segmentInfo)) { - return indexNewDayAndAppendExistingSegment(timeSlice, segmentInfo); - } - return true; - } - - - @VisibleForTesting - SegmentInfo newSegmentInfo(ArchiveTimeSlice timeSlice) throws IOException { - return new SegmentInfo(segmentDataProvider.newArchiveSegment(timeSlice), - segmentManager.getEarlybirdSegmentFactory(), segmentSyncConfig); - } - - private boolean indexNewDayAndAppendExistingSegment(final ArchiveTimeSlice timeSlice, - SegmentInfo segmentInfo) - throws CoordinatedEarlybirdActionLockFailed, IOException { - - LOG.info("Updating segment: {}; new endDate will be {} segmentInfo: {}", - segmentInfo.getSegment().getTimeSliceID(), timeSlice.getEndDate(), segmentInfo); - - // Create another new SegmentInfo for indexing - final SegmentInfo newSegmentInfoForIndexing = newSegmentInfo(timeSlice); - // make a final reference of the old segment info to be passed into closure. - final SegmentInfo oldSegmentInfo = segmentInfo; - - // Sanity check: the old and new segment should not share the same lucene directory. - Preconditions.checkState( - !newSegmentInfoForIndexing.getSyncInfo().getLocalLuceneSyncDir().equals( - oldSegmentInfo.getSyncInfo().getLocalLuceneSyncDir())); - - Preconditions.checkState( - !newSegmentInfoForIndexing.getSyncInfo().getLocalSyncDir().equals( - oldSegmentInfo.getSyncInfo().getLocalSyncDir())); - - final ArchiveSegment oldSegment = (ArchiveSegment) segmentInfo.getSegment(); - - return indexSegment(newSegmentInfoForIndexing, oldSegmentInfo, input -> { - // we're updating the segment - only index days after the old end date, but only if - // we're in the on-disk archive, and we're sure that the previous days have already - // been indexed. - return !earlybirdIndexConfig.isIndexStoredOnDisk() - // First time around, and the segment has not been indexed and optimized yet, - // we will want to add all the days - || !oldSegmentInfo.isOptimized() - || oldSegmentInfo.getIndexSegment().getIndexStats().getStatusCount() == 0 - || !oldSegment.getDataEndDate().before(timeSlice.getEndDate()) - // Index any new days - || input.after(oldSegment.getDataEndDate()); - }); - } - - private boolean existingSegmentNeedsUpdating(ArchiveTimeSlice timeSlice, - SegmentInfo segmentInfo) { - return ((ArchiveSegment) segmentInfo.getSegment()) - .getDataEndDate().before(timeSlice.getEndDate()) - // First time around, the end date is the same as the timeSlice end date, but - // the segment has not been indexed and optimized yet - || (!segmentInfo.isOptimized() && !segmentInfo.wasIndexed()) - // If indexing failed, this index will not be marked as complete, and we will want - // to reindex - || !segmentInfo.isComplete(); - } - - private boolean indexSegmentFromScratch(ArchiveTimeSlice timeSlice) throws - CoordinatedEarlybirdActionLockFailed, IOException { - - SegmentInfo segmentInfo = newSegmentInfo(timeSlice); - LOG.info("Creating segment: " + segmentInfo.getSegment().getTimeSliceID() - + "; new endDate will be " + timeSlice.getEndDate() + " segmentInfo: " + segmentInfo); - - return indexSegment(segmentInfo, null, ArchiveSegment.MATCH_ALL_DATE_PREDICATE); - } - - private void updateIndexFreshnessStats(List timeSlices) { - if (!timeSlices.isEmpty()) { - ArchiveTimeSlice lastTimeslice = timeSlices.get(timeSlices.size() - 1); - - // Add ~24 hours to start of end date to estimate freshest tweet time. - indexingMetricSet.freshestTweetTimeMillis.set( - lastTimeslice.getEndDate().getTime() + ONE_DAY_MILLIS); - - PartitionConfig curPartitionConfig = dynamicPartitionConfig.getCurrentPartitionConfig(); - long maxStatusId = lastTimeslice.getMaxStatusID( - curPartitionConfig.getIndexingHashPartitionID()); - if (maxStatusId > indexingMetricSet.highestStatusId.get()) { - indexingMetricSet.highestStatusId.set(maxStatusId); - } - } - } - - @Override - public void shutDownIndexing() { - LOG.info("Shutting down."); - userUpdatesStreamIndexer.close(); - userScrubGeoEventStreamIndexer.close(); - LOG.info("Closed User Event Kafka Consumers. Now Shutting down reader set."); - getSegmentDataProvider().getSegmentDataReaderSet().stopAll(); - } - - /** - * Attempts to index new days of data into the provided segment, indexing only the days that - * match the "dateFilter" predicate. - * @return true iff indexing succeeded, false otherwise. - */ - @VisibleForTesting - protected boolean indexSegment(final SegmentInfo segmentInfo, - @Nullable final SegmentInfo segmentToAppend, - final Predicate dateFilter) - throws CoordinatedEarlybirdActionLockFailed, IOException { - // Don't coordinate while we're starting up - if (!EarlybirdStatus.isStarting()) { - return coordinatedDailyUpdate.execute(segmentInfo.getSegmentName(), - isCoordinated -> innerIndexSegment(segmentInfo, segmentToAppend, dateFilter)); - } else { - return innerIndexSegment(segmentInfo, segmentToAppend, dateFilter); - } - } - - private boolean innerIndexSegment(SegmentInfo segmentInfo, - @Nullable SegmentInfo segmentToAppend, - Predicate dateFilter) - throws IOException { - - // First try to load the new day from HDFS / Local disk - if (new SegmentLoader(segmentSyncConfig, criticalExceptionHandler).load(segmentInfo)) { - LOG.info("Successful loaded segment for new day: " + segmentInfo); - segmentManager.putSegmentInfo(segmentInfo); - gcAfterIndexing.increment(); - GCUtil.runGC(); - return true; - } - - LOG.info("Failed to load segment for new day. Will index segment: " + segmentInfo); - RecordReader tweetReader = ((ArchiveSegment) segmentInfo.getSegment()) - .getStatusRecordReader(earlybirdIndexConfig.createDocumentFactory(), dateFilter); - try { - // Read and index the statuses - boolean success = newSimpleSegmentIndexer(tweetReader, segmentToAppend) - .indexSegment(segmentInfo); - if (!success) { - return false; - } - } finally { - tweetReader.stop(); - } - - if (!SegmentOptimizer.optimize(segmentInfo)) { - // We consider the whole indexing event as failed if we fail to optimize. - LOG.error("Failed to optimize segment: " + segmentInfo); - segmentInfo.deleteLocalIndexedSegmentDirectoryImmediately(); - return false; - } - - if (!segmentWarmer.warmSegmentIfNecessary(segmentInfo)) { - // We consider the whole indexing event as failed if we failed to warm (because we open - // index readers in the warmer). - LOG.error("Failed to warm segment: " + segmentInfo); - segmentInfo.deleteLocalIndexedSegmentDirectoryImmediately(); - return false; - } - - // Flush and upload segment to HDFS. If this fails, we just log a warning and return true. - boolean success = new SegmentHdfsFlusher(zkTryLockFactory, segmentSyncConfig) - .flushSegmentToDiskAndHDFS(segmentInfo); - if (!success) { - LOG.warn("Failed to flush segment to HDFS: " + segmentInfo); - } - - segmentManager.putSegmentInfo(segmentInfo); - gcAfterIndexing.increment(); - GCUtil.runGC(); - return true; - } - - @VisibleForTesting - protected SimpleSegmentIndexer newSimpleSegmentIndexer( - RecordReader tweetReader, SegmentInfo segmentToAppend) { - return new SimpleSegmentIndexer(tweetReader, indexingMetricSet, segmentToAppend); - } - - @Override - public boolean isCaughtUpForTests() { - return EarlybirdStatus.getStatusCode() == EarlybirdStatusCode.CURRENT; - } - - public CoordinatedEarlybirdActionInterface getCoordinatedOptimizer() { - return this.coordinatedDailyUpdate; - } - - public ArchiveTimeSlicer getTimeSlicer() { - return timeSlicer; - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/ArchiveSegment.java b/src/java/com/twitter/search/earlybird/archive/ArchiveSegment.java deleted file mode 100644 index 9d18a48e4..000000000 --- a/src/java/com/twitter/search/earlybird/archive/ArchiveSegment.java +++ /dev/null @@ -1,88 +0,0 @@ -package com.twitter.search.earlybird.archive; - -import java.io.IOException; -import java.util.Date; - -import com.google.common.base.Predicate; -import com.google.common.base.Predicates; - -import com.twitter.search.common.partitioning.base.Segment; -import com.twitter.search.common.partitioning.base.TimeSlice; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent; -import com.twitter.search.common.util.io.recordreader.RecordReader; -import com.twitter.search.earlybird.archive.ArchiveTimeSlicer.ArchiveTimeSlice; -import com.twitter.search.earlybird.document.DocumentFactory; -import com.twitter.search.earlybird.document.TweetDocument; - -public class ArchiveSegment extends Segment { - private final ArchiveTimeSlice archiveTimeSlice; - - public static final Predicate MATCH_ALL_DATE_PREDICATE = input -> true; - - // Constructor used for indexing an archive segment - public ArchiveSegment(ArchiveTimeSlice archiveTimeSlice, - int hashPartitionID, - int maxSegmentSize) { - super(new TimeSlice(archiveTimeSlice.getMinStatusID(hashPartitionID), - maxSegmentSize, hashPartitionID, - archiveTimeSlice.getNumHashPartitions()), - archiveTimeSlice.getEndDate().getTime()); - this.archiveTimeSlice = archiveTimeSlice; - } - - /** - * Constructor used for loading a flushed segment. Only be used by SegmentBuilder; Earlybird - * does not use this. - */ - ArchiveSegment(long timeSliceId, - int maxSegmentSize, - int partitions, - int hashPartitionID, - Date dataEndDate) { - super(new TimeSlice(timeSliceId, maxSegmentSize, hashPartitionID, partitions), - dataEndDate.getTime()); - // No archive timeslice is needed for loading. - this.archiveTimeSlice = null; - } - - /** - * Returns the tweets reader for this segment. - * - * @param documentFactory The factory that converts ThriftDocuments to Lucene documents. - */ - public RecordReader getStatusRecordReader( - DocumentFactory documentFactory) throws IOException { - return getStatusRecordReader(documentFactory, Predicates.alwaysTrue()); - } - - /** - * Returns the tweets reader for this segment. - * - * @param documentFactory The factory that converts ThriftDocuments to Lucene documents. - * @param filter A predicate that filters tweets based on the date they were created on. - */ - public RecordReader getStatusRecordReader( - DocumentFactory documentFactory, - Predicate filter) throws IOException { - if (archiveTimeSlice != null) { - return archiveTimeSlice.getStatusReader(this, documentFactory, filter); - } else { - throw new IllegalStateException("ArchiveSegment has no associated ArchiveTimeslice." - + "This ArchiveSegment can only be used for loading flushed segments."); - } - } - - public Date getDataEndDate() { - return archiveTimeSlice == null - ? new Date(getDataEndDateInclusiveMillis()) : archiveTimeSlice.getEndDate(); - } - - public ArchiveTimeSlice getArchiveTimeSlice() { - return archiveTimeSlice; - } - - @Override - public String toString() { - return super.toString() + " " + archiveTimeSlice.getDescription(); - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/ArchiveSegmentDataProvider.java b/src/java/com/twitter/search/earlybird/archive/ArchiveSegmentDataProvider.java deleted file mode 100644 index 07b44df5c..000000000 --- a/src/java/com/twitter/search/earlybird/archive/ArchiveSegmentDataProvider.java +++ /dev/null @@ -1,84 +0,0 @@ -package com.twitter.search.earlybird.archive; - -import java.io.IOException; -import java.util.List; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; - -import com.twitter.search.common.partitioning.base.Segment; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent; -import com.twitter.search.common.util.io.recordreader.RecordReader; -import com.twitter.search.earlybird.EarlybirdIndexConfig; -import com.twitter.search.earlybird.archive.ArchiveTimeSlicer.ArchiveTimeSlice; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.document.DocumentFactory; -import com.twitter.search.earlybird.document.TweetDocument; -import com.twitter.search.earlybird.partition.DynamicPartitionConfig; -import com.twitter.search.earlybird.partition.SegmentInfo; -import com.twitter.search.earlybird.segment.EmptySegmentDataReaderSet; -import com.twitter.search.earlybird.segment.SegmentDataProvider; -import com.twitter.search.earlybird.segment.SegmentDataReaderSet; - -public class ArchiveSegmentDataProvider implements SegmentDataProvider { - private static final org.slf4j.Logger LOG = - org.slf4j.LoggerFactory.getLogger(ArchiveSegmentDataProvider.class); - - private DynamicPartitionConfig dynamicPartitionConfig; - private final ArchiveTimeSlicer timeSlicer; - - private final DocumentFactory documentFactory; - - private final SegmentDataReaderSet readerSet; - - public ArchiveSegmentDataProvider( - DynamicPartitionConfig dynamicPartitionConfig, - ArchiveTimeSlicer timeSlicer, - EarlybirdIndexConfig earlybirdIndexConfig) throws IOException { - this.dynamicPartitionConfig = dynamicPartitionConfig; - this.timeSlicer = timeSlicer; - this.readerSet = createSegmentDataReaderSet(); - this.documentFactory = earlybirdIndexConfig.createDocumentFactory(); - } - - @Override - public List newSegmentList() throws IOException { - List timeSlices = timeSlicer.getTimeSlicesInTierRange(); - if (timeSlices == null || timeSlices.isEmpty()) { - return Lists.newArrayList(); - } - List segments = Lists.newArrayListWithCapacity(timeSlices.size()); - for (ArchiveTimeSlice timeSlice : timeSlices) { - segments.add(newArchiveSegment(timeSlice)); - } - return segments; - } - - /** - * Creates a new Segment instance for the given timeslice. - */ - public ArchiveSegment newArchiveSegment(ArchiveTimeSlice archiveTimeSlice) { - return new ArchiveSegment( - archiveTimeSlice, - dynamicPartitionConfig.getCurrentPartitionConfig().getIndexingHashPartitionID(), - EarlybirdConfig.getMaxSegmentSize()); - } - - @Override - public SegmentDataReaderSet getSegmentDataReaderSet() { - return readerSet; - } - - private EmptySegmentDataReaderSet createSegmentDataReaderSet() throws IOException { - return new EmptySegmentDataReaderSet() { - - @Override - public RecordReader newDocumentReader(SegmentInfo segmentInfo) - throws IOException { - Segment segment = segmentInfo.getSegment(); - Preconditions.checkArgument(segment instanceof ArchiveSegment); - return ((ArchiveSegment) segment).getStatusRecordReader(documentFactory); - } - }; - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/ArchiveSegmentUpdater.java b/src/java/com/twitter/search/earlybird/archive/ArchiveSegmentUpdater.java deleted file mode 100644 index 7620ace9b..000000000 --- a/src/java/com/twitter/search/earlybird/archive/ArchiveSegmentUpdater.java +++ /dev/null @@ -1,279 +0,0 @@ -package com.twitter.search.earlybird.archive; - -import java.io.IOException; -import java.util.Date; - -import com.google.common.base.Preconditions; -import com.google.common.base.Predicate; - -import org.apache.commons.lang.time.FastDateFormat; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.metrics.SearchStatsReceiver; -import com.twitter.search.common.metrics.SearchStatsReceiverImpl; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent; -import com.twitter.search.common.util.io.recordreader.RecordReader; -import com.twitter.search.common.util.zktrylock.ZooKeeperTryLockFactory; -import com.twitter.search.earlybird.EarlybirdIndexConfig; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.document.DocumentFactory; -import com.twitter.search.earlybird.document.TweetDocument; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.index.EarlybirdSegmentFactory; -import com.twitter.search.earlybird.partition.SearchIndexingMetricSet; -import com.twitter.search.earlybird.partition.SegmentHdfsFlusher; -import com.twitter.search.earlybird.partition.SegmentInfo; -import com.twitter.search.earlybird.partition.SegmentLoader; -import com.twitter.search.earlybird.partition.SegmentOptimizer; -import com.twitter.search.earlybird.partition.SegmentSyncConfig; -import com.twitter.search.earlybird.partition.SimpleSegmentIndexer; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; - -/** - * Given a segment, this class checks if the segment has an index built on HDFS: - * if not, use SimpleSegmentIndexer to build an index - * if yes, load the HDFS index, build a new index for the new status data which has dates newer - * than the HDFS index, then append the loaded HDFS index. - */ -public class ArchiveSegmentUpdater { - private static final Logger LOG = LoggerFactory.getLogger(ArchiveSegmentUpdater.class); - - private final SegmentSyncConfig sync; - private final EarlybirdIndexConfig earlybirdIndexConfig; - private final ZooKeeperTryLockFactory zkTryLockFactory; - private final SearchStatsReceiver statsReceiver = new SearchStatsReceiverImpl(); - private final SearchIndexingMetricSet searchIndexingMetricSet = - new SearchIndexingMetricSet(statsReceiver); - private final EarlybirdSearcherStats searcherStats = - new EarlybirdSearcherStats(statsReceiver); - private final SearchRateCounter indexNewSegment = - new SearchRateCounter("index_new_segment"); - private final SearchRateCounter updateExistingSegment = - new SearchRateCounter("update_existing_segment"); - private final SearchRateCounter skipExistingSegment = - new SearchRateCounter("skip_existing_segment"); - private Clock clock; - - public ArchiveSegmentUpdater(ZooKeeperTryLockFactory zooKeeperTryLockFactory, - SegmentSyncConfig sync, - EarlybirdIndexConfig earlybirdIndexConfig, - Clock clock) { - this.sync = sync; - this.earlybirdIndexConfig = earlybirdIndexConfig; - this.zkTryLockFactory = zooKeeperTryLockFactory; - this.clock = clock; - } - - private boolean canUpdateSegment(SegmentInfo segmentInfo) { - if (!(segmentInfo.getSegment() instanceof ArchiveSegment)) { - LOG.info("only ArchiveSegment is available for updating now: " - + segmentInfo); - return false; - } - - if (!segmentInfo.isEnabled()) { - LOG.debug("Segment is disabled: " + segmentInfo); - return false; - } - - if (segmentInfo.isComplete() || segmentInfo.isIndexing() - || segmentInfo.getSyncInfo().isLoaded()) { - LOG.debug("Cannot update already indexed segment: " + segmentInfo); - return false; - } - - return true; - } - - /** - * Given a segment, checks if the segment has an index built on HDFS: - * if not, use SimpleSegmentIndexer to build an index - * if yes, load the HDFS index, build a new index for the new status data which has dates newer - * than the HDFS index, then append the loaded HDFS index. - * - * Returns whether the segment was successfully updated. - */ - public boolean updateSegment(SegmentInfo segmentInfo) { - Preconditions.checkArgument(segmentInfo.getSegment() instanceof ArchiveSegment); - if (!canUpdateSegment(segmentInfo)) { - return false; - } - - if (segmentInfo.isIndexing()) { - LOG.error("Segment is already being indexed: " + segmentInfo); - return false; - } - - final Date hdfsEndDate = ArchiveHDFSUtils.getSegmentEndDateOnHdfs(sync, segmentInfo); - if (hdfsEndDate == null) { - indexNewSegment.increment(); - if (!indexSegment(segmentInfo, ArchiveSegment.MATCH_ALL_DATE_PREDICATE)) { - return false; - } - } else { - final Date curEndDate = ((ArchiveSegment) segmentInfo.getSegment()).getDataEndDate(); - if (!hdfsEndDate.before(curEndDate)) { - skipExistingSegment.increment(); - LOG.info("Segment is up-to-date: " + segmentInfo.getSegment().getTimeSliceID() - + " Found flushed segment on HDFS with end date: " - + FastDateFormat.getInstance("yyyyMMdd").format(hdfsEndDate)); - segmentInfo.setComplete(true); - segmentInfo.getSyncInfo().setFlushed(true); - return true; - } - - updateExistingSegment.increment(); - LOG.info("Updating segment: " + segmentInfo.getSegment().getTimeSliceID() - + "; new endDate will be " + FastDateFormat.getInstance("yyyyMMdd").format(curEndDate)); - - if (!updateSegment(segmentInfo, hdfsEndDate)) { - return false; - } - } - - boolean success = SegmentOptimizer.optimize(segmentInfo); - if (!success) { - // Clean up the segment dir on local disk - segmentInfo.deleteLocalIndexedSegmentDirectoryImmediately(); - LOG.info("Error optimizing segment: " + segmentInfo); - return false; - } - - // Verify segment before uploading. - success = ArchiveSegmentVerifier.verifySegment(segmentInfo); - if (!success) { - segmentInfo.deleteLocalIndexedSegmentDirectoryImmediately(); - LOG.info("Segment not uploaded to HDFS because it did not pass verification: " + segmentInfo); - return false; - } - - // upload the index to HDFS - success = new SegmentHdfsFlusher(zkTryLockFactory, sync, false) - .flushSegmentToDiskAndHDFS(segmentInfo); - if (success) { - ArchiveHDFSUtils.deleteHdfsSegmentDir(sync, segmentInfo, false, true); - } else { - // Clean up the segment dir on hdfs - ArchiveHDFSUtils.deleteHdfsSegmentDir(sync, segmentInfo, true, false); - LOG.info("Error uploading segment to HDFS: " + segmentInfo); - } - segmentInfo.deleteLocalIndexedSegmentDirectoryImmediately(); - - return success; - } - - /** - * Build index for the given segmentInfo. Only those statuses passing the dateFilter are indexed. - */ - private boolean indexSegment(final SegmentInfo segmentInfo, Predicate dateFilter) { - Preconditions.checkArgument(segmentInfo.getSegment() instanceof ArchiveSegment); - - RecordReader documentReader = null; - try { - ArchiveSegment archiveSegment = (ArchiveSegment) segmentInfo.getSegment(); - DocumentFactory documentFactory = - earlybirdIndexConfig.createDocumentFactory(); - documentReader = archiveSegment.getStatusRecordReader(documentFactory, dateFilter); - - // Read and index the statuses - boolean success = new SimpleSegmentIndexer(documentReader, searchIndexingMetricSet) - .indexSegment(segmentInfo); - if (!success) { - // Clean up segment dir on local disk - segmentInfo.deleteLocalIndexedSegmentDirectoryImmediately(); - LOG.info("Error indexing segment: " + segmentInfo); - } - - return success; - } catch (IOException e) { - segmentInfo.deleteLocalIndexedSegmentDirectoryImmediately(); - LOG.info("Exception while indexing segment: " + segmentInfo, e); - return false; - } finally { - if (documentReader != null) { - documentReader.stop(); - } - } - } - - /** - * Load the index built on HDFS for the given segmentInfo, index the new data and append the - * HDFS index to the new indexed segment - */ - private boolean updateSegment(final SegmentInfo segmentInfo, final Date hdfsEndDate) { - SegmentInfo hdfsSegmentInfo = loadSegmentFromHdfs(segmentInfo, hdfsEndDate); - if (hdfsSegmentInfo == null) { - return indexSegment(segmentInfo, ArchiveSegment.MATCH_ALL_DATE_PREDICATE); - } - - boolean success = indexSegment(segmentInfo, input -> { - // we're updating the segment - only index days after the old end date, - // and we're sure that the previous days have already been indexed. - return input.after(hdfsEndDate); - }); - if (!success) { - LOG.error("Error indexing new data: " + segmentInfo); - return indexSegment(segmentInfo, ArchiveSegment.MATCH_ALL_DATE_PREDICATE); - } - - // Now, append the index loaded from hdfs - try { - segmentInfo.getIndexSegment().append(hdfsSegmentInfo.getIndexSegment()); - hdfsSegmentInfo.deleteLocalIndexedSegmentDirectoryImmediately(); - LOG.info("Deleted local segment directories with end date " + hdfsEndDate + " : " - + segmentInfo); - } catch (IOException e) { - LOG.warn("Caught IOException while appending segment " + hdfsSegmentInfo.getSegmentName(), e); - hdfsSegmentInfo.deleteLocalIndexedSegmentDirectoryImmediately(); - segmentInfo.deleteLocalIndexedSegmentDirectoryImmediately(); - return false; - } - - segmentInfo.setComplete(true); - return true; - } - - /** - * Load the index built on HDFS for the given segmentInfo and end date - */ - private SegmentInfo loadSegmentFromHdfs(final SegmentInfo segmentInfo, final Date hdfsEndDate) { - Preconditions.checkArgument(segmentInfo.getSegment() instanceof ArchiveSegment); - - ArchiveSegment segment = new ArchiveSegment( - segmentInfo.getTimeSliceID(), - EarlybirdConfig.getMaxSegmentSize(), - segmentInfo.getNumPartitions(), - segmentInfo.getSegment().getHashPartitionID(), - hdfsEndDate); - EarlybirdSegmentFactory factory = new EarlybirdSegmentFactory( - earlybirdIndexConfig, - searchIndexingMetricSet, - searcherStats, - clock); - - SegmentInfo hdfsSegmentInfo; - - try { - hdfsSegmentInfo = new SegmentInfo(segment, factory, sync); - CriticalExceptionHandler criticalExceptionHandler = - new CriticalExceptionHandler(); - - boolean success = new SegmentLoader(sync, criticalExceptionHandler) - .load(hdfsSegmentInfo); - if (!success) { - // If not successful, segmentLoader has already cleaned up the local dir. - LOG.info("Error loading hdfs segment " + hdfsSegmentInfo - + ", building segment from scratch."); - hdfsSegmentInfo = null; - } - } catch (IOException e) { - LOG.error("Exception while loading segment from hdfs: " + segmentInfo, e); - hdfsSegmentInfo = null; - } - - return hdfsSegmentInfo; - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/ArchiveSegmentVerifier.java b/src/java/com/twitter/search/earlybird/archive/ArchiveSegmentVerifier.java deleted file mode 100644 index 2eb23265e..000000000 --- a/src/java/com/twitter/search/earlybird/archive/ArchiveSegmentVerifier.java +++ /dev/null @@ -1,75 +0,0 @@ -package com.twitter.search.earlybird.archive; - -import java.io.IOException; -import java.util.List; - -import com.google.common.annotations.VisibleForTesting; - -import org.apache.lucene.index.DirectoryReader; -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.store.Directory; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.earlybird.partition.SegmentInfo; - -public final class ArchiveSegmentVerifier { - private static final Logger LOG = LoggerFactory.getLogger(ArchiveSegmentVerifier.class); - - private ArchiveSegmentVerifier() { - } - - @VisibleForTesting - static boolean shouldVerifySegment(SegmentInfo segmentInfo) { - if (segmentInfo.isIndexing()) { - LOG.warn("ArchiveSegmentVerifier got segment still indexing."); - return false; - } - - if (!segmentInfo.isComplete()) { - LOG.warn("ArchiveSegmentVerifyer got incomplete segment."); - return false; - } - - if (!segmentInfo.isOptimized()) { - LOG.warn("ArchiveSegmentVerifyer got unoptimized segment."); - return false; - } - - return true; - } - - /** - * Verifies an archive segment has a sane number of leaves. - */ - public static boolean verifySegment(SegmentInfo segmentInfo) { - if (!shouldVerifySegment(segmentInfo)) { - return false; - } - Directory directory = segmentInfo.getIndexSegment().getLuceneDirectory(); - return verifyLuceneIndex(directory); - } - - private static boolean verifyLuceneIndex(Directory directory) { - try { - DirectoryReader indexerReader = DirectoryReader.open(directory); - List leaves = indexerReader.getContext().leaves(); - if (leaves.size() != 1) { - LOG.warn("Lucene index does not have exactly one segment: " + leaves.size() + " != 1. " - + "Lucene segments should have been merged during optimization."); - return false; - } - - LeafReader reader = leaves.get(0).reader(); - if (reader.numDocs() <= 0) { - LOG.warn("Lucene index has no document: " + reader); - return false; - } - return true; - } catch (IOException e) { - LOG.warn("Found bad lucene index at: " + directory); - return false; - } - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/ArchiveTimeSlicer.java b/src/java/com/twitter/search/earlybird/archive/ArchiveTimeSlicer.java deleted file mode 100644 index c326c76be..000000000 --- a/src/java/com/twitter/search/earlybird/archive/ArchiveTimeSlicer.java +++ /dev/null @@ -1,322 +0,0 @@ -package com.twitter.search.earlybird.archive; - -import java.io.IOException; -import java.util.ArrayList; -import java.util.Calendar; -import java.util.Collections; -import java.util.Comparator; -import java.util.Date; -import java.util.List; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.base.Predicate; -import com.google.common.collect.Lists; - - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent; -import com.twitter.search.common.util.io.MergingSortedRecordReader; -import com.twitter.search.common.util.io.recordreader.RecordReader; -import com.twitter.search.earlybird.config.TierConfig; -import com.twitter.search.earlybird.document.DocumentFactory; -import com.twitter.search.earlybird.document.ThriftIndexingEventDocumentFactory; -import com.twitter.search.earlybird.document.TweetDocument; - - -/** - * Responsible for taking a number of daily status batches and partitioning them into time slices - * which will be used to build segments. - * - * We try to put at most N number of tweets into a time slice. - */ -public class ArchiveTimeSlicer { - private static final Logger LOG = LoggerFactory.getLogger(ArchiveTimeSlicer.class); - - private static final Comparator ASCENDING = - (o1, o2) -> Long.compare(o1.getTweetID(), o2.getTweetID()); - - private static final Comparator DESCENDING = - (o1, o2) -> Long.compare(o2.getTweetID(), o1.getTweetID()); - - // Represents a number of daily batches which will go into a segment. - public static final class ArchiveTimeSlice { - private Date startDate; - private Date endDate; - private int statusCount; - private final DailyStatusBatches directory; - private final ArchiveEarlybirdIndexConfig earlybirdIndexConfig; - - // This list is always ordered from oldest day, to the newest day. - // For the on-disk archive, we reverse the days in getTweetReaders(). - private final List batches = Lists.newArrayList(); - - private ArchiveTimeSlice(DailyStatusBatches directory, - ArchiveEarlybirdIndexConfig earlybirdIndexConfig) { - this.directory = directory; - this.earlybirdIndexConfig = earlybirdIndexConfig; - } - - public Date getEndDate() { - return endDate; - } - - public int getStatusCount() { - return statusCount; - } - - public int getNumHashPartitions() { - return batches.isEmpty() ? 0 : batches.get(0).getNumHashPartitions(); - } - - /** - * Returns a reader for reading tweets from this timeslice. - * - * @param archiveSegment The segment to which the timeslice belongs. - * @param documentFactory The ThriftIndexingEvent to TweetDocument converter. - * @param filter A filter that determines what dates should be read. - */ - public RecordReader getStatusReader( - ArchiveSegment archiveSegment, - DocumentFactory documentFactory, - Predicate filter) throws IOException { - // We no longer support ThriftStatus based document factories. - Preconditions.checkState(documentFactory instanceof ThriftIndexingEventDocumentFactory); - - final int hashPartitionID = archiveSegment.getHashPartitionID(); - List> readers = new ArrayList<>(batches.size()); - List orderedForReading = orderBatchesForReading(batches); - LOG.info("Creating new status reader for hashPartition: " - + hashPartitionID + " timeslice: " + getDescription()); - - for (DailyStatusBatch batch : orderedForReading) { - if (filter.apply(batch.getDate())) { - LOG.info("Adding reader for " + batch.getDate() + " " + getDescription()); - PartitionedBatch partitionedBatch = batch.getPartition(hashPartitionID); - // Don't even try to create a reader if the partition is empty. - // There does not seem to be any problem in production now, but HDFS FileSystem's javadoc - // does indicate that listStatus() is allowed to throw a FileNotFoundException if the - // partition does not exist. This check makes the code more robust against future - // HDFS FileSystem implementation changes. - if (partitionedBatch.getStatusCount() > 0) { - RecordReader tweetReaders = partitionedBatch.getTweetReaders( - archiveSegment, - directory.getStatusPathToUseForDay(batch.getDate()), - documentFactory); - readers.add(tweetReaders); - } - } else { - LOG.info("Filtered reader for " + batch.getDate() + " " + getDescription()); - } - } - - LOG.info("Creating reader for timeslice: " + getDescription() - + " with " + readers.size() + " readers"); - - return new MergingSortedRecordReader(getMergingComparator(), readers); - } - - private List orderBatchesForReading(List orderedBatches) { - // For the index formats using stock lucene, we want the most recent days to be indexed first. - // In the twitter in-memory optimized indexes, older tweets will be added first, and - // optimization will reverse the documents to make most recent tweets be first. - return this.earlybirdIndexConfig.isUsingLIFODocumentOrdering() - ? orderedBatches : Lists.reverse(orderedBatches); - } - - private Comparator getMergingComparator() { - // We always want to retrieve larger tweet ids first. - // LIFO means that the smaller ids get inserted first --> ASCENDING order. - // FIFO would mean that we want to first insert the larger ids --> DESCENDING order. - return this.earlybirdIndexConfig.isUsingLIFODocumentOrdering() - ? ASCENDING : DESCENDING; - } - - /** - * Returns the smallest indexed tweet ID in this timeslice for the given partition. - * - * @param hashPartitionID The partition. - */ - public long getMinStatusID(int hashPartitionID) { - if (batches.isEmpty()) { - return 0; - } - - for (int i = 0; i < batches.size(); i++) { - long minStatusID = batches.get(i).getPartition(hashPartitionID).getMinStatusID(); - if (minStatusID != DailyStatusBatch.EMPTY_BATCH_STATUS_ID) { - return minStatusID; - } - } - - return 0; - } - - /** - * Returns the highest indexed tweet ID in this timeslice for the given partition. - * - * @param hashPartitionID The partition. - */ - public long getMaxStatusID(int hashPartitionID) { - if (batches.isEmpty()) { - return Long.MAX_VALUE; - } - - for (int i = batches.size() - 1; i >= 0; i--) { - long maxStatusID = batches.get(i).getPartition(hashPartitionID).getMaxStatusID(); - if (maxStatusID != DailyStatusBatch.EMPTY_BATCH_STATUS_ID) { - return maxStatusID; - } - } - - return Long.MAX_VALUE; - } - - /** - * Returns a string with some information for this timeslice. - */ - public String getDescription() { - StringBuilder builder = new StringBuilder(); - builder.append("TimeSlice[start date="); - builder.append(DailyStatusBatches.DATE_FORMAT.format(startDate)); - builder.append(", end date="); - builder.append(DailyStatusBatches.DATE_FORMAT.format(endDate)); - builder.append(", status count="); - builder.append(statusCount); - builder.append(", days count="); - builder.append(batches.size()); - builder.append("]"); - return builder.toString(); - } - } - - private final int maxSegmentSize; - private final DailyStatusBatches dailyStatusBatches; - private final Date tierStartDate; - private final Date tierEndDate; - private final ArchiveEarlybirdIndexConfig earlybirdIndexConfig; - - private List lastCachedTimeslices = null; - - public ArchiveTimeSlicer(int maxSegmentSize, - DailyStatusBatches dailyStatusBatches, - ArchiveEarlybirdIndexConfig earlybirdIndexConfig) { - this(maxSegmentSize, dailyStatusBatches, TierConfig.DEFAULT_TIER_START_DATE, - TierConfig.DEFAULT_TIER_END_DATE, earlybirdIndexConfig); - } - - public ArchiveTimeSlicer(int maxSegmentSize, - DailyStatusBatches dailyStatusBatches, - Date tierStartDate, - Date tierEndDate, - ArchiveEarlybirdIndexConfig earlybirdIndexConfig) { - this.maxSegmentSize = maxSegmentSize; - this.dailyStatusBatches = dailyStatusBatches; - this.tierStartDate = tierStartDate; - this.tierEndDate = tierEndDate; - this.earlybirdIndexConfig = earlybirdIndexConfig; - } - - private boolean cacheIsValid() throws IOException { - return lastCachedTimeslices != null - && !lastCachedTimeslices.isEmpty() - && cacheIsValid(lastCachedTimeslices.get(lastCachedTimeslices.size() - 1).endDate); - } - - private boolean cacheIsValid(Date lastDate) throws IOException { - if (lastCachedTimeslices == null || lastCachedTimeslices.isEmpty()) { - return false; - } - - // Check if we have a daily batch newer than the last batch used for the newest timeslice. - Calendar cal = Calendar.getInstance(); - cal.setTime(lastDate); - cal.add(Calendar.DATE, 1); - Date nextDate = cal.getTime(); - - boolean foundBatch = dailyStatusBatches.hasValidBatchForDay(nextDate); - - LOG.info("Checking cache: Looked for valid batch for day {}. Found: {}", - DailyStatusBatches.DATE_FORMAT.format(nextDate), foundBatch); - - return !foundBatch; - } - - private boolean timesliceIsFull(ArchiveTimeSlice timeSlice, DailyStatusBatch batch) { - return timeSlice.statusCount + batch.getMaxPerPartitionStatusCount() > maxSegmentSize; - } - - private void doTimeSlicing() throws IOException { - dailyStatusBatches.refresh(); - - lastCachedTimeslices = Lists.newArrayList(); - ArchiveTimeSlice currentTimeSlice = null; - - // Iterate over each day and add it to the current timeslice, until it gets full. - for (DailyStatusBatch batch : dailyStatusBatches.getStatusBatches()) { - if (!batch.isValid()) { - LOG.warn("Skipping hole: " + batch.getDate()); - continue; - } - - if (currentTimeSlice == null || timesliceIsFull(currentTimeSlice, batch)) { - if (currentTimeSlice != null) { - LOG.info("Filled timeslice: " + currentTimeSlice.getDescription()); - } - currentTimeSlice = new ArchiveTimeSlice(dailyStatusBatches, earlybirdIndexConfig); - currentTimeSlice.startDate = batch.getDate(); - lastCachedTimeslices.add(currentTimeSlice); - } - - currentTimeSlice.endDate = batch.getDate(); - currentTimeSlice.statusCount += batch.getMaxPerPartitionStatusCount(); - currentTimeSlice.batches.add(batch); - } - LOG.info("Last timeslice: {}", currentTimeSlice.getDescription()); - - LOG.info("Done with time slicing. Number of timeslices: {}", - lastCachedTimeslices.size()); - } - - /** - * Returns all timeslices for this earlybird. - */ - public List getTimeSlices() throws IOException { - if (cacheIsValid()) { - return lastCachedTimeslices; - } - - LOG.info("Cache is outdated. Loading new daily batches now..."); - - doTimeSlicing(); - - return lastCachedTimeslices != null ? Collections.unmodifiableList(lastCachedTimeslices) : null; - } - - /** - * Return the timeslices that overlap the tier start/end date ranges if they are specified - */ - public List getTimeSlicesInTierRange() throws IOException { - List timeSlices = getTimeSlices(); - if (tierStartDate == TierConfig.DEFAULT_TIER_START_DATE - && tierEndDate == TierConfig.DEFAULT_TIER_END_DATE) { - return timeSlices; - } - - List filteredTimeSlice = Lists.newArrayList(); - for (ArchiveTimeSlice timeSlice : timeSlices) { - if (timeSlice.startDate.before(tierEndDate) && !timeSlice.endDate.before(tierStartDate)) { - filteredTimeSlice.add(timeSlice); - } - } - - return filteredTimeSlice; - } - - @VisibleForTesting - protected DailyStatusBatches getDailyStatusBatches() { - return dailyStatusBatches; - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/DailyStatusBatch.java b/src/java/com/twitter/search/earlybird/archive/DailyStatusBatch.java deleted file mode 100644 index 6dcc852ec..000000000 --- a/src/java/com/twitter/search/earlybird/archive/DailyStatusBatch.java +++ /dev/null @@ -1,166 +0,0 @@ -package com.twitter.search.earlybird.archive; - -import java.io.IOException; -import java.util.Date; -import java.util.Map; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.collect.Maps; -import com.google.gson.Gson; -import com.google.gson.JsonParseException; - -import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -/** - * Represents a day's worth of statuses (tweets) for multiple hash partitions. - * - * Note that what this class contains is not the data, but metadata. - * - * A day of tweets will come from: - * - A scrubgen, if it has happened before the scrubgen date. - * - Our daily jobs pipeline, if it has happened after that. - * - * This class checks the _SUCCESS file exists in the "statuses" subdirectory and extracts the status - * count, min status id and max status id. - */ -public class DailyStatusBatch implements Comparable { - private static final Logger LOG = LoggerFactory.getLogger(DailyStatusBatch.class); - - public static final long EMPTY_BATCH_STATUS_ID = -1; - private static final String PARTITION_FORMAT = "p_%d_of_%d"; - private static final String SUCCESS_FILE_NAME = "_SUCCESS"; - - private final Map hashPartitionToStatuses = Maps.newHashMap(); - - private final Date date; - private final int numHashPartitions; - private final boolean hasSuccessFiles; - - public DailyStatusBatch(Date date, int numHashPartitions, Path statusPath, FileSystem hdfs) { - this.date = date; - this.numHashPartitions = numHashPartitions; - this.hasSuccessFiles = checkForSuccessFile(hdfs, date, statusPath); - } - - public Date getDate() { - return date; - } - - /** - * Check for the presence of the _SUCCESS file for the given day's path on HDFS for the statuses - * field group. - */ - private boolean checkForSuccessFile(FileSystem hdfs, Date inputDate, Path statusPath) { - Path dayPath = new Path(statusPath, ArchiveHDFSUtils.dateToPath(inputDate, "/")); - Path successFilePath = new Path(dayPath, SUCCESS_FILE_NAME); - try { - return hdfs.getFileStatus(successFilePath).isFile(); - } catch (IOException e) { - LOG.error("Could not verify existence of the _SUCCESS file. Assuming it doesn't exist.", e); - } - return false; - } - - /** - * Loads the data for this day for the given partition. - */ - public PartitionedBatch addPartition(FileSystem hdfs, Path dayPath, int hashPartitionID) - throws IOException { - String partitionDir = String.format(PARTITION_FORMAT, hashPartitionID, numHashPartitions); - Path path = new Path(dayPath, partitionDir); - PartitionedBatch batch = - new PartitionedBatch(path, hashPartitionID, numHashPartitions, date); - batch.load(hdfs); - hashPartitionToStatuses.put(hashPartitionID, batch); - return batch; - } - - public PartitionedBatch getPartition(int hashPartitionID) { - return hashPartitionToStatuses.get(hashPartitionID); - } - - /** - * Returns the greatest status count in all partitions belonging to this batch. - */ - public int getMaxPerPartitionStatusCount() { - int maxPerPartitionStatusCount = 0; - for (PartitionedBatch batch : hashPartitionToStatuses.values()) { - maxPerPartitionStatusCount = Math.max(batch.getStatusCount(), maxPerPartitionStatusCount); - } - return maxPerPartitionStatusCount; - } - - public int getNumHashPartitions() { - return numHashPartitions; - } - - @VisibleForTesting - boolean hasSuccessFiles() { - return hasSuccessFiles; - } - - /** - * Returns true if the _status_counts files could be found in each - * hash partition subfolder that belongs to this timeslice - * AND the _SUCCESS file can be found at the root folder for day - */ - public boolean isValid() { - // make sure we have data for all hash partitions - for (int i = 0; i < numHashPartitions; i++) { - PartitionedBatch day = hashPartitionToStatuses.get(i); - if (day == null || !day.hasStatusCount() || day.isDisallowedEmptyPartition()) { - return false; - } - } - return hasSuccessFiles; - } - - @Override - public String toString() { - StringBuilder builder = new StringBuilder(); - builder.append("DailyStatusBatch[date=").append(date) - .append(",valid=").append(isValid()) - .append(",hasSuccessFiles=").append(hasSuccessFiles) - .append(",numHashPartitions=").append(numHashPartitions) - .append("]:\n"); - for (int i = 0; i < numHashPartitions; i++) { - builder.append('\t').append(hashPartitionToStatuses.get(i).toString()).append('\n'); - } - return builder.toString(); - } - - @Override - public int compareTo(DailyStatusBatch o) { - return date.compareTo(o.date); - } - - /** - * Serialize DailyStatusBatch to a json string. - */ - public String serializeToJson() { - return serializeToJson(new Gson()); - } - - @VisibleForTesting - String serializeToJson(Gson gson) { - return gson.toJson(this); - } - - /** - * Given a json string, parse its fields and construct a daily status batch. - * @param batchStr the json string representation of a daily status batch. - * @return the daily status batch constructed; if the string is of invalid format, null will be - * returned. - */ - static DailyStatusBatch deserializeFromJson(String batchStr) { - try { - return new Gson().fromJson(batchStr, DailyStatusBatch.class); - } catch (JsonParseException e) { - LOG.error("Error parsing json string: " + batchStr, e); - return null; - } - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/DailyStatusBatches.java b/src/java/com/twitter/search/earlybird/archive/DailyStatusBatches.java deleted file mode 100644 index fa45a6ca3..000000000 --- a/src/java/com/twitter/search/earlybird/archive/DailyStatusBatches.java +++ /dev/null @@ -1,702 +0,0 @@ -package com.twitter.search.earlybird.archive; - -import java.io.File; -import java.io.FileNotFoundException; -import java.io.FileWriter; -import java.io.IOException; -import java.util.Calendar; -import java.util.Collection; -import java.util.Date; -import java.util.NavigableMap; -import java.util.concurrent.TimeUnit; -import java.util.concurrent.atomic.AtomicBoolean; -import java.util.regex.Matcher; -import java.util.regex.Pattern; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.base.Stopwatch; -import com.google.common.collect.Maps; - -import org.apache.commons.io.IOUtils; -import org.apache.commons.lang3.time.FastDateFormat; -import org.apache.hadoop.fs.FileStatus; -import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.quantity.Amount; -import com.twitter.common.quantity.Time; -import com.twitter.search.common.database.DatabaseConfig; -import com.twitter.search.common.util.date.DateUtil; -import com.twitter.search.common.util.io.LineRecordFileReader; -import com.twitter.search.common.util.zktrylock.TryLock; -import com.twitter.search.common.util.zktrylock.ZooKeeperTryLockFactory; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.common.config.EarlybirdProperty; -import com.twitter.search.earlybird.partition.HdfsUtil; -import com.twitter.search.earlybird.partition.StatusBatchFlushVersion; - -/** - * Provides access to preprocessed statuses (tweets) to be indexed by archive search earlybirds. - * - * These tweets can be coming from a scrub gen or from the output of the daily jobs. - */ -public class DailyStatusBatches { - private static final Logger LOG = LoggerFactory.getLogger(DailyStatusBatches.class); - - // Maximum time to spend on obtaining daily status batches by computing or loading from HDFS - private static final Amount MAX_TIME_ALLOWED_DAILY_STATUS_BATCHES_MINUTES = - Amount.of(EarlybirdConfig.getLong("daily_status_batches_max_initial_load_time_minutes"), - Time.MINUTES); - // Time to wait before trying again when obtaining daily status batches fails - private static final Amount DAILY_STATUS_BATCHES_WAITING_TIME_MINUTES = - Amount.of(EarlybirdConfig.getLong("daily_status_batches_waiting_time_minutes"), - Time.MINUTES); - private static final String DAILY_STATUS_BATCHES_SYNC_PATH = - EarlybirdProperty.ZK_APP_ROOT.get() + "/daily_batches_sync"; - private static final String DAILY_BATCHES_ZK_LOCK = "daily_batches_zk_lock"; - private static final Amount DAILY_STATUS_BATCHES_ZK_LOCK_EXPIRATION_MINUTES = - Amount.of(EarlybirdConfig.getLong("daily_status_batches_zk_lock_expiration_minutes"), - Time.MINUTES); - - static final FastDateFormat DATE_FORMAT = FastDateFormat.getInstance("yyyyMMdd"); - - // before this date, there was no twitter - private static final Date FIRST_TWITTER_DAY = DateUtil.toDate(2006, 2, 1); - - private static final String STATUS_BATCHES_PREFIX = "status_batches"; - - private final String rootDir = - EarlybirdConfig.getString("hdfs_offline_segment_sync_dir", "top_archive_statuses"); - - private final String buildGen = - EarlybirdConfig.getString("offline_segment_build_gen", "bg_1"); - - public static final String STATUS_SUBDIR_NAME = "statuses"; - public static final String LAYOUT_SUBDIR_NAME = "layouts"; - public static final String SCRUB_GEN_SUFFIX_PATTERN = "scrubbed/%s"; - - private static final String INTERMEDIATE_COUNTS_SUBDIR_NAME = "counts"; - private static final String SUCCESS_FILE_NAME = "_SUCCESS"; - private static final Pattern HASH_PARTITION_PATTERN = Pattern.compile("p_(\\d+)_of_(\\d+)"); - private static final Date FIRST_TWEET_DAY = DateUtil.toDate(2006, 3, 21); - - private final Path rootPath = new Path(rootDir); - private final Path buildGenPath = new Path(rootPath, buildGen); - private final Path statusPath = new Path(buildGenPath, STATUS_SUBDIR_NAME); - - private final NavigableMap statusBatches = Maps.newTreeMap(); - - private Date firstValidDay = null; - private Date lastValidDay = null; - - private final ZooKeeperTryLockFactory zkTryLockFactory; - private final Date scrubGenDay; - private long numberOfDaysWithValidScrubGenData; - - public DailyStatusBatches( - ZooKeeperTryLockFactory zooKeeperTryLockFactory, Date scrubGenDay) throws IOException { - this.zkTryLockFactory = zooKeeperTryLockFactory; - this.scrubGenDay = scrubGenDay; - - FileSystem hdfs = null; - try { - hdfs = HdfsUtil.getHdfsFileSystem(); - verifyDirectory(hdfs); - } finally { - IOUtils.closeQuietly(hdfs); - } - } - - @VisibleForTesting - public Date getScrubGenDay() { - return scrubGenDay; - } - - public Collection getStatusBatches() { - return statusBatches.values(); - } - - /** - * Reset the states of the directory - */ - private void resetDirectory() { - statusBatches.clear(); - firstValidDay = null; - lastValidDay = null; - } - - /** - * Indicate whether the directory has been initialized - */ - private boolean isInitialized() { - return lastValidDay != null; - } - - /** - * Load the daily status batches from HDFS; return true if one or more batches could be loaded. - **/ - private boolean refreshByLoadingHDFSStatusBatches(final FileSystem fs) throws IOException { - // first find the latest valid end date of statuses - final Date lastValidStatusDay = getLastValidInputDateFromNow(fs); - if (lastValidStatusDay != null) { - if (hasStatusBatchesOnHdfs(fs, lastValidStatusDay)) { - if (loadStatusBatchesFromHdfs(fs, lastValidStatusDay)) { - return true; - } - } - } - - resetDirectory(); - return false; - } - - /** - * Checks the directory for new data and returns true, if one or more new batches could be loaded. - */ - public void refresh() throws IOException { - final FileSystem hdfs = HdfsUtil.getHdfsFileSystem(); - - final Stopwatch stopwatch = Stopwatch.createStarted(); - try { - if (!isInitialized()) { - if (initializeDailyStatusBatches(hdfs, stopwatch)) { - LOG.info("Successfully obtained daily status batches after {}", stopwatch); - } else { - String errMsg = "Failed to load or compute daily status batches after " - + stopwatch.toString(); - LOG.error(errMsg); - throw new IOException(errMsg); - } - } else { - loadNewDailyBatches(hdfs); - } - } finally { - IOUtils.closeQuietly(hdfs); - } - } - - private boolean initializeDailyStatusBatches(final FileSystem hdfs, - final Stopwatch stopwatch) throws IOException { - long timeSpentOnDailyBatches = 0L; - long maxAllowedTimeMs = MAX_TIME_ALLOWED_DAILY_STATUS_BATCHES_MINUTES.as(Time.MILLISECONDS); - long waitingTimeMs = DAILY_STATUS_BATCHES_WAITING_TIME_MINUTES.as(Time.MILLISECONDS); - boolean firstLoop = true; - LOG.info("Starting to load or compute daily status batches for the first time."); - while (timeSpentOnDailyBatches <= maxAllowedTimeMs && !Thread.currentThread().isInterrupted()) { - if (!firstLoop) { - try { - LOG.info("Sleeping " + waitingTimeMs - + " millis before trying to obtain daily batches again"); - Thread.sleep(waitingTimeMs); - } catch (InterruptedException e) { - LOG.warn("Interrupted while waiting to load daily batches", e); - Thread.currentThread().interrupt(); - break; - } - } - - if (isStatusBatchLoadingEnabled() && refreshByLoadingHDFSStatusBatches(hdfs)) { - LOG.info("Successfully loaded daily status batches after {}", stopwatch); - return true; - } - - final AtomicBoolean successRef = new AtomicBoolean(false); - if (computeDailyBatchesWithZKLock(hdfs, successRef, stopwatch)) { - return successRef.get(); - } - - timeSpentOnDailyBatches = stopwatch.elapsed(TimeUnit.MILLISECONDS); - firstLoop = false; - } - - return false; - } - - private boolean computeDailyBatchesWithZKLock(final FileSystem hdfs, - final AtomicBoolean successRef, - final Stopwatch stopwatch) throws IOException { - // Using a global lock to coordinate among earlybirds and segment builders so that only - // one instance would hit the HDFS name node to query the daily status directories - TryLock lock = zkTryLockFactory.createTryLock( - DatabaseConfig.getLocalHostname(), - DAILY_STATUS_BATCHES_SYNC_PATH, - DAILY_BATCHES_ZK_LOCK, - DAILY_STATUS_BATCHES_ZK_LOCK_EXPIRATION_MINUTES); - - return lock.tryWithLock(() -> { - LOG.info("Obtained ZK lock to compute daily status batches after {}", stopwatch); - successRef.set(initialLoadDailyBatchInfos(hdfs)); - if (successRef.get()) { - LOG.info("Successfully computed daily status batches after {}", stopwatch); - if (isStatusBatchFlushingEnabled()) { - LOG.info("Starting to store daily status batches to HDFS"); - if (storeStatusBatchesToHdfs(hdfs, lastValidDay)) { - LOG.info("Successfully stored daily status batches to HDFS"); - } else { - LOG.warn("Failed storing daily status batches to HDFS"); - } - } - } else { - LOG.info("Failed loading daily status info"); - } - }); - } - - private void verifyDirectory(FileSystem hdfs) throws IOException { - if (!hdfs.exists(rootPath)) { - throw new IOException("Root dir '" + rootPath + "' does not exist."); - } - - if (!hdfs.exists(buildGenPath)) { - throw new IOException("Build gen dir '" + buildGenPath + "' does not exist."); - } - - if (!hdfs.exists(statusPath)) { - throw new IOException("Status dir '" + statusPath + "' does not exist."); - } - } - - private void loadNewDailyBatches(FileSystem hdfs) throws IOException { - Preconditions.checkNotNull(lastValidDay); - - Calendar day = Calendar.getInstance(); - day.setTime(lastValidDay); - day.add(Calendar.DATE, 1); - - while (loadDay(hdfs, day.getTime()) != null) { - lastValidDay = day.getTime(); - day.add(Calendar.DATE, 1); - } - } - - private boolean initialLoadDailyBatchInfos(FileSystem hdfs) throws IOException { - LOG.info("Starting to build timeslice map from scratch."); - - final Date lastValidStatusDay = getLastValidInputDateFromNow(hdfs); - - if (lastValidStatusDay == null) { - LOG.warn("No data found in " + statusPath + " and scrubbed path"); - return false; - } - int mostRecentYear = DateUtil.getCalendar(lastValidStatusDay).get(Calendar.YEAR); - for (int year = 2006; year <= mostRecentYear; ++year) { - // construct path to avoid hdfs.listStatus() calls - Calendar day = Calendar.getInstance(); - day.set(year, Calendar.JANUARY, 1, 0, 0, 0); - day.set(Calendar.MILLISECOND, 0); - - Calendar yearEnd = Calendar.getInstance(); - yearEnd.set(year, Calendar.DECEMBER, 31, 0, 0, 0); - yearEnd.set(Calendar.MILLISECOND, 0); - - if (lastValidDay != null) { - // We're updating. - if (lastValidDay.after(yearEnd.getTime())) { - // This year was already loaded. - continue; - } - if (lastValidDay.after(day.getTime())) { - // Start one day after last valid date. - day.setTime(lastValidDay); - day.add(Calendar.DATE, 1); - } - } - - for (; !day.after(yearEnd); day.add(Calendar.DATE, 1)) { - loadDay(hdfs, day.getTime()); - } - } - - boolean updated = false; - numberOfDaysWithValidScrubGenData = 0; - - // Iterate batches in sorted order. - for (DailyStatusBatch batch : statusBatches.values()) { - if (!batch.isValid()) { - break; - } - if (batch.getDate().before(scrubGenDay)) { - numberOfDaysWithValidScrubGenData++; - } - if (firstValidDay == null) { - firstValidDay = batch.getDate(); - } - if (lastValidDay == null || lastValidDay.before(batch.getDate())) { - lastValidDay = batch.getDate(); - updated = true; - } - } - - LOG.info("Number of statusBatches: {}", statusBatches.size()); - return updated; - } - - private static String filesToString(FileStatus[] files) { - if (files == null) { - return "null"; - } - StringBuilder b = new StringBuilder(); - for (FileStatus s : files) { - b.append(s.getPath().toString()).append(", "); - } - return b.toString(); - } - - @VisibleForTesting - protected DailyStatusBatch loadDay(FileSystem hdfs, Date day) throws IOException { - Path dayPath = new Path(getStatusPathToUseForDay(day), ArchiveHDFSUtils.dateToPath(day, "/")); - LOG.debug("Looking for batch in " + dayPath.toString()); - DailyStatusBatch result = this.statusBatches.get(day); - if (result != null) { - return result; - } - - final FileStatus[] files; - try { - files = hdfs.listStatus(dayPath); - LOG.debug("Files found: " + filesToString(files)); - } catch (FileNotFoundException e) { - LOG.debug("loadDay() called, but directory does not exist for day: " + day - + " in: " + dayPath); - return null; - } - - if (files != null && files.length > 0) { - for (FileStatus file : files) { - Matcher matcher = HASH_PARTITION_PATTERN.matcher(file.getPath().getName()); - if (matcher.matches()) { - int numHashPartitions = Integer.parseInt(matcher.group(2)); - result = new DailyStatusBatch( - day, numHashPartitions, getStatusPathToUseForDay(day), hdfs); - - for (int partitionID = 0; partitionID < numHashPartitions; partitionID++) { - result.addPartition(hdfs, dayPath, partitionID); - } - - if (result.isValid()) { - statusBatches.put(day, result); - return result; - } else { - LOG.info("Invalid batch found for day: " + day + ", batch: " + result); - } - } else { - // skip logging the intermediate count subdirectories or _SUCCESS files. - if (!INTERMEDIATE_COUNTS_SUBDIR_NAME.equals(file.getPath().getName()) - && !SUCCESS_FILE_NAME.equals(file.getPath().getName())) { - LOG.warn("Path does not match hash partition pattern: " + file.getPath()); - } - } - } - } else { - LOG.warn("No data found for day: " + day + " in: " + dayPath - + " files null: " + (files == null)); - } - - return null; - } - - /** - * Determines if this directory has a valid batch for the given day. - */ - public boolean hasValidBatchForDay(Date day) throws IOException { - FileSystem hdfs = null; - try { - hdfs = HdfsUtil.getHdfsFileSystem(); - return hasValidBatchForDay(hdfs, day); - } finally { - IOUtils.closeQuietly(hdfs); - } - } - - private boolean hasValidBatchForDay(FileSystem fs, Date day) throws IOException { - DailyStatusBatch batch = loadDay(fs, day); - - return batch != null && batch.isValid(); - } - - @VisibleForTesting - Date getFirstValidDay() { - return firstValidDay; - } - - @VisibleForTesting - Date getLastValidDay() { - return lastValidDay; - } - - private Date getLastValidInputDateFromNow(FileSystem hdfs) throws IOException { - Calendar cal = Calendar.getInstance(); - cal.setTime(new Date()); // current date - return getLastValidInputDate(hdfs, cal); - } - - /** - * Starting from current date, probe backward till we find a valid input Date - */ - @VisibleForTesting - Date getLastValidInputDate(FileSystem hdfs, Calendar cal) throws IOException { - cal.set(Calendar.MILLISECOND, 0); - cal.set(Calendar.HOUR_OF_DAY, 0); - cal.set(Calendar.MINUTE, 0); - cal.set(Calendar.SECOND, 0); - cal.set(Calendar.MILLISECOND, 0); - Date lastValidInputDate = cal.getTime(); - LOG.info("Probing backwards for last valid data date from " + lastValidInputDate); - while (lastValidInputDate.after(FIRST_TWITTER_DAY)) { - if (hasValidBatchForDay(hdfs, lastValidInputDate)) { - LOG.info("Found latest valid data on date " + lastValidInputDate); - LOG.info(" Used path: {}", getStatusPathToUseForDay(lastValidInputDate)); - return lastValidInputDate; - } - cal.add(Calendar.DATE, -1); - lastValidInputDate = cal.getTime(); - } - - return null; - } - - /** - * Check if the daily status batches are already on HDFS - */ - @VisibleForTesting - boolean hasStatusBatchesOnHdfs(FileSystem fs, Date lastDataDay) { - String hdfsFileName = getHdfsStatusBatchSyncFileName(lastDataDay); - try { - return fs.exists(new Path(hdfsFileName)); - } catch (IOException ex) { - LOG.error("Failed checking status batch file on HDFS: " + hdfsFileName, ex); - return false; - } - } - - /** - * Load the daily status batches from HDFS by first copying the file from HDFS to local disk - * and then reading from the local disk. - * - * @param day the latest day of valid statuses. - * @return true if the loading is successful. - */ - @VisibleForTesting - boolean loadStatusBatchesFromHdfs(FileSystem fs, Date day) { - // set the directory state to initial state - resetDirectory(); - - String fileHdfsPath = getHdfsStatusBatchSyncFileName(day); - String fileLocalPath = getLocalStatusBatchSyncFileName(day); - - LOG.info("Using " + fileHdfsPath + " as the HDFS batch summary load path."); - LOG.info("Using " + fileLocalPath + " as the local batch summary sync path."); - - LineRecordFileReader lineReader = null; - try { - fs.copyToLocalFile(new Path(fileHdfsPath), new Path(fileLocalPath)); - - lineReader = new LineRecordFileReader(fileLocalPath); - String batchLine; - while ((batchLine = lineReader.readNext()) != null) { - DailyStatusBatch batch = DailyStatusBatch.deserializeFromJson(batchLine); - if (batch == null) { - LOG.error("Invalid daily status batch constructed from line: " + batchLine); - resetDirectory(); - return false; - } - Date date = batch.getDate(); - if (firstValidDay == null || firstValidDay.after(date)) { - firstValidDay = date; - } - if (lastValidDay == null || lastValidDay.before(date)) { - lastValidDay = date; - } - statusBatches.put(date, batch); - } - LOG.info("Loaded {} status batches from HDFS: {}", - statusBatches.size(), fileHdfsPath); - LOG.info("First entry: {}", statusBatches.firstEntry().getValue().toString()); - LOG.info("Last entry: {}", statusBatches.lastEntry().getValue().toString()); - - return true; - } catch (IOException ex) { - LOG.error("Failed loading time slices from HDFS: " + fileHdfsPath, ex); - resetDirectory(); - return false; - } finally { - if (lineReader != null) { - lineReader.stop(); - } - } - } - - /** - * Flush the daily status batches to local disk and then upload to HDFS. - */ - private boolean storeStatusBatchesToHdfs(FileSystem fs, Date day) { - Preconditions.checkNotNull(lastValidDay); - - if (!StatusBatchFlushVersion.CURRENT_FLUSH_VERSION.isOfficial()) { - LOG.info("Status batch flush version is not official, no batches will be flushed to HDFS"); - return true; - } - - String fileLocalPath = getLocalStatusBatchSyncFileName(day); - - // Flush to local disk - File outputFile = null; - FileWriter fileWriter = null; - try { - LOG.info("Flushing daily status batches into: " + fileLocalPath); - outputFile = new File(fileLocalPath); - outputFile.getParentFile().mkdirs(); - if (!outputFile.getParentFile().exists()) { - LOG.error("Cannot create directory: " + outputFile.getParentFile().toString()); - return false; - } - fileWriter = new FileWriter(outputFile, false); - for (Date date : statusBatches.keySet()) { - fileWriter.write(statusBatches.get(date).serializeToJson()); - fileWriter.write("\n"); - } - fileWriter.flush(); - - // Upload the file to HDFS - return uploadStatusBatchesToHdfs(fs, day); - } catch (IOException e) { - String fileHdfsPath = getHdfsStatusBatchSyncFileName(day); - LOG.error("Failed storing status batches to HDFS: " + fileHdfsPath, e); - return false; - } finally { - try { - if (fileWriter != null) { - fileWriter.close(); - } - } catch (IOException e) { - LOG.error("Error to close fileWrite.", e); - } - if (outputFile != null) { - // Delete the local file - outputFile.delete(); - } - } - } - - /** - * Upload the status batches to HDFS. - */ - @VisibleForTesting - boolean uploadStatusBatchesToHdfs(FileSystem fs, Date day) { - String localFileName = getLocalStatusBatchSyncFileName(day); - String hdfsFileName = getHdfsStatusBatchSyncFileName(day); - - LOG.info("Using " + hdfsFileName + " as the HDFS batch summary upload path."); - LOG.info("Using " + localFileName + " as the local batch summary sync path."); - - try { - Path hdfsFilePath = new Path(hdfsFileName); - if (fs.exists(hdfsFilePath)) { - LOG.warn("Found status batch file on HDFS: " + hdfsFileName); - return true; - } - - String hdfsTempName = getHdfsStatusBatchTempSyncFileName(day); - Path hdfsTempPath = new Path(hdfsTempName); - if (fs.exists(hdfsTempPath)) { - LOG.info("Found existing temporary status batch file on HDFS, removing: " + hdfsTempName); - if (!fs.delete(hdfsTempPath, false)) { - LOG.error("Failed to delete temporary file: " + hdfsTempName); - return false; - } - } - fs.copyFromLocalFile(new Path(localFileName), hdfsTempPath); - - if (fs.rename(hdfsTempPath, hdfsFilePath)) { - LOG.debug("Renamed " + hdfsTempName + " on HDFS to: " + hdfsFileName); - return true; - } else { - LOG.error("Failed to rename " + hdfsTempName + " on HDFS to: " + hdfsFileName); - return false; - } - } catch (IOException ex) { - LOG.error("Failed uploading status batch file to HDFS: " + hdfsFileName, ex); - return false; - } - } - - private static boolean isStatusBatchFlushingEnabled() { - return EarlybirdProperty.ARCHIVE_DAILY_STATUS_BATCH_FLUSHING_ENABLED.get(false); - } - - private static boolean isStatusBatchLoadingEnabled() { - return EarlybirdConfig.getBool("archive_daily_status_batch_loading_enabled", false); - } - - private static String getVersionFileExtension() { - return StatusBatchFlushVersion.CURRENT_FLUSH_VERSION.getVersionFileExtension(); - } - - String getStatusBatchSyncRootDir() { - return EarlybirdConfig.getString("archive_daily_status_batch_sync_dir", - "daily_status_batches") + "/" + scrubGenSuffix(); - } - - @VisibleForTesting - String getLocalStatusBatchSyncFileName(Date day) { - return getStatusBatchSyncRootDir() + "/" + STATUS_BATCHES_PREFIX + "_" - + DATE_FORMAT.format(day) + getVersionFileExtension(); - } - - String getHdfsStatusBatchSyncRootDir() { - return EarlybirdConfig.getString("hdfs_archive_daily_status_batch_sync_dir", - "daily_status_batches") + "/" + scrubGenSuffix(); - } - - @VisibleForTesting - String getHdfsStatusBatchSyncFileName(Date day) { - return getHdfsStatusBatchSyncRootDir() + "/" + STATUS_BATCHES_PREFIX + "_" - + DATE_FORMAT.format(day) + getVersionFileExtension(); - } - - private String getHdfsStatusBatchTempSyncFileName(Date day) { - return getHdfsStatusBatchSyncRootDir() + "/" + DatabaseConfig.getLocalHostname() + "_" - + STATUS_BATCHES_PREFIX + "_" + DATE_FORMAT.format(day) + getVersionFileExtension(); - } - - private String scrubGenSuffix() { - return String.format(SCRUB_GEN_SUFFIX_PATTERN, DATE_FORMAT.format(scrubGenDay)); - } - - /** - * Returns the path to the directory that stores the statuses for the given day. - */ - public Path getStatusPathToUseForDay(Date day) { - if (!day.before(scrubGenDay)) { - return statusPath; - } - - String suffix = scrubGenSuffix(); - Preconditions.checkArgument(!suffix.isEmpty()); - Path scrubPath = new Path(buildGenPath, suffix); - return new Path(scrubPath, STATUS_SUBDIR_NAME); - } - - /** - * Determines if the data for the specified scrub gen was fully built, by checking the number of - * days for which data was built against the expected number of days extracted from the specified - * scrub gen date. - */ - public boolean isScrubGenDataFullyBuilt(FileSystem hdfs) throws IOException { - initialLoadDailyBatchInfos(hdfs); - if (numberOfDaysWithValidScrubGenData == 0) { - LOG.warn("numberOfDaysWithValidScrubGenData is 0"); - } - long expectedDays = getDiffBetweenDays(scrubGenDay); - return expectedDays == numberOfDaysWithValidScrubGenData; - } - - @VisibleForTesting - long getDiffBetweenDays(Date day) { - long diff = day.getTime() - FIRST_TWEET_DAY.getTime(); - return TimeUnit.DAYS.convert(diff, TimeUnit.MILLISECONDS); - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/PartitionedBatch.java b/src/java/com/twitter/search/earlybird/archive/PartitionedBatch.java deleted file mode 100644 index b72e8c7f2..000000000 --- a/src/java/com/twitter/search/earlybird/archive/PartitionedBatch.java +++ /dev/null @@ -1,333 +0,0 @@ -package com.twitter.search.earlybird.archive; - -import java.io.FileNotFoundException; -import java.io.IOException; -import java.util.Comparator; -import java.util.Date; -import java.util.List; -import java.util.concurrent.TimeUnit; -import java.util.regex.Matcher; -import java.util.regex.Pattern; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Function; -import com.google.common.base.Predicate; -import com.google.common.collect.ComparisonChain; -import com.google.common.collect.Lists; - -import org.apache.commons.io.IOUtils; -import org.apache.hadoop.fs.FileStatus; -import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; -import org.apache.hadoop.fs.PathFilter; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.config.Config; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; -import com.twitter.search.common.schema.earlybird.EarlybirdThriftDocumentUtil; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent; -import com.twitter.search.common.util.date.DateUtil; -import com.twitter.search.common.util.io.EmptyRecordReader; -import com.twitter.search.common.util.io.LzoThriftBlockFileReader; -import com.twitter.search.common.util.io.MergingSortedRecordReader; -import com.twitter.search.common.util.io.TransformingRecordReader; -import com.twitter.search.common.util.io.recordreader.RecordReader; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.document.DocumentFactory; -import com.twitter.search.earlybird.document.TweetDocument; -import com.twitter.search.earlybird.partition.HdfsUtil; - -/** - * A batch of pre-processed tweets for a single hash partition from a particular day. - */ -public class PartitionedBatch { - private static final Logger LOG = LoggerFactory.getLogger(PartitionedBatch.class); - private static final Date START_DATE_INCLUSIVE = DateUtil.toDate(2006, 03, 21); - private static final String STATUS_COUNT_FILE_PREFIX = "_status_count_"; - private static final Pattern STATUS_COUNT_FILE_PATTERN = - Pattern.compile(STATUS_COUNT_FILE_PREFIX + "(\\d+)_minid_(\\d+)_maxid_(\\d+)"); - private static final int MAXIMUM_OUT_OF_ORDER_TOLERANCE_HOURS = - EarlybirdConfig.getInt("archive_max_out_of_order_tolerance_hours", 12); - private static final int READER_INIT_IOEXCEPTION_RETRIES = 20; - private static final PathFilter LZO_DATA_FILES_FILTER = file -> file.getName().endsWith(".lzo"); - private static final PathFilter TXT_DATA_FILES_FILTER = file -> file.getName().endsWith(".txt"); - - private static final Comparator DESC_THRIFT_INDEXING_EVENT_COMPARATOR = - (o1, o2) -> ComparisonChain.start() - .compare(o2.getSortId(), o1.getSortId()) - .compare(o2.getUid(), o1.getUid()) - .result(); - - // Number archive tweets skipped because they are too out-of-order. - private static final SearchCounter OUT_OF_ORDER_STATUSES_SKIPPED = - SearchCounter.export("out_of_order_archive_statuses_skipped"); - - @VisibleForTesting - protected static final long MAXIMUM_OUT_OF_ORDER_TOLERANCE_MILLIS = - TimeUnit.HOURS.toMillis(MAXIMUM_OUT_OF_ORDER_TOLERANCE_HOURS); - - private final Date date; - private final Path path; - private int statusCount; - private long minStatusID; - private long maxStatusID; - private final int hashPartitionID; - private boolean hasStatusCountFile; - private final int numHashPartitions; - - @VisibleForTesting - public PartitionedBatch( - Path path, - int hashPartitionID, - int numHashPartitions, - Date date) { - this.path = path; - this.hashPartitionID = hashPartitionID; - this.numHashPartitions = numHashPartitions; - this.date = date; - } - - /** - * Loads all the information (tweet count, etc.) for this partition and day from HDFS. - */ - public void load(FileSystem hdfs) throws IOException { - FileStatus[] dailyBatchFiles = null; - try { - // listStatus() javadoc says it throws FileNotFoundException when path does not exist. - // However, the actual implementations return null or an empty array instead. - // We handle all 3 cases: null, empty array, or FileNotFoundException. - dailyBatchFiles = hdfs.listStatus(path); - } catch (FileNotFoundException e) { - // don't do anything here and the day will be handled as empty. - } - - if (dailyBatchFiles != null && dailyBatchFiles.length > 0) { - for (FileStatus file : dailyBatchFiles) { - String fileName = file.getPath().getName(); - if (fileName.equals(STATUS_COUNT_FILE_PREFIX)) { - // zero tweets in this partition - this can happen for early days in 2006 - handleEmptyPartition(); - } else { - Matcher matcher = STATUS_COUNT_FILE_PATTERN.matcher(fileName); - if (matcher.matches()) { - try { - statusCount = Integer.parseInt(matcher.group(1)); - // Only adjustMinStatusId in production. For tests, this makes the tests harder to - // understand. - minStatusID = Config.environmentIsTest() ? Long.parseLong(matcher.group(2)) - : adjustMinStatusId(Long.parseLong(matcher.group(2)), date); - maxStatusID = Long.parseLong(matcher.group(3)); - hasStatusCountFile = true; - } catch (NumberFormatException e) { - // invalid file - ignore - LOG.warn("Could not parse status count file name.", e); - } - } - } - } - } else { - // Partition folder does not exist. This case can happen for early days of twitter - // where some partitions are empty. Set us to having a status count file, the validity of - // the parent DailyStatusBatch will still be determined by whether there was a _SUCCESS file - // in the day root. - handleEmptyPartition(); - - if (date.after(getEarliestDenseDay())) { - LOG.error("Unexpected empty directory {} for {}", path, date); - } - } - } - - private void handleEmptyPartition() { - statusCount = 0; - minStatusID = DailyStatusBatch.EMPTY_BATCH_STATUS_ID; - maxStatusID = DailyStatusBatch.EMPTY_BATCH_STATUS_ID; - hasStatusCountFile = true; - } - - /** - * Sometimes tweets are out-of-order (E.g. a tweet from Sep 2012 got into a - * batch in July 2013). See SEARCH-1750 for more details. - * This adjust the minStatusID if it is badly out-of-order. - */ - @VisibleForTesting - protected static long adjustMinStatusId(long minStatusID, Date date) { - long dateTime = date.getTime(); - // If the daily batch is for a day before we started using snow flake IDs. Never adjust. - if (!SnowflakeIdParser.isUsableSnowflakeTimestamp(dateTime)) { - return minStatusID; - } - - long earliestStartTime = dateTime - MAXIMUM_OUT_OF_ORDER_TOLERANCE_MILLIS; - long minStatusTime = SnowflakeIdParser.getTimestampFromTweetId(minStatusID); - if (minStatusTime < earliestStartTime) { - long newMinId = SnowflakeIdParser.generateValidStatusId(earliestStartTime, 0); - LOG.info("Daily batch for " + date + " has badly out of order tweet: " + minStatusID - + ". The minStatusID for the day this batch is adjusted to " + newMinId); - return newMinId; - } else { - return minStatusID; - } - } - - /** - * Returns a reader that reads tweets from the given directory. - * - * @param archiveSegment Determines the timeslice ID of all read tweets. - * @param tweetsPath The path to the directory where the tweets for this day are stored. - * @param documentFactory The ThriftIndexingEvent to TweetDocument converter. - */ - public RecordReader getTweetReaders( - ArchiveSegment archiveSegment, - Path tweetsPath, - DocumentFactory documentFactory) throws IOException { - RecordReader tweetDocumentReader = - new TransformingRecordReader<>( - createTweetReader(tweetsPath), new Function() { - @Override - public TweetDocument apply(ThriftIndexingEvent event) { - return new TweetDocument( - event.getSortId(), - archiveSegment.getTimeSliceID(), - EarlybirdThriftDocumentUtil.getCreatedAtMs(event.getDocument()), - documentFactory.newDocument(event) - ); - } - }); - - tweetDocumentReader.setExhaustStream(true); - return tweetDocumentReader; - } - - private RecordReader createTweetReader(Path tweetsPath) throws IOException { - if (date.before(START_DATE_INCLUSIVE)) { - return new EmptyRecordReader<>(); - } - - List> readers = Lists.newArrayList(); - FileSystem hdfs = HdfsUtil.getHdfsFileSystem(); - try { - Path dayPath = new Path(tweetsPath, ArchiveHDFSUtils.dateToPath(date, "/")); - Path partitionPath = - new Path(dayPath, String.format("p_%d_of_%d", hashPartitionID, numHashPartitions)); - PathFilter pathFilter = - Config.environmentIsTest() ? TXT_DATA_FILES_FILTER : LZO_DATA_FILES_FILTER; - FileStatus[] files = hdfs.listStatus(partitionPath, pathFilter); - for (FileStatus fileStatus : files) { - String fileStatusPath = fileStatus.getPath().toString().replaceAll("file:/", "/"); - RecordReader reader = createRecordReaderWithRetries(fileStatusPath); - readers.add(reader); - } - } finally { - IOUtils.closeQuietly(hdfs); - } - - if (readers.isEmpty()) { - return new EmptyRecordReader<>(); - } - - return new MergingSortedRecordReader<>(DESC_THRIFT_INDEXING_EVENT_COMPARATOR, readers); - } - - private RecordReader createRecordReaderWithRetries(String filePath) - throws IOException { - Predicate recordFilter = getRecordFilter(); - int numTries = 0; - while (true) { - try { - ++numTries; - return new LzoThriftBlockFileReader<>(filePath, ThriftIndexingEvent.class, recordFilter); - } catch (IOException e) { - if (numTries < READER_INIT_IOEXCEPTION_RETRIES) { - LOG.warn("Failed to open LzoThriftBlockFileReader for " + filePath + ". Will retry.", e); - } else { - LOG.error("Failed to open LzoThriftBlockFileReader for " + filePath - + " after too many retries.", e); - throw e; - } - } - } - } - - private Predicate getRecordFilter() { - return Config.environmentIsTest() ? null : input -> { - if (input == null) { - return false; - } - // We only guard against status IDs that are too small, because it is possible - // for a very old tweet to get into today's batch, but not possible for a very - // large ID (a future tweet ID that is not yet published) to get in today's - // batch, unless tweet ID generation messed up. - long statusId = input.getSortId(); - boolean keep = statusId >= minStatusID; - if (!keep) { - LOG.debug("Out of order documentId: {} minStatusID: {} Date: {} Path: {}", - statusId, minStatusID, date, path); - OUT_OF_ORDER_STATUSES_SKIPPED.increment(); - } - return keep; - }; - } - - /** - * Returns the number of statuses in this batch - */ - public int getStatusCount() { - return statusCount; - } - - /** - * Was the _status_count file was found in this folder. - */ - public boolean hasStatusCount() { - return hasStatusCountFile; - } - - public long getMinStatusID() { - return minStatusID; - } - - public long getMaxStatusID() { - return maxStatusID; - } - - public Date getDate() { - return date; - } - - public Path getPath() { - return path; - } - - /** - * Check whether the partition is - * . empty and - * . it is disallowed (empty partition can only happen before 2010) - * (Empty partition means that the directory is missing when scan happens.) - * - * @return true if the partition has no documents and it is not allowed. - */ - public boolean isDisallowedEmptyPartition() { - return hasStatusCountFile - && statusCount == 0 - && minStatusID == DailyStatusBatch.EMPTY_BATCH_STATUS_ID - && maxStatusID == DailyStatusBatch.EMPTY_BATCH_STATUS_ID - && date.after(getEarliestDenseDay()); - } - - @Override - public String toString() { - return "PartitionedBatch[hashPartitionId=" + hashPartitionID - + ",numHashPartitions=" + numHashPartitions - + ",date=" + date - + ",path=" + path - + ",hasStatusCountFile=" + hasStatusCountFile - + ",statusCount=" + statusCount + "]"; - } - - private Date getEarliestDenseDay() { - return EarlybirdConfig.getDate("archive_search_earliest_dense_day"); - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/BUILD.bazel b/src/java/com/twitter/search/earlybird/archive/segmentbuilder/BUILD.bazel deleted file mode 100644 index f630ffd06..000000000 --- a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/BUILD.bazel +++ /dev/null @@ -1,64 +0,0 @@ -java_library( - name = "segment_builder_lib", - sources = ["**/*.java"], - platform = "java8", - tags = [ - "bazel-compatible", - "bazel-only", - ], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/org/apache/bookkeeper:bookkeeper-server", - "3rdparty/jvm/org/apache/bookkeeper:bookkeeper-twitter-science-provider", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/thrift:libthrift", - "3rdparty/jvm/org/apache/zookeeper:zookeeper-client", - "3rdparty/jvm/org/slf4j:slf4j-api", - "decider/src/main/scala", - "finatra/inject/inject-core/src/main/scala", - "finatra/inject/inject-server/src/main/scala/com/twitter/inject/server", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common/quantity", - "src/java/com/twitter/common/util:system-mocks", - "src/java/com/twitter/common_internal/text/version", - "src/java/com/twitter/search/common/config", - "src/java/com/twitter/search/common/database", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/partitioning/base", - "src/java/com/twitter/search/common/partitioning/zookeeper", - "src/java/com/twitter/search/common/schema", - "src/java/com/twitter/search/common/schema/base", - "src/java/com/twitter/search/common/util:closeresourceutil", - "src/java/com/twitter/search/common/util:gcutil", - "src/java/com/twitter/search/common/util:kerberos", - "src/java/com/twitter/search/common/util/date", - "src/java/com/twitter/search/common/util/io:flushable", - "src/java/com/twitter/search/common/util/zktrylock", - "src/java/com/twitter/search/common/util/zookeeper", - "src/java/com/twitter/search/earlybird:earlybird-lib", - "src/java/com/twitter/search/earlybird/common", - "src/java/com/twitter/search/earlybird/common/config", - "src/java/com/twitter/search/earlybird/common/userupdates", - "util/util-core:scala", - ], -) - -# Using hadoop_binary target can automatically exclude hadoop related jars in the built jar -# and load in the right jars based on hadoop config. -hadoop_binary( - name = "segment_builder_binary", - basename = "segment_builder", - main = "com.twitter.search.earlybird.archive.segmentbuilder.SegmentBuilderMain", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":segment_builder_lib", - "src/java/com/twitter/search/common/logging:search-log4j", - ], -) diff --git a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/BuiltAndFinalizedSegment.java b/src/java/com/twitter/search/earlybird/archive/segmentbuilder/BuiltAndFinalizedSegment.java deleted file mode 100644 index a185d41f2..000000000 --- a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/BuiltAndFinalizedSegment.java +++ /dev/null @@ -1,29 +0,0 @@ -package com.twitter.search.earlybird.archive.segmentbuilder; - -import com.twitter.search.earlybird.index.EarlybirdSegmentFactory; -import com.twitter.search.earlybird.partition.SegmentInfo; -import com.twitter.search.earlybird.partition.SegmentSyncConfig; - -public class BuiltAndFinalizedSegment extends SegmentBuilderSegment { - public BuiltAndFinalizedSegment( - SegmentInfo segmentInfo, - SegmentConfig segmentConfig, - EarlybirdSegmentFactory earlybirdSegmentFactory, - int alreadyRetriedCount, - SegmentSyncConfig sync) { - - super(segmentInfo, segmentConfig, earlybirdSegmentFactory, alreadyRetriedCount, sync); - } - - @Override - public SegmentBuilderSegment handle() throws SegmentInfoConstructionException, - SegmentUpdaterException { - - throw new IllegalStateException("Should not handle a BuildAndFinalizedSegment."); - } - - @Override - public boolean isBuilt() { - return true; - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/NotYetBuiltSegment.java b/src/java/com/twitter/search/earlybird/archive/segmentbuilder/NotYetBuiltSegment.java deleted file mode 100644 index 16249a7b1..000000000 --- a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/NotYetBuiltSegment.java +++ /dev/null @@ -1,101 +0,0 @@ -package com.twitter.search.earlybird.archive.segmentbuilder; - -import java.util.concurrent.atomic.AtomicBoolean; - -import com.google.common.base.Stopwatch; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.util.GCUtil; -import com.twitter.search.common.util.zktrylock.TryLock; -import com.twitter.search.earlybird.archive.ArchiveSegmentUpdater; -import com.twitter.search.earlybird.index.EarlybirdSegmentFactory; -import com.twitter.search.earlybird.partition.SegmentInfo; -import com.twitter.search.earlybird.partition.SegmentSyncConfig; - -public class NotYetBuiltSegment extends SegmentBuilderSegment { - private static final Logger LOG = LoggerFactory.getLogger(NotYetBuiltSegment.class); - - public NotYetBuiltSegment( - SegmentInfo segmentInfo, - SegmentConfig segmentConfig, - EarlybirdSegmentFactory earlybirdSegmentFactory, - int alreadyRetriedCount, - SegmentSyncConfig sync) { - - super(segmentInfo, segmentConfig, earlybirdSegmentFactory, alreadyRetriedCount, sync); - } - - /** - * 1. Grab the ZK lock for this segment. - * 2a. if lock fails, another host is updating; return the SOMEONE_ELSE_IS_BUILDING state. - * 2b. if lock succeeds, check again if the updated segment exists on HDFS. - * 3a. if so, just move on. - * 3b. if not, update the segment. - * In both cases, we need to check if the segment can now be marked as BUILT_AND_FINALIZED. - */ - @Override - public SegmentBuilderSegment handle() - throws SegmentUpdaterException, SegmentInfoConstructionException { - LOG.info("Handling a not yet built segment: {}", this.getSegmentName()); - Stopwatch stopwatch = Stopwatch.createStarted(); - TryLock lock = getZooKeeperTryLock(); - - // The tryWithLock can only access variables from parent class that are final. However, we - // would like to pass the process() return value to the parent class. So here we use - // AtomicBoolean reference instead of Boolean. - final AtomicBoolean successRef = new AtomicBoolean(false); - boolean gotLock = lock.tryWithLock(() -> { - ArchiveSegmentUpdater updater = new ArchiveSegmentUpdater( - segmentConfig.getTryLockFactory(), - sync, - segmentConfig.getEarlybirdIndexConfig(), - Clock.SYSTEM_CLOCK); - - boolean success = updater.updateSegment(segmentInfo); - successRef.set(success); - }); - - if (!gotLock) { - LOG.info("cannot acquire zookeeper lock for: " + segmentInfo); - return new SomeoneElseIsBuildingSegment( - segmentInfo, - segmentConfig, - earlybirdSegmentFactory, - alreadyRetriedCount, - sync); - } - - // 1. we want to make sure the heap is clean right after building a segment so that it's ready - // for us to start allocations for a new segment - // — I think we've had cases where we were seeing OOM's while building - // 2. the thing that I think it helps with is compaction (vs just organically running CMS) - // — which would clean up the heap, but may leave it in a fragmented state - // — and running a Full GC is supposed to compact the remaining tenured space. - GCUtil.runGC(); - - if (successRef.get()) { - LOG.info("Indexing segment {} took {}", segmentInfo, stopwatch); - LOG.info("Finished building {}", segmentInfo.getSegment().getSegmentName()); - return new BuiltAndFinalizedSegment( - segmentInfo, segmentConfig, earlybirdSegmentFactory, 0, sync); - } else { - int alreadyTried = alreadyRetriedCount + 1; - String errMsg = "failed updating segment for: " + segmentInfo - + " for " + alreadyTried + " times"; - LOG.error(errMsg); - if (alreadyTried < segmentConfig.getMaxRetriesOnFailure()) { - return new NotYetBuiltSegment( - createNewSegmentInfo(segmentInfo), - segmentConfig, - earlybirdSegmentFactory, - alreadyTried, - sync); - } else { - throw new SegmentUpdaterException(errMsg); - } - } - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/RateLimitingSegmentHandler.java b/src/java/com/twitter/search/earlybird/archive/segmentbuilder/RateLimitingSegmentHandler.java deleted file mode 100644 index 9ef883672..000000000 --- a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/RateLimitingSegmentHandler.java +++ /dev/null @@ -1,39 +0,0 @@ -package com.twitter.search.earlybird.archive.segmentbuilder; - -import java.util.HashMap; -import java.util.Map; - -import com.twitter.common.util.Clock; - -/** - * A class that prevents handling a given segment more than once every hdfsCheckIntervalMillis - */ -public class RateLimitingSegmentHandler { - private final long hdfsCheckIntervalMillis; - private final Clock clock; - private final Map segmentNameToLastUpdatedTimeMillis = new HashMap<>(); - - RateLimitingSegmentHandler(long hdfsCheckIntervalMillis, Clock clock) { - this.hdfsCheckIntervalMillis = hdfsCheckIntervalMillis; - this.clock = clock; - } - - SegmentBuilderSegment processSegment(SegmentBuilderSegment segment) - throws SegmentUpdaterException, SegmentInfoConstructionException { - - String segmentName = segment.getSegmentName(); - - Long lastUpdatedMillis = segmentNameToLastUpdatedTimeMillis.get(segmentName); - if (lastUpdatedMillis == null) { - lastUpdatedMillis = 0L; - } - - long nowMillis = clock.nowMillis(); - if (nowMillis - lastUpdatedMillis < hdfsCheckIntervalMillis) { - return segment; - } - segmentNameToLastUpdatedTimeMillis.put(segmentName, nowMillis); - - return segment.handle(); - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentBuilder.java b/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentBuilder.java deleted file mode 100644 index 1f3f47cf9..000000000 --- a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentBuilder.java +++ /dev/null @@ -1,540 +0,0 @@ -package com.twitter.search.earlybird.archive.segmentbuilder; - -import java.io.IOException; -import java.util.ArrayList; -import java.util.Collections; -import java.util.Date; -import java.util.HashMap; -import java.util.Iterator; -import java.util.List; -import java.util.Map; -import java.util.Optional; -import java.util.Random; -import java.util.concurrent.TimeUnit; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.base.Stopwatch; -import com.google.common.collect.ComparisonChain; -import com.google.common.collect.ImmutableList; -import com.google.common.util.concurrent.Uninterruptibles; -import com.google.inject.Inject; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.quantity.Amount; -import com.twitter.common.quantity.Time; -import com.twitter.common.util.Clock; -import com.twitter.decider.Decider; -import com.twitter.inject.annotations.Flag; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.metrics.SearchStatsReceiver; -import com.twitter.search.common.metrics.SearchStatsReceiverImpl; -import com.twitter.search.common.partitioning.zookeeper.SearchZkClient; -import com.twitter.search.common.util.Kerberos; -import com.twitter.search.common.util.zktrylock.ZooKeeperTryLockFactory; -import com.twitter.search.earlybird.archive.ArchiveOnDiskEarlybirdIndexConfig; -import com.twitter.search.earlybird.archive.ArchiveSegment; -import com.twitter.search.earlybird.archive.DailyStatusBatches; -import com.twitter.search.earlybird.archive.ArchiveTimeSlicer; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.util.ScrubGenUtil; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.index.EarlybirdSegmentFactory; -import com.twitter.search.earlybird.partition.SearchIndexingMetricSet; -import com.twitter.search.earlybird.partition.SegmentInfo; -import com.twitter.search.earlybird.partition.SegmentSyncConfig; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; - -/** - * This class provides the core logic to build segment indices offline. - * For each server, it coordinate via zookeeper to pick the next segment, build the indices for it - * and upload them to HDFS. A state machine is used to handle the build state transitions. There - * are three states: - * NOT_BUILD_YET: a segment that needs to be built - * SOMEONE_ELSE_IS_BUILDING: another server is building the segment. - * BUILT_AND_FINALIZED: the indices of this segment have already been built. - */ -public class SegmentBuilder { - private static final Logger LOG = LoggerFactory.getLogger(SegmentBuilder.class); - - private final boolean onlyRunOnce; - private final int waitBetweenLoopsMins; - private final int startUpBatchSize; - private final int instance; - private final int waitBetweenSegmentsSecs; - private final int waitBeforeQuitMins; - - // When multiple segment builders start simultaneously, they might make the HDFS name node and - // zookeeper overwhelmed. So, we let some instances sleep sometimes before they start to avoid - // the issues. - private final long startUpSleepMins; - - // If no more segments to built, wait this interval before checking again. - private final long processWaitingInterval = TimeUnit.MINUTES.toMillis(10); - - // The hash partitions that segments will be built. - private final ImmutableList hashPartitions; - - private final SearchStatsReceiver statsReceiver = new SearchStatsReceiverImpl(); - private final SearchIndexingMetricSet searchIndexingMetricSet = - new SearchIndexingMetricSet(statsReceiver); - private final EarlybirdSearcherStats searcherStats = - new EarlybirdSearcherStats(statsReceiver); - - private final ArchiveOnDiskEarlybirdIndexConfig earlybirdIndexConfig; - - private final ZooKeeperTryLockFactory zkTryLockFactory; - private final RateLimitingSegmentHandler segmentHandler; - private final Clock clock; - private final int numSegmentBuilderPartitions; - private final int myPartitionId; - private final SegmentConfig segmentConfig; - private final EarlybirdSegmentFactory segmentFactory; - private final SegmentBuilderCoordinator segmentBuilderCoordinator; - private final SegmentSyncConfig segmentSyncConfig; - private final Random random = new Random(); - - private static final double SLEEP_RANDOMIZATION_RATIO = .2; - - // Stats - // The flush version used to build segments - private static final SearchLongGauge CURRENT_FLUSH_VERSION = - SearchLongGauge.export("current_flush_version"); - - // Accumulated number and time in seconds spent on building segments locally - private static SearchCounter segmentsBuiltLocally = - SearchCounter.export("segments_built_locally"); - private static SearchCounter timeSpentOnSuccessfulBuildSecs = - SearchCounter.export("time_spent_on_successful_build_secs"); - - // The total number of segments to be built - private static final SearchLongGauge SEGMENTS_TO_BUILD = - SearchLongGauge.export("segments_to_build"); - - // How many segments failed locally - private static final SearchCounter FAILED_SEGMENTS = - SearchCounter.export("failed_segments"); - - @Inject - protected SegmentBuilder(@Flag("onlyRunOnce") boolean onlyRunOnceFlag, - @Flag("waitBetweenLoopsMins") int waitBetweenLoopsMinsFlag, - @Flag("startup_batch_size") int startUpBatchSizeFlag, - @Flag("instance") int instanceFlag, - @Flag("segmentZkLockExpirationHours") - int segmentZkLockExpirationHoursFlag, - @Flag("startupSleepMins") long startupSleepMinsFlag, - @Flag("maxRetriesOnFailure") int maxRetriesOnFailureFlag, - @Flag("hash_partitions") List hashPartitionsFlag, - @Flag("numSegmentBuilderPartitions") int numSegmentBuilderPartitionsFlag, - @Flag("waitBetweenSegmentsSecs") int waitBetweenSegmentsSecsFlag, - @Flag("waitBeforeQuitMins") int waitBeforeQuitMinsFlag, - @Flag("scrubGen") String scrubGen, - Decider decider) { - this(onlyRunOnceFlag, - waitBetweenLoopsMinsFlag, - startUpBatchSizeFlag, - instanceFlag, - segmentZkLockExpirationHoursFlag, - startupSleepMinsFlag, - hashPartitionsFlag, - maxRetriesOnFailureFlag, - waitBetweenSegmentsSecsFlag, - waitBeforeQuitMinsFlag, - SearchZkClient.getSZooKeeperClient().createZooKeeperTryLockFactory(), - new RateLimitingSegmentHandler(TimeUnit.MINUTES.toMillis(10), Clock.SYSTEM_CLOCK), - Clock.SYSTEM_CLOCK, - numSegmentBuilderPartitionsFlag, - decider, - getSyncConfig(scrubGen)); - } - - @VisibleForTesting - protected SegmentBuilder(boolean onlyRunOnceFlag, - int waitBetweenLoopsMinsFlag, - int startUpBatchSizeFlag, - int instanceFlag, - int segmentZkLockExpirationHoursFlag, - long startupSleepMinsFlag, - List hashPartitions, - int maxRetriesOnFailure, - int waitBetweenSegmentsSecsFlag, - int waitBeforeQuitMinsFlag, - ZooKeeperTryLockFactory zooKeeperTryLockFactory, - RateLimitingSegmentHandler segmentHandler, - Clock clock, - int numSegmentBuilderPartitions, - Decider decider, - SegmentSyncConfig syncConfig) { - LOG.info("Creating SegmentBuilder"); - LOG.info("Penguin version in use: " + EarlybirdConfig.getPenguinVersion()); - - // Set command line flag values - this.onlyRunOnce = onlyRunOnceFlag; - this.waitBetweenLoopsMins = waitBetweenLoopsMinsFlag; - this.startUpBatchSize = startUpBatchSizeFlag; - this.instance = instanceFlag; - this.waitBetweenSegmentsSecs = waitBetweenSegmentsSecsFlag; - this.waitBeforeQuitMins = waitBeforeQuitMinsFlag; - - this.segmentHandler = segmentHandler; - this.zkTryLockFactory = zooKeeperTryLockFactory; - this.segmentSyncConfig = syncConfig; - this.startUpSleepMins = startupSleepMinsFlag; - - if (!hashPartitions.isEmpty()) { - this.hashPartitions = ImmutableList.copyOf(hashPartitions); - } else { - this.hashPartitions = null; - } - - Amount segmentZKLockExpirationTime = Amount.of((long) - segmentZkLockExpirationHoursFlag, Time.HOURS); - - this.earlybirdIndexConfig = - new ArchiveOnDiskEarlybirdIndexConfig(decider, searchIndexingMetricSet, - new CriticalExceptionHandler()); - - this.segmentConfig = new SegmentConfig( - earlybirdIndexConfig, - segmentZKLockExpirationTime, - maxRetriesOnFailure, - zkTryLockFactory); - this.segmentFactory = new EarlybirdSegmentFactory( - earlybirdIndexConfig, - searchIndexingMetricSet, - searcherStats, - clock); - this.segmentBuilderCoordinator = new SegmentBuilderCoordinator( - zkTryLockFactory, syncConfig, clock); - - this.clock = clock; - - this.numSegmentBuilderPartitions = numSegmentBuilderPartitions; - this.myPartitionId = instance % numSegmentBuilderPartitions; - SearchLongGauge.export("segment_builder_partition_id_" + myPartitionId).set(1); - - CURRENT_FLUSH_VERSION.set(earlybirdIndexConfig.getSchema().getMajorVersionNumber()); - } - - void run() { - LOG.info("Config values: {}", EarlybirdConfig.allValuesAsString()); - - // Sleep some time uninterruptibly before get started so that if multiple instances are running, - // the HDFS name node and zookeeper wont be overwhelmed - // Say, we have 100 instances (instance_arg will have value from 0 - 99, our - // STARTUP_BATCH_SIZE_ARG is 20 and startUpSleepMins is 3 mins. Then the first 20 instances - // will not sleep, but start immediately. then instance 20 - 39 will sleep 3 mins and then - // start to run. instance 40 - 59 will sleep 6 mins then start to run. instances 60 - 79 will - // sleep 9 mins and then start to run and so forth. - long sleepTime = instance / startUpBatchSize * startUpSleepMins; - LOG.info("Instance={}, Start up batch size={}", instance, startUpBatchSize); - LOG.info("Sleep {} minutes to void HDFS name node and ZooKeeper overwhelmed.", sleepTime); - Uninterruptibles.sleepUninterruptibly(sleepTime, TimeUnit.MINUTES); - - // Kinit here. - Kerberos.kinit( - EarlybirdConfig.getString("kerberos_user", ""), - EarlybirdConfig.getString("kerberos_keytab_path", "") - ); - - long waitBetweenLoopsMs = TimeUnit.MINUTES.toMillis(waitBetweenLoopsMins); - if (onlyRunOnce) { - LOG.info("This segment builder will run the full rebuild of all the segments"); - } else { - LOG.info("This segment builder will incrementally check for new data and rebuilt " - + "current segments as needed."); - LOG.info("The waiting interval between two new data checking is: " - + waitBetweenLoopsMs + " ms."); - } - - boolean scrubGenPresent = segmentSyncConfig.getScrubGen().isPresent(); - LOG.info("Scrub gen present: {}", scrubGenPresent); - boolean scrubGenDataFullyBuilt = segmentBuilderCoordinator.isScrubGenDataFullyBuilt(instance); - LOG.info("Scrub gen data fully built: {}", scrubGenDataFullyBuilt); - - if (!scrubGenPresent || scrubGenDataFullyBuilt) { - LOG.info("Starting segment building loop..."); - while (!Thread.currentThread().isInterrupted()) { - try { - indexingLoop(); - if (onlyRunOnce) { - LOG.info("only run once is true, breaking"); - break; - } - clock.waitFor(waitBetweenLoopsMs); - } catch (InterruptedException e) { - LOG.info("Interrupted, quitting segment builder"); - Thread.currentThread().interrupt(); - } catch (SegmentInfoConstructionException e) { - LOG.error("Error creating new segmentInfo, quitting segment builder: ", e); - break; - } catch (SegmentUpdaterException e) { - FAILED_SEGMENTS.increment(); - // Before the segment builder quits, sleep for WAIT_BEFORE_QUIT_MINS minutes so that the - // FAILED_SEGMENTS stat can be exported. - try { - clock.waitFor(TimeUnit.MINUTES.toMillis(waitBeforeQuitMins)); - } catch (InterruptedException ex) { - LOG.info("Interrupted, quitting segment builder"); - Thread.currentThread().interrupt(); - } - LOG.error("SegmentUpdater processing segment error, quitting segment builder: ", e); - break; - } - } - } else { - LOG.info("Cannot build the segments for scrub gen yet."); - } - } - - // Refactoring the run loop to here for unittest - @VisibleForTesting - void indexingLoop() - throws SegmentInfoConstructionException, InterruptedException, SegmentUpdaterException { - // This map contains all the segments to be processed; if a segment is built, it will be removed - // from the map. - Map buildableSegmentInfoMap; - try { - buildableSegmentInfoMap = createSegmentInfoMap(); - printSegmentInfoMap(buildableSegmentInfoMap); - } catch (IOException e) { - LOG.error("Error creating segmentInfoMap: ", e); - return; - } - - while (!buildableSegmentInfoMap.isEmpty()) { - boolean hasBuiltSegment = processSegments(buildableSegmentInfoMap); - - if (!hasBuiltSegment) { - // If we successfully built a segment, no need to sleep since building a segment takes a - // long time - clock.waitFor(processWaitingInterval); - } - } - } - - // Actual shutdown. - protected void doShutdown() { - LOG.info("doShutdown()..."); - try { - earlybirdIndexConfig.getResourceCloser().shutdownExecutor(); - } catch (InterruptedException e) { - LOG.error("Interrupted during shutdown. ", e); - } - - LOG.info("Segment builder stopped!"); - } - - private List createTimeSlices() throws IOException { - Preconditions.checkState(segmentSyncConfig.getScrubGen().isPresent()); - Date scrubGen = ScrubGenUtil.parseScrubGenToDate(segmentSyncConfig.getScrubGen().get()); - - final DailyStatusBatches dailyStatusBatches = - new DailyStatusBatches(zkTryLockFactory, scrubGen); - final ArchiveTimeSlicer archiveTimeSlicer = new ArchiveTimeSlicer( - EarlybirdConfig.getMaxSegmentSize(), dailyStatusBatches, earlybirdIndexConfig); - - Stopwatch stopwatch = Stopwatch.createStarted(); - List timeSlices = archiveTimeSlicer.getTimeSlices(); - - if (timeSlices == null) { - LOG.error("Failed to load timeslice map after {}", stopwatch); - return Collections.emptyList(); - } - - LOG.info("Took {} to get timeslices", stopwatch); - return timeSlices; - } - - private static class TimeSliceAndHashPartition implements Comparable { - public final ArchiveTimeSlicer.ArchiveTimeSlice timeSlice; - public final Integer hashPartition; - - public TimeSliceAndHashPartition( - ArchiveTimeSlicer.ArchiveTimeSlice timeSlice, - Integer hashPartition) { - this.timeSlice = timeSlice; - this.hashPartition = hashPartition; - } - - @Override - public int compareTo(TimeSliceAndHashPartition o) { - Integer myHashPartition = this.hashPartition; - Integer otherHashPartition = o.hashPartition; - - long myTimeSliceId = this.timeSlice.getMinStatusID(myHashPartition); - long otherTimeSliceId = o.timeSlice.getMinStatusID(otherHashPartition); - - return ComparisonChain.start() - .compare(myHashPartition, otherHashPartition) - .compare(myTimeSliceId, otherTimeSliceId) - .result(); - } - } - - /** - * For all the timeslices, create the corresponding SegmentInfo and store in a map - */ - @VisibleForTesting - Map createSegmentInfoMap() throws IOException { - final List timeSlices = createTimeSlices(); - - List timeSlicePairs = createPairs(timeSlices); - // Export how many segments should be built - SEGMENTS_TO_BUILD.set(timeSlicePairs.size()); - LOG.info("Total number of segments to be built across all segment builders: {}", - timeSlicePairs.size()); - - List mySegments = getSegmentsForMyPartition(timeSlicePairs); - - Map segmentInfoMap = new HashMap<>(); - for (TimeSliceAndHashPartition mySegment : mySegments) { - ArchiveSegment segment = new ArchiveSegment(mySegment.timeSlice, mySegment.hashPartition, - EarlybirdConfig.getMaxSegmentSize()); - SegmentInfo segmentInfo = new SegmentInfo(segment, segmentFactory, segmentSyncConfig); - - segmentInfoMap.put(segmentInfo.getSegment().getSegmentName(), new NotYetBuiltSegment( - segmentInfo, segmentConfig, segmentFactory, 0, segmentSyncConfig)); - } - - return segmentInfoMap; - } - - private List createPairs( - List timeSlices) { - - List timeSlicePairs = new ArrayList<>(); - - for (ArchiveTimeSlicer.ArchiveTimeSlice slice : timeSlices) { - List localPartitions = hashPartitions; - if (localPartitions == null) { - localPartitions = range(slice.getNumHashPartitions()); - } - - for (Integer partition : localPartitions) { - timeSlicePairs.add(new TimeSliceAndHashPartition(slice, partition)); - } - } - return timeSlicePairs; - } - - private List getSegmentsForMyPartition( - List timeSlicePairs) { - - Collections.sort(timeSlicePairs); - - List myTimeSlices = new ArrayList<>(); - for (int i = myPartitionId; i < timeSlicePairs.size(); i += numSegmentBuilderPartitions) { - myTimeSlices.add(timeSlicePairs.get(i)); - } - - LOG.info("Getting segments to be built for partition: {}", myPartitionId); - LOG.info("Total number of partitions: {}", numSegmentBuilderPartitions); - LOG.info("Number of segments picked: {}", myTimeSlices.size()); - return myTimeSlices; - } - - /** - * Print out the segmentInfo Map for debugging - */ - private void printSegmentInfoMap(Map segmentInfoMap) { - LOG.info("SegmentInfoMap: "); - for (Map.Entry entry : segmentInfoMap.entrySet()) { - LOG.info(entry.getValue().toString()); - } - LOG.info("Total SegmentInfoMap size: " + segmentInfoMap.size() + ". done."); - } - - /** - * Build indices or refresh state for the segments in the specified segmentInfoMap, which only - * contains the segments that need to build or are building. When a segment has not been built, - * it is built here. If built successfully, it will be removed from the map; otherwise, its - * state will be updated in the map. - * - * Returns true iff this process has built a segment. - */ - @VisibleForTesting - boolean processSegments(Map segmentInfoMap) - throws SegmentInfoConstructionException, SegmentUpdaterException, InterruptedException { - - boolean hasBuiltSegment = false; - - Iterator> iter = - segmentInfoMap.entrySet().iterator(); - while (iter.hasNext()) { - Map.Entry entry = iter.next(); - SegmentBuilderSegment originalSegment = entry.getValue(); - - LOG.info("About to process segment: {}", originalSegment.getSegmentName()); - long startMillis = System.currentTimeMillis(); - SegmentBuilderSegment updatedSegment = segmentHandler.processSegment(originalSegment); - - if (updatedSegment.isBuilt()) { - iter.remove(); - hasBuiltSegment = true; - - if (originalSegment instanceof NotYetBuiltSegment) { - // Record the total time spent on successfully building a semgent, used to compute the - // average segment building time. - long timeSpent = System.currentTimeMillis() - startMillis; - segmentsBuiltLocally.increment(); - timeSpentOnSuccessfulBuildSecs.add(timeSpent / 1000); - } - } else { - entry.setValue(updatedSegment); - } - - clock.waitFor(getSegmentSleepTime()); - } - - return hasBuiltSegment; - } - - private long getSegmentSleepTime() { - // The Hadoop name node can handle only about 200 requests/sec before it gets overloaded. - // Updating the state of a node that has been built takes about 1 second. In the worst case - // scenario with 800 segment builders, we end up with about 800 requests/sec. Adding a 10 - // second sleep lowers the worst case to about 80 requests/sec. - - long sleepMillis = TimeUnit.SECONDS.toMillis(waitBetweenSegmentsSecs); - - // Use randomization so that we can't get all segment builders hitting it at the exact same time - - int lowerSleepBoundMillis = (int) (sleepMillis * (1.0 - SLEEP_RANDOMIZATION_RATIO)); - int upperSleepBoundMillis = (int) (sleepMillis * (1.0 + SLEEP_RANDOMIZATION_RATIO)); - return randRange(lowerSleepBoundMillis, upperSleepBoundMillis); - } - - /** - * Returns a pseudo-random number between min and max, inclusive. - */ - private int randRange(int min, int max) { - return random.nextInt((max - min) + 1) + min; - } - - /** - * Returns list of integers 0, 1, 2, ..., count-1. - */ - private static List range(int count) { - List nums = new ArrayList<>(count); - - for (int i = 0; i < count; i++) { - nums.add(i); - } - - return nums; - } - - private static SegmentSyncConfig getSyncConfig(String scrubGen) { - if (scrubGen == null || scrubGen.isEmpty()) { - throw new RuntimeException( - "Scrub gen expected, but could not get it from the arguments."); - } - - LOG.info("Scrub gen: " + scrubGen); - return new SegmentSyncConfig(Optional.of(scrubGen)); - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentBuilderApp.java b/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentBuilderApp.java deleted file mode 100644 index dc4565ede..000000000 --- a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentBuilderApp.java +++ /dev/null @@ -1,109 +0,0 @@ -package com.twitter.search.earlybird.archive.segmentbuilder; - -import java.util.Collection; - -import com.google.common.collect.ImmutableList; -import com.google.inject.Module; - - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.app.Flaggable; -import com.twitter.inject.server.AbstractTwitterServer; -import com.twitter.util.Future; -import com.twitter.util.Time; - -public class SegmentBuilderApp extends AbstractTwitterServer { - private static final Logger LOG = LoggerFactory.getLogger(SegmentBuilderApp.class); - - public SegmentBuilderApp() { - createFlag("onlyRunOnce", - true, - "whether to stop segment builder after one loop", - Flaggable.ofBoolean()); - - createFlag("waitBetweenLoopsMins", - 60, - "how many minutes to wait between building loops", - Flaggable.ofInt()); - - createFlag("startup_batch_size", - 30, - "How many instances can start and read timeslice info from HDFS at the same time. " - + "If you don't know what this parameter is, please do not change this parameter.", - Flaggable.ofInt()); - - createFlag("instance", - 20, - "the job instance number", - Flaggable.ofInt()); - - createFlag("segmentZkLockExpirationHours", - 0, - "max hours to hold the zookeeper lock while building segment", - Flaggable.ofInt()); - - createFlag("startupSleepMins", - 2L, - "sleep multiplier of startupSleepMins before job runs", - Flaggable.ofLong()); - - createFlag("maxRetriesOnFailure", - 3, - "how many times we should try to rebuild a segment when failure happens", - Flaggable.ofInt()); - - createFlag("hash_partitions", - ImmutableList.of(), - "comma separated hash partition ids, e.g., 0,1,3,4. " - + "If not specified, all the partitions will be built.", - Flaggable.ofJavaList(Flaggable.ofInt())); - - createFlag("numSegmentBuilderPartitions", - 100, - "Number of partitions for dividing up all segment builder work", - Flaggable.ofInt()); - - createFlag("waitBetweenSegmentsSecs", - 10, - "Time to sleep between processing segments.", - Flaggable.ofInt()); - - createFlag("waitBeforeQuitMins", - 2, - "How many minutes to sleep before quitting.", - Flaggable.ofInt()); - - createFlag("scrubGen", - "", - "Scrub gen for which segment builders should be run.", - Flaggable.ofString()); - } - - @Override - public void start() { - SegmentBuilder segmentBuilder = injector().instance(SegmentBuilder.class); - closeOnExit((Time time) -> { - segmentBuilder.doShutdown(); - return Future.Unit(); - }); - - LOG.info("Starting run()"); - segmentBuilder.run(); - LOG.info("run() complete"); - - // Now shutdown - shutdown(); - } - - protected void shutdown() { - LOG.info("Calling close() to initiate shutdown"); - close(); - } - - @Override - public Collection javaModules() { - return ImmutableList.of(new SegmentBuilderModule()); - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentBuilderCoordinator.java b/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentBuilderCoordinator.java deleted file mode 100644 index 79925ab5a..000000000 --- a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentBuilderCoordinator.java +++ /dev/null @@ -1,200 +0,0 @@ -package com.twitter.search.earlybird.archive.segmentbuilder; - -import java.io.IOException; -import java.util.Date; -import java.util.Optional; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.quantity.Amount; -import com.twitter.common.quantity.Time; -import com.twitter.common.util.Clock; -import com.twitter.search.common.database.DatabaseConfig; -import com.twitter.search.common.util.zktrylock.TryLock; -import com.twitter.search.common.util.zktrylock.ZooKeeperTryLockFactory; -import com.twitter.search.earlybird.archive.DailyStatusBatches; -import com.twitter.search.earlybird.common.config.EarlybirdProperty; -import com.twitter.search.earlybird.util.ScrubGenUtil; -import com.twitter.search.earlybird.partition.HdfsUtil; -import com.twitter.search.earlybird.partition.SegmentSyncConfig; -import com.twitter.util.Duration; - -/** - * Coordinate between segment builders for scrubbing pipeline. - * When segment builder is running, all of them will try to find a HDFS file indicating if data is - * ready. If the file does not exist, only one of them will go through the files and see if - * scrubbing pipeline has generated all data for this scrub gen. - * - * If the instance that got the lock found all data, it still exists, because otherwise we will - * have one single segmentbuilder instance trying to build all segments, which is not what we want. - * But if it exists, then the next time all segmentbuilder instances are scheduled, they will all - * find the file, and will start building segments. - */ -class SegmentBuilderCoordinator { - private static final Logger LOG = LoggerFactory.getLogger(SegmentBuilderCoordinator.class); - - private static final Amount ZK_LOCK_EXPIRATION_MIN = Amount.of(5L, Time.MINUTES); - private static final String SEGMENT_BUILDER_SYNC_NODE = "scrub_gen_data_sync"; - private static final String SEGMENT_BUILDER_SYNC_ZK_PATH = - EarlybirdProperty.ZK_APP_ROOT.get() + "/segment_builder_sync"; - private static final String DATA_FULLY_BUILT_FILE = "_data_fully_built"; - static final int FIRST_INSTANCE = 0; - - private static final long NON_FIRST_INSTANCE_SLEEP_BEFORE_RETRY_DURATION_MS = - Duration.fromHours(1).inMillis(); - - private final ZooKeeperTryLockFactory zkTryLockFactory; - private final SegmentSyncConfig syncConfig; - private final Optional scrubGenDayOpt; - private final Optional scrubGenOpt; - private final Clock clock; - - SegmentBuilderCoordinator( - ZooKeeperTryLockFactory zkTryLockFactory, SegmentSyncConfig syncConfig, Clock clock) { - this.zkTryLockFactory = zkTryLockFactory; - this.syncConfig = syncConfig; - this.scrubGenOpt = syncConfig.getScrubGen(); - this.scrubGenDayOpt = scrubGenOpt.map(ScrubGenUtil::parseScrubGenToDate); - this.clock = clock; - } - - - public boolean isScrubGenDataFullyBuilt(int instanceNumber) { - // Only segment builder that takes scrub gen should use isPartitioningOutputReady to coordinate - Preconditions.checkArgument(scrubGenDayOpt.isPresent()); - - final FileSystem hdfs; - try { - hdfs = HdfsUtil.getHdfsFileSystem(); - } catch (IOException e) { - LOG.error("Could not create HDFS file system.", e); - return false; - } - - return isScrubGenDataFullyBuilt( - instanceNumber, - scrubGenDayOpt.get(), - NON_FIRST_INSTANCE_SLEEP_BEFORE_RETRY_DURATION_MS, - hdfs - ); - } - - @VisibleForTesting - boolean isScrubGenDataFullyBuilt( - int instanceNumber, - Date scrubGenDay, - long nonFirstInstanceSleepBeforeRetryDuration, - FileSystem hdfs) { - // Check if the scrub gen has been fully built file exists. - if (checkHaveScrubGenDataFullyBuiltFileOnHdfs(hdfs)) { - return true; - } - - // If it doesn't exist, let first instance see if scrub gen has been fully built and create the - // file. - if (instanceNumber == FIRST_INSTANCE) { - // We were missing some data on HDFS for this scrub gen in previous run, - // but we might've gotten more data in the meantime, check again. - // Only allow instance 0 to do this mainly for 2 reasons: - // 1) Since instances are scheduled in batches, it's possible that a instance from latter - // batch find the fully built file in hdfs and start processing. We end up doing work with - // only partial instances. - // 2) If we sleep before we release lock, it's hard to estimate how long a instance will - // be scheduled. - // For deterministic reason, we simplify a bit and only allow instance 0 to check and write - // data is fully build file to hdfs. - try { - checkIfScrubGenDataIsFullyBuilt(hdfs, scrubGenDay); - } catch (IOException e) { - LOG.error("Failed to grab lock and check scrub gen data.", e); - } - } else { - // for all other instances, sleep for a bit to give time for first instance to check if scrub - // gen has been fully built and create the file, then check again. - try { - LOG.info( - "Sleeping for {} ms before re-checking if scrub gen has been fully built file exists", - nonFirstInstanceSleepBeforeRetryDuration); - clock.waitFor(nonFirstInstanceSleepBeforeRetryDuration); - return checkHaveScrubGenDataFullyBuiltFileOnHdfs(hdfs); - } catch (InterruptedException e) { - LOG.warn("Interrupted when sleeping before re-checking if scrub gen has been fully built " - + "file exists", e); - } - } - - // if hasSuccessFileToHdfs returns false, then should always return false in the end. - // next run will find success file for this scrub gen and move forward. - return false; - } - - private void checkIfScrubGenDataIsFullyBuilt( - FileSystem hdfs, Date scrubGenDay) throws IOException { - // Build the lock, try to acquire it, and check the data on HDFS - TryLock lock = zkTryLockFactory.createTryLock( - DatabaseConfig.getLocalHostname(), - SEGMENT_BUILDER_SYNC_ZK_PATH, - SEGMENT_BUILDER_SYNC_NODE, - ZK_LOCK_EXPIRATION_MIN); - Preconditions.checkState(scrubGenOpt.isPresent()); - String scrubGen = scrubGenOpt.get(); - - lock.tryWithLock(() -> { - LOG.info(String.format( - "Obtained ZK lock to check if data for scrub gen %s is ready.", scrubGen)); - final DailyStatusBatches directory = - new DailyStatusBatches(zkTryLockFactory, scrubGenDay); - if (directory.isScrubGenDataFullyBuilt(hdfs) - && createScrubGenDataFullyBuiltFileOnHdfs(hdfs)) { - LOG.info(String.format("All data for scrub gen %s is ready.", scrubGen)); - } else { - LOG.info(String.format("Data for scrub gen %s is not ready yet.", scrubGen)); - } - }); - } - - private boolean createScrubGenDataFullyBuiltFileOnHdfs(FileSystem fs) { - Path path = getScrubGenDataFullyBuiltFilePath(); - try { - fs.mkdirs(new Path(statusReadyHDFSPath())); - if (fs.createNewFile(path)) { - LOG.info("Successfully created file " + path + " on HDFS."); - return true; - } else { - LOG.warn("Failed to create file " + path + " on HDFS."); - } - } catch (IOException e) { - LOG.error("Failed to create file on HDFS " + path.toString(), e); - } - return false; - } - - private boolean checkHaveScrubGenDataFullyBuiltFileOnHdfs(FileSystem fs) { - Path path = getScrubGenDataFullyBuiltFilePath(); - try { - boolean ret = fs.exists(path); - LOG.info("Checking if file exists showing scrubgen is fully built."); - LOG.info("Path checked: {}, Exist check: {}", path, ret); - return ret; - } catch (IOException e) { - LOG.error("Failed to check file on HDFS " + path.toString(), e); - return false; - } - } - - @VisibleForTesting - Path getScrubGenDataFullyBuiltFilePath() { - return new Path(statusReadyHDFSPath(), DATA_FULLY_BUILT_FILE); - } - - @VisibleForTesting - String statusReadyHDFSPath() { - return syncConfig.getHdfsSegmentSyncRootDir() + "/segment_builder_sync"; - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentBuilderMain.java b/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentBuilderMain.java deleted file mode 100644 index 85db7e855..000000000 --- a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentBuilderMain.java +++ /dev/null @@ -1,10 +0,0 @@ -package com.twitter.search.earlybird.archive.segmentbuilder; - -public final class SegmentBuilderMain { - - private SegmentBuilderMain() { } - - public static void main(String[] args) { - new SegmentBuilderApp().main(args); - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentBuilderModule.java b/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentBuilderModule.java deleted file mode 100644 index ea0520a0b..000000000 --- a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentBuilderModule.java +++ /dev/null @@ -1,58 +0,0 @@ -package com.twitter.search.earlybird.archive.segmentbuilder; - -import java.io.File; - -import com.google.inject.Provides; -import com.google.inject.Singleton; - -import com.twitter.app.Flaggable; -import com.twitter.decider.Decider; -import com.twitter.inject.TwitterModule; -import com.twitter.inject.annotations.Flag; -import com.twitter.search.common.config.LoggerConfiguration; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.util.EarlybirdDecider; - -public class SegmentBuilderModule extends TwitterModule { - - private static final String CONFIG_FILE_FLAG_NAME = "config_file"; - private static final String SEGMENT_LOG_DIR_FLAG_NAME = "segment_log_dir"; - - public SegmentBuilderModule() { - createFlag(CONFIG_FILE_FLAG_NAME, - new File("earlybird-search.yml"), - "specify config file", - Flaggable.ofFile()); - - createFlag(SEGMENT_LOG_DIR_FLAG_NAME, - "", - "override log dir from config file", - Flaggable.ofString()); - } - - /** - * Initializes the Earlybird config and the log configuration, and returns an EarlybirdDecider - * object, which will be injected into the SegmentBuilder instance. - * - * @param configFile The config file to use to initialize EarlybirdConfig - * @param segmentLogDir If not empty, used to override the log directory from the config file - * @return An initialized EarlybirdDecider - */ - @Provides - @Singleton - public Decider provideDecider(@Flag(CONFIG_FILE_FLAG_NAME) File configFile, - @Flag(SEGMENT_LOG_DIR_FLAG_NAME) String segmentLogDir) { - // By default Guice will build singletons eagerly: - // https://github.com/google/guice/wiki/Scopes#eager-singletons - // So in order to ensure that the EarlybirdConfig and LoggerConfiguration initializations occur - // before the EarlybirdDecider initialization, we place them here. - EarlybirdConfig.init(configFile.getName()); - if (!segmentLogDir.isEmpty()) { - EarlybirdConfig.overrideLogDir(segmentLogDir); - } - new LoggerConfiguration(EarlybirdConfig.getLogPropertiesFile(), EarlybirdConfig.getLogDir()) - .configure(); - - return EarlybirdDecider.initialize(); - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentBuilderSegment.java b/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentBuilderSegment.java deleted file mode 100644 index 428113bf9..000000000 --- a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentBuilderSegment.java +++ /dev/null @@ -1,100 +0,0 @@ -package com.twitter.search.earlybird.archive.segmentbuilder; - -import java.io.IOException; - -import com.google.common.base.Preconditions; - -import com.twitter.common.quantity.Amount; -import com.twitter.common.quantity.Time; -import com.twitter.search.common.database.DatabaseConfig; -import com.twitter.search.common.util.zktrylock.TryLock; -import com.twitter.search.common.util.zktrylock.ZooKeeperTryLockFactory; -import com.twitter.search.earlybird.archive.ArchiveSegment; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.index.EarlybirdSegmentFactory; -import com.twitter.search.earlybird.partition.SegmentInfo; -import com.twitter.search.earlybird.partition.SegmentSyncConfig; - -public abstract class SegmentBuilderSegment { - protected final SegmentInfo segmentInfo; - protected final SegmentConfig segmentConfig; - protected final EarlybirdSegmentFactory earlybirdSegmentFactory; - protected final int alreadyRetriedCount; - protected final SegmentSyncConfig sync; - - public SegmentBuilderSegment(SegmentInfo segmentInfo, - SegmentConfig segmentConfig, - EarlybirdSegmentFactory earlybirdSegmentFactory, - int alreadyRetriedCount, - SegmentSyncConfig segmentSyncConfig) { - this.segmentConfig = segmentConfig; - this.earlybirdSegmentFactory = earlybirdSegmentFactory; - this.alreadyRetriedCount = alreadyRetriedCount; - this.sync = segmentSyncConfig; - Preconditions.checkState(segmentInfo.getSegment() instanceof ArchiveSegment); - this.segmentInfo = Preconditions.checkNotNull(segmentInfo); - } - - public SegmentInfo getSegmentInfo() { - return segmentInfo; - } - - public String getSegmentName() { - return segmentInfo.getSegmentName(); - } - - public int getAlreadyRetriedCount() { - return alreadyRetriedCount; - } - - /** - * Handle the segment, potentially transitioning to a new state. - * @return The state after handling. - */ - public abstract SegmentBuilderSegment handle() - throws SegmentInfoConstructionException, SegmentUpdaterException; - - public boolean isBuilt() { - return false; - } - - @Override - public String toString() { - return "SegmentBuilderSegment{" - + "segmentInfo=" + segmentInfo - + ", state=" + this.getClass().getSimpleName() - + ", alreadyRetriedCount=" + alreadyRetriedCount + '}'; - } - - /** - * Given a SegmentInfo, create a new one with the same time slice and partitionID but clean - * internal state. - */ - protected SegmentInfo createNewSegmentInfo(SegmentInfo oldSegmentInfo) - throws SegmentInfoConstructionException { - Preconditions.checkArgument(oldSegmentInfo.getSegment() instanceof ArchiveSegment); - ArchiveSegment archiveSegment = (ArchiveSegment) oldSegmentInfo.getSegment(); - - try { - ArchiveSegment segment = new ArchiveSegment(archiveSegment.getArchiveTimeSlice(), - archiveSegment.getHashPartitionID(), EarlybirdConfig.getMaxSegmentSize()); - - return new SegmentInfo(segment, earlybirdSegmentFactory, sync); - } catch (IOException e) { - throw new SegmentInfoConstructionException("Error creating new segments", e); - } - } - - protected TryLock getZooKeeperTryLock() { - ZooKeeperTryLockFactory tryLockFactory = segmentConfig.getTryLockFactory(); - String zkRootPath = sync.getZooKeeperSyncFullPath(); - String nodeName = segmentInfo.getZkNodeName(); - Amount expirationTime = segmentConfig.getSegmentZKLockExpirationTime(); - - return tryLockFactory.createTryLock( - DatabaseConfig.getLocalHostname(), - zkRootPath, - nodeName, - expirationTime); - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentConfig.java b/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentConfig.java deleted file mode 100644 index e53f060c4..000000000 --- a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentConfig.java +++ /dev/null @@ -1,41 +0,0 @@ -package com.twitter.search.earlybird.archive.segmentbuilder; - -import com.twitter.common.quantity.Amount; -import com.twitter.common.quantity.Time; -import com.twitter.search.common.util.zktrylock.ZooKeeperTryLockFactory; -import com.twitter.search.earlybird.archive.ArchiveOnDiskEarlybirdIndexConfig; - -public class SegmentConfig { - private final ArchiveOnDiskEarlybirdIndexConfig earlybirdIndexConfig; - private final Amount segmentZKLockExpirationTime; - private final int maxRetriesOnFailure; - private final ZooKeeperTryLockFactory tryLockFactory; - - public SegmentConfig( - ArchiveOnDiskEarlybirdIndexConfig earlybirdIndexConfig, - Amount segmentZKLockExpirationTime, - int maxRetriesOnFailure, - ZooKeeperTryLockFactory tryLockFactory) { - - this.earlybirdIndexConfig = earlybirdIndexConfig; - this.segmentZKLockExpirationTime = segmentZKLockExpirationTime; - this.maxRetriesOnFailure = maxRetriesOnFailure; - this.tryLockFactory = tryLockFactory; - } - - public ArchiveOnDiskEarlybirdIndexConfig getEarlybirdIndexConfig() { - return earlybirdIndexConfig; - } - - public Amount getSegmentZKLockExpirationTime() { - return segmentZKLockExpirationTime; - } - - public int getMaxRetriesOnFailure() { - return maxRetriesOnFailure; - } - - public ZooKeeperTryLockFactory getTryLockFactory() { - return tryLockFactory; - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentInfoConstructionException.java b/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentInfoConstructionException.java deleted file mode 100644 index d7b69b96c..000000000 --- a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentInfoConstructionException.java +++ /dev/null @@ -1,12 +0,0 @@ -package com.twitter.search.earlybird.archive.segmentbuilder; - -import java.io.IOException; - -/** - * Used if exceptions are thrown during creating new SegmentInfo during the indexing loop - */ -class SegmentInfoConstructionException extends Exception { - SegmentInfoConstructionException(String msg, IOException e) { - super(msg, e); - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentUpdaterException.java b/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentUpdaterException.java deleted file mode 100644 index 5ccbbdc25..000000000 --- a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SegmentUpdaterException.java +++ /dev/null @@ -1,13 +0,0 @@ -package com.twitter.search.earlybird.archive.segmentbuilder; - -import com.google.common.annotations.VisibleForTesting; - -/** - * Used when when SegmentUpdater fails processing segments. - */ -@VisibleForTesting -class SegmentUpdaterException extends Exception { - SegmentUpdaterException(String msg) { - super(msg); - } -} diff --git a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SomeoneElseIsBuildingSegment.java b/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SomeoneElseIsBuildingSegment.java deleted file mode 100644 index c4f30c70d..000000000 --- a/src/java/com/twitter/search/earlybird/archive/segmentbuilder/SomeoneElseIsBuildingSegment.java +++ /dev/null @@ -1,69 +0,0 @@ -package com.twitter.search.earlybird.archive.segmentbuilder; - -import java.util.concurrent.atomic.AtomicBoolean; - -import com.google.common.annotations.VisibleForTesting; - -import com.twitter.common.base.Command; -import com.twitter.search.common.util.zktrylock.TryLock; -import com.twitter.search.earlybird.archive.ArchiveHDFSUtils; -import com.twitter.search.earlybird.index.EarlybirdSegmentFactory; -import com.twitter.search.earlybird.partition.SegmentInfo; -import com.twitter.search.earlybird.partition.SegmentSyncConfig; - -public class SomeoneElseIsBuildingSegment extends SegmentBuilderSegment { - public SomeoneElseIsBuildingSegment( - SegmentInfo segmentInfo, - SegmentConfig segmentConfig, - EarlybirdSegmentFactory earlybirdSegmentFactory, - int alreadyRetriedCount, - SegmentSyncConfig sync) { - - super(segmentInfo, segmentConfig, earlybirdSegmentFactory, alreadyRetriedCount, sync); - } - - /** - * This method refreshes local state of a segment. - * 1. Try to grab the ZK lock - * 2a. if got the lock, the segment is not being built; mark segment as NOT_BUILT_YET. - * 2b. otherwise, the segment is being built; keep the SOMEONE_ELSE_IS_BUILDING state - */ - @Override - public SegmentBuilderSegment handle() - throws SegmentInfoConstructionException, SegmentUpdaterException { - - TryLock lock = getZooKeeperTryLock(); - - final AtomicBoolean alreadyBuilt = new AtomicBoolean(false); - boolean gotLock = lock.tryWithLock((Command) () -> { - // The segment might have already finished built by others - if (segmentExistsOnHdfs()) { - alreadyBuilt.set(true); - } - }); - - if (!gotLock) { - return this; - } - - if (alreadyBuilt.get()) { - return new BuiltAndFinalizedSegment( - segmentInfo, segmentConfig, earlybirdSegmentFactory, 0, sync); - } else { - // When a segment failed building, its state might not be clean. So, it is necessary to - // create a new SegmentInfo with a clean state - SegmentInfo newSegmentInfo = createNewSegmentInfo(segmentInfo); - return new NotYetBuiltSegment( - newSegmentInfo, - segmentConfig, - earlybirdSegmentFactory, - alreadyRetriedCount + 1, - sync); - } - } - - @VisibleForTesting - boolean segmentExistsOnHdfs() { - return ArchiveHDFSUtils.hasSegmentIndicesOnHDFS(sync, segmentInfo); - } -} diff --git a/src/java/com/twitter/search/earlybird/common/BUILD b/src/java/com/twitter/search/earlybird/common/BUILD deleted file mode 100644 index 797ad1f25..000000000 --- a/src/java/com/twitter/search/earlybird/common/BUILD +++ /dev/null @@ -1,37 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/twitter/elephantbird:core", - "3rdparty/jvm/commons-codec", - "3rdparty/jvm/commons-httpclient", - "3rdparty/jvm/geo/google:geoGoogle", - "3rdparty/jvm/org/apache/lucene:lucene-core", - "3rdparty/jvm/org/apache/thrift:libthrift", - "3rdparty/jvm/org/apache/zookeeper:zookeeper-client", - "decider/src/main/scala", - "finagle/finagle-core/src/main", - "finagle/finagle-thrift/src/main/java", - "finagle/finagle-thrift/src/main/scala", - "scrooge/scrooge-core/src/main/scala", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common/optional", - "src/java/com/twitter/search/common/decider", - "src/java/com/twitter/search/common/logging", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/util:finagleutil", - "src/java/com/twitter/search/common/util/earlybird", - "src/java/com/twitter/search/common/util/thrift:thrift-utils", - "src/java/com/twitter/search/queryparser/query:core-query-nodes", - "src/thrift/com/twitter/context:twitter-context-scala", - "src/thrift/com/twitter/search:earlybird-java", - "src/thrift/com/twitter/search/common:caching-java", - "src/thrift/com/twitter/search/common:constants-java", - "src/thrift/com/twitter/search/common:query-java", - "strato/src/main/scala/com/twitter/strato/opcontext", - "twitter-context/src/main/scala", - "util/util-core:scala", - ], -) diff --git a/src/java/com/twitter/search/earlybird/common/Base64RequestResponseForLogging.java b/src/java/com/twitter/search/earlybird/common/Base64RequestResponseForLogging.java deleted file mode 100644 index a2f2206ad..000000000 --- a/src/java/com/twitter/search/earlybird/common/Base64RequestResponseForLogging.java +++ /dev/null @@ -1,120 +0,0 @@ -package com.twitter.search.earlybird.common; - -import org.apache.commons.codec.binary.Base64; -import org.apache.thrift.TException; -import org.apache.thrift.TSerializer; -import org.apache.thrift.protocol.TBinaryProtocol; -import org.slf4j.Logger; - -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; - -public final class Base64RequestResponseForLogging { - private static final Logger GENERAL_LOG = org.slf4j.LoggerFactory.getLogger( - Base64RequestResponseForLogging.class); - private static final Logger FAILED_REQUEST_LOG = org.slf4j.LoggerFactory.getLogger( - Base64RequestResponseForLogging.class.getName() + ".FailedRequests"); - private static final Logger RANDOM_REQUEST_LOG = org.slf4j.LoggerFactory.getLogger( - Base64RequestResponseForLogging.class.getName() + ".RandomRequests"); - private static final Logger SLOW_REQUEST_LOG = org.slf4j.LoggerFactory.getLogger( - Base64RequestResponseForLogging.class.getName() + ".SlowRequests"); - - private enum LogType { - FAILED, - RANDOM, - SLOW, - }; - - private final LogType logtype; - private final String logLine; - private final EarlybirdRequest request; - private final EarlybirdResponse response; - private final Base64 base64 = new Base64(); - - // TSerializer is not threadsafe, so create a new one for each request - private final TSerializer serializer = new TSerializer(new TBinaryProtocol.Factory()); - - private Base64RequestResponseForLogging( - LogType logType, String logLine, EarlybirdRequest request, EarlybirdResponse response) { - this.logtype = logType; - this.logLine = logLine; - this.request = request; - this.response = response; - } - - public static Base64RequestResponseForLogging randomRequest( - String logLine, EarlybirdRequest request, EarlybirdResponse response) { - return new Base64RequestResponseForLogging(LogType.RANDOM, logLine, request, response); - } - - public static Base64RequestResponseForLogging failedRequest( - String logLine, EarlybirdRequest request, EarlybirdResponse response) { - return new Base64RequestResponseForLogging(LogType.FAILED, logLine, request, response); - } - - public static Base64RequestResponseForLogging slowRequest( - String logLine, EarlybirdRequest request, EarlybirdResponse response) { - return new Base64RequestResponseForLogging(LogType.SLOW, logLine, request, response); - } - - private String asBase64(EarlybirdRequest clearedRequest) { - try { - // The purpose of this log is to make it easy to re-issue requests in formz to reproduce - // issues. If queries are re-issued as is they will be treated as late-arriving queries and - // dropped due to the clientRequestTimeMs being set to the original query time. For ease of - // use purposes we clear clientRequestTimeMs and log it out separately for the rare case it - // is needed. - clearedRequest.unsetClientRequestTimeMs(); - return base64.encodeToString(serializer.serialize(clearedRequest)); - } catch (TException e) { - GENERAL_LOG.error("Failed to serialize request for logging.", e); - return "failed_to_serialize"; - } - } - - private String asBase64(EarlybirdResponse earlybirdResponse) { - try { - return base64.encodeToString(serializer.serialize(earlybirdResponse)); - } catch (TException e) { - GENERAL_LOG.error("Failed to serialize response for logging.", e); - return "failed_to_serialize"; - } - } - - private String getFormattedMessage() { - String base64Request = asBase64( - EarlybirdRequestUtil.copyAndClearUnnecessaryValuesForLogging(request)); - String base64Response = asBase64(response); - return logLine + ", clientRequestTimeMs: " + request.getClientRequestTimeMs() - + ", " + base64Request + ", " + base64Response; - } - - /** - * Logs the Base64-encoded request and response to the success or failure log. - */ - public void log() { - // Do the serializing/concatting this way so it happens on the background thread for - // async logging - Object logObject = new Object() { - @Override - public String toString() { - return getFormattedMessage(); - } - }; - - switch (logtype) { - case FAILED: - FAILED_REQUEST_LOG.info("{}", logObject); - break; - case RANDOM: - RANDOM_REQUEST_LOG.info("{}", logObject); - break; - case SLOW: - SLOW_REQUEST_LOG.info("{}", logObject); - break; - default: - // Not logging anything for other log types. - break; - } - } -} diff --git a/src/java/com/twitter/search/earlybird/common/CaughtUpMonitor.java b/src/java/com/twitter/search/earlybird/common/CaughtUpMonitor.java deleted file mode 100644 index cd6d49c06..000000000 --- a/src/java/com/twitter/search/earlybird/common/CaughtUpMonitor.java +++ /dev/null @@ -1,55 +0,0 @@ -package com.twitter.search.earlybird.common; - -import java.util.concurrent.atomic.AtomicBoolean; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchCustomGauge; - -/** - * A monitor which enforces the condition that a single thread's work is caught up, and allows - * other threads to wait to be notified when the work is complete. An AtomicBoolean ensures the - * current status is visible to all threads. - */ -public class CaughtUpMonitor { - private static final Logger LOG = LoggerFactory.getLogger(CaughtUpMonitor.class); - - protected final AtomicBoolean isCaughtUp = new AtomicBoolean(false); - - public CaughtUpMonitor(String statPrefix) { - SearchCustomGauge.export(statPrefix + "_is_caught_up", () -> isCaughtUp() ? 1 : 0); - } - - public boolean isCaughtUp() { - return isCaughtUp.get(); - } - - /** - * Set caught up state, and notify waiting threads if caught up. - */ - public synchronized void setAndNotify(boolean caughtUp) { - isCaughtUp.set(caughtUp); - if (caughtUp) { - // Readers are caught up, notify waiting threads - notifyAll(); - } - } - - /** - * Wait using Object.wait() until caught up or until thread is interrupted. - */ - public synchronized void resetAndWaitUntilCaughtUp() { - LOG.info("Waiting to catch up."); - // Explicitly set isCaughtUp to false before waiting - isCaughtUp.set(false); - try { - while (!isCaughtUp()) { - wait(); - } - } catch (InterruptedException e) { - LOG.error("{} was interrupted while waiting to catch up", Thread.currentThread()); - } - LOG.info("Caught up."); - } -} diff --git a/src/java/com/twitter/search/earlybird/common/ClientIdUtil.java b/src/java/com/twitter/search/earlybird/common/ClientIdUtil.java deleted file mode 100644 index 46b916adf..000000000 --- a/src/java/com/twitter/search/earlybird/common/ClientIdUtil.java +++ /dev/null @@ -1,85 +0,0 @@ -package com.twitter.search.earlybird.common; - -import java.util.Optional; - -import com.twitter.common.optional.Optionals; -import com.twitter.search.common.util.FinagleUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.strato.opcontext.Attribution; -import com.twitter.strato.opcontext.HttpEndpoint; - -public final class ClientIdUtil { - // Blenders should always set the EarlybirdRequest.clientId field. It should be set to the Finagle - // client ID of the client that caused the blender to send this request to the roots. If the - // Finagle ID of the blender's client cannot be determined, it will be set to "unknown" (see - // com.twitter.search.common.util.FinagleUtil.UNKNOWN_CLIENT_NAME). However, other services that - // send requests to roots might not set EarlybirdRequest.clientId. - // - // So an "unset" clientId means: EarlybirdRequest.clientId was null. - // An "unknown" clientId means: the client that sent us the request - // tried setting EarlybirdRequest.clientId, but couldn't figure out a good value for it. - public static final String UNSET_CLIENT_ID = "unset"; - - private static final String CLIENT_ID_FOR_UNKNOWN_CLIENTS = "unknown_client_id"; - - private static final String CLIENT_ID_PREFIX = "client_id_"; - - private static final String FINAGLE_CLIENT_ID_AND_CLIENT_ID_PATTERN = - "finagle_id_%s_and_client_id_%s"; - - private static final String CLIENT_ID_AND_REQUEST_TYPE = "client_id_%s_and_type_%s"; - - private ClientIdUtil() { - } - - /** Returns the ID of the client that initiated this request or UNSET_CLIENT_ID if not set. */ - public static String getClientIdFromRequest(EarlybirdRequest request) { - return Optional - .ofNullable(request.getClientId()) - .map(String::toLowerCase) - .orElse(UNSET_CLIENT_ID); - } - - /** - * Returns the Strato http endpoint attribution as an Optional. - */ - public static Optional getClientIdFromHttpEndpointAttribution() { - return Optionals - .optional(Attribution.httpEndpoint()) - .map(HttpEndpoint::name) - .map(String::toLowerCase); - } - - /** Formats the given clientId into a string that can be used for stats. */ - public static String formatClientId(String clientId) { - return CLIENT_ID_PREFIX + clientId; - } - - /** - * Formats the given Finagle clientId and the given clientId into a single string that can be used - * for stats, or other purposes where the two IDs need to be combined. - */ - public static String formatFinagleClientIdAndClientId(String finagleClientId, String clientId) { - return String.format(FINAGLE_CLIENT_ID_AND_CLIENT_ID_PATTERN, finagleClientId, clientId); - } - - /** - * Formats the given clientId and requestType into a single string that can be used - * for stats or other purposes. - */ - public static String formatClientIdAndRequestType( - String clientId, String requestType) { - return String.format(CLIENT_ID_AND_REQUEST_TYPE, clientId, requestType); - } - - /** - * Format the quota client id - */ - public static String getQuotaClientId(String clientId) { - if (FinagleUtil.UNKNOWN_CLIENT_NAME.equals(clientId) || UNSET_CLIENT_ID.equals(clientId)) { - return CLIENT_ID_FOR_UNKNOWN_CLIENTS; - } - - return clientId; - } -} diff --git a/src/java/com/twitter/search/earlybird/common/EarlybirdRequestLogger.java b/src/java/com/twitter/search/earlybird/common/EarlybirdRequestLogger.java deleted file mode 100644 index 303507b2b..000000000 --- a/src/java/com/twitter/search/earlybird/common/EarlybirdRequestLogger.java +++ /dev/null @@ -1,365 +0,0 @@ -package com.twitter.search.earlybird.common; - -import java.util.EnumMap; -import java.util.Map; - -import scala.Option; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.collect.Maps; - -import com.twitter.context.TwitterContext; -import com.twitter.context.thriftscala.Viewer; -import com.twitter.decider.Decider; -import com.twitter.finagle.thrift.ClientId; -import com.twitter.finagle.thrift.ClientId$; -import com.twitter.search.TwitterContextPermit; -import com.twitter.search.common.constants.thriftjava.ThriftQuerySource; -import com.twitter.search.common.decider.DeciderUtil; -import com.twitter.search.common.logging.RPCLogger; -import com.twitter.search.common.metrics.FailureRatioCounter; -import com.twitter.search.common.metrics.Timer; -import com.twitter.search.common.util.earlybird.TermStatisticsUtil; -import com.twitter.search.common.util.earlybird.ThriftSearchResultUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.ThriftFacetFieldRequest; -import com.twitter.search.earlybird.thrift.ThriftHistogramSettings; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.earlybird.thrift.ThriftTermStatisticsRequest; - -import static com.twitter.search.common.util.earlybird.EarlybirdResponseUtil - .responseConsideredFailed; - - -public class EarlybirdRequestLogger extends RPCLogger { - protected enum ExtraFields { - QUERY_MAX_HITS_TO_PROCESS, - COLLECTOR_PARAMS_MAX_HITS_TO_PROCESS, - RELEVANCE_OPTIONS_MAX_HITS_TO_PROCESS, - NUM_HITS_PROCESSED, - QUERY_COST, - CPU_TOTAL, - QUERY_SOURCE, - CLIENT_ID, - FINAGLE_CLIENT_ID - } - - protected enum ShardOnlyExtraFields { - NUM_SEARCHED_SEGMENTS, - SCORING_TIME_NANOS - } - - protected enum RootOnlyExtraFields { - CACHING_ALLOWED, - DEBUG_MODE, - CACHE_HIT, - USER_AGENT, - // See JIRA APPSEC-2303 for IP addresses logging - } - - private static final String LOG_FULL_REQUEST_DETAILS_ON_ERROR_DECIDER_KEY = - "log_full_request_details_on_error"; - private static final String LOG_FULL_REQUEST_DETAILS_RANDOM_FRACTION_DECIDER_KEY = - "log_full_request_details_random_fraction"; - private static final String LOG_FULL_SLOW_REQUEST_DETAILS_RANDOM_FRACTION_DECIDER_KEY = - "log_full_slow_request_details_random_fraction"; - private static final String SLOW_REQUEST_LATENCY_THRESHOLD_MS_DECIDER_KEY = - "slow_request_latency_threshold_ms"; - - private final Decider decider; - private final boolean enableLogUnknownClientRequests; - - private static final Map - FAILURE_RATIO_COUNTER_BY_QUERY_SOURCE = preBuildFailureRatioCounters(); - private static final FailureRatioCounter NO_QUERY_SOURCE_FAILURE_RATIO_COUNTER = - new FailureRatioCounter("earlybird_logger", "query_source", "not_set"); - - static EarlybirdRequestLogger buildForRoot( - String loggerName, int latencyWarnThreshold, Decider decider) { - - return new EarlybirdRequestLogger(loggerName, latencyWarnThreshold, - decider, true, RPCLogger.Fields.values(), ExtraFields.values(), - RootOnlyExtraFields.values()); - } - - static EarlybirdRequestLogger buildForShard( - String loggerName, int latencyWarnThreshold, Decider decider) { - - return new EarlybirdRequestLogger(loggerName, latencyWarnThreshold, - decider, false, RPCLogger.Fields.values(), ExtraFields.values(), - ShardOnlyExtraFields.values()); - } - - @VisibleForTesting - EarlybirdRequestLogger(String loggerName, int latencyWarnThreshold, Decider decider) { - this(loggerName, latencyWarnThreshold, decider, false, RPCLogger.Fields.values(), - ExtraFields.values(), RootOnlyExtraFields.values(), ShardOnlyExtraFields.values()); - } - - private EarlybirdRequestLogger(String loggerName, int latencyWarnThreshold, Decider decider, - boolean enableLogUnknownClientRequests, Enum[]... fieldEnums) { - super(loggerName, fieldEnums); - this.decider = decider; - this.enableLogUnknownClientRequests = enableLogUnknownClientRequests; - setLatencyWarnThreshold(latencyWarnThreshold); - } - - /** - * Logs the given earlybird request and response. - * - * @param request The earlybird request. - * @param response The earlybird response. - * @param timer The time it took to process this request. - */ - public void logRequest(EarlybirdRequest request, EarlybirdResponse response, Timer timer) { - try { - LogEntry entry = newLogEntry(); - - setRequestLogEntries(entry, request); - setResponseLogEntries(entry, response); - if (timer != null) { - entry.setField(ExtraFields.CPU_TOTAL, Long.toString(timer.getElapsedCpuTotal())); - } - - boolean wasError = response != null && responseConsideredFailed(response.getResponseCode()); - - long responseTime = response != null ? response.getResponseTime() : 0L; - - String logLine = writeLogLine(entry, responseTime, wasError); - - // This code path is called for pre/post logging - // Prevent same request showing up twice by only logging on post logging - if (response != null && DeciderUtil.isAvailableForRandomRecipient( - decider, LOG_FULL_REQUEST_DETAILS_RANDOM_FRACTION_DECIDER_KEY)) { - Base64RequestResponseForLogging.randomRequest(logLine, request, response).log(); - } - - // Unknown client request logging only applies to pre-logging. - if (enableLogUnknownClientRequests && response == null) { - UnknownClientRequestForLogging unknownClientRequestLogger = - UnknownClientRequestForLogging.unknownClientRequest(logLine, request); - if (unknownClientRequestLogger != null) { - unknownClientRequestLogger.log(); - } - } - - if (wasError - && DeciderUtil.isAvailableForRandomRecipient( - decider, LOG_FULL_REQUEST_DETAILS_ON_ERROR_DECIDER_KEY)) { - new RequestResponseForLogging(request, response).logFailedRequest(); - Base64RequestResponseForLogging.failedRequest(logLine, request, response).log(); - } - - boolean wasSlow = response != null - && responseTime >= DeciderUtil.getAvailability( - decider, SLOW_REQUEST_LATENCY_THRESHOLD_MS_DECIDER_KEY); - if (wasSlow - && DeciderUtil.isAvailableForRandomRecipient( - decider, LOG_FULL_SLOW_REQUEST_DETAILS_RANDOM_FRACTION_DECIDER_KEY)) { - Base64RequestResponseForLogging.slowRequest(logLine, request, response).log(); - } - - FailureRatioCounter failureRatioCounter = - FAILURE_RATIO_COUNTER_BY_QUERY_SOURCE.get(request.getQuerySource()); - if (failureRatioCounter != null) { - failureRatioCounter.requestFinished(!wasError); - } else { - NO_QUERY_SOURCE_FAILURE_RATIO_COUNTER.requestFinished(!wasError); - } - - } catch (Exception e) { - LOG.error("Exception building log entry ", e); - } - } - - private void setRequestLogEntries(LogEntry entry, EarlybirdRequest request) { - entry.setField(Fields.CLIENT_HOST, request.getClientHost()); - entry.setField(Fields.CLIENT_REQUEST_ID, request.getClientRequestID()); - entry.setField(Fields.REQUEST_TYPE, requestTypeForLog(request)); - - if (request.isSetSearchQuery()) { - ThriftSearchQuery searchQuery = request.getSearchQuery(); - entry.setField(Fields.QUERY, searchQuery.getSerializedQuery()); - - if (searchQuery.isSetMaxHitsToProcess()) { - entry.setField(ExtraFields.QUERY_MAX_HITS_TO_PROCESS, - Integer.toString(searchQuery.getMaxHitsToProcess())); - } - - if (searchQuery.isSetCollectorParams() - && searchQuery.getCollectorParams().isSetTerminationParams() - && searchQuery.getCollectorParams().getTerminationParams().isSetMaxHitsToProcess()) { - entry.setField(ExtraFields.COLLECTOR_PARAMS_MAX_HITS_TO_PROCESS, - Integer.toString(searchQuery.getCollectorParams().getTerminationParams() - .getMaxHitsToProcess())); - } - - if (searchQuery.isSetRelevanceOptions() - && searchQuery.getRelevanceOptions().isSetMaxHitsToProcess()) { - entry.setField(ExtraFields.RELEVANCE_OPTIONS_MAX_HITS_TO_PROCESS, - Integer.toString(searchQuery.getRelevanceOptions().getMaxHitsToProcess())); - } - } - - entry.setField(Fields.NUM_REQUESTED, Integer.toString(numRequestedForLog(request))); - - if (request.isSetQuerySource()) { - entry.setField(ExtraFields.QUERY_SOURCE, request.getQuerySource().name()); - } - - if (request.isSetClientId()) { - entry.setField(ExtraFields.CLIENT_ID, request.getClientId()); - } - - entry.setField(RootOnlyExtraFields.CACHING_ALLOWED, - Boolean.toString(EarlybirdRequestUtil.isCachingAllowed(request))); - - entry.setField(RootOnlyExtraFields.DEBUG_MODE, Byte.toString(request.getDebugMode())); - - Option clientIdOption = ClientId$.MODULE$.current(); - if (clientIdOption.isDefined()) { - entry.setField(ExtraFields.FINAGLE_CLIENT_ID, clientIdOption.get().name()); - } - - setLogEntriesFromTwitterContext(entry); - } - - @VisibleForTesting - Option getTwitterContext() { - return TwitterContext.acquire(TwitterContextPermit.get()).apply(); - } - - private void setLogEntriesFromTwitterContext(LogEntry entry) { - Option viewerOption = getTwitterContext(); - if (viewerOption.nonEmpty()) { - Viewer viewer = viewerOption.get(); - - if (viewer.userAgent().nonEmpty()) { - String userAgent = viewer.userAgent().get(); - - // we only replace the comma in the user-agent with %2C to make it easily parseable, - // specially with command line tools like cut/sed/awk - userAgent = userAgent.replace(",", "%2C"); - - entry.setField(RootOnlyExtraFields.USER_AGENT, userAgent); - } - } - } - - private void setResponseLogEntries(LogEntry entry, EarlybirdResponse response) { - if (response != null) { - entry.setField(Fields.NUM_RETURNED, Integer.toString(numResultsForLog(response))); - entry.setField(Fields.RESPONSE_CODE, String.valueOf(response.getResponseCode())); - entry.setField(Fields.RESPONSE_TIME_MICROS, Long.toString(response.getResponseTimeMicros())); - if (response.isSetSearchResults()) { - entry.setField(ExtraFields.NUM_HITS_PROCESSED, - Integer.toString(response.getSearchResults().getNumHitsProcessed())); - entry.setField(ExtraFields.QUERY_COST, - Double.toString(response.getSearchResults().getQueryCost())); - if (response.getSearchResults().isSetScoringTimeNanos()) { - entry.setField(ShardOnlyExtraFields.SCORING_TIME_NANOS, - Long.toString(response.getSearchResults().getScoringTimeNanos())); - } - } - if (response.isSetCacheHit()) { - entry.setField(RootOnlyExtraFields.CACHE_HIT, String.valueOf(response.isCacheHit())); - } - if (response.isSetNumSearchedSegments()) { - entry.setField(ShardOnlyExtraFields.NUM_SEARCHED_SEGMENTS, - Integer.toString(response.getNumSearchedSegments())); - } - } - } - - private static int numRequestedForLog(EarlybirdRequest request) { - int num = 0; - if (request.isSetFacetRequest() && request.getFacetRequest().isSetFacetFields()) { - for (ThriftFacetFieldRequest field : request.getFacetRequest().getFacetFields()) { - num += field.getNumResults(); - } - } else if (request.isSetTermStatisticsRequest()) { - num = request.getTermStatisticsRequest().getTermRequestsSize(); - } else if (request.isSetSearchQuery()) { - num = request.getSearchQuery().isSetCollectorParams() - ? request.getSearchQuery().getCollectorParams().getNumResultsToReturn() : 0; - if (request.getSearchQuery().getSearchStatusIdsSize() > 0) { - num = Math.max(num, request.getSearchQuery().getSearchStatusIdsSize()); - } - } - return num; - } - - /** - * Returns the number of results in the given response. If the response is a term stats response, - * then the returned value will be the number of term results. If the response is a facet - * response, then the returned value will be the number of facet results. Otherwise, the returned - * value will be the number of search results. - */ - public static int numResultsForLog(EarlybirdResponse response) { - if (response == null) { - return 0; - } else if (response.isSetFacetResults()) { - return ThriftSearchResultUtil.numFacetResults(response.getFacetResults()); - } else if (response.isSetTermStatisticsResults()) { - return response.getTermStatisticsResults().getTermResultsSize(); - } else { - return ThriftSearchResultUtil.numResults(response.getSearchResults()); - } - } - - private static String requestTypeForLog(EarlybirdRequest request) { - StringBuilder requestType = new StringBuilder(64); - if (request.isSetFacetRequest()) { - requestType.append("FACETS"); - int numFields = request.getFacetRequest().getFacetFieldsSize(); - if (numFields > 0) { - // For 1 or 2 fields, just put them in the request type. For more, just log the number. - if (numFields <= 2) { - for (ThriftFacetFieldRequest field : request.getFacetRequest().getFacetFields()) { - requestType.append(":").append(field.getFieldName().toUpperCase()); - } - } else { - requestType.append(":MULTI-").append(numFields); - } - } - } else if (request.isSetTermStatisticsRequest()) { - ThriftTermStatisticsRequest termStatsRequest = request.getTermStatisticsRequest(); - requestType.append("TERMSTATS-") - .append(termStatsRequest.getTermRequestsSize()); - - ThriftHistogramSettings histoSettings = termStatsRequest.getHistogramSettings(); - if (histoSettings != null) { - String binSizeVal = String.valueOf(TermStatisticsUtil.determineBinSize(histoSettings)); - String numBinsVal = String.valueOf(histoSettings.getNumBins()); - requestType.append(":NUMBINS-").append(numBinsVal).append(":BINSIZE-").append(binSizeVal); - } - } else if (request.isSetSearchQuery()) { - requestType.append("SEARCH:"); - requestType.append(request.getSearchQuery().getRankingMode().name()); - // Denote when a from user id is present. - if (request.getSearchQuery().isSetFromUserIDFilter64()) { - requestType.append(":NETWORK-") - .append(request.getSearchQuery().getFromUserIDFilter64Size()); - } - // Denote when required status ids are present. - if (request.getSearchQuery().getSearchStatusIdsSize() > 0) { - requestType.append(":IDS-").append(request.getSearchQuery().getSearchStatusIdsSize()); - } - } - return requestType.toString(); - } - - private static Map preBuildFailureRatioCounters() { - Map counterByQuerySource = - new EnumMap<>(ThriftQuerySource.class); - - for (ThriftQuerySource thriftQuerySource : ThriftQuerySource.values()) { - FailureRatioCounter counter = new FailureRatioCounter("earlybird_logger", "query_source", - thriftQuerySource.toString()); - counterByQuerySource.put(thriftQuerySource, counter); - } - - return Maps.immutableEnumMap(counterByQuerySource); - } -} diff --git a/src/java/com/twitter/search/earlybird/common/EarlybirdRequestPostLogger.java b/src/java/com/twitter/search/earlybird/common/EarlybirdRequestPostLogger.java deleted file mode 100644 index ab0d709f4..000000000 --- a/src/java/com/twitter/search/earlybird/common/EarlybirdRequestPostLogger.java +++ /dev/null @@ -1,37 +0,0 @@ -package com.twitter.search.earlybird.common; - -import com.twitter.decider.Decider; -import com.twitter.search.common.metrics.Timer; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; - -public final class EarlybirdRequestPostLogger { - private final EarlybirdRequestLogger logger; - - public static EarlybirdRequestPostLogger buildForRoot( - int latencyWarnThreshold, Decider decider) { - - EarlybirdRequestLogger requestLogger = EarlybirdRequestLogger.buildForRoot( - EarlybirdRequestPostLogger.class.getName(), latencyWarnThreshold, decider); - - return new EarlybirdRequestPostLogger(requestLogger); - } - - public static EarlybirdRequestPostLogger buildForShard( - int latencyWarnThreshold, Decider decider) { - - EarlybirdRequestLogger requestLogger = EarlybirdRequestLogger.buildForShard( - EarlybirdRequestPostLogger.class.getName(), latencyWarnThreshold, decider); - - return new EarlybirdRequestPostLogger(requestLogger); - } - - private EarlybirdRequestPostLogger(EarlybirdRequestLogger logger) { - this.logger = logger; - } - - public void logRequest(EarlybirdRequest request, EarlybirdResponse response, Timer timer) { - EarlybirdRequestUtil.updateHitsCounters(request); - logger.logRequest(request, response, timer); - } -} diff --git a/src/java/com/twitter/search/earlybird/common/EarlybirdRequestPreLogger.java b/src/java/com/twitter/search/earlybird/common/EarlybirdRequestPreLogger.java deleted file mode 100644 index 66d1d8b29..000000000 --- a/src/java/com/twitter/search/earlybird/common/EarlybirdRequestPreLogger.java +++ /dev/null @@ -1,32 +0,0 @@ -package com.twitter.search.earlybird.common; - -import com.twitter.decider.Decider; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; - -public final class EarlybirdRequestPreLogger { - private final EarlybirdRequestLogger logger; - - public static EarlybirdRequestPreLogger buildForRoot(Decider decider) { - EarlybirdRequestLogger requestLogger = EarlybirdRequestLogger.buildForRoot( - EarlybirdRequestPreLogger.class.getName(), Integer.MAX_VALUE, decider); - - return new EarlybirdRequestPreLogger(requestLogger); - } - - public static EarlybirdRequestPreLogger buildForShard( - int latencyWarnThreshold, Decider decider) { - - EarlybirdRequestLogger requestLogger = EarlybirdRequestLogger.buildForShard( - EarlybirdRequestPreLogger.class.getName(), latencyWarnThreshold, decider); - - return new EarlybirdRequestPreLogger(requestLogger); - } - - private EarlybirdRequestPreLogger(EarlybirdRequestLogger logger) { - this.logger = logger; - } - - public void logRequest(EarlybirdRequest request) { - logger.logRequest(request, null, null); - } -} diff --git a/src/java/com/twitter/search/earlybird/common/EarlybirdRequestUtil.java b/src/java/com/twitter/search/earlybird/common/EarlybirdRequestUtil.java deleted file mode 100644 index 6cdd322c5..000000000 --- a/src/java/com/twitter/search/earlybird/common/EarlybirdRequestUtil.java +++ /dev/null @@ -1,244 +0,0 @@ -package com.twitter.search.earlybird.common; - -import java.util.concurrent.TimeUnit; - -import com.google.common.annotations.VisibleForTesting; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchMovingAverage; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.common.query.thriftjava.CollectorParams; -import com.twitter.search.common.query.thriftjava.CollectorTerminationParams; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.earlybird.thrift.ThriftSearchRelevanceOptions; - -public final class EarlybirdRequestUtil { - // This logger is setup to log to a separate set of log files (request_info) and use an - // async logger so as to not block the searcher thread. See search/earlybird/config/log4j.xml - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdRequestUtil.class); - - @VisibleForTesting - static final SearchMovingAverage REQUESTED_NUM_RESULTS_STAT = - SearchMovingAverage.export("requested_num_results"); - - @VisibleForTesting - static final SearchMovingAverage REQUESTED_MAX_HITS_TO_PROCESS_STAT = - SearchMovingAverage.export("requested_max_hits_to_process"); - - @VisibleForTesting - static final SearchMovingAverage REQUESTED_COLLECTOR_PARAMS_MAX_HITS_TO_PROCESS_STAT = - SearchMovingAverage.export("requested_collector_params_max_hits_to_process"); - - @VisibleForTesting - static final SearchMovingAverage REQUESTED_RELEVANCE_OPTIONS_MAX_HITS_TO_PROCESS_STAT = - SearchMovingAverage.export("requested_relevance_options_max_hits_to_process"); - - @VisibleForTesting - static final SearchCounter REQUESTED_MAX_HITS_TO_PROCESS_ARE_DIFFERENT_STAT = - SearchCounter.export("requested_max_hits_to_process_are_different"); - - private static final SearchRateCounter REQUEST_WITH_MORE_THAN_2K_NUM_RESULTS_STAT = - SearchRateCounter.export("request_with_more_than_2k_num_result"); - private static final SearchRateCounter REQUEST_WITH_MORE_THAN_4K_NUM_RESULTS_STAT = - SearchRateCounter.export("request_with_more_than_4k_num_result"); - - // Stats for tracking clock skew between earlybird and the client-specified request timestamp. - @VisibleForTesting - public static final SearchTimerStats CLIENT_CLOCK_DIFF_ABS = - SearchTimerStats.export("client_clock_diff_abs", TimeUnit.MILLISECONDS, false, true); - @VisibleForTesting - public static final SearchTimerStats CLIENT_CLOCK_DIFF_POS = - SearchTimerStats.export("client_clock_diff_pos", TimeUnit.MILLISECONDS, false, true); - @VisibleForTesting - public static final SearchTimerStats CLIENT_CLOCK_DIFF_NEG = - SearchTimerStats.export("client_clock_diff_neg", TimeUnit.MILLISECONDS, false, true); - @VisibleForTesting - public static final SearchRateCounter CLIENT_CLOCK_DIFF_MISSING = - SearchRateCounter.export("client_clock_diff_missing"); - - private static final int MAX_NUM_RESULTS = 4000; - private static final int OLD_MAX_NUM_RESULTS = 2000; - - private EarlybirdRequestUtil() { - } - - /** - * Logs and fixes some potentially excessive values in the given request. - */ - public static void logAndFixExcessiveValues(EarlybirdRequest request) { - ThriftSearchQuery searchQuery = request.getSearchQuery(); - if (searchQuery != null) { - int maxHitsToProcess = 0; - int numResultsToReturn = 0; - - if (searchQuery.isSetCollectorParams()) { - numResultsToReturn = searchQuery.getCollectorParams().getNumResultsToReturn(); - - if (searchQuery.getCollectorParams().isSetTerminationParams()) { - maxHitsToProcess = - searchQuery.getCollectorParams().getTerminationParams().getMaxHitsToProcess(); - } - } - - if (maxHitsToProcess > 50000) { - LOG.warn("Excessive max hits in " + request.toString()); - } - - // We used to limit number of results to 2000. These two counters help us track if we receive - // too many requests with large number of results set. - String warningMessageTemplate = "Exceed %d num result in %s"; - if (numResultsToReturn > MAX_NUM_RESULTS) { - LOG.warn(String.format(warningMessageTemplate, MAX_NUM_RESULTS, request.toString())); - REQUEST_WITH_MORE_THAN_4K_NUM_RESULTS_STAT.increment(); - searchQuery.getCollectorParams().setNumResultsToReturn(MAX_NUM_RESULTS); - } else if (numResultsToReturn > OLD_MAX_NUM_RESULTS) { - LOG.warn(String.format(warningMessageTemplate, OLD_MAX_NUM_RESULTS, request.toString())); - REQUEST_WITH_MORE_THAN_2K_NUM_RESULTS_STAT.increment(); - } - - ThriftSearchRelevanceOptions options = searchQuery.getRelevanceOptions(); - if (options != null) { - if (options.getMaxHitsToProcess() > 50000) { - LOG.warn("Excessive max hits in " + request.toString()); - } - } - } - } - - /** - * Sets {@code request.searchQuery.collectorParams} if they are not already set. - */ - public static void checkAndSetCollectorParams(EarlybirdRequest request) { - ThriftSearchQuery searchQuery = request.getSearchQuery(); - if (searchQuery == null) { - return; - } - - if (!searchQuery.isSetCollectorParams()) { - searchQuery.setCollectorParams(new CollectorParams()); - } - if (!searchQuery.getCollectorParams().isSetNumResultsToReturn()) { - searchQuery.getCollectorParams().setNumResultsToReturn(searchQuery.getNumResults()); - } - if (!searchQuery.getCollectorParams().isSetTerminationParams()) { - CollectorTerminationParams terminationParams = new CollectorTerminationParams(); - if (request.isSetTimeoutMs()) { - terminationParams.setTimeoutMs(request.getTimeoutMs()); - } - if (request.isSetMaxQueryCost()) { - terminationParams.setMaxQueryCost(request.getMaxQueryCost()); - } - searchQuery.getCollectorParams().setTerminationParams(terminationParams); - } - setMaxHitsToProcess(searchQuery); - } - - // Early birds will only look for maxHitsToProcess in CollectorParameters.TerminationParameters. - // Priority to set CollectorParameters.TerminationParameters.maxHitsToProcess is - // 1 Collector parameters - // 2 RelevanceParameters - // 3 ThrfitQuery.maxHitsToProcess - private static void setMaxHitsToProcess(ThriftSearchQuery thriftSearchQuery) { - CollectorTerminationParams terminationParams = thriftSearchQuery - .getCollectorParams().getTerminationParams(); - if (!terminationParams.isSetMaxHitsToProcess()) { - if (thriftSearchQuery.isSetRelevanceOptions() - && thriftSearchQuery.getRelevanceOptions().isSetMaxHitsToProcess()) { - terminationParams.setMaxHitsToProcess( - thriftSearchQuery.getRelevanceOptions().getMaxHitsToProcess()); - } else { - terminationParams.setMaxHitsToProcess(thriftSearchQuery.getMaxHitsToProcess()); - } - } - } - - /** - * Creates a copy of the given request and unsets the binary fields to make the logged line for - * this request look nicer. - */ - public static EarlybirdRequest copyAndClearUnnecessaryValuesForLogging(EarlybirdRequest request) { - EarlybirdRequest copiedRequest = request.deepCopy(); - - if (copiedRequest.isSetSearchQuery()) { - // These fields are very large and the binary data doesn't play well with formz - copiedRequest.getSearchQuery().unsetTrustedFilter(); - copiedRequest.getSearchQuery().unsetDirectFollowFilter(); - } - - return copiedRequest; - } - - /** - * Updates some hit-related stats based on the parameters in the given request. - */ - public static void updateHitsCounters(EarlybirdRequest request) { - if ((request == null) || !request.isSetSearchQuery()) { - return; - } - - ThriftSearchQuery searchQuery = request.getSearchQuery(); - - if (searchQuery.isSetNumResults()) { - REQUESTED_NUM_RESULTS_STAT.addSample(searchQuery.getNumResults()); - } - - if (searchQuery.isSetMaxHitsToProcess()) { - REQUESTED_MAX_HITS_TO_PROCESS_STAT.addSample(searchQuery.getMaxHitsToProcess()); - } - - Integer collectorParamsMaxHitsToProcess = null; - if (searchQuery.isSetCollectorParams() - && searchQuery.getCollectorParams().isSetTerminationParams() - && searchQuery.getCollectorParams().getTerminationParams().isSetMaxHitsToProcess()) { - collectorParamsMaxHitsToProcess = - searchQuery.getCollectorParams().getTerminationParams().getMaxHitsToProcess(); - REQUESTED_COLLECTOR_PARAMS_MAX_HITS_TO_PROCESS_STAT - .addSample(collectorParamsMaxHitsToProcess); - } - - Integer relevanceOptionsMaxHitsToProcess = null; - if (searchQuery.isSetRelevanceOptions() - && searchQuery.getRelevanceOptions().isSetMaxHitsToProcess()) { - relevanceOptionsMaxHitsToProcess = searchQuery.getRelevanceOptions().getMaxHitsToProcess(); - REQUESTED_RELEVANCE_OPTIONS_MAX_HITS_TO_PROCESS_STAT - .addSample(relevanceOptionsMaxHitsToProcess); - } - - if ((collectorParamsMaxHitsToProcess != null) - && (relevanceOptionsMaxHitsToProcess != null) - && (collectorParamsMaxHitsToProcess != relevanceOptionsMaxHitsToProcess)) { - REQUESTED_MAX_HITS_TO_PROCESS_ARE_DIFFERENT_STAT.increment(); - } - } - - public static boolean isCachingAllowed(EarlybirdRequest request) { - return !request.isSetCachingParams() || request.getCachingParams().isCache(); - } - - /** - * Track the clock difference between this server and its client's specified request time. - * When there is no clock drift between machines, this will record the inflight time between this - * server and the client. - * - * @param request the incoming earlybird request. - */ - public static void recordClientClockDiff(EarlybirdRequest request) { - if (request.isSetClientRequestTimeMs()) { - final long timeDiff = System.currentTimeMillis() - request.getClientRequestTimeMs(); - final long timeDiffAbs = Math.abs(timeDiff); - if (timeDiff >= 0) { - CLIENT_CLOCK_DIFF_POS.timerIncrement(timeDiffAbs); - } else { - CLIENT_CLOCK_DIFF_NEG.timerIncrement(timeDiffAbs); - } - CLIENT_CLOCK_DIFF_ABS.timerIncrement(timeDiffAbs); - } else { - CLIENT_CLOCK_DIFF_MISSING.increment(); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/common/EarlybirdThriftBackend.java b/src/java/com/twitter/search/earlybird/common/EarlybirdThriftBackend.java deleted file mode 100644 index 52a7fa898..000000000 --- a/src/java/com/twitter/search/earlybird/common/EarlybirdThriftBackend.java +++ /dev/null @@ -1,28 +0,0 @@ -package com.twitter.search.earlybird.common; - -import javax.inject.Inject; -import javax.inject.Singleton; - -import org.apache.thrift.protocol.TProtocolFactory; - -import com.twitter.finagle.Service; -import com.twitter.search.common.util.thrift.ThriftToBytesFilter; -import com.twitter.search.earlybird.thrift.EarlybirdService; - -@Singleton -public class EarlybirdThriftBackend extends EarlybirdService.ServiceToClient { - - /** - * Wrapping the bytes svc back to a EarlybirdService.ServiceToClient, which - * is a EarlybirdService.ServiceIface again. - */ - @Inject - public EarlybirdThriftBackend( - ThriftToBytesFilter thriftToBytesFilter, - Service byteService, - TProtocolFactory protocolFactory) { - - super(thriftToBytesFilter.andThen(byteService), protocolFactory); - } - -} diff --git a/src/java/com/twitter/search/earlybird/common/NonPagingAssert.java b/src/java/com/twitter/search/earlybird/common/NonPagingAssert.java deleted file mode 100644 index 837adbb0a..000000000 --- a/src/java/com/twitter/search/earlybird/common/NonPagingAssert.java +++ /dev/null @@ -1,34 +0,0 @@ -package com.twitter.search.earlybird.common; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchRateCounter; - -/** - * When incremented, a non-paging alert will be triggered. Use this to assert for bad conditions - * that should generally never happen. - */ -public class NonPagingAssert { - private static final Logger LOG = LoggerFactory.getLogger(NonPagingAssert.class); - - private static final String ASSERT_STAT_PREFIX = "non_paging_assert_"; - - private final String name; - private final SearchRateCounter assertCounter; - - public NonPagingAssert(String name) { - this.name = name; - this.assertCounter = SearchRateCounter.export(ASSERT_STAT_PREFIX + name); - } - - public void assertFailed() { - LOG.error("NonPagingAssert failed: {}", name); - assertCounter.increment(); - } - - public static void assertFailed(String name) { - NonPagingAssert nonPagingAssert = new NonPagingAssert(name); - nonPagingAssert.assertFailed(); - } -} diff --git a/src/java/com/twitter/search/earlybird/common/RequestResponseForLogging.java b/src/java/com/twitter/search/earlybird/common/RequestResponseForLogging.java deleted file mode 100644 index 695ce4503..000000000 --- a/src/java/com/twitter/search/earlybird/common/RequestResponseForLogging.java +++ /dev/null @@ -1,55 +0,0 @@ -package com.twitter.search.earlybird.common; - - -import org.apache.thrift.TException; -import org.apache.thrift.TSerializer; -import org.apache.thrift.protocol.TSimpleJSONProtocol; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; - -public class RequestResponseForLogging { - private static final Logger LOG = LoggerFactory.getLogger( - RequestResponseForLogging.class); - - private static final Logger FAILED_REQUEST_LOG = LoggerFactory.getLogger( - RequestResponseForLogging.class.getName() + ".FailedRequests"); - - private final EarlybirdRequest request; - private final EarlybirdResponse response; - - public RequestResponseForLogging(EarlybirdRequest request, - EarlybirdResponse response) { - this.request = request; - this.response = response; - } - - private String serialize(EarlybirdRequest clearedRequest, EarlybirdResponse theResponse) { - TSerializer serializer = new TSerializer(new TSimpleJSONProtocol.Factory()); - try { - String requestJson = serializer.toString(clearedRequest); - String responseJson = serializer.toString(theResponse); - return "{\"request\":" + requestJson + ", \"response\":" + responseJson + "}"; - } catch (TException e) { - LOG.error("Failed to serialize request/response for logging.", e); - return ""; - } - } - - /** - * Logs the request and response stored in this instance to the failure log file. - */ - public void logFailedRequest() { - // Do the serializing/concatting this way so it happens on the background thread for - // async logging - FAILED_REQUEST_LOG.info("{}", new Object() { - @Override - public String toString() { - return serialize( - EarlybirdRequestUtil.copyAndClearUnnecessaryValuesForLogging(request), response); - } - }); - } -} diff --git a/src/java/com/twitter/search/earlybird/common/RequestResponsePair.java b/src/java/com/twitter/search/earlybird/common/RequestResponsePair.java deleted file mode 100644 index 2a6c4b299..000000000 --- a/src/java/com/twitter/search/earlybird/common/RequestResponsePair.java +++ /dev/null @@ -1,44 +0,0 @@ -package com.twitter.search.earlybird.common; - -import org.apache.lucene.search.Query; - -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; - -public class RequestResponsePair { - private final EarlybirdRequest request; - private final EarlybirdResponse response; - private final org.apache.lucene.search.Query luceneQuery; - - // The serialized query in its final form, after various modifications have been applied to it. - // As a note, we have some code paths in which this can be null, but I don't really see them - // triggered in production right now. - private final com.twitter.search.queryparser.query.Query finalSerializedQuery; - - public RequestResponsePair( - EarlybirdRequest request, - com.twitter.search.queryparser.query.Query finalSerializedQuery, - org.apache.lucene.search.Query luceneQuery, - EarlybirdResponse response) { - this.request = request; - this.luceneQuery = luceneQuery; - this.response = response; - this.finalSerializedQuery = finalSerializedQuery; - } - - public String getFinalSerializedQuery() { - return finalSerializedQuery != null ? finalSerializedQuery.serialize() : "N/A"; - } - - public EarlybirdRequest getRequest() { - return request; - } - - public EarlybirdResponse getResponse() { - return response; - } - - public Query getLuceneQuery() { - return luceneQuery; - } -} diff --git a/src/java/com/twitter/search/earlybird/common/UnknownClientRequestForLogging.java b/src/java/com/twitter/search/earlybird/common/UnknownClientRequestForLogging.java deleted file mode 100644 index f0345d6a2..000000000 --- a/src/java/com/twitter/search/earlybird/common/UnknownClientRequestForLogging.java +++ /dev/null @@ -1,77 +0,0 @@ -package com.twitter.search.earlybird.common; - -import org.apache.commons.codec.binary.Base64; -import org.apache.thrift.TException; -import org.apache.thrift.TSerializer; -import org.apache.thrift.protocol.TBinaryProtocol; -import org.slf4j.Logger; - -import com.twitter.search.common.util.FinagleUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; - -/** - * This class logs all requests that misses either the finagle Id or the client Id. - */ -public final class UnknownClientRequestForLogging { - private static final Logger GENERAL_LOG = org.slf4j.LoggerFactory.getLogger( - UnknownClientRequestForLogging.class); - private static final Logger LOG = org.slf4j.LoggerFactory.getLogger( - UnknownClientRequestForLogging.class.getName() + ".unknownClientRequests"); - - private final String logLine; - private final EarlybirdRequest request; - private final String clientId; - private final String finagleId; - - private final Base64 base64 = new Base64(); - private final TSerializer serializer = new TSerializer(new TBinaryProtocol.Factory()); - - private UnknownClientRequestForLogging( - String logLine, - EarlybirdRequest request, - String clientId, - String finagleId) { - - this.logLine = logLine; - this.request = request; - this.clientId = clientId; - this.finagleId = finagleId; - } - - /** - * Returns an UnknownClientRequestForLogging instance if a client ID is not set on the given - * earlybird request. If the request has a client ID set, {@code null} is returned. - * - * @param logLine Additional information to propagate to the log file, when logging this request. - * @param request The earlybird request. - */ - public static UnknownClientRequestForLogging unknownClientRequest( - String logLine, EarlybirdRequest request) { - String clientId = ClientIdUtil.getClientIdFromRequest(request); - String finagleId = FinagleUtil.getFinagleClientName(); - - if (clientId.equals(ClientIdUtil.UNSET_CLIENT_ID)) { - return new UnknownClientRequestForLogging(logLine, request, clientId, finagleId); - } else { - return null; - } - } - - private String asBase64() { - try { - // Need to make a deepCopy() here, because the request may still be in use (e.g. if we are - // doing this in the pre-logger), and we should not be modifying crucial fields on the - // EarlybirdRequest in place. - EarlybirdRequest clearedRequest = request.deepCopy(); - clearedRequest.unsetClientRequestTimeMs(); - return base64.encodeToString(serializer.serialize(clearedRequest)); - } catch (TException e) { - GENERAL_LOG.error("Failed to serialize request for logging.", e); - return "failed_to_serialize"; - } - } - - public void log() { - LOG.info("{},{},{},{}", clientId, finagleId, logLine, asBase64()); - } -} diff --git a/src/java/com/twitter/search/earlybird/common/config/BUILD b/src/java/com/twitter/search/earlybird/common/config/BUILD deleted file mode 100644 index 4d2634365..000000000 --- a/src/java/com/twitter/search/earlybird/common/config/BUILD +++ /dev/null @@ -1,21 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/code/findbugs:jsr305", - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/org/apache/commons:commons-lang3", - "3rdparty/jvm/org/apache/thrift:libthrift", - "3rdparty/jvm/org/apache/zookeeper:zookeeper-client", - "3rdparty/jvm/org/slf4j:slf4j-api", - "3rdparty/jvm/org/yaml:snakeyaml", - "finagle-internal/mtls/src/main/scala/com/twitter/finagle/mtls/authentication", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common_internal/text/version", - "src/java/com/twitter/search/common/aurora", - "src/java/com/twitter/search/common/config", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/util/zookeeper", - ], -) diff --git a/src/java/com/twitter/search/earlybird/common/config/EarlybirdConfig.java b/src/java/com/twitter/search/earlybird/common/config/EarlybirdConfig.java deleted file mode 100644 index ed18aab08..000000000 --- a/src/java/com/twitter/search/earlybird/common/config/EarlybirdConfig.java +++ /dev/null @@ -1,363 +0,0 @@ -package com.twitter.search.earlybird.common.config; - -import java.util.Date; -import java.util.List; -import java.util.Map; -import javax.annotation.Nullable; - -import com.google.common.collect.ImmutableMap; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.search.common.aurora.AuroraInstanceKey; -import com.twitter.search.common.config.Config; -import com.twitter.search.common.config.ConfigFile; -import com.twitter.search.common.config.ConfigurationException; -import com.twitter.search.common.config.SearchPenguinVersionsConfig; - -public final class EarlybirdConfig { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdConfig.class); - - private static final String DEFAULT_CONFIG_FILE = "earlybird-search.yml"; - private static final String LATE_TWEET_BUFFER_KEY = "late_tweet_buffer"; - - public static final String EARLYBIRD_ZK_CONFIG_DIR = "/twitter/search/production/earlybird/"; - public static final String EARLYBIRD_CONFIG_DIR = "earlybird/config"; - - public static final String USER_SNAPSHOT_BASE_DIR = "user_snapshot_base_dir"; - - private static volatile ConfigFile earlybirdConfig = null; - private static volatile Map overrideValueMap = ImmutableMap.of(); - - private static String logDirOverride = null; - private static AuroraInstanceKey auroraInstanceKey = null; - - private static int adminPort; - - private EarlybirdConfig() { } - - private static final class PenguinVersionHolder { - private static final PenguinVersion PENGUIN_VERSION_SINGLETON = - SearchPenguinVersionsConfig.getSingleSupportedVersion( - EarlybirdProperty.PENGUIN_VERSION.get()); - private static final byte PENGUIN_VERSION_BYTE_VALUE = - PENGUIN_VERSION_SINGLETON.getByteValue(); - } - - public static byte getPenguinVersionByte() { - return PenguinVersionHolder.PENGUIN_VERSION_BYTE_VALUE; - } - - public static PenguinVersion getPenguinVersion() { - return PenguinVersionHolder.PENGUIN_VERSION_SINGLETON; - } - - /** - * Reads the earlybird configuration from the given file. - */ - public static synchronized void init(@Nullable String configFile) { - if (earlybirdConfig == null) { - String file = configFile == null ? DEFAULT_CONFIG_FILE : configFile; - earlybirdConfig = new ConfigFile(EARLYBIRD_CONFIG_DIR, file); - } - } - - public static synchronized void setOverrideValues(Map overrideValues) { - overrideValueMap = ImmutableMap.copyOf(overrideValues); - } - - /** - * Pack all values in a string that can be printed for informational purposes. - * @return the string. - */ - public static String allValuesAsString() { - Map stringMap = earlybirdConfig.getStringMap(); - - StringBuilder stringBuilder = new StringBuilder(); - - stringBuilder.append("Config environment: " + Config.getEnvironment() + "\n\n"); - stringBuilder.append( - String.format("Values from earlybird-search.yml (total %d):\n", stringMap.size())); - - stringMap.forEach((key, value) -> { - stringBuilder.append(String.format(" %s: %s\n", key, value.toString())); - if (overrideValueMap.containsKey(key)) { - stringBuilder.append(String.format( - " override value: %s\n", overrideValueMap.get(key).toString())); - } - }); - - stringBuilder.append(String.format( - "\n\nAll command-line overrides (total: %d):\n", overrideValueMap.size())); - overrideValueMap.forEach((key, value) -> { - stringBuilder.append(String.format(" %s: %s\n", key, value.toString())); - }); - - return stringBuilder.toString(); - } - - /** - * Returns the value of the given property as a string. If the property is not set, a runtime - * exception is thrown. - */ - public static String getString(String property) { - Object overrideValue = overrideValueMap.get(property); - if (overrideValue != null) { - return (String) overrideValue; - } - - try { - return earlybirdConfig.getString(property); - } catch (ConfigurationException e) { - LOG.error("Fatal error: could not get config string " + property, e); - throw new RuntimeException(e); - } - } - - /** - * Returns the value of the given property as a string. - */ - public static String getString(String property, String defaultValue) { - Object overrideValue = overrideValueMap.get(property); - if (overrideValue != null) { - return (String) overrideValue; - } - - return earlybirdConfig.getString(property, defaultValue); - } - - /** - * Returns the value of the given property as an integer. If the property is not set, a runtime - * exception is thrown. - */ - public static int getInt(String property) { - Object overrideValue = overrideValueMap.get(property); - if (overrideValue != null) { - return (int) overrideValue; - } - - try { - return earlybirdConfig.getInt(property); - } catch (ConfigurationException e) { - LOG.error("Fatal error: could not get config int " + property, e); - throw new RuntimeException(e); - } - } - - /** - * Returns the value of the given property as an integer. - */ - public static int getInt(String property, int defaultValue) { - Object overrideValue = overrideValueMap.get(property); - if (overrideValue != null) { - return (int) overrideValue; - } - - return earlybirdConfig.getInt(property, defaultValue); - } - - /** - * Returns the value of the given property as a double. - */ - public static double getDouble(String property, double defaultValue) { - Object overrideValue = overrideValueMap.get(property); - if (overrideValue != null) { - return (double) overrideValue; - } - - return earlybirdConfig.getDouble(property, defaultValue); - } - - /** - * Returns the value of the given property as a long. If the property is not set, a runtime - * exception is thrown. - */ - public static long getLong(String property) { - Object overrideValue = overrideValueMap.get(property); - if (overrideValue != null) { - return (long) overrideValue; - } - - try { - return earlybirdConfig.getLong(property); - } catch (ConfigurationException e) { - LOG.error("Fatal error: could not get config long " + property, e); - throw new RuntimeException(e); - } - } - - /** - * Returns the value of the given property as a long. - */ - public static long getLong(String property, long defaultValue) { - Object overrideValue = overrideValueMap.get(property); - if (overrideValue != null) { - return (long) overrideValue; - } - - return earlybirdConfig.getLong(property, defaultValue); - } - - /** - * Returns the value of the given property as a boolean. If the property is not set, a runtime - * exception is thrown. - */ - public static boolean getBool(String property) { - Object overrideValue = overrideValueMap.get(property); - if (overrideValue != null) { - return (boolean) overrideValue; - } - - try { - return earlybirdConfig.getBool(property); - } catch (ConfigurationException e) { - LOG.error("Fatal error: could not get config boolean " + property, e); - throw new RuntimeException(e); - } - } - - /** - * Returns the value of the given property as a boolean. - */ - public static boolean getBool(String property, boolean defaultValue) { - Object overrideValue = overrideValueMap.get(property); - if (overrideValue != null) { - return (boolean) overrideValue; - } - - return earlybirdConfig.getBool(property, defaultValue); - } - - /** - * Returns the value of the given property as a date. - */ - public static Date getDate(String property) { - Object overrideValue = overrideValueMap.get(property); - if (overrideValue != null) { - return (Date) overrideValue; - } - - Date date = (Date) earlybirdConfig.getObject(property, null); - if (date == null) { - throw new RuntimeException("Could not get config date: " + property); - } - return date; - } - - /** - * Returns the value of the given property as a list of strings. - */ - public static List getListOfStrings(String property) { - Object overrideValue = overrideValueMap.get(property); - if (overrideValue != null) { - return (List) overrideValue; - } - - List list = (List) earlybirdConfig.getObject(property, null); - if (list == null) { - throw new RuntimeException("Could not get list of strings: " + property); - } - return list; - } - - /** - * Returns the value of the given property as a map. - */ - @SuppressWarnings("unchecked") - public static Map getMap(String property) { - Map map = (Map) earlybirdConfig.getObject(property, null); - if (map == null) { - throw new RuntimeException("Could not find config property: " + property); - } - return map; - } - - public static int getMaxSegmentSize() { - return EarlybirdConfig.getInt("max_segment_size", 1 << 16); - } - - /** - * Returns the log properties file. - */ - public static String getLogPropertiesFile() { - try { - String filename = earlybirdConfig.getString("log_properties_filename"); - return earlybirdConfig.getConfigFilePath(filename); - } catch (ConfigurationException e) { - // Print here rather than use LOG - log was probably not initialized yet. - LOG.error("Fatal error: could not get log properties file", e); - throw new RuntimeException(e); - } - } - - /** - * Returns the log directory. - */ - public static String getLogDir() { - if (logDirOverride != null) { - return logDirOverride; - } else { - return EarlybirdConfig.getString("log_dir"); - } - } - - public static void overrideLogDir(String logDir) { - EarlybirdConfig.logDirOverride = logDir; - } - - public static int getThriftPort() { - return EarlybirdProperty.THRIFT_PORT.get(); - } - - public static int getWarmUpThriftPort() { - return EarlybirdProperty.WARMUP_THRIFT_PORT.get(); - } - - public static int getSearcherThreads() { - return EarlybirdProperty.SEARCHER_THREADS.get(); - } - - public static int getLateTweetBuffer() { - return getInt(LATE_TWEET_BUFFER_KEY); - } - - public static int getAdminPort() { - return adminPort; - } - - public static void setAdminPort(int adminPort) { - EarlybirdConfig.adminPort = adminPort; - } - - public static boolean isRealtimeOrProtected() { - String earlybirdName = EarlybirdProperty.EARLYBIRD_NAME.get(); - return earlybirdName.contains("realtime") || earlybirdName.contains("protected"); - } - - public static boolean consumeUserScrubGeoEvents() { - return EarlybirdProperty.CONSUME_GEO_SCRUB_EVENTS.get(); - } - - @Nullable - public static AuroraInstanceKey getAuroraInstanceKey() { - return auroraInstanceKey; - } - - public static void setAuroraInstanceKey(AuroraInstanceKey auroraInstanceKey) { - EarlybirdConfig.auroraInstanceKey = auroraInstanceKey; - } - - public static boolean isAurora() { - return auroraInstanceKey != null; - } - - public static void setForTests(String property, Object value) { - earlybirdConfig.setForTests(DEFAULT_CONFIG_FILE, property, value); - } - - public static synchronized void clearForTests() { - earlybirdConfig = new ConfigFile(EARLYBIRD_CONFIG_DIR, DEFAULT_CONFIG_FILE); - } -} diff --git a/src/java/com/twitter/search/earlybird/common/config/EarlybirdProperty.java b/src/java/com/twitter/search/earlybird/common/config/EarlybirdProperty.java deleted file mode 100644 index f8534bce5..000000000 --- a/src/java/com/twitter/search/earlybird/common/config/EarlybirdProperty.java +++ /dev/null @@ -1,390 +0,0 @@ -package com.twitter.search.earlybird.common.config; - -import java.lang.reflect.Modifier; -import java.util.Arrays; -import java.util.List; -import java.util.function.BiFunction; -import java.util.function.Function; -import java.util.stream.Collectors; - -import com.google.common.collect.ImmutableList; - -import com.twitter.app.Flag; -import com.twitter.app.Flaggable; -import com.twitter.app.Flags; -import com.twitter.finagle.mtls.authentication.ServiceIdentifier; - -/** - * Stateless class that represents an Earlybird property that can be specified by a command line - * flag. - *

- * This is a regular Java class instead of enum to have a generic type. - * - * @param - */ -public final class EarlybirdProperty { - - private static final class PropertyType { - - private static final PropertyType BOOLEAN = new PropertyType<>( - Flaggable.ofJavaBoolean(), EarlybirdConfig::getBool, EarlybirdConfig::getBool); - - private static final PropertyType INT = new PropertyType<>( - Flaggable.ofJavaInteger(), EarlybirdConfig::getInt, EarlybirdConfig::getInt); - - private static final PropertyType STRING = new PropertyType<>( - Flaggable.ofString(), EarlybirdConfig::getString, EarlybirdConfig::getString); - - private final Flaggable flaggable; - private final Function getter; - private final BiFunction getterWithDefault; - - private PropertyType(Flaggable flaggable, Function getter, - BiFunction getterWithDefault) { - this.flaggable = flaggable; - this.getter = getter; - this.getterWithDefault = getterWithDefault; - } - } - - public static final EarlybirdProperty PENGUIN_VERSION = - new EarlybirdProperty<>( - "penguin_version", - "The penguin version to index.", - PropertyType.STRING, - false); - - public static final EarlybirdProperty THRIFT_PORT = new EarlybirdProperty<>( - "thrift_port", - "override thrift port from config file", - PropertyType.INT, - false); - - public static final EarlybirdProperty WARMUP_THRIFT_PORT = new EarlybirdProperty<>( - "warmup_thrift_port", - "override warmup thrift port from config file", - PropertyType.INT, - false); - - public static final EarlybirdProperty SEARCHER_THREADS = new EarlybirdProperty<>( - "searcher_threads", - "override number of searcher threads from config file", - PropertyType.INT, - false); - - public static final EarlybirdProperty EARLYBIRD_TIER = new EarlybirdProperty<>( - "earlybird_tier", - "the earlybird tier (e.g. tier1), used on Aurora", - PropertyType.STRING, - true); - - public static final EarlybirdProperty REPLICA_ID = new EarlybirdProperty<>( - "replica_id", - "the ID in a partition, used on Aurora", - PropertyType.INT, - true); - - public static final EarlybirdProperty PARTITION_ID = new EarlybirdProperty<>( - "partition_id", - "partition ID, used on Aurora", - PropertyType.INT, - true); - - public static final EarlybirdProperty NUM_PARTITIONS = new EarlybirdProperty<>( - "num_partitions", - "number of partitions, used on Aurora", - PropertyType.INT, - true); - - public static final EarlybirdProperty NUM_INSTANCES = new EarlybirdProperty<>( - "num_instances", - "number of instances in the job, used on Aurora", - PropertyType.INT, - true); - - public static final EarlybirdProperty SERVING_TIMESLICES = new EarlybirdProperty<>( - "serving_timeslices", - "number of time slices to serve, used on Aurora", - PropertyType.INT, - true); - - public static final EarlybirdProperty ROLE = new EarlybirdProperty<>( - "role", - "Role in the service path of Earlybird", - PropertyType.STRING, - true, - true); - - public static final EarlybirdProperty EARLYBIRD_NAME = new EarlybirdProperty<>( - "earlybird_name", - "Name in the service path of Earlybird without hash partition suffix", - PropertyType.STRING, - true, - true); - - public static final EarlybirdProperty ENV = new EarlybirdProperty<>( - "env", - "Environment in the service path of Earlybird", - PropertyType.STRING, - true, - true); - - public static final EarlybirdProperty ZONE = new EarlybirdProperty<>( - "zone", - "Zone (data center) in the service path of Earlybird", - PropertyType.STRING, - true, - true); - - public static final EarlybirdProperty DL_URI = new EarlybirdProperty<>( - "dl_uri", - "DistributedLog URI for default DL reader", - PropertyType.STRING, - false); - - public static final EarlybirdProperty USER_UPDATES_DL_URI = new EarlybirdProperty<>( - "user_updates_dl_uri", - "DistributedLog URI for user updates DL reader", - PropertyType.STRING, - false); - - public static final EarlybirdProperty ANTISOCIAL_USERUPDATES_DL_STREAM = - new EarlybirdProperty<>( - "antisocial_userupdates_dl_stream", - "DL stream name for antisocial user updates without DL version suffix", - PropertyType.STRING, - false); - - public static final EarlybirdProperty ZK_APP_ROOT = new EarlybirdProperty<>( - "zk_app_root", - "SZooKeeper base root path for this application", - PropertyType.STRING, - true); - - public static final EarlybirdProperty SEGMENT_LOAD_FROM_HDFS_ENABLED = - new EarlybirdProperty<>( - "segment_load_from_hdfs_enabled", - "Whether to load segment data from HDFS", - PropertyType.BOOLEAN, - false); - - public static final EarlybirdProperty SEGMENT_FLUSH_TO_HDFS_ENABLED = - new EarlybirdProperty<>( - "segment_flush_to_hdfs_enabled", - "Whether to flush segment data to HDFS", - PropertyType.BOOLEAN, - false); - - public static final EarlybirdProperty HDFS_SEGMENT_SYNC_DIR = new EarlybirdProperty<>( - "hdfs_segment_sync_dir", - "HDFS directory to sync segment data", - PropertyType.STRING, - false); - - public static final EarlybirdProperty HDFS_SEGMENT_UPLOAD_DIR = new EarlybirdProperty<>( - "hdfs_segment_upload_dir", - "HDFS directory to upload segment data", - PropertyType.STRING, - false); - - public static final EarlybirdProperty ARCHIVE_DAILY_STATUS_BATCH_FLUSHING_ENABLED = - new EarlybirdProperty<>( - "archive_daily_status_batch_flushing_enabled", - "Whether to enable archive daily status batch flushing", - PropertyType.BOOLEAN, - false); - - public static final EarlybirdProperty HDFS_INDEX_SYNC_DIR = new EarlybirdProperty<>( - "hdfs_index_sync_dir", - "HDFS directory to sync index data", - PropertyType.STRING, - true); - - public static final EarlybirdProperty READ_INDEX_FROM_PROD_LOCATION = - new EarlybirdProperty<>( - "read_index_from_prod_location", - "Read index from prod to speed up startup on staging / loadtest", - PropertyType.BOOLEAN, - false); - - public static final EarlybirdProperty USE_DECIDER_OVERLAY = new EarlybirdProperty<>( - "use_decider_overlay", - "Whether to use decider overlay", - PropertyType.BOOLEAN, - false); - - public static final EarlybirdProperty DECIDER_OVERLAY_CONFIG = new EarlybirdProperty<>( - "decider_overlay_config", - "Path to decider overlay config", - PropertyType.STRING, - false); - - public static final EarlybirdProperty MAX_CONCURRENT_SEGMENT_INDEXERS = - new EarlybirdProperty<>( - "max_concurrent_segment_indexers", - "Maximum number of segments indexed concurrently", - PropertyType.INT, - false); - - public static final EarlybirdProperty TF_MODELS_ENABLED = - new EarlybirdProperty<>( - "tf_models_enabled", - "Whether tensorflow models should be loaded", - PropertyType.BOOLEAN, - false); - - public static final EarlybirdProperty TF_MODELS_CONFIG_PATH = - new EarlybirdProperty<>( - "tf_models_config_path", - "The configuration path of the yaml file containing the list of tensorflow models to load.", - PropertyType.STRING, - false); - - public static final EarlybirdProperty TF_INTER_OP_THREADS = - new EarlybirdProperty<>( - "tf_inter_op_threads", - "How many tensorflow inter op threads to use. See TF documentation for more information.", - PropertyType.INT, - false); - - public static final EarlybirdProperty TF_INTRA_OP_THREADS = - new EarlybirdProperty<>( - "tf_intra_op_threads", - "How many tensorflow intra op threads to use. See TF documentation for more information.", - PropertyType.INT, - false); - - public static final EarlybirdProperty MAX_ALLOWED_REPLICAS_NOT_IN_SERVER_SET = - new EarlybirdProperty<>( - "max_allowed_replicas_not_in_server_set", - "How many replicas are allowed to be missing from the Earlybird server set.", - PropertyType.INT, - false); - - public static final EarlybirdProperty CHECK_NUM_REPLICAS_IN_SERVER_SET = - new EarlybirdProperty<>( - "check_num_replicas_in_server_set", - "Whether CoordinatedEarlybirdActions should check the number of alive replicas", - PropertyType.BOOLEAN, - false); - - public static final EarlybirdProperty MAX_QUEUE_SIZE = - new EarlybirdProperty<>( - "max_queue_size", - "Maximum size of searcher worker executor queue. If <= 0 queue is unbounded.", - PropertyType.INT, - false); - - public static final EarlybirdProperty KAFKA_ENV = - new EarlybirdProperty<>( - "kafka_env", - "The environment to use for kafka topics.", - PropertyType.STRING, - false); - public static final EarlybirdProperty KAFKA_PATH = - new EarlybirdProperty<>( - "kafka_path", - "Wily path to the Search kafka cluster.", - PropertyType.STRING, - false); - public static final EarlybirdProperty TWEET_EVENTS_KAFKA_PATH = - new EarlybirdProperty<>( - "tweet_events_kafka_path", - "Wily path to the tweet-events kafka cluster.", - PropertyType.STRING, - false); - public static final EarlybirdProperty USER_UPDATES_KAFKA_TOPIC = - new EarlybirdProperty<>( - "user_updates_topic", - "Name of the Kafka topic that contain user updates.", - PropertyType.STRING, - false); - public static final EarlybirdProperty USER_SCRUB_GEO_KAFKA_TOPIC = - new EarlybirdProperty<>( - "user_scrub_geo_topic", - "Name of the Kafka topic that contain UserScrubGeoEvents.", - PropertyType.STRING, - false); - public static final EarlybirdProperty EARLYBIRD_SCRUB_GEN = - new EarlybirdProperty<>( - "earlybird_scrub_gen", - "SCRUB_GEN TO DEPLOY", - PropertyType.STRING, - false); - public static final EarlybirdProperty CONSUME_GEO_SCRUB_EVENTS = - new EarlybirdProperty<>( - "consume_geo_scrub_events", - "Whether to consume user scrub geo events or not", - PropertyType.BOOLEAN, - false); - - private static final List> ALL_PROPERTIES = - Arrays.stream(EarlybirdProperty.class.getDeclaredFields()) - .filter(field -> - (field.getModifiers() & Modifier.STATIC) > 0 - && field.getType() == EarlybirdProperty.class) - .map(field -> { - try { - return (EarlybirdProperty) field.get(EarlybirdProperty.class); - } catch (Exception e) { - throw new RuntimeException(e); - } - }) - .collect(Collectors.collectingAndThen(Collectors.toList(), ImmutableList::copyOf)); - - public static ServiceIdentifier getServiceIdentifier() { - return new ServiceIdentifier( - ROLE.get(), - EARLYBIRD_NAME.get(), - ENV.get(), - ZONE.get()); - } - - private final String name; - private final String help; - private final PropertyType type; - private final boolean requiredOnAurora; - private final boolean requiredOnDedicated; - - private EarlybirdProperty(String name, String help, PropertyType type, - boolean requiredOnAurora) { - this(name, help, type, requiredOnAurora, false); - } - - private EarlybirdProperty(String name, String help, PropertyType type, - boolean requiredOnAurora, boolean requiredOnDedicated) { - this.name = name; - this.help = help; - this.type = type; - this.requiredOnAurora = requiredOnAurora; - this.requiredOnDedicated = requiredOnDedicated; - } - - public String name() { - return name; - } - - public boolean isRequiredOnAurora() { - return requiredOnAurora; - } - - public boolean isRequiredOnDedicated() { - return requiredOnDedicated; - } - - public Flag createFlag(Flags flags) { - return flags.createMandatory(name, help, null, type.flaggable); - } - - public T get() { - return type.getter.apply(name); - } - - public T get(T devaultValue) { - return type.getterWithDefault.apply(name, devaultValue); - } - - public static EarlybirdProperty[] values() { - return ALL_PROPERTIES.toArray(new EarlybirdProperty[0]); - } -} diff --git a/src/java/com/twitter/search/earlybird/common/userupdates/BUILD b/src/java/com/twitter/search/earlybird/common/userupdates/BUILD deleted file mode 100644 index 27a3c8c8f..000000000 --- a/src/java/com/twitter/search/earlybird/common/userupdates/BUILD +++ /dev/null @@ -1,45 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/commons-io", - "3rdparty/jvm/geo/google:geoGoogle", - "3rdparty/jvm/org/apache/bookkeeper:bookkeeper-server", - "3rdparty/jvm/org/apache/bookkeeper:bookkeeper-twitter-science-provider", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/lucene:lucene-analyzers-common", - "3rdparty/jvm/org/apache/lucene:lucene-analyzers-smartcn", - "3rdparty/jvm/org/apache/lucene:lucene-core", - "3rdparty/jvm/org/apache/lucene:lucene-facet", - "3rdparty/jvm/org/apache/thrift:libthrift", - "3rdparty/jvm/org/apache/zookeeper:zookeeper-client", - "3rdparty/jvm/org/slf4j:slf4j-api", - "3rdparty/src/jvm/com/twitter/scalding:core", - "3rdparty/src/jvm/com/twitter/scalding:date", - "3rdparty/src/jvm/com/twitter/scalding:parquet", - "decider/src/main/scala", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common/util:system-mocks", - "src/java/com/twitter/common_internal/hadoop", - "src/java/com/twitter/search/common/logging", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/partitioning/snowflakeparser", - "src/java/com/twitter/search/common/schema/earlybird", - "src/java/com/twitter/search/common/util/hash", - "src/java/com/twitter/search/common/util/io", - "src/java/com/twitter/search/common/util/io:dl-reader-writer", - "src/java/com/twitter/search/common/util/io:flushable", - "src/java/com/twitter/search/common/util/io:record-reader-api", - "src/java/com/twitter/search/earlybird/common/config", - "src/scala/com/twitter/scalding_internal/error_handling", - "src/scala/com/twitter/scalding_internal/multiformat", - "src/scala/com/twitter/scalding_internal/source", - "src/scala/com/twitter/search/user_table/sources", - "src/thrift/com/twitter/search/common:indexing-java", - "src/thrift/com/twitter/tweetypie:events-java", - "util/util-core:scala", - ], -) diff --git a/src/java/com/twitter/search/earlybird/common/userupdates/UserScrubGeoMap.java b/src/java/com/twitter/search/earlybird/common/userupdates/UserScrubGeoMap.java deleted file mode 100644 index c0c6c3be7..000000000 --- a/src/java/com/twitter/search/earlybird/common/userupdates/UserScrubGeoMap.java +++ /dev/null @@ -1,100 +0,0 @@ -package com.twitter.search.earlybird.common.userupdates; - -import java.util.concurrent.ConcurrentHashMap; -import java.util.concurrent.TimeUnit; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchCustomGauge; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; -import com.twitter.tweetypie.thriftjava.UserScrubGeoEvent; - -/** - * Map of users who have actioned to delete location data from their tweets. UserID's are mapped - * to the maxTweetId that will eventually be scrubbed from the index (userId -> maxTweetId). - * - * ConcurrentHashMap is thread safe without synchronizing the whole map. Reads can happen very fast - * while writes are done with a lock. This is ideal since many Earlybird Searcher threads could - * be reading from the map at once, whereas we will only be adding to the map via kafka. - * - * This map is checked against to filter out tweets that should not be returned to geo queries. - * See: go/realtime-geo-filtering - */ -public class UserScrubGeoMap { - // The number of geo events that contain a user ID already present in the map. This count is used - // to verify the number of users in the map against the number of events consumed from kafka. - private static final SearchCounter USER_SCRUB_GEO_EVENT_EXISTING_USER_COUNT = - SearchCounter.export("user_scrub_geo_event_existing_user_count"); - public static final SearchTimerStats USER_SCRUB_GEO_EVENT_LAG_STAT = - SearchTimerStats.export("user_scrub_geo_event_lag", - TimeUnit.MILLISECONDS, - false, - true); - private ConcurrentHashMap map; - - public UserScrubGeoMap() { - map = new ConcurrentHashMap<>(); - SearchCustomGauge.export("num_users_in_geo_map", this::getNumUsersInMap); - } - - /** - * Ensure that the max_tweet_id in the userScrubGeoEvent is greater than the one already stored - * in the map for the given user id (if any) before updating the entry for this user. - * This will protect Earlybirds from potential issues where out of date UserScrubGeoEvents - * appear in the incoming Kafka stream. - * - * @param userScrubGeoEvent - */ - public void indexUserScrubGeoEvent(UserScrubGeoEvent userScrubGeoEvent) { - long userId = userScrubGeoEvent.getUser_id(); - long newMaxTweetId = userScrubGeoEvent.getMax_tweet_id(); - long oldMaxTweetId = map.getOrDefault(userId, 0L); - if (map.containsKey(userId)) { - USER_SCRUB_GEO_EVENT_EXISTING_USER_COUNT.increment(); - } - map.put(userId, Math.max(oldMaxTweetId, newMaxTweetId)); - USER_SCRUB_GEO_EVENT_LAG_STAT.timerIncrement(computeEventLag(newMaxTweetId)); - } - - /** - * A tweet is geo scrubbed if it is older than the max tweet id that is scrubbed for the tweet's - * author. - * If there is no entry for the tweet's author in the map, then the tweet is not geo scrubbed. - * - * @param tweetId - * @param fromUserId - * @return - */ - public boolean isTweetGeoScrubbed(long tweetId, long fromUserId) { - return tweetId <= map.getOrDefault(fromUserId, 0L); - } - - /** - * The lag (in milliseconds) from when a UserScrubGeoEvent is created, until it is applied to the - * UserScrubGeoMap. Take the maxTweetId found in the current event and convert it to a timestamp. - * The maxTweetId will give us a timestamp closest to when Tweetypie processes macaw-geo requests. - * - * @param maxTweetId - * @return - */ - private long computeEventLag(long maxTweetId) { - long eventCreatedAtTime = SnowflakeIdParser.getTimestampFromTweetId(maxTweetId); - return System.currentTimeMillis() - eventCreatedAtTime; - } - - public long getNumUsersInMap() { - return map.size(); - } - - public ConcurrentHashMap getMap() { - return map; - } - - public boolean isEmpty() { - return map.isEmpty(); - } - - public boolean isSet(long userId) { - return map.containsKey(userId); - } -} diff --git a/src/java/com/twitter/search/earlybird/common/userupdates/UserTable.java b/src/java/com/twitter/search/earlybird/common/userupdates/UserTable.java deleted file mode 100644 index 3df08a5df..000000000 --- a/src/java/com/twitter/search/earlybird/common/userupdates/UserTable.java +++ /dev/null @@ -1,572 +0,0 @@ -package com.twitter.search.earlybird.common.userupdates; - -import java.util.Iterator; -import java.util.concurrent.atomic.AtomicReference; -import java.util.function.Predicate; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.util.hash.GeneralLongHashFunction; - -/** - * Table containing metadata about users, like NSFW or Antisocial status. - * Used for result filtering. - */ -public class UserTable { - private static final Logger LOG = LoggerFactory.getLogger(UserTable.class); - - @VisibleForTesting // Not final for testing. - protected static long userUpdateTableMaxCapacity = 1L << 30; - - private static final int DEFAULT_INITIAL_CAPACITY = 1024; - private static final int BYTE_WIDTH = 8; - - private static final String USER_TABLE_CAPACITY = "user_table_capacity"; - private static final String USER_TABLE_SIZE = "user_table_size"; - private static final String - USER_NUM_USERS_WITH_NO_BITS_SET = "user_table_users_with_no_bits_set"; - private static final String USER_TABLE_ANTISOCIAL_USERS = "user_table_antisocial_users"; - private static final String USER_TABLE_OFFENSIVE_USERS = "user_table_offensive_users"; - private static final String USER_TABLE_NSFW_USERS = "user_table_nsfw_users"; - private static final String USER_TABLE_IS_PROTECTED_USERS = "user_table_is_protected_users"; - - /** - * number of users filtered - */ - private static final SearchRateCounter USER_TABLE_USERS_FILTERED_COUNTER = - new SearchRateCounter("user_table_users_filtered"); - - private SearchLongGauge userTableCapacity; - private SearchLongGauge userTableSize; - private SearchLongGauge userTableNumUsersWithNoBitsSet; - private SearchLongGauge userTableAntisocialUsers; - private SearchLongGauge userTableOffensiveUsers; - private SearchLongGauge userTableNsfwUsers; - private SearchLongGauge userTableIsProtectedUsers; - - private final Predicate userIdFilter; - private long lastRecordTimestamp; - - private static final class HashTable { - private int numUsersInTable; - private int numUsersWithNoBitsSet; - // size 8 array contains the number of users who have the bit set at the index (0-7) position - // e.g. setBitCounts[0] stores the number of users who have the 0 bit set in their bytes - private long[] setBitCounts; - - private final long[] hash; - private final byte[] bits; - - private final int hashMask; - - HashTable(int size) { - this.hash = new long[size]; - this.bits = new byte[size]; - this.hashMask = size - 1; - this.numUsersInTable = 0; - this.setBitCounts = new long[BYTE_WIDTH]; - } - - protected int hashSize() { - return hash.length; - } - - // If we want to decrease the number of users in the table, we can delete as many users - // as this table returns, by calling filterTableAndCountValidItems. - public void setCountOfNumUsersWithNoBitsSet() { - int count = 0; - for (int i = 0; i < hash.length; i++) { - if ((hash[i] > 0) && (bits[i] == 0)) { - count++; - } - } - - numUsersWithNoBitsSet = count; - } - - public void setSetBitCounts() { - long[] counts = new long[BYTE_WIDTH]; - for (int i = 0; i < hash.length; i++) { - if (hash[i] > 0) { - int tempBits = bits[i] & 0xff; - int curBitPos = 0; - while (tempBits != 0) { - if ((tempBits & 1) != 0) { - counts[curBitPos]++; - } - tempBits = tempBits >>> 1; - curBitPos++; - } - } - } - setBitCounts = counts; - } - } - - public static final int ANTISOCIAL_BIT = 1; - public static final int OFFENSIVE_BIT = 1 << 1; - public static final int NSFW_BIT = 1 << 2; - public static final int IS_PROTECTED_BIT = 1 << 3; - - public long getLastRecordTimestamp() { - return this.lastRecordTimestamp; - } - - public void setLastRecordTimestamp(long lastRecordTimestamp) { - this.lastRecordTimestamp = lastRecordTimestamp; - } - - public void setOffensive(long userID, boolean offensive) { - set(userID, OFFENSIVE_BIT, offensive); - } - - public void setAntisocial(long userID, boolean antisocial) { - set(userID, ANTISOCIAL_BIT, antisocial); - } - - public void setNSFW(long userID, boolean nsfw) { - set(userID, NSFW_BIT, nsfw); - } - - public void setIsProtected(long userID, boolean isProtected) { - set(userID, IS_PROTECTED_BIT, isProtected); - } - - /** - * Adds the given user update to this table. - */ - public boolean indexUserUpdate(UserUpdatesChecker checker, UserUpdate userUpdate) { - if (checker.skipUserUpdate(userUpdate)) { - return false; - } - - switch (userUpdate.updateType) { - case ANTISOCIAL: - setAntisocial(userUpdate.twitterUserID, userUpdate.updateValue != 0); - break; - case NSFW: - setNSFW(userUpdate.twitterUserID, userUpdate.updateValue != 0); - break; - case OFFENSIVE: - setOffensive(userUpdate.twitterUserID, userUpdate.updateValue != 0); - break; - case PROTECTED: - setIsProtected(userUpdate.twitterUserID, userUpdate.updateValue != 0); - break; - default: - return false; - } - - return true; - } - - private final AtomicReference hashTable = new AtomicReference<>(); - - private int hashCode(long userID) { - return (int) GeneralLongHashFunction.hash(userID); - } - - /** - * Returns an iterator for user IDs that have at least one of the bits set. - */ - public Iterator getFlaggedUserIdIterator() { - HashTable table = hashTable.get(); - - final long[] currUserIdTable = table.hash; - final byte[] currBitsTable = table.bits; - return new Iterator() { - private int index = findNext(0); - - private int findNext(int index) { - int startingIndex = index; - while (startingIndex < currUserIdTable.length) { - if (currUserIdTable[startingIndex] != 0 && currBitsTable[startingIndex] != 0) { - break; - } - ++startingIndex; - } - return startingIndex; - } - - @Override - public boolean hasNext() { - return index < currUserIdTable.length; - } - - @Override - public Long next() { - Long r = currUserIdTable[index]; - index = findNext(index + 1); - return r; - } - - @Override - public void remove() { - throw new UnsupportedOperationException(); - } - }; - } - - /** - * Constructs an UserUpdatesTable with an given HashTable instance. - * Use useIdFilter as a Predicate that returns true for the elements - * needed to be kept in the table. - * Use shouldRehash to force a rehasing on the given HashTable. - */ - private UserTable(HashTable hashTable, Predicate userIdFilter, - boolean shouldRehash) { - - Preconditions.checkNotNull(userIdFilter); - - this.hashTable.set(hashTable); - this.userIdFilter = userIdFilter; - - exportUserUpdatesTableStats(); - - LOG.info("User table num users: {}. Users with no bits set: {}. " - + "Antisocial users: {}. Offensive users: {}. Nsfw users: {}. IsProtected users: {}.", - this.getNumUsersInTable(), - this.getNumUsersWithNoBitsSet(), - this.getSetBitCount(ANTISOCIAL_BIT), - this.getSetBitCount(OFFENSIVE_BIT), - this.getSetBitCount(NSFW_BIT), - this.getSetBitCount(IS_PROTECTED_BIT)); - - if (shouldRehash) { - int filteredTableSize = filterTableAndCountValidItems(); - // Having exactly 100% usage can impact lookup. Maintain the table at under 50% usage. - int newTableCapacity = computeDesiredHashTableCapacity(filteredTableSize * 2); - - rehash(newTableCapacity); - - LOG.info("User table num users after rehash: {}. Users with no bits set: {}. " - + "Antisocial users: {}. Offensive users: {}. Nsfw users: {}. IsProtected users: {}.", - this.getNumUsersInTable(), - this.getNumUsersWithNoBitsSet(), - this.getSetBitCount(ANTISOCIAL_BIT), - this.getSetBitCount(OFFENSIVE_BIT), - this.getSetBitCount(NSFW_BIT), - this.getSetBitCount(IS_PROTECTED_BIT)); - } - } - - private UserTable(int initialSize, Predicate userIdFilter) { - this(new HashTable(computeDesiredHashTableCapacity(initialSize)), userIdFilter, false); - } - - @VisibleForTesting - public UserTable(int initialSize) { - this(initialSize, userId -> true); - } - - public static UserTable - newTableWithDefaultCapacityAndPredicate(Predicate userIdFilter) { - - return new UserTable(DEFAULT_INITIAL_CAPACITY, userIdFilter); - } - - public static UserTable newTableNonFilteredWithDefaultCapacity() { - return newTableWithDefaultCapacityAndPredicate(userId -> true); - } - - private void exportUserUpdatesTableStats() { - userTableSize = SearchLongGauge.export(USER_TABLE_SIZE); - userTableCapacity = SearchLongGauge.export(USER_TABLE_CAPACITY); - userTableNumUsersWithNoBitsSet = SearchLongGauge.export( - USER_NUM_USERS_WITH_NO_BITS_SET - ); - userTableAntisocialUsers = SearchLongGauge.export(USER_TABLE_ANTISOCIAL_USERS); - userTableOffensiveUsers = SearchLongGauge.export(USER_TABLE_OFFENSIVE_USERS); - userTableNsfwUsers = SearchLongGauge.export(USER_TABLE_NSFW_USERS); - userTableIsProtectedUsers = SearchLongGauge.export(USER_TABLE_IS_PROTECTED_USERS); - - LOG.info( - "Exporting stats for user table. Starting with numUsersInTable={}, usersWithZeroBits={}, " - + "antisocialUsers={}, offensiveUsers={}, nsfwUsers={}, isProtectedUsers={}.", - getNumUsersInTable(), - getNumUsersWithNoBitsSet(), - getSetBitCount(ANTISOCIAL_BIT), - getSetBitCount(OFFENSIVE_BIT), - getSetBitCount(NSFW_BIT), - getSetBitCount(IS_PROTECTED_BIT)); - updateStats(); - } - - private void updateStats() { - HashTable table = this.hashTable.get(); - userTableSize.set(table.numUsersInTable); - userTableNumUsersWithNoBitsSet.set(table.numUsersWithNoBitsSet); - userTableCapacity.set(table.hashSize()); - userTableAntisocialUsers.set(getSetBitCount(ANTISOCIAL_BIT)); - userTableOffensiveUsers.set(getSetBitCount(OFFENSIVE_BIT)); - userTableNsfwUsers.set(getSetBitCount(NSFW_BIT)); - userTableIsProtectedUsers.set(getSetBitCount(IS_PROTECTED_BIT)); - } - - /** - * Computes the size of the hashtable as the first power of two greater than or equal to initialSize - */ - private static int computeDesiredHashTableCapacity(int initialSize) { - long powerOfTwoSize = 2; - while (initialSize > powerOfTwoSize) { - powerOfTwoSize *= 2; - } - if (powerOfTwoSize > Integer.MAX_VALUE) { - LOG.error("Error: powerOfTwoSize overflowed Integer.MAX_VALUE! Initial size: " + initialSize); - powerOfTwoSize = 1 << 30; // max power of 2 - } - - return (int) powerOfTwoSize; - } - - public int getNumUsersInTable() { - return hashTable.get().numUsersInTable; - } - - /** - * Get the number of users who have the bit set at the `userStateBit` position - */ - public long getSetBitCount(int userStateBit) { - int bit = userStateBit; - int bitPosition = 0; - while (bit != 0 && (bit & 1) == 0) { - bit = bit >>> 1; - bitPosition++; - } - return hashTable.get().setBitCounts[bitPosition]; - } - - public Predicate getUserIdFilter() { - return userIdFilter::test; - } - - /** - * Updates a user flag in this table. - */ - public final void set(long userID, int bit, boolean value) { - // if userID is filtered return immediately - if (!shouldKeepUser(userID)) { - USER_TABLE_USERS_FILTERED_COUNTER.increment(); - return; - } - - HashTable table = this.hashTable.get(); - - int hashPos = findHashPosition(table, userID); - long item = table.hash[hashPos]; - byte bits = 0; - int bitsDiff = 0; - - if (item != 0) { - byte bitsOriginally = bits = table.bits[hashPos]; - if (value) { - bits |= bit; - } else { - // AND'ing with the inverse map clears the desired bit, but - // doesn't change any of the other bits - bits &= ~bit; - } - - // Find the changed bits after the above operation, it is possible that no bit is changed if - // the input 'bit' is already set/unset in the table. - // Since bitwise operators cannot be directly applied on Byte, Byte is promoted into int to - // apply the operators. When that happens, if the most significant bit of the Byte is set, - // the promoted int has all significant bits set to 1. 0xff bitmask is applied here to make - // sure only the last 8 bits are considered. - bitsDiff = (bitsOriginally & 0xff) ^ (bits & 0xff); - - if (bitsOriginally > 0 && bits == 0) { - table.numUsersWithNoBitsSet++; - } else if (bitsOriginally == 0 && bits > 0) { - table.numUsersWithNoBitsSet--; - } - } else { - if (!value) { - // no need to add this user, since all bits would be false anyway - return; - } - - // New user string. - if (table.numUsersInTable + 1 >= (table.hashSize() >> 1) - && table.hashSize() != userUpdateTableMaxCapacity) { - if (2L * (long) table.hashSize() < userUpdateTableMaxCapacity) { - rehash(2 * table.hashSize()); - table = this.hashTable.get(); - } else { - if (table.hashSize() < (int) userUpdateTableMaxCapacity) { - rehash((int) userUpdateTableMaxCapacity); - table = this.hashTable.get(); - LOG.warn("User update table size reached Integer.MAX_VALUE, performance will degrade."); - } - } - - // Must repeat this operation with the resized hashTable. - hashPos = findHashPosition(table, userID); - } - - item = userID; - bits |= bit; - bitsDiff = bit & 0xff; - - table.numUsersInTable++; - } - - table.hash[hashPos] = item; - table.bits[hashPos] = bits; - - // update setBitCounts for the changed bits after applying the input 'bit' - int curBitsDiffPos = 0; - while (bitsDiff != 0) { - if ((bitsDiff & 1) != 0) { - if (value) { - table.setBitCounts[curBitsDiffPos]++; - } else { - table.setBitCounts[curBitsDiffPos]--; - } - } - bitsDiff = bitsDiff >>> 1; - curBitsDiffPos++; - } - - updateStats(); - } - - public final boolean isSet(long userID, int bits) { - HashTable table = hashTable.get(); - int hashPos = findHashPosition(table, userID); - return table.hash[hashPos] != 0 && (table.bits[hashPos] & bits) != 0; - } - - /** - * Returns true when userIdFilter condition is being met. - * If filter is not present returns true - */ - private boolean shouldKeepUser(long userID) { - return userIdFilter.test(userID); - } - - private int findHashPosition(final HashTable table, final long userID) { - int code = hashCode(userID); - int hashPos = code & table.hashMask; - - // Locate user in hash - long item = table.hash[hashPos]; - - if (item != 0 && item != userID) { - // Conflict: keep searching different locations in - // the hash table. - final int inc = ((code >> 8) + code) | 1; - do { - code += inc; - hashPos = code & table.hashMask; - item = table.hash[hashPos]; - } while (item != 0 && item != userID); - } - - return hashPos; - } - - /** - * Applies the filtering predicate and returns the size of the filtered table. - */ - private synchronized int filterTableAndCountValidItems() { - final HashTable oldTable = this.hashTable.get(); - int newSize = 0; - - int clearNoItemSet = 0; - int clearNoBitsSet = 0; - int clearDontKeepUser = 0; - - for (int i = 0; i < oldTable.hashSize(); i++) { - final long item = oldTable.hash[i]; // this is the userID - final byte bits = oldTable.bits[i]; - - boolean clearSlot = false; - if (item == 0) { - clearSlot = true; - clearNoItemSet++; - } else if (bits == 0) { - clearSlot = true; - clearNoBitsSet++; - } else if (!shouldKeepUser(item)) { - clearSlot = true; - clearDontKeepUser++; - } - - if (clearSlot) { - oldTable.hash[i] = 0; - oldTable.bits[i] = 0; - } else { - newSize += 1; - } - } - - oldTable.setCountOfNumUsersWithNoBitsSet(); - oldTable.setSetBitCounts(); - - LOG.info("Done filtering table: clearNoItemSet={}, clearNoBitsSet={}, clearDontKeepUser={}", - clearNoItemSet, clearNoBitsSet, clearDontKeepUser); - - return newSize; - } - - /** - * Called when hash is too small (> 50% occupied) - */ - private void rehash(final int newSize) { - final HashTable oldTable = this.hashTable.get(); - final HashTable newTable = new HashTable(newSize); - - final int newMask = newTable.hashMask; - final long[] newHash = newTable.hash; - final byte[] newBits = newTable.bits; - - for (int i = 0; i < oldTable.hashSize(); i++) { - final long item = oldTable.hash[i]; - final byte bits = oldTable.bits[i]; - if (item != 0 && bits != 0) { - int code = hashCode(item); - - int hashPos = code & newMask; - assert hashPos >= 0; - if (newHash[hashPos] != 0) { - final int inc = ((code >> 8) + code) | 1; - do { - code += inc; - hashPos = code & newMask; - } while (newHash[hashPos] != 0); - } - newHash[hashPos] = item; - newBits[hashPos] = bits; - newTable.numUsersInTable++; - } - } - - newTable.setCountOfNumUsersWithNoBitsSet(); - newTable.setSetBitCounts(); - this.hashTable.set(newTable); - - updateStats(); - } - - public void setTable(UserTable newTable) { - hashTable.set(newTable.hashTable.get()); - updateStats(); - } - - @VisibleForTesting - protected int getHashTableCapacity() { - return hashTable.get().hashSize(); - } - - @VisibleForTesting - protected int getNumUsersWithNoBitsSet() { - return hashTable.get().numUsersWithNoBitsSet; - } -} diff --git a/src/java/com/twitter/search/earlybird/common/userupdates/UserTableBuilderFromSnapshot.java b/src/java/com/twitter/search/earlybird/common/userupdates/UserTableBuilderFromSnapshot.java deleted file mode 100644 index 76b14de5a..000000000 --- a/src/java/com/twitter/search/earlybird/common/userupdates/UserTableBuilderFromSnapshot.java +++ /dev/null @@ -1,263 +0,0 @@ -package com.twitter.search.earlybird.common.userupdates; - -import java.io.BufferedReader; -import java.io.IOException; -import java.io.InputStreamReader; -import java.util.Arrays; -import java.util.Iterator; -import java.util.List; -import java.util.NoSuchElementException; -import java.util.Optional; -import java.util.Spliterator; -import java.util.Spliterators; -import java.util.concurrent.TimeUnit; -import java.util.function.Predicate; -import java.util.stream.Collectors; -import java.util.stream.Stream; -import java.util.stream.StreamSupport; -import javax.annotation.Nullable; - -import org.apache.hadoop.conf.Configuration; -import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; -import org.apache.hadoop.hdfs.HdfsConfiguration; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common_internal.hadoop.HdfsUtils; -import com.twitter.scalding.DateRange; -import com.twitter.scalding.Hours; -import com.twitter.scalding.RichDate; -import com.twitter.search.user_table.sources.MostRecentGoodSafetyUserStateSource; -import com.twitter.search.common.indexing.thriftjava.SafetyUserState; -import com.twitter.search.common.util.io.LzoThriftBlockFileReader; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.util.Duration; -import com.twitter.util.Time; - -/** - * Builds a user table from a user safety snapshot on HDFS. - */ -public class UserTableBuilderFromSnapshot { - private static final Logger LOG = LoggerFactory.getLogger(UserTableBuilderFromSnapshot.class); - - private static final int MAX_DAYS_TO_CHECK = 7; - public static final String DATA_DIR = "user_states"; - public static final String METADATA_DIR = "last_updated_ms"; - - private final String snapshotBaseDir; - - private String snapshotDataPath; - private String snapshotMetaDataPath; - private UserTable userTable; - - private long nsfwCount; - private long antisocialCount; - private long isProtectedCount; - - public UserTableBuilderFromSnapshot() { - snapshotBaseDir = - EarlybirdConfig.getString(EarlybirdConfig.USER_SNAPSHOT_BASE_DIR, null); - - LOG.info("Configured user snapshot directory: " + snapshotBaseDir); - } - - private static final class UserUpdate { - public final long userId; - @Nullable public final Boolean antisocial; - @Nullable public final Boolean nsfw; - @Nullable public final Boolean isProtected; - - private UserUpdate(long userId, - @Nullable Boolean antisocial, - @Nullable Boolean nsfw, - @Nullable Boolean isProtected) { - this.userId = userId; - this.antisocial = antisocial; - this.nsfw = nsfw; - this.isProtected = isProtected; - } - - public static UserUpdate fromUserState(SafetyUserState safetyUserState) { - long userId = safetyUserState.getUserID(); - @Nullable Boolean antisocial = null; - @Nullable Boolean nsfw = null; - @Nullable Boolean isProtected = null; - - if (safetyUserState.isIsAntisocial()) { - antisocial = true; - } - if (safetyUserState.isIsNsfw()) { - nsfw = true; - } - if (safetyUserState.isSetIsProtected() && safetyUserState.isIsProtected()) { - isProtected = true; - } - - return new UserUpdate(userId, antisocial, nsfw, isProtected); - } - } - - /** - * Builds a user table from an HDFS user snapshot. - * @return The table, or nothing if something went wrong. - */ - public Optional build(Predicate userFilter) { - userTable = UserTable.newTableWithDefaultCapacityAndPredicate(userFilter); - nsfwCount = 0; - antisocialCount = 0; - isProtectedCount = 0; - - if (snapshotBaseDir == null || snapshotBaseDir.isEmpty()) { - LOG.info("No snapshot directory. Can't build user table."); - return Optional.empty(); - } - - LOG.info("Starting to build user table."); - - Stream stream = null; - - try { - setSnapshotPath(); - - stream = getUserUpdates(); - stream.forEach(this::insertUser); - } catch (IOException e) { - LOG.error("IOException while building table: {}", e.getMessage(), e); - - return Optional.empty(); - } finally { - if (stream != null) { - stream.close(); - } - } - - LOG.info("Built user table with {} users, {} nsfw, {} antisocial and {} protected.", - userTable.getNumUsersInTable(), - nsfwCount, - antisocialCount, - isProtectedCount); - - try { - userTable.setLastRecordTimestamp(readTimestampOfLastSeenUpdateFromSnapshot()); - } catch (IOException e) { - LOG.error("IOException reading timestamp of last update: {}", e.getMessage(), e); - return Optional.empty(); - } - - LOG.info("Setting last record timestamp to {}.", userTable.getLastRecordTimestamp()); - - return Optional.of(userTable); - } - - private void setSnapshotPath() { - snapshotDataPath = - new MostRecentGoodSafetyUserStateSource( - snapshotBaseDir, - DATA_DIR, - METADATA_DIR, - DateRange.apply( - RichDate.now().$minus(Hours.apply(MAX_DAYS_TO_CHECK * 24)), - RichDate.now()) - ).partitionHdfsPaths(new HdfsConfiguration()) - ._1() - .head() - .replaceAll("\\*$", ""); - snapshotMetaDataPath = snapshotDataPath.replace(DATA_DIR, METADATA_DIR); - - LOG.info("Snapshot data path: {}", snapshotDataPath); - LOG.info("Snapshot metadata path: {}", snapshotMetaDataPath); - } - - private Stream getUserUpdates() throws IOException { - FileSystem fs = FileSystem.get(new Configuration()); - List lzoFiles = - Arrays.stream(fs.listStatus(new Path(snapshotDataPath), - path -> path.getName().startsWith("part-"))) - .map(fileStatus -> Path.getPathWithoutSchemeAndAuthority(fileStatus.getPath()) - .toString()) - .collect(Collectors.toList()); - - final LzoThriftBlockFileReader thriftReader = - new LzoThriftBlockFileReader<>(lzoFiles, SafetyUserState.class, null); - - Iterator iter = new Iterator() { - private SafetyUserState next; - - @Override - public boolean hasNext() { - if (next != null) { - return true; - } - - do { - try { - next = thriftReader.readNext(); - } catch (IOException e) { - throw new RuntimeException(e); - } - } while (next == null && !thriftReader.isExhausted()); - return next != null; - } - - @Override - public UserUpdate next() { - if (next != null || hasNext()) { - UserUpdate userUpdate = UserUpdate.fromUserState(next); - next = null; - return userUpdate; - } - throw new NoSuchElementException(); - } - }; - - return StreamSupport - .stream( - Spliterators.spliteratorUnknownSize(iter, Spliterator.ORDERED | Spliterator.NONNULL), - false) - .onClose(thriftReader::stop); - } - - private long readTimestampOfLastSeenUpdateFromSnapshot() throws IOException { - String timestampFile = snapshotMetaDataPath + "part-00000"; - BufferedReader buffer = new BufferedReader(new InputStreamReader( - HdfsUtils.getInputStreamSupplier(timestampFile).openStream())); - - long timestampMillis = Long.parseLong(buffer.readLine()); - LOG.info("read timestamp {} from HDFS:{}", timestampMillis, timestampFile); - - Time time = Time.fromMilliseconds(timestampMillis) - .minus(Duration.fromTimeUnit(10, TimeUnit.MINUTES)); - return time.inMilliseconds(); - } - - private void insertUser(UserUpdate userUpdate) { - if (userUpdate == null) { - return; - } - - if (userUpdate.antisocial != null) { - userTable.set( - userUpdate.userId, - UserTable.ANTISOCIAL_BIT, - userUpdate.antisocial); - antisocialCount++; - } - - if (userUpdate.nsfw != null) { - userTable.set( - userUpdate.userId, - UserTable.NSFW_BIT, - userUpdate.nsfw); - nsfwCount++; - } - - if (userUpdate.isProtected != null) { - userTable.set( - userUpdate.userId, - UserTable.IS_PROTECTED_BIT, - userUpdate.isProtected); - isProtectedCount++; - } - } -} diff --git a/src/java/com/twitter/search/earlybird/common/userupdates/UserUpdate.java b/src/java/com/twitter/search/earlybird/common/userupdates/UserUpdate.java deleted file mode 100644 index 6cfb0814c..000000000 --- a/src/java/com/twitter/search/earlybird/common/userupdates/UserUpdate.java +++ /dev/null @@ -1,38 +0,0 @@ -package com.twitter.search.earlybird.common.userupdates; - -import java.util.Date; - -import com.twitter.search.common.indexing.thriftjava.UserUpdateType; - -/** - * Contains an update for a user. - */ -public class UserUpdate { - public final long twitterUserID; - public final UserUpdateType updateType; - public final int updateValue; - private final Date updatedAt; - - public UserUpdate(long twitterUserID, - UserUpdateType updateType, - int updateValue, - Date updatedAt) { - - this.twitterUserID = twitterUserID; - this.updateType = updateType; - this.updateValue = updateValue; - this.updatedAt = (Date) updatedAt.clone(); - } - - @Override public String toString() { - return "UserInfoUpdate[userID=" + twitterUserID + ",updateType=" + updateType - + ",updateValue=" + updateValue + ",updatedAt=" + getUpdatedAt() + "]"; - } - - /** - * Returns a copy of the updated-at date. - */ - public Date getUpdatedAt() { - return (Date) updatedAt.clone(); - } -} diff --git a/src/java/com/twitter/search/earlybird/common/userupdates/UserUpdatesChecker.java b/src/java/com/twitter/search/earlybird/common/userupdates/UserUpdatesChecker.java deleted file mode 100644 index b12558fe1..000000000 --- a/src/java/com/twitter/search/earlybird/common/userupdates/UserUpdatesChecker.java +++ /dev/null @@ -1,70 +0,0 @@ -package com.twitter.search.earlybird.common.userupdates; - -import java.util.Date; -import java.util.concurrent.TimeUnit; - -import com.twitter.common.util.Clock; -import com.twitter.decider.Decider; -import com.twitter.search.common.indexing.thriftjava.UserUpdateType; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; - -/** - * Contains logic for deciding whether to apply a certain user update to the {@link UserTable}. - */ -public class UserUpdatesChecker { - private final Date antisocialStartDate; - private final Decider decider; - private final boolean isFullArchiveCluster; - - public UserUpdatesChecker(Clock clock, Decider decider, EarlybirdCluster cluster) { - // How many days of antisocial users to keep. A value of -1 means keeping all user updates. - long antisocialRecordDays = - EarlybirdConfig.getLong("keep_recent_antisocial_user_updates_days", 30); - this.antisocialStartDate = antisocialRecordDays > 0 - ? new Date(clock.nowMillis() - TimeUnit.DAYS.toMillis(antisocialRecordDays)) : null; - this.decider = decider; - this.isFullArchiveCluster = cluster == EarlybirdCluster.FULL_ARCHIVE; - } - - /** - * Decides whether to skip the given UserInfoUpdate. - */ - public boolean skipUserUpdate(UserUpdate userUpdate) { - if (userUpdate == null) { // always skip null updates - return true; - } - - UserUpdateType type = userUpdate.updateType; - - if (type == UserUpdateType.PROTECTED && skipProtectedUserUpdate()) { - return true; - } - - if (type == UserUpdateType.ANTISOCIAL && skipAntisocialUserUpdate(userUpdate)) { - return true; - } - - // NSFW users can continue to tweet even after they are marked as NSFW. That means - // that the snapshot needs to have all NSFW users from the beginning of time. Hence, no NSFW - // users updates check here. - - // pass all checks, do not skip this user update - return false; - } - - // Antisocial/suspended users can't tweet after they are suspended. Thus if our index stores - // tweets from the last 10 days, and they were suspended 60 days ago, we don't need them since - // there will be no tweets from them. We can save space by not storing info about those users. - - // (For archive, at rebuild time we filter out all suspended users tweets, so for a user that - // was suspended before a rebuild, no need to use space to store that the user is suspended) - private boolean skipAntisocialUserUpdate(UserUpdate userUpdate) { - return antisocialStartDate != null && userUpdate.getUpdatedAt().before(antisocialStartDate); - } - - // skip protected user updates for realtime and protected clusters - private boolean skipProtectedUserUpdate() { - return !isFullArchiveCluster; - } -} diff --git a/src/java/com/twitter/search/earlybird/config/BUILD b/src/java/com/twitter/search/earlybird/config/BUILD deleted file mode 100644 index 3bfb3ee1f..000000000 --- a/src/java/com/twitter/search/earlybird/config/BUILD +++ /dev/null @@ -1,21 +0,0 @@ -java_library( - sources = ["**/*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/code/findbugs:jsr305", - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/org/apache/thrift:libthrift", - "3rdparty/jvm/org/apache/zookeeper:zookeeper-client", - "3rdparty/jvm/org/slf4j:slf4j-api", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common/util:system-mocks", - "src/java/com/twitter/search/common/config", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/partitioning/snowflakeparser", - "src/java/com/twitter/search/common/util/date", - "src/java/com/twitter/search/common/util/zookeeper", - "src/java/com/twitter/search/earlybird/common/config", - ], -) diff --git a/src/java/com/twitter/search/earlybird/config/ServingRange.java b/src/java/com/twitter/search/earlybird/config/ServingRange.java deleted file mode 100644 index 076f3bd80..000000000 --- a/src/java/com/twitter/search/earlybird/config/ServingRange.java +++ /dev/null @@ -1,26 +0,0 @@ -package com.twitter.search.earlybird.config; - -/** - * An interface for abstracting a tier's serving range. - */ -public interface ServingRange { - /** - * Returns the serving range's lowest tweet ID. - */ - long getServingRangeSinceId(); - - /** - * Returns the serving range's highest tweet ID. - */ - long getServingRangeMaxId(); - - /** - * Returns the serving range's earliest time, in seconds since epoch. - */ - long getServingRangeSinceTimeSecondsFromEpoch(); - - /** - * Returns the serving range's latest time, in seconds since epoch. - */ - long getServingRangeUntilTimeSecondsFromEpoch(); -} diff --git a/src/java/com/twitter/search/earlybird/config/TierConfig.java b/src/java/com/twitter/search/earlybird/config/TierConfig.java deleted file mode 100644 index 4ef7d339c..000000000 --- a/src/java/com/twitter/search/earlybird/config/TierConfig.java +++ /dev/null @@ -1,175 +0,0 @@ -package com.twitter.search.earlybird.config; - -import java.util.Date; -import java.util.Map; -import java.util.Set; - -import javax.annotation.Nullable; - -import com.google.common.base.Preconditions; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.config.Config; -import com.twitter.search.common.config.ConfigFile; -import com.twitter.search.common.config.ConfigurationException; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.util.date.DateUtil; - -/** - * This class provides APIs to access the tier configurations for a cluster. - * Each tier has tier name, number of partitions, tier start time and end time. - */ -public final class TierConfig { - private static final org.slf4j.Logger LOG = org.slf4j.LoggerFactory.getLogger(TierConfig.class); - - private static final String DEFAULT_CONFIG_DIR = "common/config"; - public static final String DEFAULT_TIER_FILE = "earlybird-tiers.yml"; - - public static final Date DEFAULT_TIER_START_DATE = DateUtil.toDate(2006, 3, 21); - // It's convenient for DEFAULT_TIER_END_DATE to be before ~2100, because then the output of - // FieldTermCounter.getHourValue(DEFAULT_TIER_END_END_DATE) can still fit into an integer. - public static final Date DEFAULT_TIER_END_DATE = DateUtil.toDate(2099, 1, 1); - - public static final String DEFAULT_TIER_NAME = "all"; - public static final boolean DEFAULT_ENABLED = true; - public static final TierInfo.RequestReadType DEFAULT_READ_TYPE = TierInfo.RequestReadType.LIGHT; - - private static ConfigFile tierConfigFile = null; - private static ConfigSource tierConfigSource = null; - - public enum ConfigSource { - LOCAL, - ZOOKEEPER - } - - private TierConfig() { } - - private static synchronized void init() { - if (tierConfigFile == null) { - tierConfigFile = new ConfigFile(DEFAULT_CONFIG_DIR, DEFAULT_TIER_FILE); - tierConfigSource = ConfigSource.LOCAL; - SearchLongGauge.export("tier_config_source_" + tierConfigSource.name()).set(1); - LOG.info("Tier config file " + DEFAULT_TIER_FILE + " is successfully loaded from bundle."); - } - } - - public static ConfigFile getConfigFile() { - init(); - return tierConfigFile; - } - - public static String getConfigFileName() { - return getConfigFile().getConfigFileName(); - } - - /** - * Return all the tier names specified in the config file. - */ - public static Set getTierNames() { - return Config.getConfig().getMapCopy(getConfigFileName()).keySet(); - } - - /** - * Sets the value of the given tier config property to the given value. - */ - public static void setForTests(String property, Object value) { - Config.getConfig().setForTests(DEFAULT_TIER_FILE, property, value); - } - - /** - * Returns the config info for the specified tier. - */ - public static TierInfo getTierInfo(String tierName) { - return getTierInfo(tierName, null /* use current environment */); - } - - /** - * Returns the config info for the specified tier and environment. - */ - public static TierInfo getTierInfo(String tierName, @Nullable String environment) { - String tierConfigFileType = getConfigFileName(); - Map tierInfo; - try { - tierInfo = (Map) Config.getConfig() - .getFromEnvironment(environment, tierConfigFileType, tierName); - } catch (ConfigurationException e) { - throw new RuntimeException(e); - } - if (tierInfo == null) { - LOG.error("Cannot find tier config for " - + tierName + "in config file: " + tierConfigFileType); - throw new RuntimeException("Configuration error: " + tierConfigFileType); - } - - Long partitions = (Long) tierInfo.get("number_of_partitions"); - if (partitions == null) { - LOG.error("No number of partition is specified for tier " - + tierName + " in tier config file " + tierConfigFileType); - throw new RuntimeException("Configuration error: " + tierConfigFileType); - } - - Long numTimeslices = (Long) tierInfo.get("serving_timeslices"); - if (numTimeslices == null) { - LOG.info("No max timeslices is specified for tier " - + tierName + " in tier config file " + tierConfigFileType - + ", not setting a cap on number of serving timeslices"); - // NOTE: we use max int32 here because it will ultimately be cast to an int, but the config - // map expects Longs for all integral types. Using Long.MAX_VALUE leads to max serving - // timeslices being set to -1 when it is truncated to an int. - numTimeslices = (long) Integer.MAX_VALUE; - } - - Date tierStartDate = (Date) tierInfo.get("data_range_start_date_inclusive"); - if (tierStartDate == null) { - tierStartDate = DEFAULT_TIER_START_DATE; - } - Date tierEndDate = (Date) tierInfo.get("data_range_end_date_exclusive"); - if (tierEndDate == null) { - tierEndDate = DEFAULT_TIER_END_DATE; - } - - Boolean tierEnabled = (Boolean) tierInfo.get("tier_enabled"); - if (tierEnabled == null) { - tierEnabled = DEFAULT_ENABLED; - } - - TierInfo.RequestReadType readType = - getRequestReadType((String) tierInfo.get("tier_read_type"), DEFAULT_READ_TYPE); - TierInfo.RequestReadType readTypeOverride = - getRequestReadType((String) tierInfo.get("tier_read_type_override"), readType); - - return new TierInfo( - tierName, - tierStartDate, - tierEndDate, - partitions.intValue(), - numTimeslices.intValue(), - tierEnabled, - (String) tierInfo.get("serving_range_since_id_exclusive"), - (String) tierInfo.get("serving_range_max_id_inclusive"), - (Date) tierInfo.get("serving_range_start_date_inclusive_override"), - (Date) tierInfo.get("serving_range_end_date_exclusive_override"), - readType, - readTypeOverride, - Clock.SYSTEM_CLOCK); - } - - public static synchronized void clear() { - tierConfigFile = null; - tierConfigSource = null; - } - - protected static synchronized ConfigSource getTierConfigSource() { - return tierConfigSource; - } - - private static TierInfo.RequestReadType getRequestReadType( - String readTypeEnumName, TierInfo.RequestReadType defaultReadType) { - TierInfo.RequestReadType readType = defaultReadType; - if (readTypeEnumName != null) { - readType = TierInfo.RequestReadType.valueOf(readTypeEnumName.trim().toUpperCase()); - Preconditions.checkState(readType != null); - } - return readType; - } -} diff --git a/src/java/com/twitter/search/earlybird/config/TierInfo.java b/src/java/com/twitter/search/earlybird/config/TierInfo.java deleted file mode 100644 index b4640224f..000000000 --- a/src/java/com/twitter/search/earlybird/config/TierInfo.java +++ /dev/null @@ -1,180 +0,0 @@ -package com.twitter.search.earlybird.config; - -import java.util.Date; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import com.twitter.common.util.Clock; - -/** - * Properties of a single tier. - */ -public class TierInfo implements ServingRange { - // What I'm seeing historically is that this has been used when adding a new tier. First you - // add it and send dark traffic to it, then possibly grey and then you launch it by turning on - // light traffic. - public static enum RequestReadType { - // Light read: send request, wait for results, and results are returned - LIGHT, - // Dark read: send request, do not wait for results, and results are discarded - DARK, - // Grey read: send request, wait for results, but discard after results come back. - // Same results as dark read; similar latency as light read. - GREY, - } - - private final String tierName; - private final Date dataStartDate; - private final Date dataEndDate; - private final int numPartitions; - private final int maxTimeslices; - private final TierServingBoundaryEndPoint servingRangeSince; - private final TierServingBoundaryEndPoint servingRangeMax; - private final TierServingBoundaryEndPoint servingRangeSinceOverride; - private final TierServingBoundaryEndPoint servingRangeMaxOverride; - - // These two properties are only used by clients of Earlybird (E.g. roots), - // but not by Earlybirds. - private final boolean enabled; - private final RequestReadType readType; - private final RequestReadType readTypeOverride; - - public TierInfo(String tierName, - Date dataStartDate, - Date dataEndDate, - int numPartitions, - int maxTimeslices, - boolean enabled, - String sinceIdString, - String maxIdString, - Date servingStartDateOverride, - Date servingEndDateOverride, - RequestReadType readType, - RequestReadType readTypeOverride, - Clock clock) { - Preconditions.checkArgument(numPartitions > 0); - Preconditions.checkArgument(maxTimeslices > 0); - this.tierName = tierName; - this.dataStartDate = dataStartDate; - this.dataEndDate = dataEndDate; - this.numPartitions = numPartitions; - this.maxTimeslices = maxTimeslices; - this.enabled = enabled; - this.readType = readType; - this.readTypeOverride = readTypeOverride; - this.servingRangeSince = TierServingBoundaryEndPoint - .newTierServingBoundaryEndPoint(sinceIdString, dataStartDate, clock); - this.servingRangeMax = TierServingBoundaryEndPoint - .newTierServingBoundaryEndPoint(maxIdString, dataEndDate, clock); - if (servingStartDateOverride != null) { - this.servingRangeSinceOverride = TierServingBoundaryEndPoint.newTierServingBoundaryEndPoint( - TierServingBoundaryEndPoint.INFERRED_FROM_DATA_RANGE, servingStartDateOverride, clock); - } else { - this.servingRangeSinceOverride = servingRangeSince; - } - - if (servingEndDateOverride != null) { - this.servingRangeMaxOverride = TierServingBoundaryEndPoint.newTierServingBoundaryEndPoint( - TierServingBoundaryEndPoint.INFERRED_FROM_DATA_RANGE, servingEndDateOverride, clock); - } else { - this.servingRangeMaxOverride = servingRangeMax; - } - } - - @VisibleForTesting - public TierInfo(String tierName, - Date dataStartDate, - Date dataEndDate, - int numPartitions, - int maxTimeslices, - boolean enabled, - String sinceIdString, - String maxIdString, - RequestReadType readType, - Clock clock) { - // No overrides: - // servingRangeSinceOverride == servingRangeSince - // servingRangeMaxOverride == servingRangeMax - // readTypeOverride == readType - this(tierName, dataStartDate, dataEndDate, numPartitions, maxTimeslices, enabled, sinceIdString, - maxIdString, null, null, readType, readType, clock); - } - - @Override - public String toString() { - return tierName; - } - - public String getTierName() { - return tierName; - } - - public Date getDataStartDate() { - return dataStartDate; - } - - public Date getDataEndDate() { - return dataEndDate; - } - - public int getNumPartitions() { - return numPartitions; - } - - public int getMaxTimeslices() { - return maxTimeslices; - } - - public TierConfig.ConfigSource getSource() { - return TierConfig.getTierConfigSource(); - } - - public boolean isEnabled() { - return enabled; - } - - public boolean isDarkRead() { - return readType == RequestReadType.DARK; - } - - public RequestReadType getReadType() { - return readType; - } - - public RequestReadType getReadTypeOverride() { - return readTypeOverride; - } - - public long getServingRangeSinceId() { - return servingRangeSince.getBoundaryTweetId(); - } - - public long getServingRangeMaxId() { - return servingRangeMax.getBoundaryTweetId(); - } - - long getServingRangeOverrideSinceId() { - return servingRangeSinceOverride.getBoundaryTweetId(); - } - - long getServingRangeOverrideMaxId() { - return servingRangeMaxOverride.getBoundaryTweetId(); - } - - public long getServingRangeSinceTimeSecondsFromEpoch() { - return servingRangeSince.getBoundaryTimeSecondsFromEpoch(); - } - - public long getServingRangeUntilTimeSecondsFromEpoch() { - return servingRangeMax.getBoundaryTimeSecondsFromEpoch(); - } - - long getServingRangeOverrideSinceTimeSecondsFromEpoch() { - return servingRangeSinceOverride.getBoundaryTimeSecondsFromEpoch(); - } - - long getServingRangeOverrideUntilTimeSecondsFromEpoch() { - return servingRangeMaxOverride.getBoundaryTimeSecondsFromEpoch(); - } -} diff --git a/src/java/com/twitter/search/earlybird/config/TierInfoSource.java b/src/java/com/twitter/search/earlybird/config/TierInfoSource.java deleted file mode 100644 index 9835ca6ef..000000000 --- a/src/java/com/twitter/search/earlybird/config/TierInfoSource.java +++ /dev/null @@ -1,39 +0,0 @@ -package com.twitter.search.earlybird.config; - -import java.util.ArrayList; -import java.util.List; -import java.util.Set; - -import javax.inject.Inject; - -import com.twitter.search.common.util.zookeeper.ZooKeeperProxy; - -public class TierInfoSource { - private final ZooKeeperProxy zkClient; - - @Inject - public TierInfoSource(ZooKeeperProxy sZooKeeperClient) { - this.zkClient = sZooKeeperClient; - } - - public List getTierInformation() { - return getTierInfoWithPrefix("tier"); - } - - public String getConfigFileType() { - return TierConfig.getConfigFileName(); - } - - private List getTierInfoWithPrefix(String tierPrefix) { - Set tierNames = TierConfig.getTierNames(); - List tierInfos = new ArrayList<>(); - for (String name : tierNames) { - if (name.startsWith(tierPrefix)) { - TierInfo tierInfo = TierConfig.getTierInfo(name); - tierInfos.add(tierInfo); - } - } - return tierInfos; - } - -} diff --git a/src/java/com/twitter/search/earlybird/config/TierInfoUtil.java b/src/java/com/twitter/search/earlybird/config/TierInfoUtil.java deleted file mode 100644 index 995de7b81..000000000 --- a/src/java/com/twitter/search/earlybird/config/TierInfoUtil.java +++ /dev/null @@ -1,78 +0,0 @@ -package com.twitter.search.earlybird.config; - -import java.util.Comparator; -import java.util.SortedSet; - -import com.google.common.base.Preconditions; - -public final class TierInfoUtil { - public static final Comparator TIER_COMPARATOR = (t1, t2) -> { - // Reverse sort order based on date. - return t2.getDataStartDate().compareTo(t1.getDataStartDate()); - }; - - private TierInfoUtil() { - } - - /** - * Checks that the serving ranges and the override serving ranges of the given tiers do not - * overlap, and do not have gaps. Dark reads tiers are ignored. - */ - public static void checkTierServingRanges(SortedSet tierInfos) { - boolean tierServingRangesOverlap = false; - boolean tierOverrideServingRangesOverlap = false; - boolean tierServingRangesHaveGaps = false; - boolean tierOverrideServingRangesHaveGaps = false; - - TierInfoWrapper previousTierInfoWrapper = null; - TierInfoWrapper previousOverrideTierInfoWrapper = null; - for (TierInfo tierInfo : tierInfos) { - TierInfoWrapper tierInfoWrapper = new TierInfoWrapper(tierInfo, false); - TierInfoWrapper overrideTierInfoWrapper = new TierInfoWrapper(tierInfo, true); - - // Check only the tiers to which we send light reads. - if (!tierInfoWrapper.isDarkRead()) { - if (previousTierInfoWrapper != null) { - if (TierInfoWrapper.servingRangesOverlap(previousTierInfoWrapper, tierInfoWrapper)) { - // In case of rebalancing, we may have an overlap data range while - // overriding with a good serving range. - if (previousOverrideTierInfoWrapper == null - || TierInfoWrapper.servingRangesOverlap( - previousOverrideTierInfoWrapper, overrideTierInfoWrapper)) { - tierServingRangesOverlap = true; - } - } - if (TierInfoWrapper.servingRangesHaveGap(previousTierInfoWrapper, tierInfoWrapper)) { - tierServingRangesHaveGaps = true; - } - } - - previousTierInfoWrapper = tierInfoWrapper; - } - - if (!overrideTierInfoWrapper.isDarkRead()) { - if (previousOverrideTierInfoWrapper != null) { - if (TierInfoWrapper.servingRangesOverlap(previousOverrideTierInfoWrapper, - overrideTierInfoWrapper)) { - tierOverrideServingRangesOverlap = true; - } - if (TierInfoWrapper.servingRangesHaveGap(previousOverrideTierInfoWrapper, - overrideTierInfoWrapper)) { - tierOverrideServingRangesHaveGaps = true; - } - } - - previousOverrideTierInfoWrapper = overrideTierInfoWrapper; - } - } - - Preconditions.checkState(!tierServingRangesOverlap, - "Serving ranges of light reads tiers must not overlap."); - Preconditions.checkState(!tierServingRangesHaveGaps, - "Serving ranges of light reads tiers must not have gaps."); - Preconditions.checkState(!tierOverrideServingRangesOverlap, - "Override serving ranges of light reads tiers must not overlap."); - Preconditions.checkState(!tierOverrideServingRangesHaveGaps, - "Override serving ranges of light reads tiers must not have gaps."); - } -} diff --git a/src/java/com/twitter/search/earlybird/config/TierInfoWrapper.java b/src/java/com/twitter/search/earlybird/config/TierInfoWrapper.java deleted file mode 100644 index b6c3110dd..000000000 --- a/src/java/com/twitter/search/earlybird/config/TierInfoWrapper.java +++ /dev/null @@ -1,89 +0,0 @@ -package com.twitter.search.earlybird.config; - -import java.util.Date; - -import com.google.common.base.Preconditions; - -/** - * A simple wrapper around TierInfo that returns the "real" or the "overriden" values from the given - * {@code TierInfo} instance, based on the given {@code useOverrideTierConfig} flag. - */ -public class TierInfoWrapper implements ServingRange { - private final TierInfo tierInfo; - private final boolean useOverrideTierConfig; - - public TierInfoWrapper(TierInfo tierInfo, boolean useOverrideTierConfig) { - this.tierInfo = Preconditions.checkNotNull(tierInfo); - this.useOverrideTierConfig = useOverrideTierConfig; - } - - public String getTierName() { - return tierInfo.getTierName(); - } - - public Date getDataStartDate() { - return tierInfo.getDataStartDate(); - } - - public Date getDataEndDate() { - return tierInfo.getDataEndDate(); - } - - public int getNumPartitions() { - return tierInfo.getNumPartitions(); - } - - public int getMaxTimeslices() { - return tierInfo.getMaxTimeslices(); - } - - public TierConfig.ConfigSource getSource() { - return tierInfo.getSource(); - } - - public boolean isEnabled() { - return tierInfo.isEnabled(); - } - - public boolean isDarkRead() { - return getReadType() == TierInfo.RequestReadType.DARK; - } - - public TierInfo.RequestReadType getReadType() { - return useOverrideTierConfig ? tierInfo.getReadTypeOverride() : tierInfo.getReadType(); - } - - public long getServingRangeSinceId() { - return useOverrideTierConfig - ? tierInfo.getServingRangeOverrideSinceId() - : tierInfo.getServingRangeSinceId(); - } - - public long getServingRangeMaxId() { - return useOverrideTierConfig - ? tierInfo.getServingRangeOverrideMaxId() - : tierInfo.getServingRangeMaxId(); - } - - public long getServingRangeSinceTimeSecondsFromEpoch() { - return useOverrideTierConfig - ? tierInfo.getServingRangeOverrideSinceTimeSecondsFromEpoch() - : tierInfo.getServingRangeSinceTimeSecondsFromEpoch(); - } - - public long getServingRangeUntilTimeSecondsFromEpoch() { - return useOverrideTierConfig - ? tierInfo.getServingRangeOverrideUntilTimeSecondsFromEpoch() - : tierInfo.getServingRangeUntilTimeSecondsFromEpoch(); - } - - public static boolean servingRangesOverlap(TierInfoWrapper tier1, TierInfoWrapper tier2) { - return (tier1.getServingRangeMaxId() > tier2.getServingRangeSinceId()) - && (tier2.getServingRangeMaxId() > tier1.getServingRangeSinceId()); - } - - public static boolean servingRangesHaveGap(TierInfoWrapper tier1, TierInfoWrapper tier2) { - return (tier1.getServingRangeMaxId() < tier2.getServingRangeSinceId()) - || (tier2.getServingRangeMaxId() < tier1.getServingRangeSinceId()); - } -} diff --git a/src/java/com/twitter/search/earlybird/config/TierServingBoundaryEndPoint.java b/src/java/com/twitter/search/earlybird/config/TierServingBoundaryEndPoint.java deleted file mode 100644 index 8a0eb852b..000000000 --- a/src/java/com/twitter/search/earlybird/config/TierServingBoundaryEndPoint.java +++ /dev/null @@ -1,146 +0,0 @@ -package com.twitter.search.earlybird.config; - -import java.util.Date; - -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; - -/** - * The start or end boundary of a tier's serving range. - * This is used to add since_id and max_id operators onto search queries. - */ -public class TierServingBoundaryEndPoint { - @VisibleForTesting - public static final String INFERRED_FROM_DATA_RANGE = "inferred_from_data_range"; - public static final String RELATIVE_TO_CURRENT_TIME_MS = "relative_to_current_time_ms"; - - // Either offsetToCurrentTimeMillis is set or (absoluteTweetId and timeBoundarySecondsFromEpoch) - // are set. - @Nullable - private final Long offsetToCurrentTimeMillis; - @Nullable - private final Long absoluteTweetId; - @Nullable - private final Long timeBoundarySecondsFromEpoch; - private final Clock clock; - - TierServingBoundaryEndPoint(Long absoluteTweetId, - Long timeBoundarySecondsFromEpoch, - Long offsetToCurrentTimeMillis, - Clock clock) { - this.offsetToCurrentTimeMillis = offsetToCurrentTimeMillis; - this.absoluteTweetId = absoluteTweetId; - this.timeBoundarySecondsFromEpoch = timeBoundarySecondsFromEpoch; - this.clock = clock; - } - - /** - * Parse the boundary string and construct a TierServingBoundaryEndPoint instance. - * @param boundaryString boundary configuration string. Valid values are: - *

  • - * "inferred_from_data_range" infers serving range from data range. This only works after - * Nov 2010 when Twitter switched to snowflake IDs. - * This is the default value. - *
  • - *
  • - * "absolute_tweet_id_and_timestamp_millis:id:timestamp" a tweet ID/timestamp is given - * explicitly as the serving range - * boundary. - *
  • - *
  • - * "relative_to_current_time_ms:offset" adds offset onto current timestamp in millis to - * compute serving range. - *
  • - * - * @param boundaryDate the data boundary. This is used in conjunction with - * inferred_from_data_date to determine the serving boundary. - * @param clock Clock used to obtain current time, when relative_to_current_time_ms is used. - * Tests pass in a FakeClock. - */ - public static TierServingBoundaryEndPoint newTierServingBoundaryEndPoint(String boundaryString, - Date boundaryDate, - Clock clock) { - if (boundaryString == null || boundaryString.trim().equals( - INFERRED_FROM_DATA_RANGE)) { - return inferBoundaryFromDataRange(boundaryDate, clock); - } else if (boundaryString.trim().startsWith(RELATIVE_TO_CURRENT_TIME_MS)) { - return getRelativeBoundary(boundaryString, clock); - } else { - throw new IllegalStateException("Cannot parse serving range string: " + boundaryString); - } - } - - private static TierServingBoundaryEndPoint inferBoundaryFromDataRange(Date boundaryDate, - Clock clock) { - // infer from data range - // handle default start date and end date, in case the dates are not specified in the config - if (boundaryDate.equals(TierConfig.DEFAULT_TIER_START_DATE)) { - return new TierServingBoundaryEndPoint( - -1L, TierConfig.DEFAULT_TIER_START_DATE.getTime() / 1000, null, clock); - } else if (boundaryDate.equals(TierConfig.DEFAULT_TIER_END_DATE)) { - return new TierServingBoundaryEndPoint( - Long.MAX_VALUE, TierConfig.DEFAULT_TIER_END_DATE.getTime() / 1000, null, clock); - } else { - // convert data start / end dates into since / max ID. - long boundaryTimeMillis = boundaryDate.getTime(); - if (!SnowflakeIdParser.isUsableSnowflakeTimestamp(boundaryTimeMillis)) { - throw new IllegalStateException("Serving time range can not be determined, because " - + boundaryDate + " is before Twitter switched to snowflake tweet IDs."); - } - // Earlybird since_id is inclusive and max_id is exclusive. We substract 1 here. - // Consider example: - // full0: 5000 (inclusive) - 6000 (exclusive) - // full1: 6000 (inclusive) - 7000 (exclusive) - // For tier full0, we should use max_id 5999 instead of 6000. - // For tier full1, we should use since_id 5999 instead of 6000. - // Hence we substract 1 here. - long adjustedTweetId = - SnowflakeIdParser.generateValidStatusId(boundaryTimeMillis, 0) - 1; - Preconditions.checkState(adjustedTweetId >= 0, "boundary tweet ID must be non-negative"); - return new TierServingBoundaryEndPoint( - adjustedTweetId, boundaryTimeMillis / 1000, null, clock); - } - } - - private static TierServingBoundaryEndPoint getRelativeBoundary(String boundaryString, - Clock clock) { - // An offset relative to current time is given - String[] parts = boundaryString.split(":"); - Preconditions.checkState(parts.length == 2); - long offset = Long.parseLong(parts[1]); - return new TierServingBoundaryEndPoint(null, null, offset, clock); - } - - /** - * Returns the tweet ID for this tier boundary. If the tier boundary was created using a tweet ID, - * that tweet ID is returned. Otherwise, a tweet ID is derived from the time boundary. - */ - @VisibleForTesting - public long getBoundaryTweetId() { - // If absoluteTweetId is available, use it. - if (absoluteTweetId != null) { - return absoluteTweetId; - } else { - Preconditions.checkNotNull(offsetToCurrentTimeMillis); - long boundaryTime = clock.nowMillis() + offsetToCurrentTimeMillis; - return SnowflakeIdParser.generateValidStatusId(boundaryTime, 0); - } - } - - /** - * Returns the time boundary for this tier boundary, in seconds since epoch. - */ - public long getBoundaryTimeSecondsFromEpoch() { - if (timeBoundarySecondsFromEpoch != null) { - return timeBoundarySecondsFromEpoch; - } else { - Preconditions.checkNotNull(offsetToCurrentTimeMillis); - return (clock.nowMillis() + offsetToCurrentTimeMillis) / 1000; - } - } -} diff --git a/src/java/com/twitter/search/earlybird/document/DeletedStatus.java b/src/java/com/twitter/search/earlybird/document/DeletedStatus.java deleted file mode 100644 index 6da8d23a0..000000000 --- a/src/java/com/twitter/search/earlybird/document/DeletedStatus.java +++ /dev/null @@ -1,15 +0,0 @@ -package com.twitter.search.earlybird.document; - -/** - * DeletedStatus is a marker indicating that the specified tweet in the specified - * timeslice has been deleted. - */ -public final class DeletedStatus { - public final long timeSliceID; - public final long statusID; - - public DeletedStatus(long timeSliceID, long statusID) { - this.timeSliceID = timeSliceID; - this.statusID = statusID; - } -} diff --git a/src/java/com/twitter/search/earlybird/document/DocumentFactory.java b/src/java/com/twitter/search/earlybird/document/DocumentFactory.java deleted file mode 100644 index 4745f329c..000000000 --- a/src/java/com/twitter/search/earlybird/document/DocumentFactory.java +++ /dev/null @@ -1,110 +0,0 @@ -package com.twitter.search.earlybird.document; - -import java.io.IOException; -import javax.annotation.Nullable; - -import org.apache.commons.codec.binary.Base64; -import org.apache.lucene.document.Document; -import org.apache.lucene.document.Field; -import org.apache.lucene.document.FieldType; -import org.apache.lucene.index.IndexableField; -import org.apache.thrift.TBase; -import org.apache.thrift.TException; -import org.apache.thrift.TSerializer; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.util.text.OmitNormTextField; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; - -/** - * Factory that constructs a Lucene document from a thrift object stored in T format. - * - * @param ThriftStatus or ThriftIndexingEvent, to be converted to a Lucene Document. - */ -public abstract class DocumentFactory> { - private static final Logger LOG = LoggerFactory.getLogger(DocumentFactory.class); - private static final int MAX_ALLOWED_INVALID_DOCUMENTS = 100; - - private static final SearchCounter INVALID_DOCUMENTS_COUNTER = - SearchCounter.export("invalid_documents"); - - private final CriticalExceptionHandler criticalExceptionHandler; - - public DocumentFactory(CriticalExceptionHandler criticalExceptionHandler) { - this.criticalExceptionHandler = criticalExceptionHandler; - } - - /** - * Given the thrift representation of a tweet, returns the associated tweetId. - */ - public abstract long getStatusId(T thriftObject); - - /** - * Given the thrift representation of a tweet, returns a Lucene Document with all the fields - * that need to be indexed. - */ - @Nullable - public final Document newDocument(T thriftObject) { - try { - return innerNewDocument(thriftObject); - } catch (Exception e) { - String statusId = "Not available"; - if (thriftObject != null) { - try { - statusId = Long.toString(getStatusId(thriftObject)); - } catch (Exception ex) { - LOG.error("Unable to get tweet id for document", ex); - statusId = "Not parsable"; - } - } - LOG.error("Unexpected exception while indexing. Status id: " + statusId, e); - - if (thriftObject != null) { - // Log the status in base64 for debugging - try { - LOG.warn("Bad ThriftStatus. Id: " + statusId + " base 64: " - + Base64.encodeBase64String(new TSerializer().serialize(thriftObject))); - } catch (TException e1) { - // Ignored since this is logging for debugging. - } - } - INVALID_DOCUMENTS_COUNTER.increment(); - if (INVALID_DOCUMENTS_COUNTER.get() > MAX_ALLOWED_INVALID_DOCUMENTS) { - criticalExceptionHandler.handle(this, e); - } - return new Document(); - } - } - - /** - * Given the thrift representation of a tweet, returns a Lucene Document with all the fields - * that need to be indexed. - * - * Return null if the given thrift object is invalid. - * - * @throws IOException if there are problems reading the input of producing the output. Exception - * is handled in {@link #newDocument(TBase)}. - */ - @Nullable - protected abstract Document innerNewDocument(T thriftObject) throws IOException; - - // Helper methods that prevent us from adding null fields to the lucene index - protected void addField(Document document, IndexableField field) { - if (field != null) { - document.add(field); - } - } - - protected Field newField(String data, String fieldName) { - return newField(data, fieldName, OmitNormTextField.TYPE_NOT_STORED); - } - - protected Field newField(String data, String fieldName, FieldType fieldType) { - if (data != null) { - return new Field(fieldName, data, fieldType); - } - return null; - } -} diff --git a/src/java/com/twitter/search/earlybird/document/ThriftDocumentPreprocessor.java b/src/java/com/twitter/search/earlybird/document/ThriftDocumentPreprocessor.java deleted file mode 100644 index 4ef5909e5..000000000 --- a/src/java/com/twitter/search/earlybird/document/ThriftDocumentPreprocessor.java +++ /dev/null @@ -1,170 +0,0 @@ -package com.twitter.search.earlybird.document; - -import java.io.IOException; -import java.util.List; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchTruthTableCounter; -import com.twitter.search.common.schema.base.FieldNameToIdMapping; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.base.ThriftDocumentUtil; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.schema.earlybird.EarlybirdEncodedFeatures; -import com.twitter.search.common.schema.earlybird.EarlybirdEncodedFeaturesUtil; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.schema.earlybird.EarlybirdThriftDocumentUtil; -import com.twitter.search.common.schema.thriftjava.ThriftDocument; -import com.twitter.search.common.schema.thriftjava.ThriftField; - -import geo.google.datamodel.GeoAddressAccuracy; - -/** - * Used to preprocess a ThriftDocument before indexing. - */ -public final class ThriftDocumentPreprocessor { - private static final FieldNameToIdMapping ID_MAP = new EarlybirdFieldConstants(); - private static final String FILTER_LINK_VALUE = EarlybirdThriftDocumentUtil.formatFilter( - EarlybirdFieldConstant.LINKS_FIELD.getFieldName()); - private static final String HAS_LINK_VALUE = EarlybirdFieldConstant.getFacetSkipFieldName( - EarlybirdFieldConstant.LINKS_FIELD.getFieldName()); - - private ThriftDocumentPreprocessor() { - } - - /** - * Processes the given document. - */ - public static ThriftDocument preprocess( - ThriftDocument doc, EarlybirdCluster cluster, ImmutableSchemaInterface schema) - throws IOException { - patchArchiveThriftDocumentAccuracy(doc, cluster); - patchArchiveHasLinks(doc, cluster); - addAllMissingMinEngagementFields(doc, cluster, schema); - return doc; - } - - private static final SearchCounter GEO_SCRUBBED_COUNT = - SearchCounter.export("geo_scrubbed_count"); - private static final SearchCounter GEO_ARCHIVE_PATCHED_ACCURACY_COUNT = - SearchCounter.export("geo_archive_patched_accuracy_count"); - private static final SearchCounter GEO_MISSING_COORDINATE_COUNT = - SearchCounter.export("geo_missing_coordinate_count"); - private static final SearchCounter ARCHIVED_LINKS_FIELD_PATCHED_COUNT = - SearchCounter.export("links_field_patched_count"); - - /** - * Counter for all the combinations of nullcast bit set and nullcast filter set. - * - * Sum over `ThriftDocumentPreprocessor_nullcast_doc_stats__nullcastBitSet_true_*` to get all docs - * with nullcast bit set to true. - */ - private static final SearchTruthTableCounter NULLCAST_DOC_STATS = - SearchTruthTableCounter.export( - "ThriftDocumentPreprocessor_nullcast_doc_stats", - "nullcastBitSet", - "nullcastFilterSet"); - - /*** - * See JIRA SEARCH-7329 - */ - private static void patchArchiveThriftDocumentAccuracy(ThriftDocument doc, - EarlybirdCluster cluster) { - ThriftField geoField = ThriftDocumentUtil.getField( - doc, - EarlybirdFieldConstant.GEO_HASH_FIELD.getFieldName(), - ID_MAP); - if (geoField != null) { - if (!geoField.getFieldData().isSetGeoCoordinate()) { - GEO_MISSING_COORDINATE_COUNT.increment(); - return; - } - - // -1 means that the data is geo scrubbed. - if (geoField.getFieldData().getGeoCoordinate().getAccuracy() == -1) { - doc.getFields().remove(geoField); - GEO_SCRUBBED_COUNT.increment(); - } else if (EarlybirdCluster.isArchive(cluster)) { - // In archive indexing, we base precision on SearchArchiveStatus.getPrecision, which is not - // in the scale we want. We always use POINT_LEVEL scale for now. - geoField.getFieldData().getGeoCoordinate().setAccuracy( - GeoAddressAccuracy.POINT_LEVEL.getCode()); - GEO_ARCHIVE_PATCHED_ACCURACY_COUNT.increment(); - } - } - } - - /** - * See SEARCH-9635 - * This patch is used to replace - * ("field":"internal","term":"__filter_links") with - * ("field":"internal","term":"__has_links"). - */ - private static void patchArchiveHasLinks(ThriftDocument doc, EarlybirdCluster cluster) { - if (!EarlybirdCluster.isArchive(cluster)) { - return; - } - - List fieldList = ThriftDocumentUtil.getFields(doc, - EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), - ID_MAP); - for (ThriftField field : fieldList) { - if (field.getFieldData().getStringValue().equals(FILTER_LINK_VALUE)) { - field.getFieldData().setStringValue(HAS_LINK_VALUE); - ARCHIVED_LINKS_FIELD_PATCHED_COUNT.increment(); - break; - } - } - } - - /** - * Check whether the nullcast bit and nullcast filter are consistent in the given doc. - */ - public static boolean isNullcastBitAndFilterConsistent(ThriftDocument doc, - ImmutableSchemaInterface schema) { - return isNullcastBitAndFilterConsistent(doc, schema, NULLCAST_DOC_STATS); - } - - @VisibleForTesting - static boolean isNullcastBitAndFilterConsistent( - ThriftDocument doc, ImmutableSchemaInterface schema, SearchTruthTableCounter nullCastStats) { - final boolean isNullcastBitSet = EarlybirdThriftDocumentUtil.isNullcastBitSet(schema, doc); - final boolean isNullcastFilterSet = EarlybirdThriftDocumentUtil.isNullcastFilterSet(doc); - - // Track stats. - nullCastStats.record(isNullcastBitSet, isNullcastFilterSet); - - return isNullcastBitSet == isNullcastFilterSet; - } - - @VisibleForTesting - static void addAllMissingMinEngagementFields( - ThriftDocument doc, EarlybirdCluster cluster, ImmutableSchemaInterface schema - ) throws IOException { - if (!EarlybirdCluster.isArchive(cluster)) { - return; - } - EarlybirdFieldConstants.EarlybirdFieldConstant encodedFeatureFieldConstant = - EarlybirdFieldConstant.ENCODED_TWEET_FEATURES_FIELD; - byte[] encodedFeaturesBytes = ThriftDocumentUtil.getBytesValue(doc, - encodedFeatureFieldConstant.getFieldName(), ID_MAP); - if (encodedFeaturesBytes == null) { - return; - } - EarlybirdEncodedFeatures encodedFeatures = EarlybirdEncodedFeaturesUtil.fromBytes( - schema, - EarlybirdFieldConstant.ENCODED_TWEET_FEATURES_FIELD, - encodedFeaturesBytes, - 0); - for (String field: EarlybirdFieldConstants.MIN_ENGAGEMENT_FIELD_TO_CSF_NAME_MAP.keySet()) { - EarlybirdFieldConstant csfEngagementField = EarlybirdFieldConstants - .MIN_ENGAGEMENT_FIELD_TO_CSF_NAME_MAP.get(field); - Preconditions.checkState(csfEngagementField != null); - int engagementCounter = encodedFeatures.getFeatureValue(csfEngagementField); - EarlybirdThriftDocumentUtil.addNormalizedMinEngagementField(doc, field, engagementCounter); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/document/ThriftIndexingEventDocumentFactory.java b/src/java/com/twitter/search/earlybird/document/ThriftIndexingEventDocumentFactory.java deleted file mode 100644 index d225a387e..000000000 --- a/src/java/com/twitter/search/earlybird/document/ThriftIndexingEventDocumentFactory.java +++ /dev/null @@ -1,246 +0,0 @@ -package com.twitter.search.earlybird.document; - -import java.io.IOException; -import java.util.concurrent.TimeUnit; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; - -import org.apache.lucene.document.Document; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.decider.Decider; -import com.twitter.search.common.decider.DeciderUtil; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; -import com.twitter.search.common.schema.SchemaDocumentFactory; -import com.twitter.search.common.schema.base.FieldNameToIdMapping; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.base.ThriftDocumentUtil; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.schema.earlybird.EarlybirdThriftDocumentUtil; -import com.twitter.search.common.schema.thriftjava.ThriftDocument; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent; -import com.twitter.search.common.util.text.filter.NormalizedTokenFilter; -import com.twitter.search.common.util.text.splitter.HashtagMentionPunctuationSplitter; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.partition.SearchIndexingMetricSet; - -public class ThriftIndexingEventDocumentFactory extends DocumentFactory { - private static final Logger LOG = - LoggerFactory.getLogger(ThriftIndexingEventDocumentFactory.class); - - private static final FieldNameToIdMapping ID_MAPPING = new EarlybirdFieldConstants(); - private static final long TIMESTAMP_ALLOWED_FUTURE_DELTA_MS = TimeUnit.SECONDS.toMillis(60); - private static final String FILTER_TWEETS_WITH_FUTURE_TWEET_ID_AND_CREATED_AT_DECIDER_KEY = - "filter_tweets_with_future_tweet_id_and_created_at"; - - private static final SearchCounter NUM_TWEETS_WITH_FUTURE_TWEET_ID_AND_CREATED_AT_MS = - SearchCounter.export("num_tweets_with_future_tweet_id_and_created_at_ms"); - private static final SearchCounter NUM_TWEETS_WITH_INCONSISTENT_TWEET_ID_AND_CREATED_AT_MS_FOUND = - SearchCounter.export("num_tweets_with_inconsistent_tweet_id_and_created_at_ms_found"); - private static final SearchCounter - NUM_TWEETS_WITH_INCONSISTENT_TWEET_ID_AND_CREATED_AT_MS_ADJUSTED = - SearchCounter.export("num_tweets_with_inconsistent_tweet_id_and_created_at_ms_adjusted"); - private static final SearchCounter NUM_TWEETS_WITH_INCONSISTENT_TWEET_ID_AND_CREATED_AT_MS_DROPPED - = SearchCounter.export("num_tweets_with_inconsistent_tweet_id_and_created_at_ms_dropped"); - - @VisibleForTesting - static final String ENABLE_ADJUST_CREATED_AT_TIME_IF_MISMATCH_WITH_SNOWFLAKE = - "enable_adjust_created_at_time_if_mismatch_with_snowflake"; - - @VisibleForTesting - static final String ENABLE_DROP_CREATED_AT_TIME_IF_MISMATCH_WITH_SNOWFLAKE = - "enable_drop_created_at_time_if_mismatch_with_snowflake"; - - private final SchemaDocumentFactory schemaDocumentFactory; - private final EarlybirdCluster cluster; - private final SearchIndexingMetricSet searchIndexingMetricSet; - private final Decider decider; - private final Schema schema; - private final Clock clock; - - public ThriftIndexingEventDocumentFactory( - Schema schema, - EarlybirdCluster cluster, - Decider decider, - SearchIndexingMetricSet searchIndexingMetricSet, - CriticalExceptionHandler criticalExceptionHandler) { - this( - schema, - getSchemaDocumentFactory(schema, cluster, decider), - cluster, - searchIndexingMetricSet, - decider, - Clock.SYSTEM_CLOCK, - criticalExceptionHandler - ); - } - - /** - * Returns a document factory that knows how to convert ThriftDocuments to Documents based on the - * provided schema. - */ - public static SchemaDocumentFactory getSchemaDocumentFactory( - Schema schema, - EarlybirdCluster cluster, - Decider decider) { - return new SchemaDocumentFactory(schema, - Lists.newArrayList( - new TruncationTokenStreamWriter(cluster, decider), - (fieldInfo, stream) -> { - // Strip # @ $ symbols, and break up underscore connected tokens. - if (fieldInfo.getFieldType().useTweetSpecificNormalization()) { - return new HashtagMentionPunctuationSplitter(new NormalizedTokenFilter(stream)); - } - - return stream; - })); - } - - @VisibleForTesting - protected ThriftIndexingEventDocumentFactory( - Schema schema, - SchemaDocumentFactory schemaDocumentFactory, - EarlybirdCluster cluster, - SearchIndexingMetricSet searchIndexingMetricSet, - Decider decider, - Clock clock, - CriticalExceptionHandler criticalExceptionHandler) { - super(criticalExceptionHandler); - this.schema = schema; - this.schemaDocumentFactory = schemaDocumentFactory; - this.cluster = cluster; - this.searchIndexingMetricSet = searchIndexingMetricSet; - this.decider = decider; - this.clock = clock; - } - - @Override - public long getStatusId(ThriftIndexingEvent event) { - Preconditions.checkNotNull(event); - if (event.isSetDocument() && event.getDocument() != null) { - ThriftDocument thriftDocument = event.getDocument(); - try { - // Ideally, we should not call getSchemaSnapshot() here. But, as this is called only to - // retrieve status id and the ID field is static, this is fine for the purpose. - thriftDocument = ThriftDocumentPreprocessor.preprocess( - thriftDocument, cluster, schema.getSchemaSnapshot()); - } catch (IOException e) { - throw new IllegalStateException("Unable to obtain tweet ID from ThriftDocument", e); - } - return ThriftDocumentUtil.getLongValue( - thriftDocument, EarlybirdFieldConstant.ID_FIELD.getFieldName(), ID_MAPPING); - } else { - throw new IllegalArgumentException("ThriftDocument is null inside ThriftIndexingEvent."); - } - } - - @Override - protected Document innerNewDocument(ThriftIndexingEvent event) throws IOException { - Preconditions.checkNotNull(event); - Preconditions.checkNotNull(event.getDocument()); - - ImmutableSchemaInterface schemaSnapshot = schema.getSchemaSnapshot(); - - // If the tweet id and create_at are in the future, do not index it. - if (areTweetIDAndCreateAtInTheFuture(event) - && DeciderUtil.isAvailableForRandomRecipient(decider, - FILTER_TWEETS_WITH_FUTURE_TWEET_ID_AND_CREATED_AT_DECIDER_KEY)) { - NUM_TWEETS_WITH_FUTURE_TWEET_ID_AND_CREATED_AT_MS.increment(); - return null; - } - - if (isNullcastBitAndFilterConsistent(schemaSnapshot, event)) { - ThriftDocument thriftDocument = - adjustOrDropIfTweetIDAndCreatedAtAreInconsistent( - ThriftDocumentPreprocessor.preprocess(event.getDocument(), cluster, schemaSnapshot)); - - if (thriftDocument != null) { - return schemaDocumentFactory.newDocument(thriftDocument); - } else { - return null; - } - } else { - return null; - } - } - - private ThriftDocument adjustOrDropIfTweetIDAndCreatedAtAreInconsistent(ThriftDocument document) { - final long tweetID = EarlybirdThriftDocumentUtil.getID(document); - // Thrift document is storing created at in seconds. - final long createdAtMs = EarlybirdThriftDocumentUtil.getCreatedAtMs(document); - - if (!SnowflakeIdParser.isTweetIDAndCreatedAtConsistent(tweetID, createdAtMs)) { - // Increment found counter. - NUM_TWEETS_WITH_INCONSISTENT_TWEET_ID_AND_CREATED_AT_MS_FOUND.increment(); - LOG.error( - "Found inconsistent tweet ID and created at timestamp: [tweetID={}], [createdAtMs={}]", - tweetID, createdAtMs); - - if (DeciderUtil.isAvailableForRandomRecipient( - decider, ENABLE_ADJUST_CREATED_AT_TIME_IF_MISMATCH_WITH_SNOWFLAKE)) { - // Update created at (and csf) with the time stamp in snow flake ID. - final long createdAtMsInID = SnowflakeIdParser.getTimestampFromTweetId(tweetID); - EarlybirdThriftDocumentUtil.replaceCreatedAtAndCreatedAtCSF( - document, (int) (createdAtMsInID / 1000)); - - // Increment adjusted counter. - NUM_TWEETS_WITH_INCONSISTENT_TWEET_ID_AND_CREATED_AT_MS_ADJUSTED.increment(); - LOG.error( - "Updated created at to match tweet ID: createdAtMs={}, tweetID={}, createdAtMsInID={}", - createdAtMs, tweetID, createdAtMsInID); - } else if (DeciderUtil.isAvailableForRandomRecipient( - decider, ENABLE_DROP_CREATED_AT_TIME_IF_MISMATCH_WITH_SNOWFLAKE)) { - // Drop and increment counter! - NUM_TWEETS_WITH_INCONSISTENT_TWEET_ID_AND_CREATED_AT_MS_DROPPED.increment(); - LOG.error( - "Dropped tweet with inconsistent ID and timestamp: createdAtMs={}, tweetID={}", - createdAtMs, tweetID); - return null; - } - } - - return document; - } - - private boolean isNullcastBitAndFilterConsistent( - ImmutableSchemaInterface schemaSnapshot, - ThriftIndexingEvent event) { - return ThriftDocumentPreprocessor.isNullcastBitAndFilterConsistent( - event.getDocument(), schemaSnapshot); - } - - /** - * Check if the tweet ID and create_at are in the future and beyond the allowed - * TIMESTAMP_ALLOWED_FUTURE_DELTA_MS range from current time stamp. - */ - private boolean areTweetIDAndCreateAtInTheFuture(ThriftIndexingEvent event) { - ThriftDocument document = event.getDocument(); - - final long tweetID = EarlybirdThriftDocumentUtil.getID(document); - if (tweetID < SnowflakeIdParser.SNOWFLAKE_ID_LOWER_BOUND) { - return false; - } - - final long tweetIDTimestampMs = SnowflakeIdParser.getTimestampFromTweetId(tweetID); - final long allowedFutureTimestampMs = clock.nowMillis() + TIMESTAMP_ALLOWED_FUTURE_DELTA_MS; - - final long createdAtMs = EarlybirdThriftDocumentUtil.getCreatedAtMs(document); - if (tweetIDTimestampMs > allowedFutureTimestampMs && createdAtMs > allowedFutureTimestampMs) { - LOG.error( - "Found future tweet ID and created at timestamp: " - + "[tweetID={}], [createdAtMs={}], [compareDeltaMs={}]", - tweetID, createdAtMs, TIMESTAMP_ALLOWED_FUTURE_DELTA_MS); - return true; - } - - return false; - } -} diff --git a/src/java/com/twitter/search/earlybird/document/ThriftIndexingEventUpdateFactory.java b/src/java/com/twitter/search/earlybird/document/ThriftIndexingEventUpdateFactory.java deleted file mode 100644 index 63a4ced1b..000000000 --- a/src/java/com/twitter/search/earlybird/document/ThriftIndexingEventUpdateFactory.java +++ /dev/null @@ -1,91 +0,0 @@ -package com.twitter.search.earlybird.document; - -import java.io.IOException; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import org.apache.lucene.document.Document; - -import com.twitter.decider.Decider; -import com.twitter.search.common.schema.SchemaDocumentFactory; -import com.twitter.search.common.schema.base.FieldNameToIdMapping; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.base.ThriftDocumentUtil; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.schema.thriftjava.ThriftDocument; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; - -/** - * Builds a Lucene Document from a ThriftIndexingEvent. A simplified version of - * {@link ThriftIndexingEventDocumentFactory} that can be used for update events, which exclude - * many fields that the tweet indexing events contain. - */ -public class ThriftIndexingEventUpdateFactory extends DocumentFactory { - private static final FieldNameToIdMapping ID_MAPPING = new EarlybirdFieldConstants(); - - private final SchemaDocumentFactory schemaDocumentFactory; - private final EarlybirdCluster cluster; - private final Schema schema; - - public ThriftIndexingEventUpdateFactory( - Schema schema, - EarlybirdCluster cluster, - Decider decider, - CriticalExceptionHandler criticalExceptionHandler) { - this( - schema, - ThriftIndexingEventDocumentFactory.getSchemaDocumentFactory(schema, cluster, decider), - cluster, - criticalExceptionHandler - ); - } - - @VisibleForTesting - protected ThriftIndexingEventUpdateFactory( - Schema schema, - SchemaDocumentFactory schemaDocumentFactory, - EarlybirdCluster cluster, - CriticalExceptionHandler criticalExceptionHandler) { - super(criticalExceptionHandler); - this.schema = schema; - this.schemaDocumentFactory = schemaDocumentFactory; - this.cluster = cluster; - } - - @Override - public long getStatusId(ThriftIndexingEvent event) { - Preconditions.checkNotNull(event); - Preconditions.checkState( - event.isSetDocument(), "ThriftDocument is null inside ThriftIndexingEvent."); - - ThriftDocument thriftDocument; - try { - // Ideally, we should not call getSchemaSnapshot() here. But, as this is called only to - // retrieve status id and the ID field is static, this is fine for the purpose. - thriftDocument = ThriftDocumentPreprocessor.preprocess( - event.getDocument(), cluster, schema.getSchemaSnapshot()); - } catch (IOException e) { - throw new IllegalStateException("Unable to obtain tweet ID from ThriftDocument: " + event, e); - } - return ThriftDocumentUtil.getLongValue( - thriftDocument, EarlybirdFieldConstant.ID_FIELD.getFieldName(), ID_MAPPING); - } - - @Override - protected Document innerNewDocument(ThriftIndexingEvent event) throws IOException { - Preconditions.checkNotNull(event); - Preconditions.checkNotNull(event.getDocument()); - - ImmutableSchemaInterface schemaSnapshot = schema.getSchemaSnapshot(); - - ThriftDocument document = ThriftDocumentPreprocessor.preprocess( - event.getDocument(), cluster, schemaSnapshot); - - return schemaDocumentFactory.newDocument(document); - } -} diff --git a/src/java/com/twitter/search/earlybird/document/TimeSlicedThriftIndexingEvent.java b/src/java/com/twitter/search/earlybird/document/TimeSlicedThriftIndexingEvent.java deleted file mode 100644 index 0e791a008..000000000 --- a/src/java/com/twitter/search/earlybird/document/TimeSlicedThriftIndexingEvent.java +++ /dev/null @@ -1,40 +0,0 @@ -package com.twitter.search.earlybird.document; - -import com.google.common.base.Preconditions; - -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent; - -/** - * Object to encapsulate {@link ThriftIndexingEvent} with a time slice ID. - */ -public class TimeSlicedThriftIndexingEvent { - private final long timeSliceID; - private final ThriftIndexingEvent thriftIndexingEvent; - - public TimeSlicedThriftIndexingEvent(long timeSliceID, ThriftIndexingEvent thriftIndexingEvent) { - Preconditions.checkNotNull(thriftIndexingEvent); - - this.timeSliceID = timeSliceID; - this.thriftIndexingEvent = thriftIndexingEvent; - } - - public long getStatusID() { - return thriftIndexingEvent.getUid(); - } - - public long getTimeSliceID() { - return timeSliceID; - } - - public ThriftIndexingEvent getThriftIndexingEvent() { - return thriftIndexingEvent; - } - - @Override - public String toString() { - return "TimeSlicedThriftIndexingEvent{" - + "timeSliceID=" + timeSliceID - + ", thriftIndexingEvent=" + thriftIndexingEvent - + '}'; - } -} diff --git a/src/java/com/twitter/search/earlybird/document/TruncationTokenStreamWriter.java b/src/java/com/twitter/search/earlybird/document/TruncationTokenStreamWriter.java deleted file mode 100644 index 830ae7946..000000000 --- a/src/java/com/twitter/search/earlybird/document/TruncationTokenStreamWriter.java +++ /dev/null @@ -1,86 +0,0 @@ -package com.twitter.search.earlybird.document; - -import com.twitter.common.text.token.TokenProcessor; -import com.twitter.common.text.token.TwitterTokenStream; -import com.twitter.decider.Decider; -import com.twitter.search.common.decider.DeciderUtil; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.schema.SchemaDocumentFactory; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; - -public class TruncationTokenStreamWriter implements SchemaDocumentFactory.TokenStreamRewriter { - private static final int NEVER_TRUNCATE_CHARS_BELOW_POSITION = 140; - private static final String TRUNCATE_LONG_TWEETS_DECIDER_KEY_PREFIX = - "truncate_long_tweets_in_"; - private static final String NUM_TWEET_CHARACTERS_SUPPORTED_DECIDER_KEY_PREFIX = - "num_tweet_characters_supported_in_"; - - private static final SearchCounter NUM_TWEETS_TRUNCATED = - SearchCounter.export("num_tweets_truncated"); - private static final SearchLongGauge NUM_TWEET_CHARACTERS_SUPPORTED = - SearchLongGauge.export("num_tweet_characters_supported"); - - private final Decider decider; - private final String truncateLongTweetsDeciderKey; - private final String numCharsSupportedDeciderKey; - - /** - * Creates a TruncationTokenStreamWriter - */ - public TruncationTokenStreamWriter(EarlybirdCluster cluster, Decider decider) { - this.decider = decider; - - this.truncateLongTweetsDeciderKey = - TRUNCATE_LONG_TWEETS_DECIDER_KEY_PREFIX + cluster.name().toLowerCase(); - this.numCharsSupportedDeciderKey = - NUM_TWEET_CHARACTERS_SUPPORTED_DECIDER_KEY_PREFIX + cluster.name().toLowerCase(); - } - - @Override - public TwitterTokenStream rewrite(Schema.FieldInfo fieldInfo, TwitterTokenStream stream) { - if (EarlybirdFieldConstant.TEXT_FIELD.getFieldName().equals(fieldInfo.getName())) { - final int maxPosition = getTruncatePosition(); - NUM_TWEET_CHARACTERS_SUPPORTED.set(maxPosition); - if (maxPosition >= NEVER_TRUNCATE_CHARS_BELOW_POSITION) { - return new TokenProcessor(stream) { - @Override - public final boolean incrementToken() { - if (incrementInputStream()) { - if (offset() < maxPosition) { - return true; - } - NUM_TWEETS_TRUNCATED.increment(); - } - - return false; - } - }; - } - } - - return stream; - } - - /** - * Get the truncation position. - * - * @return the truncation position or -1 if truncation is disabled. - */ - private int getTruncatePosition() { - int maxPosition; - if (!DeciderUtil.isAvailableForRandomRecipient(decider, truncateLongTweetsDeciderKey)) { - return -1; - } - maxPosition = DeciderUtil.getAvailability(decider, numCharsSupportedDeciderKey); - - if (maxPosition < NEVER_TRUNCATE_CHARS_BELOW_POSITION) { - // Never truncate below NEVER_TRUNCATE_CHARS_BELOW_POSITION chars - maxPosition = NEVER_TRUNCATE_CHARS_BELOW_POSITION; - } - - return maxPosition; - } -} diff --git a/src/java/com/twitter/search/earlybird/document/TweetDocument.java b/src/java/com/twitter/search/earlybird/document/TweetDocument.java deleted file mode 100644 index 5d4dae6f4..000000000 --- a/src/java/com/twitter/search/earlybird/document/TweetDocument.java +++ /dev/null @@ -1,52 +0,0 @@ -package com.twitter.search.earlybird.document; - -import org.apache.lucene.document.Document; - -/** - * TweetDocument is a record produced by DocumentReader and TweetIndexUpdateReader - * for consumption by the partition indexer. - */ -public final class TweetDocument { - private final long tweetID; - private final long timeSliceID; - private final long eventTimeMs; - private final Document document; - - public TweetDocument( - long tweetID, - long timeSliceID, - long eventTimeMs, - Document document - ) { - this.tweetID = tweetID; - this.timeSliceID = timeSliceID; - this.eventTimeMs = eventTimeMs; - this.document = document; - } - - public long getTweetID() { - return tweetID; - } - - public long getTimeSliceID() { - return timeSliceID; - } - - public long getEventTimeMs() { - return eventTimeMs; - } - - public Document getDocument() { - return document; - } - - @Override - public String toString() { - return "TweetDocument{" - + "tweetID=" + tweetID - + ", timeSliceID=" + timeSliceID - + ", eventTimeMs=" + eventTimeMs - + ", document=" + document - + '}'; - } -} diff --git a/src/java/com/twitter/search/earlybird/exception/AlreadyInServerSetUpdateException.java b/src/java/com/twitter/search/earlybird/exception/AlreadyInServerSetUpdateException.java deleted file mode 100644 index d6db98dad..000000000 --- a/src/java/com/twitter/search/earlybird/exception/AlreadyInServerSetUpdateException.java +++ /dev/null @@ -1,12 +0,0 @@ -package com.twitter.search.earlybird.exception; - -import com.twitter.common.zookeeper.ServerSet; - -/** - * Used when trying to join a server set when this earlybird is already in a server set. - */ -public class AlreadyInServerSetUpdateException extends ServerSet.UpdateException { - public AlreadyInServerSetUpdateException(String message) { - super(message); - } -} diff --git a/src/java/com/twitter/search/earlybird/exception/BadRequestException.java b/src/java/com/twitter/search/earlybird/exception/BadRequestException.java deleted file mode 100644 index b19db6571..000000000 --- a/src/java/com/twitter/search/earlybird/exception/BadRequestException.java +++ /dev/null @@ -1,11 +0,0 @@ -package com.twitter.search.earlybird.exception; - -public class BadRequestException extends Exception { - public BadRequestException(String message, Throwable cause) { - super(message, cause); - } - - public BadRequestException(String message) { - super(message); - } -} diff --git a/src/java/com/twitter/search/earlybird/exception/ClientException.java b/src/java/com/twitter/search/earlybird/exception/ClientException.java deleted file mode 100644 index 3387f0662..000000000 --- a/src/java/com/twitter/search/earlybird/exception/ClientException.java +++ /dev/null @@ -1,11 +0,0 @@ -package com.twitter.search.earlybird.exception; - -public class ClientException extends Exception { - public ClientException(Throwable t) { - super(t); - } - - public ClientException(String message) { - super(message); - } -} diff --git a/src/java/com/twitter/search/earlybird/exception/CriticalExceptionHandler.java b/src/java/com/twitter/search/earlybird/exception/CriticalExceptionHandler.java deleted file mode 100644 index a2b72511f..000000000 --- a/src/java/com/twitter/search/earlybird/exception/CriticalExceptionHandler.java +++ /dev/null @@ -1,114 +0,0 @@ -package com.twitter.search.earlybird.exception; - -import com.google.common.annotations.VisibleForTesting; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; -import org.slf4j.Marker; -import org.slf4j.MarkerFactory; - -import com.twitter.search.common.config.Config; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.earlybird.EarlybirdStatus; - -/** - * Used for handling exceptions considered critical. - * - * When you handle an exception with this class, two things might happen. - * 1. If earlybirds are still starting, we'll shut them down. - * 2. If earlybirds have started, we'll increment a counter that will cause alerts. - * - * If you want to verify that your code handles exceptions as you expect, you can use the - * helper class ExceptionCauser. - */ -public class CriticalExceptionHandler { - private static final Logger LOG = LoggerFactory.getLogger(CriticalExceptionHandler.class); - private static final Marker FATAL = MarkerFactory.getMarker("FATAL"); - - // This stat should remain at 0 during normal operations. - // This stat being non-zero should trigger alerts. - public static final SearchCounter CRITICAL_EXCEPTION_COUNT = - SearchCounter.export("fatal_exception_count"); - - public static final SearchCounter UNSAFE_MEMORY_ACCESS = - SearchCounter.export("unsafe_memory_access"); - - private Runnable shutdownHook; - - public void setShutdownHook(Runnable shutdownHook) { - this.shutdownHook = shutdownHook; - } - - /** - * Handle a critical exception. - * - * @param thrower Instance of the class where the exception was thrown. - * @param thrown The exception. - */ - public void handle(Object thrower, Throwable thrown) { - if (thrown == null) { - return; - } - - try { - handleFatalException(thrower, thrown); - } catch (Throwable e) { - LOG.error("Unexpected exception in EarlybirdExceptionHandler.handle() while handling an " - + "unexpected exception from " + thrower.getClass(), e); - } - } - - @VisibleForTesting - boolean shouldIncrementFatalExceptionCounter(Throwable thrown) { - // See D212952 - // We don't want to get pages when this happens. - for (Throwable t = thrown; t != null; t = t.getCause()) { - if (t instanceof InternalError && t.getMessage() != null - && t.getMessage().contains("unsafe memory access operation")) { - // Don't treat InternalError caused by unsafe memory access operation which is usually - // triggered by SIGBUS for accessing a corrupted memory block. - UNSAFE_MEMORY_ACCESS.increment(); - return false; - } - } - - return true; - } - - /** - * Handle an exception that's considered fatal. - * - * @param thrower instance of the class where the exception was thrown. - * @param thrown The Error or Exception. - */ - private void handleFatalException(Object thrower, Throwable thrown) { - LOG.error(FATAL, "Fatal exception in " + thrower.getClass() + ":", thrown); - - if (shouldIncrementFatalExceptionCounter(thrown)) { - CRITICAL_EXCEPTION_COUNT.increment(); - } - - if (EarlybirdStatus.isStarting()) { - LOG.error(FATAL, "Got fatal exception while starting up, exiting ..."); - if (this.shutdownHook != null) { - this.shutdownHook.run(); - } else { - LOG.error("earlybirdServer not set, can't shut down."); - } - - if (!Config.environmentIsTest()) { - // Sleep for 3 minutes to allow the fatal exception to be caught by observability. - try { - Thread.sleep(3 * 60 * 1000); - } catch (InterruptedException e) { - LOG.error(FATAL, "interupted sleep while shutting down."); - } - LOG.info("Terminate JVM."); - //CHECKSTYLE:OFF RegexpSinglelineJava - // See SEARCH-15256 - System.exit(-1); - //CHECKSTYLE:ON RegexpSinglelineJava - } - } - } -} diff --git a/src/java/com/twitter/search/earlybird/exception/EarlybirdException.java b/src/java/com/twitter/search/earlybird/exception/EarlybirdException.java deleted file mode 100644 index fe82488a4..000000000 --- a/src/java/com/twitter/search/earlybird/exception/EarlybirdException.java +++ /dev/null @@ -1,18 +0,0 @@ -package com.twitter.search.earlybird.exception; - -/** - * General Earlybird exception class to use instead of the Java exception class. - */ -public class EarlybirdException extends Exception { - public EarlybirdException(Throwable cause) { - super(cause); - } - - public EarlybirdException(String message) { - super(message); - } - - public EarlybirdException(String message, Throwable cause) { - super(message, cause); - } -} diff --git a/src/java/com/twitter/search/earlybird/exception/EarlybirdFinagleServerMonitor.java b/src/java/com/twitter/search/earlybird/exception/EarlybirdFinagleServerMonitor.java deleted file mode 100644 index 92971b48c..000000000 --- a/src/java/com/twitter/search/earlybird/exception/EarlybirdFinagleServerMonitor.java +++ /dev/null @@ -1,25 +0,0 @@ -package com.twitter.search.earlybird.exception; - -import com.twitter.finagle.Failure; -import com.twitter.util.AbstractMonitor; - -public class EarlybirdFinagleServerMonitor extends AbstractMonitor { - private final CriticalExceptionHandler criticalExceptionHandler; - - public EarlybirdFinagleServerMonitor(CriticalExceptionHandler criticalExceptionHandler) { - this.criticalExceptionHandler = criticalExceptionHandler; - } - - @Override - public boolean handle(Throwable e) { - if (e instanceof Failure) { - // skip Finagle failure - return true; - } - - criticalExceptionHandler.handle(this, e); - - // We return true here because we handle all exceptions. - return true; - } -} diff --git a/src/java/com/twitter/search/earlybird/exception/EarlybirdRuntimeException.java b/src/java/com/twitter/search/earlybird/exception/EarlybirdRuntimeException.java deleted file mode 100644 index a570324fd..000000000 --- a/src/java/com/twitter/search/earlybird/exception/EarlybirdRuntimeException.java +++ /dev/null @@ -1,7 +0,0 @@ -package com.twitter.search.earlybird.exception; - -public class EarlybirdRuntimeException extends RuntimeException { - public EarlybirdRuntimeException(Throwable cause) { - super(cause); - } -} diff --git a/src/java/com/twitter/search/earlybird/exception/EarlybirdStartupException.java b/src/java/com/twitter/search/earlybird/exception/EarlybirdStartupException.java deleted file mode 100644 index 5c40cd0e3..000000000 --- a/src/java/com/twitter/search/earlybird/exception/EarlybirdStartupException.java +++ /dev/null @@ -1,20 +0,0 @@ -package com.twitter.search.earlybird.exception; - -/** - * Thrown by code that is executed during startup and used to communicate to caller that startup - * has failed. Generally results in shutting down of the server, but check on your own if you - * need to. - */ -public class EarlybirdStartupException extends Exception { - public EarlybirdStartupException(Throwable cause) { - super(cause); - } - - public EarlybirdStartupException(String message) { - super(message); - } - - public EarlybirdStartupException(String message, Throwable cause) { - super(message, cause); - } -} diff --git a/src/java/com/twitter/search/earlybird/exception/FlushVersionMismatchException.java b/src/java/com/twitter/search/earlybird/exception/FlushVersionMismatchException.java deleted file mode 100644 index e6cee497d..000000000 --- a/src/java/com/twitter/search/earlybird/exception/FlushVersionMismatchException.java +++ /dev/null @@ -1,17 +0,0 @@ -package com.twitter.search.earlybird.exception; - -import java.io.IOException; - -public class FlushVersionMismatchException extends IOException { - public FlushVersionMismatchException(Throwable cause) { - super(cause); - } - - public FlushVersionMismatchException(String message) { - super(message); - } - - public FlushVersionMismatchException(String message, Throwable cause) { - super(message, cause); - } -} diff --git a/src/java/com/twitter/search/earlybird/exception/MissingKafkaTopicException.java b/src/java/com/twitter/search/earlybird/exception/MissingKafkaTopicException.java deleted file mode 100644 index f6d1c675d..000000000 --- a/src/java/com/twitter/search/earlybird/exception/MissingKafkaTopicException.java +++ /dev/null @@ -1,11 +0,0 @@ -package com.twitter.search.earlybird.exception; - -public class MissingKafkaTopicException extends Exception { - public MissingKafkaTopicException(String message) { - super(message); - } - - public MissingKafkaTopicException(String message, Throwable cause) { - super(message, cause); - } -} diff --git a/src/java/com/twitter/search/earlybird/exception/MissingUserException.java b/src/java/com/twitter/search/earlybird/exception/MissingUserException.java deleted file mode 100644 index fab3886f2..000000000 --- a/src/java/com/twitter/search/earlybird/exception/MissingUserException.java +++ /dev/null @@ -1,4 +0,0 @@ -package com.twitter.search.earlybird.exception; - -public class MissingUserException extends Exception { -} diff --git a/src/java/com/twitter/search/earlybird/exception/NotInServerSetUpdateException.java b/src/java/com/twitter/search/earlybird/exception/NotInServerSetUpdateException.java deleted file mode 100644 index 1be7e7679..000000000 --- a/src/java/com/twitter/search/earlybird/exception/NotInServerSetUpdateException.java +++ /dev/null @@ -1,12 +0,0 @@ -package com.twitter.search.earlybird.exception; - -import com.twitter.common.zookeeper.ServerSet; - -/** - * Used when trying to leave a server set when this earlybird is already out of the server set. - */ -public class NotInServerSetUpdateException extends ServerSet.UpdateException { - public NotInServerSetUpdateException(String message) { - super(message); - } -} diff --git a/src/java/com/twitter/search/earlybird/exception/TransientException.java b/src/java/com/twitter/search/earlybird/exception/TransientException.java deleted file mode 100644 index 76f6b8dc6..000000000 --- a/src/java/com/twitter/search/earlybird/exception/TransientException.java +++ /dev/null @@ -1,15 +0,0 @@ -package com.twitter.search.earlybird.exception; - -public class TransientException extends Exception { - public TransientException(Throwable t) { - super(t); - } - - public TransientException(String message, Throwable cause) { - super(message, cause); - } - - public TransientException(String message) { - super(message); - } -} diff --git a/src/java/com/twitter/search/earlybird/exception/UncaughtExceptionHandler.java b/src/java/com/twitter/search/earlybird/exception/UncaughtExceptionHandler.java deleted file mode 100644 index 300e855fa..000000000 --- a/src/java/com/twitter/search/earlybird/exception/UncaughtExceptionHandler.java +++ /dev/null @@ -1,23 +0,0 @@ -package com.twitter.search.earlybird.exception; - -import com.twitter.util.AbstractMonitor; - -public class UncaughtExceptionHandler extends AbstractMonitor { - private final CriticalExceptionHandler criticalExceptionHandler; - - public UncaughtExceptionHandler() { - this.criticalExceptionHandler = new CriticalExceptionHandler(); - } - - public void setShutdownHook(Runnable shutdown) { - this.criticalExceptionHandler.setShutdownHook(shutdown); - } - - @Override - public boolean handle(Throwable e) { - criticalExceptionHandler.handle(this, e); - - // We return true here because we handle all exceptions. - return true; - } -} diff --git a/src/java/com/twitter/search/earlybird/exception/WrappedKafkaApiException.java b/src/java/com/twitter/search/earlybird/exception/WrappedKafkaApiException.java deleted file mode 100644 index de5126dad..000000000 --- a/src/java/com/twitter/search/earlybird/exception/WrappedKafkaApiException.java +++ /dev/null @@ -1,19 +0,0 @@ -package com.twitter.search.earlybird.exception; - -import org.apache.kafka.common.errors.ApiException; - -/** - * Kafka's ApiException class doesn't retain its stack trace (see its source code). - * As a result a kafka exception that propagates up the call chain can't point to where exactly - * did the exception happen in our code. As a solution, use this class when calling kafka API - * methods. - */ -public class WrappedKafkaApiException extends RuntimeException { - public WrappedKafkaApiException(ApiException cause) { - super(cause); - } - - public WrappedKafkaApiException(String message, ApiException cause) { - super(message, cause); - } -} diff --git a/src/java/com/twitter/search/earlybird/factory/EarlybirdIndexConfigUtil.java b/src/java/com/twitter/search/earlybird/factory/EarlybirdIndexConfigUtil.java deleted file mode 100644 index 7fcef3e0b..000000000 --- a/src/java/com/twitter/search/earlybird/factory/EarlybirdIndexConfigUtil.java +++ /dev/null @@ -1,53 +0,0 @@ -package com.twitter.search.earlybird.factory; - -import com.twitter.decider.Decider; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.EarlybirdIndexConfig; -import com.twitter.search.earlybird.RealtimeEarlybirdIndexConfig; -import com.twitter.search.earlybird.archive.ArchiveOnDiskEarlybirdIndexConfig; -import com.twitter.search.earlybird.archive.ArchiveSearchPartitionManager; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.partition.SearchIndexingMetricSet; - -public final class EarlybirdIndexConfigUtil { - private EarlybirdIndexConfigUtil() { - } - - /** - * Creates the index config for this earlybird. - */ - public static EarlybirdIndexConfig createEarlybirdIndexConfig( - Decider decider, SearchIndexingMetricSet searchIndexingMetricSet, - CriticalExceptionHandler criticalExceptionHandler) { - if (isArchiveSearch()) { - return new ArchiveOnDiskEarlybirdIndexConfig(decider, searchIndexingMetricSet, - criticalExceptionHandler); - } else if (isProtectedSearch()) { - return new RealtimeEarlybirdIndexConfig( - EarlybirdCluster.PROTECTED, decider, searchIndexingMetricSet, criticalExceptionHandler); - } else if (isRealtimeCG()) { - return new RealtimeEarlybirdIndexConfig( - EarlybirdCluster.REALTIME_CG, decider, searchIndexingMetricSet, criticalExceptionHandler); - } else { - return new RealtimeEarlybirdIndexConfig( - EarlybirdCluster.REALTIME, decider, searchIndexingMetricSet, criticalExceptionHandler); - } - } - - public static boolean isArchiveSearch() { - // Re-reading config on each call so that tests can reliably overwrite this - return EarlybirdConfig.getString("partition_manager", "realtime") - .equals(ArchiveSearchPartitionManager.CONFIG_NAME); - } - - private static boolean isProtectedSearch() { - // Re-reading config on each call so that tests can reliably overwrite this - return EarlybirdConfig.getBool("protected_index", false); - } - - private static boolean isRealtimeCG() { - // Re-reading config on each call so that tests can reliably overwrite this - return EarlybirdConfig.getBool("realtime_cg_index", false); - } -} diff --git a/src/java/com/twitter/search/earlybird/factory/EarlybirdKafkaConsumersFactory.java b/src/java/com/twitter/search/earlybird/factory/EarlybirdKafkaConsumersFactory.java deleted file mode 100644 index e360a3df1..000000000 --- a/src/java/com/twitter/search/earlybird/factory/EarlybirdKafkaConsumersFactory.java +++ /dev/null @@ -1,19 +0,0 @@ -package com.twitter.search.earlybird.factory; - -import org.apache.kafka.clients.consumer.KafkaConsumer; - -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; - -public interface EarlybirdKafkaConsumersFactory { - /** - * Create a kafka consumer with default records to be polled. - */ - KafkaConsumer createKafkaConsumer( - String clientID); - - /** - * Create a kafka consumer with a set number of records to be polled. - */ - KafkaConsumer createKafkaConsumer( - String clientID, int maxPollRecords); -} diff --git a/src/java/com/twitter/search/earlybird/factory/EarlybirdServerFactory.java b/src/java/com/twitter/search/earlybird/factory/EarlybirdServerFactory.java deleted file mode 100644 index c8459a3f0..000000000 --- a/src/java/com/twitter/search/earlybird/factory/EarlybirdServerFactory.java +++ /dev/null @@ -1,353 +0,0 @@ -package com.twitter.search.earlybird.factory; - -import java.io.IOException; - -import com.google.common.base.Preconditions; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.decider.Decider; -import com.twitter.search.common.aurora.AuroraInstanceKey; -import com.twitter.search.common.aurora.AuroraSchedulerClient; -import com.twitter.search.common.concurrent.ScheduledExecutorServiceFactory; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.metrics.SearchStatsReceiver; -import com.twitter.search.common.util.ml.tensorflow_engine.TensorflowModelsManager; -import com.twitter.search.common.util.zktrylock.ZooKeeperTryLockFactory; -import com.twitter.search.earlybird.EarlybirdDarkProxy; -import com.twitter.search.earlybird.EarlybirdFinagleServerManager; -import com.twitter.search.earlybird.EarlybirdFuturePoolManager; -import com.twitter.search.earlybird.EarlybirdIndexConfig; -import com.twitter.search.earlybird.EarlybirdServer; -import com.twitter.search.earlybird.EarlybirdServerSetManager; -import com.twitter.search.earlybird.EarlybirdWarmUpManager; -import com.twitter.search.earlybird.QualityFactor; -import com.twitter.search.earlybird.UpdateableEarlybirdStateManager; -import com.twitter.search.earlybird.archive.ArchiveEarlybirdIndexConfig; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.common.userupdates.UserScrubGeoMap; -import com.twitter.search.earlybird.common.userupdates.UserUpdatesChecker; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.index.EarlybirdSegmentFactory; -import com.twitter.search.earlybird.ml.ScoringModelsManager; -import com.twitter.search.earlybird.partition.AudioSpaceEventsStreamIndexer; -import com.twitter.search.earlybird.partition.AudioSpaceTable; -import com.twitter.search.earlybird.partition.DynamicPartitionConfig; -import com.twitter.search.earlybird.partition.EarlybirdStartup; -import com.twitter.search.earlybird.partition.MultiSegmentTermDictionaryManager; -import com.twitter.search.earlybird.partition.PartitionConfig; -import com.twitter.search.earlybird.partition.PartitionManager; -import com.twitter.search.earlybird.partition.SegmentManager; -import com.twitter.search.earlybird.partition.SegmentSyncConfig; -import com.twitter.search.earlybird.partition.UserScrubGeoEventStreamIndexer; -import com.twitter.search.earlybird.partition.UserUpdatesStreamIndexer; -import com.twitter.search.earlybird.querycache.QueryCacheConfig; -import com.twitter.search.earlybird.querycache.QueryCacheManager; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; -import com.twitter.search.earlybird.util.TermCountMonitor; -import com.twitter.search.earlybird.util.TweetCountMonitor; - -/** - * This is the wiring file that builds EarlybirdServers. - * Production and test code share this same wiring file. - *

    - * To supply mocks for testing, one can do so by supplying a different - * EarlybirdWiringModule to this wiring file. - */ -public final class EarlybirdServerFactory { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdServerFactory.class); - - /** - * Creates the EarlybirdServer based on the bindings in the given wire module. - * - * @param earlybirdWireModule The wire module that specifies all required bindings. - */ - public EarlybirdServer makeEarlybirdServer(EarlybirdWireModule earlybirdWireModule) - throws IOException { - LOG.info("Started making an Earlybird server"); - CriticalExceptionHandler criticalExceptionHandler = new CriticalExceptionHandler(); - Decider decider = earlybirdWireModule.provideDecider(); - SearchDecider searchDecider = new SearchDecider(decider); - - EarlybirdWireModule.ZooKeeperClients zkClients = earlybirdWireModule.provideZooKeeperClients(); - ZooKeeperTryLockFactory zkTryLockFactory = - zkClients.stateClient.createZooKeeperTryLockFactory(); - - EarlybirdIndexConfig earlybirdIndexConfig = - earlybirdWireModule.provideEarlybirdIndexConfig( - decider, earlybirdWireModule.provideSearchIndexingMetricSet(), - criticalExceptionHandler); - - SearchStatsReceiver earlybirdServerStats = - earlybirdWireModule.provideEarlybirdServerStatsReceiver(); - - EarlybirdSearcherStats tweetsSearcherStats = - earlybirdWireModule.provideTweetsSearcherStats(); - - DynamicPartitionConfig dynamicPartitionConfig = - earlybirdWireModule.provideDynamicPartitionConfig(); - - PartitionConfig partitionConfig = dynamicPartitionConfig.getCurrentPartitionConfig(); - LOG.info("Partition config info [Cluster: {}, Tier: {}, Partition: {}, Replica: {}]", - partitionConfig.getClusterName(), - partitionConfig.getTierName(), - partitionConfig.getIndexingHashPartitionID(), - partitionConfig.getHostPositionWithinHashPartition()); - - Clock clock = earlybirdWireModule.provideClock(); - UserUpdatesChecker userUpdatesChecker = - new UserUpdatesChecker(clock, decider, earlybirdIndexConfig.getCluster()); - - UserTable userTable = UserTable.newTableWithDefaultCapacityAndPredicate( - earlybirdIndexConfig.getUserTableFilter(partitionConfig)::apply); - - UserScrubGeoMap userScrubGeoMap = new UserScrubGeoMap(); - - AudioSpaceTable audioSpaceTable = new AudioSpaceTable(clock); - - SegmentSyncConfig segmentSyncConfig = - earlybirdWireModule.provideSegmentSyncConfig(earlybirdIndexConfig.getCluster()); - - SegmentManager segmentManager = earlybirdWireModule.provideSegmentManager( - dynamicPartitionConfig, - earlybirdIndexConfig, - earlybirdWireModule.provideSearchIndexingMetricSet(), - tweetsSearcherStats, - earlybirdServerStats, - userUpdatesChecker, - segmentSyncConfig, - userTable, - userScrubGeoMap, - clock, - criticalExceptionHandler); - - QueryCacheConfig config = earlybirdWireModule.provideQueryCacheConfig(earlybirdServerStats); - - QueryCacheManager queryCacheManager = earlybirdWireModule.provideQueryCacheManager( - config, - earlybirdIndexConfig, - partitionConfig.getMaxEnabledLocalSegments(), - userTable, - userScrubGeoMap, - earlybirdWireModule.provideQueryCacheUpdateTaskScheduledExecutorFactory(), - earlybirdServerStats, - tweetsSearcherStats, - decider, - criticalExceptionHandler, - clock); - - EarlybirdServerSetManager serverSetManager = earlybirdWireModule.provideServerSetManager( - zkClients.discoveryClient, - dynamicPartitionConfig, - earlybirdServerStats, - EarlybirdConfig.getThriftPort(), - ""); - - EarlybirdWarmUpManager warmUpManager = - earlybirdWireModule.provideWarmUpManager(zkClients.discoveryClient, - dynamicPartitionConfig, - earlybirdServerStats, - decider, - clock, - EarlybirdConfig.getWarmUpThriftPort(), - "warmup_"); - - EarlybirdDarkProxy earlybirdDarkProxy = earlybirdWireModule.provideEarlybirdDarkProxy( - new SearchDecider(decider), - earlybirdWireModule.provideFinagleStatsReceiver(), - serverSetManager, - warmUpManager, - partitionConfig.getClusterName()); - - UserUpdatesStreamIndexer userUpdatesStreamIndexer = - earlybirdWireModule.provideUserUpdatesKafkaConsumer(segmentManager); - - UserScrubGeoEventStreamIndexer userScrubGeoEventStreamIndexer = - earlybirdWireModule.provideUserScrubGeoEventKafkaConsumer(segmentManager); - - AudioSpaceEventsStreamIndexer audioSpaceEventsStreamIndexer = - earlybirdWireModule.provideAudioSpaceEventsStreamIndexer(audioSpaceTable, clock); - - MultiSegmentTermDictionaryManager.Config termDictionaryConfig = - earlybirdWireModule.provideMultiSegmentTermDictionaryManagerConfig(); - MultiSegmentTermDictionaryManager multiSegmentTermDictionaryManager = - earlybirdWireModule.provideMultiSegmentTermDictionaryManager( - termDictionaryConfig, - segmentManager, - earlybirdServerStats, - decider, - earlybirdIndexConfig.getCluster()); - - TermCountMonitor termCountMonitor = - earlybirdWireModule.provideTermCountMonitor( - segmentManager, earlybirdWireModule.provideTermCountMonitorScheduledExecutorFactory(), - earlybirdServerStats, - criticalExceptionHandler); - TweetCountMonitor tweetCountMonitor = - earlybirdWireModule.provideTweetCountMonitor( - segmentManager, earlybirdWireModule.provideTweetCountMonitorScheduledExecutorFactory(), - earlybirdServerStats, - criticalExceptionHandler); - - ScoringModelsManager scoringModelsManager = earlybirdWireModule.provideScoringModelsManager( - earlybirdServerStats, - earlybirdIndexConfig - ); - - TensorflowModelsManager tensorflowModelsManager = - earlybirdWireModule.provideTensorflowModelsManager( - earlybirdServerStats, - "tf_loader", - decider, - earlybirdIndexConfig - ); - - AuroraSchedulerClient schedulerClient = null; - AuroraInstanceKey auroraInstanceKey = EarlybirdConfig.getAuroraInstanceKey(); - if (auroraInstanceKey != null) { - schedulerClient = new AuroraSchedulerClient(auroraInstanceKey.getCluster()); - } - - UpdateableEarlybirdStateManager earlybirdStateManager = - earlybirdWireModule.provideUpdateableEarlybirdStateManager( - earlybirdIndexConfig, - dynamicPartitionConfig, - zkClients.stateClient, - schedulerClient, - earlybirdWireModule.provideStateUpdateManagerExecutorFactory(), - scoringModelsManager, - tensorflowModelsManager, - earlybirdServerStats, - new SearchDecider(decider), - criticalExceptionHandler); - - EarlybirdFuturePoolManager futurePoolManager = earlybirdWireModule.provideFuturePoolManager(); - EarlybirdFinagleServerManager finagleServerManager = - earlybirdWireModule.provideFinagleServerManager(criticalExceptionHandler); - - PartitionManager partitionManager = null; - if (EarlybirdIndexConfigUtil.isArchiveSearch()) { - partitionManager = buildArchivePartitionManager( - earlybirdWireModule, - userUpdatesStreamIndexer, - userScrubGeoEventStreamIndexer, - zkTryLockFactory, - earlybirdIndexConfig, - dynamicPartitionConfig, - segmentManager, - queryCacheManager, - earlybirdServerStats, - serverSetManager, - earlybirdWireModule.providePartitionManagerExecutorFactory(), - earlybirdWireModule.provideSimpleUserUpdateIndexerScheduledExecutorFactory(), - clock, - segmentSyncConfig, - criticalExceptionHandler); - } else { - LOG.info("Not creating PartitionManager"); - } - - EarlybirdSegmentFactory earlybirdSegmentFactory = new EarlybirdSegmentFactory( - earlybirdIndexConfig, - earlybirdWireModule.provideSearchIndexingMetricSet(), - tweetsSearcherStats, - clock); - - EarlybirdStartup earlybirdStartup = earlybirdWireModule.provideEarlybirdStartup( - partitionManager, - userUpdatesStreamIndexer, - userScrubGeoEventStreamIndexer, - audioSpaceEventsStreamIndexer, - dynamicPartitionConfig, - criticalExceptionHandler, - segmentManager, - multiSegmentTermDictionaryManager, - queryCacheManager, - zkTryLockFactory, - serverSetManager, - clock, - segmentSyncConfig, - earlybirdSegmentFactory, - earlybirdIndexConfig.getCluster(), - searchDecider); - - QualityFactor qualityFactor = earlybirdWireModule.provideQualityFactor( - decider, - earlybirdServerStats); - - EarlybirdServer earlybirdServer = new EarlybirdServer( - queryCacheManager, - zkClients.stateClient, - decider, - earlybirdIndexConfig, - dynamicPartitionConfig, - partitionManager, - segmentManager, - audioSpaceTable, - termCountMonitor, - tweetCountMonitor, - earlybirdStateManager, - futurePoolManager, - finagleServerManager, - serverSetManager, - warmUpManager, - earlybirdServerStats, - tweetsSearcherStats, - scoringModelsManager, - tensorflowModelsManager, - clock, - multiSegmentTermDictionaryManager, - earlybirdDarkProxy, - segmentSyncConfig, - earlybirdWireModule.provideQueryTimeoutFactory(), - earlybirdStartup, - qualityFactor, - earlybirdWireModule.provideSearchIndexingMetricSet()); - - earlybirdStateManager.setEarlybirdServer(earlybirdServer); - criticalExceptionHandler.setShutdownHook(earlybirdServer::shutdown); - - return earlybirdServer; - } - - private PartitionManager buildArchivePartitionManager( - EarlybirdWireModule earlybirdWireModule, - UserUpdatesStreamIndexer userUpdatesStreamIndexer, - UserScrubGeoEventStreamIndexer userScrubGeoEventStreamIndexer, - ZooKeeperTryLockFactory zkTryLockFactory, - EarlybirdIndexConfig earlybirdIndexConfig, - DynamicPartitionConfig dynamicPartitionConfig, - SegmentManager segmentManager, - QueryCacheManager queryCacheManager, - SearchStatsReceiver searchStatsReceiver, - EarlybirdServerSetManager serverSetManager, - ScheduledExecutorServiceFactory partitionManagerExecutorServiceFactory, - ScheduledExecutorServiceFactory simpleUserUpdateIndexerExecutorFactory, - Clock clock, - SegmentSyncConfig segmentSyncConfig, - CriticalExceptionHandler criticalExceptionHandler) - throws IOException { - - Preconditions.checkState(earlybirdIndexConfig instanceof ArchiveEarlybirdIndexConfig); - LOG.info("Creating ArchiveSearchPartitionManager"); - return earlybirdWireModule.provideFullArchivePartitionManager( - zkTryLockFactory, - queryCacheManager, - segmentManager, - dynamicPartitionConfig, - userUpdatesStreamIndexer, - userScrubGeoEventStreamIndexer, - searchStatsReceiver, - (ArchiveEarlybirdIndexConfig) earlybirdIndexConfig, - serverSetManager, - partitionManagerExecutorServiceFactory, - simpleUserUpdateIndexerExecutorFactory, - earlybirdWireModule.provideSearchIndexingMetricSet(), - clock, - segmentSyncConfig, - criticalExceptionHandler); - } -} diff --git a/src/java/com/twitter/search/earlybird/factory/EarlybirdWireModule.java b/src/java/com/twitter/search/earlybird/factory/EarlybirdWireModule.java deleted file mode 100644 index a6b67e021..000000000 --- a/src/java/com/twitter/search/earlybird/factory/EarlybirdWireModule.java +++ /dev/null @@ -1,901 +0,0 @@ -package com.twitter.search.earlybird.factory; - -import java.io.IOException; -import java.lang.management.ManagementFactory; -import java.util.Optional; -import java.util.concurrent.ScheduledThreadPoolExecutor; -import java.util.concurrent.TimeUnit; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; -import com.sun.management.OperatingSystemMXBean; - -import org.apache.directory.api.util.Strings; -import org.apache.hadoop.fs.FileSystem; -import org.apache.kafka.clients.consumer.KafkaConsumer; -import org.apache.kafka.common.TopicPartition; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.decider.Decider; -import com.twitter.finagle.stats.MetricsStatsReceiver; -import com.twitter.finagle.stats.StatsReceiver; -import com.twitter.search.common.aurora.AuroraSchedulerClient; -import com.twitter.search.common.concurrent.ScheduledExecutorServiceFactory; -import com.twitter.search.common.decider.DeciderUtil; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.file.FileUtils; -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.metrics.SearchStatsReceiver; -import com.twitter.search.common.metrics.SearchStatsReceiverImpl; -import com.twitter.search.common.partitioning.zookeeper.SearchZkClient; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.search.termination.QueryTimeoutFactory; -import com.twitter.search.common.util.io.kafka.FinagleKafkaClientUtils; -import com.twitter.search.common.util.io.kafka.ThriftDeserializer; -import com.twitter.search.common.util.ml.tensorflow_engine.TensorflowModelsManager; -import com.twitter.search.common.util.zktrylock.ZooKeeperTryLockFactory; -import com.twitter.search.common.util.zookeeper.ZooKeeperProxy; -import com.twitter.search.earlybird.EarlybirdCPUQualityFactor; -import com.twitter.search.earlybird.EarlybirdDarkProxy; -import com.twitter.search.earlybird.EarlybirdFinagleServerManager; -import com.twitter.search.earlybird.EarlybirdFuturePoolManager; -import com.twitter.search.earlybird.EarlybirdIndexConfig; -import com.twitter.search.earlybird.EarlybirdProductionFinagleServerManager; -import com.twitter.search.earlybird.EarlybirdServerSetManager; -import com.twitter.search.earlybird.EarlybirdWarmUpManager; -import com.twitter.search.earlybird.QualityFactor; -import com.twitter.search.earlybird.ServerSetMember; -import com.twitter.search.earlybird.UpdateableEarlybirdStateManager; -import com.twitter.search.earlybird.archive.ArchiveEarlybirdIndexConfig; -import com.twitter.search.earlybird.archive.ArchiveSearchPartitionManager; -import com.twitter.search.earlybird.common.CaughtUpMonitor; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.common.config.EarlybirdProperty; -import com.twitter.search.earlybird.common.userupdates.UserScrubGeoMap; -import com.twitter.search.earlybird.common.userupdates.UserUpdatesChecker; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.exception.MissingKafkaTopicException; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.index.EarlybirdSegmentFactory; -import com.twitter.search.earlybird.ml.ScoringModelsManager; -import com.twitter.search.earlybird.partition.AudioSpaceEventsStreamIndexer; -import com.twitter.search.earlybird.partition.AudioSpaceTable; -import com.twitter.search.earlybird.partition.DynamicPartitionConfig; -import com.twitter.search.earlybird.partition.EarlybirdIndexFlusher; -import com.twitter.search.earlybird.partition.EarlybirdIndexLoader; -import com.twitter.search.earlybird.partition.EarlybirdKafkaConsumer; -import com.twitter.search.earlybird.partition.EarlybirdStartup; -import com.twitter.search.earlybird.partition.OptimizationAndFlushingCoordinationLock; -import com.twitter.search.earlybird.partition.TimeLimitedHadoopExistsCall; -import com.twitter.search.earlybird.partition.UserScrubGeoEventStreamIndexer; -import com.twitter.search.earlybird.partition.freshstartup.FreshStartupHandler; -import com.twitter.search.earlybird.partition.HdfsUtil; -import com.twitter.search.earlybird.partition.KafkaStartup; -import com.twitter.search.earlybird.partition.MultiSegmentTermDictionaryManager; -import com.twitter.search.earlybird.partition.PartitionManager; -import com.twitter.search.earlybird.partition.PartitionManagerStartup; -import com.twitter.search.earlybird.partition.PartitionWriter; -import com.twitter.search.earlybird.partition.SearchIndexingMetricSet; -import com.twitter.search.earlybird.partition.SegmentManager; -import com.twitter.search.earlybird.partition.SegmentSyncConfig; -import com.twitter.search.earlybird.partition.StartupUserEventIndexer; -import com.twitter.search.earlybird.partition.TweetCreateHandler; -import com.twitter.search.earlybird.partition.TweetUpdateHandler; -import com.twitter.search.earlybird.partition.UserUpdatesStreamIndexer; -import com.twitter.search.earlybird.querycache.QueryCacheConfig; -import com.twitter.search.earlybird.querycache.QueryCacheManager; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; -import com.twitter.search.earlybird.util.CoordinatedEarlybirdAction; -import com.twitter.search.earlybird.util.EarlybirdDecider; -import com.twitter.search.earlybird.util.TermCountMonitor; -import com.twitter.search.earlybird.util.TweetCountMonitor; -import com.twitter.ubs.thriftjava.AudioSpaceBaseEvent; - -/** - * Production module that provides Earlybird components. - */ -public class EarlybirdWireModule { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdWireModule.class); - private static final int MAX_POLL_RECORDS = 1000; - - /** - * How many threads we will use for building up the query cache during startup. - * The number of threads will be set to 1 after this earlybird is current. - */ - private static final int QUERY_CACHE_NUM_WORKER_THREADS_AT_STARTUP = - EarlybirdConfig.getInt("query_cache_updater_startup_threads", 1); - - /** - * Scheduled executor service factory can be re-used in production. - * All the managers can share the same executor service factory. - */ - private final ScheduledExecutorServiceFactory sharedExecutorServiceFactory = - new ScheduledExecutorServiceFactory(); - - private final SearchStatsReceiver sharedSearchStatsReceiver = new SearchStatsReceiverImpl(); - private final StatsReceiver sharedFinagleStatsReceiver = new MetricsStatsReceiver(); - - private final SearchIndexingMetricSet searchIndexingMetricSet = - new SearchIndexingMetricSet(sharedSearchStatsReceiver); - - private final EarlybirdSearcherStats tweetsSearcherStats = - new EarlybirdSearcherStats(sharedSearchStatsReceiver); - - private final CaughtUpMonitor indexCaughtUpMonitor = new CaughtUpMonitor("dl_index"); - - public CaughtUpMonitor provideIndexCaughtUpMonitor() { - return indexCaughtUpMonitor; - } - - private final CaughtUpMonitor kafkaIndexCaughtUpMonitor = new CaughtUpMonitor("kafka_index"); - - public CaughtUpMonitor provideKafkaIndexCaughtUpMonitor() { - return kafkaIndexCaughtUpMonitor; - } - - private final OptimizationAndFlushingCoordinationLock optimizationAndFlushingCoordinationLock = - new OptimizationAndFlushingCoordinationLock(); - - public OptimizationAndFlushingCoordinationLock provideOptimizationAndFlushingCoordinationLock() { - return optimizationAndFlushingCoordinationLock; - } - - public QueryTimeoutFactory provideQueryTimeoutFactory() { - return new QueryTimeoutFactory(); - } - - public static class ZooKeeperClients { - public ZooKeeperProxy discoveryClient; - public ZooKeeperProxy stateClient; - - public ZooKeeperClients() { - this( - SearchZkClient.getServiceDiscoveryZooKeeperClient(), - SearchZkClient.getSZooKeeperClient()); - } - - public ZooKeeperClients(ZooKeeperProxy discoveryClient, ZooKeeperProxy stateClient) { - this.discoveryClient = discoveryClient; - this.stateClient = stateClient; - } - } - - /** - * Provides the earlybird decider. - */ - public Decider provideDecider() { - return EarlybirdDecider.initialize(); - } - - /** - * Provides the set of ZooKeeper clients to be used by earlybird. - */ - public ZooKeeperClients provideZooKeeperClients() { - return new ZooKeeperClients(); - } - - /** - * Provides the query cache config. - */ - public QueryCacheConfig provideQueryCacheConfig(SearchStatsReceiver searchStatsReceiver) { - return new QueryCacheConfig(searchStatsReceiver); - } - - /** - * Provides the earlybird index config. - */ - public EarlybirdIndexConfig provideEarlybirdIndexConfig( - Decider decider, SearchIndexingMetricSet indexingMetricSet, - CriticalExceptionHandler criticalExceptionHandler) { - return EarlybirdIndexConfigUtil.createEarlybirdIndexConfig(decider, indexingMetricSet, - criticalExceptionHandler); - } - - public DynamicPartitionConfig provideDynamicPartitionConfig() { - return new DynamicPartitionConfig(PartitionConfigUtil.initPartitionConfig()); - } - - /** - * Provides the segment manager to be used by this earlybird. - */ - public SegmentManager provideSegmentManager( - DynamicPartitionConfig dynamicPartitionConfig, - EarlybirdIndexConfig earlybirdIndexConfig, - SearchIndexingMetricSet partitionIndexingMetricSet, - EarlybirdSearcherStats searcherStats, - SearchStatsReceiver earlybirdServerStats, - UserUpdatesChecker userUpdatesChecker, - SegmentSyncConfig segmentSyncConfig, - UserTable userTable, - UserScrubGeoMap userScrubGeoMap, - Clock clock, - CriticalExceptionHandler criticalExceptionHandler) { - return new SegmentManager( - dynamicPartitionConfig, - earlybirdIndexConfig, - partitionIndexingMetricSet, - searcherStats, - earlybirdServerStats, - userUpdatesChecker, - segmentSyncConfig, - userTable, - userScrubGeoMap, - clock, - EarlybirdConfig.getMaxSegmentSize(), - criticalExceptionHandler, - provideKafkaIndexCaughtUpMonitor()); - } - - public QueryCacheManager provideQueryCacheManager( - QueryCacheConfig config, - EarlybirdIndexConfig indexConfig, - int maxEnabledSegments, - UserTable userTable, - UserScrubGeoMap userScrubGeoMap, - ScheduledExecutorServiceFactory queryCacheUpdaterScheduledExecutorFactory, - SearchStatsReceiver searchStatsReceiver, - EarlybirdSearcherStats searcherStats, - Decider decider, - CriticalExceptionHandler criticalExceptionHandler, - Clock clock) { - return new QueryCacheManager(config, indexConfig, maxEnabledSegments, userTable, - userScrubGeoMap, queryCacheUpdaterScheduledExecutorFactory, searchStatsReceiver, - searcherStats, decider, criticalExceptionHandler, clock); - } - - public TermCountMonitor provideTermCountMonitor( - SegmentManager segmentManager, ScheduledExecutorServiceFactory executorServiceFactory, - SearchStatsReceiver searchStatsReceiver, - CriticalExceptionHandler criticalExceptionHandler) { - return new TermCountMonitor(segmentManager, executorServiceFactory, 500, TimeUnit.MILLISECONDS, - searchStatsReceiver, criticalExceptionHandler); - } - - public TweetCountMonitor provideTweetCountMonitor( - SegmentManager segmentManager, - ScheduledExecutorServiceFactory executorServiceFactory, - SearchStatsReceiver searchStatsReceiver, - CriticalExceptionHandler criticalExceptionHandler) { - return new TweetCountMonitor(segmentManager, executorServiceFactory, 500, - TimeUnit.MILLISECONDS, searchStatsReceiver, criticalExceptionHandler); - } - - /** - * Returns a manager that keeps track of earlybird's global state while it runs. - */ - public UpdateableEarlybirdStateManager provideUpdateableEarlybirdStateManager( - EarlybirdIndexConfig earlybirdIndexConfig, - DynamicPartitionConfig dynamicPartitionConfig, - ZooKeeperProxy zooKeeperClient, - AuroraSchedulerClient schedulerClient, - ScheduledExecutorServiceFactory executorServiceFactory, - ScoringModelsManager scoringModelsManager, - TensorflowModelsManager tensorflowModelsManager, - SearchStatsReceiver searchStatsReceiver, - SearchDecider searchDecider, - CriticalExceptionHandler criticalExceptionHandler) { - Clock clock = provideClockForStateManager(); - - return new UpdateableEarlybirdStateManager( - earlybirdIndexConfig, dynamicPartitionConfig, zooKeeperClient, schedulerClient, - executorServiceFactory, scoringModelsManager, tensorflowModelsManager, searchStatsReceiver, - searchDecider, criticalExceptionHandler, - clock); - } - - public Clock provideClockForStateManager() { - return this.provideClock(); - } - - public ScheduledExecutorServiceFactory providePartitionManagerExecutorFactory() { - return sharedExecutorServiceFactory; - } - - public ScheduledExecutorServiceFactory provideStateUpdateManagerExecutorFactory() { - return sharedExecutorServiceFactory; - } - - public ScheduledExecutorServiceFactory provideTermCountMonitorScheduledExecutorFactory() { - return sharedExecutorServiceFactory; - } - - public ScheduledExecutorServiceFactory provideTweetCountMonitorScheduledExecutorFactory() { - return sharedExecutorServiceFactory; - } - - /** - * Provides the ScheduledExecutorServiceFactory that will be used to schedule all query cache - * update tasks. - */ - public ScheduledExecutorServiceFactory provideQueryCacheUpdateTaskScheduledExecutorFactory() { - return new ScheduledExecutorServiceFactory() { - @Override - public QueryCacheUpdaterScheduledExecutorService build( - String threadNameFormat, boolean isDaemon) { - ScheduledThreadPoolExecutor threadpoolExecutor = - new ScheduledThreadPoolExecutor(QUERY_CACHE_NUM_WORKER_THREADS_AT_STARTUP, - buildThreadFactory(threadNameFormat, isDaemon)); - threadpoolExecutor.setMaximumPoolSize(QUERY_CACHE_NUM_WORKER_THREADS_AT_STARTUP); - threadpoolExecutor.setCorePoolSize(QUERY_CACHE_NUM_WORKER_THREADS_AT_STARTUP); - threadpoolExecutor.setExecuteExistingDelayedTasksAfterShutdownPolicy(false); - threadpoolExecutor.setContinueExistingPeriodicTasksAfterShutdownPolicy(false); - threadpoolExecutor.setRemoveOnCancelPolicy(true); - LOG.info("Starting query cache executor with {} thread.", - QUERY_CACHE_NUM_WORKER_THREADS_AT_STARTUP); - - return new QueryCacheUpdaterScheduledExecutorService( - threadpoolExecutor) { - @Override public void setWorkerPoolSizeAfterStartup() { - delegate.setCorePoolSize(1); - delegate.setMaximumPoolSize(1); - LOG.info("Reset query cache executor to be single threaded."); - } - }; - } - }; - } - - public ScheduledExecutorServiceFactory provideSimpleUserUpdateIndexerScheduledExecutorFactory() { - return sharedExecutorServiceFactory; - } - - /** - * Returns the manager that manages the pool of searcher threads. - */ - public EarlybirdFuturePoolManager provideFuturePoolManager() { - return new EarlybirdFuturePoolManager("SearcherWorker"); - } - - /** - * Returns the manager that manages all earlybird finagle servers (warm up and production). - */ - public EarlybirdFinagleServerManager provideFinagleServerManager( - CriticalExceptionHandler criticalExceptionHandler) { - return new EarlybirdProductionFinagleServerManager(criticalExceptionHandler); - } - - /** - * Creates the production serverset manager. - */ - public EarlybirdServerSetManager provideServerSetManager( - ZooKeeperProxy discoveryClient, - DynamicPartitionConfig dynamicPartitionConfig, - SearchStatsReceiver searchStatsReceiver, - int port, - String serverSetNamePrefix) { - return new EarlybirdServerSetManager( - searchStatsReceiver, - discoveryClient, - dynamicPartitionConfig.getCurrentPartitionConfig(), - port, - serverSetNamePrefix); - } - - /** - * Creates the warm up serverset manager. - */ - public EarlybirdWarmUpManager provideWarmUpManager( - ZooKeeperProxy discoveryClient, - DynamicPartitionConfig dynamicPartitionConfig, - SearchStatsReceiver searchStatsReceiver, - Decider decider, - Clock clock, - int port, - String serverSetNamePrefix) { - return new EarlybirdWarmUpManager( - new EarlybirdServerSetManager( - searchStatsReceiver, - discoveryClient, - dynamicPartitionConfig.getCurrentPartitionConfig(), - port, - serverSetNamePrefix), - dynamicPartitionConfig.getCurrentPartitionConfig(), - searchIndexingMetricSet, - decider, - clock); - } - - /** - * Returns a dark proxy that knows how to send dark traffic to the warm up earlybird serverset. - */ - public EarlybirdDarkProxy provideEarlybirdDarkProxy( - SearchDecider searchDecider, - StatsReceiver finagleStatsReceiver, - EarlybirdServerSetManager earlybirdServerSetManager, - EarlybirdWarmUpManager earlybirdWarmUpManager, - String clusterName) { - return new EarlybirdDarkProxy(searchDecider, - finagleStatsReceiver.scope("dark_proxy"), - earlybirdServerSetManager, - earlybirdWarmUpManager, - clusterName); - } - - - /** - * Returns the manager for all (non-Tensorflow) scoring models. - */ - public ScoringModelsManager provideScoringModelsManager( - SearchStatsReceiver serverStats, - EarlybirdIndexConfig earlybirdIndexConfig) { - boolean modelsEnabled = EarlybirdConfig.getBool("scoring_models_enabled", false); - if (!modelsEnabled) { - LOG.info("Scoring Models - Disabled in the config. Not loading any models."); - serverStats.getCounter("scoring_models_disabled_in_config").increment(); - return ScoringModelsManager.NO_OP_MANAGER; - } - - String hdfsNameNode = EarlybirdConfig.getString("scoring_models_namenode"); - String hdfsModelsPath = EarlybirdConfig.getString("scoring_models_basedir"); - try { - return ScoringModelsManager.create( - serverStats, hdfsNameNode, hdfsModelsPath, earlybirdIndexConfig.getSchema()); - } catch (IOException e) { - LOG.error("Scoring Models - Error creating ScoringModelsManager", e); - serverStats.getCounter("scoring_models_initialization_errors").increment(); - return ScoringModelsManager.NO_OP_MANAGER; - } - } - - /** - * Provides the manager for all Tensorflow models. - */ - public TensorflowModelsManager provideTensorflowModelsManager( - SearchStatsReceiver serverStats, - String statsPrefix, - Decider decider, - EarlybirdIndexConfig earlybirdIndexConfig) { - - boolean modelsEnabled = EarlybirdProperty.TF_MODELS_ENABLED.get(false); - - if (!modelsEnabled) { - LOG.info("Tensorflow Models - Disabled in the config. Not loading any models."); - serverStats.getCounter("tf_models_disabled_in_config").increment(); - return TensorflowModelsManager.createNoOp(statsPrefix); - } - - String modelsConfigPath = - Preconditions.checkNotNull(EarlybirdProperty.TF_MODELS_CONFIG_PATH.get()); - - - int intraOpThreads = Preconditions.checkNotNull(EarlybirdProperty.TF_INTRA_OP_THREADS.get(0)); - int interOpThreads = Preconditions.checkNotNull(EarlybirdProperty.TF_INTER_OP_THREADS.get(0)); - - TensorflowModelsManager.initTensorflowThreadPools(intraOpThreads, interOpThreads); - - return TensorflowModelsManager.createUsingConfigFile( - FileUtils.getFileHandle(modelsConfigPath), - true, - statsPrefix, - () -> DeciderUtil.isAvailableForRandomRecipient( - decider, "enable_tf_serve_models"), - () -> decider.isAvailable("enable_tf_load_models"), - earlybirdIndexConfig.getSchema()); - } - - public SearchStatsReceiver provideEarlybirdServerStatsReceiver() { - return sharedSearchStatsReceiver; - } - - public StatsReceiver provideFinagleStatsReceiver() { - return sharedFinagleStatsReceiver; - } - - public SearchIndexingMetricSet provideSearchIndexingMetricSet() { - return searchIndexingMetricSet; - } - - public EarlybirdSearcherStats provideTweetsSearcherStats() { - return tweetsSearcherStats; - } - - /** - * Provides the clock to be used by this earlybird. - */ - public Clock provideClock() { - return Clock.SYSTEM_CLOCK; - } - - /** - * Provides the config for the multi-segment term dictionary manager. - */ - public MultiSegmentTermDictionaryManager.Config provideMultiSegmentTermDictionaryManagerConfig() { - return new MultiSegmentTermDictionaryManager.Config( - Lists.newArrayList( - EarlybirdFieldConstant.FROM_USER_ID_FIELD.getFieldName())); - } - - /** - * Provides the manager for the term dictionary that spans all segments. - */ - public MultiSegmentTermDictionaryManager provideMultiSegmentTermDictionaryManager( - MultiSegmentTermDictionaryManager.Config termDictionaryConfig, - SegmentManager segmentManager, - SearchStatsReceiver statsReceiver, - Decider decider, - EarlybirdCluster earlybirdCluster) { - return new MultiSegmentTermDictionaryManager( - termDictionaryConfig, segmentManager, statsReceiver, decider, earlybirdCluster); - } - - /** - * Returns the partition manager to be used by the archive earlybirds. - */ - public PartitionManager provideFullArchivePartitionManager( - ZooKeeperTryLockFactory zooKeeperTryLockFactory, - QueryCacheManager queryCacheManager, - SegmentManager segmentManager, - DynamicPartitionConfig dynamicPartitionConfig, - UserUpdatesStreamIndexer userUpdatesStreamIndexer, - UserScrubGeoEventStreamIndexer userScrubGeoEventStreamIndexer, - SearchStatsReceiver searchStatsReceiver, - ArchiveEarlybirdIndexConfig earlybirdIndexConfig, - ServerSetMember serverSetMember, - ScheduledExecutorServiceFactory executorServiceFactory, - ScheduledExecutorServiceFactory userUpdateIndexerExecutorFactory, - SearchIndexingMetricSet earlybirdSearchIndexingMetricSet, - Clock clock, - SegmentSyncConfig segmentSyncConfig, - CriticalExceptionHandler criticalExceptionHandler) throws IOException { - - return new ArchiveSearchPartitionManager( - zooKeeperTryLockFactory, - queryCacheManager, - segmentManager, - dynamicPartitionConfig, - userUpdatesStreamIndexer, - userScrubGeoEventStreamIndexer, - searchStatsReceiver, - earlybirdIndexConfig, - serverSetMember, - executorServiceFactory, - userUpdateIndexerExecutorFactory, - earlybirdSearchIndexingMetricSet, - segmentSyncConfig, - clock, - criticalExceptionHandler); - } - - /** - * Provides the SegmentSyncConfig instance to be used by earlybird. - */ - public SegmentSyncConfig provideSegmentSyncConfig(EarlybirdCluster cluster) { - String scrubGen = null; - if (cluster == EarlybirdCluster.FULL_ARCHIVE) { - scrubGen = EarlybirdProperty.EARLYBIRD_SCRUB_GEN.get(); - LOG.info("The scrubGen provided from Aurora is: {}", scrubGen); - Preconditions.checkState(Strings.isNotEmpty(scrubGen)); - } - return new SegmentSyncConfig(Optional.ofNullable(scrubGen)); - } - - protected void storeEarlybirdStartupProducts( - TweetCreateHandler tweetCreateHandler, - PartitionWriter partitionWriter, - EarlybirdIndexFlusher earlybirdIndexFlusher - ) { - // TestWireModule wants to store these for further use. - } - - /** - * What directory are we going to load segments from on startup. - * - * When you're running loadtests or stagingN instances and they don't have a recent index - * flushed, it can take hours to generate a new index with a fresh startup. This slows - * down development. If the read_index_from_prod_location flag is set to true, we will read - * the index from the location where prod instances are flushing their index to. - * Unset it if you want to generate your own index. - * - * @return a string with the directory. - */ - public String getIndexLoadingDirectory() { - boolean readIndexFromProdLocation = EarlybirdProperty.READ_INDEX_FROM_PROD_LOCATION.get(false); - String environment = EarlybirdProperty.ENV.get("no_env_specified"); // default value for tests. - String readIndexDir = EarlybirdProperty.HDFS_INDEX_SYNC_DIR.get(); - - if (readIndexFromProdLocation) { - LOG.info("Will attempt to read index from prod locations"); - LOG.info("Index directory provided: {}", readIndexDir); - // Replacing the path is a bit hacky, but it works ok. - readIndexDir = readIndexDir.replace("/" + environment + "/", "/prod/"); - LOG.info("Will instead use index directory: {}", readIndexDir); - } - - return readIndexDir; - } - - /** - * Indexer for audio space events. - */ - public AudioSpaceEventsStreamIndexer provideAudioSpaceEventsStreamIndexer( - AudioSpaceTable audioSpaceTable, - Clock clock) { - try { - return new AudioSpaceEventsStreamIndexer( - FinagleKafkaClientUtils.newKafkaConsumerForAssigning( - "", - new ThriftDeserializer<>(AudioSpaceBaseEvent.class), - "", - 20 - ), audioSpaceTable, clock); - } catch (MissingKafkaTopicException ex) { - LOG.error("Missing kafka stream", ex); - return null; - } - } - - /** - * Returns a class to start the Earlybird. See {@link EarlybirdStartup}. - */ - public EarlybirdStartup provideEarlybirdStartup( - PartitionManager partitionManager, - UserUpdatesStreamIndexer userUpdatesStreamIndexer, - UserScrubGeoEventStreamIndexer userScrubGeoEventStreamIndexer, - AudioSpaceEventsStreamIndexer audioSpaceEventsStreamIndexer, - DynamicPartitionConfig dynamicPartitionConfig, - CriticalExceptionHandler criticalExceptionHandler, - SegmentManager segmentManager, - MultiSegmentTermDictionaryManager multiSegmentTermDictionaryManager, - QueryCacheManager queryCacheManager, - ZooKeeperTryLockFactory zooKeeperTryLockFactory, - ServerSetMember serverSetMember, - Clock clock, - SegmentSyncConfig segmentSyncConfig, - EarlybirdSegmentFactory earlybirdSegmentFactory, - EarlybirdCluster cluster, - SearchDecider decider) throws IOException { - if (cluster == EarlybirdCluster.FULL_ARCHIVE) { - return new PartitionManagerStartup(clock, partitionManager); - } - - // Check that the earlybird name is what we're expecting so we can build the kafka topics. - String earlybirdName = EarlybirdProperty.EARLYBIRD_NAME.get(); - Preconditions.checkArgument("earlybird-realtime".equals(earlybirdName) - || "earlybird-protected".equals(earlybirdName) - || "earlybird-realtime-exp0".equals(earlybirdName) - || "earlybird-realtime_cg".equals(earlybirdName)); - - StartupUserEventIndexer startupUserEventIndexer = new StartupUserEventIndexer( - provideSearchIndexingMetricSet(), - userUpdatesStreamIndexer, - userScrubGeoEventStreamIndexer, - segmentManager, - clock); - - // Coordinate leaving the serverset to flush segments to HDFS. - CoordinatedEarlybirdAction actionCoordinator = new CoordinatedEarlybirdAction( - zooKeeperTryLockFactory, - "segment_flusher", - dynamicPartitionConfig, - serverSetMember, - criticalExceptionHandler, - segmentSyncConfig); - actionCoordinator.setShouldSynchronize(true); - - FileSystem hdfsFileSystem = HdfsUtil.getHdfsFileSystem(); - EarlybirdIndexFlusher earlybirdIndexFlusher = new EarlybirdIndexFlusher( - actionCoordinator, - hdfsFileSystem, - EarlybirdProperty.HDFS_INDEX_SYNC_DIR.get(), - segmentManager, - dynamicPartitionConfig.getCurrentPartitionConfig(), - clock, - new TimeLimitedHadoopExistsCall(hdfsFileSystem), - provideOptimizationAndFlushingCoordinationLock()); - - String baseTopicName = "search_ingester_%s_events_%s_%s"; - - String earlybirdType; - - if ("earlybird-protected".equals(earlybirdName)) { - earlybirdType = "protected"; - } else if ("earlybird-realtime_cg".equals(earlybirdName)) { - earlybirdType = "realtime_cg"; - } else { - earlybirdType = "realtime"; - } - - String tweetTopicName = String.format( - baseTopicName, - "indexing", - earlybirdType, - EarlybirdProperty.KAFKA_ENV.get()); - - String updateTopicName = String.format( - baseTopicName, - "update", - earlybirdType, - EarlybirdProperty.KAFKA_ENV.get()); - - LOG.info("Tweet topic: {}", tweetTopicName); - LOG.info("Update topic: {}", updateTopicName); - - TopicPartition tweetTopic = new TopicPartition( - tweetTopicName, - dynamicPartitionConfig.getCurrentPartitionConfig().getIndexingHashPartitionID()); - TopicPartition updateTopic = new TopicPartition( - updateTopicName, - dynamicPartitionConfig.getCurrentPartitionConfig().getIndexingHashPartitionID()); - - EarlybirdKafkaConsumersFactory earlybirdKafkaConsumersFactory = - provideEarlybirdKafkaConsumersFactory(); - FreshStartupHandler freshStartupHandler = new FreshStartupHandler( - clock, - earlybirdKafkaConsumersFactory, - tweetTopic, - updateTopic, - segmentManager, - EarlybirdConfig.getMaxSegmentSize(), - EarlybirdConfig.getLateTweetBuffer(), - criticalExceptionHandler - ); - - TweetUpdateHandler updateHandler = new TweetUpdateHandler(segmentManager); - - CoordinatedEarlybirdAction postOptimizationRebuilds = new CoordinatedEarlybirdAction( - zooKeeperTryLockFactory, - "post_optimization_rebuilds", - dynamicPartitionConfig, - serverSetMember, - criticalExceptionHandler, - segmentSyncConfig - ); - postOptimizationRebuilds.setShouldSynchronize(true); - CoordinatedEarlybirdAction gcAction = new CoordinatedEarlybirdAction( - zooKeeperTryLockFactory, - "gc_before_optimization", - dynamicPartitionConfig, - serverSetMember, - criticalExceptionHandler, - segmentSyncConfig - ); - gcAction.setShouldSynchronize(true); - - TweetCreateHandler createHandler = new TweetCreateHandler( - segmentManager, - provideSearchIndexingMetricSet(), - criticalExceptionHandler, - multiSegmentTermDictionaryManager, - queryCacheManager, - postOptimizationRebuilds, - gcAction, - EarlybirdConfig.getLateTweetBuffer(), - EarlybirdConfig.getMaxSegmentSize(), - provideKafkaIndexCaughtUpMonitor(), - provideOptimizationAndFlushingCoordinationLock()); - - PartitionWriter partitionWriter = new PartitionWriter( - createHandler, - updateHandler, - criticalExceptionHandler, - PenguinVersion.versionFromByteValue(EarlybirdConfig.getPenguinVersionByte()), - clock); - - KafkaConsumer rawKafkaConsumer = - earlybirdKafkaConsumersFactory.createKafkaConsumer( - "earlybird_tweet_kafka_consumer"); - - EarlybirdKafkaConsumer earlybirdKafkaConsumer = provideKafkaConsumer( - criticalExceptionHandler, - rawKafkaConsumer, - tweetTopic, - updateTopic, - partitionWriter, - earlybirdIndexFlusher); - - EarlybirdIndexLoader earlybirdIndexLoader = new EarlybirdIndexLoader( - hdfsFileSystem, - getIndexLoadingDirectory(), // See SEARCH-32839 - EarlybirdProperty.ENV.get("default_env_value"), - dynamicPartitionConfig.getCurrentPartitionConfig(), - earlybirdSegmentFactory, - segmentSyncConfig, - clock); - - this.storeEarlybirdStartupProducts( - createHandler, - partitionWriter, - earlybirdIndexFlusher - ); - - return new KafkaStartup( - segmentManager, - earlybirdKafkaConsumer, - startupUserEventIndexer, - userUpdatesStreamIndexer, - userScrubGeoEventStreamIndexer, - audioSpaceEventsStreamIndexer, - queryCacheManager, - earlybirdIndexLoader, - freshStartupHandler, - provideSearchIndexingMetricSet(), - multiSegmentTermDictionaryManager, - criticalExceptionHandler, - decider - ); - } - - public QualityFactor provideQualityFactor( - Decider decider, - SearchStatsReceiver searchStatsReceiver - ) { - return new EarlybirdCPUQualityFactor(decider, - ManagementFactory.getPlatformMXBean(OperatingSystemMXBean.class), - searchStatsReceiver); - } - - /** - * Returns a new UserUpdatesKafkaConsumer to read user updates. - */ - public UserUpdatesStreamIndexer provideUserUpdatesKafkaConsumer( - SegmentManager segmentManager) { - try { - return new UserUpdatesStreamIndexer( - UserUpdatesStreamIndexer.provideKafkaConsumer(), - EarlybirdProperty.USER_UPDATES_KAFKA_TOPIC.get(), - provideSearchIndexingMetricSet(), - segmentManager); - } catch (MissingKafkaTopicException ex) { - // Yes, it will crash the server. We've never seen this topic missing, but - // we've seen some others, so we had to build this functionality in the - // constructor. If one day this one goes missing, we'll have to figure out - // how to handle it. For now, we crash. - throw new RuntimeException(ex); - } - } - - /** - * Returns a new UserScrubGeosKafkaConsumer to read geo scrubbing events. - */ - public UserScrubGeoEventStreamIndexer provideUserScrubGeoEventKafkaConsumer( - SegmentManager segmentManager) { - try { - return new UserScrubGeoEventStreamIndexer( - UserScrubGeoEventStreamIndexer.provideKafkaConsumer(), - EarlybirdProperty.USER_SCRUB_GEO_KAFKA_TOPIC.get(), - provideSearchIndexingMetricSet(), - segmentManager); - } catch (MissingKafkaTopicException ex) { - /** - * See {@link #provideUserUpdatesKafkaConsumer} - */ - throw new RuntimeException(ex); - } - } - - /** - * Returns a new ProductionEarlybirdKafkaConsumer to read ThriftVersionedEvents. - */ - public EarlybirdKafkaConsumersFactory provideEarlybirdKafkaConsumersFactory() { - return new ProductionEarlybirdKafkaConsumersFactory( - EarlybirdProperty.KAFKA_PATH.get(), - MAX_POLL_RECORDS - ); - } - - /** - * Returns a class to read Tweets in the Earlybird. See {@link EarlybirdKafkaConsumer}. - */ - public EarlybirdKafkaConsumer provideKafkaConsumer( - CriticalExceptionHandler criticalExceptionHandler, - KafkaConsumer rawKafkaConsumer, - TopicPartition tweetTopic, - TopicPartition updateTopic, - PartitionWriter partitionWriter, - EarlybirdIndexFlusher earlybirdIndexFlusher - ) { - return new EarlybirdKafkaConsumer( - rawKafkaConsumer, - provideSearchIndexingMetricSet(), - criticalExceptionHandler, - partitionWriter, - tweetTopic, - updateTopic, - earlybirdIndexFlusher, - provideKafkaIndexCaughtUpMonitor()); - } -} diff --git a/src/java/com/twitter/search/earlybird/factory/PartitionConfigUtil.java b/src/java/com/twitter/search/earlybird/factory/PartitionConfigUtil.java deleted file mode 100644 index 3a183a219..000000000 --- a/src/java/com/twitter/search/earlybird/factory/PartitionConfigUtil.java +++ /dev/null @@ -1,47 +0,0 @@ -package com.twitter.search.earlybird.factory; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.earlybird.common.config.EarlybirdProperty; -import com.twitter.search.earlybird.config.TierConfig; -import com.twitter.search.earlybird.config.TierInfo; -import com.twitter.search.earlybird.partition.PartitionConfig; - -public final class PartitionConfigUtil { - private static final Logger LOG = LoggerFactory.getLogger(PartitionConfigUtil.class); - - private PartitionConfigUtil() { - } - - /** - * Initiate PartitionConfig for earlybirds running on Aurora - */ - public static PartitionConfig initPartitionConfigForAurora(int numOfInstances) { - String tier = EarlybirdProperty.EARLYBIRD_TIER.get(); - int partitionId = EarlybirdProperty.PARTITION_ID.get(); - int replicaId = EarlybirdProperty.REPLICA_ID.get(); - if (tier.equals(PartitionConfig.DEFAULT_TIER_NAME)) { - // realtime or protected earlybird - return new PartitionConfig( - partitionId, - EarlybirdProperty.SERVING_TIMESLICES.get(), - replicaId, - numOfInstances, - EarlybirdProperty.NUM_PARTITIONS.get()); - } else { - // archive earlybird - TierInfo tierInfo = TierConfig.getTierInfo(tier); - return new PartitionConfig(tier, tierInfo.getDataStartDate(), tierInfo.getDataEndDate(), - partitionId, tierInfo.getMaxTimeslices(), replicaId, numOfInstances, - tierInfo.getNumPartitions()); - } - } - - /** - * Tries to create a new PartitionConfig instance based on the Aurora flags - */ - public static PartitionConfig initPartitionConfig() { - return initPartitionConfigForAurora(EarlybirdProperty.NUM_INSTANCES.get()); - } -} diff --git a/src/java/com/twitter/search/earlybird/factory/ProductionEarlybirdKafkaConsumersFactory.java b/src/java/com/twitter/search/earlybird/factory/ProductionEarlybirdKafkaConsumersFactory.java deleted file mode 100644 index e024f27ee..000000000 --- a/src/java/com/twitter/search/earlybird/factory/ProductionEarlybirdKafkaConsumersFactory.java +++ /dev/null @@ -1,41 +0,0 @@ -package com.twitter.search.earlybird.factory; - -import org.apache.kafka.clients.consumer.KafkaConsumer; - -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.util.io.kafka.CompactThriftDeserializer; -import com.twitter.search.common.util.io.kafka.FinagleKafkaClientUtils; - -/** - * Responsible for creating kafka consumers. - */ -public class ProductionEarlybirdKafkaConsumersFactory implements EarlybirdKafkaConsumersFactory { - private final String kafkaPath; - private final int defaultMaxPollRecords; - - ProductionEarlybirdKafkaConsumersFactory(String kafkaPath, int defaultMaxPollRecords) { - this.kafkaPath = kafkaPath; - this.defaultMaxPollRecords = defaultMaxPollRecords; - } - - /** - * Create a kafka consumer with set maximum of records to be polled. - */ - @Override - public KafkaConsumer createKafkaConsumer( - String clientID, int maxPollRecords) { - return FinagleKafkaClientUtils.newKafkaConsumerForAssigning( - kafkaPath, - new CompactThriftDeserializer<>(ThriftVersionedEvents.class), - clientID, - maxPollRecords); - } - - /** - * Create a kafka consumer with default records to be polled. - */ - @Override - public KafkaConsumer createKafkaConsumer(String clientID) { - return createKafkaConsumer(clientID, defaultMaxPollRecords); - } -} diff --git a/src/java/com/twitter/search/earlybird/factory/QueryCacheUpdaterScheduledExecutorService.java b/src/java/com/twitter/search/earlybird/factory/QueryCacheUpdaterScheduledExecutorService.java deleted file mode 100644 index ff0619441..000000000 --- a/src/java/com/twitter/search/earlybird/factory/QueryCacheUpdaterScheduledExecutorService.java +++ /dev/null @@ -1,57 +0,0 @@ -package com.twitter.search.earlybird.factory; - -import java.util.concurrent.Callable; -import java.util.concurrent.ScheduledExecutorService; -import java.util.concurrent.ScheduledFuture; -import java.util.concurrent.TimeUnit; - -import com.google.common.annotations.VisibleForTesting; - -import com.twitter.common.util.concurrent.ForwardingExecutorService; - -/** - * This delegate type is intended for QueryCacheUpdater because it uses multiple threads to - * create query cache during startup and then switch later to use single thread to update the - * cache. - */ -public abstract class QueryCacheUpdaterScheduledExecutorService - extends ForwardingExecutorService implements ScheduledExecutorService { - public QueryCacheUpdaterScheduledExecutorService(T executor) { - super(executor); - } - - /** - * Sets the number of worker threads in this executor service to an appropriate value after the - * earlybird startup has finished. While earlybird is starting up, we might want this executor - * service to have more threads, in order to parallelize more some start up tasks. But once - * earlybird is up, it might make sense to lower the number of worker threads. - */ - public abstract void setWorkerPoolSizeAfterStartup(); - - @Override - public ScheduledFuture schedule(Runnable command, long delay, TimeUnit unit) { - return delegate.schedule(command, delay, unit); - } - - @Override - public ScheduledFuture scheduleAtFixedRate( - Runnable command, long initialDelay, long period, TimeUnit unit) { - return delegate.scheduleAtFixedRate(command, initialDelay, period, unit); - } - - @Override - public ScheduledFuture scheduleWithFixedDelay( - Runnable command, long initialDelay, long delay, TimeUnit unit) { - return delegate.scheduleWithFixedDelay(command, initialDelay, delay, unit); - } - - @Override - public ScheduledFuture schedule(Callable callable, long delay, TimeUnit unit) { - return delegate.schedule(callable, delay, unit); - } - - @VisibleForTesting - public T getDelegate() { - return delegate; - } -} diff --git a/src/java/com/twitter/search/earlybird/index/AbstractInMemoryTimeMapper.java b/src/java/com/twitter/search/earlybird/index/AbstractInMemoryTimeMapper.java deleted file mode 100644 index 4e9ef4f7c..000000000 --- a/src/java/com/twitter/search/earlybird/index/AbstractInMemoryTimeMapper.java +++ /dev/null @@ -1,83 +0,0 @@ -package com.twitter.search.earlybird.index; - -import com.twitter.search.core.earlybird.index.TimeMapper; -import com.twitter.search.core.earlybird.index.inverted.IntBlockPool; -import com.twitter.search.core.earlybird.index.util.SearchSortUtils; -import com.twitter.search.earlybird.search.queries.SinceUntilFilter; - -public abstract class AbstractInMemoryTimeMapper implements TimeMapper { - // Reverse map: timestamp to first doc ID seen with that timestamp. - // This is two arrays: the timestamps (sorted), and the doc ids. - protected final IntBlockPool reverseMapTimes; - protected final IntBlockPool reverseMapIds; - protected volatile int reverseMapLastIndex; - - public AbstractInMemoryTimeMapper() { - this.reverseMapTimes = new IntBlockPool(ILLEGAL_TIME, "time_mapper_times"); - this.reverseMapIds = new IntBlockPool(ILLEGAL_TIME, "time_mapper_ids"); - this.reverseMapLastIndex = -1; - } - - protected AbstractInMemoryTimeMapper(int reverseMapLastIndex, - IntBlockPool reverseMapTimes, - IntBlockPool reverseMapIds) { - this.reverseMapTimes = reverseMapTimes; - this.reverseMapIds = reverseMapIds; - this.reverseMapLastIndex = reverseMapLastIndex; - } - - @Override - public final int getLastTime() { - return reverseMapLastIndex == -1 ? ILLEGAL_TIME : reverseMapTimes.get(reverseMapLastIndex); - } - - @Override - public final int getFirstTime() { - return reverseMapLastIndex == -1 ? ILLEGAL_TIME : reverseMapTimes.get(0); - } - - @Override - public final int findFirstDocId(int timeSeconds, int smallestDocID) { - if (timeSeconds == SinceUntilFilter.NO_FILTER || reverseMapLastIndex == -1) { - return smallestDocID; - } - - final int index = SearchSortUtils.binarySearch( - new IntArrayComparator(), 0, reverseMapLastIndex, timeSeconds, false); - - if (index == reverseMapLastIndex && reverseMapTimes.get(index) < timeSeconds) { - // Special case for out of bounds time. - return smallestDocID; - } - - return reverseMapIds.get(index); - } - - protected abstract void setTime(int docID, int timeSeconds); - - protected void doAddMapping(int docID, int timeSeconds) { - setTime(docID, timeSeconds); - int lastTime = getLastTime(); - if (timeSeconds > lastTime) { - // Found a timestamp newer than any timestamp we've seen before. - // Add a reverse mapping to this tweet (the first seen with this timestamp). - // - // When indexing out of order tweets, we could have gaps in the timestamps recorded in - // reverseMapTimes. For example, if we get 3 tweets with timestamp T0, T0 + 5, T0 + 3, then we - // will only record T0 and T0 + 5 in reverseMapTimes. However, this should not be an issue, - // because reverseMapTimes is only used by findFirstDocId(), and it's OK for that method to - // return a smaller doc ID than strictly necessary (in this case, findFirstDocId(T0 + 3) will - // return the doc ID of the second tweet, instead of returning the doc ID of the third tweet). - reverseMapTimes.add(timeSeconds); - reverseMapIds.add(docID); - reverseMapLastIndex++; - } - } - - private class IntArrayComparator implements SearchSortUtils.Comparator { - @Override - public int compare(int index, Integer value) { - return Integer.compare(reverseMapTimes.get(index), value); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/index/DocValuesBasedTimeMapper.java b/src/java/com/twitter/search/earlybird/index/DocValuesBasedTimeMapper.java deleted file mode 100644 index 9b7770de6..000000000 --- a/src/java/com/twitter/search/earlybird/index/DocValuesBasedTimeMapper.java +++ /dev/null @@ -1,146 +0,0 @@ -package com.twitter.search.earlybird.index; - -import java.io.IOException; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.NumericDocValues; -import org.apache.lucene.search.DocIdSetIterator; - -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants; -import com.twitter.search.common.util.analysis.IntTermAttributeImpl; -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; -import com.twitter.search.core.earlybird.index.TimeMapper; -import com.twitter.search.core.earlybird.index.column.ColumnStrideFieldIndex; - -/** - * A few caveats when using this class: - * - This class only supports in-order createdAt! - * - Before actually using this class, one must call prepareToRead() with a Lucene AtomicReader - * - prepareToRead() will load docID to createdAt mapping into memory, if not already done. - */ -public class DocValuesBasedTimeMapper implements TimeMapper { - private LeafReader reader; - private ColumnStrideFieldIndex docValues; - - protected int minTimestamp = ILLEGAL_TIME; - protected int maxTimestamp = ILLEGAL_TIME; - - /** - * When indexing finishes, this method should be called with a index reader that - * can see all documents. - * @param leafReader Lucene index reader used to access "TweetID" to "createdAt" mapping. - */ - public void initializeWithLuceneReader(LeafReader leafReader, ColumnStrideFieldIndex csf) - throws IOException { - reader = Preconditions.checkNotNull(leafReader); - docValues = Preconditions.checkNotNull(csf); - - // Find the min and max timestamps. - // See SEARCH-5534 - // In the archive, tweets are always sorted in descending order by tweet ID, but - // that does not mean that the documents are necessarily sorted by time. We've observed tweet ID - // generation be decoupled from timestamp creation (i.e. a larger tweet ID having a smaller - // created_at time). - minTimestamp = Integer.MAX_VALUE; - maxTimestamp = Integer.MIN_VALUE; - - NumericDocValues onDiskDocValues = reader.getNumericDocValues( - EarlybirdFieldConstants.EarlybirdFieldConstant.CREATED_AT_CSF_FIELD.getFieldName()); - for (int i = 0; i < reader.maxDoc(); ++i) { - Preconditions.checkArgument(onDiskDocValues.advanceExact(i)); - int timestamp = (int) onDiskDocValues.longValue(); - docValues.setValue(i, timestamp); - - if (timestamp < minTimestamp) { - minTimestamp = timestamp; - } - if (timestamp > maxTimestamp) { - maxTimestamp = timestamp; - } - } - } - - @Override - public int getLastTime() { - return maxTimestamp; - } - - @Override - public int getFirstTime() { - return minTimestamp; - } - - @Override - public int getTime(int docID) { - if (docID < 0 || docID > reader.maxDoc()) { - return ILLEGAL_TIME; - } - return (int) docValues.get(docID); - } - - @Override - public int findFirstDocId(int timeSeconds, int smallestDocID) throws IOException { - // In the full archive, the smallest doc id corresponds to largest timestamp. - if (timeSeconds > maxTimestamp) { - return smallestDocID; - } - if (timeSeconds < minTimestamp) { - return reader.maxDoc() - 1; - } - - int docId = DocValuesHelper.getLargestDocIdWithCeilOfValue( - reader, - EarlybirdFieldConstants.EarlybirdFieldConstant.CREATED_AT_FIELD.getFieldName(), - IntTermAttributeImpl.copyIntoNewBytesRef(timeSeconds)); - if (docId == DocIdSetIterator.NO_MORE_DOCS) { - return ILLEGAL_TIME; - } - - return docId; - } - - @Override - public TimeMapper optimize(DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) { - // DocValuesBasedTimerMapper instances are not flushed or loaded, - // so their optimization is a no-op. - return this; - } - - @Override - public Flushable.Handler getFlushHandler() { - // EarlybirdIndexSegmentData will still try to flush the DocValuesBasedTimeMapper for the - // respective segment, so we need to pass in a DocValuesBasedTimeMapper instance to this - // flusher: otherwise, Flushable.Handler.flush() will throw a NullPointerException. - return new FlushHandler(new DocValuesBasedTimeMapper()); - } - - // Full archive earlybirds don't actually flush or load the DocValuesBasedTimeMapper. This is - // why doFlush() is a no-op, and doLoad() returns a new DocValuesBasedTimeMapper instance - // (initializeWithLuceneReader() will be called at load time to initialize this new - // DocValuesBasedTimeMapper instance). - public static class FlushHandler extends Flushable.Handler { - public FlushHandler() { - super(); - } - - public FlushHandler(DocValuesBasedTimeMapper objectToFlush) { - super(objectToFlush); - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) { - } - - @Override - protected DocValuesBasedTimeMapper doLoad(FlushInfo flushInfo, DataDeserializer in) { - return new DocValuesBasedTimeMapper(); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/index/DocValuesBasedTweetIDMapper.java b/src/java/com/twitter/search/earlybird/index/DocValuesBasedTweetIDMapper.java deleted file mode 100644 index 6fe1cee4d..000000000 --- a/src/java/com/twitter/search/earlybird/index/DocValuesBasedTweetIDMapper.java +++ /dev/null @@ -1,149 +0,0 @@ -package com.twitter.search.earlybird.index; - -import java.io.IOException; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.NumericDocValues; -import org.apache.lucene.search.DocIdSetIterator; - -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants; -import com.twitter.search.common.util.analysis.SortableLongTermAttributeImpl; -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; -import com.twitter.search.core.earlybird.index.column.ColumnStrideFieldIndex; - -/** - * A few caveats when using this class: - * - Before actually using this class, one must call prepareToRead() with a Lucene AtomicReader - * - prepareToRead() will load docID to tweetID mapping into memory, if not already done. - */ -public class DocValuesBasedTweetIDMapper extends TweetIDMapper implements Flushable { - private LeafReader reader; - private ColumnStrideFieldIndex docValues; - - /** - * When indexing finishes, this method should be called with a index reader that - * can see all documents. - * @param leafReader Lucene index reader used to access TweetID to internal ID mapping - */ - public void initializeWithLuceneReader(LeafReader leafReader, ColumnStrideFieldIndex csf) - throws IOException { - reader = Preconditions.checkNotNull(leafReader); - docValues = Preconditions.checkNotNull(csf); - - NumericDocValues onDiskDocValues = reader.getNumericDocValues( - EarlybirdFieldConstants.EarlybirdFieldConstant.ID_CSF_FIELD.getFieldName()); - for (int i = 0; i < reader.maxDoc(); ++i) { - Preconditions.checkArgument(onDiskDocValues.advanceExact(i)); - docValues.setValue(i, onDiskDocValues.longValue()); - } - - // In the archive, tweets are always sorted in descending order of tweet ID. - setMinTweetID(docValues.get(reader.maxDoc() - 1)); - setMaxTweetID(docValues.get(0)); - setMinDocID(0); - setMaxDocID(reader.maxDoc() - 1); - setNumDocs(reader.maxDoc()); - } - - @Override - public int getDocID(long tweetID) throws IOException { - int docId = DocValuesHelper.getFirstDocIdWithValue( - reader, - EarlybirdFieldConstants.EarlybirdFieldConstant.ID_FIELD.getFieldName(), - SortableLongTermAttributeImpl.copyIntoNewBytesRef(tweetID)); - if (docId == DocIdSetIterator.NO_MORE_DOCS) { - return ID_NOT_FOUND; - } - return docId; - } - - @Override - protected int getNextDocIDInternal(int docID) { - // The doc IDs are consecutive and TweetIDMapper already checked the boundary conditions. - return docID + 1; - } - - @Override - protected int getPreviousDocIDInternal(int docID) { - // The doc IDs are consecutive and TweetIDMapper already checked the boundary conditions. - return docID - 1; - } - - @Override - public long getTweetID(int internalID) { - if (internalID < 0 || internalID > getMaxDocID()) { - return ID_NOT_FOUND; - } - return docValues.get(internalID); - } - - @Override - protected int addMappingInternal(long tweetID) { - throw new UnsupportedOperationException( - "ArchiveTweetIDMapper should be written through Lucene instead of TweetIDMappingWriter"); - } - - @Override - protected final int findDocIDBoundInternal(long tweetID, - boolean findMaxDocID) throws IOException { - // TermsEnum has a seekCeil() method, but doesn't have a seekFloor() method, so the best we can - // do here is ignore findLow and always return the ceiling if the tweet ID cannot be found. - // However, in practice, we do a seekExact() in both cases: see the inner classes in - // com.twitter.search.core.earlybird.index.inverted.RealtimeIndexTerms. - int docId = DocValuesHelper.getLargestDocIdWithCeilOfValue( - reader, - EarlybirdFieldConstants.EarlybirdFieldConstant.ID_FIELD.getFieldName(), - SortableLongTermAttributeImpl.copyIntoNewBytesRef(tweetID)); - if (docId == DocIdSetIterator.NO_MORE_DOCS) { - return ID_NOT_FOUND; - } - - // The docId is the upper bound of the search, so if we want the lower bound, - // because doc IDs are dense, we subtract one. - return findMaxDocID ? docId : docId - 1; - } - - @Override - public DocIDToTweetIDMapper optimize() { - // DocValuesBasedTweetIDMapper instances are not flushed or loaded, - // so their optimization is a no-op. - return this; - } - - @Override - public Flushable.Handler getFlushHandler() { - // EarlybirdIndexSegmentData will still try to flush the DocValuesBasedTweetIDMapper - // for the respective segment, so we need to pass in a DocValuesBasedTweetIDMapper instance to - // this flusher: otherwise, Flushable.Handler.flush() will throw a NullPointerException. - return new FlushHandler(new DocValuesBasedTweetIDMapper()); - } - - // Full archive earlybirds don't actually flush or load the DocValuesBasedTweetIDMapper. This is - // why doFlush() is a no-op, and doLoad() returns a new DocValuesBasedTweetIDMapper instance - // (initializeWithLuceneReader() will be called at load time to initialize this new - // DocValuesBasedTweetIDMapper instance). - public static class FlushHandler extends Flushable.Handler { - public FlushHandler() { - super(); - } - - public FlushHandler(DocValuesBasedTweetIDMapper objectToFlush) { - super(objectToFlush); - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) { - } - - @Override - protected DocValuesBasedTweetIDMapper doLoad(FlushInfo flushInfo, DataDeserializer in) { - return new DocValuesBasedTweetIDMapper(); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/index/DocValuesHelper.java b/src/java/com/twitter/search/earlybird/index/DocValuesHelper.java deleted file mode 100644 index 417a3e640..000000000 --- a/src/java/com/twitter/search/earlybird/index/DocValuesHelper.java +++ /dev/null @@ -1,70 +0,0 @@ -package com.twitter.search.earlybird.index; - -import java.io.IOException; - -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.Terms; -import org.apache.lucene.index.TermsEnum; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.util.BytesRef; - -public final class DocValuesHelper { - private DocValuesHelper() { - } - - /** - * Reverse lookup. Given a value, returns the first doc ID with this value. This requires a field - * that indexes the values. - * - * @param reader The reader to use to look up field values. - * @param value The value to lookup. - * @param indexField The field containing an index of the values. - */ - public static int getFirstDocIdWithValue( - LeafReader reader, String indexField, BytesRef value) throws IOException { - TermsEnum termsEnum = getTermsEnum(reader, indexField); - if (termsEnum == null || !termsEnum.seekExact(value)) { - return DocIdSetIterator.NO_MORE_DOCS; - } - - DocIdSetIterator docsIterator = termsEnum.postings(null); - return docsIterator.nextDoc(); - } - - /** - * Reverse lookup. Same as getFirstDocIdWithValue(), but if no document with the given value - * exists, the next bigger value is used for looking up the first doc ID. - * - * If there are multiple documents that match the value, all documents will be scanned, and the - * largest doc ID that matches will be returned. - * - * @param reader The reader to use to look up field values. - * @param value The value to lookup. - * @param indexField The field containing an index of the values. - */ - public static int getLargestDocIdWithCeilOfValue( - LeafReader reader, String indexField, BytesRef value) throws IOException { - TermsEnum termsEnum = getTermsEnum(reader, indexField); - if (termsEnum == null) { - return DocIdSetIterator.NO_MORE_DOCS; - } - if (termsEnum.seekCeil(value) == TermsEnum.SeekStatus.END) { - return DocIdSetIterator.NO_MORE_DOCS; - } - - DocIdSetIterator docsIterator = termsEnum.postings(null); - int docId = docsIterator.nextDoc(); - while (docsIterator.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) { - docId = docsIterator.docID(); - } - return docId; - } - - private static TermsEnum getTermsEnum(LeafReader reader, String indexField) throws IOException { - Terms terms = reader.terms(indexField); - if (terms == null) { - return null; - } - return terms.iterator(); - } -} diff --git a/src/java/com/twitter/search/earlybird/index/EarlybirdSegment.java b/src/java/com/twitter/search/earlybird/index/EarlybirdSegment.java deleted file mode 100644 index a902dc890..000000000 --- a/src/java/com/twitter/search/earlybird/index/EarlybirdSegment.java +++ /dev/null @@ -1,1070 +0,0 @@ -package com.twitter.search.earlybird.index; - -import java.io.Closeable; -import java.io.File; -import java.io.IOException; -import java.time.Instant; -import java.time.ZoneOffset; -import java.time.ZonedDateTime; -import java.time.format.DateTimeFormatter; -import java.util.List; -import java.util.Map; -import java.util.Objects; -import java.util.concurrent.atomic.AtomicReference; -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.collect.HashBasedTable; -import com.google.common.collect.Table; -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; - -import org.apache.commons.io.FileUtils; -import org.apache.lucene.document.Document; -import org.apache.lucene.index.DirectoryReader; -import org.apache.lucene.index.IndexWriterConfig; -import org.apache.lucene.index.IndexableField; -import org.apache.lucene.store.Directory; -import org.apache.lucene.store.FSDirectory; -import org.apache.lucene.store.IOContext; -import org.apache.lucene.store.IndexOutput; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.collections.Pair; -import com.twitter.common.util.Clock; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.schema.base.FeatureConfiguration; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.base.ThriftDocumentUtil; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.schema.earlybird.EarlybirdEncodedFeatures; -import com.twitter.search.common.schema.earlybird.EarlybirdEncodedFeaturesUtil; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.schema.thriftjava.ThriftDocument; -import com.twitter.search.common.schema.thriftjava.ThriftField; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEventType; -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentData; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentWriter; -import com.twitter.search.core.earlybird.index.column.ColumnStrideFieldIndex; -import com.twitter.search.core.earlybird.index.column.DocValuesUpdate; -import com.twitter.search.core.earlybird.index.extensions.EarlybirdIndexExtensionsFactory; -import com.twitter.search.earlybird.EarlybirdIndexConfig; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.document.TweetDocument; -import com.twitter.search.earlybird.exception.FlushVersionMismatchException; -import com.twitter.search.earlybird.partition.SearchIndexingMetricSet; -import com.twitter.search.earlybird.partition.SegmentIndexStats; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; -import com.twitter.snowflake.id.SnowflakeId; - -public class EarlybirdSegment { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdSegment.class); - private static final Logger UPDATES_ERRORS_LOG = - LoggerFactory.getLogger(EarlybirdSegment.class.getName() + ".UpdatesErrors"); - private static final String SUCCESS_FILE = "EARLYBIRD_SUCCESS"; - private static final DateTimeFormatter HOURLY_COUNT_DATE_TIME_FORMATTER = - DateTimeFormatter.ofPattern("yyyy_MM_dd_HH"); - - @VisibleForTesting - public static final String NUM_TWEETS_CREATED_AT_PATTERN = "num_tweets_%s_%s_created_at_%s"; - - private static final String INVALID_FEATURE_UPDATES_DROPPED_PREFIX = - "invalid_index_feature_update_dropped_"; - - // The number of tweets not indexed because they have been previously indexed. - private static final SearchCounter DUPLICATE_TWEET_SKIPPED_COUNTER = - SearchCounter.export("duplicate_tweet_skipped"); - - // The number of tweets that came out of order. - private static final SearchCounter OUT_OF_ORDER_TWEET_COUNTER = - SearchCounter.export("out_of_order_tweet"); - - // The number partial updates dropped because the field could not be found in the schema. - // This counter is incremented once per field rather than once per partial update event. - // Note: caller may retry update, this counter will be incremented multiple times for same update. - private static final SearchCounter INVALID_FIELDS_IN_PARTIAL_UPDATES = - SearchCounter.export("invalid_fields_in_partial_updates"); - - // The number partial updates dropped because the tweet id could not be found in the segment. - // Note: caller may retry update, this counter will be incremented multiple times for same update. - private static final SearchCounter PARTIAL_UPDATE_FOR_TWEET_NOT_IN_INDEX = - SearchCounter.export("partial_update_for_tweet_id_not_in_index"); - - // The number of partial updates that were applied only partially, because the update could not - // be applied for at least one of the fields. - private static final SearchCounter PARTIAL_UPDATE_PARTIAL_FAILURE = - SearchCounter.export("partial_update_partial_failure"); - - // Both the indexing chain and the index writer are lazily initialized when adding docs for - // the first time. - private final AtomicReference segmentWriterReference = - new AtomicReference<>(); - - // Stats from the PartitionIndexer / SimpleSegmentIndexer. - private final SegmentIndexStats indexStats; - private final String segmentName; - private final int maxSegmentSize; - private final long timeSliceID; - private final AtomicReference luceneIndexReader = - new AtomicReference<>(); - private final Directory luceneDir; - private final File luceneDirFile; - private final EarlybirdIndexConfig indexConfig; - private final List closableResources = Lists.newArrayList(); - private long lastInOrderTweetId = 0; - - private final EarlybirdIndexExtensionsFactory extensionsFactory; - private final SearchIndexingMetricSet searchIndexingMetricSet; - private final EarlybirdSearcherStats searcherStats; - - private final Map indexedTweetsCounters = Maps.newHashMap(); - private final PerFieldCounters perFieldCounters; - private final Clock clock; - - @VisibleForTesting - public volatile boolean appendedLuceneIndex = false; - - public EarlybirdSegment( - String segmentName, - long timeSliceID, - int maxSegmentSize, - Directory luceneDir, - EarlybirdIndexConfig indexConfig, - SearchIndexingMetricSet searchIndexingMetricSet, - EarlybirdSearcherStats searcherStats, - Clock clock) { - this.segmentName = segmentName; - this.maxSegmentSize = maxSegmentSize; - this.timeSliceID = timeSliceID; - this.luceneDir = luceneDir; - this.indexConfig = indexConfig; - this.indexStats = new SegmentIndexStats(); - this.perFieldCounters = new PerFieldCounters(); - this.extensionsFactory = new TweetSearchIndexExtensionsFactory(); - - if (luceneDir != null && luceneDir instanceof FSDirectory) { - // getDirectory() throws if the luceneDir is already closed. - // To delete a directory, we need to close it first. - // Obtain a reference to the File now, so we can delete it later. - // See SEARCH-5281 - this.luceneDirFile = ((FSDirectory) luceneDir).getDirectory().toFile(); - } else { - this.luceneDirFile = null; - } - this.searchIndexingMetricSet = Preconditions.checkNotNull(searchIndexingMetricSet); - this.searcherStats = searcherStats; - this.clock = clock; - } - - @VisibleForTesting - public Directory getLuceneDirectory() { - return luceneDir; - } - - public SegmentIndexStats getIndexStats() { - return indexStats; - } - - /** - * Returns the smallest tweet ID in this segment. If the segment is not loaded yet, or is empty, - * DocIDToTweetIDMapper.ID_NOT_FOUND is returned (-1). - * - * @return The smallest tweet ID in this segment. - */ - public long getLowestTweetId() { - EarlybirdIndexSegmentWriter segmentWriter = segmentWriterReference.get(); - if (segmentWriter == null) { - return DocIDToTweetIDMapper.ID_NOT_FOUND; - } - - DocIDToTweetIDMapper mapper = segmentWriter.getSegmentData().getDocIDToTweetIDMapper(); - int highestDocID = mapper.getPreviousDocID(Integer.MAX_VALUE); - return mapper.getTweetID(highestDocID); - } - - /** - * Returns the cardinality (size) sum of the cardinality of each - * query cache set. - */ - public long getQueryCachesCardinality() { - EarlybirdIndexSegmentWriter writer = getIndexSegmentWriter(); - if (writer == null) { - // The segment is not loaded yet, or the query caches for this segment are not built yet. - return -1; - } - - EarlybirdIndexSegmentData earlybirdIndexSegmentData = writer.getSegmentData(); - return earlybirdIndexSegmentData.getQueryCachesCardinality(); - } - - public List> getQueryCachesData() { - return getIndexSegmentWriter().getSegmentData().getPerQueryCacheCardinality(); - } - - - /** - * Returns the highest tweet ID in this segment. If the segment is not loaded yet, or is empty, - * DocIDToTweetIDMapper.ID_NOT_FOUND is returned (-1). - * - * @return The highest tweet ID in this segment. - */ - public long getHighestTweetId() { - EarlybirdIndexSegmentWriter segmentWriter = segmentWriterReference.get(); - if (segmentWriter == null) { - return DocIDToTweetIDMapper.ID_NOT_FOUND; - } - - DocIDToTweetIDMapper mapper = segmentWriter.getSegmentData().getDocIDToTweetIDMapper(); - int lowestDocID = mapper.getNextDocID(-1); - return mapper.getTweetID(lowestDocID); - } - - /** - * Optimizes the underlying segment data. - */ - public void optimizeIndexes() throws IOException { - EarlybirdIndexSegmentWriter unoptimizedWriter = segmentWriterReference.get(); - Preconditions.checkNotNull(unoptimizedWriter); - - unoptimizedWriter.forceMerge(); - unoptimizedWriter.close(); - - // Optimize our own data structures in the indexing chain - // In the archive this is pretty much a no-op. - // The indexWriter in writeableSegment should no longer be used and referenced, and - // writeableSegment.writer can be garbage collected at this point. - EarlybirdIndexSegmentData optimized = indexConfig.optimize(unoptimizedWriter.getSegmentData()); - resetSegmentWriterReference(newWriteableSegment(optimized), true); - - addSuccessFile(); - } - - /** - * Returns a new, optimized, realtime segment, by copying the data in this segment. - */ - public EarlybirdSegment makeOptimizedSegment() throws IOException { - EarlybirdIndexSegmentWriter unoptimizedWriter = segmentWriterReference.get(); - Preconditions.checkNotNull(unoptimizedWriter); - EarlybirdSegment optimizedSegment = new EarlybirdSegment( - segmentName, - timeSliceID, - maxSegmentSize, - luceneDir, - indexConfig, - searchIndexingMetricSet, - searcherStats, - clock); - - EarlybirdIndexSegmentData optimizedSegmentData = - indexConfig.optimize(unoptimizedWriter.getSegmentData()); - LOG.info("Done optimizing, setting segment data"); - - optimizedSegment.setSegmentData( - optimizedSegmentData, - indexStats.getPartialUpdateCount(), - indexStats.getOutOfOrderUpdateCount()); - return optimizedSegment; - } - - public String getSegmentName() { - return segmentName; - } - - public boolean isOptimized() { - EarlybirdIndexSegmentWriter segmentWriter = segmentWriterReference.get(); - return segmentWriter != null && segmentWriter.getSegmentData().isOptimized(); - } - - /** - * Removes the document for the given tweet ID from this segment, if this segment contains a - * document for this tweet ID. - */ - public boolean delete(long tweetID) throws IOException { - EarlybirdIndexSegmentWriter segmentWriter = segmentWriterReference.get(); - if (!hasDocument(tweetID)) { - return false; - } - - segmentWriter.deleteDocuments(new TweetIDQuery(tweetID)); - return true; - } - - protected void updateDocValues(long tweetID, String field, DocValuesUpdate update) - throws IOException { - EarlybirdIndexSegmentWriter segmentWriter = segmentWriterReference.get(); - segmentWriter.updateDocValues(new TweetIDQuery(tweetID), field, update); - } - - /** - * Appends the Lucene index from another segment to this segment. - */ - public void append(EarlybirdSegment otherSegment) throws IOException { - if (indexConfig.isIndexStoredOnDisk()) { - EarlybirdIndexSegmentWriter segmentWriter = segmentWriterReference.get(); - Preconditions.checkNotNull(segmentWriter); - EarlybirdIndexSegmentWriter otherSegmentWriter = otherSegment.segmentWriterReference.get(); - if (otherSegmentWriter != null) { - otherSegmentWriter.close(); - } - segmentWriter.addIndexes(otherSegment.luceneDir); - LOG.info("Calling forceMerge now after appending segment."); - segmentWriter.forceMerge(); - appendedLuceneIndex = true; - LOG.info("Appended {} docs to segment {}. New doc count = {}", - otherSegment.indexStats.getStatusCount(), luceneDir.toString(), - indexStats.getStatusCount()); - - indexStats.setIndexSizeOnDiskInBytes(getSegmentSizeOnDisk()); - } - } - - /** - * Only needed for the on disk archive. - * Creates TwitterIndexReader used for searching. This is shared by all Searchers. - * This method also initializes the Lucene based mappers and CSF for the on disk archive. - * - * This method should be called after optimizing/loading a segment, but before the segment starts - * to serve search queries. - */ - public void warmSegment() throws IOException { - EarlybirdIndexSegmentWriter segmentWriter = segmentWriterReference.get(); - Preconditions.checkNotNull(segmentWriter); - - // only need to pre-create reader and initialize mappers and CSF in the on disk archive cluster - if (indexConfig.isIndexStoredOnDisk() && luceneIndexReader.get() == null) { - EarlybirdIndexSegmentAtomicReader luceneAtomicReader = - segmentWriter.getSegmentData().createAtomicReader(); - - luceneIndexReader.set(luceneAtomicReader); - closableResources.add(luceneAtomicReader); - closableResources.add(luceneDir); - } - } - - /** - * Create a tweet index searcher on the segment. - * - * For production search session, the schema snapshot should be always passed in to make sure - * that the schema usage inside scoring is consistent. - * - * For non-production usage, like one-off debugging search, you can use the function call without - * the schema snapshot. - */ - @Nullable - public EarlybirdSingleSegmentSearcher getSearcher( - UserTable userTable, - ImmutableSchemaInterface schemaSnapshot) throws IOException { - EarlybirdIndexSegmentWriter segmentWriter = segmentWriterReference.get(); - if (segmentWriter == null) { - return null; - } - return new EarlybirdSingleSegmentSearcher( - schemaSnapshot, getIndexReader(segmentWriter), userTable, searcherStats, clock); - } - - /** - * Returns a new searcher for this segment. - */ - @Nullable - public EarlybirdSingleSegmentSearcher getSearcher( - UserTable userTable) throws IOException { - EarlybirdIndexSegmentWriter segmentWriter = segmentWriterReference.get(); - if (segmentWriter == null) { - return null; - } - return new EarlybirdSingleSegmentSearcher( - segmentWriter.getSegmentData().getSchema().getSchemaSnapshot(), - getIndexReader(segmentWriter), - userTable, - searcherStats, - clock); - } - - /** - * Returns a new reader for this segment. - */ - @Nullable - public EarlybirdIndexSegmentAtomicReader getIndexReader() throws IOException { - EarlybirdIndexSegmentWriter segmentWriter = segmentWriterReference.get(); - if (segmentWriter == null) { - return null; - } - return getIndexReader(segmentWriter); - } - - private EarlybirdIndexSegmentAtomicReader getIndexReader( - EarlybirdIndexSegmentWriter segmentWriter - ) throws IOException { - EarlybirdIndexSegmentAtomicReader reader = luceneIndexReader.get(); - if (reader != null) { - return reader; - } - Preconditions.checkState(!indexConfig.isIndexStoredOnDisk()); - - // Realtime EB mode. - return segmentWriter.getSegmentData().createAtomicReader(); - } - - /** - * Gets max tweet id in this segment. - * - * @return the tweet id or -1 if not found. - */ - public long getMaxTweetId() { - EarlybirdIndexSegmentWriter segmentWriter = segmentWriterReference.get(); - if (segmentWriter == null) { - return -1; - } else { - TweetIDMapper tweetIDMapper = - (TweetIDMapper) segmentWriter.getSegmentData().getDocIDToTweetIDMapper(); - return tweetIDMapper.getMaxTweetID(); - } - } - - private EarlybirdIndexSegmentWriter newWriteableSegment(EarlybirdIndexSegmentData segmentData) - throws IOException { - EarlybirdIndexSegmentWriter old = segmentWriterReference.get(); - if (old != null) { - old.close(); - } - - LOG.info("Creating new segment writer for {} on {}", segmentName, luceneDir); - IndexWriterConfig indexWriterConfig = indexConfig.newIndexWriterConfig(); - return segmentData.createEarlybirdIndexSegmentWriter(indexWriterConfig); - } - - private void resetSegmentWriterReference( - EarlybirdIndexSegmentWriter segmentWriter, boolean previousSegmentWriterAllowed) { - EarlybirdIndexSegmentWriter previousSegmentWriter = - segmentWriterReference.getAndSet(segmentWriter); - if (!previousSegmentWriterAllowed) { - Preconditions.checkState( - previousSegmentWriter == null, - "A previous segment writer must have been set for segment " + segmentName); - } - - // Reset the stats for the number of indexed tweets per hour and recompute them. - // See SEARCH-23619 - for (SearchCounter indexedTweetsCounter : indexedTweetsCounters.values()) { - indexedTweetsCounter.reset(); - } - - if (segmentWriter != null) { - indexStats.setSegmentData(segmentWriter.getSegmentData()); - - if (indexConfig.getCluster() != EarlybirdCluster.FULL_ARCHIVE) { - initHourlyTweetCounts(segmentWriterReference.get()); - } - } else { - // It's important to unset segment data so that there are no references to it - // and it can be GC-ed. - indexStats.unsetSegmentDataAndSaveCounts(); - } - } - - /** - * Add a document if it is not already in segment. - */ - public void addDocument(TweetDocument doc) throws IOException { - if (indexConfig.isIndexStoredOnDisk()) { - addDocumentToArchiveSegment(doc); - } else { - addDocumentToRealtimeSegment(doc); - } - } - - private void addDocumentToArchiveSegment(TweetDocument doc) throws IOException { - // For archive, the document id should come in order, to drop duplicates, only need to - // compare current id with last one. - long tweetId = doc.getTweetID(); - if (tweetId == lastInOrderTweetId) { - LOG.warn("Dropped duplicate tweet for archive: {}", tweetId); - DUPLICATE_TWEET_SKIPPED_COUNTER.increment(); - return; - } - - if (tweetId > lastInOrderTweetId && lastInOrderTweetId != 0) { - // Archive orders document from newest to oldest, so this shouldn't happen - LOG.warn("Encountered out-of-order tweet for archive: {}", tweetId); - OUT_OF_ORDER_TWEET_COUNTER.increment(); - } else { - lastInOrderTweetId = tweetId; - } - - addDocumentInternal(doc); - } - - private void addDocumentToRealtimeSegment(TweetDocument doc) throws IOException { - long tweetId = doc.getTweetID(); - boolean outOfOrder = tweetId <= lastInOrderTweetId; - if (outOfOrder) { - OUT_OF_ORDER_TWEET_COUNTER.increment(); - } else { - lastInOrderTweetId = tweetId; - } - - // We only need to call hasDocument() for out-of-order tweets. - if (outOfOrder && hasDocument(tweetId)) { - // We do get duplicates sometimes so you'll see some amount of these. - DUPLICATE_TWEET_SKIPPED_COUNTER.increment(); - } else { - addDocumentInternal(doc); - incrementHourlyTweetCount(doc.getTweetID()); - } - } - - private void addDocumentInternal(TweetDocument tweetDocument) throws IOException { - Document doc = tweetDocument.getDocument(); - - // Never write blank documents into the index. - if (doc == null || doc.getFields() == null || doc.getFields().size() == 0) { - return; - } - - EarlybirdIndexSegmentWriter segmentWriter = segmentWriterReference.get(); - if (segmentWriter == null) { - EarlybirdIndexSegmentData segmentData = indexConfig.newSegmentData( - maxSegmentSize, - timeSliceID, - luceneDir, - extensionsFactory); - segmentWriter = newWriteableSegment(segmentData); - resetSegmentWriterReference(segmentWriter, false); - } - - Preconditions.checkState(segmentWriter.numDocs() < maxSegmentSize, - "Reached max segment size %s", maxSegmentSize); - - IndexableField[] featuresField = doc.getFields( - EarlybirdFieldConstants.ENCODED_TWEET_FEATURES_FIELD_NAME); - Preconditions.checkState(featuresField.length == 1, - "featuresField.length should be 1, but is %s", featuresField.length); - - // We require the createdAt field to be set so we can properly filter tweets based on time. - IndexableField[] createdAt = - doc.getFields(EarlybirdFieldConstant.CREATED_AT_FIELD.getFieldName()); - Preconditions.checkState(createdAt.length == 1); - - EarlybirdEncodedFeatures features = EarlybirdEncodedFeaturesUtil.fromBytes( - indexConfig.getSchema().getSchemaSnapshot(), - EarlybirdFieldConstant.ENCODED_TWEET_FEATURES_FIELD, - featuresField[0].binaryValue().bytes, - featuresField[0].binaryValue().offset); - boolean currentDocIsOffensive = features.isFlagSet(EarlybirdFieldConstant.IS_OFFENSIVE_FLAG); - perFieldCounters.increment(ThriftIndexingEventType.INSERT, doc); - segmentWriter.addTweet(doc, tweetDocument.getTweetID(), currentDocIsOffensive); - } - - private void incrementHourlyTweetCount(long tweetId) { - // SEARCH-23619, We won't attempt to increment the count for pre-snowflake IDs, since - // extracting an exact create time is pretty tricky at this point, and the stat is mostly - // useful for checking realtime tweet indexing. - if (SnowflakeId.isSnowflakeId(tweetId)) { - long tweetCreateTime = SnowflakeId.unixTimeMillisFromId(tweetId); - String tweetHour = HOURLY_COUNT_DATE_TIME_FORMATTER.format( - ZonedDateTime.ofInstant(Instant.ofEpochMilli(tweetCreateTime), ZoneOffset.UTC)); - - String segmentOptimizedSuffix = isOptimized() ? "optimized" : "unoptimized"; - SearchCounter indexedTweetsCounter = indexedTweetsCounters.computeIfAbsent( - tweetHour + "_" + segmentOptimizedSuffix, - (tweetHourKey) -> SearchCounter.export(String.format( - NUM_TWEETS_CREATED_AT_PATTERN, segmentOptimizedSuffix, segmentName, tweetHour))); - indexedTweetsCounter.increment(); - } - } - - private void initHourlyTweetCounts(EarlybirdIndexSegmentWriter segmentWriter) { - DocIDToTweetIDMapper mapper = segmentWriter.getSegmentData().getDocIDToTweetIDMapper(); - int docId = Integer.MIN_VALUE; - while ((docId = mapper.getNextDocID(docId)) != DocIDToTweetIDMapper.ID_NOT_FOUND) { - incrementHourlyTweetCount(mapper.getTweetID(docId)); - } - } - - /** - * Adds the given document for the given tweet ID to the segment, potentially out of order. - */ - public boolean appendOutOfOrder(Document doc, long tweetID) throws IOException { - // Never write blank documents into the index. - if (doc == null || doc.getFields() == null || doc.getFields().size() == 0) { - return false; - } - - EarlybirdIndexSegmentWriter segmentWriter = segmentWriterReference.get(); - if (segmentWriter == null) { - logAppendOutOfOrderFailure(tweetID, doc, "segment is null"); - return false; - } - - if (!indexConfig.supportOutOfOrderIndexing()) { - logAppendOutOfOrderFailure(tweetID, doc, "out of order indexing not supported"); - return false; - } - - if (!hasDocument(tweetID)) { - logAppendOutOfOrderFailure(tweetID, doc, "tweet ID index lookup failed"); - searchIndexingMetricSet.updateOnMissingTweetCounter.increment(); - perFieldCounters.incrementTweetNotInIndex(ThriftIndexingEventType.OUT_OF_ORDER_APPEND, doc); - return false; - } - - perFieldCounters.increment(ThriftIndexingEventType.OUT_OF_ORDER_APPEND, doc); - segmentWriter.appendOutOfOrder(new TweetIDQuery(tweetID), doc); - indexStats.incrementOutOfOrderUpdateCount(); - return true; - } - - private void logAppendOutOfOrderFailure(long tweetID, Document doc, String reason) { - UPDATES_ERRORS_LOG.debug( - "appendOutOfOrder() failed to apply update document with hash {} on tweet ID {}: {}", - Objects.hashCode(doc), tweetID, reason); - } - - /** - * Determines if this segment contains the given tweet ID. - */ - public boolean hasDocument(long tweetID) throws IOException { - EarlybirdIndexSegmentWriter segmentWriter = segmentWriterReference.get(); - if (segmentWriter == null) { - return false; - } - - return segmentWriter.getSegmentData().getDocIDToTweetIDMapper().getDocID(tweetID) - != DocIDToTweetIDMapper.ID_NOT_FOUND; - } - - private static final String VERSION_PROP_NAME = "version"; - private static final String VERSION_DESC_PROP_NAME = "versionDescription"; - private static final String PARTIAL_UPDATES_COUNT = "partialUpdatesCount"; - private static final String OUT_OF_ORDER_UPDATES_COUNT = "outOfOrderUpdatesCount"; - - private void checkIfFlushedDataVersionMatchesExpected(FlushInfo flushInfo) throws IOException { - int expectedVersionNumber = indexConfig.getSchema().getMajorVersionNumber(); - String expectedVersionDesc = indexConfig.getSchema().getVersionDescription(); - int version = flushInfo.getIntProperty(VERSION_PROP_NAME); - final String versionDesc = flushInfo.getStringProperty(VERSION_DESC_PROP_NAME); - - if (version != expectedVersionNumber) { - throw new FlushVersionMismatchException("Flushed version mismatch. Expected: " - + expectedVersionNumber + ", but was: " + version); - } - - if (!expectedVersionDesc.equals(versionDesc)) { - final String message = "Flush version " + expectedVersionNumber + " is ambiguous" - + " Expected: " + expectedVersionDesc - + " Found: " + versionDesc - + " Please clean up segments with bad flush version from HDFS and Earlybird local disk."; - throw new FlushVersionMismatchException(message); - } - } - - /** - * Loads the segment data and properties from the given deserializer and flush info. - * - * @param in The deserializer from which the segment's data will be read. - * @param flushInfo The flush info from which the segment's properties will be read. - */ - public void load(DataDeserializer in, FlushInfo flushInfo) throws IOException { - checkIfFlushedDataVersionMatchesExpected(flushInfo); - - int partialUpdatesCount = flushInfo.getIntProperty(PARTIAL_UPDATES_COUNT); - int outOfOrderUpdatesCount = flushInfo.getIntProperty(OUT_OF_ORDER_UPDATES_COUNT); - - EarlybirdIndexSegmentData loadedSegmentData = indexConfig.loadSegmentData( - flushInfo, in, luceneDir, extensionsFactory); - - setSegmentData(loadedSegmentData, partialUpdatesCount, outOfOrderUpdatesCount); - } - - /** - * Update the data backing this EarlyirdSegment. - */ - public void setSegmentData( - EarlybirdIndexSegmentData segmentData, - int partialUpdatesCount, - int outOfOrderUpdatesCount) throws IOException { - resetSegmentWriterReference(newWriteableSegment(segmentData), false); - try { - warmSegment(); - } catch (IOException e) { - LOG.error("Failed to create IndexReader for segment {}. Will destroy unreadable segment.", - segmentName, e); - destroyImmediately(); - throw e; - } - - LOG.info("Starting segment {} with {} partial updates, {} out of order updates and {} deletes.", - segmentName, partialUpdatesCount, outOfOrderUpdatesCount, indexStats.getDeleteCount()); - indexStats.setPartialUpdateCount(partialUpdatesCount); - indexStats.setOutOfOrderUpdateCount(outOfOrderUpdatesCount); - indexStats.setIndexSizeOnDiskInBytes(getSegmentSizeOnDisk()); - } - - /** - * Flushes the this segment's properties to the given FlushInfo instance, and this segment's data - * to the given DataSerializer instance. - * - * @param flushInfo The FlushInfo instance where all segment properties should be added. - * @param out The serializer to which all segment data should be flushed. - */ - public void flush(FlushInfo flushInfo, DataSerializer out) throws IOException { - flushInfo.addIntProperty(VERSION_PROP_NAME, indexConfig.getSchema().getMajorVersionNumber()); - flushInfo.addStringProperty(VERSION_DESC_PROP_NAME, - indexConfig.getSchema().getVersionDescription()); - flushInfo.addIntProperty(PARTIAL_UPDATES_COUNT, indexStats.getPartialUpdateCount()); - flushInfo.addIntProperty(OUT_OF_ORDER_UPDATES_COUNT, indexStats.getOutOfOrderUpdateCount()); - if (segmentWriterReference.get() == null) { - LOG.warn("Segment writer is null. flushInfo: {}", flushInfo); - } else if (segmentWriterReference.get().getSegmentData() == null) { - LOG.warn("Segment data is null. segment writer: {}, flushInfo: {}", - segmentWriterReference.get(), flushInfo); - } - segmentWriterReference.get().getSegmentData().flushSegment(flushInfo, out); - indexStats.setIndexSizeOnDiskInBytes(getSegmentSizeOnDisk()); - } - - /** - * Check to see if this segment can be loaded from an on-disk index, and load it if it can be. - * - * This should only be applicable to the current segment for the on-disk archive. It's not - * fully flushed until it's full, but we do have a lucene index on local disk which can be - * used at startup (rather than have to reindex all the current timeslice documents again). - * - * If loaded, the index reader will be pre-created, and the segment will be marked as - * optimized. - * - * If the index directory exists but it cannot be loaded, the index directory will be deleted. - * - * @return true if the index exists on disk, and was loaded. - */ - public boolean tryToLoadExistingIndex() throws IOException { - Preconditions.checkState(segmentWriterReference.get() == null); - if (indexConfig.isIndexStoredOnDisk()) { - if (DirectoryReader.indexExists(luceneDir) && checkSuccessFile()) { - LOG.info("Index directory already exists for {} at {}", segmentName, luceneDir); - - // set the optimized flag, since we don't need to optimize any more, and pre-create - // the index reader (for the on-disk index optimize() is a noop that just sets the - // optimized flag). - EarlybirdIndexSegmentData earlybirdIndexSegmentData = indexConfig.newSegmentData( - maxSegmentSize, - timeSliceID, - luceneDir, - extensionsFactory); - EarlybirdIndexSegmentData optimizedEarlybirdIndexSegmentData = - indexConfig.optimize(earlybirdIndexSegmentData); - resetSegmentWriterReference(newWriteableSegment(optimizedEarlybirdIndexSegmentData), false); - - warmSegment(); - - LOG.info("Used existing lucene index for {} with {} documents", - segmentName, indexStats.getStatusCount()); - - indexStats.setIndexSizeOnDiskInBytes(getSegmentSizeOnDisk()); - - return true; - } else { - // Check if there is an existing lucene dir without a SUCCESS file on disk. - // If so, we will remove it and reindex from scratch. - if (moveFSDirectoryIfExists(luceneDir)) { - // Throw here to be cleaned up and retried by SimpleSegmentIndexer. - throw new IOException("Found invalid existing lucene directory at: " + luceneDir); - } - } - } - return false; - } - - /** - * Partially updates a document with the field value(s) specified by event. - * Returns true if all writes were successful and false if one or more writes fail or if - * tweet id isn't found in the segment. - */ - public boolean applyPartialUpdate(ThriftIndexingEvent event) throws IOException { - Preconditions.checkArgument(event.getEventType() == ThriftIndexingEventType.PARTIAL_UPDATE); - Preconditions.checkArgument(event.isSetUid()); - Preconditions.checkArgument(!ThriftDocumentUtil.hasDuplicateFields(event.getDocument())); - ImmutableSchemaInterface schemaSnapshot = indexConfig.getSchema().getSchemaSnapshot(); - - long tweetId = event.getUid(); - ThriftDocument doc = event.getDocument(); - - if (!hasDocument(tweetId)) { - // no need to attempt field writes, fail early - PARTIAL_UPDATE_FOR_TWEET_NOT_IN_INDEX.increment(); - perFieldCounters.incrementTweetNotInIndex( - ThriftIndexingEventType.PARTIAL_UPDATE, doc); - return false; - } - - int invalidFields = 0; - for (ThriftField field : doc.getFields()) { - String featureName = schemaSnapshot.getFieldName(field.getFieldConfigId()); - FeatureConfiguration featureConfig = - schemaSnapshot.getFeatureConfigurationByName(featureName); - if (featureConfig == null) { - INVALID_FIELDS_IN_PARTIAL_UPDATES.increment(); - invalidFields++; - continue; - } - - perFieldCounters.increment(ThriftIndexingEventType.PARTIAL_UPDATE, featureName); - - updateDocValues( - tweetId, - featureName, - (docValues, docID) -> updateFeatureValue(docID, featureConfig, docValues, field)); - } - - if (invalidFields > 0 && invalidFields != doc.getFieldsSize()) { - PARTIAL_UPDATE_PARTIAL_FAILURE.increment(); - } - - if (invalidFields == 0) { - indexStats.incrementPartialUpdateCount(); - } else { - UPDATES_ERRORS_LOG.warn("Failed to apply update for tweetID {}, found {} invalid fields: {}", - tweetId, invalidFields, event); - } - - return invalidFields == 0; - } - - @VisibleForTesting - static void updateFeatureValue(int docID, - FeatureConfiguration featureConfig, - ColumnStrideFieldIndex docValues, - ThriftField updateField) { - int oldValue = Math.toIntExact(docValues.get(docID)); - int newValue = updateField.getFieldData().getIntValue(); - - if (!featureConfig.validateFeatureUpdate(oldValue, newValue)) { - // Counter values can only increase - SearchCounter.export( - INVALID_FEATURE_UPDATES_DROPPED_PREFIX + featureConfig.getName()).increment(); - } else { - docValues.setValue(docID, newValue); - } - } - - /** - * Checks if the provided directory exists and is not empty, - * and if it does moves it out to a diff directory for later inspection. - * @param luceneDirectory the dir to move if it exists. - * @return true iff we found an existing directory. - */ - private static boolean moveFSDirectoryIfExists(Directory luceneDirectory) { - Preconditions.checkState(luceneDirectory instanceof FSDirectory); - File directory = ((FSDirectory) luceneDirectory).getDirectory().toFile(); - if (directory != null && directory.exists() && directory.list().length > 0) { - // Save the bad lucene index by moving it out, for later inspection. - File movedDir = new File(directory.getParent(), - directory.getName() + ".failed." + System.currentTimeMillis()); - LOG.warn("Moving existing non-successful index for {} from {} to {}", - luceneDirectory, directory, movedDir); - boolean success = directory.renameTo(movedDir); - if (!success) { - LOG.warn("Unable to rename non-successful index: {}", luceneDirectory); - } - return true; - } - return false; - } - - /** - * For the on-disk archive, if we were able to successfully merge and flush the Lucene index to - * disk, we mark it explicitly with a SUCCESS file, so that it can be safely reused. - */ - private void addSuccessFile() throws IOException { - if (indexConfig.isIndexStoredOnDisk()) { - IndexOutput successFile = luceneDir.createOutput(SUCCESS_FILE, IOContext.DEFAULT); - successFile.close(); - } - } - - /** - * Returns the current number of documents in this segment. - */ - public int getNumDocs() throws IOException { - return indexStats.getStatusCount(); - } - - /** - * Reclaim resources used by this segment (E.g. closing lucene index reader). - * Resources will be reclaimed within the calling thread with no delay. - */ - public void destroyImmediately() { - try { - closeSegmentWriter(); - maybeDeleteSegmentOnDisk(); - unloadSegmentFromMemory(); - } finally { - indexConfig.getResourceCloser().closeResourcesImmediately(closableResources); - } - } - - /** - * Close the in-memory resources belonging to this segment. This should allow the in-memory - * segment data to be garbage collected. After closing, the segment is not writable. - */ - public void close() { - if (segmentWriterReference.get() == null) { - LOG.info("Segment {} already closed.", segmentName); - return; - } - - LOG.info("Closing segment {}.", segmentName); - try { - closeSegmentWriter(); - unloadSegmentFromMemory(); - } finally { - indexConfig.getResourceCloser().closeResourcesImmediately(closableResources); - } - } - - private void closeSegmentWriter() { - EarlybirdIndexSegmentWriter segmentWriter = segmentWriterReference.get(); - if (segmentWriter != null) { - closableResources.add(() -> { - LOG.info("Closing writer for segment: {}", segmentName); - segmentWriter.close(); - }); - } - } - - private void maybeDeleteSegmentOnDisk() { - if (indexConfig.isIndexStoredOnDisk()) { - Preconditions.checkState( - luceneDir instanceof FSDirectory, - "On-disk indexes should have an underlying directory that we can close and remove."); - closableResources.add(luceneDir); - - if (luceneDirFile != null && luceneDirFile.exists()) { - closableResources.add(new Closeable() { - @Override - public void close() throws IOException { - FileUtils.deleteDirectory(luceneDirFile); - } - - @Override - public String toString() { - return "delete {" + luceneDirFile + "}"; - } - }); - } - } - } - - private void unloadSegmentFromMemory() { - // Make sure we don't retain a reference to the IndexWriter or SegmentData. - resetSegmentWriterReference(null, true); - } - - private long getSegmentSizeOnDisk() throws IOException { - searchIndexingMetricSet.segmentSizeCheckCount.increment(); - - long totalSize = 0; - if (luceneDir != null) { - for (String file : luceneDir.listAll()) { - totalSize += luceneDir.fileLength(file); - } - } - return totalSize; - } - - ////////////////////////// - // for unit tests only - ////////////////////////// - - public EarlybirdIndexConfig getEarlybirdIndexConfig() { - return indexConfig; - } - - @VisibleForTesting - public boolean checkSuccessFile() { - return new File(luceneDirFile, SUCCESS_FILE).exists(); - } - - @VisibleForTesting - EarlybirdIndexSegmentWriter getIndexSegmentWriter() { - return segmentWriterReference.get(); - } - - // Helper class to encapsulate counter tables, patterns and various ways to increment - private class PerFieldCounters { - // The number of update/append events for each field in the schema. - private static final String PER_FIELD_EVENTS_COUNTER_PATTERN = "%s_for_field_%s"; - // The number of dropped update/append events for each field due to tweetId not found - private static final String TWEET_NOT_IN_INDEX_PER_FIELD_EVENTS_COUNTER_PATTERN = - "%s_for_tweet_id_not_in_index_for_field_%s"; - private final Table perFieldTable = - HashBasedTable.create(); - private final Table notInIndexPerFieldTable = - HashBasedTable.create(); - - public void increment( - ThriftIndexingEventType eventType, ThriftDocument doc) { - ImmutableSchemaInterface schemaSnapshot = indexConfig.getSchema().getSchemaSnapshot(); - for (ThriftField field : doc.getFields()) { - String fieldName = schemaSnapshot.getFieldName(field.getFieldConfigId()); - incrementForPattern( - eventType, fieldName, perFieldTable, PER_FIELD_EVENTS_COUNTER_PATTERN); - } - } - - public void incrementTweetNotInIndex( - ThriftIndexingEventType eventType, ThriftDocument doc) { - ImmutableSchemaInterface schemaSnapshot = indexConfig.getSchema().getSchemaSnapshot(); - for (ThriftField field : doc.getFields()) { - String fieldName = schemaSnapshot.getFieldName(field.getFieldConfigId()); - incrementForPattern( - eventType, fieldName, notInIndexPerFieldTable, - TWEET_NOT_IN_INDEX_PER_FIELD_EVENTS_COUNTER_PATTERN); - } - } - - public void increment(ThriftIndexingEventType eventType, Document doc) { - for (IndexableField field : doc.getFields()) { - incrementForPattern( - eventType, field.name(), - perFieldTable, PER_FIELD_EVENTS_COUNTER_PATTERN); - } - } - - public void increment(ThriftIndexingEventType eventType, String fieldName) { - incrementForPattern(eventType, fieldName, perFieldTable, PER_FIELD_EVENTS_COUNTER_PATTERN); - } - - public void incrementTweetNotInIndex(ThriftIndexingEventType eventType, Document doc) { - for (IndexableField field : doc.getFields()) { - incrementForPattern( - eventType, field.name(), - notInIndexPerFieldTable, - TWEET_NOT_IN_INDEX_PER_FIELD_EVENTS_COUNTER_PATTERN); - } - } - - private void incrementForPattern( - ThriftIndexingEventType eventType, String fieldName, - Table counterTable, String pattern) { - - SearchCounter stat; - if (counterTable.contains(eventType, fieldName)) { - stat = counterTable.get(eventType, fieldName); - } else { - stat = SearchCounter.export(String.format(pattern, eventType, fieldName).toLowerCase()); - counterTable.put(eventType, fieldName, stat); - } - stat.increment(); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/index/EarlybirdSegmentFactory.java b/src/java/com/twitter/search/earlybird/index/EarlybirdSegmentFactory.java deleted file mode 100644 index 1a43cf52b..000000000 --- a/src/java/com/twitter/search/earlybird/index/EarlybirdSegmentFactory.java +++ /dev/null @@ -1,58 +0,0 @@ -package com.twitter.search.earlybird.index; - -import java.io.IOException; - -import org.apache.lucene.store.Directory; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.partitioning.base.Segment; -import com.twitter.search.earlybird.EarlybirdIndexConfig; -import com.twitter.search.earlybird.partition.SearchIndexingMetricSet; -import com.twitter.search.earlybird.partition.SegmentSyncInfo; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; - -public class EarlybirdSegmentFactory { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdSegmentFactory.class); - - private final EarlybirdIndexConfig earlybirdIndexConfig; - private final SearchIndexingMetricSet searchIndexingMetricSet; - private final EarlybirdSearcherStats searcherStats; - private Clock clock; - - public EarlybirdSegmentFactory( - EarlybirdIndexConfig earlybirdIndexConfig, - SearchIndexingMetricSet searchIndexingMetricSet, - EarlybirdSearcherStats searcherStats, - Clock clock) { - this.earlybirdIndexConfig = earlybirdIndexConfig; - this.searchIndexingMetricSet = searchIndexingMetricSet; - this.searcherStats = searcherStats; - this.clock = clock; - } - - public EarlybirdIndexConfig getEarlybirdIndexConfig() { - return earlybirdIndexConfig; - } - - /** - * Creates a new earlybird segment. - */ - public EarlybirdSegment newEarlybirdSegment(Segment segment, SegmentSyncInfo segmentSyncInfo) - throws IOException { - Directory dir = earlybirdIndexConfig.newLuceneDirectory(segmentSyncInfo); - - LOG.info("Creating EarlybirdSegment on " + dir.toString()); - - return new EarlybirdSegment( - segment.getSegmentName(), - segment.getTimeSliceID(), - segment.getMaxSegmentSize(), - dir, - earlybirdIndexConfig, - searchIndexingMetricSet, - searcherStats, - clock); - } -} diff --git a/src/java/com/twitter/search/earlybird/index/EarlybirdSingleSegmentSearcher.java b/src/java/com/twitter/search/earlybird/index/EarlybirdSingleSegmentSearcher.java deleted file mode 100644 index 848a1bd6f..000000000 --- a/src/java/com/twitter/search/earlybird/index/EarlybirdSingleSegmentSearcher.java +++ /dev/null @@ -1,423 +0,0 @@ -package com.twitter.search.earlybird.index; - -import java.io.IOException; -import java.util.List; -import java.util.Locale; -import java.util.Map; -import java.util.Map.Entry; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.Term; -import org.apache.lucene.search.CollectionStatistics; -import org.apache.lucene.search.Collector; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.Explanation; -import org.apache.lucene.search.LeafCollector; -import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.TermStatistics; -import org.apache.lucene.search.Weight; -import org.apache.lucene.util.BytesRef; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.constants.thriftjava.ThriftLanguage; -import com.twitter.search.common.relevance.features.EarlybirdDocumentFeatures; -import com.twitter.search.common.results.thriftjava.FieldHitAttribution; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.search.TwitterCollector; -import com.twitter.search.common.search.TwitterIndexSearcher; -import com.twitter.search.common.util.analysis.LongTermAttributeImpl; -import com.twitter.search.common.util.lang.ThriftLanguageUtil; -import com.twitter.search.core.earlybird.facets.FacetLabelProvider; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentData; -import com.twitter.search.earlybird.EarlybirdSearcher; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.search.EarlybirdLuceneSearcher; -import com.twitter.search.earlybird.search.Hit; -import com.twitter.search.earlybird.search.SearchRequestInfo; -import com.twitter.search.earlybird.search.SimpleSearchResults; -import com.twitter.search.earlybird.search.facets.AbstractFacetTermCollector; -import com.twitter.search.earlybird.search.facets.TermStatisticsCollector; -import com.twitter.search.earlybird.search.facets.TermStatisticsRequestInfo; -import com.twitter.search.earlybird.search.relevance.scoring.RelevanceQuery; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; -import com.twitter.search.earlybird.thrift.ThriftFacetCount; -import com.twitter.search.earlybird.thrift.ThriftFacetCountMetadata; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadata; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird.thrift.ThriftTermRequest; -import com.twitter.search.earlybird.thrift.ThriftTermResults; -import com.twitter.search.earlybird.thrift.ThriftTermStatisticsResults; - -public class EarlybirdSingleSegmentSearcher extends EarlybirdLuceneSearcher { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdSingleSegmentSearcher.class); - - private final EarlybirdIndexSegmentAtomicReader twitterReader; - private final ImmutableSchemaInterface schema; - private final UserTable userTable; - private final long timeSliceID; - - private final EarlybirdSearcherStats searcherStats; - private Clock clock; - - public EarlybirdSingleSegmentSearcher( - ImmutableSchemaInterface schema, - EarlybirdIndexSegmentAtomicReader reader, - UserTable userTable, - EarlybirdSearcherStats searcherStats, - Clock clock) { - super(reader); - this.schema = schema; - this.twitterReader = reader; - this.userTable = userTable; - this.timeSliceID = reader.getSegmentData().getTimeSliceID(); - this.searcherStats = searcherStats; - this.clock = clock; - } - - public final long getTimeSliceID() { - return timeSliceID; - } - - public EarlybirdIndexSegmentAtomicReader getTwitterIndexReader() { - return twitterReader; - } - - /** - * search() main loop. - * This behaves exactly like IndexSearcher.search() if a stock Lucene collector passed in. - * However, if a TwitterCollector is passed in, this class performs Twitter style early - * termination without relying on - * {@link org.apache.lucene.search.CollectionTerminatedException}. - * This method is nearly identical to TwitterIndexSearcher.search() with two differences: - * 1) advances to smallest docID before searching. Important to skip incomplete docs in - * realtime segments. - * 2) skips deletes using twitterReader - */ - @Override - protected void search(List leaves, Weight weight, Collector coll) - throws IOException { - // If an TwitterCollector is passed in, we can do a few extra things in here, such - // as early termination. Otherwise we can just fall back to IndexSearcher.search(). - if (!(coll instanceof TwitterCollector)) { - super.search(leaves, weight, coll); - return; - } - - TwitterCollector collector = (TwitterCollector) coll; - if (collector.isTerminated()) { - return; - } - - LOG.debug("Starting segment {}", timeSliceID); - - // Notify the collector that we're starting this segment, and check for early - // termination criteria again. setNextReader() performs 'expensive' early - // termination checks in some implementations such as TwitterEarlyTerminationCollector. - LeafCollector leafCollector = collector.getLeafCollector(twitterReader.getContext()); - if (collector.isTerminated()) { - return; - } - - // Initialize the scorer: - // Note that constructing the scorer may actually do real work, such as advancing to the - // first hit. - // The scorer may be null if we can tell right away that the query has no hits: e.g. if the - // first hit does not actually exist. - Scorer scorer = weight.scorer(twitterReader.getContext()); - if (scorer == null) { - LOG.debug("Scorer was null, not searching segment {}", timeSliceID); - collector.finishSegment(DocIdSetIterator.NO_MORE_DOCS); - return; - } - leafCollector.setScorer(scorer); - - // Make sure to start searching at the smallest docID. - DocIdSetIterator docIdSetIterator = scorer.iterator(); - int smallestDocId = twitterReader.getSmallestDocID(); - int docID = docIdSetIterator.advance(smallestDocId); - - // Collect results. - while (docID != DocIdSetIterator.NO_MORE_DOCS) { - // Exclude deleted docs. - if (!twitterReader.getDeletesView().isDeleted(docID)) { - leafCollector.collect(docID); - } - - // Check if we're done after we consumed the document. - if (collector.isTerminated()) { - break; - } - - docID = docIdSetIterator.nextDoc(); - } - - // Always finish the segment, providing the last docID advanced to. - collector.finishSegment(docID); - } - - @Override - public void fillFacetResults( - AbstractFacetTermCollector collector, ThriftSearchResults searchResults) - throws IOException { - if (searchResults == null || searchResults.getResultsSize() == 0) { - return; - } - - EarlybirdIndexSegmentData segmentData = twitterReader.getSegmentData(); - collector.resetFacetLabelProviders( - segmentData.getFacetLabelProviders(), segmentData.getFacetIDMap()); - DocIDToTweetIDMapper docIdMapper = segmentData.getDocIDToTweetIDMapper(); - for (ThriftSearchResult result : searchResults.getResults()) { - int docId = docIdMapper.getDocID(result.getId()); - if (docId < 0) { - continue; - } - - segmentData.getFacetCountingArray().collectForDocId(docId, collector); - collector.fillResultAndClear(result); - } - } - - @Override - public TermStatisticsCollector.TermStatisticsSearchResults collectTermStatistics( - TermStatisticsRequestInfo searchRequestInfo, - EarlybirdSearcher searcher, int requestDebugMode) throws IOException { - TermStatisticsCollector collector = new TermStatisticsCollector( - schema, searchRequestInfo, searcherStats, clock, requestDebugMode); - - search(searchRequestInfo.getLuceneQuery(), collector); - searcher.maybeSetCollectorDebugInfo(collector); - return collector.getResults(); - } - - /** This method is only used for debugging, so it's not optimized for speed */ - @Override - public void explainSearchResults(SearchRequestInfo searchRequestInfo, - SimpleSearchResults hits, - ThriftSearchResults searchResults) throws IOException { - Weight weight = - createWeight(rewrite(searchRequestInfo.getLuceneQuery()), ScoreMode.COMPLETE, 1.0f); - - DocIDToTweetIDMapper docIdMapper = twitterReader.getSegmentData().getDocIDToTweetIDMapper(); - for (int i = 0; i < hits.numHits(); i++) { - final Hit hit = hits.getHit(i); - Preconditions.checkState(hit.getTimeSliceID() == timeSliceID, - "hit: " + hit.toString() + " is not in timeslice: " + timeSliceID); - final ThriftSearchResult result = searchResults.getResults().get(i); - if (!result.isSetMetadata()) { - result.setMetadata(new ThriftSearchResultMetadata() - .setPenguinVersion(EarlybirdConfig.getPenguinVersionByte())); - } - - final int docIdToExplain = docIdMapper.getDocID(hit.getStatusID()); - if (docIdToExplain == DocIDToTweetIDMapper.ID_NOT_FOUND) { - result.getMetadata().setExplanation( - "ERROR: Could not find doc ID to explain for " + hit.toString()); - } else { - Explanation explanation; - FieldHitAttribution fieldHitAttribution = result.getMetadata().getFieldHitAttribution(); - if (weight instanceof RelevanceQuery.RelevanceWeight && fieldHitAttribution != null) { - RelevanceQuery.RelevanceWeight relevanceWeight = - (RelevanceQuery.RelevanceWeight) weight; - - explanation = relevanceWeight.explain( - twitterReader.getContext(), docIdToExplain, fieldHitAttribution); - } else { - explanation = weight.explain(twitterReader.getContext(), docIdToExplain); - } - hit.setHasExplanation(true); - result.getMetadata().setExplanation(explanation.toString()); - } - } - } - - @Override - public void fillFacetResultMetadata(Map facetResults, - ImmutableSchemaInterface documentSchema, - byte debugMode) throws IOException { - FacetLabelProvider provider = twitterReader.getFacetLabelProviders( - documentSchema.getFacetFieldByFacetName(EarlybirdFieldConstant.TWIMG_FACET)); - - FacetLabelProvider.FacetLabelAccessor photoAccessor = null; - - if (provider != null) { - photoAccessor = provider.getLabelAccessor(); - } - - for (Entry facetResult : facetResults.entrySet()) { - Term term = facetResult.getKey(); - ThriftFacetCount facetCount = facetResult.getValue(); - - ThriftFacetCountMetadata metadata = facetCount.getMetadata(); - if (metadata == null) { - metadata = new ThriftFacetCountMetadata(); - facetCount.setMetadata(metadata); - } - - fillTermMetadata(term, metadata, photoAccessor, debugMode); - } - } - - @Override - public void fillTermStatsMetadata(ThriftTermStatisticsResults termStatsResults, - ImmutableSchemaInterface documentSchema, - byte debugMode) throws IOException { - - FacetLabelProvider provider = twitterReader.getFacetLabelProviders( - documentSchema.getFacetFieldByFacetName(EarlybirdFieldConstant.TWIMG_FACET)); - - FacetLabelProvider.FacetLabelAccessor photoAccessor = null; - - if (provider != null) { - photoAccessor = provider.getLabelAccessor(); - } - - for (Map.Entry entry - : termStatsResults.termResults.entrySet()) { - - ThriftTermRequest termRequest = entry.getKey(); - if (termRequest.getFieldName().isEmpty()) { - continue; - } - Schema.FieldInfo facetField = schema.getFacetFieldByFacetName(termRequest.getFieldName()); - Term term = null; - if (facetField != null) { - term = new Term(facetField.getName(), termRequest.getTerm()); - } - if (term == null) { - continue; - } - - ThriftFacetCountMetadata metadata = entry.getValue().getMetadata(); - if (metadata == null) { - metadata = new ThriftFacetCountMetadata(); - entry.getValue().setMetadata(metadata); - } - - fillTermMetadata(term, metadata, photoAccessor, debugMode); - } - } - - private void fillTermMetadata(Term term, ThriftFacetCountMetadata metadata, - FacetLabelProvider.FacetLabelAccessor photoAccessor, - byte debugMode) throws IOException { - boolean isTwimg = term.field().equals(EarlybirdFieldConstant.TWIMG_LINKS_FIELD.getFieldName()); - int internalDocID = DocIDToTweetIDMapper.ID_NOT_FOUND; - long statusID = -1; - long userID = -1; - Term facetTerm = term; - - // Deal with the from_user_id facet. - if (term.field().equals(EarlybirdFieldConstant.FROM_USER_ID_CSF.getFieldName())) { - userID = Long.parseLong(term.text()); - facetTerm = new Term(EarlybirdFieldConstant.FROM_USER_ID_FIELD.getFieldName(), - LongTermAttributeImpl.copyIntoNewBytesRef(userID)); - } else if (isTwimg) { - statusID = Long.parseLong(term.text()); - internalDocID = twitterReader.getSegmentData().getDocIDToTweetIDMapper().getDocID(statusID); - } - - if (internalDocID == DocIDToTweetIDMapper.ID_NOT_FOUND) { - // If this is not a twimg, this is how statusID should be looked up - // - // If this is a twimg but we couldn't find the internalDocID, that means this segment, - // or maybe even this earlybird, does not contain the original tweet. Then we treat this as - // a normal facet for now - internalDocID = twitterReader.getOldestDocID(facetTerm); - if (internalDocID >= 0) { - statusID = - twitterReader.getSegmentData().getDocIDToTweetIDMapper().getTweetID(internalDocID); - } else { - statusID = -1; - } - } - - // make sure tweet is not deleted - if (internalDocID < 0 || twitterReader.getDeletesView().isDeleted(internalDocID)) { - return; - } - - if (metadata.isSetStatusId() - && metadata.getStatusId() > 0 - && metadata.getStatusId() <= statusID) { - // we already have the metadata for this facet from an earlier tweet - return; - } - - // now check if this tweet is offensive, e.g. antisocial, nsfw, sensitive - EarlybirdDocumentFeatures documentFeatures = new EarlybirdDocumentFeatures(twitterReader); - documentFeatures.advance(internalDocID); - boolean isOffensiveFlagSet = - documentFeatures.isFlagSet(EarlybirdFieldConstant.IS_OFFENSIVE_FLAG); - boolean isSensitiveFlagSet = - documentFeatures.isFlagSet(EarlybirdFieldConstant.IS_SENSITIVE_CONTENT); - boolean offensive = isOffensiveFlagSet || isSensitiveFlagSet; - - // also, user should not be marked as antisocial, nsfw or offensive - if (userID < 0) { - userID = documentFeatures.getFeatureValue(EarlybirdFieldConstant.FROM_USER_ID_CSF); - } - offensive |= userTable.isSet(userID, - UserTable.ANTISOCIAL_BIT - | UserTable.OFFENSIVE_BIT - | UserTable.NSFW_BIT); - - metadata.setStatusId(statusID); - metadata.setTwitterUserId(userID); - metadata.setCreated_at(twitterReader.getSegmentData().getTimeMapper().getTime(internalDocID)); - int langId = (int) documentFeatures.getFeatureValue(EarlybirdFieldConstant.LANGUAGE); - Locale lang = ThriftLanguageUtil.getLocaleOf(ThriftLanguage.findByValue(langId)); - metadata.setStatusLanguage(ThriftLanguageUtil.getThriftLanguageOf(lang)); - metadata.setStatusPossiblySensitive(offensive); - if (isTwimg && photoAccessor != null && !metadata.isSetNativePhotoUrl()) { - int termID = twitterReader.getTermID(term); - if (termID != EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND) { - BytesRef termPayload = photoAccessor.getTermPayload(termID); - if (termPayload != null) { - metadata.setNativePhotoUrl(termPayload.utf8ToString()); - } - } - } - - if (debugMode > 3) { - StringBuilder sb = new StringBuilder(256); - if (metadata.isSetExplanation()) { - sb.append(metadata.getExplanation()); - } - sb.append(String.format("TweetId=%d (%s %s), UserId=%d (%s %s), Term=%s\n", - statusID, - isOffensiveFlagSet ? "OFFENSIVE" : "", - isSensitiveFlagSet ? "SENSITIVE" : "", - userID, - userTable.isSet(userID, UserTable.ANTISOCIAL_BIT) ? "ANTISOCIAL" : "", - userTable.isSet(userID, UserTable.NSFW_BIT) ? "NSFW" : "", - term.toString())); - metadata.setExplanation(sb.toString()); - } - } - - public ImmutableSchemaInterface getSchemaSnapshot() { - return schema; - } - - @Override - public CollectionStatistics collectionStatistics(String field) throws IOException { - return TwitterIndexSearcher.collectionStatistics(field, getIndexReader()); - } - - @Override - public TermStatistics termStatistics(Term term, int docFreq, long totalTermFreq) { - return TwitterIndexSearcher.termStats(term, docFreq, totalTermFreq); - } -} diff --git a/src/java/com/twitter/search/earlybird/index/OptimizedTimeMapper.java b/src/java/com/twitter/search/earlybird/index/OptimizedTimeMapper.java deleted file mode 100644 index 95267cad9..000000000 --- a/src/java/com/twitter/search/earlybird/index/OptimizedTimeMapper.java +++ /dev/null @@ -1,109 +0,0 @@ -package com.twitter.search.earlybird.index; - -import java.io.IOException; -import java.util.Arrays; - -import com.google.common.base.Preconditions; - -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; -import com.twitter.search.core.earlybird.index.TimeMapper; -import com.twitter.search.core.earlybird.index.inverted.IntBlockPool; - -/** - * A TimeMapper implementation that stores the timestamps associated with the doc IDs in an array. - */ -public class OptimizedTimeMapper extends AbstractInMemoryTimeMapper implements Flushable { - // Doc id to timestamp map. Timestamps that are negative are out-of-order. - protected final int[] timeMap; - - // Size must be greater than the max doc ID stored in the optimized tweet ID mapper. - public OptimizedTimeMapper(RealtimeTimeMapper realtimeTimeMapper, - DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) throws IOException { - super(); - int maxDocId = optimizedTweetIdMapper.getPreviousDocID(Integer.MAX_VALUE); - timeMap = new int[maxDocId + 1]; - Arrays.fill(timeMap, ILLEGAL_TIME); - - int docId = maxDocId; - while (docId != DocIDToTweetIDMapper.ID_NOT_FOUND) { - int originalDocId = originalTweetIdMapper.getDocID(optimizedTweetIdMapper.getTweetID(docId)); - Preconditions.checkState(originalDocId != DocIDToTweetIDMapper.ID_NOT_FOUND); - - int docIdTimestamp = realtimeTimeMapper.getTime(originalDocId); - Preconditions.checkState(docIdTimestamp != TimeMapper.ILLEGAL_TIME); - - doAddMapping(docId, docIdTimestamp); - - docId = optimizedTweetIdMapper.getPreviousDocID(docId); - } - } - - private OptimizedTimeMapper(int[] timeMap, - int reverseMapLastIndex, - IntBlockPool reverseMapTimes, - IntBlockPool reverseMapIds) { - super(reverseMapLastIndex, reverseMapTimes, reverseMapIds); - this.timeMap = timeMap; - } - - @Override - public int getTime(int docID) { - return timeMap[docID]; - } - - @Override - protected void setTime(int docID, int timeSeconds) { - timeMap[docID] = timeSeconds; - } - - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - public static final class FlushHandler extends Flushable.Handler { - private static final String REVERSE_MAP_LAST_INDEX_PROP = "reverseMapLastIndex"; - private static final String TIMES_SUB_PROP = "times"; - private static final String IDS_SUB_PROP = "ids"; - - public FlushHandler() { - super(); - } - - public FlushHandler(OptimizedTimeMapper objectToFlush) { - super(objectToFlush); - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) throws IOException { - OptimizedTimeMapper mapper = getObjectToFlush(); - out.writeIntArray(mapper.timeMap); - flushInfo.addIntProperty(REVERSE_MAP_LAST_INDEX_PROP, mapper.reverseMapLastIndex); - mapper.reverseMapTimes.getFlushHandler().flush( - flushInfo.newSubProperties(TIMES_SUB_PROP), out); - mapper.reverseMapIds.getFlushHandler().flush( - flushInfo.newSubProperties(IDS_SUB_PROP), out); - } - - @Override - protected OptimizedTimeMapper doLoad(FlushInfo flushInfo, DataDeserializer in) - throws IOException { - return new OptimizedTimeMapper( - in.readIntArray(), - flushInfo.getIntProperty(REVERSE_MAP_LAST_INDEX_PROP), - new IntBlockPool.FlushHandler().load(flushInfo.getSubProperties(TIMES_SUB_PROP), in), - new IntBlockPool.FlushHandler().load(flushInfo.getSubProperties(IDS_SUB_PROP), in)); - } - } - - @Override - public TimeMapper optimize(DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) { - throw new UnsupportedOperationException("OptimizedTimeMapper instances are already optimized."); - } -} diff --git a/src/java/com/twitter/search/earlybird/index/OptimizedTweetIDMapper.java b/src/java/com/twitter/search/earlybird/index/OptimizedTweetIDMapper.java deleted file mode 100644 index a9bdb7b54..000000000 --- a/src/java/com/twitter/search/earlybird/index/OptimizedTweetIDMapper.java +++ /dev/null @@ -1,145 +0,0 @@ -package com.twitter.search.earlybird.index; - -import java.io.IOException; - -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; - -import it.unimi.dsi.fastutil.longs.Long2IntMap; -import it.unimi.dsi.fastutil.longs.Long2IntOpenHashMap; -import it.unimi.dsi.fastutil.longs.LongArrays; - -/** - * After a segment is complete, we call {@link EarlybirdSegment#optimizeIndexes()} to compact the - * doc IDs assigned to the tweets in this segment, so that we can do faster ceil and floor lookups. - */ -public class OptimizedTweetIDMapper extends TweetIDMapper { - // Maps doc IDs to tweet IDs. Therefore, it should be sorted in descending order of tweet IDs. - protected final long[] inverseMap; - private final Long2IntMap tweetIdToDocIdMap; - - private OptimizedTweetIDMapper(long[] inverseMap, - long minTweetID, - long maxTweetID, - int minDocID, - int maxDocID) { - super(minTweetID, maxTweetID, minDocID, maxDocID, inverseMap.length); - this.inverseMap = inverseMap; - this.tweetIdToDocIdMap = buildTweetIdToDocIdMap(); - } - - public OptimizedTweetIDMapper(OutOfOrderRealtimeTweetIDMapper source) throws IOException { - super(source.getMinTweetID(), - source.getMaxTweetID(), - 0, - source.getNumDocs() - 1, - source.getNumDocs()); - inverseMap = source.sortTweetIds(); - tweetIdToDocIdMap = buildTweetIdToDocIdMap(); - } - - private Long2IntMap buildTweetIdToDocIdMap() { - int[] values = new int[inverseMap.length]; - for (int i = 0; i < values.length; i++) { - values[i] = i; - } - - Long2IntMap map = new Long2IntOpenHashMap(inverseMap, values); - map.defaultReturnValue(-1); - return map; - } - - @Override - public int getDocID(long tweetID) { - return tweetIdToDocIdMap.getOrDefault(tweetID, ID_NOT_FOUND); - } - - @Override - protected int getNextDocIDInternal(int docID) { - // The doc IDs are consecutive and TweetIDMapper already checked the boundary conditions. - return docID + 1; - } - - @Override - protected int getPreviousDocIDInternal(int docID) { - // The doc IDs are consecutive and TweetIDMapper already checked the boundary conditions. - return docID - 1; - } - - @Override - public long getTweetID(int internalID) { - return inverseMap[internalID]; - } - - @Override - protected int findDocIDBoundInternal(long tweetID, boolean findMaxDocID) { - int docId = tweetIdToDocIdMap.get(tweetID); - if (docId >= 0) { - return docId; - } - - int binarySearchResult = - LongArrays.binarySearch(inverseMap, tweetID, (k1, k2) -> -Long.compare(k1, k2)); - // Since the tweet ID is not present in this mapper, the binary search should return a negative - // value (-insertionPoint - 1). And since TweetIDMapper.findDocIdBound() already verified that - // tweetID is not smaller than all tweet IDs in this mapper, and not larger than all tweet IDs - // in this mapper, the insertionPoint should never be 0 or inverseMap.length. - int insertionPoint = -binarySearchResult - 1; - // The insertion point is the index in the tweet array of the upper bound of the search, so if - // we want the lower bound, because doc IDs are dense, we subtract one. - return findMaxDocID ? insertionPoint : insertionPoint - 1; - } - - @Override - protected final int addMappingInternal(final long tweetID) { - throw new UnsupportedOperationException("The OptimizedTweetIDMapper is immutable."); - } - - @Override - public DocIDToTweetIDMapper optimize() { - throw new UnsupportedOperationException("OptimizedTweetIDMapper is already optimized."); - } - - @Override - public FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - public static class FlushHandler extends Flushable.Handler { - private static final String MIN_TWEET_ID_PROP_NAME = "MinTweetID"; - private static final String MAX_TWEET_ID_PROP_NAME = "MaxTweetID"; - private static final String MIN_DOC_ID_PROP_NAME = "MinDocID"; - private static final String MAX_DOC_ID_PROP_NAME = "MaxDocID"; - - public FlushHandler() { - super(); - } - - public FlushHandler(OptimizedTweetIDMapper objectToFlush) { - super(objectToFlush); - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) throws IOException { - OptimizedTweetIDMapper objectToFlush = getObjectToFlush(); - flushInfo.addLongProperty(MIN_TWEET_ID_PROP_NAME, objectToFlush.getMinTweetID()); - flushInfo.addLongProperty(MAX_TWEET_ID_PROP_NAME, objectToFlush.getMaxTweetID()); - flushInfo.addIntProperty(MIN_DOC_ID_PROP_NAME, objectToFlush.getMinDocID()); - flushInfo.addIntProperty(MAX_DOC_ID_PROP_NAME, objectToFlush.getMaxDocID()); - out.writeLongArray(objectToFlush.inverseMap); - } - - @Override - protected OptimizedTweetIDMapper doLoad(FlushInfo flushInfo, DataDeserializer in) - throws IOException { - return new OptimizedTweetIDMapper(in.readLongArray(), - flushInfo.getLongProperty(MIN_TWEET_ID_PROP_NAME), - flushInfo.getLongProperty(MAX_TWEET_ID_PROP_NAME), - flushInfo.getIntProperty(MIN_DOC_ID_PROP_NAME), - flushInfo.getIntProperty(MAX_DOC_ID_PROP_NAME)); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/index/OutOfOrderRealtimeTweetIDMapper.java b/src/java/com/twitter/search/earlybird/index/OutOfOrderRealtimeTweetIDMapper.java deleted file mode 100644 index f03e45f50..000000000 --- a/src/java/com/twitter/search/earlybird/index/OutOfOrderRealtimeTweetIDMapper.java +++ /dev/null @@ -1,531 +0,0 @@ -package com.twitter.search.earlybird.index; - -import java.io.IOException; -import java.util.Arrays; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; - -import it.unimi.dsi.fastutil.ints.Int2ByteOpenHashMap; -import it.unimi.dsi.fastutil.ints.Int2LongMap; -import it.unimi.dsi.fastutil.ints.Int2LongOpenHashMap; - -/** - * A mapper that maps tweet IDs to doc IDs based on the tweet timestamps. This mapper guarantees - * that if creationTime(A) > creationTime(B), then docId(A) < docId(B), no matter in which order - * the tweets are added to this mapper. However, if creationTime(A) == creationTime(B), then there - * is no guarantee on the order between docId(A) and docId(B). - * - * Essentially, this mapper guarantees that tweets with a later creation time are mapped to smaller - * doc IDs, but it does not provide any ordering for tweets with the same timestamp (down to - * millisecond granularity, which is what Snowflake provides). Our claim is that ordering tweets - * with the same timestamp is not needed, because for the purposes of realtime search, the only - * significant part of the tweet ID is the timestamp. So any such ordering would just be an ordering - * for the Snowflake shards and/or sequence numbers, rather than a time based ordering for tweets. - * - * The mapper uses the following scheme to assign docIDs to tweets: - * +----------+-----------------------------+------------------------------+ - * | Bit 0 | Bits 1 - 27 | Bits 28 - 31 | - * + ---------+-----------------------------+------------------------------+ - * | sign | tweet ID timestamp - | Allow 16 tweets to be posted | - * | always 0 | segment boundary timestamp | on the same millisecond | - * + ---------+-----------------------------+------------------------------+ - * - * Important assumptions: - * * Snowflake IDs have millisecond granularity. Therefore, 27 bits is enough to represent a time - * period of 2^27 / (3600 * 100) = ~37 hours, which is more than enough to cover one realtime - * segment (our realtime segments currently span ~13 hours). - * * At peak times, the tweet posting rate is less than 10,000 tps. Given our current partitioning - * scheme (22 partitions), each realtime earlybird should expect to get less than 500 tweets per - * second, which comes down to less than 1 tweet per millisecond, assuming the partitioning hash - * function distributes the tweets fairly randomly independent of their timestamps. Therefore, - * providing space for 16 tweets (4 bits) in every millisecond should be more than enough to - * accommodate the current requirements, and any potential future changes (higher tweet rate, - * fewer partitions, etc.). - * - * How the mapper works: - * * The tweetId -> docId conversion is implicit (using the tweet's timestamp). - * * We use a IntToByteMap to store the number of tweets for each timestamp, so that we can - * allocate different doc IDs to tweets posted on the same millisecond. The size of this map is: - * segmentSize * 2 (load factor) * 1 (size of byte) = 16MB - * * The docId -> tweetId mappings are stored in an IntToLongMap. The size of this map is: - * segmentSize * 2 (load factor) * 8 (size of long) = 128MB - * * The mapper takes the "segment boundary" (the timestamp of the timeslice ID) as a parameter. - * This segment boundary determines the earliest tweet that this mapper can correctly index - * (it is subtracted from the timestamp of all tweets added to the mapper). Therefore, in order - * to correctly handle late tweets, we move back this segment boundary by twelve hour. - * * Tweets created before (segment boundary - 12 hours) are stored as if their timestamp was the - * segment boundary. - * * The largest timestamp that the mapper can store is: - * LARGEST_RELATIVE_TIMESTAMP = (1 << TIMESTAMP_BITS) - LUCENE_TIMESTAMP_BUFFER. - * Tweets created after (segmentBoundaryTimestamp + LARGEST_RELATIVE_TIMESTAMP) are stored as if - * their timestamp was (segmentBoundaryTimestamp + LARGEST_RELATIVE_TIMESTAMP). - * * When a tweet is added, we compute its doc ID as: - * int relativeTimestamp = tweetTimestamp - segmentBoundaryTimestamp; - * int docIdTimestamp = LARGEST_RELATIVE_TIMESTAMP - relativeTimestamp; - * int numTweetsForTimestamp = tweetsPerTimestamp.get(docIdTimestamp); - * int docId = (docIdTimestamp << DOC_ID_BITS) - * + MAX_DOCS_PER_TIMESTAMP - numTweetsForTimestamp - 1 - * - * This doc ID distribution scheme guarantees that tweets created later will be assigned smaller doc - * IDs (as long as we don't have more than 16 tweets created in the same millisecond). However, - * there is no ordering guarantee for tweets created at the same timestamp -- they are assigned doc - * IDs in the order in which they're added to the mapper. - * - * If we have more than 16 tweets created at time T, the mapper will still gracefully handle that - * case: the "extra" tweets will be assigned doc IDs from the pool of doc IDs for timestamp (T + 1). - * However, the ordering guarantee might no longer hold for those "extra" tweets. Also, the "extra" - * tweets might be missed by certain since_id/max_id queries (the findDocIdBound() method might not - * be able to correctly work for these tweet IDs). - */ -public class OutOfOrderRealtimeTweetIDMapper extends TweetIDMapper { - private static final Logger LOG = LoggerFactory.getLogger(OutOfOrderRealtimeTweetIDMapper.class); - - // The number of bits used to represent the tweet timestamp. - private static final int TIMESTAMP_BITS = 27; - - // The number of bits used to represent the number of tweets with a certain timestamp. - @VisibleForTesting - static final int DOC_ID_BITS = Integer.SIZE - TIMESTAMP_BITS - 1; - - // The maximum number of tweets/docs that we can store per timestamp. - @VisibleForTesting - static final int MAX_DOCS_PER_TIMESTAMP = 1 << DOC_ID_BITS; - - // Lucene has some logic that doesn't deal well with doc IDs close to Integer.MAX_VALUE. - // For example, BooleanScorer has a SIZE constant set to 2048, which gets added to the doc IDs - // inside the score() method. So when the doc IDs are close to Integer.MAX_VALUE, this causes an - // overflow, which can send Lucene into an infinite loop. Therefore, we need to make sure that - // we do not assign doc IDs close to Integer.MAX_VALUE. - private static final int LUCENE_TIMESTAMP_BUFFER = 1 << 16; - - @VisibleForTesting - public static final int LATE_TWEETS_TIME_BUFFER_MILLIS = 12 * 3600 * 1000; // 12 hours - - // The largest relative timestamp that this mapper can store. - @VisibleForTesting - static final int LARGEST_RELATIVE_TIMESTAMP = (1 << TIMESTAMP_BITS) - LUCENE_TIMESTAMP_BUFFER; - - private final long segmentBoundaryTimestamp; - private final int segmentSize; - - private final Int2LongOpenHashMap tweetIds; - private final Int2ByteOpenHashMap tweetsPerTimestamp; - - private static final SearchRateCounter BAD_BUCKET_RATE = - SearchRateCounter.export("tweets_assigned_to_bad_timestamp_bucket"); - private static final SearchRateCounter TWEETS_NOT_ASSIGNED_RATE = - SearchRateCounter.export("tweets_not_assigned"); - private static final SearchRateCounter OLD_TWEETS_DROPPED = - SearchRateCounter.export("old_tweets_dropped"); - - public OutOfOrderRealtimeTweetIDMapper(int segmentSize, long timesliceID) { - long firstTimestamp = SnowflakeIdParser.getTimestampFromTweetId(timesliceID); - // Leave a buffer so that we can handle tweets that are up to twelve hours late. - this.segmentBoundaryTimestamp = firstTimestamp - LATE_TWEETS_TIME_BUFFER_MILLIS; - this.segmentSize = segmentSize; - - tweetIds = new Int2LongOpenHashMap(segmentSize); - tweetIds.defaultReturnValue(ID_NOT_FOUND); - - tweetsPerTimestamp = new Int2ByteOpenHashMap(segmentSize); - tweetsPerTimestamp.defaultReturnValue((byte) ID_NOT_FOUND); - } - - @VisibleForTesting - int getDocIdTimestamp(long tweetId) { - long tweetTimestamp = SnowflakeIdParser.getTimestampFromTweetId(tweetId); - if (tweetTimestamp < segmentBoundaryTimestamp) { - return ID_NOT_FOUND; - } - - long relativeTimestamp = tweetTimestamp - segmentBoundaryTimestamp; - if (relativeTimestamp > LARGEST_RELATIVE_TIMESTAMP) { - relativeTimestamp = LARGEST_RELATIVE_TIMESTAMP; - } - - return LARGEST_RELATIVE_TIMESTAMP - (int) relativeTimestamp; - } - - private int getDocIdForTimestamp(int docIdTimestamp, byte docIndexInTimestamp) { - return (docIdTimestamp << DOC_ID_BITS) + MAX_DOCS_PER_TIMESTAMP - docIndexInTimestamp; - } - - @VisibleForTesting - long[] getTweetsForDocIdTimestamp(int docIdTimestamp) { - byte numDocsForTimestamp = tweetsPerTimestamp.get(docIdTimestamp); - if (numDocsForTimestamp == ID_NOT_FOUND) { - // This should never happen in prod, but better to be safe. - return new long[0]; - } - - long[] tweetIdsInBucket = new long[numDocsForTimestamp]; - int startingDocId = (docIdTimestamp << DOC_ID_BITS) + MAX_DOCS_PER_TIMESTAMP - 1; - for (int i = 0; i < numDocsForTimestamp; ++i) { - tweetIdsInBucket[i] = tweetIds.get(startingDocId - i); - } - return tweetIdsInBucket; - } - - private int newDocId(long tweetId) { - int expectedDocIdTimestamp = getDocIdTimestamp(tweetId); - if (expectedDocIdTimestamp == ID_NOT_FOUND) { - LOG.info("Dropping tweet {} because it is from before the segment boundary timestamp {}", - tweetId, - segmentBoundaryTimestamp); - OLD_TWEETS_DROPPED.increment(); - return ID_NOT_FOUND; - } - - int docIdTimestamp = expectedDocIdTimestamp; - byte numDocsForTimestamp = tweetsPerTimestamp.get(docIdTimestamp); - - if (numDocsForTimestamp == MAX_DOCS_PER_TIMESTAMP) { - BAD_BUCKET_RATE.increment(); - } - - while ((docIdTimestamp > 0) && (numDocsForTimestamp == MAX_DOCS_PER_TIMESTAMP)) { - --docIdTimestamp; - numDocsForTimestamp = tweetsPerTimestamp.get(docIdTimestamp); - } - - if (numDocsForTimestamp == MAX_DOCS_PER_TIMESTAMP) { - // The relative timestamp 0 already has MAX_DOCS_PER_TIMESTAMP. Can't add more docs. - LOG.error("Tweet {} could not be assigned a doc ID in any bucket, because the bucket for " - + "timestamp 0 is already full: {}", - tweetId, Arrays.toString(getTweetsForDocIdTimestamp(0))); - TWEETS_NOT_ASSIGNED_RATE.increment(); - return ID_NOT_FOUND; - } - - if (docIdTimestamp != expectedDocIdTimestamp) { - LOG.warn("Tweet {} could not be assigned a doc ID in the bucket for its timestamp {}, " - + "because this bucket is full. Instead, it was assigned a doc ID in the bucket for " - + "timestamp {}. The tweets in the correct bucket are: {}", - tweetId, - expectedDocIdTimestamp, - docIdTimestamp, - Arrays.toString(getTweetsForDocIdTimestamp(expectedDocIdTimestamp))); - } - - if (numDocsForTimestamp == ID_NOT_FOUND) { - numDocsForTimestamp = 0; - } - ++numDocsForTimestamp; - tweetsPerTimestamp.put(docIdTimestamp, numDocsForTimestamp); - - return getDocIdForTimestamp(docIdTimestamp, numDocsForTimestamp); - } - - @Override - public int getDocID(long tweetId) { - int docIdTimestamp = getDocIdTimestamp(tweetId); - while (docIdTimestamp >= 0) { - int numDocsForTimestamp = tweetsPerTimestamp.get(docIdTimestamp); - int startingDocId = (docIdTimestamp << DOC_ID_BITS) + MAX_DOCS_PER_TIMESTAMP - 1; - for (int docId = startingDocId; docId > startingDocId - numDocsForTimestamp; --docId) { - if (tweetIds.get(docId) == tweetId) { - return docId; - } - } - - // If we have MAX_DOCS_PER_TIMESTAMP docs with this timestamp, then we might've mis-assigned - // a tweet to the previous docIdTimestamp bucket. In that case, we need to keep searching. - // Otherwise, the tweet is not in the index. - if (numDocsForTimestamp < MAX_DOCS_PER_TIMESTAMP) { - break; - } - - --docIdTimestamp; - } - - return ID_NOT_FOUND; - } - - @Override - protected int getNextDocIDInternal(int docId) { - // Check if docId + 1 is an assigned doc ID in this mapper. This might be the case when we have - // multiple tweets posted on the same millisecond. - if (tweetIds.get(docId + 1) != ID_NOT_FOUND) { - return docId + 1; - } - - // If (docId + 1) is not assigned, then it means we do not have any more tweets posted at the - // timestamp corresponding to docId. We need to find the next relative timestamp for which this - // mapper has tweets, and return the first tweet for that timestamp. Note that iterating over - // the space of all possible timestamps is faster than iterating over the space of all possible - // doc IDs (it's MAX_DOCS_PER_TIMESTAMP times faster). - int nextDocIdTimestamp = (docId >> DOC_ID_BITS) + 1; - byte numDocsForTimestamp = tweetsPerTimestamp.get(nextDocIdTimestamp); - int maxDocIdTimestamp = getMaxDocID() >> DOC_ID_BITS; - while ((nextDocIdTimestamp <= maxDocIdTimestamp) - && (numDocsForTimestamp == ID_NOT_FOUND)) { - ++nextDocIdTimestamp; - numDocsForTimestamp = tweetsPerTimestamp.get(nextDocIdTimestamp); - } - - if (numDocsForTimestamp != ID_NOT_FOUND) { - return getDocIdForTimestamp(nextDocIdTimestamp, numDocsForTimestamp); - } - - return ID_NOT_FOUND; - } - - @Override - protected int getPreviousDocIDInternal(int docId) { - // Check if docId - 1 is an assigned doc ID in this mapper. This might be the case when we have - // multiple tweets posted on the same millisecond. - if (tweetIds.get(docId - 1) != ID_NOT_FOUND) { - return docId - 1; - } - - // If (docId - 1) is not assigned, then it means we do not have any more tweets posted at the - // timestamp corresponding to docId. We need to find the previous relative timestamp for which - // this mapper has tweets, and return the first tweet for that timestamp. Note that iterating - // over the space of all possible timestamps is faster than iterating over the space of all - // possible doc IDs (it's MAX_DOCS_PER_TIMESTAMP times faster). - int previousDocIdTimestamp = (docId >> DOC_ID_BITS) - 1; - byte numDocsForTimestamp = tweetsPerTimestamp.get(previousDocIdTimestamp); - int minDocIdTimestamp = getMinDocID() >> DOC_ID_BITS; - while ((previousDocIdTimestamp >= minDocIdTimestamp) - && (numDocsForTimestamp == ID_NOT_FOUND)) { - --previousDocIdTimestamp; - numDocsForTimestamp = tweetsPerTimestamp.get(previousDocIdTimestamp); - } - - if (numDocsForTimestamp != ID_NOT_FOUND) { - return getDocIdForTimestamp(previousDocIdTimestamp, (byte) 1); - } - - return ID_NOT_FOUND; - } - - @Override - public long getTweetID(int docId) { - return tweetIds.get(docId); - } - - @Override - protected int addMappingInternal(long tweetId) { - int docId = newDocId(tweetId); - if (docId == ID_NOT_FOUND) { - return ID_NOT_FOUND; - } - - tweetIds.put(docId, tweetId); - return docId; - } - - @Override - protected int findDocIDBoundInternal(long tweetId, boolean findMaxDocId) { - // Note that it would be incorrect to lookup the doc ID for the given tweet ID and return that - // doc ID, as we would skip over tweets created in the same millisecond but with a lower doc ID. - int docIdTimestamp = getDocIdTimestamp(tweetId); - - // The docIdTimestamp is ID_NOT_FOUND only if the tweet is from before the segment boundary and - // this should never happen here because TweetIDMapper.findDocIdBound ensures that the tweet id - // passed into this method is >= minTweetID which means the tweet is from after the segment - // boundary. - Preconditions.checkState( - docIdTimestamp != ID_NOT_FOUND, - "Tried to find doc id bound for tweet %d which is from before the segment boundary %d", - tweetId, - segmentBoundaryTimestamp); - - // It's OK to return a doc ID that doesn't correspond to any tweet ID in the index, - // as the doc ID is simply used as a starting point and ending point for range queries, - // not a source of truth. - if (findMaxDocId) { - // Return the largest possible doc ID for the timestamp. - return getDocIdForTimestamp(docIdTimestamp, (byte) 1); - } else { - // Return the smallest possible doc ID for the timestamp. - byte tweetsInTimestamp = tweetsPerTimestamp.getOrDefault(docIdTimestamp, (byte) 0); - return getDocIdForTimestamp(docIdTimestamp, tweetsInTimestamp); - } - } - - /** - * Returns the array of all tweet IDs stored in this mapper in a sorted (descending) order. - * Essentially, this method remaps all tweet IDs stored in this mapper to a compressed doc ID - * space of [0, numDocs). - * - * Note that this method is not thread safe, and it's meant to be called only at segment - * optimization time. If addMappingInternal() is called during the execution of this method, - * the behavior is undefined (it will most likely return bad results or throw an exception). - * - * @return An array of all tweet IDs stored in this mapper, in a sorted (descending) order. - */ - public long[] sortTweetIds() { - int numDocs = getNumDocs(); - if (numDocs == 0) { - return new long[0]; - } - - // Add all tweets stored in this mapper to sortTweetIds. - long[] sortedTweetIds = new long[numDocs]; - int sortedTweetIdsIndex = 0; - for (int docId = getMinDocID(); docId != ID_NOT_FOUND; docId = getNextDocID(docId)) { - sortedTweetIds[sortedTweetIdsIndex++] = getTweetID(docId); - } - Preconditions.checkState(sortedTweetIdsIndex == numDocs, - "Could not traverse all documents in the mapper. Expected to find " - + numDocs + " docs, but found only " + sortedTweetIdsIndex); - - // Sort sortedTweetIdsIndex in descending order. There's no way to sort a primitive array in - // descending order, so we have to sort it in ascending order and then reverse it. - Arrays.sort(sortedTweetIds); - for (int i = 0; i < numDocs / 2; ++i) { - long tmp = sortedTweetIds[i]; - sortedTweetIds[i] = sortedTweetIds[numDocs - 1 - i]; - sortedTweetIds[numDocs - 1 - i] = tmp; - } - - return sortedTweetIds; - } - - @Override - public DocIDToTweetIDMapper optimize() throws IOException { - return new OptimizedTweetIDMapper(this); - } - - /** - * Returns the largest Tweet ID that this doc ID mapper could handle. The returned Tweet ID - * would be safe to put into the mapper, but any larger ones would not be correctly handled. - */ - public static long calculateMaxTweetID(long timesliceID) { - long numberOfUsableTimestamps = LARGEST_RELATIVE_TIMESTAMP - LATE_TWEETS_TIME_BUFFER_MILLIS; - long firstTimestamp = SnowflakeIdParser.getTimestampFromTweetId(timesliceID); - long lastTimestamp = firstTimestamp + numberOfUsableTimestamps; - return SnowflakeIdParser.generateValidStatusId( - lastTimestamp, SnowflakeIdParser.RESERVED_BITS_MASK); - } - - /** - * Evaluates whether two instances of OutOfOrderRealtimeTweetIDMapper are equal by value. It is - * slow because it has to check every tweet ID/doc ID in the map. - */ - @VisibleForTesting - boolean verySlowEqualsForTests(OutOfOrderRealtimeTweetIDMapper that) { - return getMinTweetID() == that.getMinTweetID() - && getMaxTweetID() == that.getMaxTweetID() - && getMinDocID() == that.getMinDocID() - && getMaxDocID() == that.getMaxDocID() - && segmentBoundaryTimestamp == that.segmentBoundaryTimestamp - && segmentSize == that.segmentSize - && tweetsPerTimestamp.equals(that.tweetsPerTimestamp) - && tweetIds.equals(that.tweetIds); - } - - @Override - public OutOfOrderRealtimeTweetIDMapper.FlushHandler getFlushHandler() { - return new OutOfOrderRealtimeTweetIDMapper.FlushHandler(this); - } - - private OutOfOrderRealtimeTweetIDMapper( - long minTweetID, - long maxTweetID, - int minDocID, - int maxDocID, - long segmentBoundaryTimestamp, - int segmentSize, - int[] docIDs, - long[] tweetIDList - ) { - super(minTweetID, maxTweetID, minDocID, maxDocID, docIDs.length); - - Preconditions.checkState(docIDs.length == tweetIDList.length); - - this.segmentBoundaryTimestamp = segmentBoundaryTimestamp; - this.segmentSize = segmentSize; - - tweetIds = new Int2LongOpenHashMap(segmentSize); - tweetIds.defaultReturnValue(ID_NOT_FOUND); - - tweetsPerTimestamp = new Int2ByteOpenHashMap(segmentSize); - tweetsPerTimestamp.defaultReturnValue((byte) ID_NOT_FOUND); - - for (int i = 0; i < docIDs.length; i++) { - int docID = docIDs[i]; - long tweetID = tweetIDList[i]; - tweetIds.put(docID, tweetID); - - int timestampBucket = docID >> DOC_ID_BITS; - if (tweetsPerTimestamp.containsKey(timestampBucket)) { - tweetsPerTimestamp.addTo(timestampBucket, (byte) 1); - } else { - tweetsPerTimestamp.put(timestampBucket, (byte) 1); - } - } - } - - public static class FlushHandler extends Flushable.Handler { - private static final String MIN_TWEET_ID_PROP_NAME = "MinTweetID"; - private static final String MAX_TWEET_ID_PROP_NAME = "MaxTweetID"; - private static final String MIN_DOC_ID_PROP_NAME = "MinDocID"; - private static final String MAX_DOC_ID_PROP_NAME = "MaxDocID"; - private static final String SEGMENT_BOUNDARY_TIMESTAMP_PROP_NAME = "SegmentBoundaryTimestamp"; - private static final String SEGMENT_SIZE_PROP_NAME = "SegmentSize"; - - public FlushHandler() { - super(); - } - - public FlushHandler(OutOfOrderRealtimeTweetIDMapper objectToFlush) { - super(objectToFlush); - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer serializer) throws IOException { - OutOfOrderRealtimeTweetIDMapper mapper = getObjectToFlush(); - - flushInfo.addLongProperty(MIN_TWEET_ID_PROP_NAME, mapper.getMinTweetID()); - flushInfo.addLongProperty(MAX_TWEET_ID_PROP_NAME, mapper.getMaxTweetID()); - flushInfo.addIntProperty(MIN_DOC_ID_PROP_NAME, mapper.getMinDocID()); - flushInfo.addIntProperty(MAX_DOC_ID_PROP_NAME, mapper.getMaxDocID()); - flushInfo.addLongProperty(SEGMENT_BOUNDARY_TIMESTAMP_PROP_NAME, - mapper.segmentBoundaryTimestamp); - flushInfo.addIntProperty(SEGMENT_SIZE_PROP_NAME, mapper.segmentSize); - - serializer.writeInt(mapper.tweetIds.size()); - for (Int2LongMap.Entry entry : mapper.tweetIds.int2LongEntrySet()) { - serializer.writeInt(entry.getIntKey()); - serializer.writeLong(entry.getLongValue()); - } - } - - @Override - protected OutOfOrderRealtimeTweetIDMapper doLoad(FlushInfo flushInfo, DataDeserializer in) - throws IOException { - - int size = in.readInt(); - int[] docIds = new int[size]; - long[] tweetIds = new long[size]; - for (int i = 0; i < size; i++) { - docIds[i] = in.readInt(); - tweetIds[i] = in.readLong(); - } - - return new OutOfOrderRealtimeTweetIDMapper( - flushInfo.getLongProperty(MIN_TWEET_ID_PROP_NAME), - flushInfo.getLongProperty(MAX_TWEET_ID_PROP_NAME), - flushInfo.getIntProperty(MIN_DOC_ID_PROP_NAME), - flushInfo.getIntProperty(MAX_DOC_ID_PROP_NAME), - flushInfo.getLongProperty(SEGMENT_BOUNDARY_TIMESTAMP_PROP_NAME), - flushInfo.getIntProperty(SEGMENT_SIZE_PROP_NAME), - docIds, - tweetIds); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/index/RealtimeTimeMapper.java b/src/java/com/twitter/search/earlybird/index/RealtimeTimeMapper.java deleted file mode 100644 index d78d971e6..000000000 --- a/src/java/com/twitter/search/earlybird/index/RealtimeTimeMapper.java +++ /dev/null @@ -1,149 +0,0 @@ -package com.twitter.search.earlybird.index; - -import java.io.IOException; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; -import com.twitter.search.core.earlybird.index.TimeMapper; -import com.twitter.search.core.earlybird.index.inverted.IntBlockPool; - -import it.unimi.dsi.fastutil.ints.Int2IntMap; -import it.unimi.dsi.fastutil.ints.Int2IntOpenHashMap; - -/** - * Maps 32-bit document IDs to seconds-since-epoch timestamps. - */ -public class RealtimeTimeMapper extends AbstractInMemoryTimeMapper { - // Doc id to timestamp map. Timestamps that are negative are out-of-order. - protected final Int2IntOpenHashMap timeMap; - private final int capacity; - - public RealtimeTimeMapper(int capacity) { - super(); - this.capacity = capacity; - - timeMap = new Int2IntOpenHashMap(capacity); - timeMap.defaultReturnValue(ILLEGAL_TIME); - } - - @Override - public int getTime(int docID) { - return timeMap.get(docID); - } - - @Override - protected void setTime(int docID, int timeSeconds) { - timeMap.put(docID, timeSeconds); - } - - public final void addMapping(int docID, int timeSeconds) { - doAddMapping(docID, timeSeconds); - } - - @Override - public TimeMapper optimize(DocIDToTweetIDMapper originalTweetIdMapper, - DocIDToTweetIDMapper optimizedTweetIdMapper) throws IOException { - return new OptimizedTimeMapper(this, originalTweetIdMapper, optimizedTweetIdMapper); - } - - /** - * Evaluates whether two instances of RealtimeTimeMapper are equal by value. It is - * slow because it has to check every tweet ID/timestamp in the map. - */ - @VisibleForTesting - boolean verySlowEqualsForTests(RealtimeTimeMapper that) { - return reverseMapLastIndex == that.reverseMapLastIndex - && reverseMapIds.verySlowEqualsForTests(that.reverseMapIds) - && reverseMapTimes.verySlowEqualsForTests(that.reverseMapTimes) - && capacity == that.capacity - && timeMap.equals(that.timeMap); - } - - private RealtimeTimeMapper( - int capacity, - int reverseMapLastIndex, - int[] docIds, - int[] timestamps, - IntBlockPool reverseMapTimes, - IntBlockPool reverseMapIds - ) { - super(reverseMapLastIndex, reverseMapTimes, reverseMapIds); - - this.capacity = capacity; - - timeMap = new Int2IntOpenHashMap(capacity); - timeMap.defaultReturnValue(ILLEGAL_TIME); - - Preconditions.checkState(docIds.length == timestamps.length); - - for (int i = 0; i < docIds.length; i++) { - timeMap.put(docIds[i], timestamps[i]); - } - } - - @Override - public RealtimeTimeMapper.FlushHandler getFlushHandler() { - return new RealtimeTimeMapper.FlushHandler(this); - } - - public static class FlushHandler extends Flushable.Handler { - private static final String REVERSE_MAP_LAST_INDEX_PROP = "reverseMapLastIndex"; - private static final String TIMES_SUB_PROP = "times"; - private static final String IDS_SUB_PROP = "ids"; - private static final String CAPACITY_PROP = "capacity"; - - public FlushHandler() { - super(); - } - - public FlushHandler(RealtimeTimeMapper objectToFlush) { - super(objectToFlush); - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer serializer) throws IOException { - RealtimeTimeMapper mapper = getObjectToFlush(); - - flushInfo.addIntProperty(CAPACITY_PROP, mapper.capacity); - flushInfo.addIntProperty(REVERSE_MAP_LAST_INDEX_PROP, mapper.reverseMapLastIndex); - - serializer.writeInt(mapper.timeMap.size()); - for (Int2IntMap.Entry entry : mapper.timeMap.int2IntEntrySet()) { - serializer.writeInt(entry.getIntKey()); - serializer.writeInt(entry.getIntValue()); - } - - mapper.reverseMapTimes.getFlushHandler().flush( - flushInfo.newSubProperties(TIMES_SUB_PROP), serializer); - mapper.reverseMapIds.getFlushHandler().flush( - flushInfo.newSubProperties(IDS_SUB_PROP), serializer); - } - - @Override - protected RealtimeTimeMapper doLoad(FlushInfo flushInfo, DataDeserializer in) - throws IOException { - - int size = in.readInt(); - int[] docIds = new int[size]; - int[] timestamps = new int[size]; - for (int i = 0; i < size; i++) { - docIds[i] = in.readInt(); - timestamps[i] = in.readInt(); - } - - return new RealtimeTimeMapper( - flushInfo.getIntProperty(CAPACITY_PROP), - flushInfo.getIntProperty(REVERSE_MAP_LAST_INDEX_PROP), - docIds, - timestamps, - new IntBlockPool.FlushHandler().load(flushInfo.getSubProperties(TIMES_SUB_PROP), in), - new IntBlockPool.FlushHandler().load(flushInfo.getSubProperties(IDS_SUB_PROP), in)); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/index/TimeMappingWriter.java b/src/java/com/twitter/search/earlybird/index/TimeMappingWriter.java deleted file mode 100644 index acc1cafe7..000000000 --- a/src/java/com/twitter/search/earlybird/index/TimeMappingWriter.java +++ /dev/null @@ -1,32 +0,0 @@ -package com.twitter.search.earlybird.index; - -import java.io.IOException; - -import org.apache.lucene.util.AttributeSource; - -import com.twitter.search.common.util.analysis.IntTermAttribute; -import com.twitter.search.core.earlybird.index.EarlybirdRealtimeIndexSegmentWriter; - -public class TimeMappingWriter implements EarlybirdRealtimeIndexSegmentWriter.InvertedDocConsumer { - private IntTermAttribute termAtt; - private final RealtimeTimeMapper mapper; - - public TimeMappingWriter(RealtimeTimeMapper mapper) { - this.mapper = mapper; - } - - @Override - public final void start(AttributeSource attributeSource, boolean currentDocIsOffensive) { - termAtt = attributeSource.addAttribute(IntTermAttribute.class); - } - - @Override - public final void add(int docId, int position) throws IOException { - final int timeSec = termAtt.getTerm(); - mapper.addMapping(docId, timeSec); - } - - @Override - public void finish() { - } -} diff --git a/src/java/com/twitter/search/earlybird/index/TweetIDMapper.java b/src/java/com/twitter/search/earlybird/index/TweetIDMapper.java deleted file mode 100644 index e97d58cad..000000000 --- a/src/java/com/twitter/search/earlybird/index/TweetIDMapper.java +++ /dev/null @@ -1,183 +0,0 @@ -package com.twitter.search.earlybird.index; - -import java.io.IOException; - -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; - -public abstract class TweetIDMapper implements DocIDToTweetIDMapper, Flushable { - private long minTweetID; - private long maxTweetID; - private int minDocID; - private int maxDocID; - private int numDocs; - - protected TweetIDMapper() { - this(Long.MAX_VALUE, Long.MIN_VALUE, Integer.MAX_VALUE, Integer.MIN_VALUE, 0); - } - - protected TweetIDMapper( - long minTweetID, long maxTweetID, int minDocID, int maxDocID, int numDocs) { - this.minTweetID = minTweetID; - this.maxTweetID = maxTweetID; - this.minDocID = minDocID; - this.maxDocID = maxDocID; - this.numDocs = numDocs; - } - - // Realtime updates minTweetID and maxTweetID in addMapping. - // Archives updates minTweetID and maxTweetID in prepareToRead. - protected void setMinTweetID(long minTweetID) { - this.minTweetID = minTweetID; - } - - protected void setMaxTweetID(long maxTweetID) { - this.maxTweetID = maxTweetID; - } - - protected void setMinDocID(int minDocID) { - this.minDocID = minDocID; - } - - protected void setMaxDocID(int maxDocID) { - this.maxDocID = maxDocID; - } - - protected void setNumDocs(int numDocs) { - this.numDocs = numDocs; - } - - public long getMinTweetID() { - return this.minTweetID; - } - - public long getMaxTweetID() { - return this.maxTweetID; - } - - public int getMinDocID() { - return minDocID; - } - - public int getMaxDocID() { - return maxDocID; - } - - @Override - public int getNumDocs() { - return numDocs; - } - - /** - * Given a tweetId, find the corresponding doc ID to start, or end, a search. - * - * In the ordered, dense doc ID mappers, this returns either the doc ID assigned to the tweet ID, - * or doc ID of the next lowest tweet ID, if the tweet is not in the index. In this case - * findMaxDocID is ignored. - * - * In {@link OutOfOrderRealtimeTweetIDMapper}, doc IDs are not ordered within a millisecond, so we - * want to search the entire millisecond bucket for a filter. To accomplish this, - * if findMaxDocId is true we return the largest possible doc ID for that millisecond. - * If findMaxDocId is false, we return the smallest possible doc ID for that millisecond. - * - * The returned doc ID will be between smallestDocID and largestDocID (inclusive). - * The returned doc ID may not be in the index. - */ - public int findDocIdBound(long tweetID, - boolean findMaxDocID, - int smallestDocID, - int largestDocID) throws IOException { - if (tweetID > maxTweetID) { - return smallestDocID; - } - if (tweetID < minTweetID) { - return largestDocID; - } - - int internalID = findDocIDBoundInternal(tweetID, findMaxDocID); - - return Math.max(smallestDocID, Math.min(largestDocID, internalID)); - } - - @Override - public final int getNextDocID(int docID) { - if (numDocs <= 0) { - return ID_NOT_FOUND; - } - if (docID < minDocID) { - return minDocID; - } - if (docID >= maxDocID) { - return ID_NOT_FOUND; - } - return getNextDocIDInternal(docID); - } - - @Override - public final int getPreviousDocID(int docID) { - if (numDocs <= 0) { - return ID_NOT_FOUND; - } - if (docID <= minDocID) { - return ID_NOT_FOUND; - } - if (docID > maxDocID) { - return maxDocID; - } - return getPreviousDocIDInternal(docID); - } - - @Override - public int addMapping(final long tweetID) { - int docId = addMappingInternal(tweetID); - if (docId != ID_NOT_FOUND) { - ++numDocs; - if (tweetID > maxTweetID) { - maxTweetID = tweetID; - } - if (tweetID < minTweetID) { - minTweetID = tweetID; - } - if (docId > maxDocID) { - maxDocID = docId; - } - if (docId < minDocID) { - minDocID = docId; - } - } - - return docId; - } - - /** - * Returns the smallest valid doc ID in this mapper that's strictly higher than the given doc ID. - * If no such doc ID exists, ID_NOT_FOUND must be returned. - * - * The given docID is guaranteed to be in the range [minDocID, maxDocID). - * - * @param docID The current doc ID. - * @return The smallest valid doc ID in this mapper that's strictly higher than the given doc ID, - * or a negative number, if no such doc ID exists. - */ - protected abstract int getNextDocIDInternal(int docID); - - /** - * Returns the smallest valid doc ID in this mapper that's strictly higher than the given doc ID. - * If no such doc ID exists, ID_NOT_FOUND must be returned. - * - * The given docID is guaranteed to be in the range (minDocID, maxDocID]. - * - * @param docID The current doc ID. - * @return The smallest valid doc ID in this mapper that's strictly higher than the given doc ID, - * or a negative number, if no such doc ID exists. - */ - protected abstract int getPreviousDocIDInternal(int docID); - - protected abstract int addMappingInternal(final long tweetID); - - /** - * See {@link TweetIDMapper#findDocIdBound}. - */ - protected abstract int findDocIDBoundInternal(long tweetID, - boolean findMaxDocID) throws IOException; -} diff --git a/src/java/com/twitter/search/earlybird/index/TweetIDQuery.java b/src/java/com/twitter/search/earlybird/index/TweetIDQuery.java deleted file mode 100644 index 6f4c39a02..000000000 --- a/src/java/com/twitter/search/earlybird/index/TweetIDQuery.java +++ /dev/null @@ -1,81 +0,0 @@ -package com.twitter.search.earlybird.index; - -import java.io.IOException; -import java.util.Arrays; -import java.util.Set; - -import com.google.common.collect.Sets; - -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.Weight; - -import com.twitter.search.common.query.DefaultFilterWeight; -import com.twitter.search.common.search.IntArrayDocIdSetIterator; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentData; - -public class TweetIDQuery extends Query { - private final Set tweetIDs = Sets.newHashSet(); - - public TweetIDQuery(long... tweetIDs) { - for (long tweetID : tweetIDs) { - this.tweetIDs.add(tweetID); - } - } - - @Override - public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) { - return new DefaultFilterWeight(this) { - @Override - protected DocIdSetIterator getDocIdSetIterator(LeafReaderContext context) throws IOException { - EarlybirdIndexSegmentData segmentData = - ((EarlybirdIndexSegmentAtomicReader) context.reader()).getSegmentData(); - DocIDToTweetIDMapper docIdToTweetIdMapper = segmentData.getDocIDToTweetIDMapper(); - - Set set = Sets.newHashSet(); - for (long tweetID : tweetIDs) { - int docID = docIdToTweetIdMapper.getDocID(tweetID); - if (docID != DocIDToTweetIDMapper.ID_NOT_FOUND) { - set.add(docID); - } - } - - if (set.isEmpty()) { - return DocIdSetIterator.empty(); - } - - int[] docIDs = new int[set.size()]; - int i = 0; - for (int docID : set) { - docIDs[i++] = docID; - } - Arrays.sort(docIDs); - return new IntArrayDocIdSetIterator(docIDs); - } - }; - } - - @Override - public int hashCode() { - return tweetIDs.hashCode(); - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof TweetIDQuery)) { - return false; - } - - return tweetIDs.equals(TweetIDQuery.class.cast(obj).tweetIDs); - } - - @Override - public String toString(String field) { - return "TWEET_ID_QUERY: " + tweetIDs; - } -} diff --git a/src/java/com/twitter/search/earlybird/index/TweetIDToInternalIDMap.java b/src/java/com/twitter/search/earlybird/index/TweetIDToInternalIDMap.java deleted file mode 100644 index 87204a623..000000000 --- a/src/java/com/twitter/search/earlybird/index/TweetIDToInternalIDMap.java +++ /dev/null @@ -1,154 +0,0 @@ -package com.twitter.search.earlybird.index; - -import java.io.IOException; -import java.util.Arrays; - -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.Flushable; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; - -public final class TweetIDToInternalIDMap implements Flushable { - private final int size; - private final int[] hash; - public final int halfSize; - private final int mask; - public int numMappings; - - static final int PRIME_NUMBER = 37; - - // For FlushHandler.load() use only - private TweetIDToInternalIDMap(final int[] hash, - final int numMappings) { - this.hash = hash; - this.size = hash.length; - this.halfSize = size >> 1; - this.mask = size - 1; - this.numMappings = numMappings; - } - - TweetIDToInternalIDMap(final int size) { - this.hash = new int[size]; - Arrays.fill(hash, DocIDToTweetIDMapper.ID_NOT_FOUND); - this.size = size; - this.halfSize = size >> 1; - this.mask = size - 1; - this.numMappings = 0; - } - - // Slightly different hash function from the one used to partition tweets to Earlybirds. - protected static int hashCode(final long tweetID) { - long timestamp = SnowflakeIdParser.getTimestampFromTweetId(tweetID); - int code = (int) ((timestamp - 1) ^ (timestamp >>> 32)); - code = PRIME_NUMBER * (int) (tweetID & SnowflakeIdParser.RESERVED_BITS_MASK) + code; - return code; - } - - protected static int incrementHashCode(int code) { - return ((code >> 8) + code) | 1; - } - - private int hashPos(int code) { - return code & mask; - } - - /** - * Associates the given tweet ID with the given internal doc ID. - * - * @param tweetID The tweet ID. - * @param internalID The doc ID that should be associated with this tweet ID. - * @param inverseMap The map that stores the doc ID to tweet ID associations. - */ - public void add(final long tweetID, final int internalID, final long[] inverseMap) { - int code = hashCode(tweetID); - int hashPos = hashPos(code); - int value = hash[hashPos]; - assert inverseMap[internalID] == tweetID; - - if (value != DocIDToTweetIDMapper.ID_NOT_FOUND) { - final int inc = incrementHashCode(code); - do { - code += inc; - hashPos = hashPos(code); - value = hash[hashPos]; - } while (value != DocIDToTweetIDMapper.ID_NOT_FOUND); - } - - assert value == DocIDToTweetIDMapper.ID_NOT_FOUND; - - hash[hashPos] = internalID; - numMappings++; - } - - /** - * Returns the doc ID corresponding to the given tweet ID. - * - * @param tweetID The tweet ID. - * @param inverseMap The map that stores the doc ID to tweet ID associations. - * @return The doc ID corresponding to the given tweet ID. - */ - public int get(long tweetID, final long[] inverseMap) { - int code = hashCode(tweetID); - int hashPos = hashPos(code); - int value = hash[hashPos]; - - if (value != DocIDToTweetIDMapper.ID_NOT_FOUND && inverseMap[value] != tweetID) { - final int inc = incrementHashCode(code); - - do { - code += inc; - hashPos = hashPos(code); - value = hash[hashPos]; - } while (value != DocIDToTweetIDMapper.ID_NOT_FOUND && inverseMap[value] != tweetID); - } - - if (hashPos == -1) { - return DocIDToTweetIDMapper.ID_NOT_FOUND; - } - return hash[hashPos]; - } - - @Override - public TweetIDToInternalIDMap.FlushHandler getFlushHandler() { - return new FlushHandler(this); - } - - public static final class FlushHandler extends Flushable.Handler { - public FlushHandler() { - super(); - } - - private static final String HASH_ARRAY_SIZE_PROP_NAME = "HashArraySize"; - private static final String MASK_PROP_NAME = "Mask"; - private static final String NUM_MAPPINGS_PROP_NAME = "NumMappings"; - - public FlushHandler(TweetIDToInternalIDMap objectToFlush) { - super(objectToFlush); - } - - @Override - protected void doFlush(FlushInfo flushInfo, DataSerializer out) - throws IOException { - TweetIDToInternalIDMap mapper = getObjectToFlush(); - - flushInfo - .addIntProperty(HASH_ARRAY_SIZE_PROP_NAME, mapper.hash.length) - .addIntProperty(MASK_PROP_NAME, mapper.mask) - .addIntProperty(NUM_MAPPINGS_PROP_NAME, mapper.numMappings); - - out.writeIntArray(mapper.hash); - } - - @Override - protected TweetIDToInternalIDMap doLoad(FlushInfo flushInfo, DataDeserializer in) - throws IOException { - final int[] hash = in.readIntArray(); - - final int numMappings = flushInfo.getIntProperty(NUM_MAPPINGS_PROP_NAME); - - return new TweetIDToInternalIDMap(hash, numMappings); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/index/TweetSearchIndexExtensionsFactory.java b/src/java/com/twitter/search/earlybird/index/TweetSearchIndexExtensionsFactory.java deleted file mode 100644 index 3e782aa1f..000000000 --- a/src/java/com/twitter/search/earlybird/index/TweetSearchIndexExtensionsFactory.java +++ /dev/null @@ -1,17 +0,0 @@ -package com.twitter.search.earlybird.index; - -import com.twitter.search.core.earlybird.index.extensions.EarlybirdIndexExtensionsData; -import com.twitter.search.core.earlybird.index.extensions.EarlybirdIndexExtensionsFactory; -import com.twitter.search.core.earlybird.index.extensions.EarlybirdRealtimeIndexExtensionsData; - -public class TweetSearchIndexExtensionsFactory extends EarlybirdIndexExtensionsFactory { - @Override - public EarlybirdRealtimeIndexExtensionsData newRealtimeIndexExtensionsData() { - return new TweetSearchRealtimeIndexExtensionsData(); - } - - @Override - public EarlybirdIndexExtensionsData newLuceneIndexExtensionsData() { - return new TweetSearchLuceneIndexExtensionsData(); - } -} diff --git a/src/java/com/twitter/search/earlybird/index/TweetSearchLuceneIndexExtensionsData.java b/src/java/com/twitter/search/earlybird/index/TweetSearchLuceneIndexExtensionsData.java deleted file mode 100644 index 84ba879e1..000000000 --- a/src/java/com/twitter/search/earlybird/index/TweetSearchLuceneIndexExtensionsData.java +++ /dev/null @@ -1,41 +0,0 @@ -package com.twitter.search.earlybird.index; - -import java.io.IOException; - -import com.google.common.base.Preconditions; - -import com.twitter.search.common.schema.base.EarlybirdFieldType; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentData; -import com.twitter.search.core.earlybird.index.column.ColumnStrideFieldIndex; -import com.twitter.search.core.earlybird.index.extensions.EarlybirdIndexExtensionsData; - -public class TweetSearchLuceneIndexExtensionsData implements EarlybirdIndexExtensionsData { - @Override - public void setupExtensions(EarlybirdIndexSegmentAtomicReader atomicReader) throws IOException { - // If we use stock lucene to back the mappers and column stride fields, - // we need to initialize them - EarlybirdIndexSegmentData segmentData = atomicReader.getSegmentData(); - DocValuesBasedTweetIDMapper tweetIDMapper = - (DocValuesBasedTweetIDMapper) segmentData.getDocIDToTweetIDMapper(); - tweetIDMapper.initializeWithLuceneReader( - atomicReader, - getColumnStrideFieldIndex(segmentData, EarlybirdFieldConstant.ID_CSF_FIELD)); - - DocValuesBasedTimeMapper timeMapper = - (DocValuesBasedTimeMapper) segmentData.getTimeMapper(); - timeMapper.initializeWithLuceneReader( - atomicReader, - getColumnStrideFieldIndex(segmentData, EarlybirdFieldConstant.CREATED_AT_CSF_FIELD)); - } - - private ColumnStrideFieldIndex getColumnStrideFieldIndex( - EarlybirdIndexSegmentData segmentData, EarlybirdFieldConstant csfField) { - String csfFieldName = csfField.getFieldName(); - EarlybirdFieldType fieldType = - segmentData.getSchema().getFieldInfo(csfFieldName).getFieldType(); - Preconditions.checkState(fieldType.isCsfLoadIntoRam()); - return segmentData.getDocValuesManager().addColumnStrideField(csfFieldName, fieldType); - } -} diff --git a/src/java/com/twitter/search/earlybird/index/TweetSearchRealtimeIndexExtensionsData.java b/src/java/com/twitter/search/earlybird/index/TweetSearchRealtimeIndexExtensionsData.java deleted file mode 100644 index 02752d08d..000000000 --- a/src/java/com/twitter/search/earlybird/index/TweetSearchRealtimeIndexExtensionsData.java +++ /dev/null @@ -1,33 +0,0 @@ -package com.twitter.search.earlybird.index; - -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.core.earlybird.index.EarlybirdRealtimeIndexSegmentWriter.InvertedDocConsumerBuilder; -import com.twitter.search.core.earlybird.index.EarlybirdRealtimeIndexSegmentWriter.StoredFieldsConsumerBuilder; -import com.twitter.search.core.earlybird.index.extensions.EarlybirdRealtimeIndexExtensionsData; - -public class TweetSearchRealtimeIndexExtensionsData - implements EarlybirdRealtimeIndexExtensionsData { - @Override - public void createStoredFieldsConsumer(StoredFieldsConsumerBuilder builder) { - // no extensions necessary here - } - - @Override - public void createInvertedDocConsumer(InvertedDocConsumerBuilder builder) { - if (EarlybirdFieldConstant.ID_FIELD.getFieldName().equals(builder.getFieldName())) { - // The tweet ID should've already been added to the tweet ID <-> doc ID mapper. - builder.setUseDefaultConsumer(false); - } - - if (EarlybirdFieldConstant.CREATED_AT_FIELD.getFieldName().equals(builder.getFieldName())) { - RealtimeTimeMapper timeMapper = (RealtimeTimeMapper) builder.getSegmentData().getTimeMapper(); - builder.addConsumer(new TimeMappingWriter(timeMapper)); - builder.setUseDefaultConsumer(false); - } - } - - @Override - public void setupExtensions(EarlybirdIndexSegmentAtomicReader atomicReader) { - } -} diff --git a/src/java/com/twitter/search/earlybird/index/facets/BUILD b/src/java/com/twitter/search/earlybird/index/facets/BUILD deleted file mode 100644 index 8041b4394..000000000 --- a/src/java/com/twitter/search/earlybird/index/facets/BUILD +++ /dev/null @@ -1,16 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/org/apache/lucene:lucene-analyzers-common", - "3rdparty/jvm/org/apache/lucene:lucene-analyzers-smartcn", - "3rdparty/jvm/org/apache/lucene:lucene-facet", - "3rdparty/jvm/org/apache/lucene:lucene-queries", - "src/java/com/twitter/search/common/constants", - "src/java/com/twitter/search/common/schema/base", - "src/java/com/twitter/search/common/schema/earlybird", - "src/java/com/twitter/search/core/earlybird", - "src/thrift/com/twitter/search:earlybird-java", - ], -) diff --git a/src/java/com/twitter/search/earlybird/index/facets/FacetSkipList.java b/src/java/com/twitter/search/earlybird/index/facets/FacetSkipList.java deleted file mode 100644 index 8735f82a2..000000000 --- a/src/java/com/twitter/search/earlybird/index/facets/FacetSkipList.java +++ /dev/null @@ -1,126 +0,0 @@ -package com.twitter.search.earlybird.index.facets; - -import java.io.IOException; -import java.util.HashSet; -import java.util.Iterator; -import java.util.Set; - -import org.apache.lucene.analysis.TokenStream; -import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; -import org.apache.lucene.index.Term; -import org.apache.lucene.search.BooleanClause; -import org.apache.lucene.search.BooleanQuery; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.TermQuery; - -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.core.earlybird.facets.FacetCountState; -import com.twitter.search.earlybird.thrift.ThriftTermRequest; - -public abstract class FacetSkipList { - public static class SkipTokenStream extends TokenStream { - private CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); - - private Iterator iterator; - private Set facetFields = new HashSet<>(); - - public void add(Schema.FieldInfo field) { - this.facetFields.add(field); - } - - @Override - public final boolean incrementToken() throws IOException { - if (iterator == null) { - iterator = facetFields.iterator(); - } - - while (iterator.hasNext()) { - Schema.FieldInfo field = iterator.next(); - if (field.getFieldType().isStoreFacetSkiplist()) { - termAtt.setEmpty(); - termAtt.append(EarlybirdFieldConstant.getFacetSkipFieldName(field.getName())); - - return true; - } - } - - return false; - } - } - - /** - * Returns a Term query to search in the given facet field. - */ - public static Term getSkipListTerm(Schema.FieldInfo facetField) { - if (facetField.getFieldType().isStoreFacetSkiplist()) { - return new Term(EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), - EarlybirdFieldConstant.getFacetSkipFieldName(facetField.getName())); - } - return null; - } - - /** - * Returns a disjunction query that searches in all facet fields in the given facet count state. - */ - public static Query getSkipListQuery(FacetCountState facetCountState) { - Set fieldsWithSkipLists = - facetCountState.getFacetFieldsToCountWithSkipLists(); - - if (fieldsWithSkipLists == null || fieldsWithSkipLists.isEmpty()) { - return null; - } - - Query skipLists; - - if (fieldsWithSkipLists.size() == 1) { - skipLists = new TermQuery(getSkipListTerm(fieldsWithSkipLists.iterator().next())); - } else { - BooleanQuery.Builder disjunctionBuilder = new BooleanQuery.Builder(); - for (Schema.FieldInfo facetField : fieldsWithSkipLists) { - disjunctionBuilder.add( - new TermQuery(new Term( - EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), - EarlybirdFieldConstant.getFacetSkipFieldName(facetField.getName()))), - BooleanClause.Occur.SHOULD); - } - skipLists = disjunctionBuilder.build(); - } - - return skipLists; - } - - /** - * Returns a term request that can be used to get term statistics for the skip list term - * associated with the provided facet. Returns null, if this FacetField is configured to not - * store a skiplist. - */ - public static ThriftTermRequest getSkipListTermRequest(Schema schema, String facetName) { - return getSkipListTermRequest(schema.getFacetFieldByFacetName(facetName)); - } - - /** - * Returns a term request that can be used to get term statistics for the skip list term - * associated with the provided facet. Returns null, if this FacetField is configured to not - * store a skiplist. - */ - public static ThriftTermRequest getSkipListTermRequest(Schema.FieldInfo facetField) { - return facetField != null && facetField.getFieldType().isStoreFacetSkiplist() - ? new ThriftTermRequest( - EarlybirdFieldConstant.getFacetSkipFieldName(facetField.getName())) - .setFieldName(EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName()) - : null; - } - - /** - * Returns a term request using the specified fieldName. This is only a temporary solution until - * Blender can access the Schema to pass the FacetIDMap into the method above. - * - * @deprecated Temporary solution until Blender - */ - @Deprecated - public static ThriftTermRequest getSkipListTermRequest(String fieldName) { - return new ThriftTermRequest(EarlybirdFieldConstant.getFacetSkipFieldName(fieldName)) - .setFieldName(EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName()); - } -} diff --git a/src/java/com/twitter/search/earlybird/ml/ScoringModelsManager.java b/src/java/com/twitter/search/earlybird/ml/ScoringModelsManager.java deleted file mode 100644 index 0e12f18c7..000000000 --- a/src/java/com/twitter/search/earlybird/ml/ScoringModelsManager.java +++ /dev/null @@ -1,155 +0,0 @@ -package com.twitter.search.earlybird.ml; - -import java.io.IOException; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Optional; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.file.AbstractFile; -import com.twitter.search.common.file.FileUtils; -import com.twitter.search.common.metrics.SearchStatsReceiver; -import com.twitter.search.common.schema.DynamicSchema; -import com.twitter.search.common.util.ml.prediction_engine.CompositeFeatureContext; -import com.twitter.search.common.util.ml.prediction_engine.LightweightLinearModel; -import com.twitter.search.common.util.ml.prediction_engine.ModelLoader; - -import static com.twitter.search.modeling.tweet_ranking.TweetScoringFeatures.CONTEXT; -import static com.twitter.search.modeling.tweet_ranking.TweetScoringFeatures.FeatureContextVersion.CURRENT_VERSION; - -/** - * Loads the scoring models for tweets and provides access to them. - * - * This class relies on a list of ModelLoader objects to retrieve the objects from them. It will - * return the first model found according to the order in the list. - * - * For production, we load models from 2 sources: classpath and HDFS. If a model is available - * from HDFS, we return it, otherwise we use the model from the classpath. - * - * The models used for default requests (i.e. not experiments) MUST be present in the - * classpath, this allows us to avoid errors if they can't be loaded from HDFS. - * Models for experiments can live only in HDFS, so we don't need to redeploy Earlybird if we - * want to test them. - */ -public class ScoringModelsManager { - - private static final Logger LOG = LoggerFactory.getLogger(ScoringModelsManager.class); - - /** - * Used when - * 1. Testing - * 2. The scoring models are disabled in the config - * 3. Exceptions thrown during loading the scoring models - */ - public static final ScoringModelsManager NO_OP_MANAGER = new ScoringModelsManager() { - @Override - public boolean isEnabled() { - return false; - } - }; - - private final ModelLoader[] loaders; - private final DynamicSchema dynamicSchema; - - public ScoringModelsManager(ModelLoader... loaders) { - this.loaders = loaders; - this.dynamicSchema = null; - } - - public ScoringModelsManager(DynamicSchema dynamicSchema, ModelLoader... loaders) { - this.loaders = loaders; - this.dynamicSchema = dynamicSchema; - } - - /** - * Indicates that the scoring models were enabled in the config and were loaded successfully - */ - public boolean isEnabled() { - return true; - } - - public void reload() { - for (ModelLoader loader : loaders) { - loader.run(); - } - } - - /** - * Loads and returns the model with the given name, if one exists. - */ - public Optional getModel(String modelName) { - for (ModelLoader loader : loaders) { - Optional model = loader.getModel(modelName); - if (model.isPresent()) { - return model; - } - } - return Optional.absent(); - } - - /** - * Creates an instance that loads models first from HDFS and the classpath resources. - * - * If the models are not found in HDFS, it uses the models from the classpath as fallback. - */ - public static ScoringModelsManager create( - SearchStatsReceiver serverStats, - String hdfsNameNode, - String hdfsBasedPath, - DynamicSchema dynamicSchema) throws IOException { - // Create a composite feature context so we can load both legacy and schema-based models - CompositeFeatureContext featureContext = new CompositeFeatureContext( - CONTEXT, dynamicSchema::getSearchFeatureSchema); - ModelLoader hdfsLoader = createHdfsLoader( - serverStats, hdfsNameNode, hdfsBasedPath, featureContext); - ModelLoader classpathLoader = createClasspathLoader( - serverStats, featureContext); - - // Explicitly load the models from the classpath - classpathLoader.run(); - - ScoringModelsManager manager = new ScoringModelsManager(hdfsLoader, classpathLoader); - LOG.info("Initialized ScoringModelsManager for loading models from HDFS and the classpath"); - return manager; - } - - protected static ModelLoader createHdfsLoader( - SearchStatsReceiver serverStats, - String hdfsNameNode, - String hdfsBasedPath, - CompositeFeatureContext featureContext) { - String hdfsVersionedPath = hdfsBasedPath + "/" + CURRENT_VERSION.getVersionDirectory(); - LOG.info("Starting to load scoring models from HDFS: {}:{}", - hdfsNameNode, hdfsVersionedPath); - return ModelLoader.forHdfsDirectory( - hdfsNameNode, - hdfsVersionedPath, - featureContext, - "scoring_models_hdfs_", - serverStats); - } - - /** - * Creates a loader that loads models from a default location in the classpath. - */ - @VisibleForTesting - public static ModelLoader createClasspathLoader( - SearchStatsReceiver serverStats, CompositeFeatureContext featureContext) - throws IOException { - AbstractFile defaultModelsBaseDir = FileUtils.getTmpDirHandle( - ScoringModelsManager.class, - "/com/twitter/search/earlybird/ml/default_models"); - AbstractFile defaultModelsDir = defaultModelsBaseDir.getChild( - CURRENT_VERSION.getVersionDirectory()); - - LOG.info("Starting to load scoring models from the classpath: {}", - defaultModelsDir.getPath()); - return ModelLoader.forDirectory( - defaultModelsDir, - featureContext, - "scoring_models_classpath_", - serverStats); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/AudioSpaceEventsStreamIndexer.java b/src/java/com/twitter/search/earlybird/partition/AudioSpaceEventsStreamIndexer.java deleted file mode 100644 index 2cd655695..000000000 --- a/src/java/com/twitter/search/earlybird/partition/AudioSpaceEventsStreamIndexer.java +++ /dev/null @@ -1,75 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import com.google.common.annotations.VisibleForTesting; - -import org.apache.kafka.clients.consumer.ConsumerRecord; -import org.apache.kafka.clients.consumer.KafkaConsumer; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.search.earlybird.exception.MissingKafkaTopicException; -import com.twitter.ubs.thriftjava.AudioSpaceBaseEvent; -import com.twitter.ubs.thriftjava.AudioSpaceEvent; -import com.twitter.util.Duration; - -/** - * - * An example publish event looks like this: - * - */ -public class AudioSpaceEventsStreamIndexer extends SimpleStreamIndexer { - private static final Logger LOG = LoggerFactory.getLogger(AudioSpaceEventsStreamIndexer.class); - - private static final String AUDIO_SPACE_EVENTS_TOPIC = "audio_space_events_v1"; - - @VisibleForTesting - // We use this to filter out old space publish events so as to avoid the risk of processing - // old space publish events whose corresponding finish events are no longer in the stream. - // It's unlikely that spaces would last longer than this constant so it should be safe to assume - // that the space whose publish event is older than this age is finished. - protected static final long MAX_PUBLISH_EVENTS_AGE_MS = - Duration.fromHours(11).inMillis(); - - private final AudioSpaceTable audioSpaceTable; - private final Clock clock; - - public AudioSpaceEventsStreamIndexer( - KafkaConsumer kafkaConsumer, - AudioSpaceTable audioSpaceTable, - Clock clock) throws MissingKafkaTopicException { - super(kafkaConsumer, AUDIO_SPACE_EVENTS_TOPIC); - this.audioSpaceTable = audioSpaceTable; - this.clock = clock; - } - - @Override - protected void validateAndIndexRecord(ConsumerRecord record) { - AudioSpaceBaseEvent baseEvent = record.value(); - - if (baseEvent != null && baseEvent.isSetBroadcast_id() && baseEvent.isSetEvent_metadata()) { - AudioSpaceEvent event = baseEvent.getEvent_metadata(); - String spaceId = baseEvent.getBroadcast_id(); - if (event != null && event.isSet(AudioSpaceEvent._Fields.SPACE_PUBLISH_EVENT)) { - long publishEventAgeMs = clock.nowMillis() - baseEvent.getTime_stamp_millis(); - if (publishEventAgeMs < MAX_PUBLISH_EVENTS_AGE_MS) { - audioSpaceTable.audioSpaceStarts(spaceId); - } - } else if (event != null && event.isSet(AudioSpaceEvent._Fields.SPACE_END_EVENT)) { - audioSpaceTable.audioSpaceFinishes(spaceId); - } - } - } - - @VisibleForTesting - public AudioSpaceTable getAudioSpaceTable() { - return audioSpaceTable; - } - - void printSummary() { - LOG.info(audioSpaceTable.toString()); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/AudioSpaceTable.java b/src/java/com/twitter/search/earlybird/partition/AudioSpaceTable.java deleted file mode 100644 index a7d29d2c9..000000000 --- a/src/java/com/twitter/search/earlybird/partition/AudioSpaceTable.java +++ /dev/null @@ -1,150 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.util.ArrayDeque; -import java.util.Queue; -import java.util.Set; -import java.util.concurrent.ConcurrentSkipListSet; - -import com.twitter.common.collections.Pair; -import com.twitter.common.util.Clock; -import com.twitter.search.common.metrics.SearchCustomGauge; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.util.Duration; -import com.twitter.util.Time; - -public class AudioSpaceTable { - private static final String STATS_PREFIX = "audio_space_"; - private static final Duration AUDIO_EVENT_EXPIRATION_DURATION = - Duration.fromHours(12); - - private final Set startedSpaces; - private final Set finishedSpaces; - /** - * timestampedSpaceEvents contains both start and finish events. - * This is to aid in the case in which we receive only on or the other for a spaceId -- start or finish - * without doing this, we could potentially never purge from the sets. - */ - private final Queue> timestampedSpaceEvents; - private final Clock clock; - - private final SearchRateCounter audioSpaceStarts = - SearchRateCounter.export(STATS_PREFIX + "stream_starts"); - private final SearchRateCounter audioSpaceFinishes = - SearchRateCounter.export(STATS_PREFIX + "stream_finishes"); - private final SearchRateCounter isRunningCalls = - SearchRateCounter.export(STATS_PREFIX + "is_running_calls"); - private final SearchRateCounter audioSpaceDuplicateStarts = - SearchRateCounter.export(STATS_PREFIX + "duplicate_start_events"); - private final SearchRateCounter audioSpaceDuplicateFinishes = - SearchRateCounter.export(STATS_PREFIX + "duplicate_finish_events"); - private final SearchRateCounter startsProcessedAfterCorrespondingFinishes = - SearchRateCounter.export(STATS_PREFIX + "starts_processed_after_corresponding_finishes"); - private final SearchRateCounter finishesProcessedWithoutCorrespondingStarts = - SearchRateCounter.export(STATS_PREFIX + "finishes_processed_without_corresponding_starts"); - - public AudioSpaceTable(Clock clock) { - // We read and write from different threads, so we need a thread-safe set implementation. - startedSpaces = new ConcurrentSkipListSet<>(); - finishedSpaces = new ConcurrentSkipListSet<>(); - timestampedSpaceEvents = new ArrayDeque<>(); - this.clock = clock; - SearchCustomGauge.export(STATS_PREFIX + "live", this::getNumberOfLiveAudioSpaces); - SearchCustomGauge.export(STATS_PREFIX + "retained_starts", startedSpaces::size); - SearchCustomGauge.export(STATS_PREFIX + "retained_finishes", finishedSpaces::size); - } - - private int getNumberOfLiveAudioSpaces() { - // This call is a bit expensive, but I logged it and it's getting called once a minute, at - // the beginning of the minute, so it's fine. - int count = 0; - for (String startedSpace : startedSpaces) { - count += finishedSpaces.contains(startedSpace) ? 0 : 1; - } - return count; - } - - /** - * We keep spaces that have started in the last 12 hours. - * This is called on every start space event received, and cleans up - * the retained spaces so memory usage does not become too high - */ - private void purgeOldSpaces() { - Pair oldest = timestampedSpaceEvents.peek(); - Time now = Time.fromMilliseconds(clock.nowMillis()); - while (oldest != null) { - Duration durationSinceInsert = now.minus(oldest.getFirst()); - if (durationSinceInsert.compareTo(AUDIO_EVENT_EXPIRATION_DURATION) > 0) { - // This event has expired, so we purge it and move on to the next. - String oldSpaceId = oldest.getSecond(); - startedSpaces.remove(oldSpaceId); - finishedSpaces.remove(oldSpaceId); - oldest = timestampedSpaceEvents.poll(); - } else { - // Oldest event is not old enough so quit purging - break; - } - } - } - - /** - * Record AudioSpace start event - */ - public void audioSpaceStarts(String spaceId) { - audioSpaceStarts.increment(); - boolean spaceSeenBefore = !startedSpaces.add(spaceId); - if (spaceSeenBefore) { - audioSpaceDuplicateStarts.increment(); - } - - if (finishedSpaces.contains(spaceId)) { - startsProcessedAfterCorrespondingFinishes.increment(); - } - - timestampedSpaceEvents.add(new Pair(Time.fromMilliseconds(clock.nowMillis()), spaceId)); - purgeOldSpaces(); - } - - /** - * Record AudioSpace finish event - */ - public void audioSpaceFinishes(String spaceId) { - audioSpaceFinishes.increment(); - boolean spaceSeenBefore = !finishedSpaces.add(spaceId); - if (spaceSeenBefore) { - audioSpaceDuplicateFinishes.increment(); - } - - if (!startedSpaces.contains(spaceId)) { - finishesProcessedWithoutCorrespondingStarts.increment(); - } - - timestampedSpaceEvents.add(new Pair(Time.fromMilliseconds(clock.nowMillis()), spaceId)); - purgeOldSpaces(); - } - - public boolean isRunning(String spaceId) { - isRunningCalls.increment(); - return startedSpaces.contains(spaceId) && !finishedSpaces.contains(spaceId); - } - - /** - * Print stats on this AudioSpaceTable - * @return Stats string - */ - public String toString() { - return "AudioSpaceTable: Starts: " + audioSpaceStarts.getCounter().get() - + ", Finishes: " + audioSpaceFinishes.getCounter().get() - + ", Retained starts: " + startedSpaces.size() - + ", Retained finishes: " + finishedSpaces.size() - + ", Currently live: " + getNumberOfLiveAudioSpaces(); - } - - public Set getStartedSpaces() { - return startedSpaces; - } - - public Set getFinishedSpaces() { - return finishedSpaces; - } - -} diff --git a/src/java/com/twitter/search/earlybird/partition/BalancingKafkaConsumer.java b/src/java/com/twitter/search/earlybird/partition/BalancingKafkaConsumer.java deleted file mode 100644 index da3fd9536..000000000 --- a/src/java/com/twitter/search/earlybird/partition/BalancingKafkaConsumer.java +++ /dev/null @@ -1,117 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.time.Duration; -import java.util.Arrays; -import java.util.Collections; - -import org.apache.kafka.clients.consumer.ConsumerRecord; -import org.apache.kafka.clients.consumer.ConsumerRecords; -import org.apache.kafka.clients.consumer.KafkaConsumer; -import org.apache.kafka.common.TopicPartition; - -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.metrics.SearchRateCounter; - -/** - * BalancingKafkaConsumer is designed to read from the tweets and updates streams in proportion to - * the rates that those streams are written to, i.e. both topics should have nearly the same amount - * of lag. This is important because if one stream gets too far ahead of the other, we could end up - * in a situation where: - * 1. If the tweet stream is ahead of the updates stream, we couldn't apply an update because a - * segment has been optimized, and one of those fields became frozen. - * 2. If the updates stream is ahead of the tweet stream, we might drop updates because they are - * more than a minute old, but the tweets might still not be indexed. - * - * Also see 'Consumption Flow Control' in - * https://kafka.apache.org/23/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html - */ -public class BalancingKafkaConsumer { - // If one of the topic-partitions lags the other by more than 10 seconds, - // it's worth it to pause the faster one and let the slower one catch up. - private static final long BALANCE_THRESHOLD_MS = Duration.ofSeconds(10).toMillis(); - private final KafkaConsumer kafkaConsumer; - private final TopicPartition tweetTopic; - private final TopicPartition updateTopic; - private final SearchRateCounter tweetsPaused; - private final SearchRateCounter updatesPaused; - private final SearchRateCounter resumed; - - private long tweetTimestamp = 0; - private long updateTimestamp = 0; - private long pausedAt = 0; - private boolean paused = false; - - public BalancingKafkaConsumer( - KafkaConsumer kafkaConsumer, - TopicPartition tweetTopic, - TopicPartition updateTopic - ) { - this.kafkaConsumer = kafkaConsumer; - this.tweetTopic = tweetTopic; - this.updateTopic = updateTopic; - - String prefix = "balancing_kafka_"; - String suffix = "_topic_paused"; - - tweetsPaused = SearchRateCounter.export(prefix + tweetTopic.topic() + suffix); - updatesPaused = SearchRateCounter.export(prefix + updateTopic.topic() + suffix); - resumed = SearchRateCounter.export(prefix + "topics_resumed"); - } - - /** - * Calls poll on the underlying consumer and pauses topics as necessary. - */ - public ConsumerRecords poll(Duration timeout) { - ConsumerRecords records = kafkaConsumer.poll(timeout); - topicFlowControl(records); - return records; - } - - private void topicFlowControl(ConsumerRecords records) { - for (ConsumerRecord record : records) { - long timestamp = record.timestamp(); - - if (updateTopic.topic().equals(record.topic())) { - updateTimestamp = Math.max(updateTimestamp, timestamp); - } else if (tweetTopic.topic().equals(record.topic())) { - tweetTimestamp = Math.max(tweetTimestamp, timestamp); - } else { - throw new IllegalStateException( - "Unexpected partition " + record.topic() + " in BalancingKafkaConsumer"); - } - } - - if (paused) { - // If we paused and one of the streams is still below the pausedAt point, we want to continue - // reading from just the lagging stream. - if (tweetTimestamp >= pausedAt && updateTimestamp >= pausedAt) { - // We caught up, resume reading from both topics. - paused = false; - kafkaConsumer.resume(Arrays.asList(tweetTopic, updateTopic)); - resumed.increment(); - } - } else { - long difference = Math.abs(tweetTimestamp - updateTimestamp); - - if (difference < BALANCE_THRESHOLD_MS) { - // The streams have approximately the same lag, so no need to pause anything. - return; - } - // The difference is too great, one of the streams is lagging behind the other so we need to - // pause one topic so the other can catch up. - paused = true; - pausedAt = Math.max(updateTimestamp, tweetTimestamp); - if (tweetTimestamp > updateTimestamp) { - kafkaConsumer.pause(Collections.singleton(tweetTopic)); - tweetsPaused.increment(); - } else { - kafkaConsumer.pause(Collections.singleton(updateTopic)); - updatesPaused.increment(); - } - } - } - - public void close() { - kafkaConsumer.close(); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/CompleteSegmentManager.java b/src/java/com/twitter/search/earlybird/partition/CompleteSegmentManager.java deleted file mode 100644 index 38ff55c08..000000000 --- a/src/java/com/twitter/search/earlybird/partition/CompleteSegmentManager.java +++ /dev/null @@ -1,349 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.IOException; -import java.util.Iterator; -import java.util.List; -import java.util.function.Supplier; - -import com.google.common.collect.Lists; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.util.io.recordreader.RecordReader; -import com.twitter.search.common.util.zktrylock.ZooKeeperTryLockFactory; -import com.twitter.search.earlybird.EarlybirdStatus; -import com.twitter.search.earlybird.common.config.EarlybirdProperty; -import com.twitter.search.earlybird.document.TweetDocument; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.segment.SegmentDataProvider; - -/** - * CompleteSegmentManager is used to parallelize indexing of complete (not partial) segments - * on startup. It also populates the fields used by the PartitionManager. - */ -public class CompleteSegmentManager { - private static final Logger LOG = LoggerFactory.getLogger(CompleteSegmentManager.class); - - private static final String INDEX_COMPLETED_SEGMENTS = - "indexing, optimizing and flushing complete segments"; - private static final String LOAD_COMPLETED_SEGMENTS = "loading complete segments"; - private static final String INDEX_UPDATES_FOR_COMPLETED_SEGMENTS = - "indexing updates for complete segments"; - private static final String BUILD_MULTI_SEGMENT_TERM_DICT = - "build multi segment term dictionaries"; - - // Max number of segments being loaded / indexed concurrently. - private final int maxConcurrentSegmentIndexers = - EarlybirdProperty.MAX_CONCURRENT_SEGMENT_INDEXERS.get(3); - - // The state we are building. - protected final SegmentDataProvider segmentDataProvider; - private final InstrumentedQueue retryQueue; - - private final UserUpdatesStreamIndexer userUpdatesStreamIndexer; - private final UserScrubGeoEventStreamIndexer userScrubGeoEventStreamIndexer; - - private final SegmentManager segmentManager; - private final ZooKeeperTryLockFactory zkTryLockFactory; - private final SearchIndexingMetricSet searchIndexingMetricSet; - private final Clock clock; - private MultiSegmentTermDictionaryManager multiSegmentTermDictionaryManager; - private final SegmentSyncConfig segmentSyncConfig; - - private final CriticalExceptionHandler criticalExceptionHandler; - - private boolean interrupted = false; - - public CompleteSegmentManager( - ZooKeeperTryLockFactory zooKeeperTryLockFactory, - SegmentDataProvider segmentDataProvider, - UserUpdatesStreamIndexer userUpdatesStreamIndexer, - UserScrubGeoEventStreamIndexer userScrubGeoEventStreamIndexer, - SegmentManager segmentManager, - InstrumentedQueue retryQueue, - SearchIndexingMetricSet searchIndexingMetricSet, - Clock clock, - MultiSegmentTermDictionaryManager multiSegmentTermDictionaryManager, - SegmentSyncConfig segmentSyncConfig, - CriticalExceptionHandler criticalExceptionHandler) { - this.zkTryLockFactory = zooKeeperTryLockFactory; - this.segmentDataProvider = segmentDataProvider; - this.userUpdatesStreamIndexer = userUpdatesStreamIndexer; - this.userScrubGeoEventStreamIndexer = userScrubGeoEventStreamIndexer; - this.segmentManager = segmentManager; - this.searchIndexingMetricSet = searchIndexingMetricSet; - this.clock = clock; - this.multiSegmentTermDictionaryManager = multiSegmentTermDictionaryManager; - this.segmentSyncConfig = segmentSyncConfig; - this.retryQueue = retryQueue; - this.criticalExceptionHandler = criticalExceptionHandler; - } - - /** - * Indexes all user events. - */ - public void indexUserEvents() { - LOG.info("Loading/indexing user events."); - StartupUserEventIndexer startupUserEventIndexer = new StartupUserEventIndexer( - searchIndexingMetricSet, - userUpdatesStreamIndexer, - userScrubGeoEventStreamIndexer, - segmentManager, - clock - ); - - startupUserEventIndexer.indexAllEvents(); - LOG.info("Finished loading/indexing user events."); - } - - /** - * Loads or indexes from scratch all complete segments. - * - * @param segmentsToIndexProvider A supplier that provides the list of all complete segments. - */ - public void indexCompleteSegments( - Supplier> segmentsToIndexProvider) throws Exception { - List segmentIndexers = Lists.newArrayList(); - - EarlybirdStatus.beginEvent( - INDEX_COMPLETED_SEGMENTS, searchIndexingMetricSet.startupInIndexCompletedSegments); - while (!interrupted && !Thread.currentThread().isInterrupted()) { - try { - // Get the refreshed list of local segment databases. - segmentManager.updateSegments(segmentDataProvider.newSegmentList()); - Iterator segmentsToIndex = segmentsToIndexProvider.get().iterator(); - - // Start up to max concurrent segment indexers. - segmentIndexers.clear(); - while (segmentsToIndex.hasNext() && segmentIndexers.size() < maxConcurrentSegmentIndexers) { - SegmentInfo nextSegment = segmentsToIndex.next(); - if (!nextSegment.isComplete()) { - Thread thread = new Thread(new SingleSegmentIndexer(nextSegment), - "startup-segment-indexer-" + nextSegment.getSegmentName()); - thread.start(); - segmentIndexers.add(thread); - } - } - - // No remaining indexer threads, we're done. - if (segmentIndexers.size() == 0) { - LOG.info("Finished indexing complete segments"); - EarlybirdStatus.endEvent( - INDEX_COMPLETED_SEGMENTS, searchIndexingMetricSet.startupInIndexCompletedSegments); - break; - } - - // Wait for threads to complete fully. - LOG.info("Started {} indexing threads", segmentIndexers.size()); - for (Thread thread : segmentIndexers) { - thread.join(); - } - LOG.info("Joined all {} indexing threads", segmentIndexers.size()); - } catch (IOException e) { - LOG.error("IOException in SegmentStartupManager loop", e); - } catch (InterruptedException e) { - interrupted = true; - LOG.error("Interrupted joining segment indexer thread", e); - } - } - } - - /** - * Loads all given complete segments. - * - * @param completeSegments The list of all complete segments to be loaded. - */ - public void loadCompleteSegments(List completeSegments) throws Exception { - if (!interrupted && !Thread.currentThread().isInterrupted()) { - LOG.info("Starting to load {} complete segments.", completeSegments.size()); - EarlybirdStatus.beginEvent( - LOAD_COMPLETED_SEGMENTS, searchIndexingMetricSet.startupInLoadCompletedSegments); - - List segmentThreads = Lists.newArrayList(); - List segmentsToBeLoaded = Lists.newArrayList(); - for (SegmentInfo segmentInfo : completeSegments) { - if (segmentInfo.isEnabled()) { - segmentsToBeLoaded.add(segmentInfo); - Thread segmentLoaderThread = new Thread( - () -> new SegmentLoader(segmentSyncConfig, criticalExceptionHandler) - .load(segmentInfo), - "startup-segment-loader-" + segmentInfo.getSegmentName()); - segmentThreads.add(segmentLoaderThread); - segmentLoaderThread.start(); - } else { - LOG.info("Will not load segment {} because it's disabled.", segmentInfo.getSegmentName()); - } - } - - for (Thread segmentLoaderThread : segmentThreads) { - segmentLoaderThread.join(); - } - - for (SegmentInfo segmentInfo : segmentsToBeLoaded) { - if (!segmentInfo.getSyncInfo().isLoaded()) { - // Throw an exception if a segment could not be loaded: We do not want earlybirds to - // startup with missing segments. - throw new RuntimeException("Could not load segment " + segmentInfo.getSegmentName()); - } - } - - LOG.info("Loaded all complete segments, starting indexing all updates."); - EarlybirdStatus.beginEvent( - INDEX_UPDATES_FOR_COMPLETED_SEGMENTS, - searchIndexingMetricSet.startupInIndexUpdatesForCompletedSegments); - - // Index all updates for all complete segments until we're fully caught up. - if (!EarlybirdCluster.isArchive(segmentManager.getEarlybirdIndexConfig().getCluster())) { - segmentThreads.clear(); - for (SegmentInfo segmentInfo : completeSegments) { - if (segmentInfo.isEnabled()) { - Thread segmentUpdatesThread = new Thread( - () -> new SimpleUpdateIndexer( - segmentDataProvider.getSegmentDataReaderSet(), - searchIndexingMetricSet, - retryQueue, - criticalExceptionHandler).indexAllUpdates(segmentInfo), - "startup-complete-segment-update-indexer-" + segmentInfo.getSegmentName()); - segmentThreads.add(segmentUpdatesThread); - segmentUpdatesThread.start(); - } else { - LOG.info("Will not index updates for segment {} because it's disabled.", - segmentInfo.getSegmentName()); - } - } - - for (Thread segmentUpdatesThread : segmentThreads) { - segmentUpdatesThread.join(); - } - } - LOG.info("Indexed updates for all complete segments."); - EarlybirdStatus.endEvent( - INDEX_UPDATES_FOR_COMPLETED_SEGMENTS, - searchIndexingMetricSet.startupInIndexUpdatesForCompletedSegments); - - EarlybirdStatus.endEvent( - LOAD_COMPLETED_SEGMENTS, searchIndexingMetricSet.startupInLoadCompletedSegments); - } - } - - /** - * Builds the term dictionary that spans all earlybird segments. Some fields share the term - * dictionary across segments as an optimization. - */ - public void buildMultiSegmentTermDictionary() { - EarlybirdStatus.beginEvent( - BUILD_MULTI_SEGMENT_TERM_DICT, - searchIndexingMetricSet.startupInMultiSegmentTermDictionaryUpdates); - if (!interrupted && !Thread.currentThread().isInterrupted()) { - LOG.info("Building multi segment term dictionaries."); - boolean built = multiSegmentTermDictionaryManager.buildDictionary(); - LOG.info("Done building multi segment term dictionaries, result: {}", built); - } - EarlybirdStatus.endEvent( - BUILD_MULTI_SEGMENT_TERM_DICT, - searchIndexingMetricSet.startupInMultiSegmentTermDictionaryUpdates); - } - - /** - * Warms up the data in the given segments. The warm up will usually make sure that all necessary - * is loaded in RAM and all relevant data structures are created before the segments starts - * serving real requests. - * - * @param segments The list of segments to warm up. - */ - public final void warmSegments(Iterable segments) throws InterruptedException { - int threadId = 1; - Iterator it = segments.iterator(); - - try { - List segmentWarmers = Lists.newLinkedList(); - while (it.hasNext()) { - - segmentWarmers.clear(); - while (it.hasNext() && segmentWarmers.size() < maxConcurrentSegmentIndexers) { - final SegmentInfo segment = it.next(); - Thread t = new Thread(() -> - new SegmentWarmer(criticalExceptionHandler).warmSegmentIfNecessary(segment), - "startup-warmer-" + threadId++); - - t.start(); - segmentWarmers.add(t); - } - - for (Thread t : segmentWarmers) { - t.join(); - } - } - } catch (InterruptedException e) { - LOG.error("Interrupted segment warmer thread", e); - Thread.currentThread().interrupt(); - throw e; - } - } - - /** - * Indexes a complete segment. - */ - private class SingleSegmentIndexer implements Runnable { - private final SegmentInfo segmentInfo; - - public SingleSegmentIndexer(SegmentInfo segmentInfo) { - this.segmentInfo = segmentInfo; - } - - @Override - public void run() { - // 0) Check if the segment can be loaded. This might copy the segment from HDFS. - if (new SegmentLoader(segmentSyncConfig, criticalExceptionHandler) - .downloadSegment(segmentInfo)) { - LOG.info("Will not index segment {} because it was downloaded from HDFS.", - segmentInfo.getSegmentName()); - segmentInfo.setComplete(true); - return; - } - - LOG.info("SingleSegmentIndexer starting for segment: " + segmentInfo); - - // 1) Index all tweets in this segment. - RecordReader tweetReader; - try { - tweetReader = segmentDataProvider.getSegmentDataReaderSet().newDocumentReader(segmentInfo); - if (tweetReader != null) { - tweetReader.setExhaustStream(true); - } - } catch (Exception e) { - throw new RuntimeException("Could not create tweet reader for segment: " + segmentInfo, e); - } - - new SimpleSegmentIndexer(tweetReader, searchIndexingMetricSet).indexSegment(segmentInfo); - - if (!segmentInfo.isComplete() || segmentInfo.isIndexing()) { - throw new RuntimeException("Segment does not appear to be complete: " + segmentInfo); - } - - // 2) Index all updates in this segment (archive earlybirds don't have updates). - if (!EarlybirdCluster.isArchive(segmentManager.getEarlybirdIndexConfig().getCluster())) { - new SimpleUpdateIndexer( - segmentDataProvider.getSegmentDataReaderSet(), - searchIndexingMetricSet, - retryQueue, - criticalExceptionHandler).indexAllUpdates(segmentInfo); - } - - // 3) Optimize the segment. - SegmentOptimizer.optimize(segmentInfo); - - // 4) Flush to HDFS if necessary. - new SegmentHdfsFlusher(zkTryLockFactory, segmentSyncConfig) - .flushSegmentToDiskAndHDFS(segmentInfo); - - // 5) Unload the segment from memory. - segmentInfo.getIndexSegment().close(); - } - } - -} diff --git a/src/java/com/twitter/search/earlybird/partition/DynamicPartitionConfig.java b/src/java/com/twitter/search/earlybird/partition/DynamicPartitionConfig.java deleted file mode 100644 index 946160448..000000000 --- a/src/java/com/twitter/search/earlybird/partition/DynamicPartitionConfig.java +++ /dev/null @@ -1,69 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import com.google.common.base.Preconditions; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchLongGauge; - -/** - * Keeps track of an up-to-date PartitionConfig. The PartitionConfig may be periodically reloaded - * from ZooKeeper. If you need a consistent view of the current partition configuration, make sure - * to grab a reference to a single PartitionConfig using getCurrentPartitionConfig() and reuse that - * object. - */ -public class DynamicPartitionConfig { - private static final Logger LOG = LoggerFactory.getLogger(DynamicPartitionConfig.class); - private static final SearchCounter FAILED_UPDATE_COUNTER_NAME = - SearchCounter.export("dynamic_partition_config_failed_update"); - private static final SearchCounter SUCCESSFUL_UPDATE_COUNTER = - SearchCounter.export("dynamic_partition_config_successful_update"); - // We assume that DynamicPartitionConfig is practically a singleton in Earlybird app. - private static final SearchLongGauge NUM_REPLICAS_IN_HASH_PARTITION = - SearchLongGauge.export("dynamic_partition_config_num_replicas_in_hash_partition"); - - private final PartitionConfig curPartitionConfig; - - public DynamicPartitionConfig(PartitionConfig initialConfig) { - this.curPartitionConfig = initialConfig; - NUM_REPLICAS_IN_HASH_PARTITION.set(initialConfig.getNumReplicasInHashPartition()); - } - - public PartitionConfig getCurrentPartitionConfig() { - return curPartitionConfig; - } - - /** - * Verifies that the new partition config is compatible with the old one, and if it is, updates - * the number of replicas per partition based on the new partition config. - */ - public void setCurrentPartitionConfig(PartitionConfig partitionConfig) { - Preconditions.checkNotNull(partitionConfig); - // For now, we only allow the number of replicas in this partition to be dynamically updated. - // Ensure that the only things that have changed between the previous - if (curPartitionConfig.getClusterName().equals(partitionConfig.getClusterName()) - && (curPartitionConfig.getMaxEnabledLocalSegments() - == partitionConfig.getMaxEnabledLocalSegments()) - && (curPartitionConfig.getNumPartitions() == partitionConfig.getNumPartitions()) - && (curPartitionConfig.getTierStartDate().equals(partitionConfig.getTierStartDate())) - && (curPartitionConfig.getTierEndDate().equals(partitionConfig.getTierEndDate())) - && (curPartitionConfig.getTierName().equals(partitionConfig.getTierName()))) { - - if (curPartitionConfig.getNumReplicasInHashPartition() - != partitionConfig.getNumReplicasInHashPartition()) { - SUCCESSFUL_UPDATE_COUNTER.increment(); - curPartitionConfig.setNumReplicasInHashPartition( - partitionConfig.getNumReplicasInHashPartition()); - NUM_REPLICAS_IN_HASH_PARTITION.set(partitionConfig.getNumReplicasInHashPartition()); - } - } else { - FAILED_UPDATE_COUNTER_NAME.increment(); - LOG.warn( - "Attempted to update partition config with inconsistent layout.\n" - + "Current: " + curPartitionConfig.getPartitionConfigDescription() + "\n" - + "New: " + partitionConfig.getPartitionConfigDescription()); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/EarlybirdIndex.java b/src/java/com/twitter/search/earlybird/partition/EarlybirdIndex.java deleted file mode 100644 index cda9ea682..000000000 --- a/src/java/com/twitter/search/earlybird/partition/EarlybirdIndex.java +++ /dev/null @@ -1,65 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.util.ArrayList; -import java.util.Collections; -import java.util.List; - -public class EarlybirdIndex { - private final List segmentInfoList; - - public static final int MAX_NUM_OF_NON_OPTIMIZED_SEGMENTS = 2; - - // The Kafka offsets for the tweet create stream and the tweet update stream. Indexing should - // start from these offsets when it resumes. - private final long tweetOffset; - private final long updateOffset; - private final long maxIndexedTweetId; - - public EarlybirdIndex( - List segmentInfoList, - long tweetOffset, - long updateOffset, - long maxIndexedTweetId - ) { - List segmentInfos = new ArrayList<>(segmentInfoList); - Collections.sort(segmentInfos); - this.segmentInfoList = segmentInfos; - this.tweetOffset = tweetOffset; - this.updateOffset = updateOffset; - this.maxIndexedTweetId = maxIndexedTweetId; - } - - public EarlybirdIndex(List segmentInfoList, long tweetOffset, long updateOffset) { - this(segmentInfoList, tweetOffset, updateOffset, -1); - } - - public List getSegmentInfoList() { - return segmentInfoList; - } - - public long getTweetOffset() { - return tweetOffset; - } - - public long getUpdateOffset() { - return updateOffset; - } - - public long getMaxIndexedTweetId() { - return maxIndexedTweetId; - } - - /** - * Returns the number of non-optimized segments in this index. - * @return the number of non-optimized segments in this index. - */ - public int numOfNonOptimizedSegments() { - int numNonOptimized = 0; - for (SegmentInfo segmentInfo : segmentInfoList) { - if (!segmentInfo.isOptimized()) { - numNonOptimized++; - } - } - return numNonOptimized; - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/EarlybirdIndexFlusher.java b/src/java/com/twitter/search/earlybird/partition/EarlybirdIndexFlusher.java deleted file mode 100644 index 3804b5cd7..000000000 --- a/src/java/com/twitter/search/earlybird/partition/EarlybirdIndexFlusher.java +++ /dev/null @@ -1,371 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.File; -import java.io.IOException; -import java.io.OutputStreamWriter; -import java.text.DateFormat; -import java.text.ParseException; -import java.text.SimpleDateFormat; -import java.time.Duration; -import java.util.ArrayList; -import java.util.Date; -import java.util.SortedMap; -import java.util.TreeMap; -import java.util.concurrent.TimeoutException; - -import scala.runtime.BoxedUnit; - -import com.google.common.base.Preconditions; - -import org.apache.commons.compress.utils.Lists; -import org.apache.commons.lang.RandomStringUtils; -import org.apache.hadoop.fs.FSDataOutputStream; -import org.apache.hadoop.fs.FileStatus; -import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.config.Config; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.schema.earlybird.FlushVersion; -import com.twitter.search.common.util.io.flushable.DataSerializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.earlybird.common.NonPagingAssert; -import com.twitter.search.earlybird.util.ActionLogger; -import com.twitter.search.earlybird.util.CoordinatedEarlybirdActionInterface; -import com.twitter.search.earlybird.util.CoordinatedEarlybirdActionLockFailed; -import com.twitter.search.earlybird.util.ParallelUtil; - -/** - * Flushes an EarlybirdIndex to HDFS, so that when Earlybird starts, it can read the index from - * HDFS instead of indexing from scratch. - * - * The path looks like: - * /smf1/rt2/user/search/earlybird/loadtest/realtime/indexes/flush_version_158/partition_8/index_2020_02_25_02 - */ -public class EarlybirdIndexFlusher { - public enum FlushAttemptResult { - CHECKED_RECENTLY, - FOUND_INDEX, - FLUSH_ATTEMPT_MADE, - FAILED_LOCK_ATTEMPT, - HADOOP_TIMEOUT - } - - @FunctionalInterface - public interface PostFlushOperation { - /** - * Run this after we finish flushing an index, before we rejoin the serverset. - */ - void execute(); - } - - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdIndexFlusher.class); - - private static final SearchCounter FLUSH_SUCCESS_COUNTER = - SearchCounter.export("successfully_flushed_index"); - - public static final String TWEET_KAFKA_OFFSET = "tweet_kafka_offset"; - public static final String UPDATE_KAFKA_OFFSET = "update_kafka_offset"; - public static final String FLUSHED_FROM_REPLICA = "flushed_from_replica"; - public static final String SEGMENTS = "segments"; - public static final String TIMESLICE_ID = "timeslice_id"; - - public static final String DATA_SUFFIX = ".data"; - public static final String INFO_SUFFIX = ".info"; - public static final String INDEX_INFO = "earlybird_index.info"; - - private static final String INDEX_PATH_FORMAT = "%s/flush_version_%d/partition_%d"; - public static final DateFormat INDEX_DATE_SUFFIX = new SimpleDateFormat("yyyy_MM_dd_HH"); - public static final String INDEX_PREFIX = "index_"; - public static final String TMP_PREFIX = "tmp_"; - - // Check if we need to flush every five minutes. - private static final long FLUSH_CHECK_PERIOD = Duration.ofMinutes(5).toMillis(); - - // Make sure we don't keep more than 3 copies of the index in HDFS, so that we don't run out of - // HDFS space. - private static final int INDEX_COPIES = 3; - - private static final NonPagingAssert FLUSHING_TOO_MANY_NON_OPTIMIZED_SEGMENTS = - new NonPagingAssert("flushing_too_many_non_optimized_segments"); - - private final CoordinatedEarlybirdActionInterface actionCoordinator; - private final FileSystem fileSystem; - private final Path indexPath; - private final Clock clock; - private final SegmentManager segmentManager; - private final int replicaId; - private final TimeLimitedHadoopExistsCall timeLimitedHadoopExistsCall; - private final OptimizationAndFlushingCoordinationLock optimizationAndFlushingCoordinationLock; - - private long checkedAt = 0; - - public EarlybirdIndexFlusher( - CoordinatedEarlybirdActionInterface actionCoordinator, - FileSystem fileSystem, - String indexHDFSPath, - SegmentManager segmentManager, - PartitionConfig partitionConfig, - Clock clock, - TimeLimitedHadoopExistsCall timeLimitedHadoopExistsCall, - OptimizationAndFlushingCoordinationLock optimizationAndFlushingCoordinationLock - ) { - this.actionCoordinator = actionCoordinator; - this.fileSystem = fileSystem; - this.indexPath = buildPathToIndexes(indexHDFSPath, partitionConfig); - this.segmentManager = segmentManager; - this.clock = clock; - this.replicaId = partitionConfig.getHostPositionWithinHashPartition(); - this.timeLimitedHadoopExistsCall = timeLimitedHadoopExistsCall; - this.optimizationAndFlushingCoordinationLock = optimizationAndFlushingCoordinationLock; - } - - /** - * Periodically checks if an index needs to be uploaded to HDFS, and uploads it if necessary. - * Skips flush if unable to acquire the optimizationAndFlushingCoordinationLock. - */ - public FlushAttemptResult flushIfNecessary( - long tweetOffset, - long updateOffset, - PostFlushOperation postFlushOperation) throws Exception { - long now = clock.nowMillis(); - if (now - checkedAt < FLUSH_CHECK_PERIOD) { - return FlushAttemptResult.CHECKED_RECENTLY; - } - - checkedAt = now; - - // Try to aqcuire lock to ensure that we are not in the gc_before_optimization or the - // post_optimization_rebuilds step of optimization. If the lock is not available, then skip - // flushing. - if (!optimizationAndFlushingCoordinationLock.tryLock()) { - return FlushAttemptResult.FAILED_LOCK_ATTEMPT; - } - // Acquired the lock, so wrap the flush in a try/finally block to ensure we release the lock - try { - Path flushPath = pathForHour(); - - try { - // If this doesn't execute on time, it will throw an exception and this function - // finishes its execution. - boolean result = timeLimitedHadoopExistsCall.exists(flushPath); - - if (result) { - return FlushAttemptResult.FOUND_INDEX; - } - } catch (TimeoutException e) { - LOG.warn("Timeout while calling hadoop", e); - return FlushAttemptResult.HADOOP_TIMEOUT; - } - - boolean flushedIndex = false; - try { - // this function returns a boolean. - actionCoordinator.execute("index_flushing", isCoordinated -> - flushIndex(flushPath, isCoordinated, tweetOffset, updateOffset, postFlushOperation)); - flushedIndex = true; - } catch (CoordinatedEarlybirdActionLockFailed e) { - // This only happens when we fail to grab the lock, which is fine because another Earlybird - // is already working on flushing this index, so we don't need to. - LOG.debug("Failed to grab lock", e); - } - - if (flushedIndex) { - // We don't return with a guarantee that we actually flushed something. It's possible - // that the .execute() function above was not able to leave the server set to flush. - return FlushAttemptResult.FLUSH_ATTEMPT_MADE; - } else { - return FlushAttemptResult.FAILED_LOCK_ATTEMPT; - } - } finally { - optimizationAndFlushingCoordinationLock.unlock(); - } - } - - /** - * Create a subpath to the directory with many indexes in it. Will have an index for each hour. - */ - public static Path buildPathToIndexes(String root, PartitionConfig partitionConfig) { - return new Path(String.format( - INDEX_PATH_FORMAT, - root, - FlushVersion.CURRENT_FLUSH_VERSION.getVersionNumber(), - partitionConfig.getIndexingHashPartitionID())); - } - - - /** - * Returns a sorted map from the unix time in millis an index was flushed to the path of an index. - * The last element will be the path of the most recent index. - */ - public static SortedMap getIndexPathsByTime( - Path indexPath, - FileSystem fileSystem - ) throws IOException, ParseException { - LOG.info("Getting index paths from file system: {}", fileSystem.getUri().toASCIIString()); - - SortedMap pathByTime = new TreeMap<>(); - Path globPattern = indexPath.suffix("/" + EarlybirdIndexFlusher.INDEX_PREFIX + "*"); - LOG.info("Lookup glob pattern: {}", globPattern); - - for (FileStatus indexDir : fileSystem.globStatus(globPattern)) { - String name = new File(indexDir.getPath().toString()).getName(); - String dateString = name.substring(EarlybirdIndexFlusher.INDEX_PREFIX.length()); - Date date = EarlybirdIndexFlusher.INDEX_DATE_SUFFIX.parse(dateString); - pathByTime.put(date.getTime(), indexDir.getPath()); - } - LOG.info("Found {} files matching the pattern.", pathByTime.size()); - - return pathByTime; - } - - private boolean flushIndex( - Path flushPath, - boolean isCoordinated, - long tweetOffset, - long updateOffset, - PostFlushOperation postFlushOperation - ) throws Exception { - Preconditions.checkState(isCoordinated); - - if (fileSystem.exists(flushPath)) { - return false; - } - - LOG.info("Starting index flush"); - - // In case the process is killed suddenly, we wouldn't be able to clean up the temporary - // directory, and we don't want other processes to reuse it, so add some randomness. - Path tmpPath = indexPath.suffix("/" + TMP_PREFIX + RandomStringUtils.randomAlphabetic(8)); - boolean creationSucceed = fileSystem.mkdirs(tmpPath); - if (!creationSucceed) { - throw new IOException("Couldn't create HDFS directory at " + flushPath); - } - - LOG.info("Temp path: {}", tmpPath); - try { - ArrayList segmentInfos = Lists.newArrayList(segmentManager.getSegmentInfos( - SegmentManager.Filter.Enabled, SegmentManager.Order.NEW_TO_OLD).iterator()); - segmentManager.logState("Before flushing"); - EarlybirdIndex index = new EarlybirdIndex(segmentInfos, tweetOffset, updateOffset); - ActionLogger.run( - "Flushing index to " + tmpPath, - () -> flushIndex(tmpPath, index)); - } catch (Exception e) { - LOG.error("Exception while flushing index. Rethrowing."); - - if (fileSystem.delete(tmpPath, true)) { - LOG.info("Successfully deleted temp output"); - } else { - LOG.error("Couldn't delete temp output"); - } - - throw e; - } - - // We flush it to a temporary directory, then rename the temporary directory so that it the - // change is atomic, and other Earlybirds will either see the old indexes, or the new, complete - // index, but never an in progress index. - boolean renameSucceeded = fileSystem.rename(tmpPath, flushPath); - if (!renameSucceeded) { - throw new IOException("Couldn't rename HDFS from " + tmpPath + " to " + flushPath); - } - LOG.info("Flushed index to {}", flushPath); - - cleanupOldIndexes(); - - FLUSH_SUCCESS_COUNTER.increment(); - - LOG.info("Executing post flush operation..."); - postFlushOperation.execute(); - - return true; - } - - private void cleanupOldIndexes() throws Exception { - LOG.info("Looking up whether we need to clean up old indexes..."); - SortedMap pathsByTime = - EarlybirdIndexFlusher.getIndexPathsByTime(indexPath, fileSystem); - - while (pathsByTime.size() > INDEX_COPIES) { - Long key = pathsByTime.firstKey(); - Path oldestHourPath = pathsByTime.remove(key); - LOG.info("Deleting old index at path '{}'.", oldestHourPath); - - if (fileSystem.delete(oldestHourPath, true)) { - LOG.info("Successfully deleted old index"); - } else { - LOG.error("Couldn't delete old index"); - } - } - } - - private Path pathForHour() { - Date date = new Date(clock.nowMillis()); - String time = INDEX_DATE_SUFFIX.format(date); - return indexPath.suffix("/" + INDEX_PREFIX + time); - } - - private void flushIndex(Path flushPath, EarlybirdIndex index) throws Exception { - int numOfNonOptimized = index.numOfNonOptimizedSegments(); - if (numOfNonOptimized > EarlybirdIndex.MAX_NUM_OF_NON_OPTIMIZED_SEGMENTS) { - LOG.error( - "Found {} non-optimized segments when flushing to disk!", numOfNonOptimized); - FLUSHING_TOO_MANY_NON_OPTIMIZED_SEGMENTS.assertFailed(); - } - - int numSegments = index.getSegmentInfoList().size(); - int flushingThreadPoolSize = numSegments; - - if (Config.environmentIsTest()) { - // SEARCH-33763: Limit the thread pool size for tests to avoid using too much memory on scoot. - flushingThreadPoolSize = 2; - } - - LOG.info("Flushing index using a thread pool size of {}", flushingThreadPoolSize); - - ParallelUtil.parmap("flush-index", flushingThreadPoolSize, si -> ActionLogger.call( - "Flushing segment " + si.getSegmentName(), - () -> flushSegment(flushPath, si)), index.getSegmentInfoList()); - - FlushInfo indexInfo = new FlushInfo(); - indexInfo.addLongProperty(UPDATE_KAFKA_OFFSET, index.getUpdateOffset()); - indexInfo.addLongProperty(TWEET_KAFKA_OFFSET, index.getTweetOffset()); - indexInfo.addIntProperty(FLUSHED_FROM_REPLICA, replicaId); - - FlushInfo segmentFlushInfos = indexInfo.newSubProperties(SEGMENTS); - for (SegmentInfo segmentInfo : index.getSegmentInfoList()) { - FlushInfo segmentFlushInfo = segmentFlushInfos.newSubProperties(segmentInfo.getSegmentName()); - segmentFlushInfo.addLongProperty(TIMESLICE_ID, segmentInfo.getTimeSliceID()); - } - - Path indexInfoPath = flushPath.suffix("/" + INDEX_INFO); - try (FSDataOutputStream infoOutputStream = fileSystem.create(indexInfoPath)) { - OutputStreamWriter infoFileWriter = new OutputStreamWriter(infoOutputStream); - FlushInfo.flushAsYaml(indexInfo, infoFileWriter); - } - } - - private BoxedUnit flushSegment(Path flushPath, SegmentInfo segmentInfo) throws Exception { - Path segmentPrefix = flushPath.suffix("/" + segmentInfo.getSegmentName()); - Path segmentPath = segmentPrefix.suffix(DATA_SUFFIX); - - FlushInfo flushInfo = new FlushInfo(); - - try (FSDataOutputStream outputStream = fileSystem.create(segmentPath)) { - DataSerializer out = new DataSerializer(segmentPath.toString(), outputStream); - segmentInfo.getIndexSegment().flush(flushInfo, out); - } - - Path infoPath = segmentPrefix.suffix(INFO_SUFFIX); - - try (FSDataOutputStream infoOutputStream = fileSystem.create(infoPath)) { - OutputStreamWriter infoFileWriter = new OutputStreamWriter(infoOutputStream); - FlushInfo.flushAsYaml(flushInfo, infoFileWriter); - } - return BoxedUnit.UNIT; - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/EarlybirdIndexLoader.java b/src/java/com/twitter/search/earlybird/partition/EarlybirdIndexLoader.java deleted file mode 100644 index 1806bd106..000000000 --- a/src/java/com/twitter/search/earlybird/partition/EarlybirdIndexLoader.java +++ /dev/null @@ -1,224 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.BufferedInputStream; -import java.io.IOException; -import java.time.Duration; -import java.util.List; -import java.util.Optional; -import java.util.SortedMap; - -import com.google.common.base.Stopwatch; - -import org.apache.commons.compress.utils.Lists; -import org.apache.hadoop.fs.FSDataInputStream; -import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.partitioning.base.TimeSlice; -import com.twitter.search.common.util.io.flushable.DataDeserializer; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.earlybird.common.NonPagingAssert; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.index.EarlybirdSegmentFactory; -import com.twitter.search.earlybird.util.ActionLogger; -import com.twitter.search.earlybird.util.ParallelUtil; - -/** - * Loads an index from HDFS, if possible, or indexes all tweets from scratch using a - * FreshStartupHandler. - */ -public class EarlybirdIndexLoader { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdIndexLoader.class); - - public static final String ENV_FOR_TESTS = "test_env"; - - // To determine whether we should or should not load the most recent index from HDFS if available. - public static final long INDEX_FRESHNESS_THRESHOLD_MILLIS = Duration.ofDays(1).toMillis(); - - private static final NonPagingAssert LOADING_TOO_MANY_NON_OPTIMIZED_SEGMENTS = - new NonPagingAssert("loading_too_many_non_optimized_segments"); - - private final FileSystem fileSystem; - private final Path indexPath; - private final PartitionConfig partitionConfig; - private final EarlybirdSegmentFactory earlybirdSegmentFactory; - private final SegmentSyncConfig segmentSyncConfig; - private final Clock clock; - // Aurora environment we're running in: "prod", "loadtest", "staging2" etc. etc - private final String environment; - - public EarlybirdIndexLoader( - FileSystem fileSystem, - String indexHDFSPath, - String environment, - PartitionConfig partitionConfig, - EarlybirdSegmentFactory earlybirdSegmentFactory, - SegmentSyncConfig segmentSyncConfig, - Clock clock - ) { - this.fileSystem = fileSystem; - this.partitionConfig = partitionConfig; - this.earlybirdSegmentFactory = earlybirdSegmentFactory; - this.segmentSyncConfig = segmentSyncConfig; - this.indexPath = EarlybirdIndexFlusher.buildPathToIndexes(indexHDFSPath, partitionConfig); - this.clock = clock; - this.environment = environment; - } - - /** - * Tries to load an index from HDFS for this FlushVersion/Partition/Cluster. Returns an empty - * option if there is no index found. - */ - public Optional loadIndex() { - try { - Optional loadedIndex = - ActionLogger.call("Load index from HDFS.", this::loadFromHDFS); - - if (loadedIndex.isPresent()) { - EarlybirdIndex index = loadedIndex.get(); - int numOfNonOptimized = index.numOfNonOptimizedSegments(); - if (numOfNonOptimized > EarlybirdIndex.MAX_NUM_OF_NON_OPTIMIZED_SEGMENTS) { - // We should never have too many unoptimized segments. If this happens we likely have a - // bug somewhere that caused another Earlybird to flush too many unoptimized segments. - // Use NonPagingAssert to alert the oncall if this happens so they can look into it. - LOG.error("Found {} non-optimized segments when loading from disk!", numOfNonOptimized); - LOADING_TOO_MANY_NON_OPTIMIZED_SEGMENTS.assertFailed(); - - // If there are too many unoptimized segments, optimize the older ones until there are - // only MAX_NUM_OF_NON_OPTIMIZED_SEGMENTS left in the unoptimized state. The segment info - // list is always in order, so we will never try to optimize the most recent segments - // here. - int numSegmentsToOptimize = - numOfNonOptimized - EarlybirdIndex.MAX_NUM_OF_NON_OPTIMIZED_SEGMENTS; - LOG.info("Will try to optimize {} segments", numSegmentsToOptimize); - for (SegmentInfo segmentInfo : index.getSegmentInfoList()) { - if (numSegmentsToOptimize > 0 && !segmentInfo.isOptimized()) { - Stopwatch optimizationStopwatch = Stopwatch.createStarted(); - LOG.info("Starting to optimize segment: {}", segmentInfo.getSegmentName()); - segmentInfo.getIndexSegment().optimizeIndexes(); - numSegmentsToOptimize--; - LOG.info("Optimization of segment {} finished in {}.", - segmentInfo.getSegmentName(), optimizationStopwatch); - } - } - } - - int newNumOfNonOptimized = index.numOfNonOptimizedSegments(); - LOG.info("Loaded {} segments. {} are unoptimized.", - index.getSegmentInfoList().size(), - newNumOfNonOptimized); - - return loadedIndex; - } - } catch (Throwable e) { - LOG.error("Error loading index from HDFS, will index from scratch.", e); - } - - return Optional.empty(); - } - - private Optional loadFromHDFS() throws Exception { - SortedMap pathsByTime = - EarlybirdIndexFlusher.getIndexPathsByTime(indexPath, fileSystem); - - if (pathsByTime.isEmpty()) { - LOG.info("Could not load index from HDFS (path: {}), will index from scratch.", indexPath); - return Optional.empty(); - } - - long mostRecentIndexTimeMillis = pathsByTime.lastKey(); - Path mostRecentIndexPath = pathsByTime.get(mostRecentIndexTimeMillis); - - if (clock.nowMillis() - mostRecentIndexTimeMillis > INDEX_FRESHNESS_THRESHOLD_MILLIS) { - LOG.info("Most recent index in HDFS (path: {}) is old, will do a fresh startup.", - mostRecentIndexPath); - return Optional.empty(); - } - - EarlybirdIndex index = ActionLogger.call( - "loading index from " + mostRecentIndexPath, - () -> loadIndex(mostRecentIndexPath)); - - return Optional.of(index); - } - - private EarlybirdIndex loadIndex(Path flushPath) throws Exception { - Path indexInfoPath = flushPath.suffix("/" + EarlybirdIndexFlusher.INDEX_INFO); - - FlushInfo indexInfo; - try (FSDataInputStream infoInputStream = fileSystem.open(indexInfoPath)) { - indexInfo = FlushInfo.loadFromYaml(infoInputStream); - } - - FlushInfo segmentsFlushInfo = indexInfo.getSubProperties(EarlybirdIndexFlusher.SEGMENTS); - List segmentNames = Lists.newArrayList(segmentsFlushInfo.getKeyIterator()); - - // This should only happen if you're running in stagingN and loading a prod index through - // the read_index_from_prod_location flag. In this case, we point to a directory that has - // a lot more than the number of segments we want in staging and we trim this list to the - // desired number. - if (environment.matches("staging\\d")) { - if (segmentNames.size() > partitionConfig.getMaxEnabledLocalSegments()) { - LOG.info("Trimming list of loaded segments from size {} to size {}.", - segmentNames.size(), partitionConfig.getMaxEnabledLocalSegments()); - segmentNames = segmentNames.subList( - segmentNames.size() - partitionConfig.getMaxEnabledLocalSegments(), - segmentNames.size()); - } - } - - List segmentInfoList = ParallelUtil.parmap("load-index", name -> { - FlushInfo subProperties = segmentsFlushInfo.getSubProperties(name); - long timesliceID = subProperties.getLongProperty(EarlybirdIndexFlusher.TIMESLICE_ID); - return ActionLogger.call( - "loading segment " + name, - () -> loadSegment(flushPath, name, timesliceID)); - }, segmentNames); - - return new EarlybirdIndex( - segmentInfoList, - indexInfo.getLongProperty(EarlybirdIndexFlusher.TWEET_KAFKA_OFFSET), - indexInfo.getLongProperty(EarlybirdIndexFlusher.UPDATE_KAFKA_OFFSET)); - } - - private SegmentInfo loadSegment( - Path flushPath, - String segmentName, - long timesliceID - ) throws IOException { - Path segmentPrefix = flushPath.suffix("/" + segmentName); - Path segmentPath = segmentPrefix.suffix(EarlybirdIndexFlusher.DATA_SUFFIX); - - TimeSlice timeSlice = new TimeSlice( - timesliceID, - EarlybirdConfig.getMaxSegmentSize(), - partitionConfig.getIndexingHashPartitionID(), - partitionConfig.getNumPartitions()); - - SegmentInfo segmentInfo = new SegmentInfo( - timeSlice.getSegment(), - earlybirdSegmentFactory, - segmentSyncConfig); - - Path infoPath = segmentPrefix.suffix(EarlybirdIndexFlusher.INFO_SUFFIX); - FlushInfo flushInfo; - try (FSDataInputStream infoInputStream = fileSystem.open(infoPath)) { - flushInfo = FlushInfo.loadFromYaml(infoInputStream); - } - - FSDataInputStream inputStream = fileSystem.open(segmentPath); - - // It's significantly slower to read from the FSDataInputStream on demand, so we - // use a buffered reader to pre-read bigger chunks. - int bufferSize = 1 << 22; // 4MB - BufferedInputStream bufferedInputStream = new BufferedInputStream(inputStream, bufferSize); - - DataDeserializer in = new DataDeserializer(bufferedInputStream, segmentName); - segmentInfo.getIndexSegment().load(in, flushInfo); - - return segmentInfo; - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/EarlybirdKafkaConsumer.java b/src/java/com/twitter/search/earlybird/partition/EarlybirdKafkaConsumer.java deleted file mode 100644 index def6b8939..000000000 --- a/src/java/com/twitter/search/earlybird/partition/EarlybirdKafkaConsumer.java +++ /dev/null @@ -1,281 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.Closeable; -import java.time.Duration; -import java.util.Map; -import java.util.concurrent.atomic.AtomicBoolean; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.base.Stopwatch; -import com.google.common.collect.ImmutableList; - -import org.apache.kafka.clients.consumer.ConsumerRecords; -import org.apache.kafka.clients.consumer.KafkaConsumer; -import org.apache.kafka.common.TopicPartition; -import org.apache.kafka.common.errors.ApiException; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.metrics.SearchTimer; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.common.util.LogFormatUtil; -import com.twitter.search.earlybird.EarlybirdStatus; -import com.twitter.search.earlybird.common.CaughtUpMonitor; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.exception.WrappedKafkaApiException; -import com.twitter.search.earlybird.thrift.EarlybirdStatusCode; - -/** - * Reads TVEs from Kafka and writes them to a PartitionWriter. - */ -public class EarlybirdKafkaConsumer implements Closeable { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdKafkaConsumer.class); - - private static final Duration POLL_TIMEOUT = Duration.ofSeconds(1); - private static final String STATS_PREFIX = "earlybird_kafka_consumer_"; - - // See SEARCH-31827 - private static final SearchCounter INGESTING_DONE = - SearchCounter.export(STATS_PREFIX + "ingesting_done"); - private static final SearchRateCounter POLL_LOOP_EXCEPTIONS = - SearchRateCounter.export(STATS_PREFIX + "poll_loop_exceptions"); - private static final SearchRateCounter FLUSHING_EXCEPTIONS = - SearchRateCounter.export(STATS_PREFIX + "flushing_exceptions"); - - private static final SearchTimerStats TIMED_POLLS = - SearchTimerStats.export(STATS_PREFIX + "timed_polls"); - private static final SearchTimerStats TIMED_INDEX_EVENTS = - SearchTimerStats.export(STATS_PREFIX + "timed_index_events"); - - private final AtomicBoolean running = new AtomicBoolean(true); - private final BalancingKafkaConsumer balancingKafkaConsumer; - private final PartitionWriter partitionWriter; - protected final TopicPartition tweetTopic; - protected final TopicPartition updateTopic; - private final KafkaConsumer underlyingKafkaConsumer; - private final CriticalExceptionHandler criticalExceptionHandler; - private final EarlybirdIndexFlusher earlybirdIndexFlusher; - private final SearchIndexingMetricSet searchIndexingMetricSet; - private boolean finishedIngestUntilCurrent; - private final CaughtUpMonitor indexCaughtUpMonitor; - - protected class ConsumeBatchResult { - private boolean isCaughtUp; - private long readRecordsCount; - - public ConsumeBatchResult(boolean isCaughtUp, long readRecordsCount) { - this.isCaughtUp = isCaughtUp; - this.readRecordsCount = readRecordsCount; - } - - public boolean isCaughtUp() { - return isCaughtUp; - } - - public long getReadRecordsCount() { - return readRecordsCount; - } - } - - public EarlybirdKafkaConsumer( - KafkaConsumer underlyingKafkaConsumer, - SearchIndexingMetricSet searchIndexingMetricSet, - CriticalExceptionHandler criticalExceptionHandler, - PartitionWriter partitionWriter, - TopicPartition tweetTopic, - TopicPartition updateTopic, - EarlybirdIndexFlusher earlybirdIndexFlusher, - CaughtUpMonitor kafkaIndexCaughtUpMonitor - ) { - this.partitionWriter = partitionWriter; - this.underlyingKafkaConsumer = underlyingKafkaConsumer; - this.criticalExceptionHandler = criticalExceptionHandler; - this.searchIndexingMetricSet = searchIndexingMetricSet; - this.tweetTopic = tweetTopic; - this.updateTopic = updateTopic; - this.earlybirdIndexFlusher = earlybirdIndexFlusher; - - LOG.info("Reading from Kafka topics: tweetTopic={}, updateTopic={}", tweetTopic, updateTopic); - underlyingKafkaConsumer.assign(ImmutableList.of(updateTopic, tweetTopic)); - - this.balancingKafkaConsumer = - new BalancingKafkaConsumer(underlyingKafkaConsumer, tweetTopic, updateTopic); - this.finishedIngestUntilCurrent = false; - this.indexCaughtUpMonitor = kafkaIndexCaughtUpMonitor; - } - - /** - * Run the consumer, indexing from Kafka. - */ - @VisibleForTesting - public void run() { - while (isRunning()) { - ConsumeBatchResult result = consumeBatch(true); - indexCaughtUpMonitor.setAndNotify(result.isCaughtUp()); - } - } - - /** - * Reads from Kafka, starting at the given offsets, and applies the events until we are caught up - * with the current streams. - */ - public void ingestUntilCurrent(long tweetOffset, long updateOffset) { - Preconditions.checkState(!finishedIngestUntilCurrent); - Stopwatch stopwatch = Stopwatch.createStarted(); - LOG.info("Ingest until current: seeking to Kafka offset {} for tweets and {} for updates.", - tweetOffset, updateOffset); - - try { - underlyingKafkaConsumer.seek(tweetTopic, tweetOffset); - underlyingKafkaConsumer.seek(updateTopic, updateOffset); - } catch (ApiException kafkaApiException) { - throw new WrappedKafkaApiException("Can't seek to tweet and update offsets", - kafkaApiException); - } - - Map endOffsets; - try { - endOffsets = underlyingKafkaConsumer.endOffsets(ImmutableList.of(tweetTopic, updateTopic)); - } catch (ApiException kafkaApiException) { - throw new WrappedKafkaApiException("Can't find end offsets", - kafkaApiException); - } - - if (endOffsets.size() > 0) { - LOG.info(String.format("Records until current: tweets=%,d, updates=%,d", - endOffsets.get(tweetTopic) - tweetOffset + 1, - endOffsets.get(updateTopic) - updateOffset + 1)); - } - - consumeBatchesUntilCurrent(true); - - LOG.info("ingestUntilCurrent finished in {}.", stopwatch); - - partitionWriter.logState(); - INGESTING_DONE.increment(); - finishedIngestUntilCurrent = true; - } - - /** - * Consume tweets and updates from streams until we're up to date. - * - * @return total number of read records. - */ - private long consumeBatchesUntilCurrent(boolean flushingEnabled) { - long totalRecordsRead = 0; - long batchesConsumed = 0; - - while (isRunning()) { - ConsumeBatchResult result = consumeBatch(flushingEnabled); - batchesConsumed++; - totalRecordsRead += result.getReadRecordsCount(); - if (isCurrent(result.isCaughtUp())) { - break; - } - } - - LOG.info("Processed batches: {}", batchesConsumed); - - return totalRecordsRead; - } - - // This method is overriden in MockEarlybirdKafkaConsumer. - public boolean isCurrent(boolean current) { - return current; - } - - /** - * We don't index during flushing, so after the flush is done, the index is stale. - * We need to get to current, before we rejoin the serverset so that upon rejoining we're - * not serving a stale index. - */ - @VisibleForTesting - void getToCurrentPostFlush() { - LOG.info("Getting to current post flush"); - Stopwatch stopwatch = Stopwatch.createStarted(); - - long totalRecordsRead = consumeBatchesUntilCurrent(false); - - LOG.info("Post flush, became current in: {}, after reading {} records.", - stopwatch, LogFormatUtil.formatInt(totalRecordsRead)); - } - - /* - * @return true if we are current after indexing this batch. - */ - @VisibleForTesting - protected ConsumeBatchResult consumeBatch(boolean flushingEnabled) { - long readRecordsCount = 0; - boolean isCaughtUp = false; - - try { - // Poll. - SearchTimer pollTimer = TIMED_POLLS.startNewTimer(); - ConsumerRecords records = - balancingKafkaConsumer.poll(POLL_TIMEOUT); - readRecordsCount += records.count(); - TIMED_POLLS.stopTimerAndIncrement(pollTimer); - - // Index. - SearchTimer indexTimer = TIMED_INDEX_EVENTS.startNewTimer(); - isCaughtUp = partitionWriter.indexBatch(records); - TIMED_INDEX_EVENTS.stopTimerAndIncrement(indexTimer); - } catch (Exception ex) { - POLL_LOOP_EXCEPTIONS.increment(); - LOG.error("Exception in poll loop", ex); - } - - try { - // Possibly flush the index. - if (isCaughtUp && flushingEnabled) { - long tweetOffset = 0; - long updateOffset = 0; - - try { - tweetOffset = underlyingKafkaConsumer.position(tweetTopic); - updateOffset = underlyingKafkaConsumer.position(updateTopic); - } catch (ApiException kafkaApiException) { - throw new WrappedKafkaApiException("can't get topic positions", kafkaApiException); - } - - EarlybirdIndexFlusher.FlushAttemptResult flushAttemptResult = - earlybirdIndexFlusher.flushIfNecessary( - tweetOffset, updateOffset, this::getToCurrentPostFlush); - - if (flushAttemptResult == EarlybirdIndexFlusher.FlushAttemptResult.FLUSH_ATTEMPT_MADE) { - // Viz might show this as a fairly high number, so we're printing it here to confirm - // the value on the server. - LOG.info("Finished flushing. Index freshness in ms: {}", - LogFormatUtil.formatInt(searchIndexingMetricSet.getIndexFreshnessInMillis())); - } - - if (!finishedIngestUntilCurrent) { - LOG.info("Became current on startup. Tried to flush with result: {}", - flushAttemptResult); - } - } - } catch (Exception ex) { - FLUSHING_EXCEPTIONS.increment(); - LOG.error("Exception while flushing", ex); - } - - return new ConsumeBatchResult(isCaughtUp, readRecordsCount); - } - - public boolean isRunning() { - return running.get() && EarlybirdStatus.getStatusCode() != EarlybirdStatusCode.STOPPING; - } - - public void prepareAfterStartingWithIndex(long maxIndexedTweetId) { - partitionWriter.prepareAfterStartingWithIndex(maxIndexedTweetId); - } - - public void close() { - balancingKafkaConsumer.close(); - running.set(false); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/EarlybirdStartup.java b/src/java/com/twitter/search/earlybird/partition/EarlybirdStartup.java deleted file mode 100644 index e0a2d125d..000000000 --- a/src/java/com/twitter/search/earlybird/partition/EarlybirdStartup.java +++ /dev/null @@ -1,17 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.Closeable; - -import com.twitter.search.earlybird.exception.EarlybirdStartupException; - -/** - * Handles starting and indexing data for an Earlybird. - */ -@FunctionalInterface -public interface EarlybirdStartup { - /** - * Handles indexing Tweets, Tweet Updates and user updates. Blocks until current, and forks a - * thread to keep the index current. - */ - Closeable start() throws EarlybirdStartupException; -} diff --git a/src/java/com/twitter/search/earlybird/partition/FlowControlException.java b/src/java/com/twitter/search/earlybird/partition/FlowControlException.java deleted file mode 100644 index f7a6bced5..000000000 --- a/src/java/com/twitter/search/earlybird/partition/FlowControlException.java +++ /dev/null @@ -1,16 +0,0 @@ -package com.twitter.search.earlybird.partition; - -/** - * Exception used to cause a ScheduledExecutorService to stop executing. Used when the - * success condition of the class has been achieved. - */ -public class FlowControlException extends RuntimeException { - - public FlowControlException() { - super(); - } - - public FlowControlException(String message) { - super(message); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/HdfsUtil.java b/src/java/com/twitter/search/earlybird/partition/HdfsUtil.java deleted file mode 100644 index 5c393df50..000000000 --- a/src/java/com/twitter/search/earlybird/partition/HdfsUtil.java +++ /dev/null @@ -1,30 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.IOException; - -import org.apache.hadoop.conf.Configuration; -import org.apache.hadoop.fs.FileStatus; -import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; - -public final class HdfsUtil { - private HdfsUtil() { - } - - public static FileSystem getHdfsFileSystem() throws IOException { - Configuration config = new Configuration(); - // Since earlybird uses hdfs from different threads, and closes the FileSystem from - // them independently, we want each thread to have its own, new FileSystem. - return FileSystem.newInstance(config); - } - - /** - * Checks if the given segment is present on HDFS - */ - public static boolean segmentExistsOnHdfs(FileSystem fs, SegmentInfo segmentInfo) - throws IOException { - String hdfsBaseDirPrefix = segmentInfo.getSyncInfo().getHdfsUploadDirPrefix(); - FileStatus[] statuses = fs.globStatus(new Path(hdfsBaseDirPrefix)); - return statuses != null && statuses.length > 0; - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/ISegmentWriter.java b/src/java/com/twitter/search/earlybird/partition/ISegmentWriter.java deleted file mode 100644 index c4b32ea25..000000000 --- a/src/java/com/twitter/search/earlybird/partition/ISegmentWriter.java +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.IOException; - -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; - -public interface ISegmentWriter { - enum Result { - SUCCESS, - FAILURE_RETRYABLE, - FAILURE_NOT_RETRYABLE, - } - - /** - * Indexes the given ThriftVersionedEvents instance (adds it to the segment associated with this - * SegmentWriter instance). - */ - Result indexThriftVersionedEvents(ThriftVersionedEvents tve) throws IOException; - - /** - * Returns the segment info for this segment writer. - */ - SegmentInfo getSegmentInfo(); -} diff --git a/src/java/com/twitter/search/earlybird/partition/IndexingResultCounts.java b/src/java/com/twitter/search/earlybird/partition/IndexingResultCounts.java deleted file mode 100644 index 16235722d..000000000 --- a/src/java/com/twitter/search/earlybird/partition/IndexingResultCounts.java +++ /dev/null @@ -1,51 +0,0 @@ -package com.twitter.search.earlybird.partition; - -/** - * Helper class used to store counts to be logged. - */ -public class IndexingResultCounts { - private int indexingCalls; - private int failureRetriable; - private int failureNotRetriable; - private int indexingSuccess; - - public IndexingResultCounts() { - } - - /** - * Updates the internal counts with a single result. - */ - public void countResult(ISegmentWriter.Result result) { - indexingCalls++; - if (result == ISegmentWriter.Result.FAILURE_NOT_RETRYABLE) { - failureNotRetriable++; - } else if (result == ISegmentWriter.Result.FAILURE_RETRYABLE) { - failureRetriable++; - } else if (result == ISegmentWriter.Result.SUCCESS) { - indexingSuccess++; - } - } - - int getIndexingCalls() { - return indexingCalls; - } - - int getFailureRetriable() { - return failureRetriable; - } - - int getFailureNotRetriable() { - return failureNotRetriable; - } - - int getIndexingSuccess() { - return indexingSuccess; - } - - @Override - public String toString() { - return String.format("[calls: %,d, success: %,d, fail not-retryable: %,d, fail retryable: %,d]", - indexingCalls, indexingSuccess, failureNotRetriable, failureRetriable); - } -} - diff --git a/src/java/com/twitter/search/earlybird/partition/InstrumentedQueue.java b/src/java/com/twitter/search/earlybird/partition/InstrumentedQueue.java deleted file mode 100644 index 2f72a2c75..000000000 --- a/src/java/com/twitter/search/earlybird/partition/InstrumentedQueue.java +++ /dev/null @@ -1,51 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.util.concurrent.ConcurrentLinkedDeque; -import java.util.concurrent.atomic.AtomicLong; - -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.metrics.SearchRateCounter; - -/** - * A queue with metrics on size, enqueue rate and dequeue rate. - */ -public class InstrumentedQueue { - private final SearchRateCounter enqueueRate; - private final SearchRateCounter dequeueRate; - private final AtomicLong queueSize = new AtomicLong(); - - private final ConcurrentLinkedDeque queue; - - public InstrumentedQueue(String statsPrefix) { - SearchLongGauge.export(statsPrefix + "_size", queueSize); - enqueueRate = SearchRateCounter.export(statsPrefix + "_enqueue"); - dequeueRate = SearchRateCounter.export(statsPrefix + "_dequeue"); - - queue = new ConcurrentLinkedDeque<>(); - } - - /** - * Adds a new element to the queue. - */ - public void add(T tve) { - queue.add(tve); - enqueueRate.increment(); - queueSize.incrementAndGet(); - } - - /** - * Returns the first element in the queue. If the queue is empty, {@code null} is returned. - */ - public T poll() { - T tve = queue.poll(); - if (tve != null) { - dequeueRate.increment(); - queueSize.decrementAndGet(); - } - return tve; - } - - public long getQueueSize() { - return queueSize.get(); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/KafkaStartup.java b/src/java/com/twitter/search/earlybird/partition/KafkaStartup.java deleted file mode 100644 index e413125d7..000000000 --- a/src/java/com/twitter/search/earlybird/partition/KafkaStartup.java +++ /dev/null @@ -1,328 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.Closeable; -import java.util.Optional; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Stopwatch; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.config.Config; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.earlybird.EarlybirdStatus; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.exception.EarlybirdStartupException; -import com.twitter.search.earlybird.partition.freshstartup.FreshStartupHandler; -import com.twitter.search.earlybird.querycache.QueryCacheManager; -import com.twitter.search.earlybird.thrift.EarlybirdStatusCode; -import com.twitter.search.queryparser.query.QueryParserException; - -/** - * Handles starting an Earlybird from Kafka topics. - * - * Currently very unoptimized -- future versions will implement parallel indexing and loading - * serialized data from HDFS. See http://go/removing-dl-tdd. - */ -public class KafkaStartup implements EarlybirdStartup { - private static final Logger LOG = LoggerFactory.getLogger(KafkaStartup.class); - - private final EarlybirdKafkaConsumer earlybirdKafkaConsumer; - private final StartupUserEventIndexer startupUserEventIndexer; - private final QueryCacheManager queryCacheManager; - private final SegmentManager segmentManager; - private final EarlybirdIndexLoader earlybirdIndexLoader; - private final FreshStartupHandler freshStartupHandler; - private final UserUpdatesStreamIndexer userUpdatesStreamIndexer; - private final UserScrubGeoEventStreamIndexer userScrubGeoEventStreamIndexer; - private final SearchIndexingMetricSet searchIndexingMetricSet; - private final SearchLongGauge loadedIndex; - private final SearchLongGauge freshStartup; - private final MultiSegmentTermDictionaryManager multiSegmentTermDictionaryManager; - private final AudioSpaceEventsStreamIndexer audioSpaceEventsStreamIndexer; - private final CriticalExceptionHandler earlybirdExceptionHandler; - private final SearchDecider decider; - - private static final String FRESH_STARTUP = "fresh startup"; - private static final String INGEST_UNTIL_CURRENT = "ingest until current"; - private static final String LOAD_FLUSHED_INDEX = "load flushed index"; - private static final String SETUP_QUERY_CACHE = "setting up query cache"; - private static final String USER_UPDATES_STARTUP = "user updates startup"; - private static final String AUDIO_SPACES_STARTUP = "audio spaces startup"; - private static final String BUILD_MULTI_SEGMENT_TERM_DICTIONARY = - "build multi segment term dictionary"; - - public KafkaStartup( - SegmentManager segmentManager, - EarlybirdKafkaConsumer earlybirdKafkaConsumer, - StartupUserEventIndexer startupUserEventIndexer, - UserUpdatesStreamIndexer userUpdatesStreamIndexer, - UserScrubGeoEventStreamIndexer userScrubGeoEventStreamIndexer, - AudioSpaceEventsStreamIndexer audioSpaceEventsStreamIndexer, - QueryCacheManager queryCacheManager, - EarlybirdIndexLoader earlybirdIndexLoader, - FreshStartupHandler freshStartupHandler, - SearchIndexingMetricSet searchIndexingMetricSet, - MultiSegmentTermDictionaryManager multiSegmentTermDictionaryManager, - CriticalExceptionHandler earlybirdExceptionHandler, - SearchDecider decider - ) { - this.segmentManager = segmentManager; - this.earlybirdKafkaConsumer = earlybirdKafkaConsumer; - this.startupUserEventIndexer = startupUserEventIndexer; - this.queryCacheManager = queryCacheManager; - this.earlybirdIndexLoader = earlybirdIndexLoader; - this.freshStartupHandler = freshStartupHandler; - this.userUpdatesStreamIndexer = userUpdatesStreamIndexer; - this.userScrubGeoEventStreamIndexer = userScrubGeoEventStreamIndexer; - this.audioSpaceEventsStreamIndexer = audioSpaceEventsStreamIndexer; - this.searchIndexingMetricSet = searchIndexingMetricSet; - this.loadedIndex = SearchLongGauge.export("kafka_startup_loaded_index"); - this.freshStartup = SearchLongGauge.export("fresh_startup"); - this.multiSegmentTermDictionaryManager = multiSegmentTermDictionaryManager; - this.earlybirdExceptionHandler = earlybirdExceptionHandler; - this.decider = decider; - freshStartup.set(0); - } - - private void userEventsStartup() { - LOG.info("Start indexing user events."); - - startupUserEventIndexer.indexAllEvents(); - - LOG.info("Finished loading/indexing user events."); - - // User updates are now current, keep them current by continuing to index from the stream. - LOG.info("Starting to run UserUpdatesStreamIndexer"); - new Thread(userUpdatesStreamIndexer::run, "userupdates-stream-indexer").start(); - - if (EarlybirdConfig.consumeUserScrubGeoEvents()) { - // User scrub geo events are now current, - // keep them current by continuing to index from the stream. - LOG.info("Starting to run UserScrubGeoEventsStreamIndexer"); - new Thread(userScrubGeoEventStreamIndexer::run, - "userScrubGeoEvents-stream-indexer").start(); - } - } - - private void loadAudioSpaceEvents() { - LOG.info("Index audio space events..."); - EarlybirdStatus.beginEvent(AUDIO_SPACES_STARTUP, - searchIndexingMetricSet.startupInAudioSpaceEventIndexer); - - if (audioSpaceEventsStreamIndexer == null) { - LOG.error("Null audioSpaceEventsStreamIndexer"); - return; - } - - if (decider.isAvailable("enable_reading_audio_space_events")) { - Stopwatch stopwatch = Stopwatch.createStarted(); - audioSpaceEventsStreamIndexer.seekToBeginning(); - audioSpaceEventsStreamIndexer.readRecordsUntilCurrent(); - LOG.info("Finished reading audio spaces in {}", stopwatch); - audioSpaceEventsStreamIndexer.printSummary(); - - new Thread(audioSpaceEventsStreamIndexer::run, - "audioSpaceEvents-stream-indexer").start(); - } else { - LOG.info("Reading audio space events not enabled"); - } - - EarlybirdStatus.endEvent(AUDIO_SPACES_STARTUP, - searchIndexingMetricSet.startupInAudioSpaceEventIndexer); - } - - private void tweetsAndUpdatesStartup() throws EarlybirdStartupException { - LOG.info("Index tweets and updates..."); - EarlybirdStatus.beginEvent(LOAD_FLUSHED_INDEX, - searchIndexingMetricSet.startupInLoadFlushedIndex); - EarlybirdIndex index; - - // Set when you want to get a server from starting to ready quickly for development - // purposes. - boolean fastDevStartup = EarlybirdConfig.getBool("fast_dev_startup"); - - Optional optIndex = Optional.empty(); - if (!fastDevStartup) { - optIndex = earlybirdIndexLoader.loadIndex(); - } - - if (optIndex.isPresent()) { - loadedIndex.set(1); - LOG.info("Loaded an index."); - index = optIndex.get(); - EarlybirdStatus.endEvent(LOAD_FLUSHED_INDEX, - searchIndexingMetricSet.startupInLoadFlushedIndex); - } else { - LOG.info("Didn't load an index, indexing from scratch."); - freshStartup.set(1); - boolean parallelIndexFromScratch = EarlybirdConfig.getBool( - "parallel_index_from_scratch"); - LOG.info("parallel_index_from_scratch: {}", parallelIndexFromScratch); - EarlybirdStatus.beginEvent(FRESH_STARTUP, - searchIndexingMetricSet.startupInFreshStartup); - try { - if (fastDevStartup) { - index = freshStartupHandler.fastIndexFromScratchForDevelopment(); - } else if (parallelIndexFromScratch) { - index = freshStartupHandler.parallelIndexFromScratch(); - } else { - index = freshStartupHandler.indexFromScratch(); - } - } catch (Exception ex) { - throw new EarlybirdStartupException(ex); - } finally { - EarlybirdStatus.endEvent(FRESH_STARTUP, - searchIndexingMetricSet.startupInFreshStartup); - } - } - - LOG.info("Index has {} segments.", index.getSegmentInfoList().size()); - if (index.getSegmentInfoList().size() > 0) { - LOG.info("Inserting segments into SegmentManager"); - for (SegmentInfo segmentInfo : index.getSegmentInfoList()) { - segmentManager.putSegmentInfo(segmentInfo); - } - - earlybirdKafkaConsumer.prepareAfterStartingWithIndex( - index.getMaxIndexedTweetId() - ); - } - - // Build the Multi segment term dictionary before catching up on indexing to ensure that the - // segments won't roll and delete the oldest segment while a multi segment term dictionary that - // includes that segment is being built. - buildMultiSegmentTermDictionary(); - - segmentManager.logState("Starting ingestUntilCurrent"); - LOG.info("partial updates indexed: {}", segmentManager.getNumPartialUpdates()); - EarlybirdStatus.beginEvent(INGEST_UNTIL_CURRENT, - searchIndexingMetricSet.startupInIngestUntilCurrent); - - earlybirdKafkaConsumer.ingestUntilCurrent(index.getTweetOffset(), index.getUpdateOffset()); - - validateSegments(); - segmentManager.logState("ingestUntilCurrent is done"); - LOG.info("partial updates indexed: {}", segmentManager.getNumPartialUpdates()); - EarlybirdStatus.endEvent(INGEST_UNTIL_CURRENT, - searchIndexingMetricSet.startupInIngestUntilCurrent); - new Thread(earlybirdKafkaConsumer::run, "earlybird-kafka-consumer").start(); - } - - protected void validateSegments() throws EarlybirdStartupException { - if (!Config.environmentIsTest()) { - // Unfortunately, many tests start Earlybirds with 0 indexed documents, so we disable this - // check in tests. - validateSegmentsForNonTest(); - } - } - - protected void validateSegmentsForNonTest() throws EarlybirdStartupException { - // SEARCH-24123: Prevent Earlybird from starting if there are no indexed documents. - if (segmentManager.getNumIndexedDocuments() == 0) { - throw new EarlybirdStartupException("Earlybird has zero indexed documents."); - } - } - - private void queryCacheStartup() throws EarlybirdStartupException { - EarlybirdStatus.beginEvent(SETUP_QUERY_CACHE, - searchIndexingMetricSet.startupInQueryCacheUpdates); - try { - queryCacheManager.setupTasksIfNeeded(segmentManager); - } catch (QueryParserException e) { - LOG.error("Exception when setting up query cache tasks"); - throw new EarlybirdStartupException(e); - } - - queryCacheManager.waitUntilAllQueryCachesAreBuilt(); - - // Print the sizes of the query caches so that we can see that they're built. - Iterable segmentInfos = - segmentManager.getSegmentInfos(SegmentManager.Filter.All, SegmentManager.Order.OLD_TO_NEW); - segmentManager.logState("After building query caches"); - for (SegmentInfo segmentInfo : segmentInfos) { - LOG.info("Segment: {}, Total cardinality: {}", segmentInfo.getSegmentName(), - segmentInfo.getIndexSegment().getQueryCachesCardinality()); - } - - // We're done building the query caches for all segments, and the earlybird is ready to become - // current. Restrict all future query cache task runs to one single core, to make sure our - // searcher threads are not impacted. - queryCacheManager.setWorkerPoolSizeAfterStartup(); - EarlybirdStatus.endEvent(SETUP_QUERY_CACHE, - searchIndexingMetricSet.startupInQueryCacheUpdates); - } - - /** - * Closes all currently running Indexers. - */ - @VisibleForTesting - public void shutdownIndexing() { - LOG.info("Shutting down KafkaStartup."); - - earlybirdKafkaConsumer.close(); - userUpdatesStreamIndexer.close(); - userScrubGeoEventStreamIndexer.close(); - // Note that the QueryCacheManager is shut down in EarlybirdServer::shutdown. - } - - private void buildMultiSegmentTermDictionary() { - EarlybirdStatus.beginEvent(BUILD_MULTI_SEGMENT_TERM_DICTIONARY, - searchIndexingMetricSet.startupInMultiSegmentTermDictionaryUpdates); - Stopwatch stopwatch = Stopwatch.createStarted(); - LOG.info("Building multi segment term dictionary"); - multiSegmentTermDictionaryManager.buildDictionary(); - LOG.info("Done with building multi segment term dictionary in {}", stopwatch); - EarlybirdStatus.endEvent(BUILD_MULTI_SEGMENT_TERM_DICTIONARY, - searchIndexingMetricSet.startupInMultiSegmentTermDictionaryUpdates); - } - - private void parallelIndexingStartup() throws EarlybirdStartupException { - Thread userEventsThread = new Thread(this::userEventsStartup, "index-user-events-startup"); - Thread tweetsAndUpdatesThread = new Thread(() -> { - try { - tweetsAndUpdatesStartup(); - } catch (EarlybirdStartupException e) { - earlybirdExceptionHandler.handle(this, e); - } - }, "index-tweets-and-updates-startup"); - Thread audioSpaceEventsThread = new Thread(this::loadAudioSpaceEvents, - "index-audio-space-events-startup"); - userEventsThread.start(); - tweetsAndUpdatesThread.start(); - audioSpaceEventsThread.start(); - - try { - userEventsThread.join(); - } catch (InterruptedException e) { - throw new EarlybirdStartupException("Interrupted while indexing user events"); - } - try { - tweetsAndUpdatesThread.join(); - } catch (InterruptedException e) { - throw new EarlybirdStartupException("Interrupted while indexing tweets and updates"); - } - try { - audioSpaceEventsThread.join(); - } catch (InterruptedException e) { - throw new EarlybirdStartupException("Interrupted while indexing audio space events"); - } - } - - /** - * Does startups and starts indexing. Returns when the earlybird - * is current. - */ - @Override - public Closeable start() throws EarlybirdStartupException { - parallelIndexingStartup(); - queryCacheStartup(); - - EarlybirdStatus.setStatus(EarlybirdStatusCode.CURRENT); - - return this::shutdownIndexing; - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/MultiSegmentTermDictionaryManager.java b/src/java/com/twitter/search/earlybird/partition/MultiSegmentTermDictionaryManager.java deleted file mode 100644 index a1abba74b..000000000 --- a/src/java/com/twitter/search/earlybird/partition/MultiSegmentTermDictionaryManager.java +++ /dev/null @@ -1,314 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.IOException; -import java.util.Collections; -import java.util.List; -import java.util.Map; -import java.util.concurrent.TimeUnit; -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.decider.Decider; -import com.twitter.search.common.decider.DeciderUtil; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.metrics.SearchStatsReceiver; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.core.earlybird.index.inverted.InvertedIndex; -import com.twitter.search.core.earlybird.index.inverted.MultiSegmentTermDictionary; -import com.twitter.search.core.earlybird.index.inverted.MultiSegmentTermDictionaryWithFastutil; -import com.twitter.search.core.earlybird.index.inverted.OptimizedMemoryIndex; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.index.EarlybirdSegment; -import com.twitter.search.earlybird.partition.SegmentManager.Filter; -import com.twitter.search.earlybird.partition.SegmentManager.Order; - -/** - * Manages MultiSegmentTermDictionary's for specific fields on this earlybird. Only manages them - * for optimized segments, and should only regenerate new dictionaries when the list of optimized - * segments changes. See SEARCH-10836 - */ -public class MultiSegmentTermDictionaryManager { - private static final Logger LOG = - LoggerFactory.getLogger(MultiSegmentTermDictionaryManager.class); - - @VisibleForTesting - public static final SearchTimerStats TERM_DICTIONARY_CREATION_STATS = - SearchTimerStats.export("multi_segment_term_dictionary_manager_build_dictionary", - TimeUnit.MILLISECONDS, false); - - public static final MultiSegmentTermDictionaryManager NOOP_INSTANCE = - new MultiSegmentTermDictionaryManager( - new Config(Collections.emptyList()), null, null, null, null) { - @Override - public boolean buildDictionary() { - return false; - } - }; - - private static final String MANAGER_DISABLED_DECIDER_KEY_PREFIX = - "multi_segment_term_dictionary_manager_disabled_in_"; - - public static class Config { - private final ImmutableList fieldNames; - - public Config(List fieldNames) { - Preconditions.checkNotNull(fieldNames); - this.fieldNames = ImmutableList.copyOf(fieldNames); - } - - public List managedFieldNames() { - return fieldNames; - } - - public boolean isEnabled() { - return EarlybirdConfig.getBool("multi_segment_term_dictionary_enabled", false); - } - } - - @VisibleForTesting - public static String getManagerDisabledDeciderName(EarlybirdCluster earlybirdCluster) { - return MANAGER_DISABLED_DECIDER_KEY_PREFIX + earlybirdCluster.name().toLowerCase(); - } - - private static final class FieldStats { - private final SearchTimerStats buildTime; - private final SearchLongGauge numTerms; - private final SearchLongGauge numTermEntries; - - private FieldStats(SearchStatsReceiver statsReceiver, String fieldName) { - Preconditions.checkNotNull(fieldName); - Preconditions.checkNotNull(statsReceiver); - - String timerName = String.format( - "multi_segment_term_dictionary_manager_field_%s_build_dictionary", fieldName); - this.buildTime = statsReceiver.getTimerStats( - timerName, TimeUnit.MILLISECONDS, false, false, false); - - String numTermsName = String.format( - "multi_segment_term_dictionary_manager_field_%s_num_terms", fieldName); - this.numTerms = statsReceiver.getLongGauge(numTermsName); - - String numTermEntriesName = String.format( - "multi_segment_term_dictionary_manager_field_%s_num_term_entries", fieldName); - this.numTermEntries = statsReceiver.getLongGauge(numTermEntriesName); - } - } - - private final Config config; - @Nullable private final SegmentManager segmentManager; - @Nullable private final Decider decider; - @Nullable private final EarlybirdCluster earlybirdCluster; - private final ImmutableMap fieldTimerStats; - // A per-field map of multi-segment term dictionaries. Each key is a field. The values are the - // multi-segment term dictionaries for that field. - private volatile ImmutableMap multiSegmentTermDictionaryMap; - private List previousSegmentsToMerge; - - public MultiSegmentTermDictionaryManager( - Config config, - SegmentManager segmentManager, - SearchStatsReceiver statsReceiver, - Decider decider, - EarlybirdCluster earlybirdCluster) { - this.config = config; - this.segmentManager = segmentManager; - this.decider = decider; - this.earlybirdCluster = earlybirdCluster; - - this.multiSegmentTermDictionaryMap = ImmutableMap.of(); - this.previousSegmentsToMerge = Lists.newArrayList(); - - ImmutableMap.Builder builder = ImmutableMap.builder(); - if (statsReceiver != null) { - for (String fieldName : config.managedFieldNames()) { - builder.put(fieldName, new FieldStats(statsReceiver, fieldName)); - } - } - this.fieldTimerStats = builder.build(); - } - - /** - * Return the most recently built MultiSegmentTermDictionary for the given field. - * Will return null if the field is not supported by this manager. - */ - @Nullable - public MultiSegmentTermDictionary getMultiSegmentTermDictionary(String fieldName) { - return this.multiSegmentTermDictionaryMap.get(fieldName); - } - - /** - * Build new versions of multi-segment term dictionaries if the manager is enabled, and new - * segments are available. - * @return true if the manager actually ran, and generated new versions of multi-segment term - * dictionaries. - * - * We synchronize this method because it would be a logic error to modify the variables from - * multiple threads simultaneously, and it is possible for two segments to finish optimizing at - * the same time and try to run it. - */ - public synchronized boolean buildDictionary() { - if (!config.isEnabled()) { - return false; - } - - Preconditions.checkNotNull(decider); - Preconditions.checkNotNull(earlybirdCluster); - if (DeciderUtil.isAvailableForRandomRecipient(decider, - getManagerDisabledDeciderName(earlybirdCluster))) { - LOG.info("Multi segment term dictionary manager is disabled via decider for cluster {}.", - earlybirdCluster); - this.multiSegmentTermDictionaryMap = ImmutableMap.of(); - this.previousSegmentsToMerge = Lists.newArrayList(); - return false; - } - - List segmentsToMerge = getSegmentsToMerge(); - - if (differentFromPreviousList(segmentsToMerge)) { - long start = System.currentTimeMillis(); - try { - this.multiSegmentTermDictionaryMap = createNewDictionaries(segmentsToMerge); - this.previousSegmentsToMerge = segmentsToMerge; - return true; - } catch (IOException e) { - LOG.error("Unable to build multi segment term dictionaries", e); - return false; - } finally { - long elapsed = System.currentTimeMillis() - start; - TERM_DICTIONARY_CREATION_STATS.timerIncrement(elapsed); - } - } else { - LOG.warn("No-op for buildDictionary()"); - return false; - } - } - - /** - * Only merge terms from enabled and optimized segments. No need to look at non-enabled segments, - * and we also don't want to use un-optimized segments as their term dictionaries are still - * changing. - */ - private List getSegmentsToMerge() { - Iterable segmentInfos = - segmentManager.getSegmentInfos(Filter.Enabled, Order.OLD_TO_NEW); - - List segmentsToMerge = Lists.newArrayList(); - for (SegmentInfo segmentInfo : segmentInfos) { - if (segmentInfo.getIndexSegment().isOptimized()) { - segmentsToMerge.add(segmentInfo); - } - } - return segmentsToMerge; - } - - private boolean differentFromPreviousList(List segmentsToMerge) { - // there is a potentially different approach here to only check if the - // segmentsToMerge is subsumed by the previousSegmentsToMerge list, and not recompute - // the multi segment term dictionary if so. - // There is a case where a new segment is added, the previously current segment is not yet - // optimized, but the oldest segment is dropped. With this impl, we will recompute to remove - // the dropped segment, however, we will recompute soon again when the - // "previously current segment" is actually optimized. We can potentially delay the first - // merging before the optimization. - if (this.previousSegmentsToMerge.size() == segmentsToMerge.size()) { - for (int i = 0; i < this.previousSegmentsToMerge.size(); i++) { - if (previousSegmentsToMerge.get(i).compareTo(segmentsToMerge.get(i)) != 0) { - return true; - } - } - return false; - } - return true; - } - - /** - * Rebuild the term dictionaries from scratch for all the managed fields. - * Returning a brand new map here with all the fields' term dictionaries so that we can isolate - * failures to build, and only replace the entire map of all the fields are built successfully. - */ - private ImmutableMap createNewDictionaries( - List segments) throws IOException { - - Map map = Maps.newHashMap(); - - for (String field : config.managedFieldNames()) { - LOG.info("Merging term dictionaries for field {}", field); - - List indexesToMerge = findFieldIndexesToMerge(segments, field); - - if (indexesToMerge.isEmpty()) { - LOG.info("No indexes to merge for field {}", field); - } else { - long start = System.currentTimeMillis(); - - MultiSegmentTermDictionary multiSegmentTermDictionary = - mergeDictionaries(field, indexesToMerge); - - map.put(field, multiSegmentTermDictionary); - - long elapsed = System.currentTimeMillis() - start; - LOG.info("Done merging term dictionary for field {}, for {} segments in {}ms", - field, indexesToMerge.size(), elapsed); - - FieldStats fieldStats = fieldTimerStats.get(field); - fieldStats.buildTime.timerIncrement(elapsed); - fieldStats.numTerms.set(multiSegmentTermDictionary.getNumTerms()); - fieldStats.numTermEntries.set(multiSegmentTermDictionary.getNumTermEntries()); - } - } - return ImmutableMap.copyOf(map); - } - - private List findFieldIndexesToMerge( - List segments, String field) throws IOException { - - List indexesToMerge = Lists.newArrayList(); - - for (SegmentInfo segment : segments) { - EarlybirdSegment indexSegment = segment.getIndexSegment(); - Preconditions.checkState(indexSegment.isOptimized(), - "Expect segment to be optimized: %s", segment); - - InvertedIndex fieldIndex = Preconditions.checkNotNull(indexSegment.getIndexReader()) - .getSegmentData().getFieldIndex(field); - - // See SEARCH-11952 - // We will only have a InvertedIndex/OptimizedMemoryIndex here - // in the in-memory non-lucene-based indexes, and not in the archive. We can somewhat - // reasonably extend this to work with the archive by making the dictionaries work with - // TermsEnum's directly instead of OptimizedMemoryIndex's. Leaving this as a further - // extension for now. - if (fieldIndex != null) { - if (fieldIndex instanceof OptimizedMemoryIndex) { - indexesToMerge.add((OptimizedMemoryIndex) fieldIndex); - } else { - LOG.info("Found field index for field {} in segment {} of type {}", - field, segment, fieldIndex.getClass()); - } - } else { - LOG.info("Found null field index for field {} in segment {}", field, segment); - } - } - LOG.info("Found good fields for {} out of {} segments", indexesToMerge.size(), - segments.size()); - - return indexesToMerge; - } - - private MultiSegmentTermDictionary mergeDictionaries( - String field, - List indexes) { - // May change this if we get a better implementation in the future. - return new MultiSegmentTermDictionaryWithFastutil(field, indexes); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/OptimizationAndFlushingCoordinationLock.java b/src/java/com/twitter/search/earlybird/partition/OptimizationAndFlushingCoordinationLock.java deleted file mode 100644 index bf39bb653..000000000 --- a/src/java/com/twitter/search/earlybird/partition/OptimizationAndFlushingCoordinationLock.java +++ /dev/null @@ -1,46 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.util.concurrent.locks.ReentrantLock; - -import com.google.common.annotations.VisibleForTesting; - -/** - * Lock used to ensure that flushing does not occur concurrently with the gc_before_optimization - * and post_optimization_rebuilds actions - see where we call the "lock" method of this class. - * - * Both coordinated actions include a full GC in them, for reasons described in that part - * of the code. After the GC, they wait until indexing has caught up before rejoining the serverset. - * - * If we flush concurrently with these actions, we can pause indexing for a while and waiting - * until we're caught up can take some time, which can affect the memory state negatively. - * For example, the first GC (before optimization) we do so that we have a clean state of memory - * before optimization. - * - * The other reason we lock before executing the actions is because if we have flushing that's - * currently running, once it finishes, we will rejoin the serverset and that can be followed by - * a stop-the-world GC from the actions, which will affect our success rate. - */ -public class OptimizationAndFlushingCoordinationLock { - private final ReentrantLock lock; - - public OptimizationAndFlushingCoordinationLock() { - this.lock = new ReentrantLock(); - } - - public void lock() { - lock.lock(); - } - - public void unlock() { - lock.unlock(); - } - - public boolean tryLock() { - return lock.tryLock(); - } - - @VisibleForTesting - public boolean hasQueuedThreads() { - return lock.hasQueuedThreads(); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/OptimizingSegmentWriter.java b/src/java/com/twitter/search/earlybird/partition/OptimizingSegmentWriter.java deleted file mode 100644 index 732c2bbe8..000000000 --- a/src/java/com/twitter/search/earlybird/partition/OptimizingSegmentWriter.java +++ /dev/null @@ -1,210 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.IOException; -import java.util.concurrent.ConcurrentLinkedQueue; -import java.util.concurrent.atomic.AtomicReference; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.base.Stopwatch; -import com.google.common.base.Verify; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.util.GCUtil; -import com.twitter.search.earlybird.EarlybirdStatus; -import com.twitter.search.earlybird.common.CaughtUpMonitor; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.index.EarlybirdSegment; -import com.twitter.search.earlybird.util.CoordinatedEarlybirdActionInterface; -import com.twitter.util.Future; -import com.twitter.util.Promise; - -/** - * This class optimizes a segment without blocking reads or writes. - * - * In steady state operation (Indexing or Optimized), it delegates operations directly to a - * SegmentWriter. - * - * Optimization is naturally a copying operation -- we don't need to mutate anything internally. - * We need to be able to apply updates to the unoptimized segment while we are creating - * the optimized segment. We also need to be able to apply these updates to the optimized segment, - * but we can't apply updates while a segment is being optimized, because document IDs will be - * changing internally and posting lists could be any state. To deal with this, we queue updates - * that occur during optimization, and then apply them as the last step of optimization. At that - * point, the segment will be optimized and up to date, so we can swap the unoptimized segment for - * the optimized one. - */ -public class OptimizingSegmentWriter implements ISegmentWriter { - private static final Logger LOG = LoggerFactory.getLogger(OptimizingSegmentWriter.class); - - private final AtomicReference state = new AtomicReference<>(State.Indexing); - private final ConcurrentLinkedQueue queuedEvents = - new ConcurrentLinkedQueue<>(); - - private final CriticalExceptionHandler criticalExceptionHandler; - private final SearchIndexingMetricSet searchIndexingMetricSet; - private final String segmentName; - private final Promise optimizationPromise = new Promise<>(); - - // We use the lock to ensure that the optimizing thread and the writer thread do not attempt - // to call indexThriftVersionedEvents on the underlying writer simultaneously. - private final Object lock = new Object(); - // The reference to the current writer. Protected by lock. - private final AtomicReference segmentWriterReference; - - private final CaughtUpMonitor indexCaughtUpMonitor; - - /** - * The state flow: - * Indexing -> Optimizing -> - * ONE OF: - * - Optimized - * - FailedToOptimize - */ - @VisibleForTesting - enum State { - Indexing, - Optimizing, - FailedToOptimize, - Optimized, - } - - public OptimizingSegmentWriter( - SegmentWriter segmentWriter, - CriticalExceptionHandler criticalExceptionHandler, - SearchIndexingMetricSet searchIndexingMetricSet, - CaughtUpMonitor indexCaughtUpMonitor - ) { - Preconditions.checkState(!segmentWriter.getSegmentInfo().isOptimized()); - segmentWriterReference = new AtomicReference<>(segmentWriter); - - this.criticalExceptionHandler = criticalExceptionHandler; - this.searchIndexingMetricSet = searchIndexingMetricSet; - this.segmentName = segmentWriter.getSegmentInfo().getSegmentName(); - this.indexCaughtUpMonitor = indexCaughtUpMonitor; - } - - /** - * Start optimizing this segment in the background. Returns a Future that will complete when - * the optimization is complete. - * Acquires the optimizationAndFlushingCoordinationLock before attempting to optimize. - */ - public Future startOptimization( - CoordinatedEarlybirdActionInterface gcAction, - OptimizationAndFlushingCoordinationLock optimizationAndFlushingCoordinationLock) { - new Thread(() -> { - // Acquire lock to ensure that flushing is not in progress. If the lock is not available, - // then wait until it is. - LOG.info("Acquire coordination lock before beginning gc_before_optimization action."); - try { - optimizationAndFlushingCoordinationLock.lock(); - LOG.info("Successfully acquired coordination lock for gc_before_optimization action."); - gcAction.retryActionUntilRan("gc before optimization", () -> { - LOG.info("Run GC before optimization"); - GCUtil.runGC(); - // Wait for indexing to catch up before gcAction rejoins the serverset. We only need to do - // this if the host has already finished startup. - if (EarlybirdStatus.hasStarted()) { - indexCaughtUpMonitor.resetAndWaitUntilCaughtUp(); - } - }); - } finally { - LOG.info("Finished gc_before_optimization action. " - + "Releasing coordination lock and beginning optimization."); - optimizationAndFlushingCoordinationLock.unlock(); - } - - transition(State.Indexing, State.Optimizing); - - SegmentInfo unoptimizedSegmentInfo = null; - try { - unoptimizedSegmentInfo = segmentWriterReference.get().getSegmentInfo(); - Preconditions.checkState(!unoptimizedSegmentInfo.isOptimized()); - - Stopwatch stopwatch = Stopwatch.createStarted(); - LOG.info("Started optimizing segment data {}.", segmentName); - EarlybirdSegment optimizedSegment = - unoptimizedSegmentInfo.getIndexSegment().makeOptimizedSegment(); - LOG.info("Finished optimizing segment data {} in {}.", segmentName, stopwatch); - - SegmentInfo newSegmentInfo = unoptimizedSegmentInfo - .copyWithEarlybirdSegment(optimizedSegment); - - SegmentWriter optimizedWriter = - new SegmentWriter(newSegmentInfo, searchIndexingMetricSet.updateFreshness); - Verify.verify(optimizedWriter.getSegmentInfo().isOptimized()); - - // We want to apply all updates to the new segment twice, because this first call may apply - // many thousands of updates and take a while to complete. - applyAllPendingUpdates(optimizedWriter); - - // We try to do as little as possible while holding the lock, so the writer can continue - // to make progress. First we apply all the updates that have been queued up before we - // grabbed the lock, then we need to swap the new writer for the old one. - synchronized (lock) { - applyAllPendingUpdates(optimizedWriter); - segmentWriterReference.getAndSet(optimizedWriter); - transition(State.Optimizing, State.Optimized); - } - - if (!unoptimizedSegmentInfo.isEnabled()) { - LOG.info("Disabling segment: {}", unoptimizedSegmentInfo.getSegmentName()); - newSegmentInfo.setIsEnabled(false); - } - - optimizationPromise.setValue(newSegmentInfo); - } catch (Throwable e) { - if (unoptimizedSegmentInfo != null) { - unoptimizedSegmentInfo.setFailedOptimize(); - } - - transition(State.Optimizing, State.FailedToOptimize); - optimizationPromise.setException(e); - } - }, "optimizing-segment-writer").start(); - - return optimizationPromise; - } - - private void applyAllPendingUpdates(SegmentWriter segmentWriter) throws IOException { - LOG.info("Applying {} queued updates to segment {}.", queuedEvents.size(), segmentName); - // More events can be enqueued while this method is running, so we track the total applied too. - long eventCount = 0; - Stopwatch stopwatch = Stopwatch.createStarted(); - ThriftVersionedEvents update; - while ((update = queuedEvents.poll()) != null) { - segmentWriter.indexThriftVersionedEvents(update); - eventCount++; - } - LOG.info("Applied {} queued updates to segment {} in {}.", - eventCount, segmentName, stopwatch); - } - - @Override - public Result indexThriftVersionedEvents(ThriftVersionedEvents tve) throws IOException { - synchronized (lock) { - if (state.get() == State.Optimizing) { - queuedEvents.add(tve); - } - return segmentWriterReference.get().indexThriftVersionedEvents(tve); - } - } - - @Override - public SegmentInfo getSegmentInfo() { - return segmentWriterReference.get().getSegmentInfo(); - } - - private void transition(State from, State to) { - Preconditions.checkState(state.compareAndSet(from, to)); - LOG.info("Transitioned from {} to {} for segment {}.", from, to, segmentName); - } - - @VisibleForTesting - public Future getOptimizationPromise() { - return optimizationPromise; - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/PartitionConfig.java b/src/java/com/twitter/search/earlybird/partition/PartitionConfig.java deleted file mode 100644 index 5d8280ca6..000000000 --- a/src/java/com/twitter/search/earlybird/partition/PartitionConfig.java +++ /dev/null @@ -1,171 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.util.Date; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import org.apache.commons.lang3.builder.ToStringBuilder; - -import com.twitter.search.common.config.Config; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.config.TierConfig; - -public class PartitionConfig { - // Which sub-cluster this host belongs to - private final String tierName; - - // Which cluster this host belongs to - private final String clusterName; - - public static final String DEFAULT_TIER_NAME = "all"; - - // the date range of the timeslices this tier will load. The start date is inclusive, while - // the end date is exclusive. - private final Date tierStartDate; - private final Date tierEndDate; - - private final int indexingHashPartitionID; // Hash Partition ID assigned for this EB - private final int maxEnabledLocalSegments; // Number of segments to keep - // The position of this host in the ordered list of hosts serving this hash partition - private final int hostPositionWithinHashPartition; - private volatile int numReplicasInHashPartition; - - private final int numPartitions; // Total number of partitions in the current cluster - - public PartitionConfig( - int indexingHashPartitionID, - int maxEnabledLocalSegments, - int hostPositionWithinHashPartition, - int numReplicasInHashPartition, - int numPartitions) { - this(DEFAULT_TIER_NAME, - TierConfig.DEFAULT_TIER_START_DATE, - TierConfig.DEFAULT_TIER_END_DATE, - indexingHashPartitionID, - maxEnabledLocalSegments, - hostPositionWithinHashPartition, - numReplicasInHashPartition, - numPartitions); - } - - public PartitionConfig(String tierName, - Date tierStartDate, - Date tierEndDate, - int indexingHashPartitionID, - int maxEnabledLocalSegments, - int hostPositionWithinHashPartition, - int numReplicasInHashPartition, - int numPartitions) { - this(tierName, tierStartDate, tierEndDate, indexingHashPartitionID, maxEnabledLocalSegments, - hostPositionWithinHashPartition, numReplicasInHashPartition, Config.getEnvironment(), - numPartitions); - } - - public PartitionConfig(String tierName, - Date tierStartDate, - Date tierEndDate, - int indexingHashPartitionID, - int maxEnabledLocalSegments, - int hostPositionWithinHashPartition, - int numReplicasInHashPartition, - String clusterName, - int numPartitions) { - this.tierName = Preconditions.checkNotNull(tierName); - this.clusterName = Preconditions.checkNotNull(clusterName); - this.tierStartDate = Preconditions.checkNotNull(tierStartDate); - this.tierEndDate = Preconditions.checkNotNull(tierEndDate); - this.indexingHashPartitionID = indexingHashPartitionID; - this.maxEnabledLocalSegments = maxEnabledLocalSegments; - this.hostPositionWithinHashPartition = hostPositionWithinHashPartition; - this.numReplicasInHashPartition = numReplicasInHashPartition; - this.numPartitions = numPartitions; - } - - public String getTierName() { - return tierName; - } - - public String getClusterName() { - return clusterName; - } - - public Date getTierStartDate() { - return tierStartDate; - } - - public Date getTierEndDate() { - return tierEndDate; - } - - public int getIndexingHashPartitionID() { - return indexingHashPartitionID; - } - - public int getMaxEnabledLocalSegments() { - return maxEnabledLocalSegments; - } - - public int getHostPositionWithinHashPartition() { - return hostPositionWithinHashPartition; - } - - public int getNumReplicasInHashPartition() { - return numReplicasInHashPartition; - } - - /** - * The number of ways the Tweet and/or user data is partitioned (or sharded) in this Earlybird, in - * this tier. - */ - public int getNumPartitions() { - return numPartitions; - } - - public String getPartitionConfigDescription() { - return ToStringBuilder.reflectionToString(this); - } - - public void setNumReplicasInHashPartition(int numReplicas) { - numReplicasInHashPartition = numReplicas; - } - - public static final int DEFAULT_NUM_SERVING_TIMESLICES_FOR_TEST = 18; - public static PartitionConfig getPartitionConfigForTests() { - return getPartitionConfigForTests( - TierConfig.DEFAULT_TIER_START_DATE, - TierConfig.DEFAULT_TIER_END_DATE); - } - - public static PartitionConfig getPartitionConfigForTests(Date tierStartDate, Date tierEndDate) { - return getPartitionConfigForTests( - DEFAULT_NUM_SERVING_TIMESLICES_FOR_TEST, tierStartDate, tierEndDate, 1); - } - - /** - * Returns a PartitionConfig instance configured for tests. - * - * @param numServingTimeslices The number of timeslices that should be served. - * @param tierStartDate The tier's start date. Used only in the full archive earlybirds. - * @param tierEndDate The tier's end date. Used only by in the full archive earlybirds. - * @param numReplicasInHashPartition The number of replicas for each partition. - * @return A PartitionConfig instance configured for tests. - */ - @VisibleForTesting - public static PartitionConfig getPartitionConfigForTests( - int numServingTimeslices, - Date tierStartDate, - Date tierEndDate, - int numReplicasInHashPartition) { - return new PartitionConfig( - EarlybirdConfig.getString("sub_tiers_for_tests", "test"), - tierStartDate, - tierEndDate, - EarlybirdConfig.getInt("hash_partition_for_tests", -1), - numServingTimeslices, - 0, // hostPositionWithinHashPartition - numReplicasInHashPartition, - EarlybirdConfig.getInt("num_partitions_for_tests", -1) - ); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/PartitionConfigLoader.java b/src/java/com/twitter/search/earlybird/partition/PartitionConfigLoader.java deleted file mode 100644 index fabae4595..000000000 --- a/src/java/com/twitter/search/earlybird/partition/PartitionConfigLoader.java +++ /dev/null @@ -1,45 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.IOException; - -import com.google.common.base.Preconditions; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.aurora.AuroraInstanceKey; -import com.twitter.search.common.aurora.AuroraSchedulerClient; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.factory.PartitionConfigUtil; - -public final class PartitionConfigLoader { - private static final Logger LOG = LoggerFactory.getLogger(PartitionConfigLoader.class); - - private PartitionConfigLoader() { - // this never gets called - } - - /** - * Load partition information from the command line arguments and Aurora scheduler. - * - * @return The new PartitionConfig object for this host - */ - public static PartitionConfig getPartitionInfoForMesosConfig( - AuroraSchedulerClient schedulerClient) throws PartitionConfigLoadingException { - AuroraInstanceKey instanceKey = - Preconditions.checkNotNull(EarlybirdConfig.getAuroraInstanceKey()); - int numTasks; - - try { - numTasks = schedulerClient.getActiveTasks( - instanceKey.getRole(), instanceKey.getEnv(), instanceKey.getJobName()).size(); - LOG.info("Found {} active tasks", numTasks); - } catch (IOException e) { - // This can happen when Aurora Scheduler is holding a conclave to elect a new reader. - LOG.warn("Failed to get tasks from Aurora scheduler.", e); - throw new PartitionConfigLoadingException("Failed to get tasks from Aurora scheduler."); - } - - return PartitionConfigUtil.initPartitionConfigForAurora(numTasks); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/PartitionConfigLoadingException.java b/src/java/com/twitter/search/earlybird/partition/PartitionConfigLoadingException.java deleted file mode 100644 index fa39ce361..000000000 --- a/src/java/com/twitter/search/earlybird/partition/PartitionConfigLoadingException.java +++ /dev/null @@ -1,12 +0,0 @@ -package com.twitter.search.earlybird.partition; - -/** - * An exception thrown when the earlybird layout could not be loaded, or when a host cannot find - * itself in the layout, and the layout has errors (which might be the reason why the host could not - * find itself in the layout). - */ -public class PartitionConfigLoadingException extends Exception { - public PartitionConfigLoadingException(String message) { - super(message); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/PartitionManager.java b/src/java/com/twitter/search/earlybird/partition/PartitionManager.java deleted file mode 100644 index c71d3d602..000000000 --- a/src/java/com/twitter/search/earlybird/partition/PartitionManager.java +++ /dev/null @@ -1,254 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.util.concurrent.TimeUnit; - -import com.google.common.annotations.VisibleForTesting; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.concurrent.ScheduledExecutorServiceFactory; -import com.twitter.search.common.config.Config; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchStatsReceiver; -import com.twitter.search.earlybird.EarlybirdStatus; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.exception.EarlybirdStartupException; -import com.twitter.search.earlybird.querycache.QueryCacheManager; -import com.twitter.search.earlybird.segment.SegmentDataProvider; -import com.twitter.search.earlybird.thrift.EarlybirdStatusCode; -import com.twitter.search.earlybird.util.OneTaskScheduledExecutorManager; -import com.twitter.search.earlybird.util.PeriodicActionParams; -import com.twitter.search.earlybird.util.ShutdownWaitTimeParams; -import com.twitter.search.queryparser.query.QueryParserException; - -/** - * PartitionManager is responsible for indexing data for a partition, including Tweets and Users. - */ -public abstract class PartitionManager extends OneTaskScheduledExecutorManager { - private static final Logger LOG = LoggerFactory.getLogger(PartitionManager.class); - - private static final SearchCounter IGNORED_EXCEPTIONS = - SearchCounter.export("partition_manager_ignored_exceptions"); - - private static final String PARTITION_MANAGER_THREAD_NAME = "PartitionManager"; - private static final boolean THREAD_IS_DAEMON = true; - protected static final String INDEX_CURRENT_SEGMENT = "indexing the current segment"; - protected static final String SETUP_QUERY_CACHE = "setting up query cache"; - - protected final SegmentManager segmentManager; - protected final QueryCacheManager queryCacheManager; - // Should be updated by info read from ZK - protected final DynamicPartitionConfig dynamicPartitionConfig; - - private final SearchIndexingMetricSet searchIndexingMetricSet; - - private boolean partitionManagerFirstLoop = true; - - public PartitionManager(QueryCacheManager queryCacheManager, - SegmentManager segmentManager, - DynamicPartitionConfig dynamicPartitionConfig, - ScheduledExecutorServiceFactory executorServiceFactory, - SearchIndexingMetricSet searchIndexingMetricSet, - SearchStatsReceiver searchStatsReceiver, - CriticalExceptionHandler criticalExceptionHandler) { - super( - executorServiceFactory, - PARTITION_MANAGER_THREAD_NAME, - THREAD_IS_DAEMON, - PeriodicActionParams.withFixedDelay( - EarlybirdConfig.getInt("time_slice_roll_check_interval_ms", 500), - TimeUnit.MILLISECONDS), - ShutdownWaitTimeParams.indefinitely(), - searchStatsReceiver, - criticalExceptionHandler); - - this.segmentManager = segmentManager; - this.queryCacheManager = queryCacheManager; - this.dynamicPartitionConfig = dynamicPartitionConfig; - this.searchIndexingMetricSet = searchIndexingMetricSet; - } - - /** - * Runs the partition manager. - */ - public final void runImpl() { - if (partitionManagerFirstLoop) { - try { - testHookBeforeStartUp(); - startUp(); - validateSegments(); - segmentManager.logState("After startUp"); - } catch (Throwable t) { - criticalExceptionHandler.handle(this, t); - shutDownIndexing(); - throw new RuntimeException("PartitionManager unhandled exception, stopping scheduler", t); - } - } - - try { - testHookAfterSleep(); - indexingLoop(partitionManagerFirstLoop); - } catch (InterruptedException e) { - LOG.warn("PartitionManager thread interrupted, stoping scheduler", e); - shutDownIndexing(); - throw new RuntimeException("PartitionManager thread interrupted", e); - } catch (Exception e) { - LOG.error("Exception in indexing PartitionManager loop", e); - IGNORED_EXCEPTIONS.increment(); - } catch (Throwable t) { - LOG.error("Unhandled exception in indexing PartitionManager loop", t); - criticalExceptionHandler.handle(this, t); - shutDownIndexing(); - throw new RuntimeException("PartitionManager unhandled exception, stopping scheduler", t); - } finally { - partitionManagerFirstLoop = false; - } - } - - /** - * Returns the SegmentDataProvider instance that will be used to fetch the information for all - * segments. - */ - public abstract SegmentDataProvider getSegmentDataProvider(); - - /** - * Starts up this partition manager. - */ - protected abstract void startUp() throws Exception; - - /** - * Runs one indexing iteration. - * - * @param firstLoop Determines if this is the first time the indexing loop is running. - */ - protected abstract void indexingLoop(boolean firstLoop) throws Exception; - - /** - * Shuts down all indexing. - */ - protected abstract void shutDownIndexing(); - - @Override - public void shutdownComponent() { - shutDownIndexing(); - } - - /** - * Notifies all other threads that the partition manager has become current (ie. has indexed all - * available events). - */ - public void becomeCurrent() { - LOG.info("PartitionManager became current"); - if (EarlybirdStatus.isStarting()) { - EarlybirdStatus.setStatus(EarlybirdStatusCode.CURRENT); - } else { - LOG.warn("Could not set statusCode to CURRENT from " + EarlybirdStatus.getStatusCode()); - } - - // Now that we're done starting up, set the query cache thread pool size to one. - queryCacheManager.setWorkerPoolSizeAfterStartup(); - } - - protected void setupQueryCacheIfNeeded() throws QueryParserException { - queryCacheManager.setupTasksIfNeeded(segmentManager); - } - - // Only for tests, used for testing exception handling - private static TestHook testHookBeforeStartUp; - private static TestHook testHookAfterSleep; - - private static void testHookBeforeStartUp() throws Exception { - if (Config.environmentIsTest() && testHookBeforeStartUp != null) { - testHookBeforeStartUp.run(); - } - } - - private static void testHookAfterSleep() throws Exception { - if (Config.environmentIsTest() && testHookAfterSleep != null) { - testHookAfterSleep.run(); - } - } - - @Override - protected void runOneIteration() { - try { - runImpl(); - } catch (Throwable t) { - LOG.error("Unhandled exception in PartitionManager loop", t); - throw new RuntimeException(t.getMessage()); - } - } - - public SearchIndexingMetricSet getSearchIndexingMetricSet() { - return searchIndexingMetricSet; - } - - /** - * Allows tests to run code before the partition manager starts up. - * - * @param testHook The code to run before the start up. - */ - @VisibleForTesting - public static void setTestHookBeforeStartUp(TestHook testHook) { - if (Config.environmentIsTest()) { - testHookBeforeStartUp = testHook; - } else { - throw new RuntimeException("Trying to set startup test hook in non-test code!!"); - } - } - - /** - * Allows tests to run code before the indexing loop. - * - * @param testHook The code to run before the indexing loop. - */ - @VisibleForTesting - public static void setTestHookAfterSleep(TestHook testHook) { - if (Config.environmentIsTest()) { - testHookAfterSleep = testHook; - } else { - throw new RuntimeException("Trying to set test hook in non-test code!!"); - } - } - - /** - * An interface that allows tests to run code at various points in the PartitionManager's - * lyfecycle. - */ - @VisibleForTesting - public interface TestHook { - /** - * Defines the code that should be run. - */ - void run() throws Exception; - } - - /** - * Allows tests to determine if this partition manager is all caught up. - * - * @return {@code true} if this partition manager is caught up, {@code false} otherwise. - */ - @VisibleForTesting - public abstract boolean isCaughtUpForTests(); - - @VisibleForTesting - protected void validateSegments() throws EarlybirdStartupException { - // This is necessary because many tests rely on starting partition manager but not indexing any - // tweets. However, we do not want Earlybirds to start in production if they are not serving any - // tweets. (SEARCH-24238) - if (Config.environmentIsTest()) { - return; - } - validateSegmentsForNonTest(); - } - - @VisibleForTesting - protected void validateSegmentsForNonTest() throws EarlybirdStartupException { - // Subclasses can override this and provide additional checks. - if (segmentManager.getNumIndexedDocuments() == 0) { - throw new EarlybirdStartupException("Earlybird has zero indexed documents."); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/PartitionManagerStartup.java b/src/java/com/twitter/search/earlybird/partition/PartitionManagerStartup.java deleted file mode 100644 index f25bdd1bd..000000000 --- a/src/java/com/twitter/search/earlybird/partition/PartitionManagerStartup.java +++ /dev/null @@ -1,57 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.Closeable; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.search.earlybird.EarlybirdServer; -import com.twitter.search.earlybird.EarlybirdStatus; -import com.twitter.search.earlybird.exception.EarlybirdStartupException; -import com.twitter.search.earlybird.thrift.EarlybirdStatusCode; - -/** - * Handles starting and indexing data for a partition, using a PartitionManager. - */ -public class PartitionManagerStartup implements EarlybirdStartup { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdServer.class); - - private final Clock clock; - private final PartitionManager partitionManager; - - public PartitionManagerStartup( - Clock clock, - PartitionManager partitionManager - ) { - this.clock = clock; - this.partitionManager = partitionManager; - } - - @Override - public Closeable start() throws EarlybirdStartupException { - partitionManager.schedule(); - - int count = 0; - - while (EarlybirdStatus.getStatusCode() != EarlybirdStatusCode.CURRENT) { - if (EarlybirdStatus.getStatusCode() == EarlybirdStatusCode.STOPPING) { - return partitionManager; - } - - try { - clock.waitFor(1000); - } catch (InterruptedException e) { - LOG.info("Sleep interrupted, quitting earlybird"); - throw new EarlybirdStartupException("Sleep interrupted"); - } - - // Log every 120 seconds. - if (count++ % 120 == 0) { - LOG.info("Thrift port closed until Earlybird, both indexing and query cache, is current"); - } - } - - return partitionManager; - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/PartitionWriter.java b/src/java/com/twitter/search/earlybird/partition/PartitionWriter.java deleted file mode 100644 index 797acd002..000000000 --- a/src/java/com/twitter/search/earlybird/partition/PartitionWriter.java +++ /dev/null @@ -1,109 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.IOException; -import java.time.Duration; - -import com.google.common.annotations.VisibleForTesting; - -import org.apache.kafka.clients.consumer.ConsumerRecord; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEventType; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; - -/** - * PartitionWriter writes Tweet events and Tweet update events to an Earlybird index. It is - * responsible for creating new segments, adding Tweets to the correct segment, and applying updates - * to the correct segment. - */ -public class PartitionWriter { - private static final Logger LOG = LoggerFactory.getLogger(PartitionWriter.class); - private static final String STATS_PREFIX = "partition_writer_"; - - private static final SearchRateCounter MISSING_PENGUIN_VERSION = - SearchRateCounter.export(STATS_PREFIX + "missing_penguin_version"); - private static final Duration CAUGHT_UP_FRESHNESS = Duration.ofSeconds(5); - private static final SearchRateCounter EVENTS_CONSUMED = - SearchRateCounter.export(STATS_PREFIX + "events_consumed"); - - private final PenguinVersion penguinVersion; - private final TweetUpdateHandler updateHandler; - private final TweetCreateHandler createHandler; - private final Clock clock; - private final CriticalExceptionHandler criticalExceptionHandler; - - - - public PartitionWriter( - TweetCreateHandler tweetCreateHandler, - TweetUpdateHandler tweetUpdateHandler, - CriticalExceptionHandler criticalExceptionHandler, - PenguinVersion penguinVersion, - Clock clock - ) { - LOG.info("Creating PartitionWriter."); - this.createHandler = tweetCreateHandler; - this.updateHandler = tweetUpdateHandler; - this.criticalExceptionHandler = criticalExceptionHandler; - this.penguinVersion = penguinVersion; - this.clock = clock; - } - - /** - * Index a batch of TVE records. - */ - public boolean indexBatch(Iterable> records) - throws Exception { - long minTweetAge = Long.MAX_VALUE; - for (ConsumerRecord record : records) { - ThriftVersionedEvents tve = record.value(); - indexTVE(tve); - EVENTS_CONSUMED.increment(); - long tweetAgeInMs = SnowflakeIdParser.getTweetAgeInMs(clock.nowMillis(), tve.getId()); - minTweetAge = Math.min(tweetAgeInMs, minTweetAge); - } - - return minTweetAge < CAUGHT_UP_FRESHNESS.toMillis(); - } - - /** - * Index a ThriftVersionedEvents struct. - */ - @VisibleForTesting - public void indexTVE(ThriftVersionedEvents tve) throws IOException { - ThriftIndexingEvent tie = tve.getVersionedEvents().get(penguinVersion.getByteValue()); - if (tie == null) { - LOG.error("Could not find a ThriftIndexingEvent for PenguinVersion {} in " - + "ThriftVersionedEvents: {}", penguinVersion, tve); - MISSING_PENGUIN_VERSION.increment(); - return; - } - - // An `INSERT` event is used for new Tweets. These are generated from Tweet Create Events from - // TweetyPie. - if (tie.getEventType() == ThriftIndexingEventType.INSERT) { - createHandler.handleTweetCreate(tve); - updateHandler.retryPendingUpdates(tve.getId()); - } else { - updateHandler.handleTweetUpdate(tve, false); - } - } - - public void prepareAfterStartingWithIndex(long maxIndexedTweetId) { - createHandler.prepareAfterStartingWithIndex(maxIndexedTweetId); - } - - void logState() { - LOG.info("PartitionWriter state:"); - LOG.info(String.format(" Events indexed: %,d", EVENTS_CONSUMED.getCount())); - createHandler.logState(); - updateHandler.logState(); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/SearchIndexingMetricSet.java b/src/java/com/twitter/search/earlybird/partition/SearchIndexingMetricSet.java deleted file mode 100644 index 9edeb56d2..000000000 --- a/src/java/com/twitter/search/earlybird/partition/SearchIndexingMetricSet.java +++ /dev/null @@ -1,208 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.util.EnumMap; -import java.util.concurrent.TimeUnit; -import java.util.concurrent.atomic.AtomicLong; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.metrics.SearchStatsReceiver; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEventType; -import com.twitter.search.earlybird.util.ScheduledExecutorManager; - -/** - * Collection of common metrics used in the indexing, and related code. - * We create a set/holder for them as we want to create all counters only one time, and these - * counters can be used by both SimpleUpdateIndexer, PartitionIndexer, EarlybirdSegment, and others. - */ -public class SearchIndexingMetricSet { - /** - * A proxy for the creation time of the "freshest" tweet that we have in the index. - * It is used in computing the index freshness stat "earlybird_index_freshness_millis". - * - In the realtme clusters, this should match the creation time of highestStatusId. - * - In the archive clusters, this should match the timestamp of the latest indexed day. - */ - public final SearchLongGauge freshestTweetTimeMillis; - - /** The highest indexed tweet ID. Used to compute index freshness. */ - public final SearchLongGauge highestStatusId; - - /** - * The current timeslice's ID. We can compare this to indexer's exported current timeslice ID to - * identify stuck timeslice rolls. - */ - public final SearchLongGauge currentTimesliceId; - - /** The number of archive timeslices that we failed to process. */ - public final SearchCounter archiveTimeSliceBuildFailedCounter; - - /** The number of times we checked a segment's size on disk. */ - public final SearchCounter segmentSizeCheckCount; - - /** The number of segments that have reached their max size. */ - public final SearchCounter maxSegmentSizeReachedCounter; - - /** The number of indexed tweets and the aggregate indexing latencies in microseconds. */ - public final SearchTimerStats statusStats; - /** The number of applied updates and the aggregate indexing latencies in microseconds. */ - public final SearchTimerStats updateStats; - /** The number of retried updates and the aggregate indexing latencies in microseconds. */ - public final SearchTimerStats updateRetryStats; - /** The number of applied user updates and the aggregate indexing latencies in microseconds. */ - public final SearchTimerStats userUpdateIndexingStats; - /** The number of applied userGeoScrubEvents and the aggregate indexing latencies in - * microseconds. */ - public final SearchTimerStats userScrubGeoIndexingStats; - /** The number of updates attempted on missing tweets. */ - public final SearchRateCounter updateOnMissingTweetCounter; - /** The number of updates dropped. */ - public final SearchRateCounter droppedUpdateEvent; - - /** The latencies in microseconds of the PartitionIndexer loop. */ - public final SearchTimerStats partitionIndexerRunLoopCounter; - /** The latencies in microseconds of the PartitionIndexer.indexFromReaders() calls. */ - public final SearchTimerStats partitionIndexerIndexFromReadersCounter; - /** The number of invocations of the PartitionIndexer task. */ - public final SearchCounter partitionIndexerIterationCounter; - - /** The number of unsorted updates handled by SimpleUpdateIndexer. */ - public final SearchCounter simpleUpdateIndexerUnsortedUpdateCounter; - /** The number of unsorted updates with the wrong segment handled by SimpleUpdateIndexer. */ - public final SearchCounter simpleUpdateIndexerUnsortedUpdateWithWrongSegmentCounter; - - /** The number of invocations of the SimpleUserUpdateIndexer task. */ - public final SearchCounter simpleUserUpdateIndexerIterationCounter; - - /** The number of exceptions encountered by SimpleSegmentIndexer while indexing a segment. */ - public final SearchCounter simpleSegmentIndexerExceptionCounter; - - /** - * A map from TIE update type to the creation time of the updated tweet in milliseconds of the - * freshest update we have indexed. - */ - public final EnumMap updateFreshness = - new EnumMap<>(ThriftIndexingEventType.class); - - public final SearchStatsReceiver searchStatsReceiver; - - public static class StartupMetric { - // Switched from 0 to 1 during the event. - private SearchLongGauge duringGauge; - // Switched from 0 to time it takes, in milliseconds. - private SearchLongGauge durationMillisGauge; - - StartupMetric(String name) { - this.duringGauge = SearchLongGauge.export(name); - this.durationMillisGauge = SearchLongGauge.export("duration_of_" + name); - } - - public void begin() { - duringGauge.set(1); - } - - public void end(long durationInMillis) { - duringGauge.set(0); - durationMillisGauge.set(durationInMillis); - } - } - - public final StartupMetric startupInProgress; - public final StartupMetric startupInIndexCompletedSegments; - public final StartupMetric startupInLoadCompletedSegments; - public final StartupMetric startupInIndexUpdatesForCompletedSegments; - public final StartupMetric startupInCurrentSegment; - public final StartupMetric startupInUserUpdates; - public final StartupMetric startupInQueryCacheUpdates; - public final StartupMetric startupInMultiSegmentTermDictionaryUpdates; - public final StartupMetric startupInWarmUp; - - // Kafka metrics - public final StartupMetric startupInLoadFlushedIndex; - public final StartupMetric startupInFreshStartup; - public final StartupMetric startupInIngestUntilCurrent; - public final StartupMetric startupInUserUpdatesStartup; - public final StartupMetric startupInUserEventIndexer; - public final StartupMetric startupInAudioSpaceEventIndexer; - - public SearchIndexingMetricSet(SearchStatsReceiver searchStatsReceiver) { - this.freshestTweetTimeMillis = searchStatsReceiver.getLongGauge( - "earlybird_freshest_tweet_timestamp_millis"); - this.highestStatusId = searchStatsReceiver.getLongGauge("highest_indexed_status_id"); - this.currentTimesliceId = searchStatsReceiver.getLongGauge("earlybird_current_timeslice_id"); - this.archiveTimeSliceBuildFailedCounter = searchStatsReceiver.getCounter( - "archive_time_slice_build_failed"); - this.segmentSizeCheckCount = searchStatsReceiver.getCounter("segment_size_check_count"); - this.maxSegmentSizeReachedCounter = searchStatsReceiver.getCounter("max_segment_reached"); - - this.statusStats = searchStatsReceiver.getTimerStats( - "index_status", TimeUnit.MICROSECONDS, false, false, false); - this.updateStats = searchStatsReceiver.getTimerStats( - "updates", TimeUnit.MICROSECONDS, false, false, false); - this.updateRetryStats = searchStatsReceiver.getTimerStats( - "update_retries", TimeUnit.MICROSECONDS, false, false, false); - this.userUpdateIndexingStats = searchStatsReceiver.getTimerStats( - "user_updates", TimeUnit.MICROSECONDS, false, false, false); - this.userScrubGeoIndexingStats = searchStatsReceiver.getTimerStats( - "user_scrub_geo", TimeUnit.MICROSECONDS, false, false, false); - this.updateOnMissingTweetCounter = searchStatsReceiver.getRateCounter( - "index_update_on_missing_tweet"); - this.droppedUpdateEvent = searchStatsReceiver.getRateCounter("dropped_update_event"); - - this.partitionIndexerRunLoopCounter = searchStatsReceiver.getTimerStats( - "partition_indexer_run_loop", TimeUnit.MICROSECONDS, false, true, false); - this.partitionIndexerIndexFromReadersCounter = searchStatsReceiver.getTimerStats( - "partition_indexer_indexFromReaders", TimeUnit.MICROSECONDS, false, true, false); - this.partitionIndexerIterationCounter = searchStatsReceiver.getCounter( - ScheduledExecutorManager.SCHEDULED_EXECUTOR_TASK_PREFIX + "PartitionIndexer"); - - this.simpleUpdateIndexerUnsortedUpdateCounter = searchStatsReceiver.getCounter( - "simple_update_indexer_unsorted_update_count"); - this.simpleUpdateIndexerUnsortedUpdateWithWrongSegmentCounter = searchStatsReceiver.getCounter( - "simple_update_indexer_unsorted_update_with_wrong_segment_count"); - - this.simpleUserUpdateIndexerIterationCounter = searchStatsReceiver.getCounter( - ScheduledExecutorManager.SCHEDULED_EXECUTOR_TASK_PREFIX + "SimpleUserUpdateIndexer"); - - this.simpleSegmentIndexerExceptionCounter = searchStatsReceiver.getCounter( - "exception_while_indexing_segment"); - - for (ThriftIndexingEventType type : ThriftIndexingEventType.values()) { - AtomicLong freshness = new AtomicLong(0); - updateFreshness.put(type, freshness); - String statName = ("index_freshness_" + type + "_age_millis").toLowerCase(); - searchStatsReceiver.getCustomGauge(statName, - () -> System.currentTimeMillis() - freshness.get()); - } - - this.startupInProgress = new StartupMetric("startup_in_progress"); - this.startupInIndexCompletedSegments = new StartupMetric("startup_in_index_completed_segments"); - this.startupInLoadCompletedSegments = new StartupMetric("startup_in_load_completed_segments"); - this.startupInIndexUpdatesForCompletedSegments = - new StartupMetric("startup_in_index_updates_for_completed_segments"); - this.startupInCurrentSegment = new StartupMetric("startup_in_current_segment"); - this.startupInUserUpdates = new StartupMetric("startup_in_user_updates"); - this.startupInQueryCacheUpdates = new StartupMetric("startup_in_query_cache_updates"); - this.startupInMultiSegmentTermDictionaryUpdates = - new StartupMetric("startup_in_multi_segment_dictionary_updates"); - this.startupInWarmUp = new StartupMetric("startup_in_warm_up"); - - this.startupInLoadFlushedIndex = new StartupMetric("startup_in_load_flushed_index"); - this.startupInFreshStartup = new StartupMetric("startup_in_fresh_startup"); - this.startupInIngestUntilCurrent = new StartupMetric("startup_in_ingest_until_current"); - this.startupInUserUpdatesStartup = new StartupMetric("startup_in_user_updates_startup"); - this.startupInUserEventIndexer = new StartupMetric("startup_in_user_events_indexer"); - this.startupInAudioSpaceEventIndexer = - new StartupMetric("startup_in_audio_space_events_indexer"); - - searchStatsReceiver.getCustomGauge("earlybird_index_freshness_millis", - this::getIndexFreshnessInMillis); - - this.searchStatsReceiver = searchStatsReceiver; - } - - long getIndexFreshnessInMillis() { - return System.currentTimeMillis() - freshestTweetTimeMillis.get(); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/SegmentHdfsFlusher.java b/src/java/com/twitter/search/earlybird/partition/SegmentHdfsFlusher.java deleted file mode 100644 index c048bfa6e..000000000 --- a/src/java/com/twitter/search/earlybird/partition/SegmentHdfsFlusher.java +++ /dev/null @@ -1,247 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.File; -import java.io.IOException; -import java.util.concurrent.TimeUnit; - -import org.apache.commons.io.FileUtils; -import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; -import org.apache.lucene.store.Directory; -import org.apache.lucene.store.FSDirectory; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.base.Command; -import com.twitter.common.quantity.Amount; -import com.twitter.common.quantity.Time; -import com.twitter.search.common.database.DatabaseConfig; -import com.twitter.search.common.metrics.Timer; -import com.twitter.search.common.util.io.flushable.PersistentFile; -import com.twitter.search.common.util.zktrylock.TryLock; -import com.twitter.search.common.util.zktrylock.ZooKeeperTryLockFactory; - -/** - * Flush segments to disk and upload them to HDFS. - */ -public class SegmentHdfsFlusher { - private static final Logger LOG = LoggerFactory.getLogger(SegmentHdfsFlusher.class); - private static final Amount HDFS_UPLOADER_TRY_LOCK_NODE_EXPIRATION_TIME_MILLIS = - Amount.of(1L, Time.HOURS); - - private final SegmentSyncConfig sync; - private final boolean holdLockWhileUploading; - private final ZooKeeperTryLockFactory zkTryLockFactory; - - public SegmentHdfsFlusher(ZooKeeperTryLockFactory zooKeeperTryLockFactory, - SegmentSyncConfig sync, - boolean holdLockWhileUploading) { - this.zkTryLockFactory = zooKeeperTryLockFactory; - this.sync = sync; - this.holdLockWhileUploading = holdLockWhileUploading; - } - - public SegmentHdfsFlusher( - ZooKeeperTryLockFactory zooKeeperTryLockFactory, - SegmentSyncConfig sync) { - this(zooKeeperTryLockFactory, sync, true); - } - - private boolean shouldFlushSegment(SegmentInfo segmentInfo) { - return segmentInfo.isEnabled() - && !segmentInfo.getSyncInfo().isFlushed() - && segmentInfo.isComplete() - && segmentInfo.isOptimized() - && !segmentInfo.isFailedOptimize() - && !segmentInfo.getSyncInfo().isLoaded(); - } - - /** - * Flushes a segment to local disk and to HDFS. - */ - public boolean flushSegmentToDiskAndHDFS(SegmentInfo segmentInfo) { - if (!shouldFlushSegment(segmentInfo)) { - return false; - } - try { - if (segmentInfo.isIndexing()) { - LOG.error("Tried to flush current segment!"); - return false; - } - - // Check-and-set the beingUploaded flag from false to true. If the CAS fails, it means the - // segment is being flushed already, or being deleted. In this case, we can just return false. - if (!segmentInfo.casBeingUploaded(false, true)) { - LOG.warn("Tried to flush a segment that's being flushed or deleted."); - return false; - } - - // At this point, the above CAS must have returned false. This mean the beingUploaded flag - // was false, and set to true now. We can proceed with flushing the segment. - try { - checkAndFlushSegmentToHdfs(segmentInfo); - } finally { - segmentInfo.setBeingUploaded(false); - } - return true; - } catch (Exception e) { - LOG.error("Exception while flushing IndexSegment to " - + segmentInfo.getSyncInfo().getHdfsFlushDir(), e); - return false; - } - } - - /** - * First try to acquire a lock in Zookeeper for this segment, so multiple Earlybirds in the same - * partition don't flush or upload the segment at the same time. When the lock is acquired, check - * for the segment in HDFS. If the data already exists, don't flush to disk. - */ - private void checkAndFlushSegmentToHdfs(final SegmentInfo segment) { - LOG.info("Checking and flushing segment {}", segment); - - try { - // Always flush the segment locally. - Directory dir = FSDirectory.open(createFlushDir(segment).toPath()); - segment.flush(dir); - LOG.info("Completed local flush of segment {}. Flush to HDFS enabled: {}", - segment, sync.isFlushToHdfsEnabled()); - } catch (IOException e) { - LOG.error("Failed to flush segment " + segment + " locally", e); - return; - } - - if (!holdLockWhileUploading) { - flushToHdfsIfNecessary(segment); - } else { - TryLock lock = zkTryLockFactory.createTryLock( - DatabaseConfig.getLocalHostname(), - sync.getZooKeeperSyncFullPath(), - sync.getVersionedName(segment.getSegment()), - HDFS_UPLOADER_TRY_LOCK_NODE_EXPIRATION_TIME_MILLIS - ); - - boolean gotLock = lock.tryWithLock((Command) () -> flushToHdfsIfNecessary(segment)); - if (!gotLock) { - LOG.info("Failed to get zk upload lock for segment {}", segment); - } - } - } - - /** - * Check whether the segment has already been flushed to HDFS. If not, flush the segment to disk - * and upload the files to HDFS. - * - * If the ZK lock isn't used, there is a race between the existence check and the upload (in - * which another Earlybird can sneak in and upload the segment), so we will potentially upload - * the same segment from different hosts. Thus, the Earlybird hostname is part of the segment's - * path on HDFS. - */ - private void flushToHdfsIfNecessary(SegmentInfo segmentInfo) { - Timer timer = new Timer(TimeUnit.MILLISECONDS); - String status = "flushed"; - try (FileSystem fs = HdfsUtil.getHdfsFileSystem()) { - // If we can't load segments from HDFS, don't bother checking HDFS for the segment - if (sync.isSegmentLoadFromHdfsEnabled() - && (segmentInfo.getSyncInfo().isFlushed() - || HdfsUtil.segmentExistsOnHdfs(fs, segmentInfo))) { - status = "existing"; - } else if (sync.isFlushToHdfsEnabled()) { - copyLocalFilesToHdfs(fs, segmentInfo); - status = "uploaded"; - } - - // whether we uploaded, or someone else did, this segment should now be on HDFS. If - // uploading to HDFS is disabled, we still consider it complete. - segmentInfo.getSyncInfo().setFlushed(true); - } catch (IOException e) { - LOG.error("Failed copying segment {} to HDFS after {} ms", segmentInfo, timer.stop(), e); - status = "exception"; - } finally { - if (timer.running()) { - timer.stop(); - } - LOG.info("Flush of segment {} to HDFS completed in {} milliseconds. Status: {}", - segmentInfo, timer.getElapsed(), status); - } - } - - /** - * Copy local segment files to HDFS. Files are first copied into a temporary directory - * in the form _ and when all the files are written out to HDFS, - * the dir is renamed to _, where it is accessible to other Earlybirds. - */ - private void copyLocalFilesToHdfs(FileSystem fs, SegmentInfo segment) throws IOException { - String hdfsTempBaseDir = segment.getSyncInfo().getHdfsTempFlushDir(); - - // If the temp dir already exists on HDFS, a prior flush must have been interrupted. - // Delete it and start fresh. - removeHdfsTempDir(fs, hdfsTempBaseDir); - - for (String fileName : sync.getAllSyncFileNames(segment)) { - String hdfsFileName = hdfsTempBaseDir + "/" + fileName; - String localBaseDir = segment.getSyncInfo().getLocalSyncDir(); - String localFileName = localBaseDir + "/" + fileName; - - LOG.debug("About to start copying {} to HDFS, from {} to {}", - fileName, localFileName, hdfsFileName); - Timer timer = new Timer(TimeUnit.MILLISECONDS); - fs.copyFromLocalFile(new Path(localFileName), new Path(hdfsFileName)); - LOG.debug("Completed copying {} to HDFS, from {} to {}, in {} ms", - fileName, localFileName, hdfsFileName, timer.stop()); - } - - // now let's rename the dir into its proper form. - String hdfsBaseDir = segment.getSyncInfo().getHdfsFlushDir(); - if (fs.rename(new Path(hdfsTempBaseDir), new Path(hdfsBaseDir))) { - LOG.info("Renamed segment dir on HDFS from {} to {}", hdfsTempBaseDir, hdfsBaseDir); - } else { - String errorMessage = String.format("Failed to rename segment dir on HDFS from %s to %s", - hdfsTempBaseDir, hdfsBaseDir); - LOG.error(errorMessage); - - removeHdfsTempDir(fs, hdfsTempBaseDir); - - // Throw an IOException so the calling code knows that the copy failed - throw new IOException(errorMessage); - } - } - - private void removeHdfsTempDir(FileSystem fs, String tempDir) throws IOException { - Path tempDirPath = new Path(tempDir); - if (fs.exists(tempDirPath)) { - LOG.info("Found existing temporary flush dir {} on HDFS, removing", tempDir); - if (!fs.delete(tempDirPath, true /* recursive */)) { - LOG.error("Failed to delete temp dir {}", tempDir); - } - } - } - - // Create or replace the local flush directory - private File createFlushDir(SegmentInfo segmentInfo) throws IOException { - final String flushDirStr = segmentInfo.getSyncInfo().getLocalSyncDir(); - - File flushDir = new File(flushDirStr); - if (flushDir.exists()) { - // Delete just the flushed persistent files if they are there. - // We may also have the lucene on-disk indexed in the same dir here, - // that we do not want to delete. - for (String persistentFile : sync.getPersistentFileNames(segmentInfo)) { - for (String fileName : PersistentFile.getAllFileNames(persistentFile)) { - File file = new File(flushDir, fileName); - if (file.exists()) { - LOG.info("Deleting incomplete flush file {}", file.getAbsolutePath()); - FileUtils.forceDelete(file); - } - } - } - return flushDir; - } - - // Try to create the flush directory - if (!flushDir.mkdirs()) { - throw new IOException("Not able to create segment flush directory \"" + flushDirStr + "\""); - } - - return flushDir; - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/SegmentIndexStats.java b/src/java/com/twitter/search/earlybird/partition/SegmentIndexStats.java deleted file mode 100644 index ead096fbb..000000000 --- a/src/java/com/twitter/search/earlybird/partition/SegmentIndexStats.java +++ /dev/null @@ -1,96 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.util.Optional; -import java.util.concurrent.atomic.AtomicInteger; -import java.util.concurrent.atomic.AtomicLong; - -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentData; - -public class SegmentIndexStats { - private EarlybirdIndexSegmentData segmentData; - - private final AtomicLong indexSizeOnDiskInBytes = new AtomicLong(0); - private final AtomicInteger partialUpdateCount = new AtomicInteger(0); - private final AtomicInteger outOfOrderUpdateCount = new AtomicInteger(0); - - private Optional savedStatusCount = Optional.empty(); - private Optional savedDeletesCount = Optional.empty(); - - public void setSegmentData(EarlybirdIndexSegmentData segmentData) { - this.segmentData = segmentData; - } - - /** - * We'd like to be able to return the last counts after we unload a segment from memory. - */ - public void unsetSegmentDataAndSaveCounts() { - savedStatusCount = Optional.of(getStatusCount()); - savedDeletesCount = Optional.of(getDeleteCount()); - segmentData = null; - } - - /** - * Returns the number of deletes processed by this segment. - */ - public int getDeleteCount() { - if (segmentData != null) { - return segmentData.getDeletedDocs().numDeletions(); - } else { - return savedDeletesCount.orElse(0); - } - } - - /** - * Return the number of documents in this segment. - */ - public int getStatusCount() { - if (segmentData != null) { - return segmentData.numDocs(); - } else { - return savedStatusCount.orElse(0); - } - } - - public long getIndexSizeOnDiskInBytes() { - return indexSizeOnDiskInBytes.get(); - } - - public void setIndexSizeOnDiskInBytes(long value) { - indexSizeOnDiskInBytes.set(value); - } - - public int getPartialUpdateCount() { - return partialUpdateCount.get(); - } - - public void incrementPartialUpdateCount() { - partialUpdateCount.incrementAndGet(); - } - - public void setPartialUpdateCount(int value) { - partialUpdateCount.set(value); - } - - public int getOutOfOrderUpdateCount() { - return outOfOrderUpdateCount.get(); - } - - public void incrementOutOfOrderUpdateCount() { - outOfOrderUpdateCount.incrementAndGet(); - } - - public void setOutOfOrderUpdateCount(int value) { - outOfOrderUpdateCount.set(value); - } - - @Override - public String toString() { - StringBuilder sb = new StringBuilder(); - sb.append("Indexed ").append(getStatusCount()).append(" documents, "); - sb.append(getDeleteCount()).append(" deletes, "); - sb.append(getPartialUpdateCount()).append(" partial updates, "); - sb.append(getOutOfOrderUpdateCount()).append(" out of order udpates. "); - sb.append("Index size: ").append(getIndexSizeOnDiskInBytes()); - return sb.toString(); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/SegmentIndexStatsExporter.java b/src/java/com/twitter/search/earlybird/partition/SegmentIndexStatsExporter.java deleted file mode 100644 index dca484e4c..000000000 --- a/src/java/com/twitter/search/earlybird/partition/SegmentIndexStatsExporter.java +++ /dev/null @@ -1,85 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import com.twitter.common.base.Supplier; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.metrics.SearchMetric; -import com.twitter.search.common.metrics.SearchMetricsRegistry; - -/** - * Exporting per-segment stats collected in {@link SegmentIndexStats}. - * - * This class tries to reuse stat prefixes of "segment_stats_[0-N]_*" where N is the number - * of segments managed by this earlybird. - * For example, stats prefixed with "segment_stats_0_*" always represent the most recent segment. - * As we add more segments (and drop older ones), the same "segment_stats_*" stats end up exporting - * data for different underlying segments. - * - * This is done as an alternative to exporting stats that have the timesliceId in them, which - * would avoid the need for reusing the same stat names, but would create an ever-increasing set - * of unique stats exported by earlybirds. - */ -public final class SegmentIndexStatsExporter { - private static final class StatReader extends SearchMetric { - private volatile Supplier counter = () -> 0; - - private StatReader(String name) { - super(name); - } - - @Override - public Long read() { - return counter.get().longValue(); - } - - @Override - public void reset() { - counter = () -> 0; - } - } - - private SegmentIndexStatsExporter() { - } - - private static final String NAME_PREFIX = "segment_stats_"; - - /** - * Exports stats for some counts for the given segment: - * - status_count: number of tweets indexed - * - delete_count: number of deletes indexed - * - partial_update_count: number of partial updates indexed - * - out_of_order_update_count: number of out of order updates indexed - * - segment_size_bytes: the segment size in bytes - * - * @param segmentInfo The segment for which these stats should be exported. - * @param segmentIndex The index of this segment in the list of all segments. - */ - public static void export(SegmentInfo segmentInfo, int segmentIndex) { - exportStat(segmentIndex, "status_count", - () -> segmentInfo.getIndexStats().getStatusCount()); - exportStat(segmentIndex, "delete_count", - () -> segmentInfo.getIndexStats().getDeleteCount()); - exportStat(segmentIndex, "partial_update_count", - () -> segmentInfo.getIndexStats().getPartialUpdateCount()); - exportStat(segmentIndex, "out_of_order_update_count", - () -> segmentInfo.getIndexStats().getOutOfOrderUpdateCount()); - exportStat(segmentIndex, "segment_size_bytes", - () -> segmentInfo.getIndexStats().getIndexSizeOnDiskInBytes()); - - SearchLongGauge timeSliceIdStat = - SearchLongGauge.export(NAME_PREFIX + segmentIndex + "_timeslice_id"); - timeSliceIdStat.set(segmentInfo.getTimeSliceID()); - } - - private static void exportStat(final int segmentIndex, - final String nameSuffix, - Supplier counter) { - final String name = getName(segmentIndex, nameSuffix); - StatReader statReader = SearchMetricsRegistry.registerOrGet( - () -> new StatReader(name), name, StatReader.class); - statReader.counter = counter; - } - - private static String getName(final int segmentIndex, final String nameSuffix) { - return NAME_PREFIX + segmentIndex + "_" + nameSuffix; - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/SegmentInfo.java b/src/java/com/twitter/search/earlybird/partition/SegmentInfo.java deleted file mode 100644 index 684cf2bbe..000000000 --- a/src/java/com/twitter/search/earlybird/partition/SegmentInfo.java +++ /dev/null @@ -1,428 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.File; -import java.io.IOException; -import java.io.OutputStreamWriter; -import java.util.concurrent.atomic.AtomicBoolean; -import java.util.concurrent.atomic.AtomicInteger; -import java.util.concurrent.atomic.AtomicLong; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import org.apache.commons.io.FileUtils; -import org.apache.lucene.store.Directory; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.collections.Pair; -import com.twitter.search.common.partitioning.base.Segment; -import com.twitter.search.common.partitioning.base.TimeSlice; -import com.twitter.search.common.schema.earlybird.FlushVersion; -import com.twitter.search.common.util.LogFormatUtil; -import com.twitter.search.common.util.io.flushable.FlushInfo; -import com.twitter.search.common.util.io.flushable.PersistentFile; -import com.twitter.search.earlybird.EarlybirdIndexConfig; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.index.EarlybirdSegment; -import com.twitter.search.earlybird.index.EarlybirdSegmentFactory; - -public class SegmentInfo implements Comparable { - private static final Logger LOG = LoggerFactory.getLogger(SegmentInfo.class); - - private static final String UPDATE_STREAM_OFFSET_TIMESTAMP = "updateStreamOffsetTimestamp"; - public static final int INVALID_ID = -1; - - // Delay before deleting a segment - private final long timeToWaitBeforeClosingMillis = EarlybirdConfig.getLong( - "defer_index_closing_time_millis", 600000L); - // How many times deletions are retired. - private final AtomicInteger deletionRetries = new AtomicInteger(5); - - // Base segment information, including database name, minStatusId. - private final Segment segment; - - // Bits managed by various SegmentProcessors and PartitionManager. - private volatile boolean isEnabled = true; // True if the segment is enabled. - private volatile boolean isIndexing = false; // True during indexing. - private volatile boolean isComplete = false; // True when indexing is complete. - private volatile boolean isClosed = false; // True if indexSegment is closed. - private volatile boolean wasIndexed = false; // True if the segment was indexed from scratch. - private volatile boolean failedOptimize = false; // optimize attempt failed. - private AtomicBoolean beingUploaded = new AtomicBoolean(); // segment is being copied to HDFS - - private final SegmentSyncInfo segmentSyncInfo; - private final EarlybirdIndexConfig earlybirdIndexConfig; - - private final EarlybirdSegment indexSegment; - - private final AtomicLong updatesStreamOffsetTimestamp = new AtomicLong(0); - - public SegmentInfo(Segment segment, - EarlybirdSegmentFactory earlybirdSegmentFactory, - SegmentSyncConfig syncConfig) throws IOException { - this(segment, earlybirdSegmentFactory, new SegmentSyncInfo(syncConfig, segment)); - } - - @VisibleForTesting - public SegmentInfo(Segment segment, - EarlybirdSegmentFactory earlybirdSegmentFactory, - SegmentSyncInfo segmentSyncInfo) throws IOException { - this(earlybirdSegmentFactory.newEarlybirdSegment(segment, segmentSyncInfo), - segmentSyncInfo, - segment, - earlybirdSegmentFactory.getEarlybirdIndexConfig()); - } - - public SegmentInfo( - EarlybirdSegment earlybirdSegment, - SegmentSyncInfo segmentSyncInfo, - Segment segment, - EarlybirdIndexConfig earlybirdIndexConfig - ) { - this.indexSegment = earlybirdSegment; - this.segmentSyncInfo = segmentSyncInfo; - this.earlybirdIndexConfig = earlybirdIndexConfig; - this.segment = segment; - } - - public EarlybirdSegment getIndexSegment() { - return indexSegment; - } - - public SegmentIndexStats getIndexStats() { - return indexSegment.getIndexStats(); - } - - public EarlybirdIndexConfig getEarlybirdIndexConfig() { - return earlybirdIndexConfig; - } - - public long getTimeSliceID() { - return segment.getTimeSliceID(); - } - - public String getSegmentName() { - return segment.getSegmentName(); - } - - public int getNumPartitions() { - return segment.getNumHashPartitions(); - } - - public boolean isEnabled() { - return isEnabled; - } - - public void setIsEnabled(boolean isEnabled) { - this.isEnabled = isEnabled; - } - - public boolean isOptimized() { - return indexSegment.isOptimized(); - } - - public boolean wasIndexed() { - return wasIndexed; - } - - public void setWasIndexed(boolean wasIndexed) { - this.wasIndexed = wasIndexed; - } - - public boolean isFailedOptimize() { - return failedOptimize; - } - - public void setFailedOptimize() { - this.failedOptimize = true; - } - - public boolean isIndexing() { - return isIndexing; - } - - public void setIndexing(boolean indexing) { - this.isIndexing = indexing; - } - - public boolean isComplete() { - return isComplete; - } - - public boolean isClosed() { - return isClosed; - } - - public boolean isBeingUploaded() { - return beingUploaded.get(); - } - - public void setBeingUploaded(boolean beingUploaded) { - this.beingUploaded.set(beingUploaded); - } - - public boolean casBeingUploaded(boolean expectation, boolean updateValue) { - return beingUploaded.compareAndSet(expectation, updateValue); - } - - @VisibleForTesting - public void setComplete(boolean complete) { - this.isComplete = complete; - } - - public boolean needsIndexing() { - return isEnabled && !isIndexing && !isComplete; - } - - @Override - public int compareTo(SegmentInfo other) { - return Long.compare(getTimeSliceID(), other.getTimeSliceID()); - } - - @Override - public boolean equals(Object obj) { - return obj instanceof SegmentInfo && compareTo((SegmentInfo) obj) == 0; - } - - @Override - public int hashCode() { - return new Long(getTimeSliceID()).hashCode(); - } - - public long getUpdatesStreamOffsetTimestamp() { - return updatesStreamOffsetTimestamp.get(); - } - - public void setUpdatesStreamOffsetTimestamp(long timestamp) { - updatesStreamOffsetTimestamp.set(timestamp); - } - - @Override - public String toString() { - StringBuilder builder = new StringBuilder(); - builder.append(getSegmentName()).append(" ["); - builder.append(isEnabled ? "enabled, " : "disabled, "); - - if (isIndexing) { - builder.append("indexing, "); - } - - if (isComplete) { - builder.append("complete, "); - } - - if (isOptimized()) { - builder.append("optimized, "); - } - - if (wasIndexed) { - builder.append("wasIndexed, "); - } - - builder.append("IndexSync:"); - this.segmentSyncInfo.addDebugInfo(builder); - - return builder.append("]").toString(); - } - - public Segment getSegment() { - return segment; - } - - /** - * Delete the index segment directory corresponding to this segment info. Return true if deleted - * successfully; otherwise, false. - */ - public boolean deleteLocalIndexedSegmentDirectoryImmediately() { - if (isClosed) { - LOG.info("SegmentInfo is already closed: " + toString()); - return true; - } - - Preconditions.checkNotNull(indexSegment, "indexSegment should never be null."); - isClosed = true; - indexSegment.destroyImmediately(); - - SegmentSyncConfig sync = getSyncInfo().getSegmentSyncConfig(); - try { - String dirToClear = sync.getLocalSyncDirName(segment); - FileUtils.forceDelete(new File(dirToClear)); - LOG.info("Deleted segment directory: " + toString()); - return true; - } catch (IOException e) { - LOG.error("Cannot clean up segment directory for segment: " + toString(), e); - return false; - } - } - - /** - * Delete the index segment directory after some configured delay. - * Note that we don't delete segments that are being uploaded. - * If a segment is being uploaded when we try to delete, close() retries the deletion later. - */ - public void deleteIndexSegmentDirectoryAfterDelay() { - LOG.info("Scheduling SegmentInfo for deletion: " + toString()); - getEarlybirdIndexConfig().getResourceCloser().closeResourceQuietlyAfterDelay( - timeToWaitBeforeClosingMillis, () -> { - // Atomically check and set the being uploaded flag, if it is not set. - if (beingUploaded.compareAndSet(false, true)) { - // If successfully set the flag to true, we can delete immediately - setIsEnabled(false); - deleteLocalIndexedSegmentDirectoryImmediately(); - LOG.info("Deleted index segment dir for segment: " - + getSegment().getSegmentName()); - } else { - // If the flag is already true (compareAndSet fails), we need to reschedule. - if (deletionRetries.decrementAndGet() > 0) { - LOG.warn("Segment is being uploaded, will retry deletion later. SegmentInfo: " - + getSegment().getSegmentName()); - deleteIndexSegmentDirectoryAfterDelay(); - } else { - LOG.warn("Failed to cleanup index segment dir for segment: " - + getSegment().getSegmentName()); - } - } - }); - } - - public SegmentSyncInfo getSyncInfo() { - return segmentSyncInfo; - } - - public FlushVersion getFlushVersion() { - return FlushVersion.CURRENT_FLUSH_VERSION; - } - - public String getZkNodeName() { - return getSegmentName() + getFlushVersion().getVersionFileExtension(); - } - - static String getSyncDirName(String parentDir, String dbName, String version) { - return parentDir + "/" + dbName + version; - } - - /** - * Parses the segment name from the name of the flushed directory. - */ - public static String getSegmentNameFromFlushedDir(String flushedDir) { - String segmentName = null; - String[] fields = flushedDir.split("/"); - if (fields.length > 0) { - segmentName = fields[fields.length - 1]; - segmentName = segmentName.replaceAll(FlushVersion.DELIMITER + ".*", ""); - } - return segmentName; - } - - /** - * Flushes this segment to the given directory. - * - * @param dir The directory to flush the segment to. - * @throws IOException If the segment could not be flushed. - */ - public void flush(Directory dir) throws IOException { - LOG.info("Flushing segment: {}", getSegmentName()); - try (PersistentFile.Writer writer = PersistentFile.getWriter(dir, getSegmentName())) { - FlushInfo flushInfo = new FlushInfo(); - flushInfo.addLongProperty(UPDATE_STREAM_OFFSET_TIMESTAMP, getUpdatesStreamOffsetTimestamp()); - getIndexSegment().flush(flushInfo, writer.getDataSerializer()); - - OutputStreamWriter infoFileWriter = new OutputStreamWriter(writer.getInfoFileOutputStream()); - FlushInfo.flushAsYaml(flushInfo, infoFileWriter); - } - } - - /** - * Makes a new SegmentInfo out of the current segment info, except that we switch the underlying - * segment. - */ - public SegmentInfo copyWithEarlybirdSegment(EarlybirdSegment optimizedSegment) { - // Take everything from the current segment info that doesn't change for the new segment - // info and rebuild everything that can change. - TimeSlice newTimeSlice = new TimeSlice( - getTimeSliceID(), - EarlybirdConfig.getMaxSegmentSize(), - segment.getHashPartitionID(), - segment.getNumHashPartitions() - ); - Segment newSegment = newTimeSlice.getSegment(); - - return new SegmentInfo( - optimizedSegment, - new SegmentSyncInfo( - segmentSyncInfo.getSegmentSyncConfig(), - newSegment), - newSegment, - earlybirdIndexConfig - ); - } - - /** - * Loads the segment from the given directory. - * - * @param dir The directory to load the segment from. - * @throws IOException If the segment could not be loaded. - */ - public void load(Directory dir) throws IOException { - LOG.info("Loading segment: {}", getSegmentName()); - try (PersistentFile.Reader reader = PersistentFile.getReader(dir, getSegmentName())) { - FlushInfo flushInfo = FlushInfo.loadFromYaml(reader.getInfoInputStream()); - setUpdatesStreamOffsetTimestamp(flushInfo.getLongProperty(UPDATE_STREAM_OFFSET_TIMESTAMP)); - getIndexSegment().load(reader.getDataInputStream(), flushInfo); - } - } - - private String getShortStatus() { - if (!isEnabled()) { - return "disabled"; - } - - if (isIndexing()) { - return "indexing"; - } - - if (isComplete()) { - return "indexed"; - } - - return "pending"; - } - - /** - * Get a string to be shown in admin commands which shows the query caches' sizes for this - * segment. - */ - public String getQueryCachesData() { - StringBuilder out = new StringBuilder(); - out.append("Segment: " + getSegmentName() + "\n"); - out.append("Total documents: " + LogFormatUtil.formatInt( - getIndexStats().getStatusCount()) + "\n"); - out.append("Query caches:\n"); - for (Pair data : indexSegment.getQueryCachesData()) { - out.append(" " + data.getFirst()); - out.append(": "); - out.append(LogFormatUtil.formatInt(data.getSecond())); - out.append("\n"); - } - return out.toString(); - } - - public String getSegmentMetadata() { - return "status: " + getShortStatus() + "\n" - + "id: " + getTimeSliceID() + "\n" - + "name: " + getSegmentName() + "\n" - + "statusCount: " + getIndexStats().getStatusCount() + "\n" - + "deleteCount: " + getIndexStats().getDeleteCount() + "\n" - + "partialUpdateCount: " + getIndexStats().getPartialUpdateCount() + "\n" - + "outOfOrderUpdateCount: " + getIndexStats().getOutOfOrderUpdateCount() + "\n" - + "isEnabled: " + isEnabled() + "\n" - + "isIndexing: " + isIndexing() + "\n" - + "isComplete: " + isComplete() + "\n" - + "isFlushed: " + getSyncInfo().isFlushed() + "\n" - + "isOptimized: " + isOptimized() + "\n" - + "isLoaded: " + getSyncInfo().isLoaded() + "\n" - + "wasIndexed: " + wasIndexed() + "\n" - + "queryCachesCardinality: " + indexSegment.getQueryCachesCardinality() + "\n"; - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/SegmentLoader.java b/src/java/com/twitter/search/earlybird/partition/SegmentLoader.java deleted file mode 100644 index f36233073..000000000 --- a/src/java/com/twitter/search/earlybird/partition/SegmentLoader.java +++ /dev/null @@ -1,300 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.File; -import java.io.IOException; -import java.util.concurrent.TimeUnit; - -import org.apache.commons.io.FileUtils; -import org.apache.commons.io.IOUtils; -import org.apache.hadoop.fs.FileStatus; -import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; -import org.apache.lucene.store.Directory; -import org.apache.lucene.store.FSDirectory; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.metrics.Timer; -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; -import com.twitter.search.common.util.io.flushable.PersistentFile; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.exception.FlushVersionMismatchException; -import com.twitter.search.earlybird.stats.SegmentSyncStats; - -public class SegmentLoader { - private static final Logger LOG = LoggerFactory.getLogger(SegmentLoader.class); - private static final SegmentSyncStats SEGMENT_LOAD_FROM_HDFS_STATS = - new SegmentSyncStats("load_from_hdfs"); - - private final CriticalExceptionHandler criticalExceptionHandler; - private final SegmentSyncConfig segmentSyncConfig; - - private final Clock clock; - - public SegmentLoader(SegmentSyncConfig sync, - CriticalExceptionHandler criticalExceptionHandler) { - this(sync, criticalExceptionHandler, Clock.SYSTEM_CLOCK); - } - - public SegmentLoader(SegmentSyncConfig sync, - CriticalExceptionHandler criticalExceptionHandler, - Clock clock) { - this.criticalExceptionHandler = criticalExceptionHandler; - this.segmentSyncConfig = sync; - this.clock = clock; - } - - public boolean load(SegmentInfo segmentInfo) { - return downloadSegment(segmentInfo) && loadSegmentFromDisk(segmentInfo); - } - - /** - * Determines if the Earlybird should attempt to download the given segment from HDFS. This - * returns true if the segment is not already present on local disk, and the segment does exist - * on HDFS. - */ - public boolean shouldDownloadSegmentWhileInServerSet(SegmentInfo segmentInfo) { - if (isValidSegmentOnDisk(segmentInfo)) { - return false; - } - try (FileSystem fs = HdfsUtil.getHdfsFileSystem()) { - return HdfsUtil.segmentExistsOnHdfs(fs, segmentInfo); - } catch (IOException e) { - LOG.error("Failed to check HDFS for segment " + segmentInfo, e); - return false; - } - } - - /** - * Verifies if the data for the given segment is present on the local disk, and if it's not, - * downloads it from HDFS. - */ - public boolean downloadSegment(SegmentInfo segmentInfo) { - if (!segmentInfo.isEnabled()) { - LOG.debug("Segment is disabled: " + segmentInfo); - return false; - } - - if (segmentInfo.isIndexing() || segmentInfo.getSyncInfo().isLoaded()) { - LOG.debug("Cannot load indexing or loaded segment: " + segmentInfo); - return false; - } - - // Return whether the appropriate version is on disk, and if not, download it from HDFS. - return isValidSegmentOnDisk(segmentInfo) || checkSegmentOnHdfsAndCopyLocally(segmentInfo); - } - - /** - * Loads the data for the given segment from the local disk. - */ - public boolean loadSegmentFromDisk(SegmentInfo segmentInfo) { - if (segmentInfo.isIndexing()) { - LOG.error("Tried to load current segment!"); - return false; - } - - segmentInfo.setIndexing(true); - try { - File flushDir = new File(segmentInfo.getSyncInfo().getLocalSyncDir()); - Directory loadDir = FSDirectory.open(flushDir.toPath()); - - segmentInfo.load(loadDir); - - if (!verifySegmentStatusCountLargeEnough(segmentInfo)) { - SearchRateCounter.export( - "segment_loader_failed_too_few_tweets_in_segment_" + segmentInfo.getSegmentName()) - .increment(); - return false; - } - - segmentInfo.setIndexing(false); - segmentInfo.setComplete(true); - segmentInfo.getSyncInfo().setLoaded(true); - return true; - } catch (FlushVersionMismatchException e) { - handleException(segmentInfo, e); - // If earlybird is in starting state, handler will terminate it - criticalExceptionHandler.handle(this, e); - } catch (Exception e) { - handleException(segmentInfo, e); - } - - SearchRateCounter.export("segment_loader_failed_" + segmentInfo.getSegmentName()).increment(); - return false; - } - - // Check to see if the segment exists on disk, and its checksum passes. - private boolean isValidSegmentOnDisk(SegmentInfo segment) { - String loadDirStr = segment.getSyncInfo().getLocalSyncDir(); - File loadDir = new File(loadDirStr); - - if (!loadDir.exists()) { - return false; - } - - for (String persistentFileName : segmentSyncConfig.getPersistentFileNames(segment)) { - if (!verifyInfoChecksum(loadDir, persistentFileName)) { - return false; - } - } - - return true; - } - - private static boolean verifyInfoChecksum(File loadDir, String databaseName) { - if (checksumFileExists(loadDir, databaseName)) { - try { - Directory dir = FSDirectory.open(loadDir.toPath()); - PersistentFile.Reader reader = PersistentFile.getReader(dir, databaseName); - try { - reader.verifyInfoChecksum(); - return true; - } finally { - IOUtils.closeQuietly(reader); - IOUtils.closeQuietly(dir); - } - } catch (PersistentFile.CorruptFileException e) { - LOG.error("Failed checksum verification.", e); - } catch (IOException e) { - LOG.error("Error while trying to read checksum file", e); - } - } - return false; - } - - // Check that the loaded segment's status count is higher than the configured threshold - private boolean verifySegmentStatusCountLargeEnough(SegmentInfo segmentInfo) { - long segmentStatusCount = segmentInfo.getIndexStats().getStatusCount(); - if (segmentStatusCount > segmentSyncConfig.getMinSegmentStatusCountThreshold()) { - return true; - } else if (segmentInfo.getEarlybirdIndexConfig().isIndexStoredOnDisk() - && couldBeMostRecentArchiveSegment(segmentInfo)) { - // The most recent archive earlybird segment is expected to be incomplete - LOG.info("Segment status count (" + segmentStatusCount + ") is below the threshold of " - + segmentSyncConfig.getMinSegmentStatusCountThreshold() - + ", but this is expected because the most recent segment is expected to be incomplete: " - + segmentInfo); - return true; - } else { - // The segment status count is small so the segment is likely incomplete. - LOG.error("Segment status count (" + segmentStatusCount + ") is below the threshold of " - + segmentSyncConfig.getMinSegmentStatusCountThreshold() + ": " + segmentInfo); - segmentInfo.setIndexing(false); - segmentInfo.getSyncInfo().setLoaded(false); - - // Remove segment from local disk - if (!segmentInfo.deleteLocalIndexedSegmentDirectoryImmediately()) { - LOG.error("Failed to cleanup unloadable segment directory."); - } - - return false; - } - } - - // Check if this segment could be the most recent archive earlybird segment (would be on the - // latest tier). Archive segments tend to span around 12 days, so using a conservative threshold - // of 20 days. - private boolean couldBeMostRecentArchiveSegment(SegmentInfo segmentInfo) { - long timesliceAgeMs = - SnowflakeIdParser.getTweetAgeInMs(clock.nowMillis(), segmentInfo.getTimeSliceID()); - return (timesliceAgeMs / 1000 / 60 / 60 / 24) <= 20; - } - - /** - * Check to see if the segment exists on hdfs. Will look for the correct segment version - * uploaded by any of the hosts. - * If the segment exists on hdfs, the segment will be copied from hdfs to the local file - * system, and we will verify the checksum against the copied version. - * @return true iff the segment was copied to local disk, and the checksum is verified. - */ - private boolean checkSegmentOnHdfsAndCopyLocally(SegmentInfo segment) { - if (!segmentSyncConfig.isSegmentLoadFromHdfsEnabled()) { - return isValidSegmentOnDisk(segment); - } - - LOG.info("About to start downloading segment from hdfs: " + segment); - Timer timer = new Timer(TimeUnit.MILLISECONDS); - String status = null; - String localBaseDir = segment.getSyncInfo().getLocalSyncDir(); - FileSystem fs = null; - try { - fs = HdfsUtil.getHdfsFileSystem(); - - String hdfsBaseDirPrefix = segment.getSyncInfo().getHdfsSyncDirPrefix(); - FileStatus[] statuses = fs.globStatus(new Path(hdfsBaseDirPrefix)); - if (statuses != null && statuses.length > 0) { - Path hdfsSyncPath = statuses[0].getPath(); - copySegmentFilesFromHdfs(segment, segmentSyncConfig, fs, hdfsSyncPath); - status = "loaded"; - } else { - LOG.info("No segments found in hdfs under: " + hdfsBaseDirPrefix); - status = "notloaded"; - } - fs.close(); - } catch (IOException ex) { - LOG.error("Failed copying segment from hdfs: " + segment + " after: " - + timer.stop() + " ms", ex); - status = "exception"; - SEGMENT_LOAD_FROM_HDFS_STATS.recordError(); - try { - FileUtils.deleteDirectory(new File(localBaseDir)); - } catch (IOException e) { - LOG.error("Error cleaning up local segment directory: " + segment, e); - } - } finally { - timer.stop(); - SEGMENT_LOAD_FROM_HDFS_STATS.actionComplete(timer); - LOG.info("Download from hdfs completed in " - + timer.getElapsed() + " milliseconds: " + segment + " status: " + status); - IOUtils.closeQuietly(fs); - } - - // now check to see if we have successfully copied the segment - return isValidSegmentOnDisk(segment); - } - - private static void copySegmentFilesFromHdfs(SegmentInfo segment, - SegmentSyncConfig syncConfig, - FileSystem fs, - Path hdfsSyncPath) throws IOException { - String localBaseDir = segment.getSyncInfo().getLocalSyncDir(); - File localBaseDirFile = new File(localBaseDir); - FileUtils.deleteQuietly(localBaseDirFile); - if (localBaseDirFile.exists()) { - LOG.warn("Cannot delete the existing path: " + localBaseDir); - } - for (String fileName : syncConfig.getAllSyncFileNames(segment)) { - Path hdfsFilePath = new Path(hdfsSyncPath, fileName); - String localFileName = localBaseDir + "/" + fileName; - LOG.debug("About to start loading from hdfs: " + fileName + " from: " - + hdfsFilePath + " to: " + localFileName); - - Timer timer = new Timer(TimeUnit.MILLISECONDS); - fs.copyToLocalFile(hdfsFilePath, new Path(localFileName)); - LOG.debug("Loaded segment file from hdfs: " + fileName + " from: " - + hdfsFilePath + " to: " + localFileName + " in: " + timer.stop() + " ms."); - } - - LOG.info("Finished downloading segments from " + hdfsSyncPath); - } - - private static boolean checksumFileExists(File loadDir, String databaseName) { - String checksumFileName = PersistentFile.genChecksumFileName(databaseName); - File checksumFile = new File(loadDir, checksumFileName); - - return checksumFile.exists(); - } - - private void handleException(SegmentInfo segmentInfo, Exception e) { - LOG.error("Exception while loading IndexSegment from " - + segmentInfo.getSyncInfo().getLocalSyncDir(), e); - - segmentInfo.setIndexing(false); - segmentInfo.getSyncInfo().setLoaded(false); - if (!segmentInfo.deleteLocalIndexedSegmentDirectoryImmediately()) { - LOG.error("Failed to cleanup unloadable segment directory."); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/SegmentManager.java b/src/java/com/twitter/search/earlybird/partition/SegmentManager.java deleted file mode 100644 index 69736da8d..000000000 --- a/src/java/com/twitter/search/earlybird/partition/SegmentManager.java +++ /dev/null @@ -1,822 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.IOException; -import java.util.ArrayList; -import java.util.Collection; -import java.util.Collections; -import java.util.Comparator; -import java.util.HashSet; -import java.util.Iterator; -import java.util.List; -import java.util.Map; -import java.util.Set; -import java.util.concurrent.ConcurrentSkipListMap; -import java.util.stream.Collectors; -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.base.Predicate; -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.metrics.SearchStatsReceiver; -import com.twitter.search.common.partitioning.base.Segment; -import com.twitter.search.common.partitioning.base.TimeSlice; -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.earlybird.EarlybirdIndexConfig; -import com.twitter.search.earlybird.common.CaughtUpMonitor; -import com.twitter.search.earlybird.common.userupdates.UserScrubGeoMap; -import com.twitter.search.earlybird.common.userupdates.UserUpdate; -import com.twitter.search.earlybird.common.userupdates.UserUpdatesChecker; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.index.EarlybirdSegmentFactory; -import com.twitter.search.earlybird.index.EarlybirdSingleSegmentSearcher; -import com.twitter.search.earlybird.search.EarlybirdLuceneSearcher; -import com.twitter.search.earlybird.search.EarlybirdMultiSegmentSearcher; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.tweetypie.thriftjava.UserScrubGeoEvent; - -public class SegmentManager { - private static final Logger LOG = LoggerFactory.getLogger(SegmentManager.class); - private final Clock clock; - private static final String STATS_PREFIX = "segment_manager_"; - private static final SearchLongGauge SEGMENT_COUNT_STATS = - SearchLongGauge.export(STATS_PREFIX + "total_segments"); - private static final SearchCounter OPTIMIZED_SEGMENTS = - SearchCounter.export(STATS_PREFIX + "optimized_segments"); - private static final SearchCounter UNOPTIMIZED_SEGMENTS = - SearchCounter.export(STATS_PREFIX + "unoptimized_segments"); - - public enum Filter { - All(info -> true), - Enabled(SegmentInfo::isEnabled), - NeedsIndexing(SegmentInfo::needsIndexing), - Complete(SegmentInfo::isComplete); - - private final Predicate predicate; - - Filter(Predicate predicate) { - this.predicate = predicate; - } - - private static final Map NAME_INDEX = - Maps.newHashMapWithExpectedSize(Filter.values().length); - - static { - for (Filter filter : Filter.values()) { - NAME_INDEX.put(filter.name().toLowerCase(), filter); - } - } - - /** - * Parses the filter from the given string, based on the filter name. - */ - public static Filter fromStringIgnoreCase(String str) { - if (str == null) { - return null; - } - - return NAME_INDEX.get(str.toLowerCase()); - } - } - - public enum Order { - OLD_TO_NEW, - NEW_TO_OLD, - } - - /** - * A listener that gets notified when the list of segments changes. - */ - public interface SegmentUpdateListener { - /** - * Called with the new list of segments when it changes. - * - * @param segments The new list of segments. - */ - void update(Collection segments, String message); - } - - private final List updateListeners = - Collections.synchronizedList(Lists.newLinkedList()); - - private final ConcurrentSkipListMap segmentWriters = - new ConcurrentSkipListMap<>(); - - private final Set badTimesliceIds = new HashSet<>(); - - private final int maxEnabledSegments; - private final int maxSegmentSize; - private final EarlybirdSegmentFactory earlybirdSegmentFactory; - private final UserTable userTable; - private final UserScrubGeoMap userScrubGeoMap; - private final EarlybirdIndexConfig earlybirdIndexConfig; - private final DynamicPartitionConfig dynamicPartitionConfig; - private final UserUpdatesChecker userUpdatesChecker; - private final SegmentSyncConfig segmentSyncConfig; - private final EarlybirdSearcherStats searcherStats; - private final SearchIndexingMetricSet searchIndexingMetricSet; - private final CriticalExceptionHandler criticalExceptionHandler; - private final CaughtUpMonitor indexCaughtUpMonitor; - - public SegmentManager( - DynamicPartitionConfig dynamicPartitionConfig, - EarlybirdIndexConfig earlybirdIndexConfig, - SearchIndexingMetricSet searchIndexingMetricSet, - EarlybirdSearcherStats searcherStats, - SearchStatsReceiver earlybirdStatsReceiver, - UserUpdatesChecker userUpdatesChecker, - SegmentSyncConfig segmentSyncConfig, - UserTable userTable, - UserScrubGeoMap userScrubGeoMap, - Clock clock, - int maxSegmentSize, - CriticalExceptionHandler criticalExceptionHandler, - CaughtUpMonitor indexCaughtUpMonitor) { - - PartitionConfig curPartitionConfig = dynamicPartitionConfig.getCurrentPartitionConfig(); - - this.userTable = userTable; - this.userScrubGeoMap = userScrubGeoMap; - - this.earlybirdSegmentFactory = new EarlybirdSegmentFactory( - earlybirdIndexConfig, - searchIndexingMetricSet, - searcherStats, - clock); - this.earlybirdIndexConfig = earlybirdIndexConfig; - this.maxEnabledSegments = curPartitionConfig.getMaxEnabledLocalSegments(); - this.dynamicPartitionConfig = dynamicPartitionConfig; - this.userUpdatesChecker = userUpdatesChecker; - this.segmentSyncConfig = segmentSyncConfig; - this.searchIndexingMetricSet = searchIndexingMetricSet; - this.searcherStats = searcherStats; - this.clock = clock; - this.maxSegmentSize = maxSegmentSize; - this.criticalExceptionHandler = criticalExceptionHandler; - this.indexCaughtUpMonitor = indexCaughtUpMonitor; - - earlybirdStatsReceiver.getCustomGauge("total_loaded_segments", - segmentWriters::size); - earlybirdStatsReceiver.getCustomGauge("total_indexed_documents", - this::getNumIndexedDocuments); - earlybirdStatsReceiver.getCustomGauge("total_segment_size_bytes", - this::getTotalSegmentSizeOnDisk); - earlybirdStatsReceiver.getCustomGauge("earlybird_index_depth_millis", - this::getIndexDepthMillis); - } - - /** - * Logs the current state of this segment manager. - * - * @param label A label that should identify the segment manager. - */ - public void logState(String label) { - StringBuilder sb = new StringBuilder(); - sb.append("State of SegmentManager (" + label + "):\n"); - sb.append("Number of segments: " + segmentWriters.size()); - boolean hasSegments = false; - for (Map.Entry entry : this.segmentWriters.entrySet()) { - SegmentInfo segmentInfo = entry.getValue().getSegmentInfo(); - hasSegments = true; - - sb.append(String.format("\nSegment (%s): isClosed: %5s, isComplete: %5s, " - + "isEnabled: %5s, isIndexing: %5s, isOptimized: %5s, wasIndexed: %5s", - segmentInfo.getSegmentName(), - segmentInfo.isClosed(), - segmentInfo.isComplete(), - segmentInfo.isEnabled(), - segmentInfo.isIndexing(), - segmentInfo.isOptimized(), - segmentInfo.wasIndexed() - )); - - sb.append(String.format(" | Index stats: %s", segmentInfo.getIndexStats().toString())); - } - if (!hasSegments) { - sb.append(" No segments."); - } - LOG.info(sb.toString()); - } - - - public PartitionConfig getPartitionConfig() { - return dynamicPartitionConfig.getCurrentPartitionConfig(); - } - - public int getMaxEnabledSegments() { - return maxEnabledSegments; - } - - public EarlybirdSegmentFactory getEarlybirdSegmentFactory() { - return earlybirdSegmentFactory; - } - - public EarlybirdIndexConfig getEarlybirdIndexConfig() { - return earlybirdIndexConfig; - } - - public UserTable getUserTable() { - return userTable; - } - - public UserScrubGeoMap getUserScrubGeoMap() { - return userScrubGeoMap; - } - - @VisibleForTesting - public void reset() { - segmentWriters.clear(); - } - - /** - * Returns the list of all segments that match the given filter, in the given order. - */ - public Iterable getSegmentInfos(Filter filter, Order order) { - Comparator comparator; - - if (order == Order.OLD_TO_NEW) { - comparator = Comparator.naturalOrder(); - } else { - comparator = Comparator.reverseOrder(); - } - - return () -> segmentWriters.values().stream() - .map(ISegmentWriter::getSegmentInfo) - .filter(filter.predicate::apply) - .sorted(comparator) - .iterator(); - } - - private void createAndPutSegmentInfo(Segment segment) throws IOException { - LOG.info("Creating new SegmentInfo for segment " + segment.getSegmentName()); - putSegmentInfo(new SegmentInfo(segment, earlybirdSegmentFactory, segmentSyncConfig)); - } - - /** - * Updates the list of segments managed by this manager, based on the given list. - */ - public void updateSegments(List segmentsList) throws IOException { - // Truncate to the amount of segments we want to keep enabled. - List truncatedSegmentList = - SegmentManager.truncateSegmentList(segmentsList, maxEnabledSegments); - - final long newestTimeSliceID = getNewestTimeSliceID(); - final Set segmentsToDisable = new HashSet<>(segmentWriters.keySet()); - - for (Segment segment : truncatedSegmentList) { - final long timeSliceID = segment.getTimeSliceID(); - segmentsToDisable.remove(timeSliceID); - - // On the first loop iteration of the first call to updateSegments(), newestTimeSliceID should - // be set to -1, so the condition should be false. After that, all segments should either be - // newer than the latest process segment, or if we're replacing an old segment, it should have - // a SegmentInfo instance associated with it. - if (timeSliceID <= newestTimeSliceID) { - ISegmentWriter segmentWriter = segmentWriters.get(timeSliceID); - // Old time slice ID. It should have a SegmentInfo instance associated with it. - if (segmentWriter == null) { - if (!badTimesliceIds.contains(timeSliceID)) { - // We're dealing with a bad timeslice. Log an error, but do it only once per timeslice. - LOG.error("The SegmentInfo instance associated with an old timeSliceID should never be " - + "null. TimeSliceID: {}", timeSliceID); - badTimesliceIds.add(timeSliceID); - } - } else if (segmentWriter.getSegmentInfo().isClosed()) { - // If the SegmentInfo was closed, create a new one. - LOG.info("SegmentInfo for segment {} is closed.", segment.getSegmentName()); - createAndPutSegmentInfo(segment); - } - } else { - // New time slice ID: create a SegmentInfo instance for it. - createAndPutSegmentInfo(segment); - } - } - - // Anything we didn't see locally can be disabled. - for (Long segmentID : segmentsToDisable) { - disableSegment(segmentID); - } - - // Update segment stats and other exported variables. - updateStats(); - } - - /** - * Re-export stats after a segment has changed, or the set of segments has changed. - */ - public void updateStats() { - // Update the partition count stats. - SEGMENT_COUNT_STATS.set(segmentWriters.size()); - - OPTIMIZED_SEGMENTS.reset(); - UNOPTIMIZED_SEGMENTS.reset(); - for (ISegmentWriter writer : segmentWriters.values()) { - if (writer.getSegmentInfo().isOptimized()) { - OPTIMIZED_SEGMENTS.increment(); - } else { - UNOPTIMIZED_SEGMENTS.increment(); - } - } - } - - private long getIndexDepthMillis() { - long oldestTimeSliceID = getOldestEnabledTimeSliceID(); - if (oldestTimeSliceID == SegmentInfo.INVALID_ID) { - return 0; - } else { - // Compute timestamp from timesliceId, which is also a snowflake tweetId - long timestamp = SnowflakeIdParser.getTimestampFromTweetId(oldestTimeSliceID); - // Set current index depth in milliseconds - long indexDepthInMillis = System.currentTimeMillis() - timestamp; - // Index depth should never be negative. - if (indexDepthInMillis < 0) { - LOG.warn("Negative index depth. Large time skew on this Earlybird?"); - return 0; - } else { - return indexDepthInMillis; - } - } - } - - private void updateExportedSegmentStats() { - int index = 0; - for (SegmentInfo segmentInfo : getSegmentInfos(Filter.Enabled, Order.NEW_TO_OLD)) { - SegmentIndexStatsExporter.export(segmentInfo, index++); - } - } - - // Marks the SegmentInfo object matching this time slice as disabled. - private void disableSegment(long timeSliceID) { - SegmentInfo info = getSegmentInfo(timeSliceID); - if (info == null) { - LOG.warn("Tried to disable missing segment " + timeSliceID); - return; - } - info.setIsEnabled(false); - LOG.info("Disabled segment " + info); - } - - public long getNewestTimeSliceID() { - final Iterator segments = getSegmentInfos(Filter.All, Order.NEW_TO_OLD).iterator(); - return segments.hasNext() ? segments.next().getTimeSliceID() : SegmentInfo.INVALID_ID; - } - - /** - * Returns the timeslice ID of the oldest enabled segment. - */ - public long getOldestEnabledTimeSliceID() { - if (segmentWriters.size() == 0) { - return SegmentInfo.INVALID_ID; - } - ISegmentWriter segmentWriter = segmentWriters.firstEntry().getValue(); - return segmentWriter.getSegmentInfo().getTimeSliceID(); - } - - /** - * Returns the SegmentInfo for the given timeSliceID. - */ - public final SegmentInfo getSegmentInfo(long timeSliceID) { - ISegmentWriter segmentWriter = segmentWriters.get(timeSliceID); - return segmentWriter == null ? null : segmentWriter.getSegmentInfo(); - } - - /** - * Returns the segment info for the segment that should contain the given tweet ID. - */ - public final SegmentInfo getSegmentInfoFromStatusID(long tweetID) { - for (SegmentInfo segmentInfo : getSegmentInfos(Filter.All, Order.NEW_TO_OLD)) { - if (tweetID >= segmentInfo.getTimeSliceID()) { - return segmentInfo; - } - } - - return null; - } - - /** - * Removes the segment associated with the given timeslice ID from the segment manager. This will - * also take care of all required clean up related to the segment being removed, such as closing - * its writer. - */ - public boolean removeSegmentInfo(long timeSliceID) { - if (timeSliceID == getNewestTimeSliceID()) { - throw new RuntimeException("Cannot drop segment of current time-slice " + timeSliceID); - } - - ISegmentWriter removed = segmentWriters.get(timeSliceID); - if (removed == null) { - return false; - } - - LOG.info("Removing segment {}", removed.getSegmentInfo()); - Preconditions.checkState(!removed.getSegmentInfo().isEnabled()); - removed.getSegmentInfo().getIndexSegment().close(); - segmentWriters.remove(timeSliceID); - - String segmentName = removed.getSegmentInfo().getSegmentName(); - updateAllListeners("Removed segment " + segmentName); - LOG.info("Removed segment " + segmentName); - updateExportedSegmentStats(); - updateStats(); - return true; - } - - /** - * Add the given SegmentWriter into the segmentWriters map. - * If a segment with the same timesliceID already exists in the map, the old one is replaced - * with the new one; this should only happen in the archive. - * - * The replaced segment is destroyed after a delay to allow in-flight requests to finish. - */ - public ISegmentWriter putSegmentInfo(SegmentInfo info) { - ISegmentWriter usedSegmentWriter; - - SegmentWriter segmentWriter - = new SegmentWriter(info, searchIndexingMetricSet.updateFreshness); - - if (!info.isOptimized()) { - LOG.info("Inserting an optimizing segment writer for segment: {}", - info.getSegmentName()); - - usedSegmentWriter = new OptimizingSegmentWriter( - segmentWriter, - criticalExceptionHandler, - searchIndexingMetricSet, - indexCaughtUpMonitor); - } else { - usedSegmentWriter = segmentWriter; - } - - putSegmentWriter(usedSegmentWriter); - return usedSegmentWriter; - } - - private void putSegmentWriter(ISegmentWriter segmentWriter) { - SegmentInfo newSegmentInfo = segmentWriter.getSegmentInfo(); - SegmentInfo oldSegmentInfo = getSegmentInfo(newSegmentInfo.getTimeSliceID()); - - // Some sanity checks. - if (oldSegmentInfo != null) { - // This map is thread safe, so this put can be considered atomic. - segmentWriters.put(newSegmentInfo.getTimeSliceID(), segmentWriter); - LOG.info("Replaced SegmentInfo with a new one in segmentWriters map. " - + "Old SegmentInfo: {} New SegmentInfo: {}", oldSegmentInfo, newSegmentInfo); - - if (!oldSegmentInfo.isClosed()) { - oldSegmentInfo.deleteIndexSegmentDirectoryAfterDelay(); - } - } else { - long newestTimeSliceID = getNewestTimeSliceID(); - if (newestTimeSliceID != SegmentInfo.INVALID_ID - && newestTimeSliceID > newSegmentInfo.getTimeSliceID()) { - LOG.error("Not adding out-of-order segment " + newSegmentInfo); - return; - } - - segmentWriters.put(newSegmentInfo.getTimeSliceID(), segmentWriter); - LOG.info("Added segment " + newSegmentInfo); - } - - updateAllListeners("Added segment " + newSegmentInfo.getTimeSliceID()); - updateExportedSegmentStats(); - updateStats(); - } - - private SegmentInfo createSegmentInfo(long timesliceID) throws IOException { - PartitionConfig partitionConfig = dynamicPartitionConfig.getCurrentPartitionConfig(); - - TimeSlice timeSlice = new TimeSlice( - timesliceID, - maxSegmentSize, - partitionConfig.getIndexingHashPartitionID(), - partitionConfig.getNumPartitions()); - - SegmentInfo segmentInfo = - new SegmentInfo(timeSlice.getSegment(), earlybirdSegmentFactory, segmentSyncConfig); - - return segmentInfo; - } - - /** - * Create a new optimizing segment writer and add it to the map. - */ - public OptimizingSegmentWriter createAndPutOptimizingSegmentWriter( - long timesliceID) throws IOException { - SegmentInfo segmentInfo = createSegmentInfo(timesliceID); - - OptimizingSegmentWriter writer = new OptimizingSegmentWriter( - new SegmentWriter(segmentInfo, searchIndexingMetricSet.updateFreshness), - criticalExceptionHandler, - searchIndexingMetricSet, - indexCaughtUpMonitor); - - putSegmentWriter(writer); - return writer; - } - - /** - * Create a new segment writer. - */ - public SegmentWriter createSegmentWriter(long timesliceID) throws IOException { - SegmentInfo segmentInfo = createSegmentInfo(timesliceID); - - SegmentWriter writer = new SegmentWriter( - segmentInfo, searchIndexingMetricSet.updateFreshness); - - return writer; - } - - private void updateAllListeners(String message) { - List segmentInfos = segmentWriters.values().stream() - .map(ISegmentWriter::getSegmentInfo) - .collect(Collectors.toList()); - for (SegmentUpdateListener listener : updateListeners) { - try { - listener.update(segmentInfos, message); - } catch (Exception e) { - LOG.warn("SegmentManager: Unable to call update() on listener.", e); - } - } - } - - // Returns true if the map contains a SegmentInfo matching the given time slice. - public final boolean hasSegmentInfo(long timeSliceID) { - return segmentWriters.containsKey(timeSliceID); - } - - public void addUpdateListener(SegmentUpdateListener listener) { - updateListeners.add(listener); - } - - /** - * Look up the segment containing the given status id. - * If found, its timeslice id is returned. - * If none found, -1 is returned. - */ - public long lookupTimeSliceID(long statusID) throws IOException { - SegmentInfo segmentInfo = getSegmentInfoForID(statusID); - if (segmentInfo == null) { - return -1; - } - if (!segmentInfo.getIndexSegment().hasDocument(statusID)) { - return -1; - } - - return segmentInfo.getTimeSliceID(); - } - - /** - * Truncates the given segment list to the specified number of segments, by keeping the newest - * segments. - */ - @VisibleForTesting - public static List truncateSegmentList(List segmentList, int maxNumSegments) { - // Maybe cut-off the beginning of the sorted list of IDs. - if (maxNumSegments > 0 && maxNumSegments < segmentList.size()) { - return segmentList.subList(segmentList.size() - maxNumSegments, segmentList.size()); - } else { - return segmentList; - } - } - - @VisibleForTesting - public void setOffensive(long userID, boolean offensive) { - userTable.setOffensive(userID, offensive); - } - - @VisibleForTesting - public void setAntisocial(long userID, boolean antisocial) { - userTable.setAntisocial(userID, antisocial); - } - - /** - * Returns a searcher for all segments. - */ - public EarlybirdMultiSegmentSearcher getMultiSearcher(ImmutableSchemaInterface schemaSnapshot) - throws IOException { - return new EarlybirdMultiSegmentSearcher( - schemaSnapshot, - getSearchers(schemaSnapshot, Filter.All, Order.NEW_TO_OLD), - searcherStats, - clock); - } - - /** - * Returns a new searcher for the given segment. - */ - @Nullable - public EarlybirdLuceneSearcher getSearcher( - Segment segment, - ImmutableSchemaInterface schemaSnapshot) throws IOException { - return getSearcher(segment.getTimeSliceID(), schemaSnapshot); - } - - /** - * Get max tweet id across all enabled segments. - * @return max tweet id or -1 if none found - */ - public long getMaxTweetIdFromEnabledSegments() { - for (SegmentInfo segmentInfo : getSegmentInfos(Filter.Enabled, Order.NEW_TO_OLD)) { - long maxTweetId = segmentInfo.getIndexSegment().getMaxTweetId(); - if (maxTweetId != -1) { - return maxTweetId; - } - } - - return -1; - } - - /** - * Create a tweet index searcher on the segment represented by the timeslice id. For production - * search session, the schema snapshot should be always passed in to make sure that the schema - * usage inside scoring is consistent. - * - * For non-production usage, like one-off debugging search, you can use the function call without - * the schema snapshot. - * - * @param timeSliceID the timeslice id, which represents the index segment - * @param schemaSnapshot the schema snapshot - * @return the tweet index searcher - */ - @Nullable - public EarlybirdSingleSegmentSearcher getSearcher( - long timeSliceID, - ImmutableSchemaInterface schemaSnapshot) throws IOException { - SegmentInfo segmentInfo = getSegmentInfo(timeSliceID); - if (segmentInfo == null) { - return null; - } - return segmentInfo.getIndexSegment().getSearcher(userTable, schemaSnapshot); - } - - /** - * Returns a new searcher for the segment with the given timeslice ID. If the given timeslice ID - * does not correspond to any active segment, {@code null} is returned. - * - * @param timeSliceID The segment's timeslice ID. - * @return A new searcher for the segment with the given timeslice ID. - */ - @Nullable - public EarlybirdSingleSegmentSearcher getSearcher(long timeSliceID) throws IOException { - SegmentInfo segmentInfo = getSegmentInfo(timeSliceID); - if (segmentInfo == null) { - return null; - } - return segmentInfo.getIndexSegment().getSearcher(userTable); - } - - @Nullable - public EarlybirdResponseCode checkSegment(Segment segment) { - return checkSegmentInternal(getSegmentInfo(segment.getTimeSliceID())); - } - - private static EarlybirdResponseCode checkSegmentInternal(SegmentInfo info) { - if (info == null) { - return EarlybirdResponseCode.PARTITION_NOT_FOUND; - } else if (info.isEnabled()) { - return EarlybirdResponseCode.SUCCESS; - } else { - return EarlybirdResponseCode.PARTITION_DISABLED; - } - } - - private List getSearchers( - ImmutableSchemaInterface schemaSnapshot, - Filter filter, - Order order) throws IOException { - List searchers = Lists.newArrayList(); - for (SegmentInfo segmentInfo : getSegmentInfos(filter, order)) { - EarlybirdSingleSegmentSearcher searcher = - segmentInfo.getIndexSegment().getSearcher(userTable, schemaSnapshot); - if (searcher != null) { - searchers.add(searcher); - } - } - return searchers; - } - - /** - * Gets metadata for segments for debugging purposes. - */ - public List getSegmentMetadata() { - List segmentMetadata = new ArrayList<>(); - for (SegmentInfo segment : getSegmentInfos(Filter.All, Order.OLD_TO_NEW)) { - segmentMetadata.add(segment.getSegmentMetadata()); - } - return segmentMetadata; - } - - /** - * Gets info for query caches to be displayed in an admin page. - */ - public String getQueryCachesData() { - StringBuilder output = new StringBuilder(); - for (SegmentInfo segment : getSegmentInfos(Filter.All, Order.OLD_TO_NEW)) { - output.append(segment.getQueryCachesData() + "\n"); - } - return output.toString(); - } - - /** - * Index the given user update. Returns false if the given update is skipped. - */ - public boolean indexUserUpdate(UserUpdate userUpdate) { - return userTable.indexUserUpdate(userUpdatesChecker, userUpdate); - } - - /** - * Index the given UserScrubGeoEvent. - * @param userScrubGeoEvent - */ - public void indexUserScrubGeoEvent(UserScrubGeoEvent userScrubGeoEvent) { - userScrubGeoMap.indexUserScrubGeoEvent(userScrubGeoEvent); - } - - /** - * Return how many documents this segment manager has indexed in all of its enabled segments. - */ - public long getNumIndexedDocuments() { - // Order here doesn't matter, we just want all enabled segments, and allocate - // as little as needed. - long indexedDocs = 0; - for (SegmentInfo segmentInfo : getSegmentInfos(Filter.Enabled, Order.OLD_TO_NEW)) { - indexedDocs += segmentInfo.getIndexSegment().getIndexStats().getStatusCount(); - } - return indexedDocs; - } - - /** - * Return how many partial updates this segment manager has applied - * in all of its enabled segments. - */ - public long getNumPartialUpdates() { - long partialUpdates = 0; - for (SegmentInfo segmentInfo : getSegmentInfos(Filter.Enabled, Order.OLD_TO_NEW)) { - partialUpdates += segmentInfo.getIndexSegment().getIndexStats().getPartialUpdateCount(); - } - return partialUpdates; - } - - /** - * Returns the segment info for the segment containing the given tweet ID. - */ - public SegmentInfo getSegmentInfoForID(long tweetID) { - ISegmentWriter segmentWriter = getSegmentWriterForID(tweetID); - return segmentWriter == null ? null : segmentWriter.getSegmentInfo(); - } - - /** - * Returns the segment writer for the segment containing the given tweet ID. - */ - @Nullable - public ISegmentWriter getSegmentWriterForID(long tweetID) { - Map.Entry entry = segmentWriters.floorEntry(tweetID); - return entry == null ? null : entry.getValue(); - } - - /** - * Remove old segments until we have less than or equal to the number of max enabled segments. - */ - public void removeExcessSegments() { - int removedSegmentCount = 0; - while (segmentWriters.size() > getMaxEnabledSegments()) { - long timesliceID = getOldestEnabledTimeSliceID(); - disableSegment(timesliceID); - removeSegmentInfo(timesliceID); - removedSegmentCount += 1; - } - LOG.info("Segment manager removed {} excess segments", removedSegmentCount); - } - - /** - * Returns total index size on disk across all enabled segments in this segment manager. - */ - private long getTotalSegmentSizeOnDisk() { - long totalIndexSize = 0; - for (SegmentInfo segmentInfo : getSegmentInfos(Filter.Enabled, Order.OLD_TO_NEW)) { - totalIndexSize += segmentInfo.getIndexSegment().getIndexStats().getIndexSizeOnDiskInBytes(); - } - return totalIndexSize; - } - - @VisibleForTesting - ISegmentWriter getSegmentWriterWithoutCreationForTests(long timesliceID) { - return segmentWriters.get(timesliceID); - } - - @VisibleForTesting - ArrayList getTimeSliceIdsForTests() { - return new ArrayList(segmentWriters.keySet()); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/SegmentOptimizer.java b/src/java/com/twitter/search/earlybird/partition/SegmentOptimizer.java deleted file mode 100644 index d06d0fffc..000000000 --- a/src/java/com/twitter/search/earlybird/partition/SegmentOptimizer.java +++ /dev/null @@ -1,60 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.IOException; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.earlybird.EarlybirdStatus; - -public final class SegmentOptimizer { - private static final Logger LOG = LoggerFactory.getLogger(SegmentOptimizer.class); - - private static final String OPTIMIZING_SEGMENT_EVENT_PATTERN = "optimizing segment %s"; - private static final String OPTIMIZING_SEGMENT_GAUGE_PATTERN = "optimizing_segment_%s"; - - private SegmentOptimizer() { - } - - /** - * Optimize a segment. Returns whether optimization was successful. - */ - public static boolean optimize(SegmentInfo segmentInfo) { - try { - return optimizeThrowing(segmentInfo); - } catch (Exception e) { - // This is a bad situation, as earlybird can't run with too many un-optimized - // segments in memory. - LOG.error("Exception while optimizing segment " + segmentInfo.getSegmentName() + ": ", e); - segmentInfo.setFailedOptimize(); - return false; - } - } - - public static boolean needsOptimization(SegmentInfo segmentInfo) { - return segmentInfo.isComplete() && !segmentInfo.isOptimized() - && !segmentInfo.isFailedOptimize() && !segmentInfo.isIndexing(); - } - - private static boolean optimizeThrowing(SegmentInfo segmentInfo) throws IOException { - if (!needsOptimization(segmentInfo)) { - return false; - } - - String gaugeName = - String.format(OPTIMIZING_SEGMENT_GAUGE_PATTERN, segmentInfo.getSegmentName()); - SearchIndexingMetricSet.StartupMetric metric = - new SearchIndexingMetricSet.StartupMetric(gaugeName); - - String eventName = - String.format(OPTIMIZING_SEGMENT_EVENT_PATTERN, segmentInfo.getSegmentName()); - EarlybirdStatus.beginEvent(eventName, metric); - try { - segmentInfo.getIndexSegment().optimizeIndexes(); - } finally { - EarlybirdStatus.endEvent(eventName, metric); - } - - return true; - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/SegmentSyncConfig.java b/src/java/com/twitter/search/earlybird/partition/SegmentSyncConfig.java deleted file mode 100644 index d7e9bd82e..000000000 --- a/src/java/com/twitter/search/earlybird/partition/SegmentSyncConfig.java +++ /dev/null @@ -1,218 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.util.ArrayList; -import java.util.Collection; -import java.util.Collections; -import java.util.Date; -import java.util.Optional; -import java.util.concurrent.TimeUnit; - -import com.twitter.search.common.database.DatabaseConfig; -import com.twitter.search.common.metrics.SearchCustomGauge; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.partitioning.base.Segment; -import com.twitter.search.common.schema.earlybird.FlushVersion; -import com.twitter.search.common.util.io.flushable.PersistentFile; -import com.twitter.search.earlybird.archive.ArchiveSegment; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.common.config.EarlybirdProperty; -import com.twitter.search.earlybird.util.ScrubGenUtil; -import com.twitter.util.TwitterDateFormat; - -/** - * Encapsulates config information related to reading and writing segments to local filesystem or - * HDFS. - */ -public class SegmentSyncConfig { - public static final String LUCENE_DIR_PREFIX = "lucene_"; - - private final Optional scrubGen; - - public SegmentSyncConfig(Optional scrubGen) { - this.scrubGen = scrubGen; - String scrubGenStat = scrubGen.orElse("unset"); - SearchLongGauge.export("scrub_gen_" + scrubGenStat).set(1); - if (scrubGen.isPresent()) { - // Export a stat for the number of days between the scrub gen date and now - SearchCustomGauge.export("scrub_gen_age_in_days", () -> { - long scrubGenMillis = ScrubGenUtil.parseScrubGenToDate(scrubGen.get()).getTime(); - return TimeUnit.MILLISECONDS.toDays(System.currentTimeMillis() - scrubGenMillis); - }); - } - } - - /** - * Returns the file extension to be used for the current flush version. - */ - public String getVersionFileExtension() { - return FlushVersion.CURRENT_FLUSH_VERSION.getVersionFileExtension(); - } - - /** - * Returns the threshold for how large a segment's status count must be at load time to be - * considered valid. - */ - public int getMinSegmentStatusCountThreshold() { - double minSegmentTweetCountProportionThreshold = - EarlybirdConfig.getDouble("min_segment_tweet_count_percentage_threshold", 0) / 100; - return (int) (EarlybirdConfig.getMaxSegmentSize() * minSegmentTweetCountProportionThreshold); - } - - /** - * Determines if this earlybird is allowed to flush segments to HDFS. - */ - public boolean isFlushToHdfsEnabled() { - return EarlybirdProperty.SEGMENT_FLUSH_TO_HDFS_ENABLED.get(false) - // Flush to HDFS is always disabled if FlushVersion is not official. - && FlushVersion.CURRENT_FLUSH_VERSION.isOfficial(); - } - - /** - * Determines if this earlybird is allowed to load segments from HDFS. - */ - public boolean isSegmentLoadFromHdfsEnabled() { - return EarlybirdProperty.SEGMENT_LOAD_FROM_HDFS_ENABLED.get(false); - } - - /** - * Determines if this earlybird is allowed to delete flushed segments. - */ - public boolean isDeleteFlushedSegmentsEnabled() { - return EarlybirdConfig.getBool("segment_dropper_delete_flushed", true); - } - - /** - * Returns the root of the segment directory on the local disk. - */ - public String getLocalSegmentSyncRootDir() { - return EarlybirdConfig.getString("segment_sync_dir", "partitions") - + getScrubGenFlushDirSuffix(); - } - - /** - * Returns the root of the segment directory on HDFS. - */ - public String getHdfsSegmentSyncRootDir() { - return EarlybirdProperty.HDFS_SEGMENT_SYNC_DIR.get("partitions") - + getScrubGenFlushDirSuffix(); - } - - /** - * Returns the HDFS root directory where all segments should be uploaded. - */ - public String getHdfsSegmentUploadRootDir() { - String hdfsSegmentUploadDir = EarlybirdProperty.HDFS_SEGMENT_UPLOAD_DIR.get(null); - return hdfsSegmentUploadDir != null - ? hdfsSegmentUploadDir + getScrubGenFlushDirSuffix() - : getHdfsSegmentSyncRootDir(); - } - - /** - * Returns the ZooKeeper path used for segment sync'ing. - */ - public String getZooKeeperSyncFullPath() { - return EarlybirdProperty.ZK_APP_ROOT.get() + "/" - + EarlybirdConfig.getString("segment_flush_sync_relative_path", "segment_flush_sync"); - } - - /** - * Returns the list of directories that should be persisted for this segment. - */ - public Collection getPersistentFileNames(SegmentInfo segment) { - return Collections.singleton(segment.getSegmentName()); - } - - /** - * Returns the list of all files that should be sync'ed for this segment. - */ - public Collection getAllSyncFileNames(SegmentInfo segment) { - Collection allFileNames = PersistentFile.getAllFileNames(segment.getSegmentName()); - if (segment.getEarlybirdIndexConfig().isIndexStoredOnDisk()) { - allFileNames = new ArrayList<>(allFileNames); - // Just the file name, not the full path - allFileNames.add(getLocalLuceneSyncDirFileName(segment.getSegment())); - } - return allFileNames; - } - - /** - * Returns the local sync directory for the given segment. - */ - public String getLocalSyncDirName(Segment segment) { - return getLocalSegmentSyncRootDir() + "/" + segment.getSegmentName() - + getVersionFileExtension(); - } - - /** - * Returns the local Lucene directory for the given segment. - */ - public String getLocalLuceneSyncDirName(Segment segment) { - return getLocalSyncDirName(segment) + "/" + getLocalLuceneSyncDirFileName(segment); - } - - /** - * Returns the name (not the path) of the Lucene directory for the given segment. - */ - private String getLocalLuceneSyncDirFileName(Segment segment) { - if (segment instanceof ArchiveSegment) { - Date endDate = ((ArchiveSegment) segment).getDataEndDate(); - String endDateString = TwitterDateFormat.apply("yyyyMMdd").format(endDate); - return LUCENE_DIR_PREFIX + endDateString; - } else { - return LUCENE_DIR_PREFIX + "realtime"; - } - } - - /** - * Returns the HDFS sync directory for the given segment. - */ - public String getHdfsSyncDirNamePrefix(Segment segment) { - return getHdfsSegmentSyncRootDir() + "/" + segment.getSegmentName() - + getVersionFileExtension() + "*"; - } - - /** - * Returns the prefix of the HDFS directory where the files for this segment should be uploaded. - */ - public String getHdfsUploadDirNamePrefix(Segment segment) { - return getHdfsSegmentUploadRootDir() + "/" + segment.getSegmentName() - + getVersionFileExtension() + "*"; - } - - /** - * Returns the HDFS directory where the files for this segment should be uploaded. - */ - public String getHdfsFlushDirName(Segment segment) { - return getHdfsSegmentUploadRootDir() + "/" + segment.getSegmentName() - + getVersionFileExtension() + "_" + DatabaseConfig.getLocalHostname(); - } - - /** - * Returns a temp HDFS directory to be used for this segment. - */ - public String getHdfsTempFlushDirName(Segment segment) { - return getHdfsSegmentUploadRootDir() + "/temp_" - + DatabaseConfig.getLocalHostname() + "_" + segment.getSegmentName() - + getVersionFileExtension(); - } - - /** - * Concatenates the name of this segment with the flush version extension. - */ - public String getVersionedName(Segment segment) { - return segment.getSegmentName() + getVersionFileExtension(); - } - - private String getScrubGenFlushDirSuffix() { - return scrubGen - .map(s -> "/scrubbed/" + s) - .orElse(""); - } - - /** - * Returns the scrub gen set for this earlybird. - */ - public Optional getScrubGen() { - return scrubGen; - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/SegmentSyncInfo.java b/src/java/com/twitter/search/earlybird/partition/SegmentSyncInfo.java deleted file mode 100644 index f204882ca..000000000 --- a/src/java/com/twitter/search/earlybird/partition/SegmentSyncInfo.java +++ /dev/null @@ -1,113 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import com.google.common.annotations.VisibleForTesting; - -import com.twitter.search.common.partitioning.base.Segment; - -/** - * Representation for segment sync state, the local and hdfs file locations, as well as the - * current in-memory sync states maintained by earlybirds. - */ -public class SegmentSyncInfo { - // Is this segment loaded from disk? - private volatile boolean loaded = false; - // Has this segment been flushed to disk, and uploaded to HDFS if uploading is enabled? - private volatile boolean flushed = false; - // Time when the segment was flushed to local disk - private volatile long flushTimeMillis = 0; - - private final Segment segment; - private final SegmentSyncConfig syncConfig; - private final String localSyncDir; - private final String hdfsFlushDir; - private final String hdfsSyncDirPrefix; - private final String hdfsUploadDirPrefix; - private final String hdfsTempFlushDir; - - @VisibleForTesting - public SegmentSyncInfo(SegmentSyncConfig syncConfig, Segment segment) { - this.segment = segment; - this.syncConfig = syncConfig; - this.localSyncDir = syncConfig.getLocalSyncDirName(segment); - this.hdfsSyncDirPrefix = syncConfig.getHdfsSyncDirNamePrefix(segment); - this.hdfsUploadDirPrefix = syncConfig.getHdfsUploadDirNamePrefix(segment); - this.hdfsFlushDir = syncConfig.getHdfsFlushDirName(segment); - this.hdfsTempFlushDir = syncConfig.getHdfsTempFlushDirName(segment); - } - - public boolean isLoaded() { - return loaded; - } - - public boolean isFlushed() { - return flushed; - } - - public long getFlushTimeMillis() { - return flushTimeMillis; - } - - public String getLocalSyncDir() { - return localSyncDir; - } - - public SegmentSyncConfig getSegmentSyncConfig() { - return syncConfig; - } - - public String getLocalLuceneSyncDir() { - // For archive search this name depends on the end date of the segment, which can change, - // so we cannot pre-compute this in the constructor. - // This should only be used in the on-disk archive. - return syncConfig.getLocalLuceneSyncDirName(segment); - } - - public String getHdfsFlushDir() { - return hdfsFlushDir; - } - - public String getHdfsSyncDirPrefix() { - return hdfsSyncDirPrefix; - } - - public String getHdfsUploadDirPrefix() { - return hdfsUploadDirPrefix; - } - - public String getHdfsTempFlushDir() { - return hdfsTempFlushDir; - } - - public void setLoaded(boolean isLoaded) { - this.loaded = isLoaded; - } - - /** - * Stores the flushing state for this segment. - */ - public void setFlushed(boolean isFlushed) { - if (isFlushed) { - this.flushTimeMillis = System.currentTimeMillis(); - } - this.flushed = isFlushed; - } - - /** - * Adds debug information about the loaded and flushed status of this segment to the given - * StringBuilder. - */ - public void addDebugInfo(StringBuilder builder) { - builder.append("["); - int startLength = builder.length(); - if (loaded) { - builder.append("loaded, "); - } - if (flushed) { - builder.append("flushed, "); - } - if (startLength < builder.length()) { - builder.setLength(builder.length() - 2); - } - builder.append("]"); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/SegmentVulture.java b/src/java/com/twitter/search/earlybird/partition/SegmentVulture.java deleted file mode 100644 index 8a07b7f80..000000000 --- a/src/java/com/twitter/search/earlybird/partition/SegmentVulture.java +++ /dev/null @@ -1,380 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.File; -import java.io.IOException; -import java.util.List; -import java.util.Set; -import java.util.SortedSet; -import java.util.TreeSet; - -import javax.annotation.Nonnull; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.collect.Sets; - -import org.apache.commons.io.FileUtils; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.partitioning.base.Segment; -import com.twitter.search.common.schema.earlybird.FlushVersion; -import com.twitter.search.earlybird.archive.ArchiveSearchPartitionManager; -import com.twitter.search.earlybird.archive.ArchiveTimeSlicer; -import com.twitter.search.earlybird.archive.ArchiveTimeSlicer.ArchiveTimeSlice; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.factory.EarlybirdIndexConfigUtil; - -/** - * This class removes older flush version segments. - * Considering that we almost never increase status flush versions, old statuses are not cleaned up - * automatically. - */ -public final class SegmentVulture { - private static final Logger LOG = LoggerFactory.getLogger(SegmentVulture.class); - @VisibleForTesting // Not final for testing. - protected static int numIndexFlushVersionsToKeep = - EarlybirdConfig.getInt("number_of_flush_versions_to_keep", 2); - - private SegmentVulture() { - // this never gets called - } - - /** - * Delete old build generations, keep currentGeneration. - */ - @VisibleForTesting - static void removeOldBuildGenerations(String rootDirPath, String currentGeneration) { - File rootDir = new File(rootDirPath); - - if (!rootDir.exists() || !rootDir.isDirectory()) { - LOG.error("Root directory is invalid: " + rootDirPath); - return; - } - - File[] buildGenerations = rootDir.listFiles(); - - for (File generation : buildGenerations) { - if (generation.getName().equals(currentGeneration)) { - LOG.info("Skipping current generation: " + generation.getAbsoluteFile()); - continue; - } - - try { - FileUtils.deleteDirectory(generation); - LOG.info("Deleted old build generation: " + generation.getAbsolutePath()); - } catch (IOException e) { - LOG.error("Failed to delete old build generation at: " + generation.getAbsolutePath(), e); - } - } - LOG.info("Successfully deleted all old generations"); - } - - /** - * Delete all the timeslice data outside the serving range. - */ - @VisibleForTesting - static void removeArchiveTimesliceOutsideServingRange(PartitionConfig partitionConfig, - ArchiveTimeSlicer timeSlicer, SegmentSyncConfig segmentSyncConfig) { - try { - long servingStartTimesliceId = Long.MAX_VALUE; - long servingEndTimesliceId = 0; - int partitionID = partitionConfig.getIndexingHashPartitionID(); - List timeSliceList = timeSlicer.getTimeSlicesInTierRange(); - for (ArchiveTimeSlice timeSlice : timeSliceList) { - if (timeSlice.getMinStatusID(partitionID) < servingStartTimesliceId) { - servingStartTimesliceId = timeSlice.getMinStatusID(partitionID); - } - if (timeSlice.getMaxStatusID(partitionID) > servingEndTimesliceId) { - servingEndTimesliceId = timeSlice.getMaxStatusID(partitionID); - } - } - LOG.info("Got the serving range: [" + servingStartTimesliceId + ", " - + servingEndTimesliceId + "], " + "[" + partitionConfig.getTierStartDate() + ", " - + partitionConfig.getTierEndDate() + ") for tier: " + partitionConfig.getTierName()); - - // The tier configuration does not have valid serving range: do not do anything. - if (servingEndTimesliceId <= servingStartTimesliceId) { - LOG.error("Invalid serving range [" + partitionConfig.getTierStartDate() + ", " - + partitionConfig.getTierEndDate() + "] for tier: " + partitionConfig.getTierName()); - return; - } - - int numDeleted = 0; - File[] segments = getSegmentsOnRootDir(segmentSyncConfig); - for (File segment : segments) { - String segmentName = SegmentInfo.getSegmentNameFromFlushedDir(segment.getName()); - if (segmentName == null) { - LOG.error("Invalid directory for segments: " + segment.getAbsolutePath()); - continue; - } - long timesliceId = Segment.getTimeSliceIdFromName(segmentName); - if (timesliceId < 0) { - LOG.error("Unknown dir/file found: " + segment.getAbsolutePath()); - continue; - } - - if (timesliceId < servingStartTimesliceId || timesliceId > servingEndTimesliceId) { - LOG.info(segment.getAbsolutePath() + " will be deleted for outside serving Range[" - + partitionConfig.getTierStartDate() + ", " + partitionConfig.getTierEndDate() + ")"); - if (deleteSegment(segment)) { - numDeleted++; - } - } - } - LOG.info("Deleted " + numDeleted + " segments out of " + segments.length + " segments"); - } catch (IOException e) { - LOG.error("Can not timeslice based on the document data: ", e); - throw new RuntimeException(e); - } - } - - /** - * Deleted segments from other partitions. When boxes are moved between - * partitions, segments from other partitions may stay, we will have to - * delete them. - */ - @VisibleForTesting - static void removeIndexesFromOtherPartitions(int myPartition, int numPartitions, - SegmentSyncConfig segmentSyncConfig) { - File[] segments = getSegmentsOnRootDir(segmentSyncConfig); - int numDeleted = 0; - for (File segment : segments) { - int segmentNumPartitions = Segment.numPartitionsFromName(segment.getName()); - int segmentPartition = Segment.getPartitionFromName(segment.getName()); - - if (segmentNumPartitions < 0 || segmentPartition < 0) { // Not a segment file, ignoring - LOG.info("Unknown dir/file found: " + segment.getAbsolutePath()); - continue; - } - - if (segmentNumPartitions != numPartitions || segmentPartition != myPartition) { - if (deleteSegment(segment)) { - numDeleted++; - } - } - } - LOG.info("Deleted " + numDeleted + " segments out of " + segments.length + " segments"); - } - - /** - * Delete flushed segments of older flush versions. - */ - @VisibleForTesting - static void removeOldFlushVersionIndexes(int currentFlushVersion, - SegmentSyncConfig segmentSyncConfig) { - SortedSet indexFlushVersions = - listFlushVersions(segmentSyncConfig, currentFlushVersion); - - if (indexFlushVersions == null - || indexFlushVersions.size() <= numIndexFlushVersionsToKeep) { - return; - } - - Set suffixesToKeep = Sets.newHashSetWithExpectedSize(numIndexFlushVersionsToKeep); - int flushVersionsToKeep = numIndexFlushVersionsToKeep; - while (flushVersionsToKeep > 0 && !indexFlushVersions.isEmpty()) { - Integer oldestFlushVersion = indexFlushVersions.last(); - String flushFileExtension = FlushVersion.getVersionFileExtension(oldestFlushVersion); - if (flushFileExtension != null) { - suffixesToKeep.add(flushFileExtension); - flushVersionsToKeep--; - } else { - LOG.warn("Found unknown flush versions: " + oldestFlushVersion - + " Segments with this flush version will be deleted to recover disk space."); - } - indexFlushVersions.remove(oldestFlushVersion); - } - - String segmentSyncRootDir = segmentSyncConfig.getLocalSegmentSyncRootDir(); - File dir = new File(segmentSyncRootDir); - File[] segments = dir.listFiles(); - - for (File segment : segments) { - boolean keepSegment = false; - for (String suffix : suffixesToKeep) { - if (segment.getName().endsWith(suffix)) { - keepSegment = true; - break; - } - } - if (!keepSegment) { - try { - FileUtils.deleteDirectory(segment); - LOG.info("Deleted old flushed segment: " + segment.getAbsolutePath()); - } catch (IOException e) { - LOG.error("Failed to delete old flushed segment.", e); - } - } - } - } - - private static File[] getSegmentsOnRootDir(SegmentSyncConfig segmentSyncConfig) { - String segmentSyncRootDir = segmentSyncConfig.getLocalSegmentSyncRootDir(); - File dir = new File(segmentSyncRootDir); - File[] segments = dir.listFiles(); - if (segments == null) { - return new File[0]; - } else { - return segments; - } - } - - private static boolean deleteSegment(File segment) { - try { - FileUtils.deleteDirectory(segment); - LOG.info("Deleted segment from other partition: " + segment.getAbsolutePath()); - return true; - } catch (IOException e) { - LOG.error("Failed to delete segment from other partition.", e); - return false; - } - } - - // Returns FlushVersions found on disk. - // Current FlushVersion is always added into the list, even if segments are not found on disk, - // because they may not have appeared yet. - @Nonnull - @VisibleForTesting - static SortedSet listFlushVersions(SegmentSyncConfig sync, int currentFlushVersion) { - TreeSet flushVersions = Sets.newTreeSet(); - - // Always add current flush version. - // It is possible that on startup when this is run, the current flush version - // segments have not appeared yet. - flushVersions.add(currentFlushVersion); - - String segmentSyncRootDir = sync.getLocalSegmentSyncRootDir(); - File dir = new File(segmentSyncRootDir); - if (!dir.exists()) { - LOG.info("segmentSyncRootDir [" + segmentSyncRootDir - + "] does not exist"); - return flushVersions; - } - if (!dir.isDirectory()) { - LOG.error("segmentSyncRootDir [" + segmentSyncRootDir - + "] does not point to a directory"); - return flushVersions; - } - if (!dir.canRead()) { - LOG.error("No permission to read from segmentSyncRootDir [" - + segmentSyncRootDir + "]"); - return flushVersions; - } - if (!dir.canWrite()) { - LOG.error("No permission to write to segmentSyncRootDir [" - + segmentSyncRootDir + "]"); - return flushVersions; - } - - File[] segments = dir.listFiles(); - for (File segment : segments) { - String name = segment.getName(); - if (!name.contains(FlushVersion.DELIMITER)) { - // This is a not a segment with a FlushVersion, skip. - LOG.info("Found segment directory without a flush version: " + name); - continue; - } - String[] nameSplits = name.split(FlushVersion.DELIMITER); - if (nameSplits.length != 2) { - LOG.warn("Found segment with bad name: " + segment.getAbsolutePath()); - continue; - } - - // Second half contains flush version - try { - int flushVersion = Integer.parseInt(nameSplits[1]); - flushVersions.add(flushVersion); - } catch (NumberFormatException e) { - LOG.warn("Bad flush version number in segment name: " + segment.getAbsolutePath()); - } - } - return flushVersions; - } - - /** - * Removes old segments in the current build gen. - */ - @VisibleForTesting - static void removeOldSegments(SegmentSyncConfig sync) { - if (!sync.getScrubGen().isPresent()) { - return; - } - - File currentScrubGenSegmentDir = new File(sync.getLocalSegmentSyncRootDir()); - - // The unscrubbed segment root directory, used for rebuilds and for segments created before - // we introduced scrub gens. The getLocalSegmentSyncRootDir should be something like: - // $unscrubbedSegmentDir/scrubbed/$scrub_gen/, - // get unscrubbedSegmentDir from string name here in case scrubbed dir does not exist yet - File unscrubbedSegmentDir = new File(sync.getLocalSegmentSyncRootDir().split("scrubbed")[0]); - if (!unscrubbedSegmentDir.exists()) { - // For a new host that swapped in, it might not have flushed_segment dir yet. - // return directly in that case. - LOG.info(unscrubbedSegmentDir.getAbsoluteFile() + "does not exist, nothing to remove."); - return; - } - Preconditions.checkArgument(unscrubbedSegmentDir.exists()); - for (File file : unscrubbedSegmentDir.listFiles()) { - if (file.getName().matches("scrubbed")) { - continue; - } - try { - LOG.info("Deleting old unscrubbed segment: " + file.getAbsolutePath()); - FileUtils.deleteDirectory(file); - } catch (IOException e) { - LOG.error("Failed to delete directory: " + file.getPath(), e); - } - } - - // Delete all segments from previous scrub generations. - File allScrubbedSegmentsDir = currentScrubGenSegmentDir.getParentFile(); - if (allScrubbedSegmentsDir.exists()) { - for (File file : allScrubbedSegmentsDir.listFiles()) { - if (file.getPath().equals(currentScrubGenSegmentDir.getPath())) { - continue; - } - try { - LOG.info("Deleting old scrubbed segment: " + file.getAbsolutePath()); - FileUtils.deleteDirectory(file); - } catch (IOException e) { - LOG.error("Failed to delete directory: " + file.getPath(), e); - } - } - } - } - - /** - * Removes the data for all unused segments from the local disk. This includes: - * - data for old segments - * - data for segments belonging to another partition - * - data for segments belonging to a different flush version. - */ - public static void removeUnusedSegments( - PartitionManager partitionManager, - PartitionConfig partitionConfig, - int schemaMajorVersion, - SegmentSyncConfig segmentSyncConfig) { - - if (EarlybirdIndexConfigUtil.isArchiveSearch()) { - removeOldBuildGenerations( - EarlybirdConfig.getString("root_dir"), - EarlybirdConfig.getString("offline_segment_build_gen") - ); - removeOldSegments(segmentSyncConfig); - - Preconditions.checkState(partitionManager instanceof ArchiveSearchPartitionManager); - removeArchiveTimesliceOutsideServingRange( - partitionConfig, - ((ArchiveSearchPartitionManager) partitionManager).getTimeSlicer(), segmentSyncConfig); - } - - // Remove segments from other partitions - removeIndexesFromOtherPartitions( - partitionConfig.getIndexingHashPartitionID(), - partitionConfig.getNumPartitions(), segmentSyncConfig); - - // Remove old flushed segments - removeOldFlushVersionIndexes(schemaMajorVersion, segmentSyncConfig); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/SegmentWarmer.java b/src/java/com/twitter/search/earlybird/partition/SegmentWarmer.java deleted file mode 100644 index 7d59e5618..000000000 --- a/src/java/com/twitter/search/earlybird/partition/SegmentWarmer.java +++ /dev/null @@ -1,49 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.IOException; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; - -public class SegmentWarmer { - private static final Logger LOG = LoggerFactory.getLogger(SegmentWarmer.class); - - private final CriticalExceptionHandler criticalExceptionHandler; - - public SegmentWarmer(CriticalExceptionHandler criticalExceptionHandler) { - this.criticalExceptionHandler = criticalExceptionHandler; - } - - private boolean shouldWarmSegment(SegmentInfo segmentInfo) { - return segmentInfo.isEnabled() - && segmentInfo.isComplete() - && segmentInfo.isOptimized() - && !segmentInfo.isIndexing(); - } - - /** - * Warms a segment if it is ready to be warmed. Only has an affect on Archive Lucene segments. - */ - public boolean warmSegmentIfNecessary(SegmentInfo segmentInfo) { - if (!shouldWarmSegment(segmentInfo)) { - return false; - } - try { - segmentInfo.getIndexSegment().warmSegment(); - return true; - } catch (IOException e) { - // This is a bad situation, as earlybird can't search a segment that hasn't been warmed up - // So we delete the bad segment, and restart the earlybird if it's in starting phrase, - // otherwise alert. - LOG.error("Failed to warmup segment " + segmentInfo.getSegmentName() - + ". Will destroy local unreadable segment.", e); - segmentInfo.deleteLocalIndexedSegmentDirectoryImmediately(); - - criticalExceptionHandler.handle(this, e); - - return false; - } - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/SegmentWriter.java b/src/java/com/twitter/search/earlybird/partition/SegmentWriter.java deleted file mode 100644 index 46103840f..000000000 --- a/src/java/com/twitter/search/earlybird/partition/SegmentWriter.java +++ /dev/null @@ -1,239 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.IOException; -import java.util.EnumMap; -import java.util.concurrent.TimeUnit; -import java.util.concurrent.atomic.AtomicLong; - -import com.google.common.collect.HashBasedTable; -import com.google.common.collect.Table; - -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.metrics.Percentile; -import com.twitter.search.common.metrics.PercentileUtil; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.metrics.SearchTimer; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEventType; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.document.DocumentFactory; -import com.twitter.search.earlybird.document.TweetDocument; -import com.twitter.search.earlybird.index.EarlybirdSegment; -import com.twitter.util.Time; - -public class SegmentWriter implements ISegmentWriter { - - // helper, used for collecting stats - enum FailureReason { - FAILED_INSERT, - FAILED_FOR_TWEET_IN_INDEX, - FAILED_FOR_COMPLETE_SEGMENT - } - - private static final String STAT_PREFIX = "segment_writer_"; - private static final String EVENT_COUNTER = STAT_PREFIX + "%s_%s_segment_%s"; - private static final String EVENT_COUNTER_ALL_SEGMENTS = STAT_PREFIX + "%s_%s_all_segments"; - private static final String EVENT_TIMERS = STAT_PREFIX + "%s_timing"; - private static final String DROPPED_UPDATES_FOR_DISABLED_SEGMENTS = - STAT_PREFIX + "%s_dropped_updates_for_disabled_segments"; - private static final String INDEXING_LATENCY = - STAT_PREFIX + "%s_indexing_latency_ms"; - - private final byte penguinVersion; - private final DocumentFactory updateFactory; - private final DocumentFactory documentFactory; - private final SearchRateCounter missingPenguinVersion; - private final EarlybirdSegment earlybirdSegment; - private final SegmentInfo segmentInfo; - // Stores per segment counters for each (indexing event type, result) pair - // Example stat name - // "segment_writer_partial_update_success_segment_twttr_search_test_start_%d_p_0_of_1" - private final Table statsForUpdateType = - HashBasedTable.create(); - // Stores aggregated counters for each (indexing event type, result) pair across all segments - // Example stat name - // "segment_writer_partial_update_success_all_segments" - private final Table - aggregateStatsForUpdateType = HashBasedTable.create(); - // Stores per segment counters for each (indexing event type, non-retryable failure reason) pair - // Example stat name - // "segment_writer_partial_update_failed_for_tweet_in_index_segment_twttr_search_t_%d_p_0_of_1" - private final Table - failureStatsForUpdateType = HashBasedTable.create(); - // Stores aggregated counters for each (indexing event type, non-retryable failure reason) pair - // Example stat name - // "segment_writer_partial_update_failed_for_tweet_in_index_all_segments" - private final Table - aggregateFailureStatsForUpdateType = HashBasedTable.create(); - private final EnumMap eventTimers = - new EnumMap<>(ThriftIndexingEventType.class); - private final EnumMap - droppedUpdatesForDisabledSegments = new EnumMap<>(ThriftIndexingEventType.class); - // We pass this stat from the SearchIndexingMetricSet so that we can share the atomic longs - // between all SegmentWriters and export the largest freshness value across all segments. - private final EnumMap updateFreshness; - private final EnumMap> indexingLatency = - new EnumMap<>(ThriftIndexingEventType.class); - - public SegmentWriter( - SegmentInfo segmentInfo, - EnumMap updateFreshness - ) { - this.segmentInfo = segmentInfo; - this.updateFreshness = updateFreshness; - this.earlybirdSegment = segmentInfo.getIndexSegment(); - this.penguinVersion = EarlybirdConfig.getPenguinVersionByte(); - this.updateFactory = segmentInfo.getEarlybirdIndexConfig().createUpdateFactory(); - this.documentFactory = segmentInfo.getEarlybirdIndexConfig().createDocumentFactory(); - - String segmentName = segmentInfo.getSegmentName(); - for (ThriftIndexingEventType type : ThriftIndexingEventType.values()) { - for (Result result : Result.values()) { - String stat = String.format(EVENT_COUNTER, type, result, segmentName).toLowerCase(); - statsForUpdateType.put(type, result, SearchRateCounter.export(stat)); - - String aggregateStat = - String.format(EVENT_COUNTER_ALL_SEGMENTS, type, result).toLowerCase(); - aggregateStatsForUpdateType.put(type, result, SearchRateCounter.export(aggregateStat)); - } - - for (FailureReason reason : FailureReason.values()) { - String stat = String.format(EVENT_COUNTER, type, reason, segmentName).toLowerCase(); - failureStatsForUpdateType.put(type, reason, SearchRateCounter.export(stat)); - - String aggregateStat = - String.format(EVENT_COUNTER_ALL_SEGMENTS, type, reason).toLowerCase(); - aggregateFailureStatsForUpdateType.put( - type, reason, SearchRateCounter.export(aggregateStat)); - } - - eventTimers.put(type, SearchTimerStats.export( - String.format(EVENT_TIMERS, type).toLowerCase(), - TimeUnit.MICROSECONDS, - false)); - droppedUpdatesForDisabledSegments.put( - type, - SearchRateCounter.export( - String.format(DROPPED_UPDATES_FOR_DISABLED_SEGMENTS, type).toLowerCase())); - indexingLatency.put( - type, - PercentileUtil.createPercentile( - String.format(INDEXING_LATENCY, type).toLowerCase())); - } - - this.missingPenguinVersion = SearchRateCounter.export( - "documents_without_current_penguin_version_" + penguinVersion + "_" + segmentName); - } - - @Override - public synchronized Result indexThriftVersionedEvents(ThriftVersionedEvents tve) - throws IOException { - if (!tve.getVersionedEvents().containsKey(penguinVersion)) { - missingPenguinVersion.increment(); - return Result.FAILURE_NOT_RETRYABLE; - } - - ThriftIndexingEvent tie = tve.getVersionedEvents().get(penguinVersion); - ThriftIndexingEventType eventType = tie.getEventType(); - - if (!segmentInfo.isEnabled()) { - droppedUpdatesForDisabledSegments.get(eventType).increment(); - return Result.SUCCESS; - } - - SearchTimerStats timerStats = eventTimers.get(eventType); - SearchTimer timer = timerStats.startNewTimer(); - - long tweetId = tve.getId(); - Result result = tryApplyIndexingEvent(tweetId, tie); - - if (result == Result.SUCCESS) { - long tweetAgeInMs = SnowflakeIdParser.getTimestampFromTweetId(tweetId); - - AtomicLong freshness = updateFreshness.get(tie.getEventType()); - // Note that this is racy at startup because we don't do an atomic swap, but it will be - // approximately accurate, and this stat doesn't matter until we are current. - if (freshness.get() < tweetAgeInMs) { - freshness.set(tweetAgeInMs); - } - - if (tie.isSetCreateTimeMillis()) { - long age = Time.now().inMillis() - tie.getCreateTimeMillis(); - indexingLatency.get(tie.getEventType()).record(age); - } - } - - statsForUpdateType.get(eventType, result).increment(); - aggregateStatsForUpdateType.get(eventType, result).increment(); - timerStats.stopTimerAndIncrement(timer); - - return result; - } - - public SegmentInfo getSegmentInfo() { - return segmentInfo; - } - - public boolean hasTweet(long tweetId) throws IOException { - return earlybirdSegment.hasDocument(tweetId); - } - - private Result tryApplyIndexingEvent(long tweetId, ThriftIndexingEvent tie) throws IOException { - if (applyIndexingEvent(tie, tweetId)) { - return Result.SUCCESS; - } - - if (tie.getEventType() == ThriftIndexingEventType.INSERT) { - // We don't retry inserts - incrementFailureStats(tie, FailureReason.FAILED_INSERT); - return Result.FAILURE_NOT_RETRYABLE; - } - - if (earlybirdSegment.hasDocument(tweetId)) { - // An update fails to be applied for a tweet that is in the index. - incrementFailureStats(tie, FailureReason.FAILED_FOR_TWEET_IN_INDEX); - return Result.FAILURE_NOT_RETRYABLE; - } - - if (segmentInfo.isComplete()) { - // An update is directed at a tweet that is not in the segment (hasDocument(tweetId) failed), - // and the segment is complete (i.e. there will never be new tweets for this segment). - incrementFailureStats(tie, FailureReason.FAILED_FOR_COMPLETE_SEGMENT); - return Result.FAILURE_NOT_RETRYABLE; - } - - // The tweet may arrive later for this event, so it's possible a later try will succeed - return Result.FAILURE_RETRYABLE; - } - - private void incrementFailureStats(ThriftIndexingEvent tie, FailureReason failureReason) { - failureStatsForUpdateType.get(tie.getEventType(), failureReason).increment(); - aggregateFailureStatsForUpdateType.get(tie.getEventType(), failureReason).increment(); - } - - private boolean applyIndexingEvent(ThriftIndexingEvent tie, long tweetId) throws IOException { - switch (tie.getEventType()) { - case OUT_OF_ORDER_APPEND: - return earlybirdSegment.appendOutOfOrder(updateFactory.newDocument(tie), tweetId); - case PARTIAL_UPDATE: - return earlybirdSegment.applyPartialUpdate(tie); - case DELETE: - return earlybirdSegment.delete(tweetId); - case INSERT: - earlybirdSegment.addDocument(buildInsertDocument(tie, tweetId)); - return true; - default: - throw new IllegalArgumentException("Unexpected update type: " + tie.getEventType()); - } - } - - private TweetDocument buildInsertDocument(ThriftIndexingEvent tie, long tweetId) { - return new TweetDocument( - tweetId, - segmentInfo.getTimeSliceID(), - tie.getCreateTimeMillis(), - documentFactory.newDocument(tie)); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/SimpleSegmentIndexer.java b/src/java/com/twitter/search/earlybird/partition/SimpleSegmentIndexer.java deleted file mode 100644 index c96f7373c..000000000 --- a/src/java/com/twitter/search/earlybird/partition/SimpleSegmentIndexer.java +++ /dev/null @@ -1,191 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.IOException; -import java.util.concurrent.TimeUnit; - -import javax.annotation.Nullable; - -import com.google.common.base.Stopwatch; - - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.metrics.SearchTimer; -import com.twitter.search.common.util.io.recordreader.RecordReader; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.document.TweetDocument; -import com.twitter.search.earlybird.index.EarlybirdSegment; - -/** - * SimpleSegmentIndex indexes all Tweets for a *complete* segment. It does not index any updates or - * deletes. - */ -public class SimpleSegmentIndexer { - private static final Logger LOG = LoggerFactory.getLogger(SimpleSegmentIndexer.class); - - /** - * If not null, this segment is appended at the end after indexing finishes. - */ - @Nullable - private final SegmentInfo segmentToAppend; - - private final RecordReader tweetReader; - private final SearchIndexingMetricSet partitionIndexingMetricSet; - - // Segment we are indexing. - private EarlybirdSegment indexingSegment; - - // Total number of statuses indexed in this segment. - private long segmentSize = 0; - - public SimpleSegmentIndexer( - RecordReader tweetReader, - SearchIndexingMetricSet partitionIndexingMetricSet) { - this(tweetReader, partitionIndexingMetricSet, null); - } - - public SimpleSegmentIndexer(RecordReader tweetReader, - SearchIndexingMetricSet partitionIndexingMetricSet, - @Nullable SegmentInfo segmentToAppend) { - this.tweetReader = tweetReader; - this.segmentToAppend = segmentToAppend; - this.partitionIndexingMetricSet = partitionIndexingMetricSet; - } - - private boolean shouldIndexSegment(SegmentInfo segmentInfo) { - if (!segmentInfo.isEnabled()) { - return false; - } - - if (segmentToAppend != null) { - return true; - } - - return !segmentInfo.isComplete() - && !segmentInfo.isIndexing() - && !segmentInfo.getSyncInfo().isLoaded(); - } - - /** - * Indexes all tweets for a complete segment. - */ - public boolean indexSegment(SegmentInfo segmentInfo) { - LOG.info("Indexing segment " + segmentInfo.getSegmentName()); - if (!shouldIndexSegment(segmentInfo)) { - return false; - } - - // If we're starting to index, we're not complete, will become complete if we - // were successful here. - segmentInfo.setComplete(false); - - try { - segmentInfo.setIndexing(true); - indexingSegment = segmentInfo.getIndexSegment(); - - // if we're updating the segment, then we'll index only the new available days - // and then append the lucene index from the old segment - // If segmentToAppend is not null, it means we are updating a segment. - if (indexingSegment.tryToLoadExistingIndex()) { - segmentInfo.getSyncInfo().setLoaded(true); - LOG.info("Loaded existing index for " + segmentInfo + ", not indexing."); - } else { - indexingLoop(); - if (segmentToAppend != null) { - indexingSegment.append(segmentToAppend.getIndexSegment()); - } - } - - segmentInfo.setIndexing(false); - segmentInfo.setComplete(true); - segmentInfo.setWasIndexed(true); - LOG.info("Successfully indexed segment " + segmentInfo.getSegmentName()); - return true; - } catch (Exception e) { - LOG.error("Exception while indexing IndexSegment " + segmentInfo - + " after " + indexingSegment.getIndexStats().getStatusCount() + " documents.", e); - partitionIndexingMetricSet.simpleSegmentIndexerExceptionCounter.increment(); - - LOG.warn("Failed to load a new day into full archive. Cleaning up segment: " - + indexingSegment.getSegmentName()); - - // Clean up the lucene dir if it exists. Earlybird will retry loading the new day again later. - if (!segmentInfo.deleteLocalIndexedSegmentDirectoryImmediately()) { - LOG.error("Failed to clean up index segment folder after indexing failures."); - } - - return false; - } finally { - if (tweetReader != null) { - tweetReader.stop(); - } - segmentInfo.setIndexing(false); - } - } - - // Indexes a document if available. Returns true if index was updated. - protected boolean indexDocument(TweetDocument tweetDocument) throws IOException { - if (tweetDocument == null) { - return false; - } - - SearchTimer timer = partitionIndexingMetricSet.statusStats.startNewTimer(); - indexingSegment.addDocument(tweetDocument); - partitionIndexingMetricSet.statusStats.stopTimerAndIncrement(timer); - segmentSize++; - return true; - } - - /** - * Indexes all tweets for this segment, until no more tweets are available. - * - * @throws InterruptedException If the thread is interrupted while indexing tweets. - * @throws IOException If there's a problem reading or indexing tweets. - */ - public void indexingLoop() throws InterruptedException, IOException { - Stopwatch stopwatch = Stopwatch.createStarted(); - - Stopwatch readingStopwatch = Stopwatch.createUnstarted(); - Stopwatch indexingStopwatch = Stopwatch.createUnstarted(); - - int indexedDocumentsCount = 0; - SearchLongGauge timeToIndexSegment = SearchLongGauge.export("time_to_index_segment"); - timeToIndexSegment.set(0); - if (tweetReader != null) { - while (!tweetReader.isExhausted() && !Thread.currentThread().isInterrupted()) { - readingStopwatch.start(); - TweetDocument tweetDocument = tweetReader.readNext(); - readingStopwatch.stop(); - - indexingStopwatch.start(); - boolean documentIndexed = indexDocument(tweetDocument); - indexingStopwatch.stop(); - - if (!documentIndexed) { - // No documents waiting to be indexed. Take a nap. - Thread.sleep(10); - } else { - indexedDocumentsCount++; - } - - if (segmentSize >= EarlybirdConfig.getMaxSegmentSize()) { - LOG.error("Reached max segment size " + segmentSize + ", stopping indexer"); - partitionIndexingMetricSet.maxSegmentSizeReachedCounter.increment(); - tweetReader.stop(); - break; - } - } - } - - timeToIndexSegment.set(stopwatch.elapsed(TimeUnit.MILLISECONDS)); - - LOG.info("SimpleSegmentIndexer finished: {}. Documents: {}", - indexingSegment.getSegmentName(), indexedDocumentsCount); - LOG.info("Time taken: {}, Reading time: {}, Indexing time: {}", - stopwatch, readingStopwatch, indexingStopwatch); - LOG.info("Total Memory: {}, Free Memory: {}", - Runtime.getRuntime().totalMemory(), Runtime.getRuntime().freeMemory()); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/SimpleStreamIndexer.java b/src/java/com/twitter/search/earlybird/partition/SimpleStreamIndexer.java deleted file mode 100644 index 7b4e72281..000000000 --- a/src/java/com/twitter/search/earlybird/partition/SimpleStreamIndexer.java +++ /dev/null @@ -1,187 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.time.Duration; -import java.util.List; -import java.util.Map; -import java.util.concurrent.atomic.AtomicBoolean; -import java.util.stream.Collectors; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Verify; - -import org.apache.kafka.clients.consumer.ConsumerRecord; -import org.apache.kafka.clients.consumer.ConsumerRecords; -import org.apache.kafka.clients.consumer.KafkaConsumer; -import org.apache.kafka.clients.consumer.OffsetAndTimestamp; -import org.apache.kafka.common.PartitionInfo; -import org.apache.kafka.common.TopicPartition; -import org.apache.kafka.common.errors.WakeupException; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.earlybird.common.NonPagingAssert; -import com.twitter.search.earlybird.exception.MissingKafkaTopicException; - -/** - * Abstract base class for processing events from Kafka with the goal of indexing them and - * keeping Earlybirds up to date with the latest events. Indexing is defined by the - * implementation. - * - * NOTE: {@link EarlybirdKafkaConsumer} (tweet/tweet events consumer) is doing this in its - * own way, we might merge in the future. - * - * @param (Long) - * @param (Event/Thrift type to be consumed) - */ -public abstract class SimpleStreamIndexer { - private static final Logger LOG = LoggerFactory.getLogger(SimpleStreamIndexer.class); - - private static final Duration POLL_TIMEOUT = Duration.ofMillis(250); - private static final Duration CAUGHT_UP_FRESHNESS = Duration.ofSeconds(5); - - protected static final int MAX_POLL_RECORDS = 1000; - - private final SearchCounter numPollErrors; - protected SearchRateCounter indexingSuccesses; - protected SearchRateCounter indexingFailures; - - protected List topicPartitionList; - protected final KafkaConsumer kafkaConsumer; - private final AtomicBoolean running = new AtomicBoolean(true); - private final String topic; - - private boolean isCaughtUp = false; - - /** - * Create a simple stream indexer. - * - * @throws MissingKafkaTopicException - this shouldn't happen, but in case some - * external stream is not present, we want to have the caller decide how to - * handle it. Some missing streams might be fatal, for others it might not be - * justified to block startup. There's no point in constructing this object if - * a stream is missing, so we don't allow that to happen. - */ - public SimpleStreamIndexer(KafkaConsumer kafkaConsumer, - String topic) throws MissingKafkaTopicException { - this.kafkaConsumer = kafkaConsumer; - this.topic = topic; - List partitionInfos = this.kafkaConsumer.partitionsFor(topic); - - if (partitionInfos == null) { - LOG.error("Ooops, no partitions for {}", topic); - NonPagingAssert.assertFailed("missing_topic_" + topic); - throw new MissingKafkaTopicException(topic); - } - LOG.info("Discovered {} partitions for topic: {}", partitionInfos.size(), topic); - - numPollErrors = SearchCounter.export("stream_indexer_poll_errors_" + topic); - - this.topicPartitionList = partitionInfos - .stream() - .map(info -> new TopicPartition(topic, info.partition())) - .collect(Collectors.toList()); - this.kafkaConsumer.assign(topicPartitionList); - } - - /** - * Consume updates on startup until current (eg. until we've seen a record within 5 seconds - * of current time.) - */ - public void readRecordsUntilCurrent() { - do { - ConsumerRecords records = poll(); - - for (ConsumerRecord record : records) { - if (record.timestamp() > System.currentTimeMillis() - CAUGHT_UP_FRESHNESS.toMillis()) { - isCaughtUp = true; - } - validateAndIndexRecord(record); - } - } while (!isCaughtUp()); - } - - /** - * Run the consumer, indexing record values directly into their respective structures. - */ - public void run() { - try { - while (running.get()) { - for (ConsumerRecord record : poll()) { - validateAndIndexRecord(record); - } - } - } catch (WakeupException e) { - if (running.get()) { - LOG.error("Caught wakeup exception while running", e); - } - } finally { - kafkaConsumer.close(); - LOG.info("Consumer closed."); - } - } - - public boolean isCaughtUp() { - return isCaughtUp; - } - - /** - * For every partition in the topic, seek to an offset that has a timestamp greater - * than or equal to the given timestamp. - * @param timestamp - */ - public void seekToTimestamp(Long timestamp) { - Map partitionTimestampMap = topicPartitionList.stream() - .collect(Collectors.toMap(tp -> tp, tp -> timestamp)); - Map partitionOffsetMap = - kafkaConsumer.offsetsForTimes(partitionTimestampMap); - - partitionOffsetMap.forEach((tp, offsetAndTimestamp) -> { - Verify.verify(offsetAndTimestamp != null, - "Couldn't find records after timestamp: " + timestamp); - - kafkaConsumer.seek(tp, offsetAndTimestamp.offset()); - }); - } - - /** - * Seeks the kafka consumer to the beginning. - */ - public void seekToBeginning() { - kafkaConsumer.seekToBeginning(topicPartitionList); - } - - /** - * Polls and returns at most MAX_POLL_RECORDS records. - * @return - */ - @VisibleForTesting - protected ConsumerRecords poll() { - ConsumerRecords records; - try { - records = kafkaConsumer.poll(POLL_TIMEOUT); - } catch (Exception e) { - records = ConsumerRecords.empty(); - if (e instanceof WakeupException) { - throw e; - } else { - LOG.warn("Error polling from {} kafka topic.", topic, e); - numPollErrors.increment(); - } - } - return records; - } - - protected abstract void validateAndIndexRecord(ConsumerRecord record); - - // Shutdown hook which can be called from a seperate thread. Calling consumer.wakeup() interrupts - // the running indexer and causes it to first stop polling for new records before gracefully - // closing the consumer. - public void close() { - LOG.info("Shutting down stream indexer for topic {}", topic); - running.set(false); - kafkaConsumer.wakeup(); - } -} - diff --git a/src/java/com/twitter/search/earlybird/partition/SimpleUpdateIndexer.java b/src/java/com/twitter/search/earlybird/partition/SimpleUpdateIndexer.java deleted file mode 100644 index 30d8d3e3f..000000000 --- a/src/java/com/twitter/search/earlybird/partition/SimpleUpdateIndexer.java +++ /dev/null @@ -1,140 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.IOException; -import java.util.Optional; -import java.util.concurrent.TimeUnit; - -import com.google.common.base.Preconditions; -import com.google.common.base.Stopwatch; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.metrics.SearchTimer; -import com.twitter.search.common.util.io.dl.DLRecordTimestampUtil; -import com.twitter.search.common.util.io.recordreader.RecordReader; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.segment.SegmentDataReaderSet; - -/** - * Indexes all updates for a complete segment at startup. - */ -public class SimpleUpdateIndexer { - private static final Logger LOG = LoggerFactory.getLogger(SimpleUpdateIndexer.class); - - private final SegmentDataReaderSet readerSet; - private final SearchIndexingMetricSet partitionIndexingMetricSet; - private final InstrumentedQueue retryQueue; - private final CriticalExceptionHandler criticalExceptionHandler; - - public SimpleUpdateIndexer(SegmentDataReaderSet readerSet, - SearchIndexingMetricSet partitionIndexingMetricSet, - InstrumentedQueue retryQueue, - CriticalExceptionHandler criticalExceptionHandler) { - this.readerSet = readerSet; - this.partitionIndexingMetricSet = partitionIndexingMetricSet; - this.retryQueue = retryQueue; - this.criticalExceptionHandler = criticalExceptionHandler; - } - - /** - * Indexes all updates for the given segment. - */ - public void indexAllUpdates(SegmentInfo segmentInfo) { - Preconditions.checkState( - segmentInfo.isEnabled() && segmentInfo.isComplete() && !segmentInfo.isIndexing()); - - try { - readerSet.attachUpdateReaders(segmentInfo); - } catch (IOException e) { - throw new RuntimeException("Could not attach readers for segment: " + segmentInfo, e); - } - - RecordReader reader = - readerSet.getUpdateEventsReaderForSegment(segmentInfo); - if (reader == null) { - return; - } - - LOG.info("Got updates reader (starting timestamp = {}) for segment {}: {}", - DLRecordTimestampUtil.recordIDToTimestamp(reader.getOffset()), - segmentInfo.getSegmentName(), - reader); - - // The segment is complete (we check this in indexAllUpdates()), so we can safely get - // the smallest and largest tweet IDs in this segment. - long lowestTweetId = segmentInfo.getIndexSegment().getLowestTweetId(); - long highestTweetId = segmentInfo.getIndexSegment().getHighestTweetId(); - Preconditions.checkArgument( - lowestTweetId > 0, - "Could not get the lowest tweet ID in segment " + segmentInfo.getSegmentName()); - Preconditions.checkArgument( - highestTweetId > 0, - "Could not get the highest tweet ID in segment " + segmentInfo.getSegmentName()); - - SegmentWriter segmentWriter = - new SegmentWriter(segmentInfo, partitionIndexingMetricSet.updateFreshness); - - LOG.info("Starting to index updates for segment: {}", segmentInfo.getSegmentName()); - Stopwatch stopwatch = Stopwatch.createStarted(); - - while (!Thread.currentThread().isInterrupted() && !reader.isCaughtUp()) { - applyUpdate(segmentInfo, reader, segmentWriter, lowestTweetId, highestTweetId); - } - - LOG.info("Finished indexing updates for segment {} in {} seconds.", - segmentInfo.getSegmentName(), - stopwatch.elapsed(TimeUnit.SECONDS)); - } - - private void applyUpdate(SegmentInfo segmentInfo, - RecordReader reader, - SegmentWriter segmentWriter, - long lowestTweetId, - long highestTweetId) { - ThriftVersionedEvents update; - try { - update = reader.readNext(); - } catch (IOException e) { - LOG.error("Exception while reading update for segment: " + segmentInfo.getSegmentName(), e); - criticalExceptionHandler.handle(this, e); - return; - } - if (update == null) { - LOG.warn("Update is not available but reader was not caught up. Segment: {}", - segmentInfo.getSegmentName()); - return; - } - - try { - // If the indexer put this update in the wrong timeslice, add it to the retry queue, and - // let PartitionIndexer retry it (it has logic to apply it to the correct segment). - if ((update.getId() < lowestTweetId) || (update.getId() > highestTweetId)) { - retryQueue.add(update); - return; - } - - // At this point, we are updating a segment that has every tweet it will ever have, - // (the segment is complete), so there is no point queueing an update to retry it. - SearchTimer timer = partitionIndexingMetricSet.updateStats.startNewTimer(); - segmentWriter.indexThriftVersionedEvents(update); - partitionIndexingMetricSet.updateStats.stopTimerAndIncrement(timer); - - updateUpdatesStreamTimestamp(segmentInfo); - } catch (IOException e) { - LOG.error("Exception while indexing updates for segment: " + segmentInfo.getSegmentName(), e); - criticalExceptionHandler.handle(this, e); - } - } - - private void updateUpdatesStreamTimestamp(SegmentInfo segmentInfo) { - Optional offset = readerSet.getUpdateEventsStreamOffsetForSegment(segmentInfo); - if (!offset.isPresent()) { - LOG.info("Unable to get updates stream offset for segment: {}", segmentInfo.getSegmentName()); - } else { - long offsetTimeMillis = DLRecordTimestampUtil.recordIDToTimestamp(offset.get()); - segmentInfo.setUpdatesStreamOffsetTimestamp(offsetTimeMillis); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/StartupUserEventIndexer.java b/src/java/com/twitter/search/earlybird/partition/StartupUserEventIndexer.java deleted file mode 100644 index 5a468f3a6..000000000 --- a/src/java/com/twitter/search/earlybird/partition/StartupUserEventIndexer.java +++ /dev/null @@ -1,236 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.sql.Timestamp; -import java.text.DateFormat; -import java.text.SimpleDateFormat; -import java.time.Duration; -import java.util.Date; -import java.util.Optional; - -import com.google.common.annotations.VisibleForTesting; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.metrics.SearchTimer; -import com.twitter.search.earlybird.EarlybirdStatus; -import com.twitter.search.earlybird.common.NonPagingAssert; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.common.config.EarlybirdProperty; -import com.twitter.search.earlybird.common.userupdates.UserScrubGeoMap; -import com.twitter.search.earlybird.common.userupdates.UserTableBuilderFromSnapshot; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.factory.EarlybirdIndexConfigUtil; - -/** - * Indexer class responsible for getting the the {@link UserTable} and {@link UserScrubGeoMap} - * indexed up until the current moment. - */ -public class StartupUserEventIndexer { - private static final Logger LOG = LoggerFactory.getLogger(StartupUserEventIndexer.class); - private static final String LOAD_USER_UPDATE_SNAPSHOT = - "loading user update snapshot"; - private static final String INDEX_ALL_USER_EVENTS = - "indexing all user events"; - private static final NonPagingAssert FAILED_USER_TABLE_HDFS_LOAD - = new NonPagingAssert("failed_user_table_hdfs_load"); - - private static final long MAX_RETRY_MILLIS_FOR_SEEK_TO_TIMESTAMP = - Duration.ofMinutes(1).toMillis(); - private static final long SLEEP_MILLIS_BETWEEN_RETRIES_FOR_SEEK_TO_TIMESTAMP = - Duration.ofSeconds(1).toMillis(); - - private static final long MILLIS_IN_FOURTEEN_DAYS = 1209600000; - private static final long MILLIS_IN_ONE_DAY = 86400000; - - private final SearchIndexingMetricSet searchIndexingMetricSet; - private final UserUpdatesStreamIndexer userUpdatesStreamIndexer; - private final UserScrubGeoEventStreamIndexer userScrubGeoEventStreamIndexer; - private final SegmentManager segmentManager; - private final Clock clock; - - public StartupUserEventIndexer( - SearchIndexingMetricSet searchIndexingMetricSet, - UserUpdatesStreamIndexer userUpdatesStreamIndexer, - UserScrubGeoEventStreamIndexer userScrubGeoEventStreamIndexer, - SegmentManager segmentManager, - Clock clock) { - this.searchIndexingMetricSet = searchIndexingMetricSet; - this.userUpdatesStreamIndexer = userUpdatesStreamIndexer; - this.userScrubGeoEventStreamIndexer = userScrubGeoEventStreamIndexer; - this.segmentManager = segmentManager; - this.clock = clock; - } - - /** - * Index all user events. - */ - public void indexAllEvents() { - EarlybirdStatus.beginEvent( - INDEX_ALL_USER_EVENTS, searchIndexingMetricSet.startupInUserEventIndexer); - - indexUserUpdates(); - if (EarlybirdConfig.consumeUserScrubGeoEvents()) { - indexUserScrubGeoEvents(); - } - - EarlybirdStatus.endEvent( - INDEX_ALL_USER_EVENTS, searchIndexingMetricSet.startupInUserEventIndexer); - } - - /** - * Index user updates until current. - */ - public void indexUserUpdates() { - EarlybirdStatus.beginEvent( - LOAD_USER_UPDATE_SNAPSHOT, searchIndexingMetricSet.startupInUserUpdates); - - Optional userTable = buildUserTable(); - if (userTable.isPresent()) { - segmentManager.getUserTable().setTable(userTable.get()); - LOG.info("Set new user table."); - - if (!seekToTimestampWithRetriesIfNecessary( - userTable.get().getLastRecordTimestamp(), - userUpdatesStreamIndexer)) { - LOG.error("User Updates stream indexer unable to seek to timestamp. " - + "Will seek to beginning."); - userUpdatesStreamIndexer.seekToBeginning(); - } - } else { - LOG.info("Failed to load user update snapshot. Will reindex user updates from scratch."); - FAILED_USER_TABLE_HDFS_LOAD.assertFailed(); - userUpdatesStreamIndexer.seekToBeginning(); - } - - userUpdatesStreamIndexer.readRecordsUntilCurrent(); - LOG.info("Finished catching up on user updates via Kafka"); - - EarlybirdStatus.endEvent( - LOAD_USER_UPDATE_SNAPSHOT, searchIndexingMetricSet.startupInUserUpdates); - } - - /** - * Index UserScrubGeoEvents until current. - */ - public void indexUserScrubGeoEvents() { - seekUserScrubGeoEventKafkaConsumer(); - - SearchTimer timer = new SearchTimer(); - timer.start(); - userScrubGeoEventStreamIndexer.readRecordsUntilCurrent(); - timer.stop(); - - LOG.info("Finished catching up on user scrub geo events via Kafka"); - LOG.info("UserScrubGeoMap contains {} users and finished in {} milliseconds", - segmentManager.getUserScrubGeoMap().getNumUsersInMap(), timer.getElapsed()); - } - - /** - * Seeks UserScrubGeoEventKafkaConsumer using timestamp derived from - * getTimestampForUserScrubGeoEventKafkaConsumer(). - */ - @VisibleForTesting - public void seekUserScrubGeoEventKafkaConsumer() { - long seekTimestamp = getTimestampForUserScrubGeoEventKafkaConsumer(); - if (seekTimestamp == -1) { - userScrubGeoEventStreamIndexer.seekToBeginning(); - } else { - if (!seekToTimestampWithRetriesIfNecessary(seekTimestamp, userScrubGeoEventStreamIndexer)) { - LOG.error("User Scrub Geo stream indexer unable to seek to timestamp. " - + "Will seek to beginning."); - userScrubGeoEventStreamIndexer.seekToBeginning(); - } - } - } - - /** - * Get timestamp to seek UserScrubGeoEventKafkaConsumer to. - * @return - */ - public long getTimestampForUserScrubGeoEventKafkaConsumer() { - if (EarlybirdIndexConfigUtil.isArchiveSearch()) { - return getTimestampForArchive(); - } else { - return getTimestampForRealtime(); - } - } - - /** - * For archive: grab scrub gen from config file and convert date into a timestamp. Add buffer of - * one day. We need all UserScrubGeoEvents since the date of the current scrub gen. - * - * See go/realtime-geo-filtering - * @return - */ - public long getTimestampForArchive() { - try { - String scrubGenString = EarlybirdProperty.EARLYBIRD_SCRUB_GEN.get(); - - DateFormat dateFormat = new SimpleDateFormat("yyyyMMdd"); - Date date = dateFormat.parse(scrubGenString); - return new Timestamp(date.getTime()).getTime() - MILLIS_IN_ONE_DAY; - - } catch (Exception e) { - LOG.error("Could not derive timestamp from scrub gen. " - + "Will seek User Scrub Geo Kafka consumer to beginning of topic"); - } - return -1; - } - - /** - * For realtime/protected: Compute the timestamp 14 days from the current time. This will account - * for all events that have occurred during the lifecylce of the current index. - * - * See go/realtime-geo-filtering - */ - public long getTimestampForRealtime() { - return System.currentTimeMillis() - MILLIS_IN_FOURTEEN_DAYS; - } - - private boolean seekToTimestampWithRetriesIfNecessary( - long lastRecordTimestamp, - SimpleStreamIndexer streamIndexer) { - long initialTimeMillis = clock.nowMillis(); - int numFailures = 0; - while (shouldTrySeekToTimestamp(initialTimeMillis, numFailures)) { - try { - streamIndexer.seekToTimestamp(lastRecordTimestamp); - LOG.info("Seeked consumer to timestamp {} after {} failures", - lastRecordTimestamp, numFailures); - return true; - } catch (Exception e) { - numFailures++; - LOG.info("Caught exception when seeking to timestamp. Num failures: {}. Exception: {}", - numFailures, e); - // Sleep before attempting to retry - try { - clock.waitFor(SLEEP_MILLIS_BETWEEN_RETRIES_FOR_SEEK_TO_TIMESTAMP); - } catch (InterruptedException interruptedException) { - LOG.warn("Interrupted while sleeping between seekToTimestamp retries", - interruptedException); - // Preserve interrupt status. - Thread.currentThread().interrupt(); - break; - } - } - } - // Failed to seek to timestamp - return false; - } - - private boolean shouldTrySeekToTimestamp(long initialTimeMillis, int numFailures) { - if (numFailures == 0) { - // no attempts have been made yet, so we should try to seek to timestamp - return true; - } else { - return clock.nowMillis() - initialTimeMillis < MAX_RETRY_MILLIS_FOR_SEEK_TO_TIMESTAMP; - } - } - - protected Optional buildUserTable() { - UserTableBuilderFromSnapshot builder = new UserTableBuilderFromSnapshot(); - return builder.build(segmentManager.getUserTable().getUserIdFilter()); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/StatusBatchFlushVersion.java b/src/java/com/twitter/search/earlybird/partition/StatusBatchFlushVersion.java deleted file mode 100644 index 3175e89e1..000000000 --- a/src/java/com/twitter/search/earlybird/partition/StatusBatchFlushVersion.java +++ /dev/null @@ -1,41 +0,0 @@ -package com.twitter.search.earlybird.partition; - -/** - * Keeps track of versioning for flushed status batch data. - */ -public enum StatusBatchFlushVersion { - - VERSION_0("Initial version of status batch flushing", true), - VERSION_1("Switching to use field groups (contains changes to PartitionedBatch)", true), - VERSION_2("Removing support for per-partition _SUCCESS markers", true), - /* Put the semi colon on a separate line to avoid polluting git blame history */; - - public static final StatusBatchFlushVersion CURRENT_FLUSH_VERSION = - StatusBatchFlushVersion.values()[StatusBatchFlushVersion.values().length - 1]; - - public static final String DELIMITER = "_v_"; - - private final String description; - private final boolean isOfficial; - - private StatusBatchFlushVersion(String description, boolean official) { - this.description = description; - isOfficial = official; - } - - public int getVersionNumber() { - return this.ordinal(); - } - - public String getVersionFileExtension() { - return DELIMITER + ordinal(); - } - - public boolean isOfficial() { - return isOfficial; - } - - public String getDescription() { - return description; - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/TimeLimitedHadoopExistsCall.java b/src/java/com/twitter/search/earlybird/partition/TimeLimitedHadoopExistsCall.java deleted file mode 100644 index e3781bac7..000000000 --- a/src/java/com/twitter/search/earlybird/partition/TimeLimitedHadoopExistsCall.java +++ /dev/null @@ -1,90 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.IOException; -import java.util.concurrent.Callable; -import java.util.concurrent.ExecutorService; -import java.util.concurrent.Executors; -import java.util.concurrent.TimeUnit; - -import com.google.common.util.concurrent.SimpleTimeLimiter; -import com.google.common.util.concurrent.TimeLimiter; - -import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchTimer; -import com.twitter.search.common.metrics.SearchTimerStats; - -/** - * Abstracts details of making time limited calls to hadoop. - * - * During IM-3556 we discovered that hadoop API calls can take a long time (seconds, minutes) - * if the Hadoop clsuter is in a bad state. Our code was generally not prepared for that and - * this caused various issues. This class is a fix on top of the Hadoop API's exists call and - * it introduces a timeout. - * - * The main motivation for having this as an external class is for testability. - */ -public class TimeLimitedHadoopExistsCall { - private final TimeLimiter hadoopCallsTimeLimiter; - private final FileSystem fileSystem; - private final int timeLimitInSeconds; - - private static final SearchTimerStats EXISTS_CALLS_TIMER = - SearchTimerStats.export("hadoop_exists_calls"); - - private static final SearchCounter EXISTS_CALLS_EXCEPTION = - SearchCounter.export("hadoop_exists_calls_exception"); - - public TimeLimitedHadoopExistsCall(FileSystem fileSystem) { - // This times varies. Sometimes it's very quick, sometimes it takes some amount of seconds. - // Do a rate on hadoop_exists_calls_latency_ms to see for yourself. - this(fileSystem, 30); - } - - public TimeLimitedHadoopExistsCall(FileSystem fileSystem, int timeLimitInSeconds) { - // We do hadoop calls once every "FLUSH_CHECK_PERIOD" minutes. If a call takes - // a long time (say 10 minutes), we'll use a new thread for the next call, to give it - // a chance to complete. - // - // Let's say every call takes 2 hours. After 5 calls, the 6th call won't be able - // to take a thread out of the thread pool and it will time out. That's fair, we don't - // want to keep sending requests to Hadoop if the situation is so dire. - ExecutorService executorService = Executors.newFixedThreadPool(5); - this.hadoopCallsTimeLimiter = SimpleTimeLimiter.create(executorService); - this.fileSystem = fileSystem; - this.timeLimitInSeconds = timeLimitInSeconds; - } - - - protected boolean hadoopExistsCall(Path path) throws IOException { - SearchTimer timer = EXISTS_CALLS_TIMER.startNewTimer(); - boolean res = fileSystem.exists(path); - EXISTS_CALLS_TIMER.stopTimerAndIncrement(timer); - return res; - } - - /** - * Checks if a path exists on Hadoop. - * - * @return true if the path exists. - * @throws Exception see exceptions thrown by callWithTimeout - */ - boolean exists(Path path) throws Exception { - try { - boolean result = hadoopCallsTimeLimiter.callWithTimeout(new Callable() { - @Override - public Boolean call() throws Exception { - return hadoopExistsCall(path); - } - }, timeLimitInSeconds, TimeUnit.SECONDS); - - return result; - } catch (Exception ex) { - EXISTS_CALLS_EXCEPTION.increment(); - // No need to print and rethrow, it will be printed when caught upstream. - throw ex; - } - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/TweetCreateHandler.java b/src/java/com/twitter/search/earlybird/partition/TweetCreateHandler.java deleted file mode 100644 index e47f75a09..000000000 --- a/src/java/com/twitter/search/earlybird/partition/TweetCreateHandler.java +++ /dev/null @@ -1,526 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.IOException; -import java.util.Iterator; - -import scala.runtime.BoxedUnit; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.base.Stopwatch; -import com.google.common.base.Verify; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.config.Config; -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.metrics.SearchTimer; -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; -import com.twitter.search.common.util.GCUtil; -import com.twitter.search.earlybird.EarlybirdStatus; -import com.twitter.search.earlybird.common.CaughtUpMonitor; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.index.OutOfOrderRealtimeTweetIDMapper; -import com.twitter.search.earlybird.querycache.QueryCacheManager; -import com.twitter.search.earlybird.util.CoordinatedEarlybirdActionInterface; -import com.twitter.util.Await; -import com.twitter.util.Duration; -import com.twitter.util.Future; -import com.twitter.util.TimeoutException; - -/** - * This class handles incoming new Tweets. It is responsible for creating segments for the incoming - * Tweets when necessary, triggering optimization on those segments, and writing Tweets to the - * correct segment. - */ -public class TweetCreateHandler { - private static final Logger LOG = LoggerFactory.getLogger(TweetCreateHandler.class); - - public static final long LATE_TWEET_TIME_BUFFER_MS = Duration.fromMinutes(1).inMilliseconds(); - - private static final String STATS_PREFIX = "tweet_create_handler_"; - - // To get a better idea of which of these succeeded and so on, see stats in SegmentManager. - private IndexingResultCounts indexingResultCounts; - private static final SearchRateCounter TWEETS_IN_WRONG_SEGMENT = - SearchRateCounter.export(STATS_PREFIX + "tweets_in_wrong_segment"); - private static final SearchRateCounter SEGMENTS_CLOSED_EARLY = - SearchRateCounter.export(STATS_PREFIX + "segments_closed_early"); - private static final SearchRateCounter INSERTED_IN_CURRENT_SEGMENT = - SearchRateCounter.export(STATS_PREFIX + "inserted_in_current_segment"); - private static final SearchRateCounter INSERTED_IN_PREVIOUS_SEGMENT = - SearchRateCounter.export(STATS_PREFIX + "inserted_in_previous_segment"); - private static final NewSegmentStats NEW_SEGMENT_STATS = new NewSegmentStats(); - private static final SearchCounter CREATED_SEGMENTS = - SearchCounter.export(STATS_PREFIX + "created_segments"); - private static final SearchRateCounter INCOMING_TWEETS = - SearchRateCounter.export(STATS_PREFIX + "incoming_tweets"); - private static final SearchRateCounter INDEXING_SUCCESS = - SearchRateCounter.export(STATS_PREFIX + "indexing_success"); - private static final SearchRateCounter INDEXING_FAILURE = - SearchRateCounter.export(STATS_PREFIX + "indexing_failure"); - - // Various stats and logging around creation of new segments, put in this - // class so that the code is not watered down too much by this. - private static class NewSegmentStats { - private static final String NEW_SEGMENT_STATS_PREFIX = - STATS_PREFIX + "new_segment_"; - - private static final SearchCounter START_NEW_AFTER_REACHING_LIMIT = - SearchCounter.export(NEW_SEGMENT_STATS_PREFIX + "start_after_reaching_limit"); - private static final SearchCounter START_NEW_AFTER_EXCEEDING_MAX_ID = - SearchCounter.export(NEW_SEGMENT_STATS_PREFIX + "start_after_exceeding_max_id"); - private static final SearchCounter TIMESLICE_SET_TO_CURRENT_ID = - SearchCounter.export(NEW_SEGMENT_STATS_PREFIX + "timeslice_set_to_current_id"); - private static final SearchCounter TIMESLICE_SET_TO_MAX_ID = - SearchCounter.export(NEW_SEGMENT_STATS_PREFIX + "timeslice_set_to_max_id"); - private static final SearchLongGauge TIMESPAN_BETWEEN_MAX_AND_CURRENT = - SearchLongGauge.export(NEW_SEGMENT_STATS_PREFIX + "timespan_between_id_and_max"); - - void recordCreateNewSegment() { - CREATED_SEGMENTS.increment(); - } - - void recordStartAfterReachingTweetsLimit(int numDocs, int numDocsCutoff, - int maxSegmentSize, int lateTweetBuffer) { - START_NEW_AFTER_REACHING_LIMIT.increment(); - LOG.info(String.format( - "Will create new segment: numDocs=%,d, numDocsCutoff=%,d" - + " | maxSegmentSize=%,d, lateTweetBuffer=%,d", - numDocs, numDocsCutoff, maxSegmentSize, lateTweetBuffer)); - } - - void recordStartAfterExceedingLargestValidTweetId(long tweetId, long largestValidTweetId) { - START_NEW_AFTER_EXCEEDING_MAX_ID.increment(); - LOG.info(String.format( - "Will create new segment: tweetDd=%,d, largestValidTweetID for segment=%,d", - tweetId, largestValidTweetId)); - } - - void recordSettingTimesliceToCurrentTweet(long tweetID) { - TIMESLICE_SET_TO_CURRENT_ID.increment(); - LOG.info("Creating new segment: tweet that triggered it has the largest id we've seen. " - + " id={}", tweetID); - } - - void recordSettingTimesliceToMaxTweetId(long tweetID, long maxTweetID) { - TIMESLICE_SET_TO_MAX_ID.increment(); - LOG.info("Creating new segment: tweet that triggered it doesn't have the largest id" - + " we've seen. tweetId={}, maxTweetId={}", - tweetID, maxTweetID); - long timeDifference = - SnowflakeIdParser.getTimeDifferenceBetweenTweetIDs(maxTweetID, tweetID); - LOG.info("Time difference between max seen and last seen: {} ms", timeDifference); - TIMESPAN_BETWEEN_MAX_AND_CURRENT.set(timeDifference); - } - - void wrapNewSegmentCreation(long tweetID, long maxTweetID, - long currentSegmentTimesliceBoundary, - long largestValidTweetIDForCurrentSegment) { - long timeDifferenceStartToMax = SnowflakeIdParser.getTimeDifferenceBetweenTweetIDs( - largestValidTweetIDForCurrentSegment, - currentSegmentTimesliceBoundary); - LOG.info("Time between timeslice boundary and largest valid tweet id: {} ms", - timeDifferenceStartToMax); - - LOG.info("Created new segment: (tweetId={}, maxTweetId={}, maxTweetId-tweetId={} " - + " | currentSegmentTimesliceBoundary={}, largestValidTweetIDForSegment={})", - tweetID, maxTweetID, maxTweetID - tweetID, currentSegmentTimesliceBoundary, - largestValidTweetIDForCurrentSegment); - } - } - - - private final SegmentManager segmentManager; - private final MultiSegmentTermDictionaryManager multiSegmentTermDictionaryManager; - private final int maxSegmentSize; - private final int lateTweetBuffer; - - private long maxTweetID = Long.MIN_VALUE; - - private long largestValidTweetIDForCurrentSegment; - private long currentSegmentTimesliceBoundary; - private OptimizingSegmentWriter currentSegment; - private OptimizingSegmentWriter previousSegment; - private final QueryCacheManager queryCacheManager; - private final CriticalExceptionHandler criticalExceptionHandler; - private final SearchIndexingMetricSet searchIndexingMetricSet; - private final CoordinatedEarlybirdActionInterface postOptimizationRebuildsAction; - private final CoordinatedEarlybirdActionInterface gcAction; - private final CaughtUpMonitor indexCaughtUpMonitor; - private final OptimizationAndFlushingCoordinationLock optimizationAndFlushingCoordinationLock; - - public TweetCreateHandler( - SegmentManager segmentManager, - SearchIndexingMetricSet searchIndexingMetricSet, - CriticalExceptionHandler criticalExceptionHandler, - MultiSegmentTermDictionaryManager multiSegmentTermDictionaryManager, - QueryCacheManager queryCacheManager, - CoordinatedEarlybirdActionInterface postOptimizationRebuildsAction, - CoordinatedEarlybirdActionInterface gcAction, - int lateTweetBuffer, - int maxSegmentSize, - CaughtUpMonitor indexCaughtUpMonitor, - OptimizationAndFlushingCoordinationLock optimizationAndFlushingCoordinationLock - ) { - this.segmentManager = segmentManager; - this.criticalExceptionHandler = criticalExceptionHandler; - this.multiSegmentTermDictionaryManager = multiSegmentTermDictionaryManager; - this.queryCacheManager = queryCacheManager; - this.indexingResultCounts = new IndexingResultCounts(); - this.searchIndexingMetricSet = searchIndexingMetricSet; - this.postOptimizationRebuildsAction = postOptimizationRebuildsAction; - this.gcAction = gcAction; - this.indexCaughtUpMonitor = indexCaughtUpMonitor; - - Preconditions.checkState(lateTweetBuffer < maxSegmentSize); - this.lateTweetBuffer = lateTweetBuffer; - this.maxSegmentSize = maxSegmentSize; - this.optimizationAndFlushingCoordinationLock = optimizationAndFlushingCoordinationLock; - } - - void prepareAfterStartingWithIndex(long maxIndexedTweetId) { - LOG.info("Preparing after starting with an index."); - - Iterator segmentInfosIterator = - segmentManager - .getSegmentInfos(SegmentManager.Filter.All, SegmentManager.Order.NEW_TO_OLD) - .iterator(); - - // Setup the last segment. - Verify.verify(segmentInfosIterator.hasNext(), "at least one segment expected"); - ISegmentWriter lastWriter = segmentManager.getSegmentWriterForID( - segmentInfosIterator.next().getTimeSliceID()); - Verify.verify(lastWriter != null); - - LOG.info("TweetCreateHandler found last writer: {}", lastWriter.getSegmentInfo().toString()); - this.currentSegmentTimesliceBoundary = lastWriter.getSegmentInfo().getTimeSliceID(); - this.largestValidTweetIDForCurrentSegment = - OutOfOrderRealtimeTweetIDMapper.calculateMaxTweetID(currentSegmentTimesliceBoundary); - this.currentSegment = (OptimizingSegmentWriter) lastWriter; - - if (maxIndexedTweetId == -1) { - maxTweetID = lastWriter.getSegmentInfo().getIndexSegment().getMaxTweetId(); - LOG.info("Max tweet id = {}", maxTweetID); - } else { - // See SEARCH-31032 - maxTweetID = maxIndexedTweetId; - } - - // If we have a previous segment that's not optimized, set it up too, we still need to pick - // it up for optimization and we might still be able to add tweets to it. - if (segmentInfosIterator.hasNext()) { - SegmentInfo previousSegmentInfo = segmentInfosIterator.next(); - if (!previousSegmentInfo.isOptimized()) { - ISegmentWriter previousSegmentWriter = segmentManager.getSegmentWriterForID( - previousSegmentInfo.getTimeSliceID()); - - if (previousSegmentWriter != null) { - LOG.info("Picked previous segment"); - this.previousSegment = (OptimizingSegmentWriter) previousSegmentWriter; - } else { - // Should not happen. - LOG.error("Not found previous segment writer"); - } - } else { - LOG.info("Previous segment info is optimized"); - } - } else { - LOG.info("Previous segment info not found, we only have one segment"); - } - } - - private void updateIndexFreshness() { - searchIndexingMetricSet.highestStatusId.set(maxTweetID); - - long tweetTimestamp = SnowflakeIdParser.getTimestampFromTweetId( - searchIndexingMetricSet.highestStatusId.get()); - searchIndexingMetricSet.freshestTweetTimeMillis.set(tweetTimestamp); - } - - /** - * Index a new TVE representing a Tweet create event. - */ - public void handleTweetCreate(ThriftVersionedEvents tve) throws IOException { - INCOMING_TWEETS.increment(); - long id = tve.getId(); - maxTweetID = Math.max(id, maxTweetID); - - updateIndexFreshness(); - - boolean shouldCreateNewSegment = false; - - if (currentSegment == null) { - shouldCreateNewSegment = true; - LOG.info("Will create new segment: current segment is null"); - } else { - int numDocs = currentSegment.getSegmentInfo().getIndexSegment().getNumDocs(); - int numDocsCutoff = maxSegmentSize - lateTweetBuffer; - if (numDocs >= numDocsCutoff) { - NEW_SEGMENT_STATS.recordStartAfterReachingTweetsLimit(numDocs, numDocsCutoff, - maxSegmentSize, lateTweetBuffer); - shouldCreateNewSegment = true; - } else if (id > largestValidTweetIDForCurrentSegment) { - NEW_SEGMENT_STATS.recordStartAfterExceedingLargestValidTweetId(id, - largestValidTweetIDForCurrentSegment); - shouldCreateNewSegment = true; - } - } - - if (shouldCreateNewSegment) { - createNewSegment(id); - } - - if (previousSegment != null) { - // Inserts and some updates can't be applied to an optimized segment, so we want to wait at - // least LATE_TWEET_TIME_BUFFER between when we created the new segment and when we optimize - // the previous segment, in case there are late tweets. - // We leave a large (150k, typically) buffer in the segment so that we don't have to close - // the previousSegment before LATE_TWEET_TIME_BUFFER has passed, but if we index - // lateTweetBuffer Tweets before optimizing, then we must optimize, - // so that we don't insert more than max segment size tweets into the previous segment. - long relativeTweetAgeMs = - SnowflakeIdParser.getTimeDifferenceBetweenTweetIDs(id, currentSegmentTimesliceBoundary); - - boolean needToOptimize = false; - int numDocs = previousSegment.getSegmentInfo().getIndexSegment().getNumDocs(); - String previousSegmentName = previousSegment.getSegmentInfo().getSegmentName(); - if (numDocs >= maxSegmentSize) { - LOG.info(String.format("Previous segment (%s) reached maxSegmentSize, need to optimize it." - + " numDocs=%,d, maxSegmentSize=%,d", previousSegmentName, numDocs, maxSegmentSize)); - needToOptimize = true; - } else if (relativeTweetAgeMs > LATE_TWEET_TIME_BUFFER_MS) { - LOG.info(String.format("Previous segment (%s) is old enough, we can optimize it." - + " Got tweet past time buffer of %,d ms by: %,d ms", previousSegmentName, - LATE_TWEET_TIME_BUFFER_MS, relativeTweetAgeMs - LATE_TWEET_TIME_BUFFER_MS)); - needToOptimize = true; - } - - if (needToOptimize) { - optimizePreviousSegment(); - } - } - - ISegmentWriter segmentWriter; - if (id >= currentSegmentTimesliceBoundary) { - INSERTED_IN_CURRENT_SEGMENT.increment(); - segmentWriter = currentSegment; - } else if (previousSegment != null) { - INSERTED_IN_PREVIOUS_SEGMENT.increment(); - segmentWriter = previousSegment; - } else { - TWEETS_IN_WRONG_SEGMENT.increment(); - LOG.info("Inserting TVE ({}) into the current segment ({}) even though it should have gone " - + "in a previous segment.", id, currentSegmentTimesliceBoundary); - segmentWriter = currentSegment; - } - - SearchTimer timer = searchIndexingMetricSet.statusStats.startNewTimer(); - ISegmentWriter.Result result = segmentWriter.indexThriftVersionedEvents(tve); - searchIndexingMetricSet.statusStats.stopTimerAndIncrement(timer); - - if (result == ISegmentWriter.Result.SUCCESS) { - INDEXING_SUCCESS.increment(); - } else { - INDEXING_FAILURE.increment(); - } - - indexingResultCounts.countResult(result); - } - - /** - * Many tests need to verify behavior with segments optimized & unoptimized, so we need to expose - * this. - */ - @VisibleForTesting - public Future optimizePreviousSegment() { - String segmentName = previousSegment.getSegmentInfo().getSegmentName(); - previousSegment.getSegmentInfo().setIndexing(false); - LOG.info("Optimizing previous segment: {}", segmentName); - segmentManager.logState("Starting optimization for segment: " + segmentName); - - Future future = previousSegment - .startOptimization(gcAction, optimizationAndFlushingCoordinationLock) - .map(this::postOptimizationSteps) - .onFailure(t -> { - criticalExceptionHandler.handle(this, t); - return BoxedUnit.UNIT; - }); - - waitForOptimizationIfInTest(future); - - previousSegment = null; - return future; - } - - /** - * In tests, it's easier if when a segment starts optimizing, we know that it will finish - * optimizing. This way we have no race condition where we're surprised that something that - * started optimizing is not ready. - * - * In prod we don't have this problem. Segments run for 10 hours and optimization is 20 minutes - * so there's no need for extra synchronization. - */ - private void waitForOptimizationIfInTest(Future future) { - if (Config.environmentIsTest()) { - try { - Await.ready(future); - LOG.info("Optimizing is done"); - } catch (InterruptedException | TimeoutException ex) { - LOG.info("Exception while optimizing", ex); - } - } - } - - private SegmentInfo postOptimizationSteps(SegmentInfo optimizedSegmentInfo) { - segmentManager.updateStats(); - // See SEARCH-32175 - optimizedSegmentInfo.setComplete(true); - - String segmentName = optimizedSegmentInfo.getSegmentName(); - LOG.info("Finished optimization for segment: " + segmentName); - segmentManager.logState( - "Finished optimization for segment: " + segmentName); - - /* - * Building the multi segment term dictionary causes GC pauses. The reason for this is because - * it's pretty big (possible ~15GB). When it's allocated, we have to copy a lot of data from - * survivor space to old gen. That causes several GC pauses. See SEARCH-33544 - * - * GC pauses are in general not fatal, but since all instances finish a segment at roughly the - * same time, they might happen at the same time and then it's a problem. - * - * Some possible solutions to this problem would be to build this dictionary in some data - * structures that are pre-allocated or to build only the part for the last segment, as - * everything else doesn't change. These solutions are a bit difficult to implement and this - * here is an easy workaround. - * - * Note that we might finish optimizing a segment and then it might take ~60+ minutes until it's - * a particular Earlybird's turn to run this code. The effect of this is going to be that we - * are not going to use the multi segment dictionary for the last two segments, one of which is - * still pretty small. That's not terrible, since right before optimization we're not using - * the dictionary for the last segment anyways, since it's still not optimized. - */ - try { - LOG.info("Acquire coordination lock before beginning post_optimization_rebuilds action."); - optimizationAndFlushingCoordinationLock.lock(); - LOG.info("Successfully acquired coordination lock for post_optimization_rebuilds action."); - postOptimizationRebuildsAction.retryActionUntilRan( - "post optimization rebuilds", () -> { - Stopwatch stopwatch = Stopwatch.createStarted(); - LOG.info("Starting to build multi term dictionary for {}", segmentName); - boolean result = multiSegmentTermDictionaryManager.buildDictionary(); - LOG.info("Done building multi term dictionary for {} in {}, result: {}", - segmentName, stopwatch, result); - queryCacheManager.rebuildQueryCachesAfterSegmentOptimization( - optimizedSegmentInfo); - - // This is a serial full GC and it defragments the memory so things can run smoothly - // until the next segment rolls. What we have observed is that if we don't do that - // later on some earlybirds can have promotion failures on an old gen that hasn't - // reached the initiating occupancy limit and these promotions failures can trigger a - // long (1.5 min) full GC. That usually happens because of fragmentation issues. - GCUtil.runGC(); - // Wait for indexing to catch up before rejoining the serverset. We only need to do - // this if the host has already finished startup. - if (EarlybirdStatus.hasStarted()) { - indexCaughtUpMonitor.resetAndWaitUntilCaughtUp(); - } - }); - } finally { - LOG.info("Finished post_optimization_rebuilds action. Releasing coordination lock."); - optimizationAndFlushingCoordinationLock.unlock(); - } - - return optimizedSegmentInfo; - } - - /** - * Many tests rely on precise segment boundaries, so we expose this to allow them to create a - * particular segment. - */ - @VisibleForTesting - public void createNewSegment(long tweetID) throws IOException { - NEW_SEGMENT_STATS.recordCreateNewSegment(); - - if (previousSegment != null) { - // We shouldn't have more than one unoptimized segment, so if we get to this point and the - // previousSegment has not been optimized and set to null, start optimizing it before - // creating the next one. Note that this is a weird case and would only happen if we get - // Tweets with drastically different IDs than we expect, or there is a large amount of time - // where no Tweets are created in this partition. - LOG.error("Creating new segment for Tweet {} when the previous segment {} was not sealed. " - + "Current segment: {}. Documents: {}. largestValidTweetIDForSegment: {}.", - tweetID, - previousSegment.getSegmentInfo().getTimeSliceID(), - currentSegment.getSegmentInfo().getTimeSliceID(), - currentSegment.getSegmentInfo().getIndexSegment().getNumDocs(), - largestValidTweetIDForCurrentSegment); - optimizePreviousSegment(); - SEGMENTS_CLOSED_EARLY.increment(); - } - - previousSegment = currentSegment; - - // We have two cases: - // - // Case 1: - // If the greatest Tweet ID we have seen is tweetID, then when we want to create a new segment - // with that ID, so the Tweet being processed goes into the new segment. - // - // Case 2: - // If the tweetID is bigger than the max tweetID, then this method is being called directly from - // tests, so we didn't update the maxTweetID, so we can create a new segment with the new - // Tweet ID. - // - // Case 3: - // If it's not the greatest Tweet ID we have seen, then we don't want to create a - // segment boundary that is lower than any Tweet IDs in the current segment, because then - // some tweets from the previous segment would be in the wrong segment, so create a segment - // that has a greater ID than any Tweets that we have seen. - // - // Example: - // - We have seen tweets 3, 10, 5, 6. - // - We now see tweet 7 and we decide it's time to create a new segment. - // - The new segment will start at tweet 11. It can't start at tweet 7, because - // tweet 10 will be in the wrong segment. - // - Tweet 7 that we just saw will end up in the previous segment. - if (maxTweetID <= tweetID) { - currentSegmentTimesliceBoundary = tweetID; - NEW_SEGMENT_STATS.recordSettingTimesliceToCurrentTweet(tweetID); - } else { - currentSegmentTimesliceBoundary = maxTweetID + 1; - NEW_SEGMENT_STATS.recordSettingTimesliceToMaxTweetId(tweetID, maxTweetID); - } - currentSegment = segmentManager.createAndPutOptimizingSegmentWriter( - currentSegmentTimesliceBoundary); - - currentSegment.getSegmentInfo().setIndexing(true); - - largestValidTweetIDForCurrentSegment = - OutOfOrderRealtimeTweetIDMapper.calculateMaxTweetID(currentSegmentTimesliceBoundary); - - NEW_SEGMENT_STATS.wrapNewSegmentCreation(tweetID, maxTweetID, - currentSegmentTimesliceBoundary, largestValidTweetIDForCurrentSegment); - - segmentManager.removeExcessSegments(); - } - - void logState() { - LOG.info("TweetCreateHandler:"); - LOG.info(String.format(" tweets sent for indexing: %,d", - indexingResultCounts.getIndexingCalls())); - LOG.info(String.format(" non-retriable failure: %,d", - indexingResultCounts.getFailureNotRetriable())); - LOG.info(String.format(" retriable failure: %,d", - indexingResultCounts.getFailureRetriable())); - LOG.info(String.format(" successfully indexed: %,d", - indexingResultCounts.getIndexingSuccess())); - LOG.info(String.format(" tweets in wrong segment: %,d", TWEETS_IN_WRONG_SEGMENT.getCount())); - LOG.info(String.format(" segments closed early: %,d", SEGMENTS_CLOSED_EARLY.getCount())); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/TweetUpdateHandler.java b/src/java/com/twitter/search/earlybird/partition/TweetUpdateHandler.java deleted file mode 100644 index c4fd7e25c..000000000 --- a/src/java/com/twitter/search/earlybird/partition/TweetUpdateHandler.java +++ /dev/null @@ -1,175 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.io.IOException; -import java.util.ArrayList; -import java.util.List; -import java.util.SortedMap; -import java.util.TreeMap; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; - -/** - * This class handles incoming updates to Tweets in the index. - * - * Much of the logic deals with retries. It is very common to get an update before we have gotten - * the Tweet that the update should be applied to. In this case, we queue the update for up to a - * minute, so that we give the original Tweet the chance to be written to the index. - */ -public class TweetUpdateHandler { - private static final Logger LOG = LoggerFactory.getLogger(TweetUpdateHandler.class); - private static final Logger UPDATES_ERRORS_LOG = - LoggerFactory.getLogger(TweetUpdateHandler.class.getName() + ".UpdatesErrors"); - - private static final String STATS_PREFIX = "tweet_update_handler_"; - - private IndexingResultCounts indexingResultCounts; - private static final SearchRateCounter INCOMING_EVENT = - SearchRateCounter.export(STATS_PREFIX + "incoming_event"); - private static final SearchRateCounter QUEUED_FOR_RETRY = - SearchRateCounter.export(STATS_PREFIX + "queued_for_retry"); - private static final SearchRateCounter DROPPED_OLD_EVENT = - SearchRateCounter.export(STATS_PREFIX + "dropped_old_event"); - private static final SearchRateCounter DROPPED_INCOMING_EVENT = - SearchRateCounter.export(STATS_PREFIX + "dropped_incoming_event"); - private static final SearchRateCounter DROPPED_CLEANUP_EVENT = - SearchRateCounter.export(STATS_PREFIX + "dropped_cleanup_event"); - private static final SearchRateCounter DROPPED_NOT_RETRYABLE_EVENT = - SearchRateCounter.export(STATS_PREFIX + "dropped_not_retryable_event"); - private static final SearchRateCounter PICKED_TO_RETRY = - SearchRateCounter.export(STATS_PREFIX + "picked_to_retry"); - private static final SearchRateCounter INDEXED_EVENT = - SearchRateCounter.export(STATS_PREFIX + "indexed_event"); - - private static final long RETRY_TIME_THRESHOLD_MS = 60_000; // one minute. - - private final SortedMap> pendingUpdates = new TreeMap<>(); - private final SegmentManager segmentManager; - - /** - * At this time we cleaned all updates that are more than RETRY_TIME_THRESHOLD_MS old. - */ - private long lastCleanedUpdatesTime = 0; - - /** - * The time of the most recent Tweet that we have applied an update for. We use this to - * determine when we should give up on retrying an update, instead of using the system clock, - * because we may be processing the stream from a long time ago if we are starting up or if - * there is lag in the Kafka topics and we want to let each update get a fair shot at being - * applied. - */ - private long mostRecentUpdateTime = 0; - - public TweetUpdateHandler(SegmentManager segmentManager) { - this.segmentManager = segmentManager; - this.indexingResultCounts = new IndexingResultCounts(); - } - - /** - * Index an update to a Tweet. - */ - public void handleTweetUpdate(ThriftVersionedEvents tve, boolean isRetry) throws IOException { - if (!isRetry) { - INCOMING_EVENT.increment(); - } - long id = tve.getId(); - - mostRecentUpdateTime = - Math.max(SnowflakeIdParser.getTimestampFromTweetId(id), mostRecentUpdateTime); - cleanStaleUpdates(); - - ISegmentWriter writer = segmentManager.getSegmentWriterForID(id); - if (writer == null) { - if (segmentManager.getNumIndexedDocuments() == 0) { - // If we haven't indexed any tweets at all, then we shouldn't drop this update, because it - // might be applied to a Tweet we haven't indexed yet so queue it up for retry. - queueForRetry(id, tve); - } else { - DROPPED_OLD_EVENT.increment(); - } - return; - } - - SegmentWriter.Result result = writer.indexThriftVersionedEvents(tve); - indexingResultCounts.countResult(result); - - if (result == ISegmentWriter.Result.FAILURE_RETRYABLE) { - // If the tweet hasn't arrived yet. - queueForRetry(id, tve); - } else if (result == ISegmentWriter.Result.FAILURE_NOT_RETRYABLE) { - DROPPED_NOT_RETRYABLE_EVENT.increment(); - UPDATES_ERRORS_LOG.warn("Failed to apply update for tweetID {}: {}", id, tve); - } else if (result == ISegmentWriter.Result.SUCCESS) { - INDEXED_EVENT.increment(); - } - } - - private void queueForRetry(long id, ThriftVersionedEvents tve) { - long ageMillis = mostRecentUpdateTime - SnowflakeIdParser.getTimestampFromTweetId(id); - if (ageMillis > RETRY_TIME_THRESHOLD_MS) { - DROPPED_INCOMING_EVENT.increment(); - UPDATES_ERRORS_LOG.warn( - "Giving up retrying update for tweetID {}: {} because the retry time has elapsed", - id, tve); - return; - } - - pendingUpdates.computeIfAbsent(id, i -> new ArrayList<>()).add(tve); - QUEUED_FOR_RETRY.increment(); - } - - // Every time we have processed a minute's worth of updates, remove all pending updates that are - // more than a minute old, relative to the most recent Tweet we have seen. - private void cleanStaleUpdates() { - long oldUpdatesThreshold = mostRecentUpdateTime - RETRY_TIME_THRESHOLD_MS; - if (lastCleanedUpdatesTime < oldUpdatesThreshold) { - SortedMap> droppedUpdates = pendingUpdates - .headMap(SnowflakeIdParser.generateValidStatusId(oldUpdatesThreshold, 0)); - for (List events : droppedUpdates.values()) { - for (ThriftVersionedEvents event : events) { - UPDATES_ERRORS_LOG.warn( - "Giving up retrying update for tweetID {}: {} because the retry time has elapsed", - event.getId(), event); - } - DROPPED_CLEANUP_EVENT.increment(events.size()); - } - droppedUpdates.clear(); - - lastCleanedUpdatesTime = mostRecentUpdateTime; - } - } - - /** - * After we successfully indexed tweetID, if we have any pending updates for that tweetID, try to - * apply them again. - */ - public void retryPendingUpdates(long tweetID) throws IOException { - if (pendingUpdates.containsKey(tweetID)) { - for (ThriftVersionedEvents update : pendingUpdates.remove(tweetID)) { - PICKED_TO_RETRY.increment(); - handleTweetUpdate(update, true); - } - } - } - - void logState() { - LOG.info("TweetUpdateHandler:"); - LOG.info(String.format(" tweets sent for indexing: %,d", - indexingResultCounts.getIndexingCalls())); - LOG.info(String.format(" non-retriable failure: %,d", - indexingResultCounts.getFailureNotRetriable())); - LOG.info(String.format(" retriable failure: %,d", - indexingResultCounts.getFailureRetriable())); - LOG.info(String.format(" successfully indexed: %,d", - indexingResultCounts.getIndexingSuccess())); - LOG.info(String.format(" queued for retry: %,d", QUEUED_FOR_RETRY.getCount())); - LOG.info(String.format(" dropped old events: %,d", DROPPED_OLD_EVENT.getCount())); - LOG.info(String.format(" dropped incoming events: %,d", DROPPED_INCOMING_EVENT.getCount())); - LOG.info(String.format(" dropped cleanup events: %,d", DROPPED_CLEANUP_EVENT.getCount())); - LOG.info(String.format(" picked events to retry: %,d", PICKED_TO_RETRY.getCount())); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/UserPartitionUtil.java b/src/java/com/twitter/search/earlybird/partition/UserPartitionUtil.java deleted file mode 100644 index c78d822ab..000000000 --- a/src/java/com/twitter/search/earlybird/partition/UserPartitionUtil.java +++ /dev/null @@ -1,32 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import com.google.common.base.Predicate; - -import com.twitter.search.common.util.hash.EarlybirdPartitioningFunction; -import com.twitter.search.common.util.hash.GeneralEarlybirdPartitioningFunction; - -public final class UserPartitionUtil { - private UserPartitionUtil() { - } - - /** - * Filter out the users that are not present in this partition. - */ - public static Predicate filterUsersByPartitionPredicate(final PartitionConfig config) { - return new Predicate() { - - private final int partitionID = config.getIndexingHashPartitionID(); - private final int numPartitions = config.getNumPartitions(); - private final EarlybirdPartitioningFunction partitioner = - new GeneralEarlybirdPartitioningFunction(); - - @Override - public boolean apply(Long userId) { - // See SEARCH-6675 - // Right now if the partitioning logic changes in ArchivePartitioning this logic - // needs to be updated too. - return partitioner.getPartition(userId, numPartitions) == partitionID; - } - }; - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/UserScrubGeoEventStreamIndexer.java b/src/java/com/twitter/search/earlybird/partition/UserScrubGeoEventStreamIndexer.java deleted file mode 100644 index 7cd3a28b9..000000000 --- a/src/java/com/twitter/search/earlybird/partition/UserScrubGeoEventStreamIndexer.java +++ /dev/null @@ -1,88 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import com.google.common.annotations.VisibleForTesting; - -import org.apache.kafka.clients.consumer.ConsumerRecord; -import org.apache.kafka.clients.consumer.KafkaConsumer; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.metrics.SearchTimer; -import com.twitter.search.common.util.io.kafka.FinagleKafkaClientUtils; -import com.twitter.search.common.util.io.kafka.ThriftDeserializer; -import com.twitter.search.earlybird.common.config.EarlybirdProperty; -import com.twitter.search.earlybird.exception.MissingKafkaTopicException; -import com.twitter.tweetypie.thriftjava.TweetEvent; -import com.twitter.tweetypie.thriftjava.UserScrubGeoEvent; - -public class UserScrubGeoEventStreamIndexer extends SimpleStreamIndexer { - private static final Logger LOG = LoggerFactory.getLogger(UserScrubGeoEventStreamIndexer.class); - - protected static String kafkaClientId = "earlybird_user_scrub_geo_kafka_consumer"; - private static final SearchCounter NUM_MISSING_DATA_ERRORS = - SearchCounter.export("num_user_scrub_geo_event_kafka_consumer_num_missing_data_errors"); - - private final SegmentManager segmentManager; - private final SearchIndexingMetricSet searchIndexingMetricSet; - - public UserScrubGeoEventStreamIndexer(KafkaConsumer kafkaConsumer, - String topic, - SearchIndexingMetricSet searchIndexingMetricSet, - SegmentManager segmentManager) - throws MissingKafkaTopicException { - super(kafkaConsumer, topic); - - this.segmentManager = segmentManager; - this.searchIndexingMetricSet = searchIndexingMetricSet; - - indexingSuccesses = SearchRateCounter.export("user_scrub_geo_indexing_successes"); - indexingFailures = SearchRateCounter.export("user_scrub_geo_indexing_failures"); - } - - /** - * Provides UserScrubGeoEvent Kafka Consumer to EarlybirdWireModule. - * @return - */ - public static KafkaConsumer provideKafkaConsumer() { - return FinagleKafkaClientUtils.newKafkaConsumerForAssigning( - EarlybirdProperty.TWEET_EVENTS_KAFKA_PATH.get(), - new ThriftDeserializer<>(TweetEvent.class), - kafkaClientId, - MAX_POLL_RECORDS); - } - - @VisibleForTesting - protected void validateAndIndexRecord(ConsumerRecord record) { - TweetEvent event = record.value(); - UserScrubGeoEvent geoEvent; - try { - geoEvent = event.getData().getUser_scrub_geo_event(); - } catch (Exception e) { - LOG.warn("TweetEventData is null for TweetEvent: " + event.toString()); - indexingFailures.increment(); - return; - } - - if (geoEvent == null) { - LOG.warn("UserScrubGeoEvent is null"); - indexingFailures.increment(); - - } else if (!geoEvent.isSetMax_tweet_id() || !geoEvent.isSetUser_id()) { - // We should not consume an event that does not contain both a maxTweetId & userId since we - // we won't have enough data to properly store them in the map. We should, however, keep - // track of these cases since we don't want to miss out on users who have scrubbed their - // geo data from their tweets when applying the UserScrubGeoFilter. - LOG.warn("UserScrubGeoEvent is missing fields: " + geoEvent.toString()); - indexingFailures.increment(); - NUM_MISSING_DATA_ERRORS.increment(); - - } else { - SearchTimer timer = searchIndexingMetricSet.userScrubGeoIndexingStats.startNewTimer(); - segmentManager.indexUserScrubGeoEvent(geoEvent); - indexingSuccesses.increment(); - searchIndexingMetricSet.userScrubGeoIndexingStats.stopTimerAndIncrement(timer); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/UserUpdatesStreamIndexer.java b/src/java/com/twitter/search/earlybird/partition/UserUpdatesStreamIndexer.java deleted file mode 100644 index c264c4a82..000000000 --- a/src/java/com/twitter/search/earlybird/partition/UserUpdatesStreamIndexer.java +++ /dev/null @@ -1,89 +0,0 @@ -package com.twitter.search.earlybird.partition; - -import java.util.Date; - -import com.google.common.annotations.VisibleForTesting; - -import org.apache.kafka.clients.consumer.ConsumerRecord; -import org.apache.kafka.clients.consumer.KafkaConsumer; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.indexing.thriftjava.AntisocialUserUpdate; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.metrics.SearchTimer; -import com.twitter.search.common.util.io.kafka.CompactThriftDeserializer; -import com.twitter.search.common.util.io.kafka.FinagleKafkaClientUtils; -import com.twitter.search.earlybird.common.config.EarlybirdProperty; -import com.twitter.search.earlybird.common.userupdates.UserUpdate; -import com.twitter.search.earlybird.exception.MissingKafkaTopicException; - -public class UserUpdatesStreamIndexer extends SimpleStreamIndexer { - private static final Logger LOG = LoggerFactory.getLogger(UserUpdatesStreamIndexer.class); - - private static final SearchCounter NUM_CORRUPT_DATA_ERRORS = - SearchCounter.export("num_user_updates_kafka_consumer_corrupt_data_errors"); - protected static String kafkaClientId = ""; - - private final SegmentManager segmentManager; - private final SearchIndexingMetricSet searchIndexingMetricSet; - - public UserUpdatesStreamIndexer(KafkaConsumer kafkaConsumer, - String topic, - SearchIndexingMetricSet searchIndexingMetricSet, - SegmentManager segmentManager) - throws MissingKafkaTopicException { - super(kafkaConsumer, topic); - this.segmentManager = segmentManager; - this.searchIndexingMetricSet = searchIndexingMetricSet; - - indexingSuccesses = SearchRateCounter.export("user_update_indexing_successes"); - indexingFailures = SearchRateCounter.export("user_update_indexing_failures"); - } - - /** - * Provides user updates kafka consumer to EarlybirdWireModule. - * @return - */ - public static KafkaConsumer provideKafkaConsumer() { - return FinagleKafkaClientUtils.newKafkaConsumerForAssigning( - EarlybirdProperty.KAFKA_PATH.get(), - new CompactThriftDeserializer<>(AntisocialUserUpdate.class), - kafkaClientId, - MAX_POLL_RECORDS); - } - - UserUpdate convertToUserInfoUpdate(AntisocialUserUpdate update) { - return new UserUpdate( - update.getUserID(), - update.getType(), - update.isValue() ? 1 : 0, - new Date(update.getUpdatedAt())); - } - - @VisibleForTesting - protected void validateAndIndexRecord(ConsumerRecord record) { - AntisocialUserUpdate update = record.value(); - if (update == null) { - LOG.warn("null value returned from poll"); - return; - } - if (update.getType() == null) { - LOG.error("User update does not have type set: " + update); - NUM_CORRUPT_DATA_ERRORS.increment(); - return; - } - - SearchTimer timer = searchIndexingMetricSet.userUpdateIndexingStats.startNewTimer(); - boolean isUpdateIndexed = segmentManager.indexUserUpdate( - convertToUserInfoUpdate(update)); - searchIndexingMetricSet.userUpdateIndexingStats.stopTimerAndIncrement(timer); - - if (isUpdateIndexed) { - indexingSuccesses.increment(); - } else { - indexingFailures.increment(); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/freshstartup/FreshStartupHandler.java b/src/java/com/twitter/search/earlybird/partition/freshstartup/FreshStartupHandler.java deleted file mode 100644 index 4b54a56b4..000000000 --- a/src/java/com/twitter/search/earlybird/partition/freshstartup/FreshStartupHandler.java +++ /dev/null @@ -1,439 +0,0 @@ -package com.twitter.search.earlybird.partition.freshstartup; - -import java.io.IOException; -import java.time.Duration; -import java.util.ArrayList; -import java.util.HashSet; -import java.util.List; -import java.util.Map; -import java.util.Set; - -import com.google.common.base.Stopwatch; -import com.google.common.base.Verify; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.Lists; - -import org.apache.kafka.clients.consumer.ConsumerRecord; -import org.apache.kafka.clients.consumer.ConsumerRecords; -import org.apache.kafka.clients.consumer.KafkaConsumer; -import org.apache.kafka.clients.consumer.OffsetAndTimestamp; -import org.apache.kafka.common.TopicPartition; -import org.apache.kafka.common.errors.ApiException; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import static com.twitter.search.common.util.LogFormatUtil.formatInt; - -import com.twitter.search.common.util.GCUtil; -import com.twitter.common.util.Clock; -import com.twitter.search.common.util.LogFormatUtil; -import com.twitter.search.earlybird.common.NonPagingAssert; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.exception.EarlybirdStartupException; -import com.twitter.search.earlybird.exception.WrappedKafkaApiException; -import com.twitter.search.earlybird.factory.EarlybirdKafkaConsumersFactory; -import com.twitter.search.earlybird.partition.EarlybirdIndex; -import com.twitter.search.earlybird.partition.SegmentInfo; -import com.twitter.search.earlybird.partition.SegmentManager; -import com.twitter.search.earlybird.util.ParallelUtil; - -/** - * Bootstraps an index by indexing tweets and updates in parallel. - * - * DEVELOPMENT - * =========== - * - * 1. In earlybird-search.yml, set the following values in the "production" section: - * - max_segment_size to 200000 - * - late_tweet_buffer to 10000 - * - * 2. In KafkaStartup, don't load the index, replace the .loadIndex call as instructed - * in the file. - * - * 3. In the aurora configs, set serving_timeslices to a low number (like 5) for staging. - */ -public class FreshStartupHandler { - private static final Logger LOG = LoggerFactory.getLogger(FreshStartupHandler.class); - private static final NonPagingAssert BUILDING_FEWER_THAN_SPECIFIED_SEGMENTS = - new NonPagingAssert("building_fewer_than_specified_segments"); - - private final Clock clock; - private final TopicPartition tweetTopic; - private final TopicPartition updateTopic; - private final SegmentManager segmentManager; - private final int maxSegmentSize; - private final int lateTweetBuffer; - private final EarlybirdKafkaConsumersFactory earlybirdKafkaConsumersFactory; - private final CriticalExceptionHandler criticalExceptionHandler; - - public FreshStartupHandler( - Clock clock, - EarlybirdKafkaConsumersFactory earlybirdKafkaConsumersFactory, - TopicPartition tweetTopic, - TopicPartition updateTopic, - SegmentManager segmentManager, - int maxSegmentSize, - int lateTweetBuffer, - CriticalExceptionHandler criticalExceptionHandler - ) { - this.clock = clock; - this.earlybirdKafkaConsumersFactory = earlybirdKafkaConsumersFactory; - this.tweetTopic = tweetTopic; - this.updateTopic = updateTopic; - this.segmentManager = segmentManager; - this.maxSegmentSize = maxSegmentSize; - this.criticalExceptionHandler = criticalExceptionHandler; - this.lateTweetBuffer = lateTweetBuffer; - } - - /** - * Don't index in parallel, just pass some time back that the EarlybirdKafkaConsumer - * can start indexing from. - */ - public EarlybirdIndex indexFromScratch() { - long indexTimePeriod = Duration.ofHours( - EarlybirdConfig.getInt("index_from_scratch_hours", 12) - ).toMillis(); - - return runIndexFromScratch(indexTimePeriod); - } - - public EarlybirdIndex fastIndexFromScratchForDevelopment() { - LOG.info("Running fast index from scratch..."); - return runIndexFromScratch(Duration.ofMinutes(10).toMillis()); - } - - private EarlybirdIndex runIndexFromScratch(long indexTimePeriodMs) { - KafkaConsumer consumerForFindingOffsets = - earlybirdKafkaConsumersFactory.createKafkaConsumer("consumer_for_offsets"); - - long timestamp = clock.nowMillis() - indexTimePeriodMs; - - Map offsets; - try { - offsets = consumerForFindingOffsets - .offsetsForTimes(ImmutableMap.of(tweetTopic, timestamp, updateTopic, timestamp)); - } catch (ApiException kafkaApiException) { - throw new WrappedKafkaApiException(kafkaApiException); - } - - return new EarlybirdIndex( - Lists.newArrayList(), - offsets.get(tweetTopic).offset(), - offsets.get(updateTopic).offset()); - } - - - /** - * Index Tweets and updates from scratch, without relying on a serialized index in HDFS. - * - * This function indexes the segments in parallel, limiting the number of segments that - * are currently indexed, due to memory limitations. That's followed by another pass to index - * some updates - see the implementation for more details. - * - * The index this function outputs contains N segments, where the first N-1 are optimized and - * the last one is not. - */ - public EarlybirdIndex parallelIndexFromScratch() throws Exception { - Stopwatch parallelIndexStopwatch = Stopwatch.createStarted(); - - LOG.info("Starting parallel fresh startup."); - LOG.info("Max segment size: {}", maxSegmentSize); - LOG.info("Late tweet buffer size: {}", lateTweetBuffer); - - // Once we finish fresh startup and proceed to indexing from the streams, we'll immediately - // start a new segment, since the output of the fresh startup is full segments. - // - // That's why we index max_segments-1 segments here instead of indexing max_segments segments - // and discarding the first one later. - int numSegments = segmentManager.getMaxEnabledSegments() - 1; - LOG.info("Number of segments to build: {}", numSegments); - - // Find end offsets. - KafkaOffsetPair tweetsOffsetRange = findOffsetRangeForTweetsKafkaTopic(); - - ArrayList segmentBuildInfos = makeSegmentBuildInfos( - numSegments, tweetsOffsetRange); - - segmentManager.logState("Before starting fresh startup"); - - // Index tweets and events. - Stopwatch initialIndexStopwatch = Stopwatch.createStarted(); - - // We index at most `MAX_PARALLEL_INDEXED` (MPI) segments at the same time. If we need to - // produce 20 segments here, we'd need memory for MPI unoptimized and 20-MPI optimized segments. - // - // For back of envelope calculations you can assume optimized segments take ~6GB and unoptimized - // ones ~12GB. - final int MAX_PARALLEL_INDEXED = 8; - - List segmentInfos = ParallelUtil.parmap( - "fresh-startup", - MAX_PARALLEL_INDEXED, - segmentBuildInfo -> indexTweetsAndUpdatesForSegment(segmentBuildInfo, segmentBuildInfos), - segmentBuildInfos - ); - - LOG.info("Finished indexing tweets and updates in {}", initialIndexStopwatch); - - PostOptimizationUpdatesIndexer postOptimizationUpdatesIndexer = - new PostOptimizationUpdatesIndexer( - segmentBuildInfos, - earlybirdKafkaConsumersFactory, - updateTopic); - - postOptimizationUpdatesIndexer.indexRestOfUpdates(); - - // Finished indexing tweets and updates. - LOG.info("Segment build infos after we're done:"); - for (SegmentBuildInfo segmentBuildInfo : segmentBuildInfos) { - segmentBuildInfo.logState(); - } - - segmentManager.logState("After finishing fresh startup"); - - LOG.info("Collected {} segment infos", segmentInfos.size()); - LOG.info("Segment names:"); - for (SegmentInfo segmentInfo : segmentInfos) { - LOG.info(segmentInfo.getSegmentName()); - } - - SegmentBuildInfo lastSegmentBuildInfo = segmentBuildInfos.get(segmentBuildInfos.size() - 1); - long finishedUpdatesAtOffset = lastSegmentBuildInfo.getUpdateKafkaOffsetPair().getEndOffset(); - long maxIndexedTweetId = lastSegmentBuildInfo.getMaxIndexedTweetId(); - - LOG.info("Max indexed tweet id: {}", maxIndexedTweetId); - LOG.info("Parallel startup finished in {}", parallelIndexStopwatch); - - // verifyConstructedIndex(segmentBuildInfos); - // Run a GC to free up some memory after the fresh startup. - GCUtil.runGC(); - logMemoryStats(); - - return new EarlybirdIndex( - segmentInfos, - tweetsOffsetRange.getEndOffset() + 1, - finishedUpdatesAtOffset + 1, - maxIndexedTweetId - ); - } - - private void logMemoryStats() { - double toGB = 1024 * 1024 * 1024; - double totalMemoryGB = Runtime.getRuntime().totalMemory() / toGB; - double freeMemoryGB = Runtime.getRuntime().freeMemory() / toGB; - LOG.info("Memory stats: Total memory GB: {}, Free memory GB: {}", - totalMemoryGB, freeMemoryGB); - } - - /** - * Prints statistics about the constructed index compared to all tweets in the - * tweets stream. - * - * Only run this for testing and debugging purposes, never in prod environment. - */ - private void verifyConstructedIndex(List segmentBuildInfos) - throws IOException { - LOG.info("Verifying constructed index..."); - // Read every tweet from the offset range that we're constructing an index for. - KafkaConsumer tweetsKafkaConsumer = - earlybirdKafkaConsumersFactory.createKafkaConsumer("tweets_verify"); - try { - tweetsKafkaConsumer.assign(ImmutableList.of(tweetTopic)); - tweetsKafkaConsumer.seek(tweetTopic, segmentBuildInfos.get(0).getTweetStartOffset()); - } catch (ApiException apiException) { - throw new WrappedKafkaApiException(apiException); - } - long finalTweetOffset = segmentBuildInfos.get(segmentBuildInfos.size() - 1).getTweetEndOffset(); - boolean done = false; - Set uniqueTweetIds = new HashSet<>(); - long readTweetsCount = 0; - do { - for (ConsumerRecord record - : tweetsKafkaConsumer.poll(Duration.ofSeconds(1))) { - if (record.offset() > finalTweetOffset) { - done = true; - break; - } - readTweetsCount++; - uniqueTweetIds.add(record.value().getId()); - } - } while (!done); - - LOG.info("Total amount of read tweets: {}", formatInt(readTweetsCount)); - // Might be less, due to duplicates. - LOG.info("Unique tweet ids : {}", LogFormatUtil.formatInt(uniqueTweetIds.size())); - - int notFoundInIndex = 0; - for (Long tweetId : uniqueTweetIds) { - boolean found = false; - for (SegmentBuildInfo segmentBuildInfo : segmentBuildInfos) { - if (segmentBuildInfo.getSegmentWriter().hasTweet(tweetId)) { - found = true; - break; - } - } - if (!found) { - notFoundInIndex++; - } - } - - LOG.info("Tweets not found in the index: {}", LogFormatUtil.formatInt(notFoundInIndex)); - - long totalIndexedTweets = 0; - for (SegmentBuildInfo segmentBuildInfo : segmentBuildInfos) { - SegmentInfo si = segmentBuildInfo.getSegmentWriter().getSegmentInfo(); - totalIndexedTweets += si.getIndexStats().getStatusCount(); - } - - LOG.info("Total indexed tweets: {}", formatInt(totalIndexedTweets)); - } - - /** - * Find the end offsets for the tweets Kafka topic this partition is reading - * from. - */ - private KafkaOffsetPair findOffsetRangeForTweetsKafkaTopic() { - KafkaConsumer consumerForFindingOffsets = - earlybirdKafkaConsumersFactory.createKafkaConsumer("consumer_for_end_offsets"); - - Map endOffsets; - Map beginningOffsets; - - try { - endOffsets = consumerForFindingOffsets.endOffsets(ImmutableList.of(tweetTopic)); - beginningOffsets = consumerForFindingOffsets.beginningOffsets(ImmutableList.of(tweetTopic)); - } catch (ApiException kafkaApiException) { - throw new WrappedKafkaApiException(kafkaApiException); - } finally { - consumerForFindingOffsets.close(); - } - - long tweetsBeginningOffset = beginningOffsets.get(tweetTopic); - long tweetsEndOffset = endOffsets.get(tweetTopic); - LOG.info(String.format("Tweets beginning offset: %,d", tweetsBeginningOffset)); - LOG.info(String.format("Tweets end offset: %,d", tweetsEndOffset)); - LOG.info(String.format("Total amount of records in the stream: %,d", - tweetsEndOffset - tweetsBeginningOffset + 1)); - - return new KafkaOffsetPair(tweetsBeginningOffset, tweetsEndOffset); - } - - /** - * For each segment, we know what offset it begins at. This function finds the tweet ids - * for these offsets. - */ - private void fillTweetIdsForSegmentStarts(List segmentBuildInfos) - throws EarlybirdStartupException { - KafkaConsumer consumerForTweetIds = - earlybirdKafkaConsumersFactory.createKafkaConsumer("consumer_for_tweet_ids", 1); - consumerForTweetIds.assign(ImmutableList.of(tweetTopic)); - - // Find first tweet ids for each segment. - for (SegmentBuildInfo buildInfo : segmentBuildInfos) { - long tweetOffset = buildInfo.getTweetStartOffset(); - ConsumerRecords records; - try { - consumerForTweetIds.seek(tweetTopic, tweetOffset); - records = consumerForTweetIds.poll(Duration.ofSeconds(1)); - } catch (ApiException kafkaApiException) { - throw new WrappedKafkaApiException(kafkaApiException); - } - - if (records.count() > 0) { - ConsumerRecord recordAtOffset = records.iterator().next(); - if (recordAtOffset.offset() != tweetOffset) { - LOG.error(String.format("We were looking for offset %,d. Found a record at offset %,d", - tweetOffset, recordAtOffset.offset())); - } - - buildInfo.setStartTweetId(recordAtOffset.value().getId()); - } else { - throw new EarlybirdStartupException("Didn't get any tweets back for an offset"); - } - } - - // Check that something weird didn't happen where we end up with segment ids - // which are in non-incresing order. - // Goes from oldest to newest. - for (int i = 1; i < segmentBuildInfos.size(); i++) { - long startTweetId = segmentBuildInfos.get(i).getStartTweetId(); - long prevStartTweetId = segmentBuildInfos.get(i - 1).getStartTweetId(); - Verify.verify(prevStartTweetId < startTweetId); - } - } - - /** - * Generate the offsets at which tweets begin and end for each segment that we want - * to create. - */ - private ArrayList makeSegmentBuildInfos( - int numSegments, KafkaOffsetPair tweetsOffsets) throws EarlybirdStartupException { - ArrayList segmentBuildInfos = new ArrayList<>(); - - // If we have 3 segments, the starting tweet offsets are: - // end-3N, end-2N, end-N - int segmentSize = maxSegmentSize - lateTweetBuffer; - LOG.info("Segment size: {}", segmentSize); - - long tweetsInStream = tweetsOffsets.getEndOffset() - tweetsOffsets.getBeginOffset() + 1; - double numBuildableSegments = ((double) tweetsInStream) / segmentSize; - - LOG.info("Number of segments we can build: {}", numBuildableSegments); - - int numSegmentsToBuild = numSegments; - int numBuildableSegmentsInt = (int) numBuildableSegments; - - if (numBuildableSegmentsInt < numSegmentsToBuild) { - // This can happen if we get a low amount of tweets such that the ~10 days of tweets stored in - // Kafka are not enough to build the specified number of segments. - LOG.warn("Building {} segments instead of the specified {} segments because there are not " - + "enough tweets", numSegmentsToBuild, numSegments); - BUILDING_FEWER_THAN_SPECIFIED_SEGMENTS.assertFailed(); - numSegmentsToBuild = numBuildableSegmentsInt; - } - - for (int rewind = numSegmentsToBuild; rewind >= 1; rewind--) { - long tweetStartOffset = (tweetsOffsets.getEndOffset() + 1) - (rewind * segmentSize); - long tweetEndOffset = tweetStartOffset + segmentSize - 1; - - int index = segmentBuildInfos.size(); - - segmentBuildInfos.add(new SegmentBuildInfo( - tweetStartOffset, - tweetEndOffset, - index, - rewind == 1 - )); - } - - Verify.verify(segmentBuildInfos.get(segmentBuildInfos.size() - 1) - .getTweetEndOffset() == tweetsOffsets.getEndOffset()); - - LOG.info("Filling start tweet ids ..."); - fillTweetIdsForSegmentStarts(segmentBuildInfos); - - return segmentBuildInfos; - } - - private SegmentInfo indexTweetsAndUpdatesForSegment( - SegmentBuildInfo segmentBuildInfo, - ArrayList segmentBuildInfos) throws Exception { - - PreOptimizationSegmentIndexer preOptimizationSegmentIndexer = - new PreOptimizationSegmentIndexer( - segmentBuildInfo, - segmentBuildInfos, - this.segmentManager, - this.tweetTopic, - this.updateTopic, - this.earlybirdKafkaConsumersFactory, - this.lateTweetBuffer - ); - - return preOptimizationSegmentIndexer.runIndexing(); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/freshstartup/KafkaOffsetPair.java b/src/java/com/twitter/search/earlybird/partition/freshstartup/KafkaOffsetPair.java deleted file mode 100644 index 9300bfb3e..000000000 --- a/src/java/com/twitter/search/earlybird/partition/freshstartup/KafkaOffsetPair.java +++ /dev/null @@ -1,23 +0,0 @@ -package com.twitter.search.earlybird.partition.freshstartup; - -class KafkaOffsetPair { - private final long beginOffset; - private final long endOffset; - - public KafkaOffsetPair(long beginOffset, long endOffset) { - this.beginOffset = beginOffset; - this.endOffset = endOffset; - } - - public boolean includes(long offset) { - return beginOffset <= offset && offset <= endOffset; - } - - public long getBeginOffset() { - return beginOffset; - } - - public long getEndOffset() { - return endOffset; - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/freshstartup/PostOptimizationUpdatesIndexer.java b/src/java/com/twitter/search/earlybird/partition/freshstartup/PostOptimizationUpdatesIndexer.java deleted file mode 100644 index 93e6c9362..000000000 --- a/src/java/com/twitter/search/earlybird/partition/freshstartup/PostOptimizationUpdatesIndexer.java +++ /dev/null @@ -1,169 +0,0 @@ -package com.twitter.search.earlybird.partition.freshstartup; - -import java.io.IOException; -import java.time.Duration; -import java.util.ArrayList; -import java.util.HashMap; -import java.util.Map; -import java.util.concurrent.TimeUnit; - -import com.google.common.base.Stopwatch; -import com.google.common.collect.ImmutableList; - -import org.apache.kafka.clients.consumer.ConsumerRecord; -import org.apache.kafka.clients.consumer.ConsumerRecords; -import org.apache.kafka.clients.consumer.KafkaConsumer; -import org.apache.kafka.common.TopicPartition; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.metrics.SearchTimer; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.earlybird.factory.EarlybirdKafkaConsumersFactory; -import com.twitter.search.earlybird.partition.IndexingResultCounts; - -/** - * Indexes updates for all segments after they have been optimized. Some of the updates have been - * indexed before in the PreOptimizationSegmentIndexer, but the rest are indexed here. - */ -class PostOptimizationUpdatesIndexer { - private static final Logger LOG = LoggerFactory.getLogger(PostOptimizationUpdatesIndexer.class); - - private static final String STAT_PREFIX = "post_optimization_"; - private static final String READ_STAT_PREFIX = STAT_PREFIX + "read_updates_for_segment_"; - private static final String APPLIED_STAT_PREFIX = STAT_PREFIX + "applied_updates_for_segment_"; - - private final ArrayList segmentBuildInfos; - private final EarlybirdKafkaConsumersFactory earlybirdKafkaConsumersFactory; - private final TopicPartition updateTopic; - - PostOptimizationUpdatesIndexer( - ArrayList segmentBuildInfos, - EarlybirdKafkaConsumersFactory earlybirdKafkaConsumersFactory, - TopicPartition updateTopic) { - this.segmentBuildInfos = segmentBuildInfos; - this.earlybirdKafkaConsumersFactory = earlybirdKafkaConsumersFactory; - this.updateTopic = updateTopic; - } - - void indexRestOfUpdates() throws IOException { - LOG.info("Indexing rest of updates."); - - long updatesStartOffset = segmentBuildInfos.get(0) - .getUpdateKafkaOffsetPair().getBeginOffset(); - long updatesEndOffset = segmentBuildInfos.get(segmentBuildInfos.size() - 1) - .getUpdateKafkaOffsetPair().getEndOffset(); - - LOG.info(String.format("Total updates to go through: %,d", - updatesEndOffset - updatesStartOffset + 1)); - - KafkaConsumer kafkaConsumer = - earlybirdKafkaConsumersFactory.createKafkaConsumer("index_rest_of_updates"); - kafkaConsumer.assign(ImmutableList.of(updateTopic)); - kafkaConsumer.seek(updateTopic, updatesStartOffset); - - long readEvents = 0; - long foundSegment = 0; - long applied = 0; - - Map perSegmentReadUpdates = new HashMap<>(); - Map perSegmentAppliedUpdates = new HashMap<>(); - Map perSegmentIndexingResultCounts = new HashMap<>(); - - for (int i = 0; i < segmentBuildInfos.size(); i++) { - perSegmentReadUpdates.put(i, SearchRateCounter.export(READ_STAT_PREFIX + i)); - perSegmentAppliedUpdates.put(i, SearchRateCounter.export(APPLIED_STAT_PREFIX + i)); - perSegmentIndexingResultCounts.put(i, new IndexingResultCounts()); - } - - SearchTimerStats pollStats = SearchTimerStats.export( - "final_pass_polls", TimeUnit.NANOSECONDS, false); - SearchTimerStats indexStats = SearchTimerStats.export( - "final_pass_index", TimeUnit.NANOSECONDS, false); - - Stopwatch totalTime = Stopwatch.createStarted(); - - boolean done = false; - do { - // Poll events. - SearchTimer pt = pollStats.startNewTimer(); - ConsumerRecords records = - kafkaConsumer.poll(Duration.ofSeconds(1)); - pollStats.stopTimerAndIncrement(pt); - - // Index events. - SearchTimer it = indexStats.startNewTimer(); - for (ConsumerRecord record : records) { - if (record.offset() >= updatesEndOffset) { - done = true; - } - - readEvents++; - - ThriftVersionedEvents tve = record.value(); - long tweetId = tve.getId(); - - // Find segment to apply to. If we can't find a segment, this is an - // update for an old tweet that's not in the index. - int segmentIndex = -1; - for (int i = segmentBuildInfos.size() - 1; i >= 0; i--) { - if (segmentBuildInfos.get(i).getStartTweetId() <= tweetId) { - segmentIndex = i; - foundSegment++; - break; - } - } - - if (segmentIndex != -1) { - SegmentBuildInfo segmentBuildInfo = segmentBuildInfos.get(segmentIndex); - - perSegmentReadUpdates.get(segmentIndex).increment(); - - // Not already applied? - if (!segmentBuildInfo.getUpdateKafkaOffsetPair().includes(record.offset())) { - applied++; - - // Index the update. - // - // IMPORTANT: Note that there you'll see about 2-3% of updates that - // fail as "retryable". This type of failure happens when the update is - // for a tweet that's not found in the index. We found out that we are - // receiving some updates for protected tweets and these are not in the - // realtime index - they are the source of this error. - perSegmentIndexingResultCounts.get(segmentIndex).countResult( - segmentBuildInfo.getSegmentWriter().indexThriftVersionedEvents(tve) - ); - - perSegmentAppliedUpdates.get(segmentIndex).increment(); - } - } - if (record.offset() >= updatesEndOffset) { - break; - } - } - indexStats.stopTimerAndIncrement(it); - - } while (!done); - - LOG.info(String.format("Done in: %s, read %,d events, found segment for %,d, applied %,d", - totalTime, readEvents, foundSegment, applied)); - - LOG.info("Indexing time: {}", indexStats.getElapsedTimeAsString()); - LOG.info("Polling time: {}", pollStats.getElapsedTimeAsString()); - - LOG.info("Per segment indexing result counts:"); - for (int i = 0; i < segmentBuildInfos.size(); i++) { - LOG.info("{} : {}", i, perSegmentIndexingResultCounts.get(i)); - } - - LOG.info("Found and applied per segment:"); - for (int i = 0; i < segmentBuildInfos.size(); i++) { - LOG.info("{}: found: {}, applied: {}", - i, - perSegmentReadUpdates.get(i).getCount(), - perSegmentAppliedUpdates.get(i).getCount()); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/freshstartup/PreOptimizationSegmentIndexer.java b/src/java/com/twitter/search/earlybird/partition/freshstartup/PreOptimizationSegmentIndexer.java deleted file mode 100644 index b7e896248..000000000 --- a/src/java/com/twitter/search/earlybird/partition/freshstartup/PreOptimizationSegmentIndexer.java +++ /dev/null @@ -1,459 +0,0 @@ -package com.twitter.search.earlybird.partition.freshstartup; - -import java.io.IOException; -import java.time.Duration; -import java.util.ArrayList; -import java.util.Optional; - -import com.google.common.base.Preconditions; -import com.google.common.base.Stopwatch; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.ImmutableMap; - -import org.apache.kafka.clients.consumer.ConsumerRecord; -import org.apache.kafka.clients.consumer.ConsumerRecords; -import org.apache.kafka.clients.consumer.KafkaConsumer; -import org.apache.kafka.clients.consumer.OffsetAndTimestamp; -import org.apache.kafka.common.TopicPartition; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.earlybird.factory.EarlybirdKafkaConsumersFactory; -import com.twitter.search.earlybird.partition.IndexingResultCounts; -import com.twitter.search.earlybird.partition.SegmentInfo; -import com.twitter.search.earlybird.partition.SegmentManager; -import com.twitter.search.earlybird.partition.SegmentWriter; - -/** - * Responsible for indexing the tweets and updates that need to be applied to a single segment - * before it gets optimized and then optimizing the segment (except if it's the last one). - * - * After that, no more tweets are added to the segment and the rest of the updates are added - * in PostOptimizationUpdatesIndexer. - */ -class PreOptimizationSegmentIndexer { - private static final Logger LOG = LoggerFactory.getLogger(PreOptimizationSegmentIndexer.class); - - private SegmentBuildInfo segmentBuildInfo; - private final ArrayList segmentBuildInfos; - private SegmentManager segmentManager; - private final TopicPartition tweetTopic; - private final TopicPartition updateTopic; - private final EarlybirdKafkaConsumersFactory earlybirdKafkaConsumersFactory; - private final long lateTweetBuffer; - - public PreOptimizationSegmentIndexer( - SegmentBuildInfo segmentBuildInfo, - ArrayList segmentBuildInfos, - SegmentManager segmentManager, - TopicPartition tweetTopic, - TopicPartition updateTopic, - EarlybirdKafkaConsumersFactory earlybirdKafkaConsumersFactory, - long lateTweetBuffer) { - this.segmentBuildInfo = segmentBuildInfo; - this.segmentBuildInfos = segmentBuildInfos; - this.segmentManager = segmentManager; - this.tweetTopic = tweetTopic; - this.updateTopic = updateTopic; - this.earlybirdKafkaConsumersFactory = earlybirdKafkaConsumersFactory; - this.lateTweetBuffer = lateTweetBuffer; - } - - SegmentInfo runIndexing() throws IOException { - LOG.info(String.format("Starting segment building for segment %d. " - + "Tweet offset range [ %,d, %,d ]", - segmentBuildInfo.getIndex(), - segmentBuildInfo.getTweetStartOffset(), - segmentBuildInfo.getTweetEndOffset())); - - Optional firstTweetIdInNextSegment = Optional.empty(); - int index = segmentBuildInfo.getIndex(); - if (index + 1 < segmentBuildInfos.size()) { - firstTweetIdInNextSegment = Optional.of( - segmentBuildInfos.get(index + 1).getStartTweetId()); - } - - // Index tweets. - SegmentTweetsIndexingResult tweetIndexingResult = indexSegmentTweetsFromStream( - tweetTopic, - String.format("tweet_consumer_for_segment_%d", segmentBuildInfo.getIndex()), - firstTweetIdInNextSegment - ); - - // Index updates. - KafkaOffsetPair updatesIndexingOffsets = findUpdateStreamOffsetRange(tweetIndexingResult); - - String updatesConsumerClientId = - String.format("update_consumer_for_segment_%d", segmentBuildInfo.getIndex()); - - LOG.info(String.format("Consumer: %s :: Tweets start time: %d, end time: %d ==> " - + "Updates start offset: %,d, end offset: %,d", - updatesConsumerClientId, - tweetIndexingResult.getMinRecordTimestampMs(), - tweetIndexingResult.getMaxRecordTimestampMs(), - updatesIndexingOffsets.getBeginOffset(), - updatesIndexingOffsets.getEndOffset())); - - indexUpdatesFromStream( - updateTopic, - updatesConsumerClientId, - updatesIndexingOffsets.getBeginOffset(), - updatesIndexingOffsets.getEndOffset(), - tweetIndexingResult.getSegmentWriter() - ); - - if (segmentBuildInfo.isLastSegment()) { - /* - * We don't optimize the last segment for a few reasons: - * - * 1. We might have tweets coming next in the stream, which are supposed to end - * up in this segment. - * - * 2. We might have updates coming next in the stream, which need to be applied to - * this segment before it's optimized. - * - * So the segment is kept unoptimized and later we take care of setting up things - * so that PartitionWriter and the tweet create/update handlers can start correctly. - */ - LOG.info("Not optimizing the last segment ({})", segmentBuildInfo.getIndex()); - } else { - Stopwatch optimizationStopwatch = Stopwatch.createStarted(); - try { - LOG.info("Starting to optimize segment: {}", segmentBuildInfo.getIndex()); - tweetIndexingResult.getSegmentWriter().getSegmentInfo() - .getIndexSegment().optimizeIndexes(); - } finally { - LOG.info("Optimization of segment {} finished in {}.", - segmentBuildInfo.getIndex(), optimizationStopwatch); - } - } - - segmentBuildInfo.setUpdateKafkaOffsetPair(updatesIndexingOffsets); - segmentBuildInfo.setMaxIndexedTweetId(tweetIndexingResult.getMaxIndexedTweetId()); - segmentBuildInfo.setSegmentWriter(tweetIndexingResult.getSegmentWriter()); - - return tweetIndexingResult.getSegmentWriter().getSegmentInfo(); - } - - private SegmentTweetsIndexingResult indexSegmentTweetsFromStream( - TopicPartition topicPartition, - String consumerClientId, - Optional firstTweetIdInNextSegment) throws IOException { - long startOffset = segmentBuildInfo.getTweetStartOffset(); - long endOffset = segmentBuildInfo.getTweetEndOffset(); - long marginSize = lateTweetBuffer / 2; - - boolean isFirstSegment = segmentBuildInfo.getIndex() == 0; - - long startReadingAtOffset = startOffset; - if (!isFirstSegment) { - startReadingAtOffset -= marginSize; - } else { - LOG.info("Not moving start offset backwards for segment {}.", segmentBuildInfo.getIndex()); - } - - long endReadingAtOffset = endOffset; - if (firstTweetIdInNextSegment.isPresent()) { - endReadingAtOffset += marginSize; - } else { - LOG.info("Not moving end offset forwards for segment {}.", segmentBuildInfo.getIndex()); - } - - KafkaConsumer tweetsKafkaConsumer = - makeKafkaConsumerForIndexing(consumerClientId, - topicPartition, startReadingAtOffset); - - boolean done = false; - long minIndexedTimestampMs = Long.MAX_VALUE; - long maxIndexedTimestampMs = Long.MIN_VALUE; - int indexedEvents = 0; - - Stopwatch stopwatch = Stopwatch.createStarted(); - - LOG.info("Creating segment writer for timeslice ID {}.", segmentBuildInfo.getStartTweetId()); - SegmentWriter segmentWriter = segmentManager.createSegmentWriter( - segmentBuildInfo.getStartTweetId()); - - /* - * We don't have a guarantee that tweets come in sorted order, so when we're building segment - * X', we try to pick some tweets from the previous and next ranges we're going to index. - * - * We also ignore tweets in the beginning and the end of our tweets range, which are picked - * by the previous or following segment. - * - * Segment X Segment X' Segment X'' - * -------------- o ----------------------------------------- o --------------- - * [~~~~~] ^ [~~~~~] [~~~~~] | [~~~~~] - * | | | | | | - * front margin | front padding (size K) back padding | back margin - * | | - * segment boundary at offset B' (1) B'' - * - * (1) This is at a predetermined tweet offset / tweet id. - * - * For segment X', we start to read tweets at offset B'-K and finish reading - * tweets at offset B''+K. K is a constant. - * - * For middle segments X' - * ====================== - * We move some tweets from the front margin and back margin into segment X'. - * Some tweets from the front and back padding are ignored, as they are moved - * into the previous and next segments. - * - * For the first segment - * ===================== - * No front margin, no front padding. We just read from the beginning offset - * and insert everything. - * - * For the last segment - * ==================== - * No back margin, no back padding. We just read until the end. - */ - - SkippedPickedCounter frontMargin = new SkippedPickedCounter("front margin"); - SkippedPickedCounter backMargin = new SkippedPickedCounter("back margin"); - SkippedPickedCounter frontPadding = new SkippedPickedCounter("front padding"); - SkippedPickedCounter backPadding = new SkippedPickedCounter("back padding"); - SkippedPickedCounter regular = new SkippedPickedCounter("regular"); - int totalRead = 0; - long maxIndexedTweetId = -1; - - Stopwatch pollTimer = Stopwatch.createUnstarted(); - Stopwatch indexTimer = Stopwatch.createUnstarted(); - - do { - // This can cause an exception, See P33896 - pollTimer.start(); - ConsumerRecords records = - tweetsKafkaConsumer.poll(Duration.ofSeconds(1)); - pollTimer.stop(); - - indexTimer.start(); - for (ConsumerRecord record : records) { - // Done reading? - if (record.offset() >= endReadingAtOffset) { - done = true; - } - - ThriftVersionedEvents tve = record.value(); - boolean indexTweet = false; - SkippedPickedCounter skippedPickedCounter; - - if (record.offset() < segmentBuildInfo.getTweetStartOffset()) { - // Front margin. - skippedPickedCounter = frontMargin; - if (tve.getId() > segmentBuildInfo.getStartTweetId()) { - indexTweet = true; - } - } else if (record.offset() > segmentBuildInfo.getTweetEndOffset()) { - // Back margin. - skippedPickedCounter = backMargin; - if (firstTweetIdInNextSegment.isPresent() - && tve.getId() < firstTweetIdInNextSegment.get()) { - indexTweet = true; - } - } else if (record.offset() < segmentBuildInfo.getTweetStartOffset() + marginSize) { - // Front padding. - skippedPickedCounter = frontPadding; - if (tve.getId() >= segmentBuildInfo.getStartTweetId()) { - indexTweet = true; - } - } else if (firstTweetIdInNextSegment.isPresent() - && record.offset() > segmentBuildInfo.getTweetEndOffset() - marginSize) { - // Back padding. - skippedPickedCounter = backPadding; - if (tve.getId() < firstTweetIdInNextSegment.get()) { - indexTweet = true; - } - } else { - skippedPickedCounter = regular; - // These we just pick. A tweet that came very late can end up in the wrong - // segment, but it's better for it to be present in a segment than dropped. - indexTweet = true; - } - - if (indexTweet) { - skippedPickedCounter.incrementPicked(); - segmentWriter.indexThriftVersionedEvents(tve); - maxIndexedTweetId = Math.max(maxIndexedTweetId, tve.getId()); - indexedEvents++; - - // Note that records don't necessarily have increasing timestamps. - // Why? The timestamps whatever timestamp we picked when creating the record - // in ingesters and there are many ingesters. - minIndexedTimestampMs = Math.min(minIndexedTimestampMs, record.timestamp()); - maxIndexedTimestampMs = Math.max(maxIndexedTimestampMs, record.timestamp()); - } else { - skippedPickedCounter.incrementSkipped(); - } - totalRead++; - - if (record.offset() >= endReadingAtOffset) { - break; - } - } - indexTimer.stop(); - } while (!done); - - tweetsKafkaConsumer.close(); - - SegmentTweetsIndexingResult result = new SegmentTweetsIndexingResult( - minIndexedTimestampMs, maxIndexedTimestampMs, maxIndexedTweetId, segmentWriter); - - LOG.info("Finished indexing {} tweets for {} in {}. Read {} tweets. Result: {}." - + " Time polling: {}, Time indexing: {}.", - indexedEvents, consumerClientId, stopwatch, totalRead, result, - pollTimer, indexTimer); - - // In normal conditions, expect to pick just a few in front and in the back. - LOG.info("SkippedPicked ({}) -- {}, {}, {}, {}, {}", - consumerClientId, frontMargin, frontPadding, backPadding, backMargin, regular); - - return result; - } - - - /** - * After indexing all the tweets for a segment, index updates that need to be applied before - * the segment is optimized. - * - * This is required because some updates (URL updates, cards and Named Entities) can only be - * applied to an unoptimized segment. Luckily, all of these updates should arrive close to when - * the Tweet is created. - */ - private KafkaOffsetPair findUpdateStreamOffsetRange( - SegmentTweetsIndexingResult tweetsIndexingResult) { - KafkaConsumer offsetsConsumer = - earlybirdKafkaConsumersFactory.createKafkaConsumer( - "consumer_for_update_offsets_" + segmentBuildInfo.getIndex()); - - // Start one minute before the first indexed tweet. One minute is excessive, but - // we need to start a bit earlier in case the first tweet we indexed came in - // later than some of its updates. - long updatesStartOffset = offsetForTime(offsetsConsumer, updateTopic, - tweetsIndexingResult.getMinRecordTimestampMs() - Duration.ofMinutes(1).toMillis()); - - // Two cases: - // - // 1. If we're not indexing the last segment, end 10 minutes after the last tweet. So for - // example if we resolve an url in a tweet 3 minutes after the tweet is published, - // we'll apply that update before the segment is optimized. 10 minutes is a bit too - // much, but that doesn't matter a whole lot, since we're indexing about ~10 hours of - // updates. - // - // 2. If we're indexing the last segment, end a bit before the last indexed tweet. We might - // have incoming tweets that are a bit late. In fresh startup, we don't have a mechanism - // to store these tweets to be applied when the tweet arrives, as in TweetUpdateHandler, - // so just stop a bit earlier and let TweetCreateHandler and TweetUpdateHandler deal with - // that. - long millisAdjust; - if (segmentBuildInfo.getIndex() == segmentBuildInfos.size() - 1) { - millisAdjust = -Duration.ofMinutes(1).toMillis(); - } else { - millisAdjust = Duration.ofMinutes(10).toMillis(); - } - long updatesEndOffset = offsetForTime(offsetsConsumer, updateTopic, - tweetsIndexingResult.getMaxRecordTimestampMs() + millisAdjust); - - offsetsConsumer.close(); - - return new KafkaOffsetPair(updatesStartOffset, updatesEndOffset); - } - - /** - * Get the earliest offset with a timestamp >= $timestamp. - * - * The guarantee we get is that if we start reading from here on, we will get - * every single message that came in with a timestamp >= $timestamp. - */ - private long offsetForTime(KafkaConsumer kafkaConsumer, - TopicPartition partition, - long timestamp) { - Preconditions.checkNotNull(kafkaConsumer); - Preconditions.checkNotNull(partition); - - OffsetAndTimestamp offsetAndTimestamp = kafkaConsumer - .offsetsForTimes(ImmutableMap.of(partition, timestamp)) - .get(partition); - if (offsetAndTimestamp == null) { - return -1; - } else { - return offsetAndTimestamp.offset(); - } - } - - private void indexUpdatesFromStream( - TopicPartition topicPartition, - String consumerClientId, - long startOffset, - long endOffset, - SegmentWriter segmentWriter) throws IOException { - KafkaConsumer kafkaConsumer = - makeKafkaConsumerForIndexing(consumerClientId, topicPartition, startOffset); - - // Index TVEs. - boolean done = false; - - Stopwatch pollTimer = Stopwatch.createUnstarted(); - Stopwatch indexTimer = Stopwatch.createUnstarted(); - - SkippedPickedCounter updatesSkippedPicked = new SkippedPickedCounter("streamed_updates"); - IndexingResultCounts indexingResultCounts = new IndexingResultCounts(); - - long segmentTimesliceId = segmentWriter.getSegmentInfo().getTimeSliceID(); - - Stopwatch totalTime = Stopwatch.createStarted(); - - do { - pollTimer.start(); - ConsumerRecords records = - kafkaConsumer.poll(Duration.ofSeconds(1)); - pollTimer.stop(); - - indexTimer.start(); - for (ConsumerRecord record : records) { - if (record.value().getId() < segmentTimesliceId) { - // Doesn't apply to this segment, can be skipped instead of skipping it - // inside the more costly segmentWriter.indexThriftVersionedEvents call. - updatesSkippedPicked.incrementSkipped(); - } else { - if (record.offset() >= endOffset) { - done = true; - } - - updatesSkippedPicked.incrementPicked(); - indexingResultCounts.countResult( - segmentWriter.indexThriftVersionedEvents(record.value())); - } - - if (record.offset() >= endOffset) { - break; - } - } - indexTimer.stop(); - } while (!done); - - // Note that there'll be a decent amount of failed retryable updates. Since we index - // updates in a range that's a bit wider, they can't be applied here. - LOG.info("Client: {}, Finished indexing updates: {}. " - + "Times -- total: {}. polling: {}, indexing: {}. Indexing result counts: {}", - consumerClientId, updatesSkippedPicked, - totalTime, pollTimer, indexTimer, indexingResultCounts); - } - - /** - * Make a consumer that reads from a single partition, starting at some offset. - */ - private KafkaConsumer makeKafkaConsumerForIndexing( - String consumerClientId, - TopicPartition topicPartition, - long offset) { - KafkaConsumer kafkaConsumer = - earlybirdKafkaConsumersFactory.createKafkaConsumer(consumerClientId); - kafkaConsumer.assign(ImmutableList.of(topicPartition)); - kafkaConsumer.seek(topicPartition, offset); - LOG.info("Indexing TVEs. Kafka consumer: {}", consumerClientId); - return kafkaConsumer; - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/freshstartup/SegmentBuildInfo.java b/src/java/com/twitter/search/earlybird/partition/freshstartup/SegmentBuildInfo.java deleted file mode 100644 index 93d8436c7..000000000 --- a/src/java/com/twitter/search/earlybird/partition/freshstartup/SegmentBuildInfo.java +++ /dev/null @@ -1,92 +0,0 @@ -package com.twitter.search.earlybird.partition.freshstartup; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.earlybird.partition.SegmentWriter; - -// Data collected and produced while building a segment. -class SegmentBuildInfo { - private static final Logger LOG = LoggerFactory.getLogger(SegmentBuildInfo.class); - - // Inclusive boundaries. [start, end]. - private final long tweetStartOffset; - private final long tweetEndOffset; - private final int index; - private final boolean lastSegment; - - private long startTweetId; - private long maxIndexedTweetId; - private KafkaOffsetPair updateKafkaOffsetPair; - private SegmentWriter segmentWriter; - - public SegmentBuildInfo(long tweetStartOffset, - long tweetEndOffset, - int index, - boolean lastSegment) { - this.tweetStartOffset = tweetStartOffset; - this.tweetEndOffset = tweetEndOffset; - this.index = index; - this.lastSegment = lastSegment; - - this.startTweetId = -1; - this.updateKafkaOffsetPair = null; - this.maxIndexedTweetId = -1; - this.segmentWriter = null; - } - - public void setUpdateKafkaOffsetPair(KafkaOffsetPair updateKafkaOffsetPair) { - this.updateKafkaOffsetPair = updateKafkaOffsetPair; - } - - public KafkaOffsetPair getUpdateKafkaOffsetPair() { - return updateKafkaOffsetPair; - } - - public boolean isLastSegment() { - return lastSegment; - } - - public void setStartTweetId(long startTweetId) { - this.startTweetId = startTweetId; - } - - public long getTweetStartOffset() { - return tweetStartOffset; - } - - public long getTweetEndOffset() { - return tweetEndOffset; - } - - public long getStartTweetId() { - return startTweetId; - } - - public int getIndex() { - return index; - } - - public void setMaxIndexedTweetId(long maxIndexedTweetId) { - this.maxIndexedTweetId = maxIndexedTweetId; - } - - public long getMaxIndexedTweetId() { - return maxIndexedTweetId; - } - - public SegmentWriter getSegmentWriter() { - return segmentWriter; - } - - public void setSegmentWriter(SegmentWriter segmentWriter) { - this.segmentWriter = segmentWriter; - } - - public void logState() { - LOG.info("SegmentBuildInfo (index:{})", index); - LOG.info(String.format(" Start offset: %,d", tweetStartOffset)); - LOG.info(String.format(" End offset: %,d", tweetEndOffset)); - LOG.info(String.format(" Start tweet id: %d", startTweetId)); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/freshstartup/SegmentTweetsIndexingResult.java b/src/java/com/twitter/search/earlybird/partition/freshstartup/SegmentTweetsIndexingResult.java deleted file mode 100644 index d7a8c1c56..000000000 --- a/src/java/com/twitter/search/earlybird/partition/freshstartup/SegmentTweetsIndexingResult.java +++ /dev/null @@ -1,46 +0,0 @@ -package com.twitter.search.earlybird.partition.freshstartup; - -import com.twitter.search.earlybird.partition.SegmentWriter; - -/** - * Data collected and created while indexing tweets for a single segment. - */ -class SegmentTweetsIndexingResult { - private final long minRecordTimestampMs; - private final long maxRecordTimestampMs; - private final long maxIndexedTweetId; - private final SegmentWriter segmentWriter; - - public SegmentTweetsIndexingResult(long minRecordTimestampMs, long maxRecordTimestampMs, - long maxIndexedTweetId, - SegmentWriter segmentWriter) { - this.minRecordTimestampMs = minRecordTimestampMs; - this.maxRecordTimestampMs = maxRecordTimestampMs; - this.maxIndexedTweetId = maxIndexedTweetId; - this.segmentWriter = segmentWriter; - } - - public long getMinRecordTimestampMs() { - return minRecordTimestampMs; - } - - public long getMaxRecordTimestampMs() { - return maxRecordTimestampMs; - } - - public SegmentWriter getSegmentWriter() { - return segmentWriter; - } - - public long getMaxIndexedTweetId() { - return maxIndexedTweetId; - } - - @Override - public String toString() { - return String.format("Start time: %d, end time: %d, segment name: %s, max indexed: %d", - minRecordTimestampMs, maxRecordTimestampMs, - segmentWriter.getSegmentInfo().getSegmentName(), - maxIndexedTweetId); - } -} diff --git a/src/java/com/twitter/search/earlybird/partition/freshstartup/SkippedPickedCounter.java b/src/java/com/twitter/search/earlybird/partition/freshstartup/SkippedPickedCounter.java deleted file mode 100644 index f71d73f34..000000000 --- a/src/java/com/twitter/search/earlybird/partition/freshstartup/SkippedPickedCounter.java +++ /dev/null @@ -1,26 +0,0 @@ -package com.twitter.search.earlybird.partition.freshstartup; - -class SkippedPickedCounter { - private long skipped; - private long picked; - private String name; - - public SkippedPickedCounter(String name) { - this.skipped = 0; - this.picked = 0; - this.name = name; - } - - @Override - public String toString() { - return String.format("[%s - picked: %,d, skipped: %,d]", - name, picked, skipped); - } - - void incrementSkipped() { - skipped++; - } - void incrementPicked() { - picked++; - } -} diff --git a/src/java/com/twitter/search/earlybird/querycache/CachedFilterQuery.java b/src/java/com/twitter/search/earlybird/querycache/CachedFilterQuery.java deleted file mode 100644 index f0a888430..000000000 --- a/src/java/com/twitter/search/earlybird/querycache/CachedFilterQuery.java +++ /dev/null @@ -1,310 +0,0 @@ -package com.twitter.search.earlybird.querycache; - -import java.io.IOException; -import java.util.Objects; -import java.util.Set; - -import org.apache.lucene.index.IndexReader; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.Term; -import org.apache.lucene.search.BooleanClause; -import org.apache.lucene.search.BooleanQuery; -import org.apache.lucene.search.ConstantScoreScorer; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.Explanation; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.Weight; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.query.DefaultFilterWeight; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.core.earlybird.index.QueryCacheResultForSegment; - -/** - * Query to iterate QueryCache result (the cache) - */ -public final class CachedFilterQuery extends Query { - private static final String STAT_PREFIX = "querycache_serving_"; - private static final SearchCounter REWRITE_CALLS = SearchCounter.export( - STAT_PREFIX + "rewrite_calls"); - private static final SearchCounter NO_CACHE_FOUND = SearchCounter.export( - STAT_PREFIX + "no_cache_found"); - private static final SearchCounter USED_CACHE_AND_FRESH_DOCS = SearchCounter.export( - STAT_PREFIX + "used_cache_and_fresh_docs"); - private static final SearchCounter USED_CACHE_ONLY = SearchCounter.export( - STAT_PREFIX + "used_cache_only"); - - - public static class NoSuchFilterException extends Exception { - NoSuchFilterException(String filterName) { - super("Filter [" + filterName + "] does not exists"); - } - } - - private static class CachedResultQuery extends Query { - private final QueryCacheResultForSegment cachedResult; - - public CachedResultQuery(QueryCacheResultForSegment cachedResult) { - this.cachedResult = cachedResult; - } - - @Override - public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) { - return new DefaultFilterWeight(this) { - @Override - protected DocIdSetIterator getDocIdSetIterator(LeafReaderContext context) - throws IOException { - return cachedResult.getDocIdSet().iterator(); - } - }; - } - - @Override - public int hashCode() { - return cachedResult == null ? 0 : cachedResult.hashCode(); - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof CachedResultQuery)) { - return false; - } - - CachedResultQuery query = (CachedResultQuery) obj; - return Objects.equals(cachedResult, query.cachedResult); - } - - @Override - public String toString(String field) { - return "CACHED_RESULT"; - } - } - - private static class CachedResultAndFreshDocsQuery extends Query { - private final Query cacheLuceneQuery; - private final QueryCacheResultForSegment cachedResult; - - public CachedResultAndFreshDocsQuery( - Query cacheLuceneQuery, QueryCacheResultForSegment cachedResult) { - this.cacheLuceneQuery = cacheLuceneQuery; - this.cachedResult = cachedResult; - } - - @Override - public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) { - return new Weight(this) { - @Override - public void extractTerms(Set terms) { - } - - @Override - public Explanation explain(LeafReaderContext context, int doc) throws IOException { - Scorer scorer = scorer(context); - if ((scorer != null) && (scorer.iterator().advance(doc) == doc)) { - return Explanation.match(0f, "Match on id " + doc); - } - return Explanation.match(0f, "No match on id " + doc); - } - - @Override - public Scorer scorer(LeafReaderContext context) throws IOException { - Weight luceneWeight; - try { - luceneWeight = cacheLuceneQuery.createWeight(searcher, scoreMode, boost); - } catch (UnsupportedOperationException e) { - // Some queries do not support weights. This is fine, it simply means the query has - // no docs, and means the same thing as a null scorer. - return null; - } - - Scorer luceneScorer = luceneWeight.scorer(context); - if (luceneScorer == null) { - return null; - } - - DocIdSetIterator iterator = new CachedResultDocIdSetIterator( - cachedResult.getSmallestDocID(), - luceneScorer.iterator(), - cachedResult.getDocIdSet().iterator()); - return new ConstantScoreScorer(luceneWeight, 0.0f, scoreMode, iterator); - } - - @Override - public boolean isCacheable(LeafReaderContext ctx) { - return true; - } - }; - } - - @Override - public int hashCode() { - return (cacheLuceneQuery == null ? 0 : cacheLuceneQuery.hashCode()) * 13 - + (cachedResult == null ? 0 : cachedResult.hashCode()); - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof CachedResultAndFreshDocsQuery)) { - return false; - } - - CachedResultAndFreshDocsQuery query = (CachedResultAndFreshDocsQuery) obj; - return Objects.equals(cacheLuceneQuery, query.cacheLuceneQuery) - && Objects.equals(cachedResult, query.cachedResult); - } - - @Override - public String toString(String field) { - return "CACHED_RESULT_AND_FRESH_DOCS"; - } - } - - private static final Query DUMMY_FILTER = wrapFilter(new Query() { - @Override - public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) { - return new DefaultFilterWeight(this) { - @Override - protected DocIdSetIterator getDocIdSetIterator(LeafReaderContext context) { - return null; - } - }; - } - - @Override - public int hashCode() { - return System.identityHashCode(this); - } - - @Override - public boolean equals(Object obj) { - return this == obj; - } - - @Override - public String toString(String field) { - return "DUMMY_FILTER"; - } - }); - - private final QueryCacheFilter queryCacheFilter; - - // Lucene Query used to fill the cache - private final Query cacheLuceneQuery; - - public static Query getCachedFilterQuery(String filterName, QueryCacheManager queryCacheManager) - throws NoSuchFilterException { - return wrapFilter(new CachedFilterQuery(filterName, queryCacheManager)); - } - - private static Query wrapFilter(Query filter) { - return new BooleanQuery.Builder() - .add(filter, BooleanClause.Occur.FILTER) - .build(); - } - - private CachedFilterQuery(String filterName, QueryCacheManager queryCacheManager) - throws NoSuchFilterException { - queryCacheFilter = queryCacheManager.getFilter(filterName); - if (queryCacheFilter == null) { - throw new NoSuchFilterException(filterName); - } - queryCacheFilter.incrementUsageStat(); - - // retrieve the query that was used to populate the cache - cacheLuceneQuery = queryCacheFilter.getLuceneQuery(); - } - - /** - * Creates a query base on the cache situation - */ - @Override - public Query rewrite(IndexReader reader) { - EarlybirdIndexSegmentAtomicReader twitterReader = (EarlybirdIndexSegmentAtomicReader) reader; - QueryCacheResultForSegment cachedResult = - twitterReader.getSegmentData().getQueryCacheResult(queryCacheFilter.getFilterName()); - REWRITE_CALLS.increment(); - - if (cachedResult == null || cachedResult.getSmallestDocID() == -1) { - // No cached result, or cache has never been updated - // This happens to the newly created segment, between the segment creation and first - // query cache update - NO_CACHE_FOUND.increment(); - - if (queryCacheFilter.getCacheModeOnly()) { - // since this query cache filter allows cache mode only, we return a query that - // matches no doc - return DUMMY_FILTER; - } - - return wrapFilter(cacheLuceneQuery); - } - - if (!queryCacheFilter.getCacheModeOnly() && // is this a cache mode only filter? - // the following check is only necessary for the realtime segment, which - // grows. Since we decrement docIds in the realtime segment, a reader - // having a smallestDocID less than the one in the cachedResult indicates - // that the segment/reader has new documents. - cachedResult.getSmallestDocID() > twitterReader.getSmallestDocID()) { - // The segment has more documents than the cached result. IOW, there are new - // documents that are not cached. This happens to latest segment that we're indexing to. - USED_CACHE_AND_FRESH_DOCS.increment(); - return wrapFilter(new CachedResultAndFreshDocsQuery(cacheLuceneQuery, cachedResult)); - } - - // The segment has not grown since the cache was last updated. - // This happens mostly to old segments that we're no longer indexing to. - USED_CACHE_ONLY.increment(); - return wrapFilter(new CachedResultQuery(cachedResult)); - } - - @Override - public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) - throws IOException { - final Weight luceneWeight = cacheLuceneQuery.createWeight(searcher, scoreMode, boost); - - return new Weight(this) { - @Override - public Scorer scorer(LeafReaderContext context) throws IOException { - return luceneWeight.scorer(context); - } - - @Override - public void extractTerms(Set terms) { - luceneWeight.extractTerms(terms); - } - - @Override - public Explanation explain(LeafReaderContext context, int doc) throws IOException { - return luceneWeight.explain(context, doc); - } - - @Override - public boolean isCacheable(LeafReaderContext ctx) { - return luceneWeight.isCacheable(ctx); - } - }; - } - - @Override - public int hashCode() { - return cacheLuceneQuery == null ? 0 : cacheLuceneQuery.hashCode(); - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof CachedFilterQuery)) { - return false; - } - - CachedFilterQuery filter = (CachedFilterQuery) obj; - return Objects.equals(cacheLuceneQuery, filter.cacheLuceneQuery); - } - - @Override - public String toString(String s) { - return "CachedFilterQuery[" + queryCacheFilter.getFilterName() + "]"; - } -} diff --git a/src/java/com/twitter/search/earlybird/querycache/CachedResultDocIdSetIterator.java b/src/java/com/twitter/search/earlybird/querycache/CachedResultDocIdSetIterator.java deleted file mode 100644 index 07ad639a3..000000000 --- a/src/java/com/twitter/search/earlybird/querycache/CachedResultDocIdSetIterator.java +++ /dev/null @@ -1,72 +0,0 @@ -package com.twitter.search.earlybird.querycache; - -import java.io.IOException; - -import org.apache.lucene.search.DocIdSetIterator; - -public class CachedResultDocIdSetIterator extends DocIdSetIterator { - // With the realtime index, we grow the doc id negatively. - // Hence the smallest doc id is the ID the latest/newest document in the cache. - private final int cachedSmallestDocID; - - // Documents that were indexed after the last cache update - private final DocIdSetIterator freshDocIdIterator; - // Documents that were cached - private final DocIdSetIterator cachedDocIdIterator; - - private int currentDocId; - private boolean initialized = false; - - public CachedResultDocIdSetIterator(int cachedSmallestDocID, - DocIdSetIterator freshDocIdIterator, - DocIdSetIterator cachedDocIdIterator) { - this.cachedSmallestDocID = cachedSmallestDocID; - - this.freshDocIdIterator = freshDocIdIterator; - this.cachedDocIdIterator = cachedDocIdIterator; - this.currentDocId = -1; - } - - @Override - public int docID() { - return currentDocId; - } - - @Override - public int nextDoc() throws IOException { - if (currentDocId < cachedSmallestDocID) { - currentDocId = freshDocIdIterator.nextDoc(); - } else if (currentDocId != NO_MORE_DOCS) { - if (!initialized) { - // the first time we come in here, currentDocId should be pointing to - // something >= cachedMinDocID. We need to go to the doc after cachedMinDocID. - currentDocId = cachedDocIdIterator.advance(currentDocId + 1); - initialized = true; - } else { - currentDocId = cachedDocIdIterator.nextDoc(); - } - } - return currentDocId; - } - - @Override - public int advance(int target) throws IOException { - if (target < cachedSmallestDocID) { - currentDocId = freshDocIdIterator.advance(target); - } else if (currentDocId != NO_MORE_DOCS) { - initialized = true; - currentDocId = cachedDocIdIterator.advance(target); - } - - return currentDocId; - } - - @Override - public long cost() { - if (currentDocId < cachedSmallestDocID) { - return freshDocIdIterator.cost(); - } else { - return cachedDocIdIterator.cost(); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/querycache/QueryCacheConfig.java b/src/java/com/twitter/search/earlybird/querycache/QueryCacheConfig.java deleted file mode 100644 index fdee0ba23..000000000 --- a/src/java/com/twitter/search/earlybird/querycache/QueryCacheConfig.java +++ /dev/null @@ -1,101 +0,0 @@ -package com.twitter.search.earlybird.querycache; - -import java.io.File; -import java.io.FileNotFoundException; -import java.io.FileReader; -import java.io.Reader; -import java.util.ArrayList; -import java.util.List; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; -import org.yaml.snakeyaml.TypeDescription; -import org.yaml.snakeyaml.Yaml; -import org.yaml.snakeyaml.constructor.Constructor; - -import com.twitter.search.common.config.Config; -import com.twitter.search.common.metrics.SearchStatsReceiver; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; - -// QueryCacheConfig is not thread safe. *Do not* attempt to create multiple QueryCacheConfig -// in different threads -public class QueryCacheConfig { - private static final Logger LOG = LoggerFactory.getLogger(QueryCacheConfig.class); - private static final String DEFAULT_CONFIG_FILE = "querycache.yml"; - private final SearchStatsReceiver statsReceiver; - - private List filters; - - public QueryCacheConfig(SearchStatsReceiver statsReceiver) { - this(locateConfigFile(EarlybirdConfig.getString("query_cache_config_file_name", - DEFAULT_CONFIG_FILE)), statsReceiver); - } - - // package protected constructor for unit test only - QueryCacheConfig(Reader reader, SearchStatsReceiver statsReceiver) { - this.statsReceiver = statsReceiver; - if (reader == null) { - throw new RuntimeException("Query cache config not loaded"); - } - loadConfig(reader); - } - - public List filters() { - return filters; - } - - int getFilterSize() { - return filters.size(); - } - - private static FileReader locateConfigFile(String configFileName) { - File configFile = null; - String dir = Config.locateSearchConfigDir(EarlybirdConfig.EARLYBIRD_CONFIG_DIR, configFileName); - if (dir != null) { - configFile = openConfigFile(dir + "/" + configFileName); - } - if (configFile != null) { - try { - return new FileReader(configFile); - } catch (FileNotFoundException e) { - // This should not happen as the caller should make sure that the file exists before - // calling this function. - LOG.error("Unexpected exception", e); - throw new RuntimeException("Query cache config file not loaded!", e); - } - } - return null; - } - - private static File openConfigFile(String configFilePath) { - File configFile = new File(configFilePath); - if (!configFile.exists()) { - LOG.warn("QueryCache config file [" + configFile + "] not found"); - configFile = null; - } else { - LOG.info("Opened QueryCacheFilter config file [" + configFile + "]"); - } - return configFile; - } - - private void loadConfig(Reader reader) { - TypeDescription qcEntryDescription = new TypeDescription(QueryCacheFilter.class); - Constructor constructor = new Constructor(qcEntryDescription); - Yaml yaml = new Yaml(constructor); - - filters = new ArrayList<>(); - - for (Object data : yaml.loadAll(reader)) { - QueryCacheFilter cacheFilter = (QueryCacheFilter) data; - try { - cacheFilter.sanityCheck(); - } catch (QueryCacheFilter.InvalidEntryException e) { - throw new RuntimeException(e); - } - cacheFilter.createQueryCounter(statsReceiver); - filters.add(cacheFilter); - LOG.info("Loaded filter from config {}", cacheFilter.toString()); - } - LOG.info("Total filters loaded: {}", filters.size()); - } -} diff --git a/src/java/com/twitter/search/earlybird/querycache/QueryCacheConversionRules.java b/src/java/com/twitter/search/earlybird/querycache/QueryCacheConversionRules.java deleted file mode 100644 index 60f1a1f85..000000000 --- a/src/java/com/twitter/search/earlybird/querycache/QueryCacheConversionRules.java +++ /dev/null @@ -1,100 +0,0 @@ -package com.twitter.search.earlybird.querycache; - -import java.util.Arrays; -import java.util.List; -import java.util.Set; - -import com.google.common.collect.ImmutableList; -import com.google.common.collect.Sets; - -import com.twitter.search.common.constants.QueryCacheConstants; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.search.SearchOperator; -import com.twitter.search.queryparser.query.search.SearchOperatorConstants; - -import static com.twitter.search.common.util.RuleBasedConverter.Rule; - -/** - * Rules to convert exclude operators into cached filters and consolidate them. - * NOTE: this is copied from blender/core/parser/service/queryparser/QueryCacheConversionRules.java - * We should remove the blender one once this is in production. - */ -public final class QueryCacheConversionRules { - static final SearchOperator EXCLUDE_ANTISOCIAL = - new SearchOperator(SearchOperator.Type.EXCLUDE, SearchOperatorConstants.ANTISOCIAL); - static final SearchOperator EXCLUDE_SPAM = - new SearchOperator(SearchOperator.Type.EXCLUDE, SearchOperatorConstants.SPAM); - static final SearchOperator EXCLUDE_REPLIES = - new SearchOperator(SearchOperator.Type.EXCLUDE, SearchOperatorConstants.REPLIES); - static final SearchOperator EXCLUDE_NATIVERETWEETS = - new SearchOperator(SearchOperator.Type.EXCLUDE, SearchOperatorConstants.NATIVE_RETWEETS); - - public static final SearchOperator CACHED_EXCLUDE_ANTISOCIAL = - new SearchOperator(SearchOperator.Type.CACHED_FILTER, - QueryCacheConstants.EXCLUDE_ANTISOCIAL); - static final SearchOperator CACHED_EXCLUDE_NATIVERETWEETS = - new SearchOperator(SearchOperator.Type.CACHED_FILTER, - QueryCacheConstants.EXCLUDE_ANTISOCIAL_AND_NATIVERETWEETS); - static final SearchOperator CACHED_EXCLUDE_SPAM = - new SearchOperator(SearchOperator.Type.CACHED_FILTER, - QueryCacheConstants.EXCLUDE_SPAM); - static final SearchOperator CACHED_EXCLUDE_SPAM_AND_NATIVERETWEETS = - new SearchOperator(SearchOperator.Type.CACHED_FILTER, - QueryCacheConstants.EXCLUDE_SPAM_AND_NATIVERETWEETS); - static final SearchOperator CACHED_EXCLUDE_REPLIES = - new SearchOperator(SearchOperator.Type.CACHED_FILTER, - QueryCacheConstants.EXCLUDE_REPLIES); - - private QueryCacheConversionRules() { - } - - public static final List> DEFAULT_RULES = ImmutableList.of( - // basic translation from exclude:filter to cached filter - new Rule<>(new Query[]{EXCLUDE_ANTISOCIAL}, - new Query[]{CACHED_EXCLUDE_ANTISOCIAL}), - - new Rule<>(new Query[]{EXCLUDE_SPAM}, - new Query[]{CACHED_EXCLUDE_SPAM}), - - new Rule<>(new Query[]{EXCLUDE_NATIVERETWEETS}, - new Query[]{CACHED_EXCLUDE_NATIVERETWEETS}), - - new Rule<>(new Query[]{EXCLUDE_REPLIES}, - new Query[]{CACHED_EXCLUDE_REPLIES}), - - // combine two cached filter to a new one - new Rule<>(new Query[]{CACHED_EXCLUDE_SPAM, CACHED_EXCLUDE_NATIVERETWEETS}, - new Query[]{CACHED_EXCLUDE_SPAM_AND_NATIVERETWEETS}), - - // Remove redundant filters. A cached filter is redundant when it coexist with a - // more strict filter. Note all the filter will filter out antisocial. - new Rule<>( - new Query[]{CACHED_EXCLUDE_SPAM, CACHED_EXCLUDE_ANTISOCIAL}, - new Query[]{CACHED_EXCLUDE_SPAM}), - - new Rule<>( - new Query[]{CACHED_EXCLUDE_NATIVERETWEETS, CACHED_EXCLUDE_ANTISOCIAL}, - new Query[]{CACHED_EXCLUDE_NATIVERETWEETS}), - - new Rule<>( - new Query[]{CACHED_EXCLUDE_SPAM_AND_NATIVERETWEETS, CACHED_EXCLUDE_ANTISOCIAL}, - new Query[]{CACHED_EXCLUDE_SPAM_AND_NATIVERETWEETS}), - - new Rule<>( - new Query[]{CACHED_EXCLUDE_SPAM_AND_NATIVERETWEETS, CACHED_EXCLUDE_SPAM}, - new Query[]{CACHED_EXCLUDE_SPAM_AND_NATIVERETWEETS}), - - new Rule<>( - new Query[]{CACHED_EXCLUDE_SPAM_AND_NATIVERETWEETS, CACHED_EXCLUDE_NATIVERETWEETS}, - new Query[]{CACHED_EXCLUDE_SPAM_AND_NATIVERETWEETS}) - ); - - public static final List STRIP_ANNOTATIONS_QUERIES; - static { - Set stripAnnotationsQueries = Sets.newHashSet(); - for (Rule rule : DEFAULT_RULES) { - stripAnnotationsQueries.addAll(Arrays.asList(rule.getSources())); - } - STRIP_ANNOTATIONS_QUERIES = ImmutableList.copyOf(stripAnnotationsQueries); - } -} diff --git a/src/java/com/twitter/search/earlybird/querycache/QueryCacheFilter.java b/src/java/com/twitter/search/earlybird/querycache/QueryCacheFilter.java deleted file mode 100644 index d2726338e..000000000 --- a/src/java/com/twitter/search/earlybird/querycache/QueryCacheFilter.java +++ /dev/null @@ -1,302 +0,0 @@ -package com.twitter.search.earlybird.querycache; - -import java.util.List; -import java.util.TreeMap; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.search.Query; - -import com.twitter.common.collections.Pair; -import com.twitter.common.quantity.Amount; -import com.twitter.common.quantity.Time; -import com.twitter.common.util.Clock; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchStatsReceiver; -import com.twitter.search.common.query.thriftjava.CollectorParams; -import com.twitter.search.common.query.thriftjava.CollectorTerminationParams; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.search.TerminationTracker; -import com.twitter.search.common.util.text.regex.Regex; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.queryparser.EarlybirdLuceneQueryVisitor; -import com.twitter.search.earlybird.search.SearchRequestInfo; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.queryparser.parser.SerializedQueryParser; -import com.twitter.search.queryparser.query.QueryParserException; - -/** - * The definition of a QueryCache filter/entry, like the name of the filter, the query used - * to populate the cache, update schedule, etc.. - * - * Instances of this class are created by the YAML loader when loading the config file. Most - * members are populated by YAML using setters through reflection. - */ -public class QueryCacheFilter { - // Data structure type supported as cache result holder - public enum ResultSetType { - FixedBitSet, - SparseFixedBitSet - } - - // Fields set directly from YML config file. - private String filterName; // unique name for cached filter - private String query; // serialized query string - private ResultSetType resultType; - private boolean cacheModeOnly; - private List schedule; - private SearchCounter queries; - - // Fields generated based on config (but not directly). - private volatile Pair queryPair; - private TreeMap scheduleMap; // tree map from index to interval - - public class InvalidEntryException extends Exception { - public InvalidEntryException(String message) { - super("Filter [" + filterName + "]: " + message); - } - } - - public static class UpdateInterval { - // Overrides *all* query cache update frequencies to be this value, in seconds. - private final int overrideSecondsForTests = EarlybirdConfig.getInt( - "override_query_cache_update_frequency", -1); - - // Fields set directly from YML config file. - private int segment; - private long seconds; - - public void setSegment(int segment) { - this.segment = segment; - } - - /** - * Sets the update period in seconds. If the override_query_cache_update_frequency parameter is - * specified in the earlybird configuration, its value is used instead (the value passed to this - * method is ignored). - */ - public void setSeconds(long seconds) { - if (overrideSecondsForTests != -1) { - this.seconds = overrideSecondsForTests; - } else { - this.seconds = seconds; - } - } - - public int getSegment() { - return segment; - } - - public long getSeconds() { - return seconds; - } - } - - public void setFilterName(String filterName) throws InvalidEntryException { - sanityCheckFilterName(filterName); - this.filterName = filterName; - } - - /** - * Sets the driving query for this query cache filter. - */ - public void setQuery(String query) throws InvalidEntryException { - if (query == null || query.isEmpty()) { - throw new InvalidEntryException("Empty query string"); - } - - this.query = query; - } - - /** - * Sets the type of the results that will be generated by this query cache filter. - */ - public void setResultType(String resultType) throws InvalidEntryException { - if (ResultSetType.FixedBitSet.toString().equalsIgnoreCase(resultType)) { - this.resultType = ResultSetType.FixedBitSet; - } else if (ResultSetType.SparseFixedBitSet.toString().equalsIgnoreCase(resultType)) { - this.resultType = ResultSetType.SparseFixedBitSet; - } else { - throw new InvalidEntryException("Unregconized result type [" + resultType + "]"); - } - } - - public void setCacheModeOnly(boolean cacheModeOnly) { - this.cacheModeOnly = cacheModeOnly; - } - - public void setSchedule(List schedule) - throws QueryCacheFilter.InvalidEntryException { - sanityCheckSchedule(schedule); - this.schedule = schedule; - this.scheduleMap = createScheduleMap(schedule); - } - - public void createQueryCounter(SearchStatsReceiver statsReceiver) { - queries = statsReceiver.getCounter("cached_filter_" + filterName + "_queries"); - } - - public void incrementUsageStat() { - queries.increment(); - } - - public String getFilterName() { - return filterName; - } - - public String getQueryString() { - return query; - } - - // snakeyaml does not like a getter named getResultType() that does not return a string - public ResultSetType getResultSetType() { - return resultType; - } - - public boolean getCacheModeOnly() { - return cacheModeOnly; - } - - public Query getLuceneQuery() { - return queryPair.getSecond(); - } - - public ThriftSearchQuery getSearchQuery() { - return queryPair.getFirst(); - } - - /** - * Create a new {@link SearchRequestInfo} using {@link #queryPair}. - * - * @return a new {@link SearchRequestInfo} - */ - public SearchRequestInfo createSearchRequestInfo() { - ThriftSearchQuery searchQuery = Preconditions.checkNotNull(queryPair.getFirst()); - Query luceneQuery = Preconditions.checkNotNull(queryPair.getSecond()); - - return new SearchRequestInfo( - searchQuery, luceneQuery, new TerminationTracker(Clock.SYSTEM_CLOCK)); - } - - public void setup( - QueryCacheManager queryCacheManager, - UserTable userTable, - EarlybirdCluster earlybirdCluster) throws QueryParserException { - createQuery(queryCacheManager, userTable, earlybirdCluster); - } - - // index corresponds to 'segment' from the config file. this is the index of the - // segment, starting with the current segment (0) and counting backwards in time. - public Amount getUpdateInterval(int index) { - long seconds = scheduleMap.floorEntry(index).getValue().getSeconds(); - return Amount.of(seconds, Time.SECONDS); - } - - private TreeMap createScheduleMap(List scheduleToUse) { - TreeMap map = new TreeMap<>(); - for (UpdateInterval interval : scheduleToUse) { - map.put(interval.segment, interval); - } - return map; - } - - private void createQuery( - QueryCacheManager queryCacheManager, - UserTable userTable, - EarlybirdCluster earlybirdCluster) throws QueryParserException { - - int maxSegmentSize = EarlybirdConfig.getMaxSegmentSize(); - CollectorParams collectionParams = new CollectorParams(); - collectionParams.setNumResultsToReturn(maxSegmentSize); - CollectorTerminationParams terminationParams = new CollectorTerminationParams(); - terminationParams.setMaxHitsToProcess(maxSegmentSize); - collectionParams.setTerminationParams(terminationParams); - - ThriftSearchQuery searchQuery = new ThriftSearchQuery(); - searchQuery.setMaxHitsPerUser(maxSegmentSize); - searchQuery.setCollectorParams(collectionParams); - searchQuery.setSerializedQuery(query); - - final SerializedQueryParser parser = new SerializedQueryParser( - EarlybirdConfig.getPenguinVersion()); - - Query luceneQuery = parser.parse(query).simplify().accept( - new EarlybirdLuceneQueryVisitor( - queryCacheManager.getIndexConfig().getSchema().getSchemaSnapshot(), - queryCacheManager, - userTable, - queryCacheManager.getUserScrubGeoMap(), - earlybirdCluster, - queryCacheManager.getDecider())); - if (luceneQuery == null) { - throw new QueryParserException("Unable to create lucene query from " + query); - } - - queryPair = new Pair<>(searchQuery, luceneQuery); - } - - private void sanityCheckFilterName(String filter) throws InvalidEntryException { - if (filter == null || filter.isEmpty()) { - throw new InvalidEntryException("Missing filter name"); - } - if (Regex.FILTER_NAME_CHECK.matcher(filter).find()) { - throw new InvalidEntryException( - "Invalid character in filter name. Chars allowed [a-zA-Z_0-9]"); - } - } - - private void sanityCheckSchedule(List intervals) - throws InvalidEntryException { - // Make sure there's at least 1 interval defined - if (intervals == null || intervals.isEmpty()) { - throw new InvalidEntryException("No schedule defined"); - } - - // Make sure the first interval starts with segment 0 - if (intervals.get(0).getSegment() != 0) { - throw new InvalidEntryException( - "The first interval in the schedule must start from segment 0"); - } - - // Make sure segments are defined in order, and no segment is defined more than twice - int prevSegment = intervals.get(0).getSegment(); - for (int i = 1; i < intervals.size(); ++i) { - int currentSegment = intervals.get(i).getSegment(); - if (prevSegment > currentSegment) { - throw new InvalidEntryException("Segment intervals out of order. Segment " + prevSegment - + " is defined before segment " + currentSegment); - } - - if (prevSegment == intervals.get(i).getSegment()) { - throw new InvalidEntryException("Segment " + prevSegment + " is defined twice"); - } - - prevSegment = currentSegment; - } - } - - protected void sanityCheck() throws InvalidEntryException { - sanityCheckFilterName(filterName); - if (query == null || query.isEmpty()) { - throw new InvalidEntryException("Missing query"); - } - if (resultType == null) { - throw new InvalidEntryException("Missing result type"); - } - if (schedule == null || schedule.size() == 0) { - throw new InvalidEntryException("Missing update schedule"); - } - if (scheduleMap == null || scheduleMap.size() == 0) { - throw new InvalidEntryException("Missing update schedule map"); - } - } - - @Override - public String toString() { - return "filterName: [" + getFilterName() - + "] query: [" + getQueryString() - + "] result type [" + getResultSetType() - + "] schedule: " + schedule; - } -} diff --git a/src/java/com/twitter/search/earlybird/querycache/QueryCacheManager.java b/src/java/com/twitter/search/earlybird/querycache/QueryCacheManager.java deleted file mode 100644 index 0795df0cc..000000000 --- a/src/java/com/twitter/search/earlybird/querycache/QueryCacheManager.java +++ /dev/null @@ -1,365 +0,0 @@ -package com.twitter.search.earlybird.querycache; - -import java.util.Collection; -import java.util.Collections; -import java.util.HashMap; -import java.util.List; -import java.util.Map; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.base.Stopwatch; -import com.google.common.collect.Lists; -import com.google.common.primitives.Longs; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.quantity.Amount; -import com.twitter.common.quantity.Time; -import com.twitter.common.util.Clock; -import com.twitter.decider.Decider; -import com.twitter.search.common.concurrent.ScheduledExecutorServiceFactory; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.metrics.SearchStatsReceiver; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.EarlybirdIndexConfig; -import com.twitter.search.earlybird.EarlybirdStatus; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.common.userupdates.UserScrubGeoMap; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.partition.SegmentInfo; -import com.twitter.search.earlybird.partition.SegmentManager; -import com.twitter.search.earlybird.partition.SegmentManager.Filter; -import com.twitter.search.earlybird.partition.SegmentManager.Order; -import com.twitter.search.earlybird.partition.SegmentManager.SegmentUpdateListener; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; -import com.twitter.search.earlybird.thrift.EarlybirdStatusCode; -import com.twitter.search.queryparser.query.QueryParserException; - -/** - * Main class to manage Earlybird's QueryCache. - * - * Initialize the QueryCache and new segments are notified to the QueryCache subsystem - * through this class. - * - * This class is thread-safe when calling methods that modify the list of tasks that - * we're executing or when we need to traverse all tasks and check something. The way - * thread-safety is achieved here right now is through making methods synchronized. - */ -public class QueryCacheManager implements SegmentUpdateListener { - private static final Logger LOG = LoggerFactory.getLogger(QueryCacheManager.class); - - private static final Amount ZERO_SECONDS = Amount.of(0L, Time.SECONDS); - - private final boolean enabled = EarlybirdConfig.getBool("querycache", false); - - // segments are removed from SegmentInfoMap lazily, and there may be a wait time. - // So, beware that there's short period of time where there's more segments than - // maxEnabledSegments. - private final int maxEnabledSegments; - - private final UserTable userTable; - private final UserScrubGeoMap userScrubGeoMap; - private final EarlybirdIndexConfig indexConfig; - private QueryCacheUpdater updater; - private final Map filters; - private final ScheduledExecutorServiceFactory updaterScheduledExecutorServiceFactory; - - private final SearchStatsReceiver searchStatsReceiver; - - private static final SearchLongGauge NUM_CACHE_ENTRY_STAT = - SearchLongGauge.export("querycache_num_entries"); - - private static final SearchCounter NUM_UPDATE_SEGMENTS_CALLS = - SearchCounter.export("querycache_num_update_segments_calls"); - - private volatile boolean didSetup = false; - - private final EarlybirdSearcherStats searcherStats; - private final Decider decider; - private final CriticalExceptionHandler criticalExceptionHandler; - private final Clock clock; - - public QueryCacheManager( - QueryCacheConfig config, - EarlybirdIndexConfig indexConfig, - int maxEnabledSegments, - UserTable userTable, - UserScrubGeoMap userScrubGeoMap, - ScheduledExecutorServiceFactory updaterScheduledExecutorServiceFactory, - SearchStatsReceiver searchStatsReceiver, - EarlybirdSearcherStats searcherStats, - Decider decider, - CriticalExceptionHandler criticalExceptionHandler, - Clock clock) { - - Preconditions.checkArgument(maxEnabledSegments > 0); - - QueryCacheConfig queryCacheConfig = config; - if (queryCacheConfig == null) { - queryCacheConfig = new QueryCacheConfig(searchStatsReceiver); - } - this.indexConfig = indexConfig; - this.maxEnabledSegments = maxEnabledSegments; - this.userTable = userTable; - this.userScrubGeoMap = userScrubGeoMap; - this.updaterScheduledExecutorServiceFactory = updaterScheduledExecutorServiceFactory; - this.searchStatsReceiver = searchStatsReceiver; - this.searcherStats = searcherStats; - this.filters = new HashMap<>(); - this.decider = decider; - this.criticalExceptionHandler = criticalExceptionHandler; - this.clock = clock; - for (QueryCacheFilter filter : queryCacheConfig.filters()) { - filters.put(filter.getFilterName(), filter); - } - NUM_CACHE_ENTRY_STAT.set(filters.size()); - } - - public EarlybirdIndexConfig getIndexConfig() { - return indexConfig; - } - - public UserScrubGeoMap getUserScrubGeoMap() { - return userScrubGeoMap; - } - - /** Setup all update tasks at once, should only be called after Earlybird has loaded/indexed all - * segments during start-up - * - * Only the first call to the function has effect, subsequent calls are no-ops - */ - public void setupTasksIfNeeded(SegmentManager segmentManager) - throws QueryParserException { - setupTasks( - segmentManager.getSegmentInfos(Filter.All, Order.OLD_TO_NEW), - segmentManager.getEarlybirdIndexConfig().getCluster()); - } - - @VisibleForTesting - synchronized void setupTasks( - Iterable newSegments, - EarlybirdCluster earlybirdCluster) throws QueryParserException { - // Setup needs to be done only once after all index caught up. - if (didSetup) { - return; - } - - LOG.info("Setting up {} query cache tasks", filters.values().size()); - - for (QueryCacheFilter filter : filters.values()) { - filter.setup(this, userTable, earlybirdCluster); - } - - if (!enabled()) { - // Note that the definition of disabling the query caches here is "don't compute the caches". - // We still load the queries from the .yml, we still rewrite search queries to use - // cached queries. The reason we are choosing this definition is that it's somewhat simpler - // to implement (no need to turn off rewriting) and because we might get external queries that - // contain cached filters (they're listed in go/searchsyntax). - // - // If we need a stricter definition of turning off query caches, we can implement it too, or - // just tighten this one. - return; - } - - Preconditions.checkState(updater == null); - updater = new QueryCacheUpdater( - filters.values(), - updaterScheduledExecutorServiceFactory, - userTable, - searchStatsReceiver, - searcherStats, - decider, - criticalExceptionHandler, - clock); - - LOG.info("Finished setting up query cache updater."); - - scheduleTasks(newSegments, false); - - didSetup = true; - } - - private void scheduleTasks(Iterable segments, boolean isCurrent) { - List sortedSegments = Lists.newArrayList(segments); - Collections.sort(sortedSegments, (o1, o2) -> { - // sort new to old (o2 and o1 are reversed here) - return Longs.compare(o2.getTimeSliceID(), o1.getTimeSliceID()); - }); - - LOG.info("Scheduling tasks for {} segments.", sortedSegments.size()); - - for (int segmentIndex = 0; segmentIndex < sortedSegments.size(); ++segmentIndex) { - SegmentInfo segmentInfo = sortedSegments.get(segmentIndex); - if (segmentIndex == maxEnabledSegments) { - LOG.warn("Tried to add more segments than MaxEnabledSegments (" + maxEnabledSegments - + "). Removed oldest segment " + segmentInfo.getTimeSliceID()); - continue; - } - addQueryCacheTasksForSegment(segmentInfo, segmentIndex, !isCurrent); - } - } - - /** - * Rebuilds the query cache for the given segment after it was optimized. - */ - public synchronized void rebuildQueryCachesAfterSegmentOptimization( - SegmentInfo optimizedSegment) { - Preconditions.checkState(optimizedSegment.getIndexSegment().isOptimized(), - "Segment " + optimizedSegment.getSegmentName() + " is not optimized."); - - if (!didSetup) { - // Once our indexing is current, we'll just start tasks for all segments, optimized or not. - // Before that event, we don't do anything query cache related. - LOG.info("Haven't done initial setup, returning."); - return; - } - - LOG.info("Rebuilding query caches for optimized segment {}", - optimizedSegment.getSegmentName()); - - // The optimized segment should always be the 1st segment (the current segment has index 0). - Stopwatch stopwatch = Stopwatch.createStarted(); - updater.removeAllTasksForSegment(optimizedSegment); - addQueryCacheTasksForSegment(optimizedSegment, 1, true); - - while (!updater.allTasksRanForSegment(optimizedSegment)) { - try { - Thread.sleep(1000); - } catch (InterruptedException e) { - // Ignore - } - } - - LOG.info("Rebuilding all query caches for the optimized segment {} took {}.", - optimizedSegment.getSegmentName(), stopwatch); - } - - /** - * Block until all the tasks inside this manager have ran at least once. - */ - public void waitUntilAllQueryCachesAreBuilt() { - LOG.info("Waiting until all query caches are built..."); - - Stopwatch stopwatch = Stopwatch.createStarted(); - while (!allTasksRan()) { - try { - Thread.sleep(1000); - } catch (InterruptedException ex) { - Thread.currentThread().interrupt(); - } - } - - LOG.info("Ran query cache tasks in: {}", stopwatch); - } - - private void addQueryCacheTasksForSegment( - SegmentInfo segmentInfo, int segmentIndex, boolean scheduleImmediately) { - LOG.info("Adding query cache tasks for segment {}.", segmentInfo.getTimeSliceID()); - double updateIntervalMultiplier = - EarlybirdConfig.getDouble("query_cache_update_interval_multiplier", 1.0); - for (QueryCacheFilter filter : filters.values()) { - Amount updateIntervalFromConfig = filter.getUpdateInterval(segmentIndex); - Amount updateInterval = Amount.of( - (long) (updateIntervalFromConfig.getValue() * updateIntervalMultiplier), - updateIntervalFromConfig.getUnit()); - - Amount initialDelay = scheduleImmediately ? ZERO_SECONDS : updateInterval; - updater.addTask(filter, segmentInfo, updateInterval, initialDelay); - } - } - - /** - * Notify QueryCacheManager of a new list of segments we currently have, so that cache tasks - * can be updated. - * - * @param segments fresh list of all segments - * - * All existing tasks will be canceled/removed/destroyed, new tasks will be created for all - * segments. - */ - @Override - public synchronized void update(Collection segments, String message) { - if (!enabled()) { - return; - } - - // This manager is created right at the beginning of a startup. Before we set it up, - // we'll read tweets and create segments and therefore this method will be called. - // We don't want to start computing query caches during that time, so we just return. - if (!didSetup) { - return; - } - - NUM_UPDATE_SEGMENTS_CALLS.increment(); - - LOG.info("Rescheduling all query cache tasks ({}). Number of segments received = {}.", - message, segments.size()); - updater.clearTasks(); // cancel and remove all scheduled tasks - - // If Earlybird is still starting up, and we get a partition roll, don't delay rebuilding - // the query cache. - boolean isCurrent = EarlybirdStatus.getStatusCode() == EarlybirdStatusCode.CURRENT; - scheduleTasks(segments, isCurrent); - } - - /** - * Determines if all query cache tasks ran at least once (even if they failed). - */ - public synchronized boolean allTasksRan() { - return (!(enabled() && didSetup)) || updater.allTasksRan(); - } - - /** - * Determines if the query cache manager is enabled. - */ - public boolean enabled() { - return enabled; - } - - /** - * Returns the query cache filter with the given name. - */ - public QueryCacheFilter getFilter(String filterName) { - return filters.get(filterName); - } - - /** - * Shuts down the query cache manager. - */ - public synchronized void shutdown() throws InterruptedException { - LOG.info("Shutting down QueryCacheManager"); - if (updater != null) { - updater.shutdown(); - updater = null; - } - didSetup = false; // needed for unit test - } - - /** - * After startup, we want only one thread to update the query cache. - */ - public void setWorkerPoolSizeAfterStartup() { - if (this.updater != null) { - this.updater.setWorkerPoolSizeAfterStartup(); - } - } - - public Decider getDecider() { - return this.decider; - } - - ////////////////////////// - // for unit tests only - ////////////////////////// - QueryCacheUpdater getUpdaterForTest() { - return updater; - } - Map getCacheMapForTest() { - return filters; - } -} diff --git a/src/java/com/twitter/search/earlybird/querycache/QueryCacheResultCollector.java b/src/java/com/twitter/search/earlybird/querycache/QueryCacheResultCollector.java deleted file mode 100644 index 5f69f57a5..000000000 --- a/src/java/com/twitter/search/earlybird/querycache/QueryCacheResultCollector.java +++ /dev/null @@ -1,124 +0,0 @@ -package com.twitter.search.earlybird.querycache; - -import java.io.IOException; - -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.util.BitDocIdSet; -import org.apache.lucene.util.BitSet; -import org.apache.lucene.util.FixedBitSet; -import org.apache.lucene.util.SparseFixedBitSet; - -import com.twitter.common.util.Clock; -import com.twitter.decider.Decider; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.core.earlybird.index.QueryCacheResultForSegment; -import com.twitter.search.earlybird.RecentTweetRestriction; -import com.twitter.search.earlybird.search.AbstractResultsCollector; -import com.twitter.search.earlybird.search.SearchRequestInfo; -import com.twitter.search.earlybird.search.SearchResultsInfo; -import com.twitter.search.earlybird.search.queries.SinceUntilFilter; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; - -import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; - -import static com.twitter.search.core.earlybird.index.TimeMapper.ILLEGAL_TIME; - -/** - * Collector to update the query cache (one segment for a filter) - */ -public class QueryCacheResultCollector - extends AbstractResultsCollector { - private static final int UNSET = -1; - - private final QueryCacheFilter queryCacheFilter; - private final Decider decider; - - private BitSet bitSet; - private long cardinality = 0L; - private int startingDocID = UNSET; - - public QueryCacheResultCollector( - ImmutableSchemaInterface schema, - QueryCacheFilter queryCacheFilter, - EarlybirdSearcherStats searcherStats, - Decider decider, - Clock clock, - int requestDebugMode) { - super(schema, - queryCacheFilter.createSearchRequestInfo(), - clock, - searcherStats, - requestDebugMode); - this.queryCacheFilter = queryCacheFilter; - this.decider = decider; - } - - @Override - public void startSegment() throws IOException { - // The doc IDs in the optimized segments are always in the 0 .. (segmentSize - 1) range, so we - // can use a dense bitset to collect the hits. However, unoptimized segments can use any int - // doc IDs, so we have to use a sparse bitset to collect the hits in those segments. - if (currTwitterReader.getSegmentData().isOptimized()) { - switch (queryCacheFilter.getResultSetType()) { - case FixedBitSet: - bitSet = new FixedBitSet(currTwitterReader.maxDoc()); - break; - case SparseFixedBitSet: - bitSet = new SparseFixedBitSet(currTwitterReader.maxDoc()); - break; - default: - throw new IllegalStateException( - "Unknown ResultSetType: " + queryCacheFilter.getResultSetType().name()); - } - } else { - bitSet = new SparseFixedBitSet(currTwitterReader.maxDoc()); - } - - startingDocID = findStartingDocID(); - cardinality = 0; - } - - @Override - protected void doCollect(long tweetID) { - bitSet.set(curDocId); - cardinality++; - } - - @Override - protected SearchResultsInfo doGetResults() { - return new SearchResultsInfo(); - } - - public QueryCacheResultForSegment getCachedResult() { - // Note that BitSet.cardinality takes linear time in the size of the maxDoc, so we track - // cardinality separately. - return new QueryCacheResultForSegment(new BitDocIdSet(bitSet, cardinality), - cardinality, startingDocID); - } - - /** - * We don't want to return results less than 15 seconds older than the most recently indexed tweet, - * as they might not be completely indexed. - * We can't simply use the first hit, as some cached filters might not have any hits, - * e.g. has_engagement in the protected cluster. - * We can't use a clock because streams can lag. - */ - private int findStartingDocID() throws IOException { - int lastTime = currTwitterReader.getSegmentData().getTimeMapper().getLastTime(); - if (lastTime == ILLEGAL_TIME) { - return NO_MORE_DOCS; - } - - int untilTime = RecentTweetRestriction.queryCacheUntilTime(decider, lastTime); - if (untilTime == 0) { - return currTwitterReader.getSmallestDocID(); - } - - return SinceUntilFilter.getUntilQuery(untilTime) - .createWeight(new IndexSearcher(currTwitterReader), ScoreMode.COMPLETE_NO_SCORES, 1.0f) - .scorer(currTwitterReader.getContext()) - .iterator() - .nextDoc(); - } -} diff --git a/src/java/com/twitter/search/earlybird/querycache/QueryCacheUpdateTask.java b/src/java/com/twitter/search/earlybird/querycache/QueryCacheUpdateTask.java deleted file mode 100644 index db2ba8d4b..000000000 --- a/src/java/com/twitter/search/earlybird/querycache/QueryCacheUpdateTask.java +++ /dev/null @@ -1,283 +0,0 @@ -package com.twitter.search.earlybird.querycache; - -import java.io.IOException; -import java.util.concurrent.TimeUnit; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.cache.CacheBuilder; -import com.google.common.cache.CacheLoader; -import com.google.common.cache.LoadingCache; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.quantity.Amount; -import com.twitter.common.quantity.Time; -import com.twitter.common.util.Clock; -import com.twitter.decider.Decider; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.metrics.Timer; -import com.twitter.search.common.search.TerminationTracker; -import com.twitter.search.core.earlybird.index.QueryCacheResultForSegment; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.exception.EarlybirdException; -import com.twitter.search.earlybird.index.EarlybirdSegment; -import com.twitter.search.earlybird.index.EarlybirdSingleSegmentSearcher; -import com.twitter.search.earlybird.partition.SegmentInfo; -import com.twitter.search.earlybird.search.SearchResultsInfo; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; -import com.twitter.search.earlybird.util.ScheduledExecutorTask; - -/** - * Each task is responsible for one filter on one segment. We should have a total - * of num_of_filter * num_of_segments tasks - */ -@VisibleForTesting -class QueryCacheUpdateTask extends ScheduledExecutorTask { - private static final Logger LOG = LoggerFactory.getLogger(QueryCacheUpdateTask.class); - - // See OBSERVE-10347 - private static final boolean EXPORT_STATS = - EarlybirdConfig.getBool("export_query_cache_update_task_stats", false); - - private static final LoadingCache TASK_STATS = - CacheBuilder.newBuilder().build(new CacheLoader() { - @Override - public TaskStats load(String statNamePrefix) { - return new TaskStats(statNamePrefix, EXPORT_STATS); - } - }); - - private static final SearchCounter FINISHED_TASKS = SearchCounter.export( - "querycache_finished_tasks"); - - private final QueryCacheFilter filter; - - // Info/data of the segment this task is responsible for - private final SegmentInfo segmentInfo; - - private final UserTable userTable; - - private volatile boolean ranOnce; - private final TaskStats stats; - private Amount lastRunFinishTime; - - // See SEARCH-4346 - private final String filterAndSegment; - - private final Decider decider; - - private static final class TaskStats { - private final SearchLongGauge numHitsStat; - private final SearchLongGauge updateLatencyStat; - private final SearchCounter updateSuccessCountStat; - private final SearchCounter updateFailureCountStat; - - private TaskStats(String statNamePrefix, boolean exportStats) { - // See SEARCH-3698 - numHitsStat = exportStats ? SearchLongGauge.export(statNamePrefix + "numhit") - : new SearchLongGauge(statNamePrefix + "numhit"); - updateLatencyStat = exportStats - ? SearchLongGauge.export(statNamePrefix + "update_latency_ms") - : new SearchLongGauge(statNamePrefix + "update_latency_ms"); - updateSuccessCountStat = exportStats - ? SearchCounter.export(statNamePrefix + "update_success_count") - : SearchCounter.create(statNamePrefix + "update_success_count"); - updateFailureCountStat = exportStats - ? SearchCounter.export(statNamePrefix + "update_failure_count") - : SearchCounter.create(statNamePrefix + "update_failure_count"); - } - } - - private final Amount updateInterval; - private final Amount initialDelay; - - private final EarlybirdSearcherStats searcherStats; - private final CriticalExceptionHandler criticalExceptionHandler; - - /** - * Constructor - * @param filter Filter to be used to populate the cache - * @param segmentInfo Segment this task is responsible for - * @param updateInterval Time between successive updates - * @param initialDelay Time before the first update - * @param updateIterationCounter - * @param decider - */ - public QueryCacheUpdateTask(QueryCacheFilter filter, - SegmentInfo segmentInfo, - UserTable userTable, - Amount updateInterval, - Amount initialDelay, - SearchCounter updateIterationCounter, - EarlybirdSearcherStats searcherStats, - Decider decider, - CriticalExceptionHandler criticalExceptionHandler, - Clock clock) { - super(updateIterationCounter, clock); - this.filter = filter; - this.segmentInfo = segmentInfo; - this.userTable = userTable; - this.ranOnce = false; - this.updateInterval = updateInterval; - this.initialDelay = initialDelay; - this.stats = setupStats(); - this.filterAndSegment = String.format( - "QueryCacheFilter: %s | Segment: %d", - filter.getFilterName(), segmentInfo.getTimeSliceID()); - this.searcherStats = searcherStats; - this.criticalExceptionHandler = criticalExceptionHandler; - this.decider = decider; - } - - @Override - protected void runOneIteration() { - try { - if (LOG.isDebugEnabled()) { - LOG.debug( - "[{}] Updating with query [{}] for the {} th time.", - filterAndSegment, - filter.getQueryString(), - stats.updateSuccessCountStat.get() + stats.updateFailureCountStat.get() + 1 - ); - if (lastRunFinishTime != null) { - LOG.debug( - "[{}] Last run, {} th time, finished {} secs ago. Should run every {} secs", - filterAndSegment, - stats.updateSuccessCountStat.get() + stats.updateFailureCountStat.get(), - TimeUnit.NANOSECONDS.toSeconds( - System.nanoTime() - lastRunFinishTime.as(Time.NANOSECONDS)), - updateInterval.as(Time.SECONDS) - ); - } - } - - Timer timer = new Timer(TimeUnit.MILLISECONDS); - SearchResultsInfo result = null; - try { - result = update(); - } catch (Exception e) { - String msg = "Failed to update query cache entry [" + filter.getFilterName() - + "] on segment [" + segmentInfo.getTimeSliceID() + "]"; - LOG.warn(msg, e); - } - - long endTime = timer.stop(); - updateStats(result, endTime); - - if (LOG.isDebugEnabled()) { - LOG.debug("[{}] Updated in {} ms, hit {} docs.", - filterAndSegment, endTime, stats.numHitsStat.read()); - } - // Need to catch throwable here instead of exception so we handle errors like OutOfMemory - // See RB=528695 and SEARCH-4402 - } catch (Throwable t) { - String message = String.format("Got unexpected throwable in %s", getClass().getName()); - LOG.error(message, t); - - // Wrap the Throwable in a FatalEarlybirdException to categorize it and ensure it's - // handled as a fatal exception - criticalExceptionHandler.handle(this, - new EarlybirdException(message, t)); - } finally { - // Earlybird won't become CURRENT until all tasks are run at least once. We don't want - // failed "run" (update) to prevent Earlybird from becoming CURRENT. As long as all tasks - // got a chance to run at least once, we are good to go. - ranOnce = true; - - lastRunFinishTime = Amount.of(System.nanoTime(), Time.NANOSECONDS); - } - } - - public boolean ranOnce() { - return ranOnce; - } - - private TaskStats setupStats() { - return TASK_STATS.getUnchecked(statNamePrefix()); - } - - private SearchResultsInfo update() throws IOException { - // There's a chance that the EarlybirdSegment of a SegmentInfo to change at any - // time. Therefore, it's not safe to operate segments on the SegmentInfo level. - // On the archive clusters we create a new EarlybirdSegment and then swap it in when there's - // new data instead of appending to an existing EarlybirdSegment. - EarlybirdSegment earlybirdSegment = segmentInfo.getIndexSegment(); - - EarlybirdSingleSegmentSearcher searcher = earlybirdSegment.getSearcher(userTable); - if (searcher == null) { - LOG.warn("Unable to get searcher from TwitterIndexManager for segment [" - + segmentInfo.getTimeSliceID() + "]. Has it been dropped?"); - return null; - } - - QueryCacheResultCollector collector = new QueryCacheResultCollector( - searcher.getSchemaSnapshot(), filter, searcherStats, decider, clock, 0); - searcher.search(filter.getLuceneQuery(), collector); - - QueryCacheResultForSegment cacheResult = collector.getCachedResult(); - searcher.getTwitterIndexReader().getSegmentData().updateQueryCacheResult( - filter.getFilterName(), cacheResult); - - FINISHED_TASKS.increment(); - - if (LOG.isDebugEnabled()) { - TerminationTracker tracker = collector.getSearchRequestInfo().getTerminationTracker(); - LOG.debug( - "[{}] Updating query finished, start time ms is {}, termination reason is {}", - filterAndSegment, - tracker.getLocalStartTimeMillis(), - tracker.getEarlyTerminationState().getTerminationReason()); - } - - return collector.getResults(); - } - - private void updateStats(SearchResultsInfo result, long endTime) { - if (result != null) { - stats.numHitsStat.set(result.getNumHitsProcessed()); - stats.updateSuccessCountStat.increment(); - } else { - stats.updateFailureCountStat.increment(); - } - stats.updateLatencyStat.set(endTime); - } - - @VisibleForTesting - String statNamePrefix() { - // If we use this and try to display in monviz "ts(partition, single_instance, querycache*)", - // the UI shows "Really expensive query" message. We can keep this around for times when we - // want to start things manually and debug. - return "querycache_" + filter.getFilterName() + "_" + segmentInfo.getTimeSliceID() + "_"; - } - - public long getTimeSliceID() { - return segmentInfo.getTimeSliceID(); - } - - ////////////////////////// - // for unit tests only - ////////////////////////// - @VisibleForTesting - String getFilterNameForTest() { - return filter.getFilterName(); - } - - @VisibleForTesting - Amount getUpdateIntervalForTest() { - return updateInterval; - } - - @VisibleForTesting - Amount getInitialDelayForTest() { - return initialDelay; - } - - @VisibleForTesting - TaskStats getTaskStatsForTest() { - return stats; - } -} diff --git a/src/java/com/twitter/search/earlybird/querycache/QueryCacheUpdater.java b/src/java/com/twitter/search/earlybird/querycache/QueryCacheUpdater.java deleted file mode 100644 index f76e197ec..000000000 --- a/src/java/com/twitter/search/earlybird/querycache/QueryCacheUpdater.java +++ /dev/null @@ -1,242 +0,0 @@ -package com.twitter.search.earlybird.querycache; - -import java.util.ArrayList; -import java.util.Collection; -import java.util.Iterator; -import java.util.List; -import java.util.concurrent.ScheduledExecutorService; -import java.util.concurrent.ScheduledFuture; -import java.util.concurrent.TimeUnit; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.quantity.Amount; -import com.twitter.common.quantity.Time; -import com.twitter.common.util.Clock; -import com.twitter.decider.Decider; -import com.twitter.search.common.concurrent.ScheduledExecutorServiceFactory; -import com.twitter.search.common.metrics.SearchCustomGauge; -import com.twitter.search.common.metrics.SearchStatsReceiver; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.factory.QueryCacheUpdaterScheduledExecutorService; -import com.twitter.search.earlybird.partition.SegmentInfo; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; -import com.twitter.search.earlybird.util.PeriodicActionParams; -import com.twitter.search.earlybird.util.ScheduledExecutorManager; -import com.twitter.search.earlybird.util.ShutdownWaitTimeParams; - -/** - * Class to manage the scheduler service and all the update tasks. Through this - * class, update tasks are created and scheduled, canceled and removed. - * - * This class is not thread-safe. - */ -@VisibleForTesting -final class QueryCacheUpdater extends ScheduledExecutorManager { - private static final Logger LOG = LoggerFactory.getLogger(QueryCacheUpdater.class); - - private final List tasks; - private final EarlybirdSearcherStats searcherStats; - private final Decider decider; - private final UserTable userTable; - private final Clock clock; - - @VisibleForTesting - static final class Task { - @VisibleForTesting public final QueryCacheUpdateTask updateTask; - @VisibleForTesting public final ScheduledFuture future; - - private Task(QueryCacheUpdateTask updateTask, ScheduledFuture future) { - this.updateTask = updateTask; - this.future = future; - } - } - - public QueryCacheUpdater(Collection cacheFilters, - ScheduledExecutorServiceFactory updaterScheduledExecutorServiceFactory, - UserTable userTable, - SearchStatsReceiver searchStatsReceiver, - EarlybirdSearcherStats searcherStats, - Decider decider, - CriticalExceptionHandler criticalExceptionHandler, - Clock clock) { - super(updaterScheduledExecutorServiceFactory.build("QueryCacheUpdateThread-%d", true), - ShutdownWaitTimeParams.immediately(), searchStatsReceiver, - criticalExceptionHandler, clock); - Preconditions.checkNotNull(cacheFilters); - Preconditions.checkArgument(getExecutor() instanceof QueryCacheUpdaterScheduledExecutorService, - getExecutor().getClass()); - - this.searcherStats = searcherStats; - this.decider = decider; - this.userTable = userTable; - this.clock = clock; - - shouldLog = false; - // One update task per - tasks = Lists.newArrayListWithCapacity(cacheFilters.size() * 20); - - SearchCustomGauge.export( - "querycache_num_tasks", - tasks::size - ); - } - - /** - * Create an update task and add it to the executor - * - * @param filter The filter the task should execute - * @param segmentInfo The segment that this task would be responsible for - * @param updateInterval time in milliseconds between successive updates - * @param initialDelay Introduce a delay when adding the task to the executor - */ - void addTask(QueryCacheFilter filter, SegmentInfo segmentInfo, - Amount updateInterval, Amount initialDelay) { - String filterName = filter.getFilterName(); - String query = filter.getQueryString(); - - // Create the task. - QueryCacheUpdateTask qcTask = new QueryCacheUpdateTask( - filter, - segmentInfo, - userTable, - updateInterval, - initialDelay, - getIterationCounter(), - searcherStats, - decider, - criticalExceptionHandler, - clock); - - long initialDelayAsMS = initialDelay.as(Time.MILLISECONDS); - long updateIntervalAsMS = updateInterval.as(Time.MILLISECONDS); - Preconditions.checkArgument( - initialDelayAsMS >= initialDelay.getValue(), "initial delay unit granularity too small"); - Preconditions.checkArgument( - updateIntervalAsMS >= updateInterval.getValue(), - "update interval unit granularity too small"); - - // Schedule the task. - ScheduledFuture future = scheduleNewTask(qcTask, - PeriodicActionParams.withIntialWaitAndFixedDelay( - initialDelayAsMS, updateIntervalAsMS, TimeUnit.MILLISECONDS - ) - ); - - tasks.add(new Task(qcTask, future)); - - LOG.debug("Added a task for filter [" + filterName - + "] for segment [" + segmentInfo.getTimeSliceID() - + "] with query [" + query - + "] update interval " + updateInterval + " " - + (initialDelay.getValue() == 0 ? "without" : "with " + initialDelay) - + " initial delay"); - - } - - void removeAllTasksForSegment(SegmentInfo segmentInfo) { - int removedTasksCount = 0; - for (Iterator it = tasks.iterator(); it.hasNext();) { - Task task = it.next(); - if (task.updateTask.getTimeSliceID() == segmentInfo.getTimeSliceID()) { - task.future.cancel(true); - it.remove(); - removedTasksCount += 1; - } - } - - LOG.info("Removed {} update tasks for segment {}.", removedTasksCount, - segmentInfo.getTimeSliceID()); - } - - public void clearTasks() { - int totalTasks = tasks.size(); - LOG.info("Removing {} update tasks for all segments.", totalTasks); - for (Task task : tasks) { - task.future.cancel(true); - } - tasks.clear(); - LOG.info("Canceled {} QueryCache update tasks", totalTasks); - } - - // Have all tasks run at least once (even if they failed)? - public boolean allTasksRan() { - boolean allTasksRan = true; - for (Task task : tasks) { - if (!task.updateTask.ranOnce()) { - allTasksRan = false; - break; - } - } - - return allTasksRan; - } - - // Have all tasks for this run at least once (even if they failed)? - public boolean allTasksRanForSegment(SegmentInfo segmentInfo) { - boolean allTasksRanForSegment = true; - for (Task task : tasks) { - if ((task.updateTask.getTimeSliceID() == segmentInfo.getTimeSliceID()) - && !task.updateTask.ranOnce()) { - allTasksRanForSegment = false; - break; - } - } - - return allTasksRanForSegment; - } - - /** - * After startup, we want only one thread to update the query cache. - */ - void setWorkerPoolSizeAfterStartup() { - QueryCacheUpdaterScheduledExecutorService executor = - (QueryCacheUpdaterScheduledExecutorService) getExecutor(); - executor.setWorkerPoolSizeAfterStartup(); - LOG.info("Done setting executor core pool size to one"); - } - - @Override - protected void shutdownComponent() { - clearTasks(); - } - - ////////////////////////// - // for unit tests only - ////////////////////////// - - /** - * Returns the list of all query cache updater tasks. This method should be used only in tests. - */ - @VisibleForTesting - List getTasksForTest() { - synchronized (tasks) { - return new ArrayList<>(tasks); - } - } - - @VisibleForTesting - int getTasksSize() { - synchronized (tasks) { - return tasks.size(); - } - } - - @VisibleForTesting - boolean tasksContains(Task task) { - synchronized (tasks) { - return tasks.contains(task); - } - } - - @VisibleForTesting - public ScheduledExecutorService getExecutorForTest() { - return getExecutor(); - } -} diff --git a/src/java/com/twitter/search/earlybird/queryparser/DetectAntisocialVisitor.java b/src/java/com/twitter/search/earlybird/queryparser/DetectAntisocialVisitor.java deleted file mode 100644 index b42ce5cb9..000000000 --- a/src/java/com/twitter/search/earlybird/queryparser/DetectAntisocialVisitor.java +++ /dev/null @@ -1,131 +0,0 @@ -package com.twitter.search.earlybird.queryparser; - -import com.twitter.search.common.constants.QueryCacheConstants; -import com.twitter.search.queryparser.query.Conjunction; -import com.twitter.search.queryparser.query.Disjunction; -import com.twitter.search.queryparser.query.Phrase; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.query.SpecialTerm; -import com.twitter.search.queryparser.query.Term; -import com.twitter.search.queryparser.query.search.SearchOperator; -import com.twitter.search.queryparser.query.search.SearchOperatorConstants; -import com.twitter.search.queryparser.query.search.SearchQueryVisitor; - -/** - * Visitor to detect presence of any antisocial / spam operator in a Query. - * Visitor returns true if any operators it detects were found. - */ -public class DetectAntisocialVisitor extends SearchQueryVisitor { - // True if the query contains any operator to include antisocial tweets. - private boolean includeAntisocial = false; - - // True if the query contains any operator to exclude antisocial/spam tweets. - private boolean excludeAntisocial = false; - - // True if the query contains an antisocial tweets filter. - private boolean filterAntisocial = false; - - public boolean hasIncludeAntisocial() { - return includeAntisocial; - } - - public boolean hasExcludeAntisocial() { - return excludeAntisocial; - } - - public boolean hasFilterAntisocial() { - return filterAntisocial; - } - - public boolean hasAnyAntisocialOperator() { - // Top tweets is considered an antisocial operator due to scoring also excluding - // spam tweets. - return hasIncludeAntisocial() || hasExcludeAntisocial() || hasFilterAntisocial(); - } - - @Override public Boolean visit(Disjunction disjunction) throws QueryParserException { - boolean found = false; - for (com.twitter.search.queryparser.query.Query node : disjunction.getChildren()) { - if (node.accept(this)) { - found = true; - } - } - return found; - } - - @Override public Boolean visit(Conjunction conjunction) throws QueryParserException { - boolean found = false; - for (com.twitter.search.queryparser.query.Query node : conjunction.getChildren()) { - if (node.accept(this)) { - found = true; - } - } - return found; - } - - @Override public Boolean visit(SearchOperator operator) throws QueryParserException { - boolean found = false; - switch (operator.getOperatorType()) { - case INCLUDE: - if (SearchOperatorConstants.ANTISOCIAL.equals(operator.getOperand())) { - if (operator.mustNotOccur()) { - excludeAntisocial = true; - } else { - includeAntisocial = true; - } - found = true; - } - break; - case EXCLUDE: - if (SearchOperatorConstants.ANTISOCIAL.equals(operator.getOperand())) { - if (operator.mustNotOccur()) { - includeAntisocial = true; - } else { - excludeAntisocial = true; - } - found = true; - } - break; - case FILTER: - if (SearchOperatorConstants.ANTISOCIAL.equals(operator.getOperand())) { - if (operator.mustNotOccur()) { - excludeAntisocial = true; - } else { - filterAntisocial = true; - } - found = true; - } - break; - case CACHED_FILTER: - if (QueryCacheConstants.EXCLUDE_SPAM.equals(operator.getOperand()) - || QueryCacheConstants.EXCLUDE_SPAM_AND_NATIVERETWEETS.equals(operator.getOperand()) - || QueryCacheConstants.EXCLUDE_ANTISOCIAL.equals(operator.getOperand()) - || QueryCacheConstants.EXCLUDE_ANTISOCIAL_AND_NATIVERETWEETS - .equals(operator.getOperand())) { - - excludeAntisocial = true; - found = true; - } - break; - default: - break; - } - - return found; - } - - @Override - public Boolean visit(SpecialTerm special) throws QueryParserException { - return false; - } - - @Override - public Boolean visit(Phrase phrase) throws QueryParserException { - return false; - } - - @Override - public Boolean visit(Term term) throws QueryParserException { - return false; - } -} diff --git a/src/java/com/twitter/search/earlybird/queryparser/DetectFieldAnnotationVisitor.java b/src/java/com/twitter/search/earlybird/queryparser/DetectFieldAnnotationVisitor.java deleted file mode 100644 index 5c565ce91..000000000 --- a/src/java/com/twitter/search/earlybird/queryparser/DetectFieldAnnotationVisitor.java +++ /dev/null @@ -1,99 +0,0 @@ -package com.twitter.search.earlybird.queryparser; - -import java.util.Set; - -import com.google.common.collect.ImmutableSet; - -import com.twitter.search.queryparser.query.BooleanQuery; -import com.twitter.search.queryparser.query.Conjunction; -import com.twitter.search.queryparser.query.Disjunction; -import com.twitter.search.queryparser.query.Operator; -import com.twitter.search.queryparser.query.Phrase; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.query.QueryVisitor; -import com.twitter.search.queryparser.query.SpecialTerm; -import com.twitter.search.queryparser.query.Term; -import com.twitter.search.queryparser.query.annotation.Annotation; -import com.twitter.search.queryparser.query.annotation.FieldNameWithBoost; - -/** - * Detects whether the query tree has certain field annotations. - */ -public class DetectFieldAnnotationVisitor extends QueryVisitor { - private final ImmutableSet fieldNames; - - /** - * This visitor will return true if the query tree has a FIELD annotation with any of the given - * field names. If the set is empty, any FIELD annotation will match. - */ - public DetectFieldAnnotationVisitor(Set fieldNames) { - this.fieldNames = ImmutableSet.copyOf(fieldNames); - } - - /** - * This visitor will return true if the query tree has a FIELD annotation. - */ - public DetectFieldAnnotationVisitor() { - this.fieldNames = ImmutableSet.of(); - } - - @Override - public Boolean visit(Disjunction disjunction) throws QueryParserException { - return visitQuery(disjunction) || visitBooleanQuery(disjunction); - } - - @Override - public Boolean visit(Conjunction conjunction) throws QueryParserException { - return visitQuery(conjunction) || visitBooleanQuery(conjunction); - } - - @Override - public Boolean visit(Phrase phrase) throws QueryParserException { - return visitQuery(phrase); - } - - @Override - public Boolean visit(Term term) throws QueryParserException { - return visitQuery(term); - } - - @Override - public Boolean visit(Operator operator) throws QueryParserException { - return visitQuery(operator); - } - - @Override - public Boolean visit(SpecialTerm special) throws QueryParserException { - return visitQuery(special); - } - - private Boolean visitQuery(Query query) throws QueryParserException { - if (query.hasAnnotations()) { - for (Annotation annotation : query.getAnnotations()) { - if (!Annotation.Type.FIELD.equals(annotation.getType())) { - continue; - } - if (fieldNames.isEmpty()) { - return true; - } - FieldNameWithBoost value = (FieldNameWithBoost) annotation.getValue(); - if (fieldNames.contains(value.getFieldName())) { - return true; - } - } - } - - return false; - } - - private boolean visitBooleanQuery(BooleanQuery query) throws QueryParserException { - for (Query subQuery : query.getChildren()) { - if (subQuery.accept(this)) { - return true; - } - } - - return false; - } -} diff --git a/src/java/com/twitter/search/earlybird/queryparser/EarlybirdLuceneQueryVisitor.java b/src/java/com/twitter/search/earlybird/queryparser/EarlybirdLuceneQueryVisitor.java deleted file mode 100644 index d78e7d8b1..000000000 --- a/src/java/com/twitter/search/earlybird/queryparser/EarlybirdLuceneQueryVisitor.java +++ /dev/null @@ -1,1781 +0,0 @@ -package com.twitter.search.earlybird.queryparser; - -import java.util.Arrays; -import java.util.Collection; -import java.util.Collections; -import java.util.HashSet; -import java.util.List; -import java.util.Map; -import java.util.Set; -import java.util.TreeSet; -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Functions; -import com.google.common.base.Optional; -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.Lists; -import com.google.common.collect.Sets; - -import org.apache.lucene.search.BooleanClause; -import org.apache.lucene.search.BooleanClause.Occur; -import org.apache.lucene.search.BooleanQuery; -import org.apache.lucene.search.BoostQuery; -import org.apache.lucene.search.MatchNoDocsQuery; -import org.apache.lucene.search.PhraseQuery; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.TermQuery; -import org.locationtech.spatial4j.shape.Point; -import org.locationtech.spatial4j.shape.Rectangle; -import org.locationtech.spatial4j.shape.impl.PointImpl; -import org.locationtech.spatial4j.shape.impl.RectangleImpl; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.decider.Decider; -import com.twitter.search.common.constants.QueryCacheConstants; -import com.twitter.search.common.decider.DeciderUtil; -import com.twitter.search.common.encoding.features.ByteNormalizer; -import com.twitter.search.common.indexing.thriftjava.ThriftGeoLocationSource; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.query.BoostUtils; -import com.twitter.search.common.query.FieldWeightUtil; -import com.twitter.search.common.query.FilteredQuery; -import com.twitter.search.common.query.HitAttributeHelper; -import com.twitter.search.common.query.MappableField; -import com.twitter.search.common.schema.ImmutableSchema; -import com.twitter.search.common.schema.SchemaUtil; -import com.twitter.search.common.schema.base.FieldWeightDefault; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.schema.earlybird.EarlybirdThriftDocumentBuilder; -import com.twitter.search.common.schema.earlybird.EarlybirdThriftDocumentUtil; -import com.twitter.search.common.schema.thriftjava.ThriftCSFType; -import com.twitter.search.common.search.TerminationTracker; -import com.twitter.search.common.search.termination.QueryTimeout; -import com.twitter.search.common.util.analysis.IntTermAttributeImpl; -import com.twitter.search.common.util.analysis.LongTermAttributeImpl; -import com.twitter.search.common.util.spatial.GeohashChunkImpl; -import com.twitter.search.common.util.text.HighFrequencyTermPairs; -import com.twitter.search.common.util.text.NormalizerHelper; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.common.userupdates.UserScrubGeoMap; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.partition.MultiSegmentTermDictionaryManager; -import com.twitter.search.earlybird.querycache.CachedFilterQuery; -import com.twitter.search.earlybird.querycache.QueryCacheManager; -import com.twitter.search.earlybird.search.queries.CSFDisjunctionFilter; -import com.twitter.search.earlybird.search.queries.DocValRangeFilter; -import com.twitter.search.earlybird.search.queries.FeatureValueInAcceptListOrUnsetFilter; -import com.twitter.search.earlybird.search.GeoQuadTreeQueryBuilder; -import com.twitter.search.earlybird.search.queries.MatchAllDocsQuery; -import com.twitter.search.earlybird.search.queries.RequiredStatusIDsFilter; -import com.twitter.search.earlybird.search.queries.SinceMaxIDFilter; -import com.twitter.search.earlybird.search.queries.SinceUntilFilter; -import com.twitter.search.earlybird.search.queries.TermQueryWithSafeToString; -import com.twitter.search.earlybird.search.queries.UserFlagsExcludeFilter; -import com.twitter.search.earlybird.search.queries.UserScrubGeoFilter; -import com.twitter.search.earlybird.search.queries.UserIdMultiSegmentQuery; -import com.twitter.search.earlybird.search.relevance.MinFeatureValueFilter; -import com.twitter.search.earlybird.search.relevance.ScoreFilterQuery; -import com.twitter.search.earlybird.search.relevance.scoring.ScoringFunctionProvider; -import com.twitter.search.queryparser.query.Conjunction; -import com.twitter.search.queryparser.query.Disjunction; -import com.twitter.search.queryparser.query.Phrase; -import com.twitter.search.queryparser.query.QueryNodeUtils; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.query.SpecialTerm; -import com.twitter.search.queryparser.query.Term; -import com.twitter.search.queryparser.query.annotation.Annotation; -import com.twitter.search.queryparser.query.annotation.FloatAnnotation; -import com.twitter.search.queryparser.query.search.Link; -import com.twitter.search.queryparser.query.search.SearchOperator; -import com.twitter.search.queryparser.query.search.SearchOperatorConstants; -import com.twitter.search.queryparser.query.search.SearchQueryVisitor; -import com.twitter.search.queryparser.util.GeoCode; -import com.twitter.service.spiderduck.gen.LinkCategory; -import com.twitter.tweetypie.thriftjava.ComposerSource; - -/** - * Visitor for {@link com.twitter.search.queryparser.query.Query}, which produces a Lucene - * Query ({@link Query}). - */ -public class EarlybirdLuceneQueryVisitor extends SearchQueryVisitor { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdLuceneQueryVisitor.class); - - @VisibleForTesting - static final String UNSUPPORTED_OPERATOR_PREFIX = "unsupported_query_operator_"; - - private static final String SMILEY_FORMAT_STRING = "__has_%s_smiley"; - private static final String PHRASE_WILDCARD = "*"; - private static final float DEFAULT_FIELD_WEIGHT = 1.0f; - - private static final SearchCounter SINCE_TIME_INVALID_INT_COUNTER = - SearchCounter.export("EarlybirdLuceneQueryVisitor_since_time_invalid_int"); - private static final SearchCounter UNTIL_TIME_INVALID_INT_COUNTER = - SearchCounter.export("EarlybirdLuceneQueryVisitor_until_time_invalid_int"); - - private static final SearchCounter NUM_QUERIES_BELOW_MIN_ENGAGEMENT_THRESHOLD = - SearchCounter.export( - "EarlybirdLuceneQueryVisitor_num_queries_below_min_engagement_threshold"); - private static final SearchCounter NUM_QUERIES_ABOVE_MIN_ENGAGEMENT_THRESHOLD = - SearchCounter.export( - "EarlybirdLuceneQueryVisitor_num_queries_above_min_engagement_threshold"); - - private static final SearchOperator OPERATOR_CACHED_EXCLUDE_ANTISOCIAL_AND_NATIVERETWEETS = - new SearchOperator(SearchOperator.Type.CACHED_FILTER, - QueryCacheConstants.EXCLUDE_ANTISOCIAL_AND_NATIVERETWEETS); - - private static final Map> OPERATORS_BY_SAFE_EXCLUDE_OPERAND = - ImmutableMap.of( - SearchOperatorConstants.TWEET_SPAM, ImmutableList.of( - new SearchOperator(SearchOperator.Type.DOCVAL_RANGE_FILTER, - "extended_encoded_tweet_features.label_spam_flag", "0", "1"), - new SearchOperator(SearchOperator.Type.DOCVAL_RANGE_FILTER, - "extended_encoded_tweet_features.label_spam_hi_rcl_flag", "0", "1"), - new SearchOperator(SearchOperator.Type.DOCVAL_RANGE_FILTER, - "extended_encoded_tweet_features.label_dup_content_flag", "0", "1")), - - SearchOperatorConstants.TWEET_ABUSIVE, ImmutableList.of( - new SearchOperator(SearchOperator.Type.DOCVAL_RANGE_FILTER, - "extended_encoded_tweet_features.label_abusive_flag", "0", "1")), - - SearchOperatorConstants.TWEET_UNSAFE, ImmutableList.of( - new SearchOperator(SearchOperator.Type.DOCVAL_RANGE_FILTER, - "extended_encoded_tweet_features.label_nsfw_hi_prc_flag", "0", "1")) - ); - - private static final ImmutableMap DEFAULT_FIELDS = - ImmutableMap.of(EarlybirdFieldConstant.TEXT_FIELD.getFieldName(), - new FieldWeightDefault(true, DEFAULT_FIELD_WEIGHT)); - - // All Earlybird fields that should have geo scrubbed tweets filtered out when searched. - // See go/realtime-geo-filtering - @VisibleForTesting - public static final List GEO_FIELDS_TO_BE_SCRUBBED = Arrays.asList( - EarlybirdFieldConstant.GEO_HASH_FIELD.getFieldName(), - EarlybirdFieldConstant.PLACE_FIELD.getFieldName(), - EarlybirdFieldConstant.PLACE_ID_FIELD.getFieldName(), - EarlybirdFieldConstant.PLACE_FULL_NAME_FIELD.getFieldName(), - EarlybirdFieldConstant.PLACE_COUNTRY_CODE_FIELD.getFieldName()); - - // Geo scrubbing doesn't remove user profile location, so when using the geo location type filters - // we only need to filter out geo scrubbed tweets for the geo location types other than - // ThriftGeoLocationSource.USER_PROFILE. - // Separately, we also need to filter out geo scrubbed tweets for the place_id filter. - private static final List GEO_FILTERS_TO_BE_SCRUBBED = Arrays.asList( - EarlybirdFieldConstants.formatGeoType(ThriftGeoLocationSource.GEOTAG), - EarlybirdFieldConstants.formatGeoType(ThriftGeoLocationSource.TWEET_TEXT), - EarlybirdThriftDocumentUtil.formatFilter( - EarlybirdFieldConstant.PLACE_ID_FIELD.getFieldName())); - - // queries whose parents are negated. - // used to decide if a negated query is within a negated parent or not. - private final Set parentNegatedQueries = - Sets.newIdentityHashSet(); - - private final ImmutableSchemaInterface schemaSnapshot; - private final ImmutableMap defaultFieldWeightMap; - private final QueryCacheManager queryCacheManager; - private final UserTable userTable; - private final UserScrubGeoMap userScrubGeoMap; - - @Nullable - private final TerminationTracker terminationTracker; - private final Map mappableFieldMap; - private final MultiSegmentTermDictionaryManager multiSegmentTermDictionaryManager; - private final Decider decider; - private final EarlybirdCluster earlybirdCluster; - - private float proximityPhraseWeight = 1.0f; - private int proximityPhraseSlop = 255; - private ImmutableMap enabledFieldWeightMap; - private Set queriedFields; - - // If we need to accumulate and collect per-field and per query node hit attribution information, - // this will have a mapping between the query nodes and their unique ranks, as well as the - // attribute collector. - @Nullable - private HitAttributeHelper hitAttributeHelper; - - @Nullable - private QueryTimeout queryTimeout; - - public EarlybirdLuceneQueryVisitor( - ImmutableSchemaInterface schemaSnapshot, - QueryCacheManager queryCacheManager, - UserTable userTable, - UserScrubGeoMap userScrubGeoMap, - EarlybirdCluster earlybirdCluster, - Decider decider) { - this(schemaSnapshot, queryCacheManager, userTable, userScrubGeoMap, null, DEFAULT_FIELDS, - Collections.emptyMap(), null, decider, earlybirdCluster, null); - } - - public EarlybirdLuceneQueryVisitor( - ImmutableSchemaInterface schemaSnapshot, - QueryCacheManager queryCacheManager, - UserTable userTable, - UserScrubGeoMap userScrubGeoMap, - @Nullable TerminationTracker terminationTracker, - Map fieldWeightMap, - Map mappableFieldMap, - MultiSegmentTermDictionaryManager multiSegmentTermDictionaryManager, - Decider decider, - EarlybirdCluster earlybirdCluster, - QueryTimeout queryTimeout) { - this.schemaSnapshot = schemaSnapshot; - this.defaultFieldWeightMap = ImmutableMap.copyOf(fieldWeightMap); - this.enabledFieldWeightMap = FieldWeightDefault.getOnlyEnabled(defaultFieldWeightMap); - this.queryCacheManager = queryCacheManager; - this.userTable = userTable; - this.userScrubGeoMap = userScrubGeoMap; - this.mappableFieldMap = Preconditions.checkNotNull(mappableFieldMap); - this.terminationTracker = terminationTracker; - this.multiSegmentTermDictionaryManager = multiSegmentTermDictionaryManager; - this.decider = decider; - this.earlybirdCluster = earlybirdCluster; - this.queryTimeout = queryTimeout; - this.queriedFields = new TreeSet<>(); - } - - public ImmutableMap getEnabledFieldWeightMap() { - return enabledFieldWeightMap; - } - - public ImmutableMap getDefaultFieldWeightMap() { - return defaultFieldWeightMap; - } - - public EarlybirdLuceneQueryVisitor setProximityPhraseWeight(float weight) { - this.proximityPhraseWeight = weight; - return this; - } - - public EarlybirdLuceneQueryVisitor setProximityPhraseSlop(int slop) { - this.proximityPhraseSlop = slop; - return this; - } - - public void setFieldHitAttributeHelper(HitAttributeHelper newHitAttributeHelper) { - this.hitAttributeHelper = newHitAttributeHelper; - } - - @Override - public final Query visit(Disjunction disjunction) throws QueryParserException { - BooleanQuery.Builder bqBuilder = new BooleanQuery.Builder(); - List children = disjunction.getChildren(); - // Do a final round of check, if all nodes under a disjunction are MUST, - // treat them all as DEFAULT (SHOULD in Lucene). - boolean allMust = true; - for (com.twitter.search.queryparser.query.Query child : children) { - if (!child.mustOccur()) { - allMust = false; - break; - } - } - if (allMust) { - children = Lists.transform(children, QueryNodeUtils.MAKE_QUERY_DEFAULT); - } - // Actually converting all children now. - for (com.twitter.search.queryparser.query.Query child : children) { - final Query q = child.accept(this); - if (q != null) { - // if a node is marked with MUSTHAVE annotation, we set it to must even if it's a - // disjunction. - if (child.mustOccur()) { - bqBuilder.add(q, Occur.MUST); - } else { - bqBuilder.add(q, Occur.SHOULD); - } - } - } - - Query bq = bqBuilder.build(); - float boost = (float) getBoostFromAnnotations(disjunction.getAnnotations()); - if (boost >= 0) { - bq = BoostUtils.maybeWrapInBoostQuery(bq, boost); - } - return bq; - } - - @Override - public Query visit(Conjunction conjunction) throws QueryParserException { - BooleanQuery.Builder bqBuilder = new BooleanQuery.Builder(); - List children = conjunction.getChildren(); - boolean hasPositiveTerms = false; - for (com.twitter.search.queryparser.query.Query child : children) { - boolean childMustNotOccur = child.mustNotOccur(); - boolean childAdded = addQuery(bqBuilder, child); - if (childAdded && !childMustNotOccur) { - hasPositiveTerms = true; - } - } - if (!children.isEmpty() && !hasPositiveTerms) { - bqBuilder.add(new MatchAllDocsQuery(), Occur.MUST); - } - - Query bq = bqBuilder.build(); - float boost = (float) getBoostFromAnnotations(conjunction.getAnnotations()); - if (boost >= 0) { - bq = BoostUtils.maybeWrapInBoostQuery(bq, boost); - } - return bq; - } - - @Override - public Query visit(Phrase phrase) throws QueryParserException { - return visit(phrase, false); - } - - @Override - public Query visit(Term term) throws QueryParserException { - return finalizeQuery(createTermQueryDisjunction(term), term); - } - - @Override - public Query visit(SpecialTerm special) throws QueryParserException { - String field; - - switch (special.getType()) { - case HASHTAG: - field = EarlybirdFieldConstant.HASHTAGS_FIELD.getFieldName(); - break; - case STOCK: - field = EarlybirdFieldConstant.STOCKS_FIELD.getFieldName(); - break; - case MENTION: - field = EarlybirdFieldConstant.MENTIONS_FIELD.getFieldName(); - break; - default: - field = EarlybirdFieldConstant.TEXT_FIELD.getFieldName(); - } - - String termText = special.getSpecialChar() + special.getValue(); - Query q = createSimpleTermQuery(special, field, termText); - - float boost = (float) getBoostFromAnnotations(special.getAnnotations()); - if (boost >= 0) { - q = BoostUtils.maybeWrapInBoostQuery(q, boost); - } - - return negateQueryIfNodeNegated(special, q); - } - - @Override - public Query visit(Link link) throws QueryParserException { - Query q = createSimpleTermQuery( - link, EarlybirdFieldConstant.LINKS_FIELD.getFieldName(), link.getOperand()); - - float boost = (float) getBoostFromAnnotations(link.getAnnotations()); - if (boost >= 0) { - q = BoostUtils.maybeWrapInBoostQuery(q, boost); - } - - return negateQueryIfNodeNegated(link, q); - } - - @Override - public Query visit(final SearchOperator op) throws QueryParserException { - final Query query; - SearchOperator.Type type = op.getOperatorType(); - - switch (type) { - case TO: - query = visitToOperator(op); - break; - - case FROM: - query = visitFromOperator(op); - break; - - case FILTER: - query = visitFilterOperator(op); - break; - - case INCLUDE: - query = visitIncludeOperator(op); - break; - - case EXCLUDE: - query = visitExcludeOperator(op); - break; - - case LANG: - query = visitLangOperator(op); - break; - - case SOURCE: - query = visitSourceOperator(op); - break; - - case SMILEY: - query = visitSmileyOperator(op); - break; - - case DOCVAL_RANGE_FILTER: - query = visitDocValRangeFilterOperator(op); - break; - - case CACHED_FILTER: - query = visitCachedFilterOperator(op); - break; - - case SCORE_FILTER: - query = visitScoredFilterOperator(op); - break; - - case SINCE_TIME: - query = visitSinceTimeOperator(op); - break; - - case UNTIL_TIME: - query = visitUntilTimeOperator(op); - break; - - case SINCE_ID: - query = visitSinceIDOperator(op); - break; - - case MAX_ID: - query = visitMaxIDOperator(op); - break; - - case GEOLOCATION_TYPE: - query = visitGeoLocationTypeOperator(op); - break; - - case GEOCODE: - query = visitGeocodeOperator(op); - break; - - case GEO_BOUNDING_BOX: - query = visitGeoBoundingBoxOperator(op); - break; - - case PLACE: - query = visitPlaceOperator(op); - break; - - case LINK: - // This should never be called - the Link visitor (visitor(Link link)) should be. - query = visitLinkOperator(op); - break; - - case ENTITY_ID: - query = visitEntityIdOperator(op); - break; - - case FROM_USER_ID: - query = visitFromUserIDOperator(op); - break; - - case IN_REPLY_TO_TWEET_ID: - query = visitInReplyToTweetIdOperator(op); - break; - - case IN_REPLY_TO_USER_ID: - query = visitInReplyToUserIdOperator(op); - break; - - case LIKED_BY_USER_ID: - query = visitLikedByUserIdOperator(op); - break; - - case RETWEETED_BY_USER_ID: - query = visitRetweetedByUserIdOperator(op); - break; - - case REPLIED_TO_BY_USER_ID: - query = visitRepliedToByUserIdOperator(op); - break; - - case QUOTED_USER_ID: - query = visitQuotedUserIdOperator(op); - break; - - case QUOTED_TWEET_ID: - query = visitQuotedTweetIdOperator(op); - break; - - case DIRECTED_AT_USER_ID: - query = visitDirectedAtUserIdOperator(op); - break; - - case CONVERSATION_ID: - query = visitConversationIdOperator(op); - break; - - case COMPOSER_SOURCE: - query = visitComposerSourceOperator(op); - break; - - case RETWEETS_OF_TWEET_ID: - query = visitRetweetsOfTweetIdOperator(op); - break; - - case RETWEETS_OF_USER_ID: - query = visitRetweetsOfUserIdOperator(op); - break; - - case LINK_CATEGORY: - query = visitLinkCategoryOperator(op); - break; - - case CARD_NAME: - query = visitCardNameOperator(op); - break; - - case CARD_DOMAIN: - query = visitCardDomainOperator(op); - break; - - case CARD_LANG: - query = visitCardLangOperator(op); - break; - - case HF_TERM_PAIR: - query = visitHFTermPairOperator(op); - break; - - case HF_PHRASE_PAIR: - query = visitHFTermPhrasePairOperator(op); - break; - - case PROXIMITY_GROUP: - Phrase phrase = new Phrase( - Lists.transform(op.getOperands(), - s -> NormalizerHelper.normalizeWithUnknownLocale( - s, EarlybirdConfig.getPenguinVersion()))); - - query = visit(phrase, true); - break; - - case MULTI_TERM_DISJUNCTION: - query = visitMultiTermDisjunction(op); - break; - - case CSF_DISJUNCTION_FILTER: - query = visitCSFDisjunctionFilter(op); - break; - - case SAFETY_EXCLUDE: - query = visitSafetyExclude(op); - break; - - case SPACE_ID: - query = visitSpaceId(op); - break; - - case NAMED_ENTITY: - query = visitNamedEntity(op); - break; - - case NAMED_ENTITY_WITH_TYPE: - query = visitNamedEntityWithType(op); - break; - - case MIN_FAVES: - case MIN_QUALITY_SCORE: - case MIN_REPLIES: - case MIN_RETWEETS: - case MIN_REPUTATION: - query = visitMinFeatureValueOperator(type, op); - break; - - case FEATURE_VALUE_IN_ACCEPT_LIST_OR_UNSET: - query = visitFeatureValueInAcceptListOrUnsetFilterOperator(op); - break; - - case NEAR: - case RELATED_TO_TWEET_ID: - case SINCE: - case SITE: - case UNTIL: - case WITHIN: - case WITHIN_TIME: - query = createUnsupportedOperatorQuery(op); - break; - - case NAMED_CSF_DISJUNCTION_FILTER: - case NAMED_MULTI_TERM_DISJUNCTION: - query = logAndThrowQueryParserException( - "Named disjunction operator could not be converted to a disjunction operator."); - break; - - default: - query = logAndThrowQueryParserException("Unknown operator " + op.toString()); - } - - return negateQueryIfNodeNegated(op, query); - } - - protected Query visitToOperator(SearchOperator op) throws QueryParserException { - return createNormalizedTermQuery( - op, EarlybirdFieldConstant.TO_USER_FIELD.getFieldName(), op.getOperand()); - } - - protected Query visitFromOperator(SearchOperator op) throws QueryParserException { - return createNormalizedTermQuery( - op, EarlybirdFieldConstant.FROM_USER_FIELD.getFieldName(), op.getOperand()); - } - - protected Query visitFilterOperator(SearchOperator op) throws QueryParserException { - return visitFilterOperator(op, false); - } - - protected Query visitIncludeOperator(SearchOperator op) throws QueryParserException { - // Include is a bit funny. If we have [include retweets] we are saying - // do include retweets, which is the default. Also conjunctions re-negate - // whatever node we emit from the visitor. - if (!isParentNegated(op) && !nodeIsNegated(op)) { - // positive include - no-op. - return null; - } - return visitFilterOperator(op, false); - } - - protected Query visitExcludeOperator(SearchOperator op) throws QueryParserException { - // Exclude is a bit funny. If we have -[exclude retweets] we are saying - // dont exclude retweets, which is the default. - if (isParentNegated(op) || nodeIsNegated(op)) { - // Negative exclude. Do nothing - parent will not add this to the list of children. - return null; - } else { - // Positive exclude. - return visitFilterOperator(op, true); - } - } - - protected Query visitFilterOperator(SearchOperator op, boolean negate) - throws QueryParserException { - Query q; - boolean negateQuery = negate; - - if (op.getOperand().equals(SearchOperatorConstants.ANTISOCIAL)) { - // Since the object we use to implement these filters is actually an - // EXCLUDE filter, we need to negate it to get it to work as a regular filter. - q = UserFlagsExcludeFilter.getUserFlagsExcludeFilter(userTable, true, false, false); - negateQuery = !negateQuery; - } else if (op.getOperand().equals(SearchOperatorConstants.OFFENSIVE_USER)) { - q = UserFlagsExcludeFilter.getUserFlagsExcludeFilter(userTable, false, true, false); - negateQuery = !negateQuery; - } else if (op.getOperand().equals(SearchOperatorConstants.ANTISOCIAL_OFFENSIVE_USER)) { - q = UserFlagsExcludeFilter.getUserFlagsExcludeFilter(userTable, true, true, false); - negateQuery = !negateQuery; - } else if (op.getOperand().equals(SearchOperatorConstants.PROTECTED)) { - q = UserFlagsExcludeFilter.getUserFlagsExcludeFilter(userTable, false, false, true); - negateQuery = !negateQuery; - } else if (op.getOperand().equals(SearchOperatorConstants.HAS_ENGAGEMENT)) { - return buildHasEngagementsQuery(); - } else if (op.getOperand().equals(SearchOperatorConstants.SAFE_SEARCH_FILTER)) { - BooleanQuery.Builder bqBuilder = new BooleanQuery.Builder(); - bqBuilder.add( - createNoScoreTermQuery( - op, - EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), - EarlybirdFieldConstant.IS_OFFENSIVE), - Occur.SHOULD); - - // The following internal field __filter_sensitive_content - // is not currently built by earlybird. - // This means the safe search filter soley operates on the is_offensive bit - bqBuilder.add( - createNoScoreTermQuery( - op, - EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), - EarlybirdThriftDocumentUtil.formatFilter(SearchOperatorConstants.SENSITIVE_CONTENT)), - Occur.SHOULD); - q = bqBuilder.build(); - negateQuery = !negateQuery; - } else if (op.getOperand().equals(SearchOperatorConstants.RETWEETS)) { - // Special case for filter:retweets - we use the text field search "-rt" - // mostly for legacy reasons. - q = createSimpleTermQuery( - op, - EarlybirdFieldConstant.TEXT_FIELD.getFieldName(), - EarlybirdThriftDocumentBuilder.RETWEET_TERM); - } else if (schemaSnapshot.getFacetFieldByFacetName(op.getOperand()) != null) { - Schema.FieldInfo facetField = schemaSnapshot.getFacetFieldByFacetName(op.getOperand()); - if (facetField.getFieldType().isStoreFacetSkiplist()) { - q = createSimpleTermQuery( - op, - EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), - EarlybirdFieldConstant.getFacetSkipFieldName(facetField.getName())); - } else { - // return empty BQ that doesn't match anything - q = new BooleanQuery.Builder().build(); - } - } else if (op.getOperand().equals(SearchOperatorConstants.VINE_LINK)) { - // Temporary special case for filter:vine_link. The filter is called "vine_link", but it - // should use the internal field "__filter_vine". We need this special case because otherwise - // it would look for the non-existing "__filter_vine_link" field. See SEARCH-9390 - q = createNoScoreTermQuery( - op, - EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), - EarlybirdThriftDocumentUtil.formatFilter("vine")); - } else { - // The default vanilla filters just uses the filter format string and the - // operand text. - q = createNoScoreTermQuery( - op, - EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), - EarlybirdThriftDocumentUtil.formatFilter(op.getOperand())); - } - // Double check: no filters should have any score contribution. - q = new BoostQuery(q, 0.0f); - return negateQuery ? negateQuery(q) : q; - } - - private Query buildHasEngagementsQuery() { - if (earlybirdCluster == EarlybirdCluster.PROTECTED) { - // Engagements and engagement counts are not indexed on Earlybirds, so there is no need to - // traverse the entire segment with the MinFeatureValueFilter. See SEARCH-28120 - return new MatchNoDocsQuery(); - } - - Query favFilter = MinFeatureValueFilter.getMinFeatureValueFilter( - EarlybirdFieldConstant.FAVORITE_COUNT.getFieldName(), 1); - Query retweetFilter = MinFeatureValueFilter.getMinFeatureValueFilter( - EarlybirdFieldConstant.RETWEET_COUNT.getFieldName(), 1); - Query replyFilter = MinFeatureValueFilter.getMinFeatureValueFilter( - EarlybirdFieldConstant.REPLY_COUNT.getFieldName(), 1); - return new BooleanQuery.Builder() - .add(favFilter, Occur.SHOULD) - .add(retweetFilter, Occur.SHOULD) - .add(replyFilter, Occur.SHOULD) - .build(); - } - - protected Query visitLangOperator(SearchOperator op) throws QueryParserException { - return createNoScoreTermQuery( - op, EarlybirdFieldConstant.ISO_LANGUAGE_FIELD.getFieldName(), op.getOperand()); - } - - protected Query visitSourceOperator(SearchOperator op) throws QueryParserException { - return createNoScoreTermQuery( - op, EarlybirdFieldConstant.NORMALIZED_SOURCE_FIELD.getFieldName(), op.getOperand()); - } - - protected Query visitSmileyOperator(SearchOperator op) throws QueryParserException { - return createSimpleTermQuery( - op, - EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), - String.format(SMILEY_FORMAT_STRING, op.getOperand())); - } - - protected Query visitDocValRangeFilterOperator(SearchOperator op) throws QueryParserException { - String csfFieldName = op.getOperands().get(0).toLowerCase(); - - ThriftCSFType csfFieldType = schemaSnapshot.getCSFFieldType(csfFieldName); - if (csfFieldType == null) { - throw new QueryParserException("invalid csf field name " + op.getOperands().get(0) - + " used in " + op.serialize()); - } - - try { - if (csfFieldType == ThriftCSFType.DOUBLE - || csfFieldType == ThriftCSFType.FLOAT) { - return DocValRangeFilter.getDocValRangeQuery(csfFieldName, csfFieldType, - Double.parseDouble(op.getOperands().get(1)), - Double.parseDouble(op.getOperands().get(2))); - } else if (csfFieldType == ThriftCSFType.LONG - || csfFieldType == ThriftCSFType.INT - || csfFieldType == ThriftCSFType.BYTE) { - Query query = DocValRangeFilter.getDocValRangeQuery(csfFieldName, csfFieldType, - Long.parseLong(op.getOperands().get(1)), - Long.parseLong(op.getOperands().get(2))); - if (csfFieldName.equals(EarlybirdFieldConstant.LAT_LON_CSF_FIELD.getFieldName())) { - return wrapQueryInUserScrubGeoFilter(query); - } - return query; - } else { - throw new QueryParserException("invalid ThriftCSFType. drop this op: " + op.serialize()); - } - } catch (NumberFormatException e) { - throw new QueryParserException("invalid range numeric type used in " + op.serialize()); - } - } - - protected final Query visitCachedFilterOperator(SearchOperator op) throws QueryParserException { - try { - return CachedFilterQuery.getCachedFilterQuery(op.getOperand(), queryCacheManager); - } catch (CachedFilterQuery.NoSuchFilterException e) { - throw new QueryParserException(e.getMessage(), e); - } - } - - protected final Query visitScoredFilterOperator(SearchOperator op) throws QueryParserException { - final List operands = op.getOperands(); - final String scoreFunction = operands.get(0); - ScoringFunctionProvider.NamedScoringFunctionProvider scoringFunctionProvider = - ScoringFunctionProvider.getScoringFunctionProviderByName(scoreFunction, schemaSnapshot); - if (scoringFunctionProvider == null) { - throw new QueryParserException("Unknown scoring function name [" + scoreFunction - + " ] used as score_filter's operand"); - } - - return ScoreFilterQuery.getScoreFilterQuery( - schemaSnapshot, - scoringFunctionProvider, - Float.parseFloat(operands.get(1)), - Float.parseFloat(operands.get(2))); - } - - protected Query visitSinceTimeOperator(SearchOperator op) { - try { - return SinceUntilFilter.getSinceQuery(Integer.parseInt(op.getOperand())); - } catch (NumberFormatException e) { - LOG.warn("since time is not a valid integer, the date isn't reasonable. drop this op: " - + op.serialize()); - SINCE_TIME_INVALID_INT_COUNTER.increment(); - return null; - } - } - - protected Query visitUntilTimeOperator(SearchOperator op) { - try { - return SinceUntilFilter.getUntilQuery(Integer.parseInt(op.getOperand())); - } catch (NumberFormatException e) { - LOG.warn("until time is not a valid integer, the date isn't reasonable. drop this op: " - + op.serialize()); - UNTIL_TIME_INVALID_INT_COUNTER.increment(); - return null; - } - } - - protected Query visitSinceIDOperator(SearchOperator op) { - long id = Long.parseLong(op.getOperand()); - return SinceMaxIDFilter.getSinceIDQuery(id); - } - - protected Query visitMaxIDOperator(SearchOperator op) { - long id = Long.parseLong(op.getOperand()); - return SinceMaxIDFilter.getMaxIDQuery(id); - } - - protected Query visitGeoLocationTypeOperator(SearchOperator op) throws QueryParserException { - String operand = op.getOperand(); - ThriftGeoLocationSource source = ThriftGeoLocationSource.valueOf(operand.toUpperCase()); - // If necessary, this query will be wrapped by the UserScrubGeoFilter within - // the createSimpleTermQuery() helper method - return createNoScoreTermQuery( - op, - EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), - EarlybirdFieldConstants.formatGeoType(source)); - } - - protected Query visitGeocodeOperator(SearchOperator op) throws QueryParserException { - return visitGeocodeOrGeocodePrivateOperator(op); - } - - protected Query visitGeoBoundingBoxOperator(SearchOperator op) throws QueryParserException { - Rectangle rectangle = boundingBoxFromSearchOperator(op); - return wrapQueryInUserScrubGeoFilter( - GeoQuadTreeQueryBuilder.buildGeoQuadTreeQuery(rectangle, terminationTracker)); - } - - protected Query visitPlaceOperator(SearchOperator op) throws QueryParserException { - // This query will be wrapped by the UserScrubGeoFilter within the createSimpleTermQuery() - // helper method - return createSimpleTermQuery( - op, EarlybirdFieldConstant.PLACE_FIELD.getFieldName(), op.getOperand()); - } - - protected Query visitLinkOperator(SearchOperator op) throws QueryParserException { - // This should never be called - the Link visitor (visitor(Link link)) should be. - if (op instanceof Link) { - LOG.warn("Unexpected Link operator " + op.serialize()); - return visit((Link) op); - } else { - throw new QueryParserException("Operator type set to " + op.getOperatorName() - + " but it is not an instance of Link [" + op.toString() + "]"); - } - } - - protected Query visitEntityIdOperator(SearchOperator op) throws QueryParserException { - return createSimpleTermQuery( - op, EarlybirdFieldConstant.ENTITY_ID_FIELD.getFieldName(), op.getOperand()); - } - - protected Query visitFromUserIDOperator(SearchOperator op) { - return buildLongTermAttributeQuery( - op, EarlybirdFieldConstant.FROM_USER_ID_FIELD.getFieldName()); - } - - protected Query visitInReplyToTweetIdOperator(SearchOperator op) { - return buildLongTermAttributeQuery( - op, EarlybirdFieldConstant.IN_REPLY_TO_TWEET_ID_FIELD.getFieldName()); - } - - protected Query visitInReplyToUserIdOperator(SearchOperator op) { - return buildLongTermAttributeQuery( - op, EarlybirdFieldConstant.IN_REPLY_TO_USER_ID_FIELD.getFieldName()); - } - - protected Query visitLikedByUserIdOperator(SearchOperator op) throws QueryParserException { - return buildLongTermAttributeQuery(op, - EarlybirdFieldConstant.LIKED_BY_USER_ID_FIELD.getFieldName()); - } - - protected Query visitRetweetedByUserIdOperator(SearchOperator op) throws QueryParserException { - return buildLongTermAttributeQuery(op, - EarlybirdFieldConstant.RETWEETED_BY_USER_ID.getFieldName()); - } - - protected Query visitRepliedToByUserIdOperator(SearchOperator op) throws QueryParserException { - return buildLongTermAttributeQuery(op, - EarlybirdFieldConstant.REPLIED_TO_BY_USER_ID.getFieldName()); - } - - protected Query visitQuotedUserIdOperator(SearchOperator op) throws QueryParserException { - return buildLongTermAttributeQuery(op, - EarlybirdFieldConstant.QUOTED_USER_ID_FIELD.getFieldName()); - } - - protected Query visitQuotedTweetIdOperator(SearchOperator op) throws QueryParserException { - return buildLongTermAttributeQuery(op, - EarlybirdFieldConstant.QUOTED_TWEET_ID_FIELD.getFieldName()); - } - - protected Query visitDirectedAtUserIdOperator(SearchOperator op) throws QueryParserException { - return buildLongTermAttributeQuery(op, - EarlybirdFieldConstant.DIRECTED_AT_USER_ID_FIELD.getFieldName()); - } - - protected Query visitConversationIdOperator(SearchOperator op) throws QueryParserException { - return buildLongTermAttributeQuery( - op, EarlybirdFieldConstant.CONVERSATION_ID_FIELD.getFieldName()); - } - - protected Query visitComposerSourceOperator(SearchOperator op) throws QueryParserException { - Preconditions.checkNotNull(op.getOperand(), "composer_source requires operand"); - try { - ComposerSource composerSource = ComposerSource.valueOf(op.getOperand().toUpperCase()); - return buildNoScoreIntTermQuery( - op, EarlybirdFieldConstant.COMPOSER_SOURCE, composerSource.getValue()); - } catch (IllegalArgumentException e) { - throw new QueryParserException("Invalid operand for composer_source: " + op.getOperand(), e); - } - } - - protected Query visitRetweetsOfTweetIdOperator(SearchOperator op) { - return buildLongTermAttributeQuery( - op, EarlybirdFieldConstant.RETWEET_SOURCE_TWEET_ID_FIELD.getFieldName()); - } - - protected Query visitRetweetsOfUserIdOperator(SearchOperator op) { - return buildLongTermAttributeQuery( - op, EarlybirdFieldConstant.RETWEET_SOURCE_USER_ID_FIELD.getFieldName()); - } - - protected Query visitLinkCategoryOperator(SearchOperator op) { - int linkCategory; - try { - linkCategory = LinkCategory.valueOf(op.getOperand()).getValue(); - } catch (IllegalArgumentException e) { - linkCategory = Integer.parseInt(op.getOperand()); - } - - String fieldName = EarlybirdFieldConstant.LINK_CATEGORY_FIELD.getFieldName(); - org.apache.lucene.index.Term term = new org.apache.lucene.index.Term( - fieldName, IntTermAttributeImpl.copyIntoNewBytesRef(linkCategory)); - return wrapQuery( - new TermQueryWithSafeToString(term, Integer.toString(linkCategory)), op, fieldName); - } - - protected Query visitCardNameOperator(SearchOperator op) throws QueryParserException { - return createNoScoreTermQuery( - op, EarlybirdFieldConstant.CARD_NAME_FIELD.getFieldName(), op.getOperand()); - } - - protected Query visitCardDomainOperator(SearchOperator op) throws QueryParserException { - return createNoScoreTermQuery( - op, EarlybirdFieldConstant.CARD_DOMAIN_FIELD.getFieldName(), op.getOperand()); - } - - protected Query visitCardLangOperator(SearchOperator op) throws QueryParserException { - return createNoScoreTermQuery( - op, EarlybirdFieldConstant.CARD_LANG.getFieldName(), op.getOperand()); - } - - protected Query visitHFTermPairOperator(SearchOperator op) throws QueryParserException { - final List operands = op.getOperands(); - String termPair = HighFrequencyTermPairs.createPair(op.getOperands().get(0), - op.getOperands().get(1)); - Query q = createSimpleTermQuery(op, ImmutableSchema.HF_TERM_PAIRS_FIELD, termPair); - float boost = Float.parseFloat(operands.get(2)); - if (boost >= 0) { - q = BoostUtils.maybeWrapInBoostQuery(q, boost); - } - return q; - } - - protected Query visitHFTermPhrasePairOperator(SearchOperator op) throws QueryParserException { - final List operands = op.getOperands(); - String termPair = HighFrequencyTermPairs.createPhrasePair(op.getOperands().get(0), - op.getOperands().get(1)); - Query q = createSimpleTermQuery(op, ImmutableSchema.HF_PHRASE_PAIRS_FIELD, termPair); - float boost = Float.parseFloat(operands.get(2)); - if (boost >= 0) { - q = BoostUtils.maybeWrapInBoostQuery(q, boost); - } - return q; - } - - private Query logAndThrowQueryParserException(String message) throws QueryParserException { - LOG.error(message); - throw new QueryParserException(message); - } - - private Query logMissingEntriesAndThrowQueryParserException(String field, SearchOperator op) - throws QueryParserException { - return logAndThrowQueryParserException( - String.format("Missing required %s entries for %s", field, op.serialize())); - } - - // previous implementation of this operator allowed insertion of - // operands from the thrift search query. This was reverted to ensure simplicity - // of the api, and to keep the serialized query self contained. - protected final Query visitMultiTermDisjunction(SearchOperator op) throws QueryParserException { - final List operands = op.getOperands(); - final String field = operands.get(0); - - if (isUserIdField(field)) { - List ids = Lists.newArrayList(); - parseLongArgs(operands.subList(1, operands.size()), ids, op); - if (ids.size() > 0) { - // Try to get ranks for ids if exist from hitAttributeHelper. - // Otherwise just pass in a empty list. - List ranks; - if (hitAttributeHelper != null - && hitAttributeHelper.getExpandedNodeToRankMap().containsKey(op)) { - ranks = hitAttributeHelper.getExpandedNodeToRankMap().get(op); - } else { - ranks = Lists.newArrayList(); - } - return UserIdMultiSegmentQuery.createIdDisjunctionQuery( - "multi_term_disjunction_" + field, - ids, - field, - schemaSnapshot, - multiSegmentTermDictionaryManager, - decider, - earlybirdCluster, - ranks, - hitAttributeHelper, - queryTimeout); - } else { - return logMissingEntriesAndThrowQueryParserException(field, op); - } - } else if (EarlybirdFieldConstant.ID_FIELD.getFieldName().equals(field)) { - List ids = Lists.newArrayList(); - parseLongArgs(operands.subList(1, operands.size()), ids, op); - if (ids.size() > 0) { - return RequiredStatusIDsFilter.getRequiredStatusIDsQuery(ids); - } else { - return logMissingEntriesAndThrowQueryParserException(field, op); - } - } else if (isTweetIdField(field)) { - List ids = Lists.newArrayList(); - parseLongArgs(operands.subList(1, operands.size()), ids, op); - if (ids.size() > 0) { - BooleanQuery.Builder bqBuilder = new BooleanQuery.Builder(); - int numClauses = 0; - for (long id : ids) { - if (numClauses >= BooleanQuery.getMaxClauseCount()) { - BooleanQuery saved = bqBuilder.build(); - bqBuilder = new BooleanQuery.Builder(); - bqBuilder.add(saved, BooleanClause.Occur.SHOULD); - numClauses = 1; - } - bqBuilder.add(buildLongTermAttributeQuery(op, field, id), Occur.SHOULD); - ++numClauses; - } - return bqBuilder.build(); - } else { - return logMissingEntriesAndThrowQueryParserException(field, op); - } - } else { - return createUnsupportedOperatorQuery(op); - } - } - - protected final Query visitCSFDisjunctionFilter(SearchOperator op) - throws QueryParserException { - List operands = op.getOperands(); - String field = operands.get(0); - - ThriftCSFType csfType = schemaSnapshot.getCSFFieldType(field); - if (csfType == null) { - throw new QueryParserException("Field must be a CSF"); - } - - if (csfType != ThriftCSFType.LONG) { - throw new QueryParserException("csf_disjunction_filter only works with long fields"); - } - - Set values = new HashSet<>(); - parseLongArgs(operands.subList(1, operands.size()), values, op); - - Query query = CSFDisjunctionFilter.getCSFDisjunctionFilter(field, values); - if (field.equals(EarlybirdFieldConstant.LAT_LON_CSF_FIELD.getFieldName())) { - return wrapQueryInUserScrubGeoFilter(query); - } - return query; - } - - protected Query visitSafetyExclude(SearchOperator op) throws QueryParserException { - // We do not allow negating safety_exclude operator. Note the operator is internal so if we - // get here, it means there's a bug in the query construction side. - if (isParentNegated(op) || nodeIsNegated(op)) { - throw new QueryParserException("Negating safety_exclude operator is not allowed: " + op); - } - - // Convert the safety filter to other operators depending on cluster setting - // The safety filter is interpreted differently on archive because the underlying safety labels - // in extended encoded field are not available on archive. - if (EarlybirdCluster.isArchive(earlybirdCluster)) { - return visit(OPERATOR_CACHED_EXCLUDE_ANTISOCIAL_AND_NATIVERETWEETS); - } else { - List children = Lists.newArrayList(); - for (String filterName : op.getOperands()) { - children.addAll( - OPERATORS_BY_SAFE_EXCLUDE_OPERAND.getOrDefault(filterName, ImmutableList.of())); - } - return visit(new Conjunction(children)); - } - } - - protected Query visitNamedEntity(SearchOperator op) throws QueryParserException { - List operands = op.getOperands(); - Preconditions.checkState(operands.size() == 1, - "named_entity: wrong number of operands"); - - return createDisjunction( - operands.get(0).toLowerCase(), - op, - EarlybirdFieldConstant.NAMED_ENTITY_FROM_TEXT_FIELD, - EarlybirdFieldConstant.NAMED_ENTITY_FROM_URL_FIELD); - } - - protected Query visitSpaceId(SearchOperator op) throws QueryParserException { - List operands = op.getOperands(); - Preconditions.checkState(operands.size() == 1, - "space_id: wrong number of operands"); - - return createSimpleTermQuery( - op, - EarlybirdFieldConstant.SPACE_ID_FIELD.getFieldName(), - op.getOperand() - ); - } - - protected Query visitNamedEntityWithType(SearchOperator op) throws QueryParserException { - List operands = op.getOperands(); - Preconditions.checkState(operands.size() == 2, - "named_entity_with_type: wrong number of operands"); - - String name = operands.get(0); - String type = operands.get(1); - return createDisjunction( - String.format("%s:%s", name, type).toLowerCase(), - op, - EarlybirdFieldConstant.NAMED_ENTITY_WITH_TYPE_FROM_TEXT_FIELD, - EarlybirdFieldConstant.NAMED_ENTITY_WITH_TYPE_FROM_URL_FIELD); - } - - // Create a disjunction query for a given value in one of the given fields - private Query createDisjunction( - String value, SearchOperator operator, EarlybirdFieldConstant... fields) - throws QueryParserException { - BooleanQuery.Builder booleanQueryBuilder = new BooleanQuery.Builder(); - for (EarlybirdFieldConstant field : fields) { - booleanQueryBuilder.add( - createSimpleTermQuery(operator, field.getFieldName(), value), Occur.SHOULD); - } - return booleanQueryBuilder.build(); - } - - protected Query visitMinFeatureValueOperator(SearchOperator.Type type, SearchOperator op) { - final List operands = op.getOperands(); - - String featureName; - switch (type) { - case MIN_FAVES: - featureName = EarlybirdFieldConstant.FAVORITE_COUNT.getFieldName(); - break; - case MIN_QUALITY_SCORE: - featureName = EarlybirdFieldConstant.PARUS_SCORE.getFieldName(); - break; - case MIN_REPLIES: - featureName = EarlybirdFieldConstant.REPLY_COUNT.getFieldName(); - break; - case MIN_REPUTATION: - featureName = EarlybirdFieldConstant.USER_REPUTATION.getFieldName(); - break; - case MIN_RETWEETS: - featureName = EarlybirdFieldConstant.RETWEET_COUNT.getFieldName(); - break; - default: - throw new IllegalArgumentException("Unknown min feature type " + type); - } - - double operand = Double.parseDouble(operands.get(0)); - - // SEARCH-16751: Because we use QueryCacheConstants.HAS_ENGAGEMENT as a driving query below, we - // won't return tweets with 0 engagements when we handle a query with a [min_X 0] filter (e.g. - // (* cat [min_faves 0] ). Thus we need to return a MatchAllDocsQuery in that case. - if (operand == 0) { - return new MatchAllDocsQuery(); - } - - // Only perform the rewrite if the operator is a min engagement operator. - if (isOperatorTypeEngagementFilter(type)) { - return buildQueryForEngagementOperator(op, operands, featureName); - } - - if (type == SearchOperator.Type.MIN_REPUTATION) { - return buildQueryForMinReputationOperator(operands, featureName); - } - - return MinFeatureValueFilter.getMinFeatureValueFilter( - featureName, Double.parseDouble(operands.get(0))); - } - - protected Query visitFeatureValueInAcceptListOrUnsetFilterOperator(SearchOperator op) - throws QueryParserException { - final List operands = op.getOperands(); - final String field = operands.get(0); - - if (isIdCSFField(field)) { - Set ids = Sets.newHashSet(); - parseLongArgs(operands.subList(1, operands.size()), ids, op); - return FeatureValueInAcceptListOrUnsetFilter.getFeatureValueInAcceptListOrUnsetFilter( - field, ids); - } else { - return logAndThrowQueryParserException( - "Invalid CSF field passed to operator " + op.toString()); - } - } - - /** - * Creates a Lucene query for an operator that's not supported by the search service. - * - * NOTE: Developer, if you are writing a class to extends this class, make sure the - * behaviour of this function makes sense for your search service. - * - * @param op The operator that's not supported by the search service. - * @return The Lucene query for this operator - */ - protected Query createUnsupportedOperatorQuery(SearchOperator op) throws QueryParserException { - SearchCounter - .export(UNSUPPORTED_OPERATOR_PREFIX + op.getOperatorType().getOperatorName()) - .increment(); - return visit(op.toPhrase()); - } - - private Query buildNoScoreIntTermQuery( - SearchOperator op, - EarlybirdFieldConstant field, - int termValue) { - org.apache.lucene.index.Term term = new org.apache.lucene.index.Term( - field.getFieldName(), IntTermAttributeImpl.copyIntoNewBytesRef(termValue)); - return wrapQuery( - new TermQueryWithSafeToString(term, Integer.toString(termValue)), op, field.getFieldName()); - } - - private Query buildQueryForMinReputationOperator(List operands, String featureName) { - int operand = (int) Double.parseDouble(operands.get(0)); - // Driving by MinFeatureValueFilter's DocIdSetIterator is very slow, because we have to - // perform an expensive check for all doc IDs in the segment, so we use a cached result to - // drive the query, and use MinFeatureValueFilter as a secondary filter. - String queryCacheFilterName; - if (operand >= 50) { - queryCacheFilterName = QueryCacheConstants.MIN_REPUTATION_50; - } else if (operand >= 36) { - queryCacheFilterName = QueryCacheConstants.MIN_REPUTATION_36; - } else if (operand >= 30) { - queryCacheFilterName = QueryCacheConstants.MIN_REPUTATION_30; - } else { - return MinFeatureValueFilter.getMinFeatureValueFilter(featureName, operand); - } - - try { - Query drivingQuery = CachedFilterQuery.getCachedFilterQuery( - queryCacheFilterName, queryCacheManager); - return new FilteredQuery( - drivingQuery, MinFeatureValueFilter.getDocIdFilterFactory(featureName, operand)); - } catch (Exception e) { - // If the filter is not found, that's OK, it might be our first time running the query cache, - // or there may be no tweets with that high reputation. - return MinFeatureValueFilter.getMinFeatureValueFilter(featureName, operand); - } - } - - private Query buildQueryForEngagementOperator( - SearchOperator op, List operands, String featureName) { - // Engagements and engagement counts are not indexed on Protected Earlybirds, so there is no - // need to traverse the entire segment with the MinFeatureValueFilter. SEARCH-28120 - if (earlybirdCluster == EarlybirdCluster.PROTECTED) { - return new MatchNoDocsQuery(); - } - - EarlybirdFieldConstant field = - EarlybirdFieldConstants.CSF_NAME_TO_MIN_ENGAGEMENT_FIELD_MAP.get(featureName); - if (field == null) { - throw new IllegalArgumentException(String.format("Expected the feature to be " - + "FAVORITE_COUNT, REPLY_COUNT, or RETWEET_COUNT. Got %s.", featureName)); - } - int operand = (int) Double.parseDouble(operands.get(0)); - ByteNormalizer normalizer = MinFeatureValueFilter.getMinFeatureValueNormalizer(featureName); - int minValue = normalizer.unsignedByteToInt(normalizer.normalize(operand)); - - // We default to the old behavior of filtering posts instead of consulting the min engagement - // field if the operand is less than some threshold value because it seems, empirically, that - // the old method results in lower query latencies for lower values of the filter operand. - // This threshold can be controlled by the "use_min_engagement_field_threshold" decider. The - // current default value is 90. SEARCH-16102 - int useMinEngagementFieldThreshold = decider.getAvailability( - "use_min_engagement_field_threshold").getOrElse(() -> 0); - if (operand >= useMinEngagementFieldThreshold) { - NUM_QUERIES_ABOVE_MIN_ENGAGEMENT_THRESHOLD.increment(); - } else { - NUM_QUERIES_BELOW_MIN_ENGAGEMENT_THRESHOLD.increment(); - } - if (schemaHasField(field) && operand >= useMinEngagementFieldThreshold) { - return buildNoScoreIntTermQuery(op, field, minValue); - } - // Driving by MinFeatureValueFilter's DocIdSetIterator is very slow, because we have to - // perform an expensive check for all doc IDs in the segment, so we use a cached result to - // drive the query, and use MinFeatureValueFilter as a secondary filter. - try { - Query drivingQuery = minEngagmentsDrivingQuery(op, operand); - return new FilteredQuery( - drivingQuery, MinFeatureValueFilter.getDocIdFilterFactory(featureName, operand)); - } catch (Exception e) { - // If the filter is not found, that's OK, it might be our first time running the query cache, - // or there may be no Tweets with that many engagements (we would only expect this in tests). - return MinFeatureValueFilter.getMinFeatureValueFilter(featureName, operand); - } - } - - private Query minEngagmentsDrivingQuery(SearchOperator operator, int minValue) - throws CachedFilterQuery.NoSuchFilterException, QueryParserException { - // If the min engagements value is large, then many of the hits that have engagement will still - // not match the query, leading to extremely slow queries. Therefore, if there is more than 100 - // engagements, we drive by a more restricted filter. See SEARCH-33740 - String filter; - if (minValue < 100) { - filter = QueryCacheConstants.HAS_ENGAGEMENT; - } else if (operator.getOperatorType() == SearchOperator.Type.MIN_FAVES) { - filter = QueryCacheConstants.MIN_FAVES_100; - } else if (operator.getOperatorType() == SearchOperator.Type.MIN_REPLIES) { - filter = QueryCacheConstants.MIN_REPLIES_100; - } else if (operator.getOperatorType() == SearchOperator.Type.MIN_RETWEETS) { - filter = QueryCacheConstants.MIN_RETWEETS_100; - } else { - throw new QueryParserException("Missing engagement filter."); - } - return CachedFilterQuery.getCachedFilterQuery(filter, queryCacheManager); - } - - private boolean isOperatorTypeEngagementFilter(SearchOperator.Type type) { - return type == SearchOperator.Type.MIN_FAVES - || type == SearchOperator.Type.MIN_RETWEETS - || type == SearchOperator.Type.MIN_REPLIES; - } - - private boolean schemaHasField(EarlybirdFieldConstant field) { - return schemaSnapshot.hasField(field.getFieldId()); - } - - // Helper functions - private Query createSimpleTermQuery( - com.twitter.search.queryparser.query.Query node, String field, String text) - throws QueryParserException { - Query baseQuery = new TermQuery(createTerm(field, text)); - if (isGeoFieldThatShouldBeScrubbed(field, text)) { - baseQuery = wrapQueryInUserScrubGeoFilter(baseQuery); - } - return wrapQuery(baseQuery, node, field); - } - - private boolean isGeoFieldThatShouldBeScrubbed(String field, String text) { - if (field.equals(EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName())) { - // the internal field is used for the place id filter and the geo location type filters, some - // of which should be scrubbed - return GEO_FILTERS_TO_BE_SCRUBBED.contains(text); - } - return GEO_FIELDS_TO_BE_SCRUBBED.contains(field); - } - - // Like above, but sets boost to 0 to disable scoring component. This should be used - // for filters that do not impact scoring (such as filter:images). - private Query createNoScoreTermQuery(com.twitter.search.queryparser.query.Query node, - String field, String text) - throws QueryParserException { - Query query = createSimpleTermQuery(node, field, text); - return new BoostQuery(query, 0.0f); // No score contribution. - } - - private Query createNormalizedTermQuery(com.twitter.search.queryparser.query.Query node, - String field, String text) - throws QueryParserException { - return createSimpleTermQuery( - node, - field, - NormalizerHelper.normalizeWithUnknownLocale(text, EarlybirdConfig.getPenguinVersion())); - } - - /** - * Get the boost from the annotation list of a query node. - * Right now this is very simple, we simple extract the value of some annotations and ignore all - * others, also, if there are multiple annotations that have values, we only use the first one we - * see in the list (although the rewritten query EB receives should have this). - * NOTE: we use simple weight selection logic here based on the assumption that the annotator - * and rewriter will not produce ambiguous weight information. There should always be only one - * weight-bearing annotation for a specific node. - * - * @param annotations The list of annotations of the query node. - * @return The boost for this query node, 0 if there is no boost, in which case you shouldn't - * apply it at all. - */ - private static double getBoostFromAnnotations(List annotations) { - if (annotations != null) { - for (Annotation anno : annotations) { - switch (anno.getType()) { - case VARIANT: - case SPELLING: - case WEIGHT: - case OPTIONAL: - return ((FloatAnnotation) anno).getValue(); - default: - } - } - } - return -1; - } - - private static double getPhraseProximityFromAnnotations(List annotations) { - if (annotations != null) { - for (Annotation anno : annotations) { - if (anno.getType() == Annotation.Type.PROXIMITY) { - return ((FloatAnnotation) anno).getValue(); - } - } - } - return -1; - } - - private static boolean isOptional(com.twitter.search.queryparser.query.Query node) { - return node.hasAnnotationType(Annotation.Type.OPTIONAL); - } - - private static boolean isProximityGroup(com.twitter.search.queryparser.query.Query node) { - if (node.isTypeOf(com.twitter.search.queryparser.query.Query.QueryType.OPERATOR)) { - SearchOperator op = (SearchOperator) node; - if (op.getOperatorType() == SearchOperator.Type.PROXIMITY_GROUP) { - return true; - } - } - return false; - } - - private final Query simplifyBooleanQuery(BooleanQuery q) { - if (q.clauses() == null || q.clauses().size() != 1) { - return q; - } - - return q.clauses().get(0).getQuery(); - } - - private Query visit(final Phrase phrase, boolean sloppy) throws QueryParserException { - Optional fieldOpt = phrase.getAnnotationOf(Annotation.Type.FIELD); - if (fieldOpt.isPresent()) { - String field = fieldOpt.get().valueToString(); - Schema.FieldInfo fieldInfo = schemaSnapshot.getFieldInfo(field); - if (fieldInfo != null && !fieldInfo.getFieldType().hasPositions()) { - throw new QueryParserException(String.format("Field %s does not support phrase queries " - + "because it does not have position information.", field)); - } - } - BooleanQuery.Builder queryBuilder = new BooleanQuery.Builder(); - Map actualFieldWeights = getFieldWeightMapForNode(phrase); - for (Map.Entry entry : actualFieldWeights.entrySet()) { - PhraseQuery.Builder phraseQueryBuilder = new PhraseQuery.Builder(); - int curPos = 0; - for (String term : phrase.getTerms()) { - if (!term.equals(PHRASE_WILDCARD)) { - phraseQueryBuilder.add(createTerm(entry.getKey(), term), curPos); - curPos++; - } else if (curPos != 0) { //"*" at the beggining of a phrase has no effect/meaning - curPos++; - } - } - - // No actual terms added to query - if (curPos == 0) { - break; - } - int annotatedSloppiness = (int) getPhraseProximityFromAnnotations(phrase.getAnnotations()); - if (annotatedSloppiness > 0) { - phraseQueryBuilder.setSlop(annotatedSloppiness); - } else if (sloppy) { - phraseQueryBuilder.setSlop(proximityPhraseSlop); - } - float fieldWeight = entry.getValue(); - float boost = (float) getBoostFromAnnotations(phrase.getAnnotations()); - Query query = phraseQueryBuilder.build(); - if (boost >= 0) { - query = BoostUtils.maybeWrapInBoostQuery(query, boost * fieldWeight); - } else if (fieldWeight != DEFAULT_FIELD_WEIGHT) { - query = BoostUtils.maybeWrapInBoostQuery(query, fieldWeight); - } else { - query = BoostUtils.maybeWrapInBoostQuery(query, proximityPhraseWeight); - } - Occur occur = actualFieldWeights.size() > 1 ? Occur.SHOULD : Occur.MUST; - queryBuilder.add(wrapQuery(query, phrase, entry.getKey()), occur); - } - Query q = simplifyBooleanQuery(queryBuilder.build()); - return negateQueryIfNodeNegated(phrase, q); - } - - private Query wrapQuery( - org.apache.lucene.search.Query query, - com.twitter.search.queryparser.query.Query node, - String fieldName) { - return EarlybirdQueryHelper.maybeWrapWithTimeout( - EarlybirdQueryHelper.maybeWrapWithHitAttributionCollector( - query, node, schemaSnapshot.getFieldInfo(fieldName), hitAttributeHelper), - node, queryTimeout); - } - - private final boolean nodeIsNegated(com.twitter.search.queryparser.query.Query node) { - if (isParentNegated(node)) { - return !node.mustNotOccur(); - } else { - return node.mustNotOccur(); - } - } - - private final Query negateQuery(Query q) { - return new BooleanQuery.Builder() - .add(q, Occur.MUST_NOT) - .add(new MatchAllDocsQuery(), Occur.MUST) - .build(); - } - - // Simple helper to examine node, and negate the lucene query if necessary. - private final Query negateQueryIfNodeNegated(com.twitter.search.queryparser.query.Query node, - Query query) { - if (query == null) { - return null; - } - return nodeIsNegated(node) ? negateQuery(query) : query; - } - - private boolean isParentNegated(com.twitter.search.queryparser.query.Query query) { - return parentNegatedQueries.contains(query); - } - - private org.apache.lucene.index.Term createTerm(String field, String text) - throws QueryParserException { - Schema.FieldInfo fieldInfo = schemaSnapshot.getFieldInfo(field); - if (fieldInfo == null) { - throw new QueryParserException("Unknown field: " + field); - } - - queriedFields.add(field); - - try { - return new org.apache.lucene.index.Term(field, SchemaUtil.toBytesRef(fieldInfo, text)); - } catch (UnsupportedOperationException e) { - throw new QueryParserException(e.getMessage(), e.getCause()); - } - } - - /** - * Get field weight map for a node, combing default values and its annotations. - */ - private Map getFieldWeightMapForNode( - com.twitter.search.queryparser.query.Query query) throws QueryParserException { - return FieldWeightUtil.combineDefaultWithAnnotation( - query, - defaultFieldWeightMap, - enabledFieldWeightMap, - Functions.identity(), - mappableFieldMap, - Functions.identity()); - } - - private boolean addQuery( - BooleanQuery.Builder bqBuilder, - com.twitter.search.queryparser.query.Query child) throws QueryParserException { - Occur occur = Occur.MUST; - if (child.mustNotOccur()) { - // To build a conjunction, we will not rely on the negation in the child visitor. - // Instead we will add the term as MUST_NOT occur. - // Store this in parentNegatedQueries so the child visitor can do the right thing. - occur = Occur.MUST_NOT; - parentNegatedQueries.add(child); - } else if (isOptional(child) || isProximityGroup(child)) { - occur = Occur.SHOULD; - } - - Query q = child.accept(this); - if (q != null) { - bqBuilder.add(q, occur); - return true; - } - return false; - } - - /** - * Constructs a BooleanQuery from a queryparser Query node. - * Adds fields as configured in the fieldWeightMap and specified by termQueryDisjunctionType - * - TermQueryDisjunctionType.ONLY_OPTIONALIZED adds optional fields - * (only resolved_links_text for now), - * - TermQueryDisjunctionType.DROP_OPTIONALIZED adds all other valid fields expect - * resolved_links_text (for now), - * - TermQueryDisjunctionType.NORMAL adds all valid fields - * @param query an instance of com.twitter.search.queryparser.query.Query or - * com.twitter.search.queryparser.query.Term - * @return a BooleanQuery consists of fields from query - */ - private BooleanQuery createTermQueryDisjunction( - com.twitter.search.queryparser.query.Query query) throws QueryParserException { - String normTerm = query.isTypeOf(com.twitter.search.queryparser.query.Query.QueryType.TERM) - ? ((Term) query).getValue() : query.toString(false); - BooleanQuery.Builder booleanQueryBuilder = new BooleanQuery.Builder(); - Map actualFieldWeightMap = getFieldWeightMapForNode(query); - Set fieldsToUse = Sets.newLinkedHashSet(actualFieldWeightMap.keySet()); - Occur occur = fieldsToUse.size() > 1 ? Occur.SHOULD : Occur.MUST; - for (String field : fieldsToUse) { - addTermQueryWithField(booleanQueryBuilder, query, normTerm, field, occur, - actualFieldWeightMap.get(field)); - } - return booleanQueryBuilder.build(); - } - - private void addTermQueryWithField( - BooleanQuery.Builder bqBuilder, - com.twitter.search.queryparser.query.Query term, - String normTerm, - String fieldName, - Occur occur, - float fieldWeight) throws QueryParserException { - float boost = (float) getBoostFromAnnotations(term.getAnnotations()); - Query query = createSimpleTermQuery(term, fieldName, normTerm); - if (boost >= 0) { - query = BoostUtils.maybeWrapInBoostQuery(query, boost * fieldWeight); - } else { - query = BoostUtils.maybeWrapInBoostQuery(query, fieldWeight); - } - bqBuilder.add(query, occur); - } - - private Query finalizeQuery(BooleanQuery bq, Term term) { - Query q = simplifyBooleanQuery(bq); - return negateQueryIfNodeNegated(term, q); - } - - private Rectangle boundingBoxFromSearchOperator(SearchOperator op) throws QueryParserException { - Preconditions.checkArgument(op.getOperatorType() == SearchOperator.Type.GEO_BOUNDING_BOX); - Preconditions.checkNotNull(op.getOperands()); - Preconditions.checkState(op.getOperands().size() == 4); - - List operands = op.getOperands(); - try { - // Unfortunately, we store coordinates as floats in our index, which causes a lot of precision - // loss. On the query side, we have to cast into floats to match. - float minLat = (float) Double.parseDouble(operands.get(0)); - float minLon = (float) Double.parseDouble(operands.get(1)); - float maxLat = (float) Double.parseDouble(operands.get(2)); - float maxLon = (float) Double.parseDouble(operands.get(3)); - - Point lowerLeft = new PointImpl(minLon, minLat, GeohashChunkImpl.getSpatialContext()); - Point upperRight = new PointImpl(maxLon, maxLat, GeohashChunkImpl.getSpatialContext()); - return new RectangleImpl(lowerLeft, upperRight, GeohashChunkImpl.getSpatialContext()); - } catch (NumberFormatException e) { - // consider operator invalid if any of the coordinate cannot be parsed. - throw new QueryParserException("Malformed bounding box operator." + op.serialize()); - } - } - - private Query visitGeocodeOrGeocodePrivateOperator(SearchOperator op) - throws QueryParserException { - - GeoCode geoCode = GeoCode.fromOperator(op); - if (geoCode == null) { - throw new QueryParserException("Invalid GeoCode operator:" + op.serialize()); - } - - return wrapQueryInUserScrubGeoFilter( - GeoQuadTreeQueryBuilder.buildGeoQuadTreeQuery(geoCode, terminationTracker)); - } - - private Query wrapQueryInUserScrubGeoFilter(Query baseQuery) { - if (DeciderUtil.isAvailableForRandomRecipient( - decider, "filter_out_geo_scrubbed_tweets_" + earlybirdCluster.getNameForStats())) { - return new FilteredQuery( - baseQuery, - UserScrubGeoFilter.getDocIdFilterFactory(userScrubGeoMap)); - } else { - return baseQuery; - } - } - - private Query buildLongTermAttributeQuery(SearchOperator op, String fieldName) { - return buildLongTermAttributeQuery(op, fieldName, Long.parseLong(op.getOperand())); - } - - private Query buildLongTermAttributeQuery(SearchOperator op, String fieldName, long argValue) { - org.apache.lucene.index.Term term = new org.apache.lucene.index.Term( - fieldName, LongTermAttributeImpl.copyIntoNewBytesRef(argValue)); - return wrapQuery(new TermQueryWithSafeToString(term, Long.toString(argValue)), op, fieldName); - } - - private static void parseLongArgs(List operands, - Collection arguments, - SearchOperator op) throws QueryParserException { - for (String operand : operands) { - try { - arguments.add(Long.parseLong(operand)); - } catch (NumberFormatException e) { - throw new QueryParserException("Invalid long operand in " + op.serialize(), e); - } - } - } - - private static boolean isUserIdField(String field) { - return EarlybirdFieldConstant.FROM_USER_ID_FIELD.getFieldName().equals(field) - || EarlybirdFieldConstant.IN_REPLY_TO_USER_ID_FIELD.getFieldName().equals(field) - || EarlybirdFieldConstant.RETWEET_SOURCE_USER_ID_FIELD.getFieldName().equals(field) - || EarlybirdFieldConstant.LIKED_BY_USER_ID_FIELD.getFieldName().equals(field) - || EarlybirdFieldConstant.RETWEETED_BY_USER_ID.getFieldName().equals(field) - || EarlybirdFieldConstant.REPLIED_TO_BY_USER_ID.getFieldName().equals(field) - || EarlybirdFieldConstant.QUOTED_USER_ID_FIELD.getFieldName().equals(field) - || EarlybirdFieldConstant.DIRECTED_AT_USER_ID_FIELD.getFieldName().equals(field); - } - - private static boolean isTweetIdField(String field) { - return EarlybirdFieldConstant.IN_REPLY_TO_TWEET_ID_FIELD.getFieldName().equals(field) - || EarlybirdFieldConstant.RETWEET_SOURCE_TWEET_ID_FIELD.getFieldName().equals(field) - || EarlybirdFieldConstant.QUOTED_TWEET_ID_FIELD.getFieldName().equals(field) - || EarlybirdFieldConstant.CONVERSATION_ID_FIELD.getFieldName().equals(field); - } - - private static boolean isIdCSFField(String field) { - return EarlybirdFieldConstant.DIRECTED_AT_USER_ID_CSF.getFieldName().equals(field); - } - - public Set getQueriedFields() { - return queriedFields; - } -} diff --git a/src/java/com/twitter/search/earlybird/queryparser/EarlybirdQueryHelper.java b/src/java/com/twitter/search/earlybird/queryparser/EarlybirdQueryHelper.java deleted file mode 100644 index 45c3fe77a..000000000 --- a/src/java/com/twitter/search/earlybird/queryparser/EarlybirdQueryHelper.java +++ /dev/null @@ -1,154 +0,0 @@ -package com.twitter.search.earlybird.queryparser; - -import javax.annotation.Nullable; - -import com.google.common.base.Optional; -import com.google.common.base.Preconditions; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.constants.QueryCacheConstants; -import com.twitter.search.common.query.HitAttributeCollector; -import com.twitter.search.common.query.HitAttributeHelper; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.search.termination.QueryTimeout; -import com.twitter.search.common.search.termination.TerminationQuery; -import com.twitter.search.earlybird.querycache.QueryCacheManager; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryNodeUtils; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.query.annotation.Annotation; -import com.twitter.search.queryparser.query.search.SearchOperator; -import com.twitter.search.queryparser.query.search.SearchOperatorConstants; - -public abstract class EarlybirdQueryHelper { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdQueryHelper.class); - - /** - * Wraps the given query and some clauses to exclude antisocial tweets into a conjunction. - */ - public static Query requireExcludeAntisocial( - Query basicQuery, - QueryCacheManager queryCacheManager) throws QueryParserException { - // Do not set exclude antisocial if they have any other antisocial filters set - Query query = basicQuery; - DetectAntisocialVisitor detectAntisocialVisitor = new DetectAntisocialVisitor(); - query.accept(detectAntisocialVisitor); - if (detectAntisocialVisitor.hasAnyAntisocialOperator()) { - return query; - } - - // No operator found, force antisocial filter. - if (queryCacheManager.enabled()) { - SearchOperator filter = - new SearchOperator(SearchOperator.Type.CACHED_FILTER, - QueryCacheConstants.EXCLUDE_ANTISOCIAL); - - query = QueryNodeUtils.appendAsConjunction(query, filter); - } else { - SearchOperator filter = new SearchOperator(SearchOperator.Type.EXCLUDE, - SearchOperatorConstants.ANTISOCIAL); - - query = QueryNodeUtils.appendAsConjunction(query, filter); - } - return query; - } - - /** - * Wraps the given query into an equivalent query that will also collect hit attribution data. - * - * @param query The original query. - * @param node The query parser node storing this query. - * @param fieldInfo The field in which the given query will be searching. - * @param hitAttributeHelper The helper that will collect all hit attribution data. - * @return An equivalent query that will also collect hit attribution data. - */ - public static final org.apache.lucene.search.Query maybeWrapWithHitAttributionCollector( - org.apache.lucene.search.Query query, - @Nullable com.twitter.search.queryparser.query.Query node, - Schema.FieldInfo fieldInfo, - @Nullable HitAttributeHelper hitAttributeHelper) { - // Prevents lint error for assigning to a function parameter. - org.apache.lucene.search.Query luceneQuery = query; - if (hitAttributeHelper != null && node != null) { - Optional annotation = node.getAnnotationOf(Annotation.Type.NODE_RANK); - - if (annotation.isPresent()) { - Integer nodeRank = (Integer) annotation.get().getValue(); - luceneQuery = wrapWithHitAttributionCollector( - luceneQuery, - fieldInfo, - nodeRank, - hitAttributeHelper.getFieldRankHitAttributeCollector()); - } - } - - return luceneQuery; - } - - /** - * Wraps the given query into an equivalent query that will also collect hit attribution data. - * - * @param query The original query. - * @param nodeRank The rank of the given query in the overall request query. - * @param fieldInfo The field in which the given query will be searching. - * @param hitAttributeHelper The helper that will collect all hit attribution data. - * @return An equivalent query that will also collect hit attribution data. - */ - public static final org.apache.lucene.search.Query maybeWrapWithHitAttributionCollector( - org.apache.lucene.search.Query query, - int nodeRank, - Schema.FieldInfo fieldInfo, - @Nullable HitAttributeHelper hitAttributeHelper) { - - org.apache.lucene.search.Query luceneQuery = query; - if (hitAttributeHelper != null && nodeRank != -1) { - Preconditions.checkArgument(nodeRank > 0); - luceneQuery = wrapWithHitAttributionCollector( - luceneQuery, fieldInfo, nodeRank, hitAttributeHelper.getFieldRankHitAttributeCollector()); - } - return luceneQuery; - } - - private static final org.apache.lucene.search.Query wrapWithHitAttributionCollector( - org.apache.lucene.search.Query luceneQuery, - Schema.FieldInfo fieldInfo, - int nodeRank, - HitAttributeCollector hitAttributeCollector) { - Preconditions.checkNotNull(fieldInfo, - "Tried collecting hit attribution for unknown field: " + fieldInfo.getName() - + " luceneQuery: " + luceneQuery); - return hitAttributeCollector.newIdentifiableQuery( - luceneQuery, fieldInfo.getFieldId(), nodeRank); - } - - /** - * Returns a query equivalent to the given query, and with the given timeout enforced. - */ - public static org.apache.lucene.search.Query maybeWrapWithTimeout( - org.apache.lucene.search.Query query, - QueryTimeout timeout) { - if (timeout != null) { - return new TerminationQuery(query, timeout); - } - return query; - } - - /** - * Returns a query equivalent to the given query, and with the given timeout enforced. If the - * given query is negated, it is returned without any modifications. - */ - public static org.apache.lucene.search.Query maybeWrapWithTimeout( - org.apache.lucene.search.Query query, - @Nullable com.twitter.search.queryparser.query.Query node, - QueryTimeout timeout) { - // If the node is looking for negation of something, we don't want to include it in node-level - // timeout checks. In general, nodes keep track of the last doc seen, but non-matching docs - // encountered by "must not occur" node do not reflect overall progress in the index. - if (node != null && node.mustNotOccur()) { - return query; - } - return maybeWrapWithTimeout(query, timeout); - } -} diff --git a/src/java/com/twitter/search/earlybird/queryparser/HighFrequencyTermPairExtractor.java b/src/java/com/twitter/search/earlybird/queryparser/HighFrequencyTermPairExtractor.java deleted file mode 100644 index 83a928185..000000000 --- a/src/java/com/twitter/search/earlybird/queryparser/HighFrequencyTermPairExtractor.java +++ /dev/null @@ -1,211 +0,0 @@ -package com.twitter.search.earlybird.queryparser; - -import java.util.ArrayList; -import java.util.IdentityHashMap; - -import com.google.common.base.Preconditions; - -import com.twitter.search.common.util.text.HighFrequencyTermPairs; -import com.twitter.search.queryparser.query.BooleanQuery; -import com.twitter.search.queryparser.query.Conjunction; -import com.twitter.search.queryparser.query.Disjunction; -import com.twitter.search.queryparser.query.Operator; -import com.twitter.search.queryparser.query.Phrase; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.query.QueryVisitor; -import com.twitter.search.queryparser.query.SpecialTerm; -import com.twitter.search.queryparser.query.Term; -import com.twitter.search.queryparser.query.annotation.Annotation; - -/** - * Iterates over the Query, populating information of an ArrayList of HighFrequencyTermQueryGroup so that - * HighFrequencyTermPairRewriteVisitor can rewrite the query to use hf term pairs. Returns the - * (approximate) number of high frequency terms it has detected. Iff that number is greater than 1 - * it MAY be able to rewrite the query to use the hf_term_pairs field. - * - * The key to HF Term Pair rewriting is understanding which nodes can be combined. This extractor - * accomplishes this job by grouping nodes of the query together. All positive children of a - * conjunction are grouped together, and all negative children of a disjunction are grouped - * together. The end result is a tree of groups, where every child of a single group will have the - * opposite value of isPositive of the parent group. - * - * I'll try to break it down a bit further. Let's assume "a" and "b" are hf terms, and ' - * "[hf_term_pair a b]" represents querying their co-occurence. - * Query (* a b not_hf) can become (* [hf_term_pair a b] not_hf) - * Query (+ -a -b -not_hf) can become (+ -[hf_term_pair a b] -not_hf) - * These two rules represent the bulk of the rewrites that this class makes. - * - * We also keep track of another form of rewrite. A member of a group can be paired up with a member - * of any of its parent groups as long as both groups have the same isPositive value. This - * operation mimics boolean distribution. As this is probably better explained with an example: - * Query (* a (+ not_hf (* b not_hf2))) can become (* a (+ not_hf (* [hf_term_pair a b ] not_hf2))) - * Query (+ -a (* not_hf (+ -b not_hf2))) can become (+ -a (* not_hf (+ -[hf_term_pair a b] not_hf2))) - */ -public class HighFrequencyTermPairExtractor extends QueryVisitor { - - private final ArrayList groupList; - private final IdentityHashMap groupIds; - - public HighFrequencyTermPairExtractor(ArrayList groupList, - IdentityHashMap groupIds) { - Preconditions.checkNotNull(groupList); - Preconditions.checkArgument(groupList.isEmpty()); - this.groupList = groupList; - this.groupIds = groupIds; - } - - @Override - public Integer visit(Disjunction disjunction) throws QueryParserException { - return visit((BooleanQuery) disjunction); - } - - @Override - public Integer visit(Conjunction conjunction) throws QueryParserException { - return visit((BooleanQuery) conjunction); - } - - /** - * All positive children under a conjunction (negative children under disjunction) belong in the - * same group as booleanQuery. All other children belong in their own, separate, new groups. - * @param booleanQuery - * @return Number of high frequency terms seen by this node and its children - * @throws QueryParserException - */ - private Integer visit(BooleanQuery booleanQuery) throws QueryParserException { - HighFrequencyTermQueryGroup group = getGroupForQuery(booleanQuery); - int numHits = 0; - - for (Query node : booleanQuery.getChildren()) { - boolean neg = node.mustNotOccur(); - if (node.isTypeOf(Query.QueryType.DISJUNCTION)) { - // Disjunctions, being negative conjunctions, are inherently negative nodes. In terms of - // being in a positive or negative group, we must flip their Occur value. - neg = !neg; - } - - if (booleanQuery.isTypeOf(Query.QueryType.DISJUNCTION) && node.mustOccur()) { - // Potential Example: (* a (+ +b not_c)) => (* (+ +b not_c) [hf_term_pair a b 0.05]) - // Implementation is too difficult and would make this rewriter even MORE complicated for - // a rarely used query. For now, we ignore it completely. We might gain some benefit in the - // future if we decide to create a new extractor and rewriter and rewrite this subquery, and - // that wouldn't complicate things too much. - continue; - } - - if (booleanQuery.isTypeOf(Query.QueryType.CONJUNCTION) != neg) { // Add node to current group - groupIds.put(node, group.groupIdx); - group.numMembers++; - } else { // Create a new group - HighFrequencyTermQueryGroup newGroup = - new HighFrequencyTermQueryGroup(groupList.size(), group.groupIdx, !group.isPositive); - newGroup.numMembers++; - groupIds.put(node, newGroup.groupIdx); - groupList.add(newGroup); - } - numHits += node.accept(this); - } - - return numHits; - } - - @Override - public Integer visit(Phrase phrase) throws QueryParserException { - HighFrequencyTermQueryGroup group = getGroupForQuery(phrase); - - int numFound = 0; - if (!phrase.hasAnnotationType(Annotation.Type.OPTIONAL)) { - boolean canBeRewritten = false; - - // Special case: phrases with exactly 2 terms that are both high frequency can be - // rewritten. In all other cases terms will be treated as pre-used hf term phrases. - if (!phrase.hasAnnotations() && phrase.size() == 2 - && HighFrequencyTermPairs.HF_TERM_SET.contains(phrase.getTerms().get(0)) - && HighFrequencyTermPairs.HF_TERM_SET.contains(phrase.getTerms().get(1))) { - canBeRewritten = true; - } - - // Special case: do not treat phrase containing :prox annotation as a real phrase. - boolean proximityPhrase = phrase.hasAnnotationType(Annotation.Type.PROXIMITY); - - String lastHFToken = null; - for (String token : phrase.getTerms()) { - if (HighFrequencyTermPairs.HF_TERM_SET.contains(token)) { - group.preusedHFTokens.add(token); - if (group.distributiveToken == null) { - group.distributiveToken = token; - } - if (lastHFToken != null && !proximityPhrase) { - if (canBeRewritten) { - group.hfPhrases.add(lastHFToken + " " + token); - } else { - group.preusedHFPhrases.add(lastHFToken + " " + token); - } - } - lastHFToken = token; - numFound++; - } else { - lastHFToken = null; - } - } - } - - return numFound; - } - - @Override - public Integer visit(Term term) throws QueryParserException { - if (groupList.isEmpty()) { // Shortcut for 1 term queries. - return 0; - } - - HighFrequencyTermQueryGroup group = getGroupForQuery(term); - - if (!term.hasAnnotationType(Annotation.Type.OPTIONAL) - && HighFrequencyTermPairs.HF_TERM_SET.contains(term.getValue())) { - if (!term.hasAnnotations()) { - group.hfTokens.add(term.getValue()); - } else { // Should not remove the annotated term. - group.preusedHFTokens.add(term.getValue()); - } - - if (group.distributiveToken == null) { - group.distributiveToken = term.getValue(); - } - return 1; - } - - return 0; - } - - @Override - public Integer visit(Operator operator) throws QueryParserException { - return 0; - } - - @Override - public Integer visit(SpecialTerm special) throws QueryParserException { - return 0; - } - - /** - * Uses the query's visitor data as an index and returns the group it belongs to. If groupList is - * empty, create a new group and set this group's visitor data to be index 0. - * @param query - * @return the group which query belongs to. - */ - private HighFrequencyTermQueryGroup getGroupForQuery(Query query) { - if (groupList.isEmpty()) { - boolean pos = !query.mustNotOccur(); - if (query instanceof Disjunction) { - pos = !pos; - } - HighFrequencyTermQueryGroup group = new HighFrequencyTermQueryGroup(0, pos); - group.numMembers++; - groupList.add(group); - groupIds.put(query, 0); - } - - return groupList.get(groupIds.get(query)); - } -} diff --git a/src/java/com/twitter/search/earlybird/queryparser/HighFrequencyTermPairRewriteVisitor.java b/src/java/com/twitter/search/earlybird/queryparser/HighFrequencyTermPairRewriteVisitor.java deleted file mode 100644 index a54b46e5e..000000000 --- a/src/java/com/twitter/search/earlybird/queryparser/HighFrequencyTermPairRewriteVisitor.java +++ /dev/null @@ -1,477 +0,0 @@ -package com.twitter.search.earlybird.queryparser; - -import java.util.ArrayList; -import java.util.IdentityHashMap; -import java.util.List; -import java.util.Set; - -import javax.annotation.Nullable; - -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.util.text.HighFrequencyTermPairs; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.queryparser.parser.SerializedQueryParser; -import com.twitter.search.queryparser.query.BooleanQuery; -import com.twitter.search.queryparser.query.Conjunction; -import com.twitter.search.queryparser.query.Disjunction; -import com.twitter.search.queryparser.query.Operator; -import com.twitter.search.queryparser.query.Phrase; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryNodeUtils; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.query.QueryVisitor; -import com.twitter.search.queryparser.query.SpecialTerm; -import com.twitter.search.queryparser.query.Term; -import com.twitter.search.queryparser.query.search.SearchOperator; - -/** - * Iterates over the Query, modifying it to include high frequency term pairs, replacing - * singular high frequency terms where possible. - * - * Assumes that this will be used IMMEDIATELY after using HighFrequencyTermPairExtractor - * - * There are two primary functions of this visitor: - * 1. Append hf_term_pairs to each group's root node. - * 2. Remove all unnecessary term queries (unnecessary as they are captured by an hf_term_pair) - * - * Every time the visitor finishes visiting a node, HighFrequencyTermQueryGroup.numVisits will be - * incremented for that node's group. When numVisits == numChildren, we know we have just finished - * processing the root of the group. At this point, we must append relevant hf_term_pairs to this - * node. - */ -public class HighFrequencyTermPairRewriteVisitor extends QueryVisitor { - private static final Logger LOG = LoggerFactory.getLogger( - HighFrequencyTermPairRewriteVisitor.class); - private static final SearchRateCounter SEARCH_HF_PAIR_COUNTER = - SearchRateCounter.export("hf_pair_rewrite"); - - private final ArrayList groupList; - private final IdentityHashMap groupIds; - private final boolean allowNegativeOrRewrite; - - /** - * Creates a new HighFrequencyTermPairRewriteVisitor. Should be used only IMMEDIATELY after using - * a HighFrequencyTermPairExtractor - * @param groupList The groups extracted using HighFrequencyTermPairExtractor - * @param groupIds the mapping from query to the HF term query group - */ - public HighFrequencyTermPairRewriteVisitor(ArrayList groupList, - IdentityHashMap groupIds) { - this(groupList, groupIds, true); - } - - /** - * Creates a new HighFrequencyTermPairRewriteVisitor. Should be used only IMMEDIATELY after using - * a HighFrequencyTermPairExtractor - * @param groupList The groups extracted using HighFrequencyTermPairExtractor - * @param groupIds the mapping from query to the HF term query group - * @param allowNegativeOrRewrite whether to allow rewrite for 'or (-terms)' - */ - public HighFrequencyTermPairRewriteVisitor(ArrayList groupList, - IdentityHashMap groupIds, - boolean allowNegativeOrRewrite) { - this.groupList = groupList; - this.groupIds = groupIds; - this.allowNegativeOrRewrite = allowNegativeOrRewrite; - } - - /** - * This method logs successful rewrites, and protects against unsuccessful ones by - * catching all exceptions and restoring the previous query. - */ - public static Query safeRewrite(Query safeQuery, boolean allowNegativeOrRewrite) - throws QueryParserException { - Query query = safeQuery; - - ArrayList groups = Lists.newArrayList(); - IdentityHashMap groupIds = Maps.newIdentityHashMap(); - - // Step 1: extract high frequency term pairs and phrases. - try { - int hfTermsFound = query.accept(new HighFrequencyTermPairExtractor(groups, groupIds)); - if (hfTermsFound < 2) { - return query; - } - } catch (Exception e) { - LOG.error("Exception while extracting high frequency term pairs", e); - return query; - } - - // Step 2: rewrite (safely). - String original = query.serialize(); - try { - query = query.accept( - new HighFrequencyTermPairRewriteVisitor(groups, groupIds, allowNegativeOrRewrite)) - .simplify(); - String rewrite = query.serialize(); - if (LOG.isDebugEnabled()) { - LOG.debug("Optimized query: " + original + " -> " + rewrite); - } - SEARCH_HF_PAIR_COUNTER.increment(); - return query; - } catch (Exception e) { - LOG.error("Exception rewriting high frequency term pairs", e); - return new SerializedQueryParser(EarlybirdConfig.getPenguinVersion()).parse(original); - } - } - - /** - * The rewritten query to use the hf_term_pair operators. - * - * @param disjunction query node which must have been previously visited by - * HighFrequencyTermPairExtractor and not had its visitor data cleared. - */ - @Override - public Query visit(Disjunction disjunction) throws QueryParserException { - return visit((BooleanQuery) disjunction); - } - - /** - * The rewritten query to use the hf_term_pair operators. - * - * @param conjunction query node which must have been previously visited by - * HighFrequencyTermPairExtractor and not had its visitor data cleared. - */ - @Override - public Query visit(Conjunction conjunction) throws QueryParserException { - return visit((BooleanQuery) conjunction); - } - - /** - * Applies this visitor to a BooleanQuery. - */ - public Query visit(BooleanQuery booleanQuery) throws QueryParserException { - HighFrequencyTermQueryGroup group = groupList.get(groupIds.get(booleanQuery)); - queryPreprocess(group); - - ArrayList children = Lists.newArrayList(); - for (Query node : booleanQuery.getChildren()) { - if (booleanQuery.isTypeOf(Query.QueryType.DISJUNCTION) && node.mustOccur()) { - // Potential Example: (* a (+ +b not_c)) => (* (+ +b not_c) [hf_term_pair a b 0.05]) - // Implementation is too difficult and would make this rewriter even MORE complicated for - // a rarely used query. For now, we ignore it completely. We might gain some benefit in the - // future if we decide to create a new extractor and rewriter and rewrite this subquery, and - // that wouldn't complicate things too much. - children.add(node); - continue; - } - Query child = node.accept(this); - if (child != null) { - children.add(child); - } - } - - Query newBooleanQuery = booleanQuery.newBuilder().setChildren(children).build(); - - return queryPostprocess(newBooleanQuery, group); - } - - /** - * The rewritten query to use the hf_term_pair operators. - * - * @param phraseToVisit query node which must have been previously visited by - * HighFrequencyTermPairExtractor and not had its visitor data cleared. - */ - @Override - public Query visit(Phrase phraseToVisit) throws QueryParserException { - Phrase phrase = phraseToVisit; - - HighFrequencyTermQueryGroup group = groupList.get(groupIds.get(phrase)); - queryPreprocess(group); - - // Remove all high frequency phrases from the query that do not have any annotations. - // This will cause phrase de-duping, which we probably don't care about. - if (!hasAnnotations(phrase) && ( - group.hfPhrases.contains(phrase.getPhraseValue()) - || group.preusedHFPhrases.contains(phrase.getPhraseValue()))) { - // This term will be appended to the end of the query in the form of a pair. - phrase = null; - } - - return queryPostprocess(phrase, group); - } - - /** - * The rewritten query to use the hf_term_pair operators. - * - * @param termToVisit query node which must have been previously visited by - * HighFrequencyTermPairExtractor and not had its visitor data cleared. - */ - @Override - public Query visit(Term termToVisit) throws QueryParserException { - Term term = termToVisit; - - HighFrequencyTermQueryGroup group = groupList.get(groupIds.get(term)); - queryPreprocess(group); - - // Remove all high frequency terms from the query that do not have any annotations. This will - // do term de-duping within a group, which may effect scoring, but since these are high df - // terms, they don't have much of an impact anyways. - if (!hasAnnotations(term) - && (group.preusedHFTokens.contains(term.getValue()) - || group.hfTokens.contains(term.getValue()))) { - // This term will be appended to the end of the query in the form of a pair. - term = null; - } - - return queryPostprocess(term, group); - } - - /** - * The rewritten query to use the hf_term_pair operators. - * - * @param operator query node which must have been previously visited by - * HighFrequencyTermPairExtractor and not had its visitor data cleared. - */ - @Override - public Query visit(Operator operator) throws QueryParserException { - HighFrequencyTermQueryGroup group = groupList.get(groupIds.get(operator)); - queryPreprocess(group); - - return queryPostprocess(operator, group); - } - - /** - * The rewritten query to use the hf_term_pair operators. - * - * @param special query node which must have been previously visited by - * HighFrequencyTermPairExtractor and not had its visitor data cleared. - */ - @Override - public Query visit(SpecialTerm special) throws QueryParserException { - HighFrequencyTermQueryGroup group = groupList.get(groupIds.get(special)); - queryPreprocess(group); - - return queryPostprocess(special, group); - } - - /** - * Before visiting a node's children, we must process its group's distributiveToken. This way, a - * node only has to check its grandparent group for a distributiveToken instead of recursing all - * of the way up to the root of the tree. - */ - private void queryPreprocess(HighFrequencyTermQueryGroup group) { - if (group.distributiveToken == null) { - group.distributiveToken = getAncestorDistributiveToken(group); - } - } - - /** - * If the query isn't the root of the group, returns the query. Otherwise, if the query's - * group has at most one hf term, return the query. Otherwise, returns the query with hf_term_pair - * operators created from the group's hf terms appended to it. - */ - private Query queryPostprocess(@Nullable Query query, HighFrequencyTermQueryGroup group) - throws QueryParserException { - - group.numVisits++; - if (group.numMembers == group.numVisits - && (!group.hfTokens.isEmpty() || !group.preusedHFTokens.isEmpty() - || group.hasPhrases())) { - - group.removePreusedTokens(); - String ancestorDistributiveToken = getAncestorDistributiveToken(group); - - // Need at least 2 tokens to perform a pair rewrite. Try to get one - // additional token from ancestors, and if that fails, from phrases. - if ((group.hfTokens.size() + group.preusedHFTokens.size()) == 1 - && ancestorDistributiveToken != null) { - group.preusedHFTokens.add(ancestorDistributiveToken); - } - if ((group.hfTokens.size() + group.preusedHFTokens.size()) == 1) { - String tokenFromPhrase = group.getTokenFromPhrase(); - if (tokenFromPhrase != null) { - group.preusedHFTokens.add(tokenFromPhrase); - } - } - - return appendPairs(query, group); - } - - return query; - } - - /** - * Returns the distributiveToken of group's grandparent. - */ - private String getAncestorDistributiveToken(HighFrequencyTermQueryGroup group) { - String ancestorDistributiveToken = null; - if (group.parentGroupIdx >= 0 && groupList.get(group.parentGroupIdx).parentGroupIdx >= 0) { - ancestorDistributiveToken = - groupList.get(groupList.get(group.parentGroupIdx).parentGroupIdx).distributiveToken; - } - return ancestorDistributiveToken; - } - - /** - * Returns the hf_term_pair operators created using the hf terms of the group appended to query. - * - * @param query The query which the new hf_term_pair operators will be appended to. - * @param group The group which this query belongs to. - * @return The hf_term_pair operators created using the hf terms of the group appended to query. - */ - private Query appendPairs(@Nullable Query query, HighFrequencyTermQueryGroup group) - throws QueryParserException { - - BooleanQuery query2 = createQueryFromGroup(group); - - // If either of the queries are null, do not have to worry about combining them. - if (query2 == null) { - return query; - } else if (query == null) { - return query2; - } - - Query newQuery; - - if (query.isTypeOf(Query.QueryType.CONJUNCTION) - || query.isTypeOf(Query.QueryType.DISJUNCTION)) { - // Adding children in this way is safer when its query is a conjunction or disjunction - // ex. Other way: (+ +de -la -the) => (+ (+ +de -la -the) -[hf_term_pair la the 0.005]) - // This way: (+ +de -la -the) => (+ +de -la -the -[hf_term_pair la the 0.005]) - return ((BooleanQuery.Builder) query.newBuilder()).addChildren(query2.getChildren()).build(); - } else if (!group.isPositive) { - // In lucene, [+ (-term1, -term2, ...)] has non-deterministic behavior and the rewrite is not - // efficient from query execution perspective. So, we will not do this rewrite if it is - // configured that way. - if (!allowNegativeOrRewrite) { - return query; - } - - // Negate both queries to combine, and the append as a conjunction, followed by negating - // whole query. Equivalent to appending as a disjunction. - newQuery = QueryNodeUtils.appendAsConjunction( - query.negate(), - query2.negate() - ); - newQuery = newQuery.makeMustNot(); - } else { - newQuery = QueryNodeUtils.appendAsConjunction(query, query2); - newQuery = newQuery.makeDefault(); - } - - return newQuery; - } - - /** - * Creates a conjunction of term_pairs using the sets of hf terms in HighFrequencyTermQueryGroup - * group. If !group.isPositive, will return a disjunction of negated pairs. If there aren't enough - * hfTokens, will return null. - */ - private BooleanQuery createQueryFromGroup(HighFrequencyTermQueryGroup group) - throws QueryParserException { - - if (!group.hfTokens.isEmpty() || group.preusedHFTokens.size() > 1 || group.hasPhrases()) { - List terms = createTermPairsForGroup(group.hfTokens, - group.preusedHFTokens, - group.hfPhrases, - group.preusedHFPhrases); - - if (group.isPositive) { - return new Conjunction(terms); - } else { - return new Disjunction(Lists.transform(terms, QueryNodeUtils.NEGATE_QUERY)); - } - } - - return null; - } - - /** - * Creates HF_TERM_PAIR terms out of hfTokens and optHFTokens. Attempts to create the minimal - * amount of tokens necessary. optHFToken pairs should be given a weight of 0.0 and not be scored, - * as they are likely already included in the query in a phrase or an annotated term. - * @param hfTokens - * @param optHFTokens - * @return A list of hf_term_pair operators. - */ - private List createTermPairsForGroup(Set hfTokens, - Set optHFTokens, - Set hfPhrases, - Set optHFPhrases) { - // Handle sets with only one token. - if (optHFTokens.size() == 1 && hfTokens.size() > 0) { - // (* "a not_hf" b c) => (* "a not_hf" [hf_term_pair a b 0.05] [hf_term_pair b c 0.05]) - // optHFTokens: [a] hfTokens: [b, c] => optHFTokens: [] hfTokens: [a, b, c] - hfTokens.addAll(optHFTokens); - optHFTokens.clear(); - } else if (hfTokens.size() == 1 && optHFTokens.size() > 0) { - // (* "a b" not_hf c) => (* "a b" not_hf [hf_term_pair a b 0.0] [hf_term_pair a c 0.005]) - // optHFTokens: [a, b] hfTokens: [c] => optHFTokens: [a, b] hfTokens: [a, c] - String term = optHFTokens.iterator().next(); - hfTokens.add(term); - } - - List terms = createTermPairs(hfTokens, true, HighFrequencyTermPairs.HF_DEFAULT_WEIGHT); - terms.addAll(createTermPairs(optHFTokens, false, 0)); - terms.addAll(createPhrasePairs(hfPhrases, HighFrequencyTermPairs.HF_DEFAULT_WEIGHT)); - terms.addAll(createPhrasePairs(optHFPhrases, 0)); - - return terms; - } - - /** - * Turns a set of hf terms into a list of hf_term_pair operators. Each term will be used at least - * once in as few pairs as possible. - * @param tokens - * @param createSingle If the set contains only one query, the returned list will contain a single - * Term for that query if createSingle is true, and an empty list otherwise. - * @param weight Each term pair will be given a score boost of serializedWeight. - * @return - */ - private static List createTermPairs(Set tokens, boolean createSingle, - double weight) { - - List terms = Lists.newArrayList(); - if (tokens.size() >= 2) { - int tokensLeft = tokens.size(); - String token1 = null; - for (String token2 : tokens) { - if (token1 == null) { - token1 = token2; - } else { - terms.add(createHFTermPair(token1, token2, weight)); - - if (tokensLeft > 2) { // Only reset if there is more than one token remaining. - token1 = null; - } - } - tokensLeft--; - } - } else if (createSingle && !tokens.isEmpty()) { // Only one high frequency token - // Need to add token as a term because it was removed from the query earlier in rewriting. - Term newTerm = new Term(tokens.iterator().next()); - terms.add(newTerm); - } - - return terms; - } - - private static List createPhrasePairs(Set phrases, double weight) { - List ops = Lists.newArrayList(); - for (String phrase : phrases) { - String[] terms = phrase.split(" "); - assert terms.length == 2; - SearchOperator op = new SearchOperator(SearchOperator.Type.HF_PHRASE_PAIR, - terms[0], terms[1], Double.toString(weight)); - ops.add(op); - } - return ops; - } - - private static SearchOperator createHFTermPair(String token1, String token2, double weight) { - SearchOperator op = new SearchOperator(SearchOperator.Type.HF_TERM_PAIR, - token1, token2, Double.toString(weight)); - return op; - } - - private static boolean hasAnnotations(com.twitter.search.queryparser.query.Query node) { - return node.hasAnnotations(); - } -} diff --git a/src/java/com/twitter/search/earlybird/queryparser/HighFrequencyTermQueryGroup.java b/src/java/com/twitter/search/earlybird/queryparser/HighFrequencyTermQueryGroup.java deleted file mode 100644 index f6b40f868..000000000 --- a/src/java/com/twitter/search/earlybird/queryparser/HighFrequencyTermQueryGroup.java +++ /dev/null @@ -1,94 +0,0 @@ -package com.twitter.search.earlybird.queryparser; - -import java.util.ArrayList; -import java.util.List; -import java.util.Set; - -import com.google.common.collect.Sets; - -/** - * Used to store information relevant to processing query groups for HighFrequencyTermPairExtractor - * and HighFrequencyTermPairRewriter - */ -public class HighFrequencyTermQueryGroup { - protected final int groupIdx; - protected final int parentGroupIdx; - // The number of nodes in this group. - protected int numMembers = 0; - // For the rewrite visitor: Incremented once at the end of each of this group's nodes' visits. - protected int numVisits = 0; - - // The set of tokens that should be removed from the query if seen as an individual term and - // rewritten in the query as a hf term pair. - protected final Set hfTokens = Sets.newTreeSet(); - - // Tokens that can be used to restrict searches but should not be scored. They will be given a - // weight of 0. - protected final Set preusedHFTokens = Sets.newTreeSet(); - - // Set of phrases that should be removed from the query if seen as an individual phrase and - // rewritten in the query as a hf term phrase pair. - protected final Set hfPhrases = Sets.newTreeSet(); - - // Phrases that can be used to restrict searches but should not be scored. They will be given a - // weight of 0. - protected final Set preusedHFPhrases = Sets.newTreeSet(); - - // The first found hf_term, or the hf_term of an ancestor with the same isPositive value. - protected String distributiveToken = null; - - // If it is a single node group, isPositive is true iff that node is true. - // Otherwise, isPositive is false iff the root of the group is a disjunction. - protected final boolean isPositive; - - public HighFrequencyTermQueryGroup(int groupIdx, boolean positive) { - this(groupIdx, -1, positive); - } - - public HighFrequencyTermQueryGroup(int groupIdx, int parentGroupIdx, boolean positive) { - this.groupIdx = groupIdx; - this.parentGroupIdx = parentGroupIdx; - isPositive = positive; - } - - public boolean hasPhrases() { - return !hfPhrases.isEmpty() || !preusedHFPhrases.isEmpty(); - } - - protected List tokensFromPhrases() { - if (!hasPhrases()) { - return null; - } - List tokens = new ArrayList<>(); - for (String phrase : hfPhrases) { - for (String term : phrase.split(" ")) { - tokens.add(term); - } - } - for (String phrase : preusedHFPhrases) { - for (String term : phrase.split(" ")) { - tokens.add(term); - } - } - return tokens; - } - - protected void removePreusedTokens() { - hfTokens.removeAll(preusedHFTokens); - List phraseTokens = tokensFromPhrases(); - if (phraseTokens != null) { - hfTokens.removeAll(phraseTokens); - preusedHFTokens.removeAll(phraseTokens); - } - hfPhrases.removeAll(preusedHFPhrases); - } - - protected String getTokenFromPhrase() { - List phraseTokens = tokensFromPhrases(); - if (phraseTokens != null) { - return phraseTokens.get(0); - } else { - return null; - } - } -} diff --git a/src/java/com/twitter/search/earlybird/queryparser/LuceneRelevanceQueryVisitor.java b/src/java/com/twitter/search/earlybird/queryparser/LuceneRelevanceQueryVisitor.java deleted file mode 100644 index cf749ad20..000000000 --- a/src/java/com/twitter/search/earlybird/queryparser/LuceneRelevanceQueryVisitor.java +++ /dev/null @@ -1,69 +0,0 @@ -package com.twitter.search.earlybird.queryparser; - -import java.util.Map; - -import com.google.common.annotations.VisibleForTesting; - -import org.apache.lucene.search.Query; - -import com.twitter.decider.Decider; -import com.twitter.search.common.query.MappableField; -import com.twitter.search.common.schema.base.FieldWeightDefault; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.search.TerminationTracker; -import com.twitter.search.common.search.termination.QueryTimeout; -import com.twitter.search.earlybird.common.userupdates.UserScrubGeoMap; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.partition.MultiSegmentTermDictionaryManager; -import com.twitter.search.earlybird.querycache.QueryCacheManager; -import com.twitter.search.queryparser.query.search.SearchOperator; - -public class LuceneRelevanceQueryVisitor extends EarlybirdLuceneQueryVisitor { - public LuceneRelevanceQueryVisitor( - ImmutableSchemaInterface schema, - QueryCacheManager queryCacheManager, - UserTable userTable, - UserScrubGeoMap userScrubGeoMap, - TerminationTracker terminationTracker, - Map fieldWeightMap, - Map mappableFieldMap, - MultiSegmentTermDictionaryManager multiSegmentTermDictionaryManager, - Decider decider, - EarlybirdCluster earlybirdCluster, - QueryTimeout queryTimeout) { - super( - schema, - queryCacheManager, - userTable, - userScrubGeoMap, - terminationTracker, - fieldWeightMap, - mappableFieldMap, - multiSegmentTermDictionaryManager, - decider, - earlybirdCluster, - queryTimeout); - } - - @VisibleForTesting - protected LuceneRelevanceQueryVisitor( - ImmutableSchemaInterface schema, - QueryCacheManager queryCacheManager, - UserTable userTable, - UserScrubGeoMap userScrubGeoMap, - EarlybirdCluster earlybirdCluster) { - super(schema, - queryCacheManager, - userTable, - userScrubGeoMap, - earlybirdCluster, - queryCacheManager.getDecider()); - } - - @Override - protected Query visitSinceIDOperator(SearchOperator op) { - // since_id is handled by the blender for relevance queries, so don't filter on it. - return null; - } -} diff --git a/src/java/com/twitter/search/earlybird/queryparser/ProtectedOperatorQueryRewriter.java b/src/java/com/twitter/search/earlybird/queryparser/ProtectedOperatorQueryRewriter.java deleted file mode 100644 index fd35ac61c..000000000 --- a/src/java/com/twitter/search/earlybird/queryparser/ProtectedOperatorQueryRewriter.java +++ /dev/null @@ -1,153 +0,0 @@ -package com.twitter.search.earlybird.queryparser; - -import java.util.List; - -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableList; - -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.queryparser.query.Conjunction; -import com.twitter.search.queryparser.query.Disjunction; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.search.SearchOperator; -import com.twitter.search.queryparser.query.search.SearchOperatorConstants; - -public class ProtectedOperatorQueryRewriter { - private static final String ERROR_MESSAGE = "Positive 'protected' operator must be in the root" - + " query node and the root query node must be a Conjunction."; - private static final Query EXCLUDE_PROTECTED_OPERATOR = - new SearchOperator(SearchOperator.Type.EXCLUDE, SearchOperatorConstants.PROTECTED); - - /** - * Rewrite a query with positive 'protected' operator into an equivalent query without the positive - * 'protected' operator. This method assumes the following preconditions hold: - * 1. 'followedUserIds' is not empty - * 2. the query's root node is of type Conjunction - * 3. the query's root node is not negated - * 4. there is one positive 'protected' operator in the root node - * 5. there is only one 'protected' operator in the whole query - * - * Query with '[include protected]' operator is rewritten into a Disjunction of a query with - * protected Tweets only and a query with public Tweets only. - * For example, - * Original query: - * (* "cat" [include protected]) - * with followedUserIds=[1, 7, 12] where 1 and 7 are protected users - * Rewritten query: - * (+ - * (* "cat" [multi_term_disjunction from_user_id 1 7]) - * (* "cat" [exclude protected]) - * ) - * - * Query with '[filter protected]' operator is rewritten with multi_term_disjunction from_user_id - * operator. - * For example, - * Original query: - * (* "cat" [filter protected]) - * with followedUserIds=[1, 7, 12] where 1 and 7 are protected users - * Rewritten query: - * (* "cat" [multi_term_disjunction from_user_id 1 7]) - */ - public Query rewrite(Query parsedQuery, List followedUserIds, UserTable userTable) { - Preconditions.checkState(followedUserIds != null && !followedUserIds.isEmpty(), - "'followedUserIds' should not be empty when positive 'protected' operator exists."); - Preconditions.checkState( - parsedQuery.isTypeOf(com.twitter.search.queryparser.query.Query.QueryType.CONJUNCTION), - ERROR_MESSAGE); - Conjunction parsedConjQuery = (Conjunction) parsedQuery; - List children = parsedConjQuery.getChildren(); - int opIndex = findPositiveProtectedOperatorIndex(children); - Preconditions.checkState(opIndex >= 0, ERROR_MESSAGE); - SearchOperator protectedOp = (SearchOperator) children.get(opIndex); - - ImmutableList.Builder otherChildrenBuilder = ImmutableList.builder(); - otherChildrenBuilder.addAll(children.subList(0, opIndex)); - if (opIndex + 1 < children.size()) { - otherChildrenBuilder.addAll(children.subList(opIndex + 1, children.size())); - } - List otherChildren = otherChildrenBuilder.build(); - - List protectedUserIds = getProtectedUserIds(followedUserIds, userTable); - if (protectedOp.getOperatorType() == SearchOperator.Type.FILTER) { - if (protectedUserIds.isEmpty()) { - // match none query - return Disjunction.EMPTY_DISJUNCTION; - } else { - return parsedConjQuery.newBuilder() - .setChildren(otherChildren) - .addChild(createFromUserIdMultiTermDisjunctionQuery(protectedUserIds)) - .build(); - } - } else { - // 'include' or negated 'exclude' operator - // negated 'exclude' is considered the same as 'include' to be consistent with the logic in - // EarlybirdLuceneQueryVisitor - if (protectedUserIds.isEmpty()) { - // return public only query - return parsedConjQuery.newBuilder() - .setChildren(otherChildren) - .addChild(EXCLUDE_PROTECTED_OPERATOR) - .build(); - } else { - // build a disjunction of protected only query and public only query - Query protectedOnlyQuery = parsedConjQuery.newBuilder() - .setChildren(otherChildren) - .addChild(createFromUserIdMultiTermDisjunctionQuery(protectedUserIds)) - .build(); - Query publicOnlyQuery = parsedConjQuery.newBuilder() - .setChildren(otherChildren) - .addChild(EXCLUDE_PROTECTED_OPERATOR) - .build(); - return new Disjunction(protectedOnlyQuery, publicOnlyQuery); - } - } - } - - private Query createFromUserIdMultiTermDisjunctionQuery(List userIds) { - ImmutableList.Builder operandsBuilder = ImmutableList.builder(); - operandsBuilder - .add(EarlybirdFieldConstants.EarlybirdFieldConstant.FROM_USER_ID_FIELD.getFieldName()); - for (Long userId : userIds) { - operandsBuilder.add(userId.toString()); - } - List operands = operandsBuilder.build(); - return new SearchOperator(SearchOperator.Type.MULTI_TERM_DISJUNCTION, operands); - } - - private List getProtectedUserIds(List followedUserIds, UserTable userTable) { - ImmutableList.Builder protectedUserIds = ImmutableList.builder(); - for (Long userId : followedUserIds) { - if (userTable.isSet(userId, UserTable.IS_PROTECTED_BIT)) { - protectedUserIds.add(userId); - } - } - return protectedUserIds.build(); - } - - private int findPositiveProtectedOperatorIndex(List children) { - for (int i = 0; i < children.size(); i++) { - Query child = children.get(i); - if (child instanceof SearchOperator) { - SearchOperator searchOp = (SearchOperator) child; - if (SearchOperatorConstants.PROTECTED.equals(searchOp.getOperand()) - && (isNegateExclude(searchOp) || isPositive(searchOp))) { - return i; - } - } - } - - return -1; - } - - private boolean isNegateExclude(SearchOperator searchOp) { - return searchOp.mustNotOccur() - && searchOp.getOperatorType() == SearchOperator.Type.EXCLUDE; - } - - private boolean isPositive(SearchOperator searchOp) { - return !searchOp.mustNotOccur() - && (searchOp.getOperatorType() == SearchOperator.Type.INCLUDE - || searchOp.getOperatorType() == SearchOperator.Type.FILTER); - } -} diff --git a/src/java/com/twitter/search/earlybird/search/AbstractResultsCollector.java b/src/java/com/twitter/search/earlybird/search/AbstractResultsCollector.java deleted file mode 100644 index d18fcdfda..000000000 --- a/src/java/com/twitter/search/earlybird/search/AbstractResultsCollector.java +++ /dev/null @@ -1,630 +0,0 @@ -package com.twitter.search.earlybird.search; - -import java.io.IOException; -import java.util.ArrayList; -import java.util.List; -import java.util.Map; -import java.util.Set; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Optional; -import com.google.common.base.Preconditions; -import com.google.common.collect.Maps; -import com.google.common.collect.Sets; - -import org.apache.commons.collections.CollectionUtils; -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.ScoreMode; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.constants.thriftjava.ThriftLanguage; -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; -import com.twitter.search.common.relevance.features.EarlybirdDocumentFeatures; -import com.twitter.search.common.results.thriftjava.FieldHitAttribution; -import com.twitter.search.common.results.thriftjava.FieldHitList; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.search.TwitterEarlyTerminationCollector; -import com.twitter.search.common.util.spatial.GeoUtil; -import com.twitter.search.core.earlybird.facets.AbstractFacetCountingArray; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentData; -import com.twitter.search.core.earlybird.index.TimeMapper; -import com.twitter.search.core.earlybird.index.inverted.QueryCostTracker; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.index.EarlybirdSingleSegmentSearcher; -import com.twitter.search.earlybird.index.TweetIDMapper; -import com.twitter.search.earlybird.search.facets.FacetLabelCollector; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; -import com.twitter.search.earlybird.thrift.ThriftFacetLabel; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.earlybird.thrift.ThriftSearchResultExtraMetadata; -import com.twitter.search.earlybird.thrift.ThriftSearchResultGeoLocation; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadata; -import com.twitter.search.queryparser.util.IdTimeRanges; - -import geo.google.datamodel.GeoCoordinate; - -/** - * Abstract parent class for all results collectors in earlybird. - * This collector should be able to handle both single-segment and - * multi-segment collection. - */ -public abstract class AbstractResultsCollector - extends TwitterEarlyTerminationCollector { - enum IdAndRangeUpdateType { - BEGIN_SEGMENT, - END_SEGMENT, - HIT - } - - // Earlybird used to have a special early termination logic: at segment boundaries - // the collector estimates how much time it'll take to search the next segment. - // If this estimate * 1.5 will cause the request to timeout, the search early terminates. - // That logic is removed in favor of more fine grained checks---now we check timeout - // within a segment, every 2,000,000 docs processed. - private static final int EXPENSIVE_TERMINATION_CHECK_INTERVAL = - EarlybirdConfig.getInt("expensive_termination_check_interval", 2000000); - - private static final long NO_TIME_SLICE_ID = -1; - - protected final R searchRequestInfo; - - // Sometimes maxHitsToProcess can also come from places other than collector params. - // E.g. from searchQuery.getRelevanceOptions(). This provides a way to allow - // subclasses to override the maxHitsToProcess on collector params. - private final long maxHitsToProcessOverride; - - // min and max status id actually considered in the search (may not be a hit) - private long minSearchedStatusID = Long.MAX_VALUE; - private long maxSearchedStatusID = Long.MIN_VALUE; - - private int minSearchedTime = Integer.MAX_VALUE; - private int maxSearchedTime = Integer.MIN_VALUE; - - // per-segment start time. Will be re-started in setNextReader(). - private long segmentStartTime; - - // Current segment being searched. - protected EarlybirdIndexSegmentAtomicReader currTwitterReader; - protected TweetIDMapper tweetIdMapper; - protected TimeMapper timeMapper; - protected long currTimeSliceID = NO_TIME_SLICE_ID; - - private final long queryTime; - - // Time periods, in milliseconds, for which hits are counted. - private final List hitCountsThresholdsMsec; - - // hitCounts[i] is the number of hits that are more recent than hitCountsThresholdsMsec[i] - private final int[] hitCounts; - - private final ImmutableSchemaInterface schema; - - private final EarlybirdSearcherStats searcherStats; - // For collectors that fill in the results' geo locations, this will be used to retrieve the - // documents' lat/lon coordinates. - private GeoCoordinate resultGeoCoordinate; - protected final boolean fillInLatLonForHits; - - protected EarlybirdDocumentFeatures documentFeatures; - protected boolean featuresRequested = false; - - private final FacetLabelCollector facetCollector; - - // debugMode set in request to determine debugging level. - private int requestDebugMode; - - // debug info to be returned in earlybird response - protected List debugInfo; - - private int numHitsCollectedPerSegment; - - public AbstractResultsCollector( - ImmutableSchemaInterface schema, - R searchRequestInfo, - Clock clock, - EarlybirdSearcherStats searcherStats, - int requestDebugMode) { - super(searchRequestInfo.getSearchQuery().getCollectorParams(), - searchRequestInfo.getTerminationTracker(), - QueryCostTracker.getTracker(), - EXPENSIVE_TERMINATION_CHECK_INTERVAL, - clock); - - this.schema = schema; - this.searchRequestInfo = searchRequestInfo; - ThriftSearchQuery thriftSearchQuery = searchRequestInfo.getSearchQuery(); - this.maxHitsToProcessOverride = searchRequestInfo.getMaxHitsToProcess(); - this.facetCollector = buildFacetCollector(searchRequestInfo, schema); - - if (searchRequestInfo.getTimestamp() > 0) { - queryTime = searchRequestInfo.getTimestamp(); - } else { - queryTime = System.currentTimeMillis(); - } - hitCountsThresholdsMsec = thriftSearchQuery.getHitCountBuckets(); - hitCounts = hitCountsThresholdsMsec == null || hitCountsThresholdsMsec.size() == 0 - ? null - : new int[hitCountsThresholdsMsec.size()]; - - this.searcherStats = searcherStats; - - Schema.FieldInfo latLonCSFField = - schema.hasField(EarlybirdFieldConstant.LAT_LON_CSF_FIELD.getFieldName()) - ? schema.getFieldInfo(EarlybirdFieldConstant.LAT_LON_CSF_FIELD.getFieldName()) - : null; - boolean loadLatLonMapperIntoRam = true; - if (latLonCSFField != null) { - // If the latlon_csf field is explicitly defined, then take the config from the schema. - // If it's not defined, we assume that the latlon mapper is stored in memory. - loadLatLonMapperIntoRam = latLonCSFField.getFieldType().isCsfLoadIntoRam(); - } - // Default to not fill in lat/lon if the lat/lon CSF field is not loaded into RAM - this.fillInLatLonForHits = EarlybirdConfig.getBool("fill_in_lat_lon_for_hits", - loadLatLonMapperIntoRam); - this.requestDebugMode = requestDebugMode; - - if (shouldCollectDetailedDebugInfo()) { - this.debugInfo = new ArrayList<>(); - debugInfo.add("Starting Search"); - } - } - - private static FacetLabelCollector buildFacetCollector( - SearchRequestInfo request, - ImmutableSchemaInterface schema) { - if (CollectionUtils.isEmpty(request.getFacetFieldNames())) { - return null; - } - - // Get all facet field ids requested. - Set requiredFields = Sets.newHashSet(); - for (String fieldName : request.getFacetFieldNames()) { - Schema.FieldInfo field = schema.getFacetFieldByFacetName(fieldName); - if (field != null) { - requiredFields.add(field.getFieldType().getFacetName()); - } - } - - if (requiredFields.size() > 0) { - return new FacetLabelCollector(requiredFields); - } else { - return null; - } - } - - /** - * Subclasses should implement the following methods. - */ - - // Subclasses should process collected hits and construct a final - // AbstractSearchResults object. - protected abstract S doGetResults() throws IOException; - - // Subclasses can override this method to add more collection logic. - protected abstract void doCollect(long tweetID) throws IOException; - - public final ImmutableSchemaInterface getSchema() { - return schema; - } - - // Updates the hit count array - each result only increments the first qualifying bucket. - protected final void updateHitCounts(long statusId) { - if (hitCounts == null) { - return; - } - - long delta = queryTime - SnowflakeIdParser.getTimestampFromTweetId(statusId); - for (int i = 0; i < hitCountsThresholdsMsec.size(); ++i) { - if (delta >= 0 && delta < hitCountsThresholdsMsec.get(i)) { - hitCounts[i]++; - // Increments to the rest of the count array are implied, and aggregated later, since the - // array is sorted. - break; - } - } - } - - private boolean searchedStatusIDsAndTimesInitialized() { - return maxSearchedStatusID != Long.MIN_VALUE; - } - - // Updates the first searched status ID when starting to search a new segment. - private void updateFirstSearchedStatusID() { - // Only try to update the min/max searched ids, if this segment/reader actually has documents - // See SEARCH-4535 - int minDocID = currTwitterReader.getSmallestDocID(); - if (currTwitterReader.hasDocs() && minDocID >= 0 && !searchedStatusIDsAndTimesInitialized()) { - final long firstStatusID = tweetIdMapper.getTweetID(minDocID); - final int firstStatusTime = timeMapper.getTime(minDocID); - if (shouldCollectDetailedDebugInfo()) { - debugInfo.add( - "updateFirstSearchedStatusID. minDocId=" + minDocID + ", firstStatusID=" - + firstStatusID + ", firstStatusTime=" + firstStatusTime); - } - updateIDandTimeRanges(firstStatusID, firstStatusTime, IdAndRangeUpdateType.BEGIN_SEGMENT); - } - } - - public final R getSearchRequestInfo() { - return searchRequestInfo; - } - - public final long getMinSearchedStatusID() { - return minSearchedStatusID; - } - - public final long getMaxSearchedStatusID() { - return maxSearchedStatusID; - } - - public final int getMinSearchedTime() { - return minSearchedTime; - } - - public boolean isSetMinSearchedTime() { - return minSearchedTime != Integer.MAX_VALUE; - } - - public final int getMaxSearchedTime() { - return maxSearchedTime; - } - - @Override - public final long getMaxHitsToProcess() { - return maxHitsToProcessOverride; - } - - // Notifies classes that a new index segment is about to be searched. - @Override - public final void setNextReader(LeafReaderContext context) throws IOException { - super.setNextReader(context); - setNextReader(context.reader()); - } - - /** - * Notifies the collector that a new segment is about to be searched. - * - * It's easier to use this method from tests, because LeafReader is not a final class, so it can - * be mocked (unlike LeafReaderContext). - */ - @VisibleForTesting - public final void setNextReader(LeafReader reader) throws IOException { - if (!(reader instanceof EarlybirdIndexSegmentAtomicReader)) { - throw new RuntimeException("IndexReader type not supported: " + reader.getClass()); - } - - currTwitterReader = (EarlybirdIndexSegmentAtomicReader) reader; - documentFeatures = new EarlybirdDocumentFeatures(currTwitterReader); - tweetIdMapper = (TweetIDMapper) currTwitterReader.getSegmentData().getDocIDToTweetIDMapper(); - timeMapper = currTwitterReader.getSegmentData().getTimeMapper(); - currTimeSliceID = currTwitterReader.getSegmentData().getTimeSliceID(); - updateFirstSearchedStatusID(); - if (shouldCollectDetailedDebugInfo()) { - debugInfo.add("Starting search in segment with timeslice ID: " + currTimeSliceID); - } - - segmentStartTime = getClock().nowMillis(); - startSegment(); - } - - protected abstract void startSegment() throws IOException; - - @Override - protected final void doCollect() throws IOException { - documentFeatures.advance(curDocId); - long tweetID = tweetIdMapper.getTweetID(curDocId); - updateIDandTimeRanges(tweetID, timeMapper.getTime(curDocId), IdAndRangeUpdateType.HIT); - doCollect(tweetID); - numHitsCollectedPerSegment++; - } - - protected void collectFeatures(ThriftSearchResultMetadata metadata) throws IOException { - if (featuresRequested) { - ensureExtraMetadataIsSet(metadata); - - metadata.getExtraMetadata().setDirectedAtUserId( - documentFeatures.getFeatureValue(EarlybirdFieldConstant.DIRECTED_AT_USER_ID_CSF)); - metadata.getExtraMetadata().setQuotedTweetId( - documentFeatures.getFeatureValue(EarlybirdFieldConstant.QUOTED_TWEET_ID_CSF)); - metadata.getExtraMetadata().setQuotedUserId( - documentFeatures.getFeatureValue(EarlybirdFieldConstant.QUOTED_USER_ID_CSF)); - - int cardLangValue = - (int) documentFeatures.getFeatureValue(EarlybirdFieldConstant.CARD_LANG_CSF); - ThriftLanguage thriftLanguage = ThriftLanguage.findByValue(cardLangValue); - metadata.getExtraMetadata().setCardLang(thriftLanguage); - - long cardNumericUri = - (long) documentFeatures.getFeatureValue(EarlybirdFieldConstant.CARD_URI_CSF); - if (cardNumericUri > 0) { - metadata.getExtraMetadata().setCardUri(String.format("card://%s", cardNumericUri)); - } - } - } - - protected void collectIsProtected( - ThriftSearchResultMetadata metadata, EarlybirdCluster cluster, UserTable userTable) - throws IOException { - // 'isUserProtected' field is only set for archive cluster because only archive cluster user - // table has IS_PROTECTED_BIT populated. - // Since this bit is checked after UserFlagsExcludeFilter checked this bit, there is a slight - // chance that this bit is updated in-between. When that happens, it is possible that we will - // see a small number of protected Tweets in the response when we meant to exclude them. - if (cluster == EarlybirdCluster.FULL_ARCHIVE) { - ensureExtraMetadataIsSet(metadata); - long userId = documentFeatures.getFeatureValue(EarlybirdFieldConstant.FROM_USER_ID_CSF); - boolean isProtected = userTable.isSet(userId, UserTable.IS_PROTECTED_BIT); - metadata.getExtraMetadata().setIsUserProtected(isProtected); - } - } - - protected void collectExclusiveConversationAuthorId(ThriftSearchResultMetadata metadata) - throws IOException { - if (searchRequestInfo.isCollectExclusiveConversationAuthorId()) { - long exclusiveConversationAuthorId = documentFeatures.getFeatureValue( - EarlybirdFieldConstant.EXCLUSIVE_CONVERSATION_AUTHOR_ID_CSF); - if (exclusiveConversationAuthorId != 0L) { - ensureExtraMetadataIsSet(metadata); - metadata.getExtraMetadata().setExclusiveConversationAuthorId(exclusiveConversationAuthorId); - } - } - } - - // It only makes sense to collectFacets for search types that return individual results (recency, - // relevance and top_tweets), which use the AbstractRelevanceCollector and SearchResultsCollector, - // so this method should only be called from these classes. - protected void collectFacets(ThriftSearchResultMetadata metadata) { - if (currTwitterReader == null) { - return; - } - - AbstractFacetCountingArray facetCountingArray = currTwitterReader.getFacetCountingArray(); - EarlybirdIndexSegmentData segmentData = currTwitterReader.getSegmentData(); - - if (facetCountingArray == null || facetCollector == null) { - return; - } - - facetCollector.resetFacetLabelProviders( - segmentData.getFacetLabelProviders(), - segmentData.getFacetIDMap()); - - facetCountingArray.collectForDocId(curDocId, facetCollector); - - List labels = facetCollector.getLabels(); - if (labels.size() > 0) { - metadata.setFacetLabels(labels); - } - } - - protected void ensureExtraMetadataIsSet(ThriftSearchResultMetadata metadata) { - if (!metadata.isSetExtraMetadata()) { - metadata.setExtraMetadata(new ThriftSearchResultExtraMetadata()); - } - } - - @Override - protected final void doFinishSegment(int lastSearchedDocID) { - if (shouldCollectDetailedDebugInfo()) { - long timeSpentSearchingSegmentInMillis = getClock().nowMillis() - segmentStartTime; - debugInfo.add("Finished segment at doc id: " + lastSearchedDocID); - debugInfo.add("Time spent searching " + currTimeSliceID - + ": " + timeSpentSearchingSegmentInMillis + "ms"); - debugInfo.add("Number of hits collected in segment " + currTimeSliceID + ": " - + numHitsCollectedPerSegment); - } - - if (!currTwitterReader.hasDocs()) { - // Due to race between the reader and the indexing thread, a seemingly empty segment that - // does not have document committed in the posting lists, might already have a document - // inserted into the id/time mappers, which we do not want to take into account. - // If there are no documents in the segment, we don't update searched min/max ids to - // anything. - return; - } else if (lastSearchedDocID == DocIdSetIterator.NO_MORE_DOCS) { - // Segment exhausted. - if (shouldCollectDetailedDebugInfo()) { - debugInfo.add("Segment exhausted"); - } - updateIDandTimeRanges(tweetIdMapper.getMinTweetID(), timeMapper.getFirstTime(), - IdAndRangeUpdateType.END_SEGMENT); - } else if (lastSearchedDocID >= 0) { - long lastSearchedTweetID = tweetIdMapper.getTweetID(lastSearchedDocID); - int lastSearchTweetTime = timeMapper.getTime(lastSearchedDocID); - if (shouldCollectDetailedDebugInfo()) { - debugInfo.add("lastSearchedDocId=" + lastSearchedDocID); - } - updateIDandTimeRanges(lastSearchedTweetID, lastSearchTweetTime, - IdAndRangeUpdateType.END_SEGMENT); - } - - numHitsCollectedPerSegment = 0; - } - - private void updateIDandTimeRanges(long tweetID, int time, IdAndRangeUpdateType updateType) { - // We need to update minSearchedStatusID/maxSearchedStatusID and - // minSearchedTime/maxSearchedTime independently: SEARCH-6139 - minSearchedStatusID = Math.min(minSearchedStatusID, tweetID); - maxSearchedStatusID = Math.max(maxSearchedStatusID, tweetID); - if (time > 0) { - minSearchedTime = Math.min(minSearchedTime, time); - maxSearchedTime = Math.max(maxSearchedTime, time); - } - if (shouldCollectVerboseDebugInfo()) { - debugInfo.add( - String.format("call to updateIDandTimeRanges(%d, %d, %s)" - + " set minSearchStatusID=%d, maxSearchedStatusID=%d," - + " minSearchedTime=%d, maxSearchedTime=%d)", - tweetID, time, updateType.toString(), - minSearchedStatusID, maxSearchedStatusID, - minSearchedTime, maxSearchedTime)); - } - } - - /** - * This is called when a segment is skipped but we would want to do accounting - * for minSearchDocId as well as numDocsProcessed. - */ - public void skipSegment(EarlybirdSingleSegmentSearcher searcher) throws IOException { - setNextReader(searcher.getTwitterIndexReader().getContext()); - trackCompleteSegment(DocIdSetIterator.NO_MORE_DOCS); - if (shouldCollectDetailedDebugInfo()) { - debugInfo.add("Skipping segment: " + currTimeSliceID); - } - } - - /** - * Returns the results collected by this collector. - */ - public final S getResults() throws IOException { - // In order to make pagination work, if minSearchedStatusID is greater than the asked max_id. - // We force the minSearchedStatusID to be max_id + 1. - IdTimeRanges idTimeRanges = searchRequestInfo.getIdTimeRanges(); - if (idTimeRanges != null) { - Optional maxIDInclusive = idTimeRanges.getMaxIDInclusive(); - if (maxIDInclusive.isPresent() && minSearchedStatusID > maxIDInclusive.get()) { - searcherStats.numCollectorAdjustedMinSearchedStatusID.increment(); - minSearchedStatusID = maxIDInclusive.get() + 1; - } - } - - S results = doGetResults(); - results.setNumHitsProcessed((int) getNumHitsProcessed()); - results.setNumSearchedSegments(getNumSearchedSegments()); - if (searchedStatusIDsAndTimesInitialized()) { - results.setMaxSearchedStatusID(maxSearchedStatusID); - results.setMinSearchedStatusID(minSearchedStatusID); - results.setMaxSearchedTime(maxSearchedTime); - results.setMinSearchedTime(minSearchedTime); - } - results.setEarlyTerminated(getEarlyTerminationState().isTerminated()); - if (getEarlyTerminationState().isTerminated()) { - results.setEarlyTerminationReason(getEarlyTerminationState().getTerminationReason()); - } - Map counts = getHitCountMap(); - if (counts != null) { - results.hitCounts.putAll(counts); - } - return results; - } - - /** - * Returns a map of timestamps (specified in the query) to the number of hits that are more recent - * that the respective timestamps. - */ - public final Map getHitCountMap() { - int total = 0; - if (hitCounts == null) { - return null; - } - Map map = Maps.newHashMap(); - // since the array is incremental, need to aggregate here. - for (int i = 0; i < hitCounts.length; ++i) { - map.put(hitCountsThresholdsMsec.get(i), total += hitCounts[i]); - } - return map; - } - - /** - * Common helper for collecting per-field hit attribution data (if it's available). - * - * @param metadata the metadata to fill for this hit. - */ - protected final void fillHitAttributionMetadata(ThriftSearchResultMetadata metadata) { - if (searchRequestInfo.getHitAttributeHelper() == null) { - return; - } - - Map> hitAttributeMapping = - searchRequestInfo.getHitAttributeHelper().getHitAttribution(curDocId); - Preconditions.checkNotNull(hitAttributeMapping); - - FieldHitAttribution fieldHitAttribution = new FieldHitAttribution(); - for (Map.Entry> entry : hitAttributeMapping.entrySet()) { - FieldHitList fieldHitList = new FieldHitList(); - fieldHitList.setHitFields(entry.getValue()); - - fieldHitAttribution.putToHitMap(entry.getKey(), fieldHitList); - } - metadata.setFieldHitAttribution(fieldHitAttribution); - } - - /** - * Fill the geo location of the given document in metadata, if we have the lat/lon for it. - * For queries that specify a geolocation, this will also have the distance from - * the location specified in the query, and the location of this document. - */ - protected final void fillResultGeoLocation(ThriftSearchResultMetadata metadata) - throws IOException { - Preconditions.checkNotNull(metadata); - if (currTwitterReader != null && fillInLatLonForHits) { - // See if we can have a lat/lon for this doc. - if (resultGeoCoordinate == null) { - resultGeoCoordinate = new GeoCoordinate(); - } - // Only fill if necessary - if (searchRequestInfo.isCollectResultLocation() - && GeoUtil.decodeLatLonFromInt64( - documentFeatures.getFeatureValue(EarlybirdFieldConstant.LAT_LON_CSF_FIELD), - resultGeoCoordinate)) { - ThriftSearchResultGeoLocation resultLocation = new ThriftSearchResultGeoLocation(); - resultLocation.setLatitude(resultGeoCoordinate.getLatitude()); - resultLocation.setLongitude(resultGeoCoordinate.getLongitude()); - metadata.setResultLocation(resultLocation); - } - } - } - - @Override - public ScoreMode scoreMode() { - return ScoreMode.COMPLETE; - } - - private int terminationDocID = -1; - - @Override - protected void collectedEnoughResults() throws IOException { - // We find 'terminationDocID' once we collect enough results, so that we know the point at which - // we can stop searching. We must do this because with the unordered doc ID mapper, tweets - // are not ordered within a millisecond, so we must search the entire millisecond bucket before - // terminating the search, otherwise we could skip over tweets and have an incorrect - // minSearchedStatusID. - if (curDocId != -1 && terminationDocID == -1) { - long tweetId = tweetIdMapper.getTweetID(curDocId); - // We want to find the highest possible doc ID for this tweetId, so pass true. - boolean findMaxDocID = true; - terminationDocID = tweetIdMapper.findDocIdBound(tweetId, - findMaxDocID, - curDocId, - curDocId); - } - } - - @Override - protected boolean shouldTerminate() { - return curDocId >= terminationDocID; - } - - @Override - public List getDebugInfo() { - return debugInfo; - } - - protected boolean shouldCollectDetailedDebugInfo() { - return requestDebugMode >= 5; - } - - // Use this for per-result debug info. Useful for queries with no results - // or a very small number of results. - protected boolean shouldCollectVerboseDebugInfo() { - return requestDebugMode >= 6; - } -} diff --git a/src/java/com/twitter/search/earlybird/search/AntiGamingFilter.java b/src/java/com/twitter/search/earlybird/search/AntiGamingFilter.java deleted file mode 100644 index fbb95443c..000000000 --- a/src/java/com/twitter/search/earlybird/search/AntiGamingFilter.java +++ /dev/null @@ -1,228 +0,0 @@ -package com.twitter.search.earlybird.search; - -import java.io.IOException; -import java.util.Comparator; -import java.util.HashSet; -import java.util.Set; -import java.util.SortedSet; -import java.util.TreeSet; - -import com.google.common.annotations.VisibleForTesting; - -import org.apache.commons.lang.mutable.MutableInt; -import org.apache.lucene.index.IndexReader; -import org.apache.lucene.index.NumericDocValues; -import org.apache.lucene.index.Term; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.ScoreMode; - -import com.twitter.common_internal.collections.RandomAccessPriorityQueue; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.search.TwitterIndexSearcher; -import com.twitter.search.common.util.analysis.LongTermAttributeImpl; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; - -public class AntiGamingFilter { - private interface Acceptor { - boolean accept(int internalDocID) throws IOException; - } - - private NumericDocValues userReputation; - private NumericDocValues fromUserIDs; - - private final Query luceneQuery; - - private boolean termsExtracted = false; - private final Set queryTerms; - - // we ignore these user ids for anti-gaming filtering, because they were explicitly queried for - private Set segmentUserIDWhitelist = null; - // we gather the whitelisted userIDs from all segments here - private Set globalUserIDWhitelist = null; - - /** - * Used to track the number of occurrences of a particular user. - */ - private static final class UserCount - implements RandomAccessPriorityQueue.SignatureProvider { - private long userID; - private int count; - - @Override - public Long getSignature() { - return userID; - } - - @Override - public void clear() { - userID = 0; - count = 0; - } - } - - private static final Comparator USER_COUNT_COMPARATOR = - (d1, d2) -> d1.count == d2.count ? Long.compare(d1.userID, d2.userID) : d1.count - d2.count; - - private final RandomAccessPriorityQueue priorityQueue = - new RandomAccessPriorityQueue(1024, USER_COUNT_COMPARATOR) { - @Override - protected UserCount getSentinelObject() { - return new UserCount(); - } - }; - - private final Acceptor acceptor; - private final int maxHitsPerUser; - - /** - * Creates an AntiGamingFilter that either accepts or rejects tweets from all users. - * This method should only be called in tests. - * - * @param alwaysValue Determines if tweets should always be accepted or rejected. - * @return An AntiGamingFilter that either accepts or rejects tweets from all users. - */ - @VisibleForTesting - public static AntiGamingFilter newMock(boolean alwaysValue) { - return new AntiGamingFilter(alwaysValue) { - @Override - public void startSegment(EarlybirdIndexSegmentAtomicReader reader) { - } - }; - } - - private AntiGamingFilter(boolean alwaysValue) { - acceptor = internalDocID -> alwaysValue; - maxHitsPerUser = Integer.MAX_VALUE; - termsExtracted = true; - luceneQuery = null; - queryTerms = null; - } - - public AntiGamingFilter(int maxHitsPerUser, int maxTweepCred, Query luceneQuery) { - this.maxHitsPerUser = maxHitsPerUser; - this.luceneQuery = luceneQuery; - - if (maxTweepCred != -1) { - this.acceptor = internalDocID -> { - long userReputationVal = - userReputation.advanceExact(internalDocID) ? userReputation.longValue() : 0L; - return ((byte) userReputationVal > maxTweepCred) || acceptUser(internalDocID); - }; - } else { - this.acceptor = this::acceptUser; - } - - this.queryTerms = new HashSet<>(); - } - - public Set getUserIDWhitelist() { - return globalUserIDWhitelist; - } - - private boolean acceptUser(int internalDocID) throws IOException { - final long fromUserID = getUserId(internalDocID); - final MutableInt freq = new MutableInt(); - // try to increment UserCount for an user already exist in the priority queue. - boolean incremented = priorityQueue.incrementElement( - fromUserID, element -> freq.setValue(++element.count)); - - // If not incremented, it means the user node does not exist in the priority queue yet. - if (!incremented) { - priorityQueue.updateTop(element -> { - element.userID = fromUserID; - element.count = 1; - freq.setValue(element.count); - }); - } - - if (freq.intValue() <= maxHitsPerUser) { - return true; - } else if (segmentUserIDWhitelist == null) { - return false; - } - return segmentUserIDWhitelist.contains(fromUserID); - } - - /** - * Initializes this filter with the new feature source. This method should be called every time an - * earlybird searcher starts searching in a new segment. - * - * @param reader The reader for the new segment. - */ - public void startSegment(EarlybirdIndexSegmentAtomicReader reader) throws IOException { - if (!termsExtracted) { - extractTerms(reader); - } - - fromUserIDs = - reader.getNumericDocValues(EarlybirdFieldConstant.FROM_USER_ID_CSF.getFieldName()); - - // fill the id whitelist for the current segment. initialize lazily. - segmentUserIDWhitelist = null; - - SortedSet sortedFromUserDocIds = new TreeSet<>(); - for (Term t : queryTerms) { - if (t.field().equals(EarlybirdFieldConstant.FROM_USER_ID_FIELD.getFieldName())) { - // Add the operand of the from_user_id operator to the whitelist - long fromUserID = LongTermAttributeImpl.copyBytesRefToLong(t.bytes()); - addUserToWhitelists(fromUserID); - } else if (t.field().equals(EarlybirdFieldConstant.FROM_USER_FIELD.getFieldName())) { - // For a [from X] filter, we need to find a document that has the from_user field set to X, - // and then we need to get the value of the from_user_id field for that document and add it - // to the whitelist. We can get the from_user_id value from the fromUserIDs NumericDocValues - // instance, but we need to traverse it in increasing order of doc IDs. So we add a doc ID - // for each term to a sorted set for now, and then we traverse it in increasing doc ID order - // and add the from_user_id values for those docs to the whitelist. - int firstInternalDocID = reader.getNewestDocID(t); - if (firstInternalDocID != EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND) { - sortedFromUserDocIds.add(firstInternalDocID); - } - } - } - - for (int fromUserDocId : sortedFromUserDocIds) { - addUserToWhitelists(getUserId(fromUserDocId)); - } - - userReputation = - reader.getNumericDocValues(EarlybirdFieldConstant.USER_REPUTATION.getFieldName()); - - // Reset the fromUserIDs NumericDocValues so that the acceptor can use it to iterate over docs. - fromUserIDs = - reader.getNumericDocValues(EarlybirdFieldConstant.FROM_USER_ID_CSF.getFieldName()); - } - - private void extractTerms(IndexReader reader) throws IOException { - Query query = luceneQuery; - for (Query rewrittenQuery = query.rewrite(reader); rewrittenQuery != query; - rewrittenQuery = query.rewrite(reader)) { - query = rewrittenQuery; - } - - // Create a new TwitterIndexSearcher instance here instead of an IndexSearcher instance, to use - // the TwitterIndexSearcher.collectionStatistics() implementation. - query.createWeight(new TwitterIndexSearcher(reader), ScoreMode.COMPLETE, 1.0f) - .extractTerms(queryTerms); - termsExtracted = true; - } - - public boolean accept(int internalDocID) throws IOException { - return acceptor.accept(internalDocID); - } - - private void addUserToWhitelists(long userID) { - if (this.segmentUserIDWhitelist == null) { - this.segmentUserIDWhitelist = new HashSet<>(); - } - if (this.globalUserIDWhitelist == null) { - this.globalUserIDWhitelist = new HashSet<>(); - } - this.segmentUserIDWhitelist.add(userID); - this.globalUserIDWhitelist.add(userID); - } - - @VisibleForTesting - protected long getUserId(int internalDocId) throws IOException { - return fromUserIDs.advanceExact(internalDocId) ? fromUserIDs.longValue() : 0L; - } -} diff --git a/src/java/com/twitter/search/earlybird/search/EarlybirdLuceneSearcher.java b/src/java/com/twitter/search/earlybird/search/EarlybirdLuceneSearcher.java deleted file mode 100644 index 14f742eac..000000000 --- a/src/java/com/twitter/search/earlybird/search/EarlybirdLuceneSearcher.java +++ /dev/null @@ -1,98 +0,0 @@ -package com.twitter.search.earlybird.search; - -import java.io.IOException; -import java.util.Map; - -import org.apache.lucene.index.IndexReader; -import org.apache.lucene.index.Term; -import org.apache.lucene.search.IndexSearcher; - -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.earlybird.EarlybirdSearcher; -import com.twitter.search.earlybird.search.facets.AbstractFacetTermCollector; -import com.twitter.search.earlybird.search.facets.FacetResultsCollector; -import com.twitter.search.earlybird.search.facets.TermStatisticsCollector.TermStatisticsSearchResults; -import com.twitter.search.earlybird.search.facets.TermStatisticsRequestInfo; -import com.twitter.search.earlybird.thrift.ThriftFacetCount; -import com.twitter.search.earlybird.thrift.ThriftFacetFieldResults; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird.thrift.ThriftTermStatisticsResults; - -public abstract class EarlybirdLuceneSearcher extends IndexSearcher { - public EarlybirdLuceneSearcher(IndexReader r) { - super(r); - } - - /** - * Fills facet information for all given search results. - * - * @param collector A collector that knows how collect facet information. - * @param searchResults The search results. - */ - public abstract void fillFacetResults( - AbstractFacetTermCollector collector, ThriftSearchResults searchResults) - throws IOException; - - /** - * Fills metadata for all given facet results. - * - * @param facetResults The facet results. - * @param schema The earlybird schema. - * @param debugMode The debug mode for the request that yielded these results. - */ - public abstract void fillFacetResultMetadata( - Map facetResults, - ImmutableSchemaInterface schema, - byte debugMode) throws IOException; - - /** - * Fills metadata for all given term stats results. - * - * @param termStatsResults The term stats results. - * @param schema The earlybird schema. - * @param debugMode The debug mode for the request that yielded these results. - */ - public abstract void fillTermStatsMetadata( - ThriftTermStatisticsResults termStatsResults, - ImmutableSchemaInterface schema, - byte debugMode) throws IOException; - - /** - * Returns the results for the given term stats request. - * - * @param searchRequestInfo Stores the original term stats request and some other useful request - * information. - * @param searcher The searcher that should be used to execute the request. - * @param requestDebugMode The debug mode for this request. - * @return The term stats results for the given request. - */ - public abstract TermStatisticsSearchResults collectTermStatistics( - TermStatisticsRequestInfo searchRequestInfo, - EarlybirdSearcher searcher, - int requestDebugMode) throws IOException; - - /** - * Writes an explanation for the given hits into the given ThriftSearchResults instance. - * - * @param searchRequestInfo Stores the original request and some other useful request context. - * @param hits The hits. - * @param searchResults The ThriftSearchResults where the explanation for the given hits will be - * stored. - */ - // Writes explanations into the searchResults thrift. - public abstract void explainSearchResults(SearchRequestInfo searchRequestInfo, - SimpleSearchResults hits, - ThriftSearchResults searchResults) throws IOException; - - public static class FacetSearchResults extends SearchResultsInfo { - private FacetResultsCollector collector; - - public FacetSearchResults(FacetResultsCollector collector) { - this.collector = collector; - } - - public ThriftFacetFieldResults getFacetResults(String facetName, int topK) { - return collector.getFacetResults(facetName, topK); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/search/EarlybirdMultiSegmentSearcher.java b/src/java/com/twitter/search/earlybird/search/EarlybirdMultiSegmentSearcher.java deleted file mode 100644 index 61bd21ef1..000000000 --- a/src/java/com/twitter/search/earlybird/search/EarlybirdMultiSegmentSearcher.java +++ /dev/null @@ -1,254 +0,0 @@ -package com.twitter.search.earlybird.search; - -import java.io.IOException; -import java.util.ArrayList; -import java.util.Arrays; -import java.util.HashSet; -import java.util.LinkedHashMap; -import java.util.List; -import java.util.Map; -import java.util.Set; -import java.util.stream.Collectors; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.MultiReader; -import org.apache.lucene.index.Term; -import org.apache.lucene.search.Collector; -import org.apache.lucene.search.Explanation; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.Weight; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentData; -import com.twitter.search.earlybird.EarlybirdSearcher; -import com.twitter.search.earlybird.index.EarlybirdSingleSegmentSearcher; -import com.twitter.search.earlybird.index.TweetIDMapper; -import com.twitter.search.earlybird.search.facets.AbstractFacetTermCollector; -import com.twitter.search.earlybird.search.facets.TermStatisticsCollector; -import com.twitter.search.earlybird.search.facets.TermStatisticsCollector.TermStatisticsSearchResults; -import com.twitter.search.earlybird.search.facets.TermStatisticsRequestInfo; -import com.twitter.search.earlybird.search.queries.SinceMaxIDFilter; -import com.twitter.search.earlybird.search.queries.SinceUntilFilter; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; -import com.twitter.search.earlybird.thrift.ThriftFacetCount; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird.thrift.ThriftTermStatisticsResults; -import com.twitter.search.queryparser.util.IdTimeRanges; - -public class EarlybirdMultiSegmentSearcher extends EarlybirdLuceneSearcher { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdMultiSegmentSearcher.class); - - private final ImmutableSchemaInterface schema; - private final Map segmentSearchers; - protected final int numSegments; - private final Clock clock; - - // This will prevent us from even considering segments that are out of range. - // It's an important optimization for a certain class of queries. - protected IdTimeRanges idTimeRanges = null; - - private final EarlybirdSearcherStats searcherStats; - - public EarlybirdMultiSegmentSearcher( - ImmutableSchemaInterface schema, - List searchers, - EarlybirdSearcherStats searcherStats, - Clock clock) throws IOException { - // NOTE: We pass in an empty MultiReader to super and retain the list of searchers in this - // class since MultiReader does not allow an aggregate of more than Integer.MAX_VALUE docs, - // which some of our larger archive indexes may have. - super(new MultiReader()); - // segmentSearchers are mapped from time slice IDs to searchers so that we can quickly - // find the correct searcher for a given time slice ID (see fillPayload). - // make sure we maintain order of segments, hence a LinkedHashMap instead of just a HashMap - this.segmentSearchers = new LinkedHashMap<>(); - this.schema = schema; - for (EarlybirdSingleSegmentSearcher searcher : searchers) { - if (searcher != null) { - long timeSliceID = searcher.getTimeSliceID(); - this.segmentSearchers.put(timeSliceID, searcher); - } - } - // initializing this after populating the list. previously initialized before, and - // this may have lead to a race condition, although this doesn't seem possible given - // that segments should be an immutable cloned list. - this.numSegments = segmentSearchers.size(); - - this.searcherStats = searcherStats; - this.clock = clock; - } - - public void setIdTimeRanges(IdTimeRanges idTimeRanges) { - this.idTimeRanges = idTimeRanges; - } - - @Override - protected void search(List unusedLeaves, Weight weight, Collector coll) - throws IOException { - Preconditions.checkState(coll instanceof AbstractResultsCollector); - AbstractResultsCollector collector = (AbstractResultsCollector) coll; - - for (EarlybirdSingleSegmentSearcher segmentSearcher : segmentSearchers.values()) { - if (shouldSkipSegment(segmentSearcher)) { - collector.skipSegment(segmentSearcher); - } else { - segmentSearcher.search(weight.getQuery(), collector); - if (collector.isTerminated()) { - break; - } - } - } - } - - @VisibleForTesting - protected boolean shouldSkipSegment(EarlybirdSingleSegmentSearcher segmentSearcher) { - EarlybirdIndexSegmentData segmentData = - segmentSearcher.getTwitterIndexReader().getSegmentData(); - if (idTimeRanges != null) { - if (!SinceMaxIDFilter.sinceMaxIDsInRange( - (TweetIDMapper) segmentData.getDocIDToTweetIDMapper(), - idTimeRanges.getSinceIDExclusive().or(SinceMaxIDFilter.NO_FILTER), - idTimeRanges.getMaxIDInclusive().or(SinceMaxIDFilter.NO_FILTER)) - || !SinceUntilFilter.sinceUntilTimesInRange( - segmentData.getTimeMapper(), - idTimeRanges.getSinceTimeInclusive().or(SinceUntilFilter.NO_FILTER), - idTimeRanges.getUntilTimeExclusive().or(SinceUntilFilter.NO_FILTER))) { - return true; - } - } - return false; - } - - @Override - public void fillFacetResults( - AbstractFacetTermCollector collector, ThriftSearchResults searchResults) throws IOException { - for (EarlybirdSingleSegmentSearcher segmentSearcher : segmentSearchers.values()) { - segmentSearcher.fillFacetResults(collector, searchResults); - } - } - - @Override - public TermStatisticsSearchResults collectTermStatistics( - TermStatisticsRequestInfo searchRequestInfo, - EarlybirdSearcher searcher, - int requestDebugMode) throws IOException { - TermStatisticsCollector collector = new TermStatisticsCollector( - schema, searchRequestInfo, searcherStats, clock, requestDebugMode); - search(collector.getSearchRequestInfo().getLuceneQuery(), collector); - searcher.maybeSetCollectorDebugInfo(collector); - return collector.getResults(); - } - - @Override - public void explainSearchResults(SearchRequestInfo searchRequestInfo, - SimpleSearchResults hits, ThriftSearchResults searchResults) throws IOException { - for (EarlybirdSingleSegmentSearcher segmentSearcher : segmentSearchers.values()) { - // the hits that are getting passed into this method are hits across - // all searched segments. We need to get the per segment hits and - // generate explanations one segment at a time. - List hitsForCurrentSegment = new ArrayList<>(); - Set tweetIdsForCurrentSegment = new HashSet<>(); - List hitResultsForCurrentSegment = new ArrayList<>(); - - for (Hit hit : hits.hits) { - if (hit.getTimeSliceID() == segmentSearcher.getTimeSliceID()) { - hitsForCurrentSegment.add(hit); - tweetIdsForCurrentSegment.add(hit.statusID); - } - } - for (ThriftSearchResult result : searchResults.getResults()) { - if (tweetIdsForCurrentSegment.contains(result.id)) { - hitResultsForCurrentSegment.add(result); - } - } - ThriftSearchResults resultsForSegment = new ThriftSearchResults() - .setResults(hitResultsForCurrentSegment); - - SimpleSearchResults finalHits = new SimpleSearchResults(hitsForCurrentSegment); - segmentSearcher.explainSearchResults(searchRequestInfo, finalHits, resultsForSegment); - } - // We should not see hits that are not associated with an active segment - List hitsWithUnknownSegment = - Arrays.stream(hits.hits()).filter(hit -> !hit.isHasExplanation()) - .collect(Collectors.toList()); - for (Hit hit : hitsWithUnknownSegment) { - LOG.error("Unable to find segment associated with hit: " + hit.toString()); - } - } - - @Override - public void fillFacetResultMetadata(Map facetResults, - ImmutableSchemaInterface documentSchema, byte debugMode) - throws IOException { - for (EarlybirdSingleSegmentSearcher segmentSearcher : segmentSearchers.values()) { - segmentSearcher.fillFacetResultMetadata(facetResults, documentSchema, debugMode); - } - } - - @Override - public void fillTermStatsMetadata(ThriftTermStatisticsResults termStatsResults, - ImmutableSchemaInterface documentSchema, byte debugMode) - throws IOException { - for (EarlybirdSingleSegmentSearcher segmentSearcher : segmentSearchers.values()) { - segmentSearcher.fillTermStatsMetadata(termStatsResults, documentSchema, debugMode); - } - } - - /** - * The searchers for individual segments will rewrite the query as they see fit, so the multi - * segment searcher does not need to rewrite it. In fact, not rewriting the query here improves - * the request latency by ~5%. - */ - @Override - public Query rewrite(Query original) { - return original; - } - - /** - * The searchers for individual segments will create their own weights. This method only creates - * a dummy weight to pass the Lucene query to the search() method of these individual segment - * searchers. - */ - @Override - public Weight createWeight(Query query, ScoreMode scoreMode, float boost) { - return new DummyWeight(query); - } - - /** - * Dummy weight used solely to pass Lucene Query around. - */ - private static final class DummyWeight extends Weight { - private DummyWeight(Query luceneQuery) { - super(luceneQuery); - } - - @Override - public Explanation explain(LeafReaderContext context, int doc) { - throw new UnsupportedOperationException(); - } - - @Override - public Scorer scorer(LeafReaderContext context) { - throw new UnsupportedOperationException(); - } - - @Override - public void extractTerms(Set terms) { - throw new UnsupportedOperationException(); - } - - @Override - public boolean isCacheable(LeafReaderContext context) { - return true; - } - } -} diff --git a/src/java/com/twitter/search/earlybird/search/GeoQuadTreeQueryBuilder.java b/src/java/com/twitter/search/earlybird/search/GeoQuadTreeQueryBuilder.java deleted file mode 100644 index e0fc9f8cb..000000000 --- a/src/java/com/twitter/search/earlybird/search/GeoQuadTreeQueryBuilder.java +++ /dev/null @@ -1,199 +0,0 @@ -package com.twitter.search.earlybird.search; - -import java.io.IOException; -import java.util.LinkedHashSet; -import java.util.Set; - -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.NumericDocValues; -import org.apache.lucene.search.Query; -import org.apache.lucene.spatial.prefix.tree.Cell; -import org.apache.lucene.spatial.prefix.tree.CellIterator; -import org.apache.lucene.util.BytesRef; -import org.locationtech.spatial4j.shape.Rectangle; - -import com.twitter.search.common.query.MultiTermDisjunctionQuery; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.search.GeoQuadTreeQueryBuilderUtil; -import com.twitter.search.common.search.TerminationTracker; -import com.twitter.search.common.util.spatial.BoundingBox; -import com.twitter.search.common.util.spatial.GeoUtil; -import com.twitter.search.common.util.spatial.GeohashChunkImpl; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.earlybird.search.queries.GeoTwoPhaseQuery; -import com.twitter.search.earlybird.search.queries.GeoTwoPhaseQuery.SecondPhaseDocAccepter; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.util.GeoCode; - -import geo.google.datamodel.GeoCoordinate; - -/** - * A class that builds queries to query the quadtree. - */ -public final class GeoQuadTreeQueryBuilder { - private GeoQuadTreeQueryBuilder() { - } - - /** - * Returns a GeoTwoPhaseQuery for the given geocode. - */ - public static Query buildGeoQuadTreeQuery(final GeoCode geocode) { - return buildGeoQuadTreeQuery(geocode, null); - } - - /** - * Returns a GeoTwoPhaseQuery for the given geocode. - * - * @param geocode The geocode. - * @param terminationTracker The tracker that determines when the query needs to terminate. - */ - public static Query buildGeoQuadTreeQuery(GeoCode geocode, - TerminationTracker terminationTracker) { - Query geoHashDisjuntiveQuery = GeoQuadTreeQueryBuilderUtil.buildGeoQuadTreeQuery( - geocode, EarlybirdFieldConstant.GEO_HASH_FIELD.getFieldName()); - - // 5. Create post filtering accepter - final SecondPhaseDocAccepter accepter = (geocode.distanceKm != GeoCode.DOUBLE_DISTANCE_NOT_SET) - ? new CenterRadiusAccepter(geocode.latitude, geocode.longitude, geocode.distanceKm) - : GeoTwoPhaseQuery.ALL_DOCS_ACCEPTER; - - return new GeoTwoPhaseQuery(geoHashDisjuntiveQuery, accepter, terminationTracker); - } - - /** - * Construct a query as below: - * 1. Compute all quadtree cells that intersects the bounding box. - * 2. Create a disjunction of the geohashes of all the intersecting cells. - * 3. Add a filter to only keep points inside the giving bounding box. - */ - public static Query buildGeoQuadTreeQuery(final Rectangle boundingBox, - final TerminationTracker terminationTracker) - throws QueryParserException { - // 1. Locate the main quadtree cell---the cell containing the bounding box's center point whose - // diagonal is just longer than the bounding box's diagonal. - final Cell centerCell = GeohashChunkImpl.getGeoNodeByBoundingBox(boundingBox); - - // 2. Determine quadtree level to search. - int treeLevel = -1; - if (centerCell != null) { - treeLevel = centerCell.getLevel(); - } else { - // This should not happen. - throw new QueryParserException( - "Unable to locate quadtree cell containing the given bounding box." - + "Bounding box is: " + boundingBox); - } - - // 3. get all quadtree cells at treeLevel that intersects the given bounding box. - CellIterator intersectingCells = - GeohashChunkImpl.getNodesIntersectingBoundingBox(boundingBox, treeLevel); - - // 4. Construct disjunction query - final Set geoHashSet = new LinkedHashSet<>(); - - // Add center node - geoHashSet.add(centerCell.getTokenBytesNoLeaf(new BytesRef())); - // If there are other nodes intersecting query circle, also add them in. - if (intersectingCells != null) { - while (intersectingCells.hasNext()) { - geoHashSet.add(intersectingCells.next().getTokenBytesNoLeaf(new BytesRef())); - } - } - MultiTermDisjunctionQuery geoHashDisjuntiveQuery = new MultiTermDisjunctionQuery( - EarlybirdFieldConstant.GEO_HASH_FIELD.getFieldName(), geoHashSet); - - // 5. Create post filtering accepter - final GeoDocAccepter accepter = new BoundingBoxAccepter(boundingBox); - - return new GeoTwoPhaseQuery(geoHashDisjuntiveQuery, accepter, terminationTracker); - } - - private abstract static class GeoDocAccepter extends SecondPhaseDocAccepter { - private NumericDocValues latLonDocValues; - private final GeoCoordinate geoCoordReuse = new GeoCoordinate(); - - @Override - public void initialize(LeafReaderContext context) throws IOException { - final EarlybirdIndexSegmentAtomicReader reader = - (EarlybirdIndexSegmentAtomicReader) context.reader(); - latLonDocValues = - reader.getNumericDocValues(EarlybirdFieldConstant.LAT_LON_CSF_FIELD.getFieldName()); - } - - // Decides whether a point should be accepted. - protected abstract boolean acceptPoint(double lat, double lon); - - // Decides whether a document should be accepted based on its geo coordinates. - @Override - public final boolean accept(int internalDocId) throws IOException { - // Cannot obtain valid geo coordinates for the document. Not acceptable. - if (latLonDocValues == null - || !latLonDocValues.advanceExact(internalDocId) - || !GeoUtil.decodeLatLonFromInt64(latLonDocValues.longValue(), geoCoordReuse)) { - return false; - } - - return acceptPoint(geoCoordReuse.getLatitude(), geoCoordReuse.getLongitude()); - } - } - - // Accepts points within a circle defined by a center point and a radius. - private static final class CenterRadiusAccepter extends GeoDocAccepter { - private final double centerLat; - private final double centerLon; - private final double radiusKm; - - public CenterRadiusAccepter(double centerLat, double centerLon, double radiusKm) { - this.centerLat = centerLat; - this.centerLon = centerLon; - this.radiusKm = radiusKm; - } - - @Override - protected boolean acceptPoint(double lat, double lon) { - double actualDistance = - BoundingBox.approxDistanceC(centerLat, centerLon, lat, lon); - if (actualDistance < radiusKm) { - return true; - } else if (Double.isNaN(actualDistance)) { - // There seems to be a rare bug in GeoUtils that computes NaN - // for two identical lat/lon pairs on occasion. Check for that here. - if (lat == centerLat && lon == centerLon) { - return true; - } - } - - return false; - } - - @Override - public String toString() { - return String.format("CenterRadiusAccepter(Center: %.4f, %.4f Radius (km): %.4f)", - centerLat, centerLon, radiusKm); - } - } - - // Accepts points within a BoundingBox - private static final class BoundingBoxAccepter extends GeoDocAccepter { - private final Rectangle boundingBox; - - public BoundingBoxAccepter(Rectangle boundingBox) { - this.boundingBox = boundingBox; - } - - @Override - protected boolean acceptPoint(double lat, double lon) { - return GeohashChunkImpl.isPointInBoundingBox(lat, lon, boundingBox); - - } - - @Override - public String toString() { - return String.format("PointInBoundingBoxAccepter((%.4f, %.4f), (%.4f, %.4f), " - + "crossesDateLine=%b)", - boundingBox.getMinY(), boundingBox.getMinX(), - boundingBox.getMaxY(), boundingBox.getMaxX(), - boundingBox.getCrossesDateLine()); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/search/Hit.java b/src/java/com/twitter/search/earlybird/search/Hit.java deleted file mode 100644 index c8abb5043..000000000 --- a/src/java/com/twitter/search/earlybird/search/Hit.java +++ /dev/null @@ -1,59 +0,0 @@ -package com.twitter.search.earlybird.search; - -import javax.annotation.Nullable; - -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadata; - -/** - * Class that abstracts a document that matches a query we're processing in Earlybird. - */ -public class Hit implements Comparable { - protected long timeSliceID; - protected long statusID; - private boolean hasExplanation; - - @Nullable - protected ThriftSearchResultMetadata metadata; - - public Hit(long timeSliceID, long statusID) { - this.timeSliceID = timeSliceID; - this.statusID = statusID; - this.metadata = null; - } - - public long getTimeSliceID() { - return timeSliceID; - } - - public long getStatusID() { - return statusID; - } - - @Nullable - public ThriftSearchResultMetadata getMetadata() { - return metadata; - } - - public void setMetadata(ThriftSearchResultMetadata metadata) { - this.metadata = metadata; - } - - @Override - public int compareTo(Hit other) { - return -Long.compare(this.statusID, other.statusID); - } - - @Override - public String toString() { - return "Hit[tweetID=" + statusID + ",timeSliceID=" + timeSliceID - + ",score=" + (metadata == null ? "null" : metadata.getScore()) + "]"; - } - - public boolean isHasExplanation() { - return hasExplanation; - } - - public void setHasExplanation(boolean hasExplanation) { - this.hasExplanation = hasExplanation; - } -} diff --git a/src/java/com/twitter/search/earlybird/search/SearchRequestInfo.java b/src/java/com/twitter/search/earlybird/search/SearchRequestInfo.java deleted file mode 100644 index 51fdf7935..000000000 --- a/src/java/com/twitter/search/earlybird/search/SearchRequestInfo.java +++ /dev/null @@ -1,180 +0,0 @@ -package com.twitter.search.earlybird.search; - -import java.util.List; -import javax.annotation.Nullable; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.search.Query; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.query.HitAttributeHelper; -import com.twitter.search.common.search.TerminationTracker; -import com.twitter.search.earlybird.QualityFactor; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.queryparser.util.IdTimeRanges; - -public class SearchRequestInfo { - private final ThriftSearchQuery searchQuery; - private final Query luceneQuery; - private final boolean collectConversationId; - private final boolean collectResultLocation; - private final boolean getInReplyToStatusId; - private final boolean getReferenceAuthorId; - private final boolean getFromUserId; - private final boolean collectExclusiveConversationAuthorId; - - private final int numResultsRequested; - private final int maxHitsToProcess; - private final List facetFieldNames; - private long timestamp; - - private final TerminationTracker terminationTracker; - - protected final QualityFactor qualityFactor; - - // Set if we want to collect per-field hit attributes for this request. - @Nullable - private HitAttributeHelper hitAttributeHelper; - - private IdTimeRanges idTimeRanges; - - private static final int DEFAULT_MAX_HITS = 1000; - - private static final SearchCounter RESET_MAX_HITS_TO_PROCESS_COUNTER = - SearchCounter.export("search_request_info_reset_max_hits_to_process"); - - public SearchRequestInfo( - ThriftSearchQuery searchQuery, - Query luceneQuery, - TerminationTracker terminationTracker) { - this(searchQuery, luceneQuery, terminationTracker, null); - } - - public SearchRequestInfo( - ThriftSearchQuery searchQuery, - Query luceneQuery, - TerminationTracker terminationTracker, - QualityFactor qualityFactor) { - Preconditions.checkNotNull(searchQuery.getCollectorParams()); - Preconditions.checkNotNull(terminationTracker); - - this.searchQuery = searchQuery; - this.luceneQuery = luceneQuery; - this.collectConversationId = searchQuery.isCollectConversationId(); - if (searchQuery.isSetResultMetadataOptions()) { - this.collectResultLocation = searchQuery.getResultMetadataOptions().isGetResultLocation(); - this.getInReplyToStatusId = searchQuery.getResultMetadataOptions().isGetInReplyToStatusId(); - this.getReferenceAuthorId = - searchQuery.getResultMetadataOptions().isGetReferencedTweetAuthorId(); - this.getFromUserId = searchQuery.getResultMetadataOptions().isGetFromUserId(); - this.collectExclusiveConversationAuthorId = - searchQuery.getResultMetadataOptions().isGetExclusiveConversationAuthorId(); - } else { - this.collectResultLocation = false; - this.getInReplyToStatusId = false; - this.getReferenceAuthorId = false; - this.getFromUserId = false; - this.collectExclusiveConversationAuthorId = false; - } - - this.qualityFactor = qualityFactor; - - this.numResultsRequested = searchQuery.getCollectorParams().getNumResultsToReturn(); - this.maxHitsToProcess = calculateMaxHitsToProcess(searchQuery); - this.terminationTracker = terminationTracker; - this.facetFieldNames = searchQuery.getFacetFieldNames(); - } - - /** - * Gets the value to be used as max hits to process for this query. The base class gets it from - * the searchQuery directly, and uses a default if that's not set. - * - * Subclasses can override this to compute a different value for max hits to process. - */ - protected int calculateMaxHitsToProcess(ThriftSearchQuery thriftSearchQuery) { - int maxHits = thriftSearchQuery.getCollectorParams().isSetTerminationParams() - ? thriftSearchQuery.getCollectorParams().getTerminationParams().getMaxHitsToProcess() : 0; - - if (maxHits <= 0) { - maxHits = DEFAULT_MAX_HITS; - RESET_MAX_HITS_TO_PROCESS_COUNTER.increment(); - } - return maxHits; - } - - public final ThriftSearchQuery getSearchQuery() { - return this.searchQuery; - } - - public Query getLuceneQuery() { - return luceneQuery; - } - - public final int getNumResultsRequested() { - return numResultsRequested; - } - - public final int getMaxHitsToProcess() { - return maxHitsToProcess; - } - - public boolean isCollectConversationId() { - return collectConversationId; - } - - public boolean isCollectResultLocation() { - return collectResultLocation; - } - - public boolean isGetInReplyToStatusId() { - return getInReplyToStatusId; - } - - public boolean isGetReferenceAuthorId() { - return getReferenceAuthorId; - } - - public boolean isCollectExclusiveConversationAuthorId() { - return collectExclusiveConversationAuthorId; - } - - public final IdTimeRanges getIdTimeRanges() { - return idTimeRanges; - } - - public SearchRequestInfo setIdTimeRanges(IdTimeRanges newIdTimeRanges) { - this.idTimeRanges = newIdTimeRanges; - return this; - } - - public SearchRequestInfo setTimestamp(long newTimestamp) { - this.timestamp = newTimestamp; - return this; - } - - public long getTimestamp() { - return timestamp; - } - - public TerminationTracker getTerminationTracker() { - return this.terminationTracker; - } - - @Nullable - public HitAttributeHelper getHitAttributeHelper() { - return hitAttributeHelper; - } - - public void setHitAttributeHelper(@Nullable HitAttributeHelper hitAttributeHelper) { - this.hitAttributeHelper = hitAttributeHelper; - } - - public List getFacetFieldNames() { - return facetFieldNames; - } - - public boolean isGetFromUserId() { - return getFromUserId; - } -} diff --git a/src/java/com/twitter/search/earlybird/search/SearchResultsCollector.java b/src/java/com/twitter/search/earlybird/search/SearchResultsCollector.java deleted file mode 100644 index fff5d1f0f..000000000 --- a/src/java/com/twitter/search/earlybird/search/SearchResultsCollector.java +++ /dev/null @@ -1,188 +0,0 @@ -package com.twitter.search.earlybird.search; - -import java.io.IOException; -import java.util.ArrayList; -import java.util.Collections; -import java.util.HashSet; -import java.util.List; -import java.util.Set; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.constants.thriftjava.ThriftLanguage; -import com.twitter.search.common.features.thrift.ThriftSearchResultFeatures; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.search.EarlyTerminationState; -import com.twitter.search.common.util.LongIntConverter; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadata; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadataOptions; -import com.twitter.search.earlybird.thrift.ThriftSearchResultType; - -/** - * This class collects results for Recency queries for delegation to collectors based on query mode - */ -public class SearchResultsCollector - extends AbstractResultsCollector { - private static final EarlyTerminationState TERMINATED_COLLECTED_ENOUGH_RESULTS = - new EarlyTerminationState("terminated_collected_enough_results", true); - - protected final List results; - private final Set requestedFeatureIds; - private final EarlybirdCluster cluster; - private final UserTable userTable; - - public SearchResultsCollector( - ImmutableSchemaInterface schema, - SearchRequestInfo searchRequestInfo, - Clock clock, - EarlybirdSearcherStats searcherStats, - EarlybirdCluster cluster, - UserTable userTable, - int requestDebugMode) { - super(schema, searchRequestInfo, clock, searcherStats, requestDebugMode); - results = new ArrayList<>(); - this.cluster = cluster; - this.userTable = userTable; - - ThriftSearchResultMetadataOptions options = - searchRequestInfo.getSearchQuery().getResultMetadataOptions(); - if (options != null && options.isReturnSearchResultFeatures()) { - requestedFeatureIds = schema.getSearchFeatureSchema().getEntries().keySet(); - } else if (options != null && options.isSetRequestedFeatureIDs()) { - requestedFeatureIds = new HashSet<>(options.getRequestedFeatureIDs()); - } else { - requestedFeatureIds = null; - } - } - - @Override - public void startSegment() throws IOException { - featuresRequested = requestedFeatureIds != null; - } - - @Override - public void doCollect(long tweetID) throws IOException { - Hit hit = new Hit(currTimeSliceID, tweetID); - ThriftSearchResultMetadata metadata = - new ThriftSearchResultMetadata(ThriftSearchResultType.RECENCY) - .setPenguinVersion(EarlybirdConfig.getPenguinVersionByte()); - - // Set tweet language in metadata - ThriftLanguage thriftLanguage = ThriftLanguage.findByValue( - (int) documentFeatures.getFeatureValue(EarlybirdFieldConstant.LANGUAGE)); - metadata.setLanguage(thriftLanguage); - - // Check and collect hit attribution data, if it's available. - fillHitAttributionMetadata(metadata); - - // Set the nullcast flag in metadata - metadata.setIsNullcast(documentFeatures.isFlagSet(EarlybirdFieldConstant.IS_NULLCAST_FLAG)); - - if (searchRequestInfo.isCollectConversationId()) { - long conversationId = - documentFeatures.getFeatureValue(EarlybirdFieldConstant.CONVERSATION_ID_CSF); - if (conversationId != 0) { - ensureExtraMetadataIsSet(metadata); - metadata.getExtraMetadata().setConversationId(conversationId); - } - } - - fillResultGeoLocation(metadata); - collectRetweetAndReplyMetadata(metadata); - - long fromUserId = documentFeatures.getFeatureValue(EarlybirdFieldConstant.FROM_USER_ID_CSF); - if (requestedFeatureIds != null) { - ThriftSearchResultFeatures features = documentFeatures.getSearchResultFeatures( - getSchema(), requestedFeatureIds::contains); - ensureExtraMetadataIsSet(metadata); - metadata.getExtraMetadata().setFeatures(features); - metadata.setFromUserId(fromUserId); - if (documentFeatures.isFlagSet(EarlybirdFieldConstant.HAS_CARD_FLAG)) { - metadata.setCardType( - (byte) documentFeatures.getFeatureValue(EarlybirdFieldConstant.CARD_TYPE_CSF_FIELD)); - } - } - if (searchRequestInfo.isGetFromUserId()) { - metadata.setFromUserId(fromUserId); - } - - collectExclusiveConversationAuthorId(metadata); - collectFacets(metadata); - collectFeatures(metadata); - collectIsProtected(metadata, cluster, userTable); - hit.setMetadata(metadata); - results.add(hit); - updateHitCounts(tweetID); - } - - private final void collectRetweetAndReplyMetadata(ThriftSearchResultMetadata metadata) - throws IOException { - if (searchRequestInfo.isGetInReplyToStatusId() || searchRequestInfo.isGetReferenceAuthorId()) { - boolean isRetweet = documentFeatures.isFlagSet(EarlybirdFieldConstant.IS_RETWEET_FLAG); - boolean isReply = documentFeatures.isFlagSet(EarlybirdFieldConstant.IS_REPLY_FLAG); - // Set the isRetweet and isReply metadata so that clients who request retweet and reply - // metadata know whether a result is a retweet or reply or neither. - metadata.setIsRetweet(isRetweet); - metadata.setIsReply(isReply); - - // Only store the shared status id if the hit is a reply or a retweet and - // the getInReplyToStatusId flag is set. - if (searchRequestInfo.isGetInReplyToStatusId() && (isReply || isRetweet)) { - long sharedStatusID = - documentFeatures.getFeatureValue(EarlybirdFieldConstant.SHARED_STATUS_ID_CSF); - if (sharedStatusID != 0) { - metadata.setSharedStatusId(sharedStatusID); - } - } - - // Only store the reference tweet author ID if the hit is a reply or a retweet and the - // getReferenceAuthorId flag is set. - if (searchRequestInfo.isGetReferenceAuthorId() && (isReply || isRetweet)) { - // the REFERENCE_AUTHOR_ID_CSF stores the source tweet author id for all retweets - long referenceAuthorId = - documentFeatures.getFeatureValue(EarlybirdFieldConstant.REFERENCE_AUTHOR_ID_CSF); - if (referenceAuthorId != 0) { - metadata.setReferencedTweetAuthorId(referenceAuthorId); - } else if (cluster != EarlybirdCluster.FULL_ARCHIVE) { - // we also store the reference author id for retweets, directed at tweets, and self - // threaded tweets separately on Realtime/Protected Earlybirds. This data will be moved to - // the REFERENCE_AUTHOR_ID_CSF and these fields will be deprecated in SEARCH-34958. - referenceAuthorId = LongIntConverter.convertTwoIntToOneLong( - (int) documentFeatures.getFeatureValue( - EarlybirdFieldConstant.REFERENCE_AUTHOR_ID_MOST_SIGNIFICANT_INT), - (int) documentFeatures.getFeatureValue( - EarlybirdFieldConstant.REFERENCE_AUTHOR_ID_LEAST_SIGNIFICANT_INT)); - if (referenceAuthorId > 0) { - metadata.setReferencedTweetAuthorId(referenceAuthorId); - } - } - } - } - } - - /** - * This differs from base class because we check against num results collected instead of - * num hits collected. - */ - @Override - public EarlyTerminationState innerShouldCollectMore() throws IOException { - if (results.size() >= searchRequestInfo.getNumResultsRequested()) { - collectedEnoughResults(); - if (shouldTerminate()) { - return setEarlyTerminationState(TERMINATED_COLLECTED_ENOUGH_RESULTS); - } - } - return EarlyTerminationState.COLLECTING; - } - - @Override - public SimpleSearchResults doGetResults() { - // Sort hits by tweet id. - Collections.sort(results); - return new SimpleSearchResults(results); - } -} diff --git a/src/java/com/twitter/search/earlybird/search/SearchResultsInfo.java b/src/java/com/twitter/search/earlybird/search/SearchResultsInfo.java deleted file mode 100644 index ff19de98d..000000000 --- a/src/java/com/twitter/search/earlybird/search/SearchResultsInfo.java +++ /dev/null @@ -1,99 +0,0 @@ -package com.twitter.search.earlybird.search; - -import java.util.Map; - -import com.google.common.collect.Maps; - -import com.twitter.search.earlybird.search.queries.SinceMaxIDFilter; - -public class SearchResultsInfo { - public static final long NO_ID = SinceMaxIDFilter.NO_FILTER; - public static final int NO_TIME = -1; - - private int numHitsProcessed = 0; - private int numSearchedSegments = 0; - - private boolean earlyTerminated = false; - private String earlyTerminationReason = null; - - private long maxSearchedStatusID = NO_ID; - private long minSearchedStatusID = NO_ID; - - private int maxSearchedTime = NO_TIME; - private int minSearchedTime = NO_TIME; - - // Map from time thresholds (in milliseconds) to number of results more recent than this period. - protected final Map hitCounts = Maps.newHashMap(); - - public final int getNumHitsProcessed() { - return numHitsProcessed; - } - - public final void setNumHitsProcessed(int numHitsProcessed) { - this.numHitsProcessed = numHitsProcessed; - } - - public final int getNumSearchedSegments() { - return numSearchedSegments; - } - - public final void setNumSearchedSegments(int numSearchedSegments) { - this.numSearchedSegments = numSearchedSegments; - } - - public final long getMaxSearchedStatusID() { - return maxSearchedStatusID; - } - - public final long getMinSearchedStatusID() { - return minSearchedStatusID; - } - - public final int getMaxSearchedTime() { - return maxSearchedTime; - } - - public final int getMinSearchedTime() { - return minSearchedTime; - } - - public boolean isSetSearchedStatusIDs() { - return maxSearchedStatusID != NO_ID && minSearchedStatusID != NO_ID; - } - - public boolean isSetSearchedTimes() { - return maxSearchedTime != NO_TIME && minSearchedTime != NO_TIME; - } - - public void setMaxSearchedStatusID(long maxSearchedStatusID) { - this.maxSearchedStatusID = maxSearchedStatusID; - } - - public void setMinSearchedStatusID(long minSearchedStatusID) { - this.minSearchedStatusID = minSearchedStatusID; - } - - public void setMaxSearchedTime(int maxSearchedTime) { - this.maxSearchedTime = maxSearchedTime; - } - - public void setMinSearchedTime(int minSearchedTime) { - this.minSearchedTime = minSearchedTime; - } - - public void setEarlyTerminated(boolean earlyTerminated) { - this.earlyTerminated = earlyTerminated; - } - - public boolean isEarlyTerminated() { - return earlyTerminated; - } - - public String getEarlyTerminationReason() { - return earlyTerminationReason; - } - - public void setEarlyTerminationReason(String earlyTerminationReason) { - this.earlyTerminationReason = earlyTerminationReason; - } -} diff --git a/src/java/com/twitter/search/earlybird/search/SimpleSearchResults.java b/src/java/com/twitter/search/earlybird/search/SimpleSearchResults.java deleted file mode 100644 index e3e0894fd..000000000 --- a/src/java/com/twitter/search/earlybird/search/SimpleSearchResults.java +++ /dev/null @@ -1,35 +0,0 @@ -package com.twitter.search.earlybird.search; - -import java.util.List; - -public class SimpleSearchResults extends SearchResultsInfo { - protected Hit[] hits; - protected int numHits; - - public SimpleSearchResults(int size) { - this.hits = new Hit[size]; - this.numHits = 0; - } - - public SimpleSearchResults(List hits) { - this.hits = new Hit[hits.size()]; - this.numHits = hits.size(); - hits.toArray(this.hits); - } - - public Hit[] hits() { - return hits; - } - - public int numHits() { - return numHits; - } - - public void setNumHits(int numHits) { - this.numHits = numHits; - } - - public Hit getHit(int hitIndex) { - return hits[hitIndex]; - } -} diff --git a/src/java/com/twitter/search/earlybird/search/SocialFilter.java b/src/java/com/twitter/search/earlybird/search/SocialFilter.java deleted file mode 100644 index 9761171bc..000000000 --- a/src/java/com/twitter/search/earlybird/search/SocialFilter.java +++ /dev/null @@ -1,98 +0,0 @@ -package com.twitter.search.earlybird.search; - -import java.io.IOException; - -import com.google.common.base.Preconditions; -import com.google.common.primitives.Longs; - -import org.apache.lucene.index.NumericDocValues; - -import com.twitter.common_internal.bloomfilter.BloomFilter; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.earlybird.thrift.ThriftSocialFilterType; - -/** - * Filter class used by the SearchResultsCollector to filter social tweets - * from the hits. - */ -public class SocialFilter { - private interface Acceptor { - boolean accept(long fromUserLong, byte[] userIDInBytes); - } - - private NumericDocValues fromUserID; - private final Acceptor acceptor; - private final long searcherId; - private final BloomFilter trustedFilter; - private final BloomFilter followFilter; - - private class FollowsAcceptor implements Acceptor { - @Override - public boolean accept(long fromUserLong, byte[] userIdInBytes) { - return followFilter.contains(userIdInBytes); - } - } - - private class TrustedAcceptor implements Acceptor { - @Override - public boolean accept(long fromUserLong, byte[] userIdInBytes) { - return trustedFilter.contains(userIdInBytes); - } - } - - private class AllAcceptor implements Acceptor { - @Override - public boolean accept(long fromUserLong, byte[] userIdInBytes) { - return trustedFilter.contains(userIdInBytes) - || followFilter.contains(userIdInBytes) - || fromUserLong == searcherId; - } - } - - public SocialFilter( - ThriftSocialFilterType socialFilterType, - final long searcherId, - final byte[] trustedFilter, - final byte[] followFilter) throws IOException { - Preconditions.checkNotNull(socialFilterType); - Preconditions.checkNotNull(trustedFilter); - Preconditions.checkNotNull(followFilter); - this.searcherId = searcherId; - this.trustedFilter = new BloomFilter(trustedFilter); - this.followFilter = new BloomFilter(followFilter); - - - switch (socialFilterType) { - case FOLLOWS: - this.acceptor = new FollowsAcceptor(); - break; - case TRUSTED: - this.acceptor = new TrustedAcceptor(); - break; - case ALL: - this.acceptor = new AllAcceptor(); - break; - default: - throw new UnsupportedOperationException("Invalid social filter type passed"); - } - } - - public void startSegment(EarlybirdIndexSegmentAtomicReader indexReader) throws IOException { - fromUserID = - indexReader.getNumericDocValues(EarlybirdFieldConstant.FROM_USER_ID_CSF.getFieldName()); - } - - /** - * Determines if the given doc ID should be accepted. - */ - public boolean accept(int internalDocID) throws IOException { - if (!fromUserID.advanceExact(internalDocID)) { - return false; - } - - long fromUserLong = fromUserID.longValue(); - byte[] userIDInBytes = Longs.toByteArray(fromUserLong); - return acceptor.accept(fromUserLong, userIDInBytes); - } -} diff --git a/src/java/com/twitter/search/earlybird/search/SocialSearchResultsCollector.java b/src/java/com/twitter/search/earlybird/search/SocialSearchResultsCollector.java deleted file mode 100644 index 170db4faa..000000000 --- a/src/java/com/twitter/search/earlybird/search/SocialSearchResultsCollector.java +++ /dev/null @@ -1,47 +0,0 @@ -package com.twitter.search.earlybird.search; - -import java.io.IOException; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; - -/** - * Created with IntelliJ IDEA. - * Date: 6/20/12 - * Time: 12:06 PM - * To change this template use File | Settings | File Templates. - */ -public class SocialSearchResultsCollector extends SearchResultsCollector { - - private final SocialFilter socialFilter; - - public SocialSearchResultsCollector( - ImmutableSchemaInterface schema, - SearchRequestInfo searchRequestInfo, - SocialFilter socialFilter, - EarlybirdSearcherStats searcherStats, - EarlybirdCluster cluster, - UserTable userTable, - int requestDebugMode) { - super(schema, searchRequestInfo, Clock.SYSTEM_CLOCK, searcherStats, cluster, userTable, - requestDebugMode); - this.socialFilter = socialFilter; - } - - @Override - public final void doCollect(long tweetID) throws IOException { - if (socialFilter == null || socialFilter.accept(curDocId)) { - results.add(new Hit(currTimeSliceID, tweetID)); - } - } - - @Override - public void startSegment() throws IOException { - if (socialFilter != null) { - socialFilter.startSegment(currTwitterReader); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/search/facets/AbstractFacetTermCollector.java b/src/java/com/twitter/search/earlybird/search/facets/AbstractFacetTermCollector.java deleted file mode 100644 index eb07d3fd1..000000000 --- a/src/java/com/twitter/search/earlybird/search/facets/AbstractFacetTermCollector.java +++ /dev/null @@ -1,67 +0,0 @@ -package com.twitter.search.earlybird.search.facets; - -import java.util.Map; -import java.util.Set; - -import com.twitter.search.core.earlybird.facets.FacetIDMap; -import com.twitter.search.core.earlybird.facets.FacetLabelProvider; -import com.twitter.search.core.earlybird.facets.FacetTermCollector; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftSearchResultExtraMetadata; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadata; - -public abstract class AbstractFacetTermCollector implements FacetTermCollector { - private Map facetLabelProviders; - private FacetIDMap facetIdMap; - - /** - * Populates the given ThriftSearchResult instance with the results collected by this collector - * and clears all collected results in this collector. - * - * @param result The ThriftSearchResult instance to be populated with the results collected in - * this collector. - */ - public abstract void fillResultAndClear(ThriftSearchResult result); - - public void resetFacetLabelProviders( - Map facetLabelProvidersToReset, FacetIDMap facetIdMapToReset) { - this.facetLabelProviders = facetLabelProvidersToReset; - this.facetIdMap = facetIdMapToReset; - } - - String findFacetName(int fieldId) { - return fieldId < 0 ? null : facetIdMap.getFacetFieldByFacetID(fieldId).getFacetName(); - } - - protected ThriftSearchResultExtraMetadata getExtraMetadata(ThriftSearchResult result) { - ThriftSearchResultMetadata metadata = result.getMetadata(); - if (!metadata.isSetExtraMetadata()) { - metadata.setExtraMetadata(new ThriftSearchResultExtraMetadata()); - } - return metadata.getExtraMetadata(); - } - - protected String getTermFromProvider( - String facetName, long termID, FacetLabelProvider provider) { - return provider.getLabelAccessor().getTermText(termID); - } - - protected String getTermFromFacet(long termID, int fieldID, Set facetsToCollectFrom) { - if (termID == EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND) { - return null; - } - - String facetName = findFacetName(fieldID); - if (!facetsToCollectFrom.contains(facetName)) { - return null; - } - - final FacetLabelProvider provider = facetLabelProviders.get(facetName); - if (provider == null) { - return null; - } - - return getTermFromProvider(facetName, termID, provider); - } -} diff --git a/src/java/com/twitter/search/earlybird/search/facets/DefaultFacetScorer.java b/src/java/com/twitter/search/earlybird/search/facets/DefaultFacetScorer.java deleted file mode 100644 index 729d6ea24..000000000 --- a/src/java/com/twitter/search/earlybird/search/facets/DefaultFacetScorer.java +++ /dev/null @@ -1,236 +0,0 @@ -package com.twitter.search.earlybird.search.facets; - -import java.io.IOException; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.constants.thriftjava.ThriftLanguage; -import com.twitter.search.common.ranking.thriftjava.ThriftFacetEarlybirdSortingMode; -import com.twitter.search.common.ranking.thriftjava.ThriftFacetRankingOptions; -import com.twitter.search.common.relevance.features.EarlybirdDocumentFeatures; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.util.lang.ThriftLanguageUtil; -import com.twitter.search.core.earlybird.facets.FacetAccumulator; -import com.twitter.search.core.earlybird.facets.FacetCountIterator; -import com.twitter.search.core.earlybird.facets.FacetLabelProvider; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.earlybird.search.AntiGamingFilter; -import com.twitter.search.earlybird.search.facets.FacetResultsCollector.Accumulator; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; - -public class DefaultFacetScorer extends FacetScorer { - private static final Logger LOG = LoggerFactory.getLogger(FacetScorer.class.getName()); - private static final double DEFAULT_FEATURE_WEIGHT = 0.0; - private static final byte DEFAULT_PENALTY = 1; - - private static final byte DEFAULT_REPUTATION_MIN = 45; - - private final AntiGamingFilter antiGamingFilter; - - // tweepcreds below this value will not be counted at all - private final byte reputationMinFilterThresholdVal; - - // tweepcreds between reputationMinFilterThresholdVal and this value will be counted - // with a score of 1 - private final byte reputationMinScoreVal; - - private final double userRepWeight; - private final double favoritesWeight; - private final double parusWeight; - private final double parusBase; - private final double queryIndependentPenaltyWeight; - - private final ThriftLanguage uiLang; - private final double langEnglishUIBoost; - private final double langEnglishFacetBoost; - private final double langDefaultBoost; - - private final int antigamingPenalty; - private final int offensiveTweetPenalty; - private final int multipleHashtagsOrTrendsPenalty; - - private final int maxScorePerTweet; - private final ThriftFacetEarlybirdSortingMode sortingMode; - - private EarlybirdIndexSegmentAtomicReader reader; - private EarlybirdDocumentFeatures features; - - /** - * Creates a new facet scorer. - */ - public DefaultFacetScorer(ThriftSearchQuery searchQuery, - ThriftFacetRankingOptions rankingOptions, - AntiGamingFilter antiGamingFilter, - ThriftFacetEarlybirdSortingMode sortingMode) { - this.sortingMode = sortingMode; - this.antiGamingFilter = antiGamingFilter; - - maxScorePerTweet = - rankingOptions.isSetMaxScorePerTweet() - ? rankingOptions.getMaxScorePerTweet() - : Integer.MAX_VALUE; - - // filters - reputationMinFilterThresholdVal = - rankingOptions.isSetMinTweepcredFilterThreshold() - ? (byte) (rankingOptions.getMinTweepcredFilterThreshold() & 0xFF) - : DEFAULT_REPUTATION_MIN; - - // weights - // reputationMinScoreVal must be >= reputationMinFilterThresholdVal - reputationMinScoreVal = - (byte) Math.max(rankingOptions.isSetReputationParams() - ? (byte) rankingOptions.getReputationParams().getMin() - : DEFAULT_REPUTATION_MIN, reputationMinFilterThresholdVal); - - parusWeight = - rankingOptions.isSetParusScoreParams() && rankingOptions.getParusScoreParams().isSetWeight() - ? rankingOptions.getParusScoreParams().getWeight() - : DEFAULT_FEATURE_WEIGHT; - // compute this once so that base ** parusScore is backwards-compatible - parusBase = Math.sqrt(1 + parusWeight); - - userRepWeight = - rankingOptions.isSetReputationParams() && rankingOptions.getReputationParams().isSetWeight() - ? rankingOptions.getReputationParams().getWeight() - : DEFAULT_FEATURE_WEIGHT; - - favoritesWeight = - rankingOptions.isSetFavoritesParams() && rankingOptions.getFavoritesParams().isSetWeight() - ? rankingOptions.getFavoritesParams().getWeight() - : DEFAULT_FEATURE_WEIGHT; - - queryIndependentPenaltyWeight = - rankingOptions.isSetQueryIndependentPenaltyWeight() - ? rankingOptions.getQueryIndependentPenaltyWeight() - : DEFAULT_FEATURE_WEIGHT; - - // penalty increment - antigamingPenalty = - rankingOptions.isSetAntigamingPenalty() - ? rankingOptions.getAntigamingPenalty() - : DEFAULT_PENALTY; - - offensiveTweetPenalty = - rankingOptions.isSetOffensiveTweetPenalty() - ? rankingOptions.getOffensiveTweetPenalty() - : DEFAULT_PENALTY; - - multipleHashtagsOrTrendsPenalty = - rankingOptions.isSetMultipleHashtagsOrTrendsPenalty() - ? rankingOptions.getMultipleHashtagsOrTrendsPenalty() - : DEFAULT_PENALTY; - - // query information - if (!searchQuery.isSetUiLang() || searchQuery.getUiLang().isEmpty()) { - uiLang = ThriftLanguage.UNKNOWN; - } else { - uiLang = ThriftLanguageUtil.getThriftLanguageOf(searchQuery.getUiLang()); - } - langEnglishUIBoost = rankingOptions.getLangEnglishUIBoost(); - langEnglishFacetBoost = rankingOptions.getLangEnglishFacetBoost(); - langDefaultBoost = rankingOptions.getLangDefaultBoost(); - } - - @Override - protected void startSegment(EarlybirdIndexSegmentAtomicReader segmentReader) throws IOException { - reader = segmentReader; - features = new EarlybirdDocumentFeatures(reader); - if (antiGamingFilter != null) { - antiGamingFilter.startSegment(reader); - } - } - - @Override - public void incrementCounts(Accumulator accumulator, int internalDocID) throws IOException { - FacetCountIterator.IncrementData data = accumulator.accessor.incrementData; - data.accumulators = accumulator.accumulators; - features.advance(internalDocID); - - // Also keep track of the tweet language of tweet themselves. - data.languageId = (int) features.getFeatureValue(EarlybirdFieldConstant.LANGUAGE); - - if (antigamingPenalty > 0 - && antiGamingFilter != null - && !antiGamingFilter.accept(internalDocID)) { - data.weightedCountIncrement = 0; - data.penaltyIncrement = antigamingPenalty; - data.tweepCred = 0; - accumulator.accessor.collect(internalDocID); - return; - } - - if (offensiveTweetPenalty > 0 && features.isFlagSet(EarlybirdFieldConstant.IS_OFFENSIVE_FLAG)) { - data.weightedCountIncrement = 0; - data.penaltyIncrement = offensiveTweetPenalty; - data.tweepCred = 0; - accumulator.accessor.collect(internalDocID); - return; - } - - byte userRep = (byte) features.getFeatureValue(EarlybirdFieldConstant.USER_REPUTATION); - - if (userRep < reputationMinFilterThresholdVal) { - // don't penalize - data.weightedCountIncrement = 0; - data.penaltyIncrement = 0; - data.tweepCred = 0; - accumulator.accessor.collect(internalDocID); - return; - } - - // Other non-terminating penalties - int penalty = 0; - if (multipleHashtagsOrTrendsPenalty > 0 - && features.isFlagSet(EarlybirdFieldConstant.HAS_MULTIPLE_HASHTAGS_OR_TRENDS_FLAG)) { - penalty += multipleHashtagsOrTrendsPenalty; - } - - double parus = 0xFF & (byte) features.getFeatureValue(EarlybirdFieldConstant.PARUS_SCORE); - - double score = Math.pow(1 + userRepWeight, Math.max(0, userRep - reputationMinScoreVal)); - - if (parus > 0) { - score += Math.pow(parusBase, parus); - } - - int favoriteCount = - (int) features.getUnnormalizedFeatureValue(EarlybirdFieldConstant.FAVORITE_COUNT); - if (favoriteCount > 0) { - score += favoriteCount * favoritesWeight; - } - - // Language preferences - int tweetLinkLangId = (int) features.getFeatureValue(EarlybirdFieldConstant.LINK_LANGUAGE); - if (tweetLinkLangId == ThriftLanguage.UNKNOWN.getValue()) { - // fall back to use the tweet language itself. - tweetLinkLangId = (int) features.getFeatureValue(EarlybirdFieldConstant.LANGUAGE); - } - if (uiLang != ThriftLanguage.UNKNOWN && uiLang.getValue() != tweetLinkLangId) { - if (uiLang == ThriftLanguage.ENGLISH) { - score *= langEnglishUIBoost; - } else if (tweetLinkLangId == ThriftLanguage.ENGLISH.getValue()) { - score *= langEnglishFacetBoost; - } else { - score *= langDefaultBoost; - } - } - - // make sure a single tweet can't contribute too high a score - if (score > maxScorePerTweet) { - score = maxScorePerTweet; - } - - data.weightedCountIncrement = (int) score; - data.penaltyIncrement = penalty; - data.tweepCred = userRep & 0xFF; - accumulator.accessor.collect(internalDocID); - } - - @Override - public FacetAccumulator getFacetAccumulator(FacetLabelProvider labelProvider) { - return new HashingAndPruningFacetAccumulator(labelProvider, queryIndependentPenaltyWeight, - HashingAndPruningFacetAccumulator.getComparator(sortingMode)); - } -} diff --git a/src/java/com/twitter/search/earlybird/search/facets/EntityAnnotationCollector.java b/src/java/com/twitter/search/earlybird/search/facets/EntityAnnotationCollector.java deleted file mode 100644 index 81e07a718..000000000 --- a/src/java/com/twitter/search/earlybird/search/facets/EntityAnnotationCollector.java +++ /dev/null @@ -1,48 +0,0 @@ -package com.twitter.search.earlybird.search.facets; - -import java.util.List; - -import com.google.common.collect.ImmutableList; -import com.google.common.collect.Lists; -import com.google.common.collect.Sets; - -import org.apache.commons.lang.StringUtils; - -import com.twitter.escherbird.thriftjava.TweetEntityAnnotation; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; - -public class EntityAnnotationCollector extends AbstractFacetTermCollector { - private List annotations = Lists.newArrayList(); - - @Override - public boolean collect(int docID, long termID, int fieldID) { - - String term = getTermFromFacet(termID, fieldID, - Sets.newHashSet(EarlybirdFieldConstant.ENTITY_ID_FIELD.getFieldName())); - if (StringUtils.isEmpty(term)) { - return false; - } - - String[] idParts = term.split("\\."); - - // Only include the full three-part form of the entity ID: "groupId.domainId.entityId" - // Exclude the less-specific forms we index: "domainId.entityId" and "entityId" - if (idParts.length < 3) { - return false; - } - - annotations.add(new TweetEntityAnnotation( - Long.valueOf(idParts[0]), - Long.valueOf(idParts[1]), - Long.valueOf(idParts[2]))); - - return true; - } - - @Override - public void fillResultAndClear(ThriftSearchResult result) { - getExtraMetadata(result).setEntityAnnotations(ImmutableList.copyOf(annotations)); - annotations.clear(); - } -} diff --git a/src/java/com/twitter/search/earlybird/search/facets/ExpandedUrlCollector.java b/src/java/com/twitter/search/earlybird/search/facets/ExpandedUrlCollector.java deleted file mode 100644 index 65721747f..000000000 --- a/src/java/com/twitter/search/earlybird/search/facets/ExpandedUrlCollector.java +++ /dev/null @@ -1,118 +0,0 @@ -package com.twitter.search.earlybird.search.facets; - -import java.util.LinkedHashMap; -import java.util.List; -import java.util.Map; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.ImmutableSet; - -import org.apache.lucene.util.BytesRef; - -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.core.earlybird.facets.FacetLabelProvider; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftSearchResultUrl; -import com.twitter.service.spiderduck.gen.MediaTypes; - -/** - * A collector for collecting expanded urls from facets. Note that the only thing connecting this - * collector with expanded URLs is the fact that we only store the expanded url in the facet fields. - */ -public class ExpandedUrlCollector extends AbstractFacetTermCollector { - private static final ImmutableSet FACET_CONTAINS_URL = ImmutableSet.of( - EarlybirdFieldConstant.VIDEOS_FACET, - EarlybirdFieldConstant.IMAGES_FACET, - EarlybirdFieldConstant.NEWS_FACET, - EarlybirdFieldConstant.LINKS_FACET, - EarlybirdFieldConstant.TWIMG_FACET); - - private final Map dedupedUrls = new LinkedHashMap<>(); - - - @Override - protected String getTermFromProvider( - String facetName, - long termID, - FacetLabelProvider provider) { - String url = null; - if (EarlybirdFieldConstant.TWIMG_FACET.equals(facetName)) { - // Special case extraction of media url for twimg. - FacetLabelProvider.FacetLabelAccessor photoAccessor = provider.getLabelAccessor(); - BytesRef termPayload = photoAccessor.getTermPayload(termID); - if (termPayload != null) { - url = termPayload.utf8ToString(); - } - } else { - url = provider.getLabelAccessor().getTermText(termID); - } - return url; - } - - @Override - public boolean collect(int docID, long termID, int fieldID) { - - String url = getTermFromFacet(termID, fieldID, FACET_CONTAINS_URL); - if (url == null || url.isEmpty()) { - return false; - } - - ThriftSearchResultUrl resultUrl = new ThriftSearchResultUrl(); - resultUrl.setOriginalUrl(url); - MediaTypes mediaType = getMediaType(findFacetName(fieldID)); - resultUrl.setMediaType(mediaType); - - // Media links will show up twice: - // - once in image/native_image/video/news facets - // - another time in the links facet - // - // For those urls, we only want to return the media version. If it is non-media version, only - // write to map if doesn't exist already, if media version, overwrite any previous entries. - if (mediaType == MediaTypes.UNKNOWN) { - if (!dedupedUrls.containsKey(url)) { - dedupedUrls.put(url, resultUrl); - } - } else { - dedupedUrls.put(url, resultUrl); - } - - return true; - } - - @Override - public void fillResultAndClear(ThriftSearchResult result) { - result.getMetadata().setTweetUrls(getExpandedUrls()); - dedupedUrls.clear(); - } - - @VisibleForTesting - List getExpandedUrls() { - return ImmutableList.copyOf(dedupedUrls.values()); - } - - /** - * Gets the Spiderduck media type for a given facet name. - * - * @param facetName A given facet name. - * @return {@code MediaTypes} enum corresponding to the facet name. - */ - private static MediaTypes getMediaType(String facetName) { - if (facetName == null) { - return MediaTypes.UNKNOWN; - } - - switch (facetName) { - case EarlybirdFieldConstant.TWIMG_FACET: - return MediaTypes.NATIVE_IMAGE; - case EarlybirdFieldConstant.IMAGES_FACET: - return MediaTypes.IMAGE; - case EarlybirdFieldConstant.VIDEOS_FACET: - return MediaTypes.VIDEO; - case EarlybirdFieldConstant.NEWS_FACET: - return MediaTypes.NEWS; - default: - return MediaTypes.UNKNOWN; - } - } -} diff --git a/src/java/com/twitter/search/earlybird/search/facets/ExplainFacetResultsCollector.java b/src/java/com/twitter/search/earlybird/search/facets/ExplainFacetResultsCollector.java deleted file mode 100644 index 76dc918c7..000000000 --- a/src/java/com/twitter/search/earlybird/search/facets/ExplainFacetResultsCollector.java +++ /dev/null @@ -1,159 +0,0 @@ -package com.twitter.search.earlybird.search.facets; - -import java.io.IOException; -import java.util.ArrayList; -import java.util.HashMap; -import java.util.List; -import java.util.Map; - -import com.google.common.collect.Maps; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.collections.Pair; -import com.twitter.common.util.Clock; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.core.earlybird.facets.FacetIDMap; -import com.twitter.search.core.earlybird.facets.FacetLabelProvider; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.earlybird.search.AntiGamingFilter; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; -import com.twitter.search.earlybird.thrift.ThriftFacetCount; -import com.twitter.search.earlybird.thrift.ThriftFacetCountMetadata; -import com.twitter.search.earlybird.thrift.ThriftFacetFieldResults; -import com.twitter.search.earlybird.thrift.ThriftFacetResults; - -public class ExplainFacetResultsCollector extends FacetResultsCollector { - private static final Logger LOG = - LoggerFactory.getLogger(ExplainFacetResultsCollector.class.getName()); - - protected final List> proofs; - protected final Map>> proofAccumulators; - - protected Map facetLabelProviders; - private FacetIDMap facetIDMap; - - /** - * Creates a new facet collector with the ability to provide explanations for the search results. - */ - public ExplainFacetResultsCollector( - ImmutableSchemaInterface schema, - FacetSearchRequestInfo searchRequestInfo, - AntiGamingFilter antiGamingFilter, - EarlybirdSearcherStats searcherStats, - Clock clock, - int requestDebugMode) throws IOException { - super(schema, searchRequestInfo, antiGamingFilter, searcherStats, clock, requestDebugMode); - - proofs = new ArrayList<>(128); - - proofAccumulators = Maps.newHashMap(); - for (Schema.FieldInfo facetField : schema.getFacetFields()) { - HashMap> fieldLabelToTweetIdsMap = new HashMap<>(); - proofAccumulators.put(facetField.getFieldType().getFacetName(), fieldLabelToTweetIdsMap); - } - } - - @Override - protected Accumulator newPerSegmentAccumulator(EarlybirdIndexSegmentAtomicReader indexReader) { - Accumulator accumulator = super.newPerSegmentAccumulator(indexReader); - accumulator.accessor.setProofs(proofs); - facetLabelProviders = indexReader.getFacetLabelProviders(); - facetIDMap = indexReader.getFacetIDMap(); - - return accumulator; - } - - @Override - public void doCollect(long tweetID) throws IOException { - proofs.clear(); - - // FacetResultsCollector.doCollect() calls FacetScorer.incrementCounts(), - // FacetResultsCollector.doCollect() creates a FacetResultsCollector.Accumulator, if - // necessary, which contains the accessor (a CompositeFacetIterator) and accumulators - // (FacetAccumulator of each field) - super.doCollect(tweetID); - - for (Pair fieldIdTermIdPair : proofs) { - int fieldID = fieldIdTermIdPair.getFirst(); - long termID = fieldIdTermIdPair.getSecond(); - - // Convert term ID to the term text, a.k.a. facet label - String facetName = facetIDMap.getFacetFieldByFacetID(fieldID).getFacetName(); - if (facetName != null) { - String facetLabel = facetLabelProviders.get(facetName) - .getLabelAccessor().getTermText(termID); - - List tweetIDs = proofAccumulators.get(facetName).get(facetLabel); - if (tweetIDs == null) { - tweetIDs = new ArrayList<>(); - proofAccumulators.get(facetName).put(facetLabel, tweetIDs); - } - - tweetIDs.add(tweetID); - } - } - - // clear it again just to be sure - proofs.clear(); - } - - /** - * Sets explanations for the facet results. - */ - public void setExplanations(ThriftFacetResults facetResults) { - StringBuilder explanation = new StringBuilder(); - - for (Map.Entry facetFieldResultsEntry - : facetResults.getFacetFields().entrySet()) { - String facetName = facetFieldResultsEntry.getKey(); - ThriftFacetFieldResults facetFieldResults = facetFieldResultsEntry.getValue(); - - Map> proofAccumulator = proofAccumulators.get(facetName); - - if (proofAccumulator == null) { - // did not accumulate explanation for this facet type? a bug? - LOG.warn("No explanation accumulated for facet type " + facetName); - continue; - } - - for (ThriftFacetCount facetCount : facetFieldResults.getTopFacets()) { - String facetLabel = facetCount.getFacetLabel(); // a.k.a. term text - ThriftFacetCountMetadata metadata = facetCount.getMetadata(); - - List tweetIDs = proofAccumulator.get(facetLabel); - if (tweetIDs == null) { - // did not accumulate explanation for this facet label? a bug? - LOG.warn("No explanation accumulated for " + facetLabel + " of facet type " + facetName); - continue; - } - - explanation.setLength(0); - String oldExplanation = null; - if (metadata.isSetExplanation()) { - // save the old explanation from TwitterInMemoryIndexSearcher.fillTermMetadata() - oldExplanation = metadata.getExplanation(); - // as of 2012/05/29, we have 18 digits tweet IDs - explanation.ensureCapacity(oldExplanation.length() + (18 + 2) + 10); - } else { - // as of 2012/05/29, we have 18 digits tweet IDs - explanation.ensureCapacity(tweetIDs.size() * (18 + 2) + 10); - } - - explanation.append("["); - for (Long tweetID : tweetIDs) { - explanation.append(tweetID) - .append(", "); - } - explanation.setLength(explanation.length() - 2); // remove the last ", " - explanation.append("]\n"); - if (oldExplanation != null) { - explanation.append(oldExplanation); - } - metadata.setExplanation(explanation.toString()); - } - } - } -} diff --git a/src/java/com/twitter/search/earlybird/search/facets/FacetLabelCollector.java b/src/java/com/twitter/search/earlybird/search/facets/FacetLabelCollector.java deleted file mode 100644 index 7ea471582..000000000 --- a/src/java/com/twitter/search/earlybird/search/facets/FacetLabelCollector.java +++ /dev/null @@ -1,62 +0,0 @@ -package com.twitter.search.earlybird.search.facets; - -import java.util.ArrayList; -import java.util.List; -import java.util.Map; -import java.util.Set; - -import com.twitter.search.core.earlybird.facets.FacetIDMap; -import com.twitter.search.core.earlybird.facets.FacetLabelProvider; -import com.twitter.search.core.earlybird.facets.FacetTermCollector; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.earlybird.thrift.ThriftFacetLabel; - -/** - * A collector for facet labels of given fields. - */ -public class FacetLabelCollector implements FacetTermCollector { - - private final Set requiredFields; - private FacetIDMap facetIDMap; - private Map facetLabelProviders; - - private final List labels = new ArrayList<>(); - - public FacetLabelCollector(Set requiredFields) { - this.requiredFields = requiredFields; - } - - public void resetFacetLabelProviders(Map facetLabelProvidersToReset, - FacetIDMap facetIDMapToReset) { - this.facetLabelProviders = facetLabelProvidersToReset; - this.facetIDMap = facetIDMapToReset; - labels.clear(); - } - - @Override - public boolean collect(int docID, long termID, int fieldID) { - String facetName = facetIDMap.getFacetFieldByFacetID(fieldID).getFacetName(); - if (facetName == null || !requiredFields.contains(facetName)) { - return false; - } - if (termID != EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND && fieldID >= 0) { - final FacetLabelProvider provider = facetLabelProviders.get(facetName); - if (provider != null) { - FacetLabelProvider.FacetLabelAccessor labelAccessor = provider.getLabelAccessor(); - String label = labelAccessor.getTermText(termID); - int offensiveCount = labelAccessor.getOffensiveCount(termID); - labels.add(new ThriftFacetLabel() - .setFieldName(facetName) - .setLabel(label) - .setOffensiveCount(offensiveCount)); - return true; - } - } - return false; - } - - public List getLabels() { - // Make a copy - return new ArrayList<>(labels); - } -} diff --git a/src/java/com/twitter/search/earlybird/search/facets/FacetRankingModule.java b/src/java/com/twitter/search/earlybird/search/facets/FacetRankingModule.java deleted file mode 100644 index a32ac2253..000000000 --- a/src/java/com/twitter/search/earlybird/search/facets/FacetRankingModule.java +++ /dev/null @@ -1,26 +0,0 @@ -package com.twitter.search.earlybird.search.facets; - -import java.util.ArrayList; -import java.util.List; - -import com.twitter.search.core.earlybird.facets.FacetCountState; -import com.twitter.search.earlybird.search.EarlybirdLuceneSearcher; -import com.twitter.search.earlybird.thrift.ThriftFacetFieldResults; - -public abstract class FacetRankingModule { - public static final List REGISTERED_RANKING_MODULES = - new ArrayList<>(); - - static { - REGISTERED_RANKING_MODULES.add(new SimpleCountRankingModule()); - } - - /** - * Prepares the {@link com.twitter.search.earlybird.thrift.ThriftFacetFieldResults} - * in {@link FacetCountState} before they're returned. This extension point therefore allows - * post-processing the facet results, e.g. for re-ranking or sorting purposes. - */ - public abstract void prepareResults( - EarlybirdLuceneSearcher.FacetSearchResults hits, - FacetCountState facetCountState); -} diff --git a/src/java/com/twitter/search/earlybird/search/facets/FacetResultsCollector.java b/src/java/com/twitter/search/earlybird/search/facets/FacetResultsCollector.java deleted file mode 100644 index ba6a920e0..000000000 --- a/src/java/com/twitter/search/earlybird/search/facets/FacetResultsCollector.java +++ /dev/null @@ -1,229 +0,0 @@ -package com.twitter.search.earlybird.search.facets; - -import java.io.IOException; -import java.util.ArrayList; -import java.util.HashMap; -import java.util.List; -import java.util.Map; -import java.util.PriorityQueue; - -import com.google.common.base.Preconditions; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.constants.thriftjava.ThriftLanguage; -import com.twitter.search.common.ranking.thriftjava.ThriftFacetEarlybirdSortingMode; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.core.earlybird.facets.DummyFacetAccumulator; -import com.twitter.search.core.earlybird.facets.FacetAccumulator; -import com.twitter.search.core.earlybird.facets.FacetCountIterator; -import com.twitter.search.core.earlybird.facets.FacetIDMap; -import com.twitter.search.core.earlybird.facets.FacetIDMap.FacetField; -import com.twitter.search.core.earlybird.facets.FacetLabelProvider; -import com.twitter.search.core.earlybird.facets.LanguageHistogram; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.earlybird.search.AbstractResultsCollector; -import com.twitter.search.earlybird.search.AntiGamingFilter; -import com.twitter.search.earlybird.search.EarlybirdLuceneSearcher.FacetSearchResults; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; -import com.twitter.search.earlybird.thrift.ThriftFacetCount; -import com.twitter.search.earlybird.thrift.ThriftFacetFieldResults; - -public class FacetResultsCollector extends - AbstractResultsCollector { - - private final FacetScorer facetScorer; - private final ThriftFacetEarlybirdSortingMode sortingMode; - - static class Accumulator { - protected final FacetAccumulator[] accumulators; - protected final FacetCountIterator accessor; - protected final FacetIDMap facetIDMap; - - Accumulator(FacetAccumulator[] accumulators, - FacetCountIterator accessor, - FacetIDMap facetIDMap) { - this.accumulators = accumulators; - this.accessor = accessor; - this.facetIDMap = facetIDMap; - } - - FacetAccumulator getFacetAccumulator(String facetName) { - FacetField facet = facetIDMap.getFacetFieldByFacetName(facetName); - return accumulators[facet.getFacetId()]; - } - } - - private Accumulator currentAccumulator; - private List segAccumulators; - private final HashingAndPruningFacetAccumulator.FacetComparator facetComparator; - - /** - * Creates a new FacetResultsCollector for the given facet search request. - */ - public FacetResultsCollector( - ImmutableSchemaInterface schema, - FacetSearchRequestInfo searchRequestInfo, - AntiGamingFilter antiGamingFilter, - EarlybirdSearcherStats searcherStats, - Clock clock, - int requestDebugInfo) { - super(schema, searchRequestInfo, clock, searcherStats, requestDebugInfo); - - if (searchRequestInfo.rankingOptions != null - && searchRequestInfo.rankingOptions.isSetSortingMode()) { - this.sortingMode = searchRequestInfo.rankingOptions.getSortingMode(); - } else { - this.sortingMode = ThriftFacetEarlybirdSortingMode.SORT_BY_WEIGHTED_COUNT; - } - - this.facetComparator = HashingAndPruningFacetAccumulator.getComparator(sortingMode); - this.facetScorer = createScorer(antiGamingFilter); - this.segAccumulators = new ArrayList<>(); - } - - @Override - public void startSegment() { - currentAccumulator = null; - } - - @Override - public void doCollect(long tweetID) throws IOException { - if (currentAccumulator == null) { - // Lazily create accumulators. Most segment / query / facet combinations have no hits. - currentAccumulator = newPerSegmentAccumulator(currTwitterReader); - segAccumulators.add(currentAccumulator); - facetScorer.startSegment(currTwitterReader); - } - facetScorer.incrementCounts(currentAccumulator, curDocId); - } - - @Override - public FacetSearchResults doGetResults() { - return new FacetSearchResults(this); - } - - /** - * Returns the top-k facet results for the requested facetName. - */ - public ThriftFacetFieldResults getFacetResults(String facetName, int topK) { - int totalCount = 0; - final Map map = new HashMap<>(); - - LanguageHistogram languageHistogram = new LanguageHistogram(); - - for (Accumulator segAccumulator : segAccumulators) { - FacetAccumulator accumulator = - segAccumulator.getFacetAccumulator(facetName); - Preconditions.checkNotNull(accumulator); - - ThriftFacetFieldResults results = accumulator.getAllFacets(); - if (results == null) { - continue; - } - - totalCount += results.totalCount; - - // merge language histograms from different segments - languageHistogram.addAll(accumulator.getLanguageHistogram()); - - for (ThriftFacetCount facetCount : results.getTopFacets()) { - String label = facetCount.getFacetLabel(); - ThriftFacetCount oldCount = map.get(label); - if (oldCount != null) { - oldCount.setSimpleCount(oldCount.getSimpleCount() + facetCount.getSimpleCount()); - oldCount.setWeightedCount(oldCount.getWeightedCount() + facetCount.getWeightedCount()); - - oldCount.setFacetCount(oldCount.getFacetCount() + facetCount.getFacetCount()); - oldCount.setPenaltyCount(oldCount.getPenaltyCount() + facetCount.getPenaltyCount()); - } else { - map.put(label, facetCount); - } - } - } - - if (map.size() == 0 || totalCount == 0) { - // No results. - return null; - } - - // sort table wrt percentage - PriorityQueue pq = - new PriorityQueue<>(map.size(), facetComparator.getThriftComparator(true)); - pq.addAll(map.values()); - - ThriftFacetFieldResults results = new ThriftFacetFieldResults(); - results.setTopFacets(new ArrayList<>()); - results.setTotalCount(totalCount); - - // Store merged language histogram into thrift object - for (Map.Entry entry - : languageHistogram.getLanguageHistogramAsMap().entrySet()) { - results.putToLanguageHistogram(entry.getKey(), entry.getValue()); - } - - // Get top facets. - for (int i = 0; i < topK && i < map.size(); i++) { - ThriftFacetCount facetCount = pq.poll(); - if (facetCount != null) { - results.addToTopFacets(facetCount); - } - } - return results; - } - - protected FacetScorer createScorer(AntiGamingFilter antiGamingFilter) { - if (searchRequestInfo.rankingOptions != null) { - return new DefaultFacetScorer(searchRequestInfo.getSearchQuery(), - searchRequestInfo.rankingOptions, - antiGamingFilter, - sortingMode); - } else { - return new FacetScorer() { - @Override - protected void startSegment(EarlybirdIndexSegmentAtomicReader reader) { - } - - @Override - public void incrementCounts(Accumulator accumulator, int internalDocID) throws IOException { - accumulator.accessor.incrementData.accumulators = accumulator.accumulators; - accumulator.accessor.incrementData.weightedCountIncrement = 1; - accumulator.accessor.incrementData.penaltyIncrement = 0; - accumulator.accessor.incrementData.languageId = ThriftLanguage.UNKNOWN.getValue(); - accumulator.accessor.collect(internalDocID); - } - - @Override - public FacetAccumulator getFacetAccumulator(FacetLabelProvider labelProvider) { - return new HashingAndPruningFacetAccumulator(labelProvider, facetComparator); - } - }; - } - } - - protected Accumulator newPerSegmentAccumulator(EarlybirdIndexSegmentAtomicReader indexReader) { - final FacetIDMap facetIDMap = indexReader.getFacetIDMap(); - final FacetCountIterator accessor = - indexReader.getFacetCountingArray().getIterator( - indexReader, - getSearchRequestInfo().getFacetCountState(), - TweetSearchFacetCountIteratorFactory.FACTORY); - - final FacetAccumulator[] accumulators = - (FacetAccumulator[]) - new FacetAccumulator[facetIDMap.getNumberOfFacetFields()]; - - Map labelProviders = indexReader.getFacetLabelProviders(); - for (FacetField f : facetIDMap.getFacetFields()) { - int id = f.getFacetId(); - if (getSearchRequestInfo().getFacetCountState().isCountField(f.getFieldInfo())) { - accumulators[id] = (FacetAccumulator) facetScorer - .getFacetAccumulator(labelProviders.get(f.getFacetName())); - } else { - // Dummmy accumulator does nothing. - accumulators[id] = new DummyFacetAccumulator(); - } - } - - return new Accumulator(accumulators, accessor, facetIDMap); - } -} diff --git a/src/java/com/twitter/search/earlybird/search/facets/FacetScorer.java b/src/java/com/twitter/search/earlybird/search/facets/FacetScorer.java deleted file mode 100644 index 0e8725bac..000000000 --- a/src/java/com/twitter/search/earlybird/search/facets/FacetScorer.java +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.search.earlybird.search.facets; - -import java.io.IOException; - -import com.twitter.search.core.earlybird.facets.FacetAccumulator; -import com.twitter.search.core.earlybird.facets.FacetLabelProvider; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.earlybird.search.facets.FacetResultsCollector.Accumulator; - -public abstract class FacetScorer { - protected abstract void startSegment(EarlybirdIndexSegmentAtomicReader reader) throws IOException; - - /** - * Increments facet counts for the given document. - */ - public abstract void incrementCounts(Accumulator accumulator, int internalDocID) - throws IOException; - - /** - * Returns a FacetAccumulator for counting facets. It will use the given FacetLabelProvider - * for facet result labeling. - */ - public abstract FacetAccumulator getFacetAccumulator(FacetLabelProvider labelProvider); -} diff --git a/src/java/com/twitter/search/earlybird/search/facets/FacetSearchRequestInfo.java b/src/java/com/twitter/search/earlybird/search/facets/FacetSearchRequestInfo.java deleted file mode 100644 index 948b098d2..000000000 --- a/src/java/com/twitter/search/earlybird/search/facets/FacetSearchRequestInfo.java +++ /dev/null @@ -1,28 +0,0 @@ -package com.twitter.search.earlybird.search.facets; - -import org.apache.lucene.search.Query; - -import com.twitter.search.common.ranking.thriftjava.ThriftFacetRankingOptions; -import com.twitter.search.common.search.TerminationTracker; -import com.twitter.search.core.earlybird.facets.FacetCountState; -import com.twitter.search.earlybird.search.SearchRequestInfo; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; - -public class FacetSearchRequestInfo extends SearchRequestInfo { - protected final FacetCountState facetCountState; - protected final ThriftFacetRankingOptions rankingOptions; - - public FacetSearchRequestInfo(ThriftSearchQuery searchQuery, - ThriftFacetRankingOptions rankingOptions, - Query query, - FacetCountState facetCountState, - TerminationTracker terminationTracker) { - super(searchQuery, query, terminationTracker); - this.facetCountState = facetCountState; - this.rankingOptions = rankingOptions; - } - - public final FacetCountState getFacetCountState() { - return this.facetCountState; - } -} diff --git a/src/java/com/twitter/search/earlybird/search/facets/HashingAndPruningFacetAccumulator.java b/src/java/com/twitter/search/earlybird/search/facets/HashingAndPruningFacetAccumulator.java deleted file mode 100644 index 9415b19ee..000000000 --- a/src/java/com/twitter/search/earlybird/search/facets/HashingAndPruningFacetAccumulator.java +++ /dev/null @@ -1,492 +0,0 @@ -package com.twitter.search.earlybird.search.facets; - -import java.util.Arrays; -import java.util.Comparator; -import java.util.PriorityQueue; - -import com.twitter.search.common.ranking.thriftjava.ThriftFacetEarlybirdSortingMode; -import com.twitter.search.core.earlybird.facets.FacetAccumulator; -import com.twitter.search.core.earlybird.facets.FacetLabelProvider; -import com.twitter.search.core.earlybird.facets.FacetLabelProvider.FacetLabelAccessor; -import com.twitter.search.core.earlybird.facets.LanguageHistogram; -import com.twitter.search.earlybird.thrift.ThriftFacetCount; -import com.twitter.search.earlybird.thrift.ThriftFacetCountMetadata; -import com.twitter.search.earlybird.thrift.ThriftFacetFieldResults; - -public class HashingAndPruningFacetAccumulator extends FacetAccumulator { - private static final int DEFAULT_HASH_SIZE = 4096; - /** - * 4 longs per entry accommodates long termIDs. - * Although entries could be encoded in 3 bytes, 4 ensures that no entry is split - * across cache lines. - */ - protected static final int LONGS_PER_ENTRY = 4; - private static final double LOAD_FACTOR = 0.5; - private static final long BITSHIFT_MAX_TWEEPCRED = 32; - private static final long PENALTY_COUNT_MASK = (1L << BITSHIFT_MAX_TWEEPCRED) - 1; - - protected static final long UNASSIGNED = -1; - - protected LanguageHistogram languageHistogram = new LanguageHistogram(); - - protected static final class HashTable { - protected final long[] hash; - protected final int size; - protected final int maxLoad; - protected final int mask; - - public HashTable(int size) { - hash = new long[LONGS_PER_ENTRY * size]; - Arrays.fill(hash, UNASSIGNED); - this.size = size; - // Ensure alignment to LONGS_PER_ENTRY-byte boundaries - this.mask = LONGS_PER_ENTRY * (size - 1); - this.maxLoad = (int) (size * LOAD_FACTOR); - } - - protected void reset() { - Arrays.fill(hash, UNASSIGNED); - } - - private final Cursor cursor = new Cursor(); - - public int findHashPosition(long termID) { - int code = (new Long(termID)).hashCode(); - int hashPos = code & mask; - - if (cursor.readFromHash(hashPos) && (cursor.termID != termID)) { - final int inc = ((code >> 8) + code) | 1; - do { - code += inc; - hashPos = code & this.mask; - } while (cursor.readFromHash(hashPos) && (cursor.termID != termID)); - } - - return hashPos; - } - - /** - * The cursor can be used to access the different fields of a hash entry. - * Callers should always position the cursor with readFromHash() before - * accessing the members. - */ - private final class Cursor { - private int simpleCount; - private int weightedCount; - private int penaltyCount; - private int maxTweepcred; - private long termID; - - public void writeToHash(int position) { - long payload = (((long) maxTweepcred) << BITSHIFT_MAX_TWEEPCRED) - | ((long) penaltyCount); - - assert itemPenaltyCount(payload) == penaltyCount : payload + ", " - + itemPenaltyCount(payload) + " != " + penaltyCount; - assert itemMaxTweepCred(payload) == maxTweepcred; - - hash[position] = termID; - hash[position + 1] = simpleCount; - hash[position + 2] = weightedCount; - hash[position + 3] = payload; - } - - /** Returns the item ID, or UNASSIGNED */ - public boolean readFromHash(int position) { - long entry = hash[position]; - if (entry == UNASSIGNED) { - termID = UNASSIGNED; - return false; - } - - termID = entry; - - simpleCount = (int) hash[position + 1]; - weightedCount = (int) hash[position + 2]; - long payload = hash[position + 3]; - - penaltyCount = itemPenaltyCount(payload); - maxTweepcred = itemMaxTweepCred(payload); - - return true; - } - } - } - - protected static int itemPenaltyCount(long payload) { - return (int) (payload & PENALTY_COUNT_MASK); - } - - protected static int itemMaxTweepCred(long payload) { - return (int) (payload >>> BITSHIFT_MAX_TWEEPCRED); - } - - protected int numItems; - protected final HashTable hashTable; - protected final long[] sortBuffer; - private FacetLabelProvider facetLabelProvider; - - private int totalSimpleCount; - private int totalWeightedCount; - private int totalPenalty; - - static final double DEFAULT_QUERY_INDEPENDENT_PENALTY_WEIGHT = 1.0; - private final double queryIndependentPenaltyWeight; - - private final FacetComparator facetComparator; - - public HashingAndPruningFacetAccumulator(FacetLabelProvider facetLabelProvider, - FacetComparator comparator) { - this(DEFAULT_HASH_SIZE, facetLabelProvider, - DEFAULT_QUERY_INDEPENDENT_PENALTY_WEIGHT, comparator); - } - - public HashingAndPruningFacetAccumulator(FacetLabelProvider facetLabelProvider, - double queryIndependentPenaltyWeight, FacetComparator comparator) { - this(DEFAULT_HASH_SIZE, facetLabelProvider, queryIndependentPenaltyWeight, comparator); - } - - /** - * Creates a new, empty HashingAndPruningFacetAccumulator with the given initial size. - * HashSize will be rounded up to the next power-of-2 value. - */ - public HashingAndPruningFacetAccumulator(int hashSize, FacetLabelProvider facetLabelProvider, - double queryIndependentPenaltyWeight, FacetComparator comparator) { - int powerOfTwoSize = 2; - while (hashSize > powerOfTwoSize) { - powerOfTwoSize *= 2; - } - - this.facetComparator = comparator; - hashTable = new HashTable(powerOfTwoSize); - sortBuffer = new long[LONGS_PER_ENTRY * (int) Math.ceil(LOAD_FACTOR * powerOfTwoSize)]; - this.facetLabelProvider = facetLabelProvider; - this.queryIndependentPenaltyWeight = queryIndependentPenaltyWeight; - } - - @Override - public void reset(FacetLabelProvider facetLabelProviderToReset) { - this.facetLabelProvider = facetLabelProviderToReset; - this.numItems = 0; - this.hashTable.reset(); - this.totalSimpleCount = 0; - this.totalPenalty = 0; - this.totalWeightedCount = 0; - languageHistogram.clear(); - } - - - @Override - public int add(long termID, int weightedCounterIncrement, int penaltyIncrement, int tweepCred) { - int hashPos = hashTable.findHashPosition(termID); - - totalPenalty += penaltyIncrement; - totalSimpleCount++; - totalWeightedCount += weightedCounterIncrement; - - if (hashTable.cursor.termID == UNASSIGNED) { - hashTable.cursor.termID = termID; - hashTable.cursor.simpleCount = 1; - hashTable.cursor.weightedCount = weightedCounterIncrement; - hashTable.cursor.penaltyCount = penaltyIncrement; - hashTable.cursor.maxTweepcred = tweepCred; - hashTable.cursor.writeToHash(hashPos); - - numItems++; - if (numItems >= hashTable.maxLoad) { - prune(); - } - return 1; - } else { - - hashTable.cursor.simpleCount++; - hashTable.cursor.weightedCount += weightedCounterIncrement; - - if (tweepCred > hashTable.cursor.maxTweepcred) { - hashTable.cursor.maxTweepcred = tweepCred; - } - - hashTable.cursor.penaltyCount += penaltyIncrement; - hashTable.cursor.writeToHash(hashPos); - return hashTable.cursor.simpleCount; - } - } - - @Override - public void recordLanguage(int languageId) { - languageHistogram.increment(languageId); - } - - @Override - public LanguageHistogram getLanguageHistogram() { - return languageHistogram; - } - - private void prune() { - copyToSortBuffer(); - hashTable.reset(); - - int targetNumItems = (int) (hashTable.maxLoad >> 1); - - int minCount = 2; - int nextMinCount = Integer.MAX_VALUE; - - final int n = LONGS_PER_ENTRY * numItems; - - while (numItems > targetNumItems) { - for (int i = 0; i < n; i += LONGS_PER_ENTRY) { - long item = sortBuffer[i]; - if (item != UNASSIGNED) { - int count = (int) sortBuffer[i + 1]; - if (count < minCount) { - evict(i); - } else if (count < nextMinCount) { - nextMinCount = count; - } - } - } - if (minCount == nextMinCount) { - minCount++; - } else { - minCount = nextMinCount; - } - nextMinCount = Integer.MAX_VALUE; - } - - // rehash - for (int i = 0; i < n; i += LONGS_PER_ENTRY) { - long item = sortBuffer[i]; - if (item != UNASSIGNED) { - final long termID = item; - int hashPos = hashTable.findHashPosition(termID); - for (int j = 0; j < LONGS_PER_ENTRY; ++j) { - hashTable.hash[hashPos + j] = sortBuffer[i + j]; - } - } - } - } - - // overridable for unit test - protected void evict(int index) { - sortBuffer[index] = UNASSIGNED; - numItems--; - } - - @Override - public ThriftFacetFieldResults getAllFacets() { - return getTopFacets(numItems); - } - - @Override - public ThriftFacetFieldResults getTopFacets(final int numRequested) { - int n = numRequested > numItems ? numItems : numRequested; - - if (n == 0) { - return null; - } - - ThriftFacetFieldResults facetResults = new ThriftFacetFieldResults(); - facetResults.setTotalCount(totalSimpleCount); - facetResults.setTotalScore(totalWeightedCount); - facetResults.setTotalPenalty(totalPenalty); - - copyToSortBuffer(); - - // sort table using the facet comparator - PriorityQueue pq = new PriorityQueue<>(numItems, facetComparator.getComparator(true)); - - for (int i = 0; i < LONGS_PER_ENTRY * numItems; i += LONGS_PER_ENTRY) { - pq.add(new Item(sortBuffer, i)); - } - - FacetLabelAccessor accessor = facetLabelProvider.getLabelAccessor(); - - for (int i = 0; i < n; i++) { - Item item = pq.poll(); - long id = item.getTermId(); - - int penalty = item.getPenaltyCount() + (int) (queryIndependentPenaltyWeight - * accessor.getOffensiveCount(id)); - ThriftFacetCount result = new ThriftFacetCount().setFacetLabel(accessor.getTermText(id)); - result.setPenaltyCount(penalty); - result.setSimpleCount(item.getSimpleCount()); - result.setWeightedCount(item.getWeightedCount()); - result.setMetadata(new ThriftFacetCountMetadata().setMaxTweepCred(item.getMaxTweetCred())); - - result.setFacetCount(result.getWeightedCount()); - facetResults.addToTopFacets(result); - } - - return facetResults; - } - - // Compacts the hashtable entries in place by removing empty hashes. After - // this operation it's no longer a hash table but a array of entries. - private void copyToSortBuffer() { - int upto = 0; - - for (int i = 0; i < hashTable.hash.length; i += LONGS_PER_ENTRY) { - if (hashTable.hash[i] != UNASSIGNED) { - for (int j = 0; j < LONGS_PER_ENTRY; ++j) { - sortBuffer[upto + j] = hashTable.hash[i + j]; - } - upto += LONGS_PER_ENTRY; - } - } - assert upto == numItems * LONGS_PER_ENTRY; - } - - /** - * Sorts facets in the following order: - * 1) ascending by weightedCount - * 2) if weightedCount equal: ascending by simpleCount - * 3) if weightedCount and simpleCount equal: descending by penaltyCount - */ - public static int compareFacetCounts(int weightedCount1, int simpleCount1, int penaltyCount1, - int weightedCount2, int simpleCount2, int penaltyCount2, - boolean simpleCountPrecedence) { - if (simpleCountPrecedence) { - if (simpleCount1 < simpleCount2) { - return -1; - } else if (simpleCount1 > simpleCount2) { - return 1; - } else { - if (weightedCount1 < weightedCount2) { - return -1; - } else if (weightedCount1 > weightedCount2) { - return 1; - } else { - if (penaltyCount1 < penaltyCount2) { - // descending - return 1; - } else if (penaltyCount1 > penaltyCount2) { - return -1; - } else { - return 0; - } - } - } - } else { - if (weightedCount1 < weightedCount2) { - return -1; - } else if (weightedCount1 > weightedCount2) { - return 1; - } else { - if (simpleCount1 < simpleCount2) { - return -1; - } else if (simpleCount1 > simpleCount2) { - return 1; - } else { - if (penaltyCount1 < penaltyCount2) { - // descending - return 1; - } else if (penaltyCount1 > penaltyCount2) { - return -1; - } else { - return 0; - } - } - } - } - } - - public static final class FacetComparator { - private final Comparator thriftComparator; - private final Comparator comparator; - - private FacetComparator(Comparator thriftComparator, - Comparator comparator) { - this.thriftComparator = thriftComparator; - this.comparator = comparator; - } - - public Comparator getThriftComparator() { - return getThriftComparator(false); - } - - public Comparator getThriftComparator(boolean reverse) { - return reverse ? getReverseComparator(thriftComparator) : thriftComparator; - } - - private Comparator getComparator(boolean reverse) { - return reverse ? getReverseComparator(comparator) : comparator; - } - } - - public static final FacetComparator SIMPLE_COUNT_COMPARATOR = new FacetComparator( - (facet1, facet2) -> compareFacetCounts( - facet1.weightedCount, facet1.simpleCount, facet1.penaltyCount, - facet2.weightedCount, facet2.simpleCount, facet2.penaltyCount, - true), - (facet1, facet2) -> compareFacetCounts( - facet1.getWeightedCount(), facet1.getSimpleCount(), facet1.getPenaltyCount(), - facet2.getWeightedCount(), facet2.getSimpleCount(), facet2.getPenaltyCount(), - true)); - - - public static final FacetComparator WEIGHTED_COUNT_COMPARATOR = new FacetComparator( - (facet1, facet2) -> compareFacetCounts( - facet1.weightedCount, facet1.simpleCount, facet1.penaltyCount, - facet2.weightedCount, facet2.simpleCount, facet2.penaltyCount, - false), - (facet1, facet2) -> compareFacetCounts( - facet1.getWeightedCount(), facet1.getSimpleCount(), facet1.getPenaltyCount(), - facet2.getWeightedCount(), facet2.getSimpleCount(), facet2.getPenaltyCount(), - false)); - - /** - * Returns the appropriate FacetComparator for the specified sortingMode. - */ - public static FacetComparator getComparator(ThriftFacetEarlybirdSortingMode sortingMode) { - switch (sortingMode) { - case SORT_BY_WEIGHTED_COUNT: - return WEIGHTED_COUNT_COMPARATOR; - case SORT_BY_SIMPLE_COUNT: - default: - return SIMPLE_COUNT_COMPARATOR; - } - } - - private static Comparator getReverseComparator(final Comparator comparator) { - return (t1, t2) -> -comparator.compare(t1, t2); - } - - static final class Item { - private final long[] data; - private final int offset; - - Item(long[] data, int offset) { - this.data = data; - this.offset = offset; - } - - public long getTermId() { - return data[offset]; - } - - public int getSimpleCount() { - return (int) data[offset + 1]; - } - - public int getWeightedCount() { - return (int) data[offset + 2]; - } - - public int getPenaltyCount() { - return itemPenaltyCount(data[offset + 3]); - } - - public int getMaxTweetCred() { - return itemMaxTweepCred(data[offset + 3]); - } - - @Override public int hashCode() { - return (int) (31 * getTermId()); - } - - @Override public boolean equals(Object o) { - return getTermId() == ((Item) o).getTermId(); - } - - } -} diff --git a/src/java/com/twitter/search/earlybird/search/facets/NamedEntityCollector.java b/src/java/com/twitter/search/earlybird/search/facets/NamedEntityCollector.java deleted file mode 100644 index 2e7be9e30..000000000 --- a/src/java/com/twitter/search/earlybird/search/facets/NamedEntityCollector.java +++ /dev/null @@ -1,49 +0,0 @@ -package com.twitter.search.earlybird.search.facets; - -import java.util.List; -import java.util.Map; - -import com.google.common.collect.ImmutableList; -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.Lists; - -import org.apache.commons.lang.StringUtils; - -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.earlybird.thrift.NamedEntitySource; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftSearchResultNamedEntity; - -public class NamedEntityCollector extends AbstractFacetTermCollector { - private static final Map NAMED_ENTITY_WITH_TYPE_FIELDS = - ImmutableMap.of( - EarlybirdFieldConstant.NAMED_ENTITY_WITH_TYPE_FROM_TEXT_FIELD.getFieldName(), - NamedEntitySource.TEXT, - EarlybirdFieldConstant.NAMED_ENTITY_WITH_TYPE_FROM_URL_FIELD.getFieldName(), - NamedEntitySource.URL); - - private List namedEntities = Lists.newArrayList(); - - @Override - public boolean collect(int docID, long termID, int fieldID) { - - String term = getTermFromFacet(termID, fieldID, NAMED_ENTITY_WITH_TYPE_FIELDS.keySet()); - if (StringUtils.isEmpty(term)) { - return false; - } - - int index = term.lastIndexOf(":"); - namedEntities.add(new ThriftSearchResultNamedEntity( - term.substring(0, index), - term.substring(index + 1), - NAMED_ENTITY_WITH_TYPE_FIELDS.get(findFacetName(fieldID)))); - - return true; - } - - @Override - public void fillResultAndClear(ThriftSearchResult result) { - getExtraMetadata(result).setNamedEntities(ImmutableList.copyOf(namedEntities)); - namedEntities.clear(); - } -} diff --git a/src/java/com/twitter/search/earlybird/search/facets/RetweetFacetCountIterator.java b/src/java/com/twitter/search/earlybird/search/facets/RetweetFacetCountIterator.java deleted file mode 100644 index 1693b8cf2..000000000 --- a/src/java/com/twitter/search/earlybird/search/facets/RetweetFacetCountIterator.java +++ /dev/null @@ -1,36 +0,0 @@ -package com.twitter.search.earlybird.search.facets; - -import java.io.IOException; - -import org.apache.lucene.index.NumericDocValues; - -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.core.earlybird.facets.CSFFacetCountIterator; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; - -/** - * And iterator for counting retweets. Reads from shared_status_id CSF but doesn't count - * replies. - */ -public class RetweetFacetCountIterator extends CSFFacetCountIterator { - private final NumericDocValues featureReaderIsRetweetFlag; - - public RetweetFacetCountIterator( - EarlybirdIndexSegmentAtomicReader reader, - Schema.FieldInfo facetFieldInfo) throws IOException { - super(reader, facetFieldInfo); - featureReaderIsRetweetFlag = - reader.getNumericDocValues(EarlybirdFieldConstant.IS_RETWEET_FLAG.getFieldName()); - } - - @Override - protected boolean shouldCollect(int internalDocID, long termID) throws IOException { - // termID == 0 means that we didn't set shared_status_csf, so don't collect - // (tweet IDs are all positive) - // Also only collect if this doc is a retweet, not a reply - return termID > 0 - && featureReaderIsRetweetFlag.advanceExact(internalDocID) - && (featureReaderIsRetweetFlag.longValue() != 0); - } -} diff --git a/src/java/com/twitter/search/earlybird/search/facets/SimpleCountRankingModule.java b/src/java/com/twitter/search/earlybird/search/facets/SimpleCountRankingModule.java deleted file mode 100644 index b5f31361a..000000000 --- a/src/java/com/twitter/search/earlybird/search/facets/SimpleCountRankingModule.java +++ /dev/null @@ -1,32 +0,0 @@ -package com.twitter.search.earlybird.search.facets; - -import java.util.Iterator; - -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.core.earlybird.facets.FacetCountState; -import com.twitter.search.core.earlybird.facets.FacetCountState.FacetFieldResults; -import com.twitter.search.earlybird.search.EarlybirdLuceneSearcher; -import com.twitter.search.earlybird.thrift.ThriftFacetFieldResults; - -public class SimpleCountRankingModule extends FacetRankingModule { - - @Override - public void prepareResults( - EarlybirdLuceneSearcher.FacetSearchResults hits, - FacetCountState facetCountState) { - Iterator> fieldResultsIterator = - facetCountState.getFacetFieldResultsIterator(); - while (fieldResultsIterator.hasNext()) { - FacetFieldResults state = fieldResultsIterator.next(); - if (!state.isFinished()) { - Schema.FieldInfo facetField = - facetCountState.getSchema().getFacetFieldByFacetName(state.facetName); - state.results = hits.getFacetResults( - facetField.getFieldType().getFacetName(), state.numResultsRequested); - if (state.results != null) { - state.numResultsFound = state.results.getTopFacetsSize(); - } - } - } - } -} diff --git a/src/java/com/twitter/search/earlybird/search/facets/SpaceFacetCollector.java b/src/java/com/twitter/search/earlybird/search/facets/SpaceFacetCollector.java deleted file mode 100644 index 3ceeacb20..000000000 --- a/src/java/com/twitter/search/earlybird/search/facets/SpaceFacetCollector.java +++ /dev/null @@ -1,47 +0,0 @@ -package com.twitter.search.earlybird.search.facets; - -import java.util.ArrayList; -import java.util.List; - -import com.google.common.collect.ImmutableList; -import com.google.common.collect.Sets; - -import org.apache.commons.lang.StringUtils; - -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.earlybird.partition.AudioSpaceTable; -import com.twitter.search.earlybird.thrift.AudioSpaceState; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftSearchResultAudioSpace; - -public class SpaceFacetCollector extends AbstractFacetTermCollector { - private final List spaces = new ArrayList<>(); - - private final AudioSpaceTable audioSpaceTable; - - public SpaceFacetCollector(AudioSpaceTable audioSpaceTable) { - this.audioSpaceTable = audioSpaceTable; - } - - @Override - public boolean collect(int docID, long termID, int fieldID) { - - String spaceId = getTermFromFacet(termID, fieldID, - Sets.newHashSet(EarlybirdFieldConstant.SPACES_FACET)); - if (StringUtils.isEmpty(spaceId)) { - return false; - } - - spaces.add(new ThriftSearchResultAudioSpace(spaceId, - audioSpaceTable.isRunning(spaceId) ? AudioSpaceState.RUNNING - : AudioSpaceState.ENDED)); - - return true; - } - - @Override - public void fillResultAndClear(ThriftSearchResult result) { - getExtraMetadata(result).setSpaces(ImmutableList.copyOf(spaces)); - spaces.clear(); - } -} diff --git a/src/java/com/twitter/search/earlybird/search/facets/TermStatisticsCollector.java b/src/java/com/twitter/search/earlybird/search/facets/TermStatisticsCollector.java deleted file mode 100644 index 5856d5281..000000000 --- a/src/java/com/twitter/search/earlybird/search/facets/TermStatisticsCollector.java +++ /dev/null @@ -1,487 +0,0 @@ -package com.twitter.search.earlybird.search.facets; - -import java.io.IOException; -import java.util.ArrayList; -import java.util.HashMap; -import java.util.List; -import java.util.Map; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import org.apache.commons.lang.StringUtils; -import org.apache.lucene.index.PostingsEnum; -import org.apache.lucene.index.Term; -import org.apache.lucene.index.Terms; -import org.apache.lucene.index.TermsEnum; -import org.apache.lucene.search.DocIdSetIterator; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchResultsStats; -import com.twitter.search.common.schema.SchemaUtil; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.search.EarlyTerminationState; -import com.twitter.search.common.util.earlybird.TermStatisticsUtil; -import com.twitter.search.core.earlybird.index.TimeMapper; -import com.twitter.search.earlybird.index.EarlybirdSingleSegmentSearcher; -import com.twitter.search.earlybird.search.AbstractResultsCollector; -import com.twitter.search.earlybird.search.SearchResultsInfo; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; -import com.twitter.search.earlybird.thrift.ThriftHistogramSettings; -import com.twitter.search.earlybird.thrift.ThriftTermRequest; -import com.twitter.search.earlybird.thrift.ThriftTermResults; - -public class TermStatisticsCollector extends AbstractResultsCollector - { - private static final EarlyTerminationState TERMINATED_TERM_STATS_COUNTING_DONE = - new EarlyTerminationState("terminated_term_stats_counting_done", true); - - // Stats for tracking histogram results. - private static final SearchResultsStats TERM_STATS_HISTOGRAM_REQUESTS_WITH_MOVED_BACK_BINS = - SearchResultsStats.export("term_statistics_collector_queries_with_moved_back_bins"); - private static final SearchCounter TERM_STATS_SKIPPED_LARGER_OUT_OF_BOUNDS_HITS = - SearchCounter.export("term_statistics_collector_skipped_larger_out_of_bounds_hits"); - - @VisibleForTesting - static final class TermStatistics { - private final ThriftTermRequest termRequest; - private final Term term; // could be null, for count across all fields - private int termDF = 0; - private int termCount = 0; - private final int[] histogramBins; - - // Per-segment information. - private PostingsEnum segmentDocsEnum; // could be null, for count across all fields - private boolean segmentDone; - - @VisibleForTesting - TermStatistics(ThriftTermRequest termRequest, Term term, int numBins) { - this.termRequest = termRequest; - this.term = term; - this.histogramBins = new int[numBins]; - } - - /** - * Take the currently accumulated counts and "move them back" to make room for counts from more - * recent binIds. - * - * For example, if the oldFirstBinID was set to 10, and the histogramBins were {3, 4, 5, 6, 7}, - * after this call with newFirstBinID set to 12, the histogramBins will be set - * to {5, 6, 7, 0, 0}. - * - * @param oldFirstBinID the binId of the firstBin that's been used up to now. - * @param newFirstBinID the new binId of the firstBin that will be used from now on. - * The newFirstBinID is presumed to be larger than the oldFirstBinID, and is asserted. - */ - @VisibleForTesting - void moveBackTermCounts(int oldFirstBinID, int newFirstBinID) { - Preconditions.checkState(oldFirstBinID < newFirstBinID); - // move counts back by this many bins - final int moveBackBy = newFirstBinID - oldFirstBinID; - - this.termCount = 0; - for (int i = 0; i < histogramBins.length; i++) { - int oldCount = histogramBins[i]; - histogramBins[i] = 0; - int newIndex = i - moveBackBy; - if (newIndex >= 0) { - histogramBins[newIndex] = oldCount; - this.termCount += oldCount; - } - } - } - - @VisibleForTesting void countHit(int bin) { - termCount++; - histogramBins[bin]++; - } - - @VisibleForTesting int getTermCount() { - return termCount; - } - - @VisibleForTesting int[] getHistogramBins() { - return histogramBins; - } - } - - private TermStatistics[] termStatistics; - - // Histogram fields. - private int numBins; - private int binSize; - - private int numTimesBinsWereMovedBack = 0; - private int numLargerOutOfBoundsBinsSkipped = 0; - - private static final int SEEN_OUT_OF_RANGE_THRESHOLD = 10; - - private int seenOutOfRange = 0; - - // ID of the first bin - effectively time / binSize. This is calculated - // relative to the first collected in-order hit. - private int firstBinID = -1; - // List of per-segment debug information specifically useful for termstat request debugging. - private List termStatisticsDebugInfo = new ArrayList<>(); - - /** - * Creates a new term stats collector. - */ - public TermStatisticsCollector( - ImmutableSchemaInterface schema, - TermStatisticsRequestInfo searchRequestInfo, - EarlybirdSearcherStats searcherStats, - Clock clock, - int requestDebugMode) { - super(schema, searchRequestInfo, clock, searcherStats, requestDebugMode); - - // Set up the histogram bins. - if (searchRequestInfo.isReturnHistogram()) { - ThriftHistogramSettings histogramSettings = searchRequestInfo.getHistogramSettings(); - this.numBins = histogramSettings.getNumBins(); - binSize = TermStatisticsUtil.determineBinSize(histogramSettings); - } else { - this.numBins = 0; - this.binSize = 0; - } - - // Set up the term statistics array. - List termRequests = searchRequestInfo.getTermRequests(); - if (termRequests == null) { - this.termStatistics = new TermStatistics[0]; - return; - } - - this.termStatistics = new TermStatistics[searchRequestInfo.getTermRequests().size()]; - for (int i = 0; i < searchRequestInfo.getTermRequests().size(); i++) { - final ThriftTermRequest termRequest = searchRequestInfo.getTermRequests().get(i); - - Term term = null; - String fieldName = termRequest.getFieldName(); - if (!StringUtils.isBlank(fieldName)) { - // First check if it's a facet field. - Schema.FieldInfo facetField = schema.getFacetFieldByFacetName(termRequest.getFieldName()); - if (facetField != null) { - term = new Term(facetField.getName(), termRequest.getTerm()); - } else { - // EarlybirdSearcher.validateRequest() should've already checked that the field exists in - // the schema, and that the term can be converted to the type of this field. However, if - // that did not happen for some reason, an exception will be thrown here, which will be - // converted to a TRANSIENT_ERROR response code. - Schema.FieldInfo fieldInfo = schema.getFieldInfo(fieldName); - Preconditions.checkNotNull( - fieldInfo, - "Found a ThriftTermRequest for a field that's not in the schema: " + fieldName - + ". This should've been caught by EarlybirdSearcher.validateRequest()!"); - term = new Term(fieldName, SchemaUtil.toBytesRef(fieldInfo, termRequest.getTerm())); - } - } else { - // NOTE: if the fieldName is empty, this is a catch-all term request for the count across - // all fields. We'll just use a null term in the TermStatistics object. - } - - termStatistics[i] = new TermStatistics(termRequest, term, numBins); - } - } - - @Override - public void startSegment() throws IOException { - termStatisticsDebugInfo.add( - "Starting segment in timestamp range: [" + timeMapper.getFirstTime() - + ", " + timeMapper.getLastTime() + "]"); - for (TermStatistics termStats : termStatistics) { - termStats.segmentDone = true; // until we know it's false later. - TermsEnum termsEnum = null; - if (termStats.term != null) { - Terms terms = currTwitterReader.terms(termStats.term.field()); - if (terms != null) { - termsEnum = terms.iterator(); - if (termsEnum != null && termsEnum.seekExact(termStats.term.bytes())) { - termStats.termDF += termsEnum.docFreq(); // Only meaningful for matchAll queries. - termStats.segmentDocsEnum = - termsEnum.postings(termStats.segmentDocsEnum, PostingsEnum.FREQS); - termStats.segmentDone = termStats.segmentDocsEnum == null - || termStats.segmentDocsEnum.nextDoc() == DocIdSetIterator.NO_MORE_DOCS; - } else { - // this term doesn't exist in this segment. - } - } - } else { - // Catch-all case - termStats.termDF += currTwitterReader.numDocs(); // Only meaningful for matchAll queries. - termStats.segmentDocsEnum = null; - termStats.segmentDone = false; - } - } - } - - private int calculateBin(final int tweetTime) { - if (tweetTime == TimeMapper.ILLEGAL_TIME) { - return -1; - } - - final int binID = Math.abs(tweetTime) / binSize; - final int expectedFirstBinId = binID - numBins + 1; - - if (firstBinID == -1) { - firstBinID = expectedFirstBinId; - } else if (expectedFirstBinId > firstBinID) { - numTimesBinsWereMovedBack++; - final int oldOutOfOrderFirstBinID = firstBinID; - firstBinID = expectedFirstBinId; - // We got a more recent out of order bin, move previous counts back. - for (TermStatistics ts : termStatistics) { - ts.moveBackTermCounts(oldOutOfOrderFirstBinID, firstBinID); - } - } - - final int binIndex = binID - firstBinID; - if (binIndex >= numBins) { - // In-order times should be decreasing, - // and out of order times seen after an in-order tweet should also be smaller than the - // first in-order tweet's time. Will track these and export as a stat. - numLargerOutOfBoundsBinsSkipped++; - return -1; - } else if (binIndex < 0) { - // Early termination criteria. - seenOutOfRange++; - } else { - // Reset the counter, since we want to see consecutive tweets that are out of our bin range - // not single anomalies. - seenOutOfRange = 0; - } - - return binIndex; - } - - @Override - public void doCollect(long tweetID) throws IOException { - if (searchRequestInfo.isReturnHistogram()) { - final int tweetTime = timeMapper.getTime(curDocId); - final int binIndex = calculateBin(tweetTime); - if (binIndex >= 0) { - for (TermStatistics ts : termStatistics) { - if (!ts.segmentDone) { - countHist(ts, binIndex); - } - } - } - } else { - for (TermStatistics ts : termStatistics) { - if (!ts.segmentDone) { - countNoHist(ts); - } - } - } - } - - @Override - public void skipSegment(EarlybirdSingleSegmentSearcher searcher) { - // Do nothing here. - // We don't do accounting that's done in AbstractResultsCollector for Term Stats - // requests because otherwise the bin ID calculation will be confused. - } - - private boolean advance(TermStatistics ts) throws IOException { - PostingsEnum docsEnum = ts.segmentDocsEnum; - if (docsEnum.docID() < curDocId) { - if (docsEnum.advance(curDocId) == DocIdSetIterator.NO_MORE_DOCS) { - ts.segmentDone = true; - return false; - } - } - return docsEnum.docID() == curDocId; - } - - private boolean countHist(TermStatistics ts, int bin) throws IOException { - if (ts.term != null && !advance(ts)) { - return false; - } - ts.countHit(bin); - return true; - } - - private boolean countNoHist(TermStatistics ts) throws IOException { - if (ts.term != null && !advance(ts)) { - return false; - } - ts.termCount++; - return true; - } - - @Override - public EarlyTerminationState innerShouldCollectMore() { - if (readyToTerminate()) { - return setEarlyTerminationState(TERMINATED_TERM_STATS_COUNTING_DONE); - } - return EarlyTerminationState.COLLECTING; - } - - /** - * The termination logic is simple - we know what our earliest bin is and once we see a result - * that's before our earliest bin, we terminate. - * - * Our results come with increasing internal doc ids, which should correspond to decreasing - * timestamps. See SEARCH-27729, TWEETYPIE-7031. - * - * We early terminate after we have seen enough tweets that are outside of the bin - * range that we want to return. This way we're not terminating too early because of single tweets - * with wrong timestamps. - */ - @VisibleForTesting - boolean readyToTerminate() { - return this.seenOutOfRange >= SEEN_OUT_OF_RANGE_THRESHOLD; - } - - @Override - public TermStatisticsSearchResults doGetResults() { - return new TermStatisticsSearchResults(); - } - - public final class TermStatisticsSearchResults extends SearchResultsInfo { - public final List binIds; - public final Map results; - public final int lastCompleteBinId; - public final List termStatisticsDebugInfo; - - private TermStatisticsSearchResults() { - // Initialize term stat debug info - termStatisticsDebugInfo = TermStatisticsCollector.this.termStatisticsDebugInfo; - - if (termStatistics.length > 0) { - results = new HashMap<>(); - - if (searchRequestInfo.isReturnHistogram()) { - binIds = new ArrayList<>(numBins); - int minSearchedTime = TermStatisticsCollector.this.getMinSearchedTime(); - - if (shouldCollectDetailedDebugInfo()) { - termStatisticsDebugInfo.add("minSearchedTime: " + minSearchedTime); - int maxSearchedTime = TermStatisticsCollector.this.getMaxSearchedTime(); - termStatisticsDebugInfo.add("maxSearchedTime: " + maxSearchedTime); - } - - int lastCompleteBin = -1; - - computeFirstBinId(TermStatisticsCollector.this.isSetMinSearchedTime(), minSearchedTime); - trackHistogramResultStats(); - - // Example: - // minSearchTime = 53s - // binSize = 10 - // firstBinId = 5 - // numBins = 4 - // binId = 5, 6, 7, 8 - // binTimeStamp = 50s, 60s, 70s, 80s - for (int i = 0; i < numBins; i++) { - int binId = firstBinID + i; - int binTimeStamp = binId * binSize; - binIds.add(binId); - if (lastCompleteBin == -1 && binTimeStamp > minSearchedTime) { - lastCompleteBin = binId; - } - } - - if (!getEarlyTerminationState().isTerminated()) { - // only if we didn't early terminate we can be sure to use the firstBinID as - // lastCompleteBinId - lastCompleteBinId = firstBinID; - if (shouldCollectDetailedDebugInfo()) { - termStatisticsDebugInfo.add("no early termination"); - } - } else { - lastCompleteBinId = lastCompleteBin; - if (shouldCollectDetailedDebugInfo()) { - termStatisticsDebugInfo.add( - "early terminated for reason: " + getEarlyTerminationReason()); - } - } - if (shouldCollectDetailedDebugInfo()) { - termStatisticsDebugInfo.add("lastCompleteBinId: " + lastCompleteBinId); - } - } else { - binIds = null; - lastCompleteBinId = -1; - } - - for (TermStatistics ts : termStatistics) { - ThriftTermResults termResults = new ThriftTermResults().setTotalCount(ts.termCount); - - if (searchRequestInfo.isReturnHistogram()) { - List list = new ArrayList<>(); - for (int count : ts.histogramBins) { - list.add(count); - } - termResults.setHistogramBins(list); - } - - results.put(ts.termRequest, termResults); - } - } else { - binIds = null; - results = null; - lastCompleteBinId = -1; - } - } - - @Override - public String toString() { - StringBuilder res = new StringBuilder(); - res.append("TermStatisticsSearchResults(\n"); - if (binIds != null) { - res.append(" binIds=").append(binIds).append("\n"); - } - res.append(" lastCompleteBinId=").append(lastCompleteBinId).append("\n"); - if (results != null) { - res.append(" results=").append(results).append("\n"); - } - res.append(")"); - return res.toString(); - } - - public List getTermStatisticsDebugInfo() { - return termStatisticsDebugInfo; - } - } - - /** - * Figure out what the actual firstBinId is for this query. - */ - private void computeFirstBinId(boolean isSetMinSearchedTime, int minSearchedTime) { - if (firstBinID == -1) { - if (!isSetMinSearchedTime) { - // This would only happen if we don't search any segments, which for now we have - // only seen happening if since_time or until_time don't intersect at all with - // the range of the served segments. - firstBinID = 0; - } else { - // Example: - // minSearchedTime = 54 - // binSize = 10 - // firstBinId = 5 - firstBinID = minSearchedTime / binSize; - } - - if (shouldCollectDetailedDebugInfo()) { - termStatisticsDebugInfo.add("firstBinId: " + firstBinID); - } - } - } - - @VisibleForTesting - int getSeenOutOfRange() { - return seenOutOfRange; - } - - private void trackHistogramResultStats() { - if (numLargerOutOfBoundsBinsSkipped > 0) { - TERM_STATS_SKIPPED_LARGER_OUT_OF_BOUNDS_HITS.increment(); - } - - if (numTimesBinsWereMovedBack > 0) { - TERM_STATS_HISTOGRAM_REQUESTS_WITH_MOVED_BACK_BINS.recordResults(numTimesBinsWereMovedBack); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/search/facets/TermStatisticsRequestInfo.java b/src/java/com/twitter/search/earlybird/search/facets/TermStatisticsRequestInfo.java deleted file mode 100644 index 6162f4192..000000000 --- a/src/java/com/twitter/search/earlybird/search/facets/TermStatisticsRequestInfo.java +++ /dev/null @@ -1,94 +0,0 @@ -package com.twitter.search.earlybird.search.facets; - -import java.util.LinkedList; -import java.util.List; -import java.util.Set; - -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableSet; - -import org.apache.lucene.search.Query; - -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.search.TerminationTracker; -import com.twitter.search.common.util.text.NormalizerHelper; -import com.twitter.search.common.util.url.URLUtils; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.search.SearchRequestInfo; -import com.twitter.search.earlybird.thrift.ThriftHistogramSettings; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.earlybird.thrift.ThriftTermRequest; -import com.twitter.search.earlybird.thrift.ThriftTermStatisticsRequest; - -public class TermStatisticsRequestInfo extends SearchRequestInfo { - private static final Set FACET_URL_FIELDS_TO_NORMALIZE = new ImmutableSet.Builder() - .add(EarlybirdFieldConstant.IMAGES_FACET) - .add(EarlybirdFieldConstant.VIDEOS_FACET) - .add(EarlybirdFieldConstant.NEWS_FACET) - .build(); - - protected final List termRequests; - protected final ThriftHistogramSettings histogramSettings; - - /** - * Creates a new TermStatisticsRequestInfo instance using the provided query. - */ - public TermStatisticsRequestInfo(ThriftSearchQuery searchQuery, - Query luceneQuery, - ThriftTermStatisticsRequest termStatsRequest, - TerminationTracker terminationTracker) { - super(searchQuery, luceneQuery, terminationTracker); - this.termRequests = termStatsRequest.isSetTermRequests() - ? termStatsRequest.getTermRequests() : new LinkedList<>(); - this.histogramSettings = termStatsRequest.getHistogramSettings(); - if (termStatsRequest.isIncludeGlobalCounts()) { - // Add an empty request to indicate we need a global count across all fields. - termRequests.add(new ThriftTermRequest().setFieldName("").setTerm("")); - } - - // We only normalize TEXT terms and urls. All other terms, e.g. topics (named entities) are - // not normalized. Here the assumption is that the caller passes the exact terms back that - // the facet API returned - for (ThriftTermRequest termReq : termRequests) { - if (termReq.getTerm().isEmpty()) { - continue; // the special catch-all term. - } - - if (!termReq.isSetFieldName() - || termReq.getFieldName().equals(EarlybirdFieldConstant.TEXT_FIELD.getFieldName())) { - // normalize the TEXT term as it's normalized during ingestion - termReq.setTerm(NormalizerHelper.normalizeWithUnknownLocale( - termReq.getTerm(), EarlybirdConfig.getPenguinVersion())); - } else if (FACET_URL_FIELDS_TO_NORMALIZE.contains(termReq.getFieldName())) { - // remove the trailing slash from the URL path. This operation is idempotent, - // so either a spiderduck URL or a facet URL can be used here. The latter would just - // be normalized twice, which is fine. - termReq.setTerm(URLUtils.normalizePath(termReq.getTerm())); - } - } - } - - @Override - protected int calculateMaxHitsToProcess(ThriftSearchQuery searchQuery) { - Preconditions.checkNotNull(searchQuery.getCollectorParams()); - if (!searchQuery.getCollectorParams().isSetTerminationParams() - || !searchQuery.getCollectorParams().getTerminationParams().isSetMaxHitsToProcess()) { - // Override the default value to all hits. - return Integer.MAX_VALUE; - } else { - return super.calculateMaxHitsToProcess(searchQuery); - } - } - - public final List getTermRequests() { - return this.termRequests; - } - - public final ThriftHistogramSettings getHistogramSettings() { - return this.histogramSettings; - } - - public final boolean isReturnHistogram() { - return this.histogramSettings != null; - } -} diff --git a/src/java/com/twitter/search/earlybird/search/facets/TweetSearchFacetCountIteratorFactory.java b/src/java/com/twitter/search/earlybird/search/facets/TweetSearchFacetCountIteratorFactory.java deleted file mode 100644 index a46149fc4..000000000 --- a/src/java/com/twitter/search/earlybird/search/facets/TweetSearchFacetCountIteratorFactory.java +++ /dev/null @@ -1,41 +0,0 @@ -package com.twitter.search.earlybird.search.facets; - -import java.io.IOException; - -import com.google.common.base.Preconditions; - -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.core.earlybird.facets.CSFFacetCountIterator; -import com.twitter.search.core.earlybird.facets.FacetCountIterator; -import com.twitter.search.core.earlybird.facets.FacetCountIteratorFactory; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; - -/** - * Factory of {@link FacetCountIterator} instances for tweet search. - * It provides a special iterator for the retweets facet. - */ -public final class TweetSearchFacetCountIteratorFactory extends FacetCountIteratorFactory { - public static final TweetSearchFacetCountIteratorFactory FACTORY = - new TweetSearchFacetCountIteratorFactory(); - - private TweetSearchFacetCountIteratorFactory() { - } - - @Override - public FacetCountIterator getFacetCountIterator( - EarlybirdIndexSegmentAtomicReader reader, - Schema.FieldInfo fieldInfo) throws IOException { - Preconditions.checkNotNull(reader); - Preconditions.checkNotNull(fieldInfo); - Preconditions.checkArgument(fieldInfo.getFieldType().isUseCSFForFacetCounting()); - - String facetName = fieldInfo.getFieldType().getFacetName(); - - if (EarlybirdFieldConstant.RETWEETS_FACET.equals(facetName)) { - return new RetweetFacetCountIterator(reader, fieldInfo); - } else { - return new CSFFacetCountIterator(reader, fieldInfo); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/search/queries/BadUserRepFilter.java b/src/java/com/twitter/search/earlybird/search/queries/BadUserRepFilter.java deleted file mode 100644 index 3577b8635..000000000 --- a/src/java/com/twitter/search/earlybird/search/queries/BadUserRepFilter.java +++ /dev/null @@ -1,115 +0,0 @@ -package com.twitter.search.earlybird.search.queries; - -import java.io.IOException; - -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.NumericDocValues; -import org.apache.lucene.search.BooleanClause; -import org.apache.lucene.search.BooleanQuery; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.Weight; - -import com.twitter.search.common.query.DefaultFilterWeight; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.core.earlybird.index.util.AllDocsIterator; -import com.twitter.search.core.earlybird.index.util.RangeFilterDISI; - -public final class BadUserRepFilter extends Query { - /** - * Creates a query that filters out results coming from users with bad reputation. - * - * @param minTweepCred The lowest acceptable user reputation. - * @return A query that filters out results from bad reputation users. - */ - public static Query getBadUserRepFilter(int minTweepCred) { - if (minTweepCred <= 0) { - return null; - } - - return new BooleanQuery.Builder() - .add(new BadUserRepFilter(minTweepCred), BooleanClause.Occur.FILTER) - .build(); - } - - private final int minTweepCred; - - private BadUserRepFilter(int minTweepCred) { - this.minTweepCred = minTweepCred; - } - - @Override - public int hashCode() { - return minTweepCred; - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof BadUserRepFilter)) { - return false; - } - - return minTweepCred == BadUserRepFilter.class.cast(obj).minTweepCred; - } - - @Override - public String toString(String field) { - return "BadUserRepFilter:" + minTweepCred; - } - - @Override - public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) { - return new DefaultFilterWeight(this) { - @Override - protected DocIdSetIterator getDocIdSetIterator(LeafReaderContext context) throws IOException { - LeafReader reader = context.reader(); - if (!(reader instanceof EarlybirdIndexSegmentAtomicReader)) { - return new AllDocsIterator(reader); - } - - return new BadUserExcludeDocIdSetIterator( - (EarlybirdIndexSegmentAtomicReader) context.reader(), minTweepCred); - } - }; - } - - private static final class BadUserExcludeDocIdSetIterator extends RangeFilterDISI { - private final NumericDocValues userReputationDocValues; - private final int minTweepCred; - - BadUserExcludeDocIdSetIterator(EarlybirdIndexSegmentAtomicReader indexReader, - int minTweepCred) throws IOException { - super(indexReader); - this.userReputationDocValues = - indexReader.getNumericDocValues(EarlybirdFieldConstant.USER_REPUTATION.getFieldName()); - this.minTweepCred = minTweepCred; - } - - @Override - public boolean shouldReturnDoc() throws IOException { - // We need this explicit casting to byte, because of how we encode and decode features in our - // encoded_tweet_features field. If a feature is an int (uses all 32 bits of the int), then - // encoding the feature and then decoding it preserves its original value. However, if the - // feature does not use the entire int (and especially if it uses bits somewhere in the middle - // of the int), then the feature value is assumed to be unsigned when it goes through this - // process of encoding and decoding. So a user rep of - // RelevanceSignalConstants.UNSET_REPUTATION_SENTINEL (-128) will be correctly encoded as the - // binary value 10000000, but will be treated as an unsigned value when decoded, and therefore - // the decoded value will be 128. - // - // In retrospect, this seems like a really poor design decision. It seems like it would be - // better if all feature values were considered to be signed, even if most features can never - // have negative values. Unfortunately, making this change is not easy, because some features - // store normalized values, so we would also need to change the range of allowed values - // produced by those normalizers, as well as all code that depends on those values. - // - // So for now, just cast this value to a byte, to get the proper negative value. - return userReputationDocValues.advanceExact(docID()) - && ((byte) userReputationDocValues.longValue() >= minTweepCred); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/search/queries/CSFDisjunctionFilter.java b/src/java/com/twitter/search/earlybird/search/queries/CSFDisjunctionFilter.java deleted file mode 100644 index f5ba12493..000000000 --- a/src/java/com/twitter/search/earlybird/search/queries/CSFDisjunctionFilter.java +++ /dev/null @@ -1,87 +0,0 @@ -package com.twitter.search.earlybird.search.queries; - -import java.io.IOException; -import java.util.Objects; -import java.util.Set; - -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.NumericDocValues; -import org.apache.lucene.search.BooleanClause; -import org.apache.lucene.search.BooleanQuery; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.Weight; - -import com.twitter.search.common.query.DefaultFilterWeight; -import com.twitter.search.core.earlybird.index.util.RangeFilterDISI; - -/** - * CSFDisjunctionFilter provides an efficient mechanism to query for documents that have a - * long CSF equal to one of the provided values. - */ -public final class CSFDisjunctionFilter extends Query { - private final String csfField; - private final Set values; - - public static Query getCSFDisjunctionFilter(String csfField, Set values) { - return new BooleanQuery.Builder() - .add(new CSFDisjunctionFilter(csfField, values), BooleanClause.Occur.FILTER) - .build(); - } - - private CSFDisjunctionFilter(String csfField, Set values) { - this.csfField = csfField; - this.values = values; - } - - @Override - public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) { - return new DefaultFilterWeight(this) { - @Override - protected DocIdSetIterator getDocIdSetIterator(LeafReaderContext context) throws IOException { - return new CSFDisjunctionFilterDISI(context.reader(), csfField, values); - } - }; - } - - @Override - public int hashCode() { - return (csfField == null ? 0 : csfField.hashCode()) * 17 - + (values == null ? 0 : values.hashCode()); - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof CSFDisjunctionFilter)) { - return false; - } - - CSFDisjunctionFilter filter = CSFDisjunctionFilter.class.cast(obj); - return Objects.equals(csfField, filter.csfField) && Objects.equals(values, filter.values); - } - - @Override - public String toString(String field) { - return "CSFDisjunctionFilter:" + csfField + ",count:" + values.size(); - } - - private static final class CSFDisjunctionFilterDISI extends RangeFilterDISI { - private final NumericDocValues docValues; - private final Set values; - - private CSFDisjunctionFilterDISI(LeafReader reader, String csfField, Set values) - throws IOException { - super(reader); - this.values = values; - this.docValues = reader.getNumericDocValues(csfField); - } - - @Override - protected boolean shouldReturnDoc() throws IOException { - return docValues.advanceExact(docID()) && values.contains(docValues.longValue()); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/search/queries/DocValRangeFilter.java b/src/java/com/twitter/search/earlybird/search/queries/DocValRangeFilter.java deleted file mode 100644 index b9b5ad68f..000000000 --- a/src/java/com/twitter/search/earlybird/search/queries/DocValRangeFilter.java +++ /dev/null @@ -1,195 +0,0 @@ -package com.twitter.search.earlybird.search.queries; - -import java.io.IOException; -import java.util.Objects; - -import com.google.common.annotations.VisibleForTesting; - -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.NumericDocValues; -import org.apache.lucene.search.BooleanClause; -import org.apache.lucene.search.BooleanQuery; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.Weight; - -import com.twitter.search.common.query.DefaultFilterWeight; -import com.twitter.search.common.schema.thriftjava.ThriftCSFType; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.core.earlybird.index.util.AllDocsIterator; -import com.twitter.search.core.earlybird.index.util.RangeFilterDISI; - -/** - * Filters tweets according to the specified CSF field value. - * Note that min value is inclusive, and max value is exclusive. - */ -public final class DocValRangeFilter extends Query { - private final String csfField; - private final ThriftCSFType csfFieldType; - private final Number minValInclusive; - private final Number maxValExclusive; - - /** - * Returns a query that filters hits based on the value of a CSF. - * - * @param csfField The CSF name. - * @param csfFieldType The CSF type. - * @param minVal The minimum acceptable value (inclusive). - * @param maxVal The maximum acceptable value (exclusive). - * @return A query that filters hits based on the value of a CSF. - */ - public static Query getDocValRangeQuery(String csfField, ThriftCSFType csfFieldType, - double minVal, double maxVal) { - return new BooleanQuery.Builder() - .add(new DocValRangeFilter(csfField, csfFieldType, minVal, maxVal), - BooleanClause.Occur.FILTER) - .build(); - } - - /** - * Returns a query that filters hits based on the value of a CSF. - * - * @param csfField The CSF name. - * @param csfFieldType The CSF type. - * @param minVal The minimum acceptable value (inclusive). - * @param maxVal The maximum acceptable value (exclusive). - * @return A query that filters hits based on the value of a CSF. - */ - public static Query getDocValRangeQuery(String csfField, ThriftCSFType csfFieldType, - long minVal, long maxVal) { - return new BooleanQuery.Builder() - .add(new DocValRangeFilter(csfField, csfFieldType, minVal, maxVal), - BooleanClause.Occur.FILTER) - .build(); - } - - private DocValRangeFilter(String csfField, ThriftCSFType csfFieldType, - double minVal, double maxVal) { - this.csfField = csfField; - this.csfFieldType = csfFieldType; - this.minValInclusive = new Float(minVal); - this.maxValExclusive = new Float(maxVal); - } - - private DocValRangeFilter(String csfField, ThriftCSFType csfFieldType, - long minVal, long maxVal) { - this.csfField = csfField; - this.csfFieldType = csfFieldType; - this.minValInclusive = new Long(minVal); - this.maxValExclusive = new Long(maxVal); - } - - @Override - public int hashCode() { - return (csfField == null ? 0 : csfField.hashCode()) * 29 - + (csfFieldType == null ? 0 : csfFieldType.hashCode()) * 17 - + minValInclusive.hashCode() * 7 - + maxValExclusive.hashCode(); - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof DocValRangeFilter)) { - return false; - } - - DocValRangeFilter filter = DocValRangeFilter.class.cast(obj); - return Objects.equals(csfField, filter.csfField) - && (csfFieldType == filter.csfFieldType) - && minValInclusive.equals(filter.minValInclusive) - && maxValExclusive.equals(filter.maxValExclusive); - } - - @Override - public String toString(String field) { - return "DocValRangeFilter:" + csfField - + ",type:" + csfFieldType.toString() - + ",min:" + this.minValInclusive.toString() - + ",max:" + this.maxValExclusive.toString(); - } - - @Override - public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) { - return new DefaultFilterWeight(this) { - @Override - protected DocIdSetIterator getDocIdSetIterator(LeafReaderContext context) throws IOException { - LeafReader reader = context.reader(); - if (csfFieldType == null) { - return new AllDocsIterator(reader); - } - - int smallestDoc = (reader instanceof EarlybirdIndexSegmentAtomicReader) - ? ((EarlybirdIndexSegmentAtomicReader) reader).getSmallestDocID() : 0; - int largestDoc = reader.maxDoc() - 1; - return new CSFRangeDocIdSetIterator(reader, csfField, csfFieldType, - smallestDoc, largestDoc, - minValInclusive, maxValExclusive); - } - }; - } - - private static final class CSFRangeDocIdSetIterator extends RangeFilterDISI { - private final NumericDocValues numericDocValues; - private final ThriftCSFType csfType; - private final Number minValInclusive; - private final Number maxValExclusive; - - public CSFRangeDocIdSetIterator(LeafReader reader, - String csfField, - ThriftCSFType csfType, - int smallestDocID, - int largestDocID, - Number minValInclusive, - Number maxValExclusive) throws IOException { - super(reader, smallestDocID, largestDocID); - this.numericDocValues = reader.getNumericDocValues(csfField); - this.csfType = csfType; - this.minValInclusive = minValInclusive; - this.maxValExclusive = maxValExclusive; - } - - @Override - protected boolean shouldReturnDoc() throws IOException { - if (!numericDocValues.advanceExact(docID())) { - return false; - } - - long val = numericDocValues.longValue(); - switch (csfType) { - case DOUBLE: - double doubleVal = Double.longBitsToDouble(val); - return doubleVal >= minValInclusive.doubleValue() - && doubleVal < maxValExclusive.doubleValue(); - case FLOAT: - float floatVal = Float.intBitsToFloat((int) val); - return floatVal >= minValInclusive.doubleValue() - && floatVal < maxValExclusive.doubleValue(); - case LONG: - return val >= minValInclusive.longValue() && val < maxValExclusive.longValue(); - case INT: - return val >= minValInclusive.longValue() && (int) val < maxValExclusive.longValue(); - case BYTE: - return (byte) val >= minValInclusive.longValue() - && (byte) val < maxValExclusive.longValue(); - default: - return false; - } - } - } - - ////////////////////////// - // for unit tests only - ////////////////////////// - @VisibleForTesting - public Number getMinValForTest() { - return minValInclusive; - } - - @VisibleForTesting - public Number getMaxValForTest() { - return maxValExclusive; - } -} diff --git a/src/java/com/twitter/search/earlybird/search/queries/FeatureValueInAcceptListOrUnsetFilter.java b/src/java/com/twitter/search/earlybird/search/queries/FeatureValueInAcceptListOrUnsetFilter.java deleted file mode 100644 index e4e9d37a7..000000000 --- a/src/java/com/twitter/search/earlybird/search/queries/FeatureValueInAcceptListOrUnsetFilter.java +++ /dev/null @@ -1,113 +0,0 @@ -package com.twitter.search.earlybird.search.queries; - -import java.io.IOException; -import java.util.Set; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.NumericDocValues; -import org.apache.lucene.search.BooleanClause; -import org.apache.lucene.search.BooleanQuery; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.Weight; - -import com.twitter.search.common.query.DefaultFilterWeight; -import com.twitter.search.core.earlybird.index.util.RangeFilterDISI; - -public final class FeatureValueInAcceptListOrUnsetFilter extends Query { - - private final String featureName; - private final Set idsAcceptList; - - /** - * Creates a query that filters for hits that have the given feature unset, or that have the - * given feature set to a value in the given list of IDs. - * - * @param featureName The feature. - * @param ids A list of id values this filter will accept for the given feature. - * @return A query that filters out all hits that have the given feature set. - */ - public static Query getFeatureValueInAcceptListOrUnsetFilter(String featureName, Set ids) { - return new BooleanQuery.Builder() - .add(new FeatureValueInAcceptListOrUnsetFilter(featureName, ids), - BooleanClause.Occur.FILTER) - .build(); - } - - @Override - public String toString(String s) { - return String.format("FeatureValueInAcceptListOrUnsetFilter(%s, AcceptList = (%s))", - featureName, - idsAcceptList); - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof FeatureValueInAcceptListOrUnsetFilter)) { - return false; - } - - FeatureValueInAcceptListOrUnsetFilter filter = - FeatureValueInAcceptListOrUnsetFilter.class.cast(obj); - return featureName.equals(filter.featureName) && idsAcceptList.equals(filter.idsAcceptList); - } - - @Override - public int hashCode() { - return featureName.hashCode() * 7 + idsAcceptList.hashCode(); - } - - private FeatureValueInAcceptListOrUnsetFilter(String featureName, Set ids) { - this.featureName = Preconditions.checkNotNull(featureName); - this.idsAcceptList = Preconditions.checkNotNull(ids); - } - - @Override - public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) { - return new DefaultFilterWeight(this) { - @Override - protected DocIdSetIterator getDocIdSetIterator(LeafReaderContext context) throws IOException { - return new FeatureValueInAcceptListOrUnsetDocIdSetIterator( - context.reader(), featureName, idsAcceptList); - } - }; - } - - private static final class FeatureValueInAcceptListOrUnsetDocIdSetIterator - extends RangeFilterDISI { - private final NumericDocValues featureDocValues; - private final Set idsAcceptList; - - FeatureValueInAcceptListOrUnsetDocIdSetIterator( - LeafReader indexReader, String featureName, Set ids) throws IOException { - super(indexReader); - this.featureDocValues = indexReader.getNumericDocValues(featureName); - this.idsAcceptList = ids; - } - - @Override - public boolean shouldReturnDoc() throws IOException { - // If featureDocValues is null, that means there were no documents indexed with the given - // field in the current segment. - // - // The advanceExact() method returns false if it cannot find the given docId in the - // NumericDocValues instance. So if advanceExact() returns false then we know the feature is - // unset. - // However, for realtime Earlybirds we have a custom implementation of NumericDocValues, - // ColumnStrideFieldDocValues, which will contain an entry for every indexed docId and use a - // value of 0 to indicate that a feature is unset. - // - // So to check if a feature is unset for a given docId, we first need to check if we can find - // the docId, and then we additionally need to check if the feature value is 0. - return featureDocValues == null - || !featureDocValues.advanceExact(docID()) - || featureDocValues.longValue() == 0 - || idsAcceptList.contains(featureDocValues.longValue()); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/search/queries/GeoTwoPhaseQuery.java b/src/java/com/twitter/search/earlybird/search/queries/GeoTwoPhaseQuery.java deleted file mode 100644 index cfae5f988..000000000 --- a/src/java/com/twitter/search/earlybird/search/queries/GeoTwoPhaseQuery.java +++ /dev/null @@ -1,255 +0,0 @@ -package com.twitter.search.earlybird.search.queries; - -import java.io.IOException; -import java.util.Set; - -import org.apache.lucene.index.IndexReader; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.Term; -import org.apache.lucene.search.ConstantScoreQuery; -import org.apache.lucene.search.ConstantScoreScorer; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.Explanation; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.TwoPhaseIterator; -import org.apache.lucene.search.Weight; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.search.TerminationTracker; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; - - -public class GeoTwoPhaseQuery extends Query { - private static final boolean ENABLE_GEO_EARLY_TERMINATION = - EarlybirdConfig.getBool("early_terminate_geo_searches", true); - - private static final int GEO_TIMEOUT_OVERRIDE = - EarlybirdConfig.getInt("early_terminate_geo_searches_timeout_override", -1); - - // How many geo searches are early terminated due to timeout. - private static final SearchCounter GEO_SEARCH_TIMEOUT_COUNT = - SearchCounter.export("geo_search_timeout_count"); - - private final SecondPhaseDocAccepter accepter; - private final TerminationTracker terminationTracker; - private final ConstantScoreQuery query; - - public GeoTwoPhaseQuery( - Query query, SecondPhaseDocAccepter accepter, TerminationTracker terminationTracker) { - this.accepter = accepter; - this.terminationTracker = terminationTracker; - - this.query = new ConstantScoreQuery(query); - } - - @Override - public Query rewrite(IndexReader reader) throws IOException { - Query rewritten = query.getQuery().rewrite(reader); - if (rewritten != query.getQuery()) { - return new GeoTwoPhaseQuery(rewritten, accepter, terminationTracker); - } - - return this; - } - - @Override - public int hashCode() { - return query.hashCode(); - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof GeoTwoPhaseQuery)) { - return false; - } - GeoTwoPhaseQuery that = (GeoTwoPhaseQuery) obj; - return query.equals(that.query) - && accepter.equals(that.accepter) - && terminationTracker.equals(that.terminationTracker); - } - - @Override - public String toString(String field) { - return new StringBuilder("GeoTwoPhaseQuery(") - .append("Accepter(") - .append(accepter.toString()) - .append(") Geohashes(") - .append(query.getQuery().toString(field)) - .append("))") - .toString(); - } - - @Override - public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) - throws IOException { - Weight innerWeight = query.createWeight(searcher, scoreMode, boost); - return new GeoTwoPhaseWeight(this, innerWeight, accepter, terminationTracker); - } - - private static final class GeoTwoPhaseWeight extends Weight { - private final Weight innerWeight; - private final SecondPhaseDocAccepter accepter; - private final TerminationTracker terminationTracker; - - private GeoTwoPhaseWeight( - Query query, - Weight innerWeight, - SecondPhaseDocAccepter accepter, - TerminationTracker terminationTracker) { - super(query); - this.innerWeight = innerWeight; - this.accepter = accepter; - this.terminationTracker = terminationTracker; - } - - @Override - public void extractTerms(Set terms) { - innerWeight.extractTerms(terms); - } - - @Override - public Explanation explain(LeafReaderContext context, int doc) throws IOException { - return innerWeight.explain(context, doc); - } - - @Override - public Scorer scorer(LeafReaderContext context) throws IOException { - Scorer innerScorer = innerWeight.scorer(context); - if (innerScorer == null) { - return null; - } - if (ENABLE_GEO_EARLY_TERMINATION - && (terminationTracker == null || !terminationTracker.useLastSearchedDocIdOnTimeout())) { - innerScorer = new ConstantScoreScorer( - this, - 0.0f, - ScoreMode.COMPLETE_NO_SCORES, - new TimedDocIdSetIterator(innerScorer.iterator(), - terminationTracker, - GEO_TIMEOUT_OVERRIDE, - GEO_SEARCH_TIMEOUT_COUNT)); - } - - accepter.initialize(context); - return new GeoTwoPhaseScorer(this, innerScorer, accepter); - } - - @Override - public boolean isCacheable(LeafReaderContext ctx) { - return innerWeight.isCacheable(ctx); - } - } - - private static final class GeoTwoPhaseScorer extends Scorer { - private final Scorer innerScorer; - private final SecondPhaseDocAccepter accepter; - - private GeoTwoPhaseScorer(Weight weight, Scorer innerScorer, SecondPhaseDocAccepter accepter) { - super(weight); - this.innerScorer = innerScorer; - this.accepter = accepter; - } - - @Override - public TwoPhaseIterator twoPhaseIterator() { - return new TwoPhaseIterator(innerScorer.iterator()) { - @Override - public boolean matches() throws IOException { - return checkDocExpensive(innerScorer.docID()); - } - - @Override - public float matchCost() { - return 0.0f; - } - }; - } - - @Override - public int docID() { - return iterator().docID(); - } - - @Override - public float score() throws IOException { - return innerScorer.score(); - } - - @Override - public DocIdSetIterator iterator() { - return new DocIdSetIterator() { - private int doNext(int startingDocId) throws IOException { - int docId = startingDocId; - while ((docId != NO_MORE_DOCS) && !checkDocExpensive(docId)) { - docId = innerScorer.iterator().nextDoc(); - } - return docId; - } - - @Override - public int docID() { - return innerScorer.iterator().docID(); - } - - @Override - public int nextDoc() throws IOException { - return doNext(innerScorer.iterator().nextDoc()); - } - - @Override - public int advance(int target) throws IOException { - return doNext(innerScorer.iterator().advance(target)); - } - - @Override - public long cost() { - return 2 * innerScorer.iterator().cost(); - } - }; - } - - @Override - public float getMaxScore(int upTo) throws IOException { - return innerScorer.getMaxScore(upTo); - } - - private boolean checkDocExpensive(int doc) throws IOException { - return accepter.accept(doc); - } - } - - public abstract static class SecondPhaseDocAccepter { - /** - * Initializes this accepter with the given reader context. - */ - public abstract void initialize(LeafReaderContext context) throws IOException; - - /** - * Determines if the given doc ID is accepted by this accepter. - */ - public abstract boolean accept(int doc) throws IOException; - - /** - * Returns a string description for this SecondPhaseDocAccepter instance. - */ - public abstract String toString(); - } - - public static final SecondPhaseDocAccepter ALL_DOCS_ACCEPTER = new SecondPhaseDocAccepter() { - @Override - public void initialize(LeafReaderContext context) { } - - @Override - public boolean accept(int doc) { - return true; - } - - @Override - public String toString() { - return "AllDocsAccepter"; - } - }; -} diff --git a/src/java/com/twitter/search/earlybird/search/queries/MatchAllDocIdSet.java b/src/java/com/twitter/search/earlybird/search/queries/MatchAllDocIdSet.java deleted file mode 100644 index 27c194678..000000000 --- a/src/java/com/twitter/search/earlybird/search/queries/MatchAllDocIdSet.java +++ /dev/null @@ -1,44 +0,0 @@ -package com.twitter.search.earlybird.search.queries; - -import java.io.IOException; - -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.search.DocIdSet; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.util.Bits; -import org.apache.lucene.util.RamUsageEstimator; - -import com.twitter.search.core.earlybird.index.util.AllDocsIterator; - -public final class MatchAllDocIdSet extends DocIdSet { - private final LeafReader reader; - - public MatchAllDocIdSet(LeafReader reader) { - this.reader = reader; - } - - @Override - public DocIdSetIterator iterator() throws IOException { - return new AllDocsIterator(reader); - } - - @Override - public Bits bits() throws IOException { - return new Bits() { - @Override - public boolean get(int index) { - return true; - } - - @Override - public int length() { - return reader.maxDoc(); - } - }; - } - - @Override - public long ramBytesUsed() { - return RamUsageEstimator.shallowSizeOf(this); - } -} diff --git a/src/java/com/twitter/search/earlybird/search/queries/MatchAllDocsQuery.java b/src/java/com/twitter/search/earlybird/search/queries/MatchAllDocsQuery.java deleted file mode 100644 index 5b2b649f5..000000000 --- a/src/java/com/twitter/search/earlybird/search/queries/MatchAllDocsQuery.java +++ /dev/null @@ -1,91 +0,0 @@ -package com.twitter.search.earlybird.search.queries; - -import java.io.IOException; -import java.util.Set; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.Term; -import org.apache.lucene.search.ConstantScoreScorer; -import org.apache.lucene.search.Explanation; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.Weight; - -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.core.earlybird.index.util.RangeFilterDISI; -import com.twitter.search.earlybird.index.EarlybirdSingleSegmentSearcher; - -/** - * A MatchAllDocsQuery implementation that does not assume that doc IDs are assigned sequentially. - * Instead, it wraps the EarlybirdIndexSegmentAtomicReader into a RangeFilterDISI, and uses - * this iterator to traverse only the valid doc IDs in this segment. - * - * Note that org.apache.lucene.index.MatchAllDocsQuery is final, so we cannot extend it. - */ -public class MatchAllDocsQuery extends Query { - private static class MatchAllDocsWeight extends Weight { - private final Weight luceneWeight; - - public MatchAllDocsWeight(Query query, Weight luceneWeight) { - super(query); - this.luceneWeight = luceneWeight; - } - - @Override - public void extractTerms(Set terms) { - luceneWeight.extractTerms(terms); - } - - @Override - public Explanation explain(LeafReaderContext context, int doc) throws IOException { - return luceneWeight.explain(context, doc); - } - - @Override - public Scorer scorer(LeafReaderContext context) throws IOException { - Preconditions.checkState(context.reader() instanceof EarlybirdIndexSegmentAtomicReader, - "Expected an EarlybirdIndexSegmentAtomicReader, but got a " - + context.reader().getClass().getName() + " instance."); - EarlybirdIndexSegmentAtomicReader reader = - (EarlybirdIndexSegmentAtomicReader) context.reader(); - return new ConstantScoreScorer( - this, 1.0f, ScoreMode.COMPLETE_NO_SCORES, new RangeFilterDISI(reader)); - } - - @Override - public boolean isCacheable(LeafReaderContext ctx) { - return luceneWeight.isCacheable(ctx); - } - } - - @Override - public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) { - org.apache.lucene.search.MatchAllDocsQuery luceneMatchAllDocsQuery = - new org.apache.lucene.search.MatchAllDocsQuery(); - Weight luceneWeight = luceneMatchAllDocsQuery.createWeight(searcher, scoreMode, boost); - if (!(searcher instanceof EarlybirdSingleSegmentSearcher)) { - return luceneWeight; - } - return new MatchAllDocsWeight(this, luceneWeight); - } - - @Override - public int hashCode() { - return 0; - } - - @Override - public boolean equals(Object obj) { - return obj instanceof MatchAllDocsQuery; - } - - // Copied from org.apache.lucene.search.MatchAllDocsWeight - @Override - public String toString(String field) { - return "*:*"; - } -} diff --git a/src/java/com/twitter/search/earlybird/search/queries/RequiredStatusIDsFilter.java b/src/java/com/twitter/search/earlybird/search/queries/RequiredStatusIDsFilter.java deleted file mode 100644 index e62de315f..000000000 --- a/src/java/com/twitter/search/earlybird/search/queries/RequiredStatusIDsFilter.java +++ /dev/null @@ -1,131 +0,0 @@ -package com.twitter.search.earlybird.search.queries; - -import java.io.IOException; -import java.util.Arrays; -import java.util.Collection; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.search.BooleanClause; -import org.apache.lucene.search.BooleanQuery; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.Weight; - -import com.twitter.search.common.query.DefaultFilterWeight; -import com.twitter.search.common.search.IntArrayDocIdSetIterator; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.core.earlybird.index.util.AllDocsIterator; -import com.twitter.search.earlybird.index.TweetIDMapper; - -public final class RequiredStatusIDsFilter extends Query { - private final Collection statusIDs; - - public static Query getRequiredStatusIDsQuery(Collection statusIDs) { - return new BooleanQuery.Builder() - .add(new RequiredStatusIDsFilter(statusIDs), BooleanClause.Occur.FILTER) - .build(); - } - - private RequiredStatusIDsFilter(Collection statusIDs) { - this.statusIDs = Preconditions.checkNotNull(statusIDs); - } - - @Override - public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) { - return new DefaultFilterWeight(this) { - @Override - protected DocIdSetIterator getDocIdSetIterator(LeafReaderContext context) throws IOException { - LeafReader leafReader = context.reader(); - if (!(leafReader instanceof EarlybirdIndexSegmentAtomicReader)) { - return DocIdSetIterator.empty(); - } - - EarlybirdIndexSegmentAtomicReader reader = (EarlybirdIndexSegmentAtomicReader) leafReader; - TweetIDMapper idMapper = (TweetIDMapper) reader.getSegmentData().getDocIDToTweetIDMapper(); - - int docIdsSize = 0; - int[] docIds = new int[statusIDs.size()]; - for (long statusID : statusIDs) { - int docId = idMapper.getDocID(statusID); - if (docId >= 0) { - docIds[docIdsSize++] = docId; - } - } - - Arrays.sort(docIds, 0, docIdsSize); - DocIdSetIterator statusesDISI = - new IntArrayDocIdSetIterator(Arrays.copyOf(docIds, docIdsSize)); - DocIdSetIterator allDocsDISI = new AllDocsIterator(reader); - - // We only want to return IDs for fully indexed documents. So we need to make sure that - // every doc ID we return exists in allDocsDISI. However, allDocsDISI has all documents in - // this segment, so driving by allDocsDISI would be very slow. So we want to drive by - // statusesDISI, and use allDocsDISI as a post-filter. What this comes down to is that we do - // not want to call allDocsDISI.nextDoc(); we only want to call allDocsDISI.advance(), and - // only on the doc IDs returned by statusesDISI. - return new DocIdSetIterator() { - @Override - public int docID() { - return statusesDISI.docID(); - } - - @Override - public int nextDoc() throws IOException { - statusesDISI.nextDoc(); - return advanceToNextFullyIndexedDoc(); - } - - @Override - public int advance(int target) throws IOException { - statusesDISI.advance(target); - return advanceToNextFullyIndexedDoc(); - } - - private int advanceToNextFullyIndexedDoc() throws IOException { - while (docID() != DocIdSetIterator.NO_MORE_DOCS) { - // Check if the current doc is fully indexed. - // If it is, then we can return it. If it's not, then we need to keep searching. - int allDocsDocId = allDocsDISI.advance(docID()); - if (allDocsDocId == docID()) { - break; - } - - statusesDISI.advance(allDocsDocId); - } - return docID(); - } - - @Override - public long cost() { - return statusesDISI.cost(); - } - }; - } - }; - } - - @Override - public int hashCode() { - return statusIDs.hashCode(); - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof RequiredStatusIDsFilter)) { - return false; - } - - RequiredStatusIDsFilter filter = RequiredStatusIDsFilter.class.cast(obj); - return statusIDs.equals(filter.statusIDs); - } - - @Override - public final String toString(String field) { - return String.format("RequiredStatusIDs[%s]", statusIDs); - } -} diff --git a/src/java/com/twitter/search/earlybird/search/queries/SimpleTermQuery.java b/src/java/com/twitter/search/earlybird/search/queries/SimpleTermQuery.java deleted file mode 100644 index 9981ef2ab..000000000 --- a/src/java/com/twitter/search/earlybird/search/queries/SimpleTermQuery.java +++ /dev/null @@ -1,86 +0,0 @@ -package com.twitter.search.earlybird.search.queries; - -import java.io.IOException; - -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.PostingsEnum; -import org.apache.lucene.index.TermsEnum; -import org.apache.lucene.search.ConstantScoreScorer; -import org.apache.lucene.search.ConstantScoreWeight; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.Weight; - -/** - * A version of a term query that we can use when we already know the term id (in case where we - * previously looked it up), and have a TermsEnum to get the actual postings. - * - * This is can be used for constant score queries, where only iterating on the postings is required. - */ -class SimpleTermQuery extends Query { - private final TermsEnum termsEnum; - private final long termId; - - public SimpleTermQuery(TermsEnum termsEnum, long termId) { - this.termsEnum = termsEnum; - this.termId = termId; - } - - @Override - public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) - throws IOException { - return new SimpleTermQueryWeight(scoreMode); - } - - @Override - public int hashCode() { - return (termsEnum == null ? 0 : termsEnum.hashCode()) * 13 + (int) termId; - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof SimpleTermQuery)) { - return false; - } - - SimpleTermQuery query = SimpleTermQuery.class.cast(obj); - return (termsEnum == null ? query.termsEnum == null : termsEnum.equals(query.termsEnum)) - && (termId == query.termId); - } - - @Override - public String toString(String field) { - return "SimpleTermQuery(" + field + ":" + termId + ")"; - } - - private class SimpleTermQueryWeight extends ConstantScoreWeight { - private final ScoreMode scoreMode; - - public SimpleTermQueryWeight(ScoreMode scoreMode) { - super(SimpleTermQuery.this, 1.0f); - this.scoreMode = scoreMode; - } - - @Override - public String toString() { - return "weight(" + SimpleTermQuery.this + ")"; - } - - @Override - public Scorer scorer(LeafReaderContext context) throws IOException { - termsEnum.seekExact(termId); - - PostingsEnum docs = termsEnum.postings( - null, scoreMode.needsScores() ? PostingsEnum.FREQS : PostingsEnum.NONE); - assert docs != null; - return new ConstantScoreScorer(this, 0, scoreMode, docs); - } - - @Override - public boolean isCacheable(LeafReaderContext ctx) { - return true; - } - } -} diff --git a/src/java/com/twitter/search/earlybird/search/queries/SinceMaxIDFilter.java b/src/java/com/twitter/search/earlybird/search/queries/SinceMaxIDFilter.java deleted file mode 100644 index aae8fcf2f..000000000 --- a/src/java/com/twitter/search/earlybird/search/queries/SinceMaxIDFilter.java +++ /dev/null @@ -1,211 +0,0 @@ -package com.twitter.search.earlybird.search.queries; - -import java.io.IOException; - -import com.google.common.annotations.VisibleForTesting; - -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.search.BooleanClause; -import org.apache.lucene.search.BooleanQuery; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.Weight; - -import com.twitter.search.common.query.DefaultFilterWeight; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.core.earlybird.index.util.AllDocsIterator; -import com.twitter.search.core.earlybird.index.util.RangeFilterDISI; -import com.twitter.search.earlybird.index.TweetIDMapper; - -/** - * Filters tweet ids according to since_id and max_id parameter. - * - * Note that since_id is exclusive and max_id is inclusive. - */ -public final class SinceMaxIDFilter extends Query { - public static final long NO_FILTER = -1; - - private final long sinceIdExclusive; - private final long maxIdInclusive; - - public static Query getSinceMaxIDQuery(long sinceIdExclusive, long maxIdInclusive) { - return new BooleanQuery.Builder() - .add(new SinceMaxIDFilter(sinceIdExclusive, maxIdInclusive), BooleanClause.Occur.FILTER) - .build(); - } - - public static Query getSinceIDQuery(long sinceIdExclusive) { - return new BooleanQuery.Builder() - .add(new SinceMaxIDFilter(sinceIdExclusive, NO_FILTER), BooleanClause.Occur.FILTER) - .build(); - } - - public static Query getMaxIDQuery(long maxIdInclusive) { - return new BooleanQuery.Builder() - .add(new SinceMaxIDFilter(NO_FILTER, maxIdInclusive), BooleanClause.Occur.FILTER) - .build(); - } - - private SinceMaxIDFilter(long sinceIdExclusive, long maxIdInclusive) { - this.sinceIdExclusive = sinceIdExclusive; - this.maxIdInclusive = maxIdInclusive; - } - - @Override - public int hashCode() { - return (int) (sinceIdExclusive * 13 + maxIdInclusive); - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof SinceMaxIDFilter)) { - return false; - } - - SinceMaxIDFilter filter = SinceMaxIDFilter.class.cast(obj); - return (sinceIdExclusive == filter.sinceIdExclusive) - && (maxIdInclusive == filter.maxIdInclusive); - } - - @Override - public String toString(String field) { - if (sinceIdExclusive != NO_FILTER && maxIdInclusive != NO_FILTER) { - return "SinceIdFilter:" + sinceIdExclusive + ",MaxIdFilter:" + maxIdInclusive; - } else if (maxIdInclusive != NO_FILTER) { - return "MaxIdFilter:" + maxIdInclusive; - } else { - return "SinceIdFilter:" + sinceIdExclusive; - } - } - - /** - * Determines if this segment is at least partially covered by the given tweet ID range. - */ - public static boolean sinceMaxIDsInRange( - TweetIDMapper tweetIdMapper, long sinceIdExclusive, long maxIdInclusive) { - // Check for since id out of range. Note that since this ID is exclusive, - // equality is out of range too. - if (sinceIdExclusive != NO_FILTER && sinceIdExclusive >= tweetIdMapper.getMaxTweetID()) { - return false; - } - - // Check for max id in range. - return maxIdInclusive == NO_FILTER || maxIdInclusive >= tweetIdMapper.getMinTweetID(); - } - - // Returns true if this segment is completely covered by these id filters. - private static boolean sinceMaxIdsCoverRange( - TweetIDMapper tweetIdMapper, long sinceIdExclusive, long maxIdInclusive) { - // Check for since_id specified AND since_id newer than than first tweet. - if (sinceIdExclusive != NO_FILTER && sinceIdExclusive >= tweetIdMapper.getMinTweetID()) { - return false; - } - - // Check for max id in range. - return maxIdInclusive == NO_FILTER || maxIdInclusive > tweetIdMapper.getMaxTweetID(); - } - - @Override - public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) - throws IOException { - return new DefaultFilterWeight(this) { - @Override - protected DocIdSetIterator getDocIdSetIterator(LeafReaderContext context) throws IOException { - LeafReader reader = context.reader(); - if (!(reader instanceof EarlybirdIndexSegmentAtomicReader)) { - return new AllDocsIterator(reader); - } - - EarlybirdIndexSegmentAtomicReader twitterInMemoryIndexReader = - (EarlybirdIndexSegmentAtomicReader) reader; - TweetIDMapper tweetIdMapper = - (TweetIDMapper) twitterInMemoryIndexReader.getSegmentData().getDocIDToTweetIDMapper(); - - // Important to return a null DocIdSetIterator here, so the Scorer will skip searching - // this segment completely. - if (!sinceMaxIDsInRange(tweetIdMapper, sinceIdExclusive, maxIdInclusive)) { - return null; - } - - // Optimization: just return a match-all iterator when the whole segment is in range. - // This avoids having to do so many status id lookups. - if (sinceMaxIdsCoverRange(tweetIdMapper, sinceIdExclusive, maxIdInclusive)) { - return new AllDocsIterator(reader); - } - - return new SinceMaxIDDocIdSetIterator( - twitterInMemoryIndexReader, sinceIdExclusive, maxIdInclusive); - } - }; - } - - @VisibleForTesting - static class SinceMaxIDDocIdSetIterator extends RangeFilterDISI { - private final DocIDToTweetIDMapper docIdToTweetIdMapper; - private final long sinceIdExclusive; - private final long maxIdInclusive; - - public SinceMaxIDDocIdSetIterator(EarlybirdIndexSegmentAtomicReader reader, - long sinceIdExclusive, - long maxIdInclusive) throws IOException { - super(reader, - findMaxIdDocID(reader, maxIdInclusive), - findSinceIdDocID(reader, sinceIdExclusive)); - this.docIdToTweetIdMapper = reader.getSegmentData().getDocIDToTweetIDMapper(); - this.sinceIdExclusive = sinceIdExclusive; // sinceStatusId == NO_FILTER is OK, it's exclusive - this.maxIdInclusive = maxIdInclusive != NO_FILTER ? maxIdInclusive : Long.MAX_VALUE; - } - - /** - * This is a necessary check when we have out of order tweets in the archive. - * When tweets are out of order, this guarantees that no false positive results are returned. - * I.e. we can still miss some tweets in the specified range, but we never incorrectly return - * anything that's not in the range. - */ - @Override - protected boolean shouldReturnDoc() { - final long statusID = docIdToTweetIdMapper.getTweetID(docID()); - return statusID > sinceIdExclusive && statusID <= maxIdInclusive; - } - - private static int findSinceIdDocID( - EarlybirdIndexSegmentAtomicReader reader, long sinceIdExclusive) throws IOException { - TweetIDMapper tweetIdMapper = - (TweetIDMapper) reader.getSegmentData().getDocIDToTweetIDMapper(); - if (sinceIdExclusive != SinceMaxIDFilter.NO_FILTER) { - // We use this as an upper bound on the search, so we want to find the highest possible - // doc ID for this tweet ID. - boolean findMaxDocID = true; - return tweetIdMapper.findDocIdBound( - sinceIdExclusive, - findMaxDocID, - reader.getSmallestDocID(), - reader.maxDoc() - 1); - } else { - return DocIDToTweetIDMapper.ID_NOT_FOUND; - } - } - - private static int findMaxIdDocID( - EarlybirdIndexSegmentAtomicReader reader, long maxIdInclusive) throws IOException { - TweetIDMapper tweetIdMapper = - (TweetIDMapper) reader.getSegmentData().getDocIDToTweetIDMapper(); - if (maxIdInclusive != SinceMaxIDFilter.NO_FILTER) { - // We use this as a lower bound on the search, so we want to find the lowest possible - // doc ID for this tweet ID. - boolean findMaxDocID = false; - return tweetIdMapper.findDocIdBound( - maxIdInclusive, - findMaxDocID, - reader.getSmallestDocID(), - reader.maxDoc() - 1); - } else { - return DocIDToTweetIDMapper.ID_NOT_FOUND; - } - } - } -} diff --git a/src/java/com/twitter/search/earlybird/search/queries/SinceUntilFilter.java b/src/java/com/twitter/search/earlybird/search/queries/SinceUntilFilter.java deleted file mode 100644 index 1f68975c4..000000000 --- a/src/java/com/twitter/search/earlybird/search/queries/SinceUntilFilter.java +++ /dev/null @@ -1,137 +0,0 @@ -package com.twitter.search.earlybird.search.queries; - -import java.io.IOException; - -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.search.BooleanClause; -import org.apache.lucene.search.BooleanQuery; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.Weight; - -import com.twitter.search.common.query.DefaultFilterWeight; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.core.earlybird.index.TimeMapper; -import com.twitter.search.core.earlybird.index.util.AllDocsIterator; -import com.twitter.search.core.earlybird.index.util.RangeFilterDISI; - -// Filters tweets according to since time and until time (in seconds). -// Note that since time is inclusive, and until time is exclusive. -public final class SinceUntilFilter extends Query { - public static final int NO_FILTER = -1; - - // These are both in seconds since the epoch. - private final int minTimeInclusive; - private final int maxTimeExclusive; - - public static Query getSinceQuery(int sinceTimeSeconds) { - return new BooleanQuery.Builder() - .add(new SinceUntilFilter(sinceTimeSeconds, NO_FILTER), BooleanClause.Occur.FILTER) - .build(); - } - - public static Query getUntilQuery(int untilTimeSeconds) { - return new BooleanQuery.Builder() - .add(new SinceUntilFilter(NO_FILTER, untilTimeSeconds), BooleanClause.Occur.FILTER) - .build(); - } - - public static Query getSinceUntilQuery(int sinceTimeSeconds, int untilTimeSeconds) { - return new BooleanQuery.Builder() - .add(new SinceUntilFilter(sinceTimeSeconds, untilTimeSeconds), BooleanClause.Occur.FILTER) - .build(); - } - - private SinceUntilFilter(int sinceTime, int untilTime) { - this.minTimeInclusive = sinceTime != NO_FILTER ? sinceTime : 0; - this.maxTimeExclusive = untilTime != NO_FILTER ? untilTime : Integer.MAX_VALUE; - } - - @Override - public int hashCode() { - return (int) (minTimeInclusive * 17 + maxTimeExclusive); - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof SinceUntilFilter)) { - return false; - } - - SinceUntilFilter filter = SinceUntilFilter.class.cast(obj); - return (minTimeInclusive == filter.minTimeInclusive) - && (maxTimeExclusive == filter.maxTimeExclusive); - } - - @Override - public String toString(String field) { - if (minTimeInclusive > 0 && maxTimeExclusive != Integer.MAX_VALUE) { - return "SinceFilter:" + this.minTimeInclusive + ",UntilFilter:" + maxTimeExclusive; - } else if (minTimeInclusive > 0) { - return "SinceFilter:" + this.minTimeInclusive; - } else { - return "UntilFilter:" + this.maxTimeExclusive; - } - } - - @Override - public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) - throws IOException { - return new DefaultFilterWeight(this) { - @Override - protected DocIdSetIterator getDocIdSetIterator(LeafReaderContext context) throws IOException { - LeafReader indexReader = context.reader(); - if (!(indexReader instanceof EarlybirdIndexSegmentAtomicReader)) { - return new AllDocsIterator(indexReader); - } - - EarlybirdIndexSegmentAtomicReader reader = (EarlybirdIndexSegmentAtomicReader) indexReader; - TimeMapper timeMapper = reader.getSegmentData().getTimeMapper(); - int smallestDocID = timeMapper.findFirstDocId(maxTimeExclusive, reader.getSmallestDocID()); - int largestDoc = timeMapper.findFirstDocId(minTimeInclusive, reader.getSmallestDocID()); - int smallestDoc = smallestDocID > 0 ? smallestDocID - 1 : 0; - return new SinceUntilDocIdSetIterator( - reader, - timeMapper, - smallestDoc, - largestDoc, - minTimeInclusive, - maxTimeExclusive); - } - }; - } - - // Returns true if this TimeMapper is at least partially covered by these time filters. - public static boolean sinceUntilTimesInRange( - TimeMapper timeMapper, int sinceTime, int untilTime) { - return (sinceTime == NO_FILTER || sinceTime <= timeMapper.getLastTime()) - && (untilTime == NO_FILTER || untilTime >= timeMapper.getFirstTime()); - } - - private static final class SinceUntilDocIdSetIterator extends RangeFilterDISI { - private final TimeMapper timeMapper; - private final int minTimeInclusive; - private final int maxTimeExclusive; - - public SinceUntilDocIdSetIterator(EarlybirdIndexSegmentAtomicReader reader, - TimeMapper timeMapper, - int smallestDocID, - int largestDocID, - int minTimeInclusive, - int maxExclusive) throws IOException { - super(reader, smallestDocID, largestDocID); - this.timeMapper = timeMapper; - this.minTimeInclusive = minTimeInclusive; - this.maxTimeExclusive = maxExclusive; - } - - @Override - protected boolean shouldReturnDoc() { - final int docTime = timeMapper.getTime(docID()); - return docTime >= minTimeInclusive && docTime < maxTimeExclusive; - } - } -} diff --git a/src/java/com/twitter/search/earlybird/search/queries/TermQueryWithSafeToString.java b/src/java/com/twitter/search/earlybird/search/queries/TermQueryWithSafeToString.java deleted file mode 100644 index 3ae4a0c15..000000000 --- a/src/java/com/twitter/search/earlybird/search/queries/TermQueryWithSafeToString.java +++ /dev/null @@ -1,29 +0,0 @@ -package com.twitter.search.earlybird.search.queries; - -import org.apache.lucene.index.Term; -import org.apache.lucene.search.TermQuery; - -/** - * Work around an issue where IntTerms and LongTerms are not valid utf8, - * so calling toString on any TermQuery containing an IntTerm or a LongTerm may cause exceptions. - * This code should produce the same output as TermQuery.toString - */ -public final class TermQueryWithSafeToString extends TermQuery { - private final String termValueForToString; - - public TermQueryWithSafeToString(Term term, String termValueForToString) { - super(term); - this.termValueForToString = termValueForToString; - } - - @Override - public String toString(String field) { - StringBuilder buffer = new StringBuilder(); - if (!getTerm().field().equals(field)) { - buffer.append(getTerm().field()); - buffer.append(":"); - } - buffer.append(termValueForToString); - return buffer.toString(); - } -} diff --git a/src/java/com/twitter/search/earlybird/search/queries/TimedDocIdSetIterator.java b/src/java/com/twitter/search/earlybird/search/queries/TimedDocIdSetIterator.java deleted file mode 100644 index e6d65868b..000000000 --- a/src/java/com/twitter/search/earlybird/search/queries/TimedDocIdSetIterator.java +++ /dev/null @@ -1,128 +0,0 @@ -package com.twitter.search.earlybird.search.queries; - -import java.io.IOException; -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; - -import org.apache.lucene.search.DocIdSetIterator; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.search.EarlyTerminationState; -import com.twitter.search.common.search.TerminationTracker; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; - -/** - * DocIdSetIterator whose nextDoc() and advance() will early terminate by returning NO_MORE_DOCS - * after the given deadline. - */ -public class TimedDocIdSetIterator extends DocIdSetIterator { - // check deadline every NEXT_CALL_TIMEOUT_CHECK_PERIOD calls to nextDoc() - @VisibleForTesting - protected static final int NEXT_CALL_TIMEOUT_CHECK_PERIOD = - EarlybirdConfig.getInt("timed_doc_id_set_next_doc_deadline_check_period", 1000); - - - // check deadline every ADVANCE_CALL_TIMEOUT_CHECK_PERIOD calls to advance() - private static final int ADVANCE_CALL_TIMEOUT_CHECK_PERIOD = - EarlybirdConfig.getInt("timed_doc_id_set_advance_deadline_check_period", 100); - - private final Clock clock; - private final DocIdSetIterator innerIterator; - private final SearchCounter timeoutCountStat; - - @Nullable - private final TerminationTracker terminationTracker; - private final long deadlineMillisFromEpoch; - - private int docId = -1; - private int nextCounter = 0; - private int advanceCounter = 0; - - public TimedDocIdSetIterator(DocIdSetIterator innerIterator, - @Nullable TerminationTracker terminationTracker, - final long timeoutOverride, - @Nullable SearchCounter timeoutCountStat) { - this(innerIterator, terminationTracker, timeoutOverride, timeoutCountStat, Clock.SYSTEM_CLOCK); - } - - protected TimedDocIdSetIterator(DocIdSetIterator innerIterator, - @Nullable TerminationTracker terminationTracker, - final long timeoutOverride, - @Nullable SearchCounter timeoutCountStat, - Clock clock) { - this.clock = clock; - this.innerIterator = innerIterator; - this.timeoutCountStat = timeoutCountStat; - this.terminationTracker = terminationTracker; - - if (terminationTracker == null) { - deadlineMillisFromEpoch = -1; - } else { - if (timeoutOverride > 0) { - deadlineMillisFromEpoch = terminationTracker.getClientStartTimeMillis() + timeoutOverride; - } else { - deadlineMillisFromEpoch = terminationTracker.getTimeoutEndTimeWithReservation(); - } - } - } - - @VisibleForTesting - protected TimedDocIdSetIterator(DocIdSetIterator innerIterator, - final long deadline, - @Nullable SearchCounter timeoutCountStat, - Clock clock) { - this.clock = clock; - this.innerIterator = innerIterator; - this.timeoutCountStat = timeoutCountStat; - this.terminationTracker = null; - - this.deadlineMillisFromEpoch = deadline; - } - - - @Override - public int docID() { - return docId; - } - - @Override - public int nextDoc() throws IOException { - if (++nextCounter % NEXT_CALL_TIMEOUT_CHECK_PERIOD == 0 - && clock.nowMillis() > deadlineMillisFromEpoch) { - if (timeoutCountStat != null) { - timeoutCountStat.increment(); - } - if (terminationTracker != null) { - terminationTracker.setEarlyTerminationState( - EarlyTerminationState.TERMINATED_TIME_OUT_EXCEEDED); - } - - return docId = NO_MORE_DOCS; - } - return docId = innerIterator.nextDoc(); - } - - @Override - public int advance(int target) throws IOException { - if (++advanceCounter % ADVANCE_CALL_TIMEOUT_CHECK_PERIOD == 0 - && clock.nowMillis() > deadlineMillisFromEpoch) { - if (timeoutCountStat != null) { - timeoutCountStat.increment(); - } - if (terminationTracker != null) { - terminationTracker.setEarlyTerminationState( - EarlyTerminationState.TERMINATED_TIME_OUT_EXCEEDED); - } - return docId = NO_MORE_DOCS; - } - - return docId = innerIterator.advance(target); - } - - @Override - public long cost() { - return innerIterator.cost(); - } -} diff --git a/src/java/com/twitter/search/earlybird/search/queries/UserFlagsExcludeFilter.java b/src/java/com/twitter/search/earlybird/search/queries/UserFlagsExcludeFilter.java deleted file mode 100644 index a3d0890ff..000000000 --- a/src/java/com/twitter/search/earlybird/search/queries/UserFlagsExcludeFilter.java +++ /dev/null @@ -1,128 +0,0 @@ -package com.twitter.search.earlybird.search.queries; - -import java.io.IOException; - -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.NumericDocValues; -import org.apache.lucene.search.BooleanClause; -import org.apache.lucene.search.BooleanQuery; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.Weight; - -import com.twitter.search.common.query.DefaultFilterWeight; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.core.earlybird.index.util.AllDocsIterator; -import com.twitter.search.core.earlybird.index.util.RangeFilterDISI; -import com.twitter.search.earlybird.common.userupdates.UserTable; - -public final class UserFlagsExcludeFilter extends Query { - /** - * Returns a query that filters hits based on their author flags. - * - * @param excludeAntisocial Determines if the filter should exclude hits from antisocial users. - * @param excludeOffensive Determines if the filter should exclude hits from offensive users. - * @param excludeProtected Determines if the filter should exclude hits from protected users - * @return A query that filters hits based on their author flags. - */ - public static Query getUserFlagsExcludeFilter(UserTable userTable, - boolean excludeAntisocial, - boolean excludeOffensive, - boolean excludeProtected) { - return new BooleanQuery.Builder() - .add(new UserFlagsExcludeFilter( - userTable, excludeAntisocial, excludeOffensive, excludeProtected), - BooleanClause.Occur.FILTER) - .build(); - } - - private final UserTable userTable; - private final boolean excludeAntisocial; - private final boolean excludeOffensive; - private final boolean excludeProtected; - - private UserFlagsExcludeFilter( - UserTable userTable, - boolean excludeAntisocial, - boolean excludeOffensive, - boolean excludeProtected) { - this.userTable = userTable; - this.excludeAntisocial = excludeAntisocial; - this.excludeOffensive = excludeOffensive; - this.excludeProtected = excludeProtected; - } - - @Override - public int hashCode() { - return (excludeAntisocial ? 13 : 0) + (excludeOffensive ? 1 : 0) + (excludeProtected ? 2 : 0); - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof UserFlagsExcludeFilter)) { - return false; - } - - UserFlagsExcludeFilter filter = UserFlagsExcludeFilter.class.cast(obj); - return (excludeAntisocial == filter.excludeAntisocial) - && (excludeOffensive == filter.excludeOffensive) - && (excludeProtected == filter.excludeProtected); - } - - @Override - public String toString(String field) { - return "UserFlagsExcludeFilter"; - } - - @Override - public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) { - return new DefaultFilterWeight(this) { - @Override - protected DocIdSetIterator getDocIdSetIterator(LeafReaderContext context) throws IOException { - LeafReader reader = context.reader(); - if (userTable == null) { - return new AllDocsIterator(reader); - } - - final int bits = - (excludeAntisocial ? UserTable.ANTISOCIAL_BIT : 0) - | (excludeOffensive ? UserTable.OFFENSIVE_BIT | UserTable.NSFW_BIT : 0) - | (excludeProtected ? UserTable.IS_PROTECTED_BIT : 0); - if (bits != 0) { - return new UserFlagsExcludeDocIdSetIterator(reader, userTable) { - @Override - protected boolean checkUserFlags(UserTable table, long userID) { - return !table.isSet(userID, bits); - } - }; - } - - return new AllDocsIterator(reader); - } - }; - } - - private abstract static class UserFlagsExcludeDocIdSetIterator extends RangeFilterDISI { - private final UserTable userTable; - private final NumericDocValues fromUserID; - - public UserFlagsExcludeDocIdSetIterator( - LeafReader indexReader, UserTable table) throws IOException { - super(indexReader); - userTable = table; - fromUserID = - indexReader.getNumericDocValues(EarlybirdFieldConstant.FROM_USER_ID_CSF.getFieldName()); - } - - @Override - protected boolean shouldReturnDoc() throws IOException { - return fromUserID.advanceExact(docID()) - && checkUserFlags(userTable, fromUserID.longValue()); - } - - protected abstract boolean checkUserFlags(UserTable table, long userID); - } -} diff --git a/src/java/com/twitter/search/earlybird/search/queries/UserIdMultiSegmentQuery.java b/src/java/com/twitter/search/earlybird/search/queries/UserIdMultiSegmentQuery.java deleted file mode 100644 index 891a21bd6..000000000 --- a/src/java/com/twitter/search/earlybird/search/queries/UserIdMultiSegmentQuery.java +++ /dev/null @@ -1,528 +0,0 @@ -package com.twitter.search.earlybird.search.queries; - -import java.io.IOException; -import java.util.Arrays; -import java.util.HashMap; -import java.util.List; -import java.util.Map; -import java.util.Set; -import java.util.concurrent.TimeUnit; -import java.util.stream.Collectors; -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; - -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.Term; -import org.apache.lucene.index.Terms; -import org.apache.lucene.index.TermsEnum; -import org.apache.lucene.search.BooleanClause; -import org.apache.lucene.search.BooleanQuery; -import org.apache.lucene.search.BulkScorer; -import org.apache.lucene.search.ConstantScoreQuery; -import org.apache.lucene.search.ConstantScoreWeight; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.Weight; -import org.apache.lucene.util.BytesRef; - -import com.twitter.decider.Decider; -import com.twitter.search.common.decider.DeciderUtil; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchTimer; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.common.query.HitAttributeHelper; -import com.twitter.search.common.query.IDDisjunctionQuery; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.base.IndexedNumericFieldSettings; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.search.termination.QueryTimeout; -import com.twitter.search.common.util.analysis.LongTermAttributeImpl; -import com.twitter.search.common.util.analysis.SortableLongTermAttributeImpl; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentData; -import com.twitter.search.core.earlybird.index.inverted.InvertedIndex; -import com.twitter.search.core.earlybird.index.inverted.MultiSegmentTermDictionary; -import com.twitter.search.earlybird.partition.MultiSegmentTermDictionaryManager; -import com.twitter.search.earlybird.queryparser.EarlybirdQueryHelper; -import com.twitter.search.queryparser.query.QueryParserException; - -/** - * A variant of a multi-term ID disjunction query (similar to {@link UserIdMultiSegmentQuery}), - * that also uses a {@link MultiSegmentTermDictionary} where available, for more efficient - * term lookups for queries that span multiple segments. - * - * By default, a IDDisjunctionQuery (or Lucene's MultiTermQuery), does a term dictionary lookup - * for all of the terms in its disjunction, and it does it once for each segment (or AtomicReader) - * that the query is searching. - * This means that when the term dictionary is large, and the term lookups are expensive, and when - * we are searching multiple segments, the query needs to make num_terms * num_segments expensive - * term dictionary lookups. - * - * With the help of a MultiSegmentTermDictionary, this multi-term disjunction query implementation - * only does one lookup for all of the segments managed by the MultiSegmentTermDictionary. - * If a segment is not supported by the MultiSegmentTermDictionary (e.g. if it's not optimized yet), - * a regular lookup in that segment's term dictionary will be performed. - * - * Usually, we will make 'num_terms' lookups in the current, un-optimized segment, and then if - * more segments need to be searched, we will make another 'num_terms' lookups, once for all of - * the remaining segments. - * - * When performing lookups in the MultiSegmentTermDictionary, for each supported segment, we save - * a list of termIds from that segment for all the searched terms that appear in that segment. - * - * For example, when querying for UserIdMultiSegmentQuery with user ids: {1L, 2L, 3L} and - * segments: {1, 2}, where segment 1 has user ids {1L, 2L} indexed under termIds {100, 200}, - * and segment 2 has user ids {1L, 2L, 3L} indexed under termIds {200, 300, 400}, we will build - * up the following map once: - * segment1 -> [100, 200] - * segment2 -> [200, 300, 400] - */ -public class UserIdMultiSegmentQuery extends Query { - @VisibleForTesting - public static final SearchTimerStats TERM_LOOKUP_STATS = - SearchTimerStats.export("multi_segment_query_term_lookup", TimeUnit.NANOSECONDS, false); - public static final SearchTimerStats QUERY_FROM_PRECOMPUTED = - SearchTimerStats.export("multi_segment_query_from_precomputed", TimeUnit.NANOSECONDS, false); - public static final SearchTimerStats QUERY_REGULAR = - SearchTimerStats.export("multi_segment_query_regular", TimeUnit.NANOSECONDS, false); - - @VisibleForTesting - public static final SearchCounter USED_MULTI_SEGMENT_TERM_DICTIONARY_COUNT = SearchCounter.export( - "user_id_multi_segment_query_used_multi_segment_term_dictionary_count"); - @VisibleForTesting - public static final SearchCounter USED_ORIGINAL_TERM_DICTIONARY_COUNT = SearchCounter.export( - "user_id_multi_segment_query_used_original_term_dictionary_count"); - - private static final SearchCounter NEW_QUERY_COUNT = - SearchCounter.export("user_id_multi_segment_new_query_count"); - private static final SearchCounter OLD_QUERY_COUNT = - SearchCounter.export("user_id_multi_segment_old_query_count"); - - private static final HashMap QUERY_COUNT_BY_QUERY_NAME = new HashMap<>(); - private static final HashMap QUERY_COUNT_BY_FIELD_NAME = new HashMap<>(); - - private static final String DECIDER_KEY_PREFIX = "use_multi_segment_id_disjunction_queries_in_"; - - /** - * Returns a new user ID disjunction query. - * - * @param ids The user IDs. - * @param field The field storing the user IDs. - * @param schemaSnapshot A snapshot of earlybird's schema. - * @param multiSegmentTermDictionaryManager The manager for the term dictionaries that span - * multiple segments. - * @param decider The decider. - * @param earlybirdCluster The earlybird cluster. - * @param ranks The hit attribution ranks to be assigned to every user ID. - * @param hitAttributeHelper The helper that tracks hit attributions. - * @param queryTimeout The timeout to be enforced on this query. - * @return A new user ID disjunction query. - */ - public static Query createIdDisjunctionQuery( - String queryName, - List ids, - String field, - ImmutableSchemaInterface schemaSnapshot, - MultiSegmentTermDictionaryManager multiSegmentTermDictionaryManager, - Decider decider, - EarlybirdCluster earlybirdCluster, - List ranks, - @Nullable HitAttributeHelper hitAttributeHelper, - @Nullable QueryTimeout queryTimeout) throws QueryParserException { - QUERY_COUNT_BY_QUERY_NAME.computeIfAbsent(queryName, name -> - SearchCounter.export("multi_segment_query_name_" + name)).increment(); - QUERY_COUNT_BY_FIELD_NAME.computeIfAbsent(field, name -> - SearchCounter.export("multi_segment_query_count_for_field_" + name)).increment(); - - if (DeciderUtil.isAvailableForRandomRecipient(decider, getDeciderName(earlybirdCluster))) { - NEW_QUERY_COUNT.increment(); - MultiSegmentTermDictionary multiSegmentTermDictionary = - multiSegmentTermDictionaryManager.getMultiSegmentTermDictionary(field); - return new UserIdMultiSegmentQuery( - ids, - field, - schemaSnapshot, - multiSegmentTermDictionary, - ranks, - hitAttributeHelper, - queryTimeout); - } else { - OLD_QUERY_COUNT.increment(); - return new IDDisjunctionQuery(ids, field, schemaSnapshot); - } - } - - @VisibleForTesting - public static String getDeciderName(EarlybirdCluster earlybirdCluster) { - return DECIDER_KEY_PREFIX + earlybirdCluster.name().toLowerCase(); - } - - private final boolean useOrderPreservingEncoding; - private final HitAttributeHelper hitAttributeHelper; - private final QueryTimeout queryTimeout; - private final MultiSegmentTermDictionary multiSegmentTermDictionary; - private final Schema.FieldInfo fieldInfo; - private final String field; - private final List ids; - - private final List ranks; - // For each segment where we have a multi-segment term dictionary, this map will contain the - // termIds of all the terms that actually appear in that segment's index. - @Nullable - private Map> termIdsPerSegment; - - // A wrap class helps to associate termId with corresponding search operator rank if exist - private final class TermRankPair { - private final int termId; - private final int rank; - - TermRankPair(int termId, int rank) { - this.termId = termId; - this.rank = rank; - } - - public int getTermId() { - return termId; - } - - public int getRank() { - return rank; - } - } - - @VisibleForTesting - public UserIdMultiSegmentQuery( - List ids, - String field, - ImmutableSchemaInterface schemaSnapshot, - MultiSegmentTermDictionary termDictionary, - List ranks, - @Nullable HitAttributeHelper hitAttributeHelper, - @Nullable QueryTimeout queryTimeout) { - this.field = field; - this.ids = ids; - this.multiSegmentTermDictionary = termDictionary; - this.ranks = ranks; - this.hitAttributeHelper = hitAttributeHelper; - this.queryTimeout = queryTimeout; - - // check ids and ranks have same size - Preconditions.checkArgument(ranks.size() == 0 || ranks.size() == ids.size()); - // hitAttributeHelper is not null iff ranks is not empty - if (ranks.size() > 0) { - Preconditions.checkNotNull(hitAttributeHelper); - } else { - Preconditions.checkArgument(hitAttributeHelper == null); - } - - if (!schemaSnapshot.hasField(field)) { - throw new IllegalStateException("Tried to search a field which does not exist in schema"); - } - this.fieldInfo = Preconditions.checkNotNull(schemaSnapshot.getFieldInfo(field)); - - IndexedNumericFieldSettings numericFieldSettings = - fieldInfo.getFieldType().getNumericFieldSettings(); - if (numericFieldSettings == null) { - throw new IllegalStateException("Id field is not numerical"); - } - - this.useOrderPreservingEncoding = numericFieldSettings.isUseSortableEncoding(); - } - - /** - * If it hasn't been built yet, build up the map containing termIds of all the terms being - * searched, for all of the segments that are managed by the multi-segment term dictionary. - * - * We only do this once, when we have to search the first segment that's supported by our - * multi-segment term dictionary. - * - * Flow here is to: - * 1. go through all the ids being queried. - * 2. for each id, get the termIds for that term in all of the segments in the term dictionary - * 3. for all of the segments that have that term, add the termId to that segment's list of - * term ids (in the 'termIdsPerSegment' map). - */ - private void createTermIdsPerSegment() { - if (termIdsPerSegment != null) { - // already created the map - return; - } - - long start = System.nanoTime(); - - final BytesRef termRef = useOrderPreservingEncoding - ? SortableLongTermAttributeImpl.newBytesRef() - : LongTermAttributeImpl.newBytesRef(); - - termIdsPerSegment = Maps.newHashMap(); - List segmentIndexes = multiSegmentTermDictionary.getSegmentIndexes(); - - for (int idx = 0; idx < ids.size(); ++idx) { - long longTerm = ids.get(idx); - - if (useOrderPreservingEncoding) { - SortableLongTermAttributeImpl.copyLongToBytesRef(termRef, longTerm); - } else { - LongTermAttributeImpl.copyLongToBytesRef(termRef, longTerm); - } - - int[] termIds = multiSegmentTermDictionary.lookupTermIds(termRef); - Preconditions.checkState(segmentIndexes.size() == termIds.length, - "SegmentIndexes: %s, field: %s, termIds: %s", - segmentIndexes.size(), field, termIds.length); - - for (int indexId = 0; indexId < termIds.length; indexId++) { - int termId = termIds[indexId]; - if (termId != EarlybirdIndexSegmentAtomicReader.TERM_NOT_FOUND) { - InvertedIndex fieldIndex = segmentIndexes.get(indexId); - - List termIdsList = termIdsPerSegment.get(fieldIndex); - if (termIdsList == null) { - termIdsList = Lists.newArrayList(); - termIdsPerSegment.put(fieldIndex, termIdsList); - } - termIdsList.add(new TermRankPair( - termId, ranks.size() > 0 ? ranks.get(idx) : -1)); - } - } - } - - long elapsed = System.nanoTime() - start; - TERM_LOOKUP_STATS.timerIncrement(elapsed); - } - - @Override - public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) { - return new UserIdMultiSegmentQueryWeight(searcher, scoreMode, boost); - } - - @Override - public int hashCode() { - return Arrays.hashCode( - new Object[] {useOrderPreservingEncoding, queryTimeout, field, ids, ranks}); - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof UserIdMultiSegmentQuery)) { - return false; - } - - UserIdMultiSegmentQuery query = UserIdMultiSegmentQuery.class.cast(obj); - return Arrays.equals( - new Object[] {useOrderPreservingEncoding, queryTimeout, field, ids, ranks}, - new Object[] {query.useOrderPreservingEncoding, - query.queryTimeout, - query.field, - query.ids, - query.ranks}); - } - - @Override - public String toString(String fieldName) { - StringBuilder builder = new StringBuilder(); - builder.append(getClass().getSimpleName()).append("[").append(fieldName).append(":"); - for (Long id : this.ids) { - builder.append(id); - builder.append(","); - } - builder.setLength(builder.length() - 1); - builder.append("]"); - return builder.toString(); - } - - private final class UserIdMultiSegmentQueryWeight extends ConstantScoreWeight { - private final IndexSearcher searcher; - private final ScoreMode scoreMode; - - private UserIdMultiSegmentQueryWeight( - IndexSearcher searcher, - ScoreMode scoreMode, - float boost) { - super(UserIdMultiSegmentQuery.this, boost); - this.searcher = searcher; - this.scoreMode = scoreMode; - } - - @Override - public Scorer scorer(LeafReaderContext context) throws IOException { - Weight weight = rewrite(context); - if (weight != null) { - return weight.scorer(context); - } else { - return null; - } - } - - @Override - public BulkScorer bulkScorer(LeafReaderContext context) throws IOException { - Weight weight = rewrite(context); - if (weight != null) { - return weight.bulkScorer(context); - } else { - return null; - } - } - - @Override - public void extractTerms(Set terms) { - terms.addAll(ids - .stream() - .map(id -> new Term(field, LongTermAttributeImpl.copyIntoNewBytesRef(id))) - .collect(Collectors.toSet())); - } - - @Override - public boolean isCacheable(LeafReaderContext ctx) { - return true; - } - - private Weight rewrite(LeafReaderContext context) throws IOException { - final Terms terms = context.reader().terms(field); - if (terms == null) { - // field does not exist - return null; - } - final TermsEnum termsEnum = terms.iterator(); - Preconditions.checkNotNull(termsEnum, "No termsEnum for field: %s", field); - - BooleanQuery bq; - // See if the segment is supported by the multi-segment term dictionary. If so, build up - // the query using the termIds from the multi-segment term dictionary. - // If not (for the current segment), do the term lookups directly in the queried segment. - InvertedIndex fieldIndex = getFieldIndexFromMultiTermDictionary(context); - if (fieldIndex != null) { - createTermIdsPerSegment(); - - USED_MULTI_SEGMENT_TERM_DICTIONARY_COUNT.increment(); - SearchTimer timer = QUERY_FROM_PRECOMPUTED.startNewTimer(); - bq = addPrecomputedTermQueries(fieldIndex, termsEnum); - QUERY_FROM_PRECOMPUTED.stopTimerAndIncrement(timer); - } else { - USED_ORIGINAL_TERM_DICTIONARY_COUNT.increment(); - // This segment is not supported by the multi-segment term dictionary. Lookup terms - // directly. - SearchTimer timer = QUERY_REGULAR.startNewTimer(); - bq = addTermQueries(termsEnum); - QUERY_REGULAR.stopTimerAndIncrement(timer); - } - - return searcher.rewrite(new ConstantScoreQuery(bq)).createWeight( - searcher, scoreMode, score()); - } - - /** - * If the multi-segment term dictionary supports this segment/LeafReader, then return the - * InvertedIndex representing this segment. - * - * If the segment being queried right now is not in the multi-segment term dictionary (e.g. - * if it's not optimized yet), return null. - */ - @Nullable - private InvertedIndex getFieldIndexFromMultiTermDictionary(LeafReaderContext context) - throws IOException { - if (multiSegmentTermDictionary == null) { - return null; - } - - if (context.reader() instanceof EarlybirdIndexSegmentAtomicReader) { - EarlybirdIndexSegmentAtomicReader reader = - (EarlybirdIndexSegmentAtomicReader) context.reader(); - - EarlybirdIndexSegmentData segmentData = reader.getSegmentData(); - InvertedIndex fieldIndex = segmentData.getFieldIndex(field); - - if (multiSegmentTermDictionary.supportSegmentIndex(fieldIndex)) { - return fieldIndex; - } - } - - return null; - } - - private BooleanQuery addPrecomputedTermQueries( - InvertedIndex fieldIndex, - TermsEnum termsEnum) throws IOException { - - BooleanQuery.Builder bqBuilder = new BooleanQuery.Builder(); - int numClauses = 0; - - List termRankPairs = termIdsPerSegment.get(fieldIndex); - if (termRankPairs != null) { - for (TermRankPair pair : termRankPairs) { - int termId = pair.getTermId(); - if (numClauses >= BooleanQuery.getMaxClauseCount()) { - BooleanQuery saved = bqBuilder.build(); - bqBuilder = new BooleanQuery.Builder(); - bqBuilder.add(saved, BooleanClause.Occur.SHOULD); - numClauses = 1; - } - - Query query; - if (pair.getRank() != -1) { - query = EarlybirdQueryHelper.maybeWrapWithHitAttributionCollector( - new SimpleTermQuery(termsEnum, termId), - pair.getRank(), - fieldInfo, - hitAttributeHelper); - } else { - query = new SimpleTermQuery(termsEnum, termId); - } - bqBuilder.add(EarlybirdQueryHelper.maybeWrapWithTimeout(query, queryTimeout), - BooleanClause.Occur.SHOULD); - ++numClauses; - } - } - return bqBuilder.build(); - } - - private BooleanQuery addTermQueries(TermsEnum termsEnum) throws IOException { - final BytesRef termRef = useOrderPreservingEncoding - ? SortableLongTermAttributeImpl.newBytesRef() - : LongTermAttributeImpl.newBytesRef(); - - BooleanQuery.Builder bqBuilder = new BooleanQuery.Builder(); - int numClauses = 0; - - for (int idx = 0; idx < ids.size(); ++idx) { - long longTerm = ids.get(idx); - if (useOrderPreservingEncoding) { - SortableLongTermAttributeImpl.copyLongToBytesRef(termRef, longTerm); - } else { - LongTermAttributeImpl.copyLongToBytesRef(termRef, longTerm); - } - - if (termsEnum.seekExact(termRef)) { - if (numClauses >= BooleanQuery.getMaxClauseCount()) { - BooleanQuery saved = bqBuilder.build(); - bqBuilder = new BooleanQuery.Builder(); - bqBuilder.add(saved, BooleanClause.Occur.SHOULD); - numClauses = 1; - } - - if (ranks.size() > 0) { - bqBuilder.add(EarlybirdQueryHelper.maybeWrapWithHitAttributionCollector( - new SimpleTermQuery(termsEnum, termsEnum.ord()), - ranks.get(idx), - fieldInfo, - hitAttributeHelper), - BooleanClause.Occur.SHOULD); - } else { - bqBuilder.add(new SimpleTermQuery(termsEnum, termsEnum.ord()), - BooleanClause.Occur.SHOULD); - } - ++numClauses; - } - } - - return bqBuilder.build(); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/search/queries/UserScrubGeoFilter.java b/src/java/com/twitter/search/earlybird/search/queries/UserScrubGeoFilter.java deleted file mode 100644 index 6f66ff54d..000000000 --- a/src/java/com/twitter/search/earlybird/search/queries/UserScrubGeoFilter.java +++ /dev/null @@ -1,82 +0,0 @@ -package com.twitter.search.earlybird.search.queries; - -import java.io.IOException; -import java.util.Objects; - -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.NumericDocValues; - -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.query.FilteredQuery; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.earlybird.common.userupdates.UserScrubGeoMap; -import com.twitter.search.earlybird.index.TweetIDMapper; - -/** - * Filter that can be used with searches over geo field postings lists in order to filter out tweets - * that have been geo scrubbed. Determines if a tweet has been geo scrubbed by comparing the - * tweet's id against the max scrubbed tweet id for that tweet's author, which is stored in the - * UserScrubGeoMap. - * - * See: go/realtime-geo-filtering - */ -public class UserScrubGeoFilter implements FilteredQuery.DocIdFilterFactory { - - private UserScrubGeoMap userScrubGeoMap; - - private final SearchRateCounter totalRequestsUsingFilterCounter = - SearchRateCounter.export("user_scrub_geo_filter_total_requests"); - - public static FilteredQuery.DocIdFilterFactory getDocIdFilterFactory( - UserScrubGeoMap userScrubGeoMap) { - return new UserScrubGeoFilter(userScrubGeoMap); - } - - public UserScrubGeoFilter(UserScrubGeoMap userScrubGeoMap) { - this.userScrubGeoMap = userScrubGeoMap; - totalRequestsUsingFilterCounter.increment(); - } - - @Override - public FilteredQuery.DocIdFilter getDocIdFilter(LeafReaderContext context) throws IOException { - // To determine if a given doc has been geo scrubbed we need two pieces of information about the - // doc: the associated tweet id and the user id of the tweet's author. We can get the tweet id - // from the TweetIDMapper for the segment we are currently searching, and we can get the user id - // of the tweet's author by looking up the doc id in the NumericDocValues for the - // FROM_USER_ID_CSF. - // - // With this information we can check the UserScrubGeoMap to find out if the tweet has been - // geo scrubbed and filter it out accordingly. - final EarlybirdIndexSegmentAtomicReader currTwitterReader = - (EarlybirdIndexSegmentAtomicReader) context.reader(); - final TweetIDMapper tweetIdMapper = - (TweetIDMapper) currTwitterReader.getSegmentData().getDocIDToTweetIDMapper(); - final NumericDocValues fromUserIdDocValues = currTwitterReader.getNumericDocValues( - EarlybirdFieldConstant.FROM_USER_ID_CSF.getFieldName()); - return (docId) -> fromUserIdDocValues.advanceExact(docId) - && !userScrubGeoMap.isTweetGeoScrubbed( - tweetIdMapper.getTweetID(docId), fromUserIdDocValues.longValue()); - } - - @Override - public String toString() { - return "UserScrubGeoFilter"; - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof UserScrubGeoMap)) { - return false; - } - - UserScrubGeoFilter filter = UserScrubGeoFilter.class.cast(obj); - // filters are considered equal as long as they are using the same UserScrubGeoMap - return Objects.equals(userScrubGeoMap, filter.userScrubGeoMap); - } - - @Override - public int hashCode() { - return userScrubGeoMap == null ? 0 : userScrubGeoMap.hashCode(); - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/LinearScoringData.java b/src/java/com/twitter/search/earlybird/search/relevance/LinearScoringData.java deleted file mode 100644 index 9d0e85795..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/LinearScoringData.java +++ /dev/null @@ -1,422 +0,0 @@ -package com.twitter.search.earlybird.search.relevance; - -import java.util.Arrays; -import java.util.List; - -import com.google.common.collect.Lists; - -import com.twitter.search.common.constants.SearchCardType; -import com.twitter.search.common.constants.thriftjava.ThriftLanguage; - -public class LinearScoringData { - public static final float NO_BOOST_VALUE = 1.0f; - - // A signal value so we can tell if something is unset, also used in explanation. - public static final int UNSET_SIGNAL_VALUE = -999; - - //This is somewhat arbitrary, and is here so that we have some limit on - //how many offline experimental features we support per query - public static final int MAX_OFFLINE_EXPERIMENTAL_FIELDS = 5; - - public enum SkipReason { - NOT_SKIPPED, - ANTIGAMING, - LOW_REPUTATION, - LOW_TEXT_SCORE, - LOW_RETWEET_COUNT, - LOW_FAV_COUNT, - SOCIAL_FILTER, - LOW_FINAL_SCORE - } - - // When you add fields here, make sure you also update the clear() function. - public double luceneScore; - public double textScore; - //I am not sure why this has to be double... - public double tokenAt140DividedByNumTokensBucket; - public double userRep; - public double parusScore; - public final double[] offlineExpFeatureValues = new double[MAX_OFFLINE_EXPERIMENTAL_FIELDS]; - - // v1 engagement counters - public double retweetCountPostLog2; - public double favCountPostLog2; - public double replyCountPostLog2; - public double embedsImpressionCount; - public double embedsUrlCount; - public double videoViewCount; - - // v2 engagement counters (that have a v1 counter part) - public double retweetCountV2; - public double favCountV2; - public double replyCountV2; - public double embedsImpressionCountV2; - public double embedsUrlCountV2; - public double videoViewCountV2; - // pure v2 engagement counters, they started v2 only - public double quotedCount; - public double weightedRetweetCount; - public double weightedReplyCount; - public double weightedFavCount; - public double weightedQuoteCount; - - // card related properties - public boolean hasCard; - public byte cardType; - - public boolean hasUrl; - public boolean isReply; - public boolean isRetweet; - public boolean isOffensive; - public boolean hasTrend; - public boolean isFromVerifiedAccount; - public boolean isFromBlueVerifiedAccount; - public boolean isUserSpam; - public boolean isUserNSFW; - public boolean isUserBot; - public boolean isUserAntiSocial; - public boolean hasVisibleLink; - - public double luceneContrib; - public double reputationContrib; - public double textScoreContrib; - public double favContrib; - public double replyContrib; - public double multipleReplyContrib; - public double retweetContrib; - public double parusContrib; - public final double[] offlineExpFeatureContributions = - new double[MAX_OFFLINE_EXPERIMENTAL_FIELDS]; - public double embedsImpressionContrib; - public double embedsUrlContrib; - public double videoViewContrib; - public double quotedContrib; - - public double hasUrlContrib; - public double isReplyContrib; - public double isFollowRetweetContrib; - public double isTrustedRetweetContrib; - - // Value passed in the request (ThriftRankingParams.querySpecificScoreAdjustments) - public double querySpecificScore; - - // Value passed in the request (ThriftRankingParams.authorSpecificScoreAdjustments) - public double authorSpecificScore; - - public double normalizedLuceneScore; - - public int tweetLangId; - public double uiLangMult; - public double userLangMult; - public boolean hasDifferentLang; - public boolean hasEnglishTweetAndDifferentUILang; - public boolean hasEnglishUIAndDifferentTweetLang; - - public int tweetAgeInSeconds; - public double ageDecayMult; - - // Intermediate scores - public double scoreBeforeBoost; - public double scoreAfterBoost; - public double scoreFinal; - public double scoreReturned; - - public SkipReason skipReason; - - public boolean isTrusted; - public boolean isFollow; - public boolean spamUserDampApplied; - public boolean nsfwUserDampApplied; - public boolean botUserDampApplied; - public boolean trustedCircleBoostApplied; - public boolean directFollowBoostApplied; - public boolean outOfNetworkReplyPenaltyApplied; - public boolean hasMultipleHashtagsOrTrends; - - public boolean tweetHasTrendsBoostApplied; - public boolean tweetFromVerifiedAccountBoostApplied; - public boolean tweetFromBlueVerifiedAccountBoostApplied; - public boolean hasCardBoostApplied; - public boolean cardDomainMatchBoostApplied; - public boolean cardAuthorMatchBoostApplied; - public boolean cardTitleMatchBoostApplied; - public boolean cardDescriptionMatchBoostApplied; - - public List hitFields; - public boolean hasNoTextHitDemotionApplied; - public boolean hasUrlOnlyHitDemotionApplied; - public boolean hasNameOnlyHitDemotionApplied; - public boolean hasSeparateTextAndNameHitDemotionApplied; - public boolean hasSeparateTextAndUrlHitDemotionApplied; - - public long fromUserId; - // This is actually retweet status ID, or the ID of the original tweet being (natively) retweeted - public long sharedStatusId; - public long referenceAuthorId; // SEARCH-8564 - - public boolean isSelfTweet; - public boolean selfTweetBoostApplied; - public double selfTweetMult; - - public boolean hasImageUrl; - public boolean hasVideoUrl; - public boolean hasMedialUrlBoostApplied; - public boolean hasNewsUrl; - public boolean hasNewsUrlBoostApplied; - - public boolean hasConsumerVideo; - public boolean hasProVideo; - public boolean hasVine; - public boolean hasPeriscope; - public boolean hasNativeImage; - public boolean isNullcast; - public boolean hasQuote; - - public boolean isSensitiveContent; - public boolean hasMultipleMediaFlag; - public boolean profileIsEggFlag; - public boolean isUserNewFlag; - - public int numMentions; - public int numHashtags; - public int linkLanguage; - public int prevUserTweetEngagement; - - public boolean isComposerSourceCamera; - - // health model scores by HML - public double toxicityScore; // go/toxicity - public double pBlockScore; // go/pblock - public double pSpammyTweetScore; // go/pspammytweet - public double pReportedTweetScore; // go/preportedtweet - public double spammyTweetContentScore; // go/spammy-tweet-content - public double experimentalHealthModelScore1; - public double experimentalHealthModelScore2; - public double experimentalHealthModelScore3; - public double experimentalHealthModelScore4; - - public LinearScoringData() { - hitFields = Lists.newArrayList(); - clear(); - } - - // the following three counters were added later and they got denormalized in standard way, - // you can choose to apply scalding (for legacy LinearScoringFunction) or - // not apply (for returning in metadata and display in debug). - public double getEmbedsImpressionCount(boolean scaleForScoring) { - return scaleForScoring ? logWith0(embedsImpressionCount) : embedsImpressionCount; - } - public double getEmbedsUrlCount(boolean scaleForScoring) { - return scaleForScoring ? logWith0(embedsUrlCount) : embedsUrlCount; - } - public double getVideoViewCount(boolean scaleForScoring) { - return scaleForScoring ? logWith0(videoViewCount) : videoViewCount; - } - private static double logWith0(double value) { - return value > 0 ? Math.log(value) : 0.0; - } - - /** - * Returns a string description of all data stored in this instance. - */ - public String getPropertyExplanation() { - StringBuilder sb = new StringBuilder(); - sb.append(hasCard ? "CARD " + SearchCardType.cardTypeFromByteValue(cardType) : ""); - sb.append(hasUrl ? "URL " : ""); - sb.append(isReply ? "REPLY " : ""); - sb.append(isRetweet ? "RETWEET " : ""); - sb.append(isOffensive ? "OFFENSIVE " : ""); - sb.append(hasTrend ? "TREND " : ""); - sb.append(hasMultipleHashtagsOrTrends ? "HASHTAG/TREND+ " : ""); - sb.append(isFromVerifiedAccount ? "VERIFIED " : ""); - sb.append(isFromBlueVerifiedAccount ? "BLUE_VERIFIED " : ""); - sb.append(isUserSpam ? "SPAM " : ""); - sb.append(isUserNSFW ? "NSFW " : ""); - sb.append(isUserBot ? "BOT " : ""); - sb.append(isUserAntiSocial ? "ANTISOCIAL " : ""); - sb.append(isTrusted ? "TRUSTED " : ""); - sb.append(isFollow ? "FOLLOW " : ""); - sb.append(isSelfTweet ? "SELF " : ""); - sb.append(hasImageUrl ? "IMAGE " : ""); - sb.append(hasVideoUrl ? "VIDEO " : ""); - sb.append(hasNewsUrl ? "NEWS " : ""); - sb.append(isNullcast ? "NULLCAST" : ""); - sb.append(hasQuote ? "QUOTE" : ""); - sb.append(isComposerSourceCamera ? "Composer Source: CAMERA" : ""); - sb.append(favCountPostLog2 > 0 ? "Faves:" + favCountPostLog2 + " " : ""); - sb.append(retweetCountPostLog2 > 0 ? "Retweets:" + retweetCountPostLog2 + " " : ""); - sb.append(replyCountPostLog2 > 0 ? "Replies:" + replyCountPostLog2 + " " : ""); - sb.append(getEmbedsImpressionCount(false) > 0 - ? "Embedded Imps:" + getEmbedsImpressionCount(false) + " " : ""); - sb.append(getEmbedsUrlCount(false) > 0 - ? "Embedded Urls:" + getEmbedsUrlCount(false) + " " : ""); - sb.append(getVideoViewCount(false) > 0 - ? "Video views:" + getVideoViewCount(false) + " " : ""); - sb.append(weightedRetweetCount > 0 ? "Weighted Retweets:" - + ((int) weightedRetweetCount) + " " : ""); - sb.append(weightedReplyCount > 0 - ? "Weighted Replies:" + ((int) weightedReplyCount) + " " : ""); - sb.append(weightedFavCount > 0 - ? "Weighted Faves:" + ((int) weightedFavCount) + " " : ""); - sb.append(weightedQuoteCount > 0 - ? "Weighted Quotes:" + ((int) weightedQuoteCount) + " " : ""); - return sb.toString(); - } - - /** - * Resets all data stored in this instance. - */ - public void clear() { - luceneScore = UNSET_SIGNAL_VALUE; - textScore = UNSET_SIGNAL_VALUE; - tokenAt140DividedByNumTokensBucket = UNSET_SIGNAL_VALUE; - userRep = UNSET_SIGNAL_VALUE; - retweetCountPostLog2 = UNSET_SIGNAL_VALUE; - favCountPostLog2 = UNSET_SIGNAL_VALUE; - replyCountPostLog2 = UNSET_SIGNAL_VALUE; - parusScore = UNSET_SIGNAL_VALUE; - Arrays.fill(offlineExpFeatureValues, 0); - embedsImpressionCount = UNSET_SIGNAL_VALUE; - embedsUrlCount = UNSET_SIGNAL_VALUE; - videoViewCount = UNSET_SIGNAL_VALUE; - // v2 engagement, these each have a v1 counterpart - retweetCountV2 = UNSET_SIGNAL_VALUE; - favCountV2 = UNSET_SIGNAL_VALUE; - replyCountV2 = UNSET_SIGNAL_VALUE; - embedsImpressionCountV2 = UNSET_SIGNAL_VALUE; - embedsUrlCountV2 = UNSET_SIGNAL_VALUE; - videoViewCountV2 = UNSET_SIGNAL_VALUE; - // new engagement counters, they only have one version with the v2 normalizer - quotedCount = UNSET_SIGNAL_VALUE; - weightedRetweetCount = UNSET_SIGNAL_VALUE; - weightedReplyCount = UNSET_SIGNAL_VALUE; - weightedFavCount = UNSET_SIGNAL_VALUE; - weightedQuoteCount = UNSET_SIGNAL_VALUE; - - hasUrl = false; - isReply = false; - isRetweet = false; - isOffensive = false; - hasTrend = false; - isFromVerifiedAccount = false; - isFromBlueVerifiedAccount = false; - isUserSpam = false; - isUserNSFW = false; - isUserBot = false; - isUserAntiSocial = false; - hasVisibleLink = false; - isNullcast = false; - - luceneContrib = UNSET_SIGNAL_VALUE; - reputationContrib = UNSET_SIGNAL_VALUE; - textScoreContrib = UNSET_SIGNAL_VALUE; - replyContrib = UNSET_SIGNAL_VALUE; - multipleReplyContrib = UNSET_SIGNAL_VALUE; - retweetContrib = UNSET_SIGNAL_VALUE; - favContrib = UNSET_SIGNAL_VALUE; - parusContrib = UNSET_SIGNAL_VALUE; - Arrays.fill(offlineExpFeatureContributions, 0); - embedsImpressionContrib = UNSET_SIGNAL_VALUE; - embedsUrlContrib = UNSET_SIGNAL_VALUE; - videoViewContrib = UNSET_SIGNAL_VALUE; - hasUrlContrib = UNSET_SIGNAL_VALUE; - isReplyContrib = UNSET_SIGNAL_VALUE; - - querySpecificScore = UNSET_SIGNAL_VALUE; - authorSpecificScore = UNSET_SIGNAL_VALUE; - - normalizedLuceneScore = NO_BOOST_VALUE; - - tweetLangId = ThriftLanguage.UNKNOWN.getValue(); - uiLangMult = NO_BOOST_VALUE; - userLangMult = NO_BOOST_VALUE; - hasDifferentLang = false; - hasEnglishTweetAndDifferentUILang = false; - hasEnglishUIAndDifferentTweetLang = false; - - tweetAgeInSeconds = 0; - ageDecayMult = NO_BOOST_VALUE; - - // Intermediate scores - scoreBeforeBoost = UNSET_SIGNAL_VALUE; - scoreAfterBoost = UNSET_SIGNAL_VALUE; - scoreFinal = UNSET_SIGNAL_VALUE; - scoreReturned = UNSET_SIGNAL_VALUE; - - skipReason = SkipReason.NOT_SKIPPED; - - isTrusted = false; // Set later - isFollow = false; // Set later - trustedCircleBoostApplied = false; - directFollowBoostApplied = false; - outOfNetworkReplyPenaltyApplied = false; - hasMultipleHashtagsOrTrends = false; - spamUserDampApplied = false; - nsfwUserDampApplied = false; - botUserDampApplied = false; - - tweetHasTrendsBoostApplied = false; - tweetFromVerifiedAccountBoostApplied = false; - tweetFromBlueVerifiedAccountBoostApplied = false; - - fromUserId = UNSET_SIGNAL_VALUE; - sharedStatusId = UNSET_SIGNAL_VALUE; - referenceAuthorId = UNSET_SIGNAL_VALUE; - - isSelfTweet = false; - selfTweetBoostApplied = false; - selfTweetMult = NO_BOOST_VALUE; - - trustedCircleBoostApplied = false; - directFollowBoostApplied = false; - - hasImageUrl = false; - hasVideoUrl = false; - hasMedialUrlBoostApplied = false; - hasNewsUrl = false; - hasNewsUrlBoostApplied = false; - - hasCard = false; - cardType = SearchCardType.UNKNOWN.getByteValue(); - hasCardBoostApplied = false; - cardDomainMatchBoostApplied = false; - cardAuthorMatchBoostApplied = false; - cardTitleMatchBoostApplied = false; - cardDescriptionMatchBoostApplied = false; - - hitFields.clear(); - hasNoTextHitDemotionApplied = false; - hasUrlOnlyHitDemotionApplied = false; - hasNameOnlyHitDemotionApplied = false; - hasSeparateTextAndNameHitDemotionApplied = false; - hasSeparateTextAndUrlHitDemotionApplied = false; - - hasConsumerVideo = false; - hasProVideo = false; - hasVine = false; - hasPeriscope = false; - hasNativeImage = false; - - isSensitiveContent = false; - hasMultipleMediaFlag = false; - profileIsEggFlag = false; - numMentions = 0; - numHashtags = 0; - isUserNewFlag = false; - linkLanguage = 0; - prevUserTweetEngagement = 0; - - isComposerSourceCamera = false; - - // health model scores by HML - toxicityScore = UNSET_SIGNAL_VALUE; - pBlockScore = UNSET_SIGNAL_VALUE; - pSpammyTweetScore = UNSET_SIGNAL_VALUE; - pReportedTweetScore = UNSET_SIGNAL_VALUE; - spammyTweetContentScore = UNSET_SIGNAL_VALUE; - experimentalHealthModelScore1 = UNSET_SIGNAL_VALUE; - experimentalHealthModelScore2 = UNSET_SIGNAL_VALUE; - experimentalHealthModelScore3 = UNSET_SIGNAL_VALUE; - experimentalHealthModelScore4 = UNSET_SIGNAL_VALUE; - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/LinearScoringParams.java b/src/java/com/twitter/search/earlybird/search/relevance/LinearScoringParams.java deleted file mode 100644 index 9c049068c..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/LinearScoringParams.java +++ /dev/null @@ -1,304 +0,0 @@ -package com.twitter.search.earlybird.search.relevance; - -import java.util.Arrays; -import java.util.Map; - -import com.google.common.annotations.VisibleForTesting; - -import com.twitter.search.common.constants.SearchCardType; -import com.twitter.search.common.constants.thriftjava.ThriftLanguage; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.ranking.thriftjava.ThriftAgeDecayRankingParams; -import com.twitter.search.common.ranking.thriftjava.ThriftCardRankingParams; -import com.twitter.search.common.ranking.thriftjava.ThriftRankingParams; -import com.twitter.search.common.util.lang.ThriftLanguageUtil; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.earlybird.thrift.ThriftSocialFilterType; - -/* - * The class for all query specific parameters, including the parameters from the relevanceOptions and - * values that are extracted from the request itself. - */ -public class LinearScoringParams { - - public static final double DEFAULT_FEATURE_WEIGHT = 0; - public static final double DEFAULT_FEATURE_MIN_VAL = 0; - public static final double DEFAULT_NO_BOOST = 1.0; - @VisibleForTesting - static final SearchCounter NULL_USER_LANGS_KEY = - SearchCounter.export("linear_scoring_params_null_user_langs_key"); - - public final double luceneWeight; - public final double textScoreWeight; - public final double textScoreMinVal; - public final double retweetWeight; - public final double retweetMinVal; - public final double favWeight; - public final double favMinVal; - public final double replyWeight; - public final double multipleReplyWeight; - public final double multipleReplyMinVal; - public final double isReplyWeight; - public final double parusWeight; - public final double embedsImpressionWeight; - public final double embedsUrlWeight; - public final double videoViewWeight; - public final double quotedCountWeight; - - public final double[] rankingOfflineExpWeights = - new double[LinearScoringData.MAX_OFFLINE_EXPERIMENTAL_FIELDS]; - - public final boolean applyBoosts; - - // Storing ranking params for cards, avoid using maps for faster lookup - public final double[] hasCardBoosts = new double[SearchCardType.values().length]; - public final double[] cardDomainMatchBoosts = new double[SearchCardType.values().length]; - public final double[] cardAuthorMatchBoosts = new double[SearchCardType.values().length]; - public final double[] cardTitleMatchBoosts = new double[SearchCardType.values().length]; - public final double[] cardDescriptionMatchBoosts = new double[SearchCardType.values().length]; - - public final double urlWeight; - public final double reputationWeight; - public final double reputationMinVal; - public final double followRetweetWeight; - public final double trustedRetweetWeight; - - // Adjustments for specific tweets (tweetId -> score) - public final Map querySpecificScoreAdjustments; - - // Adjustments for tweets posted by specific authors (userId -> score) - public final Map authorSpecificScoreAdjustments; - - public final double offensiveDamping; - public final double spamUserDamping; - public final double nsfwUserDamping; - public final double botUserDamping; - public final double trustedCircleBoost; - public final double directFollowBoost; - public final double minScore; - - public final boolean applyFiltersAlways; - - public final boolean useLuceneScoreAsBoost; - public final double maxLuceneScoreBoost; - - public final double langEnglishTweetDemote; - public final double langEnglishUIDemote; - public final double langDefaultDemote; - public final boolean useUserLanguageInfo; - public final double unknownLanguageBoost; - - public final double outOfNetworkReplyPenalty; - - public final boolean useAgeDecay; - public final double ageDecayHalflife; - public final double ageDecayBase; - public final double ageDecaySlope; - - // hit attribute demotions - public final boolean enableHitDemotion; - public final double noTextHitDemotion; - public final double urlOnlyHitDemotion; - public final double nameOnlyHitDemotion; - public final double separateTextAndNameHitDemotion; - public final double separateTextAndUrlHitDemotion; - - // trends related params - public final double tweetHasTrendBoost; - public final double multipleHashtagsOrTrendsDamping; - - public final double tweetFromVerifiedAccountBoost; - - public final double tweetFromBlueVerifiedAccountBoost; - - public final ThriftSocialFilterType socialFilterType; - public final int uiLangId; - // Confidences of the understandability of different languages for this user. - public final double[] userLangs = new double[ThriftLanguage.values().length]; - - public final long searcherId; - public final double selfTweetBoost; - - public final double tweetHasMediaUrlBoost; - public final double tweetHasNewsUrlBoost; - - // whether we need meta-data for replies what the reply is to. - public final boolean getInReplyToStatusId; - - // Initialize from a ranking parameter - public LinearScoringParams(ThriftSearchQuery searchQuery, ThriftRankingParams params) { - // weights - luceneWeight = params.isSetLuceneScoreParams() - ? params.getLuceneScoreParams().getWeight() : DEFAULT_FEATURE_WEIGHT; - textScoreWeight = params.isSetTextScoreParams() - ? params.getTextScoreParams().getWeight() : DEFAULT_FEATURE_WEIGHT; - retweetWeight = params.isSetRetweetCountParams() - ? params.getRetweetCountParams().getWeight() : DEFAULT_FEATURE_WEIGHT; - favWeight = params.isSetFavCountParams() - ? params.getFavCountParams().getWeight() : DEFAULT_FEATURE_WEIGHT; - replyWeight = params.isSetReplyCountParams() - ? params.getReplyCountParams().getWeight() : DEFAULT_FEATURE_WEIGHT; - multipleReplyWeight = params.isSetMultipleReplyCountParams() - ? params.getMultipleReplyCountParams().getWeight() : DEFAULT_FEATURE_WEIGHT; - parusWeight = params.isSetParusScoreParams() - ? params.getParusScoreParams().getWeight() : DEFAULT_FEATURE_WEIGHT; - for (int i = 0; i < LinearScoringData.MAX_OFFLINE_EXPERIMENTAL_FIELDS; i++) { - Byte featureTypeByte = (byte) i; - // default weight is 0, thus contribution for unset feature value will be 0. - rankingOfflineExpWeights[i] = params.getOfflineExperimentalFeatureRankingParamsSize() > 0 - && params.getOfflineExperimentalFeatureRankingParams().containsKey(featureTypeByte) - ? params.getOfflineExperimentalFeatureRankingParams().get(featureTypeByte).getWeight() - : DEFAULT_FEATURE_WEIGHT; - } - embedsImpressionWeight = params.isSetEmbedsImpressionCountParams() - ? params.getEmbedsImpressionCountParams().getWeight() : DEFAULT_FEATURE_WEIGHT; - embedsUrlWeight = params.isSetEmbedsUrlCountParams() - ? params.getEmbedsUrlCountParams().getWeight() : DEFAULT_FEATURE_WEIGHT; - videoViewWeight = params.isSetVideoViewCountParams() - ? params.getVideoViewCountParams().getWeight() : DEFAULT_FEATURE_WEIGHT; - quotedCountWeight = params.isSetQuotedCountParams() - ? params.getQuotedCountParams().getWeight() : DEFAULT_FEATURE_WEIGHT; - - applyBoosts = params.isApplyBoosts(); - - // configure card values - Arrays.fill(hasCardBoosts, DEFAULT_NO_BOOST); - Arrays.fill(cardAuthorMatchBoosts, DEFAULT_NO_BOOST); - Arrays.fill(cardDomainMatchBoosts, DEFAULT_NO_BOOST); - Arrays.fill(cardTitleMatchBoosts, DEFAULT_NO_BOOST); - Arrays.fill(cardDescriptionMatchBoosts, DEFAULT_NO_BOOST); - if (params.isSetCardRankingParams()) { - for (SearchCardType cardType : SearchCardType.values()) { - byte cardTypeIndex = cardType.getByteValue(); - ThriftCardRankingParams rankingParams = params.getCardRankingParams().get(cardTypeIndex); - if (rankingParams != null) { - hasCardBoosts[cardTypeIndex] = rankingParams.getHasCardBoost(); - cardAuthorMatchBoosts[cardTypeIndex] = rankingParams.getAuthorMatchBoost(); - cardDomainMatchBoosts[cardTypeIndex] = rankingParams.getDomainMatchBoost(); - cardTitleMatchBoosts[cardTypeIndex] = rankingParams.getTitleMatchBoost(); - cardDescriptionMatchBoosts[cardTypeIndex] = rankingParams.getDescriptionMatchBoost(); - } - } - } - - urlWeight = params.isSetUrlParams() - ? params.getUrlParams().getWeight() : DEFAULT_FEATURE_WEIGHT; - reputationWeight = params.isSetReputationParams() - ? params.getReputationParams().getWeight() : DEFAULT_FEATURE_WEIGHT; - isReplyWeight = params.isSetIsReplyParams() - ? params.getIsReplyParams().getWeight() : DEFAULT_FEATURE_WEIGHT; - followRetweetWeight = params.isSetDirectFollowRetweetCountParams() - ? params.getDirectFollowRetweetCountParams().getWeight() : DEFAULT_FEATURE_WEIGHT; - trustedRetweetWeight = params.isSetTrustedCircleRetweetCountParams() - ? params.getTrustedCircleRetweetCountParams().getWeight() : DEFAULT_FEATURE_WEIGHT; - - querySpecificScoreAdjustments = params.getQuerySpecificScoreAdjustments(); - authorSpecificScoreAdjustments = params.getAuthorSpecificScoreAdjustments(); - - // min/max filters - textScoreMinVal = params.isSetTextScoreParams() - ? params.getTextScoreParams().getMin() : DEFAULT_FEATURE_MIN_VAL; - reputationMinVal = params.isSetReputationParams() - ? params.getReputationParams().getMin() : DEFAULT_FEATURE_MIN_VAL; - multipleReplyMinVal = params.isSetMultipleReplyCountParams() - ? params.getMultipleReplyCountParams().getMin() : DEFAULT_FEATURE_MIN_VAL; - retweetMinVal = params.isSetRetweetCountParams() && params.getRetweetCountParams().isSetMin() - ? params.getRetweetCountParams().getMin() : DEFAULT_FEATURE_MIN_VAL; - favMinVal = params.isSetFavCountParams() && params.getFavCountParams().isSetMin() - ? params.getFavCountParams().getMin() : DEFAULT_FEATURE_MIN_VAL; - - // boosts - spamUserDamping = params.isSetSpamUserBoost() ? params.getSpamUserBoost() : 1.0; - nsfwUserDamping = params.isSetNsfwUserBoost() ? params.getNsfwUserBoost() : 1.0; - botUserDamping = params.isSetBotUserBoost() ? params.getBotUserBoost() : 1.0; - offensiveDamping = params.getOffensiveBoost(); - trustedCircleBoost = params.getInTrustedCircleBoost(); - directFollowBoost = params.getInDirectFollowBoost(); - - // language boosts - langEnglishTweetDemote = params.getLangEnglishTweetBoost(); - langEnglishUIDemote = params.getLangEnglishUIBoost(); - langDefaultDemote = params.getLangDefaultBoost(); - useUserLanguageInfo = params.isUseUserLanguageInfo(); - unknownLanguageBoost = params.getUnknownLanguageBoost(); - - // hit demotions - enableHitDemotion = params.isEnableHitDemotion(); - noTextHitDemotion = params.getNoTextHitDemotion(); - urlOnlyHitDemotion = params.getUrlOnlyHitDemotion(); - nameOnlyHitDemotion = params.getNameOnlyHitDemotion(); - separateTextAndNameHitDemotion = params.getSeparateTextAndNameHitDemotion(); - separateTextAndUrlHitDemotion = params.getSeparateTextAndUrlHitDemotion(); - - outOfNetworkReplyPenalty = params.getOutOfNetworkReplyPenalty(); - - if (params.isSetAgeDecayParams()) { - // new age decay settings - ThriftAgeDecayRankingParams ageDecayParams = params.getAgeDecayParams(); - ageDecaySlope = ageDecayParams.getSlope(); - ageDecayHalflife = ageDecayParams.getHalflife(); - ageDecayBase = ageDecayParams.getBase(); - useAgeDecay = true; - } else if (params.isSetDeprecatedAgeDecayBase() - && params.isSetDeprecatedAgeDecayHalflife() - && params.isSetDeprecatedAgeDecaySlope()) { - ageDecaySlope = params.getDeprecatedAgeDecaySlope(); - ageDecayHalflife = params.getDeprecatedAgeDecayHalflife(); - ageDecayBase = params.getDeprecatedAgeDecayBase(); - useAgeDecay = true; - } else { - ageDecaySlope = 0.0; - ageDecayHalflife = 0.0; - ageDecayBase = 0.0; - useAgeDecay = false; - } - - // trends - tweetHasTrendBoost = params.getTweetHasTrendBoost(); - multipleHashtagsOrTrendsDamping = params.getMultipleHashtagsOrTrendsBoost(); - - // verified accounts - tweetFromVerifiedAccountBoost = params.getTweetFromVerifiedAccountBoost(); - tweetFromBlueVerifiedAccountBoost = params.getTweetFromBlueVerifiedAccountBoost(); - - // score filter - minScore = params.getMinScore(); - - applyFiltersAlways = params.isApplyFiltersAlways(); - - useLuceneScoreAsBoost = params.isUseLuceneScoreAsBoost(); - maxLuceneScoreBoost = params.getMaxLuceneScoreBoost(); - - searcherId = searchQuery.isSetSearcherId() ? searchQuery.getSearcherId() : -1; - selfTweetBoost = params.getSelfTweetBoost(); - - socialFilterType = searchQuery.getSocialFilterType(); - - // the UI language and the confidences of the languages user can understand. - if (!searchQuery.isSetUiLang() || searchQuery.getUiLang().isEmpty()) { - uiLangId = ThriftLanguage.UNKNOWN.getValue(); - } else { - uiLangId = ThriftLanguageUtil.getThriftLanguageOf(searchQuery.getUiLang()).getValue(); - } - if (searchQuery.getUserLangsSize() > 0) { - for (Map.Entry lang : searchQuery.getUserLangs().entrySet()) { - ThriftLanguage thriftLanguage = lang.getKey(); - // SEARCH-13441 - if (thriftLanguage != null) { - userLangs[thriftLanguage.getValue()] = lang.getValue(); - } else { - NULL_USER_LANGS_KEY.increment(); - } - } - } - - // For now, we will use the same boost for both image, and video. - tweetHasMediaUrlBoost = params.getTweetHasImageUrlBoost(); - tweetHasNewsUrlBoost = params.getTweetHasNewsUrlBoost(); - - getInReplyToStatusId = - searchQuery.isSetResultMetadataOptions() - && searchQuery.getResultMetadataOptions().isSetGetInReplyToStatusId() - && searchQuery.getResultMetadataOptions().isGetInReplyToStatusId(); - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/MinFeatureValueFilter.java b/src/java/com/twitter/search/earlybird/search/relevance/MinFeatureValueFilter.java deleted file mode 100644 index c3c3e3861..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/MinFeatureValueFilter.java +++ /dev/null @@ -1,163 +0,0 @@ -package com.twitter.search.earlybird.search.relevance; - -import java.io.IOException; -import java.util.Objects; - -import com.google.common.annotations.VisibleForTesting; - -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.NumericDocValues; -import org.apache.lucene.search.BooleanClause; -import org.apache.lucene.search.BooleanQuery; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.Weight; - -import com.twitter.search.common.encoding.features.ByteNormalizer; -import com.twitter.search.common.encoding.features.ClampByteNormalizer; -import com.twitter.search.common.encoding.features.SingleBytePositiveFloatNormalizer; -import com.twitter.search.common.query.DefaultFilterWeight; -import com.twitter.search.common.query.FilteredQuery; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.core.earlybird.index.util.RangeFilterDISI; - -public final class MinFeatureValueFilter extends Query implements FilteredQuery.DocIdFilterFactory { - private final String featureName; - private final ByteNormalizer normalizer; - private final double minValue; - - /** - * Creates a query that filters out all hits that have a value smaller than the given threshold - * for the given feature. - * - * @param featureName The feature. - * @param minValue The threshold for the feature values. - * @return A query that filters out all hits that have a value smaller than the given threshold - * for the given feature. - */ - public static Query getMinFeatureValueFilter(String featureName, double minValue) { - return new BooleanQuery.Builder() - .add(new MinFeatureValueFilter(featureName, minValue), BooleanClause.Occur.FILTER) - .build(); - } - - public static FilteredQuery.DocIdFilterFactory getDocIdFilterFactory( - String featureName, double minValue) { - return new MinFeatureValueFilter(featureName, minValue); - } - - /** - * Returns the normalizer that should be used to normalize the values for the given feature. - * - * @param featureName The feature. - * @return The normalizer that should be used to normalize the values for the given feature. - */ - @VisibleForTesting - public static ByteNormalizer getMinFeatureValueNormalizer(String featureName) { - if (featureName.equals(EarlybirdFieldConstant.USER_REPUTATION.getFieldName())) { - return new ClampByteNormalizer(0, 100); - } - - if (featureName.equals(EarlybirdFieldConstant.FAVORITE_COUNT.getFieldName()) - || featureName.equals(EarlybirdFieldConstant.PARUS_SCORE.getFieldName()) - || featureName.equals(EarlybirdFieldConstant.REPLY_COUNT.getFieldName()) - || featureName.equals(EarlybirdFieldConstant.RETWEET_COUNT.getFieldName())) { - return new SingleBytePositiveFloatNormalizer(); - } - - throw new IllegalArgumentException("Unknown normalization method for field " + featureName); - } - - @Override - public int hashCode() { - // Probably doesn't make sense to include the schemaSnapshot and normalizer here. - return (int) ((featureName == null ? 0 : featureName.hashCode() * 7) + minValue); - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof MinFeatureValueFilter)) { - return false; - } - - // Probably doesn't make sense to include the schemaSnapshot and normalizer here. - MinFeatureValueFilter filter = MinFeatureValueFilter.class.cast(obj); - return Objects.equals(featureName, filter.featureName) && (minValue == filter.minValue); - } - - @Override - public String toString(String field) { - return String.format("MinFeatureValueFilter(%s, %f)", featureName, minValue); - } - - private MinFeatureValueFilter(String featureName, double minValue) { - this.featureName = featureName; - this.normalizer = getMinFeatureValueNormalizer(featureName); - this.minValue = normalizer.normalize(minValue); - } - - @Override - public FilteredQuery.DocIdFilter getDocIdFilter(LeafReaderContext context) throws IOException { - final NumericDocValues featureDocValues = context.reader().getNumericDocValues(featureName); - return (docId) -> featureDocValues.advanceExact(docId) - && ((byte) featureDocValues.longValue() >= minValue); - } - - @Override - public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) { - return new DefaultFilterWeight(this) { - @Override - protected DocIdSetIterator getDocIdSetIterator(LeafReaderContext context) throws IOException { - return new MinFeatureValueDocIdSetIterator( - context.reader(), featureName, minValue); - } - }; - } - - private static final class MinFeatureValueDocIdSetIterator extends RangeFilterDISI { - private final NumericDocValues featureDocValues; - private final double minValue; - - MinFeatureValueDocIdSetIterator(LeafReader indexReader, - String featureName, - double minValue) throws IOException { - super(indexReader); - this.featureDocValues = indexReader.getNumericDocValues(featureName); - this.minValue = minValue; - } - - @Override - public boolean shouldReturnDoc() throws IOException { - // We need this explicit casting to byte, because of how we encode and decode features in our - // encoded_tweet_features field. If a feature is an int (uses all 32 bits of the int), then - // encoding the feature and then decoding it preserves its original value. However, if the - // feature does not use the entire int (and especially if it uses bits somewhere in the middle - // of the int), then the feature value is assumed to be unsigned when it goes through this - // process of encoding and decoding. So a user rep of - // RelevanceSignalConstants.UNSET_REPUTATION_SENTINEL (-128) will be correctly encoded as the - // binary value 10000000, but will be treated as an unsigned value when decoded, and therefore - // the decoded value will be 128. - // - // In retrospect, this seems like a really poor design decision. It seems like it would be - // better if all feature values were considered to be signed, even if most features can never - // have negative values. Unfortunately, making this change is not easy, because some features - // store normalized values, so we would also need to change the range of allowed values - // produced by those normalizers, as well as all code that depends on those values. - // - // So for now, just cast this value to a byte, to get the proper negative value. - return featureDocValues.advanceExact(docID()) - && ((byte) featureDocValues.longValue() >= minValue); - } - } - - public double getMinValue() { - return minValue; - } - - public ByteNormalizer getNormalizer() { - return normalizer; - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/RelevanceHit.java b/src/java/com/twitter/search/earlybird/search/relevance/RelevanceHit.java deleted file mode 100644 index abf312d9f..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/RelevanceHit.java +++ /dev/null @@ -1,104 +0,0 @@ -package com.twitter.search.earlybird.search.relevance; - -import java.util.Comparator; - -import javax.annotation.Nullable; - -import com.google.common.base.Preconditions; - -import com.twitter.common_internal.collections.RandomAccessPriorityQueue; -import com.twitter.search.common.relevance.features.TweetIntegerShingleSignature; -import com.twitter.search.earlybird.search.Hit; -import com.twitter.search.earlybird.search.relevance.scoring.ScoringFunction; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadata; - -public class RelevanceHit extends Hit - implements RandomAccessPriorityQueue.SignatureProvider { - @Nullable - private TweetIntegerShingleSignature signature; - - public RelevanceHit() { - super(Long.MAX_VALUE, Long.MAX_VALUE); - } - - public RelevanceHit(long timeSliceID, long statusID, - TweetIntegerShingleSignature signature, - ThriftSearchResultMetadata metadata) { - super(timeSliceID, statusID); - update(timeSliceID, statusID, signature, metadata); - } - - /** - * Updates the data for this relevance hit. - * - * @param timeSliceID The timeslice ID of the segment that the segment came from. - * @param statusID The hit's tweet ID. - * @param tweetSignature The tweet signature generated for this hit. - * @param metadata The metadata associated with this hit. - */ - public void update(long timeSliceID, long statusID, TweetIntegerShingleSignature tweetSignature, - ThriftSearchResultMetadata metadata) { - this.statusID = statusID; - this.timeSliceID = timeSliceID; - this.metadata = Preconditions.checkNotNull(metadata); - this.signature = Preconditions.checkNotNull(tweetSignature); - } - - /** - * Returns the computed score for this hit. - */ - public float getScore() { - if (metadata != null) { - return (float) metadata.getScore(); - } else { - return ScoringFunction.SKIP_HIT; - } - } - - // We want the score as a double (and not cast to a float) for COMPARATOR_BY_SCORE and - // PQ_COMPARATOR_BY_SCORE so that the results returned from Earlybirds will be sorted based on the - // scores in the ThriftSearchResultMetadata objects (and will not lose precision by being cast to - // floats). Thus, the sorted order on Earlybirds and Earlybird Roots will be consistent. - private double getScoreDouble() { - if (metadata != null) { - return metadata.getScore(); - } else { - return (double) ScoringFunction.SKIP_HIT; - } - } - - @Override @Nullable - public TweetIntegerShingleSignature getSignature() { - return signature; - } - - @Override - public String toString() { - return "RelevanceHit[tweetID=" + statusID + ",timeSliceID=" + timeSliceID - + ",score=" + (metadata == null ? "null" : metadata.getScore()) - + ",signature=" + (signature == null ? "null" : signature) + "]"; - } - - public static final Comparator COMPARATOR_BY_SCORE = - (d1, d2) -> { - // if two docs have the same score, then the first one (most recent) wins - if (d1.getScore() == d2.getScore()) { - return Long.compare(d2.getStatusID(), d1.getStatusID()); - } - return Double.compare(d2.getScoreDouble(), d1.getScoreDouble()); - }; - - public static final Comparator PQ_COMPARATOR_BY_SCORE = - (d1, d2) -> { - // Reverse the order - return COMPARATOR_BY_SCORE.compare(d2, d1); - }; - - @Override - public void clear() { - timeSliceID = Long.MAX_VALUE; - statusID = Long.MAX_VALUE; - metadata = null; - signature = null; - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/RelevanceSearchRequestInfo.java b/src/java/com/twitter/search/earlybird/search/relevance/RelevanceSearchRequestInfo.java deleted file mode 100644 index 0b99ab0da..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/RelevanceSearchRequestInfo.java +++ /dev/null @@ -1,66 +0,0 @@ -package com.twitter.search.earlybird.search.relevance; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.search.Query; - -import com.twitter.search.common.search.TerminationTracker; -import com.twitter.search.earlybird.QualityFactor; -import com.twitter.search.earlybird.search.SearchRequestInfo; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.earlybird.thrift.ThriftSearchRelevanceOptions; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadataOptions; - -public class RelevanceSearchRequestInfo extends SearchRequestInfo { - private final ThriftSearchRelevanceOptions relevanceOptions; - - public RelevanceSearchRequestInfo( - ThriftSearchQuery searchQuery, Query query, - TerminationTracker terminationTracker, QualityFactor qualityFactor) { - super(addResultMetadataOptionsIfUnset(searchQuery), query, terminationTracker, qualityFactor); - this.relevanceOptions = searchQuery.getRelevanceOptions(); - } - - private static ThriftSearchQuery addResultMetadataOptionsIfUnset(ThriftSearchQuery searchQuery) { - if (!searchQuery.isSetResultMetadataOptions()) { - searchQuery.setResultMetadataOptions(new ThriftSearchResultMetadataOptions()); - } - return searchQuery; - } - - @Override - protected int calculateMaxHitsToProcess(ThriftSearchQuery thriftSearchQuery) { - ThriftSearchRelevanceOptions searchRelevanceOptions = thriftSearchQuery.getRelevanceOptions(); - - // Don't use the value from the ThriftSearchQuery object if one is provided in the - // relevance options - int requestedMaxHitsToProcess = searchRelevanceOptions.isSetMaxHitsToProcess() - ? searchRelevanceOptions.getMaxHitsToProcess() - : super.calculateMaxHitsToProcess(thriftSearchQuery); - - return qualityFactorMaxHitsToProcess(getNumResultsRequested(), requestedMaxHitsToProcess); - } - - public ThriftSearchRelevanceOptions getRelevanceOptions() { - return this.relevanceOptions; - } - - /** - * Reduces maxHitsToProcess based on quality factor. Never reduces it beyond - * numResults. - * @param numResults - * @param maxHitsToProcess - * @return Reduced maxHitsToProcess. - */ - public int qualityFactorMaxHitsToProcess(int numResults, int maxHitsToProcess) { - Preconditions.checkNotNull(qualityFactor); - - // Do not quality factor if there is no lower bound on maxHitsToProcess. - if (numResults > maxHitsToProcess) { - return maxHitsToProcess; - } - - double currentQualityFactor = qualityFactor.get(); - return Math.max(numResults, (int) (currentQualityFactor * maxHitsToProcess)); - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/RelevanceSearchResults.java b/src/java/com/twitter/search/earlybird/search/relevance/RelevanceSearchResults.java deleted file mode 100644 index 0dc169dc9..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/RelevanceSearchResults.java +++ /dev/null @@ -1,37 +0,0 @@ -package com.twitter.search.earlybird.search.relevance; - -import com.twitter.search.earlybird.search.Hit; -import com.twitter.search.earlybird.search.SimpleSearchResults; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadata; -import com.twitter.search.earlybird.thrift.ThriftSearchResultsRelevanceStats; - -public class RelevanceSearchResults extends SimpleSearchResults { - public final ThriftSearchResultMetadata[] resultMetadata; - private ThriftSearchResultsRelevanceStats relevanceStats = null; - private long scoringTimeNanos = 0; - - public RelevanceSearchResults(int size) { - super(size); - this.resultMetadata = new ThriftSearchResultMetadata[size]; - } - - public void setHit(Hit hit, int hitIndex) { - hits[hitIndex] = hit; - resultMetadata[hitIndex] = hit.getMetadata(); - } - - public void setRelevanceStats(ThriftSearchResultsRelevanceStats relevanceStats) { - this.relevanceStats = relevanceStats; - } - public ThriftSearchResultsRelevanceStats getRelevanceStats() { - return relevanceStats; - } - - public void setScoringTimeNanos(long scoringTimeNanos) { - this.scoringTimeNanos = scoringTimeNanos; - } - - public long getScoringTimeNanos() { - return scoringTimeNanos; - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/ScoreFilterQuery.java b/src/java/com/twitter/search/earlybird/search/relevance/ScoreFilterQuery.java deleted file mode 100644 index b3d6184d0..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/ScoreFilterQuery.java +++ /dev/null @@ -1,138 +0,0 @@ -package com.twitter.search.earlybird.search.relevance; - -import java.io.IOException; - -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.search.BooleanClause; -import org.apache.lucene.search.BooleanQuery; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.Weight; - -import com.twitter.search.common.query.DefaultFilterWeight; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.core.earlybird.index.util.RangeFilterDISI; -import com.twitter.search.earlybird.search.relevance.scoring.ScoringFunction; -import com.twitter.search.earlybird.search.relevance.scoring.ScoringFunctionProvider; -import com.twitter.search.earlybird.search.relevance.scoring.ScoringFunctionProvider.NamedScoringFunctionProvider; - -/** - * This filter only accepts documents for which the provided - * {@link com.twitter.search.earlybird.search.relevance.scoring.ScoringFunction} - * returns a score that's greater or equal to the passed-in minScore and smaller or equal - * to maxScore. - */ -public final class ScoreFilterQuery extends Query { - private static final float DEFAULT_LUCENE_SCORE = 1.0F; - - private final float minScore; - private final float maxScore; - private final NamedScoringFunctionProvider scoringFunctionProvider; - private final ImmutableSchemaInterface schema; - - /** - * Returns a score filter. - * - * @param schema The schema to use to extract the feature scores. - * @param scoringFunctionProvider The scoring function provider. - * @param minScore The minimum score threshold. - * @param maxScore The maximum score threshold. - * @return A score filter with the given configuration. - */ - public static Query getScoreFilterQuery( - ImmutableSchemaInterface schema, - NamedScoringFunctionProvider scoringFunctionProvider, - float minScore, - float maxScore) { - return new BooleanQuery.Builder() - .add(new ScoreFilterQuery(schema, scoringFunctionProvider, minScore, maxScore), - BooleanClause.Occur.FILTER) - .build(); - } - - private ScoreFilterQuery(ImmutableSchemaInterface schema, - NamedScoringFunctionProvider scoringFunctionProvider, - float minScore, - float maxScore) { - this.schema = schema; - this.scoringFunctionProvider = scoringFunctionProvider; - this.minScore = minScore; - this.maxScore = maxScore; - } - - @Override - public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) - throws IOException { - return new DefaultFilterWeight(this) { - @Override - protected DocIdSetIterator getDocIdSetIterator(LeafReaderContext context) throws IOException { - ScoringFunction scoringFunction = scoringFunctionProvider.getScoringFunction(); - scoringFunction.setNextReader((EarlybirdIndexSegmentAtomicReader) context.reader()); - return new ScoreFilterDocIdSetIterator( - context.reader(), scoringFunction, minScore, maxScore); - } - }; - } - - private static final class ScoreFilterDocIdSetIterator extends RangeFilterDISI { - private final ScoringFunction scoringFunction; - private final float minScore; - private final float maxScore; - - public ScoreFilterDocIdSetIterator(LeafReader indexReader, ScoringFunction scoringFunction, - float minScore, float maxScore) throws IOException { - super(indexReader); - this.scoringFunction = scoringFunction; - this.minScore = minScore; - this.maxScore = maxScore; - } - - @Override - protected boolean shouldReturnDoc() throws IOException { - float score = scoringFunction.score(docID(), DEFAULT_LUCENE_SCORE); - return score >= minScore && score <= maxScore; - } - } - - public float getMinScoreForTest() { - return minScore; - } - - public float getMaxScoreForTest() { - return maxScore; - } - - public ScoringFunctionProvider getScoringFunctionProviderForTest() { - return scoringFunctionProvider; - } - - @Override - public int hashCode() { - return (int) (minScore * 29 - + maxScore * 17 - + (scoringFunctionProvider == null ? 0 : scoringFunctionProvider.hashCode())); - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof ScoreFilterQuery)) { - return false; - } - - ScoreFilterQuery filter = ScoreFilterQuery.class.cast(obj); - return (minScore == filter.minScore) - && (maxScore == filter.maxScore) - && (scoringFunctionProvider == null - ? filter.scoringFunctionProvider == null - : scoringFunctionProvider.equals(filter.scoringFunctionProvider)); - } - - @Override - public String toString(String field) { - return "SCORE_FILTER_QUERY[minScore=" + minScore + ",maxScore=" + maxScore + "]"; - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/collectors/AbstractRelevanceCollector.java b/src/java/com/twitter/search/earlybird/search/relevance/collectors/AbstractRelevanceCollector.java deleted file mode 100644 index 5eea104c9..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/collectors/AbstractRelevanceCollector.java +++ /dev/null @@ -1,147 +0,0 @@ -package com.twitter.search.earlybird.search.relevance.collectors; - -import java.io.IOException; - -import com.google.common.base.Preconditions; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.core.earlybird.facets.LanguageHistogram; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.search.AbstractResultsCollector; -import com.twitter.search.earlybird.search.relevance.RelevanceSearchRequestInfo; -import com.twitter.search.earlybird.search.relevance.RelevanceSearchResults; -import com.twitter.search.earlybird.search.relevance.scoring.ScoringFunction; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadata; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadataOptions; -import com.twitter.search.earlybird.thrift.ThriftSearchResultsRelevanceStats; - -/** - * AbstractRelevanceCollector is a results collector that collects RelevanceHit results - * which include more detailed information than a normal Hit. - */ -public abstract class AbstractRelevanceCollector - extends AbstractResultsCollector { - protected final ScoringFunction scoringFunction; - private final ThriftSearchResultsRelevanceStats relevanceStats; - private final EarlybirdCluster cluster; - private final UserTable userTable; - - // Per-language result counts. - private final LanguageHistogram languageHistogram = new LanguageHistogram(); - - // Accumulated time spend on relevance scoring across all collected hits, including batch scoring. - private long scoringTimeNanos = 0; - - public AbstractRelevanceCollector( - ImmutableSchemaInterface schema, - RelevanceSearchRequestInfo searchRequestInfo, - ScoringFunction scoringFunction, - EarlybirdSearcherStats searcherStats, - EarlybirdCluster cluster, - UserTable userTable, - Clock clock, - int requestDebugMode) { - super(schema, searchRequestInfo, clock, searcherStats, requestDebugMode); - this.scoringFunction = scoringFunction; - this.relevanceStats = new ThriftSearchResultsRelevanceStats(); - this.cluster = cluster; - this.userTable = userTable; - } - - /** - * Subclasses must implement this method to actually collect a scored relevance hit. - */ - protected abstract void doCollectWithScore(long tweetID, float score) throws IOException; - - @Override - public final void startSegment() throws IOException { - scoringFunction.setNextReader(currTwitterReader); - - ThriftSearchResultMetadataOptions options = - searchRequestInfo.getSearchQuery().getResultMetadataOptions(); - featuresRequested = options != null && options.isReturnSearchResultFeatures(); - } - - @Override - protected final void doCollect(long tweetID) throws IOException { - final long scoringStartNanos = getClock().nowNanos(); - float luceneSore = scorer.score(); - final float score = scoringFunction.score(curDocId, luceneSore); - final long scoringEndNanos = getClock().nowNanos(); - addToOverallScoringTimeNanos(scoringStartNanos, scoringEndNanos); - - scoringFunction.updateRelevanceStats(relevanceStats); - - updateHitCounts(tweetID); - - doCollectWithScore(tweetID, score); - } - - protected final void addToOverallScoringTimeNanos(long scoringStartNanos, long scoringEndNanos) { - scoringTimeNanos += scoringEndNanos - scoringStartNanos; - } - - protected final ThriftSearchResultMetadata collectMetadata() throws IOException { - ThriftSearchResultMetadataOptions options = - searchRequestInfo.getSearchQuery().getResultMetadataOptions(); - Preconditions.checkNotNull(options); - ThriftSearchResultMetadata metadata = - Preconditions.checkNotNull(scoringFunction.getResultMetadata(options)); - if (metadata.isSetLanguage()) { - languageHistogram.increment(metadata.getLanguage().getValue()); - } - - // Some additional metadata which is not provided by the scoring function, but - // by accessing the reader directly. - if (currTwitterReader != null) { - fillResultGeoLocation(metadata); - if (searchRequestInfo.isCollectConversationId()) { - long conversationId = - documentFeatures.getFeatureValue(EarlybirdFieldConstant.CONVERSATION_ID_CSF); - if (conversationId != 0) { - ensureExtraMetadataIsSet(metadata); - metadata.getExtraMetadata().setConversationId(conversationId); - } - } - } - - // Check and collect hit attribution data, if it's available. - fillHitAttributionMetadata(metadata); - - long fromUserId = documentFeatures.getFeatureValue(EarlybirdFieldConstant.FROM_USER_ID_CSF); - if (searchRequestInfo.isGetFromUserId()) { - metadata.setFromUserId(fromUserId); - } - - collectExclusiveConversationAuthorId(metadata); - collectFacets(metadata); - collectFeatures(metadata); - collectIsProtected(metadata, cluster, userTable); - - return metadata; - } - - protected final ThriftSearchResultsRelevanceStats getRelevanceStats() { - return relevanceStats; - } - - public final LanguageHistogram getLanguageHistogram() { - return languageHistogram; - } - - @Override - protected final RelevanceSearchResults doGetResults() throws IOException { - final RelevanceSearchResults results = doGetRelevanceResults(); - results.setScoringTimeNanos(scoringTimeNanos); - return results; - } - - /** - * For subclasses to process and aggregate collected hits. - */ - protected abstract RelevanceSearchResults doGetRelevanceResults() throws IOException; -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/collectors/BatchRelevanceTopCollector.java b/src/java/com/twitter/search/earlybird/search/relevance/collectors/BatchRelevanceTopCollector.java deleted file mode 100644 index 3e4f6a711..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/collectors/BatchRelevanceTopCollector.java +++ /dev/null @@ -1,118 +0,0 @@ -package com.twitter.search.earlybird.search.relevance.collectors; - -import java.io.IOException; -import java.util.ArrayList; -import java.util.List; -import java.util.concurrent.TimeUnit; - -import com.twitter.common.collections.Pair; -import com.twitter.common.util.Clock; -import com.twitter.search.common.features.thrift.ThriftSearchResultFeatures; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.search.EarlyTerminationState; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.search.relevance.LinearScoringData; -import com.twitter.search.earlybird.search.relevance.RelevanceSearchRequestInfo; -import com.twitter.search.earlybird.search.relevance.RelevanceSearchResults; -import com.twitter.search.earlybird.search.relevance.scoring.BatchHit; -import com.twitter.search.earlybird.search.relevance.scoring.ScoringFunction; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; -import com.twitter.search.earlybird.thrift.ThriftSearchRelevanceOptions; -import com.twitter.search.earlybird.thrift.ThriftSearchResultExtraMetadata; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadata; - -/** - * BatchRelevanceTopCollector is similar to the `RelevanceTopCollector` in what it outputs: - * Collects the top numResults by score, filtering out duplicates - * and results with scores equal to Flat.MIN_VALUE. - * The way that it achieves that is different though: it will score documents through the batch score - * function instead of scoring documents one by one. - */ -public class BatchRelevanceTopCollector extends RelevanceTopCollector { - protected final List hits; - - public BatchRelevanceTopCollector( - ImmutableSchemaInterface schema, - RelevanceSearchRequestInfo searchRequestInfo, - ScoringFunction scoringFunction, - EarlybirdSearcherStats searcherStats, - EarlybirdCluster cluster, - UserTable userTable, - Clock clock, - int requestDebugMode) { - super(schema, searchRequestInfo, scoringFunction, searcherStats, cluster, userTable, clock, - requestDebugMode); - this.hits = new ArrayList<>((int) getMaxHitsToProcess()); - } - - @Override - protected void doCollectWithScore(long tweetID, float score) throws IOException { - Pair pair = - scoringFunction.collectFeatures(score); - ThriftSearchResultMetadata metadata = collectMetadata(); - hits.add(new BatchHit(pair.getFirst(), - pair.getSecond(), - metadata, - tweetID, - currTimeSliceID)); - } - - @Override - public EarlyTerminationState innerShouldCollectMore() { - if (hits.size() >= getMaxHitsToProcess()) { - return setEarlyTerminationState(EarlyTerminationState.TERMINATED_MAX_HITS_EXCEEDED); - } - return EarlyTerminationState.COLLECTING; - } - - @Override - protected RelevanceSearchResults doGetRelevanceResults() throws IOException { - final long scoringStartNanos = getClock().nowNanos(); - float[] scores = scoringFunction.batchScore(hits); - final long scoringEndNanos = getClock().nowNanos(); - addToOverallScoringTimeNanos(scoringStartNanos, scoringEndNanos); - exportBatchScoringTime(scoringEndNanos - scoringStartNanos); - - for (int i = 0; i < hits.size(); i++) { - BatchHit hit = hits.get(i); - ThriftSearchResultMetadata metadata = hit.getMetadata(); - - if (!metadata.isSetExtraMetadata()) { - metadata.setExtraMetadata(new ThriftSearchResultExtraMetadata()); - } - metadata.getExtraMetadata().setFeatures(hit.getFeatures()); - - - // Populate the ThriftSearchResultMetadata post batch scoring with information from the - // LinearScoringData, which now includes a score. - scoringFunction.populateResultMetadataBasedOnScoringData( - searchRequestInfo.getSearchQuery().getResultMetadataOptions(), - metadata, - hit.getScoringData()); - - collectWithScoreInternal( - hit.getTweetID(), - hit.getTimeSliceID(), - scores[i], - metadata - ); - } - return getRelevanceResultsInternal(); - } - - private void exportBatchScoringTime(long scoringTimeNanos) { - ThriftSearchRelevanceOptions relevanceOptions = searchRequestInfo.getRelevanceOptions(); - if (relevanceOptions.isSetRankingParams() - && relevanceOptions.getRankingParams().isSetSelectedTensorflowModel()) { - String model = relevanceOptions.getRankingParams().getSelectedTensorflowModel(); - SearchTimerStats batchScoringPerModelTimer = SearchTimerStats.export( - String.format("batch_scoring_time_for_model_%s", model), - TimeUnit.NANOSECONDS, - false, - true); - batchScoringPerModelTimer.timerIncrement(scoringTimeNanos); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/collectors/RelevanceAllCollector.java b/src/java/com/twitter/search/earlybird/search/relevance/collectors/RelevanceAllCollector.java deleted file mode 100644 index c7cd8d50f..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/collectors/RelevanceAllCollector.java +++ /dev/null @@ -1,70 +0,0 @@ -package com.twitter.search.earlybird.search.relevance.collectors; - -import java.io.IOException; -import java.util.List; - -import com.google.common.collect.Lists; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.relevance.features.TweetIntegerShingleSignature; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.search.relevance.RelevanceHit; -import com.twitter.search.earlybird.search.relevance.RelevanceSearchRequestInfo; -import com.twitter.search.earlybird.search.relevance.RelevanceSearchResults; -import com.twitter.search.earlybird.search.relevance.scoring.ScoringFunction; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadata; - -/** - * RelevanceAllCollector is a results collector that collects all results sorted by score, - * including signature-duplicates and results skipped by the scoring function. - */ -public class RelevanceAllCollector extends AbstractRelevanceCollector { - // All results. - protected final List results; - - public RelevanceAllCollector( - ImmutableSchemaInterface schema, - RelevanceSearchRequestInfo searchRequestInfo, - ScoringFunction scoringFunction, - EarlybirdSearcherStats searcherStats, - EarlybirdCluster cluster, - UserTable userTable, - Clock clock, - int requestDebugMode) { - super(schema, searchRequestInfo, scoringFunction, searcherStats, cluster, userTable, clock, - requestDebugMode); - this.results = Lists.newArrayList(); - } - - @Override - protected void doCollectWithScore(long tweetID, float score) throws IOException { - ThriftSearchResultMetadata metadata = collectMetadata(); - scoringFunction.populateResultMetadataBasedOnScoringData( - searchRequestInfo.getSearchQuery().getResultMetadataOptions(), - metadata, - scoringFunction.getScoringDataForCurrentDocument()); - results.add(new RelevanceHit( - currTimeSliceID, - tweetID, - TweetIntegerShingleSignature.deserialize(metadata.getSignature()), - metadata)); - } - - @Override - protected RelevanceSearchResults doGetRelevanceResults() { - final int numResults = results.size(); - RelevanceSearchResults searchResults = new RelevanceSearchResults(numResults); - - // Insert hits in decreasing order by score. - results.sort(RelevanceHit.COMPARATOR_BY_SCORE); - for (int i = 0; i < numResults; i++) { - searchResults.setHit(results.get(i), i); - } - searchResults.setRelevanceStats(getRelevanceStats()); - searchResults.setNumHits(numResults); - return searchResults; - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/collectors/RelevanceTopCollector.java b/src/java/com/twitter/search/earlybird/search/relevance/collectors/RelevanceTopCollector.java deleted file mode 100644 index ef921070c..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/collectors/RelevanceTopCollector.java +++ /dev/null @@ -1,167 +0,0 @@ -package com.twitter.search.earlybird.search.relevance.collectors; - -import java.io.IOException; - -import com.google.common.base.Preconditions; - -import com.twitter.common.util.Clock; -import com.twitter.common_internal.collections.RandomAccessPriorityQueue; -import com.twitter.search.common.relevance.features.TweetIntegerShingleSignature; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.search.EarlyTerminationState; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.search.relevance.RelevanceHit; -import com.twitter.search.earlybird.search.relevance.RelevanceSearchRequestInfo; -import com.twitter.search.earlybird.search.relevance.RelevanceSearchResults; -import com.twitter.search.earlybird.search.relevance.scoring.ScoringFunction; -import com.twitter.search.earlybird.stats.EarlybirdSearcherStats; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadata; -import com.twitter.search.earlybird.thrift.ThriftSearchResultsRelevanceStats; - -/** - * RelevanceTopCollector is a results collector that collects the top numResults by - * score, filtering out duplicates. - */ -public class RelevanceTopCollector extends AbstractRelevanceCollector { - // Search results are collected in a min-heap. - protected final RandomAccessPriorityQueue minQueue; - - // Number of hits actually added to the min queue after dupe filtering and skipping. - // Less than or equal to numHitsProcessed. - protected int numHitsCollected; - - // The 'top' of the min heap, or, the lowest scored document in the heap. - private RelevanceHit pqTop; - private float lowestScore = ScoringFunction.SKIP_HIT; - - private final boolean isFilterDupes; - - public RelevanceTopCollector( - ImmutableSchemaInterface schema, - RelevanceSearchRequestInfo searchRequestInfo, - ScoringFunction scoringFunction, - EarlybirdSearcherStats searcherStats, - EarlybirdCluster cluster, - UserTable userTable, - Clock clock, - int requestDebugMode) { - super(schema, searchRequestInfo, scoringFunction, searcherStats, cluster, userTable, clock, - requestDebugMode); - this.minQueue = new RandomAccessPriorityQueue( - searchRequestInfo.getNumResultsRequested(), RelevanceHit.PQ_COMPARATOR_BY_SCORE) { - @Override - protected RelevanceHit getSentinelObject() { - return new RelevanceHit(); // default relevance constructor would create a hit with the - // lowest score possible. - } - }; - this.pqTop = minQueue.top(); - this.isFilterDupes = getSearchRequestInfo().getRelevanceOptions().isFilterDups(); - } - - protected void collectWithScoreInternal( - long tweetID, - long timeSliceID, - float score, - ThriftSearchResultMetadata metadata) { - // This collector cannot handle these scores: - assert !Float.isNaN(score); - - if (score <= lowestScore) { - // Since docs are returned in-order (i.e., increasing doc Id), a document - // with equal score to pqTop.score cannot compete since HitQueue favors - // documents with lower doc Ids. Therefore reject those docs too. - // IMPORTANT: docs skipped by the scoring function will have scores set - // to ScoringFunction.SKIP_HIT, meaning they will not be collected. - return; - } - - boolean dupFound = false; - Preconditions.checkState(metadata.isSetSignature(), - "The signature should be set at metadata collection time, but it is null. " - + "Tweet id = %s, metadata = %s", - tweetID, - metadata); - int signatureInt = metadata.getSignature(); - final TweetIntegerShingleSignature signature = - TweetIntegerShingleSignature.deserialize(signatureInt); - - if (isFilterDupes) { - // update duplicate if any - if (signatureInt != TweetIntegerShingleSignature.DEFAULT_NO_SIGNATURE) { - dupFound = minQueue.incrementElement( - signature, - element -> { - if (score > element.getScore()) { - element.update(timeSliceID, tweetID, signature, metadata); - } - } - ); - } - } - - if (!dupFound) { - numHitsCollected++; - - // if we didn't find a duplicate element to update then we add it now as a new element to the - // pq - pqTop = minQueue.updateTop(top -> top.update(timeSliceID, tweetID, signature, metadata)); - - lowestScore = pqTop.getScore(); - } - } - - @Override - protected void doCollectWithScore(final long tweetID, final float score) throws IOException { - ThriftSearchResultMetadata metadata = collectMetadata(); - scoringFunction.populateResultMetadataBasedOnScoringData( - searchRequestInfo.getSearchQuery().getResultMetadataOptions(), - metadata, - scoringFunction.getScoringDataForCurrentDocument()); - collectWithScoreInternal(tweetID, currTimeSliceID, score, metadata); - } - - @Override - public EarlyTerminationState innerShouldCollectMore() { - // Note that numHitsCollected here might be less than num results collected in the - // TwitterEarlyTerminationCollector, if we hit dups or there are very low scores. - if (numHitsCollected >= getMaxHitsToProcess()) { - return setEarlyTerminationState(EarlyTerminationState.TERMINATED_MAX_HITS_EXCEEDED); - } - return EarlyTerminationState.COLLECTING; - } - - @Override - protected RelevanceSearchResults doGetRelevanceResults() throws IOException { - return getRelevanceResultsInternal(); - } - - protected RelevanceSearchResults getRelevanceResultsInternal() { - return resultsFromQueue(minQueue, getSearchRequestInfo().getNumResultsRequested(), - getRelevanceStats()); - } - - private static RelevanceSearchResults resultsFromQueue( - RandomAccessPriorityQueue pq, - int desiredNumResults, - ThriftSearchResultsRelevanceStats relevanceStats) { - // trim first in case we didn't fill up the queue to not get any sentinel values here - int numResults = pq.trim(); - if (numResults > desiredNumResults) { - for (int i = 0; i < numResults - desiredNumResults; i++) { - pq.pop(); - } - numResults = desiredNumResults; - } - RelevanceSearchResults results = new RelevanceSearchResults(numResults); - // insert hits in decreasing order by score - for (int i = numResults - 1; i >= 0; i--) { - RelevanceHit hit = pq.pop(); - results.setHit(hit, i); - } - results.setRelevanceStats(relevanceStats); - results.setNumHits(numResults); - return results; - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/scoring/BatchHit.java b/src/java/com/twitter/search/earlybird/search/relevance/scoring/BatchHit.java deleted file mode 100644 index ed42bf319..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/scoring/BatchHit.java +++ /dev/null @@ -1,47 +0,0 @@ -package com.twitter.search.earlybird.search.relevance.scoring; - -import com.twitter.search.common.features.thrift.ThriftSearchResultFeatures; -import com.twitter.search.earlybird.search.relevance.LinearScoringData; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadata; - -public class BatchHit { - private final LinearScoringData scoringData; - private final ThriftSearchResultFeatures features; - private final ThriftSearchResultMetadata metadata; - private final long tweetID; - private final long timeSliceID; - - public BatchHit( - LinearScoringData scoringData, - ThriftSearchResultFeatures features, - ThriftSearchResultMetadata metadata, - long tweetID, - long timeSliceID - ) { - this.scoringData = scoringData; - this.features = features; - this.metadata = metadata; - this.tweetID = tweetID; - this.timeSliceID = timeSliceID; - } - - public LinearScoringData getScoringData() { - return scoringData; - } - - public ThriftSearchResultFeatures getFeatures() { - return features; - } - - public ThriftSearchResultMetadata getMetadata() { - return metadata; - } - - public long getTweetID() { - return tweetID; - } - - public long getTimeSliceID() { - return timeSliceID; - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/scoring/DefaultScoringFunction.java b/src/java/com/twitter/search/earlybird/search/relevance/scoring/DefaultScoringFunction.java deleted file mode 100644 index ead0078d8..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/scoring/DefaultScoringFunction.java +++ /dev/null @@ -1,37 +0,0 @@ -package com.twitter.search.earlybird.search.relevance.scoring; - -import org.apache.lucene.search.Explanation; - -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.earlybird.thrift.ThriftSearchResultsRelevanceStats; - -/* - * A sample scorer, doesn't really do anything, returns the same score for every document. - */ -public class DefaultScoringFunction extends ScoringFunction { - private float score; - - public DefaultScoringFunction(ImmutableSchemaInterface schema) { - super(schema); - } - - @Override - protected float score(float luceneQueryScore) { - score = luceneQueryScore; - return luceneQueryScore; - } - - @Override - protected Explanation doExplain(float luceneScore) { - // just an example - this scoring function will go away soon - return Explanation.match(luceneScore, "luceneScore=" + luceneScore); - } - - @Override - public void updateRelevanceStats(ThriftSearchResultsRelevanceStats relevanceStats) { - relevanceStats.setNumScored(relevanceStats.getNumScored() + 1); - if (score == ScoringFunction.SKIP_HIT) { - relevanceStats.setNumSkipped(relevanceStats.getNumSkipped() + 1); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/scoring/FeatureBasedScoringFunction.java b/src/java/com/twitter/search/earlybird/search/relevance/scoring/FeatureBasedScoringFunction.java deleted file mode 100644 index b6fe1f3ad..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/scoring/FeatureBasedScoringFunction.java +++ /dev/null @@ -1,1360 +0,0 @@ -package com.twitter.search.earlybird.search.relevance.scoring; - -import java.io.IOException; -import java.util.EnumSet; -import java.util.HashMap; -import java.util.List; -import java.util.Map; -import java.util.Set; -import java.util.concurrent.TimeUnit; -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableSet; -import com.google.common.collect.Iterables; -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; -import com.google.common.primitives.Ints; -import com.google.common.primitives.Longs; - -import org.apache.lucene.search.Explanation; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common_internal.bloomfilter.BloomFilter; -import com.twitter.search.common.constants.SearchCardType; -import com.twitter.search.common.constants.thriftjava.ThriftLanguage; -import com.twitter.search.common.database.DatabaseConfig; -import com.twitter.search.common.features.ExternalTweetFeature; -import com.twitter.search.common.features.FeatureHandler; -import com.twitter.search.common.features.thrift.ThriftSearchFeatureSchemaEntry; -import com.twitter.search.common.features.thrift.ThriftSearchFeatureType; -import com.twitter.search.common.features.thrift.ThriftSearchResultFeatures; -import com.twitter.search.common.query.QueryCommonFieldHitsVisitor; -import com.twitter.search.common.ranking.thriftjava.ThriftRankingParams; -import com.twitter.search.common.relevance.features.AgeDecay; -import com.twitter.search.common.relevance.features.RelevanceSignalConstants; -import com.twitter.search.common.relevance.text.VisibleTokenRatioNormalizer; -import com.twitter.search.common.results.thriftjava.FieldHitList; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.util.LongIntConverter; -import com.twitter.search.common.util.lang.ThriftLanguageUtil; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.search.AntiGamingFilter; -import com.twitter.search.earlybird.search.relevance.LinearScoringData; -import com.twitter.search.earlybird.search.relevance.LinearScoringData.SkipReason; -import com.twitter.search.earlybird.search.relevance.LinearScoringParams; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.earlybird.thrift.ThriftSearchResultExtraMetadata; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadata; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadataOptions; -import com.twitter.search.earlybird.thrift.ThriftSearchResultType; -import com.twitter.search.earlybird.thrift.ThriftSearchResultsRelevanceStats; -import com.twitter.search.earlybird.thrift.ThriftSocialFilterType; - -/** - * Base class for scoring functions that rely on the extracted features stored in LinearScoringData. - * - * Extensions of this class must implement 2 methods: - * - * - computeScore - * - generateExplanationForScoring - * - * They are called for scoring and generating the debug information of the document that it's - * currently being evaluated. The field 'data' holds the features of the document. - */ -public abstract class FeatureBasedScoringFunction extends ScoringFunction { - private static final Logger LOG = LoggerFactory.getLogger(FeatureBasedScoringFunction.class); - - // A multiplier that's applied to all scores to avoid scores too low. - public static final float SCORE_ADJUSTER = 100.0f; - - private static final VisibleTokenRatioNormalizer VISIBLE_TOKEN_RATIO_NORMALIZER = - VisibleTokenRatioNormalizer.createInstance(); - - // Allow default values only for numeric types. - private static final Set ALLOWED_TYPES_FOR_DEFAULT_FEATURE_VALUES = - EnumSet.of(ThriftSearchFeatureType.INT32_VALUE, - ThriftSearchFeatureType.LONG_VALUE, - ThriftSearchFeatureType.DOUBLE_VALUE); - - private static final Set NUMERIC_FEATURES_FOR_WHICH_DEFAULTS_SHOULD_NOT_BE_SET = - ImmutableSet.of(EarlybirdFieldConstant.TWEET_SIGNATURE.getFieldId(), - EarlybirdFieldConstant.REFERENCE_AUTHOR_ID_LEAST_SIGNIFICANT_INT.getFieldId(), - EarlybirdFieldConstant.REFERENCE_AUTHOR_ID_MOST_SIGNIFICANT_INT.getFieldId()); - - // Name of the scoring function. Used for generating explanations. - private final String functionName; - - private final BloomFilter trustedFilter; - private final BloomFilter followFilter; - - // Current timestamp in seconds. Overridable by unit test or by timestamp set in search query. - private int now; - - private final AntiGamingFilter antiGamingFilter; - - @Nullable - private final AgeDecay ageDecay; - - protected final LinearScoringParams params; // Parameters and query-dependent values. - - // In order for the API calls to retrieve the correct `LinearScoringData` - // for the passed `docId`, we need to maintain a map of `docId` -> `LinearScoringData` - // NOTE: THIS CAN ONLY BE REFERENCED AT HIT COLLECTION TIME, SINCE DOC IDS ARE NOT UNIQUE - // ACROSS SEGMENTS. IT'S NOT USABLE DURING BATCH SCORING. - private final Map docIdToScoringData; - - private final ThriftSearchResultType searchResultType; - - private final UserTable userTable; - - @VisibleForTesting - void setNow(int fakeNow) { - now = fakeNow; - } - - public FeatureBasedScoringFunction( - String functionName, - ImmutableSchemaInterface schema, - ThriftSearchQuery searchQuery, - AntiGamingFilter antiGamingFilter, - ThriftSearchResultType searchResultType, - UserTable userTable) throws IOException { - super(schema); - - this.functionName = functionName; - this.searchResultType = searchResultType; - this.userTable = userTable; - - Preconditions.checkNotNull(searchQuery.getRelevanceOptions()); - ThriftRankingParams rankingParams = searchQuery.getRelevanceOptions().getRankingParams(); - Preconditions.checkNotNull(rankingParams); - - params = new LinearScoringParams(searchQuery, rankingParams); - docIdToScoringData = new HashMap<>(); - - long timestamp = searchQuery.isSetTimestampMsecs() && searchQuery.getTimestampMsecs() > 0 - ? searchQuery.getTimestampMsecs() : System.currentTimeMillis(); - now = Ints.checkedCast(TimeUnit.MILLISECONDS.toSeconds(timestamp)); - - this.antiGamingFilter = antiGamingFilter; - - this.ageDecay = params.useAgeDecay - ? new AgeDecay(params.ageDecayBase, params.ageDecayHalflife, params.ageDecaySlope) - : null; - - if (searchQuery.isSetTrustedFilter()) { - trustedFilter = new BloomFilter(searchQuery.getTrustedFilter()); - } else { - trustedFilter = null; - } - - if (searchQuery.isSetDirectFollowFilter()) { - followFilter = new BloomFilter(searchQuery.getDirectFollowFilter()); - } else { - followFilter = null; - } - } - - @VisibleForTesting - final LinearScoringParams getScoringParams() { - return params; - } - - /** - * Returns the LinearScoringData instance associated with the current doc ID. If it doesn't exist, - * an empty LinearScoringData is created. - */ - @Override - public LinearScoringData getScoringDataForCurrentDocument() { - LinearScoringData data = docIdToScoringData.get(getCurrentDocID()); - if (data == null) { - data = new LinearScoringData(); - docIdToScoringData.put(getCurrentDocID(), data); - } - return data; - } - - @Override - public void setDebugMode(int debugMode) { - super.setDebugMode(debugMode); - } - - /** - * Normal the lucene score, which was unbounded, to a range of [1.0, maxLuceneScoreBoost]. - * The normalized value increases almost linearly in the lucene score range 2.0 ~ 7.0, where - * most queries fall in. For rare long tail queries, like some hashtags, they have high idf and - * thus high lucene score, the normalized value won't have much difference between tweets. - * The normalization function is: - * ls = luceneScore - * norm = min(max, 1 + (max - 1.0) / 2.4 * ln(1 + ls) - */ - static float normalizeLuceneScore(float luceneScore, float maxBoost) { - return (float) Math.min(maxBoost, 1.0 + (maxBoost - 1.0) / 2.4 * Math.log1p(luceneScore)); - } - - @Override - protected float score(float luceneQueryScore) throws IOException { - return scoreInternal(luceneQueryScore, null); - } - - protected LinearScoringData updateLinearScoringData(float luceneQueryScore) throws IOException { - // Reset the data for each tweet!!! - LinearScoringData data = new LinearScoringData(); - docIdToScoringData.put(getCurrentDocID(), data); - - // Set proper version for engagement counters for this request. - data.skipReason = SkipReason.NOT_SKIPPED; - data.luceneScore = luceneQueryScore; - data.userRep = (byte) documentFeatures.getFeatureValue(EarlybirdFieldConstant.USER_REPUTATION); - - if (antiGamingFilter != null && !antiGamingFilter.accept(getCurrentDocID())) { - data.skipReason = SkipReason.ANTIGAMING; - return data; - } - - data.textScore = (byte) documentFeatures.getFeatureValue(EarlybirdFieldConstant.TEXT_SCORE); - data.tokenAt140DividedByNumTokensBucket = VISIBLE_TOKEN_RATIO_NORMALIZER.denormalize( - (byte) documentFeatures.getFeatureValue(EarlybirdFieldConstant.VISIBLE_TOKEN_RATIO)); - data.fromUserId = documentFeatures.getFeatureValue(EarlybirdFieldConstant.FROM_USER_ID_CSF); - data.isFollow = followFilter != null - && followFilter.contains(Longs.toByteArray(data.fromUserId)); - data.isTrusted = trustedFilter != null - && trustedFilter.contains(Longs.toByteArray(data.fromUserId)); - data.isFromVerifiedAccount = documentFeatures.isFlagSet( - EarlybirdFieldConstant.FROM_VERIFIED_ACCOUNT_FLAG); - data.isFromBlueVerifiedAccount = documentFeatures.isFlagSet( - EarlybirdFieldConstant.FROM_BLUE_VERIFIED_ACCOUNT_FLAG); - data.isSelfTweet = data.fromUserId == params.searcherId; - // v1 engagement counters, note that the first three values are post-log2 version - // of the original unnormalized values. - data.retweetCountPostLog2 = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.RETWEET_COUNT); - data.replyCountPostLog2 = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.REPLY_COUNT); - data.favCountPostLog2 = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.FAVORITE_COUNT); - data.embedsImpressionCount = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.EMBEDS_IMPRESSION_COUNT); - data.embedsUrlCount = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.EMBEDS_URL_COUNT); - data.videoViewCount = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.VIDEO_VIEW_COUNT); - // v2 engagement counters - data.retweetCountV2 = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.RETWEET_COUNT_V2); - data.replyCountV2 = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.REPLY_COUNT_V2); - data.favCountV2 = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.FAVORITE_COUNT_V2); - // other v2 engagement counters - data.embedsImpressionCountV2 = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.EMBEDS_IMPRESSION_COUNT_V2); - data.embedsUrlCountV2 = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.EMBEDS_URL_COUNT_V2); - data.videoViewCountV2 = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.VIDEO_VIEW_COUNT_V2); - // pure v2 engagement counters without v1 counterpart - data.quotedCount = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.QUOTE_COUNT); - data.weightedRetweetCount = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.WEIGHTED_RETWEET_COUNT); - data.weightedReplyCount = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.WEIGHTED_REPLY_COUNT); - data.weightedFavCount = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.WEIGHTED_FAVORITE_COUNT); - data.weightedQuoteCount = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.WEIGHTED_QUOTE_COUNT); - - Double querySpecificScoreAdjustment = params.querySpecificScoreAdjustments == null ? null - : params.querySpecificScoreAdjustments.get(tweetIDMapper.getTweetID(getCurrentDocID())); - data.querySpecificScore = - querySpecificScoreAdjustment == null ? 0.0 : querySpecificScoreAdjustment; - - data.authorSpecificScore = params.authorSpecificScoreAdjustments == null - ? 0.0 - : params.authorSpecificScoreAdjustments.getOrDefault(data.fromUserId, 0.0); - - // respect social filter type - if (params.socialFilterType != null && !data.isSelfTweet) { - if ((params.socialFilterType == ThriftSocialFilterType.ALL - && !data.isFollow && !data.isTrusted) - || (params.socialFilterType == ThriftSocialFilterType.TRUSTED && !data.isTrusted) - || (params.socialFilterType == ThriftSocialFilterType.FOLLOWS && !data.isFollow)) { - // we can skip this hit as we only want social results in this mode. - data.skipReason = SkipReason.SOCIAL_FILTER; - return data; - } - } - - // 1. first apply all the filters to only non-follow tweets and non-verified accounts, - // but be tender to sentinel values - // unless you specifically asked to apply filters regardless - if (params.applyFiltersAlways - || (!data.isSelfTweet && !data.isFollow && !data.isFromVerifiedAccount - && !data.isFromBlueVerifiedAccount)) { - if (data.userRep < params.reputationMinVal - // don't filter unset userreps, we give them the benefit of doubt and let it - // continue to scoring. userrep is unset when either user just signed up or - // during ingestion time we had trouble getting userrep from reputation service. - && data.userRep != RelevanceSignalConstants.UNSET_REPUTATION_SENTINEL) { - data.skipReason = SkipReason.LOW_REPUTATION; - return data; - } else if (data.textScore < params.textScoreMinVal - // don't filter unset text scores, use goodwill value - && data.textScore != RelevanceSignalConstants.UNSET_TEXT_SCORE_SENTINEL) { - data.skipReason = SkipReason.LOW_TEXT_SCORE; - return data; - } else if (data.retweetCountPostLog2 != LinearScoringData.UNSET_SIGNAL_VALUE - && data.retweetCountPostLog2 < params.retweetMinVal) { - data.skipReason = SkipReason.LOW_RETWEET_COUNT; - return data; - } else if (data.favCountPostLog2 != LinearScoringData.UNSET_SIGNAL_VALUE - && data.favCountPostLog2 < params.favMinVal) { - data.skipReason = SkipReason.LOW_FAV_COUNT; - return data; - } - } - - // if sentinel value is set, assume goodwill score and let scoring continue. - if (data.textScore == RelevanceSignalConstants.UNSET_TEXT_SCORE_SENTINEL) { - data.textScore = RelevanceSignalConstants.GOODWILL_TEXT_SCORE; - } - if (data.userRep == RelevanceSignalConstants.UNSET_REPUTATION_SENTINEL) { - data.userRep = RelevanceSignalConstants.GOODWILL_REPUTATION; - } - - data.tweetAgeInSeconds = now - timeMapper.getTime(getCurrentDocID()); - if (data.tweetAgeInSeconds < 0) { - data.tweetAgeInSeconds = 0; // Age cannot be negative - } - - // The PARUS_SCORE feature should be read as is. - data.parusScore = documentFeatures.getFeatureValue(EarlybirdFieldConstant.PARUS_SCORE); - - data.isNullcast = documentFeatures.isFlagSet(EarlybirdFieldConstant.IS_NULLCAST_FLAG); - data.hasUrl = documentFeatures.isFlagSet(EarlybirdFieldConstant.HAS_LINK_FLAG); - data.hasImageUrl = documentFeatures.isFlagSet(EarlybirdFieldConstant.HAS_IMAGE_URL_FLAG); - data.hasVideoUrl = documentFeatures.isFlagSet(EarlybirdFieldConstant.HAS_VIDEO_URL_FLAG); - data.hasNewsUrl = documentFeatures.isFlagSet(EarlybirdFieldConstant.HAS_NEWS_URL_FLAG); - data.isReply = documentFeatures.isFlagSet(EarlybirdFieldConstant.IS_REPLY_FLAG); - data.isRetweet = documentFeatures.isFlagSet(EarlybirdFieldConstant.IS_RETWEET_FLAG); - data.isOffensive = documentFeatures.isFlagSet(EarlybirdFieldConstant.IS_OFFENSIVE_FLAG); - data.hasTrend = documentFeatures.isFlagSet(EarlybirdFieldConstant.HAS_TREND_FLAG); - data.hasMultipleHashtagsOrTrends = - documentFeatures.isFlagSet(EarlybirdFieldConstant.HAS_MULTIPLE_HASHTAGS_OR_TRENDS_FLAG); - data.isUserSpam = documentFeatures.isFlagSet(EarlybirdFieldConstant.IS_USER_SPAM_FLAG); - data.isUserNSFW = documentFeatures.isFlagSet(EarlybirdFieldConstant.IS_USER_NSFW_FLAG) - || userTable.isSet(data.fromUserId, UserTable.NSFW_BIT); - data.isUserAntiSocial = - userTable.isSet(data.fromUserId, UserTable.ANTISOCIAL_BIT); - data.isUserBot = documentFeatures.isFlagSet(EarlybirdFieldConstant.IS_USER_BOT_FLAG); - data.hasCard = documentFeatures.isFlagSet(EarlybirdFieldConstant.HAS_CARD_FLAG); - data.cardType = SearchCardType.UNKNOWN.getByteValue(); - if (data.hasCard) { - data.cardType = - (byte) documentFeatures.getFeatureValue(EarlybirdFieldConstant.CARD_TYPE_CSF_FIELD); - } - data.hasVisibleLink = documentFeatures.isFlagSet(EarlybirdFieldConstant.HAS_VISIBLE_LINK_FLAG); - - data.hasConsumerVideo = - documentFeatures.isFlagSet(EarlybirdFieldConstant.HAS_CONSUMER_VIDEO_FLAG); - data.hasProVideo = documentFeatures.isFlagSet(EarlybirdFieldConstant.HAS_PRO_VIDEO_FLAG); - data.hasVine = documentFeatures.isFlagSet(EarlybirdFieldConstant.HAS_VINE_FLAG); - data.hasPeriscope = documentFeatures.isFlagSet(EarlybirdFieldConstant.HAS_PERISCOPE_FLAG); - data.hasNativeImage = documentFeatures.isFlagSet(EarlybirdFieldConstant.HAS_NATIVE_IMAGE_FLAG); - data.hasQuote = documentFeatures.isFlagSet(EarlybirdFieldConstant.HAS_QUOTE_FLAG); - data.isComposerSourceCamera = - documentFeatures.isFlagSet(EarlybirdFieldConstant.COMPOSER_SOURCE_IS_CAMERA_FLAG); - - // Only read the shared status if the isRetweet or isReply bit is true (minor optimization). - if (data.isRetweet || (params.getInReplyToStatusId && data.isReply)) { - data.sharedStatusId = - documentFeatures.getFeatureValue(EarlybirdFieldConstant.SHARED_STATUS_ID_CSF); - } - - // Only read the reference tweet author ID if the isRetweet or isReply bit - // is true (minor optimization). - if (data.isRetweet || data.isReply) { - // the REFERENCE_AUTHOR_ID_CSF stores the source tweet author id for all retweets - long referenceAuthorId = - documentFeatures.getFeatureValue(EarlybirdFieldConstant.REFERENCE_AUTHOR_ID_CSF); - if (referenceAuthorId > 0) { - data.referenceAuthorId = referenceAuthorId; - } else { - // we also store the reference author id for retweets, directed at tweets, and self threaded - // tweets separately on Realtime/Protected Earlybirds. This data will be moved to the - // REFERENCE_AUTHOR_ID_CSF and these fields will be deprecated in SEARCH-34958. - referenceAuthorId = LongIntConverter.convertTwoIntToOneLong( - (int) documentFeatures.getFeatureValue( - EarlybirdFieldConstant.REFERENCE_AUTHOR_ID_MOST_SIGNIFICANT_INT), - (int) documentFeatures.getFeatureValue( - EarlybirdFieldConstant.REFERENCE_AUTHOR_ID_LEAST_SIGNIFICANT_INT)); - if (referenceAuthorId > 0) { - data.referenceAuthorId = referenceAuthorId; - } - } - } - - // Convert language to a thrift language and then back to an int in order to - // ensure a value compatible with our current ThriftLanguage definition. - ThriftLanguage tweetLang = ThriftLanguageUtil.safeFindByValue( - (int) documentFeatures.getFeatureValue(EarlybirdFieldConstant.LANGUAGE)); - data.tweetLangId = tweetLang.getValue(); - // Set the language-related features here so that they can be later used in promotion/demotion - // and also be transferred to ThriftSearchResultMetadata - data.userLangMult = computeUserLangMultiplier(data, params); - data.hasDifferentLang = params.uiLangId != ThriftLanguage.UNKNOWN.getValue() - && params.uiLangId != data.tweetLangId; - data.hasEnglishTweetAndDifferentUILang = data.hasDifferentLang - && data.tweetLangId == ThriftLanguage.ENGLISH.getValue(); - data.hasEnglishUIAndDifferentTweetLang = data.hasDifferentLang - && params.uiLangId == ThriftLanguage.ENGLISH.getValue(); - - // Exposed all these features for the clients. - data.isSensitiveContent = - documentFeatures.isFlagSet(EarlybirdFieldConstant.IS_SENSITIVE_CONTENT); - data.hasMultipleMediaFlag = - documentFeatures.isFlagSet(EarlybirdFieldConstant.HAS_MULTIPLE_MEDIA_FLAG); - data.profileIsEggFlag = documentFeatures.isFlagSet(EarlybirdFieldConstant.PROFILE_IS_EGG_FLAG); - data.isUserNewFlag = documentFeatures.isFlagSet(EarlybirdFieldConstant.IS_USER_NEW_FLAG); - data.numMentions = (int) documentFeatures.getFeatureValue(EarlybirdFieldConstant.NUM_MENTIONS); - data.numHashtags = (int) documentFeatures.getFeatureValue(EarlybirdFieldConstant.NUM_HASHTAGS); - data.linkLanguage = - (int) documentFeatures.getFeatureValue(EarlybirdFieldConstant.LINK_LANGUAGE); - data.prevUserTweetEngagement = - (int) documentFeatures.getFeatureValue(EarlybirdFieldConstant.PREV_USER_TWEET_ENGAGEMENT); - - // health model scores by HML - data.toxicityScore = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.TOXICITY_SCORE); - data.pBlockScore = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.PBLOCK_SCORE); - data.pSpammyTweetScore = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.P_SPAMMY_TWEET_SCORE); - data.pReportedTweetScore = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.P_REPORTED_TWEET_SCORE); - data.spammyTweetContentScore = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.SPAMMY_TWEET_CONTENT_SCORE - ); - data.experimentalHealthModelScore1 = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.EXPERIMENTAL_HEALTH_MODEL_SCORE_1); - data.experimentalHealthModelScore2 = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.EXPERIMENTAL_HEALTH_MODEL_SCORE_2); - data.experimentalHealthModelScore3 = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.EXPERIMENTAL_HEALTH_MODEL_SCORE_3); - data.experimentalHealthModelScore4 = documentFeatures.getUnnormalizedFeatureValue( - EarlybirdFieldConstant.EXPERIMENTAL_HEALTH_MODEL_SCORE_4); - - return data; - } - - protected float scoreInternal( - float luceneQueryScore, ExplanationWrapper explanation) throws IOException { - LinearScoringData data = updateLinearScoringData(luceneQueryScore); - if (data.skipReason != null && data.skipReason != SkipReason.NOT_SKIPPED) { - return finalizeScore(data, explanation, SKIP_HIT); - } - - double score = computeScore(data, explanation != null); - return postScoreComputation(data, score, true, explanation); - } - - protected float postScoreComputation( - LinearScoringData data, - double score, - boolean boostScoreWithHitAttribution, - ExplanationWrapper explanation) throws IOException { - double modifiedScore = score; - data.scoreBeforeBoost = modifiedScore; - if (params.applyBoosts) { - modifiedScore = - applyBoosts(data, modifiedScore, boostScoreWithHitAttribution, explanation != null); - } - // Final adjustment to avoid too-low scores. - modifiedScore *= SCORE_ADJUSTER; - data.scoreAfterBoost = modifiedScore; - - // 3. final score filter - data.scoreFinal = modifiedScore; - if ((params.applyFiltersAlways || (!data.isSelfTweet && !data.isFollow)) - && modifiedScore < params.minScore) { - data.skipReason = SkipReason.LOW_FINAL_SCORE; - modifiedScore = SKIP_HIT; - } - - // clear field hits - this.fieldHitAttribution = null; - return finalizeScore(data, explanation, modifiedScore); - } - - /** - * Applying promotion/demotion to the scores generated by feature-based scoring functions - * - * @param data Original LinearScoringData (to be modified with boosts here) - * @param score Score generated by the feature-based scoring function - * @param withHitAttribution Determines if hit attribution data should be included. - * @param forExplanation Indicates if the score will be computed for generating the explanation. - * @return Score after applying promotion/demotion - */ - private double applyBoosts( - LinearScoringData data, - double score, - boolean withHitAttribution, - boolean forExplanation) { - double boostedScore = score; - - if (params.useLuceneScoreAsBoost) { - data.normalizedLuceneScore = normalizeLuceneScore( - (float) data.luceneScore, (float) params.maxLuceneScoreBoost); - boostedScore *= data.normalizedLuceneScore; - } - if (data.isOffensive) { - boostedScore *= params.offensiveDamping; - } - if (data.isUserSpam && params.spamUserDamping != LinearScoringData.NO_BOOST_VALUE) { - data.spamUserDampApplied = true; - boostedScore *= params.spamUserDamping; - } - if (data.isUserNSFW && params.nsfwUserDamping != LinearScoringData.NO_BOOST_VALUE) { - data.nsfwUserDampApplied = true; - boostedScore *= params.nsfwUserDamping; - } - if (data.isUserBot && params.botUserDamping != LinearScoringData.NO_BOOST_VALUE) { - data.botUserDampApplied = true; - boostedScore *= params.botUserDamping; - } - - // cards - if (data.hasCard && params.hasCardBoosts[data.cardType] != LinearScoringData.NO_BOOST_VALUE) { - boostedScore *= params.hasCardBoosts[data.cardType]; - data.hasCardBoostApplied = true; - } - - // trends - if (data.hasMultipleHashtagsOrTrends) { - boostedScore *= params.multipleHashtagsOrTrendsDamping; - } else if (data.hasTrend) { - data.tweetHasTrendsBoostApplied = true; - boostedScore *= params.tweetHasTrendBoost; - } - - // Media/News url boosts. - if (data.hasImageUrl || data.hasVideoUrl) { - data.hasMedialUrlBoostApplied = true; - boostedScore *= params.tweetHasMediaUrlBoost; - } - if (data.hasNewsUrl) { - data.hasNewsUrlBoostApplied = true; - boostedScore *= params.tweetHasNewsUrlBoost; - } - - if (data.isFromVerifiedAccount) { - data.tweetFromVerifiedAccountBoostApplied = true; - boostedScore *= params.tweetFromVerifiedAccountBoost; - } - - if (data.isFromBlueVerifiedAccount) { - data.tweetFromBlueVerifiedAccountBoostApplied = true; - boostedScore *= params.tweetFromBlueVerifiedAccountBoost; - } - - if (data.isFollow) { - // direct follow, so boost both replies and non-replies. - data.directFollowBoostApplied = true; - boostedScore *= params.directFollowBoost; - } else if (data.isTrusted) { - // trusted circle - if (!data.isReply) { - // non-at-reply, in trusted network - data.trustedCircleBoostApplied = true; - boostedScore *= params.trustedCircleBoost; - } - } else if (data.isReply) { - // at-reply out of my network - data.outOfNetworkReplyPenaltyApplied = true; - boostedScore -= params.outOfNetworkReplyPenalty; - } - - if (data.isSelfTweet) { - data.selfTweetBoostApplied = true; - data.selfTweetMult = params.selfTweetBoost; - boostedScore *= params.selfTweetBoost; - } - - // Language Demotion - // User language based demotion - // The data.userLangMult is set in scoreInternal(), and this setting step is always before - // the applying boosts step - if (params.useUserLanguageInfo) { - boostedScore *= data.userLangMult; - } - // UI language based demotion - if (params.uiLangId != ThriftLanguage.UNKNOWN.getValue() - && params.uiLangId != data.tweetLangId) { - if (data.tweetLangId == ThriftLanguage.ENGLISH.getValue()) { - data.uiLangMult = params.langEnglishTweetDemote; - } else if (params.uiLangId == ThriftLanguage.ENGLISH.getValue()) { - data.uiLangMult = params.langEnglishUIDemote; - } else { - data.uiLangMult = params.langDefaultDemote; - } - } else { - data.uiLangMult = LinearScoringData.NO_BOOST_VALUE; - } - boostedScore *= data.uiLangMult; - - if (params.useAgeDecay) { - // shallow sigmoid with an inflection point at ageDecayHalflife - data.ageDecayMult = ageDecay.getAgeDecayMultiplier(data.tweetAgeInSeconds); - boostedScore *= data.ageDecayMult; - } - - // Hit Attribute Demotion - // Scoring is currently based on tokenized user name, text, and url in the tweet - // If hit attribute collection is enabled, we demote score based on these fields - if (hitAttributeHelper != null && params.enableHitDemotion) { - - Map> hitMap; - if (forExplanation && fieldHitAttribution != null) { - // if this scoring call is for generating an explanation, - // we'll use the fieldHitAttribution found in the search result's metadata because - // collectors are not called during the debug workflow - hitMap = Maps.transformValues(fieldHitAttribution.getHitMap(), FieldHitList::getHitFields); - } else if (withHitAttribution) { - hitMap = hitAttributeHelper.getHitAttribution(getCurrentDocID()); - } else { - hitMap = Maps.newHashMap(); - } - Set uniqueFieldHits = ImmutableSet.copyOf(Iterables.concat(hitMap.values())); - - data.hitFields.addAll(uniqueFieldHits); - // there should always be fields that are hit - // if there aren't, we assume this is a call from 'explain' in debug mode - // do not override hit attribute data if in debug mode - if (!uniqueFieldHits.isEmpty()) { - // demotions based strictly on field hits - if (uniqueFieldHits.size() == 1) { - if (uniqueFieldHits.contains( - EarlybirdFieldConstant.RESOLVED_LINKS_TEXT_FIELD.getFieldName())) { - // if url was the only field that was hit, demote - data.hasUrlOnlyHitDemotionApplied = true; - boostedScore *= params.urlOnlyHitDemotion; - } else if (uniqueFieldHits.contains( - EarlybirdFieldConstant.TOKENIZED_FROM_USER_FIELD.getFieldName())) { - // if name was the only field that was hit, demote - data.hasNameOnlyHitDemotionApplied = true; - boostedScore *= params.nameOnlyHitDemotion; - } - } else if (!uniqueFieldHits.contains(EarlybirdFieldConstant.TEXT_FIELD.getFieldName()) - && !uniqueFieldHits.contains(EarlybirdFieldConstant.MENTIONS_FIELD.getFieldName()) - && !uniqueFieldHits.contains(EarlybirdFieldConstant.HASHTAGS_FIELD.getFieldName()) - && !uniqueFieldHits.contains(EarlybirdFieldConstant.STOCKS_FIELD.getFieldName())) { - // if text or special text was never hit, demote - data.hasNoTextHitDemotionApplied = true; - boostedScore *= params.noTextHitDemotion; - } else if (uniqueFieldHits.size() == 2) { - // demotions based on field hit combinations - // want to demote if we only hit two of the fields (one being text) - // but with separate terms - Set fieldIntersections = QueryCommonFieldHitsVisitor.findIntersection( - hitAttributeHelper.getNodeToRankMap(), - hitMap, - query); - - if (fieldIntersections.isEmpty()) { - if (uniqueFieldHits.contains( - EarlybirdFieldConstant.TOKENIZED_FROM_USER_FIELD.getFieldName())) { - // if name is hit but has no hits in common with text, demote - // want to demote cases where we hit part of the person's name - // and tweet text separately - data.hasSeparateTextAndNameHitDemotionApplied = true; - boostedScore *= params.separateTextAndNameHitDemotion; - } else if (uniqueFieldHits.contains( - EarlybirdFieldConstant.RESOLVED_LINKS_TEXT_FIELD.getFieldName())) { - // if url is hit but has no hits in common with text, demote - // want to demote cases where we hit a potential domain keyword - // and tweet text separately - data.hasSeparateTextAndUrlHitDemotionApplied = true; - boostedScore *= params.separateTextAndUrlHitDemotion; - } - } - } - } - } - - return boostedScore; - } - - /** - * Compute the user language based demotion multiplier - */ - private static double computeUserLangMultiplier( - LinearScoringData data, LinearScoringParams params) { - if (data.tweetLangId == params.uiLangId - && data.tweetLangId != ThriftLanguage.UNKNOWN.getValue()) { - // Effectively the uiLang is considered a language that user knows with 1.0 confidence. - return LinearScoringData.NO_BOOST_VALUE; - } - - if (params.userLangs[data.tweetLangId] > 0.0) { - return params.userLangs[data.tweetLangId]; - } - - return params.unknownLanguageBoost; - } - - /** - * Computes the score of the document that it's currently being evaluated. - * - * The extracted features from the document are available in the field 'data'. - * - * @param data The LinearScoringData instance that will store the document features. - * @param forExplanation Indicates if the score will be computed for generating the explanation. - */ - protected abstract double computeScore( - LinearScoringData data, boolean forExplanation) throws IOException; - - private float finalizeScore( - LinearScoringData scoringData, - ExplanationWrapper explanation, - double score) throws IOException { - scoringData.scoreReturned = score; - if (explanation != null) { - explanation.explanation = generateExplanation(scoringData); - } - return (float) score; - } - - @Override - protected void initializeNextSegment(EarlybirdIndexSegmentAtomicReader reader) - throws IOException { - if (antiGamingFilter != null) { - antiGamingFilter.startSegment(reader); - } - } - - /* - * Generate the scoring explanation for debug. - */ - private Explanation generateExplanation(LinearScoringData scoringData) throws IOException { - final List details = Lists.newArrayList(); - - details.add(Explanation.match(0.0f, "[PROPERTIES] " - + scoringData.getPropertyExplanation())); - - // 1. Filters - boolean isHit = scoringData.skipReason == SkipReason.NOT_SKIPPED; - if (scoringData.skipReason == SkipReason.ANTIGAMING) { - details.add(Explanation.noMatch("SKIPPED for antigaming")); - } - if (scoringData.skipReason == SkipReason.LOW_REPUTATION) { - details.add(Explanation.noMatch( - String.format("SKIPPED for low reputation: %.3f < %.3f", - scoringData.userRep, params.reputationMinVal))); - } - if (scoringData.skipReason == SkipReason.LOW_TEXT_SCORE) { - details.add(Explanation.noMatch( - String.format("SKIPPED for low text score: %.3f < %.3f", - scoringData.textScore, params.textScoreMinVal))); - } - if (scoringData.skipReason == SkipReason.LOW_RETWEET_COUNT) { - details.add(Explanation.noMatch( - String.format("SKIPPED for low retweet count: %.3f < %.3f", - scoringData.retweetCountPostLog2, params.retweetMinVal))); - } - if (scoringData.skipReason == SkipReason.LOW_FAV_COUNT) { - details.add(Explanation.noMatch( - String.format("SKIPPED for low fav count: %.3f < %.3f", - scoringData.favCountPostLog2, params.favMinVal))); - } - if (scoringData.skipReason == SkipReason.SOCIAL_FILTER) { - details.add(Explanation.noMatch("SKIPPED for not in the right social circle")); - } - - // 2. Explanation depending on the scoring type - generateExplanationForScoring(scoringData, isHit, details); - - // 3. Explanation depending on boosts - if (params.applyBoosts) { - generateExplanationForBoosts(scoringData, isHit, details); - } - - // 4. Final score filter. - if (scoringData.skipReason == SkipReason.LOW_FINAL_SCORE) { - details.add(Explanation.noMatch("SKIPPED for low final score: " + scoringData.scoreFinal)); - isHit = false; - } - - String hostAndSegment = String.format("%s host = %s segment = %s", - functionName, DatabaseConfig.getLocalHostname(), DatabaseConfig.getDatabase()); - if (isHit) { - return Explanation.match((float) scoringData.scoreFinal, hostAndSegment, details); - } else { - return Explanation.noMatch(hostAndSegment, details); - } - } - - /** - * Generates the explanation for the document that is currently being evaluated. - * - * Implementations of this method must use the 'details' parameter to collect its output. - * - * @param scoringData Scoring components for the document - * @param isHit Indicates whether the document is not skipped - * @param details Details of the explanation. Used to collect the output. - */ - protected abstract void generateExplanationForScoring( - LinearScoringData scoringData, boolean isHit, List details) throws IOException; - - /** - * Generates the boosts part of the explanation for the document that is currently - * being evaluated. - */ - private void generateExplanationForBoosts( - LinearScoringData scoringData, - boolean isHit, - List details) { - List boostDetails = Lists.newArrayList(); - - boostDetails.add(Explanation.match((float) scoringData.scoreBeforeBoost, "Score before boost")); - - - // Lucene score boost - if (params.useLuceneScoreAsBoost) { - boostDetails.add(Explanation.match( - (float) scoringData.normalizedLuceneScore, - String.format("[x] Lucene score boost, luceneScore=%.3f", - scoringData.luceneScore))); - } - - // card boost - if (scoringData.hasCardBoostApplied) { - boostDetails.add(Explanation.match((float) params.hasCardBoosts[scoringData.cardType], - "[x] card boost for type " + SearchCardType.cardTypeFromByteValue(scoringData.cardType))); - } - - // Offensive - if (scoringData.isOffensive) { - boostDetails.add(Explanation.match((float) params.offensiveDamping, "[x] Offensive damping")); - } else { - boostDetails.add(Explanation.match(LinearScoringData.NO_BOOST_VALUE, - String.format("Not Offensive, damping=%.3f", params.offensiveDamping))); - } - - // Spam - if (scoringData.spamUserDampApplied) { - boostDetails.add(Explanation.match((float) params.spamUserDamping, "[x] Spam")); - } - // NSFW - if (scoringData.nsfwUserDampApplied) { - boostDetails.add(Explanation.match((float) params.nsfwUserDamping, "[X] NSFW")); - } - // Bot - if (scoringData.botUserDampApplied) { - boostDetails.add(Explanation.match((float) params.botUserDamping, "[X] Bot")); - } - - // Multiple hashtags or trends - if (scoringData.hasMultipleHashtagsOrTrends) { - boostDetails.add(Explanation.match((float) params.multipleHashtagsOrTrendsDamping, - "[x] Multiple hashtags or trends boost")); - } else { - boostDetails.add(Explanation.match(LinearScoringData.NO_BOOST_VALUE, - String.format("No multiple hashtags or trends, damping=%.3f", - params.multipleHashtagsOrTrendsDamping))); - } - - if (scoringData.tweetHasTrendsBoostApplied) { - boostDetails.add(Explanation.match( - (float) params.tweetHasTrendBoost, "[x] Tweet has trend boost")); - } - - if (scoringData.hasMedialUrlBoostApplied) { - boostDetails.add(Explanation.match( - (float) params.tweetHasMediaUrlBoost, "[x] Media url boost")); - } - - if (scoringData.hasNewsUrlBoostApplied) { - boostDetails.add(Explanation.match( - (float) params.tweetHasNewsUrlBoost, "[x] News url boost")); - } - - boostDetails.add(Explanation.match(0.0f, "[FIELDS HIT] " + scoringData.hitFields)); - - if (scoringData.hasNoTextHitDemotionApplied) { - boostDetails.add(Explanation.match( - (float) params.noTextHitDemotion, "[x] No text hit demotion")); - } - - if (scoringData.hasUrlOnlyHitDemotionApplied) { - boostDetails.add(Explanation.match( - (float) params.urlOnlyHitDemotion, "[x] URL only hit demotion")); - } - - if (scoringData.hasNameOnlyHitDemotionApplied) { - boostDetails.add(Explanation.match( - (float) params.nameOnlyHitDemotion, "[x] Name only hit demotion")); - } - - if (scoringData.hasSeparateTextAndNameHitDemotionApplied) { - boostDetails.add(Explanation.match((float) params.separateTextAndNameHitDemotion, - "[x] Separate text/name demotion")); - } - - if (scoringData.hasSeparateTextAndUrlHitDemotionApplied) { - boostDetails.add(Explanation.match((float) params.separateTextAndUrlHitDemotion, - "[x] Separate text/url demotion")); - } - - if (scoringData.tweetFromVerifiedAccountBoostApplied) { - boostDetails.add(Explanation.match((float) params.tweetFromVerifiedAccountBoost, - "[x] Verified account boost")); - } - - if (scoringData.tweetFromBlueVerifiedAccountBoostApplied) { - boostDetails.add(Explanation.match((float) params.tweetFromBlueVerifiedAccountBoost, - "[x] Blue-verified account boost")); - } - - if (scoringData.selfTweetBoostApplied) { - boostDetails.add(Explanation.match((float) params.selfTweetBoost, - "[x] Self tweet boost")); - } - - if (scoringData.skipReason == LinearScoringData.SkipReason.SOCIAL_FILTER) { - boostDetails.add(Explanation.noMatch("SKIPPED for social filter")); - } else { - if (scoringData.directFollowBoostApplied) { - boostDetails.add(Explanation.match((float) params.directFollowBoost, - "[x] Direct follow boost")); - } - if (scoringData.trustedCircleBoostApplied) { - boostDetails.add(Explanation.match((float) params.trustedCircleBoost, - "[x] Trusted circle boost")); - } - if (scoringData.outOfNetworkReplyPenaltyApplied) { - boostDetails.add(Explanation.match((float) params.outOfNetworkReplyPenalty, - "[-] Out of network reply penalty")); - } - } - - // Language demotions - String langDetails = String.format( - "tweetLang=[%s] uiLang=[%s]", - ThriftLanguageUtil.getLocaleOf( - ThriftLanguage.findByValue(scoringData.tweetLangId)).getLanguage(), - ThriftLanguageUtil.getLocaleOf(ThriftLanguage.findByValue(params.uiLangId)).getLanguage()); - if (scoringData.uiLangMult == 1.0) { - boostDetails.add(Explanation.match( - LinearScoringData.NO_BOOST_VALUE, "No UI Language demotion: " + langDetails)); - } else { - boostDetails.add(Explanation.match( - (float) scoringData.uiLangMult, "[x] UI LangMult: " + langDetails)); - } - StringBuilder userLangDetails = new StringBuilder(); - userLangDetails.append("userLang=["); - for (int i = 0; i < params.userLangs.length; i++) { - if (params.userLangs[i] > 0.0) { - String lang = ThriftLanguageUtil.getLocaleOf(ThriftLanguage.findByValue(i)).getLanguage(); - userLangDetails.append(String.format("%s:%.3f,", lang, params.userLangs[i])); - } - } - userLangDetails.append("]"); - if (!params.useUserLanguageInfo) { - boostDetails.add(Explanation.noMatch( - "No User Language Demotion: " + userLangDetails.toString())); - } else { - boostDetails.add(Explanation.match( - (float) scoringData.userLangMult, - "[x] User LangMult: " + userLangDetails.toString())); - } - - // Age decay - String ageDecayDetails = String.format( - "age=%d seconds, slope=%.3f, base=%.1f, half-life=%.0f", - scoringData.tweetAgeInSeconds, params.ageDecaySlope, - params.ageDecayBase, params.ageDecayHalflife); - if (params.useAgeDecay) { - boostDetails.add(Explanation.match( - (float) scoringData.ageDecayMult, "[x] AgeDecay: " + ageDecayDetails)); - } else { - boostDetails.add(Explanation.match(1.0f, "Age decay disabled: " + ageDecayDetails)); - } - - // Score adjuster - boostDetails.add(Explanation.match(SCORE_ADJUSTER, "[x] score adjuster")); - - Explanation boostCombo = isHit - ? Explanation.match((float) scoringData.scoreAfterBoost, - "(MATCH) After Boosts and Demotions:", boostDetails) - : Explanation.noMatch("After Boosts and Demotions:", boostDetails); - - details.add(boostCombo); - } - - @Override - protected Explanation doExplain(float luceneQueryScore) throws IOException { - // Run the scorer again and get the explanation. - ExplanationWrapper explanation = new ExplanationWrapper(); - scoreInternal(luceneQueryScore, explanation); - return explanation.explanation; - } - - @Override - public void populateResultMetadataBasedOnScoringData( - ThriftSearchResultMetadataOptions options, - ThriftSearchResultMetadata metadata, - LinearScoringData data) throws IOException { - metadata.setResultType(searchResultType); - metadata.setScore(data.scoreReturned); - metadata.setFromUserId(data.fromUserId); - - if (data.isTrusted) { - metadata.setIsTrusted(true); - } - if (data.isFollow) { - metadata.setIsFollow(true); - } - if (data.skipReason != SkipReason.NOT_SKIPPED) { - metadata.setSkipped(true); - } - if ((data.isRetweet || (params.getInReplyToStatusId && data.isReply)) - && data.sharedStatusId != LinearScoringData.UNSET_SIGNAL_VALUE) { - metadata.setSharedStatusId(data.sharedStatusId); - } - if (data.hasCard) { - metadata.setCardType(data.cardType); - } - - // Optional features. Note: other optional metadata is populated by - // AbstractRelevanceCollector, not the scoring function. - - if (options.isGetLuceneScore()) { - metadata.setLuceneScore(data.luceneScore); - } - if (options.isGetReferencedTweetAuthorId() - && data.referenceAuthorId != LinearScoringData.UNSET_SIGNAL_VALUE) { - metadata.setReferencedTweetAuthorId(data.referenceAuthorId); - } - - if (options.isGetMediaBits()) { - metadata.setHasConsumerVideo(data.hasConsumerVideo); - metadata.setHasProVideo(data.hasProVideo); - metadata.setHasVine(data.hasVine); - metadata.setHasPeriscope(data.hasPeriscope); - boolean hasNativeVideo = - data.hasConsumerVideo || data.hasProVideo || data.hasVine || data.hasPeriscope; - metadata.setHasNativeVideo(hasNativeVideo); - metadata.setHasNativeImage(data.hasNativeImage); - } - - metadata - .setIsOffensive(data.isOffensive) - .setIsReply(data.isReply) - .setIsRetweet(data.isRetweet) - .setHasLink(data.hasUrl) - .setHasTrend(data.hasTrend) - .setHasMultipleHashtagsOrTrends(data.hasMultipleHashtagsOrTrends) - .setRetweetCount((int) data.retweetCountPostLog2) - .setFavCount((int) data.favCountPostLog2) - .setReplyCount((int) data.replyCountPostLog2) - .setEmbedsImpressionCount((int) data.embedsImpressionCount) - .setEmbedsUrlCount((int) data.embedsUrlCount) - .setVideoViewCount((int) data.videoViewCount) - .setResultType(searchResultType) - .setFromVerifiedAccount(data.isFromVerifiedAccount) - .setIsUserSpam(data.isUserSpam) - .setIsUserNSFW(data.isUserNSFW) - .setIsUserBot(data.isUserBot) - .setHasImage(data.hasImageUrl) - .setHasVideo(data.hasVideoUrl) - .setHasNews(data.hasNewsUrl) - .setHasCard(data.hasCard) - .setHasVisibleLink(data.hasVisibleLink) - .setParusScore(data.parusScore) - .setTextScore(data.textScore) - .setUserRep(data.userRep) - .setTokenAt140DividedByNumTokensBucket(data.tokenAt140DividedByNumTokensBucket); - - if (!metadata.isSetExtraMetadata()) { - metadata.setExtraMetadata(new ThriftSearchResultExtraMetadata()); - } - ThriftSearchResultExtraMetadata extraMetadata = metadata.getExtraMetadata(); - - // Promotion/Demotion features - extraMetadata.setUserLangScore(data.userLangMult) - .setHasDifferentLang(data.hasDifferentLang) - .setHasEnglishTweetAndDifferentUILang(data.hasEnglishTweetAndDifferentUILang) - .setHasEnglishUIAndDifferentTweetLang(data.hasEnglishUIAndDifferentTweetLang) - .setHasQuote(data.hasQuote) - .setQuotedCount((int) data.quotedCount) - .setWeightedRetweetCount((int) data.weightedRetweetCount) - .setWeightedReplyCount((int) data.weightedReplyCount) - .setWeightedFavCount((int) data.weightedFavCount) - .setWeightedQuoteCount((int) data.weightedQuoteCount) - .setQuerySpecificScore(data.querySpecificScore) - .setAuthorSpecificScore(data.authorSpecificScore) - .setRetweetCountV2((int) data.retweetCountV2) - .setFavCountV2((int) data.favCountV2) - .setReplyCountV2((int) data.replyCountV2) - .setIsComposerSourceCamera(data.isComposerSourceCamera) - .setFromBlueVerifiedAccount(data.isFromBlueVerifiedAccount); - - // Health model scores features - extraMetadata - .setToxicityScore(data.toxicityScore) - .setPBlockScore(data.pBlockScore) - .setPSpammyTweetScore(data.pSpammyTweetScore) - .setPReportedTweetScore(data.pReportedTweetScore) - .setSpammyTweetContentScore(data.spammyTweetContentScore) - .setExperimentalHealthModelScore1(data.experimentalHealthModelScore1) - .setExperimentalHealthModelScore2(data.experimentalHealthModelScore2) - .setExperimentalHealthModelScore3(data.experimentalHealthModelScore3) - .setExperimentalHealthModelScore4(data.experimentalHealthModelScore4); - - // Return all extra features for clients to consume. - if (options.isGetAllFeatures()) { - extraMetadata.setIsSensitiveContent(data.isSensitiveContent) - .setHasMultipleMediaFlag(data.hasMultipleMediaFlag) - .setProfileIsEggFlag(data.profileIsEggFlag) - .setIsUserNewFlag(data.isUserNewFlag) - .setNumMentions(data.numMentions) - .setNumHashtags(data.numHashtags) - .setLinkLanguage(data.linkLanguage) - .setPrevUserTweetEngagement(data.prevUserTweetEngagement); - } - - // Set features in new Feature Access API format, in the future this will be the only part - // needed in this method, we don't need to set any other metadata fields any more. - if (options.isReturnSearchResultFeatures()) { - // If the features are unset, and they were requested, then we can retrieve them. If they are - // already set, then we don't need to re-read the document features, and the reader - // is probably positioned over the wrong document so it will return incorrect results. - if (!extraMetadata.isSetFeatures()) { - // We ignore all features with default values when returning them in the response, - // because it saves a lot of network bandwidth. - ThriftSearchResultFeatures features = createFeaturesForDocument(data, true).getFeatures(); - extraMetadata.setFeatures(features); - } - - // The raw score may have changed since we created the features, so we should update it. - extraMetadata.getFeatures().getDoubleValues() - .put(ExternalTweetFeature.RAW_EARLYBIRD_SCORE.getId(), data.scoreFinal); - } - - metadata - .setIsSelfTweet(data.isSelfTweet) - .setIsUserAntiSocial(data.isUserAntiSocial); - } - - /** - * Create earlybird basic features and dervied features for current document. - * @return a FeatureHandler object where you can keep adding extra feature values, or you can - * call .getFeatures() on it to get a Thrift object to return. - */ - protected FeatureHandler createFeaturesForDocument( - LinearScoringData data, boolean ignoreDefaultValues) throws IOException { - ThriftSearchResultFeatures features = documentFeatures.getSearchResultFeatures(getSchema()); - if (!ignoreDefaultValues) { - setDefaultFeatureValues(features); - } - - // add derived features - return new FeatureHandler(features, ignoreDefaultValues) - .addDouble(ExternalTweetFeature.LUCENE_SCORE, data.luceneScore) - .addInt(ExternalTweetFeature.TWEET_AGE_IN_SECS, data.tweetAgeInSeconds) - .addBoolean(ExternalTweetFeature.IS_SELF_TWEET, data.isSelfTweet) - .addBoolean(ExternalTweetFeature.IS_FOLLOW_RETWEET, data.isFollow && data.isRetweet) - .addBoolean(ExternalTweetFeature.IS_TRUSTED_RETWEET, data.isTrusted && data.isRetweet) - .addBoolean(ExternalTweetFeature.AUTHOR_IS_FOLLOW, data.isFollow) - .addBoolean(ExternalTweetFeature.AUTHOR_IS_TRUSTED, data.isTrusted) - .addBoolean(ExternalTweetFeature.AUTHOR_IS_ANTISOCIAL, data.isUserAntiSocial) - .addBoolean(ExternalTweetFeature.HAS_DIFF_LANG, data.hasDifferentLang) - .addBoolean(ExternalTweetFeature.HAS_ENGLISH_TWEET_DIFF_UI_LANG, - data.hasEnglishTweetAndDifferentUILang) - .addBoolean(ExternalTweetFeature.HAS_ENGLISH_UI_DIFF_TWEET_LANG, - data.hasEnglishUIAndDifferentTweetLang) - .addDouble(ExternalTweetFeature.SEARCHER_LANG_SCORE, data.userLangMult) - .addDouble(ExternalTweetFeature.QUERY_SPECIFIC_SCORE, data.querySpecificScore) - .addDouble(ExternalTweetFeature.AUTHOR_SPECIFIC_SCORE, data.authorSpecificScore); - } - - /** - * Adds default values for most numeric features that do not have a value set yet in the given - * ThriftSearchResultFeatures instance. - * - * This method is needed because some models do not work properly with missing features. Instead, - * they expect all features to be present even if they are unset (their values are 0). - */ - protected void setDefaultFeatureValues(ThriftSearchResultFeatures features) { - for (Map.Entry entry - : getSchema().getSearchFeatureSchema().getEntries().entrySet()) { - int featureId = entry.getKey(); - ThriftSearchFeatureSchemaEntry schemaEntry = entry.getValue(); - if (shouldSetDefaultValueForFeature(schemaEntry.getFeatureType(), featureId)) { - switch (schemaEntry.getFeatureType()) { - case INT32_VALUE: - features.getIntValues().putIfAbsent(featureId, 0); - break; - case LONG_VALUE: - features.getLongValues().putIfAbsent(featureId, 0L); - break; - case DOUBLE_VALUE: - features.getDoubleValues().putIfAbsent(featureId, 0.0); - break; - default: - throw new IllegalArgumentException( - "Should set default values only for integer, long or double features. Instead, " - + "found feature " + featureId + " of type " + schemaEntry.getFeatureType()); - } - } - } - } - - protected void overrideFeatureValues(ThriftSearchResultFeatures features, - ThriftSearchResultFeatures overrideFeatures) { - LOG.info("Features before override {}", features); - if (overrideFeatures.isSetIntValues()) { - overrideFeatures.getIntValues().forEach(features::putToIntValues); - } - if (overrideFeatures.isSetLongValues()) { - overrideFeatures.getLongValues().forEach(features::putToLongValues); - } - if (overrideFeatures.isSetDoubleValues()) { - overrideFeatures.getDoubleValues().forEach(features::putToDoubleValues); - } - if (overrideFeatures.isSetBoolValues()) { - overrideFeatures.getBoolValues().forEach(features::putToBoolValues); - } - if (overrideFeatures.isSetStringValues()) { - overrideFeatures.getStringValues().forEach(features::putToStringValues); - } - if (overrideFeatures.isSetBytesValues()) { - overrideFeatures.getBytesValues().forEach(features::putToBytesValues); - } - if (overrideFeatures.isSetFeatureStoreDiscreteValues()) { - overrideFeatures.getFeatureStoreDiscreteValues().forEach( - features::putToFeatureStoreDiscreteValues); - } - if (overrideFeatures.isSetSparseBinaryValues()) { - overrideFeatures.getSparseBinaryValues().forEach(features::putToSparseBinaryValues); - } - if (overrideFeatures.isSetSparseContinuousValues()) { - overrideFeatures.getSparseContinuousValues().forEach(features::putToSparseContinuousValues); - } - if (overrideFeatures.isSetGeneralTensorValues()) { - overrideFeatures.getGeneralTensorValues().forEach(features::putToGeneralTensorValues); - } - if (overrideFeatures.isSetStringTensorValues()) { - overrideFeatures.getStringTensorValues().forEach(features::putToStringTensorValues); - } - LOG.info("Features after override {}", features); - } - - /** - * Check if a feature is eligible to have its default value automatically set when absent. - * We have a similar logic for building data record. - */ - private static boolean shouldSetDefaultValueForFeature( - ThriftSearchFeatureType type, int featureId) { - return ALLOWED_TYPES_FOR_DEFAULT_FEATURE_VALUES.contains(type) - && !NUMERIC_FEATURES_FOR_WHICH_DEFAULTS_SHOULD_NOT_BE_SET.contains(featureId) - && (ExternalTweetFeature.EARLYBIRD_INDEXED_FEATURE_IDS.contains(featureId) - || ExternalTweetFeature.EARLYBIRD_DERIVED_FEATURE_IDS.contains(featureId)); - } - - @Override - public void updateRelevanceStats(ThriftSearchResultsRelevanceStats relevanceStats) { - if (relevanceStats == null) { - return; - } - - LinearScoringData data = getScoringDataForCurrentDocument(); - - if (data.tweetAgeInSeconds > relevanceStats.getOldestScoredTweetAgeInSeconds()) { - relevanceStats.setOldestScoredTweetAgeInSeconds(data.tweetAgeInSeconds); - } - relevanceStats.setNumScored(relevanceStats.getNumScored() + 1); - if (data.scoreReturned == SKIP_HIT) { - relevanceStats.setNumSkipped(relevanceStats.getNumSkipped() + 1); - switch(data.skipReason) { - case ANTIGAMING: - relevanceStats.setNumSkippedForAntiGaming( - relevanceStats.getNumSkippedForAntiGaming() + 1); - break; - case LOW_REPUTATION: - relevanceStats.setNumSkippedForLowReputation( - relevanceStats.getNumSkippedForLowReputation() + 1); - break; - case LOW_TEXT_SCORE: - relevanceStats.setNumSkippedForLowTextScore( - relevanceStats.getNumSkippedForLowTextScore() + 1); - break; - case SOCIAL_FILTER: - relevanceStats.setNumSkippedForSocialFilter( - relevanceStats.getNumSkippedForSocialFilter() + 1); - break; - case LOW_FINAL_SCORE: - relevanceStats.setNumSkippedForLowFinalScore( - relevanceStats.getNumSkippedForLowFinalScore() + 1); - break; - case LOW_RETWEET_COUNT: - break; - default: - LOG.warn("Unknown SkipReason: " + data.skipReason); - } - } - - if (data.isFollow) { - relevanceStats.setNumFromDirectFollows(relevanceStats.getNumFromDirectFollows() + 1); - } - if (data.isTrusted) { - relevanceStats.setNumFromTrustedCircle(relevanceStats.getNumFromTrustedCircle() + 1); - } - if (data.isReply) { - relevanceStats.setNumReplies(relevanceStats.getNumReplies() + 1); - if (data.isTrusted) { - relevanceStats.setNumRepliesTrusted(relevanceStats.getNumRepliesTrusted() + 1); - } else if (!data.isFollow) { - relevanceStats.setNumRepliesOutOfNetwork(relevanceStats.getNumRepliesOutOfNetwork() + 1); - } - } - if (data.isSelfTweet) { - relevanceStats.setNumSelfTweets(relevanceStats.getNumSelfTweets() + 1); - } - if (data.hasImageUrl || data.hasVideoUrl) { - relevanceStats.setNumWithMedia(relevanceStats.getNumWithMedia() + 1); - } - if (data.hasNewsUrl) { - relevanceStats.setNumWithNews(relevanceStats.getNumWithNews() + 1); - } - if (data.isUserSpam) { - relevanceStats.setNumSpamUser(relevanceStats.getNumSpamUser() + 1); - } - if (data.isUserNSFW) { - relevanceStats.setNumOffensive(relevanceStats.getNumOffensive() + 1); - } - if (data.isUserBot) { - relevanceStats.setNumBot(relevanceStats.getNumBot() + 1); - } - } - - @VisibleForTesting - static final class ExplanationWrapper { - private Explanation explanation; - - public Explanation getExplanation() { - return explanation; - } - - @Override - public String toString() { - return explanation.toString(); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/scoring/LegacyScoreAccumulator.java b/src/java/com/twitter/search/earlybird/search/relevance/scoring/LegacyScoreAccumulator.java deleted file mode 100644 index bbe79cf84..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/scoring/LegacyScoreAccumulator.java +++ /dev/null @@ -1,98 +0,0 @@ -package com.twitter.search.earlybird.search.relevance.scoring; - -import com.twitter.search.common.util.ml.prediction_engine.BaseLegacyScoreAccumulator; -import com.twitter.search.common.util.ml.prediction_engine.LightweightLinearModel; -import com.twitter.search.earlybird.search.relevance.LinearScoringData; -import com.twitter.search.modeling.tweet_ranking.TweetScoringFeatures; - -/** - * Legacy score accumulator in Earlybird with specific features added. - * This class is created to avoid adding LinearScoringData as a dependency to search's common ML - * library. - * - * @deprecated This class is retired and we suggest to switch to SchemaBasedScoreAccumulator. - */ -@Deprecated -public class LegacyScoreAccumulator extends BaseLegacyScoreAccumulator { - /** - * Constructs with a model and LinearScoringData - */ - LegacyScoreAccumulator(LightweightLinearModel model) { - super(model); - } - - /** - * Update the accumulator score with features, after this function the score should already - * be computed. - * - * @deprecated This function is retired and we suggest to switch to updateScoresWithFeatures in - * SchemaBasedScoreAccumulator. - */ - @Override - @Deprecated - protected void updateScoreWithFeatures(LinearScoringData data) { - addContinuousFeature(TweetScoringFeatures.LUCENE_SCORE, data.luceneScore); - addContinuousFeature(TweetScoringFeatures.TEXT_SCORE, data.textScore); - addContinuousFeature(TweetScoringFeatures.TWEET_AGE_IN_SECONDS, data.tweetAgeInSeconds); - addContinuousFeature(TweetScoringFeatures.REPLY_COUNT, data.replyCountPostLog2); - addContinuousFeature(TweetScoringFeatures.RETWEET_COUNT, data.retweetCountPostLog2); - addContinuousFeature(TweetScoringFeatures.FAV_COUNT, data.favCountPostLog2); - addContinuousFeature(TweetScoringFeatures.REPLY_COUNT_V2, data.replyCountV2); - addContinuousFeature(TweetScoringFeatures.RETWEET_COUNT_V2, data.retweetCountV2); - addContinuousFeature(TweetScoringFeatures.FAV_COUNT_V2, data.favCountV2); - addContinuousFeature(TweetScoringFeatures.EMBEDS_IMPRESSION_COUNT, - data.getEmbedsImpressionCount(false)); - addContinuousFeature(TweetScoringFeatures.EMBEDS_URL_COUNT, data.getEmbedsUrlCount(false)); - addContinuousFeature(TweetScoringFeatures.VIDEO_VIEW_COUNT, data.getVideoViewCount(false)); - addContinuousFeature(TweetScoringFeatures.QUOTED_COUNT, data.quotedCount); - addContinuousFeature(TweetScoringFeatures.WEIGHTED_RETWEET_COUNT, data.weightedRetweetCount); - addContinuousFeature(TweetScoringFeatures.WEIGHTED_REPLY_COUNT, data.weightedReplyCount); - addContinuousFeature(TweetScoringFeatures.WEIGHTED_FAV_COUNT, data.weightedFavCount); - addContinuousFeature(TweetScoringFeatures.WEIGHTED_QUOTE_COUNT, data.weightedQuoteCount); - addBinaryFeature(TweetScoringFeatures.HAS_URL, data.hasUrl); - addBinaryFeature(TweetScoringFeatures.HAS_CARD, data.hasCard); - addBinaryFeature(TweetScoringFeatures.HAS_VINE, data.hasVine); - addBinaryFeature(TweetScoringFeatures.HAS_PERISCOPE, data.hasPeriscope); - addBinaryFeature(TweetScoringFeatures.HAS_NATIVE_IMAGE, data.hasNativeImage); - addBinaryFeature(TweetScoringFeatures.HAS_IMAGE_URL, data.hasImageUrl); - addBinaryFeature(TweetScoringFeatures.HAS_NEWS_URL, data.hasNewsUrl); - addBinaryFeature(TweetScoringFeatures.HAS_VIDEO_URL, data.hasVideoUrl); - addBinaryFeature(TweetScoringFeatures.HAS_CONSUMER_VIDEO, data.hasConsumerVideo); - addBinaryFeature(TweetScoringFeatures.HAS_PRO_VIDEO, data.hasProVideo); - addBinaryFeature(TweetScoringFeatures.HAS_QUOTE, data.hasQuote); - addBinaryFeature(TweetScoringFeatures.HAS_TREND, data.hasTrend); - addBinaryFeature(TweetScoringFeatures.HAS_MULTIPLE_HASHTAGS_OR_TRENDS, - data.hasMultipleHashtagsOrTrends); - addBinaryFeature(TweetScoringFeatures.IS_OFFENSIVE, data.isOffensive); - addBinaryFeature(TweetScoringFeatures.IS_REPLY, data.isReply); - addBinaryFeature(TweetScoringFeatures.IS_RETWEET, data.isRetweet); - addBinaryFeature(TweetScoringFeatures.IS_SELF_TWEET, data.isSelfTweet); - addBinaryFeature(TweetScoringFeatures.IS_FOLLOW_RETWEET, data.isRetweet & data.isFollow); - addBinaryFeature(TweetScoringFeatures.IS_TRUSTED_RETWEET, data.isRetweet & data.isTrusted); - addContinuousFeature(TweetScoringFeatures.QUERY_SPECIFIC_SCORE, data.querySpecificScore); - addContinuousFeature(TweetScoringFeatures.AUTHOR_SPECIFIC_SCORE, data.authorSpecificScore); - addBinaryFeature(TweetScoringFeatures.AUTHOR_IS_FOLLOW, data.isFollow); - addBinaryFeature(TweetScoringFeatures.AUTHOR_IS_TRUSTED, data.isTrusted); - addBinaryFeature(TweetScoringFeatures.AUTHOR_IS_VERIFIED, data.isFromVerifiedAccount); - addBinaryFeature(TweetScoringFeatures.AUTHOR_IS_NSFW, data.isUserNSFW); - addBinaryFeature(TweetScoringFeatures.AUTHOR_IS_SPAM, data.isUserSpam); - addBinaryFeature(TweetScoringFeatures.AUTHOR_IS_BOT, data.isUserBot); - addBinaryFeature(TweetScoringFeatures.AUTHOR_IS_ANTISOCIAL, data.isUserAntiSocial); - addContinuousFeature(TweetScoringFeatures.AUTHOR_REPUTATION, data.userRep); - addContinuousFeature(TweetScoringFeatures.SEARCHER_LANG_SCORE, data.userLangMult); - addBinaryFeature(TweetScoringFeatures.HAS_DIFFERENT_LANG, data.hasDifferentLang); - addBinaryFeature(TweetScoringFeatures.HAS_ENGLISH_TWEET_AND_DIFFERENT_UI_LANG, - data.hasEnglishTweetAndDifferentUILang); - addBinaryFeature(TweetScoringFeatures.HAS_ENGLISH_UI_AND_DIFFERENT_TWEET_LANG, - data.hasEnglishUIAndDifferentTweetLang); - addBinaryFeature(TweetScoringFeatures.IS_SENSITIVE_CONTENT, data.isSensitiveContent); - addBinaryFeature(TweetScoringFeatures.HAS_MULTIPLE_MEDIA, data.hasMultipleMediaFlag); - addBinaryFeature(TweetScoringFeatures.AUTHOR_IS_PROFILE_EGG, data.profileIsEggFlag); - addBinaryFeature(TweetScoringFeatures.AUTHOR_IS_NEW, data.isUserNewFlag); - addContinuousFeature(TweetScoringFeatures.MENTIONS_COUNT, data.numMentions); - addContinuousFeature(TweetScoringFeatures.HASHTAGS_COUNT, data.numHashtags); - addContinuousFeature(TweetScoringFeatures.LINK_LANGUAGE_ID, data.linkLanguage); - addContinuousFeature(TweetScoringFeatures.LANGUAGE_ID, data.tweetLangId); - addBinaryFeature(TweetScoringFeatures.HAS_VISIBLE_LINK, data.hasVisibleLink); - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/scoring/LinearScoringFunction.java b/src/java/com/twitter/search/earlybird/search/relevance/scoring/LinearScoringFunction.java deleted file mode 100644 index 770f4f49b..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/scoring/LinearScoringFunction.java +++ /dev/null @@ -1,237 +0,0 @@ -package com.twitter.search.earlybird.search.relevance.scoring; - -import java.io.IOException; -import java.util.List; - -import com.google.common.collect.Lists; - -import org.apache.lucene.search.Explanation; - -import com.twitter.search.common.relevance.features.MutableFeatureNormalizers; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.search.AntiGamingFilter; -import com.twitter.search.earlybird.search.relevance.LinearScoringData; -import com.twitter.search.earlybird.search.relevance.LinearScoringParams; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.earlybird.thrift.ThriftSearchResultType; - -/** - * Scoring function that uses the weights and boosts provided in the scoring parameters from the - * request. - */ -public class LinearScoringFunction extends FeatureBasedScoringFunction { - private static final double BASE_SCORE = 0.0001; - - public LinearScoringFunction( - ImmutableSchemaInterface schema, - ThriftSearchQuery searchQuery, - AntiGamingFilter antiGamingFilter, - ThriftSearchResultType searchResultType, - UserTable userTable) throws IOException { - super("LinearScoringFunction", schema, searchQuery, antiGamingFilter, searchResultType, - userTable); - } - - @Override - protected double computeScore(LinearScoringData data, boolean forExplanation) throws IOException { - double score = BASE_SCORE; - - data.luceneContrib = params.useLuceneScoreAsBoost - ? 0.0 : params.luceneWeight * data.luceneScore; - - data.reputationContrib = params.reputationWeight * data.userRep; - data.textScoreContrib = params.textScoreWeight * data.textScore; - data.parusContrib = params.parusWeight * data.parusScore; - - // contributions from engagement counters. Note that we have "true" argument for all getters, - // which means all values will get scaled down for scoring, they were unbounded in raw form. - data.retweetContrib = params.retweetWeight * data.retweetCountPostLog2; - data.favContrib = params.favWeight * data.favCountPostLog2; - data.replyContrib = params.replyWeight * data.replyCountPostLog2; - data.embedsImpressionContrib = - params.embedsImpressionWeight * data.getEmbedsImpressionCount(true); - data.embedsUrlContrib = - params.embedsUrlWeight * data.getEmbedsUrlCount(true); - data.videoViewContrib = - params.videoViewWeight * data.getVideoViewCount(true); - data.quotedContrib = - params.quotedCountWeight * data.quotedCount; - - for (int i = 0; i < LinearScoringData.MAX_OFFLINE_EXPERIMENTAL_FIELDS; i++) { - data.offlineExpFeatureContributions[i] = - params.rankingOfflineExpWeights[i] * data.offlineExpFeatureValues[i]; - } - - data.hasUrlContrib = params.urlWeight * (data.hasUrl ? 1.0 : 0.0); - data.isReplyContrib = params.isReplyWeight * (data.isReply ? 1.0 : 0.0); - data.isFollowRetweetContrib = - params.followRetweetWeight * (data.isRetweet && data.isFollow ? 1.0 : 0.0); - data.isTrustedRetweetContrib = - params.trustedRetweetWeight * (data.isRetweet && data.isTrusted ? 1.0 : 0.0); - double replyCountOriginal = getUnscaledReplyCountFeatureValue(); - data.multipleReplyContrib = params.multipleReplyWeight - * (replyCountOriginal < params.multipleReplyMinVal ? 0.0 : replyCountOriginal); - - // We directly the query specific score as the contribution below as it doesn't need a weight - // for contribution computation. - score += data.luceneContrib - + data.reputationContrib - + data.textScoreContrib - + data.replyContrib - + data.multipleReplyContrib - + data.retweetContrib - + data.favContrib - + data.parusContrib - + data.embedsImpressionContrib - + data.embedsUrlContrib - + data.videoViewContrib - + data.quotedContrib - + data.hasUrlContrib - + data.isReplyContrib - + data.isFollowRetweetContrib - + data.isTrustedRetweetContrib - + data.querySpecificScore - + data.authorSpecificScore; - - for (int i = 0; i < LinearScoringData.MAX_OFFLINE_EXPERIMENTAL_FIELDS; i++) { - score += data.offlineExpFeatureContributions[i]; - } - - return score; - } - - /** - * Generates the explanation for the linear score. - */ - @Override - protected void generateExplanationForScoring( - LinearScoringData scoringData, boolean isHit, List details) throws IOException { - // 1. Linear components - final List linearDetails = Lists.newArrayList(); - addLinearElementExplanation( - linearDetails, "[LuceneQueryScore]", - params.luceneWeight, scoringData.luceneScore, scoringData.luceneContrib); - if (scoringData.hasCard) { - if (scoringData.cardAuthorMatchBoostApplied) { - linearDetails.add(Explanation.match( - (float) params.cardAuthorMatchBoosts[scoringData.cardType], - "[x] card author match boost")); - } - if (scoringData.cardDescriptionMatchBoostApplied) { - linearDetails.add(Explanation.match( - (float) params.cardDescriptionMatchBoosts[scoringData.cardType], - "[x] card description match boost")); - } - if (scoringData.cardDomainMatchBoostApplied) { - linearDetails.add(Explanation.match( - (float) params.cardDomainMatchBoosts[scoringData.cardType], - "[x] card domain match boost")); - } - if (scoringData.cardTitleMatchBoostApplied) { - linearDetails.add(Explanation.match( - (float) params.cardTitleMatchBoosts[scoringData.cardType], - "[x] card title match boost")); - } - } - addLinearElementExplanation( - linearDetails, "reputation", - params.reputationWeight, scoringData.userRep, scoringData.reputationContrib); - addLinearElementExplanation( - linearDetails, "text score", - params.textScoreWeight, scoringData.textScore, scoringData.textScoreContrib); - addLinearElementExplanation( - linearDetails, "reply count (log2)", - params.replyWeight, scoringData.replyCountPostLog2, scoringData.replyContrib); - addLinearElementExplanation( - linearDetails, "multi reply", - params.multipleReplyWeight, - getUnscaledReplyCountFeatureValue() > params.multipleReplyMinVal ? 1 : 0, - scoringData.multipleReplyContrib); - addLinearElementExplanation( - linearDetails, "retweet count (log2)", - params.retweetWeight, scoringData.retweetCountPostLog2, scoringData.retweetContrib); - addLinearElementExplanation( - linearDetails, "fav count (log2)", - params.favWeight, scoringData.favCountPostLog2, scoringData.favContrib); - addLinearElementExplanation( - linearDetails, "parus score", - params.parusWeight, scoringData.parusScore, scoringData.parusContrib); - for (int i = 0; i < LinearScoringData.MAX_OFFLINE_EXPERIMENTAL_FIELDS; i++) { - if (params.rankingOfflineExpWeights[i] != LinearScoringParams.DEFAULT_FEATURE_WEIGHT) { - addLinearElementExplanation(linearDetails, - "ranking exp score offline experimental #" + i, - params.rankingOfflineExpWeights[i], scoringData.offlineExpFeatureValues[i], - scoringData.offlineExpFeatureContributions[i]); - } - } - addLinearElementExplanation(linearDetails, - "embedded tweet impression count", - params.embedsImpressionWeight, scoringData.getEmbedsImpressionCount(false), - scoringData.embedsImpressionContrib); - addLinearElementExplanation(linearDetails, - "embedded tweet url count", - params.embedsUrlWeight, scoringData.getEmbedsUrlCount(false), - scoringData.embedsUrlContrib); - addLinearElementExplanation(linearDetails, - "video view count", - params.videoViewWeight, scoringData.getVideoViewCount(false), - scoringData.videoViewContrib); - addLinearElementExplanation(linearDetails, - "quoted count", - params.quotedCountWeight, scoringData.quotedCount, scoringData.quotedContrib); - - addLinearElementExplanation( - linearDetails, "has url", params.urlWeight, scoringData.hasUrl ? 1.0 : 0.0, - scoringData.hasUrlContrib); - - addLinearElementExplanation( - linearDetails, "is reply", params.isReplyWeight, - scoringData.isReply ? 1.0 : 0.0, scoringData.isReplyContrib); - addLinearElementExplanation( - linearDetails, "is follow retweet", params.followRetweetWeight, - scoringData.isRetweet && scoringData.isFollow ? 1.0 : 0.0, - scoringData.isFollowRetweetContrib); - addLinearElementExplanation( - linearDetails, "is trusted retweet", params.trustedRetweetWeight, - scoringData.isRetweet && scoringData.isTrusted ? 1.0 : 0.0, - scoringData.isTrustedRetweetContrib); - - if (scoringData.querySpecificScore != 0.0) { - linearDetails.add(Explanation.match((float) scoringData.querySpecificScore, - "[+] query specific score adjustment")); - } - if (scoringData.authorSpecificScore != 0.0) { - linearDetails.add(Explanation.match((float) scoringData.authorSpecificScore, - "[+] author specific score adjustment")); - } - - - Explanation linearCombo = isHit - ? Explanation.match((float) scoringData.scoreBeforeBoost, - "(MATCH) Linear components, sum of:", linearDetails) - : Explanation.noMatch("Linear components, sum of:", linearDetails); - - - details.add(linearCombo); - } - - private void addLinearElementExplanation(List explanation, - String name, - double weight, - double componentValue, - double contrib) { - if (contrib == 0.0) { - return; - } - explanation.add( - Explanation.match((float) contrib, - String.format("[+] %s=%.3f weight=%.3f", name, componentValue, weight))); - } - - private double getUnscaledReplyCountFeatureValue() throws IOException { - byte featureValue = (byte) documentFeatures.getFeatureValue(EarlybirdFieldConstant.REPLY_COUNT); - return MutableFeatureNormalizers.BYTE_NORMALIZER.unnormLowerBound(featureValue); - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/scoring/ModelBasedScoringFunction.java b/src/java/com/twitter/search/earlybird/search/relevance/scoring/ModelBasedScoringFunction.java deleted file mode 100644 index 179f684cd..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/scoring/ModelBasedScoringFunction.java +++ /dev/null @@ -1,151 +0,0 @@ -package com.twitter.search.earlybird.search.relevance.scoring; - -import java.io.IOException; -import java.util.List; -import java.util.Map; - -import com.google.common.base.Optional; -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; - -import org.apache.lucene.search.Explanation; - -import com.twitter.search.common.features.thrift.ThriftSearchResultFeatures; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.ranking.thriftjava.ThriftRankingParams; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.util.ml.prediction_engine.LightweightLinearModel; -import com.twitter.search.common.util.ml.prediction_engine.SchemaBasedScoreAccumulator; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.exception.ClientException; -import com.twitter.search.earlybird.ml.ScoringModelsManager; -import com.twitter.search.earlybird.search.AntiGamingFilter; -import com.twitter.search.earlybird.search.relevance.LinearScoringData; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.earlybird.thrift.ThriftSearchResultType; - -/** - * Scoring function that uses the scoring models specified from the request. - */ -public class ModelBasedScoringFunction extends FeatureBasedScoringFunction { - private final SelectedModel[] selectedModels; - private final boolean useLogitScore; - private final boolean isSchemaBased; - - private static final SearchCounter NUM_LEGACY_MODELS = - SearchCounter.export("scoring_function_num_legacy_models"); - private static final SearchCounter NUM_SCHEMA_BASED_MODELS = - SearchCounter.export("scoring_function_num_schema_based_models"); - private static final SearchCounter MIXED_MODEL_TYPES = - SearchCounter.export("scoring_function_mixed_model_types"); - - public ModelBasedScoringFunction( - ImmutableSchemaInterface schema, - ThriftSearchQuery searchQuery, - AntiGamingFilter antiGamingFilter, - ThriftSearchResultType searchResultType, - UserTable userTable, - ScoringModelsManager scoringModelsManager - ) throws IOException, ClientException { - - super("ModelBasedScoringFunction", schema, searchQuery, antiGamingFilter, searchResultType, - userTable); - - ThriftRankingParams rankingParams = searchQuery.getRelevanceOptions().getRankingParams(); - Preconditions.checkNotNull(rankingParams); - - if (rankingParams.getSelectedModelsSize() <= 0) { - throw new ClientException("Scoring type is MODEL_BASED but no models were selected"); - } - - Map models = rankingParams.getSelectedModels(); - - selectedModels = new SelectedModel[models.size()]; - int numSchemaBased = 0; - int i = 0; - for (Map.Entry nameAndWeight : models.entrySet()) { - Optional model = - scoringModelsManager.getModel(nameAndWeight.getKey()); - if (!model.isPresent()) { - throw new ClientException(String.format( - "Scoring function is MODEL_BASED. Selected model '%s' not found", - nameAndWeight.getKey())); - } - selectedModels[i] = - new SelectedModel(nameAndWeight.getKey(), nameAndWeight.getValue(), model.get()); - - if (selectedModels[i].model.isSchemaBased()) { - ++numSchemaBased; - NUM_SCHEMA_BASED_MODELS.increment(); - } else { - NUM_LEGACY_MODELS.increment(); - } - ++i; - } - - // We should either see all models schema-based, or none of them so, if this is not the case, - // we log an error message and fall back to use just the first model, whatever it is. - if (numSchemaBased > 0 && numSchemaBased != selectedModels.length) { - MIXED_MODEL_TYPES.increment(); - throw new ClientException( - "You cannot mix schema-based and non-schema-based models in the same request, " - + "models are: " + models.keySet()); - } - - isSchemaBased = selectedModels[0].model.isSchemaBased(); - useLogitScore = rankingParams.isUseLogitScore(); - } - - @Override - protected double computeScore(LinearScoringData data, boolean forExplanation) throws IOException { - ThriftSearchResultFeatures features = - isSchemaBased ? createFeaturesForDocument(data, false).getFeatures() : null; - - double score = 0; - for (SelectedModel selectedModel : selectedModels) { - double modelScore = isSchemaBased - ? new SchemaBasedScoreAccumulator(selectedModel.model).scoreWith(features, useLogitScore) - : new LegacyScoreAccumulator(selectedModel.model).scoreWith(data, useLogitScore); - score += selectedModel.weight * modelScore; - } - - return score; - } - - @Override - protected void generateExplanationForScoring( - LinearScoringData scoringData, boolean isHit, List details) throws IOException { - boolean schemaBased = selectedModels[0].model.isSchemaBased(); - ThriftSearchResultFeatures features = - schemaBased ? createFeaturesForDocument(scoringData, false).getFeatures() : null; - - // 1. Model-based score - final List modelExplanations = Lists.newArrayList(); - float finalScore = 0; - for (SelectedModel selectedModel : selectedModels) { - double modelScore = schemaBased - ? new SchemaBasedScoreAccumulator(selectedModel.model).scoreWith(features, useLogitScore) - : new LegacyScoreAccumulator(selectedModel.model).scoreWith(scoringData, useLogitScore); - float weightedScore = (float) (selectedModel.weight * modelScore); - details.add(Explanation.match( - weightedScore, String.format("model=%s score=%.6f weight=%.3f useLogitScore=%s", - selectedModel.name, modelScore, selectedModel.weight, useLogitScore))); - finalScore += weightedScore; - } - - details.add(Explanation.match( - finalScore, String.format("Total model-based score (hit=%s)", isHit), modelExplanations)); - } - - private static final class SelectedModel { - public final String name; - public final double weight; - public final LightweightLinearModel model; - - private SelectedModel(String name, double weight, LightweightLinearModel model) { - this.name = name; - this.weight = weight; - this.model = model; - } - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/scoring/RelevanceQuery.java b/src/java/com/twitter/search/earlybird/search/relevance/scoring/RelevanceQuery.java deleted file mode 100644 index b105f3490..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/scoring/RelevanceQuery.java +++ /dev/null @@ -1,164 +0,0 @@ -package com.twitter.search.earlybird.search.relevance.scoring; - -import java.io.IOException; -import java.util.Objects; -import java.util.Set; - -import javax.annotation.Nullable; - -import org.apache.lucene.index.IndexReader; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.Term; -import org.apache.lucene.search.Explanation; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.ScoreMode; -import org.apache.lucene.search.Weight; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.results.thriftjava.FieldHitAttribution; - -/** - * A wrapper for a Lucene query which first computes Lucene's query score - * and then delegates to a {@link ScoringFunction} for final score computation. - */ -public class RelevanceQuery extends Query { - private static final Logger LOG = LoggerFactory.getLogger(RelevanceQuery.class.getName()); - - protected final Query luceneQuery; - protected final ScoringFunction scoringFunction; - - // True when the lucene query's score should be ignored for debug explanations. - protected final boolean ignoreLuceneQueryScoreExplanation; - - public RelevanceQuery(Query luceneQuery, ScoringFunction scoringFunction) { - this(luceneQuery, scoringFunction, false); - } - - public RelevanceQuery(Query luceneQuery, - ScoringFunction scoringFunction, - boolean ignoreLuceneQueryScoreExplanation) { - this.luceneQuery = luceneQuery; - this.scoringFunction = scoringFunction; - this.ignoreLuceneQueryScoreExplanation = ignoreLuceneQueryScoreExplanation; - } - - public ScoringFunction getScoringFunction() { - return scoringFunction; - } - - public Query getLuceneQuery() { - return luceneQuery; - } - - @Override - public Query rewrite(IndexReader reader) throws IOException { - Query rewritten = luceneQuery.rewrite(reader); - if (rewritten == luceneQuery) { - return this; - } - return new RelevanceQuery(rewritten, scoringFunction, ignoreLuceneQueryScoreExplanation); - } - - @Override - public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) - throws IOException { - Weight luceneWeight = luceneQuery.createWeight(searcher, scoreMode, boost); - if (luceneWeight == null) { - return null; - } - return new RelevanceWeight(searcher, luceneWeight); - } - - public class RelevanceWeight extends Weight { - private final Weight luceneWeight; - - public RelevanceWeight(IndexSearcher searcher, Weight luceneWeight) { - super(RelevanceQuery.this); - this.luceneWeight = luceneWeight; - } - - @Override - public void extractTerms(Set terms) { - this.luceneWeight.extractTerms(terms); - } - - - @Override - public Explanation explain(LeafReaderContext context, int doc) throws IOException { - return explain(context, doc, null); - } - - /** - * Returns an explanation of the scoring for the given document. - * - * @param context The context of the reader that returned this document. - * @param doc The document. - * @param fieldHitAttribution Per-hit field attribution information. - * @return An explanation of the scoring for the given document. - */ - public Explanation explain(LeafReaderContext context, int doc, - @Nullable FieldHitAttribution fieldHitAttribution) throws IOException { - - Explanation luceneExplanation = Explanation.noMatch("LuceneQuery explain skipped"); - if (!ignoreLuceneQueryScoreExplanation) { - // get Lucene score - try { - luceneExplanation = luceneWeight.explain(context, doc); - } catch (Exception e) { - // We sometimes see exceptions resulting from term queries that do not store - // utf8-text, which TermQuery.toString() assumes. Catch here and allow at least - // scoring function explanations to be returned. - LOG.error("Exception in explain", e); - luceneExplanation = Explanation.noMatch("LuceneQuery explain failed"); - } - } - - Explanation scoringFunctionExplanation; - scoringFunction.setFieldHitAttribution(fieldHitAttribution); - scoringFunctionExplanation = scoringFunction.explain( - context.reader(), doc, luceneExplanation.getValue().floatValue()); - - // just add a wrapper for a better structure of the final explanation - Explanation luceneExplanationWrapper = Explanation.match( - luceneExplanation.getValue(), "LuceneQuery", luceneExplanation); - - return Explanation.match(scoringFunctionExplanation.getValue(), "RelevanceQuery", - scoringFunctionExplanation, luceneExplanationWrapper); - } - - @Override - public Scorer scorer(LeafReaderContext context) throws IOException { - return luceneWeight.scorer(context); - } - - @Override - public boolean isCacheable(LeafReaderContext ctx) { - return luceneWeight.isCacheable(ctx); - } - } - - @Override - public int hashCode() { - return (luceneQuery == null ? 0 : luceneQuery.hashCode()) - + (scoringFunction == null ? 0 : scoringFunction.hashCode()) * 13; - } - - @Override - public boolean equals(Object obj) { - if (!(obj instanceof RelevanceQuery)) { - return false; - } - - RelevanceQuery query = RelevanceQuery.class.cast(obj); - return Objects.equals(luceneQuery, query.luceneQuery) - && Objects.equals(scoringFunction, query.scoringFunction); - } - - @Override - public String toString(String field) { - return "RelevanceQuery[q=" + luceneQuery.toString(field) + "]"; - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/scoring/RetweetBasedTopTweetsScoringFunction.java b/src/java/com/twitter/search/earlybird/search/relevance/scoring/RetweetBasedTopTweetsScoringFunction.java deleted file mode 100644 index deae8ba66..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/scoring/RetweetBasedTopTweetsScoringFunction.java +++ /dev/null @@ -1,165 +0,0 @@ -package com.twitter.search.earlybird.search.relevance.scoring; - -import java.io.IOException; - -import org.apache.lucene.search.Explanation; - -import com.twitter.search.common.relevance.features.MutableFeatureNormalizers; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadata; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadataOptions; -import com.twitter.search.earlybird.thrift.ThriftSearchResultType; -import com.twitter.search.earlybird.thrift.ThriftSearchResultsRelevanceStats; - -/** - * A toptweets query cache index selection scoring function that is based purely on retweet counts. - * The goal of this scoring functon is to deprecate itweet score in entirety. - * - * Once all legacy itweet scores are drained from existing earlybird index, new parus score replaces - * existing itweet score position, then this class will be deprecated, a new scoring function - * using parus score shall replace this. - * - * this scoring function is only used in Query Cache for marking top tweets - * in the background. When searched, those tweets are still ranked with linear or model-based - * scoring function. - * - */ -public class RetweetBasedTopTweetsScoringFunction extends ScoringFunction { - private static final double DEFAULT_RECENCY_SCORE_FRACTION = 0.1; - private static final double DEFAULT_SIGMOID_APLHA = 0.008; - private static final int DEFAULT_RECENCY_CENTER_MINUTES = 1080; - - // if you update the default cut off, make sure you update the query cache filter in - // querycache.yml - // - // we know currently each time slice, each partition has about 10K entries in toptweets query - // cache. These are unique tweets. Looking at retweet updates, each time slice, each partition has - // about 650K unique tweets that received retweet. To create roughly similar number of entries in - // query cache, we need top 2% of such tweets, and that sets to min retweet count to 4. - // In this linear scoring function, we will rescale retweet count to [0, 1] range, - // with an input range of [0, 20]. Given the realtime factor's weight of 0.1, that give our - // minimal retweet score threshold to: 4/20 * 0.9 = 0.18. - // Testing on prod showed much higher volume due to the generous setting of max value of 20, - // (highest we have seen is 14). Adjusted to 0.21 which gave us similar volume. - private static final double DEFAULT_CUT_OFF_SCORE = 0.21; - - // Normalize retweet counts from [0, 20] range to [0, 1] range - private static final double MAX_RETWEET_COUNT = 20.0; - private static final double MIN_USER_REPUTATION = 40.0; // matches itweet system threshold - - /** - * The scores for the retweet based top tweets have to be in the [0, 1] interval. So we can't use - * SKIP_HIT as the lowest possible score, and instead have to use Float.MIN_VALUE. - * - * It's OK to use different values for these constants, because they do not interfere with each - * other. This constant is only used in RetweetBasedTopTweetsScoringFunction, which is only used - * to filter the hits for the [score_filter retweets minScore maxScore] operator. So the scores - * returned by RetweetBasedTopTweetsScoringFunction.score() do not have any impact on the final - * hit score. - * - * See EarlybirdLuceneQueryVisitor.visitScoredFilterOperator() and ScoreFilterQuery for more details. - */ - private static final float RETWEET_BASED_TOP_TWEETS_LOWEST_SCORE = Float.MIN_VALUE; - - private final double recencyScoreFraction; - private final double sigmoidAlpha; - private final double cutOffScore; - private final int recencyCenterMinutes; - private final double maxRecency; - - private final int currentTimeSeconds; - - private ThriftSearchResultMetadata metadata = null; - private double score; - private double retweetCount; - - public RetweetBasedTopTweetsScoringFunction(ImmutableSchemaInterface schema) { - this(schema, DEFAULT_RECENCY_SCORE_FRACTION, - DEFAULT_SIGMOID_APLHA, - DEFAULT_CUT_OFF_SCORE, - DEFAULT_RECENCY_CENTER_MINUTES); - } - - /** - * Creates a no decay scoring function (used by top archive). - * Otherwise same as default constructor. - * @param nodecay If no decay is set to true. Alpha is set to 0.0. - */ - public RetweetBasedTopTweetsScoringFunction(ImmutableSchemaInterface schema, boolean nodecay) { - this(schema, DEFAULT_RECENCY_SCORE_FRACTION, - nodecay ? 0.0 : DEFAULT_SIGMOID_APLHA, - DEFAULT_CUT_OFF_SCORE, - DEFAULT_RECENCY_CENTER_MINUTES); - } - - public RetweetBasedTopTweetsScoringFunction(ImmutableSchemaInterface schema, - double recencyScoreFraction, double sigmoidAlpha, - double cutOffScore, int recencyCenterMinutes) { - super(schema); - this.recencyScoreFraction = recencyScoreFraction; - this.sigmoidAlpha = sigmoidAlpha; - this.cutOffScore = cutOffScore; - this.recencyCenterMinutes = recencyCenterMinutes; - this.maxRecency = computeSigmoid(0); - this.currentTimeSeconds = (int) (System.currentTimeMillis() / 1000); - } - - @Override - protected float score(float luceneQueryScore) throws IOException { - // Reset the data for each tweet!!! - metadata = null; - if (documentFeatures.isFlagSet(EarlybirdFieldConstant.IS_OFFENSIVE_FLAG) - || (documentFeatures.getFeatureValue(EarlybirdFieldConstant.USER_REPUTATION) - < MIN_USER_REPUTATION)) { - score = RETWEET_BASED_TOP_TWEETS_LOWEST_SCORE; - } else { - // Note that here we want the post log2 value, as the MAX_RETWEET_COUNT was actually - // set up for that. - retweetCount = MutableFeatureNormalizers.BYTE_NORMALIZER.unnormAndLog2( - (byte) documentFeatures.getFeatureValue(EarlybirdFieldConstant.RETWEET_COUNT)); - final double recencyScore = computeTopTweetRecencyScore(); - - score = (retweetCount / MAX_RETWEET_COUNT) * (1 - recencyScoreFraction) - + recencyScoreFraction * recencyScore; - - if (score < this.cutOffScore) { - score = RETWEET_BASED_TOP_TWEETS_LOWEST_SCORE; - } - } - - return (float) score; - } - - private double computeSigmoid(double x) { - return 1.0f / (1.0f + Math.exp(sigmoidAlpha * (x - recencyCenterMinutes))); - } - - private double computeTopTweetRecencyScore() { - double diffMinutes = - Math.max(0, currentTimeSeconds - timeMapper.getTime(getCurrentDocID())) / 60.0; - return computeSigmoid(diffMinutes) / maxRecency; - } - - @Override - protected Explanation doExplain(float luceneScore) { - return null; - } - - @Override - public ThriftSearchResultMetadata getResultMetadata(ThriftSearchResultMetadataOptions options) { - if (metadata == null) { - metadata = new ThriftSearchResultMetadata() - .setResultType(ThriftSearchResultType.POPULAR) - .setPenguinVersion(EarlybirdConfig.getPenguinVersionByte()); - metadata.setRetweetCount((int) retweetCount); - metadata.setScore(score); - } - return metadata; - } - - @Override - public void updateRelevanceStats(ThriftSearchResultsRelevanceStats relevanceStats) { - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/scoring/ScoringFunction.java b/src/java/com/twitter/search/earlybird/search/relevance/scoring/ScoringFunction.java deleted file mode 100644 index c2b1a4deb..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/scoring/ScoringFunction.java +++ /dev/null @@ -1,213 +0,0 @@ -package com.twitter.search.earlybird.search.relevance.scoring; - -import java.io.IOException; -import java.util.List; - -import com.google.common.base.Preconditions; - -import org.apache.lucene.index.IndexReader; -import org.apache.lucene.search.Explanation; - -import com.twitter.common.collections.Pair; -import com.twitter.search.common.constants.thriftjava.ThriftLanguage; -import com.twitter.search.common.features.thrift.ThriftSearchResultFeatures; -import com.twitter.search.common.query.HitAttributeHelper; -import com.twitter.search.common.relevance.features.EarlybirdDocumentFeatures; -import com.twitter.search.common.results.thriftjava.FieldHitAttribution; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.core.earlybird.index.TimeMapper; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.search.relevance.LinearScoringData; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadata; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadataOptions; -import com.twitter.search.earlybird.thrift.ThriftSearchResultType; -import com.twitter.search.earlybird.thrift.ThriftSearchResultsRelevanceStats; -import com.twitter.search.queryparser.query.Query; - -/** - * Defines a ranking function which computes the score of a document that matches a query. - */ -public abstract class ScoringFunction { - /** - * Returned by a {@link #score(int, float)} to indicate that a hit should be scored below all. - * - * We have some equality tests like: - * "if (score == ScoringFunction.SKIP_HIT) {...}" (DefaultScoringFunction#updateRelevanceStats) - * We might also have double to float casts. - * - * Such castings seem to work with the equality test, but there might corner cases when casting - * this float value to a double (and back) might not work properly. - * - * If possible, we should choose a constant that is not in the valid score range. Then we can - * turn the float equality tests into Math.abs(...) < EPSILON tests. - */ - public static final float SKIP_HIT = -Float.MAX_VALUE; - - private final ImmutableSchemaInterface schema; - - // The current doc ID and the reader for the current segment should be private, because we don't - // want sub-classes to incorrectly update them. The doc ID should only be updated by the score() - // and explain() methods, and the reader should only be updated by the setNextReader() method. - private int currentDocID = -1; - - protected DocIDToTweetIDMapper tweetIDMapper = null; - protected TimeMapper timeMapper = null; - protected EarlybirdDocumentFeatures documentFeatures; - - protected int debugMode = 0; - protected HitAttributeHelper hitAttributeHelper; - protected Query query; - - protected FieldHitAttribution fieldHitAttribution; - - public ScoringFunction(ImmutableSchemaInterface schema) { - this.schema = Preconditions.checkNotNull(schema); - } - - protected ImmutableSchemaInterface getSchema() { - return schema; - } - - /** - * Updates the reader that will be used to retrieve the tweet IDs and creation times associated - * with scored doc IDs, as well as the values for various CSFs. Should be called every time the - * searcher starts searching in a new segment. - */ - public void setNextReader(EarlybirdIndexSegmentAtomicReader reader) throws IOException { - tweetIDMapper = reader.getSegmentData().getDocIDToTweetIDMapper(); - timeMapper = reader.getSegmentData().getTimeMapper(); - documentFeatures = new EarlybirdDocumentFeatures(reader); - initializeNextSegment(reader); - } - - public void setHitAttributeHelperAndQuery(HitAttributeHelper newHitAttributeHelper, - Query parsedQuery) { - this.hitAttributeHelper = newHitAttributeHelper; - this.query = parsedQuery; - } - - public void setFieldHitAttribution(FieldHitAttribution fieldHitAttribution) { - this.fieldHitAttribution = fieldHitAttribution; - } - - public void setDebugMode(int debugMode) { - this.debugMode = debugMode; - } - - /** - * Allow scoring functions to perform more per-segment-specific setup. - */ - protected void initializeNextSegment(EarlybirdIndexSegmentAtomicReader reader) - throws IOException { - // Noop by default - } - - // Updates the current document ID and advances all NumericDocValues to this doc ID. - private void setCurrentDocID(int currentDocID) throws IOException { - this.currentDocID = currentDocID; - documentFeatures.advance(currentDocID); - } - - /** - * Returns the current doc ID stored in this scoring function. - */ - public int getCurrentDocID() { - return currentDocID; - } - - /** - * Compute the score for the current hit. This is not expected to be thread safe. - * - * @param internalDocID internal id of the matching hit - * @param luceneQueryScore the score that lucene's text query computed for this hit - */ - public float score(int internalDocID, float luceneQueryScore) throws IOException { - setCurrentDocID(internalDocID); - return score(luceneQueryScore); - } - - /** - * Compute the score for the current hit. This is not expected to be thread safe. - * - * @param luceneQueryScore the score that lucene's text query computed for this hit - */ - protected abstract float score(float luceneQueryScore) throws IOException; - - /** Returns an explanation for the given hit. */ - public final Explanation explain(IndexReader reader, int internalDocID, float luceneScore) - throws IOException { - setNextReader((EarlybirdIndexSegmentAtomicReader) reader); - setCurrentDocID(internalDocID); - return doExplain(luceneScore); - } - - /** Returns an explanation for the current document. */ - protected abstract Explanation doExplain(float luceneScore) throws IOException; - - /** - * Returns the scoring metadata for the current doc ID. - */ - public ThriftSearchResultMetadata getResultMetadata(ThriftSearchResultMetadataOptions options) - throws IOException { - ThriftSearchResultMetadata metadata = new ThriftSearchResultMetadata(); - metadata.setResultType(ThriftSearchResultType.RELEVANCE); - metadata.setPenguinVersion(EarlybirdConfig.getPenguinVersionByte()); - metadata.setLanguage(ThriftLanguage.findByValue( - (int) documentFeatures.getFeatureValue(EarlybirdFieldConstant.LANGUAGE))); - metadata.setSignature( - (int) documentFeatures.getFeatureValue(EarlybirdFieldConstant.TWEET_SIGNATURE)); - metadata.setIsNullcast(documentFeatures.isFlagSet(EarlybirdFieldConstant.IS_NULLCAST_FLAG)); - return metadata; - } - - /** - * Updates the given ThriftSearchResultsRelevanceStats instance based on the scoring metadata for - * the current doc ID. - */ - public abstract void updateRelevanceStats(ThriftSearchResultsRelevanceStats relevanceStats); - - /** - * Score a list of hits. Not thread safe. - */ - public float[] batchScore(List hits) throws IOException { - throw new UnsupportedOperationException("This operation (batchScore) is not implemented!"); - } - - /** - * Collect the features and CSFs for the current document. Used for scoring and generating the - * returned metadata. - */ - public Pair collectFeatures( - float luceneQueryScore) throws IOException { - throw new UnsupportedOperationException("This operation (collectFeatures) is not implemented!"); - } - - /** - * Implement this function to populate the result metadata based on the given scoring data. - * Otherwise, this is a no-op. - * - * Scoring functions that implement this should also implement getScoringData(). - */ - public void populateResultMetadataBasedOnScoringData( - ThriftSearchResultMetadataOptions options, - ThriftSearchResultMetadata metadata, - LinearScoringData data) throws IOException { - // Make sure that the scoring data passed in is null because getScoringDataForCurrentDocument() - // returns null by default and if a subclass overrides one of these two methods, it should - // override both. - Preconditions.checkState(data == null, "LinearScoringData should be null"); - } - - /** - * This should only be called at hit collection time because it relies on the internal doc id. - * - * Scoring functions that implement this should also implement the function - * populateResultMetadataBasedOnScoringData(). - */ - public LinearScoringData getScoringDataForCurrentDocument() { - return null; - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/scoring/ScoringFunctionProvider.java b/src/java/com/twitter/search/earlybird/search/relevance/scoring/ScoringFunctionProvider.java deleted file mode 100644 index 5e264e8f8..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/scoring/ScoringFunctionProvider.java +++ /dev/null @@ -1,216 +0,0 @@ -package com.twitter.search.earlybird.search.relevance.scoring; - -import java.io.IOException; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.query.HitAttributeHelper; -import com.twitter.search.common.ranking.thriftjava.ThriftRankingParams; -import com.twitter.search.common.ranking.thriftjava.ThriftScoringFunctionType; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.util.ml.tensorflow_engine.TensorflowModelsManager; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.exception.ClientException; -import com.twitter.search.earlybird.ml.ScoringModelsManager; -import com.twitter.search.earlybird.search.AntiGamingFilter; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.earlybird.thrift.ThriftSearchResultType; -import com.twitter.search.queryparser.query.Query; - -/** - * Returns a scoring function for a particular experiment ID. - * - * Can be used for a/b testing of different scoring formulas. - */ -public abstract class ScoringFunctionProvider { - private static final Logger LOG = LoggerFactory.getLogger(ScoringFunctionProvider.class); - - /** - * Returns the scoring function. - */ - public abstract ScoringFunction getScoringFunction() throws IOException, ClientException; - - public static final String RETWEETS_SCORER_NAME = "retweets"; - public static final String NO_SPAM_SCORER_NAME = "no_spam"; - public static final String TEST_SCORER_NAME = "test"; - - // Whether to avoid time decay when scoring top tweets. - // Top archive does not need time decay. - private static final boolean TOP_TWEET_WITH_DECAY = - EarlybirdConfig.getBool("top_tweet_scoring_with_decay", true); - - /** - * Abstract class that can be used for ScoringFunctions that don't throw a ClientException. - * - * It does throw an IOException but it doesn't throw a ClientException so the name can be a bit - * misleading. - */ - public abstract static class NamedScoringFunctionProvider extends ScoringFunctionProvider { - /** - * Returns the scoring function. - */ - public abstract ScoringFunction getScoringFunction() throws IOException; - } - - /** - * Returns the scoring function provider with the given name, or null if no such provider exists. - */ - public static NamedScoringFunctionProvider getScoringFunctionProviderByName( - String name, final ImmutableSchemaInterface schema) { - if (name.equals(NO_SPAM_SCORER_NAME)) { - return new NamedScoringFunctionProvider() { - @Override - public ScoringFunction getScoringFunction() throws IOException { - return new SpamVectorScoringFunction(schema); - } - }; - } else if (name.equals(RETWEETS_SCORER_NAME)) { - return new NamedScoringFunctionProvider() { - @Override - public ScoringFunction getScoringFunction() throws IOException { - // Production top tweet actually uses this. - if (TOP_TWEET_WITH_DECAY) { - return new RetweetBasedTopTweetsScoringFunction(schema); - } else { - return new RetweetBasedTopTweetsScoringFunction(schema, true); - } - } - }; - } else if (name.equals(TEST_SCORER_NAME)) { - return new NamedScoringFunctionProvider() { - @Override - public ScoringFunction getScoringFunction() throws IOException { - return new TestScoringFunction(schema); - } - }; - } - return null; - } - - /** - * Returns default scoring functions for different scoring function type - * and provides fallback behavior if model-based scoring function fails - */ - public static class DefaultScoringFunctionProvider extends ScoringFunctionProvider { - private final EarlybirdRequest request; - private final ImmutableSchemaInterface schema; - private final ThriftSearchQuery searchQuery; - private final AntiGamingFilter antiGamingFilter; - private final UserTable userTable; - private final HitAttributeHelper hitAttributeHelper; - private final Query parsedQuery; - private final ScoringModelsManager scoringModelsManager; - private final TensorflowModelsManager tensorflowModelsManager; - - private static final SearchCounter MODEL_BASED_SCORING_FUNCTION_CREATED = - SearchCounter.export("model_based_scoring_function_created"); - private static final SearchCounter MODEL_BASED_FALLBACK_TO_LINEAR_SCORING_FUNCTION = - SearchCounter.export("model_based_fallback_to_linear_scoring_function"); - - private static final SearchCounter TENSORFLOW_BASED_SCORING_FUNCTION_CREATED = - SearchCounter.export("tensorflow_based_scoring_function_created"); - private static final SearchCounter TENSORFLOW_BASED_FALLBACK_TO_LINEAR_SCORING_FUNCTION = - SearchCounter.export("tensorflow_fallback_to_linear_function_scoring_function"); - - public DefaultScoringFunctionProvider( - final EarlybirdRequest request, - final ImmutableSchemaInterface schema, - final ThriftSearchQuery searchQuery, - final AntiGamingFilter antiGamingFilter, - final UserTable userTable, - final HitAttributeHelper hitAttributeHelper, - final Query parsedQuery, - final ScoringModelsManager scoringModelsManager, - final TensorflowModelsManager tensorflowModelsManager) { - this.request = request; - this.schema = schema; - this.searchQuery = searchQuery; - this.antiGamingFilter = antiGamingFilter; - this.userTable = userTable; - this.hitAttributeHelper = hitAttributeHelper; - this.parsedQuery = parsedQuery; - this.scoringModelsManager = scoringModelsManager; - this.tensorflowModelsManager = tensorflowModelsManager; - } - - @Override - public ScoringFunction getScoringFunction() throws IOException, ClientException { - if (searchQuery.isSetRelevanceOptions() - && searchQuery.getRelevanceOptions().isSetRankingParams()) { - ThriftRankingParams params = searchQuery.getRelevanceOptions().getRankingParams(); - ThriftScoringFunctionType type = params.isSetType() - ? params.getType() : ThriftScoringFunctionType.LINEAR; // default type - switch (type) { - case LINEAR: - return createLinear(); - case MODEL_BASED: - if (scoringModelsManager.isEnabled()) { - MODEL_BASED_SCORING_FUNCTION_CREATED.increment(); - return createModelBased(); - } else { - // From ScoringModelsManager.NO_OP_MANAGER. Fall back to LinearScoringFunction - MODEL_BASED_FALLBACK_TO_LINEAR_SCORING_FUNCTION.increment(); - return createLinear(); - } - case TENSORFLOW_BASED: - if (tensorflowModelsManager.isEnabled()) { - TENSORFLOW_BASED_SCORING_FUNCTION_CREATED.increment(); - return createTensorflowBased(); - } else { - // Fallback to linear scoring if tf manager is disabled - TENSORFLOW_BASED_FALLBACK_TO_LINEAR_SCORING_FUNCTION.increment(); - return createLinear(); - } - case TOPTWEETS: - return createTopTweets(); - default: - throw new IllegalArgumentException("Unknown scoring type: in " + searchQuery); - } - } else { - LOG.error("No relevance options provided query = " + searchQuery); - return new DefaultScoringFunction(schema); - } - } - - private ScoringFunction createLinear() throws IOException { - LinearScoringFunction scoringFunction = new LinearScoringFunction( - schema, searchQuery, antiGamingFilter, ThriftSearchResultType.RELEVANCE, - userTable); - scoringFunction.setHitAttributeHelperAndQuery(hitAttributeHelper, parsedQuery); - - return scoringFunction; - } - - /** - * For model based scoring function, ClientException will be throw if client selects an - * unknown model for scoring manager. - * {@link com.twitter.search.earlybird.search.relevance.scoring.ModelBasedScoringFunction} - */ - private ScoringFunction createModelBased() throws IOException, ClientException { - ModelBasedScoringFunction scoringFunction = new ModelBasedScoringFunction( - schema, searchQuery, antiGamingFilter, ThriftSearchResultType.RELEVANCE, userTable, - scoringModelsManager); - scoringFunction.setHitAttributeHelperAndQuery(hitAttributeHelper, parsedQuery); - - return scoringFunction; - } - - private ScoringFunction createTopTweets() throws IOException { - return new LinearScoringFunction( - schema, searchQuery, antiGamingFilter, ThriftSearchResultType.POPULAR, userTable); - } - - private TensorflowBasedScoringFunction createTensorflowBased() - throws IOException, ClientException { - TensorflowBasedScoringFunction tfScoringFunction = new TensorflowBasedScoringFunction( - request, schema, searchQuery, antiGamingFilter, - ThriftSearchResultType.RELEVANCE, userTable, tensorflowModelsManager); - tfScoringFunction.setHitAttributeHelperAndQuery(hitAttributeHelper, parsedQuery); - return tfScoringFunction; - } - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/scoring/SpamVectorScoringFunction.java b/src/java/com/twitter/search/earlybird/search/relevance/scoring/SpamVectorScoringFunction.java deleted file mode 100644 index 1d45ad642..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/scoring/SpamVectorScoringFunction.java +++ /dev/null @@ -1,85 +0,0 @@ -package com.twitter.search.earlybird.search.relevance.scoring; - -import java.io.IOException; - -import com.google.common.annotations.VisibleForTesting; - -import org.apache.lucene.search.Explanation; - -import com.twitter.search.common.relevance.features.RelevanceSignalConstants; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadata; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadataOptions; -import com.twitter.search.earlybird.thrift.ThriftSearchResultsRelevanceStats; - -public class SpamVectorScoringFunction extends ScoringFunction { - private static final int MIN_TWEEPCRED_WITH_LINK = - EarlybirdConfig.getInt("min_tweepcred_with_non_whitelisted_link", 25); - - // The engagement threshold that prevents us from filtering users with low tweepcred. - private static final int ENGAGEMENTS_NO_FILTER = 1; - - @VisibleForTesting - static final float NOT_SPAM_SCORE = 0.5f; - @VisibleForTesting - static final float SPAM_SCORE = -0.5f; - - public SpamVectorScoringFunction(ImmutableSchemaInterface schema) { - super(schema); - } - - @Override - protected float score(float luceneQueryScore) throws IOException { - if (documentFeatures.isFlagSet(EarlybirdFieldConstant.FROM_VERIFIED_ACCOUNT_FLAG)) { - return NOT_SPAM_SCORE; - } - - int tweepCredThreshold = 0; - if (documentFeatures.isFlagSet(EarlybirdFieldConstant.HAS_LINK_FLAG) - && !documentFeatures.isFlagSet(EarlybirdFieldConstant.HAS_IMAGE_URL_FLAG) - && !documentFeatures.isFlagSet(EarlybirdFieldConstant.HAS_VIDEO_URL_FLAG) - && !documentFeatures.isFlagSet(EarlybirdFieldConstant.HAS_NEWS_URL_FLAG)) { - // Contains a non-media non-news link, definite spam vector. - tweepCredThreshold = MIN_TWEEPCRED_WITH_LINK; - } - - int tweepcred = (int) documentFeatures.getFeatureValue(EarlybirdFieldConstant.USER_REPUTATION); - - // For new user, tweepcred is set to a sentinel value of -128, specified at - // src/thrift/com/twitter/search/common/indexing/status.thrift - if (tweepcred >= tweepCredThreshold - || tweepcred == (int) RelevanceSignalConstants.UNSET_REPUTATION_SENTINEL) { - return NOT_SPAM_SCORE; - } - - double retweetCount = - documentFeatures.getUnnormalizedFeatureValue(EarlybirdFieldConstant.RETWEET_COUNT); - double replyCount = - documentFeatures.getUnnormalizedFeatureValue(EarlybirdFieldConstant.REPLY_COUNT); - double favoriteCount = - documentFeatures.getUnnormalizedFeatureValue(EarlybirdFieldConstant.FAVORITE_COUNT); - - // If the tweet has enough engagements, do not mark it as spam. - if (retweetCount + replyCount + favoriteCount >= ENGAGEMENTS_NO_FILTER) { - return NOT_SPAM_SCORE; - } - - return SPAM_SCORE; - } - - @Override - protected Explanation doExplain(float luceneScore) { - return null; - } - - @Override - public ThriftSearchResultMetadata getResultMetadata(ThriftSearchResultMetadataOptions options) { - return null; - } - - @Override - public void updateRelevanceStats(ThriftSearchResultsRelevanceStats relevanceStats) { - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/scoring/SparseTensor.java b/src/java/com/twitter/search/earlybird/search/relevance/scoring/SparseTensor.java deleted file mode 100644 index 67df06d95..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/scoring/SparseTensor.java +++ /dev/null @@ -1,87 +0,0 @@ -package com.twitter.search.earlybird.search.relevance.scoring; - -import java.nio.ByteBuffer; -import java.nio.ByteOrder; - -// Ideally, this part should live somewhere in the Cortex common -// code. Today, it is not possible to create -// a `SparseTensor` that relies only on ByteBuffer. -public class SparseTensor { - - private ByteBuffer sparseIndices; - private ByteBuffer sparseValues; - private ByteBuffer sparseShape; - - private int numDocs; - private final long[] sparseShapeShapeDimension = new long[] {2L}; - private final long inputBitSize = 1 << 63; - - private long numRecordsSeen = 0; - private final long numFeatures; - private int numValuesSeen; - - public SparseTensor(int numDocs, int numFeatures) { - this.numDocs = numDocs; - this.numFeatures = (long) numFeatures; - this.sparseValues = - ByteBuffer - .allocate(numFeatures * numDocs * Float.BYTES) - .order(ByteOrder.LITTLE_ENDIAN); - this.sparseIndices = - ByteBuffer - .allocate(2 * numFeatures * numDocs * Long.BYTES) - .order(ByteOrder.LITTLE_ENDIAN); - this.sparseShape = - ByteBuffer - .allocate(2 * Long.BYTES) - .order(ByteOrder.LITTLE_ENDIAN); - } - - public void incNumRecordsSeen() { - numRecordsSeen++; - } - - /** - * Adds the given value to this tensor. - */ - public void addValue(long featureId, float value) { - sparseValues.putFloat(value); - sparseIndices.putLong(numRecordsSeen); - sparseIndices.putLong(featureId); - numValuesSeen++; - } - - public ByteBuffer getSparseValues() { - sparseValues.limit(numValuesSeen * Float.BYTES); - sparseValues.rewind(); - return sparseValues; - } - - public long[] getSparseValuesShape() { - return new long[] {numValuesSeen}; - } - - public long[] getSparseIndicesShape() { - return new long[] {numValuesSeen, 2L}; - } - - public long[] getSparseShapeShape() { - return sparseShapeShapeDimension; - } - - public ByteBuffer getSparseIndices() { - sparseIndices.limit(2 * numValuesSeen * Long.BYTES); - sparseIndices.rewind(); - return sparseIndices; - } - - /** - * Returns the sparse shape for this tensor. - */ - public ByteBuffer getSparseShape() { - sparseShape.putLong(numRecordsSeen); - sparseShape.putLong(inputBitSize); - sparseShape.rewind(); - return sparseShape; - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/scoring/TensorflowBasedScoringFunction.java b/src/java/com/twitter/search/earlybird/search/relevance/scoring/TensorflowBasedScoringFunction.java deleted file mode 100644 index 497f4bbc0..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/scoring/TensorflowBasedScoringFunction.java +++ /dev/null @@ -1,339 +0,0 @@ -package com.twitter.search.earlybird.search.relevance.scoring; - -import java.io.IOException; -import java.nio.FloatBuffer; -import java.util.HashMap; -import java.util.List; -import java.util.Map; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.ImmutableMap; - -import org.apache.lucene.search.Explanation; -import org.tensorflow.Tensor; - -import com.twitter.common.collections.Pair; -import com.twitter.search.common.constants.thriftjava.ThriftQuerySource; -import com.twitter.search.common.features.EarlybirdRankingDerivedFeature; -import com.twitter.search.common.features.FeatureHandler; -import com.twitter.search.common.features.thrift.ThriftSearchResultFeatures; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.util.ml.tensorflow_engine.TensorflowModelsManager; -import com.twitter.search.earlybird.EarlybirdSearcher; -import com.twitter.search.earlybird.common.userupdates.UserTable; -import com.twitter.search.earlybird.exception.ClientException; -import com.twitter.search.earlybird.search.AntiGamingFilter; -import com.twitter.search.earlybird.search.relevance.LinearScoringData; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.earlybird.thrift.ThriftSearchRelevanceOptions; -import com.twitter.search.earlybird.thrift.ThriftSearchResultType; -import com.twitter.search.modeling.common.TweetFeaturesUtils; -import com.twitter.tfcompute_java.TFModelRunner; - -/** - * TensorflowBasedScoringFunction relies on a TF model for scoring tweets - * Only the `batchScore` part is implemented - */ -public class TensorflowBasedScoringFunction extends FeatureBasedScoringFunction { - private final TFModelRunner tfModelRunner; - - // https://stackoverflow.com/questions/37849322/how-to-understand-the-term-tensor-in-tensorflow - // for more information on this notation - in short, a TF graph is made - // of TF operations and doesn't have a first order notion of tensors - // The notation : will maps to the output of the - // contained in the TF graph. - private static final String INPUT_VALUES = "input_sparse_tensor_values:0"; - private static final String INPUT_INDICES = "input_sparse_tensor_indices:0"; - private static final String INPUT_SHAPE = "input_sparse_tensor_shape:0"; - private static final String OUTPUT_NODE = "output_scores:0"; - - private final Map featureSchemaIdToMlApiId; - private final Map tweetIdToScoreMap = new HashMap<>(); - private final EarlybirdRequest request; - - public TensorflowBasedScoringFunction( - EarlybirdRequest request, - ImmutableSchemaInterface schema, - ThriftSearchQuery searchQuery, - AntiGamingFilter antiGamingFilter, - ThriftSearchResultType searchResultType, - UserTable userTable, - TensorflowModelsManager tensorflowModelsManager - ) throws IOException, ClientException { - super( - "TensorflowBasedScoringFunction", - schema, - searchQuery, - antiGamingFilter, - searchResultType, - userTable - ); - this.request = request; - String modelName = searchQuery.getRelevanceOptions().getRankingParams().selectedTensorflowModel; - this.featureSchemaIdToMlApiId = tensorflowModelsManager.getFeatureSchemaIdToMlApiId(); - - if (modelName == null) { - throw new ClientException("Scoring type is TENSORFLOW_BASED but no model was selected"); - } else if (!tensorflowModelsManager.getModel(modelName).isPresent()) { - throw new ClientException( - "Scoring type is TENSORFLOW_BASED. Model " - + modelName - + " is not present." - ); - } - - if (searchQuery.getRelevanceOptions().getRankingParams().isEnableHitDemotion()) { - throw new ClientException( - "Hit attribute demotion is not supported with TENSORFLOW_BASED scoring type"); - } - - tfModelRunner = tensorflowModelsManager.getModel(modelName).get(); - } - - /** - * Single item scoring just returns the lucene score to be used during the batching phase. - */ - @Override - protected float score(float luceneQueryScore) { - return luceneQueryScore; - } - - @Override - public Pair collectFeatures( - float luceneQueryScore) throws IOException { - LinearScoringData linearScoringData = updateLinearScoringData(luceneQueryScore); - ThriftSearchResultFeatures features = - createFeaturesForDocument(linearScoringData, true).getFeatures(); - - return new Pair<>(linearScoringData, features); - } - - @Override - protected FeatureHandler createFeaturesForDocument( - LinearScoringData linearScoringData, - boolean ignoreDefaultValues) throws IOException { - return super.createFeaturesForDocument(linearScoringData, - ignoreDefaultValues) - .addBoolean(EarlybirdRankingDerivedFeature.QUERY_SOURCE_TREND_CLICK, - request.querySource == ThriftQuerySource.TREND_CLICK) - .addBoolean(EarlybirdRankingDerivedFeature.QUERY_SOURCE_TYPED_QUERY, - request.querySource == ThriftQuerySource.TYPED_QUERY) - .addBoolean(EarlybirdRankingDerivedFeature.QUERY_SOURCE_TYPEAHEAD_CLICK, - request.querySource == ThriftQuerySource.TYPEAHEAD_CLICK) - .addBoolean(EarlybirdRankingDerivedFeature.QUERY_SOURCE_HASHTAG_CLICK, - request.querySource == ThriftQuerySource.RECENT_SEARCH_CLICK) - .addBoolean(EarlybirdRankingDerivedFeature.QUERY_SOURCE_RECENT_SEARCH_CLICK, - request.querySource == ThriftQuerySource.RECENT_SEARCH_CLICK) - .addBoolean(EarlybirdRankingDerivedFeature.QUERY_SOURCE_PROFILE_CLICK, - request.querySource == ThriftQuerySource.PROFILE_CLICK) - .addBoolean(EarlybirdRankingDerivedFeature.QUERY_SOURCE_API_CALL, - request.querySource == ThriftQuerySource.API_CALL) - .addBoolean(EarlybirdRankingDerivedFeature.QUERY_SOURCE_PROMOTED_TREND_CLICK, - request.querySource == ThriftQuerySource.PROMOTED_TREND_CLICK) - .addBoolean(EarlybirdRankingDerivedFeature.QUERY_SOURCE_SAVED_SEARCH_CLICK, - request.querySource == ThriftQuerySource.SAVED_SEARCH_CLICK) - .addBoolean(EarlybirdRankingDerivedFeature.QUERY_SOURCE_CASHTAG_CLICK, - request.querySource == ThriftQuerySource.CASHTAG_CLICK) - .addBoolean(EarlybirdRankingDerivedFeature.QUERY_SOURCE_SPELLING_EXPANSION_REVERT_CLICK, - request.querySource == ThriftQuerySource.SPELLING_EXPANSION_REVERT_CLICK) - .addBoolean(EarlybirdRankingDerivedFeature.QUERY_SOURCE_SPELLING_SUGGESTION_CLICK, - request.querySource == ThriftQuerySource.SPELLING_SUGGESTION_CLICK) - .addBoolean(EarlybirdRankingDerivedFeature.QUERY_SOURCE_LOGGED_OUT_HOME_TREND_CLICK, - request.querySource == ThriftQuerySource.LOGGED_OUT_HOME_TREND_CLICK) - .addBoolean(EarlybirdRankingDerivedFeature.QUERY_SOURCE_RELATED_QUERY_CLICK, - request.querySource == ThriftQuerySource.RELATED_QUERY_CLICK) - .addBoolean(EarlybirdRankingDerivedFeature.QUERY_SOURCE_AUTO_SPELL_CORRECT_REVERT_CLICK, - request.querySource == ThriftQuerySource.AUTO_SPELL_CORRECT_REVERT_CLICK); - } - - /** - * Return scores computed in batchScore() if forExplanation is true. - */ - @Override - protected double computeScore(LinearScoringData data, boolean forExplanation) { - Preconditions.checkState(forExplanation, - "forExplanation is false. computeScore() should only be used for explanation creation"); - return tweetIdToScoreMap.get(tweetIDMapper.getTweetID(getCurrentDocID())); - } - - @Override - protected void generateExplanationForScoring( - LinearScoringData scoringData, boolean isHit, List details) { - } - - @VisibleForTesting - SparseTensor createInputTensor(ThriftSearchResultFeatures[] featuresForDocs) { - // Moving this across outside of the request path - // would reduce the allocation cost and make the `ByteBuffer`s - // long lived - would need one per thread. - SparseTensor sparseTensor = - new SparseTensor(featuresForDocs.length, featureSchemaIdToMlApiId.size()); - for (ThriftSearchResultFeatures features : featuresForDocs) { - updateSparseTensor(sparseTensor, features); - } - return sparseTensor; - } - - private void addSchemaBooleanFeatures(SparseTensor sparseTensor, - Map booleanMap) { - if (booleanMap == null || booleanMap.isEmpty()) { - return; - } - for (Map.Entry entry : booleanMap.entrySet()) { - Preconditions.checkState(featureSchemaIdToMlApiId.containsKey(entry.getKey())); - sparseTensor.addValue( - featureSchemaIdToMlApiId.get(entry.getKey()), entry.getValue() ? 1f : 0f); - } - } - - private void addSchemaContinuousFeatures(SparseTensor sparseTensor, - Map valueMap) { - if (valueMap == null || valueMap.isEmpty()) { - return; - } - for (Map.Entry entry : valueMap.entrySet()) { - Integer id = entry.getKey(); - // SEARCH-26795 - if (!TweetFeaturesUtils.isFeatureDiscrete(id)) { - Preconditions.checkState(featureSchemaIdToMlApiId.containsKey(id)); - sparseTensor.addValue( - featureSchemaIdToMlApiId.get(id), entry.getValue().floatValue()); - } - } - } - - private void updateSparseTensor(SparseTensor sparseTensor, ThriftSearchResultFeatures features) { - addSchemaBooleanFeatures(sparseTensor, features.getBoolValues()); - addSchemaContinuousFeatures(sparseTensor, features.getIntValues()); - addSchemaContinuousFeatures(sparseTensor, features.getLongValues()); - addSchemaContinuousFeatures(sparseTensor, features.getDoubleValues()); - - sparseTensor.incNumRecordsSeen(); - } - - private float[] batchScoreInternal(ThriftSearchResultFeatures[] featuresForDocs) { - int nbDocs = featuresForDocs.length; - float[] backingArrayResults = new float[nbDocs]; - SparseTensor sparseTensor = createInputTensor(featuresForDocs); - Tensor sparseValues = - Tensor.create( - Float.class, - sparseTensor.getSparseValuesShape(), - sparseTensor.getSparseValues()); - Tensor sparseIndices = - Tensor.create( - Long.class, - sparseTensor.getSparseIndicesShape(), - sparseTensor.getSparseIndices()); - Tensor sparseShape = - Tensor.create( - Long.class, - sparseTensor.getSparseShapeShape(), - sparseTensor.getSparseShape()); - Map> inputMap = ImmutableMap.of( - INPUT_VALUES, sparseValues, - INPUT_INDICES, sparseIndices, - INPUT_SHAPE, sparseShape - ); - List output = ImmutableList.of(OUTPUT_NODE); - - Map> outputs = tfModelRunner.run( - inputMap, - output, - ImmutableList.of() - ); - Tensor outputTensor = outputs.get(OUTPUT_NODE); - try { - FloatBuffer finalResultBuffer = - FloatBuffer.wrap(backingArrayResults, 0, nbDocs); - - outputTensor.writeTo(finalResultBuffer); - } finally { - // Close tensors to avoid memory leaks - sparseValues.close(); - sparseIndices.close(); - sparseShape.close(); - if (outputTensor != null) { - outputTensor.close(); - } - } - return backingArrayResults; - } - - /** - * Compute the score for a list of hits. Not thread safe. - * @return Array of scores - */ - @Override - public float[] batchScore(List hits) throws IOException { - ThriftSearchResultFeatures[] featuresForDocs = new ThriftSearchResultFeatures[hits.size()]; - - for (int i = 0; i < hits.size(); i++) { - // This is a gigantic allocation, but the models are trained to depend on unset values having - // a default. - BatchHit hit = hits.get(i); - ThriftSearchResultFeatures features = hit.getFeatures().deepCopy(); - - // Adjust features of a hit based on overrides provided by relevance options. Should mostly - // be used for debugging purposes. - adjustHitScoringFeatures(hit, features); - - setDefaultFeatureValues(features); - featuresForDocs[i] = features; - } - - float[] scores = batchScoreInternal(featuresForDocs); - float[] finalScores = new float[hits.size()]; - - for (int i = 0; i < hits.size(); i++) { - LinearScoringData data = hits.get(i).getScoringData(); - if (data.skipReason != null && data.skipReason != LinearScoringData.SkipReason.NOT_SKIPPED) { - // If the hit should be skipped, overwrite the score with SKIP_HIT - scores[i] = SKIP_HIT; - } - - // If explanations enabled, Add scores to map. Will be used in computeScore() - if (EarlybirdSearcher.explanationsEnabled(debugMode)) { - tweetIdToScoreMap.put(hits.get(i).getTweetID(), scores[i]); - } - - finalScores[i] = postScoreComputation( - data, - scores[i], - false, // cannot get the hit attribution info for this hit at this point in time - null); - } - return finalScores; - } - - private void adjustHitScoringFeatures(BatchHit hit, ThriftSearchResultFeatures features) { - - if (request.isSetSearchQuery() && request.getSearchQuery().isSetRelevanceOptions()) { - ThriftSearchRelevanceOptions relevanceOptions = - request.getSearchQuery().getRelevanceOptions(); - - if (relevanceOptions.isSetPerTweetFeaturesOverride() - && relevanceOptions.getPerTweetFeaturesOverride().containsKey(hit.getTweetID())) { - overrideFeatureValues( - features, - relevanceOptions.getPerTweetFeaturesOverride().get(hit.getTweetID())); - } - - if (relevanceOptions.isSetPerUserFeaturesOverride() - && relevanceOptions.getPerUserFeaturesOverride().containsKey( - hit.getScoringData().fromUserId)) { - overrideFeatureValues( - features, - relevanceOptions.getPerUserFeaturesOverride().get(hit.getScoringData().fromUserId)); - } - - if (relevanceOptions.isSetGlobalFeaturesOverride()) { - overrideFeatureValues( - features, relevanceOptions.getGlobalFeaturesOverride()); - } - } - } -} diff --git a/src/java/com/twitter/search/earlybird/search/relevance/scoring/TestScoringFunction.java b/src/java/com/twitter/search/earlybird/search/relevance/scoring/TestScoringFunction.java deleted file mode 100644 index 6e0c6a36f..000000000 --- a/src/java/com/twitter/search/earlybird/search/relevance/scoring/TestScoringFunction.java +++ /dev/null @@ -1,52 +0,0 @@ -package com.twitter.search.earlybird.search.relevance.scoring; - -import org.apache.lucene.search.Explanation; - -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadata; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadataOptions; -import com.twitter.search.earlybird.thrift.ThriftSearchResultType; -import com.twitter.search.earlybird.thrift.ThriftSearchResultsRelevanceStats; - -/** - * A dummy scoring function for test, the score is always tweetId/10000.0 - * Since score_filter: operator requires all score to be between [0, 1], if you want to use this - * with it, don't use any tweet id larger than 10000 in your test. - */ -public class TestScoringFunction extends ScoringFunction { - private ThriftSearchResultMetadata metadata = null; - private float score; - - public TestScoringFunction(ImmutableSchemaInterface schema) { - super(schema); - } - - @Override - protected float score(float luceneQueryScore) { - long tweetId = tweetIDMapper.getTweetID(getCurrentDocID()); - this.score = (float) (tweetId / 10000.0); - System.out.println(String.format("score for tweet %10d is %6.3f", tweetId, score)); - return this.score; - } - - @Override - protected Explanation doExplain(float luceneScore) { - return null; - } - - @Override - public ThriftSearchResultMetadata getResultMetadata(ThriftSearchResultMetadataOptions options) { - if (metadata == null) { - metadata = new ThriftSearchResultMetadata() - .setResultType(ThriftSearchResultType.RELEVANCE) - .setPenguinVersion(EarlybirdConfig.getPenguinVersionByte()); - metadata.setScore(score); - } - return metadata; - } - - @Override - public void updateRelevanceStats(ThriftSearchResultsRelevanceStats relevanceStats) { - } -} diff --git a/src/java/com/twitter/search/earlybird/segment/DLSegmentDataProvider.java b/src/java/com/twitter/search/earlybird/segment/DLSegmentDataProvider.java deleted file mode 100644 index db3afc3cf..000000000 --- a/src/java/com/twitter/search/earlybird/segment/DLSegmentDataProvider.java +++ /dev/null @@ -1,62 +0,0 @@ -package com.twitter.search.earlybird.segment; - -import java.io.IOException; -import java.util.ArrayList; -import java.util.Collections; -import java.util.List; -import java.util.Set; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.partitioning.base.Segment; -import com.twitter.search.common.util.io.dl.DLReaderWriterFactory; -import com.twitter.search.common.util.io.dl.SegmentDLUtil; -import com.twitter.search.earlybird.EarlybirdIndexConfig; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; - -/** - * An implementation of SegmentDataProvider using DistributedLog. - */ -public class DLSegmentDataProvider implements SegmentDataProvider { - private final int hashPartitionID; - private final DLReaderWriterFactory dlFactory; - private final SegmentDataReaderSet readerSet; - - public DLSegmentDataProvider( - int hashPartitionID, - EarlybirdIndexConfig earlybirdIndexConfig, - DLReaderWriterFactory dlReaderWriterFactory) throws IOException { - this(hashPartitionID, earlybirdIndexConfig, dlReaderWriterFactory, - Clock.SYSTEM_CLOCK); - } - - public DLSegmentDataProvider( - int hashPartitionID, - EarlybirdIndexConfig earlybirdIndexConfig, - DLReaderWriterFactory dlReaderWriterFactory, - Clock clock) throws IOException { - this.hashPartitionID = hashPartitionID; - this.dlFactory = dlReaderWriterFactory; - this.readerSet = new DLSegmentDataReaderSet( - dlFactory, - earlybirdIndexConfig, - clock); - } - - @Override - public SegmentDataReaderSet getSegmentDataReaderSet() { - return readerSet; - } - - @Override - public List newSegmentList() throws IOException { - Set segmentNames = SegmentDLUtil.getSegmentNames(dlFactory, null, hashPartitionID); - List segmentList = new ArrayList<>(segmentNames.size()); - for (String segmentName : segmentNames) { - Segment segment = Segment.fromSegmentName(segmentName, EarlybirdConfig.getMaxSegmentSize()); - segmentList.add(segment); - } - // Sort the segments by ID. - Collections.sort(segmentList); - return segmentList; - } -} diff --git a/src/java/com/twitter/search/earlybird/segment/DLSegmentDataReaderSet.java b/src/java/com/twitter/search/earlybird/segment/DLSegmentDataReaderSet.java deleted file mode 100644 index 88aa02a5c..000000000 --- a/src/java/com/twitter/search/earlybird/segment/DLSegmentDataReaderSet.java +++ /dev/null @@ -1,237 +0,0 @@ -package com.twitter.search.earlybird.segment; - -import java.io.IOException; -import java.util.HashMap; -import java.util.Map; -import java.util.Optional; -import java.util.concurrent.TimeUnit; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Function; -import com.google.common.base.Preconditions; - -import org.apache.thrift.TException; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchCustomGauge; -import com.twitter.search.common.metrics.SearchRequestStats; -import com.twitter.search.common.schema.earlybird.EarlybirdThriftDocumentUtil; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent; -import com.twitter.search.common.util.io.ReaderWithStatsFactory; -import com.twitter.search.common.util.io.TransformingRecordReader; -import com.twitter.search.common.util.io.dl.DLMultiStreamReader; -import com.twitter.search.common.util.io.dl.DLReaderWriterFactory; -import com.twitter.search.common.util.io.dl.DLTimestampedReaderFactory; -import com.twitter.search.common.util.io.dl.SegmentDLUtil; -import com.twitter.search.common.util.io.recordreader.RecordReader; -import com.twitter.search.common.util.io.recordreader.RecordReaderFactory; -import com.twitter.search.common.util.thrift.ThriftUtils; -import com.twitter.search.earlybird.EarlybirdIndexConfig; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.document.DocumentFactory; -import com.twitter.search.earlybird.document.TweetDocument; -import com.twitter.search.earlybird.partition.SegmentInfo; - -public class DLSegmentDataReaderSet implements SegmentDataReaderSet { - private static final Logger LOG = LoggerFactory.getLogger(DLSegmentDataReaderSet.class); - - public static final SearchRequestStats STATUS_DL_READ_STATS = - SearchRequestStats.export("status_dlreader", TimeUnit.MICROSECONDS, false); - private static final SearchRequestStats UPDATE_EVENT_DL_READ_STATS = - SearchRequestStats.export("update_events_dlreader", TimeUnit.MICROSECONDS, false); - // The number of tweets not indexed because they failed deserialization. - private static final SearchCounter STATUS_SKIPPED_DUE_TO_FAILED_DESERIALIZATION_COUNTER = - SearchCounter.export("statuses_skipped_due_to_failed_deserialization"); - - @VisibleForTesting - public static final int FRESH_READ_THRESHOLD = (int) TimeUnit.MINUTES.toMillis(1); - - private final int documentReadFreshnessThreshold = - EarlybirdConfig.getInt("documents_reader_freshness_threshold_millis", 10000); - private final int updateReadFreshnessThreshold = - EarlybirdConfig.getInt("updates_freshness_threshold_millis", FRESH_READ_THRESHOLD); - private final int dlReaderVersion = EarlybirdConfig.getInt("dl_reader_version"); - - private final DLReaderWriterFactory dlFactory; - private final RecordReaderFactory dlUpdateEventsFactory; - private final EarlybirdIndexConfig indexConfig; - private final Clock clock; - - private RecordReader documentReader; - - // RecordReaders for update events that span all live segments. - private final RecordReader updateEventsReader; - private final DLMultiStreamReader updateEventsMultiReader; - private final Map> updateEventReaders = new HashMap<>(); - - DLSegmentDataReaderSet( - DLReaderWriterFactory dlFactory, - final EarlybirdIndexConfig indexConfig, - Clock clock) throws IOException { - this.dlFactory = dlFactory; - this.indexConfig = indexConfig; - this.clock = clock; - - this.dlUpdateEventsFactory = new ReaderWithStatsFactory( - new DLTimestampedReaderFactory(dlFactory, clock, updateReadFreshnessThreshold), - UPDATE_EVENT_DL_READ_STATS); - this.updateEventsMultiReader = - new DLMultiStreamReader("update_events", dlUpdateEventsFactory, true, clock); - this.updateEventsReader = - new TransformingRecordReader<>(updateEventsMultiReader, record -> - (record != null) ? deserializeTVE(record.getBytes()) : null); - - SearchCustomGauge.export("open_dl_update_events_streams", updateEventReaders::size); - } - - private ThriftVersionedEvents deserializeTVE(byte[] bytes) { - ThriftVersionedEvents event = new ThriftVersionedEvents(); - try { - ThriftUtils.fromCompactBinaryFormat(bytes, event); - return event; - } catch (TException e) { - LOG.error("error deserializing TVE", e); - return null; - } - } - - @Override - public void attachDocumentReaders(SegmentInfo segmentInfo) throws IOException { - // Close any document reader left open before. - if (documentReader != null) { - LOG.warn("Previous documentReader not closed: {}", documentReader); - completeSegmentDocs(segmentInfo); - } - documentReader = newDocumentReader(segmentInfo); - } - - @Override - public void attachUpdateReaders(SegmentInfo segmentInfo) throws IOException { - if (updateEventsMultiReader == null) { - return; - } - - String segmentName = segmentInfo.getSegmentName(); - if (getUpdateEventsReaderForSegment(segmentInfo) != null) { - LOG.info("Update events reader for segment {} is already attached.", segmentName); - return; - } - - long updateEventStreamOffsetTimestamp = segmentInfo.getUpdatesStreamOffsetTimestamp(); - LOG.info("Attaching update events reader for segment {} with timestamp: {}.", - segmentName, updateEventStreamOffsetTimestamp); - - String topic = SegmentDLUtil.getDLTopicForUpdateEvents(segmentName, dlReaderVersion); - RecordReader recordReader = - dlUpdateEventsFactory.newRecordReaderForTimestamp(topic, updateEventStreamOffsetTimestamp); - updateEventsMultiReader.addRecordReader(recordReader, topic); - updateEventReaders.put(segmentInfo.getTimeSliceID(), - new TransformingRecordReader<>(recordReader, this::deserializeTVE)); - } - - @Override - public void stopAll() { - if (documentReader != null) { - documentReader.close(); - } - if (updateEventsReader != null) { - updateEventsReader.close(); - } - try { - dlFactory.close(); - } catch (IOException e) { - LOG.error("Exception while closing DL factory", e); - } - } - - @Override - public void completeSegmentDocs(SegmentInfo segmentInfo) { - if (documentReader != null) { - documentReader.close(); - documentReader = null; - } - } - - @Override - public void stopSegmentUpdates(SegmentInfo segmentInfo) { - if (updateEventsMultiReader != null) { - updateEventsMultiReader.removeStream( - SegmentDLUtil.getDLTopicForUpdateEvents(segmentInfo.getSegmentName(), dlReaderVersion)); - updateEventReaders.remove(segmentInfo.getTimeSliceID()); - } - } - - @Override - public RecordReader newDocumentReader(SegmentInfo segmentInfo) throws IOException { - String topic = SegmentDLUtil.getDLTopicForTweets(segmentInfo.getSegmentName(), - EarlybirdConfig.getPenguinVersion(), dlReaderVersion); - final long timeSliceId = segmentInfo.getTimeSliceID(); - final DocumentFactory docFactory = indexConfig.createDocumentFactory(); - - // Create the underlying DLRecordReader wrapped with the tweet reader stats. - RecordReader dlReader = new ReaderWithStatsFactory( - new DLTimestampedReaderFactory( - dlFactory, - clock, - documentReadFreshnessThreshold), - STATUS_DL_READ_STATS) - .newRecordReader(topic); - - // Create the wrapped reader which transforms serialized byte[] to TweetDocument. - return new TransformingRecordReader<>( - dlReader, - new Function() { - @Override - public TweetDocument apply(byte[] input) { - ThriftIndexingEvent event = new ThriftIndexingEvent(); - try { - ThriftUtils.fromCompactBinaryFormat(input, event); - } catch (TException e) { - LOG.error("Could not deserialize status document", e); - STATUS_SKIPPED_DUE_TO_FAILED_DESERIALIZATION_COUNTER.increment(); - return null; - } - - Preconditions.checkNotNull(event.getDocument()); - return new TweetDocument( - docFactory.getStatusId(event), - timeSliceId, - EarlybirdThriftDocumentUtil.getCreatedAtMs(event.getDocument()), - docFactory.newDocument(event)); - } - }); - } - - @Override - public RecordReader getDocumentReader() { - return documentReader; - } - - @Override - public RecordReader getUpdateEventsReader() { - return updateEventsReader; - } - - @Override - public RecordReader getUpdateEventsReaderForSegment( - SegmentInfo segmentInfo) { - return updateEventReaders.get(segmentInfo.getTimeSliceID()); - } - - @Override - public Optional getUpdateEventsStreamOffsetForSegment(SegmentInfo segmentInfo) { - String topic = - SegmentDLUtil.getDLTopicForUpdateEvents(segmentInfo.getSegmentName(), dlReaderVersion); - return updateEventsMultiReader.getUnderlyingOffsetForSegmentWithTopic(topic); - } - - @Override - public boolean allCaughtUp() { - return ((getDocumentReader() == null) || getDocumentReader().isCaughtUp()) - && ((getUpdateEventsReader() == null) || getUpdateEventsReader().isCaughtUp()); - } -} diff --git a/src/java/com/twitter/search/earlybird/segment/EmptySegmentDataReaderSet.java b/src/java/com/twitter/search/earlybird/segment/EmptySegmentDataReaderSet.java deleted file mode 100644 index 0d6ad55b5..000000000 --- a/src/java/com/twitter/search/earlybird/segment/EmptySegmentDataReaderSet.java +++ /dev/null @@ -1,72 +0,0 @@ -package com.twitter.search.earlybird.segment; - -import java.util.Optional; - -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.util.io.EmptyRecordReader; -import com.twitter.search.common.util.io.recordreader.RecordReader; -import com.twitter.search.earlybird.document.TweetDocument; -import com.twitter.search.earlybird.partition.SegmentInfo; - -/** - * A SegmentDataReaderSet that returns no data. Uses a DocumentReader that is - * always caught up, but never gets exhausted. - * Can be used for bringing up an earlybird against a static set of segments, - * and will not incorporate any new updates. - */ -public class EmptySegmentDataReaderSet implements SegmentDataReaderSet { - public static final EmptySegmentDataReaderSet INSTANCE = new EmptySegmentDataReaderSet(); - - @Override - public void attachDocumentReaders(SegmentInfo segmentInfo) { - } - - @Override - public void attachUpdateReaders(SegmentInfo segmentInfo) { - } - - @Override - public void completeSegmentDocs(SegmentInfo segmentInfo) { - } - - @Override - public void stopSegmentUpdates(SegmentInfo segmentInfo) { - } - - @Override - public void stopAll() { - } - - @Override - public boolean allCaughtUp() { - // ALWAYS CAUGHT UP - return true; - } - - @Override - public RecordReader newDocumentReader(SegmentInfo segmentInfo) - throws Exception { - return null; - } - - @Override - public RecordReader getDocumentReader() { - return new EmptyRecordReader<>(); - } - - @Override - public RecordReader getUpdateEventsReader() { - return null; - } - - @Override - public RecordReader getUpdateEventsReaderForSegment( - SegmentInfo segmentInfo) { - return null; - } - - @Override - public Optional getUpdateEventsStreamOffsetForSegment(SegmentInfo segmentInfo) { - return Optional.of(0L); - } -} diff --git a/src/java/com/twitter/search/earlybird/segment/SegmentDataProvider.java b/src/java/com/twitter/search/earlybird/segment/SegmentDataProvider.java deleted file mode 100644 index 502bbe1f5..000000000 --- a/src/java/com/twitter/search/earlybird/segment/SegmentDataProvider.java +++ /dev/null @@ -1,14 +0,0 @@ -package com.twitter.search.earlybird.segment; - -/** - * SegmentDataProvider provides information about available segments for indexing. This interface - * abstracts away the actual source of the segment data. It might be a MySQL database, a mock - * object, or a directory of flat files. It also provides access to the segmentInfoMap itself, which - * contains information about the indexing state of Segments. - */ -public interface SegmentDataProvider extends SegmentProvider { - /** - * Returns the set of segment data record readers. - */ - SegmentDataReaderSet getSegmentDataReaderSet(); -} diff --git a/src/java/com/twitter/search/earlybird/segment/SegmentDataReaderSet.java b/src/java/com/twitter/search/earlybird/segment/SegmentDataReaderSet.java deleted file mode 100644 index 84b18c34e..000000000 --- a/src/java/com/twitter/search/earlybird/segment/SegmentDataReaderSet.java +++ /dev/null @@ -1,79 +0,0 @@ -package com.twitter.search.earlybird.segment; - -import java.io.IOException; -import java.util.Optional; - -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.util.io.recordreader.RecordReader; -import com.twitter.search.earlybird.document.TweetDocument; -import com.twitter.search.earlybird.partition.SegmentInfo; - -/** - * SegmentDataReaderSet provides an interface to create and manage the various - * RecordReaders used to index Earlybird segments. - */ -public interface SegmentDataReaderSet { - /** - * Instruct the document RecordReaders (i.e. document, geo, ... as appropriate) to read from this - * segment. - */ - void attachDocumentReaders(SegmentInfo segmentInfo) throws IOException; - - /** - * Instruct the reader set to add segment to non-document RecordReaders (deletes, features, etc.) - */ - void attachUpdateReaders(SegmentInfo segmentInfo) throws IOException; - - /** - * Mark a segment as "complete", denoting that we are done reading document records from it. - * - * This instructs the reader set to stop reading documents from the segment (if it hasn't - * already), although for now geo-document records can still be read. Updates RecordReaders - * (deletes, etc.) may continue to read entries for the segment. - */ - void completeSegmentDocs(SegmentInfo segmentInfo); - - /** - * This instructs the reader set to stop reading updates for the Segment. It - * should remove the segment from all non-document RecordReaders (deletes, etc.) - */ - void stopSegmentUpdates(SegmentInfo segmentInfo); - - /** - * Stops all RecordReaders and closes all resources. - */ - void stopAll(); - - /** - * Returns true if all RecordReaders are 'caught up' with the data sources they - * are reading from. This might mean that the end of a file has been reached, - * or that we are waiting/polling for new records from an append-only database. - */ - boolean allCaughtUp(); - - /** - * Create a new DocumentReader for the given segment that is not managed by this set. - */ - RecordReader newDocumentReader(SegmentInfo segmentInfo) throws Exception; - - /** - * Returns the document reader for the current segment. - */ - RecordReader getDocumentReader(); - - /** - * Returns a combined update events reader for all segments. - */ - RecordReader getUpdateEventsReader(); - - /** - * Returns the update events reader for the given segment. - */ - RecordReader getUpdateEventsReaderForSegment(SegmentInfo segmentInfo); - - /** - * Returns the offset in the update events stream for the given segment that this earlybird should - * start indexing from. - */ - Optional getUpdateEventsStreamOffsetForSegment(SegmentInfo segmentInfo); -} diff --git a/src/java/com/twitter/search/earlybird/segment/SegmentProvider.java b/src/java/com/twitter/search/earlybird/segment/SegmentProvider.java deleted file mode 100644 index 7b8b94554..000000000 --- a/src/java/com/twitter/search/earlybird/segment/SegmentProvider.java +++ /dev/null @@ -1,13 +0,0 @@ -package com.twitter.search.earlybird.segment; - -import java.io.IOException; -import java.util.List; - -import com.twitter.search.common.partitioning.base.Segment; - -public interface SegmentProvider { - /** - * Returns a *new* sorted list of all available segments on disk / db / hdfs / etc. - */ - List newSegmentList() throws IOException; -} diff --git a/src/java/com/twitter/search/earlybird/stats/EarlybirdRPCStats.java b/src/java/com/twitter/search/earlybird/stats/EarlybirdRPCStats.java deleted file mode 100644 index b8f6a67ec..000000000 --- a/src/java/com/twitter/search/earlybird/stats/EarlybirdRPCStats.java +++ /dev/null @@ -1,55 +0,0 @@ -package com.twitter.search.earlybird.stats; - -import java.util.concurrent.TimeUnit; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.metrics.SearchRequestStats; - -/** - * SearchRequestStats with earlybird-specific additional stats. - */ -public final class EarlybirdRPCStats { - private final SearchRequestStats requestStats; - // Number of queries that were terminated early. - private final SearchCounter earlyTerminatedRequests; - - // We do not count client error in the response error rate, but track it separately. - private final SearchRateCounter responseClientErrors; - - public EarlybirdRPCStats(String name) { - requestStats = SearchRequestStats.export(name, TimeUnit.MICROSECONDS, true, true); - earlyTerminatedRequests = SearchCounter.export(name + "_early_terminated"); - responseClientErrors = SearchRateCounter.export(name + "_client_error"); - } - - public long getRequestRate() { - return (long) (double) requestStats.getRequestRate().read(); - } - - public long getAverageLatency() { - return (long) (double) requestStats.getTimerStats().read(); - } - - /** - * Records a completed earlybird request. - * @param latencyUs how long the request took to complete, in microseconds. - * @param resultsCount how many results were returned. - * @param success whether the request was successful or not. - * @param earlyTerminated whether the request terminated early or not. - * @param clientError whether the request failure is caused by client errors - */ - public void requestComplete(long latencyUs, long resultsCount, boolean success, - boolean earlyTerminated, boolean clientError) { - // We treat client errors as successes for top-line metrics to prevent bad client requests (like - // malformed queries) from dropping our success rate and generating alerts. - requestStats.requestComplete(latencyUs, resultsCount, success || clientError); - - if (earlyTerminated) { - earlyTerminatedRequests.increment(); - } - if (clientError) { - responseClientErrors.increment(); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/stats/EarlybirdSearcherStats.java b/src/java/com/twitter/search/earlybird/stats/EarlybirdSearcherStats.java deleted file mode 100644 index dcaedafdf..000000000 --- a/src/java/com/twitter/search/earlybird/stats/EarlybirdSearcherStats.java +++ /dev/null @@ -1,213 +0,0 @@ -package com.twitter.search.earlybird.stats; - -import java.util.EnumMap; -import java.util.Map; -import java.util.concurrent.ConcurrentHashMap; -import java.util.concurrent.TimeUnit; - -import com.google.common.base.Preconditions; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchMetricTimerOptions; -import com.twitter.search.common.metrics.SearchStatsReceiver; -import com.twitter.search.common.metrics.SearchTimer; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.common.ranking.thriftjava.ThriftRankingParams; -import com.twitter.search.common.ranking.thriftjava.ThriftScoringFunctionType; -import com.twitter.search.earlybird.EarlybirdSearcher; -import com.twitter.search.earlybird.common.ClientIdUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.ThriftSearchRelevanceOptions; - -/** - * Manages counter and timer stats for EarlybirdSearcher. - */ -public class EarlybirdSearcherStats { - private static final TimeUnit TIME_UNIT = TimeUnit.MICROSECONDS; - - private final SearchStatsReceiver earlybirdServerStatsReceiver; - - public final SearchCounter thriftQueryWithSerializedQuery; - public final SearchCounter thriftQueryWithLuceneQuery; - public final SearchCounter thriftQueryWithoutTextQuery; - public final SearchCounter addedFilterBadUserRep; - public final SearchCounter addedFilterFromUserIds; - public final SearchCounter addedFilterTweetIds; - public final SearchCounter unsetFiltersForSocialFilterTypeQuery; - public final SearchCounter querySpecificSignalMapTotalSize; - public final SearchCounter querySpecificSignalQueriesUsed; - public final SearchCounter querySpecificSignalQueriesErased; - public final SearchCounter authorSpecificSignalMapTotalSize; - public final SearchCounter authorSpecificSignalQueriesUsed; - public final SearchCounter authorSpecificSignalQueriesErased; - public final SearchCounter nullcastTweetsForceExcluded; - public final SearchCounter nullcastUnexpectedResults; - public final SearchCounter nullcastUnexpectedQueries; - public final SearchCounter relevanceAntiGamingFilterUsed; - public final SearchCounter relevanceAntiGamingFilterNotRequested; - public final SearchCounter relevanceAntiGamingFilterSpecifiedTweetsAndFromUserIds; - public final SearchCounter relevanceAntiGamingFilterSpecifiedTweets; - public final SearchCounter relevanceAntiGamingFilterSpecifiedFromUserIds; - public final SearchCounter numCollectorAdjustedMinSearchedStatusID; - - public final Map numRequestsWithBlankQuery; - private final Map latencyByScoringFunctionType; - private final Map> latencyByScoringFunctionTypeAndClient; - private final Map latencyByTensorflowModel; - - public EarlybirdSearcherStats(SearchStatsReceiver earlybirdServerStatsReceiver) { - this.earlybirdServerStatsReceiver = earlybirdServerStatsReceiver; - - this.thriftQueryWithLuceneQuery = - earlybirdServerStatsReceiver.getCounter("thrift_query_with_lucene_query"); - this.thriftQueryWithSerializedQuery = - earlybirdServerStatsReceiver.getCounter("thrift_query_with_serialized_query"); - this.thriftQueryWithoutTextQuery = - earlybirdServerStatsReceiver.getCounter("thrift_query_without_text_query"); - - this.addedFilterBadUserRep = - earlybirdServerStatsReceiver.getCounter("added_filter_bad_user_rep"); - this.addedFilterFromUserIds = - earlybirdServerStatsReceiver.getCounter("added_filter_from_user_ids"); - this.addedFilterTweetIds = - earlybirdServerStatsReceiver.getCounter("added_filter_tweet_ids"); - - this.unsetFiltersForSocialFilterTypeQuery = - earlybirdServerStatsReceiver.getCounter("unset_filters_for_social_filter_type_query"); - this.querySpecificSignalMapTotalSize = - earlybirdServerStatsReceiver.getCounter("query_specific_signal_map_total_size"); - this.querySpecificSignalQueriesUsed = - earlybirdServerStatsReceiver.getCounter("query_specific_signal_queries_used"); - this.querySpecificSignalQueriesErased = - earlybirdServerStatsReceiver.getCounter("query_specific_signal_queries_erased"); - this.authorSpecificSignalMapTotalSize = - earlybirdServerStatsReceiver.getCounter("author_specific_signal_map_total_size"); - this.authorSpecificSignalQueriesUsed = - earlybirdServerStatsReceiver.getCounter("author_specific_signal_queries_used"); - this.authorSpecificSignalQueriesErased = - earlybirdServerStatsReceiver.getCounter("author_specific_signal_queries_erased"); - this.nullcastTweetsForceExcluded = - earlybirdServerStatsReceiver.getCounter("force_excluded_nullcast_result_count"); - this.nullcastUnexpectedResults = - earlybirdServerStatsReceiver.getCounter("unexpected_nullcast_result_count"); - this.nullcastUnexpectedQueries = - earlybirdServerStatsReceiver.getCounter("queries_with_unexpected_nullcast_results"); - this.numCollectorAdjustedMinSearchedStatusID = - earlybirdServerStatsReceiver.getCounter("collector_adjusted_min_searched_status_id"); - - this.relevanceAntiGamingFilterUsed = earlybirdServerStatsReceiver - .getCounter("relevance_anti_gaming_filter_used"); - this.relevanceAntiGamingFilterNotRequested = earlybirdServerStatsReceiver - .getCounter("relevance_anti_gaming_filter_not_requested"); - this.relevanceAntiGamingFilterSpecifiedTweetsAndFromUserIds = earlybirdServerStatsReceiver - .getCounter("relevance_anti_gaming_filter_specified_tweets_and_from_user_ids"); - this.relevanceAntiGamingFilterSpecifiedTweets = earlybirdServerStatsReceiver - .getCounter("relevance_anti_gaming_filter_specified_tweets"); - this.relevanceAntiGamingFilterSpecifiedFromUserIds = earlybirdServerStatsReceiver - .getCounter("relevance_anti_gaming_filter_specified_from_user_ids"); - - this.latencyByScoringFunctionType = new EnumMap<>(ThriftScoringFunctionType.class); - this.latencyByScoringFunctionTypeAndClient = new EnumMap<>(ThriftScoringFunctionType.class); - this.latencyByTensorflowModel = new ConcurrentHashMap<>(); - - for (ThriftScoringFunctionType type : ThriftScoringFunctionType.values()) { - this.latencyByScoringFunctionType.put(type, getTimerStatsByName(getStatsNameByType(type))); - this.latencyByScoringFunctionTypeAndClient.put(type, new ConcurrentHashMap<>()); - } - - this.numRequestsWithBlankQuery = new EnumMap<>(EarlybirdSearcher.QueryMode.class); - - for (EarlybirdSearcher.QueryMode queryMode : EarlybirdSearcher.QueryMode.values()) { - String counterName = - String.format("num_requests_with_blank_query_%s", queryMode.name().toLowerCase()); - - this.numRequestsWithBlankQuery.put( - queryMode, earlybirdServerStatsReceiver.getCounter(counterName)); - } - } - - /** - * Records the latency for a request for the applicable stats. - * @param timer A stopped timer that timed the request. - * @param request The request that was timed. - */ - public void recordRelevanceStats(SearchTimer timer, EarlybirdRequest request) { - Preconditions.checkNotNull(timer); - Preconditions.checkNotNull(request); - Preconditions.checkArgument(!timer.isRunning()); - - ThriftSearchRelevanceOptions relevanceOptions = request.getSearchQuery().getRelevanceOptions(); - - // Only record ranking searches with a set type. - if (!relevanceOptions.isSetRankingParams() - || !relevanceOptions.getRankingParams().isSetType()) { - return; - } - - ThriftRankingParams rankingParams = relevanceOptions.getRankingParams(); - ThriftScoringFunctionType scoringFunctionType = rankingParams.getType(); - - latencyByScoringFunctionType.get(scoringFunctionType).stoppedTimerIncrement(timer); - - if (request.getClientId() != null) { - getTimerStatsByClient(scoringFunctionType, request.getClientId()) - .stoppedTimerIncrement(timer); - } - - if (scoringFunctionType != ThriftScoringFunctionType.TENSORFLOW_BASED) { - return; - } - - String modelName = rankingParams.getSelectedTensorflowModel(); - - if (modelName != null) { - getTimerStatsByTensorflowModel(modelName).stoppedTimerIncrement(timer); - } - } - - /** - * Creates a search timer with options specified by TweetsEarlybirdSearcherStats. - * @return A new SearchTimer. - */ - public SearchTimer createTimer() { - return new SearchTimer(new SearchMetricTimerOptions.Builder() - .withTimeUnit(TIME_UNIT) - .build()); - } - - private SearchTimerStats getTimerStatsByClient( - ThriftScoringFunctionType type, - String clientId) { - Map latencyByClient = latencyByScoringFunctionTypeAndClient.get(type); - - return latencyByClient.computeIfAbsent(clientId, - cid -> getTimerStatsByName(getStatsNameByClientAndType(type, cid))); - } - - private SearchTimerStats getTimerStatsByTensorflowModel(String modelName) { - return latencyByTensorflowModel.computeIfAbsent(modelName, - mn -> getTimerStatsByName(getStatsNameByTensorflowModel(mn))); - } - - private SearchTimerStats getTimerStatsByName(String name) { - return earlybirdServerStatsReceiver.getTimerStats( - name, TIME_UNIT, false, true, false); - } - - public static String getStatsNameByType(ThriftScoringFunctionType type) { - return String.format( - "search_relevance_scoring_function_%s_requests", type.name().toLowerCase()); - } - - public static String getStatsNameByClientAndType( - ThriftScoringFunctionType type, - String clientId) { - return String.format("%s_%s", ClientIdUtil.formatClientId(clientId), getStatsNameByType(type)); - } - - public static String getStatsNameByTensorflowModel(String modelName) { - return String.format( - "model_%s_%s", modelName, getStatsNameByType(ThriftScoringFunctionType.TENSORFLOW_BASED)); - } -} diff --git a/src/java/com/twitter/search/earlybird/stats/SegmentSyncStats.java b/src/java/com/twitter/search/earlybird/stats/SegmentSyncStats.java deleted file mode 100644 index e16b35f6e..000000000 --- a/src/java/com/twitter/search/earlybird/stats/SegmentSyncStats.java +++ /dev/null @@ -1,59 +0,0 @@ -package com.twitter.search.earlybird.stats; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.Timer; - -public class SegmentSyncStats { - private static final String CPU_TOTAL = "_cpu_total_"; - private static final String CPU_USER = "_cpu_user_mode_"; - private static final String CPU_SYS = "_cpu_system_mode_"; - - private final SearchCounter segmentSyncLatency; - private final SearchCounter segmentSyncLatencyCpuTotal; - private final SearchCounter segmentSyncLatencyCpuUserMode; - private final SearchCounter segmentSyncLatencyCpuSystemMode; - private final SearchCounter segmentSyncCount; - private final SearchCounter segmentErrorCount; - - private SegmentSyncStats(SearchCounter segmentSyncLatency, - SearchCounter segmentSyncLatencyCpuTotal, - SearchCounter segmentSyncLatencyCpuUserMode, - SearchCounter segmentSyncLatencyCpuSystemMode, - SearchCounter segmentSyncCount, - SearchCounter segmentErrorCount) { - this.segmentSyncLatency = segmentSyncLatency; - this.segmentSyncLatencyCpuTotal = segmentSyncLatencyCpuTotal; - this.segmentSyncLatencyCpuUserMode = segmentSyncLatencyCpuUserMode; - this.segmentSyncLatencyCpuSystemMode = segmentSyncLatencyCpuSystemMode; - this.segmentSyncCount = segmentSyncCount; - this.segmentErrorCount = segmentErrorCount; - } - - /** - * Creates a new set of stats for the given segment sync action. - * @param action the name to be used for the sync stats. - */ - public SegmentSyncStats(String action) { - this(SearchCounter.export("segment_" + action + "_latency_ms"), - SearchCounter.export("segment_" + action + "_latency" + CPU_TOTAL + "ms"), - SearchCounter.export("segment_" + action + "_latency" + CPU_USER + "ms"), - SearchCounter.export("segment_" + action + "_latency" + CPU_SYS + "ms"), - SearchCounter.export("segment_" + action + "_count"), - SearchCounter.export("segment_" + action + "_error_count")); - } - - /** - * Records a completed action using the specified timer. - */ - public void actionComplete(Timer timer) { - segmentSyncCount.increment(); - segmentSyncLatency.add(timer.getElapsed()); - segmentSyncLatencyCpuTotal.add(timer.getElapsedCpuTotal()); - segmentSyncLatencyCpuUserMode.add(timer.getElapsedCpuUserMode()); - segmentSyncLatencyCpuSystemMode.add(timer.getElapsedCpuSystemMode()); - } - - public void recordError() { - segmentErrorCount.increment(); - } -} diff --git a/src/java/com/twitter/search/earlybird/tools/EarlybirdThriftRequestDeserializerUtil.java b/src/java/com/twitter/search/earlybird/tools/EarlybirdThriftRequestDeserializerUtil.java deleted file mode 100644 index c6dd20c9d..000000000 --- a/src/java/com/twitter/search/earlybird/tools/EarlybirdThriftRequestDeserializerUtil.java +++ /dev/null @@ -1,77 +0,0 @@ -package com.twitter.search.earlybird.tools; - -import java.io.BufferedReader; -import java.io.IOException; -import java.nio.charset.Charset; -import java.nio.file.FileSystems; -import java.nio.file.Files; -import java.nio.file.Path; - -import com.google.common.base.Preconditions; - -import org.apache.commons.codec.binary.Base64; -import org.apache.thrift.TDeserializer; -import org.apache.thrift.TException; - -import com.twitter.search.earlybird.thrift.EarlybirdRequest; - -/** - * - * This tool deserializes the collected thrift requests into human readable format. - * - * Takes zero or one parameter: path to the thrift request log file. - * - * To run: Launch main from IntelliJ / Eclipse. - */ -public final class EarlybirdThriftRequestDeserializerUtil { - private static final String DEFAULT_LOG_FILE_LOCATION = "/tmp/eb_req.B64"; - // Not threadsafe. Single thread main(). - private static final Base64 B64 = new Base64(0); - private static final TDeserializer DESERIALIZER = new TDeserializer(); - - private EarlybirdThriftRequestDeserializerUtil() { - } - - /** - * Runs the EarlybirdThriftRequestDeserializerUtil tool with the given command-line arguments. - */ - public static void main(String[] args) throws IOException { - Path logFile = null; - if (args.length == 1) { - logFile = FileSystems.getDefault().getPath(args[0]); - } else if (args.length == 0) { - logFile = FileSystems.getDefault().getPath(DEFAULT_LOG_FILE_LOCATION); - } else { - System.err.println("Usage: takes zero or one parameter (log file path). " - + "If no log file is specified, " + DEFAULT_LOG_FILE_LOCATION + " is used."); - //CHECKSTYLE:OFF RegexpSinglelineJava - System.exit(-1); - //CHECKSTYLE:ON RegexpSinglelineJava - } - Preconditions.checkState(logFile.toFile().exists()); - - BufferedReader reader = Files.newBufferedReader(logFile, Charset.defaultCharset()); - try { - String line; - while ((line = reader.readLine()) != null) { - EarlybirdRequest ebRequest = deserializeEBRequest(line); - if (ebRequest != null) { - System.out.println(ebRequest); - } - } - } finally { - reader.close(); - } - } - - private static EarlybirdRequest deserializeEBRequest(String line) { - EarlybirdRequest ebRequest = new EarlybirdRequest(); - byte[] bytes = B64.decode(line); - try { - DESERIALIZER.deserialize(ebRequest, bytes); - } catch (TException e) { - System.err.println("Error deserializing thrift."); - } - return ebRequest; - } -} diff --git a/src/java/com/twitter/search/earlybird/util/ActionLogger.java b/src/java/com/twitter/search/earlybird/util/ActionLogger.java deleted file mode 100644 index cc21c7956..000000000 --- a/src/java/com/twitter/search/earlybird/util/ActionLogger.java +++ /dev/null @@ -1,49 +0,0 @@ -package com.twitter.search.earlybird.util; - -import java.util.concurrent.Callable; - -import com.google.common.base.Stopwatch; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -public final class ActionLogger { - private static final Logger LOG = LoggerFactory.getLogger(ActionLogger.class); - - private ActionLogger() { - } - - /** - * Run a function, logging a message at the start and end, and the time it took. - */ - public static T call(String message, Callable fn) throws Exception { - LOG.info("Action starting: '{}'.", message); - Stopwatch stopwatch = Stopwatch.createStarted(); - try { - return fn.call(); - } catch (Throwable e) { - LOG.error("Action failed: '{}'.", message, e); - throw e; - } finally { - LOG.info("Action finished in {} '{}'.", stopwatch, message); - } - } - - /** - * Run a function, logging a message at the start and end, and the time it took. - */ - public static void run(String message, CheckedRunnable fn) throws Exception { - call(message, () -> { - fn.run(); - return null; - }); - } - - @FunctionalInterface - public interface CheckedRunnable { - /** - * A nullary function that throws checked exceptions. - */ - void run() throws Exception; - } -} diff --git a/src/java/com/twitter/search/earlybird/util/CoordinatedEarlybirdAction.java b/src/java/com/twitter/search/earlybird/util/CoordinatedEarlybirdAction.java deleted file mode 100644 index ede199588..000000000 --- a/src/java/com/twitter/search/earlybird/util/CoordinatedEarlybirdAction.java +++ /dev/null @@ -1,409 +0,0 @@ -package com.twitter.search.earlybird.util; - -import java.util.Optional; -import java.util.Random; -import java.util.concurrent.atomic.AtomicBoolean; -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.base.Stopwatch; - -import org.apache.zookeeper.KeeperException; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.base.ExceptionalFunction; -import com.twitter.common.quantity.Amount; -import com.twitter.common.quantity.Time; -import com.twitter.common.zookeeper.ServerSet; -import com.twitter.common.zookeeper.ZooKeeperClient; -import com.twitter.search.common.config.Config; -import com.twitter.search.common.database.DatabaseConfig; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchCustomGauge; -import com.twitter.search.common.util.zktrylock.TryLock; -import com.twitter.search.common.util.zktrylock.ZooKeeperTryLockFactory; -import com.twitter.search.earlybird.ServerSetMember; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.common.config.EarlybirdProperty; -import com.twitter.search.earlybird.exception.AlreadyInServerSetUpdateException; -import com.twitter.search.earlybird.exception.EarlybirdException; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.exception.NotInServerSetUpdateException; -import com.twitter.search.earlybird.partition.DynamicPartitionConfig; -import com.twitter.search.earlybird.partition.PartitionConfig; -import com.twitter.search.earlybird.partition.SegmentSyncConfig; - -/** - * Utility class for executing tasks on Earlybirds that need to be coordinated across replicas - * on the same hash partition. - * Can be used for things like coordinating optimization on the same timeslice. - * When enabled, a try-lock will be taken out in zookeeper while the task is performed. - * The action will attempt to leave the partition's server set. If the attempt fails, the action - * is aborted. - */ -public class CoordinatedEarlybirdAction implements CoordinatedEarlybirdActionInterface { - private static final Logger LOG = LoggerFactory.getLogger(CoordinatedEarlybirdAction.class); - - private static final Boolean COORDINATED_ACTION_FLAG = Boolean.TRUE; - private static final Boolean NOT_COORDINATED_ACTION_FLAG = Boolean.FALSE; - - private final String actionName; - private final DynamicPartitionConfig dynamicPartitionConfig; - @Nullable private final ServerSetMember serverSetMember; - private final ZooKeeperTryLockFactory zooKeeperTryLockFactory; - - // Whether this action should be coordinated through zookeeper in the first place (could be - // config'ed off). - // If the action is coordinated, this earlybird will leave its server set when performing the - // coordinated action. - private final AtomicBoolean shouldSynchronize; - // Whether this action should ensure that there are enough replicas in the serverset (defined by - // maxAllowedReplicasNotInServerSet) before leaving the serverset. - private final boolean checkNumReplicasInServerSet; - // If this many (or more) servers have left the partition, we cannot perform a coordinated action - private final int maxAllowedReplicasNotInServerSet; - // How long to lock out all other replicas in this hash partition for. - // Should be some small multiple of how long the action is expected to take, to allow for longer - // running cases. - private final long zkLockExpirationTimeMinutes; - // Prefix for the zookeeper lock used when coordinating daily updates. - // Full name should include the hash partition number. - private final String zkLockNamePrefix; - // If we're unable to re-join this earlybird's server set during coordinated updates, - // how many times to retry. - private final int joinServerSetRetries; - // How long to sleep between retries if unable to job back into server set. - private final int joinServerSetRetrySleepMillis; - // How long to sleep between leaving the serverset and executing the action - private final int sleepAfterLeaveServerSetMillis; - - // How many times a this action was called within a lock block. - private final SearchCounter numCoordinatedFunctionCalls; - private final SearchCounter numCoordinatedLeaveServersetCalls; - - private final CriticalExceptionHandler criticalExceptionHandler; - private final SegmentSyncConfig segmentSyncConfig; - - /** - * Create a CoordinatedEarlybirdAction. - * - * @param actionName the name to be used for logging and the prefix for config options. - * @param dynamicPartitionConfig maintains the current partitioning configuration for this - * earlybird. Used mainly to determine the hash partition of this earlybird. - * @param serverSetMember the server that this action is running on. To be used to leaving and - * rejoining the server's server set. - */ - public CoordinatedEarlybirdAction( - ZooKeeperTryLockFactory zooKeeperTryLockFactory, - String actionName, - DynamicPartitionConfig dynamicPartitionConfig, - @Nullable ServerSetMember serverSetMember, - CriticalExceptionHandler criticalExceptionHandler, - SegmentSyncConfig segmentSyncConfig) { - this.actionName = actionName; - this.dynamicPartitionConfig = dynamicPartitionConfig; - this.serverSetMember = serverSetMember; - this.criticalExceptionHandler = criticalExceptionHandler; - this.segmentSyncConfig = segmentSyncConfig; - this.zooKeeperTryLockFactory = zooKeeperTryLockFactory; - if (serverSetMember == null) { - Preconditions.checkState(Config.environmentIsTest(), - "Should only have a null server in tests"); - } - - this.shouldSynchronize = new AtomicBoolean( - EarlybirdConfig.getBool(actionName + "_should_synchronize", false)); - - // Export whether or not synchronization is enabled as a stat - SearchCustomGauge.export( - actionName + "_should_synchronize", () -> shouldSynchronize.get() ? 1 : 0); - - this.checkNumReplicasInServerSet = EarlybirdProperty.CHECK_NUM_REPLICAS_IN_SERVER_SET.get(); - - int numReplicas = - dynamicPartitionConfig.getCurrentPartitionConfig().getNumReplicasInHashPartition(); - this.maxAllowedReplicasNotInServerSet = - EarlybirdProperty.MAX_ALLOWED_REPLICAS_NOT_IN_SERVER_SET.get(numReplicas); - - this.zkLockExpirationTimeMinutes = - EarlybirdConfig.getLong(actionName + "_lock_expiration_time_minutes", 60L); - this.zkLockNamePrefix = actionName + "_for_hash_partition_"; - this.joinServerSetRetries = - EarlybirdConfig.getInt(actionName + "_join_server_set_retries", 20); - this.joinServerSetRetrySleepMillis = - EarlybirdConfig.getInt(actionName + "_join_server_retry_sleep_millis", 2000); - this.sleepAfterLeaveServerSetMillis = - EarlybirdConfig.getInt("coordinated_action_sleep_after_leave_server_set_millis", 30000); - - this.numCoordinatedFunctionCalls = SearchCounter.export(actionName + "_num_coordinated_calls"); - this.numCoordinatedLeaveServersetCalls = - SearchCounter.export(actionName + "_num_coordinated_leave_serverset_calls"); - - if (this.checkNumReplicasInServerSet) { - LOG.info( - "Coordinate action config ({}): allowedNotIn: {}, current number of replicas: {}, " - + "synchronization enabled: {}, checkNumReplicasInServerSet enabled: {}", - actionName, - maxAllowedReplicasNotInServerSet, - dynamicPartitionConfig.getCurrentPartitionConfig().getNumReplicasInHashPartition(), - shouldSynchronize, - this.checkNumReplicasInServerSet); - } else { - LOG.info( - "Coordinate action config ({}): synchronization enabled: {}, " - + "checkNumReplicasInServerSet enabled: {}", - actionName, - shouldSynchronize, - this.checkNumReplicasInServerSet); - } - } - - - @Override - public boolean execute( - String description, - ExceptionalFunction function) - throws E, CoordinatedEarlybirdActionLockFailed { - if (this.shouldSynchronize.get()) { - return executeWithCoordination(description, function); - } else { - return function.apply(NOT_COORDINATED_ACTION_FLAG); - } - } - - enum LeaveServerSetResult { - SUCCESS, - FAILURE, - NOT_IN_SERVER_SET, - NO_SERVER_SET_MEMBER - } - - private LeaveServerSetResult leaveServerSet() { - LOG.info("Leaving serving server set for " + actionName); - try { - serverSetMember.leaveServerSet("CoordinatedAction: " + actionName); - return LeaveServerSetResult.SUCCESS; - } catch (ServerSet.UpdateException ex) { - if (ex instanceof NotInServerSetUpdateException) { - LOG.info("No need to leave; already out of server set during: " - + actionName, ex); - return LeaveServerSetResult.NOT_IN_SERVER_SET; - } else { - LOG.warn("Unable to leave server set during: " + actionName, ex); - return LeaveServerSetResult.FAILURE; - } - } - } - - private LeaveServerSetResult maybeLeaveServerSet() { - if (serverSetMember != null) { - if (serverSetMember.isInServerSet()) { - - if (!checkNumReplicasInServerSet) { - return leaveServerSet(); - } else { - PartitionConfig curPartitionConfig = dynamicPartitionConfig.getCurrentPartitionConfig(); - final int minNumServers = - curPartitionConfig.getNumReplicasInHashPartition() - maxAllowedReplicasNotInServerSet; - Optional numServerSetMembers = getNumberOfServerSetMembers(); - LOG.info("Checking number of replicas before leaving server set for " + actionName - + ". Number of members is: " + numServerSetMembers + " minMembers: " + minNumServers); - if (numServerSetMembers.isPresent() && numServerSetMembers.get() > minNumServers) { - return leaveServerSet(); - } else { - LOG.warn("Not leaving server set during: " + actionName); - return LeaveServerSetResult.FAILURE; - } - } - } else { - LOG.info("Not in server set, no need to leave it."); - return LeaveServerSetResult.NOT_IN_SERVER_SET; - } - } - - return LeaveServerSetResult.NO_SERVER_SET_MEMBER; - } - - private boolean executeWithCoordination( - final String description, - final ExceptionalFunction function) - throws E, CoordinatedEarlybirdActionLockFailed { - PartitionConfig curPartitionConfig = dynamicPartitionConfig.getCurrentPartitionConfig(); - TryLock lock = zooKeeperTryLockFactory.createTryLock( - DatabaseConfig.getLocalHostname(), - segmentSyncConfig.getZooKeeperSyncFullPath(), - zkLockNamePrefix - + curPartitionConfig.getIndexingHashPartitionID(), - Amount.of(zkLockExpirationTimeMinutes, Time.MINUTES) - ); - - final AtomicBoolean success = new AtomicBoolean(false); - - boolean gotLock = lock.tryWithLock(() -> { - Stopwatch actionTiming = Stopwatch.createStarted(); - - LeaveServerSetResult leftServerSet = maybeLeaveServerSet(); - if (leftServerSet == LeaveServerSetResult.FAILURE) { - LOG.info("Failed to leave the server set, will not execute action."); - return; - } - - LOG.info("maybeLeaveServerSet returned: {}", leftServerSet); - - // Sleep for a short time to give the server some time to finish requests that it is currently - // executing and allow roots some time to register that this host has left the server set. - // If we didn't do this and the coordinated action included a full GC, then latency and error - // rate at the root layer would spike higher at the time of the GC. SEARCH-35456 - try { - Thread.sleep(sleepAfterLeaveServerSetMillis); - } catch (InterruptedException ex) { - Thread.currentThread().interrupt(); - } - - LOG.info(actionName + " synchronization action for " + description); - - try { - numCoordinatedFunctionCalls.increment(); - numCoordinatedLeaveServersetCalls.increment(); - - Boolean successValue = function.apply(COORDINATED_ACTION_FLAG); - success.set(successValue); - } finally { - if (leftServerSet == LeaveServerSetResult.SUCCESS) { - joinServerSet(); - } - LOG.info("{} synchronization action for {} completed after {}, success: {}", - actionName, - description, - actionTiming, - success.get()); - } - }); - - if (!gotLock) { - String errorMsg = actionName + ": Failed to get zk indexing lock for " + description; - LOG.info(errorMsg); - throw new CoordinatedEarlybirdActionLockFailed(errorMsg); - } - return success.get(); - } - - @Override - public void retryActionUntilRan(String description, Runnable action) { - Random random = new Random(System.currentTimeMillis()); - - boolean actionExecuted = false; - int attempts = 0; - while (!actionExecuted) { - try { - attempts++; - actionExecuted = this.execute(description, isCoordinated -> { - action.run(); - return true; - }); - } catch (CoordinatedEarlybirdActionLockFailed ex) { - } - - if (!actionExecuted) { - // Variable sleep amount. The reason for the random sleeps - // is so that across multiple earlybirds this doesn't get - // executed in some sequence that depends on something else - // like maybe deploy times. It might be easier to catch possible - // problems if implicit orderings like this are not introduced. - long msToSleep = (10 + random.nextInt(5)) * 1000L; - try { - Thread.sleep(msToSleep); - } catch (InterruptedException ex) { - LOG.info("Interrupted while trying to execute"); - Thread.currentThread().interrupt(); - } - } else { - LOG.info("Executed {} after {} attempts", actionName, attempts); - } - } - } - - /** - * Gets the current number of servers in this server's server set. - * @return absent Optional if we encountered an exception getting the number of hosts. - */ - private Optional getNumberOfServerSetMembers() { - try { - return serverSetMember != null ? Optional.of(serverSetMember.getNumberOfServerSetMembers()) - : Optional.empty(); - } catch (InterruptedException ex) { - LOG.warn("Action " + actionName + " was interrupted.", ex); - Thread.currentThread().interrupt(); - return Optional.empty(); - } catch (ZooKeeperClient.ZooKeeperConnectionException | KeeperException ex) { - LOG.warn("Exception during " + actionName, ex); - return Optional.empty(); - } - } - - /** - * After a coordinated action, join back this earlybird's server set with retries - * and sleeps in between. - */ - private void joinServerSet() { - Preconditions.checkNotNull(serverSetMember); - - boolean joined = false; - for (int i = 0; i < joinServerSetRetries; i++) { - try { - serverSetMember.joinServerSet("CoordinatedAction: " + actionName); - joined = true; - break; - } catch (AlreadyInServerSetUpdateException ex) { - // Most likely leaving the server set failed - joined = true; - break; - } catch (ServerSet.UpdateException ex) { - LOG.warn("Unable to join server set after " + actionName + " on attempt " - + i, ex); - if (i < (joinServerSetRetries - 1)) { - try { - Thread.sleep(joinServerSetRetrySleepMillis); - } catch (InterruptedException e) { - LOG.warn("Interrupted while waiting to join back server set for: " + actionName); - // Preserve interrupt status. - Thread.currentThread().interrupt(); - break; - } - } - } - } - if (!joined) { - String message = String.format( - "Unable to join server set after %s, setting fatal flag.", - actionName); - EarlybirdException exception = new EarlybirdException(message); - - LOG.error(message, exception); - criticalExceptionHandler.handle(this, exception); - } - } - - - @Override - public boolean setShouldSynchronize(boolean shouldSynchronizeParam) { - boolean oldValue = this.shouldSynchronize.getAndSet(shouldSynchronizeParam); - LOG.info("Updated shouldSynchronize for: " + actionName + " from " + oldValue - + " to " + shouldSynchronizeParam); - return oldValue; - } - - @Override - @VisibleForTesting - public long getNumCoordinatedFunctionCalls() { - return this.numCoordinatedFunctionCalls.get(); - } - - @Override - @VisibleForTesting - public long getNumCoordinatedLeaveServersetCalls() { - return this.numCoordinatedLeaveServersetCalls.get(); - } -} diff --git a/src/java/com/twitter/search/earlybird/util/CoordinatedEarlybirdActionInterface.java b/src/java/com/twitter/search/earlybird/util/CoordinatedEarlybirdActionInterface.java deleted file mode 100644 index 4414a6bc9..000000000 --- a/src/java/com/twitter/search/earlybird/util/CoordinatedEarlybirdActionInterface.java +++ /dev/null @@ -1,50 +0,0 @@ -package com.twitter.search.earlybird.util; - -import com.google.common.annotations.VisibleForTesting; - -import com.twitter.common.base.ExceptionalFunction; - -public interface CoordinatedEarlybirdActionInterface { - /** - * Executes the provided Function associated with the given segment. - * @param description a name for the action to be exected. - * @param function the function to call in a coordinated manner. - * As input, the function will receive a flag indicating whether or not it is being - * called in a coordinated fashion. true if it is, and false otherwise. - * @return true iff the function was executed, and function.apply() returned true; - * throws CoordinatedEarlybirdActionLockFailed if function is not executed (because lock - * aquisition failed). - */ - boolean execute( - String description, - ExceptionalFunction function) - throws E, CoordinatedEarlybirdActionLockFailed; - - /** - * Set whether this action should be synchronized. - * If not, the action is directly applied. If yes, Earlybirds will coordinate executing the - * action via ZooKeeperTryLocks. - */ - boolean setShouldSynchronize(boolean shouldSynchronizeParam); - - /** - * Number of times this coordinated actions has been executed. - * @return - */ - @VisibleForTesting - long getNumCoordinatedFunctionCalls(); - - /** - * Number of times we have left the serverset. - * @return - */ - @VisibleForTesting - long getNumCoordinatedLeaveServersetCalls(); - - /** - * Retry until we can run an action on a single instance in the serverset. - * @param description Text description of the action. - * @param action A runnable to be ran. - */ - void retryActionUntilRan(String description, Runnable action); -} diff --git a/src/java/com/twitter/search/earlybird/util/CoordinatedEarlybirdActionLockFailed.java b/src/java/com/twitter/search/earlybird/util/CoordinatedEarlybirdActionLockFailed.java deleted file mode 100644 index 52c00f975..000000000 --- a/src/java/com/twitter/search/earlybird/util/CoordinatedEarlybirdActionLockFailed.java +++ /dev/null @@ -1,11 +0,0 @@ -package com.twitter.search.earlybird.util; - -/** - * This class represents that coordindated earlybird action can not acquire the lock so that it - * throws this exception. - */ -public class CoordinatedEarlybirdActionLockFailed extends Exception { - public CoordinatedEarlybirdActionLockFailed(String message) { - super(message); - } -} diff --git a/src/java/com/twitter/search/earlybird/util/EarlybirdDecider.java b/src/java/com/twitter/search/earlybird/util/EarlybirdDecider.java deleted file mode 100644 index 6e2740a7a..000000000 --- a/src/java/com/twitter/search/earlybird/util/EarlybirdDecider.java +++ /dev/null @@ -1,128 +0,0 @@ -package com.twitter.search.earlybird.util; - -import scala.Some; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import com.twitter.decider.Decider; -import com.twitter.decider.Decider$; -import com.twitter.decider.RandomRecipient$; -import com.twitter.decider.Recipient; -import com.twitter.decider.decisionmaker.MutableDecisionMaker; -import com.twitter.search.common.decider.DeciderUtil; -import com.twitter.search.common.decider.SearchDeciderFactory; -import com.twitter.search.earlybird.common.config.EarlybirdProperty; - -/** - * A Singleton to let any code in Earlybird have the ability to be guarded by a decider key. - * - * EarlybirdDecider is a thin wrapper around the Twitter Decider library to provide global access to a single - * decider configuration. This way any code anywhere can easily be guarded by a Decider key. The initializer requires - * EarlybirdConfig to be initialized already. Defaults to a NullDecider, which causes all requests for keys to return - * false. - */ -public final class EarlybirdDecider { - public static final org.slf4j.Logger LOG = - org.slf4j.LoggerFactory.getLogger(EarlybirdDecider.class); - public static final String DECIDER_CONFIG = "./config/earlybird-decider.yml"; - - private static volatile Decider earlybirdDecider = Decider$.MODULE$.NullDecider(); - private static volatile MutableDecisionMaker mutableDecisionMaker; - - private EarlybirdDecider() { } - - /** - * Initializes the global decider accessor. Requires EarlybirdConfig to be initialized. - * - * @return the new decider interface. - */ - public static Decider initialize() { - return initialize(DECIDER_CONFIG); - } - - /** - * Initializes the global decider accessor. Requires EarlybirdConfig to be initialized. - * - * @param configPath path to the base decider config file. - * @return the new decider interface. - */ - @VisibleForTesting public static Decider initialize(String configPath) { - synchronized (EarlybirdDecider.class) { - Preconditions.checkState(earlybirdDecider == Decider$.MODULE$.NullDecider(), - "EarlybirdDecider can be initialized only once."); - - mutableDecisionMaker = new MutableDecisionMaker(); - - if (EarlybirdProperty.USE_DECIDER_OVERLAY.get(false)) { - String category = EarlybirdProperty.DECIDER_OVERLAY_CONFIG.get(); - earlybirdDecider = - SearchDeciderFactory.createDeciderWithoutRefreshBaseWithOverlay( - configPath, category, mutableDecisionMaker); - LOG.info("EarlybirdDecider set to use the decider overlay " + category); - } else { - earlybirdDecider = - SearchDeciderFactory.createDeciderWithRefreshBaseWithoutOverlay( - configPath, mutableDecisionMaker); - LOG.info("EarlybirdDecider set to only use the base config"); - } - return earlybirdDecider; - } - } - - /** - * Check if feature is available based on randomness - * - * @param feature the feature name to test - * @return true if the feature is available, false otherwise - */ - public static boolean isFeatureAvailable(String feature) { - return isFeatureAvailable(feature, RandomRecipient$.MODULE$); - } - - /** - * Check if the feature is available based on the user - * - * The recipient'd id is hashed and used as the value to compare with the decider percentage. Therefore, the same user - * will always get the same result for a given percentage, and higher percentages should always be a superset of the - * lower percentage users. - * - * RandomRecipient can be used to get a random value for every call. - * - * @param feature the feature name to test - * @param recipient the recipient to base a decision on - * @return true if the feature is available, false otherwise - */ - public static boolean isFeatureAvailable(String feature, Recipient recipient) { - if (earlybirdDecider == Decider$.MODULE$.NullDecider()) { - LOG.warn("EarlybirdDecider is uninitialized but requested feature " + feature); - } - - return earlybirdDecider.isAvailable(feature, Some.apply(recipient)); - } - - /** - * Get the raw decider value for a given feature. - * - * @param feature the feature name - * @return the integer value of the decider - */ - public static int getAvailability(String feature) { - return DeciderUtil.getAvailability(earlybirdDecider, feature); - } - - public static Decider getDecider() { - checkInitialized(); - return earlybirdDecider; - } - - public static MutableDecisionMaker getMutableDecisionMaker() { - checkInitialized(); - return mutableDecisionMaker; - } - - private static void checkInitialized() { - Preconditions.checkState(earlybirdDecider != Decider$.MODULE$.NullDecider(), - "EarlybirdDecider is not initialized."); - } -} diff --git a/src/java/com/twitter/search/earlybird/util/EarlybirdSearchResultUtil.java b/src/java/com/twitter/search/earlybird/util/EarlybirdSearchResultUtil.java deleted file mode 100644 index 22af0d8b3..000000000 --- a/src/java/com/twitter/search/earlybird/util/EarlybirdSearchResultUtil.java +++ /dev/null @@ -1,182 +0,0 @@ -package com.twitter.search.earlybird.util; - -import java.util.List; -import java.util.Map; -import java.util.Set; - -import javax.annotation.Nullable; - -import com.google.common.collect.ImmutableMap; - -import com.twitter.search.common.constants.thriftjava.ThriftLanguage; -import com.twitter.search.common.database.DatabaseConfig; -import com.twitter.search.common.query.thriftjava.EarlyTerminationInfo; -import com.twitter.search.common.util.earlybird.ResultsUtil; -import com.twitter.search.common.util.earlybird.ThriftSearchResultUtil; -import com.twitter.search.common.util.earlybird.ThriftSearchResultsRelevanceStatsUtil; -import com.twitter.search.core.earlybird.facets.LanguageHistogram; -import com.twitter.search.earlybird.partition.PartitionConfig; -import com.twitter.search.earlybird.search.Hit; -import com.twitter.search.earlybird.search.SearchResultsInfo; -import com.twitter.search.earlybird.search.SimpleSearchResults; -import com.twitter.search.earlybird.search.relevance.RelevanceSearchResults; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftSearchResultDebugInfo; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadata; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird.thrift.ThriftSearchResultsRelevanceStats; - -// EarlybirdSearchResultUtil contains some simple static methods for constructing -// ThriftSearchResult objects. -public final class EarlybirdSearchResultUtil { - public static final double MIN_LANGUAGE_RATIO_TO_KEEP = 0.002; - - private EarlybirdSearchResultUtil() { } - - /** - * Update result stats on the ThriftSearchResult. - */ - public static void setResultStatistics(ThriftSearchResults results, SearchResultsInfo info) { - results.setNumHitsProcessed(info.getNumHitsProcessed()); - results.setNumPartitionsEarlyTerminated(info.isEarlyTerminated() ? 1 : 0); - if (info.isSetSearchedStatusIDs()) { - results.setMaxSearchedStatusID(info.getMaxSearchedStatusID()); - results.setMinSearchedStatusID(info.getMinSearchedStatusID()); - } - - if (info.isSetSearchedTimes()) { - results.setMaxSearchedTimeSinceEpoch(info.getMaxSearchedTime()); - results.setMinSearchedTimeSinceEpoch(info.getMinSearchedTime()); - } - } - - /** - * Create an EarlyTerminationInfo based on information inside a SearchResultsInfo. - */ - public static EarlyTerminationInfo prepareEarlyTerminationInfo(SearchResultsInfo info) { - EarlyTerminationInfo earlyTerminationInfo = new EarlyTerminationInfo(info.isEarlyTerminated()); - if (info.isEarlyTerminated()) { - earlyTerminationInfo.setEarlyTerminationReason(info.getEarlyTerminationReason()); - } - return earlyTerminationInfo; - } - - /** - * Populate language histogram inside ThriftSerachResults. - */ - public static void setLanguageHistogram(ThriftSearchResults results, - LanguageHistogram languageHistogram) { - int sum = 0; - for (int value : languageHistogram.getLanguageHistogram()) { - sum += value; - } - if (sum == 0) { - return; - } - ImmutableMap.Builder builder = ImmutableMap.builder(); - int threshold = (int) (sum * MIN_LANGUAGE_RATIO_TO_KEEP); - for (Map.Entry entry : languageHistogram.getLanguageHistogramAsMap() - .entrySet()) { - if (entry.getValue() > threshold) { - builder.put(entry.getKey(), entry.getValue()); - } - } - Map langCounts = builder.build(); - if (langCounts.size() > 0) { - results.setLanguageHistogram(langCounts); - } - } - - private static void addDebugInfoToResults(List resultArray, - @Nullable PartitionConfig partitionConfig) { - if (partitionConfig == null) { - return; - } - ThriftSearchResultDebugInfo debugInfo = new ThriftSearchResultDebugInfo(); - debugInfo.setHostname(DatabaseConfig.getLocalHostname()); - // These info can also come from EarlybirdServer.get().getPartitionConfig() if we add such a - // getter for partitionConfig(). - debugInfo.setPartitionId(partitionConfig.getIndexingHashPartitionID()); - debugInfo.setTiername(partitionConfig.getTierName()); - debugInfo.setClusterName(partitionConfig.getClusterName()); - - for (ThriftSearchResult result : resultArray) { - result.setDebugInfo(debugInfo); - } - } - - /** - * Write results into the result array. - * @param resultArray the result array to write into. - * @param hits the hits from the search. - * @param partitionConfig partition config used to fill in debug info. Pass in null if no debug - * info should be written into results. - */ - public static void prepareResultsArray(List resultArray, - SimpleSearchResults hits, - @Nullable PartitionConfig partitionConfig) { - for (int i = 0; i < hits.numHits(); i++) { - final Hit hit = hits.getHit(i); - final long id = hit.getStatusID(); - final ThriftSearchResult result = new ThriftSearchResult(id); - final ThriftSearchResultMetadata resultMetadata = hit.getMetadata(); - result.setMetadata(resultMetadata); - resultArray.add(result); - } - addDebugInfoToResults(resultArray, partitionConfig); - } - - /** - * Write results into the result array. - * @param resultArray the result array to write into. - * @param hits the hits from the search. - * @param userIDWhitelist Used to set flag ThriftSearchResultMetadata.dontFilterUser. - * @param partitionConfig partition config used to fill in debug info. Pass in null if no debug - * info should be written into results. - */ - public static void prepareRelevanceResultsArray(List resultArray, - RelevanceSearchResults hits, - Set userIDWhitelist, - @Nullable PartitionConfig partitionConfig) { - for (int i = 0; i < hits.numHits(); i++) { - final long id = hits.getHit(i).getStatusID(); - final ThriftSearchResult result = new ThriftSearchResult(id); - final ThriftSearchResultMetadata resultMetadata = hits.resultMetadata[i]; - result.setMetadata(resultMetadata); - if (userIDWhitelist != null) { - resultMetadata.setDontFilterUser(userIDWhitelist.contains(resultMetadata.getFromUserId())); - } - - resultArray.add(result); - } - addDebugInfoToResults(resultArray, partitionConfig); - } - - /** - * Merge a List of ThriftSearchResults into a single ThriftSearchResults object. - */ - public static ThriftSearchResults mergeSearchResults(List allSearchResults) { - ThriftSearchResults mergedResults = new ThriftSearchResults(); - mergedResults.setRelevanceStats(new ThriftSearchResultsRelevanceStats()); - - mergedResults.setHitCounts(ResultsUtil.aggregateCountMap(allSearchResults, - ThriftSearchResultUtil.HIT_COUNTS_MAP_GETTER)); - - mergedResults.setLanguageHistogram(ResultsUtil.aggregateCountMap(allSearchResults, - ThriftSearchResultUtil.LANG_MAP_GETTER)); - - for (ThriftSearchResults searchResults : allSearchResults) { - // Add results - mergedResults.getResults().addAll(searchResults.getResults()); - // Update counts - ThriftSearchResultUtil.incrementCounts(mergedResults, searchResults); - // Update relevance stats - if (searchResults.getRelevanceStats() != null) { - ThriftSearchResultsRelevanceStatsUtil.addRelevanceStats(mergedResults.getRelevanceStats(), - searchResults.getRelevanceStats()); - } - } - - return mergedResults; - } -} diff --git a/src/java/com/twitter/search/earlybird/util/FieldTermCounter.java b/src/java/com/twitter/search/earlybird/util/FieldTermCounter.java deleted file mode 100644 index 29cece148..000000000 --- a/src/java/com/twitter/search/earlybird/util/FieldTermCounter.java +++ /dev/null @@ -1,304 +0,0 @@ -package com.twitter.search.earlybird.util; - -import java.util.Calendar; -import java.util.Collections; -import java.util.Map; -import java.util.TimeZone; -import java.util.concurrent.atomic.AtomicInteger; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.collect.Maps; - -import org.apache.commons.lang.mutable.MutableInt; -import org.apache.commons.lang.mutable.MutableLong; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchLongGauge; - -/** - * This class is used to count how many times a field happens in hourly and daily stats. - * It is used by TermCountMonitor for iterating all fields in the index. - * - * There is one exception that this class is also used to count the number of tweets in the index. - * Under the situation, the passed in fieldName would be empty string (as TWEET_COUNT_KEY). - */ -public class FieldTermCounter { - private static final Logger LOG = LoggerFactory.getLogger(FieldTermCounter.class); - - static final TimeZone TIME_ZONE = TimeZone.getTimeZone("GMT"); - static final String TWEET_COUNT_KEY = ""; - - private final String fieldName; - private final int instanceCounter; - - // The first date in format "YYYYMMDDHH" that we want to check counts for. - private final int startCheckHour; - // The last date in format "YYYYMMDDHH" that we want to check counts for. - private final int endCheckHour; - // Smallest number of docs we expect to have for each hour. - private final int hourlyMinCount; - //Smallest number of docs we expect to have for each day. - private final int dailyMinCount; - - // Count of tweets for each day, keyed of by the hour in the format "YYYYMMDD". - private final Map exportedHourlyCounts; - - // Count of tweets for each day, keyed of by the day in the format "YYYYMMDD". - private final Map dailyCounts; - - // Only export hourly stats that are below minimum threshold. - private final Map exportedStats; - - private final SearchLongGauge hoursWithNoTweetsStat; - private final SearchLongGauge daysWithNoTweetsStat; - - public FieldTermCounter( - String fieldName, - int instanceCounter, - int startCheckHour, - int endCheckHour, - int hourlyMinCount, - int dailyMinCount) { - this.fieldName = fieldName; - this.instanceCounter = instanceCounter; - this.startCheckHour = startCheckHour; - this.endCheckHour = endCheckHour; - this.hourlyMinCount = hourlyMinCount; - this.dailyMinCount = dailyMinCount; - this.exportedHourlyCounts = Maps.newHashMap(); - this.dailyCounts = Maps.newHashMap(); - this.exportedStats = Maps.newHashMap(); - - this.hoursWithNoTweetsStat = SearchLongGauge.export(getAggregatedNoTweetStatName(true)); - this.daysWithNoTweetsStat = SearchLongGauge.export(getAggregatedNoTweetStatName(false)); - } - - /** - * Updates the stats exported by this class based on the new counts provided in the given map. - */ - public void runWithNewCounts(Map newCounts) { - dailyCounts.clear(); - - // See go/rb/813442/#comment2566569 - // 1. Update all existing hours - updateExistingHourlyCounts(newCounts); - - // 2. Add and export all new hours - addAndExportNewHourlyCounts(newCounts); - - // 3. fill in all the missing hours between know min and max days. - fillMissingHourlyCounts(); - - // 4. Export as a stat, how many hours don't have any tweets (i.e. <= 0) - exportMissingTweetStats(); - } - - // Input: - // . the new hourly count map in the current iteration - // . the existing hourly count map before the current iteration - // If the hourly key matches from the new hourly map to the existing hourly count map, update - // the value of the existing hourly count map to the value from the new hourly count map. - private void updateExistingHourlyCounts(Map newCounts) { - for (Map.Entry exportedCount : exportedHourlyCounts.entrySet()) { - Integer date = exportedCount.getKey(); - AtomicInteger exportedCountValue = exportedCount.getValue(); - - MutableInt newCount = newCounts.get(date); - if (newCount == null) { - exportedCountValue.set(0); - } else { - exportedCountValue.set(newCount.intValue()); - // clean up so that we don't check this date again when we look for new hours - newCounts.remove(date); - } - } - } - - // Input: - // . the new hourly count map in the current iteration - // . the existing hourly count map before the current iteration - // This function is called after the above function of updateExistingHourlyCounts() so that all - // matching key value pairs have been removed from the new hourly count map. - // Move all remaining valid values from the new hourly count map to the existing hourly count - // map. - private void addAndExportNewHourlyCounts(Map newCounts) { - for (Map.Entry newCount : newCounts.entrySet()) { - Integer hour = newCount.getKey(); - MutableInt newCountValue = newCount.getValue(); - Preconditions.checkState(!exportedHourlyCounts.containsKey(hour), - "Should have already processed and removed existing hours: " + hour); - - AtomicInteger newStat = new AtomicInteger(newCountValue.intValue()); - exportedHourlyCounts.put(hour, newStat); - } - } - - // Find whether the existing hourly count map has hourly holes. If such holes exist, fill 0 - // values so that they can be exported. - private void fillMissingHourlyCounts() { - // Figure out the time range for which we should have tweets in the index. At the very least, - // this range should cover [startCheckHour, endCheckHour) if endCheckHour is set, or - // [startCheckHour, latestHourInTheIndexWithTweets] if endCheckHour is not set (latest tier or - // realtime cluster). - int startHour = startCheckHour; - int endHour = endCheckHour < getHourValue(Calendar.getInstance(TIME_ZONE)) ? endCheckHour : -1; - for (int next : exportedHourlyCounts.keySet()) { - if (next < startHour) { - startHour = next; - } - if (next > endHour) { - endHour = next; - } - } - - Calendar endHourCal = getCalendarValue(endHour); - Calendar hour = getCalendarValue(startHour); - for (; hour.before(endHourCal); hour.add(Calendar.HOUR_OF_DAY, 1)) { - int hourValue = getHourValue(hour); - if (!exportedHourlyCounts.containsKey(hourValue)) { - exportedHourlyCounts.put(hourValue, new AtomicInteger(0)); - } - } - } - - private void exportMissingTweetStats() { - int hoursWithNoTweets = 0; - int daysWithNoTweets = 0; - - for (Map.Entry hourlyCount : exportedHourlyCounts.entrySet()) { - int hour = hourlyCount.getKey(); - if ((hour < startCheckHour) || (hour >= endCheckHour)) { - continue; - } - - // roll up the days - int day = hour / 100; - MutableLong dayCount = dailyCounts.get(day); - if (dayCount == null) { - dailyCounts.put(day, new MutableLong(hourlyCount.getValue().get())); - } else { - dayCount.setValue(dayCount.longValue() + hourlyCount.getValue().get()); - } - AtomicInteger exportedCountValue = hourlyCount.getValue(); - if (exportedCountValue.get() <= hourlyMinCount) { - // We do not export hourly too few tweets for index fields as it can 10x the existing - // exported stats. - // We might consider whitelisting some high frequency fields later. - if (isFieldForTweet()) { - String statsName = getStatName(hourlyCount.getKey()); - SearchLongGauge stat = SearchLongGauge.export(statsName); - stat.set(exportedCountValue.longValue()); - exportedStats.put(statsName, stat); - } - LOG.warn("Found an hour with too few tweets. Field: <{}> Hour: {} count: {}", - fieldName, hour, exportedCountValue); - hoursWithNoTweets++; - } - } - - for (Map.Entry dailyCount : dailyCounts.entrySet()) { - if (dailyCount.getValue().longValue() <= dailyMinCount) { - LOG.warn("Found a day with too few tweets. Field: <{}> Day: {} count: {}", - fieldName, dailyCount.getKey(), dailyCount.getValue()); - daysWithNoTweets++; - } - } - - hoursWithNoTweetsStat.set(hoursWithNoTweets); - daysWithNoTweetsStat.set(daysWithNoTweets); - } - - // When the fieldName is empty string (as TWEET_COUNT_KEY), it means that we are counting the - // number of tweets for the index, not for some specific fields. - private boolean isFieldForTweet() { - return TWEET_COUNT_KEY.equals(fieldName); - } - - private String getAggregatedNoTweetStatName(boolean hourly) { - if (isFieldForTweet()) { - if (hourly) { - return "hours_with_no_indexed_tweets_v_" + instanceCounter; - } else { - return "days_with_no_indexed_tweets_v_" + instanceCounter; - } - } else { - if (hourly) { - return "hours_with_no_indexed_fields_v_" + fieldName + "_" + instanceCounter; - } else { - return "days_with_no_indexed_fields_v_" + fieldName + "_" + instanceCounter; - } - } - } - - @VisibleForTesting - String getStatName(Integer date) { - return getStatName(fieldName, instanceCounter, date); - } - - @VisibleForTesting - static String getStatName(String field, int instance, Integer date) { - if (TWEET_COUNT_KEY.equals(field)) { - return "tweets_indexed_on_hour_v_" + instance + "_" + date; - } else { - return "tweets_indexed_on_hour_v_" + instance + "_" + field + "_" + date; - } - } - - @VisibleForTesting - Map getExportedCounts() { - return Collections.unmodifiableMap(exportedHourlyCounts); - } - - @VisibleForTesting - Map getDailyCounts() { - return Collections.unmodifiableMap(dailyCounts); - } - - @VisibleForTesting - long getHoursWithNoTweets() { - return hoursWithNoTweetsStat.get(); - } - - @VisibleForTesting - long getDaysWithNoTweets() { - return daysWithNoTweetsStat.get(); - } - - @VisibleForTesting - Map getExportedHourlyCountStats() { - return exportedStats; - } - - /** - * Given a unit time in seconds since epoch UTC, will return the day in format "YYYYMMDDHH" - * as an int. - */ - @VisibleForTesting - static int getHourValue(Calendar cal, int timeSecs) { - cal.setTimeInMillis(timeSecs * 1000L); - return getHourValue(cal); - } - - static int getHourValue(Calendar cal) { - int year = cal.get(Calendar.YEAR) * 1000000; - int month = (cal.get(Calendar.MONTH) + 1) * 10000; // month is 0-based - int day = cal.get(Calendar.DAY_OF_MONTH) * 100; - int hour = cal.get(Calendar.HOUR_OF_DAY); - return year + month + day + hour; - } - - @VisibleForTesting - static Calendar getCalendarValue(int hour) { - Calendar cal = Calendar.getInstance(TIME_ZONE); - - int year = hour / 1000000; - int month = ((hour / 10000) % 100) - 1; // 0-based - int day = (hour / 100) % 100; - int hr = hour % 100; - cal.setTimeInMillis(0); // reset all time fields - cal.set(year, month, day, hr, 0); - return cal; - } -} diff --git a/src/java/com/twitter/search/earlybird/util/Histogram.java b/src/java/com/twitter/search/earlybird/util/Histogram.java deleted file mode 100644 index ccf40a64e..000000000 --- a/src/java/com/twitter/search/earlybird/util/Histogram.java +++ /dev/null @@ -1,160 +0,0 @@ -package com.twitter.search.earlybird.util; - -import java.util.ArrayList; -import java.util.Arrays; -import java.util.List; - -import com.google.common.base.Preconditions; - -/** - * A histogram of int values with arbitrary buckets. - * Keeps a count for each bucket, and a sum of values for each bucket. - * The histogram view is returned as a list of {@link Histogram.Entry}s. - *

    - * Bucket boundaries are inclusive on the upper boundaries. Given buckets of [0, 10, 100], - * items will be places in 4 bins, { X <= 0, 0 < X <= 10, 10 < X <= 100, X > 100 }. - *

    - * This class is not thread safe. - * - */ -public class Histogram { - private final double[] buckets; - private final int[] itemsCount; - private final long[] itemsSum; - private int totalCount; - private long totalSum; - - public static class Entry { - private final String bucketName; - private final int count; - private final double countPercent; - private final double countCumulative; - private final long sum; - private final double sumPercent; - private final double sumCumulative; - - Entry(String bucketName, - int count, double countPercent, double countCumulative, - long sum, double sumPercent, double sumCumulative) { - this.bucketName = bucketName; - this.count = count; - this.countPercent = countPercent; - this.countCumulative = countCumulative; - this.sum = sum; - this.sumPercent = sumPercent; - this.sumCumulative = sumCumulative; - } - - public String getBucketName() { - return bucketName; - } - - public int getCount() { - return count; - } - - public double getCountPercent() { - return countPercent; - } - - public double getCountCumulative() { - return countCumulative; - } - - public long getSum() { - return sum; - } - - public double getSumPercent() { - return sumPercent; - } - - public double getSumCumulative() { - return sumCumulative; - } - } - - /** - * No buckets will put all items into a single bin. - * @param buckets the buckets to use for binnning data. - * An item will be put in bin i if item <= buckets[i] and > buckets[i-1] - * The bucket values must be strictly increasing. - */ - public Histogram(double... buckets) { - Preconditions.checkNotNull(buckets); - this.buckets = new double[buckets.length]; - for (int i = 0; i < buckets.length; i++) { - this.buckets[i] = buckets[i]; - if (i > 0) { - Preconditions.checkState(this.buckets[i - 1] < this.buckets[i], - "Histogram buckets must me strictly increasing: " + Arrays.toString(buckets)); - } - } - this.itemsCount = new int[buckets.length + 1]; - this.itemsSum = new long[buckets.length + 1]; - this.totalCount = 0; - this.totalSum = 0; - } - - /** - * Add the given item to the appropriate bucket. - */ - public void addItem(double item) { - int i = 0; - for (; i < this.buckets.length; i++) { - if (item <= buckets[i]) { - break; - } - } - this.itemsCount[i]++; - this.totalCount++; - this.itemsSum[i] += item; - this.totalSum += item; - } - - /** - * returns the current view of all the bins. - */ - public List entries() { - List entries = new ArrayList<>(this.itemsCount.length); - double countCumulative = 0; - double sumCumulative = 0; - for (int i = 0; i < this.itemsCount.length; i++) { - String bucketName; - if (i < this.buckets.length) { - bucketName = "<= " + this.buckets[i]; - } else if (this.buckets.length > 0) { - bucketName = " > " + this.buckets[this.buckets.length - 1]; - } else { - bucketName = " * "; - } - - int count = this.itemsCount[i]; - double countPercent = this.totalCount == 0 ? 0 : ((double) this.itemsCount[i]) / totalCount; - countCumulative += countPercent; - - long sum = this.itemsSum[i]; - double sumPercent = this.totalSum == 0 ? 0 : ((double) this.itemsSum[i]) / totalSum; - sumCumulative += sumPercent; - - Entry e = new Entry(bucketName, count, countPercent, countCumulative, - sum, sumPercent, sumCumulative); - entries.add(e); - } - return entries; - } - - /** - * Returns total number of items seen. - */ - public int getTotalCount() { - return totalCount; - } - - /** - * Returns sum of all the items seen. - */ - public long getTotalSum() { - return totalSum; - } -} diff --git a/src/java/com/twitter/search/earlybird/util/IndexViewer.java b/src/java/com/twitter/search/earlybird/util/IndexViewer.java deleted file mode 100644 index d8966a611..000000000 --- a/src/java/com/twitter/search/earlybird/util/IndexViewer.java +++ /dev/null @@ -1,798 +0,0 @@ -package com.twitter.search.earlybird.util; - -import java.io.IOException; -import java.io.PrintWriter; -import java.io.UnsupportedEncodingException; -import java.util.ArrayList; -import java.util.Collections; -import java.util.Comparator; -import java.util.List; -import java.util.Locale; -import java.util.Set; -import java.util.TreeSet; - -import com.google.common.collect.ImmutableSet; -import com.google.common.collect.Lists; - -import org.apache.lucene.index.IndexOptions; -import org.apache.lucene.index.NumericDocValues; -import org.apache.lucene.index.PostingsEnum; -import org.apache.lucene.index.Terms; -import org.apache.lucene.index.TermsEnum; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.util.BytesRef; - -import com.twitter.search.common.constants.thriftjava.ThriftLanguage; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.schema.thriftjava.ThriftCSFType; -import com.twitter.search.common.util.analysis.IntTermAttributeImpl; -import com.twitter.search.common.util.analysis.LongTermAttributeImpl; -import com.twitter.search.common.util.analysis.SortableLongTermAttributeImpl; -import com.twitter.search.common.util.spatial.GeoUtil; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.core.earlybird.index.inverted.MPHTermDictionary; -import com.twitter.search.core.earlybird.index.inverted.RealtimeIndexTerms; -import com.twitter.search.earlybird.index.EarlybirdSingleSegmentSearcher; - -import geo.google.datamodel.GeoCoordinate; - -public class IndexViewer { - /** - * Fields whose terms are indexed using - * {@link com.twitter.search.common.util.analysis.IntTermAttribute} - */ - private static final Set INT_TERM_ATTRIBUTE_FIELDS = ImmutableSet.of( - EarlybirdFieldConstant.CREATED_AT_FIELD.getFieldName(), - EarlybirdFieldConstant.LINK_CATEGORY_FIELD.getFieldName(), - EarlybirdFieldConstant - .NORMALIZED_FAVORITE_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD.getFieldName(), - EarlybirdFieldConstant - .NORMALIZED_REPLY_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD.getFieldName(), - EarlybirdFieldConstant - .NORMALIZED_RETWEET_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD.getFieldName(), - EarlybirdFieldConstant.COMPOSER_SOURCE.getFieldName()); - - /** - * Fields whose terms are indexed using - * {@link com.twitter.search.common.util.analysis.LongTermAttribute} - */ - private static final Set LONG_TERM_ATTRIBUTE_FIELDS = ImmutableSet.of( - EarlybirdFieldConstant.CONVERSATION_ID_FIELD.getFieldName(), - EarlybirdFieldConstant.LIKED_BY_USER_ID_FIELD.getFieldName(), - EarlybirdFieldConstant.QUOTED_TWEET_ID_FIELD.getFieldName(), - EarlybirdFieldConstant.QUOTED_USER_ID_FIELD.getFieldName(), - EarlybirdFieldConstant.REPLIED_TO_BY_USER_ID.getFieldName(), - EarlybirdFieldConstant.RETWEETED_BY_USER_ID.getFieldName(), - EarlybirdFieldConstant.DIRECTED_AT_USER_ID_FIELD.getFieldName(), - EarlybirdFieldConstant.FROM_USER_ID_FIELD.getFieldName(), - EarlybirdFieldConstant.IN_REPLY_TO_TWEET_ID_FIELD.getFieldName(), - EarlybirdFieldConstant.IN_REPLY_TO_USER_ID_FIELD.getFieldName(), - EarlybirdFieldConstant.RETWEET_SOURCE_TWEET_ID_FIELD.getFieldName(), - EarlybirdFieldConstant.RETWEET_SOURCE_USER_ID_FIELD.getFieldName()); - - /** - * Fields whose terms index using SORTED - * {@link com.twitter.search.common.util.analysis.LongTermAttribute} - */ - private static final Set SORTED_LONG_TERM_ATTRIBUTE_FIELDS = - ImmutableSet.of(EarlybirdFieldConstant.ID_FIELD.getFieldName()); - - private final EarlybirdSingleSegmentSearcher searcher; - private final EarlybirdIndexSegmentAtomicReader twitterReader; - - public long getTimeSliceId() { - return searcher.getTimeSliceID(); - } - - public static class Options { - private boolean dumpHexTerms = false; - private String charset; - private double[] histogramBuckets; - private boolean termLengthHistogram; - - public Options setDumpHexTerms(boolean dumpHexTermsParam) { - this.dumpHexTerms = dumpHexTermsParam; - return this; - } - - public Options setCharset(String charsetParam) { - this.charset = charsetParam; - return this; - } - - public Options setHistogramBuckets(double[] histogramBucketsParam) { - this.histogramBuckets = histogramBucketsParam; - return this; - } - - public Options setTermLengthHistogram(boolean termLengthHistogramParam) { - this.termLengthHistogram = termLengthHistogramParam; - return this; - } - } - - /** - * Data Transfer Object for Terms, encapsulates the "json" serialization - * while maintaining streaming mode - */ - private static class TermDto { - - private final String field; - private final String term; - private final String docFreq; - private final String percent; - private final PostingsEnum docsEnum; - private final TermsEnum termsEnum; - private final Integer maxDocs; - - public TermDto(String field, String term, String docFreq, String percent, - PostingsEnum docsEnum, TermsEnum termsEnum, Integer maxDocs) { - this.field = field; - this.term = term; - this.docFreq = docFreq; - this.percent = percent; - this.docsEnum = docsEnum; - this.termsEnum = termsEnum; - this.maxDocs = maxDocs; - } - - public void write(ViewerWriter writer, - EarlybirdIndexSegmentAtomicReader twitterReader) throws IOException { - writer.beginObject(); - writer.name("field").value(field); - writer.name("term").value(term); - writer.name("docFreq").value(docFreq); - writer.name("percent").value(percent); - if (docsEnum != null) { - appendFrequencyAndPositions(writer, field, docsEnum, twitterReader); - } - if (maxDocs != null) { - appendDocs(writer, termsEnum, maxDocs, twitterReader); - } - writer.endObject(); - } - } - - /** - * Data Transfer Object for Terms, encapsulates the "json" serialization - * while maintaining streaming mode - */ - private static class StatsDto { - - private final String field; - private final String numTerms; - private final String terms; - - - public StatsDto(String field, String numTerms, String terms) { - this.field = field; - this.numTerms = numTerms; - this.terms = terms; - } - - public void write(ViewerWriter writer) throws IOException { - writer.beginObject(); - - writer.name("field").value(field); - writer.name("numTerms").value(numTerms); - writer.name("terms").value(terms); - - writer.endObject(); - } - } - - public IndexViewer(EarlybirdSingleSegmentSearcher searcher) { - this.searcher = searcher; - this.twitterReader = searcher.getTwitterIndexReader(); - } - - private boolean shouldSeekExact(Terms terms, TermsEnum termsEnum) { - return terms instanceof RealtimeIndexTerms - || termsEnum instanceof MPHTermDictionary.MPHTermsEnum; - } - - /** - * Dumps all terms for a given tweet id. - * @param writer writer being used - * @param tweetId the tweet id to use - */ - public void dumpTweetDataByTweetId(ViewerWriter writer, long tweetId, Options options) - throws IOException { - int docId = twitterReader.getSegmentData().getDocIDToTweetIDMapper().getDocID(tweetId); - dumpTweetDataByDocId(writer, docId, options); - } - - /** - * Dumps all terms for a given doc id. - * @param writer writer being used - * @param docId the document id to use. - */ - public void dumpTweetDataByDocId(ViewerWriter writer, int docId, Options options) - throws IOException { - writer.beginObject(); - - printHeader(writer); - long tweetID = twitterReader.getSegmentData().getDocIDToTweetIDMapper().getTweetID(docId); - if (docId < twitterReader.maxDoc() && tweetID >= 0) { - writer.name("docId").value(Integer.toString(docId)); - writer.name("tweetId").value(Long.toString(tweetID)); - dumpIndexedFields(writer, docId, options); - dumpCsfFields(writer, docId); - } - writer.endObject(); - } - - /** - * Dumps all tweet IDs in the current segment to the given file. - */ - public void dumpTweetIds(ViewerWriter writer, String logFile, PrintWriter logWriter) - throws IOException { - writeTweetIdsToLogFile(logWriter); - - writer.beginObject(); - writer.name(Long.toString(searcher.getTimeSliceID())).value(logFile); - writer.endObject(); - } - - private void writeTweetIdsToLogFile(PrintWriter logWriter) { - DocIDToTweetIDMapper mapper = twitterReader.getSegmentData().getDocIDToTweetIDMapper(); - int docId = Integer.MIN_VALUE; - while ((docId = mapper.getNextDocID(docId)) != DocIDToTweetIDMapper.ID_NOT_FOUND) { - long tweetId = mapper.getTweetID(docId); - - // Ensure tweet ID is valid and non-deleted - if ((tweetId > 0) && !twitterReader.getDeletesView().isDeleted(docId)) { - logWriter.println(tweetId); - } - } - } - - private void dumpIndexedFields(ViewerWriter writer, int docId, - Options options) throws IOException { - writer.name("indexedFields"); - writer.beginArray(); - writer.newline(); - for (String field : sortedFields()) { - dumpTweetData(writer, field, docId, options); - } - writer.endArray(); - writer.newline(); - } - - private void dumpCsfFields(ViewerWriter writer, int docId) throws IOException { - writer.name("csfFields"); - writer.beginArray(); - writer.newline(); - dumpCSFData(writer, docId); - - writer.endArray(); - } - - /** - * Dumps all CSF values for a given doc id. - * @param writer writer being used - * @param docId the document id to use. - */ - private void dumpCSFData(ViewerWriter writer, int docId) throws IOException { - Schema tweetSchema = twitterReader.getSchema(); - - // Sort the FieldInfo objects to generate fixed order to make testing easier - List sortedFieldInfos = new ArrayList<>(tweetSchema.getFieldInfos()); - sortedFieldInfos.sort(Comparator.comparing(Schema.FieldInfo::getFieldId)); - - for (Schema.FieldInfo fieldInfo: sortedFieldInfos) { - String csfFieldInfoName = fieldInfo.getName(); - ThriftCSFType csfType = tweetSchema.getCSFFieldType(csfFieldInfoName); - NumericDocValues csfDocValues = twitterReader.getNumericDocValues(csfFieldInfoName); - // If twitterReader.getNumericDocValues(value.getName()) == null, - // means no NumericDocValue was indexed for the field so ignore - if (csfType != null && csfDocValues != null && csfDocValues.advanceExact(docId)) { - long csfValue = csfDocValues.longValue(); - writer.beginObject(); - writer.name("field").value(formatField(csfFieldInfoName)); - writer.name("value"); - if (csfFieldInfoName.equals(EarlybirdFieldConstant.LAT_LON_CSF_FIELD.getFieldName())) { - writer.value(latlongDecode(csfValue)); - } else if (csfFieldInfoName.equals(EarlybirdFieldConstant.LANGUAGE.getFieldName())) { - writer.value(languageDecode(csfValue)); - } else if (csfFieldInfoName.equals(EarlybirdFieldConstant.CARD_LANG_CSF.getFieldName())) { - writer.value(languageDecode(csfValue)); - } else { - writer.value(Long.toString(csfValue)); - } - writer.endObject(); - writer.newline(); - } - } - } - - /** - * Decipher long value gotten, put into format (lat, lon) - * Decode the stored long value by creating a geocode - */ - private String latlongDecode(long csfValue) { - StringBuilder sb = new StringBuilder(); - GeoCoordinate geoCoordinate = new GeoCoordinate(); - if (GeoUtil.decodeLatLonFromInt64(csfValue, geoCoordinate)) { - sb.append(geoCoordinate.getLatitude()).append(", ").append(geoCoordinate.getLongitude()); - } else { - sb.append(csfValue).append(" (Value Unset or Invalid Coordinate)"); - } - return sb.toString(); - } - - /** - * Decipher long value gotten into string of tweet's language - */ - private String languageDecode(long csfValue) { - StringBuilder sb = new StringBuilder(); - ThriftLanguage languageType = ThriftLanguage.findByValue((int) csfValue); - sb.append(csfValue).append(" (").append(languageType).append(")"); - return sb.toString(); - } - - private void dumpTweetData(ViewerWriter writer, - String field, - int docId, - Options options) throws IOException { - - Terms terms = twitterReader.terms(field); - if (terms != null) { - TermsEnum termsEnum = terms.iterator(); - if (shouldSeekExact(terms, termsEnum)) { - long numTerms = terms.size(); - for (int i = 0; i < numTerms; i++) { - termsEnum.seekExact(i); - dumpTweetDataTerm(writer, field, termsEnum, docId, options); - } - } else { - while (termsEnum.next() != null) { - dumpTweetDataTerm(writer, field, termsEnum, docId, options); - } - } - } - } - - private void dumpTweetDataTerm(ViewerWriter writer, String field, TermsEnum termsEnum, - int docId, Options options) throws IOException { - PostingsEnum docsAndPositionsEnum = termsEnum.postings(null, PostingsEnum.ALL); - if (docsAndPositionsEnum != null && docsAndPositionsEnum.advance(docId) == docId) { - printTerm(writer, field, termsEnum, docsAndPositionsEnum, null, options); - } - } - - /** - * Prints the histogram for the currently viewed index. - * @param writer current viewerWriter - * @param field if null, will use all fields - * @param options options for dumping out text - */ - public void dumpHistogram(ViewerWriter writer, String field, Options options) throws IOException { - writer.beginObject(); - printHeader(writer); - writer.name("histogram"); - writer.beginArray(); - writer.newline(); - if (field == null) { - for (String field2 : sortedFields()) { - dumpFieldHistogram(writer, field2, options); - } - } else { - dumpFieldHistogram(writer, field, options); - } - writer.endArray(); - writer.endObject(); - } - - private void dumpFieldHistogram(ViewerWriter writer, String field, Options options) - throws IOException { - Histogram histo = new Histogram(options.histogramBuckets); - - Terms terms = twitterReader.terms(field); - if (terms != null) { - TermsEnum termsEnum = terms.iterator(); - if (shouldSeekExact(terms, termsEnum)) { - long numTerms = terms.size(); - for (int i = 0; i < numTerms; i++) { - termsEnum.seekExact(i); - countHistogram(options, histo, termsEnum); - } - } else { - while (termsEnum.next() != null) { - countHistogram(options, histo, termsEnum); - } - } - printHistogram(writer, field, options, histo); - } - } - - private void printHistogram(ViewerWriter writer, String field, Options options, - Histogram histo) throws IOException { - - String bucket = options.termLengthHistogram ? "termLength" : "df"; - for (Histogram.Entry histEntry : histo.entries()) { - String format = - String.format(Locale.US, - "field: %s %sBucket: %11s count: %10d " - + "percent: %6.2f%% cumulative: %6.2f%% totalCount: %10d" - + " sum: %15d percent: %6.2f%% cumulative: %6.2f%% totalSum: %15d", - formatField(field), - bucket, - histEntry.getBucketName(), - histEntry.getCount(), - histEntry.getCountPercent() * 100.0, - histEntry.getCountCumulative() * 100.0, - histo.getTotalCount(), - histEntry.getSum(), - histEntry.getSumPercent() * 100.0, - histEntry.getSumCumulative() * 100.0, - histo.getTotalSum() - ); - writer.value(format); - writer.newline(); - } - } - - private void countHistogram(Options options, Histogram histo, TermsEnum termsEnum) - throws IOException { - if (options.termLengthHistogram) { - final BytesRef bytesRef = termsEnum.term(); - histo.addItem(bytesRef.length); - } else { - histo.addItem(termsEnum.docFreq()); - } - } - - - /** - * Prints terms and optionally documents for the currently viewed index. - * @param writer writer being used - * @param field if null, will use all fields - * @param term if null will use all terms - * @param maxTerms will print at most this many terms per field. If null will print 0 terms. - * @param maxDocs will print at most this many documents, If null, will not print docs. - * @param options options for dumping out text - */ - public void dumpData(ViewerWriter writer, String field, String term, Integer maxTerms, - Integer maxDocs, Options options, boolean shouldSeekToTerm) throws IOException { - - writer.beginObject(); - printHeader(writer); - - writer.name("terms"); - writer.beginArray(); - writer.newline(); - dumpDataInternal(writer, field, term, maxTerms, maxDocs, options, shouldSeekToTerm); - writer.endArray(); - writer.endObject(); - } - - private void dumpDataInternal(ViewerWriter writer, String field, String term, Integer maxTerms, - Integer maxDocs, Options options, boolean shouldSeekToTerm) throws IOException { - - if (field == null) { - dumpDataForAllFields(writer, term, maxTerms, maxDocs, options); - return; - } - if (term == null) { - dumpDataForAllTerms(writer, field, maxTerms, maxDocs, options); - return; - } - Terms terms = twitterReader.terms(field); - if (terms != null) { - TermsEnum termsEnum = terms.iterator(); - TermsEnum.SeekStatus status = termsEnum.seekCeil(new BytesRef(term)); - if (status == TermsEnum.SeekStatus.FOUND) { - printTerm(writer, field, termsEnum, null, maxDocs, options); - } - if (shouldSeekToTerm) { - dumpTermsAfterSeek(writer, field, terms, maxTerms, maxDocs, options, termsEnum, status); - } - } - } - - /** - * if term (cursor) is found for an indexed segment - dump the next termsLeft words - * starting from the current position in the enum. For an indexed segment, - * seekCeil will place the enum at the word or the next "ceiling" term. For - * a realtime index, if the word is not found we do not paginate anything - * We also only paginate if the TermsEnum is not at the end. - */ - private void dumpTermsAfterSeek(ViewerWriter writer, String field, Terms terms, Integer maxTerms, - Integer maxDocs, Options options, TermsEnum termsEnum, TermsEnum.SeekStatus status) - throws IOException { - if (status != TermsEnum.SeekStatus.END) { - // for realtime, to not repeat the found word - if (shouldSeekExact(terms, termsEnum)) { - termsEnum.next(); - } - if (status != TermsEnum.SeekStatus.FOUND) { - // if not found, print out curr term before calling next() - printTerm(writer, field, termsEnum, null, maxDocs, options); - } - for (int termsLeft = maxTerms - 1; termsLeft > 0 && termsEnum.next() != null; termsLeft--) { - printTerm(writer, field, termsEnum, null, maxDocs, options); - } - } - } - - private void dumpDataForAllFields(ViewerWriter writer, String term, Integer maxTerms, - Integer maxDocs, Options options) throws IOException { - for (String field : sortedFields()) { - dumpDataInternal(writer, field, term, maxTerms, maxDocs, options, false); - } - } - - private List sortedFields() { - // Tweet facets are added to a special $facets field, which is not part of the schema. - // We include it here, because seeing the facets for a tweet is generally useful. - List fields = Lists.newArrayList("$facets"); - for (Schema.FieldInfo fieldInfo : twitterReader.getSchema().getFieldInfos()) { - if (fieldInfo.getFieldType().indexOptions() != IndexOptions.NONE) { - fields.add(fieldInfo.getName()); - } - } - Collections.sort(fields); - return fields; - } - - private void dumpDataForAllTerms(ViewerWriter writer, - String field, - Integer maxTerms, - Integer maxDocs, - Options options) throws IOException { - Terms terms = twitterReader.terms(field); - if (terms != null) { - TermsEnum termsEnum = terms.iterator(); - if (shouldSeekExact(terms, termsEnum)) { - long numTerms = terms.size(); - long termToDump = maxTerms == null ? 0 : Math.min(numTerms, maxTerms); - for (int i = 0; i < termToDump; i++) { - termsEnum.seekExact(i); - printTerm(writer, field, termsEnum, null, maxDocs, options); - } - } else { - int max = maxTerms == null ? 0 : maxTerms; - while (max > 0 && termsEnum.next() != null) { - printTerm(writer, field, termsEnum, null, maxDocs, options); - max--; - } - } - } - } - - private String termToString(String field, BytesRef bytesTerm, Options options) - throws UnsupportedEncodingException { - if (INT_TERM_ATTRIBUTE_FIELDS.contains(field)) { - return Integer.toString(IntTermAttributeImpl.copyBytesRefToInt(bytesTerm)); - } else if (LONG_TERM_ATTRIBUTE_FIELDS.contains(field)) { - return Long.toString(LongTermAttributeImpl.copyBytesRefToLong(bytesTerm)); - } else if (SORTED_LONG_TERM_ATTRIBUTE_FIELDS.contains(field)) { - return Long.toString(SortableLongTermAttributeImpl.copyBytesRefToLong(bytesTerm)); - } else { - if (options != null && options.charset != null && !options.charset.isEmpty()) { - return new String(bytesTerm.bytes, bytesTerm.offset, bytesTerm.length, options.charset); - } else { - return bytesTerm.utf8ToString(); - } - } - } - - private void printTerm(ViewerWriter writer, String field, TermsEnum termsEnum, - PostingsEnum docsEnum, Integer maxDocs, Options options) - throws IOException { - final BytesRef bytesRef = termsEnum.term(); - StringBuilder termToString = new StringBuilder(); - termToString.append(termToString(field, bytesRef, options)); - if (options != null && options.dumpHexTerms) { - termToString.append(" ").append(bytesRef.toString()); - } - final int df = termsEnum.docFreq(); - double dfPercent = ((double) df / this.twitterReader.numDocs()) * 100.0; - TermDto termDto = new TermDto(field, termToString.toString(), Integer.toString(df), - String.format(Locale.US, "%.2f%%", dfPercent), - docsEnum, termsEnum, maxDocs); - termDto.write(writer, twitterReader); - writer.newline(); - } - - private static void appendFrequencyAndPositions(ViewerWriter writer, String field, - PostingsEnum docsEnum, EarlybirdIndexSegmentAtomicReader twitterReader) throws IOException { - final int frequency = docsEnum.freq(); - writer.name("freq").value(Integer.toString(frequency)); - - Schema schema = twitterReader.getSchema(); - Schema.FieldInfo fieldInfo = schema.getFieldInfo(field); - - if (fieldInfo != null - && (fieldInfo.getFieldType().indexOptions() == IndexOptions.DOCS_AND_FREQS_AND_POSITIONS - || fieldInfo.getFieldType().indexOptions() - == IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS)) { - appendPositions(writer, docsEnum); - } - } - - private static void appendPositions(ViewerWriter writer, PostingsEnum docsAndPositionsEnum) - throws IOException { - writer.name("positions"); - - writer.beginArray(); - final int frequency = docsAndPositionsEnum.freq(); - for (int i = 0; i < frequency; i++) { - int position = docsAndPositionsEnum.nextPosition(); - writer.value(Integer.toString(position)); - } - writer.endArray(); - } - - private static void appendDocs(ViewerWriter writer, TermsEnum termsEnum, int maxDocs, - EarlybirdIndexSegmentAtomicReader twitterReader) - throws IOException { - writer.name("docIds"); - - writer.beginArray(); - - PostingsEnum docs = termsEnum.postings(null, 0); - int docsReturned = 0; - int docId; - boolean endedEarly = false; - DocIDToTweetIDMapper mapper = twitterReader.getSegmentData().getDocIDToTweetIDMapper(); - while ((docId = docs.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { - if (docsReturned < maxDocs) { - docsReturned++; - long tweetID = mapper.getTweetID(docId); - - writer.beginObject(); - writer.name("docId").value(Long.toString(docId)); - writer.name("tweetId").value(Long.toString(tweetID)); - writer.endObject(); - } else { - endedEarly = true; - break; - } - } - if (endedEarly) { - writer.beginObject(); - writer.name("status").value("ended early"); - writer.endObject(); - } - writer.endArray(); - } - - /** - * Prints generic stats for all fields in the currently viewed index. - */ - public void dumpStats(ViewerWriter writer) throws IOException { - writer.beginObject(); - - printHeader(writer); - // stats section - writer.name("stats"); - writer.beginArray(); - writer.newline(); - for (String field : sortedFields()) { - Terms terms = twitterReader.terms(field); - if (terms != null) { - printStats(writer, field, terms); - } - } - writer.endArray(); - writer.endObject(); - } - - private void printStats(ViewerWriter writer, String field, Terms terms) throws IOException { - StatsDto statsDto = new StatsDto( - field, String.valueOf(terms.size()), terms.getClass().getCanonicalName()); - statsDto.write(writer); - writer.newline(); - } - - private void printHeader(ViewerWriter writer) throws IOException { - writer.name("timeSliceId").value(Long.toString(this.searcher.getTimeSliceID())); - writer.name("maxDocNumber").value(Integer.toString(this.twitterReader.maxDoc())); - writer.newline(); - } - - private static String formatField(String field) { - return String.format("%20s", field); - } - - /** - * Dumps out the schema of the current segment. - * @param writer to be used for printing - */ - public void dumpSchema(ViewerWriter writer) throws IOException { - writer.beginObject(); - printHeader(writer); - writer.name("schemaFields"); - writer.beginArray(); - writer.newline(); - Schema schema = this.twitterReader.getSchema(); - // The fields in the schema are not sorted. Sort them so that the output is deterministic - Set fieldNameSet = new TreeSet<>(); - for (Schema.FieldInfo fieldInfo: schema.getFieldInfos()) { - fieldNameSet.add(fieldInfo.getName()); - } - for (String fieldName : fieldNameSet) { - writer.value(fieldName); - writer.newline(); - } - writer.endArray(); - writer.endObject(); - } - - /** - * Dumps out the indexed fields inside the current segment. - * Mainly used to help the front end populate the fields. - * @param writer writer to be used for printing - */ - public void dumpFields(ViewerWriter writer) throws IOException { - writer.beginObject(); - printHeader(writer); - writer.name("fields"); - writer.beginArray(); - writer.newline(); - for (String field : sortedFields()) { - writer.value(field); - writer.newline(); - } - writer.endArray(); - writer.endObject(); - } - - /** - * Dumps out the mapping of the tweet/tweetId to - * a docId as well as segment/timeslide pair. - * @param writer writer to be used for writing - * @param tweetId tweetId that is input by user - */ - public void dumpTweetIdToDocIdMapping(ViewerWriter writer, long tweetId) throws IOException { - writer.beginObject(); - printHeader(writer); - writer.name("tweetId").value(Long.toString(tweetId)); - int docId = twitterReader.getSegmentData().getDocIDToTweetIDMapper().getDocID(tweetId); - - writer.name("docId").value(Integer.toString(docId)); - writer.endObject(); - writer.newline(); - } - - /** - * Dumps out the mapping of the docId to - * tweetId and timeslice/segmentId pairs. - * @param writer writer to be used for writing - * @param docid docId that is input by user - */ - public void dumpDocIdToTweetIdMapping(ViewerWriter writer, int docid) throws IOException { - writer.beginObject(); - printHeader(writer); - long tweetId = twitterReader.getSegmentData().getDocIDToTweetIDMapper().getTweetID(docid); - - writer.name("tweetId"); - if (tweetId >= 0) { - writer.value(Long.toString(tweetId)); - } else { - writer.value("Does not exist in segment"); - } - writer.name("docid").value(Integer.toString(docid)); - writer.endObject(); - } - - /** - * Print a response indicating that the given tweet id is not found in the index. - * - * Note that this method does not actually need the underlying index, and hence is setup as - * a util function. - */ - public static void writeTweetDoesNotExistResponse(ViewerWriter writer, long tweetId) - throws IOException { - writer.beginObject(); - writer.name("tweetId"); - writer.value(Long.toString(tweetId)); - writer.name("docId"); - writer.value("does not exist on this earlybird."); - writer.endObject(); - } -} diff --git a/src/java/com/twitter/search/earlybird/util/JsonViewerWriter.java b/src/java/com/twitter/search/earlybird/util/JsonViewerWriter.java deleted file mode 100644 index 0672a76be..000000000 --- a/src/java/com/twitter/search/earlybird/util/JsonViewerWriter.java +++ /dev/null @@ -1,68 +0,0 @@ -package com.twitter.search.earlybird.util; - -import java.io.IOException; -import java.io.Writer; - -import com.google.gson.stream.JsonWriter; - -/** - * Wrapper class for JsonWriter that implements the - * ViewerWriter interface. - */ -public class JsonViewerWriter implements ViewerWriter { - - private final JsonWriter writer; - private final Writer out; - - public JsonViewerWriter(Writer out) { - this.out = out; - this.writer = new JsonWriter(out); - } - - - @Override - public ViewerWriter beginArray() throws IOException { - writer.beginArray(); - return this; - } - - @Override - public ViewerWriter beginObject() throws IOException { - writer.beginObject(); - return this; - } - - @Override - public ViewerWriter endArray() throws IOException { - writer.endArray(); - return this; - } - - @Override - public ViewerWriter endObject() throws IOException { - writer.endObject(); - return this; - } - - @Override - public ViewerWriter name(String field) throws IOException { - writer.name(field); - return this; - } - - @Override - public ViewerWriter value(String s) throws IOException { - writer.value(s); - return this; - } - - @Override - public ViewerWriter newline() throws IOException { - out.append('\n'); - return this; - } - - public void flush() throws IOException { - out.flush(); - } -} diff --git a/src/java/com/twitter/search/earlybird/util/OneTaskScheduledExecutorManager.java b/src/java/com/twitter/search/earlybird/util/OneTaskScheduledExecutorManager.java deleted file mode 100644 index cdd9d50c3..000000000 --- a/src/java/com/twitter/search/earlybird/util/OneTaskScheduledExecutorManager.java +++ /dev/null @@ -1,91 +0,0 @@ -package com.twitter.search.earlybird.util; - -import java.io.Closeable; -import java.io.IOException; -import java.util.concurrent.ScheduledExecutorService; -import java.util.concurrent.ScheduledFuture; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.concurrent.ScheduledExecutorServiceFactory; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchStatsReceiver; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; - -/** - * Executes a single periodic task. - */ -public abstract class OneTaskScheduledExecutorManager - extends ScheduledExecutorManager implements Closeable { - private final ScheduledExecutorTask scheduledTask; - private final PeriodicActionParams periodicActionParams; - - public OneTaskScheduledExecutorManager( - ScheduledExecutorServiceFactory executorServiceFactory, - String threadNameFormat, - boolean isDaemon, - PeriodicActionParams periodicActionParams, - ShutdownWaitTimeParams shutdownTiming, - SearchStatsReceiver searchStatsReceiver, - CriticalExceptionHandler criticalExceptionHandler) { - this(executorServiceFactory.build(threadNameFormat, isDaemon), periodicActionParams, - shutdownTiming, searchStatsReceiver, criticalExceptionHandler); - } - - public OneTaskScheduledExecutorManager( - ScheduledExecutorService executor, - PeriodicActionParams periodicActionParams, - ShutdownWaitTimeParams shutdownTiming, - SearchStatsReceiver searchStatsReceiver, - CriticalExceptionHandler criticalExceptionHandler) { - this(executor, periodicActionParams, shutdownTiming, searchStatsReceiver, null, - criticalExceptionHandler, Clock.SYSTEM_CLOCK); - } - - public OneTaskScheduledExecutorManager( - ScheduledExecutorService executor, - PeriodicActionParams periodicActionParams, - ShutdownWaitTimeParams shutdownWaitTimeParams, - SearchStatsReceiver searchStatsReceiver, - SearchCounter iterationCounter, - CriticalExceptionHandler criticalExceptionHandler, - Clock clock) { - super(executor, shutdownWaitTimeParams, searchStatsReceiver, iterationCounter, - criticalExceptionHandler, clock); - - this.periodicActionParams = periodicActionParams; - this.scheduledTask = new ScheduledExecutorTask(getIterationCounter(), clock) { - @Override - protected void runOneIteration() { - OneTaskScheduledExecutorManager.this.runOneIteration(); - } - }; - } - - /** - * Schedule the single internally specified task returned by getScheduledTask. - */ - public ScheduledFuture schedule() { - return this.scheduleNewTask( - this.getScheduledTask(), - this.periodicActionParams - ); - } - - /** - * The code that the task executes. - */ - protected abstract void runOneIteration(); - - public ScheduledExecutorTask getScheduledTask() { - return scheduledTask; - } - - @Override - public void close() throws IOException { - try { - shutdown(); - } catch (InterruptedException e) { - throw new IOException(e); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/util/ParallelUtil.java b/src/java/com/twitter/search/earlybird/util/ParallelUtil.java deleted file mode 100644 index 9e570b1d9..000000000 --- a/src/java/com/twitter/search/earlybird/util/ParallelUtil.java +++ /dev/null @@ -1,71 +0,0 @@ -package com.twitter.search.earlybird.util; - -import java.util.List; -import java.util.concurrent.ExecutorService; -import java.util.concurrent.Executors; -import java.util.concurrent.ThreadFactory; -import java.util.stream.Collectors; - -import com.google.common.util.concurrent.ThreadFactoryBuilder; - -import com.twitter.util.Await; -import com.twitter.util.Future; -import com.twitter.util.Future$; -import com.twitter.util.FuturePool; -import com.twitter.util.FuturePool$; - -public final class ParallelUtil { - private ParallelUtil() { - } - - public static List parmap(String threadName, CheckedFunction fn, List input) - throws Exception { - return parmap(threadName, input.size(), fn, input); - } - - /** - * Runs a function in parallel across the elements of the list, and throws an exception if any - * of the functions throws, or returns the results. - * - * Uses as many threads as there are elements in the input, so only use this for tasks that - * require significant CPU for each element, and have less elements than the number of cores. - */ - public static List parmap( - String threadName, int threadPoolSize, CheckedFunction fn, List input) - throws Exception { - ExecutorService executor = Executors.newFixedThreadPool(threadPoolSize, - buildThreadFactory(threadName)); - FuturePool futurePool = FuturePool$.MODULE$.apply(executor); - - List> futures = input - .stream() - .map(in -> futurePool.apply(() -> { - try { - return fn.apply(in); - } catch (Exception e) { - throw new RuntimeException(e); - } - })).collect(Collectors.toList()); - - try { - return Await.result(Future$.MODULE$.collect(futures)); - } finally { - executor.shutdownNow(); - } - } - - private static ThreadFactory buildThreadFactory(String threadNameFormat) { - return new ThreadFactoryBuilder() - .setNameFormat(threadNameFormat) - .setDaemon(false) - .build(); - } - - @FunctionalInterface - public interface CheckedFunction { - /** - * A function from T to R that throws checked Exceptions. - */ - R apply(T t) throws Exception; - } -} diff --git a/src/java/com/twitter/search/earlybird/util/PeriodicActionParams.java b/src/java/com/twitter/search/earlybird/util/PeriodicActionParams.java deleted file mode 100644 index b2f148b4b..000000000 --- a/src/java/com/twitter/search/earlybird/util/PeriodicActionParams.java +++ /dev/null @@ -1,79 +0,0 @@ -package com.twitter.search.earlybird.util; - -import java.util.concurrent.TimeUnit; - -/** - * Specifies timing and type of period actions that we schedule. - * - * See: - * https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html - */ -public final class PeriodicActionParams { - private enum DelayType { - FIXED_DELAY, - FIXED_RATE - } - - private long initialDelayDuration; - private long intervalDuration; - private TimeUnit intervalUnit; - private DelayType delayType; - - public long getInitialDelayDuration() { - return initialDelayDuration; - } - - public long getIntervalDuration() { - return intervalDuration; - } - - public TimeUnit getIntervalUnit() { - return intervalUnit; - } - - public DelayType getDelayType() { - return delayType; - } - - private PeriodicActionParams( - DelayType delayType, - long initialDelayDuration, - long intervalDuration, - TimeUnit intervalUnit) { - this.delayType = delayType; - this.intervalDuration = intervalDuration; - this.initialDelayDuration = initialDelayDuration; - this.intervalUnit = intervalUnit; - } - - // Runs start at times start, start+X, start+2*X etc., so they can possibly overlap. - public static PeriodicActionParams atFixedRate( - long intervalDuration, - TimeUnit intervalUnit) { - return new PeriodicActionParams(DelayType.FIXED_RATE, 0, - intervalDuration, intervalUnit); - } - - // Delay between every run. - // The order of what happens is: - // initial delay, run task, wait X time, run task, wait X time, etc. - // Runs can't overlap. - public static PeriodicActionParams withIntialWaitAndFixedDelay( - long initialDelayDuration, - long intervalDuration, - TimeUnit intervalUnit) { - return new PeriodicActionParams(DelayType.FIXED_DELAY, initialDelayDuration, - intervalDuration, intervalUnit); - } - - // Delay between every run. - public static PeriodicActionParams withFixedDelay( - long intervalDuration, - TimeUnit intervalUnit) { - return withIntialWaitAndFixedDelay(0, intervalDuration, intervalUnit); - } - - boolean isFixedDelay() { - return this.delayType == DelayType.FIXED_DELAY; - } -} diff --git a/src/java/com/twitter/search/earlybird/util/ScheduledExecutorManager.java b/src/java/com/twitter/search/earlybird/util/ScheduledExecutorManager.java deleted file mode 100644 index 54f39c1ac..000000000 --- a/src/java/com/twitter/search/earlybird/util/ScheduledExecutorManager.java +++ /dev/null @@ -1,150 +0,0 @@ -package com.twitter.search.earlybird.util; - -import java.util.concurrent.ScheduledExecutorService; -import java.util.concurrent.ScheduledFuture; -import java.util.concurrent.TimeUnit; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchStatsReceiver; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; - -/** - * Base class for classes that run periodic tasks. - */ -public abstract class ScheduledExecutorManager { - private static final Logger LOG = LoggerFactory.getLogger(ScheduledExecutorManager.class); - private static final long SHUTDOWN_WAIT_INTERVAL_SEC = 30; - - public static final String SCHEDULED_EXECUTOR_TASK_PREFIX = "scheduled_executor_task_"; - - private final String name; - private final ScheduledExecutorService executor; - - private final ShutdownWaitTimeParams shutdownWaitTimeParams; - - private final SearchCounter iterationCounter; - private final SearchStatsReceiver searchStatsReceiver; - - protected final CriticalExceptionHandler criticalExceptionHandler; - private final Clock clock; - - protected boolean shouldLog = true; - - public ScheduledExecutorManager( - ScheduledExecutorService executor, - ShutdownWaitTimeParams shutdownWaitTimeParams, - SearchStatsReceiver searchStatsReceiver, - CriticalExceptionHandler criticalExceptionHandler, - Clock clock) { - this(executor, shutdownWaitTimeParams, searchStatsReceiver, null, - criticalExceptionHandler, clock); - } - - ScheduledExecutorManager( - ScheduledExecutorService executor, - ShutdownWaitTimeParams shutdownWaitTimeParams, - SearchStatsReceiver searchStatsReceiver, - SearchCounter iterationCounter, - CriticalExceptionHandler criticalExceptionHandler, - Clock clock) { - this.name = getClass().getSimpleName(); - this.executor = executor; - this.criticalExceptionHandler = criticalExceptionHandler; - this.shutdownWaitTimeParams = shutdownWaitTimeParams; - - if (iterationCounter != null) { - this.iterationCounter = iterationCounter; - } else { - this.iterationCounter = searchStatsReceiver.getCounter(SCHEDULED_EXECUTOR_TASK_PREFIX + name); - } - - this.searchStatsReceiver = searchStatsReceiver; - this.clock = clock; - } - - /** - * Schedule a task. - */ - protected final ScheduledFuture scheduleNewTask( - ScheduledExecutorTask task, - PeriodicActionParams periodicActionParams) { - long interval = periodicActionParams.getIntervalDuration(); - TimeUnit timeUnit = periodicActionParams.getIntervalUnit(); - long initialDelay = periodicActionParams.getInitialDelayDuration(); - - if (interval <= 0) { - String message = String.format( - "Not scheduling manager %s for wrong interval %d %s", name, interval, timeUnit); - LOG.error(message); - throw new UnsupportedOperationException(message); - } - - if (shouldLog) { - LOG.info("Scheduling to run {} every {} {} with {}", name, interval, timeUnit, - periodicActionParams.getDelayType()); - } - final ScheduledFuture scheduledFuture; - if (periodicActionParams.isFixedDelay()) { - scheduledFuture = executor.scheduleWithFixedDelay(task, initialDelay, interval, timeUnit); - } else { - scheduledFuture = executor.scheduleAtFixedRate(task, initialDelay, interval, timeUnit); - } - return scheduledFuture; - } - - /** - * Shutdown everything that's running with the executor. - */ - public boolean shutdown() throws InterruptedException { - LOG.info("Start shutting down {}.", name); - executor.shutdownNow(); - - boolean terminated = false; - long waitSeconds = shutdownWaitTimeParams.getWaitUnit().toSeconds( - shutdownWaitTimeParams.getWaitDuration() - ); - - if (waitSeconds == 0) { - LOG.info("Not waiting at all for {}, wait time is set to zero.", name); - } else { - while (!terminated && waitSeconds > 0) { - long waitTime = Math.min(waitSeconds, SHUTDOWN_WAIT_INTERVAL_SEC); - terminated = executor.awaitTermination(waitTime, TimeUnit.SECONDS); - waitSeconds -= waitTime; - - if (!terminated) { - LOG.info("Still shutting down {} ...", name); - } - } - } - - LOG.info("Done shutting down {}, terminated: {}", name, terminated); - - shutdownComponent(); - return terminated; - } - - protected ScheduledExecutorService getExecutor() { - return executor; - } - - public final String getName() { - return name; - } - - public SearchCounter getIterationCounter() { - return iterationCounter; - } - - protected final SearchStatsReceiver getSearchStatsReceiver() { - return searchStatsReceiver; - } - - // Override if you need to shutdown additional services. - protected void shutdownComponent() { - } -} diff --git a/src/java/com/twitter/search/earlybird/util/ScheduledExecutorTask.java b/src/java/com/twitter/search/earlybird/util/ScheduledExecutorTask.java deleted file mode 100644 index a6cd074c0..000000000 --- a/src/java/com/twitter/search/earlybird/util/ScheduledExecutorTask.java +++ /dev/null @@ -1,27 +0,0 @@ -package com.twitter.search.earlybird.util; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.metrics.SearchCounter; - -public abstract class ScheduledExecutorTask implements Runnable { - private final SearchCounter counter; - protected final Clock clock; - - public ScheduledExecutorTask(SearchCounter counter, Clock clock) { - Preconditions.checkNotNull(counter); - this.counter = counter; - this.clock = clock; - } - - @Override - public final void run() { - counter.increment(); - runOneIteration(); - } - - @VisibleForTesting - protected abstract void runOneIteration(); -} diff --git a/src/java/com/twitter/search/earlybird/util/ScrubGenUtil.java b/src/java/com/twitter/search/earlybird/util/ScrubGenUtil.java deleted file mode 100644 index f2ba966d3..000000000 --- a/src/java/com/twitter/search/earlybird/util/ScrubGenUtil.java +++ /dev/null @@ -1,28 +0,0 @@ -package com.twitter.search.earlybird.util; - -import java.text.ParseException; -import java.util.Date; - -import org.apache.commons.lang3.time.FastDateFormat; - -public final class ScrubGenUtil { - public static final FastDateFormat SCRUB_GEN_DATE_FORMAT = FastDateFormat.getInstance("yyyyMMdd"); - - private ScrubGenUtil() { } - - /** - * Helper method to parse a scrub gen from String to date - * - * @param scrubGen - * @return scrubGen in Date type - */ - public static Date parseScrubGenToDate(String scrubGen) { - try { - return SCRUB_GEN_DATE_FORMAT.parse(scrubGen); - } catch (ParseException e) { - String msg = "Malformed scrub gen date: " + scrubGen; - // If we are running a scrub gen and the date is bad we should quit and not continue. - throw new RuntimeException(msg, e); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/util/ShutdownWaitTimeParams.java b/src/java/com/twitter/search/earlybird/util/ShutdownWaitTimeParams.java deleted file mode 100644 index ec056e93a..000000000 --- a/src/java/com/twitter/search/earlybird/util/ShutdownWaitTimeParams.java +++ /dev/null @@ -1,40 +0,0 @@ -package com.twitter.search.earlybird.util; - -import java.util.concurrent.TimeUnit; - -/** - * Specifies how much time do we wait when shutting down a task. - */ -public class ShutdownWaitTimeParams { - private long waitDuration; - private TimeUnit waitUnit; - - public ShutdownWaitTimeParams(long waitDuration, TimeUnit waitUnit) { - this.waitDuration = waitDuration; - this.waitUnit = waitUnit; - } - - public long getWaitDuration() { - return waitDuration; - } - - public TimeUnit getWaitUnit() { - return waitUnit; - } - - /** - * Returns a ShutdownWaitTimeParams instance that instructs the caller to wait indefinitely for - * the task to shut down. - */ - public static ShutdownWaitTimeParams indefinitely() { - return new ShutdownWaitTimeParams(Long.MAX_VALUE, TimeUnit.DAYS); - } - - /** - * Returns a ShutdownWaitTimeParams instance that instructs the caller to shut down the task - * immediately. - */ - public static ShutdownWaitTimeParams immediately() { - return new ShutdownWaitTimeParams(0, TimeUnit.MILLISECONDS); - } -} diff --git a/src/java/com/twitter/search/earlybird/util/TermCountMonitor.java b/src/java/com/twitter/search/earlybird/util/TermCountMonitor.java deleted file mode 100644 index 55d747754..000000000 --- a/src/java/com/twitter/search/earlybird/util/TermCountMonitor.java +++ /dev/null @@ -1,338 +0,0 @@ -package com.twitter.search.earlybird.util; - -import java.util.Collections; -import java.util.HashMap; -import java.util.Map; -import java.util.Set; -import java.util.concurrent.TimeUnit; -import java.util.concurrent.atomic.AtomicLong; -import java.util.function.Function; -import java.util.stream.Collectors; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import org.apache.commons.lang.mutable.MutableLong; -import org.apache.lucene.index.IndexOptions; -import org.apache.lucene.index.Terms; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.concurrent.ScheduledExecutorServiceFactory; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.metrics.SearchStatsReceiver; -import com.twitter.search.common.metrics.SearchTimer; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.index.EarlybirdSingleSegmentSearcher; -import com.twitter.search.earlybird.partition.SegmentInfo; -import com.twitter.search.earlybird.partition.SegmentManager; - -/** - * A background task that periodically gets and exports the number of terms per field that are - * indexed on this earlybird, averaged over all segments. - * Specifically used for making sure that we are not missing terms for any fields in the search - * archives. - * The task loops though all the segments that are indexed by this earlybird, and for each segment - * looks at the term counts for all fields in that segment. - * - * Also keeps track of the number of fields that do not have any term counts (or below the specified - * threshold) in the data that is indexed on this earlybird. - */ -public class TermCountMonitor extends OneTaskScheduledExecutorManager { - private static final Logger LOG = LoggerFactory.getLogger(TermCountMonitor.class); - - private static final String THREAD_NAME_FORMAT = "TermCountMonitor-%d"; - private static final boolean THREAD_IS_DAEMON = true; - - public static final String RUN_INTERVAL_MINUTES_CONFIG_NAME = - "term_count_monitor_run_interval_minutes"; - - private static Function termStatNameFunc = - field -> "term_count_on_field_" + field; - private static Function tokenStatNameFunc = - field -> "token_count_on_field_" + field; - private static Function missingFieldStatNameFunc = - field -> "term_count_monitor_missing_field_" + field; - - private static class RawFieldCounter { - private MutableLong numTerms = new MutableLong(0L); - private MutableLong numTokens = new MutableLong(0L); - } - - @VisibleForTesting - static class ExportedFieldCounter { - private final AtomicLong numTerms; - private final AtomicLong numTokens; - - ExportedFieldCounter(RawFieldCounter rawCounter) { - this.numTerms = new AtomicLong(rawCounter.numTerms.longValue()); - this.numTokens = new AtomicLong(rawCounter.numTokens.longValue()); - } - - ExportedFieldCounter(long numInitialTerms, long numInitialTokens) { - this.numTerms = new AtomicLong(numInitialTerms); - this.numTokens = new AtomicLong(numInitialTokens); - } - - @VisibleForTesting - long getNumTerms() { - return numTerms.longValue(); - } - - @VisibleForTesting - long getNumTokens() { - return numTokens.longValue(); - } - } - - private final int fieldMinTermCount = - EarlybirdConfig.getInt("term_count_monitor_min_count", 0); - - private final SegmentManager segmentManager; - private final Map missingFields; - private final Map termStats; - private final Map tokenStats; - private final Map exportedCounts; - private final SearchLongGauge termCountOnAllFields; - private final SearchLongGauge tokenCountOnAllFields; - private final SearchLongGauge fieldsWithNoTermCountStat; - private final SearchLongGauge isRunningStat; - private final SearchTimerStats checkTimeStat; - - @Override - protected void runOneIteration() { - LOG.info("Starting to get per-field term counts"); - isRunningStat.set(1); - final SearchTimer timer = checkTimeStat.startNewTimer(); - try { - updateFieldTermCounts(); - } catch (Exception ex) { - LOG.error("Unexpected exception while getting per-field term counts", ex); - } finally { - LOG.info( - "Done getting per-field term counts. Fields with low term counts: {}", - getFieldsWithLowTermCount()); - isRunningStat.set(0); - checkTimeStat.stopTimerAndIncrement(timer); - } - } - - /** - * Create a term count monitor which monitors the number of terms in segments - * managed by the given segment manager. - */ - public TermCountMonitor( - SegmentManager segmentManager, - ScheduledExecutorServiceFactory executorServiceFactory, - long shutdownWaitDuration, - TimeUnit shutdownWaitUnit, - SearchStatsReceiver searchStatsReceiver, - CriticalExceptionHandler criticalExceptionHandler) { - super( - executorServiceFactory, - THREAD_NAME_FORMAT, - THREAD_IS_DAEMON, - PeriodicActionParams.atFixedRate( - EarlybirdConfig.getInt(RUN_INTERVAL_MINUTES_CONFIG_NAME, -1), - TimeUnit.MINUTES), - new ShutdownWaitTimeParams( - shutdownWaitDuration, - shutdownWaitUnit - ), - searchStatsReceiver, - criticalExceptionHandler); - this.segmentManager = segmentManager; - this.missingFields = new HashMap<>(); - this.termStats = new HashMap<>(); - this.tokenStats = new HashMap<>(); - this.exportedCounts = new HashMap<>(); - this.termCountOnAllFields = getSearchStatsReceiver().getLongGauge("term_count_on_all_fields"); - this.tokenCountOnAllFields = getSearchStatsReceiver().getLongGauge("token_count_on_all_fields"); - this.fieldsWithNoTermCountStat = - getSearchStatsReceiver().getLongGauge("fields_with_low_term_counts"); - this.isRunningStat = - getSearchStatsReceiver().getLongGauge("term_count_monitor_is_running"); - this.checkTimeStat = - getSearchStatsReceiver().getTimerStats( - "term_count_monitor_check_time", TimeUnit.MILLISECONDS, true, true, false); - } - - private SearchLongGauge getOrCreateLongGauge( - Map gauges, String field, Function nameSupplier) { - SearchLongGauge stat = gauges.get(field); - - if (stat == null) { - stat = getSearchStatsReceiver().getLongGauge(nameSupplier.apply(field)); - gauges.put(field, stat); - } - - return stat; - } - - private void updateFieldTermCounts() { - // 0. Get the current per-field term counts - Map newCounts = getFieldStats(); - LOG.info("Computed field stats for all segments"); - - // 1. Update all existing keys - for (Map.Entry exportedCount : exportedCounts.entrySet()) { - String field = exportedCount.getKey(); - ExportedFieldCounter exportedCountValue = exportedCount.getValue(); - - RawFieldCounter newCount = newCounts.get(field); - if (newCount == null) { - exportedCountValue.numTerms.set(0L); - exportedCountValue.numTokens.set(0L); - } else { - exportedCountValue.numTerms.set(newCount.numTerms.longValue()); - exportedCountValue.numTokens.set(newCount.numTokens.longValue()); - - // clean up so that we don't check this field again when we look for new field - newCounts.remove(field); - } - } - - // 2. Add and export all new fields' term counts - for (Map.Entry newCount: newCounts.entrySet()) { - String field = newCount.getKey(); - Preconditions.checkState(!exportedCounts.containsKey(field), - "Should have already processed and removed existing fields: " + field); - - ExportedFieldCounter newStat = new ExportedFieldCounter(newCount.getValue()); - exportedCounts.put(field, newStat); - } - - // 3. Export as a stat the term counts for all the known fields. - for (Map.Entry exportedCount : exportedCounts.entrySet()) { - String field = exportedCount.getKey(); - ExportedFieldCounter counter = exportedCount.getValue(); - - getOrCreateLongGauge(termStats, field, termStatNameFunc).set(counter.numTerms.get()); - getOrCreateLongGauge(tokenStats, field, tokenStatNameFunc).set(counter.numTokens.get()); - } - - // 4. Export as a stat, number of fields not having enough term counts (i.e. <= 0) - int fieldsWithNoTermCounts = 0; - for (Map.Entry fieldTermCount : exportedCounts.entrySet()) { - String field = fieldTermCount.getKey(); - AtomicLong exportedCountValue = fieldTermCount.getValue().numTerms; - if (exportedCountValue.get() <= fieldMinTermCount) { - LOG.warn( - "Found a field with too few term counts. Field: {} count: {}", - field, exportedCountValue); - fieldsWithNoTermCounts++; - } - } - this.fieldsWithNoTermCountStat.set(fieldsWithNoTermCounts); - } - - /** - * Loops through all segments, and for each field gets the average term/token count. - * Based on that, returns a map from each field to its term/token count (average per segment). - */ - private Map getFieldStats() { - Iterable segmentInfos = segmentManager.getSegmentInfos( - SegmentManager.Filter.Enabled, SegmentManager.Order.NEW_TO_OLD); - Map rawCounts = new HashMap<>(); - - ImmutableSchemaInterface schemaSnapshot = - segmentManager.getEarlybirdIndexConfig().getSchema().getSchemaSnapshot(); - Set missingFieldsCandidates = schemaSnapshot - .getFieldInfos() - .stream() - .filter(fieldInfo -> fieldInfo.getFieldType().indexOptions() != IndexOptions.NONE) - .map(Schema.FieldInfo::getName) - .collect(Collectors.toSet()); - int segmentCount = 0; - for (SegmentInfo segmentInfo : segmentInfos) { - segmentCount++; - try { - EarlybirdSingleSegmentSearcher searcher = segmentManager.getSearcher( - segmentInfo.getTimeSliceID(), schemaSnapshot); - if (searcher != null) { - EarlybirdIndexSegmentAtomicReader reader = searcher.getTwitterIndexReader(); - for (Schema.FieldInfo fieldInfo : schemaSnapshot.getFieldInfos()) { - if (fieldInfo.getFieldType().indexOptions() == IndexOptions.NONE) { - continue; - } - - String fieldName = fieldInfo.getName(); - RawFieldCounter count = rawCounts.get(fieldName); - if (count == null) { - count = new RawFieldCounter(); - rawCounts.put(fieldName, count); - } - Terms terms = reader.terms(fieldName); - if (terms != null) { - missingFieldsCandidates.remove(fieldName); - count.numTerms.add(terms.size()); - long sumTotalTermFreq = terms.getSumTotalTermFreq(); - if (sumTotalTermFreq != -1) { - count.numTokens.add(sumTotalTermFreq); - } - } - } - } - } catch (Exception e) { - LOG.error("Exception getting average term count per field: " + segmentInfo, e); - } - } - - // Update missing fields stats. - missingFieldsCandidates.forEach( - field -> getOrCreateLongGauge(missingFields, field, missingFieldStatNameFunc).set(1)); - missingFields.keySet().stream() - .filter( - field -> !missingFieldsCandidates.contains(field)) - .forEach( - field -> getOrCreateLongGauge(missingFields, field, missingFieldStatNameFunc).set(0)); - - long totalTermCount = 0; - long totalTokenCount = 0; - if (segmentCount == 0) { - LOG.error("No segments are found to calculate per-field term counts."); - } else { - LOG.debug("TermCountMonitor.getPerFieldTermCount.segmentCount = {}", segmentCount); - LOG.debug(" field: term count (average per segment)"); - for (Map.Entry entry : rawCounts.entrySet()) { - String field = entry.getKey(); - final long averageTermCount = entry.getValue().numTerms.longValue() / segmentCount; - final long averageTokenCount = entry.getValue().numTokens.longValue() / segmentCount; - totalTermCount += entry.getValue().numTerms.longValue(); - totalTokenCount += entry.getValue().numTokens.longValue(); - - LOG.debug(" '{} term': {}", field, averageTermCount); - LOG.debug(" '{} token': {}", field, averageTokenCount); - - entry.getValue().numTerms.setValue(averageTermCount); - entry.getValue().numTokens.setValue(averageTokenCount); - } - } - LOG.info("Total term count: {}", totalTermCount); - LOG.info("Total token count: {}", totalTokenCount); - this.termCountOnAllFields.set(totalTermCount); - this.tokenCountOnAllFields.set(totalTokenCount); - - return rawCounts; - } - - @VisibleForTesting - Map getExportedCounts() { - return Collections.unmodifiableMap(this.exportedCounts); - } - - @VisibleForTesting - long getFieldsWithLowTermCount() { - return fieldsWithNoTermCountStat.get(); - } - - @VisibleForTesting - Map getMissingFields() { - return missingFields; - } -} diff --git a/src/java/com/twitter/search/earlybird/util/TweetCountMonitor.java b/src/java/com/twitter/search/earlybird/util/TweetCountMonitor.java deleted file mode 100644 index e33433656..000000000 --- a/src/java/com/twitter/search/earlybird/util/TweetCountMonitor.java +++ /dev/null @@ -1,447 +0,0 @@ -package com.twitter.search.earlybird.util; - -import java.io.IOException; -import java.util.ArrayList; -import java.util.Calendar; -import java.util.Date; -import java.util.List; -import java.util.Map; -import java.util.concurrent.TimeUnit; -import java.util.concurrent.atomic.AtomicInteger; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.collect.Maps; - -import org.apache.commons.lang.mutable.MutableInt; -import org.apache.commons.lang.mutable.MutableLong; -import org.apache.lucene.index.IndexOptions; -import org.apache.lucene.index.PostingsEnum; -import org.apache.lucene.index.Terms; -import org.apache.lucene.index.TermsEnum; -import org.apache.lucene.search.DocIdSetIterator; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.collections.Pair; -import com.twitter.search.common.concurrent.ScheduledExecutorServiceFactory; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.metrics.SearchStatsReceiver; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.common.partitioning.base.Segment; -import com.twitter.search.common.schema.base.ImmutableSchemaInterface; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.core.earlybird.index.DocIDToTweetIDMapper; -import com.twitter.search.core.earlybird.index.EarlybirdIndexSegmentAtomicReader; -import com.twitter.search.core.earlybird.index.TimeMapper; -import com.twitter.search.earlybird.common.config.EarlybirdConfig; -import com.twitter.search.earlybird.exception.CriticalExceptionHandler; -import com.twitter.search.earlybird.index.EarlybirdSingleSegmentSearcher; -import com.twitter.search.earlybird.partition.SegmentInfo; -import com.twitter.search.earlybird.partition.SegmentManager; - -/** - * A background task that periodically gets and exports the number of tweets per hour that are - * indexed on this earlybird. - * Specifically used for making sure that we are not missing data for any hours in the search - * archives. - * The task loops though all the segments that are indexed by this earlybird, and for each segment - * looks at all the createdAt dates for all of the documents in that segment. - * - * Also keeps track off an exposes as a stat the number of hours that do not have any tweets in the - * min/max range of data that IS indexed on this earlybird. i.e if we only have data for - * 2006/01/01:02 and 2006/01/01:04, it will consider 2006/01/01:03 as a missing hour. - * Hours before 2006/01/01:02 or after 2006/01/01:04 will not be considered as missing. - */ -public class TweetCountMonitor extends OneTaskScheduledExecutorManager { - private static final Logger LOG = LoggerFactory.getLogger(TweetCountMonitor.class); - - private static final String THREAD_NAME_FORMAT = "TweetCountMonitor-%d"; - private static final boolean THREAD_IS_DAEMON = true; - - public static final String RUN_INTERVAL_MINUTES_CONFIG_NAME = - "tweet_count_monitor_run_interval_minutes"; - public static final String START_CHECK_HOUR_CONFIG_NAME = - "tweet_count_monitor_start_check_hour"; - public static final String HOURLY_MIN_COUNT_CONFIG_NAME = - "tweet_count_monitor_hourly_min_count"; - public static final String DAILY_MIN_COUNT_CONFIG_NAME = - "tweet_count_monitor_daily_min_count"; - - @VisibleForTesting - public static final AtomicInteger INSTANCE_COUNTER = new AtomicInteger(0); - - private static final long MILLIS_IN_A_DAY = TimeUnit.DAYS.toMillis(1); - - private final SegmentManager segmentManager; - - private final SearchStatsReceiver searchStatsReceiver; - private final int instanceCounter; - - // The first date in format "YYYYMMDDHH" that we want to check counts for. - private final int startCheckHour; - // The last date in format "YYYYMMDDHH" that we want to check counts for. - private final int endCheckHour; - //Smallest number of docs we expect to have for each day. - private final int dailyMinCount; - // Smallest number of docs we expect to have for each hour. - private final int hourlyMinCount; - // Binary stat, set to 0 when the monitor is running - private final SearchLongGauge isRunningStat; - // How long each iteration takes - private final SearchTimerStats checkTimeStat; - - private final Map fieldTermCounters; - private final Map fieldCheckTimeStats; - - /** - * Create a TweetCountMonitor to monitor all segments in the given segmentManager - */ - public TweetCountMonitor( - SegmentManager segmentManager, - ScheduledExecutorServiceFactory executorServiceFactory, - long shutdownWaitDuration, - TimeUnit shutdownWaitUnit, - SearchStatsReceiver searchStatsReceiver, - CriticalExceptionHandler criticalExceptionHandler) { - this(segmentManager, - EarlybirdConfig.getInt(START_CHECK_HOUR_CONFIG_NAME, 0), - EarlybirdConfig.getInt(RUN_INTERVAL_MINUTES_CONFIG_NAME, -1), - EarlybirdConfig.getInt(HOURLY_MIN_COUNT_CONFIG_NAME, 0), - EarlybirdConfig.getInt(DAILY_MIN_COUNT_CONFIG_NAME, 0), - executorServiceFactory, - shutdownWaitDuration, - shutdownWaitUnit, - searchStatsReceiver, - criticalExceptionHandler); - } - - @VisibleForTesting - TweetCountMonitor( - SegmentManager segmentManager, - int startCheckHourFromConfig, - int schedulePeriodMinutes, - int hourlyMinCount, - int dailyMinCount, - ScheduledExecutorServiceFactory executorServiceFactory, - long shutdownWaitDuration, - TimeUnit shutdownWaitUnit, - SearchStatsReceiver searchStatsReceiver, - CriticalExceptionHandler criticalExceptionHandler) { - super( - executorServiceFactory, - THREAD_NAME_FORMAT, - THREAD_IS_DAEMON, - PeriodicActionParams.atFixedRate( - schedulePeriodMinutes, - TimeUnit.MINUTES - ), - new ShutdownWaitTimeParams( - shutdownWaitDuration, - shutdownWaitUnit - ), - searchStatsReceiver, - criticalExceptionHandler); - this.segmentManager = segmentManager; - this.searchStatsReceiver = searchStatsReceiver; - this.instanceCounter = INSTANCE_COUNTER.incrementAndGet(); - this.hourlyMinCount = hourlyMinCount; - this.dailyMinCount = dailyMinCount; - - String isRunningStatName = "tweet_count_monitor_is_running_v_" + this.instanceCounter; - this.isRunningStat = SearchLongGauge.export(isRunningStatName); - String checkTimeStatName = "tweet_count_monitor_check_time_v_" + this.instanceCounter; - this.checkTimeStat = SearchTimerStats.export(checkTimeStatName, TimeUnit.MILLISECONDS, true); - - this.startCheckHour = Math.max( - startCheckHourFromConfig, - dateToHourValue(segmentManager.getPartitionConfig().getTierStartDate())); - this.endCheckHour = dateToHourValue(segmentManager.getPartitionConfig().getTierEndDate()); - - this.fieldTermCounters = Maps.newHashMap(); - this.fieldTermCounters.put( - FieldTermCounter.TWEET_COUNT_KEY, - new FieldTermCounter( - FieldTermCounter.TWEET_COUNT_KEY, - instanceCounter, - startCheckHour, - endCheckHour, - hourlyMinCount, - dailyMinCount)); - this.fieldCheckTimeStats = Maps.newHashMap(); - } - - private int dateToHourValue(Date date) { - Calendar cal = Calendar.getInstance(FieldTermCounter.TIME_ZONE); - cal.setTime(date); - return FieldTermCounter.getHourValue(cal); - } - - private void updateHourlyCounts() { - // Iterate the current index to count all tweets anf field hits. - Map> newCountMap = getNewTweetCountMap(); - - for (Map.Entry> newCounts : newCountMap.entrySet()) { - final String fieldName = newCounts.getKey(); - FieldTermCounter termCounter = fieldTermCounters.get(fieldName); - if (termCounter == null) { - termCounter = new FieldTermCounter( - fieldName, - instanceCounter, - startCheckHour, - endCheckHour, - hourlyMinCount, - dailyMinCount); - fieldTermCounters.put(fieldName, termCounter); - } - termCounter.runWithNewCounts(newCounts.getValue()); - } - } - - /** - * Loops through all segments, and all documents in each segment, and for each document - * gets the createdAt timestamp (in seconds) from the TimeMapper. - * Based on that, returns a map with the count of: - * . the number of tweets for each hour - * . the number of tweets corresponding to each field for each hour - */ - private Map> getNewTweetCountMap() { - Iterable segmentInfos = segmentManager.getSegmentInfos( - SegmentManager.Filter.Enabled, SegmentManager.Order.NEW_TO_OLD); - Map> newCountMap = Maps.newHashMap(); - - Map newCounts = Maps.newHashMap(); - newCountMap.put(FieldTermCounter.TWEET_COUNT_KEY, newCounts); - - ImmutableSchemaInterface schemaSnapshot = - segmentManager.getEarlybirdIndexConfig().getSchema().getSchemaSnapshot(); - Calendar cal = Calendar.getInstance(FieldTermCounter.TIME_ZONE); - for (SegmentInfo segmentInfo : segmentInfos) { - try { - EarlybirdSingleSegmentSearcher searcher = segmentManager.getSearcher( - segmentInfo.getTimeSliceID(), schemaSnapshot); - if (searcher != null) { - EarlybirdIndexSegmentAtomicReader reader = searcher.getTwitterIndexReader(); - TimeMapper timeMapper = reader.getSegmentData().getTimeMapper(); - List> outsideEndDateRangeDocList = new ArrayList<>(); - - // Get the number of tweets for each hour. - int docsOutsideEndDateRange = getNewTweetCountsForSegment( - segmentInfo, reader, timeMapper, cal, newCounts); - if (docsOutsideEndDateRange > 0) { - outsideEndDateRangeDocList.add(new Pair<>( - FieldTermCounter.TWEET_COUNT_KEY, docsOutsideEndDateRange)); - } - - // Get the number of tweets with corresponding field for each hour. - for (Schema.FieldInfo fieldInfo : schemaSnapshot.getFieldInfos()) { - if (fieldInfo.getFieldType().indexOptions() == IndexOptions.NONE) { - continue; - } - - String fieldName = fieldInfo.getName(); - docsOutsideEndDateRange = getNewFieldTweetCountsForSegment( - segmentInfo, reader, timeMapper, cal, fieldName, newCountMap); - if (docsOutsideEndDateRange > 0) { - outsideEndDateRangeDocList.add(new Pair<>(fieldName, docsOutsideEndDateRange)); - } - } - - LOG.info("Inspected segment: " + segmentInfo + " found " - + outsideEndDateRangeDocList.size() - + " fields with documents outside of segment end date."); - for (Pair outsideEndRange : outsideEndDateRangeDocList) { - LOG.info(" outside end date range - segment: " + segmentInfo.getSegmentName() - + " field: " + outsideEndRange.toString()); - } - } - } catch (IOException e) { - LOG.error("Exception getting daily tweet counts for timeslice: " + segmentInfo, e); - } - } - return newCountMap; - } - - private void incrementNumDocsWithIllegalTimeCounter(String segmentName, String fieldSuffix) { - String statName = String.format( - "num_docs_with_illegal_time_for_segment_%s%s_counter", segmentName, fieldSuffix); - SearchCounter counter = SearchCounter.export(statName); - counter.increment(); - } - - private int getNewTweetCountsForSegment( - SegmentInfo segmentInfo, - EarlybirdIndexSegmentAtomicReader reader, - TimeMapper timeMapper, - Calendar cal, - Map newTweetCounts) { - DocIDToTweetIDMapper tweetIdMapper = reader.getSegmentData().getDocIDToTweetIDMapper(); - long dataEndTimeExclusiveMillis = getDataEndTimeExclusiveMillis(segmentInfo); - int docsOutsideEndDateRange = 0; - int docId = Integer.MIN_VALUE; - while ((docId = tweetIdMapper.getNextDocID(docId)) != DocIDToTweetIDMapper.ID_NOT_FOUND) { - UpdateCountType updateCountType = - updateTweetCount(timeMapper, docId, dataEndTimeExclusiveMillis, cal, newTweetCounts); - if (updateCountType == UpdateCountType.ILLEGAL_TIME) { - incrementNumDocsWithIllegalTimeCounter(segmentInfo.getSegmentName(), ""); - } else if (updateCountType == UpdateCountType.OUT_OF_RANGE_TIME) { - docsOutsideEndDateRange++; - } - } - return docsOutsideEndDateRange; - } - - private int getNewFieldTweetCountsForSegment( - SegmentInfo segmentInfo, - EarlybirdIndexSegmentAtomicReader reader, - TimeMapper timeMapper, - Calendar cal, - String field, - Map> newCountMap) throws IOException { - int docsOutsideEndDateRange = 0; - Map fieldTweetCounts = - newCountMap.computeIfAbsent(field, k -> Maps.newHashMap()); - - Terms terms = reader.terms(field); - if (terms == null) { - LOG.warn("Field <" + field + "> is missing terms in segment: " - + segmentInfo.getSegmentName()); - return 0; - } - long startTimeMillis = System.currentTimeMillis(); - - long dataEndTimeExclusiveMillis = getDataEndTimeExclusiveMillis(segmentInfo); - for (TermsEnum termsEnum = terms.iterator(); termsEnum.next() != null;) { - DocIdSetIterator docsIterator = termsEnum.postings(null, PostingsEnum.NONE); - for (int docId = docsIterator.nextDoc(); - docId != DocIdSetIterator.NO_MORE_DOCS; docId = docsIterator.nextDoc()) { - UpdateCountType updateCountType = updateTweetCount( - timeMapper, docId, dataEndTimeExclusiveMillis, cal, fieldTweetCounts); - if (updateCountType == UpdateCountType.ILLEGAL_TIME) { - incrementNumDocsWithIllegalTimeCounter( - segmentInfo.getSegmentName(), "_and_field_" + field); - } else if (updateCountType == UpdateCountType.OUT_OF_RANGE_TIME) { - docsOutsideEndDateRange++; - } - } - } - updateFieldRunTimeStats(field, System.currentTimeMillis() - startTimeMillis); - - return docsOutsideEndDateRange; - } - - private enum UpdateCountType { - OK_TIME, - ILLEGAL_TIME, - OUT_OF_RANGE_TIME, - } - - private static UpdateCountType updateTweetCount( - TimeMapper timeMapper, - int docId, - long dataEndTimeExclusiveMillis, - Calendar cal, - Map newTweetCounts) { - int timeSecs = timeMapper.getTime(docId); - if (timeSecs == TimeMapper.ILLEGAL_TIME) { - return UpdateCountType.ILLEGAL_TIME; - } - if (dataEndTimeExclusiveMillis == Segment.NO_DATA_END_TIME - || timeSecs * 1000L < dataEndTimeExclusiveMillis) { - Integer hourlyValue = FieldTermCounter.getHourValue(cal, timeSecs); - MutableInt count = newTweetCounts.get(hourlyValue); - if (count == null) { - count = new MutableInt(0); - newTweetCounts.put(hourlyValue, count); - } - count.increment(); - return UpdateCountType.OK_TIME; - } else { - return UpdateCountType.OUT_OF_RANGE_TIME; - } - } - - /** - * If a segment has an end date, return the last timestamp (exclusive, and in millis) for which - * we expect it to have data. - * @return Segment.NO_DATA_END_TIME if the segment does not have an end date. - */ - private long getDataEndTimeExclusiveMillis(SegmentInfo segmentInfo) { - long dataEndDate = segmentInfo.getSegment().getDataEndDateInclusiveMillis(); - if (dataEndDate == Segment.NO_DATA_END_TIME) { - return Segment.NO_DATA_END_TIME; - } else { - return dataEndDate + MILLIS_IN_A_DAY; - } - } - - private void updateFieldRunTimeStats(String fieldName, long runTimeMs) { - SearchTimerStats timerStats = fieldCheckTimeStats.get(fieldName); - if (timerStats == null) { - final String statName = "tweet_count_monitor_check_time_field_" + fieldName; - timerStats = searchStatsReceiver.getTimerStats( - statName, TimeUnit.MILLISECONDS, false, false, false); - fieldCheckTimeStats.put(fieldName, timerStats); - } - timerStats.timerIncrement(runTimeMs); - } - - @VisibleForTesting - String getStatName(String fieldName, Integer date) { - return FieldTermCounter.getStatName(fieldName, instanceCounter, date); - } - - @VisibleForTesting - Map getExportedCounts(String fieldName) { - if (fieldTermCounters.get(fieldName) == null) { - return null; - } else { - return fieldTermCounters.get(fieldName).getExportedCounts(); - } - } - - @VisibleForTesting - Map getDailyCounts(String fieldName) { - if (fieldTermCounters.get(fieldName) == null) { - return null; - } else { - return fieldTermCounters.get(fieldName).getDailyCounts(); - } - } - - @VisibleForTesting - long getHoursWithNoTweets(String fieldName) { - return fieldTermCounters.get(fieldName).getHoursWithNoTweets(); - } - - @VisibleForTesting - long getDaysWithNoTweets(String fieldName) { - return fieldTermCounters.get(fieldName).getDaysWithNoTweets(); - } - - @VisibleForTesting - Map getExportedHourlyCountStats(String fieldName) { - return fieldTermCounters.get(fieldName).getExportedHourlyCountStats(); - } - - @Override - protected void runOneIteration() { - LOG.info("Starting to get hourly tweet counts"); - final long startTimeMillis = System.currentTimeMillis(); - - isRunningStat.set(1); - try { - updateHourlyCounts(); - } catch (Exception ex) { - LOG.error("Unexpected exception while getting hourly tweet counts", ex); - } finally { - isRunningStat.set(0); - - long elapsedTimeMillis = System.currentTimeMillis() - startTimeMillis; - checkTimeStat.timerIncrement(elapsedTimeMillis); - LOG.info("Done getting daily tweet counts. Hours without tweets: " - + getHoursWithNoTweets(FieldTermCounter.TWEET_COUNT_KEY)); - LOG.info("Updating tweet count takes " + (elapsedTimeMillis / 1000) + " secs."); - } - } -} diff --git a/src/java/com/twitter/search/earlybird/util/ViewerWriter.java b/src/java/com/twitter/search/earlybird/util/ViewerWriter.java deleted file mode 100644 index f6b02f1a5..000000000 --- a/src/java/com/twitter/search/earlybird/util/ViewerWriter.java +++ /dev/null @@ -1,47 +0,0 @@ -package com.twitter.search.earlybird.util; - -import java.io.IOException; - -/** - * Interface class for writer. Writer should be passed in - * and have these methods. Currently keeps the hierarchy for - * completed and valid json, methods mirror the ones found in - * JsonWriter - * http://google-gson.googlecode.com/svn/trunk/gson/docs/javadocs/com/google/gson/stream/JsonWriter.html - */ -public interface ViewerWriter { - /** - * Writes a mark for the beginning of an array. - */ - ViewerWriter beginArray() throws IOException; - - /** - * Writes a mark for the beginning of an object. - */ - ViewerWriter beginObject() throws IOException; - - /** - * Writes a mark for the end of an array. - */ - ViewerWriter endArray() throws IOException; - - /** - * Writes a mark for the end of an object. - */ - ViewerWriter endObject() throws IOException; - - /** - * Writes the name (key) of a property. - */ - ViewerWriter name(String field) throws IOException; - - /** - * Writes the value of a property. - */ - ViewerWriter value(String s) throws IOException; - - /** - * Writes a new line. - */ - ViewerWriter newline() throws IOException; -} diff --git a/src/java/com/twitter/search/earlybird_root/BUILD b/src/java/com/twitter/search/earlybird_root/BUILD deleted file mode 100644 index e28e612bd..000000000 --- a/src/java/com/twitter/search/earlybird_root/BUILD +++ /dev/null @@ -1,75 +0,0 @@ -java_library( - name = "earlybird_root-lib", - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/commons-lang", - "3rdparty/jvm/org/slf4j:slf4j-api", - "decider/src/main/scala", - "finagle-internal/mtls/src/main/scala/com/twitter/finagle/mtls/authentication", - "finagle-internal/mtls/src/main/scala/com/twitter/finagle/mtls/authorization/server", - "finagle/finagle-memcached/src/main/java", - "finagle/finagle-mux/src/main/scala", - "finagle/finagle-thrift/src/main/java", - "finagle/finagle-thrift/src/main/scala", - "finatra/inject/inject-core/src/main/scala", - "finatra/inject/inject-server/src/main/scala/com/twitter/inject/server", - "src/java/com/google/common/util/concurrent", - "src/java/com/twitter/common/collections", - "src/java/com/twitter/common_internal/text/version", - "src/java/com/twitter/search/common/caching", - "src/java/com/twitter/search/common/clientstats", - "src/java/com/twitter/search/common/config", - "src/java/com/twitter/search/common/dark", - "src/java/com/twitter/search/common/decider", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/partitioning/base", - "src/java/com/twitter/search/common/partitioning/zookeeper", - "src/java/com/twitter/search/common/relevance:ranking", - "src/java/com/twitter/search/common/root", - "src/java/com/twitter/search/common/runtime", - "src/java/com/twitter/search/common/schema/earlybird", - "src/java/com/twitter/search/common/search", - "src/java/com/twitter/search/common/util/earlybird", - "src/java/com/twitter/search/common/util/io/periodic", - "src/java/com/twitter/search/common/util/zookeeper", - "src/java/com/twitter/search/earlybird/common", - "src/java/com/twitter/search/earlybird/config", - "src/java/com/twitter/search/earlybird_root/caching", - "src/java/com/twitter/search/earlybird_root/common", - "src/java/com/twitter/search/earlybird_root/filters", - "src/java/com/twitter/search/earlybird_root/mergers", - "src/java/com/twitter/search/earlybird_root/quota", - "src/java/com/twitter/search/earlybird_root/routers", - "src/java/com/twitter/search/earlybird_root/visitors", - "src/java/com/twitter/search/queryparser", - "src/java/com/twitter/search/queryparser/query:core-query-nodes", - "src/java/com/twitter/search/queryparser/query/search:search-query-nodes", - "src/thrift/com/twitter/search:benchmark_query-java", - "src/thrift/com/twitter/search:earlybird-java", - "stitch/stitch-core", - "strato/src/main/scala/com/twitter/strato/catalog", - "strato/src/main/scala/com/twitter/strato/client", - "thrift-web-forms", - "thrift-web-forms/src/main/scala/com/twitter/thriftwebforms/model", - ], -) - -jvm_binary( - name = "earlybird_root-binary", - basename = "earlybird_root", - # The main class is reset in the aurora files (it's a required param). - # We need to set it to something here, because hadoop_binary requires it. - main = "com.twitter.search.earlybird_root.RealtimeRootAppMain", - runtime_platform = "java11", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/org/slf4j:slf4j-log4j12", - ":earlybird_root-lib", - "src/java/com/twitter/search/common/logging:search-log4j", - # For /admin/logging. - "twitter-server/slf4j-log4j12/src/main/scala", - ], -) diff --git a/src/java/com/twitter/search/earlybird_root/ClientBackupFilter.java b/src/java/com/twitter/search/earlybird_root/ClientBackupFilter.java deleted file mode 100644 index e24333735..000000000 --- a/src/java/com/twitter/search/earlybird_root/ClientBackupFilter.java +++ /dev/null @@ -1,90 +0,0 @@ -package com.twitter.search.earlybird_root; - -import java.util.Map; -import java.util.concurrent.ConcurrentHashMap; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.finagle.client.BackupRequestFilter; -import com.twitter.finagle.service.ResponseClassifier; -import com.twitter.finagle.service.RetryBudgets; -import com.twitter.finagle.stats.StatsReceiver; -import com.twitter.finagle.util.DefaultTimer; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.metrics.SearchCustomGauge; -import com.twitter.search.earlybird.common.ClientIdUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.util.Future; -import com.twitter.util.tunable.Tunable; - -public class ClientBackupFilter extends SimpleFilter { - private static final Logger LOG = LoggerFactory.getLogger(ClientBackupFilter.class); - - private final Map> - clientBackupFilters = new ConcurrentHashMap<>(); - private final boolean sendInterupts = false; - private final String statPrefix; - private final Tunable.Mutable maxExtraLoad; - private final StatsReceiver statsReceiver; - private final SearchDecider decider; - private final String backupRequestPrecentExtraLoadDecider; - private final int minSendBackupAfterMs = 1; - - public ClientBackupFilter(String serviceName, - String statPrefix, - StatsReceiver statsReceiver, - SearchDecider decider) { - this.statPrefix = statPrefix; - this.backupRequestPrecentExtraLoadDecider = serviceName + "_backup_request_percent_extra_load"; - this.decider = decider; - this.maxExtraLoad = Tunable.mutable("backup_tunable", getMaxExtraLoadFromDecider()); - this.statsReceiver = statsReceiver; - SearchCustomGauge.export(serviceName + "_backup_request_factor", - () -> (maxExtraLoad.apply().isDefined()) ? (double) maxExtraLoad.apply().get() : -1); - } - - private double getMaxExtraLoadFromDecider() { - return ((double) decider.getAvailability(backupRequestPrecentExtraLoadDecider)) / 100 / 100; - } - - private BackupRequestFilter backupFilter(String client) { - return new BackupRequestFilter( - maxExtraLoad, - sendInterupts, - minSendBackupAfterMs, - ResponseClassifier.Default(), - RetryBudgets.newRetryBudget(), - statsReceiver.scope(statPrefix, client, "backup_filter"), - DefaultTimer.getInstance(), - client); - } - - private void updateMaxExtraLoadIfNecessary() { - double maxExtraLoadDeciderValue = getMaxExtraLoadFromDecider(); - if (maxExtraLoad.apply().isDefined() - && !maxExtraLoad.apply().get().equals(maxExtraLoadDeciderValue)) { - LOG.info("Updating maxExtraLoad from {} to {}", - maxExtraLoad.apply().get(), - maxExtraLoadDeciderValue); - maxExtraLoad.set(maxExtraLoadDeciderValue); - } - } - - @Override - public Future apply(EarlybirdRequest request, - Service service) { - updateMaxExtraLoadIfNecessary(); - - String clientID = ClientIdUtil.getClientIdFromRequest(request); - BackupRequestFilter filter = - clientBackupFilters.computeIfAbsent(clientID, this::backupFilter); - - return filter - .andThen(service) - .apply(request); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/ClientLatencyFilter.java b/src/java/com/twitter/search/earlybird_root/ClientLatencyFilter.java deleted file mode 100644 index 0106d7a28..000000000 --- a/src/java/com/twitter/search/earlybird_root/ClientLatencyFilter.java +++ /dev/null @@ -1,45 +0,0 @@ -package com.twitter.search.earlybird_root; - -import java.util.concurrent.ConcurrentHashMap; - -import com.twitter.common.util.Clock; -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.clientstats.RequestCounters; -import com.twitter.search.common.clientstats.RequestCountersEventListener; -import com.twitter.search.earlybird.common.ClientIdUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.filters.EarlybirdSuccessfulResponseHandler; -import com.twitter.util.Future; - -public class ClientLatencyFilter extends SimpleFilter { - // _client_latency_stats_for_ is intended to measure the latency of requests to services that this - // root depends on. This can be used to measure how long a request takes in transit between when - // it leaves a root and when a root receives the response, in case this latency is significantly - // different than Earlybird measured latency. We break it down by client, so that we can tell - // which customers are being hit by this latency. - private static final String STAT_FORMAT = "%s_client_latency_stats_for_%s"; - - private final ConcurrentHashMap requestCounterForClient = - new ConcurrentHashMap<>(); - private final String prefix; - - public ClientLatencyFilter(String prefix) { - this.prefix = prefix; - } - - @Override - public Future apply(EarlybirdRequest request, - Service service) { - - RequestCounters requestCounters = requestCounterForClient.computeIfAbsent( - ClientIdUtil.getClientIdFromRequest(request), client -> - new RequestCounters(String.format(STAT_FORMAT, prefix, client))); - - RequestCountersEventListener requestCountersEventListener = - new RequestCountersEventListener<>(requestCounters, Clock.SYSTEM_CLOCK, - EarlybirdSuccessfulResponseHandler.INSTANCE); - return service.apply(request).addEventListener(requestCountersEventListener); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/EarlybirdCacheCommonModule.java b/src/java/com/twitter/search/earlybird_root/EarlybirdCacheCommonModule.java deleted file mode 100644 index 79a67ac3b..000000000 --- a/src/java/com/twitter/search/earlybird_root/EarlybirdCacheCommonModule.java +++ /dev/null @@ -1,96 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.inject.Named; -import javax.inject.Singleton; - -import com.google.inject.Provides; - -import com.twitter.finagle.memcached.JavaClient; -import com.twitter.finagle.mtls.authentication.ServiceIdentifier; -import com.twitter.finagle.stats.StatsReceiver; -import com.twitter.inject.TwitterModule; -import com.twitter.search.common.caching.Cache; -import com.twitter.search.common.caching.EarlybirdCacheSerializer; -import com.twitter.search.common.caching.SearchCacheBuilder; -import com.twitter.search.common.caching.SearchMemcacheClientConfig; -import com.twitter.search.common.caching.SearchMemcacheClientFactory; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.caching.CacheCommonUtil; -import com.twitter.search.earlybird_root.caching.CacheStats; -import com.twitter.search.earlybird_root.caching.DefaultForcedCacheMissDecider; -import com.twitter.search.earlybird_root.filters.PostCacheRequestTypeCountFilter; -import com.twitter.util.Duration; - -/** - * Provides common bindings for cache related modules. - */ -public class EarlybirdCacheCommonModule extends TwitterModule { - private static final String CACHE_VERSION = "1"; - - @Override - public void configure() { - bind(PostCacheRequestTypeCountFilter.class).in(Singleton.class); - bind(DefaultForcedCacheMissDecider.class).in(Singleton.class); - } - - @Provides - @Singleton - @Named(CacheCommonUtil.NAMED_MAX_CACHE_RESULTS) - Integer provideMaxCacheResults() { - return 100; - } - - @Provides - @Singleton - JavaClient provideMemCacheClient( - StatsReceiver statsReceiver, ServiceIdentifier serviceIdentifier) { - SearchMemcacheClientConfig config = new SearchMemcacheClientConfig(); - config.connectTimeoutMs = Duration.fromMilliseconds(100); - config.requestTimeoutMs = Duration.fromMilliseconds(100); - config.failureAccrualFailuresNumber = 150; - config.failureAccrualFailuresDurationMillis = 30000; - config.failureAccrualDuration = Duration.fromMilliseconds(60000); - - return SearchMemcacheClientFactory.createMtlsClient( - "", - "earlybird_root", - statsReceiver, - config, - serviceIdentifier - ); - } - - /** - * Create a new Earlybird cache. - * - * @param client the memcache client to use. - * @param decider the decider to use for the cache. - * @param cachePrefix the common cache prefix for the cache type. - * @param serializedKeyPrefix the common cache prefix for the cluster. - * @param cacheExpiryMillis cache entry ttl in milliseconds. - */ - static Cache createCache( - JavaClient client, - DefaultForcedCacheMissDecider decider, - String cachePrefix, - String serializedKeyPrefix, - long cacheExpiryMillis, - int cacheKeyMaxBytes, - int cacheValueMaxBytes) { - return new SearchCacheBuilder( - CACHE_VERSION, - client, - cachePrefix, - serializedKeyPrefix, - cacheExpiryMillis) - .withMaxKeyBytes(cacheKeyMaxBytes) - .withMaxValueBytes(cacheValueMaxBytes) - .withRequestTimeoutCounter(CacheStats.REQUEST_TIMEOUT_COUNTER) - .withRequestFailedCounter(CacheStats.REQUEST_FAILED_COUNTER) - .withCacheSerializer(new EarlybirdCacheSerializer()) - .withForceCacheMissDecider(decider) - .withInProcessCache() - .build(); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/EarlybirdChainedScatterGatherService.java b/src/java/com/twitter/search/earlybird_root/EarlybirdChainedScatterGatherService.java deleted file mode 100644 index 1c201ce09..000000000 --- a/src/java/com/twitter/search/earlybird_root/EarlybirdChainedScatterGatherService.java +++ /dev/null @@ -1,58 +0,0 @@ -package com.twitter.search.earlybird_root; - -import java.util.List; - -import javax.inject.Inject; - -import com.google.common.collect.Lists; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.finagle.Service; -import com.twitter.search.common.root.PartitionLoggingSupport; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.util.Future; - -/** - * A chain of scatter gather services. - * Regular roots use ScatterGatherService directly. This class is only used by multi-tier roots. - */ -public class EarlybirdChainedScatterGatherService extends - Service>> { - - private static final Logger LOG = - LoggerFactory.getLogger(EarlybirdChainedScatterGatherService.class); - - private final List> serviceChain; - - /** - * Construct a ScatterGatherServiceChain, by loading configurations from earlybird-tiers.yml. - */ - @Inject - public EarlybirdChainedScatterGatherService( - EarlybirdServiceChainBuilder serviceChainBuilder, - EarlybirdServiceScatterGatherSupport scatterGatherSupport, - PartitionLoggingSupport partitionLoggingSupport) { - - serviceChain = - serviceChainBuilder.buildServiceChain(scatterGatherSupport, partitionLoggingSupport); - - if (serviceChain.isEmpty()) { - LOG.error("At least one tier has to be enabled."); - throw new RuntimeException("Root does not work with all tiers disabled."); - } - } - - @Override - public Future>> apply(EarlybirdRequestContext requestContext) { - // Hit all tiers in parallel. - List> resultList = - Lists.newArrayListWithCapacity(serviceChain.size()); - for (final Service service : serviceChain) { - resultList.add(service.apply(requestContext)); - } - return Future.value(resultList); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/EarlybirdCommonModule.java b/src/java/com/twitter/search/earlybird_root/EarlybirdCommonModule.java deleted file mode 100644 index f6918316d..000000000 --- a/src/java/com/twitter/search/earlybird_root/EarlybirdCommonModule.java +++ /dev/null @@ -1,170 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.annotation.Nullable; -import javax.inject.Named; -import javax.inject.Singleton; - -import scala.PartialFunction; - -import com.google.inject.Provides; - -import org.apache.thrift.protocol.TProtocolFactory; - -import com.twitter.app.Flag; -import com.twitter.app.Flaggable; -import com.twitter.common.util.Clock; -import com.twitter.finagle.Service; -import com.twitter.finagle.mtls.authorization.server.MtlsServerSessionTrackerFilter; -import com.twitter.finagle.service.ReqRep; -import com.twitter.finagle.service.ResponseClass; -import com.twitter.finagle.stats.StatsReceiver; -import com.twitter.finagle.thrift.RichServerParam; -import com.twitter.finagle.thrift.ThriftClientRequest; -import com.twitter.inject.TwitterModule; -import com.twitter.search.common.dark.DarkProxy; -import com.twitter.search.common.dark.ResolverProxy; -import com.twitter.search.common.partitioning.zookeeper.SearchZkClient; -import com.twitter.search.common.root.PartitionConfig; -import com.twitter.search.common.root.RemoteClientBuilder; -import com.twitter.search.common.root.RootClientServiceBuilder; -import com.twitter.search.common.root.SearchRootModule; -import com.twitter.search.common.root.ServerSetsConfig; -import com.twitter.search.common.util.zookeeper.ZooKeeperProxy; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.search.earlybird_root.common.EarlybirdFeatureSchemaMerger; -import com.twitter.search.earlybird_root.filters.PreCacheRequestTypeCountFilter; -import com.twitter.search.earlybird_root.filters.QueryLangStatFilter; - -/** - * Provides common bindings. - */ -public class EarlybirdCommonModule extends TwitterModule { - static final String NAMED_ALT_CLIENT = "alt_client"; - static final String NAMED_EXP_CLUSTER_CLIENT = "exp_cluster_client"; - - private final Flag altZkRoleFlag = createFlag( - "alt_zk_role", - "", - "The alternative ZooKeeper role", - Flaggable.ofString()); - private final Flag altZkClientEnvFlag = createFlag( - "alt_zk_client_env", - "", - "The alternative zk client environment", - Flaggable.ofString()); - private final Flag altPartitionZkPathFlag = createFlag( - "alt_partition_zk_path", - "", - "The alternative client partition zk path", - Flaggable.ofString()); - - @Override - public void configure() { - bind(InitializeFilter.class).in(Singleton.class); - bind(PreCacheRequestTypeCountFilter.class).in(Singleton.class); - - bind(Clock.class).toInstance(Clock.SYSTEM_CLOCK); - bind(QueryLangStatFilter.Config.class).toInstance(new QueryLangStatFilter.Config(100)); - } - - // Used in SearchRootModule. - @Provides - @Singleton - PartialFunction provideResponseClassifier() { - return new RootResponseClassifier(); - } - - @Provides - @Singleton - Service providesByteService( - EarlybirdService.ServiceIface svc, - DarkProxy darkProxy, - TProtocolFactory protocolFactory) { - return darkProxy.toFilter().andThen( - new EarlybirdService.Service( - svc, new RichServerParam(protocolFactory, SearchRootModule.SCROOGE_BUFFER_SIZE))); - } - - @Provides - @Singleton - @Named(SearchRootModule.NAMED_SERVICE_INTERFACE) - Class providesServiceInterface() { - return EarlybirdService.ServiceIface.class; - } - - @Provides - @Singleton - ZooKeeperProxy provideZookeeperClient() { - return SearchZkClient.getSZooKeeperClient(); - } - - @Provides - @Singleton - EarlybirdFeatureSchemaMerger provideFeatureSchemaMerger() { - return new EarlybirdFeatureSchemaMerger(); - } - - @Provides - @Singleton - @Nullable - @Named(NAMED_ALT_CLIENT) - ServerSetsConfig provideAltServerSetsConfig() { - if (!altZkRoleFlag.isDefined() || !altZkClientEnvFlag.isDefined()) { - return null; - } - - return new ServerSetsConfig(altZkRoleFlag.apply(), altZkClientEnvFlag.apply()); - } - - @Provides - @Singleton - @Nullable - @Named(NAMED_ALT_CLIENT) - PartitionConfig provideAltPartitionConfig(PartitionConfig defaultPartitionConfig) { - if (!altPartitionZkPathFlag.isDefined()) { - return null; - } - - return new PartitionConfig( - defaultPartitionConfig.getNumPartitions(), altPartitionZkPathFlag.apply()); - } - - @Provides - @Singleton - @Nullable - @Named(NAMED_ALT_CLIENT) - RootClientServiceBuilder provideAltRootClientServiceBuilder( - @Named(NAMED_ALT_CLIENT) @Nullable ServerSetsConfig serverSetsConfig, - @Named(SearchRootModule.NAMED_SERVICE_INTERFACE) Class serviceIface, - ResolverProxy resolverProxy, - RemoteClientBuilder remoteClientBuilder) { - if (serverSetsConfig == null) { - return null; - } - - return new RootClientServiceBuilder<>( - serverSetsConfig, serviceIface, resolverProxy, remoteClientBuilder); - } - - @Provides - @Singleton - @Named(NAMED_EXP_CLUSTER_CLIENT) - RootClientServiceBuilder provideExpClusterRootClientServiceBuilder( - @Named(SearchRootModule.NAMED_EXP_CLUSTER_SERVER_SETS_CONFIG) - ServerSetsConfig serverSetsConfig, - @Named(SearchRootModule.NAMED_SERVICE_INTERFACE) Class serviceIface, - ResolverProxy resolverProxy, - RemoteClientBuilder remoteClientBuilder) { - return new RootClientServiceBuilder<>( - serverSetsConfig, serviceIface, resolverProxy, remoteClientBuilder); - } - - @Provides - @Singleton - MtlsServerSessionTrackerFilter - provideMtlsServerSessionTrackerFilter(StatsReceiver statsReceiver) { - return new MtlsServerSessionTrackerFilter<>(statsReceiver); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/EarlybirdFullArchiveScatterGatherSupport.java b/src/java/com/twitter/search/earlybird_root/EarlybirdFullArchiveScatterGatherSupport.java deleted file mode 100644 index b62adf579..000000000 --- a/src/java/com/twitter/search/earlybird_root/EarlybirdFullArchiveScatterGatherSupport.java +++ /dev/null @@ -1,21 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.inject.Inject; - -import com.twitter.search.common.partitioning.base.PartitionMappingManager; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird_root.common.EarlybirdFeatureSchemaMerger; - -/** - * The EarlybirdServiceScatterGatherSupport implementation used to fan out requests to the earlybird - * partitions in the full archive tiers. - */ -public class EarlybirdFullArchiveScatterGatherSupport extends EarlybirdServiceScatterGatherSupport { - /** Creates a new EarlybirdFullArchiveScatterGatherSupport instance. */ - @Inject - EarlybirdFullArchiveScatterGatherSupport( - PartitionMappingManager partitionMappingManager, - EarlybirdFeatureSchemaMerger featureSchemaMerger) { - super(partitionMappingManager, EarlybirdCluster.FULL_ARCHIVE, featureSchemaMerger); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/EarlybirdProtectedScatterGatherSupport.java b/src/java/com/twitter/search/earlybird_root/EarlybirdProtectedScatterGatherSupport.java deleted file mode 100644 index 97bcdd620..000000000 --- a/src/java/com/twitter/search/earlybird_root/EarlybirdProtectedScatterGatherSupport.java +++ /dev/null @@ -1,25 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.inject.Inject; - -import com.twitter.search.common.partitioning.base.PartitionMappingManager; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird_root.common.EarlybirdFeatureSchemaMerger; - -/** - * The EarlybirdServiceScatterGatherSupport implementation used to fan out requests to the earlybird - * partitions in the protected cluster. - */ -public class EarlybirdProtectedScatterGatherSupport extends EarlybirdServiceScatterGatherSupport { - /** - * Construct a EarlybirdProtectedScatterGatherSupport to do minUserFanOut, - * used only by protected. The main difference from the base class is that - * if the from user ID is not set, exception is thrown. - */ - @Inject - EarlybirdProtectedScatterGatherSupport( - PartitionMappingManager partitionMappingManager, - EarlybirdFeatureSchemaMerger featureSchemaMerger) { - super(partitionMappingManager, EarlybirdCluster.PROTECTED, featureSchemaMerger); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/EarlybirdProtectedValidationBehavior.java b/src/java/com/twitter/search/earlybird_root/EarlybirdProtectedValidationBehavior.java deleted file mode 100644 index ee53cfbae..000000000 --- a/src/java/com/twitter/search/earlybird_root/EarlybirdProtectedValidationBehavior.java +++ /dev/null @@ -1,45 +0,0 @@ -package com.twitter.search.earlybird_root; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; - -public class EarlybirdProtectedValidationBehavior extends EarlybirdServiceValidationBehavior { - private static final Logger LOG = - LoggerFactory.getLogger(EarlybirdProtectedValidationBehavior.class); - - @Override - public EarlybirdResponse getResponseIfInvalidRequest(EarlybirdRequest request) { - if (!request.isSetSearchQuery() || request.getSearchQuery() == null) { - String errorMsg = "Invalid EarlybirdRequest, no ThriftSearchQuery specified. " + request; - LOG.warn(errorMsg); - return createErrorResponse(errorMsg); - } - ThriftSearchQuery searchQuery = request.getSearchQuery(); - - // Make sure this request is valid for the protected tweets cluster. - if (!searchQuery.isSetFromUserIDFilter64() || searchQuery.getFromUserIDFilter64().isEmpty()) { - String errorMsg = "ThriftSearchQuery.fromUserIDFilter64 not set. " + request; - LOG.warn(errorMsg); - return createErrorResponse(errorMsg); - } - - if (!searchQuery.isSetSearcherId()) { - String errorMsg = "ThriftSearchQuery.searcherId not set. " + request; - LOG.warn(errorMsg); - return createErrorResponse(errorMsg); - } - - if (searchQuery.getSearcherId() < 0) { - String errorMsg = "Invalid ThriftSearchQuery.searcherId: " + searchQuery.getSearcherId() - + ". " + request; - LOG.warn(errorMsg); - return createErrorResponse(errorMsg); - } - - return super.getResponseIfInvalidRequest(request); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/EarlybirdProtectedWarmup.java b/src/java/com/twitter/search/earlybird_root/EarlybirdProtectedWarmup.java deleted file mode 100644 index c1b022d66..000000000 --- a/src/java/com/twitter/search/earlybird_root/EarlybirdProtectedWarmup.java +++ /dev/null @@ -1,28 +0,0 @@ -package com.twitter.search.earlybird_root; - -import com.google.common.base.Preconditions; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.root.WarmupConfig; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; - -public class EarlybirdProtectedWarmup extends EarlybirdWarmup { - - public EarlybirdProtectedWarmup(Clock clock, WarmupConfig config) { - super(clock, config); - } - - /** - * The protected cluster requires all queries to specify a fromUserIdFilter and a searcherId. - */ - @Override - protected EarlybirdRequest createRequest(int requestId) { - EarlybirdRequest request = super.createRequest(requestId); - - Preconditions.checkState(request.isSetSearchQuery()); - request.getSearchQuery().addToFromUserIDFilter64(requestId); - request.getSearchQuery().setSearcherId(0L); - - return request; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/EarlybirdQueryRewriteFilter.java b/src/java/com/twitter/search/earlybird_root/EarlybirdQueryRewriteFilter.java deleted file mode 100644 index 07e07cb2d..000000000 --- a/src/java/com/twitter/search/earlybird_root/EarlybirdQueryRewriteFilter.java +++ /dev/null @@ -1,157 +0,0 @@ -package com.twitter.search.earlybird_root; - -import java.util.List; -import java.util.Map; - -import javax.inject.Inject; -import javax.inject.Named; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Predicate; -import com.google.common.collect.Maps; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.root.SearchRootModule; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.query.Term; -import com.twitter.search.queryparser.query.annotation.Annotation; -import com.twitter.search.queryparser.rewriter.PredicateQueryNodeDropper; -import com.twitter.search.queryparser.visitors.TermExtractorVisitor; -import com.twitter.util.Future; - -/** - * Filter that rewrites the serialized query on EarlybirdRequest. - * As of now, this filter performs the following rewrites: - * - Drop ":v annotated variants based on decider, if the query has enough term nodes. - */ -public class EarlybirdQueryRewriteFilter extends - SimpleFilter { - - private static final Logger LOG = - LoggerFactory.getLogger(EarlybirdQueryRewriteFilter.class); - - private static final String DROP_PHRASE_VARIANT_FROM_QUERY_DECIDER_KEY_PATTERN = - "drop_variants_from_%s_%s_queries"; - - // only drop variants from queries with more than this number of terms. - private static final String MIN_TERM_COUNT_FOR_VARIANT_DROPPING_DECIDER_KEY_PATTERN = - "drop_variants_from_%s_%s_queries_term_count_threshold"; - - private static final SearchCounter QUERY_PARSER_FAILURE_COUNT = - SearchCounter.export("query_rewrite_filter_query_parser_failure_count"); - - // We currently add variants only to RECENCY and RELEVANCE requests, but it doesn't hurt to export - // stats for all request types. - @VisibleForTesting - static final Map DROP_VARIANTS_QUERY_COUNTS = - Maps.newEnumMap(EarlybirdRequestType.class); - static { - for (EarlybirdRequestType requestType : EarlybirdRequestType.values()) { - DROP_VARIANTS_QUERY_COUNTS.put( - requestType, - SearchCounter.export(String.format("drop_%s_variants_query_count", - requestType.getNormalizedName()))); - } - } - - private static final Predicate DROP_VARIANTS_PREDICATE = - q -> q.hasAnnotationType(Annotation.Type.VARIANT); - - private static final PredicateQueryNodeDropper DROP_VARIANTS_VISITOR = - new PredicateQueryNodeDropper(DROP_VARIANTS_PREDICATE); - - private final SearchDecider decider; - private final String normalizedSearchRootName; - - @Inject - public EarlybirdQueryRewriteFilter( - SearchDecider decider, - @Named(SearchRootModule.NAMED_NORMALIZED_SEARCH_ROOT_NAME) String normalizedSearchRootName) { - this.decider = decider; - this.normalizedSearchRootName = normalizedSearchRootName; - } - - @Override - public Future apply( - EarlybirdRequestContext requestContext, - Service service) { - - Query query = requestContext.getParsedQuery(); - // If there's no serialized query, no rewrite is necessary. - if (query == null) { - return service.apply(requestContext); - } else { - try { - Query variantsRemoved = maybeRemoveVariants(requestContext, query); - - if (query == variantsRemoved) { - return service.apply(requestContext); - } else { - EarlybirdRequestContext clonedRequestContext = - EarlybirdRequestContext.copyRequestContext(requestContext, variantsRemoved); - - return service.apply(clonedRequestContext); - } - } catch (QueryParserException e) { - // It is not clear here that the QueryParserException is the client's fault, or our fault. - // At this point it is most likely not the client's since we have a legitimate parsed Query - // from the client's request, and it's the rewriting that failed. - // In this case we choose to send the query as is (without the rewrite), instead of - // failing the entire request. - QUERY_PARSER_FAILURE_COUNT.increment(); - LOG.warn("Failed to rewrite serialized query: " + query.serialize(), e); - return service.apply(requestContext); - } - } - } - - private Query maybeRemoveVariants(EarlybirdRequestContext requestContext, Query query) - throws QueryParserException { - - if (shouldDropVariants(requestContext, query)) { - Query rewrittenQuery = DROP_VARIANTS_VISITOR.apply(query); - if (!query.equals(rewrittenQuery)) { - DROP_VARIANTS_QUERY_COUNTS.get(requestContext.getEarlybirdRequestType()).increment(); - return rewrittenQuery; - } - } - return query; - } - - private boolean shouldDropVariants(EarlybirdRequestContext requestContext, Query query) - throws QueryParserException { - TermExtractorVisitor termExtractorVisitor = new TermExtractorVisitor(false); - List terms = query.accept(termExtractorVisitor); - - EarlybirdRequestType requestType = requestContext.getEarlybirdRequestType(); - - boolean shouldDropVariants = decider.isAvailable(getDropPhaseVariantDeciderKey(requestType)); - - return terms != null - && terms.size() >= decider.getAvailability( - getMinTermCountForVariantDroppingDeciderKey(requestType)) - && shouldDropVariants; - } - - private String getDropPhaseVariantDeciderKey(EarlybirdRequestType requestType) { - return String.format(DROP_PHRASE_VARIANT_FROM_QUERY_DECIDER_KEY_PATTERN, - normalizedSearchRootName, - requestType.getNormalizedName()); - } - - private String getMinTermCountForVariantDroppingDeciderKey(EarlybirdRequestType requestType) { - return String.format(MIN_TERM_COUNT_FOR_VARIANT_DROPPING_DECIDER_KEY_PATTERN, - normalizedSearchRootName, - requestType.getNormalizedName()); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/EarlybirdRealtimeCgScatterGatherSupport.java b/src/java/com/twitter/search/earlybird_root/EarlybirdRealtimeCgScatterGatherSupport.java deleted file mode 100644 index 1ffd2d247..000000000 --- a/src/java/com/twitter/search/earlybird_root/EarlybirdRealtimeCgScatterGatherSupport.java +++ /dev/null @@ -1,21 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.inject.Inject; - -import com.twitter.search.common.partitioning.base.PartitionMappingManager; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird_root.common.EarlybirdFeatureSchemaMerger; - -/** - * The EarlybirdServiceScatterGatherSupport implementation used to fan out requests to the earlybird - * partitions in the realtime_cg cluster. - */ -public class EarlybirdRealtimeCgScatterGatherSupport extends EarlybirdServiceScatterGatherSupport { - /** Creates a new EarlybirdRealtimeCgScatterGatherSupport instance. */ - @Inject - EarlybirdRealtimeCgScatterGatherSupport( - PartitionMappingManager partitionMappingManager, - EarlybirdFeatureSchemaMerger featureSchemaMerger) { - super(partitionMappingManager, EarlybirdCluster.REALTIME_CG, featureSchemaMerger); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/EarlybirdRealtimeScatterGatherSupport.java b/src/java/com/twitter/search/earlybird_root/EarlybirdRealtimeScatterGatherSupport.java deleted file mode 100644 index abe694857..000000000 --- a/src/java/com/twitter/search/earlybird_root/EarlybirdRealtimeScatterGatherSupport.java +++ /dev/null @@ -1,21 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.inject.Inject; - -import com.twitter.search.common.partitioning.base.PartitionMappingManager; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird_root.common.EarlybirdFeatureSchemaMerger; - -/** - * The EarlybirdServiceScatterGatherSupport implementation used to fan out requests to the earlybird - * partitions in the realtime cluster. - */ -public class EarlybirdRealtimeScatterGatherSupport extends EarlybirdServiceScatterGatherSupport { - /** Creates a new EarlybirdRealtimeScatterGatherSupport instance. */ - @Inject - EarlybirdRealtimeScatterGatherSupport( - PartitionMappingManager partitionMappingManager, - EarlybirdFeatureSchemaMerger featureSchemaMerger) { - super(partitionMappingManager, EarlybirdCluster.REALTIME, featureSchemaMerger); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/EarlybirdRootQueryUtils.java b/src/java/com/twitter/search/earlybird_root/EarlybirdRootQueryUtils.java deleted file mode 100644 index 979885d09..000000000 --- a/src/java/com/twitter/search/earlybird_root/EarlybirdRootQueryUtils.java +++ /dev/null @@ -1,53 +0,0 @@ -package com.twitter.search.earlybird_root; - -import java.util.Map; - -import com.google.common.collect.Maps; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.partitioning.base.PartitionMappingManager; -import com.twitter.search.earlybird_root.visitors.MultiTermDisjunctionPerPartitionVisitor; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryParserException; - -public final class EarlybirdRootQueryUtils { - - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdRootQueryUtils.class); - - private EarlybirdRootQueryUtils() { - } - - /** - * Rewrite 'multi_term_disjunction from_user_id' or 'multi_term_disjunction id' based on partition - * for USER_ID/TWEET_ID partitioned cluster - * @return a map with partition id as key and rewritten query as value. - * If there is no 'multi_term_disjunction from_user_id/id' in query, the map will be empty; if all - * ids are truncated for a partition, it will add a NO_MATCH_CONJUNCTION here. - */ - public static Map rewriteMultiTermDisjunctionPerPartitionFilter( - Query query, PartitionMappingManager partitionMappingManager, int numPartitions) { - Map m = Maps.newHashMap(); - // If there is no parsed query, just return - if (query == null) { - return m; - } - for (int i = 0; i < numPartitions; ++i) { - MultiTermDisjunctionPerPartitionVisitor visitor = - new MultiTermDisjunctionPerPartitionVisitor(partitionMappingManager, i); - try { - Query q = query.accept(visitor); - if (q != null && q != query) { - m.put(i, q); - } - } catch (QueryParserException e) { - // Should not happen, put and log error here just in case - m.put(i, query); - LOG.error( - "MultiTermDisjuctionPerPartitionVisitor cannot process query: " + query.serialize()); - } - } - return m; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/EarlybirdServiceChainBuilder.java b/src/java/com/twitter/search/earlybird_root/EarlybirdServiceChainBuilder.java deleted file mode 100644 index 9790de9fb..000000000 --- a/src/java/com/twitter/search/earlybird_root/EarlybirdServiceChainBuilder.java +++ /dev/null @@ -1,278 +0,0 @@ -package com.twitter.search.earlybird_root; - -import java.util.Collections; -import java.util.List; -import java.util.Map; -import java.util.SortedSet; -import java.util.TreeSet; - -import javax.inject.Inject; -import javax.inject.Named; -import javax.inject.Singleton; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.finagle.stats.StatsReceiver; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.root.PartitionConfig; -import com.twitter.search.common.root.PartitionLoggingSupport; -import com.twitter.search.common.root.RequestSuccessStats; -import com.twitter.search.common.root.RootClientServiceBuilder; -import com.twitter.search.common.root.ScatterGatherService; -import com.twitter.search.common.root.ScatterGatherSupport; -import com.twitter.search.common.root.SearchRootModule; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.config.TierConfig; -import com.twitter.search.earlybird.config.TierInfo; -import com.twitter.search.earlybird.config.TierInfoSource; -import com.twitter.search.earlybird.config.TierInfoUtil; -import com.twitter.search.earlybird.config.TierInfoWrapper; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.EarlybirdService.ServiceIface; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.filters.EarlybirdTimeRangeFilter; -import com.twitter.search.earlybird_root.filters.RequestContextToEarlybirdRequestFilter; -import com.twitter.util.Function; -import com.twitter.util.Future; - -@Singleton -public class EarlybirdServiceChainBuilder { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdServiceChainBuilder.class); - - private static final String SEARCH_METHOD_NAME = "search"; - - private static final EarlybirdResponse TIER_SKIPPED_RESPONSE = - new EarlybirdResponse(EarlybirdResponseCode.TIER_SKIPPED, 0) - .setSearchResults(new ThriftSearchResults()) - .setDebugString("Request to cluster dropped by decider, or sent as dark read."); - - private final EarlybirdTierThrottleDeciders tierThrottleDeciders; - - private final RequestContextToEarlybirdRequestFilter requestContextToEarlybirdRequestFilter; - - private final SearchDecider decider; - private final String normalizedSearchRootName; - private final RootClientServiceBuilder clientServiceBuilder; - private final String partitionPath; - private final int numPartitions; - private final SortedSet tierInfos; - private final PartitionAccessController partitionAccessController; - private final StatsReceiver statsReceiver; - - /** - * Construct a ScatterGatherServiceChain, by loading configurations from earlybird-tiers.yml. - */ - @Inject - public EarlybirdServiceChainBuilder( - PartitionConfig partitionConfig, - RequestContextToEarlybirdRequestFilter requestContextToEarlybirdRequestFilter, - EarlybirdTierThrottleDeciders tierThrottleDeciders, - @Named(SearchRootModule.NAMED_NORMALIZED_SEARCH_ROOT_NAME) String normalizedSearchRootName, - SearchDecider decider, - TierInfoSource tierConfig, - RootClientServiceBuilder clientServiceBuilder, - PartitionAccessController partitionAccessController, - StatsReceiver statsReceiver) { - this.partitionAccessController = partitionAccessController; - this.tierThrottleDeciders = Preconditions.checkNotNull(tierThrottleDeciders); - this.requestContextToEarlybirdRequestFilter = requestContextToEarlybirdRequestFilter; - this.normalizedSearchRootName = normalizedSearchRootName; - this.decider = decider; - this.statsReceiver = statsReceiver; - - List tierInformation = tierConfig.getTierInformation(); - if (tierInformation == null || tierInformation.isEmpty()) { - LOG.error( - "No tier found in config file {} Did you set SEARCH_ENV correctly?", - tierConfig.getConfigFileType()); - throw new RuntimeException("No tier found in tier config file."); - } - - // Get the tier info from the tier config yml file - TreeSet infos = new TreeSet<>(TierInfoUtil.TIER_COMPARATOR); - infos.addAll(tierInformation); - this.tierInfos = Collections.unmodifiableSortedSet(infos); - this.clientServiceBuilder = clientServiceBuilder; - this.partitionPath = partitionConfig.getPartitionPath(); - this.numPartitions = partitionConfig.getNumPartitions(); - - LOG.info("Found the following tiers from config: {}", tierInfos); - } - - /** Builds the chain of services that should be queried on each request. */ - public List> buildServiceChain( - ScatterGatherSupport support, - PartitionLoggingSupport partitionLoggingSupport) { - // Make sure the tier serving ranges do not overlap and do not have gaps. - TierInfoUtil.checkTierServingRanges(tierInfos); - - List> chain = Lists.newArrayList(); - - for (TierInfo tierInfo : tierInfos) { - String tierName = tierInfo.getTierName(); - if (tierInfo.isEnabled()) { - String rewrittenPartitionPath = partitionPath; - // This rewriting rule must match the rewriting rule inside - // EarlybirdServer#joinServerSet(). - if (!TierConfig.DEFAULT_TIER_NAME.equals(tierName)) { - rewrittenPartitionPath = partitionPath + "/" + tierName; - } - - clientServiceBuilder.initializeWithPathSuffix( - tierInfo.getTierName(), - numPartitions, - rewrittenPartitionPath); - - try { - chain.add(createTierService( - support, tierInfo, clientServiceBuilder, partitionLoggingSupport)); - } catch (Exception e) { - LOG.error("Failed to build clients for tier: {}", tierInfo.getTierName()); - throw new RuntimeException(e); - } - - } else { - LOG.info("Skipped disabled tier: {}", tierName); - } - } - - return chain; - } - - private Service createTierService( - ScatterGatherSupport support, - final TierInfo tierInfo, - RootClientServiceBuilder builder, - PartitionLoggingSupport partitionLoggingSupport) { - - final String tierName = tierInfo.getTierName(); - RequestSuccessStats stats = new RequestSuccessStats(tierName); - - List> services = - builder.safeBuildServiceList(SEARCH_METHOD_NAME); - - // Get the client list for this tier, and apply the degradationTrackerFilter to each response. - // - // We currently do this only for the EarlybirdSearchMultiTierAdaptor (the full archive cluster). - // If we want to do this for all clusters (or if we want to apply any other filter to all - // earlybird responses, for other clusters), we should change ScatterGatherService's constructor - // to take in a filter, and apply it there. - ClientBackupFilter backupFilter = new ClientBackupFilter( - "root_" + EarlybirdCluster.FULL_ARCHIVE.getNameForStats(), - tierName, - statsReceiver, - decider); - List> clients = Lists.newArrayList(); - ClientLatencyFilter latencyFilter = new ClientLatencyFilter(tierName); - for (Service client : services) { - clients.add(requestContextToEarlybirdRequestFilter - .andThen(backupFilter) - .andThen(latencyFilter) - .andThen(client)); - } - - clients = SkipPartitionFilter.wrapServices(tierName, clients, partitionAccessController); - - // Build the scatter gather service for this tier. - // Each tier has their own stats. - ScatterGatherService scatterGatherService = - new ScatterGatherService<>( - support, clients, stats, partitionLoggingSupport); - - SimpleFilter tierThrottleFilter = - getTierThrottleFilter(tierInfo, tierName); - - EarlybirdTimeRangeFilter timeRangeFilter = - EarlybirdTimeRangeFilter.newTimeRangeFilterWithQueryRewriter( - (requestContext, userOverride) -> new TierInfoWrapper(tierInfo, userOverride), - decider); - - return tierThrottleFilter - .andThen(timeRangeFilter) - .andThen(scatterGatherService); - } - - private SimpleFilter getTierThrottleFilter( - final TierInfo tierInfo, - final String tierName) { - - // A filter that throttles request rate. - final String tierThrottleDeciderKey = tierThrottleDeciders.getTierThrottleDeciderKey( - normalizedSearchRootName, tierName); - - SimpleFilter tierThrottleFilter = - new SimpleFilter() { - private final Map readCounts = - getReadCountsMap(); - - private Map getReadCountsMap() { - Map readCountsMap = - Maps.newEnumMap(TierInfo.RequestReadType.class); - for (TierInfo.RequestReadType readType : TierInfo.RequestReadType.values()) { - readCountsMap.put(readType, - SearchCounter.export("earlybird_tier_" + tierName + "_" - + readType.name().toLowerCase() + "_read_count")); - } - return Collections.unmodifiableMap(readCountsMap); - } - - private final SearchCounter tierRequestDroppedByDeciderCount = - SearchCounter.export("earlybird_tier_" + tierName - + "_request_dropped_by_decider_count"); - - @Override - public Future apply( - EarlybirdRequestContext requestContext, - Service service) { - - // a blank response is returned when a request is dropped by decider, or - // a request is sent as a dark read. - final Future blankTierResponse = Future.value(TIER_SKIPPED_RESPONSE); - if (tierThrottleDeciders.shouldSendRequestToTier(tierThrottleDeciderKey)) { - TierInfoWrapper tierInfoWrapper = - new TierInfoWrapper(tierInfo, requestContext.useOverrideTierConfig()); - - TierInfo.RequestReadType readType = tierInfoWrapper.getReadType(); - readCounts.get(readType).increment(); - switch (readType) { - case DARK: - // dark read: call backend but do not wait for results - service.apply(requestContext); - return blankTierResponse; - case GREY: - // grey read: call backend, wait for results, but discard results. - return service.apply(requestContext).flatMap( - new Function>() { - @Override - public Future apply(EarlybirdResponse v1) { - // No matter what's returned, always return blankTierResponse. - return blankTierResponse; - } - }); - case LIGHT: - // light read: return the future from the backend service. - return service.apply(requestContext); - default: - throw new RuntimeException("Unknown read type: " + readType); - } - } else { - // Request is dropped by throttle decider - tierRequestDroppedByDeciderCount.increment(); - return blankTierResponse; - } - } - }; - return tierThrottleFilter; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/EarlybirdServiceLoggingSupport.java b/src/java/com/twitter/search/earlybird_root/EarlybirdServiceLoggingSupport.java deleted file mode 100644 index c9b0aa776..000000000 --- a/src/java/com/twitter/search/earlybird_root/EarlybirdServiceLoggingSupport.java +++ /dev/null @@ -1,60 +0,0 @@ -package com.twitter.search.earlybird_root; - -import java.util.concurrent.TimeUnit; - -import com.google.common.base.Preconditions; - -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.metrics.Timer; -import com.twitter.search.common.root.LoggingSupport; -import com.twitter.search.earlybird.common.EarlybirdRequestPostLogger; -import com.twitter.search.earlybird.common.EarlybirdRequestPreLogger; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; - -public class EarlybirdServiceLoggingSupport extends - LoggingSupport.DefaultLoggingSupport { - private static final int LATENCY_WARN_THRESHOLD_MS = 100; - - private static final Timer DUMMY_TIMER; - - private final EarlybirdRequestPreLogger requestPreLogger; - private final EarlybirdRequestPostLogger requestPostLogger; - - - static { - DUMMY_TIMER = new Timer(TimeUnit.MILLISECONDS); - DUMMY_TIMER.stop(); - } - - public EarlybirdServiceLoggingSupport(SearchDecider decider) { - requestPreLogger = EarlybirdRequestPreLogger.buildForRoot(decider.getDecider()); - requestPostLogger = EarlybirdRequestPostLogger.buildForRoot(LATENCY_WARN_THRESHOLD_MS, - decider.getDecider()); - } - - @Override - public void prelogRequest(EarlybirdRequest req) { - requestPreLogger.logRequest(req); - } - - @Override - public void postLogRequest( - EarlybirdRequest request, - EarlybirdResponse response, - long latencyNanos) { - - Preconditions.checkNotNull(request); - Preconditions.checkNotNull(response); - - response.setResponseTimeMicros(TimeUnit.NANOSECONDS.toMicros(latencyNanos)); - response.setResponseTime(TimeUnit.NANOSECONDS.toMillis(latencyNanos)); - - requestPostLogger.logRequest(request, response, DUMMY_TIMER); - } - - @Override - public void logExceptions(EarlybirdRequest req, Throwable t) { - ExceptionHandler.logException(req, t); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/EarlybirdServicePartitionLoggingSupport.java b/src/java/com/twitter/search/earlybird_root/EarlybirdServicePartitionLoggingSupport.java deleted file mode 100644 index eb4533a31..000000000 --- a/src/java/com/twitter/search/earlybird_root/EarlybirdServicePartitionLoggingSupport.java +++ /dev/null @@ -1,42 +0,0 @@ -package com.twitter.search.earlybird_root; - -import java.util.Map; -import java.util.Random; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.root.PartitionLoggingSupport; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; - -public class EarlybirdServicePartitionLoggingSupport - extends PartitionLoggingSupport.DefaultPartitionLoggingSupport { - private static final Logger PARTITION_LOG = LoggerFactory.getLogger("partitionLogger"); - - private static final long LATENCY_LOG_PARTITIONS_THRESHOLD_MS = 500; - private static final double FRACTION_OF_REQUESTS_TO_LOG = 1.0 / 500.0; - - private final Random random = new Random(); - - @Override - public void logPartitionLatencies(EarlybirdRequestContext requestContext, - String tierName, - Map partitionLatenciesMicros, - long latencyMs) { - String logReason = null; - - if (random.nextFloat() <= FRACTION_OF_REQUESTS_TO_LOG) { - logReason = "randomSample"; - } else if (latencyMs > LATENCY_LOG_PARTITIONS_THRESHOLD_MS) { - logReason = "slow"; - } - - EarlybirdRequest request = requestContext.getRequest(); - if (logReason != null && request.isSetSearchQuery()) { - PARTITION_LOG.info("{};{};{};{};{};{}", tierName, logReason, latencyMs, - partitionLatenciesMicros, request.getClientRequestID(), - request.getSearchQuery().getSerializedQuery()); - } - } -} diff --git a/src/java/com/twitter/search/earlybird_root/EarlybirdServiceScatterGatherSupport.java b/src/java/com/twitter/search/earlybird_root/EarlybirdServiceScatterGatherSupport.java deleted file mode 100644 index 8ca892dc7..000000000 --- a/src/java/com/twitter/search/earlybird_root/EarlybirdServiceScatterGatherSupport.java +++ /dev/null @@ -1,202 +0,0 @@ -package com.twitter.search.earlybird_root; - -import java.util.ArrayList; -import java.util.List; -import java.util.Map; -import javax.inject.Inject; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; -import com.google.common.collect.Sets; - -import com.twitter.search.common.partitioning.base.PartitionDataType; -import com.twitter.search.common.partitioning.base.PartitionMappingManager; -import com.twitter.search.common.root.ScatterGatherSupport; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.util.earlybird.EarlybirdResponseUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird_root.common.EarlybirdFeatureSchemaMerger; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.mergers.EarlybirdResponseMerger; -import com.twitter.search.earlybird_root.mergers.PartitionResponseAccumulator; -import com.twitter.search.queryparser.query.Query; -import com.twitter.util.Future; - -import static com.twitter.search.earlybird_root.visitors.MultiTermDisjunctionPerPartitionVisitor.NO_MATCH_CONJUNCTION; - -public class EarlybirdServiceScatterGatherSupport - implements ScatterGatherSupport { - - private static final EarlybirdResponse EMPTY_RESPONSE = newEmptyResponse(); - - private final PartitionMappingManager partitionMappingManager; - private final EarlybirdCluster cluster; - private final EarlybirdFeatureSchemaMerger featureSchemaMerger; - - @Inject - protected EarlybirdServiceScatterGatherSupport(PartitionMappingManager partitionMappingManager, - EarlybirdCluster cluster, - EarlybirdFeatureSchemaMerger featureSchemaMerger) { - this.partitionMappingManager = partitionMappingManager; - this.cluster = cluster; - this.featureSchemaMerger = featureSchemaMerger; - } - - /** - * Fans out the original request to all partitions. - */ - private List fanoutToAllPartitions( - EarlybirdRequestContext requestContext, int numPartitions) { - // We don't need to create a deep copy of the original requestContext for every partition, - // because requests are not rewritten once they get to this level: our roots have filters - // that rewrite the requests at the top-level, but we do not rewrite requests per-partition. - List requestContexts = new ArrayList<>(numPartitions); - for (int i = 0; i < numPartitions; ++i) { - requestContexts.add(requestContext); - } - return requestContexts; - } - - private Map> populateIdsForPartition(EarlybirdRequestContext requestContext) { - Map> perPartitionIds = Maps.newHashMap(); - // Based on partition type, populate map for every partition if needed. - if (partitionMappingManager.getPartitionDataType() == PartitionDataType.USER_ID - && requestContext.getRequest().getSearchQuery().getFromUserIDFilter64Size() > 0) { - for (long userId : requestContext.getRequest().getSearchQuery().getFromUserIDFilter64()) { - int userPartition = partitionMappingManager.getPartitionIdForUserId(userId); - if (!perPartitionIds.containsKey(userPartition)) { - perPartitionIds.put(userPartition, Lists.newArrayList()); - } - perPartitionIds.get(userPartition).add(userId); - } - } else if (partitionMappingManager.getPartitionDataType() == PartitionDataType.TWEET_ID - && requestContext.getRequest().getSearchQuery().getSearchStatusIdsSize() > 0) { - for (long id : requestContext.getRequest().getSearchQuery().getSearchStatusIds()) { - int tweetPartition = partitionMappingManager.getPartitionIdForTweetId(id); - if (!perPartitionIds.containsKey(tweetPartition)) { - perPartitionIds.put(tweetPartition, Lists.newArrayList()); - } - perPartitionIds.get(tweetPartition).add(id); - } - } - return perPartitionIds; - } - - private void setPerPartitionIds(EarlybirdRequest request, List ids) { - if (partitionMappingManager.getPartitionDataType() == PartitionDataType.USER_ID) { - request.getSearchQuery().setFromUserIDFilter64(ids); - } else { - request.getSearchQuery().setSearchStatusIds(Sets.newHashSet(ids)); - } - } - - @Override - public EarlybirdResponse emptyResponse() { - return EMPTY_RESPONSE; - } - - public static final EarlybirdResponse newEmptyResponse() { - return new EarlybirdResponse(EarlybirdResponseCode.PARTITION_SKIPPED, 0) - .setSearchResults(new ThriftSearchResults()); - } - - @Override - public List rewriteRequest( - EarlybirdRequestContext requestContext, int rootNumPartitions) { - int numPartitions = partitionMappingManager.getNumPartitions(); - Preconditions.checkState(rootNumPartitions == numPartitions, - "Root's configured numPartitions is different from that configured in database.yml."); - // Rewrite query based on "multi_term_disjunction id/from_user_id" and partition id if needed. - Map perPartitionQueryMap = - requestContext.getRequest().getSearchQuery().getSearchStatusIdsSize() == 0 - ? EarlybirdRootQueryUtils.rewriteMultiTermDisjunctionPerPartitionFilter( - requestContext.getParsedQuery(), - partitionMappingManager, - numPartitions) - : Maps.newHashMap(); - - // Key: partition Id; Value: valid ids list for this partition - Map> perPartitionIds = populateIdsForPartition(requestContext); - - if (perPartitionQueryMap.isEmpty() && perPartitionIds.isEmpty()) { - return fanoutToAllPartitions(requestContext, numPartitions); - } - - List requestContexts = new ArrayList<>(numPartitions); - for (int i = 0; i < numPartitions; ++i) { - requestContexts.add(null); - } - - // Rewrite per partition queries if exist. - for (int i = 0; i < numPartitions; ++i) { - if (perPartitionIds.containsKey(i)) { - if (!perPartitionQueryMap.containsKey(i)) { - // Query does not need to be rewritten for the partition - // But we still need to create a copy, because we're gonna - // set fromUserIDFilter64/searchStatusIds - requestContexts.set(i, requestContext.deepCopy()); - setPerPartitionIds(requestContexts.get(i).getRequest(), perPartitionIds.get(i)); - } else if (perPartitionQueryMap.get(i) != NO_MATCH_CONJUNCTION) { - requestContexts.set(i, EarlybirdRequestContext.copyRequestContext( - requestContext, perPartitionQueryMap.get(i))); - setPerPartitionIds(requestContexts.get(i).getRequest(), perPartitionIds.get(i)); - } - } else if (perPartitionIds.isEmpty()) { - // The fromUserIDFilter64/searchStatusIds field is not set on the original request, - // perPartitionQueryMap should decide if we send a request to this partition or not - if (!perPartitionQueryMap.containsKey(i)) { - // Query does not need to be rewritten for the partition - // Don't need to create a copy, because request context won't be changed afterwards - requestContexts.set(i, requestContext); - } else if (perPartitionQueryMap.get(i) != NO_MATCH_CONJUNCTION) { - requestContexts.set(i, EarlybirdRequestContext.copyRequestContext( - requestContext, perPartitionQueryMap.get(i))); - } - } - } - return requestContexts; - } - - /** - * Merges all the sub-results indexed by the partition id. Sub-results with value null - * indicate an error with that partition such as timeout etc. - */ - @Override - public Future merge(EarlybirdRequestContext requestContext, - List> responses) { - EarlybirdResponseMerger merger = EarlybirdResponseMerger.getResponseMerger( - requestContext, - responses, - new PartitionResponseAccumulator(), - cluster, - featureSchemaMerger, - partitionMappingManager.getNumPartitions()); - return merger.merge(); - } - - @Override - public boolean isSuccess(EarlybirdResponse earlybirdResponse) { - return EarlybirdResponseUtil.isSuccessfulResponse(earlybirdResponse); - } - - @Override - public boolean isTimeout(EarlybirdResponse earlybirdResponse) { - return earlybirdResponse.getResponseCode() == EarlybirdResponseCode.SERVER_TIMEOUT_ERROR; - } - - @Override - public boolean isClientCancel(EarlybirdResponse earlybirdResponse) { - return earlybirdResponse.getResponseCode() == EarlybirdResponseCode.CLIENT_CANCEL_ERROR; - } - - @Override - public EarlybirdResponse errorResponse(String debugString) { - return new EarlybirdResponse() - .setResponseCode(EarlybirdResponseCode.TRANSIENT_ERROR) - .setDebugString(debugString); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/EarlybirdServiceValidationBehavior.java b/src/java/com/twitter/search/earlybird_root/EarlybirdServiceValidationBehavior.java deleted file mode 100644 index d145b5e4d..000000000 --- a/src/java/com/twitter/search/earlybird_root/EarlybirdServiceValidationBehavior.java +++ /dev/null @@ -1,111 +0,0 @@ -package com.twitter.search.earlybird_root; - -import org.apache.thrift.TException; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.root.ValidationBehavior; -import com.twitter.search.earlybird.common.EarlybirdRequestUtil; -import com.twitter.search.earlybird.thrift.EarlybirdDebugInfo; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; - -public class EarlybirdServiceValidationBehavior - extends ValidationBehavior.DefaultValidationBehavior { - private static final Logger LOG = - LoggerFactory.getLogger(EarlybirdServiceValidationBehavior.class); - - private static final EarlybirdDebugInfo EARLYBIRD_DEBUG_INFO = - new EarlybirdDebugInfo().setHost("earlybird_root"); - - private static final SearchCounter INVALID_SUCCESS_RESPONSE_THRESHOLD_TOO_LOW = - SearchCounter.export("invalid_success_response_threshold_too_low"); - private static final SearchCounter INVALID_SUCCESS_RESPONSE_THRESHOLD_TOO_HIGH = - SearchCounter.export("invalid_success_response_threshold_too_high"); - - protected EarlybirdResponse createErrorResponse(String errorMsg) { - EarlybirdResponse response = new EarlybirdResponse(EarlybirdResponseCode.CLIENT_ERROR, 0); - - // We're changing some ERROR logs to WARN on our side, so we want to ensure - // that the response contains the debug information the client needs to - // resolve the problem. - response.setDebugInfo(EARLYBIRD_DEBUG_INFO); - response.setDebugString(errorMsg); - - return response; - } - - @Override - public EarlybirdResponse getResponseIfInvalidRequest(EarlybirdRequest request) { - // First, fix up the query. - EarlybirdRequestUtil.checkAndSetCollectorParams(request); - EarlybirdRequestUtil.logAndFixExcessiveValues(request); - - try { - request.validate(); - } catch (TException e) { - String errorMsg = "Invalid EarlybirdRequest. " + request; - LOG.warn(errorMsg); - return createErrorResponse(errorMsg); - } - - if (request.isSetSearchSegmentId() && request.getSearchSegmentId() <= 0) { - String errorMsg = "Bad time slice ID: " + request.getSearchSegmentId(); - LOG.warn(errorMsg); - return createErrorResponse(errorMsg); - } - - if (request.isSetTermStatisticsRequest() - && request.getTermStatisticsRequest().isSetHistogramSettings() - && request.getTermStatisticsRequest().getHistogramSettings().getNumBins() == 0) { - - String errorMsg = "numBins for term statistics histograms request cannot be zero: " + request; - LOG.warn(errorMsg); - return createErrorResponse(errorMsg); - } - - if (!request.isSetSearchQuery() - || request.getSearchQuery() == null) { - String errorMsg = "Invalid EarlybirdRequest, no ThriftSearchQuery specified. " + request; - LOG.warn(errorMsg); - return createErrorResponse(errorMsg); - } - - ThriftSearchQuery searchQuery = request.getSearchQuery(); - - if (!searchQuery.getCollectorParams().isSetNumResultsToReturn()) { - String errorMsg = "ThriftSearchQuery.numResultsToReturn not set. " + request; - LOG.warn(errorMsg); - return createErrorResponse(errorMsg); - } - - if (searchQuery.getCollectorParams().getNumResultsToReturn() < 0) { - String errorMsg = "Invalid ThriftSearchQuery.collectorParams.numResultsToReturn: " - + searchQuery.getCollectorParams().getNumResultsToReturn() + ". " + request; - LOG.warn(errorMsg); - return createErrorResponse(errorMsg); - } - - if (request.isSetSuccessfulResponseThreshold()) { - double successfulResponseThreshold = request.getSuccessfulResponseThreshold(); - if (successfulResponseThreshold <= 0) { - String errorMsg = "Success response threshold is below or equal to 0: " - + successfulResponseThreshold + " request: " + request; - LOG.warn(errorMsg); - INVALID_SUCCESS_RESPONSE_THRESHOLD_TOO_LOW.increment(); - return createErrorResponse(errorMsg); - } else if (successfulResponseThreshold > 1) { - String errorMsg = "Success response threshold is above 1: " + successfulResponseThreshold - + " request: " + request; - LOG.warn(errorMsg); - INVALID_SUCCESS_RESPONSE_THRESHOLD_TOO_HIGH.increment(); - return createErrorResponse(errorMsg); - } - } - - return null; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/EarlybirdTierThrottleDeciders.java b/src/java/com/twitter/search/earlybird_root/EarlybirdTierThrottleDeciders.java deleted file mode 100644 index 82f382ed3..000000000 --- a/src/java/com/twitter/search/earlybird_root/EarlybirdTierThrottleDeciders.java +++ /dev/null @@ -1,49 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.inject.Inject; -import javax.inject.Singleton; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.decider.SearchDecider; - -/** - * Controls fractions of requests that are sent out to each tier. - */ -@Singleton -public class EarlybirdTierThrottleDeciders { - private static final Logger LOG = - LoggerFactory.getLogger(EarlybirdTierThrottleDeciders.class); - private static final String TIER_THROTTLE_DECIDER_KEY_FORMAT = - "percentage_to_hit_cluster_%s_tier_%s"; - private final SearchDecider decider; - - /** - * Construct a decider using the singleton decider object injected by Guice for the - * specified tier. - * See {@link com.twitter.search.common.root.SearchRootModule#provideDecider()} - */ - @Inject - public EarlybirdTierThrottleDeciders(SearchDecider decider) { - this.decider = decider; - } - - /** - * Return the throttle decider key for the specified tier. - */ - public String getTierThrottleDeciderKey(String clusterName, String tierName) { - String deciderKey = String.format(TIER_THROTTLE_DECIDER_KEY_FORMAT, clusterName, tierName); - if (!decider.getDecider().feature(deciderKey).exists()) { - LOG.warn("Decider key {} not found. Will always return unavailable.", deciderKey); - } - return deciderKey; - } - - /** - * Check whether a request should be sent to the specified tier. - */ - public Boolean shouldSendRequestToTier(final String tierDarkReadDeciderKey) { - return decider.isAvailable(tierDarkReadDeciderKey); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/EarlybirdWarmup.java b/src/java/com/twitter/search/earlybird_root/EarlybirdWarmup.java deleted file mode 100644 index 629f2764d..000000000 --- a/src/java/com/twitter/search/earlybird_root/EarlybirdWarmup.java +++ /dev/null @@ -1,69 +0,0 @@ -package com.twitter.search.earlybird_root; - -import scala.runtime.AbstractFunction0; - -import com.twitter.common.util.Clock; -import com.twitter.finagle.thrift.ClientId; -import com.twitter.search.common.caching.thriftjava.CachingParams; -import com.twitter.search.common.query.thriftjava.CollectorParams; -import com.twitter.search.common.ranking.thriftjava.ThriftRankingParams; -import com.twitter.search.common.ranking.thriftjava.ThriftScoringFunctionType; -import com.twitter.search.common.root.SearchRootWarmup; -import com.twitter.search.common.root.WarmupConfig; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.earlybird.thrift.ThriftSearchRankingMode; -import com.twitter.search.earlybird.thrift.ThriftSearchRelevanceOptions; -import com.twitter.util.Future; - -/** - * Warm-up logic for Earlybird Roots. - * Sends 60 rounds of requests with a 1 second timeout between each round. - * The actual number of requests sent by each round can be configured. - */ -public class EarlybirdWarmup extends - SearchRootWarmup { - - private static final int WARMUP_NUM_RESULTS = 20; - - private static final String CLIENT_ID = "earlybird_root_warmup"; - - public EarlybirdWarmup(Clock clock, WarmupConfig config) { - super(clock, config); - } - - @Override - protected EarlybirdRequest createRequest(int requestId) { - String query = "(* " + "warmup" + requestId + ")"; - - return new EarlybirdRequest() - .setSearchQuery( - new ThriftSearchQuery() - .setNumResults(WARMUP_NUM_RESULTS) - .setCollectorParams( - new CollectorParams().setNumResultsToReturn(WARMUP_NUM_RESULTS)) - .setRankingMode(ThriftSearchRankingMode.RELEVANCE) - .setRelevanceOptions(new ThriftSearchRelevanceOptions() - .setRankingParams(new ThriftRankingParams() - .setType(ThriftScoringFunctionType.LINEAR))) - .setSerializedQuery(query)) - .setCachingParams(new CachingParams().setCache(false)) - .setClientId(CLIENT_ID); - } - - @Override - protected Future callService( - final EarlybirdService.ServiceIface service, - final EarlybirdRequest request) { - - return ClientId.apply(CLIENT_ID).asCurrent( - new AbstractFunction0>() { - @Override - public Future apply() { - return service.search(request); - } - }); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/ExceptionHandler.java b/src/java/com/twitter/search/earlybird_root/ExceptionHandler.java deleted file mode 100644 index 4176f358a..000000000 --- a/src/java/com/twitter/search/earlybird_root/ExceptionHandler.java +++ /dev/null @@ -1,17 +0,0 @@ -package com.twitter.search.earlybird_root; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.earlybird.thrift.EarlybirdRequest; - -public final class ExceptionHandler { - private static final Logger LOG = LoggerFactory.getLogger(ExceptionHandler.class); - - private ExceptionHandler() { - } - - public static void logException(EarlybirdRequest request, Throwable e) { - LOG.error("Exception while handling request: {}", request, e); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/FullArchiveRootAppMain.java b/src/java/com/twitter/search/earlybird_root/FullArchiveRootAppMain.java deleted file mode 100644 index a573d5deb..000000000 --- a/src/java/com/twitter/search/earlybird_root/FullArchiveRootAppMain.java +++ /dev/null @@ -1,40 +0,0 @@ -package com.twitter.search.earlybird_root; - -import java.util.Arrays; -import java.util.Collection; - -import com.google.inject.Module; - -import com.twitter.search.common.root.SearchRootAppMain; -import com.twitter.search.earlybird.thrift.EarlybirdService; - -public class FullArchiveRootAppMain extends SearchRootAppMain { - /** - * Boilerplate for the Java-friendly AbstractTwitterServer - */ - public static class Main { - public static void main(String[] args) { - new FullArchiveRootAppMain().main(args); - } - } - - @Override - protected Collection getAdditionalModules() { - return Arrays.asList( - new EarlybirdCommonModule(), - new EarlybirdCacheCommonModule(), - new FullArchiveRootModule(), - new QuotaModule() - ); - } - - @Override - protected Class getSearchRootServerClass() { - return FullArchiveRootServer.class; - } - - @Override - protected Class getServiceIfaceClass() { - return EarlybirdService.ServiceIface.class; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/FullArchiveRootModule.java b/src/java/com/twitter/search/earlybird_root/FullArchiveRootModule.java deleted file mode 100644 index 1a128856b..000000000 --- a/src/java/com/twitter/search/earlybird_root/FullArchiveRootModule.java +++ /dev/null @@ -1,241 +0,0 @@ -package com.twitter.search.earlybird_root; - -import java.util.List; -import java.util.concurrent.TimeUnit; - -import javax.annotation.Nullable; -import javax.inject.Named; -import javax.inject.Singleton; - -import com.google.inject.Key; -import com.google.inject.Provides; - -import com.twitter.app.Flag; -import com.twitter.app.Flaggable; -import com.twitter.common.util.Clock; -import com.twitter.finagle.Service; -import com.twitter.finagle.memcached.JavaClient; -import com.twitter.finagle.stats.StatsReceiver; -import com.twitter.inject.TwitterModule; -import com.twitter.search.common.caching.Cache; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.root.LoggingSupport; -import com.twitter.search.common.root.PartitionConfig; -import com.twitter.search.common.root.PartitionLoggingSupport; -import com.twitter.search.common.root.RootClientServiceBuilder; -import com.twitter.search.common.root.SearchRootModule; -import com.twitter.search.common.root.SearchRootWarmup; -import com.twitter.search.common.root.SplitterService; -import com.twitter.search.common.root.ValidationBehavior; -import com.twitter.search.common.root.WarmupConfig; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.config.TierInfoSource; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.search.earlybird_root.caching.DefaultForcedCacheMissDecider; -import com.twitter.search.earlybird_root.caching.RecencyCache; -import com.twitter.search.earlybird_root.caching.RelevanceCache; -import com.twitter.search.earlybird_root.caching.StrictRecencyCache; -import com.twitter.search.earlybird_root.caching.TermStatsCache; -import com.twitter.search.earlybird_root.caching.TopTweetsCache; -import com.twitter.search.earlybird_root.caching.TopTweetsServicePostProcessor; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.filters.RequestContextToEarlybirdRequestFilter; -import com.twitter.util.Future; - -import static com.twitter.search.earlybird_root.EarlybirdCommonModule.NAMED_ALT_CLIENT; - -public class FullArchiveRootModule extends TwitterModule { - private static final String CLUSTER = "archive_full"; - private static final String ALT_TRAFFIC_PERCENTAGE_DECIDER_KEY = - "full_archive_alt_client_traffic_percentage"; - - private final Flag forceAltClientFlag = createFlag( - "force_alt_client", - false, - "Always sends traffic to the alt client", - Flaggable.ofJavaBoolean()); - - @Override - public void configure() { - bind(Key.get(EarlybirdCluster.class)).toInstance(EarlybirdCluster.FULL_ARCHIVE); - - bind(EarlybirdServiceScatterGatherSupport.class) - .to(EarlybirdFullArchiveScatterGatherSupport.class); - - bind(EarlybirdService.ServiceIface.class).to(FullArchiveRootService.class); - } - - @Provides - LoggingSupport provideLoggingSupport( - SearchDecider decider) { - return new EarlybirdServiceLoggingSupport(decider); - } - - @Provides - PartitionLoggingSupport providePartitionLoggingSupport() { - return new EarlybirdServicePartitionLoggingSupport(); - } - - @Provides - ValidationBehavior provideValidationBehavior() { - return new EarlybirdServiceValidationBehavior(); - } - - @Provides - @Singleton - @Nullable - @Named(NAMED_ALT_CLIENT) - EarlybirdServiceChainBuilder provideAltEarlybirdServiceChainBuilder( - @Named(NAMED_ALT_CLIENT) @Nullable PartitionConfig altPartitionConfig, - RequestContextToEarlybirdRequestFilter requestContextToEarlybirdRequestFilter, - EarlybirdTierThrottleDeciders tierThrottleDeciders, - @Named(SearchRootModule.NAMED_NORMALIZED_SEARCH_ROOT_NAME) String normalizedSearchRootName, - SearchDecider decider, - TierInfoSource tierConfig, - @Named(NAMED_ALT_CLIENT) @Nullable - RootClientServiceBuilder altRootClientServiceBuilder, - PartitionAccessController partitionAccessController, - StatsReceiver statsReceiver - ) { - if (altPartitionConfig == null || altRootClientServiceBuilder == null) { - return null; - } - - return new EarlybirdServiceChainBuilder( - altPartitionConfig, - requestContextToEarlybirdRequestFilter, - tierThrottleDeciders, - normalizedSearchRootName, - decider, - tierConfig, - altRootClientServiceBuilder, - partitionAccessController, - statsReceiver - ); - } - - @Provides - @Singleton - @Nullable - @Named(NAMED_ALT_CLIENT) - EarlybirdChainedScatterGatherService provideAltEarlybirdChainedScatterGatherService( - @Named(NAMED_ALT_CLIENT) @Nullable EarlybirdServiceChainBuilder altServiceChainBuilder, - EarlybirdServiceScatterGatherSupport scatterGatherSupport, - PartitionLoggingSupport partitionLoggingSupport - ) { - if (altServiceChainBuilder == null) { - return null; - } - - return new EarlybirdChainedScatterGatherService( - altServiceChainBuilder, - scatterGatherSupport, - partitionLoggingSupport - ); - } - - @Provides - @Singleton - Service>> - provideEarlybirdChainedScatterGatherService( - EarlybirdChainedScatterGatherService chainedScatterGatherService, - @Named(NAMED_ALT_CLIENT) @Nullable - EarlybirdChainedScatterGatherService altChainedScatterGatherService, - SearchDecider decider - ) { - if (forceAltClientFlag.apply()) { - if (altChainedScatterGatherService == null) { - throw new RuntimeException( - "alt client cannot be null when 'force_alt_client' is set to true"); - } else { - return altChainedScatterGatherService; - } - } - - if (altChainedScatterGatherService == null) { - return chainedScatterGatherService; - } - - return new SplitterService<>( - chainedScatterGatherService, - altChainedScatterGatherService, - decider, - ALT_TRAFFIC_PERCENTAGE_DECIDER_KEY - ); - } - - @Provides - @Singleton - @RecencyCache - Cache provideRecencyCache( - JavaClient client, - DefaultForcedCacheMissDecider decider, - @Named(SearchRootModule.NAMED_SERIALIZED_KEY_PREFIX) String serializedKeyPrefix, - @Named(SearchRootModule.NAMED_CACHE_KEY_MAX_BYTES) int cacheKeyMaxBytes, - @Named(SearchRootModule.NAMED_CACHE_VALUE_MAX_BYTES) int cacheValueMaxBytes) { - return EarlybirdCacheCommonModule.createCache(client, decider, CLUSTER + "_recency_root", - serializedKeyPrefix, TimeUnit.HOURS.toMillis(2), cacheKeyMaxBytes, cacheValueMaxBytes); - } - - @Provides - @Singleton - @RelevanceCache - Cache provideRelevanceCache( - JavaClient client, - DefaultForcedCacheMissDecider decider, - @Named(SearchRootModule.NAMED_SERIALIZED_KEY_PREFIX) String serializedKeyPrefix, - @Named(SearchRootModule.NAMED_CACHE_KEY_MAX_BYTES) int cacheKeyMaxBytes, - @Named(SearchRootModule.NAMED_CACHE_VALUE_MAX_BYTES) int cacheValueMaxBytes) { - return EarlybirdCacheCommonModule.createCache(client, decider, CLUSTER + "_relevance_root", - serializedKeyPrefix, TimeUnit.HOURS.toMillis(2), cacheKeyMaxBytes, cacheValueMaxBytes); - } - - @Provides - @Singleton - @StrictRecencyCache - Cache provideStrictRecencyCache( - JavaClient client, - DefaultForcedCacheMissDecider decider, - @Named(SearchRootModule.NAMED_SERIALIZED_KEY_PREFIX) String serializedKeyPrefix, - @Named(SearchRootModule.NAMED_CACHE_KEY_MAX_BYTES) int cacheKeyMaxBytes, - @Named(SearchRootModule.NAMED_CACHE_VALUE_MAX_BYTES) int cacheValueMaxBytes) { - return EarlybirdCacheCommonModule.createCache(client, decider, CLUSTER + "_strict_recency_root", - serializedKeyPrefix, TimeUnit.HOURS.toMillis(2), cacheKeyMaxBytes, cacheValueMaxBytes); - } - - @Provides - @Singleton - @TermStatsCache - Cache provideTermStatsCache( - JavaClient client, - DefaultForcedCacheMissDecider decider, - @Named(SearchRootModule.NAMED_SERIALIZED_KEY_PREFIX) String serializedKeyPrefix, - @Named(SearchRootModule.NAMED_CACHE_KEY_MAX_BYTES) int cacheKeyMaxBytes, - @Named(SearchRootModule.NAMED_CACHE_VALUE_MAX_BYTES) int cacheValueMaxBytes) { - return EarlybirdCacheCommonModule.createCache(client, decider, CLUSTER + "_termstats_root", - serializedKeyPrefix, TimeUnit.MINUTES.toMillis(5), cacheKeyMaxBytes, cacheValueMaxBytes); - } - - @Provides - @Singleton - @TopTweetsCache - Cache provideTopTweetsCache( - JavaClient client, - DefaultForcedCacheMissDecider decider, - @Named(SearchRootModule.NAMED_SERIALIZED_KEY_PREFIX) String serializedKeyPrefix, - @Named(SearchRootModule.NAMED_CACHE_KEY_MAX_BYTES) int cacheKeyMaxBytes, - @Named(SearchRootModule.NAMED_CACHE_VALUE_MAX_BYTES) int cacheValueMaxBytes) { - return EarlybirdCacheCommonModule.createCache(client, decider, CLUSTER + "_toptweets_root", - serializedKeyPrefix, TopTweetsServicePostProcessor.CACHE_AGE_IN_MS, - cacheKeyMaxBytes, cacheValueMaxBytes); - } - - @Provides - SearchRootWarmup providesSearchRootWarmup( - Clock clock, - WarmupConfig config) { - return new EarlybirdWarmup(clock, config); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/FullArchiveRootServer.java b/src/java/com/twitter/search/earlybird_root/FullArchiveRootServer.java deleted file mode 100644 index 5fc77bf12..000000000 --- a/src/java/com/twitter/search/earlybird_root/FullArchiveRootServer.java +++ /dev/null @@ -1,16 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.inject.Inject; - -import com.twitter.finagle.Service; -import com.twitter.search.common.root.SearchRootServer; -import com.twitter.search.earlybird.thrift.EarlybirdService; - -public class FullArchiveRootServer extends SearchRootServer { - - @Inject - public FullArchiveRootServer(FullArchiveRootService svc, Service byteSvc) { - super(svc, byteSvc); - } - -} diff --git a/src/java/com/twitter/search/earlybird_root/FullArchiveRootService.java b/src/java/com/twitter/search/earlybird_root/FullArchiveRootService.java deleted file mode 100644 index 977c8b974..000000000 --- a/src/java/com/twitter/search/earlybird_root/FullArchiveRootService.java +++ /dev/null @@ -1,148 +0,0 @@ -package com.twitter.search.earlybird_root; - -import java.util.List; -import javax.inject.Inject; -import javax.inject.Singleton; - -import com.twitter.finagle.Service; -import com.twitter.finagle.mtls.authorization.server.MtlsServerSessionTrackerFilter; -import com.twitter.search.common.clientstats.FinagleClientStatsFilter; -import com.twitter.search.common.root.LoggingFilter; -import com.twitter.search.common.root.RequestValidationFilter; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.search.earlybird.thrift.EarlybirdStatusResponse; -import com.twitter.search.earlybird_root.caching.RecencyCacheFilter; -import com.twitter.search.earlybird_root.caching.RelevanceCacheFilter; -import com.twitter.search.earlybird_root.caching.RelevanceZeroResultsCacheFilter; -import com.twitter.search.earlybird_root.caching.StrictRecencyCacheFilter; -import com.twitter.search.earlybird_root.caching.TermStatsCacheFilter; -import com.twitter.search.earlybird_root.caching.TopTweetsCacheFilter; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.filters.ClientIdQueryOperatorStatsFilter; -import com.twitter.search.earlybird_root.filters.ClientIdQuotaFilter; -import com.twitter.search.earlybird_root.filters.ClientIdTrackingFilter; -import com.twitter.search.earlybird_root.filters.ClientRequestTimeFilter; -import com.twitter.search.earlybird_root.filters.DeadlineTimeoutStatsFilter; -import com.twitter.search.earlybird_root.filters.EarlybirdFeatureSchemaAnnotateFilter; -import com.twitter.search.earlybird_root.filters.FullArchiveProtectedOperatorFilter; -import com.twitter.search.earlybird_root.filters.InitializeRequestContextFilter; -import com.twitter.search.earlybird_root.filters.IsUserProtectedMetadataTrackingFilter; -import com.twitter.search.earlybird_root.filters.MetadataTrackingFilter; -import com.twitter.search.earlybird_root.filters.NullcastTrackingFilter; -import com.twitter.search.earlybird_root.filters.PostCacheRequestTypeCountFilter; -import com.twitter.search.earlybird_root.filters.PreCacheRequestTypeCountFilter; -import com.twitter.search.earlybird_root.filters.QueryLangStatFilter; -import com.twitter.search.earlybird_root.filters.QueryOperatorStatFilter; -import com.twitter.search.earlybird_root.filters.RequestResultStatsFilter; -import com.twitter.search.earlybird_root.filters.RequestSuccessStatsFilter; -import com.twitter.search.earlybird_root.filters.ResponseCodeStatFilter; -import com.twitter.search.earlybird_root.filters.ResultTierCountFilter; -import com.twitter.search.earlybird_root.filters.SearchPayloadSizeLocalContextFilter; -import com.twitter.search.earlybird_root.filters.RejectRequestsByQuerySourceFilter; -import com.twitter.search.earlybird_root.filters.StratoAttributionClientIdFilter; -import com.twitter.search.earlybird_root.filters.TopLevelExceptionHandlingFilter; -import com.twitter.util.Future; - -@Singleton -public class FullArchiveRootService implements EarlybirdService.ServiceIface { - - private final Service allFiltersAndService; - - @Inject - public FullArchiveRootService( - TopLevelExceptionHandlingFilter topLevelExceptionHandlingFilter, - ResponseCodeStatFilter responseCodeStatFilter, - LoggingFilter loggingFilter, - RequestValidationFilter validationFilter, - MtlsServerSessionTrackerFilter mtlsFilter, - FinagleClientStatsFilter finagleStatsFilter, - InitializeFilter initializeFilter, - InitializeRequestContextFilter initializeRequestContextFilter, - QueryLangStatFilter queryLangStatFilter, - FullArchiveProtectedOperatorFilter protectedOperatorFilter, - QueryOperatorStatFilter queryOperatorStatFilter, - ClientIdQueryOperatorStatsFilter clientIdQueryOperatorStatsFilter, - IsUserProtectedMetadataTrackingFilter isUserProtectedMetadataTrackingFilter, - RequestResultStatsFilter requestResultStatsFilter, - PreCacheRequestTypeCountFilter preCacheCountFilter, - RecencyCacheFilter recencyCacheFilter, - RelevanceCacheFilter relevanceCacheFilter, - RelevanceZeroResultsCacheFilter relevanceZeroResultsCacheFilter, - StrictRecencyCacheFilter strictRecencyCacheFilter, - TermStatsCacheFilter termStatsCacheFilter, - TopTweetsCacheFilter topTweetsCacheFilter, - PostCacheRequestTypeCountFilter postCacheCountFilter, - ClientIdTrackingFilter clientIdTrackingFilter, - ClientIdQuotaFilter quotaFilter, - RejectRequestsByQuerySourceFilter rejectRequestsByQuerySourceFilter, - MetadataTrackingFilter metadataTrackingFilter, - MultiTierResultsMergeFilter multiTierResultsMergeFilter, - RequestSuccessStatsFilter requestSuccessStatsFilter, - NullcastTrackingFilter nullcastTrackingFilter, - ClientRequestTimeFilter clientRequestTimeFilter, - DeadlineTimeoutStatsFilter deadlineTimeoutStatsFilter, - EarlybirdFeatureSchemaAnnotateFilter featureSchemaAnnotateFilter, - SearchPayloadSizeLocalContextFilter searchPayloadSizeLocalContextFilter, - EarlybirdQueryRewriteFilter queryRewriteFilter, - ResultTierCountFilter resultTierCountFilter, - StratoAttributionClientIdFilter stratoAttributionClientIdFilter, - Service>> chainedScatterGatherService - ) { - - this.allFiltersAndService = - loggingFilter - .andThen(topLevelExceptionHandlingFilter) - .andThen(stratoAttributionClientIdFilter) - .andThen(clientRequestTimeFilter) - .andThen(searchPayloadSizeLocalContextFilter) - .andThen(requestSuccessStatsFilter) - .andThen(requestResultStatsFilter) - .andThen(responseCodeStatFilter) - .andThen(validationFilter) - .andThen(mtlsFilter) - .andThen(finagleStatsFilter) - .andThen(clientIdTrackingFilter) - .andThen(quotaFilter) - .andThen(rejectRequestsByQuerySourceFilter) - .andThen(metadataTrackingFilter) - .andThen(initializeFilter) - .andThen(initializeRequestContextFilter) - .andThen(deadlineTimeoutStatsFilter) - .andThen(queryLangStatFilter) - .andThen(protectedOperatorFilter) - .andThen(queryOperatorStatFilter) - .andThen(clientIdQueryOperatorStatsFilter) - .andThen(isUserProtectedMetadataTrackingFilter) - .andThen(preCacheCountFilter) - .andThen(nullcastTrackingFilter) - .andThen(recencyCacheFilter) - .andThen(relevanceCacheFilter) - .andThen(relevanceZeroResultsCacheFilter) - .andThen(strictRecencyCacheFilter) - .andThen(termStatsCacheFilter) - .andThen(topTweetsCacheFilter) - .andThen(postCacheCountFilter) - .andThen(queryRewriteFilter) - .andThen(featureSchemaAnnotateFilter) - .andThen(resultTierCountFilter) - .andThen(multiTierResultsMergeFilter) - .andThen(chainedScatterGatherService); - } - - @Override - public Future getName() { - return Future.value("fullarchive"); - } - - @Override - public Future getStatus() { - throw new UnsupportedOperationException("not supported"); - } - - @Override - public Future search(EarlybirdRequest request) { - return allFiltersAndService.apply(request); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/InitializeFilter.java b/src/java/com/twitter/search/earlybird_root/InitializeFilter.java deleted file mode 100644 index 5d5659965..000000000 --- a/src/java/com/twitter/search/earlybird_root/InitializeFilter.java +++ /dev/null @@ -1,36 +0,0 @@ -package com.twitter.search.earlybird_root; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.relevance.ranking.ActionChainManager; -import com.twitter.search.common.runtime.ActionChainDebugManager; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.util.Future; -import com.twitter.util.FutureEventListener; - -/** - * Initialize request-scope state and clean them at the end. - */ -public class InitializeFilter extends SimpleFilter { - @Override - public Future apply(EarlybirdRequest request, - Service service) { - ActionChainDebugManager.update(new ActionChainManager(request.getDebugMode()), "EarlybirdRoot"); - return service.apply(request).addEventListener(new FutureEventListener() { - @Override - public void onSuccess(EarlybirdResponse response) { - cleanup(); - } - - @Override - public void onFailure(Throwable cause) { - cleanup(); - } - }); - } - - private void cleanup() { - ActionChainDebugManager.clearLocals(); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/MultiTierResultsMergeFilter.java b/src/java/com/twitter/search/earlybird_root/MultiTierResultsMergeFilter.java deleted file mode 100644 index d862e5047..000000000 --- a/src/java/com/twitter/search/earlybird_root/MultiTierResultsMergeFilter.java +++ /dev/null @@ -1,55 +0,0 @@ -package com.twitter.search.earlybird_root; - -import java.util.List; - -import javax.inject.Inject; - -import com.twitter.finagle.Filter; -import com.twitter.finagle.Service; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdFeatureSchemaMerger; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.mergers.EarlybirdResponseMerger; -import com.twitter.search.earlybird_root.mergers.TierResponseAccumulator; -import com.twitter.util.Function; -import com.twitter.util.Future; - -/** - * Filter used to merge results from multiple tiers - */ -public class MultiTierResultsMergeFilter extends - Filter>> { - - private final EarlybirdFeatureSchemaMerger featureSchemaMerger; - - @Inject - public MultiTierResultsMergeFilter(EarlybirdFeatureSchemaMerger featureSchemaMerger) { - this.featureSchemaMerger = featureSchemaMerger; - } - - @Override - public Future apply( - final EarlybirdRequestContext request, - Service>> service) { - return service.apply(request).flatMap(Function.func(responses -> merge(request, responses))); - } - - private Future merge( - EarlybirdRequestContext requestContext, - List> responses) { - - // For multi-tier response merging, the number of partitions do not have meaning because - // the response is not uniformly partitioned anymore. We pass Integer.MAX_VALUE for stats - // counting purpose. - EarlybirdResponseMerger merger = EarlybirdResponseMerger.getResponseMerger( - requestContext, - responses, - new TierResponseAccumulator(), - EarlybirdCluster.FULL_ARCHIVE, - featureSchemaMerger, - Integer.MAX_VALUE); - return merger.merge(); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/PartitionAccessController.java b/src/java/com/twitter/search/earlybird_root/PartitionAccessController.java deleted file mode 100644 index bf39b39fb..000000000 --- a/src/java/com/twitter/search/earlybird_root/PartitionAccessController.java +++ /dev/null @@ -1,70 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.inject.Inject; -import javax.inject.Named; - -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.root.SearchRootModule; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; - -/** - * Determines if a root should send requests to certain partitions based on if they have been turned - * off by decider. - */ -public class PartitionAccessController { - private final String clusterName; - private final SearchDecider decider; - - @Inject - public PartitionAccessController( - @Named(SearchRootModule.NAMED_SEARCH_ROOT_NAME) String clusterName, - @Named(SearchRootModule.NAMED_PARTITION_DECIDER) SearchDecider partitionDecider) { - this.clusterName = clusterName; - this.decider = partitionDecider; - } - - /** - * Should root send requests to a given partition - * Designed to be used to quickly stop hitting a partition of there are problems with it. - */ - public boolean canAccessPartition( - String tierName, int partitionNum, String clientId, EarlybirdRequestType requestType) { - - String partitionDeciderName = - String.format("cluster_%s_skip_tier_%s_partition_%s", clusterName, tierName, partitionNum); - if (decider.isAvailable(partitionDeciderName)) { - SearchCounter.export(partitionDeciderName).increment(); - return false; - } - - String clientDeciderName = String.format("cluster_%s_skip_tier_%s_partition_%s_client_id_%s", - clusterName, tierName, partitionNum, clientId); - if (decider.isAvailable(clientDeciderName)) { - SearchCounter.export(clientDeciderName).increment(); - return false; - } - - String requestTypeDeciderName = String.format( - "cluster_%s_skip_tier_%s_partition_%s_request_type_%s", - clusterName, tierName, partitionNum, requestType.getNormalizedName()); - if (decider.isAvailable(requestTypeDeciderName)) { - SearchCounter.export(requestTypeDeciderName).increment(); - return false; - } - - String clientRequestTypeDeciderName = String.format( - "cluster_%s_skip_tier_%s_partition_%s_client_id_%s_request_type_%s", - clusterName, tierName, partitionNum, clientId, requestType.getNormalizedName()); - if (decider.isAvailable(clientRequestTypeDeciderName)) { - SearchCounter.export(clientRequestTypeDeciderName).increment(); - return false; - } - - return true; - } - - public String getClusterName() { - return clusterName; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/ProtectedRootAppMain.java b/src/java/com/twitter/search/earlybird_root/ProtectedRootAppMain.java deleted file mode 100644 index 68d155ae6..000000000 --- a/src/java/com/twitter/search/earlybird_root/ProtectedRootAppMain.java +++ /dev/null @@ -1,39 +0,0 @@ -package com.twitter.search.earlybird_root; - -import java.util.Arrays; -import java.util.Collection; - -import com.google.inject.Module; - -import com.twitter.search.common.root.SearchRootAppMain; -import com.twitter.search.earlybird.thrift.EarlybirdService; - -public class ProtectedRootAppMain extends SearchRootAppMain { - /** - * Boilerplate for the Java-friendly AbstractTwitterServer - */ - public static class Main { - public static void main(String[] args) { - new ProtectedRootAppMain().main(args); - } - } - - @Override - protected Collection getAdditionalModules() { - return Arrays.asList( - new EarlybirdCommonModule(), - new EarlybirdCacheCommonModule(), - new ProtectedRootAppModule(), - new ProtectedScatterGatherModule()); - } - - @Override - protected Class getSearchRootServerClass() { - return ProtectedRootServer.class; - } - - @Override - protected Class getServiceIfaceClass() { - return EarlybirdService.ServiceIface.class; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/ProtectedRootAppModule.java b/src/java/com/twitter/search/earlybird_root/ProtectedRootAppModule.java deleted file mode 100644 index 09b137f7f..000000000 --- a/src/java/com/twitter/search/earlybird_root/ProtectedRootAppModule.java +++ /dev/null @@ -1,78 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.inject.Named; -import javax.inject.Singleton; - -import com.google.inject.Key; -import com.google.inject.Provides; - -import com.twitter.common.util.Clock; -import com.twitter.finagle.memcached.JavaClient; -import com.twitter.inject.TwitterModule; -import com.twitter.search.common.caching.Cache; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.root.LoggingSupport; -import com.twitter.search.common.root.PartitionLoggingSupport; -import com.twitter.search.common.root.SearchRootModule; -import com.twitter.search.common.root.SearchRootWarmup; -import com.twitter.search.common.root.ValidationBehavior; -import com.twitter.search.common.root.WarmupConfig; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.search.earlybird_root.caching.DefaultForcedCacheMissDecider; -import com.twitter.search.earlybird_root.caching.RecencyCache; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; - -public class ProtectedRootAppModule extends TwitterModule { - @Override - public void configure() { - bind(Key.get(EarlybirdCluster.class)).toInstance(EarlybirdCluster.PROTECTED); - - bind(EarlybirdServiceScatterGatherSupport.class) - .to(EarlybirdProtectedScatterGatherSupport.class); - - bind(EarlybirdService.ServiceIface.class).to(ProtectedRootService.class); - } - - @Provides - @Singleton - LoggingSupport provideLoggingSupport( - SearchDecider decider) { - return new EarlybirdServiceLoggingSupport(decider); - } - - @Provides - @Singleton - PartitionLoggingSupport providePartitionLoggingSupport() { - return new EarlybirdServicePartitionLoggingSupport(); - } - - @Provides - @Singleton - ValidationBehavior providesValidation() { - return new EarlybirdProtectedValidationBehavior(); - } - - @Provides - @Singleton - @RecencyCache - Cache provideRecencyCache( - JavaClient client, - DefaultForcedCacheMissDecider decider, - @Named(SearchRootModule.NAMED_SERIALIZED_KEY_PREFIX) String serializedKeyPrefix, - @Named(SearchRootModule.NAMED_CACHE_KEY_MAX_BYTES) int cacheKeyMaxBytes, - @Named(SearchRootModule.NAMED_CACHE_VALUE_MAX_BYTES) int cacheValueMaxBytes) { - return EarlybirdCacheCommonModule - .createCache(client, decider, "realtime_protected_recency_root", serializedKeyPrefix, - 20000L, cacheKeyMaxBytes, cacheValueMaxBytes); - } - - @Provides - SearchRootWarmup providesSearchRootWarmup( - Clock clock, - WarmupConfig config) { - return new EarlybirdProtectedWarmup(clock, config); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/ProtectedRootServer.java b/src/java/com/twitter/search/earlybird_root/ProtectedRootServer.java deleted file mode 100644 index 926250641..000000000 --- a/src/java/com/twitter/search/earlybird_root/ProtectedRootServer.java +++ /dev/null @@ -1,16 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.inject.Inject; - -import com.twitter.finagle.Service; -import com.twitter.search.common.root.SearchRootServer; -import com.twitter.search.earlybird.thrift.EarlybirdService; - -public class ProtectedRootServer extends SearchRootServer { - - @Inject - public ProtectedRootServer(ProtectedRootService svc, Service byteSvc) { - super(svc, byteSvc); - } - -} diff --git a/src/java/com/twitter/search/earlybird_root/ProtectedRootService.java b/src/java/com/twitter/search/earlybird_root/ProtectedRootService.java deleted file mode 100644 index 2e9b0ed0a..000000000 --- a/src/java/com/twitter/search/earlybird_root/ProtectedRootService.java +++ /dev/null @@ -1,110 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.inject.Inject; -import javax.inject.Named; -import javax.inject.Singleton; - -import com.twitter.finagle.Service; -import com.twitter.finagle.mtls.authorization.server.MtlsServerSessionTrackerFilter; -import com.twitter.search.common.clientstats.FinagleClientStatsFilter; -import com.twitter.search.common.root.LoggingFilter; -import com.twitter.search.common.root.RequestValidationFilter; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.search.earlybird.thrift.EarlybirdStatusResponse; -import com.twitter.search.earlybird_root.caching.RecencyCacheFilter; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.filters.ClientIdTrackingFilter; -import com.twitter.search.earlybird_root.filters.ClientRequestTimeFilter; -import com.twitter.search.earlybird_root.filters.DeadlineTimeoutStatsFilter; -import com.twitter.search.earlybird_root.filters.DropAllProtectedOperatorFilter; -import com.twitter.search.earlybird_root.filters.EarlybirdFeatureSchemaAnnotateFilter; -import com.twitter.search.earlybird_root.filters.InitializeRequestContextFilter; -import com.twitter.search.earlybird_root.filters.MetadataTrackingFilter; -import com.twitter.search.earlybird_root.filters.NullcastTrackingFilter; -import com.twitter.search.earlybird_root.filters.PostCacheRequestTypeCountFilter; -import com.twitter.search.earlybird_root.filters.PreCacheRequestTypeCountFilter; -import com.twitter.search.earlybird_root.filters.QueryLangStatFilter; -import com.twitter.search.earlybird_root.filters.QueryOperatorStatFilter; -import com.twitter.search.earlybird_root.filters.RequestResultStatsFilter; -import com.twitter.search.earlybird_root.filters.ResponseCodeStatFilter; -import com.twitter.search.earlybird_root.filters.SearchPayloadSizeLocalContextFilter; -import com.twitter.search.earlybird_root.filters.StratoAttributionClientIdFilter; -import com.twitter.search.earlybird_root.filters.TopLevelExceptionHandlingFilter; -import com.twitter.util.Future; - -@Singleton -public class ProtectedRootService implements EarlybirdService.ServiceIface { - - private final Service allFiltersAndService; - - @Inject - public ProtectedRootService( - LoggingFilter loggingFilter, - RequestValidationFilter validationFilter, - MtlsServerSessionTrackerFilter mtlsFilter, - FinagleClientStatsFilter finagleStatsFilter, - TopLevelExceptionHandlingFilter topLevelExceptionHandlingFilter, - ResponseCodeStatFilter responseCodeStatFilter, - InitializeFilter initializeFilter, - InitializeRequestContextFilter initializeRequestContextFilter, - QueryLangStatFilter queryLangStatFilter, - DropAllProtectedOperatorFilter dropAllProtectedOperatorFilter, - QueryOperatorStatFilter queryOperatorStatFilter, - RequestResultStatsFilter requestResultStatsFilter, - PreCacheRequestTypeCountFilter preCacheCountFilter, - RecencyCacheFilter recencyCacheFilter, - PostCacheRequestTypeCountFilter postCacheCountFilter, - ClientIdTrackingFilter clientIdTrackingFilter, - MetadataTrackingFilter metadataTrackingFilter, - NullcastTrackingFilter nullcastTrackingFilter, - ClientRequestTimeFilter clientRequestTimeFilter, - DeadlineTimeoutStatsFilter deadlineTimeoutStatsFilter, - EarlybirdFeatureSchemaAnnotateFilter featureSchemaAnnotateFilter, - SearchPayloadSizeLocalContextFilter searchPayloadSizeLocalContextFilter, - @Named(ProtectedScatterGatherModule.NAMED_SCATTER_GATHER_SERVICE) - Service scatterGatherService, - StratoAttributionClientIdFilter stratoAttributionClientIdFilter) { - allFiltersAndService = loggingFilter - .andThen(topLevelExceptionHandlingFilter) - .andThen(stratoAttributionClientIdFilter) - .andThen(clientRequestTimeFilter) - .andThen(searchPayloadSizeLocalContextFilter) - .andThen(responseCodeStatFilter) - .andThen(requestResultStatsFilter) - .andThen(validationFilter) - .andThen(mtlsFilter) - .andThen(finagleStatsFilter) - .andThen(clientIdTrackingFilter) - .andThen(metadataTrackingFilter) - .andThen(initializeFilter) - .andThen(initializeRequestContextFilter) - .andThen(deadlineTimeoutStatsFilter) - .andThen(queryLangStatFilter) - .andThen(nullcastTrackingFilter) - .andThen(dropAllProtectedOperatorFilter) - .andThen(queryOperatorStatFilter) - .andThen(preCacheCountFilter) - .andThen(recencyCacheFilter) - .andThen(postCacheCountFilter) - .andThen(featureSchemaAnnotateFilter) - .andThen(scatterGatherService); - } - - - @Override - public Future getName() { - return Future.value("protectedroot"); - } - - @Override - public Future getStatus() { - throw new UnsupportedOperationException("not supported"); - } - - @Override - public Future search(EarlybirdRequest request) { - return allFiltersAndService.apply(request); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/ProtectedScatterGatherModule.java b/src/java/com/twitter/search/earlybird_root/ProtectedScatterGatherModule.java deleted file mode 100644 index 60a02e666..000000000 --- a/src/java/com/twitter/search/earlybird_root/ProtectedScatterGatherModule.java +++ /dev/null @@ -1,62 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.annotation.Nullable; -import javax.inject.Named; -import javax.inject.Singleton; - -import com.google.inject.Provides; - -import com.twitter.finagle.Service; -import com.twitter.finagle.stats.StatsReceiver; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.root.PartitionConfig; -import com.twitter.search.common.root.PartitionLoggingSupport; -import com.twitter.search.common.root.RequestSuccessStats; -import com.twitter.search.common.root.RootClientServiceBuilder; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.filters.RequestContextToEarlybirdRequestFilter; - -public class ProtectedScatterGatherModule extends ScatterGatherModule { - /** - * Provides the scatterGatherService for the protected cluster. - */ - @Provides - @Singleton - @Named(NAMED_SCATTER_GATHER_SERVICE) - @Override - public Service provideScatterGatherService( - EarlybirdServiceScatterGatherSupport scatterGatherSupport, - RequestSuccessStats requestSuccessStats, - PartitionLoggingSupport partitionLoggingSupport, - RequestContextToEarlybirdRequestFilter requestContextToEarlybirdRequestFilter, - PartitionAccessController partitionAccessController, - PartitionConfig partitionConfig, - RootClientServiceBuilder rootClientServiceBuilder, - @Named(EarlybirdCommonModule.NAMED_EXP_CLUSTER_CLIENT) - RootClientServiceBuilder - expClusterRootClientServiceBuilder, // unused in protected roots - @Named(EarlybirdCommonModule.NAMED_ALT_CLIENT) @Nullable PartitionConfig altPartitionConfig, - @Named(EarlybirdCommonModule.NAMED_ALT_CLIENT) @Nullable - RootClientServiceBuilder altRootClientServiceBuilder, - StatsReceiver statsReceiver, - EarlybirdCluster cluster, - SearchDecider decider) { - return buildScatterOrSplitterService( - scatterGatherSupport, - requestSuccessStats, - partitionLoggingSupport, - requestContextToEarlybirdRequestFilter, - partitionAccessController, - partitionConfig, - rootClientServiceBuilder, - altPartitionConfig, - altRootClientServiceBuilder, - statsReceiver, - cluster, - decider - ); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/QuotaModule.java b/src/java/com/twitter/search/earlybird_root/QuotaModule.java deleted file mode 100644 index d013e17d3..000000000 --- a/src/java/com/twitter/search/earlybird_root/QuotaModule.java +++ /dev/null @@ -1,110 +0,0 @@ -package com.twitter.search.earlybird_root; - -import java.util.concurrent.Executors; -import java.util.concurrent.ScheduledExecutorService; -import javax.annotation.Nullable; -import javax.inject.Named; -import javax.inject.Singleton; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.util.concurrent.ThreadFactoryBuilder; -import com.google.common.util.concurrent.TwitterRateLimiterProxyFactory; -import com.google.inject.Provides; - -import com.twitter.app.Flag; -import com.twitter.app.Flaggable; -import com.twitter.common.util.Clock; -import com.twitter.inject.TwitterModule; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.earlybird_root.filters.ClientIdArchiveAccessFilter; -import com.twitter.search.earlybird_root.filters.ClientIdQuotaFilter; -import com.twitter.search.earlybird_root.filters.DisableClientByTierFilter; -import com.twitter.search.earlybird_root.quota.ConfigBasedQuotaConfig; -import com.twitter.search.earlybird_root.quota.ConfigRepoBasedQuotaManager; - -public class QuotaModule extends TwitterModule { - @VisibleForTesting - public static final String NAMED_QUOTA_CONFIG_PATH = "quotaConfigPath"; - public static final String NAMED_CLIENT_QUOTA_KEY = "clientQuotaKey"; - private static final String NAMED_REQUIRE_QUOTA_CONFIG_FOR_CLIENTS - = "requireQuotaConfigForClients"; - - private final Flag quotaConfigPathFlag = createMandatoryFlag( - "quota_config_path", - "", - "Path to the quota config file", - Flaggable.ofString()); - - private final Flag clientQuotaKeyFlag = createFlag( - "client_quota_key", - "quota", - "The key that will be used to extract client quotas", - Flaggable.ofString()); - - private final Flag requireQuotaConfigForClientsFlag = createFlag( - "require_quota_config_for_clients", - true, - "If true, require a quota value under for each client in the config", - Flaggable.ofJavaBoolean()); - - @Provides - @Singleton - @Named(NAMED_QUOTA_CONFIG_PATH) - String provideQuotaConfigPath() { - return quotaConfigPathFlag.apply(); - } - - @Provides - @Singleton - @Named(NAMED_CLIENT_QUOTA_KEY) - String provideClientQuotaKey() { - return clientQuotaKeyFlag.apply(); - } - - @Provides - @Singleton - @Named(NAMED_REQUIRE_QUOTA_CONFIG_FOR_CLIENTS) - boolean provideRequireQuotaConfigForClients() { - return requireQuotaConfigForClientsFlag.apply(); - } - - @Provides - @Singleton - ClientIdQuotaFilter provideConfigRepoBasedClientIdQuotaFilter( - ConfigRepoBasedQuotaManager configRepoBasedQuotaManager, - TwitterRateLimiterProxyFactory rateLimiterProxyFactory) throws Exception { - return new ClientIdQuotaFilter(configRepoBasedQuotaManager, rateLimiterProxyFactory); - } - - @Provides - @Singleton - ConfigBasedQuotaConfig providesConfigBasedQuotaConfig( - @Nullable @Named(NAMED_QUOTA_CONFIG_PATH) String quotaConfigPath, - @Nullable @Named(NAMED_CLIENT_QUOTA_KEY) String clientQuotaKey, - @Nullable @Named(NAMED_REQUIRE_QUOTA_CONFIG_FOR_CLIENTS) boolean requireQuotaConfigForClients, - Clock clock - ) throws Exception { - ScheduledExecutorService executorService = Executors.newSingleThreadScheduledExecutor( - new ThreadFactoryBuilder() - .setNameFormat("quota-config-reloader") - .setDaemon(true) - .build()); - return ConfigBasedQuotaConfig.newConfigBasedQuotaConfig( - quotaConfigPath, clientQuotaKey, requireQuotaConfigForClients, executorService, clock); - } - - @Provides - @Singleton - DisableClientByTierFilter provideDisableClientByTierFilter( - ConfigRepoBasedQuotaManager configRepoBasedQuotaManager, - SearchDecider searchDecider) { - return new DisableClientByTierFilter(configRepoBasedQuotaManager, searchDecider); - } - - @Provides - @Singleton - ClientIdArchiveAccessFilter clientIdArchiveAccessFilter( - ConfigRepoBasedQuotaManager configRepoBasedQuotaManager) { - return new ClientIdArchiveAccessFilter(configRepoBasedQuotaManager); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/README.md b/src/java/com/twitter/search/earlybird_root/README.md deleted file mode 100644 index 750b8cdd6..000000000 --- a/src/java/com/twitter/search/earlybird_root/README.md +++ /dev/null @@ -1,8 +0,0 @@ -# Search Index (Earlybird) Root -Earlybird Roots are fanout services that fan out requests to different Earlybird clusters or partitions. - -## Architecture -![in-network](img/serving.png) - -Superroot serves as the entry point to Earlybird (Search Index) service. Request coming to superroot are first fanned out to realtime (public) and protected roots in parallel and may be fanned out to the archive root if realtime and protected clusters don't return enough results. -The realtime, protected and archive roots fanout requests to the earlybird partitions where the index is stored and served. diff --git a/src/java/com/twitter/search/earlybird_root/RealtimeCgRootAppMain.java b/src/java/com/twitter/search/earlybird_root/RealtimeCgRootAppMain.java deleted file mode 100644 index 748d4556b..000000000 --- a/src/java/com/twitter/search/earlybird_root/RealtimeCgRootAppMain.java +++ /dev/null @@ -1,40 +0,0 @@ -package com.twitter.search.earlybird_root; - -import java.util.Arrays; -import java.util.Collection; - -import com.google.inject.Module; - -import com.twitter.search.common.root.SearchRootAppMain; -import com.twitter.search.earlybird.thrift.EarlybirdService; - -public class RealtimeCgRootAppMain extends SearchRootAppMain { - /** - * Boilerplate for the Java-friendly AbstractTwitterServer - */ - public static class Main { - public static void main(String[] args) { - new RealtimeCgRootAppMain().main(args); - } - } - - @Override - protected Collection getAdditionalModules() { - return Arrays.asList( - new EarlybirdCommonModule(), - new EarlybirdCacheCommonModule(), - new RealtimeCgRootAppModule(), - new RealtimeCgScatterGatherModule(), - new QuotaModule()); - } - - @Override - protected Class getSearchRootServerClass() { - return RealtimeCgRootServer.class; - } - - @Override - protected Class getServiceIfaceClass() { - return EarlybirdService.ServiceIface.class; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/RealtimeCgRootAppModule.java b/src/java/com/twitter/search/earlybird_root/RealtimeCgRootAppModule.java deleted file mode 100644 index 2e0cde6a2..000000000 --- a/src/java/com/twitter/search/earlybird_root/RealtimeCgRootAppModule.java +++ /dev/null @@ -1,152 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.inject.Named; -import javax.inject.Singleton; - -import com.google.inject.Key; -import com.google.inject.Provides; - -import com.twitter.common.util.Clock; -import com.twitter.finagle.memcached.JavaClient; -import com.twitter.inject.TwitterModule; -import com.twitter.search.common.caching.Cache; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.root.LoggingSupport; -import com.twitter.search.common.root.PartitionLoggingSupport; -import com.twitter.search.common.root.SearchRootModule; -import com.twitter.search.common.root.SearchRootWarmup; -import com.twitter.search.common.root.ValidationBehavior; -import com.twitter.search.common.root.WarmupConfig; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.search.earlybird_root.caching.DefaultForcedCacheMissDecider; -import com.twitter.search.earlybird_root.caching.FacetsCache; -import com.twitter.search.earlybird_root.caching.RecencyCache; -import com.twitter.search.earlybird_root.caching.RelevanceCache; -import com.twitter.search.earlybird_root.caching.StrictRecencyCache; -import com.twitter.search.earlybird_root.caching.TermStatsCache; -import com.twitter.search.earlybird_root.caching.TopTweetsCache; -import com.twitter.search.earlybird_root.caching.TopTweetsServicePostProcessor; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; - -public class RealtimeCgRootAppModule extends TwitterModule { - private static final long RECENCY_CACHE_TTL_MILLIS = 20000L; - private static final long RELEVANCE_CACHE_TTL_MILLIS = 20000L; - private static final long FACETS_CACHE_TTL_MILLIS = 300000L; - private static final long TERMSTATS_CACHE_TTL_MILLIS = 300000L; - - @Override - public void configure() { - bind(Key.get(EarlybirdCluster.class)).toInstance(EarlybirdCluster.REALTIME_CG); - - bind(EarlybirdServiceScatterGatherSupport.class) - .to(EarlybirdRealtimeCgScatterGatherSupport.class); - - bind(EarlybirdService.ServiceIface.class).to(RealtimeCgRootService.class); - } - - @Provides - @Singleton - @RecencyCache - Cache provideRecencyCache( - JavaClient client, - DefaultForcedCacheMissDecider decider, - @Named(SearchRootModule.NAMED_SERIALIZED_KEY_PREFIX) String serializedKeyPrefix, - @Named(SearchRootModule.NAMED_CACHE_KEY_MAX_BYTES) int cacheKeyMaxBytes, - @Named(SearchRootModule.NAMED_CACHE_VALUE_MAX_BYTES) int cacheValueMaxBytes) { - return EarlybirdCacheCommonModule.createCache(client, decider, "realtime_cg_recency_root", - serializedKeyPrefix, RECENCY_CACHE_TTL_MILLIS, cacheKeyMaxBytes, cacheValueMaxBytes); - } - - @Provides - @Singleton - @RelevanceCache - Cache provideRelevanceCache( - JavaClient client, - DefaultForcedCacheMissDecider decider, - @Named(SearchRootModule.NAMED_SERIALIZED_KEY_PREFIX) String serializedKeyPrefix, - @Named(SearchRootModule.NAMED_CACHE_KEY_MAX_BYTES) int cacheKeyMaxBytes, - @Named(SearchRootModule.NAMED_CACHE_VALUE_MAX_BYTES) int cacheValueMaxBytes) { - return EarlybirdCacheCommonModule.createCache(client, decider, "realtime_cg_relevance_root", - serializedKeyPrefix, RELEVANCE_CACHE_TTL_MILLIS, cacheKeyMaxBytes, cacheValueMaxBytes); - } - - @Provides - @Singleton - @StrictRecencyCache - Cache provideStrictRecencyCache( - JavaClient client, - DefaultForcedCacheMissDecider decider, - @Named(SearchRootModule.NAMED_SERIALIZED_KEY_PREFIX) String serializedKeyPrefix, - @Named(SearchRootModule.NAMED_CACHE_KEY_MAX_BYTES) int cacheKeyMaxBytes, - @Named(SearchRootModule.NAMED_CACHE_VALUE_MAX_BYTES) int cacheValueMaxBytes) { - return EarlybirdCacheCommonModule.createCache( - client, decider, "realtime_cg_strict_recency_root", serializedKeyPrefix, - RECENCY_CACHE_TTL_MILLIS, cacheKeyMaxBytes, cacheValueMaxBytes); - } - - @Provides - @Singleton - @FacetsCache - Cache provideFacetsCache( - JavaClient client, - DefaultForcedCacheMissDecider decider, - @Named(SearchRootModule.NAMED_SERIALIZED_KEY_PREFIX) String serializedKeyPrefix, - @Named(SearchRootModule.NAMED_CACHE_KEY_MAX_BYTES) int cacheKeyMaxBytes, - @Named(SearchRootModule.NAMED_CACHE_VALUE_MAX_BYTES) int cacheValueMaxBytes) { - return EarlybirdCacheCommonModule.createCache(client, decider, "realtime_cg_facets_root", - serializedKeyPrefix, FACETS_CACHE_TTL_MILLIS, cacheKeyMaxBytes, cacheValueMaxBytes); - } - - @Provides - @Singleton - @TermStatsCache - Cache provideTermStatsCache( - JavaClient client, - DefaultForcedCacheMissDecider decider, - @Named(SearchRootModule.NAMED_SERIALIZED_KEY_PREFIX) String serializedKeyPrefix, - @Named(SearchRootModule.NAMED_CACHE_KEY_MAX_BYTES) int cacheKeyMaxBytes, - @Named(SearchRootModule.NAMED_CACHE_VALUE_MAX_BYTES) int cacheValueMaxBytes) { - return EarlybirdCacheCommonModule.createCache(client, decider, "realtime_cg_termstats_root", - serializedKeyPrefix, TERMSTATS_CACHE_TTL_MILLIS, cacheKeyMaxBytes, cacheValueMaxBytes); - } - - @Provides - @Singleton - @TopTweetsCache - Cache provideTopTweetsCache( - JavaClient client, - DefaultForcedCacheMissDecider decider, - @Named(SearchRootModule.NAMED_SERIALIZED_KEY_PREFIX) String serializedKeyPrefix, - @Named(SearchRootModule.NAMED_CACHE_KEY_MAX_BYTES) int cacheKeyMaxBytes, - @Named(SearchRootModule.NAMED_CACHE_VALUE_MAX_BYTES) int cacheValueMaxBytes) { - return EarlybirdCacheCommonModule.createCache(client, decider, "realtime_cg_toptweets_root", - serializedKeyPrefix, TopTweetsServicePostProcessor.CACHE_AGE_IN_MS, - cacheKeyMaxBytes, cacheValueMaxBytes); - } - - @Provides - SearchRootWarmup providesSearchRootWarmup( - Clock clock, - WarmupConfig config) { - return new EarlybirdWarmup(clock, config); - } - - @Provides - public LoggingSupport provideLoggingSupport( - SearchDecider decider) { - return new EarlybirdServiceLoggingSupport(decider); - } - - @Provides - public PartitionLoggingSupport providePartitionLoggingSupport() { - return new EarlybirdServicePartitionLoggingSupport(); - } - - @Provides - public ValidationBehavior provideValidationBehavior() { - return new EarlybirdServiceValidationBehavior(); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/RealtimeCgRootServer.java b/src/java/com/twitter/search/earlybird_root/RealtimeCgRootServer.java deleted file mode 100644 index 5d90da3af..000000000 --- a/src/java/com/twitter/search/earlybird_root/RealtimeCgRootServer.java +++ /dev/null @@ -1,18 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.inject.Inject; -import javax.inject.Singleton; - -import com.twitter.finagle.Service; -import com.twitter.search.common.root.SearchRootServer; -import com.twitter.search.earlybird.thrift.EarlybirdService; - -@Singleton -public class RealtimeCgRootServer extends SearchRootServer { - - @Inject - public RealtimeCgRootServer(RealtimeCgRootService svc, Service byteSvc) { - super(svc, byteSvc); - } - -} diff --git a/src/java/com/twitter/search/earlybird_root/RealtimeCgRootService.java b/src/java/com/twitter/search/earlybird_root/RealtimeCgRootService.java deleted file mode 100644 index 1e8a9cb76..000000000 --- a/src/java/com/twitter/search/earlybird_root/RealtimeCgRootService.java +++ /dev/null @@ -1,132 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.inject.Inject; -import javax.inject.Named; -import javax.inject.Singleton; - - -import com.twitter.finagle.Service; -import com.twitter.finagle.mtls.authorization.server.MtlsServerSessionTrackerFilter; -import com.twitter.search.common.clientstats.FinagleClientStatsFilter; -import com.twitter.search.common.root.LoggingFilter; -import com.twitter.search.common.root.RequestValidationFilter; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.search.earlybird.thrift.EarlybirdStatusResponse; -import com.twitter.search.earlybird_root.caching.FacetsCacheFilter; -import com.twitter.search.earlybird_root.caching.RecencyCacheFilter; -import com.twitter.search.earlybird_root.caching.RelevanceCacheFilter; -import com.twitter.search.earlybird_root.caching.RelevanceZeroResultsCacheFilter; -import com.twitter.search.earlybird_root.caching.StrictRecencyCacheFilter; -import com.twitter.search.earlybird_root.caching.TermStatsCacheFilter; -import com.twitter.search.earlybird_root.caching.TopTweetsCacheFilter; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.filters.ClientIdQuotaFilter; -import com.twitter.search.earlybird_root.filters.ClientIdTrackingFilter; -import com.twitter.search.earlybird_root.filters.ClientRequestTimeFilter; -import com.twitter.search.earlybird_root.filters.DeadlineTimeoutStatsFilter; -import com.twitter.search.earlybird_root.filters.DropAllProtectedOperatorFilter; -import com.twitter.search.earlybird_root.filters.EarlybirdFeatureSchemaAnnotateFilter; -import com.twitter.search.earlybird_root.filters.InitializeRequestContextFilter; -import com.twitter.search.earlybird_root.filters.MetadataTrackingFilter; -import com.twitter.search.earlybird_root.filters.NullcastTrackingFilter; -import com.twitter.search.earlybird_root.filters.PostCacheRequestTypeCountFilter; -import com.twitter.search.earlybird_root.filters.PreCacheRequestTypeCountFilter; -import com.twitter.search.earlybird_root.filters.QueryLangStatFilter; -import com.twitter.search.earlybird_root.filters.QueryOperatorStatFilter; -import com.twitter.search.earlybird_root.filters.RequestResultStatsFilter; -import com.twitter.search.earlybird_root.filters.ResponseCodeStatFilter; -import com.twitter.search.earlybird_root.filters.SearchPayloadSizeLocalContextFilter; -import com.twitter.search.earlybird_root.filters.StratoAttributionClientIdFilter; -import com.twitter.search.earlybird_root.filters.TopLevelExceptionHandlingFilter; -import com.twitter.util.Future; - -@Singleton -public class RealtimeCgRootService implements EarlybirdService.ServiceIface { - - private final Service allFiltersAndService; - - @Inject - public RealtimeCgRootService( - TopLevelExceptionHandlingFilter topLevelExceptionHandlingFilter, - ResponseCodeStatFilter responseCodeStatFilter, - LoggingFilter loggingFilter, - RequestValidationFilter validationFilter, - MtlsServerSessionTrackerFilter mtlsFilter, - FinagleClientStatsFilter finagleStatsFilter, - InitializeFilter initializeFilter, - InitializeRequestContextFilter initializeRequestContextFilter, - QueryLangStatFilter queryLangStatFilter, - DropAllProtectedOperatorFilter dropAllProtectedOperatorFilter, - QueryOperatorStatFilter queryOperatorStatFilter, - RequestResultStatsFilter requestResultStatsFilter, - PreCacheRequestTypeCountFilter preCacheCountFilter, - RecencyCacheFilter recencyCacheFilter, - RelevanceCacheFilter relevanceCacheFilter, - RelevanceZeroResultsCacheFilter relevanceZeroResultsCacheFilter, - StrictRecencyCacheFilter strictRecencyCacheFilter, - FacetsCacheFilter facetsCacheFilter, - TermStatsCacheFilter termStatsCacheFilter, - TopTweetsCacheFilter topTweetsCacheFilter, - PostCacheRequestTypeCountFilter postCacheCountFilter, - ClientIdTrackingFilter clientIdTrackingFilter, - ClientIdQuotaFilter quotaFilter, - MetadataTrackingFilter metadataTrackingFilter, - NullcastTrackingFilter nullcastTrackingFilter, - ClientRequestTimeFilter clientRequestTimeFilter, - DeadlineTimeoutStatsFilter deadlineTimeoutStatsFilter, - EarlybirdFeatureSchemaAnnotateFilter featureSchemaAnnotateFilter, - SearchPayloadSizeLocalContextFilter searchPayloadSizeLocalContextFilter, - @Named(ProtectedScatterGatherModule.NAMED_SCATTER_GATHER_SERVICE) - Service scatterGatherService, - StratoAttributionClientIdFilter stratoAttributionClientIdFilter) { - this.allFiltersAndService = - loggingFilter - .andThen(topLevelExceptionHandlingFilter) - .andThen(stratoAttributionClientIdFilter) - .andThen(clientRequestTimeFilter) - .andThen(searchPayloadSizeLocalContextFilter) - .andThen(responseCodeStatFilter) - .andThen(requestResultStatsFilter) - .andThen(validationFilter) - .andThen(mtlsFilter) - .andThen(finagleStatsFilter) - .andThen(clientIdTrackingFilter) - .andThen(quotaFilter) - .andThen(metadataTrackingFilter) - .andThen(initializeFilter) - .andThen(initializeRequestContextFilter) - .andThen(deadlineTimeoutStatsFilter) - .andThen(queryLangStatFilter) - .andThen(nullcastTrackingFilter) - .andThen(dropAllProtectedOperatorFilter) - .andThen(queryOperatorStatFilter) - .andThen(preCacheCountFilter) - .andThen(recencyCacheFilter) - .andThen(relevanceCacheFilter) - .andThen(relevanceZeroResultsCacheFilter) - .andThen(strictRecencyCacheFilter) - .andThen(facetsCacheFilter) - .andThen(termStatsCacheFilter) - .andThen(topTweetsCacheFilter) - .andThen(postCacheCountFilter) - .andThen(featureSchemaAnnotateFilter) - .andThen(scatterGatherService); - } - - @Override - public Future getName() { - return Future.value("realtime_cg root"); - } - - @Override - public Future getStatus() { - throw new UnsupportedOperationException("not supported"); - } - - @Override - public Future search(EarlybirdRequest request) { - return allFiltersAndService.apply(request); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/RealtimeCgScatterGatherModule.java b/src/java/com/twitter/search/earlybird_root/RealtimeCgScatterGatherModule.java deleted file mode 100644 index a98189d65..000000000 --- a/src/java/com/twitter/search/earlybird_root/RealtimeCgScatterGatherModule.java +++ /dev/null @@ -1,71 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.annotation.Nullable; -import javax.inject.Named; -import javax.inject.Singleton; - -import com.google.inject.Provides; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.finagle.Service; -import com.twitter.finagle.stats.StatsReceiver; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.root.PartitionConfig; -import com.twitter.search.common.root.PartitionLoggingSupport; -import com.twitter.search.common.root.RequestSuccessStats; -import com.twitter.search.common.root.RootClientServiceBuilder; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.filters.RequestContextToEarlybirdRequestFilter; - -public class RealtimeCgScatterGatherModule extends ScatterGatherModule { - private static final Logger LOG = - LoggerFactory.getLogger(RealtimeCgScatterGatherModule.class); - - /** - * Provides a scatter gather service for the realtime_cg cluster. - */ - @Provides - @Singleton - @Named(NAMED_SCATTER_GATHER_SERVICE) - @Override - public Service provideScatterGatherService( - EarlybirdServiceScatterGatherSupport scatterGatherSupport, - RequestSuccessStats requestSuccessStats, - PartitionLoggingSupport partitionLoggingSupport, - RequestContextToEarlybirdRequestFilter requestContextToEarlybirdRequestFilter, - PartitionAccessController partitionAccessController, - PartitionConfig partitionConfig, - RootClientServiceBuilder rootClientServiceBuilder, - @Named(EarlybirdCommonModule.NAMED_EXP_CLUSTER_CLIENT) - RootClientServiceBuilder - expClusterRootClientServiceBuilder, - @Named(EarlybirdCommonModule.NAMED_ALT_CLIENT) @Nullable PartitionConfig altPartitionConfig, - @Named(EarlybirdCommonModule.NAMED_ALT_CLIENT) @Nullable - RootClientServiceBuilder altRootClientServiceBuilder, - StatsReceiver statsReceiver, - EarlybirdCluster cluster, - SearchDecider decider) { - - - return - buildScatterOrSplitterService( - scatterGatherSupport, - requestSuccessStats, - partitionLoggingSupport, - requestContextToEarlybirdRequestFilter, - partitionAccessController, - partitionConfig, - rootClientServiceBuilder, - altPartitionConfig, - altRootClientServiceBuilder, - statsReceiver, - cluster, - decider - ); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/RealtimeRootAppMain.java b/src/java/com/twitter/search/earlybird_root/RealtimeRootAppMain.java deleted file mode 100644 index 3fa8ccba6..000000000 --- a/src/java/com/twitter/search/earlybird_root/RealtimeRootAppMain.java +++ /dev/null @@ -1,39 +0,0 @@ -package com.twitter.search.earlybird_root; - -import java.util.Arrays; -import java.util.Collection; - -import com.google.inject.Module; - -import com.twitter.search.common.root.SearchRootAppMain; -import com.twitter.search.earlybird.thrift.EarlybirdService; - -public class RealtimeRootAppMain extends SearchRootAppMain { - /** - * Boilerplate for the Java-friendly AbstractTwitterServer - */ - public static class Main { - public static void main(String[] args) { - new RealtimeRootAppMain().main(args); - } - } - - @Override - protected Collection getAdditionalModules() { - return Arrays.asList( - new EarlybirdCommonModule(), - new EarlybirdCacheCommonModule(), - new RealtimeRootAppModule(), - new RealtimeScatterGatherModule()); - } - - @Override - protected Class getSearchRootServerClass() { - return RealtimeRootServer.class; - } - - @Override - protected Class getServiceIfaceClass() { - return EarlybirdService.ServiceIface.class; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/RealtimeRootAppModule.java b/src/java/com/twitter/search/earlybird_root/RealtimeRootAppModule.java deleted file mode 100644 index 8e2328fb9..000000000 --- a/src/java/com/twitter/search/earlybird_root/RealtimeRootAppModule.java +++ /dev/null @@ -1,151 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.inject.Named; -import javax.inject.Singleton; - -import com.google.inject.Key; -import com.google.inject.Provides; - -import com.twitter.common.util.Clock; -import com.twitter.finagle.memcached.JavaClient; -import com.twitter.inject.TwitterModule; -import com.twitter.search.common.caching.Cache; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.root.LoggingSupport; -import com.twitter.search.common.root.PartitionLoggingSupport; -import com.twitter.search.common.root.SearchRootModule; -import com.twitter.search.common.root.SearchRootWarmup; -import com.twitter.search.common.root.ValidationBehavior; -import com.twitter.search.common.root.WarmupConfig; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.search.earlybird_root.caching.DefaultForcedCacheMissDecider; -import com.twitter.search.earlybird_root.caching.FacetsCache; -import com.twitter.search.earlybird_root.caching.RecencyCache; -import com.twitter.search.earlybird_root.caching.RelevanceCache; -import com.twitter.search.earlybird_root.caching.StrictRecencyCache; -import com.twitter.search.earlybird_root.caching.TermStatsCache; -import com.twitter.search.earlybird_root.caching.TopTweetsCache; -import com.twitter.search.earlybird_root.caching.TopTweetsServicePostProcessor; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; - -public class RealtimeRootAppModule extends TwitterModule { - private static final long RECENCY_CACHE_TTL_MILLIS = 20000L; - private static final long RELEVANCE_CACHE_TTL_MILLIS = 20000L; - private static final long FACETS_CACHE_TTL_MILLIS = 300000L; - private static final long TERMSTATS_CACHE_TTL_MILLIS = 300000L; - - @Override - public void configure() { - bind(Key.get(EarlybirdCluster.class)).toInstance(EarlybirdCluster.REALTIME); - - bind(EarlybirdServiceScatterGatherSupport.class) - .to(EarlybirdRealtimeScatterGatherSupport.class); - - bind(EarlybirdService.ServiceIface.class).to(RealtimeRootService.class); - } - - @Provides - @Singleton - @RecencyCache - Cache provideRecencyCache( - JavaClient client, - DefaultForcedCacheMissDecider decider, - @Named(SearchRootModule.NAMED_SERIALIZED_KEY_PREFIX) String serializedKeyPrefix, - @Named(SearchRootModule.NAMED_CACHE_KEY_MAX_BYTES) int cacheKeyMaxBytes, - @Named(SearchRootModule.NAMED_CACHE_VALUE_MAX_BYTES) int cacheValueMaxBytes) { - return EarlybirdCacheCommonModule.createCache(client, decider, "realtime_recency_root", - serializedKeyPrefix, RECENCY_CACHE_TTL_MILLIS, cacheKeyMaxBytes, cacheValueMaxBytes); - } - - @Provides - @Singleton - @RelevanceCache - Cache provideRelevanceCache( - JavaClient client, - DefaultForcedCacheMissDecider decider, - @Named(SearchRootModule.NAMED_SERIALIZED_KEY_PREFIX) String serializedKeyPrefix, - @Named(SearchRootModule.NAMED_CACHE_KEY_MAX_BYTES) int cacheKeyMaxBytes, - @Named(SearchRootModule.NAMED_CACHE_VALUE_MAX_BYTES) int cacheValueMaxBytes) { - return EarlybirdCacheCommonModule.createCache(client, decider, "realtime_relevance_root", - serializedKeyPrefix, RELEVANCE_CACHE_TTL_MILLIS, cacheKeyMaxBytes, cacheValueMaxBytes); - } - - @Provides - @Singleton - @StrictRecencyCache - Cache provideStrictRecencyCache( - JavaClient client, - DefaultForcedCacheMissDecider decider, - @Named(SearchRootModule.NAMED_SERIALIZED_KEY_PREFIX) String serializedKeyPrefix, - @Named(SearchRootModule.NAMED_CACHE_KEY_MAX_BYTES) int cacheKeyMaxBytes, - @Named(SearchRootModule.NAMED_CACHE_VALUE_MAX_BYTES) int cacheValueMaxBytes) { - return EarlybirdCacheCommonModule.createCache(client, decider, "realtime_strict_recency_root", - serializedKeyPrefix, RECENCY_CACHE_TTL_MILLIS, cacheKeyMaxBytes, cacheValueMaxBytes); - } - - @Provides - @Singleton - @FacetsCache - Cache provideFacetsCache( - JavaClient client, - DefaultForcedCacheMissDecider decider, - @Named(SearchRootModule.NAMED_SERIALIZED_KEY_PREFIX) String serializedKeyPrefix, - @Named(SearchRootModule.NAMED_CACHE_KEY_MAX_BYTES) int cacheKeyMaxBytes, - @Named(SearchRootModule.NAMED_CACHE_VALUE_MAX_BYTES) int cacheValueMaxBytes) { - return EarlybirdCacheCommonModule.createCache(client, decider, "realtime_facets_root", - serializedKeyPrefix, FACETS_CACHE_TTL_MILLIS, cacheKeyMaxBytes, cacheValueMaxBytes); - } - - @Provides - @Singleton - @TermStatsCache - Cache provideTermStatsCache( - JavaClient client, - DefaultForcedCacheMissDecider decider, - @Named(SearchRootModule.NAMED_SERIALIZED_KEY_PREFIX) String serializedKeyPrefix, - @Named(SearchRootModule.NAMED_CACHE_KEY_MAX_BYTES) int cacheKeyMaxBytes, - @Named(SearchRootModule.NAMED_CACHE_VALUE_MAX_BYTES) int cacheValueMaxBytes) { - return EarlybirdCacheCommonModule.createCache(client, decider, "realtime_termstats_root", - serializedKeyPrefix, TERMSTATS_CACHE_TTL_MILLIS, cacheKeyMaxBytes, cacheValueMaxBytes); - } - - @Provides - @Singleton - @TopTweetsCache - Cache provideTopTweetsCache( - JavaClient client, - DefaultForcedCacheMissDecider decider, - @Named(SearchRootModule.NAMED_SERIALIZED_KEY_PREFIX) String serializedKeyPrefix, - @Named(SearchRootModule.NAMED_CACHE_KEY_MAX_BYTES) int cacheKeyMaxBytes, - @Named(SearchRootModule.NAMED_CACHE_VALUE_MAX_BYTES) int cacheValueMaxBytes) { - return EarlybirdCacheCommonModule.createCache(client, decider, "realtime_toptweets_root", - serializedKeyPrefix, TopTweetsServicePostProcessor.CACHE_AGE_IN_MS, - cacheKeyMaxBytes, cacheValueMaxBytes); - } - - @Provides - SearchRootWarmup providesSearchRootWarmup( - Clock clock, - WarmupConfig config) { - return new EarlybirdWarmup(clock, config); - } - - @Provides - public LoggingSupport provideLoggingSupport( - SearchDecider decider) { - return new EarlybirdServiceLoggingSupport(decider); - } - - @Provides - public PartitionLoggingSupport providePartitionLoggingSupport() { - return new EarlybirdServicePartitionLoggingSupport(); - } - - @Provides - public ValidationBehavior provideValidationBehavior() { - return new EarlybirdServiceValidationBehavior(); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/RealtimeRootServer.java b/src/java/com/twitter/search/earlybird_root/RealtimeRootServer.java deleted file mode 100644 index 2b4aed336..000000000 --- a/src/java/com/twitter/search/earlybird_root/RealtimeRootServer.java +++ /dev/null @@ -1,18 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.inject.Inject; -import javax.inject.Singleton; - -import com.twitter.finagle.Service; -import com.twitter.search.common.root.SearchRootServer; -import com.twitter.search.earlybird.thrift.EarlybirdService; - -@Singleton -public class RealtimeRootServer extends SearchRootServer { - - @Inject - public RealtimeRootServer(RealtimeRootService svc, Service byteSvc) { - super(svc, byteSvc); - } - -} diff --git a/src/java/com/twitter/search/earlybird_root/RealtimeRootService.java b/src/java/com/twitter/search/earlybird_root/RealtimeRootService.java deleted file mode 100644 index e13379337..000000000 --- a/src/java/com/twitter/search/earlybird_root/RealtimeRootService.java +++ /dev/null @@ -1,129 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.inject.Inject; -import javax.inject.Named; -import javax.inject.Singleton; - - -import com.twitter.finagle.Service; -import com.twitter.finagle.mtls.authorization.server.MtlsServerSessionTrackerFilter; -import com.twitter.search.common.clientstats.FinagleClientStatsFilter; -import com.twitter.search.common.root.LoggingFilter; -import com.twitter.search.common.root.RequestValidationFilter; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.search.earlybird.thrift.EarlybirdStatusResponse; -import com.twitter.search.earlybird_root.caching.FacetsCacheFilter; -import com.twitter.search.earlybird_root.caching.RecencyCacheFilter; -import com.twitter.search.earlybird_root.caching.RelevanceCacheFilter; -import com.twitter.search.earlybird_root.caching.RelevanceZeroResultsCacheFilter; -import com.twitter.search.earlybird_root.caching.StrictRecencyCacheFilter; -import com.twitter.search.earlybird_root.caching.TermStatsCacheFilter; -import com.twitter.search.earlybird_root.caching.TopTweetsCacheFilter; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.filters.ClientIdTrackingFilter; -import com.twitter.search.earlybird_root.filters.ClientRequestTimeFilter; -import com.twitter.search.earlybird_root.filters.DeadlineTimeoutStatsFilter; -import com.twitter.search.earlybird_root.filters.DropAllProtectedOperatorFilter; -import com.twitter.search.earlybird_root.filters.EarlybirdFeatureSchemaAnnotateFilter; -import com.twitter.search.earlybird_root.filters.InitializeRequestContextFilter; -import com.twitter.search.earlybird_root.filters.MetadataTrackingFilter; -import com.twitter.search.earlybird_root.filters.NullcastTrackingFilter; -import com.twitter.search.earlybird_root.filters.PostCacheRequestTypeCountFilter; -import com.twitter.search.earlybird_root.filters.PreCacheRequestTypeCountFilter; -import com.twitter.search.earlybird_root.filters.QueryLangStatFilter; -import com.twitter.search.earlybird_root.filters.QueryOperatorStatFilter; -import com.twitter.search.earlybird_root.filters.RequestResultStatsFilter; -import com.twitter.search.earlybird_root.filters.ResponseCodeStatFilter; -import com.twitter.search.earlybird_root.filters.SearchPayloadSizeLocalContextFilter; -import com.twitter.search.earlybird_root.filters.StratoAttributionClientIdFilter; -import com.twitter.search.earlybird_root.filters.TopLevelExceptionHandlingFilter; -import com.twitter.util.Future; - -@Singleton -public class RealtimeRootService implements EarlybirdService.ServiceIface { - - private final Service allFiltersAndService; - - @Inject - public RealtimeRootService( - TopLevelExceptionHandlingFilter topLevelExceptionHandlingFilter, - ResponseCodeStatFilter responseCodeStatFilter, - LoggingFilter loggingFilter, - RequestValidationFilter validationFilter, - MtlsServerSessionTrackerFilter mtlsFilter, - FinagleClientStatsFilter finagleStatsFilter, - InitializeFilter initializeFilter, - InitializeRequestContextFilter initializeRequestContextFilter, - QueryLangStatFilter queryLangStatFilter, - DropAllProtectedOperatorFilter dropAllProtectedOperatorFilter, - QueryOperatorStatFilter queryOperatorStatFilter, - RequestResultStatsFilter requestResultStatsFilter, - PreCacheRequestTypeCountFilter preCacheCountFilter, - RecencyCacheFilter recencyCacheFilter, - RelevanceCacheFilter relevanceCacheFilter, - RelevanceZeroResultsCacheFilter relevanceZeroResultsCacheFilter, - StrictRecencyCacheFilter strictRecencyCacheFilter, - FacetsCacheFilter facetsCacheFilter, - TermStatsCacheFilter termStatsCacheFilter, - TopTweetsCacheFilter topTweetsCacheFilter, - PostCacheRequestTypeCountFilter postCacheCountFilter, - ClientIdTrackingFilter clientIdTrackingFilter, - MetadataTrackingFilter metadataTrackingFilter, - NullcastTrackingFilter nullcastTrackingFilter, - ClientRequestTimeFilter clientRequestTimeFilter, - DeadlineTimeoutStatsFilter deadlineTimeoutStatsFilter, - EarlybirdFeatureSchemaAnnotateFilter featureSchemaAnnotateFilter, - SearchPayloadSizeLocalContextFilter searchPayloadSizeLocalContextFilter, - @Named(ProtectedScatterGatherModule.NAMED_SCATTER_GATHER_SERVICE) - Service scatterGatherService, - StratoAttributionClientIdFilter stratoAttributionClientIdFilter) { - this.allFiltersAndService = - loggingFilter - .andThen(topLevelExceptionHandlingFilter) - .andThen(stratoAttributionClientIdFilter) - .andThen(clientRequestTimeFilter) - .andThen(searchPayloadSizeLocalContextFilter) - .andThen(responseCodeStatFilter) - .andThen(requestResultStatsFilter) - .andThen(validationFilter) - .andThen(mtlsFilter) - .andThen(finagleStatsFilter) - .andThen(clientIdTrackingFilter) - .andThen(metadataTrackingFilter) - .andThen(initializeFilter) - .andThen(initializeRequestContextFilter) - .andThen(deadlineTimeoutStatsFilter) - .andThen(queryLangStatFilter) - .andThen(nullcastTrackingFilter) - .andThen(dropAllProtectedOperatorFilter) - .andThen(queryOperatorStatFilter) - .andThen(preCacheCountFilter) - .andThen(recencyCacheFilter) - .andThen(relevanceCacheFilter) - .andThen(relevanceZeroResultsCacheFilter) - .andThen(strictRecencyCacheFilter) - .andThen(facetsCacheFilter) - .andThen(termStatsCacheFilter) - .andThen(topTweetsCacheFilter) - .andThen(postCacheCountFilter) - .andThen(featureSchemaAnnotateFilter) - .andThen(scatterGatherService); - } - - @Override - public Future getName() { - return Future.value("realtime root"); - } - - @Override - public Future getStatus() { - throw new UnsupportedOperationException("not supported"); - } - - @Override - public Future search(EarlybirdRequest request) { - return allFiltersAndService.apply(request); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/RealtimeScatterGatherModule.java b/src/java/com/twitter/search/earlybird_root/RealtimeScatterGatherModule.java deleted file mode 100644 index 16463e4bc..000000000 --- a/src/java/com/twitter/search/earlybird_root/RealtimeScatterGatherModule.java +++ /dev/null @@ -1,118 +0,0 @@ -package com.twitter.search.earlybird_root; - -import java.util.Arrays; -import java.util.HashMap; -import java.util.Map; -import javax.annotation.Nullable; -import javax.inject.Named; -import javax.inject.Singleton; - -import com.google.inject.Provides; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.finagle.Service; -import com.twitter.finagle.stats.StatsReceiver; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.root.PartitionConfig; -import com.twitter.search.common.root.PartitionLoggingSupport; -import com.twitter.search.common.root.RequestSuccessStats; -import com.twitter.search.common.root.RootClientServiceBuilder; -import com.twitter.search.common.root.ScatterGatherService; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.search.earlybird.thrift.ExperimentCluster; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.filters.RequestContextToEarlybirdRequestFilter; -import com.twitter.search.earlybird_root.filters.ScatterGatherWithExperimentRedirectsService; - -public class RealtimeScatterGatherModule extends ScatterGatherModule { - private static final Logger LOG = - LoggerFactory.getLogger(RealtimeScatterGatherModule.class); - - /** - * Provides a scatter gather service for the realtime cluster that redirects to experimental - * clusters when the experiment cluster parameter is set on the EarlybirdRequest. - * - * Note: if an alternate client is specified via altPartitionConfig or - * altRootClientServiceBuilder, it will be built and used for the "control" cluster, but the - * experiment cluster takes precedence (if the experiment cluster is set on the request, the - * alternate client will never be used. - */ - @Provides - @Singleton - @Named(NAMED_SCATTER_GATHER_SERVICE) - @Override - public Service provideScatterGatherService( - EarlybirdServiceScatterGatherSupport scatterGatherSupport, - RequestSuccessStats requestSuccessStats, - PartitionLoggingSupport partitionLoggingSupport, - RequestContextToEarlybirdRequestFilter requestContextToEarlybirdRequestFilter, - PartitionAccessController partitionAccessController, - PartitionConfig partitionConfig, - RootClientServiceBuilder rootClientServiceBuilder, - @Named(EarlybirdCommonModule.NAMED_EXP_CLUSTER_CLIENT) - RootClientServiceBuilder - expClusterRootClientServiceBuilder, - @Named(EarlybirdCommonModule.NAMED_ALT_CLIENT) @Nullable PartitionConfig altPartitionConfig, - @Named(EarlybirdCommonModule.NAMED_ALT_CLIENT) @Nullable - RootClientServiceBuilder altRootClientServiceBuilder, - StatsReceiver statsReceiver, - EarlybirdCluster cluster, - SearchDecider decider) { - - - Service controlService = - buildScatterOrSplitterService( - scatterGatherSupport, - requestSuccessStats, - partitionLoggingSupport, - requestContextToEarlybirdRequestFilter, - partitionAccessController, - partitionConfig, - rootClientServiceBuilder, - altPartitionConfig, - altRootClientServiceBuilder, - statsReceiver, - cluster, - decider - ); - - Map> - experimentScatterGatherServices = new HashMap<>(); - - LOG.info("Using ScatterGatherWithExperimentRedirectsService"); - LOG.info("Control Partition Path: {}", partitionConfig.getPartitionPath()); - - Arrays.stream(ExperimentCluster.values()) - .filter(v -> v.name().toLowerCase().startsWith("exp")) - .forEach(experimentCluster -> { - String expPartitionPath = partitionConfig.getPartitionPath() - + "-" + experimentCluster.name().toLowerCase(); - - LOG.info("Experiment Partition Path: {}", expPartitionPath); - - experimentScatterGatherServices.put(experimentCluster, - createScatterGatherService( - "", - scatterGatherSupport, - requestSuccessStats, - partitionLoggingSupport, - requestContextToEarlybirdRequestFilter, - partitionAccessController, - partitionConfig.getNumPartitions(), - expPartitionPath, - expClusterRootClientServiceBuilder, - statsReceiver, - cluster, - decider, - experimentCluster.name().toLowerCase())); - }); - - return new ScatterGatherWithExperimentRedirectsService( - controlService, - experimentScatterGatherServices); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/RootResponseClassifier.java b/src/java/com/twitter/search/earlybird_root/RootResponseClassifier.java deleted file mode 100644 index f9578b648..000000000 --- a/src/java/com/twitter/search/earlybird_root/RootResponseClassifier.java +++ /dev/null @@ -1,69 +0,0 @@ -package com.twitter.search.earlybird_root; - -import scala.PartialFunction; -import scala.runtime.AbstractPartialFunction; - -import com.twitter.finagle.service.ReqRep; -import com.twitter.finagle.service.ResponseClass; -import com.twitter.finagle.service.ResponseClasses; -import com.twitter.finagle.service.ResponseClassifier; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.util.Try; - -public class RootResponseClassifier extends AbstractPartialFunction { - private static final PartialFunction DEFAULT_CLASSIFIER = - ResponseClassifier.Default(); - - private static final SearchRateCounter NOT_EARLYBIRD_REQUEST_COUNTER = - SearchRateCounter.export("response_classifier_not_earlybird_request"); - private static final SearchRateCounter NOT_EARLYBIRD_RESPONSE_COUNTER = - SearchRateCounter.export("response_classifier_not_earlybird_response"); - private static final SearchRateCounter NON_RETRYABLE_FAILURE_COUNTER = - SearchRateCounter.export("response_classifier_non_retryable_failure"); - private static final SearchRateCounter RETRYABLE_FAILURE_COUNTER = - SearchRateCounter.export("response_classifier_retryable_failure"); - private static final SearchRateCounter SUCCESS_COUNTER = - SearchRateCounter.export("response_classifier_success"); - - @Override - public boolean isDefinedAt(ReqRep reqRep) { - if (!(reqRep.request() instanceof EarlybirdService.search_args)) { - NOT_EARLYBIRD_REQUEST_COUNTER.increment(); - return false; - } - - if (!reqRep.response().isThrow() && (!(reqRep.response().get() instanceof EarlybirdResponse))) { - NOT_EARLYBIRD_RESPONSE_COUNTER.increment(); - return false; - } - - return true; - } - - @Override - public ResponseClass apply(ReqRep reqRep) { - Try responseTry = reqRep.response(); - if (responseTry.isThrow()) { - return DEFAULT_CLASSIFIER.apply(reqRep); - } - - // isDefinedAt() guarantees that the response is an EarlybirdResponse instance. - EarlybirdResponseCode responseCode = ((EarlybirdResponse) responseTry.get()).getResponseCode(); - switch (responseCode) { - case PARTITION_NOT_FOUND: - case PARTITION_DISABLED: - case PERSISTENT_ERROR: - NON_RETRYABLE_FAILURE_COUNTER.increment(); - return ResponseClasses.NON_RETRYABLE_FAILURE; - case TRANSIENT_ERROR: - RETRYABLE_FAILURE_COUNTER.increment(); - return ResponseClasses.RETRYABLE_FAILURE; - default: - SUCCESS_COUNTER.increment(); - return ResponseClasses.SUCCESS; - } - } -} diff --git a/src/java/com/twitter/search/earlybird_root/ScatterGatherModule.java b/src/java/com/twitter/search/earlybird_root/ScatterGatherModule.java deleted file mode 100644 index a0bfa07ac..000000000 --- a/src/java/com/twitter/search/earlybird_root/ScatterGatherModule.java +++ /dev/null @@ -1,167 +0,0 @@ -package com.twitter.search.earlybird_root; - -import java.util.ArrayList; -import java.util.List; - -import javax.annotation.Nullable; -import javax.inject.Named; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.finagle.Service; -import com.twitter.finagle.stats.StatsReceiver; -import com.twitter.inject.TwitterModule; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.root.PartitionConfig; -import com.twitter.search.common.root.PartitionLoggingSupport; -import com.twitter.search.common.root.RequestSuccessStats; -import com.twitter.search.common.root.RootClientServiceBuilder; -import com.twitter.search.common.root.ScatterGatherService; -import com.twitter.search.common.root.SplitterService; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.config.TierConfig; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.filters.RequestContextToEarlybirdRequestFilter; - -public abstract class ScatterGatherModule extends TwitterModule { - private static final Logger LOG = LoggerFactory.getLogger(ScatterGatherModule.class); - - private static final String SEARCH_METHOD_NAME = "search"; - protected static final String ALT_TRAFFIC_PERCENTAGE_DECIDER_KEY_PREFIX = - "alt_client_traffic_percentage_"; - static final String NAMED_SCATTER_GATHER_SERVICE = "scatter_gather_service"; - - /** - * Provides the scatterGatherService for single tier Earlybird clusters (Protected and Realtime). - */ - public abstract Service provideScatterGatherService( - EarlybirdServiceScatterGatherSupport scatterGatherSupport, - RequestSuccessStats requestSuccessStats, - PartitionLoggingSupport partitionLoggingSupport, - RequestContextToEarlybirdRequestFilter requestContextToEarlybirdRequestFilter, - PartitionAccessController partitionAccessController, - PartitionConfig partitionConfig, - RootClientServiceBuilder rootClientServiceBuilder, - @Named(EarlybirdCommonModule.NAMED_EXP_CLUSTER_CLIENT) - RootClientServiceBuilder - expClusterRootClientServiceBuilder, - @Named(EarlybirdCommonModule.NAMED_ALT_CLIENT) @Nullable PartitionConfig altPartitionConfig, - @Named(EarlybirdCommonModule.NAMED_ALT_CLIENT) @Nullable - RootClientServiceBuilder altRootClientServiceBuilder, - StatsReceiver statsReceiver, - EarlybirdCluster cluster, - SearchDecider decider); - - protected final Service buildScatterOrSplitterService( - EarlybirdServiceScatterGatherSupport scatterGatherSupport, - RequestSuccessStats requestSuccessStats, - PartitionLoggingSupport partitionLoggingSupport, - RequestContextToEarlybirdRequestFilter requestContextToEarlybirdRequestFilter, - PartitionAccessController partitionAccessController, - PartitionConfig partitionConfig, - RootClientServiceBuilder rootClientServiceBuilder, - @Named(EarlybirdCommonModule.NAMED_ALT_CLIENT) @Nullable PartitionConfig altPartitionConfig, - @Named(EarlybirdCommonModule.NAMED_ALT_CLIENT) @Nullable - RootClientServiceBuilder altRootClientServiceBuilder, - StatsReceiver statsReceiver, - EarlybirdCluster cluster, - SearchDecider decider - ) { - ScatterGatherService scatterGatherService = - createScatterGatherService( - "", - scatterGatherSupport, - requestSuccessStats, - partitionLoggingSupport, - requestContextToEarlybirdRequestFilter, - partitionAccessController, - partitionConfig.getNumPartitions(), - partitionConfig.getPartitionPath(), - rootClientServiceBuilder, - statsReceiver, - cluster, - decider, - TierConfig.DEFAULT_TIER_NAME); - - if (altPartitionConfig == null || altRootClientServiceBuilder == null) { - LOG.info("altPartitionConfig or altRootClientServiceBuilder is not available, " - + "not using SplitterService"); - return scatterGatherService; - } - - LOG.info("alt client config available, using SplitterService"); - - ScatterGatherService altScatterGatherService = - createScatterGatherService( - "_alt", - scatterGatherSupport, - requestSuccessStats, - partitionLoggingSupport, - requestContextToEarlybirdRequestFilter, - partitionAccessController, - altPartitionConfig.getNumPartitions(), - altPartitionConfig.getPartitionPath(), - altRootClientServiceBuilder, - statsReceiver, - cluster, - decider, - TierConfig.DEFAULT_TIER_NAME); - - return new SplitterService<>( - scatterGatherService, - altScatterGatherService, - decider, - ALT_TRAFFIC_PERCENTAGE_DECIDER_KEY_PREFIX + cluster.getNameForStats()); - } - - protected ScatterGatherService - createScatterGatherService( - String nameSuffix, - EarlybirdServiceScatterGatherSupport scatterGatherSupport, - RequestSuccessStats requestSuccessStats, - PartitionLoggingSupport partitionLoggingSupport, - RequestContextToEarlybirdRequestFilter requestContextToEarlybirdRequestFilter, - PartitionAccessController partitionAccessController, - int numPartitions, - String partitionPath, - RootClientServiceBuilder rootClientServiceBuilder, - StatsReceiver statsReceiver, - EarlybirdCluster cluster, - SearchDecider decider, - String clientName) { - rootClientServiceBuilder.initializeWithPathSuffix(clientName + nameSuffix, - numPartitions, - partitionPath); - - ClientBackupFilter backupFilter = - new ClientBackupFilter( - "root_" + cluster.getNameForStats(), - "earlybird" + nameSuffix, - statsReceiver, - decider); - - ClientLatencyFilter clientLatencyFilter = new ClientLatencyFilter("all" + nameSuffix); - - List> services = new ArrayList<>(); - for (Service service - : rootClientServiceBuilder - .safeBuildServiceList(SEARCH_METHOD_NAME)) { - services.add(requestContextToEarlybirdRequestFilter - .andThen(backupFilter) - .andThen(clientLatencyFilter) - .andThen(service)); - } - services = SkipPartitionFilter.wrapServices(TierConfig.DEFAULT_TIER_NAME, services, - partitionAccessController); - - return new ScatterGatherService<>( - scatterGatherSupport, - services, - requestSuccessStats, - partitionLoggingSupport); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/SkipPartitionFilter.java b/src/java/com/twitter/search/earlybird_root/SkipPartitionFilter.java deleted file mode 100644 index 1d9feac4b..000000000 --- a/src/java/com/twitter/search/earlybird_root/SkipPartitionFilter.java +++ /dev/null @@ -1,70 +0,0 @@ -package com.twitter.search.earlybird_root; - -import java.util.ArrayList; -import java.util.List; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; -import com.twitter.util.Future; - -/** - * Filter that returns a PARTITION_SKIPPED response instead of sending the request to a partition - * if the partition PartitionAccessController says its disabled for a request. - */ -public final class SkipPartitionFilter extends - SimpleFilter { - - private static final Logger LOG = LoggerFactory.getLogger(SkipPartitionFilter.class); - - private final String tierName; - private final int partitionNum; - private final PartitionAccessController controller; - - private SkipPartitionFilter(String tierName, int partitionNum, - PartitionAccessController controller) { - this.tierName = tierName; - this.partitionNum = partitionNum; - this.controller = controller; - } - - @Override - public Future apply( - EarlybirdRequestContext requestContext, - Service service) { - - EarlybirdRequest request = requestContext.getRequest(); - if (!controller.canAccessPartition(tierName, partitionNum, request.getClientId(), - EarlybirdRequestType.of(request))) { - return Future.value(EarlybirdServiceScatterGatherSupport.newEmptyResponse()); - } - - return service.apply(requestContext); - } - - /** - * Wrap the services with a SkipPartitionFilter - */ - public static List> wrapServices( - String tierName, - List> clients, - PartitionAccessController controller) { - - LOG.info("Creating SkipPartitionFilters for cluster: {}, tier: {}, partitions 0-{}", - controller.getClusterName(), tierName, clients.size() - 1); - - List> wrappedServices = new ArrayList<>(); - for (int partitionNum = 0; partitionNum < clients.size(); partitionNum++) { - SkipPartitionFilter filter = new SkipPartitionFilter(tierName, partitionNum, controller); - wrappedServices.add(filter.andThen(clients.get(partitionNum))); - } - - return wrappedServices; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/SuperRootAppMain.java b/src/java/com/twitter/search/earlybird_root/SuperRootAppMain.java deleted file mode 100644 index 26ab5e5bb..000000000 --- a/src/java/com/twitter/search/earlybird_root/SuperRootAppMain.java +++ /dev/null @@ -1,48 +0,0 @@ -package com.twitter.search.earlybird_root; - -import java.util.Arrays; -import java.util.Collection; - -import com.google.inject.Module; - -import com.twitter.search.common.root.SearchRootAppMain; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.search.earlybird_root.routers.FacetsRequestRouterModule; -import com.twitter.search.earlybird_root.routers.RecencyRequestRouterModule; -import com.twitter.search.earlybird_root.routers.RelevanceRequestRouterModule; -import com.twitter.search.earlybird_root.routers.TermStatsRequestRouterModule; -import com.twitter.search.earlybird_root.routers.TopTweetsRequestRouterModule; - -public class SuperRootAppMain extends SearchRootAppMain { - /** - * Boilerplate for the Java-friendly AbstractTwitterServer - */ - public static class Main { - public static void main(String[] args) { - new SuperRootAppMain().main(args); - } - } - - @Override - protected Collection getAdditionalModules() { - return Arrays.asList( - new EarlybirdCommonModule(), - new SuperRootAppModule(), - new TermStatsRequestRouterModule(), - new RecencyRequestRouterModule(), - new RelevanceRequestRouterModule(), - new TopTweetsRequestRouterModule(), - new FacetsRequestRouterModule(), - new QuotaModule()); - } - - @Override - protected Class getSearchRootServerClass() { - return SuperRootServer.class; - } - - @Override - protected Class getServiceIfaceClass() { - return EarlybirdService.ServiceIface.class; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/SuperRootAppModule.java b/src/java/com/twitter/search/earlybird_root/SuperRootAppModule.java deleted file mode 100644 index 8c36f3aa2..000000000 --- a/src/java/com/twitter/search/earlybird_root/SuperRootAppModule.java +++ /dev/null @@ -1,234 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.inject.Named; -import javax.inject.Singleton; - -import com.google.inject.Key; -import com.google.inject.Provides; -import com.google.inject.util.Providers; - -import com.twitter.app.Flag; -import com.twitter.app.Flaggable; -import com.twitter.common.util.Clock; -import com.twitter.common_internal.text.version.PenguinVersionConfig; -import com.twitter.finagle.Name; -import com.twitter.finagle.Service; -import com.twitter.finagle.stats.StatsReceiver; -import com.twitter.inject.TwitterModule; -import com.twitter.search.common.config.SearchPenguinVersionsConfig; -import com.twitter.search.common.dark.ResolverProxy; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.root.LoggingSupport; -import com.twitter.search.common.root.RemoteClientBuilder; -import com.twitter.search.common.root.SearchRootWarmup; -import com.twitter.search.common.root.ValidationBehavior; -import com.twitter.search.common.root.WarmupConfig; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.search.earlybird.thrift.ThriftTweetSource; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.InjectionNames; -import com.twitter.search.earlybird_root.filters.EarlybirdClusterAvailableFilter; -import com.twitter.search.earlybird_root.filters.MarkTweetSourceFilter; -import com.twitter.search.earlybird_root.filters.RequestContextToEarlybirdRequestFilter; -import com.twitter.search.earlybird_root.filters.RequestTypeCountFilter; -import com.twitter.search.earlybird_root.filters.ServiceExceptionHandlingFilter; -import com.twitter.search.earlybird_root.filters.ServiceResponseValidationFilter; -import com.twitter.search.earlybird_root.filters.UnsetSuperRootFieldsFilter; -import com.twitter.util.Future; - -public class SuperRootAppModule extends TwitterModule { - private final Flag rootRealtimeFlag = createFlag( - "root-realtime", - "", - "Override the path to root-realtime", - Flaggable.ofString()); - private final Flag rootProtectedFlag = createFlag( - "root-protected", - "", - "Override the path to root-protected", - Flaggable.ofString()); - private final Flag rootArchiveFullFlag = createFlag( - "root-archive-full", - "", - "Override the path to root-archive-full", - Flaggable.ofString()); - private final Flag penguinVersionsFlag = createMandatoryFlag( - "penguin_versions", - "Penguin versions to be tokenized", - "", - Flaggable.ofString()); - - @Override - public void configure() { - // SuperRoot uses all clusters, not just one. We bind EarlybirdCluster to null to indicate that - // there is not one specific cluster to use. - bind(Key.get(EarlybirdCluster.class)).toProvider(Providers.of(null)); - - bind(EarlybirdService.ServiceIface.class).to(SuperRootService.class); - } - - @Provides - SearchRootWarmup providesSearchRootWarmup( - Clock clock, - WarmupConfig config) { - return new EarlybirdWarmup(clock, config); - } - - @Provides - @Singleton - @Named(InjectionNames.REALTIME) - private EarlybirdService.ServiceIface providesRealtimeIface( - RemoteClientBuilder builder, - ResolverProxy proxy) throws Exception { - Name name = proxy.resolve(rootRealtimeFlag.apply()); - return builder.createRemoteClient(name, "realtime", "realtime_"); - } - - @Provides - @Singleton - @Named(InjectionNames.REALTIME) - private Service providesRealtimeService( - @Named(InjectionNames.REALTIME) - EarlybirdService.ServiceIface realtimeServiceIface, - RequestContextToEarlybirdRequestFilter requestContextToEarlybirdRequestFilter, - StatsReceiver statsReceiver, - SearchDecider decider) { - return buildClientService( - realtimeServiceIface, - new EarlybirdClusterAvailableFilter(decider, EarlybirdCluster.REALTIME), - new MarkTweetSourceFilter(ThriftTweetSource.REALTIME_CLUSTER), - new ServiceExceptionHandlingFilter(EarlybirdCluster.REALTIME), - new ServiceResponseValidationFilter(EarlybirdCluster.REALTIME), - new RequestTypeCountFilter(EarlybirdCluster.REALTIME.getNameForStats()), - requestContextToEarlybirdRequestFilter, - new UnsetSuperRootFieldsFilter(), - new ClientLatencyFilter(EarlybirdCluster.REALTIME.getNameForStats())); - } - - @Provides - @Singleton - @Named(InjectionNames.FULL_ARCHIVE) - private EarlybirdService.ServiceIface providesFullArchiveIface( - RemoteClientBuilder builder, - ResolverProxy proxy) throws Exception { - Name name = proxy.resolve(rootArchiveFullFlag.apply()); - return builder.createRemoteClient(name, "fullarchive", "full_archive_"); - } - - @Provides - @Singleton - @Named(InjectionNames.FULL_ARCHIVE) - private Service providesFullArchiveService( - @Named(InjectionNames.FULL_ARCHIVE) - EarlybirdService.ServiceIface fullArchiveServiceIface, - RequestContextToEarlybirdRequestFilter requestContextToEarlybirdRequestFilter, - StatsReceiver statsReceiver, - SearchDecider decider) { - return buildClientService( - fullArchiveServiceIface, - new EarlybirdClusterAvailableFilter(decider, EarlybirdCluster.FULL_ARCHIVE), - new MarkTweetSourceFilter(ThriftTweetSource.FULL_ARCHIVE_CLUSTER), - new ServiceExceptionHandlingFilter(EarlybirdCluster.FULL_ARCHIVE), - new ServiceResponseValidationFilter(EarlybirdCluster.FULL_ARCHIVE), - new RequestTypeCountFilter(EarlybirdCluster.FULL_ARCHIVE.getNameForStats()), - requestContextToEarlybirdRequestFilter, - // Disable unset followedUserIds for archive since archive earlybirds rely on this field - // to rewrite query to include protected Tweets - new UnsetSuperRootFieldsFilter(false), - new ClientLatencyFilter(EarlybirdCluster.FULL_ARCHIVE.getNameForStats())); - } - - @Provides - @Singleton - @Named(InjectionNames.PROTECTED) - private EarlybirdService.ServiceIface providesProtectedIface( - RemoteClientBuilder builder, - ResolverProxy proxy) throws Exception { - Name name = proxy.resolve(rootProtectedFlag.apply()); - return builder.createRemoteClient(name, "protected", "protected_"); - } - - @Provides - @Singleton - @Named(InjectionNames.PROTECTED) - private Service providesProtectedService( - @Named(InjectionNames.PROTECTED) - EarlybirdService.ServiceIface protectedServiceIface, - RequestContextToEarlybirdRequestFilter requestContextToEarlybirdRequestFilter, - StatsReceiver statsReceiver, - SearchDecider decider) { - return buildClientService( - protectedServiceIface, - new EarlybirdClusterAvailableFilter(decider, EarlybirdCluster.PROTECTED), - new MarkTweetSourceFilter(ThriftTweetSource.REALTIME_PROTECTED_CLUSTER), - new ServiceExceptionHandlingFilter(EarlybirdCluster.PROTECTED), - new ServiceResponseValidationFilter(EarlybirdCluster.PROTECTED), - new RequestTypeCountFilter(EarlybirdCluster.PROTECTED.getNameForStats()), - requestContextToEarlybirdRequestFilter, - new UnsetSuperRootFieldsFilter(), - new ClientLatencyFilter(EarlybirdCluster.PROTECTED.getNameForStats())); - } - - /** - * Builds a Finagle Service based on a EarlybirdService.ServiceIface. - */ - private Service buildClientService( - final EarlybirdService.ServiceIface serviceIface, - EarlybirdClusterAvailableFilter earlybirdClusterAvailableFilter, - MarkTweetSourceFilter markTweetSourceFilter, - ServiceExceptionHandlingFilter serviceExceptionHandlingFilter, - ServiceResponseValidationFilter serviceResponseValidationFilter, - RequestTypeCountFilter requestTypeCountFilter, - RequestContextToEarlybirdRequestFilter requestContextToEarlybirdRequestFilter, - UnsetSuperRootFieldsFilter unsetSuperRootFieldsFilter, - ClientLatencyFilter latencyFilter) { - Service service = - new Service() { - - @Override - public Future apply(EarlybirdRequest requestContext) { - return serviceIface.search(requestContext); - } - }; - - // We should apply ServiceResponseValidationFilter first, to validate the response. - // Then, if the response is valid, we should tag all results with the appropriate tweet source. - // ServiceExceptionHandlingFilter should come last, to catch all possible exceptions (that were - // thrown by the service, or by ServiceResponseValidationFilter and MarkTweetSourceFilter). - // - // But before we do all of this, we should apply the EarlybirdClusterAvailableFilter to see if - // we even need to send the request to this cluster. - return earlybirdClusterAvailableFilter - .andThen(serviceExceptionHandlingFilter) - .andThen(markTweetSourceFilter) - .andThen(serviceResponseValidationFilter) - .andThen(requestTypeCountFilter) - .andThen(requestContextToEarlybirdRequestFilter) - .andThen(latencyFilter) - .andThen(unsetSuperRootFieldsFilter) - .andThen(service); - } - - @Provides - public LoggingSupport provideLoggingSupport( - SearchDecider decider) { - return new EarlybirdServiceLoggingSupport(decider); - } - - @Provides - public ValidationBehavior provideValidationBehavior() { - return new EarlybirdServiceValidationBehavior(); - } - - /** - * Provides the penguin versions that we should use to retokenize the query if requested. - */ - @Provides - @Singleton - public PenguinVersionConfig providePenguinVersions() { - return SearchPenguinVersionsConfig.deserialize(penguinVersionsFlag.apply()); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/SuperRootRequestTypeRouter.java b/src/java/com/twitter/search/earlybird_root/SuperRootRequestTypeRouter.java deleted file mode 100644 index 86e84a5c3..000000000 --- a/src/java/com/twitter/search/earlybird_root/SuperRootRequestTypeRouter.java +++ /dev/null @@ -1,79 +0,0 @@ -package com.twitter.search.earlybird_root; - -import java.util.Map; - -import javax.inject.Inject; -import javax.inject.Singleton; - -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.Maps; - -import com.twitter.finagle.Service; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.ClientErrorException; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; -import com.twitter.search.earlybird_root.routers.FacetsRequestRouter; -import com.twitter.search.earlybird_root.routers.RecencyRequestRouter; -import com.twitter.search.earlybird_root.routers.RelevanceRequestRouter; -import com.twitter.search.earlybird_root.routers.RequestRouter; -import com.twitter.search.earlybird_root.routers.TermStatsRequestRouter; -import com.twitter.search.earlybird_root.routers.TopTweetsRequestRouter; -import com.twitter.util.Future; - -@Singleton -public class SuperRootRequestTypeRouter - extends Service { - - private final Map routingMap; - - /** - * constructor - */ - @Inject - public SuperRootRequestTypeRouter( - FacetsRequestRouter facetsRequestRouter, - TermStatsRequestRouter termStatsRequestRouter, - TopTweetsRequestRouter topTweetsRequestRouter, - RecencyRequestRouter recencyRequestRouter, - RelevanceRequestRouter relevanceRequestRouter - ) { - routingMap = Maps.immutableEnumMap( - ImmutableMap.builder() - .put(EarlybirdRequestType.FACETS, facetsRequestRouter) - .put(EarlybirdRequestType.TERM_STATS, termStatsRequestRouter) - .put(EarlybirdRequestType.TOP_TWEETS, topTweetsRequestRouter) - .put(EarlybirdRequestType.RECENCY, recencyRequestRouter) - .put(EarlybirdRequestType.STRICT_RECENCY, recencyRequestRouter) - .put(EarlybirdRequestType.RELEVANCE, relevanceRequestRouter) - .build()); - } - - @Override - public Future apply(EarlybirdRequestContext requestContext) { - EarlybirdRequest request = requestContext.getRequest(); - if (request.getSearchQuery() == null) { - return Future.exception(new ClientErrorException( - "Client must fill in search Query object in request")); - } - - EarlybirdRequestType requestType = requestContext.getEarlybirdRequestType(); - - if (routingMap.containsKey(requestType)) { - RequestRouter router = routingMap.get(requestType); - return router.route(requestContext); - } else { - return Future.exception( - new ClientErrorException( - "Request type " + requestType + " is unsupported. " - + "Sorry this api is a bit hard to use.\n" - + "for facets, call earlybirdRequest.setFacetsRequest\n" - + "for termstats, call earluybirdRequest.setTermStatisticsRequest\n" - + "for recency, strict recency, relevance or toptweets,\n" - + " call req.setSearchQuery() and req.getSearchQuery().setRankingMode()\n" - + " with the correct ranking mode and for strict recency call\n" - + " earlybirdRequest.setQuerySource(ThriftQuerySource.GNIP)\n")); - } - } -} diff --git a/src/java/com/twitter/search/earlybird_root/SuperRootServer.java b/src/java/com/twitter/search/earlybird_root/SuperRootServer.java deleted file mode 100644 index e1e9ba266..000000000 --- a/src/java/com/twitter/search/earlybird_root/SuperRootServer.java +++ /dev/null @@ -1,36 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.inject.Inject; -import javax.inject.Singleton; - -import com.twitter.finagle.Service; -import com.twitter.search.common.root.SearchRootServer; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.search.earlybird_root.filters.QueryTokenizerFilter; -import com.twitter.search.queryparser.query.QueryParserException; - -@Singleton -public class SuperRootServer extends SearchRootServer { - private final QueryTokenizerFilter queryTokenizerFilter; - - @Inject - public SuperRootServer( - SuperRootService svc, - Service byteSvc, - QueryTokenizerFilter queryTokenizerFilter) { - super(svc, byteSvc); - - this.queryTokenizerFilter = queryTokenizerFilter; - } - - @Override - public void warmup() { - super.warmup(); - - try { - queryTokenizerFilter.performExpensiveInitialization(); - } catch (QueryParserException e) { - throw new RuntimeException(e); - } - } -} diff --git a/src/java/com/twitter/search/earlybird_root/SuperRootService.java b/src/java/com/twitter/search/earlybird_root/SuperRootService.java deleted file mode 100644 index c11052c35..000000000 --- a/src/java/com/twitter/search/earlybird_root/SuperRootService.java +++ /dev/null @@ -1,121 +0,0 @@ -package com.twitter.search.earlybird_root; - -import javax.inject.Inject; -import javax.inject.Singleton; - -import com.twitter.finagle.Service; -import com.twitter.finagle.mtls.authorization.server.MtlsServerSessionTrackerFilter; -import com.twitter.search.common.clientstats.FinagleClientStatsFilter; -import com.twitter.search.common.root.LoggingFilter; -import com.twitter.search.common.root.RequestValidationFilter; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdService; -import com.twitter.search.earlybird.thrift.EarlybirdStatusResponse; -import com.twitter.search.earlybird_root.filters.ClientIdArchiveAccessFilter; -import com.twitter.search.earlybird_root.filters.ClientIdQuotaFilter; -import com.twitter.search.earlybird_root.filters.ClientIdTrackingFilter; -import com.twitter.search.earlybird_root.filters.ClientRequestTimeFilter; -import com.twitter.search.earlybird_root.filters.DeadlineTimeoutStatsFilter; -import com.twitter.search.earlybird_root.filters.DisableClientByTierFilter; -import com.twitter.search.earlybird_root.filters.EarlybirdFeatureSchemaAnnotateFilter; -import com.twitter.search.earlybird_root.filters.InitializeRequestContextFilter; -import com.twitter.search.earlybird_root.filters.MetadataTrackingFilter; -import com.twitter.search.earlybird_root.filters.NamedMultiTermDisjunctionStatsFilter; -import com.twitter.search.earlybird_root.filters.NullcastTrackingFilter; -import com.twitter.search.earlybird_root.filters.PreCacheRequestTypeCountFilter; -import com.twitter.search.earlybird_root.filters.QueryLangStatFilter; -import com.twitter.search.earlybird_root.filters.QueryOperatorStatFilter; -import com.twitter.search.earlybird_root.filters.QueryTokenizerFilter; -import com.twitter.search.earlybird_root.filters.RequestResultStatsFilter; -import com.twitter.search.earlybird_root.filters.RequestSuccessStatsFilter; -import com.twitter.search.earlybird_root.filters.ResponseCodeStatFilter; -import com.twitter.search.earlybird_root.filters.SearchPayloadSizeLocalContextFilter; -import com.twitter.search.earlybird_root.filters.RejectRequestsByQuerySourceFilter; -import com.twitter.search.earlybird_root.filters.StratoAttributionClientIdFilter; -import com.twitter.search.earlybird_root.filters.TopLevelExceptionHandlingFilter; -import com.twitter.search.earlybird_root.filters.VeryRecentTweetsFilter; -import com.twitter.util.Future; - -@Singleton -class SuperRootService implements EarlybirdService.ServiceIface { - private final Service fullSearchMethod; - - @Inject - public SuperRootService( - TopLevelExceptionHandlingFilter topLevelExceptionHandlingFilter, - ResponseCodeStatFilter responseCodeStatFilter, - LoggingFilter loggingFilter, - NamedMultiTermDisjunctionStatsFilter namedMultiTermDisjunctionStatsFilter, - RequestValidationFilter validationFilter, - MtlsServerSessionTrackerFilter mtlsFilter, - FinagleClientStatsFilter finagleStatsFilter, - InitializeFilter initializeFilter, - InitializeRequestContextFilter initializeRequestContextFilter, - QueryLangStatFilter queryLangStatFilter, - QueryOperatorStatFilter queryOperatorStatFilter, - RequestResultStatsFilter requestResultStatsFilter, - PreCacheRequestTypeCountFilter preCacheRequestTypeCountFilter, - ClientIdArchiveAccessFilter clientIdArchiveAccessFilter, - DisableClientByTierFilter disableClientByTierFilter, - ClientIdTrackingFilter clientIdTrackingFilter, - ClientIdQuotaFilter quotaFilter, - RejectRequestsByQuerySourceFilter rejectRequestsByQuerySourceFilter, - MetadataTrackingFilter metadataTrackingFilter, - VeryRecentTweetsFilter veryRecentTweetsFilter, - RequestSuccessStatsFilter requestSuccessStatsFilter, - NullcastTrackingFilter nullcastTrackingFilter, - QueryTokenizerFilter queryTokenizerFilter, - ClientRequestTimeFilter clientRequestTimeFilter, - DeadlineTimeoutStatsFilter deadlineTimeoutStatsFilter, - SuperRootRequestTypeRouter superRootSearchService, - EarlybirdFeatureSchemaAnnotateFilter featureSchemaAnnotateFilter, - SearchPayloadSizeLocalContextFilter searchPayloadSizeLocalContextFilter, - StratoAttributionClientIdFilter stratoAttributionClientIdFilter) { - this.fullSearchMethod = - loggingFilter - .andThen(topLevelExceptionHandlingFilter) - .andThen(stratoAttributionClientIdFilter) - .andThen(clientRequestTimeFilter) - .andThen(searchPayloadSizeLocalContextFilter) - .andThen(requestSuccessStatsFilter) - .andThen(requestResultStatsFilter) - .andThen(responseCodeStatFilter) - .andThen(validationFilter) - .andThen(mtlsFilter) - .andThen(finagleStatsFilter) - .andThen(disableClientByTierFilter) - .andThen(clientIdTrackingFilter) - .andThen(quotaFilter) - .andThen(clientIdArchiveAccessFilter) - .andThen(rejectRequestsByQuerySourceFilter) - .andThen(namedMultiTermDisjunctionStatsFilter) - .andThen(metadataTrackingFilter) - .andThen(veryRecentTweetsFilter) - .andThen(initializeFilter) - .andThen(initializeRequestContextFilter) - .andThen(deadlineTimeoutStatsFilter) - .andThen(queryLangStatFilter) - .andThen(nullcastTrackingFilter) - .andThen(queryOperatorStatFilter) - .andThen(preCacheRequestTypeCountFilter) - .andThen(queryTokenizerFilter) - .andThen(featureSchemaAnnotateFilter) - .andThen(superRootSearchService); - } - - @Override - public Future getName() { - return Future.value("superroot"); - } - - @Override - public Future getStatus() { - throw new UnsupportedOperationException("not supported"); - } - - @Override - public Future search(EarlybirdRequest request) { - return fullSearchMethod.apply(request); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/BUILD b/src/java/com/twitter/search/earlybird_root/caching/BUILD deleted file mode 100644 index 9ea5a4041..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/BUILD +++ /dev/null @@ -1,20 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/org/slf4j:slf4j-api", - "finatra/inject/inject-core/src/main/scala", - "src/java/com/twitter/search/common/caching", - "src/java/com/twitter/search/common/decider", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/root", - "src/java/com/twitter/search/earlybird/common", - "src/java/com/twitter/search/earlybird_root/common", - "src/java/com/twitter/search/queryparser", - "src/java/com/twitter/search/queryparser/query:core-query-nodes", - "src/thrift/com/twitter/search:earlybird-java", - ], -) diff --git a/src/java/com/twitter/search/earlybird_root/caching/CacheCommonUtil.java b/src/java/com/twitter/search/earlybird_root/caching/CacheCommonUtil.java deleted file mode 100644 index 6cadf565b..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/CacheCommonUtil.java +++ /dev/null @@ -1,16 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.twitter.search.earlybird.thrift.EarlybirdResponse; - -public final class CacheCommonUtil { - public static final String NAMED_MAX_CACHE_RESULTS = "maxCacheResults"; - - private CacheCommonUtil() { - } - - public static boolean hasResults(EarlybirdResponse response) { - return response.isSetSearchResults() - && (response.getSearchResults().getResults() != null) - && !response.getSearchResults().getResults().isEmpty(); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/CacheStats.java b/src/java/com/twitter/search/earlybird_root/caching/CacheStats.java deleted file mode 100644 index 2c0896e0b..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/CacheStats.java +++ /dev/null @@ -1,13 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.twitter.search.common.metrics.SearchRateCounter; - -public final class CacheStats { - public static final SearchRateCounter REQUEST_FAILED_COUNTER = - SearchRateCounter.export("memcache_request_failed"); - public static final SearchRateCounter REQUEST_TIMEOUT_COUNTER = - SearchRateCounter.export("memcache_request_timeout"); - - private CacheStats() { - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/DefaultForcedCacheMissDecider.java b/src/java/com/twitter/search/earlybird_root/caching/DefaultForcedCacheMissDecider.java deleted file mode 100644 index fc14e2203..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/DefaultForcedCacheMissDecider.java +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import javax.inject.Inject; - -import com.twitter.common.base.Supplier; -import com.twitter.search.common.decider.SearchDecider; - -/** - * A cache miss decider backed by a decider key. - */ -public class DefaultForcedCacheMissDecider implements Supplier { - private static final String DECIDER_KEY = "default_forced_cache_miss_rate"; - private final SearchDecider decider; - - @Inject - public DefaultForcedCacheMissDecider(SearchDecider decider) { - this.decider = decider; - } - - @Override - public Boolean get() { - return decider.isAvailable(DECIDER_KEY); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/EarlybirdCachePostProcessor.java b/src/java/com/twitter/search/earlybird_root/caching/EarlybirdCachePostProcessor.java deleted file mode 100644 index 04cd08e30..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/EarlybirdCachePostProcessor.java +++ /dev/null @@ -1,22 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.google.common.base.Optional; - -import com.twitter.search.common.caching.filter.CachePostProcessor; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; - -public class EarlybirdCachePostProcessor - extends CachePostProcessor { - - @Override - public final void recordCacheHit(EarlybirdResponse response) { - response.setCacheHit(true); - } - - @Override - public Optional processCacheResponse(EarlybirdRequestContext originalRequest, - EarlybirdResponse cacheResponse) { - return Optional.of(cacheResponse); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/EarlybirdRequestPerClientCacheStats.java b/src/java/com/twitter/search/earlybird_root/caching/EarlybirdRequestPerClientCacheStats.java deleted file mode 100644 index 2b8d96179..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/EarlybirdRequestPerClientCacheStats.java +++ /dev/null @@ -1,46 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import java.util.Map; -import java.util.concurrent.ConcurrentHashMap; - -import com.twitter.search.common.caching.filter.PerClientCacheStats; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.earlybird.common.EarlybirdRequestUtil; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; - -public class EarlybirdRequestPerClientCacheStats - extends PerClientCacheStats { - - private String cacheOffByClientStatFormat; - private final Map cacheTurnedOffByClient; - - private String cacheHitsByClientStatFormat; - private final Map cacheHitsByClient; - - public EarlybirdRequestPerClientCacheStats(String cacheRequestType) { - this.cacheOffByClientStatFormat = - cacheRequestType + "_client_id_%s_cache_turned_off_in_request"; - this.cacheTurnedOffByClient = new ConcurrentHashMap<>(); - - this.cacheHitsByClientStatFormat = cacheRequestType + "_client_id_%s_cache_hit_total"; - this.cacheHitsByClient = new ConcurrentHashMap<>(); - } - - @Override - public void recordRequest(EarlybirdRequestContext requestContext) { - if (!EarlybirdRequestUtil.isCachingAllowed(requestContext.getRequest())) { - String client = requestContext.getRequest().getClientId(); - SearchRateCounter counter = cacheTurnedOffByClient.computeIfAbsent(client, - cl -> SearchRateCounter.export(String.format(cacheOffByClientStatFormat, cl))); - counter.increment(); - } - } - - @Override - public void recordCacheHit(EarlybirdRequestContext requestContext) { - String client = requestContext.getRequest().getClientId(); - SearchRateCounter counter = cacheHitsByClient.computeIfAbsent(client, - cl -> SearchRateCounter.export(String.format(cacheHitsByClientStatFormat, cl))); - counter.increment(); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/FacetsCache.java b/src/java/com/twitter/search/earlybird_root/caching/FacetsCache.java deleted file mode 100644 index 84b30502e..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/FacetsCache.java +++ /dev/null @@ -1,15 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import java.lang.annotation.ElementType; -import java.lang.annotation.Retention; -import java.lang.annotation.Target; - -import com.google.inject.BindingAnnotation; - -import static java.lang.annotation.RetentionPolicy.RUNTIME; - -@Retention(RUNTIME) -@Target({ ElementType.FIELD, ElementType.PARAMETER, ElementType.METHOD }) -@BindingAnnotation -public @interface FacetsCache { -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/FacetsCacheFilter.java b/src/java/com/twitter/search/earlybird_root/caching/FacetsCacheFilter.java deleted file mode 100644 index a06b2eda0..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/FacetsCacheFilter.java +++ /dev/null @@ -1,32 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import javax.inject.Inject; -import javax.inject.Named; - -import com.twitter.search.common.caching.Cache; -import com.twitter.search.common.caching.filter.CacheFilter; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.root.SearchRootModule; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; - -public class FacetsCacheFilter extends - CacheFilter { - /** - * Constructs a new cache filter for facet requests. - */ - @Inject - public FacetsCacheFilter( - @FacetsCache Cache cache, - SearchDecider decider, - @Named(SearchRootModule.NAMED_NORMALIZED_SEARCH_ROOT_NAME) String normalizedSearchRootName) { - super(cache, - new FacetsQueryCachePredicate(decider, normalizedSearchRootName), - new FacetsCacheRequestNormalizer(), - new EarlybirdCachePostProcessor(), - new FacetsServicePostProcessor(cache), - new EarlybirdRequestPerClientCacheStats(EarlybirdRequestType.FACETS.getNormalizedName())); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/FacetsCacheRequestNormalizer.java b/src/java/com/twitter/search/earlybird_root/caching/FacetsCacheRequestNormalizer.java deleted file mode 100644 index b89be8ad3..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/FacetsCacheRequestNormalizer.java +++ /dev/null @@ -1,18 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.google.common.base.Optional; - -import com.twitter.search.common.caching.FacetsCacheUtil; -import com.twitter.search.common.caching.filter.CacheRequestNormalizer; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; - -public class FacetsCacheRequestNormalizer extends - CacheRequestNormalizer { - - @Override - public Optional normalizeRequest(EarlybirdRequestContext requestContext) { - return Optional.fromNullable(FacetsCacheUtil.normalizeRequestForCache( - requestContext.getRequest())); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/FacetsQueryCachePredicate.java b/src/java/com/twitter/search/earlybird_root/caching/FacetsQueryCachePredicate.java deleted file mode 100644 index c7cb5c454..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/FacetsQueryCachePredicate.java +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.twitter.search.common.caching.filter.QueryCachePredicate; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.earlybird.common.EarlybirdRequestUtil; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; - -public class FacetsQueryCachePredicate extends QueryCachePredicate { - private final SearchDecider decider; - private final String facetsCacheEnabledDeciderKey; - - public FacetsQueryCachePredicate(SearchDecider decider, String normalizedSearchRootName) { - this.decider = decider; - this.facetsCacheEnabledDeciderKey = "facets_cache_enabled_" + normalizedSearchRootName; - } - - @Override - public Boolean shouldQueryCache(EarlybirdRequestContext requestContext) { - return EarlybirdRequestType.FACETS == requestContext.getEarlybirdRequestType() - && EarlybirdRequestUtil.isCachingAllowed(requestContext.getRequest()) - && decider.isAvailable(facetsCacheEnabledDeciderKey); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/FacetsServicePostProcessor.java b/src/java/com/twitter/search/earlybird_root/caching/FacetsServicePostProcessor.java deleted file mode 100644 index 74984a757..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/FacetsServicePostProcessor.java +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.twitter.search.common.caching.Cache; -import com.twitter.search.common.caching.FacetsCacheUtil; -import com.twitter.search.common.caching.filter.ServicePostProcessor; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; - -public class FacetsServicePostProcessor - extends ServicePostProcessor { - - private final Cache cache; - - public FacetsServicePostProcessor(Cache cache) { - this.cache = cache; - } - - @Override - public void processServiceResponse(EarlybirdRequestContext requestContext, - EarlybirdResponse serviceResponse) { - FacetsCacheUtil.cacheResults(requestContext.getRequest(), serviceResponse, cache); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/RecencyAndRelevanceCachePostProcessor.java b/src/java/com/twitter/search/earlybird_root/caching/RecencyAndRelevanceCachePostProcessor.java deleted file mode 100644 index eb0752286..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/RecencyAndRelevanceCachePostProcessor.java +++ /dev/null @@ -1,66 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.google.common.base.Optional; -import com.google.common.base.Preconditions; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.caching.CacheUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.util.IdTimeRanges; - -public class RecencyAndRelevanceCachePostProcessor extends EarlybirdCachePostProcessor { - - private static final Logger LOG = - LoggerFactory.getLogger(RecencyAndRelevanceCachePostProcessor.class); - - protected Optional postProcessCacheResponse( - EarlybirdRequest earlybirdRequest, - EarlybirdResponse earlybirdResponse, long sinceID, long maxID) { - return CacheUtil.postProcessCacheResult( - earlybirdRequest, earlybirdResponse, sinceID, maxID); - } - - @Override - public final Optional processCacheResponse( - EarlybirdRequestContext requestContext, - EarlybirdResponse cacheResponse) { - EarlybirdRequest originalRequest = requestContext.getRequest(); - Preconditions.checkArgument(originalRequest.isSetSearchQuery()); - - IdTimeRanges ranges; - Query query = requestContext.getParsedQuery(); - if (query != null) { - try { - ranges = IdTimeRanges.fromQuery(query); - } catch (QueryParserException e) { - LOG.error( - "Exception when parsing since and max IDs. Request: {} Response: {}", - originalRequest, - cacheResponse, - e); - return Optional.absent(); - } - } else { - ranges = null; - } - - Optional sinceID; - Optional maxID; - if (ranges != null) { - sinceID = ranges.getSinceIDExclusive(); - maxID = ranges.getMaxIDInclusive(); - } else { - sinceID = Optional.absent(); - maxID = Optional.absent(); - } - - return postProcessCacheResponse( - originalRequest, cacheResponse, sinceID.or(0L), maxID.or(Long.MAX_VALUE)); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/RecencyCache.java b/src/java/com/twitter/search/earlybird_root/caching/RecencyCache.java deleted file mode 100644 index 27b9abc72..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/RecencyCache.java +++ /dev/null @@ -1,15 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import java.lang.annotation.ElementType; -import java.lang.annotation.Retention; -import java.lang.annotation.Target; - -import com.google.inject.BindingAnnotation; - -import static java.lang.annotation.RetentionPolicy.RUNTIME; - -@Retention(RUNTIME) -@Target({ ElementType.FIELD, ElementType.PARAMETER, ElementType.METHOD }) -@BindingAnnotation -public @interface RecencyCache { -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/RecencyCacheFilter.java b/src/java/com/twitter/search/earlybird_root/caching/RecencyCacheFilter.java deleted file mode 100644 index 5d4772f70..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/RecencyCacheFilter.java +++ /dev/null @@ -1,34 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import javax.inject.Inject; -import javax.inject.Named; - -import com.twitter.search.common.caching.Cache; -import com.twitter.search.common.caching.filter.CacheFilter; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.root.SearchRootModule; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; - -public class RecencyCacheFilter extends - CacheFilter { - /** - * Creates a cache filter for earlybird recency requests. - */ - @Inject - public RecencyCacheFilter( - @RecencyCache Cache cache, - SearchDecider decider, - @Named(SearchRootModule.NAMED_NORMALIZED_SEARCH_ROOT_NAME) String normalizedSearchRootName, - @Named(CacheCommonUtil.NAMED_MAX_CACHE_RESULTS) int maxCacheResults) { - super(cache, - new RecencyQueryCachePredicate(decider, normalizedSearchRootName), - new RecencyCacheRequestNormalizer(), - new RecencyAndRelevanceCachePostProcessor(), - new RecencyServicePostProcessor(cache, maxCacheResults), - new EarlybirdRequestPerClientCacheStats( - EarlybirdRequestType.RECENCY.getNormalizedName())); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/RecencyCacheRequestNormalizer.java b/src/java/com/twitter/search/earlybird_root/caching/RecencyCacheRequestNormalizer.java deleted file mode 100644 index 11d74370c..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/RecencyCacheRequestNormalizer.java +++ /dev/null @@ -1,16 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.google.common.base.Optional; - -import com.twitter.search.common.caching.CacheUtil; -import com.twitter.search.common.caching.filter.CacheRequestNormalizer; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; - -public class RecencyCacheRequestNormalizer extends - CacheRequestNormalizer { - @Override - public Optional normalizeRequest(EarlybirdRequestContext requestContext) { - return Optional.fromNullable(CacheUtil.normalizeRequestForCache(requestContext.getRequest())); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/RecencyQueryCachePredicate.java b/src/java/com/twitter/search/earlybird_root/caching/RecencyQueryCachePredicate.java deleted file mode 100644 index 12778f922..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/RecencyQueryCachePredicate.java +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.twitter.search.common.caching.filter.QueryCachePredicate; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.earlybird.common.EarlybirdRequestUtil; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; - -public class RecencyQueryCachePredicate extends QueryCachePredicate { - private final SearchDecider decider; - private final String recencyCacheEnabledDeciderKey; - - public RecencyQueryCachePredicate(SearchDecider decider, String normalizedSearchRootName) { - this.decider = decider; - this.recencyCacheEnabledDeciderKey = "recency_cache_enabled_" + normalizedSearchRootName; - } - - @Override - public Boolean shouldQueryCache(EarlybirdRequestContext request) { - return EarlybirdRequestType.RECENCY == request.getEarlybirdRequestType() - && EarlybirdRequestUtil.isCachingAllowed(request.getRequest()) - && decider.isAvailable(recencyCacheEnabledDeciderKey); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/RecencyServicePostProcessor.java b/src/java/com/twitter/search/earlybird_root/caching/RecencyServicePostProcessor.java deleted file mode 100644 index 35ab01e99..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/RecencyServicePostProcessor.java +++ /dev/null @@ -1,27 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.twitter.search.common.caching.Cache; -import com.twitter.search.common.caching.CacheUtil; -import com.twitter.search.common.caching.filter.ServicePostProcessor; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; - -public class RecencyServicePostProcessor - extends ServicePostProcessor { - private final Cache cache; - private final int maxCacheResults; - - public RecencyServicePostProcessor( - Cache cache, - int maxCacheResults) { - this.cache = cache; - this.maxCacheResults = maxCacheResults; - } - - @Override - public void processServiceResponse(EarlybirdRequestContext requestContext, - EarlybirdResponse serviceResponse) { - CacheUtil.cacheResults(cache, requestContext.getRequest(), serviceResponse, maxCacheResults); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/RelevanceCache.java b/src/java/com/twitter/search/earlybird_root/caching/RelevanceCache.java deleted file mode 100644 index b44a7950a..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/RelevanceCache.java +++ /dev/null @@ -1,15 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import java.lang.annotation.ElementType; -import java.lang.annotation.Retention; -import java.lang.annotation.Target; - -import com.google.inject.BindingAnnotation; - -import static java.lang.annotation.RetentionPolicy.RUNTIME; - -@Retention(RUNTIME) -@Target({ ElementType.FIELD, ElementType.PARAMETER, ElementType.METHOD }) -@BindingAnnotation -public @interface RelevanceCache { -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/RelevanceCacheFilter.java b/src/java/com/twitter/search/earlybird_root/caching/RelevanceCacheFilter.java deleted file mode 100644 index bd1d718a9..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/RelevanceCacheFilter.java +++ /dev/null @@ -1,33 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import javax.inject.Inject; -import javax.inject.Named; - -import com.twitter.search.common.caching.Cache; -import com.twitter.search.common.caching.filter.CacheFilter; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.root.SearchRootModule; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; - -public class RelevanceCacheFilter extends - CacheFilter { - /** - * Creates a cache filter for earlybird relevance requests - */ - @Inject - public RelevanceCacheFilter( - @RelevanceCache Cache cache, - SearchDecider decider, - @Named(SearchRootModule.NAMED_NORMALIZED_SEARCH_ROOT_NAME) String normalizedSearchRootName) { - super(cache, - new RelevanceQueryCachePredicate(decider, normalizedSearchRootName), - new RelevanceCacheRequestNormalizer(decider, normalizedSearchRootName), - new RecencyAndRelevanceCachePostProcessor(), - new RelevanceServicePostProcessor(cache), - new EarlybirdRequestPerClientCacheStats( - EarlybirdRequestType.RELEVANCE.getNormalizedName())); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/RelevanceCacheRequestNormalizer.java b/src/java/com/twitter/search/earlybird_root/caching/RelevanceCacheRequestNormalizer.java deleted file mode 100644 index 6f01eb63d..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/RelevanceCacheRequestNormalizer.java +++ /dev/null @@ -1,40 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.google.common.base.Optional; - -import com.twitter.search.common.caching.CacheUtil; -import com.twitter.search.common.caching.filter.CacheRequestNormalizer; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; - -public class RelevanceCacheRequestNormalizer extends - CacheRequestNormalizer { - private static final SearchCounter RELEVANCE_FORCE_CACHED_LOGGED_IN_REQUEST = - SearchCounter.export("relevance_force_cached_logged_in_request"); - - private final SearchDecider decider; - private final String relevanceStripPersonalizationFieldsDeciderKey; - - public RelevanceCacheRequestNormalizer( - SearchDecider decider, - String normalizedSearchRootName) { - this.decider = decider; - this.relevanceStripPersonalizationFieldsDeciderKey = - String.format("relevance_%s_force_cache_logged_in_requests", normalizedSearchRootName); - } - - @Override - public Optional normalizeRequest(EarlybirdRequestContext requestContext) { - boolean cacheLoggedInRequest = - decider.isAvailable(relevanceStripPersonalizationFieldsDeciderKey); - - if (cacheLoggedInRequest) { - RELEVANCE_FORCE_CACHED_LOGGED_IN_REQUEST.increment(); - } - - return Optional.fromNullable(CacheUtil.normalizeRequestForCache( - requestContext.getRequest(), cacheLoggedInRequest)); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/RelevanceQueryCachePredicate.java b/src/java/com/twitter/search/earlybird_root/caching/RelevanceQueryCachePredicate.java deleted file mode 100644 index a7767682e..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/RelevanceQueryCachePredicate.java +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.twitter.search.common.caching.filter.QueryCachePredicate; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.earlybird.common.EarlybirdRequestUtil; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; - -public class RelevanceQueryCachePredicate extends QueryCachePredicate { - private final SearchDecider decider; - private final String relevanceCacheEnabledDeciderKey; - - public RelevanceQueryCachePredicate(SearchDecider decider, String normalizedSearchRootName) { - this.decider = decider; - this.relevanceCacheEnabledDeciderKey = "relevance_cache_enabled_" + normalizedSearchRootName; - } - - @Override - public Boolean shouldQueryCache(EarlybirdRequestContext requestContext) { - return EarlybirdRequestType.RELEVANCE == requestContext.getEarlybirdRequestType() - && EarlybirdRequestUtil.isCachingAllowed(requestContext.getRequest()) - && decider.isAvailable(relevanceCacheEnabledDeciderKey); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/RelevanceServicePostProcessor.java b/src/java/com/twitter/search/earlybird_root/caching/RelevanceServicePostProcessor.java deleted file mode 100644 index 7dcaaaf52..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/RelevanceServicePostProcessor.java +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.twitter.search.common.caching.Cache; -import com.twitter.search.common.caching.CacheUtil; -import com.twitter.search.common.caching.filter.ServicePostProcessor; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; - -public class RelevanceServicePostProcessor - extends ServicePostProcessor { - private final Cache cache; - - public RelevanceServicePostProcessor( - Cache cache) { - this.cache = cache; - } - - @Override - public void processServiceResponse(EarlybirdRequestContext requestContext, - EarlybirdResponse serviceResponse) { - CacheUtil.cacheResults(cache, requestContext.getRequest(), serviceResponse, Integer.MAX_VALUE); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/RelevanceZeroResultsCacheFilter.java b/src/java/com/twitter/search/earlybird_root/caching/RelevanceZeroResultsCacheFilter.java deleted file mode 100644 index f0a3b8cf5..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/RelevanceZeroResultsCacheFilter.java +++ /dev/null @@ -1,40 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import javax.inject.Inject; -import javax.inject.Named; - -import com.twitter.search.common.caching.Cache; -import com.twitter.search.common.caching.filter.CacheFilter; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.root.SearchRootModule; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; - -/** - * A filter that: - * - Strips the request of all personalization fields, normalizes it and looks it up in the cache. - * If it finds a response with 0 results in the cache, it returns it. - * - Caches the response for a personalized query, whenever the response has 0 results. The cache - * key is the normalized request with all personalization fields stripped. - * - * If a query (from a logged in or logged out user) returns 0 results, then the same query will - * always return 0 results, for all users. So we can cache that result. - */ -public class RelevanceZeroResultsCacheFilter - extends CacheFilter { - - /** Creates a filter that caches relevance requests with 0 results. */ - @Inject - public RelevanceZeroResultsCacheFilter( - @RelevanceCache Cache cache, - SearchDecider decider, - @Named(SearchRootModule.NAMED_NORMALIZED_SEARCH_ROOT_NAME) String normalizedSearchRootName) { - super(cache, - new RelevanceZeroResultsQueryCachePredicate(decider, normalizedSearchRootName), - new RelevanceZeroResultsCacheRequestNormalizer(), - new RelevanceZeroResultsCachePostProcessor(), - new RelevanceZeroResultsServicePostProcessor(cache), - new EarlybirdRequestPerClientCacheStats("relevance_zero_results")); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/RelevanceZeroResultsCachePostProcessor.java b/src/java/com/twitter/search/earlybird_root/caching/RelevanceZeroResultsCachePostProcessor.java deleted file mode 100644 index 41ebeff06..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/RelevanceZeroResultsCachePostProcessor.java +++ /dev/null @@ -1,20 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.google.common.base.Optional; - -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; - -public class RelevanceZeroResultsCachePostProcessor extends RecencyAndRelevanceCachePostProcessor { - @Override - protected Optional postProcessCacheResponse( - EarlybirdRequest request, EarlybirdResponse response, long sinceId, long maxId) { - // If a query (from a logged in or logged out user) returns 0 results, then the same query will - // always return 0 results, for all users. So we can cache that result. - if (CacheCommonUtil.hasResults(response)) { - return Optional.absent(); - } - - return Optional.of(response); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/RelevanceZeroResultsCacheRequestNormalizer.java b/src/java/com/twitter/search/earlybird_root/caching/RelevanceZeroResultsCacheRequestNormalizer.java deleted file mode 100644 index 6e5284588..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/RelevanceZeroResultsCacheRequestNormalizer.java +++ /dev/null @@ -1,31 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.google.common.base.Optional; - -import com.twitter.search.common.caching.CacheUtil; -import com.twitter.search.common.caching.SearchQueryNormalizer; -import com.twitter.search.common.caching.filter.CacheRequestNormalizer; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; - -public class RelevanceZeroResultsCacheRequestNormalizer - extends CacheRequestNormalizer { - @Override - public Optional normalizeRequest(EarlybirdRequestContext requestContext) { - // If the query is not personalized, it means that: - // - RelevanceCacheRequestNormalizer has already normalized it into a cacheable query. - // - RelevanceCacheFilter could not find a response for this query in the cache. - // - // So if we try to normalize it here again, we will succeed, but then - // RelevanceZeroResultsCacheFilter will do the same look up in the cache, which will again - // result in a cache miss. There is no need to do this look up twice, so if the query is not - // personalized, return Optional.absent(). - // - // If the query is personalized, strip all personalization fields and normalize the request. - if (!SearchQueryNormalizer.queryIsPersonalized(requestContext.getRequest().getSearchQuery())) { - return Optional.absent(); - } - return Optional.fromNullable( - CacheUtil.normalizeRequestForCache(requestContext.getRequest(), true)); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/RelevanceZeroResultsQueryCachePredicate.java b/src/java/com/twitter/search/earlybird_root/caching/RelevanceZeroResultsQueryCachePredicate.java deleted file mode 100644 index 8d04bceb2..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/RelevanceZeroResultsQueryCachePredicate.java +++ /dev/null @@ -1,31 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.twitter.search.common.caching.filter.QueryCachePredicate; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.earlybird.common.EarlybirdRequestUtil; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; - -public class RelevanceZeroResultsQueryCachePredicate - extends QueryCachePredicate { - private final SearchDecider decider; - private final String relevanceCacheEnabledDeciderKey; - private final String relevanceZeroResultsCacheEnabledDeciderKey; - - public RelevanceZeroResultsQueryCachePredicate( - SearchDecider decider, String normalizedSearchRootName) { - this.decider = decider; - this.relevanceCacheEnabledDeciderKey = - "relevance_cache_enabled_" + normalizedSearchRootName; - this.relevanceZeroResultsCacheEnabledDeciderKey = - "relevance_zero_results_cache_enabled_" + normalizedSearchRootName; - } - - @Override - public Boolean shouldQueryCache(EarlybirdRequestContext requestContext) { - return EarlybirdRequestType.RELEVANCE == requestContext.getEarlybirdRequestType() - && EarlybirdRequestUtil.isCachingAllowed(requestContext.getRequest()) - && decider.isAvailable(relevanceCacheEnabledDeciderKey) - && decider.isAvailable(relevanceZeroResultsCacheEnabledDeciderKey); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/RelevanceZeroResultsServicePostProcessor.java b/src/java/com/twitter/search/earlybird_root/caching/RelevanceZeroResultsServicePostProcessor.java deleted file mode 100644 index 6fe7c405f..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/RelevanceZeroResultsServicePostProcessor.java +++ /dev/null @@ -1,36 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.twitter.search.common.caching.Cache; -import com.twitter.search.common.caching.CacheUtil; -import com.twitter.search.common.caching.filter.ServicePostProcessor; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; - -public class RelevanceZeroResultsServicePostProcessor - extends ServicePostProcessor { - - private static final SearchCounter RELEVANCE_RESPONSES_WITH_ZERO_RESULTS_COUNTER = - SearchCounter.export("relevance_responses_with_zero_results"); - - private final Cache cache; - - public RelevanceZeroResultsServicePostProcessor( - Cache cache) { - this.cache = cache; - } - - @Override - public void processServiceResponse(EarlybirdRequestContext requestContext, - EarlybirdResponse serviceResponse) { - // serviceResponse is the response to a personalized query. If it has zero results, then we can - // cache it and reuse it for other requests with the same query. Otherwise, it makes no sense to - // cache this response. - if (!CacheCommonUtil.hasResults(serviceResponse)) { - RELEVANCE_RESPONSES_WITH_ZERO_RESULTS_COUNTER.increment(); - CacheUtil.cacheResults( - cache, requestContext.getRequest(), serviceResponse, Integer.MAX_VALUE); - } - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/StrictRecencyCache.java b/src/java/com/twitter/search/earlybird_root/caching/StrictRecencyCache.java deleted file mode 100644 index b56733227..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/StrictRecencyCache.java +++ /dev/null @@ -1,15 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import java.lang.annotation.ElementType; -import java.lang.annotation.Retention; -import java.lang.annotation.Target; - -import com.google.inject.BindingAnnotation; - -import static java.lang.annotation.RetentionPolicy.RUNTIME; - -@Retention(RUNTIME) -@Target({ ElementType.FIELD, ElementType.PARAMETER, ElementType.METHOD }) -@BindingAnnotation -public @interface StrictRecencyCache { -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/StrictRecencyCacheFilter.java b/src/java/com/twitter/search/earlybird_root/caching/StrictRecencyCacheFilter.java deleted file mode 100644 index 22b5b1023..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/StrictRecencyCacheFilter.java +++ /dev/null @@ -1,34 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import javax.inject.Inject; -import javax.inject.Named; - -import com.twitter.search.common.caching.Cache; -import com.twitter.search.common.caching.filter.CacheFilter; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.root.SearchRootModule; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; - -public class StrictRecencyCacheFilter extends - CacheFilter { - /** - * Creates a cache filter for earlybird strict recency requests. - */ - @Inject - public StrictRecencyCacheFilter( - @StrictRecencyCache Cache cache, - SearchDecider decider, - @Named(SearchRootModule.NAMED_NORMALIZED_SEARCH_ROOT_NAME) String normalizedSearchRootName, - @Named(CacheCommonUtil.NAMED_MAX_CACHE_RESULTS) int maxCacheResults) { - super(cache, - new StrictRecencyQueryCachePredicate(decider, normalizedSearchRootName), - new RecencyCacheRequestNormalizer(), - new RecencyAndRelevanceCachePostProcessor(), - new RecencyServicePostProcessor(cache, maxCacheResults), - new EarlybirdRequestPerClientCacheStats( - EarlybirdRequestType.STRICT_RECENCY.getNormalizedName())); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/StrictRecencyQueryCachePredicate.java b/src/java/com/twitter/search/earlybird_root/caching/StrictRecencyQueryCachePredicate.java deleted file mode 100644 index 665b0917f..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/StrictRecencyQueryCachePredicate.java +++ /dev/null @@ -1,25 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.twitter.search.common.caching.filter.QueryCachePredicate; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.earlybird.common.EarlybirdRequestUtil; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; - -public class StrictRecencyQueryCachePredicate extends QueryCachePredicate { - private final SearchDecider decider; - private final String strictRecencyCacheEnabledDeciderKey; - - public StrictRecencyQueryCachePredicate(SearchDecider decider, String normalizedSearchRootName) { - this.decider = decider; - this.strictRecencyCacheEnabledDeciderKey = - "strict_recency_cache_enabled_" + normalizedSearchRootName; - } - - @Override - public Boolean shouldQueryCache(EarlybirdRequestContext requestContext) { - return EarlybirdRequestType.STRICT_RECENCY == requestContext.getEarlybirdRequestType() - && EarlybirdRequestUtil.isCachingAllowed(requestContext.getRequest()) - && decider.isAvailable(strictRecencyCacheEnabledDeciderKey); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/TermStatsCache.java b/src/java/com/twitter/search/earlybird_root/caching/TermStatsCache.java deleted file mode 100644 index 3f3458fbc..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/TermStatsCache.java +++ /dev/null @@ -1,15 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import java.lang.annotation.ElementType; -import java.lang.annotation.Retention; -import java.lang.annotation.Target; - -import com.google.inject.BindingAnnotation; - -import static java.lang.annotation.RetentionPolicy.RUNTIME; - -@Retention(RUNTIME) -@Target({ ElementType.FIELD, ElementType.PARAMETER, ElementType.METHOD }) -@BindingAnnotation -public @interface TermStatsCache { -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/TermStatsCacheFilter.java b/src/java/com/twitter/search/earlybird_root/caching/TermStatsCacheFilter.java deleted file mode 100644 index 833b8909f..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/TermStatsCacheFilter.java +++ /dev/null @@ -1,33 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import javax.inject.Inject; -import javax.inject.Named; - -import com.twitter.search.common.caching.Cache; -import com.twitter.search.common.caching.filter.CacheFilter; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.root.SearchRootModule; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; - -public class TermStatsCacheFilter extends - CacheFilter { - /** - * Constructs a new cache filter for term stats requests. - */ - @Inject - public TermStatsCacheFilter( - @TermStatsCache Cache cache, - SearchDecider decider, - @Named(SearchRootModule.NAMED_NORMALIZED_SEARCH_ROOT_NAME) String normalizedSearchRootName) { - super(cache, - new TermStatsQueryCachePredicate(decider, normalizedSearchRootName), - new TermStatsCacheRequestNormalizer(), - new EarlybirdCachePostProcessor(), - new TermStatsServicePostProcessor(cache), - new EarlybirdRequestPerClientCacheStats( - EarlybirdRequestType.TERM_STATS.getNormalizedName())); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/TermStatsCacheRequestNormalizer.java b/src/java/com/twitter/search/earlybird_root/caching/TermStatsCacheRequestNormalizer.java deleted file mode 100644 index f804a6eb3..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/TermStatsCacheRequestNormalizer.java +++ /dev/null @@ -1,17 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.google.common.base.Optional; - -import com.twitter.search.common.caching.TermStatsCacheUtil; -import com.twitter.search.common.caching.filter.CacheRequestNormalizer; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; - -public class TermStatsCacheRequestNormalizer extends - CacheRequestNormalizer { - - @Override - public Optional normalizeRequest(EarlybirdRequestContext requestContext) { - return Optional.fromNullable(TermStatsCacheUtil.normalizeForCache(requestContext.getRequest())); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/TermStatsQueryCachePredicate.java b/src/java/com/twitter/search/earlybird_root/caching/TermStatsQueryCachePredicate.java deleted file mode 100644 index 34ca56870..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/TermStatsQueryCachePredicate.java +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.twitter.search.common.caching.filter.QueryCachePredicate; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.earlybird.common.EarlybirdRequestUtil; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; - -public class TermStatsQueryCachePredicate extends QueryCachePredicate { - private final SearchDecider decider; - private final String termstatsCacheEnabledDeciderKey; - - public TermStatsQueryCachePredicate(SearchDecider decider, String normalizedSearchRootName) { - this.decider = decider; - this.termstatsCacheEnabledDeciderKey = "termstats_cache_enabled_" + normalizedSearchRootName; - } - - @Override - public Boolean shouldQueryCache(EarlybirdRequestContext requestContext) { - return EarlybirdRequestType.TERM_STATS == requestContext.getEarlybirdRequestType() - && EarlybirdRequestUtil.isCachingAllowed(requestContext.getRequest()) - && decider.isAvailable(termstatsCacheEnabledDeciderKey); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/TermStatsServicePostProcessor.java b/src/java/com/twitter/search/earlybird_root/caching/TermStatsServicePostProcessor.java deleted file mode 100644 index 58e16d371..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/TermStatsServicePostProcessor.java +++ /dev/null @@ -1,25 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.google.common.base.Preconditions; - -import com.twitter.search.common.caching.Cache; -import com.twitter.search.common.caching.TermStatsCacheUtil; -import com.twitter.search.common.caching.filter.ServicePostProcessor; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; - -public class TermStatsServicePostProcessor - extends ServicePostProcessor { - private final Cache cache; - - public TermStatsServicePostProcessor(Cache cache) { - this.cache = Preconditions.checkNotNull(cache); - } - - @Override - public void processServiceResponse(EarlybirdRequestContext requestContext, - EarlybirdResponse serviceResponse) { - TermStatsCacheUtil.cacheResults(cache, requestContext.getRequest(), serviceResponse); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/TopTweetsCache.java b/src/java/com/twitter/search/earlybird_root/caching/TopTweetsCache.java deleted file mode 100644 index c79413312..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/TopTweetsCache.java +++ /dev/null @@ -1,15 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import java.lang.annotation.ElementType; -import java.lang.annotation.Retention; -import java.lang.annotation.Target; - -import com.google.inject.BindingAnnotation; - -import static java.lang.annotation.RetentionPolicy.RUNTIME; - -@Retention(RUNTIME) -@Target({ ElementType.FIELD, ElementType.PARAMETER, ElementType.METHOD }) -@BindingAnnotation -public @interface TopTweetsCache { -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/TopTweetsCacheFilter.java b/src/java/com/twitter/search/earlybird_root/caching/TopTweetsCacheFilter.java deleted file mode 100644 index afbbde5a2..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/TopTweetsCacheFilter.java +++ /dev/null @@ -1,33 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import javax.inject.Inject; -import javax.inject.Named; - -import com.twitter.search.common.caching.Cache; -import com.twitter.search.common.caching.filter.CacheFilter; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.root.SearchRootModule; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; - -public class TopTweetsCacheFilter extends - CacheFilter { - /** - * Constructs a new cache filter for top tweets requests. - */ - @Inject - public TopTweetsCacheFilter( - @TopTweetsCache Cache cache, - SearchDecider decider, - @Named(SearchRootModule.NAMED_NORMALIZED_SEARCH_ROOT_NAME) String normalizedSearchRootName) { - super(cache, - new TopTweetsQueryCachePredicate(decider, normalizedSearchRootName), - new TopTweetsCacheRequestNormalizer(), - new EarlybirdCachePostProcessor(), - new TopTweetsServicePostProcessor(cache), - new EarlybirdRequestPerClientCacheStats( - EarlybirdRequestType.TOP_TWEETS.getNormalizedName())); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/TopTweetsCacheRequestNormalizer.java b/src/java/com/twitter/search/earlybird_root/caching/TopTweetsCacheRequestNormalizer.java deleted file mode 100644 index f790f97c2..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/TopTweetsCacheRequestNormalizer.java +++ /dev/null @@ -1,18 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.google.common.base.Optional; - -import com.twitter.search.common.caching.TopTweetsCacheUtil; -import com.twitter.search.common.caching.filter.CacheRequestNormalizer; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; - -public class TopTweetsCacheRequestNormalizer extends - CacheRequestNormalizer { - - @Override - public Optional normalizeRequest(EarlybirdRequestContext requestContext) { - return Optional.fromNullable( - TopTweetsCacheUtil.normalizeTopTweetsRequestForCache(requestContext.getRequest())); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/TopTweetsQueryCachePredicate.java b/src/java/com/twitter/search/earlybird_root/caching/TopTweetsQueryCachePredicate.java deleted file mode 100644 index 2e8fda2c6..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/TopTweetsQueryCachePredicate.java +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import com.twitter.search.common.caching.filter.QueryCachePredicate; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.earlybird.common.EarlybirdRequestUtil; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; - -public class TopTweetsQueryCachePredicate extends QueryCachePredicate { - private final SearchDecider decider; - private final String toptweetsCacheEnabledDeciderKey; - - public TopTweetsQueryCachePredicate(SearchDecider decider, String normalizedSearchRootName) { - this.decider = decider; - this.toptweetsCacheEnabledDeciderKey = "toptweets_cache_enabled_" + normalizedSearchRootName; - } - - @Override - public Boolean shouldQueryCache(EarlybirdRequestContext requestContext) { - return EarlybirdRequestType.TOP_TWEETS == requestContext.getEarlybirdRequestType() - && EarlybirdRequestUtil.isCachingAllowed(requestContext.getRequest()) - && decider.isAvailable(toptweetsCacheEnabledDeciderKey); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/caching/TopTweetsServicePostProcessor.java b/src/java/com/twitter/search/earlybird_root/caching/TopTweetsServicePostProcessor.java deleted file mode 100644 index 5812404a1..000000000 --- a/src/java/com/twitter/search/earlybird_root/caching/TopTweetsServicePostProcessor.java +++ /dev/null @@ -1,41 +0,0 @@ -package com.twitter.search.earlybird_root.caching; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.caching.Cache; -import com.twitter.search.common.caching.TopTweetsCacheUtil; -import com.twitter.search.common.caching.filter.ServicePostProcessor; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; - -import static com.google.common.base.Preconditions.checkNotNull; - -public class TopTweetsServicePostProcessor - extends ServicePostProcessor { - private static final Logger LOG = LoggerFactory.getLogger(TopTweetsServicePostProcessor.class); - - public static final int CACHE_AGE_IN_MS = 600000; - public static final int NO_RESULT_CACHE_AGE_IN_MS = 300000; - - private final Cache cache; - - public TopTweetsServicePostProcessor(Cache cache) { - this.cache = checkNotNull(cache); - } - - @Override - public void processServiceResponse(EarlybirdRequestContext requestContext, - EarlybirdResponse serviceResponse) { - - EarlybirdRequest originalRequest = requestContext.getRequest(); - LOG.debug("Writing to top tweets cache. Request: {}, Response: {}", - originalRequest, serviceResponse); - TopTweetsCacheUtil.cacheResults(originalRequest, - serviceResponse, - cache, - CACHE_AGE_IN_MS, - NO_RESULT_CACHE_AGE_IN_MS); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/collectors/BUILD b/src/java/com/twitter/search/earlybird_root/collectors/BUILD deleted file mode 100644 index bbb90ada1..000000000 --- a/src/java/com/twitter/search/earlybird_root/collectors/BUILD +++ /dev/null @@ -1,12 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/log4j", - "src/java/com/twitter/search/common/relevance:utils", - "src/java/com/twitter/search/common/util/earlybird", - "src/thrift/com/twitter/search:earlybird-java", - ], -) diff --git a/src/java/com/twitter/search/earlybird_root/collectors/MultiwayMergeCollector.java b/src/java/com/twitter/search/earlybird_root/collectors/MultiwayMergeCollector.java deleted file mode 100644 index 8423007ee..000000000 --- a/src/java/com/twitter/search/earlybird_root/collectors/MultiwayMergeCollector.java +++ /dev/null @@ -1,82 +0,0 @@ -package com.twitter.search.earlybird_root.collectors; - -import java.util.Collections; -import java.util.Comparator; -import java.util.List; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.earlybird.thrift.EarlybirdResponse; - -/** - * Generic MultiwayMergeCollector class for doing k-way merge of earlybird responses - * that takes a comparator and returns a list of results sorted by the comparator. - */ -public abstract class MultiwayMergeCollector { - protected static final Logger LOG = LoggerFactory.getLogger(MultiwayMergeCollector.class); - - private final Comparator resultComparator; - private final int numResponsesToMerge; - private final List results = Lists.newArrayList(); - private int numResponsesAdded = 0; - - /** - * Constructor that does multi way merge and takes in a custom predicate search result filter. - */ - public MultiwayMergeCollector(int numResponses, - Comparator comparator) { - Preconditions.checkNotNull(comparator); - this.resultComparator = comparator; - this.numResponsesToMerge = numResponses; - } - - /** - * Add a single response from one partition, updates stats. - * - * @param response response from one partition - */ - public final void addResponse(EarlybirdResponse response) { - // On prod, does it ever happen we receive more responses than numPartitions ? - Preconditions.checkArgument(numResponsesAdded++ < numResponsesToMerge, - String.format("Attempting to merge more than %d responses", numResponsesToMerge)); - if (!isResponseValid(response)) { - return; - } - collectStats(response); - List resultsFromResponse = collectResults(response); - if (resultsFromResponse != null && resultsFromResponse.size() > 0) { - results.addAll(resultsFromResponse); - } - } - - /** - * Parse the EarlybirdResponse and retrieve list of results to be appended. - * - * @param response earlybird response from where results are extracted - * @return resultsList to be appended - */ - protected abstract List collectResults(EarlybirdResponse response); - - /** - * It is recommended that sub-class overrides this function to add custom logic to - * collect more stat and call this base function. - */ - protected void collectStats(EarlybirdResponse response) { - } - - /** - * Get full list of results, after addResponse calls have been invoked. - * - * @return list of results extracted from all EarlybirdResponses that have been collected so far - */ - protected final List getResultsList() { - Collections.sort(results, resultComparator); - return results; - } - - protected abstract boolean isResponseValid(EarlybirdResponse response); -} diff --git a/src/java/com/twitter/search/earlybird_root/collectors/RecencyMergeCollector.java b/src/java/com/twitter/search/earlybird_root/collectors/RecencyMergeCollector.java deleted file mode 100644 index 42d9bdf6a..000000000 --- a/src/java/com/twitter/search/earlybird_root/collectors/RecencyMergeCollector.java +++ /dev/null @@ -1,75 +0,0 @@ -package com.twitter.search.earlybird_root.collectors; - -import java.util.Comparator; -import java.util.List; - -import com.twitter.search.common.relevance.utils.ResultComparators; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; - -/** - * {@link RecencyMergeCollector} inherits {@link MultiwayMergeCollector} for the type - * {@link com.twitter.search.earlybird.thrift.ThriftSearchResult} as the result type. - *

    - * It also implements two public methods to retrieve the top-k or all results. - */ -public class RecencyMergeCollector extends MultiwayMergeCollector { - - // Container for the final results array and also stats like numHitsProcessed etc... - protected final ThriftSearchResults finalResults = new ThriftSearchResults(); - - public RecencyMergeCollector(int numResponses) { - this(numResponses, ResultComparators.ID_COMPARATOR); - } - - protected RecencyMergeCollector(int numResponses, Comparator comparator) { - super(numResponses, comparator); - } - - @Override - protected void collectStats(EarlybirdResponse response) { - super.collectStats(response); - - ThriftSearchResults searchResults = response.getSearchResults(); - if (searchResults.isSetNumHitsProcessed()) { - finalResults.setNumHitsProcessed( - finalResults.getNumHitsProcessed() + searchResults.getNumHitsProcessed()); - } - if (searchResults.isSetNumPartitionsEarlyTerminated()) { - finalResults.setNumPartitionsEarlyTerminated( - finalResults.getNumPartitionsEarlyTerminated() - + searchResults.getNumPartitionsEarlyTerminated()); - } - } - - @Override - protected final List collectResults(EarlybirdResponse response) { - if (response != null - && response.isSetSearchResults() - && response.getSearchResults().getResultsSize() > 0) { - return response.getSearchResults().getResults(); - } else { - return null; - } - } - - /** - * Gets all the results that has been collected. - * - * @return {@link ThriftSearchResults} containing a list of results sorted by provided - * comparator in descending order. - */ - public final ThriftSearchResults getAllSearchResults() { - return finalResults.setResults(getResultsList()); - } - - @Override - protected final boolean isResponseValid(EarlybirdResponse response) { - if (response == null || !response.isSetSearchResults()) { - LOG.warn("searchResults was null: " + response); - return false; - } - return true; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/collectors/RelevanceMergeCollector.java b/src/java/com/twitter/search/earlybird_root/collectors/RelevanceMergeCollector.java deleted file mode 100644 index 7331083a1..000000000 --- a/src/java/com/twitter/search/earlybird_root/collectors/RelevanceMergeCollector.java +++ /dev/null @@ -1,39 +0,0 @@ -package com.twitter.search.earlybird_root.collectors; - -import com.twitter.search.common.relevance.utils.ResultComparators; -import com.twitter.search.common.util.earlybird.ThriftSearchResultsRelevanceStatsUtil; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.ThriftSearchResultsRelevanceStats; - -/** - * RelevanceMergeCollector class extends (@link RecencyMergeCollector} to do k-way merge of - * earlybird responses, but sorted by relevance score. - * - * Note that this is a superset of functionality found in - * {@link com.twitter.search.blender.services.earlybird.relevance.RelevanceCollector} - * If you make changes here, evaluate if they should be made in RelevanceCollector as well. - */ -public class RelevanceMergeCollector extends RecencyMergeCollector { - - public RelevanceMergeCollector(int numResponses) { - super(numResponses, ResultComparators.SCORE_COMPARATOR); - } - - @Override - protected void collectStats(EarlybirdResponse response) { - super.collectStats(response); - - if (!response.getSearchResults().isSetRelevanceStats()) { - return; - } - - if (!finalResults.isSetRelevanceStats()) { - finalResults.setRelevanceStats(new ThriftSearchResultsRelevanceStats()); - } - - ThriftSearchResultsRelevanceStats base = finalResults.getRelevanceStats(); - ThriftSearchResultsRelevanceStats delta = response.getSearchResults().getRelevanceStats(); - - ThriftSearchResultsRelevanceStatsUtil.addRelevanceStats(base, delta); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/common/BUILD b/src/java/com/twitter/search/earlybird_root/common/BUILD deleted file mode 100644 index 57ad0ffa4..000000000 --- a/src/java/com/twitter/search/earlybird_root/common/BUILD +++ /dev/null @@ -1,22 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/commons-lang", - "3rdparty/jvm/org/slf4j:slf4j-api", - "src/java/com/twitter/common/util:system-mocks", - "src/java/com/twitter/search/common/decider", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/partitioning/snowflakeparser", - "src/java/com/twitter/search/queryparser", - "src/java/com/twitter/search/queryparser/query:core-query-nodes", - "src/thrift/com/twitter/context:twitter-context-scala", - "src/thrift/com/twitter/search:earlybird-java", - "src/thrift/com/twitter/search/common:constants-java", - "src/thrift/com/twitter/search/common:features-java", - "twitter-context/src/main/scala", - ], -) diff --git a/src/java/com/twitter/search/earlybird_root/common/ClientErrorException.java b/src/java/com/twitter/search/earlybird_root/common/ClientErrorException.java deleted file mode 100644 index 98c2bb011..000000000 --- a/src/java/com/twitter/search/earlybird_root/common/ClientErrorException.java +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.search.earlybird_root.common; - -public class ClientErrorException extends RuntimeException { - - public ClientErrorException() { - } - - public ClientErrorException(String message) { - super(message); - } - - public ClientErrorException(String message, Throwable cause) { - super(message, cause); - } - - public ClientErrorException(Throwable cause) { - super(cause); - } - - public ClientErrorException(String message, Throwable cause, - boolean enableSuppression, boolean writableStackTrace) { - super(message, cause, enableSuppression, writableStackTrace); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/common/EarlybirdFeatureSchemaMerger.java b/src/java/com/twitter/search/earlybird_root/common/EarlybirdFeatureSchemaMerger.java deleted file mode 100644 index f91d2d3c4..000000000 --- a/src/java/com/twitter/search/earlybird_root/common/EarlybirdFeatureSchemaMerger.java +++ /dev/null @@ -1,377 +0,0 @@ -package com.twitter.search.earlybird_root.common; - -import java.util.List; -import java.util.Map; -import java.util.Set; -import java.util.TreeSet; -import java.util.concurrent.ConcurrentHashMap; - -import javax.annotation.concurrent.ThreadSafe; - -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.Maps; - -import org.apache.commons.lang.mutable.MutableInt; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.features.thrift.ThriftSearchFeatureSchema; -import com.twitter.search.common.features.thrift.ThriftSearchFeatureSchemaSpecifier; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.ThriftSearchRankingMode; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; - -@ThreadSafe -public class EarlybirdFeatureSchemaMerger { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdFeatureSchemaMerger.class); - - private static final SearchLongGauge NUM_FEATURE_SCHEMAS_MAP = SearchLongGauge.export( - "earlybird_feature_schema_cached_cnt"); - - private class Stats { - public final SearchCounter fieldFormatResponses; - public final SearchCounter mapFormatResponses; - public final SearchCounter mapFormatSavedSchemaResponses; - public final SearchCounter mapFormatAllDownstreamMissingSchema; - public final SearchCounter mapFormatOneDownstreamMissingSchema; - public final SearchCounter mapFormatSchemaCachedMismatch; - public final SearchCounter numInvalidRankingModeRequests; - public final SearchCounter numEmptyResponses; - - public Stats(String prefix) { - this.fieldFormatResponses = - SearchCounter.export( - "earlybird_feature_schema_" + prefix + "_field_format_feature_responses"); - this.mapFormatResponses = - SearchCounter.export( - "earlybird_feature_schema_" + prefix + "_map_format_feature_responses"); - this.mapFormatSavedSchemaResponses = - SearchCounter.export( - "earlybird_feature_schema_" + prefix + "_map_format_feature_saved_schema_responses"); - this.mapFormatAllDownstreamMissingSchema = - SearchCounter.export( - "earlybird_feature_schema_" + prefix - + "_map_format_feature_all_downstream_missing_schema_error"); - this.mapFormatOneDownstreamMissingSchema = - SearchCounter.export( - "earlybird_feature_schema_" + prefix - + "_map_format_feature_one_downstream_missing_schema_error"); - this.mapFormatSchemaCachedMismatch = - SearchCounter.export( - "earlybird_feature_schema_" + prefix - + "_map_format_feature_schema_cached_mismatch_error"); - this.numInvalidRankingModeRequests = - SearchCounter.export( - "earlybird_feature_schema_" + prefix + "_num_invalid_ranking_mode_requests"); - this.numEmptyResponses = - SearchCounter.export( - "earlybird_feature_schema_" + prefix - + "_num_empty_response_without_schema"); - } - } - - private final ConcurrentHashMap - featureSchemas = new ConcurrentHashMap<>(); - private final ConcurrentHashMap mergeStats = new ConcurrentHashMap<>(); - - /** - * Get all available cache schema list indicated by the schema specifier. - * @return identifiers for all the cached schema - */ - public List getAvailableSchemaList() { - return ImmutableList.copyOf(featureSchemas.keySet()); - } - - /** - * Iterate all the responses and collect and cache feature schemas from response. - * Set the feature schema for the response in searchResults if needed. - * (This is done inside earlybird roots) - * - * @param searchResults the response - * @param requestContext the request, which should record the client cached feature schemas - * @param statPrefix the stats prefix string - * @param successfulResponses all successfull responses from downstream - */ - public void collectAndSetFeatureSchemaInResponse( - ThriftSearchResults searchResults, - EarlybirdRequestContext requestContext, - String statPrefix, - List successfulResponses) { - Stats stats = getOrCreateMergeStat(statPrefix); - EarlybirdRequest request = requestContext.getRequest(); - if (!request.isSetSearchQuery() - || !request.getSearchQuery().isSetResultMetadataOptions() - || !request.getSearchQuery().getResultMetadataOptions().isReturnSearchResultFeatures()) { - // If the client does not want to get all features in map format, do not do anything. - stats.fieldFormatResponses.increment(); - return; - } - - // Find the most occurred schema from per-merge responses and return it in the post-merge - // response. - ThriftSearchFeatureSchemaSpecifier schemaMostOccurred = findMostOccurredSchema( - stats, request, successfulResponses); - if (schemaMostOccurred == null) { - return; - } - - Set availableSchemasInClient = - requestContext.getFeatureSchemasAvailableInClient(); - if (availableSchemasInClient != null && availableSchemasInClient.contains(schemaMostOccurred)) { - // The client already knows the schema that we used for this response, so we don't need to - // send it the full schema, just the ThriftSearchFeatureSchemaSpecifier. - ThriftSearchFeatureSchema schema = new ThriftSearchFeatureSchema(); - schema.setSchemaSpecifier(schemaMostOccurred); - searchResults.setFeatureSchema(schema); - stats.mapFormatResponses.increment(); - stats.mapFormatSavedSchemaResponses.increment(); - } else { - ThriftSearchFeatureSchema schema = featureSchemas.get(schemaMostOccurred); - if (schema != null) { - Preconditions.checkState(schema.isSetEntries()); - Preconditions.checkState(schema.isSetSchemaSpecifier()); - searchResults.setFeatureSchema(schema); - stats.mapFormatResponses.increment(); - } else { - stats.mapFormatSchemaCachedMismatch.increment(); - LOG.error("The feature schema cache misses the schema entry {} it should cache for {}", - schemaMostOccurred, request); - } - } - } - - /** - * Merge the feature schema from each cluster's response and return it to the client. - * (This is done inside superroot) - * @param requestContext the search request context - * @param mergedResponse the merged result inside the superroot - * @param realtimeResponse the realtime tier resposne - * @param protectedResponse the protected tier response - * @param fullArchiveResponse the full archive tier response - * @param statsPrefix - */ - public void mergeFeatureSchemaAcrossClusters( - EarlybirdRequestContext requestContext, - EarlybirdResponse mergedResponse, - String statsPrefix, - EarlybirdResponse realtimeResponse, - EarlybirdResponse protectedResponse, - EarlybirdResponse fullArchiveResponse) { - Stats superrootStats = getOrCreateMergeStat(statsPrefix); - - // Only try to merge feature schema if there are search results. - ThriftSearchResults mergedResults = Preconditions.checkNotNull( - mergedResponse.getSearchResults()); - if (mergedResults.getResults().isEmpty()) { - mergedResults.unsetFeatureSchema(); - superrootStats.numEmptyResponses.increment(); - return; - } - - EarlybirdRequest request = requestContext.getRequest(); - if (!request.isSetSearchQuery() - || !request.getSearchQuery().isSetResultMetadataOptions() - || !request.getSearchQuery().getResultMetadataOptions().isReturnSearchResultFeatures()) { - mergedResults.unsetFeatureSchema(); - - // If the client does not want to get all features in map format, do not do anything. - superrootStats.fieldFormatResponses.increment(); - return; - } - if (request.getSearchQuery().getRankingMode() != ThriftSearchRankingMode.RELEVANCE - && request.getSearchQuery().getRankingMode() != ThriftSearchRankingMode.TOPTWEETS - && request.getSearchQuery().getRankingMode() != ThriftSearchRankingMode.RECENCY) { - mergedResults.unsetFeatureSchema(); - - // Only RELEVANCE, TOPTWEETS and RECENCY requests might need a feature schema in the response. - superrootStats.numInvalidRankingModeRequests.increment(); - LOG.warn("Request asked for feature schema, but has incorrect ranking mode: {}", request); - return; - } - superrootStats.mapFormatResponses.increment(); - - ThriftSearchFeatureSchema schema = updateReturnSchemaForClusterResponse( - null, realtimeResponse, request, superrootStats); - schema = updateReturnSchemaForClusterResponse( - schema, protectedResponse, request, superrootStats); - schema = updateReturnSchemaForClusterResponse( - schema, fullArchiveResponse, request, superrootStats); - - if (schema != null) { - if (requestContext.getFeatureSchemasAvailableInClient() != null - && requestContext.getFeatureSchemasAvailableInClient().contains( - schema.getSchemaSpecifier())) { - mergedResults.setFeatureSchema( - new ThriftSearchFeatureSchema().setSchemaSpecifier(schema.getSchemaSpecifier())); - } else { - mergedResults.setFeatureSchema(schema); - } - } else { - superrootStats.mapFormatAllDownstreamMissingSchema.increment(); - LOG.error("The response for request {} is missing feature schema from all clusters", request); - } - } - - /** - * Add the schema to both the schema map and and the schema list if it is not there yet. - * - * @param schema the feature schema for search results - */ - private void addNewSchema(ThriftSearchFeatureSchema schema) { - if (!schema.isSetEntries() - || !schema.isSetSchemaSpecifier() - || featureSchemas.containsKey(schema.getSchemaSpecifier())) { - return; - } - - synchronized (this) { - String oldExportedSchemaName = null; - if (!featureSchemas.isEmpty()) { - oldExportedSchemaName = getExportSchemasName(); - } - - if (featureSchemas.putIfAbsent(schema.getSchemaSpecifier(), schema) == null) { - LOG.info("Add new feature schema {} into the list", schema); - NUM_FEATURE_SCHEMAS_MAP.set(featureSchemas.size()); - - if (oldExportedSchemaName != null) { - SearchLongGauge.export(oldExportedSchemaName).reset(); - } - SearchLongGauge.export(getExportSchemasName()).set(1); - LOG.info("Expanded feature schema: {}", ImmutableList.copyOf(featureSchemas.keySet())); - } - } - } - - private String getExportSchemasName() { - StringBuilder builder = new StringBuilder("earlybird_feature_schema_cached"); - TreeSet exportedVersions = new TreeSet<>(); - - // We do not need checksum for exported vars as all cached schemas are from the majority of the - // responses. - featureSchemas.keySet().stream().forEach(key -> exportedVersions.add(key.getVersion())); - exportedVersions.stream().forEach(version -> { - builder.append('_'); - builder.append(version); - }); - return builder.toString(); - } - - // Get the updated the feature schema based on the earlybird response from the search cluster. - // . If the existingSchema is not null, the function would return the existing schema. Under the - // situation, we would still check whether the feature in earlybird response is valid. - // . Otherwise, the function would extract the feature schema from the earlybird response. - private ThriftSearchFeatureSchema updateReturnSchemaForClusterResponse( - ThriftSearchFeatureSchema existingSchema, - EarlybirdResponse clusterResponse, - EarlybirdRequest request, - Stats stats) { - // If there is no response or search result for this cluster, do not update returned schema. - if ((clusterResponse == null) || !clusterResponse.isSetSearchResults()) { - return existingSchema; - } - ThriftSearchResults results = clusterResponse.getSearchResults(); - if (results.getResults().isEmpty()) { - return existingSchema; - } - - if (!results.isSetFeatureSchema() || !results.getFeatureSchema().isSetSchemaSpecifier()) { - stats.mapFormatOneDownstreamMissingSchema.increment(); - LOG.error("The downstream response {} is missing feature schema for request {}", - clusterResponse, request); - return existingSchema; - } - - ThriftSearchFeatureSchema schema = results.getFeatureSchema(); - - // Even if existingSchema is already set, we would still try to cache the returned schema. - // In this way, the next time earlybird roots don't have to send the full schema back again. - if (schema.isSetEntries()) { - addNewSchema(schema); - } else if (featureSchemas.containsKey(schema.getSchemaSpecifier())) { - stats.mapFormatSavedSchemaResponses.increment(); - } else { - stats.mapFormatSchemaCachedMismatch.increment(); - LOG.error( - "The feature schema cache misses the schema entry {}, it should cache {} in {}", - schema.getSchemaSpecifier(), request, clusterResponse); - } - - ThriftSearchFeatureSchema updatedSchema = existingSchema; - if (updatedSchema == null) { - updatedSchema = featureSchemas.get(schema.getSchemaSpecifier()); - if (updatedSchema != null) { - Preconditions.checkState(updatedSchema.isSetEntries()); - Preconditions.checkState(updatedSchema.isSetSchemaSpecifier()); - } - } - return updatedSchema; - } - - private ThriftSearchFeatureSchemaSpecifier findMostOccurredSchema( - Stats stats, - EarlybirdRequest request, - List successfulResponses) { - boolean hasResults = false; - Map schemaCount = - Maps.newHashMapWithExpectedSize(successfulResponses.size()); - for (EarlybirdResponse response : successfulResponses) { - if (!response.isSetSearchResults() - || response.getSearchResults().getResultsSize() == 0) { - continue; - } - - hasResults = true; - if (response.getSearchResults().isSetFeatureSchema()) { - ThriftSearchFeatureSchema schema = response.getSearchResults().getFeatureSchema(); - if (schema.isSetSchemaSpecifier()) { - MutableInt cnt = schemaCount.get(schema.getSchemaSpecifier()); - if (cnt != null) { - cnt.increment(); - } else { - schemaCount.put(schema.getSchemaSpecifier(), new MutableInt(1)); - } - - if (schema.isSetEntries()) { - addNewSchema(schema); - } - } - } else { - stats.mapFormatOneDownstreamMissingSchema.increment(); - LOG.error("The downstream response {} is missing feature schema for request {}", - response, request); - } - } - - int numMostOccurred = 0; - ThriftSearchFeatureSchemaSpecifier schemaMostOccurred = null; - for (Map.Entry entry : schemaCount.entrySet()) { - if (entry.getValue().toInteger() > numMostOccurred) { - numMostOccurred = entry.getValue().toInteger(); - schemaMostOccurred = entry.getKey(); - } - } - - if (schemaMostOccurred == null && hasResults) { - stats.mapFormatAllDownstreamMissingSchema.increment(); - LOG.error("None of the downstream host returned feature schema for {}", request); - } - return schemaMostOccurred; - } - - private Stats getOrCreateMergeStat(String statPrefix) { - Stats stats = mergeStats.get(statPrefix); - if (stats == null) { - Stats newStats = new Stats(statPrefix); - stats = mergeStats.putIfAbsent(statPrefix, newStats); - if (stats == null) { - stats = newStats; - } - } - return stats; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/common/EarlybirdRequestContext.java b/src/java/com/twitter/search/earlybird_root/common/EarlybirdRequestContext.java deleted file mode 100644 index a9d2840ca..000000000 --- a/src/java/com/twitter/search/earlybird_root/common/EarlybirdRequestContext.java +++ /dev/null @@ -1,227 +0,0 @@ -package com.twitter.search.earlybird_root.common; - -import java.util.ArrayList; -import java.util.List; -import java.util.Set; - -import javax.annotation.Nullable; - -import scala.Option; - -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableSet; -import com.google.common.collect.Sets; - -import com.twitter.common.util.Clock; -import com.twitter.context.thriftscala.Viewer; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.features.thrift.ThriftSearchFeatureSchemaSpecifier; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryParserException; - -/** - * A class that wraps a request and additional per-request data that should be passed to services. - * - * This class should be immutable. At the very least, it must be thread-safe. In practice, since - * EarlybirdRequest is a mutable Thrift structure, the users of this class need to make sure that - * once a request is used to create a RequestContext instance, it is not modified. - * - * If the request needs to be modified, a new RequestContext with the modified EarlybirdRequest - * should be created. - */ -public final class EarlybirdRequestContext { - - private static final String OVERRIDE_TIER_CONFIGS_DECIDER_KEY = "use_override_tier_configs"; - - /** - * Creates a new context with the provided earlybird request, and using the given decider. - */ - public static EarlybirdRequestContext newContext( - EarlybirdRequest request, - SearchDecider decider, - Option twitterContextViewer, - Clock clock) throws QueryParserException { - - // Try to capture created time as early as possible. For example, we want to account for query - // parsing time. - long createdTimeMillis = clock.nowMillis(); - - boolean useOverrideTierConfig = decider.isAvailable(OVERRIDE_TIER_CONFIGS_DECIDER_KEY); - - Query parsedQuery = QueryParsingUtils.getParsedQuery(request); - - return new EarlybirdRequestContext( - request, - parsedQuery, - useOverrideTierConfig, - createdTimeMillis, - twitterContextViewer); - } - - /** - * Intersection of the userID and the flock response, which is set in the followedUserIds field. - * This is used for protected cluster. - */ - public static EarlybirdRequestContext newContextWithRestrictFromUserIdFilter64( - EarlybirdRequestContext requestContext) { - Preconditions.checkArgument(requestContext.getRequest().isSetFollowedUserIds()); - - EarlybirdRequest request = requestContext.getRequest().deepCopy(); - List toIntersect = request.getFollowedUserIds(); - ThriftSearchQuery searchQuery = request.getSearchQuery(); - if (!searchQuery.isSetFromUserIDFilter64()) { - searchQuery.setFromUserIDFilter64(new ArrayList<>(toIntersect)); - } else { - Set intersection = Sets.intersection( - Sets.newHashSet(searchQuery.getFromUserIDFilter64()), - Sets.newHashSet(toIntersect)); - searchQuery.setFromUserIDFilter64(new ArrayList<>(intersection)); - } - - return new EarlybirdRequestContext(requestContext, request, requestContext.getParsedQuery()); - } - - /** - * Makes an exact copy of the provided request context, by cloning the underlying earlybird - * request. - */ - public static EarlybirdRequestContext copyRequestContext( - EarlybirdRequestContext requestContext, - Query parsedQuery) { - return new EarlybirdRequestContext(requestContext, parsedQuery); - } - - /** - * Creates a new context with the provided request, context and reset both the feature schemas - * cached in client and the feature schemas cached in the local cache. - */ - public static EarlybirdRequestContext newContext( - EarlybirdRequest oldRequest, - EarlybirdRequestContext oldRequestContext, - List featureSchemasAvailableInCache, - List featureSchemasAvailableInClient) { - EarlybirdRequest request = oldRequest.deepCopy(); - request.getSearchQuery().getResultMetadataOptions() - .setFeatureSchemasAvailableInClient(featureSchemasAvailableInCache); - - ImmutableSet featureSchemaSetAvailableInClient = null; - if (featureSchemasAvailableInClient != null) { - featureSchemaSetAvailableInClient = ImmutableSet.copyOf(featureSchemasAvailableInClient); - } - - return new EarlybirdRequestContext( - request, - EarlybirdRequestType.of(request), - oldRequestContext.getParsedQuery(), - oldRequestContext.useOverrideTierConfig(), - oldRequestContext.getCreatedTimeMillis(), - oldRequestContext.getTwitterContextViewer(), - featureSchemaSetAvailableInClient); - } - - public EarlybirdRequestContext deepCopy() { - return new EarlybirdRequestContext(request.deepCopy(), parsedQuery, useOverrideTierConfig, - createdTimeMillis, twitterContextViewer); - } - - private final EarlybirdRequest request; - // EarlybirdRequestType should not change for a given request. Computing it once here so that we - // don't need to compute it from the request every time we want to use it. - private final EarlybirdRequestType earlybirdRequestType; - // The parsed query matching the serialized query in the request. May be null if the request does - // not contain a serialized query. - // If a request's serialized query needs to be rewritten for any reason, a new - // EarlybirdRequestContext should be created, with a new EarlybirdRequest (with a new serialized - // query), and a new parsed query (matching the new serialized query). - @Nullable - private final Query parsedQuery; - private final boolean useOverrideTierConfig; - private final long createdTimeMillis; - private final Option twitterContextViewer; - - @Nullable - private final ImmutableSet featureSchemasAvailableInClient; - - private EarlybirdRequestContext( - EarlybirdRequest request, - Query parsedQuery, - boolean useOverrideTierConfig, - long createdTimeMillis, - Option twitterContextViewer) { - this(request, - EarlybirdRequestType.of(request), - parsedQuery, - useOverrideTierConfig, - createdTimeMillis, - twitterContextViewer, - null); - } - - private EarlybirdRequestContext( - EarlybirdRequest request, - EarlybirdRequestType earlybirdRequestType, - Query parsedQuery, - boolean useOverrideTierConfig, - long createdTimeMillis, - Option twitterContextViewer, - @Nullable ImmutableSet featureSchemasAvailableInClient) { - this.request = Preconditions.checkNotNull(request); - this.earlybirdRequestType = earlybirdRequestType; - this.parsedQuery = parsedQuery; - this.useOverrideTierConfig = useOverrideTierConfig; - this.createdTimeMillis = createdTimeMillis; - this.twitterContextViewer = twitterContextViewer; - this.featureSchemasAvailableInClient = featureSchemasAvailableInClient; - } - - private EarlybirdRequestContext(EarlybirdRequestContext otherContext, Query otherParsedQuery) { - this(otherContext, otherContext.getRequest().deepCopy(), otherParsedQuery); - } - - private EarlybirdRequestContext(EarlybirdRequestContext otherContext, - EarlybirdRequest otherRequest, - Query otherParsedQuery) { - this(otherRequest, - otherContext.earlybirdRequestType, - otherParsedQuery, - otherContext.useOverrideTierConfig, - otherContext.createdTimeMillis, - otherContext.twitterContextViewer, - null); - - Preconditions.checkState(request.isSetSearchQuery()); - this.request.getSearchQuery().setSerializedQuery(otherParsedQuery.serialize()); - } - - public EarlybirdRequest getRequest() { - return request; - } - - public boolean useOverrideTierConfig() { - return useOverrideTierConfig; - } - - public EarlybirdRequestType getEarlybirdRequestType() { - return earlybirdRequestType; - } - - @Nullable - public Query getParsedQuery() { - return parsedQuery; - } - - public long getCreatedTimeMillis() { - return createdTimeMillis; - } - - public Option getTwitterContextViewer() { - return twitterContextViewer; - } - - @Nullable - public Set getFeatureSchemasAvailableInClient() { - return featureSchemasAvailableInClient; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/common/EarlybirdRequestType.java b/src/java/com/twitter/search/earlybird_root/common/EarlybirdRequestType.java deleted file mode 100644 index 6082135dc..000000000 --- a/src/java/com/twitter/search/earlybird_root/common/EarlybirdRequestType.java +++ /dev/null @@ -1,68 +0,0 @@ -package com.twitter.search.earlybird_root.common; - -import javax.annotation.Nonnull; - -import com.twitter.search.common.constants.thriftjava.ThriftQuerySource; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.ThriftSearchRankingMode; - -/** - * Earlybird roots distinguish these types of requests and treat them differently. - */ -public enum EarlybirdRequestType { - FACETS, - RECENCY, - RELEVANCE, - STRICT_RECENCY, - TERM_STATS, - TOP_TWEETS; - - /** - * Returns the type of the given requests. - */ - @Nonnull - public static EarlybirdRequestType of(EarlybirdRequest request) { - if (request.isSetFacetRequest()) { - return FACETS; - } else if (request.isSetTermStatisticsRequest()) { - return TERM_STATS; - } else if (request.isSetSearchQuery() && request.getSearchQuery().isSetRankingMode()) { - ThriftSearchRankingMode rankingMode = request.getSearchQuery().getRankingMode(); - switch (rankingMode) { - case RECENCY: - if (shouldUseStrictRecency(request)) { - return STRICT_RECENCY; - } else { - return RECENCY; - } - case RELEVANCE: - return RELEVANCE; - case TOPTWEETS: - return TOP_TWEETS; - default: - throw new IllegalArgumentException(); - } - } else { - throw new UnsupportedOperationException(); - } - } - - private static boolean shouldUseStrictRecency(EarlybirdRequest request) { - // For now, we decide to do strict merging solely based on the QuerySource, and only for GNIP. - return request.isSetQuerySource() && request.getQuerySource() == ThriftQuerySource.GNIP; - } - - private final String normalizedName; - - EarlybirdRequestType() { - this.normalizedName = name().toLowerCase(); - } - - /** - * Returns the "normalized" name of this request type, that can be used for stat and decider - * names. - */ - public String getNormalizedName() { - return normalizedName; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/common/EarlybirdRequestUtil.java b/src/java/com/twitter/search/earlybird_root/common/EarlybirdRequestUtil.java deleted file mode 100644 index 430aa870c..000000000 --- a/src/java/com/twitter/search/earlybird_root/common/EarlybirdRequestUtil.java +++ /dev/null @@ -1,107 +0,0 @@ -package com.twitter.search.earlybird_root.common; - -import com.google.common.base.Optional; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.util.IdTimeRanges; - -public final class EarlybirdRequestUtil { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdRequestUtil.class); - - private EarlybirdRequestUtil() { - } - - /** - * Returns the max ID specified in the query. The max ID is determined based on the max_id - * operator, and the returned value is an inclusive max ID (that is, the returned response is - * allowed to have a tweet with this ID). - * - * If the query is null, could not be parsed or does not have a max_id operator, Optional.absent() - * is returned. - * - * @param query The query. - * @return The max ID specified in the given query (based on the max_id operator). - */ - public static Optional getRequestMaxId(Query query) { - if (query == null) { - return Optional.absent(); - } - - IdTimeRanges idTimeRanges = null; - try { - idTimeRanges = IdTimeRanges.fromQuery(query); - } catch (QueryParserException e) { - LOG.warn("Exception while getting max_id/until_time from query: " + query, e); - } - - if (idTimeRanges == null) { - // An exception was thrown or the query doesn't accept the boundary operators. - return Optional.absent(); - } - - return idTimeRanges.getMaxIDInclusive(); - } - - /** - * Returns the max ID specified in the query, based on the until_time operator. The returned ID - * is inclusive (that is, the returned response is allowed to have a tweet with this ID). - * - * If the query is null, could not be parsed or does not have an until_time operator, - * Optional.absent() is returned. - * - * @param query The query. - * @return The max ID specified in the given query (based on the until_time operator). - */ - public static Optional getRequestMaxIdFromUntilTime(Query query) { - if (query == null) { - return Optional.absent(); - } - - IdTimeRanges idTimeRanges = null; - try { - idTimeRanges = IdTimeRanges.fromQuery(query); - } catch (QueryParserException e) { - LOG.warn("Exception while getting max_id/until_time from query: " + query, e); - } - - if (idTimeRanges == null) { - // An exception was thrown or the query doesn't accept the boundary operators. - return Optional.absent(); - } - - Optional queryUntilTimeExclusive = idTimeRanges.getUntilTimeExclusive(); - Optional maxId = Optional.absent(); - if (queryUntilTimeExclusive.isPresent()) { - long timestampMillis = queryUntilTimeExclusive.get() * 1000L; - if (SnowflakeIdParser.isUsableSnowflakeTimestamp(timestampMillis)) { - // Convert timestampMillis to an ID, and subtract 1, because the until_time operator is - // exclusive, and we need to return an inclusive max ID. - maxId = Optional.of(SnowflakeIdParser.generateValidStatusId(timestampMillis, 0) - 1); - } - } - return maxId; - } - - /** - * Creates a copy of the given EarlybirdRequest and unsets all fields that are used - * only by the SuperRoot. - */ - public static EarlybirdRequest unsetSuperRootFields( - EarlybirdRequest request, boolean unsetFollowedUserIds) { - EarlybirdRequest newRequest = request.deepCopy(); - newRequest.unsetGetOlderResults(); - newRequest.unsetGetProtectedTweetsOnly(); - if (unsetFollowedUserIds) { - newRequest.unsetFollowedUserIds(); - } - newRequest.unsetAdjustedProtectedRequestParams(); - newRequest.unsetAdjustedFullArchiveRequestParams(); - return newRequest; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/common/EarlybirdServiceResponse.java b/src/java/com/twitter/search/earlybird_root/common/EarlybirdServiceResponse.java deleted file mode 100644 index 476cedc72..000000000 --- a/src/java/com/twitter/search/earlybird_root/common/EarlybirdServiceResponse.java +++ /dev/null @@ -1,87 +0,0 @@ -package com.twitter.search.earlybird_root.common; - -import javax.annotation.Nullable; - -import com.google.common.base.Preconditions; - -import com.twitter.search.earlybird.thrift.EarlybirdResponse; - -/** - * A class that wraps an EarlybirdResponse and a flag that determines if a request was sent to a - * service. - */ -public final class EarlybirdServiceResponse { - public static enum ServiceState { - // The service was called (or will be called). - SERVICE_CALLED(true), - - // The service is not available (turned off by a decider, for example). - SERVICE_NOT_AVAILABLE(false), - - // The client did not request results from this service. - SERVICE_NOT_REQUESTED(false), - - // The service is available and the client wants results from this service, but the service - // was not called (because we got enough results from other services, for example). - SERVICE_NOT_CALLED(false); - - private final boolean serviceWasCalled; - - private ServiceState(boolean serviceWasCalled) { - this.serviceWasCalled = serviceWasCalled; - } - - public boolean serviceWasCalled() { - return serviceWasCalled; - } - - public boolean serviceWasRequested() { - return this != SERVICE_NOT_REQUESTED; - } - - } - - private final EarlybirdResponse earlybirdResponse; - private final ServiceState serviceState; - - private EarlybirdServiceResponse(@Nullable EarlybirdResponse earlybirdResponse, - ServiceState serviceState) { - this.earlybirdResponse = earlybirdResponse; - this.serviceState = serviceState; - if (!serviceState.serviceWasCalled()) { - Preconditions.checkArgument(earlybirdResponse == null); - } - } - - /** - * Creates a new EarlybirdServiceResponse instance, indicating that the service was not called. - * - * @param serviceState The state of the service. - * @return a new EarlybirdServiceResponse instance, indicating that the service was not called. - */ - public static EarlybirdServiceResponse serviceNotCalled(ServiceState serviceState) { - Preconditions.checkArgument(!serviceState.serviceWasCalled()); - return new EarlybirdServiceResponse(null, serviceState); - } - - /** - * Creates a new EarlybirdServiceResponse instance that wraps the given earlybird response. - * - * @param earlybirdResponse The EarlybirdResponse instance returned by the service. - * @return a new EarlybirdServiceResponse instance that wraps the given earlybird response. - */ - public static EarlybirdServiceResponse serviceCalled(EarlybirdResponse earlybirdResponse) { - return new EarlybirdServiceResponse(earlybirdResponse, ServiceState.SERVICE_CALLED); - } - - /** Returns the wrapped earlybird response. */ - @Nullable - public EarlybirdResponse getResponse() { - return earlybirdResponse; - } - - /** Returns the state of the service. */ - public ServiceState getServiceState() { - return serviceState; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/common/InjectionNames.java b/src/java/com/twitter/search/earlybird_root/common/InjectionNames.java deleted file mode 100644 index 8662a1d7b..000000000 --- a/src/java/com/twitter/search/earlybird_root/common/InjectionNames.java +++ /dev/null @@ -1,10 +0,0 @@ -package com.twitter.search.earlybird_root.common; - -public final class InjectionNames { - - public static final String FULL_ARCHIVE = "full_archive"; - public static final String REALTIME = "realtime"; - public static final String PROTECTED = "protected"; - - private InjectionNames() { } -} diff --git a/src/java/com/twitter/search/earlybird_root/common/QueryParsingUtils.java b/src/java/com/twitter/search/earlybird_root/common/QueryParsingUtils.java deleted file mode 100644 index 0df98b34e..000000000 --- a/src/java/com/twitter/search/earlybird_root/common/QueryParsingUtils.java +++ /dev/null @@ -1,86 +0,0 @@ -package com.twitter.search.earlybird_root.common; - -import java.util.concurrent.TimeUnit; - -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.queryparser.parser.SerializedQueryParser; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.util.Future; - -/** - * Common utils for parsing serialized queries, and handling query parser exceptions. - */ -public final class QueryParsingUtils { - - private static final Logger LOG = LoggerFactory.getLogger(QueryParsingUtils.class); - - @VisibleForTesting - public static final SearchCounter QUERYPARSE_COUNT = - SearchCounter.export("root_queryparse_count"); - private static final SearchTimerStats QUERYPARSE_TIMER = - SearchTimerStats.export("root_queryparse_time", TimeUnit.NANOSECONDS, false, true); - private static final SearchCounter NO_PARSED_QUERY_COUNT = - SearchCounter.export("root_no_parsed_query_count"); - - private QueryParsingUtils() { } - - /** - * Takes an earlybird request, and parses its serialized query (if it is set). - * Expects the required ThriftSearchQuery to be set on the passed in EarlybirdRequest. - * - * @param request the earlybird request to parse. - * @return null if the request does not specify a serialized query. - * @throws QueryParserException if querry parsing fails. - */ - @Nullable - static Query getParsedQuery(EarlybirdRequest request) throws QueryParserException { - // searchQuery is required on EarlybirdRequest. - Preconditions.checkState(request.isSetSearchQuery()); - Query parsedQuery; - if (request.getSearchQuery().isSetSerializedQuery()) { - long startTime = System.nanoTime(); - try { - String serializedQuery = request.getSearchQuery().getSerializedQuery(); - - parsedQuery = new SerializedQueryParser().parse(serializedQuery); - } finally { - QUERYPARSE_COUNT.increment(); - QUERYPARSE_TIMER.timerIncrement(System.nanoTime() - startTime); - } - } else { - NO_PARSED_QUERY_COUNT.increment(); - parsedQuery = null; - } - return parsedQuery; - } - - /** - * Creates a new EarlybirdResponse with a CLIENT_ERROR response code, to be used as a response - * to a request where we failed to parse a user passed in serialized query. - */ - public static Future newClientErrorResponse( - EarlybirdRequest request, - QueryParserException e) { - - String msg = "Failed to parse query"; - LOG.warn(msg, e); - - EarlybirdResponse errorResponse = - new EarlybirdResponse(EarlybirdResponseCode.CLIENT_ERROR, 0); - errorResponse.setDebugString(msg + ": " + e.getMessage()); - return Future.value(errorResponse); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/common/TwitterContextProvider.java b/src/java/com/twitter/search/earlybird_root/common/TwitterContextProvider.java deleted file mode 100644 index 50b40bb41..000000000 --- a/src/java/com/twitter/search/earlybird_root/common/TwitterContextProvider.java +++ /dev/null @@ -1,20 +0,0 @@ -package com.twitter.search.earlybird_root.common; - -import javax.inject.Singleton; - -import scala.Option; - -import com.twitter.context.TwitterContext; -import com.twitter.context.thriftscala.Viewer; -import com.twitter.search.TwitterContextPermit; - -/** - * This class is needed to provide an easy way for unit tests to "inject" - * a TwitterContext Viewer - */ -@Singleton -public class TwitterContextProvider { - public Option get() { - return TwitterContext.acquire(TwitterContextPermit.get()).apply(); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/config/BUILD.bazel b/src/java/com/twitter/search/earlybird_root/config/BUILD.bazel deleted file mode 100644 index 48896ca1c..000000000 --- a/src/java/com/twitter/search/earlybird_root/config/BUILD.bazel +++ /dev/null @@ -1,7 +0,0 @@ -java_library( - sources = ["*.java"], - dependencies = [ - "src/java/com/twitter/common/util:system-mocks", - "src/java/com/twitter/search/earlybird/config", - ], -) diff --git a/src/java/com/twitter/search/earlybird_root/config/RootClusterBoundaryInfo.java b/src/java/com/twitter/search/earlybird_root/config/RootClusterBoundaryInfo.java deleted file mode 100644 index 00c72d4e5..000000000 --- a/src/java/com/twitter/search/earlybird_root/config/RootClusterBoundaryInfo.java +++ /dev/null @@ -1,48 +0,0 @@ -package com.twitter.search.earlybird_root.config; - -import java.util.Date; - -import com.twitter.common.util.Clock; -import com.twitter.search.earlybird.config.ServingRange; -import com.twitter.search.earlybird.config.TierServingBoundaryEndPoint; - -/** - * Time boundary information for a root cluster. - * Used by EarlybirdTimeRangeFilter. - */ -public class RootClusterBoundaryInfo implements ServingRange { - - private final TierServingBoundaryEndPoint servingRangeSince; - private final TierServingBoundaryEndPoint servingRangeMax; - - /** - * Build a time boundary information - */ - public RootClusterBoundaryInfo( - Date startDate, - Date clusterEndDate, - String sinceIdBoundaryString, - String maxIdBoundaryString, - Clock clock) { - this.servingRangeSince = TierServingBoundaryEndPoint - .newTierServingBoundaryEndPoint(sinceIdBoundaryString, startDate, clock); - this.servingRangeMax = TierServingBoundaryEndPoint - .newTierServingBoundaryEndPoint(maxIdBoundaryString, clusterEndDate, clock); - } - - public long getServingRangeSinceId() { - return servingRangeSince.getBoundaryTweetId(); - } - - public long getServingRangeMaxId() { - return servingRangeMax.getBoundaryTweetId(); - } - - public long getServingRangeSinceTimeSecondsFromEpoch() { - return servingRangeSince.getBoundaryTimeSecondsFromEpoch(); - } - - public long getServingRangeUntilTimeSecondsFromEpoch() { - return servingRangeMax.getBoundaryTimeSecondsFromEpoch(); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/BUILD b/src/java/com/twitter/search/earlybird_root/filters/BUILD deleted file mode 100644 index 464d15a80..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/BUILD +++ /dev/null @@ -1,40 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/commons-io", - "3rdparty/jvm/org/slf4j:slf4j-api", - "snowflake/src/main/scala/com/twitter/snowflake/id", - "src/antlr/com/twitter/search/queryparser/antlr:queryparser-antlr", - "src/java/com/google/common/util/concurrent", - "src/java/com/twitter/common/collections", - "src/java/com/twitter/common/text/language:locale-util", - "src/java/com/twitter/common/util:system-mocks", - "src/java/com/twitter/common_internal/text/version", - "src/java/com/twitter/search/common/clientstats", - "src/java/com/twitter/search/common/decider", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/partitioning/snowflakeparser", - "src/java/com/twitter/search/common/root", - "src/java/com/twitter/search/common/schema/earlybird", - "src/java/com/twitter/search/common/util:finagleutil", - "src/java/com/twitter/search/common/util/date", - "src/java/com/twitter/search/common/util/earlybird", - "src/java/com/twitter/search/common/util/io/periodic", - "src/java/com/twitter/search/common/util/lang", - "src/java/com/twitter/search/common/util/thrift:text-protocol", - "src/java/com/twitter/search/earlybird/common", - "src/java/com/twitter/search/earlybird/config", - "src/java/com/twitter/search/earlybird_root/common", - "src/java/com/twitter/search/earlybird_root/quota", - "src/java/com/twitter/search/earlybird_root/validators", - "src/java/com/twitter/search/queryparser", - "src/java/com/twitter/search/queryparser/query:core-query-nodes", - "src/java/com/twitter/search/queryparser/query/search:search-query-nodes", - "src/thrift/com/twitter/context:twitter-context-scala", - "src/thrift/com/twitter/search:earlybird-java", - ], -) diff --git a/src/java/com/twitter/search/earlybird_root/filters/ClientIdArchiveAccessFilter.java b/src/java/com/twitter/search/earlybird_root/filters/ClientIdArchiveAccessFilter.java deleted file mode 100644 index bb7e7b1ab..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/ClientIdArchiveAccessFilter.java +++ /dev/null @@ -1,56 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.Optional; - -import javax.inject.Inject; - -import com.google.common.base.Preconditions; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.earlybird.common.ClientIdUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird_root.quota.ClientIdQuotaManager; -import com.twitter.search.earlybird_root.quota.QuotaInfo; -import com.twitter.util.Future; - -public class ClientIdArchiveAccessFilter extends SimpleFilter { - private static final String UNAUTHORIZED_ARCHIVE_ACCESS_COUNTER_PATTERN = - "unauthorized_access_to_full_archive_by_client_%s"; - - private final ClientIdQuotaManager quotaManager; - - /** - * Construct the filter by using ClientIdQuotaManager - */ - @Inject - public ClientIdArchiveAccessFilter(ClientIdQuotaManager quotaManager) { - this.quotaManager = Preconditions.checkNotNull(quotaManager); - } - - @Override - public Future apply(EarlybirdRequest request, - Service service) { - String clientId = ClientIdUtil.getClientIdFromRequest(request); - - Optional quotaInfoOptional = quotaManager.getQuotaForClient(clientId); - QuotaInfo quotaInfo = quotaInfoOptional.orElseGet(quotaManager::getCommonPoolQuota); - if (!quotaInfo.hasArchiveAccess() && request.isGetOlderResults()) { - SearchCounter unauthorizedArchiveAccessCounter = SearchCounter.export( - String.format(UNAUTHORIZED_ARCHIVE_ACCESS_COUNTER_PATTERN, clientId)); - unauthorizedArchiveAccessCounter.increment(); - - String message = String.format( - "Client %s is not whitelisted for archive access. Request access at go/searchquota.", - clientId); - EarlybirdResponse response = new EarlybirdResponse( - EarlybirdResponseCode.QUOTA_EXCEEDED_ERROR, 0) - .setDebugString(message); - return Future.value(response); - } - return service.apply(request); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/ClientIdQueryOperatorStatsFilter.java b/src/java/com/twitter/search/earlybird_root/filters/ClientIdQueryOperatorStatsFilter.java deleted file mode 100644 index 750b39198..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/ClientIdQueryOperatorStatsFilter.java +++ /dev/null @@ -1,129 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.Arrays; -import java.util.EnumSet; -import java.util.HashSet; -import java.util.Set; -import java.util.concurrent.ConcurrentHashMap; -import java.util.concurrent.ConcurrentMap; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.clientstats.RequestCounters; -import com.twitter.search.common.clientstats.RequestCountersEventListener; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.query.search.SearchOperator; -import com.twitter.search.queryparser.visitors.DetectVisitor; -import com.twitter.util.Future; - -/** -* This filter exports RequestCounters stats for each unique combination of client_id and -* query_operator. RequestCounters produce 19 stats for each prefix, and we have numerous -* clients and operators, so this filter can produce a large number of stats. To keep the -* number of exported stats reasonable we use an allow list of operators. The list currently -* includes the geo operators while we monitor the impacts of realtime geo filtering. See -* SEARCH-33699 for project details. -* -* To find the stats look for query_client_operator_* exported by archive roots. -* - **/ - -public class ClientIdQueryOperatorStatsFilter - extends SimpleFilter { - - private static final Logger LOG = LoggerFactory.getLogger(ClientIdQueryOperatorStatsFilter.class); - - public static final String COUNTER_PREFIX_PATTERN = "query_client_operator_%s_%s"; - private final Clock clock; - private final ConcurrentMap requestCountersByClientIdAndOperator = - new ConcurrentHashMap<>(); - private final Set operatorsToRecordStatsFor = new HashSet<>(Arrays.asList( - SearchOperator.Type.GEO_BOUNDING_BOX, - SearchOperator.Type.GEOCODE, - SearchOperator.Type.GEOLOCATION_TYPE, - SearchOperator.Type.NEAR, - SearchOperator.Type.PLACE, - SearchOperator.Type.WITHIN)); - - public ClientIdQueryOperatorStatsFilter() { - this.clock = Clock.SYSTEM_CLOCK; - } - - @Override - public Future apply( - EarlybirdRequestContext requestContext, - Service service) { - EarlybirdRequest req = requestContext.getRequest(); - Query parsedQuery = requestContext.getParsedQuery(); - - if (parsedQuery == null) { - return service.apply(requestContext); - } - - Set operators = getOperators(parsedQuery); - Future response = service.apply(requestContext); - for (SearchOperator.Type operator : operators) { - - RequestCounters clientOperatorCounters = getClientOperatorCounters(req.clientId, operator); - RequestCountersEventListener clientOperatorCountersEventListener = - new RequestCountersEventListener<>( - clientOperatorCounters, clock, EarlybirdSuccessfulResponseHandler.INSTANCE); - - response = response.addEventListener(clientOperatorCountersEventListener); - } - return response; - } - - /** - * Gets or creates RequestCounters for the given clientId and operatorType - */ - private RequestCounters getClientOperatorCounters(String clientId, - SearchOperator.Type operatorType) { - String counterPrefix = String.format(COUNTER_PREFIX_PATTERN, clientId, operatorType.toString()); - RequestCounters clientCounters = requestCountersByClientIdAndOperator.get(counterPrefix); - if (clientCounters == null) { - clientCounters = new RequestCounters(counterPrefix); - RequestCounters existingCounters = - requestCountersByClientIdAndOperator.putIfAbsent(counterPrefix, clientCounters); - if (existingCounters != null) { - clientCounters = existingCounters; - } - } - return clientCounters; - } - - /** - * Returns a set of the SearchOperator types that are: - * 1) used by the query - * 2) included in the allow list: operatorsToRecordStatsFor - */ - private Set getOperators(Query parsedQuery) { - final DetectVisitor detectVisitor = new DetectVisitor(false, SearchOperator.Type.values()); - Set detectedOperatorTypes = EnumSet.noneOf(SearchOperator.Type.class); - - try { - parsedQuery.accept(detectVisitor); - } catch (QueryParserException e) { - LOG.error("Failed to detect SearchOperators in query: " + parsedQuery.toString()); - return detectedOperatorTypes; - } - - for (Query query : detectVisitor.getDetectedQueries()) { - // This detectVisitor only matches on SearchOperators. - SearchOperator operator = (SearchOperator) query; - SearchOperator.Type operatorType = operator.getOperatorType(); - if (operatorsToRecordStatsFor.contains(operatorType)) { - detectedOperatorTypes.add(operatorType); - } - } - return detectedOperatorTypes; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/ClientIdQuotaFilter.java b/src/java/com/twitter/search/earlybird_root/filters/ClientIdQuotaFilter.java deleted file mode 100644 index 828f92ad4..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/ClientIdQuotaFilter.java +++ /dev/null @@ -1,274 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.Optional; -import java.util.concurrent.ConcurrentHashMap; -import java.util.concurrent.ConcurrentMap; - -import javax.inject.Inject; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.cache.CacheBuilder; -import com.google.common.cache.CacheLoader; -import com.google.common.cache.LoadingCache; -import com.google.common.util.concurrent.RateLimiterProxy; -import com.google.common.util.concurrent.TwitterRateLimiterProxyFactory; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.metrics.SearchCustomGauge; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.util.FinagleUtil; -import com.twitter.search.earlybird.common.ClientIdUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird_root.quota.ClientIdQuotaManager; -import com.twitter.search.earlybird_root.quota.QuotaInfo; -import com.twitter.util.Future; - -/** - * A filter that tracks and limits the per-client request rate. The ID of the client is determined - * by looking at the Finagle client ID and the EarlybirdRequest.clientId field. - * - * The configuration currently has one config based implementation: see ConfigRepoBasedQuotaManager. - * - * If a client has a quota set, this filter will rate limit the requests from that client based on - * that quota. Otherwise, the client is assumed to use a "common request pool", which has its own - * quota. A quota for the common pool must always exist (even if it's set to 0). - * - * All rate limiters used in this class are tolerant to bursts. See TwitterRateLimiterFactory for - * more details. - * - * If a client sends us more requests than its allowed quota, we keep track of the excess traffic - * and export that number in a counter. However, we rate limit the requests from that client only if - * the QuotaInfo returned from ClientIdQuotaManager has the shouldEnforceQuota property set to true. - * - * If a request is rate limited, the filter will return an EarlybirdResponse with a - * QUOTA_EXCEEDED_ERROR response code. - */ -public class ClientIdQuotaFilter extends SimpleFilter { - private static final class ClientQuota { - private final QuotaInfo quotaInfo; - private final boolean shouldAllowRequest; - private final ClientIdRequestCounters requestCounters; - - private ClientQuota( - QuotaInfo quotaInfo, - boolean shouldAllowRequest, - ClientIdRequestCounters requestCounters) { - - this.quotaInfo = quotaInfo; - this.shouldAllowRequest = shouldAllowRequest; - this.requestCounters = requestCounters; - } - } - - private static final class ClientIdRequestCounters { - private static final String REQUESTS_RECEIVED_COUNTER_NAME_PATTERN = - "quota_requests_received_for_client_id_%s"; - - private static final String THROTTLED_REQUESTS_COUNTER_NAME_PATTERN = - "quota_requests_throttled_for_client_id_%s"; - - private static final String REQUESTS_ABOVE_QUOTA_COUNTER_NAME_PATTERN = - "quota_requests_above_quota_for_client_id_%s"; - - private static final String REQUESTS_WITHIN_QUOTA_COUNTER_NAME_PATTERN = - "quota_requests_within_quota_for_client_id_%s"; - - private static final String PER_CLIENT_QUOTA_GAUGE_NAME_PATTERN = - "quota_for_client_id_%s"; - - private final SearchRateCounter throttledRequestsCounter; - private final SearchRateCounter requestsReceivedCounter; - private final SearchRateCounter requestsAboveQuotaCounter; - private final SearchRateCounter requestsWithinQuotaCounter; - private final SearchLongGauge quotaClientGauge; - - private ClientIdRequestCounters(String clientId) { - this.throttledRequestsCounter = SearchRateCounter.export( - String.format(THROTTLED_REQUESTS_COUNTER_NAME_PATTERN, clientId)); - - this.requestsReceivedCounter = SearchRateCounter.export( - String.format(REQUESTS_RECEIVED_COUNTER_NAME_PATTERN, clientId), true); - - this.quotaClientGauge = SearchLongGauge.export( - String.format(PER_CLIENT_QUOTA_GAUGE_NAME_PATTERN, clientId)); - - this.requestsAboveQuotaCounter = SearchRateCounter.export( - String.format(REQUESTS_ABOVE_QUOTA_COUNTER_NAME_PATTERN, clientId)); - - this.requestsWithinQuotaCounter = SearchRateCounter.export( - String.format(REQUESTS_WITHIN_QUOTA_COUNTER_NAME_PATTERN, clientId)); - } - } - - private static final String REQUESTS_RECEIVED_FOR_EMAIL_COUNTER_NAME_PATTERN = - "quota_requests_received_for_email_%s"; - - // We have this aggregate stat only because doing sumany(...) on the - // per-client statistic is too expensive for an alert. - @VisibleForTesting - static final SearchRateCounter TOTAL_REQUESTS_RECEIVED_COUNTER = - SearchRateCounter.export("total_quota_requests_received", true); - - private static final int DEFAULT_BURST_FACTOR_SECONDS = 60; - private static final String QUOTA_STAT_CACHE_SIZE = "quota_stat_cache_size"; - private static final String MISSING_QUOTA_FOR_CLIENT_ID_COUNTER_NAME_PATTERN = - "quota_requests_with_missing_quota_for_client_id_%s"; - - private static final Logger LOG = LoggerFactory.getLogger(ClientIdQuotaFilter.class); - - private final ConcurrentMap rateLimiterProxiesByClientId = - new ConcurrentHashMap<>(); - - private final ClientIdQuotaManager quotaManager; - private final TwitterRateLimiterProxyFactory rateLimiterProxyFactory; - private final LoadingCache clientRequestCounters; - private final LoadingCache emailRequestCounters; - - /** Creates a new ClientIdQuotaFilter instance. */ - @Inject - public ClientIdQuotaFilter(ClientIdQuotaManager quotaManager, - TwitterRateLimiterProxyFactory rateLimiterProxyFactory) { - this.quotaManager = quotaManager; - this.rateLimiterProxyFactory = rateLimiterProxyFactory; - - this.clientRequestCounters = CacheBuilder.newBuilder() - .build(new CacheLoader() { - @Override - public ClientIdRequestCounters load(String clientId) { - return new ClientIdRequestCounters(clientId); - } - }); - this.emailRequestCounters = CacheBuilder.newBuilder() - .build(new CacheLoader() { - @Override - public SearchRateCounter load(String email) { - return SearchRateCounter.export( - String.format(REQUESTS_RECEIVED_FOR_EMAIL_COUNTER_NAME_PATTERN, email)); - } - }); - - SearchCustomGauge.export(QUOTA_STAT_CACHE_SIZE, () -> clientRequestCounters.size()); - } - - @Override - public Future apply(EarlybirdRequest request, - Service service) { - String finagleClientId = FinagleUtil.getFinagleClientName(); - String requestClientId = ClientIdUtil.getClientIdFromRequest(request); - LOG.debug(String.format("Client id from request or attribution: %s", requestClientId)); - - // Multiple client ids may be grouped into a single quota client id, all the - // unknown or unset client ids for example. - String quotaClientId = ClientIdUtil.getQuotaClientId(requestClientId); - LOG.debug(String.format("Client id used for checking quota: %s", quotaClientId)); - - ClientQuota clientQuota = getClientQuota(quotaClientId); - if (!clientQuota.shouldAllowRequest && clientQuota.quotaInfo.shouldEnforceQuota()) { - clientQuota.requestCounters.throttledRequestsCounter.increment(); - - return Future.value(getQuotaExceededResponse( - finagleClientId, - clientQuota.quotaInfo.getQuotaClientId(), - clientQuota.quotaInfo.getQuota())); - } - - return service.apply(request); - } - - private ClientQuota getClientQuota(String clientId) { - Optional quotaInfoOptional = quotaManager.getQuotaForClient(clientId); - if (!quotaInfoOptional.isPresent()) { - SearchRateCounter noQuotaFoundForClientCounter = SearchRateCounter.export( - String.format(MISSING_QUOTA_FOR_CLIENT_ID_COUNTER_NAME_PATTERN, clientId)); - noQuotaFoundForClientCounter.increment(); - } - - // If a quota was set for this client, use it. Otherwise, use the common pool's quota. - // A quota for the common pool must always exist. - QuotaInfo quotaInfo = quotaInfoOptional.orElseGet(quotaManager::getCommonPoolQuota); - - ClientIdRequestCounters requestCounters = clientRequestCounters - .getUnchecked(quotaInfo.getQuotaClientId()); - emailRequestCounters.getUnchecked(quotaInfo.getQuotaEmail()).increment(); - - // Increment a stat for each request the filter receives. - requestCounters.requestsReceivedCounter.increment(); - - // Also increment the total stat - TOTAL_REQUESTS_RECEIVED_COUNTER.increment(); - - // If shouldEnforceQuota is false, we already know that the request will be allowed. - // However, we still want to update the rate limiter and the stats. - final boolean requestAllowed; - if (quotaInfo.getQuota() == 0) { - // If the quota for this client is set to 0, then the request should not be allowed. - // - // Do not update the rate limiter's rate: RateLimiter only accepts positive rates, and in any - // case, we already know that the request should not be allowed. - requestAllowed = false; - } else { - // The quota is not 0: update the rate limiter with the new quota, and see if the request - // should be allowed. - RateLimiterProxy rateLimiterProxy = getClientRateLimiterProxy(quotaInfo.getQuotaClientId(), - quotaInfo.getQuota()); - requestAllowed = rateLimiterProxy.tryAcquire(); - } - - // Report the current quota for each client - requestCounters.quotaClientGauge.set(quotaInfo.getQuota()); - - // Update the corresponding counter, if the request should not be allowed. - if (!requestAllowed) { - requestCounters.requestsAboveQuotaCounter.increment(); - } else { - requestCounters.requestsWithinQuotaCounter.increment(); - } - - // Throttle the request only if the quota for this service should be enforced. - return new ClientQuota(quotaInfo, requestAllowed, requestCounters); - } - - private RateLimiterProxy getClientRateLimiterProxy(String clientId, int rate) { - // If a RateLimiter for this client doesn't exist, create one, - // unless another thread beat us to it. - RateLimiterProxy clientRateLimiterProxy = rateLimiterProxiesByClientId.get(clientId); - if (clientRateLimiterProxy == null) { - clientRateLimiterProxy = - rateLimiterProxyFactory.createRateLimiterProxy(rate, DEFAULT_BURST_FACTOR_SECONDS); - RateLimiterProxy existingClientRateLimiterProxy = - rateLimiterProxiesByClientId.putIfAbsent(clientId, clientRateLimiterProxy); - if (existingClientRateLimiterProxy != null) { - clientRateLimiterProxy = existingClientRateLimiterProxy; - } - LOG.info("Using rate limiter with rate {} for clientId {}.", - clientRateLimiterProxy.getRate(), clientId); - } - - // Update the quota, if needed. - if (clientRateLimiterProxy.getRate() != rate) { - LOG.info("Updating the rate from {} to {} for clientId {}.", - clientRateLimiterProxy.getRate(), rate, clientId); - clientRateLimiterProxy.setRate(rate); - } - - return clientRateLimiterProxy; - } - - private static EarlybirdResponse getQuotaExceededResponse( - String finagleClientId, String quotaClientId, int quota) { - return new EarlybirdResponse(EarlybirdResponseCode.QUOTA_EXCEEDED_ERROR, 0) - .setSearchResults(new ThriftSearchResults()) - .setDebugString(String.format( - "Client %s (finagle client ID %s) has exceeded its request quota of %d. " - + "Please request more quota at go/searchquota.", - quotaClientId, finagleClientId, quota)); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/ClientIdTrackingFilter.java b/src/java/com/twitter/search/earlybird_root/filters/ClientIdTrackingFilter.java deleted file mode 100644 index 7da53ae4c..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/ClientIdTrackingFilter.java +++ /dev/null @@ -1,148 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.concurrent.ConcurrentHashMap; -import java.util.concurrent.ConcurrentMap; - -import javax.inject.Inject; - -import com.google.common.annotations.VisibleForTesting; - -import com.twitter.common.collections.Pair; -import com.twitter.common.util.Clock; -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.clientstats.RequestCounters; -import com.twitter.search.common.clientstats.RequestCountersEventListener; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.util.FinagleUtil; -import com.twitter.search.common.util.earlybird.ThriftSearchQueryUtil; -import com.twitter.search.earlybird.common.ClientIdUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.util.Future; - -/** Tracks the number of queries we get from each client. */ -public class ClientIdTrackingFilter extends SimpleFilter { - // Be careful when changing the names of these stats or adding new ones: make sure that they have - // prefixes/suffixes that will allow us to group them in Viz, without pulling in other stats. - // For example, we'll probably have a Viz graph for client_id_tracker_qps_for_client_id_*_all. - // So if you add a new stat named client_id_tracker_qps_for_client_id_%s_and_new_field_%s_all, - // then the graph will be grouping up the values from both stats, instead of grouping up the - // values only for client_id_tracker_qps_for_client_id_%s_all. - @VisibleForTesting - static final String QPS_ALL_STAT_PATTERN = "client_id_tracker_qps_for_%s_all"; - - @VisibleForTesting - static final String QPS_LOGGED_IN_STAT_PATTERN = "client_id_tracker_qps_for_%s_logged_in"; - - @VisibleForTesting - static final String QPS_LOGGED_OUT_STAT_PATTERN = "client_id_tracker_qps_for_%s_logged_out"; - - static final String SUPERROOT_REJECT_REQUESTS_WITH_UNKNOWN_FINAGLE_ID = - "superroot_reject_requests_with_unknown_finagle_id"; - - static final String UNKNOWN_FINAGLE_ID_DEBUG_STRING = "Please specify a finagle client id."; - - private final ConcurrentMap requestCountersByClientId = - new ConcurrentHashMap<>(); - private final ConcurrentMap, RequestCounters> - requestCountersByFinagleIdAndClientId = new ConcurrentHashMap<>(); - private final Clock clock; - private final SearchDecider decider; - - @Inject - public ClientIdTrackingFilter(SearchDecider decider) { - this(decider, Clock.SYSTEM_CLOCK); - } - - @VisibleForTesting - ClientIdTrackingFilter(SearchDecider decider, Clock clock) { - this.decider = decider; - this.clock = clock; - } - - @Override - public Future apply(EarlybirdRequest request, - Service service) { - String clientId = ClientIdUtil.getClientIdFromRequest(request); - String finagleId = FinagleUtil.getFinagleClientName(); - boolean isLoggedIn = ThriftSearchQueryUtil.requestInitiatedByLoggedInUser(request); - incrementCounters(clientId, finagleId, isLoggedIn); - - if (decider.isAvailable(SUPERROOT_REJECT_REQUESTS_WITH_UNKNOWN_FINAGLE_ID) - && finagleId.equals(FinagleUtil.UNKNOWN_CLIENT_NAME)) { - EarlybirdResponse response = new EarlybirdResponse( - EarlybirdResponseCode.QUOTA_EXCEEDED_ERROR, 0) - .setDebugString(UNKNOWN_FINAGLE_ID_DEBUG_STRING); - return Future.value(response); - } - - RequestCounters clientCounters = getClientCounters(clientId); - RequestCountersEventListener clientCountersEventListener = - new RequestCountersEventListener<>( - clientCounters, clock, EarlybirdSuccessfulResponseHandler.INSTANCE); - RequestCounters finagleIdAndClientCounters = getFinagleIdClientCounters(clientId, finagleId); - RequestCountersEventListener finagleIdAndClientCountersEventListener = - new RequestCountersEventListener<>( - finagleIdAndClientCounters, clock, EarlybirdSuccessfulResponseHandler.INSTANCE); - - return service.apply(request) - .addEventListener(clientCountersEventListener) - .addEventListener(finagleIdAndClientCountersEventListener); - } - - // Returns the RequestCounters instance tracking the requests from the given client ID. - private RequestCounters getClientCounters(String clientId) { - RequestCounters clientCounters = requestCountersByClientId.get(clientId); - if (clientCounters == null) { - clientCounters = new RequestCounters(ClientIdUtil.formatClientId(clientId)); - RequestCounters existingCounters = - requestCountersByClientId.putIfAbsent(clientId, clientCounters); - if (existingCounters != null) { - clientCounters = existingCounters; - } - } - return clientCounters; - } - - // Returns the RequestCounters instance tracking the requests from the given client ID. - private RequestCounters getFinagleIdClientCounters(String clientId, String finagleId) { - Pair clientKey = Pair.of(clientId, finagleId); - RequestCounters counters = requestCountersByFinagleIdAndClientId.get(clientKey); - if (counters == null) { - counters = new RequestCounters(ClientIdUtil.formatFinagleClientIdAndClientId( - finagleId, clientId)); - RequestCounters existingCounters = requestCountersByFinagleIdAndClientId.putIfAbsent( - clientKey, counters); - if (existingCounters != null) { - counters = existingCounters; - } - } - return counters; - } - - // Increments the correct counters, based on the given clientId, finagleId, and whether or not the - // request came from a logged in user. - private static void incrementCounters(String clientId, String finagleId, boolean isLoggedIn) { - String clientIdForStats = ClientIdUtil.formatClientId(clientId); - String finagleClientIdAndClientIdForStats = - ClientIdUtil.formatFinagleClientIdAndClientId(finagleId, clientId); - SearchCounter.export(String.format(QPS_ALL_STAT_PATTERN, clientIdForStats)).increment(); - SearchCounter.export(String.format(QPS_ALL_STAT_PATTERN, finagleClientIdAndClientIdForStats)) - .increment(); - if (isLoggedIn) { - SearchCounter.export(String.format(QPS_LOGGED_IN_STAT_PATTERN, clientIdForStats)).increment(); - SearchCounter.export( - String.format(QPS_LOGGED_IN_STAT_PATTERN, finagleClientIdAndClientIdForStats)) - .increment(); - } else { - SearchCounter.export(String.format(QPS_LOGGED_OUT_STAT_PATTERN, clientIdForStats)) - .increment(); - SearchCounter.export( - String.format(QPS_LOGGED_OUT_STAT_PATTERN, finagleClientIdAndClientIdForStats)) - .increment(); - } - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/ClientRequestTimeFilter.java b/src/java/com/twitter/search/earlybird_root/filters/ClientRequestTimeFilter.java deleted file mode 100644 index b19da2819..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/ClientRequestTimeFilter.java +++ /dev/null @@ -1,34 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import javax.inject.Inject; - -import com.twitter.common.util.Clock; -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.util.Future; - -/** A filter that sets the EarlybirdRequest.clientRequestTimeMs field if it's not already set. */ -public class ClientRequestTimeFilter extends SimpleFilter { - private static final SearchCounter CLIENT_REQUEST_TIME_MS_UNSET_COUNTER = - SearchCounter.export("client_request_time_ms_unset"); - - private final Clock clock; - - @Inject - public ClientRequestTimeFilter(Clock clock) { - this.clock = clock; - } - - @Override - public Future apply(EarlybirdRequest request, - Service service) { - if (!request.isSetClientRequestTimeMs()) { - CLIENT_REQUEST_TIME_MS_UNSET_COUNTER.increment(); - request.setClientRequestTimeMs(clock.nowMillis()); - } - return service.apply(request); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/DeadlineTimeoutStatsFilter.java b/src/java/com/twitter/search/earlybird_root/filters/DeadlineTimeoutStatsFilter.java deleted file mode 100644 index 0ac1a2113..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/DeadlineTimeoutStatsFilter.java +++ /dev/null @@ -1,188 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.concurrent.TimeUnit; -import javax.inject.Inject; - -import scala.Option; - -import com.google.common.base.Preconditions; -import com.google.common.cache.CacheBuilder; -import com.google.common.cache.CacheLoader; -import com.google.common.cache.LoadingCache; - -import com.twitter.common.util.Clock; -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.finagle.context.Contexts$; -import com.twitter.finagle.context.Deadline; -import com.twitter.finagle.context.Deadline$; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.earlybird.common.ClientIdUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.util.Future; - -/** - * A filter for comparing the request deadline (set in the finagle request context) with the request - * timeout, as set in the EarlybirdRequest. - * - * Tracks stats per client, for (1) requests where the request deadline is set to expire before the - * EarlybirdRequest timeout, and also (2) requests where the deadline allows enough time for the - * EarlybirdRequest timeout to kick in. - */ -public class DeadlineTimeoutStatsFilter - extends SimpleFilter { - - // All stats maps below are per client id, keyed by the client id. - private final LoadingCache requestTimeoutNotSetStats; - private final LoadingCache finagleDeadlineNotSetStats; - private final LoadingCache finagleDeadlineAndRequestTimeoutNotSetStats; - private final LoadingCache requestTimeoutStats; - private final LoadingCache finagleDeadlineStats; - private final LoadingCache deadlineLargerStats; - private final LoadingCache deadlineSmallerStats; - - @Inject - public DeadlineTimeoutStatsFilter(Clock clock) { - this.requestTimeoutNotSetStats = CacheBuilder.newBuilder().build( - new CacheLoader() { - public SearchCounter load(String clientId) { - return SearchCounter.export( - "deadline_for_client_id_" + clientId + "_request_timeout_not_set"); - } - }); - this.finagleDeadlineNotSetStats = CacheBuilder.newBuilder().build( - new CacheLoader() { - public SearchCounter load(String clientId) { - return SearchCounter.export( - "deadline_for_client_id_" + clientId + "_finagle_deadline_not_set"); - } - }); - this.finagleDeadlineAndRequestTimeoutNotSetStats = CacheBuilder.newBuilder().build( - new CacheLoader() { - public SearchCounter load(String clientId) { - return SearchCounter.export( - "deadline_for_client_id_" + clientId - + "_finagle_deadline_and_request_timeout_not_set"); - } - }); - this.requestTimeoutStats = CacheBuilder.newBuilder().build( - new CacheLoader() { - public SearchTimerStats load(String clientId) { - return SearchTimerStats.export( - "deadline_for_client_id_" + clientId + "_request_timeout", - TimeUnit.MILLISECONDS, - false, - true, - clock); - } - }); - this.finagleDeadlineStats = CacheBuilder.newBuilder().build( - new CacheLoader() { - public SearchTimerStats load(String clientId) { - return SearchTimerStats.export( - "deadline_for_client_id_" + clientId + "_finagle_deadline", - TimeUnit.MILLISECONDS, - false, - true, - clock); - } - }); - this.deadlineLargerStats = CacheBuilder.newBuilder().build( - new CacheLoader() { - public SearchTimerStats load(String clientId) { - return SearchTimerStats.export( - "deadline_for_client_id_" + clientId - + "_finagle_deadline_larger_than_request_timeout", - TimeUnit.MILLISECONDS, - false, - true, - clock - ); - } - }); - this.deadlineSmallerStats = CacheBuilder.newBuilder().build( - new CacheLoader() { - public SearchTimerStats load(String clientId) { - return SearchTimerStats.export( - "deadline_for_client_id_" + clientId - + "_finagle_deadline_smaller_than_request_timeout", - TimeUnit.MILLISECONDS, - false, - true, - clock - ); - } - }); - } - - @Override - public Future apply( - EarlybirdRequestContext requestContext, - Service service) { - - EarlybirdRequest request = requestContext.getRequest(); - String clientId = ClientIdUtil.getClientIdFromRequest(request); - long requestTimeoutMillis = getRequestTimeout(request); - Option deadline = Contexts$.MODULE$.broadcast().get(Deadline$.MODULE$); - - // Tracking per-client timeouts specified in the EarlybirdRequest. - if (requestTimeoutMillis > 0) { - requestTimeoutStats.getUnchecked(clientId).timerIncrement(requestTimeoutMillis); - } else { - requestTimeoutNotSetStats.getUnchecked(clientId).increment(); - } - - // How much time does this request have, from its deadline start, to the effective deadline. - if (deadline.isDefined()) { - long deadlineEndTimeMillis = deadline.get().deadline().inMillis(); - long deadlineStartTimeMillis = deadline.get().timestamp().inMillis(); - long finagleDeadlineTimeMillis = deadlineEndTimeMillis - deadlineStartTimeMillis; - finagleDeadlineStats.getUnchecked(clientId).timerIncrement(finagleDeadlineTimeMillis); - } else { - finagleDeadlineNotSetStats.getUnchecked(clientId).increment(); - } - - // Explicitly track when both are not set. - if (requestTimeoutMillis <= 0 && deadline.isEmpty()) { - finagleDeadlineAndRequestTimeoutNotSetStats.getUnchecked(clientId).increment(); - } - - // If both timeout and the deadline are set, track how much over / under we are, when - // comparing the deadline, and the EarlybirdRequest timeout. - if (requestTimeoutMillis > 0 && deadline.isDefined()) { - long deadlineEndTimeMillis = deadline.get().deadline().inMillis(); - Preconditions.checkState(request.isSetClientRequestTimeMs(), - "Expect ClientRequestTimeFilter to always set the clientRequestTimeMs field. Request: %s", - request); - long requestStartTimeMillis = request.getClientRequestTimeMs(); - long requestEndTimeMillis = requestStartTimeMillis + requestTimeoutMillis; - - long deadlineDiffMillis = deadlineEndTimeMillis - requestEndTimeMillis; - if (deadlineDiffMillis >= 0) { - deadlineLargerStats.getUnchecked(clientId).timerIncrement(deadlineDiffMillis); - } else { - // Track "deadline is smaller" as positive values. - deadlineSmallerStats.getUnchecked(clientId).timerIncrement(-deadlineDiffMillis); - } - } - - return service.apply(requestContext); - } - - private long getRequestTimeout(EarlybirdRequest request) { - if (request.isSetSearchQuery() - && request.getSearchQuery().isSetCollectorParams() - && request.getSearchQuery().getCollectorParams().isSetTerminationParams() - && request.getSearchQuery().getCollectorParams().getTerminationParams().isSetTimeoutMs()) { - - return request.getSearchQuery().getCollectorParams().getTerminationParams().getTimeoutMs(); - } else if (request.isSetTimeoutMs()) { - return request.getTimeoutMs(); - } else { - return -1; - } - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/DisableClientByTierFilter.java b/src/java/com/twitter/search/earlybird_root/filters/DisableClientByTierFilter.java deleted file mode 100644 index 299d89d0f..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/DisableClientByTierFilter.java +++ /dev/null @@ -1,64 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.Optional; - -import javax.inject.Inject; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.earlybird.common.ClientIdUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird_root.quota.ClientIdQuotaManager; -import com.twitter.search.earlybird_root.quota.QuotaInfo; -import com.twitter.util.Future; - -public class DisableClientByTierFilter extends SimpleFilter { - private static final String CLIENT_BLOCKED_RESPONSE_PATTERN = - "Requests of client %s are blocked due to %s disable"; - - private final SearchDecider decider; - private final ClientIdQuotaManager quotaManager; - - /** - * Construct the filter by using ClientIdQuotaManager - */ - @Inject - public DisableClientByTierFilter(ClientIdQuotaManager quotaManager, SearchDecider decider) { - this.quotaManager = Preconditions.checkNotNull(quotaManager); - this.decider = decider; - } - - @Override - public Future apply(EarlybirdRequest request, - Service service) { - String clientId = ClientIdUtil.getClientIdFromRequest(request); - Optional quotaInfoOptional = quotaManager.getQuotaForClient(clientId); - QuotaInfo quotaInfo = quotaInfoOptional.orElseGet(quotaManager::getCommonPoolQuota); - // Tier value should exist: if client's tier value not in config file, it will be - // set to "no_tier" by default in ConfigBasedQuotaConfig - String tier = quotaInfo.getClientTier(); - - Preconditions.checkNotNull(tier); - - if (decider.isAvailable("superroot_unavailable_for_" + tier + "_clients")) { - return Future.value(getClientBlockedResponse(clientId, tier)); - } else { - return service.apply(request); - } - } - - private static EarlybirdResponse getClientBlockedResponse(String clientId, String tier) { - return new EarlybirdResponse(EarlybirdResponseCode.CLIENT_BLOCKED_BY_TIER_ERROR, 0) - .setSearchResults(new ThriftSearchResults() - .setResults(Lists.newArrayList())) - .setDebugString(String.format(CLIENT_BLOCKED_RESPONSE_PATTERN, clientId, tier)); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/DropAllProtectedOperatorFilter.java b/src/java/com/twitter/search/earlybird_root/filters/DropAllProtectedOperatorFilter.java deleted file mode 100644 index f7703b58c..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/DropAllProtectedOperatorFilter.java +++ /dev/null @@ -1,71 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import javax.inject.Inject; - -import com.google.common.annotations.VisibleForTesting; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.visitors.DropAllProtectedOperatorVisitor; -import com.twitter.util.Future; - -public class DropAllProtectedOperatorFilter - extends SimpleFilter { - private static final Logger LOG = - LoggerFactory.getLogger(DropAllProtectedOperatorFilter.class); - private static final SearchCounter QUERY_PARSER_FAILURE_COUNTER = - SearchCounter.export("protected_operator_filter_query_parser_failure_count"); - @VisibleForTesting - static final SearchCounter TOTAL_REQUESTS_COUNTER = - SearchCounter.export("drop_all_protected_operator_filter_total"); - @VisibleForTesting - static final SearchCounter OPERATOR_DROPPED_REQUESTS_COUNTER = - SearchCounter.export("drop_all_protected_operator_filter_operator_dropped"); - - private final DropAllProtectedOperatorVisitor dropProtectedOperatorVisitor; - - @Inject - public DropAllProtectedOperatorFilter( - DropAllProtectedOperatorVisitor dropProtectedOperatorVisitor - ) { - this.dropProtectedOperatorVisitor = dropProtectedOperatorVisitor; - } - - @Override - public Future apply( - EarlybirdRequestContext requestContext, - Service service) { - TOTAL_REQUESTS_COUNTER.increment(); - Query query = requestContext.getParsedQuery(); - if (query == null) { - return service.apply(requestContext); - } - - Query processedQuery = query; - try { - processedQuery = query.accept(dropProtectedOperatorVisitor); - } catch (QueryParserException e) { - // this should not happen since we already have a parsed query - QUERY_PARSER_FAILURE_COUNTER.increment(); - LOG.warn( - "Failed to drop protected operator for serialized query: " + query.serialize(), e); - } - - if (processedQuery == query) { - return service.apply(requestContext); - } else { - OPERATOR_DROPPED_REQUESTS_COUNTER.increment(); - EarlybirdRequestContext clonedRequestContext = - EarlybirdRequestContext.copyRequestContext(requestContext, processedQuery); - return service.apply(clonedRequestContext); - } - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/EarlybirdClusterAvailableFilter.java b/src/java/com/twitter/search/earlybird_root/filters/EarlybirdClusterAvailableFilter.java deleted file mode 100644 index 87b304a48..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/EarlybirdClusterAvailableFilter.java +++ /dev/null @@ -1,85 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.Collections; -import java.util.Map; - -import javax.inject.Inject; - -import com.google.common.collect.Maps; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; -import com.twitter.util.Future; - -/** - * A Finagle filter that determines if a certain cluster is available to the SuperRoot. - * - * Normally, all clusters should be available. However, if there's a problem with our systems, and - * our search clusters are causing issues for other services (time outs, for example), then we might - * want to be disable them, and return errors to our clients. - */ -public class EarlybirdClusterAvailableFilter - extends SimpleFilter { - private final SearchDecider decider; - private final EarlybirdCluster cluster; - private final String allRequestsDeciderKey; - private final Map requestTypeDeciderKeys; - private final Map disabledRequests; - - /** - * Creates a new EarlybirdClusterAvailableFilter instance. - * - * @param decider The decider to use to determine if this cluster is available. - * @param cluster The cluster. - */ - @Inject - public EarlybirdClusterAvailableFilter(SearchDecider decider, EarlybirdCluster cluster) { - this.decider = decider; - this.cluster = cluster; - - String clusterName = cluster.getNameForStats(); - this.allRequestsDeciderKey = "superroot_" + clusterName + "_cluster_available_for_all_requests"; - - Map tempDeciderKeys = Maps.newEnumMap(EarlybirdRequestType.class); - Map tempCounters = - Maps.newEnumMap(EarlybirdRequestType.class); - for (EarlybirdRequestType requestType : EarlybirdRequestType.values()) { - String requestTypeName = requestType.getNormalizedName(); - tempDeciderKeys.put(requestType, "superroot_" + clusterName + "_cluster_available_for_" - + requestTypeName + "_requests"); - tempCounters.put(requestType, SearchCounter.export( - "cluster_available_filter_" + clusterName + "_" - + requestTypeName + "_disabled_requests")); - } - requestTypeDeciderKeys = Collections.unmodifiableMap(tempDeciderKeys); - disabledRequests = Collections.unmodifiableMap(tempCounters); - } - - @Override - public Future apply( - EarlybirdRequestContext requestContext, - Service service) { - EarlybirdRequestType requestType = requestContext.getEarlybirdRequestType(); - if (!decider.isAvailable(allRequestsDeciderKey) - || !decider.isAvailable(requestTypeDeciderKeys.get(requestType))) { - disabledRequests.get(requestType).increment(); - return Future.value( - errorResponse("The " + cluster.getNameForStats() + " cluster is not available for " - + requestType.getNormalizedName() + " requests.")); - } - - return service.apply(requestContext); - } - - private EarlybirdResponse errorResponse(String debugMessage) { - return new EarlybirdResponse(EarlybirdResponseCode.PERSISTENT_ERROR, 0) - .setDebugString(debugMessage); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/EarlybirdFeatureSchemaAnnotateFilter.java b/src/java/com/twitter/search/earlybird_root/filters/EarlybirdFeatureSchemaAnnotateFilter.java deleted file mode 100644 index f1034a514..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/EarlybirdFeatureSchemaAnnotateFilter.java +++ /dev/null @@ -1,57 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.List; -import javax.inject.Inject; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.features.thrift.ThriftSearchFeatureSchemaSpecifier; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdFeatureSchemaMerger; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.util.Future; - -public class EarlybirdFeatureSchemaAnnotateFilter - extends SimpleFilter { - - private final EarlybirdFeatureSchemaMerger schemaMerger; - - @Inject - public EarlybirdFeatureSchemaAnnotateFilter(EarlybirdFeatureSchemaMerger merger) { - this.schemaMerger = merger; - } - - @Override - public Future apply( - EarlybirdRequestContext requestContext, - Service service) { - return service.apply(annotateRequestContext(requestContext)); - } - - /** - * Annotate the request to indicate the available features schemas before sending to earlybird. - * - * @param requestContext the earlybird request context - */ - private EarlybirdRequestContext annotateRequestContext(EarlybirdRequestContext requestContext) { - EarlybirdRequest request = requestContext.getRequest(); - if (request.isSetSearchQuery() - && request.getSearchQuery().isSetResultMetadataOptions() - && request.getSearchQuery().getResultMetadataOptions().isReturnSearchResultFeatures()) { - // Remember the available client side cached features schema in the context and prepare to - // reset it something new. - List featureSchemasAvailableInClient = - request.getSearchQuery().getResultMetadataOptions().getFeatureSchemasAvailableInClient(); - - return EarlybirdRequestContext.newContext( - request, - requestContext, - schemaMerger.getAvailableSchemaList(), // Set the available feature schemas based on - // what is cached in the current root. - featureSchemasAvailableInClient); - } else { - return requestContext; - } - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/EarlybirdResponseExceptionHandler.java b/src/java/com/twitter/search/earlybird_root/filters/EarlybirdResponseExceptionHandler.java deleted file mode 100644 index a22d18f9f..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/EarlybirdResponseExceptionHandler.java +++ /dev/null @@ -1,108 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.HashMap; -import java.util.Map; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.util.FinagleUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird_root.common.ClientErrorException; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; -import com.twitter.util.Function; -import com.twitter.util.Future; - -/** Converts exceptions into EarlybirdResponses with error codes. */ -public class EarlybirdResponseExceptionHandler { - private static final Logger LOG = - LoggerFactory.getLogger(EarlybirdResponseExceptionHandler.class); - - private final Map requestTypeToCancelledExceptions - = new HashMap<>(); - private final Map requestTypeToTimeoutExceptions - = new HashMap<>(); - private final Map requestTypeToPersistentErrors - = new HashMap<>(); - private final SearchCounter cancelledExceptions; - private final SearchCounter timeoutExceptions; - private final SearchCounter persistentErrors; - - /** - * Creates a new top level filter for handling exceptions. - */ - public EarlybirdResponseExceptionHandler(String statPrefix) { - this.cancelledExceptions = SearchCounter.export( - statPrefix + "_exception_handler_cancelled_exceptions"); - this.timeoutExceptions = SearchCounter.export( - statPrefix + "_exception_handler_timeout_exceptions"); - this.persistentErrors = SearchCounter.export( - statPrefix + "_exception_handler_persistent_errors"); - - for (EarlybirdRequestType requestType : EarlybirdRequestType.values()) { - String requestTypeNormalized = requestType.getNormalizedName(); - requestTypeToCancelledExceptions.put(requestType, - SearchCounter.export( - statPrefix + "_exception_handler_cancelled_exceptions_" - + requestTypeNormalized)); - requestTypeToTimeoutExceptions.put(requestType, - SearchCounter.export( - statPrefix + "_exception_handler_timeout_exceptions_" - + requestTypeNormalized)); - requestTypeToPersistentErrors.put(requestType, - SearchCounter.export( - statPrefix + "_exception_handler_persistent_errors_" - + requestTypeNormalized)); - } - } - - /** - * If {@code responseFuture} is wraps an exception, converts it to an EarlybirdResponse instance - * with an appropriate error code. - * - * @param request The earlybird request. - * @param responseFuture The response future. - */ - public Future handleException(final EarlybirdRequest request, - Future responseFuture) { - return responseFuture.handle( - new Function() { - @Override - public EarlybirdResponse apply(Throwable t) { - if (t instanceof ClientErrorException) { - ClientErrorException clientExc = (ClientErrorException) t; - return new EarlybirdResponse() - .setResponseCode(EarlybirdResponseCode.CLIENT_ERROR) - .setDebugString(clientExc.getMessage()); - } else if (FinagleUtil.isCancelException(t)) { - requestTypeToCancelledExceptions.get(EarlybirdRequestType.of(request)) - .increment(); - cancelledExceptions.increment(); - return new EarlybirdResponse() - .setResponseCode(EarlybirdResponseCode.CLIENT_CANCEL_ERROR) - .setDebugString(t.getMessage()); - } else if (FinagleUtil.isTimeoutException(t)) { - requestTypeToTimeoutExceptions.get(EarlybirdRequestType.of(request)) - .increment(); - timeoutExceptions.increment(); - return new EarlybirdResponse() - .setResponseCode(EarlybirdResponseCode.SERVER_TIMEOUT_ERROR) - .setDebugString(t.getMessage()); - } else { - // Unexpected exception: log it. - LOG.error("Caught unexpected exception.", t); - - requestTypeToPersistentErrors.get(EarlybirdRequestType.of(request)) - .increment(); - persistentErrors.increment(); - return new EarlybirdResponse() - .setResponseCode(EarlybirdResponseCode.PERSISTENT_ERROR) - .setDebugString(t.getMessage()); - } - } - }); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/EarlybirdSuccessfulResponseHandler.java b/src/java/com/twitter/search/earlybird_root/filters/EarlybirdSuccessfulResponseHandler.java deleted file mode 100644 index 8c05d6609..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/EarlybirdSuccessfulResponseHandler.java +++ /dev/null @@ -1,54 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import com.twitter.search.common.clientstats.RequestCounters; -import com.twitter.search.common.clientstats.RequestCountersEventListener; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird.thrift.ThriftTermStatisticsResults; - -import static com.twitter.search.common.util.earlybird.EarlybirdResponseUtil - .responseConsideredFailed; - - -/** - * Checks EarlybirdResponse's response to update stats. - */ -public final class EarlybirdSuccessfulResponseHandler - implements RequestCountersEventListener.SuccessfulResponseHandler { - - public static final EarlybirdSuccessfulResponseHandler INSTANCE = - new EarlybirdSuccessfulResponseHandler(); - - private EarlybirdSuccessfulResponseHandler() { } - - @Override - public void handleSuccessfulResponse( - EarlybirdResponse response, - RequestCounters requestCounters) { - - if (response == null) { - requestCounters.incrementRequestFailedCounter(); - return; - } - - if (response.getResponseCode() == EarlybirdResponseCode.CLIENT_CANCEL_ERROR) { - requestCounters.incrementRequestCancelCounter(); - } else if (response.getResponseCode() == EarlybirdResponseCode.SERVER_TIMEOUT_ERROR) { - requestCounters.incrementRequestTimedOutCounter(); - } else if (responseConsideredFailed(response.getResponseCode())) { - requestCounters.incrementRequestFailedCounter(); - } - - ThriftSearchResults results = response.getSearchResults(); - if (results != null) { - requestCounters.incrementResultCounter(results.getResultsSize()); - } - - ThriftTermStatisticsResults termStats = response.getTermStatisticsResults(); - if (termStats != null) { - requestCounters.incrementResultCounter(termStats.getTermResultsSize()); - } - } - -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/EarlybirdTimeFilterQueryRewriter.java b/src/java/com/twitter/search/earlybird_root/filters/EarlybirdTimeFilterQueryRewriter.java deleted file mode 100644 index 16b1f60f5..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/EarlybirdTimeFilterQueryRewriter.java +++ /dev/null @@ -1,133 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.Collections; -import java.util.List; -import java.util.Map; - -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.earlybird.config.ServingRange; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; -import com.twitter.search.queryparser.query.Conjunction; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.query.search.SearchOperator; - -/** - * Adds query filters that filter out tweets outside a tier's serving range. Two tiers might load - * the same timeslice, so if the filtering is not done, the two tiers might return duplicates. The - * mergers should know how to handle the duplicates, but this might decrease the number or the - * quality of the returned results. - */ -public class EarlybirdTimeFilterQueryRewriter { - private static final Logger LOG = - LoggerFactory.getLogger(EarlybirdTimeFilterQueryRewriter.class); - - private static final Map NO_QUERY_COUNTS; - static { - final Map tempMap = - Maps.newEnumMap(EarlybirdRequestType.class); - for (EarlybirdRequestType requestType : EarlybirdRequestType.values()) { - tempMap.put(requestType, SearchCounter.export( - "time_filter_query_rewriter_" + requestType.getNormalizedName() + "_no_query_count")); - } - NO_QUERY_COUNTS = Collections.unmodifiableMap(tempMap); - } - - @VisibleForTesting - static final Map ADD_SINCE_ID_MAX_ID_DECIDER_KEY_MAP; - static { - final String ADD_SINCE_ID_MAX_ID_DECIDER_KEY_TEMPLATE = - "add_since_id_max_id_operators_to_%s_query"; - final Map tempMap = Maps.newEnumMap(EarlybirdRequestType.class); - for (EarlybirdRequestType requestType : EarlybirdRequestType.values()) { - tempMap.put( - requestType, - String.format(ADD_SINCE_ID_MAX_ID_DECIDER_KEY_TEMPLATE, requestType.getNormalizedName())); - } - ADD_SINCE_ID_MAX_ID_DECIDER_KEY_MAP = Collections.unmodifiableMap(tempMap); - } - - @VisibleForTesting - static final String ADD_SINCE_ID_MAX_ID_TO_NULL_SERIALIZED_QUERIES_DECIDER_KEY = - "add_since_id_max_id_operators_to_null_serialized_queries"; - - private final SearchDecider decider; - private final ServingRangeProvider servingRangeProvider; - - EarlybirdTimeFilterQueryRewriter( - ServingRangeProvider servingRangeProvider, - SearchDecider decider) { - - this.servingRangeProvider = servingRangeProvider; - this.decider = decider; - } - - /** - * Add maxId and sinceId fields to the serialized query. - * - * This must be done after calculating the IdTimeRanges to prevent interfering with calculating - * IdTimeRanges - */ - public EarlybirdRequestContext rewriteRequest(EarlybirdRequestContext requestContext) - throws QueryParserException { - Query q = requestContext.getParsedQuery(); - if (q == null) { - if (requestContext.getEarlybirdRequestType() != EarlybirdRequestType.TERM_STATS) { - LOG.warn("Received request without a parsed query: " + requestContext.getRequest()); - NO_QUERY_COUNTS.get(requestContext.getEarlybirdRequestType()).increment(); - } - - if (!decider.isAvailable(ADD_SINCE_ID_MAX_ID_TO_NULL_SERIALIZED_QUERIES_DECIDER_KEY)) { - return requestContext; - } - } - - return addOperators(requestContext, q); - } - - private EarlybirdRequestContext addOperators( - EarlybirdRequestContext requestContext, - @Nullable Query query) throws QueryParserException { - - // Add the SINCE_ID and MAX_ID operators only if the decider is enabled. - if (!decider.isAvailable( - ADD_SINCE_ID_MAX_ID_DECIDER_KEY_MAP.get(requestContext.getEarlybirdRequestType()))) { - return requestContext; - } - - // Note: can't recompute the search operators because the serving range changes in real time - // for the most recent tier. - ServingRange servingRange = servingRangeProvider.getServingRange( - requestContext, requestContext.useOverrideTierConfig()); - - long tierSinceId = servingRange.getServingRangeSinceId(); - SearchOperator sinceId = new SearchOperator(SearchOperator.Type.SINCE_ID, - Long.toString(tierSinceId)); - - long tierMaxId = servingRange.getServingRangeMaxId(); - SearchOperator maxId = new SearchOperator(SearchOperator.Type.MAX_ID, - Long.toString(tierMaxId)); - - List conjunctionChildren = (query == null) - ? Lists.newArrayList(sinceId, maxId) - : Lists.newArrayList(query, sinceId, maxId); - - Query restrictedQuery = new Conjunction(conjunctionChildren).simplify(); - - EarlybirdRequestContext copiedRequestContext = - EarlybirdRequestContext.copyRequestContext(requestContext, restrictedQuery); - - return copiedRequestContext; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/EarlybirdTimeRangeFilter.java b/src/java/com/twitter/search/earlybird_root/filters/EarlybirdTimeRangeFilter.java deleted file mode 100644 index bd5eda6de..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/EarlybirdTimeRangeFilter.java +++ /dev/null @@ -1,205 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.Collections; -import java.util.Map; -import java.util.Optional; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.collect.Maps; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.util.earlybird.EarlybirdResponseUtil; -import com.twitter.search.earlybird.config.ServingRange; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.util.IdTimeRanges; -import com.twitter.util.Future; - -/** - * A Finagle filter used to filter requests to tiers. - * Parses serialized query on Earlybird request, and extracts since / until / since_id / max_id - * operators. This filter then tests whether the request overlaps with the given tier. If there - * is no overlap, an empty response is returned without actually forwarding the requests to the - * underlying service. - */ -public class EarlybirdTimeRangeFilter extends - SimpleFilter { - - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdTimeRangeFilter.class); - - private static final EarlybirdResponse ERROR_RESPONSE = - new EarlybirdResponse(EarlybirdResponseCode.PERSISTENT_ERROR, 0) - .setSearchResults(new ThriftSearchResults()); - - private final ServingRangeProvider servingRangeProvider; - private final Optional queryRewriter; - - private static final Map FAILED_REQUESTS; - static { - final Map tempMap = - Maps.newEnumMap(EarlybirdRequestType.class); - for (EarlybirdRequestType requestType : EarlybirdRequestType.values()) { - tempMap.put(requestType, SearchCounter.export( - "time_range_filter_" + requestType.getNormalizedName() + "_failed_requests")); - } - FAILED_REQUESTS = Collections.unmodifiableMap(tempMap); - } - - public static EarlybirdTimeRangeFilter newTimeRangeFilterWithQueryRewriter( - ServingRangeProvider servingRangeProvider, - SearchDecider decider) { - - return new EarlybirdTimeRangeFilter(servingRangeProvider, - Optional.of(new EarlybirdTimeFilterQueryRewriter(servingRangeProvider, decider))); - } - - public static EarlybirdTimeRangeFilter newTimeRangeFilterWithoutQueryRewriter( - ServingRangeProvider servingRangeProvider) { - - return new EarlybirdTimeRangeFilter(servingRangeProvider, Optional.empty()); - } - - /** - * Construct a filter that avoids forwarding requests to unrelated tiers - * based on requests' since / until / since_id / max_id. - * @param provider Holds the boundary information. - */ - EarlybirdTimeRangeFilter( - ServingRangeProvider provider, - Optional rewriter) { - - this.servingRangeProvider = provider; - this.queryRewriter = rewriter; - } - - public ServingRangeProvider getServingRangeProvider() { - return servingRangeProvider; - } - - @Override - public Future apply( - EarlybirdRequestContext requestContext, - Service service) { - - Query parsedQuery = requestContext.getParsedQuery(); - if (parsedQuery != null) { - // Only perform filtering if serialized query is set. - try { - IdTimeRanges queryRanges = IdTimeRanges.fromQuery(parsedQuery); - if (queryRanges == null) { - // No time ranges in query. - return issueServiceRequest(service, requestContext); - } - - ServingRange servingRange = - servingRangeProvider.getServingRange( - requestContext, requestContext.useOverrideTierConfig()); - - if (queryDoesNotOverlapWithServingRange(queryRanges, servingRange)) { - return Future.value(tierSkippedResponse(requestContext.getEarlybirdRequestType(), - servingRange)); - } else { - return issueServiceRequest(service, requestContext); - } - } catch (QueryParserException e) { - LOG.warn("Unable to get IdTimeRanges from query: " + parsedQuery.serialize()); - // The failure here is not due to a miss-formed query from the client, since we already - // were able to successfully get a parsed Query from the request. - // If we can't determine the time ranges, pass the query along to the tier, and just - // restrict it to the timeranges of the tier. - return issueServiceRequest(service, requestContext); - } - } else { - // There's no serialized query. Just pass through like an identity filter. - return issueServiceRequest(service, requestContext); - } - } - - private boolean queryDoesNotOverlapWithServingRange(IdTimeRanges queryRanges, - ServingRange servingRange) { - // As long as a query overlaps with the tier serving range on either side, - // the request is not filtered. I.e. we want to be conservative when doing this filtering, - // because it is just an optimization. We ignore the inclusiveness / exclusiveness of the - // boundaries. If the tier boundary and the query boundry happen to be the same, we do not - // filter the request. - return queryRanges.getSinceIDExclusive().or(0L) - > servingRange.getServingRangeMaxId() - || queryRanges.getMaxIDInclusive().or(Long.MAX_VALUE) - < servingRange.getServingRangeSinceId() - || queryRanges.getSinceTimeInclusive().or(0) - > servingRange.getServingRangeUntilTimeSecondsFromEpoch() - || queryRanges.getUntilTimeExclusive().or(Integer.MAX_VALUE) - < servingRange.getServingRangeSinceTimeSecondsFromEpoch(); - } - - private Future issueServiceRequest( - Service service, - EarlybirdRequestContext requestContext) { - - try { - EarlybirdRequestContext request = requestContext; - if (queryRewriter.isPresent()) { - request = queryRewriter.get().rewriteRequest(requestContext); - } - return service.apply(request); - } catch (QueryParserException e) { - FAILED_REQUESTS.get(requestContext.getEarlybirdRequestType()).increment(); - String msg = "Failed to add time filter operators"; - LOG.error(msg, e); - - // Note that in this case it is not clear whether the error is the client's fault or our - // fault, so we don't necessarily return a CLIENT_ERROR here. - // Currently this actually returns a PERSISTENT_ERROR. - if (requestContext.getRequest().getDebugMode() > 0) { - return Future.value( - ERROR_RESPONSE.deepCopy().setDebugString(msg + ": " + e.getMessage())); - } else { - return Future.value(ERROR_RESPONSE); - } - } - } - - /** - * Creates a tier skipped response, based on the given request type. - * - * For recency, relevance, facets and top tweets requests, this method returns a SUCCESS response - * with no search results and the minSearchedStatusID and maxSearchedStatusID appropriately set. - * For term stats response, it returns a TIER_SKIPPED response, but we need to revisit this. - * - * @param requestType The type of the request. - * @param servingRange The serving range of the tier that we're skipping. - */ - @VisibleForTesting - public static EarlybirdResponse tierSkippedResponse( - EarlybirdRequestType requestType, - ServingRange servingRange) { - String debugMessage = - "Tier skipped because it does not intersect with query time boundaries."; - if (requestType == EarlybirdRequestType.TERM_STATS) { - // If it's a term stats request, return a TIER_SKIPPED response for now. - // But we need to figure out the right thing to do here. - return new EarlybirdResponse(EarlybirdResponseCode.TIER_SKIPPED, 0) - .setDebugString(debugMessage); - } else { - // minIds in ServingRange instances are set to tierLowerBoundary - 1, because the - // since_id operator is exclusive. The max_id operator on the other hand is inclusive, - // so maxIds in ServingRange instances are also set to tierUpperBoundary - 1. - // Here we want both of them to be inclusive, so we need to increment the minId by 1. - return EarlybirdResponseUtil.tierSkippedRootResponse( - servingRange.getServingRangeSinceId() + 1, - servingRange.getServingRangeMaxId(), - debugMessage); - } - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/FullArchiveProtectedOperatorFilter.java b/src/java/com/twitter/search/earlybird_root/filters/FullArchiveProtectedOperatorFilter.java deleted file mode 100644 index 778f118e4..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/FullArchiveProtectedOperatorFilter.java +++ /dev/null @@ -1,167 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.List; - -import javax.inject.Inject; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.earlybird.thrift.EarlybirdDebugInfo; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryNodeUtils; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.query.search.SearchOperator; -import com.twitter.search.queryparser.query.search.SearchOperatorConstants; -import com.twitter.search.queryparser.visitors.DropAllProtectedOperatorVisitor; -import com.twitter.search.queryparser.visitors.QueryTreeIndex; -import com.twitter.util.Future; - -/** - * Full archive service filter validates requests with a protected operator, appends the - * '[exclude protected]' operator by default, and appends '[filter protected]' operator instead if - * 'getProtectedTweetsOnly' request param is set. A client error response is returned if any of the - * following rules is violated. - * 1. There is at most one 'protected' operator in the query. - * 2. If there is a 'protected' operator, it must be in the query root node. - * 3. The parent node of the 'protected' operator must not be negated and must be a conjunction. - * 4. If there is a positive 'protected' operator, 'followedUserIds' and 'searcherId' request - * params must be set. - */ -public class FullArchiveProtectedOperatorFilter extends - SimpleFilter { - private static final Logger LOG = - LoggerFactory.getLogger(FullArchiveProtectedOperatorFilter.class); - private static final SearchOperator EXCLUDE_PROTECTED_OPERATOR = - new SearchOperator(SearchOperator.Type.EXCLUDE, SearchOperatorConstants.PROTECTED); - private static final SearchOperator FILTER_PROTECTED_OPERATOR = - new SearchOperator(SearchOperator.Type.FILTER, SearchOperatorConstants.PROTECTED); - private static final SearchCounter QUERY_PARSER_FAILURE_COUNT = - SearchCounter.export("protected_operator_filter_query_parser_failure_count"); - - private final DropAllProtectedOperatorVisitor dropProtectedOperatorVisitor; - private final SearchDecider decider; - - @Inject - public FullArchiveProtectedOperatorFilter( - DropAllProtectedOperatorVisitor dropProtectedOperatorVisitor, - SearchDecider decider) { - this.dropProtectedOperatorVisitor = dropProtectedOperatorVisitor; - this.decider = decider; - } - - @Override - public Future apply( - EarlybirdRequestContext requestContext, - Service service) { - Query query = requestContext.getParsedQuery(); - if (query == null) { - return service.apply(requestContext); - } - - QueryTreeIndex queryTreeIndex = QueryTreeIndex.buildFor(query); - List nodeList = queryTreeIndex.getNodeList(); - // try to find a protected operator, returns error response if more than one protected - // operator is detected - SearchOperator protectedOperator = null; - for (Query node : nodeList) { - if (node instanceof SearchOperator) { - SearchOperator searchOp = (SearchOperator) node; - if (SearchOperatorConstants.PROTECTED.equals(searchOp.getOperand())) { - if (protectedOperator == null) { - protectedOperator = searchOp; - } else { - return createErrorResponse("Only one 'protected' operator is expected."); - } - } - } - } - - Query processedQuery; - if (protectedOperator == null) { - // no protected operator is detected, append '[exclude protected]' by default - processedQuery = QueryNodeUtils.appendAsConjunction(query, EXCLUDE_PROTECTED_OPERATOR); - } else { - // protected operator must be in the query root node - if (queryTreeIndex.getParentOf(protectedOperator) != query) { - return createErrorResponse("'protected' operator must be in the query root node"); - } - // the query node that contains protected operator must not be negated - if (query.mustNotOccur()) { - return createErrorResponse("The query node that contains a 'protected' operator must not" - + " be negated."); - } - // the query node that contains protected operator must be a conjunction - if (!query.isTypeOf(Query.QueryType.CONJUNCTION)) { - return createErrorResponse("The query node that contains a 'protected' operator must" - + " be a conjunction."); - } - // check the existence of 'followedUserIds' and 'searcherId' if it is a positive operator - if (isPositive(protectedOperator)) { - if (!validateRequestParam(requestContext.getRequest())) { - return createErrorResponse("'followedUserIds' and 'searcherId' are required " - + "by positive 'protected' operator."); - } - } - processedQuery = query; - } - // update processedQuery if 'getProtectedTweetsOnly' is set to true, it takes precedence over - // the existing protected operators - if (requestContext.getRequest().isGetProtectedTweetsOnly()) { - if (!validateRequestParam(requestContext.getRequest())) { - return createErrorResponse("'followedUserIds' and 'searcherId' are required " - + "when 'getProtectedTweetsOnly' is set to true."); - } - try { - processedQuery = processedQuery.accept(dropProtectedOperatorVisitor); - } catch (QueryParserException e) { - // this should not happen since we already have a parsed query - QUERY_PARSER_FAILURE_COUNT.increment(); - LOG.warn( - "Failed to drop protected operator for serialized query: " + query.serialize(), e); - } - processedQuery = - QueryNodeUtils.appendAsConjunction(processedQuery, FILTER_PROTECTED_OPERATOR); - } - - if (processedQuery == query) { - return service.apply(requestContext); - } else { - EarlybirdRequestContext clonedRequestContext = - EarlybirdRequestContext.copyRequestContext(requestContext, processedQuery); - return service.apply(clonedRequestContext); - } - } - - private boolean validateRequestParam(EarlybirdRequest request) { - List followedUserIds = request.followedUserIds; - Long searcherId = (request.searchQuery != null && request.searchQuery.isSetSearcherId()) - ? request.searchQuery.getSearcherId() : null; - return followedUserIds != null && !followedUserIds.isEmpty() && searcherId != null; - } - - private boolean isPositive(SearchOperator searchOp) { - boolean isNegateExclude = searchOp.mustNotOccur() - && searchOp.getOperatorType() == SearchOperator.Type.EXCLUDE; - boolean isPositive = !searchOp.mustNotOccur() - && (searchOp.getOperatorType() == SearchOperator.Type.INCLUDE - || searchOp.getOperatorType() == SearchOperator.Type.FILTER); - return isNegateExclude || isPositive; - } - - private Future createErrorResponse(String errorMsg) { - EarlybirdResponse response = new EarlybirdResponse(EarlybirdResponseCode.CLIENT_ERROR, 0); - response.setDebugInfo(new EarlybirdDebugInfo().setHost("full_archive_root")); - response.setDebugString(errorMsg); - return Future.value(response); - } - -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/FullArchiveServingRangeProvider.java b/src/java/com/twitter/search/earlybird_root/filters/FullArchiveServingRangeProvider.java deleted file mode 100644 index e7ec96963..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/FullArchiveServingRangeProvider.java +++ /dev/null @@ -1,64 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.Date; -import java.util.concurrent.TimeUnit; - -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; -import com.twitter.search.common.util.date.DateUtil; -import com.twitter.search.earlybird.config.ServingRange; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; - -public class FullArchiveServingRangeProvider implements ServingRangeProvider { - - public static final Date FULL_ARCHIVE_START_DATE = DateUtil.toDate(2006, 3, 21); - private static final int DEFAULT_SERVING_RANGE_BOUNDARY_HOURS_AGO = 48; - - private final SearchDecider decider; - private final String deciderKey; - - public FullArchiveServingRangeProvider( - SearchDecider decider, String deciderKey) { - this.decider = decider; - this.deciderKey = deciderKey; - } - - @Override - public ServingRange getServingRange( - final EarlybirdRequestContext requestContext, boolean useBoundaryOverride) { - return new ServingRange() { - @Override - public long getServingRangeSinceId() { - // we use 1 instead of 0, because the since_id operator is inclusive in earlybirds. - return 1L; - } - - @Override - public long getServingRangeMaxId() { - long servingRangeEndMillis = TimeUnit.HOURS.toMillis( - (decider.featureExists(deciderKey)) - ? decider.getAvailability(deciderKey) - : DEFAULT_SERVING_RANGE_BOUNDARY_HOURS_AGO); - - long boundaryTime = requestContext.getCreatedTimeMillis() - servingRangeEndMillis; - return SnowflakeIdParser.generateValidStatusId(boundaryTime, 0); - } - - @Override - public long getServingRangeSinceTimeSecondsFromEpoch() { - return FULL_ARCHIVE_START_DATE.getTime() / 1000; - } - - @Override - public long getServingRangeUntilTimeSecondsFromEpoch() { - long servingRangeEndMillis = TimeUnit.HOURS.toMillis( - (decider.featureExists(deciderKey)) - ? decider.getAvailability(deciderKey) - : DEFAULT_SERVING_RANGE_BOUNDARY_HOURS_AGO); - - long boundaryTime = requestContext.getCreatedTimeMillis() - servingRangeEndMillis; - return boundaryTime / 1000; - } - }; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/InitializeRequestContextFilter.java b/src/java/com/twitter/search/earlybird_root/filters/InitializeRequestContextFilter.java deleted file mode 100644 index a1f5bafa2..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/InitializeRequestContextFilter.java +++ /dev/null @@ -1,66 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import javax.inject.Inject; - -import com.google.common.annotations.VisibleForTesting; - -import com.twitter.common.util.Clock; -import com.twitter.finagle.Filter; -import com.twitter.finagle.Service; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.earlybird.common.EarlybirdRequestUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.QueryParsingUtils; -import com.twitter.search.earlybird_root.common.TwitterContextProvider; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.util.Future; - -/** - * Creates a new RequestContext from an EarlybirdRequest, and passes the RequestContext down to - * the rest of the filter/service chain. - */ -public class InitializeRequestContextFilter extends - Filter { - - @VisibleForTesting - static final SearchCounter FAILED_QUERY_PARSING = - SearchCounter.export("initialize_request_context_filter_query_parsing_failure"); - - private final SearchDecider decider; - private final TwitterContextProvider twitterContextProvider; - private final Clock clock; - - /** - * The constructor of the filter. - */ - @Inject - public InitializeRequestContextFilter(SearchDecider decider, - TwitterContextProvider twitterContextProvider, - Clock clock) { - this.decider = decider; - this.twitterContextProvider = twitterContextProvider; - this.clock = clock; - } - - @Override - public Future apply( - EarlybirdRequest request, - Service service) { - - EarlybirdRequestUtil.recordClientClockDiff(request); - - EarlybirdRequestContext requestContext; - try { - requestContext = EarlybirdRequestContext.newContext( - request, decider, twitterContextProvider.get(), clock); - } catch (QueryParserException e) { - FAILED_QUERY_PARSING.increment(); - return QueryParsingUtils.newClientErrorResponse(request, e); - } - - return service.apply(requestContext); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/IsUserProtectedMetadataTrackingFilter.java b/src/java/com/twitter/search/earlybird_root/filters/IsUserProtectedMetadataTrackingFilter.java deleted file mode 100644 index 9d7b2c0fe..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/IsUserProtectedMetadataTrackingFilter.java +++ /dev/null @@ -1,80 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.EnumMap; -import java.util.List; -import java.util.Map; - -import com.google.common.annotations.VisibleForTesting; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftSearchResultExtraMetadata; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; -import com.twitter.util.Future; -import com.twitter.util.FutureEventListener; - -/** - * Filter tracks the isUserProtected metadata stats returned from Earlybirds. - */ -public class IsUserProtectedMetadataTrackingFilter - extends SimpleFilter { - private static final String COUNTER_PREFIX = "is_user_protected_metadata_count_filter_"; - @VisibleForTesting - final Map totalCounterByRequestTypeMap; - @VisibleForTesting - final Map isProtectedCounterByRequestTypeMap; - - public IsUserProtectedMetadataTrackingFilter() { - this.totalCounterByRequestTypeMap = new EnumMap<>(EarlybirdRequestType.class); - this.isProtectedCounterByRequestTypeMap = new EnumMap<>(EarlybirdRequestType.class); - for (EarlybirdRequestType requestType : EarlybirdRequestType.values()) { - this.totalCounterByRequestTypeMap.put(requestType, - SearchCounter.export(COUNTER_PREFIX + requestType.getNormalizedName() + "_total")); - this.isProtectedCounterByRequestTypeMap.put(requestType, - SearchCounter.export(COUNTER_PREFIX + requestType.getNormalizedName() + "_is_protected")); - } - } - - @Override - public Future apply( - EarlybirdRequestContext request, - Service service) { - Future response = service.apply(request); - - EarlybirdRequestType requestType = request.getEarlybirdRequestType(); - response.addEventListener(new FutureEventListener() { - @Override - public void onSuccess(EarlybirdResponse response) { - if (!response.isSetSearchResults() || response.getSearchResults().getResults().isEmpty()) { - return; - } - List searchResults = response.getSearchResults().getResults(); - int totalCount = searchResults.size(); - int isUserProtectedCount = 0; - for (ThriftSearchResult searchResult : searchResults) { - if (searchResult.isSetMetadata() && searchResult.getMetadata().isSetExtraMetadata()) { - ThriftSearchResultExtraMetadata extraMetadata = - searchResult.getMetadata().getExtraMetadata(); - if (extraMetadata.isIsUserProtected()) { - isUserProtectedCount++; - } - } - } - IsUserProtectedMetadataTrackingFilter.this - .totalCounterByRequestTypeMap.get(requestType).add(totalCount); - IsUserProtectedMetadataTrackingFilter.this - .isProtectedCounterByRequestTypeMap.get(requestType).add(isUserProtectedCount); - } - - @Override - public void onFailure(Throwable cause) { } - }); - - return response; - } - -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/MarkTweetSourceFilter.java b/src/java/com/twitter/search/earlybird_root/filters/MarkTweetSourceFilter.java deleted file mode 100644 index 2a6321089..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/MarkTweetSourceFilter.java +++ /dev/null @@ -1,49 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftTweetSource; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; -import com.twitter.util.Function; -import com.twitter.util.Future; - -public class MarkTweetSourceFilter - extends SimpleFilter { - private final SearchCounter searchResultsNotSet; - - private final ThriftTweetSource tweetSource; - - public MarkTweetSourceFilter(ThriftTweetSource tweetSource) { - this.tweetSource = tweetSource; - searchResultsNotSet = SearchCounter.export( - tweetSource.name().toLowerCase() + "_mark_tweet_source_filter_search_results_not_set"); - } - - @Override - public Future apply( - final EarlybirdRequestContext requestContext, - Service service) { - return service.apply(requestContext).map(new Function() { - @Override - public EarlybirdResponse apply(EarlybirdResponse response) { - if (response.getResponseCode() == EarlybirdResponseCode.SUCCESS - && requestContext.getEarlybirdRequestType() != EarlybirdRequestType.TERM_STATS) { - if (!response.isSetSearchResults()) { - searchResultsNotSet.increment(); - } else { - for (ThriftSearchResult searchResult : response.getSearchResults().getResults()) { - searchResult.setTweetSource(tweetSource); - } - } - } - return response; - } - } - ); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/MetadataTrackingFilter.java b/src/java/com/twitter/search/earlybird_root/filters/MetadataTrackingFilter.java deleted file mode 100644 index 8a1b29fc6..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/MetadataTrackingFilter.java +++ /dev/null @@ -1,119 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.List; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.cache.CacheBuilder; -import com.google.common.cache.CacheLoader; -import com.google.common.cache.LoadingCache; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.metrics.SearchMovingAverage; -import com.twitter.search.earlybird.common.ClientIdUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftSearchResultMetadata; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; -import com.twitter.util.Future; -import com.twitter.util.FutureEventListener; - -/** - * Filter that is tracking the engagement stats returned from Earlybirds. - */ -public class MetadataTrackingFilter extends SimpleFilter { - - private static final String SCORING_SIGNAL_STAT_PREFIX = "scoring_signal_"; - private static final String SCORE_STAT_PATTERN = "client_id_score_tracker_for_%s_x100"; - - @VisibleForTesting - static final SearchMovingAverage SCORING_SIGNAL_FAV_COUNT = - SearchMovingAverage.export(SCORING_SIGNAL_STAT_PREFIX + "fav_count"); - - @VisibleForTesting - static final SearchMovingAverage SCORING_SIGNAL_REPLY_COUNT = - SearchMovingAverage.export(SCORING_SIGNAL_STAT_PREFIX + "reply_count"); - - @VisibleForTesting - static final SearchMovingAverage SCORING_SIGNAL_RETWEET_COUNT = - SearchMovingAverage.export(SCORING_SIGNAL_STAT_PREFIX + "retweet_count"); - - @VisibleForTesting - static final LoadingCache CLIENT_SCORE_METRICS_LOADING_CACHE = - CacheBuilder.newBuilder().build(new CacheLoader() { - public SearchMovingAverage load(String clientId) { - return SearchMovingAverage.export(String.format(SCORE_STAT_PATTERN, clientId)); - } - }); - - @Override - public Future apply(final EarlybirdRequest request, - Service service) { - - Future response = service.apply(request); - - response.addEventListener(new FutureEventListener() { - @Override - public void onSuccess(EarlybirdResponse earlybirdResponse) { - EarlybirdRequestType type = EarlybirdRequestType.of(request); - - if (earlybirdResponse.responseCode == EarlybirdResponseCode.SUCCESS - && type == EarlybirdRequestType.RELEVANCE - && earlybirdResponse.isSetSearchResults() - && earlybirdResponse.getSearchResults().isSetResults()) { - - List searchResults = earlybirdResponse.getSearchResults() - .getResults(); - - long totalFavoriteAmount = 0; - long totalReplyAmount = 0; - long totalRetweetAmount = 0; - double totalScoreX100 = 0; - - for (ThriftSearchResult result : searchResults) { - if (!result.isSetMetadata()) { - continue; - } - - ThriftSearchResultMetadata metadata = result.getMetadata(); - - if (metadata.isSetFavCount()) { - totalFavoriteAmount += metadata.getFavCount(); - } - - if (metadata.isSetReplyCount()) { - totalReplyAmount += metadata.getReplyCount(); - } - - if (metadata.isSetRetweetCount()) { - totalRetweetAmount += metadata.getRetweetCount(); - } - - if (metadata.isSetScore()) { - // Scale up the score by 100 so that scores are at least 1 and visible on viz graph - totalScoreX100 += metadata.getScore() * 100; - } - } - - // We only count present engagement counts but report the full size of the search results. - // This means that we consider the missing counts as being 0. - SCORING_SIGNAL_FAV_COUNT.addSamples(totalFavoriteAmount, searchResults.size()); - SCORING_SIGNAL_REPLY_COUNT.addSamples(totalReplyAmount, searchResults.size()); - SCORING_SIGNAL_RETWEET_COUNT.addSamples(totalRetweetAmount, searchResults.size()); - // Export per client id average scores. - String requestClientId = ClientIdUtil.getClientIdFromRequest(request); - String quotaClientId = ClientIdUtil.getQuotaClientId(requestClientId); - CLIENT_SCORE_METRICS_LOADING_CACHE.getUnchecked(quotaClientId) - .addSamples((long) totalScoreX100, searchResults.size()); - } - } - - @Override - public void onFailure(Throwable cause) { } - }); - - return response; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/NamedMultiTermDisjunctionStatsFilter.java b/src/java/com/twitter/search/earlybird_root/filters/NamedMultiTermDisjunctionStatsFilter.java deleted file mode 100644 index c75864124..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/NamedMultiTermDisjunctionStatsFilter.java +++ /dev/null @@ -1,45 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.List; -import java.util.Map; -import java.util.concurrent.ConcurrentHashMap; -import java.util.concurrent.ConcurrentMap; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.metrics.Percentile; -import com.twitter.search.common.metrics.PercentileUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.util.Future; - -public class NamedMultiTermDisjunctionStatsFilter extends - SimpleFilter { - - private static final String STAT_FORMAT = "named_disjunction_size_client_%s_key_%s"; - // ClientID -> disjunction name -> operand count - private static final ConcurrentMap>> - NAMED_MULTI_TERM_DISJUNCTION_IDS_COUNT = new ConcurrentHashMap<>(); - - @Override - public Future apply(EarlybirdRequest request, - Service service) { - - if (request.getSearchQuery().isSetNamedDisjunctionMap()) { - for (Map.Entry> entry - : request.getSearchQuery().getNamedDisjunctionMap().entrySet()) { - - Map> statsForClient = - NAMED_MULTI_TERM_DISJUNCTION_IDS_COUNT.computeIfAbsent( - request.getClientId(), clientId -> new ConcurrentHashMap<>()); - Percentile stats = statsForClient.computeIfAbsent(entry.getKey(), - keyName -> PercentileUtil.createPercentile( - String.format(STAT_FORMAT, request.getClientId(), keyName))); - - stats.record(entry.getValue().size()); - } - } - - return service.apply(request); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/NullcastTrackingFilter.java b/src/java/com/twitter/search/earlybird_root/filters/NullcastTrackingFilter.java deleted file mode 100644 index d7003533d..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/NullcastTrackingFilter.java +++ /dev/null @@ -1,81 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.HashSet; -import java.util.Set; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.collect.ImmutableSet; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.util.earlybird.EarlybirdResponseUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.queryparser.query.search.SearchOperatorConstants; -import com.twitter.search.queryparser.visitors.DetectPositiveOperatorVisitor; - -/** - * Filter that is tracking the unexpected nullcast results from Earlybirds. - */ -public class NullcastTrackingFilter extends SensitiveResultsTrackingFilter { - public NullcastTrackingFilter() { - super("unexpected nullcast tweets", true); - } - - private static final Logger LOG = LoggerFactory.getLogger(NullcastTrackingFilter.class); - - @VisibleForTesting - static final SearchCounter BAD_NULLCAST_QUERY_COUNT = - SearchCounter.export("unexpected_nullcast_query_count"); - - @VisibleForTesting - static final SearchCounter BAD_NULLCAST_RESULT_COUNT = - SearchCounter.export("unexpected_nullcast_result_count"); - - @Override - protected Logger getLogger() { - return LOG; - } - - @Override - protected SearchCounter getSensitiveQueryCounter() { - return BAD_NULLCAST_QUERY_COUNT; - } - - @Override - protected SearchCounter getSensitiveResultsCounter() { - return BAD_NULLCAST_RESULT_COUNT; - } - - @Override - protected Set getSensitiveResults(EarlybirdRequestContext requestContext, - EarlybirdResponse earlybirdResponse) throws Exception { - if (!requestContext.getParsedQuery().accept( - new DetectPositiveOperatorVisitor(SearchOperatorConstants.NULLCAST))) { - return EarlybirdResponseUtil.findUnexpectedNullcastStatusIds( - earlybirdResponse.getSearchResults(), requestContext.getRequest()); - } else { - return new HashSet<>(); - } - } - - /** - * Some Earlybird requests are not searches, instead, they are scoring requests. - * These requests supply a list of IDs to be scored. - * It is OK to return nullcast tweet result if the ID is supplied in the request. - * This extracts the scoring request tweet IDs. - */ - @Override - protected Set getExceptedResults(EarlybirdRequestContext requestContext) { - EarlybirdRequest request = requestContext.getRequest(); - if (request == null - || !request.isSetSearchQuery() - || request.getSearchQuery().getSearchStatusIdsSize() == 0) { - return ImmutableSet.of(); - } - return request.getSearchQuery().getSearchStatusIds(); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/PostCacheRequestTypeCountFilter.java b/src/java/com/twitter/search/earlybird_root/filters/PostCacheRequestTypeCountFilter.java deleted file mode 100644 index d83fd1227..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/PostCacheRequestTypeCountFilter.java +++ /dev/null @@ -1,10 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import javax.inject.Inject; - -public class PostCacheRequestTypeCountFilter extends RequestTypeCountFilter { - @Inject - public PostCacheRequestTypeCountFilter() { - super("post_cache"); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/PreCacheRequestTypeCountFilter.java b/src/java/com/twitter/search/earlybird_root/filters/PreCacheRequestTypeCountFilter.java deleted file mode 100644 index e5d2b00c7..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/PreCacheRequestTypeCountFilter.java +++ /dev/null @@ -1,10 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import javax.inject.Inject; - -public class PreCacheRequestTypeCountFilter extends RequestTypeCountFilter { - @Inject - public PreCacheRequestTypeCountFilter() { - super("pre_cache"); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/QueryLangStatFilter.java b/src/java/com/twitter/search/earlybird_root/filters/QueryLangStatFilter.java deleted file mode 100644 index dbbc3d23a..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/QueryLangStatFilter.java +++ /dev/null @@ -1,114 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.Map; -import java.util.concurrent.ConcurrentHashMap; -import javax.inject.Inject; -import javax.inject.Singleton; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import com.twitter.common.text.language.LocaleUtil; -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.constants.thriftjava.ThriftLanguage; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.util.lang.ThriftLanguageUtil; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.util.Future; - -/** - * Export stats for query languages. - */ -@Singleton -public class QueryLangStatFilter - extends SimpleFilter { - - public static class Config { - // We put a limit here in case an error in the client are sending us random lang codes. - private int maxNumberOfLangs; - - public Config(int maxNumberOfLangs) { - this.maxNumberOfLangs = maxNumberOfLangs; - } - - public int getMaxNumberOfLangs() { - return maxNumberOfLangs; - } - } - - @VisibleForTesting - protected static final String LANG_STATS_PREFIX = "num_queries_in_lang_"; - - private final Config config; - private final SearchCounter allCountsForLangsOverMaxNumLang = - SearchCounter.export(LANG_STATS_PREFIX + "overflow"); - - private final ConcurrentHashMap langCounters = - new ConcurrentHashMap<>(); - - @Inject - public QueryLangStatFilter(Config config) { - this.config = config; - } - - private SearchCounter getCounter(String lang) { - Preconditions.checkNotNull(lang); - - SearchCounter counter = langCounters.get(lang); - if (counter == null) { - if (langCounters.size() >= config.getMaxNumberOfLangs()) { - return allCountsForLangsOverMaxNumLang; - } - synchronized (langCounters) { // This double-checked locking is safe, - // since we're using a ConcurrentHashMap - counter = langCounters.get(lang); - if (counter == null) { - counter = SearchCounter.export(LANG_STATS_PREFIX + lang); - langCounters.put(lang, counter); - } - } - } - - return counter; - } - - @Override - public Future apply( - EarlybirdRequestContext requestContext, - Service service) { - - String lang = null; - - ThriftSearchQuery searchQuery = requestContext.getRequest().getSearchQuery(); - - lang = searchQuery.getQueryLang(); - - if (lang == null) { - // fallback to ui lang - lang = searchQuery.getUiLang(); - } - - if (lang == null && searchQuery.isSetUserLangs()) { - // fallback to the user lang with the highest confidence - double maxConfidence = Double.MIN_VALUE; - - for (Map.Entry entry : searchQuery.getUserLangs().entrySet()) { - if (entry.getValue() > maxConfidence) { - lang = ThriftLanguageUtil.getLanguageCodeOf(entry.getKey()); - maxConfidence = entry.getValue(); - } - } - } - - if (lang == null) { - lang = LocaleUtil.UNDETERMINED_LANGUAGE; - } - - getCounter(lang).increment(); - - return service.apply(requestContext); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/QueryOperatorStatFilter.java b/src/java/com/twitter/search/earlybird_root/filters/QueryOperatorStatFilter.java deleted file mode 100644 index 1b17299f9..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/QueryOperatorStatFilter.java +++ /dev/null @@ -1,194 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.EnumSet; -import java.util.Set; -import java.util.concurrent.TimeUnit; - -import scala.runtime.BoxedUnit; - -import com.google.common.collect.ImmutableMap; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchTimer; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.query.annotation.Annotation; -import com.twitter.search.queryparser.query.search.SearchOperator; -import com.twitter.search.queryparser.query.search.SearchOperatorConstants; -import com.twitter.search.queryparser.visitors.DetectAnnotationVisitor; -import com.twitter.search.queryparser.visitors.DetectVisitor; -import com.twitter.util.Future; - -/** - * For a given query, increments counters if that query has a number of search operators or - * annotations applied to it. Used to detect unusual traffic patterns. - */ -public class QueryOperatorStatFilter - extends SimpleFilter { - private static final Logger LOG = LoggerFactory.getLogger(QueryOperatorStatFilter.class); - - private final SearchCounter numQueryOperatorDetectionErrors = - SearchCounter.export("query_operator_detection_errors"); - - private final SearchCounter numQueryOperatorConsideredRequests = - SearchCounter.export("query_operator_requests_considered"); - - private final ImmutableMap filterOperatorStats; - - // Keeps track of the number of queries with a filter applied, whose type we don't care about. - private final SearchCounter numUnknownFilterOperatorRequests = - SearchCounter.export("query_operator_filter_unknown_requests"); - - private final ImmutableMap includeOperatorStats; - - // Keeps track of the number of queries with an include operator applied, whose type we don't - // know about. - private final SearchCounter numUnknownIncludeOperatorRequests = - SearchCounter.export("query_operator_include_unknown_requests"); - - private final ImmutableMap operatorTypeStats; - - private final SearchCounter numVariantRequests = - SearchCounter.export("query_operator_variant_requests"); - - /** - * Construct this QueryOperatorStatFilter by getting the complete set of possible filters a query - * might have and associating each with a counter. - */ - public QueryOperatorStatFilter() { - - ImmutableMap.Builder filterBuilder = new ImmutableMap.Builder<>(); - for (String operand : SearchOperatorConstants.VALID_FILTER_OPERANDS) { - filterBuilder.put( - operand, - SearchTimerStats.export( - "query_operator_filter_" + operand + "_requests", - TimeUnit.MILLISECONDS, - false, - true)); - } - filterOperatorStats = filterBuilder.build(); - - ImmutableMap.Builder includeBuilder = new ImmutableMap.Builder<>(); - for (String operand : SearchOperatorConstants.VALID_INCLUDE_OPERANDS) { - includeBuilder.put( - operand, - SearchTimerStats.export( - "query_operator_include_" + operand + "_requests", - TimeUnit.MILLISECONDS, - false, - true)); - } - includeOperatorStats = includeBuilder.build(); - - ImmutableMap.Builder operatorBuilder = - new ImmutableMap.Builder<>(); - for (SearchOperator.Type operatorType : SearchOperator.Type.values()) { - operatorBuilder.put( - operatorType, - SearchTimerStats.export( - "query_operator_" + operatorType.name().toLowerCase() + "_requests", - TimeUnit.MILLISECONDS, - false, - true - )); - } - operatorTypeStats = operatorBuilder.build(); - } - - @Override - public Future apply( - EarlybirdRequestContext requestContext, - Service service) { - numQueryOperatorConsideredRequests.increment(); - Query parsedQuery = requestContext.getParsedQuery(); - - if (parsedQuery == null) { - return service.apply(requestContext); - } - - SearchTimer timer = new SearchTimer(); - timer.start(); - - return service.apply(requestContext).ensure(() -> { - timer.stop(); - - try { - updateTimersForOperatorsAndOperands(parsedQuery, timer); - updateCountersIfVariantAnnotation(parsedQuery); - } catch (QueryParserException e) { - LOG.warn("Unable to test if query has operators defined", e); - numQueryOperatorDetectionErrors.increment(); - } - return BoxedUnit.UNIT; - }); - } - - /** - * Tracks request stats for operators and operands. - * - * @param parsedQuery the query to check. - */ - private void updateTimersForOperatorsAndOperands(Query parsedQuery, SearchTimer timer) - throws QueryParserException { - final DetectVisitor detectVisitor = new DetectVisitor(false, SearchOperator.Type.values()); - parsedQuery.accept(detectVisitor); - - Set detectedOperatorTypes = EnumSet.noneOf(SearchOperator.Type.class); - for (Query query : detectVisitor.getDetectedQueries()) { - // This detectVisitor only matches on SearchOperators. - SearchOperator operator = (SearchOperator) query; - SearchOperator.Type operatorType = operator.getOperatorType(); - detectedOperatorTypes.add(operatorType); - - if (operatorType == SearchOperator.Type.INCLUDE) { - updateOperandStats( - operator, - includeOperatorStats, - timer, - numUnknownIncludeOperatorRequests); - } - if (operatorType == SearchOperator.Type.FILTER) { - updateOperandStats( - operator, - filterOperatorStats, - timer, - numUnknownFilterOperatorRequests); - } - } - - for (SearchOperator.Type type : detectedOperatorTypes) { - operatorTypeStats.get(type).stoppedTimerIncrement(timer); - } - } - - private void updateOperandStats( - SearchOperator operator, - ImmutableMap operandRequestStats, - SearchTimer timer, - SearchCounter unknownOperandStat) { - String operand = operator.getOperand(); - SearchTimerStats stats = operandRequestStats.get(operand); - - if (stats != null) { - stats.stoppedTimerIncrement(timer); - } else { - unknownOperandStat.increment(); - } - } - - private void updateCountersIfVariantAnnotation(Query parsedQuery) throws QueryParserException { - DetectAnnotationVisitor visitor = new DetectAnnotationVisitor(Annotation.Type.VARIANT); - if (parsedQuery.accept(visitor)) { - numVariantRequests.increment(); - } - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/QueryTokenizerFilter.java b/src/java/com/twitter/search/earlybird_root/filters/QueryTokenizerFilter.java deleted file mode 100644 index e7c8a2c54..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/QueryTokenizerFilter.java +++ /dev/null @@ -1,92 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.concurrent.TimeUnit; -import javax.inject.Inject; - -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.common_internal.text.version.PenguinVersionConfig; -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.finagle.tracing.Trace; -import com.twitter.finagle.tracing.Tracing; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.metrics.SearchTimer; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.QueryParsingUtils; -import com.twitter.search.queryparser.parser.SerializedQueryParser; -import com.twitter.search.queryparser.parser.SerializedQueryParser.TokenizationOption; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.util.Duration; -import com.twitter.util.Future; - -public class QueryTokenizerFilter extends SimpleFilter { - private static final String PREFIX = "query_tokenizer_"; - private static final SearchRateCounter SUCCESS_COUNTER = - SearchRateCounter.export(PREFIX + "success"); - private static final SearchRateCounter FAILURE_COUNTER = - SearchRateCounter.export(PREFIX + "error"); - private static final SearchRateCounter SKIPPED_COUNTER = - SearchRateCounter.export(PREFIX + "skipped"); - private static final SearchTimerStats QUERY_TOKENIZER_TIME = - SearchTimerStats.export(PREFIX + "time", TimeUnit.MILLISECONDS, false); - - private final TokenizationOption tokenizationOption; - - @Inject - public QueryTokenizerFilter(PenguinVersionConfig penguinversions) { - PenguinVersion[] supportedVersions = penguinversions - .getSupportedVersions().toArray(new PenguinVersion[0]); - tokenizationOption = new TokenizationOption(true, supportedVersions); - } - - @Override - public Future apply( - EarlybirdRequestContext requestContext, - Service service) { - - if (!requestContext.getRequest().isRetokenizeSerializedQuery() - || !requestContext.getRequest().isSetSearchQuery() - || !requestContext.getRequest().getSearchQuery().isSetSerializedQuery()) { - SKIPPED_COUNTER.increment(); - return service.apply(requestContext); - } - - SearchTimer timer = QUERY_TOKENIZER_TIME.startNewTimer(); - try { - String serializedQuery = requestContext.getRequest().getSearchQuery().getSerializedQuery(); - Query parsedQuery = reparseQuery(serializedQuery); - SUCCESS_COUNTER.increment(); - return service.apply(EarlybirdRequestContext.copyRequestContext(requestContext, parsedQuery)); - } catch (QueryParserException e) { - FAILURE_COUNTER.increment(); - return QueryParsingUtils.newClientErrorResponse(requestContext.getRequest(), e); - } finally { - long elapsed = timer.stop(); - QUERY_TOKENIZER_TIME.timerIncrement(elapsed); - Tracing trace = Trace.apply(); - if (trace.isActivelyTracing()) { - trace.record(PREFIX + "time", Duration.fromMilliseconds(elapsed)); - } - } - } - - public Query reparseQuery(String serializedQuery) throws QueryParserException { - SerializedQueryParser parser = new SerializedQueryParser(tokenizationOption); - return parser.parse(serializedQuery); - } - - /** - * Initializing the query parser can take many seconds. We initialize it at warmup so that - * requests don't time out after we join the serverset. SEARCH-28801 - */ - public void performExpensiveInitialization() throws QueryParserException { - SerializedQueryParser queryParser = new SerializedQueryParser(tokenizationOption); - - // The Korean query parser takes a few seconds on it's own to initialize. - String koreanQuery = "스포츠"; - queryParser.parse(koreanQuery); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/RealtimeServingRangeProvider.java b/src/java/com/twitter/search/earlybird_root/filters/RealtimeServingRangeProvider.java deleted file mode 100644 index 856afc2bb..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/RealtimeServingRangeProvider.java +++ /dev/null @@ -1,60 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.concurrent.TimeUnit; - -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; -import com.twitter.search.earlybird.config.ServingRange; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; - -public class RealtimeServingRangeProvider implements ServingRangeProvider { - - private static final int DEFAULT_SERVING_RANGE_BOUNDARY_HOURS_AGO = 240; - - private final SearchDecider decider; - private final String deciderKey; - - public RealtimeServingRangeProvider(SearchDecider decider, String deciderKey) { - this.decider = decider; - this.deciderKey = deciderKey; - } - - @Override - public ServingRange getServingRange( - final EarlybirdRequestContext requestContext, boolean useBoundaryOverride) { - return new ServingRange() { - @Override - public long getServingRangeSinceId() { - long servingRangeStartMillis = TimeUnit.HOURS.toMillis( - (decider.featureExists(deciderKey)) - ? decider.getAvailability(deciderKey) - : DEFAULT_SERVING_RANGE_BOUNDARY_HOURS_AGO); - - long boundaryTime = requestContext.getCreatedTimeMillis() - servingRangeStartMillis; - return SnowflakeIdParser.generateValidStatusId(boundaryTime, 0); - } - - @Override - public long getServingRangeMaxId() { - return SnowflakeIdParser.generateValidStatusId( - requestContext.getCreatedTimeMillis(), 0); - } - - @Override - public long getServingRangeSinceTimeSecondsFromEpoch() { - long servingRangeStartMillis = TimeUnit.HOURS.toMillis( - (decider.featureExists(deciderKey)) - ? decider.getAvailability(deciderKey) - : DEFAULT_SERVING_RANGE_BOUNDARY_HOURS_AGO); - - long boundaryTime = requestContext.getCreatedTimeMillis() - servingRangeStartMillis; - return boundaryTime / 1000; - } - - @Override - public long getServingRangeUntilTimeSecondsFromEpoch() { - return requestContext.getCreatedTimeMillis() / 1000; - } - }; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/RejectRequestsByQuerySourceFilter.java b/src/java/com/twitter/search/earlybird_root/filters/RejectRequestsByQuerySourceFilter.java deleted file mode 100644 index fb346c7a1..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/RejectRequestsByQuerySourceFilter.java +++ /dev/null @@ -1,94 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.HashMap; -import java.util.Map; -import javax.annotation.Nullable; -import javax.inject.Inject; - -import com.google.common.annotations.VisibleForTesting; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.constants.thriftjava.ThriftQuerySource; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.util.Future; - -/** - * Rejects requests based on the query source of the request. Intended to be used at super-root - * or archive-root. If used to reject client request at super-root, the client will get a response - * with empty results and a REQUEST_BLOCKED_ERROR status code. If used at archive-root the client - * will get a response which might contain some results from realtime and protected and the status - * code of the response will depend on how super-root combines responses from the three downstream - * roots. - */ -public class RejectRequestsByQuerySourceFilter extends - SimpleFilter { - - @VisibleForTesting - protected static final String NUM_REJECTED_REQUESTS_STAT_NAME_PATTERN = - "num_root_%s_rejected_requests_with_query_source_%s"; - @VisibleForTesting - protected static final String REJECT_REQUESTS_DECIDER_KEY_PATTERN = - "root_%s_reject_requests_with_query_source_%s"; - private final Map rejectedRequestsCounterPerQuerySource = - new HashMap<>(); - private final Map rejectRequestsDeciderKeyPerQuerySource = - new HashMap<>(); - private final SearchDecider searchDecider; - - - @Inject - public RejectRequestsByQuerySourceFilter( - @Nullable EarlybirdCluster cluster, - SearchDecider searchDecider) { - - this.searchDecider = searchDecider; - - String clusterName = cluster != null - ? cluster.getNameForStats() - : EarlybirdCluster.SUPERROOT.getNameForStats(); - - for (ThriftQuerySource querySource : ThriftQuerySource.values()) { - String querySourceName = querySource.name().toLowerCase(); - - rejectedRequestsCounterPerQuerySource.put(querySource, - SearchRateCounter.export( - String.format( - NUM_REJECTED_REQUESTS_STAT_NAME_PATTERN, clusterName, querySourceName))); - - rejectRequestsDeciderKeyPerQuerySource.put(querySource, - String.format( - REJECT_REQUESTS_DECIDER_KEY_PATTERN, clusterName, querySourceName)); - } - } - - @Override - public Future apply(EarlybirdRequest request, - Service service) { - - ThriftQuerySource querySource = request.isSetQuerySource() - ? request.getQuerySource() - : ThriftQuerySource.UNKNOWN; - - String deciderKey = rejectRequestsDeciderKeyPerQuerySource.get(querySource); - if (searchDecider.isAvailable(deciderKey)) { - rejectedRequestsCounterPerQuerySource.get(querySource).increment(); - return Future.value(getRejectedRequestResponse(querySource, deciderKey)); - } - return service.apply(request); - } - - private static EarlybirdResponse getRejectedRequestResponse( - ThriftQuerySource querySource, String deciderKey) { - return new EarlybirdResponse(EarlybirdResponseCode.REQUEST_BLOCKED_ERROR, 0) - .setSearchResults(new ThriftSearchResults()) - .setDebugString(String.format( - "Request with query source %s is blocked by decider %s", querySource, deciderKey)); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/RequestContextToEarlybirdRequestFilter.java b/src/java/com/twitter/search/earlybird_root/filters/RequestContextToEarlybirdRequestFilter.java deleted file mode 100644 index 1059b4d30..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/RequestContextToEarlybirdRequestFilter.java +++ /dev/null @@ -1,33 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.concurrent.TimeUnit; - -import com.twitter.finagle.Filter; -import com.twitter.finagle.Service; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.util.Future; - -/** - * A filter for transforming a RequestContext to an EarlybirdRequest. - */ -public class RequestContextToEarlybirdRequestFilter extends - Filter { - - private static final SearchTimerStats REQUEST_CONTEXT_TRIP_TIME = - SearchTimerStats.export("request_context_trip_time", TimeUnit.MILLISECONDS, false, - true); - - @Override - public Future apply( - EarlybirdRequestContext requestContext, - Service service) { - - long tripTime = System.currentTimeMillis() - requestContext.getCreatedTimeMillis(); - REQUEST_CONTEXT_TRIP_TIME.timerIncrement(tripTime); - - return service.apply(requestContext.getRequest()); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/RequestResultStatsFilter.java b/src/java/com/twitter/search/earlybird_root/filters/RequestResultStatsFilter.java deleted file mode 100644 index 95f0f44b5..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/RequestResultStatsFilter.java +++ /dev/null @@ -1,185 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.List; -import java.util.Map; -import java.util.concurrent.ConcurrentHashMap; -import javax.inject.Inject; - -import scala.runtime.BoxedUnit; - -import com.twitter.common.util.Clock; -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.metrics.Percentile; -import com.twitter.search.common.metrics.PercentileUtil; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.query.thriftjava.CollectorParams; -import com.twitter.search.common.query.thriftjava.CollectorTerminationParams; -import com.twitter.search.earlybird.common.ClientIdUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.snowflake.id.SnowflakeId; -import com.twitter.util.Function; -import com.twitter.util.Future; - -public class RequestResultStatsFilter - extends SimpleFilter { - private final Clock clock; - private final RequestResultStats stats; - - static class RequestResultStats { - private static final String PREFIX = "request_result_properties_"; - - private final SearchCounter resultsRequestedCount; - private final SearchCounter resultsReturnedCount; - private final SearchCounter maxHitsToProcessCount; - private final SearchCounter hitsProcessedCount; - private final SearchCounter docsProcessedCount; - private final SearchCounter timeoutMsCount; - private Map> requestedNumResultsPercentileByClientId; - private Map> returnedNumResultsPercentileByClientId; - private Map> oldestResultPercentileByClientId; - - RequestResultStats() { - // Request properties - resultsRequestedCount = SearchCounter.export(PREFIX + "results_requested_cnt"); - maxHitsToProcessCount = SearchCounter.export(PREFIX + "max_hits_to_process_cnt"); - timeoutMsCount = SearchCounter.export(PREFIX + "timeout_ms_cnt"); - requestedNumResultsPercentileByClientId = new ConcurrentHashMap<>(); - - // Result properties - resultsReturnedCount = SearchCounter.export(PREFIX + "results_returned_cnt"); - hitsProcessedCount = SearchCounter.export(PREFIX + "hits_processed_cnt"); - docsProcessedCount = SearchCounter.export(PREFIX + "docs_processed_cnt"); - returnedNumResultsPercentileByClientId = new ConcurrentHashMap<>(); - oldestResultPercentileByClientId = new ConcurrentHashMap<>(); - } - - SearchCounter getResultsRequestedCount() { - return resultsRequestedCount; - } - - SearchCounter getResultsReturnedCount() { - return resultsReturnedCount; - } - - SearchCounter getMaxHitsToProcessCount() { - return maxHitsToProcessCount; - } - - SearchCounter getHitsProcessedCount() { - return hitsProcessedCount; - } - - SearchCounter getDocsProcessedCount() { - return docsProcessedCount; - } - - SearchCounter getTimeoutMsCount() { - return timeoutMsCount; - } - - Percentile getOldestResultPercentile(String clientId) { - return oldestResultPercentileByClientId.computeIfAbsent(clientId, - key -> PercentileUtil.createPercentile(statName(clientId, "oldest_result_age_seconds"))); - } - - Percentile getRequestedNumResultsPercentile(String clientId) { - return requestedNumResultsPercentileByClientId.computeIfAbsent(clientId, - key -> PercentileUtil.createPercentile(statName(clientId, "requested_num_results"))); - } - - Percentile getReturnedNumResultsPercentile(String clientId) { - return returnedNumResultsPercentileByClientId.computeIfAbsent(clientId, - key -> PercentileUtil.createPercentile(statName(clientId, "returned_num_results"))); - } - - private String statName(String clientId, String suffix) { - return String.format("%s%s_%s", PREFIX, ClientIdUtil.formatClientId(clientId), suffix); - } - } - - @Inject - RequestResultStatsFilter(Clock clock, RequestResultStats stats) { - this.clock = clock; - this.stats = stats; - } - - private void updateRequestStats(EarlybirdRequest request) { - ThriftSearchQuery searchQuery = request.getSearchQuery(); - CollectorParams collectorParams = searchQuery.getCollectorParams(); - - if (collectorParams != null) { - stats.getResultsRequestedCount().add(collectorParams.numResultsToReturn); - if (request.isSetClientId()) { - stats.getRequestedNumResultsPercentile(request.getClientId()) - .record(collectorParams.numResultsToReturn); - } - CollectorTerminationParams terminationParams = collectorParams.getTerminationParams(); - if (terminationParams != null) { - if (terminationParams.isSetMaxHitsToProcess()) { - stats.getMaxHitsToProcessCount().add(terminationParams.maxHitsToProcess); - } - if (terminationParams.isSetTimeoutMs()) { - stats.getTimeoutMsCount().add(terminationParams.timeoutMs); - } - } - } else { - if (searchQuery.isSetNumResults()) { - stats.getResultsRequestedCount().add(searchQuery.numResults); - if (request.isSetClientId()) { - stats.getRequestedNumResultsPercentile(request.getClientId()) - .record(searchQuery.numResults); - } - } - if (searchQuery.isSetMaxHitsToProcess()) { - stats.getMaxHitsToProcessCount().add(searchQuery.maxHitsToProcess); - } - if (request.isSetTimeoutMs()) { - stats.getTimeoutMsCount().add(request.timeoutMs); - } - } - } - - private void updateResultsStats(String clientId, ThriftSearchResults results) { - stats.getResultsReturnedCount().add(results.getResultsSize()); - if (results.isSetNumHitsProcessed()) { - stats.getHitsProcessedCount().add(results.numHitsProcessed); - } - - if (clientId != null) { - if (results.getResultsSize() > 0) { - List resultsList = results.getResults(); - - long lastId = resultsList.get(resultsList.size() - 1).getId(); - long tweetTime = SnowflakeId.timeFromId(lastId).inLongSeconds(); - long tweetAge = (clock.nowMillis() / 1000) - tweetTime; - stats.getOldestResultPercentile(clientId).record(tweetAge); - } - - stats.getReturnedNumResultsPercentile(clientId).record(results.getResultsSize()); - } - } - - @Override - public Future apply( - EarlybirdRequest request, - Service service) { - - updateRequestStats(request); - - return service.apply(request).onSuccess( - new Function() { - @Override - public BoxedUnit apply(EarlybirdResponse response) { - if (response.isSetSearchResults()) { - updateResultsStats(request.getClientId(), response.searchResults); - } - return BoxedUnit.UNIT; - } - }); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/RequestSuccessStatsFilter.java b/src/java/com/twitter/search/earlybird_root/filters/RequestSuccessStatsFilter.java deleted file mode 100644 index 7a942d05a..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/RequestSuccessStatsFilter.java +++ /dev/null @@ -1,79 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.concurrent.TimeUnit; -import javax.inject.Inject; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.root.RequestSuccessStats; -import com.twitter.search.common.util.FinagleUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.util.Future; -import com.twitter.util.FutureEventListener; - -import static com.twitter.search.common.util.earlybird.EarlybirdResponseUtil.responseConsideredFailed; - - -/** - * Records cancellations, timeouts, and failures for requests that do not go through - * ScatterGatherService (which also updates these stats, but for different requests). - */ -public class RequestSuccessStatsFilter - extends SimpleFilter { - - private final RequestSuccessStats stats; - - @Inject - RequestSuccessStatsFilter(RequestSuccessStats stats) { - this.stats = stats; - } - - - @Override - public Future apply( - EarlybirdRequest request, - Service service) { - - final long startTime = System.nanoTime(); - - return service.apply(request).addEventListener( - new FutureEventListener() { - @Override - public void onSuccess(EarlybirdResponse response) { - boolean success = true; - - if (response.getResponseCode() == EarlybirdResponseCode.CLIENT_CANCEL_ERROR) { - success = false; - stats.getCancelledRequestCount().increment(); - } else if (response.getResponseCode() == EarlybirdResponseCode.SERVER_TIMEOUT_ERROR) { - success = false; - stats.getTimedoutRequestCount().increment(); - } else if (responseConsideredFailed(response.getResponseCode())) { - success = false; - stats.getErroredRequestCount().increment(); - } - - long latencyNanos = System.nanoTime() - startTime; - stats.getRequestLatencyStats().requestComplete( - TimeUnit.NANOSECONDS.toMillis(latencyNanos), 0, success); - } - - @Override - public void onFailure(Throwable cause) { - long latencyNanos = System.nanoTime() - startTime; - stats.getRequestLatencyStats().requestComplete( - TimeUnit.NANOSECONDS.toMillis(latencyNanos), 0, false); - - if (FinagleUtil.isCancelException(cause)) { - stats.getCancelledRequestCount().increment(); - } else if (FinagleUtil.isTimeoutException(cause)) { - stats.getTimedoutRequestCount().increment(); - } else { - stats.getErroredRequestCount().increment(); - } - } - }); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/RequestTypeCountFilter.java b/src/java/com/twitter/search/earlybird_root/filters/RequestTypeCountFilter.java deleted file mode 100644 index 477a74ed4..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/RequestTypeCountFilter.java +++ /dev/null @@ -1,105 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import com.google.common.base.Preconditions; -import com.google.common.cache.CacheBuilder; -import com.google.common.cache.CacheLoader; -import com.google.common.cache.LoadingCache; -import com.google.common.collect.ImmutableMap; - -import com.twitter.common.util.Clock; -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.clientstats.RequestCounters; -import com.twitter.search.common.clientstats.RequestCountersEventListener; -import com.twitter.search.common.util.FinagleUtil; -import com.twitter.search.earlybird.common.ClientIdUtil; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; -import com.twitter.util.Future; - -public class RequestTypeCountFilter - extends SimpleFilter { - private final ImmutableMap typeCounters; - private final RequestCounters allRequestTypesCounter; - private final ImmutableMap> - perTypePerClientCounters; - - /** - * Constructs the filter. - */ - public RequestTypeCountFilter(final String statSuffix) { - ImmutableMap.Builder perTypeBuilder = - ImmutableMap.builder(); - for (EarlybirdRequestType type : EarlybirdRequestType.values()) { - perTypeBuilder.put(type, new RequestCounters( - "request_type_count_filter_" + type.getNormalizedName() + "_" + statSuffix)); - } - typeCounters = perTypeBuilder.build(); - - allRequestTypesCounter = - new RequestCounters("request_type_count_filter_all_" + statSuffix, true); - - ImmutableMap.Builder> - perTypePerClientBuilder = ImmutableMap.builder(); - - // No point in setting any kind of expiration policy for the cache, since the stats will - // continue to be exported, so the objects will not be GCed anyway. - CacheBuilder cacheBuilder = CacheBuilder.newBuilder(); - for (final EarlybirdRequestType requestType : EarlybirdRequestType.values()) { - CacheLoader cacheLoader = - new CacheLoader() { - @Override - public RequestCounters load(String clientId) { - return new RequestCounters("request_type_count_filter_for_" + clientId + "_" - + requestType.getNormalizedName() + "_" + statSuffix); - } - }; - perTypePerClientBuilder.put(requestType, cacheBuilder.build(cacheLoader)); - } - perTypePerClientCounters = perTypePerClientBuilder.build(); - } - - @Override - public Future apply( - EarlybirdRequestContext requestContext, - Service service) { - EarlybirdRequestType requestType = requestContext.getEarlybirdRequestType(); - RequestCounters requestCounters = typeCounters.get(requestType); - Preconditions.checkNotNull(requestCounters); - - // Update the per-type and "all" counters. - RequestCountersEventListener requestCountersEventListener = - new RequestCountersEventListener<>( - requestCounters, Clock.SYSTEM_CLOCK, EarlybirdSuccessfulResponseHandler.INSTANCE); - RequestCountersEventListener allRequestTypesEventListener = - new RequestCountersEventListener<>( - allRequestTypesCounter, Clock.SYSTEM_CLOCK, - EarlybirdSuccessfulResponseHandler.INSTANCE); - - RequestCountersEventListener perTypePerClientEventListener = - updatePerTypePerClientCountersListener(requestContext); - - return service.apply(requestContext) - .addEventListener(requestCountersEventListener) - .addEventListener(allRequestTypesEventListener) - .addEventListener(perTypePerClientEventListener); - } - - private RequestCountersEventListener updatePerTypePerClientCountersListener( - EarlybirdRequestContext earlybirdRequestContext) { - EarlybirdRequestType requestType = earlybirdRequestContext.getEarlybirdRequestType(); - LoadingCache perClientCounters = - perTypePerClientCounters.get(requestType); - Preconditions.checkNotNull(perClientCounters); - - String clientId = ClientIdUtil.formatFinagleClientIdAndClientId( - FinagleUtil.getFinagleClientName(), - ClientIdUtil.getClientIdFromRequest(earlybirdRequestContext.getRequest())); - RequestCounters clientCounters = perClientCounters.getUnchecked(clientId); - Preconditions.checkNotNull(clientCounters); - - return new RequestCountersEventListener<>( - clientCounters, Clock.SYSTEM_CLOCK, EarlybirdSuccessfulResponseHandler.INSTANCE); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/ResponseCodeStatFilter.java b/src/java/com/twitter/search/earlybird_root/filters/ResponseCodeStatFilter.java deleted file mode 100644 index 50fa78299..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/ResponseCodeStatFilter.java +++ /dev/null @@ -1,50 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.Map; - -import com.google.common.collect.Maps; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.util.Future; -import com.twitter.util.FutureEventListener; - -public class ResponseCodeStatFilter - extends SimpleFilter { - - private final Map responseCodeCounters; - - /** - * Create ResponseCodeStatFilter - */ - public ResponseCodeStatFilter() { - responseCodeCounters = Maps.newEnumMap(EarlybirdResponseCode.class); - for (EarlybirdResponseCode code : EarlybirdResponseCode.values()) { - SearchCounter stat = SearchCounter.export("response_code_" + code.name().toLowerCase()); - responseCodeCounters.put(code, stat); - } - } - - @Override - public Future apply( - final EarlybirdRequest request, - final Service service) { - - return service.apply(request).addEventListener( - new FutureEventListener() { - - @Override - public void onSuccess(final EarlybirdResponse response) { - responseCodeCounters.get(response.getResponseCode()).increment(); - } - - @Override - public void onFailure(final Throwable cause) { } - }); - - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/ResultTierCountFilter.java b/src/java/com/twitter/search/earlybird_root/filters/ResultTierCountFilter.java deleted file mode 100644 index 088ab07e7..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/ResultTierCountFilter.java +++ /dev/null @@ -1,114 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.Collection; -import java.util.Collections; -import java.util.Comparator; -import java.util.List; -import java.util.NavigableMap; - -import javax.inject.Inject; -import javax.inject.Singleton; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.collect.ImmutableSortedMap; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchCustomGauge; -import com.twitter.search.earlybird.config.TierInfo; -import com.twitter.search.earlybird.config.TierInfoSource; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.snowflake.id.SnowflakeId; -import com.twitter.util.Future; -import com.twitter.util.FutureEventListener; - -/** - * A filter to count the tier to which the oldest tweet in the results belong. - */ -@Singleton -public class ResultTierCountFilter - extends SimpleFilter { - - private static final String COUNTER_PREFIX = "result_tier_count"; - private final long firstTweetTimeSinceEpochSec; - private final NavigableMap tierBuckets; - private final SearchCounter allCounter = SearchCounter.export(COUNTER_PREFIX + "_all"); - private final SearchCounter noResultsCounter = - SearchCounter.export(COUNTER_PREFIX + "_no_results"); - - @Inject - @SuppressWarnings("unused") - ResultTierCountFilter(TierInfoSource tierInfoSource) { - List tierInfos = tierInfoSource.getTierInformation(); - tierInfos.sort(Comparator.comparing(TierInfo::getDataStartDate)); - - firstTweetTimeSinceEpochSec = tierInfos.get(0).getServingRangeSinceTimeSecondsFromEpoch(); - - ImmutableSortedMap.Builder builder = ImmutableSortedMap.naturalOrder(); - Collections.reverse(tierInfos); - - for (TierInfo tierInfo : tierInfos) { - SearchCounter searchCounter = SearchCounter.export( - String.format("%s_%s", COUNTER_PREFIX, tierInfo.getTierName())); - builder.put(tierInfo.getServingRangeSinceTimeSecondsFromEpoch(), searchCounter); - - // export cumulative metrics to sum from the latest to a lower tier - Collection counters = builder.build().values(); - SearchCustomGauge.export( - String.format("%s_down_to_%s", COUNTER_PREFIX, tierInfo.getTierName()), - () -> counters.stream() - .mapToLong(SearchCounter::get) - .sum()); - } - - tierBuckets = builder.build(); - } - - @Override - public Future apply( - EarlybirdRequestContext context, - Service service) { - return service.apply(context).addEventListener( - new FutureEventListener() { - @Override - public void onFailure(Throwable cause) { - // do nothing - } - - @Override - public void onSuccess(EarlybirdResponse response) { - record(response); - } - }); - } - - @VisibleForTesting - void record(EarlybirdResponse response) { - if (response.isSetSearchResults()) { - long minResultsStatusId = response.getSearchResults().getResults().stream() - .mapToLong(ThriftSearchResult::getId) - .min() - .orElse(-1); - getBucket(minResultsStatusId).increment(); - } - allCounter.increment(); - } - - private SearchCounter getBucket(long statusId) { - if (statusId < 0) { - return noResultsCounter; - } - - // If non-negative statusId is not a SnowflakeId, the tweet must have been created before - // Twepoch (2010-11-04T01:42:54Z) and thus belongs to full1. - long timeSinceEpochSec = firstTweetTimeSinceEpochSec; - if (SnowflakeId.isSnowflakeId(statusId)) { - timeSinceEpochSec = SnowflakeId.timeFromId(statusId).inSeconds(); - } - - return tierBuckets.floorEntry(timeSinceEpochSec).getValue(); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/ScatterGatherWithExperimentRedirectsService.java b/src/java/com/twitter/search/earlybird_root/filters/ScatterGatherWithExperimentRedirectsService.java deleted file mode 100644 index 179aa259e..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/ScatterGatherWithExperimentRedirectsService.java +++ /dev/null @@ -1,59 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.Map; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.finagle.Service; -import com.twitter.search.common.root.ScatterGatherService; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.ExperimentCluster; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.util.Future; - -public class ScatterGatherWithExperimentRedirectsService - extends Service { - private final Service - controlScatterGatherService; - - private final Map> - experimentScatterGatherServices; - - private static final Logger LOG = - LoggerFactory.getLogger(ScatterGatherWithExperimentRedirectsService.class); - - public ScatterGatherWithExperimentRedirectsService( - Service controlScatterGatherService, - Map> - experimentScatterGatherServices - ) { - this.controlScatterGatherService = controlScatterGatherService; - this.experimentScatterGatherServices = experimentScatterGatherServices; - } - - @Override - public Future apply(EarlybirdRequestContext request) { - if (request.getRequest().isSetExperimentClusterToUse()) { - ExperimentCluster cluster = request.getRequest().getExperimentClusterToUse(); - - if (!experimentScatterGatherServices.containsKey(cluster)) { - String error = String.format( - "Received invalid experiment cluster: %s", cluster.name()); - - LOG.error("{} Request: {}", error, request.getRequest()); - - return Future.value(new EarlybirdResponse() - .setResponseCode(EarlybirdResponseCode.CLIENT_ERROR) - .setDebugString(error)); - } - - return experimentScatterGatherServices.get(cluster).apply(request); - } - - return controlScatterGatherService.apply(request); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/SearchPayloadSizeLocalContextFilter.java b/src/java/com/twitter/search/earlybird_root/filters/SearchPayloadSizeLocalContextFilter.java deleted file mode 100644 index 0ce99bd42..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/SearchPayloadSizeLocalContextFilter.java +++ /dev/null @@ -1,43 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.concurrent.atomic.AtomicReference; - -import scala.Option; - -import com.google.common.base.Preconditions; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.finagle.context.Contexts; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.root.SearchPayloadSizeFilter; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.util.Future; - -/** - * A filter that sets the clientId in the local context, to be usd later by SearchPayloadSizeFilter. - */ -public class SearchPayloadSizeLocalContextFilter - extends SimpleFilter { - private static final SearchCounter CLIENT_ID_CONTEXT_KEY_NOT_SET_COUNTER = SearchCounter.export( - "search_payload_size_local_context_filter_client_id_context_key_not_set"); - - @Override - public Future apply(EarlybirdRequest request, - Service service) { - // In production, the SearchPayloadSizeFilter.CLIENT_ID_CONTEXT_KEY should always be set - // (by ThriftServer). However, it's not set in tests, because tests do not start a ThriftServer. - Option> clientIdOption = - Contexts.local().get(SearchPayloadSizeFilter.CLIENT_ID_CONTEXT_KEY); - if (clientIdOption.isDefined()) { - AtomicReference clientIdReference = clientIdOption.get(); - Preconditions.checkArgument(clientIdReference.get() == null); - clientIdReference.set(request.getClientId()); - } else { - CLIENT_ID_CONTEXT_KEY_NOT_SET_COUNTER.increment(); - } - - return service.apply(request); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/SensitiveResultsTrackingFilter.java b/src/java/com/twitter/search/earlybird_root/filters/SensitiveResultsTrackingFilter.java deleted file mode 100644 index 082e52dde..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/SensitiveResultsTrackingFilter.java +++ /dev/null @@ -1,140 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.Set; - -import com.google.common.base.Joiner; - -import org.apache.thrift.TException; -import org.slf4j.Logger; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.util.thrift.ThriftUtils; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.util.Future; -import com.twitter.util.FutureEventListener; - -/** - * The general framework for earlybird root to track sensitive results. - */ -public abstract class SensitiveResultsTrackingFilter - extends SimpleFilter { - - /** - * The type name is used to distinguish different kinds of sensitive results in log. - */ - private final String typeName; - - /** - * The mark is to control whether to log expensive information. - */ - private final boolean logDetails; - - /** - * Constructor helps distinguish different sensitive content trackers. - * @param typeName The sensitive content's name (e.g. nullcast) - * @param logDetails Whether to log details such as serialized requests and responses - */ - public SensitiveResultsTrackingFilter(final String typeName, boolean logDetails) { - super(); - this.typeName = typeName; - this.logDetails = logDetails; - } - - /** - * Get the LOG that the sensitive results can write to. - */ - protected abstract Logger getLogger(); - - /** - * The counter which counts the number of queries with sensitive results. - */ - protected abstract SearchCounter getSensitiveQueryCounter(); - - /** - * The counter which counts the number of sensitive results. - */ - protected abstract SearchCounter getSensitiveResultsCounter(); - - /** - * The method defines how the sensitive results are identified. - */ - protected abstract Set getSensitiveResults( - EarlybirdRequestContext requestContext, - EarlybirdResponse earlybirdResponse) throws Exception; - - /** - * Get a set of tweets which should be exclude from the sensitive results set. - */ - protected abstract Set getExceptedResults(EarlybirdRequestContext requestContext); - - @Override - public final Future apply( - final EarlybirdRequestContext requestContext, - Service service) { - Future response = service.apply(requestContext); - - response.addEventListener(new FutureEventListener() { - @Override - public void onSuccess(EarlybirdResponse earlybirdResponse) { - try { - if (earlybirdResponse.responseCode == EarlybirdResponseCode.SUCCESS - && earlybirdResponse.isSetSearchResults() - && requestContext.getParsedQuery() != null) { - Set statusIds = getSensitiveResults(requestContext, earlybirdResponse); - Set exceptedIds = getExceptedResults(requestContext); - statusIds.removeAll(exceptedIds); - - if (statusIds.size() > 0) { - getSensitiveQueryCounter().increment(); - getSensitiveResultsCounter().add(statusIds.size()); - logContent(requestContext, earlybirdResponse, statusIds); - } - } - } catch (Exception e) { - getLogger().error("Caught exception while trying to log sensitive results for query: {}", - requestContext.getParsedQuery().serialize(), e); - } - } - - @Override - public void onFailure(Throwable cause) { - } - }); - - return response; - } - - private void logContent( - final EarlybirdRequestContext requestContext, - final EarlybirdResponse earlybirdResponse, - final Set statusIds) { - - if (logDetails) { - String base64Request; - try { - base64Request = ThriftUtils.toBase64EncodedString(requestContext.getRequest()); - } catch (TException e) { - base64Request = "Failed to parse base 64 request"; - } - getLogger().error("Found " + typeName - + ": {} | " - + "parsedQuery: {} | " - + "request: {} | " - + "base 64 request: {} | " - + "response: {}", - Joiner.on(",").join(statusIds), - requestContext.getParsedQuery().serialize(), - requestContext.getRequest(), - base64Request, - earlybirdResponse); - } else { - getLogger().error("Found " + typeName + ": {} for parsedQuery {}", - Joiner.on(",").join(statusIds), - requestContext.getParsedQuery().serialize()); - } - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/ServiceExceptionHandlingFilter.java b/src/java/com/twitter/search/earlybird_root/filters/ServiceExceptionHandlingFilter.java deleted file mode 100644 index 4594aa289..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/ServiceExceptionHandlingFilter.java +++ /dev/null @@ -1,27 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.util.Future; - -/** A per-service filter for handling exceptions. */ -public class ServiceExceptionHandlingFilter - extends SimpleFilter { - private final EarlybirdResponseExceptionHandler exceptionHandler; - - /** Creates a new ServiceExceptionHandlingFilter instance. */ - public ServiceExceptionHandlingFilter(EarlybirdCluster cluster) { - this.exceptionHandler = new EarlybirdResponseExceptionHandler(cluster.getNameForStats()); - } - - @Override - public Future apply( - EarlybirdRequestContext requestContext, - Service service) { - return exceptionHandler.handleException( - requestContext.getRequest(), service.apply(requestContext)); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/ServiceResponseValidationFilter.java b/src/java/com/twitter/search/earlybird_root/filters/ServiceResponseValidationFilter.java deleted file mode 100644 index 2464be534..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/ServiceResponseValidationFilter.java +++ /dev/null @@ -1,81 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import java.util.HashMap; -import java.util.Map; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.util.earlybird.EarlybirdResponseMergeUtil; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; -import com.twitter.search.earlybird_root.validators.FacetsResponseValidator; -import com.twitter.search.earlybird_root.validators.PassThroughResponseValidator; -import com.twitter.search.earlybird_root.validators.ServiceResponseValidator; -import com.twitter.search.earlybird_root.validators.TermStatsResultsValidator; -import com.twitter.search.earlybird_root.validators.TopTweetsResultsValidator; -import com.twitter.util.Function; -import com.twitter.util.Future; - -/** - * Filter responsible for handling invalid response returned by downstream services, and - * translating them into EarlybirdResponseExceptions. - */ -public class ServiceResponseValidationFilter - extends SimpleFilter { - - private final Map> - requestTypeToResponseValidators = new HashMap<>(); - private final EarlybirdCluster cluster; - - /** - * Creates a new filter for handling invalid response - */ - public ServiceResponseValidationFilter(EarlybirdCluster cluster) { - this.cluster = cluster; - - ServiceResponseValidator passThroughValidator = - new PassThroughResponseValidator(); - - requestTypeToResponseValidators - .put(EarlybirdRequestType.FACETS, new FacetsResponseValidator(cluster)); - requestTypeToResponseValidators - .put(EarlybirdRequestType.RECENCY, passThroughValidator); - requestTypeToResponseValidators - .put(EarlybirdRequestType.RELEVANCE, passThroughValidator); - requestTypeToResponseValidators - .put(EarlybirdRequestType.STRICT_RECENCY, passThroughValidator); - requestTypeToResponseValidators - .put(EarlybirdRequestType.TERM_STATS, new TermStatsResultsValidator(cluster)); - requestTypeToResponseValidators - .put(EarlybirdRequestType.TOP_TWEETS, new TopTweetsResultsValidator(cluster)); - } - - @Override - public Future apply( - final EarlybirdRequestContext requestContext, - Service service) { - return service.apply(requestContext).flatMap( - new Function>() { - @Override - public Future apply(EarlybirdResponse response) { - if (response == null) { - return Future.exception(new IllegalStateException( - cluster + " returned null response")); - } - - if (response.getResponseCode() == EarlybirdResponseCode.SUCCESS) { - return requestTypeToResponseValidators - .get(requestContext.getEarlybirdRequestType()) - .validate(response); - } - - return Future.value(EarlybirdResponseMergeUtil.transformInvalidResponse( - response, - String.format("Failure from %s (%s)", cluster, response.getResponseCode()))); - } - }); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/ServingRangeProvider.java b/src/java/com/twitter/search/earlybird_root/filters/ServingRangeProvider.java deleted file mode 100644 index fb26bd2d7..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/ServingRangeProvider.java +++ /dev/null @@ -1,12 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import com.twitter.search.earlybird.config.ServingRange; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; - -public interface ServingRangeProvider { - /** - * Get a ServingRange implementation. - * Usually backed by either TierInfoWrapper or RootClusterBoundaryInfo. - */ - ServingRange getServingRange(EarlybirdRequestContext requestContext, boolean useBoundaryOverride); -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/StratoAttributionClientIdFilter.java b/src/java/com/twitter/search/earlybird_root/filters/StratoAttributionClientIdFilter.java deleted file mode 100644 index aff0c44e1..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/StratoAttributionClientIdFilter.java +++ /dev/null @@ -1,30 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.earlybird.common.ClientIdUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.util.Future; - -/** - * A filter that will set the clientId of the request to the strato HttpEndpoint Attribution. - *

    - * If the clientId is already set to something non-null then that value is used. - * If the clientId is null but Attribution.httpEndpoint() contains a value it will be set as - * the clientId. - */ -public class StratoAttributionClientIdFilter extends - SimpleFilter { - @Override - public Future apply( - EarlybirdRequest request, Service service - ) { - if (request.getClientId() == null) { - ClientIdUtil.getClientIdFromHttpEndpointAttribution().ifPresent(request::setClientId); - } - - return service.apply(request); - } -} - diff --git a/src/java/com/twitter/search/earlybird_root/filters/TopLevelExceptionHandlingFilter.java b/src/java/com/twitter/search/earlybird_root/filters/TopLevelExceptionHandlingFilter.java deleted file mode 100644 index f3db830fd..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/TopLevelExceptionHandlingFilter.java +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.util.Future; - -/** A top level filter for handling exceptions. */ -public class TopLevelExceptionHandlingFilter - extends SimpleFilter { - private final EarlybirdResponseExceptionHandler exceptionHandler; - - /** Creates a new TopLevelExceptionHandlingFilter instance. */ - public TopLevelExceptionHandlingFilter() { - this.exceptionHandler = new EarlybirdResponseExceptionHandler("top_level"); - } - - @Override - public Future apply(EarlybirdRequest request, - Service service) { - return exceptionHandler.handleException(request, service.apply(request)); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/UnsetSuperRootFieldsFilter.java b/src/java/com/twitter/search/earlybird_root/filters/UnsetSuperRootFieldsFilter.java deleted file mode 100644 index a3f24b7b2..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/UnsetSuperRootFieldsFilter.java +++ /dev/null @@ -1,30 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestUtil; -import com.twitter.util.Future; - -/** - * A filter that unsets some request fields that make sense only on the SuperRoot, before sending - * them to the individual roots. - */ -public class UnsetSuperRootFieldsFilter extends SimpleFilter { - private final boolean unsetFollowedUserIds; - - public UnsetSuperRootFieldsFilter() { - this(true); - } - - public UnsetSuperRootFieldsFilter(boolean unsetFollowedUserIds) { - this.unsetFollowedUserIds = unsetFollowedUserIds; - } - - @Override - public Future apply(EarlybirdRequest request, - Service service) { - return service.apply(EarlybirdRequestUtil.unsetSuperRootFields(request, unsetFollowedUserIds)); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/filters/VeryRecentTweetsFilter.java b/src/java/com/twitter/search/earlybird_root/filters/VeryRecentTweetsFilter.java deleted file mode 100644 index 6f0678a1e..000000000 --- a/src/java/com/twitter/search/earlybird_root/filters/VeryRecentTweetsFilter.java +++ /dev/null @@ -1,44 +0,0 @@ -package com.twitter.search.earlybird_root.filters; - -import javax.inject.Inject; - -import com.twitter.finagle.Service; -import com.twitter.finagle.SimpleFilter; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.util.Future; - -public class VeryRecentTweetsFilter - extends SimpleFilter { - private static final String DECIDER_KEY = "enable_very_recent_tweets"; - private static final SearchRateCounter VERY_RECENT_TWEETS_NOT_MODIFIED = - SearchRateCounter.export("very_recent_tweets_not_modified"); - private static final SearchRateCounter VERY_RECENT_TWEETS_ENABLED = - SearchRateCounter.export("very_recent_tweets_enabled"); - - private final SearchDecider decider; - - @Inject - public VeryRecentTweetsFilter( - SearchDecider decider - ) { - this.decider = decider; - } - - @Override - public Future apply( - EarlybirdRequest request, - Service service - ) { - if (decider.isAvailable(DECIDER_KEY)) { - VERY_RECENT_TWEETS_ENABLED.increment(); - request.setSkipVeryRecentTweets(false); - } else { - VERY_RECENT_TWEETS_NOT_MODIFIED.increment(); - } - - return service.apply(request); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/img/serving.png b/src/java/com/twitter/search/earlybird_root/img/serving.png deleted file mode 100644 index aca60b55e..000000000 Binary files a/src/java/com/twitter/search/earlybird_root/img/serving.png and /dev/null differ diff --git a/src/java/com/twitter/search/earlybird_root/mergers/AccumulatedResponses.java b/src/java/com/twitter/search/earlybird_root/mergers/AccumulatedResponses.java deleted file mode 100644 index abfebf20d..000000000 --- a/src/java/com/twitter/search/earlybird_root/mergers/AccumulatedResponses.java +++ /dev/null @@ -1,176 +0,0 @@ -package com.twitter.search.earlybird_root.mergers; - -import java.util.List; -import java.util.Map; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Maps; - -import com.twitter.search.common.query.thriftjava.EarlyTerminationInfo; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.TierResponse; - -/** - * Collection of EarlybirdResponses and associated stats to be merged. - */ -public class AccumulatedResponses { - // The list of the successful responses from all earlybird futures. This does not include empty - // responses resulted from null requests. - private final List successResponses; - // The list of the unsuccessful responses from all earlybird futures. - private final List errorResponses; - // the list of max statusIds seen in each earlybird. - private final List maxIds; - // the list of min statusIds seen in each earlybird. - private final List minIds; - - private final EarlyTerminationInfo mergedEarlyTerminationInfo; - private final boolean isMergingAcrossTiers; - private final PartitionCounts partitionCounts; - private final int numSearchedSegments; - - public static final class PartitionCounts { - private final int numPartitions; - private final int numSuccessfulPartitions; - private final List perTierResponse; - - public PartitionCounts(int numPartitions, int numSuccessfulPartitions, List - perTierResponse) { - this.numPartitions = numPartitions; - this.numSuccessfulPartitions = numSuccessfulPartitions; - this.perTierResponse = perTierResponse; - } - - public int getNumPartitions() { - return numPartitions; - } - - public int getNumSuccessfulPartitions() { - return numSuccessfulPartitions; - } - - public List getPerTierResponse() { - return perTierResponse; - } - } - - /** - * Create AccumulatedResponses - */ - public AccumulatedResponses(List successResponses, - List errorResponses, - List maxIds, - List minIds, - EarlyTerminationInfo mergedEarlyTerminationInfo, - boolean isMergingAcrossTiers, - PartitionCounts partitionCounts, - int numSearchedSegments) { - this.successResponses = successResponses; - this.errorResponses = errorResponses; - this.maxIds = maxIds; - this.minIds = minIds; - this.mergedEarlyTerminationInfo = mergedEarlyTerminationInfo; - this.isMergingAcrossTiers = isMergingAcrossTiers; - this.partitionCounts = partitionCounts; - this.numSearchedSegments = numSearchedSegments; - } - - public List getSuccessResponses() { - return successResponses; - } - - public List getErrorResponses() { - return errorResponses; - } - - public List getMaxIds() { - return maxIds; - } - - public List getMinIds() { - return minIds; - } - - public EarlyTerminationInfo getMergedEarlyTerminationInfo() { - return mergedEarlyTerminationInfo; - } - - public boolean foundError() { - return !errorResponses.isEmpty(); - } - - /** - * Tries to return a merged EarlybirdResponse that propagates as much information from the error - * responses as possible. - * - * If all error responses have the same error response code, the merged response will have the - * same error response code, and the debugString/debugInfo on the merged response will be set to - * the debugString/debugInfo of one of the merged responses. - * - * If the error responses have at least 2 different response codes, TRANSIENT_ERROR will be set - * on the merged response. Also, we will look for the most common error response code, and will - * propagate the debugString/debugInfo from an error response with that response code. - */ - public EarlybirdResponse getMergedErrorResponse() { - Preconditions.checkState(!errorResponses.isEmpty()); - - // Find a response that has the most common error response code. - int maxCount = 0; - EarlybirdResponse errorResponseWithMostCommonErrorResponseCode = null; - Map responseCodeCounts = Maps.newHashMap(); - for (EarlybirdResponse errorResponse : errorResponses) { - EarlybirdResponseCode responseCode = errorResponse.getResponseCode(); - Integer responseCodeCount = responseCodeCounts.get(responseCode); - if (responseCodeCount == null) { - responseCodeCount = 0; - } - ++responseCodeCount; - responseCodeCounts.put(responseCode, responseCodeCount); - if (responseCodeCount > maxCount) { - errorResponseWithMostCommonErrorResponseCode = errorResponse; - } - } - - // If all error responses have the same response code, set it on the merged response. - // Otherwise, set TRANSIENT_ERROR on the merged response. - EarlybirdResponseCode mergedResponseCode = EarlybirdResponseCode.TRANSIENT_ERROR; - if (responseCodeCounts.size() == 1) { - mergedResponseCode = responseCodeCounts.keySet().iterator().next(); - } - - EarlybirdResponse mergedResponse = new EarlybirdResponse() - .setResponseCode(mergedResponseCode); - - // Propagate the debugString/debugInfo of the selected error response to the merged response. - Preconditions.checkNotNull(errorResponseWithMostCommonErrorResponseCode); - if (errorResponseWithMostCommonErrorResponseCode.isSetDebugString()) { - mergedResponse.setDebugString(errorResponseWithMostCommonErrorResponseCode.getDebugString()); - } - if (errorResponseWithMostCommonErrorResponseCode.isSetDebugInfo()) { - mergedResponse.setDebugInfo(errorResponseWithMostCommonErrorResponseCode.getDebugInfo()); - } - - // Set the numPartitions and numPartitionsSucceeded on the mergedResponse - mergedResponse.setNumPartitions(partitionCounts.getNumPartitions()); - mergedResponse.setNumSuccessfulPartitions(partitionCounts.getNumSuccessfulPartitions()); - - return mergedResponse; - } - - public boolean isMergingAcrossTiers() { - return isMergingAcrossTiers; - } - - public boolean isMergingPartitionsWithinATier() { - return !isMergingAcrossTiers; - } - - public PartitionCounts getPartitionCounts() { - return partitionCounts; - } - - public int getNumSearchedSegments() { - return numSearchedSegments; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/mergers/BUILD b/src/java/com/twitter/search/earlybird_root/mergers/BUILD deleted file mode 100644 index cd818b753..000000000 --- a/src/java/com/twitter/search/earlybird_root/mergers/BUILD +++ /dev/null @@ -1,26 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/log4j", - "3rdparty/jvm/org/slf4j:slf4j-api", - "src/java/com/twitter/common/collections", - "src/java/com/twitter/common/quantity", - "src/java/com/twitter/search/common/futures", - "src/java/com/twitter/search/common/logging", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/partitioning/snowflakeparser", - "src/java/com/twitter/search/common/relevance:utils", - "src/java/com/twitter/search/common/schema/earlybird", - "src/java/com/twitter/search/common/search", - "src/java/com/twitter/search/common/util:finagleutil", - "src/java/com/twitter/search/common/util/earlybird", - "src/java/com/twitter/search/earlybird_root/collectors", - "src/java/com/twitter/search/earlybird_root/common", - "src/java/com/twitter/search/queryparser/query:core-query-nodes", - "src/thrift/com/twitter/search:earlybird-java", - "src/thrift/com/twitter/search/common:query-java", - ], -) diff --git a/src/java/com/twitter/search/earlybird_root/mergers/EarlyTerminateTierMergePredicate.java b/src/java/com/twitter/search/earlybird_root/mergers/EarlyTerminateTierMergePredicate.java deleted file mode 100644 index 9bde1eb03..000000000 --- a/src/java/com/twitter/search/earlybird_root/mergers/EarlyTerminateTierMergePredicate.java +++ /dev/null @@ -1,9 +0,0 @@ -package com.twitter.search.earlybird_root.mergers; - -public interface EarlyTerminateTierMergePredicate { - /** - * Do we have enough results so far that we can early terminate and not continue onto next tier? - */ - boolean shouldEarlyTerminateTierMerge(int totalResultsFromSuccessfulShards, - boolean foundEarlyTermination); -} diff --git a/src/java/com/twitter/search/earlybird_root/mergers/EarlybirdResponseDebugMessageBuilder.java b/src/java/com/twitter/search/earlybird_root/mergers/EarlybirdResponseDebugMessageBuilder.java deleted file mode 100644 index f27f7214b..000000000 --- a/src/java/com/twitter/search/earlybird_root/mergers/EarlybirdResponseDebugMessageBuilder.java +++ /dev/null @@ -1,176 +0,0 @@ -package com.twitter.search.earlybird_root.mergers; - - -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Function; -import com.google.common.base.Joiner; -import com.google.common.collect.Iterables; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.logging.DebugMessageBuilder; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; - -/** - * Collects debug messages to attach to EarlybirdResponse - */ -class EarlybirdResponseDebugMessageBuilder { - private static final Logger LOG = - LoggerFactory.getLogger(EarlybirdResponseDebugMessageBuilder.class); - - private static final Logger TOO_MANY_FAILED_PARTITIONS_LOG = - LoggerFactory.getLogger(String.format("%s_too_many_failed_partitions", - EarlybirdResponseDebugMessageBuilder.class.getName())); - - @VisibleForTesting - protected final SearchCounter insufficientValidResponseCounter = - SearchCounter.export("insufficient_valid_partition_responses_count"); - @VisibleForTesting - protected final SearchCounter validPartitionResponseCounter = - SearchCounter.export("valid_partition_response_count"); - - // the combined debug string for all earlybird responses - private final StringBuilder debugString; - /** - * A message builder backed by the same {@link #debugString} above. - */ - private final DebugMessageBuilder debugMessageBuilder; - - private static final Joiner JOINER = Joiner.on(", "); - - EarlybirdResponseDebugMessageBuilder(EarlybirdRequest request) { - this(getDebugLevel(request)); - } - - EarlybirdResponseDebugMessageBuilder(DebugMessageBuilder.Level level) { - this.debugString = new StringBuilder(); - this.debugMessageBuilder = new DebugMessageBuilder(debugString, level); - } - - private static DebugMessageBuilder.Level getDebugLevel(EarlybirdRequest request) { - if (request.isSetDebugMode() && request.getDebugMode() > 0) { - return DebugMessageBuilder.getDebugLevel(request.getDebugMode()); - } else if (request.isSetDebugOptions()) { - return DebugMessageBuilder.Level.DEBUG_BASIC; - } else { - return DebugMessageBuilder.Level.DEBUG_NONE; - } - } - - protected boolean isDebugMode() { - return debugMessageBuilder.getDebugLevel() > 0; - } - - void append(String msg) { - debugString.append(msg); - } - - void debugAndLogWarning(String msg) { - if (isDebugMode()) { - debugString.append(msg).append('\n'); - } - LOG.warn(msg); - } - - void debugDetailed(String format, Object... args) { - debugAtLevel(DebugMessageBuilder.Level.DEBUG_DETAILED, format, args); - } - - void debugVerbose(String format, Object... args) { - debugAtLevel(DebugMessageBuilder.Level.DEBUG_VERBOSE, format, args); - } - - void debugVerbose2(String format, Object... args) { - debugAtLevel(DebugMessageBuilder.Level.DEBUG_VERBOSE_2, format, args); - } - - void debugAtLevel(DebugMessageBuilder.Level level, String format, Object... args) { - boolean levelOK = debugMessageBuilder.isAtLeastLevel(level); - if (levelOK || LOG.isDebugEnabled()) { - // We check both modes here in order to build the formatted message only once. - String message = String.format(format, args); - - LOG.debug(message); - - if (levelOK) { - debugString.append(message).append('\n'); - } - } - } - - String debugString() { - return debugString.toString(); - } - - DebugMessageBuilder getDebugMessageBuilder() { - return debugMessageBuilder; - } - - void logBelowSuccessThreshold(ThriftSearchQuery searchQuery, int numSuccessResponses, - int numPartitions, double successThreshold) { - String rawQuery = (searchQuery != null && searchQuery.isSetRawQuery()) - ? "[" + searchQuery.getRawQuery() + "]" : "null"; - String serializedQuery = (searchQuery != null && searchQuery.isSetSerializedQuery()) - ? "[" + searchQuery.getSerializedQuery() + "]" : "null"; - // Not enough successful responses from partitions. - String errorMessage = String.format( - "Only %d valid responses returned out of %d partitions for raw query: %s" - + " serialized query: %s. Lower than threshold of %s", - numSuccessResponses, numPartitions, rawQuery, serializedQuery, successThreshold); - - TOO_MANY_FAILED_PARTITIONS_LOG.warn(errorMessage); - - insufficientValidResponseCounter.increment(); - validPartitionResponseCounter.add(numSuccessResponses); - debugString.append(errorMessage); - } - - - @VisibleForTesting - void logResponseDebugInfo(EarlybirdRequest earlybirdRequest, - String partitionTierName, - EarlybirdResponse response) { - if (response.isSetDebugString() && !response.getDebugString().isEmpty()) { - debugString.append(String.format("Received response from [%s] with debug string [%s]", - partitionTierName, response.getDebugString())).append("\n"); - } - - if (!response.isSetResponseCode()) { - debugAndLogWarning(String.format( - "Received Earlybird null response code for query [%s] from [%s]", - earlybirdRequest, partitionTierName)); - } else if (response.getResponseCode() != EarlybirdResponseCode.SUCCESS - && response.getResponseCode() != EarlybirdResponseCode.PARTITION_SKIPPED - && response.getResponseCode() != EarlybirdResponseCode.PARTITION_DISABLED - && response.getResponseCode() != EarlybirdResponseCode.TIER_SKIPPED) { - debugAndLogWarning(String.format( - "Received Earlybird response error [%s] for query [%s] from [%s]", - response.getResponseCode(), earlybirdRequest, partitionTierName)); - } - - if (debugMessageBuilder.isVerbose2()) { - debugVerbose2("Earlybird [%s] returned response: %s", partitionTierName, response); - } else if (debugMessageBuilder.isVerbose()) { - if (response.isSetSearchResults() && response.getSearchResults().getResultsSize() > 0) { - String ids = JOINER.join(Iterables.transform( - response.getSearchResults().getResults(), - new Function() { - @Nullable - @Override - public Long apply(ThriftSearchResult result) { - return result.getId(); - } - })); - debugVerbose("Earlybird [%s] returned TweetIDs: %s", partitionTierName, ids); - } - } - } -} diff --git a/src/java/com/twitter/search/earlybird_root/mergers/EarlybirdResponseMerger.java b/src/java/com/twitter/search/earlybird_root/mergers/EarlybirdResponseMerger.java deleted file mode 100644 index e52e70b29..000000000 --- a/src/java/com/twitter/search/earlybird_root/mergers/EarlybirdResponseMerger.java +++ /dev/null @@ -1,604 +0,0 @@ -package com.twitter.search.earlybird_root.mergers; - -import java.util.Collections; -import java.util.HashSet; -import java.util.List; -import java.util.Map; - -import scala.runtime.BoxedUnit; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Optional; -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; -import com.google.common.collect.Sets; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.util.FinagleUtil; -import com.twitter.search.common.util.earlybird.EarlybirdResponseMergeUtil; -import com.twitter.search.common.util.earlybird.ResultsUtil; -import com.twitter.search.earlybird.thrift.EarlybirdDebugInfo; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird_root.collectors.MultiwayMergeCollector; -import com.twitter.search.earlybird_root.common.EarlybirdFeatureSchemaMerger; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; -import com.twitter.search.earlybird_root.common.EarlybirdRequestUtil; -import com.twitter.util.Function; -import com.twitter.util.Future; - -/** - * Base EarlybirdResponseMerger containing basic logic to merge EarlybirdResponse objects - */ -public abstract class EarlybirdResponseMerger implements EarlyTerminateTierMergePredicate { - private static final Logger LOG = LoggerFactory.getLogger(EarlybirdResponseMerger.class); - private static final Logger MIN_SEARCHED_STATUS_ID_LOGGER = - LoggerFactory.getLogger("MinSearchedStatusIdLogger"); - - private static final SearchCounter NO_SEARCH_RESULT_COUNTER = - SearchCounter.export("no_search_result_count"); - private static final SearchCounter NO_RESPONSES_TO_MERGE = - SearchCounter.export("no_responses_to_merge"); - private static final SearchCounter EARLYBIRD_RESPONSE_NO_MORE_RESULTS = - SearchCounter.export("merger_earlybird_response_no_more_results"); - private static final String PARTITION_OR_TIER_COUNTER_NAME_FORMAT = - "merger_waited_for_response_from_%s_counter"; - private static final String PARTITION_OR_TIER_ERROR_COUNTER_NAME_FORMAT = - "merger_num_error_responses_from_%s"; - private static final String PARTITION_OR_TIER_RESPONSE_CODE_COUNTER_NAME_FORMAT = - "merger_earlybird_response_code_from_%s_%s"; - - protected final EarlybirdResponseDebugMessageBuilder responseMessageBuilder; - protected final EarlybirdRequestContext requestContext; - protected final ImmutableList> responses; - protected AccumulatedResponses accumulatedResponses; - - - @VisibleForTesting - static final Map MERGER_CREATED_STATS = - perRequestTypeCounterImmutableMap("earlybird_response_merger_%s_created_count"); - - @VisibleForTesting - static final Map - MIN_SEARCHED_STATUS_ID_LARGER_THAN_REQUEST_MAX_ID = perRequestTypeCounterImmutableMap( - "merger_%s_min_searched_status_id_larger_than_request_max_id"); - - @VisibleForTesting - static final Map - MIN_SEARCHED_STATUS_ID_LARGER_THAN_REQUEST_UNTIL_TIME = perRequestTypeCounterImmutableMap( - "merger_%s_min_searched_status_id_larger_than_request_until_time"); - - private static Map perRequestTypeCounterImmutableMap( - String statPattern) { - Map statsMap = Maps.newEnumMap(EarlybirdRequestType.class); - for (EarlybirdRequestType earlybirdRequestType : EarlybirdRequestType.values()) { - String statName = String.format(statPattern, earlybirdRequestType.getNormalizedName()); - statsMap.put(earlybirdRequestType, SearchCounter.export(statName)); - } - - return Maps.immutableEnumMap(statsMap); - } - - public static final com.google.common.base.Function> - HIT_COUNT_GETTER = - response -> response.getSearchResults() == null - ? null - : response.getSearchResults().getHitCounts(); - - private final ChainMerger chainMerger; - - private class ChainMerger { - private final EarlybirdRequestContext requestContext; - private final ResponseAccumulator responseAccumulator; - private final List> responses; - private final EarlybirdResponseDebugMessageBuilder responseMessageBuilder; - private int currentFutureIndex = -1; - - public ChainMerger(EarlybirdRequestContext requestContext, - ResponseAccumulator responseAccumulator, - List> responses, - EarlybirdResponseDebugMessageBuilder responseMessageBuilder) { - this.requestContext = requestContext; - this.responseAccumulator = responseAccumulator; - this.responses = responses; - this.responseMessageBuilder = responseMessageBuilder; - } - - public Future merge() { - // 'responseFutures' should always be sorted. - // When returned by EarlybirdScatterGather service, the responses are sorted by partition ID. - // When returned by EarlybirdChainedScatterGatherService, - // responses are sorted descending by tier start date. See: - // com.twitter.search.earlybird_root.EarlybirdChainedScatterGatherService.TIER_COMPARATOR. - // - // When merging responses from partitions, we want to wait for responses from all partitions, - // so the order in which we wait for those results does not matter. When merging responses - // from tiers, we want to wait for the response from the latest. If we don't need any more - // responses to compute the final response, then we don't need to wait for the responses from - // other tiers. If we cannot terminate early, then we want to wait for the responses from the - // second tier, and so on. - // - // We do not need to have any explicit synchronization, because: - // 1. The callbacks for future_i are set by the flatMap() callback on future_{i-1} (when - // recursively calling merge() inside the flatMap()). - // 2. Before setting the callbacks on future_i, future_{i-1}.flatMap() adds the response - // results to mergeHelper. - // 3. When the callbacks on future_i are set, the memory barrier between - // thread_running_future_{i-1} and thread_running_future_i is crossed. This guarantees - // that thread_running_future_i will see the updates to mergeHelper before it sees the - // callbacks. (Or thread_running_future_{i-1} == thread_running_future_i, in which case - // synchronization is not an issue, and correctness is guarateed by the order in which - // things will run.) - // 4. The same reasoning applies to currentFutureIndex. - - ++currentFutureIndex; - if (currentFutureIndex >= responses.size()) { - return Future.value(getTimedMergedResponse(responseAccumulator.getAccumulatedResults())); - } - - final String partitionTierName = - responseAccumulator.getNameForLogging(currentFutureIndex, responses.size()); - final String nameForEarlybirdResponseCodeStats = - responseAccumulator.getNameForEarlybirdResponseCodeStats( - currentFutureIndex, responses.size()); - - // If a tier in the chain throws an exception, convert it to a null response, and let the - // mergeHelper handle it appropriately. - return responses.get(currentFutureIndex) - .handle(Function.func(t -> { - if (FinagleUtil.isCancelException(t)) { - return new EarlybirdResponse() - .setResponseCode(EarlybirdResponseCode.CLIENT_CANCEL_ERROR); - } else if (FinagleUtil.isTimeoutException(t)) { - return new EarlybirdResponse() - .setResponseCode(EarlybirdResponseCode.SERVER_TIMEOUT_ERROR); - } else { - SearchCounter.export( - String.format(PARTITION_OR_TIER_ERROR_COUNTER_NAME_FORMAT, partitionTierName)) - .increment(); - if (responseMessageBuilder.isDebugMode()) { - responseMessageBuilder.debugAndLogWarning( - String.format("[%s] failed, exception [%s]", - partitionTierName, t.toString())); - } - LOG.warn("exception response from: " + partitionTierName, t); - return new EarlybirdResponse() - .setResponseCode(EarlybirdResponseCode.TRANSIENT_ERROR); - } - })) - .flatMap(Function.func(response -> { - Preconditions.checkNotNull(response); - - SearchCounter.export( - String.format(PARTITION_OR_TIER_RESPONSE_CODE_COUNTER_NAME_FORMAT, - nameForEarlybirdResponseCodeStats, - response.getResponseCode().name().toLowerCase())) - .increment(); - - if ((response.getResponseCode() != EarlybirdResponseCode.PARTITION_SKIPPED) - && (response.getResponseCode() != EarlybirdResponseCode.TIER_SKIPPED)) { - SearchCounter.export( - String.format(PARTITION_OR_TIER_COUNTER_NAME_FORMAT, partitionTierName)) - .increment(); - } - - if (response.getResponseCode() == EarlybirdResponseCode.CLIENT_CANCEL_ERROR) { - // the request has been cancelled, no need to proceed - return Future.value(response); - } - - rewriteResponseCodeIfSearchResultsMissing(requestContext, partitionTierName, response); - responseMessageBuilder.logResponseDebugInfo( - requestContext.getRequest(), - partitionTierName, - response); - responseAccumulator.addResponse( - responseMessageBuilder, - requestContext.getRequest(), - response); - - if (responseAccumulator.shouldEarlyTerminateMerge(EarlybirdResponseMerger.this)) { - return Future.value(getTimedMergedResponse( - responseAccumulator.getAccumulatedResults())); - } - return merge(); - })); - } - } - - private void rewriteResponseCodeIfSearchResultsMissing( - EarlybirdRequestContext earlybirdRequestContext, - String partitionTierName, - EarlybirdResponse response) { - // We always require searchResults to be set, even for term stats and facet requests. - // This is because searchResults contains important info such as pagination cursors - // like minSearchStatusId and minSearchedTimeSinceEpoch. - // We expect all successful responses to have searchResults set. - if (response.isSetResponseCode() - && response.getResponseCode() == EarlybirdResponseCode.SUCCESS - && response.getSearchResults() == null) { - NO_SEARCH_RESULT_COUNTER.increment(); - LOG.warn("Received Earlybird response with null searchResults from [{}]" - + " EarlybirdRequest [{}] EarlybirdResponse [{}] ", - partitionTierName, earlybirdRequestContext.getRequest(), response); - response.setResponseCode(EarlybirdResponseCode.TRANSIENT_ERROR); - } - } - - /** - * Construct a EarlybirdResponseMerger to merge responses from multiple partitions or tiers - * based on mode. - */ - EarlybirdResponseMerger(EarlybirdRequestContext requestContext, - List> responses, - ResponseAccumulator responseAccumulator) { - this.requestContext = requestContext; - this.responses = ImmutableList.copyOf(responses); - this.responseMessageBuilder = - new EarlybirdResponseDebugMessageBuilder(requestContext.getRequest()); - this.chainMerger = new ChainMerger(requestContext, responseAccumulator, responses, - responseMessageBuilder); - } - - /** - * Get a response merger to merge the given responses. - */ - public static EarlybirdResponseMerger getResponseMerger( - EarlybirdRequestContext requestContext, - List> responses, - ResponseAccumulator helper, - EarlybirdCluster cluster, - EarlybirdFeatureSchemaMerger featureSchemaMerger, - int numPartitions) { - EarlybirdRequestType type = requestContext.getEarlybirdRequestType(); - MERGER_CREATED_STATS.get(type).increment(); - switch (type) { - case FACETS: - return new FacetResponseMerger(requestContext, responses, helper); - case TERM_STATS: - return new TermStatisticsResponseMerger(requestContext, responses, helper); - case RECENCY: - return new RecencyResponseMerger(requestContext, responses, helper, featureSchemaMerger); - case STRICT_RECENCY: - return new StrictRecencyResponseMerger( - requestContext, responses, helper, featureSchemaMerger, cluster); - case RELEVANCE: - return new RelevanceResponseMerger( - requestContext, responses, helper, featureSchemaMerger, numPartitions); - case TOP_TWEETS: - return new TopTweetsResponseMerger(requestContext, responses, helper); - default: - throw new RuntimeException("EarlybirdRequestType " + type + "is not supported by merge"); - } - } - - /** - * This method can perform two types of merges: - * 1. merge responses within a tier from different partitions. - * 2. merge responses from multiple tiers. - */ - public final Future merge() { - return chainMerger.merge() - .onSuccess(checkMinSearchedStatusIdFunction( - "max_id", - EarlybirdRequestUtil.getRequestMaxId(requestContext.getParsedQuery()), - MIN_SEARCHED_STATUS_ID_LARGER_THAN_REQUEST_MAX_ID.get( - requestContext.getEarlybirdRequestType()))) - .onSuccess(checkMinSearchedStatusIdFunction( - "until_time", - EarlybirdRequestUtil.getRequestMaxIdFromUntilTime(requestContext.getParsedQuery()), - MIN_SEARCHED_STATUS_ID_LARGER_THAN_REQUEST_UNTIL_TIME.get( - requestContext.getEarlybirdRequestType()))); - } - - /** - * Returns the function that checks if the minSearchedStatusID on the merged response is higher - * than the max ID in the request. - */ - private Function checkMinSearchedStatusIdFunction( - final String operator, final Optional requestMaxId, final SearchCounter stat) { - return Function.cons(mergedResponse -> { - if (requestMaxId.isPresent() - && requestMaxId.get() != Long.MAX_VALUE - && (mergedResponse.getResponseCode() == EarlybirdResponseCode.SUCCESS) - && mergedResponse.isSetSearchResults() - && mergedResponse.getSearchResults().isSetMinSearchedStatusID()) { - long minSearchedStatusId = mergedResponse.getSearchResults().getMinSearchedStatusID(); - // We sometimes set minSearchedStatusId = max_id + 1 when a request times out even - // before any search happens. - // Check SEARCH-10134 for more details. - if (minSearchedStatusId > requestMaxId.get() + 1) { - stat.increment(); - String logMessage = "Response has a minSearchedStatusID ({}) larger than request " - + operator + " ({})." - + "\nrequest type: {}" - + "\nrequest: {}" - + "\nmerged response: {}" - + "\nSuccessful accumulated responses:"; - List logMessageParams = Lists.newArrayList(); - logMessageParams.add(minSearchedStatusId); - logMessageParams.add(requestMaxId.get()); - logMessageParams.add(requestContext.getEarlybirdRequestType()); - logMessageParams.add(requestContext.getRequest()); - logMessageParams.add(mergedResponse); - for (EarlybirdResponse response : accumulatedResponses.getSuccessResponses()) { - logMessage += "\naccumulated response: {}"; - logMessageParams.add(response); - } - MIN_SEARCHED_STATUS_ID_LOGGER.warn(logMessage, logMessageParams.toArray()); - } - } - }); - } - - private EarlybirdResponse getTimedMergedResponse(AccumulatedResponses accResponses) { - long start = System.nanoTime(); - try { - return getMergedResponse(accResponses); - } finally { - long totalTime = System.nanoTime() - start; - getMergedResponseTimer().timerIncrement(totalTime); - } - } - - private EarlybirdResponse initializeMergedSuccessResponseFromAccumulatedResponses() { - EarlybirdResponse mergedResponse = new EarlybirdResponse(); - - AccumulatedResponses.PartitionCounts partitionCounts = - accumulatedResponses.getPartitionCounts(); - - mergedResponse.setNumPartitions(partitionCounts.getNumPartitions()) - .setNumSuccessfulPartitions(partitionCounts.getNumSuccessfulPartitions()) - .setPerTierResponse(partitionCounts.getPerTierResponse()) - .setNumSearchedSegments(accumulatedResponses.getNumSearchedSegments()); - - mergedResponse.setEarlyTerminationInfo(accumulatedResponses.getMergedEarlyTerminationInfo()); - mergedResponse.setResponseCode(EarlybirdResponseCode.SUCCESS); - - return mergedResponse; - } - - private EarlybirdResponse getMergedResponse(AccumulatedResponses accResponses) { - accumulatedResponses = accResponses; - EarlybirdResponse mergedResponse; - - if (accumulatedResponses.getSuccessResponses().isEmpty() - && !accumulatedResponses.foundError()) { - // No successful or error responses. This means that all tiers / partitions are intentionally - // skipped. Return a blank successful response. - NO_RESPONSES_TO_MERGE.increment(); - mergedResponse = new EarlybirdResponse() - .setResponseCode(EarlybirdResponseCode.SUCCESS) - .setSearchResults(new ThriftSearchResults()) - .setDebugString("No responses to merge, probably because all tiers/partitions " - + "were skipped."); - } else if (accumulatedResponses.isMergingAcrossTiers()) { - mergedResponse = getMergedResponseAcrossTiers(); - } else { - mergedResponse = getMergedResponseAcrossPartitions(); - } - - saveMergedDebugString(mergedResponse); - return mergedResponse; - } - - private EarlybirdResponse getMergedResponseAcrossTiers() { - Preconditions.checkState( - !accumulatedResponses.getSuccessResponses().isEmpty() - || accumulatedResponses.foundError()); - - // When merging across tiers, if we have one failed tier, we should fail the whole - // response. Note that due to early termination, if a tier that is old fails - // but the newer tiers return enough results, the failed tier won't show up - // here in accumulatedResponses -- the only tiers that show up here - // will be successful. - if (accumulatedResponses.foundError()) { - // The TierResponseAccumulator early terminates on the first error, so we should - // never get more than one error. This means that the getMergedErrorResponse will - // return an error response with the error code of that one error, and will never - // have to decide which error response to return if the error responses are all - // different. - - // Perhaps we should just return accumulatedResponses.getErrorResponses().get(0); - Preconditions.checkState(accumulatedResponses.getErrorResponses().size() == 1); - return accumulatedResponses.getMergedErrorResponse(); - } else { - EarlybirdResponse mergedResponse = initializeMergedSuccessResponseFromAccumulatedResponses(); - return internalMerge(mergedResponse); - } - } - - private EarlybirdResponse getMergedResponseAcrossPartitions() { - Preconditions.checkState( - !accumulatedResponses.getSuccessResponses().isEmpty() - || accumulatedResponses.foundError()); - - EarlybirdResponse mergedResponse; - - // Unlike tier merging, one failed response doesn't mean the merged response should - // fail. If we have successful responses we can check the success ratio and if its - // good we can still return a successful merge. - if (!accumulatedResponses.getSuccessResponses().isEmpty()) { - // We have at least one successful response, but still need to check the success ratio. - // mergedResponse is a SUCCESS response after this call, but we will - // set it to failure below if necessary. - mergedResponse = initializeMergedSuccessResponseFromAccumulatedResponses(); - - int numSuccessResponses = mergedResponse.getNumSuccessfulPartitions(); - int numPartitions = mergedResponse.getNumPartitions(); - double successThreshold = getSuccessResponseThreshold(); - if (checkSuccessPartitionRatio(numSuccessResponses, numPartitions, successThreshold)) { - // Success! Proceed with merging. - mergedResponse.setResponseCode(EarlybirdResponseCode.SUCCESS); - mergedResponse = internalMerge(mergedResponse); - } else { - responseMessageBuilder.logBelowSuccessThreshold( - requestContext.getRequest().getSearchQuery(), numSuccessResponses, numPartitions, - successThreshold); - mergedResponse.setResponseCode(EarlybirdResponseCode.TOO_MANY_PARTITIONS_FAILED_ERROR); - } - } else { - mergedResponse = accumulatedResponses.getMergedErrorResponse(); - } - - return mergedResponse; - } - - /** - * Derive class should implement the logic to merge the specific type of results (recency, - * relevance, Top Tweets, etc..) - */ - protected abstract EarlybirdResponse internalMerge(EarlybirdResponse response); - - protected abstract SearchTimerStats getMergedResponseTimer(); - - /** - * Do we have enough results so far that we can early terminate and not continue onto next tier? - */ - public boolean shouldEarlyTerminateTierMerge(int totalResultsFromSuccessfulShards, - boolean foundEarlyTermination) { - // We are taking the most conservative tier response merging. - // This is the most conservative merge logic --- as long as we have some results, we should - // not return anything from the next tier. This may cause not ideal experience where a - // page is not full, but the use can still scroll further. - - return foundEarlyTermination || totalResultsFromSuccessfulShards >= 1; - } - - private void saveMergedDebugString(EarlybirdResponse mergedResponse) { - if (responseMessageBuilder.isDebugMode()) { - String message = responseMessageBuilder.debugString(); - mergedResponse.setDebugString(message); - if (!accumulatedResponses.getSuccessResponses().isEmpty() - && accumulatedResponses.getSuccessResponses().get(0).isSetDebugInfo()) { - - EarlybirdDebugInfo debugInfo = - accumulatedResponses.getSuccessResponses().get(0).getDebugInfo(); - mergedResponse.setDebugInfo(debugInfo); - } - } - } - - private double getSuccessResponseThreshold() { - EarlybirdRequest request = requestContext.getRequest(); - if (request.isSetSuccessfulResponseThreshold()) { - double successfulResponseThreshold = request.getSuccessfulResponseThreshold(); - Preconditions.checkArgument(successfulResponseThreshold > 0, - "Invalid successfulResponseThreshold %s", successfulResponseThreshold); - Preconditions.checkArgument(successfulResponseThreshold <= 1.0, - "Invalid successfulResponseThreshold %s", successfulResponseThreshold); - return successfulResponseThreshold; - } else { - return getDefaultSuccessResponseThreshold(); - } - } - - protected abstract double getDefaultSuccessResponseThreshold(); - - private static boolean checkSuccessPartitionRatio( - int numSuccessResponses, - int numPartitions, - double goodResponseThreshold) { - Preconditions.checkArgument(goodResponseThreshold > 0.0, - "Invalid goodResponseThreshold %s", goodResponseThreshold); - return numSuccessResponses >= (numPartitions * goodResponseThreshold); - } - - /** - * Merge hit counts from all results. - */ - protected Map aggregateHitCountMap() { - Map hitCounts = ResultsUtil - .aggregateCountMap(accumulatedResponses.getSuccessResponses(), HIT_COUNT_GETTER); - if (hitCounts.size() > 0) { - if (responseMessageBuilder.isDebugMode()) { - responseMessageBuilder.append("Hit counts:\n"); - for (Map.Entry entry : hitCounts.entrySet()) { - responseMessageBuilder.append(String.format(" %10s seconds: %d hits\n", - entry.getKey() / 1000, entry.getValue())); - } - } - return hitCounts; - } - return null; - } - - /** - * Returns the number of results to keep as part of merge-collection. - */ - protected final int computeNumResultsToKeep() { - return EarlybirdResponseMergeUtil.computeNumResultsToKeep(requestContext.getRequest()); - } - - /** - * Remove exact duplicates (same id) from the result set. - */ - protected static void trimExactDups(ThriftSearchResults searchResults, TrimStats trimStats) { - int numResults = searchResults.getResultsSize(); - List oldResults = searchResults.getResults(); - List newResults = Lists.newArrayListWithCapacity(numResults); - HashSet resultSet = Sets.newHashSetWithExpectedSize(numResults); - - for (ThriftSearchResult result : oldResults) { - if (resultSet.contains(result.getId())) { - trimStats.increaseRemovedDupsCount(); - continue; - } - - newResults.add(result); - resultSet.add(result.getId()); - } - - searchResults.setResults(newResults); - } - - protected final int addResponsesToCollector(MultiwayMergeCollector collector) { - int totalResultSize = 0; - for (EarlybirdResponse response : accumulatedResponses.getSuccessResponses()) { - if (response.isSetSearchResults()) { - totalResultSize += response.getSearchResults().getResultsSize(); - } - collector.addResponse(response); - } - return totalResultSize; - } - - /** - * Given a sorted searchResults (for recency, sorted by ID; for relevance, sorted by score), - * returns the first 'computeNumResultsToKeep()' number of results. - * - * @param searchResults the searchResults to be truncated. - */ - protected final void truncateResults(ThriftSearchResults searchResults, TrimStats trimStats) { - int numResultsRequested = computeNumResultsToKeep(); - - int to = numResultsRequested == Integer.MAX_VALUE ? searchResults.getResultsSize() - : Math.min(numResultsRequested, searchResults.getResultsSize()); - if (searchResults.getResultsSize() > to) { - trimStats.setResultsTruncatedFromTailCount(searchResults.getResultsSize() - to); - - if (to > 0) { - searchResults.setResults(searchResults.getResults().subList(0, to)); - } else { - // No more results for the next page - EARLYBIRD_RESPONSE_NO_MORE_RESULTS.increment(); - searchResults.setResults(Collections.emptyList()); - } - } - } - - EarlybirdRequest getEarlybirdRequest() { - return requestContext.getRequest(); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/mergers/FacetResponseMerger.java b/src/java/com/twitter/search/earlybird_root/mergers/FacetResponseMerger.java deleted file mode 100644 index 06fc76d18..000000000 --- a/src/java/com/twitter/search/earlybird_root/mergers/FacetResponseMerger.java +++ /dev/null @@ -1,353 +0,0 @@ -package com.twitter.search.earlybird_root.mergers; - -import java.util.ArrayList; -import java.util.Arrays; -import java.util.Collections; -import java.util.HashMap; -import java.util.HashSet; -import java.util.List; -import java.util.Map; -import java.util.Set; -import java.util.concurrent.TimeUnit; - -import com.google.common.collect.Sets; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.logging.DebugMessageBuilder; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.common.ranking.thriftjava.ThriftFacetRankingOptions; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.util.earlybird.FacetsResultsUtils; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.ThriftFacetCount; -import com.twitter.search.earlybird.thrift.ThriftFacetCountMetadata; -import com.twitter.search.earlybird.thrift.ThriftFacetFieldResults; -import com.twitter.search.earlybird.thrift.ThriftFacetResults; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.util.Future; - -/** - * Merger class to merge facets EarlybirdResponse objects - */ -public class FacetResponseMerger extends EarlybirdResponseMerger { - private static final Logger LOG = LoggerFactory.getLogger(FacetResponseMerger.class); - - private static final SearchTimerStats TIMER = - SearchTimerStats.export("merge_facets", TimeUnit.NANOSECONDS, false, true); - - private static final double SUCCESSFUL_RESPONSE_THRESHOLD = 0.9; - private final DebugMessageBuilder debugMessageBuilder; - - - /** - * Constructor to create the merger - */ - public FacetResponseMerger(EarlybirdRequestContext requestContext, - List> responses, - ResponseAccumulator mode) { - super(requestContext, responses, mode); - debugMessageBuilder = responseMessageBuilder.getDebugMessageBuilder(); - debugMessageBuilder.verbose("--- Request Received: %s", requestContext.getRequest()); - } - - @Override - protected SearchTimerStats getMergedResponseTimer() { - return TIMER; - } - - @Override - protected double getDefaultSuccessResponseThreshold() { - return SUCCESSFUL_RESPONSE_THRESHOLD; - } - - @Override - protected EarlybirdResponse internalMerge(EarlybirdResponse facetsResponse) { - - final Map facetFieldInfoMap = - new HashMap<>(); - final Set userIDWhitelist = new HashSet<>(); - - // First, parse the responses and build up our facet info map. - boolean termStatsFilteringMode = FacetsResultsUtils.prepareFieldInfoMap( - requestContext.getRequest().getFacetRequest(), facetFieldInfoMap); - // Iterate through all futures and get results. - collectResponsesAndPopulateMap(facetFieldInfoMap, userIDWhitelist); - - // Next, aggregate the top facets and update the blender response. - facetsResponse - .setFacetResults(new ThriftFacetResults() - .setFacetFields(new HashMap<>()) - .setUserIDWhitelist(userIDWhitelist)); - - // keep track of how many facets a user contributed - this map gets reset for every field - Map perFieldAntiGamingMap = new HashMap<>(); - - // this one is used for images and twimges - Map imagesAntiGamingMap = new HashMap<>(); - - Set twimgDedupSet = null; - - for (final Map.Entry entry - : facetFieldInfoMap.entrySet()) { - // reset for each field - String field = entry.getKey(); - final Map antiGamingMap; - if (field.equals(EarlybirdFieldConstant.IMAGES_FACET) - || field.equals(EarlybirdFieldConstant.TWIMG_FACET)) { - antiGamingMap = imagesAntiGamingMap; - } else { - perFieldAntiGamingMap.clear(); - antiGamingMap = perFieldAntiGamingMap; - } - - ThriftFacetFieldResults results = new ThriftFacetFieldResults(); - FacetsResultsUtils.FacetFieldInfo info = entry.getValue(); - results.setTotalCount(info.totalCounts); - results.setTopFacets(new ArrayList<>()); - FacetsResultsUtils.fillTopLanguages(info, results); - if (info.topFacets != null && !info.topFacets.isEmpty()) { - fillFacetFieldResults(info, antiGamingMap, results); - } - - if (field.equals(EarlybirdFieldConstant.TWIMG_FACET)) { - if (twimgDedupSet == null) { - twimgDedupSet = Sets.newHashSet(); - } - FacetsResultsUtils.dedupTwimgFacet(twimgDedupSet, results, debugMessageBuilder); - } - - facetsResponse.getFacetResults().putToFacetFields(entry.getKey(), results); - } - - if (!termStatsFilteringMode) { - // in term stats filtering mode, if doing it here would break term stats filtering - FacetsResultsUtils.mergeTwimgResults( - facetsResponse.getFacetResults(), - Collections.reverseOrder( - FacetsResultsUtils.getFacetCountComparator( - requestContext.getRequest().getFacetRequest()))); - } - - // Update the numHitsProcessed on ThriftSearchResults. - int numHitsProcessed = 0; - int numPartitionsEarlyTerminated = 0; - for (EarlybirdResponse earlybirdResponse: accumulatedResponses.getSuccessResponses()) { - ThriftSearchResults searchResults = earlybirdResponse.getSearchResults(); - if (searchResults != null) { - numHitsProcessed += searchResults.getNumHitsProcessed(); - numPartitionsEarlyTerminated += searchResults.getNumPartitionsEarlyTerminated(); - } - } - ThriftSearchResults searchResults = new ThriftSearchResults(); - searchResults.setResults(new ArrayList<>()); // required field - searchResults.setNumHitsProcessed(numHitsProcessed); - searchResults.setNumPartitionsEarlyTerminated(numPartitionsEarlyTerminated); - facetsResponse.setSearchResults(searchResults); - - LOG.debug("Facets call completed successfully: {}", facetsResponse); - - FacetsResultsUtils.fixNativePhotoUrl(facetsResponse); - return facetsResponse; - } - - private void fillFacetFieldResults(FacetsResultsUtils.FacetFieldInfo facetFieldInfo, - Map antiGamingMap, - ThriftFacetFieldResults results) { - int minWeightedCount = 0; - int minSimpleCount = 0; - int maxPenaltyCount = Integer.MAX_VALUE; - double maxPenaltyCountRatio = 1; - boolean excludePossiblySensitiveFacets = false; - boolean onlyReturnFacetsWithDisplayTweet = false; - int maxHitsPerUser = -1; - - EarlybirdRequest request = requestContext.getRequest(); - if (request.getFacetRequest() != null) { - ThriftFacetRankingOptions rankingOptions = request.getFacetRequest().getFacetRankingOptions(); - - if (request.getSearchQuery() != null) { - maxHitsPerUser = request.getSearchQuery().getMaxHitsPerUser(); - } - - if (rankingOptions != null) { - LOG.debug("FacetsResponseMerger: Using rankingOptions={}", rankingOptions); - - if (rankingOptions.isSetMinCount()) { - minWeightedCount = rankingOptions.getMinCount(); - } - if (rankingOptions.isSetMinSimpleCount()) { - minSimpleCount = rankingOptions.getMinSimpleCount(); - } - if (rankingOptions.isSetMaxPenaltyCount()) { - maxPenaltyCount = rankingOptions.getMaxPenaltyCount(); - } - if (rankingOptions.isSetMaxPenaltyCountRatio()) { - maxPenaltyCountRatio = rankingOptions.getMaxPenaltyCountRatio(); - } - if (rankingOptions.isSetExcludePossiblySensitiveFacets()) { - excludePossiblySensitiveFacets = rankingOptions.isExcludePossiblySensitiveFacets(); - } - if (rankingOptions.isSetOnlyReturnFacetsWithDisplayTweet()) { - onlyReturnFacetsWithDisplayTweet = rankingOptions.isOnlyReturnFacetsWithDisplayTweet(); - } - } - } else { - LOG.warn("earlybirdRequest.getFacetRequest() is null"); - } - - ThriftFacetCount[] topFacetsArray = new ThriftFacetCount[facetFieldInfo.topFacets.size()]; - - facetFieldInfo.topFacets.values().toArray(topFacetsArray); - Arrays.sort(topFacetsArray, Collections.reverseOrder( - FacetsResultsUtils.getFacetCountComparator(request.getFacetRequest()))); - - int numResults = capFacetFieldWidth(facetFieldInfo.fieldRequest.numResults); - - if (topFacetsArray.length < numResults) { - numResults = topFacetsArray.length; - } - - int collected = 0; - for (int i = 0; i < topFacetsArray.length; ++i) { - ThriftFacetCount count = topFacetsArray[i]; - - if (onlyReturnFacetsWithDisplayTweet - && (!count.isSetMetadata() || !count.getMetadata().isSetStatusId() - || count.getMetadata().getStatusId() == -1)) { - // status id must be set - continue; - } - - if (excludePossiblySensitiveFacets && count.isSetMetadata() - && count.getMetadata().isStatusPossiblySensitive()) { - // the display tweet may be offensive or NSFW - if (DebugMessageBuilder.DEBUG_VERBOSE <= debugMessageBuilder.getDebugLevel()) { - debugMessageBuilder.verbose2("[%d] FacetsResponseMerger EXCLUDED: offensive or NSFW %s, " - + "explanation: %s", - i, facetCountSummary(count), - count.getMetadata().getExplanation()); - } - continue; - } - - boolean filterOutUser = false; - if (maxHitsPerUser != -1 && count.isSetMetadata()) { - ThriftFacetCountMetadata metadata = count.getMetadata(); - if (!metadata.dontFilterUser) { - long twitterUserId = metadata.getTwitterUserId(); - int numResultsFromUser = 1; - if (twitterUserId != -1) { - Integer perUser = antiGamingMap.get(twitterUserId); - if (perUser != null) { - numResultsFromUser = perUser + 1; - filterOutUser = numResultsFromUser > maxHitsPerUser; - } - antiGamingMap.put(twitterUserId, numResultsFromUser); - } - } - } - - // Filter facets those don't meet the basic criteria. - if (count.getSimpleCount() < minSimpleCount) { - if (DebugMessageBuilder.DEBUG_VERBOSE <= debugMessageBuilder.getDebugLevel()) { - debugMessageBuilder.verbose2( - "[%d] FacetsResponseMerger EXCLUDED: simpleCount:%d < minSimpleCount:%d, %s", - i, count.getSimpleCount(), minSimpleCount, facetCountSummary(count)); - } - continue; - } - if (count.getWeightedCount() < minWeightedCount) { - if (DebugMessageBuilder.DEBUG_VERBOSE <= debugMessageBuilder.getDebugLevel()) { - debugMessageBuilder.verbose2( - "[%d] FacetsResponseMerger EXCLUDED: weightedCount:%d < minWeightedCount:%d, %s", - i, count.getWeightedCount(), minWeightedCount, facetCountSummary(count)); - } - continue; - } - if (filterOutUser) { - if (DebugMessageBuilder.DEBUG_VERBOSE <= debugMessageBuilder.getDebugLevel()) { - debugMessageBuilder.verbose2( - "[%d] FacetsResponseMerger EXCLUDED: antiGaming filterd user: %d: %s", - i, count.getMetadata().getTwitterUserId(), facetCountSummary(count)); - } - continue; - } - if (count.getPenaltyCount() > maxPenaltyCount) { - if (DebugMessageBuilder.DEBUG_VERBOSE <= debugMessageBuilder.getDebugLevel()) { - debugMessageBuilder.verbose2( - "[%d] FacetsResponseMerger EXCLUCED: penaltyCount:%.3f > maxPenaltyCount:%.3f, %s", - i, count.getPenaltyCount(), maxPenaltyCount, facetCountSummary(count)); - } - continue; - } - if (((double) count.getPenaltyCount() / count.getSimpleCount()) > maxPenaltyCountRatio) { - if (DebugMessageBuilder.DEBUG_VERBOSE <= debugMessageBuilder.getDebugLevel()) { - debugMessageBuilder.verbose2( - "[%d] FacetsResponseMerger EXCLUDED: penaltyCountRatio: %.3f > " - + "maxPenaltyCountRatio:%.3f, %s", - i, (double) count.getPenaltyCount() / count.getSimpleCount(), maxPenaltyCountRatio, - facetCountSummary(count)); - } - continue; - } - results.addToTopFacets(count); - - collected++; - if (collected >= numResults) { - break; - } - } - } - - private static int capFacetFieldWidth(int numResults) { - int ret = numResults; - if (numResults <= 0) { - // this in theory should not be allowed, but for now we issue the request with goodwill length - ret = 10; // default to 10 for future merge code to terminate correctly - } - if (numResults >= 100) { - ret = 100; - } - return ret; - } - - private static String facetCountSummary(final ThriftFacetCount count) { - if (count.isSetMetadata()) { - return String.format("Label: %s (s:%d, w:%d, p:%d, score:%.2f, sid:%d (%s))", - count.getFacetLabel(), count.getSimpleCount(), count.getWeightedCount(), - count.getPenaltyCount(), count.getScore(), count.getMetadata().getStatusId(), - count.getMetadata().getStatusLanguage()); - } else { - return String.format("Label: %s (s:%d, w:%d, p:%d, score:%.2f)", count.getFacetLabel(), - count.getSimpleCount(), count.getWeightedCount(), count.getPenaltyCount(), - count.getScore()); - } - } - - // Iterate through the backend responses and fill up the FacetFieldInfo map. - private void collectResponsesAndPopulateMap( - final Map facetFieldInfoMap, - final Set userIDWhitelist) { - // Next, iterate through the backend responses. - int i = 0; - for (EarlybirdResponse facetsResponse : accumulatedResponses.getSuccessResponses()) { - if (facetsResponse.isSetFacetResults()) { - LOG.debug("Facet response from earlybird {} is {} ", i, facetsResponse.getFacetResults()); - i++; - ThriftFacetResults facetResults = facetsResponse.getFacetResults(); - if (facetResults.isSetUserIDWhitelist()) { - userIDWhitelist.addAll(facetResults.getUserIDWhitelist()); - } - FacetsResultsUtils.fillFacetFieldInfo( - facetResults, facetFieldInfoMap, - userIDWhitelist); - } - } - LOG.debug("Earlybird facet response total size {}", i); - } -} - diff --git a/src/java/com/twitter/search/earlybird_root/mergers/PartitionResponseAccumulator.java b/src/java/com/twitter/search/earlybird_root/mergers/PartitionResponseAccumulator.java deleted file mode 100644 index 22fcb101c..000000000 --- a/src/java/com/twitter/search/earlybird_root/mergers/PartitionResponseAccumulator.java +++ /dev/null @@ -1,44 +0,0 @@ -package com.twitter.search.earlybird_root.mergers; - -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; - - -public final class PartitionResponseAccumulator extends ResponseAccumulator { - private static final String TARGET_TYPE_PARTITION = "partition"; - - @Override - public String getNameForLogging(int responseIndex, int numTotalResponses) { - return TARGET_TYPE_PARTITION + responseIndex; - } - - @Override - public String getNameForEarlybirdResponseCodeStats(int responseIndex, int numTotalResponses) { - // We do not need to differentiate between partitions: we just want to get the number of - // responses returned by Earlybirds, for each EarlybirdResponseCode. - return TARGET_TYPE_PARTITION; - } - - @Override - boolean shouldEarlyTerminateMerge(EarlyTerminateTierMergePredicate merger) { - return false; - } - - @Override - public void handleSkippedResponse(EarlybirdResponseCode responseCode) { } - - @Override - public void handleErrorResponse(EarlybirdResponse response) { - } - - @Override - public AccumulatedResponses.PartitionCounts getPartitionCounts() { - return new AccumulatedResponses.PartitionCounts(getNumResponses(), - getSuccessResponses().size() + getSuccessfulEmptyResponseCount(), null); - } - - @Override - protected boolean isMergingAcrossTiers() { - return false; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/mergers/RecencyResponseMerger.java b/src/java/com/twitter/search/earlybird_root/mergers/RecencyResponseMerger.java deleted file mode 100644 index bc4742493..000000000 --- a/src/java/com/twitter/search/earlybird_root/mergers/RecencyResponseMerger.java +++ /dev/null @@ -1,638 +0,0 @@ -package com.twitter.search.earlybird_root.mergers; - -import java.util.Collections; -import java.util.List; -import java.util.Map; -import java.util.concurrent.TimeUnit; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; -import com.twitter.search.common.query.thriftjava.EarlyTerminationInfo; -import com.twitter.search.common.relevance.utils.ResultComparators; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird_root.collectors.RecencyMergeCollector; -import com.twitter.search.earlybird_root.common.EarlybirdFeatureSchemaMerger; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.util.Future; - -import static com.twitter.search.earlybird_root.mergers.RecencyResponseMerger - .EarlyTerminationTrimmingStats.Type.ALREADY_EARLY_TERMINATED; -import static com.twitter.search.earlybird_root.mergers.RecencyResponseMerger - .EarlyTerminationTrimmingStats.Type.FILTERED; -import static com.twitter.search.earlybird_root.mergers.RecencyResponseMerger - .EarlyTerminationTrimmingStats.Type.FILTERED_AND_TRUNCATED; -import static com.twitter.search.earlybird_root.mergers.RecencyResponseMerger - .EarlyTerminationTrimmingStats.Type.NOT_EARLY_TERMINATED; -import static com.twitter.search.earlybird_root.mergers.RecencyResponseMerger - .EarlyTerminationTrimmingStats.Type.TERMINATED_GOT_EXACT_NUM_RESULTS; -import static com.twitter.search.earlybird_root.mergers.RecencyResponseMerger - .EarlyTerminationTrimmingStats.Type.TRUNCATED; - -/** - * Merger class to merge recency search EarlybirdResponse objects. - */ -public class RecencyResponseMerger extends EarlybirdResponseMerger { - private static final Logger LOG = LoggerFactory.getLogger(RecencyResponseMerger.class); - - private static final SearchTimerStats RECENCY_TIMER = - SearchTimerStats.export("merge_recency", TimeUnit.NANOSECONDS, false, true); - - @VisibleForTesting - static final String TERMINATED_COLLECTED_ENOUGH_RESULTS = - "terminated_collected_enough_results"; - - // Allowed replication lag relative to all replicas. Replication lag exceeding - // this amount may result in some tweets from the replica not returned in search. - private static final long ALLOWED_REPLICATION_LAG_MS = 10000; - - private static final double SUCCESSFUL_RESPONSE_THRESHOLD = 0.9; - - @VisibleForTesting - static final SearchCounter RECENCY_ZERO_RESULT_COUNT_AFTER_FILTERING_MAX_MIN_IDS = - SearchCounter.export("merger_recency_zero_result_count_after_filtering_max_min_ids"); - - @VisibleForTesting - static final SearchCounter RECENCY_TRIMMED_TOO_MANY_RESULTS_COUNT = - SearchCounter.export("merger_recency_trimmed_too_many_results_count"); - - private static final SearchCounter RECENCY_TIER_MERGE_EARLY_TERMINATED_WITH_NOT_ENOUGH_RESULTS = - SearchCounter.export("merger_recency_tier_merge_early_terminated_with_not_enough_results"); - - private static final SearchCounter RECENCY_CLEARED_EARLY_TERMINATION_COUNT = - SearchCounter.export("merger_recency_cleared_early_termination_count"); - - /** - * Results were truncated because merged results exceeded the requested numResults. - */ - @VisibleForTesting - static final String MERGING_EARLY_TERMINATION_REASON_TRUNCATED = - "root_merging_truncated_results"; - - /** - * Results that were were filtered smaller than merged minSearchedStatusId were filtered out. - */ - @VisibleForTesting - static final String MERGING_EARLY_TERMINATION_REASON_FILTERED = - "root_merging_filtered_results"; - - @VisibleForTesting - static final EarlyTerminationTrimmingStats PARTITION_MERGING_EARLY_TERMINATION_TRIMMING_STATS = - new EarlyTerminationTrimmingStats("recency_partition_merging"); - - @VisibleForTesting - static final EarlyTerminationTrimmingStats TIER_MERGING_EARLY_TERMINATION_TRIMMING_STATS = - new EarlyTerminationTrimmingStats("recency_tier_merging"); - - @VisibleForTesting - static class EarlyTerminationTrimmingStats { - - enum Type { - /** - * The whole result was not terminated at all. - */ - NOT_EARLY_TERMINATED, - /** - * Was terminated before we did any trimming. - */ - ALREADY_EARLY_TERMINATED, - /** - * Was not terminated when merged, but results were filtered due to min/max ranges. - */ - FILTERED, - /** - * Was not terminated when merged, but results were truncated. - */ - TRUNCATED, - /** - * Was not terminated when merged, but results were filtered due to min/max ranges and - * truncated. - */ - FILTERED_AND_TRUNCATED, - /** - * When the search asks for X result, and we get exactly X results back, without trimming - * or truncating on the tail side (min_id side), we still mark the search as early terminated. - * This is because later tiers possibly has more results. - */ - TERMINATED_GOT_EXACT_NUM_RESULTS, - } - - /** - * A counter tracking merged responses for each {@link EarlyTerminationTrimmingStats.Type} - * define above. - */ - private final ImmutableMap searchCounterMap; - - EarlyTerminationTrimmingStats(String prefix) { - Map tempMap = Maps.newEnumMap(Type.class); - - tempMap.put(NOT_EARLY_TERMINATED, - SearchCounter.export(prefix + "_not_early_terminated_after_merging")); - tempMap.put(ALREADY_EARLY_TERMINATED, - SearchCounter.export(prefix + "_early_terminated_before_merge_trimming")); - tempMap.put(TRUNCATED, - SearchCounter.export(prefix + "_early_terminated_after_merging_truncated")); - tempMap.put(FILTERED, - SearchCounter.export(prefix + "_early_terminated_after_merging_filtered")); - tempMap.put(FILTERED_AND_TRUNCATED, - SearchCounter.export(prefix + "_early_terminated_after_merging_filtered_and_truncated")); - tempMap.put(TERMINATED_GOT_EXACT_NUM_RESULTS, - SearchCounter.export(prefix + "_early_terminated_after_merging_got_exact_num_results")); - - searchCounterMap = Maps.immutableEnumMap(tempMap); - } - - public SearchCounter getCounterFor(Type type) { - return searchCounterMap.get(type); - } - } - - private final EarlybirdFeatureSchemaMerger featureSchemaMerger; - - public RecencyResponseMerger(EarlybirdRequestContext requestContext, - List> responses, - ResponseAccumulator mode, - EarlybirdFeatureSchemaMerger featureSchemaMerger) { - super(requestContext, responses, mode); - this.featureSchemaMerger = featureSchemaMerger; - } - - @Override - protected double getDefaultSuccessResponseThreshold() { - return SUCCESSFUL_RESPONSE_THRESHOLD; - } - - @Override - protected SearchTimerStats getMergedResponseTimer() { - return RECENCY_TIMER; - } - - @Override - protected EarlybirdResponse internalMerge(EarlybirdResponse mergedResponse) { - // The merged maxSearchedStatusId and minSearchedStatusId - long maxId = findMaxFullySearchedStatusID(); - long minId = findMinFullySearchedStatusID(); - - RecencyMergeCollector collector = new RecencyMergeCollector(responses.size()); - int totalResultSize = addResponsesToCollector(collector); - ThriftSearchResults searchResults = collector.getAllSearchResults(); - - TrimStats trimStats = trimResults(searchResults, minId, maxId); - setMergedMaxSearchedStatusId(searchResults, maxId); - setMergedMinSearchedStatusId( - searchResults, minId, trimStats.getResultsTruncatedFromTailCount() > 0); - - mergedResponse.setSearchResults(searchResults); - - // Override some components of the response as appropriate to real-time. - searchResults.setHitCounts(aggregateHitCountMap()); - if (accumulatedResponses.isMergingPartitionsWithinATier() - && clearEarlyTerminationIfReachingTierBottom(mergedResponse)) { - RECENCY_CLEARED_EARLY_TERMINATION_COUNT.increment(); - } else { - setEarlyTerminationForTrimmedResults(mergedResponse, trimStats); - } - - responseMessageBuilder.debugVerbose("Hits: %s %s", totalResultSize, trimStats); - responseMessageBuilder.debugVerbose( - "Hash Partitioned Earlybird call completed successfully: %s", mergedResponse); - - featureSchemaMerger.collectAndSetFeatureSchemaInResponse( - searchResults, - requestContext, - "merger_recency_tier", - accumulatedResponses.getSuccessResponses()); - - return mergedResponse; - } - - /** - * When we reached tier bottom, pagination can stop working even though we haven't got - * all results. e.g. - * Results from partition 1: [101 91 81], minSearchedStatusId is 81 - * Results from Partition 2: [102 92], minSearchedStatusId is 92, not early terminated. - * - * After merge, we get [102, 101, 92], with minResultId == 92. Since results from - * partition 2 is not early terminated, 92 is the tier bottom here. Since results are - * filtered, early termination for merged result is set to true, so blender will call again, - * with maxDocId == 91. This time we get result: - * Results from partition 1: [91 81], minSearchedStatusId is 81 - * Results from partition 2: [], minSearchedStatusId is still 92 - * After merge we get [] and minSearchedStatusId is still 92. No progress can be made on - * pagination and clients get stuck. - * - * So in this case, we clear the early termination flag to tell blender there is no more - * result in this tier. Tweets below tier bottom will be missed, but that also happens - * without this step, as the next pagination call will return empty results anyway. - * So even if there is NOT overlap between tiers, this is still better. - * - * Return true if early termination is cleared due to this, otherwise return false. - * To be safe, we do nothing here to keep existing behavior and only override it in - * StrictRecencyResponseMerger. - */ - protected boolean clearEarlyTerminationIfReachingTierBottom(EarlybirdResponse mergedResponse) { - return false; - } - - /** - * Determines if the merged response should be early-terminated when it has exactly as many - * trimmed results as requested, as is not early-terminated because of other reasons. - */ - protected boolean shouldEarlyTerminateWhenEnoughTrimmedResults() { - return true; - } - - /** - * If the end results were trimmed in any way, reflect that in the response as a query that was - * early terminated. A response can be either (1) truncated because we merged more results than - * what was asked for with numResults, or (2) we filtered results that were smaller than the - * merged minSearchedStatusId. - * - * @param mergedResponse the merged response. - * @param trimStats trim stats for this merge. - */ - private void setEarlyTerminationForTrimmedResults( - EarlybirdResponse mergedResponse, - TrimStats trimStats) { - - responseMessageBuilder.debugVerbose("Checking for merge trimming, trimStats %s", trimStats); - - EarlyTerminationTrimmingStats stats = getEarlyTerminationTrimmingStats(); - - EarlyTerminationInfo earlyTerminationInfo = mergedResponse.getEarlyTerminationInfo(); - Preconditions.checkNotNull(earlyTerminationInfo); - - if (!earlyTerminationInfo.isEarlyTerminated()) { - if (trimStats.getMinIdFilterCount() > 0 || trimStats.getResultsTruncatedFromTailCount() > 0) { - responseMessageBuilder.debugVerbose("Setting early termination, trimStats: %s, results: %s", - trimStats, mergedResponse); - - earlyTerminationInfo.setEarlyTerminated(true); - addEarlyTerminationReasons(earlyTerminationInfo, trimStats); - - if (trimStats.getMinIdFilterCount() > 0 - && trimStats.getResultsTruncatedFromTailCount() > 0) { - stats.getCounterFor(FILTERED_AND_TRUNCATED).increment(); - } else if (trimStats.getMinIdFilterCount() > 0) { - stats.getCounterFor(FILTERED).increment(); - } else if (trimStats.getResultsTruncatedFromTailCount() > 0) { - stats.getCounterFor(TRUNCATED).increment(); - } else { - Preconditions.checkState(false, "Invalid TrimStats: %s", trimStats); - } - } else if ((computeNumResultsToKeep() == mergedResponse.getSearchResults().getResultsSize()) - && shouldEarlyTerminateWhenEnoughTrimmedResults()) { - earlyTerminationInfo.setEarlyTerminated(true); - earlyTerminationInfo.addToMergedEarlyTerminationReasons( - TERMINATED_COLLECTED_ENOUGH_RESULTS); - stats.getCounterFor(TERMINATED_GOT_EXACT_NUM_RESULTS).increment(); - } else { - stats.getCounterFor(NOT_EARLY_TERMINATED).increment(); - } - } else { - stats.getCounterFor(ALREADY_EARLY_TERMINATED).increment(); - // Even if the results were already marked as early terminated, we can add additional - // reasons for debugging (if the merged results were filtered or truncated). - addEarlyTerminationReasons(earlyTerminationInfo, trimStats); - } - } - - private void addEarlyTerminationReasons( - EarlyTerminationInfo earlyTerminationInfo, - TrimStats trimStats) { - - if (trimStats.getMinIdFilterCount() > 0) { - earlyTerminationInfo.addToMergedEarlyTerminationReasons( - MERGING_EARLY_TERMINATION_REASON_FILTERED); - } - - if (trimStats.getResultsTruncatedFromTailCount() > 0) { - earlyTerminationInfo.addToMergedEarlyTerminationReasons( - MERGING_EARLY_TERMINATION_REASON_TRUNCATED); - } - } - - private EarlyTerminationTrimmingStats getEarlyTerminationTrimmingStats() { - if (accumulatedResponses.isMergingPartitionsWithinATier()) { - return getEarlyTerminationTrimmingStatsForPartitions(); - } else { - return getEarlyTerminationTrimmingStatsForTiers(); - } - } - - protected EarlyTerminationTrimmingStats getEarlyTerminationTrimmingStatsForPartitions() { - return PARTITION_MERGING_EARLY_TERMINATION_TRIMMING_STATS; - } - - protected EarlyTerminationTrimmingStats getEarlyTerminationTrimmingStatsForTiers() { - return TIER_MERGING_EARLY_TERMINATION_TRIMMING_STATS; - } - - /** - * If we get enough results, no need to go on. - * If one of the partitions early terminated, we can't go on or else there could be a gap. - */ - @Override - public boolean shouldEarlyTerminateTierMerge(int totalResultsFromSuccessfulShards, - boolean foundEarlyTermination) { - - - int resultsRequested = computeNumResultsToKeep(); - - boolean shouldEarlyTerminate = foundEarlyTermination - || totalResultsFromSuccessfulShards >= resultsRequested; - - if (shouldEarlyTerminate && totalResultsFromSuccessfulShards < resultsRequested) { - RECENCY_TIER_MERGE_EARLY_TERMINATED_WITH_NOT_ENOUGH_RESULTS.increment(); - } - - return shouldEarlyTerminate; - } - - /** - * Find the min status id that has been _completely_ searched across all partitions. The - * largest min status id across all partitions. - * - * @return the min searched status id found - */ - protected long findMinFullySearchedStatusID() { - List minIds = accumulatedResponses.getMinIds(); - if (minIds.isEmpty()) { - return Long.MIN_VALUE; - } - - if (accumulatedResponses.isMergingPartitionsWithinATier()) { - // When merging partitions, the min ID should be the largest among the min IDs. - return Collections.max(accumulatedResponses.getMinIds()); - } else { - // When merging tiers, the min ID should be the smallest among the min IDs. - return Collections.min(accumulatedResponses.getMinIds()); - } - } - - /** - * Find the max status id that has been _completely_ searched across all partitions. The - * smallest max status id across all partitions. - * - * This is where we reconcile replication lag by selecting the oldest maxid from the - * partitions searched. - * - * @return the max searched status id found - */ - protected long findMaxFullySearchedStatusID() { - List maxIDs = accumulatedResponses.getMaxIds(); - if (maxIDs.isEmpty()) { - return Long.MAX_VALUE; - } - Collections.sort(maxIDs); - - final long newest = maxIDs.get(maxIDs.size() - 1); - final long newestTimestamp = SnowflakeIdParser.getTimestampFromTweetId(newest); - - for (int i = 0; i < maxIDs.size(); i++) { - long oldest = maxIDs.get(i); - long oldestTimestamp = SnowflakeIdParser.getTimestampFromTweetId(oldest); - long deltaMs = newestTimestamp - oldestTimestamp; - - if (i == 0) { - LOG.debug("Max delta is {}", deltaMs); - } - - if (deltaMs < ALLOWED_REPLICATION_LAG_MS) { - if (i != 0) { - LOG.debug("{} partition replicas lagging more than {} ms", i, ALLOWED_REPLICATION_LAG_MS); - } - return oldest; - } - } - - // Can't get here - by this point oldest == newest, and delta is 0. - return newest; - } - - /** - * Trim the ThriftSearchResults if we have enough results, to return the first - * 'computeNumResultsToKeep()' number of results. - * - * If we don't have enough results after trimming, this function will first try to back fill - * older results, then newer results - * - * @param searchResults ThriftSearchResults that hold the to be trimmed List - * @return TrimStats containing statistics about how many results being removed - */ - protected TrimStats trimResults( - ThriftSearchResults searchResults, - long mergedMin, - long mergedMax) { - if (!searchResults.isSetResults() || searchResults.getResultsSize() == 0) { - // no results, no trimming needed - return TrimStats.EMPTY_STATS; - } - - if (requestContext.getRequest().getSearchQuery().isSetSearchStatusIds()) { - // Not a normal search, no trimming needed - return TrimStats.EMPTY_STATS; - } - - TrimStats trimStats = new TrimStats(); - trimExactDups(searchResults, trimStats); - - int numResultsRequested = computeNumResultsToKeep(); - if (shouldSkipTrimmingWhenNotEnoughResults(searchResults, numResultsRequested)) { - ////////////////////////////////////////////////////////// - // We don't have enough results, let's not do trimming - ////////////////////////////////////////////////////////// - return trimStats; - } - - if (accumulatedResponses.isMergingPartitionsWithinATier()) { - trimResultsBasedSearchedRange( - searchResults, trimStats, numResultsRequested, mergedMin, mergedMax); - } - - // Respect "computeNumResultsToKeep()" here, only keep "computeNumResultsToKeep()" results. - truncateResults(searchResults, trimStats); - - return trimStats; - } - - /** - * When there's not enough results, we don't remove results based on the searched range. - * This has a tradeoff: with this, we don't reduce our recall when we already don't have enough - * results. However, with this, we can lose results while paginating because we return results - * outside of the valid searched range. - */ - protected boolean shouldSkipTrimmingWhenNotEnoughResults( - ThriftSearchResults searchResults, int numResultsRequested) { - return searchResults.getResultsSize() <= numResultsRequested; - } - - - /** - * Trim results based on search range. The search range [x, y] is determined by: - * x is the maximun of the minimun search IDs; - * y is the minimun of the maximum search IDs. - * - * Ids out side of this range are removed. - * If we do not get enough results after the removal, we add IDs back until we get enough results. - * We first add IDs back from the older side back. If there's still not enough results, - * we start adding IDs from the newer side back. - */ - private void trimResultsBasedSearchedRange(ThriftSearchResults searchResults, - TrimStats trimStats, - int numResultsRequested, - long mergedMin, - long mergedMax) { - /////////////////////////////////////////////////////////////////// - // we have more results than requested, let's do some trimming - /////////////////////////////////////////////////////////////////// - - // Save the original results before trimming - List originalResults = searchResults.getResults(); - - filterResultsByMergedMinMaxIds(searchResults, mergedMax, mergedMin, trimStats); - - // This does happen. It is hard to say what we should do here so we just return the original - // result here. - if (searchResults.getResultsSize() == 0) { - RECENCY_ZERO_RESULT_COUNT_AFTER_FILTERING_MAX_MIN_IDS.increment(); - searchResults.setResults(originalResults); - - // Clean up min/mix filtered count, since we're bringing back whatever we just filtered. - trimStats.clearMaxIdFilterCount(); - trimStats.clearMinIdFilterCount(); - - if (LOG.isDebugEnabled() || responseMessageBuilder.isDebugMode()) { - String errMsg = "No trimming is done as filtered results is empty. " - + "maxId=" + mergedMax + ",minId=" + mergedMin; - LOG.debug(errMsg); - responseMessageBuilder.append(errMsg + "\n"); - } - } else { - // oops! we're trimming too many results. Let's put some back - if (searchResults.getResultsSize() < numResultsRequested) { - RECENCY_TRIMMED_TOO_MANY_RESULTS_COUNT.increment(); - - List trimmedResults = searchResults.getResults(); - long firstTrimmedResultId = trimmedResults.get(0).getId(); - long lastTrimmedResultId = trimmedResults.get(trimmedResults.size() - 1).getId(); - - // First, try to back fill with older results - int i = 0; - for (; i < originalResults.size(); ++i) { - ThriftSearchResult result = originalResults.get(i); - if (result.getId() < lastTrimmedResultId) { - trimmedResults.add(result); - trimStats.decreaseMinIdFilterCount(); - if (trimmedResults.size() >= numResultsRequested) { - break; - } - } - } - - // still not enough results? back fill with newer results - // find the oldest of the newer results - if (trimmedResults.size() < numResultsRequested) { - // still not enough results? back fill with newer results - // find the oldest of the newer results - for (i = originalResults.size() - 1; i >= 0; --i) { - ThriftSearchResult result = originalResults.get(i); - if (result.getId() > firstTrimmedResultId) { - trimmedResults.add(result); - trimStats.decreaseMaxIdFilterCount(); - if (trimmedResults.size() >= numResultsRequested) { - break; - } - } - } - - // newer results were added to the back of the list, re-sort - Collections.sort(trimmedResults, ResultComparators.ID_COMPARATOR); - } - } - } - } - - protected void setMergedMinSearchedStatusId( - ThriftSearchResults searchResults, - long currentMergedMin, - boolean resultsWereTrimmed) { - if (accumulatedResponses.getMinIds().isEmpty()) { - return; - } - - long merged; - if (searchResults == null - || !searchResults.isSetResults() - || searchResults.getResultsSize() == 0) { - merged = currentMergedMin; - } else { - List results = searchResults.getResults(); - long firstResultId = results.get(0).getId(); - long lastResultId = results.get(results.size() - 1).getId(); - merged = Math.min(firstResultId, lastResultId); - if (!resultsWereTrimmed) { - // If the results were trimmed, we want to set minSearchedStatusID to the smallest - // tweet ID in the response. Otherwise, we want to take the min between that, and - // the current minSearchedStatusID. - merged = Math.min(merged, currentMergedMin); - } - } - - searchResults.setMinSearchedStatusID(merged); - } - - private void setMergedMaxSearchedStatusId( - ThriftSearchResults searchResults, - long currentMergedMax) { - if (accumulatedResponses.getMaxIds().isEmpty()) { - return; - } - - long merged; - if (searchResults == null - || !searchResults.isSetResults() - || searchResults.getResultsSize() == 0) { - merged = currentMergedMax; - } else { - List results = searchResults.getResults(); - long firstResultId = results.get(0).getId(); - long lastResultId = results.get(results.size() - 1).getId(); - long maxResultId = Math.max(firstResultId, lastResultId); - merged = Math.max(maxResultId, currentMergedMax); - } - - searchResults.setMaxSearchedStatusID(merged); - } - - protected static void filterResultsByMergedMinMaxIds( - ThriftSearchResults results, long maxStatusId, long minStatusId, TrimStats trimStats) { - List trimedResults = - Lists.newArrayListWithCapacity(results.getResultsSize()); - - for (ThriftSearchResult result : results.getResults()) { - long statusId = result.getId(); - - if (statusId > maxStatusId) { - trimStats.increaseMaxIdFilterCount(); - } else if (statusId < minStatusId) { - trimStats.increaseMinIdFilterCount(); - } else { - trimedResults.add(result); - } - } - - results.setResults(trimedResults); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/mergers/RelevanceResponseMerger.java b/src/java/com/twitter/search/earlybird_root/mergers/RelevanceResponseMerger.java deleted file mode 100644 index e58e79951..000000000 --- a/src/java/com/twitter/search/earlybird_root/mergers/RelevanceResponseMerger.java +++ /dev/null @@ -1,268 +0,0 @@ -package com.twitter.search.earlybird_root.mergers; - -import java.util.Collections; -import java.util.List; -import java.util.Map; -import java.util.Set; -import java.util.TreeMap; -import java.util.concurrent.TimeUnit; -import java.util.stream.Collectors; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Function; -import com.google.common.base.Preconditions; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.constants.thriftjava.ThriftLanguage; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.common.util.earlybird.EarlybirdResponseUtil; -import com.twitter.search.common.util.earlybird.ResultsUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.earlybird.thrift.ThriftSearchRankingMode; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird_root.collectors.RelevanceMergeCollector; -import com.twitter.search.earlybird_root.common.EarlybirdFeatureSchemaMerger; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.util.Future; - -/** - * Merger class to merge relevance search EarlybirdResponse objects - */ -public class RelevanceResponseMerger extends EarlybirdResponseMerger { - private static final Logger LOG = LoggerFactory.getLogger(RelevanceResponseMerger.class); - - private static final SearchTimerStats TIMER = - SearchTimerStats.export("merge_relevance", TimeUnit.NANOSECONDS, false, true); - - private static final SearchCounter RELVEANCE_TIER_MERGE_EARLY_TERMINATED_WITH_NOT_ENOUGH_RESULTS = - SearchCounter.export("merger_relevance_tier_merge_early_terminated_with_not_enough_results"); - - private static final String PARTITION_NUM_RESULTS_COUNTER_SKIP_STATS = - "merger_relevance_post_trimmed_results_skip_stat_tier_%s_partition_%d"; - - @VisibleForTesting - public static final String PARTITION_NUM_RESULTS_COUNTER_NAME_FORMAT = - "merger_relevance_post_trimmed_results_from_tier_%s_partition_%d"; - - protected static final Function> LANG_MAP_GETTER = - response -> response.getSearchResults() == null - ? null - : response.getSearchResults().getLanguageHistogram(); - - private static final double SUCCESSFUL_RESPONSE_THRESHOLD = 0.8; - - private final EarlybirdFeatureSchemaMerger featureSchemaMerger; - - // The number of partitions are not meaningful when it is invoked through multi-tier merging. - private final int numPartitions; - - public RelevanceResponseMerger(EarlybirdRequestContext requestContext, - List> responses, - ResponseAccumulator mode, - EarlybirdFeatureSchemaMerger featureSchemaMerger, - int numPartitions) { - super(requestContext, responses, mode); - this.featureSchemaMerger = Preconditions.checkNotNull(featureSchemaMerger); - this.numPartitions = numPartitions; - } - - @Override - protected double getDefaultSuccessResponseThreshold() { - return SUCCESSFUL_RESPONSE_THRESHOLD; - } - - @Override - protected SearchTimerStats getMergedResponseTimer() { - return TIMER; - } - - @Override - protected EarlybirdResponse internalMerge(EarlybirdResponse mergedResponse) { - final ThriftSearchQuery searchQuery = requestContext.getRequest().getSearchQuery(); - long maxId = findMaxFullySearchedStatusID(); - long minId = findMinFullySearchedStatusID(); - - Preconditions.checkNotNull(searchQuery); - Preconditions.checkState(searchQuery.isSetRankingMode()); - Preconditions.checkState(searchQuery.getRankingMode() == ThriftSearchRankingMode.RELEVANCE); - - // First get the results in score order (the default comparator for this merge collector). - RelevanceMergeCollector collector = new RelevanceMergeCollector(responses.size()); - int totalResultSize = addResponsesToCollector(collector); - ThriftSearchResults searchResults = collector.getAllSearchResults(); - - TrimStats trimStats = trimResults(searchResults); - featureSchemaMerger.collectAndSetFeatureSchemaInResponse( - searchResults, - requestContext, - "merger_relevance_tier", - accumulatedResponses.getSuccessResponses()); - - mergedResponse.setSearchResults(searchResults); - - searchResults = mergedResponse.getSearchResults(); - searchResults - .setHitCounts(aggregateHitCountMap()) - .setLanguageHistogram(aggregateLanguageHistograms()); - - if (!accumulatedResponses.getMaxIds().isEmpty()) { - searchResults.setMaxSearchedStatusID(maxId); - } - - if (!accumulatedResponses.getMinIds().isEmpty()) { - searchResults.setMinSearchedStatusID(minId); - } - - LOG.debug("Hits: {} Removed duplicates: {}", totalResultSize, trimStats.getRemovedDupsCount()); - LOG.debug("Hash Partition'ed Earlybird call completed successfully: {}", mergedResponse); - - publishNumResultsFromPartitionStatistics(mergedResponse); - - return mergedResponse; - } - - /** - * If any of the partitions has an early termination, the tier merge must also early terminate. - * - * If a partition early terminated (we haven't fully searched that partition), and we instead - * moved onto the next tier, there will be a gap of unsearched results. - * - * If our early termination condition was only if we had enough results, we could get bad quality - * results by only looking at 20 hits when asking for 20 results. - */ - @Override - public boolean shouldEarlyTerminateTierMerge(int totalResultsFromSuccessfulShards, - boolean foundEarlyTermination) { - - // Don't use computeNumResultsToKeep because if returnAllResults is true, it will be - // Integer.MAX_VALUE and we will always log a stat that we didn't get enough results - int resultsRequested; - EarlybirdRequest request = requestContext.getRequest(); - if (request.isSetNumResultsToReturnAtRoot()) { - resultsRequested = request.getNumResultsToReturnAtRoot(); - } else { - resultsRequested = request.getSearchQuery().getCollectorParams().getNumResultsToReturn(); - } - if (foundEarlyTermination && totalResultsFromSuccessfulShards < resultsRequested) { - RELVEANCE_TIER_MERGE_EARLY_TERMINATED_WITH_NOT_ENOUGH_RESULTS.increment(); - } - - return foundEarlyTermination; - } - - /** - * Merge language histograms from all queries. - * - * @return Merge per-language count map. - */ - private Map aggregateLanguageHistograms() { - Map totalLangCounts = new TreeMap<>( - ResultsUtil.aggregateCountMap( - accumulatedResponses.getSuccessResponses(), LANG_MAP_GETTER)); - if (totalLangCounts.size() > 0) { - if (responseMessageBuilder.isDebugMode()) { - responseMessageBuilder.append("Language Distrbution:\n"); - int count = 0; - for (Map.Entry entry : totalLangCounts.entrySet()) { - responseMessageBuilder.append( - String.format(" %10s:%6d", entry.getKey(), entry.getValue())); - if (++count % 5 == 0) { - responseMessageBuilder.append("\n"); - } - } - responseMessageBuilder.append("\n"); - } - } - return totalLangCounts; - } - - /** - * Find the min status id that has been searched. Since no results are trimmed for Relevance mode, - * it should be the smallest among the min IDs. - */ - private long findMinFullySearchedStatusID() { - // The min ID should be the smallest among the min IDs - return accumulatedResponses.getMinIds().isEmpty() ? 0 - : Collections.min(accumulatedResponses.getMinIds()); - } - - /** - * Find the max status id that has been searched. Since no results are trimmed for Relevance mode, - * it should be the largest among the max IDs. - */ - private long findMaxFullySearchedStatusID() { - // The max ID should be the largest among the max IDs - return accumulatedResponses.getMaxIds().isEmpty() ? 0 - : Collections.max(accumulatedResponses.getMaxIds()); - } - - /** - * Return all the searchResults except duplicates. - * - * @param searchResults ThriftSearchResults that hold the to be trimmed List - * @return TrimStats containing statistics about how many results being removed - */ - private TrimStats trimResults(ThriftSearchResults searchResults) { - if (!searchResults.isSetResults() || searchResults.getResultsSize() == 0) { - // no results, no trimming needed - return TrimStats.EMPTY_STATS; - } - - if (requestContext.getRequest().getSearchQuery().isSetSearchStatusIds()) { - // Not a normal search, no trimming needed - return TrimStats.EMPTY_STATS; - } - - TrimStats trimStats = new TrimStats(); - trimExactDups(searchResults, trimStats); - - truncateResults(searchResults, trimStats); - - return trimStats; - } - - private void publishNumResultsFromPartitionStatistics(EarlybirdResponse mergedResponse) { - - // Keep track of all of the results that were kept after merging - Set mergedResults = - EarlybirdResponseUtil.getResults(mergedResponse).getResults() - .stream() - .map(result -> result.getId()) - .collect(Collectors.toSet()); - - // For each successful response (pre merge), count how many of its results were kept post merge. - // Increment the appropriate stat. - for (EarlybirdResponse response : accumulatedResponses.getSuccessResponses()) { - if (!response.isSetEarlybirdServerStats()) { - continue; - } - int numResultsKept = 0; - for (ThriftSearchResult result - : EarlybirdResponseUtil.getResults(response).getResults()) { - if (mergedResults.contains(result.getId())) { - ++numResultsKept; - } - } - - // We only update partition stats when the partition ID looks sane. - String tierName = response.getEarlybirdServerStats().getTierName(); - int partition = response.getEarlybirdServerStats().getPartition(); - if (partition >= 0 && partition < numPartitions) { - SearchCounter.export(String.format(PARTITION_NUM_RESULTS_COUNTER_NAME_FORMAT, - tierName, - partition)) - .add(numResultsKept); - } else { - SearchCounter.export(String.format(PARTITION_NUM_RESULTS_COUNTER_SKIP_STATS, - tierName, - partition)).increment(); - } - } - } -} diff --git a/src/java/com/twitter/search/earlybird_root/mergers/ResponseAccumulator.java b/src/java/com/twitter/search/earlybird_root/mergers/ResponseAccumulator.java deleted file mode 100644 index ad0daa5f3..000000000 --- a/src/java/com/twitter/search/earlybird_root/mergers/ResponseAccumulator.java +++ /dev/null @@ -1,356 +0,0 @@ -package com.twitter.search.earlybird_root.mergers; - -import java.util.ArrayList; -import java.util.EnumMap; -import java.util.List; -import java.util.Map; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.collect.Maps; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.util.earlybird.ResponseMergerUtils; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; - -/** - * Accumulates EarlybirdResponse's and determines when to early terminate. - */ -public abstract class ResponseAccumulator { - - @VisibleForTesting - static class MinMaxSearchedIdStats { - /** How many results did we actually check */ - private final SearchCounter checkedMaxMinSearchedStatusId; - private final SearchCounter unsetMaxSearchedStatusId; - private final SearchCounter unsetMinSearchedStatusId; - private final SearchCounter unsetMaxAndMinSearchedStatusId; - private final SearchCounter sameMinMaxSearchedIdWithoutResults; - private final SearchCounter sameMinMaxSearchedIdWithOneResult; - private final SearchCounter sameMinMaxSearchedIdWithResults; - private final SearchCounter flippedMinMaxSearchedId; - - MinMaxSearchedIdStats(EarlybirdRequestType requestType) { - String statPrefix = "merge_helper_" + requestType.getNormalizedName(); - - checkedMaxMinSearchedStatusId = SearchCounter.export(statPrefix - + "_max_min_searched_id_checks"); - unsetMaxSearchedStatusId = SearchCounter.export(statPrefix - + "_unset_max_searched_status_id"); - unsetMinSearchedStatusId = SearchCounter.export(statPrefix - + "_unset_min_searched_status_id"); - unsetMaxAndMinSearchedStatusId = SearchCounter.export(statPrefix - + "_unset_max_and_min_searched_status_id"); - sameMinMaxSearchedIdWithoutResults = SearchCounter.export(statPrefix - + "_same_min_max_searched_id_without_results"); - sameMinMaxSearchedIdWithOneResult = SearchCounter.export(statPrefix - + "_same_min_max_searched_id_with_one_results"); - sameMinMaxSearchedIdWithResults = SearchCounter.export(statPrefix - + "_same_min_max_searched_id_with_results"); - flippedMinMaxSearchedId = SearchCounter.export(statPrefix - + "_flipped_min_max_searched_id"); - } - - @VisibleForTesting - SearchCounter getCheckedMaxMinSearchedStatusId() { - return checkedMaxMinSearchedStatusId; - } - - @VisibleForTesting - SearchCounter getFlippedMinMaxSearchedId() { - return flippedMinMaxSearchedId; - } - - @VisibleForTesting - SearchCounter getUnsetMaxSearchedStatusId() { - return unsetMaxSearchedStatusId; - } - - @VisibleForTesting - SearchCounter getUnsetMinSearchedStatusId() { - return unsetMinSearchedStatusId; - } - - @VisibleForTesting - SearchCounter getUnsetMaxAndMinSearchedStatusId() { - return unsetMaxAndMinSearchedStatusId; - } - - @VisibleForTesting - SearchCounter getSameMinMaxSearchedIdWithoutResults() { - return sameMinMaxSearchedIdWithoutResults; - } - - @VisibleForTesting - SearchCounter getSameMinMaxSearchedIdWithOneResult() { - return sameMinMaxSearchedIdWithOneResult; - } - - @VisibleForTesting - SearchCounter getSameMinMaxSearchedIdWithResults() { - return sameMinMaxSearchedIdWithResults; - } - } - - @VisibleForTesting - static final Map MIN_MAX_SEARCHED_ID_STATS_MAP; - static { - EnumMap statsMap - = Maps.newEnumMap(EarlybirdRequestType.class); - for (EarlybirdRequestType earlybirdRequestType : EarlybirdRequestType.values()) { - statsMap.put(earlybirdRequestType, new MinMaxSearchedIdStats(earlybirdRequestType)); - } - - MIN_MAX_SEARCHED_ID_STATS_MAP = Maps.immutableEnumMap(statsMap); - } - - // Merge has encountered at least one early terminated response. - private boolean foundEarlyTermination = false; - // Empty but successful response counter (E.g. when a tier or partition is skipped) - private int successfulEmptyResponseCount = 0; - // The list of the successful responses from all earlybird futures. This does not include empty - // responses resulted from null requests. - private final List successResponses = new ArrayList<>(); - // The list of the error responses from all earlybird futures. - private final List errorResponses = new ArrayList<>(); - // the list of max statusIds seen in each earlybird. - private final List maxIds = new ArrayList<>(); - // the list of min statusIds seen in each earlybird. - private final List minIds = new ArrayList<>(); - - private int numResponses = 0; - - private int numResultsAccumulated = 0; - private int numSearchedSegments = 0; - - /** - * Returns a string that can be used for logging to identify a single response out of all the - * responses that are being merged. - * - * @param responseIndex the index of a response's partition or tier, depending on the type of - * responses being accumulated. - * @param numTotalResponses the total number of partitions or tiers that are being merged. - */ - public abstract String getNameForLogging(int responseIndex, int numTotalResponses); - - /** - * Returns a string that is used to export per-EarlybirdResponseCode stats for partitions and tiers. - * - * @param responseIndex the index of of a response's partition or tier. - * @param numTotalResponses the total number of partitions or tiers that are being merged. - * @return a string that is used to export per-EarlybirdResponseCode stats for partitions and tiers. - */ - public abstract String getNameForEarlybirdResponseCodeStats( - int responseIndex, int numTotalResponses); - - abstract boolean shouldEarlyTerminateMerge(EarlyTerminateTierMergePredicate merger); - - /** - * Add a EarlybirdResponse - */ - public void addResponse(EarlybirdResponseDebugMessageBuilder responseMessageBuilder, - EarlybirdRequest request, - EarlybirdResponse response) { - numResponses++; - numSearchedSegments += response.getNumSearchedSegments(); - - if (isSkippedResponse(response)) { - // This is an empty response, no processing is required, just need to update statistics. - successfulEmptyResponseCount++; - handleSkippedResponse(response.getResponseCode()); - } else if (isErrorResponse(response)) { - errorResponses.add(response); - handleErrorResponse(response); - } else { - handleSuccessfulResponse(responseMessageBuilder, request, response); - } - } - - private boolean isErrorResponse(EarlybirdResponse response) { - return !response.isSetResponseCode() - || response.getResponseCode() != EarlybirdResponseCode.SUCCESS; - } - - private boolean isSkippedResponse(EarlybirdResponse response) { - return response.isSetResponseCode() - && (response.getResponseCode() == EarlybirdResponseCode.PARTITION_SKIPPED - || response.getResponseCode() == EarlybirdResponseCode.TIER_SKIPPED); - } - - /** - * Record a response corresponding to a skipped partition or skipped tier. - */ - protected abstract void handleSkippedResponse(EarlybirdResponseCode responseCode); - - /** - * Handle an error response - */ - protected abstract void handleErrorResponse(EarlybirdResponse response); - - /** - * Subclasses can override this to perform more successful response handling. - */ - protected void extraSuccessfulResponseHandler(EarlybirdResponse response) { } - - /** - * Whether the helper is for merging results from partitions within a single tier. - */ - protected final boolean isMergingPartitionsWithinATier() { - return !isMergingAcrossTiers(); - } - - /** - * Whether the helper is for merging results across different tiers. - */ - protected abstract boolean isMergingAcrossTiers(); - - - /** - * Record a successful response. - */ - public final void handleSuccessfulResponse( - EarlybirdResponseDebugMessageBuilder responseMessageBuilder, - EarlybirdRequest request, - EarlybirdResponse response) { - successResponses.add(response); - if (response.isSetSearchResults()) { - ThriftSearchResults searchResults = response.getSearchResults(); - numResultsAccumulated += searchResults.getResultsSize(); - - recordMinMaxSearchedIdsAndUpdateStats(responseMessageBuilder, request, response, - searchResults); - } - if (response.isSetEarlyTerminationInfo() - && response.getEarlyTerminationInfo().isEarlyTerminated()) { - foundEarlyTermination = true; - } - extraSuccessfulResponseHandler(response); - } - - private void recordMinMaxSearchedIdsAndUpdateStats( - EarlybirdResponseDebugMessageBuilder responseMessageBuidler, - EarlybirdRequest request, - EarlybirdResponse response, - ThriftSearchResults searchResults) { - - boolean isMaxIdSet = searchResults.isSetMaxSearchedStatusID(); - boolean isMinIdSet = searchResults.isSetMinSearchedStatusID(); - - if (isMaxIdSet) { - maxIds.add(searchResults.getMaxSearchedStatusID()); - } - if (isMinIdSet) { - minIds.add(searchResults.getMinSearchedStatusID()); - } - - updateMinMaxIdStats(responseMessageBuidler, request, response, searchResults, isMaxIdSet, - isMinIdSet); - } - - private void updateMinMaxIdStats( - EarlybirdResponseDebugMessageBuilder responseMessageBuilder, - EarlybirdRequest request, - EarlybirdResponse response, - ThriftSearchResults searchResults, - boolean isMaxIdSet, - boolean isMinIdSet) { - // Now just track the stats. - EarlybirdRequestType requestType = EarlybirdRequestType.of(request); - MinMaxSearchedIdStats minMaxSearchedIdStats = MIN_MAX_SEARCHED_ID_STATS_MAP.get(requestType); - - minMaxSearchedIdStats.checkedMaxMinSearchedStatusId.increment(); - if (isMaxIdSet && isMinIdSet) { - if (searchResults.getMinSearchedStatusID() > searchResults.getMaxSearchedStatusID()) { - // We do not expect this case to happen in production. - minMaxSearchedIdStats.flippedMinMaxSearchedId.increment(); - } else if (searchResults.getResultsSize() == 0 - && searchResults.getMaxSearchedStatusID() == searchResults.getMinSearchedStatusID()) { - minMaxSearchedIdStats.sameMinMaxSearchedIdWithoutResults.increment(); - responseMessageBuilder.debugVerbose( - "Got no results, and same min/max searched ids. Request: %s, Response: %s", - request, response); - } else if (searchResults.getResultsSize() == 1 - && searchResults.getMaxSearchedStatusID() == searchResults.getMinSearchedStatusID()) { - minMaxSearchedIdStats.sameMinMaxSearchedIdWithOneResult.increment(); - responseMessageBuilder.debugVerbose( - "Got one results, and same min/max searched ids. Request: %s, Response: %s", - request, response); - } else if (searchResults.getMaxSearchedStatusID() - == searchResults.getMinSearchedStatusID()) { - minMaxSearchedIdStats.sameMinMaxSearchedIdWithResults.increment(); - responseMessageBuilder.debugVerbose( - "Got multiple results, and same min/max searched ids. Request: %s, Response: %s", - request, response); - } - } else if (!isMaxIdSet && isMinIdSet) { - // We do not expect this case to happen in production. - minMaxSearchedIdStats.unsetMaxSearchedStatusId.increment(); - responseMessageBuilder.debugVerbose( - "Got unset maxSearchedStatusID. Request: %s, Response: %s", request, response); - } else if (isMaxIdSet && !isMinIdSet) { - // We do not expect this case to happen in production. - minMaxSearchedIdStats.unsetMinSearchedStatusId.increment(); - responseMessageBuilder.debugVerbose( - "Got unset minSearchedStatusID. Request: %s, Response: %s", request, response); - } else { - Preconditions.checkState(!isMaxIdSet && !isMinIdSet); - minMaxSearchedIdStats.unsetMaxAndMinSearchedStatusId.increment(); - responseMessageBuilder.debugVerbose( - "Got unset maxSearchedStatusID and minSearchedStatusID. Request: %s, Response: %s", - request, response); - } - } - - - /** - * Return partition counts with number of partitions, number of successful responses, and list of - * responses per tier. - */ - public abstract AccumulatedResponses.PartitionCounts getPartitionCounts(); - - public final AccumulatedResponses getAccumulatedResults() { - return new AccumulatedResponses(successResponses, - errorResponses, - maxIds, - minIds, - ResponseMergerUtils.mergeEarlyTerminationInfo(successResponses), - isMergingAcrossTiers(), - getPartitionCounts(), - getNumSearchedSegments()); - } - - // Getters are only intended to be used by subclasses. Other users should get data from - // AccumulatedResponses - - int getNumResponses() { - return numResponses; - } - - int getNumSearchedSegments() { - return numSearchedSegments; - } - - List getSuccessResponses() { - return successResponses; - } - - int getNumResultsAccumulated() { - return numResultsAccumulated; - } - - int getSuccessfulEmptyResponseCount() { - return successfulEmptyResponseCount; - } - - boolean foundError() { - return !errorResponses.isEmpty(); - } - - boolean foundEarlyTermination() { - return foundEarlyTermination; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/mergers/StrictRecencyResponseMerger.java b/src/java/com/twitter/search/earlybird_root/mergers/StrictRecencyResponseMerger.java deleted file mode 100644 index 4ea72717e..000000000 --- a/src/java/com/twitter/search/earlybird_root/mergers/StrictRecencyResponseMerger.java +++ /dev/null @@ -1,297 +0,0 @@ -package com.twitter.search.earlybird_root.mergers; - -import java.util.Collections; -import java.util.List; -import java.util.concurrent.TimeUnit; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird_root.common.EarlybirdFeatureSchemaMerger; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.util.Future; - -/** - * A RecencyResponseMerger that prioritizes not losing results during pagination. - * As of now, this merger is used by Gnip to make sure that scrolling returns all results. - * - * The logic used for merging partitions is a bit tricky, because on one hand, we want to make sure - * that we do miss results on the next pagination request; on the other hand, we want to return as - * many results as we can, and we want to set the minSearchedStatusID of the merged response as low - * as we can, in order to minimize the number of pagination requests. - * - * The merging logic is: - * - * Realtime cluster: - * 1. merge results from all partitions - * 2. if at least one partition response is early-terminated, set earlyTerminated = true - * on the merged response - * 3. set trimmingMinId = max(minSearchedStatusIDs of all partition responses) - * 4. trim all results to trimmingMinId - * 5. set minSearchedStatusID on the merged response to trimmingMinId - * 6. if we have more than numRequested results: - * - keep only the newest numRequested results - * - set minSearchedStatusID of the merged response to the lowest tweet ID in the response - * 7. if at least one partition response is not early-terminated, set - * tierBottomId = max(minSearchedStatusIDs of all non-early-terminated responses) - * (otherwise, set tierBottomId to some undefined value: -1, Long.MAX_VALUE, etc.) - * 8. if minSearchedStatusID of the merged response is the same as tierBottomId, - * clear the early-termination flag on the merged response - * - * The logic in steps 7 and 8 can be a little tricky to understand. They basically say: when we've - * exhausted the "least deep" partition in the realtime cluster, it's time to move to the full - * archive cluster (if we keep going past the "least deep" partition, we might miss results). - * - * Full archive cluster: - * 1. merge results from all partitions - * 2. if at least one partition response is early-terminated, set earlyTerminated = true - * on the merged response - * 3. set trimmingMinId to: - * - max(minSearchedStatusIDs of early-terminated responses), if at least one partition response - * is early-terminated - * - min(minSearchedStatusIDs of all responses), if all partition responses are not - * early-terminated - * 4. trim all results to trimmingMinId - * 5. set minSearchedStatusID of the merged response to trimmingMinId - * 6. if we have more than numRequested results: - * - keep only the newest numRequested results - * - set minSearchedStatusID of the merged response to the lowest tweet ID in the response - * - * The logic in step 3 can be a little tricky to understand. On one hand, if we always set - * trimmingMinId to the highest minSearchedStatusID, then some tweets at the very bottom of some - * partitions will never be returned. Consider the case: - * - * partition 1 has tweets 10, 8, 6 - * partition 2 has tweets 9, 7, 5 - * - * In this case, we would always trim all results to minId = 6, and tweet 5 would never be returned. - * - * On the other hand, if we always set trimmingMinId to the lowest minSearchedStatusID, then we - * might miss tweets from partitions that early-terminated. Consider the case: - * - * partition 1 has tweets 10, 5, 3, 1 that match our query - * partition 2 has tweets 9, 8, 7, 6, 2 that match our query - * - * If we ask for 3 results, than partition 1 will return tweets 10, 5, 3, and partition 2 will - * return tweets 9, 8, 7. If we set trimmingMinId = min(minSearchedStatusIDs), then the next - * pagination request will have [max_id = 2], and we will miss tweet 6. - * - * So the intuition here is that if we have an early-terminated response, we cannot set - * trimmingMinId to something lower than the minSearchedStatusID returned by that partition - * (otherwise we might miss results from that partition). However, if we've exhausted all - * partitions, then it's OK to not trim any result, because tiers do not intersect, so we will not - * miss any result from the next tier once we get there. - */ -public class StrictRecencyResponseMerger extends RecencyResponseMerger { - private static final SearchTimerStats STRICT_RECENCY_TIMER_AVG = - SearchTimerStats.export("merge_recency_strict", TimeUnit.NANOSECONDS, false, true); - - @VisibleForTesting - static final EarlyTerminationTrimmingStats PARTITION_MERGING_EARLY_TERMINATION_TRIMMING_STATS = - new EarlyTerminationTrimmingStats("strict_recency_partition_merging"); - - @VisibleForTesting - static final EarlyTerminationTrimmingStats TIER_MERGING_EARLY_TERMINATION_TRIMMING_STATS = - new EarlyTerminationTrimmingStats("strict_recency_tier_merging"); - - private final EarlybirdCluster cluster; - - public StrictRecencyResponseMerger(EarlybirdRequestContext requestContext, - List> responses, - ResponseAccumulator mode, - EarlybirdFeatureSchemaMerger featureSchemaMerger, - EarlybirdCluster cluster) { - super(requestContext, responses, mode, featureSchemaMerger); - this.cluster = cluster; - } - - @Override - protected SearchTimerStats getMergedResponseTimer() { - return STRICT_RECENCY_TIMER_AVG; - } - - /** - * Unlike {@link com.twitter.search.earlybird_root.mergers.RecencyResponseMerger}, this method - * takes a much simpler approach by just taking the max of the maxSearchedStatusIds. - * - * Also, when no maxSearchedStatusId is available at all, Long.MIN_VALUE is used instead of - * Long.MAX_VALUE. This ensures that we don't return any result in these cases. - */ - @Override - protected long findMaxFullySearchedStatusID() { - return accumulatedResponses.getMaxIds().isEmpty() - ? Long.MIN_VALUE : Collections.max(accumulatedResponses.getMaxIds()); - } - - /** - * This method is subtly different from the base class version: when no minSearchedStatusId is - * available at all, Long.MAX_VALUE is used instead of Long.MIN_VALUE. This ensures that we - * don't return any result in these cases. - */ - @Override - protected long findMinFullySearchedStatusID() { - List minIds = accumulatedResponses.getMinIds(); - if (minIds.isEmpty()) { - return Long.MAX_VALUE; - } - - if (accumulatedResponses.isMergingPartitionsWithinATier()) { - return getTrimmingMinId(); - } - - // When merging tiers, the min ID should be the smallest among the min IDs. - return Collections.min(minIds); - } - - @Override - protected TrimStats trimResults( - ThriftSearchResults searchResults, long mergedMin, long mergedMax) { - if (!searchResults.isSetResults() || searchResults.getResultsSize() == 0) { - // no results, no trimming needed - return TrimStats.EMPTY_STATS; - } - - TrimStats trimStats = new TrimStats(); - trimExactDups(searchResults, trimStats); - filterResultsByMergedMinMaxIds(searchResults, mergedMax, mergedMin, trimStats); - int numResults = computeNumResultsToKeep(); - if (searchResults.getResultsSize() > numResults) { - trimStats.setResultsTruncatedFromTailCount(searchResults.getResultsSize() - numResults); - searchResults.setResults(searchResults.getResults().subList(0, numResults)); - } - - return trimStats; - } - - /** - * This method is different from the base class version because when minResultId is bigger - * than currentMergedMin, we always take minResultId. - * If we don't do this, we would lose results. - * - * Illustration with an example. Assuming we are outside of the lag threshold. - * Num results requested: 3 - * Response 1: min: 100 max: 900 results: 400, 500, 600 - * Response 2: min: 300 max: 700 results: 350, 450, 550 - * - * Merged results: 600, 550, 500 - * Merged max: 900 - * Merged min: we could take 300 (minId), or take 500 (minResultId). - * - * If we take minId, and use 300 as the pagination cursor, we'd lose results - * 350 and 450 when we paginate. So we have to take minResultId here. - */ - @Override - protected void setMergedMinSearchedStatusId( - ThriftSearchResults searchResults, - long currentMergedMin, - boolean resultsWereTrimmed) { - if (accumulatedResponses.getMinIds().isEmpty()) { - return; - } - - long minId = currentMergedMin; - if (resultsWereTrimmed - && (searchResults != null) - && searchResults.isSetResults() - && (searchResults.getResultsSize() > 0)) { - List results = searchResults.getResults(); - minId = results.get(results.size() - 1).getId(); - } - - searchResults.setMinSearchedStatusID(minId); - } - - @Override - protected boolean clearEarlyTerminationIfReachingTierBottom(EarlybirdResponse mergedResponse) { - if (EarlybirdCluster.isArchive(cluster)) { - // We don't need to worry about the tier bottom when merging partition responses in the full - // archive cluster: if all partitions were exhausted and we didn't trim the results, then - // the early-terminated flag on the merged response will be false. If at least one partition - // is early-terminated, or we trimmed some results, then the ealry-terminated flag on the - // merged response will be true, and we should continue getting results from this tier before - // we move to the next one. - return false; - } - - ThriftSearchResults searchResults = mergedResponse.getSearchResults(); - if (searchResults.getMinSearchedStatusID() == getTierBottomId()) { - mergedResponse.getEarlyTerminationInfo().setEarlyTerminated(false); - mergedResponse.getEarlyTerminationInfo().unsetMergedEarlyTerminationReasons(); - responseMessageBuilder.debugVerbose( - "Set earlytermination to false because minSearchedStatusId is tier bottom"); - return true; - } - return false; - } - - @Override - protected boolean shouldEarlyTerminateWhenEnoughTrimmedResults() { - return false; - } - - @Override - protected final EarlyTerminationTrimmingStats getEarlyTerminationTrimmingStatsForPartitions() { - return PARTITION_MERGING_EARLY_TERMINATION_TRIMMING_STATS; - } - - @Override - protected final EarlyTerminationTrimmingStats getEarlyTerminationTrimmingStatsForTiers() { - return TIER_MERGING_EARLY_TERMINATION_TRIMMING_STATS; - } - - /** Determines the bottom of the realtime cluster, based on the partition responses. */ - private long getTierBottomId() { - Preconditions.checkState(!EarlybirdCluster.isArchive(cluster)); - - long tierBottomId = -1; - for (EarlybirdResponse response : accumulatedResponses.getSuccessResponses()) { - if (!isEarlyTerminated(response) - && response.isSetSearchResults() - && response.getSearchResults().isSetMinSearchedStatusID() - && (response.getSearchResults().getMinSearchedStatusID() > tierBottomId)) { - tierBottomId = response.getSearchResults().getMinSearchedStatusID(); - } - } - - return tierBottomId; - } - - /** Determines the minId to which all results should be trimmed. */ - private long getTrimmingMinId() { - List minIds = accumulatedResponses.getMinIds(); - Preconditions.checkArgument(!minIds.isEmpty()); - - if (!EarlybirdCluster.isArchive(cluster)) { - return Collections.max(minIds); - } - - long maxOfEarlyTerminatedMins = -1; - long minOfAllMins = Long.MAX_VALUE; - for (EarlybirdResponse response : accumulatedResponses.getSuccessResponses()) { - if (response.isSetSearchResults() - && response.getSearchResults().isSetMinSearchedStatusID()) { - long minId = response.getSearchResults().getMinSearchedStatusID(); - minOfAllMins = Math.min(minOfAllMins, minId); - if (isEarlyTerminated(response)) { - maxOfEarlyTerminatedMins = Math.max(maxOfEarlyTerminatedMins, minId); - } - } - } - if (maxOfEarlyTerminatedMins >= 0) { - return maxOfEarlyTerminatedMins; - } else { - return minOfAllMins; - } - } - - /** Determines if the given earlybird response is early terminated. */ - private boolean isEarlyTerminated(EarlybirdResponse response) { - return response.isSetEarlyTerminationInfo() - && response.getEarlyTerminationInfo().isEarlyTerminated(); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/mergers/SuperRootResponseMerger.java b/src/java/com/twitter/search/earlybird_root/mergers/SuperRootResponseMerger.java deleted file mode 100644 index 5f1c7aa87..000000000 --- a/src/java/com/twitter/search/earlybird_root/mergers/SuperRootResponseMerger.java +++ /dev/null @@ -1,688 +0,0 @@ -package com.twitter.search.earlybird_root.mergers; - -import java.util.Collections; -import java.util.List; -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.cache.CacheBuilder; -import com.google.common.cache.CacheLoader; -import com.google.common.cache.LoadingCache; -import com.google.common.collect.Lists; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.collections.Pair; -import com.twitter.common.quantity.Amount; -import com.twitter.common.quantity.Time; -import com.twitter.common.util.Clock; -import com.twitter.search.common.futures.Futures; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; -import com.twitter.search.common.query.thriftjava.EarlyTerminationInfo; -import com.twitter.search.common.relevance.utils.ResultComparators; -import com.twitter.search.common.search.EarlyTerminationState; -import com.twitter.search.common.util.FinagleUtil; -import com.twitter.search.common.util.earlybird.EarlybirdResponseMergeUtil; -import com.twitter.search.common.util.earlybird.EarlybirdResponseUtil; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.earlybird.thrift.ThriftSearchRankingMode; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird.thrift.ThriftTweetSource; -import com.twitter.search.earlybird_root.common.EarlybirdFeatureSchemaMerger; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdServiceResponse; -import com.twitter.util.Function; -import com.twitter.util.Function0; -import com.twitter.util.Future; - -/** Utility functions for merging recency and relevance results. */ -public class SuperRootResponseMerger { - private static final Logger LOG = LoggerFactory.getLogger(SuperRootResponseMerger.class); - private static final String ALL_STATS_PREFIX = "superroot_response_merger_"; - - private static final SearchCounter FULL_ARCHIVE_MIN_ID_GREATER_THAN_REALTIME_MIN_ID = - SearchCounter.export("full_archive_min_id_greater_than_realtime_min_id"); - - private static final String ERROR_FORMAT = "%s%s_errors_from_cluster_%s_%s"; - - private final ThriftSearchRankingMode rankingMode; - private final EarlybirdFeatureSchemaMerger featureSchemaMerger; - private final String featureStatPrefix; - private final Clock clock; - private final String rankingModeStatPrefix; - - private final SearchCounter mergedResponseSearchResultsNotSet; - private final SearchCounter invalidMinStatusId; - private final SearchCounter invalidMaxStatusId; - private final SearchCounter noMinIds; - private final SearchCounter noMaxIds; - private final SearchCounter mergedResponses; - private final SearchCounter mergedResponsesWithExactDups; - private final LoadingCache, SearchCounter> dupsStats; - - private static final EarlybirdResponse EMPTY_RESPONSE = - new EarlybirdResponse(EarlybirdResponseCode.SUCCESS, 0) - .setSearchResults(new ThriftSearchResults() - .setResults(Lists.newArrayList())); - - /** - * Creates a new SuperRootResponseMerger instance. - * @param rankingMode The ranking mode to use when merging results. - * @param featureSchemaMerger The merger that can merge feature schema from different tiers. - * @param clock The clock that will be used to merge results. - */ - public SuperRootResponseMerger(ThriftSearchRankingMode rankingMode, - EarlybirdFeatureSchemaMerger featureSchemaMerger, - Clock clock) { - this.rankingModeStatPrefix = rankingMode.name().toLowerCase(); - - this.rankingMode = rankingMode; - this.featureSchemaMerger = featureSchemaMerger; - this.clock = clock; - this.featureStatPrefix = "superroot_" + rankingMode.name().toLowerCase(); - - mergedResponseSearchResultsNotSet = SearchCounter.export( - ALL_STATS_PREFIX + rankingModeStatPrefix + "_merged_response_search_results_not_set"); - invalidMinStatusId = - SearchCounter.export(ALL_STATS_PREFIX + rankingModeStatPrefix + "_invalid_min_status_id"); - invalidMaxStatusId = - SearchCounter.export(ALL_STATS_PREFIX + rankingModeStatPrefix + "_invalid_max_status_id"); - noMinIds = SearchCounter.export(ALL_STATS_PREFIX + rankingModeStatPrefix + "_no_min_ids"); - noMaxIds = SearchCounter.export(ALL_STATS_PREFIX + rankingModeStatPrefix + "_no_max_ids"); - mergedResponses = SearchCounter.export(ALL_STATS_PREFIX + rankingModeStatPrefix - + "_merged_responses"); - mergedResponsesWithExactDups = - SearchCounter.export(ALL_STATS_PREFIX + rankingModeStatPrefix - + "_merged_responses_with_exact_dups"); - dupsStats = CacheBuilder.newBuilder() - .build(new CacheLoader, SearchCounter>() { - @Override - public SearchCounter load(Pair key) { - return SearchCounter.export( - ALL_STATS_PREFIX + rankingModeStatPrefix + "_merged_responses_with_exact_dups_" - + key.getFirst().name() + "_" + key.getSecond().name()); - } - }); - } - - private void incrErrorCount(String cluster, @Nullable EarlybirdResponse response) { - String cause; - if (response != null) { - cause = response.getResponseCode().name().toLowerCase(); - } else { - cause = "null_response"; - } - String statName = String.format( - ERROR_FORMAT, ALL_STATS_PREFIX, rankingModeStatPrefix, cluster, cause - ); - - SearchCounter.export(statName).increment(); - } - - /** - * Merges the given response futures. - * - * @param earlybirdRequestContext The earlybird request. - * @param realtimeResponseFuture The response from the realtime cluster. - * @param protectedResponseFuture The response from the protected cluster. - * @param fullArchiveResponseFuture The response from the full archive cluster. - * @return A future with the merged results. - */ - public Future mergeResponseFutures( - final EarlybirdRequestContext earlybirdRequestContext, - final Future realtimeResponseFuture, - final Future protectedResponseFuture, - final Future fullArchiveResponseFuture) { - Future mergedResponseFuture = Futures.map( - realtimeResponseFuture, protectedResponseFuture, fullArchiveResponseFuture, - new Function0() { - @Override - public EarlybirdResponse apply() { - // If the realtime response is not valid, return an error response. - // Also, the realtime service should always be called. - EarlybirdServiceResponse realtimeResponse = Futures.get(realtimeResponseFuture); - - if (realtimeResponse.getServiceState().serviceWasRequested() - && (!realtimeResponse.getServiceState().serviceWasCalled() - || !EarlybirdResponseMergeUtil.isValidResponse( - realtimeResponse.getResponse()))) { - - incrErrorCount("realtime", realtimeResponse.getResponse()); - return EarlybirdResponseMergeUtil.transformInvalidResponse( - realtimeResponse.getResponse(), "realtime"); - } - - // If we have a protected response and it's not valid, return an error response. - EarlybirdServiceResponse protectedResponse = Futures.get(protectedResponseFuture); - if (protectedResponse.getServiceState().serviceWasCalled()) { - if (!EarlybirdResponseMergeUtil.isValidResponse(protectedResponse.getResponse())) { - incrErrorCount("protected", protectedResponse.getResponse()); - - return EarlybirdResponseMergeUtil.transformInvalidResponse( - protectedResponse.getResponse(), "protected"); - } - } - - // If we have a full archive response, check if it's valid. - EarlybirdServiceResponse fullArchiveResponse = Futures.get(fullArchiveResponseFuture); - boolean archiveHasError = - fullArchiveResponse.getServiceState().serviceWasCalled() - && !EarlybirdResponseMergeUtil.isValidResponse(fullArchiveResponse.getResponse()); - - // Merge the responses. - EarlybirdResponse mergedResponse = mergeResponses( - earlybirdRequestContext, - realtimeResponse.getResponse(), - protectedResponse.getResponse(), - fullArchiveResponse.getResponse()); - - // If the realtime clusters didn't return any results, and the full archive cluster - // returned an error response, return an error merged response. - if (archiveHasError && !EarlybirdResponseUtil.hasResults(mergedResponse)) { - incrErrorCount("full_archive", fullArchiveResponse.getResponse()); - - return EarlybirdResponseMergeUtil.failedEarlybirdResponse( - fullArchiveResponse.getResponse().getResponseCode(), - "realtime clusters had no results and archive cluster response had error"); - } - - // Corner case: the realtime response could have exactly numRequested results, and could - // be exhausted (not early-terminated). In this case, the request should not have been - // sent to the full archive cluster. - // - If the full archive cluster is not available, or was not requested, then we don't - // need to change anything. - // - If the full archive cluster is available and was requested (but wasn't hit - // because we found enough results in the realtime cluster), then we should set the - // early-termination flag on the merged response, to indicate that we potentially - // have more results for this query in our index. - if ((fullArchiveResponse.getServiceState() - == EarlybirdServiceResponse.ServiceState.SERVICE_NOT_CALLED) - && !EarlybirdResponseUtil.isEarlyTerminated(realtimeResponse.getResponse())) { - EarlyTerminationInfo earlyTerminationInfo = new EarlyTerminationInfo(true); - earlyTerminationInfo.setEarlyTerminationReason( - EarlyTerminationState.TERMINATED_NUM_RESULTS_EXCEEDED.getTerminationReason()); - mergedResponse.setEarlyTerminationInfo(earlyTerminationInfo); - } - - // If we've exhausted all clusters, set the minSearchedStatusID to 0. - if (!EarlybirdResponseUtil.isEarlyTerminated(mergedResponse)) { - mergedResponse.getSearchResults().setMinSearchedStatusID(0); - } - - return mergedResponse; - } - }); - - // Handle all merging exceptions. - return handleResponseException(mergedResponseFuture, - "Exception thrown while merging responses."); - } - - /** - * Merge the results in the given responses. - * - * @param earlybirdRequestContext The earlybird request context. - * @param realtimeResponse The response from the realtime cluster. - * @param protectedResponse The response from the protected cluster. - * @param fullArchiveResponse The response from the full archive cluster. - * @return The merged response. - */ - private EarlybirdResponse mergeResponses( - EarlybirdRequestContext earlybirdRequestContext, - @Nullable EarlybirdResponse realtimeResponse, - @Nullable EarlybirdResponse protectedResponse, - @Nullable EarlybirdResponse fullArchiveResponse) { - - EarlybirdRequest request = earlybirdRequestContext.getRequest(); - ThriftSearchQuery searchQuery = request.getSearchQuery(); - int numResultsRequested; - - if (request.isSetNumResultsToReturnAtRoot()) { - numResultsRequested = request.getNumResultsToReturnAtRoot(); - } else { - numResultsRequested = searchQuery.getNumResults(); - } - - Preconditions.checkState(numResultsRequested > 0); - - EarlybirdResponse mergedResponse = EMPTY_RESPONSE.deepCopy(); - if ((realtimeResponse != null) - && (realtimeResponse.getResponseCode() != EarlybirdResponseCode.TIER_SKIPPED)) { - mergedResponse = realtimeResponse.deepCopy(); - } - - if (!mergedResponse.isSetSearchResults()) { - mergedResponseSearchResultsNotSet.increment(); - mergedResponse.setSearchResults( - new ThriftSearchResults(Lists.newArrayList())); - } - - // If either the realtime or the full archive response is early-terminated, we want the merged - // response to be early-terminated too. The early-termination flag from the realtime response - // carries over to the merged response, because mergedResponse is just a deep copy of the - // realtime response. So we only need to check the early-termination flag of the full archive - // response. - if ((fullArchiveResponse != null) - && EarlybirdResponseUtil.isEarlyTerminated(fullArchiveResponse)) { - mergedResponse.setEarlyTerminationInfo(fullArchiveResponse.getEarlyTerminationInfo()); - } - - // If realtime has empty results and protected has some results then we copy the early - // termination information if that is present - if (protectedResponse != null - && mergedResponse.getSearchResults().getResults().isEmpty() - && !protectedResponse.getSearchResults().getResults().isEmpty() - && EarlybirdResponseUtil.isEarlyTerminated(protectedResponse)) { - mergedResponse.setEarlyTerminationInfo(protectedResponse.getEarlyTerminationInfo()); - } - - // Merge the results. - List mergedResults = mergeResults( - numResultsRequested, realtimeResponse, protectedResponse, fullArchiveResponse); - - // Trim the merged results if necessary. - boolean resultsTrimmed = false; - if (mergedResults.size() > numResultsRequested - && !(searchQuery.isSetRelevanceOptions() - && searchQuery.getRelevanceOptions().isReturnAllResults())) { - // If we have more results than requested, trim the result list and re-adjust - // minSearchedStatusID. - mergedResults = mergedResults.subList(0, numResultsRequested); - - // Mark early termination in merged response - if (!EarlybirdResponseUtil.isEarlyTerminated(mergedResponse)) { - EarlyTerminationInfo earlyTerminationInfo = new EarlyTerminationInfo(true); - earlyTerminationInfo.setEarlyTerminationReason( - EarlyTerminationState.TERMINATED_NUM_RESULTS_EXCEEDED.getTerminationReason()); - mergedResponse.setEarlyTerminationInfo(earlyTerminationInfo); - } - - resultsTrimmed = true; - } - - mergedResponse.getSearchResults().setResults(mergedResults); - featureSchemaMerger.mergeFeatureSchemaAcrossClusters( - earlybirdRequestContext, - mergedResponse, - featureStatPrefix, - realtimeResponse, - protectedResponse, - fullArchiveResponse); - - // Set the minSearchedStatusID and maxSearchedStatusID fields on the merged response. - setMinSearchedStatusId(mergedResponse, realtimeResponse, protectedResponse, fullArchiveResponse, - resultsTrimmed); - setMaxSearchedStatusId(mergedResponse, realtimeResponse, protectedResponse, - fullArchiveResponse); - - int numRealtimeSearchedSegments = - (realtimeResponse != null && realtimeResponse.isSetNumSearchedSegments()) - ? realtimeResponse.getNumSearchedSegments() - : 0; - - int numProtectedSearchedSegments = - (protectedResponse != null && protectedResponse.isSetNumSearchedSegments()) - ? protectedResponse.getNumSearchedSegments() - : 0; - - int numArchiveSearchedSegments = - (fullArchiveResponse != null && fullArchiveResponse.isSetNumSearchedSegments()) - ? fullArchiveResponse.getNumSearchedSegments() - : 0; - - mergedResponse.setNumSearchedSegments( - numRealtimeSearchedSegments + numProtectedSearchedSegments + numArchiveSearchedSegments); - - if (earlybirdRequestContext.getRequest().getDebugMode() > 0) { - mergedResponse.setDebugString( - mergeClusterDebugStrings(realtimeResponse, protectedResponse, fullArchiveResponse)); - } - - return mergedResponse; - } - - /** - * Merges the given responses. - * - * @param numResults the number of results requested - * @param realtimeResponse the response from the realtime response - * @param protectedResponse the response from the protected response - * @param fullArchiveResponse the response from the full archive response - * @return the list of merged results - */ - private List mergeResults(int numResults, - @Nullable EarlybirdResponse realtimeResponse, - @Nullable EarlybirdResponse protectedResponse, - @Nullable EarlybirdResponse fullArchiveResponse) { - mergedResponses.increment(); - // We first merge the results from the two realtime clusters, Realtime cluster and - // Realtime Protected Tweets cluster - List mergedResults = mergePublicAndProtectedRealtimeResults( - numResults, - realtimeResponse, - protectedResponse, - fullArchiveResponse, - clock); - - EarlybirdResponseMergeUtil.addResultsToList(mergedResults, fullArchiveResponse, - ThriftTweetSource.FULL_ARCHIVE_CLUSTER); - - List distinctMergedResults = - EarlybirdResponseMergeUtil.distinctByStatusId(mergedResults, dupsStats); - if (mergedResults != distinctMergedResults) { - mergedResponsesWithExactDups.increment(); - } - - if (rankingMode == ThriftSearchRankingMode.RELEVANCE - || rankingMode == ThriftSearchRankingMode.TOPTWEETS) { - distinctMergedResults.sort(ResultComparators.SCORE_COMPARATOR); - } else { - distinctMergedResults.sort(ResultComparators.ID_COMPARATOR); - } - - return distinctMergedResults; - } - - /** - * Method for merging tweets from protected and realtime clusters - * - realtime, guaranteed newer than any archive tweets - * - protected, also realtime, but with a potentially larger window (optional) - * - archive, public, guaranteed older than any public realtime tweets (optional, used for - * id limits, *not added to results*) - * It adds the ThriftSearchResults from protected tweets to the realtimeResponse - * - * Algorithm diagram: (with newer tweets at the top) - * ------------------------------------ <--- protected maxSearchedStatusID - * |C:Newest protected realtime tweets| - * | (does not exist if realtime | - * | maxID >= protected maxID) | - * - * | ------------------------ | <--- 60 seconds ago - * |D:Newer protected realtime tweets | - * | (does not exist if realtime | - * | maxID >= 60 seconds ago) | - * ---------- | ------------------------ | <--- public realtime maxSearchedStatusID - * |A:Public| |E:Automatically valid protected | - * |realtime| |realtime tweets | - * ---------- | ------------------------ | <--- public realtime minSearchedStatusID - * | | - * ---------- | E if archive is present | <--- public archive maxSearchedStatusID - * ---------- | E if archive is present | <--- public archive maxSearchedStatusID - * |B:Public| | F is archive is not present | - * |archive | | | - * ---------- | ------------------------ | <--- public archive minSearchedStatusID - * |F:Older protected realtime tweets | - * | (does not exist if protected | - * | minID >= public minID) | - * ------------------------------------ <--- protected minSearchedStatusID - * Step 1: Select tweets from groups A, and E. If this is enough, return them - * Step 2: Select tweets from groups A, E, and F. If this is enough, return them - * Step 3: Select tweets from groups A, D, E, and F and return them - * - * There are two primary tradeoffs, both of which favor public tweets: - * (1) Benefit: While public indexing latency is < 60s, auto-updating never misses public tweets - * Cost: Absence of public tweets may delay protected tweets from being searchable for 60s - * (2) Benefit: No failure or delay from the protected cluster will affect realtime results - * Cost: If the protected cluster indexes more slowly, auto-update may miss its tweets - * - * @param fullArchiveTweets - used solely for generating anchor points, not merged in. - */ - @VisibleForTesting - static List mergePublicAndProtectedRealtimeResults( - int numRequested, - EarlybirdResponse realtimeTweets, - EarlybirdResponse realtimeProtectedTweets, - @Nullable EarlybirdResponse fullArchiveTweets, - Clock clock) { - // See which results will actually be used - boolean isRealtimeUsable = EarlybirdResponseUtil.hasResults(realtimeTweets); - boolean isArchiveUsable = EarlybirdResponseUtil.hasResults(fullArchiveTweets); - boolean isProtectedUsable = EarlybirdResponseUtil.hasResults(realtimeProtectedTweets); - - long minId = Long.MIN_VALUE; - long maxId = Long.MAX_VALUE; - if (isRealtimeUsable) { - // Determine the actual upper/lower bounds on the tweet id - if (realtimeTweets.getSearchResults().isSetMinSearchedStatusID()) { - minId = realtimeTweets.getSearchResults().getMinSearchedStatusID(); - } - if (realtimeTweets.getSearchResults().isSetMaxSearchedStatusID()) { - maxId = realtimeTweets.getSearchResults().getMaxSearchedStatusID(); - } - - int justRight = realtimeTweets.getSearchResults().getResultsSize(); - if (isArchiveUsable) { - justRight += fullArchiveTweets.getSearchResults().getResultsSize(); - if (fullArchiveTweets.getSearchResults().isSetMinSearchedStatusID()) { - long fullArchiveMinId = fullArchiveTweets.getSearchResults().getMinSearchedStatusID(); - if (fullArchiveMinId <= minId) { - minId = fullArchiveMinId; - } else { - FULL_ARCHIVE_MIN_ID_GREATER_THAN_REALTIME_MIN_ID.increment(); - } - } - } - if (isProtectedUsable) { - for (ThriftSearchResult result : realtimeProtectedTweets.getSearchResults().getResults()) { - if (result.getId() >= minId && result.getId() <= maxId) { - justRight++; - } - } - } - if (justRight < numRequested) { - // Since this is only used as an upper bound, old (pre-2010) ids are still handled correctly - maxId = Math.max( - maxId, - SnowflakeIdParser.generateValidStatusId( - clock.nowMillis() - Amount.of(60, Time.SECONDS).as(Time.MILLISECONDS), 0)); - } - } - - List mergedSearchResults = Lists.newArrayListWithCapacity(numRequested * 2); - - // Add valid tweets in order of priority: protected, then realtime - // Only add results that are within range (that check only matters for protected) - if (isProtectedUsable) { - EarlybirdResponseMergeUtil.markWithTweetSource( - realtimeProtectedTweets.getSearchResults().getResults(), - ThriftTweetSource.REALTIME_PROTECTED_CLUSTER); - for (ThriftSearchResult result : realtimeProtectedTweets.getSearchResults().getResults()) { - if (result.getId() <= maxId && result.getId() >= minId) { - mergedSearchResults.add(result); - } - } - } - - if (isRealtimeUsable) { - EarlybirdResponseMergeUtil.addResultsToList( - mergedSearchResults, realtimeTweets, ThriftTweetSource.REALTIME_CLUSTER); - } - - // Set the minSearchedStatusID and maxSearchedStatusID on the protected response to the - // minId and maxId that were used to trim the protected results. - // This is needed in order to correctly set these IDs on the merged response. - ThriftSearchResults protectedResults = - EarlybirdResponseUtil.getResults(realtimeProtectedTweets); - if ((protectedResults != null) - && protectedResults.isSetMinSearchedStatusID() - && (protectedResults.getMinSearchedStatusID() < minId)) { - protectedResults.setMinSearchedStatusID(minId); - } - if ((protectedResults != null) - && protectedResults.isSetMaxSearchedStatusID() - && (protectedResults.getMaxSearchedStatusID() > maxId)) { - realtimeProtectedTweets.getSearchResults().setMaxSearchedStatusID(maxId); - } - - return mergedSearchResults; - } - - /** - * Merges the debug strings of the given cluster responses. - * - * @param realtimeResponse The response from the realtime cluster. - * @param protectedResponse The response from the protected cluster. - * @param fullArchiveResponse The response from the full archive cluster. - * @return The merged debug string. - */ - public static String mergeClusterDebugStrings(@Nullable EarlybirdResponse realtimeResponse, - @Nullable EarlybirdResponse protectedResponse, - @Nullable EarlybirdResponse fullArchiveResponse) { - StringBuilder sb = new StringBuilder(); - if ((realtimeResponse != null) && realtimeResponse.isSetDebugString()) { - sb.append("Realtime response: ").append(realtimeResponse.getDebugString()); - } - if ((protectedResponse != null) && protectedResponse.isSetDebugString()) { - if (sb.length() > 0) { - sb.append("\n"); - } - sb.append("Protected response: ").append(protectedResponse.getDebugString()); - } - if ((fullArchiveResponse != null) && fullArchiveResponse.isSetDebugString()) { - if (sb.length() > 0) { - sb.append("\n"); - } - sb.append("Full archive response: ").append(fullArchiveResponse.getDebugString()); - } - - if (sb.length() == 0) { - return null; - } - return sb.toString(); - } - - /** - * Sets the minSearchedStatusID field on the merged response. - * - * @param mergedResponse The merged response. - * @param fullArchiveResponse The full archive response. - * @param resultsTrimmed Whether the merged response results were trimmed. - */ - private void setMinSearchedStatusId(EarlybirdResponse mergedResponse, - EarlybirdResponse realtimeResponse, - EarlybirdResponse protectedResponse, - EarlybirdResponse fullArchiveResponse, - boolean resultsTrimmed) { - Preconditions.checkNotNull(mergedResponse.getSearchResults()); - if (resultsTrimmed) { - // We got more results that we asked for and we trimmed them. - // Set minSearchedStatusID to the ID of the oldest result. - ThriftSearchResults searchResults = mergedResponse.getSearchResults(); - if (searchResults.getResultsSize() > 0) { - List results = searchResults.getResults(); - long lastResultId = results.get(results.size() - 1).getId(); - searchResults.setMinSearchedStatusID(lastResultId); - } - return; - } - - // We did not get more results that we asked for. Get the min of the minSearchedStatusIDs of - // the merged responses. - List minIDs = Lists.newArrayList(); - if (fullArchiveResponse != null - && fullArchiveResponse.isSetSearchResults() - && fullArchiveResponse.getSearchResults().isSetMinSearchedStatusID()) { - minIDs.add(fullArchiveResponse.getSearchResults().getMinSearchedStatusID()); - if (mergedResponse.getSearchResults().isSetMinSearchedStatusID() - && mergedResponse.getSearchResults().getMinSearchedStatusID() - < fullArchiveResponse.getSearchResults().getMinSearchedStatusID()) { - invalidMinStatusId.increment(); - } - } - - if (protectedResponse != null - && !EarlybirdResponseUtil.hasResults(realtimeResponse) - && EarlybirdResponseUtil.hasResults(protectedResponse) - && protectedResponse.getSearchResults().isSetMinSearchedStatusID()) { - minIDs.add(protectedResponse.getSearchResults().getMinSearchedStatusID()); - } - - if (mergedResponse.getSearchResults().isSetMinSearchedStatusID()) { - minIDs.add(mergedResponse.getSearchResults().getMinSearchedStatusID()); - } - - if (!minIDs.isEmpty()) { - mergedResponse.getSearchResults().setMinSearchedStatusID(Collections.min(minIDs)); - } else { - noMinIds.increment(); - } - } - - /** - * Sets the maxSearchedStatusID field on the merged response. - * - * @param mergedResponse The merged response. - * @param fullArchiveResponse The full archive response. - */ - private void setMaxSearchedStatusId(EarlybirdResponse mergedResponse, - EarlybirdResponse realtimeResponse, - EarlybirdResponse protectedResponse, - EarlybirdResponse fullArchiveResponse) { - - Preconditions.checkNotNull(mergedResponse.getSearchResults()); - List maxIDs = Lists.newArrayList(); - if (fullArchiveResponse != null - && fullArchiveResponse.isSetSearchResults() - && fullArchiveResponse.getSearchResults().isSetMaxSearchedStatusID()) { - maxIDs.add(fullArchiveResponse.getSearchResults().getMaxSearchedStatusID()); - if (mergedResponse.getSearchResults().isSetMaxSearchedStatusID() - && fullArchiveResponse.getSearchResults().getMaxSearchedStatusID() - > mergedResponse.getSearchResults().getMaxSearchedStatusID()) { - invalidMaxStatusId.increment(); - } - } - - if (protectedResponse != null - && !EarlybirdResponseUtil.hasResults(realtimeResponse) - && EarlybirdResponseUtil.hasResults(protectedResponse) - && protectedResponse.getSearchResults().isSetMaxSearchedStatusID()) { - - maxIDs.add(protectedResponse.getSearchResults().getMaxSearchedStatusID()); - } - - if (mergedResponse.getSearchResults().isSetMaxSearchedStatusID()) { - maxIDs.add(mergedResponse.getSearchResults().getMaxSearchedStatusID()); - } - - ThriftSearchResults searchResults = mergedResponse.getSearchResults(); - if (searchResults.getResultsSize() > 0) { - List results = searchResults.getResults(); - maxIDs.add(results.get(0).getId()); - } - - if (!maxIDs.isEmpty()) { - mergedResponse.getSearchResults().setMaxSearchedStatusID(Collections.max(maxIDs)); - } else { - noMaxIds.increment(); - } - } - - /** - * Handles exceptions thrown while merging responses. Timeout exceptions are converted to - * SERVER_TIMEOUT_ERROR responses. All other exceptions are converted to PERSISTENT_ERROR - * responses. - */ - private Future handleResponseException( - Future responseFuture, final String debugMsg) { - return responseFuture.handle( - new Function() { - @Override - public EarlybirdResponse apply(Throwable t) { - EarlybirdResponseCode responseCode = EarlybirdResponseCode.PERSISTENT_ERROR; - if (FinagleUtil.isTimeoutException(t)) { - responseCode = EarlybirdResponseCode.SERVER_TIMEOUT_ERROR; - } - EarlybirdResponse response = new EarlybirdResponse(responseCode, 0); - response.setDebugString(debugMsg + "\n" + t); - return response; - } - }); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/mergers/TermStatisticsResponseMerger.java b/src/java/com/twitter/search/earlybird_root/mergers/TermStatisticsResponseMerger.java deleted file mode 100644 index d23fff64b..000000000 --- a/src/java/com/twitter/search/earlybird_root/mergers/TermStatisticsResponseMerger.java +++ /dev/null @@ -1,90 +0,0 @@ -package com.twitter.search.earlybird_root.mergers; - -import java.util.Collection; -import java.util.List; -import java.util.concurrent.TimeUnit; - -import com.google.common.collect.Collections2; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.common.util.earlybird.FacetsResultsUtils; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.ThriftTermStatisticsRequest; -import com.twitter.search.earlybird.thrift.ThriftTermStatisticsResults; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.util.Future; - -/** - * Merger class to merge termstats EarlybirdResponse objects - */ -public class TermStatisticsResponseMerger extends EarlybirdResponseMerger { - private static final Logger LOG = LoggerFactory.getLogger(TermStatisticsResponseMerger.class); - - private static final SearchTimerStats TIMER = - SearchTimerStats.export("merge_term_stats", TimeUnit.NANOSECONDS, false, true); - - private static final double SUCCESSFUL_RESPONSE_THRESHOLD = 0.9; - - public TermStatisticsResponseMerger(EarlybirdRequestContext requestContext, - List> responses, - ResponseAccumulator mode) { - super(requestContext, responses, mode); - } - - @Override - protected SearchTimerStats getMergedResponseTimer() { - return TIMER; - } - - @Override - protected double getDefaultSuccessResponseThreshold() { - return SUCCESSFUL_RESPONSE_THRESHOLD; - } - - @Override - protected EarlybirdResponse internalMerge(EarlybirdResponse termStatsResponse) { - ThriftTermStatisticsRequest termStatisticsRequest = - requestContext.getRequest().getTermStatisticsRequest(); - - Collection termStatsResults = - Collections2.filter(accumulatedResponses.getSuccessResponses(), - earlybirdResponse -> earlybirdResponse.isSetTermStatisticsResults()); - - ThriftTermStatisticsResults results = - new ThriftTermResultsMerger( - termStatsResults, - termStatisticsRequest.getHistogramSettings()) - .merge(); - - if (results.getTermResults().isEmpty()) { - final String line = "No results returned from any backend for term statistics request: {}"; - - // If the termstats request was not empty and we got empty results. log it as a warning - // otherwise log is as a debug. - if (termStatisticsRequest.getTermRequestsSize() > 0) { - LOG.warn(line, termStatisticsRequest); - } else { - LOG.debug(line, termStatisticsRequest); - } - } - - termStatsResponse.setTermStatisticsResults(results); - termStatsResponse.setSearchResults(ThriftTermResultsMerger.mergeSearchStats(termStatsResults)); - - FacetsResultsUtils.fixNativePhotoUrl(results.getTermResults().values()); - - LOG.debug("TermStats call completed successfully: {}", termStatsResponse); - - return termStatsResponse; - } - - @Override - public boolean shouldEarlyTerminateTierMerge(int totalResultsFromSuccessfulShards, - boolean foundEarlyTermination) { - // To get accurate term stats, must never early terminate - return false; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/mergers/ThriftTermResultsMerger.java b/src/java/com/twitter/search/earlybird_root/mergers/ThriftTermResultsMerger.java deleted file mode 100644 index ccfa54aff..000000000 --- a/src/java/com/twitter/search/earlybird_root/mergers/ThriftTermResultsMerger.java +++ /dev/null @@ -1,472 +0,0 @@ -package com.twitter.search.earlybird_root.mergers; - -import java.util.ArrayList; -import java.util.Collection; -import java.util.Collections; -import java.util.List; -import java.util.Map; - -import javax.annotation.Nonnull; -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.util.earlybird.FacetsResultsUtils; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.ThriftHistogramSettings; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird.thrift.ThriftTermRequest; -import com.twitter.search.earlybird.thrift.ThriftTermResults; -import com.twitter.search.earlybird.thrift.ThriftTermStatisticsResults; - -/** - * Takes multiple successful EarlybirdResponses and merges them. - */ -public class ThriftTermResultsMerger { - private static final Logger LOG = LoggerFactory.getLogger(ThriftTermResultsMerger.class); - - private static final SearchCounter BIN_ID_GAP_COUNTER = - SearchCounter.export("thrift_term_results_merger_found_gap_in_bin_ids"); - private static final SearchCounter MIN_COMPLETE_BIN_ID_ADJUSTED_NULL = - SearchCounter.export("thrift_term_results_merger_min_complete_bin_id_adjusted_null"); - private static final SearchCounter MIN_COMPLETE_BIN_ID_NULL_WITHOUT_BINS = - SearchCounter.export("thrift_term_results_merger_min_complete_bin_id_null_without_bins"); - private static final SearchCounter MIN_COMPLETE_BIN_ID_OUT_OF_RANGE = - SearchCounter.export("thrift_term_results_merger_min_complete_bin_id_out_of_range"); - private static final SearchCounter RESPONSE_WITHOUT_DRIVING_QUERY_HIT = - SearchCounter.export("response_without_driving_query_hit"); - - private static final ThriftTermRequest GLOBAL_COUNT_REQUEST = - new ThriftTermRequest().setFieldName("").setTerm(""); - - /** - * Sorted list of the most recent (and contiguous) numBins binIds across all responses. - * Expected to be an empty list if this request did not ask for histograms, or if it - * did ask for histograms for 0 numBins. - */ - @Nonnull - private final List mostRecentBinIds; - /** - * The first binId in the {@link #mostRecentBinIds} list. This value is not meant to be used in - * case mostRecentBinIds is an empty list. - */ - private final int firstBinId; - - /** - * For each unique ThriftTermRequest, stores an array of the total counts for all the binIds - * that we will return, summed up across all earlybird responses. - * - * The values in each totalCounts array correspond to the binIds in the - * {@link #mostRecentBinIds} list. - * - * Key: thrift term request. - * Value: array of the total counts summed up across all earlybird responses for the key's - * term request, corresponding to the binIds in {@link #mostRecentBinIds}. - */ - private final Map mergedTermRequestTotalCounts = Maps.newHashMap(); - /** - * The set of all unique binIds that we are merging. - */ - private final Map termResultsMap = Maps.newHashMap(); - private final ThriftHistogramSettings histogramSettings; - - /** - * Only relevant for merging responses with histogram settings. - * This will be null either if (1) the request is not asking for histograms at all, or if - * (2) numBins was set to 0 (and no bin can be considered complete). - * If not null, the minCompleteBinId will be computed as the max over all merged responses' - * minCompleteBinId's. - */ - @Nullable - private final Integer minCompleteBinId; - - /** - * Create merger with collections of results to merge - */ - public ThriftTermResultsMerger(Collection termStatsResults, - ThriftHistogramSettings histogramSettings) { - this.histogramSettings = histogramSettings; - - Collection filteredTermStatsResults = - filterOutEmptyEarlybirdResponses(termStatsResults); - - this.mostRecentBinIds = findMostRecentBinIds(histogramSettings, filteredTermStatsResults); - this.firstBinId = mostRecentBinIds.isEmpty() - ? Integer.MAX_VALUE // Should not be used if mostRecentBinIds is empty. - : mostRecentBinIds.get(0); - - List minCompleteBinIds = - Lists.newArrayListWithCapacity(filteredTermStatsResults.size()); - for (EarlybirdResponse response : filteredTermStatsResults) { - Preconditions.checkState(response.getResponseCode() == EarlybirdResponseCode.SUCCESS, - "Unsuccessful responses should not be given to ThriftTermResultsMerger."); - Preconditions.checkState(response.getTermStatisticsResults() != null, - "Response given to ThriftTermResultsMerger has no termStatisticsResults."); - - ThriftTermStatisticsResults termStatisticsResults = response.getTermStatisticsResults(); - List binIds = termStatisticsResults.getBinIds(); - - for (Map.Entry entry - : termStatisticsResults.getTermResults().entrySet()) { - ThriftTermRequest termRequest = entry.getKey(); - ThriftTermResults termResults = entry.getValue(); - - adjustTotalCount(termResults, binIds); - addTotalCountData(termRequest, termResults); - - if (histogramSettings != null) { - Preconditions.checkState(termStatisticsResults.isSetBinIds()); - addHistogramData(termRequest, termResults, termStatisticsResults.getBinIds()); - } - } - - if (histogramSettings != null) { - addMinCompleteBinId(minCompleteBinIds, response); - } - } - - minCompleteBinId = minCompleteBinIds.isEmpty() ? null : Collections.max(minCompleteBinIds); - } - - /** - * Take out any earlybird responses that we know did not match anything relevant to the query, - * and may have erroneous binIds. - */ - private Collection filterOutEmptyEarlybirdResponses( - Collection termStatsResults) { - List emptyResponses = Lists.newArrayList(); - List nonEmptyResponses = Lists.newArrayList(); - for (EarlybirdResponse response : termStatsResults) { - // Guard against erroneously merging and returning 0 counts when we actually have data to - // return from other partitions. - // When a query doesn't match anything at all on an earlybird, the binIds that are returned - // do not correspond at all to the actual query, and are just based on the data range on the - // earlybird itself. - // We can identify these responses as (1) being non-early terminated, and (2) having 0 - // hits processed. - if (isTermStatResponseEmpty(response)) { - emptyResponses.add(response); - } else { - nonEmptyResponses.add(response); - } - } - - // If all responses were "empty", we will just use those to merge into a new set of empty - // responses, using the binIds provided. - return nonEmptyResponses.isEmpty() ? emptyResponses : nonEmptyResponses; - } - - private boolean isTermStatResponseEmpty(EarlybirdResponse response) { - return response.isSetSearchResults() - && (response.getSearchResults().getNumHitsProcessed() == 0 - || drivingQueryHasNoHits(response)) - && response.isSetEarlyTerminationInfo() - && !response.getEarlyTerminationInfo().isEarlyTerminated(); - } - - /** - * If the global count bins are all 0, then we know the driving query has no hits. - * This check is added as a short term solution for SEARCH-5476. This short term fix requires - * the client to set the includeGlobalCounts to kick in. - */ - private boolean drivingQueryHasNoHits(EarlybirdResponse response) { - ThriftTermStatisticsResults termStatisticsResults = response.getTermStatisticsResults(); - if (termStatisticsResults == null || termStatisticsResults.getTermResults() == null) { - // If there's no term stats response, be conservative and return false. - return false; - } else { - ThriftTermResults globalCounts = - termStatisticsResults.getTermResults().get(GLOBAL_COUNT_REQUEST); - if (globalCounts == null) { - // We cannot tell if driving query has no hits, be conservative and return false. - return false; - } else { - for (Integer i : globalCounts.getHistogramBins()) { - if (i > 0) { - return false; - } - } - RESPONSE_WITHOUT_DRIVING_QUERY_HIT.increment(); - return true; - } - } - } - - private static List findMostRecentBinIds( - ThriftHistogramSettings histogramSettings, - Collection filteredTermStatsResults) { - Integer largestFirstBinId = null; - List binIdsToUse = null; - - if (histogramSettings != null) { - int numBins = histogramSettings.getNumBins(); - for (EarlybirdResponse response : filteredTermStatsResults) { - ThriftTermStatisticsResults termStatisticsResults = response.getTermStatisticsResults(); - Preconditions.checkState(termStatisticsResults.getBinIds().size() == numBins, - "expected all results to have the same numBins. " - + "request numBins: %s, response numBins: %s", - numBins, termStatisticsResults.getBinIds().size()); - - if (termStatisticsResults.getBinIds().size() > 0) { - Integer firstBinId = termStatisticsResults.getBinIds().get(0); - if (largestFirstBinId == null - || largestFirstBinId.intValue() < firstBinId.intValue()) { - largestFirstBinId = firstBinId; - binIdsToUse = termStatisticsResults.getBinIds(); - } - } - } - } - return binIdsToUse == null - ? Collections.emptyList() - // Just in case, make a copy of the binIds so that we don't reuse the same list from one - // of the responses we're merging. - : Lists.newArrayList(binIdsToUse); - } - - private void addMinCompleteBinId(List minCompleteBinIds, - EarlybirdResponse response) { - Preconditions.checkNotNull(histogramSettings); - ThriftTermStatisticsResults termStatisticsResults = response.getTermStatisticsResults(); - - if (termStatisticsResults.isSetMinCompleteBinId()) { - // This is the base case. Early terminated or not, this is the proper minCompleteBinId - // that we're told to use for this response. - minCompleteBinIds.add(termStatisticsResults.getMinCompleteBinId()); - } else if (termStatisticsResults.getBinIds().size() > 0) { - // This is the case where no bins were complete. For the purposes of merging, we need to - // mark all the binIds in this response as non-complete by marking the "max(binId)+1" as the - // last complete bin. - // When returning the merged response, we still have a guard for the resulting - // minCompleteBinId being outside of the binIds range, and will set the returned - // minCompleteBinId value to null, if this response's binIds end up being used as the most - // recent ones, and we need to signify that none of the bins are complete. - int binSize = termStatisticsResults.getBinIds().size(); - Integer maxBinId = termStatisticsResults.getBinIds().get(binSize - 1); - minCompleteBinIds.add(maxBinId + 1); - - LOG.debug("Adjusting null minCompleteBinId for response: {}, histogramSettings {}", - response, histogramSettings); - MIN_COMPLETE_BIN_ID_ADJUSTED_NULL.increment(); - } else { - // This should only happen in the case where numBins is set to 0. - Preconditions.checkState(histogramSettings.getNumBins() == 0, - "Expected numBins set to 0. response: %s", response); - Preconditions.checkState(minCompleteBinIds.isEmpty(), - "minCompleteBinIds: %s", minCompleteBinIds); - - LOG.debug("Got null minCompleteBinId with no bins for response: {}, histogramSettings {}", - response, histogramSettings); - MIN_COMPLETE_BIN_ID_NULL_WITHOUT_BINS.increment(); - } - } - - private void addTotalCountData(ThriftTermRequest request, ThriftTermResults results) { - ThriftTermResults termResults = termResultsMap.get(request); - if (termResults == null) { - termResultsMap.put(request, results); - } else { - termResults.setTotalCount(termResults.getTotalCount() + results.getTotalCount()); - if (termResults.isSetMetadata()) { - termResults.setMetadata( - FacetsResultsUtils.mergeFacetMetadata(termResults.getMetadata(), - results.getMetadata(), null)); - } - } - } - - /** - * Set results.totalCount to the sum of hits in only the bins that will be returned in - * the merged response. - */ - private void adjustTotalCount(ThriftTermResults results, List binIds) { - int adjustedTotalCount = 0; - List histogramBins = results.getHistogramBins(); - if ((binIds != null) && (histogramBins != null)) { - Preconditions.checkState( - histogramBins.size() == binIds.size(), - "Expected ThriftTermResults to have the same number of histogramBins as binIds set in " - + " ThriftTermStatisticsResults. ThriftTermResults.histogramBins: %s, " - + " ThriftTermStatisticsResults.binIds: %s.", - histogramBins, binIds); - for (int i = 0; i < binIds.size(); ++i) { - if (binIds.get(i) >= firstBinId) { - adjustedTotalCount += histogramBins.get(i); - } - } - } - - results.setTotalCount(adjustedTotalCount); - } - - private void addHistogramData(ThriftTermRequest request, - ThriftTermResults results, - List binIds) { - - int[] requestTotalCounts = mergedTermRequestTotalCounts.get(request); - if (requestTotalCounts == null) { - requestTotalCounts = new int[mostRecentBinIds.size()]; - mergedTermRequestTotalCounts.put(request, requestTotalCounts); - } - - // Only consider these results if they fall into the mostRecentBinIds range. - // - // The list of returned binIds is expected to be both sorted (in ascending order), and - // contiguous, which allows us to use firstBinId to check if it overlaps with the - // mostRecentBinIds range. - if (binIds.size() > 0 && binIds.get(binIds.size() - 1) >= firstBinId) { - int firstBinIndex; - if (binIds.get(0) == firstBinId) { - // This should be the common case when all partitions have the same binIds, - // no need to do a binary search. - firstBinIndex = 0; - } else { - // The firstBinId must be in the binIds range. We can find it using binary search since - // binIds are sorted. - firstBinIndex = Collections.binarySearch(binIds, firstBinId); - Preconditions.checkState(firstBinIndex >= 0, - "Expected to find firstBinId (%s) in the result binIds: %s, " - + "histogramSettings: %s, termRequest: %s", - firstBinId, binIds, histogramSettings, request); - } - - // Skip binIds that are before the smallest binId that we will use in the merged results. - for (int i = firstBinIndex; i < binIds.size(); i++) { - final Integer currentBinValue = results.getHistogramBins().get(i); - requestTotalCounts[i - firstBinIndex] += currentBinValue.intValue(); - } - } - } - - /** - * Return a new ThriftTermStatisticsResults with the total counts merged, and if enabled, - * histogram bins merged. - */ - public ThriftTermStatisticsResults merge() { - ThriftTermStatisticsResults results = new ThriftTermStatisticsResults(termResultsMap); - - if (histogramSettings != null) { - mergeHistogramBins(results); - } - - return results; - } - - - /** - * Takes multiple histogram results and merges them so: - * 1) Counts for the same binId (represents the time) and term are summed - * 2) All results are re-indexed to use the most recent bins found from the union of all bins - */ - private void mergeHistogramBins(ThriftTermStatisticsResults mergedResults) { - - mergedResults.setBinIds(mostRecentBinIds); - mergedResults.setHistogramSettings(histogramSettings); - - setMinCompleteBinId(mergedResults); - - useMostRecentBinsForEachThriftTermResults(); - } - - private void setMinCompleteBinId(ThriftTermStatisticsResults mergedResults) { - if (mostRecentBinIds.isEmpty()) { - Preconditions.checkState(minCompleteBinId == null); - // This is the case where the requested numBins is set to 0. We don't have any binIds, - // and the minCompleteBinId has to be unset. - LOG.debug("Empty binIds returned for mergedResults: {}", mergedResults); - } else { - Preconditions.checkNotNull(minCompleteBinId); - - Integer maxBinId = mostRecentBinIds.get(mostRecentBinIds.size() - 1); - if (minCompleteBinId <= maxBinId) { - mergedResults.setMinCompleteBinId(minCompleteBinId); - } else { - // Leaving the minCompleteBinId unset as it is outside the range of the returned binIds. - LOG.debug("Computed minCompleteBinId: {} is out of maxBinId: {} for mergedResults: {}", - minCompleteBinId, mergedResults); - MIN_COMPLETE_BIN_ID_OUT_OF_RANGE.increment(); - } - } - } - - /** - * Check that the binIds we are using are contiguous. Increment the provided stat if we find - * a gap, as we don't expect to find any. - * See: SEARCH-4362 - * - * @param sortedBinIds most recent numBins sorted binIds. - * @param binIdGapCounter stat to increment if we see a gap in the binId range. - */ - @VisibleForTesting - static void checkForBinIdGaps(List sortedBinIds, SearchCounter binIdGapCounter) { - for (int i = sortedBinIds.size() - 1; i > 0; i--) { - final Integer currentBinId = sortedBinIds.get(i); - final Integer previousBinId = sortedBinIds.get(i - 1); - - if (previousBinId < currentBinId - 1) { - binIdGapCounter.increment(); - break; - } - } - } - - /** - * Returns a view containing only the last N items from the list - */ - private static List takeLastN(List lst, int n) { - Preconditions.checkArgument(n <= lst.size(), - "Attempting to take more elements than the list has. List size: %s, n: %s", lst.size(), n); - return lst.subList(lst.size() - n, lst.size()); - } - - private void useMostRecentBinsForEachThriftTermResults() { - for (Map.Entry entry : termResultsMap.entrySet()) { - ThriftTermRequest request = entry.getKey(); - ThriftTermResults results = entry.getValue(); - - List histogramBins = Lists.newArrayList(); - results.setHistogramBins(histogramBins); - - int[] requestTotalCounts = mergedTermRequestTotalCounts.get(request); - Preconditions.checkNotNull(requestTotalCounts); - - for (int totalCount : requestTotalCounts) { - histogramBins.add(totalCount); - } - } - } - - /** - * Merges search stats from several earlybird responses and puts them in - * {@link ThriftSearchResults} structure. - * - * @param responses earlybird responses to merge the search stats from - * @return merged search stats inside of {@link ThriftSearchResults} structure - */ - public static ThriftSearchResults mergeSearchStats(Collection responses) { - int numHitsProcessed = 0; - int numPartitionsEarlyTerminated = 0; - - for (EarlybirdResponse response : responses) { - ThriftSearchResults searchResults = response.getSearchResults(); - - if (searchResults != null) { - numHitsProcessed += searchResults.getNumHitsProcessed(); - numPartitionsEarlyTerminated += searchResults.getNumPartitionsEarlyTerminated(); - } - } - - ThriftSearchResults searchResults = new ThriftSearchResults(new ArrayList<>()); - searchResults.setNumHitsProcessed(numHitsProcessed); - searchResults.setNumPartitionsEarlyTerminated(numPartitionsEarlyTerminated); - return searchResults; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/mergers/TierResponseAccumulator.java b/src/java/com/twitter/search/earlybird_root/mergers/TierResponseAccumulator.java deleted file mode 100644 index 58b7cb877..000000000 --- a/src/java/com/twitter/search/earlybird_root/mergers/TierResponseAccumulator.java +++ /dev/null @@ -1,97 +0,0 @@ -package com.twitter.search.earlybird_root.mergers; - -import java.util.ArrayList; -import java.util.List; - -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.TierResponse; - -public final class TierResponseAccumulator extends ResponseAccumulator { - private static final String TARGET_TYPE_TIER = "tier"; - - private final List tierResponses = new ArrayList<>(); - // Total number of partitions the request was sent to, across all tiers. - private int totalPartitionsQueriedInAllTiers = 0; - // Among the above partitions, the number of them that returned successful responses. - private int totalSuccessfulPartitionsInAllTiers = 0; - - @Override - public String getNameForLogging(int responseIndex, int numTotalResponses) { - return TARGET_TYPE_TIER + (numTotalResponses - responseIndex); - } - - @Override - public String getNameForEarlybirdResponseCodeStats(int responseIndex, int numTotalResponses) { - return TARGET_TYPE_TIER + (numTotalResponses - responseIndex); - } - - @Override - protected boolean isMergingAcrossTiers() { - return true; - } - - @Override - public boolean shouldEarlyTerminateMerge(EarlyTerminateTierMergePredicate merger) { - if (foundError()) { - return true; - } - - int numResults = 0; - for (EarlybirdResponse resp : getSuccessResponses()) { - if (resp.isSetSearchResults()) { - numResults += resp.getSearchResults().getResultsSize(); - } - } - - return merger.shouldEarlyTerminateTierMerge(numResults, foundEarlyTermination()); - } - - @Override - public void handleSkippedResponse(EarlybirdResponseCode responseCode) { - tierResponses.add(new TierResponse() - .setNumPartitions(0) - .setNumSuccessfulPartitions(0) - .setTierResponseCode(responseCode)); - } - - @Override - public void handleErrorResponse(EarlybirdResponse response) { - // TierResponse, which is only returned if merging results from different tiers. - TierResponse tr = new TierResponse(); - if (response != null) { - if (response.isSetResponseCode()) { - tr.setTierResponseCode(response.getResponseCode()); - } else { - tr.setTierResponseCode(EarlybirdResponseCode.TRANSIENT_ERROR); - } - tr.setNumPartitions(response.getNumPartitions()); - tr.setNumSuccessfulPartitions(0); - totalPartitionsQueriedInAllTiers += response.getNumPartitions(); - } else { - tr.setTierResponseCode(EarlybirdResponseCode.TRANSIENT_ERROR) - .setNumPartitions(0) - .setNumSuccessfulPartitions(0); - } - - tierResponses.add(tr); - } - - @Override - public AccumulatedResponses.PartitionCounts getPartitionCounts() { - return new AccumulatedResponses.PartitionCounts(totalPartitionsQueriedInAllTiers, - totalSuccessfulPartitionsInAllTiers, tierResponses); - } - - @Override - public void extraSuccessfulResponseHandler(EarlybirdResponse response) { - // Record tier stats. - totalPartitionsQueriedInAllTiers += response.getNumPartitions(); - totalSuccessfulPartitionsInAllTiers += response.getNumSuccessfulPartitions(); - - tierResponses.add(new TierResponse() - .setNumPartitions(response.getNumPartitions()) - .setNumSuccessfulPartitions(response.getNumSuccessfulPartitions()) - .setTierResponseCode(EarlybirdResponseCode.SUCCESS)); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/mergers/TopTweetsResponseMerger.java b/src/java/com/twitter/search/earlybird_root/mergers/TopTweetsResponseMerger.java deleted file mode 100644 index 5d76ab4cd..000000000 --- a/src/java/com/twitter/search/earlybird_root/mergers/TopTweetsResponseMerger.java +++ /dev/null @@ -1,65 +0,0 @@ -package com.twitter.search.earlybird_root.mergers; - -import java.util.List; -import java.util.concurrent.TimeUnit; - -import com.google.common.base.Preconditions; - -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.earlybird.thrift.ThriftSearchRankingMode; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird_root.collectors.RelevanceMergeCollector; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.util.Future; - -/** - * Merger class to merge toptweets EarlybirdResponse objects - */ -public class TopTweetsResponseMerger extends EarlybirdResponseMerger { - - private static final double SUCCESSFUL_RESPONSE_THRESHOLD = 0.9; - - private static final SearchTimerStats TIMER = - SearchTimerStats.export("merge_top_tweets", TimeUnit.NANOSECONDS, false, true); - - public TopTweetsResponseMerger(EarlybirdRequestContext requestContext, - List> responses, - ResponseAccumulator mode) { - super(requestContext, responses, mode); - } - - @Override - protected SearchTimerStats getMergedResponseTimer() { - return TIMER; - } - - @Override - protected double getDefaultSuccessResponseThreshold() { - return SUCCESSFUL_RESPONSE_THRESHOLD; - } - - @Override - protected EarlybirdResponse internalMerge(EarlybirdResponse mergedResponse) { - final ThriftSearchQuery searchQuery = requestContext.getRequest().getSearchQuery(); - - Preconditions.checkNotNull(searchQuery); - Preconditions.checkState(searchQuery.isSetRankingMode()); - Preconditions.checkState(searchQuery.getRankingMode() == ThriftSearchRankingMode.TOPTWEETS); - - int numResultsRequested = computeNumResultsToKeep(); - - RelevanceMergeCollector collector = new RelevanceMergeCollector(responses.size()); - - addResponsesToCollector(collector); - ThriftSearchResults searchResults = collector.getAllSearchResults(); - if (numResultsRequested < searchResults.getResults().size()) { - searchResults.setResults(searchResults.getResults().subList(0, numResultsRequested)); - } - - mergedResponse.setSearchResults(searchResults); - - return mergedResponse; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/mergers/TrimStats.java b/src/java/com/twitter/search/earlybird_root/mergers/TrimStats.java deleted file mode 100644 index 284f3bc1b..000000000 --- a/src/java/com/twitter/search/earlybird_root/mergers/TrimStats.java +++ /dev/null @@ -1,71 +0,0 @@ -package com.twitter.search.earlybird_root.mergers; - -/** - * Tracks what situations are encountered when trimming results - */ -class TrimStats { - protected static final TrimStats EMPTY_STATS = new TrimStats(); - - private int maxIdFilterCount = 0; - private int minIdFilterCount = 0; - private int removedDupsCount = 0; - private int resultsTruncatedFromTailCount = 0; - - int getMinIdFilterCount() { - return minIdFilterCount; - } - - int getRemovedDupsCount() { - return removedDupsCount; - } - - int getResultsTruncatedFromTailCount() { - return resultsTruncatedFromTailCount; - } - - void decreaseMaxIdFilterCount() { - maxIdFilterCount--; - } - - void decreaseMinIdFilterCount() { - minIdFilterCount--; - } - - public void clearMaxIdFilterCount() { - this.maxIdFilterCount = 0; - } - - public void clearMinIdFilterCount() { - this.minIdFilterCount = 0; - } - - void increaseMaxIdFilterCount() { - maxIdFilterCount++; - } - - void increaseMinIdFilterCount() { - minIdFilterCount++; - } - - void increaseRemovedDupsCount() { - removedDupsCount++; - } - - void setResultsTruncatedFromTailCount(int resultsTruncatedFromTailCount) { - this.resultsTruncatedFromTailCount = resultsTruncatedFromTailCount; - } - - @Override - public String toString() { - StringBuilder builder = new StringBuilder(); - - builder.append("TrimStats{"); - builder.append("maxIdFilterCount=").append(maxIdFilterCount); - builder.append(", minIdFilterCount=").append(minIdFilterCount); - builder.append(", removedDupsCount=").append(removedDupsCount); - builder.append(", resultsTruncatedFromTailCount=").append(resultsTruncatedFromTailCount); - builder.append("}"); - - return builder.toString(); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/quota/BUILD b/src/java/com/twitter/search/earlybird_root/quota/BUILD deleted file mode 100644 index 8f81a89fa..000000000 --- a/src/java/com/twitter/search/earlybird_root/quota/BUILD +++ /dev/null @@ -1,15 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/commons-io", - "3rdparty/jvm/org/json", - "src/java/com/twitter/common/util:system-mocks", - "src/java/com/twitter/search/common/dark", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/util/io/periodic", - "src/java/com/twitter/search/common/util/json", - ], -) diff --git a/src/java/com/twitter/search/earlybird_root/quota/ClientIdQuotaManager.java b/src/java/com/twitter/search/earlybird_root/quota/ClientIdQuotaManager.java deleted file mode 100644 index 2a5723a3d..000000000 --- a/src/java/com/twitter/search/earlybird_root/quota/ClientIdQuotaManager.java +++ /dev/null @@ -1,22 +0,0 @@ -package com.twitter.search.earlybird_root.quota; - -import java.util.Optional; - -/** A manager that determines how quota restrictions should be applied for each client. */ -public interface ClientIdQuotaManager { - /** - * Returns the quota for the given client, if one is set. - * - * @param clientId The ID of the client. - * @return The quota for the given client (in requests per second), if one is set. - */ - Optional getQuotaForClient(String clientId); - - /** - * Returns the common pool quota. A common pool quota must always be set. - * - * @return The common pool quota (in requests per second). - */ - QuotaInfo getCommonPoolQuota(); - -} diff --git a/src/java/com/twitter/search/earlybird_root/quota/ConfigBasedQuotaConfig.java b/src/java/com/twitter/search/earlybird_root/quota/ConfigBasedQuotaConfig.java deleted file mode 100644 index 6565fdae6..000000000 --- a/src/java/com/twitter/search/earlybird_root/quota/ConfigBasedQuotaConfig.java +++ /dev/null @@ -1,161 +0,0 @@ -package com.twitter.search.earlybird_root.quota; - -import java.io.IOException; -import java.io.InputStream; -import java.nio.charset.StandardCharsets; -import java.util.Iterator; -import java.util.Map; -import java.util.Optional; -import java.util.concurrent.ScheduledExecutorService; -import java.util.concurrent.atomic.AtomicReference; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.Maps; - -import org.apache.commons.io.IOUtils; -import org.json.JSONException; -import org.json.JSONObject; - -import com.twitter.common.util.Clock; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.util.io.periodic.PeriodicFileLoader; -import com.twitter.search.common.util.json.JSONParsingUtil; - -/** - * Periodically loads a json serialized map that contains the quota information indexed by - * client id. - * - * Each json object from the map is required to have an int property that represents a client's quota. - * The key for the quota property is passed to this class. - * - * Optionally it can have a should_enforce property of type boolean - * - * If this two properties are not present an exception will be thrown. - */ -public class ConfigBasedQuotaConfig extends PeriodicFileLoader { - private static final String UNSET_EMAIL = "unset"; - - private static final String PER_CLIENT_QUOTA_GAUGE_NAME_PATTERN = - "config_based_quota_for_client_id_%s"; - private static final String PER_EMAIL_QUOTA_GAUGE_NAME_PATTERN = - "config_based_quota_for_email_%s"; - - @VisibleForTesting - static final SearchLongGauge TOTAL_QUOTA = - SearchLongGauge.export("total_config_based_quota"); - - @VisibleForTesting - static final SearchLongGauge ENTRIES_COUNT = - SearchLongGauge.export("config_repo_quota_config_entries_count"); - - private final AtomicReference> clientQuotas = - new AtomicReference<>(); - - private String clientQuotaKey; - private boolean requireQuotaConfigForClients; - - /** - * Creates the object that manages loads the config from: quotaConfigPath. It periodically - * reloads the config file using the given executor service. - * - * @param quotaConfigPath Path to configuration file. - * @param executorService ScheduledExecutorService to be used for periodically reloading the file. - * @param clientQuotaKey The key that will be used to extract client quotas. - * @param requireQuotaConfigForClients Determines whether a client can be skipped - * if the associated object is missing the quota key - * (ie a client that is a SuperRoot client but the current service is Archive) - */ - public static ConfigBasedQuotaConfig newConfigBasedQuotaConfig( - String quotaConfigPath, - String clientQuotaKey, - boolean requireQuotaConfigForClients, - ScheduledExecutorService executorService, - Clock clock - ) throws Exception { - ConfigBasedQuotaConfig configLoader = new ConfigBasedQuotaConfig( - quotaConfigPath, - clientQuotaKey, - requireQuotaConfigForClients, - executorService, - clock - ); - configLoader.init(); - return configLoader; - } - - public ConfigBasedQuotaConfig( - String quotaConfigPath, - String clientQuotaKey, - boolean requireQuotaConfigForClients, - ScheduledExecutorService executorService, - Clock clock - ) throws Exception { - super("quotaConfig", quotaConfigPath, executorService, clock); - this.clientQuotaKey = clientQuotaKey; - this.requireQuotaConfigForClients = requireQuotaConfigForClients; - } - - /** - * Returns the quota information for a specific client id. - */ - public Optional getQuotaForClient(String clientId) { - return Optional.ofNullable(clientQuotas.get().get(clientId)); - } - - /** - * Load the json format and store it in a map. - */ - @Override - protected void accept(InputStream fileStream) throws JSONException, IOException { - String fileContents = IOUtils.toString(fileStream, StandardCharsets.UTF_8); - JSONObject quotaConfig = new JSONObject(JSONParsingUtil.stripComments(fileContents)); - - Map perEmailQuotas = Maps.newHashMap(); - ImmutableMap.Builder quotasBuilder = new ImmutableMap.Builder<>(); - Iterator clientIds = quotaConfig.keys(); - - long totalQuota = 0; - while (clientIds.hasNext()) { - String clientId = clientIds.next(); - JSONObject clientQuota = quotaConfig.getJSONObject(clientId); - - // Skip clients that don't send requests to this service. - // (ie some SuperRoot clients are not Archive clients) - if (!requireQuotaConfigForClients && !clientQuota.has(clientQuotaKey)) { - continue; - } - - int quotaValue = clientQuota.getInt(clientQuotaKey); - boolean shouldEnforce = clientQuota.optBoolean("should_enforce", false); - String tierValue = clientQuota.optString("tier", QuotaInfo.DEFAULT_TIER_VALUE); - boolean archiveAccess = clientQuota.optBoolean("archive_access", - QuotaInfo.DEFAULT_ARCHIVE_ACCESS_VALUE); - String email = clientQuota.optString("email", UNSET_EMAIL); - - quotasBuilder.put( - clientId, - new QuotaInfo(clientId, email, quotaValue, shouldEnforce, tierValue, archiveAccess)); - - SearchLongGauge perClientQuota = SearchLongGauge.export( - String.format(PER_CLIENT_QUOTA_GAUGE_NAME_PATTERN, clientId)); - perClientQuota.set(quotaValue); - totalQuota += quotaValue; - - Integer emailQuota = perEmailQuotas.get(email); - if (emailQuota == null) { - emailQuota = 0; - } - perEmailQuotas.put(email, emailQuota + quotaValue); - } - - clientQuotas.set(quotasBuilder.build()); - TOTAL_QUOTA.set(totalQuota); - ENTRIES_COUNT.set(clientQuotas.get().size()); - - for (String email : perEmailQuotas.keySet()) { - SearchLongGauge.export(String.format(PER_EMAIL_QUOTA_GAUGE_NAME_PATTERN, email)).set( - perEmailQuotas.get(email)); - } - } -} diff --git a/src/java/com/twitter/search/earlybird_root/quota/ConfigRepoBasedQuotaManager.java b/src/java/com/twitter/search/earlybird_root/quota/ConfigRepoBasedQuotaManager.java deleted file mode 100644 index a2f3b6e7e..000000000 --- a/src/java/com/twitter/search/earlybird_root/quota/ConfigRepoBasedQuotaManager.java +++ /dev/null @@ -1,65 +0,0 @@ -package com.twitter.search.earlybird_root.quota; - -import java.util.Optional; - -import javax.inject.Inject; - -import com.google.common.base.Preconditions; - -import com.twitter.search.common.dark.ServerSetResolver.SelfServerSetResolver; - -/** - * A config based implementation of the {@code ClientIdQuotaManager} interface. - * It uses a ConfigBasedQuotaConfig object to load the contents of the config. - */ -public class ConfigRepoBasedQuotaManager implements ClientIdQuotaManager { - - public static final String COMMON_POOL_CLIENT_ID = "common_pool"; - - private final ConfigBasedQuotaConfig quotaConfig; - private final SelfServerSetResolver serverSetResolver; - - /** Creates a new ConfigRepoBasedQuotaManager instance. */ - @Inject - public ConfigRepoBasedQuotaManager( - SelfServerSetResolver serverSetResolver, - ConfigBasedQuotaConfig quotaConfig) { - Preconditions.checkNotNull(quotaConfig); - - this.quotaConfig = quotaConfig; - this.serverSetResolver = serverSetResolver; - } - - @Override - public Optional getQuotaForClient(String clientId) { - Optional quotaForClient = quotaConfig.getQuotaForClient(clientId); - - if (!quotaForClient.isPresent()) { - return Optional.empty(); - } - - QuotaInfo quota = quotaForClient.get(); - - int quotaValue = quota.getQuota(); - int rootInstanceCount = serverSetResolver.getServerSetSize(); - if (rootInstanceCount > 0) { - quotaValue = (int) Math.ceil((double) quotaValue / rootInstanceCount); - } - - return Optional.of( - new QuotaInfo( - quota.getQuotaClientId(), - quota.getQuotaEmail(), - quotaValue, - quota.shouldEnforceQuota(), - quota.getClientTier(), - quota.hasArchiveAccess())); - } - - @Override - public QuotaInfo getCommonPoolQuota() { - Optional commonPoolQuota = getQuotaForClient(COMMON_POOL_CLIENT_ID); - Preconditions.checkState(commonPoolQuota.isPresent()); - return commonPoolQuota.get(); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/quota/QuotaInfo.java b/src/java/com/twitter/search/earlybird_root/quota/QuotaInfo.java deleted file mode 100644 index d672f602b..000000000 --- a/src/java/com/twitter/search/earlybird_root/quota/QuotaInfo.java +++ /dev/null @@ -1,78 +0,0 @@ -package com.twitter.search.earlybird_root.quota; - -import com.google.common.base.Preconditions; - -/** - * Simple container of quota related information. - */ -public class QuotaInfo { - public static final String DEFAULT_TIER_VALUE = "no_tier"; - public static final boolean DEFAULT_ARCHIVE_ACCESS_VALUE = false; - - private final String quotaClientId; - private final String quotaEmail; - private final int quota; - private final boolean shouldEnforceQuota; - private final String clientTier; - private final boolean archiveAccess; - - /** - * Creates a new QuotaInfo object with the given clientId, quota and shouldEnforceQuota. - */ - public QuotaInfo( - String quotaClientId, - String quotaEmail, - int quota, - boolean shouldEnforceQuota, - String clientTier, - boolean archiveAccess) { - this.quotaClientId = Preconditions.checkNotNull(quotaClientId); - this.quotaEmail = Preconditions.checkNotNull(quotaEmail); - this.quota = quota; - this.shouldEnforceQuota = shouldEnforceQuota; - this.clientTier = Preconditions.checkNotNull(clientTier); - this.archiveAccess = archiveAccess; - } - - /** - * Returns the clientId for which we have the QuotaInfo. - */ - public String getQuotaClientId() { - return quotaClientId; - } - - /** - * Returns the email associated with this clientId. - */ - public String getQuotaEmail() { - return quotaEmail; - } - - /** - * Returns the integer based quota for the stored client id. - */ - public int getQuota() { - return quota; - } - - /** - * Returns whether the quota should be enforced or not. - */ - public boolean shouldEnforceQuota() { - return shouldEnforceQuota; - } - - /** - * Return tier info about the client. - */ - public String getClientTier() { - return clientTier; - } - - /** - * Returns whether the client has access to the full archive. - */ - public boolean hasArchiveAccess() { - return archiveAccess; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/routers/AbstractRecencyAndRelevanceRequestRouter.java b/src/java/com/twitter/search/earlybird_root/routers/AbstractRecencyAndRelevanceRequestRouter.java deleted file mode 100644 index bf4154e1a..000000000 --- a/src/java/com/twitter/search/earlybird_root/routers/AbstractRecencyAndRelevanceRequestRouter.java +++ /dev/null @@ -1,442 +0,0 @@ -package com.twitter.search.earlybird_root.routers; - -import java.util.ArrayList; -import java.util.Collections; -import java.util.List; - -import com.google.common.base.Preconditions; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.finagle.Service; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.futures.Futures; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.util.earlybird.EarlybirdResponseMergeUtil; -import com.twitter.search.earlybird.thrift.AdjustedRequestParams; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.ThriftSearchQuery; -import com.twitter.search.earlybird.thrift.ThriftSearchRankingMode; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird_root.common.ClientErrorException; -import com.twitter.search.earlybird_root.common.EarlybirdFeatureSchemaMerger; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestUtil; -import com.twitter.search.earlybird_root.common.EarlybirdServiceResponse; -import com.twitter.search.earlybird_root.filters.EarlybirdTimeRangeFilter; -import com.twitter.search.earlybird_root.mergers.SuperRootResponseMerger; -import com.twitter.search.queryparser.util.QueryUtil; -import com.twitter.util.Function; -import com.twitter.util.Function0; -import com.twitter.util.Future; - -/** - * For Recency traffic SuperRoot hits realtime and/or protected realtime first and then archive - */ -public abstract class AbstractRecencyAndRelevanceRequestRouter extends RequestRouter { - public static final String FULL_ARCHIVE_AVAILABLE_FOR_GET_PROTECTED_TWEETS_ONLY_DECIDER_KEY = - "superroot_full_archive_cluster_available_for_get_protected_tweets_only_requests"; - public static final String FULL_ARCHIVE_AVAILABLE_FOR_NOT_ENOUGH_PROTECTED_RESULTS_DECIDER_KEY = - "superroot_full_archive_cluster_available_for_requests_without_enough_protected_results"; - - private static final Logger LOG = - LoggerFactory.getLogger(AbstractRecencyAndRelevanceRequestRouter.class); - - private final String skipProtectedClusterDeciderKey; - private final String skipFullArchiveClusterDeciderKey; - - private final SearchCounter realtimeResponseInvalidCounter; - private final SearchCounter realtimeResponseSearchResultsNotSetCounter; - private final SearchCounter minSearchedStatusIdLargerThanRequestMaxIdCounter; - private final SearchCounter minSearchedStatusIdLargerThanRequestUntilTimeCounter; - - private final Service realtime; - private final Service protectedRealtime; - private final Service fullArchive; - private final SuperRootResponseMerger responseMerger; - private final SearchDecider decider; - - AbstractRecencyAndRelevanceRequestRouter( - Service realtime, - Service protectedRealtime, - Service fullArchive, - EarlybirdTimeRangeFilter realtimeTimeRangeFilter, - EarlybirdTimeRangeFilter protectedTimeRangeFilter, - EarlybirdTimeRangeFilter fullArchiveTimeRangeFilter, - ThriftSearchRankingMode rankingMode, - Clock clock, - SearchDecider decider, - EarlybirdFeatureSchemaMerger featureSchemaMerger) { - LOG.info("Instantiating AbstractRecencyAndRelevanceRequestRouter"); - this.realtime = realtimeTimeRangeFilter.andThen(realtime); - this.protectedRealtime = protectedTimeRangeFilter.andThen(protectedRealtime); - this.fullArchive = fullArchiveTimeRangeFilter.andThen(fullArchive); - this.responseMerger = new SuperRootResponseMerger(rankingMode, featureSchemaMerger, clock); - this.decider = decider; - - String rankingModeForStats = rankingMode.name().toLowerCase(); - skipProtectedClusterDeciderKey = - String.format("superroot_skip_protected_cluster_for_%s_requests", rankingModeForStats); - skipFullArchiveClusterDeciderKey = - String.format("superroot_skip_full_archive_cluster_for_%s_requests", rankingModeForStats); - - realtimeResponseInvalidCounter = - SearchCounter.export(rankingModeForStats + "_realtime_response_invalid"); - realtimeResponseSearchResultsNotSetCounter = - SearchCounter.export(rankingModeForStats + "_realtime_response_search_results_not_set"); - minSearchedStatusIdLargerThanRequestMaxIdCounter = SearchCounter.export( - rankingModeForStats + "_min_searched_status_id_larger_than_request_max_id"); - minSearchedStatusIdLargerThanRequestUntilTimeCounter = SearchCounter.export( - rankingModeForStats + "_min_searched_status_id_larger_than_request_until_time"); - } - - private void checkRequestPreconditions(EarlybirdRequest request) { - // CollectorParams should be set in EarlybirdRequestUtil.checkAndSetCollectorParams(). - Preconditions.checkNotNull(request.getSearchQuery().getCollectorParams()); - - // return a Client error if the num results are less than 0 - if (request.getSearchQuery().getNumResults() < 0) { - throw new ClientErrorException("The request.searchQuery.numResults field can't be negative"); - } - - if (request.getSearchQuery().getCollectorParams().getNumResultsToReturn() < 0) { - throw new ClientErrorException("The request.searchQuery.collectorParams.numResultsToReturn " - + "field can't be negative"); - } - } - - /** - * Hit realtime and/or protected realtime first, if not enough results, then hit archive, - * merge the results. - */ - @Override - public Future route(final EarlybirdRequestContext requestContext) { - EarlybirdRequest request = requestContext.getRequest(); - - this.checkRequestPreconditions(request); - - ArrayList savedRequestResponses = new ArrayList<>(); - - // If clients do not define numResults to return or the numResults requested are 0 - // return an empty EarlyBirdResponse without hitting any service. - if (request.getSearchQuery().getNumResults() == 0 - || request.getSearchQuery().getCollectorParams().getNumResultsToReturn() == 0) { - return Future.value(successNoResultsResponse()); - } - - // Realtime earlybird response is already required. Even if the service is not called - // the result passed to the mergers should be a valid one. - EarlybirdServiceResponse.ServiceState realtimeServiceState = - getRealtimeServiceState(requestContext); - final Future realtimeResponseFuture = - realtimeServiceState.serviceWasCalled() - ? getRealtimeResponse(savedRequestResponses, requestContext) - : Future.value(EarlybirdServiceResponse.serviceNotCalled(realtimeServiceState)); - - // If no flock response (followedUserIds) is set, request wont be sent to protected. - EarlybirdServiceResponse.ServiceState protectedServiceState = - getProtectedServiceState(requestContext); - final Future protectedResponseFuture = - protectedServiceState.serviceWasCalled() - ? getProtectedResponse(savedRequestResponses, requestContext) - : Future.value(EarlybirdServiceResponse.serviceNotCalled(protectedServiceState)); - - final Future archiveResponseFuture = - Futures.flatMap(realtimeResponseFuture, protectedResponseFuture, - new Function0>() { - @Override - public Future apply() { - EarlybirdServiceResponse realtimeResponse = Futures.get(realtimeResponseFuture); - EarlybirdServiceResponse protectedResponse = Futures.get(protectedResponseFuture); - EarlybirdServiceResponse.ServiceState fullArchiveServiceState = - getFullArchiveServiceState(requestContext, realtimeResponse, protectedResponse); - return fullArchiveServiceState.serviceWasCalled() - ? getFullArchiveResponse(savedRequestResponses, requestContext, - realtimeResponse.getResponse(), protectedResponse.getResponse()) - : Future.value( - EarlybirdServiceResponse.serviceNotCalled(fullArchiveServiceState)); - } - } - ); - - Future mergedResponse = responseMerger.mergeResponseFutures( - requestContext, realtimeResponseFuture, protectedResponseFuture, archiveResponseFuture); - mergedResponse = mergedResponse - .map(RequestRouterUtil.checkMinSearchedStatusId( - requestContext, - "max_id", - EarlybirdRequestUtil.getRequestMaxId(requestContext.getParsedQuery()), - realtimeResponseFuture, - protectedResponseFuture, - archiveResponseFuture, - minSearchedStatusIdLargerThanRequestMaxIdCounter)) - .map(RequestRouterUtil.checkMinSearchedStatusId( - requestContext, - "until_time", - EarlybirdRequestUtil.getRequestMaxIdFromUntilTime(requestContext.getParsedQuery()), - realtimeResponseFuture, - protectedResponseFuture, - archiveResponseFuture, - minSearchedStatusIdLargerThanRequestUntilTimeCounter)); - - return this.maybeAttachSentRequestsToDebugInfo( - savedRequestResponses, - requestContext, - mergedResponse - ); - } - - private EarlybirdResponse successNoResultsResponse() { - return new EarlybirdResponse(EarlybirdResponseCode.SUCCESS, 0) - .setSearchResults(new ThriftSearchResults().setResults(Collections.emptyList())); - } - - protected abstract boolean shouldSendRequestToFullArchiveCluster( - EarlybirdRequest request, EarlybirdResponse realtimeResponse); - - /** Determines if the protected service is available and if a request should be sent to it. */ - private EarlybirdServiceResponse.ServiceState getProtectedServiceState( - EarlybirdRequestContext requestContext) { - if (!requestContext.getRequest().isSetFollowedUserIds() - || requestContext.getRequest().getFollowedUserIds().isEmpty()) { - return EarlybirdServiceResponse.ServiceState.SERVICE_NOT_REQUESTED; - } - - if (decider.isAvailable(skipProtectedClusterDeciderKey)) { - return EarlybirdServiceResponse.ServiceState.SERVICE_NOT_AVAILABLE; - } - - return EarlybirdServiceResponse.ServiceState.SERVICE_CALLED; - } - - /** Determines if the realtime service is available and if a request should be sent to it. */ - private EarlybirdServiceResponse.ServiceState getRealtimeServiceState( - EarlybirdRequestContext requestContext) { - EarlybirdRequest request = requestContext.getRequest(); - - // SERVICE_NOT_REQUESTED should always be returned before other states as - // SuperRootResponseMerger has special logic for this case. - if (request.isSetGetProtectedTweetsOnly() && request.isGetProtectedTweetsOnly()) { - return EarlybirdServiceResponse.ServiceState.SERVICE_NOT_REQUESTED; - } - - return EarlybirdServiceResponse.ServiceState.SERVICE_CALLED; - } - - /** Determines if the full archive service is available and if a request should be sent to it. */ - private EarlybirdServiceResponse.ServiceState getFullArchiveServiceState( - EarlybirdRequestContext requestContext, - EarlybirdServiceResponse publicServiceResponse, - EarlybirdServiceResponse protectedServiceResponse) { - - // SERVICE_NOT_REQUESTED should be always be returned before other states as - // SuperRootResponseMerger has special logic for this case. - if (!requestContext.getRequest().isSetGetOlderResults() - || !requestContext.getRequest().isGetOlderResults()) { - return EarlybirdServiceResponse.ServiceState.SERVICE_NOT_REQUESTED; - } - - // allow requesting full archive service when decider is enabled - if (!decider.isAvailable(FULL_ARCHIVE_AVAILABLE_FOR_GET_PROTECTED_TWEETS_ONLY_DECIDER_KEY) - && requestContext.getRequest().isSetGetProtectedTweetsOnly() - && requestContext.getRequest().isGetProtectedTweetsOnly()) { - return EarlybirdServiceResponse.ServiceState.SERVICE_NOT_REQUESTED; - } - - if (decider.isAvailable(skipFullArchiveClusterDeciderKey)) { - return EarlybirdServiceResponse.ServiceState.SERVICE_NOT_AVAILABLE; - } - - boolean serviceWasCalledForPublic = - getFullArchiveServiceState(requestContext, publicServiceResponse).serviceWasCalled(); - boolean serviceWasCalledForProtected = - decider.isAvailable(FULL_ARCHIVE_AVAILABLE_FOR_NOT_ENOUGH_PROTECTED_RESULTS_DECIDER_KEY) - && getFullArchiveServiceState(requestContext, protectedServiceResponse).serviceWasCalled(); - if (!serviceWasCalledForPublic && !serviceWasCalledForProtected) { - return EarlybirdServiceResponse.ServiceState.SERVICE_NOT_CALLED; - } - - return EarlybirdServiceResponse.ServiceState.SERVICE_CALLED; - } - - private EarlybirdServiceResponse.ServiceState getFullArchiveServiceState( - EarlybirdRequestContext requestContext, - EarlybirdServiceResponse realtimeServiceResponse) { - EarlybirdResponse realtimeResponse = realtimeServiceResponse.getResponse(); - - if (!EarlybirdResponseMergeUtil.isValidResponse(realtimeResponse)) { - realtimeResponseInvalidCounter.increment(); - return EarlybirdServiceResponse.ServiceState.SERVICE_NOT_CALLED; - } - - if (!realtimeResponse.isSetSearchResults()) { - realtimeResponseSearchResultsNotSetCounter.increment(); - return EarlybirdServiceResponse.ServiceState.SERVICE_NOT_CALLED; - } - - if (!shouldSendRequestToFullArchiveCluster(requestContext.getRequest(), realtimeResponse)) { - return EarlybirdServiceResponse.ServiceState.SERVICE_NOT_CALLED; - } - - return EarlybirdServiceResponse.ServiceState.SERVICE_CALLED; - } - - /** - * Modify the original request context based on the followedUserId field and then send the - * request to the protected cluster. - */ - private Future getProtectedResponse( - ArrayList savedRequestResponses, - final EarlybirdRequestContext requestContext) { - EarlybirdRequestContext protectedRequestContext = - EarlybirdRequestContext.newContextWithRestrictFromUserIdFilter64(requestContext); - Preconditions.checkArgument( - protectedRequestContext.getRequest().getSearchQuery().isSetFromUserIDFilter64()); - - // SERVICE_NOT_REQUESTED should be always be returned before other states as - // SuperRootResponseMerger has special logic for this case. - if (protectedRequestContext.getRequest().getSearchQuery().getFromUserIDFilter64().isEmpty()) { - return Future.value(EarlybirdServiceResponse.serviceNotCalled( - EarlybirdServiceResponse.ServiceState.SERVICE_NOT_REQUESTED)); - } - - if (requestContext.getRequest().isSetAdjustedProtectedRequestParams()) { - adjustRequestParams(protectedRequestContext.getRequest(), - requestContext.getRequest().getAdjustedProtectedRequestParams()); - } - - LOG.debug("Request sent to the protected cluster: {}", protectedRequestContext.getRequest()); - return toEarlybirdServiceResponseFuture( - savedRequestResponses, - protectedRequestContext, - "protected", - this.protectedRealtime - ); - } - - private Future getRealtimeResponse( - ArrayList savedRequestResponses, - EarlybirdRequestContext requestContext) { - return toEarlybirdServiceResponseFuture( - savedRequestResponses, - requestContext, - "realtime", - this.realtime); - } - - /** - * Modifying the existing max id filter of the request or appending a new - * max id filter and then send the request to the full archive cluster. - */ - private Future getFullArchiveResponse( - ArrayList savedRequestResponses, - EarlybirdRequestContext requestContext, - EarlybirdResponse realtimeResponse, - EarlybirdResponse protectedResponse) { - long realtimeMinId = getMinSearchedId(realtimeResponse); - long protectedMinId = getMinSearchedId(protectedResponse); - // if both realtime and protected min searched ids are available, the larger(newer) one is used - // to make sure no tweets are left out. However, this means it might introduce duplicates for - // the other response. The response merger will dedup the response. This logic is enabled - // when full archive cluster is available for requests without enough protected results. - long minId = - decider.isAvailable(FULL_ARCHIVE_AVAILABLE_FOR_NOT_ENOUGH_PROTECTED_RESULTS_DECIDER_KEY) - ? Math.max(realtimeMinId, protectedMinId) : realtimeMinId; - - if (minId <= 0) { - // If the realtime response doesn't have a minSearchedStatusID set, get all results from - // the full archive cluster. - minId = Long.MAX_VALUE; - } - - // The [max_id] operator is inclusive in earlybirds. This means that a query with [max_id X] - // will return tweet X, if X matches the rest of the query. So we should add a [max_id (X - 1)] - // operator to the full archive query (instead of [max_id X]). Otherwise, we could end up with - // duplicates. For example: - // - // realtime response: results = [ 100, 90, 80 ], minSearchedStatusID = 80 - // full archive request: [max_id 80] - // full archive response: results = [ 80, 70, 60 ] - // - // In this case, tweet 80 would be returned from both the realtime and full archive clusters. - EarlybirdRequestContext archiveRequestContext = - EarlybirdRequestContext.copyRequestContext( - requestContext, - QueryUtil.addOrReplaceMaxIdFilter( - requestContext.getParsedQuery(), - minId - 1)); - - if (requestContext.getRequest().isSetAdjustedFullArchiveRequestParams()) { - adjustRequestParams(archiveRequestContext.getRequest(), - requestContext.getRequest().getAdjustedFullArchiveRequestParams()); - } - - LOG.debug("Request sent to the full archive cluster: {},", archiveRequestContext.getRequest()); - return toEarlybirdServiceResponseFuture( - savedRequestResponses, - archiveRequestContext, - "archive", - this.fullArchive - ); - } - - private long getMinSearchedId(EarlybirdResponse response) { - return response != null && response.isSetSearchResults() - ? response.getSearchResults().getMinSearchedStatusID() : 0; - } - - private void adjustRequestParams(EarlybirdRequest request, - AdjustedRequestParams adjustedRequestParams) { - ThriftSearchQuery searchQuery = request.getSearchQuery(); - - if (adjustedRequestParams.isSetNumResults()) { - searchQuery.setNumResults(adjustedRequestParams.getNumResults()); - if (searchQuery.isSetCollectorParams()) { - searchQuery.getCollectorParams().setNumResultsToReturn( - adjustedRequestParams.getNumResults()); - } - } - - if (adjustedRequestParams.isSetMaxHitsToProcess()) { - searchQuery.setMaxHitsToProcess(adjustedRequestParams.getMaxHitsToProcess()); - if (searchQuery.isSetRelevanceOptions()) { - searchQuery.getRelevanceOptions().setMaxHitsToProcess( - adjustedRequestParams.getMaxHitsToProcess()); - } - if (searchQuery.isSetCollectorParams() - && searchQuery.getCollectorParams().isSetTerminationParams()) { - searchQuery.getCollectorParams().getTerminationParams().setMaxHitsToProcess( - adjustedRequestParams.getMaxHitsToProcess()); - } - } - - if (adjustedRequestParams.isSetReturnAllResults()) { - if (searchQuery.isSetRelevanceOptions()) { - searchQuery.getRelevanceOptions().setReturnAllResults( - adjustedRequestParams.isReturnAllResults()); - } - } - } - - private Future toEarlybirdServiceResponseFuture( - List savedRequestResponses, - EarlybirdRequestContext requestContext, - String sentTo, - Service service) { - Future responseFuture = service.apply(requestContext); - this.saveRequestResponse( - savedRequestResponses, sentTo, requestContext, responseFuture - ); - - return responseFuture.map(new Function() { - @Override - public EarlybirdServiceResponse apply(EarlybirdResponse response) { - return EarlybirdServiceResponse.serviceCalled(response); - } - }); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/routers/BUILD b/src/java/com/twitter/search/earlybird_root/routers/BUILD deleted file mode 100644 index 1f9f71b60..000000000 --- a/src/java/com/twitter/search/earlybird_root/routers/BUILD +++ /dev/null @@ -1,25 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/org/slf4j:slf4j-api", - "finatra/inject/inject-core/src/main/scala", - "src/java/com/twitter/common/collections", - "src/java/com/twitter/search/common/decider", - "src/java/com/twitter/search/common/futures", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/partitioning/snowflakeparser", - "src/java/com/twitter/search/common/util/earlybird", - "src/java/com/twitter/search/earlybird/common", - "src/java/com/twitter/search/earlybird/config", - "src/java/com/twitter/search/earlybird_root/common", - "src/java/com/twitter/search/earlybird_root/filters", - "src/java/com/twitter/search/earlybird_root/mergers", - "src/java/com/twitter/search/queryparser", - "src/java/com/twitter/search/queryparser/query:core-query-nodes", - "src/thrift/com/twitter/search:earlybird-java", - ], -) diff --git a/src/java/com/twitter/search/earlybird_root/routers/FacetsRequestRouter.java b/src/java/com/twitter/search/earlybird_root/routers/FacetsRequestRouter.java deleted file mode 100644 index 9883853f3..000000000 --- a/src/java/com/twitter/search/earlybird_root/routers/FacetsRequestRouter.java +++ /dev/null @@ -1,35 +0,0 @@ -package com.twitter.search.earlybird_root.routers; - -import javax.inject.Inject; -import javax.inject.Named; - -import com.twitter.finagle.Service; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.InjectionNames; -import com.twitter.search.earlybird_root.filters.EarlybirdTimeRangeFilter; -import com.twitter.util.Future; - -/** - * For Facets traffic SuperRoot forwards all traffic to the realtime cluster. - */ -public class FacetsRequestRouter extends RequestRouter { - - private final Service realtime; - - /** Creates a new FacetsRequestRouter instance to be used by the SuperRoot. */ - @Inject - public FacetsRequestRouter( - @Named(InjectionNames.REALTIME) - Service realtime, - @Named(FacetsRequestRouterModule.TIME_RANGE_FILTER) - EarlybirdTimeRangeFilter timeRangeFilter) { - - this.realtime = timeRangeFilter.andThen(realtime); - } - - @Override - public Future route(EarlybirdRequestContext requestContext) { - return realtime.apply(requestContext); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/routers/FacetsRequestRouterModule.java b/src/java/com/twitter/search/earlybird_root/routers/FacetsRequestRouterModule.java deleted file mode 100644 index 87aa5852e..000000000 --- a/src/java/com/twitter/search/earlybird_root/routers/FacetsRequestRouterModule.java +++ /dev/null @@ -1,33 +0,0 @@ -package com.twitter.search.earlybird_root.routers; - -import javax.inject.Named; -import javax.inject.Singleton; - -import com.google.inject.Provides; - -import com.twitter.inject.TwitterModule; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.earlybird_root.filters.EarlybirdTimeRangeFilter; -import com.twitter.search.earlybird_root.filters.RealtimeServingRangeProvider; -import com.twitter.search.earlybird_root.filters.ServingRangeProvider; - -public class FacetsRequestRouterModule extends TwitterModule { - public static final String TIME_RANGE_FILTER = "facets_time_range_filter"; - - public static final String SERVING_RANGE_BOUNDARY_HOURS_AGO_DECIDER_KEY = - "superroot_facets_serving_range_boundary_hours_ago"; - - private ServingRangeProvider getServingRangeProvider(final SearchDecider decider) - throws Exception { - return new RealtimeServingRangeProvider( - decider, SERVING_RANGE_BOUNDARY_HOURS_AGO_DECIDER_KEY); - } - - @Provides - @Singleton - @Named(TIME_RANGE_FILTER) - private EarlybirdTimeRangeFilter providesTimeRangeFilter(SearchDecider decider) throws Exception { - return EarlybirdTimeRangeFilter.newTimeRangeFilterWithoutQueryRewriter( - getServingRangeProvider(decider)); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/routers/RecencyRequestRouter.java b/src/java/com/twitter/search/earlybird_root/routers/RecencyRequestRouter.java deleted file mode 100644 index f870c2e68..000000000 --- a/src/java/com/twitter/search/earlybird_root/routers/RecencyRequestRouter.java +++ /dev/null @@ -1,73 +0,0 @@ -package com.twitter.search.earlybird_root.routers; - -import javax.inject.Inject; -import javax.inject.Named; - -import com.twitter.common.util.Clock; -import com.twitter.finagle.Service; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.ThriftSearchRankingMode; -import com.twitter.search.earlybird_root.common.EarlybirdFeatureSchemaMerger; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.InjectionNames; -import com.twitter.search.earlybird_root.filters.EarlybirdTimeRangeFilter; - -public class RecencyRequestRouter extends AbstractRecencyAndRelevanceRequestRouter { - private static final SearchCounter SKIPPED_ARCHIVE_DUE_TO_REALTIME_EARLY_TERMINATION_COUNTER = - SearchCounter.export("recency_skipped_archive_due_to_realtime_early_termination"); - private static final SearchCounter SKIPPED_ARCHIVE_DUE_TO_REALTIME_ENOUGH_RESULTS_COUNTER = - SearchCounter.export("recency_skipped_archive_due_to_realtime_enough_results"); - - @Inject - public RecencyRequestRouter( - @Named(InjectionNames.REALTIME) - Service realtime, - @Named(InjectionNames.PROTECTED) - Service protectedRealtime, - @Named(InjectionNames.FULL_ARCHIVE) - Service fullArchive, - @Named(RecencyRequestRouterModule.REALTIME_TIME_RANGE_FILTER) - EarlybirdTimeRangeFilter realtimeTimeRangeFilter, - @Named(RecencyRequestRouterModule.PROTECTED_TIME_RANGE_FILTER) - EarlybirdTimeRangeFilter protectedTimeRangeFilter, - @Named(RecencyRequestRouterModule.FULL_ARCHIVE_TIME_RANGE_FILTER) - EarlybirdTimeRangeFilter fullArchiveTimeRangeFilter, - Clock clock, - SearchDecider decider, - EarlybirdFeatureSchemaMerger featureSchemaMerger) { - super(realtime, - protectedRealtime, - fullArchive, - realtimeTimeRangeFilter, - protectedTimeRangeFilter, - fullArchiveTimeRangeFilter, - ThriftSearchRankingMode.RECENCY, - clock, - decider, - featureSchemaMerger); - } - - @Override - protected boolean shouldSendRequestToFullArchiveCluster( - EarlybirdRequest request, EarlybirdResponse realtimeResponse) { - boolean isEarlyTerminated = realtimeResponse.isSetEarlyTerminationInfo() - && realtimeResponse.getEarlyTerminationInfo().isEarlyTerminated(); - if (isEarlyTerminated) { - SKIPPED_ARCHIVE_DUE_TO_REALTIME_EARLY_TERMINATION_COUNTER.increment(); - return false; - } - - // Check if we have the minimum number of results to fulfill the original request. - int numResultsRequested = request.getSearchQuery().getNumResults(); - int actualNumResults = realtimeResponse.getSearchResults().getResultsSize(); - if (actualNumResults >= numResultsRequested) { - SKIPPED_ARCHIVE_DUE_TO_REALTIME_ENOUGH_RESULTS_COUNTER.increment(); - return false; - } - - return true; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/routers/RecencyRequestRouterModule.java b/src/java/com/twitter/search/earlybird_root/routers/RecencyRequestRouterModule.java deleted file mode 100644 index 009c04a68..000000000 --- a/src/java/com/twitter/search/earlybird_root/routers/RecencyRequestRouterModule.java +++ /dev/null @@ -1,74 +0,0 @@ -package com.twitter.search.earlybird_root.routers; - -import javax.inject.Named; -import javax.inject.Singleton; - -import com.google.inject.Provides; - -import com.twitter.inject.TwitterModule; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.earlybird_root.filters.EarlybirdTimeRangeFilter; -import com.twitter.search.earlybird_root.filters.FullArchiveServingRangeProvider; -import com.twitter.search.earlybird_root.filters.RealtimeServingRangeProvider; -import com.twitter.search.earlybird_root.filters.ServingRangeProvider; - -public class RecencyRequestRouterModule extends TwitterModule { - public static final String FULL_ARCHIVE_TIME_RANGE_FILTER = - "recency_full_archive_time_range_filter"; - public static final String REALTIME_TIME_RANGE_FILTER = - "recency_realtime_time_range_filter"; - public static final String PROTECTED_TIME_RANGE_FILTER = - "recency_protected_time_range_filter"; - - public static final String REALTIME_RANGE_SERVING_RANGE_BOUNDARY_HOURS_AGO_DECIDER_KEY = - "superroot_recency_realtime_serving_range_boundary_hours_ago"; - public static final String PROTECTED_SERVING_RANGE_BOUNDARY_HOURS_AGO_DECIDER_KEY = - "superroot_recency_protected_serving_range_boundary_hours_ago"; - public static final String FULL_ARCHIVE_SERVING_RANGE_BOUNDARY_HOURS_AGO_DECIDER_KEY = - "superroot_recency_full_archive_serving_range_boundary_hours_ago"; - - private ServingRangeProvider getFullArchiveServingRangeProvider(final SearchDecider decider) - throws Exception { - return new FullArchiveServingRangeProvider( - decider, FULL_ARCHIVE_SERVING_RANGE_BOUNDARY_HOURS_AGO_DECIDER_KEY); - } - - private ServingRangeProvider getRealtimeServingRangeProvider(final SearchDecider decider) - throws Exception { - return new RealtimeServingRangeProvider( - decider, REALTIME_RANGE_SERVING_RANGE_BOUNDARY_HOURS_AGO_DECIDER_KEY); - } - - private ServingRangeProvider getProtectedServingRangeProvider(final SearchDecider decider) - throws Exception { - return new RealtimeServingRangeProvider( - decider, PROTECTED_SERVING_RANGE_BOUNDARY_HOURS_AGO_DECIDER_KEY); - } - - @Provides - @Singleton - @Named(FULL_ARCHIVE_TIME_RANGE_FILTER) - private EarlybirdTimeRangeFilter providesFullArchiveTimeRangeFilter(SearchDecider decider) - throws Exception { - return EarlybirdTimeRangeFilter.newTimeRangeFilterWithoutQueryRewriter( - getFullArchiveServingRangeProvider(decider)); - } - - @Provides - @Singleton - @Named(REALTIME_TIME_RANGE_FILTER) - private EarlybirdTimeRangeFilter providesRealtimeTimeRangeFilter(SearchDecider decider) - throws Exception { - return EarlybirdTimeRangeFilter.newTimeRangeFilterWithoutQueryRewriter( - getRealtimeServingRangeProvider(decider)); - } - - @Provides - @Singleton - @Named(PROTECTED_TIME_RANGE_FILTER) - private EarlybirdTimeRangeFilter providesProtectedTimeRangeFilter(SearchDecider decider) - throws Exception { - return EarlybirdTimeRangeFilter.newTimeRangeFilterWithoutQueryRewriter( - getProtectedServingRangeProvider(decider)); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/routers/RelevanceRequestRouter.java b/src/java/com/twitter/search/earlybird_root/routers/RelevanceRequestRouter.java deleted file mode 100644 index cb7d10504..000000000 --- a/src/java/com/twitter/search/earlybird_root/routers/RelevanceRequestRouter.java +++ /dev/null @@ -1,100 +0,0 @@ -package com.twitter.search.earlybird_root.routers; - -import java.util.concurrent.TimeUnit; - -import javax.inject.Inject; -import javax.inject.Named; - -import com.google.common.base.Preconditions; - -import com.twitter.common.util.Clock; -import com.twitter.finagle.Service; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; -import com.twitter.search.common.query.thriftjava.CollectorTerminationParams; -import com.twitter.search.earlybird.thrift.EarlybirdRequest; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.ThriftSearchRankingMode; -import com.twitter.search.earlybird.thrift.ThriftSearchResult; -import com.twitter.search.earlybird_root.common.EarlybirdFeatureSchemaMerger; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.InjectionNames; -import com.twitter.search.earlybird_root.filters.EarlybirdTimeRangeFilter; - -public class RelevanceRequestRouter extends AbstractRecencyAndRelevanceRequestRouter { - private static final long MILLIS_IN_ONE_DAY = TimeUnit.DAYS.toMillis(1); - - @Inject - public RelevanceRequestRouter( - @Named(InjectionNames.REALTIME) - Service realtime, - @Named(InjectionNames.PROTECTED) - Service protectedRealtime, - @Named(InjectionNames.FULL_ARCHIVE) - Service fullArchive, - @Named(RelevanceRequestRouterModule.REALTIME_TIME_RANGE_FILTER) - EarlybirdTimeRangeFilter realtimeTimeRangeFilter, - @Named(RelevanceRequestRouterModule.PROTECTED_TIME_RANGE_FILTER) - EarlybirdTimeRangeFilter protectedTimeRangeFilter, - @Named(RelevanceRequestRouterModule.FULL_ARCHIVE_TIME_RANGE_FILTER) - EarlybirdTimeRangeFilter fullArchiveTimeRangeFilter, - Clock clock, - SearchDecider decider, - EarlybirdFeatureSchemaMerger featureSchemaMerger) { - super(realtime, - protectedRealtime, - fullArchive, - realtimeTimeRangeFilter, - protectedTimeRangeFilter, - fullArchiveTimeRangeFilter, - ThriftSearchRankingMode.RELEVANCE, - clock, - decider, - featureSchemaMerger); - } - - @Override - protected boolean shouldSendRequestToFullArchiveCluster( - EarlybirdRequest request, EarlybirdResponse realtimeResponse) { - int numResultsRequested = request.getSearchQuery().getNumResults(); - int numHitsProcessed = realtimeResponse.getSearchResults().isSetNumHitsProcessed() - ? realtimeResponse.getSearchResults().getNumHitsProcessed() - : -1; - if (numHitsProcessed < numResultsRequested) { - // Send query to the full archive cluster, if we went through fewer hits in the realtime - // cluster than the requested number of results. - return true; - } - - // If we have enough hits, don't query the full archive cluster yet. - int numSuccessfulPartitions = realtimeResponse.getNumSuccessfulPartitions(); - CollectorTerminationParams terminationParams = - request.getSearchQuery().getCollectorParams().getTerminationParams(); - - Preconditions.checkArgument(terminationParams.isSetMaxHitsToProcess()); - int maxHits = terminationParams.getMaxHitsToProcess() * numSuccessfulPartitions; - - if (numHitsProcessed >= maxHits) { - return false; - } - - // Check if there is a gap between the last result and the min status ID of current search. - // If the difference is larger than one day, then we can still get more tweets from the realtime - // cluster, so there's no need to query the full archive cluster just yet. If we don't check - // this, then we might end up with a big gap in the returned results. - int numReturnedResults = realtimeResponse.getSearchResults().getResultsSize(); - if (numReturnedResults > 0) { - ThriftSearchResult lastResult = - realtimeResponse.getSearchResults().getResults().get(numReturnedResults - 1); - long lastResultTimeMillis = SnowflakeIdParser.getTimestampFromTweetId(lastResult.getId()); - long minSearchedStatusID = realtimeResponse.getSearchResults().getMinSearchedStatusID(); - long minSearchedStatusIDTimeMillis = - SnowflakeIdParser.getTimestampFromTweetId(minSearchedStatusID); - if (lastResultTimeMillis - minSearchedStatusIDTimeMillis > MILLIS_IN_ONE_DAY) { - return false; - } - } - - return true; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/routers/RelevanceRequestRouterModule.java b/src/java/com/twitter/search/earlybird_root/routers/RelevanceRequestRouterModule.java deleted file mode 100644 index eaed2c25e..000000000 --- a/src/java/com/twitter/search/earlybird_root/routers/RelevanceRequestRouterModule.java +++ /dev/null @@ -1,74 +0,0 @@ -package com.twitter.search.earlybird_root.routers; - -import javax.inject.Named; -import javax.inject.Singleton; - -import com.google.inject.Provides; - -import com.twitter.inject.TwitterModule; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.earlybird_root.filters.EarlybirdTimeRangeFilter; -import com.twitter.search.earlybird_root.filters.FullArchiveServingRangeProvider; -import com.twitter.search.earlybird_root.filters.RealtimeServingRangeProvider; -import com.twitter.search.earlybird_root.filters.ServingRangeProvider; - -public class RelevanceRequestRouterModule extends TwitterModule { - public static final String FULL_ARCHIVE_TIME_RANGE_FILTER = - "relevance_full_archive_time_range_filter"; - public static final String REALTIME_TIME_RANGE_FILTER = - "relevance_realtime_time_range_filter"; - public static final String PROTECTED_TIME_RANGE_FILTER = - "relevance_protected_time_range_filter"; - - public static final String REALTIME_SERVING_RANGE_BOUNDARY_HOURS_AGO_DECIDER_KEY = - "superroot_relevance_realtime_serving_range_boundary_hours_ago"; - public static final String FULL_ARCHIVE_SERVING_RANGE_BOUNDARY_HOURS_AGO_DECIDER_KEY = - "superroot_relevance_full_archive_serving_range_boundary_hours_ago"; - public static final String PROTECTED_SERVING_RANGE_BOUNDARY_HOURS_AGO_DECIDER_KEY = - "superroot_relevance_protected_serving_range_boundary_hours_ago"; - - private ServingRangeProvider getFullArchiveServingRangeProvider(final SearchDecider decider) - throws Exception { - return new FullArchiveServingRangeProvider( - decider, FULL_ARCHIVE_SERVING_RANGE_BOUNDARY_HOURS_AGO_DECIDER_KEY); - } - - private ServingRangeProvider getRealtimeServingRangeProvider(final SearchDecider decider) - throws Exception { - return new RealtimeServingRangeProvider( - decider, REALTIME_SERVING_RANGE_BOUNDARY_HOURS_AGO_DECIDER_KEY); - } - - private ServingRangeProvider getProtectedServingRangeProvider(final SearchDecider decider) - throws Exception { - return new RealtimeServingRangeProvider( - decider, PROTECTED_SERVING_RANGE_BOUNDARY_HOURS_AGO_DECIDER_KEY); - } - - @Provides - @Singleton - @Named(FULL_ARCHIVE_TIME_RANGE_FILTER) - private EarlybirdTimeRangeFilter providesFullArchiveTimeRangeFilter(SearchDecider decider) - throws Exception { - return EarlybirdTimeRangeFilter.newTimeRangeFilterWithoutQueryRewriter( - getFullArchiveServingRangeProvider(decider)); - } - - @Provides - @Singleton - @Named(REALTIME_TIME_RANGE_FILTER) - private EarlybirdTimeRangeFilter providesRealtimeTimeRangeFilter(SearchDecider decider) - throws Exception { - return EarlybirdTimeRangeFilter.newTimeRangeFilterWithoutQueryRewriter( - getRealtimeServingRangeProvider(decider)); - } - - @Provides - @Singleton - @Named(PROTECTED_TIME_RANGE_FILTER) - private EarlybirdTimeRangeFilter providesProtectedTimeRangeFilter(SearchDecider decider) - throws Exception { - return EarlybirdTimeRangeFilter.newTimeRangeFilterWithoutQueryRewriter( - getProtectedServingRangeProvider(decider)); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/routers/RequestRouter.java b/src/java/com/twitter/search/earlybird_root/routers/RequestRouter.java deleted file mode 100644 index e8d01b42c..000000000 --- a/src/java/com/twitter/search/earlybird_root/routers/RequestRouter.java +++ /dev/null @@ -1,144 +0,0 @@ -package com.twitter.search.earlybird_root.routers; - -import java.util.ArrayList; -import java.util.List; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.futures.Futures; -import com.twitter.search.earlybird.thrift.EarlybirdDebugInfo; -import com.twitter.search.earlybird.thrift.EarlybirdRequestResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.util.Future; -import com.twitter.util.Try; - -/** - * Responsible for handling requests in superroot. - */ -public abstract class RequestRouter { - private static final Logger LOG = LoggerFactory.getLogger(RequestRouter.class); - - /** - * Saved request and response, to be included in debug info. - */ - class RequestResponse { - // Where is this request sent to. Freeform text like "realtime", "archive", etc. - private String sentTo; - private EarlybirdRequestContext requestContext; - private Future earlybirdResponseFuture; - - RequestResponse(String sentTo, - EarlybirdRequestContext requestContext, - Future earlybirdResponseFuture) { - this.sentTo = sentTo; - this.requestContext = requestContext; - this.earlybirdResponseFuture = earlybirdResponseFuture; - } - - String getSentTo() { - return sentTo; - } - - public EarlybirdRequestContext getRequestContext() { - return requestContext; - } - - Future getEarlybirdResponseFuture() { - return earlybirdResponseFuture; - } - } - - /** - * Forward a request to different clusters and merge the responses back into one response. - * @param requestContext - */ - public abstract Future route(EarlybirdRequestContext requestContext); - - /** - * Save a request (and its response future) to be included in debug info. - */ - void saveRequestResponse( - List requestResponses, - String sentTo, - EarlybirdRequestContext earlybirdRequestContext, - Future earlybirdResponseFuture - ) { - requestResponses.add( - new RequestResponse( - sentTo, - earlybirdRequestContext, - earlybirdResponseFuture - ) - ); - } - - Future maybeAttachSentRequestsToDebugInfo( - List requestResponses, - EarlybirdRequestContext requestContext, - Future response - ) { - if (requestContext.getRequest().getDebugMode() >= 4) { - return this.attachSentRequestsToDebugInfo( - response, - requestResponses - ); - } else { - return response; - } - } - - /** - * Attaches saved client requests and their responses to the debug info within the - * main EarlybirdResponse. - */ - Future attachSentRequestsToDebugInfo( - Future currentResponse, - List requestResponses) { - - // Get all the response futures that we're waiting on. - List> allResponseFutures = new ArrayList<>(); - for (RequestResponse rr : requestResponses) { - allResponseFutures.add(rr.getEarlybirdResponseFuture()); - } - - // Pack all the futures into a single future. - Future>> allResponsesFuture = - Futures.collectAll(allResponseFutures); - - return currentResponse.flatMap(mainResponse -> { - if (!mainResponse.isSetDebugInfo()) { - mainResponse.setDebugInfo(new EarlybirdDebugInfo()); - } - - Future responseWithRequests = allResponsesFuture.map(allResponses -> { - // Get all individual response "Trys" and see if we can extract something from them - // that we can attach to the debugInfo. - for (int i = 0; i < allResponses.size(); i++) { - - Try responseTry = allResponses.get(i); - - if (responseTry.isReturn()) { - EarlybirdResponse attachedResponse = responseTry.get(); - - // Don't include the debug string, it's already a part of the main response's - // debug string. - attachedResponse.unsetDebugString(); - - EarlybirdRequestResponse reqResp = new EarlybirdRequestResponse(); - reqResp.setSentTo(requestResponses.get(i).getSentTo()); - reqResp.setRequest(requestResponses.get(i).getRequestContext().getRequest()); - reqResp.setResponse(attachedResponse.toString()); - - mainResponse.debugInfo.addToSentRequests(reqResp); - } - } - - return mainResponse; - }); - - return responseWithRequests; - }); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/routers/RequestRouterUtil.java b/src/java/com/twitter/search/earlybird_root/routers/RequestRouterUtil.java deleted file mode 100644 index 785704982..000000000 --- a/src/java/com/twitter/search/earlybird_root/routers/RequestRouterUtil.java +++ /dev/null @@ -1,107 +0,0 @@ -package com.twitter.search.earlybird_root.routers; - -import java.util.List; - -import com.google.common.base.Optional; -import com.google.common.collect.Lists; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.EarlybirdRequestType; -import com.twitter.search.earlybird_root.common.EarlybirdServiceResponse; -import com.twitter.util.Await; -import com.twitter.util.Function; -import com.twitter.util.Future; - -public final class RequestRouterUtil { - private static final Logger LOG = LoggerFactory.getLogger(RequestRouterUtil.class); - - private RequestRouterUtil() { - } - - /** - * Returns the function that checks if the minSearchedStatusID on the merged response is higher - * than the max ID in the request. - * - * @param requestContext The request context that stores the request. - * @param operator The operator that we're checking against (max_id or until_time). - * @param requestMaxId The maxId specified in the request (in the given operator). - * @param realtimeResponseFuture The response from the realtime cluster. - * @param protectedResponseFuture The response from the protected cluster. - * @param fullArchiveResponseFuture The response from the full archive cluster. - * @param stat The stat to increment if minSearchedStatusID on the merged response is higher than - * the max ID in the request. - * @return A function that checks if the minSearchedStatusID on the merged response is higher than - * the max ID in the request. - */ - public static Function checkMinSearchedStatusId( - final EarlybirdRequestContext requestContext, - final String operator, - final Optional requestMaxId, - final Future realtimeResponseFuture, - final Future protectedResponseFuture, - final Future fullArchiveResponseFuture, - final SearchCounter stat) { - return new Function() { - @Override - public EarlybirdResponse apply(EarlybirdResponse mergedResponse) { - if (requestMaxId.isPresent() - && (mergedResponse.getResponseCode() == EarlybirdResponseCode.SUCCESS) - && mergedResponse.isSetSearchResults() - && mergedResponse.getSearchResults().isSetMinSearchedStatusID()) { - long minSearchedStatusId = mergedResponse.getSearchResults().getMinSearchedStatusID(); - if (minSearchedStatusId > requestMaxId.get()) { - stat.increment(); - // We're logging this only for STRICT RECENCY as it was very spammy for all types of - // request. We don't expect this to happen for STRICT RECENCY but we're tracking - // with the stat when it happens for RELEVANCE and RECENCY - if (requestContext.getEarlybirdRequestType() == EarlybirdRequestType.STRICT_RECENCY) { - String logMessage = "Response has a minSearchedStatusID ({}) larger than request " - + operator + " ({})." - + "\nrequest type: {}" - + "\nrequest: {}" - + "\nmerged response: {}" - + "\nrealtime response: {}" - + "\nprotected response: {}" - + "\nfull archive response: {}"; - List logMessageParams = Lists.newArrayList(); - logMessageParams.add(minSearchedStatusId); - logMessageParams.add(requestMaxId.get()); - logMessageParams.add(requestContext.getEarlybirdRequestType()); - logMessageParams.add(requestContext.getRequest()); - logMessageParams.add(mergedResponse); - - // The realtime, protected and full archive response futures are "done" at this point: - // we have to wait for them in order to build the merged response. So it's ok to call - // Await.result() here to get the responses: it's a no-op. - try { - logMessageParams.add(Await.result(realtimeResponseFuture).getResponse()); - } catch (Exception e) { - logMessageParams.add(e); - } - try { - logMessageParams.add(Await.result(protectedResponseFuture).getResponse()); - } catch (Exception e) { - logMessageParams.add(e); - } - try { - logMessageParams.add(Await.result(fullArchiveResponseFuture).getResponse()); - } catch (Exception e) { - logMessageParams.add(e); - } - - LOG.warn(logMessage, logMessageParams.toArray()); - } - } - } - - return mergedResponse; - } - }; - } -} diff --git a/src/java/com/twitter/search/earlybird_root/routers/TermStatsRequestRouter.java b/src/java/com/twitter/search/earlybird_root/routers/TermStatsRequestRouter.java deleted file mode 100644 index efc568748..000000000 --- a/src/java/com/twitter/search/earlybird_root/routers/TermStatsRequestRouter.java +++ /dev/null @@ -1,238 +0,0 @@ -package com.twitter.search.earlybird_root.routers; - -import java.util.ArrayList; -import java.util.List; -import javax.inject.Inject; -import javax.inject.Named; - -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.Lists; - - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.finagle.Service; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.util.earlybird.EarlybirdResponseUtil; -import com.twitter.search.earlybird.config.ServingRange; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird.thrift.EarlybirdResponseCode; -import com.twitter.search.earlybird.thrift.ThriftSearchResults; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.InjectionNames; -import com.twitter.search.earlybird_root.filters.EarlybirdTimeRangeFilter; -import com.twitter.search.earlybird_root.filters.ServingRangeProvider; -import com.twitter.search.earlybird_root.mergers.EarlybirdResponseMerger; -import com.twitter.search.earlybird_root.mergers.SuperRootResponseMerger; -import com.twitter.search.earlybird_root.mergers.TermStatisticsResponseMerger; -import com.twitter.search.earlybird_root.mergers.TierResponseAccumulator; -import com.twitter.util.Function; -import com.twitter.util.Future; - -import static com.twitter.search.common.util.earlybird.TermStatisticsUtil.determineBinSize; - -/** - * For TermStats traffic SuperRoot hits both realtime and archive in parallel, and then merges - * the results. - */ -public class TermStatsRequestRouter extends RequestRouter { - private static final Logger LOG = LoggerFactory.getLogger(TermStatsRequestRouter.class); - - private static final String SUPERROOT_SKIP_FULL_ARCHIVE_CLUSTER_FOR_TERM_STATS_REQUESTS = - "superroot_skip_full_archive_cluster_for_term_stats_requests"; - - private final Service realtimeService; - private final Service fullArchiveService; - - private final SearchDecider decider; - - private final ServingRangeProvider realtimeServingRangeProvider; - - @Inject - public TermStatsRequestRouter( - @Named(InjectionNames.REALTIME) - Service realtime, - @Named(TermStatsRequestRouterModule.REALTIME_TIME_RANGE_FILTER) - EarlybirdTimeRangeFilter realtimeTimeRangeFilter, - @Named(InjectionNames.FULL_ARCHIVE) - Service fullArchive, - @Named(TermStatsRequestRouterModule.FULL_ARCHIVE_TIME_RANGE_FILTER) - EarlybirdTimeRangeFilter fullArchiveTimeRangeFilter, - SearchDecider decider) { - LOG.info("Instantiating a TermStatsRequestRouter"); - - this.realtimeService = realtimeTimeRangeFilter - .andThen(realtime); - - this.fullArchiveService = fullArchiveTimeRangeFilter - .andThen(fullArchive); - - this.decider = decider; - this.realtimeServingRangeProvider = realtimeTimeRangeFilter.getServingRangeProvider(); - } - - /** - * Hit both realtime and full-archive clusters then merges term stat request. - */ - @Override - public Future route(EarlybirdRequestContext requestContext) { - List requestResponses = new ArrayList<>(); - - Future realtimeResponseFuture = realtimeService.apply(requestContext); - this.saveRequestResponse(requestResponses, "realtime", requestContext, realtimeResponseFuture); - - Future archiveResponseFuture = - requestContext.getRequest().isGetOlderResults() - && !decider.isAvailable(SUPERROOT_SKIP_FULL_ARCHIVE_CLUSTER_FOR_TERM_STATS_REQUESTS) - ? fullArchiveService.apply(requestContext) - : Future.value(emptyResponse()); - this.saveRequestResponse(requestResponses, "archive", requestContext, archiveResponseFuture); - - Future mergedResponse = - merge(realtimeResponseFuture, archiveResponseFuture, requestContext); - - return this.maybeAttachSentRequestsToDebugInfo( - requestResponses, - requestContext, - mergedResponse - ); - } - - /** - * Merge responses from realtime and full archive clusters. - */ - private Future merge( - final Future realtimeResponseFuture, - final Future archiveResponseFuture, - final EarlybirdRequestContext requestContext) { - - return realtimeResponseFuture.flatMap( - new Function>() { - @Override - public Future apply(final EarlybirdResponse realtimeResponse) { - if (!EarlybirdResponseUtil.isSuccessfulResponse(realtimeResponse)) { - return Future.value(realtimeResponse); - } - - return archiveResponseFuture.flatMap( - new Function>() { - @Override - public Future apply(EarlybirdResponse archiveResponse) { - if (!EarlybirdResponseUtil.isSuccessfulResponse(archiveResponse)) { - return Future.value( - mergeWithUnsuccessfulArchiveResponse( - requestContext, realtimeResponse, archiveResponse)); - } - - List> responses = - ImmutableList.>builder() - .add(realtimeResponseFuture) - .add(archiveResponseFuture) - .build(); - - EarlybirdResponseMerger merger = new TermStatisticsResponseMerger( - requestContext, responses, new TierResponseAccumulator()); - - return merger.merge().map(new Function() { - @Override - public EarlybirdResponse apply(EarlybirdResponse mergedResponse) { - if (requestContext.getRequest().getDebugMode() > 0) { - mergedResponse.setDebugString( - SuperRootResponseMerger.mergeClusterDebugStrings( - realtimeResponse, null, archiveResponse)); - } - return mergedResponse; - } - }); - } - }); - } - }); - } - - private EarlybirdResponse mergeWithUnsuccessfulArchiveResponse( - EarlybirdRequestContext requestContext, - EarlybirdResponse realtimeResponse, - EarlybirdResponse archiveResponse) { - // If the realtime cluster was skipped, and the full archive returned an error - // response, return the full archive response. - if (isTierSkippedResponse(realtimeResponse)) { - return archiveResponse; - } - - // If the realtime response has results and the full archive cluster returned an error - // response, we return the realtime response. If the client needs more results, it can paginate, - // and on the next request it will get the error response from the full archive cluster. - if (realtimeResponse.isSetTermStatisticsResults() - && !realtimeResponse.getTermStatisticsResults().getTermResults().isEmpty()) { - realtimeResponse.setDebugString( - "Full archive cluster returned an error response (" - + archiveResponse.getResponseCode() + "). " - + SuperRootResponseMerger.mergeClusterDebugStrings( - realtimeResponse, null, archiveResponse)); - return updateMinCompleteBinId(requestContext, realtimeResponse); - } - - // If the realtime response has no results, and the full archive cluster returned an error - // response, return a PERSISTENT_ERROR response, and merge the debug strings from the two - // responses. - EarlybirdResponse mergedResponse = - new EarlybirdResponse(EarlybirdResponseCode.PERSISTENT_ERROR, 0); - mergedResponse.setDebugString( - "Full archive cluster returned an error response (" - + archiveResponse.getResponseCode() - + "), and the realtime response had no results. " - + SuperRootResponseMerger.mergeClusterDebugStrings( - realtimeResponse, null, archiveResponse)); - return mergedResponse; - } - - /** - * If we get a completed realtime response but a failed archive response, the minCompleteBinId we - * return will be incorrect -- the realtime minCompleteBinId is assumed to be the oldest bin - * returned, rather than the bin that intersects the realtime serving boundary. In these cases, we - * need to move the minCompleteBinId forward. - *

    - * Note that we cannot always set the minCompleteBinId for the realtime results to the bin - * intersecting the realtime serving boundary: somewhere in the guts of the merging logic, we set - * the minCompleteBinId of the merged response to the max of the minCompleteBinIds of the original - * responses. :-( - */ - private EarlybirdResponse updateMinCompleteBinId( - EarlybirdRequestContext requestContext, EarlybirdResponse realtimeResponse) { - Preconditions.checkArgument( - realtimeResponse.getTermStatisticsResults().isSetMinCompleteBinId()); - int roundedServingRange = roundServingRangeUpToNearestBinId(requestContext, realtimeResponse); - int minCompleteBinId = Math.max( - roundedServingRange, - realtimeResponse.getTermStatisticsResults().getMinCompleteBinId()); - realtimeResponse.getTermStatisticsResults().setMinCompleteBinId(minCompleteBinId); - return realtimeResponse; - } - - private static EarlybirdResponse emptyResponse() { - return new EarlybirdResponse(EarlybirdResponseCode.SUCCESS, 0) - .setSearchResults(new ThriftSearchResults() - .setResults(Lists.newArrayList())) - .setDebugString("Full archive cluster not requested or not available."); - } - - private static boolean isTierSkippedResponse(EarlybirdResponse response) { - return response.getResponseCode() == EarlybirdResponseCode.TIER_SKIPPED; - } - - /** - * Given a termstats request/response pair, round the serving range for the appropriate cluster up - * to the nearest binId at the appropriate resolution. - */ - private int roundServingRangeUpToNearestBinId( - EarlybirdRequestContext request, EarlybirdResponse response) { - ServingRange servingRange = realtimeServingRangeProvider.getServingRange( - request, request.useOverrideTierConfig()); - long servingRangeStartSecs = servingRange.getServingRangeSinceTimeSecondsFromEpoch(); - int binSize = determineBinSize(response.getTermStatisticsResults().getHistogramSettings()); - return (int) Math.ceil((double) servingRangeStartSecs / binSize); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/routers/TermStatsRequestRouterModule.java b/src/java/com/twitter/search/earlybird_root/routers/TermStatsRequestRouterModule.java deleted file mode 100644 index 6b11f5d43..000000000 --- a/src/java/com/twitter/search/earlybird_root/routers/TermStatsRequestRouterModule.java +++ /dev/null @@ -1,60 +0,0 @@ -package com.twitter.search.earlybird_root.routers; - -import javax.inject.Named; -import javax.inject.Singleton; - -import com.google.inject.Provides; - -import com.twitter.inject.TwitterModule; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.earlybird_root.filters.EarlybirdTimeRangeFilter; -import com.twitter.search.earlybird_root.filters.FullArchiveServingRangeProvider; -import com.twitter.search.earlybird_root.filters.RealtimeServingRangeProvider; -import com.twitter.search.earlybird_root.filters.ServingRangeProvider; - -public class TermStatsRequestRouterModule extends TwitterModule { - public static final String FULL_ARCHIVE_TIME_RANGE_FILTER = - "term_stats_full_archive_time_range_filter"; - public static final String REALTIME_TIME_RANGE_FILTER = - "term_stats_realtime_time_range_filter"; - - private static final String SUPERROOT_TERM_STATS_SERVING_RANGE_BOUNDARY_HOURS_AGO_DECIDER_KEY = - "superroot_term_stats_serving_range_boundary_hours_ago"; - - private ServingRangeProvider getFullArchiveTimeRangeProvider(final SearchDecider decider) - throws Exception { - return new FullArchiveServingRangeProvider( - decider, SUPERROOT_TERM_STATS_SERVING_RANGE_BOUNDARY_HOURS_AGO_DECIDER_KEY); - } - - private ServingRangeProvider getRealtimeTimeRangeProvider(final SearchDecider decider) - throws Exception { - return new RealtimeServingRangeProvider( - decider, SUPERROOT_TERM_STATS_SERVING_RANGE_BOUNDARY_HOURS_AGO_DECIDER_KEY); - } - - /** - * For term stats full archive cluster spans from 21 March to 2006 to 6 days ago from current time - */ - @Provides - @Singleton - @Named(FULL_ARCHIVE_TIME_RANGE_FILTER) - private EarlybirdTimeRangeFilter providesFullArchiveTimeRangeFilter(final SearchDecider decider) - throws Exception { - return EarlybirdTimeRangeFilter.newTimeRangeFilterWithQueryRewriter( - getFullArchiveTimeRangeProvider(decider), decider); - } - - /** - * For term stats realtime cluster spans from 6 days ago from current time to a far away date - * into the future - */ - @Provides - @Singleton - @Named(REALTIME_TIME_RANGE_FILTER) - private EarlybirdTimeRangeFilter providesRealtimeTimeRangeFilter(final SearchDecider decider) - throws Exception { - return EarlybirdTimeRangeFilter.newTimeRangeFilterWithQueryRewriter( - getRealtimeTimeRangeProvider(decider), decider); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/routers/TopTweetsRequestRouter.java b/src/java/com/twitter/search/earlybird_root/routers/TopTweetsRequestRouter.java deleted file mode 100644 index 20c2411b1..000000000 --- a/src/java/com/twitter/search/earlybird_root/routers/TopTweetsRequestRouter.java +++ /dev/null @@ -1,35 +0,0 @@ -package com.twitter.search.earlybird_root.routers; - -import javax.inject.Inject; -import javax.inject.Named; - -import com.twitter.finagle.Service; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.search.earlybird_root.common.EarlybirdRequestContext; -import com.twitter.search.earlybird_root.common.InjectionNames; -import com.twitter.search.earlybird_root.filters.EarlybirdTimeRangeFilter; -import com.twitter.util.Future; - -/** - * For TopTweets traffic SuperRoot forwards all traffic to the realtime cluster. - */ -public class TopTweetsRequestRouter extends RequestRouter { - - private final Service realtime; - - /** Creates a new TopTweetsRequestRouter instance to be used by the SuperRoot. */ - @Inject - public TopTweetsRequestRouter( - @Named(InjectionNames.REALTIME) - Service realtime, - @Named(TopTweetsRequestRouterModule.TIME_RANGE_FILTER) - EarlybirdTimeRangeFilter timeRangeFilter) { - - this.realtime = timeRangeFilter.andThen(realtime); - } - - @Override - public Future route(EarlybirdRequestContext requestContext) { - return realtime.apply(requestContext); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/routers/TopTweetsRequestRouterModule.java b/src/java/com/twitter/search/earlybird_root/routers/TopTweetsRequestRouterModule.java deleted file mode 100644 index 03a247afb..000000000 --- a/src/java/com/twitter/search/earlybird_root/routers/TopTweetsRequestRouterModule.java +++ /dev/null @@ -1,32 +0,0 @@ -package com.twitter.search.earlybird_root.routers; - -import javax.inject.Named; -import javax.inject.Singleton; - -import com.google.inject.Provides; - -import com.twitter.inject.TwitterModule; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.earlybird_root.filters.EarlybirdTimeRangeFilter; -import com.twitter.search.earlybird_root.filters.RealtimeServingRangeProvider; -import com.twitter.search.earlybird_root.filters.ServingRangeProvider; - -public class TopTweetsRequestRouterModule extends TwitterModule { - public static final String TIME_RANGE_FILTER = "top_tweets_time_range_filter"; - - public static final String SERVING_RANGE_BOUNDARY_HOURS_AGO_DECIDER_KEY = - "superroot_top_tweets_serving_range_boundary_hours_ago"; - - private ServingRangeProvider getServingRangeProvider(final SearchDecider decider) - throws Exception { - return new RealtimeServingRangeProvider(decider, SERVING_RANGE_BOUNDARY_HOURS_AGO_DECIDER_KEY); - } - - @Provides - @Singleton - @Named(TIME_RANGE_FILTER) - private EarlybirdTimeRangeFilter providesTimeRangeFilter(SearchDecider decider) throws Exception { - return EarlybirdTimeRangeFilter.newTimeRangeFilterWithoutQueryRewriter( - getServingRangeProvider(decider)); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/validators/BUILD b/src/java/com/twitter/search/earlybird_root/validators/BUILD deleted file mode 100644 index 3a39026c1..000000000 --- a/src/java/com/twitter/search/earlybird_root/validators/BUILD +++ /dev/null @@ -1,9 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/java/com/twitter/search/common/schema/earlybird", - "src/thrift/com/twitter/search:earlybird-java", - ], -) diff --git a/src/java/com/twitter/search/earlybird_root/validators/FacetsResponseValidator.java b/src/java/com/twitter/search/earlybird_root/validators/FacetsResponseValidator.java deleted file mode 100644 index a40c17a42..000000000 --- a/src/java/com/twitter/search/earlybird_root/validators/FacetsResponseValidator.java +++ /dev/null @@ -1,39 +0,0 @@ -package com.twitter.search.earlybird_root.validators; - -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.util.Future; - -public class FacetsResponseValidator implements ServiceResponseValidator { - - private final EarlybirdCluster cluster; - - /** - * Validator for facets responses - */ - public FacetsResponseValidator(EarlybirdCluster cluster) { - this.cluster = cluster; - } - - @Override - public Future validate(EarlybirdResponse response) { - if (!response.isSetSearchResults() || !response.getSearchResults().isSetResults()) { - return Future.exception( - new IllegalStateException(cluster + " didn't set search results.")); - } - - if (!response.isSetFacetResults()) { - return Future.exception( - new IllegalStateException( - cluster + " facets response does not have the facetResults field set.")); - } - - if (response.getFacetResults().getFacetFields().isEmpty()) { - return Future.exception( - new IllegalStateException( - cluster + " facets response does not have any facet fields set.")); - } - - return Future.value(response); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/validators/PassThroughResponseValidator.java b/src/java/com/twitter/search/earlybird_root/validators/PassThroughResponseValidator.java deleted file mode 100644 index af4de0cec..000000000 --- a/src/java/com/twitter/search/earlybird_root/validators/PassThroughResponseValidator.java +++ /dev/null @@ -1,12 +0,0 @@ -package com.twitter.search.earlybird_root.validators; - -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.util.Future; - -/** A no-op ServiceResponseValidator. */ -public class PassThroughResponseValidator implements ServiceResponseValidator { - @Override - public Future validate(EarlybirdResponse response) { - return Future.value(response); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/validators/SearchResultsValidator.java b/src/java/com/twitter/search/earlybird_root/validators/SearchResultsValidator.java deleted file mode 100644 index 39d4f2392..000000000 --- a/src/java/com/twitter/search/earlybird_root/validators/SearchResultsValidator.java +++ /dev/null @@ -1,37 +0,0 @@ -package com.twitter.search.earlybird_root.validators; - -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.util.Future; - -public class SearchResultsValidator - implements ServiceResponseValidator { - - private final EarlybirdCluster cluster; - - public SearchResultsValidator(EarlybirdCluster cluster) { - this.cluster = cluster; - } - - @Override - public Future validate(EarlybirdResponse response) { - if (!response.isSetSearchResults() - || !response.getSearchResults().isSetResults()) { - return Future.exception( - new IllegalStateException(cluster + " didn't set search results")); - } else if (!response.getSearchResults().isSetMaxSearchedStatusID()) { - return Future.exception( - new IllegalStateException(cluster + " didn't set max searched status id")); - } else { - boolean isEarlyTerminated = response.isSetEarlyTerminationInfo() - && response.getEarlyTerminationInfo().isEarlyTerminated(); - if (!isEarlyTerminated && !response.getSearchResults().isSetMinSearchedStatusID()) { - return Future.exception( - new IllegalStateException( - cluster + " neither early terminated nor set min searched status id")); - } else { - return Future.value(response); - } - } - } -} diff --git a/src/java/com/twitter/search/earlybird_root/validators/ServiceResponseValidator.java b/src/java/com/twitter/search/earlybird_root/validators/ServiceResponseValidator.java deleted file mode 100644 index b025d6476..000000000 --- a/src/java/com/twitter/search/earlybird_root/validators/ServiceResponseValidator.java +++ /dev/null @@ -1,10 +0,0 @@ -package com.twitter.search.earlybird_root.validators; - -import com.twitter.util.Future; - -public interface ServiceResponseValidator { - /** - * Interface for validating Service responses - */ - Future validate(R response); -} diff --git a/src/java/com/twitter/search/earlybird_root/validators/TermStatsResultsValidator.java b/src/java/com/twitter/search/earlybird_root/validators/TermStatsResultsValidator.java deleted file mode 100644 index 01324f3c5..000000000 --- a/src/java/com/twitter/search/earlybird_root/validators/TermStatsResultsValidator.java +++ /dev/null @@ -1,23 +0,0 @@ -package com.twitter.search.earlybird_root.validators; - -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.util.Future; - -public class TermStatsResultsValidator implements ServiceResponseValidator { - private final EarlybirdCluster cluster; - - public TermStatsResultsValidator(EarlybirdCluster cluster) { - this.cluster = cluster; - } - - @Override - public Future validate(EarlybirdResponse response) { - if (!response.isSetTermStatisticsResults() - || !response.getTermStatisticsResults().isSetTermResults()) { - return Future.exception( - new IllegalStateException(cluster + " returned null term statistics results.")); - } - return Future.value(response); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/validators/TopTweetsResultsValidator.java b/src/java/com/twitter/search/earlybird_root/validators/TopTweetsResultsValidator.java deleted file mode 100644 index a0ad8eb89..000000000 --- a/src/java/com/twitter/search/earlybird_root/validators/TopTweetsResultsValidator.java +++ /dev/null @@ -1,22 +0,0 @@ -package com.twitter.search.earlybird_root.validators; - -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.earlybird.thrift.EarlybirdResponse; -import com.twitter.util.Future; - -public class TopTweetsResultsValidator implements ServiceResponseValidator { - private final EarlybirdCluster cluster; - - public TopTweetsResultsValidator(EarlybirdCluster cluster) { - this.cluster = cluster; - } - - @Override - public Future validate(EarlybirdResponse response) { - if (!response.isSetSearchResults() || !response.getSearchResults().isSetResults()) { - return Future.exception( - new IllegalStateException(cluster + " didn't set search results.")); - } - return Future.value(response); - } -} diff --git a/src/java/com/twitter/search/earlybird_root/visitors/BUILD b/src/java/com/twitter/search/earlybird_root/visitors/BUILD deleted file mode 100644 index d82aaf4c7..000000000 --- a/src/java/com/twitter/search/earlybird_root/visitors/BUILD +++ /dev/null @@ -1,13 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "src/java/com/twitter/search/common/partitioning/base", - "src/java/com/twitter/search/common/schema/base", - "src/java/com/twitter/search/common/schema/earlybird", - "src/java/com/twitter/search/queryparser/query:core-query-nodes", - "src/java/com/twitter/search/queryparser/query/search:search-query-nodes", - ], -) diff --git a/src/java/com/twitter/search/earlybird_root/visitors/MultiTermDisjunctionPerPartitionVisitor.java b/src/java/com/twitter/search/earlybird_root/visitors/MultiTermDisjunctionPerPartitionVisitor.java deleted file mode 100644 index 646b46e2c..000000000 --- a/src/java/com/twitter/search/earlybird_root/visitors/MultiTermDisjunctionPerPartitionVisitor.java +++ /dev/null @@ -1,136 +0,0 @@ -package com.twitter.search.earlybird_root.visitors; - -import java.util.Collections; -import java.util.List; -import java.util.stream.Collectors; - -import com.google.common.collect.ImmutableList; -import com.google.common.collect.Lists; - -import com.twitter.search.common.partitioning.base.PartitionDataType; -import com.twitter.search.common.partitioning.base.PartitionMappingManager; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants; -import com.twitter.search.queryparser.query.Conjunction; -import com.twitter.search.queryparser.query.Disjunction; -import com.twitter.search.queryparser.query.Query; -import com.twitter.search.queryparser.query.Query.Occur; -import com.twitter.search.queryparser.query.QueryParserException; -import com.twitter.search.queryparser.query.search.SearchOperator; -import com.twitter.search.queryparser.query.search.SearchQueryTransformer; - -/** - * Truncate user id or id lists in [multi_term_disjunction from_user_id/id] queries. - * Return null if query has incorrect operators or looked at wrong field. - */ -public class MultiTermDisjunctionPerPartitionVisitor extends SearchQueryTransformer { - private final PartitionMappingManager partitionMappingManager; - private final int partitionId; - private final String targetFieldName; - - public static final Conjunction NO_MATCH_CONJUNCTION = - new Conjunction(Occur.MUST_NOT, Collections.emptyList(), Collections.emptyList()); - - public MultiTermDisjunctionPerPartitionVisitor( - PartitionMappingManager partitionMappingManager, - int partitionId) { - this.partitionMappingManager = partitionMappingManager; - this.partitionId = partitionId; - this.targetFieldName = - partitionMappingManager.getPartitionDataType() == PartitionDataType.USER_ID - ? EarlybirdFieldConstants.EarlybirdFieldConstant.FROM_USER_ID_FIELD.getFieldName() - : EarlybirdFieldConstants.EarlybirdFieldConstant.ID_FIELD.getFieldName(); - } - - private boolean isTargetedQuery(Query query) { - if (query instanceof SearchOperator) { - SearchOperator operator = (SearchOperator) query; - return operator.getOperatorType() == SearchOperator.Type.MULTI_TERM_DISJUNCTION - && operator.getOperand().equals(targetFieldName); - } else { - return false; - } - } - - @Override - public Query visit(Conjunction query) throws QueryParserException { - boolean modified = false; - ImmutableList.Builder children = ImmutableList.builder(); - for (Query child : query.getChildren()) { - Query newChild = child.accept(this); - if (newChild != null) { - // For conjunction case, if any child is "multi_term_disjunction from_user_id" and returns - // Conjunction.NO_MATCH_CONJUNCTION, it should be considered same as match no docs. And - // caller should decide how to deal with it. - if (isTargetedQuery(child) && newChild == NO_MATCH_CONJUNCTION) { - return NO_MATCH_CONJUNCTION; - } - if (newChild != Conjunction.EMPTY_CONJUNCTION - && newChild != Disjunction.EMPTY_DISJUNCTION) { - children.add(newChild); - } - } - if (newChild != child) { - modified = true; - } - } - return modified ? query.newBuilder().setChildren(children.build()).build() : query; - } - - @Override - public Query visit(Disjunction disjunction) throws QueryParserException { - boolean modified = false; - ImmutableList.Builder children = ImmutableList.builder(); - for (Query child : disjunction.getChildren()) { - Query newChild = child.accept(this); - if (newChild != null - && newChild != Conjunction.EMPTY_CONJUNCTION - && newChild != Disjunction.EMPTY_DISJUNCTION - && newChild != NO_MATCH_CONJUNCTION) { - children.add(newChild); - } - if (newChild != child) { - modified = true; - } - } - return modified ? disjunction.newBuilder().setChildren(children.build()).build() : disjunction; - } - - @Override - public Query visit(SearchOperator operator) throws QueryParserException { - if (isTargetedQuery(operator)) { - List ids = extractIds(operator); - if (ids.size() > 0) { - List operands = Lists.newArrayList(targetFieldName); - for (long id : ids) { - operands.add(String.valueOf(id)); - } - return operator.newBuilder().setOperands(operands).build(); - } else { - // If the [multi_term_disjunction from_user_id] is a negation (i.e., occur == MUST_NOT), - // and there is no user id left, the whole sub query node does not do anything; if it is - // NOT a negation, then sub query matches nothing. - if (operator.getOccur() == Query.Occur.MUST_NOT) { - return Conjunction.EMPTY_CONJUNCTION; - } else { - return NO_MATCH_CONJUNCTION; - } - } - } - return operator; - } - - private List extractIds(SearchOperator operator) throws QueryParserException { - if (EarlybirdFieldConstants.EarlybirdFieldConstant.ID_FIELD - .getFieldName().equals(targetFieldName)) { - return operator.getOperands().subList(1, operator.getNumOperands()).stream() - .map(Long::valueOf) - .filter(id -> partitionMappingManager.getPartitionIdForTweetId(id) == partitionId) - .collect(Collectors.toList()); - } else { - return operator.getOperands().subList(1, operator.getNumOperands()).stream() - .map(Long::valueOf) - .filter(id -> partitionMappingManager.getPartitionIdForUserId(id) == partitionId) - .collect(Collectors.toList()); - } - } -} diff --git a/src/java/com/twitter/search/feature_update_service/BUILD b/src/java/com/twitter/search/feature_update_service/BUILD deleted file mode 100644 index 449f39d9b..000000000 --- a/src/java/com/twitter/search/feature_update_service/BUILD +++ /dev/null @@ -1,86 +0,0 @@ -java_library( - name = "feature_update_service-lib", - sources = ["*.java"], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/fasterxml/jackson/core:jackson-annotations", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/javax/inject:javax.inject", - "3rdparty/jvm/org/apache/kafka:kafka-clients", - "3rdparty/jvm/org/apache/thrift:libthrift", - "decider/src/main/scala", - "finagle/finagle-core/src/main", - "finagle/finagle-exp/src/main/scala", - "finagle/finagle-http/src/main/scala", - "finagle/finagle-thrift/src/main/scala", - "finagle/finagle-thriftmux/src/main/scala", - "finatra-internal/decider/src/main/scala", - "finatra-internal/diffy/src/main/scala", - "finatra-internal/mtls-thriftmux/src/main/scala", - "finatra/inject/inject-app/src/main/scala", - "finatra/inject/inject-core/src/main/scala", - "finatra/inject/inject-server/src/main/scala", - "finatra/inject/inject-slf4j/src/main/scala", - "finatra/inject/inject-slf4j/src/main/scala/com/twitter/inject", - "finatra/inject/inject-thrift-client/src/main/scala", - "finatra/inject/inject-utils/src/main/scala", - "finatra/thrift/src/main/scala/com/twitter/finatra/thrift", - "finatra/thrift/src/main/scala/com/twitter/finatra/thrift:controller", - "finatra/thrift/src/main/scala/com/twitter/finatra/thrift/exceptions", - "finatra/thrift/src/main/scala/com/twitter/finatra/thrift/filters", - "finatra/thrift/src/main/scala/com/twitter/finatra/thrift/modules", - "finatra/thrift/src/main/scala/com/twitter/finatra/thrift/response", - "finatra/thrift/src/main/scala/com/twitter/finatra/thrift/routing", - "kafka/finagle-kafka/finatra-kafka/src/main/scala", - "science/search/feature_update_service/resources", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common/util:system-mocks", - "src/java/com/twitter/common_internal/text/version", - "src/java/com/twitter/search/common/constants", - "src/java/com/twitter/search/common/debug", - "src/java/com/twitter/search/common/decider", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/partitioning/base", - "src/java/com/twitter/search/common/schema/base", - "src/java/com/twitter/search/common/util:platform_stats_exporter", - "src/java/com/twitter/search/common/util/io/periodic", - "src/java/com/twitter/search/common/util/thrift:text-protocol", - "src/java/com/twitter/search/feature_update_service/filters", - "src/java/com/twitter/search/feature_update_service/modules", - "src/java/com/twitter/search/feature_update_service/stats", - "src/java/com/twitter/search/feature_update_service/util", - "src/java/com/twitter/search/ingester/model", - "src/thrift/com/twitter/search/common:indexing-java", - "src/thrift/com/twitter/search/common:schema-java", - "src/thrift/com/twitter/search/feature_update_service/thrift:thrift-java", - "src/thrift/com/twitter/tweetypie:service-java", - "src/thrift/com/twitter/tweetypie:tweet-java", - "thrift-web-forms/src/main/java/com/twitter/thriftwebforms", - "thrift-web-forms/src/main/scala/com/twitter/thriftwebforms", - "thrift-web-forms/src/main/scala/com/twitter/thriftwebforms/model", - "twitter-server-internal/src/main/scala", - "twitter-server/server/src/main/scala", - "util/util-app/src/main/scala", - "util/util-core:scala", - "util/util-function/src/main/java", - "util/util-lint/src/main/scala", - "util/util-slf4j-api/src/main/scala", - "util/util-stats/src/main/scala", - ], -) - -jvm_binary( - name = "feature_update_service", - basename = "feature_update_service", - main = "com.twitter.search.feature_update_service.FeatureUpdateServiceThriftServerMain", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":feature_update_service-lib", - "3rdparty/jvm/ch/qos/logback:logback-classic", - "loglens/loglens-logback/src/main/scala/com/twitter/loglens/logback", - "twitter-server-internal/src/main/scala", - ], -) diff --git a/src/java/com/twitter/search/feature_update_service/FeatureUpdateController.java b/src/java/com/twitter/search/feature_update_service/FeatureUpdateController.java deleted file mode 100644 index 1613bee3c..000000000 --- a/src/java/com/twitter/search/feature_update_service/FeatureUpdateController.java +++ /dev/null @@ -1,245 +0,0 @@ -package com.twitter.search.feature_update_service; - -import java.util.Arrays; -import java.util.Collections; -import java.util.List; -import javax.inject.Inject; -import javax.inject.Named; - -import scala.runtime.BoxedUnit; - -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.Lists; - -import org.apache.kafka.clients.producer.ProducerRecord; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.decider.Decider; -import com.twitter.finagle.mux.ClientDiscardedRequestException; -import com.twitter.finagle.thrift.ClientId; -import com.twitter.finatra.kafka.producers.BlockingFinagleKafkaProducer; -import com.twitter.inject.annotations.Flag; -import com.twitter.search.common.decider.DeciderUtil; -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEventType; -import com.twitter.search.feature_update_service.modules.EarlybirdUtilModule; -import com.twitter.search.feature_update_service.modules.FinagleKafkaProducerModule; -import com.twitter.search.feature_update_service.stats.FeatureUpdateStats; -import com.twitter.search.feature_update_service.thriftjava.FeatureUpdateRequest; -import com.twitter.search.feature_update_service.thriftjava.FeatureUpdateResponse; -import com.twitter.search.feature_update_service.thriftjava.FeatureUpdateResponseCode; -import com.twitter.search.feature_update_service.thriftjava.FeatureUpdateService; -import com.twitter.search.feature_update_service.util.FeatureUpdateValidator; -import com.twitter.search.ingester.model.IngesterThriftVersionedEvents; -import com.twitter.tweetypie.thriftjava.GetTweetFieldsOptions; -import com.twitter.tweetypie.thriftjava.GetTweetFieldsRequest; -import com.twitter.tweetypie.thriftjava.TweetInclude; -import com.twitter.tweetypie.thriftjava.TweetService; -import com.twitter.tweetypie.thriftjava.TweetVisibilityPolicy; -import com.twitter.util.ExecutorServiceFuturePool; -import com.twitter.util.Function; -import com.twitter.util.Future; -import com.twitter.util.Futures; - -import static com.twitter.tweetypie.thriftjava.Tweet._Fields.CORE_DATA; - -public class FeatureUpdateController implements FeatureUpdateService.ServiceIface { - private static final Logger LOG = LoggerFactory.getLogger(FeatureUpdateController.class); - private static final Logger REQUEST_LOG = - LoggerFactory.getLogger("feature_update_service_requests"); - private static final String KAFKA_SEND_COUNT_FORMAT = "kafka_%s_partition_%d_send_count"; - private static final String WRITE_TO_KAFKA_DECIDER_KEY = "write_events_to_kafka_update_events"; - private static final String WRITE_TO_KAFKA_DECIDER_KEY_REALTIME_CG = - "write_events_to_kafka_update_events_realtime_cg"; - - private final SearchRateCounter droppedKafkaUpdateEvents = - SearchRateCounter.export("dropped_kafka_update_events"); - - private final SearchRateCounter droppedKafkaUpdateEventsRealtimeCg = - SearchRateCounter.export("dropped_kafka_update_events_realtime_cg"); - private final Clock clock; - private final Decider decider; - private final BlockingFinagleKafkaProducer kafkaProducer; - private final BlockingFinagleKafkaProducer kafkaProducerRealtimeCg; - - private final List penguinVersions; - private final FeatureUpdateStats stats; - private final String kafkaUpdateEventsTopicName; - private final String kafkaUpdateEventsTopicNameRealtimeCg; - private final ExecutorServiceFuturePool futurePool; - private final TweetService.ServiceIface tweetService; - - @Inject - public FeatureUpdateController( - Clock clock, - Decider decider, - @Named("KafkaProducer") - BlockingFinagleKafkaProducer kafkaProducer, - @Named("KafkaProducerRealtimeCg") - BlockingFinagleKafkaProducer kafkaProducerRealtimeCg, - @Flag(EarlybirdUtilModule.PENGUIN_VERSIONS_FLAG) String penguinVersions, - FeatureUpdateStats stats, - @Flag(FinagleKafkaProducerModule.KAFKA_TOPIC_NAME_UPDATE_EVENTS_FLAG) - String kafkaUpdateEventsTopicName, - @Flag(FinagleKafkaProducerModule.KAFKA_TOPIC_NAME_UPDATE_EVENTS_FLAG_REALTIME_CG) - String kafkaUpdateEventsTopicNameRealtimeCg, - ExecutorServiceFuturePool futurePool, - TweetService.ServiceIface tweetService - ) { - this.clock = clock; - this.decider = decider; - this.kafkaProducer = kafkaProducer; - this.kafkaProducerRealtimeCg = kafkaProducerRealtimeCg; - this.penguinVersions = getPenguinVersions(penguinVersions); - this.stats = stats; - this.kafkaUpdateEventsTopicName = kafkaUpdateEventsTopicName; - this.kafkaUpdateEventsTopicNameRealtimeCg = kafkaUpdateEventsTopicNameRealtimeCg; - this.futurePool = futurePool; - this.tweetService = tweetService; - } - - @Override - public Future process(FeatureUpdateRequest featureUpdate) { - long requestStartTimeMillis = clock.nowMillis(); - - // Export overall and per-client request rate stats - final String requestClientId; - if (featureUpdate.getRequestClientId() != null - && !featureUpdate.getRequestClientId().isEmpty()) { - requestClientId = featureUpdate.getRequestClientId(); - } else if (ClientId.current().nonEmpty()) { - requestClientId = ClientId.current().get().name(); - } else { - requestClientId = "unknown"; - } - stats.clientRequest(requestClientId); - REQUEST_LOG.info("{} {}", requestClientId, featureUpdate); - - FeatureUpdateResponse errorResponse = FeatureUpdateValidator.validate(featureUpdate); - if (errorResponse != null) { - stats.clientResponse(requestClientId, errorResponse.getResponseCode()); - LOG.warn("client error: clientID {} - reason: {}", - requestClientId, errorResponse.getDetailMessage()); - return Future.value(errorResponse); - } - - ThriftIndexingEvent event = featureUpdate.getEvent(); - return writeToKafka(event, requestStartTimeMillis) - .map(responsesList -> { - stats.clientResponse(requestClientId, FeatureUpdateResponseCode.SUCCESS); - // only when both Realtime & RealtimeCG succeed, then it will return a success flag - return new FeatureUpdateResponse(FeatureUpdateResponseCode.SUCCESS); - }) - .handle(Function.func(throwable -> { - FeatureUpdateResponseCode responseCode; - // if either Realtime or RealtimeCG throws an exception, it will return a failure - if (throwable instanceof ClientDiscardedRequestException) { - responseCode = FeatureUpdateResponseCode.CLIENT_CANCEL_ERROR; - LOG.info("ClientDiscardedRequestException received from client: " + requestClientId, - throwable); - } else { - responseCode = FeatureUpdateResponseCode.TRANSIENT_ERROR; - LOG.error("Error occurred while writing to output stream: " - + kafkaUpdateEventsTopicName + ", " - + kafkaUpdateEventsTopicNameRealtimeCg, throwable); - } - stats.clientResponse(requestClientId, responseCode); - return new FeatureUpdateResponse(responseCode) - .setDetailMessage(throwable.getMessage()); - })); - } - - /** - * In writeToKafka(), we use Futures.collect() to aggregate results for two RPC calls - * Futures.collect() means that if either one of the Future fails then it will return an Exception - * only when both Realtime & RealtimeCG succeed, then it will return a success flag - * The FeatureUpdateResponse is more like an ACK message, and the upstream (feature update ingester) - * will not be affected much even if it failed (as long as the kafka message is written) - */ - private Future> writeToKafka(ThriftIndexingEvent event, - long requestStartTimeMillis) { - return Futures.collect(Lists.newArrayList( - writeToKafkaInternal(event, WRITE_TO_KAFKA_DECIDER_KEY, droppedKafkaUpdateEvents, - kafkaUpdateEventsTopicName, -1, kafkaProducer), - Futures.flatten(getUserId(event.getUid()).map( - userId -> writeToKafkaInternal(event, WRITE_TO_KAFKA_DECIDER_KEY_REALTIME_CG, - droppedKafkaUpdateEventsRealtimeCg, - kafkaUpdateEventsTopicNameRealtimeCg, userId, kafkaProducerRealtimeCg))))); - - } - - private Future writeToKafkaInternal(ThriftIndexingEvent event, String deciderKey, - SearchRateCounter droppedStats, String topicName, long userId, - BlockingFinagleKafkaProducer producer) { - if (!DeciderUtil.isAvailableForRandomRecipient(decider, deciderKey)) { - droppedStats.increment(); - return Future.Unit(); - } - - ProducerRecord producerRecord = new ProducerRecord<>( - topicName, - convertToThriftVersionedEvents(userId, event)); - - try { - return Futures.flatten(futurePool.apply(() -> - producer.send(producerRecord) - .map(record -> { - SearchCounter.export(String.format( - KAFKA_SEND_COUNT_FORMAT, record.topic(), record.partition())).increment(); - return BoxedUnit.UNIT; - }))); - } catch (Exception e) { - return Future.exception(e); - } - } - - private List getPenguinVersions(String penguinVersionsStr) { - String[] tokens = penguinVersionsStr.split("\\s*,\\s*"); - List listOfPenguinVersions = Lists.newArrayListWithCapacity(tokens.length); - for (String token : tokens) { - listOfPenguinVersions.add(PenguinVersion.valueOf(token.toUpperCase())); - } - LOG.info(String.format("Using Penguin Versions: %s", listOfPenguinVersions)); - return listOfPenguinVersions; - } - - private Future getUserId(long tweetId) { - TweetInclude tweetInclude = new TweetInclude(); - tweetInclude.setTweetFieldId(CORE_DATA.getThriftFieldId()); - GetTweetFieldsOptions getTweetFieldsOptions = new GetTweetFieldsOptions().setTweet_includes( - Collections.singleton(tweetInclude)).setVisibilityPolicy( - TweetVisibilityPolicy.NO_FILTERING); - GetTweetFieldsRequest getTweetFieldsRequest = new GetTweetFieldsRequest().setTweetIds( - Arrays.asList(tweetId)).setOptions(getTweetFieldsOptions); - try { - return tweetService.get_tweet_fields(getTweetFieldsRequest).map( - tweetFieldsResults -> tweetFieldsResults.get( - 0).tweetResult.getFound().tweet.core_data.user_id); - } catch (Exception e) { - return Future.exception(e); - } - } - - private ThriftVersionedEvents convertToThriftVersionedEvents( - long userId, ThriftIndexingEvent event) { - ThriftIndexingEvent thriftIndexingEvent = event.deepCopy() - .setEventType(ThriftIndexingEventType.PARTIAL_UPDATE); - - ImmutableMap.Builder versionedEventsBuilder = - new ImmutableMap.Builder<>(); - for (PenguinVersion penguinVersion : penguinVersions) { - versionedEventsBuilder.put(penguinVersion.getByteValue(), thriftIndexingEvent); - } - - IngesterThriftVersionedEvents thriftVersionedEvents = - new IngesterThriftVersionedEvents(userId, versionedEventsBuilder.build()); - thriftVersionedEvents.setId(thriftIndexingEvent.getUid()); - return thriftVersionedEvents; - } -} diff --git a/src/java/com/twitter/search/feature_update_service/FeatureUpdateResponseClassifier.java b/src/java/com/twitter/search/feature_update_service/FeatureUpdateResponseClassifier.java deleted file mode 100644 index c63e81f46..000000000 --- a/src/java/com/twitter/search/feature_update_service/FeatureUpdateResponseClassifier.java +++ /dev/null @@ -1,43 +0,0 @@ -package com.twitter.search.feature_update_service; - -import scala.runtime.AbstractPartialFunction; - -import com.twitter.finagle.service.ReqRep; -import com.twitter.finagle.service.ResponseClass; -import com.twitter.finagle.service.ResponseClassifier; -import com.twitter.search.feature_update_service.thriftjava.FeatureUpdateResponse; -import com.twitter.search.feature_update_service.thriftjava.FeatureUpdateResponseCode; -import com.twitter.util.Try; - -public class FeatureUpdateResponseClassifier - extends AbstractPartialFunction { - @Override - public boolean isDefinedAt(ReqRep tuple) { - return true; - } - - @Override - public ResponseClass apply(ReqRep reqRep) { - Try finagleResponse = reqRep.response(); - if (finagleResponse.isThrow()) { - return ResponseClassifier.Default().apply(reqRep); - } - FeatureUpdateResponse response = (FeatureUpdateResponse) finagleResponse.apply(); - FeatureUpdateResponseCode responseCode = response.getResponseCode(); - switch (responseCode) { - case TRANSIENT_ERROR: - case SERVER_TIMEOUT_ERROR: - return ResponseClass.RetryableFailure(); - case PERSISTENT_ERROR: - return ResponseClass.NonRetryableFailure(); - // Client cancellations don't necessarily mean failures on our end. The client decided to - // cancel the request (for example we timed out, so they sent a duplicate request etc.), - // so let's treat them as successes. - case CLIENT_CANCEL_ERROR: - default: - // The other response codes are client errors, and success, and in those cases the server - // behaved correctly, so we classify it as a success. - return ResponseClass.Success(); - } - } -} diff --git a/src/java/com/twitter/search/feature_update_service/FeatureUpdateServiceThriftServer.java b/src/java/com/twitter/search/feature_update_service/FeatureUpdateServiceThriftServer.java deleted file mode 100644 index 7f2730560..000000000 --- a/src/java/com/twitter/search/feature_update_service/FeatureUpdateServiceThriftServer.java +++ /dev/null @@ -1,149 +0,0 @@ -package com.twitter.search.feature_update_service; - -import java.util.ArrayList; -import java.util.Arrays; -import java.util.Collection; -import java.util.List; -import java.util.concurrent.TimeUnit; - -import com.google.common.base.Preconditions; -import com.google.inject.Module; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.app.Flag; -import com.twitter.app.Flaggable; -import com.twitter.finagle.Filter; -import com.twitter.finagle.Service; -import com.twitter.finagle.ThriftMux; -import com.twitter.finatra.annotations.DarkTrafficFilterType; -import com.twitter.finatra.decider.modules.DeciderModule$; -import com.twitter.finatra.mtls.thriftmux.modules.MtlsThriftWebFormsModule; -import com.twitter.finatra.mtls.thriftmux.AbstractMtlsThriftServer; -import com.twitter.finatra.thrift.filters.AccessLoggingFilter; -import com.twitter.finatra.thrift.filters.LoggingMDCFilter; -import com.twitter.finatra.thrift.filters.StatsFilter; -import com.twitter.finatra.thrift.filters.ThriftMDCFilter; -import com.twitter.finatra.thrift.filters.TraceIdMDCFilter; -import com.twitter.finatra.thrift.routing.JavaThriftRouter; -import com.twitter.inject.thrift.modules.ThriftClientIdModule$; -import com.twitter.search.common.constants.SearchThriftWebFormsAccess; -import com.twitter.search.common.metrics.BuildInfoStats; -import com.twitter.search.common.util.PlatformStatsExporter; -import com.twitter.search.feature_update_service.filters.ClientIdWhitelistFilter; -import com.twitter.search.feature_update_service.modules.ClientIdWhitelistModule; -import com.twitter.search.feature_update_service.modules.EarlybirdUtilModule; -import com.twitter.search.feature_update_service.modules.FeatureUpdateServiceDiffyModule; -import com.twitter.search.feature_update_service.modules.FinagleKafkaProducerModule; -import com.twitter.search.feature_update_service.modules.FuturePoolModule; -import com.twitter.search.feature_update_service.modules.TweetypieModule; -import com.twitter.search.feature_update_service.thriftjava.FeatureUpdateService; -import com.twitter.thriftwebforms.MethodOptionsAccessConfig; -import com.twitter.util.ExecutorServiceFuturePool; - -public class FeatureUpdateServiceThriftServer extends AbstractMtlsThriftServer { - private static final Logger LOG = - LoggerFactory.getLogger(FeatureUpdateServiceThriftServer.class); - - // Ideally we would not have to access the "environment" flag here and we could instead pass - // a flag to the ThriftWebFormsModule that would either enable or disable thrift web forms. - // However, it is not simple to create our own TwitterModule that both extends the - // ThriftWebFormsModule and consumes an injected flag. - private Flag envFlag = flag().create("environment", - "", - "Environment for service (prod, staging, staging1, devel)", - Flaggable.ofString()); - - FeatureUpdateServiceThriftServer(String[] args) { - BuildInfoStats.export(); - PlatformStatsExporter.exportPlatformStats(); - - flag().parseArgs(args, true); - } - - @Override - @SuppressWarnings("unchecked") - public Collection javaModules() { - List modules = new ArrayList<>(); - modules.addAll(Arrays.asList( - ThriftClientIdModule$.MODULE$, - DeciderModule$.MODULE$, - new ClientIdWhitelistModule(), - new FinagleKafkaProducerModule(), - new EarlybirdUtilModule(), - new FuturePoolModule(), - new FeatureUpdateServiceDiffyModule(), - new TweetypieModule())); - - // Only add the Thrift Web Forms module for non-prod services because we should - // not allow write access to production data through Thrift Web Forms. - String environment = envFlag.apply(); - if ("prod".equals(environment)) { - LOG.info("Not including Thrift Web Forms because the environment is prod"); - } else { - LOG.info("Including Thrift Web Forms because the environment is " + environment); - modules.add( - MtlsThriftWebFormsModule.create( - this, - FeatureUpdateService.ServiceIface.class, - MethodOptionsAccessConfig.byLdapGroup(SearchThriftWebFormsAccess.WRITE_LDAP_GROUP) - ) - ); - } - - return modules; - } - - @Override - public void configureThrift(JavaThriftRouter router) { - router - // Initialize Mapped Diagnostic Context (MDC) for logging - // (see https://logback.qos.ch/manual/mdc.html) - .filter(LoggingMDCFilter.class) - // Inject trace ID in MDC for logging - .filter(TraceIdMDCFilter.class) - // Inject request method and client ID in MDC for logging - .filter(ThriftMDCFilter.class) - // Log client access - .filter(AccessLoggingFilter.class) - // Export basic service stats - .filter(StatsFilter.class) - .filter(ClientIdWhitelistFilter.class) - .add(FeatureUpdateController.class); - } - - @Override - public Service configureService(Service service) { - // Add the DarkTrafficFilter in "front" of the service being served. - return injector() - .instance(Filter.TypeAgnostic.class, DarkTrafficFilterType.class) - .andThen(service); - } - - @Override - public ThriftMux.Server configureThriftServer(ThriftMux.Server server) { - // This cast looks redundant, but it is required for pants to compile this file. - return (ThriftMux.Server) server.withResponseClassifier(new FeatureUpdateResponseClassifier()); - } - - @Override - public void postWarmup() { - super.postWarmup(); - - ExecutorServiceFuturePool futurePool = injector().instance(ExecutorServiceFuturePool.class); - Preconditions.checkNotNull(futurePool); - - onExit(() -> { - try { - futurePool.executor().shutdownNow(); - - futurePool.executor().awaitTermination(10L, TimeUnit.SECONDS); - } catch (InterruptedException e) { - LOG.error("Interrupted while awaiting future pool termination", e); - } - - return null; - }); - } -} diff --git a/src/java/com/twitter/search/feature_update_service/FeatureUpdateServiceThriftServerMain.java b/src/java/com/twitter/search/feature_update_service/FeatureUpdateServiceThriftServerMain.java deleted file mode 100644 index d19c102a9..000000000 --- a/src/java/com/twitter/search/feature_update_service/FeatureUpdateServiceThriftServerMain.java +++ /dev/null @@ -1,12 +0,0 @@ -package com.twitter.search.feature_update_service; - -final class FeatureUpdateServiceThriftServerMain { - private FeatureUpdateServiceThriftServerMain() { - // Private constructor to satisfy checkstyle error: - // "Utility classes should not have a public or default constructor)." - } - - public static void main(String[] args) { - new FeatureUpdateServiceThriftServer(args).main(args); - } -} diff --git a/src/java/com/twitter/search/feature_update_service/README.md b/src/java/com/twitter/search/feature_update_service/README.md deleted file mode 100644 index ed28acbc8..000000000 --- a/src/java/com/twitter/search/feature_update_service/README.md +++ /dev/null @@ -1,6 +0,0 @@ -## Feature Update Service -Feature update service is a service that sends tweet feature updates e.g number of retweets, replies and favorites to Earlybird. Earlybird then indexes and uses these features to rank in-network Home Timeline tweets. - - - - diff --git a/src/java/com/twitter/search/feature_update_service/filters/BUILD b/src/java/com/twitter/search/feature_update_service/filters/BUILD deleted file mode 100644 index 267acdcff..000000000 --- a/src/java/com/twitter/search/feature_update_service/filters/BUILD +++ /dev/null @@ -1,22 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/inject:guice", - "decider/src/main/scala", - "finatra-internal/thrift/src/main/thrift:thrift-java", - "finatra/thrift/src/main/scala/com/twitter/finatra/thrift", - "finatra/thrift/src/main/scala/com/twitter/finatra/thrift:controller", - "finatra/thrift/src/main/scala/com/twitter/finatra/thrift/exceptions", - "finatra/thrift/src/main/scala/com/twitter/finatra/thrift/filters", - "finatra/thrift/src/main/scala/com/twitter/finatra/thrift/modules", - "finatra/thrift/src/main/scala/com/twitter/finatra/thrift/response", - "finatra/thrift/src/main/scala/com/twitter/finatra/thrift/routing", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/util/io/periodic", - "src/java/com/twitter/search/common/util/thrift:text-protocol", - "src/java/com/twitter/search/feature_update_service/whitelist", - "src/thrift/com/twitter/search/feature_update_service/thrift:thrift-java", - ], -) diff --git a/src/java/com/twitter/search/feature_update_service/filters/ClientIdWhitelistFilter.java b/src/java/com/twitter/search/feature_update_service/filters/ClientIdWhitelistFilter.java deleted file mode 100644 index 077c45067..000000000 --- a/src/java/com/twitter/search/feature_update_service/filters/ClientIdWhitelistFilter.java +++ /dev/null @@ -1,60 +0,0 @@ -package com.twitter.search.feature_update_service.filters; - -import com.google.inject.Inject; -import com.google.inject.Singleton; - -import com.twitter.finagle.Service; -import com.twitter.finatra.thrift.AbstractThriftFilter; -import com.twitter.finatra.thrift.ThriftRequest; -import com.twitter.inject.annotations.Flag; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.feature_update_service.thriftjava.FeatureUpdateResponse; -import com.twitter.search.feature_update_service.thriftjava.FeatureUpdateResponseCode; -import com.twitter.search.feature_update_service.whitelist.ClientIdWhitelist; -import com.twitter.util.Future; - -@Singleton -public class ClientIdWhitelistFilter extends AbstractThriftFilter { - private final boolean enabled; - private final ClientIdWhitelist whitelist; - - private final SearchRateCounter unknownClientIdStat = - SearchRateCounter.export("unknown_client_id"); - private final SearchRateCounter noClientIdStat = - SearchRateCounter.export("no_client_id"); - - @Inject - public ClientIdWhitelistFilter( - ClientIdWhitelist whitelist, - @Flag("client.whitelist.enable") Boolean enabled - ) { - this.whitelist = whitelist; - this.enabled = enabled; - } - - @Override - @SuppressWarnings("unchecked") - public Future apply(ThriftRequest request, Service, R> svc) { - if (!enabled) { - return svc.apply(request); - } - if (request.clientId().isEmpty()) { - noClientIdStat.increment(); - return (Future) Future.value( - new FeatureUpdateResponse(FeatureUpdateResponseCode.MISSING_CLIENT_ERROR) - .setDetailMessage("finagle clientId is required in request")); - - } else if (!whitelist.isClientAllowed(request.clientId().get())) { - // It's safe to use get() in the above condition because - // clientId was already checked for emptiness - unknownClientIdStat.increment(); - return (Future) Future.value( - new FeatureUpdateResponse(FeatureUpdateResponseCode.UNKNOWN_CLIENT_ERROR) - .setDetailMessage(String.format( - "request contains unknown finagle clientId: %s", request.clientId().toString()))); - } else { - return svc.apply(request); - } - } -} - diff --git a/src/java/com/twitter/search/feature_update_service/modules/BUILD b/src/java/com/twitter/search/feature_update_service/modules/BUILD deleted file mode 100644 index f7ee145be..000000000 --- a/src/java/com/twitter/search/feature_update_service/modules/BUILD +++ /dev/null @@ -1,48 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/commons-io", - "3rdparty/jvm/org/apache/kafka:kafka-clients", - "3rdparty/jvm/org/yaml:snakeyaml", - "decider/src/main/scala", - "finagle/finagle-core/src/main", - "finagle/finagle-exp/src/main/scala", - "finagle/finagle-thriftmux/src/main/scala", - "finagle/finagle-zipkin-core/src/main/scala", - "finagle/finagle-zipkin-scribe/src/main/scala", - "finatra-internal/mtls-thriftmux/src/main/scala", - "finatra/inject/inject-app/src/main/java/com/twitter/inject/annotations", - "finatra/inject/inject-core/src/main/scala", - "finatra/inject/inject-modules/src/main/scala", - "finatra/inject/inject-modules/src/main/scala/com/twitter/inject/modules", - "finatra/inject/inject-thrift-client/src/main/scala", - "finatra/thrift/src/main/scala/com/twitter/finatra/thrift", - "finatra/thrift/src/main/scala/com/twitter/finatra/thrift:controller", - "finatra/thrift/src/main/scala/com/twitter/finatra/thrift/exceptions", - "finatra/thrift/src/main/scala/com/twitter/finatra/thrift/filters", - "finatra/thrift/src/main/scala/com/twitter/finatra/thrift/modules", - "finatra/thrift/src/main/scala/com/twitter/finatra/thrift/response", - "finatra/thrift/src/main/scala/com/twitter/finatra/thrift/routing", - "kafka/finagle-kafka/finatra-kafka/src/main/scala", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common/util:system-mocks", - "src/java/com/twitter/search/common/decider", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/partitioning/base", - "src/java/com/twitter/search/common/util/io/kafka", - "src/java/com/twitter/search/common/util/io/periodic", - "src/java/com/twitter/search/feature_update_service/filters", - "src/java/com/twitter/search/feature_update_service/stats", - "src/java/com/twitter/search/feature_update_service/whitelist", - "src/java/com/twitter/spam/finagle", - "src/thrift/com/twitter/search/common:indexing-java", - "src/thrift/com/twitter/search/common:schema-java", - "src/thrift/com/twitter/search/feature_update_service/thrift:thrift-java", - "src/thrift/com/twitter/tweetypie:service-java", - "src/thrift/com/twitter/tweetypie:tweet-java", - "util/util-core/src/main/java", - ], -) diff --git a/src/java/com/twitter/search/feature_update_service/modules/ClientIdWhitelistModule.java b/src/java/com/twitter/search/feature_update_service/modules/ClientIdWhitelistModule.java deleted file mode 100644 index 705de435d..000000000 --- a/src/java/com/twitter/search/feature_update_service/modules/ClientIdWhitelistModule.java +++ /dev/null @@ -1,30 +0,0 @@ -package com.twitter.search.feature_update_service.modules; - -import com.google.inject.Provides; -import com.google.inject.Singleton; - -import com.twitter.app.Flaggable; -import com.twitter.inject.TwitterModule; -import com.twitter.inject.annotations.Flag; - -import com.twitter.search.feature_update_service.whitelist.ClientIdWhitelist; - -/** - * Provides a ClientIdWhitelist, which periodically loads the - * Feature Update Service client whitelist from ConfigBus - */ -public class ClientIdWhitelistModule extends TwitterModule { - public ClientIdWhitelistModule() { - flag("client.whitelist.path", "", - "Path to client id white list.", Flaggable.ofString()); - flag("client.whitelist.enable", true, - "Enable client whitelist for production.", Flaggable.ofBoolean()); - } - - @Provides - @Singleton - public ClientIdWhitelist provideClientWhitelist( - @Flag("client.whitelist.path") String clientIdWhiteListPath) throws Exception { - return ClientIdWhitelist.initWhitelist(clientIdWhiteListPath); - } - } diff --git a/src/java/com/twitter/search/feature_update_service/modules/EarlybirdUtilModule.java b/src/java/com/twitter/search/feature_update_service/modules/EarlybirdUtilModule.java deleted file mode 100644 index 1f5bc495f..000000000 --- a/src/java/com/twitter/search/feature_update_service/modules/EarlybirdUtilModule.java +++ /dev/null @@ -1,13 +0,0 @@ -package com.twitter.search.feature_update_service.modules; - -import com.twitter.app.Flaggable; -import com.twitter.inject.TwitterModule; - -public class EarlybirdUtilModule extends TwitterModule { - public static final String PENGUIN_VERSIONS_FLAG = "penguin.versions"; - - public EarlybirdUtilModule() { - flag(PENGUIN_VERSIONS_FLAG, "penguin_6", - "Comma-separated list of supported Penguin versions.", Flaggable.ofString()); - } -} diff --git a/src/java/com/twitter/search/feature_update_service/modules/FeatureUpdateServiceDiffyModule.java b/src/java/com/twitter/search/feature_update_service/modules/FeatureUpdateServiceDiffyModule.java deleted file mode 100644 index d38665624..000000000 --- a/src/java/com/twitter/search/feature_update_service/modules/FeatureUpdateServiceDiffyModule.java +++ /dev/null @@ -1,35 +0,0 @@ -package com.twitter.search.feature_update_service.modules; - -import com.twitter.decider.Decider; -import com.twitter.inject.Injector; -import com.twitter.finatra.mtls.thriftmux.modules.MtlsJavaDarkTrafficFilterModule; -import com.twitter.search.common.decider.DeciderUtil; -import com.twitter.util.Function; - - -/** - * Provide a filter that sends dark traffic to diffy, if the diffy.dest command-line parameter - * is non-empty. If diffy.dest is empty, just provide a no-op filter. - */ -public class FeatureUpdateServiceDiffyModule extends MtlsJavaDarkTrafficFilterModule { - @Override - public String destFlagName() { - return "diffy.dest"; - } - - @Override - public String defaultClientId() { - return "feature_update_service.origin"; - } - - @Override - public Function enableSampling(Injector injector) { - Decider decider = injector.instance(Decider.class); - return new Function() { - @Override - public Object apply(byte[] v1) { - return DeciderUtil.isAvailableForRandomRecipient(decider, "dark_traffic_filter"); - } - }; - } -} diff --git a/src/java/com/twitter/search/feature_update_service/modules/FinagleKafkaProducerModule.java b/src/java/com/twitter/search/feature_update_service/modules/FinagleKafkaProducerModule.java deleted file mode 100644 index b35177099..000000000 --- a/src/java/com/twitter/search/feature_update_service/modules/FinagleKafkaProducerModule.java +++ /dev/null @@ -1,62 +0,0 @@ -package com.twitter.search.feature_update_service.modules; - -import javax.inject.Named; -import javax.inject.Singleton; - -import com.google.inject.Provides; - -import com.twitter.app.Flaggable; -import com.twitter.common.util.Clock; -import com.twitter.finatra.kafka.producers.BlockingFinagleKafkaProducer; -import com.twitter.inject.TwitterModule; -import com.twitter.inject.annotations.Flag; -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.util.io.kafka.CompactThriftSerializer; -import com.twitter.search.common.util.io.kafka.FinagleKafkaClientUtils; -import com.twitter.search.common.util.io.kafka.SearchPartitioner; -import com.twitter.search.common.util.io.kafka.SearchPartitionerRealtimeCg; - -public class FinagleKafkaProducerModule extends TwitterModule { - public static final String KAFKA_DEST_FLAG = "kafka.dest"; - public static final String KAFKA_TOPIC_NAME_UPDATE_EVENTS_FLAG = - "kafka.topic.name.update_events"; - public static final String KAFKA_TOPIC_NAME_UPDATE_EVENTS_FLAG_REALTIME_CG = - "kafka.topic.name.update_events_realtime_cg"; - public static final String KAFKA_ENABLE_S2S_AUTH_FLAG = "kafka.enable_s2s_auth"; - - public FinagleKafkaProducerModule() { - flag(KAFKA_DEST_FLAG, "Kafka cluster destination", "", Flaggable.ofString()); - flag(KAFKA_TOPIC_NAME_UPDATE_EVENTS_FLAG, "", - "Topic name for update events", Flaggable.ofString()); - flag(KAFKA_TOPIC_NAME_UPDATE_EVENTS_FLAG_REALTIME_CG, "", - "Topic name for update events", Flaggable.ofString()); - flag(KAFKA_ENABLE_S2S_AUTH_FLAG, true, "enable s2s authentication configs", - Flaggable.ofBoolean()); - } - - @Provides - @Named("KafkaProducer") - public BlockingFinagleKafkaProducer kafkaProducer( - @Flag(KAFKA_DEST_FLAG) String kafkaDest, - @Flag(KAFKA_ENABLE_S2S_AUTH_FLAG) boolean enableKafkaAuth) { - return FinagleKafkaClientUtils.newFinagleKafkaProducer( - kafkaDest, enableKafkaAuth, new CompactThriftSerializer(), - "search_cluster", SearchPartitioner.class); - } - - @Provides - @Named("KafkaProducerRealtimeCg") - public BlockingFinagleKafkaProducer kafkaProducerRealtimeCg( - @Flag(KAFKA_DEST_FLAG) String kafkaDest, - @Flag(KAFKA_ENABLE_S2S_AUTH_FLAG) boolean enableKafkaAuth) { - return FinagleKafkaClientUtils.newFinagleKafkaProducer( - kafkaDest, enableKafkaAuth, new CompactThriftSerializer(), - "search_cluster", SearchPartitionerRealtimeCg.class); - } - - @Provides - @Singleton - public Clock clock() { - return Clock.SYSTEM_CLOCK; - } -} diff --git a/src/java/com/twitter/search/feature_update_service/modules/FuturePoolModule.java b/src/java/com/twitter/search/feature_update_service/modules/FuturePoolModule.java deleted file mode 100644 index 537f67559..000000000 --- a/src/java/com/twitter/search/feature_update_service/modules/FuturePoolModule.java +++ /dev/null @@ -1,57 +0,0 @@ -package com.twitter.search.feature_update_service.modules; - -import java.util.concurrent.LinkedBlockingQueue; -import java.util.concurrent.ThreadPoolExecutor; -import java.util.concurrent.TimeUnit; - -import com.google.common.annotations.VisibleForTesting; -import com.google.inject.Provides; -import com.google.inject.Singleton; - -import com.twitter.inject.TwitterModule; -import com.twitter.search.common.metrics.SearchCustomGauge; -import com.twitter.search.feature_update_service.stats.FeatureUpdateStats; -import com.twitter.util.ExecutorServiceFuturePool; -import com.twitter.util.InterruptibleExecutorServiceFuturePool; - -public class FuturePoolModule extends TwitterModule { - /** - * Provide future pool backed by executor service, with bounded thread pool and bounded backing - * queue. - */ - @Provides - @Singleton - public ExecutorServiceFuturePool futurePool() { - // These limits are based on service capacity estimates and testing on staging, - // attempting to give the pool as many resources as possible without overloading anything. - // 100-200 threads is manageable, and the 2000 queue size is based on a conservative upper - // limit that tasks in the queue take 1 MB each, meaning queue maxes out at 2 GB, which should - // be okay given 4 GB RAM with 3 GB reserved heap. - return createFuturePool(100, 200, 2000); - } - - /** - * Create a future pool backed by executor service, with bounded thread pool and bounded backing - * queue. ONLY VISIBILE FOR TESTING; don't invoke outside this class. - */ - @VisibleForTesting - public static ExecutorServiceFuturePool createFuturePool( - int corePoolSize, int maximumPoolSize, int queueCapacity) { - final LinkedBlockingQueue queue = new LinkedBlockingQueue<>(queueCapacity); - - ExecutorServiceFuturePool futurePool = new InterruptibleExecutorServiceFuturePool( - new ThreadPoolExecutor( - corePoolSize, - maximumPoolSize, - 60L, - TimeUnit.SECONDS, - queue)); - - SearchCustomGauge.export(FeatureUpdateStats.PREFIX + "thread_pool_size", - futurePool::poolSize); - SearchCustomGauge.export(FeatureUpdateStats.PREFIX + "work_queue_size", - queue::size); - - return futurePool; - } -} diff --git a/src/java/com/twitter/search/feature_update_service/modules/TweetypieModule.java b/src/java/com/twitter/search/feature_update_service/modules/TweetypieModule.java deleted file mode 100644 index 6fd041cd4..000000000 --- a/src/java/com/twitter/search/feature_update_service/modules/TweetypieModule.java +++ /dev/null @@ -1,62 +0,0 @@ -package com.twitter.search.feature_update_service.modules; - -import javax.inject.Singleton; - -import com.google.inject.Provides; - -import com.twitter.finagle.Service; -import com.twitter.finagle.ThriftMux; -import com.twitter.finagle.builder.ClientBuilder; -import com.twitter.finagle.mtls.authentication.ServiceIdentifier; -import com.twitter.finagle.mtls.client.MtlsThriftMuxClient; -import com.twitter.finagle.stats.StatsReceiver; -import com.twitter.finagle.thrift.ClientId; -import com.twitter.finagle.thrift.ThriftClientRequest; -import com.twitter.finagle.zipkin.thrift.ZipkinTracer; -import com.twitter.inject.TwitterModule; -import com.twitter.spam.finagle.FinagleUtil; -import com.twitter.tweetypie.thriftjava.TweetService; -import com.twitter.util.Duration; - -public class TweetypieModule extends TwitterModule { - @Provides - @Singleton - private ThriftMux.Client providesThriftMuxClient(ServiceIdentifier serviceIdentifier) { - return new MtlsThriftMuxClient(ThriftMux.client()) - .withMutualTls(serviceIdentifier) - .withClientId(new ClientId("feature_update_service.prod")); - } - private static final Duration DEFAULT_CONN_TIMEOUT = Duration.fromSeconds(2); - - private static final Duration TWEET_SERVICE_REQUEST_TIMEOUT = Duration.fromMilliseconds(500); - - private static final int TWEET_SERVICE_RETRIES = 5; - @Provides @Singleton - private TweetService.ServiceIface provideTweetServiceClient( - ThriftMux.Client thriftMux, - StatsReceiver statsReceiver) throws InterruptedException { - // TweetService is TweetService (tweetypie) with different api - // Since TweetService will be primarly used for interacting with - // tweetypie's flexible schema (MH), we will increase request - // timeout and retries but share other settings from TweetService. - @SuppressWarnings("unchecked") - ClientBuilder clientBuilder = FinagleUtil.getClientBuilder() - .name("tweet_service") - .stack(thriftMux) - .tcpConnectTimeout(DEFAULT_CONN_TIMEOUT) - .requestTimeout(TWEET_SERVICE_REQUEST_TIMEOUT) - .retries(TWEET_SERVICE_RETRIES) - .reportTo(statsReceiver) - .tracer(ZipkinTracer.mk(statsReceiver)); - - @SuppressWarnings("unchecked") - final Service finagleClient = - FinagleUtil.createResolvedFinagleClient( - "tweetypie", - "prod", - "tweetypie", - clientBuilder); - - return new TweetService.ServiceToClient(finagleClient); - } -} diff --git a/src/java/com/twitter/search/feature_update_service/stats/BUILD b/src/java/com/twitter/search/feature_update_service/stats/BUILD deleted file mode 100644 index 001463400..000000000 --- a/src/java/com/twitter/search/feature_update_service/stats/BUILD +++ /dev/null @@ -1,11 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/inject:guice", - "src/java/com/twitter/common/base", - "src/java/com/twitter/search/common/metrics", - "src/thrift/com/twitter/search/feature_update_service/thrift:thrift-java", - ], -) diff --git a/src/java/com/twitter/search/feature_update_service/stats/FeatureUpdateStats.java b/src/java/com/twitter/search/feature_update_service/stats/FeatureUpdateStats.java deleted file mode 100644 index aa607e85e..000000000 --- a/src/java/com/twitter/search/feature_update_service/stats/FeatureUpdateStats.java +++ /dev/null @@ -1,111 +0,0 @@ -package com.twitter.search.feature_update_service.stats; - -import java.util.concurrent.ConcurrentHashMap; -import java.util.concurrent.ConcurrentMap; - -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.feature_update_service.thriftjava.FeatureUpdateResponseCode; - -/** Stat tracking for the feature update ingester service. */ -public class FeatureUpdateStats { - public static final String PREFIX = "feature_update_service_"; - - private final SearchRateCounter requestRate = SearchRateCounter.export( - PREFIX + "requests"); - - private ConcurrentMap perClientRequestRate = - new ConcurrentHashMap<>(); - - private ConcurrentMap responseCodeRate = - new ConcurrentHashMap<>(); - - private ConcurrentMap preClientResponseCodeRate = - new ConcurrentHashMap<>(); - - /** - * Record metrics for a single incoming request. - */ - public void clientRequest(String clientID) { - // 1. Track total request rate. It's better to precompute than compute the per client sum at - // query time. - requestRate.increment(); - - // 2. Track request rate per client. - incrementPerClientCounter(perClientRequestRate, clientRequestRateKey(clientID)); - } - - /** - * Record metrics for a single response. - */ - public void clientResponse(String clientID, FeatureUpdateResponseCode responseCode) { - String code = responseCode.toString().toLowerCase(); - - // 1. Track rates per response code. - incrementPerClientCounter(responseCodeRate, responseCodeKey(code)); - - // 2. Track rates per client per response code. - incrementPerClientCounter(preClientResponseCodeRate, clientResponseCodeKey(clientID, code)); - } - - /** - * Returns the total number of requests. - */ - public long getRequestRateCount() { - return requestRate.getCount(); - } - - /** - * Returns the total number of requests for the specified client. - */ - public long getClientRequestCount(String clientID) { - String key = clientRequestRateKey(clientID); - if (perClientRequestRate.containsKey(key)) { - return perClientRequestRate.get(key).getCount(); - } - return 0; - } - - /** - * Returns the total number of responses with the specified code. - */ - public long getResponseCodeCount(FeatureUpdateResponseCode responseCode) { - String code = responseCode.toString().toLowerCase(); - String key = responseCodeKey(code); - if (responseCodeRate.containsKey(key)) { - return responseCodeRate.get(key).getCount(); - } - return 0; - } - - /** - * Returns the total number of responses to the specified client with the specified code. - */ - public long getClientResponseCodeCount(String clientID, FeatureUpdateResponseCode responseCode) { - String code = responseCode.toString().toLowerCase(); - String key = clientResponseCodeKey(clientID, code); - if (preClientResponseCodeRate.containsKey(key)) { - return preClientResponseCodeRate.get(key).getCount(); - } - return 0; - } - - private static String clientRequestRateKey(String clientID) { - return String.format(PREFIX + "requests_for_client_id_%s", clientID); - } - - private static String responseCodeKey(String responseCode) { - return String.format(PREFIX + "response_code_%s", responseCode); - } - - private static String clientResponseCodeKey(String clientID, String responseCode) { - return String.format(PREFIX + "response_for_client_id_%s_code_%s", clientID, responseCode); - } - - private void incrementPerClientCounter( - ConcurrentMap rates, - String key - ) { - rates.putIfAbsent(key, SearchRateCounter.export(key)); - rates.get(key).increment(); - } -} diff --git a/src/java/com/twitter/search/feature_update_service/util/BUILD b/src/java/com/twitter/search/feature_update_service/util/BUILD deleted file mode 100644 index 0baf9e722..000000000 --- a/src/java/com/twitter/search/feature_update_service/util/BUILD +++ /dev/null @@ -1,11 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/code/findbugs:jsr305", - "src/java/com/twitter/search/common/schema/base", - "src/thrift/com/twitter/search/common:schema-java", - "src/thrift/com/twitter/search/feature_update_service/thrift:thrift-java", - ], -) diff --git a/src/java/com/twitter/search/feature_update_service/util/FeatureUpdateValidator.java b/src/java/com/twitter/search/feature_update_service/util/FeatureUpdateValidator.java deleted file mode 100644 index c523a083e..000000000 --- a/src/java/com/twitter/search/feature_update_service/util/FeatureUpdateValidator.java +++ /dev/null @@ -1,41 +0,0 @@ -package com.twitter.search.feature_update_service.util; - - -import javax.annotation.Nullable; - -import com.twitter.search.common.schema.base.ThriftDocumentUtil; -import com.twitter.search.feature_update_service.thriftjava.FeatureUpdateRequest; -import com.twitter.search.feature_update_service.thriftjava.FeatureUpdateResponse; -import com.twitter.search.feature_update_service.thriftjava.FeatureUpdateResponseCode; - -public final class FeatureUpdateValidator { - - private FeatureUpdateValidator() { } - - /** - * Validates FeatureUpdateRequest - * @param featureUpdate instance of FeatureUpdateRequest with ThriftIndexingEvent - * @return null if valid, instance of FeatureUpdateResponse if not. - * Response will have appropriate error code and message set. - */ - @Nullable - public static FeatureUpdateResponse validate(FeatureUpdateRequest featureUpdate) { - - if (ThriftDocumentUtil.hasDuplicateFields(featureUpdate.getEvent().getDocument())) { - return createResponse( - String.format("duplicate document fields: %s", featureUpdate.toString())); - } - if (!featureUpdate.getEvent().isSetUid()) { - return createResponse(String.format("unset uid: %s", featureUpdate.toString())); - } - - return null; - } - - private static FeatureUpdateResponse createResponse(String errorMsg) { - FeatureUpdateResponseCode responseCode = FeatureUpdateResponseCode.CLIENT_ERROR; - FeatureUpdateResponse response = new FeatureUpdateResponse(responseCode); - response.setDetailMessage(errorMsg); - return response; - } -} diff --git a/src/java/com/twitter/search/feature_update_service/whitelist/BUILD b/src/java/com/twitter/search/feature_update_service/whitelist/BUILD deleted file mode 100644 index 9bd13cf87..000000000 --- a/src/java/com/twitter/search/feature_update_service/whitelist/BUILD +++ /dev/null @@ -1,13 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/commons-io", - "3rdparty/jvm/org/yaml:snakeyaml", - "finagle/finagle-core/src/main", - "src/java/com/twitter/common/util:system-mocks", - "src/java/com/twitter/search/common/util/io/periodic", - ], -) diff --git a/src/java/com/twitter/search/feature_update_service/whitelist/ClientIdWhitelist.java b/src/java/com/twitter/search/feature_update_service/whitelist/ClientIdWhitelist.java deleted file mode 100644 index 4718c547e..000000000 --- a/src/java/com/twitter/search/feature_update_service/whitelist/ClientIdWhitelist.java +++ /dev/null @@ -1,77 +0,0 @@ -package com.twitter.search.feature_update_service.whitelist; - -import java.io.InputStream; -import java.util.Set; -import java.util.concurrent.Executors; -import java.util.concurrent.ScheduledExecutorService; -import java.util.concurrent.atomic.AtomicReference; - -import com.google.common.collect.ImmutableSet; -import com.google.common.util.concurrent.ThreadFactoryBuilder; - -import org.yaml.snakeyaml.Yaml; - -import com.twitter.common.util.Clock; -import com.twitter.finagle.thrift.ClientId; -import com.twitter.search.common.util.io.periodic.PeriodicFileLoader; - -/** - * ClientIdWhitelist extends PeriodicFileLoader to load client whitelist - * from configbus and checks to see if current clientId is allowed - */ -public class ClientIdWhitelist extends PeriodicFileLoader { - - private final AtomicReference> clientIdSet = new AtomicReference<>(); - - - public ClientIdWhitelist(String clientIdWhitelistPath, ScheduledExecutorService executorService, - Clock clock) { - super("ClientIdWhitelist", clientIdWhitelistPath, executorService, clock); - } - - /** - * Creates the object that manages loads from the clientIdWhitelistpath in config. - * It periodically reloads the client whitelist file using the given executor service. - */ - public static ClientIdWhitelist initWhitelist( - String clientIdWhitelistPath, ScheduledExecutorService executorService, - Clock clock) throws Exception { - ClientIdWhitelist clientIdWhitelist = new ClientIdWhitelist( - clientIdWhitelistPath, executorService, clock); - clientIdWhitelist.init(); - return clientIdWhitelist; - } - - /** - * Creates clock and executor service needed to create a periodic file loading object - * then returns object that accpets file. - * @param clientWhitelistPath - * @return ClientIdWhitelist - * @throws Exception - */ - public static ClientIdWhitelist initWhitelist(String clientWhitelistPath) throws Exception { - Clock clock = Clock.SYSTEM_CLOCK; - ScheduledExecutorService executorService = Executors.newSingleThreadScheduledExecutor( - new ThreadFactoryBuilder() - .setNameFormat("client-whitelist-reloader") - .setDaemon(true) - .build()); - - return initWhitelist(clientWhitelistPath, executorService, clock); - } - @Override - protected void accept(InputStream fileStream) { - ImmutableSet.Builder clientIdBuilder = new ImmutableSet.Builder<>(); - Yaml yaml = new Yaml(); - Set set = yaml.loadAs(fileStream, Set.class); - for (String id : set) { - clientIdBuilder.add(ClientId.apply(id)); - } - clientIdSet.set(clientIdBuilder.build()); - } - - // checks to see if clientId is in set of whitelisted clients - public boolean isClientAllowed(ClientId clientId) { - return clientIdSet.get().contains(clientId); - } -} diff --git a/src/java/com/twitter/search/img/foryou.png b/src/java/com/twitter/search/img/foryou.png deleted file mode 100644 index 6d08febde..000000000 Binary files a/src/java/com/twitter/search/img/foryou.png and /dev/null differ diff --git a/src/java/com/twitter/search/img/in-network.png b/src/java/com/twitter/search/img/in-network.png deleted file mode 100644 index 09caa3df2..000000000 Binary files a/src/java/com/twitter/search/img/in-network.png and /dev/null differ diff --git a/src/java/com/twitter/search/img/indexing.png b/src/java/com/twitter/search/img/indexing.png deleted file mode 100644 index 2704854ab..000000000 Binary files a/src/java/com/twitter/search/img/indexing.png and /dev/null differ diff --git a/src/java/com/twitter/search/img/serving.png b/src/java/com/twitter/search/img/serving.png deleted file mode 100644 index aca60b55e..000000000 Binary files a/src/java/com/twitter/search/img/serving.png and /dev/null differ diff --git a/src/java/com/twitter/search/img/top-search.png b/src/java/com/twitter/search/img/top-search.png deleted file mode 100644 index 267c3aaf2..000000000 Binary files a/src/java/com/twitter/search/img/top-search.png and /dev/null differ diff --git a/src/java/com/twitter/search/ingester/BUILD b/src/java/com/twitter/search/ingester/BUILD deleted file mode 100644 index 391184356..000000000 --- a/src/java/com/twitter/search/ingester/BUILD +++ /dev/null @@ -1,30 +0,0 @@ -target( - name = "ingester-lib", - dependencies = [ - "src/java/com/twitter/search/common/converter/earlybird", - "src/java/com/twitter/search/ingester/model", - "src/java/com/twitter/search/ingester/pipeline/app", - "src/java/com/twitter/search/ingester/pipeline/twitter", - "src/java/com/twitter/search/ingester/pipeline/twitter/engagements", - "src/java/com/twitter/search/ingester/pipeline/twitter/filters", - "src/java/com/twitter/search/ingester/pipeline/twitter/kafka", - "src/java/com/twitter/search/ingester/pipeline/twitter/thriftparse", - "src/java/com/twitter/search/ingester/pipeline/twitter/userupdates", - "src/java/com/twitter/search/ingester/pipeline/util", - "src/java/com/twitter/search/ingester/util/jndi", - ], -) - -jvm_binary( - name = "ingester-binary", - basename = "ingester", - main = "com.twitter.search.ingester.pipeline.app.IngesterPipelineApplication\\$Main", - platform = "java8", - tags = [ - "bazel-compatible", - ], - dependencies = [ - ":ingester-lib", - "src/java/com/twitter/search/common/logging:search-log4j", - ], -) diff --git a/src/java/com/twitter/search/ingester/README.md b/src/java/com/twitter/search/ingester/README.md deleted file mode 100644 index ee0a2b15a..000000000 --- a/src/java/com/twitter/search/ingester/README.md +++ /dev/null @@ -1,10 +0,0 @@ -## Ingesters -Ingesters are services that consume raw tweets and user updates, process them through a series of transformations and write them to kafka topics for Earlybird to consume and subsequently index. - -There are two types of ingesters: -1. Tweet ingesters -2. UserUpdates ingesters - -Tweet ingesters consume raw tweets and extract different fields and features for Earlybird to index. User updates ingester produces user safety information such as whether the user is deactivated, suspended or off-boarded. The user and tweet features produced by ingesters are then used by Earlybird during tweet retieval and ranking. - -Ingesters are made up of a pipeline of stages with each stage performing a different field/feature extraction. The pipeline configuration of the ingesters can be found at science/search/ingester/config diff --git a/src/java/com/twitter/search/ingester/model/BUILD b/src/java/com/twitter/search/ingester/model/BUILD deleted file mode 100644 index 4225e7ff5..000000000 --- a/src/java/com/twitter/search/ingester/model/BUILD +++ /dev/null @@ -1,28 +0,0 @@ -java_library( - sources = ["*.java"], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/code/findbugs:jsr305", - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/org/apache/thrift:libthrift", - "cuad/projects/ner/thrift/src/main/thrift:thrift-java", - "src/java/com/twitter/common/text/token", - "src/java/com/twitter/common_internal/text/version", - "src/java/com/twitter/search/common/debug", - "src/java/com/twitter/search/common/encoding/features", - "src/java/com/twitter/search/common/partitioning/base", - "src/java/com/twitter/search/common/relevance:entities_and_filters", - "src/java/com/twitter/search/common/relevance:text", - "src/java/com/twitter/search/common/relevance/features", - "src/java/com/twitter/search/common/schema/base", - "src/java/com/twitter/search/common/schema/earlybird", - "src/thrift/com/twitter/search/common:indexing-java", - "src/thrift/com/twitter/search/common:schema-java", - "src/thrift/com/twitter/search/common/debug:debug-java", - "src/thrift/com/twitter/service/spiderduck/gen:metadata-store-java", - "src/thrift/com/twitter/tweetypie:events-java", - "util/util-core:scala", - ], -) diff --git a/src/java/com/twitter/search/ingester/model/IndexerStatus.java b/src/java/com/twitter/search/ingester/model/IndexerStatus.java deleted file mode 100644 index 6893bbc67..000000000 --- a/src/java/com/twitter/search/ingester/model/IndexerStatus.java +++ /dev/null @@ -1,13 +0,0 @@ -package com.twitter.search.ingester.model; - -import com.twitter.search.common.debug.DebugEventAccumulator; - -/** - * Interface used for stages that process both TwitterMessages and ThriftVersionedEvents. - */ -public interface IndexerStatus extends DebugEventAccumulator { - /** - * Needed by the SortStage. - */ - long getId(); -} diff --git a/src/java/com/twitter/search/ingester/model/IngesterThriftVersionedEvents.java b/src/java/com/twitter/search/ingester/model/IngesterThriftVersionedEvents.java deleted file mode 100644 index b6dd985a8..000000000 --- a/src/java/com/twitter/search/ingester/model/IngesterThriftVersionedEvents.java +++ /dev/null @@ -1,50 +0,0 @@ -package com.twitter.search.ingester.model; - -import java.util.Map; - -import com.google.common.primitives.Longs; - -import com.twitter.search.common.debug.DebugEventAccumulator; -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.partitioning.base.Partitionable; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent; - -/** - * Wrap of ThriftVersionedEvents, make it partitionable for the queue writer. - */ -public class IngesterThriftVersionedEvents extends ThriftVersionedEvents - implements Comparable, Partitionable, DebugEventAccumulator { - - // Make userId field easier to be accessed to calculate partition number - private final long userId; - - public IngesterThriftVersionedEvents(long userId) { - this.userId = userId; - } - - public IngesterThriftVersionedEvents(long userId, - Map versionedEvents) { - super(versionedEvents); - this.userId = userId; - } - - public IngesterThriftVersionedEvents(long userId, ThriftVersionedEvents original) { - super(original); - this.userId = userId; - } - - @Override - public int compareTo(ThriftVersionedEvents o) { - return Longs.compare(getId(), o.getId()); - } - - @Override - public long getTweetId() { - return this.getId(); - } - - @Override - public long getUserId() { - return this.userId; - } -} diff --git a/src/java/com/twitter/search/ingester/model/IngesterTweetEvent.java b/src/java/com/twitter/search/ingester/model/IngesterTweetEvent.java deleted file mode 100644 index 1d5fae1b9..000000000 --- a/src/java/com/twitter/search/ingester/model/IngesterTweetEvent.java +++ /dev/null @@ -1,19 +0,0 @@ -package com.twitter.search.ingester.model; - -import com.twitter.search.common.debug.DebugEventAccumulator; -import com.twitter.search.common.debug.thriftjava.DebugEvents; -import com.twitter.tweetypie.thriftjava.TweetEvent; - -public class IngesterTweetEvent extends TweetEvent implements DebugEventAccumulator { - // Used for propagating DebugEvents through the ingester stages. - private final DebugEvents debugEvents; - - public IngesterTweetEvent() { - this.debugEvents = new DebugEvents(); - } - - @Override - public DebugEvents getDebugEvents() { - return debugEvents; - } -} diff --git a/src/java/com/twitter/search/ingester/model/IngesterTwitterMessage.java b/src/java/com/twitter/search/ingester/model/IngesterTwitterMessage.java deleted file mode 100644 index e89fef845..000000000 --- a/src/java/com/twitter/search/ingester/model/IngesterTwitterMessage.java +++ /dev/null @@ -1,73 +0,0 @@ -package com.twitter.search.ingester.model; - -import java.util.List; - -import javax.annotation.Nullable; - -import com.google.common.base.Preconditions; -import com.google.common.primitives.Longs; - -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.search.common.debug.thriftjava.DebugEvents; -import com.twitter.search.common.partitioning.base.HashPartitionFunction; -import com.twitter.search.common.partitioning.base.Partitionable; -import com.twitter.search.common.relevance.entities.TwitterMessage; - -/** - * A Twitter "status" object (e.g. a message) - * - */ -public class IngesterTwitterMessage extends TwitterMessage - implements Comparable, IndexerStatus, Partitionable { - private final DebugEvents debugEvents; - - public IngesterTwitterMessage(Long twitterId, List supportedPenguinVersions) { - this(twitterId, supportedPenguinVersions, null); - } - - public IngesterTwitterMessage( - Long twitterId, - List penguinVersions, - @Nullable DebugEvents debugEvents) { - super(twitterId, penguinVersions); - this.debugEvents = debugEvents == null ? new DebugEvents() : debugEvents.deepCopy(); - } - - @Override - public int compareTo(IndexerStatus o) { - return Longs.compare(getId(), o.getId()); - } - - @Override - public boolean equals(Object o) { - return (o instanceof IngesterTwitterMessage) - && compareTo((IngesterTwitterMessage) o) == 0; - } - - @Override - public int hashCode() { - return HashPartitionFunction.hashCode(getId()); - } - - public boolean isIndexable(boolean indexProtectedTweets) { - return getFromUserScreenName().isPresent() - && getId() != INT_FIELD_NOT_PRESENT - && (indexProtectedTweets || !isUserProtected()); - } - - @Override - public long getTweetId() { - return this.getId(); - } - - @Override - public long getUserId() { - Preconditions.checkState(getFromUserTwitterId().isPresent(), "The author user ID is missing"); - return getFromUserTwitterId().get(); - } - - @Override - public DebugEvents getDebugEvents() { - return debugEvents; - } -} diff --git a/src/java/com/twitter/search/ingester/model/KafkaRawRecord.java b/src/java/com/twitter/search/ingester/model/KafkaRawRecord.java deleted file mode 100644 index 85ea70fa7..000000000 --- a/src/java/com/twitter/search/ingester/model/KafkaRawRecord.java +++ /dev/null @@ -1,22 +0,0 @@ -package com.twitter.search.ingester.model; - -/** - * The raw data in a Kafka record. - */ -public class KafkaRawRecord { - private final byte[] data; - private final long readAtTimestampMs; - - public KafkaRawRecord(byte[] data, long readAtTimestampMs) { - this.data = data; - this.readAtTimestampMs = readAtTimestampMs; - } - - public byte[] getData() { - return data; - } - - public long getReadAtTimestampMs() { - return readAtTimestampMs; - } -} diff --git a/src/java/com/twitter/search/ingester/model/PromiseContainer.java b/src/java/com/twitter/search/ingester/model/PromiseContainer.java deleted file mode 100644 index 7d9b2ead9..000000000 --- a/src/java/com/twitter/search/ingester/model/PromiseContainer.java +++ /dev/null @@ -1,21 +0,0 @@ -package com.twitter.search.ingester.model; - -import com.twitter.util.Promise; - -public class PromiseContainer { - private final Promise promise; - private final U obj; - - public PromiseContainer(Promise promise, U obj) { - this.promise = promise; - this.obj = obj; - } - - public Promise getPromise() { - return promise; - } - - public U getObj() { - return obj; - } -} diff --git a/src/java/com/twitter/search/ingester/model/VisibleTokenRatioUtil.java b/src/java/com/twitter/search/ingester/model/VisibleTokenRatioUtil.java deleted file mode 100644 index 52c8654a5..000000000 --- a/src/java/com/twitter/search/ingester/model/VisibleTokenRatioUtil.java +++ /dev/null @@ -1,42 +0,0 @@ -package com.twitter.search.ingester.model; - -import com.twitter.common.text.token.TokenizedCharSequenceStream; -import com.twitter.common.text.token.attribute.CharSequenceTermAttribute; -import com.twitter.search.common.relevance.text.VisibleTokenRatioNormalizer; - -public class VisibleTokenRatioUtil { - - private static final int TOKEN_DEMARCATION = 140; - - private static final VisibleTokenRatioNormalizer NORMALIZER = - VisibleTokenRatioNormalizer.createInstance(); - - /** - * Take the number of visible tokens and divide by number of total tokens to get the - * visible token percentage (pretending 140 chars is visible as that is old typical tweet - * size). Then normalize it down to 4 bits(round it basically) - */ - public int extractAndNormalizeTokenPercentage(TokenizedCharSequenceStream tokenSeqStream) { - - CharSequenceTermAttribute attr = tokenSeqStream.addAttribute(CharSequenceTermAttribute.class); - - int totalTokens = 0; - int numTokensBelowThreshold = 0; - while (tokenSeqStream.incrementToken()) { - totalTokens++; - int offset = attr.getOffset(); - if (offset <= TOKEN_DEMARCATION) { - numTokensBelowThreshold++; - } - } - - double percent; - if (totalTokens > 0) { - percent = numTokensBelowThreshold / (double) totalTokens; - } else { - percent = 1; - } - - return NORMALIZER.normalize(percent); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/app/BUILD b/src/java/com/twitter/search/ingester/pipeline/app/BUILD deleted file mode 100644 index d28a18bd7..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/app/BUILD +++ /dev/null @@ -1,31 +0,0 @@ -java_library( - sources = ["*.java"], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/commons-logging", - "3rdparty/jvm/org/slf4j:slf4j-api", - "decider/src/main/scala", - "finagle/finagle-core/src/main", - "finagle/finagle-http/src/main/scala", - "servo/decider/src/main/scala", - "src/java/com/twitter/search/common/debug", - "src/java/com/twitter/search/common/logging", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/ingester/model", - "src/java/com/twitter/search/ingester/pipeline/twitter", - "src/java/com/twitter/search/ingester/pipeline/twitter/kafka", - "src/java/com/twitter/search/ingester/pipeline/util", - "src/java/com/twitter/search/ingester/pipeline/wire", - "src/java/com/twitter/search/ingester/util/jndi", - "src/java/org/apache/commons/pipeline", - "src/thrift/com/twitter/tweetypie:events-java", - "twitter-server/server/src/main/scala", - "util/util-app/src/main/scala", - "util/util-core:scala", - "util/util-lint/src/main/scala", - "util/util-stats/src/main/scala", - ], -) diff --git a/src/java/com/twitter/search/ingester/pipeline/app/IngesterPipelineApplication.java b/src/java/com/twitter/search/ingester/pipeline/app/IngesterPipelineApplication.java deleted file mode 100644 index 2c7d9c952..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/app/IngesterPipelineApplication.java +++ /dev/null @@ -1,195 +0,0 @@ -package com.twitter.search.ingester.pipeline.app; - -import java.io.File; -import java.net.URL; -import java.util.concurrent.CountDownLatch; -import java.util.concurrent.atomic.AtomicBoolean; - -import com.google.common.annotations.VisibleForTesting; - -import org.apache.commons.pipeline.Pipeline; -import org.apache.commons.pipeline.PipelineCreationException; -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.config.DigesterPipelineFactory; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; -import com.twitter.app.Flag; -import com.twitter.app.Flaggable; -import com.twitter.search.common.metrics.BuildInfoStats; -import com.twitter.search.ingester.pipeline.wire.ProductionWireModule; -import com.twitter.search.ingester.pipeline.wire.WireModule; -import com.twitter.search.ingester.util.jndi.JndiUtil; -import com.twitter.server.AbstractTwitterServer; -import com.twitter.server.handler.DeciderHandler$; - -/** Starts the ingester/indexer pipeline. */ -public class IngesterPipelineApplication extends AbstractTwitterServer { - private static final Logger LOG = LoggerFactory.getLogger(IngesterPipelineApplication.class); - private static final String VERSION_2 = "v2"; - private final Flag pipelineConfigFile = flag().create( - "config_file", - "", - "xml file to load pipeline config from. Required.", - Flaggable.ofString()); - - private final Flag pipelineVersion = flag().create( - "version", - "", - "Specifies if we want to run the acp pipeline or non acp pipeline.", - Flaggable.ofString()); - - private final Flag partitionArg = flag().create( - "shard", - -1, - "The partition this indexer is responsible for.", - Flaggable.ofJavaInteger()); - - private final Flag deciderOverlay = flag().create( - "decider_overlay", - "", - "Decider overlay", - Flaggable.ofString()); - - private final Flag serviceIdentifierFlag = flag().create( - "service_identifier", - "", - "Service identifier for mutual TLS authentication", - Flaggable.ofString()); - - private final Flag environment = flag().create( - "environment", - "", - "Specifies the environment the app is running in. Valid values : prod, staging, " - + "staging1. Required if pipelineVersion == 'v2'", - Flaggable.ofString() - ); - - private final Flag cluster = flag().create( - "cluster", - "", - "Specifies the cluster the app is running in. Valid values : realtime, protected, " - + "realtime_cg, user_updates. Required if pipelineVersion == 'v2'", - Flaggable.ofString() - ); - - private final Flag cores = flag().create( - "cores", - 1F, - "Specifies the number of cores this cluster is using. ", - Flaggable.ofJavaFloat() - ); - - private final CountDownLatch shutdownLatch = new CountDownLatch(1); - - public void shutdown() { - shutdownLatch.countDown(); - } - - private Pipeline pipeline; - - private final AtomicBoolean started = new AtomicBoolean(false); - - private final AtomicBoolean finished = new AtomicBoolean(false); - - /** - * Boilerplate for the Java-friendly AbstractTwitterServer - */ - public static class Main { - public static void main(String[] args) { - new IngesterPipelineApplication().main(args); - } - } - - /** - * Code is based on DigesterPipelineFactory.main. We only require reading in one config file. - */ - @Override - public void main() { - try { - JndiUtil.loadJNDI(); - - ProductionWireModule wireModule = new ProductionWireModule( - deciderOverlay.get().get(), - partitionArg.getWithDefault().get(), - serviceIdentifierFlag.get()); - WireModule.bindWireModule(wireModule); - - addAdminRoute(DeciderHandler$.MODULE$.route( - "ingester", - wireModule.getMutableDecisionMaker(), - wireModule.getDecider())); - - BuildInfoStats.export(); - if (pipelineVersion.get().get().equals(VERSION_2)) { - runPipelineV2(wireModule); - } else { - runPipelineV1(wireModule); - } - LOG.info("Pipeline terminated. Ingester is DOWN."); - } catch (Exception e) { - LOG.error("Exception in pipeline. Ingester is DOWN.", e); - throw new RuntimeException(e); - } - } - - @VisibleForTesting - boolean isFinished() { - return finished.get(); - } - - @VisibleForTesting - Pipeline createPipeline(URL pipelineConfigFileURL) throws PipelineCreationException { - DigesterPipelineFactory factory = new DigesterPipelineFactory(pipelineConfigFileURL); - LOG.info("Pipeline created from {}, about to begin processing...", pipelineConfigFileURL); - return factory.createPipeline(); - } - - void runPipelineV1(ProductionWireModule wireModule) throws Exception { - LOG.info("Running Pipeline V1"); - final File pipelineFile = new File(pipelineConfigFile.get().get()); - URL pipelineConfigFileUrl = pipelineFile.toURI().toURL(); - wireModule.setPipelineExceptionHandler(new PipelineExceptionImpl(this)); - runPipelineV1(pipelineConfigFileUrl); - shutdownLatch.await(); - } - - @VisibleForTesting - void runPipelineV1(URL pipelineConfigFileUrl) throws Exception { - pipeline = createPipeline(pipelineConfigFileUrl); - pipeline.start(); - started.set(true); - } - - void runPipelineV2(ProductionWireModule wireModule) throws Exception { - LOG.info("Running Pipeline V2"); - int threadsToSpawn = cores.get().get().intValue() - 1; - RealtimeIngesterPipelineV2 realtimePipeline = new RealtimeIngesterPipelineV2( - environment.get().get(), cluster.get().get(), threadsToSpawn); - wireModule.setPipelineExceptionHandler(new PipelineExceptionImplV2(realtimePipeline)); - realtimePipeline.run(); - } - - @Override - public void onExit() { - try { - LOG.info("Attempting to shutdown gracefully."); - /* - * Iterates over each Stage and calls finish(). The Stage is considered finished when - * its queue is empty. If there is a backup, finish() waits for the queues to empty. - */ - - // We don't call finish() unless the pipeline exists and has started because if any stage - // fails to initialize, no processing is started and not only is calling finish() unnecessary, - // but it will also deadlock any DedicatedThreadStageDriver. - if (pipeline != null && started.get()) { - pipeline.finish(); - finished.set(true); - LOG.info("Pipeline exited cleanly."); - } else { - LOG.info("Pipeline not yet started."); - } - } catch (StageException e) { - LOG.error("Unable to shutdown pipeline.", e); - } - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/app/PipelineExceptionImpl.java b/src/java/com/twitter/search/ingester/pipeline/app/PipelineExceptionImpl.java deleted file mode 100644 index 5ce4892af..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/app/PipelineExceptionImpl.java +++ /dev/null @@ -1,30 +0,0 @@ -package com.twitter.search.ingester.pipeline.app; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.ingester.pipeline.util.PipelineExceptionHandler; -import com.twitter.util.Duration; - -public class PipelineExceptionImpl implements PipelineExceptionHandler { - private static final Logger LOG = LoggerFactory.getLogger(PipelineExceptionImpl.class); - - private final IngesterPipelineApplication app; - - public PipelineExceptionImpl(IngesterPipelineApplication app) { - this.app = app; - } - - @Override - public void logAndWait(String msg, Duration waitTime) throws InterruptedException { - LOG.info(msg); - long waitTimeInMilliSecond = waitTime.inMilliseconds(); - Thread.sleep(waitTimeInMilliSecond); - } - - @Override - public void logAndShutdown(String msg) { - LOG.error(msg); - app.shutdown(); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/app/PipelineExceptionImplV2.java b/src/java/com/twitter/search/ingester/pipeline/app/PipelineExceptionImplV2.java deleted file mode 100644 index 1b576ebdf..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/app/PipelineExceptionImplV2.java +++ /dev/null @@ -1,29 +0,0 @@ -package com.twitter.search.ingester.pipeline.app; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.ingester.pipeline.util.PipelineExceptionHandler; -import com.twitter.util.Duration; - -public class PipelineExceptionImplV2 implements PipelineExceptionHandler { - private static final Logger LOG = LoggerFactory.getLogger(PipelineExceptionImplV2.class); - private RealtimeIngesterPipelineV2 pipeline; - - public PipelineExceptionImplV2(RealtimeIngesterPipelineV2 pipeline) { - this.pipeline = pipeline; - } - - @Override - public void logAndWait(String msg, Duration waitTime) throws InterruptedException { - LOG.info(msg); - long waitTimeInMilliSecond = waitTime.inMilliseconds(); - Thread.sleep(waitTimeInMilliSecond); - } - - @Override - public void logAndShutdown(String msg) { - LOG.info(msg); - pipeline.shutdown(); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/app/RealtimeIngesterPipelineV2.java b/src/java/com/twitter/search/ingester/pipeline/app/RealtimeIngesterPipelineV2.java deleted file mode 100644 index b3669305c..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/app/RealtimeIngesterPipelineV2.java +++ /dev/null @@ -1,111 +0,0 @@ -package com.twitter.search.ingester.pipeline.app; -import java.util.List; -import java.util.concurrent.CompletableFuture; -import java.util.concurrent.ExecutorService; -import java.util.concurrent.SynchronousQueue; -import java.util.concurrent.ThreadPoolExecutor; -import java.util.concurrent.TimeUnit; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.ingester.model.IngesterTweetEvent; -import com.twitter.search.ingester.model.KafkaRawRecord; -import com.twitter.search.ingester.pipeline.twitter.TweetEventDeserializerStage; -import com.twitter.search.ingester.pipeline.twitter.kafka.KafkaConsumerStage; -import com.twitter.search.ingester.pipeline.twitter.kafka.KafkaRawRecordConsumerStage; -import com.twitter.search.ingester.pipeline.util.PipelineV2CreationException; -import com.twitter.search.ingester.pipeline.util.PipelineStageException; - -public class RealtimeIngesterPipelineV2 { - private static final Logger LOG = LoggerFactory.getLogger(RealtimeIngesterPipelineV2.class); - private static final String PROD_ENV = "prod"; - private static final String STAGING_ENV = "staging"; - private static final String STAGING1_ENV = "staging1"; - private static final String REALTIME_CLUSTER = "realtime"; - private static final String PROTECTED_CLUSTER = "protected"; - private static final String REALTIME_CG_CLUSTER = "realtime_cg"; - private static final String KAFKA_CLIENT_ID = ""; - private static final String KAFKA_TOPIC_NAME = ""; - private static final String KAFKA_CONSUMER_GROUP_ID = ""; - private static final String KAFKA_CLUSTER_PATH = ""; - private static final String KAFKA_DECIDER_KEY = "ingester_tweets_consume_from_kafka"; - private static final String STATS_PREFIX = "realtimeingesterpipelinev2"; - private SearchCounter kafkaErrorCount = SearchCounter.create(STATS_PREFIX - + "_kafka_error_count"); - private Boolean running; - private String environment; - private String cluster; - private ExecutorService threadPool; - private KafkaConsumerStage kafkaConsumer; - private TweetEventDeserializerStage tweetEventDeserializerStage; - - public RealtimeIngesterPipelineV2(String environment, String cluster, int threadsToSpawn) throws - PipelineV2CreationException, PipelineStageException { - if (!environment.equals(PROD_ENV) && !environment.equals(STAGING_ENV) - && !environment.equals(STAGING1_ENV)) { - throw new PipelineV2CreationException("invalid value for environment"); - } - - if (!cluster.equals(REALTIME_CLUSTER) - && !cluster.equals(PROTECTED_CLUSTER) && !cluster.equals(REALTIME_CG_CLUSTER)) { - throw new PipelineV2CreationException("invalid value for cluster."); - } - - int numberOfThreads = Math.max(1, threadsToSpawn); - this.environment = environment; - this.cluster = cluster; - this.threadPool = new ThreadPoolExecutor(numberOfThreads, numberOfThreads, 0L, - TimeUnit.MILLISECONDS, new SynchronousQueue<>(), new ThreadPoolExecutor.CallerRunsPolicy()); - initStages(); - } - - private void initStages() throws PipelineStageException { - kafkaConsumer = new KafkaRawRecordConsumerStage(KAFKA_CLIENT_ID, KAFKA_TOPIC_NAME, - KAFKA_CONSUMER_GROUP_ID, KAFKA_CLUSTER_PATH, KAFKA_DECIDER_KEY); - kafkaConsumer.setupStageV2(); - tweetEventDeserializerStage = new TweetEventDeserializerStage(); - tweetEventDeserializerStage.setupStageV2(); - } - - /*** - * Starts the pipeline by starting the polling from Kafka and passing the events to the first - * stage of the pipeline. - */ - public void run() { - running = true; - while (running) { - pollFromKafkaAndSendToPipeline(); - } - } - - private void pollFromKafkaAndSendToPipeline() { - try { - List records = kafkaConsumer.pollFromTopic(); - for (KafkaRawRecord record : records) { - processKafkaRecord(record); - } - } catch (PipelineStageException e) { - kafkaErrorCount.increment(); - LOG.error("Error polling from Kafka", e); - } - } - - private void processKafkaRecord(KafkaRawRecord record) { - CompletableFuture stage1 = CompletableFuture.supplyAsync(() -> record, - threadPool); - - CompletableFuture stage2 = stage1.thenApplyAsync((KafkaRawRecord r) -> - tweetEventDeserializerStage.runStageV2(r), threadPool); - - } - - /*** - * Stop the pipeline from processing any further events. - */ - public void shutdown() { - running = false; - kafkaConsumer.cleanupStageV2(); - tweetEventDeserializerStage.cleanupStageV2(); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/strato_fetchers/AudioSpaceCoreFetcher.java b/src/java/com/twitter/search/ingester/pipeline/strato_fetchers/AudioSpaceCoreFetcher.java deleted file mode 100644 index b80cd93cf..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/strato_fetchers/AudioSpaceCoreFetcher.java +++ /dev/null @@ -1,56 +0,0 @@ -package com.twitter.search.ingester.pipeline.strato_fetchers; - -import java.util.List; -import java.util.Set; -import java.util.stream.Collectors; - -import com.twitter.periscope.api.thriftjava.AudioSpacesLookupContext; -import com.twitter.stitch.Stitch; -import com.twitter.strato.catalog.Fetch; -import com.twitter.strato.client.Client; -import com.twitter.strato.client.Fetcher; -import com.twitter.strato.data.Conv; -import com.twitter.strato.thrift.TBaseConv; -import com.twitter.ubs.thriftjava.AudioSpace; -import com.twitter.util.Future; -import com.twitter.util.Try; - -/** - * Fetches from the audio space core strato column. - */ -public class AudioSpaceCoreFetcher { - private static final String CORE_STRATO_COLUMN = ""; - - private static final AudioSpacesLookupContext - EMPTY_AUDIO_LOOKUP_CONTEXT = new AudioSpacesLookupContext(); - - private final Fetcher fetcher; - - public AudioSpaceCoreFetcher(Client stratoClient) { - fetcher = stratoClient.fetcher( - CORE_STRATO_COLUMN, - true, // enables checking types against catalog - Conv.stringConv(), - TBaseConv.forClass(AudioSpacesLookupContext.class), - TBaseConv.forClass(AudioSpace.class)); - } - - public Future> fetch(String spaceId) { - return Stitch.run(fetcher.fetch(spaceId, EMPTY_AUDIO_LOOKUP_CONTEXT)); - } - - /** - * Use stitch to fetch mulitiple AudioSpace Objects at once - */ - public Future>>> fetchBulkSpaces(Set spaceIds) { - return Stitch.run( - Stitch.collectToTry( - spaceIds - .stream() - .map(spaceId -> fetcher.fetch(spaceId, EMPTY_AUDIO_LOOKUP_CONTEXT)) - .collect(Collectors.toList()) - ) - ); - } - -} diff --git a/src/java/com/twitter/search/ingester/pipeline/strato_fetchers/AudioSpaceParticipantsFetcher.java b/src/java/com/twitter/search/ingester/pipeline/strato_fetchers/AudioSpaceParticipantsFetcher.java deleted file mode 100644 index 591c4e541..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/strato_fetchers/AudioSpaceParticipantsFetcher.java +++ /dev/null @@ -1,36 +0,0 @@ -package com.twitter.search.ingester.pipeline.strato_fetchers; - -import com.twitter.periscope.api.thriftjava.AudioSpacesLookupContext; -import com.twitter.stitch.Stitch; -import com.twitter.strato.catalog.Fetch; -import com.twitter.strato.client.Client; -import com.twitter.strato.client.Fetcher; -import com.twitter.strato.data.Conv; -import com.twitter.strato.thrift.TBaseConv; -import com.twitter.ubs.thriftjava.Participants; -import com.twitter.util.Future; - -/** - * Fetches from the audio space participants strato column. - */ -public class AudioSpaceParticipantsFetcher { - private static final String PARTICIPANTS_STRATO_COLUMN = ""; - - private static final AudioSpacesLookupContext - EMPTY_AUDIO_LOOKUP_CONTEXT = new AudioSpacesLookupContext(); - - private final Fetcher fetcher; - - public AudioSpaceParticipantsFetcher(Client stratoClient) { - fetcher = stratoClient.fetcher( - PARTICIPANTS_STRATO_COLUMN, - true, // enables checking types against catalog - Conv.stringConv(), - TBaseConv.forClass(AudioSpacesLookupContext.class), - TBaseConv.forClass(Participants.class)); - } - - public Future> fetch(String spaceId) { - return Stitch.run(fetcher.fetch(spaceId, EMPTY_AUDIO_LOOKUP_CONTEXT)); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/strato_fetchers/BUILD b/src/java/com/twitter/search/ingester/pipeline/strato_fetchers/BUILD deleted file mode 100644 index 57f38483a..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/strato_fetchers/BUILD +++ /dev/null @@ -1,20 +0,0 @@ -java_library( - sources = ["*.java"], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/commons-lang", - "cuad/projects/ner/thrift:thrift-java", - "periscope/api-proxy-thrift/thrift/src/main/thrift:thrift-java", - "scrooge/scrooge-core/src/main/scala", - "src/java/com/twitter/common/collections", - "stitch/stitch-core", - "strato/src/main/scala/com/twitter/strato/catalog", - "strato/src/main/scala/com/twitter/strato/client", - "strato/src/main/scala/com/twitter/strato/thrift", - "twitter-server-internal/src/main/scala", - "ubs/common/src/main/thrift/com/twitter/ubs:broadcast-thrift-java", - "util/util-core:util-core-util", - ], -) diff --git a/src/java/com/twitter/search/ingester/pipeline/strato_fetchers/NamedEntityFetcher.java b/src/java/com/twitter/search/ingester/pipeline/strato_fetchers/NamedEntityFetcher.java deleted file mode 100644 index fb5cbefeb..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/strato_fetchers/NamedEntityFetcher.java +++ /dev/null @@ -1,47 +0,0 @@ -package com.twitter.search.ingester.pipeline.strato_fetchers; - -import scala.Option; - -import com.twitter.cuad.ner.plain.thriftjava.NamedEntities; -import com.twitter.cuad.ner.plain.thriftjava.NamedEntitiesRequestOptions; -import com.twitter.cuad.ner.thriftjava.ModelFamily; -import com.twitter.cuad.ner.thriftjava.NERCalibrateRequest; -import com.twitter.cuad.thriftjava.CalibrationLevel; -import com.twitter.cuad.thriftjava.NERCandidateSource; -import com.twitter.stitch.Stitch; -import com.twitter.strato.catalog.Fetch; -import com.twitter.strato.client.Client; -import com.twitter.strato.client.Fetcher; -import com.twitter.strato.data.Conv; -import com.twitter.strato.opcontext.ServeWithin; -import com.twitter.strato.thrift.TBaseConv; -import com.twitter.util.Duration; -import com.twitter.util.Future; - -public class NamedEntityFetcher { - private static final String NAMED_ENTITY_STRATO_COLUMN = ""; - - private static final ServeWithin SERVE_WITHIN = new ServeWithin( - Duration.fromMilliseconds(100), Option.empty()); - - private static final NamedEntitiesRequestOptions REQUEST_OPTIONS = - new NamedEntitiesRequestOptions( - new NERCalibrateRequest(CalibrationLevel.HIGH_PRECISION, NERCandidateSource.NER_CRF) - .setModel_family(ModelFamily.CFB)) - .setDisplay_entity_info(false); - - private final Fetcher fetcher; - - public NamedEntityFetcher(Client stratoClient) { - fetcher = stratoClient.fetcher( - NAMED_ENTITY_STRATO_COLUMN, - true, // enables checking types against catalog - Conv.longConv(), - TBaseConv.forClass(NamedEntitiesRequestOptions.class), - TBaseConv.forClass(NamedEntities.class)).serveWithin(SERVE_WITHIN); - } - - public Future> fetch(long tweetId) { - return Stitch.run(fetcher.fetch(tweetId, REQUEST_OPTIONS)); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/AsyncPinkUrlsResolver.java b/src/java/com/twitter/search/ingester/pipeline/twitter/AsyncPinkUrlsResolver.java deleted file mode 100644 index 0b1ae2187..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/AsyncPinkUrlsResolver.java +++ /dev/null @@ -1,67 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import java.util.Collection; -import java.util.List; -import java.util.Map; - -import com.google.common.collect.ImmutableList; -import com.google.common.collect.Maps; - -import com.twitter.pink_floyd.thrift.ClientIdentifier; -import com.twitter.pink_floyd.thrift.Mask; -import com.twitter.pink_floyd.thrift.Storer; -import com.twitter.pink_floyd.thrift.UrlData; -import com.twitter.pink_floyd.thrift.UrlReadRequest; -import com.twitter.util.Function; -import com.twitter.util.Future; - -/** - * Resolve compressed URL via Pink - */ -public class AsyncPinkUrlsResolver { - private final Storer.ServiceIface storerClient; - private final ClientIdentifier pinkClientId; - private final Mask requestMask; - - // Use ServerSet to construct a metadata store client - public AsyncPinkUrlsResolver(Storer.ServiceIface storerClient, String pinkClientId) { - this.storerClient = storerClient; - this.pinkClientId = ClientIdentifier.valueOf(pinkClientId); - - requestMask = new Mask(); - requestMask.setResolution(true); - requestMask.setHtmlBasics(true); - requestMask.setUrlDirectInfo(true); - } - - /** - * resolve urls calling pink asynchronously - * @param urls urls to resolve - * @return Future map of resolved urls - */ - public Future> resolveUrls( - Collection urls) { - if (urls == null || urls.size() == 0) { - Future.value(Maps.newHashMap()); - } - - List urlsList = ImmutableList.copyOf(urls); - - UrlReadRequest request = new UrlReadRequest(); - request.setUrls(urlsList); - request.setClientId(pinkClientId); - request.setMask(requestMask); - - return storerClient.read(request).map(Function.func( - response -> { - Map resultMap = Maps.newHashMap(); - for (UrlData urlData : response.getData()) { - if (ResolveCompressedUrlsUtils.isResolved(urlData)) { - resultMap.put(urlData.url, ResolveCompressedUrlsUtils.getUrlInfo(urlData)); - } - } - return resultMap; - } - )); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/BUILD b/src/java/com/twitter/search/ingester/pipeline/twitter/BUILD deleted file mode 100644 index 5fd578ba8..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/BUILD +++ /dev/null @@ -1,74 +0,0 @@ -java_library( - sources = ["*.java"], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/commons-io", - "3rdparty/jvm/commons-logging", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/thrift:libthrift", - "cuad/projects/ner/client/src/main/scala/com/twitter/cuad/ner/client", - "cuad/projects/ner/thrift/src/main/thrift:thrift-java", - "decider/src/main/scala", - "eventbus/client/src/main/scala/com/twitter/eventbus/client", - "finagle/finagle-core/src/main", - "finagle/finagle-thriftmux/src/main/scala", - "pink-floyd/pink-common/src/main/java/com/twitter/spiderduck/common", - "scrooge/scrooge-core", - "scrooge/scrooge-serializer/src/main/scala", - "servo/util/src/main/scala", - "src/java/com/twitter/common/text/language:locale-util", - "src/java/com/twitter/common/util:system-mocks", - "src/java/com/twitter/common_internal/text/version", - "src/java/com/twitter/metastore/client_v2", - "src/java/com/twitter/search/common/config", - "src/java/com/twitter/search/common/converter/earlybird", - "src/java/com/twitter/search/common/debug", - "src/java/com/twitter/search/common/decider", - "src/java/com/twitter/search/common/encoding/features", - "src/java/com/twitter/search/common/logging", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/partitioning:timeslice-manager", - "src/java/com/twitter/search/common/partitioning/base", - "src/java/com/twitter/search/common/partitioning/snowflakeparser", - "src/java/com/twitter/search/common/relevance:classifiers", - "src/java/com/twitter/search/common/relevance:entities_and_filters", - "src/java/com/twitter/search/common/relevance:scorers", - "src/java/com/twitter/search/common/relevance:text", - "src/java/com/twitter/search/common/schema", - "src/java/com/twitter/search/common/schema/base", - "src/java/com/twitter/search/common/schema/earlybird", - "src/java/com/twitter/search/common/util/text", - "src/java/com/twitter/search/common/util/thrift:text-protocol", - "src/java/com/twitter/search/ingester/model", - "src/java/com/twitter/search/ingester/model/engagements", - "src/java/com/twitter/search/ingester/pipeline/strato_fetchers", - "src/java/com/twitter/search/ingester/pipeline/twitter/filters", - "src/java/com/twitter/search/ingester/pipeline/twitter/thriftparse", - "src/java/com/twitter/search/ingester/pipeline/util", - "src/java/com/twitter/search/ingester/pipeline/wire", - "src/java/org/apache/commons/pipeline", - "src/thrift/com/twitter/expandodo:cards-java", - "src/thrift/com/twitter/gizmoduck:thrift-java", - "src/thrift/com/twitter/pink-floyd/thrift:derivatives-java", - "src/thrift/com/twitter/pink-floyd/thrift:thrift-java", - "src/thrift/com/twitter/search/common:indexing-java", - "src/thrift/com/twitter/search/common:schema-java", - "src/thrift/com/twitter/search/common/debug:debug-java", - "src/thrift/com/twitter/service/spiderduck/gen:metadata-store-java", - "src/thrift/com/twitter/timelineservice/server/internal:thrift-java", - "src/thrift/com/twitter/tweetypie:events-java", - "src/thrift/com/twitter/tweetypie:events-scala", - "src/thrift/com/twitter/tweetypie:service-java", - "src/thrift/com/twitter/tweetypie:tweet-java", - "stitch/stitch-core", - "storage/clients/manhattan/client/src/main/scala", - "strato/src/main/scala/com/twitter/strato/catalog", - "ubs/common/src/main/thrift/com/twitter/ubs:broadcast-thrift-java", - "util/util-core:scala", - "util/util-core/src/main/java", - "util/util-function/src/main/java", - ], -) diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/CollectComparableObjectsStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/CollectComparableObjectsStage.java deleted file mode 100644 index f8d98723f..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/CollectComparableObjectsStage.java +++ /dev/null @@ -1,176 +0,0 @@ -/** - * © Copyright 2008, Summize, Inc. All rights reserved. - */ -package com.twitter.search.ingester.pipeline.twitter; - -import java.util.Collections; -import java.util.NavigableSet; -import java.util.TreeSet; -import java.util.concurrent.TimeUnit; -import java.util.concurrent.atomic.AtomicLong; - -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.apache.commons.pipeline.validation.ProducedTypes; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.debug.DebugEventUtil; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchCustomGauge; -import com.twitter.search.common.metrics.SearchTimerStats; - -/** - * Collect incoming objects into batches of the configured size and then - * emit the Collection of objects. Internally uses a TreeSet - * to remove duplicates. Incoming objects MUST implement the Comparable - * interface. - */ -@ConsumedTypes(Comparable.class) -@ProducedTypes(NavigableSet.class) -public class CollectComparableObjectsStage extends TwitterBaseStage { - private static final Logger LOG = LoggerFactory.getLogger(CollectComparableObjectsStage.class); - - // Batch size of the collections we are emitting. - private int batchSize = -1; - - // Top tweets sorts the tweets in reverse order. - private Boolean reverseOrder = false; - - // Batch being constructed. - private TreeSet currentCollection = null; - - // Timestamp (ms) of last batch emission. - private final AtomicLong lastEmitTimeMillis = new AtomicLong(-1); - // If set, will emit a batch (only upon arrival of a new element), if time since last emit has - // exceeded this threshold. - private long emitAfterMillis = -1; - - private SearchCounter sizeBasedEmitCount; - private SearchCounter timeBasedEmitCount; - private SearchCounter sizeAndTimeBasedEmitCount; - private SearchTimerStats batchEmitTimeStats; - - @Override - protected void initStats() { - super.initStats(); - - SearchCustomGauge.export(getStageNamePrefix() + "_last_emit_time", - () -> lastEmitTimeMillis.get()); - - sizeBasedEmitCount = SearchCounter.export(getStageNamePrefix() + "_size_based_emit_count"); - timeBasedEmitCount = SearchCounter.export(getStageNamePrefix() + "_time_based_emit_count"); - sizeAndTimeBasedEmitCount = SearchCounter.export( - getStageNamePrefix() + "_size_and_time_based_emit_count"); - - batchEmitTimeStats = SearchTimerStats.export( - getStageNamePrefix() + "_batch_emit_time", - TimeUnit.MILLISECONDS, - false, // no cpu timers - true); // with percentiles - } - - @Override - protected void doInnerPreprocess() throws StageException { - // We have to initialize this stat here, because initStats() is called before - // doInnerPreprocess(), so at that point the 'clock' is not set yet. - SearchCustomGauge.export(getStageNamePrefix() + "_millis_since_last_emit", - () -> clock.nowMillis() - lastEmitTimeMillis.get()); - - currentCollection = newBatchCollection(); - if (batchSize <= 0) { - throw new StageException(this, "Must set the batchSize parameter to a value >0"); - } - } - - private TreeSet newBatchCollection() { - return new TreeSet<>(reverseOrder ? Collections.reverseOrder() : null); - } - - @Override - public void innerProcess(Object obj) throws StageException { - if (!Comparable.class.isAssignableFrom(obj.getClass())) { - throw new StageException( - this, "Attempt to add a non-comparable object to a sorted collection"); - } - - currentCollection.add(obj); - if (shouldEmit()) { - // We want to trace here when we actually emit the batch, as tweets sit in this stage until - // a batch is full, and we want to see how long they actually stick around. - DebugEventUtil.addDebugEventToCollection( - currentCollection, "CollectComparableObjectsStage.outgoing", clock.nowMillis()); - emitAndCount(currentCollection); - updateLastEmitTime(); - - currentCollection = newBatchCollection(); - } - } - - private boolean shouldEmit() { - if (lastEmitTimeMillis.get() < 0) { - // Initialize lastEmit at the first tweet seen by this stage. - lastEmitTimeMillis.set(clock.nowMillis()); - } - - final boolean sizeBasedEmit = currentCollection.size() >= batchSize; - final boolean timeBasedEmit = - emitAfterMillis > 0 && lastEmitTimeMillis.get() + emitAfterMillis <= clock.nowMillis(); - - if (sizeBasedEmit && timeBasedEmit) { - sizeAndTimeBasedEmitCount.increment(); - return true; - } else if (sizeBasedEmit) { - sizeBasedEmitCount.increment(); - return true; - } else if (timeBasedEmit) { - timeBasedEmitCount.increment(); - return true; - } else { - return false; - } - } - - @Override - public void innerPostprocess() throws StageException { - if (!currentCollection.isEmpty()) { - emitAndCount(currentCollection); - updateLastEmitTime(); - currentCollection = newBatchCollection(); - } - } - - private void updateLastEmitTime() { - long currentEmitTime = clock.nowMillis(); - long previousEmitTime = lastEmitTimeMillis.getAndSet(currentEmitTime); - - // Also stat how long each emit takes. - batchEmitTimeStats.timerIncrement(currentEmitTime - previousEmitTime); - } - - public void setBatchSize(Integer size) { - LOG.info("Updating all CollectComparableObjectsStage batchSize to {}.", size); - this.batchSize = size; - } - - public Boolean getReverseOrder() { - return reverseOrder; - } - - public void setReverseOrder(Boolean reverseOrder) { - this.reverseOrder = reverseOrder; - } - - public void setEmitAfterMillis(long emitAfterMillis) { - LOG.info("Setting emitAfterMillis to {}.", emitAfterMillis); - this.emitAfterMillis = emitAfterMillis; - } - - public long getSizeBasedEmitCount() { - return sizeBasedEmitCount.get(); - } - - public long getTimeBasedEmitCount() { - return timeBasedEmitCount.get(); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/ComputeTweetSignatureStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/ComputeTweetSignatureStage.java deleted file mode 100644 index 960cb6f86..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/ComputeTweetSignatureStage.java +++ /dev/null @@ -1,38 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.apache.commons.pipeline.validation.ProducesConsumed; - -import com.twitter.search.common.relevance.classifiers.TweetQualityFeatureExtractor; -import com.twitter.search.ingester.model.IngesterTwitterMessage; - -@ConsumedTypes(IngesterTwitterMessage.class) -@ProducesConsumed -public class ComputeTweetSignatureStage extends TwitterBaseStage - { - private final TweetQualityFeatureExtractor tweetSignatureExtractor = - new TweetQualityFeatureExtractor(); - - @Override - public void innerProcess(Object obj) throws StageException { - if (!(obj instanceof IngesterTwitterMessage)) { - throw new StageException(this, "Object is not a TwitterMessage instance: " + obj); - } - - IngesterTwitterMessage message = IngesterTwitterMessage.class.cast(obj); - extract(message); - emitAndCount(message); - } - - private void extract(IngesterTwitterMessage message) { - tweetSignatureExtractor.extractTweetTextFeatures(message); - } - - @Override - protected IngesterTwitterMessage innerRunStageV2(IngesterTwitterMessage message) { - extract(message); - return message; - } -} - diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/ConvertDelayedMessageToThriftStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/ConvertDelayedMessageToThriftStage.java deleted file mode 100644 index 9a1d61bfa..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/ConvertDelayedMessageToThriftStage.java +++ /dev/null @@ -1,95 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import java.util.List; - -import javax.naming.NamingException; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; - -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.apache.commons.pipeline.validation.ProducedTypes; - -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.search.common.converter.earlybird.DelayedIndexingConverter; -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.relevance.entities.TwitterMessage; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.earlybird.EarlybirdSchemaCreateTool; -import com.twitter.search.ingester.model.IngesterThriftVersionedEvents; -import com.twitter.search.ingester.model.IngesterTwitterMessage; - -@ConsumedTypes(IngesterTwitterMessage.class) -@ProducedTypes(IngesterThriftVersionedEvents.class) -public class ConvertDelayedMessageToThriftStage extends TwitterBaseStage - { - private List penguinVersionList; - private FieldStatExporter fieldStatExporter; - private DelayedIndexingConverter messageConverter; - - @Override - protected void doInnerPreprocess() throws StageException, NamingException { - Schema schema; - try { - schema = EarlybirdSchemaCreateTool.buildSchema(Preconditions.checkNotNull(earlybirdCluster)); - } catch (Schema.SchemaValidationException e) { - throw new StageException(this, e); - } - - penguinVersionList = wireModule.getPenguinVersions(); - messageConverter = new DelayedIndexingConverter(schema, decider); - fieldStatExporter = new FieldStatExporter("unsorted_urls", schema, penguinVersionList); - } - - @Override - public void innerProcess(Object obj) throws StageException { - if (!(obj instanceof IngesterTwitterMessage)) { - throw new StageException(this, "Object is not an IngesterTwitterMessage instance: " + obj); - } - - penguinVersionList = wireModule.getCurrentlyEnabledPenguinVersions(); - fieldStatExporter.updatePenguinVersions(penguinVersionList); - - IngesterTwitterMessage message = IngesterTwitterMessage.class.cast(obj); - for (IngesterThriftVersionedEvents events : buildVersionedEvents(message)) { - fieldStatExporter.addFieldStats(events); - emitAndCount(events); - } - } - - /** - * Method that converts all URL and card related fields and features of a TwitterMessage to a - * ThriftVersionedEvents instance. - * - * @param twitterMessage An IngesterThriftVersionedEvents instance to be converted. - * @return The corresponding ThriftVersionedEvents instance. - */ - private List buildVersionedEvents( - IngesterTwitterMessage twitterMessage) { - List versionedEvents = - messageConverter.convertMessageToOutOfOrderAppendAndFeatureUpdate( - twitterMessage, penguinVersionList); - Preconditions.checkArgument( - versionedEvents.size() == 2, - "DelayedIndexingConverter produced an incorrect number of ThriftVersionedEvents."); - return Lists.newArrayList( - toIngesterThriftVersionedEvents(versionedEvents.get(0), twitterMessage), - toIngesterThriftVersionedEvents(versionedEvents.get(1), twitterMessage)); - } - - private IngesterThriftVersionedEvents toIngesterThriftVersionedEvents( - ThriftVersionedEvents versionedEvents, IngesterTwitterMessage twitterMessage) { - // We don't want to propagate the same DebugEvents instance to multiple - // IngesterThriftVersionedEvents instances, because future stages might want to add new events - // to this list for multiple events at the same time, which would result in a - // ConcurrentModificationException. So we need to create a DebugEvents deep copy. - IngesterThriftVersionedEvents ingesterThriftVersionedEvents = - new IngesterThriftVersionedEvents(twitterMessage.getUserId()); - ingesterThriftVersionedEvents.setDarkWrite(false); - ingesterThriftVersionedEvents.setId(twitterMessage.getTweetId()); - ingesterThriftVersionedEvents.setVersionedEvents(versionedEvents.getVersionedEvents()); - ingesterThriftVersionedEvents.setDebugEvents(twitterMessage.getDebugEvents().deepCopy()); - return ingesterThriftVersionedEvents; - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/ConvertMessageToThriftStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/ConvertMessageToThriftStage.java deleted file mode 100644 index 9b4fb6fd9..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/ConvertMessageToThriftStage.java +++ /dev/null @@ -1,117 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import java.io.IOException; -import java.util.List; -import java.util.Optional; - -import javax.naming.NamingException; - -import com.google.common.base.Preconditions; - -import org.apache.commons.lang.StringUtils; -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.apache.commons.pipeline.validation.ProducesConsumed; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.search.common.converter.earlybird.BasicIndexingConverter; -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.earlybird.EarlybirdSchemaCreateTool; -import com.twitter.search.ingester.model.IngesterThriftVersionedEvents; -import com.twitter.search.ingester.model.IngesterTwitterMessage; - -@ConsumedTypes(IngesterTwitterMessage.class) -@ProducesConsumed -public class ConvertMessageToThriftStage extends TwitterBaseStage - { - private static final Logger LOG = LoggerFactory.getLogger(ConvertMessageToThriftStage.class); - - private List penguinVersionList; - private String thriftVersionedEventsBranchName; - private FieldStatExporter fieldStatExporter; - private BasicIndexingConverter messageConverter; - - private SearchCounter twitterMessageToTveErrorCount; - - @Override - public void initStats() { - super.initStats(); - twitterMessageToTveErrorCount = SearchCounter.export( - getStageNamePrefix() + "_ingester_convert_twitter_message_to_tve_error_count"); - } - - @Override - protected void doInnerPreprocess() throws StageException, NamingException { - Schema schema; - try { - schema = EarlybirdSchemaCreateTool.buildSchema(Preconditions.checkNotNull(earlybirdCluster)); - } catch (Schema.SchemaValidationException e) { - throw new StageException(this, e); - } - - penguinVersionList = wireModule.getPenguinVersions(); - Preconditions.checkState(StringUtils.isNotBlank(thriftVersionedEventsBranchName)); - messageConverter = new BasicIndexingConverter(schema, earlybirdCluster); - fieldStatExporter = new FieldStatExporter("unsorted_tweets", schema, penguinVersionList); - } - - @Override - public void innerProcess(Object obj) throws StageException { - if (!(obj instanceof IngesterTwitterMessage)) { - throw new StageException(this, "Object is not an IngesterTwitterMessage instance: " + obj); - } - - penguinVersionList = wireModule.getCurrentlyEnabledPenguinVersions(); - fieldStatExporter.updatePenguinVersions(penguinVersionList); - - IngesterTwitterMessage message = IngesterTwitterMessage.class.cast(obj); - - Optional maybeEvents = buildVersionedEvents(message); - if (maybeEvents.isPresent()) { - IngesterThriftVersionedEvents events = maybeEvents.get(); - fieldStatExporter.addFieldStats(events); - emitToBranchAndCount(thriftVersionedEventsBranchName, events); - } - - emitAndCount(message); - } - - /** - * Method that converts a TwitterMessage to a ThriftVersionedEvents. - * - * @param twitterMessage An IngesterThriftVersionedEvents instance to be converted. - * @return The corresponding ThriftVersionedEvents. - */ - private Optional buildVersionedEvents( - IngesterTwitterMessage twitterMessage) { - IngesterThriftVersionedEvents ingesterEvents = - new IngesterThriftVersionedEvents(twitterMessage.getUserId()); - ingesterEvents.setDarkWrite(false); - ingesterEvents.setId(twitterMessage.getTweetId()); - - // We will emit both the original TwitterMessage, and the ThriftVersionedEvents instance, so we - // need to make sure they have separate DebugEvents copies. - ingesterEvents.setDebugEvents(twitterMessage.getDebugEvents().deepCopy()); - - try { - ThriftVersionedEvents versionedEvents = - messageConverter.convertMessageToThrift(twitterMessage, true, penguinVersionList); - ingesterEvents.setVersionedEvents(versionedEvents.getVersionedEvents()); - return Optional.of(ingesterEvents); - } catch (IOException e) { - LOG.error("Failed to convert tweet " + twitterMessage.getTweetId() + " from TwitterMessage " - + "to ThriftVersionedEvents for Penguin versions " + penguinVersionList, - e); - twitterMessageToTveErrorCount.increment(); - } - return Optional.empty(); - } - - public void setThriftVersionedEventsBranchName(String thriftVersionedEventsBranchName) { - this.thriftVersionedEventsBranchName = thriftVersionedEventsBranchName; - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/ConvertToThriftVersionedEventsStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/ConvertToThriftVersionedEventsStage.java deleted file mode 100644 index a8b52418f..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/ConvertToThriftVersionedEventsStage.java +++ /dev/null @@ -1,83 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import javax.naming.NamingException; - -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.apache.commons.pipeline.validation.ProducedTypes; - -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants; -import com.twitter.search.ingester.model.IngesterThriftVersionedEvents; -import com.twitter.search.ingester.model.IngesterTwitterMessage; -import com.twitter.search.ingester.pipeline.util.PipelineStageRuntimeException; - -@ConsumedTypes(IngesterTwitterMessage.class) -@ProducedTypes(ThriftVersionedEvents.class) -public class ConvertToThriftVersionedEventsStage extends TwitterBaseStage - { - private ThriftVersionedEventsConverter converter; - - @Override - public void doInnerPreprocess() throws StageException, NamingException { - super.doInnerPreprocess(); - innerSetup(); - } - - @Override - protected void innerSetup() throws NamingException { - converter = new ThriftVersionedEventsConverter(wireModule.getPenguinVersions()); - } - - @Override - public void innerProcess(Object obj) throws StageException { - if (!(obj instanceof IngesterTwitterMessage)) { - throw new StageException(this, "Object is not an IngesterTwitterMessage: " + obj); - } - - IngesterTwitterMessage ingesterTwitterMessage = (IngesterTwitterMessage) obj; - IngesterThriftVersionedEvents maybeEvents = tryToConvert(ingesterTwitterMessage); - - if (maybeEvents == null) { - throw new StageException( - this, "Object is not a retweet or a reply: " + ingesterTwitterMessage); - } - - emitAndCount(maybeEvents); - - } - - @Override - protected IngesterThriftVersionedEvents innerRunStageV2(IngesterTwitterMessage message) { - IngesterThriftVersionedEvents maybeEvents = tryToConvert(message); - - if (maybeEvents == null) { - throw new PipelineStageRuntimeException("Object is not a retweet or reply, does not have to" - + " pass to next stage"); - } - - return maybeEvents; - } - - private IngesterThriftVersionedEvents tryToConvert(IngesterTwitterMessage message) { - converter.updatePenguinVersions(wireModule.getCurrentlyEnabledPenguinVersions()); - - if (!message.isRetweet() && !message.isReplyToTweet()) { - return null; - } - - if (message.isRetweet()) { - return converter.toOutOfOrderAppend( - message.getRetweetMessage().getSharedId(), - EarlybirdFieldConstants.EarlybirdFieldConstant.RETWEETED_BY_USER_ID, - message.getUserId(), - message.getDebugEvents().deepCopy()); - } - - return converter.toOutOfOrderAppend( - message.getInReplyToStatusId().get(), - EarlybirdFieldConstants.EarlybirdFieldConstant.REPLIED_TO_BY_USER_ID, - message.getUserId(), - message.getDebugEvents().deepCopy()); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/EventBusReaderStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/EventBusReaderStage.java deleted file mode 100644 index b42828ce5..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/EventBusReaderStage.java +++ /dev/null @@ -1,185 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import java.util.concurrent.TimeUnit; - -import javax.naming.NamingException; - -import scala.runtime.BoxedUnit; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import org.apache.commons.pipeline.Pipeline; -import org.apache.commons.pipeline.StageDriver; -import org.apache.thrift.TBase; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.eventbus.client.EventBusSubscriber; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.ingester.model.PromiseContainer; -import com.twitter.search.ingester.pipeline.util.PipelineUtil; -import com.twitter.util.Await; -import com.twitter.util.Function; -import com.twitter.util.Future; -import com.twitter.util.Promise; - -public abstract class EventBusReaderStage> extends TwitterBaseStage - { - private static final Logger LOG = LoggerFactory.getLogger(EventBusReaderStage.class); - - private static final int DECIDER_POLL_INTERVAL_IN_SECS = 5; - - private SearchCounter totalEventsCount; - - private String environment = null; - private String eventBusReaderEnabledDeciderKey; - - private StageDriver stageDriver; - - private EventBusSubscriber eventBusSubscriber = null; - - // XML configuration options - private String eventBusSubscriberId; - private int maxConcurrentEvents; - private SearchDecider searchDecider; - - protected EventBusReaderStage() { - } - - @Override - protected void initStats() { - super.initStats(); - totalEventsCount = SearchCounter.export(getStageNamePrefix() + "_total_events_count"); - } - - @Override - protected void doInnerPreprocess() throws NamingException { - searchDecider = new SearchDecider(decider); - - if (stageDriver == null) { - stageDriver = ((Pipeline) stageContext).getStageDriver(this); - } - - eventBusReaderEnabledDeciderKey = String.format( - getDeciderKeyTemplate(), - earlybirdCluster.getNameForStats(), - environment); - - PipelineUtil.feedStartObjectToStage(this); - } - - protected abstract PromiseContainer eventAndPromiseToContainer( - T incomingEvent, - Promise p); - - private Future processEvent(T incomingEvent) { - Promise p = new Promise<>(); - PromiseContainer promiseContainer = eventAndPromiseToContainer(incomingEvent, p); - totalEventsCount.increment(); - emitAndCount(promiseContainer); - return p; - } - - private void closeEventBusSubscriber() throws Exception { - if (eventBusSubscriber != null) { - Await.result(eventBusSubscriber.close()); - eventBusSubscriber = null; - } - } - - protected abstract Class getThriftClass(); - - protected abstract String getDeciderKeyTemplate(); - - private void startUpEventBusSubscriber() { - // Start reading from eventbus if it is null - if (eventBusSubscriber == null) { - //noinspection unchecked - eventBusSubscriber = wireModule.createEventBusSubscriber( - Function.func(this::processEvent), - getThriftClass(), - eventBusSubscriberId, - maxConcurrentEvents); - - } - Preconditions.checkNotNull(eventBusSubscriber); - } - - /** - * This is only kicked off once with a start object which is ignored. Then we loop - * checking the decider. If it turns off then we close the eventbus reader, - * and if it turns on, then we create a new eventbus reader. - * - * @param obj ignored - */ - @Override - public void innerProcess(Object obj) { - boolean interrupted = false; - - Preconditions.checkNotNull("The environment is not set.", environment); - - int previousEventBusReaderEnabledAvailability = 0; - while (stageDriver.getState() == StageDriver.State.RUNNING) { - int eventBusReaderEnabledAvailability = - searchDecider.getAvailability(eventBusReaderEnabledDeciderKey); - if (previousEventBusReaderEnabledAvailability != eventBusReaderEnabledAvailability) { - LOG.info("EventBusReaderStage availability decider changed from {} to {}.", - previousEventBusReaderEnabledAvailability, eventBusReaderEnabledAvailability); - - // If the availability is 0 then disable the reader, otherwise read from EventBus. - if (eventBusReaderEnabledAvailability == 0) { - try { - closeEventBusSubscriber(); - } catch (Exception e) { - LOG.warn("Exception while closing eventbus subscriber", e); - } - } else { - startUpEventBusSubscriber(); - } - } - previousEventBusReaderEnabledAvailability = eventBusReaderEnabledAvailability; - - try { - clock.waitFor(TimeUnit.SECONDS.toMillis(DECIDER_POLL_INTERVAL_IN_SECS)); - } catch (InterruptedException e) { - interrupted = true; - } - } - LOG.info("StageDriver is not RUNNING anymore, closing EventBus subscriber"); - try { - closeEventBusSubscriber(); - } catch (InterruptedException e) { - interrupted = true; - } catch (Exception e) { - LOG.warn("Exception while closing eventbus subscriber", e); - } finally { - if (interrupted) { - Thread.currentThread().interrupt(); - } - } - } - - // This is needed to set the value from XML config. - public void setEventBusSubscriberId(String eventBusSubscriberId) { - this.eventBusSubscriberId = eventBusSubscriberId; - LOG.info("EventBusReaderStage with eventBusSubscriberId: {}", eventBusSubscriberId); - } - - // This is needed to set the value from XML config. - public void setEnvironment(String environment) { - this.environment = environment; - LOG.info("Ingester is running in {}", environment); - } - - // This is needed to set the value from XML config. - public void setMaxConcurrentEvents(int maxConcurrentEvents) { - this.maxConcurrentEvents = maxConcurrentEvents; - } - - @VisibleForTesting - public void setStageDriver(StageDriver stageDriver) { - this.stageDriver = stageDriver; - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/FieldStatExporter.java b/src/java/com/twitter/search/ingester/pipeline/twitter/FieldStatExporter.java deleted file mode 100644 index 10ea03c29..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/FieldStatExporter.java +++ /dev/null @@ -1,150 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import java.util.List; -import java.util.Set; - -import com.google.common.base.Preconditions; -import com.google.common.collect.HashBasedTable; -import com.google.common.collect.Sets; -import com.google.common.collect.Table; - -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.schema.SchemaBuilder; -import com.twitter.search.common.schema.base.Schema; -import com.twitter.search.common.schema.earlybird.EarlybirdEncodedFeatures; -import com.twitter.search.common.schema.earlybird.EarlybirdEncodedFeaturesUtil; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; -import com.twitter.search.common.schema.thriftjava.ThriftField; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent; - -/** - * This class exports counts of fields that are present on processed tweets. It is used to ensure - * that we are not missing important fields. It is not threadsafe. - */ -public class FieldStatExporter { - private static final String STAT_FORMAT = "%s_penguin_%d_documents_with_field_%s"; - private static final String UNKNOWN_FIELD = "%s_penguin_%d_documents_with_unknown_field_%d"; - private final String statPrefix; - private final Schema schema; - private final Table fieldCounters - = HashBasedTable.create(); - private final Set encodedTweetFeaturesFields; - private final Set extendedEncodedTweetFeaturesFields; - - private List penguinVersions; - - FieldStatExporter(String statPrefix, Schema schema, List penguinVersions) { - this.statPrefix = statPrefix; - this.schema = schema; - this.penguinVersions = penguinVersions; - this.encodedTweetFeaturesFields = - getEncodedTweetFeaturesFields(EarlybirdFieldConstant.ENCODED_TWEET_FEATURES_FIELD); - this.extendedEncodedTweetFeaturesFields = - getEncodedTweetFeaturesFields(EarlybirdFieldConstant.EXTENDED_ENCODED_TWEET_FEATURES_FIELD); - - for (PenguinVersion version : penguinVersions) { - for (Schema.FieldInfo info : schema.getFieldInfos()) { - String name = - String.format(STAT_FORMAT, statPrefix, version.getByteValue(), info.getName()); - SearchRateCounter counter = SearchRateCounter.export(name); - fieldCounters.put(version, info.getFieldId(), counter); - } - } - } - - /** - * Exports stats counting the number of fields that are present on each document. - */ - public void addFieldStats(ThriftVersionedEvents event) { - for (PenguinVersion penguinVersion : penguinVersions) { - byte version = penguinVersion.getByteValue(); - ThriftIndexingEvent indexingEvent = event.getVersionedEvents().get(version); - Preconditions.checkNotNull(indexingEvent); - - // We only want to count each field once per tweet. - Set seenFields = Sets.newHashSet(); - for (ThriftField field : indexingEvent.getDocument().getFields()) { - int fieldId = field.getFieldConfigId(); - if (seenFields.add(fieldId)) { - if (fieldId == EarlybirdFieldConstant.ENCODED_TWEET_FEATURES_FIELD.getFieldId()) { - exportEncodedFeaturesStats(EarlybirdFieldConstant.ENCODED_TWEET_FEATURES_FIELD, - encodedTweetFeaturesFields, - penguinVersion, - field); - } else if (fieldId - == EarlybirdFieldConstant.EXTENDED_ENCODED_TWEET_FEATURES_FIELD.getFieldId()) { - exportEncodedFeaturesStats(EarlybirdFieldConstant.EXTENDED_ENCODED_TWEET_FEATURES_FIELD, - extendedEncodedTweetFeaturesFields, - penguinVersion, - field); - } else if (isFeatureField(field)) { - updateCounterForFeatureField( - field.getFieldConfigId(), field.getFieldData().getIntValue(), penguinVersion); - } else { - SearchRateCounter counter = fieldCounters.get(penguinVersion, fieldId); - if (counter == null) { - counter = SearchRateCounter.export( - String.format(UNKNOWN_FIELD, statPrefix, version, fieldId)); - fieldCounters.put(penguinVersion, fieldId, counter); - } - counter.increment(); - } - } - } - } - } - - private boolean isFeatureField(ThriftField field) { - String fieldName = - EarlybirdFieldConstants.getFieldConstant(field.getFieldConfigId()).getFieldName(); - return fieldName.startsWith(EarlybirdFieldConstants.ENCODED_TWEET_FEATURES_FIELD_NAME - + SchemaBuilder.CSF_VIEW_NAME_SEPARATOR) - || fieldName.startsWith(EarlybirdFieldConstants.EXTENDED_ENCODED_TWEET_FEATURES_FIELD_NAME - + SchemaBuilder.CSF_VIEW_NAME_SEPARATOR); - } - - private Set getEncodedTweetFeaturesFields( - EarlybirdFieldConstant featuresField) { - Set schemaFeatureFields = Sets.newHashSet(); - String baseFieldNamePrefix = - featuresField.getFieldName() + SchemaBuilder.CSF_VIEW_NAME_SEPARATOR; - for (EarlybirdFieldConstant field : EarlybirdFieldConstant.values()) { - if (field.getFieldName().startsWith(baseFieldNamePrefix)) { - schemaFeatureFields.add(field); - } - } - return schemaFeatureFields; - } - - private void exportEncodedFeaturesStats(EarlybirdFieldConstant featuresField, - Set schemaFeatureFields, - PenguinVersion penguinVersion, - ThriftField thriftField) { - byte[] encodedFeaturesBytes = thriftField.getFieldData().getBytesValue(); - EarlybirdEncodedFeatures encodedTweetFeatures = EarlybirdEncodedFeaturesUtil.fromBytes( - schema.getSchemaSnapshot(), featuresField, encodedFeaturesBytes, 0); - for (EarlybirdFieldConstant field : schemaFeatureFields) { - updateCounterForFeatureField( - field.getFieldId(), encodedTweetFeatures.getFeatureValue(field), penguinVersion); - } - } - - private void updateCounterForFeatureField(int fieldId, int value, PenguinVersion penguinVersion) { - if (value != 0) { - SearchRateCounter counter = fieldCounters.get(penguinVersion, fieldId); - if (counter == null) { - counter = SearchRateCounter.export( - String.format(UNKNOWN_FIELD, statPrefix, penguinVersion, fieldId)); - fieldCounters.put(penguinVersion, fieldId, counter); - } - counter.increment(); - } - } - - public void updatePenguinVersions(List updatedPenguinVersions) { - penguinVersions = updatedPenguinVersions; - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/FilterEventsBySafetyTypeStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/FilterEventsBySafetyTypeStage.java deleted file mode 100644 index 2f8ba9928..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/FilterEventsBySafetyTypeStage.java +++ /dev/null @@ -1,279 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import java.util.Map; -import java.util.concurrent.ConcurrentHashMap; -import java.util.concurrent.TimeUnit; -import javax.annotation.Nonnull; - -import com.google.common.annotations.VisibleForTesting; - -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.apache.commons.pipeline.validation.ProducedTypes; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchDelayStats; -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; -import com.twitter.search.ingester.model.IngesterTweetEvent; -import com.twitter.search.ingester.pipeline.util.PipelineStageRuntimeException; -import com.twitter.tweetypie.thriftjava.Tweet; -import com.twitter.tweetypie.thriftjava.TweetCreateEvent; -import com.twitter.tweetypie.thriftjava.TweetEvent; -import com.twitter.tweetypie.thriftjava.TweetEventData; -import com.twitter.tweetypie.thriftjava.TweetEventFlags; - -/** - * Only lets through the create events that match the specified safety type. - * Also lets through all delete events. - */ -@ConsumedTypes(IngesterTweetEvent.class) -@ProducedTypes(IngesterTweetEvent.class) -public class FilterEventsBySafetyTypeStage extends TwitterBaseStage - { - private static final Logger LOG = LoggerFactory.getLogger(FilterEventsBySafetyTypeStage.class); - - private SearchCounter totalEventsCount; - private SearchCounter createEventsCount; - private SearchCounter createPublicEventsCount; - private SearchCounter createProtectedEventsCount; - private SearchCounter createRestrictedEventsCount; - private SearchCounter createInvalidSafetyTypeCount; - private SearchCounter deleteEventsCount; - private SearchCounter deletePublicEventsCount; - private SearchCounter deleteProtectedEventsCount; - private SearchCounter deleteRestrictedEventsCount; - private SearchCounter deleteInvalidSafetyTypeCount; - private SearchCounter otherEventsCount; - - private SearchDelayStats tweetCreateDelayStats; - - private long tweetCreateLatencyLogThresholdMillis = -1; - private SafetyType safetyType = null; - private Map> invalidSafetyTypeByEventTypeStatMap = - new ConcurrentHashMap<>(); - - public FilterEventsBySafetyTypeStage() { } - - public FilterEventsBySafetyTypeStage(String safetyType, long tweetCreateLatencyThresholdMillis) { - setSafetyType(safetyType); - this.tweetCreateLatencyLogThresholdMillis = tweetCreateLatencyThresholdMillis; - } - - /** - * To be called by XML config. Can be made private after we delete ACP code. - */ - public void setSafetyType(@Nonnull String safetyTypeString) { - this.safetyType = SafetyType.valueOf(safetyTypeString); - if (this.safetyType == SafetyType.INVALID) { - throw new UnsupportedOperationException( - "Can't create a stage that permits 'INVALID' safetytypes"); - } - } - - @Override - protected void initStats() { - super.initStats(); - innerSetupStats(); - } - - @Override - protected void innerSetupStats() { - totalEventsCount = SearchCounter.export(getStageNamePrefix() + "_total_events_count"); - createEventsCount = SearchCounter.export(getStageNamePrefix() + "_create_events_count"); - createPublicEventsCount = - SearchCounter.export(getStageNamePrefix() + "_create_public_events_count"); - createProtectedEventsCount = - SearchCounter.export(getStageNamePrefix() + "_create_protected_events_count"); - createRestrictedEventsCount = - SearchCounter.export(getStageNamePrefix() + "_create_restricted_events_count"); - createInvalidSafetyTypeCount = - SearchCounter.export(getStageNamePrefix() + "_create_missing_or_unknown_safetytype"); - deleteEventsCount = - SearchCounter.export(getStageNamePrefix() + "_delete_events_count"); - deletePublicEventsCount = - SearchCounter.export(getStageNamePrefix() + "_delete_public_events_count"); - deleteProtectedEventsCount = - SearchCounter.export(getStageNamePrefix() + "_delete_protected_events_count"); - deleteRestrictedEventsCount = - SearchCounter.export(getStageNamePrefix() + "_delete_restricted_events_count"); - deleteInvalidSafetyTypeCount = - SearchCounter.export(getStageNamePrefix() + "_delete_missing_or_unknown_safetytype"); - otherEventsCount = - SearchCounter.export(getStageNamePrefix() + "_other_events_count"); - - tweetCreateDelayStats = SearchDelayStats.export( - "create_histogram_" + getStageNamePrefix(), 90, - TimeUnit.SECONDS, TimeUnit.MILLISECONDS); - } - - @Override - public void innerProcess(Object obj) throws StageException { - if (obj instanceof IngesterTweetEvent) { - IngesterTweetEvent tweetEvent = (IngesterTweetEvent) obj; - if (tryToRecordCreateLatency(tweetEvent)) { - emitAndCount(tweetEvent); - } - } else { - throw new StageException(this, "Object is not a IngesterTweetEvent: " + obj); - } - } - - @Override - protected IngesterTweetEvent innerRunStageV2(IngesterTweetEvent tweetEvent) { - if (!tryToRecordCreateLatency(tweetEvent)) { - throw new PipelineStageRuntimeException("Event does not have to pass to the next stage."); - } - return tweetEvent; - } - - private boolean tryToRecordCreateLatency(IngesterTweetEvent tweetEvent) { - incrementCounters(tweetEvent); - boolean shouldEmit = shouldEmit(tweetEvent); - if (shouldEmit) { - if (isCreateEvent(tweetEvent.getData())) { - recordCreateLatency(tweetEvent.getData().getTweet_create_event()); - } - } - return shouldEmit; - } - - private void incrementCounters(@Nonnull TweetEvent tweetEvent) { - totalEventsCount.increment(); - SafetyType eventSafetyType = getEventSafetyType(tweetEvent); - - if (isCreateEvent(tweetEvent.getData())) { - createEventsCount.increment(); - switch (eventSafetyType) { - case PUBLIC: - createPublicEventsCount.increment(); - break; - case PROTECTED: - createProtectedEventsCount.increment(); - break; - case RESTRICTED: - createRestrictedEventsCount.increment(); - break; - default: - createInvalidSafetyTypeCount.increment(); - incrementInvalidSafetyTypeStatMap(tweetEvent, "create"); - } - } else if (isDeleteEvent(tweetEvent.getData())) { - deleteEventsCount.increment(); - switch (eventSafetyType) { - case PUBLIC: - deletePublicEventsCount.increment(); - break; - case PROTECTED: - deleteProtectedEventsCount.increment(); - break; - case RESTRICTED: - deleteRestrictedEventsCount.increment(); - break; - default: - deleteInvalidSafetyTypeCount.increment(); - incrementInvalidSafetyTypeStatMap(tweetEvent, "delete"); - } - } else { - otherEventsCount.increment(); - } - } - - private void incrementInvalidSafetyTypeStatMap(TweetEvent tweetEvent, String eventType) { - com.twitter.tweetypie.thriftjava.SafetyType thriftSafetyType = - tweetEvent.getFlags().getSafety_type(); - String safetyTypeString = - thriftSafetyType == null ? "null" : thriftSafetyType.toString().toLowerCase(); - invalidSafetyTypeByEventTypeStatMap.putIfAbsent(eventType, new ConcurrentHashMap<>()); - SearchCounter stat = invalidSafetyTypeByEventTypeStatMap.get(eventType).computeIfAbsent( - safetyTypeString, - safetyTypeStr -> SearchCounter.export( - getStageNamePrefix() - + String.format("_%s_missing_or_unknown_safetytype_%s", - eventType, safetyTypeStr))); - stat.increment(); - } - - @VisibleForTesting - boolean shouldEmit(@Nonnull TweetEvent tweetEvent) { - // Do not emit any undelete events. - if (isUndeleteEvent(tweetEvent.getData())) { - return false; - } - - SafetyType eventSafetyType = getEventSafetyType(tweetEvent); - // Custom logic for REALTIME_CG cluster - if (safetyType == SafetyType.PUBLIC_OR_PROTECTED) { - return eventSafetyType == SafetyType.PUBLIC || eventSafetyType == SafetyType.PROTECTED; - } else { - return eventSafetyType == safetyType; - } - } - - private SafetyType getEventSafetyType(@Nonnull TweetEvent tweetEvent) { - TweetEventFlags tweetEventFlags = tweetEvent.getFlags(); - return SafetyType.fromThriftSafetyType(tweetEventFlags.getSafety_type()); - } - - private boolean isCreateEvent(@Nonnull TweetEventData tweetEventData) { - return tweetEventData.isSet(TweetEventData._Fields.TWEET_CREATE_EVENT); - } - - private boolean isDeleteEvent(@Nonnull TweetEventData tweetEventData) { - return tweetEventData.isSet(TweetEventData._Fields.TWEET_DELETE_EVENT); - } - - private boolean isUndeleteEvent(@Nonnull TweetEventData tweetEventData) { - return tweetEventData.isSet(TweetEventData._Fields.TWEET_UNDELETE_EVENT); - } - - private void recordCreateLatency(TweetCreateEvent tweetCreateEvent) { - Tweet tweet = tweetCreateEvent.getTweet(); - if (tweet != null) { - long tweetCreateLatency = - clock.nowMillis() - SnowflakeIdParser.getTimestampFromTweetId(tweet.getId()); - tweetCreateDelayStats.recordLatency(tweetCreateLatency, TimeUnit.MILLISECONDS); - if (tweetCreateLatency < 0) { - LOG.warn("Received a tweet created in the future: {}", tweet); - } else if (tweetCreateLatencyLogThresholdMillis > 0 - && tweetCreateLatency > tweetCreateLatencyLogThresholdMillis) { - LOG.debug("Found late incoming tweet: {}. Create latency: {}ms. Tweet: {}", - tweet.getId(), tweetCreateLatency, tweet); - } - } - } - - public void setTweetCreateLatencyLogThresholdMillis(long tweetCreateLatencyLogThresholdMillis) { - LOG.info("Setting tweetCreateLatencyLogThresholdMillis to {}.", - tweetCreateLatencyLogThresholdMillis); - this.tweetCreateLatencyLogThresholdMillis = tweetCreateLatencyLogThresholdMillis; - } - - public enum SafetyType { - PUBLIC, - PROTECTED, - RESTRICTED, - PUBLIC_OR_PROTECTED, - INVALID; - - /** Converts a tweetypie SafetyType instance to an instance of this enum. */ - @Nonnull - public static SafetyType fromThriftSafetyType( - com.twitter.tweetypie.thriftjava.SafetyType safetyType) { - if (safetyType == null) { - return INVALID; - } - switch(safetyType) { - case PRIVATE: - return PROTECTED; - case PUBLIC: - return PUBLIC; - case RESTRICTED: - return RESTRICTED; - default: - return INVALID; - } - } - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/FilterRetweetsAndRepliesStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/FilterRetweetsAndRepliesStage.java deleted file mode 100644 index 7da7178ba..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/FilterRetweetsAndRepliesStage.java +++ /dev/null @@ -1,79 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.apache.commons.pipeline.validation.ProducedTypes; - -import com.twitter.search.common.decider.DeciderUtil; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.ingester.model.IngesterTwitterMessage; -import com.twitter.search.ingester.pipeline.util.PipelineStageRuntimeException; - -/** - * Filters out tweets that are not retweets or replies. - */ -@ConsumedTypes(IngesterTwitterMessage.class) -@ProducedTypes(IngesterTwitterMessage.class) -public class FilterRetweetsAndRepliesStage extends TwitterBaseStage - { - private static final String EMIT_RETWEET_AND_REPLY_ENGAGEMENTS_DECIDER_KEY = - "ingester_realtime_emit_retweet_and_reply_engagements"; - - private SearchRateCounter filteredRetweetsCount; - private SearchRateCounter filteredRepliesToTweetsCount; - private SearchRateCounter incomingRetweetsAndRepliesToTweetsCount; - - @Override - public void initStats() { - super.initStats(); - innerSetupStats(); - } - - @Override - protected void innerSetupStats() { - filteredRetweetsCount = - SearchRateCounter.export(getStageNamePrefix() + "_filtered_retweets_count"); - filteredRepliesToTweetsCount = - SearchRateCounter.export(getStageNamePrefix() + "_filtered_replies_to_tweets_count"); - incomingRetweetsAndRepliesToTweetsCount = - SearchRateCounter.export( - getStageNamePrefix() + "_incoming_retweets_and_replies_to_tweets_count"); - } - - @Override - public void innerProcess(Object obj) throws StageException { - if (!(obj instanceof IngesterTwitterMessage)) { - throw new StageException(this, "Object is not an IngesterTwitterMessage: " + obj); - } - - IngesterTwitterMessage status = (IngesterTwitterMessage) obj; - if (tryToFilter(status)) { - emitAndCount(status); - } - } - - @Override - public IngesterTwitterMessage runStageV2(IngesterTwitterMessage message) { - if (!tryToFilter(message)) { - throw new PipelineStageRuntimeException("Does not have to pass to the next stage."); - } - return message; - } - - private boolean tryToFilter(IngesterTwitterMessage status) { - boolean shouldEmit = false; - if (status.isRetweet() || status.isReplyToTweet()) { - incomingRetweetsAndRepliesToTweetsCount.increment(); - if (DeciderUtil.isAvailableForRandomRecipient( - decider, EMIT_RETWEET_AND_REPLY_ENGAGEMENTS_DECIDER_KEY)) { - if (status.isRetweet()) { - filteredRetweetsCount.increment(); - } else { - filteredRepliesToTweetsCount.increment(); - } - shouldEmit = true; - } - } - return shouldEmit; - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/FilterTwitterMessageStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/FilterTwitterMessageStage.java deleted file mode 100644 index 61a52c200..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/FilterTwitterMessageStage.java +++ /dev/null @@ -1,77 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.apache.commons.pipeline.validation.ProducesConsumed; - -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.relevance.entities.TwitterMessage; -import com.twitter.search.ingester.pipeline.twitter.filters.IngesterValidMessageFilter; -import com.twitter.search.ingester.pipeline.util.PipelineStageRuntimeException; - -/** - * Filter out Twitter messages meeting some filtering rule. - */ -@ConsumedTypes(TwitterMessage.class) -@ProducesConsumed -public class FilterTwitterMessageStage extends TwitterBaseStage - { - private IngesterValidMessageFilter filter = null; - private SearchRateCounter validMessages; - private SearchRateCounter invalidMessages; - - @Override - protected void initStats() { - super.initStats(); - innerSetupStats(); - } - - @Override - protected void innerSetupStats() { - validMessages = SearchRateCounter.export(getStageNamePrefix() + "_valid_messages"); - invalidMessages = SearchRateCounter.export(getStageNamePrefix() + "_filtered_messages"); - } - - @Override - protected void doInnerPreprocess() { - innerSetup(); - } - - @Override - protected void innerSetup() { - filter = new IngesterValidMessageFilter(decider); - } - - @Override - public void innerProcess(Object obj) throws StageException { - if (!(obj instanceof TwitterMessage)) { - throw new StageException(this, "Object is not a IngesterTwitterMessage: " - + obj); - } - - TwitterMessage message = (TwitterMessage) obj; - if (tryToFilter(message)) { - emitAndCount(message); - } - } - - @Override - protected TwitterMessage innerRunStageV2(TwitterMessage message) { - if (!tryToFilter(message)) { - throw new PipelineStageRuntimeException("Failed to filter, does not have to " - + "pass to the next stage"); - } - return message; - } - - private boolean tryToFilter(TwitterMessage message) { - boolean ableToFilter = false; - if (message != null && filter.accepts(message)) { - validMessages.increment(); - ableToFilter = true; - } else { - invalidMessages.increment(); - } - return ableToFilter; - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/LookupUserPropertiesBatchedStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/LookupUserPropertiesBatchedStage.java deleted file mode 100644 index 9e0184d6f..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/LookupUserPropertiesBatchedStage.java +++ /dev/null @@ -1,60 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import java.util.Collection; -import javax.naming.NamingException; - -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.apache.commons.pipeline.validation.ProducesConsumed; - -import com.twitter.search.ingester.model.IngesterTwitterMessage; -import com.twitter.search.ingester.pipeline.util.BatchedElement; -import com.twitter.search.ingester.pipeline.util.PipelineStageException; -import com.twitter.search.ingester.pipeline.util.UserPropertiesManager; -import com.twitter.util.Future; - -@ConsumedTypes(IngesterTwitterMessage.class) -@ProducesConsumed -public class LookupUserPropertiesBatchedStage extends TwitterBatchedBaseStage - { - - protected UserPropertiesManager userPropertiesManager; - - @Override - protected Class getQueueObjectType() { - return IngesterTwitterMessage.class; - } - - @Override - protected Future> innerProcessBatch(Collection> batch) { - Collection batchedElements = extractOnlyElementsFromBatch(batch); - return userPropertiesManager.populateUserProperties(batchedElements); - } - - @Override - protected boolean needsToBeBatched(IngesterTwitterMessage element) { - return true; - } - - @Override - protected IngesterTwitterMessage transform(IngesterTwitterMessage element) { - return element; - } - - @Override - public synchronized void doInnerPreprocess() throws StageException, NamingException { - super.doInnerPreprocess(); - commonInnerSetup(); - } - - @Override - protected void innerSetup() throws PipelineStageException, NamingException { - super.innerSetup(); - commonInnerSetup(); - } - - private void commonInnerSetup() throws NamingException { - userPropertiesManager = new UserPropertiesManager(wireModule.getMetastoreClient()); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/NamedEntityHandler.java b/src/java/com/twitter/search/ingester/pipeline/twitter/NamedEntityHandler.java deleted file mode 100644 index 617b8183c..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/NamedEntityHandler.java +++ /dev/null @@ -1,101 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import java.util.Set; - -import scala.Option; - -import com.google.common.collect.ImmutableSet; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.cuad.ner.plain.thriftjava.NamedEntities; -import com.twitter.cuad.ner.plain.thriftjava.NamedEntity; -import com.twitter.decider.Decider; -import com.twitter.search.common.decider.DeciderUtil; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.ingester.model.IngesterTwitterMessage; -import com.twitter.search.ingester.pipeline.strato_fetchers.NamedEntityFetcher; -import com.twitter.search.ingester.pipeline.util.IngesterStageTimer; -import com.twitter.strato.catalog.Fetch; -import com.twitter.util.Future; - -/** - * Handles the retrieval and population of named entities in TwitterMessages performed - * by ingesters. - */ -class NamedEntityHandler { - private static final Logger LOG = LoggerFactory.getLogger(NamedEntityHandler.class); - - private static final String RETRIEVE_NAMED_ENTITIES_DECIDER_KEY = - "ingester_all_retrieve_named_entities_%s"; - - // Named entities are only extracted in English, Spanish, and Japanese - private static final Set NAMED_ENTITY_LANGUAGES = ImmutableSet.of("en", "es", "ja"); - - private final NamedEntityFetcher namedEntityFetcher; - private final Decider decider; - private final String deciderKey; - - private SearchRateCounter lookupStat; - private SearchRateCounter successStat; - private SearchRateCounter namedEntityCountStat; - private SearchRateCounter errorStat; - private SearchRateCounter emptyResponseStat; - private SearchRateCounter deciderSkippedStat; - private IngesterStageTimer retrieveNamedEntitiesTimer; - - NamedEntityHandler( - NamedEntityFetcher namedEntityFetcher, Decider decider, String statsPrefix, - String deciderSuffix) { - this.namedEntityFetcher = namedEntityFetcher; - this.decider = decider; - this.deciderKey = String.format(RETRIEVE_NAMED_ENTITIES_DECIDER_KEY, deciderSuffix); - - lookupStat = SearchRateCounter.export(statsPrefix + "_lookups"); - successStat = SearchRateCounter.export(statsPrefix + "_success"); - namedEntityCountStat = SearchRateCounter.export(statsPrefix + "_named_entity_count"); - errorStat = SearchRateCounter.export(statsPrefix + "_error"); - emptyResponseStat = SearchRateCounter.export(statsPrefix + "_empty_response"); - deciderSkippedStat = SearchRateCounter.export(statsPrefix + "_decider_skipped"); - retrieveNamedEntitiesTimer = new IngesterStageTimer(statsPrefix + "_request_timer"); - } - - Future> retrieve(IngesterTwitterMessage message) { - lookupStat.increment(); - return namedEntityFetcher.fetch(message.getTweetId()); - } - - void addEntitiesToMessage(IngesterTwitterMessage message, Fetch.Result result) { - retrieveNamedEntitiesTimer.start(); - Option response = result.v(); - if (response.isDefined()) { - successStat.increment(); - for (NamedEntity namedEntity : response.get().getEntities()) { - namedEntityCountStat.increment(); - message.addNamedEntity(namedEntity); - } - } else { - emptyResponseStat.increment(); - LOG.debug("Empty NERResponse for named entity query on tweet {}", message.getId()); - } - retrieveNamedEntitiesTimer.stop(); - } - - void incrementErrorCount() { - errorStat.increment(); - } - - boolean shouldRetrieve(IngesterTwitterMessage message) { - // Use decider to control retrieval of named entities. This allows us to shut off retrieval - // if it causes problems. - if (!DeciderUtil.isAvailableForRandomRecipient(decider, deciderKey)) { - deciderSkippedStat.increment(); - return false; - } - - // Named entities are only extracted in certain languages, so we can skip tweets - // in other languages - return NAMED_ENTITY_LANGUAGES.contains(message.getLanguage()); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/PopulateCodedLocationsBatchedStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/PopulateCodedLocationsBatchedStage.java deleted file mode 100644 index 89bb803f5..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/PopulateCodedLocationsBatchedStage.java +++ /dev/null @@ -1,79 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import java.util.Collection; -import javax.naming.NamingException; - -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.apache.commons.pipeline.validation.ProducesConsumed; - -import com.twitter.search.ingester.model.IngesterTwitterMessage; -import com.twitter.search.ingester.pipeline.util.BatchedElement; -import com.twitter.search.ingester.pipeline.util.ManhattanCodedLocationProvider; -import com.twitter.search.ingester.pipeline.util.PipelineStageException; -import com.twitter.util.Future; - -/** - * Read-only stage for looking up location info and populating it onto messages. - */ -@ConsumedTypes(IngesterTwitterMessage.class) -@ProducesConsumed -public final class PopulateCodedLocationsBatchedStage - extends TwitterBatchedBaseStage { - private static final String GEOCODE_DATASET_NAME = "ingester_geocode_profile_location"; - - private ManhattanCodedLocationProvider manhattanCodedLocationProvider = null; - - /** - * Require lat/lon from TwitterMessage instead of lookup from coded_locations, - * do not batch sql, and simply emit messages passed in with regions populated on them - * rather than emitting to indexing queues. - */ - @Override - protected void doInnerPreprocess() throws StageException, NamingException { - super.doInnerPreprocess(); - commonInnerSetup(); - } - - @Override - protected void innerSetup() throws PipelineStageException, NamingException { - super.innerSetup(); - commonInnerSetup(); - } - - private void commonInnerSetup() throws NamingException { - this.manhattanCodedLocationProvider = ManhattanCodedLocationProvider.createWithEndpoint( - wireModule.getJavaManhattanKVEndpoint(), - getStageNamePrefix(), - GEOCODE_DATASET_NAME); - } - - @Override - public void initStats() { - super.initStats(); - } - - @Override - protected Class getQueueObjectType() { - return IngesterTwitterMessage.class; - } - - @Override - protected Future> innerProcessBatch(Collection> batch) { - - Collection batchedElements = extractOnlyElementsFromBatch(batch); - return manhattanCodedLocationProvider.populateCodedLatLon(batchedElements); - } - - @Override - protected boolean needsToBeBatched(IngesterTwitterMessage message) { - return !message.hasGeoLocation() && (message.getLocation() != null) - && !message.getLocation().isEmpty(); - } - - @Override - protected IngesterTwitterMessage transform(IngesterTwitterMessage element) { - return element; - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/ResolveCompressedUrlsBatchedStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/ResolveCompressedUrlsBatchedStage.java deleted file mode 100644 index 3bf2ebe7f..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/ResolveCompressedUrlsBatchedStage.java +++ /dev/null @@ -1,387 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import java.net.URI; -import java.net.URISyntaxException; -import java.util.Collection; -import java.util.Collections; -import java.util.HashSet; -import java.util.Locale; -import java.util.Map; -import java.util.Set; -import javax.naming.NamingException; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Maps; -import com.google.common.collect.Sets; - -import org.apache.commons.lang.StringUtils; -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.apache.commons.pipeline.validation.ProducesConsumed; - -import com.twitter.common.text.language.LocaleUtil; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.search.common.indexing.thriftjava.ThriftExpandedUrl; -import com.twitter.search.common.metrics.Percentile; -import com.twitter.search.common.metrics.PercentileUtil; -import com.twitter.search.common.metrics.RelevanceStats; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.ingester.model.IngesterTwitterMessage; -import com.twitter.search.ingester.pipeline.util.BatchedElement; -import com.twitter.search.ingester.pipeline.util.PipelineStageException; -import com.twitter.search.ingester.pipeline.wire.WireModule; -import com.twitter.service.spiderduck.gen.MediaTypes; -import com.twitter.util.Duration; -import com.twitter.util.Function; -import com.twitter.util.Future; - -@ConsumedTypes(IngesterTwitterMessage.class) -@ProducesConsumed -public class ResolveCompressedUrlsBatchedStage extends TwitterBatchedBaseStage - { - - private static final int PINK_REQUEST_TIMEOUT_MILLIS = 500; - private static final int PINK_REQUEST_RETRIES = 2; - private static final String PINK_REQUESTS_BATCH_SIZE_DECIDER_KEY = "pink_requests_batch_size"; - private AsyncPinkUrlsResolver urlResolver; - private int resolveUrlPercentage = 100; - private String pinkClientId; - private SearchDecider searchDecider; - - // The number of URLs that we attempted to resolve. - private SearchRateCounter linksAttempted; - // The number of URLs that were successfully resolved. - private SearchRateCounter linksSucceeded; - // The number of URLs ignored because they are too long. - private SearchRateCounter linksTooLong; - // The number of URLs truncated because they are too long. - private SearchRateCounter linksTruncated; - - // The number of resolved URLs without a media type. - private SearchRateCounter urlsWithoutMediaType; - // The number of resolved URLs with a specific media type. - private final Map urlsWithMediaTypeMap = - Maps.newEnumMap(MediaTypes.class); - - // The number of tweets for which all URLs were resolved. - private SearchRateCounter tweetsWithResolvedURLs; - // The number of tweets for which some URLs were not resolved. - private SearchRateCounter tweetsWithUnresolvedURLs; - - // How long it takes to fully resolve all URLs in a tweet. - private Percentile millisToResolveAllTweetURLs; - - // max age that a tweet can be before passed down the pipeline - private long tweetMaxAgeToResolve; - - // number of times an element is within quota. - private SearchRateCounter numberOfElementsWithinQuota; - - // number of times element is not within quota. If element not within quota, we dont batch. - private SearchRateCounter numberOfElementsNotWithinQuota; - - // number of times element has urls. - private SearchRateCounter numberOfElementsWithUrls; - - // number of times element does not have urls. If element does not have URL, we dont batch. - private SearchRateCounter numberOfElementsWithoutUrls; - - // number of calls to needsToBeBatched method. - private SearchRateCounter numberOfCallsToNeedsToBeBatched; - - - public void setTweetMaxAgeToResolve(long tweetMaxAgeToResolve) { - this.tweetMaxAgeToResolve = tweetMaxAgeToResolve; - } - - @Override - protected Class getQueueObjectType() { - return IngesterTwitterMessage.class; - } - - @Override - protected boolean needsToBeBatched(IngesterTwitterMessage element) { - numberOfCallsToNeedsToBeBatched.increment(); - boolean isWithinQuota = (element.getId() % 100) < resolveUrlPercentage; - - if (isWithinQuota) { - this.numberOfElementsWithinQuota.increment(); - } else { - this.numberOfElementsNotWithinQuota.increment(); - } - - boolean hasUrls = !element.getExpandedUrlMap().isEmpty(); - - if (hasUrls) { - this.numberOfElementsWithUrls.increment(); - } else { - this.numberOfElementsWithoutUrls.increment(); - } - - return hasUrls && isWithinQuota; - } - - // Identity transformation. T and U types are the same - @Override - protected IngesterTwitterMessage transform(IngesterTwitterMessage element) { - return element; - } - - @Override - public void initStats() { - super.initStats(); - commonInnerSetupStats(); - } - - @Override - protected void innerSetupStats() { - super.innerSetupStats(); - commonInnerSetupStats(); - } - - private void commonInnerSetupStats() { - linksAttempted = RelevanceStats.exportRate(getStageNamePrefix() + "_num_links_attempted"); - linksSucceeded = RelevanceStats.exportRate(getStageNamePrefix() + "_num_links_succeeded"); - linksTooLong = RelevanceStats.exportRate(getStageNamePrefix() + "_num_links_toolong"); - linksTruncated = RelevanceStats.exportRate(getStageNamePrefix() + "_num_links_truncated"); - - urlsWithoutMediaType = RelevanceStats.exportRate( - getStageNamePrefix() + "_urls_without_media_type"); - - for (MediaTypes mediaType : MediaTypes.values()) { - urlsWithMediaTypeMap.put( - mediaType, - RelevanceStats.exportRate( - getStageNamePrefix() + "_urls_with_media_type_" + mediaType.name().toLowerCase())); - } - - tweetsWithResolvedURLs = RelevanceStats.exportRate( - getStageNamePrefix() + "_num_tweets_with_resolved_urls"); - tweetsWithUnresolvedURLs = RelevanceStats.exportRate( - getStageNamePrefix() + "_num_tweets_with_unresolved_urls"); - - millisToResolveAllTweetURLs = PercentileUtil.createPercentile( - getStageNamePrefix() + "_millis_to_resolve_all_tweet_urls"); - - numberOfCallsToNeedsToBeBatched = SearchRateCounter.export(getStageNamePrefix() - + "_calls_to_needsToBeBatched"); - - numberOfElementsWithinQuota = SearchRateCounter.export(getStageNamePrefix() - + "_is_within_quota"); - - numberOfElementsNotWithinQuota = SearchRateCounter.export(getStageNamePrefix() - + "_is_not_within_quota"); - - numberOfElementsWithUrls = SearchRateCounter.export(getStageNamePrefix() - + "_has_urls"); - - numberOfElementsWithoutUrls = SearchRateCounter.export(getStageNamePrefix() - + "_does_not_have_urls"); - } - - @Override - protected void doInnerPreprocess() throws StageException, NamingException { - searchDecider = new SearchDecider(decider); - // We need to call this after assigning searchDecider because our updateBatchSize function - // depends on the searchDecider. - super.doInnerPreprocess(); - commonInnerSetup(); - } - - @Override - protected void innerSetup() throws PipelineStageException, NamingException { - searchDecider = new SearchDecider(decider); - // We need to call this after assigning searchDecider because our updateBatchSize function - // depends on the searchDecider. - super.innerSetup(); - commonInnerSetup(); - } - - private void commonInnerSetup() throws NamingException { - Preconditions.checkNotNull(pinkClientId); - urlResolver = new AsyncPinkUrlsResolver( - WireModule - .getWireModule() - .getStorer(Duration.fromMilliseconds(PINK_REQUEST_TIMEOUT_MILLIS), - PINK_REQUEST_RETRIES), - pinkClientId); - } - - @Override - protected Future> innerProcessBatch(Collection> batch) { - // Batch urls - Map> urlToTweetsMap = createUrlToTweetMap(batch); - - Set urlsToResolve = urlToTweetsMap.keySet(); - - updateBatchSize(); - - linksAttempted.increment(batch.size()); - // Do the lookup - return urlResolver.resolveUrls(urlsToResolve).map(processResolvedUrlsFunction(batch)); - } - - @Override - protected void updateBatchSize() { - // update batch based on decider - int decidedBatchSize = searchDecider.featureExists(PINK_REQUESTS_BATCH_SIZE_DECIDER_KEY) - ? searchDecider.getAvailability(PINK_REQUESTS_BATCH_SIZE_DECIDER_KEY) - : batchSize; - - setBatchedStageBatchSize(decidedBatchSize); - } - - //if not all urls for a message where resolved re-enqueue until maxAge is reached - private Function, - Collection> - processResolvedUrlsFunction(Collection> batch) { - return Function.func(resolvedUrls -> { - linksSucceeded.increment(resolvedUrls.size()); - - for (ResolveCompressedUrlsUtils.UrlInfo urlInfo : resolvedUrls.values()) { - if (urlInfo.mediaType != null) { - urlsWithMediaTypeMap.get(urlInfo.mediaType).increment(); - } else { - urlsWithoutMediaType.increment(); - } - } - - Set successfulTweets = Sets.newHashSet(); - - for (BatchedElement batchedElement : batch) { - IngesterTwitterMessage message = batchedElement.getItem(); - Set tweetUrls = message.getExpandedUrlMap().keySet(); - - int resolvedUrlCounter = 0; - - for (String url : tweetUrls) { - ResolveCompressedUrlsUtils.UrlInfo urlInfo = resolvedUrls.get(url); - - // if the url didn't resolve move on to the next one, this might trigger a re-enqueue - // if the tweet is still kind of new. But we want to process the rest for when that - // is not the case and we are going to end up passing it to the next stage - if (urlInfo == null) { - continue; - } - - String resolvedUrl = urlInfo.resolvedUrl; - Locale locale = urlInfo.language == null ? null - : LocaleUtil.getLocaleOf(urlInfo.language); - - if (StringUtils.isNotBlank(resolvedUrl)) { - ThriftExpandedUrl expandedUrl = message.getExpandedUrlMap().get(url); - resolvedUrlCounter += 1; - enrichTweetWithUrlInfo(message, expandedUrl, urlInfo, locale); - } - } - long tweetMessageAge = clock.nowMillis() - message.getDate().getTime(); - - if (resolvedUrlCounter == tweetUrls.size()) { - millisToResolveAllTweetURLs.record(tweetMessageAge); - tweetsWithResolvedURLs.increment(); - successfulTweets.add(message); - } else if (tweetMessageAge > tweetMaxAgeToResolve) { - tweetsWithUnresolvedURLs.increment(); - successfulTweets.add(message); - } else { - //re-enqueue if all urls weren't resolved and the tweet is younger than maxAge - reEnqueueAndRetry(batchedElement); - } - } - return successfulTweets; - }); - } - - private Map> createUrlToTweetMap( - Collection> batch) { - Map> urlToTweetsMap = Maps.newHashMap(); - for (BatchedElement batchedElement : batch) { - IngesterTwitterMessage message = batchedElement.getItem(); - for (String originalUrl : message.getExpandedUrlMap().keySet()) { - Set messages = urlToTweetsMap.get(originalUrl); - if (messages == null) { - messages = new HashSet<>(); - urlToTweetsMap.put(originalUrl, messages); - } - messages.add(message); - } - } - return Collections.unmodifiableMap(urlToTweetsMap); - } - - // enrich the twitterMessage with the resolvedCounter Urls. - private void enrichTweetWithUrlInfo(IngesterTwitterMessage message, - ThriftExpandedUrl expandedUrl, - ResolveCompressedUrlsUtils.UrlInfo urlInfo, - Locale locale) { - String truncatedUrl = maybeTruncate(urlInfo.resolvedUrl); - if (truncatedUrl == null) { - return; - } - - expandedUrl.setCanonicalLastHopUrl(truncatedUrl); - if (urlInfo.mediaType != null) { - // Overwrite url media type with media type from resolved url only if the media type from - // resolved url is not Unknown - if (!expandedUrl.isSetMediaType() || urlInfo.mediaType != MediaTypes.UNKNOWN) { - expandedUrl.setMediaType(urlInfo.mediaType); - } - } - if (urlInfo.linkCategory != null) { - expandedUrl.setLinkCategory(urlInfo.linkCategory); - } - // Note that if there are multiple links in one tweet message, the language of the - // link that got examined later in this for loop will overwrite the values that were - // written before. This is not an optimal design but considering most tweets have - // only one link, or same-language links, this shouldn't be a big issue. - if (locale != null) { - message.setLinkLocale(locale); - } - - if (urlInfo.description != null) { - expandedUrl.setDescription(urlInfo.description); - } - - if (urlInfo.title != null) { - expandedUrl.setTitle(urlInfo.title); - } - } - - // test methods - public void setResolveUrlPercentage(int percentage) { - this.resolveUrlPercentage = percentage; - } - - public void setPinkClientId(String pinkClientId) { - this.pinkClientId = pinkClientId; - } - - public static final int MAX_URL_LENGTH = 1000; - - private String maybeTruncate(String fullUrl) { - if (fullUrl.length() <= MAX_URL_LENGTH) { - return fullUrl; - } - - try { - URI parsed = new URI(fullUrl); - - // Create a URL with an empty query and fragment. - String simplified = new URI(parsed.getScheme(), - parsed.getAuthority(), - parsed.getPath(), - null, - null).toString(); - if (simplified.length() < MAX_URL_LENGTH) { - linksTruncated.increment(); - return simplified; - } - } catch (URISyntaxException e) { - } - - linksTooLong.increment(); - return null; - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/ResolveCompressedUrlsPink.java b/src/java/com/twitter/search/ingester/pipeline/twitter/ResolveCompressedUrlsPink.java deleted file mode 100644 index 4064b590e..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/ResolveCompressedUrlsPink.java +++ /dev/null @@ -1,113 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import java.util.List; -import java.util.Map; -import java.util.Set; - -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.decider.Decider; -import com.twitter.pink_floyd.thrift.ClientIdentifier; -import com.twitter.pink_floyd.thrift.Mask; -import com.twitter.pink_floyd.thrift.Storer; -import com.twitter.pink_floyd.thrift.UrlData; -import com.twitter.pink_floyd.thrift.UrlReadRequest; -import com.twitter.pink_floyd.thrift.UrlReadResponse; -import com.twitter.search.common.decider.SearchDecider; -import com.twitter.util.Await; -import com.twitter.util.Future; -import com.twitter.util.Throw; -import com.twitter.util.Throwables; -import com.twitter.util.Try; - -import static com.twitter.search.ingester.pipeline.twitter.ResolveCompressedUrlsUtils.getUrlInfo; - -/** - * Resolve compressed URL via Pink - */ -public class ResolveCompressedUrlsPink { - private static final Logger LOG = LoggerFactory.getLogger(ResolveCompressedUrlsPink.class); - private static final String PINK_REQUESTS_BATCH_SIZE_DECIDER_KEY = "pink_requests_batch_size"; - - private final Storer.ServiceIface storerClient; - private final ClientIdentifier pinkClientId; - private final Mask requestMask; - private final SearchDecider decider; - - // Use ServerSet to construct a metadata store client - public ResolveCompressedUrlsPink(Storer.ServiceIface storerClient, - String pinkClientId, - Decider decider) { - this.storerClient = storerClient; - this.pinkClientId = ClientIdentifier.valueOf(pinkClientId); - this.decider = new SearchDecider(Preconditions.checkNotNull(decider)); - - requestMask = new Mask(); - requestMask.setResolution(true); - requestMask.setHtmlBasics(true); - requestMask.setUrlDirectInfo(true); - } - - /** - * Resolve a set of URLs using PinkFloyd. - */ - public Map resolveUrls(Set urls) { - if (urls == null || urls.size() == 0) { - return null; - } - - List urlsList = ImmutableList.copyOf(urls); - int batchSize = decider.featureExists(PINK_REQUESTS_BATCH_SIZE_DECIDER_KEY) - ? decider.getAvailability(PINK_REQUESTS_BATCH_SIZE_DECIDER_KEY) - : 10000; - int numRequests = (int) Math.ceil(1.0 * urlsList.size() / batchSize); - - List> responseFutures = Lists.newArrayList(); - for (int i = 0; i < numRequests; ++i) { - UrlReadRequest request = new UrlReadRequest(); - request.setUrls( - urlsList.subList(i * batchSize, Math.min(urlsList.size(), (i + 1) * batchSize))); - request.setMask(requestMask); - request.setClientId(pinkClientId); - - // Send all requests in parallel. - responseFutures.add(storerClient.read(request)); - } - - Map resultMap = Maps.newHashMap(); - for (Future responseFuture : responseFutures) { - Try tryResponse = getResponseTry(responseFuture); - if (tryResponse.isThrow()) { - continue; - } - - UrlReadResponse response = tryResponse.get(); - for (UrlData urlData : response.getData()) { - if (ResolveCompressedUrlsUtils.isResolved(urlData)) { - resultMap.put(urlData.url, getUrlInfo(urlData)); - } - } - } - - return resultMap; - } - - private Try getResponseTry(Future responseFuture) { - try { - Try tryResponse = Await.result(responseFuture.liftToTry()); - if (tryResponse.isThrow()) { - Throwable throwable = ((Throw) tryResponse).e(); - LOG.warn("Failed to resolve URLs with Pink Storer.", throwable); - } - return tryResponse; - } catch (Exception e) { - return Throwables.unchecked(e); - } - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/ResolveCompressedUrlsUtils.java b/src/java/com/twitter/search/ingester/pipeline/twitter/ResolveCompressedUrlsUtils.java deleted file mode 100644 index baa4269cd..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/ResolveCompressedUrlsUtils.java +++ /dev/null @@ -1,116 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import javax.annotation.Nullable; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Iterables; - -import org.apache.commons.lang.StringUtils; - -import com.twitter.pink_floyd.thrift.FetchStatusCode; -import com.twitter.pink_floyd.thrift.HtmlBasics; -import com.twitter.pink_floyd.thrift.Resolution; -import com.twitter.pink_floyd.thrift.UrlData; -import com.twitter.service.spiderduck.gen.LinkCategory; -import com.twitter.service.spiderduck.gen.MediaTypes; -import com.twitter.spiderduck.common.URLUtils; - -// Helper class with UrlInfo helper functions -public final class ResolveCompressedUrlsUtils { - - private ResolveCompressedUrlsUtils() { } - static class UrlInfo { - public String originalUrl; - @Nullable public String resolvedUrl; - @Nullable public String language; - @Nullable public MediaTypes mediaType; - @Nullable public LinkCategory linkCategory; - @Nullable public String description; - @Nullable public String title; - } - - /** - * Determines if the given UrlData instance is fully resolved. - * - * Based on discussions with the URL services team, we decided that the most correct way to - * determine that a URL was fully resolved is to look at a few response fields: - * - urlDirectInfo: both the media type and link category must be set. - * - htmlBasics: Pink has successfully parsed the resolved link's metadata. - * - resolution: Pink was able to successfully get to the last hop in the redirect chain. - * This is especially important, because some sites have a robots.txt file, which - * prevents Pink from following the redirect chain once it gets to that site. - * In that case, we end up with a "last hop" URL, but the FetchStatusCode is not - * set to OK. We need to ignore these URLs because we don't know if they're really - * the last hop URLs. - * Also, Pink has some restrictions on the page size. For example, it does not - * parse text pages that are larger than 2MB. So if the redirect chain leads Pink - * to one of these pages, it will stop there. And again, we don't know if this is - * the last hop URL or not, so we have to ignore that URL. - * - * @param urlData The UrlData instance. - * @return true if the URL data is fully resolved; false otherwise. - */ - public static boolean isResolved(UrlData urlData) { - // Make sure the mediaType and linkCategory fields are set. - boolean isInfoReady = urlData.isSetUrlDirectInfo() - && urlData.getUrlDirectInfo().isSetMediaType() - && urlData.getUrlDirectInfo().isSetLinkCategory(); - - // The individual HtmlBasics fields might or might not be set, depending on each website. - // However, all fields should be set at the same time, if they are present. Consider the - // resolution complete if at least one of the title, description or language fields is set. - boolean isHtmlReady = urlData.isSetHtmlBasics() - && (StringUtils.isNotEmpty(urlData.getHtmlBasics().getTitle()) - || StringUtils.isNotEmpty(urlData.getHtmlBasics().getDescription()) - || StringUtils.isNotEmpty(urlData.getHtmlBasics().getLang())); - - Resolution resolution = urlData.getResolution(); - boolean isResolutionReady = urlData.isSetResolution() - && StringUtils.isNotEmpty(resolution.getLastHopCanonicalUrl()) - && resolution.getStatus() == FetchStatusCode.OK - && resolution.getLastHopHttpResponseStatusCode() == 200; - - return isHtmlReady && isInfoReady && isResolutionReady; - } - - /** - * Creates a UrlInfo instance from the given URL data. - * - * @param urlData urlData from a resolver response. - * @return the UrlInfo instance. - */ - public static UrlInfo getUrlInfo(UrlData urlData) { - Preconditions.checkArgument(urlData.isSetResolution()); - - UrlInfo urlInfo = new UrlInfo(); - urlInfo.originalUrl = urlData.url; - Resolution resolution = urlData.getResolution(); - if (resolution.isSetLastHopCanonicalUrl()) { - urlInfo.resolvedUrl = resolution.lastHopCanonicalUrl; - } else { - // Just in case lastHopCanonicalUrl is not available (which shouldn't happen) - if (resolution.isSetRedirectionChain()) { - urlInfo.resolvedUrl = Iterables.getLast(resolution.redirectionChain); - } else { - urlInfo.resolvedUrl = urlData.url; - } - urlInfo.resolvedUrl = URLUtils.canonicalizeUrl(urlInfo.resolvedUrl); - } - if (urlData.isSetUrlDirectInfo()) { - urlInfo.mediaType = urlData.urlDirectInfo.mediaType; - urlInfo.linkCategory = urlData.urlDirectInfo.linkCategory; - } - if (urlData.isSetHtmlBasics()) { - HtmlBasics htmlBasics = urlData.getHtmlBasics(); - urlInfo.language = htmlBasics.getLang(); - if (htmlBasics.isSetDescription()) { - urlInfo.description = htmlBasics.getDescription(); - } - if (htmlBasics.isSetTitle()) { - urlInfo.title = htmlBasics.getTitle(); - } - } - return urlInfo; - } -} - diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/RetrieveCardBatchedStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/RetrieveCardBatchedStage.java deleted file mode 100644 index 705c211c5..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/RetrieveCardBatchedStage.java +++ /dev/null @@ -1,288 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import java.net.MalformedURLException; -import java.util.ArrayList; -import java.util.Collections; -import java.util.HashMap; -import java.util.List; -import java.util.Map; -import java.util.Set; -import javax.naming.NamingException; - -import com.google.common.collect.Maps; - -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.stage.StageTimer; -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.apache.commons.pipeline.validation.ProducesConsumed; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.text.language.LocaleUtil; -import com.twitter.expandodo.thriftjava.Card2; -import com.twitter.mediaservices.commons.tweetmedia.thrift_java.MediaInfo; -import com.twitter.search.common.indexing.thriftjava.ThriftExpandedUrl; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.ingester.model.IngesterTwitterMessage; -import com.twitter.search.ingester.pipeline.util.BatchingClient; -import com.twitter.search.ingester.pipeline.util.CardFieldUtil; -import com.twitter.search.ingester.pipeline.util.IngesterStageTimer; -import com.twitter.search.ingester.pipeline.util.ResponseNotReturnedException; -import com.twitter.spiderduck.common.URLUtils; -import com.twitter.tweetypie.thriftjava.GetTweetOptions; -import com.twitter.tweetypie.thriftjava.GetTweetResult; -import com.twitter.tweetypie.thriftjava.GetTweetsRequest; -import com.twitter.tweetypie.thriftjava.MediaEntity; -import com.twitter.tweetypie.thriftjava.StatusState; -import com.twitter.tweetypie.thriftjava.Tweet; -import com.twitter.tweetypie.thriftjava.TweetService; -import com.twitter.util.Function; -import com.twitter.util.Future; - -@ConsumedTypes(IngesterTwitterMessage.class) -@ProducesConsumed -public class RetrieveCardBatchedStage extends TwitterBaseStage - { - private static final Logger LOG = LoggerFactory.getLogger(RetrieveCardBatchedStage.class); - - private static final String CARDS_PLATFORM_KEY = "iPhone-13"; - private int batchSize = 10; - - private SearchRateCounter totalTweets; - private SearchRateCounter tweetsWithCards; - private SearchRateCounter tweetsWithoutCards; - private SearchRateCounter tweetsWithAnimatedGifMediaInfo; - private SearchRateCounter cardsWithName; - private SearchRateCounter cardsWithDomain; - private SearchRateCounter cardsWithTitles; - private SearchRateCounter cardsWithDescriptions; - private SearchRateCounter cardsWithUnknownLanguage; - private SearchRateCounter tweetsNotFound; - private SearchRateCounter malformedUrls; - private SearchRateCounter urlMismatches; - private SearchRateCounter cardExceptions; - private SearchRateCounter cardExceptionTweets; - private StageTimer retrieveCardsTimer; - - private String cardNamePrefix; - // Since there is only one thread executing this stage (although that could potentially be - // changed in the pipeline config), no need to be thread safe. - private static final Map CARD_NAME_STATS = new HashMap<>(); - - private static TweetService.ServiceToClient tweetyPieService; - private BatchingClient cardsClient; - - private String tweetypieClientId = null; - - // Can be overridden in the corresponding pipeline-ingester.*.xml config. - // By default protected tweets are filtered out. - // Only in the protected ingester pipeline is this set to false. - private boolean filterProtected = true; - - @Override - public void initStats() { - super.initStats(); - cardNamePrefix = getStageNamePrefix() + "_card_name_"; - totalTweets = SearchRateCounter.export(getStageNamePrefix() + "_total_tweets"); - tweetsWithCards = SearchRateCounter.export(getStageNamePrefix() + "_tweets_with_cards"); - tweetsWithoutCards = SearchRateCounter.export(getStageNamePrefix() + "_tweets_without_cards"); - tweetsWithAnimatedGifMediaInfo = - SearchRateCounter.export(getStageNamePrefix() + "_tweets_with_animated_gif_media_info"); - cardsWithName = SearchRateCounter.export(getStageNamePrefix() + "_tweets_with_card_name"); - cardsWithDomain = SearchRateCounter.export(getStageNamePrefix() + "_tweets_with_card_domain"); - cardsWithTitles = SearchRateCounter.export(getStageNamePrefix() + "_tweets_with_card_titles"); - cardsWithDescriptions = - SearchRateCounter.export(getStageNamePrefix() + "_tweets_with_card_descriptions"); - cardsWithUnknownLanguage = - SearchRateCounter.export(getStageNamePrefix() + "_tweets_with_unknown_card_lanuage"); - tweetsNotFound = SearchRateCounter.export(getStageNamePrefix() + "_tweets_not_found"); - malformedUrls = SearchRateCounter.export(getStageNamePrefix() + "_malformed_urls"); - urlMismatches = SearchRateCounter.export(getStageNamePrefix() + "_url_mismatches"); - cardExceptions = SearchRateCounter.export(getStageNamePrefix() + "_card_exceptions"); - cardExceptionTweets = - SearchRateCounter.export(getStageNamePrefix() + "_card_exception_tweets"); - retrieveCardsTimer = new IngesterStageTimer(getStageNamePrefix() + "_request_timer"); - } - - @Override - protected void doInnerPreprocess() throws StageException, NamingException { - super.doInnerPreprocess(); - tweetyPieService = wireModule.getTweetyPieClient(tweetypieClientId); - cardsClient = new BatchingClient<>(this::batchRetrieveURLs, batchSize); - } - - @Override - public void innerProcess(Object obj) throws StageException { - if (!(obj instanceof IngesterTwitterMessage)) { - throw new StageException(this, - "Received object of incorrect type: " + obj.getClass().getName()); - } - - IngesterTwitterMessage message = (IngesterTwitterMessage) obj; - - cardsClient.call(message.getTweetId()) - .onSuccess(Function.cons(card -> { - updateMessage(message, card); - emitAndCount(message); - })) - .onFailure(Function.cons(exception -> { - if (!(exception instanceof ResponseNotReturnedException)) { - cardExceptionTweets.increment(); - } - - emitAndCount(message); - })); - } - - private Future> batchRetrieveURLs(Set keys) { - retrieveCardsTimer.start(); - totalTweets.increment(keys.size()); - - GetTweetOptions options = new GetTweetOptions() - .setInclude_cards(true) - .setCards_platform_key(CARDS_PLATFORM_KEY) - .setBypass_visibility_filtering(!filterProtected); - - GetTweetsRequest request = new GetTweetsRequest() - .setOptions(options) - .setTweet_ids(new ArrayList<>(keys)); - - return tweetyPieService.get_tweets(request) - .onFailure(throwable -> { - cardExceptions.increment(); - LOG.error("TweetyPie server threw an exception while requesting tweetIds: " - + request.getTweet_ids(), throwable); - return null; - }) - .map(this::createIdToCardMap); - } - - private void updateMessage(IngesterTwitterMessage message, Card2 card) { - tweetsWithCards.increment(); - - String cardName = card.getName().toLowerCase(); - addCardNameToStats(cardName); - message.setCardName(cardName); - cardsWithName.increment(); - message.setCardUrl(card.getUrl()); - - String url = getLastHop(message, card.getUrl()); - if (url != null) { - try { - String domain = URLUtils.getDomainFromURL(url); - message.setCardDomain(domain.toLowerCase()); - cardsWithDomain.increment(); - } catch (MalformedURLException e) { - malformedUrls.increment(); - if (LOG.isDebugEnabled()) { - LOG.debug("Tweet ID {} has a malformed card last hop URL: {}", message.getId(), url); - } - } - } else { - // This happens with retweet. Basically when retrieve card for a retweet, we - // get a card associated with the original tweet, so the tco won't match. - // As of Sep 2014, this seems to be the intended behavior and has been running - // like this for over a year. - urlMismatches.increment(); - } - - message.setCardTitle( - CardFieldUtil.extractBindingValue(CardFieldUtil.TITLE_BINDING_KEY, card)); - if (message.getCardTitle() != null) { - cardsWithTitles.increment(); - } - message.setCardDescription( - CardFieldUtil.extractBindingValue(CardFieldUtil.DESCRIPTION_BINDING_KEY, card)); - if (message.getCardDescription() != null) { - cardsWithDescriptions.increment(); - } - CardFieldUtil.deriveCardLang(message); - if (LocaleUtil.UNKNOWN.getLanguage().equals(message.getCardLang())) { - cardsWithUnknownLanguage.increment(); - } - } - - private Map createIdToCardMap(List listResult) { - Map responseMap = Maps.newHashMap(); - for (GetTweetResult entry : listResult) { - if (entry.isSetTweet() - && entry.isSetTweet_state() - && (entry.getTweet_state() == StatusState.FOUND)) { - long id = entry.getTweet_id(); - if (entry.getTweet().isSetCard2()) { - responseMap.put(id, entry.getTweet().getCard2()); - } else { - // Short-term fix for removal of animated GIF cards -- - // if the tweet contains an animated GIF, create a card based on media entity data - Card2 card = createCardForAnimatedGif(entry.getTweet()); - if (card != null) { - responseMap.put(id, card); - tweetsWithAnimatedGifMediaInfo.increment(); - } else { - tweetsWithoutCards.increment(); - } - } - } else { - tweetsNotFound.increment(); - } - } - return responseMap; - } - - private Card2 createCardForAnimatedGif(Tweet tweet) { - if (tweet.getMediaSize() > 0) { - for (MediaEntity mediaEntity : tweet.getMedia()) { - MediaInfo mediaInfo = mediaEntity.getMedia_info(); - if (mediaInfo != null && mediaInfo.getSetField() == MediaInfo._Fields.ANIMATED_GIF_INFO) { - Card2 card = new Card2(); - card.setName("animated_gif"); - // Use the original compressed URL for the media entity to match existing card URLs - card.setUrl(mediaEntity.getUrl()); - card.setBinding_values(Collections.emptyList()); - - return card; - } - } - } - return null; - } - - // Unfortunately the url returned in the card data is not the last hop - private String getLastHop(IngesterTwitterMessage message, String url) { - if (message.getExpandedUrlMap() != null) { - ThriftExpandedUrl expanded = message.getExpandedUrlMap().get(url); - if ((expanded != null) && expanded.isSetCanonicalLastHopUrl()) { - return expanded.getCanonicalLastHopUrl(); - } - } - return null; - } - - // Used by commons-pipeline and set via the xml config - public void setFilterProtected(boolean filterProtected) { - LOG.info("Filtering protected tweets: {}", filterProtected); - this.filterProtected = filterProtected; - } - - public void setTweetypieClientId(String tweetypieClientId) { - LOG.info("Using tweetypieClientId: {}", tweetypieClientId); - this.tweetypieClientId = tweetypieClientId; - } - - public void setInternalBatchSize(int internalBatchSize) { - this.batchSize = internalBatchSize; - } - - /** - * For each card name, we add a rate counter to observe what kinds of card we're actually - * indexing, and with what rate. - */ - private void addCardNameToStats(String cardName) { - SearchRateCounter cardNameCounter = CARD_NAME_STATS.get(cardName); - if (cardNameCounter == null) { - cardNameCounter = SearchRateCounter.export(cardNamePrefix + cardName); - CARD_NAME_STATS.put(cardName, cardNameCounter); - } - cardNameCounter.increment(); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/RetrieveNamedEntitiesSingleTweetStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/RetrieveNamedEntitiesSingleTweetStage.java deleted file mode 100644 index 762abefc2..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/RetrieveNamedEntitiesSingleTweetStage.java +++ /dev/null @@ -1,75 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import java.util.concurrent.CompletableFuture; -import javax.naming.NamingException; - -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.apache.commons.pipeline.validation.ProducesConsumed; - -import com.twitter.search.ingester.model.IngesterTwitterMessage; -import com.twitter.util.Function; - -@ConsumedTypes(IngesterTwitterMessage.class) -@ProducesConsumed -public class RetrieveNamedEntitiesSingleTweetStage extends TwitterBaseStage - > { - - private NamedEntityHandler namedEntityHandler; - - @Override - protected void doInnerPreprocess() throws StageException, NamingException { - innerSetup(); - } - - @Override - protected void innerSetup() { - namedEntityHandler = new NamedEntityHandler( - wireModule.getNamedEntityFetcher(), decider, getStageNamePrefix(), - "single_tweet"); - } - - @Override - public void innerProcess(Object obj) throws StageException { - if (!(obj instanceof IngesterTwitterMessage)) { - throw new StageException(this, "Object is not a IngesterTwitterMessage object: " + obj); - } - IngesterTwitterMessage twitterMessage = (IngesterTwitterMessage) obj; - - if (namedEntityHandler.shouldRetrieve(twitterMessage)) { - namedEntityHandler.retrieve(twitterMessage) - .onSuccess(Function.cons(result -> { - namedEntityHandler.addEntitiesToMessage(twitterMessage, result); - emitAndCount(twitterMessage); - })) - .onFailure(Function.cons(throwable -> { - namedEntityHandler.incrementErrorCount(); - emitAndCount(twitterMessage); - })); - } else { - emitAndCount(twitterMessage); - } - } - - @Override - protected CompletableFuture innerRunStageV2(IngesterTwitterMessage - message) { - CompletableFuture cf = new CompletableFuture<>(); - - if (namedEntityHandler.shouldRetrieve(message)) { - namedEntityHandler.retrieve(message) - .onSuccess(Function.cons(result -> { - namedEntityHandler.addEntitiesToMessage(message, result); - cf.complete(message); - })) - .onFailure(Function.cons(throwable -> { - namedEntityHandler.incrementErrorCount(); - cf.complete(message); - })); - } else { - cf.complete(message); - } - - return cf; - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/RetrieveSpaceAdminsAndTitleStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/RetrieveSpaceAdminsAndTitleStage.java deleted file mode 100644 index 66918274c..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/RetrieveSpaceAdminsAndTitleStage.java +++ /dev/null @@ -1,246 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import java.util.List; -import java.util.Optional; -import java.util.Set; -import java.util.concurrent.CompletableFuture; - -import scala.Option; -import scala.Tuple2; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.collect.Lists; - -import org.apache.commons.lang.StringUtils; -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.apache.commons.pipeline.validation.ProducesConsumed; - -import com.twitter.search.common.decider.DeciderUtil; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.relevance.entities.TwitterMessageUser; -import com.twitter.search.ingester.model.IngesterTwitterMessage; -import com.twitter.search.ingester.pipeline.strato_fetchers.AudioSpaceCoreFetcher; -import com.twitter.search.ingester.pipeline.strato_fetchers.AudioSpaceParticipantsFetcher; -import com.twitter.strato.catalog.Fetch; -import com.twitter.ubs.thriftjava.AudioSpace; -import com.twitter.ubs.thriftjava.ParticipantUser; -import com.twitter.ubs.thriftjava.Participants; -import com.twitter.util.Function; -import com.twitter.util.Future; -import com.twitter.util.Futures; -import com.twitter.util.Try; - -@ConsumedTypes(IngesterTwitterMessage.class) -@ProducesConsumed -public class RetrieveSpaceAdminsAndTitleStage extends TwitterBaseStage - > { - - @VisibleForTesting - protected static final String RETRIEVE_SPACE_ADMINS_AND_TITLE_DECIDER_KEY = - "ingester_all_retrieve_space_admins_and_title"; - - private AudioSpaceCoreFetcher coreFetcher; - private AudioSpaceParticipantsFetcher participantsFetcher; - - private SearchRateCounter tweetsWithSpaceAdmins; - private SearchRateCounter tweetsWithSpaceTitle; - private SearchRateCounter coreFetchSuccess; - private SearchRateCounter coreFetchFailure; - private SearchRateCounter participantsFetchSuccess; - private SearchRateCounter participantsFetchFailure; - private SearchRateCounter emptyCore; - private SearchRateCounter emptyParticipants; - private SearchRateCounter emptySpaceTitle; - private SearchRateCounter emptySpaceAdmins; - private SearchRateCounter parallelFetchAttempts; - private SearchRateCounter parallelFetchFailure; - - - @Override - protected void doInnerPreprocess() { - innerSetup(); - } - - @Override - protected void innerSetup() { - coreFetcher = wireModule.getAudioSpaceCoreFetcher(); - participantsFetcher = wireModule.getAudioSpaceParticipantsFetcher(); - - tweetsWithSpaceAdmins = getStageStat("tweets_with_audio_space_admins"); - tweetsWithSpaceTitle = getStageStat("tweets_with_audio_space_title"); - coreFetchSuccess = getStageStat("core_fetch_success"); - coreFetchFailure = getStageStat("core_fetch_failure"); - participantsFetchSuccess = getStageStat("participants_fetch_success"); - participantsFetchFailure = getStageStat("participants_fetch_failure"); - emptyCore = getStageStat("empty_core"); - emptyParticipants = getStageStat("empty_participants"); - emptySpaceTitle = getStageStat("empty_space_title"); - emptySpaceAdmins = getStageStat("empty_space_admins"); - parallelFetchAttempts = getStageStat("parallel_fetch_attempts"); - parallelFetchFailure = getStageStat("parallel_fetch_failure"); - } - - private SearchRateCounter getStageStat(String statSuffix) { - return SearchRateCounter.export(getStageNamePrefix() + "_" + statSuffix); - } - - private Future>, Try>>> - tryRetrieveSpaceAdminAndTitle(IngesterTwitterMessage twitterMessage) { - Set spaceIds = twitterMessage.getSpaceIds(); - - if (spaceIds.isEmpty()) { - return null; - } - - if (!(DeciderUtil.isAvailableForRandomRecipient(decider, - RETRIEVE_SPACE_ADMINS_AND_TITLE_DECIDER_KEY))) { - return null; - } - - String spaceId = spaceIds.iterator().next(); - - // Query both columns in parallel. - parallelFetchAttempts.increment(); - Future> core = coreFetcher.fetch(spaceId); - Future> participants = participantsFetcher.fetch(spaceId); - - return Futures.join(core.liftToTry(), participants.liftToTry()); - } - - @Override - protected CompletableFuture innerRunStageV2(IngesterTwitterMessage - twitterMessage) { - Future>, Try>>> - tryRetrieveSpaceAdminAndTitle = tryRetrieveSpaceAdminAndTitle(twitterMessage); - - CompletableFuture cf = new CompletableFuture<>(); - - if (tryRetrieveSpaceAdminAndTitle == null) { - cf.complete(twitterMessage); - } else { - tryRetrieveSpaceAdminAndTitle.onSuccess(Function.cons(tries -> { - handleFutureOnSuccess(tries, twitterMessage); - cf.complete(twitterMessage); - })).onFailure(Function.cons(throwable -> { - handleFutureOnFailure(); - cf.complete(twitterMessage); - })); - } - - return cf; - } - - @Override - public void innerProcess(Object obj) throws StageException { - if (!(obj instanceof IngesterTwitterMessage)) { - throw new StageException(this, "Object is not a IngesterTwitterMessage object: " + obj); - } - IngesterTwitterMessage twitterMessage = (IngesterTwitterMessage) obj; - Future>, Try>>> - tryRetrieveSpaceAdminAndTitle = tryRetrieveSpaceAdminAndTitle(twitterMessage); - - if (tryRetrieveSpaceAdminAndTitle == null) { - emitAndCount(twitterMessage); - return; - } - - tryRetrieveSpaceAdminAndTitle.onSuccess(Function.cons(tries -> { - handleFutureOnSuccess(tries, twitterMessage); - emitAndCount(twitterMessage); - })).onFailure(Function.cons(throwable -> { - handleFutureOnFailure(); - emitAndCount(twitterMessage); - })); - } - - private void handleFutureOnSuccess(Tuple2>, - Try>> tries, IngesterTwitterMessage twitterMessage) { - handleCoreFetchTry(tries._1(), twitterMessage); - handleParticipantsFetchTry(tries._2(), twitterMessage); - } - - private void handleFutureOnFailure() { - parallelFetchFailure.increment(); - } - - private void handleCoreFetchTry( - Try> fetchTry, - IngesterTwitterMessage twitterMessage) { - - if (fetchTry.isReturn()) { - coreFetchSuccess.increment(); - addSpaceTitleToMessage(twitterMessage, fetchTry.get().v()); - } else { - coreFetchFailure.increment(); - } - } - - private void handleParticipantsFetchTry( - Try> fetchTry, - IngesterTwitterMessage twitterMessage) { - - if (fetchTry.isReturn()) { - participantsFetchSuccess.increment(); - addSpaceAdminsToMessage(twitterMessage, fetchTry.get().v()); - } else { - participantsFetchFailure.increment(); - } - } - - private void addSpaceTitleToMessage( - IngesterTwitterMessage twitterMessage, - Option audioSpace) { - - if (audioSpace.isDefined()) { - String audioSpaceTitle = audioSpace.get().getTitle(); - if (StringUtils.isNotEmpty(audioSpaceTitle)) { - twitterMessage.setSpaceTitle(audioSpaceTitle); - tweetsWithSpaceTitle.increment(); - } else { - emptySpaceTitle.increment(); - } - } else { - emptyCore.increment(); - } - } - - private void addSpaceAdminsToMessage( - IngesterTwitterMessage twitterMessage, - Option participants) { - - if (participants.isDefined()) { - List admins = getAdminsFromParticipants(participants.get()); - if (!admins.isEmpty()) { - for (ParticipantUser admin : admins) { - addSpaceAdminToMessage(twitterMessage, admin); - } - tweetsWithSpaceAdmins.increment(); - } else { - emptySpaceAdmins.increment(); - } - } else { - emptyParticipants.increment(); - } - } - - private List getAdminsFromParticipants(Participants participants) { - if (!participants.isSetAdmins()) { - return Lists.newArrayList(); - } - return participants.getAdmins(); - } - - private void addSpaceAdminToMessage(IngesterTwitterMessage twitterMessage, - ParticipantUser admin) { - TwitterMessageUser.Builder userBuilder = new TwitterMessageUser.Builder(); - if (admin.isSetTwitter_screen_name() - && StringUtils.isNotEmpty(admin.getTwitter_screen_name())) { - userBuilder.withScreenName(Optional.of(admin.getTwitter_screen_name())); - } - if (admin.isSetDisplay_name() && StringUtils.isNotEmpty(admin.getDisplay_name())) { - userBuilder.withDisplayName(Optional.of(admin.getDisplay_name())); - } - twitterMessage.addSpaceAdmin(userBuilder.build()); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/RetrieveSpaceIdsStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/RetrieveSpaceIdsStage.java deleted file mode 100644 index 112e6b875..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/RetrieveSpaceIdsStage.java +++ /dev/null @@ -1,99 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import java.util.Set; -import java.util.regex.Matcher; -import java.util.regex.Pattern; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.collect.Sets; - -import org.apache.commons.lang.StringUtils; -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.apache.commons.pipeline.validation.ProducesConsumed; - -import com.twitter.search.common.decider.DeciderUtil; -import com.twitter.search.common.indexing.thriftjava.ThriftExpandedUrl; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.relevance.entities.TwitterMessage; - -@ConsumedTypes(TwitterMessage.class) -@ProducesConsumed -public class RetrieveSpaceIdsStage extends TwitterBaseStage - { - - @VisibleForTesting - protected static final Pattern SPACES_URL_REGEX = - Pattern.compile("^https://twitter\\.com/i/spaces/([a-zA-Z0-9]+)\\S*$"); - - @VisibleForTesting - protected static final String PARSE_SPACE_ID_DECIDER_KEY = "ingester_all_parse_space_id_from_url"; - - private static SearchRateCounter numTweetsWithSpaceIds; - private static SearchRateCounter numTweetsWithMultipleSpaceIds; - - @Override - protected void initStats() { - super.initStats(); - innerSetupStats(); - } - - @Override - protected void innerSetupStats() { - numTweetsWithSpaceIds = SearchRateCounter.export( - getStageNamePrefix() + "_tweets_with_space_ids"); - numTweetsWithMultipleSpaceIds = SearchRateCounter.export( - getStageNamePrefix() + "_tweets_with_multiple_space_ids"); - } - - @Override - public void innerProcess(Object obj) throws StageException { - TwitterMessage message = (TwitterMessage) obj; - tryToRetrieveSpaceId(message); - emitAndCount(message); - } - - private void tryToRetrieveSpaceId(TwitterMessage message) { - if (DeciderUtil.isAvailableForRandomRecipient(decider, PARSE_SPACE_ID_DECIDER_KEY)) { - Set spaceIds = parseSpaceIdsFromMessage(message); - int spaceIdCount = spaceIds.size(); - if (spaceIdCount > 0) { - numTweetsWithSpaceIds.increment(); - if (spaceIdCount > 1) { - numTweetsWithMultipleSpaceIds.increment(); - } - message.setSpaceIds(spaceIds); - } - } - } - - @Override - protected TwitterMessage innerRunStageV2(TwitterMessage message) { - tryToRetrieveSpaceId(message); - return message; - } - - private String parseSpaceIdsFromUrl(String url) { - String spaceId = null; - - if (StringUtils.isNotEmpty(url)) { - Matcher matcher = SPACES_URL_REGEX.matcher(url); - if (matcher.matches()) { - spaceId = matcher.group(1); - } - } - return spaceId; - } - - private Set parseSpaceIdsFromMessage(TwitterMessage message) { - Set spaceIds = Sets.newHashSet(); - - for (ThriftExpandedUrl expandedUrl : message.getExpandedUrls()) { - String spaceId = parseSpaceIdsFromUrl(expandedUrl.getExpandedUrl()); - if (StringUtils.isNotEmpty(spaceId)) { - spaceIds.add(spaceId); - } - } - return spaceIds; - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/SingleTweetExtractAndGeocodeLatLonStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/SingleTweetExtractAndGeocodeLatLonStage.java deleted file mode 100644 index 5a40020bb..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/SingleTweetExtractAndGeocodeLatLonStage.java +++ /dev/null @@ -1,99 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.apache.commons.pipeline.validation.ProducesConsumed; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.indexing.thriftjava.ThriftGeoLocationSource; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.relevance.entities.GeoObject; -import com.twitter.search.common.relevance.entities.TwitterMessage; -import com.twitter.search.common.relevance.text.LocationUtils; -import com.twitter.search.ingester.model.IngesterTwitterMessage; -import com.twitter.search.ingester.pipeline.util.PipelineStageRuntimeException; - -/** - * Read-only stage to extract lat/lon pairs from the tweet text and populate - * the geoLocation field. - *

    - * If the tweet is geotagged by mobile devices, the geo coordinates extracted from the JSON - * is used. - */ -@ConsumedTypes(IngesterTwitterMessage.class) -@ProducesConsumed -public class SingleTweetExtractAndGeocodeLatLonStage extends TwitterBaseStage - { - private static final Logger LOG = - LoggerFactory.getLogger(SingleTweetExtractAndGeocodeLatLonStage.class); - - private SearchRateCounter extractedLatLons; - private SearchRateCounter badLatLons; - - @Override - public void initStats() { - super.initStats(); - innerSetupStats(); - } - - @Override - protected void innerSetupStats() { - extractedLatLons = SearchRateCounter.export(getStageNamePrefix() + "_extracted_lat_lons"); - badLatLons = SearchRateCounter.export(getStageNamePrefix() + "_invalid_lat_lons"); - } - - @Override - public void innerProcess(Object obj) throws StageException { - if (!(obj instanceof IngesterTwitterMessage)) { - throw new StageException(this, "Object is not IngesterTwitterMessage object: " + obj); - } - - IngesterTwitterMessage message = IngesterTwitterMessage.class.cast(obj); - tryToSetGeoLocation(message); - emitAndCount(message); - } - - @Override - protected IngesterTwitterMessage innerRunStageV2(TwitterMessage message) { - // Previous stage takes in a TwitterMessage and returns a TwitterMessage. I think it was - // done to simplify testing. From this stage onwards, we only count the message that are of type - // IngesterTwitterMessage. - if (!(message instanceof IngesterTwitterMessage)) { - throw new PipelineStageRuntimeException("Message needs to be of type IngesterTwitterMessage"); - } - - IngesterTwitterMessage ingesterTwitterMessage = IngesterTwitterMessage.class.cast(message); - tryToSetGeoLocation(ingesterTwitterMessage); - return ingesterTwitterMessage; - } - - private void tryToSetGeoLocation(IngesterTwitterMessage message) { - if (message.getGeoTaggedLocation() != null) { - message.setGeoLocation(message.getGeoTaggedLocation()); - } else if (message.hasGeoLocation()) { - LOG.warn("Message {} already contains geoLocation", message.getId()); - } else { - try { - GeoObject extracted = extractLatLon(message); - if (extracted != null) { - message.setGeoLocation(extracted); - extractedLatLons.increment(); - } - } catch (NumberFormatException e) { - LOG.debug("Message contains bad latitude and longitude: " + message.getOrigLocation(), e); - badLatLons.increment(); - } catch (Exception e) { - LOG.error("Failed to extract geo location from " + message.getOrigLocation() + " for tweet " - + message.getId(), e); - } - } - } - - private GeoObject extractLatLon(IngesterTwitterMessage message) throws NumberFormatException { - double[] latlon = LocationUtils.extractLatLon(message); - return latlon == null - ? null - : new GeoObject(latlon[0], latlon[1], ThriftGeoLocationSource.TWEET_TEXT); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/TextFeatureExtractionWorkersStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/TextFeatureExtractionWorkersStage.java deleted file mode 100644 index 45e967d43..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/TextFeatureExtractionWorkersStage.java +++ /dev/null @@ -1,148 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import java.util.concurrent.BlockingQueue; -import java.util.concurrent.ExecutorService; -import javax.naming.NamingException; - -import com.google.common.collect.Queues; - -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.apache.commons.pipeline.validation.ProducesConsumed; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchCustomGauge; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.relevance.entities.TwitterMessage; -import com.twitter.search.common.relevance.text.TweetParser; -import com.twitter.search.ingester.pipeline.util.PipelineStageRuntimeException; - -@ConsumedTypes(TwitterMessage.class) -@ProducesConsumed -public class TextFeatureExtractionWorkersStage extends TwitterBaseStage - { - private static final Logger LOG = - LoggerFactory.getLogger(TextFeatureExtractionWorkersStage.class); - - private static final int NUM_THREADS = 5; - private static final int MAX_QUEUE_SIZE = 100; - private static final long SLOW_TWEET_TIME_MILLIS = 1000; - private ExecutorService executorService = null; - - // define as static so that FeatureExtractorWorker thread can use it - private static SearchRateCounter slowTweetCounter; - private SearchRateCounter threadErrorCounter; - private SearchRateCounter threadInterruptionCounter; - private final BlockingQueue messageQueue = - Queues.newLinkedBlockingQueue(MAX_QUEUE_SIZE); - private TweetParser tweetParser; - - @Override - public void initStats() { - super.initStats(); - innerSetupStats(); - } - - @Override - protected void innerSetupStats() { - slowTweetCounter = SearchRateCounter.export( - getStageNamePrefix() + "_text_feature_extraction_slow_tweet_count"); - SearchCustomGauge.export(getStageNamePrefix() + "_queue_size", - messageQueue::size); - threadErrorCounter = SearchRateCounter.export( - getStageNamePrefix() + "_text_quality_evaluation_thread_error"); - threadInterruptionCounter = SearchRateCounter.export( - getStageNamePrefix() + "_text_quality_evaluation_thread_interruption"); - } - - @Override - protected void doInnerPreprocess() throws StageException, NamingException { - innerSetup(); - // anything threading related, we don't need in V2 as of yet. - executorService = wireModule.getThreadPool(NUM_THREADS); - for (int i = 0; i < NUM_THREADS; ++i) { - executorService.submit(new FeatureExtractorWorker()); - } - LOG.info("Initialized {} parsers.", NUM_THREADS); - } - - @Override - protected void innerSetup() { - tweetParser = new TweetParser(); - } - - @Override - public void innerProcess(Object obj) throws StageException { - if (!(obj instanceof TwitterMessage)) { - LOG.error("Object is not a TwitterMessage object: {}", obj); - return; - } - - TwitterMessage message = TwitterMessage.class.cast(obj); - try { - messageQueue.put(message); - } catch (InterruptedException ie) { - LOG.error("Interrupted exception adding to the queue", ie); - } - } - - private boolean tryToParse(TwitterMessage message) { - boolean isAbleToParse = false; - long startTime = clock.nowMillis(); - // Parse tweet and merge the parsed out features into what we already have in the message. - try { - synchronized (this) { - tweetParser.parseTweet(message, false, false); - } - // If parsing failed we don't need to pass the tweet down the pipeline. - isAbleToParse = true; - } catch (Exception e) { - threadErrorCounter.increment(); - LOG.error("Uncaught exception from tweetParser.parseTweet()", e); - } finally { - long elapsedTime = clock.nowMillis() - startTime; - if (elapsedTime > SLOW_TWEET_TIME_MILLIS) { - LOG.debug("Took {}ms to parse tweet {}: {}", elapsedTime, message.getId(), message); - slowTweetCounter.increment(); - } - } - return isAbleToParse; - } - - @Override - protected TwitterMessage innerRunStageV2(TwitterMessage message) { - if (!tryToParse(message)) { - throw new PipelineStageRuntimeException("Failed to parse, not passing to next stage."); - } - - return message; - } - - @Override - public void innerPostprocess() { - if (executorService != null) { - executorService.shutdownNow(); - } - executorService = null; - } - - private class FeatureExtractorWorker implements Runnable { - public void run() { - while (!Thread.currentThread().isInterrupted()) { - TwitterMessage message = null; - try { - message = messageQueue.take(); - } catch (InterruptedException ie) { - threadInterruptionCounter.increment(); - LOG.error("Interrupted exception polling from the queue", ie); - continue; - } finally { - if (tryToParse(message)) { - emitAndCount(message); - } - } - } - } - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/TextQualityEvaluationWorkerStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/TextQualityEvaluationWorkerStage.java deleted file mode 100644 index 27e5d5c0c..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/TextQualityEvaluationWorkerStage.java +++ /dev/null @@ -1,181 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; -import java.util.List; -import java.util.concurrent.BlockingQueue; -import java.util.concurrent.ExecutorService; -import javax.naming.NamingException; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.Queues; -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.apache.commons.pipeline.validation.ProducesConsumed; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.search.common.metrics.SearchCustomGauge; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.relevance.classifiers.TweetEvaluator; -import com.twitter.search.common.relevance.classifiers.TweetOffensiveEvaluator; -import com.twitter.search.common.relevance.classifiers.TweetTextClassifier; -import com.twitter.search.common.relevance.classifiers.TweetTextEvaluator; -import com.twitter.search.common.relevance.entities.TwitterMessage; -import com.twitter.search.common.relevance.scorers.TweetTextScorer; - -@ConsumedTypes(TwitterMessage.class) -@ProducesConsumed -public class TextQualityEvaluationWorkerStage extends TwitterBaseStage - { - private static final Logger LOG = LoggerFactory.getLogger(TextQualityEvaluationWorkerStage.class); - - private static final int NUM_THREADS = 5; - private static final long SLOW_TWEET_TIME_MILLIS = 1000; - // based on the batched branch 3 elements in the queue times 200 tweets per batch. - private static final int MAX_QUEUE_SIZE = 100; - private final BlockingQueue messages = - Queues.newLinkedBlockingQueue(MAX_QUEUE_SIZE); - - private static final String DO_TEXT_QUALITY_EVALUATION_DECIDER_KEY_TEMPLATE = - "ingester_%s_do_text_quality_evaluation"; - - private ExecutorService executorService = null; - private SearchRateCounter unscoredTweetCounter; - private TweetTextClassifier classifier; - private final TweetTextScorer scorer = new TweetTextScorer(null); - // Defined as static so that ClassifierWorker thread can use it - private static SearchRateCounter slowTweetCounter; - private SearchRateCounter threadErrorCounter; - private SearchRateCounter threadInterruptionCounter; - private String deciderKey; - - @Override - public void initStats() { - super.initStats(); - innerSetupStats(); - } - - public SearchRateCounter getUnscoredTweetCounter() { - return unscoredTweetCounter; - } - - @Override - protected void innerSetupStats() { - threadErrorCounter = SearchRateCounter.export( - getStageNamePrefix() + "_text_quality_evaluation_thread_error"); - threadInterruptionCounter = SearchRateCounter.export( - getStageNamePrefix() + "_text_quality_evaluation_thread_interruption"); - unscoredTweetCounter = SearchRateCounter.export( - getStageNamePrefix() + "_text_quality_evaluation_tweets_unscored_count"); - slowTweetCounter = SearchRateCounter.export( - getStageNamePrefix() + "_text_quality_evaluation_slow_tweet_count"); - SearchCustomGauge.export(getStageNamePrefix() + "_queue_size", messages::size); - } - - @Override - protected void doInnerPreprocess() throws StageException, NamingException { - innerSetup(); - executorService = wireModule.getThreadPool(NUM_THREADS); - for (int i = 0; i < NUM_THREADS; i++) { - executorService.submit( - new ClassifierWorker()); - } - LOG.info("Initialized {} classfiers and scorers.", NUM_THREADS); - } - - @Override - protected void innerSetup() throws NamingException { - deciderKey = String.format(DO_TEXT_QUALITY_EVALUATION_DECIDER_KEY_TEMPLATE, - earlybirdCluster.getNameForStats()); - List supportedPenguinVersions = wireModule.getPenguinVersions(); - TweetOffensiveEvaluator tweetOffensiveEvaluator = wireModule.getTweetOffensiveEvaluator(); - - ImmutableList evaluators = - ImmutableList.of(tweetOffensiveEvaluator, new TweetTextEvaluator()); - classifier = new TweetTextClassifier( - evaluators, - wireModule.getServiceIdentifier(), - supportedPenguinVersions); - } - - @Override - public void innerProcess(Object obj) throws StageException { - if (!(obj instanceof TwitterMessage)) { - LOG.error("Object is not a TwitterMessage object: {}", obj); - return; - } - - if (decider.isAvailable(deciderKey)) { - TwitterMessage message = TwitterMessage.class.cast(obj); - try { - messages.put(message); - } catch (InterruptedException ie) { - LOG.error("Interrupted exception adding to the queue", ie); - } - } else { - unscoredTweetCounter.increment(); - emitAndCount(obj); - } - } - - @Override - protected TwitterMessage innerRunStageV2(TwitterMessage message) { - if (decider.isAvailable(deciderKey)) { - classifyAndScore(message); - } else { - unscoredTweetCounter.increment(); - } - - return message; - } - - private void classifyAndScore(TwitterMessage message) { - long startTime = clock.nowMillis(); - try { - // The tweet signature computed here might not be correct, since we did not resolve the - // tweet URLs yet. This is why BasicIndexingConverter does not set the tweet signature - // feature on the event it builds. - // - // We correct the tweet signature later in the ComputeTweetSignatureStage, and - // DelayedIndexingConverter sets this feature on the URL update event it creates. - synchronized (this) { - scorer.classifyAndScoreTweet(classifier, message); - } - } catch (Exception e) { - threadErrorCounter.increment(); - LOG.error("Uncaught exception from classifyAndScoreTweet", e); - } finally { - long elapsedTime = clock.nowMillis() - startTime; - if (elapsedTime > SLOW_TWEET_TIME_MILLIS) { - LOG.warn("Took {}ms to classify and score tweet {}: {}", - elapsedTime, message.getId(), message); - slowTweetCounter.increment(); - } - } - } - - @Override - public void innerPostprocess() { - if (executorService != null) { - executorService.shutdownNow(); - } - executorService = null; - } - - private class ClassifierWorker implements Runnable { - public void run() { - while (!Thread.currentThread().isInterrupted()) { - TwitterMessage message; - try { - message = messages.take(); - } catch (InterruptedException ie) { - threadInterruptionCounter.increment(); - LOG.error("Interrupted exception polling from the queue", ie); - continue; - } - - // We want to emit even if we couldn't score the tweet. - classifyAndScore(message); - emitAndCount(message); - } - } - } -} - diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/TextUrlsFeatureExtractionStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/TextUrlsFeatureExtractionStage.java deleted file mode 100644 index 3843223d8..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/TextUrlsFeatureExtractionStage.java +++ /dev/null @@ -1,53 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.apache.commons.pipeline.validation.ProducesConsumed; - -import com.twitter.search.common.relevance.classifiers.TweetOffensiveEvaluator; -import com.twitter.search.common.relevance.entities.TwitterMessage; -import com.twitter.search.common.relevance.scorers.TweetTextScorer; -import com.twitter.search.common.relevance.text.TweetParser; -import com.twitter.search.ingester.model.IngesterTwitterMessage; - -@ConsumedTypes(TwitterMessage.class) -@ProducesConsumed -public class TextUrlsFeatureExtractionStage extends TwitterBaseStage - { - private final TweetParser tweetParser = new TweetParser(); - private TweetOffensiveEvaluator offensiveEvaluator; - private final TweetTextScorer tweetTextScorer = new TweetTextScorer(null); - - @Override - protected void doInnerPreprocess() { - innerSetup(); - } - - @Override - protected void innerSetup() { - offensiveEvaluator = wireModule.getTweetOffensiveEvaluator(); - } - - @Override - public void innerProcess(Object obj) throws StageException { - if (!(obj instanceof IngesterTwitterMessage)) { - throw new StageException(this, "Object is not a TwitterMessage instance: " + obj); - } - - IngesterTwitterMessage message = IngesterTwitterMessage.class.cast(obj); - extract(message); - emitAndCount(message); - } - - private void extract(IngesterTwitterMessage message) { - tweetParser.parseUrls(message); - offensiveEvaluator.evaluate(message); - tweetTextScorer.scoreTweet(message); - } - - @Override - protected IngesterTwitterMessage innerRunStageV2(IngesterTwitterMessage message) { - extract(message); - return message; - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/ThriftTweetParserStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/ThriftTweetParserStage.java deleted file mode 100644 index 6a9d4369d..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/ThriftTweetParserStage.java +++ /dev/null @@ -1,178 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import java.util.List; -import java.util.Map; -import javax.annotation.Nonnull; -import javax.annotation.Nullable; -import javax.naming.NamingException; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; - -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.apache.commons.pipeline.validation.ProducedTypes; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.search.common.debug.thriftjava.DebugEvents; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.relevance.entities.TwitterMessage; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.ingester.model.IngesterTweetEvent; -import com.twitter.search.ingester.model.IngesterTwitterMessage; -import com.twitter.search.ingester.pipeline.twitter.thriftparse.ThriftTweetParsingException; -import com.twitter.search.ingester.pipeline.twitter.thriftparse.TweetEventParseHelper; -import com.twitter.tweetypie.thriftjava.TweetCreateEvent; -import com.twitter.tweetypie.thriftjava.TweetDeleteEvent; -import com.twitter.tweetypie.thriftjava.TweetEventData; - -@ConsumedTypes(IngesterTweetEvent.class) -@ProducedTypes(IngesterTwitterMessage.class) -public class ThriftTweetParserStage extends TwitterBaseStage { - private static final Logger LOG = LoggerFactory.getLogger(ThriftTweetParserStage.class); - - // TweetEventData is a union of all possible tweet event types. TweetEventData._Fields is an enum - // that corresponds to the fields in that union. So essentially, TweetEventData._Fields tells us - // which tweet event we're getting inside TweetEventData. We want to keep track of how many tweet - // events of each type we're getting. - private final Map tweetEventCounters = - Maps.newEnumMap(TweetEventData._Fields.class); - - private final List tweetCreateEventBranches = Lists.newArrayList(); - private final List tweetDeleteEventBranches = Lists.newArrayList(); - - private boolean shouldIndexProtectedTweets; - private SearchCounter totalEventsCount; - private SearchCounter thriftParsingErrorsCount; - - private List supportedPenguinVersions; - - @Override - protected void initStats() { - super.initStats(); - - for (TweetEventData._Fields field : TweetEventData._Fields.values()) { - tweetEventCounters.put( - field, - this.makeStageCounter(field.name().toLowerCase() + "_count")); - } - totalEventsCount = this.makeStageCounter("total_events_count"); - thriftParsingErrorsCount = this.makeStageCounter("thrift_parsing_errors_count"); - } - - @Override - protected void doInnerPreprocess() throws StageException, NamingException { - supportedPenguinVersions = wireModule.getPenguinVersions(); - LOG.info("Supported penguin versions: {}", supportedPenguinVersions); - - shouldIndexProtectedTweets = earlybirdCluster == EarlybirdCluster.PROTECTED - || earlybirdCluster == EarlybirdCluster.REALTIME_CG; - - Preconditions.checkState(!tweetDeleteEventBranches.isEmpty(), - "At least one delete branch must be specified."); - } - - @Override - public void innerProcess(Object obj) throws StageException { - if (!(obj instanceof TweetEventData || obj instanceof IngesterTweetEvent)) { - LOG.error("Object is not a TweetEventData or IngesterTweetEvent: {}", obj); - throw new StageException(this, "Object is not a TweetEventData or IngesterTweetEvent"); - } - - supportedPenguinVersions = wireModule.getCurrentlyEnabledPenguinVersions(); - - try { - IngesterTweetEvent ingesterTweetEvent = (IngesterTweetEvent) obj; - TweetEventData tweetEventData = ingesterTweetEvent.getData(); - DebugEvents debugEvents = ingesterTweetEvent.getDebugEvents(); - - // Determine if the message is a tweet delete event before the next stages mutate it. - IngesterTwitterMessage message = getTwitterMessage(tweetEventData, debugEvents); - boolean shouldEmitMessage = message != null - && message.isIndexable(shouldIndexProtectedTweets); - - if (shouldEmitMessage) { - if (!message.isDeleted()) { - emitAndCount(message); - - for (String tweetCreateEventBranch : tweetCreateEventBranches) { - // If we need to send the message to another branch, we need to make a copy. - // Otherwise, we'll have multiple stages mutating the same object in parallel. - IngesterTwitterMessage tweetCreateEventBranchMessage = - getTwitterMessage(tweetEventData, debugEvents); - emitToBranchAndCount(tweetCreateEventBranch, tweetCreateEventBranchMessage); - } - } else { - for (String tweetDeleteEventBranch : tweetDeleteEventBranches) { - // If we need to send the message to another branch, we need to make a copy. - // Otherwise, we'll have multiple stages mutating the same object in parallel. - IngesterTwitterMessage tweetDeleteEventBranchMessage = - getTwitterMessage(tweetEventData, debugEvents); - emitToBranchAndCount(tweetDeleteEventBranch, tweetDeleteEventBranchMessage); - } - } - } - } catch (ThriftTweetParsingException e) { - thriftParsingErrorsCount.increment(); - LOG.error("Failed to parse Thrift tweet event: " + obj, e); - throw new StageException(this, e); - } - } - - @Nullable - private IngesterTwitterMessage getTwitterMessage( - @Nonnull TweetEventData tweetEventData, - @Nullable DebugEvents debugEvents) - throws ThriftTweetParsingException { - totalEventsCount.increment(); - - // TweetEventData is a union of all possible tweet event types. TweetEventData._Fields is an - // enum that corresponds to all TweetEventData fields. By calling TweetEventData.getSetField(), - // we can determine which field is set. - TweetEventData._Fields tweetEventDataField = tweetEventData.getSetField(); - Preconditions.checkNotNull(tweetEventDataField); - tweetEventCounters.get(tweetEventDataField).increment(); - - if (tweetEventDataField == TweetEventData._Fields.TWEET_CREATE_EVENT) { - TweetCreateEvent tweetCreateEvent = tweetEventData.getTweet_create_event(); - return TweetEventParseHelper.getTwitterMessageFromCreationEvent( - tweetCreateEvent, supportedPenguinVersions, debugEvents); - } - if (tweetEventDataField == TweetEventData._Fields.TWEET_DELETE_EVENT) { - TweetDeleteEvent tweetDeleteEvent = tweetEventData.getTweet_delete_event(); - return TweetEventParseHelper.getTwitterMessageFromDeletionEvent( - tweetDeleteEvent, supportedPenguinVersions, debugEvents); - } - return null; - } - - /** - * Sets the branches to which all TweetDeleteEvents should be emitted. - * - * @param tweetDeleteEventBranchNames A comma-separated list of branches. - */ - public void setTweetDeleteEventBranchNames(String tweetDeleteEventBranchNames) { - parseBranches(tweetDeleteEventBranchNames, tweetDeleteEventBranches); - } - - /** - * Sets the additional branches to which all TweetCreateEvents should be emitted. - * - * @param tweetCreateEventBranchNames A comma-separated list of branches. - */ - public void setTweetCreateEventBranchNames(String tweetCreateEventBranchNames) { - parseBranches(tweetCreateEventBranchNames, tweetCreateEventBranches); - } - - private void parseBranches(String branchNames, List branches) { - branches.clear(); - for (String branch : branchNames.split(",")) { - String trimmedBranch = branch.trim(); - Preconditions.checkState(!trimmedBranch.isEmpty(), "Branches cannot be empty strings."); - branches.add(trimmedBranch); - } - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/ThriftVersionedEventsConverter.java b/src/java/com/twitter/search/ingester/pipeline/twitter/ThriftVersionedEventsConverter.java deleted file mode 100644 index 1cb21e188..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/ThriftVersionedEventsConverter.java +++ /dev/null @@ -1,132 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import java.util.HashMap; -import java.util.List; -import java.util.Map; - -import com.google.common.collect.Lists; - -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.search.common.debug.thriftjava.DebugEvents; -import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants; -import com.twitter.search.common.schema.thriftjava.ThriftDocument; -import com.twitter.search.common.schema.thriftjava.ThriftField; -import com.twitter.search.common.schema.thriftjava.ThriftFieldData; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEventType; -import com.twitter.search.ingester.model.IngesterThriftVersionedEvents; - -/** - * Converter for {@code ThriftVersionedEvents}. - * - */ -public class ThriftVersionedEventsConverter { - private static final long UNUSED_USER_ID = -1L; - - private Iterable penguinVersions; - - public ThriftVersionedEventsConverter(Iterable penguinVersions) { - this.penguinVersions = penguinVersions; - } - - /** - * Creates a DELETE IngesterThriftVersionedEvents instance for the given tweet ID and user ID. - * - * @param tweetId The tweet ID. - * @param userId The user ID. - * @param debugEvents The DebugEvents to propagate to the returned IngesterThriftVersionedEvents - * instance. - * @return A DELETE IngesterThriftVersionedEvents instance with the given tweet and user IDs. - */ - public IngesterThriftVersionedEvents toDelete( - long tweetId, long userId, DebugEvents debugEvents) { - ThriftIndexingEvent thriftIndexingEvent = new ThriftIndexingEvent() - .setEventType(ThriftIndexingEventType.DELETE) - .setUid(tweetId); - return toThriftVersionedEvents(tweetId, userId, thriftIndexingEvent, debugEvents); - } - - /** - * Creates an OUT_OF_ORDER_APPEND IngesterThriftVersionedEvents instance for the given tweet ID - * and the given value for the given field. - * - * @param tweetId The tweet ID. - * @param field The updated field. - * @param value The new field value. - * @param debugEvents The DebugEvents to propagate to the returned IngesterThriftVersionedEvents - * instance. - * @return An OUT_OF_ORDER_APPEND IngesterThriftVersionedEvents instance with the given tweet ID - * and value for the field. - */ - public IngesterThriftVersionedEvents toOutOfOrderAppend( - long tweetId, - EarlybirdFieldConstants.EarlybirdFieldConstant field, - long value, - DebugEvents debugEvents) { - ThriftField updateField = new ThriftField() - .setFieldConfigId(field.getFieldId()) - .setFieldData(new ThriftFieldData().setLongValue(value)); - ThriftDocument document = new ThriftDocument() - .setFields(Lists.newArrayList(updateField)); - ThriftIndexingEvent thriftIndexingEvent = new ThriftIndexingEvent() - .setEventType(ThriftIndexingEventType.OUT_OF_ORDER_APPEND) - .setUid(tweetId) - .setDocument(document); - return toThriftVersionedEvents(tweetId, UNUSED_USER_ID, thriftIndexingEvent, debugEvents); - } - - - /** - * Creates a PARTIAL_UPDATE IngesterThriftVersionedEvents instance for the given tweet ID and the - * given value for the given feature. - * - * @param tweetId The tweet ID. - * @param feature The updated feature. - * @param value The new feature value. - * @param debugEvents The DebugEvents to propagate to the returned IngesterThriftVersionedEvents - * instance. - * @return A PARTIAL_UPDATE IngesterThriftVersionedEvents instance with the given tweet ID and - * value for the feature. - */ - public IngesterThriftVersionedEvents toPartialUpdate( - long tweetId, - EarlybirdFieldConstants.EarlybirdFieldConstant feature, - int value, - DebugEvents debugEvents) { - ThriftField updateField = new ThriftField() - .setFieldConfigId(feature.getFieldId()) - .setFieldData(new ThriftFieldData().setIntValue(value)); - ThriftDocument document = new ThriftDocument() - .setFields(Lists.newArrayList(updateField)); - ThriftIndexingEvent thriftIndexingEvent = new ThriftIndexingEvent() - .setEventType(ThriftIndexingEventType.PARTIAL_UPDATE) - .setUid(tweetId) - .setDocument(document); - return toThriftVersionedEvents(tweetId, UNUSED_USER_ID, thriftIndexingEvent, debugEvents); - } - - // Wraps the given ThriftIndexingEvent into a ThriftVersionedEvents instance. - private IngesterThriftVersionedEvents toThriftVersionedEvents( - long tweetId, long userId, ThriftIndexingEvent thriftIndexingEvent, DebugEvents debugEvents) { - if (!thriftIndexingEvent.isSetCreateTimeMillis() - && (debugEvents != null) - && debugEvents.isSetCreatedAt()) { - thriftIndexingEvent.setCreateTimeMillis(debugEvents.getCreatedAt().getEventTimestampMillis()); - } - - Map versionedEvents = new HashMap<>(); - for (PenguinVersion penguinVersion : penguinVersions) { - versionedEvents.put(penguinVersion.getByteValue(), thriftIndexingEvent); - } - - IngesterThriftVersionedEvents events = - new IngesterThriftVersionedEvents(userId, versionedEvents); - events.setId(tweetId); - events.setDebugEvents(debugEvents); - return events; - } - - public void updatePenguinVersions(List updatePenguinVersions) { - penguinVersions = updatePenguinVersions; - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/TweetEventDeserializerStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/TweetEventDeserializerStage.java deleted file mode 100644 index 96d7c2018..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/TweetEventDeserializerStage.java +++ /dev/null @@ -1,137 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; -import com.google.common.annotations.VisibleForTesting; -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.apache.commons.pipeline.validation.ProducedTypes; -import org.apache.thrift.TDeserializer; -import org.apache.thrift.TException; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; -import com.twitter.search.common.debug.DebugEventUtil; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.ingester.model.IngesterTweetEvent; -import com.twitter.search.ingester.model.KafkaRawRecord; -import com.twitter.search.ingester.pipeline.util.PipelineStageRuntimeException; - -/** - * Deserializes {@link KafkaRawRecord} into IngesterTweetEvent and emits those. - */ -@ConsumedTypes(KafkaRawRecord.class) -@ProducedTypes(IngesterTweetEvent.class) -public class TweetEventDeserializerStage extends TwitterBaseStage - { - private static final Logger LOG = LoggerFactory.getLogger(TweetEventDeserializerStage.class); - - // Limit how much the logs get polluted - private static final int MAX_OOM_SERIALIZED_BYTES_LOGGED = 5000; - private static final char[] HEX_ARRAY = "0123456789ABCDEF".toCharArray(); - - private final TDeserializer deserializer = new TDeserializer(); - - private SearchCounter outOfMemoryErrors; - private SearchCounter outOfMemoryErrors2; - private SearchCounter totalEventsCount; - private SearchCounter validEventsCount; - private SearchCounter deserializationErrorsCount; - - @Override - public void initStats() { - super.initStats(); - innerSetupStats(); - } - - @Override - protected void innerSetupStats() { - outOfMemoryErrors = SearchCounter.export(getStageNamePrefix() + "_out_of_memory_errors"); - outOfMemoryErrors2 = SearchCounter.export(getStageNamePrefix() + "_out_of_memory_errors_2"); - totalEventsCount = SearchCounter.export(getStageNamePrefix() + "_total_events_count"); - validEventsCount = SearchCounter.export(getStageNamePrefix() + "_valid_events_count"); - deserializationErrorsCount = - SearchCounter.export(getStageNamePrefix() + "_deserialization_errors_count"); - } - - @Override - public void innerProcess(Object obj) throws StageException { - if (!(obj instanceof KafkaRawRecord)) { - throw new StageException(this, "Object is not a KafkaRawRecord: " + obj); - } - - KafkaRawRecord kafkaRecord = (KafkaRawRecord) obj; - IngesterTweetEvent tweetEvent = tryDeserializeRecord(kafkaRecord); - - if (tweetEvent != null) { - emitAndCount(tweetEvent); - } - } - - @Override - protected IngesterTweetEvent innerRunStageV2(KafkaRawRecord kafkaRawRecord) { - IngesterTweetEvent ingesterTweetEvent = tryDeserializeRecord(kafkaRawRecord); - if (ingesterTweetEvent == null) { - throw new PipelineStageRuntimeException("failed to deserialize KafkaRawRecord : " - + kafkaRawRecord); - } - return ingesterTweetEvent; - } - - private IngesterTweetEvent tryDeserializeRecord(KafkaRawRecord kafkaRecord) { - try { - totalEventsCount.increment(); - IngesterTweetEvent tweetEvent = deserialize(kafkaRecord); - validEventsCount.increment(); - return tweetEvent; - } catch (OutOfMemoryError e) { - try { - outOfMemoryErrors.increment(); - byte[] bytes = kafkaRecord.getData(); - int limit = Math.min(bytes.length, MAX_OOM_SERIALIZED_BYTES_LOGGED); - StringBuilder sb = new StringBuilder(2 * limit + 100) - .append("OutOfMemoryError deserializing ").append(bytes.length).append(" bytes: "); - appendBytesAsHex(sb, bytes, MAX_OOM_SERIALIZED_BYTES_LOGGED); - LOG.error(sb.toString(), e); - } catch (OutOfMemoryError e2) { - outOfMemoryErrors2.increment(); - } - } - - return null; - - } - - private IngesterTweetEvent deserialize(KafkaRawRecord kafkaRecord) { - try { - IngesterTweetEvent ingesterTweetEvent = new IngesterTweetEvent(); - synchronized (this) { - deserializer.deserialize(ingesterTweetEvent, kafkaRecord.getData()); - } - // Record the created_at time and then we first saw this tweet in the ingester for tracking - // down the ingestion pipeline. - addDebugEventsToIncomingTweet(ingesterTweetEvent, kafkaRecord.getReadAtTimestampMs()); - return ingesterTweetEvent; - } catch (TException e) { - LOG.error("Unable to deserialize TweetEventData", e); - deserializationErrorsCount.increment(); - } - return null; - } - - private void addDebugEventsToIncomingTweet( - IngesterTweetEvent ingesterTweetEvent, long readAtTimestampMs) { - DebugEventUtil.setCreatedAtDebugEvent( - ingesterTweetEvent, ingesterTweetEvent.getFlags().getTimestamp_ms()); - DebugEventUtil.setProcessingStartedAtDebugEvent(ingesterTweetEvent, readAtTimestampMs); - - // The TweetEventDeserializerStage takes in a byte[] representation of a tweet, so debug events - // are not automatically appended by TwitterBaseStage. We do that explicitly here. - DebugEventUtil.addDebugEvent(ingesterTweetEvent, getFullStageName(), clock.nowMillis()); - } - - @VisibleForTesting - static void appendBytesAsHex(StringBuilder sb, byte[] bytes, int maxLength) { - int limit = Math.min(bytes.length, maxLength); - for (int j = 0; j < limit; j++) { - sb.append(HEX_ARRAY[(bytes[j] >>> 4) & 0x0F]); - sb.append(HEX_ARRAY[bytes[j] & 0x0F]); - } - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/TwitterBaseStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/TwitterBaseStage.java deleted file mode 100644 index ba3787bd0..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/TwitterBaseStage.java +++ /dev/null @@ -1,360 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; - -import java.util.Arrays; -import java.util.Collection; -import java.util.Collections; -import java.util.List; -import java.util.Optional; -import java.util.concurrent.ConcurrentMap; -import java.util.concurrent.TimeUnit; - -import javax.naming.NamingException; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Maps; - -import org.apache.commons.lang.StringUtils; -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.stage.InstrumentedBaseStage; - -import com.twitter.common.metrics.Metrics; -import com.twitter.common.util.Clock; -import com.twitter.decider.Decider; -import com.twitter.search.common.debug.DebugEventAccumulator; -import com.twitter.search.common.debug.DebugEventUtil; -import com.twitter.search.common.decider.DeciderUtil; -import com.twitter.search.common.metrics.Percentile; -import com.twitter.search.common.metrics.PercentileUtil; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchCustomGauge; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.ingester.pipeline.util.PipelineStageException; -import com.twitter.search.ingester.pipeline.util.PipelineStageRuntimeException; -import com.twitter.search.ingester.pipeline.wire.WireModule; - -/** - * Common functionality for all stages. - */ -public class TwitterBaseStage extends InstrumentedBaseStage { - // Currently, all stages run in separate threads, so we could use simple maps here. - // However, it seems safer to use concurrent maps, in case we ever change our stage set up. - // The performance impact should be negligible. - private final ConcurrentMap, SearchRateCounter> branchEmitObjectsRateCounters = - Maps.newConcurrentMap(); - private final ConcurrentMap, SearchRateCounter> - branchEmitBatchObjectsRateCounters = Maps.newConcurrentMap(); - - private String stageNamePrefix = null; - - protected WireModule wireModule; - protected Decider decider; - protected Clock clock; - protected EarlybirdCluster earlybirdCluster; - - private String fullStageName = null; - private Percentile processPercentile = null; - private SearchTimerStats processTimerStats = null; - private SearchRateCounter droppedItems = null; - private SearchLongGauge stageExceptions = null; - - private SearchRateCounter incomingBatchesRateCounter; - private SearchRateCounter incomingBatchObjectsRateCounter; - - private List passThroughToBranches = Collections.emptyList(); - private List additionalEmitToBranches = Collections.emptyList(); - - private boolean passThroughDownstream = false; - private boolean emitDownstream = true; - - private String dropItemsDeciderKey; - - // From XML config. - public void setPassThroughToBranches(String passThroughToBranchesString) { - // This is a comma-delimited string which is a list of branches to which we just - // pass through the incoming object without any processing/filtering. - this.passThroughToBranches = Arrays.asList(passThroughToBranchesString.split(",")); - } - - // From XML config. - public void setAdditionalEmitToBranches(String emitToBranchesString) { - // This is a comma-delimited string which is a list of branches to which we - // will emit when we call actuallyEmitAndCount(obj). - this.additionalEmitToBranches = Arrays.asList(emitToBranchesString.split(",")); - } - - // From XML config. - public void setPassThroughDownstream(boolean passThroughDownstream) { - // If true, we emit the raw object downstream - this.passThroughDownstream = passThroughDownstream; - } - - // From XML config. - public void setEmitDownstream(boolean emitDownstream) { - // If true, we emit the processed object downstream. - this.emitDownstream = emitDownstream; - } - - @Override - public final void innerPreprocess() throws StageException { - try { - setupEssentialObjects(); - doInnerPreprocess(); - } catch (NamingException e) { - throw new StageException(this, "Failed to initialize stage.", e); - } - } - - /*** - * Sets up all necessary objects for this stage of the Pipeline. Previously, this task was done - * by the preprocess() method provided by the ACP library. - * @throws PipelineStageException - */ - public void setupStageV2() throws PipelineStageException { - try { - setupCommonStats(); - innerSetupStats(); - setupEssentialObjects(); - innerSetup(); - } catch (NamingException e) { - throw new PipelineStageException(this, "Failed to initialize stage", e); - } - } - - protected void innerSetup() throws PipelineStageException, NamingException { } - - /*** - * Takes in an argument of type T, processes it and returns an argument of Type R. This is the - * main method of a pipeline stage. - */ - public R runStageV2(T arg) { - long startingTime = startProcessing(); - R processed = innerRunStageV2(arg); - endProcessing(startingTime); - return processed; - } - - /*** - * Takes in an argument of type T, processes it and pushes the processed element to some place. - * This method does not return anything as any time this method is called on a stage, it means - * there is no stage after this one. An example stage is any KafkaProducerStage. - */ - public void runFinalStageOfBranchV2(T arg) { - long startingTime = startProcessing(); - innerRunFinalStageOfBranchV2(arg); - endProcessing(startingTime); - } - - protected R innerRunStageV2(T arg) { - return null; - } - - protected void innerRunFinalStageOfBranchV2(T arg) { } - - /*** - * called at the end of a pipeline. Cleans up all resources of the stage. - */ - public void cleanupStageV2() { } - - private void setupEssentialObjects() throws NamingException { - wireModule = WireModule.getWireModule(); - decider = wireModule.getDecider(); - clock = wireModule.getClock(); - earlybirdCluster = wireModule.getEarlybirdCluster(); - dropItemsDeciderKey = - "drop_items_" + earlybirdCluster.getNameForStats() + "_" + fullStageName; - } - - protected void doInnerPreprocess() throws StageException, NamingException { } - - @Override - protected void initStats() { - super.initStats(); - setupCommonStats(); - // Export stage timers - SearchCustomGauge.export(stageNamePrefix + "_queue_size", - () -> Optional.ofNullable(getQueueSizeAverage()).orElse(0.0)); - SearchCustomGauge.export(stageNamePrefix + "_queue_percentage_full", - () -> Optional.ofNullable(getQueuePercentFull()).orElse(0.0)); - - // This only called once on startup - // In some unit tests, getQueueCapacity can return null. Hence this guard is added. - // getQueueCapacity() does not return null here in prod. - SearchLongGauge.export(stageNamePrefix + "_queue_capacity") - .set(getQueueCapacity() == null ? 0 : getQueueCapacity()); - } - - private void setupCommonStats() { - // If the stage is instantiated only once, the class name is used for stats export - // If the stage is instantiated multiple times, the "stageName" specified in the - // pipeline definition xml file is also included. - if (StringUtils.isBlank(this.getStageName())) { - fullStageName = this.getClass().getSimpleName(); - } else { - fullStageName = String.format( - "%s_%s", - this.getClass().getSimpleName(), - this.getStageName()); - } - - stageNamePrefix = Metrics.normalizeName(fullStageName).toLowerCase(); - - droppedItems = SearchRateCounter.export(stageNamePrefix + "_dropped_messages"); - stageExceptions = SearchLongGauge.export(stageNamePrefix + "_stage_exceptions"); - - processTimerStats = SearchTimerStats.export(stageNamePrefix, TimeUnit.NANOSECONDS, - true); - processPercentile = PercentileUtil.createPercentile(stageNamePrefix); - - incomingBatchesRateCounter = SearchRateCounter.export(stageNamePrefix + "_incoming_batches"); - incomingBatchObjectsRateCounter = - SearchRateCounter.export(stageNamePrefix + "_incoming_batch_objects"); - } - - protected void innerSetupStats() { - - } - - protected SearchCounter makeStageCounter(String counterName) { - return SearchCounter.export(getStageNamePrefix() + "_" + counterName); - } - - private SearchRateCounter getEmitObjectsRateCounterFor(Optional maybeBranch) { - return getRateCounterFor(maybeBranch, "emit_objects", branchEmitObjectsRateCounters); - } - - private SearchRateCounter getEmitBatchObjectsRateCounterFor(Optional maybeBranch) { - return getRateCounterFor(maybeBranch, "emit_batch_objects", branchEmitBatchObjectsRateCounters); - } - - private SearchRateCounter getRateCounterFor( - Optional maybeBranch, - String statSuffix, - ConcurrentMap, SearchRateCounter> rateCountersMap) { - SearchRateCounter rateCounter = rateCountersMap.get(maybeBranch); - if (rateCounter == null) { - String branchSuffix = maybeBranch.map(b -> "_" + b.toLowerCase()).orElse(""); - rateCounter = SearchRateCounter.export(stageNamePrefix + branchSuffix + "_" + statSuffix); - SearchRateCounter existingRateCounter = rateCountersMap.putIfAbsent(maybeBranch, rateCounter); - if (existingRateCounter != null) { - Preconditions.checkState( - existingRateCounter == rateCounter, - "SearchRateCounter.export() should always return the same stat instance."); - } - } - return rateCounter; - } - - public String getStageNamePrefix() { - return stageNamePrefix; - } - - public String getFullStageName() { - return fullStageName; - } - - @Override - public void process(Object obj) throws StageException { - long startTime = System.nanoTime(); - try { - // this needs to be updated before calling super.process() so that innerProcess can actually - // use the updated incoming rates - updateIncomingBatchStats(obj); - // Track timing events for when tweets enter each stage. - captureStageDebugEvents(obj); - - if (DeciderUtil.isAvailableForRandomRecipient(decider, dropItemsDeciderKey)) { - droppedItems.increment(); - return; - } - - super.process(obj); - - // Now emit the object raw to wherever we need to - emitToPassThroughBranches(obj); - } finally { - long processTime = System.nanoTime() - startTime; - processTimerStats.timerIncrement(processTime); - processPercentile.record(processTime); - stageExceptions.set(stats.getExceptionCount()); - } - } - - protected long startProcessing() { - long startingTime = System.nanoTime(); - checkIfObjectShouldBeEmittedOrThrowRuntimeException(); - return startingTime; - } - - protected void endProcessing(long startingTime) { - long processTime = System.nanoTime() - startingTime; - processTimerStats.timerIncrement(processTime); - processPercentile.record(processTime); - } - - private void checkIfObjectShouldBeEmittedOrThrowRuntimeException() { - if (DeciderUtil.isAvailableForRandomRecipient(decider, dropItemsDeciderKey)) { - droppedItems.increment(); - throw new PipelineStageRuntimeException("Object does not have to be processed and passed" - + " to the next stage"); - } - } - - private void emitToPassThroughBranches(Object obj) { - for (String branch : passThroughToBranches) { - actuallyEmitAndCount(Optional.of(branch), obj); - } - if (passThroughDownstream) { - actuallyEmitAndCount(Optional.empty(), obj); - } - } - - private void updateIncomingBatchStats(Object obj) { - incomingBatchesRateCounter.increment(); - incomingBatchObjectsRateCounter.increment(getBatchSizeForStats(obj)); - } - - protected void captureStageDebugEvents(Object obj) { - if (obj instanceof DebugEventAccumulator) { - DebugEventUtil.addDebugEvent( - (DebugEventAccumulator) obj, getFullStageName(), clock.nowMillis()); - } else if (obj instanceof Collection) { - DebugEventUtil.addDebugEventToCollection( - (Collection) obj, getFullStageName(), clock.nowMillis()); - } else { - SearchCounter debugEventsNotSupportedCounter = SearchCounter.export( - stageNamePrefix + "_debug_events_not_supported_for_" + obj.getClass()); - debugEventsNotSupportedCounter.increment(); - } - } - - protected int getBatchSizeForStats(Object obj) { - return (obj instanceof Collection) ? ((Collection) obj).size() : 1; - } - - protected void emitAndCount(Object obj) { - for (String branch : additionalEmitToBranches) { - actuallyEmitAndCount(Optional.of(branch), obj); - } - if (emitDownstream) { - actuallyEmitAndCount(Optional.empty(), obj); - } - } - - protected void emitToBranchAndCount(String branch, Object obj) { - actuallyEmitAndCount(Optional.of(branch), obj); - } - - // If the branch is none, emit downstream - private void actuallyEmitAndCount(Optional maybeBranch, Object obj) { - if (maybeBranch.isPresent()) { - emit(maybeBranch.get(), obj); - } else { - emit(obj); - } - getEmitObjectsRateCounterFor(maybeBranch).increment(); - getEmitBatchObjectsRateCounterFor(maybeBranch).increment(getBatchSizeForStats(obj)); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/TwitterBatchedBaseStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/TwitterBatchedBaseStage.java deleted file mode 100644 index fda5b6166..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/TwitterBatchedBaseStage.java +++ /dev/null @@ -1,309 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter; -import java.util.ArrayList; -import java.util.Collection; -import java.util.Iterator; -import java.util.Optional; -import java.util.Queue; -import java.util.concurrent.CompletableFuture; -import java.util.concurrent.TimeUnit; -import java.util.stream.Collectors; -import javax.naming.NamingException; - -import scala.runtime.BoxedUnit; - -import com.google.common.collect.Lists; -import com.google.common.collect.Queues; - -import org.apache.commons.pipeline.StageException; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.metrics.SearchCustomGauge; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.metrics.SearchTimerStats; -import com.twitter.search.ingester.pipeline.util.BatchedElement; -import com.twitter.search.ingester.pipeline.util.PipelineStageException; -import com.twitter.util.Function; -import com.twitter.util.Future; - -public abstract class TwitterBatchedBaseStage extends - TwitterBaseStage> { - private static final Logger LOG = LoggerFactory.getLogger(TwitterBatchedBaseStage.class); - - protected final Queue> queue = - Queues.newLinkedBlockingQueue(MAX_BATCHING_QUEUE_SIZE); - - private int batchedStageBatchSize = 100; - private int forceProcessAfterMs = 500; - - private long lastProcessingTime; - - private SearchRateCounter timeBasedQueueFlush; - private SearchRateCounter sizeBasedQueueFlush; - private SearchRateCounter eventsFailed; - private SearchRateCounter numberOfCallsToNextBatchIfReady; - private SearchTimerStats batchExecutionTime; - private SearchTimerStats batchFailedExecutionTime; - private SearchRateCounter validElements; - private SearchRateCounter batchedElements; - private SearchRateCounter emittedElements; - private static final int MAX_BATCHING_QUEUE_SIZE = 10000; - - // force the implementing class to set type correctly to avoid catching issues at runtime - protected abstract Class getQueueObjectType(); - - // up to the developer on how each batch is processed. - protected abstract Future> innerProcessBatch(Collection> - batch); - - // classes that need to update their batch e.g after a decider change - // can override this - protected void updateBatchSize() { - } - - protected Collection extractOnlyElementsFromBatch(Collection> batch) { - Collection elementsOnly = new ArrayList<>(); - - for (BatchedElement batchedElement : batch) { - elementsOnly.add(batchedElement.getItem()); - } - return elementsOnly; - } - /** - * This function is used to filter the elements that we want to batch. - * e.g. if a tweet has urls batch it to resolve the urls, if it doesn't contain urls - * do not batch. - * - * @param element to be evaluated - */ - protected abstract boolean needsToBeBatched(T element); - - /** - * Tranform from type T to U element. - * T and U might be different types so this function will help with the transformation - * if the incoming T element is filtered out and is bypass directly to the next stage - * that takes incoming objects of type U - * - * @param element incoming element - */ - protected abstract R transform(T element); - - protected void reEnqueueAndRetry(BatchedElement batchedElement) { - queue.add(batchedElement); - } - - @Override - protected void initStats() { - super.initStats(); - commonInnerSetupStats(); - } - - private void commonInnerSetupStats() { - timeBasedQueueFlush = SearchRateCounter.export(getStageNamePrefix() - + "_time_based_queue_flush"); - sizeBasedQueueFlush = SearchRateCounter.export(getStageNamePrefix() - + "_size_based_queue_flush"); - batchExecutionTime = SearchTimerStats.export(getStageNamePrefix() - + "_batch_execution_time", TimeUnit.MILLISECONDS, false, true); - batchFailedExecutionTime = SearchTimerStats.export(getStageNamePrefix() - + "_batch_failed_execution_time", TimeUnit.MILLISECONDS, false, true); - eventsFailed = SearchRateCounter.export(getStageNamePrefix() + "_events_dropped"); - SearchCustomGauge.export(getStageNamePrefix() + "_batched_stage_queue_size", queue::size); - numberOfCallsToNextBatchIfReady = SearchRateCounter.export(getStageNamePrefix() - + "_calls_to_nextBatchIfReady"); - validElements = SearchRateCounter.export(getStageNamePrefix() + "_valid_elements"); - batchedElements = SearchRateCounter.export(getStageNamePrefix() + "_batched_elements"); - emittedElements = SearchRateCounter.export(getStageNamePrefix() + "_emitted_elements"); - } - - @Override - protected void innerSetupStats() { - commonInnerSetupStats(); - } - - // return a possible batch of elements to process. If we have enough for one batch - protected Optional>> nextBatchIfReady() { - numberOfCallsToNextBatchIfReady.increment(); - Optional>> batch = Optional.empty(); - - if (!queue.isEmpty()) { - long elapsed = clock.nowMillis() - lastProcessingTime; - if (elapsed > forceProcessAfterMs) { - batch = Optional.of(Lists.newArrayList(queue)); - timeBasedQueueFlush.increment(); - queue.clear(); - } else if (queue.size() >= batchedStageBatchSize) { - batch = Optional.of(queue.stream() - .limit(batchedStageBatchSize) - .map(element -> queue.remove()) - .collect(Collectors.toList())); - sizeBasedQueueFlush.increment(); - } - } - return batch; - } - - @Override - public void innerProcess(Object obj) throws StageException { - T element; - if (getQueueObjectType().isInstance(obj)) { - element = getQueueObjectType().cast(obj); - } else { - throw new StageException(this, "Trying to add an object of the wrong type to a queue. " - + getQueueObjectType().getSimpleName() - + " is the expected type"); - } - - if (!tryToAddElementToBatch(element)) { - emitAndCount(transform(element)); - } - - tryToSendBatchedRequest(); - } - - @Override - protected CompletableFuture innerRunStageV2(T element) { - CompletableFuture completableFuture = new CompletableFuture<>(); - if (!tryToAddElementToBatch(element, completableFuture)) { - completableFuture.complete(transform(element)); - } - - tryToSendBatchedRequestV2(); - - return completableFuture; - } - - private boolean tryToAddElementToBatch(T element, CompletableFuture cf) { - boolean needsToBeBatched = needsToBeBatched(element); - if (needsToBeBatched) { - queue.add(new BatchedElement<>(element, cf)); - } - - return needsToBeBatched; - } - - private boolean tryToAddElementToBatch(T element) { - return tryToAddElementToBatch(element, CompletableFuture.completedFuture(null)); - } - - private void tryToSendBatchedRequest() { - Optional>> maybeToProcess = nextBatchIfReady(); - if (maybeToProcess.isPresent()) { - Collection> batch = maybeToProcess.get(); - lastProcessingTime = clock.nowMillis(); - processBatch(batch, getOnSuccessFunction(lastProcessingTime), - getOnFailureFunction(batch, lastProcessingTime)); - } - } - - private void tryToSendBatchedRequestV2() { - Optional>> maybeToProcess = nextBatchIfReady(); - if (maybeToProcess.isPresent()) { - Collection> batch = maybeToProcess.get(); - lastProcessingTime = clock.nowMillis(); - processBatch(batch, getOnSuccessFunctionV2(batch, lastProcessingTime), - getOnFailureFunctionV2(batch, lastProcessingTime)); - } - } - - private void processBatch(Collection> batch, - Function, BoxedUnit> onSuccess, - Function onFailure) { - updateBatchSize(); - - Future> futureComputation = innerProcessBatch(batch); - - futureComputation.onSuccess(onSuccess); - - futureComputation.onFailure(onFailure); - } - - private Function, BoxedUnit> getOnSuccessFunction(long started) { - return Function.cons((elements) -> { - elements.forEach(this::emitAndCount); - batchExecutionTime.timerIncrement(clock.nowMillis() - started); - }); - } - - private Function, BoxedUnit> getOnSuccessFunctionV2(Collection> - batch, long started) { - return Function.cons((elements) -> { - Iterator> iterator = batch.iterator(); - for (R element : elements) { - if (iterator.hasNext()) { - iterator.next().getCompletableFuture().complete(element); - } else { - LOG.error("Getting Response from Batched Request, but no CompleteableFuture object" - + " to complete."); - } - } - batchExecutionTime.timerIncrement(clock.nowMillis() - started); - - }); - } - - private Function getOnFailureFunction(Collection> - batch, long started) { - return Function.cons((throwable) -> { - batch.forEach(batchedElement -> { - eventsFailed.increment(); - // pass the tweet event down better to index an incomplete event than nothing at all - emitAndCount(transform(batchedElement.getItem())); - }); - batchFailedExecutionTime.timerIncrement(clock.nowMillis() - started); - LOG.error("Failed processing batch", throwable); - }); - } - - private Function getOnFailureFunctionV2(Collection> - batch, long started) { - return Function.cons((throwable) -> { - batch.forEach(batchedElement -> { - eventsFailed.increment(); - R itemTransformed = transform(batchedElement.getItem()); - // complete the future, its better to index an incomplete event than nothing at all - batchedElement.getCompletableFuture().complete(itemTransformed); - }); - batchFailedExecutionTime.timerIncrement(clock.nowMillis() - started); - LOG.error("Failed processing batch", throwable); - }); - } - - @Override - protected void doInnerPreprocess() throws StageException, NamingException { - try { - commonInnerSetup(); - } catch (PipelineStageException e) { - throw new StageException(this, e); - } - } - - private void commonInnerSetup() throws PipelineStageException, NamingException { - updateBatchSize(); - - if (batchedStageBatchSize < 1) { - throw new PipelineStageException(this, - "Batch size must be set at least to 1 for batched stages but is set to" - + batchedStageBatchSize); - } - - if (forceProcessAfterMs < 1) { - throw new PipelineStageException(this, "forceProcessAfterMs needs to be at least 1 " - + "ms but is set to " + forceProcessAfterMs); - } - } - - @Override - protected void innerSetup() throws PipelineStageException, NamingException { - commonInnerSetup(); - } - - // Setters for configuration parameters - public void setBatchedStageBatchSize(int maxElementsToWaitFor) { - this.batchedStageBatchSize = maxElementsToWaitFor; - } - - public void setForceProcessAfter(int forceProcessAfterMS) { - this.forceProcessAfterMs = forceProcessAfterMS; - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/filters/BUILD b/src/java/com/twitter/search/ingester/pipeline/twitter/filters/BUILD deleted file mode 100644 index e5349558e..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/filters/BUILD +++ /dev/null @@ -1,13 +0,0 @@ -java_library( - sources = ["*.java"], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/code/findbugs:jsr305", - "decider/src/main/scala", - "src/java/com/twitter/search/common/decider", - "src/java/com/twitter/search/common/relevance:entities_and_filters", - "util/util-core:scala", - ], -) diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/filters/IngesterValidMessageFilter.java b/src/java/com/twitter/search/ingester/pipeline/twitter/filters/IngesterValidMessageFilter.java deleted file mode 100644 index 8f32521a4..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/filters/IngesterValidMessageFilter.java +++ /dev/null @@ -1,50 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter.filters; - -import java.util.EnumSet; -import java.util.Set; - -import com.twitter.decider.Decider; -import com.twitter.search.common.decider.DeciderUtil; -import com.twitter.search.common.relevance.entities.TwitterMessage; -import com.twitter.search.common.relevance.entities.TwitterMessageUtil; - -public class IngesterValidMessageFilter { - public static final String KEEP_NULLCAST_DECIDER_KEY = - "ingester_all_keep_nullcasts"; - public static final String STRIP_SUPPLEMENTARY_EMOJIS_DECIDER_KEY_PREFIX = - "valid_message_filter_strip_supplementary_emojis_"; - - protected final Decider decider; - - public IngesterValidMessageFilter(Decider decider) { - this.decider = decider; - } - - /** - * Evaluate a message to see if it matches the filter or not. - * - * @param message to evaluate - * @return true if this message should be emitted. - */ - public boolean accepts(TwitterMessage message) { - return TwitterMessageUtil.validateTwitterMessage( - message, getStripEmojisFields(), acceptNullcast()); - } - - private Set getStripEmojisFields() { - Set stripEmojisFields = - EnumSet.noneOf(TwitterMessageUtil.Field.class); - for (TwitterMessageUtil.Field field : TwitterMessageUtil.Field.values()) { - if (DeciderUtil.isAvailableForRandomRecipient( - decider, - STRIP_SUPPLEMENTARY_EMOJIS_DECIDER_KEY_PREFIX + field.getNameForStats())) { - stripEmojisFields.add(field); - } - } - return stripEmojisFields; - } - - protected final boolean acceptNullcast() { - return DeciderUtil.isAvailableForRandomRecipient(decider, KEEP_NULLCAST_DECIDER_KEY); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/kafka/BUILD b/src/java/com/twitter/search/ingester/pipeline/twitter/kafka/BUILD deleted file mode 100644 index fc981f0f7..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/kafka/BUILD +++ /dev/null @@ -1,32 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/commons-logging", - "3rdparty/jvm/org/apache/kafka:kafka-clients", - "3rdparty/jvm/org/slf4j:slf4j-api", - "decider/src/main/scala", - "kafka/finagle-kafka/finatra-kafka/src/main/scala", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common/util:system-mocks", - "src/java/com/twitter/common_internal/text/version", - "src/java/com/twitter/search/common/debug", - "src/java/com/twitter/search/common/decider", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/partitioning/base", - "src/java/com/twitter/search/common/relevance:entities_and_filters", - "src/java/com/twitter/search/common/util/io/kafka", - "src/java/com/twitter/search/common/util/thrift:text-protocol", - "src/java/com/twitter/search/ingester/model", - "src/java/com/twitter/search/ingester/pipeline/twitter", - "src/java/com/twitter/search/ingester/pipeline/util", - "src/java/com/twitter/search/ingester/pipeline/wire", - "src/java/org/apache/commons/pipeline", - "src/thrift/com/twitter/search/common:indexing-java", - "src/thrift/com/twitter/search/common:schema-java", - "src/thrift/com/twitter/search/common/debug:debug-java", - "util/util-core:util-core-util", - "util/util-core/src/main/java/com/twitter/util/javainterop", - ], -) diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/kafka/DeleteUpdateEventsKafkaProducerStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/kafka/DeleteUpdateEventsKafkaProducerStage.java deleted file mode 100644 index 37ecff5bb..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/kafka/DeleteUpdateEventsKafkaProducerStage.java +++ /dev/null @@ -1,66 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter.kafka; - -import javax.naming.NamingException; - -import com.google.common.base.Preconditions; - -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.validation.ConsumedTypes; - -import com.twitter.search.ingester.model.IngesterTwitterMessage; -import com.twitter.search.ingester.pipeline.twitter.ThriftVersionedEventsConverter; -import com.twitter.search.ingester.pipeline.util.PipelineStageException; - -@ConsumedTypes(IngesterTwitterMessage.class) -public class DeleteUpdateEventsKafkaProducerStage extends KafkaProducerStage - { - private ThriftVersionedEventsConverter converter; - - public DeleteUpdateEventsKafkaProducerStage() { - super(); - } - - public DeleteUpdateEventsKafkaProducerStage(String topicName, String clientId, - String clusterPath) { - super(topicName, clientId, clusterPath); - } - - @Override - protected void innerSetup() throws PipelineStageException, NamingException { - super.innerSetup(); - commonInnerSetup(); - } - - @Override - protected void doInnerPreprocess() throws StageException, NamingException { - super.doInnerPreprocess(); - commonInnerSetup(); - } - - private void commonInnerSetup() throws NamingException { - converter = new ThriftVersionedEventsConverter(wireModule.getPenguinVersions()); - - } - @Override - public void innerProcess(Object obj) throws StageException { - if (!(obj instanceof IngesterTwitterMessage)) { - throw new StageException(this, "Object is not an IngesterTwitterMessage: " + obj); - } - - IngesterTwitterMessage message = (IngesterTwitterMessage) obj; - innerRunFinalStageOfBranchV2(message); - } - - @Override - protected void innerRunFinalStageOfBranchV2(IngesterTwitterMessage message) { - converter.updatePenguinVersions(wireModule.getCurrentlyEnabledPenguinVersions()); - - Preconditions.checkArgument(message.getFromUserTwitterId().isPresent(), - "Missing user ID."); - - super.tryToSendEventsToKafka(converter.toDelete( - message.getTweetId(), message.getUserId(), message.getDebugEvents())); - } - - -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/kafka/KafkaConsumerStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/kafka/KafkaConsumerStage.java deleted file mode 100644 index 55675fc3c..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/kafka/KafkaConsumerStage.java +++ /dev/null @@ -1,245 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter.kafka; - -import java.time.Duration; -import java.util.ArrayList; -import java.util.Collections; -import java.util.List; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; - -import org.apache.commons.pipeline.Pipeline; -import org.apache.commons.pipeline.StageDriver; -import org.apache.commons.pipeline.StageException; -import org.apache.kafka.clients.consumer.ConsumerRecords; -import org.apache.kafka.clients.consumer.KafkaConsumer; -import org.apache.kafka.common.TopicPartition; -import org.apache.kafka.common.errors.SaslAuthenticationException; -import org.apache.kafka.common.errors.SerializationException; -import org.apache.kafka.common.serialization.Deserializer; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; -import com.twitter.search.common.decider.DeciderUtil; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.ingester.pipeline.twitter.TwitterBaseStage; -import com.twitter.search.ingester.pipeline.util.PipelineStageException; -import com.twitter.search.ingester.pipeline.util.PipelineUtil; - -/** - * A stage to read Thrift payloads from a Kafka topic. - */ -public abstract class KafkaConsumerStage extends TwitterBaseStage { - private static final Logger LOG = LoggerFactory.getLogger(KafkaConsumerStage.class); - private static final String SHUT_DOWN_ON_AUTH_FAIL = "shut_down_on_authentication_fail"; - private String kafkaClientId; - private String kafkaTopicName; - private String kafkaConsumerGroupId; - private String kafkaClusterPath; - private int maxPollRecords = 1; - private int pollTimeoutMs = 1000; - private boolean partitioned; - private String deciderKey; - private final Deserializer deserializer; - private SearchCounter pollCount; - private SearchCounter deserializationErrorCount; - private SearchRateCounter droppedMessages; - - private KafkaConsumer kafkaConsumer; - - protected KafkaConsumerStage(String kafkaClientId, String kafkaTopicName, - String kafkaConsumerGroupId, String kafkaClusterPath, - String deciderKey, Deserializer deserializer) { - - this.kafkaClientId = kafkaClientId; - this.kafkaTopicName = kafkaTopicName; - this.kafkaConsumerGroupId = kafkaConsumerGroupId; - this.kafkaClusterPath = kafkaClusterPath; - this.deciderKey = deciderKey; - this.deserializer = deserializer; - } - - protected KafkaConsumerStage(Deserializer deserializer) { - this.deserializer = deserializer; - } - - @Override - protected void initStats() { - super.initStats(); - commonInnerSetupStats(); - } - - private void commonInnerSetupStats() { - pollCount = SearchCounter.export(getStageNamePrefix() + "_poll_count"); - deserializationErrorCount = - SearchCounter.export(getStageNamePrefix() + "_deserialization_error_count"); - droppedMessages = - SearchRateCounter.export(getStageNamePrefix() + "_dropped_messages"); - } - - @Override - protected void innerSetupStats() { - commonInnerSetupStats(); - } - - @Override - protected void doInnerPreprocess() { - commonInnerSetup(); - PipelineUtil.feedStartObjectToStage(this); - } - - private void commonInnerSetup() { - Preconditions.checkNotNull(kafkaClientId); - Preconditions.checkNotNull(kafkaClusterPath); - Preconditions.checkNotNull(kafkaTopicName); - - kafkaConsumer = wireModule.newKafkaConsumer( - kafkaClusterPath, - deserializer, - kafkaClientId, - kafkaConsumerGroupId, - maxPollRecords); - if (partitioned) { - kafkaConsumer.assign(Collections.singletonList( - new TopicPartition(kafkaTopicName, wireModule.getPartition()))); - } else { - kafkaConsumer.subscribe(Collections.singleton(kafkaTopicName)); - } - } - - @Override - protected void innerSetup() { - commonInnerSetup(); - } - - @Override - public void innerProcess(Object obj) throws StageException { - StageDriver driver = ((Pipeline) stageContext).getStageDriver(this); - while (driver.getState() == StageDriver.State.RUNNING) { - pollAndEmit(); - } - - LOG.info("StageDriver state is no longer RUNNING, closing Kafka consumer."); - closeKafkaConsumer(); - } - - @VisibleForTesting - void pollAndEmit() throws StageException { - try { - List records = poll(); - for (R record : records) { - emitAndCount(record); - } - } catch (PipelineStageException e) { - throw new StageException(this, e); - } - } - - /*** - * Poll Kafka and get the items from the topic. Record stats. - * @return - * @throws PipelineStageException - */ - public List pollFromTopic() throws PipelineStageException { - long startingTime = startProcessing(); - List polledItems = poll(); - endProcessing(startingTime); - return polledItems; - } - - private List poll() throws PipelineStageException { - List recordsFromKafka = new ArrayList<>(); - try { - ConsumerRecords records = kafkaConsumer.poll(Duration.ofMillis(pollTimeoutMs)); - pollCount.increment(); - records.iterator().forEachRemaining(record -> { - if (deciderKey == null || DeciderUtil.isAvailableForRandomRecipient(decider, deciderKey)) { - recordsFromKafka.add(record.value()); - } else { - droppedMessages.increment(); - } - }); - - } catch (SerializationException e) { - deserializationErrorCount.increment(); - LOG.error("Failed to deserialize the value.", e); - } catch (SaslAuthenticationException e) { - if (DeciderUtil.isAvailableForRandomRecipient(decider, SHUT_DOWN_ON_AUTH_FAIL)) { - wireModule.getPipelineExceptionHandler() - .logAndShutdown("Authentication error connecting to Kafka broker: " + e); - } else { - throw new PipelineStageException(this, "Kafka Authentication Error", e); - } - } catch (Exception e) { - throw new PipelineStageException(e); - } - - return recordsFromKafka; - } - - @VisibleForTesting - void closeKafkaConsumer() { - try { - kafkaConsumer.close(); - LOG.info("Kafka kafkaConsumer for {} was closed", getFullStageName()); - } catch (Exception e) { - log.error("Failed to close Kafka kafkaConsumer", e); - } - } - - @Override - public void release() { - closeKafkaConsumer(); - super.release(); - } - - @Override - public void cleanupStageV2() { - closeKafkaConsumer(); - } - - @SuppressWarnings("unused") // set from pipeline config - public void setKafkaClientId(String kafkaClientId) { - this.kafkaClientId = kafkaClientId; - } - - @SuppressWarnings("unused") // set from pipeline config - public void setKafkaTopicName(String kafkaTopicName) { - this.kafkaTopicName = kafkaTopicName; - } - - @SuppressWarnings("unused") // set from pipeline config - public void setKafkaConsumerGroupId(String kafkaConsumerGroupId) { - this.kafkaConsumerGroupId = kafkaConsumerGroupId; - } - - @SuppressWarnings("unused") // set from pipeline config - public void setMaxPollRecords(int maxPollRecords) { - this.maxPollRecords = maxPollRecords; - } - - @SuppressWarnings("unused") // set from pipeline config - public void setPollTimeoutMs(int pollTimeoutMs) { - this.pollTimeoutMs = pollTimeoutMs; - } - - @SuppressWarnings("unused") // set from pipeline config - public void setPartitioned(boolean partitioned) { - this.partitioned = partitioned; - } - - @SuppressWarnings("unused") // set from pipeline config - public void setDeciderKey(String deciderKey) { - this.deciderKey = deciderKey; - } - - @VisibleForTesting - KafkaConsumer getKafkaConsumer() { - return kafkaConsumer; - } - - @SuppressWarnings("unused") // set from pipeline config - public void setKafkaClusterPath(String kafkaClusterPath) { - this.kafkaClusterPath = kafkaClusterPath; - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/kafka/KafkaProducerStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/kafka/KafkaProducerStage.java deleted file mode 100644 index 84252d0da..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/kafka/KafkaProducerStage.java +++ /dev/null @@ -1,259 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter.kafka; - -import java.util.Collection; -import java.util.Map; - -import javax.naming.NamingException; - -import scala.runtime.BoxedUnit; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.collect.Maps; - -import org.apache.commons.pipeline.StageException; -import org.apache.kafka.clients.producer.ProducerRecord; -import org.apache.kafka.clients.producer.RecordMetadata; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.finatra.kafka.producers.BlockingFinagleKafkaProducer; -import com.twitter.search.common.debug.DebugEventUtil; -import com.twitter.search.common.debug.thriftjava.DebugEvents; -import com.twitter.search.common.decider.DeciderUtil; -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.common.metrics.Percentile; -import com.twitter.search.common.metrics.PercentileUtil; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent; -import com.twitter.search.common.schema.thriftjava.ThriftIndexingEventType; -import com.twitter.search.common.util.io.kafka.CompactThriftSerializer; -import com.twitter.search.ingester.model.IngesterThriftVersionedEvents; -import com.twitter.search.ingester.pipeline.twitter.TwitterBaseStage; -import com.twitter.search.ingester.pipeline.util.PipelineStageException; -import com.twitter.search.ingester.pipeline.wire.IngesterPartitioner; -import com.twitter.util.Await; -import com.twitter.util.Future; - -public class KafkaProducerStage extends TwitterBaseStage { - private static final Logger LOG = LoggerFactory.getLogger(KafkaProducerStage.class); - - private static final Logger LATE_EVENTS_LOG = LoggerFactory.getLogger( - KafkaProducerStage.class.getName() + ".LateEvents"); - - private final Map> processingLatenciesStats = - Maps.newEnumMap(ThriftIndexingEventType.class); - - private String kafkaClientId; - private String kafkaTopicName; - private String kafkaClusterPath; - private SearchCounter sendCount; - private String perPartitionSendCountFormat; - private String deciderKey; - - protected BlockingFinagleKafkaProducer kafkaProducer; - - private int processingLatencyThresholdMillis = 10000; - - public KafkaProducerStage() { } - - public KafkaProducerStage(String topicName, String clientId, String clusterPath) { - this.kafkaTopicName = topicName; - this.kafkaClientId = clientId; - this.kafkaClusterPath = clusterPath; - } - - @Override - protected void initStats() { - super.initStats(); - setupCommonStats(); - } - - private void setupCommonStats() { - sendCount = SearchCounter.export(getStageNamePrefix() + "_send_count"); - perPartitionSendCountFormat = getStageNamePrefix() + "_partition_%d_send_count"; - for (ThriftIndexingEventType eventType : ThriftIndexingEventType.values()) { - processingLatenciesStats.put( - eventType, - PercentileUtil.createPercentile( - getStageNamePrefix() + "_" + eventType.name().toLowerCase() - + "_processing_latency_ms")); - } - } - - @Override - protected void innerSetupStats() { - setupCommonStats(); - } - - private boolean isEnabled() { - if (this.deciderKey != null) { - return DeciderUtil.isAvailableForRandomRecipient(decider, deciderKey); - } else { - // No decider means it's enabled. - return true; - } - } - - @Override - protected void doInnerPreprocess() throws StageException, NamingException { - try { - innerSetup(); - } catch (PipelineStageException e) { - throw new StageException(this, e); - } - } - - @Override - protected void innerSetup() throws PipelineStageException, NamingException { - Preconditions.checkNotNull(kafkaClientId); - Preconditions.checkNotNull(kafkaClusterPath); - Preconditions.checkNotNull(kafkaTopicName); - - kafkaProducer = wireModule.newFinagleKafkaProducer( - kafkaClusterPath, - new CompactThriftSerializer(), - kafkaClientId, - IngesterPartitioner.class); - - int numPartitions = wireModule.getPartitionMappingManager().getNumPartitions(); - int numKafkaPartitions = kafkaProducer.partitionsFor(kafkaTopicName).size(); - if (numPartitions != numKafkaPartitions) { - throw new PipelineStageException(String.format( - "Number of partitions for Kafka topic %s (%d) != number of expected partitions (%d)", - kafkaTopicName, numKafkaPartitions, numPartitions)); - } - } - - - @Override - public void innerProcess(Object obj) throws StageException { - if (!(obj instanceof IngesterThriftVersionedEvents)) { - throw new StageException(this, "Object is not IngesterThriftVersionedEvents: " + obj); - } - - IngesterThriftVersionedEvents events = (IngesterThriftVersionedEvents) obj; - tryToSendEventsToKafka(events); - } - - protected void tryToSendEventsToKafka(IngesterThriftVersionedEvents events) { - if (!isEnabled()) { - return; - } - - DebugEvents debugEvents = events.getDebugEvents(); - // We don't propagate debug events to Kafka, because they take about 50% - // of the storage space. - events.unsetDebugEvents(); - - ProducerRecord record = new ProducerRecord<>( - kafkaTopicName, - null, - clock.nowMillis(), - null, - events); - - sendRecordToKafka(record).ensure(() -> { - updateEventProcessingLatencyStats(events, debugEvents); - return null; - }); - } - - private Future sendRecordToKafka( - ProducerRecord record) { - Future result; - try { - result = kafkaProducer.send(record); - } catch (Exception e) { - // Even though KafkaProducer.send() returns a Future, it can throw a synchronous exception, - // so we translate synchronous exceptions into a Future.exception so we handle all exceptions - // consistently. - result = Future.exception(e); - } - - return result.onSuccess(recordMetadata -> { - sendCount.increment(); - SearchCounter.export( - String.format(perPartitionSendCountFormat, recordMetadata.partition())).increment(); - return BoxedUnit.UNIT; - }).onFailure(e -> { - stats.incrementExceptions(); - LOG.error("Sending a record failed.", e); - return BoxedUnit.UNIT; - }); - } - - private void updateEventProcessingLatencyStats(IngesterThriftVersionedEvents events, - DebugEvents debugEvents) { - if ((debugEvents != null) && debugEvents.isSetProcessingStartedAt()) { - // Get the one indexing event out of all events we're sending. - Collection indexingEvents = events.getVersionedEvents().values(); - Preconditions.checkState(!indexingEvents.isEmpty()); - ThriftIndexingEventType eventType = indexingEvents.iterator().next().getEventType(); - - // Check if the event took too much time to get to this current point. - long processingLatencyMillis = - clock.nowMillis() - debugEvents.getProcessingStartedAt().getEventTimestampMillis(); - processingLatenciesStats.get(eventType).record(processingLatencyMillis); - - if (processingLatencyMillis >= processingLatencyThresholdMillis) { - LATE_EVENTS_LOG.warn("Event of type {} for tweet {} was processed in {}ms: {}", - eventType.name(), - events.getTweetId(), - processingLatencyMillis, - DebugEventUtil.debugEventsToString(debugEvents)); - } - } - } - - public void setProcessingLatencyThresholdMillis(int processingLatencyThresholdMillis) { - this.processingLatencyThresholdMillis = processingLatencyThresholdMillis; - } - - @Override - public void innerPostprocess() throws StageException { - try { - commonCleanup(); - } catch (Exception e) { - throw new StageException(this, e); - } - } - - @Override - public void cleanupStageV2() { - try { - commonCleanup(); - } catch (Exception e) { - LOG.error("Error trying to clean up KafkaProducerStage.", e); - } - } - - private void commonCleanup() throws Exception { - Await.result(kafkaProducer.close()); - } - - @SuppressWarnings("unused") // set from pipeline config - public void setKafkaClientId(String kafkaClientId) { - this.kafkaClientId = kafkaClientId; - } - - @SuppressWarnings("unused") // set from pipeline config - public void setKafkaTopicName(String kafkaTopicName) { - this.kafkaTopicName = kafkaTopicName; - } - - @VisibleForTesting - public BlockingFinagleKafkaProducer getKafkaProducer() { - return kafkaProducer; - } - - @SuppressWarnings("unused") // set from pipeline config - public void setDeciderKey(String deciderKey) { - this.deciderKey = deciderKey; - } - - @SuppressWarnings("unused") // set from pipeline config - public void setKafkaClusterPath(String kafkaClusterPath) { - this.kafkaClusterPath = kafkaClusterPath; - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/kafka/KafkaRawRecordConsumerStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/kafka/KafkaRawRecordConsumerStage.java deleted file mode 100644 index 2cb777090..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/kafka/KafkaRawRecordConsumerStage.java +++ /dev/null @@ -1,36 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter.kafka; - -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.apache.commons.pipeline.validation.ProducedTypes; -import org.apache.kafka.common.serialization.Deserializer; - -import com.twitter.finatra.kafka.serde.internal.BaseDeserializer; -import com.twitter.search.ingester.model.KafkaRawRecord; -import com.twitter.util.Time; - -/** - * Kafka consumer stage that emits the binary payload wrapped in {@code ByteArray}. - */ -@ConsumedTypes(String.class) -@ProducedTypes(KafkaRawRecord.class) -public class KafkaRawRecordConsumerStage extends KafkaConsumerStage { - public KafkaRawRecordConsumerStage() { - super(getDeserializer()); - } - - private static Deserializer getDeserializer() { - return new BaseDeserializer() { - @Override - public KafkaRawRecord deserialize(String topic, byte[] data) { - return new KafkaRawRecord(data, Time.now().inMillis()); - } - }; - } - - public KafkaRawRecordConsumerStage(String kafkaClientId, String kafkaTopicName, - String kafkaConsumerGroupId, String kafkaClusterPath, - String deciderKey) { - super(kafkaClientId, kafkaTopicName, kafkaConsumerGroupId, kafkaClusterPath, deciderKey, - getDeserializer()); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/kafka/RetweetAndReplyUpdateEventsKafkaProducerStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/kafka/RetweetAndReplyUpdateEventsKafkaProducerStage.java deleted file mode 100644 index 4227617cb..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/kafka/RetweetAndReplyUpdateEventsKafkaProducerStage.java +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter.kafka; - -import org.apache.commons.pipeline.validation.ConsumedTypes; - -import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; -import com.twitter.search.ingester.model.IngesterThriftVersionedEvents; - -@ConsumedTypes(ThriftVersionedEvents.class) -public class RetweetAndReplyUpdateEventsKafkaProducerStage extends KafkaProducerStage - { - public RetweetAndReplyUpdateEventsKafkaProducerStage(String kafkaTopic, String clientId, - String clusterPath) { - super(kafkaTopic, clientId, clusterPath); - } - - public RetweetAndReplyUpdateEventsKafkaProducerStage() { - super(); - } - - @Override - protected void innerRunFinalStageOfBranchV2(IngesterThriftVersionedEvents events) { - super.tryToSendEventsToKafka(events); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/kafka/TweetThriftVersionedEventsKafkaProducerStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/kafka/TweetThriftVersionedEventsKafkaProducerStage.java deleted file mode 100644 index 9e96471d7..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/kafka/TweetThriftVersionedEventsKafkaProducerStage.java +++ /dev/null @@ -1,108 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter.kafka; - -import javax.naming.NamingException; - -import org.apache.commons.pipeline.StageException; -import org.apache.commons.pipeline.validation.ConsumedTypes; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.search.common.debug.DebugEventUtil; -import com.twitter.search.common.debug.thriftjava.DebugEvents; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.search.ingester.model.IngesterThriftVersionedEvents; -import com.twitter.search.ingester.pipeline.util.PipelineStageException; - -/** - * Kafka producer stage to write tweet indexing data as {@code ThriftVersionedEvents}. This stage - * also handles extra debug event processing. - */ -@ConsumedTypes(IngesterThriftVersionedEvents.class) -public class TweetThriftVersionedEventsKafkaProducerStage extends KafkaProducerStage - { - private static final int PROCESSING_LATENCY_THRESHOLD_FOR_UPDATES_MILLIS = 30000; - - private static final Logger LOG = - LoggerFactory.getLogger(TweetThriftVersionedEventsKafkaProducerStage.class); - - private long processedTweetCount = 0; - - private SearchLongGauge kafkaProducerLag; - - private int debugEventLogPeriod = -1; - - public TweetThriftVersionedEventsKafkaProducerStage(String kafkaTopic, String clientId, - String clusterPath) { - super(kafkaTopic, clientId, clusterPath); - } - - public TweetThriftVersionedEventsKafkaProducerStage() { - super(); - } - - @Override - protected void initStats() { - super.initStats(); - setupCommonStats(); - } - - @Override - protected void innerSetupStats() { - super.innerSetupStats(); - setupCommonStats(); - } - - private void setupCommonStats() { - kafkaProducerLag = SearchLongGauge.export( - getStageNamePrefix() + "_kafka_producer_lag_millis"); - } - - @Override - protected void innerSetup() throws PipelineStageException, NamingException { - super.innerSetup(); - } - - @Override - protected void doInnerPreprocess() throws StageException, NamingException { - super.doInnerPreprocess(); - commonInnerSetup(); - } - - private void commonInnerSetup() { - setProcessingLatencyThresholdMillis(PROCESSING_LATENCY_THRESHOLD_FOR_UPDATES_MILLIS); - } - - @Override - public void innerProcess(Object obj) throws StageException { - if (!(obj instanceof IngesterThriftVersionedEvents)) { - throw new StageException(this, "Object is not IngesterThriftVersionedEvents: " + obj); - } - - IngesterThriftVersionedEvents events = (IngesterThriftVersionedEvents) obj; - innerRunFinalStageOfBranchV2(events); - } - - @Override - protected void innerRunFinalStageOfBranchV2(IngesterThriftVersionedEvents events) { - if ((debugEventLogPeriod > 0) - && (processedTweetCount % debugEventLogPeriod == 0) - && (events.getDebugEvents() != null)) { - LOG.info("DebugEvents for tweet {}: {}", - events.getTweetId(), DebugEventUtil.debugEventsToString(events.getDebugEvents())); - } - processedTweetCount++; - - DebugEvents debugEvents = events.getDebugEvents(); - if ((debugEvents != null) && debugEvents.isSetProcessingStartedAt()) { - kafkaProducerLag.set( - clock.nowMillis() - debugEvents.getProcessingStartedAt().getEventTimestampMillis()); - } - - super.tryToSendEventsToKafka(events); - } - - @SuppressWarnings("unused") // set from pipeline config - public void setDebugEventLogPeriod(int debugEventLogPeriod) { - this.debugEventLogPeriod = debugEventLogPeriod; - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/thriftparse/BUILD b/src/java/com/twitter/search/ingester/pipeline/twitter/thriftparse/BUILD deleted file mode 100644 index bd8f88b26..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/thriftparse/BUILD +++ /dev/null @@ -1,32 +0,0 @@ -java_library( - sources = ["*.java"], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/code/findbugs:jsr305", - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/commons-lang", - "3rdparty/jvm/org/apache/thrift:libthrift", - "mediaservices/commons/src/main/thrift:thrift-java", - "src/java/com/twitter/common/text/language:locale-util", - "src/java/com/twitter/common_internal/text/version", - "src/java/com/twitter/search/common/debug", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/partitioning/base", - "src/java/com/twitter/search/common/partitioning/snowflakeparser", - "src/java/com/twitter/search/common/relevance:entities_and_filters", - "src/java/com/twitter/search/common/util/text", - "src/java/com/twitter/search/ingester/model", - "src/java/com/twitter/search/ingester/pipeline/util", - "src/thrift/com/twitter/dataproducts:enrichments_profilegeo-java", - "src/thrift/com/twitter/escherbird:tweet-annotation-java", - "src/thrift/com/twitter/gizmoduck:user-thrift-java", - "src/thrift/com/twitter/search/common/debug:debug-java", - "src/thrift/com/twitter/service/spiderduck/gen:metadata-store-java", - "src/thrift/com/twitter/tweetypie:events-java", - "src/thrift/com/twitter/tweetypie:tweet-java", - "tweetypie/src/scala/com/twitter/tweetypie/tweettext", - ], -) diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/thriftparse/ThriftTweetParsingException.java b/src/java/com/twitter/search/ingester/pipeline/twitter/thriftparse/ThriftTweetParsingException.java deleted file mode 100644 index a986eec58..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/thriftparse/ThriftTweetParsingException.java +++ /dev/null @@ -1,7 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter.thriftparse; - -public final class ThriftTweetParsingException extends Exception { - public ThriftTweetParsingException(String message) { - super(message); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/thriftparse/TweetEventParseHelper.java b/src/java/com/twitter/search/ingester/pipeline/twitter/thriftparse/TweetEventParseHelper.java deleted file mode 100644 index d644898aa..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/thriftparse/TweetEventParseHelper.java +++ /dev/null @@ -1,727 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter.thriftparse; - -import java.util.Date; -import java.util.List; -import java.util.Optional; -import javax.annotation.Nonnull; -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; - -import org.apache.commons.lang.StringEscapeUtils; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.dataproducts.enrichments.thriftjava.GeoEntity; -import com.twitter.dataproducts.enrichments.thriftjava.PotentialLocation; -import com.twitter.dataproducts.enrichments.thriftjava.ProfileGeoEnrichment; -import com.twitter.escherbird.thriftjava.TweetEntityAnnotation; -import com.twitter.expandodo.thriftjava.Card2; -import com.twitter.gizmoduck.thriftjava.User; -import com.twitter.mediaservices.commons.tweetmedia.thrift_java.MediaInfo; -import com.twitter.search.common.debug.thriftjava.DebugEvents; -import com.twitter.search.common.metrics.Percentile; -import com.twitter.search.common.metrics.PercentileUtil; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; -import com.twitter.search.common.relevance.entities.GeoObject; -import com.twitter.search.common.relevance.entities.PotentialLocationObject; -import com.twitter.search.common.relevance.entities.TwitterMessage; -import com.twitter.search.common.relevance.entities.TwitterMessage.EscherbirdAnnotation; -import com.twitter.search.common.relevance.entities.TwitterMessageUser; -import com.twitter.search.common.relevance.entities.TwitterMessageUtil; -import com.twitter.search.common.relevance.entities.TwitterQuotedMessage; -import com.twitter.search.common.relevance.entities.TwitterRetweetMessage; -import com.twitter.search.ingester.model.IngesterTwitterMessage; -import com.twitter.search.ingester.pipeline.util.CardFieldUtil; -import com.twitter.service.spiderduck.gen.MediaTypes; -import com.twitter.tweetypie.thriftjava.DeviceSource; -import com.twitter.tweetypie.thriftjava.DirectedAtUser; -import com.twitter.tweetypie.thriftjava.EscherbirdEntityAnnotations; -import com.twitter.tweetypie.thriftjava.ExclusiveTweetControl; -import com.twitter.tweetypie.thriftjava.GeoCoordinates; -import com.twitter.tweetypie.thriftjava.HashtagEntity; -import com.twitter.tweetypie.thriftjava.MediaEntity; -import com.twitter.tweetypie.thriftjava.MentionEntity; -import com.twitter.tweetypie.thriftjava.Place; -import com.twitter.tweetypie.thriftjava.QuotedTweet; -import com.twitter.tweetypie.thriftjava.Reply; -import com.twitter.tweetypie.thriftjava.Tweet; -import com.twitter.tweetypie.thriftjava.TweetCoreData; -import com.twitter.tweetypie.thriftjava.TweetCreateEvent; -import com.twitter.tweetypie.thriftjava.TweetDeleteEvent; -import com.twitter.tweetypie.thriftjava.UrlEntity; -import com.twitter.tweetypie.tweettext.PartialHtmlEncoding; - -/** - * This is an utility class for converting Thrift TweetEvent messages sent by TweetyPie - * into ingester internal representation, IngesterTwitterMessage. - */ -public final class TweetEventParseHelper { - private static final Logger LOG = LoggerFactory.getLogger(TweetEventParseHelper.class); - - @VisibleForTesting - static final SearchCounter NUM_TWEETS_WITH_NULL_TEXT = - SearchCounter.export("tweets_with_null_text_from_thrift_cnt"); - - @VisibleForTesting - static final SearchCounter TWEET_SIZE = SearchCounter.export("tweet_size_from_thrift"); - - @VisibleForTesting - static final Percentile TWEET_SIZE_PERCENTILES = - PercentileUtil.createPercentile("tweet_size_from_thrift"); - - @VisibleForTesting - static final SearchCounter NUM_TWEETS_WITH_CONVERSATION_ID = - SearchCounter.export("tweets_with_conversation_id_from_thrift_cnt"); - - @VisibleForTesting - static final SearchCounter NUM_TWEETS_WITH_QUOTE = - SearchCounter.export("tweets_with_quote_from_thrift_cnt"); - - @VisibleForTesting - static final SearchCounter NUM_TWEETS_WITH_ANNOTATIONS = - SearchCounter.export("tweets_with_annotation_from_thrift_cnt"); - - @VisibleForTesting - static final SearchCounter NUM_ANNOTATIONS_ADDED = - SearchCounter.export("num_annotations_from_thrift_cnt"); - - @VisibleForTesting - static final SearchCounter NUM_TWEETS_WITH_COORDINATE_FIELD = - SearchCounter.export("tweets_with_coordinate_field_from_thrift_cnt"); - - @VisibleForTesting - static final SearchCounter NUM_PLACE_ADDED = - SearchCounter.export("num_places_from_thrift_cnt"); - - @VisibleForTesting - static final SearchCounter NUM_TWEETS_WITH_PLACE_FIELD = - SearchCounter.export("tweets_with_place_field_from_thrift_cnt"); - - @VisibleForTesting - static final SearchCounter NUM_TWEETS_WITH_PLACE_COUNTRY_CODE = - SearchCounter.export("tweets_with_place_country_code_from_thrift_cnt"); - - @VisibleForTesting - static final SearchCounter NUM_TWEETS_USE_PLACE_FIELD = - SearchCounter.export("tweets_use_place_field_from_thrift_cnt"); - - @VisibleForTesting - static final SearchCounter NUM_TWEETS_CANNOT_PARSE_PLACE_FIELD = - SearchCounter.export("tweets_cannot_parse_place_field_from_thrift_cnt"); - - @VisibleForTesting - static final SearchCounter NUM_TWEETS_WITH_PROFILE_GEO_ENRICHMENT = - SearchCounter.export("tweets_with_profile_geo_enrichment_from_thrift_cnt"); - - @VisibleForTesting - static final SearchCounter NUM_TWEETS_WITH_MENTIONS = - SearchCounter.export("tweets_with_mentions_from_thrift_cnt"); - - @VisibleForTesting - static final SearchCounter NUM_MENTIONS_ADDED = - SearchCounter.export("num_mentions_from_thrift_cnt"); - - @VisibleForTesting - static final SearchCounter NUM_TWEETS_WITH_HASHTAGS = - SearchCounter.export("tweets_with_hashtags_from_thrift_cnt"); - - @VisibleForTesting - static final SearchCounter NUM_HASHTAGS_ADDED = - SearchCounter.export("num_hashtags_from_thrift_cnt"); - - @VisibleForTesting - static final SearchCounter NUM_TWEETS_WITH_MEDIA_URL = - SearchCounter.export("tweets_with_media_url_from_thrift_cnt"); - - @VisibleForTesting - static final SearchCounter NUM_MEDIA_URLS_ADDED = - SearchCounter.export("num_media_urls_from_thrift_cnt"); - - @VisibleForTesting - static final SearchCounter NUM_TWEETS_WITH_PHOTO_MEDIA_URL = - SearchCounter.export("tweets_with_photo_media_url_from_thrift_cnt"); - - @VisibleForTesting - static final SearchCounter NUM_TWEETS_WITH_VIDEO_MEDIA_URL = - SearchCounter.export("tweets_with_video_media_url_from_thrift_cnt"); - - @VisibleForTesting - static final SearchCounter NUM_TWEETS_WITH_NON_MEDIA_URL = - SearchCounter.export("tweets_with_non_media_url_from_thrift_cnt"); - - @VisibleForTesting - static final SearchCounter NUM_NON_MEDIA_URLS_ADDED = - SearchCounter.export("num_non_media_urls_from_thrift_cnt"); - - @VisibleForTesting - static final SearchCounter NUM_TWEETS_MISSING_QUOTE_URLS = - SearchCounter.export("num_tweets_missing_quote_urls_cnt"); - - // Utility class, disallow instantiation. - private TweetEventParseHelper() { - } - - /** Builds an IngesterTwitterMessage instance from a TweetCreateEvent. */ - @Nonnull - public static IngesterTwitterMessage getTwitterMessageFromCreationEvent( - @Nonnull TweetCreateEvent createEvent, - @Nonnull List supportedPenguinVersions, - @Nullable DebugEvents debugEvents) throws ThriftTweetParsingException { - - Tweet tweet = createEvent.getTweet(); - if (tweet == null) { - throw new ThriftTweetParsingException("No tweet field in TweetCreateEvent"); - } - - TweetCoreData coreData = tweet.getCore_data(); - if (coreData == null) { - throw new ThriftTweetParsingException("No core_data field in Tweet in TweetCreateEvent"); - } - - User user = createEvent.getUser(); - if (user == null) { - throw new ThriftTweetParsingException("No user field in TweetCreateEvent"); - } - if (!user.isSetProfile()) { - throw new ThriftTweetParsingException("No profile field in User in TweetCreateEvent"); - } - if (!user.isSetSafety()) { - throw new ThriftTweetParsingException("No safety field in User in TweetCreateEvent"); - } - - long twitterId = tweet.getId(); - IngesterTwitterMessage message = new IngesterTwitterMessage( - twitterId, - supportedPenguinVersions, - debugEvents); - - // Set the creation time based on the tweet ID, because it has millisecond granularity, - // and coreData.created_at_secs has only second granularity. - message.setDate(new Date(SnowflakeIdParser.getTimestampFromTweetId(twitterId))); - - boolean isNsfw = coreData.isNsfw_admin() || coreData.isNsfw_user(); - boolean hasMediaOrUrlsOrCards = - tweet.getMediaSize() > 0 - || tweet.getUrlsSize() > 0 - || tweet.getCardsSize() > 0 - || tweet.isSetCard2(); - - message.setIsSensitiveContent(isNsfw && hasMediaOrUrlsOrCards); - - message.setFromUser(getFromUser(user)); - if (user.isSetCounts()) { - message.setFollowersCount((int) user.getCounts().getFollowers()); - } - message.setUserProtected(user.getSafety().isIs_protected()); - message.setUserVerified(user.getSafety().isVerified()); - message.setUserBlueVerified(user.getSafety().isIs_blue_verified()); - - if (tweet.isSetLanguage()) { - message.setLanguage(tweet.getLanguage().getLanguage()); // language ID like "en" - } - - if (tweet.isSetSelf_thread_metadata()) { - message.setSelfThread(true); - } - - ExclusiveTweetControl exclusiveTweetControl = tweet.getExclusive_tweet_control(); - if (exclusiveTweetControl != null) { - if (exclusiveTweetControl.isSetConversation_author_id()) { - message.setExclusiveConversationAuthorId( - exclusiveTweetControl.getConversation_author_id()); - } - } - - setDirectedAtUser(message, coreData); - addMentionsToMessage(message, tweet); - addHashtagsToMessage(message, tweet); - addMediaEntitiesToMessage(message, tweet.getId(), tweet.getMedia()); - addUrlsToMessage(message, tweet.getUrls()); - addEscherbirdAnnotationsToMessage(message, tweet); - message.setNullcast(coreData.isNullcast()); - - if (coreData.isSetConversation_id()) { - message.setConversationId(coreData.getConversation_id()); - NUM_TWEETS_WITH_CONVERSATION_ID.increment(); - } - - // quotes - if (tweet.isSetQuoted_tweet()) { - QuotedTweet quotedTweet = tweet.getQuoted_tweet(); - if (quotedTweet.getTweet_id() > 0 && quotedTweet.getUser_id() > 0) { - if (quotedTweet.isSetPermalink()) { - String quotedURL = quotedTweet.getPermalink().getLong_url(); - UrlEntity quotedURLEntity = new UrlEntity(); - quotedURLEntity.setExpanded(quotedURL); - quotedURLEntity.setUrl(quotedTweet.getPermalink().getShort_url()); - quotedURLEntity.setDisplay(quotedTweet.getPermalink().getDisplay_text()); - addUrlsToMessage(message, Lists.newArrayList(quotedURLEntity)); - } else { - LOG.warn("Tweet {} has quoted tweet, but is missing quoted tweet URL: {}", - tweet.getId(), quotedTweet); - NUM_TWEETS_MISSING_QUOTE_URLS.increment(); - } - TwitterQuotedMessage quotedMessage = - new TwitterQuotedMessage( - quotedTweet.getTweet_id(), - quotedTweet.getUser_id()); - message.setQuotedMessage(quotedMessage); - NUM_TWEETS_WITH_QUOTE.increment(); - } - } - - // card fields - if (createEvent.getTweet().isSetCard2()) { - Card2 card = createEvent.getTweet().getCard2(); - message.setCardName(card.getName()); - message.setCardTitle( - CardFieldUtil.extractBindingValue(CardFieldUtil.TITLE_BINDING_KEY, card)); - message.setCardDescription( - CardFieldUtil.extractBindingValue(CardFieldUtil.DESCRIPTION_BINDING_KEY, card)); - CardFieldUtil.deriveCardLang(message); - message.setCardUrl(card.getUrl()); - } - - // Some fields should be set based on the "original" tweet. So if this tweet is a retweet, - // we want to extract those fields from the retweeted tweet. - Tweet retweetOrTweet = tweet; - TweetCoreData retweetOrTweetCoreData = coreData; - User retweetOrTweetUser = user; - - // retweets - boolean isRetweet = coreData.isSetShare(); - if (isRetweet) { - retweetOrTweet = createEvent.getSource_tweet(); - retweetOrTweetCoreData = retweetOrTweet.getCore_data(); - retweetOrTweetUser = createEvent.getSource_user(); - - TwitterRetweetMessage retweetMessage = new TwitterRetweetMessage(); - retweetMessage.setRetweetId(twitterId); - - if (retweetOrTweetUser != null) { - if (retweetOrTweetUser.isSetProfile()) { - retweetMessage.setSharedUserDisplayName(retweetOrTweetUser.getProfile().getName()); - } - retweetMessage.setSharedUserTwitterId(retweetOrTweetUser.getId()); - } - - retweetMessage.setSharedDate(new Date(retweetOrTweetCoreData.getCreated_at_secs() * 1000)); - retweetMessage.setSharedId(retweetOrTweet.getId()); - - addMediaEntitiesToMessage(message, retweetOrTweet.getId(), retweetOrTweet.getMedia()); - addUrlsToMessage(message, retweetOrTweet.getUrls()); - - // If a tweet's text is longer than 140 characters, the text for any retweet of that tweet - // will be truncated. And if the original tweet has hashtags or mentions after character 140, - // the Tweetypie event for the retweet will not include those hashtags/mentions, which will - // make the retweet unsearchable by those hashtags/mentions. So in order to avoid this - // problem, we add to the retweet all hashtags/mentions set on the original tweet. - addMentionsToMessage(message, retweetOrTweet); - addHashtagsToMessage(message, retweetOrTweet); - - message.setRetweetMessage(retweetMessage); - } - - // Some fields should be set based on the "original" tweet. - // Only set geo fields if this is not a retweet - if (!isRetweet) { - setGeoFields(message, retweetOrTweetCoreData, retweetOrTweetUser); - setPlacesFields(message, retweetOrTweet); - } - setText(message, retweetOrTweetCoreData); - setInReplyTo(message, retweetOrTweetCoreData, isRetweet); - setDeviceSourceField(message, retweetOrTweet); - - // Profile geo enrichment fields should be set based on this tweet, even if it's a retweet. - setProfileGeoEnrichmentFields(message, tweet); - - // The composer used to create this tweet: standard tweet creator or the camera flow. - setComposerSource(message, tweet); - - return message; - } - - private static void setGeoFields( - TwitterMessage message, TweetCoreData coreData, User user) { - - if (coreData.isSetCoordinates()) { - NUM_TWEETS_WITH_COORDINATE_FIELD.increment(); - GeoCoordinates coords = coreData.getCoordinates(); - message.setGeoTaggedLocation( - GeoObject.createForIngester(coords.getLatitude(), coords.getLongitude())); - - String location = - String.format("GeoAPI:%.4f,%.4f", coords.getLatitude(), coords.getLongitude()); - TwitterMessageUtil.setAndTruncateLocationOnMessage(message, location); - } - - // If the location was not set from the coordinates. - if ((message.getOrigLocation() == null) && (user != null) && user.isSetProfile()) { - TwitterMessageUtil.setAndTruncateLocationOnMessage(message, user.getProfile().getLocation()); - } - } - - private static void setPlacesFields(TwitterMessage message, Tweet tweet) { - if (!tweet.isSetPlace()) { - return; - } - - Place place = tweet.getPlace(); - - if (place.isSetContainers() && place.getContainersSize() > 0) { - NUM_TWEETS_WITH_PLACE_FIELD.increment(); - NUM_PLACE_ADDED.add(place.getContainersSize()); - - for (String placeId : place.getContainers()) { - message.addPlace(placeId); - } - } - - Preconditions.checkArgument(place.isSetId(), "Tweet.Place without id."); - message.setPlaceId(place.getId()); - Preconditions.checkArgument(place.isSetFull_name(), "Tweet.Place without full_name."); - message.setPlaceFullName(place.getFull_name()); - if (place.isSetCountry_code()) { - message.setPlaceCountryCode(place.getCountry_code()); - NUM_TWEETS_WITH_PLACE_COUNTRY_CODE.increment(); - } - - if (message.getGeoTaggedLocation() == null) { - Optional location = GeoObject.fromPlace(place); - - if (location.isPresent()) { - NUM_TWEETS_USE_PLACE_FIELD.increment(); - message.setGeoTaggedLocation(location.get()); - } else { - NUM_TWEETS_CANNOT_PARSE_PLACE_FIELD.increment(); - } - } - } - - private static void setText(TwitterMessage message, TweetCoreData coreData) { - /** - * TweetyPie doesn't do a full HTML escaping of the text, only a partial escaping - * so we use their code to unescape it first, then we do - * a second unescaping because when the tweet text itself has HTML escape - * sequences, we want to index the unescaped version, not the escape sequence itself. - * -- - * Yes, we *double* unescape html. About 1-2 tweets per second are double escaped, - * and we probably want to index the real text and not things like '★'. - * Unescaping already unescaped text seems safe in practice. - * -- - * - * This may seem wrong, because one thinks we should index whatever the user posts, - * but given punctuation stripping this creates odd behavior: - * - * If someone tweets & they won't be able to find it by searching for '&' because - * the tweet will be indexed as 'amp' - * - * It would also prevent some tweets from surfacing for certain searches, for example: - * - * User Tweets: John Mayer & Dave Chappelle - * We Unescape To: John Mayer & Dave Chappelle - * We Strip/Normalize To: john mayer dave chappelle - * - * A user searching for 'John Mayer Dave Chappelle' would get the above tweet. - * - * If we didn't double unescape - * - * User Tweets: John Mayer & Dave Chappelle - * We Strip/Normalize To: john mayer amp dave chappelle - * - * A user searching for 'John Mayer Dave Chappelle' would miss the above tweet. - * - * Second example - * - * User Tweets: L'Humanité - * We Unescape To: L'Humanité - * We Strip/Normalize To: l humanite - * - * If we didn't double escape - * - * User Tweets: L'Humanité - * We Strip/Normalize To: l humanit eacute - * - */ - - String text = coreData.isSetText() - ? StringEscapeUtils.unescapeHtml(PartialHtmlEncoding.decode(coreData.getText())) - : coreData.getText(); - message.setText(text); - if (text != null) { - long tweetLength = text.length(); - TWEET_SIZE.add(tweetLength); - TWEET_SIZE_PERCENTILES.record(tweetLength); - } else { - NUM_TWEETS_WITH_NULL_TEXT.increment(); - } - } - - private static void setInReplyTo( - TwitterMessage message, TweetCoreData coreData, boolean isRetweet) { - Reply reply = coreData.getReply(); - if (!isRetweet && reply != null) { - String inReplyToScreenName = reply.getIn_reply_to_screen_name(); - long inReplyToUserId = reply.getIn_reply_to_user_id(); - message.replaceToUserWithInReplyToUserIfNeeded(inReplyToScreenName, inReplyToUserId); - } - - if ((reply != null) && reply.isSetIn_reply_to_status_id()) { - message.setInReplyToStatusId(reply.getIn_reply_to_status_id()); - } - } - - private static void setProfileGeoEnrichmentFields(TwitterMessage message, Tweet tweet) { - if (!tweet.isSetProfile_geo_enrichment()) { - return; - } - - ProfileGeoEnrichment profileGeoEnrichment = tweet.getProfile_geo_enrichment(); - List thriftPotentialLocations = - profileGeoEnrichment.getPotential_locations(); - if (!thriftPotentialLocations.isEmpty()) { - NUM_TWEETS_WITH_PROFILE_GEO_ENRICHMENT.increment(); - List potentialLocations = Lists.newArrayList(); - for (PotentialLocation potentialLocation : thriftPotentialLocations) { - GeoEntity geoEntity = potentialLocation.getGeo_entity(); - potentialLocations.add(new PotentialLocationObject(geoEntity.getCountry_code(), - geoEntity.getRegion(), - geoEntity.getLocality())); - } - - message.setPotentialLocations(potentialLocations); - } - } - - private static void setDeviceSourceField(TwitterMessage message, Tweet tweet) { - DeviceSource deviceSource = tweet.getDevice_source(); - TwitterMessageUtil.setSourceOnMessage(message, modifyDeviceSourceWithNofollow(deviceSource)); - } - - /** Builds an IngesterTwitterMessage instance from a TweetDeleteEvent. */ - @Nonnull - public static IngesterTwitterMessage getTwitterMessageFromDeletionEvent( - @Nonnull TweetDeleteEvent deleteEvent, - @Nonnull List supportedPenguinVersions, - @Nullable DebugEvents debugEvents) throws ThriftTweetParsingException { - - Tweet tweet = deleteEvent.getTweet(); - if (tweet == null) { - throw new ThriftTweetParsingException("No tweet field in TweetDeleteEvent"); - } - long tweetId = tweet.getId(); - - TweetCoreData coreData = tweet.getCore_data(); - if (coreData == null) { - throw new ThriftTweetParsingException("No TweetCoreData in TweetDeleteEvent"); - } - long userId = coreData.getUser_id(); - - IngesterTwitterMessage message = new IngesterTwitterMessage( - tweetId, - supportedPenguinVersions, - debugEvents); - message.setDeleted(true); - message.setText("delete"); - message.setFromUser(TwitterMessageUser.createWithNamesAndId("delete", "delete", userId)); - - return message; - } - - private static TwitterMessageUser getFromUser(User user) { - String screenName = user.getProfile().getScreen_name(); - long id = user.getId(); - String displayName = user.getProfile().getName(); - return TwitterMessageUser.createWithNamesAndId(screenName, displayName, id); - } - - private static void addMentionsToMessage(IngesterTwitterMessage message, Tweet tweet) { - List mentions = tweet.getMentions(); - if (mentions != null) { - NUM_TWEETS_WITH_MENTIONS.increment(); - NUM_MENTIONS_ADDED.add(mentions.size()); - for (MentionEntity mention : mentions) { - addMention(message, mention); - } - } - } - - private static void addMention(IngesterTwitterMessage message, MentionEntity mention) { - // Default values. They are weird, but are consistent with JSON parsing behavior. - Optional id = Optional.of(-1L); - Optional screenName = Optional.of(""); - Optional displayName = Optional.of(""); - - if (mention.isSetUser_id()) { - id = Optional.of(mention.getUser_id()); - } - - if (mention.isSetScreen_name()) { - screenName = Optional.of(mention.getScreen_name()); - } - - if (mention.isSetName()) { - displayName = Optional.of(mention.getName()); - } - - TwitterMessageUser mentionedUser = TwitterMessageUser - .createWithOptionalNamesAndId(screenName, displayName, id); - - if (isToUser(mention, message.getToUserObject())) { - message.setToUserObject(mentionedUser); - } - message.addUserToMentions(mentionedUser); - } - - private static boolean isToUser( - MentionEntity mention, Optional optionalToUser) { - if (mention.getFrom_index() == 0) { - return true; - } - if (optionalToUser.isPresent()) { - TwitterMessageUser toUser = optionalToUser.get(); - if (toUser.getId().isPresent()) { - long toUserId = toUser.getId().get(); - return mention.getUser_id() == toUserId; - } - } - return false; - } - - private static void addHashtagsToMessage(IngesterTwitterMessage message, Tweet tweet) { - List hashtags = tweet.getHashtags(); - if (hashtags != null) { - NUM_TWEETS_WITH_HASHTAGS.increment(); - NUM_HASHTAGS_ADDED.add(hashtags.size()); - for (HashtagEntity hashtag : hashtags) { - addHashtag(message, hashtag); - } - } - } - - private static void addHashtag(IngesterTwitterMessage message, HashtagEntity hashtag) { - String hashtagString = hashtag.getText(); - message.addHashtag(hashtagString); - } - - /** Add the given media entities to the given message. */ - public static void addMediaEntitiesToMessage( - IngesterTwitterMessage message, - long photoStatusId, - @Nullable List medias) { - - if (medias != null) { - NUM_TWEETS_WITH_MEDIA_URL.increment(); - NUM_MEDIA_URLS_ADDED.add(medias.size()); - - boolean hasPhotoMediaUrl = false; - boolean hasVideoMediaUrl = false; - for (MediaEntity media : medias) { - MediaTypes mediaType = null; - if (media.isSetMedia_info()) { - MediaInfo mediaInfo = media.getMedia_info(); - if (mediaInfo != null) { - if (mediaInfo.isSet(MediaInfo._Fields.IMAGE_INFO)) { - mediaType = MediaTypes.NATIVE_IMAGE; - String mediaUrl = media.getMedia_url_https(); - if (mediaUrl != null) { - hasPhotoMediaUrl = true; - message.addPhotoUrl(photoStatusId, mediaUrl); - // Add this link to the expanded URLs too, so that the HAS_NATIVE_IMAGE_FLAG is set - // correctly too. See EncodedFeatureBuilder.updateLinkEncodedFeatures(). - } - } else if (mediaInfo.isSet(MediaInfo._Fields.VIDEO_INFO)) { - mediaType = MediaTypes.VIDEO; - hasVideoMediaUrl = true; - } - } - } - String originalUrl = media.getUrl(); - String expandedUrl = media.getExpanded_url(); - message.addExpandedMediaUrl(originalUrl, expandedUrl, mediaType); - } - - if (hasPhotoMediaUrl) { - NUM_TWEETS_WITH_PHOTO_MEDIA_URL.increment(); - } - if (hasVideoMediaUrl) { - NUM_TWEETS_WITH_VIDEO_MEDIA_URL.increment(); - } - } - } - - /** Adds the given urls to the given message. */ - public static void addUrlsToMessage( - IngesterTwitterMessage message, - @Nullable List urls) { - - if (urls != null) { - NUM_TWEETS_WITH_NON_MEDIA_URL.increment(); - NUM_NON_MEDIA_URLS_ADDED.add(urls.size()); - for (UrlEntity url : urls) { - String originalUrl = url.getUrl(); - String expandedUrl = url.getExpanded(); - message.addExpandedNonMediaUrl(originalUrl, expandedUrl); - } - } - } - - private static void addEscherbirdAnnotationsToMessage( - IngesterTwitterMessage message, Tweet tweet) { - if (tweet.isSetEscherbird_entity_annotations()) { - EscherbirdEntityAnnotations entityAnnotations = tweet.getEscherbird_entity_annotations(); - if (entityAnnotations.isSetEntity_annotations()) { - NUM_TWEETS_WITH_ANNOTATIONS.increment(); - NUM_ANNOTATIONS_ADDED.add(entityAnnotations.getEntity_annotationsSize()); - for (TweetEntityAnnotation entityAnnotation : entityAnnotations.getEntity_annotations()) { - EscherbirdAnnotation escherbirdAnnotation = - new EscherbirdAnnotation(entityAnnotation.getGroupId(), - entityAnnotation.getDomainId(), - entityAnnotation.getEntityId()); - message.addEscherbirdAnnotation(escherbirdAnnotation); - } - } - } - } - - private static void setComposerSource(IngesterTwitterMessage message, Tweet tweet) { - if (tweet.isSetComposer_source()) { - message.setComposerSource(tweet.getComposer_source()); - } - } - - private static String modifyDeviceSourceWithNofollow(@Nullable DeviceSource deviceSource) { - if (deviceSource != null) { - String source = deviceSource.getDisplay(); - int i = source.indexOf("\">"); - if (i == -1) { - return source; - } else { - return source.substring(0, i) + "\" rel=\"nofollow\">" + source.substring(i + 2); - } - } else { - return "Twitter"; - } - } - - private static void setDirectedAtUser( - IngesterTwitterMessage message, - TweetCoreData tweetCoreData) { - if (!tweetCoreData.isSetDirected_at_user()) { - return; - } - - DirectedAtUser directedAtUser = tweetCoreData.getDirected_at_user(); - - if (!directedAtUser.isSetUser_id()) { - return; - } - - message.setDirectedAtUserId(Optional.of(directedAtUser.getUser_id())); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/userupdates/BUILD b/src/java/com/twitter/search/ingester/pipeline/twitter/userupdates/BUILD deleted file mode 100644 index 3bbba39ea..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/userupdates/BUILD +++ /dev/null @@ -1,34 +0,0 @@ -java_library( - sources = ["*.java"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/commons-logging", - "3rdparty/jvm/org/apache/bookkeeper:bookkeeper-twitter-finagle-provider", - "3rdparty/jvm/org/apache/commons:commons-text", - "3rdparty/jvm/org/apache/kafka:kafka-clients", - "3rdparty/jvm/org/slf4j:slf4j-api", - "decider/src/main/scala", - "kafka/finagle-kafka/finatra-kafka/src/main/scala", - "src/java/com/twitter/common/base", - "src/java/com/twitter/common/collections", - "src/java/com/twitter/common/util:system-mocks", - "src/java/com/twitter/search/common/decider", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/util/io:dl-reader-writer", - "src/java/com/twitter/search/common/util/io:record-reader-api", - "src/java/com/twitter/search/common/util/io/kafka", - "src/java/com/twitter/search/common/util/thrift:text-protocol", - "src/java/com/twitter/search/ingester/model", - "src/java/com/twitter/search/ingester/pipeline/twitter", - "src/java/com/twitter/search/ingester/pipeline/twitter/kafka", - "src/java/com/twitter/search/ingester/pipeline/util", - "src/java/com/twitter/search/ingester/pipeline/wire", - "src/java/org/apache/commons/pipeline", - "src/thrift/com/twitter/gizmoduck:modified_user-gizmoduck_scala", - "src/thrift/com/twitter/gizmoduck:thrift-java", - "src/thrift/com/twitter/search/common:indexing-java", - "util/util-core:util-core-util", - "util/util-core/src/main/java/com/twitter/util/javainterop", - ], -) diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/userupdates/UserUpdateIngester.java b/src/java/com/twitter/search/ingester/pipeline/twitter/userupdates/UserUpdateIngester.java deleted file mode 100644 index 03ff94ff2..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/userupdates/UserUpdateIngester.java +++ /dev/null @@ -1,292 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter.userupdates; - -import java.util.AbstractMap; -import java.util.Collection; -import java.util.Collections; -import java.util.EnumSet; -import java.util.List; -import java.util.Map; -import java.util.Objects; -import java.util.Set; -import java.util.function.Function; -import java.util.stream.Collectors; - -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.Sets; - -import org.apache.commons.text.CaseUtils; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.collections.Pair; -import com.twitter.decider.Decider; -import com.twitter.finagle.util.DefaultTimer; -import com.twitter.gizmoduck.thriftjava.LifecycleChangeReason; -import com.twitter.gizmoduck.thriftjava.LookupContext; -import com.twitter.gizmoduck.thriftjava.QueryFields; -import com.twitter.gizmoduck.thriftjava.Safety; -import com.twitter.gizmoduck.thriftjava.UpdateDiffItem; -import com.twitter.gizmoduck.thriftjava.User; -import com.twitter.gizmoduck.thriftjava.UserModification; -import com.twitter.gizmoduck.thriftjava.UserService; -import com.twitter.gizmoduck.thriftjava.UserType; -import com.twitter.search.common.indexing.thriftjava.AntisocialUserUpdate; -import com.twitter.search.common.indexing.thriftjava.UserUpdateType; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchLongGauge; -import com.twitter.util.Duration; -import com.twitter.util.Future; -import com.twitter.util.TimeoutException; - -/** - * This class ingests {@link UserModification} events and transforms them into a possibly empty list - * of {@link AntisocialUserUpdate}s to be indexed by Earlybirds. - */ -public class UserUpdateIngester { - private static final Logger LOG = LoggerFactory.getLogger(UserUpdateIngester.class); - private static final Duration RESULT_TIMEOUT = Duration.fromSeconds(3); - - private static final List NO_UPDATE = Collections.emptyList(); - - // Map from UserUpdateType to a set of Safety fields to examine. - private static final Map> SAFETY_FIELDS_MAP = - ImmutableMap.of( - UserUpdateType.ANTISOCIAL, - Sets.immutableEnumSet( - Safety._Fields.SUSPENDED, Safety._Fields.DEACTIVATED, Safety._Fields.OFFBOARDED), - UserUpdateType.NSFW, - Sets.immutableEnumSet(Safety._Fields.NSFW_USER, Safety._Fields.NSFW_ADMIN), - UserUpdateType.PROTECTED, Sets.immutableEnumSet(Safety._Fields.IS_PROTECTED)); - - private static final Function FIELD_TO_FIELD_NAME_FUNCTION = - field -> "safety." + CaseUtils.toCamelCase(field.name(), false, '_'); - - private static final Map FIELD_NAME_TO_TYPE_MAP = - SAFETY_FIELDS_MAP.entrySet().stream() - .flatMap( - entry -> entry.getValue().stream() - .map(field -> new AbstractMap.SimpleEntry<>( - FIELD_TO_FIELD_NAME_FUNCTION.apply(field), - entry.getKey()))) - .collect(Collectors.toMap( - AbstractMap.SimpleEntry::getKey, - AbstractMap.SimpleEntry::getValue)); - - private static final Map FIELD_NAME_TO_FIELD_MAP = - SAFETY_FIELDS_MAP.values().stream() - .flatMap(Collection::stream) - .collect(Collectors.toMap( - FIELD_TO_FIELD_NAME_FUNCTION, - Function.identity())); - - private static final LookupContext LOOKUP_CONTEXT = new LookupContext() - .setInclude_deactivated(true) - .setInclude_erased(true) - .setInclude_suspended(true) - .setInclude_offboarded(true) - .setInclude_protected(true); - - private final UserService.ServiceToClient userService; - private final Decider decider; - - private final SearchLongGauge userModificationLatency; - private final SearchCounter unsuccessfulUserModificationCount; - private final SearchCounter byInactiveAccountDeactivationUserModificationCount; - private final SearchCounter irrelevantUserModificationCount; - private final SearchCounter notNormalUserCount; - private final SearchCounter missingSafetyCount; - private final SearchCounter userServiceRequests; - private final SearchCounter userServiceSuccesses; - private final SearchCounter userServiceNoResults; - private final SearchCounter userServiceFailures; - private final SearchCounter userServiceTimeouts; - private final Map, SearchCounter> counterMap; - - public UserUpdateIngester( - String statPrefix, - UserService.ServiceToClient userService, - Decider decider - ) { - this.userService = userService; - this.decider = decider; - - userModificationLatency = - SearchLongGauge.export(statPrefix + "_user_modification_latency_ms"); - unsuccessfulUserModificationCount = - SearchCounter.export(statPrefix + "_unsuccessful_user_modification_count"); - byInactiveAccountDeactivationUserModificationCount = - SearchCounter.export(statPrefix - + "_by_inactive_account_deactivation_user_modification_count"); - irrelevantUserModificationCount = - SearchCounter.export(statPrefix + "_irrelevant_user_modification_count"); - notNormalUserCount = - SearchCounter.export(statPrefix + "_not_normal_user_count"); - missingSafetyCount = - SearchCounter.export(statPrefix + "_missing_safety_count"); - userServiceRequests = - SearchCounter.export(statPrefix + "_user_service_requests"); - userServiceSuccesses = - SearchCounter.export(statPrefix + "_user_service_successes"); - userServiceNoResults = - SearchCounter.export(statPrefix + "_user_service_no_results"); - userServiceFailures = - SearchCounter.export(statPrefix + "_user_service_failures"); - userServiceTimeouts = - SearchCounter.export(statPrefix + "_user_service_timeouts"); - counterMap = ImmutableMap., SearchCounter>builder() - .put(Pair.of(UserUpdateType.ANTISOCIAL, true), - SearchCounter.export(statPrefix + "_antisocial_set_count")) - .put(Pair.of(UserUpdateType.ANTISOCIAL, false), - SearchCounter.export(statPrefix + "_antisocial_unset_count")) - .put(Pair.of(UserUpdateType.NSFW, true), - SearchCounter.export(statPrefix + "_nsfw_set_count")) - .put(Pair.of(UserUpdateType.NSFW, false), - SearchCounter.export(statPrefix + "_nsfw_unset_count")) - .put(Pair.of(UserUpdateType.PROTECTED, true), - SearchCounter.export(statPrefix + "_protected_set_count")) - .put(Pair.of(UserUpdateType.PROTECTED, false), - SearchCounter.export(statPrefix + "_protected_unset_count")) - .build(); - } - - /** - * Convert a UserModification event into a (possibly empty) list of antisocial updates for - * Earlybird. - */ - public Future> transform(UserModification userModification) { - userModificationLatency.set(System.currentTimeMillis() - userModification.getUpdated_at_msec()); - - if (!userModification.isSuccess()) { - unsuccessfulUserModificationCount.increment(); - return Future.value(NO_UPDATE); - } - - // To avoid UserTable gets overflowed, we exclude traffic from ByInactiveAccountDeactivation - if (userModification.getUser_audit_data() != null - && userModification.getUser_audit_data().getReason() != null - && userModification.getUser_audit_data().getReason() - == LifecycleChangeReason.BY_INACTIVE_ACCOUNT_DEACTIVATION) { - byInactiveAccountDeactivationUserModificationCount.increment(); - return Future.value(NO_UPDATE); - } - - long userId = userModification.getUser_id(); - Set userUpdateTypes = getUserUpdateTypes(userModification); - if (userUpdateTypes.isEmpty()) { - irrelevantUserModificationCount.increment(); - return Future.value(NO_UPDATE); - } - - Future userFuture = userModification.isSetCreate() - ? Future.value(userModification.getCreate()) - : getUser(userId); - - return userFuture - .map(user -> { - if (user == null) { - return NO_UPDATE; - } else if (user.getUser_type() != UserType.NORMAL) { - LOG.info("User with id={} is not a normal user.", userId); - notNormalUserCount.increment(); - return NO_UPDATE; - } else if (!user.isSetSafety()) { - LOG.info("Safety for User with id={} is missing.", userId); - missingSafetyCount.increment(); - return NO_UPDATE; - } - - if (userModification.isSetUpdate()) { - // Apply relevant updates from UserModification as User returned from Gizmoduck may not - // have reflected them yet. - applyUpdates(user, userModification); - } - - return userUpdateTypes.stream() - .map(userUpdateType -> - convertToAntiSocialUserUpdate( - user, userUpdateType, userModification.getUpdated_at_msec())) - .peek(update -> - counterMap.get(Pair.of(update.getType(), update.isValue())).increment()) - .collect(Collectors.toList()); - }) - .onFailure(com.twitter.util.Function.cons(exception -> { - if (exception instanceof UserNotFoundException) { - userServiceNoResults.increment(); - } else if (exception instanceof TimeoutException) { - userServiceTimeouts.increment(); - LOG.error("UserService.get timed out for user id=" + userId, exception); - } else { - userServiceFailures.increment(); - LOG.error("UserService.get failed for user id=" + userId, exception); - } - })); - } - - private static Set getUserUpdateTypes(UserModification userModification) { - Set types = EnumSet.noneOf(UserUpdateType.class); - - if (userModification.isSetUpdate()) { - userModification.getUpdate().stream() - .map(UpdateDiffItem::getField_name) - .map(FIELD_NAME_TO_TYPE_MAP::get) - .filter(Objects::nonNull) - .collect(Collectors.toCollection(() -> types)); - } else if (userModification.isSetCreate() && userModification.getCreate().isSetSafety()) { - Safety safety = userModification.getCreate().getSafety(); - if (safety.isSuspended()) { - types.add(UserUpdateType.ANTISOCIAL); - } - if (safety.isNsfw_admin() || safety.isNsfw_user()) { - types.add(UserUpdateType.NSFW); - } - if (safety.isIs_protected()) { - types.add(UserUpdateType.PROTECTED); - } - } - - return types; - } - - private Future getUser(long userId) { - userServiceRequests.increment(); - return userService.get( - LOOKUP_CONTEXT, - Collections.singletonList(userId), - Collections.singleton(QueryFields.SAFETY)) - .within(DefaultTimer.getInstance(), RESULT_TIMEOUT) - .flatMap(userResults -> { - if (userResults.size() != 1 || !userResults.get(0).isSetUser()) { - return Future.exception(new UserNotFoundException(userId)); - } - - userServiceSuccesses.increment(); - return Future.value(userResults.get(0).getUser()); - }); - } - - private static void applyUpdates(User user, UserModification userModification) { - userModification.getUpdate().stream() - .filter(update -> FIELD_NAME_TO_FIELD_MAP.containsKey(update.getField_name())) - .filter(UpdateDiffItem::isSetAfter) - .forEach(update -> - user.getSafety().setFieldValue( - FIELD_NAME_TO_FIELD_MAP.get(update.getField_name()), - Boolean.valueOf(update.getAfter())) - ); - } - - private AntisocialUserUpdate convertToAntiSocialUserUpdate( - User user, - UserUpdateType userUpdateType, - long updatedAt) { - boolean value = SAFETY_FIELDS_MAP.get(userUpdateType).stream() - .anyMatch(safetyField -> (boolean) user.getSafety().getFieldValue(safetyField)); - return new AntisocialUserUpdate(user.getId(), userUpdateType, value, updatedAt); - } - - class UserNotFoundException extends Exception { - UserNotFoundException(long userId) { - super("User " + userId + " not found."); - } - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/userupdates/UserUpdatesPipeline.java b/src/java/com/twitter/search/ingester/pipeline/twitter/userupdates/UserUpdatesPipeline.java deleted file mode 100644 index 5cbf009d2..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/userupdates/UserUpdatesPipeline.java +++ /dev/null @@ -1,222 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter.userupdates; - -import java.time.Duration; -import java.util.ArrayList; -import java.util.Collections; -import java.util.List; -import java.util.concurrent.Semaphore; -import java.util.function.Supplier; - -import scala.runtime.BoxedUnit; - -import com.google.common.base.Preconditions; - -import org.apache.kafka.clients.consumer.ConsumerRecord; -import org.apache.kafka.clients.consumer.KafkaConsumer; -import org.apache.kafka.clients.producer.ProducerRecord; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.finatra.kafka.producers.BlockingFinagleKafkaProducer; -import com.twitter.gizmoduck.thriftjava.UserModification; -import com.twitter.search.common.indexing.thriftjava.AntisocialUserUpdate; -import com.twitter.search.common.metrics.SearchCustomGauge; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.util.io.kafka.CompactThriftSerializer; -import com.twitter.search.common.util.io.kafka.ThriftDeserializer; -import com.twitter.search.ingester.pipeline.wire.WireModule; -import com.twitter.util.Future; -import com.twitter.util.Futures; - -/** - * This class reads UserModification events from Kafka, transforms them into AntisocialUserUpdates, - * and writes them to Kafka. - */ -public final class UserUpdatesPipeline { - private static final Logger LOG = LoggerFactory.getLogger(UserUpdatesPipeline.class); - private static final Duration POLL_TIMEOUT = Duration.ofSeconds(1); - private static final int MAX_PENDING_EVENTS = 100; - private static final String KAFKA_CLIENT_ID = ""; - private static final int MAX_POLL_RECORDS = 1; - private static final String USER_MODIFICATIONS_KAFKA_TOPIC = ""; - private static final String USER_UPDATES_KAFKA_TOPIC_PREFIX = ""; - private static final String KAFKA_PRODUCER_DEST = ""; - private static final String KAFKA_CONSUMER_DEST = ""; - - // This semaphore stops us from having more than MAX_PENDING_EVENTS in the pipeline at any point - // in time. - private final Semaphore pendingEvents = new Semaphore(MAX_PENDING_EVENTS); - private final Supplier isRunning; - private final KafkaConsumer userModificationConsumer; - private final UserUpdateIngester userUpdateIngester; - private final SearchRateCounter records; - private final SearchRateCounter success; - private final SearchRateCounter failure; - - private final String userUpdatesKafkaTopic; - private final BlockingFinagleKafkaProducer userUpdatesProducer; - private final Clock clock; - - /** - * Builds the pipeline. - */ - public static UserUpdatesPipeline buildPipeline( - String environment, - WireModule wireModule, - String statsPrefix, - Supplier isRunning, - Clock clock) throws Exception { - - // We only have Gizmoduck clients for staging and prod. - String gizmoduckClient; - if (environment.startsWith("staging")) { - gizmoduckClient = ""; - } else { - Preconditions.checkState("prod".equals(environment)); - gizmoduckClient = ""; - } - LOG.info("Gizmoduck client: {}", gizmoduckClient); - - String kafkaConsumerGroup = "" + environment; - KafkaConsumer userModificationConsumer = wireModule.newKafkaConsumer( - KAFKA_CONSUMER_DEST, - new ThriftDeserializer<>(UserModification.class), - KAFKA_CLIENT_ID, - kafkaConsumerGroup, - MAX_POLL_RECORDS); - userModificationConsumer.subscribe(Collections.singleton(USER_MODIFICATIONS_KAFKA_TOPIC)); - LOG.info("User modifications topic: {}", USER_MODIFICATIONS_KAFKA_TOPIC); - LOG.info("User updates Kafka topic prefix: {}", USER_UPDATES_KAFKA_TOPIC_PREFIX); - LOG.info("Kafka consumer group: {}", kafkaConsumerGroup); - LOG.info("Kafka client id: {}", KAFKA_CLIENT_ID); - - UserUpdateIngester userUpdateIngester = new UserUpdateIngester( - statsPrefix, - wireModule.getGizmoduckClient(gizmoduckClient), - wireModule.getDecider()); - - String userUpdatesKafkaTopic = USER_UPDATES_KAFKA_TOPIC_PREFIX + environment; - BlockingFinagleKafkaProducer userUpdatesProducer = - wireModule.newFinagleKafkaProducer( - KAFKA_PRODUCER_DEST, - new CompactThriftSerializer(), - KAFKA_CLIENT_ID, - null); - - return new UserUpdatesPipeline( - isRunning, - userModificationConsumer, - userUpdateIngester, - userUpdatesProducer, - userUpdatesKafkaTopic, - clock); - } - - private UserUpdatesPipeline( - Supplier isRunning, - KafkaConsumer userModificationConsumer, - UserUpdateIngester userUpdateIngester, - BlockingFinagleKafkaProducer userUpdatesProducer, - String userUpdatesKafkaTopic, - Clock clock) { - this.isRunning = isRunning; - this.userModificationConsumer = userModificationConsumer; - this.userUpdateIngester = userUpdateIngester; - this.userUpdatesProducer = userUpdatesProducer; - this.userUpdatesKafkaTopic = userUpdatesKafkaTopic; - this.clock = clock; - - String statPrefix = "user_updates_pipeline_"; - SearchCustomGauge.export(statPrefix + "semaphore_permits", pendingEvents::availablePermits); - - records = SearchRateCounter.export(statPrefix + "records_processed_total"); - success = SearchRateCounter.export(statPrefix + "records_processed_success"); - failure = SearchRateCounter.export(statPrefix + "records_processed_failure"); - } - - /** - * Start the user updates pipeline. - */ - public void run() { - while (isRunning.get()) { - try { - pollFromKafka(); - } catch (Throwable e) { - LOG.error("Exception processing event.", e); - } - } - close(); - } - /** - * Polls records from Kafka and handles timeouts, back-pressure, and error handling. - * All consumed messages are passed to the messageHandler. - */ - private void pollFromKafka() throws Exception { - for (ConsumerRecord record - : userModificationConsumer.poll(POLL_TIMEOUT)) { - pendingEvents.acquire(); - records.increment(); - - handleUserModification(record.value()) - .onFailure(e -> { - failure.increment(); - return null; - }) - .onSuccess(u -> { - success.increment(); - return null; - }) - .ensure(() -> { - pendingEvents.release(); - return null; - }); - } - } - - /** - * Handles the business logic for the user updates pipeline: - * 1. Converts incoming event into possibly empty set of AntisocialUserUpdates - * 2. Writes the result to Kafka so that Earlybird can consume it. - */ - private Future handleUserModification(UserModification event) { - return userUpdateIngester - .transform(event) - .flatMap(this::writeListToKafka); - } - - private Future writeListToKafka(List updates) { - List> futures = new ArrayList<>(); - for (AntisocialUserUpdate update : updates) { - futures.add(writeToKafka(update)); - } - return Futures.join(futures).onFailure(e -> { - LOG.info("Exception while writing to kafka", e); - return null; - }); - } - - private Future writeToKafka(AntisocialUserUpdate update) { - ProducerRecord record = new ProducerRecord<>( - userUpdatesKafkaTopic, - null, - clock.nowMillis(), - null, - update); - try { - return userUpdatesProducer.send(record).unit(); - } catch (Exception e) { - return Future.exception(e); - } - } - - private void close() { - userModificationConsumer.close(); - try { - // Acquire all of the permits, so we know all pending events have been written. - pendingEvents.acquire(MAX_PENDING_EVENTS); - } catch (Exception e) { - LOG.error("Error shutting down stage", e); - } - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/twitter/userupdates/UserUpdatesPipelineStage.java b/src/java/com/twitter/search/ingester/pipeline/twitter/userupdates/UserUpdatesPipelineStage.java deleted file mode 100644 index 77ba0acf0..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/twitter/userupdates/UserUpdatesPipelineStage.java +++ /dev/null @@ -1,51 +0,0 @@ -package com.twitter.search.ingester.pipeline.twitter.userupdates; - -import java.util.function.Supplier; - -import org.apache.commons.pipeline.Pipeline; -import org.apache.commons.pipeline.StageDriver; -import org.apache.commons.pipeline.StageException; - -import com.twitter.search.ingester.pipeline.twitter.TwitterBaseStage; -import com.twitter.search.ingester.pipeline.util.PipelineUtil; - -/** - * This stage is a shim for the UserUpdatesPipeline. - * - * Eventually the UserUpdatesPipeline will be called directly from a TwitterServer, but this exists - * as a bridge while we migrate. - */ -public class UserUpdatesPipelineStage extends TwitterBaseStage { - // This is 'prod', 'staging', or 'staging1'. - private String environment; - private UserUpdatesPipeline userUpdatesPipeline; - - @Override - protected void doInnerPreprocess() throws StageException { - StageDriver driver = ((Pipeline) stageContext).getStageDriver(this); - Supplier booleanSupplier = () -> driver.getState() == StageDriver.State.RUNNING; - try { - userUpdatesPipeline = UserUpdatesPipeline.buildPipeline( - environment, - wireModule, - getStageNamePrefix(), - booleanSupplier, - clock); - - } catch (Exception e) { - throw new StageException(this, e); - } - PipelineUtil.feedStartObjectToStage(this); - } - - @Override - public void innerProcess(Object obj) throws StageException { - userUpdatesPipeline.run(); - } - - @SuppressWarnings("unused") // populated from pipeline config - public void setEnvironment(String environment) { - this.environment = environment; - } - -} diff --git a/src/java/com/twitter/search/ingester/pipeline/util/BUILD b/src/java/com/twitter/search/ingester/pipeline/util/BUILD deleted file mode 100644 index 916b58636..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/util/BUILD +++ /dev/null @@ -1,41 +0,0 @@ -java_library( - sources = ["*.java"], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/code/findbugs:jsr305", - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/commons-lang", - "3rdparty/jvm/commons-logging", - "3rdparty/jvm/org/apache/commons:commons-math3", - "3rdparty/jvm/org/apache/thrift:libthrift", - "decider/src/main/scala", - "src/java/com/twitter/common_internal/text/version", - "src/java/com/twitter/metastore/client_v2", - "src/java/com/twitter/metastore/data", - "src/java/com/twitter/search/common/debug", - "src/java/com/twitter/search/common/encoding/features", - "src/java/com/twitter/search/common/metrics", - "src/java/com/twitter/search/common/partitioning/base", - "src/java/com/twitter/search/common/relevance:entities_and_filters", - "src/java/com/twitter/search/common/relevance/features", - "src/java/com/twitter/search/common/schema", - "src/java/com/twitter/search/common/schema/base", - "src/java/com/twitter/search/common/schema/earlybird", - "src/java/com/twitter/search/common/util/geocoding", - "src/java/com/twitter/search/common/util/text", - "src/java/com/twitter/search/ingester/model", - "src/java/org/apache/commons/pipeline", - "src/scala/com/twitter/common_internal/analytics/test_user_filter", - "src/thrift/com/twitter/expandodo:cards-java", - "src/thrift/com/twitter/manhattan:internal-scala", - "src/thrift/com/twitter/search/common:indexing-java", - "src/thrift/com/twitter/search/common:schema-java", - "src/thrift/com/twitter/service/metastore/gen:thrift-java", - "stitch/stitch-core", - "storage/clients/manhattan", - "util/util-core:scala", - ], -) diff --git a/src/java/com/twitter/search/ingester/pipeline/util/BatchedElement.java b/src/java/com/twitter/search/ingester/pipeline/util/BatchedElement.java deleted file mode 100644 index 7b78a1fc5..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/util/BatchedElement.java +++ /dev/null @@ -1,21 +0,0 @@ -package com.twitter.search.ingester.pipeline.util; - -import java.util.concurrent.CompletableFuture; - -public class BatchedElement { - private CompletableFuture completableFuture; - private T item; - - public BatchedElement(T item, CompletableFuture completableFuture) { - this.item = item; - this.completableFuture = completableFuture; - } - - public T getItem() { - return item; - } - - public CompletableFuture getCompletableFuture() { - return completableFuture; - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/util/BatchingClient.java b/src/java/com/twitter/search/ingester/pipeline/util/BatchingClient.java deleted file mode 100644 index 222c6f544..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/util/BatchingClient.java +++ /dev/null @@ -1,105 +0,0 @@ -package com.twitter.search.ingester.pipeline.util; - -import java.util.HashSet; -import java.util.Map; -import java.util.Set; -import java.util.concurrent.ConcurrentHashMap; - -import com.google.common.collect.Sets; - -import com.twitter.util.Future; -import com.twitter.util.Promise; - -/** - * Batches single requests of type RQ -> Future to an underlying client that supports batch - * calls with multiple values of type RQ. Threadsafe. - */ -public class BatchingClient { - @FunctionalInterface - public interface BatchClient { - /** - * Issue a request to the underlying store which supports batches of requests. - */ - Future> batchGet(Set requests); - } - - /** - * unsentRequests is not threadsafe, and so it must be externally synchronized. - */ - private final HashSet unsentRequests = new HashSet<>(); - - private final ConcurrentHashMap> promises = new ConcurrentHashMap<>(); - - private final BatchClient batchClient; - private final int batchSize; - - public BatchingClient( - BatchClient batchClient, - int batchSize - ) { - this.batchClient = batchClient; - this.batchSize = batchSize; - } - - /** - * Send a request and receive a Future. The future will not be resolved until at there at - * least batchSize requests ready to send. - */ - public Future call(RQ request) { - Promise promise = promises.computeIfAbsent(request, r -> new Promise<>()); - - maybeBatchCall(request); - - return promise; - } - - private void maybeBatchCall(RQ request) { - Set frozenRequests; - synchronized (unsentRequests) { - unsentRequests.add(request); - if (unsentRequests.size() < batchSize) { - return; - } - - // Make a copy of requests so we can modify it inside executeBatchCall without additional - // synchronization. - frozenRequests = new HashSet<>(unsentRequests); - unsentRequests.clear(); - } - - executeBatchCall(frozenRequests); - } - - private void executeBatchCall(Set requests) { - batchClient.batchGet(requests) - .onSuccess(responseMap -> { - for (Map.Entry entry : responseMap.entrySet()) { - Promise promise = promises.remove(entry.getKey()); - if (promise != null) { - promise.become(Future.value(entry.getValue())); - } - } - - Set outstandingRequests = Sets.difference(requests, responseMap.keySet()); - for (RQ request : outstandingRequests) { - Promise promise = promises.remove(request); - if (promise != null) { - promise.become(Future.exception(new ResponseNotReturnedException(request))); - } - } - - return null; - }) - .onFailure(exception -> { - for (RQ request : requests) { - Promise promise = promises.remove(request); - if (promise != null) { - promise.become(Future.exception(exception)); - } - } - - return null; - }); - } -} - diff --git a/src/java/com/twitter/search/ingester/pipeline/util/CardFieldUtil.java b/src/java/com/twitter/search/ingester/pipeline/util/CardFieldUtil.java deleted file mode 100644 index ae82f8764..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/util/CardFieldUtil.java +++ /dev/null @@ -1,48 +0,0 @@ -package com.twitter.search.ingester.pipeline.util; - -import com.google.common.base.Strings; - -import com.twitter.expandodo.thriftjava.BindingValue; -import com.twitter.expandodo.thriftjava.BindingValueType; -import com.twitter.expandodo.thriftjava.Card2; -import com.twitter.search.common.util.text.LanguageIdentifierHelper; -import com.twitter.search.ingester.model.IngesterTwitterMessage; - -public final class CardFieldUtil { - - private CardFieldUtil() { - /* prevent instantiation */ - } - - /** - * Binding Keys for card fields - */ - public static final String TITLE_BINDING_KEY = "title"; - public static final String DESCRIPTION_BINDING_KEY = "description"; - - /** - * given a bindingKey and card, will return the bindingValue of the given bindingKey - * if present in card.getBinding_values(). If no match is found return null. - */ - public static String extractBindingValue(String bindingKey, Card2 card) { - for (BindingValue bindingValue : card.getBinding_values()) { - if ((bindingValue != null) - && bindingValue.isSetType() - && (bindingValue.getType() == BindingValueType.STRING) - && bindingKey.equals(bindingValue.getKey())) { - return bindingValue.getString_value(); - } - } - return null; - } - - /** - * derives card lang from title + description and sets it in TwitterMessage. - */ - public static void deriveCardLang(IngesterTwitterMessage message) { - message.setCardLang(LanguageIdentifierHelper.identifyLanguage(String.format("%s %s", - Strings.nullToEmpty(message.getCardTitle()), - Strings.nullToEmpty(message.getCardDescription()))).getLanguage()); - } -} - diff --git a/src/java/com/twitter/search/ingester/pipeline/util/IngesterStageTimer.java b/src/java/com/twitter/search/ingester/pipeline/util/IngesterStageTimer.java deleted file mode 100644 index a20db70a6..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/util/IngesterStageTimer.java +++ /dev/null @@ -1,35 +0,0 @@ -package com.twitter.search.ingester.pipeline.util; -import java.util.concurrent.TimeUnit; -import com.twitter.common.base.MorePreconditions; -import com.twitter.search.common.metrics.SearchTimerStats; -import org.apache.commons.pipeline.stage.StageTimer; -/** - * Adds science stats export to StageTimer - */ -public class IngesterStageTimer extends StageTimer { - private final String name; - private final SearchTimerStats timer; - - public IngesterStageTimer(String statName) { - name = MorePreconditions.checkNotBlank(statName); - timer = SearchTimerStats.export(name, TimeUnit.NANOSECONDS, true); - } - - public String getName() { - return name; - } - - @Override - public void start() { - // This override is not necessary; it is added for code readability. - // super.start puts the current time in startTime - super.start(); - } - - @Override - public void stop() { - super.stop(); - long runTime = System.nanoTime() - startTime.get(); - timer.timerIncrement(runTime); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/util/ManhattanCodedLocationProvider.java b/src/java/com/twitter/search/ingester/pipeline/util/ManhattanCodedLocationProvider.java deleted file mode 100644 index cc569a939..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/util/ManhattanCodedLocationProvider.java +++ /dev/null @@ -1,110 +0,0 @@ -package com.twitter.search.ingester.pipeline.util; - -import java.util.ArrayList; -import java.util.Collection; -import java.util.Iterator; -import java.util.List; -import java.util.Optional; - -import com.google.common.base.Preconditions; - -import com.twitter.search.common.indexing.thriftjava.ThriftGeoLocationSource; -import com.twitter.search.common.indexing.thriftjava.ThriftGeoPoint; -import com.twitter.search.common.indexing.thriftjava.ThriftGeocodeRecord; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.relevance.entities.GeoObject; -import com.twitter.search.common.util.geocoding.ManhattanGeocodeRecordStore; -import com.twitter.search.ingester.model.IngesterTwitterMessage; -import com.twitter.stitch.Stitch; -import com.twitter.storage.client.manhattan.kv.JavaManhattanKVEndpoint; -import com.twitter.storage.client.manhattan.kv.ManhattanValue; -import com.twitter.util.Function; -import com.twitter.util.Future; - - -public final class ManhattanCodedLocationProvider { - - private final ManhattanGeocodeRecordStore store; - private final SearchCounter locationsCounter; - - private static final String LOCATIONS_POPULATED_STAT_NAME = "_locations_populated_count"; - - public static ManhattanCodedLocationProvider createWithEndpoint( - JavaManhattanKVEndpoint endpoint, String metricsPrefix, String datasetName) { - return new ManhattanCodedLocationProvider( - ManhattanGeocodeRecordStore.create(endpoint, datasetName), metricsPrefix); - } - - private ManhattanCodedLocationProvider(ManhattanGeocodeRecordStore store, String metricPrefix) { - this.locationsCounter = SearchCounter.export(metricPrefix + LOCATIONS_POPULATED_STAT_NAME); - this.store = store; - } - - /** - * Iterates through all given messages, and for each message that has a location set, retrieves - * the coordinates of that location from Manhattan and sets them back on that message. - */ - public Future> populateCodedLatLon( - Collection messages) { - if (messages.isEmpty()) { - return Future.value(messages); - } - - // Batch read requests - List>>> readRequests = - new ArrayList<>(messages.size()); - for (IngesterTwitterMessage message : messages) { - readRequests.add(store.asyncReadFromManhattan(message.getLocation())); - } - Future>>> batchedRequest = - Stitch.run(Stitch.collect(readRequests)); - - return batchedRequest.map(Function.func(optGeoLocations -> { - // Iterate over messages and responses simultaneously - Preconditions.checkState(messages.size() == optGeoLocations.size()); - Iterator messageIterator = messages.iterator(); - Iterator>> optGeoLocationIterator = - optGeoLocations.iterator(); - while (messageIterator.hasNext() && optGeoLocationIterator.hasNext()) { - IngesterTwitterMessage message = messageIterator.next(); - Optional> optGeoLocation = - optGeoLocationIterator.next(); - if (setGeoLocationForMessage(message, optGeoLocation)) { - locationsCounter.increment(); - } - } - return messages; - })); - } - - /** - * Returns whether a valid geolocation was successfully found and saved in the message. - */ - private boolean setGeoLocationForMessage( - IngesterTwitterMessage message, - Optional> optGeoLocation) { - if (optGeoLocation.isPresent()) { - ThriftGeocodeRecord geoLocation = optGeoLocation.get().contents(); - ThriftGeoPoint geoTags = geoLocation.getGeoPoint(); - - if ((geoTags.getLatitude() == GeoObject.DOUBLE_FIELD_NOT_PRESENT) - && (geoTags.getLongitude() == GeoObject.DOUBLE_FIELD_NOT_PRESENT)) { - // This case indicates that we have "negative cache" in coded_locations table, so - // don't try to geocode again. - message.setUncodeableLocation(); - return false; - } else { - GeoObject code = new GeoObject( - geoTags.getLatitude(), - geoTags.getLongitude(), - geoTags.getAccuracy(), - ThriftGeoLocationSource.USER_PROFILE); - message.setGeoLocation(code); - return true; - } - } else { - message.setGeocodeRequired(); - return false; - } - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/util/PenguinVersionsUtil.java b/src/java/com/twitter/search/ingester/pipeline/util/PenguinVersionsUtil.java deleted file mode 100644 index 323dd201d..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/util/PenguinVersionsUtil.java +++ /dev/null @@ -1,47 +0,0 @@ -package com.twitter.search.ingester.pipeline.util; - -import java.util.ArrayList; -import java.util.List; - -import com.google.common.base.Preconditions; - -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.decider.Decider; - -public final class PenguinVersionsUtil { - - private PenguinVersionsUtil() { /* prevent instantiation */ } - - /** - * Utility method for updating penguinVersions lists via decider availability. We must have - * at least one version available. - * @param penguinVersions - * @param decider - * @return - */ - public static List filterPenguinVersionsWithDeciders( - List penguinVersions, - Decider decider) { - List updatedPenguinVersions = new ArrayList<>(); - for (PenguinVersion penguinVersion : penguinVersions) { - if (isPenguinVersionAvailable(penguinVersion, decider)) { - updatedPenguinVersions.add(penguinVersion); - } - } - Preconditions.checkArgument(penguinVersions.size() > 0, - "At least one penguin version must be specified."); - - return updatedPenguinVersions; - } - - /** - * Checks penguinVersion decider for availability. - * @param penguinVersion - * @param decider - * @return - */ - public static boolean isPenguinVersionAvailable(PenguinVersion penguinVersion, Decider decider) { - return decider.isAvailable( - String.format("enable_penguin_version_%d", penguinVersion.getByteValue())); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/util/PipelineExceptionHandler.java b/src/java/com/twitter/search/ingester/pipeline/util/PipelineExceptionHandler.java deleted file mode 100644 index fc9dd2a72..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/util/PipelineExceptionHandler.java +++ /dev/null @@ -1,15 +0,0 @@ -package com.twitter.search.ingester.pipeline.util; - -import com.twitter.util.Duration; - -public interface PipelineExceptionHandler { - /** - * Logs the given message and waits the given duration. - */ - void logAndWait(String msg, Duration waitTime) throws InterruptedException; - - /** - * Logs the given message and shutdowns the application. - */ - void logAndShutdown(String msg); -} diff --git a/src/java/com/twitter/search/ingester/pipeline/util/PipelineStageException.java b/src/java/com/twitter/search/ingester/pipeline/util/PipelineStageException.java deleted file mode 100644 index 4f4dcddbf..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/util/PipelineStageException.java +++ /dev/null @@ -1,19 +0,0 @@ -package com.twitter.search.ingester.pipeline.util; - -public class PipelineStageException extends Exception { - public PipelineStageException(Object location, String message, Throwable cause) { - super(message + " In Stage : " + location.getClass(), cause); - } - - public PipelineStageException(Throwable cause) { - super(cause); - } - - public PipelineStageException(String message) { - super(message); - } - - public PipelineStageException(Object location, String message) { - super(message + " In Stage : " + location.getClass()); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/util/PipelineStageRuntimeException.java b/src/java/com/twitter/search/ingester/pipeline/util/PipelineStageRuntimeException.java deleted file mode 100644 index 32d237804..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/util/PipelineStageRuntimeException.java +++ /dev/null @@ -1,7 +0,0 @@ -package com.twitter.search.ingester.pipeline.util; - -public class PipelineStageRuntimeException extends RuntimeException { - public PipelineStageRuntimeException(String msg) { - super(msg); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/util/PipelineUtil.java b/src/java/com/twitter/search/ingester/pipeline/util/PipelineUtil.java deleted file mode 100644 index 58159347b..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/util/PipelineUtil.java +++ /dev/null @@ -1,26 +0,0 @@ -package com.twitter.search.ingester.pipeline.util; - -import com.google.common.base.Preconditions; - -import org.apache.commons.pipeline.Feeder; -import org.apache.commons.pipeline.stage.InstrumentedBaseStage; - -public final class PipelineUtil { - - /** - * Feed an object to a specified stage. Used for stages that follow the pattern of - * looping indefinitely in the first call to process() and don't care what the object passed - * in is, but still needs at least one item fed to the stage to start processing. - * - * Examples of stages like this are: EventBusReaderStage and KafkaBytesReaderStage - * - * @param stage stage to enqueue an arbitrary object to. - */ - public static void feedStartObjectToStage(InstrumentedBaseStage stage) { - Feeder stageFeeder = stage.getStageContext().getStageFeeder(stage); - Preconditions.checkNotNull(stageFeeder); - stageFeeder.feed("off to the races"); - } - - private PipelineUtil() { /* prevent instantiation */ } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/util/PipelineV2CreationException.java b/src/java/com/twitter/search/ingester/pipeline/util/PipelineV2CreationException.java deleted file mode 100644 index 9248050c4..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/util/PipelineV2CreationException.java +++ /dev/null @@ -1,7 +0,0 @@ -package com.twitter.search.ingester.pipeline.util; - -public class PipelineV2CreationException extends Exception { - public PipelineV2CreationException(String message) { - super(message); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/util/ResponseNotReturnedException.java b/src/java/com/twitter/search/ingester/pipeline/util/ResponseNotReturnedException.java deleted file mode 100644 index ad58148cf..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/util/ResponseNotReturnedException.java +++ /dev/null @@ -1,7 +0,0 @@ -package com.twitter.search.ingester.pipeline.util; - -public class ResponseNotReturnedException extends Exception { - ResponseNotReturnedException(Object request) { - super("Response not returned in batch for request: " + request); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/util/UserPropertiesManager.java b/src/java/com/twitter/search/ingester/pipeline/util/UserPropertiesManager.java deleted file mode 100644 index d11932289..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/util/UserPropertiesManager.java +++ /dev/null @@ -1,446 +0,0 @@ -package com.twitter.search.ingester.pipeline.util; - -import java.util.Collection; -import java.util.Collections; -import java.util.HashSet; -import java.util.List; -import java.util.Map; -import java.util.Optional; -import java.util.Set; -import javax.annotation.Nullable; - -import com.google.common.annotations.VisibleForTesting; -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.Lists; -import com.google.common.collect.Maps; - -import org.apache.thrift.TBase; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common_internal.analytics.test_user_filter.TestUserFilter; -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.metastore.client_v2.MetastoreClient; -import com.twitter.metastore.data.MetastoreColumn; -import com.twitter.metastore.data.MetastoreException; -import com.twitter.metastore.data.MetastoreRow; -import com.twitter.metastore.data.MetastoreValue; -import com.twitter.search.common.metrics.RelevanceStats; -import com.twitter.search.common.metrics.SearchCounter; -import com.twitter.search.common.metrics.SearchRateCounter; -import com.twitter.search.common.metrics.SearchRequestStats; -import com.twitter.search.common.relevance.entities.TwitterMessage; -import com.twitter.search.common.relevance.features.RelevanceSignalConstants; -import com.twitter.search.ingester.model.IngesterTwitterMessage; -import com.twitter.service.metastore.gen.ResponseCode; -import com.twitter.service.metastore.gen.TweepCred; -import com.twitter.util.Function; -import com.twitter.util.Future; - -public class UserPropertiesManager { - private static final Logger LOG = LoggerFactory.getLogger(UserPropertiesManager.class); - - @VisibleForTesting - protected static final List>> COLUMNS = - ImmutableList.of(MetastoreColumn.TWEEPCRED); // contains tweepcred value - - // same spam threshold that is use in tweeypie to spread user level spam to tweets, all tweets - // from user with spam score above such are marked so and removed from search results - @VisibleForTesting - public static final double SPAM_SCORE_THRESHOLD = 4.5; - - @VisibleForTesting - static final SearchRequestStats MANHATTAN_METASTORE_STATS = - SearchRequestStats.export("manhattan_metastore_get", true); - - private static final MetastoreGetColumnStats GET_TWEEP_CRED - = new MetastoreGetColumnStats("tweep_cred"); - - @VisibleForTesting - static final SearchRateCounter MISSING_REPUTATION_COUNTER = RelevanceStats.exportRate( - "num_missing_reputation"); - @VisibleForTesting - static final SearchRateCounter INVALID_REPUTATION_COUNTER = RelevanceStats.exportRate( - "num_invalid_reputation"); - @VisibleForTesting - static final SearchRateCounter ACCEPTED_REPUTATION_COUNTER = RelevanceStats.exportRate( - "num_accepted_reputation"); - @VisibleForTesting - static final SearchRateCounter SKIPPED_REPUTATION_CHECK_COUNTER = RelevanceStats.exportRate( - "num_skipped_reputation_check_for_test_user"); - @VisibleForTesting - static final SearchCounter DEFAULT_REPUTATION_COUNTER = SearchCounter.export( - "messages_default_reputation_count"); - @VisibleForTesting - static final SearchCounter MESSAGE_FROM_TEST_USER = - SearchCounter.export("messages_from_test_user"); - - // User level bits that are spread onto tweets - private static final SearchRateCounter IS_USER_NSFW_COUNTER = RelevanceStats.exportRate( - "num_is_nsfw"); - private static final SearchRateCounter IS_USER_SPAM_COUNTER = RelevanceStats.exportRate( - "num_is_spam"); - - // count how many tweets has "possibly_sensitive" set to true in the original json message - private static final SearchRateCounter IS_SENSITIVE_FROM_JSON_COUNTER = RelevanceStats.exportRate( - "num_is_sensitive_in_json"); - - private static final SearchCounter SENSITIVE_BITS_COUNTER = - SearchCounter.export("messages_sensitive_bits_set_count"); - - private final MetastoreClient metastoreClient; - private final UserPropertiesManager.MetastoreGetColumnStats tweepCredStats; - - /** - * Stats for keeping track of multiGet requests to metastore for a specific data column. - */ - @VisibleForTesting static class MetastoreGetColumnStats { - /** - * No data was returned from metastore for a specific user. - */ - private final SearchCounter notReturned; - /** - * Metastore returned a successful OK response. - */ - private final SearchCounter metastoreSuccess; - /** - * Metastore returned a NOT_FOUND response for a user. - */ - private final SearchCounter metastoreNotFound; - /** - * Metastore returned a BAD_INPUT response for a user. - */ - private final SearchCounter metastoreBadInput; - /** - * Metastore returned a TRANSIENT_ERROR response for a user. - */ - private final SearchCounter metastoreTransientError; - /** - * Metastore returned a PERMANENT_ERROR response for a user. - */ - private final SearchCounter metastorePermanentError; - /** - * Metastore returned an unknown response code for a user. - */ - private final SearchCounter metastoreUnknownResponseCode; - /** - * Total number of users that we asked data for in metastore. - */ - private final SearchCounter totalRequests; - - @VisibleForTesting MetastoreGetColumnStats(String columnName) { - String prefix = "manhattan_metastore_get_" + columnName; - notReturned = SearchCounter.export(prefix + "_response_not_returned"); - metastoreSuccess = SearchCounter.export(prefix + "_response_success"); - metastoreNotFound = SearchCounter.export(prefix + "_response_not_found"); - metastoreBadInput = SearchCounter.export(prefix + "_response_bad_input"); - metastoreTransientError = SearchCounter.export(prefix + "_response_transient_error"); - metastorePermanentError = SearchCounter.export(prefix + "_response_permanent_error"); - metastoreUnknownResponseCode = - SearchCounter.export(prefix + "_response_unknown_response_code"); - // Have a distinguishable prefix for the total requests stat so that we can use it to get - // a viz rate against wild-carded "prefix_response_*" stats. - totalRequests = SearchCounter.export(prefix + "_requests"); - } - - /** - * Tracks metastore get column stats for an individual user's response. - * @param responseCode the response code received from metastore. Expected to be null if no - * response came back at all. - */ - private void trackMetastoreResponseCode(@Nullable ResponseCode responseCode) { - totalRequests.increment(); - - if (responseCode == null) { - notReturned.increment(); - } else if (responseCode == ResponseCode.OK) { - metastoreSuccess.increment(); - } else if (responseCode == ResponseCode.NOT_FOUND) { - metastoreNotFound.increment(); - } else if (responseCode == ResponseCode.BAD_INPUT) { - metastoreBadInput.increment(); - } else if (responseCode == ResponseCode.TRANSIENT_ERROR) { - metastoreTransientError.increment(); - } else if (responseCode == ResponseCode.PERMANENT_ERROR) { - metastorePermanentError.increment(); - } else { - metastoreUnknownResponseCode.increment(); - } - } - - @VisibleForTesting long getTotalRequests() { - return totalRequests.get(); - } - - @VisibleForTesting long getNotReturnedCount() { - return notReturned.get(); - } - - @VisibleForTesting long getMetastoreSuccessCount() { - return metastoreSuccess.get(); - } - - @VisibleForTesting long getMetastoreNotFoundCount() { - return metastoreNotFound.get(); - } - - @VisibleForTesting long getMetastoreBadInputCount() { - return metastoreBadInput.get(); - } - - @VisibleForTesting long getMetastoreTransientErrorCount() { - return metastoreTransientError.get(); - } - - @VisibleForTesting long getMetastorePermanentErrorCount() { - return metastorePermanentError.get(); - } - - @VisibleForTesting long getMetastoreUnknownResponseCodeCount() { - return metastoreUnknownResponseCode.get(); - } - } - - /** Class that holds all user properties from Manhattan. */ - @VisibleForTesting - protected static class ManhattanUserProperties { - private double spamScore = 0; - private float tweepcred = RelevanceSignalConstants.UNSET_REPUTATION_SENTINEL; // default - - public ManhattanUserProperties setSpamScore(double newSpamScore) { - this.spamScore = newSpamScore; - return this; - } - - public float getTweepcred() { - return tweepcred; - } - - public ManhattanUserProperties setTweepcred(float newTweepcred) { - this.tweepcred = newTweepcred; - return this; - } - } - - public UserPropertiesManager(MetastoreClient metastoreClient) { - this(metastoreClient, GET_TWEEP_CRED); - } - - @VisibleForTesting - UserPropertiesManager( - MetastoreClient metastoreClient, - MetastoreGetColumnStats tweepCredStats) { - this.metastoreClient = metastoreClient; - this.tweepCredStats = tweepCredStats; - } - - /** - * Gets user properties including TWEEPCRED, SpamScore values/flags from metastore for the - * given userids. - * - * @param userIds the list of users for which to get the properties. - * @return mapping from userId to UserProperties. If a user's twepcred score is not present in the - * metastore, of if there was a problem retrieving it, that user's score will not be set in the - * returned map. - */ - @VisibleForTesting - Future> getManhattanUserProperties(final List userIds) { - Preconditions.checkArgument(userIds != null); - if (metastoreClient == null || userIds.isEmpty()) { - return Future.value(Collections.emptyMap()); - } - - final long start = System.currentTimeMillis(); - - return metastoreClient.multiGet(userIds, COLUMNS) - .map(new Function, Map>() { - @Override - public Map apply(Map response) { - long latencyMs = System.currentTimeMillis() - start; - Map resultMap = - Maps.newHashMapWithExpectedSize(userIds.size()); - - for (Long userId : userIds) { - MetastoreRow row = response.get(userId); - processTweepCredColumn(userId, row, resultMap); - } - - MANHATTAN_METASTORE_STATS.requestComplete(latencyMs, resultMap.size(), true); - return resultMap; - } - }) - .handle(new Function>() { - @Override - public Map apply(Throwable t) { - long latencyMs = System.currentTimeMillis() - start; - LOG.error("Exception talking to metastore after " + latencyMs + " ms.", t); - - MANHATTAN_METASTORE_STATS.requestComplete(latencyMs, 0, false); - return Collections.emptyMap(); - } - }); - } - - - /** - * Process the TweepCred column data returned from metastore, takes TweepCred, fills in the - * the resultMap as appropriate. - */ - private void processTweepCredColumn( - Long userId, - MetastoreRow metastoreRow, - Map resultMap) { - MetastoreValue tweepCredValue = - metastoreRow == null ? null : metastoreRow.getValue(MetastoreColumn.TWEEPCRED); - ResponseCode responseCode = tweepCredValue == null ? null : tweepCredValue.getResponseCode(); - tweepCredStats.trackMetastoreResponseCode(responseCode); - - if (responseCode == ResponseCode.OK) { - try { - TweepCred tweepCred = tweepCredValue.getValue(); - if (tweepCred != null && tweepCred.isSetScore()) { - ManhattanUserProperties manhattanUserProperties = - getOrCreateManhattanUserProperties(userId, resultMap); - manhattanUserProperties.setTweepcred(tweepCred.getScore()); - } - } catch (MetastoreException e) { - // guaranteed not to be thrown if ResponseCode.OK - LOG.warn("Unexpected MetastoreException parsing userinfo column!", e); - } - } - } - - private static ManhattanUserProperties getOrCreateManhattanUserProperties( - Long userId, Map resultMap) { - - ManhattanUserProperties manhattanUserProperties = resultMap.get(userId); - if (manhattanUserProperties == null) { - manhattanUserProperties = new ManhattanUserProperties(); - resultMap.put(userId, manhattanUserProperties); - } - - return manhattanUserProperties; - } - - /** - * Populates the user properties from the given batch. - */ - public Future> populateUserProperties( - Collection batch) { - Set userIds = new HashSet<>(); - for (IngesterTwitterMessage message : batch) { - if ((message.getUserReputation() == IngesterTwitterMessage.DOUBLE_FIELD_NOT_PRESENT) - && !message.isDeleted()) { - Optional userId = message.getFromUserTwitterId(); - if (userId.isPresent()) { - userIds.add(userId.get()); - } else { - LOG.error("No user id present for tweet {}", message.getId()); - } - } - } - List uniqIds = Lists.newArrayList(userIds); - Collections.sort(uniqIds); // for testing predictability - - Future> manhattanUserPropertiesMap = - getManhattanUserProperties(uniqIds); - - return manhattanUserPropertiesMap.map(Function.func(map -> { - for (IngesterTwitterMessage message : batch) { - if (((message.getUserReputation() != IngesterTwitterMessage.DOUBLE_FIELD_NOT_PRESENT) - && RelevanceSignalConstants.isValidUserReputation( - (int) Math.floor(message.getUserReputation()))) - || message.isDeleted()) { - continue; - } - Optional optionalUserId = message.getFromUserTwitterId(); - if (optionalUserId.isPresent()) { - long userId = optionalUserId.get(); - ManhattanUserProperties manhattanUserProperties = map.get(userId); - - final boolean isTestUser = TestUserFilter.isTestUserId(userId); - if (isTestUser) { - MESSAGE_FROM_TEST_USER.increment(); - } - - // legacy setting of tweepcred - setTweepCred(isTestUser, manhattanUserProperties, message); - - // set additional fields - if (setSensitiveBits(manhattanUserProperties, message)) { - SENSITIVE_BITS_COUNTER.increment(); - } - } - } - return batch; - })); - } - - // good old tweepcred - private void setTweepCred( - boolean isTestUser, - ManhattanUserProperties manhattanUserProperties, - TwitterMessage message) { - float score = RelevanceSignalConstants.UNSET_REPUTATION_SENTINEL; - if (manhattanUserProperties == null) { - if (isTestUser) { - SKIPPED_REPUTATION_CHECK_COUNTER.increment(); - } else { - MISSING_REPUTATION_COUNTER.increment(); - DEFAULT_REPUTATION_COUNTER.increment(); - } - } else if (!RelevanceSignalConstants.isValidUserReputation( - (int) Math.floor(manhattanUserProperties.tweepcred))) { - if (!isTestUser) { - INVALID_REPUTATION_COUNTER.increment(); - DEFAULT_REPUTATION_COUNTER.increment(); - } - } else { - score = manhattanUserProperties.tweepcred; - ACCEPTED_REPUTATION_COUNTER.increment(); - } - message.setUserReputation(score); - } - - // Sets sensitive content, nsfw, and spam flags in TwitterMessage, further - // sets the following bits in encoded features: - // EarlybirdFeatureConfiguration.IS_SENSITIVE_FLAG - // EarlybirdFeatureConfiguration.IS_USER_NSFW_FLAG - // EarlybirdFeatureConfiguration.IS_USER_SPAM_FLAG - private boolean setSensitiveBits( - ManhattanUserProperties manhattanUserProperties, - TwitterMessage message) { - if (manhattanUserProperties == null) { - return false; - } - - final boolean isUserSpam = manhattanUserProperties.spamScore > SPAM_SCORE_THRESHOLD; - // SEARCH-17413: Compute the field with gizmoduck data. - final boolean isUserNSFW = false; - final boolean anySensitiveBitSet = isUserSpam || isUserNSFW; - - if (message.isSensitiveContent()) { - // original json has possibly_sensitive = true, count it - IS_SENSITIVE_FROM_JSON_COUNTER.increment(); - } - - if (isUserNSFW) { - // set EarlybirdFeatureConfiguration.IS_USER_NSFW_FLAG - for (PenguinVersion penguinVersion : message.getSupportedPenguinVersions()) { - message.getTweetUserFeatures(penguinVersion).setNsfw(isUserNSFW); - } - IS_USER_NSFW_COUNTER.increment(); - } - if (isUserSpam) { - // set EarlybirdFeatureConfiguration.IS_USER_SPAM_FLAG - for (PenguinVersion penguinVersion : message.getSupportedPenguinVersions()) { - message.getTweetUserFeatures(penguinVersion).setSpam(isUserSpam); - } - IS_USER_SPAM_COUNTER.increment(); - } - - // if any of the sensitive bits are set, we return true - return anySensitiveBitSet; - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/wire/BUILD b/src/java/com/twitter/search/ingester/pipeline/wire/BUILD deleted file mode 100644 index 042d214f2..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/wire/BUILD +++ /dev/null @@ -1,52 +0,0 @@ -java_library( - sources = ["*.java"], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/twitter/distributedlog:distributedlog-core", - "3rdparty/jvm/commons-logging", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/kafka:kafka-clients", - "3rdparty/jvm/org/apache/thrift:libthrift", - "cuad/projects/ner/client/src/main/scala/com/twitter/cuad/ner/client", - "decider/src/main/scala", - "eventbus/client/src/main/scala/com/twitter/eventbus/client", - "finagle-internal/mtls/src/main/scala/com/twitter/finagle/mtls/authentication", - "finagle-internal/mtls/src/main/scala/com/twitter/finagle/mtls/authorization", - "finagle-internal/mtls/src/main/scala/com/twitter/finagle/mtls/client", - "finagle/finagle-core/src/main", - "finagle/finagle-thrift/src/main/java", - "finagle/finagle-thrift/src/main/scala", - "finagle/finagle-thriftmux/src/main/scala", - "kafka/finagle-kafka/finatra-kafka/src/main/scala", - "servo/util/src/main/scala", - "src/java/com/twitter/common/quantity", - "src/java/com/twitter/common/util:system-mocks", - "src/java/com/twitter/common_internal/manhattan", - "src/java/com/twitter/common_internal/text/version", - "src/java/com/twitter/common_internal/zookeeper", - "src/java/com/twitter/metastore/client_v2", - "src/java/com/twitter/search/common/decider", - "src/java/com/twitter/search/common/partitioning:timeslice-manager", - "src/java/com/twitter/search/common/partitioning/base", - "src/java/com/twitter/search/common/relevance:classifiers", - "src/java/com/twitter/search/common/schema/earlybird", - "src/java/com/twitter/search/common/util/io:dl-reader-writer", - "src/java/com/twitter/search/common/util/io:record-reader-api", - "src/java/com/twitter/search/common/util/io/kafka", - "src/java/com/twitter/search/common/util/thrift:text-protocol", - "src/java/com/twitter/search/ingester/pipeline/strato_fetchers", - "src/java/com/twitter/search/ingester/pipeline/util", - "src/thrift/com/twitter/gizmoduck:thrift-java", - "src/thrift/com/twitter/manhattan:internal-scala", - "src/thrift/com/twitter/manhattan:v1-java", - "src/thrift/com/twitter/pink-floyd/thrift:thrift-java", - "src/thrift/com/twitter/tweetypie:service-java", - "storage/clients/manhattan/client/src/main/scala", - "strato/src/main/scala/com/twitter/strato/client", - "util/util-core:scala", - "util/util-function/src/main/java", - "util/util-stats/src/main/scala", - ], -) diff --git a/src/java/com/twitter/search/ingester/pipeline/wire/IngesterPartitioner.java b/src/java/com/twitter/search/ingester/pipeline/wire/IngesterPartitioner.java deleted file mode 100644 index f126a7370..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/wire/IngesterPartitioner.java +++ /dev/null @@ -1,27 +0,0 @@ -package com.twitter.search.ingester.pipeline.wire; - -import javax.naming.NamingException; - -import com.twitter.search.common.partitioning.base.PartitionMappingManager; -import com.twitter.search.common.util.io.kafka.SearchPartitioner; - -/** - * A variant of {@code SearchPartitioner} which retrieves {@code PartitionMappingManager} from - * {@code WireModule}. - * - * Note that the value object has to implement {@code Partitionable}. - */ -public class IngesterPartitioner extends SearchPartitioner { - - public IngesterPartitioner() { - super(getPartitionMappingManager()); - } - - private static PartitionMappingManager getPartitionMappingManager() { - try { - return WireModule.getWireModule().getPartitionMappingManager(); - } catch (NamingException e) { - throw new RuntimeException(e); - } - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/wire/ProductionWireModule.java b/src/java/com/twitter/search/ingester/pipeline/wire/ProductionWireModule.java deleted file mode 100644 index b50962297..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/wire/ProductionWireModule.java +++ /dev/null @@ -1,363 +0,0 @@ -package com.twitter.search.ingester.pipeline.wire; - -import java.util.ArrayList; -import java.util.List; -import java.util.concurrent.ExecutorService; -import java.util.concurrent.Executors; -import javax.annotation.Nullable; -import javax.naming.Context; -import javax.naming.InitialContext; -import javax.naming.NamingException; - -import scala.Option; -import scala.collection.JavaConversions$; - -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableList; - -import org.apache.kafka.clients.consumer.KafkaConsumer; -import org.apache.kafka.clients.producer.Partitioner; -import org.apache.kafka.common.serialization.Deserializer; -import org.apache.kafka.common.serialization.Serializer; -import org.apache.thrift.TBase; -import org.apache.thrift.protocol.TBinaryProtocol; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.util.Clock; -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.decider.Decider; -import com.twitter.decider.DeciderFactory; -import com.twitter.decider.DeciderFactory$; -import com.twitter.decider.decisionmaker.DecisionMaker; -import com.twitter.decider.decisionmaker.MutableDecisionMaker; -import com.twitter.eventbus.client.EventBusSubscriber; -import com.twitter.eventbus.client.EventBusSubscriberBuilder; -import com.twitter.finagle.Service; -import com.twitter.finagle.ThriftMux; -import com.twitter.finagle.builder.ClientBuilder; -import com.twitter.finagle.builder.ClientConfig; -import com.twitter.finagle.mtls.authentication.ServiceIdentifier; -import com.twitter.finagle.mtls.client.MtlsThriftMuxClient; -import com.twitter.finagle.mux.transport.OpportunisticTls; -import com.twitter.finagle.service.RetryPolicy; -import com.twitter.finagle.stats.DefaultStatsReceiver; -import com.twitter.finagle.thrift.ClientId; -import com.twitter.finagle.thrift.ThriftClientRequest; -import com.twitter.finatra.kafka.producers.BlockingFinagleKafkaProducer; -import com.twitter.gizmoduck.thriftjava.UserService; -import com.twitter.metastore.client_v2.MetastoreClient; -import com.twitter.pink_floyd.thrift.Storer; -import com.twitter.search.common.partitioning.base.PartitionMappingManager; -import com.twitter.search.common.relevance.classifiers.TweetOffensiveEvaluator; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.common.util.io.kafka.FinagleKafkaClientUtils; -import com.twitter.search.ingester.pipeline.strato_fetchers.AudioSpaceCoreFetcher; -import com.twitter.search.ingester.pipeline.strato_fetchers.AudioSpaceParticipantsFetcher; -import com.twitter.search.ingester.pipeline.strato_fetchers.NamedEntityFetcher; -import com.twitter.search.ingester.pipeline.util.PenguinVersionsUtil; -import com.twitter.search.ingester.pipeline.util.PipelineExceptionHandler; -import com.twitter.storage.client.manhattan.kv.JavaManhattanKVEndpoint; -import com.twitter.storage.client.manhattan.kv.ManhattanKVClient; -import com.twitter.storage.client.manhattan.kv.ManhattanKVClientMtlsParams; -import com.twitter.storage.client.manhattan.kv.ManhattanKVEndpointBuilder; -import com.twitter.strato.client.Client; -import com.twitter.strato.client.Strato; -import com.twitter.tweetypie.thriftjava.TweetService; -import com.twitter.util.Duration; -import com.twitter.util.Function; -import com.twitter.util.Future; - -/** - * The injection module that provides all production bindings. - */ -public class ProductionWireModule extends WireModule { - private static final Logger LOG = LoggerFactory.getLogger(ProductionWireModule.class); - - private static final String DECIDER_BASE = "config/ingester-indexer-decider.yml"; - private static final String GEOCODE_APP_ID = "search_ingester_readonly"; - private static final String CLUSTER_DEST_NAME = ""; - - private static final String JNDI_GIZMODUCK_DEST = JNDI_PIPELINE_ROOT + "gizmoduckDest"; - - private static final String PENGUIN_VERSIONS_JNDI_NAME = JNDI_PIPELINE_ROOT + "penguinVersions"; - private static final String SEGMENT_BUFFER_SIZE_JNDI_NAME = - JNDI_PIPELINE_ROOT + "segmentBufferSize"; - private static final String SEGMENT_SEAL_DELAY_TIME_MS_JNDI_NAME = - JNDI_PIPELINE_ROOT + "segmentSealDelayTimeMs"; - private static final String JNDI_DL_URI = JNDI_PIPELINE_ROOT + "distributedlog/dlUri"; - private static final String JNDI_DL_CONFIG_FILE = - JNDI_PIPELINE_ROOT + "distributedlog/configFile"; - private static final String CLUSTER_JNDI_NAME = JNDI_PIPELINE_ROOT + "cluster"; - - private static final String TIME_SLICE_MANAGER_ROOT_PATH = ""; - private static final String MAX_TIMESLICES_JNDI_NAME = - TIME_SLICE_MANAGER_ROOT_PATH + "hashPartition/maxTimeSlices"; - private static final String MAX_SEGMENT_SIZE_JNDI_NAME = - TIME_SLICE_MANAGER_ROOT_PATH + "hashPartition/maxSegmentSize"; - private static final String NUM_PARTITIONS_JNDI_NAME = - TIME_SLICE_MANAGER_ROOT_PATH + "hashPartition/numPartitions"; - - private static final String PINK_CLIENT_ID = "search_ingester"; - - private final Decider decider; - private final MutableDecisionMaker mutableDecisionMaker; - private final int partition; - private PipelineExceptionHandler pipelineExceptionHandler; - private final StratoMetaStoreWireModule stratoMetaStoreWireModule; - - private final Client stratoClient; - - private ServiceIdentifier serviceIdentifier = ServiceIdentifier.empty(); - - private List penguinVersions; - - public ProductionWireModule(String deciderOverlay, int partition, Option - serviceIdentifierFlag) { - mutableDecisionMaker = new MutableDecisionMaker(); - decider = DeciderFactory.get() - .withBaseConfig(DECIDER_BASE) - .withOverlayConfig(deciderOverlay) - .withRefreshBase(false) - .withDecisionMakers( - ImmutableList.builder() - .add(mutableDecisionMaker) - .addAll(JavaConversions$.MODULE$.asJavaCollection( - DeciderFactory$.MODULE$.DefaultDecisionMakers())) - .build()) - .apply(); - this.partition = partition; - this.stratoMetaStoreWireModule = new StratoMetaStoreWireModule(this); - if (serviceIdentifierFlag.isDefined()) { - this.serviceIdentifier = - ServiceIdentifier.flagOfServiceIdentifier().parse(serviceIdentifierFlag.get()); - } - - this.stratoClient = Strato.client() - .withMutualTls(serviceIdentifier) - .withRequestTimeout(Duration.fromMilliseconds(500)) - .build(); - } - - public ProductionWireModule(String deciderOverlay, - int partition, - PipelineExceptionHandler pipelineExceptionHandler, - Option serviceIdentifierFlag) { - this(deciderOverlay, partition, serviceIdentifierFlag); - this.pipelineExceptionHandler = pipelineExceptionHandler; - } - - public void setPipelineExceptionHandler(PipelineExceptionHandler pipelineExceptionHandler) { - this.pipelineExceptionHandler = pipelineExceptionHandler; - } - - @Override - public ServiceIdentifier getServiceIdentifier() { - return serviceIdentifier; - } - - @Override - public PartitionMappingManager getPartitionMappingManager() { - return PartitionMappingManager.getInstance(); - } - - @Override - public JavaManhattanKVEndpoint getJavaManhattanKVEndpoint() { - Preconditions.checkNotNull(serviceIdentifier, - "Can't create Manhattan client with S2S authentication because Service Identifier is null"); - LOG.info(String.format("Service identifier for Manhattan client: %s", - ServiceIdentifier.asString(serviceIdentifier))); - ManhattanKVClientMtlsParams mtlsParams = ManhattanKVClientMtlsParams.apply(serviceIdentifier, - ManhattanKVClientMtlsParams.apply$default$2(), - OpportunisticTls.Required() - ); - return ManhattanKVEndpointBuilder - .apply(ManhattanKVClient.apply(GEOCODE_APP_ID, CLUSTER_DEST_NAME, mtlsParams)) - .buildJava(); - } - - @Override - public Decider getDecider() { - return decider; - } - - // Since MutableDecisionMaker is needed only for production TwitterServer, this method is defined - // only in ProductionWireModule. - public MutableDecisionMaker getMutableDecisionMaker() { - return mutableDecisionMaker; - } - - @Override - public int getPartition() { - return partition; - } - - @Override - public PipelineExceptionHandler getPipelineExceptionHandler() { - return pipelineExceptionHandler; - } - - @Override - public Storer.ServiceIface getStorer(Duration requestTimeout, int retries) { - TBinaryProtocol.Factory factory = new TBinaryProtocol.Factory(); - - MtlsThriftMuxClient mtlsThriftMuxClient = new MtlsThriftMuxClient( - ThriftMux.client().withClientId(new ClientId(PINK_CLIENT_ID))); - ThriftMux.Client tmuxClient = mtlsThriftMuxClient - .withMutualTls(serviceIdentifier) - .withOpportunisticTls(OpportunisticTls.Required()); - - ClientBuilder< - ThriftClientRequest, - byte[], - ClientConfig.Yes, - ClientConfig.Yes, - ClientConfig.Yes> builder = ClientBuilder.get() - .dest("") - .requestTimeout(requestTimeout) - .retries(retries) - .timeout(requestTimeout.mul(retries)) - .stack(tmuxClient) - .name("pinkclient") - .reportTo(DefaultStatsReceiver.get()); - return new Storer.ServiceToClient(ClientBuilder.safeBuild(builder), factory); - } - - @Override - public MetastoreClient getMetastoreClient() throws NamingException { - return stratoMetaStoreWireModule.getMetastoreClient(this.serviceIdentifier); - } - - @Override - public ExecutorService getThreadPool(int numThreads) { - return Executors.newFixedThreadPool(numThreads); - } - - @Override - public TweetService.ServiceToClient getTweetyPieClient(String tweetypieClientId) - throws NamingException { - return TweetyPieWireModule.getTweetyPieClient(tweetypieClientId, serviceIdentifier); - } - - @Override - public UserService.ServiceToClient getGizmoduckClient(String clientId) - throws NamingException { - Context context = new InitialContext(); - String dest = (String) context.lookup(JNDI_GIZMODUCK_DEST); - - MtlsThriftMuxClient mtlsThriftMuxClient = new MtlsThriftMuxClient( - ThriftMux.client().withClientId(new ClientId(clientId))); - - Service clientBuilder = - ClientBuilder.safeBuild( - ClientBuilder - .get() - .requestTimeout(Duration.fromMilliseconds(800)) - .retryPolicy(RetryPolicy.tries(3)) - .name("search_ingester_gizmoduck_client") - .reportTo(DefaultStatsReceiver.get()) - .daemon(true) - .dest(dest) - .stack(mtlsThriftMuxClient.withMutualTls(serviceIdentifier) - .withOpportunisticTls(OpportunisticTls.Required()))); - return new UserService.ServiceToClient(clientBuilder, new TBinaryProtocol.Factory()); - } - - @Override - public > EventBusSubscriber createEventBusSubscriber( - Function> process, - Class thriftStructClass, - String eventBusSubscriberId, - int maxConcurrentEvents) { - Preconditions.checkNotNull(serviceIdentifier, - "Can't create EventBusSubscriber with S2S auth because Service Identifier is null"); - LOG.info(String.format("Service identifier for EventBusSubscriber Manhattan client: %s", - ServiceIdentifier.asString(serviceIdentifier))); - // We set the processTimeoutMs parameter here to be Duration.Top because we do not want to read - // more events from EventBus if we are experiencing back pressure and cannot write them to the - // downstream queue. - return EventBusSubscriberBuilder.apply() - .subscriberId(eventBusSubscriberId) - .skipToLatest(false) - .fromAllZones(true) - .statsReceiver(DefaultStatsReceiver.get().scope("eventbus")) - .thriftStruct(thriftStructClass) - .serviceIdentifier(serviceIdentifier) - .maxConcurrentEvents(maxConcurrentEvents) - .processTimeout(Duration.Top()) - .build(process); - } - - @Override - public Clock getClock() { - return Clock.SYSTEM_CLOCK; - } - - @Override - public TweetOffensiveEvaluator getTweetOffensiveEvaluator() { - return new TweetOffensiveEvaluator(); - } - - @Override - public EarlybirdCluster getEarlybirdCluster() throws NamingException { - Context jndiContext = new InitialContext(); - String clusterName = (String) jndiContext.lookup(CLUSTER_JNDI_NAME); - return EarlybirdCluster.valueOf(clusterName.toUpperCase()); - } - - @Override - public List getPenguinVersions() throws NamingException { - Context context = new InitialContext(); - String penguinVersionsStr = (String) context.lookup(PENGUIN_VERSIONS_JNDI_NAME); - penguinVersions = new ArrayList<>(); - - for (String penguinVersion : penguinVersionsStr.split(",")) { - PenguinVersion pv = PenguinVersion.versionFromByteValue(Byte.parseByte(penguinVersion)); - if (PenguinVersionsUtil.isPenguinVersionAvailable(pv, decider)) { - penguinVersions.add(pv); - } - } - - Preconditions.checkArgument(penguinVersions.size() > 0, - "At least one penguin version must be specified."); - - return penguinVersions; - } - - // We update penguin versions via deciders in order to disable one in case of an emergency. - @Override - public List getCurrentlyEnabledPenguinVersions() { - return PenguinVersionsUtil.filterPenguinVersionsWithDeciders(penguinVersions, decider); - } - - @Override - public NamedEntityFetcher getNamedEntityFetcher() { - return new NamedEntityFetcher(stratoClient); - } - - @Override - public AudioSpaceParticipantsFetcher getAudioSpaceParticipantsFetcher() { - return new AudioSpaceParticipantsFetcher(stratoClient); - } - - @Override - public AudioSpaceCoreFetcher getAudioSpaceCoreFetcher() { - return new AudioSpaceCoreFetcher(stratoClient); - } - - @Override - public KafkaConsumer newKafkaConsumer( - String kafkaClusterPath, Deserializer deserializer, String clientId, String groupId, - int maxPollRecords) { - return FinagleKafkaClientUtils.newKafkaConsumer( - kafkaClusterPath, deserializer, clientId, groupId, maxPollRecords); - } - - @Override - public BlockingFinagleKafkaProducer newFinagleKafkaProducer( - String kafkaClusterPath, Serializer serializer, String clientId, - @Nullable Class partitionerClass) { - return FinagleKafkaClientUtils.newFinagleKafkaProducer( - kafkaClusterPath, true, serializer, clientId, partitionerClass); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/wire/StratoMetaStoreWireModule.java b/src/java/com/twitter/search/ingester/pipeline/wire/StratoMetaStoreWireModule.java deleted file mode 100644 index 0f3f5833b..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/wire/StratoMetaStoreWireModule.java +++ /dev/null @@ -1,119 +0,0 @@ -package com.twitter.search.ingester.pipeline.wire; - -import java.util.concurrent.TimeUnit; -import javax.naming.Context; -import javax.naming.InitialContext; -import javax.naming.NamingException; - -import com.google.common.base.Preconditions; - -import org.apache.thrift.protocol.TBinaryProtocol; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common.quantity.Amount; -import com.twitter.common.quantity.Time; -import com.twitter.common_internal.manhattan.ManhattanClient; -import com.twitter.common_internal.manhattan.ManhattanClientImpl; -import com.twitter.finagle.Service; -import com.twitter.finagle.ThriftMux; -import com.twitter.finagle.builder.ClientBuilder; -import com.twitter.finagle.builder.ClientConfig.Yes; -import com.twitter.finagle.mtls.authentication.ServiceIdentifier; -import com.twitter.finagle.mtls.client.MtlsThriftMuxClient; -import com.twitter.finagle.mux.transport.OpportunisticTls; -import com.twitter.finagle.stats.DefaultStatsReceiver; -import com.twitter.finagle.thrift.ClientId; -import com.twitter.finagle.thrift.ThriftClientRequest; -import com.twitter.manhattan.thriftv1.ConsistencyLevel; -import com.twitter.manhattan.thriftv1.ManhattanCoordinator; -import com.twitter.metastore.client_v2.MetastoreClient; -import com.twitter.metastore.client_v2.MetastoreClientImpl; -import com.twitter.util.Duration; - -public class StratoMetaStoreWireModule { - private WireModule wireModule; - private static final Logger LOG = LoggerFactory.getLogger(StratoMetaStoreWireModule.class); - - public StratoMetaStoreWireModule(WireModule wireModule) { - this.wireModule = wireModule; - } - - private static final String MANHATTAN_SD_ZK_ROLE = - WireModule.JNDI_PIPELINE_ROOT + "manhattanSDZKRole"; - private static final String MANHATTAN_SD_ZK_ENV = - WireModule.JNDI_PIPELINE_ROOT + "manhattanSDZKEnv"; - private static final String MANHATTAN_SD_ZK_NAME = - WireModule.JNDI_PIPELINE_ROOT + "manhattanSDZKName"; - private static final String MANHATTAN_APPLICATION_ID = "ingester_starbuck"; - - private static class Options { - // The client id as a string - private final String clientId = "ingester"; - - // The connection timeout in millis - private final long connectTimeout = 50; - - // The request timeout im millis - private final long requestTimeout = 300; - - // Total timeout per call (including retries) - private final long totalTimeout = 500; - - // The maximum number of retries per call - private final int retries = 2; - } - - private final Options options = new Options(); - - private ClientBuilder getClientBuilder( - String name, - ServiceIdentifier serviceIdentifier) { - return getClientBuilder(name, new ClientId(options.clientId), serviceIdentifier); - } - - private ClientBuilder getClientBuilder( - String name, - ClientId clientId, - ServiceIdentifier serviceIdentifier) { - Preconditions.checkNotNull(serviceIdentifier, - "Can't create Metastore Manhattan client with S2S auth because Service Identifier is null"); - LOG.info(String.format("Service identifier for Metastore Manhattan client: %s", - ServiceIdentifier.asString(serviceIdentifier))); - return ClientBuilder.get() - .name(name) - .tcpConnectTimeout(new Duration(TimeUnit.MILLISECONDS.toNanos(options.connectTimeout))) - .requestTimeout(new Duration(TimeUnit.MILLISECONDS.toNanos(options.requestTimeout))) - .timeout(new Duration(TimeUnit.MILLISECONDS.toNanos(options.totalTimeout))) - .retries(options.retries) - .reportTo(DefaultStatsReceiver.get()) - .stack(new MtlsThriftMuxClient(ThriftMux.client()) - .withMutualTls(serviceIdentifier) - .withClientId(clientId) - .withOpportunisticTls(OpportunisticTls.Required())); - } - - /** - * Returns the Metastore client. - */ - public MetastoreClient getMetastoreClient(ServiceIdentifier serviceIdentifier) - throws NamingException { - Context jndiContext = new InitialContext(); - String destString = String.format("/cluster/local/%s/%s/%s", - jndiContext.lookup(MANHATTAN_SD_ZK_ROLE), - jndiContext.lookup(MANHATTAN_SD_ZK_ENV), - jndiContext.lookup(MANHATTAN_SD_ZK_NAME)); - LOG.info("Manhattan serverset Name: {}", destString); - - Service service = - ClientBuilder.safeBuild(getClientBuilder("metastore", serviceIdentifier).dest(destString)); - - ManhattanClient manhattanClient = new ManhattanClientImpl( - new ManhattanCoordinator.ServiceToClient(service, new TBinaryProtocol.Factory()), - MANHATTAN_APPLICATION_ID, - Amount.of((int) options.requestTimeout, Time.MILLISECONDS), - ConsistencyLevel.ONE); - - return new MetastoreClientImpl(manhattanClient); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/wire/TweetyPieWireModule.java b/src/java/com/twitter/search/ingester/pipeline/wire/TweetyPieWireModule.java deleted file mode 100644 index 3f3d67158..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/wire/TweetyPieWireModule.java +++ /dev/null @@ -1,110 +0,0 @@ -package com.twitter.search.ingester.pipeline.wire; - -import java.util.concurrent.TimeoutException; -import javax.naming.Context; -import javax.naming.InitialContext; -import javax.naming.NamingException; - -import org.apache.thrift.protocol.TBinaryProtocol; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.twitter.common_internal.zookeeper.TwitterServerSet; -import com.twitter.finagle.Name; -import com.twitter.finagle.Resolvers; -import com.twitter.finagle.Service; -import com.twitter.finagle.ThriftMux; -import com.twitter.finagle.builder.ClientBuilder; -import com.twitter.finagle.builder.ClientConfig; -import com.twitter.finagle.mtls.authentication.ServiceIdentifier; -import com.twitter.finagle.mtls.client.MtlsThriftMuxClient; -import com.twitter.finagle.mux.transport.OpportunisticTls; -import com.twitter.finagle.service.RetryPolicy; -import com.twitter.finagle.stats.DefaultStatsReceiver; -import com.twitter.finagle.thrift.ClientId; -import com.twitter.finagle.thrift.ThriftClientRequest; -import com.twitter.servo.util.WaitForServerSets; -import com.twitter.tweetypie.thriftjava.TweetService; -import com.twitter.util.Await; -import com.twitter.util.Duration; - -final class TweetyPieWireModule { - private static final Logger LOG = LoggerFactory.getLogger(ProductionWireModule.class); - - private static final int TWEETYPIE_CONNECT_TIMEOUT_MS = 100; - private static final int TWEETYPIE_REQUEST_TIMEOUT_MS = 500; - - // This is actually the total tries count, so one initial try, and one more retry (if needed). - private static final int TWEETYPIE_REQUEST_NUM_TRIES = 3; - private static final int TWEETYPIE_TOTAL_TIMEOUT_MS = - TWEETYPIE_REQUEST_TIMEOUT_MS * TWEETYPIE_REQUEST_NUM_TRIES; - - private static final String TWEETYPIE_SD_ZK_ROLE = - WireModule.JNDI_PIPELINE_ROOT + "tweetypieSDZKRole"; - private static final String TWEETYPIE_SD_ZK_ENV = - WireModule.JNDI_PIPELINE_ROOT + "tweetypieSDZKEnv"; - private static final String TWEETYPIE_SD_ZK_NAME = - WireModule.JNDI_PIPELINE_ROOT + "tweetypieSDZKName"; - - private TweetyPieWireModule() { - } - - private static TwitterServerSet.Service getTweetyPieZkServerSetService() - throws NamingException { - Context jndiContext = new InitialContext(); - TwitterServerSet.Service service = new TwitterServerSet.Service( - (String) jndiContext.lookup(TWEETYPIE_SD_ZK_ROLE), - (String) jndiContext.lookup(TWEETYPIE_SD_ZK_ENV), - (String) jndiContext.lookup(TWEETYPIE_SD_ZK_NAME)); - LOG.info("TweetyPie ZK path: {}", TwitterServerSet.getPath(service)); - return service; - } - - static TweetService.ServiceToClient getTweetyPieClient( - String clientIdString, ServiceIdentifier serviceIdentifier) throws NamingException { - TwitterServerSet.Service service = getTweetyPieZkServerSetService(); - - // Use explicit Name types so we can force a wait on resolution (COORD-479) - String destString = String.format("/cluster/local/%s/%s/%s", - service.getRole(), service.getEnv(), service.getName()); - Name destination = Resolvers.eval(destString); - try { - Await.ready(WaitForServerSets.ready(destination, Duration.fromMilliseconds(10000))); - } catch (TimeoutException e) { - LOG.warn("Timed out while resolving Zookeeper ServerSet", e); - } catch (InterruptedException e) { - LOG.warn("Interrupted while resolving Zookeeper ServerSet", e); - Thread.currentThread().interrupt(); - } - - LOG.info("Creating Tweetypie client with ID: {}", clientIdString); - ClientId clientId = new ClientId(clientIdString); - - MtlsThriftMuxClient mtlsThriftMuxClient = new MtlsThriftMuxClient( - ThriftMux.client().withClientId(clientId)); - ThriftMux.Client tmuxClient = mtlsThriftMuxClient - .withMutualTls(serviceIdentifier) - .withOpportunisticTls(OpportunisticTls.Required()); - - ClientBuilder< - ThriftClientRequest, - byte[], - ClientConfig.Yes, - ClientConfig.Yes, - ClientConfig.Yes> builder = ClientBuilder.get() - .stack(tmuxClient) - .name("retrieve_cards_tweetypie_client") - .dest(destination) - .reportTo(DefaultStatsReceiver.get()) - .connectTimeout(Duration.fromMilliseconds(TWEETYPIE_CONNECT_TIMEOUT_MS)) - .requestTimeout(Duration.fromMilliseconds(TWEETYPIE_REQUEST_TIMEOUT_MS)) - .timeout(Duration.fromMilliseconds(TWEETYPIE_TOTAL_TIMEOUT_MS)) - .retryPolicy(RetryPolicy.tries( - TWEETYPIE_REQUEST_NUM_TRIES, - RetryPolicy.TimeoutAndWriteExceptionsOnly())); - - Service clientBuilder = ClientBuilder.safeBuild(builder); - - return new TweetService.ServiceToClient(clientBuilder, new TBinaryProtocol.Factory()); - } -} diff --git a/src/java/com/twitter/search/ingester/pipeline/wire/WireModule.java b/src/java/com/twitter/search/ingester/pipeline/wire/WireModule.java deleted file mode 100644 index c6c5f198f..000000000 --- a/src/java/com/twitter/search/ingester/pipeline/wire/WireModule.java +++ /dev/null @@ -1,226 +0,0 @@ -package com.twitter.search.ingester.pipeline.wire; - -import java.util.List; -import java.util.concurrent.ExecutorService; -import javax.annotation.Nullable; -import javax.naming.Context; -import javax.naming.InitialContext; -import javax.naming.NamingException; - -import org.apache.kafka.clients.consumer.KafkaConsumer; -import org.apache.kafka.clients.producer.Partitioner; -import org.apache.kafka.common.serialization.Deserializer; -import org.apache.kafka.common.serialization.Serializer; -import org.apache.thrift.TBase; - -import com.twitter.common.util.Clock; -import com.twitter.common_internal.text.version.PenguinVersion; -import com.twitter.decider.Decider; -import com.twitter.eventbus.client.EventBusSubscriber; -import com.twitter.finagle.mtls.authentication.ServiceIdentifier; -import com.twitter.finatra.kafka.producers.BlockingFinagleKafkaProducer; -import com.twitter.gizmoduck.thriftjava.UserService; -import com.twitter.metastore.client_v2.MetastoreClient; -import com.twitter.pink_floyd.thrift.Storer; -import com.twitter.search.common.partitioning.base.PartitionMappingManager; -import com.twitter.search.common.relevance.classifiers.TweetOffensiveEvaluator; -import com.twitter.search.common.schema.earlybird.EarlybirdCluster; -import com.twitter.search.ingester.pipeline.strato_fetchers.AudioSpaceCoreFetcher; -import com.twitter.search.ingester.pipeline.strato_fetchers.AudioSpaceParticipantsFetcher; -import com.twitter.search.ingester.pipeline.strato_fetchers.NamedEntityFetcher; -import com.twitter.search.ingester.pipeline.util.PipelineExceptionHandler; -import com.twitter.storage.client.manhattan.kv.JavaManhattanKVEndpoint; -import com.twitter.tweetypie.thriftjava.TweetService; -import com.twitter.util.Duration; -import com.twitter.util.Function; -import com.twitter.util.Future; - -/** - * An "injection module" that provides bindings for all ingester endpoints that we want to mock out - * in tests. - */ -public abstract class WireModule { - /** The JNDI property to which this module will be bound. */ - private static final String WIRE_MODULE_NAME = ""; - - /** The root name of all properties specified in the twitter-naming-production.*.xml files. */ - public static final String JNDI_PIPELINE_ROOT = ""; - - /** - * (Re)binds the given wire module in JNDI. - * - * @param wireModule The wire module to bind in JNDI. - * @throws NamingException If the wire module cannot be bound in JNDI for some reason. - */ - public static void bindWireModule(WireModule wireModule) throws NamingException { - Context jndiContext = new InitialContext(); - jndiContext.rebind(WIRE_MODULE_NAME, wireModule); - } - - /** - * Returns the wire module bound in JNDI. - * - * @return The wire module bound in JNDI. - * @throws NamingException If there's no wire module bound in JNDI. - */ - public static WireModule getWireModule() throws NamingException { - Context jndiContext = new InitialContext(); - return (WireModule) jndiContext.lookup(WIRE_MODULE_NAME); - } - - /** - * Retrieves the service identifier needed for making mtls requests. - * @return The service identifier for the current running service. - */ - public abstract ServiceIdentifier getServiceIdentifier(); - - /** - * Creates a new {@code FinagleKafkaConsumer} with a specified consumer group ID. - */ - public abstract KafkaConsumer newKafkaConsumer( - String kafkaClusterPath, Deserializer deserializer, String clientId, String groupId, - int maxPollRecords); - - /** - * Creates a new {@code FinagleKafkaConsumer} with a specified consumer group ID. - */ - public abstract BlockingFinagleKafkaProducer newFinagleKafkaProducer( - String kafkaClusterPath, Serializer serializer, String clientId, - @Nullable Class partitionerClass); - - /** - * Gets a TweetyPie client. - * - * @param tweetypieClientId Use this string as the client id. - * @return A TweetyPie client - * @throws NamingException - */ - public abstract TweetService.ServiceToClient getTweetyPieClient(String tweetypieClientId) - throws NamingException; - - /** - * Gets a Gizmoduck client. - * - * @param clientId - * @throws NamingException - */ - public abstract UserService.ServiceToClient getGizmoduckClient(String clientId) - throws NamingException; - - /** - * Gets the ManhattanKVEndpoint that should be used for the ManhattanCodedLocationProvider - * - * @return the JavaManhattanKVEndpoint that we need for the ManhattanCodedLocationProvider - * @throws NamingException - */ - public abstract JavaManhattanKVEndpoint getJavaManhattanKVEndpoint() - throws NamingException; - - /** - * Returns the decider to be used by all stages. - * - * @return The decider to be used by all stages. - */ - public abstract Decider getDecider(); - - /** - * Returns the partition ID to be used by all stages. - * - * @return The partition ID to be used by all stages. - */ - public abstract int getPartition(); - - - /** - * Returns the PipelineExceptionHandler instance to be used by all stages. - * - * @return The PipelineExceptionHandler instance to be used by all stages. - * @throws NamingException If building the PipelineExceptionHandler instance requires some - * parameters, and those parameters were not bound in JNDI. - */ - public abstract PipelineExceptionHandler getPipelineExceptionHandler(); - - /** - * Gets the PartitionMappingManager for the Kafka writer. - * - * @return a PartitionMappingManager - */ - public abstract PartitionMappingManager getPartitionMappingManager(); - - /** - * Returns the Metastore client used by the UserPropertiesManager. - * - * @return A Metastore client. - * @throws NamingException - */ - public abstract MetastoreClient getMetastoreClient() throws NamingException; - - /** - * Returns an ExecutorService potentially backed by the specified number of threads. - * - * @param numThreads An advisory value with a suggestion for how large the threadpool should be. - * @return an ExecutorService that might be backed by some threads. - * @throws NamingException - */ - public abstract ExecutorService getThreadPool(int numThreads) throws NamingException; - - /** - * Returns the Storer interface to connect to Pink. - * - * @param requestTimeout The request timeout for the Pink client. - * @param retries The number of Finagle retries. - * @return a Storer.ServiceIface to connect to pink. - * - */ - public abstract Storer.ServiceIface getStorer(Duration requestTimeout, int retries) - throws NamingException; - - /** - * Returns an EventBusSubscriber - */ - public abstract > EventBusSubscriber createEventBusSubscriber( - Function> process, - Class thriftStructClass, - String eventBusSubscriberId, - int maxConcurrentEvents); - - /** - * Returns a Clock. - */ - public abstract Clock getClock(); - - /** - * Returns a TweetOffensiveEvaluator. - */ - public abstract TweetOffensiveEvaluator getTweetOffensiveEvaluator(); - - /** - * Returns the cluster. - */ - public abstract EarlybirdCluster getEarlybirdCluster() throws NamingException; - - /** - * Returns the current penguin version(s). - */ - public abstract List getPenguinVersions() throws NamingException; - - /** - * Returns updated penguin version(s) depending on decider availability. - */ - public abstract List getCurrentlyEnabledPenguinVersions(); - - /** - * Returns a named entities strato column fetcher. - */ - public abstract NamedEntityFetcher getNamedEntityFetcher(); - - /** - * Returns audio space participants strato column fetcher. - */ - public abstract AudioSpaceParticipantsFetcher getAudioSpaceParticipantsFetcher(); - - /** - * Returns audio space core strato column fetcher. - */ - public abstract AudioSpaceCoreFetcher getAudioSpaceCoreFetcher(); -} diff --git a/src/java/com/twitter/search/ingester/util/jndi/BUILD b/src/java/com/twitter/search/ingester/util/jndi/BUILD deleted file mode 100644 index 4eaf908dd..000000000 --- a/src/java/com/twitter/search/ingester/util/jndi/BUILD +++ /dev/null @@ -1,9 +0,0 @@ -java_library( - sources = ["*.java"], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/twitter/search/ingester:legacy", - ], -) diff --git a/src/java/com/twitter/search/ingester/util/jndi/JndiUtil.java b/src/java/com/twitter/search/ingester/util/jndi/JndiUtil.java deleted file mode 100644 index 8f50870cf..000000000 --- a/src/java/com/twitter/search/ingester/util/jndi/JndiUtil.java +++ /dev/null @@ -1,70 +0,0 @@ -package com.twitter.search.ingester.util.jndi; - -import java.util.Hashtable; -import javax.naming.Context; -import javax.naming.InitialContext; -import javax.naming.NameNotFoundException; - -import org.apache.naming.config.XmlConfigurator; - -public abstract class JndiUtil { - // This is different from the search repo---twitter-naming-devtest.xml is - // checked in as a resource in src/resources/com/twitter/search/ingester. - public static final String DEFAULT_JNDI_XML = - System.getProperty("jndiXml", "/com/twitter/search/ingester/twitter-naming-devtest.xml"); - protected static String jndiXml = DEFAULT_JNDI_XML; - protected static boolean testingMode = false; - - static { - System.setProperty("javax.xml.parsers.SAXParserFactory", - "org.apache.xerces.jaxp.SAXParserFactoryImpl"); - System.setProperty("javax.xml.parsers.DocumentBuilderFactory", - "com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl"); - } - - public static void loadJNDI() { - loadJNDI(jndiXml); - } - - protected static void loadJNDI(String jndiXmlFile) { - try { - Hashtable props = new Hashtable<>(); - props.put(Context.INITIAL_CONTEXT_FACTORY, "org.apache.naming.java.javaURLContextFactory"); - Context jndiContext = new InitialContext(props); - try { - jndiContext.lookup("java:comp"); - setTestingModeFromJndiContext(jndiContext); - } catch (NameNotFoundException e) { - // No context. - XmlConfigurator.loadConfiguration(JndiUtil.class.getResourceAsStream(jndiXmlFile)); - } - } catch (Exception e) { - throw new RuntimeException(String.format("Failed to load JNDI configuration file=%s %s", - jndiXmlFile, e.getMessage()), e); - } - } - - public static void setJndiXml(String jndiXml) { - JndiUtil.jndiXml = jndiXml; - } - - public static String getJndiXml() { - return jndiXml; - } - - public static void setTestingMode(Boolean testingMode) { - JndiUtil.testingMode = testingMode; - } - - public static boolean isTestingMode() { - return testingMode; - } - - private static void setTestingModeFromJndiContext(Context jndiContext) { - try { - setTestingMode((Boolean) jndiContext.lookup("java:comp/env/testingMode")); - } catch (Exception e) { - setTestingMode(false); - } - } -} diff --git a/src/python/twitter/deepbird/projects/timelines/configs/recap_earlybird/feature_config.py b/src/python/twitter/deepbird/projects/timelines/configs/recap_earlybird/feature_config.py deleted file mode 100644 index 167756c01..000000000 --- a/src/python/twitter/deepbird/projects/timelines/configs/recap_earlybird/feature_config.py +++ /dev/null @@ -1,83 +0,0 @@ -# checkstyle: noqa -from twml.feature_config import FeatureConfigBuilder - - -def get_feature_config(data_spec_path, label): - return ( - FeatureConfigBuilder(data_spec_path=data_spec_path, debug=True) - .batch_add_features( - [ - ("ebd.author_specific_score", "A"), - ("ebd.has_diff_lang", "A"), - ("ebd.has_english_tweet_diff_ui_lang", "A"), - ("ebd.has_english_ui_diff_tweet_lang", "A"), - ("ebd.is_self_tweet", "A"), - ("ebd.tweet_age_in_secs", "A"), - ("encoded_tweet_features.favorite_count", "A"), - ("encoded_tweet_features.from_verified_account_flag", "A"), - ("encoded_tweet_features.has_card_flag", "A"), - # ("encoded_tweet_features.has_consumer_video_flag", "A"), - ("encoded_tweet_features.has_image_url_flag", "A"), - ("encoded_tweet_features.has_link_flag", "A"), - ("encoded_tweet_features.has_multiple_hashtags_or_trends_flag", "A"), - # ("encoded_tweet_features.has_multiple_media_flag", "A"), - ("encoded_tweet_features.has_native_image_flag", "A"), - ("encoded_tweet_features.has_news_url_flag", "A"), - ("encoded_tweet_features.has_periscope_flag", "A"), - ("encoded_tweet_features.has_pro_video_flag", "A"), - ("encoded_tweet_features.has_quote_flag", "A"), - ("encoded_tweet_features.has_trend_flag", "A"), - ("encoded_tweet_features.has_video_url_flag", "A"), - ("encoded_tweet_features.has_vine_flag", "A"), - ("encoded_tweet_features.has_visible_link_flag", "A"), - ("encoded_tweet_features.is_offensive_flag", "A"), - ("encoded_tweet_features.is_reply_flag", "A"), - ("encoded_tweet_features.is_retweet_flag", "A"), - ("encoded_tweet_features.is_sensitive_content", "A"), - # ("encoded_tweet_features.is_user_new_flag", "A"), - ("encoded_tweet_features.language", "A"), - ("encoded_tweet_features.link_language", "A"), - ("encoded_tweet_features.num_hashtags", "A"), - ("encoded_tweet_features.num_mentions", "A"), - # ("encoded_tweet_features.profile_is_egg_flag", "A"), - ("encoded_tweet_features.reply_count", "A"), - ("encoded_tweet_features.retweet_count", "A"), - ("encoded_tweet_features.text_score", "A"), - ("encoded_tweet_features.user_reputation", "A"), - ("extended_encoded_tweet_features.embeds_impression_count", "A"), - ("extended_encoded_tweet_features.embeds_impression_count_v2", "A"), - ("extended_encoded_tweet_features.embeds_url_count", "A"), - ("extended_encoded_tweet_features.embeds_url_count_v2", "A"), - ("extended_encoded_tweet_features.favorite_count_v2", "A"), - ("extended_encoded_tweet_features.label_abusive_hi_rcl_flag", "A"), - ("extended_encoded_tweet_features.label_dup_content_flag", "A"), - ("extended_encoded_tweet_features.label_nsfw_hi_prc_flag", "A"), - ("extended_encoded_tweet_features.label_nsfw_hi_rcl_flag", "A"), - ("extended_encoded_tweet_features.label_spam_flag", "A"), - ("extended_encoded_tweet_features.label_spam_hi_rcl_flag", "A"), - ("extended_encoded_tweet_features.quote_count", "A"), - ("extended_encoded_tweet_features.reply_count_v2", "A"), - ("extended_encoded_tweet_features.retweet_count_v2", "A"), - ("extended_encoded_tweet_features.weighted_favorite_count", "A"), - ("extended_encoded_tweet_features.weighted_quote_count", "A"), - ("extended_encoded_tweet_features.weighted_reply_count", "A"), - ("extended_encoded_tweet_features.weighted_retweet_count", "A"), - ] - ) - .add_labels( - [ - label, # Tensor index: 0 - "recap.engagement.is_clicked", # Tensor index: 1 - "recap.engagement.is_favorited", # Tensor index: 2 - "recap.engagement.is_open_linked", # Tensor index: 3 - "recap.engagement.is_photo_expanded", # Tensor index: 4 - "recap.engagement.is_profile_clicked", # Tensor index: 5 - "recap.engagement.is_replied", # Tensor index: 6 - "recap.engagement.is_retweeted", # Tensor index: 7 - "recap.engagement.is_video_playback_50", # Tensor index: 8 - "timelines.earlybird_score", # Tensor index: 9 - ] - ) - .define_weight("meta.record_weight/type=earlybird") - .build() - ) diff --git a/src/python/twitter/deepbird/projects/timelines/configs/rectweet_earlybird/feature_config.py b/src/python/twitter/deepbird/projects/timelines/configs/rectweet_earlybird/feature_config.py deleted file mode 100644 index 85b7d7f10..000000000 --- a/src/python/twitter/deepbird/projects/timelines/configs/rectweet_earlybird/feature_config.py +++ /dev/null @@ -1,74 +0,0 @@ -# checkstyle: noqa -from twml.feature_config import FeatureConfigBuilder - - -def get_feature_config(data_spec_path, label): - return FeatureConfigBuilder(data_spec_path=data_spec_path, debug=True) \ - .batch_add_features( - [ - ("ebd.has_diff_lang", "A"), - ("ebd.tweet_age_in_secs", "A"), - ("encoded_tweet_features.composer_source_is_camera_flag", "A"), - ("encoded_tweet_features.favorite_count", "A"), - ("encoded_tweet_features.has_card_flag", "A"), - ("encoded_tweet_features.has_image_url_flag", "A"), - ("encoded_tweet_features.has_native_image_flag", "A"), - ("encoded_tweet_features.has_news_url_flag", "A"), - ("encoded_tweet_features.has_periscope_flag", "A"), - ("encoded_tweet_features.has_pro_video_flag", "A"), - ("encoded_tweet_features.has_quote_flag", "A"), - ("encoded_tweet_features.has_video_url_flag", "A"), - ("encoded_tweet_features.has_vine_flag", "A"), - ("encoded_tweet_features.has_visible_link_flag", "A"), - ("encoded_tweet_features.is_sensitive_content", "A"), - ("encoded_tweet_features.is_user_spam_flag", "A"), - ("encoded_tweet_features.link_language", "A"), - ("encoded_tweet_features.num_hashtags", "A"), - ("encoded_tweet_features.num_mentions", "A"), - ("encoded_tweet_features.reply_count", "A"), - ("encoded_tweet_features.retweet_count", "A"), - ("encoded_tweet_features.text_score", "A"), - ("encoded_tweet_features.user_reputation", "A"), - ("extended_encoded_tweet_features.decayed_favorite_count", "A"), - ("extended_encoded_tweet_features.decayed_quote_count", "A"), - ("extended_encoded_tweet_features.decayed_reply_count", "A"), - ("extended_encoded_tweet_features.decayed_retweet_count", "A"), - ("extended_encoded_tweet_features.embeds_impression_count_v2", "A"), - ("extended_encoded_tweet_features.embeds_url_count_v2", "A"), - ("extended_encoded_tweet_features.fake_favorite_count", "A"), - ("extended_encoded_tweet_features.fake_quote_count", "A"), - ("extended_encoded_tweet_features.fake_reply_count", "A"), - ("extended_encoded_tweet_features.fake_retweet_count", "A"), - ("extended_encoded_tweet_features.favorite_count_v2", "A"), - ("extended_encoded_tweet_features.label_dup_content_flag", "A"), - ("extended_encoded_tweet_features.label_nsfw_hi_prc_flag", "A"), - ("extended_encoded_tweet_features.label_nsfw_hi_rcl_flag", "A"), - ("extended_encoded_tweet_features.label_spam_hi_rcl_flag", "A"), - ("extended_encoded_tweet_features.periscope_exists", "A"), - ("extended_encoded_tweet_features.periscope_has_been_featured", "A"), - ("extended_encoded_tweet_features.periscope_is_currently_featured", "A"), - ("extended_encoded_tweet_features.periscope_is_from_quality_source", "A"), - ("extended_encoded_tweet_features.periscope_is_live", "A"), - ("extended_encoded_tweet_features.quote_count", "A"), - ("extended_encoded_tweet_features.reply_count_v2", "A"), - ("extended_encoded_tweet_features.retweet_count_v2", "A"), - ("extended_encoded_tweet_features.weighted_favorite_count", "A"), - ("extended_encoded_tweet_features.weighted_quote_count", "A"), - ("extended_encoded_tweet_features.weighted_reply_count", "A"), - ("extended_encoded_tweet_features.weighted_retweet_count", "A"), - ("timelines.earlybird.visible_token_ratio", "A") - ] - ).add_labels([ - label, # Tensor index: 0 - "itl.engagement.is_clicked", # Tensor index: 1 - "itl.engagement.is_favorited", # Tensor index: 2 - "itl.engagement.is_open_linked", # Tensor index: 3 - "itl.engagement.is_photo_expanded", # Tensor index: 4 - "itl.engagement.is_profile_clicked", # Tensor index: 5 - "itl.engagement.is_replied", # Tensor index: 6 - "itl.engagement.is_retweeted", # Tensor index: 7 - "itl.engagement.is_video_playback_50", # Tensor index: 8 - "timelines.earlybird_score", # Tensor index: 9 - ]) \ - .define_weight("meta.record_weight/type=earlybird") \ - .build() diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/BUILD b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/BUILD deleted file mode 100644 index 0e889392e..000000000 --- a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/BUILD +++ /dev/null @@ -1,23 +0,0 @@ -python3_library( - name = "libs_py3", - sources = ["*.py"], - tags = ["no-mypy"], - dependencies = [ - "src/python/twitter/deepbird/io", - "src/python/twitter/deepbird/projects/timelines/configs:all_configs", - "twml:twml-nodeps", - ], -) - -python37_binary( - name = "model_earlybird", - source = "train.py", - tags = ["no-mypy"], - dependencies = [ - ":libs_py3", - "3rdparty/python/_closures/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird:model_earlybird", - "src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly:libs_py3", - "src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/tf_model:libs_py3", - "twml", - ], -) diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/README.md b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/README.md deleted file mode 100644 index 3eb9e6c74..000000000 --- a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/README.md +++ /dev/null @@ -1,63 +0,0 @@ -# Earlybird Light Ranker - -*Note: the light ranker is an old part of the stack which we are currently in the process of replacing. -The current model was last trained several years ago, and uses some very strange features. -We are working on training a new model, and eventually rebuilding this part of the stack entirely.* - -The Earlybird light ranker is a logistic regression model which predicts the likelihood that the user will engage with a -tweet. -It is intended to be a simplified version of the heavy ranker which can run on a greater amount of tweets. - -There are currently 2 main light ranker models in use: one for ranking in network tweets (`recap_earlybird`), and -another for -out of network (UTEG) tweets (`rectweet_earlybird`). Both models are trained using the `train.py` script which is -included in this directory. They differ mainly in the set of features -used by the model. -The in network model uses -the `src/python/twitter/deepbird/projects/timelines/configs/recap/feature_config.py` file to define the -feature configuration, while the -out of network model uses `src/python/twitter/deepbird/projects/timelines/configs/rectweet_earlybird/feature_config.py`. - -The `train.py` script is essentially a series of hooks provided to for Twitter's `twml` framework to execute, -which is included under `twml/`. - -### Features - -The light ranker features pipeline is as follows: -![earlybird_features.png](earlybird_features.png) - -Some of these components are explained below: - -- Index Ingester: an indexing pipeline that handles the tweets as they are generated. This is the main input of - Earlybird, it produces Tweet Data (the basic information about the tweet, the text, the urls, media entities, facets, - etc) and Static Features (the features you can compute directly from a tweet right now, like whether it has URL, has - Cards, has quotes, etc); All information computed here are stored in index and flushed as each realtime index segments - become full. They are loaded back later from disk when Earlybird restarts. Note that the features may be computed in a - non-trivial way (like deciding the value of hasUrl), they could be computed and combined from some more "raw" - information in the tweet and from other services. - Signal Ingester: the ingester for Realtime Features, per-tweet features that can change after the tweet has been - indexed, mostly social engagements like retweetCount, favCount, replyCount, etc, along with some (future) spam signals - that's computed with later activities. These were collected and computed in a Heron topology by processing multiple - event streams and can be extended to support more features. -- User Table Features is another set of features per user. They are from User Table Updater, a different input that - processes a stream written by our user service. It's used to store sparse realtime user - information. These per-user features are propagated to the tweet being scored by - looking up the author of the tweet. -- Search Context Features are basically the information of current searcher, like their UI language, their own - produced/consumed language, and the current time (implied). They are combined with Tweet Data to compute some of the - features used in scoring. - -The scoring function in Earlybird uses both static and realtime features. Examples of static features used are: - -- Whether the tweet is a retweet -- Whether the tweet contains a link -- Whether this tweet has any trend words at ingestion time -- Whether the tweet is a reply -- A score for the static quality of the text, computed in TweetTextScorer.java in the Ingester. Based on the factors - such as offensiveness, content entropy, "shout" score, length, and readability. -- tweepcred, see top-level README.md - -Examples of realtime features used are: - -- Number of tweet likes/replies/retweets -- pToxicity and pBlock scores provided by health models diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/__init__.py b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/__init__.py deleted file mode 100644 index e69de29bb..000000000 diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/constants.py b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/constants.py deleted file mode 100644 index 57178b92c..000000000 --- a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/constants.py +++ /dev/null @@ -1,21 +0,0 @@ -# checkstyle: noqa - -INDEX_BY_LABEL = { - "is_clicked": 1, - "is_favorited": 2, - "is_open_linked": 3, - "is_photo_expanded": 4, - "is_profile_clicked": 5, - "is_replied": 6, - "is_retweeted": 7, - "is_video_playback_50": 8 -} - -TARGET_LABEL_IDX = 0 -EB_SCORE_IDX = 9 - -LABEL_NAMES = [label_name for label_name, _ in sorted(INDEX_BY_LABEL.items(), key=lambda item: item[1])] - -PREDICTED_CLASSES = \ - ["tf_target"] + ["tf_" + label_name for label_name in LABEL_NAMES] + ["tf_timelines.earlybird_score"] + \ - ["lolly_target"] + ["lolly_" + label_name for label_name in LABEL_NAMES] + ["lolly_timelines.earlybird_score"] diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/earlybird_features.png b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/earlybird_features.png deleted file mode 100644 index abba44ef1..000000000 Binary files a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/earlybird_features.png and /dev/null differ diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/example_weights.py b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/example_weights.py deleted file mode 100644 index cf0c38ecc..000000000 --- a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/example_weights.py +++ /dev/null @@ -1,43 +0,0 @@ -# checkstyle: noqa -import tensorflow.compat.v1 as tf -from .constants import INDEX_BY_LABEL, LABEL_NAMES - -# TODO: Read these from command line arguments, since they specify the existing example weights in the input data. -DEFAULT_WEIGHT_BY_LABEL = { - "is_clicked": 0.3, - "is_favorited": 1.0, - "is_open_linked": 0.1, - "is_photo_expanded": 0.03, - "is_profile_clicked": 1.0, - "is_replied": 9.0, - "is_retweeted": 1.0, - "is_video_playback_50": 0.01 -} - -def add_weight_arguments(parser): - for label_name in LABEL_NAMES: - parser.add_argument( - _make_weight_cli_argument_name(label_name), - type=float, - default=DEFAULT_WEIGHT_BY_LABEL[label_name], - dest=_make_weight_param_name(label_name) - ) - -def make_weights_tensor(input_weights, label, params): - ''' - Replaces the weights for each positive engagement and keeps the input weights for negative examples. - ''' - weight_tensors = [input_weights] - for label_name in LABEL_NAMES: - index, default_weight = INDEX_BY_LABEL[label_name], DEFAULT_WEIGHT_BY_LABEL[label_name] - weight_param_name =_make_weight_param_name(label_name) - weight_tensors.append( - tf.reshape(tf.math.scalar_mul(getattr(params, weight_param_name) - default_weight, label[:, index]), [-1, 1]) - ) - return tf.math.accumulate_n(weight_tensors) - -def _make_weight_cli_argument_name(label_name): - return f"--weight.{label_name}" - -def _make_weight_param_name(label_name): - return f"weight_{label_name}" diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/BUILD b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/BUILD deleted file mode 100644 index a834ba69e..000000000 --- a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/BUILD +++ /dev/null @@ -1,18 +0,0 @@ -python3_library( - name = "libs_py3", - sources = ["*.py"], - dependencies = [ - "src/python/twitter/deepbird/io", - "twml:twml-nodeps", - ], -) - -python37_binary( - name = "score", - source = "score.py", - dependencies = [ - ":libs_py3", - "3rdparty/python/_closures/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly:score", - "twml", - ], -) diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/__init__.py b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/__init__.py deleted file mode 100644 index e69de29bb..000000000 diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/data_helpers.py b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/data_helpers.py deleted file mode 100644 index 723dd626c..000000000 --- a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/data_helpers.py +++ /dev/null @@ -1,23 +0,0 @@ -# checkstyle: noqa -import tensorflow.compat.v1 as tf -from ..constants import EB_SCORE_IDX - -# The rationale behind this logic is available at TQ-9678. -def get_lolly_logits(labels): - ''' - :param labels: tf.Tensor of shape (batch size, num labels) with labels as specified by the feature config. - :return: tf.Tensor of shape (batch size) with the extracted lolly logits. - ''' - eb_lolly_scores = get_lolly_scores(labels) - inverse_eb_lolly_scores = tf.math.subtract(1.0, eb_lolly_scores) - lolly_activations = tf.math.subtract(tf.math.log(eb_lolly_scores), tf.math.log(inverse_eb_lolly_scores)) - return lolly_activations - -def get_lolly_scores(labels): - ''' - :param labels: tf.Tensor of shape (batch size, num labels) with labels as specified by the feature config. - :return: tf.Tensor of shape (batch size) with the extracted lolly scores. - ''' - logged_eb_lolly_scores = tf.reshape(labels[:, EB_SCORE_IDX], (-1, 1)) - eb_lolly_scores = tf.truediv(logged_eb_lolly_scores, 100.0) - return eb_lolly_scores diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/parsers.py b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/parsers.py deleted file mode 100644 index cb39c67a7..000000000 --- a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/parsers.py +++ /dev/null @@ -1,145 +0,0 @@ -import re - -from twitter.deepbird.io.util import _get_feature_id - - -class Parser(object): - def parse(self, line): - match = re.search(self.pattern(), line) - if match: - return self._parse_match(match) - return None - - def pattern(self): - raise NotImplementedError - - def _parse_match(self, match): - raise NotImplementedError - - -class BiasParser(Parser): - ''' - Parses the bias feature available in lolly model tsv files. - ''' - - def pattern(self): - ''' - Matches lines like: - unified_engagement bias -0.935945 - :return: a RegEx that extracts feature weight. - ''' - return r"\t(bias)\t([^\s]+)" - - def _parse_match(self, match): - return float(match.group(2)) - - -class BinaryFeatureParser(Parser): - ''' - Parses binary features available in lolly model tsv files. - ''' - - def pattern(self): - ''' - Matches lines like: - unified_engagement encoded_tweet_features.is_user_spam_flag -0.181130 - :return: a RegEx that extracts feature name and weight. - ''' - return r"\t([\w\.]+)\t([^\s]+)" - - def _parse_match(self, match): - return (match.group(1), float(match.group(2))) - - -class DiscretizedFeatureParser(Parser): - ''' - Parses discretized features available in lolly model tsv files. - ''' - - def pattern(self): - ''' - Matches lines like: - unified_engagement encoded_tweet_features.user_reputation.dz/dz_model=mdl/dz_range=1.000000e+00_2.000000e+00 0.031004 - :return: a RegEx that extracts feature name, bin boundaries and weight. - ''' - return r"([\w\.]+)\.dz\/dz_model=mdl\/dz_range=([^\s]+)\t([^\s]+)" - - def _parse_match(self, match): - left_bin_side, right_bin_side = [float(number) for number in match.group(2).split("_")] - return ( - match.group(1), - left_bin_side, - right_bin_side, - float(match.group(3)) - ) - - -class LollyModelFeaturesParser(Parser): - def __init__(self, bias_parser=BiasParser(), binary_feature_parser=BinaryFeatureParser(), discretized_feature_parser=DiscretizedFeatureParser()): - self._bias_parser = bias_parser - self._binary_feature_parser = binary_feature_parser - self._discretized_feature_parser = discretized_feature_parser - - def parse(self, lolly_model_reader): - parsed_features = { - "bias": None, - "binary": {}, - "discretized": {} - } - def process_line_fn(line): - bias_parser_result = self._bias_parser.parse(line) - if bias_parser_result: - parsed_features["bias"] = bias_parser_result - return - - binary_feature_parser_result = self._binary_feature_parser.parse(line) - if binary_feature_parser_result: - name, value = binary_feature_parser_result - parsed_features["binary"][name] = value - return - - discretized_feature_parser_result = self._discretized_feature_parser.parse(line) - if discretized_feature_parser_result: - name, left_bin, right_bin, weight = discretized_feature_parser_result - discretized_features = parsed_features["discretized"] - if name not in discretized_features: - discretized_features[name] = [] - discretized_features[name].append((left_bin, right_bin, weight)) - - lolly_model_reader.read(process_line_fn) - - return parsed_features - - -class DBv2DataExampleParser(Parser): - ''' - Parses data records printed by the DBv2 train.py build_graph function. - Format: [[dbv2 logit]][[logged lolly logit]][[space separated feature ids]][[space separated feature values]] - ''' - - def __init__(self, lolly_model_reader, lolly_model_features_parser=LollyModelFeaturesParser()): - self.features = lolly_model_features_parser.parse(lolly_model_reader) - self.feature_name_by_dbv2_id = {} - - for feature_name in list(self.features["binary"].keys()) + list(self.features["discretized"].keys()): - self.feature_name_by_dbv2_id[str(_get_feature_id(feature_name))] = feature_name - - def pattern(self): - ''' - :return: a RegEx that extracts dbv2 logit, logged lolly logit, feature ids and feature values. - ''' - return r"\[\[([\w\.\-]+)\]\]\[\[([\w\.\-]+)\]\]\[\[([\w\.\- ]+)\]\]\[\[([\w\. ]+)\]\]" - - def _parse_match(self, match): - feature_ids = match.group(3).split(" ") - feature_values = match.group(4).split(" ") - - value_by_feature_name = {} - for index in range(len(feature_ids)): - feature_id = feature_ids[index] - if feature_id not in self.feature_name_by_dbv2_id: - print("Missing feature with id: " + str(feature_id)) - continue - value_by_feature_name[self.feature_name_by_dbv2_id[feature_id]] = float(feature_values[index]) - - return value_by_feature_name diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/reader.py b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/reader.py deleted file mode 100644 index ab33ee4e7..000000000 --- a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/reader.py +++ /dev/null @@ -1,8 +0,0 @@ -class LollyModelReader(object): - def __init__(self, lolly_model_file_path): - self._lolly_model_file_path = lolly_model_file_path - - def read(self, process_line_fn): - with open(self._lolly_model_file_path, "r") as file: - for line in file: - process_line_fn(line) diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/score.py b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/score.py deleted file mode 100644 index 5692616c2..000000000 --- a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/score.py +++ /dev/null @@ -1,13 +0,0 @@ -import sys - -from .parsers import DBv2DataExampleParser -from .reader import LollyModelReader -from .scorer import LollyModelScorer - - -if __name__ == "__main__": - lolly_model_reader = LollyModelReader(lolly_model_file_path=sys.argv[1]) - lolly_model_scorer = LollyModelScorer(data_example_parser=DBv2DataExampleParser(lolly_model_reader)) - - score = lolly_model_scorer.score(data_example=sys.argv[2]) - print(score) diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/scorer.py b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/scorer.py deleted file mode 100644 index 621c43388..000000000 --- a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/scorer.py +++ /dev/null @@ -1,37 +0,0 @@ -class LollyModelScorer(object): - def __init__(self, data_example_parser): - self._data_example_parser = data_example_parser - - def score(self, data_example): - value_by_feature_name = self._data_example_parser.parse(data_example) - features = self._data_example_parser.features - return self._score(value_by_feature_name, features) - - def _score(self, value_by_feature_name, features): - score = features["bias"] - score += self._score_binary_features(features["binary"], value_by_feature_name) - score += self._score_discretized_features(features["discretized"], value_by_feature_name) - return score - - def _score_binary_features(self, binary_features, value_by_feature_name): - score = 0.0 - for binary_feature_name, binary_feature_weight in binary_features.items(): - if binary_feature_name in value_by_feature_name: - score += binary_feature_weight - return score - - def _score_discretized_features(self, discretized_features, value_by_feature_name): - score = 0.0 - for discretized_feature_name, buckets in discretized_features.items(): - if discretized_feature_name in value_by_feature_name: - feature_value = value_by_feature_name[discretized_feature_name] - score += self._find_matching_bucket_weight(buckets, feature_value) - return score - - def _find_matching_bucket_weight(self, buckets, feature_value): - for left_side, right_side, weight in buckets: - # The Earlybird Lolly prediction engine discretizer bin membership interval is [a, b) - if feature_value >= left_side and feature_value < right_side: - return weight - - raise LookupError("Couldn't find a matching bucket for the given feature value.") diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/tf_model_initializer_builder.py b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/tf_model_initializer_builder.py deleted file mode 100644 index 2d0342551..000000000 --- a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/lolly/tf_model_initializer_builder.py +++ /dev/null @@ -1,91 +0,0 @@ -from .parsers import LollyModelFeaturesParser - - -class TFModelInitializerBuilder: - - def __init__(self, model_features_parser=LollyModelFeaturesParser()): - self._model_features_parser = model_features_parser - - def build(self, lolly_model_reader): - ''' - :param lolly_model_reader: LollyModelReader instance - :return: tf_model_initializer dictionary of the following format: - { - "features": { - "bias": 0.0, - "binary": { - # (feature name : feature weight) pairs - "feature_name_1": 0.0, - ... - "feature_nameN": 0.0 - }, - "discretized": { - # (feature name : index aligned lists of bin_boundaries and weights - "feature_name_1": { - "bin_boundaries": [1, ..., inf], - "weights": [0.0, ..., 0.0] - } - ... - "feature_name_K": { - "bin_boundaries": [1, ..., inf], - "weights": [0.0, ..., 0.0] - } - } - } - } - ''' - tf_model_initializer = { - "features": {} - } - - features = self._model_features_parser.parse(lolly_model_reader) - tf_model_initializer["features"]["bias"] = features["bias"] - self._set_discretized_features(features["discretized"], tf_model_initializer) - - self._dedup_binary_features(features["binary"], features["discretized"]) - tf_model_initializer["features"]["binary"] = features["binary"] - - return tf_model_initializer - - def _set_discretized_features(self, discretized_features, tf_model_initializer): - if len(discretized_features) == 0: - return - - num_bins = max([len(bins) for bins in discretized_features.values()]) - - bin_boundaries_and_weights = {} - for feature_name in discretized_features: - bin_boundaries_and_weights[feature_name] = self._extract_bin_boundaries_and_weights( - discretized_features[feature_name], num_bins) - - tf_model_initializer["features"]["discretized"] = bin_boundaries_and_weights - - def _dedup_binary_features(self, binary_features, discretized_features): - [binary_features.pop(feature_name) for feature_name in discretized_features] - - def _extract_bin_boundaries_and_weights(self, discretized_feature_buckets, num_bins): - bin_boundary_weight_pairs = [] - - for bucket in discretized_feature_buckets: - bin_boundary_weight_pairs.append([bucket[0], bucket[2]]) - - # The default DBv2 HashingDiscretizer bin membership interval is (a, b] - # - # The Earlybird Lolly prediction engine discretizer bin membership interval is [a, b) - # - # Thus, convert (a, b] to [a, b) by inverting the bin boundaries. - for bin_boundary_weight_pair in bin_boundary_weight_pairs: - if bin_boundary_weight_pair[0] < float("inf"): - bin_boundary_weight_pair[0] *= -1 - - while len(bin_boundary_weight_pairs) < num_bins: - bin_boundary_weight_pairs.append([float("inf"), float(0)]) - - bin_boundary_weight_pairs.sort(key=lambda bin_boundary_weight_pair: bin_boundary_weight_pair[0]) - - bin_boundaries, weights = list(zip(*bin_boundary_weight_pairs)) - - return { - "bin_boundaries": bin_boundaries, - "weights": weights - } diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/metrics.py b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/metrics.py deleted file mode 100644 index 6919914f8..000000000 --- a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/metrics.py +++ /dev/null @@ -1,120 +0,0 @@ -# checkstyle: noqa -import tensorflow.compat.v1 as tf -from collections import OrderedDict -from .constants import EB_SCORE_IDX -from .lolly.data_helpers import get_lolly_scores - -import twml - -def get_multi_binary_class_metric_fn(metrics, classes=None, class_dim=1): - """ - This function was copied from twml/metrics.py with the following adjustments: - - Override example weights with the ones set in graph_output. - - Tile labels in order to support per engagement metrics for both TF and Lolly scores. - - Add lolly_tf_score_MSE metric. - Note: All custom lines have a comment that starts with 'Added' - """ - # pylint: disable=invalid-name,dict-keys-not-iterating - if metrics is None: - # remove expensive metrics by default for faster eval - metrics = list(twml.metrics.SUPPORTED_BINARY_CLASS_METRICS.keys()) - metrics.remove('pr_curve') - - def get_eval_metric_ops(graph_output, labels, weights): - """ - graph_output: - dict that is returned by build_graph given input features. - labels: - target labels associated to batch. - weights: - weights of the samples.. - """ - - # Added to support the example weights overriding. - weights = graph_output["weights"] - # Added to support per engagement metrics for both TF and Lolly scores. - labels = tf.tile(labels, [1, 2]) - - eval_metric_ops = OrderedDict() - - preds = graph_output['output'] - - threshold = graph_output['threshold'] if 'threshold' in graph_output else 0.5 - - hard_preds = graph_output.get('hard_output') - if not hard_preds: - hard_preds = tf.greater_equal(preds, threshold) - - shape = labels.get_shape() - - # basic sanity check: multi_metric dimension must exist - assert len(shape) > class_dim, "Dimension specified by class_dim does not exist." - - num_labels = shape[class_dim] - # If we are doing multi-class / multi-label metric, the number of classes / labels must - # be know at graph construction time. This dimension cannot have size None. - assert num_labels is not None, "The multi-metric dimension cannot be None." - assert classes is None or len(classes) == num_labels, ( - "Number of classes must match the number of labels") - - weights_shape = weights.get_shape() if weights is not None else None - if weights_shape is None: - num_weights = None - elif len(weights_shape) > 1: - num_weights = weights_shape[class_dim] - else: - num_weights = 1 - - for i in range(num_labels): - - # add metrics to eval_metric_ops dict - for metric_name in metrics: - metric_name = metric_name.lower() # metric name are case insensitive. - - class_metric_name = metric_name + "_" + (classes[i] if classes is not None else str(i)) - - if class_metric_name in eval_metric_ops: - # avoid adding duplicate metrics. - continue - - class_labels = tf.gather(labels, indices=[i], axis=class_dim) - class_preds = tf.gather(preds, indices=[i], axis=class_dim) - class_hard_preds = tf.gather(hard_preds, indices=[i], axis=class_dim) - - if num_weights is None: - class_weights = None - elif num_weights == num_labels: - class_weights = tf.gather(weights, indices=[i], axis=class_dim) - elif num_weights == 1: - class_weights = weights - else: - raise ValueError("num_weights (%d) and num_labels (%d) do not match" - % (num_weights, num_labels)) - - metric_factory, requires_threshold = twml.metrics.SUPPORTED_BINARY_CLASS_METRICS.get(metric_name) - if metric_factory: - value_op, update_op = metric_factory( - labels=class_labels, - predictions=(class_hard_preds if requires_threshold else class_preds), - weights=class_weights, name=class_metric_name) - eval_metric_ops[class_metric_name] = (value_op, update_op) - else: - raise ValueError('Cannot find the metric named ' + metric_name) - - # Added to compare TF and Lolly scores. - eval_metric_ops["lolly_tf_score_MSE"] = get_mse(graph_output["output"], labels) - - return eval_metric_ops - - return get_eval_metric_ops - - -def get_mse(predictions, labels): - lolly_scores = get_lolly_scores(labels) - tf_scores = predictions[:, EB_SCORE_IDX] - squared_lolly_tf_score_diff = tf.square(tf.subtract(tf_scores, lolly_scores)) - - value_op = tf.reduce_mean(squared_lolly_tf_score_diff, name="value_op") - update_op = tf.reduce_mean(squared_lolly_tf_score_diff, name="update_op") - - return value_op, update_op diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/tf_model/BUILD b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/tf_model/BUILD deleted file mode 100644 index d8cd264ad..000000000 --- a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/tf_model/BUILD +++ /dev/null @@ -1,8 +0,0 @@ -python3_library( - name = "libs_py3", - sources = ["*.py"], - dependencies = [ - "src/python/twitter/deepbird/io", - "twml:twml-nodeps", - ], -) diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/tf_model/__init__.py b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/tf_model/__init__.py deleted file mode 100644 index e69de29bb..000000000 diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/tf_model/discretizer_builder.py b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/tf_model/discretizer_builder.py deleted file mode 100644 index 82c31bde0..000000000 --- a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/tf_model/discretizer_builder.py +++ /dev/null @@ -1,62 +0,0 @@ -from .hashing_utils import make_feature_id - -from twml.contrib.layers.hashing_discretizer import HashingDiscretizer -import numpy as np - - -class TFModelDiscretizerBuilder(object): - def __init__(self, num_bits): - self.num_bits = num_bits - - def build(self, tf_model_initializer): - ''' - :param tf_model_initializer: dictionary of the following format: - { - "features": { - "bias": 0.0, - "binary": { - # (feature name : feature weight) pairs - "feature_name_1": 0.0, - ... - "feature_nameN": 0.0 - }, - "discretized": { - # (feature name : index aligned lists of bin_boundaries and weights - "feature_name_1": { - "bin_boundaries": [1, ..., inf], - "weights": [0.0, ..., 0.0] - } - ... - "feature_name_K": { - "bin_boundaries": [1, ..., inf], - "weights": [0.0, ..., 0.0] - } - } - } - } - :return: a HashingDiscretizer instance. - ''' - discretized_features = tf_model_initializer["features"]["discretized"] - - max_bins = 0 - - feature_ids = [] - bin_vals = [] - for feature_name in discretized_features: - bin_boundaries = discretized_features[feature_name]["bin_boundaries"] - feature_id = make_feature_id(feature_name, self.num_bits) - feature_ids.append(feature_id) - np_bin_boundaries = [np.float(bin_boundary) for bin_boundary in bin_boundaries] - bin_vals.append(np_bin_boundaries) - - max_bins = max(max_bins, len(np_bin_boundaries)) - - feature_ids_np = np.array(feature_ids) - bin_vals_np = np.array(bin_vals).flatten() - - return HashingDiscretizer( - feature_ids=feature_ids_np, - bin_vals=bin_vals_np, - n_bin=max_bins, - out_bits=self.num_bits - ) diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/tf_model/hashing_utils.py b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/tf_model/hashing_utils.py deleted file mode 100644 index 2c57f8d63..000000000 --- a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/tf_model/hashing_utils.py +++ /dev/null @@ -1,29 +0,0 @@ -from twitter.deepbird.io.util import _get_feature_id - -import numpy as np - - -def numpy_hashing_uniform(the_id, bin_idx, output_bits): - """ - integer_multiplicative_hashing - This is a reimplementation, for testing purposes, of the - c++ version found in hashing_discretizer_impl.cpp - """ - hashing_constant = 2654435761 - N = 32 - with np.errstate(over='ignore'): - the_id *= hashing_constant - the_id += bin_idx - the_id *= hashing_constant - the_id >>= N - output_bits - the_id &= (1 << output_bits) - 1 - return the_id - - -def make_feature_id(name, num_bits): - feature_id = _get_feature_id(name) - return np.int64(limit_bits(feature_id, num_bits)) - - -def limit_bits(value, num_bits): - return value & ((2 ** num_bits) - 1) diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/tf_model/weights_initializer_builder.py b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/tf_model/weights_initializer_builder.py deleted file mode 100644 index 63491ea38..000000000 --- a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/tf_model/weights_initializer_builder.py +++ /dev/null @@ -1,34 +0,0 @@ -from .hashing_utils import make_feature_id, numpy_hashing_uniform - -import numpy as np -import tensorflow.compat.v1 as tf -import twml - - -class TFModelWeightsInitializerBuilder(object): - def __init__(self, num_bits): - self.num_bits = num_bits - - def build(self, tf_model_initializer): - ''' - :return: (bias_initializer, weight_initializer) - ''' - initial_weights = np.zeros((2 ** self.num_bits, 1)) - - features = tf_model_initializer["features"] - self._set_binary_feature_weights(initial_weights, features["binary"]) - self._set_discretized_feature_weights(initial_weights, features["discretized"]) - - return tf.constant_initializer(features["bias"]), twml.contrib.initializers.PartitionConstant(initial_weights) - - def _set_binary_feature_weights(self, initial_weights, binary_features): - for feature_name, weight in binary_features.items(): - feature_id = make_feature_id(feature_name, self.num_bits) - initial_weights[feature_id][0] = weight - - def _set_discretized_feature_weights(self, initial_weights, discretized_features): - for feature_name, discretized_feature in discretized_features.items(): - feature_id = make_feature_id(feature_name, self.num_bits) - for bin_idx, weight in enumerate(discretized_feature["weights"]): - final_bucket_id = numpy_hashing_uniform(feature_id, bin_idx, self.num_bits) - initial_weights[final_bucket_id][0] = weight diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/train.py b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/train.py deleted file mode 100644 index 6ef181f5f..000000000 --- a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/train.py +++ /dev/null @@ -1,212 +0,0 @@ -# checkstyle: noqa -import tensorflow.compat.v1 as tf -from tensorflow.python.estimator.export.export import build_raw_serving_input_receiver_fn -from tensorflow.python.framework import dtypes -from tensorflow.python.ops import array_ops -import tensorflow_hub as hub - -from datetime import datetime -from tensorflow.compat.v1 import logging -from twitter.deepbird.projects.timelines.configs import all_configs -from twml.trainers import DataRecordTrainer -from twml.contrib.calibrators.common_calibrators import build_percentile_discretizer_graph -from twml.contrib.calibrators.common_calibrators import calibrate_discretizer_and_export -from .metrics import get_multi_binary_class_metric_fn -from .constants import TARGET_LABEL_IDX, PREDICTED_CLASSES -from .example_weights import add_weight_arguments, make_weights_tensor -from .lolly.data_helpers import get_lolly_logits -from .lolly.tf_model_initializer_builder import TFModelInitializerBuilder -from .lolly.reader import LollyModelReader -from .tf_model.discretizer_builder import TFModelDiscretizerBuilder -from .tf_model.weights_initializer_builder import TFModelWeightsInitializerBuilder - -import twml - -def get_feature_values(features_values, params): - if params.lolly_model_tsv: - # The default DBv2 HashingDiscretizer bin membership interval is (a, b] - # - # The Earlybird Lolly prediction engine discretizer bin membership interval is [a, b) - # - # TFModelInitializerBuilder converts (a, b] to [a, b) by inverting the bin boundaries. - # - # Thus, invert the feature values, so that HashingDiscretizer can to find the correct bucket. - return tf.multiply(features_values, -1.0) - else: - return features_values - -def build_graph(features, label, mode, params, config=None): - weights = None - if "weights" in features: - weights = make_weights_tensor(features["weights"], label, params) - - num_bits = params.input_size_bits - - if mode == "infer": - indices = twml.limit_bits(features["input_sparse_tensor_indices"], num_bits) - dense_shape = tf.stack([features["input_sparse_tensor_shape"][0], 1 << num_bits]) - sparse_tf = tf.SparseTensor( - indices=indices, - values=get_feature_values(features["input_sparse_tensor_values"], params), - dense_shape=dense_shape - ) - else: - features["values"] = get_feature_values(features["values"], params) - sparse_tf = twml.util.convert_to_sparse(features, num_bits) - - if params.lolly_model_tsv: - tf_model_initializer = TFModelInitializerBuilder().build(LollyModelReader(params.lolly_model_tsv)) - bias_initializer, weight_initializer = TFModelWeightsInitializerBuilder(num_bits).build(tf_model_initializer) - discretizer = TFModelDiscretizerBuilder(num_bits).build(tf_model_initializer) - else: - discretizer = hub.Module(params.discretizer_save_dir) - bias_initializer, weight_initializer = None, None - - input_sparse = discretizer(sparse_tf, signature="hashing_discretizer_calibrator") - - logits = twml.layers.full_sparse( - inputs=input_sparse, - output_size=1, - bias_initializer=bias_initializer, - weight_initializer=weight_initializer, - use_sparse_grads=(mode == "train"), - use_binary_values=True, - name="full_sparse_1" - ) - - loss = None - - if mode != "infer": - lolly_activations = get_lolly_logits(label) - - if opt.print_data_examples: - logits = print_data_example(logits, lolly_activations, features) - - if params.replicate_lolly: - loss = tf.reduce_mean(tf.math.squared_difference(logits, lolly_activations)) - else: - batch_size = tf.shape(label)[0] - target_label = tf.reshape(tensor=label[:, TARGET_LABEL_IDX], shape=(batch_size, 1)) - loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=target_label, logits=logits) - loss = twml.util.weighted_average(loss, weights) - - num_labels = tf.shape(label)[1] - eb_scores = tf.tile(lolly_activations, [1, num_labels]) - logits = tf.tile(logits, [1, num_labels]) - logits = tf.concat([logits, eb_scores], axis=1) - - output = tf.nn.sigmoid(logits) - - return {"output": output, "loss": loss, "weights": weights} - -def print_data_example(logits, lolly_activations, features): - return tf.Print( - logits, - [logits, lolly_activations, tf.reshape(features['keys'], (1, -1)), tf.reshape(tf.multiply(features['values'], -1.0), (1, -1))], - message="DATA EXAMPLE = ", - summarize=10000 - ) - -def earlybird_output_fn(graph_output): - export_outputs = { - tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY: - tf.estimator.export.PredictOutput( - {"prediction": tf.identity(graph_output["output"], name="output_scores")} - ) - } - return export_outputs - -if __name__ == "__main__": - parser = DataRecordTrainer.add_parser_arguments() - - parser = twml.contrib.calibrators.add_discretizer_arguments(parser) - - parser.add_argument("--label", type=str, help="label for the engagement") - parser.add_argument("--model.use_existing_discretizer", action="store_true", - dest="model_use_existing_discretizer", - help="Load a pre-trained calibration or train a new one") - parser.add_argument("--input_size_bits", type=int) - parser.add_argument("--export_module_name", type=str, default="base_mlp", dest="export_module_name") - parser.add_argument("--feature_config", type=str) - parser.add_argument("--replicate_lolly", type=bool, default=False, dest="replicate_lolly", - help="Train a regression model with MSE loss and the logged Earlybird score as a label") - parser.add_argument("--lolly_model_tsv", type=str, required=False, dest="lolly_model_tsv", - help="Initialize with weights and discretizer bins available in the given Lolly model tsv file" - "No discretizer gets trained or loaded if set.") - parser.add_argument("--print_data_examples", type=bool, default=False, dest="print_data_examples", - help="Prints 'DATA EXAMPLE = [[tf logit]][[logged lolly logit]][[feature ids][feature values]]'") - add_weight_arguments(parser) - - opt = parser.parse_args() - - feature_config_module = all_configs.select_feature_config(opt.feature_config) - - feature_config = feature_config_module.get_feature_config(data_spec_path=opt.data_spec, label=opt.label) - - parse_fn = twml.parsers.get_sparse_parse_fn( - feature_config, - keep_fields=("ids", "keys", "values", "batch_size", "total_size", "codes")) - - if not opt.lolly_model_tsv: - if opt.model_use_existing_discretizer: - logging.info("Skipping discretizer calibration [model.use_existing_discretizer=True]") - logging.info(f"Using calibration at {opt.discretizer_save_dir}") - else: - logging.info("Calibrating new discretizer [model.use_existing_discretizer=False]") - calibrator = twml.contrib.calibrators.HashingDiscretizerCalibrator( - opt.discretizer_num_bins, - opt.discretizer_output_size_bits - ) - calibrate_discretizer_and_export(name="recap_earlybird_hashing_discretizer", - params=opt, - calibrator=calibrator, - build_graph_fn=build_percentile_discretizer_graph, - feature_config=feature_config) - - trainer = DataRecordTrainer( - name="earlybird", - params=opt, - build_graph_fn=build_graph, - save_dir=opt.save_dir, - feature_config=feature_config, - metric_fn=get_multi_binary_class_metric_fn( - metrics=["roc_auc"], - classes=PREDICTED_CLASSES - ), - warm_start_from=None - ) - - train_input_fn = trainer.get_train_input_fn(parse_fn=parse_fn) - eval_input_fn = trainer.get_eval_input_fn(parse_fn=parse_fn) - - logging.info("Training and Evaluation ...") - trainingStartTime = datetime.now() - trainer.train_and_evaluate(train_input_fn=train_input_fn, eval_input_fn=eval_input_fn) - trainingEndTime = datetime.now() - logging.info("Training and Evaluation time: " + str(trainingEndTime - trainingStartTime)) - - if trainer._estimator.config.is_chief: - serving_input_in_earlybird = { - "input_sparse_tensor_indices": array_ops.placeholder( - name="input_sparse_tensor_indices", - shape=[None, 2], - dtype=dtypes.int64), - "input_sparse_tensor_values": array_ops.placeholder( - name="input_sparse_tensor_values", - shape=[None], - dtype=dtypes.float32), - "input_sparse_tensor_shape": array_ops.placeholder( - name="input_sparse_tensor_shape", - shape=[2], - dtype=dtypes.int64) - } - serving_input_receiver_fn = build_raw_serving_input_receiver_fn(serving_input_in_earlybird) - twml.contrib.export.export_fn.export_all_models( - trainer=trainer, - export_dir=opt.export_dir, - parse_fn=parse_fn, - serving_input_receiver_fn=serving_input_receiver_fn, - export_output_fn=earlybird_output_fn, - feature_spec=feature_config.get_feature_spec() - ) - logging.info("The export model path is: " + opt.export_dir) diff --git a/src/scala/com/twitter/graph/batch/BUILD.bazel b/src/scala/com/twitter/graph/batch/BUILD.bazel deleted file mode 100644 index 0dcfc85cf..000000000 --- a/src/scala/com/twitter/graph/batch/BUILD.bazel +++ /dev/null @@ -1,91 +0,0 @@ -JOB = ["job/**/*"] - -scala_library( - name = "batch", - sources = ["**/*.scala"], - platform = "java8", - tags = [ - "bazel-compatible", - "bazel-only", - ], - dependencies = [ - "3rdparty/jvm/cascading:cascading-core", - "3rdparty/jvm/cascading:cascading-hadoop", - "3rdparty/jvm/cascading:cascading-local", - "3rdparty/jvm/cascading:cascading-thrift", - "3rdparty/jvm/com/twitter/algebird:core", - "3rdparty/jvm/com/twitter/algebird:util", - "3rdparty/jvm/com/twitter/storehaus:algebra", - "3rdparty/jvm/com/twitter/storehaus:core", - "3rdparty/src/jvm/com/twitter/scalding:args", - "3rdparty/src/jvm/com/twitter/scalding:commons", - "3rdparty/src/jvm/com/twitter/scalding:core", - "3rdparty/src/jvm/com/twitter/scalding:date", - "3rdparty/src/jvm/com/twitter/scalding:parquet", - "3rdparty/src/jvm/com/twitter/summingbird:batch", - "3rdparty/src/jvm/com/twitter/summingbird:client", - "graphstore/common:flock_follows-java", - "src/java/com/twitter/common_internal/util:date_util", - "src/java/com/twitter/twadoop/batch", - "src/java/com/twitter/twadoop/util/dbconfig", - "src/java/com/twitter/twadoop/util/yaml", - "src/protobuf/com/twitter/twadoop", - "src/scala/com/twitter/pluck", - "src/scala/com/twitter/pluck/source/combined_user_source", - "src/scala/com/twitter/pluck/source/jdbc", - "src/scala/com/twitter/scalding_internal/error_handling", - "src/scala/com/twitter/scalding_internal/job", - "src/scala/com/twitter/scalding_internal/job/analytics_batch", - "src/scala/com/twitter/scalding_internal/multiformat", - "src/scala/com/twitter/scalding_internal/source", - "src/scala/com/twitter/wtf/scalding/jobs/common:date_util", - "src/thrift/com/twitter/gizmoduck:user-thrift-java", - "src/thrift/com/twitter/twadoop/user/gen:gen-java", - "util/util-core:scala", - ], -) - -#pants.new build target for the old "dist" -hadoop_binary( - name = "graph-batch-deploy", - main = "com.twitter.scalding.Tool", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":tweepcred", - ], -) - -# Generated with `capesospy-v2 create_target tweepcred_job science/scalding/mesos/wtf/recos_platform_atla_proc.yaml`, config hash d63a47. -scalding_job( - name = "tweepcred_job", - main = "com.twitter.graph.batch.job.tweepcred.TweepcredBatchJob", - args = ["--weighted false --hadoop_config /etc/hadoop/hadoop-conf-proc-atla"], - config = [ - ("hadoop.combine-input", "true"), - ("hadoop.map.jvm.total-memory", "3072m"), - ("hadoop.queue", "cassowary.default"), - ("hadoop.reduce.jvm.total-memory", "3072m"), - ("hadoop.reducers", "1200"), - ("hadoop.submitter.disk", "200000m"), - ("hadoop.submitter.jvm.total-memory", "5120m"), - ("submitter.tier", "preemptible"), - ], - cron = "24,44,04 * * * *", - hadoop_cluster = "atla-proc", - platform = "java8", - role = "cassowary", - runtime_platform = "java8", - tags = [ - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":tweepcred", - ], -) diff --git a/src/scala/com/twitter/graph/batch/job/tweepcred/ExtractTweepcred.scala b/src/scala/com/twitter/graph/batch/job/tweepcred/ExtractTweepcred.scala deleted file mode 100644 index 568e85251..000000000 --- a/src/scala/com/twitter/graph/batch/job/tweepcred/ExtractTweepcred.scala +++ /dev/null @@ -1,83 +0,0 @@ -package com.twitter.graph.batch.job.tweepcred - -import com.twitter.pluck.source.combined_user_source.MostRecentCombinedUserSnapshotSource -import com.twitter.scalding._ - -/** - * Calculate tweepcred from the given pagerank file. If post_adjust is true, - * reduce pagerank for users with low followers compared to number of - * followings based on existing reputation code. - * Options: - * --input_pagerank: given pagerank - * --user_mass: user mass tsv file, generated by twadoop user_mass job - * --output_pagerank: where to put pagerank file - * --output_tweepcred: where to put tweepcred file - * optional arguments: - * --post_adjust: whether to do post adjust, default true - * - */ -class ExtractTweepcred(args: Args) extends Job(args) { - val POST_ADJUST = args.getOrElse("post_adjust", "true").toBoolean - - val inputPagerank = getInputPagerank(args("input_pagerank")) - .map(() -> ('num_followers, 'num_followings)) { (u: Unit) => - (0, 0) - } - - val userInfo = TypedPipe - .from(MostRecentCombinedUserSnapshotSource) - .flatMap { combinedUser => - val user = Option(combinedUser.user) - val userId = user.map(_.id).getOrElse(0L) - val userExtended = Option(combinedUser.user_extended) - val numFollowers = userExtended.flatMap(u => Option(u.followers)).map(_.toInt).getOrElse(0) - val numFollowings = userExtended.flatMap(u => Option(u.followings)).map(_.toInt).getOrElse(0) - - if (userId == 0L || user.map(_.safety).exists(_.deactivated)) { - None - } else { - Some((userId, 0.0, numFollowers, numFollowings)) - } - } - .toPipe[(Long, Double, Int, Int)]('src_id, 'mass_input, 'num_followers, 'num_followings) - - val pagerankWithSuspended = (inputPagerank ++ userInfo) - .groupBy('src_id) { - _.max('mass_input) - .max('num_followers) - .max('num_followings) - } - - pagerankWithSuspended - .discard('num_followers, 'num_followings) - .write(Tsv(args("output_pagerank"))) - - val adjustedPagerank = - if (POST_ADJUST) { - pagerankWithSuspended - .map(('mass_input, 'num_followers, 'num_followings) -> 'mass_input) { - input: (Double, Int, Int) => - Reputation.adjustReputationsPostCalculation(input._1, input._2, input._3) - } - .normalize('mass_input) - } else { - pagerankWithSuspended - .discard('num_followers, 'num_followings) - } - - val tweepcred = adjustedPagerank - .map('mass_input -> 'mass_input) { input: Double => - Reputation.scaledReputation(input) - } - - tweepcred.write(Tsv(args("output_tweepcred"))) - tweepcred.write(Tsv(args("current_tweepcred"))) - tweepcred.write(Tsv(args("today_tweepcred"))) - - def getInputPagerank(fileName: String) = { - Tsv(fileName).read - .mapTo((0, 1) -> ('src_id, 'mass_input)) { input: (Long, Double) => - input - } - } -} diff --git a/src/scala/com/twitter/graph/batch/job/tweepcred/PreparePageRankData.scala b/src/scala/com/twitter/graph/batch/job/tweepcred/PreparePageRankData.scala deleted file mode 100644 index 284ba45f8..000000000 --- a/src/scala/com/twitter/graph/batch/job/tweepcred/PreparePageRankData.scala +++ /dev/null @@ -1,275 +0,0 @@ -package com.twitter.graph.batch.job.tweepcred - -import com.twitter.data.proto.Flock -import com.twitter.scalding._ -import com.twitter.pluck.source._ -import com.twitter.pluck.source.combined_user_source.MostRecentCombinedUserSnapshotSource -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.service.interactions.InteractionGraph -import graphstore.common.FlockFollowsJavaDataset -import java.util.TimeZone - -/** - * Prepare the graph data for page rank calculation. Also generate the initial - * pagerank as the starting point. Afterwards, start WeightedPageRank job. - * - * Either read a tsv file for testing or read the following to build the graph - * flock edges Flock.Edge - * real graph input for weights InteractionGraph.Edge - * - * Options: - * --pwd: working directory, will generate the following files there - * numnodes: total number of nodes - * nodes: nodes file <'src_id, 'dst_ids, 'weights, 'mass_prior> - * pagerank: the page rank file - * --user_mass: user mass tsv file, generated by twadoop user_mass job - * Optional arguments: - * --input: use the given tsv file instead of flock and real graph - * --weighted: do weighted pagerank, default false - * --flock_edges_only: restrict graph to flock edges, default true - * --input_pagerank: continue pagerank from this - * - * Plus the following options for WeightedPageRank and ExtractTweepcred: - * --output_pagerank: where to put pagerank file - * --output_tweepcred: where to put tweepcred file - * Optional: - * --maxiterations: how many iterations to run. Default is 20 - * --jumpprob: probability of a random jump, default is 0.1 - * --threshold: total difference before finishing early, default 0.001 - * --post_adjust: whether to do post adjust, default true - */ -class PreparePageRankData(args: Args) extends Job(args) { - implicit val timeZone: TimeZone = DateOps.UTC - val PWD = args("pwd") - val WEIGHTED = args.getOrElse("weighted", "false").toBoolean - val FLOCK_EDGES_ONLY = args.getOrElse("flock_edges_only", "true").toBoolean - - val ROW_TYPE_1 = 1 - val ROW_TYPE_2 = 2 - - // graph data and user mass - val userMass = getUserMass - val nodesWithPrior = getGraphData(userMass) - val numNodes = nodesWithPrior.groupAll { _.size } - numNodes.write(Tsv(PWD + "/numnodes")) - dumpNodes(nodesWithPrior, PWD + "/nodes"); - - // initial pagerank to start computation - generateInitialPagerank(nodesWithPrior) - - // continue with the calculation - override def next = { - Some(new WeightedPageRank(args)) - } - - /** - * read flock edges - */ - def getFlockEdges = { - DAL - .readMostRecentSnapshotNoOlderThan(FlockFollowsJavaDataset, Days(7)) - .toTypedSource - .flatMapTo('src_id, 'dst_id) { edge: Flock.Edge => - if (edge.getStateId() == Flock.State.Positive.getNumber()) { - Some((edge.getSourceId(), edge.getDestinationId())) - } else { - None - } - } - } - - /** - * read real graph edges with weights - */ - def getRealGraphEdges = { - RealGraphEdgeSource() - .flatMapTo('src_id, 'dst_id, 'weight) { edge: InteractionGraph.Edge => - if (edge.getSourceId() != edge.getDestinationId()) { - val srcId = edge.getSourceId() - val dstId = edge.getDestinationId() - val weight = edge.getWeight().toFloat - Some((srcId, dstId, weight)) - } else { - None - } - } - } - - /** - * combine real graph and flock. If flock_edges_only is true, only take the - * flock edges; otherwise edges are either from flock or from real graph. - * edges weights default to be 1, overwritten by weights from real graph - */ - def getFlockRealGraphEdges = { - val flock = getFlockEdges - - if (WEIGHTED) { - val flockWithWeight = flock - .map(() -> ('weight, 'rowtype)) { (u: Unit) => - (1.0f, ROW_TYPE_1) - } - - val realGraph = getRealGraphEdges - .map(() -> 'rowtype) { (u: Unit) => - (ROW_TYPE_2) - } - - val combined = (flockWithWeight ++ realGraph) - .groupBy('src_id, 'dst_id) { - _.min('rowtype) - .max('weight) // take whichever is bigger - } - - if (FLOCK_EDGES_ONLY) { - combined.filter('rowtype) { (rowtype: Int) => - rowtype == ROW_TYPE_1 - } - } else { - combined - } - } else { - flock.map(() -> ('weight)) { (u: Unit) => - 1.0f - } - }.project('src_id, 'dst_id, 'weight) - } - - def getCsvEdges(fileName: String) = { - Tsv(fileName).read - .mapTo((0, 1, 2) -> ('src_id, 'dst_id, 'weight)) { input: (Long, Long, Float) => - input - } - } - - /* - * Compute user mass based on combined user - */ - def getUserMass = - TypedPipe - .from(MostRecentCombinedUserSnapshotSource) - .flatMap { user => - UserMass.getUserMass(user) - } - .map { userMassInfo => - (userMassInfo.userId, userMassInfo.mass) - } - .toPipe[(Long, Double)]('src_id_input, 'mass_prior) - .normalize('mass_prior) - - /** - * Read either flock/real_graph or a given tsv file - * group by the source id, and output node data structure - * merge with the user_mass. - * return <'src_id, 'dst_ids, 'weights, 'mass_prior> - * - * make sure src_id is the same set as in user_mass, and dst_ids - * are subset of user_mass. eg flock has edges like 1->2, - * where both users 1 and 2 do not exist anymore - */ - def getGraphData(userMass: RichPipe) = { - val edges: RichPipe = args.optional("input") match { - case None => getFlockRealGraphEdges - case Some(input) => getCsvEdges(input) - } - - // remove edges where dst_id is not in userMass - val filterByDst = userMass - .joinWithLarger('src_id_input -> 'dst_id, edges) - .discard('src_id_input, 'mass_prior) - - // aggreate by the source id - val nodes = filterByDst - .groupBy('src_id) { - _.mapReduceMap(('dst_id, 'weight) -> ('dst_ids, 'weights)) /* map1 */ { a: (Long, Float) => - (Vector(a._1), if (WEIGHTED) Vector(a._2) else Vector()) - } /* reduce */ { (a: (Vector[Long], Vector[Float]), b: (Vector[Long], Vector[Float])) => - { - (a._1 ++ b._1, a._2 ++ b._2) - } - } /* map2 */ { a: (Vector[Long], Vector[Float]) => - a - } - } - .mapTo( - ('src_id, 'dst_ids, 'weights) -> ('src_id, 'dst_ids, 'weights, 'mass_prior, 'rowtype)) { - input: (Long, Vector[Long], Vector[Float]) => - { - (input._1, input._2.toArray, input._3.toArray, 0.0, ROW_TYPE_1) - } - } - - // get to the same schema - val userMassNodes = userMass - .mapTo(('src_id_input, 'mass_prior) -> ('src_id, 'dst_ids, 'weights, 'mass_prior, 'rowtype)) { - input: (Long, Double) => - { - (input._1, Array[Long](), Array[Float](), input._2, ROW_TYPE_2) - } - } - - // make src_id the same set as in userMass - (nodes ++ userMassNodes) - .groupBy('src_id) { - _.sortBy('rowtype) - .head('dst_ids, 'weights) - .last('mass_prior, 'rowtype) - } - .filter('rowtype) { input: Int => - input == ROW_TYPE_2 - } - } - - /** - * generate the graph data output - */ - def dumpNodes(nodes: RichPipe, fileName: String) = { - mode match { - case Hdfs(_, conf) => nodes.write(SequenceFile(fileName)) - case _ => - nodes - .mapTo((0, 1, 2, 3) -> (0, 1, 2, 3)) { input: (Long, Array[Long], Array[Float], Double) => - (input._1, input._2.mkString(","), input._3.mkString(","), input._4) - } - .write(Tsv(fileName)) - } - } - - /* - * output prior mass or copy the given mass file (merge, normalize) - * to be used as the starting point - */ - def generateInitialPagerank(nodes: RichPipe) = { - val prior = nodes - .project('src_id, 'mass_prior) - - val combined = args.optional("input_pagerank") match { - case None => prior - case Some(fileName) => { - val massInput = Tsv(fileName).read - .mapTo((0, 1) -> ('src_id, 'mass_prior, 'rowtype)) { input: (Long, Double) => - (input._1, input._2, ROW_TYPE_2) - } - - val priorRow = prior - .map(() -> ('rowtype)) { (u: Unit) => - ROW_TYPE_1 - } - - (priorRow ++ massInput) - .groupBy('src_id) { - _.sortBy('rowtype) - .last('mass_prior) - .head('rowtype) - } - // throw away extra nodes from input file - .filter('rowtype) { (rowtype: Int) => - rowtype == ROW_TYPE_1 - } - .discard('rowtype) - .normalize('mass_prior) - } - } - - combined.write(Tsv(PWD + "/pagerank_0")) - } -} diff --git a/src/scala/com/twitter/graph/batch/job/tweepcred/README b/src/scala/com/twitter/graph/batch/job/tweepcred/README deleted file mode 100644 index 55ef3b093..000000000 --- a/src/scala/com/twitter/graph/batch/job/tweepcred/README +++ /dev/null @@ -1,75 +0,0 @@ -Tweepcred - -Tweepcred is a social network analysis tool that calculates the influence of Twitter users based on their interactions with other users. The tool uses the PageRank algorithm to rank users based on their influence. - -PageRank Algorithm -PageRank is a graph algorithm that was originally developed by Google to determine the importance of web pages in search results. The algorithm works by assigning a numerical score to each page based on the number and quality of other pages that link to it. The more links a page has from other high-quality pages, the higher its PageRank score. - -In the Tweepcred project, the PageRank algorithm is used to determine the influence of Twitter users based on their interactions with other users. The graph is constructed by treating Twitter users as nodes, and their interactions (mentions, retweets, etc.) as edges. The PageRank score of a user represents their influence in the network. - -Tweepcred PageRank Implementation -The implementation of the PageRank algorithm in Tweepcred is based on the Hadoop MapReduce framework. The algorithm is split into two stages: preparation and iteration. - -The preparation stage involves constructing the graph of Twitter users and their interactions, and initializing each user's PageRank score to a default value. This stage is implemented in the PreparePageRankData class. - -The iteration stage involves repeatedly calculating and updating the PageRank scores of each user until convergence is reached. This stage is implemented in the UpdatePageRank class, which is run multiple times until the algorithm converges. - -The Tweepcred PageRank implementation also includes a number of optimizations to improve performance and reduce memory usage. These optimizations include block compression, lazy loading, and in-memory caching. - - -========================================== TweepcredBatchJob.scala ========================================== - - -This is a Scala class that represents a batch job for computing the "tweepcred" (Twitter credibility) score for Twitter users using weighted or unweighted PageRank algorithm. The class extends the AnalyticsIterativeBatchJob class, which is part of the Scalding framework used for data processing on Hadoop. - -The class defines various properties and methods that are used to configure and run the batch job. The args parameter represents the command-line arguments that are passed to the batch job, such as the --weighted flag that determines whether to use the weighted PageRank algorithm or not. - -The run method overrides the run method of the base class and prints the batch statistics after the job has finished. The children method defines a list of child jobs that need to be executed as part of the batch job. The messageHeader method returns a string that represents the header of the batch job message. - -========================================== ExtractTweepcred.scala ========================================== - -This class is a Scalding job that calculates "tweepcred" from a given pagerank file. Tweepcred is a measure of reputation for Twitter users that takes into account the number of followers they have and the number of people they follow. If the optional argument post_adjust is set to true (default value), then the pagerank values are adjusted based on the user's follower-to-following ratio. - -The class takes several command-line arguments specifying input and output files and options, and it uses the Scalding library to perform distributed data processing on the input files. It reads in the pagerank file and a user mass file, both in TSV format, and combines them to produce a new pagerank file with the adjusted values. The adjusted pagerank is then used to calculate tweepcred values, which are written to output files. - -The code makes use of the MostRecentCombinedUserSnapshotSource class from the com.twitter.pluck.source.combined_user_source package to obtain user information from the user mass file. It also uses the Reputation class to perform the tweepcred calculations and adjustments. - - -========================================== UserMass.scala ========================================== - -The UserMass class is a helper class used to calculate the "mass" of a user on Twitter, as defined by a certain algorithm. The mass score represents the user's reputation and is used in various applications, such as in determining which users should be recommended to follow or which users should have their content highlighted. - -The getUserMass method of the UserMass class takes in a CombinedUser object, which contains information about a Twitter user, and returns an optional UserMassInfo object, which contains the user's ID and calculated mass score. - -The algorithm used to calculate the mass score takes into account various factors such as the user's account age, number of followers and followings, device usage, and safety status (restricted, suspended, verified). The calculation involves adding and multiplying weight factors and adjusting the mass score based on a threshold for the number of friends and followers. - - -========================================== PreparePageRankData.scala ========================================== - -The PreparePageRankData class prepares the graph data for the page rank calculation. It generates the initial pagerank and then starts the WeightedPageRank job. It has the following functionalities: - -It reads the user mass TSV file generated by the twadoop user_mass job. -It reads the graph data, which is either a TSV file or a combination of flock edges and real graph inputs for weights. -It generates the initial pagerank as the starting point for the pagerank computation. -It writes the number of nodes to a TSV file and dumps the nodes to another TSV file. -It has several options like weighted, flock_edges_only, and input_pagerank to fine-tune the pagerank calculation. -It also has options for the WeightedPageRank and ExtractTweepcred jobs, like output_pagerank, output_tweepcred, maxiterations, jumpprob, threshold, and post_adjust. -The PreparePageRankData class has several helper functions like getFlockEdges, getRealGraphEdges, getFlockRealGraphEdges, and getCsvEdges that read the graph data from different sources like DAL, InteractionGraph, or CSV files. It also has the generateInitialPagerank function that generates the initial pagerank from the graph data. - -========================================== WeightedPageRank.scala ========================================== - -WeightedPageRank is a class that performs the weighted PageRank algorithm on a given graph. - -The algorithm starts from a given PageRank value and performs one iteration, then tests for convergence. If convergence has not been reached, the algorithm clones itself and starts the next PageRank job with the updated PageRank as input. If convergence has been reached, the algorithm starts the ExtractTweepcred job instead. - -The class takes in several options, including the working directory, total number of nodes, nodes file, PageRank file, total difference, whether to perform weighted PageRank, the current iteration, maximum iterations to run, probability of a random jump, and whether to do post adjust. - -The algorithm reads a nodes file that includes the source node ID, destination node IDs, weights, and mass prior. The algorithm also reads an input PageRank file that includes the source node ID and mass input. The algorithm then performs one iteration of the PageRank algorithm and writes the output PageRank to a file. - -The algorithm tests for convergence by calculating the total difference between the input and output PageRank masses. If convergence has not been reached, the algorithm clones itself and starts the next PageRank job. If convergence has been reached, the algorithm starts the ExtractTweepcred job. - -========================================== Reputation.scala ========================================== - -This is a helper class called Reputation that contains methods for calculating a user's reputation score. The first method called scaledReputation takes a Double parameter raw which represents the user's page rank, and returns a Byte value that represents the user's reputation on a scale of 0 to 100. This method uses a formula that involves converting the logarithm of the page rank to a number between 0 and 100. - -The second method called adjustReputationsPostCalculation takes three parameters: mass (a Double value representing the user's page rank), numFollowers (an Int value representing the number of followers a user has), and numFollowings (an Int value representing the number of users a user is following). This method reduces the page rank of users who have a low number of followers but a high number of followings. It calculates a division factor based on the ratio of followings to followers, and reduces the user's page rank by dividing it by this factor. The method returns the adjusted page rank. diff --git a/src/scala/com/twitter/graph/batch/job/tweepcred/Reputation.scala b/src/scala/com/twitter/graph/batch/job/tweepcred/Reputation.scala deleted file mode 100644 index 6c81805fd..000000000 --- a/src/scala/com/twitter/graph/batch/job/tweepcred/Reputation.scala +++ /dev/null @@ -1,50 +0,0 @@ -package com.twitter.graph.batch.job.tweepcred - -/** - * helper class to calculate reputation, borrowed from repo reputations - */ -object Reputation { - - /** - * convert pagerank to tweepcred between 0 and 100, - * take from repo reputations, util/Utils.scala - */ - def scaledReputation(raw: Double): Byte = { - if (raw == 0 || (raw < 1.0e-20)) { - 0 - } else { - // convert log(pagerank) to a number between 0 and 100 - // the two parameters are from a linear fit by converting - // max pagerank -> 95 - // min pagerank -> 15 - val e: Double = 130d + 5.21 * scala.math.log(raw) // log to the base e - val pos = scala.math.rint(e) - val v = if (pos > 100) 100.0 else if (pos < 0) 0.0 else pos - v.toByte - } - } - - // these constants are take from repo reputations, config/production.conf - private val threshAbsNumFriendsReps = 2500 - private val constantDivisionFactorGt_threshFriendsToFollowersRatioReps = 3.0 - private val threshFriendsToFollowersRatioUMass = 0.6 - private val maxDivFactorReps = 50 - - /** - * reduce pagerank of users with low followers but high followings - */ - def adjustReputationsPostCalculation(mass: Double, numFollowers: Int, numFollowings: Int) = { - if (numFollowings > threshAbsNumFriendsReps) { - val friendsToFollowersRatio = (1.0 + numFollowings) / (1.0 + numFollowers) - val divFactor = - scala.math.exp( - constantDivisionFactorGt_threshFriendsToFollowersRatioReps * - (friendsToFollowersRatio - threshFriendsToFollowersRatioUMass) * - scala.math.log(scala.math.log(numFollowings)) - ) - mass / ((divFactor min maxDivFactorReps) max 1.0) - } else { - mass - } - } -} diff --git a/src/scala/com/twitter/graph/batch/job/tweepcred/TweepcredBatchJob.scala b/src/scala/com/twitter/graph/batch/job/tweepcred/TweepcredBatchJob.scala deleted file mode 100644 index 48c06027b..000000000 --- a/src/scala/com/twitter/graph/batch/job/tweepcred/TweepcredBatchJob.scala +++ /dev/null @@ -1,64 +0,0 @@ -package com.twitter.graph.batch.job.tweepcred - -import com.twitter.scalding._ -import com.twitter.scalding_internal.job._ -import com.twitter.scalding_internal.job.analytics_batch._ - -/** - * Register the beginning of the tweepcred job in analytic batch table - * - * Options: - * --weighted: do weighted pagerank - * --hadoop_config: /etc/hadoop/hadoop-conf-proc-atla - * - */ -class TweepcredBatchJob(args: Args) extends AnalyticsIterativeBatchJob(args) { - - def WEIGHTED = args("weighted").toBoolean - - override def timeout = Hours(36) - override def hasFlow = false - def descriptionSuffix = " weighted=" + args("weighted") - override def batchIncrement = Hours(24) - override def firstTime = RichDate("2015-10-02") - override def batchDescription = classOf[TweepcredBatchJob].getCanonicalName + descriptionSuffix - - override def run = { - val success = super.run - println("Batch Stat: " + messageHeader + " " + jobStat.get.toString) - success - } - - def startTime = dateRange.start - def dateString = startTime.toString("yyyy/MM/dd") - - override def children = { - val BASEDIR = "/user/cassowary/tweepcred/" - val baseDir = BASEDIR + (if (WEIGHTED) "weighted" else "unweighted") + "/daily/" - val tmpDir = baseDir + "tmp" - val outputDir = baseDir + dateString - val pageRankDir = outputDir + "/finalmass" - val tweepcredDir = outputDir + "/finaltweepcred" - val yesterdayStr = (startTime - Days(1)).toString("yyyy/MM/dd") - val yestPageRankDir = baseDir + yesterdayStr + "/finalmass" - val TWEEPCRED = "/tweepcred" - val curRep = (if (WEIGHTED) baseDir else BASEDIR) + "current" - val todayRep = (if (WEIGHTED) baseDir else BASEDIR) + dateString - val newArgs = args + ("pwd", Some(tmpDir)) + - ("output_pagerank", Some(pageRankDir)) + - ("output_tweepcred", Some(tweepcredDir)) + - ("input_pagerank", Some(yestPageRankDir)) + - ("current_tweepcred", Some(curRep + TWEEPCRED)) + - ("today_tweepcred", Some(todayRep + TWEEPCRED)) - - val prJob = new PreparePageRankData(newArgs) - - List(prJob) - } - - private def messageHeader = { - val dateString = dateRange.start.toString("yyyy/MM/dd") - classOf[TweepcredBatchJob].getSimpleName + - (if (WEIGHTED) " weighted " else " unweighted ") + dateString - } -} diff --git a/src/scala/com/twitter/graph/batch/job/tweepcred/UserMass.scala b/src/scala/com/twitter/graph/batch/job/tweepcred/UserMass.scala deleted file mode 100644 index 064819bb0..000000000 --- a/src/scala/com/twitter/graph/batch/job/tweepcred/UserMass.scala +++ /dev/null @@ -1,69 +0,0 @@ -package com.twitter.graph.batch.job.tweepcred - -import com.twitter.twadoop.user.gen.CombinedUser -import com.twitter.util.Time -import com.twitter.wtf.scalding.jobs.common.DateUtil - -case class UserMassInfo(userId: Long, mass: Double) - -/** - * helper class to calculate user mass, borrowed from repo reputations - */ -object UserMass { - - private val currentTimestamp = Time.now.inMilliseconds - private val constantDivisionFactorGt_threshFriendsToFollowersRatioUMass = 5.0 - private val threshAbsNumFriendsUMass = 500 - private val threshFriendsToFollowersRatioUMass = 0.6 - private val deviceWeightAdditive = 0.5 - private val ageWeightAdditive = 0.2 - private val restrictedWeightMultiplicative = 0.1 - - def getUserMass(combinedUser: CombinedUser): Option[UserMassInfo] = { - val user = Option(combinedUser.user) - val userId = user.map(_.id).getOrElse(0L) - val userExtended = Option(combinedUser.user_extended) - val age = user.map(_.created_at_msec).map(DateUtil.diffDays(_, currentTimestamp)).getOrElse(0) - val isRestricted = user.map(_.safety).exists(_.restricted) - val isSuspended = user.map(_.safety).exists(_.suspended) - val isVerified = user.map(_.safety).exists(_.verified) - val hasValidDevice = user.flatMap(u => Option(u.devices)).exists(_.isSetMessaging_devices) - val numFollowers = userExtended.flatMap(u => Option(u.followers)).map(_.toInt).getOrElse(0) - val numFollowings = userExtended.flatMap(u => Option(u.followings)).map(_.toInt).getOrElse(0) - - if (userId == 0L || user.map(_.safety).exists(_.deactivated)) { - None - } else { - val mass = - if (isSuspended) - 0 - else if (isVerified) - 100 - else { - var score = deviceWeightAdditive * 0.1 + - (if (hasValidDevice) deviceWeightAdditive else 0) - val normalizedAge = if (age > 30) 1.0 else (1.0 min scala.math.log(1.0 + age / 15.0)) - score *= normalizedAge - if (score < 0.01) score = 0.01 - if (isRestricted) score *= restrictedWeightMultiplicative - score = (score min 1.0) max 0 - score *= 100 - score - } - - val friendsToFollowersRatio = (1.0 + numFollowings) / (1.0 + numFollowers) - val adjustedMass = - if (numFollowings > threshAbsNumFriendsUMass && - friendsToFollowersRatio > threshFriendsToFollowersRatioUMass) { - mass / scala.math.exp( - constantDivisionFactorGt_threshFriendsToFollowersRatioUMass * - (friendsToFollowersRatio - threshFriendsToFollowersRatioUMass) - ) - } else { - mass - } - - Some(UserMassInfo(userId, adjustedMass)) - } - } -} diff --git a/src/scala/com/twitter/graph/batch/job/tweepcred/WeightedPageRank.scala b/src/scala/com/twitter/graph/batch/job/tweepcred/WeightedPageRank.scala deleted file mode 100644 index 7e06077a1..000000000 --- a/src/scala/com/twitter/graph/batch/job/tweepcred/WeightedPageRank.scala +++ /dev/null @@ -1,235 +0,0 @@ -package com.twitter.graph.batch.job.tweepcred - -import com.twitter.scalding._ - -/** - * weighted page rank for the given graph, start from the given pagerank, - * perform one iteration, test for convergence, if not yet, clone itself - * and start the next page rank job with updated pagerank as input; - * if converged, start ExtractTweepcred job instead - * - * Options: - * --pwd: working directory, will read/generate the following files there - * numnodes: total number of nodes - * nodes: nodes file <'src_id, 'dst_ids, 'weights, 'mass_prior> - * pagerank: the page rank file eg pagerank_0, pagerank_1 etc - * totaldiff: the current max pagerank delta - * Optional arguments: - * --weighted: do weighted pagerank, default false - * --curiteration: what is the current iteration, default 0 - * --maxiterations: how many iterations to run. Default is 20 - * --jumpprob: probability of a random jump, default is 0.1 - * --threshold: total difference before finishing early, default 0.001 - * - * plus the following options for ExtractTweepcred: - * --user_mass: user mass tsv file, generated by twadoop user_mass job - * --output_pagerank: where to put pagerank file - * --output_tweepcred: where to put tweepcred file - * Optional: - * --post_adjust: whether to do post adjust, default true - * - */ -class WeightedPageRank(args: Args) extends Job(args) { - val ROW_TYPE_1 = 1 - val ROW_TYPE_2 = 2 - - val PWD = args("pwd") - val ALPHA = args.getOrElse("jumpprob", "0.1").toDouble - val WEIGHTED = args.getOrElse("weighted", "false").toBoolean - val THRESHOLD = args.getOrElse("threshold", "0.001").toDouble - val MAXITERATIONS = args.getOrElse("maxiterations", "20").toInt - val CURITERATION = args.getOrElse("curiteration", "0").toInt - - // 'size - val numNodes = getNumNodes(PWD + "/numnodes") - - // 'src_id, 'dst_ids, 'weights, 'mass_prior - val nodes = getNodes(PWD + "/nodes") - - // 'src_id_input, 'mass_input - val inputPagerank = getInputPagerank(PWD + "/pagerank_" + CURITERATION) - - // one iteration of pagerank - val outputPagerank = doPageRank(nodes, inputPagerank) - val outputFileName = PWD + "/pagerank_" + (CURITERATION + 1) - outputPagerank - .project('src_id, 'mass_n) - .write(Tsv(outputFileName)) - - // detect convergence - val totalDiff = outputPagerank - .mapTo(('mass_input, 'mass_n) -> 'mass_diff) { args: (Double, Double) => - scala.math.abs(args._1 - args._2) - } - .groupAll { _.sum[Double]('mass_diff) } - .write(Tsv(PWD + "/totaldiff")) - - /** - * test convergence, if not yet, kick off the next iteration - */ - override def next = { - // the max diff generated above - val totalDiff = Tsv(PWD + "/totaldiff").readAtSubmitter[Double].head - - if (CURITERATION < MAXITERATIONS - 1 && totalDiff > THRESHOLD) { - val newArgs = args + ("curiteration", Some((CURITERATION + 1).toString)) - Some(clone(newArgs)) - } else { - val newArgs = args + ("input_pagerank", Some(outputFileName)) - Some(new ExtractTweepcred(newArgs)) - } - } - - def getInputPagerank(fileName: String) = { - Tsv(fileName).read - .mapTo((0, 1) -> ('src_id_input, 'mass_input)) { input: (Long, Double) => - input - } - } - - /** - * read the pregenerated nodes file <'src_id, 'dst_ids, 'weights, 'mass_prior> - */ - def getNodes(fileName: String) = { - mode match { - case Hdfs(_, conf) => { - SequenceFile(fileName).read - .mapTo((0, 1, 2, 3) -> ('src_id, 'dst_ids, 'weights, 'mass_prior)) { - input: (Long, Array[Long], Array[Float], Double) => - input - } - } - case _ => { - Tsv(fileName).read - .mapTo((0, 1, 2, 3) -> ('src_id, 'dst_ids, 'weights, 'mass_prior)) { - input: (Long, String, String, Double) => - { - ( - input._1, - // convert string to int array - if (input._2 != null && input._2.length > 0) { - input._2.split(",").map { _.toLong } - } else { - Array[Long]() - }, - // convert string to float array - if (input._3 != null && input._3.length > 0) { - input._3.split(",").map { _.toFloat } - } else { - Array[Float]() - }, - input._4 - ) - } - } - } - } - } - - /** - * the total number of nodes, single line file - */ - def getNumNodes(fileName: String) = { - Tsv(fileName).read - .mapTo(0 -> 'size) { input: Long => - input - } - } - - /** - * one iteration of pagerank - * inputPagerank: <'src_id_input, 'mass_input> - * return <'src_id, 'mass_n, 'mass_input> - * - * Here is a highlevel view of the unweighted algorithm: - * let - * N: number of nodes - * inputPagerank(N_i): prob of walking to node i, - * d(N_j): N_j's out degree - * then - * pagerankNext(N_i) = (\sum_{j points to i} inputPagerank(N_j) / d_j) - * deadPagerank = (1 - \sum_{i} pagerankNext(N_i)) / N - * randomPagerank(N_i) = userMass(N_i) * ALPHA + deadPagerank * (1-ALPHA) - * pagerankOutput(N_i) = randomPagerank(N_i) + pagerankNext(N_i) * (1-ALPHA) - * - * For weighted algorithm: - * let - * w(N_j, N_i): weight from N_j to N_i - * tw(N_j): N_j's total out weights - * then - * pagerankNext(N_i) = (\sum_{j points to i} inputPagerank(N_j) * w(N_j, N_i) / tw(N_j)) - * - */ - def doPageRank(nodeRows: RichPipe, inputPagerank: RichPipe): RichPipe = { - // 'src_id, 'dst_ids, 'weights, 'mass_prior, 'mass_input - val nodeJoined = nodeRows - .joinWithSmaller('src_id -> 'src_id_input, inputPagerank) - .discard('src_id_input) - - // 'src_id, 'mass_n - val pagerankNext = nodeJoined - .flatMapTo(('dst_ids, 'weights, 'mass_input) -> ('src_id, 'mass_n)) { - args: (Array[Long], Array[Float], Double) => - { - if (args._1.length > 0) { - if (WEIGHTED) { - // weighted distribution - val total: Double = args._2.sum - (args._1 zip args._2).map { idWeight: (Long, Float) => - (idWeight._1, args._3 * idWeight._2 / total) - } - } else { - // equal distribution - val dist: Double = args._3 / args._1.length - args._1.map { id: Long => - (id, dist) - } - } - } else { - //Here is a node that points to no other nodes (dangling) - Nil - } - } - } - .groupBy('src_id) { - _.sum[Double]('mass_n) - } - - // 'sum_mass - val sumPagerankNext = pagerankNext.groupAll { _.sum[Double]('mass_n -> 'sum_mass) } - - // 'deadMass - // single row jobs - // the dead page rank equally distributed to every node - val deadPagerank = sumPagerankNext - .crossWithTiny(numNodes) - .map(('sum_mass, 'size) -> 'deadMass) { input: (Double, Long) => - (1.0 - input._1) / input._2 - } - .discard('size, 'sum_mass) - - // 'src_id_r, 'mass_n_r - // random jump probability plus dead page rank - val randomPagerank = nodeJoined - .crossWithTiny(deadPagerank) - .mapTo(('src_id, 'mass_prior, 'deadMass, 'mass_input) -> ('src_id, 'mass_n, 'mass_input)) { - ranks: (Long, Double, Double, Double) => - (ranks._1, ranks._2 * ALPHA + ranks._3 * (1 - ALPHA), ranks._4) - } - - // 'src_id, 'mass_n - // scale next page rank to 1-ALPHA - val pagerankNextScaled = pagerankNext - .map('mass_n -> ('mass_n, 'mass_input)) { m: Double => - ((1 - ALPHA) * m, 0.0) - } - - // 'src_id, 'mass_n, 'mass_input - // random probability + next probability - (randomPagerank ++ pagerankNextScaled) - .groupBy('src_id) { - _.sum[Double]('mass_input) // keep the input pagerank - .sum[Double]('mass_n) // take the sum - } - } -} diff --git a/src/scala/com/twitter/interaction_graph/README.md b/src/scala/com/twitter/interaction_graph/README.md deleted file mode 100644 index 31b4cf00b..000000000 --- a/src/scala/com/twitter/interaction_graph/README.md +++ /dev/null @@ -1,19 +0,0 @@ -## Real Graph (bqe) - -This project builds a machine learning model using a gradient boosting tree classifier to predict the likelihood of a Twitter user interacting with another user. - -The algorithm works by first creating a labeled dataset of user interactions from a graph of Twitter users. This graph is represented in a BigQuery table where each row represents a directed edge between two users, along with various features such as the number of tweets, follows, favorites, and other metrics related to user behavior. - -To create the labeled dataset, the algorithm first selects a set of candidate interactions by identifying all edges that were active during a certain time period. It then joins this candidate set with a set of labeled interactions that occurred one day after the candidate period. Positive interactions are labeled as "1" and negative interactions are labeled as "0". The resulting labeled dataset is then used to train a boosted tree classifier model. - -The model is trained using the labeled dataset and various hyperparameters, including the maximum number of iterations and the subsample rate. The algorithm splits the labeled dataset into training and testing sets based on the source user's ID, using a custom data split method. - -Once the model is trained, it can be used to generate a score estimating the probability of a user interacting with another user. - -## Real Graph (scio) - -This project aggregates the number of interactions between pairs of users on Twitter. On a daily basis, there are multiple dataflow jobs that perform this aggregation, which includes public engagements like favorites, retweets, follows, etc. as well as private engagements like profile views, tweet clicks, and whether or not a user has another user in their address book (given a user opt-in to share address book). - -After the daily aggregation of interactions, there is a rollup job that aggregates yesterday's aggregation with today's interactions. The rollup job outputs several results, including the daily count of interactions per interaction types between a pair of users, the daily incoming interactions made on a user per interaction type, the rollup aggregation of interactions as a decayed sum between a pair of users, and the rollup aggregation of incoming interactions made on a user. - -Finally, the rollup job outputs the ML predicted interaction score between the pair of users alongside the rollup aggregation of interactions as a decayed sum between them. diff --git a/src/scala/com/twitter/interaction_graph/bqe/scoring/README.md b/src/scala/com/twitter/interaction_graph/bqe/scoring/README.md deleted file mode 100644 index 0e435feb8..000000000 --- a/src/scala/com/twitter/interaction_graph/bqe/scoring/README.md +++ /dev/null @@ -1,58 +0,0 @@ -# Scoring - -This folder contains the sql files that we'll use for scoring the real graph edges in BQ. We have 4 steps that take place: -- check to make sure that our models are in place. the feature importance query should return 20 rows in total: 10 rows per model, 1 for each feature. -- follow graph feature generation. this is to ensure that we have features for all users regardless if they have had any recent activity. -- candidate generation. this query combines the candidates from the follow graph and the activity graph, and the features from both. -- scoring. this query scores with 2 of our prod models and saves the scores to a table, with an additional field that distinguishes if an edge in in/out of network. - -## Instructions - -For deploying the job, you would need to create a zip file, upload to packer, and then schedule it with aurora. - -``` -zip -jr real_graph_scoring src/scala/com/twitter/interaction_graph/bqe/scoring && \ -packer add_version --cluster=atla cassowary real_graph_scoring real_graph_scoring.zip -aurora cron schedule atla/cassowary/prod/real_graph_scoring src/scala/com/twitter/interaction_graph/bqe/scoring/scoring.aurora && \ -aurora cron start atla/cassowary/prod/real_graph_scoring -``` - -# candidates.sql - -This BigQuery (BQ) query does the following: - -1. Declares two variables, date_start and date_end, which are both of type DATE. -2. Sets the date_end variable to the maximum partition ID of the interaction_graph_labels_daily table, using the PARSE_DATE() function to convert the partition ID to a date format. -3. Sets the date_start variable to 30 days prior to the date_end variable, using the DATE_SUB() function. -4. Creates a new table called candidates in the realgraph dataset, partitioned by ds. -5. The query uses three common table expressions (T1, T2, and T3) to join data from two tables (interaction_graph_labels_daily and tweeting_follows) to generate a table containing candidate information and features. -6. The table T3 is the result of a full outer join between T1 and T2, grouping by source_id and destination_id, and aggregating values such as num_tweets, label_types, and the counts of different types of labels (e.g. num_follows, num_favorites, etc.). -7. The T4 table ranks each source_id by the number of num_days and num_tweets, and selects the top 2000 rows for each source_id. -8. Finally, the query selects all columns from the T4 table and appends the date_end variable as a new column named ds. - -Overall, the query generates a table of candidates and their associated features for a particular date range, using data from two tables in the twttr-bq-cassowary-prod and twttr-recos-ml-prod datasets. - -# follow_graph_features.sql - -This BigQuery script creates a table twttr-recos-ml-prod.realgraph.tweeting_follows that includes features for Twitter user interactions, specifically tweet counts and follows. - -First, it sets two variables date_latest_tweet and date_latest_follows to the most recent dates available in two separate tables: twttr-bq-tweetsource-pub-prod.user.public_tweets and twttr-recos-ml-prod.user_events.valid_user_follows, respectively. - -Then, it creates the tweet_count and all_follows CTEs. - -The tweet_count CTE counts the number of tweets made by each user within the last 3 days prior to date_latest_tweet. - -The all_follows CTE retrieves all the follows from the valid_user_follows table that happened on date_latest_follows and left joins it with the tweet_count CTE. It also adds a row number that partitions by the source user ID and orders by the number of tweets in descending order. The final output is filtered to keep only the top 2000 follows per user based on the row number. - -The final SELECT statement combines the all_follows CTE with the date_latest_tweet variable and inserts the results into the twttr-recos-ml-prod.realgraph.tweeting_follows table partitioned by date. - -# scoring.sql - -This BQ code performs operations on a BigQuery table called twttr-recos-ml-prod.realgraph.scores. Here is a step-by-step breakdown of what the code does: - -Declare two variables, date_end and date_latest_follows, and set their values based on the latest partitions in the twttr-bq-cassowary-prod.user.INFORMATION_SCHEMA.PARTITIONS and twttr-recos-ml-prod.user_events.INFORMATION_SCHEMA.PARTITIONS tables that correspond to specific tables, respectively. The PARSE_DATE() function is used to convert the partition IDs to date format. - -Delete rows from the twttr-recos-ml-prod.realgraph.scores table where the value of the ds column is equal to date_end. - -Insert rows into the twttr-recos-ml-prod.realgraph.scores table based on a query that generates predicted scores for pairs of user IDs using two machine learning models. Specifically, the query uses the ML.PREDICT() function to apply two machine learning models (twttr-recos-ml-prod.realgraph.prod and twttr-recos-ml-prod.realgraph.prod_explicit) to the twttr-recos-ml-prod.realgraph.candidates table. The resulting predicted scores are joined with the twttr-recos-ml-prod.realgraph.tweeting_follows table, which contains information about the number of tweets made by users and their follow relationships, using a full outer join. The final result includes columns for the source ID, destination ID, predicted score (prob), explicit predicted score (prob_explicit), a binary variable indicating whether the destination ID is followed by the source ID (followed), and the value of date_end for the ds column. If there is no match in the predicted_scores table for a given pair of user IDs, the COALESCE() function is used to return the corresponding values from the tweeting_follows table, with default values of 0.0 for the predicted scores. - diff --git a/src/scala/com/twitter/interaction_graph/bqe/scoring/candidates.sql b/src/scala/com/twitter/interaction_graph/bqe/scoring/candidates.sql deleted file mode 100644 index 89bd30d38..000000000 --- a/src/scala/com/twitter/interaction_graph/bqe/scoring/candidates.sql +++ /dev/null @@ -1,42 +0,0 @@ -DECLARE date_start, date_end DATE; -SET date_end = ( - SELECT PARSE_DATE('%Y%m%d', MAX(partition_id)) AS partition_id - FROM `twttr-bq-cassowary-prod.user.INFORMATION_SCHEMA.PARTITIONS` - WHERE partition_id IS NOT NULL AND partition_id != '__NULL__' AND table_name="interaction_graph_labels_daily" -); -SET date_start = DATE_SUB(date_end, INTERVAL 30 DAY); - --- all candidates and their features -CREATE OR REPLACE TABLE `twttr-recos-ml-prod.realgraph.candidates` -PARTITION BY ds -AS -WITH T1 AS ( - SELECT source_id, destination_id, label, dateHour - FROM `twttr-bq-cassowary-prod.user.interaction_graph_labels_daily` - LEFT JOIN UNNEST(labels) AS label - WHERE DATE(dateHour) BETWEEN date_start AND date_end -), T2 AS ( - SELECT source_id, destination_id, num_tweets - FROM `twttr-recos-ml-prod.realgraph.tweeting_follows` -), T3 AS ( -SELECT -COALESCE(T1.source_id, T2.source_id) AS source_id, -COALESCE(T1.destination_id, T2.destination_id) AS destination_id, -COUNT(DISTINCT(T1.dateHour)) AS num_days, -MIN(COALESCE(num_tweets,0)) AS num_tweets, -- all rows' num_tweets should be the same -COALESCE(DATE_DIFF(date_end, DATE(MAX(T1.dateHour)), DAY),30) AS days_since_last_interaction, -COUNT(DISTINCT(label)) AS label_types, -COUNTIF(label="num_follows") AS num_follows, -COUNTIF(label="num_favorites") AS num_favorites, -COUNTIF(label="num_tweet_clicks") AS num_tweet_clicks, -COUNTIF(label="num_profile_views") AS num_profile_views, -FROM T1 -FULL JOIN T2 -USING (source_id, destination_id) -GROUP BY 1,2 -ORDER BY 3 DESC,4 DESC -), T4 AS ( - SELECT RANK() OVER (PARTITION BY source_id ORDER BY num_days DESC, num_tweets DESC) AS rn, * - FROM T3 -) SELECT *, date_end AS ds FROM T4 WHERE rn <= 2000 - diff --git a/src/scala/com/twitter/interaction_graph/bqe/scoring/check_models.sql b/src/scala/com/twitter/interaction_graph/bqe/scoring/check_models.sql deleted file mode 100644 index 6baecc2ed..000000000 --- a/src/scala/com/twitter/interaction_graph/bqe/scoring/check_models.sql +++ /dev/null @@ -1,5 +0,0 @@ -(SELECT * FROM ML.FEATURE_IMPORTANCE(MODEL `twttr-recos-ml-prod.realgraph.prod`) -ORDER BY importance_gain DESC) -UNION ALL -(SELECT * FROM ML.FEATURE_IMPORTANCE(MODEL `twttr-recos-ml-prod.realgraph.prod_explicit`) -ORDER BY importance_gain DESC) diff --git a/src/scala/com/twitter/interaction_graph/bqe/scoring/follow_graph_features.sql b/src/scala/com/twitter/interaction_graph/bqe/scoring/follow_graph_features.sql deleted file mode 100644 index ace7e2f36..000000000 --- a/src/scala/com/twitter/interaction_graph/bqe/scoring/follow_graph_features.sql +++ /dev/null @@ -1,28 +0,0 @@ -DECLARE date_latest_tweet, date_latest_follows DATE; -SET date_latest_tweet = ( - SELECT PARSE_DATE('%Y%m%d', SUBSTRING(MAX(partition_id), 1, 8)) AS partition_id - FROM `twttr-bq-tweetsource-pub-prod.user.INFORMATION_SCHEMA.PARTITIONS` - WHERE partition_id IS NOT NULL AND partition_id != '__NULL__' AND table_name="public_tweets"); -SET date_latest_follows = ( - SELECT PARSE_DATE('%Y%m%d', MAX(partition_id)) AS partition_id - FROM `twttr-recos-ml-prod.user_events.INFORMATION_SCHEMA.PARTITIONS` - WHERE partition_id IS NOT NULL AND partition_id != '__NULL__' AND table_name="valid_user_follows"); - --- tweet count candidate features -CREATE OR REPLACE TABLE `twttr-recos-ml-prod.realgraph.tweeting_follows` -PARTITION BY ds -AS -WITH tweet_count AS ( - SELECT userId, COUNT(userId) AS num_tweets - FROM `twttr-bq-tweetsource-pub-prod.user.public_tweets` - WHERE DATE(ts) BETWEEN DATE_SUB(date_latest_tweet, INTERVAL 3 DAY) AND date_latest_tweet - GROUP BY 1 -), all_follows AS ( - SELECT F.sourceId AS source_id, F.destinationId AS destination_id, COALESCE(T.num_tweets,0) AS num_tweets, - ROW_NUMBER() OVER (PARTITION BY F.sourceId ORDER BY T.num_tweets DESC) AS rn - FROM `twttr-recos-ml-prod.user_events.valid_user_follows` F - LEFT JOIN tweet_count T - ON F.destinationId=T.userId - WHERE DATE(F._PARTITIONTIME) = date_latest_follows -) SELECT *, date_latest_tweet AS ds FROM all_follows WHERE rn <= 2000 -; diff --git a/src/scala/com/twitter/interaction_graph/bqe/scoring/scoring.sql b/src/scala/com/twitter/interaction_graph/bqe/scoring/scoring.sql deleted file mode 100644 index 5694c0988..000000000 --- a/src/scala/com/twitter/interaction_graph/bqe/scoring/scoring.sql +++ /dev/null @@ -1,52 +0,0 @@ -DECLARE date_end, date_latest_follows DATE; -SET date_end = ( - SELECT PARSE_DATE('%Y%m%d', MAX(partition_id)) AS partition_id - FROM `twttr-bq-cassowary-prod.user.INFORMATION_SCHEMA.PARTITIONS` - WHERE partition_id IS NOT NULL AND partition_id != '__NULL__' AND table_name="interaction_graph_labels_daily" -); -SET date_latest_follows = ( - SELECT PARSE_DATE('%Y%m%d', MAX(partition_id)) AS partition_id - FROM `twttr-recos-ml-prod.user_events.INFORMATION_SCHEMA.PARTITIONS` - WHERE partition_id IS NOT NULL AND partition_id != '__NULL__' AND table_name="valid_user_follows"); - -DELETE -FROM `twttr-recos-ml-prod.realgraph.scores` -WHERE ds = date_end; - --- score candidates (59m) -INSERT INTO `twttr-recos-ml-prod.realgraph.scores` -WITH predicted_scores AS ( - SELECT - source_id, - destination_id, - p1.prob AS prob, - p2.prob AS prob_explicit - FROM ML.PREDICT(MODEL `twttr-recos-ml-prod.realgraph.prod`, - ( - SELECT - * - FROM - `twttr-recos-ml-prod.realgraph.candidates` ) ) S1 - CROSS JOIN UNNEST(S1.predicted_label_probs) AS p1 - JOIN ML.PREDICT(MODEL `twttr-recos-ml-prod.realgraph.prod_explicit`, - ( - SELECT - * - FROM - `twttr-recos-ml-prod.realgraph.candidates` ) ) S2 - USING (source_id, destination_id) - CROSS JOIN UNNEST(S2.predicted_label_probs) AS p2 - WHERE p1.label=1 AND p2.label=1 -) -SELECT - COALESCE(predicted_scores.source_id, tweeting_follows.source_id) AS source_id, - COALESCE(predicted_scores.destination_id, tweeting_follows.destination_id) AS destination_id, - COALESCE(prob, 0.0) AS prob, - COALESCE(prob_explicit, 0.0) AS prob_explicit, - (tweeting_follows.source_id IS NOT NULL) AND (tweeting_follows.destination_id IS NOT NULL) AS followed, - date_end AS ds -FROM - predicted_scores - FULL JOIN - `twttr-recos-ml-prod.realgraph.tweeting_follows` tweeting_follows - USING (source_id, destination_id) diff --git a/src/scala/com/twitter/interaction_graph/bqe/training/README.md b/src/scala/com/twitter/interaction_graph/bqe/training/README.md deleted file mode 100644 index 17e94e7f5..000000000 --- a/src/scala/com/twitter/interaction_graph/bqe/training/README.md +++ /dev/null @@ -1,60 +0,0 @@ -# Training - -This folder contains the sql files that we'll use for training the prod real graph models: -- prod (predicts any interactions the next day) -- prod_explicit (predicts any explicit interactions the next day) - -We have 3 steps that take place: -- candidate generation + feature hydration. this query samples 1% of edges from the `twttr-recos-ml-prod.realgraph.candidates` table which is already produced daily and saves it to `twttr-recos-ml-prod.realgraph.candidates_sampled`. we save each day's data according to the statebird batch run date and hence require checks to make sure that the data exists to begin with. -- label candidates. we join day T's candidates with day T+1's labels while filtering out any negative interactions to get our labeled dataset. we append an additional day's worth of segments for each day. we finally generate the training dataset which uses all day's labeled data for training, performing negative downsampling to get a roughly 50-50 split of positive to negative labels. -- training. we use bqml for training our xgboost models. - -## Instructions - -For deploying the job, you would need to create a zip file, upload to packer, and then schedule it with aurora. - -``` -zip -jr real_graph_training src/scala/com/twitter/interaction_graph/bqe/training && \ -packer add_version --cluster=atla cassowary real_graph_training real_graph_training.zip -aurora cron schedule atla/cassowary/prod/real_graph_training src/scala/com/twitter/interaction_graph/bqe/training/training.aurora && \ -aurora cron start atla/cassowary/prod/real_graph_training -``` - -# candidates.sql - -1. Sets the value of the variable date_candidates to the date of the latest partition of the candidates_for_training table. -2. Creates a new table candidates_sampled if it does not exist already, which will contain a sample of 100 rows from the candidates_for_training table. -3. Deletes any existing rows from the candidates_sampled table where the ds column matches the date_candidates value, to avoid double-writing. -4. Inserts a sample of rows into the candidates_sampled table from the candidates_for_training table, where the modulo of the absolute value of the FARM_FINGERPRINT of the concatenation of source_id and destination_id is equal to the value of the $mod_remainder$ variable, and where the ds column matches the date_candidates value. - -# check_candidates_exist.sql - -This BigQuery prepares a table of candidates for training a machine learning model. It does the following: - -1. Declares two variables date_start and date_end that are 30 days apart, and date_end is set to the value of $start_time$ parameter (which is a Unix timestamp). -2. Creates a table candidates_for_training that is partitioned by ds (date) and populated with data from several other tables in the database. It joins information from tables of user interactions, tweeting, and interaction graph aggregates, filters out negative edge snapshots, calculates some statistics and aggregates them by source_id and destination_id. Then, it ranks each source_id by the number of days and tweets, selects top 2000, and adds date_end as a new column ds. -3. Finally, it selects the ds column from candidates_for_training where ds equals date_end. - -Overall, this script prepares a table of 2000 candidate pairs of user interactions with statistics and labels, which can be used to train a machine learning model for recommendation purposes. - -# labeled_candidates.sql - -The BQ does the following: - -1. Defines two variables date_candidates and date_labels as dates based on the $start_time$ parameter. -2. Creates a new table twttr-recos-ml-prod.realgraph.labeled_candidates$table_suffix$ with default values. -3. Deletes any prior data in the twttr-recos-ml-prod.realgraph.labeled_candidates$table_suffix$ table for the current date_candidates. -4. Joins the twttr-recos-ml-prod.realgraph.candidates_sampled table with the twttr-bq-cassowary-prod.user.interaction_graph_labels_daily table and the twttr-bq-cassowary-prod.user.interaction_graph_agg_negative_edge_snapshot table. It assigns a label of 1 for positive interactions and 0 for negative interactions, and selects only the rows where there is no negative interaction. -5. Inserts the joined data into the twttr-recos-ml-prod.realgraph.labeled_candidates$table_suffix$ table. -6. Calculates the positive rate by counting the number of positive labels and dividing it by the total number of labels. -7. Creates a new table twttr-recos-ml-prod.realgraph.train$table_suffix$ by sampling from the twttr-recos-ml-prod.realgraph.labeled_candidates$table_suffix$ table, with a downsampling of negative examples to balance the number of positive and negative examples, based on the positive rate calculated in step 6. - -The resulting twttr-recos-ml-prod.realgraph.train$table_suffix$ table is used as a training dataset for a machine learning model. - -# train_model.sql - -This BQ command creates or replaces a machine learning model called twttr-recos-ml-prod.realgraph.prod$table_suffix$. The model is a boosted tree classifier, which is used for binary classification problems. - -The options provided in the command configure the specific settings for the model, such as the number of parallel trees, the maximum number of iterations, and the data split method. The DATA_SPLIT_METHOD parameter is set to CUSTOM, and DATA_SPLIT_COL is set to if_eval, which means the data will be split into training and evaluation sets based on the if_eval column. The IF function is used to assign a boolean value of true or false to if_eval based on the modulo operation performed on source_id. - -The SELECT statement specifies the input data for the model. The columns selected include label (the target variable to be predicted), as well as various features such as num_days, num_tweets, and num_follows that are used to predict the target variable. \ No newline at end of file diff --git a/src/scala/com/twitter/interaction_graph/bqe/training/candidates.sql b/src/scala/com/twitter/interaction_graph/bqe/training/candidates.sql deleted file mode 100644 index 8c47b8184..000000000 --- a/src/scala/com/twitter/interaction_graph/bqe/training/candidates.sql +++ /dev/null @@ -1,18 +0,0 @@ --- get latest partition of candidates with data -DECLARE date_candidates DATE; -SET date_candidates = (SELECT DATE(TIMESTAMP_MILLIS($start_time$))); - -CREATE TABLE IF NOT EXISTS `twttr-recos-ml-prod.realgraph.candidates_sampled` AS -SELECT * FROM `twttr-recos-ml-prod.realgraph.candidates_for_training` LIMIT 100; - --- remove previous output snapshot (if exists) to avoid double-writing -DELETE -FROM `twttr-recos-ml-prod.realgraph.candidates_sampled` -WHERE ds = date_candidates; - --- sample from candidates table instead of recomputing features -INSERT INTO `twttr-recos-ml-prod.realgraph.candidates_sampled` -SELECT * FROM `twttr-recos-ml-prod.realgraph.candidates_for_training` -WHERE MOD(ABS(FARM_FINGERPRINT(CONCAT(source_id, '_', destination_id))), 100) = $mod_remainder$ -AND ds = date_candidates; - diff --git a/src/scala/com/twitter/interaction_graph/bqe/training/check_candidates_exist.sql b/src/scala/com/twitter/interaction_graph/bqe/training/check_candidates_exist.sql deleted file mode 100644 index 5cb380b4f..000000000 --- a/src/scala/com/twitter/interaction_graph/bqe/training/check_candidates_exist.sql +++ /dev/null @@ -1,43 +0,0 @@ -DECLARE date_start, date_end DATE; -SET date_end = (SELECT DATE(TIMESTAMP_MILLIS($start_time$))); -SET date_start = DATE_SUB(date_end, INTERVAL 30 DAY); - -CREATE OR REPLACE TABLE `twttr-recos-ml-prod.realgraph.candidates_for_training` -PARTITION BY ds -AS -WITH T1 AS ( - SELECT source_id, destination_id, label, dateHour - FROM `twttr-bq-cassowary-prod.user.interaction_graph_labels_daily` - LEFT JOIN UNNEST(labels) AS label - WHERE DATE(dateHour) BETWEEN date_start AND date_end -), T2 AS ( - SELECT source_id, destination_id, num_tweets - FROM `twttr-recos-ml-prod.realgraph.tweeting_follows` -), T3 AS ( -SELECT -COALESCE(T1.source_id, T2.source_id) AS source_id, -COALESCE(T1.destination_id, T2.destination_id) AS destination_id, -COUNT(DISTINCT(T1.dateHour)) AS num_days, -MIN(COALESCE(num_tweets,0)) AS num_tweets, -- all rows' num_tweets should be the same -COALESCE(DATE_DIFF(date_end, DATE(MAX(T1.dateHour)), DAY),30) AS days_since_last_interaction, -COUNT(DISTINCT(label)) AS label_types, -COUNTIF(label="num_follows") AS num_follows, -COUNTIF(label="num_favorites") AS num_favorites, -COUNTIF(label="num_tweet_clicks") AS num_tweet_clicks, -COUNTIF(label="num_profile_views") AS num_profile_views, -FROM T1 -FULL JOIN T2 -USING (source_id, destination_id) -LEFT JOIN `twttr-bq-cassowary-prod.user.interaction_graph_agg_negative_edge_snapshot` N -USING (source_id, destination_id) -WHERE N.source_id IS NULL AND N.destination_id IS NULL -GROUP BY 1,2 -ORDER BY 3 DESC,4 DESC -), T4 AS ( - SELECT RANK() OVER (PARTITION BY source_id ORDER BY num_days DESC, num_tweets DESC) AS rn, * - FROM T3 -) SELECT *, date_end AS ds FROM T4 WHERE rn <= 2000; - -SELECT ds FROM `twttr-recos-ml-prod.realgraph.candidates_for_training` -WHERE ds = (SELECT DATE(TIMESTAMP_MILLIS($start_time$))) -LIMIT 1 diff --git a/src/scala/com/twitter/interaction_graph/bqe/training/check_labels_exist.sql b/src/scala/com/twitter/interaction_graph/bqe/training/check_labels_exist.sql deleted file mode 100644 index 20a372b4a..000000000 --- a/src/scala/com/twitter/interaction_graph/bqe/training/check_labels_exist.sql +++ /dev/null @@ -1,4 +0,0 @@ -SELECT dateHour FROM `twttr-bq-cassowary-prod.user.interaction_graph_labels_daily` -WHERE dateHour = (SELECT TIMESTAMP_ADD(TIMESTAMP_MILLIS($start_time$), INTERVAL 1 DAY)) -LIMIT 1 - diff --git a/src/scala/com/twitter/interaction_graph/bqe/training/labeled_candidates.sql b/src/scala/com/twitter/interaction_graph/bqe/training/labeled_candidates.sql deleted file mode 100644 index 4230ee5c5..000000000 --- a/src/scala/com/twitter/interaction_graph/bqe/training/labeled_candidates.sql +++ /dev/null @@ -1,67 +0,0 @@ --- date_labels is 1 day after date_candidates (which is the current batch run's start date) -DECLARE date_candidates, date_labels DATE; -DECLARE positive_rate FLOAT64; -SET date_candidates = (SELECT DATE(TIMESTAMP_MILLIS($start_time$))); -SET date_labels = DATE_ADD(date_candidates, INTERVAL 1 DAY); - -CREATE TABLE IF NOT EXISTS `twttr-recos-ml-prod.realgraph.labeled_candidates$table_suffix$` AS -SELECT - 0 AS source_id, - 1 AS destination_id, - 1 AS label, - 1 AS num_days, - 1 AS num_tweets, - 1 AS num_follows, - 1 AS num_favorites, - 1 AS num_tweet_clicks, - 1 AS num_profile_views, - 1 AS days_since_last_interaction, - 1 AS label_types, - DATE("2023-01-08") AS ds; - --- delete any prior data to avoid double writing -DELETE -FROM `twttr-recos-ml-prod.realgraph.labeled_candidates$table_suffix$` -WHERE ds = date_candidates; - --- join labels with candidates with 1 day attribution delay and insert new segment -INSERT INTO `twttr-recos-ml-prod.realgraph.labeled_candidates$table_suffix$` -WITH label_positive AS ( - SELECT source_id, destination_id - FROM `twttr-bq-cassowary-prod.user.interaction_graph_labels_daily` - WHERE DATE(dateHour)=date_labels -), label_negative AS ( - SELECT source_id, destination_id - FROM `twttr-bq-cassowary-prod.user.interaction_graph_agg_negative_edge_snapshot` -) SELECT - F.source_id, - F.destination_id, - CASE WHEN P.source_id IS NULL THEN 0 ELSE 1 END AS label, - num_days, - num_tweets, - num_follows, - num_favorites, - num_tweet_clicks, - num_profile_views, - days_since_last_interaction, - label_types, - date_candidates AS ds -FROM `twttr-recos-ml-prod.realgraph.candidates_sampled` F -LEFT JOIN label_positive P USING(source_id, destination_id) -LEFT JOIN label_negative N USING(source_id, destination_id) -WHERE N.source_id IS NULL AND N.destination_id IS NULL -AND F.ds=date_candidates -; - --- get positive rate -SET positive_rate = -(SELECT SUM(label)/COUNT(label) AS pct_positive -FROM `twttr-recos-ml-prod.realgraph.labeled_candidates$table_suffix$` -); - --- create training dataset with negative downsampling (should get ~50-50 split) --- this spans over the cumulative date range of the labeled candidates table. -CREATE OR REPLACE TABLE `twttr-recos-ml-prod.realgraph.train$table_suffix$` AS -SELECT * FROM `twttr-recos-ml-prod.realgraph.labeled_candidates$table_suffix$` -WHERE CASE WHEN label = 0 AND RAND() < positive_rate THEN true WHEN label = 1 AND RAND() < (1-positive_rate) THEN true ELSE false END -; diff --git a/src/scala/com/twitter/interaction_graph/bqe/training/train_model.sql b/src/scala/com/twitter/interaction_graph/bqe/training/train_model.sql deleted file mode 100644 index c7a5df501..000000000 --- a/src/scala/com/twitter/interaction_graph/bqe/training/train_model.sql +++ /dev/null @@ -1,27 +0,0 @@ -CREATE OR REPLACE MODEL `twttr-recos-ml-prod.realgraph.prod$table_suffix$` -OPTIONS(MODEL_TYPE='BOOSTED_TREE_CLASSIFIER', - BOOSTER_TYPE = 'GBTREE', - NUM_PARALLEL_TREE = 1, - MAX_ITERATIONS = 20, - TREE_METHOD = 'HIST', - EARLY_STOP = TRUE, - SUBSAMPLE = 0.01, - INPUT_LABEL_COLS = ['label'], - DATA_SPLIT_METHOD = 'CUSTOM', - DATA_SPLIT_COL = 'if_eval') -AS SELECT - label, - source_id, - destination_id, - num_days, - num_tweets, - num_follows, - num_favorites, - num_tweet_clicks, - num_profile_views, - days_since_last_interaction, - label_types, - -- partition train/test by source_id's - IF(MOD(ABS(FARM_FINGERPRINT(CAST(source_id AS STRING))), 10) = 0, true, false) AS if_eval, -FROM `twttr-recos-ml-prod.realgraph.train$table_suffix$` -; diff --git a/src/scala/com/twitter/interaction_graph/injection/BUILD b/src/scala/com/twitter/interaction_graph/injection/BUILD deleted file mode 100644 index 3e9d55ccf..000000000 --- a/src/scala/com/twitter/interaction_graph/injection/BUILD +++ /dev/null @@ -1,25 +0,0 @@ -scala_library( - name = "user_session_inj", - sources = ["UserSessionInjection.scala"], - platform = "java8", - strict_deps = True, - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/twitter/bijection:scrooge", - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/thrift/com/twitter/user_session_store:thrift-scala", - ], -) - -scala_library( - name = "edge_list_injection", - sources = ["EdgeListInjection.scala"], - platform = "java8", - strict_deps = True, - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/twitter/bijection:scrooge", - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/thrift/com/twitter/interaction_graph:interaction_graph-scala", - ], -) diff --git a/src/scala/com/twitter/interaction_graph/injection/EdgeListInjection.scala b/src/scala/com/twitter/interaction_graph/injection/EdgeListInjection.scala deleted file mode 100644 index c03ad097c..000000000 --- a/src/scala/com/twitter/interaction_graph/injection/EdgeListInjection.scala +++ /dev/null @@ -1,14 +0,0 @@ -package com.twitter.interaction_graph.injection - -import com.twitter.interaction_graph.thriftscala.EdgeList -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.Long2BigEndian -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.ScalaCompactThrift - -object EdgeListInjection { - final val injection: KeyValInjection[Long, EdgeList] = - KeyValInjection( - Long2BigEndian, - ScalaCompactThrift(EdgeList) - ) -} diff --git a/src/scala/com/twitter/interaction_graph/injection/UserSessionInjection.scala b/src/scala/com/twitter/interaction_graph/injection/UserSessionInjection.scala deleted file mode 100644 index f6c84e184..000000000 --- a/src/scala/com/twitter/interaction_graph/injection/UserSessionInjection.scala +++ /dev/null @@ -1,14 +0,0 @@ -package com.twitter.interaction_graph.injection - -import com.twitter.user_session_store.thriftscala.UserSession -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.ScalaCompactThrift -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.Long2BigEndian - -object UserSessionInjection { - final val injection: KeyValInjection[Long, UserSession] = - KeyValInjection( - Long2BigEndian, - ScalaCompactThrift(UserSession) - ) -} diff --git a/src/scala/com/twitter/interaction_graph/scio/README.md b/src/scala/com/twitter/interaction_graph/scio/README.md deleted file mode 100644 index c7ef6d713..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/README.md +++ /dev/null @@ -1,7 +0,0 @@ -# Interaction Graph - -This folder contains the code used in the offline pipeline for real graph v2. - -The ETL jobs are contained in folders prefaced with `agg_*`, while the jobs powering the ml pipeline are in the ml folder. - -Note that the jobs in the ml folder are mostly ETL jobs; the main training and scoring happens within BQML. diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_address_book/BUILD b/src/scala/com/twitter/interaction_graph/scio/agg_address_book/BUILD deleted file mode 100644 index 3f7e0491e..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_address_book/BUILD +++ /dev/null @@ -1,62 +0,0 @@ -scala_library( - name = "agg_address_book", - sources = ["*.scala"], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":interaction_graph_agg_address_book_edge_snapshot-scala", - ":interaction_graph_agg_address_book_vertex_snapshot-scala", - "3rdparty/jvm/com/twitter/storehaus:algebra", - "addressbook/jobs/src/main/scala/com/twitter/addressbook/jobs/simplematches:simple_user_matches-scala", - "beam-internal/src/main/scala/com/twitter/beam/io/dal", - "beam-internal/src/main/scala/com/twitter/scio_internal/job", - "beam-internal/src/main/scala/com/twitter/scio_internal/runner/dataflow", - "consumer-data-tools/src/main/scala/com/twitter/cde/scio/dal_read", - "src/scala/com/twitter/interaction_graph/scio/common", - ], -) - -jvm_binary( - name = "interaction_graph_address_book_scio", - main = "com.twitter.interaction_graph.scio.agg_address_book.InteractionGraphAddressBookJob", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":agg_address_book", - ], -) - -create_datasets( - base_name = "interaction_graph_agg_address_book_edge_snapshot", - description = "User-user directed edges with addressbook features", - java_schema = "com.twitter.interaction_graph.thriftjava.Edge", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.interaction_graph.thriftscala.Edge", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-scala", - ], -) - -create_datasets( - base_name = "interaction_graph_agg_address_book_vertex_snapshot", - description = "User vertex with addressbook features", - java_schema = "com.twitter.interaction_graph.thriftjava.Vertex", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.interaction_graph.thriftscala.Vertex", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-scala", - ], -) diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_address_book/InteractionGraphAddressBookCounters.scala b/src/scala/com/twitter/interaction_graph/scio/agg_address_book/InteractionGraphAddressBookCounters.scala deleted file mode 100644 index 0d57c4cae..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_address_book/InteractionGraphAddressBookCounters.scala +++ /dev/null @@ -1,34 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_address_book - -import com.spotify.scio.ScioMetrics -import org.apache.beam.sdk.metrics.Counter - -trait InteractionGraphAddressBookCountersTrait { - val Namespace = "Interaction Graph Address Book" - - def emailFeatureInc(): Unit - - def phoneFeatureInc(): Unit - - def bothFeatureInc(): Unit -} - -/** - * SCIO counters are used to gather run time statistics - */ -case object InteractionGraphAddressBookCounters extends InteractionGraphAddressBookCountersTrait { - val emailFeatureCounter: Counter = - ScioMetrics.counter(Namespace, "Email Feature") - - val phoneFeatureCounter: Counter = - ScioMetrics.counter(Namespace, "Phone Feature") - - val bothFeatureCounter: Counter = - ScioMetrics.counter(Namespace, "Both Feature") - - override def emailFeatureInc(): Unit = emailFeatureCounter.inc() - - override def phoneFeatureInc(): Unit = phoneFeatureCounter.inc() - - override def bothFeatureInc(): Unit = bothFeatureCounter.inc() -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_address_book/InteractionGraphAddressBookJob.scala b/src/scala/com/twitter/interaction_graph/scio/agg_address_book/InteractionGraphAddressBookJob.scala deleted file mode 100644 index 360b52cee..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_address_book/InteractionGraphAddressBookJob.scala +++ /dev/null @@ -1,71 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_address_book - -import com.spotify.scio.ScioContext -import com.spotify.scio.values.SCollection -import com.twitter.addressbook.matches.thriftscala.UserMatchesRecord -import com.twitter.beam.io.dal.DAL -import com.twitter.beam.io.dal.DAL.DiskFormat -import com.twitter.beam.io.dal.DAL.PathLayout -import com.twitter.beam.io.dal.DAL.WriteOptions -import com.twitter.beam.job.ServiceIdentifierOptions -import com.twitter.scio_internal.job.ScioBeamJob -import com.twitter.statebird.v2.thriftscala.Environment -import com.twitter.interaction_graph.thriftscala.Edge -import com.twitter.interaction_graph.thriftscala.Vertex -import java.time.Instant -import org.joda.time.Interval - -object InteractionGraphAddressBookJob extends ScioBeamJob[InteractionGraphAddressBookOption] { - override protected def configurePipeline( - scioContext: ScioContext, - pipelineOptions: InteractionGraphAddressBookOption - ): Unit = { - @transient - implicit lazy val sc: ScioContext = scioContext - implicit lazy val dateInterval: Interval = pipelineOptions.interval - implicit lazy val addressBookCounters: InteractionGraphAddressBookCountersTrait = - InteractionGraphAddressBookCounters - - val interactionGraphAddressBookSource = InteractionGraphAddressBookSource(pipelineOptions) - - val addressBook: SCollection[UserMatchesRecord] = - interactionGraphAddressBookSource.readSimpleUserMatches( - dateInterval.withStart(dateInterval.getStart.minusDays(3)) - ) - val (vertex, edges) = InteractionGraphAddressBookUtil.process(addressBook) - - val dalEnvironment: String = pipelineOptions - .as(classOf[ServiceIdentifierOptions]) - .getEnvironment() - val dalWriteEnvironment = if (pipelineOptions.getDALWriteEnvironment != null) { - pipelineOptions.getDALWriteEnvironment - } else { - dalEnvironment - } - - vertex.saveAsCustomOutput( - "Write Vertex Records", - DAL.writeSnapshot[Vertex]( - InteractionGraphAggAddressBookVertexSnapshotScalaDataset, - PathLayout.DailyPath(pipelineOptions.getOutputPath + "/address_book_vertex_daily"), - Instant.ofEpochMilli(dateInterval.getEndMillis), - DiskFormat.Parquet, - Environment.valueOf(dalWriteEnvironment), - writeOption = - WriteOptions(numOfShards = Some((pipelineOptions.getNumberOfShards / 16.0).ceil.toInt)) - ) - ) - - edges.saveAsCustomOutput( - "Write Edge Records", - DAL.writeSnapshot[Edge]( - InteractionGraphAggAddressBookEdgeSnapshotScalaDataset, - PathLayout.DailyPath(pipelineOptions.getOutputPath + "/address_book_edge_daily"), - Instant.ofEpochMilli(dateInterval.getEndMillis), - DiskFormat.Parquet, - Environment.valueOf(dalWriteEnvironment), - writeOption = WriteOptions(numOfShards = Some(pipelineOptions.getNumberOfShards)) - ) - ) - } -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_address_book/InteractionGraphAddressBookOption.scala b/src/scala/com/twitter/interaction_graph/scio/agg_address_book/InteractionGraphAddressBookOption.scala deleted file mode 100644 index b5c34e94c..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_address_book/InteractionGraphAddressBookOption.scala +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_address_book - -import com.twitter.beam.io.dal.DALOptions -import com.twitter.beam.job.DateRangeOptions -import org.apache.beam.sdk.options.Default -import org.apache.beam.sdk.options.Description -import org.apache.beam.sdk.options.Validation.Required - -trait InteractionGraphAddressBookOption extends DALOptions with DateRangeOptions { - @Required - @Description("Output path for storing the final dataset") - def getOutputPath: String - def setOutputPath(value: String): Unit - - @Description("Indicates DAL write environment. Can be set to dev/stg during local validation") - @Default.String("PROD") - def getDALWriteEnvironment: String - def setDALWriteEnvironment(value: String): Unit - - @Description("Number of shards/partitions for saving the final dataset.") - @Default.Integer(16) - def getNumberOfShards: Integer - def setNumberOfShards(value: Integer): Unit -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_address_book/InteractionGraphAddressBookSource.scala b/src/scala/com/twitter/interaction_graph/scio/agg_address_book/InteractionGraphAddressBookSource.scala deleted file mode 100644 index 66e3903bc..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_address_book/InteractionGraphAddressBookSource.scala +++ /dev/null @@ -1,28 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_address_book - -import com.spotify.scio.ScioContext -import com.spotify.scio.values.SCollection -import com.twitter.addressbook.jobs.simplematches.SimpleUserMatchesScalaDataset -import com.twitter.addressbook.matches.thriftscala.UserMatchesRecord -import com.twitter.beam.job.ServiceIdentifierOptions -import com.twitter.cde.scio.dal_read.SourceUtil -import org.joda.time.Interval - -case class InteractionGraphAddressBookSource( - pipelineOptions: InteractionGraphAddressBookOption -)( - implicit sc: ScioContext, -) { - val dalEnvironment: String = pipelineOptions - .as(classOf[ServiceIdentifierOptions]) - .getEnvironment() - - def readSimpleUserMatches( - dateInterval: Interval - ): SCollection[UserMatchesRecord] = { - SourceUtil.readMostRecentSnapshotDALDataset[UserMatchesRecord]( - SimpleUserMatchesScalaDataset, - dateInterval, - dalEnvironment) - } -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_address_book/InteractionGraphAddressBookUtil.scala b/src/scala/com/twitter/interaction_graph/scio/agg_address_book/InteractionGraphAddressBookUtil.scala deleted file mode 100644 index fc5898ce0..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_address_book/InteractionGraphAddressBookUtil.scala +++ /dev/null @@ -1,93 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_address_book - -import com.spotify.scio.values.SCollection -import com.twitter.addressbook.matches.thriftscala.UserMatchesRecord -import com.twitter.interaction_graph.scio.common.FeatureGeneratorUtil -import com.twitter.interaction_graph.scio.common.InteractionGraphRawInput -import com.twitter.interaction_graph.thriftscala.Edge -import com.twitter.interaction_graph.thriftscala.FeatureName -import com.twitter.interaction_graph.thriftscala.Vertex - -object InteractionGraphAddressBookUtil { - val EMAIL = "email" - val PHONE = "phone" - val BOTH = "both" - - val DefaultAge = 1 - val DegaultFeatureValue = 1.0 - - def process( - addressBook: SCollection[UserMatchesRecord] - )( - implicit addressBookCounters: InteractionGraphAddressBookCountersTrait - ): (SCollection[Vertex], SCollection[Edge]) = { - // First construct a data with (src, dst, name), where name can be "email", "phone", or "both" - val addressBookTypes: SCollection[((Long, Long), String)] = addressBook.flatMap { record => - record.forwardMatches.toSeq.flatMap { matchDetails => - val matchedUsers = (record.userId, matchDetails.userId) - (matchDetails.matchedByEmail, matchDetails.matchedByPhone) match { - case (true, true) => - Seq((matchedUsers, EMAIL), (matchedUsers, PHONE), (matchedUsers, BOTH)) - case (true, false) => Seq((matchedUsers, EMAIL)) - case (false, true) => Seq((matchedUsers, PHONE)) - case _ => Seq.empty - } - } - } - - // Then construct the input data for feature calculation - val addressBookFeatureInput: SCollection[InteractionGraphRawInput] = addressBookTypes - .map { - case ((src, dst), name) => - if (src < dst) - ((src, dst, name), false) - else - ((dst, src, name), true) - }.groupByKey - .flatMap { - case ((src, dst, name), iterator) => - val isReversedValues = iterator.toSeq - // check if (src, dst) is mutual follow - val isMutualFollow = isReversedValues.size == 2 - // get correct srcId and dstId if there is no mutual follow and they are reversed - val (srcId, dstId) = { - if (!isMutualFollow && isReversedValues.head) - (dst, src) - else - (src, dst) - } - // get the feature name and mutual follow name - val (featureName, mfFeatureName) = name match { - case EMAIL => - addressBookCounters.emailFeatureInc() - (FeatureName.AddressBookEmail, FeatureName.AddressBookMutualEdgeEmail) - case PHONE => - addressBookCounters.phoneFeatureInc() - (FeatureName.AddressBookPhone, FeatureName.AddressBookMutualEdgePhone) - case BOTH => - addressBookCounters.bothFeatureInc() - (FeatureName.AddressBookInBoth, FeatureName.AddressBookMutualEdgeInBoth) - } - // construct the TypedPipe for feature calculation - if (isMutualFollow) { - Iterator( - InteractionGraphRawInput(srcId, dstId, featureName, DefaultAge, DegaultFeatureValue), - InteractionGraphRawInput(dstId, srcId, featureName, DefaultAge, DegaultFeatureValue), - InteractionGraphRawInput( - srcId, - dstId, - mfFeatureName, - DefaultAge, - DegaultFeatureValue), - InteractionGraphRawInput(dstId, srcId, mfFeatureName, DefaultAge, DegaultFeatureValue) - ) - } else { - Iterator( - InteractionGraphRawInput(srcId, dstId, featureName, DefaultAge, DegaultFeatureValue)) - } - } - - // Calculate the Features - FeatureGeneratorUtil.getFeatures(addressBookFeatureInput) - } -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_address_book/README.md b/src/scala/com/twitter/interaction_graph/scio/agg_address_book/README.md deleted file mode 100644 index 4d895c71d..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_address_book/README.md +++ /dev/null @@ -1,34 +0,0 @@ -## InteractionGraphAddressBook Dataflow Job - -#### IntelliJ -``` -./bazel idea src/scala/com/twitter/interaction_graph/scio/agg_address_book:interaction_graph_address_book_scio -``` - -#### Compile -``` -./bazel build src/scala/com/twitter/interaction_graph/scio/agg_address_book:interaction_graph_address_book_scio -``` - -#### Build Jar -``` -./bazel bundle src/scala/com/twitter/interaction_graph/scio/agg_address_book:interaction_graph_address_book_scio -``` - -#### Run Scheduled Job -``` -export PROJECTID=twttr-recos-ml-prod -export REGION=us-central1 -export JOB_NAME=interaction-graph-address-book-dataflow - -bin/d6w schedule \ - ${PROJECTID}/${REGION}/${JOB_NAME} \ - src/scala/com/twitter/interaction_graph/scio/agg_address_book/config.d6w \ - --bind=profile.user_name=cassowary \ - --bind=profile.project=${PROJECTID} \ - --bind=profile.region=${REGION} \ - --bind=profile.job_name=${JOB_NAME} \ - --bind=profile.environment=prod \ - --bind=profile.date=2022-04-13 \ - --bind=profile.output_path=processed/interaction_graph_agg_address_book_dataflow -``` \ No newline at end of file diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_all/BUILD b/src/scala/com/twitter/interaction_graph/scio/agg_all/BUILD deleted file mode 100644 index 61dc35906..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_all/BUILD +++ /dev/null @@ -1,175 +0,0 @@ -scala_library( - name = "agg_all", - sources = ["*.scala"], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":interaction_graph_history_aggregated_raw_edge_daily-scala", - ":interaction_graph_history_aggregated_vertex_daily-scala", - ":interaction_graph_aggregated_edge_daily-scala", - ":interaction_graph_aggregated_vertex_daily-scala", - ":interaction_graph_history_aggregated_edge_snapshot-scala", - ":interaction_graph_history_aggregated_vertex_snapshot-scala", - ":real_graph_features-scala", - "beam-internal/src/main/scala/com/twitter/beam/io/dal", - "beam-internal/src/main/scala/com/twitter/scio_internal/job", - "beam-internal/src/main/scala/com/twitter/scio_internal/runner/dataflow", - "consumer-data-tools/src/main/scala/com/twitter/cde/scio/dal_read", - "src/scala/com/twitter/interaction_graph/scio/agg_address_book:interaction_graph_agg_address_book_edge_snapshot-scala", - "src/scala/com/twitter/interaction_graph/scio/agg_address_book:interaction_graph_agg_address_book_vertex_snapshot-scala", - "src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs:interaction_graph_agg_client_event_logs_edge_daily-scala", - "src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs:interaction_graph_agg_client_event_logs_vertex_daily-scala", - "src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions:interaction_graph_agg_direct_interactions_edge_daily-scala", - "src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions:interaction_graph_agg_direct_interactions_vertex_daily-scala", - "src/scala/com/twitter/interaction_graph/scio/agg_flock:interaction_graph_agg_flock_edge_snapshot-scala", - "src/scala/com/twitter/interaction_graph/scio/agg_flock:interaction_graph_agg_flock_vertex_snapshot-scala", - "src/scala/com/twitter/interaction_graph/scio/common", - "src/scala/com/twitter/interaction_graph/scio/ml/scores:real_graph_in_scores-scala", - "src/scala/com/twitter/interaction_graph/scio/ml/scores:real_graph_oon_scores-scala", - "src/scala/com/twitter/wtf/dataflow/user_events:valid_user_follows-scala", - "src/thrift/com/twitter/wtf/candidate:wtf-candidate-scala", - "tcdc/bq_blaster/src/main/scala/com/twitter/tcdc/bqblaster/beam", - "usersource/snapshot/src/main/scala/com/twitter/usersource/snapshot/flat:usersource_flat-scala", - ], -) - -create_datasets( - base_name = "interaction_graph_history_aggregated_raw_edge_daily", - description = "User-user directed edges with all features", - java_schema = "com.twitter.interaction_graph.thriftjava.Edge", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.interaction_graph.thriftscala.Edge", - segment_type = "partitioned", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-scala", - ], -) - -create_datasets( - base_name = "interaction_graph_history_aggregated_vertex_daily", - description = "User vertex with all features", - java_schema = "com.twitter.interaction_graph.thriftjava.Vertex", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.interaction_graph.thriftscala.Vertex", - segment_type = "partitioned", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-scala", - ], -) - -jvm_binary( - name = "interaction_graph_aggregation_job_scio", - main = "com.twitter.interaction_graph.scio.agg_all.InteractionGraphAggregationJob", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":agg_all", - ], -) - -create_datasets( - base_name = "interaction_graph_history_aggregated_edge_snapshot", - description = "User-user directed edges with all features", - java_schema = "com.twitter.interaction_graph.thriftjava.Edge", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.interaction_graph.thriftscala.Edge", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-scala", - ], -) - -create_datasets( - base_name = "interaction_graph_history_aggregated_vertex_snapshot", - description = "User vertex with all features", - java_schema = "com.twitter.interaction_graph.thriftjava.Vertex", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.interaction_graph.thriftscala.Vertex", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-scala", - ], -) - -create_datasets( - base_name = "interaction_graph_aggregated_edge_daily", - description = "User-user directed edges with all features", - java_schema = "com.twitter.interaction_graph.thriftjava.Edge", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.interaction_graph.thriftscala.Edge", - segment_type = "partitioned", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-scala", - ], -) - -create_datasets( - base_name = "interaction_graph_aggregated_vertex_daily", - description = "User vertex with all features", - java_schema = "com.twitter.interaction_graph.thriftjava.Vertex", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.interaction_graph.thriftscala.Vertex", - segment_type = "partitioned", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-scala", - ], -) - -create_datasets( - base_name = "real_graph_features", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.interaction_graph.injection.UserSessionInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.user_session_store.thriftscala.UserSession", - scala_dependencies = [ - "src/scala/com/twitter/interaction_graph/injection:user_session_inj", - ], -) - -create_datasets( - base_name = "home_light_ranker_top_k_real_graph_features", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.interaction_graph.injection.EdgeListInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.interaction_graph.thriftscala.EdgeList", - scala_dependencies = [ - "src/scala/com/twitter/interaction_graph/injection:edge_list_injection", - ], -) diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_all/InteractionGraphAggregationConfig.scala b/src/scala/com/twitter/interaction_graph/scio/agg_all/InteractionGraphAggregationConfig.scala deleted file mode 100644 index 2f9b0da57..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_all/InteractionGraphAggregationConfig.scala +++ /dev/null @@ -1,14 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_all - -object InteractionGraphScoringConfig { - - /** - * This is alpha for a variant of the Exponentially weighted moving average, computed as: - * ewma_{t+1} = x_{t+1} + (1-alpha) * ewma_t (ewma_1 = x_1, t > 0) - * We choose alpha such that the half life of weights is 7 days. - * Note that we don't down-weight x_{t+1} (unlike in EWMA) as we only want to decay actions - * as they grow old, not compute the average value. - */ - val ALPHA = 1.0 - val ONE_MINUS_ALPHA = 0.955 -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_all/InteractionGraphAggregationJob.scala b/src/scala/com/twitter/interaction_graph/scio/agg_all/InteractionGraphAggregationJob.scala deleted file mode 100644 index 06942205d..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_all/InteractionGraphAggregationJob.scala +++ /dev/null @@ -1,314 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_all - -import com.google.cloud.bigquery.BigQueryOptions -import com.google.cloud.bigquery.QueryJobConfiguration -import com.spotify.scio.ScioContext -import com.spotify.scio.ScioMetrics -import com.spotify.scio.values.SCollection -import com.twitter.beam.io.dal.DAL -import com.twitter.beam.io.dal.DAL.DiskFormat -import com.twitter.beam.io.dal.DAL.PathLayout -import com.twitter.beam.io.dal.DAL.WriteOptions -import com.twitter.beam.io.exception.DataNotFoundException -import com.twitter.beam.job.ServiceIdentifierOptions -import com.twitter.interaction_graph.scio.agg_all.InteractionGraphAggregationTransform._ -import com.twitter.interaction_graph.scio.common.DateUtil -import com.twitter.interaction_graph.scio.common.FeatureGeneratorUtil -import com.twitter.interaction_graph.scio.common.UserUtil -import com.twitter.interaction_graph.thriftscala.Edge -import com.twitter.interaction_graph.thriftscala.Vertex -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.scio_internal.job.ScioBeamJob -import com.twitter.statebird.v2.thriftscala.Environment -import com.twitter.user_session_store.thriftscala.UserSession -import com.twitter.util.Duration -import com.twitter.wtf.candidate.thriftscala.ScoredEdge -import java.time.Instant -import org.apache.avro.generic.GenericRecord -import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO -import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.TypedRead -import org.apache.beam.sdk.io.gcp.bigquery.SchemaAndRecord -import org.apache.beam.sdk.transforms.SerializableFunction -import org.joda.time.Interval -import scala.collection.JavaConverters._ - -object InteractionGraphAggregationJob extends ScioBeamJob[InteractionGraphAggregationOption] { - - // to parse latest date from the BQ table we're reading from - val parseDateRow = new SerializableFunction[SchemaAndRecord, String] { - override def apply(input: SchemaAndRecord): String = { - val genericRecord: GenericRecord = input.getRecord() - genericRecord.get("ds").toString - } - } - - // note that we're using the prob_explicit for real_graph_features (for Home) - val parseRow = new SerializableFunction[SchemaAndRecord, ScoredEdge] { - override def apply(record: SchemaAndRecord): ScoredEdge = { - val genericRecord: GenericRecord = record.getRecord() - ScoredEdge( - genericRecord.get("source_id").asInstanceOf[Long], - genericRecord.get("destination_id").asInstanceOf[Long], - genericRecord.get("prob_explicit").asInstanceOf[Double], - genericRecord.get("followed").asInstanceOf[Boolean], - ) - } - } - - override def runPipeline( - sc: ScioContext, - opts: InteractionGraphAggregationOption - ): Unit = { - - val dateStr: String = opts.getDate().value.getStart.toString("yyyyMMdd") - logger.info(s"dateStr $dateStr") - val project: String = "twttr-recos-ml-prod" - val datasetName: String = "realgraph" - val bqTableName: String = "scores" - val fullBqTableName: String = s"$project:$datasetName.$bqTableName" - - if (opts.getDALWriteEnvironment.toLowerCase == "prod") { - val bqClient = - BigQueryOptions.newBuilder.setProjectId(project).build.getService - val query = - s""" - |SELECT total_rows - |FROM `$project.$datasetName.INFORMATION_SCHEMA.PARTITIONS` - |WHERE partition_id ="$dateStr" AND - |table_name="$bqTableName" AND total_rows > 0 - |""".stripMargin - val queryConfig = QueryJobConfiguration.of(query) - val results = bqClient.query(queryConfig).getValues.asScala.toSeq - if (results.isEmpty || results.head.get(0).getLongValue == 0) { - throw new DataNotFoundException(s"$dateStr not present in $fullBqTableName.") - } - } - sc.run() - } - - override protected def configurePipeline( - scioContext: ScioContext, - pipelineOptions: InteractionGraphAggregationOption - ): Unit = { - @transient - implicit lazy val sc: ScioContext = scioContext - implicit lazy val dateInterval: Interval = pipelineOptions.interval - val yesterday = DateUtil.subtract(dateInterval, Duration.fromDays(1)) - - val dalEnvironment: String = pipelineOptions - .as(classOf[ServiceIdentifierOptions]) - .getEnvironment() - val dalWriteEnvironment = if (pipelineOptions.getDALWriteEnvironment != null) { - pipelineOptions.getDALWriteEnvironment - } else { - dalEnvironment - } - val dateStr: String = pipelineOptions.getDate().value.getStart.toString("yyyy-MM-dd") - logger.info(s"dateStr $dateStr") - val project: String = "twttr-recos-ml-prod" - val datasetName: String = "realgraph" - val bqTableName: String = "scores" - val fullBqTableName: String = s"$project:$datasetName.$bqTableName" - - val scoreExport: SCollection[ScoredEdge] = - sc.customInput( - s"Read from BQ table $fullBqTableName", - BigQueryIO - .read(parseRow) - .fromQuery(s"""SELECT source_id, destination_id, prob_explicit, followed - |FROM `$project.$datasetName.$bqTableName` - |WHERE ds = '$dateStr'""".stripMargin) - .usingStandardSql() - .withMethod(TypedRead.Method.DEFAULT) - ) - - val source = InteractionGraphAggregationSource(pipelineOptions) - - val (addressEdgeFeatures, addressVertexFeatures) = source.readAddressBookFeatures() - - val (clientEventLogsEdgeFeatures, clientEventLogsVertexFeatures) = - source.readClientEventLogsFeatures(dateInterval) - - val (flockEdgeFeatures, flockVertexFeatures) = source.readFlockFeatures() - - val (directInteractionsEdgeFeatures, directInteractionsVertexFeatures) = - source.readDirectInteractionsFeatures(dateInterval) - - val invalidUsers = UserUtil.getInvalidUsers(source.readFlatUsers()) - - val (prevAggEdge, prevAggVertex) = source.readAggregatedFeatures(yesterday) - - val prevAggregatedVertex: SCollection[Vertex] = - UserUtil - .filterUsersByIdMapping[Vertex]( - prevAggVertex, - invalidUsers, - v => v.userId - ) - - /** Remove status-based features (flock/ab) from current graph, because we only need the latest - * This is to allow us to filter and roll-up a smaller dataset, to which we will still add - * back the status-based features for the complete scoredAggregates (that other teams will read). - */ - val prevAggEdgeFiltered = prevAggEdge - .filter { e => - e.sourceId != e.destinationId - } - .withName("filtering status-based edges") - .flatMap(FeatureGeneratorUtil.removeStatusFeatures) - val prevAggEdgeValid: SCollection[Edge] = - UserUtil - .filterUsersByMultipleIdMappings[Edge]( - prevAggEdgeFiltered, - invalidUsers, - Seq(e => e.sourceId, e => e.destinationId) - ) - - val aggregatedActivityVertexDaily = UserUtil - .filterUsersByIdMapping[Vertex]( - FeatureGeneratorUtil - .combineVertexFeatures( - clientEventLogsVertexFeatures ++ - directInteractionsVertexFeatures ++ - addressVertexFeatures ++ - flockVertexFeatures - ), - invalidUsers, - v => v.userId - ) - - // we split up the roll-up of decayed counts between status vs activity/count-based features - val aggregatedActivityEdgeDaily = FeatureGeneratorUtil - .combineEdgeFeatures(clientEventLogsEdgeFeatures ++ directInteractionsEdgeFeatures) - - // Vertex level, Add the decay sum for history and daily - val aggregatedActivityVertex = FeatureGeneratorUtil - .combineVertexFeaturesWithDecay( - prevAggregatedVertex, - aggregatedActivityVertexDaily, - InteractionGraphScoringConfig.ONE_MINUS_ALPHA, - InteractionGraphScoringConfig.ALPHA - ) - - // Edge level, Add the decay sum for history and daily - val aggregatedActivityEdge = FeatureGeneratorUtil - .combineEdgeFeaturesWithDecay( - prevAggEdgeValid, - aggregatedActivityEdgeDaily, - InteractionGraphScoringConfig.ONE_MINUS_ALPHA, - InteractionGraphScoringConfig.ALPHA - ) - .filter(FeatureGeneratorUtil.edgeWithFeatureOtherThanDwellTime) - .withName("removing edges that only have dwell time features") - - val edgeKeyedScores = scoreExport.keyBy { e => (e.sourceId, e.destinationId) } - - val scoredAggregatedActivityEdge = aggregatedActivityEdge - .keyBy { e => (e.sourceId, e.destinationId) } - .withName("join with scores") - .leftOuterJoin(edgeKeyedScores) - .map { - case (_, (e, scoredEdgeOpt)) => - val scoreOpt = scoredEdgeOpt.map(_.score) - e.copy(weight = if (scoreOpt.nonEmpty) { - ScioMetrics.counter("after joining edge with scores", "has score").inc() - scoreOpt - } else { - ScioMetrics.counter("after joining edge with scores", "no score").inc() - None - }) - } - - val combinedFeatures = FeatureGeneratorUtil - .combineEdgeFeatures(aggregatedActivityEdge ++ addressEdgeFeatures ++ flockEdgeFeatures) - .keyBy { e => (e.sourceId, e.destinationId) } - - val aggregatedActivityScoredEdge = - edgeKeyedScores - .withName("join with combined edge features") - .leftOuterJoin(combinedFeatures) - .map { - case (_, (scoredEdge, combinedFeaturesOpt)) => - if (combinedFeaturesOpt.exists(_.features.nonEmpty)) { - ScioMetrics.counter("after joining scored edge with features", "has features").inc() - Edge( - sourceId = scoredEdge.sourceId, - destinationId = scoredEdge.destinationId, - weight = Some(scoredEdge.score), - features = combinedFeaturesOpt.map(_.features).getOrElse(Nil) - ) - } else { - ScioMetrics.counter("after joining scored edge with features", "no features").inc() - Edge( - sourceId = scoredEdge.sourceId, - destinationId = scoredEdge.destinationId, - weight = Some(scoredEdge.score), - features = Nil - ) - } - } - - val realGraphFeatures = - getTopKTimelineFeatures(aggregatedActivityScoredEdge, pipelineOptions.getMaxDestinationIds) - - aggregatedActivityVertex.saveAsCustomOutput( - "Write History Aggregated Vertex Records", - DAL.writeSnapshot[Vertex]( - dataset = InteractionGraphHistoryAggregatedVertexSnapshotScalaDataset, - pathLayout = PathLayout.DailyPath(pipelineOptions.getOutputPath + "/aggregated_vertex"), - endDate = Instant.ofEpochMilli(dateInterval.getEndMillis), - diskFormat = DiskFormat.Parquet, - environmentOverride = Environment.valueOf(dalWriteEnvironment), - writeOption = WriteOptions(numOfShards = Some(pipelineOptions.getNumberOfShards / 10)) - ) - ) - - scoredAggregatedActivityEdge.saveAsCustomOutput( - "Write History Aggregated Edge Records", - DAL.writeSnapshot[Edge]( - dataset = InteractionGraphHistoryAggregatedEdgeSnapshotScalaDataset, - pathLayout = PathLayout.DailyPath(pipelineOptions.getOutputPath + "/aggregated_raw_edge"), - endDate = Instant.ofEpochMilli(dateInterval.getEndMillis), - diskFormat = DiskFormat.Parquet, - environmentOverride = Environment.valueOf(dalWriteEnvironment), - writeOption = WriteOptions(numOfShards = Some(pipelineOptions.getNumberOfShards)) - ) - ) - - aggregatedActivityVertexDaily.saveAsCustomOutput( - "Write Daily Aggregated Vertex Records", - DAL.write[Vertex]( - dataset = InteractionGraphAggregatedVertexDailyScalaDataset, - pathLayout = - PathLayout.DailyPath(pipelineOptions.getOutputPath + "/aggregated_vertex_daily"), - interval = dateInterval, - diskFormat = DiskFormat.Parquet, - environmentOverride = Environment.valueOf(dalWriteEnvironment), - writeOption = WriteOptions(numOfShards = Some(pipelineOptions.getNumberOfShards / 10)) - ) - ) - - aggregatedActivityEdgeDaily.saveAsCustomOutput( - "Write Daily Aggregated Edge Records", - DAL.write[Edge]( - dataset = InteractionGraphAggregatedEdgeDailyScalaDataset, - pathLayout = PathLayout.DailyPath(pipelineOptions.getOutputPath + "/aggregated_edge_daily"), - interval = dateInterval, - diskFormat = DiskFormat.Parquet, - environmentOverride = Environment.valueOf(dalWriteEnvironment), - writeOption = WriteOptions(numOfShards = Some(pipelineOptions.getNumberOfShards)) - ) - ) - - realGraphFeatures.saveAsCustomOutput( - "Write Timeline Real Graph Features", - DAL.writeVersionedKeyVal[KeyVal[Long, UserSession]]( - dataset = RealGraphFeaturesScalaDataset, - pathLayout = - PathLayout.VersionedPath(pipelineOptions.getOutputPath + "/real_graph_features"), - environmentOverride = Environment.valueOf(dalWriteEnvironment), - writeOption = WriteOptions(numOfShards = Some(pipelineOptions.getNumberOfShards)) - ) - ) - } -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_all/InteractionGraphAggregationOption.scala b/src/scala/com/twitter/interaction_graph/scio/agg_all/InteractionGraphAggregationOption.scala deleted file mode 100644 index 94e7ffae6..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_all/InteractionGraphAggregationOption.scala +++ /dev/null @@ -1,36 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_all - -import com.twitter.beam.io.dal.DALOptions -import com.twitter.beam.job.DateRangeOptions -import org.apache.beam.sdk.options.Default -import org.apache.beam.sdk.options.Description -import org.apache.beam.sdk.options.Validation.Required - -trait InteractionGraphAggregationOption extends DALOptions with DateRangeOptions { - @Required - @Description("Output path for storing the final dataset") - def getOutputPath: String - def setOutputPath(value: String): Unit - - @Description("Indicates DAL write environment. Can be set to dev/stg during local validation") - @Default.String("PROD") - def getDALWriteEnvironment: String - def setDALWriteEnvironment(value: String): Unit - - @Description("Number of shards/partitions for saving the final dataset.") - @Default.Integer(16) - def getNumberOfShards: Integer - def setNumberOfShards(value: Integer): Unit - - @Description("BQ Table name for reading scores from") - def getBqTableName: String - def setBqTableName(value: String): Unit - - @Description("max destination ids that we will store for real graph features in TL") - def getMaxDestinationIds: Integer - def setMaxDestinationIds(value: Integer): Unit - - @Description("true if getting scores from BQ instead of DAL-based dataset in GCS") - def getScoresFromBQ: Boolean - def setScoresFromBQ(value: Boolean): Unit -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_all/InteractionGraphAggregationSource.scala b/src/scala/com/twitter/interaction_graph/scio/agg_all/InteractionGraphAggregationSource.scala deleted file mode 100644 index b1ea8ff05..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_all/InteractionGraphAggregationSource.scala +++ /dev/null @@ -1,182 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_all - -import com.spotify.scio.ScioContext -import com.spotify.scio.values.SCollection -import com.twitter.beam.io.dal.DAL -import com.twitter.beam.io.dal.DAL.ReadOptions -import com.twitter.beam.job.ServiceIdentifierOptions -import com.twitter.dal.client.dataset.SnapshotDALDatasetBase -import com.twitter.dal.client.dataset.TimePartitionedDALDataset -import com.twitter.interaction_graph.scio.agg_address_book.InteractionGraphAggAddressBookEdgeSnapshotScalaDataset -import com.twitter.interaction_graph.scio.agg_address_book.InteractionGraphAggAddressBookVertexSnapshotScalaDataset -import com.twitter.interaction_graph.scio.agg_client_event_logs.InteractionGraphAggClientEventLogsEdgeDailyScalaDataset -import com.twitter.interaction_graph.scio.agg_client_event_logs.InteractionGraphAggClientEventLogsVertexDailyScalaDataset -import com.twitter.interaction_graph.scio.agg_direct_interactions.InteractionGraphAggDirectInteractionsEdgeDailyScalaDataset -import com.twitter.interaction_graph.scio.agg_direct_interactions.InteractionGraphAggDirectInteractionsVertexDailyScalaDataset -import com.twitter.interaction_graph.scio.agg_flock.InteractionGraphAggFlockEdgeSnapshotScalaDataset -import com.twitter.interaction_graph.scio.agg_flock.InteractionGraphAggFlockVertexSnapshotScalaDataset -import com.twitter.interaction_graph.thriftscala.Edge -import com.twitter.interaction_graph.thriftscala.Vertex -import com.twitter.statebird.v2.thriftscala.Environment -import com.twitter.usersource.snapshot.flat.UsersourceFlatScalaDataset -import com.twitter.usersource.snapshot.flat.thriftscala.FlatUser -import com.twitter.util.Duration -import org.joda.time.Interval - -case class InteractionGraphAggregationSource( - pipelineOptions: InteractionGraphAggregationOption -)( - implicit sc: ScioContext) { - val dalEnvironment: String = pipelineOptions - .as(classOf[ServiceIdentifierOptions]) - .getEnvironment() - - def readDALDataset[T: Manifest]( - dataset: TimePartitionedDALDataset[T], - interval: Interval, - dalEnvironment: String, - projections: Option[Seq[String]] = None - )( - implicit sc: ScioContext, - ): SCollection[T] = { - sc.customInput( - s"Reading ${dataset.role.name}.${dataset.logicalName}", - DAL.read[T]( - dataset = dataset, - interval = interval, - environmentOverride = Environment.valueOf(dalEnvironment), - readOptions = ReadOptions(projections) - ) - ) - } - - def readMostRecentSnapshotDALDataset[T: Manifest]( - dataset: SnapshotDALDatasetBase[T], - dateInterval: Interval, - dalEnvironment: String, - projections: Option[Seq[String]] = None - )( - implicit sc: ScioContext, - ): SCollection[T] = { - sc.customInput( - s"Reading most recent snapshot ${dataset.role.name}.${dataset.logicalName}", - DAL.readMostRecentSnapshot[T]( - dataset, - dateInterval, - Environment.valueOf(dalEnvironment), - readOptions = ReadOptions(projections) - ) - ) - } - - def readMostRecentSnapshotNoOlderThanDALDataset[T: Manifest]( - dataset: SnapshotDALDatasetBase[T], - noOlderThan: Duration, - dalEnvironment: String, - projections: Option[Seq[String]] = None - )( - implicit sc: ScioContext, - ): SCollection[T] = { - sc.customInput( - s"Reading most recent snapshot ${dataset.role.name}.${dataset.logicalName}", - DAL.readMostRecentSnapshotNoOlderThan[T]( - dataset, - noOlderThan, - environmentOverride = Environment.valueOf(dalEnvironment), - readOptions = ReadOptions(projections) - ) - ) - } - - def readAddressBookFeatures(): (SCollection[Edge], SCollection[Vertex]) = { - val edges = readMostRecentSnapshotNoOlderThanDALDataset[Edge]( - dataset = InteractionGraphAggAddressBookEdgeSnapshotScalaDataset, - noOlderThan = Duration.fromDays(5), - dalEnvironment = dalEnvironment, - ) - - val vertex = readMostRecentSnapshotNoOlderThanDALDataset[Vertex]( - dataset = InteractionGraphAggAddressBookVertexSnapshotScalaDataset, - noOlderThan = Duration.fromDays(5), - dalEnvironment = dalEnvironment, - ) - - (edges, vertex) - } - - def readClientEventLogsFeatures( - dateInterval: Interval - ): (SCollection[Edge], SCollection[Vertex]) = { - val edges = readDALDataset[Edge]( - dataset = InteractionGraphAggClientEventLogsEdgeDailyScalaDataset, - dalEnvironment = dalEnvironment, - interval = dateInterval - ) - - val vertex = readDALDataset[Vertex]( - dataset = InteractionGraphAggClientEventLogsVertexDailyScalaDataset, - dalEnvironment = dalEnvironment, - interval = dateInterval - ) - - (edges, vertex) - } - - def readDirectInteractionsFeatures( - dateInterval: Interval - ): (SCollection[Edge], SCollection[Vertex]) = { - val edges = readDALDataset[Edge]( - dataset = InteractionGraphAggDirectInteractionsEdgeDailyScalaDataset, - dalEnvironment = dalEnvironment, - interval = dateInterval - ) - - val vertex = readDALDataset[Vertex]( - dataset = InteractionGraphAggDirectInteractionsVertexDailyScalaDataset, - dalEnvironment = dalEnvironment, - interval = dateInterval - ) - - (edges, vertex) - } - - def readFlockFeatures(): (SCollection[Edge], SCollection[Vertex]) = { - val edges = readMostRecentSnapshotNoOlderThanDALDataset[Edge]( - dataset = InteractionGraphAggFlockEdgeSnapshotScalaDataset, - noOlderThan = Duration.fromDays(5), - dalEnvironment = dalEnvironment, - ) - - val vertex = readMostRecentSnapshotNoOlderThanDALDataset[Vertex]( - dataset = InteractionGraphAggFlockVertexSnapshotScalaDataset, - noOlderThan = Duration.fromDays(5), - dalEnvironment = dalEnvironment, - ) - - (edges, vertex) - } - - def readAggregatedFeatures(dateInterval: Interval): (SCollection[Edge], SCollection[Vertex]) = { - val edges = readMostRecentSnapshotDALDataset[Edge]( - dataset = InteractionGraphHistoryAggregatedEdgeSnapshotScalaDataset, - dalEnvironment = dalEnvironment, - dateInterval = dateInterval - ) - - val vertex = readMostRecentSnapshotDALDataset[Vertex]( - dataset = InteractionGraphHistoryAggregatedVertexSnapshotScalaDataset, - dalEnvironment = dalEnvironment, - dateInterval = dateInterval - ) - - (edges, vertex) - } - - def readFlatUsers(): SCollection[FlatUser] = - readMostRecentSnapshotNoOlderThanDALDataset[FlatUser]( - dataset = UsersourceFlatScalaDataset, - noOlderThan = Duration.fromDays(5), - dalEnvironment = dalEnvironment, - projections = Some(Seq("id", "valid_user")) - ) -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_all/InteractionGraphAggregationTransform.scala b/src/scala/com/twitter/interaction_graph/scio/agg_all/InteractionGraphAggregationTransform.scala deleted file mode 100644 index c76592c10..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_all/InteractionGraphAggregationTransform.scala +++ /dev/null @@ -1,59 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_all - -import collection.JavaConverters._ -import com.spotify.scio.values.SCollection -import com.twitter.algebird.mutable.PriorityQueueMonoid -import com.twitter.interaction_graph.scio.common.GraphUtil -import com.twitter.interaction_graph.thriftscala.Edge -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.timelines.real_graph.thriftscala.RealGraphFeatures -import com.twitter.timelines.real_graph.thriftscala.RealGraphFeaturesTest -import com.twitter.timelines.real_graph.v1.thriftscala.{RealGraphFeatures => RealGraphFeaturesV1} -import com.twitter.user_session_store.thriftscala.UserSession -import com.twitter.interaction_graph.scio.common.ConversionUtil._ - -object InteractionGraphAggregationTransform { - val ordering: Ordering[Edge] = Ordering.by(-_.weight.getOrElse(0.0)) - - // converts our Edge thrift into timelines' thrift - def getTopKTimelineFeatures( - scoredAggregatedEdge: SCollection[Edge], - maxDestinationIds: Int - ): SCollection[KeyVal[Long, UserSession]] = { - scoredAggregatedEdge - .filter(_.weight.exists(_ > 0)) - .keyBy(_.sourceId) - .groupByKey - .map { - case (sourceId, edges) => - val (inEdges, outEdges) = edges.partition(GraphUtil.isFollow) - val inTopK = - if (inEdges.isEmpty) Nil - else { - val inTopKQueue = - new PriorityQueueMonoid[Edge](maxDestinationIds)(ordering) - inTopKQueue - .build(inEdges).iterator().asScala.toList.flatMap( - toRealGraphEdgeFeatures(hasTimelinesRequiredFeatures)) - } - val outTopK = - if (outEdges.isEmpty) Nil - else { - val outTopKQueue = - new PriorityQueueMonoid[Edge](maxDestinationIds)(ordering) - outTopKQueue - .build(outEdges).iterator().asScala.toList.flatMap( - toRealGraphEdgeFeatures(hasTimelinesRequiredFeatures)) - } - KeyVal( - sourceId, - UserSession( - userId = Some(sourceId), - realGraphFeatures = Some(RealGraphFeatures.V1(RealGraphFeaturesV1(inTopK, outTopK))), - realGraphFeaturesTest = - Some(RealGraphFeaturesTest.V1(RealGraphFeaturesV1(inTopK, outTopK))) - ) - ) - } - } -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_all/README.md b/src/scala/com/twitter/interaction_graph/scio/agg_all/README.md deleted file mode 100644 index cedf39b12..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_all/README.md +++ /dev/null @@ -1,38 +0,0 @@ -## InteractionGraphAggregationJob Dataflow Job - -This job aggregates the previous day's history with today's activities, and outputs an updated -history. This history is joined with the explicit scores from real graph's BQML pipeline, and -exported as features for timelines (which is why we're using their thrift). - -#### IntelliJ -``` -fastpass create --name rg_agg_all --intellij src/scala/com/twitter/interaction_graph/scio/agg_all:interaction_graph_aggregation_job_scio -``` - -#### Compile -``` -bazel build src/scala/com/twitter/interaction_graph/scio/agg_all:interaction_graph_aggregation_job_scio -``` - -#### Build Jar -``` -bazel bundle src/scala/com/twitter/interaction_graph/scio/agg_all:interaction_graph_aggregation_job_scio -``` - -#### Run Scheduled Job -``` -export PROJECTID=twttr-recos-ml-prod -export REGION=us-central1 -export JOB_NAME=interaction-graph-aggregation-dataflow - -bin/d6w schedule \ - ${PROJECTID}/${REGION}/${JOB_NAME} \ - src/scala/com/twitter/interaction_graph/scio/agg_all/config.d6w \ - --bind=profile.user_name=cassowary \ - --bind=profile.project=${PROJECTID} \ - --bind=profile.region=${REGION} \ - --bind=profile.job_name=${JOB_NAME} \ - --bind=profile.environment=prod \ - --bind=profile.date=2022-11-08 \ - --bind=profile.output_path=processed/interaction_graph_aggregation_dataflow -``` \ No newline at end of file diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs/BUILD b/src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs/BUILD deleted file mode 100644 index 9c14f4d38..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs/BUILD +++ /dev/null @@ -1,61 +0,0 @@ -scala_library( - name = "agg_client_event_logs", - sources = ["*.scala"], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":interaction_graph_agg_client_event_logs_edge_daily-scala", - ":interaction_graph_agg_client_event_logs_vertex_daily-scala", - "beam-internal/src/main/scala/com/twitter/beam/io/dal", - "beam-internal/src/main/scala/com/twitter/scio_internal/job", - "beam-internal/src/main/scala/com/twitter/scio_internal/runner/dataflow", - "consumer-data-tools/src/main/scala/com/twitter/cde/scio/dal_read", - "src/scala/com/twitter/interaction_graph/scio/common", - "src/scala/com/twitter/wtf/scalding/jobs/client_event_processing:user_interaction-scala", - "usersource/snapshot/src/main/scala/com/twitter/usersource/snapshot/combined:usersource-scala", - ], -) - -jvm_binary( - name = "interaction_graph_client_event_logs_scio", - main = "com.twitter.interaction_graph.scio.agg_client_event_logs.InteractionGraphClientEventLogsJob", - platform = "java8", - dependencies = [ - ":agg_client_event_logs", - ], -) - -create_datasets( - base_name = "interaction_graph_agg_client_event_logs_edge_daily", - description = "User-user directed edges with client events features", - java_schema = "com.twitter.interaction_graph.thriftjava.Edge", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.interaction_graph.thriftscala.Edge", - segment_type = "partitioned", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-scala", - ], -) - -create_datasets( - base_name = "interaction_graph_agg_client_event_logs_vertex_daily", - description = "User vertex with client events features", - java_schema = "com.twitter.interaction_graph.thriftjava.Vertex", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.interaction_graph.thriftscala.Vertex", - segment_type = "partitioned", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-scala", - ], -) diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs/InteractionGraphClientEventLogsCounters.scala b/src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs/InteractionGraphClientEventLogsCounters.scala deleted file mode 100644 index cc9793ba8..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs/InteractionGraphClientEventLogsCounters.scala +++ /dev/null @@ -1,32 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_client_event_logs - -import com.spotify.scio.ScioMetrics - -trait InteractionGraphClientEventLogsCountersTrait { - val Namespace = "Interaction Graph Client Event Logs" - def profileViewFeaturesInc(): Unit - def linkOpenFeaturesInc(): Unit - def tweetClickFeaturesInc(): Unit - def tweetImpressionFeaturesInc(): Unit - def catchAllInc(): Unit -} - -case object InteractionGraphClientEventLogsCounters - extends InteractionGraphClientEventLogsCountersTrait { - - val profileViewCounter = ScioMetrics.counter(Namespace, "Profile View Features") - val linkOpenCounter = ScioMetrics.counter(Namespace, "Link Open Features") - val tweetClickCounter = ScioMetrics.counter(Namespace, "Tweet Click Features") - val tweetImpressionCounter = ScioMetrics.counter(Namespace, "Tweet Impression Features") - val catchAllCounter = ScioMetrics.counter(Namespace, "Catch All") - - override def profileViewFeaturesInc(): Unit = profileViewCounter.inc() - - override def linkOpenFeaturesInc(): Unit = linkOpenCounter.inc() - - override def tweetClickFeaturesInc(): Unit = tweetClickCounter.inc() - - override def tweetImpressionFeaturesInc(): Unit = tweetImpressionCounter.inc() - - override def catchAllInc(): Unit = catchAllCounter.inc() -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs/InteractionGraphClientEventLogsJob.scala b/src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs/InteractionGraphClientEventLogsJob.scala deleted file mode 100644 index 1a12b33d9..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs/InteractionGraphClientEventLogsJob.scala +++ /dev/null @@ -1,74 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_client_event_logs - -import com.spotify.scio.ScioContext -import com.twitter.beam.io.dal.DAL -import com.twitter.beam.io.dal.DAL.DiskFormat -import com.twitter.beam.io.dal.DAL.WriteOptions -import com.twitter.beam.io.fs.multiformat.PathLayout -import com.twitter.beam.job.ServiceIdentifierOptions -import com.twitter.interaction_graph.scio.common.UserUtil -import com.twitter.interaction_graph.thriftscala.Edge -import com.twitter.interaction_graph.thriftscala.Vertex -import com.twitter.scio_internal.job.ScioBeamJob -import com.twitter.statebird.v2.thriftscala.Environment -import org.joda.time.Interval - -object InteractionGraphClientEventLogsJob - extends ScioBeamJob[InteractionGraphClientEventLogsOption] { - override protected def configurePipeline( - scioContext: ScioContext, - pipelineOptions: InteractionGraphClientEventLogsOption - ): Unit = { - - @transient - implicit lazy val sc: ScioContext = scioContext - implicit lazy val jobCounters: InteractionGraphClientEventLogsCountersTrait = - InteractionGraphClientEventLogsCounters - - lazy val dateInterval: Interval = pipelineOptions.interval - - val sources = InteractionGraphClientEventLogsSource(pipelineOptions) - - val userInteractions = sources.readUserInteractions(dateInterval) - val rawUsers = sources.readCombinedUsers() - val safeUsers = UserUtil.getValidUsers(rawUsers) - - val (vertex, edges) = InteractionGraphClientEventLogsUtil.process(userInteractions, safeUsers) - - val dalEnvironment: String = pipelineOptions - .as(classOf[ServiceIdentifierOptions]) - .getEnvironment() - val dalWriteEnvironment = if (pipelineOptions.getDALWriteEnvironment != null) { - pipelineOptions.getDALWriteEnvironment - } else { - dalEnvironment - } - - vertex.saveAsCustomOutput( - "Write Vertex Records", - DAL.write[Vertex]( - InteractionGraphAggClientEventLogsVertexDailyScalaDataset, - PathLayout.DailyPath( - pipelineOptions.getOutputPath + "/aggregated_client_event_logs_vertex_daily"), - dateInterval, - DiskFormat.Parquet, - Environment.valueOf(dalWriteEnvironment), - writeOption = - WriteOptions(numOfShards = Some((pipelineOptions.getNumberOfShards / 32.0).ceil.toInt)) - ) - ) - - edges.saveAsCustomOutput( - "Write Edge Records", - DAL.write[Edge]( - InteractionGraphAggClientEventLogsEdgeDailyScalaDataset, - PathLayout.DailyPath( - pipelineOptions.getOutputPath + "/aggregated_client_event_logs_edge_daily"), - dateInterval, - DiskFormat.Parquet, - Environment.valueOf(dalWriteEnvironment), - writeOption = WriteOptions(numOfShards = Some(pipelineOptions.getNumberOfShards)) - ) - ) - } -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs/InteractionGraphClientEventLogsOption.scala b/src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs/InteractionGraphClientEventLogsOption.scala deleted file mode 100644 index 7a07a6913..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs/InteractionGraphClientEventLogsOption.scala +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_client_event_logs - -import com.twitter.beam.io.dal.DALOptions -import com.twitter.beam.job.DateRangeOptions -import org.apache.beam.sdk.options.Default -import org.apache.beam.sdk.options.Description -import org.apache.beam.sdk.options.Validation.Required - -trait InteractionGraphClientEventLogsOption extends DALOptions with DateRangeOptions { - @Required - @Description("Output path for storing the final dataset") - def getOutputPath: String - def setOutputPath(value: String): Unit - - @Description("Indicates DAL write environment. Can be set to dev/stg during local validation") - @Default.String("PROD") - def getDALWriteEnvironment: String - def setDALWriteEnvironment(value: String): Unit - - @Description("Number of shards/partitions for saving the final dataset.") - @Default.Integer(16) - def getNumberOfShards: Integer - def setNumberOfShards(value: Integer): Unit -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs/InteractionGraphClientEventLogsSource.scala b/src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs/InteractionGraphClientEventLogsSource.scala deleted file mode 100644 index 1cf2da318..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs/InteractionGraphClientEventLogsSource.scala +++ /dev/null @@ -1,40 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_client_event_logs - -import com.spotify.scio.ScioContext -import com.spotify.scio.values.SCollection -import com.twitter.beam.job.ServiceIdentifierOptions -import com.twitter.twadoop.user.gen.thriftscala.CombinedUser -import com.twitter.usersource.snapshot.combined.UsersourceScalaDataset -import com.twitter.util.Duration -import com.twitter.cde.scio.dal_read.SourceUtil -import com.twitter.wtf.scalding.client_event_processing.thriftscala.UserInteraction -import com.twitter.wtf.scalding.jobs.client_event_processing.UserInteractionScalaDataset -import org.joda.time.Interval - -case class InteractionGraphClientEventLogsSource( - pipelineOptions: InteractionGraphClientEventLogsOption -)( - implicit sc: ScioContext) { - - val dalEnvironment: String = pipelineOptions - .as(classOf[ServiceIdentifierOptions]) - .getEnvironment() - - def readUserInteractions(dateInterval: Interval): SCollection[UserInteraction] = { - - SourceUtil.readDALDataset[UserInteraction]( - dataset = UserInteractionScalaDataset, - interval = dateInterval, - dalEnvironment = dalEnvironment) - - } - - def readCombinedUsers(): SCollection[CombinedUser] = { - - SourceUtil.readMostRecentSnapshotNoOlderThanDALDataset[CombinedUser]( - dataset = UsersourceScalaDataset, - noOlderThan = Duration.fromDays(5), - dalEnvironment = dalEnvironment - ) - } -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs/InteractionGraphClientEventLogsUtil.scala b/src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs/InteractionGraphClientEventLogsUtil.scala deleted file mode 100644 index 521a1f07f..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs/InteractionGraphClientEventLogsUtil.scala +++ /dev/null @@ -1,137 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_client_event_logs - -import com.spotify.scio.values.SCollection -import com.twitter.interaction_graph.scio.common.FeatureGeneratorUtil -import com.twitter.interaction_graph.scio.common.FeatureKey -import com.twitter.interaction_graph.scio.common.InteractionGraphRawInput -import com.twitter.interaction_graph.thriftscala.Edge -import com.twitter.interaction_graph.thriftscala.FeatureName -import com.twitter.interaction_graph.thriftscala.Vertex -import com.twitter.wtf.scalding.client_event_processing.thriftscala.InteractionDetails -import com.twitter.wtf.scalding.client_event_processing.thriftscala.InteractionType -import com.twitter.wtf.scalding.client_event_processing.thriftscala.UserInteraction - -object InteractionGraphClientEventLogsUtil { - - val DefaultAge = 1 - val DefaultFeatureValue = 1.0 - - def process( - userInteractions: SCollection[UserInteraction], - safeUsers: SCollection[Long] - )( - implicit jobCounters: InteractionGraphClientEventLogsCountersTrait - ): (SCollection[Vertex], SCollection[Edge]) = { - - val unfilteredFeatureInput = userInteractions - .flatMap { - case UserInteraction( - userId, - _, - interactionType, - InteractionDetails.ProfileClickDetails(profileClick)) - if interactionType == InteractionType.ProfileClicks && userId != profileClick.profileId => - jobCounters.profileViewFeaturesInc() - Seq( - FeatureKey( - userId, - profileClick.profileId, - FeatureName.NumProfileViews) -> DefaultFeatureValue - ) - - case UserInteraction( - userId, - _, - interactionType, - InteractionDetails.TweetClickDetails(tweetClick)) - if interactionType == InteractionType.TweetClicks && - Some(userId) != tweetClick.authorId => - ( - for { - authorId <- tweetClick.authorId - } yield { - jobCounters.tweetClickFeaturesInc() - FeatureKey(userId, authorId, FeatureName.NumTweetClicks) -> DefaultFeatureValue - - } - ).toSeq - - case UserInteraction( - userId, - _, - interactionType, - InteractionDetails.LinkClickDetails(linkClick)) - if interactionType == InteractionType.LinkClicks && - Some(userId) != linkClick.authorId => - ( - for { - authorId <- linkClick.authorId - } yield { - jobCounters.linkOpenFeaturesInc() - FeatureKey(userId, authorId, FeatureName.NumLinkClicks) -> DefaultFeatureValue - } - ).toSeq - - case UserInteraction( - userId, - _, - interactionType, - InteractionDetails.TweetImpressionDetails(tweetImpression)) - if interactionType == InteractionType.TweetImpressions && - Some(userId) != tweetImpression.authorId => - ( - for { - authorId <- tweetImpression.authorId - dwellTime <- tweetImpression.dwellTimeInSec - } yield { - jobCounters.tweetImpressionFeaturesInc() - Seq( - FeatureKey( - userId, - authorId, - FeatureName.NumInspectedStatuses) -> DefaultFeatureValue, - FeatureKey(userId, authorId, FeatureName.TotalDwellTime) -> dwellTime.toDouble - ) - } - ).getOrElse(Nil) - - case _ => - jobCounters.catchAllInc() - Nil - } - .sumByKey - .collect { - case (FeatureKey(srcId, destId, featureName), featureValue) => - InteractionGraphRawInput( - src = srcId, - dst = destId, - name = featureName, - age = 1, - featureValue = featureValue - ) - } - - val filteredFeatureInput = filterForSafeUsers(unfilteredFeatureInput, safeUsers) - - // Calculate the Features - FeatureGeneratorUtil.getFeatures(filteredFeatureInput) - - } - - private def filterForSafeUsers( - featureInput: SCollection[InteractionGraphRawInput], - safeUsers: SCollection[Long] - ): SCollection[InteractionGraphRawInput] = { - - featureInput - .keyBy(_.src) - .withName("Filter out unsafe users") - .intersectByKey(safeUsers) - .values // Fetch only InteractionGraphRawInput - .keyBy(_.dst) - .withName("Filter out unsafe authors") - .intersectByKey(safeUsers) - .values // Fetch only InteractionGraphRawInput - } - -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs/README.md b/src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs/README.md deleted file mode 100644 index 6bd1ea2cd..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs/README.md +++ /dev/null @@ -1,34 +0,0 @@ -## InteractionGraphClientEventLogs Dataflow Job - -#### IntelliJ -``` -./bazel idea src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs:interaction_graph_client_event_logs_scio -``` - -#### Compile -``` -./bazel build src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs:interaction_graph_client_event_logs_scio -``` - -#### Build Jar -``` -./bazel bundle src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs:interaction_graph_client_event_logs_scio -``` - -#### Run Scheduled Job -``` -export PROJECTID=twttr-recos-ml-prod -export REGION=us-central1 -export JOB_NAME=interaction-graph-client-event-logs-dataflow - -bin/d6w schedule \ - ${PROJECTID}/${REGION}/${JOB_NAME} \ - src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs/config.d6w \ - --bind=profile.user_name=cassowary \ - --bind=profile.project=${PROJECTID} \ - --bind=profile.region=${REGION} \ - --bind=profile.job_name=${JOB_NAME} \ - --bind=profile.environment=prod \ - --bind=profile.date=2022-04-27 \ - --bind=profile.output_path=processed/interaction_graph_agg_client_event_logs_dataflow -``` \ No newline at end of file diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions/BUILD b/src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions/BUILD deleted file mode 100644 index 51479c70d..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions/BUILD +++ /dev/null @@ -1,65 +0,0 @@ -scala_library( - name = "agg_direct_interactions", - sources = ["*.scala"], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":interaction_graph_agg_direct_interactions_edge_daily-scala", - ":interaction_graph_agg_direct_interactions_vertex_daily-scala", - "beam-internal/src/main/scala/com/twitter/beam/io/dal", - "beam-internal/src/main/scala/com/twitter/scio_internal/job", - "beam-internal/src/main/scala/com/twitter/scio_internal/runner/dataflow", - "consumer-data-tools/src/main/scala/com/twitter/cde/scio/dal_read", - "src/scala/com/twitter/interaction_graph/scio/common", - "src/thrift/com/twitter/timelineservice/server/internal:thrift-scala", - "twadoop_config/configuration/log_categories/group/timeline:timeline_service_favorites-scala", - "twadoop_config/configuration/log_categories/group/tweetypie:tweetypie_media_tag_events-scala", - "tweetsource/common:unhydrated_flat-scala", - "usersource/snapshot/src/main/scala/com/twitter/usersource/snapshot/combined:usersource-scala", - ], -) - -jvm_binary( - name = "interaction_graph_agg_direct_interactions_scio", - main = "com.twitter.interaction_graph.scio.agg_direct_interactions.InteractionGraphAggDirectInteractionsJob", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":agg_direct_interactions", - ], -) - -create_datasets( - base_name = "interaction_graph_agg_direct_interactions_edge_daily", - description = "User-user directed edges with direct interactions features", - java_schema = "com.twitter.interaction_graph.thriftjava.Edge", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.interaction_graph.thriftscala.Edge", - segment_type = "partitioned", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-scala", - ], -) - -create_datasets( - base_name = "interaction_graph_agg_direct_interactions_vertex_daily", - description = "User vertex with direct interactions features", - java_schema = "com.twitter.interaction_graph.thriftjava.Vertex", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.interaction_graph.thriftscala.Vertex", - segment_type = "partitioned", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-scala", - ], -) diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions/InteractionGraphAggDirectInteractionsJob.scala b/src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions/InteractionGraphAggDirectInteractionsJob.scala deleted file mode 100644 index 0b855cee2..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions/InteractionGraphAggDirectInteractionsJob.scala +++ /dev/null @@ -1,79 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_direct_interactions - -import com.spotify.scio.ScioContext -import com.twitter.beam.io.dal.DAL -import com.twitter.beam.io.dal.DAL.DiskFormat -import com.twitter.beam.io.fs.multiformat.PathLayout -import com.twitter.beam.io.fs.multiformat.WriteOptions -import com.twitter.beam.job.ServiceIdentifierOptions -import com.twitter.interaction_graph.scio.common.UserUtil -import com.twitter.interaction_graph.thriftscala.Edge -import com.twitter.interaction_graph.thriftscala.Vertex -import com.twitter.scio_internal.job.ScioBeamJob -import com.twitter.statebird.v2.thriftscala.Environment -import org.joda.time.Interval - -object InteractionGraphAggDirectInteractionsJob - extends ScioBeamJob[InteractionGraphAggDirectInteractionsOption] { - override protected def configurePipeline( - scioContext: ScioContext, - pipelineOptions: InteractionGraphAggDirectInteractionsOption - ): Unit = { - @transient - implicit lazy val sc: ScioContext = scioContext - implicit lazy val dateInterval: Interval = pipelineOptions.interval - - val dalEnvironment: String = pipelineOptions - .as(classOf[ServiceIdentifierOptions]) - .getEnvironment() - val dalWriteEnvironment = if (pipelineOptions.getDALWriteEnvironment != null) { - pipelineOptions.getDALWriteEnvironment - } else { - dalEnvironment - } - - val source = InteractionGraphAggDirectInteractionsSource(pipelineOptions) - - val rawUsers = source.readCombinedUsers() - val safeUsers = UserUtil.getValidUsers(rawUsers) - - val rawFavorites = source.readFavorites(dateInterval) - val rawPhotoTags = source.readPhotoTags(dateInterval) - val tweetSource = source.readTweetSource(dateInterval) - - val (vertex, edges) = InteractionGraphAggDirectInteractionsUtil.process( - rawFavorites, - tweetSource, - rawPhotoTags, - safeUsers - ) - - vertex.saveAsCustomOutput( - "Write Vertex Records", - DAL.write[Vertex]( - InteractionGraphAggDirectInteractionsVertexDailyScalaDataset, - PathLayout.DailyPath( - pipelineOptions.getOutputPath + "/aggregated_direct_interactions_vertex_daily"), - dateInterval, - DiskFormat.Parquet, - Environment.valueOf(dalWriteEnvironment), - writeOption = - WriteOptions(numOfShards = Some((pipelineOptions.getNumberOfShards / 8.0).ceil.toInt)) - ) - ) - - edges.saveAsCustomOutput( - "Write Edge Records", - DAL.write[Edge]( - InteractionGraphAggDirectInteractionsEdgeDailyScalaDataset, - PathLayout.DailyPath( - pipelineOptions.getOutputPath + "/aggregated_direct_interactions_edge_daily"), - dateInterval, - DiskFormat.Parquet, - Environment.valueOf(dalWriteEnvironment), - writeOption = WriteOptions(numOfShards = Some(pipelineOptions.getNumberOfShards)) - ) - ) - - } -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions/InteractionGraphAggDirectInteractionsOption.scala b/src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions/InteractionGraphAggDirectInteractionsOption.scala deleted file mode 100644 index 43d3d08df..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions/InteractionGraphAggDirectInteractionsOption.scala +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_direct_interactions - -import com.twitter.beam.io.dal.DALOptions -import com.twitter.beam.job.DateRangeOptions -import org.apache.beam.sdk.options.Default -import org.apache.beam.sdk.options.Description -import org.apache.beam.sdk.options.Validation.Required - -trait InteractionGraphAggDirectInteractionsOption extends DALOptions with DateRangeOptions { - @Required - @Description("Output path for storing the final dataset") - def getOutputPath: String - def setOutputPath(value: String): Unit - - @Description("Indicates DAL write environment. Can be set to dev/stg during local validation") - @Default.String("PROD") - def getDALWriteEnvironment: String - def setDALWriteEnvironment(value: String): Unit - - @Description("Number of shards/partitions for saving the final dataset.") - @Default.Integer(16) - def getNumberOfShards: Integer - def setNumberOfShards(value: Integer): Unit -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions/InteractionGraphAggDirectInteractionsSource.scala b/src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions/InteractionGraphAggDirectInteractionsSource.scala deleted file mode 100644 index 9470b1980..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions/InteractionGraphAggDirectInteractionsSource.scala +++ /dev/null @@ -1,51 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_direct_interactions - -import com.spotify.scio.ScioContext -import com.spotify.scio.values.SCollection -import com.twitter.beam.job.ServiceIdentifierOptions -import com.twitter.cde.scio.dal_read.SourceUtil -import com.twitter.timelineservice.thriftscala.ContextualizedFavoriteEvent -import com.twitter.twadoop.user.gen.thriftscala.CombinedUser -import com.twitter.tweetsource.common.thriftscala.UnhydratedFlatTweet -import com.twitter.tweetypie.thriftscala.TweetMediaTagEvent -import com.twitter.usersource.snapshot.combined.UsersourceScalaDataset -import com.twitter.util.Duration -import org.joda.time.Interval -import twadoop_config.configuration.log_categories.group.timeline.TimelineServiceFavoritesScalaDataset -import twadoop_config.configuration.log_categories.group.tweetypie.TweetypieMediaTagEventsScalaDataset -import tweetsource.common.UnhydratedFlatScalaDataset - -case class InteractionGraphAggDirectInteractionsSource( - pipelineOptions: InteractionGraphAggDirectInteractionsOption -)( - implicit sc: ScioContext) { - val dalEnvironment: String = pipelineOptions - .as(classOf[ServiceIdentifierOptions]) - .getEnvironment() - - def readFavorites(dateInterval: Interval): SCollection[ContextualizedFavoriteEvent] = - SourceUtil.readDALDataset[ContextualizedFavoriteEvent]( - dataset = TimelineServiceFavoritesScalaDataset, - interval = dateInterval, - dalEnvironment = dalEnvironment - ) - - def readPhotoTags(dateInterval: Interval): SCollection[TweetMediaTagEvent] = - SourceUtil.readDALDataset[TweetMediaTagEvent]( - dataset = TweetypieMediaTagEventsScalaDataset, - interval = dateInterval, - dalEnvironment = dalEnvironment) - - def readTweetSource(dateInterval: Interval): SCollection[UnhydratedFlatTweet] = - SourceUtil.readDALDataset[UnhydratedFlatTweet]( - dataset = UnhydratedFlatScalaDataset, - interval = dateInterval, - dalEnvironment = dalEnvironment) - - def readCombinedUsers(): SCollection[CombinedUser] = - SourceUtil.readMostRecentSnapshotNoOlderThanDALDataset[CombinedUser]( - dataset = UsersourceScalaDataset, - noOlderThan = Duration.fromDays(5), - dalEnvironment = dalEnvironment - ) -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions/InteractionGraphAggDirectInteractionsUtil.scala b/src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions/InteractionGraphAggDirectInteractionsUtil.scala deleted file mode 100644 index 1d996116e..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions/InteractionGraphAggDirectInteractionsUtil.scala +++ /dev/null @@ -1,168 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_direct_interactions - -import com.spotify.scio.ScioMetrics -import com.spotify.scio.values.SCollection -import com.twitter.interaction_graph.scio.common.FeatureGeneratorUtil -import com.twitter.interaction_graph.scio.common.FeatureKey -import com.twitter.interaction_graph.scio.common.InteractionGraphRawInput -import com.twitter.interaction_graph.scio.common.UserUtil.DUMMY_USER_ID -import com.twitter.interaction_graph.thriftscala.Edge -import com.twitter.interaction_graph.thriftscala.FeatureName -import com.twitter.interaction_graph.thriftscala.Vertex -import com.twitter.timelineservice.thriftscala.ContextualizedFavoriteEvent -import com.twitter.timelineservice.thriftscala.FavoriteEventUnion.Favorite -import com.twitter.tweetsource.common.thriftscala.UnhydratedFlatTweet -import com.twitter.tweetypie.thriftscala.TweetMediaTagEvent - -object InteractionGraphAggDirectInteractionsUtil { - - val DefaultFeatureValue = 1L - - def favouriteFeatures( - rawFavorites: SCollection[ContextualizedFavoriteEvent] - ): SCollection[(FeatureKey, Long)] = { - rawFavorites - .withName("fav features") - .flatMap { event => - event.event match { - case Favorite(e) if e.userId != e.tweetUserId => - ScioMetrics.counter("process", "fav").inc() - Some( - FeatureKey(e.userId, e.tweetUserId, FeatureName.NumFavorites) -> DefaultFeatureValue) - case _ => None - } - } - - } - - def mentionFeatures( - tweetSource: SCollection[UnhydratedFlatTweet] - ): SCollection[(FeatureKey, Long)] = { - tweetSource - .withName("mention features") - .flatMap { - case s if s.shareSourceTweetId.isEmpty => // only for non-retweets - s.atMentionedUserIds - .map { users => - users.toSet.map { uid: Long => - ScioMetrics.counter("process", "mention").inc() - FeatureKey(s.userId, uid, FeatureName.NumMentions) -> DefaultFeatureValue - }.toSeq - } - .getOrElse(Nil) - case _ => - Nil - } - } - - def photoTagFeatures( - rawPhotoTags: SCollection[TweetMediaTagEvent] - ): SCollection[(FeatureKey, Long)] = { - rawPhotoTags - .withName("photo tag features") - .flatMap { p => - p.taggedUserIds.map { (p.userId, _) } - } - .collect { - case (src, dst) if src != dst => - ScioMetrics.counter("process", "photo tag").inc() - FeatureKey(src, dst, FeatureName.NumPhotoTags) -> DefaultFeatureValue - } - } - - def retweetFeatures( - tweetSource: SCollection[UnhydratedFlatTweet] - ): SCollection[(FeatureKey, Long)] = { - tweetSource - .withName("retweet features") - .collect { - case s if s.shareSourceUserId.exists(_ != s.userId) => - ScioMetrics.counter("process", "share tweet").inc() - FeatureKey( - s.userId, - s.shareSourceUserId.get, - FeatureName.NumRetweets) -> DefaultFeatureValue - } - } - - def quotedTweetFeatures( - tweetSource: SCollection[UnhydratedFlatTweet] - ): SCollection[(FeatureKey, Long)] = { - tweetSource - .withName("quoted tweet features") - .collect { - case t if t.quotedTweetUserId.isDefined => - ScioMetrics.counter("process", "quote tweet").inc() - FeatureKey( - t.userId, - t.quotedTweetUserId.get, - FeatureName.NumTweetQuotes) -> DefaultFeatureValue - } - } - - def replyTweetFeatures( - tweetSource: SCollection[UnhydratedFlatTweet] - ): SCollection[(FeatureKey, Long)] = { - tweetSource - .withName("reply tweet features") - .collect { - case t if t.inReplyToUserId.isDefined => - ScioMetrics.counter("process", "reply tweet").inc() - FeatureKey(t.userId, t.inReplyToUserId.get, FeatureName.NumReplies) -> DefaultFeatureValue - } - } - - // we create edges to a dummy user id since creating a tweet has no destination id - def createTweetFeatures( - tweetSource: SCollection[UnhydratedFlatTweet] - ): SCollection[(FeatureKey, Long)] = { - tweetSource.withName("create tweet features").map { tweet => - ScioMetrics.counter("process", "create tweet").inc() - FeatureKey(tweet.userId, DUMMY_USER_ID, FeatureName.NumCreateTweets) -> DefaultFeatureValue - } - } - - def process( - rawFavorites: SCollection[ContextualizedFavoriteEvent], - tweetSource: SCollection[UnhydratedFlatTweet], - rawPhotoTags: SCollection[TweetMediaTagEvent], - safeUsers: SCollection[Long] - ): (SCollection[Vertex], SCollection[Edge]) = { - val favouriteInput = favouriteFeatures(rawFavorites) - val mentionInput = mentionFeatures(tweetSource) - val photoTagInput = photoTagFeatures(rawPhotoTags) - val retweetInput = retweetFeatures(tweetSource) - val quotedTweetInput = quotedTweetFeatures(tweetSource) - val replyInput = replyTweetFeatures(tweetSource) - val createTweetInput = createTweetFeatures(tweetSource) - - val allInput = SCollection.unionAll( - Seq( - favouriteInput, - mentionInput, - photoTagInput, - retweetInput, - quotedTweetInput, - replyInput, - createTweetInput - )) - - val filteredFeatureInput = allInput - .keyBy(_._1.src) - .intersectByKey(safeUsers) // filter for safe users - .values - .collect { - case (FeatureKey(src, dst, feature), featureValue) if src != dst => - FeatureKey(src, dst, feature) -> featureValue - } - .sumByKey - .map { - case (FeatureKey(src, dst, feature), featureValue) => - val age = 1 - InteractionGraphRawInput(src, dst, feature, age, featureValue) - } - - FeatureGeneratorUtil.getFeatures(filteredFeatureInput) - } - -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions/README.md b/src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions/README.md deleted file mode 100644 index a9e9d3610..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions/README.md +++ /dev/null @@ -1,34 +0,0 @@ -## InteractionGraphAggDirectInteractions Dataflow Job - -#### IntelliJ -``` -./bazel idea src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions:interaction_graph_agg_direct_interactions_scio -``` - -#### Compile -``` -./bazel build src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions:interaction_graph_agg_direct_interactions_scio -``` - -#### Build Jar -``` -./bazel bundle src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions:interaction_graph_agg_direct_interactions_scio -``` - -#### Run Scheduled Job -``` -export PROJECTID=twttr-recos-ml-prod -export REGION=us-central1 -export JOB_NAME=interaction-graph-agg-direct-interactions-dataflow - -bin/d6w schedule \ - ${PROJECTID}/${REGION}/${JOB_NAME} \ - src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions/config.d6w \ - --bind=profile.user_name=cassowary \ - --bind=profile.project=${PROJECTID} \ - --bind=profile.region=${REGION} \ - --bind=profile.job_name=${JOB_NAME} \ - --bind=profile.environment=prod \ - --bind=profile.date=2022-04-13 \ - --bind=profile.output_path=processed/interaction_graph_agg_direct_interactions_dataflow -``` diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_flock/BUILD b/src/scala/com/twitter/interaction_graph/scio/agg_flock/BUILD deleted file mode 100644 index 3bf51323c..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_flock/BUILD +++ /dev/null @@ -1,70 +0,0 @@ -scala_library( - name = "agg_flock", - sources = ["*.scala"], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":interaction_graph_agg_flock_edge_snapshot-scala", - ":interaction_graph_agg_flock_vertex_snapshot-scala", - "3rdparty/jvm/com/twitter/storehaus:algebra", - "beam-internal/src/main/scala/com/twitter/beam/io/dal", - "beam-internal/src/main/scala/com/twitter/scio_internal/job", - "beam-internal/src/main/scala/com/twitter/scio_internal/runner/dataflow", - "consumer-data-tools/src/main/scala/com/twitter/cde/scio/dal_read", - "flockdb-tools/datasets/flock:flock-blocks-edges-scala", - "flockdb-tools/datasets/flock:flock-mutes-edges-scala", - "flockdb-tools/datasets/flock:flock-report-as-abuse-edges-scala", - "flockdb-tools/datasets/flock:flock-report-as-spam-edges-scala", - "src/scala/com/twitter/interaction_graph/scio/common", - "src/scala/com/twitter/wtf/dataflow/user_events:valid_user_follows-scala", - "src/thrift/com/twitter/core_workflows/user_model:user_model-scala", - "src/thrift/com/twitter/twadoop/user/gen:gen-java", - "src/thrift/com/twitter/twadoop/user/gen:gen-scala", - "usersource/snapshot/src/main/scala/com/twitter/usersource/snapshot/combined:usersource-scala", - ], -) - -jvm_binary( - name = "interaction_graph_agg_flock_scio", - main = "com.twitter.interaction_graph.scio.agg_flock.InteractionGraphAggFlockJob", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":agg_flock", - ], -) - -create_datasets( - base_name = "interaction_graph_agg_flock_edge_snapshot", - description = "User-user directed edges with flock features", - java_schema = "com.twitter.interaction_graph.thriftjava.Edge", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.interaction_graph.thriftscala.Edge", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-scala", - ], -) - -create_datasets( - base_name = "interaction_graph_agg_flock_vertex_snapshot", - description = "User vertex with flock features", - java_schema = "com.twitter.interaction_graph.thriftjava.Vertex", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.interaction_graph.thriftscala.Vertex", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-scala", - ], -) diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_flock/InteractionGraphAggFlockJob.scala b/src/scala/com/twitter/interaction_graph/scio/agg_flock/InteractionGraphAggFlockJob.scala deleted file mode 100644 index e0a9f934d..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_flock/InteractionGraphAggFlockJob.scala +++ /dev/null @@ -1,84 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_flock - -import com.spotify.scio.ScioContext -import com.spotify.scio.values.SCollection -import com.twitter.beam.io.dal.DAL -import com.twitter.beam.io.dal.DAL.DiskFormat -import com.twitter.beam.io.dal.DAL.PathLayout -import com.twitter.beam.io.dal.DAL.WriteOptions -import com.twitter.beam.job.ServiceIdentifierOptions -import com.twitter.interaction_graph.scio.agg_flock.InteractionGraphAggFlockUtil._ -import com.twitter.interaction_graph.scio.common.DateUtil -import com.twitter.interaction_graph.scio.common.FeatureGeneratorUtil -import com.twitter.interaction_graph.thriftscala.Edge -import com.twitter.interaction_graph.thriftscala.FeatureName -import com.twitter.interaction_graph.thriftscala.Vertex -import com.twitter.scio_internal.job.ScioBeamJob -import com.twitter.statebird.v2.thriftscala.Environment -import com.twitter.util.Duration -import java.time.Instant -import org.joda.time.Interval - -object InteractionGraphAggFlockJob extends ScioBeamJob[InteractionGraphAggFlockOption] { - override protected def configurePipeline( - scioContext: ScioContext, - pipelineOptions: InteractionGraphAggFlockOption - ): Unit = { - @transient - implicit lazy val sc: ScioContext = scioContext - implicit lazy val dateInterval: Interval = pipelineOptions.interval - - val source = InteractionGraphAggFlockSource(pipelineOptions) - - val embiggenInterval = DateUtil.embiggen(dateInterval, Duration.fromDays(7)) - - val flockFollowsSnapshot = source.readFlockFollowsSnapshot(embiggenInterval) - - // the flock snapshot we're reading from has already been filtered for safe/valid users hence no filtering for safeUsers - val flockFollowsFeature = - getFlockFeatures(flockFollowsSnapshot, FeatureName.NumFollows, dateInterval) - - val flockMutualFollowsFeature = getMutualFollowFeature(flockFollowsFeature) - - val allSCollections = Seq(flockFollowsFeature, flockMutualFollowsFeature) - - val allFeatures = SCollection.unionAll(allSCollections) - - val (vertex, edges) = FeatureGeneratorUtil.getFeatures(allFeatures) - - val dalEnvironment: String = pipelineOptions - .as(classOf[ServiceIdentifierOptions]) - .getEnvironment() - val dalWriteEnvironment = if (pipelineOptions.getDALWriteEnvironment != null) { - pipelineOptions.getDALWriteEnvironment - } else { - dalEnvironment - } - - vertex.saveAsCustomOutput( - "Write Vertex Records", - DAL.writeSnapshot[Vertex]( - InteractionGraphAggFlockVertexSnapshotScalaDataset, - PathLayout.DailyPath(pipelineOptions.getOutputPath + "/aggregated_flock_vertex_daily"), - Instant.ofEpochMilli(dateInterval.getEndMillis), - DiskFormat.Parquet, - Environment.valueOf(dalWriteEnvironment), - writeOption = - WriteOptions(numOfShards = Some((pipelineOptions.getNumberOfShards / 64.0).ceil.toInt)) - ) - ) - - edges.saveAsCustomOutput( - "Write Edge Records", - DAL.writeSnapshot[Edge]( - InteractionGraphAggFlockEdgeSnapshotScalaDataset, - PathLayout.DailyPath(pipelineOptions.getOutputPath + "/aggregated_flock_edge_daily"), - Instant.ofEpochMilli(dateInterval.getEndMillis), - DiskFormat.Parquet, - Environment.valueOf(dalWriteEnvironment), - writeOption = WriteOptions(numOfShards = Some(pipelineOptions.getNumberOfShards)) - ) - ) - - } -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_flock/InteractionGraphAggFlockOption.scala b/src/scala/com/twitter/interaction_graph/scio/agg_flock/InteractionGraphAggFlockOption.scala deleted file mode 100644 index f5ef58b55..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_flock/InteractionGraphAggFlockOption.scala +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_flock - -import com.twitter.beam.io.dal.DALOptions -import com.twitter.beam.job.DateRangeOptions -import org.apache.beam.sdk.options.Default -import org.apache.beam.sdk.options.Description -import org.apache.beam.sdk.options.Validation.Required - -trait InteractionGraphAggFlockOption extends DALOptions with DateRangeOptions { - @Required - @Description("Output path for storing the final dataset") - def getOutputPath: String - def setOutputPath(value: String): Unit - - @Description("Indicates DAL write environment. Can be set to dev/stg during local validation") - @Default.String("PROD") - def getDALWriteEnvironment: String - def setDALWriteEnvironment(value: String): Unit - - @Description("Number of shards/partitions for saving the final dataset.") - @Default.Integer(16) - def getNumberOfShards: Integer - def setNumberOfShards(value: Integer): Unit -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_flock/InteractionGraphAggFlockSource.scala b/src/scala/com/twitter/interaction_graph/scio/agg_flock/InteractionGraphAggFlockSource.scala deleted file mode 100644 index 726293475..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_flock/InteractionGraphAggFlockSource.scala +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_flock - -import com.spotify.scio.ScioContext -import com.spotify.scio.values.SCollection -import com.twitter.beam.job.ServiceIdentifierOptions -import com.twitter.flockdb.tools.datasets.flock.thriftscala.FlockEdge -import com.twitter.cde.scio.dal_read.SourceUtil -import com.twitter.wtf.dataflow.user_events.ValidUserFollowsScalaDataset -import org.joda.time.Interval - -case class InteractionGraphAggFlockSource( - pipelineOptions: InteractionGraphAggFlockOption -)( - implicit sc: ScioContext) { - val dalEnvironment: String = pipelineOptions - .as(classOf[ServiceIdentifierOptions]) - .getEnvironment() - - def readFlockFollowsSnapshot(dateInterval: Interval): SCollection[FlockEdge] = - SourceUtil.readMostRecentSnapshotDALDataset( - dataset = ValidUserFollowsScalaDataset, - dateInterval = dateInterval, - dalEnvironment = dalEnvironment) -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_flock/InteractionGraphAggFlockUtil.scala b/src/scala/com/twitter/interaction_graph/scio/agg_flock/InteractionGraphAggFlockUtil.scala deleted file mode 100644 index 89858a89a..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_flock/InteractionGraphAggFlockUtil.scala +++ /dev/null @@ -1,63 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_flock - -import com.spotify.scio.values.SCollection -import com.twitter.algebird.Min -import com.twitter.flockdb.tools.datasets.flock.thriftscala.FlockEdge -import com.twitter.interaction_graph.scio.common.InteractionGraphRawInput -import com.twitter.interaction_graph.thriftscala.FeatureName -import java.time.Instant -import java.time.temporal.ChronoUnit -import org.joda.time.Interval - -object InteractionGraphAggFlockUtil { - - def getFlockFeatures( - edges: SCollection[FlockEdge], - featureName: FeatureName, - dateInterval: Interval - ): SCollection[InteractionGraphRawInput] = { - edges - .withName(s"${featureName.toString} - Converting flock edge to interaction graph input") - .map { edge => - // NOTE: getUpdatedAt gives time in the seconds resolution - // Because we use .extend() when reading the data source, the updatedAt time might be larger than the dateRange. - // We need to cap them, otherwise, DateUtil.diffDays gives incorrect results. - val start = (edge.updatedAt * 1000L).min(dateInterval.getEnd.toInstant.getMillis) - val end = dateInterval.getStart.toInstant.getMillis - val age = ChronoUnit.DAYS.between( - Instant.ofEpochMilli(start), - Instant.ofEpochMilli(end) - ) + 1 - InteractionGraphRawInput(edge.sourceId, edge.destinationId, featureName, age.toInt, 1.0) - } - - } - - def getMutualFollowFeature( - flockFollowFeature: SCollection[InteractionGraphRawInput] - ): SCollection[InteractionGraphRawInput] = { - flockFollowFeature - .withName("Convert FlockFollows to Mutual Follows") - .map { input => - val sourceId = input.src - val destId = input.dst - - if (sourceId < destId) { - Tuple2(sourceId, destId) -> Tuple2(Set(true), Min(input.age)) // true means follow - } else { - Tuple2(destId, sourceId) -> Tuple2(Set(false), Min(input.age)) // false means followed_by - } - } - .sumByKey - .flatMap { - case ((id1, id2), (followSet, minAge)) if followSet.size == 2 => - val age = minAge.get - Seq( - InteractionGraphRawInput(id1, id2, FeatureName.NumMutualFollows, age, 1.0), - InteractionGraphRawInput(id2, id1, FeatureName.NumMutualFollows, age, 1.0)) - case _ => - Nil - } - } - -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_flock/README.md b/src/scala/com/twitter/interaction_graph/scio/agg_flock/README.md deleted file mode 100644 index 0ff797194..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_flock/README.md +++ /dev/null @@ -1,34 +0,0 @@ -## InteractionGraphClientEventLogs Dataflow Job - -#### IntelliJ -``` -./bazel idea src/scala/com/twitter/interaction_graph/scio/agg_flock:interaction_graph_agg_flock_scio -``` - -#### Compile -``` -./bazel build src/scala/com/twitter/interaction_graph/scio/agg_flock:interaction_graph_agg_flock_scio -``` - -#### Build Jar -``` -./bazel bundle src/scala/com/twitter/interaction_graph/scio/agg_flock:interaction_graph_agg_flock_scio -``` - -#### Run Scheduled Job -``` -export PROJECTID=twttr-recos-ml-prod -export REGION=us-central1 -export JOB_NAME=interaction-graph-agg-flock-dataflow - -bin/d6w schedule \ - ${PROJECTID}/${REGION}/${JOB_NAME} \ - src/scala/com/twitter/interaction_graph/scio/agg_flock/config.d6w \ - --bind=profile.user_name=cassowary \ - --bind=profile.project=${PROJECTID} \ - --bind=profile.region=${REGION} \ - --bind=profile.job_name=${JOB_NAME} \ - --bind=profile.environment=prod \ - --bind=profile.date=2022-04-13 \ - --bind=profile.output_path=processed/interaction_graph_agg_flock_dataflow -``` \ No newline at end of file diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_negative/BUILD b/src/scala/com/twitter/interaction_graph/scio/agg_negative/BUILD deleted file mode 100644 index 1fbe57e1f..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_negative/BUILD +++ /dev/null @@ -1,43 +0,0 @@ -scala_library( - name = "agg_negative", - sources = ["*.scala"], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":real_graph_negative_features-scala", - "beam-internal/src/main/scala/com/twitter/beam/io/dal", - "beam-internal/src/main/scala/com/twitter/scio_internal/job", - "beam-internal/src/main/scala/com/twitter/scio_internal/runner/dataflow", - "flockdb-tools/datasets/flock:flock-blocks-edges-scala", - "flockdb-tools/datasets/flock:flock-mutes-edges-scala", - "flockdb-tools/datasets/flock:flock-report-as-abuse-edges-scala", - "flockdb-tools/datasets/flock:flock-report-as-spam-edges-scala", - "socialgraph/hadoop/src/main/scala/com/twitter/socialgraph/hadoop:socialgraph-unfollows-scala", - "src/scala/com/twitter/interaction_graph/scio/common", - "tcdc/bq_blaster/src/main/scala/com/twitter/tcdc/bqblaster/beam", - ], -) - -jvm_binary( - name = "interaction_graph_negative_scio", - main = "com.twitter.interaction_graph.scio.agg_negative.InteractionGraphNegativeJob", - platform = "java8", - dependencies = [ - ":agg_negative", - ], -) - -create_datasets( - base_name = "real_graph_negative_features", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.interaction_graph.injection.UserSessionInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.user_session_store.thriftscala.UserSession", - scala_dependencies = [ - "src/scala/com/twitter/interaction_graph/injection:user_session_inj", - ], -) diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_negative/InteractionGraphNegativeJob.scala b/src/scala/com/twitter/interaction_graph/scio/agg_negative/InteractionGraphNegativeJob.scala deleted file mode 100644 index 479b67524..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_negative/InteractionGraphNegativeJob.scala +++ /dev/null @@ -1,155 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_negative - -import com.google.api.services.bigquery.model.TimePartitioning -import com.spotify.scio.ScioContext -import com.spotify.scio.values.SCollection -import com.twitter.algebird.mutable.PriorityQueueMonoid -import com.twitter.beam.io.dal.DAL -import com.twitter.beam.io.fs.multiformat.PathLayout -import com.twitter.beam.io.fs.multiformat.WriteOptions -import com.twitter.conversions.DurationOps._ -import com.twitter.dal.client.dataset.SnapshotDALDataset -import com.twitter.interaction_graph.scio.common.ConversionUtil.hasNegativeFeatures -import com.twitter.interaction_graph.scio.common.ConversionUtil.toRealGraphEdgeFeatures -import com.twitter.interaction_graph.scio.common.FeatureGeneratorUtil.getEdgeFeature -import com.twitter.interaction_graph.scio.common.GraphUtil -import com.twitter.interaction_graph.scio.common.InteractionGraphRawInput -import com.twitter.interaction_graph.thriftscala.Edge -import com.twitter.interaction_graph.thriftscala.FeatureName -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.scio_internal.job.ScioBeamJob -import com.twitter.scrooge.ThriftStruct -import com.twitter.socialgraph.hadoop.SocialgraphUnfollowsScalaDataset -import com.twitter.tcdc.bqblaster.beam.syntax._ -import com.twitter.tcdc.bqblaster.core.avro.TypedProjection -import com.twitter.tcdc.bqblaster.core.transform.RootTransform -import com.twitter.timelines.real_graph.thriftscala.RealGraphFeaturesTest -import com.twitter.timelines.real_graph.v1.thriftscala.{RealGraphFeatures => RealGraphFeaturesV1} -import com.twitter.user_session_store.thriftscala.UserSession -import flockdb_tools.datasets.flock.FlockBlocksEdgesScalaDataset -import flockdb_tools.datasets.flock.FlockMutesEdgesScalaDataset -import flockdb_tools.datasets.flock.FlockReportAsAbuseEdgesScalaDataset -import flockdb_tools.datasets.flock.FlockReportAsSpamEdgesScalaDataset -import java.time.Instant -import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO - -object InteractionGraphNegativeJob extends ScioBeamJob[InteractionGraphNegativeOption] { - val maxDestinationIds = 500 // p99 is about 500 - def getFeatureCounts(e: Edge): Int = e.features.size - val negativeEdgeOrdering = Ordering.by[Edge, Int](getFeatureCounts) - val negativeEdgeReverseOrdering = negativeEdgeOrdering.reverse - implicit val pqMonoid: PriorityQueueMonoid[Edge] = - new PriorityQueueMonoid[Edge](maxDestinationIds)(negativeEdgeOrdering) - - override protected def configurePipeline( - sc: ScioContext, - opts: InteractionGraphNegativeOption - ): Unit = { - - val endTs = opts.interval.getEndMillis - - // read input datasets - val blocks: SCollection[InteractionGraphRawInput] = - GraphUtil.getFlockFeatures( - readSnapshot(FlockBlocksEdgesScalaDataset, sc), - FeatureName.NumBlocks, - endTs) - - val mutes: SCollection[InteractionGraphRawInput] = - GraphUtil.getFlockFeatures( - readSnapshot(FlockMutesEdgesScalaDataset, sc), - FeatureName.NumMutes, - endTs) - - val abuseReports: SCollection[InteractionGraphRawInput] = - GraphUtil.getFlockFeatures( - readSnapshot(FlockReportAsAbuseEdgesScalaDataset, sc), - FeatureName.NumReportAsAbuses, - endTs) - - val spamReports: SCollection[InteractionGraphRawInput] = - GraphUtil.getFlockFeatures( - readSnapshot(FlockReportAsSpamEdgesScalaDataset, sc), - FeatureName.NumReportAsSpams, - endTs) - - // we only keep unfollows in the past 90 days due to the huge size of this dataset, - // and to prevent permanent "shadow-banning" in the event of accidental unfollows. - // we treat unfollows as less critical than above 4 negative signals, since it deals more with - // interest than health typically, which might change over time. - val unfollows: SCollection[InteractionGraphRawInput] = - GraphUtil - .getSocialGraphFeatures( - readSnapshot(SocialgraphUnfollowsScalaDataset, sc), - FeatureName.NumUnfollows, - endTs) - .filter(_.age < 90) - - // group all features by (src, dest) - val allEdgeFeatures: SCollection[Edge] = - getEdgeFeature(SCollection.unionAll(Seq(blocks, mutes, abuseReports, spamReports, unfollows))) - - val negativeFeatures: SCollection[KeyVal[Long, UserSession]] = - allEdgeFeatures - .keyBy(_.sourceId) - .topByKey(maxDestinationIds)(Ordering.by(_.features.size)) - .map { - case (srcId, pqEdges) => - val topKNeg = - pqEdges.toSeq.flatMap(toRealGraphEdgeFeatures(hasNegativeFeatures)) - KeyVal( - srcId, - UserSession( - userId = Some(srcId), - realGraphFeaturesTest = - Some(RealGraphFeaturesTest.V1(RealGraphFeaturesV1(topKNeg))))) - } - - // save to GCS (via DAL) - negativeFeatures.saveAsCustomOutput( - "Write Negative Edge Label", - DAL.writeVersionedKeyVal( - dataset = RealGraphNegativeFeaturesScalaDataset, - pathLayout = PathLayout.VersionedPath(opts.getOutputPath), - instant = Instant.ofEpochMilli(opts.interval.getEndMillis), - writeOption = WriteOptions(numOfShards = Some(3000)) - ) - ) - - // save to BQ - val ingestionDate = opts.getDate().value.getStart.toDate - val bqDataset = opts.getBqDataset - val bqFieldsTransform = RootTransform - .Builder() - .withPrependedFields("dateHour" -> TypedProjection.fromConstant(ingestionDate)) - val timePartitioning = new TimePartitioning() - .setType("DAY").setField("dateHour").setExpirationMs(21.days.inMilliseconds) - val bqWriter = BigQueryIO - .write[Edge] - .to(s"${bqDataset}.interaction_graph_agg_negative_edge_snapshot") - .withExtendedErrorInfo() - .withTimePartitioning(timePartitioning) - .withLoadJobProjectId("twttr-recos-ml-prod") - .withThriftSupport(bqFieldsTransform.build(), AvroConverter.Legacy) - .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED) - .withWriteDisposition( - BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE - ) // we only want the latest snapshot - - allEdgeFeatures - .saveAsCustomOutput( - s"Save Recommendations to BQ interaction_graph_agg_negative_edge_snapshot", - bqWriter - ) - } - - def readSnapshot[T <: ThriftStruct]( - dataset: SnapshotDALDataset[T], - sc: ScioContext - ): SCollection[T] = { - sc.customInput( - s"Reading most recent snaphost ${dataset.role.name}.${dataset.logicalName}", - DAL.readMostRecentSnapshotNoOlderThan[T](dataset, 7.days) - ) - } -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_negative/InteractionGraphNegativeOption.scala b/src/scala/com/twitter/interaction_graph/scio/agg_negative/InteractionGraphNegativeOption.scala deleted file mode 100644 index c44dc3396..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_negative/InteractionGraphNegativeOption.scala +++ /dev/null @@ -1,18 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_negative - -import com.twitter.beam.io.dal.DALOptions -import com.twitter.beam.job.DateRangeOptions -import org.apache.beam.sdk.options.Description -import org.apache.beam.sdk.options.Validation.Required - -trait InteractionGraphNegativeOption extends DALOptions with DateRangeOptions { - @Required - @Description("Output path for storing the final dataset") - def getOutputPath: String - def setOutputPath(value: String): Unit - - @Description("BQ dataset prefix") - def getBqDataset: String - def setBqDataset(value: String): Unit - -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_negative/README.md b/src/scala/com/twitter/interaction_graph/scio/agg_negative/README.md deleted file mode 100644 index 9df76e7ad..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_negative/README.md +++ /dev/null @@ -1,35 +0,0 @@ -## InteractionGraphNegative Dataflow Job - -#### IntelliJ -``` -fastpass create --name rg_neg --intellij src/scala/com/twitter/interaction_graph/scio/agg_negative -``` - -#### Compile -``` -bazel build src/scala/com/twitter/interaction_graph/scio/agg_negative:interaction_graph_negative_scio -``` - -#### Build Jar -``` -bazel bundle src/scala/com/twitter/interaction_graph/scio/agg_negative:interaction_graph_negative_scio -``` - -#### Run Scheduled Job -``` -export PROJECTID=twttr-recos-ml-prod -export REGION=us-central1 -export JOB_NAME=interaction-graph-negative-dataflow - -bin/d6w schedule \ - ${PROJECTID}/${REGION}/${JOB_NAME} \ - src/scala/com/twitter/interaction_graph/scio/agg_negative/config.d6w \ - --bind=profile.user_name=cassowary \ - --bind=profile.project=${PROJECTID} \ - --bind=profile.region=${REGION} \ - --bind=profile.job_name=${JOB_NAME} \ - --bind=profile.environment=prod \ - --bind=profile.date=2022-10-19 \ - --bind=profile.output_path=processed/interaction_graph_agg_negative_dataflow \ - --bind=profile.bq_dataset="twttr-bq-cassowary-prod:user" -``` \ No newline at end of file diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_notifications/BUILD b/src/scala/com/twitter/interaction_graph/scio/agg_notifications/BUILD deleted file mode 100644 index 25dfa572b..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_notifications/BUILD +++ /dev/null @@ -1,65 +0,0 @@ -scala_library( - name = "agg_notifications", - sources = ["*.scala"], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":interaction_graph_agg_notifications_edge_daily-scala", - ":interaction_graph_agg_notifications_vertex_daily-scala", - "beam-internal/src/main/scala/com/twitter/beam/io/dal", - "beam-internal/src/main/scala/com/twitter/scio_internal/job", - "beam-internal/src/main/scala/com/twitter/scio_internal/runner/dataflow", - "consumer-data-tools/src/main/scala/com/twitter/cde/scio/dal_read", - "src/scala/com/twitter/frigate/data_pipeline_beam/mr-client-event-filtering-job/src/main/scala/com/twitter/client_event_filtering:frigate_filtered_client_events_dataflow-scala", - "src/scala/com/twitter/interaction_graph/scio/common", - "src/scala/com/twitter/wtf/scalding/jobs/client_event_processing:user_interaction-scala", - "tcdc/bq_blaster/src/main/scala/com/twitter/tcdc/bqblaster/beam", - "twadoop_config/configuration/log_categories/group/frigate:frigate_notifier-scala", - "tweetsource/public_tweets/src/main/scala/com/twitter/tweetsource/public_tweets:public_tweets-scala", - "usersource/snapshot/src/main/scala/com/twitter/usersource/snapshot/combined:usersource-scala", - ], -) - -jvm_binary( - name = "interaction_graph_notifications_scio", - main = "com.twitter.interaction_graph.scio.agg_notifications.InteractionGraphNotificationsJob", - platform = "java8", - dependencies = [ - ":agg_notifications", - ], -) - -create_datasets( - base_name = "interaction_graph_agg_notifications_edge_daily", - description = "User-user directed edges with notification features", - java_schema = "com.twitter.interaction_graph.thriftjava.Edge", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.interaction_graph.thriftscala.Edge", - segment_type = "partitioned", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-scala", - ], -) - -create_datasets( - base_name = "interaction_graph_agg_notifications_vertex_daily", - description = "User vertex with notification features", - java_schema = "com.twitter.interaction_graph.thriftjava.Vertex", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.interaction_graph.thriftscala.Vertex", - segment_type = "partitioned", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-scala", - ], -) diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_notifications/InteractionGraphNotificationUtil.scala b/src/scala/com/twitter/interaction_graph/scio/agg_notifications/InteractionGraphNotificationUtil.scala deleted file mode 100644 index 2ca5a9cf4..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_notifications/InteractionGraphNotificationUtil.scala +++ /dev/null @@ -1,132 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_notifications - -import com.spotify.scio.ScioMetrics -import com.twitter.clientapp.thriftscala.EventNamespace -import com.twitter.clientapp.thriftscala.LogEvent -import com.twitter.interaction_graph.thriftscala.FeatureName - -object InteractionGraphNotificationUtil { - - val PUSH_OPEN_ACTIONS = Set("open", "background_open") - val NTAB_CLICK_ACTIONS = Set("navigate", "click") - val STATUS_ID_REGEX = "^twitter:\\/\\/tweet\\?status_id=([0-9]+).*".r - val TWEET_ID_REGEX = "^twitter:\\/\\/tweet.id=([0-9]+).*".r - - def extractTweetIdFromUrl(url: String): Option[Long] = url match { - case STATUS_ID_REGEX(statusId) => - ScioMetrics.counter("regex matching", "status_id=").inc() - Some(statusId.toLong) - case TWEET_ID_REGEX(tweetId) => - ScioMetrics.counter("regex matching", "tweet?id=").inc() - Some(tweetId.toLong) - case _ => None - } - - def getPushNtabEvents(e: LogEvent): Seq[(Long, (Long, FeatureName))] = { - for { - logBase <- e.logBase.toSeq - userId <- logBase.userId.toSeq - namespace <- e.eventNamespace.toSeq - (tweetId, featureName) <- namespace match { - case EventNamespace(_, _, _, _, _, Some(action)) if PUSH_OPEN_ACTIONS.contains(action) => - (for { - details <- e.eventDetails - url <- details.url - tweetId <- extractTweetIdFromUrl(url) - } yield { - ScioMetrics.counter("event type", "push open").inc() - (tweetId, FeatureName.NumPushOpens) - }).toSeq - case EventNamespace(_, Some("ntab"), _, _, _, Some("navigate")) => - val tweetIds = for { - details <- e.eventDetails.toSeq - items <- details.items.toSeq - item <- items - ntabDetails <- item.notificationTabDetails.toSeq - clientEventMetadata <- ntabDetails.clientEventMetadata.toSeq - tweetIds <- clientEventMetadata.tweetIds.toSeq - tweetId <- tweetIds - } yield { - ScioMetrics.counter("event type", "ntab navigate").inc() - tweetId - } - tweetIds.map((_, FeatureName.NumNtabClicks)) - case EventNamespace(_, Some("ntab"), _, _, _, Some("click")) => - val tweetIds = for { - details <- e.eventDetails.toSeq - items <- details.items.toSeq - item <- items - tweetId <- item.id - } yield { - ScioMetrics.counter("event type", "ntab click").inc() - tweetId - } - tweetIds.map((_, FeatureName.NumNtabClicks)) - case _ => Nil - } - } yield (tweetId, (userId, featureName)) - } - - /** - * Returns events corresponding to ntab clicks. We have the tweet id from ntab clicks and can join - * those with public tweets. - */ - def getNtabEvents(e: LogEvent): Seq[(Long, (Long, FeatureName))] = { - for { - logBase <- e.logBase.toSeq - userId <- logBase.userId.toSeq - namespace <- e.eventNamespace.toSeq - (tweetId, featureName) <- namespace match { - case EventNamespace(_, Some("ntab"), _, _, _, Some("navigate")) => - val tweetIds = for { - details <- e.eventDetails.toSeq - items <- details.items.toSeq - item <- items - ntabDetails <- item.notificationTabDetails.toSeq - clientEventMetadata <- ntabDetails.clientEventMetadata.toSeq - tweetIds <- clientEventMetadata.tweetIds.toSeq - tweetId <- tweetIds - } yield { - ScioMetrics.counter("event type", "ntab navigate").inc() - tweetId - } - tweetIds.map((_, FeatureName.NumNtabClicks)) - case EventNamespace(_, Some("ntab"), _, _, _, Some("click")) => - val tweetIds = for { - details <- e.eventDetails.toSeq - items <- details.items.toSeq - item <- items - tweetId <- item.id - } yield { - ScioMetrics.counter("event type", "ntab click").inc() - tweetId - } - tweetIds.map((_, FeatureName.NumNtabClicks)) - case _ => Nil - } - } yield (tweetId, (userId, featureName)) - } - - /** - * get push open events, keyed by impressionId (as the client event does not always have the tweetId nor the authorId) - */ - def getPushOpenEvents(e: LogEvent): Seq[(String, (Long, FeatureName))] = { - for { - logBase <- e.logBase.toSeq - userId <- logBase.userId.toSeq - namespace <- e.eventNamespace.toSeq - (tweetId, featureName) <- namespace match { - case EventNamespace(_, _, _, _, _, Some(action)) if PUSH_OPEN_ACTIONS.contains(action) => - val impressionIdOpt = for { - details <- e.notificationDetails - impressionId <- details.impressionId - } yield { - ScioMetrics.counter("event type", "push open").inc() - impressionId - } - impressionIdOpt.map((_, FeatureName.NumPushOpens)).toSeq - case _ => Nil - } - } yield (tweetId, (userId, featureName)) - } -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_notifications/InteractionGraphNotificationsJob.scala b/src/scala/com/twitter/interaction_graph/scio/agg_notifications/InteractionGraphNotificationsJob.scala deleted file mode 100644 index 2a01988be..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_notifications/InteractionGraphNotificationsJob.scala +++ /dev/null @@ -1,86 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_notifications - -import com.spotify.scio.ScioContext -import com.spotify.scio.values.SCollection -import com.twitter.beam.io.dal.DAL -import com.twitter.beam.io.fs.multiformat.DiskFormat -import com.twitter.beam.io.fs.multiformat.PathLayout -import com.twitter.beam.io.fs.multiformat.ReadOptions -import com.twitter.beam.io.fs.multiformat.WriteOptions -import com.twitter.client_event_filtering.FrigateFilteredClientEventsDataflowScalaDataset -import com.twitter.clientapp.thriftscala.LogEvent -import com.twitter.interaction_graph.scio.common.FeatureGeneratorUtil -import com.twitter.interaction_graph.thriftscala._ -import com.twitter.scio_internal.job.ScioBeamJob -import com.twitter.statebird.v2.thriftscala.Environment -import com.twitter.tweetsource.public_tweets.PublicTweetsScalaDataset - -object InteractionGraphNotificationsJob extends ScioBeamJob[InteractionGraphNotificationsOption] { - override protected def configurePipeline( - sc: ScioContext, - opts: InteractionGraphNotificationsOption - ): Unit = { - - val pushClientEvents: SCollection[LogEvent] = sc - .customInput( - name = "Read Push Client Events", - DAL - .read( - FrigateFilteredClientEventsDataflowScalaDataset, - opts.interval, - DAL.Environment.Prod, - ) - ) - val pushNtabEvents = - pushClientEvents.flatMap(InteractionGraphNotificationUtil.getPushNtabEvents) - - // look back tweets for 2 days because MR gets tweets from 2 days ago. - // Allow a grace period of 24 hours to reduce oncall workload - val graceHours = 24 - val interval2DaysBefore = - opts.interval.withStart(opts.interval.getStart.minusDays(2).plusHours(graceHours)) - val tweetAuthors: SCollection[(Long, Long)] = sc - .customInput( - name = "Read Tweets", - DAL - .read( - dataset = PublicTweetsScalaDataset, - interval = interval2DaysBefore, - environmentOverride = DAL.Environment.Prod, - readOptions = ReadOptions(projections = Some(Seq("tweetId", "userId"))) - ) - ).map { t => (t.tweetId, t.userId) } - - val pushNtabEdgeCounts = pushNtabEvents - .join(tweetAuthors) - .map { - case (_, ((srcId, feature), destId)) => ((srcId, destId, feature), 1L) - } - .withName("summing edge feature counts") - .sumByKey - - val aggPushEdges = pushNtabEdgeCounts - .map { - case ((srcId, destId, featureName), count) => - (srcId, destId) -> Seq( - EdgeFeature(featureName, FeatureGeneratorUtil.initializeTSS(count))) - } - .sumByKey - .map { - case ((srcId, destId), edgeFeatures) => - Edge(srcId, destId, None, edgeFeatures.sortBy(_.name.value)) - } - - aggPushEdges.saveAsCustomOutput( - "Write Edge Records", - DAL.write[Edge]( - InteractionGraphAggNotificationsEdgeDailyScalaDataset, - PathLayout.DailyPath(opts.getOutputPath + "/aggregated_notifications_edge_daily"), - opts.interval, - DiskFormat.Parquet, - Environment.valueOf(opts.getDALWriteEnvironment), - writeOption = WriteOptions(numOfShards = Some(opts.getNumberOfShards)) - ) - ) - } -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_notifications/InteractionGraphNotificationsOption.scala b/src/scala/com/twitter/interaction_graph/scio/agg_notifications/InteractionGraphNotificationsOption.scala deleted file mode 100644 index dd1b4c769..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_notifications/InteractionGraphNotificationsOption.scala +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.interaction_graph.scio.agg_notifications - -import com.twitter.beam.io.dal.DALOptions -import com.twitter.beam.job.DateRangeOptions -import org.apache.beam.sdk.options.Default -import org.apache.beam.sdk.options.Description -import org.apache.beam.sdk.options.Validation.Required - -trait InteractionGraphNotificationsOption extends DALOptions with DateRangeOptions { - @Required - @Description("Output path for storing the final dataset") - def getOutputPath: String - def setOutputPath(value: String): Unit - - @Description("Indicates DAL write environment. Can be set to dev/stg during local validation") - @Default.String("PROD") - def getDALWriteEnvironment: String - def setDALWriteEnvironment(value: String): Unit - - @Description("Number of shards/partitions for saving the final dataset.") - @Default.Integer(8) - def getNumberOfShards: Integer - def setNumberOfShards(value: Integer): Unit -} diff --git a/src/scala/com/twitter/interaction_graph/scio/agg_notifications/README.md b/src/scala/com/twitter/interaction_graph/scio/agg_notifications/README.md deleted file mode 100644 index f5f274ad8..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/agg_notifications/README.md +++ /dev/null @@ -1,34 +0,0 @@ -## InteractionGraphClientEventLogs Dataflow Job - -#### IntelliJ -``` -fastpass create --name rg_labels --intellij src/scala/com/twitter/interaction_graph/scio/agg_notifications -``` - -#### Compile -``` -bazel build src/scala/com/twitter/interaction_graph/scio/agg_notifications:interaction_graph_notifications_scio -``` - -#### Build Jar -``` -bazel bundle src/scala/com/twitter/interaction_graph/scio/agg_notifications:interaction_graph_notifications_scio -``` - -#### Run Scheduled Job -``` -export PROJECTID=twttr-recos-ml-prod -export REGION=us-central1 -export JOB_NAME=interaction-graph-notifications-dataflow - -bin/d6w schedule \ - ${PROJECTID}/${REGION}/${JOB_NAME} \ - src/scala/com/twitter/interaction_graph/scio/agg_notifications/config.d6w \ - --bind=profile.user_name=cassowary \ - --bind=profile.project=${PROJECTID} \ - --bind=profile.region=${REGION} \ - --bind=profile.job_name=${JOB_NAME} \ - --bind=profile.environment=prod \ - --bind=profile.date=2022-05-10 \ - --bind=profile.output_path=processed/interaction_graph_agg_notifications_dataflow -``` diff --git a/src/scala/com/twitter/interaction_graph/scio/common/BUILD b/src/scala/com/twitter/interaction_graph/scio/common/BUILD deleted file mode 100644 index 4916728c5..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/common/BUILD +++ /dev/null @@ -1,31 +0,0 @@ -scala_library( - sources = ["*.scala"], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/twitter/algebird:core", - "beam-internal/src/main/scala/com/twitter/beam/io/dal", - "beam-internal/src/main/scala/com/twitter/scio_internal/job", - "beam-internal/src/main/scala/com/twitter/scio_internal/runner/dataflow", - "flockdb-tools/datasets/flock/src/main/thrift:thrift-scala", - "src/scala/com/twitter/pluck/source/combined_user_scrooge_source", - "src/thrift/com/twitter/gizmoduck:user-thrift-scala", - "src/thrift/com/twitter/interaction_graph:interaction_graph-scala", - "src/thrift/com/twitter/socialgraph:thrift-scala", - "src/thrift/com/twitter/twadoop/user/gen:gen-scala", - "src/thrift/com/twitter/user_session_store:thrift-scala", - "usersource/snapshot/src/main/scala/com/twitter/usersource/snapshot/flat:usersource_flat-scala", - ], -) - -scala_library( - name = "feature_groups", - sources = ["FeatureGroups.scala"], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-scala", - ], -) diff --git a/src/scala/com/twitter/interaction_graph/scio/common/CaseClasses.scala b/src/scala/com/twitter/interaction_graph/scio/common/CaseClasses.scala deleted file mode 100644 index d8264fd8e..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/common/CaseClasses.scala +++ /dev/null @@ -1,21 +0,0 @@ -package com.twitter.interaction_graph.scio.common - -import com.twitter.interaction_graph.thriftscala.FeatureName - -/** Interaction Graph Raw Input type defines a common type for edge / vertex feature calculation - * It has fields: (source Id, destination Id, Feature Name, age of this relationship (in days), - * and value to be aggregated) - */ -case class InteractionGraphRawInput( - src: Long, - dst: Long, - name: FeatureName, - age: Int, - featureValue: Double) - -case class FeatureKey( - src: Long, - dest: Long, - name: FeatureName) - -case class Tweepcred(userId: Long, tweepcred: Short) diff --git a/src/scala/com/twitter/interaction_graph/scio/common/ConversionUtil.scala b/src/scala/com/twitter/interaction_graph/scio/common/ConversionUtil.scala deleted file mode 100644 index a23816078..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/common/ConversionUtil.scala +++ /dev/null @@ -1,110 +0,0 @@ -package com.twitter.interaction_graph.scio.common - -import com.spotify.scio.ScioMetrics -import com.twitter.interaction_graph.thriftscala.Edge -import com.twitter.interaction_graph.thriftscala.FeatureName -import com.twitter.interaction_graph.thriftscala.TimeSeriesStatistics -import com.twitter.timelines.real_graph.v1.thriftscala.RealGraphEdgeFeatures -import com.twitter.timelines.real_graph.v1.thriftscala.{ - RealGraphEdgeFeature => RealGraphEdgeFeatureV1 -} - -object ConversionUtil { - def toRealGraphEdgeFeatureV1(tss: TimeSeriesStatistics): RealGraphEdgeFeatureV1 = { - RealGraphEdgeFeatureV1( - mean = Some(tss.mean), - ewma = Some(tss.ewma), - m2ForVariance = Some(tss.m2ForVariance), - daysSinceLast = tss.numDaysSinceLast.map(_.toShort), - nonZeroDays = Some(tss.numNonZeroDays.toShort), - elapsedDays = Some(tss.numElapsedDays.toShort), - isMissing = Some(false) - ) - } - - /** - * Checks if the converted `RealGraphEdgeFeatures` has negative edges features. - * Our pipeline includes other negative interactions that aren't in the UserSession thrift - * so we'll just filter them away for now (for parity). - */ - def hasNegativeFeatures(rgef: RealGraphEdgeFeatures): Boolean = { - rgef.numMutes.nonEmpty || - rgef.numBlocks.nonEmpty || - rgef.numReportAsAbuses.nonEmpty || - rgef.numReportAsSpams.nonEmpty - } - - /** - * Checks if the converted `RealGraphEdgeFeatures` has some of the key interaction features present. - * This is adapted from timeline's code here: - */ - def hasTimelinesRequiredFeatures(rgef: RealGraphEdgeFeatures): Boolean = { - rgef.retweetsFeature.nonEmpty || - rgef.favsFeature.nonEmpty || - rgef.mentionsFeature.nonEmpty || - rgef.tweetClicksFeature.nonEmpty || - rgef.linkClicksFeature.nonEmpty || - rgef.profileViewsFeature.nonEmpty || - rgef.dwellTimeFeature.nonEmpty || - rgef.inspectedStatusesFeature.nonEmpty || - rgef.photoTagsFeature.nonEmpty || - rgef.numTweetQuotes.nonEmpty || - rgef.followFeature.nonEmpty || - rgef.mutualFollowFeature.nonEmpty || - rgef.addressBookEmailFeature.nonEmpty || - rgef.addressBookPhoneFeature.nonEmpty - } - - /** - * Convert an Edge into a RealGraphEdgeFeature. - * We return the converted RealGraphEdgeFeature when filterFn is true. - * This is to allow us to filter early on during the conversion if required, rather than map over the whole - * collection of records again to filter. - * - * @param filterFn true if and only if we want to keep the converted feature - */ - def toRealGraphEdgeFeatures( - filterFn: RealGraphEdgeFeatures => Boolean - )( - e: Edge - ): Option[RealGraphEdgeFeatures] = { - val baseFeature = RealGraphEdgeFeatures(destId = e.destinationId) - val aggregatedFeature = e.features.foldLeft(baseFeature) { - case (aggregatedFeature, edgeFeature) => - val f = Some(toRealGraphEdgeFeatureV1(edgeFeature.tss)) - ScioMetrics.counter("toRealGraphEdgeFeatures", edgeFeature.name.name).inc() - edgeFeature.name match { - case FeatureName.NumRetweets => aggregatedFeature.copy(retweetsFeature = f) - case FeatureName.NumFavorites => aggregatedFeature.copy(favsFeature = f) - case FeatureName.NumMentions => aggregatedFeature.copy(mentionsFeature = f) - case FeatureName.NumTweetClicks => aggregatedFeature.copy(tweetClicksFeature = f) - case FeatureName.NumLinkClicks => aggregatedFeature.copy(linkClicksFeature = f) - case FeatureName.NumProfileViews => aggregatedFeature.copy(profileViewsFeature = f) - case FeatureName.TotalDwellTime => aggregatedFeature.copy(dwellTimeFeature = f) - case FeatureName.NumInspectedStatuses => - aggregatedFeature.copy(inspectedStatusesFeature = f) - case FeatureName.NumPhotoTags => aggregatedFeature.copy(photoTagsFeature = f) - case FeatureName.NumFollows => aggregatedFeature.copy(followFeature = f) - case FeatureName.NumMutualFollows => aggregatedFeature.copy(mutualFollowFeature = f) - case FeatureName.AddressBookEmail => aggregatedFeature.copy(addressBookEmailFeature = f) - case FeatureName.AddressBookPhone => aggregatedFeature.copy(addressBookPhoneFeature = f) - case FeatureName.AddressBookInBoth => aggregatedFeature.copy(addressBookInBothFeature = f) - case FeatureName.AddressBookMutualEdgeEmail => - aggregatedFeature.copy(addressBookMutualEdgeEmailFeature = f) - case FeatureName.AddressBookMutualEdgePhone => - aggregatedFeature.copy(addressBookMutualEdgePhoneFeature = f) - case FeatureName.AddressBookMutualEdgeInBoth => - aggregatedFeature.copy(addressBookMutualEdgeInBothFeature = f) - case FeatureName.NumTweetQuotes => aggregatedFeature.copy(numTweetQuotes = f) - case FeatureName.NumBlocks => aggregatedFeature.copy(numBlocks = f) - case FeatureName.NumMutes => aggregatedFeature.copy(numMutes = f) - case FeatureName.NumReportAsSpams => aggregatedFeature.copy(numReportAsSpams = f) - case FeatureName.NumReportAsAbuses => aggregatedFeature.copy(numReportAsAbuses = f) - case _ => aggregatedFeature - } - } - if (filterFn(aggregatedFeature)) - Some(aggregatedFeature.copy(weight = e.weight.orElse(Some(0.0)))) - else None - } -} diff --git a/src/scala/com/twitter/interaction_graph/scio/common/DateUtil.scala b/src/scala/com/twitter/interaction_graph/scio/common/DateUtil.scala deleted file mode 100644 index f791d538a..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/common/DateUtil.scala +++ /dev/null @@ -1,27 +0,0 @@ -package com.twitter.interaction_graph.scio.common - -import com.twitter.util.Duration -import org.joda.time.Interval - -object DateUtil { - def embiggen(dateInterval: Interval, duration: Duration): Interval = { - - val days = duration.inDays - val newStart = dateInterval.getStart.minusDays(days) - val newEnd = dateInterval.getEnd.plusDays(days) - new Interval(newStart, newEnd) - } - - def subtract(dateInterval: Interval, duration: Duration): Interval = { - val days = duration.inDays - val newStart = dateInterval.getStart.minusDays(days) - val newEnd = dateInterval.getEnd.minusDays(days) - new Interval(newStart, newEnd) - } - - def prependDays(dateInterval: Interval, duration: Duration): Interval = { - val days = duration.inDays - val newStart = dateInterval.getStart.minusDays(days) - new Interval(newStart, dateInterval.getEnd.toInstant) - } -} diff --git a/src/scala/com/twitter/interaction_graph/scio/common/EdgeFeatureCombiner.scala b/src/scala/com/twitter/interaction_graph/scio/common/EdgeFeatureCombiner.scala deleted file mode 100644 index 004a141bb..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/common/EdgeFeatureCombiner.scala +++ /dev/null @@ -1,350 +0,0 @@ -package com.twitter.interaction_graph.scio.common - -import com.spotify.scio.ScioMetrics -import com.twitter.interaction_graph.thriftscala.Edge -import com.twitter.interaction_graph.thriftscala.EdgeFeature -import com.twitter.interaction_graph.thriftscala.FeatureName -import com.twitter.interaction_graph.thriftscala.TimeSeriesStatistics - -object EdgeFeatureCombiner { - def apply(srcId: Long, destId: Long): EdgeFeatureCombiner = new EdgeFeatureCombiner( - instanceEdge = Edge(srcId, destId), - featureMap = Map( - FeatureName.NumRetweets -> new WeightedAdditiveEdgeCombiner, - FeatureName.NumFavorites -> new WeightedAdditiveEdgeCombiner, - FeatureName.NumMentions -> new WeightedAdditiveEdgeCombiner, - FeatureName.NumTweetClicks -> new WeightedAdditiveEdgeCombiner, - FeatureName.NumLinkClicks -> new WeightedAdditiveEdgeCombiner, - FeatureName.NumProfileViews -> new WeightedAdditiveEdgeCombiner, - FeatureName.NumFollows -> new BooleanOrEdgeCombiner, - FeatureName.NumUnfollows -> new BooleanOrEdgeCombiner, - FeatureName.NumMutualFollows -> new BooleanOrEdgeCombiner, - FeatureName.NumBlocks -> new BooleanOrEdgeCombiner, - FeatureName.NumMutes -> new BooleanOrEdgeCombiner, - FeatureName.NumReportAsAbuses -> new BooleanOrEdgeCombiner, - FeatureName.NumReportAsSpams -> new BooleanOrEdgeCombiner, - FeatureName.NumTweetQuotes -> new WeightedAdditiveEdgeCombiner, - FeatureName.AddressBookEmail -> new BooleanOrEdgeCombiner, - FeatureName.AddressBookPhone -> new BooleanOrEdgeCombiner, - FeatureName.AddressBookInBoth -> new BooleanOrEdgeCombiner, - FeatureName.AddressBookMutualEdgeEmail -> new BooleanOrEdgeCombiner, - FeatureName.AddressBookMutualEdgePhone -> new BooleanOrEdgeCombiner, - FeatureName.AddressBookMutualEdgeInBoth -> new BooleanOrEdgeCombiner, - FeatureName.TotalDwellTime -> new WeightedAdditiveEdgeCombiner, - FeatureName.NumInspectedStatuses -> new WeightedAdditiveEdgeCombiner, - FeatureName.NumPhotoTags -> new WeightedAdditiveEdgeCombiner, - FeatureName.NumPushOpens -> new WeightedAdditiveEdgeCombiner, - FeatureName.NumNtabClicks -> new WeightedAdditiveEdgeCombiner, - FeatureName.NumRtMentions -> new WeightedAdditiveEdgeCombiner, - FeatureName.NumRtReplies -> new WeightedAdditiveEdgeCombiner, - FeatureName.NumRtRetweets -> new WeightedAdditiveEdgeCombiner, - FeatureName.NumRtFavories -> new WeightedAdditiveEdgeCombiner, - FeatureName.NumRtLinkClicks -> new WeightedAdditiveEdgeCombiner, - FeatureName.NumRtTweetClicks -> new WeightedAdditiveEdgeCombiner, - FeatureName.NumRtTweetQuotes -> new WeightedAdditiveEdgeCombiner, - FeatureName.NumShares -> new WeightedAdditiveEdgeCombiner, - FeatureName.NumEmailOpen -> new WeightedAdditiveEdgeCombiner, - FeatureName.NumEmailClick -> new WeightedAdditiveEdgeCombiner, - ) - ) -} - -/** - * This class can take in a number of input Edge thrift objects, (all of which are assumed to - * contain information about a single edge) and builds a combined Edge protobuf object, which has - * the union of all the input. - *

    - * There are two modes of aggregation: one of them just adds the values in assuming that these are - * from the same day, and the other adds them in a time-decayed manner using the passed in weights. - *

    - * The input objects features must be disjoint. Also, remember that the edge is directed! - */ -class EdgeFeatureCombiner(instanceEdge: Edge, featureMap: Map[FeatureName, EFeatureCombiner]) { - - /** - * Adds features without any decay. To be used for the same day. - * - * @param edge edge to be added into the combiner - */ - def addFeature(edge: Edge): EdgeFeatureCombiner = { - - val newEdge = - if (edge.weight.isDefined) instanceEdge.copy(weight = edge.weight) else instanceEdge - val newFeatures = featureMap.map { - case (featureName, combiner) => - edge.features.find(_.name.equals(featureName)) match { - case Some(feature) => - val updatedCombiner = - if (combiner.isSet) combiner.updateFeature(feature) else combiner.setFeature(feature) - (featureName, updatedCombiner) - case _ => (featureName, combiner) - } - } - - new EdgeFeatureCombiner(newEdge, newFeatures) - - } - - /** - * Adds features with decays. Used for combining multiple days. - * - * @param edge edge to be added into the combiner - * @param alpha parameters for the decay calculation - * @param day number of days from today - */ - def addFeature(edge: Edge, alpha: Double, day: Int): EdgeFeatureCombiner = { - - val newEdge = if (edge.weight.isDefined) edge.copy(weight = edge.weight) else edge - val newFeatures = featureMap.map { - case (featureName, combiner) => - edge.features.find(_.name.equals(featureName)) match { - case Some(feature) => - val updatedCombiner = - if (combiner.isSet) combiner.updateFeature(feature, alpha, day) - else combiner.setFeature(feature, alpha, day) - ScioMetrics.counter("EdgeFeatureCombiner.addFeature", feature.name.name).inc() - (featureName, updatedCombiner) - case _ => (featureName, combiner) - } - } - new EdgeFeatureCombiner(newEdge, newFeatures) - } - - /** - * Generate the final combined Edge instance - * We return a deterministically sorted list of edge features - * - * @param totalDays total number of days to be combined together - */ - def getCombinedEdge(totalDays: Int): Edge = { - val moreFeatures = featureMap.values - .flatMap { combiner => - combiner.getFinalFeature(totalDays) - }.toList.sortBy(_.name.value) - instanceEdge.copy( - features = moreFeatures - ) - } - -} - -/** - * This portion contains the actual combination logic. For now, we only implement a simple - * additive combiner, but in future we'd like to have things like time-weighted (exponential - * decay, maybe) values. - */ - -trait EFeatureCombiner { - val edgeFeature: Option[EdgeFeature] - val startingDay: Int - val endingDay: Int - val timeSeriesStatistics: Option[TimeSeriesStatistics] - - def updateTSS(feature: EdgeFeature, alpha: Double): Option[TimeSeriesStatistics] - - def addToTSS(feature: EdgeFeature): Option[TimeSeriesStatistics] - - def updateFeature(feature: EdgeFeature): EFeatureCombiner - - def updateFeature(feature: EdgeFeature, alpha: Double, day: Int): EFeatureCombiner - - def isSet: Boolean - - def dropFeature: Boolean - - def setFeature(feature: EdgeFeature, alpha: Double, day: Int): EFeatureCombiner - - def setFeature(feature: EdgeFeature): EFeatureCombiner - - def getFinalFeature(totalDays: Int): Option[EdgeFeature] - -} - -case class WeightedAdditiveEdgeCombiner( - override val edgeFeature: Option[EdgeFeature] = None, - override val startingDay: Int = Integer.MAX_VALUE, - override val endingDay: Int = Integer.MIN_VALUE, - override val timeSeriesStatistics: Option[TimeSeriesStatistics] = None) - extends EFeatureCombiner { - - override def updateTSS( - feature: EdgeFeature, - alpha: Double - ): Option[TimeSeriesStatistics] = { - timeSeriesStatistics.map(tss => - InteractionGraphUtils.updateTimeSeriesStatistics(tss, feature.tss.mean, alpha)) - } - - override def addToTSS(feature: EdgeFeature): Option[TimeSeriesStatistics] = { - timeSeriesStatistics.map(tss => - InteractionGraphUtils.addToTimeSeriesStatistics(tss, feature.tss.mean)) - } - - override def updateFeature(feature: EdgeFeature): WeightedAdditiveEdgeCombiner = { - WeightedAdditiveEdgeCombiner( - edgeFeature, - startingDay, - endingDay, - addToTSS(feature) - ) - } - - def setFeature(feature: EdgeFeature, alpha: Double, day: Int): WeightedAdditiveEdgeCombiner = { - val newStartingDay = Math.min(startingDay, day) - val newEndingDay = Math.max(endingDay, day) - - val numDaysSinceLast = - if (feature.tss.numDaysSinceLast.exists(_ > 0)) - feature.tss.numDaysSinceLast - else Some(feature.tss.numElapsedDays - feature.tss.numNonZeroDays + 1) - - val tss = feature.tss.copy( - numDaysSinceLast = numDaysSinceLast, - ewma = alpha * feature.tss.ewma - ) - - val newFeature = EdgeFeature( - name = feature.name, - tss = tss - ) - - WeightedAdditiveEdgeCombiner( - Some(newFeature), - newStartingDay, - newEndingDay, - Some(tss) - ) - } - - def getFinalFeature(totalDays: Int): Option[EdgeFeature] = { - if (edgeFeature.isEmpty || dropFeature) return None - - val newTss = if (totalDays > 0) { - val elapsed = - timeSeriesStatistics.map(tss => tss.numElapsedDays + totalDays - 1 - startingDay) - - val latest = - if (endingDay > 0) Some(totalDays - endingDay) - else - timeSeriesStatistics.flatMap(tss => - tss.numDaysSinceLast.map(numDaysSinceLast => numDaysSinceLast + totalDays - 1)) - - timeSeriesStatistics.map(tss => - tss.copy( - numElapsedDays = elapsed.get, - numDaysSinceLast = latest - )) - } else timeSeriesStatistics - - edgeFeature.map(ef => ef.copy(tss = newTss.get)) - } - - override def updateFeature( - feature: EdgeFeature, - alpha: Double, - day: Int - ): WeightedAdditiveEdgeCombiner = copy( - endingDay = Math.max(endingDay, day), - timeSeriesStatistics = updateTSS(feature, alpha) - ) - - override def dropFeature: Boolean = timeSeriesStatistics.exists(tss => - tss.numDaysSinceLast.exists(_ > InteractionGraphUtils.MAX_DAYS_RETENTION) || - tss.ewma < InteractionGraphUtils.MIN_FEATURE_VALUE) - - override def isSet = edgeFeature.isDefined - - override def setFeature(feature: EdgeFeature): WeightedAdditiveEdgeCombiner = - setFeature(feature, 1.0, 0) - -} - -/** - * This combiner resets the value to 0 if the latest event being combined = 0. Ignores time decays. - */ -case class BooleanOrEdgeCombiner( - override val edgeFeature: Option[EdgeFeature] = None, - override val startingDay: Int = Integer.MAX_VALUE, - override val endingDay: Int = Integer.MIN_VALUE, - override val timeSeriesStatistics: Option[TimeSeriesStatistics] = None) - extends EFeatureCombiner { - - override def updateTSS( - feature: EdgeFeature, - alpha: Double - ): Option[TimeSeriesStatistics] = { - val value = timeSeriesStatistics.map(tss => Math.floor(tss.ewma)) - val newValue = if (value.exists(_ == 1.0) || feature.tss.mean > 0.0) 1.0 else 0.0 - timeSeriesStatistics.map(tss => - tss.copy( - mean = newValue, - ewma = newValue, - numNonZeroDays = tss.numNonZeroDays + 1 - )) - } - - override def addToTSS(feature: EdgeFeature): Option[TimeSeriesStatistics] = { - val value = timeSeriesStatistics.map(tss => Math.floor(tss.ewma)) - val newValue = if (value.exists(_ == 1.0) || feature.tss.mean > 0.0) 1.0 else 0.0 - timeSeriesStatistics.map(tss => tss.copy(mean = newValue, ewma = newValue)) - } - - override def updateFeature(feature: EdgeFeature): BooleanOrEdgeCombiner = BooleanOrEdgeCombiner( - edgeFeature, - startingDay, - endingDay, - addToTSS(feature) - ) - - def setFeature(feature: EdgeFeature, alpha: Double, day: Int): BooleanOrEdgeCombiner = { - val newStartingDay = Math.min(startingDay, day) - val newEndingDay = Math.max(endingDay, day) - - val numDaysSinceLast = - if (feature.tss.numDaysSinceLast.exists(_ > 0)) - feature.tss.numDaysSinceLast.get - else feature.tss.numElapsedDays - feature.tss.numNonZeroDays + 1 - - val tss = feature.tss.copy( - numDaysSinceLast = Some(numDaysSinceLast), - ewma = alpha * feature.tss.ewma - ) - - val newFeature = EdgeFeature( - name = feature.name, - tss = tss - ) - - BooleanOrEdgeCombiner( - Some(newFeature), - newStartingDay, - newEndingDay, - Some(tss) - ) - } - - override def getFinalFeature(totalDays: Int): Option[EdgeFeature] = - if (timeSeriesStatistics.exists(tss => tss.ewma < 1.0)) None - else { - if (edgeFeature.isEmpty || dropFeature) return None - edgeFeature.map(ef => - ef.copy( - tss = timeSeriesStatistics.get - )) - } - - override def updateFeature( - feature: EdgeFeature, - alpha: Double, - day: Int - ): BooleanOrEdgeCombiner = copy( - endingDay = Math.max(endingDay, day), - timeSeriesStatistics = updateTSS(feature, alpha) - ) - - override def dropFeature: Boolean = false // we will keep rolling up status-based features - - override def isSet = edgeFeature.isDefined - - override def setFeature(feature: EdgeFeature): BooleanOrEdgeCombiner = setFeature(feature, 1.0, 0) -} diff --git a/src/scala/com/twitter/interaction_graph/scio/common/FeatureGeneratorUtil.scala b/src/scala/com/twitter/interaction_graph/scio/common/FeatureGeneratorUtil.scala deleted file mode 100644 index 56c403522..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/common/FeatureGeneratorUtil.scala +++ /dev/null @@ -1,263 +0,0 @@ -package com.twitter.interaction_graph.scio.common - -import com.spotify.scio.ScioMetrics -import com.spotify.scio.values.SCollection -import com.twitter.interaction_graph.scio.common.FeatureGroups.DWELL_TIME_FEATURE_LIST -import com.twitter.interaction_graph.scio.common.FeatureGroups.STATUS_FEATURE_LIST -import com.twitter.interaction_graph.scio.common.UserUtil.DUMMY_USER_ID -import com.twitter.interaction_graph.thriftscala.Edge -import com.twitter.interaction_graph.thriftscala.EdgeFeature -import com.twitter.interaction_graph.thriftscala.FeatureName -import com.twitter.interaction_graph.thriftscala.TimeSeriesStatistics -import com.twitter.interaction_graph.thriftscala.Vertex -import com.twitter.interaction_graph.thriftscala.VertexFeature - -object FeatureGeneratorUtil { - - // Initialize a TimeSeriesStatistics object by (value, age) pair - def initializeTSS(featureValue: Double, age: Int = 1): TimeSeriesStatistics = - TimeSeriesStatistics( - mean = featureValue, - m2ForVariance = 0.0, - ewma = featureValue, - numElapsedDays = age, - numNonZeroDays = age, - numDaysSinceLast = Some(age) - ) - - /** - * Create vertex feature from InteractionGraphRawInput graph (src, dst, feature name, age, featureValue) - * We will represent non-directional features (eg num_create_tweets) as "outgoing" values. - * @return - */ - def getVertexFeature( - input: SCollection[InteractionGraphRawInput] - ): SCollection[Vertex] = { - // For vertex features we need to calculate both in and out featureValue - val vertexAggregatedFeatureValues = input - .flatMap { input => - if (input.dst != DUMMY_USER_ID) { - Seq( - ((input.src, input.name.value), (input.featureValue, 0.0)), - ((input.dst, input.name.value), (0.0, input.featureValue)) - ) - } else { - // we put the non-directional features as "outgoing" values - Seq(((input.src, input.name.value), (input.featureValue, 0.0))) - } - } - .sumByKey - .map { - case ((userId, nameId), (outEdges, inEdges)) => - (userId, (FeatureName(nameId), outEdges, inEdges)) - }.groupByKey - - vertexAggregatedFeatureValues.map { - case (userId, records) => - // sort features by FeatureName for deterministic order (esp during testing) - val features = records.toSeq.sortBy(_._1.value).flatMap { - case (name, outEdges, inEdges) => - // create out vertex features - val outFeatures = if (outEdges > 0) { - val outTss = initializeTSS(outEdges) - List( - VertexFeature( - name = name, - outgoing = true, - tss = outTss - )) - } else Nil - - // create in vertex features - val inFeatures = if (inEdges > 0) { - val inTss = initializeTSS(inEdges) - List( - VertexFeature( - name = name, - outgoing = false, - tss = inTss - )) - } else Nil - - outFeatures ++ inFeatures - } - Vertex(userId = userId, features = features) - } - } - - /** - * Create edge feature from InteractionGraphRawInput graph (src, dst, feature name, age, featureValue) - * We will exclude all non-directional features (eg num_create_tweets) from all edge aggregates - */ - def getEdgeFeature( - input: SCollection[InteractionGraphRawInput] - ): SCollection[Edge] = { - input - .withName("filter non-directional features") - .flatMap { input => - if (input.dst != DUMMY_USER_ID) { - ScioMetrics.counter("getEdgeFeature", s"directional feature ${input.name.name}").inc() - Some(((input.src, input.dst), (input.name, input.age, input.featureValue))) - } else { - ScioMetrics.counter("getEdgeFeature", s"non-directional feature ${input.name.name}").inc() - None - } - } - .withName("group features by pairs") - .groupByKey - .map { - case ((src, dst), records) => - // sort features by FeatureName for deterministic order (esp during testing) - val features = records.toSeq.sortBy(_._1.value).map { - case (name, age, featureValue) => - val tss = initializeTSS(featureValue, age) - EdgeFeature( - name = name, - tss = tss - ) - } - Edge( - sourceId = src, - destinationId = dst, - weight = Some(0.0), - features = features.toSeq - ) - } - } - - // For same user id, combine different vertex feature records into one record - // The input will assume for each (userId, featureName, direction), there will be only one record - def combineVertexFeatures( - vertex: SCollection[Vertex], - ): SCollection[Vertex] = { - vertex - .groupBy { v: Vertex => - v.userId - } - .map { - case (userId, vertexes) => - val combiner = vertexes.foldLeft(VertexFeatureCombiner(userId)) { - case (combiner, vertex) => - combiner.addFeature(vertex) - } - combiner.getCombinedVertex(0) - } - - } - - def combineEdgeFeatures( - edge: SCollection[Edge] - ): SCollection[Edge] = { - edge - .groupBy { e => - (e.sourceId, e.destinationId) - } - .withName("combining edge features for each (src, dst)") - .map { - case ((src, dst), edges) => - val combiner = edges.foldLeft(EdgeFeatureCombiner(src, dst)) { - case (combiner, edge) => - combiner.addFeature(edge) - } - combiner.getCombinedEdge(0) - } - } - - def combineVertexFeaturesWithDecay( - history: SCollection[Vertex], - daily: SCollection[Vertex], - historyWeight: Double, - dailyWeight: Double - ): SCollection[Vertex] = { - - history - .keyBy(_.userId) - .cogroup(daily.keyBy(_.userId)).map { - case (userId, (h, d)) => - // Adding history iterators - val historyCombiner = h.toList.foldLeft(VertexFeatureCombiner(userId)) { - case (combiner, vertex) => - combiner.addFeature(vertex, historyWeight, 0) - } - // Adding daily iterators - val finalCombiner = d.toList.foldLeft(historyCombiner) { - case (combiner, vertex) => - combiner.addFeature(vertex, dailyWeight, 1) - } - - finalCombiner.getCombinedVertex( - 2 - ) // 2 means totally we have 2 days(yesterday and today) data to combine together - } - } - - def combineEdgeFeaturesWithDecay( - history: SCollection[Edge], - daily: SCollection[Edge], - historyWeight: Double, - dailyWeight: Double - ): SCollection[Edge] = { - - history - .keyBy { e => - (e.sourceId, e.destinationId) - } - .withName("combine history and daily edges with decay") - .cogroup(daily.keyBy { e => - (e.sourceId, e.destinationId) - }).map { - case ((src, dst), (h, d)) => - //val combiner = EdgeFeatureCombiner(src, dst) - // Adding history iterators - - val historyCombiner = h.toList.foldLeft(EdgeFeatureCombiner(src, dst)) { - case (combiner, edge) => - combiner.addFeature(edge, historyWeight, 0) - } - - val finalCombiner = d.toList.foldLeft(historyCombiner) { - case (combiner, edge) => - combiner.addFeature(edge, dailyWeight, 1) - } - - finalCombiner.getCombinedEdge( - 2 - ) // 2 means totally we have 2 days(yesterday and today) data to combine together - - } - } - - /** - * Create features from following graph (src, dst, age, featureValue) - * Note that we will filter out vertex features represented as edges from the edge output. - */ - def getFeatures( - input: SCollection[InteractionGraphRawInput] - ): (SCollection[Vertex], SCollection[Edge]) = { - (getVertexFeature(input), getEdgeFeature(input)) - } - - // remove the edge features that from flock, address book or sms as we will refresh them on a daily basis - def removeStatusFeatures(e: Edge): Seq[Edge] = { - val updatedFeatureList = e.features.filter { e => - !STATUS_FEATURE_LIST.contains(e.name) - } - if (updatedFeatureList.size > 0) { - val edge = Edge( - sourceId = e.sourceId, - destinationId = e.destinationId, - weight = e.weight, - features = updatedFeatureList - ) - Seq(edge) - } else - Nil - } - - // check if the edge feature has features other than dwell time feature - def edgeWithFeatureOtherThanDwellTime(e: Edge): Boolean = { - e.features.exists { f => - !DWELL_TIME_FEATURE_LIST.contains(f.name) - } - } -} diff --git a/src/scala/com/twitter/interaction_graph/scio/common/FeatureGroups.scala b/src/scala/com/twitter/interaction_graph/scio/common/FeatureGroups.scala deleted file mode 100644 index 89887be99..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/common/FeatureGroups.scala +++ /dev/null @@ -1,30 +0,0 @@ -package com.twitter.interaction_graph.scio.common - -import com.twitter.interaction_graph.thriftscala.FeatureName - -object FeatureGroups { - - val HEALTH_FEATURE_LIST: Set[FeatureName] = Set( - FeatureName.NumMutes, - FeatureName.NumBlocks, - FeatureName.NumReportAsSpams, - FeatureName.NumReportAsAbuses - ) - - val STATUS_FEATURE_LIST: Set[FeatureName] = Set( - FeatureName.AddressBookEmail, - FeatureName.AddressBookPhone, - FeatureName.AddressBookInBoth, - FeatureName.AddressBookMutualEdgeEmail, - FeatureName.AddressBookMutualEdgePhone, - FeatureName.AddressBookMutualEdgeInBoth, - FeatureName.NumFollows, - FeatureName.NumUnfollows, - FeatureName.NumMutualFollows - ) ++ HEALTH_FEATURE_LIST - - val DWELL_TIME_FEATURE_LIST: Set[FeatureName] = Set( - FeatureName.TotalDwellTime, - FeatureName.NumInspectedStatuses - ) -} diff --git a/src/scala/com/twitter/interaction_graph/scio/common/GraphUtil.scala b/src/scala/com/twitter/interaction_graph/scio/common/GraphUtil.scala deleted file mode 100644 index f94c136df..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/common/GraphUtil.scala +++ /dev/null @@ -1,93 +0,0 @@ -package com.twitter.interaction_graph.scio.common - -import com.spotify.scio.ScioMetrics -import com.spotify.scio.values.SCollection -import com.twitter.socialgraph.presto.thriftscala.{Edge => SocialGraphEdge} -import com.twitter.flockdb.tools.datasets.flock.thriftscala.FlockEdge -import com.twitter.interaction_graph.scio.common.FeatureGroups.HEALTH_FEATURE_LIST -import com.twitter.interaction_graph.thriftscala.Edge -import com.twitter.interaction_graph.thriftscala.FeatureName - -import java.time.Instant -import java.time.temporal.ChronoUnit - -object GraphUtil { - - /** - * Convert FlockEdge into common InteractionGraphRawInput class. - * updatedAt field in socialgraph.unfollows is in seconds. - */ - def getFlockFeatures( - edges: SCollection[FlockEdge], - featureName: FeatureName, - currentTimeMillis: Long - ): SCollection[InteractionGraphRawInput] = { - edges - .withName(s"${featureName.toString} - Converting flock edge to interaction graph input") - .map { edge => - val age = ChronoUnit.DAYS.between( - Instant.ofEpochMilli(edge.updatedAt * 1000L), // updatedAt is in seconds - Instant.ofEpochMilli(currentTimeMillis) - ) - InteractionGraphRawInput( - edge.sourceId, - edge.destinationId, - featureName, - age.max(0).toInt, - 1.0) - } - } - - /** - * Convert com.twitter.socialgraph.presto.thriftscala.Edge (from unfollows) into common InteractionGraphRawInput class. - * updatedAt field in socialgraph.unfollows is in seconds. - */ - def getSocialGraphFeatures( - edges: SCollection[SocialGraphEdge], - featureName: FeatureName, - currentTimeMillis: Long - ): SCollection[InteractionGraphRawInput] = { - edges - .withName(s"${featureName.toString} - Converting flock edge to interaction graph input") - .map { edge => - val age = ChronoUnit.DAYS.between( - Instant.ofEpochMilli(edge.updatedAt * 1000L), // updatedAt is in seconds - Instant.ofEpochMilli(currentTimeMillis) - ) - InteractionGraphRawInput( - edge.sourceId, - edge.destinationId, - featureName, - age.max(0).toInt, - 1.0) - } - } - def isFollow(edge: Edge): Boolean = { - val result = edge.features - .find(_.name == FeatureName.NumFollows) - .exists(_.tss.mean == 1.0) - result - } - - def filterExtremes(edge: Edge): Boolean = { - if (edge.weight.exists(_.isNaN)) { - ScioMetrics.counter("filter extremes", "nan").inc() - false - } else if (edge.weight.contains(Double.MaxValue)) { - ScioMetrics.counter("filter extremes", "max value").inc() - false - } else if (edge.weight.contains(Double.PositiveInfinity)) { - ScioMetrics.counter("filter extremes", "+ve inf").inc() - false - } else if (edge.weight.exists(_ < 0.0)) { - ScioMetrics.counter("filter extremes", "negative").inc() - false - } else { - true - } - } - - def filterNegative(edge: Edge): Boolean = { - !edge.features.find(ef => HEALTH_FEATURE_LIST.contains(ef.name)).exists(_.tss.mean > 0.0) - } -} diff --git a/src/scala/com/twitter/interaction_graph/scio/common/InteractionGraphUtils.scala b/src/scala/com/twitter/interaction_graph/scio/common/InteractionGraphUtils.scala deleted file mode 100644 index be6aa0153..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/common/InteractionGraphUtils.scala +++ /dev/null @@ -1,40 +0,0 @@ -package com.twitter.interaction_graph.scio.common - -import com.twitter.interaction_graph.thriftscala.TimeSeriesStatistics - -object InteractionGraphUtils { - final val MIN_FEATURE_VALUE = Math.pow(0.955, 60) - final val MAX_DAYS_RETENTION = 60L - final val MILLISECONDS_PER_DAY = 1000 * 60 * 60 * 24 - - def updateTimeSeriesStatistics( - timeSeriesStatistics: TimeSeriesStatistics, - currValue: Double, - alpha: Double - ): TimeSeriesStatistics = { - val numNonZeroDays = timeSeriesStatistics.numNonZeroDays + 1 - - val delta = currValue - timeSeriesStatistics.mean - val updatedMean = timeSeriesStatistics.mean + delta / numNonZeroDays - val m2ForVariance = timeSeriesStatistics.m2ForVariance + delta * (currValue - updatedMean) - val ewma = alpha * currValue + timeSeriesStatistics.ewma - - timeSeriesStatistics.copy( - mean = updatedMean, - m2ForVariance = m2ForVariance, - ewma = ewma, - numNonZeroDays = numNonZeroDays - ) - } - - def addToTimeSeriesStatistics( - timeSeriesStatistics: TimeSeriesStatistics, - currValue: Double - ): TimeSeriesStatistics = { - timeSeriesStatistics.copy( - mean = timeSeriesStatistics.mean + currValue, - ewma = timeSeriesStatistics.ewma + currValue - ) - } - -} diff --git a/src/scala/com/twitter/interaction_graph/scio/common/UserUtil.scala b/src/scala/com/twitter/interaction_graph/scio/common/UserUtil.scala deleted file mode 100644 index 39ac51006..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/common/UserUtil.scala +++ /dev/null @@ -1,76 +0,0 @@ -package com.twitter.interaction_graph.scio.common - -import com.spotify.scio.coders.Coder -import com.spotify.scio.values.SCollection -import com.twitter.twadoop.user.gen.thriftscala.CombinedUser -import com.twitter.usersource.snapshot.flat.thriftscala.FlatUser - -object UserUtil { - - /** - * placeholder for the destId when representing vertex features with no dest (eg create tweet) - * this will only be aggregated and saved in the vertex datasets but not the edge datasets - */ - val DUMMY_USER_ID = -1L - def getValidUsers(users: SCollection[CombinedUser]): SCollection[Long] = { - users - .flatMap { u => - for { - user <- u.user - if user.id != 0 - safety <- user.safety - if !(safety.suspended || safety.deactivated || safety.restricted || - safety.nsfwUser || safety.nsfwAdmin || safety.erased) - } yield { - user.id - } - } - } - - def getValidFlatUsers(users: SCollection[FlatUser]): SCollection[Long] = { - users - .flatMap { u => - for { - id <- u.id - if id != 0 && u.validUser.contains(true) - } yield { - id - } - } - } - - def getInvalidUsers(users: SCollection[FlatUser]): SCollection[Long] = { - users - .flatMap { user => - for { - valid <- user.validUser - if !valid - id <- user.id - } yield id - } - } - - def filterUsersByIdMapping[T: Coder]( - input: SCollection[T], - usersToBeFiltered: SCollection[Long], - userIdMapping: T => Long - ): SCollection[T] = { - input - .withName("filter users by id") - .keyBy(userIdMapping(_)) - .leftOuterJoin[Long](usersToBeFiltered.map(x => (x, x))) - .collect { - // only return data if the key is not in the list of usersToBeFiltered - case (_, (data, None)) => data - } - } - - def filterUsersByMultipleIdMappings[T: Coder]( - input: SCollection[T], - usersToBeFiltered: SCollection[Long], - userIdMappings: Seq[T => Long] - ): SCollection[T] = { - userIdMappings.foldLeft(input)((data, mapping) => - filterUsersByIdMapping(data, usersToBeFiltered, mapping)) - } -} diff --git a/src/scala/com/twitter/interaction_graph/scio/common/VertexFeatureCombiner.scala b/src/scala/com/twitter/interaction_graph/scio/common/VertexFeatureCombiner.scala deleted file mode 100644 index fb7ae7947..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/common/VertexFeatureCombiner.scala +++ /dev/null @@ -1,342 +0,0 @@ -package com.twitter.interaction_graph.scio.common - -import com.twitter.interaction_graph.thriftscala.FeatureName -import com.twitter.interaction_graph.thriftscala.TimeSeriesStatistics -import com.twitter.interaction_graph.thriftscala.Vertex -import com.twitter.interaction_graph.thriftscala.VertexFeature - -object VertexFeatureCombiner { - def apply(userId: Long): VertexFeatureCombiner = new VertexFeatureCombiner( - instanceVertex = Vertex(userId), - featureMap = Map( - (FeatureName.NumRetweets, true) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumRetweets, false) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumFavorites, true) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumFavorites, false) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumMentions, true) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumMentions, false) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumTweetClicks, true) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumTweetClicks, false) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumLinkClicks, true) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumLinkClicks, false) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumProfileViews, true) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumProfileViews, false) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumFollows, true) -> new ReplacementVertexCombiner, - (FeatureName.NumFollows, false) -> new ReplacementVertexCombiner, - (FeatureName.NumUnfollows, true) -> new ReplacementVertexCombiner, - (FeatureName.NumUnfollows, false) -> new ReplacementVertexCombiner, - (FeatureName.NumMutualFollows, true) -> new ReplacementVertexCombiner, - (FeatureName.NumBlocks, true) -> new ReplacementVertexCombiner, - (FeatureName.NumBlocks, false) -> new ReplacementVertexCombiner, - (FeatureName.NumMutes, true) -> new ReplacementVertexCombiner, - (FeatureName.NumMutes, false) -> new ReplacementVertexCombiner, - (FeatureName.NumReportAsAbuses, true) -> new ReplacementVertexCombiner, - (FeatureName.NumReportAsAbuses, false) -> new ReplacementVertexCombiner, - (FeatureName.NumReportAsSpams, true) -> new ReplacementVertexCombiner, - (FeatureName.NumReportAsSpams, false) -> new ReplacementVertexCombiner, - (FeatureName.NumTweetQuotes, true) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumTweetQuotes, false) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumMutualFollows, false) -> new ReplacementVertexCombiner, - (FeatureName.AddressBookEmail, true) -> new ReplacementVertexCombiner, - (FeatureName.AddressBookEmail, false) -> new ReplacementVertexCombiner, - (FeatureName.AddressBookPhone, true) -> new ReplacementVertexCombiner, - (FeatureName.AddressBookPhone, false) -> new ReplacementVertexCombiner, - (FeatureName.AddressBookInBoth, true) -> new ReplacementVertexCombiner, - (FeatureName.AddressBookInBoth, false) -> new ReplacementVertexCombiner, - (FeatureName.AddressBookMutualEdgeEmail, true) -> new ReplacementVertexCombiner, - (FeatureName.AddressBookMutualEdgeEmail, false) -> new ReplacementVertexCombiner, - (FeatureName.AddressBookMutualEdgePhone, true) -> new ReplacementVertexCombiner, - (FeatureName.AddressBookMutualEdgePhone, false) -> new ReplacementVertexCombiner, - (FeatureName.AddressBookMutualEdgeInBoth, true) -> new ReplacementVertexCombiner, - (FeatureName.AddressBookMutualEdgeInBoth, false) -> new ReplacementVertexCombiner, - (FeatureName.TotalDwellTime, true) -> new WeightedAdditiveVertexCombiner, - (FeatureName.TotalDwellTime, false) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumInspectedStatuses, true) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumInspectedStatuses, false) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumPhotoTags, true) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumPhotoTags, false) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumPushOpens, true) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumPushOpens, false) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumNtabClicks, true) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumNtabClicks, false) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumRtFavories, true) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumRtFavories, false) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumRtTweetQuotes, true) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumRtTweetQuotes, false) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumRtTweetClicks, true) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumRtTweetClicks, false) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumRtRetweets, true) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumRtRetweets, false) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumRtReplies, true) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumRtReplies, false) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumRtLinkClicks, true) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumRtLinkClicks, false) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumRtMentions, true) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumRtMentions, false) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumShares, true) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumShares, false) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumEmailOpen, true) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumEmailOpen, false) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumEmailClick, true) -> new WeightedAdditiveVertexCombiner, - (FeatureName.NumEmailClick, false) -> new WeightedAdditiveVertexCombiner, - ) - ) -} - -/** - * This class can take in a number of input Vertex thrift objects (all of which are assumed to - * contain information about a single vertex) and builds a combined Vertex protobuf object, which - * has the union of all the input. Note that we do a weighted addition for a time-decayed value. - *

    - * The input objects features must be disjoint. Also, remember that the Vertex is directed! - */ -class VertexFeatureCombiner( - instanceVertex: Vertex, - featureMap: Map[(FeatureName, Boolean), VFeatureCombiner]) { - - /** - * Adds features without any decay. To be used for the same day. - * - * @param vertex vertex to be added into the combiner - */ - def addFeature(vertex: Vertex): VertexFeatureCombiner = { - val newVertex = instanceVertex.copy(weight = vertex.weight) - val newFeatures = featureMap.map { - case ((featureName, outgoing), combiner) => - vertex.features.find(f => f.name.equals(featureName) && f.outgoing.equals(outgoing)) match { - case Some(feature) => - val updatedCombiner = - if (combiner.isSet) combiner.updateFeature(feature) else combiner.setFeature(feature) - ((featureName, outgoing), updatedCombiner) - case _ => ((featureName, outgoing), combiner) - } - } - - new VertexFeatureCombiner(newVertex, newFeatures) - } - - /** - * Adds features with decays. Used for combining multiple days. - * - * @param vertex vertex to be added into the combiner - * @param alpha parameters for the decay calculation - * @param day number of days from today - */ - def addFeature(vertex: Vertex, alpha: Double, day: Int): VertexFeatureCombiner = { - - val newVertex = instanceVertex.copy(weight = vertex.weight) - val newFeatures = featureMap.map { - case ((featureName, outgoing), combiner) => - vertex.features.find(f => f.name.equals(featureName) && f.outgoing.equals(outgoing)) match { - case Some(feature) => - val updatedCombiner = - if (combiner.isSet) combiner.updateFeature(feature, alpha, day) - else combiner.setFeature(feature, alpha, day) - ((featureName, outgoing), updatedCombiner) - case _ => ((featureName, outgoing), combiner) - } - } - - new VertexFeatureCombiner(newVertex, newFeatures) - } - - /** - * Generate the final combined Vertex instance - * - * @param totalDays total number of days to be combined together - */ - def getCombinedVertex(totalDays: Int): Vertex = { - val moreFeatures = featureMap.values.flatMap { - case combiner => combiner.getFinalFeature(totalDays) - } - instanceVertex.copy(features = moreFeatures.toSeq) - } - -} - -/** - * This portion contains the actual combination logic. For now, we only implement a simple - * additive combiner, but in future we'd like to have things like time-weighted (exponential - * decay, maybe) values. - */ -trait VFeatureCombiner { - val startingDay: Int - val endingDay: Int - val timeSeriesStatistics: Option[TimeSeriesStatistics] - val vertexFeature: Option[VertexFeature] - - def updateTss(feature: VertexFeature, alpha: Double): VFeatureCombiner - def addToTss(feature: VertexFeature): VFeatureCombiner - def updateFeature(feature: VertexFeature, alpha: Double, day: Int): VFeatureCombiner - def updateFeature(feature: VertexFeature): VFeatureCombiner - def isSet: Boolean - def dropFeature: Boolean - def setFeature(feature: VertexFeature, alpha: Double, day: Int): VFeatureCombiner - def setFeature(feature: VertexFeature): VFeatureCombiner - def getFinalFeature(totalDays: Int): Option[VertexFeature] -} - -case class WeightedAdditiveVertexCombiner( - override val vertexFeature: Option[VertexFeature] = None, - override val startingDay: Int = Integer.MAX_VALUE, - override val endingDay: Int = Integer.MIN_VALUE, - override val timeSeriesStatistics: Option[TimeSeriesStatistics] = None) - extends VFeatureCombiner { - override def updateTss( - feature: VertexFeature, - alpha: Double - ): WeightedAdditiveVertexCombiner = copy(timeSeriesStatistics = timeSeriesStatistics.map(tss => - InteractionGraphUtils.updateTimeSeriesStatistics(tss, feature.tss.mean, alpha))) - - override def addToTss(feature: VertexFeature): WeightedAdditiveVertexCombiner = - copy(timeSeriesStatistics = timeSeriesStatistics.map(tss => - InteractionGraphUtils.addToTimeSeriesStatistics(tss, feature.tss.mean))) - - override def updateFeature(feature: VertexFeature, alpha: Double, day: Int): VFeatureCombiner = { - updateTss(feature, alpha).copy( - vertexFeature, - startingDay = startingDay, - endingDay = Math.max(endingDay, day) - ) - } - - override def updateFeature(feature: VertexFeature): VFeatureCombiner = - addToTss(feature) - - override def setFeature(feature: VertexFeature, alpha: Double, day: Int): VFeatureCombiner = { - val newStartingDay = Math.min(startingDay, day) - val newEndingDay = Math.max(endingDay, day) - - val numDaysSinceLast = - if (feature.tss.numDaysSinceLast.exists(_ > 0)) - feature.tss.numDaysSinceLast - else Some(feature.tss.numElapsedDays - feature.tss.numNonZeroDays + 1) - - val tss = feature.tss.copy(numDaysSinceLast = numDaysSinceLast) - - val newFeature = VertexFeature( - name = feature.name, - outgoing = feature.outgoing, - tss = tss - ) - - WeightedAdditiveVertexCombiner( - Some(newFeature), - newStartingDay, - newEndingDay, - Some(tss) - ) - } - - def getFinalFeature(totalDays: Int): Option[VertexFeature] = { - if (vertexFeature.isEmpty || dropFeature) return None - - val newTss = if (totalDays > 0) { - val elapsed = - timeSeriesStatistics.map(tss => tss.numElapsedDays + totalDays - 1 - startingDay) - val latest = - if (endingDay > 0) Some(totalDays - endingDay) - else timeSeriesStatistics.map(tss => tss.numDaysSinceLast.get + totalDays - 1) - - timeSeriesStatistics.map(tss => - tss.copy( - numElapsedDays = elapsed.get, - numDaysSinceLast = latest - )) - } else timeSeriesStatistics - - vertexFeature.map(vf => vf.copy(tss = newTss.get)) - } - - override def setFeature(feature: VertexFeature): VFeatureCombiner = setFeature(feature, 1.0, 0) - override def isSet: Boolean = vertexFeature.isDefined - override def dropFeature: Boolean = - timeSeriesStatistics.exists(tss => - tss.numDaysSinceLast.exists(_ > InteractionGraphUtils.MAX_DAYS_RETENTION) && - tss.ewma < InteractionGraphUtils.MIN_FEATURE_VALUE) -} - -/** - * This combiner always replaces the old value with the current. Ignores time-decays. - */ -case class ReplacementVertexCombiner( - override val vertexFeature: Option[VertexFeature] = None, - override val startingDay: Int = Integer.MAX_VALUE, - override val endingDay: Int = Integer.MIN_VALUE, - override val timeSeriesStatistics: Option[TimeSeriesStatistics] = None) - extends VFeatureCombiner { - override def updateTss( - feature: VertexFeature, - alpha: Double - ): ReplacementVertexCombiner = setFeature(feature, 1.0, 0) - - override def addToTss(feature: VertexFeature): ReplacementVertexCombiner = - setFeature(feature, 1.0, 0) - - override def updateFeature( - feature: VertexFeature, - alpha: Double, - day: Int - ): ReplacementVertexCombiner = updateTss(feature, alpha).copy( - vertexFeature, - startingDay = startingDay, - endingDay = Math.max(endingDay, day) - ) - - override def updateFeature(feature: VertexFeature): ReplacementVertexCombiner = - addToTss(feature) - - override def setFeature( - feature: VertexFeature, - alpha: Double, - day: Int - ): ReplacementVertexCombiner = { - val newStartingDay = Math.min(startingDay, day) - val newEndingDay = Math.max(endingDay, day) - - val numDaysSinceLast = - if (feature.tss.numDaysSinceLast.exists(_ > 0)) - feature.tss.numDaysSinceLast - else Some(feature.tss.numElapsedDays - feature.tss.numNonZeroDays + 1) - - val tss = feature.tss.copy(numDaysSinceLast = numDaysSinceLast) - - val newFeature = VertexFeature( - name = feature.name, - outgoing = feature.outgoing, - tss = tss - ) - - ReplacementVertexCombiner( - Some(newFeature), - newStartingDay, - newEndingDay, - Some(tss) - ) - } - - override def getFinalFeature(totalDays: Int): Option[VertexFeature] = { - if (vertexFeature.isEmpty || dropFeature) return None - if (timeSeriesStatistics.exists(tss => tss.ewma < 1.0)) return None - val newTss = if (totalDays > 0) { - val latest = - if (endingDay > 0) totalDays - endingDay - else timeSeriesStatistics.get.numDaysSinceLast.get + totalDays - 1 - - timeSeriesStatistics.map(tss => - tss.copy( - numElapsedDays = 1, - numDaysSinceLast = Some(latest) - )) - } else timeSeriesStatistics - - vertexFeature.map(vf => vf.copy(tss = newTss.get)) - } - - override def setFeature(feature: VertexFeature): VFeatureCombiner = setFeature(feature, 1.0, 0) - override def isSet: Boolean = vertexFeature.isDefined - override def dropFeature: Boolean = - timeSeriesStatistics.exists(tss => - tss.numDaysSinceLast.exists(_ > InteractionGraphUtils.MAX_DAYS_RETENTION) && - tss.ewma < InteractionGraphUtils.MIN_FEATURE_VALUE) -} diff --git a/src/scala/com/twitter/interaction_graph/scio/ml/labels/BUILD b/src/scala/com/twitter/interaction_graph/scio/ml/labels/BUILD deleted file mode 100644 index f06c0c08d..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/ml/labels/BUILD +++ /dev/null @@ -1,49 +0,0 @@ -scala_library( - name = "labels", - sources = ["*.scala"], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":interaction_graph_labels_daily-scala", - "beam-internal/src/main/scala/com/twitter/beam/io/dal", - "beam-internal/src/main/scala/com/twitter/scio_internal/job", - "beam-internal/src/main/scala/com/twitter/scio_internal/runner/dataflow", - "consumer-data-tools/src/main/scala/com/twitter/cde/scio/dal_read", - "socialgraph/hadoop/src/main/scala/com/twitter/socialgraph/hadoop:socialgraph-follow-events-scala", - "src/scala/com/twitter/interaction_graph/scio/agg_client_event_logs:interaction_graph_agg_client_event_logs_edge_daily-scala", - "src/scala/com/twitter/interaction_graph/scio/agg_direct_interactions:interaction_graph_agg_direct_interactions_edge_daily-scala", - "src/scala/com/twitter/interaction_graph/scio/agg_email:interaction_graph_extended_email_edge_daily-scala", - "src/scala/com/twitter/interaction_graph/scio/agg_notifications:interaction_graph_agg_notifications_edge_daily-scala", - "src/scala/com/twitter/interaction_graph/scio/agg_retweets:interaction_graph_extended_retweet_edge_daily-scala", - "src/scala/com/twitter/interaction_graph/scio/agg_shares:interaction_graph_extended_share_edge_daily-scala", - "tcdc/bq_blaster/src/main/scala/com/twitter/tcdc/bqblaster/beam", - ], -) - -jvm_binary( - name = "interaction_graph_labels", - main = "com.twitter.interaction_graph.scio.ml.labels.InteractionGraphLabelsJob", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":labels", - ], -) - -create_datasets( - base_name = "interaction_graph_labels_daily", - description = "Daily labels", - java_schema = "com.twitter.interaction_graph.thriftjava.EdgeLabel", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.interaction_graph.thriftscala.EdgeLabel", - segment_type = "partitioned", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/interaction_graph:interaction_graph-scala", - ], -) diff --git a/src/scala/com/twitter/interaction_graph/scio/ml/labels/InteractionGraphLabelsJob.scala b/src/scala/com/twitter/interaction_graph/scio/ml/labels/InteractionGraphLabelsJob.scala deleted file mode 100644 index a6d9999c8..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/ml/labels/InteractionGraphLabelsJob.scala +++ /dev/null @@ -1,123 +0,0 @@ -package com.twitter.interaction_graph.scio.ml.labels - -import com.google.api.services.bigquery.model.TimePartitioning -import com.spotify.scio.ScioContext -import com.spotify.scio.values.SCollection -import com.twitter.beam.io.dal.DAL -import com.twitter.beam.io.fs.multiformat.DiskFormat -import com.twitter.beam.io.fs.multiformat.PathLayout -import com.twitter.beam.io.fs.multiformat.WriteOptions -import com.twitter.beam.job.ServiceIdentifierOptions -import com.twitter.cde.scio.dal_read.SourceUtil -import com.twitter.conversions.DurationOps._ -import com.twitter.dal.client.dataset.TimePartitionedDALDataset -import com.twitter.interaction_graph.scio.agg_client_event_logs.InteractionGraphAggClientEventLogsEdgeDailyScalaDataset -import com.twitter.interaction_graph.scio.agg_direct_interactions.InteractionGraphAggDirectInteractionsEdgeDailyScalaDataset -import com.twitter.interaction_graph.scio.agg_notifications.InteractionGraphAggNotificationsEdgeDailyScalaDataset -import com.twitter.interaction_graph.thriftscala.Edge -import com.twitter.interaction_graph.thriftscala.EdgeLabel -import com.twitter.scio_internal.job.ScioBeamJob -import com.twitter.socialgraph.event.thriftscala.FollowEvent -import com.twitter.socialgraph.hadoop.SocialgraphFollowEventsScalaDataset -import com.twitter.statebird.v2.thriftscala.Environment -import com.twitter.tcdc.bqblaster.beam.syntax._ -import com.twitter.tcdc.bqblaster.core.avro.TypedProjection -import com.twitter.tcdc.bqblaster.core.transform.RootTransform -import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO -import org.joda.time.Interval - -object InteractionGraphLabelsJob extends ScioBeamJob[InteractionGraphLabelsOption] { - - override protected def configurePipeline( - scioContext: ScioContext, - pipelineOptions: InteractionGraphLabelsOption - ): Unit = { - @transient - implicit lazy val sc: ScioContext = scioContext - implicit lazy val dateInterval: Interval = pipelineOptions.interval - - val bqTableName: String = pipelineOptions.getBqTableName - val dalEnvironment: String = pipelineOptions - .as(classOf[ServiceIdentifierOptions]) - .getEnvironment() - val dalWriteEnvironment = if (pipelineOptions.getDALWriteEnvironment != null) { - pipelineOptions.getDALWriteEnvironment - } else { - dalEnvironment - } - - def readPartition[T: Manifest](dataset: TimePartitionedDALDataset[T]): SCollection[T] = { - SourceUtil.readDALDataset[T]( - dataset = dataset, - interval = dateInterval, - dalEnvironment = dalEnvironment - ) - } - - val follows = readPartition[FollowEvent](SocialgraphFollowEventsScalaDataset) - .flatMap(LabelUtil.fromFollowEvent) - - val directInteractions = - readPartition[Edge](InteractionGraphAggDirectInteractionsEdgeDailyScalaDataset) - .flatMap(LabelUtil.fromInteractionGraphEdge) - - val clientEvents = - readPartition[Edge](InteractionGraphAggClientEventLogsEdgeDailyScalaDataset) - .flatMap(LabelUtil.fromInteractionGraphEdge) - - val pushEvents = - readPartition[Edge](InteractionGraphAggNotificationsEdgeDailyScalaDataset) - .flatMap(LabelUtil.fromInteractionGraphEdge) - - - val labels = groupLabels( - follows ++ - directInteractions ++ - clientEvents ++ - pushEvents) - - labels.saveAsCustomOutput( - "Write Edge Labels", - DAL.write[EdgeLabel]( - InteractionGraphLabelsDailyScalaDataset, - PathLayout.DailyPath(pipelineOptions.getOutputPath), - dateInterval, - DiskFormat.Parquet, - Environment.valueOf(dalWriteEnvironment), - writeOption = WriteOptions(numOfShards = Some(pipelineOptions.getNumberOfShards)) - ) - ) - - // save to BQ - if (pipelineOptions.getBqTableName != null) { - val ingestionTime = pipelineOptions.getDate().value.getStart.toDate - val bqFieldsTransform = RootTransform - .Builder() - .withPrependedFields("dateHour" -> TypedProjection.fromConstant(ingestionTime)) - val timePartitioning = new TimePartitioning() - .setType("DAY").setField("dateHour").setExpirationMs(90.days.inMilliseconds) - val bqWriter = BigQueryIO - .write[EdgeLabel] - .to(bqTableName) - .withExtendedErrorInfo() - .withTimePartitioning(timePartitioning) - .withLoadJobProjectId("twttr-recos-ml-prod") - .withThriftSupport(bqFieldsTransform.build(), AvroConverter.Legacy) - .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED) - .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND) - labels - .saveAsCustomOutput( - s"Save Recommendations to BQ $bqTableName", - bqWriter - ) - } - - } - - def groupLabels(labels: SCollection[EdgeLabel]): SCollection[EdgeLabel] = { - labels - .map { e: EdgeLabel => ((e.sourceId, e.destinationId), e.labels.toSet) } - .sumByKey - .map { case ((srcId, destId), labels) => EdgeLabel(srcId, destId, labels) } - } -} diff --git a/src/scala/com/twitter/interaction_graph/scio/ml/labels/InteractionGraphLabelsOption.scala b/src/scala/com/twitter/interaction_graph/scio/ml/labels/InteractionGraphLabelsOption.scala deleted file mode 100644 index 7c0a9a27a..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/ml/labels/InteractionGraphLabelsOption.scala +++ /dev/null @@ -1,28 +0,0 @@ -package com.twitter.interaction_graph.scio.ml.labels - -import com.twitter.beam.io.dal.DALOptions -import com.twitter.beam.job.DateRangeOptions -import org.apache.beam.sdk.options.Default -import org.apache.beam.sdk.options.Description -import org.apache.beam.sdk.options.Validation.Required - -trait InteractionGraphLabelsOption extends DALOptions with DateRangeOptions { - @Required - @Description("Output path for storing the final dataset") - def getOutputPath: String - def setOutputPath(value: String): Unit - - @Description("Output bq table name") - def getBqTableName: String - def setBqTableName(value: String): Unit - - @Description("Indicates DAL write environment. Can be set to dev/stg during local validation") - @Default.String("PROD") - def getDALWriteEnvironment: String - def setDALWriteEnvironment(value: String): Unit - - @Description("Number of shards/partitions for saving the final dataset.") - @Default.Integer(10) - def getNumberOfShards: Integer - def setNumberOfShards(value: Integer): Unit -} diff --git a/src/scala/com/twitter/interaction_graph/scio/ml/labels/LabelUtil.scala b/src/scala/com/twitter/interaction_graph/scio/ml/labels/LabelUtil.scala deleted file mode 100644 index 350c86c84..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/ml/labels/LabelUtil.scala +++ /dev/null @@ -1,63 +0,0 @@ -package com.twitter.interaction_graph.scio.ml.labels - -import com.spotify.scio.ScioMetrics -import com.twitter.interaction_graph.thriftscala.EdgeFeature -import com.twitter.interaction_graph.thriftscala.EdgeLabel -import com.twitter.interaction_graph.thriftscala.FeatureName -import com.twitter.interaction_graph.thriftscala.{Edge => TEdge} -import com.twitter.socialgraph.event.thriftscala.FollowEvent - -object LabelUtil { - - val LabelExplicit = Set( - FeatureName.NumFollows, - FeatureName.NumFavorites, - FeatureName.NumRetweets, - FeatureName.NumMentions, - FeatureName.NumTweetQuotes, - FeatureName.NumPhotoTags, - FeatureName.NumRtFavories, - FeatureName.NumRtReplies, - FeatureName.NumRtTweetQuotes, - FeatureName.NumRtRetweets, - FeatureName.NumRtMentions, - FeatureName.NumShares, - FeatureName.NumReplies, - ) - - val LabelImplicit = Set( - FeatureName.NumTweetClicks, - FeatureName.NumProfileViews, - FeatureName.NumLinkClicks, - FeatureName.NumPushOpens, - FeatureName.NumNtabClicks, - FeatureName.NumRtTweetClicks, - FeatureName.NumRtLinkClicks, - FeatureName.NumEmailOpen, - FeatureName.NumEmailClick, - ) - - val LabelSet = (LabelExplicit ++ LabelImplicit).map(_.value) - - def fromFollowEvent(f: FollowEvent): Option[EdgeLabel] = { - for { - srcId <- f.sourceId - destId <- f.targetId - } yield EdgeLabel(srcId, destId, labels = Set(FeatureName.NumFollows)) - } - - def fromInteractionGraphEdge(e: TEdge): Option[EdgeLabel] = { - val labels = e.features.collect { - case EdgeFeature(featureName: FeatureName, _) if LabelSet.contains(featureName.value) => - ScioMetrics.counter("fromInteractionGraphEdge", featureName.toString).inc() - featureName - }.toSet - if (labels.nonEmpty) { - Some(EdgeLabel(e.sourceId, e.destinationId, labels)) - } else None - } - - def toTEdge(e: EdgeLabel): EdgeLabel = { - EdgeLabel(e.sourceId, e.destinationId, labels = e.labels) - } -} diff --git a/src/scala/com/twitter/interaction_graph/scio/ml/labels/README.md b/src/scala/com/twitter/interaction_graph/scio/ml/labels/README.md deleted file mode 100644 index f67a624fb..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/ml/labels/README.md +++ /dev/null @@ -1,34 +0,0 @@ -## InteractionGraphLabels Dataflow Job - -#### IntelliJ -``` -fastpass create --name rg_labels --intellij src/scala/com/twitter/interaction_graph/scio/ml/labels -``` - -#### Compile -``` -bazel build src/scala/com/twitter/interaction_graph/scio/ml/labels:interaction_graph_labels -``` - -#### Build Jar -``` -bazel bundle src/scala/com/twitter/interaction_graph/scio/ml/labels:interaction_graph_labels -``` - -#### Run Scheduled Job -``` -export PROJECTID=twttr-recos-ml-prod -export REGION=us-central1 -export JOB_NAME=interaction-graph-labels-dataflow - -bin/d6w schedule \ - ${PROJECTID}/${REGION}/${JOB_NAME} \ - src/scala/com/twitter/interaction_graph/scio/ml/labels/config.d6w \ - --bind=profile.user_name=cassowary \ - --bind=profile.project=${PROJECTID} \ - --bind=profile.region=${REGION} \ - --bind=profile.job_name=${JOB_NAME} \ - --bind=profile.environment=prod \ - --bind=profile.date=2022-05-15 \ - --bind=profile.output_path=processed/interaction_graph/labels -``` \ No newline at end of file diff --git a/src/scala/com/twitter/interaction_graph/scio/ml/scores/BUILD b/src/scala/com/twitter/interaction_graph/scio/ml/scores/BUILD deleted file mode 100644 index f5f1cacc2..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/ml/scores/BUILD +++ /dev/null @@ -1,54 +0,0 @@ -scala_library( - sources = ["*.scala"], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":real_graph_in_scores-scala", - ":real_graph_oon_scores-scala", - "beam-internal/src/main/scala/com/twitter/beam/io/dal", - "beam-internal/src/main/scala/com/twitter/scio_internal/job", - "beam-internal/src/main/scala/com/twitter/scio_internal/runner/dataflow", - "tcdc/bq_blaster/src/main/scala/com/twitter/tcdc/bqblaster/beam", - ], -) - -jvm_binary( - name = "interaction_graph_scores_scio", - main = "com.twitter.interaction_graph.scio.ml.scores.InteractionGraphScoreExportJob", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":scores", - ], -) - -create_datasets( - base_name = "real_graph_in_scores", - description = "Real Graph in network scores", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.wtf.scalding.jobs.injection.CandidateSeqInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.wtf.candidate.thriftscala.CandidateSeq", - scala_dependencies = [ - "src/scala/com/twitter/wtf/scalding/jobs/injection", - ], -) - -create_datasets( - base_name = "real_graph_oon_scores", - description = "Real Graph OON Scores", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.wtf.scalding.jobs.injection.CandidateSeqInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.wtf.candidate.thriftscala.CandidateSeq", - scala_dependencies = [ - "src/scala/com/twitter/wtf/scalding/jobs/injection", - ], -) diff --git a/src/scala/com/twitter/interaction_graph/scio/ml/scores/InteractionGraphScoreExportJob.scala b/src/scala/com/twitter/interaction_graph/scio/ml/scores/InteractionGraphScoreExportJob.scala deleted file mode 100644 index 85e2284c2..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/ml/scores/InteractionGraphScoreExportJob.scala +++ /dev/null @@ -1,134 +0,0 @@ -package com.twitter.interaction_graph.scio.ml.scores - -import com.google.cloud.bigquery.BigQueryOptions -import com.google.cloud.bigquery.QueryJobConfiguration -import com.spotify.scio.ScioContext -import com.spotify.scio.values.SCollection -import com.twitter.beam.io.dal.DAL -import com.twitter.beam.io.exception.DataNotFoundException -import com.twitter.beam.io.fs.multiformat.PathLayout -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.scio_internal.job.ScioBeamJob -import com.twitter.wtf.candidate.thriftscala.Candidate -import com.twitter.wtf.candidate.thriftscala.CandidateSeq -import com.twitter.wtf.candidate.thriftscala.ScoredEdge -import org.apache.avro.generic.GenericRecord -import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO -import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.TypedRead -import org.apache.beam.sdk.io.gcp.bigquery.SchemaAndRecord -import org.apache.beam.sdk.transforms.SerializableFunction -import scala.collection.JavaConverters._ - -object InteractionGraphScoreExportJob extends ScioBeamJob[InteractionGraphScoreExportOption] { - - // to parse latest date from the BQ table we're reading from - val parseDateRow = new SerializableFunction[SchemaAndRecord, String] { - override def apply(input: SchemaAndRecord): String = { - val genericRecord: GenericRecord = input.getRecord() - genericRecord.get("ds").toString - } - } - - // to parse each row from the BQ table we're reading from - val parseRow = new SerializableFunction[SchemaAndRecord, ScoredEdge] { - override def apply(record: SchemaAndRecord): ScoredEdge = { - val genericRecord: GenericRecord = record.getRecord() - ScoredEdge( - genericRecord.get("source_id").asInstanceOf[Long], - genericRecord.get("destination_id").asInstanceOf[Long], - genericRecord.get("prob").asInstanceOf[Double], - genericRecord.get("followed").asInstanceOf[Boolean], - ) - } - } - - override def runPipeline( - sc: ScioContext, - opts: InteractionGraphScoreExportOption - ): Unit = { - - val dateStr: String = opts.getDate().value.getStart.toString("yyyyMMdd") - logger.info(s"dateStr $dateStr") - val project: String = "twttr-recos-ml-prod" - val datasetName: String = "realgraph" - val bqTableName: String = "scores" - val fullBqTableName: String = s"$project:$datasetName.$bqTableName" - - if (opts.getDALWriteEnvironment == "PROD") { - val bqClient = - BigQueryOptions.newBuilder.setProjectId("twttr-recos-ml-prod").build.getService - val query = - s""" - |SELECT total_rows - |FROM `$project.$datasetName.INFORMATION_SCHEMA.PARTITIONS` - |WHERE partition_id ="$dateStr" AND - |table_name="$bqTableName" AND total_rows > 0 - |""".stripMargin - val queryConfig = QueryJobConfiguration.of(query) - val results = bqClient.query(queryConfig).getValues.asScala.toSeq - if (results.isEmpty || results.head.get(0).getLongValue == 0) { - throw new DataNotFoundException(s"$dateStr not present in $fullBqTableName.") - } - } - sc.run() - } - - override protected def configurePipeline( - sc: ScioContext, - opts: InteractionGraphScoreExportOption - ): Unit = { - - val dateStr: String = opts.getDate().value.getStart.toString("yyyy-MM-dd") - logger.info(s"dateStr $dateStr") - val project: String = "twttr-recos-ml-prod" - val datasetName: String = "realgraph" - val bqTableName: String = "scores" - val fullBqTableName: String = s"$project:$datasetName.$bqTableName" - - val scoreExport: SCollection[ScoredEdge] = sc - .customInput( - s"Read from BQ table $fullBqTableName", - BigQueryIO - .read(parseRow) - .from(fullBqTableName) - .withSelectedFields(List("source_id", "destination_id", "prob", "followed").asJava) - .withRowRestriction(s"ds = '$dateStr'") - .withMethod(TypedRead.Method.DIRECT_READ) - ) - - val inScores = scoreExport - .collect { - case ScoredEdge(src, dest, score, true) => - (src, Candidate(dest, score)) - } - .groupByKey - .map { - case (src, candidateIter) => KeyVal(src, CandidateSeq(candidateIter.toSeq.sortBy(-_.score))) - } - - val outScores = scoreExport - .collect { - case ScoredEdge(src, dest, score, false) => - (src, Candidate(dest, score)) - } - .groupByKey - .map { - case (src, candidateIter) => KeyVal(src, CandidateSeq(candidateIter.toSeq.sortBy(-_.score))) - } - - inScores.saveAsCustomOutput( - "Write real_graph_in_scores", - DAL.writeVersionedKeyVal( - RealGraphInScoresScalaDataset, - PathLayout.VersionedPath(opts.getOutputPath + "/in"), - ) - ) - outScores.saveAsCustomOutput( - "Write real_graph_oon_scores", - DAL.writeVersionedKeyVal( - RealGraphOonScoresScalaDataset, - PathLayout.VersionedPath(opts.getOutputPath + "/oon"), - ) - ) - } -} diff --git a/src/scala/com/twitter/interaction_graph/scio/ml/scores/InteractionGraphScoreExportOption.scala b/src/scala/com/twitter/interaction_graph/scio/ml/scores/InteractionGraphScoreExportOption.scala deleted file mode 100644 index 3b55c517b..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/ml/scores/InteractionGraphScoreExportOption.scala +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.interaction_graph.scio.ml.scores - -import com.twitter.beam.io.dal.DALOptions -import com.twitter.beam.job.DateRangeOptions -import org.apache.beam.sdk.options.Default -import org.apache.beam.sdk.options.Description -import org.apache.beam.sdk.options.Validation.Required - -trait InteractionGraphScoreExportOption extends DALOptions with DateRangeOptions { - @Required - @Description("Output path for storing the final dataset") - def getOutputPath: String - def setOutputPath(value: String): Unit - - @Description("Indicates DAL write environment. Can be set to dev/stg during local validation") - @Default.String("PROD") - def getDALWriteEnvironment: String - def setDALWriteEnvironment(value: String): Unit - - @Description("Number of shards/partitions for saving the final dataset.") - @Default.Integer(1000) - def getNumberOfShards: Integer - def setNumberOfShards(value: Integer): Unit -} diff --git a/src/scala/com/twitter/interaction_graph/scio/ml/scores/README.md b/src/scala/com/twitter/interaction_graph/scio/ml/scores/README.md deleted file mode 100644 index 51ace9d9a..000000000 --- a/src/scala/com/twitter/interaction_graph/scio/ml/scores/README.md +++ /dev/null @@ -1,34 +0,0 @@ -## InteractionGraphLabels Dataflow Job - -#### IntelliJ -``` -fastpass create --name rg_scores --intellij src/scala/com/twitter/interaction_graph/scio/ml/scores -``` - -#### Compile -``` -bazel build src/scala/com/twitter/interaction_graph/scio/ml/scores -``` - -#### Build Jar -``` -bazel bundle src/scala/com/twitter/interaction_graph/scio/ml/scores -``` - -#### Run Scheduled Job -``` -export PROJECTID=twttr-recos-ml-prod -export REGION=us-central1 -export JOB_NAME=interaction-graph-scores-dataflow - -bin/d6w schedule \ - ${PROJECTID}/${REGION}/${JOB_NAME} \ - src/scala/com/twitter/interaction_graph/scio/ml/scores/config.d6w \ - --bind=profile.user_name=cassowary \ - --bind=profile.project=${PROJECTID} \ - --bind=profile.region=${REGION} \ - --bind=profile.job_name=${JOB_NAME} \ - --bind=profile.environment=prod \ - --bind=profile.date=2022-06-23 \ - --bind=profile.output_path=manhattan_sequence_files/real_graph_scores_v2 -``` \ No newline at end of file diff --git a/src/scala/com/twitter/recos/decider/BUILD b/src/scala/com/twitter/recos/decider/BUILD deleted file mode 100644 index d1eb8d74f..000000000 --- a/src/scala/com/twitter/recos/decider/BUILD +++ /dev/null @@ -1,9 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "decider/src/main/scala", - "src/scala/com/twitter/recos/util:recos-util", - ], -) diff --git a/src/scala/com/twitter/recos/decider/BaseDecider.scala b/src/scala/com/twitter/recos/decider/BaseDecider.scala deleted file mode 100644 index 841963631..000000000 --- a/src/scala/com/twitter/recos/decider/BaseDecider.scala +++ /dev/null @@ -1,110 +0,0 @@ -package com.twitter.recos.decider - -import com.twitter.decider.Decider -import com.twitter.decider.DeciderFactory -import com.twitter.decider.RandomRecipient -import com.twitter.decider.Recipient -import com.twitter.decider.SimpleRecipient -import com.twitter.recos.util.TeamUsers - -case class GuestRecipient(id: Long) extends Recipient { - override def isGuest: Boolean = true -} - -sealed trait BaseDecider { - def baseConfig: Option[String] = None - - def overlayConfig: Option[String] = None - - lazy val decider: Decider = DeciderFactory(baseConfig, overlayConfig)() - - def isAvailable(feature: String, recipient: Option[Recipient]): Boolean = - decider.isAvailable(feature, recipient) - - def isAvailable(feature: String): Boolean = isAvailable(feature, None) - - def isAvailableExceptTeam(feature: String, id: Long, isUser: Boolean = true): Boolean = { - if (isUser) TeamUsers.team.contains(id) || isAvailable(feature, Some(SimpleRecipient(id))) - else isAvailable(feature, Some(GuestRecipient(id))) - } -} - -case class RecosDecider(env: String, cluster: String = "atla") extends BaseDecider { - override val baseConfig = Some("/com/twitter/recos/config/decider.yml") - override val overlayConfig = Some( - s"/usr/local/config/overlays/recos/service/prod/$cluster/decider_overlay.yml" - ) - - def shouldCompute(id: Long, displayLocation: String, isUser: Boolean = true): Boolean = { - isAvailableExceptTeam(RecosDecider.recosIncomingTraffic + "_" + displayLocation, id, isUser) - } - - def shouldReturn(id: Long, displayLocation: String, isUser: Boolean = true): Boolean = { - isAvailableExceptTeam(RecosDecider.recosShouldReturn + "_" + displayLocation, id, isUser) - } - - def shouldDarkmode(experiment: String): Boolean = { - isAvailable(RecosDecider.recosShouldDark + "_exp_" + experiment, None) - } - - def shouldScribe(id: Long, isUser: Boolean = true): Boolean = { - if (isUser) (id > 0) && isAvailableExceptTeam(RecosDecider.recosShouldScribe, id, isUser) - else false // TODO: define the behavior for guests - } - - def shouldWriteMomentCapsuleOpenEdge(): Boolean = { - val capsuleOpenDecider = env match { - case "prod" => RecosDecider.recosShouldWriteMomentCapsuleOpenEdge - case _ => RecosDecider.recosShouldWriteMomentCapsuleOpenEdge + RecosDecider.testSuffix - } - - isAvailable(capsuleOpenDecider, Some(RandomRecipient)) - } -} - -object RecosDecider { - val testSuffix = "_test" - - val recosIncomingTraffic: String = "recos_incoming_traffic" - val recosShouldReturn: String = "recos_should_return" - val recosShouldDark: String = "recos_should_dark" - val recosRealtimeBlacklist: String = "recos_realtime_blacklist" - val recosRealtimeDeveloperlist: String = "recos_realtime_developerlist" - val recosShouldScribe: String = "recos_should_scribe" - val recosShouldWriteMomentCapsuleOpenEdge: String = "recos_should_write_moment_capsule_open_edge" -} - -trait GraphDecider extends BaseDecider { - val graphNamePrefix: String - - override val baseConfig = Some("/com/twitter/recos/config/decider.yml") - override val overlayConfig = Some( - "/usr/local/config/overlays/recos/service/prod/atla/decider_overlay.yml" - ) -} - -case class UserTweetEntityGraphDecider() extends GraphDecider { - override val graphNamePrefix: String = "user_tweet_entity_graph" - - def tweetSocialProof: Boolean = { - isAvailable("user_tweet_entity_graph_tweet_social_proof") - } - - def entitySocialProof: Boolean = { - isAvailable("user_tweet_entity_graph_entity_social_proof") - } - -} - -case class UserUserGraphDecider() extends GraphDecider { - override val graphNamePrefix: String = "user_user_graph" -} - -case class UserTweetGraphDecider(env: String, dc: String) extends GraphDecider { - override val graphNamePrefix: String = "user-tweet-graph" - - override val baseConfig = Some("/com/twitter/recos/config/user-tweet-graph_decider.yml") - override val overlayConfig = Some( - s"/usr/local/config/overlays/user-tweet-graph/user-tweet-graph/$env/$dc/decider_overlay.yml" - ) -} diff --git a/src/scala/com/twitter/recos/decider/EndpointLoadShedder.scala b/src/scala/com/twitter/recos/decider/EndpointLoadShedder.scala deleted file mode 100644 index 73a06e5af..000000000 --- a/src/scala/com/twitter/recos/decider/EndpointLoadShedder.scala +++ /dev/null @@ -1,39 +0,0 @@ -package com.twitter.recos.decider - -import com.twitter.decider.Decider -import com.twitter.decider.RandomRecipient -import com.twitter.util.Future -import scala.util.control.NoStackTrace - -/* - Provides deciders-controlled load shedding for a given endpoint. - The format of the decider keys is: - - enable_loadshedding__ - E.g.: - enable_loadshedding_user-tweet-graph_relatedTweets - - Deciders are fractional, so a value of 50.00 will drop 50% of responses. If a decider key is not - defined for a particular endpoint, those requests will always be - served. - - We should therefore aim to define keys for the endpoints we care most about in decider.yml, - so that we can control them during incidents. - */ -class EndpointLoadShedder( - decider: GraphDecider) { - import EndpointLoadShedder._ - - private val keyPrefix = "enable_loadshedding" - - def apply[T](endpointName: String)(serve: => Future[T]): Future[T] = { - val key = s"${keyPrefix}_${decider.graphNamePrefix}_${endpointName}" - if (decider.isAvailable(key, recipient = Some(RandomRecipient))) - Future.exception(LoadSheddingException) - else serve - } -} - -object EndpointLoadShedder { - object LoadSheddingException extends Exception with NoStackTrace -} diff --git a/src/scala/com/twitter/recos/graph_common/ActionEdgeTypeMask.scala b/src/scala/com/twitter/recos/graph_common/ActionEdgeTypeMask.scala deleted file mode 100644 index d29b12bc4..000000000 --- a/src/scala/com/twitter/recos/graph_common/ActionEdgeTypeMask.scala +++ /dev/null @@ -1,99 +0,0 @@ -package com.twitter.recos.graph_common - -import com.twitter.graphjet.bipartite.api.EdgeTypeMask -import com.twitter.recos.recos_common.thriftscala.SocialProofType - -/** - * The bit mask is used to encode edge types in the top bits of an integer, - * e.g. favorite, retweet, reply and click. Under current segment configuration, each segment - * stores up to 128M edges. Assuming that each node on one side is unique, each segment - * stores up to 128M unique nodes on one side, which occupies the lower 27 bits of an integer. - * This leaves five bits to encode the edge types, which at max can store 32 edge types. - * The following implementation utilizes the top four bits and leaves one free bit out. - */ -class ActionEdgeTypeMask extends EdgeTypeMask { - import ActionEdgeTypeMask._ - - override def encode(node: Int, edgeType: Byte): Int = { - if (edgeType == FAVORITE) { - node | EDGEARRAY(FAVORITE) - } else if (edgeType == RETWEET) { - node | EDGEARRAY(RETWEET) - } else if (edgeType == REPLY) { - node | EDGEARRAY(REPLY) - } else if (edgeType == TWEET) { - node | EDGEARRAY(TWEET) - } else { - // Anything that is not a public engagement (i.e. openlink, share, select, etc.) is a "click" - node | EDGEARRAY(CLICK) - } - } - - override def edgeType(node: Int): Byte = { - (node >> 28).toByte - } - - override def restore(node: Int): Int = { - node & MASK - } -} - -object ActionEdgeTypeMask { - - /** - * Reserve the top four bits of each integer to encode the edge type information. - */ - val MASK: Int = - Integer.parseInt("00001111111111111111111111111111", 2) - val CLICK: Byte = 0 - val FAVORITE: Byte = 1 - val RETWEET: Byte = 2 - val REPLY: Byte = 3 - val TWEET: Byte = 4 - val SIZE: Byte = 5 - val UNUSED6: Byte = 6 - val UNUSED7: Byte = 7 - val UNUSED8: Byte = 8 - val UNUSED9: Byte = 9 - val UNUSED10: Byte = 10 - val UNUSED11: Byte = 11 - val UNUSED12: Byte = 12 - val UNUSED13: Byte = 13 - val UNUSED14: Byte = 14 - val UNUSED15: Byte = 15 - val EDGEARRAY: Array[Int] = Array( - 0, - 1 << 28, - 2 << 28, - 3 << 28, - 4 << 28, - 5 << 28, - 6 << 28, - 7 << 28, - 8 << 28, - 9 << 28, - 10 << 28, - 11 << 28, - 12 << 28, - 13 << 28, - 14 << 28, - 15 << 28 - ) - - /** - * Map valid social proof types specified by clients to an array of bytes. If clients do not - * specify any social proof types in thrift, it will return all available social types by - * default. - * - * @param socialProofTypes are the valid socialProofTypes specified by clients - * @return an array of bytes representing valid social proof types - */ - def getUserTweetGraphSocialProofTypes( - socialProofTypes: Option[Seq[SocialProofType]] - ): Array[Byte] = { - socialProofTypes - .map { _.map { _.getValue }.toArray } - .getOrElse((0 until SIZE).toArray) - .map { _.toByte } - } -} diff --git a/src/scala/com/twitter/recos/graph_common/BUILD b/src/scala/com/twitter/recos/graph_common/BUILD deleted file mode 100644 index dd1f455ef..000000000 --- a/src/scala/com/twitter/recos/graph_common/BUILD +++ /dev/null @@ -1,12 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - strict_deps = False, - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/twitter/graphjet", - "finagle/finagle-stats/src/main/scala", - "src/scala/com/twitter/recos/util:recos-util", - "src/thrift/com/twitter/recos:recos-common-scala", - ], -) diff --git a/src/scala/com/twitter/recos/graph_common/BipartiteGraphHelper.scala b/src/scala/com/twitter/recos/graph_common/BipartiteGraphHelper.scala deleted file mode 100644 index 645bb900c..000000000 --- a/src/scala/com/twitter/recos/graph_common/BipartiteGraphHelper.scala +++ /dev/null @@ -1,40 +0,0 @@ -package com.twitter.recos.graph_common - -import com.twitter.graphjet.algorithms.TweetIDMask -import com.twitter.graphjet.bipartite.api.BipartiteGraph -import scala.collection.mutable.ListBuffer - -/* - * The helper class encodes and decodes tweet ids with tweetypie's card information - * when querying recos salsa library. Inside salsa library, all tweet ids are - * encoded with card information for the purpose of inline filtering. - */ -class BipartiteGraphHelper(graph: BipartiteGraph) { - private val tweetIDMask = new TweetIDMask - - def getLeftNodeEdges(leftNode: Long): Seq[(Long, Byte)] = { - val iterator = graph.getLeftNodeEdges(leftNode) - - val edges: ListBuffer[(Long, Byte)] = ListBuffer() - if (iterator != null) { - while (iterator.hasNext) { - val node = iterator.nextLong() - val engagementType = iterator.currentEdgeType() - edges += ((tweetIDMask.restore(node), engagementType)) - } - } - edges.reverse.distinct // Most recent edges first, no duplications - } - - def getRightNodeEdges(rightNode: Long): Seq[Long] = { - val iterator = graph.getRightNodeEdges(rightNode) - val leftNodes: ListBuffer[Long] = ListBuffer() - if (iterator != null) { - while (iterator.hasNext) { - leftNodes += iterator.nextLong() - } - } - - leftNodes.reverse.distinct // Most recent edges first, no duplications - } -} diff --git a/src/scala/com/twitter/recos/graph_common/FinagleCounterWrapper.scala b/src/scala/com/twitter/recos/graph_common/FinagleCounterWrapper.scala deleted file mode 100644 index 3c4d62b1d..000000000 --- a/src/scala/com/twitter/recos/graph_common/FinagleCounterWrapper.scala +++ /dev/null @@ -1,15 +0,0 @@ -package com.twitter.recos.graph_common - -import com.twitter.finagle.stats.Counter -import com.twitter.graphjet.stats.{Counter => GraphCounter} - -/** - * FinagleCounterWrapper wraps Twitter's Finagle Counter. - * - * This is because GraphJet is an openly available library which does not - * depend on Finagle, but tracks stats using a similar interface. - */ -class FinagleCounterWrapper(counter: Counter) extends GraphCounter { - def incr() = counter.incr() - def incr(delta: Int) = counter.incr(delta) -} diff --git a/src/scala/com/twitter/recos/graph_common/FinagleStatsReceiverWrapper.scala b/src/scala/com/twitter/recos/graph_common/FinagleStatsReceiverWrapper.scala deleted file mode 100644 index ac8bfc883..000000000 --- a/src/scala/com/twitter/recos/graph_common/FinagleStatsReceiverWrapper.scala +++ /dev/null @@ -1,16 +0,0 @@ -package com.twitter.recos.graph_common - -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.graphjet.stats.{StatsReceiver => GraphStatsReceiver} - -/** - * FinagleStatsReceiverWrapper wraps Twitter's Finagle StatsReceiver. - * - * This is because GraphJet is an openly available library which does not - * depend on Finagle, but tracks stats using a similar interface. - */ -case class FinagleStatsReceiverWrapper(statsReceiver: StatsReceiver) extends GraphStatsReceiver { - - def scope(namespace: String) = new FinagleStatsReceiverWrapper(statsReceiver.scope(namespace)) - def counter(name: String) = new FinagleCounterWrapper(statsReceiver.counter(name)) -} diff --git a/src/scala/com/twitter/recos/graph_common/LeftIndexedPowerLawMultiSegmentBipartiteGraphBuilder.scala b/src/scala/com/twitter/recos/graph_common/LeftIndexedPowerLawMultiSegmentBipartiteGraphBuilder.scala deleted file mode 100644 index 7e21b82c7..000000000 --- a/src/scala/com/twitter/recos/graph_common/LeftIndexedPowerLawMultiSegmentBipartiteGraphBuilder.scala +++ /dev/null @@ -1,59 +0,0 @@ -package com.twitter.recos.graph_common - -import com.twitter.graphjet.bipartite.LeftIndexedPowerLawMultiSegmentBipartiteGraph -import com.twitter.graphjet.bipartite.api.EdgeTypeMask -import com.twitter.graphjet.stats.StatsReceiver - -/** - * The GraphBuilder builds a LeftIndexedPowerLawMultiSegmentBipartiteGraph given a set of - * parameters. - */ -object LeftIndexedPowerLawMultiSegmentBipartiteGraphBuilder { - - /** - * This encapsulates all the state needed to initialize the in-memory graph. - * - * @param maxNumSegments is the maximum number of segments we'll add to the graph. - * At that point, the oldest segments will start getting dropped - * @param maxNumEdgesPerSegment determines when the implementation decides to fork off a - * new segment - * @param expectedNumLeftNodes is the expected number of left nodes that would be inserted in - * the segment - * @param expectedMaxLeftDegree is the maximum degree expected for any left node - * @param leftPowerLawExponent is the exponent of the LHS power-law graph. see - * [[com.twitter.graphjet.bipartite.edgepool.PowerLawDegreeEdgePool]] - * for details - * @param expectedNumRightNodes is the expected number of right nodes that would be inserted in - * the segment - */ - case class GraphBuilderConfig( - maxNumSegments: Int, - maxNumEdgesPerSegment: Int, - expectedNumLeftNodes: Int, - expectedMaxLeftDegree: Int, - leftPowerLawExponent: Double, - expectedNumRightNodes: Int, - edgeTypeMask: EdgeTypeMask) - - /** - * This apply function returns a mutuable bipartiteGraph - * - * @param graphBuilderConfig is the graph builder config - * - */ - def apply( - graphBuilderConfig: GraphBuilderConfig, - statsReceiverWrapper: StatsReceiver - ): LeftIndexedPowerLawMultiSegmentBipartiteGraph = { - new LeftIndexedPowerLawMultiSegmentBipartiteGraph( - graphBuilderConfig.maxNumSegments, - graphBuilderConfig.maxNumEdgesPerSegment, - graphBuilderConfig.expectedNumLeftNodes, - graphBuilderConfig.expectedMaxLeftDegree, - graphBuilderConfig.leftPowerLawExponent, - graphBuilderConfig.expectedNumRightNodes, - graphBuilderConfig.edgeTypeMask, - statsReceiverWrapper - ) - } -} diff --git a/src/scala/com/twitter/recos/graph_common/MultiSegmentPowerLawBipartiteGraphBuilder.scala b/src/scala/com/twitter/recos/graph_common/MultiSegmentPowerLawBipartiteGraphBuilder.scala deleted file mode 100644 index ca777c97d..000000000 --- a/src/scala/com/twitter/recos/graph_common/MultiSegmentPowerLawBipartiteGraphBuilder.scala +++ /dev/null @@ -1,64 +0,0 @@ -package com.twitter.recos.graph_common - -import com.twitter.graphjet.stats.StatsReceiver -import com.twitter.graphjet.bipartite.MultiSegmentPowerLawBipartiteGraph - -/** - * The GraphBuilder builds a MultiSegmentPowerLawBipartiteGraph given a set of parameters. - */ -object MultiSegmentPowerLawBipartiteGraphBuilder { - - /** - * This encapsulates all the state needed to initialize the in-memory graph. - * - * @param maxNumSegments is the maximum number of segments we'll add to the graph. - * At that point, the oldest segments will start getting dropped - * @param maxNumEdgesPerSegment determines when the implementation decides to fork off a - * new segment - * @param expectedNumLeftNodes is the expected number of left nodes that would be inserted in - * the segment - * @param expectedMaxLeftDegree is the maximum degree expected for any left node - * @param leftPowerLawExponent is the exponent of the LHS power-law graph. see - * [[com.twitter.graphjet.bipartite.edgepool.PowerLawDegreeEdgePool]] - * for details - * @param expectedNumRightNodes is the expected number of right nodes that would be inserted in - * the segment - * @param expectedMaxRightDegree is the maximum degree expected for any right node - * @param rightPowerLawExponent is the exponent of the RHS power-law graph. see - * [[com.twitter.graphjet.bipartite.edgepool.PowerLawDegreeEdgePool]] - * for details - */ - case class GraphBuilderConfig( - maxNumSegments: Int, - maxNumEdgesPerSegment: Int, - expectedNumLeftNodes: Int, - expectedMaxLeftDegree: Int, - leftPowerLawExponent: Double, - expectedNumRightNodes: Int, - expectedMaxRightDegree: Int, - rightPowerLawExponent: Double) - - /** - * This apply function returns a mutuable bipartiteGraph - * - * @param graphBuilderConfig is the graph builder config - * - */ - def apply( - graphBuilderConfig: GraphBuilderConfig, - statsReceiver: StatsReceiver - ): MultiSegmentPowerLawBipartiteGraph = { - new MultiSegmentPowerLawBipartiteGraph( - graphBuilderConfig.maxNumSegments, - graphBuilderConfig.maxNumEdgesPerSegment, - graphBuilderConfig.expectedNumLeftNodes, - graphBuilderConfig.expectedMaxLeftDegree, - graphBuilderConfig.leftPowerLawExponent, - graphBuilderConfig.expectedNumRightNodes, - graphBuilderConfig.expectedMaxRightDegree, - graphBuilderConfig.rightPowerLawExponent, - new ActionEdgeTypeMask(), - statsReceiver - ) - } -} diff --git a/src/scala/com/twitter/recos/graph_common/NodeInfoHandler.scala b/src/scala/com/twitter/recos/graph_common/NodeInfoHandler.scala deleted file mode 100644 index 5e71f6b03..000000000 --- a/src/scala/com/twitter/recos/graph_common/NodeInfoHandler.scala +++ /dev/null @@ -1,59 +0,0 @@ -package com.twitter.recos.graph_common - -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.recos.recos_common.thriftscala.{ - SocialProofType, - GetRecentEdgesRequest, - GetRecentEdgesResponse, - NodeInfo, - RecentEdge -} -import com.twitter.recos.util.Stats._ -import com.twitter.servo.request._ -import com.twitter.util.Future - -/** - * Implementation of the Thrift-defined service interface. - */ -class LeftNodeEdgesHandler(graphHelper: BipartiteGraphHelper, statsReceiver: StatsReceiver) - extends RequestHandler[GetRecentEdgesRequest, GetRecentEdgesResponse] { - private val stats = statsReceiver.scope(this.getClass.getSimpleName) - - private val CLICK = 0 - private val FAVORITE = 1 - private val RETWEET = 2 - private val REPLY = 3 - private val TWEET = 4 - - override def apply(request: GetRecentEdgesRequest): Future[GetRecentEdgesResponse] = { - trackFutureBlockStats(stats) { - val recentEdges = graphHelper.getLeftNodeEdges(request.requestId).flatMap { - case (node, engagementType) if engagementType == CLICK => - Some(RecentEdge(node, SocialProofType.Click)) - case (node, engagementType) if engagementType == FAVORITE => - Some(RecentEdge(node, SocialProofType.Favorite)) - case (node, engagementType) if engagementType == RETWEET => - Some(RecentEdge(node, SocialProofType.Retweet)) - case (node, engagementType) if engagementType == REPLY => - Some(RecentEdge(node, SocialProofType.Reply)) - case (node, engagementType) if engagementType == TWEET => - Some(RecentEdge(node, SocialProofType.Tweet)) - case _ => - None - } - Future.value(GetRecentEdgesResponse(recentEdges)) - } - } -} - -class RightNodeInfoHandler(graphHelper: BipartiteGraphHelper, statsReceiver: StatsReceiver) - extends RequestHandler[Long, NodeInfo] { - private[this] val stats = statsReceiver.scope(this.getClass.getSimpleName) - - override def apply(rightNode: Long): Future[NodeInfo] = { - trackFutureBlockStats(stats) { - val edges = graphHelper.getRightNodeEdges(rightNode) - Future.value(NodeInfo(edges = edges)) - } - } -} diff --git a/src/scala/com/twitter/recos/graph_common/NodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraphBuilder.scala b/src/scala/com/twitter/recos/graph_common/NodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraphBuilder.scala deleted file mode 100644 index ce63644a6..000000000 --- a/src/scala/com/twitter/recos/graph_common/NodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraphBuilder.scala +++ /dev/null @@ -1,63 +0,0 @@ -package com.twitter.recos.graph_common - -import com.twitter.graphjet.bipartite.api.EdgeTypeMask -import com.twitter.graphjet.bipartite.NodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraph -import com.twitter.graphjet.stats.StatsReceiver - -/** - * The GraphBuilder builds a NodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraphBuilder given a set of - * parameters. - */ -object NodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraphBuilder { - - /** - * This encapsulates all the state needed to initialize the in-memory graph. - * - * @param maxNumSegments is the maximum number of segments we'll add to the graph. - * At that point, the oldest segments will start getting dropped - * @param maxNumEdgesPerSegment determines when the implementation decides to fork off a - * new segment - * @param expectedNumLeftNodes is the expected number of left nodes that would be inserted in - * the segment - * @param expectedMaxLeftDegree is the maximum degree expected for any left node - * @param leftPowerLawExponent is the exponent of the LHS power-law graph. see - * [[com.twitter.graphjet.bipartite.edgepool.PowerLawDegreeEdgePool]] - * for details - * @param expectedNumRightNodes is the expected number of right nodes that would be inserted in - * the segment - * @param numRightNodeMetadataTypes is the max number of node metadata types associated with the - * right nodes - */ - case class GraphBuilderConfig( - maxNumSegments: Int, - maxNumEdgesPerSegment: Int, - expectedNumLeftNodes: Int, - expectedMaxLeftDegree: Int, - leftPowerLawExponent: Double, - expectedNumRightNodes: Int, - numRightNodeMetadataTypes: Int, - edgeTypeMask: EdgeTypeMask) - - /** - * This apply function returns a mutuable bipartiteGraph - * - * @param graphBuilderConfig is the graph builder config - * - */ - def apply( - graphBuilderConfig: GraphBuilderConfig, - statsReceiverWrapper: StatsReceiver - ): NodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraph = { - new NodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraph( - graphBuilderConfig.maxNumSegments, - graphBuilderConfig.maxNumEdgesPerSegment, - graphBuilderConfig.expectedNumLeftNodes, - graphBuilderConfig.expectedMaxLeftDegree, - graphBuilderConfig.leftPowerLawExponent, - graphBuilderConfig.expectedNumRightNodes, - graphBuilderConfig.numRightNodeMetadataTypes, - graphBuilderConfig.edgeTypeMask, - statsReceiverWrapper - ) - } -} diff --git a/src/scala/com/twitter/recos/graph_common/RightNodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraphBuilder.scala b/src/scala/com/twitter/recos/graph_common/RightNodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraphBuilder.scala deleted file mode 100644 index 353b47d92..000000000 --- a/src/scala/com/twitter/recos/graph_common/RightNodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraphBuilder.scala +++ /dev/null @@ -1,63 +0,0 @@ -package com.twitter.recos.graph_common - -import com.twitter.graphjet.bipartite.RightNodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraph -import com.twitter.graphjet.bipartite.api.EdgeTypeMask -import com.twitter.graphjet.stats.StatsReceiver - -/** - * The GraphBuilder builds a RightNodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraphBuilder given a set of - * parameters. - */ -object RightNodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraphBuilder { - - /** - * This encapsulates all the state needed to initialize the in-memory graph. - * - * @param maxNumSegments is the maximum number of segments we'll add to the graph. - * At that point, the oldest segments will start getting dropped - * @param maxNumEdgesPerSegment determines when the implementation decides to fork off a - * new segment - * @param expectedNumLeftNodes is the expected number of left nodes that would be inserted in - * the segment - * @param expectedMaxLeftDegree is the maximum degree expected for any left node - * @param leftPowerLawExponent is the exponent of the LHS power-law graph. see - * [[com.twitter.graphjet.bipartite.edgepool.PowerLawDegreeEdgePool]] - * for details - * @param expectedNumRightNodes is the expected number of right nodes that would be inserted in - * the segment - * @param numRightNodeMetadataTypes is the max number of node metadata types associated with the - * right nodes - */ - case class GraphBuilderConfig( - maxNumSegments: Int, - maxNumEdgesPerSegment: Int, - expectedNumLeftNodes: Int, - expectedMaxLeftDegree: Int, - leftPowerLawExponent: Double, - expectedNumRightNodes: Int, - numRightNodeMetadataTypes: Int, - edgeTypeMask: EdgeTypeMask) - - /** - * This apply function returns a mutuable bipartiteGraph - * - * @param graphBuilderConfig is the graph builder config - * - */ - def apply( - graphBuilderConfig: GraphBuilderConfig, - statsReceiverWrapper: StatsReceiver - ): RightNodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraph = { - new RightNodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraph( - graphBuilderConfig.maxNumSegments, - graphBuilderConfig.maxNumEdgesPerSegment, - graphBuilderConfig.expectedNumLeftNodes, - graphBuilderConfig.expectedMaxLeftDegree, - graphBuilderConfig.leftPowerLawExponent, - graphBuilderConfig.expectedNumRightNodes, - graphBuilderConfig.numRightNodeMetadataTypes, - graphBuilderConfig.edgeTypeMask, - statsReceiverWrapper - ) - } -} diff --git a/src/scala/com/twitter/recos/hose/common/BUILD b/src/scala/com/twitter/recos/hose/common/BUILD deleted file mode 100644 index 9fcb19b5f..000000000 --- a/src/scala/com/twitter/recos/hose/common/BUILD +++ /dev/null @@ -1,15 +0,0 @@ -scala_library( - sources = ["*.scala"], - strict_deps = False, - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/twitter/graphjet", - "3rdparty/jvm/org/apache/kafka:rosette-kafka", - "finagle/finagle-stats/src/main/scala", - "kafka/finagle-kafka/finatra-kafka/src/main/scala", - "kafka/libs/src/main/scala/com/twitter/kafka/client/processor", - "servo/repo/src/main/scala", - "src/scala/com/twitter/recos/util:recos-util", - "src/thrift/com/twitter/recos:recos-internal-scala", - ], -) diff --git a/src/scala/com/twitter/recos/hose/common/BufferedEdgeWriter.scala b/src/scala/com/twitter/recos/hose/common/BufferedEdgeWriter.scala deleted file mode 100644 index f2f5ee056..000000000 --- a/src/scala/com/twitter/recos/hose/common/BufferedEdgeWriter.scala +++ /dev/null @@ -1,48 +0,0 @@ -package com.twitter.recos.hose.common - -import com.twitter.finagle.stats.{Stat, StatsReceiver} -import com.twitter.logging.Logger -import com.twitter.recos.internal.thriftscala.RecosHoseMessage -import java.util.concurrent.Semaphore - -/** - * This class reads a buffer of edges from the concurrently linked queue - * and inserts each edge into the recos graph. - * If the queue is empty the thread will sleep for 100ms and attempt to read from the queue again. - */ -case class BufferedEdgeWriter( - queue: java.util.Queue[Array[RecosHoseMessage]], - queuelimit: Semaphore, - edgeCollector: EdgeCollector, - statsReceiver: StatsReceiver, - isRunning: () => Boolean) - extends Runnable { - val logger = Logger() - private val queueRemoveCounter = statsReceiver.counter("queueRemove") - private val queueSleepCounter = statsReceiver.counter("queueSleep") - - def running: Boolean = { - isRunning() - } - - override def run(): Unit = { - while (running) { - val currentBatch = queue.poll - if (currentBatch != null) { - queueRemoveCounter.incr() - queuelimit.release() - var i = 0 - Stat.time(statsReceiver.stat("batchAddEdge")) { - while (i < currentBatch.length) { - edgeCollector.addEdge(currentBatch(i)) - i = i + 1 - } - } - } else { - queueSleepCounter.incr() - Thread.sleep(100L) - } - } - logger.info(this.getClass.getSimpleName + " is done") - } -} diff --git a/src/scala/com/twitter/recos/hose/common/EdgeCollector.scala b/src/scala/com/twitter/recos/hose/common/EdgeCollector.scala deleted file mode 100644 index c5279496c..000000000 --- a/src/scala/com/twitter/recos/hose/common/EdgeCollector.scala +++ /dev/null @@ -1,42 +0,0 @@ -package com.twitter.recos.hose.common - -import com.twitter.finagle.stats.{Stat, StatsReceiver} -import com.twitter.recos.internal.thriftscala.RecosHoseMessage -import java.util.concurrent.Semaphore - -trait EdgeCollector { - def addEdge(message: RecosHoseMessage): Unit -} - -/** - * The class consumes incoming edges and inserts them into a buffer of a specified bufferSize. - * Once the buffer is full of edges, it is written to a concurrently linked queue where the size is bounded by queuelimit. - */ -case class BufferedEdgeCollector( - bufferSize: Int, - queue: java.util.Queue[Array[RecosHoseMessage]], - queuelimit: Semaphore, - statsReceiver: StatsReceiver) - extends EdgeCollector { - - private var buffer = new Array[RecosHoseMessage](bufferSize) - private var index = 0 - private val queueAddCounter = statsReceiver.counter("queueAdd") - - override def addEdge(message: RecosHoseMessage): Unit = { - buffer(index) = message - index = index + 1 - if (index >= bufferSize) { - val oldBuffer = buffer - buffer = new Array[RecosHoseMessage](bufferSize) - index = 0 - - Stat.time(statsReceiver.stat("waitEnqueue")) { - queuelimit.acquireUninterruptibly() - } - - queue.add(oldBuffer) - queueAddCounter.incr() - } - } -} diff --git a/src/scala/com/twitter/recos/hose/common/RecosEdgeProcessor.scala b/src/scala/com/twitter/recos/hose/common/RecosEdgeProcessor.scala deleted file mode 100644 index 243fce628..000000000 --- a/src/scala/com/twitter/recos/hose/common/RecosEdgeProcessor.scala +++ /dev/null @@ -1,41 +0,0 @@ -package com.twitter.recos.hose.common - -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.recos.internal.thriftscala.RecosHoseMessage -import com.twitter.util.Future -import org.apache.kafka.clients.consumer.ConsumerRecord - -/** - * The class processes RecosHoseMessage and inserts the message as an edge into a recos graph. - */ -case class RecosEdgeProcessor( - edgeCollector: EdgeCollector -)( - implicit statsReceiver: StatsReceiver) { - - private val scopedStats = statsReceiver.scope("RecosEdgeProcessor") - - private val processEventsCounter = scopedStats.counter("process_events") - private val nullPointerEventCounter = scopedStats.counter("null_pointer_num") - private val errorCounter = scopedStats.counter("process_errors") - - def process(record: ConsumerRecord[String, RecosHoseMessage]): Future[Unit] = { - processEventsCounter.incr() - val message = record.value() - try { - // the message is nullable - if (message != null) { - edgeCollector.addEdge(message) - } else { - nullPointerEventCounter.incr() - } - Future.Unit - } catch { - case e: Throwable => - errorCounter.incr() - e.printStackTrace() - Future.Unit - } - } - -} diff --git a/src/scala/com/twitter/recos/hose/common/UnifiedGraphWriter.scala b/src/scala/com/twitter/recos/hose/common/UnifiedGraphWriter.scala deleted file mode 100644 index bac62e418..000000000 --- a/src/scala/com/twitter/recos/hose/common/UnifiedGraphWriter.scala +++ /dev/null @@ -1,217 +0,0 @@ -package com.twitter.recos.hose.common - -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.finatra.kafka.consumers.FinagleKafkaConsumerBuilder -import com.twitter.graphjet.bipartite.LeftIndexedMultiSegmentBipartiteGraph -import com.twitter.graphjet.bipartite.segment.LeftIndexedBipartiteGraphSegment -import com.twitter.kafka.client.processor.{AtLeastOnceProcessor, ThreadSafeKafkaConsumerClient} -import com.twitter.logging.Logger -import com.twitter.recos.internal.thriftscala.RecosHoseMessage -import java.util.concurrent.atomic.AtomicBoolean -import java.util.concurrent.{ConcurrentLinkedQueue, ExecutorService, Executors, Semaphore} - -/** - * The class submits a number of graph writer threads, BufferedEdgeWriter, - * during service startup. One of them is live writer thread, and the other $(numBootstrapWriters - 1) - * are catchup writer threads. All of them consume kafka events from an internal concurrent queue, - * which is populated by kafka reader threads. At bootstrap time, the kafka reader threads look - * back kafka offset from several hours ago and populate the internal concurrent queue. - * Each graph writer thread writes to an individual graph segment separately. - * The (numBootstrapWriters - 1) catchup writer threads will stop once all events - * between current system time at startup and the time in memcache are processed. - * The live writer thread will continue to write all incoming kafka events. - * It lives through the entire life cycle of recos graph service. - */ -trait UnifiedGraphWriter[ - TSegment <: LeftIndexedBipartiteGraphSegment, - TGraph <: LeftIndexedMultiSegmentBipartiteGraph[TSegment]] { writer => - - import UnifiedGraphWriter._ - - def shardId: String - def env: String - def hosename: String - def bufferSize: Int - def consumerNum: Int - def catchupWriterNum: Int - def kafkaConsumerBuilder: FinagleKafkaConsumerBuilder[String, RecosHoseMessage] - def clientId: String - def statsReceiver: StatsReceiver - - /** - * Adds a RecosHoseMessage to the graph. used by live writer to insert edges to the - * current segment - */ - def addEdgeToGraph(graph: TGraph, recosHoseMessage: RecosHoseMessage): Unit - - /** - * Adds a RecosHoseMessage to the given segment in the graph. Used by catch up writers to - * insert edges to non-current (old) segments - */ - def addEdgeToSegment(segment: TSegment, recosHoseMessage: RecosHoseMessage): Unit - - private val log = Logger() - private val isRunning: AtomicBoolean = new AtomicBoolean(true) - private val initialized: AtomicBoolean = new AtomicBoolean(false) - private var processors: Seq[AtLeastOnceProcessor[String, RecosHoseMessage]] = Seq.empty - private var consumers: Seq[ThreadSafeKafkaConsumerClient[String, RecosHoseMessage]] = Seq.empty - private val threadPool: ExecutorService = Executors.newCachedThreadPool() - - def shutdown(): Unit = { - processors.foreach { processor => - processor.close() - } - processors = Seq.empty - consumers.foreach { consumer => - consumer.close() - } - consumers = Seq.empty - threadPool.shutdown() - isRunning.set(false) - } - - def initHose(liveGraph: TGraph): Unit = this.synchronized { - if (!initialized.get) { - initialized.set(true) - - val queue: java.util.Queue[Array[RecosHoseMessage]] = - new ConcurrentLinkedQueue[Array[RecosHoseMessage]]() - val queuelimit: Semaphore = new Semaphore(1024) - - initRecosHoseKafka(queue, queuelimit) - initGrpahWriters(liveGraph, queue, queuelimit) - } else { - throw new RuntimeException("attempt to re-init kafka hose") - } - } - - private def initRecosHoseKafka( - queue: java.util.Queue[Array[RecosHoseMessage]], - queuelimit: Semaphore, - ): Unit = { - try { - consumers = (0 until consumerNum).map { index => - new ThreadSafeKafkaConsumerClient( - kafkaConsumerBuilder.clientId(s"clientId-$index").enableAutoCommit(false).config) - } - processors = consumers.zipWithIndex.map { - case (consumer, index) => - val bufferedWriter = BufferedEdgeCollector(bufferSize, queue, queuelimit, statsReceiver) - val processor = RecosEdgeProcessor(bufferedWriter)(statsReceiver) - - AtLeastOnceProcessor[String, RecosHoseMessage]( - s"recos-injector-kafka-$index", - hosename, - consumer, - processor.process, - maxPendingRequests = MaxPendingRequests * bufferSize, - workerThreads = ProcessorThreads, - commitIntervalMs = CommitIntervalMs, - statsReceiver = statsReceiver - ) - } - - log.info(s"starting ${processors.size} recosKafka processors") - processors.foreach { processor => - processor.start() - } - } catch { - case e: Throwable => - e.printStackTrace() - log.error(e, e.toString) - processors.foreach { processor => - processor.close() - } - processors = Seq.empty - consumers.foreach { consumer => - consumer.close() - } - consumers = Seq.empty - } - } - - /** - * Initialize the graph writers, - * by first creating catch up writers to bootstrap the older segments, - * and then assigning a live writer to populate the live segment. - */ - private def initGrpahWriters( - liveGraph: TGraph, - queue: java.util.Queue[Array[RecosHoseMessage]], - queuelimit: Semaphore - ): Unit = { - // define a number of (numBootstrapWriters - 1) catchup writer threads, each of which will write - // to a separate graph segment. - val catchupWriters = (0 until (catchupWriterNum - 1)).map { index => - val segment = liveGraph.getLiveSegment - liveGraph.rollForwardSegment() - getCatchupWriter(segment, queue, queuelimit, index) - } - val threadPool: ExecutorService = Executors.newCachedThreadPool() - - // define one live writer thread - val liveWriter = getLiveWriter(liveGraph, queue, queuelimit) - log.info("starting live graph writer that runs until service shutdown") - threadPool.submit(liveWriter) - log.info( - "starting catchup graph writer, which will terminate as soon as the catchup segment is full" - ) - catchupWriters.map(threadPool.submit(_)) - } - - private def getLiveWriter( - liveGraph: TGraph, - queue: java.util.Queue[Array[RecosHoseMessage]], - queuelimit: Semaphore - ): BufferedEdgeWriter = { - val liveEdgeCollector = new EdgeCollector { - override def addEdge(message: RecosHoseMessage): Unit = addEdgeToGraph(liveGraph, message) - } - BufferedEdgeWriter( - queue, - queuelimit, - liveEdgeCollector, - statsReceiver.scope("liveWriter"), - isRunning.get - ) - } - - private def getCatchupWriter( - segment: TSegment, - queue: java.util.Queue[Array[RecosHoseMessage]], - queuelimit: Semaphore, - catchupWriterIndex: Int - ): BufferedEdgeWriter = { - val catchupEdgeCollector = new EdgeCollector { - var currentNumEdges = 0 - - override def addEdge(message: RecosHoseMessage): Unit = { - currentNumEdges += 1 - addEdgeToSegment(segment, message) - } - } - val maxEdges = segment.getMaxNumEdges - - def runCondition(): Boolean = { - isRunning.get && ((maxEdges - catchupEdgeCollector.currentNumEdges) > bufferSize) - } - - BufferedEdgeWriter( - queue, - queuelimit, - catchupEdgeCollector, - statsReceiver.scope("catcher_" + catchupWriterIndex), - runCondition - ) - } -} - -private object UnifiedGraphWriter { - - // The RecosEdgeProcessor is not thread-safe. Only use one thread to process each instance. - val ProcessorThreads = 1 - // Each one cache at most 1000 * bufferSize requests. - val MaxPendingRequests = 1000 - // Short Commit MS to reduce duplicate messages. - val CommitIntervalMs: Long = 5000 // 5 seconds, Default Kafka value. -} diff --git a/src/scala/com/twitter/recos/hose/common/UnifiedGraphWriterMulti.scala b/src/scala/com/twitter/recos/hose/common/UnifiedGraphWriterMulti.scala deleted file mode 100644 index af69a9c2c..000000000 --- a/src/scala/com/twitter/recos/hose/common/UnifiedGraphWriterMulti.scala +++ /dev/null @@ -1,228 +0,0 @@ -package src.scala.com.twitter.recos.hose.common - -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.finatra.kafka.consumers.FinagleKafkaConsumerBuilder -import com.twitter.graphjet.bipartite.LeftIndexedMultiSegmentBipartiteGraph -import com.twitter.graphjet.bipartite.segment.LeftIndexedBipartiteGraphSegment -import com.twitter.kafka.client.processor.AtLeastOnceProcessor -import com.twitter.kafka.client.processor.ThreadSafeKafkaConsumerClient -import com.twitter.logging.Logger -import com.twitter.recos.hose.common.BufferedEdgeCollector -import com.twitter.recos.hose.common.BufferedEdgeWriter -import com.twitter.recos.hose.common.EdgeCollector -import com.twitter.recos.hose.common.RecosEdgeProcessor -import com.twitter.recos.internal.thriftscala.RecosHoseMessage -import com.twitter.recos.util.Action -import java.util.concurrent.atomic.AtomicBoolean -import java.util.concurrent.ConcurrentLinkedQueue -import java.util.concurrent.ExecutorService -import java.util.concurrent.Executors -import java.util.concurrent.Semaphore - -/** - * The class is an variation of UnifiedGraphWriter which allow one instance to hold multiple graphs - */ -trait UnifiedGraphWriterMulti[ - TSegment <: LeftIndexedBipartiteGraphSegment, - TGraph <: LeftIndexedMultiSegmentBipartiteGraph[TSegment]] { writer => - - import UnifiedGraphWriterMulti._ - - def shardId: String - def env: String - def hosename: String - def bufferSize: Int - def consumerNum: Int - def catchupWriterNum: Int - def kafkaConsumerBuilder: FinagleKafkaConsumerBuilder[String, RecosHoseMessage] - def clientId: String - def statsReceiver: StatsReceiver - - /** - * Adds a RecosHoseMessage to the graph. used by live writer to insert edges to the - * current segment - */ - def addEdgeToGraph( - graphs: Seq[(TGraph, Set[Action.Value])], - recosHoseMessage: RecosHoseMessage - ): Unit - - /** - * Adds a RecosHoseMessage to the given segment in the graph. Used by catch up writers to - * insert edges to non-current (old) segments - */ - def addEdgeToSegment( - segment: Seq[(TSegment, Set[Action.Value])], - recosHoseMessage: RecosHoseMessage - ): Unit - - private val log = Logger() - private val isRunning: AtomicBoolean = new AtomicBoolean(true) - private val initialized: AtomicBoolean = new AtomicBoolean(false) - private var processors: Seq[AtLeastOnceProcessor[String, RecosHoseMessage]] = Seq.empty - private var consumers: Seq[ThreadSafeKafkaConsumerClient[String, RecosHoseMessage]] = Seq.empty - private val threadPool: ExecutorService = Executors.newCachedThreadPool() - - def shutdown(): Unit = { - processors.foreach { processor => - processor.close() - } - processors = Seq.empty - consumers.foreach { consumer => - consumer.close() - } - consumers = Seq.empty - threadPool.shutdown() - isRunning.set(false) - } - - def initHose(liveGraphs: Seq[(TGraph, Set[Action.Value])]): Unit = this.synchronized { - if (!initialized.get) { - initialized.set(true) - - val queue: java.util.Queue[Array[RecosHoseMessage]] = - new ConcurrentLinkedQueue[Array[RecosHoseMessage]]() - val queuelimit: Semaphore = new Semaphore(1024) - - initRecosHoseKafka(queue, queuelimit) - initGrpahWriters(liveGraphs, queue, queuelimit) - } else { - throw new RuntimeException("attempt to re-init kafka hose") - } - } - - private def initRecosHoseKafka( - queue: java.util.Queue[Array[RecosHoseMessage]], - queuelimit: Semaphore, - ): Unit = { - try { - consumers = (0 until consumerNum).map { index => - new ThreadSafeKafkaConsumerClient( - kafkaConsumerBuilder.clientId(s"clientId-$index").enableAutoCommit(false).config) - } - processors = consumers.zipWithIndex.map { - case (consumer, index) => - val bufferedWriter = BufferedEdgeCollector(bufferSize, queue, queuelimit, statsReceiver) - val processor = RecosEdgeProcessor(bufferedWriter)(statsReceiver) - - AtLeastOnceProcessor[String, RecosHoseMessage]( - s"recos-injector-kafka-$index", - hosename, - consumer, - processor.process, - maxPendingRequests = MaxPendingRequests * bufferSize, - workerThreads = ProcessorThreads, - commitIntervalMs = CommitIntervalMs, - statsReceiver = statsReceiver - ) - } - - log.info(s"starting ${processors.size} recosKafka processors") - processors.foreach { processor => - processor.start() - } - } catch { - case e: Throwable => - e.printStackTrace() - log.error(e, e.toString) - processors.foreach { processor => - processor.close() - } - processors = Seq.empty - consumers.foreach { consumer => - consumer.close() - } - consumers = Seq.empty - } - } - - /** - * Initialize the graph writers, - * by first creating catch up writers to bootstrap the older segments, - * and then assigning a live writer to populate the live segment. - */ - private def initGrpahWriters( - liveGraphs: Seq[(TGraph, Set[Action.Value])], - queue: java.util.Queue[Array[RecosHoseMessage]], - queuelimit: Semaphore - ): Unit = { - // define a number of (numBootstrapWriters - 1) catchup writer threads, each of which will write - // to a separate graph segment. - val catchupWriters = (0 until (catchupWriterNum - 1)).map { index => - val segments = liveGraphs.map { case (graph, actions) => (graph.getLiveSegment, actions) } - for (liveGraph <- liveGraphs) { - liveGraph._1.rollForwardSegment() - } - getCatchupWriter(segments, queue, queuelimit, index) - } - val threadPool: ExecutorService = Executors.newCachedThreadPool() - - log.info("starting live graph writer that runs until service shutdown") - - // define one live writer thread - val liveWriter = getLiveWriter(liveGraphs, queue, queuelimit) - threadPool.submit(liveWriter) - - log.info( - "starting catchup graph writer, which will terminate as soon as the catchup segment is full" - ) - catchupWriters.map(threadPool.submit(_)) - } - - private def getLiveWriter( - liveGraphs: Seq[(TGraph, Set[Action.Value])], - queue: java.util.Queue[Array[RecosHoseMessage]], - queuelimit: Semaphore, - ): BufferedEdgeWriter = { - val liveEdgeCollector = new EdgeCollector { - override def addEdge(message: RecosHoseMessage): Unit = - addEdgeToGraph(liveGraphs, message) - } - BufferedEdgeWriter( - queue, - queuelimit, - liveEdgeCollector, - statsReceiver.scope("liveWriter"), - isRunning.get - ) - } - - private def getCatchupWriter( - segments: Seq[(TSegment, Set[Action.Value])], - queue: java.util.Queue[Array[RecosHoseMessage]], - queuelimit: Semaphore, - catchupWriterIndex: Int, - ): BufferedEdgeWriter = { - val catchupEdgeCollector = new EdgeCollector { - var currentNumEdges = 0 - - override def addEdge(message: RecosHoseMessage): Unit = { - currentNumEdges += 1 - addEdgeToSegment(segments, message) - } - } - val maxEdges = segments.map(_._1.getMaxNumEdges).sum - - def runCondition(): Boolean = { - isRunning.get && ((maxEdges - catchupEdgeCollector.currentNumEdges) > bufferSize) - } - - BufferedEdgeWriter( - queue, - queuelimit, - catchupEdgeCollector, - statsReceiver.scope("catcher_" + catchupWriterIndex), - runCondition - ) - } -} - -private object UnifiedGraphWriterMulti { - - // The RecosEdgeProcessor is not thread-safe. Only use one thread to process each instance. - val ProcessorThreads = 1 - // Each one cache at most 1000 * bufferSize requests. - val MaxPendingRequests = 1000 - // Short Commit MS to reduce duplicate messages. - val CommitIntervalMs: Long = 5000 // 5 seconds, Default Kafka value. -} diff --git a/src/scala/com/twitter/recos/user_tweet_entity_graph/BUILD b/src/scala/com/twitter/recos/user_tweet_entity_graph/BUILD deleted file mode 100644 index 779703f07..000000000 --- a/src/scala/com/twitter/recos/user_tweet_entity_graph/BUILD +++ /dev/null @@ -1,67 +0,0 @@ -scala_library( - name = "user_tweet_entity_graph", - sources = ["*.scala"], - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/cascading:cascading-local", - "3rdparty/jvm/com/backtype:dfs-datastores", - "3rdparty/jvm/com/fasterxml/jackson/module:jackson-module-scala", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/com/netflix/curator:curator-framework", - "3rdparty/jvm/com/twitter/graphjet", - "3rdparty/jvm/io/netty:netty4-tcnative-boringssl-static", - "3rdparty/jvm/it/unimi/dsi:fastutil", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/kafka:rosette-kafka", - "3rdparty/jvm/org/apache/thrift:libthrift", - "abdecider/src/main/scala", - "decider/src/main/scala", - "finagle-internal/mtls/src/main/scala/com/twitter/finagle/mtls/authentication", - "finagle-internal/mtls/src/main/scala/com/twitter/finagle/mtls/server", - "finagle/finagle-core/src/main", - "finagle/finagle-http/src/main/scala", - "finagle/finagle-memcached/src/main/scala", - "finagle/finagle-stats/src/main/scala", - "finagle/finagle-thriftmux/src/main/scala", - "frigate/frigate-common/src/main/scala/com/twitter/frigate/common/util", - "scrooge/scrooge-core/src/main/scala", - "servo/repo/src/main/scala", - "servo/request/src/main/scala", - "servo/util/src/main/scala", - "src/resources/com/twitter/recos:decider", - "src/scala/com/twitter/recos/decider", - "src/scala/com/twitter/recos/graph_common", - "src/scala/com/twitter/recos/hose/common", - "src/scala/com/twitter/recos/model:recos-model", - "src/scala/com/twitter/recos/serviceapi", - "src/scala/com/twitter/recos/util:recos-util", - "src/thrift/com/twitter/recos:recos-common-scala", - "src/thrift/com/twitter/recos:recos-internal-scala", - "src/thrift/com/twitter/recos/user_tweet_entity_graph:user_tweet_entity_graph-scala", - "thrift-web-forms/src/main/scala/com/twitter/thriftwebforms", - "thrift-web-forms/src/main/scala/com/twitter/thriftwebforms/model", - "twitter-server-internal/src/main/scala", - "twitter-server/server/src/main/scala", - "twitter-server/slf4j-jdk14/src/main/scala/com/twitter/server/logging", - "util/util-app/src/main/scala", - "util/util-hashing/src/main/scala", - "util/util-logging/src/main/scala", - "util/util-stats/src/main/scala", - ], -) - -jvm_binary( - name = "bin", - basename = "user_tweet_entity_graph-server", - main = "com.twitter.recos.user_tweet_entity_graph.Main", - runtime_platform = "java11", - tags = [ - "bazel-compatible", - "known-to-fail-jira:SD-20990", - ], - dependencies = [ - ":user_tweet_entity_graph", - "3rdparty/jvm/org/slf4j:slf4j-jdk14", - "twitter-server/slf4j-jdk14/src/main/scala", - ], -) diff --git a/src/scala/com/twitter/recos/user_tweet_entity_graph/EntitySocialProofRunner.scala b/src/scala/com/twitter/recos/user_tweet_entity_graph/EntitySocialProofRunner.scala deleted file mode 100644 index 2f5806fea..000000000 --- a/src/scala/com/twitter/recos/user_tweet_entity_graph/EntitySocialProofRunner.scala +++ /dev/null @@ -1,167 +0,0 @@ -package com.twitter.recos.user_tweet_entity_graph - -import java.util.Random -import com.twitter.concurrent.AsyncQueue -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.graphjet.bipartite.NodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraph -import com.twitter.graphjet.algorithms.{ - RecommendationInfo, - RecommendationType => JavaRecommendationType -} -import com.twitter.graphjet.algorithms.socialproof.{ - NodeMetadataSocialProofGenerator, - NodeMetadataSocialProofResult, - NodeMetadataSocialProofRequest => SocialProofJavaRequest, - SocialProofResponse => SocialProofJavaResponse -} -import com.twitter.logging.Logger -import com.twitter.recos.model.SalsaQueryRunner.SalsaRunnerConfig -import com.twitter.recos.user_tweet_entity_graph.thriftscala.{ - RecommendationType => ThriftRecommendationType, - RecommendationSocialProofRequest => SocialProofThriftRequest -} -import com.twitter.util.{Future, Try} -import it.unimi.dsi.fastutil.bytes.{Byte2ObjectArrayMap, Byte2ObjectMap} -import it.unimi.dsi.fastutil.ints.{IntOpenHashSet, IntSet} -import it.unimi.dsi.fastutil.longs.{Long2DoubleMap, Long2DoubleOpenHashMap} -import scala.collection.JavaConverters._ - -/** - * EntitySocialProofRunner creates a queue of reader threads, NodeMetadataProofGenerator, - * and each one reads from the graph and computes social proofs. - */ -class EntitySocialProofRunner( - graph: NodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraph, - salsaRunnerConfig: SalsaRunnerConfig, - statsReceiver: StatsReceiver) { - private val log: Logger = Logger() - private val stats = statsReceiver.scope(this.getClass.getSimpleName) - private val socialProofSizeStat = stats.stat("socialProofSize") - - private val socialProofFailureCounter = stats.counter("failure") - private val pollCounter = stats.counter("poll") - private val pollTimeoutCounter = stats.counter("pollTimeout") - private val offerCounter = stats.counter("offer") - private val pollLatencyStat = stats.stat("pollLatency") - private val socialProofRunnerPool = initSocialProofRunnerPool() - - private def initSocialProofRunnerPool(): AsyncQueue[NodeMetadataSocialProofGenerator] = { - val socialProofQueue = new AsyncQueue[NodeMetadataSocialProofGenerator] - (0 until salsaRunnerConfig.numSalsaRunners).foreach { _ => - socialProofQueue.offer(new NodeMetadataSocialProofGenerator(graph)) - } - socialProofQueue - } - - /** - * Helper method to interpret the output of SocialProofJavaResponse - * - * @param socialProofResponse is the response from running NodeMetadataSocialProof - * @return a sequence of SocialProofResult - */ - private def transformSocialProofResponse( - socialProofResponse: Option[SocialProofJavaResponse] - ): Seq[RecommendationInfo] = { - socialProofResponse match { - case Some(response) => - val scalaResponse = response.getRankedRecommendations.asScala - scalaResponse.foreach { result => - socialProofSizeStat.add( - result.asInstanceOf[NodeMetadataSocialProofResult].getSocialProofSize) - } - scalaResponse.toSeq - case _ => Nil - } - } - - /** - * Helper method to run social proof computation and convert the results to Option - * - * @param socialProof is socialProof reader on bipartite graph - * @param request is the socialProof request - * @return is an option of SocialProofJavaResponse - */ - private def getSocialProofResponse( - socialProof: NodeMetadataSocialProofGenerator, - request: SocialProofJavaRequest, - random: Random - )( - implicit statsReceiver: StatsReceiver - ): Option[SocialProofJavaResponse] = { - val attempt = Try(socialProof.computeRecommendations(request, random)).onFailure { e => - socialProofFailureCounter.incr() - log.error(e, "SocialProof computation failed") - } - attempt.toOption - } - - /** - * Attempt to retrieve a NodeMetadataSocialProof thread from the runner pool - * to execute a socialProofRequest - */ - private def handleSocialProofRequest(socialProofRequest: SocialProofJavaRequest) = { - pollCounter.incr() - val t0 = System.currentTimeMillis() - socialProofRunnerPool.poll().map { entitySocialProof => - val pollTime = System.currentTimeMillis - t0 - pollLatencyStat.add(pollTime) - val socialProofResponse = Try { - if (pollTime < salsaRunnerConfig.timeoutSalsaRunner) { - val response = - getSocialProofResponse(entitySocialProof, socialProofRequest, new Random())( - statsReceiver - ) - transformSocialProofResponse(response) - } else { - // if we did not get a social proof in time, then fail fast here and immediately put it back - log.warning("socialProof polling timeout") - pollTimeoutCounter.incr() - throw new RuntimeException("socialProof poll timeout") - Nil - } - } ensure { - socialProofRunnerPool.offer(entitySocialProof) - offerCounter.incr() - } - socialProofResponse.toOption getOrElse Nil - } - } - - /** - * This apply() supports requests coming from the new social proof endpoint in UTEG that works for - * tweet social proof generation, as well as hashtag and url social proof generation. - * Currently this endpoint supports url social proof generation for Guide. - */ - def apply(request: SocialProofThriftRequest): Future[Seq[RecommendationInfo]] = { - val nodeMetadataTypeToIdsMap: Byte2ObjectMap[IntSet] = new Byte2ObjectArrayMap[IntSet]() - request.recommendationIdsForSocialProof.collect { - case (ThriftRecommendationType.Url, urlIds) => - // We must convert the Long url ids into type Int since the underlying library expects Int type metadata ids. - val urlIntIds = urlIds.map(_.toInt) - nodeMetadataTypeToIdsMap.put( - JavaRecommendationType.URL.getValue.toByte, - new IntOpenHashSet(urlIntIds.toArray) - ) - case (ThriftRecommendationType.Hashtag, hashtagIds) => - // We must convert the Long hashtag ids into type Int since the underlying library expects Int type metadata ids. - val hashtagIntIds = hashtagIds.map(_.toInt) - nodeMetadataTypeToIdsMap.put( - JavaRecommendationType.HASHTAG.getValue.toByte, - new IntOpenHashSet(hashtagIntIds.toArray) - ) - } - - val leftSeedNodes: Long2DoubleMap = new Long2DoubleOpenHashMap( - request.seedsWithWeights.keys.toArray, - request.seedsWithWeights.values.toArray - ) - - val socialProofRequest = new SocialProofJavaRequest( - nodeMetadataTypeToIdsMap, - leftSeedNodes, - UserTweetEdgeTypeMask.getUserTweetGraphSocialProofTypes(request.socialProofTypes) - ) - - handleSocialProofRequest(socialProofRequest) - } -} diff --git a/src/scala/com/twitter/recos/user_tweet_entity_graph/LoggingUserTweetEntityGraph.scala b/src/scala/com/twitter/recos/user_tweet_entity_graph/LoggingUserTweetEntityGraph.scala deleted file mode 100644 index ab1a44324..000000000 --- a/src/scala/com/twitter/recos/user_tweet_entity_graph/LoggingUserTweetEntityGraph.scala +++ /dev/null @@ -1,103 +0,0 @@ -package com.twitter.recos.user_tweet_entity_graph - -import com.twitter.finagle.tracing.Trace -import com.twitter.logging.Logger -import com.twitter.recos.user_tweet_entity_graph.thriftscala._ -import com.twitter.util.Future - -trait LoggingUserTweetEntityGraph extends thriftscala.UserTweetEntityGraph.MethodPerEndpoint { - private[this] val accessLog = Logger("access") - - abstract override def recommendTweets( - request: RecommendTweetEntityRequest - ): Future[RecommendTweetEntityResponse] = { - val time = System.currentTimeMillis - super.recommendTweets(request) onSuccess { resp => - accessLog.info( - "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\tRecommendTweetResponse size: %s\t%s in %d ms" - .format( - time, - Trace.id.toString(), - request.requesterId, - request.displayLocation, - request.recommendationTypes, - request.maxResultsByType, - request.excludedTweetIds.map(_.take(5)), - request.excludedTweetIds.map(_.size), - request.seedsWithWeights.take(5), - request.seedsWithWeights.size, - request.maxTweetAgeInMillis, - request.maxUserSocialProofSize, - request.maxTweetSocialProofSize, - request.minUserSocialProofSizes, - request.tweetTypes, - request.socialProofTypes, - request.socialProofTypeUnions, - resp.recommendations.size, - resp.recommendations.take(20).toList map { - case UserTweetEntityRecommendationUnion.TweetRec(tweetRec) => - (tweetRec.tweetId, tweetRec.socialProofByType.map { case (k, v) => (k, v.size) }) - case UserTweetEntityRecommendationUnion.HashtagRec(hashtagRec) => - (hashtagRec.id, hashtagRec.socialProofByType.map { case (k, v) => (k, v.size) }) - case UserTweetEntityRecommendationUnion.UrlRec(urlRec) => - (urlRec.id, urlRec.socialProofByType.map { case (k, v) => (k, v.size) }) - case _ => - throw new Exception("Unsupported recommendation types") - }, - System.currentTimeMillis - time - ) - ) - } onFailure { exc => - accessLog.error( - "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s in %d ms".format( - time, - Trace.id.toString(), - request.requesterId, - request.displayLocation, - request.recommendationTypes, - request.maxResultsByType, - request.excludedTweetIds.map(_.take(5)), - request.excludedTweetIds.map(_.size), - request.seedsWithWeights.take(5), - request.seedsWithWeights.size, - request.maxTweetAgeInMillis, - request.maxUserSocialProofSize, - request.maxTweetSocialProofSize, - request.minUserSocialProofSizes, - request.tweetTypes, - request.socialProofTypes, - request.socialProofTypeUnions, - exc, - System.currentTimeMillis - time - ) - ) - } - } - - abstract override def findTweetSocialProofs( - request: SocialProofRequest - ): Future[SocialProofResponse] = { - val time = System.currentTimeMillis - super.findTweetSocialProofs(request) onSuccess { resp => - accessLog.info( - "%s\t%s\t%d\tResponse: %s\tin %d ms".format( - Trace.id.toString, - request.requesterId, - request.seedsWithWeights.size, - resp.socialProofResults.toList, - System.currentTimeMillis - time - ) - ) - } onFailure { exc => - accessLog.info( - "%s\t%s\t%d\tException: %s\tin %d ms".format( - Trace.id.toString, - request.requesterId, - request.seedsWithWeights.size, - exc, - System.currentTimeMillis - time - ) - ) - } - } -} diff --git a/src/scala/com/twitter/recos/user_tweet_entity_graph/Main.scala b/src/scala/com/twitter/recos/user_tweet_entity_graph/Main.scala deleted file mode 100644 index 9bd39d57e..000000000 --- a/src/scala/com/twitter/recos/user_tweet_entity_graph/Main.scala +++ /dev/null @@ -1,258 +0,0 @@ -package com.twitter.recos.user_tweet_entity_graph - -import com.twitter.abdecider.ABDeciderFactory -import com.twitter.abdecider.LoggingABDecider -import com.twitter.app.Flag -import com.twitter.conversions.DurationOps._ -import com.twitter.finagle.ThriftMux -import com.twitter.finagle.http.HttpMuxer -import com.twitter.finagle.mtls.authentication.ServiceIdentifier -import com.twitter.finagle.mtls.server.MtlsStackServer._ -import com.twitter.finagle.mux.transport.OpportunisticTls -import com.twitter.finagle.thrift.ClientId -import com.twitter.finatra.kafka.consumers.FinagleKafkaConsumerBuilder -import com.twitter.finatra.kafka.domain.KafkaGroupId -import com.twitter.finatra.kafka.domain.SeekStrategy -import com.twitter.finatra.kafka.serde.ScalaSerdes -import com.twitter.frigate.common.util.ElfOwlFilter -import com.twitter.frigate.common.util.ElfOwlFilter.ByLdapGroup -import com.twitter.graphjet.bipartite.NodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraph -import com.twitter.logging._ -import com.twitter.recos.decider.UserTweetEntityGraphDecider -import com.twitter.recos.graph_common.FinagleStatsReceiverWrapper -import com.twitter.recos.graph_common.NodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraphBuilder -import com.twitter.recos.internal.thriftscala.RecosHoseMessage -import com.twitter.recos.model.Constants -import com.twitter.recos.user_tweet_entity_graph.RecosConfig._ -import com.twitter.server.logging.{Logging => JDK14Logging} -import com.twitter.server.Deciderable -import com.twitter.server.TwitterServer -import com.twitter.thriftwebforms.MethodOptions -import com.twitter.thriftwebforms.TwitterServerThriftWebForms -import com.twitter.util.Await -import com.twitter.util.Duration -import java.net.InetSocketAddress -import java.util.concurrent.TimeUnit -import org.apache.kafka.clients.CommonClientConfigs -import org.apache.kafka.common.config.SaslConfigs -import org.apache.kafka.common.config.SslConfigs -import org.apache.kafka.common.security.auth.SecurityProtocol -import org.apache.kafka.common.serialization.StringDeserializer - -object Main extends TwitterServer with JDK14Logging with Deciderable { - profile => - - val shardId: Flag[Int] = flag("shardId", 0, "Shard ID") - val servicePort: Flag[InetSocketAddress] = - flag("service.port", new InetSocketAddress(10143), "Thrift service port") - val logDir: Flag[String] = flag("logdir", "recos", "Logging directory") - val numShards: Flag[Int] = flag("numShards", 1, "Number of shards for this service") - val truststoreLocation: Flag[String] = - flag[String]("truststore_location", "", "Truststore file location") - val hoseName: Flag[String] = - flag("hosename", "recos_injector_user_user", "the kafka stream used for incoming edges") - - val dataCenter: Flag[String] = flag("service.cluster", "atla", "Data Center") - val serviceRole: Flag[String] = flag("service.role", "Service Role") - val serviceEnv: Flag[String] = flag("service.env", "Service Env") - val serviceName: Flag[String] = flag("service.name", "Service Name") - - private val maxNumSegments = - flag("maxNumSegments", graphBuilderConfig.maxNumSegments, "the number of segments in the graph") - - private val statsReceiverWrapper = FinagleStatsReceiverWrapper(statsReceiver) - - lazy val clientId = ClientId(s"usertweetentitygraph.${serviceEnv()}") - - private val shutdownTimeout = flag( - "service.shutdownTimeout", - 5.seconds, - "Maximum amount of time to wait for pending requests to complete on shutdown" - ) - - // ********* logging ********** - - lazy val loggingLevel: Level = Level.INFO - lazy val recosLogPath: String = logDir() + "/recos.log" - lazy val graphLogPath: String = logDir() + "/graph.log" - lazy val accessLogPath: String = logDir() + "/access.log" - - override def loggerFactories: List[LoggerFactory] = - List( - LoggerFactory( - level = Some(loggingLevel), - handlers = QueueingHandler( - handler = FileHandler( - filename = recosLogPath, - level = Some(loggingLevel), - rollPolicy = Policy.Hourly, - rotateCount = 6, - formatter = new Formatter - ) - ) :: Nil - ), - LoggerFactory( - node = "graph", - useParents = false, - level = Some(loggingLevel), - handlers = QueueingHandler( - handler = FileHandler( - filename = graphLogPath, - level = Some(loggingLevel), - rollPolicy = Policy.Hourly, - rotateCount = 6, - formatter = new Formatter - ) - ) :: Nil - ), - LoggerFactory( - node = "access", - useParents = false, - level = Some(loggingLevel), - handlers = QueueingHandler( - handler = FileHandler( - filename = accessLogPath, - level = Some(loggingLevel), - rollPolicy = Policy.Hourly, - rotateCount = 6, - formatter = new Formatter - ) - ) :: Nil - ), - LoggerFactory( - node = "client_event", - level = Some(loggingLevel), - useParents = false, - handlers = QueueingHandler( - maxQueueSize = 10000, - handler = ScribeHandler( - category = "client_event", - formatter = BareFormatter - ) - ) :: Nil - ) - ) - // ******** Decider ************* - - val graphDecider: UserTweetEntityGraphDecider = UserTweetEntityGraphDecider() - - // ********* ABdecider ********** - - val abDeciderYmlPath: String = "/usr/local/config/abdecider/abdecider.yml" - - val scribeLogger: Option[Logger] = Some(Logger.get("client_event")) - - val abDecider: LoggingABDecider = - ABDeciderFactory( - abDeciderYmlPath = abDeciderYmlPath, - scribeLogger = scribeLogger, - environment = Some("production") - ).buildWithLogging() - - // ********* Recos service ********** - - private def getKafkaBuilder() = { - FinagleKafkaConsumerBuilder[String, RecosHoseMessage]() - .dest("/s/kafka/recommendations:kafka-tls") - .groupId(KafkaGroupId(f"user_tweet_entity_graph-${shardId()}%06d")) - .keyDeserializer(new StringDeserializer) - .valueDeserializer(ScalaSerdes.Thrift[RecosHoseMessage].deserializer) - .seekStrategy(SeekStrategy.REWIND) - .rewindDuration(20.hours) - .withConfig(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, SecurityProtocol.SASL_SSL.toString) - .withConfig(SslConfigs.SSL_TRUSTSTORE_LOCATION_CONFIG, truststoreLocation()) - .withConfig(SaslConfigs.SASL_MECHANISM, SaslConfigs.GSSAPI_MECHANISM) - .withConfig(SaslConfigs.SASL_KERBEROS_SERVICE_NAME, "kafka") - .withConfig(SaslConfigs.SASL_KERBEROS_SERVER_NAME, "kafka") - } - def main(): Unit = { - log.info("building graph with maxNumSegments = " + profile.maxNumSegments()) - val graph = NodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraphBuilder( - graphBuilderConfig.copy(maxNumSegments = profile.maxNumSegments()), - statsReceiverWrapper - ) - - val kafkaConfigBuilder = getKafkaBuilder() - - val graphWriter = - UserTweetEntityGraphWriter( - shardId().toString, - serviceEnv(), - hoseName(), - 128, // keep the original setting. - kafkaConfigBuilder, - clientId.name, - statsReceiver, - ) - graphWriter.initHose(graph) - - val tweetRecsRunner = new TweetRecommendationsRunner( - graph, - Constants.salsaRunnerConfig, - statsReceiverWrapper - ) - - val tweetSocialProofRunner = new TweetSocialProofRunner( - graph, - Constants.salsaRunnerConfig, - statsReceiver - ) - - val entitySocialProofRunner = new EntitySocialProofRunner( - graph, - Constants.salsaRunnerConfig, - statsReceiver - ) - - val recommendationHandler = new RecommendationHandler(tweetRecsRunner, statsReceiver) - - /* - * Old social proof handler retained to support old tweet social proof endpoint. - * Future clients should utilize the findRecommendationSocialProofs endpoint which will use - * the more broad "SocialProofHandler" - */ - val tweetSocialProofHandler = new TweetSocialProofHandler( - tweetSocialProofRunner, - graphDecider, - statsReceiver - ) - val socialProofHandler = new SocialProofHandler( - tweetSocialProofRunner, - entitySocialProofRunner, - graphDecider, - statsReceiver - ) - val userTweetEntityGraph = new UserTweetEntityGraph( - recommendationHandler, - tweetSocialProofHandler, - socialProofHandler - ) with LoggingUserTweetEntityGraph - - // For MutualTLS - val serviceIdentifier = ServiceIdentifier( - role = serviceRole(), - service = serviceName(), - environment = serviceEnv(), - zone = dataCenter() - ) - log.info(s"ServiceIdentifier = ${serviceIdentifier.toString}") - - val thriftServer = ThriftMux.server - .withOpportunisticTls(OpportunisticTls.Required) - .withMutualTls(serviceIdentifier) - .serveIface(servicePort(), userTweetEntityGraph) - - log.info("clientid: " + clientId.toString) - log.info("servicePort: " + servicePort().toString) - - log.info("adding shutdown hook") - onExit { - graphWriter.shutdown() - thriftServer.close(shutdownTimeout().fromNow) - } - log.info("added shutdown hook") - - // Wait on the thriftServer so that shutdownTimeout is respected. - Await.result(thriftServer) - } -} diff --git a/src/scala/com/twitter/recos/user_tweet_entity_graph/README.md b/src/scala/com/twitter/recos/user_tweet_entity_graph/README.md deleted file mode 100644 index 39af44deb..000000000 --- a/src/scala/com/twitter/recos/user_tweet_entity_graph/README.md +++ /dev/null @@ -1,17 +0,0 @@ -# UserTweetEntityGraph (UTEG) - -## What is it -User Tweet Entity Graph (UTEG) is a Finalge thrift service built on the GraphJet framework. It maintains a graph of user-tweet relationships and serves user recommendations based on traversals in this graph. - -## How is it used on Twitter -UTEG generates the "XXX Liked" out-of-network tweets seen on Twitter's Home Timeline. -The core idea behind UTEG is collaborative filtering. UTEG takes a user's weighted follow graph (i.e a list of weighted userIds) as input, -performs efficient traversal & aggregation, and returns the top-weighted tweets engaged based on # of users that engaged the tweet, as well as -the engaged users' weights. - -UTEG is a stateful service and relies on a Kafka stream to ingest & persist states. It maintains in-memory user engagements over the past -24-48 hours. Older events are dropped and GC'ed. - -For full details on storage & processing, please check out our open-sourced project GraphJet, a general-purpose high-performance in-memory storage engine. -- https://github.com/twitter/GraphJet -- http://www.vldb.org/pvldb/vol9/p1281-sharma.pdf diff --git a/src/scala/com/twitter/recos/user_tweet_entity_graph/RecommendationHandler.scala b/src/scala/com/twitter/recos/user_tweet_entity_graph/RecommendationHandler.scala deleted file mode 100644 index 80749cd76..000000000 --- a/src/scala/com/twitter/recos/user_tweet_entity_graph/RecommendationHandler.scala +++ /dev/null @@ -1,78 +0,0 @@ -package com.twitter.recos.user_tweet_entity_graph - -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.frigate.common.util.StatsUtil -import com.twitter.graphjet.algorithms.RecommendationType -import com.twitter.graphjet.algorithms.counting.tweet.TweetMetadataRecommendationInfo -import com.twitter.graphjet.algorithms.counting.tweet.TweetRecommendationInfo -import com.twitter.recos.user_tweet_entity_graph.thriftscala._ -import com.twitter.recos.util.Stats -import com.twitter.servo.request._ -import com.twitter.util.Future - -/** - * Implementation of the Thrift-defined service interface. - * -* A wrapper of magicRecsRunner. - */ -class RecommendationHandler( - tweetRecsRunner: TweetRecommendationsRunner, - statsReceiver: StatsReceiver) - extends RequestHandler[RecommendTweetEntityRequest, RecommendTweetEntityResponse] { - private val stats = statsReceiver.scope(this.getClass.getSimpleName) - private val socialProofHydrator = new SocialProofHydrator(stats) - - override def apply(request: RecommendTweetEntityRequest): Future[RecommendTweetEntityResponse] = { - val scopedStats: StatsReceiver = stats.scope(request.displayLocation.toString) - - StatsUtil.trackBlockStats(scopedStats) { - val candidatesFuture = tweetRecsRunner.apply(request) - - candidatesFuture.map { candidates => - if (candidates.isEmpty) scopedStats.counter(Stats.EmptyResult).incr() - else scopedStats.counter(Stats.Served).incr(candidates.size) - - RecommendTweetEntityResponse(candidates.flatMap { - _ match { - case tweetRec: TweetRecommendationInfo => - Some( - UserTweetEntityRecommendationUnion.TweetRec( - TweetRecommendation( - tweetRec.getRecommendation, - tweetRec.getWeight, - socialProofHydrator.addTweetSocialProofByType(tweetRec), - socialProofHydrator.addTweetSocialProofs(tweetRec) - ) - ) - ) - case tweetMetadataRec: TweetMetadataRecommendationInfo => - if (tweetMetadataRec.getRecommendationType == RecommendationType.HASHTAG) { - Some( - UserTweetEntityRecommendationUnion.HashtagRec( - HashtagRecommendation( - tweetMetadataRec.getRecommendation, - tweetMetadataRec.getWeight, - socialProofHydrator.addMetadataSocialProofByType(tweetMetadataRec) - ) - ) - ) - } else if (tweetMetadataRec.getRecommendationType == RecommendationType.URL) { - Some( - UserTweetEntityRecommendationUnion.UrlRec( - UrlRecommendation( - tweetMetadataRec.getRecommendation, - tweetMetadataRec.getWeight, - socialProofHydrator.addMetadataSocialProofByType(tweetMetadataRec) - ) - ) - ) - } else { - None: Option[UserTweetEntityRecommendationUnion] - } - case _ => None - } - }) - } - } - } -} diff --git a/src/scala/com/twitter/recos/user_tweet_entity_graph/RecosConfig.scala b/src/scala/com/twitter/recos/user_tweet_entity_graph/RecosConfig.scala deleted file mode 100644 index c37d2911d..000000000 --- a/src/scala/com/twitter/recos/user_tweet_entity_graph/RecosConfig.scala +++ /dev/null @@ -1,44 +0,0 @@ -package com.twitter.recos.user_tweet_entity_graph - -import com.twitter.graphjet.algorithms.RecommendationType -import com.twitter.recos.model.Constants -import com.twitter.recos.graph_common.NodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraphBuilder.GraphBuilderConfig - -/** - * The class holds all the config parameters for recos graph. - */ -object RecosConfig { - val maxNumSegments: Int = 8 // this value will be overwritten by a parameter from profile config - val maxNumEdgesPerSegment: Int = 1 << 27 // 134M edges per segment - val expectedNumLeftNodes: Int = 1 << 24 // 16M nodes - val expectedMaxLeftDegree: Int = 64 - val leftPowerLawExponent: Double = 16.0 // steep power law as most nodes will have a small degree - val expectedNumRightNodes: Int = 1 << 24 // 16M nodes - val numRightNodeMetadataTypes: Int = - RecommendationType.METADATASIZE.getValue // two node metadata types: hashtag and url - - val graphBuilderConfig = GraphBuilderConfig( - maxNumSegments = maxNumSegments, - maxNumEdgesPerSegment = maxNumEdgesPerSegment, - expectedNumLeftNodes = expectedNumLeftNodes, - expectedMaxLeftDegree = expectedMaxLeftDegree, - leftPowerLawExponent = leftPowerLawExponent, - expectedNumRightNodes = expectedNumRightNodes, - numRightNodeMetadataTypes = numRightNodeMetadataTypes, - edgeTypeMask = new UserTweetEdgeTypeMask() - ) - - val maxUserSocialProofSize: Int = 10 - val maxTweetSocialProofSize: Int = 10 - val maxTweetAgeInMillis: Long = 24 * 60 * 60 * 1000 - val maxEngagementAgeInMillis: Long = Long.MaxValue - - println("RecosConfig - maxNumSegments " + maxNumSegments) - println("RecosConfig - maxNumEdgesPerSegment " + maxNumEdgesPerSegment) - println("RecosConfig - expectedNumLeftNodes " + expectedNumLeftNodes) - println("RecosConfig - expectedMaxLeftDegree " + expectedMaxLeftDegree) - println("RecosConfig - leftPowerLawExponent " + leftPowerLawExponent) - println("RecosConfig - expectedNumRightNodes " + expectedNumRightNodes) - println("RecosConfig - numRightNodeMetadataTypes " + numRightNodeMetadataTypes) - println("RecosConfig - salsaRunnerConfig " + Constants.salsaRunnerConfig) -} diff --git a/src/scala/com/twitter/recos/user_tweet_entity_graph/SocialProofHandler.scala b/src/scala/com/twitter/recos/user_tweet_entity_graph/SocialProofHandler.scala deleted file mode 100644 index 8d74cbe37..000000000 --- a/src/scala/com/twitter/recos/user_tweet_entity_graph/SocialProofHandler.scala +++ /dev/null @@ -1,165 +0,0 @@ -package com.twitter.recos.user_tweet_entity_graph - -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.graphjet.algorithms.{ - RecommendationInfo, - RecommendationType => JavaRecommendationType -} -import com.twitter.graphjet.algorithms.socialproof.{ - NodeMetadataSocialProofResult => EntitySocialProofJavaResult, - SocialProofResult => SocialProofJavaResult -} -import com.twitter.recos.decider.UserTweetEntityGraphDecider -import com.twitter.recos.util.Stats -import com.twitter.recos.util.Stats._ -import com.twitter.recos.recos_common.thriftscala.{SocialProofType => SocialProofThriftType} -import com.twitter.recos.user_tweet_entity_graph.thriftscala.{ - HashtagRecommendation, - TweetRecommendation, - UrlRecommendation, - UserTweetEntityRecommendationUnion, - RecommendationSocialProofRequest => SocialProofThriftRequest, - RecommendationSocialProofResponse => SocialProofThriftResponse, - RecommendationType => ThriftRecommendationType -} -import com.twitter.servo.request.RequestHandler -import com.twitter.util.{Future, Try} -import scala.collection.JavaConverters._ - -class SocialProofHandler( - tweetSocialProofRunner: TweetSocialProofRunner, - entitySocialProofRunner: EntitySocialProofRunner, - decider: UserTweetEntityGraphDecider, - statsReceiver: StatsReceiver) - extends RequestHandler[SocialProofThriftRequest, SocialProofThriftResponse] { - private val stats = statsReceiver.scope(this.getClass.getSimpleName) - - private def getThriftSocialProof( - entitySocialProof: EntitySocialProofJavaResult - ): Map[SocialProofThriftType, Map[Long, Seq[Long]]] = { - val socialProofAttempt = Try(entitySocialProof.getSocialProof) - .onFailure { e => - stats.counter(e.getClass.getSimpleName).incr() - } - - socialProofAttempt.toOption match { - case Some(socialProof) if socialProof.isEmpty => - stats.counter(Stats.EmptyResult).incr() - Map.empty[SocialProofThriftType, Map[Long, Seq[Long]]] - case Some(socialProof) if !socialProof.isEmpty => - socialProof.asScala.map { - case (socialProofType, socialProofUserToTweetsMap) => - val userToTweetsSocialProof = socialProofUserToTweetsMap.asScala.map { - case (socialProofUser, connectingTweets) => - (socialProofUser.toLong, connectingTweets.asScala.map(Long2long).toSeq) - }.toMap - (SocialProofThriftType(socialProofType.toInt), userToTweetsSocialProof) - }.toMap - case _ => - Map.empty[SocialProofThriftType, Map[Long, Seq[Long]]] - } - } - - private def getThriftSocialProof( - tweetSocialProof: SocialProofJavaResult - ): Map[SocialProofThriftType, Seq[Long]] = { - val socialProofAttempt = Try(tweetSocialProof.getSocialProof) - .onFailure { e => - stats.counter(e.getClass.getSimpleName).incr() - } - - socialProofAttempt.toOption match { - case Some(socialProof) if socialProof.isEmpty => - stats.counter(Stats.EmptyResult).incr() - Map.empty[SocialProofThriftType, Seq[Long]] - case Some(socialProof) if !socialProof.isEmpty => - socialProof.asScala.map { - case (socialProofType, connectingUsers) => - ( - SocialProofThriftType(socialProofType.toInt), - connectingUsers.asScala.map { Long2long }.toSeq) - }.toMap - case _ => - Map.empty[SocialProofThriftType, Seq[Long]] - } - } - - private def getEntitySocialProof( - request: SocialProofThriftRequest - ): Future[Seq[UserTweetEntityRecommendationUnion]] = { - val socialProofsFuture = entitySocialProofRunner(request) - - socialProofsFuture.map { socialProofs: Seq[RecommendationInfo] => - stats.counter(Stats.Served).incr(socialProofs.size) - socialProofs.flatMap { entitySocialProof: RecommendationInfo => - val entitySocialProofJavaResult = - entitySocialProof.asInstanceOf[EntitySocialProofJavaResult] - if (entitySocialProofJavaResult.getRecommendationType == JavaRecommendationType.URL) { - Some( - UserTweetEntityRecommendationUnion.UrlRec( - UrlRecommendation( - entitySocialProofJavaResult.getNodeMetadataId, - entitySocialProofJavaResult.getWeight, - getThriftSocialProof(entitySocialProofJavaResult) - ) - ) - ) - } else if (entitySocialProofJavaResult.getRecommendationType == JavaRecommendationType.HASHTAG) { - Some( - UserTweetEntityRecommendationUnion.HashtagRec( - HashtagRecommendation( - entitySocialProofJavaResult.getNodeMetadataId, - entitySocialProofJavaResult.getWeight, - getThriftSocialProof(entitySocialProofJavaResult) - ) - ) - ) - } else { - None - } - } - } - } - - private def getTweetSocialProof( - request: SocialProofThriftRequest - ): Future[Seq[UserTweetEntityRecommendationUnion]] = { - val socialProofsFuture = tweetSocialProofRunner(request) - - socialProofsFuture.map { socialProofs: Seq[RecommendationInfo] => - stats.counter(Stats.Served).incr(socialProofs.size) - socialProofs.flatMap { tweetSocialProof: RecommendationInfo => - val tweetSocialProofJavaResult = tweetSocialProof.asInstanceOf[SocialProofJavaResult] - Some( - UserTweetEntityRecommendationUnion.TweetRec( - TweetRecommendation( - tweetSocialProofJavaResult.getNode, - tweetSocialProofJavaResult.getWeight, - getThriftSocialProof(tweetSocialProofJavaResult) - ) - ) - ) - } - } - } - - def apply(request: SocialProofThriftRequest): Future[SocialProofThriftResponse] = { - trackFutureBlockStats(stats) { - val recommendationsWithSocialProofFut = Future - .collect { - request.recommendationIdsForSocialProof.keys.map { - case ThriftRecommendationType.Tweet if decider.tweetSocialProof => - getTweetSocialProof(request) - case (ThriftRecommendationType.Url | ThriftRecommendationType.Hashtag) - if decider.entitySocialProof => - getEntitySocialProof(request) - case _ => - Future.Nil - }.toSeq - }.map(_.flatten) - recommendationsWithSocialProofFut.map { recommendationsWithSocialProof => - SocialProofThriftResponse(recommendationsWithSocialProof) - } - } - } -} diff --git a/src/scala/com/twitter/recos/user_tweet_entity_graph/SocialProofHydrator.scala b/src/scala/com/twitter/recos/user_tweet_entity_graph/SocialProofHydrator.scala deleted file mode 100644 index ed44de053..000000000 --- a/src/scala/com/twitter/recos/user_tweet_entity_graph/SocialProofHydrator.scala +++ /dev/null @@ -1,111 +0,0 @@ -package com.twitter.recos.user_tweet_entity_graph - -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.graphjet.algorithms.counting.tweet.{ - TweetMetadataRecommendationInfo, - TweetRecommendationInfo -} -import com.twitter.recos.recos_common.thriftscala.{SocialProof, SocialProofType} - -import scala.collection.JavaConverters._ - -class SocialProofHydrator(statsReceiver: StatsReceiver) { - private val stats = statsReceiver.scope(this.getClass.getSimpleName) - private val socialProofsDup = stats.counter("socialProofsDup") - private val socialProofsUni = stats.counter("socialProofsUni") - private val socialProofByTypeDup = stats.counter("socialProofByTypeDup") - private val socialProofByTypeUni = stats.counter("socialProofByTypeUni") - - // If the social proof type is favorite, there are cases that one user favs, unfavs and then favs the same tweet again. - // In this case, UTEG only returns one valid social proof. Note that GraphJet library compares the number of unique users - // with the minSocialProofThreshold, so the threshold checking logic is correct. - // If the social proof type is reply or quote, there are valid cases that one user replies the same tweet multiple times. - // GraphJet does not handle this deduping because this is Twitter specific logic. - def getSocialProofs( - socialProofType: SocialProofType, - users: Seq[Long], - metadata: Seq[Long] - ): Seq[SocialProof] = { - if (socialProofType == SocialProofType.Favorite && users.size > 1 && users.size != users.distinct.size) { - socialProofsDup.incr() - val unique = users - .zip(metadata) - .foldLeft[Seq[(Long, Long)]](Nil) { (list, next) => - { - val test = list find { _._1 == next._1 } - if (test.isEmpty) next +: list else list - } - } - .reverse - unique.map { case (user, data) => SocialProof(user, Some(data)) } - } else { - socialProofsUni.incr() - users.zip(metadata).map { case (user, data) => SocialProof(user, Some(data)) } - } - - } - - // Extract and dedup social proofs from GraphJet. Only Favorite based social proof needs to dedup. - // Return the social proofs (userId, metadata) pair in SocialProof thrift objects. - def addTweetSocialProofs( - tweet: TweetRecommendationInfo - ): Option[Map[SocialProofType, Seq[SocialProof]]] = { - Some( - tweet.getSocialProof.asScala.map { - case (socialProofType, socialProof) => - val socialProofThriftType = SocialProofType(socialProofType.toByte) - ( - socialProofThriftType, - getSocialProofs( - socialProofThriftType, - socialProof.getConnectingUsers.asScala.map(_.toLong), - socialProof.getMetadata.asScala.map(_.toLong) - ) - ) - }.toMap - ) - } - - def getSocialProofs(users: Seq[Long]): Seq[Long] = { - if (users.size > 1) { - val distinctUsers = users.distinct - if (users.size != distinctUsers.size) { - socialProofByTypeDup.incr() - } else { - socialProofByTypeUni.incr() - } - distinctUsers - } else { - socialProofByTypeUni.incr() - users - } - } - - // Extract and dedup social proofs from GraphJet. All social proof types need to dedup. - // Return the userId social proofs without metadata. - def addTweetSocialProofByType(tweet: TweetRecommendationInfo): Map[SocialProofType, Seq[Long]] = { - tweet.getSocialProof.asScala.map { - case (socialProofType, socialProof) => - ( - SocialProofType(socialProofType.toByte), - getSocialProofs(socialProof.getConnectingUsers.asScala.map(_.toLong)) - ) - }.toMap - } - - // The Hashtag and URL Social Proof. Dedup is not necessary. - def addMetadataSocialProofByType( - tweetMetadataRec: TweetMetadataRecommendationInfo - ): Map[SocialProofType, Map[Long, Seq[Long]]] = { - tweetMetadataRec.getSocialProof.asScala.map { - case (socialProofType, socialProof) => - ( - SocialProofType(socialProofType.toByte), - socialProof.asScala.map { - case (authorId, tweetIds) => - (authorId.toLong, tweetIds.asScala.map(_.toLong)) - }.toMap) - }.toMap - } - -} diff --git a/src/scala/com/twitter/recos/user_tweet_entity_graph/TweetRecommendationsRunner.scala b/src/scala/com/twitter/recos/user_tweet_entity_graph/TweetRecommendationsRunner.scala deleted file mode 100644 index 428c2dd6d..000000000 --- a/src/scala/com/twitter/recos/user_tweet_entity_graph/TweetRecommendationsRunner.scala +++ /dev/null @@ -1,322 +0,0 @@ -package com.twitter.recos.user_tweet_entity_graph - -import java.util.Random -import com.twitter.concurrent.AsyncQueue -import com.twitter.conversions.DurationOps._ -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.graphjet.algorithms._ -import com.twitter.graphjet.algorithms.filters._ -import com.twitter.graphjet.algorithms.counting.TopSecondDegreeByCountResponse -import com.twitter.graphjet.algorithms.counting.tweet.TopSecondDegreeByCountForTweet -import com.twitter.graphjet.algorithms.counting.tweet.TopSecondDegreeByCountRequestForTweet -import com.twitter.graphjet.bipartite.NodeMetadataLeftIndexedMultiSegmentBipartiteGraph -import com.twitter.logging.Logger -import com.twitter.recos.graph_common.FinagleStatsReceiverWrapper -import com.twitter.recos.model.SalsaQueryRunner.SalsaRunnerConfig -import com.twitter.recos.recos_common.thriftscala.SocialProofType -import com.twitter.recos.user_tweet_entity_graph.thriftscala.RecommendTweetEntityRequest -import com.twitter.recos.user_tweet_entity_graph.thriftscala.TweetEntityDisplayLocation -import com.twitter.recos.user_tweet_entity_graph.thriftscala.TweetType -import com.twitter.recos.util.Stats.trackBlockStats -import com.twitter.util.Future -import com.twitter.util.JavaTimer -import com.twitter.util.Try -import it.unimi.dsi.fastutil.longs.Long2DoubleOpenHashMap -import it.unimi.dsi.fastutil.longs.LongOpenHashSet -import scala.collection.JavaConverters._ - -import com.twitter.graphjet.algorithms.RecommendationType -import com.twitter.recos.user_tweet_entity_graph.thriftscala.{ - RecommendationType => ThriftRecommendationType -} -import scala.collection.Map -import scala.collection.Set - -object TweetRecommendationsRunner { - private val DefaultTweetTypes: Seq[TweetType] = - Seq(TweetType.Regular, TweetType.Summary, TweetType.Photo, TweetType.Player) - private val DefaultF1ExactSocialProofSize = 1 - private val DefaultRareTweetRecencyMillis: Long = 7.days.inMillis - - /** - * Map valid social proof types specified by clients to an array of bytes. If clients do not - * specify any social proof type unions in thrift, it will return an empty set by default. - */ - private def getSocialProofTypeUnions( - socialProofTypeUnions: Option[Set[Seq[SocialProofType]]] - ): Set[Array[Byte]] = { - socialProofTypeUnions - .map { - _.map { - _.map { - _.getValue.toByte - }.toArray - } - } - .getOrElse(Set.empty) - } - - private def getRecommendationTypes( - recommendationTypes: Seq[ThriftRecommendationType] - ): Set[RecommendationType] = { - recommendationTypes.flatMap { - _ match { - case ThriftRecommendationType.Tweet => Some(RecommendationType.TWEET) - case ThriftRecommendationType.Hashtag => Some(RecommendationType.HASHTAG) - case ThriftRecommendationType.Url => Some(RecommendationType.URL) - case _ => - throw new Exception("Unmatched Recommendation Type in getRecommendationTypes") - } - }.toSet - } - - private def convertThriftEnumsToJavaEnums( - maxResults: Option[Map[ThriftRecommendationType, Int]] - ): Map[RecommendationType, Integer] = { - maxResults - .map { - _.flatMap { - _ match { - case (ThriftRecommendationType.Tweet, v) => Some((RecommendationType.TWEET, v: Integer)) - case (ThriftRecommendationType.Hashtag, v) => - Some((RecommendationType.HASHTAG, v: Integer)) - case (ThriftRecommendationType.Url, v) => Some((RecommendationType.URL, v: Integer)) - case _ => - throw new Exception("Unmatched Recommendation Type in convertThriftEnumsToJavaEnums") - } - } - } - .getOrElse(Map.empty) - } - -} - -/** - * The MagicRecsRunner creates a queue of reader threads, MagicRecs, and each one reads from the - * graph and computes recommendations. - */ -class TweetRecommendationsRunner( - bipartiteGraph: NodeMetadataLeftIndexedMultiSegmentBipartiteGraph, - salsaRunnerConfig: SalsaRunnerConfig, - statsReceiverWrapper: FinagleStatsReceiverWrapper) { - - import TweetRecommendationsRunner._ - - private val log: Logger = Logger() - - private val stats = statsReceiverWrapper.statsReceiver.scope(this.getClass.getSimpleName) - private val magicRecsFailureCounter = stats.counter("failure") - private val pollCounter = stats.counter("poll") - private val pollTimeoutCounter = stats.counter("pollTimeout") - private val offerCounter = stats.counter("offer") - private val pollLatencyStat = stats.stat("pollLatency") - - private val magicRecsQueue = new AsyncQueue[TopSecondDegreeByCountForTweet] - (0 until salsaRunnerConfig.numSalsaRunners).foreach { _ => - magicRecsQueue.offer( - new TopSecondDegreeByCountForTweet( - bipartiteGraph, - salsaRunnerConfig.expectedNodesToHitInSalsa, - statsReceiverWrapper.scope(this.getClass.getSimpleName) - ) - ) - } - - private implicit val timer: JavaTimer = new JavaTimer(true) - - private def getBaseFilters( - staleTweetDuration: Long, - tweetTypes: Seq[TweetType] - ) = { - List( - // Keep RecentTweetFilter first since it's the cheapest - new RecentTweetFilter(staleTweetDuration, statsReceiverWrapper), - new TweetCardFilter( - tweetTypes.contains(TweetType.Regular), - tweetTypes.contains(TweetType.Summary), - tweetTypes.contains(TweetType.Photo), - tweetTypes.contains(TweetType.Player), - false, // no promoted tweets - statsReceiverWrapper - ), - new DirectInteractionsFilter(bipartiteGraph, statsReceiverWrapper), - new RequestedSetFilter(statsReceiverWrapper), - new SocialProofTypesFilter(statsReceiverWrapper) - ) - } - - /** - * Helper method to interpret the output of MagicRecs graph - * - * @param magicRecsResponse is the response from running MagicRecs - * @return a sequence of candidate ids, with score and list of social proofs - */ - private def transformMagicRecsResponse( - magicRecsResponse: Option[TopSecondDegreeByCountResponse] - ): Seq[RecommendationInfo] = { - val responses = magicRecsResponse match { - case Some(response) => response.getRankedRecommendations.asScala.toSeq - case _ => Nil - } - responses - } - - /** - * Helper function to determine different post-process filtering logic in GraphJet, - * based on display locations - */ - private def getFiltersByDisplayLocations( - displayLocation: TweetEntityDisplayLocation, - whitelistAuthors: LongOpenHashSet, - blacklistAuthors: LongOpenHashSet, - validSocialProofs: Array[Byte] - ) = { - displayLocation match { - case TweetEntityDisplayLocation.MagicRecsF1 => - Seq( - new ANDFilters( - List[ResultFilter]( - new TweetAuthorFilter( - bipartiteGraph, - whitelistAuthors, - new LongOpenHashSet(), - statsReceiverWrapper), - new ExactUserSocialProofSizeFilter( - DefaultF1ExactSocialProofSize, - validSocialProofs, - statsReceiverWrapper - ) - ).asJava, - statsReceiverWrapper - ), - // Blacklist filter must be applied separately from F1's AND filter chain - new TweetAuthorFilter( - bipartiteGraph, - new LongOpenHashSet(), - blacklistAuthors, - statsReceiverWrapper) - ) - case TweetEntityDisplayLocation.MagicRecsRareTweet => - Seq( - new TweetAuthorFilter( - bipartiteGraph, - whitelistAuthors, - blacklistAuthors, - statsReceiverWrapper), - new RecentEdgeMetadataFilter( - DefaultRareTweetRecencyMillis, - UserTweetEdgeTypeMask.Tweet.id.toByte, - statsReceiverWrapper - ) - ) - case _ => - Seq( - new TweetAuthorFilter( - bipartiteGraph, - whitelistAuthors, - blacklistAuthors, - statsReceiverWrapper)) - } - } - - /** - * Helper method to run salsa computation and convert the results to Option - * - * @param magicRecs is magicRecs reader on bipartite graph - * @param magicRecsRequest is the magicRecs request - * @return is an option of MagicRecsResponse - */ - private def getMagicRecsResponse( - magicRecs: TopSecondDegreeByCountForTweet, - magicRecsRequest: TopSecondDegreeByCountRequestForTweet - )( - implicit statsReceiver: StatsReceiver - ): Option[TopSecondDegreeByCountResponse] = { - trackBlockStats(stats) { - val random = new Random() - // compute recs -- need to catch and print exceptions here otherwise they are swallowed - val magicRecsAttempt = - Try(magicRecs.computeRecommendations(magicRecsRequest, random)).onFailure { e => - magicRecsFailureCounter.incr() - log.error(e, "MagicRecs computation failed") - } - magicRecsAttempt.toOption - } - } - - private def getMagicRecsRequest( - request: RecommendTweetEntityRequest - ): TopSecondDegreeByCountRequestForTweet = { - val requesterId = request.requesterId - val leftSeedNodes = new Long2DoubleOpenHashMap( - request.seedsWithWeights.keys.toArray, - request.seedsWithWeights.values.toArray - ) - val tweetsToExcludeArray = new LongOpenHashSet(request.excludedTweetIds.getOrElse(Nil).toArray) - val staleTweetDuration = request.maxTweetAgeInMillis.getOrElse(RecosConfig.maxTweetAgeInMillis) - val staleEngagementDuration = - request.maxEngagementAgeInMillis.getOrElse(RecosConfig.maxEngagementAgeInMillis) - val tweetTypes = request.tweetTypes.getOrElse(DefaultTweetTypes) - val tweetAuthors = new LongOpenHashSet(request.tweetAuthors.getOrElse(Nil).toArray) - val excludedTweetAuthors = new LongOpenHashSet( - request.excludedTweetAuthors.getOrElse(Nil).toArray) - val validSocialProofs = - UserTweetEdgeTypeMask.getUserTweetGraphSocialProofTypes(request.socialProofTypes) - - val resultFilterChain = new ResultFilterChain( - ( - getBaseFilters(staleTweetDuration, tweetTypes) ++ - getFiltersByDisplayLocations( - displayLocation = request.displayLocation, - whitelistAuthors = tweetAuthors, - blacklistAuthors = excludedTweetAuthors, - validSocialProofs = validSocialProofs - ) - ).asJava - ) - - new TopSecondDegreeByCountRequestForTweet( - requesterId, - leftSeedNodes, - tweetsToExcludeArray, - getRecommendationTypes(request.recommendationTypes).asJava, - convertThriftEnumsToJavaEnums(request.maxResultsByType).asJava, - UserTweetEdgeTypeMask.SIZE, - request.maxUserSocialProofSize.getOrElse(RecosConfig.maxUserSocialProofSize), - request.maxTweetSocialProofSize.getOrElse(RecosConfig.maxTweetSocialProofSize), - convertThriftEnumsToJavaEnums(request.minUserSocialProofSizes).asJava, - validSocialProofs, - staleTweetDuration, - staleEngagementDuration, - resultFilterChain, - getSocialProofTypeUnions(request.socialProofTypeUnions).asJava - ) - } - - def apply(request: RecommendTweetEntityRequest): Future[Seq[RecommendationInfo]] = { - pollCounter.incr() - val t0 = System.currentTimeMillis - magicRecsQueue.poll().map { magicRecs => - val pollTime = System.currentTimeMillis - t0 - pollLatencyStat.add(pollTime) - val magicRecsResponse = Try { - if (pollTime < salsaRunnerConfig.timeoutSalsaRunner) { - val magicRecsRequest = getMagicRecsRequest(request) - transformMagicRecsResponse( - getMagicRecsResponse(magicRecs, magicRecsRequest)(statsReceiverWrapper.statsReceiver) - ) - } else { - // if we did not get a magicRecs in time, then fail fast here and immediately put it back - log.warning("magicRecsQueue polling timeout") - pollTimeoutCounter.incr() - throw new RuntimeException("magicRecs poll timeout") - Nil - } - } ensure { - magicRecsQueue.offer(magicRecs) - offerCounter.incr() - } - magicRecsResponse.toOption getOrElse Nil - } - } -} diff --git a/src/scala/com/twitter/recos/user_tweet_entity_graph/TweetSocialProofHandler.scala b/src/scala/com/twitter/recos/user_tweet_entity_graph/TweetSocialProofHandler.scala deleted file mode 100644 index 6ab493589..000000000 --- a/src/scala/com/twitter/recos/user_tweet_entity_graph/TweetSocialProofHandler.scala +++ /dev/null @@ -1,73 +0,0 @@ -package com.twitter.recos.user_tweet_entity_graph - -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.frigate.common.util.StatsUtil -import com.twitter.graphjet.algorithms.RecommendationInfo -import com.twitter.graphjet.algorithms.socialproof.{SocialProofResult => SocialProofJavaResult} -import com.twitter.recos.decider.UserTweetEntityGraphDecider -import com.twitter.recos.util.Stats -import com.twitter.recos.util.Stats._ -import com.twitter.recos.recos_common.thriftscala.{SocialProofType => SocialProofThriftType} -import com.twitter.recos.user_tweet_entity_graph.thriftscala.TweetRecommendation -import com.twitter.recos.user_tweet_entity_graph.thriftscala.{ - SocialProofRequest => SocialProofThriftRequest -} -import com.twitter.recos.user_tweet_entity_graph.thriftscala.{ - SocialProofResponse => SocialProofThriftResponse -} -import com.twitter.servo.request.RequestHandler -import com.twitter.util.Future -import scala.collection.JavaConverters._ - -class TweetSocialProofHandler( - tweetSocialProofRunner: TweetSocialProofRunner, - decider: UserTweetEntityGraphDecider, - statsReceiver: StatsReceiver) - extends RequestHandler[SocialProofThriftRequest, SocialProofThriftResponse] { - private val stats = statsReceiver.scope(this.getClass.getSimpleName) - - def getThriftSocialProof( - tweetSocialProof: SocialProofJavaResult - ): Map[SocialProofThriftType, Seq[Long]] = { - Option(tweetSocialProof.getSocialProof) match { - case Some(socialProof) if socialProof.isEmpty => - stats.counter(Stats.EmptyResult).incr() - Map.empty[SocialProofThriftType, Seq[Long]] - case Some(socialProof) if !socialProof.isEmpty => - socialProof.asScala.map { - case (socialProofType, connectingUsers) => - ( - SocialProofThriftType(socialProofType.toInt), - connectingUsers.asScala.map { Long2long }.toSeq) - }.toMap - case _ => - throw new Exception("TweetSocialProofHandler gets wrong TweetSocialProof response") - } - } - - def apply(request: SocialProofThriftRequest): Future[SocialProofThriftResponse] = { - StatsUtil.trackBlockStats(stats) { - if (decider.tweetSocialProof) { - val socialProofsFuture = tweetSocialProofRunner(request) - - socialProofsFuture map { socialProofs: Seq[RecommendationInfo] => - stats.counter(Stats.Served).incr(socialProofs.size) - SocialProofThriftResponse( - socialProofs.flatMap { tweetSocialProof: RecommendationInfo => - val tweetSocialProofJavaResult = tweetSocialProof.asInstanceOf[SocialProofJavaResult] - Some( - TweetRecommendation( - tweetSocialProofJavaResult.getNode, - tweetSocialProofJavaResult.getWeight, - getThriftSocialProof(tweetSocialProofJavaResult) - ) - ) - } - ) - } - } else { - Future.value(SocialProofThriftResponse()) - } - } - } -} diff --git a/src/scala/com/twitter/recos/user_tweet_entity_graph/TweetSocialProofRunner.scala b/src/scala/com/twitter/recos/user_tweet_entity_graph/TweetSocialProofRunner.scala deleted file mode 100644 index e0b38e067..000000000 --- a/src/scala/com/twitter/recos/user_tweet_entity_graph/TweetSocialProofRunner.scala +++ /dev/null @@ -1,168 +0,0 @@ -package com.twitter.recos.user_tweet_entity_graph - -import java.util.Random -import com.twitter.concurrent.AsyncQueue -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.graphjet.bipartite.NodeMetadataLeftIndexedMultiSegmentBipartiteGraph -import com.twitter.graphjet.algorithms.RecommendationInfo -import com.twitter.graphjet.algorithms.socialproof.{ - SocialProofResult, - TweetSocialProofGenerator, - SocialProofRequest => SocialProofJavaRequest, - SocialProofResponse => SocialProofJavaResponse -} -import com.twitter.logging.Logger -import com.twitter.recos.model.SalsaQueryRunner.SalsaRunnerConfig -import com.twitter.recos.user_tweet_entity_graph.thriftscala.{ - RecommendationType, - RecommendationSocialProofRequest => RecommendationSocialProofThriftRequest, - SocialProofRequest => SocialProofThriftRequest -} -import com.twitter.util.{Future, Try} -import it.unimi.dsi.fastutil.longs.{Long2DoubleMap, Long2DoubleOpenHashMap, LongArraySet} -import scala.collection.JavaConverters._ - -/** - * TweetSocialProofRunner creates a queue of reader threads, TweetSocialProofGenerator, and each one - * reads from the graph and computes social proofs. - */ -class TweetSocialProofRunner( - bipartiteGraph: NodeMetadataLeftIndexedMultiSegmentBipartiteGraph, - salsaRunnerConfig: SalsaRunnerConfig, - statsReceiver: StatsReceiver) { - private val log: Logger = Logger() - private val stats = statsReceiver.scope(this.getClass.getSimpleName) - private val socialProofSizeStat = stats.stat("socialProofSize") - - private val socialProofFailureCounter = stats.counter("failure") - private val pollCounter = stats.counter("poll") - private val pollTimeoutCounter = stats.counter("pollTimeout") - private val offerCounter = stats.counter("offer") - private val pollLatencyStat = stats.stat("pollLatency") - private val socialProofRunnerPool = initSocialProofRunnerPool() - - private def initSocialProofRunnerPool(): AsyncQueue[TweetSocialProofGenerator] = { - val socialProofQueue = new AsyncQueue[TweetSocialProofGenerator] - (0 until salsaRunnerConfig.numSalsaRunners).foreach { _ => - socialProofQueue.offer(new TweetSocialProofGenerator(bipartiteGraph)) - } - socialProofQueue - } - - /** - * Helper method to interpret the output of SocialProofJavaResponse - * - * @param socialProofResponse is the response from running TweetSocialProof - * @return a sequence of SocialProofResult - */ - private def transformSocialProofResponse( - socialProofResponse: Option[SocialProofJavaResponse] - ): Seq[RecommendationInfo] = { - socialProofResponse match { - case Some(response) => - val scalaResponse = response.getRankedRecommendations.asScala - scalaResponse.foreach { result => - socialProofSizeStat.add(result.asInstanceOf[SocialProofResult].getSocialProofSize) - } - scalaResponse.toSeq - case _ => Nil - } - } - - /** - * Helper method to run social proof computation and convert the results to Option - * - * @param socialProof is socialProof reader on bipartite graph - * @param request is the socialProof request - * @return is an option of SocialProofJavaResponse - */ - private def getSocialProofResponse( - socialProof: TweetSocialProofGenerator, - request: SocialProofJavaRequest, - random: Random - )( - implicit statsReceiver: StatsReceiver - ): Option[SocialProofJavaResponse] = { - val attempt = Try(socialProof.computeRecommendations(request, random)).onFailure { e => - socialProofFailureCounter.incr() - log.error(e, "SocialProof computation failed") - } - attempt.toOption - } - - /** - * Attempt to retrieve a TweetSocialProof thread from the runner pool - * to execute a socialProofRequest - */ - private def handleSocialProofRequest(socialProofRequest: SocialProofJavaRequest) = { - pollCounter.incr() - val t0 = System.currentTimeMillis() - socialProofRunnerPool.poll().map { tweetSocialProof => - val pollTime = System.currentTimeMillis - t0 - pollLatencyStat.add(pollTime) - val socialProofResponse = Try { - if (pollTime < salsaRunnerConfig.timeoutSalsaRunner) { - val response = getSocialProofResponse(tweetSocialProof, socialProofRequest, new Random())( - statsReceiver - ) - transformSocialProofResponse(response) - } else { - // if we did not get a social proof in time, then fail fast here and immediately put it back - log.warning("socialProof polling timeout") - pollTimeoutCounter.incr() - throw new RuntimeException("socialProof poll timeout") - Nil - } - } ensure { - socialProofRunnerPool.offer(tweetSocialProof) - offerCounter.incr() - } - socialProofResponse.toOption getOrElse Nil - } - } - - /** - * This apply() supports requests coming from the old tweet social proof endpoint. - * Currently this supports clients such as Email Recommendations, MagicRecs, and HomeTimeline. - * In order to avoid heavy migration work, we are retaining this endpoint. - */ - def apply(request: SocialProofThriftRequest): Future[Seq[RecommendationInfo]] = { - val tweetSet = new LongArraySet(request.inputTweets.toArray) - val leftSeedNodes: Long2DoubleMap = new Long2DoubleOpenHashMap( - request.seedsWithWeights.keys.toArray, - request.seedsWithWeights.values.toArray - ) - - val socialProofRequest = new SocialProofJavaRequest( - tweetSet, - leftSeedNodes, - UserTweetEdgeTypeMask.getUserTweetGraphSocialProofTypes(request.socialProofTypes) - ) - - handleSocialProofRequest(socialProofRequest) - } - - /** - * This apply() supports requests coming from the new social proof endpoint in UTEG that works for - * tweet social proof generation, as well as hashtag and url social proof generation. - * Currently this endpoint supports url social proof generation for Guide. - */ - def apply(request: RecommendationSocialProofThriftRequest): Future[Seq[RecommendationInfo]] = { - val tweetIds = request.recommendationIdsForSocialProof.collect { - case (RecommendationType.Tweet, ids) => ids - }.flatten - val tweetSet = new LongArraySet(tweetIds.toArray) - val leftSeedNodes: Long2DoubleMap = new Long2DoubleOpenHashMap( - request.seedsWithWeights.keys.toArray, - request.seedsWithWeights.values.toArray - ) - - val socialProofRequest = new SocialProofJavaRequest( - tweetSet, - leftSeedNodes, - UserTweetEdgeTypeMask.getUserTweetGraphSocialProofTypes(request.socialProofTypes) - ) - - handleSocialProofRequest(socialProofRequest) - } -} diff --git a/src/scala/com/twitter/recos/user_tweet_entity_graph/UserTweetEdgeTypeMask.scala b/src/scala/com/twitter/recos/user_tweet_entity_graph/UserTweetEdgeTypeMask.scala deleted file mode 100644 index b8e855ffd..000000000 --- a/src/scala/com/twitter/recos/user_tweet_entity_graph/UserTweetEdgeTypeMask.scala +++ /dev/null @@ -1,95 +0,0 @@ -package com.twitter.recos.user_tweet_entity_graph - -import com.twitter.graphjet.bipartite.api.EdgeTypeMask -import com.twitter.recos.recos_common.thriftscala.SocialProofType -import com.twitter.recos.util.Action - -/** - * The bit mask is used to encode edge types in the top bits of an integer, - * e.g. favorite, retweet, reply and click. Under current segment configuration, each segment - * stores up to 128M edges. Assuming that each node on one side is unique, each segment - * stores up to 128M unique nodes on one side, which occupies the lower 27 bits of an integer. - * This leaves five bits to encode the edge types, which at max can store 32 edge types. - * The following implementation utilizes the top four bits and leaves one free bit out. - */ -class UserTweetEdgeTypeMask extends EdgeTypeMask { - import UserTweetEdgeTypeMask._ - - override def encode(node: Int, edgeType: Byte): Int = { - if (edgeType < 0 || edgeType > SIZE || edgeType == Click.id.toByte) { - throw new IllegalArgumentException("encode: Illegal edge type argument " + edgeType) - } else { - node | (edgeType << 28) - } - } - - override def edgeType(node: Int): Byte = { - (node >>> 28).toByte - } - - override def restore(node: Int): Int = { - node & MASK - } -} - -object UserTweetEdgeTypeMask extends Enumeration { - - type UserTweetEdgeTypeMask = Value - - /** - * Byte values corresponding to the action taken on a tweet, which will be encoded in the - * top 4 bits in a tweet Id - * NOTE: THERE CAN ONLY BE UP TO 16 TYPES - */ - val Click: UserTweetEdgeTypeMask = Value(0) - val Favorite: UserTweetEdgeTypeMask = Value(1) - val Retweet: UserTweetEdgeTypeMask = Value(2) - val Reply: UserTweetEdgeTypeMask = Value(3) - val Tweet: UserTweetEdgeTypeMask = Value(4) - val IsMentioned: UserTweetEdgeTypeMask = Value(5) - val IsMediatagged: UserTweetEdgeTypeMask = Value(6) - val Quote: UserTweetEdgeTypeMask = Value(7) - val Unfavorite: UserTweetEdgeTypeMask = Value(8) - - /** - * Reserve the top four bits of each integer to encode the edge type information. - */ - val MASK: Int = Integer.parseInt("00001111111111111111111111111111", 2) - val SIZE: Int = this.values.size - - /** - * Map valid social proof types specified by clients to an array of bytes. If clients do not - * specify any social proof types in thrift, it will return all available social types by - * default. - * - * @param socialProofTypes are the valid socialProofTypes specified by clients - * @return an array of bytes representing valid social proof types - */ - def getUserTweetGraphSocialProofTypes( - socialProofTypes: Option[Seq[SocialProofType]] - ): Array[Byte] = { - socialProofTypes - .map { _.map { _.getValue }.toArray } - .getOrElse((0 until SIZE).toArray) - .map { _.toByte } - } - - /** - * Converts the action byte in the RecosHoseMessage into GraphJet internal byte mapping - */ - def actionTypeToEdgeType(actionByte: Byte): Byte = { - val edgeType = Action(actionByte) match { - case Action.Favorite => Favorite.id - case Action.Retweet => Retweet.id - case Action.Reply => Reply.id - case Action.Tweet => Tweet.id - case Action.IsMentioned => IsMentioned.id - case Action.IsMediaTagged => IsMediatagged.id - case Action.Quote => Quote.id - case Action.Unfavorite => Unfavorite.id - case _ => - throw new IllegalArgumentException("getEdgeType: Illegal edge type argument " + actionByte) - } - edgeType.toByte - } -} diff --git a/src/scala/com/twitter/recos/user_tweet_entity_graph/UserTweetEntityGraph.scala b/src/scala/com/twitter/recos/user_tweet_entity_graph/UserTweetEntityGraph.scala deleted file mode 100644 index 1ac23fb3b..000000000 --- a/src/scala/com/twitter/recos/user_tweet_entity_graph/UserTweetEntityGraph.scala +++ /dev/null @@ -1,46 +0,0 @@ -package com.twitter.recos.user_tweet_entity_graph - -import com.twitter.finagle.thrift.ClientId -import com.twitter.finagle.tracing.{Trace, TraceId} -import com.twitter.recos.user_tweet_entity_graph.thriftscala._ -import com.twitter.util.Future - -object UserTweetEntityGraph { - def traceId: TraceId = Trace.id - def clientId: Option[ClientId] = ClientId.current -} - -class UserTweetEntityGraph( - recommendationHandler: RecommendationHandler, - tweetSocialProofHandler: TweetSocialProofHandler, - socialProofHandler: SocialProofHandler) - extends thriftscala.UserTweetEntityGraph.MethodPerEndpoint { - - override def recommendTweets( - request: RecommendTweetEntityRequest - ): Future[RecommendTweetEntityResponse] = recommendationHandler(request) - - /** - * Given a query user, its seed users, and a set of input tweets, return the social proofs of - * input tweets if any. - * - * Currently this supports clients such as Email Recommendations, MagicRecs, and HomeTimeline. - * In order to avoid heavy migration work, we are retaining this endpoint. - */ - override def findTweetSocialProofs( - request: SocialProofRequest - ): Future[SocialProofResponse] = tweetSocialProofHandler(request) - - /** - * Find social proof for the specified RecommendationType given a set of input ids of that type. - * Only find social proofs from the specified seed users with the specified social proof types. - * - * Currently this supports url social proof generation for Guide. - * - * This endpoint is flexible enough to support social proof generation for all recommendation - * types, and should be used for all future clients of this service. - */ - override def findRecommendationSocialProofs( - request: RecommendationSocialProofRequest - ): Future[RecommendationSocialProofResponse] = socialProofHandler(request) -} diff --git a/src/scala/com/twitter/recos/user_tweet_entity_graph/UserTweetEntityGraphWriter.scala b/src/scala/com/twitter/recos/user_tweet_entity_graph/UserTweetEntityGraphWriter.scala deleted file mode 100644 index eff1b22bd..000000000 --- a/src/scala/com/twitter/recos/user_tweet_entity_graph/UserTweetEntityGraphWriter.scala +++ /dev/null @@ -1,105 +0,0 @@ -package com.twitter.recos.user_tweet_entity_graph - -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.finatra.kafka.consumers.FinagleKafkaConsumerBuilder -import com.twitter.graphjet.algorithms.{RecommendationType, TweetIDMask} -import com.twitter.graphjet.bipartite.NodeMetadataLeftIndexedMultiSegmentBipartiteGraph -import com.twitter.graphjet.bipartite.segment.NodeMetadataLeftIndexedBipartiteGraphSegment -import com.twitter.recos.hose.common.UnifiedGraphWriter -import com.twitter.recos.internal.thriftscala.RecosHoseMessage -import com.twitter.recos.serviceapi.Tweetypie._ - -/** - * The class submits a number of $numBootstrapWriters graph writer threads, BufferedEdgeWriter, - * during service startup. One of them is live writer thread, and the other $(numBootstrapWriters - 1) - * are catchup writer threads. All of them consume kafka events from an internal concurrent queue, - * which is populated by kafka reader threads. At bootstrap time, the kafka reader threads look - * back kafka offset from several hours ago and populate the internal concurrent queue. - * Each graph writer thread writes to an individual graph segment separately. - * The $(numBootstrapWriters - 1) catchup writer threads will stop once all events - * between current system time at startup and the time in memcache are processed. - * The live writer thread will continue to write all incoming kafka events. - * It lives through the entire life cycle of recos graph service. - */ -case class UserTweetEntityGraphWriter( - shardId: String, - env: String, - hosename: String, - bufferSize: Int, - kafkaConsumerBuilder: FinagleKafkaConsumerBuilder[String, RecosHoseMessage], - clientId: String, - statsReceiver: StatsReceiver) - extends UnifiedGraphWriter[ - NodeMetadataLeftIndexedBipartiteGraphSegment, - NodeMetadataLeftIndexedMultiSegmentBipartiteGraph - ] { - writer => - // The max throughput for each kafka consumer is around 25MB/s - // Use 4 processors for 100MB/s catch-up speed. - val consumerNum: Int = 4 - // Leave 1 Segments to LiveWriter - val catchupWriterNum: Int = RecosConfig.maxNumSegments - 1 - - private final val EMTPY_LEFT_NODE_METADATA = new Array[Array[Int]](1) - - /** - * Adds a RecosHoseMessage to the graph. used by live writer to insert edges to the - * current segment - */ - override def addEdgeToGraph( - graph: NodeMetadataLeftIndexedMultiSegmentBipartiteGraph, - recosHoseMessage: RecosHoseMessage - ): Unit = { - graph.addEdge( - recosHoseMessage.leftId, - getMetaEdge(recosHoseMessage.rightId, recosHoseMessage.card), - UserTweetEdgeTypeMask.actionTypeToEdgeType(recosHoseMessage.action), - recosHoseMessage.edgeMetadata.getOrElse(0L), - EMTPY_LEFT_NODE_METADATA, - extractEntities(recosHoseMessage) - ) - } - - /** - * Adds a RecosHoseMessage to the given segment in the graph. Used by catch up writers to - * insert edges to non-current (old) segments - */ - override def addEdgeToSegment( - segment: NodeMetadataLeftIndexedBipartiteGraphSegment, - recosHoseMessage: RecosHoseMessage - ): Unit = { - segment.addEdge( - recosHoseMessage.leftId, - getMetaEdge(recosHoseMessage.rightId, recosHoseMessage.card), - UserTweetEdgeTypeMask.actionTypeToEdgeType(recosHoseMessage.action), - recosHoseMessage.edgeMetadata.getOrElse(0L), - EMTPY_LEFT_NODE_METADATA, - extractEntities(recosHoseMessage) - ) - } - - private def getMetaEdge(rightId: Long, cardOption: Option[Byte]): Long = { - cardOption - .map { card => - if (isPhotoCard(card)) TweetIDMask.photo(rightId) - else if (isPlayerCard(card)) TweetIDMask.player(rightId) - else if (isSummaryCard(card)) TweetIDMask.summary(rightId) - else if (isPromotionCard(card)) TweetIDMask.promotion(rightId) - else rightId - } - .getOrElse(rightId) - } - - private def extractEntities(message: RecosHoseMessage): Array[Array[Int]] = { - val entities: Array[Array[Int]] = - new Array[Array[Int]](RecommendationType.METADATASIZE.getValue) - message.entities.foreach { - _.foreach { - case (entityType, ids) => - entities.update(entityType, ids.toArray) - } - } - entities - } - -} diff --git a/src/scala/com/twitter/recos/user_tweet_graph/BUILD b/src/scala/com/twitter/recos/user_tweet_graph/BUILD deleted file mode 100644 index 92f06d1c9..000000000 --- a/src/scala/com/twitter/recos/user_tweet_graph/BUILD +++ /dev/null @@ -1,66 +0,0 @@ -scala_library( - name = "user-tweet-graph", - sources = ["*.scala"], - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/cascading:cascading-local", - "3rdparty/jvm/com/backtype:dfs-datastores", - "3rdparty/jvm/com/fasterxml/jackson/module:jackson-module-scala", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/com/netflix/curator:curator-framework", - "3rdparty/jvm/com/twitter/graphjet", - "3rdparty/jvm/io/netty:netty4-tcnative-boringssl-static", - "3rdparty/jvm/it/unimi/dsi:fastutil", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/kafka:rosette-kafka", - "3rdparty/jvm/org/apache/thrift:libthrift", - "abdecider/src/main/scala", - "decider/src/main/scala", - "finagle-internal/mtls/src/main/scala/com/twitter/finagle/mtls/authentication", - "finagle-internal/mtls/src/main/scala/com/twitter/finagle/mtls/server", - "finagle/finagle-core/src/main", - "finagle/finagle-http/src/main/scala", - "finagle/finagle-memcached/src/main/scala", - "finagle/finagle-stats/src/main/scala", - "finagle/finagle-thriftmux/src/main/scala", - "frigate/frigate-common/src/main/scala/com/twitter/frigate/common/util", - "scrooge/scrooge-core/src/main/scala", - "servo/repo/src/main/scala", - "servo/request/src/main/scala", - "servo/util/src/main/scala", - "src/resources/com/twitter/recos:decider", - "src/scala/com/twitter/recos/decider", - "src/scala/com/twitter/recos/graph_common", - "src/scala/com/twitter/recos/hose/common", - "src/scala/com/twitter/recos/model:recos-model", - "src/scala/com/twitter/recos/serviceapi", - "src/scala/com/twitter/recos/user_tweet_graph/relatedTweetHandlers", - "src/scala/com/twitter/recos/util:recos-util", - "src/scala/com/twitter/simclusters_v2/common", - "src/thrift/com/twitter/recos:recos-common-scala", - "src/thrift/com/twitter/recos:recos-internal-scala", - "src/thrift/com/twitter/recos/user_tweet_graph:user_tweet_graph-scala", - "thrift-web-forms/src/main/scala/com/twitter/thriftwebforms", - "thrift-web-forms/src/main/scala/com/twitter/thriftwebforms/model", - "twitter-server-internal/src/main/scala", - "twitter-server/server/src/main/scala", - "twitter-server/slf4j-jdk14/src/main/scala/com/twitter/server/logging", - "util/util-app/src/main/scala", - "util/util-hashing/src/main/scala", - "util/util-logging/src/main/scala", - "util/util-stats/src/main/scala", - ], -) - -jvm_binary( - name = "bin", - basename = "user-tweet-graph-server", - main = "com.twitter.recos.user_tweet_graph.Main", - runtime_platform = "java11", - tags = ["known-to-fail-jira:SD-20771"], - dependencies = [ - ":user-tweet-graph", - "3rdparty/jvm/org/slf4j:slf4j-jdk14", - "twitter-server/slf4j-jdk14/src/main/scala", - ], -) diff --git a/src/scala/com/twitter/recos/user_tweet_graph/Main.scala b/src/scala/com/twitter/recos/user_tweet_graph/Main.scala deleted file mode 100644 index 2920481f3..000000000 --- a/src/scala/com/twitter/recos/user_tweet_graph/Main.scala +++ /dev/null @@ -1,291 +0,0 @@ -package com.twitter.recos.user_tweet_graph - -import com.twitter.abdecider.ABDeciderFactory -import com.twitter.abdecider.LoggingABDecider -import com.twitter.app.Flag -import com.twitter.conversions.DurationOps._ -import com.twitter.finagle.ThriftMux -import com.twitter.finagle.http.HttpMuxer -import com.twitter.finagle.mtls.authentication.ServiceIdentifier -import com.twitter.finagle.mtls.client.MtlsStackClient.MtlsThriftMuxClientSyntax -import com.twitter.finagle.mtls.server.MtlsStackServer._ -import com.twitter.finagle.mux.ClientDiscardedRequestException -import com.twitter.finagle.mux.transport.OpportunisticTls -import com.twitter.finagle.service.ReqRep -import com.twitter.finagle.service.ResponseClass -import com.twitter.finagle.thrift.ClientId -import com.twitter.finatra.kafka.consumers.FinagleKafkaConsumerBuilder -import com.twitter.finatra.kafka.domain.KafkaGroupId -import com.twitter.finatra.kafka.domain.SeekStrategy -import com.twitter.finatra.kafka.serde.ScalaSerdes -import com.twitter.frigate.common.util.ElfOwlFilter -import com.twitter.frigate.common.util.ElfOwlFilter.ByLdapGroup -import com.twitter.graphjet.bipartite.MultiSegmentPowerLawBipartiteGraph -import com.twitter.logging._ -import com.twitter.recos.decider.EndpointLoadShedder -import com.twitter.recos.decider.UserTweetGraphDecider -import com.twitter.recos.graph_common.FinagleStatsReceiverWrapper -import com.twitter.recos.graph_common.MultiSegmentPowerLawBipartiteGraphBuilder -import com.twitter.recos.internal.thriftscala.RecosHoseMessage -import com.twitter.recos.user_tweet_graph.RecosConfig._ -import com.twitter.recos.user_tweet_graph.relatedTweetHandlers.ConsumersBasedRelatedTweetsHandler -import com.twitter.recos.user_tweet_graph.relatedTweetHandlers.ProducerBasedRelatedTweetsHandler -import com.twitter.recos.user_tweet_graph.relatedTweetHandlers.TweetBasedRelatedTweetsHandler -import com.twitter.recos.user_tweet_graph.store.UserRecentFollowersStore -import com.twitter.server.Deciderable -import com.twitter.server.TwitterServer -import com.twitter.server.logging.{Logging => JDK14Logging} -import com.twitter.servo.request._ -import com.twitter.servo.util.ExceptionCounter -import com.twitter.simclusters_v2.common.UserId -import com.twitter.socialgraph.thriftscala.SocialGraphService -import com.twitter.storehaus.ReadableStore -import com.twitter.util.Await -import com.twitter.util.Duration -import com.twitter.util.JavaTimer -import com.twitter.util.Throw -import com.twitter.util.Timer -import java.net.InetSocketAddress -import java.util.concurrent.TimeUnit -import org.apache.kafka.clients.CommonClientConfigs -import org.apache.kafka.common.config.SaslConfigs -import org.apache.kafka.common.config.SslConfigs -import org.apache.kafka.common.security.auth.SecurityProtocol -import org.apache.kafka.common.serialization.StringDeserializer -import scala.reflect.ClassTag - -object Main extends TwitterServer with JDK14Logging with Deciderable { - profile => - - val shardId: Flag[Int] = flag("shardId", 0, "Shard ID") - val servicePort: Flag[InetSocketAddress] = - flag("service.port", new InetSocketAddress(10143), "Thrift service port") - val logDir: Flag[String] = flag("logdir", "recos", "Logging directory") - val numShards: Flag[Int] = flag("numShards", 1, "Number of shards for this service") - val truststoreLocation: Flag[String] = - flag[String]("truststore_location", "", "Truststore file location") - val hoseName: Flag[String] = - flag("hosename", "recos_injector_user_user", "the kafka stream used for incoming edges") - - val dataCenter: Flag[String] = flag("service.cluster", "atla", "Data Center") - val serviceRole: Flag[String] = flag("service.role", "Service Role") - val serviceEnv: Flag[String] = flag("service.env", "Service Env") - val serviceName: Flag[String] = flag("service.name", "Service Name") - - private val maxNumSegments = - flag("maxNumSegments", graphBuilderConfig.maxNumSegments, "the number of segments in the graph") - - private val statsReceiverWrapper = FinagleStatsReceiverWrapper(statsReceiver) - - /** - * A ClientRequestAuthorizer to be used in a request-authorization RequestFilter. - */ - lazy val clientAuthorizer: ClientRequestAuthorizer = - ClientRequestAuthorizer.observed( - ClientRequestAuthorizer.permissive, - new ClientRequestObserver(statsReceiver) - ) - - lazy val clientId = ClientId(s"usertweetgraph.${serviceEnv()}") - - private def makeThriftClient[ThriftServiceType: ClassTag]( - dest: String, - label: String, - serviceIdentifier: ServiceIdentifier, - requestTimeout: Duration = 100.milliseconds - ): ThriftServiceType = { - ThriftMux.client - .withClientId(ClientId("usertweetgraph.prod")) - .withOpportunisticTls(OpportunisticTls.Required) - .withMutualTls(serviceIdentifier) - .withRequestTimeout(requestTimeout) - .withStatsReceiver(statsReceiver.scope("clnt")) - .withResponseClassifier { - case ReqRep(_, Throw(_: ClientDiscardedRequestException)) => ResponseClass.Ignorable - }.build[ThriftServiceType](dest, label) - } - - private val shutdownTimeout = flag( - "service.shutdownTimeout", - 5.seconds, - "Maximum amount of time to wait for pending requests to complete on shutdown" - ) - - /** - * ExceptionCounter for tracking failures from RequestHandler(s). - */ - lazy val exceptionCounter = new ExceptionCounter(statsReceiver) - - /** - * Function for translating exceptions returned by a RequestHandler. Useful - * for cases where underlying exception types should be wrapped in those - * defined in the project's Thrift IDL. - */ - lazy val translateExceptions: PartialFunction[Throwable, Throwable] = { - case t => t - } - - // ********* logging ********** - - lazy val loggingLevel: Level = Level.INFO - lazy val recosLogPath: String = logDir() + "/recos.log" - lazy val graphLogPath: String = logDir() + "/graph.log" - lazy val accessLogPath: String = logDir() + "/access.log" - - override def loggerFactories: List[LoggerFactory] = - List( - LoggerFactory( - level = Some(loggingLevel), - handlers = QueueingHandler( - handler = FileHandler( - filename = recosLogPath, - level = Some(loggingLevel), - rollPolicy = Policy.Hourly, - rotateCount = 6, - formatter = new Formatter - ) - ) :: Nil - ), - LoggerFactory( - node = "graph", - useParents = false, - level = Some(loggingLevel), - handlers = QueueingHandler( - handler = FileHandler( - filename = graphLogPath, - level = Some(loggingLevel), - rollPolicy = Policy.Hourly, - rotateCount = 6, - formatter = new Formatter - ) - ) :: Nil - ), - LoggerFactory( - node = "access", - useParents = false, - level = Some(loggingLevel), - handlers = QueueingHandler( - handler = FileHandler( - filename = accessLogPath, - level = Some(loggingLevel), - rollPolicy = Policy.Hourly, - rotateCount = 6, - formatter = new Formatter - ) - ) :: Nil - ), - LoggerFactory( - node = "client_event", - level = Some(loggingLevel), - useParents = false, - handlers = QueueingHandler( - maxQueueSize = 10000, - handler = ScribeHandler( - category = "client_event", - formatter = BareFormatter - ) - ) :: Nil - ) - ) - // ******** Decider ************* - - // ********* ABdecider ********** - - val abDeciderYmlPath: String = "/usr/local/config/abdecider/abdecider.yml" - - val scribeLogger: Option[Logger] = Some(Logger.get("client_event")) - - val abDecider: LoggingABDecider = - ABDeciderFactory( - abDeciderYmlPath = abDeciderYmlPath, - scribeLogger = scribeLogger, - environment = Some("production") - ).buildWithLogging() - - // ********* Recos service ********** - def main(): Unit = { - log.info("building graph with maxNumSegments = " + profile.maxNumSegments()) - - implicit val timer: Timer = new JavaTimer(true) - - val graph = MultiSegmentPowerLawBipartiteGraphBuilder( - graphBuilderConfig.copy(maxNumSegments = profile.maxNumSegments()), - statsReceiverWrapper - ) - - val kafkaConfigBuilder = FinagleKafkaConsumerBuilder[String, RecosHoseMessage]() - .dest("/s/kafka/recommendations:kafka-tls") - .groupId(KafkaGroupId(f"user_tweet_graph-${shardId()}%06d")) - .keyDeserializer(new StringDeserializer) - .valueDeserializer(ScalaSerdes.Thrift[RecosHoseMessage].deserializer) - .seekStrategy(SeekStrategy.REWIND) - .rewindDuration(48.hours) - .withConfig(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, SecurityProtocol.SASL_SSL.toString) - .withConfig(SslConfigs.SSL_TRUSTSTORE_LOCATION_CONFIG, truststoreLocation()) - .withConfig(SaslConfigs.SASL_MECHANISM, SaslConfigs.GSSAPI_MECHANISM) - .withConfig(SaslConfigs.SASL_KERBEROS_SERVICE_NAME, "kafka") - .withConfig(SaslConfigs.SASL_KERBEROS_SERVER_NAME, "kafka") - - val graphWriter = - UserTweetGraphWriter( - shardId().toString, - serviceEnv(), - hoseName(), - 128, // keep the original setting. - kafkaConfigBuilder, - clientId.name, - statsReceiver, - ) - graphWriter.initHose(graph) - - // For MutualTLS - val serviceIdentifier = ServiceIdentifier( - role = serviceRole(), - service = serviceName(), - environment = serviceEnv(), - zone = dataCenter() - ) - log.info(s"ServiceIdentifier = ${serviceIdentifier.toString}") - - val socialGraphClient: SocialGraphService.MethodPerEndpoint = - makeThriftClient[SocialGraphService.MethodPerEndpoint]( - "/s/socialgraph/socialgraph", - "socialgraph", - serviceIdentifier) - val userRecentFollowersStore: ReadableStore[UserRecentFollowersStore.Query, Seq[UserId]] = - new UserRecentFollowersStore(socialGraphClient) - - val tweetBasedRelatedTweetsHandler = new TweetBasedRelatedTweetsHandler(graph, statsReceiver) - val consumersBasedRelatedTweetsHandler = - new ConsumersBasedRelatedTweetsHandler(graph, statsReceiver) - val producerBasedRelatedTweetsHandler = - new ProducerBasedRelatedTweetsHandler(graph, userRecentFollowersStore, statsReceiver) - - val decider = UserTweetGraphDecider(serviceEnv(), dataCenter()) - val endpointLoadShedder = new EndpointLoadShedder(decider) - val userTweetGraph = - new UserTweetGraph( - tweetBasedRelatedTweetsHandler, - producerBasedRelatedTweetsHandler, - consumersBasedRelatedTweetsHandler, - endpointLoadShedder)(timer) - - val thriftServer = ThriftMux.server - .withOpportunisticTls(OpportunisticTls.Required) - .withMutualTls(serviceIdentifier) - .serveIface(servicePort(), userTweetGraph) - - log.info("clientid: " + clientId.toString) - log.info("servicePort: " + servicePort().toString) - - log.info("adding shutdown hook") - onExit { - graphWriter.shutdown() - thriftServer.close(shutdownTimeout().fromNow) - } - log.info("added shutdown hook") - - // Wait on the thriftServer so that shutdownTimeout is respected. - Await.result(thriftServer) - } -} diff --git a/src/scala/com/twitter/recos/user_tweet_graph/README.md b/src/scala/com/twitter/recos/user_tweet_graph/README.md deleted file mode 100644 index e5e8fe35a..000000000 --- a/src/scala/com/twitter/recos/user_tweet_graph/README.md +++ /dev/null @@ -1,17 +0,0 @@ -# UserTweetGraph (UTG) - -## What is it -User Tweet Graph (UTG) is a Finalge thrift service built on the GraphJet framework. In maintains a graph of user-tweet engagements and serves user recommendations based on traversals of this graph. - -## How is it used on Twitter -UTG recommends tweets based on collaborative filtering & random walks. UTG takes a set of seed users or seed tweets as input, and performs -1-hop, 2-hop, or even 3+hop traversals on the engagement graph. -UTG's user-tweet engagement edges are bi-directional, and this enables it to perform flexible multi-hop traversals. The flipside to this is -UTG is more memory demanding compared to other GraphJet services like UTEG, whose engagement edges are single directional. - -UTG is a stateful service and relies on a Kafka stream to ingest & persist states. The Kafka stream is processed and generated by Recos-Injector. -It maintains an in-memory user engagements over the past 24-48 hours. Older events are dropped and GC'ed. - -For full details on storage & processing, please check out our open-sourced project GraphJet, a general-purpose high performance in-memory storage engine. -- https://github.com/twitter/GraphJet -- http://www.vldb.org/pvldb/vol9/p1281-sharma.pdf diff --git a/src/scala/com/twitter/recos/user_tweet_graph/UserTweetGraph.scala b/src/scala/com/twitter/recos/user_tweet_graph/UserTweetGraph.scala deleted file mode 100644 index 6c7ab1bf6..000000000 --- a/src/scala/com/twitter/recos/user_tweet_graph/UserTweetGraph.scala +++ /dev/null @@ -1,98 +0,0 @@ -package com.twitter.recos.user_tweet_graph - -import com.twitter.finagle.thrift.ClientId -import com.twitter.finagle.tracing.Trace -import com.twitter.finagle.tracing.TraceId -import com.twitter.recos.decider.EndpointLoadShedder -import com.twitter.recos.recos_common.thriftscala._ -import com.twitter.recos.user_tweet_graph.thriftscala._ -import com.twitter.util.Duration -import com.twitter.util.Future -import com.twitter.util.Timer -import scala.concurrent.duration.MILLISECONDS -import com.twitter.logging.Logger -import com.twitter.recos.user_tweet_graph.relatedTweetHandlers.TweetBasedRelatedTweetsHandler -import com.twitter.recos.user_tweet_graph.relatedTweetHandlers.ProducerBasedRelatedTweetsHandler -import com.twitter.recos.user_tweet_graph.relatedTweetHandlers.ConsumersBasedRelatedTweetsHandler -import com.twitter.simclusters_v2.common.TweetId -import com.twitter.simclusters_v2.common.UserId - -object UserTweetGraph { - def traceId: TraceId = Trace.id - def clientId: Option[ClientId] = ClientId.current -} - -class UserTweetGraph( - tweetBasedRelatedTweetsHandler: TweetBasedRelatedTweetsHandler, - producerBasedRelatedTweetsHandler: ProducerBasedRelatedTweetsHandler, - consumersBasedRelatedTweetsHandler: ConsumersBasedRelatedTweetsHandler, - endpointLoadShedder: EndpointLoadShedder -)( - implicit timer: Timer) - extends thriftscala.UserTweetGraph.MethodPerEndpoint { - - private val defaultTimeout: Duration = Duration(50, MILLISECONDS) - private val EmptyResponse = Future.value(RelatedTweetResponse()) - private val EmptyFeatureResponse = Future.value(UserTweetFeatureResponse()) - - private val log = Logger() - - override def recommendTweets(request: RecommendTweetRequest): Future[RecommendTweetResponse] = - Future.value(RecommendTweetResponse()) - - override def getLeftNodeEdges(request: GetRecentEdgesRequest): Future[GetRecentEdgesResponse] = - Future.value(GetRecentEdgesResponse()) - - override def getRightNode(tweet: Long): Future[NodeInfo] = Future.value(NodeInfo()) - - // deprecated - override def relatedTweets(request: RelatedTweetRequest): Future[RelatedTweetResponse] = - EmptyResponse - - override def tweetBasedRelatedTweets( - request: TweetBasedRelatedTweetRequest - ): Future[RelatedTweetResponse] = - endpointLoadShedder("tweetBasedRelatedTweets") { - tweetBasedRelatedTweetsHandler(request).raiseWithin(defaultTimeout) - }.rescue { - case EndpointLoadShedder.LoadSheddingException => - EmptyResponse - case e => - log.info("user-tweet-graph_tweetBasedRelatedTweets" + e) - EmptyResponse - } - - override def producerBasedRelatedTweets( - request: ProducerBasedRelatedTweetRequest - ): Future[RelatedTweetResponse] = - endpointLoadShedder("producerBasedRelatedTweets") { - producerBasedRelatedTweetsHandler(request).raiseWithin(defaultTimeout) - }.rescue { - case EndpointLoadShedder.LoadSheddingException => - EmptyResponse - case e => - log.info("user-tweet-graph_producerBasedRelatedTweets" + e) - EmptyResponse - } - - override def consumersBasedRelatedTweets( - request: ConsumersBasedRelatedTweetRequest - ): Future[RelatedTweetResponse] = - endpointLoadShedder("consumersBasedRelatedTweets") { - consumersBasedRelatedTweetsHandler(request).raiseWithin(defaultTimeout) - }.rescue { - case EndpointLoadShedder.LoadSheddingException => - EmptyResponse - case e => - log.info("user-tweet-graph_consumersBasedRelatedTweets" + e) - EmptyResponse - } - - // deprecated - override def userTweetFeatures( - userId: UserId, - tweetId: TweetId - ): Future[UserTweetFeatureResponse] = - EmptyFeatureResponse - -} diff --git a/src/scala/com/twitter/recos/user_tweet_graph/UserTweetGraphConfig.scala b/src/scala/com/twitter/recos/user_tweet_graph/UserTweetGraphConfig.scala deleted file mode 100644 index 7bd9f08eb..000000000 --- a/src/scala/com/twitter/recos/user_tweet_graph/UserTweetGraphConfig.scala +++ /dev/null @@ -1,39 +0,0 @@ -package com.twitter.recos.user_tweet_graph - -import com.twitter.recos.graph_common.MultiSegmentPowerLawBipartiteGraphBuilder.GraphBuilderConfig - -/** - * The class holds all the config parameters for recos graph. - */ -object RecosConfig { - val maxNumSegments: Int = 8 - val maxNumEdgesPerSegment: Int = - (1 << 28) // 268M edges per segment, should be able to include 2 days' data - val expectedNumLeftNodes: Int = - (1 << 26) // should correspond to 67M nodes storage - val expectedMaxLeftDegree: Int = 64 - val leftPowerLawExponent: Double = 16.0 // steep power law as most nodes will have a small degree - val expectedNumRightNodes: Int = (1 << 26) // 67M nodes - val expectedMaxRightDegree: Int = scala.math.pow(1024, 2).toInt // some nodes will be very popular - val rightPowerLawExponent: Double = 4.0 // this will be less steep - - val graphBuilderConfig = GraphBuilderConfig( - maxNumSegments = maxNumSegments, - maxNumEdgesPerSegment = maxNumEdgesPerSegment, - expectedNumLeftNodes = expectedNumLeftNodes, - expectedMaxLeftDegree = expectedMaxLeftDegree, - leftPowerLawExponent = leftPowerLawExponent, - expectedNumRightNodes = expectedNumRightNodes, - expectedMaxRightDegree = expectedMaxRightDegree, - rightPowerLawExponent = rightPowerLawExponent - ) - - println("RecosConfig - maxNumSegments " + maxNumSegments) - println("RecosConfig - maxNumEdgesPerSegment " + maxNumEdgesPerSegment) - println("RecosConfig - expectedNumLeftNodes " + expectedNumLeftNodes) - println("RecosConfig - expectedMaxLeftDegree " + expectedMaxLeftDegree) - println("RecosConfig - leftPowerLawExponent " + leftPowerLawExponent) - println("RecosConfig - expectedNumRightNodes " + expectedNumRightNodes) - println("RecosConfig - expectedMaxRightDegree " + expectedMaxRightDegree) - println("RecosConfig - rightPowerLawExponent " + rightPowerLawExponent) -} diff --git a/src/scala/com/twitter/recos/user_tweet_graph/UserTweetGraphWriter.scala b/src/scala/com/twitter/recos/user_tweet_graph/UserTweetGraphWriter.scala deleted file mode 100644 index bd7f238a1..000000000 --- a/src/scala/com/twitter/recos/user_tweet_graph/UserTweetGraphWriter.scala +++ /dev/null @@ -1,88 +0,0 @@ -package com.twitter.recos.user_tweet_graph - -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.finatra.kafka.consumers.FinagleKafkaConsumerBuilder -import com.twitter.graphjet.algorithms.TweetIDMask -import com.twitter.recos.util.Action -import com.twitter.graphjet.bipartite.MultiSegmentPowerLawBipartiteGraph -import com.twitter.graphjet.bipartite.segment.BipartiteGraphSegment -import com.twitter.recos.hose.common.UnifiedGraphWriter -import com.twitter.recos.internal.thriftscala.RecosHoseMessage -import com.twitter.recos.serviceapi.Tweetypie._ -import com.twitter.recos.user_tweet_graph.util.UserTweetEdgeTypeMask - -/** - * The class submits a number of $numBootstrapWriters graph writer threads, BufferedEdgeWriter, - * during service startup. One of them is live writer thread, and the other $(numBootstrapWriters - 1) - * are catchup writer threads. All of them consume kafka events from an internal concurrent queue, - * which is populated by kafka reader threads. At bootstrap time, the kafka reader threads look - * back kafka offset from several hours ago and populate the internal concurrent queue. - * Each graph writer thread writes to an individual graph segment separately. - * The $(numBootstrapWriters - 1) catchup writer threads will stop once all events - * between current system time at startup and the time in memcache are processed. - * The live writer thread will continue to write all incoming kafka events. - * It lives through the entire life cycle of recos graph service. - */ -case class UserTweetGraphWriter( - shardId: String, - env: String, - hosename: String, - bufferSize: Int, - kafkaConsumerBuilder: FinagleKafkaConsumerBuilder[String, RecosHoseMessage], - clientId: String, - statsReceiver: StatsReceiver) - extends UnifiedGraphWriter[BipartiteGraphSegment, MultiSegmentPowerLawBipartiteGraph] { - writer => - // The max throughput for each kafka consumer is around 25MB/s - // Use 4 processors for 100MB/s catch-up speed. - val consumerNum: Int = 4 - // Leave 1 Segments to LiveWriter - val catchupWriterNum: Int = RecosConfig.maxNumSegments - 1 - - /** - * Adds a RecosHoseMessage to the graph. used by live writer to insert edges to the - * current segment - */ - override def addEdgeToGraph( - graph: MultiSegmentPowerLawBipartiteGraph, - recosHoseMessage: RecosHoseMessage - ): Unit = { - if (Action(recosHoseMessage.action) == Action.Favorite || Action( - recosHoseMessage.action) == Action.Retweet) - graph.addEdge( - recosHoseMessage.leftId, - getMetaEdge(recosHoseMessage.rightId, recosHoseMessage.card), - UserTweetEdgeTypeMask.actionTypeToEdgeType(recosHoseMessage.action), - ) - } - - /** - * Adds a RecosHoseMessage to the given segment in the graph. Used by catch up writers to - * insert edges to non-current (old) segments - */ - override def addEdgeToSegment( - segment: BipartiteGraphSegment, - recosHoseMessage: RecosHoseMessage - ): Unit = { - if (Action(recosHoseMessage.action) == Action.Favorite || Action( - recosHoseMessage.action) == Action.Retweet) - segment.addEdge( - recosHoseMessage.leftId, - getMetaEdge(recosHoseMessage.rightId, recosHoseMessage.card), - UserTweetEdgeTypeMask.actionTypeToEdgeType(recosHoseMessage.action) - ) - } - - private def getMetaEdge(rightId: Long, cardOption: Option[Byte]): Long = { - cardOption - .map { card => - if (isPhotoCard(card)) TweetIDMask.photo(rightId) - else if (isPlayerCard(card)) TweetIDMask.player(rightId) - else if (isSummaryCard(card)) TweetIDMask.summary(rightId) - else if (isPromotionCard(card)) TweetIDMask.promotion(rightId) - else rightId - } - .getOrElse(rightId) - } - -} diff --git a/src/scala/com/twitter/recos/user_tweet_graph/relatedTweetHandlers/BUILD b/src/scala/com/twitter/recos/user_tweet_graph/relatedTweetHandlers/BUILD deleted file mode 100644 index 898e5f6ab..000000000 --- a/src/scala/com/twitter/recos/user_tweet_graph/relatedTweetHandlers/BUILD +++ /dev/null @@ -1,12 +0,0 @@ -scala_library( - sources = ["*.scala"], - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/twitter/graphjet", - "servo/request/src/main/scala", - "src/scala/com/twitter/recos/user_tweet_graph/store", - "src/scala/com/twitter/recos/user_tweet_graph/util", - "src/scala/com/twitter/recos/util:recos-util", - "src/thrift/com/twitter/recos/user_tweet_graph:user_tweet_graph-scala", - ], -) diff --git a/src/scala/com/twitter/recos/user_tweet_graph/relatedTweetHandlers/ConsumersBasedRelatedTweetsHandler.scala b/src/scala/com/twitter/recos/user_tweet_graph/relatedTweetHandlers/ConsumersBasedRelatedTweetsHandler.scala deleted file mode 100644 index 9f807029b..000000000 --- a/src/scala/com/twitter/recos/user_tweet_graph/relatedTweetHandlers/ConsumersBasedRelatedTweetsHandler.scala +++ /dev/null @@ -1,68 +0,0 @@ -package com.twitter.recos.user_tweet_graph.relatedTweetHandlers - -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.graphjet.bipartite.api.BipartiteGraph -import com.twitter.recos.user_tweet_graph.thriftscala._ -import com.twitter.recos.user_tweet_graph.util.FetchRHSTweetsUtil -import com.twitter.recos.user_tweet_graph.util.FilterUtil -import com.twitter.recos.user_tweet_graph.util.GetRelatedTweetCandidatesUtil -import com.twitter.recos.util.Action -import com.twitter.recos.util.Stats._ -import com.twitter.servo.request._ -import com.twitter.util.Duration -import com.twitter.util.Future -import scala.concurrent.duration.HOURS - -/** - * Implementation of the Thrift-defined service interface for consumersTweetBasedRelatedTweets. - * given a list of consumer userIds, find the tweets they co-engaged with (we're treating input userIds as consumers therefore "consumersTweetBasedRelatedTweets" ) - * example use case: given a list of user's contacts in their address book, find tweets those contacts engaged with - */ -class ConsumersBasedRelatedTweetsHandler( - bipartiteGraph: BipartiteGraph, - statsReceiver: StatsReceiver) - extends RequestHandler[ConsumersBasedRelatedTweetRequest, RelatedTweetResponse] { - private val stats = statsReceiver.scope(this.getClass.getSimpleName) - - override def apply(request: ConsumersBasedRelatedTweetRequest): Future[RelatedTweetResponse] = { - trackFutureBlockStats(stats) { - - val maxResults = request.maxResults.getOrElse(200) - val minScore = request.minScore.getOrElse(0.0) - val maxTweetAge = request.maxTweetAgeInHours.getOrElse(48) - val minResultDegree = request.minResultDegree.getOrElse(50) - val minCooccurrence = request.minCooccurrence.getOrElse(3) - val excludeTweetIds = request.excludeTweetIds.getOrElse(Seq.empty).toSet - - val consumerSeedSet = request.consumerSeedSet.distinct.filter { userId => - val userDegree = bipartiteGraph.getLeftNodeDegree(userId) - // constrain to users that have <100 engagements to avoid spammy behavior - userDegree < 100 - } - - val rhsTweetIds = FetchRHSTweetsUtil.fetchRHSTweets( - consumerSeedSet, - bipartiteGraph, - Set(Action.Favorite, Action.Retweet) - ) - - val scorePreFactor = 1000.0 / consumerSeedSet.size - val relatedTweetCandidates = GetRelatedTweetCandidatesUtil.getRelatedTweetCandidates( - rhsTweetIds, - minCooccurrence, - minResultDegree, - scorePreFactor, - bipartiteGraph) - - val relatedTweets = relatedTweetCandidates - .filter(relatedTweet => - FilterUtil.tweetAgeFilter( - relatedTweet.tweetId, - Duration(maxTweetAge, HOURS)) && (relatedTweet.score > minScore) && (!excludeTweetIds - .contains(relatedTweet.tweetId))).take(maxResults) - - stats.stat("response_size").add(relatedTweets.size) - Future.value(RelatedTweetResponse(tweets = relatedTweets)) - } - } -} diff --git a/src/scala/com/twitter/recos/user_tweet_graph/relatedTweetHandlers/ProducerBasedRelatedTweetsHandler.scala b/src/scala/com/twitter/recos/user_tweet_graph/relatedTweetHandlers/ProducerBasedRelatedTweetsHandler.scala deleted file mode 100644 index dd73342ec..000000000 --- a/src/scala/com/twitter/recos/user_tweet_graph/relatedTweetHandlers/ProducerBasedRelatedTweetsHandler.scala +++ /dev/null @@ -1,88 +0,0 @@ -package com.twitter.recos.user_tweet_graph.relatedTweetHandlers - -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.graphjet.bipartite.api.BipartiteGraph -import com.twitter.recos.user_tweet_graph.thriftscala._ -import com.twitter.recos.util.Stats._ -import com.twitter.servo.request._ -import com.twitter.util.Duration -import com.twitter.util.Future -import scala.concurrent.duration.HOURS -import com.twitter.simclusters_v2.common.UserId -import com.twitter.storehaus.ReadableStore -import com.twitter.recos.user_tweet_graph.store.UserRecentFollowersStore -import com.twitter.recos.user_tweet_graph.util.FetchRHSTweetsUtil -import com.twitter.recos.user_tweet_graph.util.FilterUtil -import com.twitter.recos.user_tweet_graph.util.GetRelatedTweetCandidatesUtil -import com.twitter.recos.util.Action - -/** - * Implementation of the Thrift-defined service interface for producerBasedRelatedTweets. - * - */ -class ProducerBasedRelatedTweetsHandler( - bipartiteGraph: BipartiteGraph, - userRecentFollowersStore: ReadableStore[UserRecentFollowersStore.Query, Seq[UserId]], - statsReceiver: StatsReceiver) - extends RequestHandler[ProducerBasedRelatedTweetRequest, RelatedTweetResponse] { - private val stats = statsReceiver.scope(this.getClass.getSimpleName) - - override def apply(request: ProducerBasedRelatedTweetRequest): Future[RelatedTweetResponse] = { - trackFutureBlockStats(stats) { - val maxResults = request.maxResults.getOrElse(200) - val maxNumFollowers = request.maxNumFollowers.getOrElse(500) - val minScore = request.minScore.getOrElse(0.0) - val maxTweetAge = request.maxTweetAgeInHours.getOrElse(48) - val minResultDegree = request.minResultDegree.getOrElse(50) - val minCooccurrence = request.minCooccurrence.getOrElse(4) - val excludeTweetIds = request.excludeTweetIds.getOrElse(Seq.empty).toSet - - val followersFut = fetchFollowers(request.producerId, Some(maxNumFollowers)) - followersFut.map { followers => - val rhsTweetIds = FetchRHSTweetsUtil.fetchRHSTweets( - followers, - bipartiteGraph, - Set(Action.Favorite, Action.Retweet) - ) - - val scorePreFactor = 1000.0 / followers.size - val relatedTweetCandidates = GetRelatedTweetCandidatesUtil.getRelatedTweetCandidates( - rhsTweetIds, - minCooccurrence, - minResultDegree, - scorePreFactor, - bipartiteGraph) - - val relatedTweets = relatedTweetCandidates - .filter { relatedTweet => - FilterUtil.tweetAgeFilter( - relatedTweet.tweetId, - Duration(maxTweetAge, HOURS)) && (relatedTweet.score > minScore) && (!excludeTweetIds - .contains(relatedTweet.tweetId)) - }.take(maxResults) - stats.stat("response_size").add(relatedTweets.size) - RelatedTweetResponse(tweets = relatedTweets) - } - } - } - - private def fetchFollowers( - producerId: Long, - maxNumFollower: Option[Int], - ): Future[Seq[Long]] = { - val query = - UserRecentFollowersStore.Query(producerId, maxNumFollower, None) - - val followersFut = userRecentFollowersStore.get(query) - followersFut.map { followersOpt => - val followers = followersOpt.getOrElse(Seq.empty) - val followerIds = followers.distinct.filter { userId => - val userDegree = bipartiteGraph.getLeftNodeDegree(userId) - // constrain to more active users that have >1 engagement to optimize latency, and <100 engagements to avoid spammy behavior - userDegree > 1 && userDegree < 100 - } - stats.stat("follower_size_after_filter").add(followerIds.size) - followerIds - } - } -} diff --git a/src/scala/com/twitter/recos/user_tweet_graph/relatedTweetHandlers/TweetBasedRelatedTweetsHandler.scala b/src/scala/com/twitter/recos/user_tweet_graph/relatedTweetHandlers/TweetBasedRelatedTweetsHandler.scala deleted file mode 100644 index 6643bd408..000000000 --- a/src/scala/com/twitter/recos/user_tweet_graph/relatedTweetHandlers/TweetBasedRelatedTweetsHandler.scala +++ /dev/null @@ -1,93 +0,0 @@ -package com.twitter.recos.user_tweet_graph.relatedTweetHandlers - -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.graphjet.bipartite.api.BipartiteGraph -import com.twitter.recos.features.tweet.thriftscala.GraphFeaturesForQuery -import com.twitter.recos.user_tweet_graph.thriftscala._ -import com.twitter.recos.user_tweet_graph.util.FilterUtil -import com.twitter.recos.user_tweet_graph.util.FetchRHSTweetsUtil -import com.twitter.recos.user_tweet_graph.util.GetAllInternalTweetIdsUtil -import com.twitter.recos.user_tweet_graph.util.GetRelatedTweetCandidatesUtil -import com.twitter.recos.user_tweet_graph.util.SampleLHSUsersUtil -import com.twitter.recos.util.Action -import com.twitter.recos.util.Stats._ -import com.twitter.servo.request._ -import com.twitter.util.Duration -import com.twitter.util.Future -import scala.concurrent.duration.HOURS - -/** - * Implementation of the Thrift-defined service interface for tweetBasedRelatedTweets. - * - */ -class TweetBasedRelatedTweetsHandler(bipartiteGraph: BipartiteGraph, statsReceiver: StatsReceiver) - extends RequestHandler[TweetBasedRelatedTweetRequest, RelatedTweetResponse] { - private val stats = statsReceiver.scope(this.getClass.getSimpleName) - - override def apply(request: TweetBasedRelatedTweetRequest): Future[RelatedTweetResponse] = { - trackFutureBlockStats(stats) { - val internalQueryTweetIds = - GetAllInternalTweetIdsUtil.getAllInternalTweetIds(request.tweetId, bipartiteGraph) - - val response = internalQueryTweetIds match { - case head +: Nil => getRelatedTweets(request, head) - case _ => RelatedTweetResponse() - } - Future.value(response) - } - } - - private def getRelatedTweets( - request: TweetBasedRelatedTweetRequest, - maskedTweetId: Long - ): RelatedTweetResponse = { - - val maxNumSamplesPerNeighbor = request.maxNumSamplesPerNeighbor.getOrElse(100) - val maxResults = request.maxResults.getOrElse(200) - val minScore = request.minScore.getOrElse(0.5) - val maxTweetAge = request.maxTweetAgeInHours.getOrElse(48) - val minResultDegree = request.minResultDegree.getOrElse(50) - val minQueryDegree = request.minQueryDegree.getOrElse(10) - val minCooccurrence = request.minCooccurrence.getOrElse(3) - val excludeTweetIds = request.excludeTweetIds.getOrElse(Seq.empty).toSet - - val queryTweetDegree = bipartiteGraph.getRightNodeDegree(maskedTweetId) - stats.stat("queryTweetDegree").add(queryTweetDegree) - - if (queryTweetDegree < minQueryDegree) { - stats.counter("queryTweetDegreeLessThanMinQueryDegree").incr() - RelatedTweetResponse() - } else { - - val sampledLHSuserIds = - SampleLHSUsersUtil.sampleLHSUsers(maskedTweetId, maxNumSamplesPerNeighbor, bipartiteGraph) - - val rHStweetIds = FetchRHSTweetsUtil.fetchRHSTweets( - sampledLHSuserIds, - bipartiteGraph, - Set(Action.Favorite, Action.Retweet) - ) - - val scorePreFactor = - queryTweetDegree / math.log(queryTweetDegree) / sampledLHSuserIds.distinct.size - val relatedTweetCandidates = GetRelatedTweetCandidatesUtil.getRelatedTweetCandidates( - rHStweetIds, - minCooccurrence, - minResultDegree, - scorePreFactor, - bipartiteGraph) - - val relatedTweets = relatedTweetCandidates - .filter(relatedTweet => - FilterUtil.tweetAgeFilter( - relatedTweet.tweetId, - Duration(maxTweetAge, HOURS)) && (relatedTweet.score > minScore) && (!excludeTweetIds - .contains(relatedTweet.tweetId))).take(maxResults) - - stats.stat("response_size").add(relatedTweets.size) - RelatedTweetResponse( - tweets = relatedTweets, - queryTweetGraphFeatures = Some(GraphFeaturesForQuery(degree = Some(queryTweetDegree)))) - } - } -} diff --git a/src/scala/com/twitter/recos/user_tweet_graph/store/BUILD b/src/scala/com/twitter/recos/user_tweet_graph/store/BUILD deleted file mode 100644 index b1c3562b7..000000000 --- a/src/scala/com/twitter/recos/user_tweet_graph/store/BUILD +++ /dev/null @@ -1,9 +0,0 @@ -scala_library( - sources = ["*.scala"], - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/twitter/storehaus:core", - "src/scala/com/twitter/simclusters_v2/common", - "src/thrift/com/twitter/socialgraph:thrift-scala", - ], -) diff --git a/src/scala/com/twitter/recos/user_tweet_graph/store/UserRecentFollowersStore.scala b/src/scala/com/twitter/recos/user_tweet_graph/store/UserRecentFollowersStore.scala deleted file mode 100644 index 4910e9d71..000000000 --- a/src/scala/com/twitter/recos/user_tweet_graph/store/UserRecentFollowersStore.scala +++ /dev/null @@ -1,50 +0,0 @@ -package com.twitter.recos.user_tweet_graph.store - -import com.twitter.simclusters_v2.common.UserId -import com.twitter.socialgraph.thriftscala.EdgesRequest -import com.twitter.socialgraph.thriftscala.EdgesResult -import com.twitter.socialgraph.thriftscala.PageRequest -import com.twitter.socialgraph.thriftscala.RelationshipType -import com.twitter.socialgraph.thriftscala.SrcRelationship -import com.twitter.socialgraph.thriftscala.SocialGraphService -import com.twitter.storehaus.ReadableStore -import com.twitter.util.Duration -import com.twitter.util.Future -import com.twitter.util.Time - -class UserRecentFollowersStore( - sgsClient: SocialGraphService.MethodPerEndpoint) - extends ReadableStore[UserRecentFollowersStore.Query, Seq[UserId]] { - - override def get(key: UserRecentFollowersStore.Query): Future[Option[Seq[UserId]]] = { - val edgeRequest = EdgesRequest( - relationship = SrcRelationship(key.userId, RelationshipType.FollowedBy), - // Could have a better guess at count when k.maxAge != None - pageRequest = Some(PageRequest(count = key.maxResults)) - ) - - val lookbackThresholdMillis = key.maxAge - .map(maxAge => (Time.now - maxAge).inMilliseconds) - .getOrElse(0L) - - sgsClient - .edges(Seq(edgeRequest)) - .map(_.flatMap { - case EdgesResult(edges, _, _) => - edges.collect { - case e if e.createdAt >= lookbackThresholdMillis => - e.target - } - }) - .map(Some(_)) - } -} - -object UserRecentFollowersStore { - case class Query( - userId: UserId, - // maxResults - if Some(count), we return only the `count` most recent follows - maxResults: Option[Int] = None, - // maxAge - if Some(duration), return only follows since `Time.now - duration` - maxAge: Option[Duration] = None) -} diff --git a/src/scala/com/twitter/recos/user_tweet_graph/util/BUILD b/src/scala/com/twitter/recos/user_tweet_graph/util/BUILD deleted file mode 100644 index 789b5e3ad..000000000 --- a/src/scala/com/twitter/recos/user_tweet_graph/util/BUILD +++ /dev/null @@ -1,12 +0,0 @@ -scala_library( - sources = ["*.scala"], - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/twitter/graphjet", - "snowflake:id", - "snowflake/src/main/scala/com/twitter/snowflake/id", - "src/scala/com/twitter/recos/util:recos-util", - "src/scala/com/twitter/simclusters_v2/common", - "src/thrift/com/twitter/recos/user_tweet_graph:user_tweet_graph-scala", - ], -) diff --git a/src/scala/com/twitter/recos/user_tweet_graph/util/FetchRHSTweetsUtil.scala b/src/scala/com/twitter/recos/user_tweet_graph/util/FetchRHSTweetsUtil.scala deleted file mode 100644 index dc4ec3020..000000000 --- a/src/scala/com/twitter/recos/user_tweet_graph/util/FetchRHSTweetsUtil.scala +++ /dev/null @@ -1,34 +0,0 @@ -package com.twitter.recos.user_tweet_graph.util - -import com.twitter.graphjet.bipartite.MultiSegmentIterator -import com.twitter.graphjet.bipartite.api.BipartiteGraph -import com.twitter.graphjet.bipartite.segment.BipartiteGraphSegment -import scala.collection.mutable.ListBuffer -import com.twitter.recos.util.Action - -object FetchRHSTweetsUtil { - // get RHS tweets given LHS users - def fetchRHSTweets( - userIds: Seq[Long], - bipartiteGraph: BipartiteGraph, - allowedActions: Set[Action.Value] - ): Seq[Long] = { - val allowedActionStrings = allowedActions.map(_.toString) - userIds.distinct - .flatMap { userId => - val tweetIdsIterator = bipartiteGraph - .getLeftNodeEdges(userId).asInstanceOf[MultiSegmentIterator[BipartiteGraphSegment]] - - val tweetIds = new ListBuffer[Long]() - if (tweetIdsIterator != null) { - while (tweetIdsIterator.hasNext) { - val rightNode = tweetIdsIterator.nextLong() - val edgeType = tweetIdsIterator.currentEdgeType() - if (allowedActionStrings.contains(UserTweetEdgeTypeMask(edgeType).toString)) - tweetIds += rightNode - } - } - tweetIds.distinct - } - } -} diff --git a/src/scala/com/twitter/recos/user_tweet_graph/util/FilterUtil.scala b/src/scala/com/twitter/recos/user_tweet_graph/util/FilterUtil.scala deleted file mode 100644 index fb5928904..000000000 --- a/src/scala/com/twitter/recos/user_tweet_graph/util/FilterUtil.scala +++ /dev/null @@ -1,15 +0,0 @@ -package com.twitter.recos.user_tweet_graph.util - -import com.twitter.simclusters_v2.common.TweetId -import com.twitter.snowflake.id.SnowflakeId -import com.twitter.util.Duration -import com.twitter.util.Time - -object FilterUtil { - def tweetAgeFilter(tweetId: TweetId, maxAge: Duration): Boolean = { - SnowflakeId - .timeFromIdOpt(tweetId) - .map { tweetTime => tweetTime > Time.now - maxAge }.getOrElse(false) - // If there's no snowflake timestamp, we have no idea when this tweet happened. - } -} diff --git a/src/scala/com/twitter/recos/user_tweet_graph/util/GetAllInternalTweetIdsUtil.scala b/src/scala/com/twitter/recos/user_tweet_graph/util/GetAllInternalTweetIdsUtil.scala deleted file mode 100644 index 0a5e6ee65..000000000 --- a/src/scala/com/twitter/recos/user_tweet_graph/util/GetAllInternalTweetIdsUtil.scala +++ /dev/null @@ -1,33 +0,0 @@ -package com.twitter.recos.user_tweet_graph.util - -import com.twitter.graphjet.algorithms.TweetIDMask -import com.twitter.graphjet.bipartite.api.BipartiteGraph - -object GetAllInternalTweetIdsUtil { - - def getAllInternalTweetIds(tweetId: Long, bipartiteGraph: BipartiteGraph): Seq[Long] = { - val internalTweetIds = getAllMasks(tweetId) - sortByDegrees(internalTweetIds, bipartiteGraph) - } - - private def getAllMasks(tweetId: Long): Seq[Long] = { - Seq( - tweetId, - TweetIDMask.summary(tweetId), - TweetIDMask.photo(tweetId), - TweetIDMask.player(tweetId), - TweetIDMask.promotion(tweetId) - ) - } - - private def sortByDegrees( - encodedTweetIds: Seq[Long], - bipartiteGraph: BipartiteGraph - ): Seq[Long] = { - encodedTweetIds - .map { encodedTweetId => (encodedTweetId, bipartiteGraph.getRightNodeDegree(encodedTweetId)) } - .filter { case (_, degree) => degree > 0 } // keep only tweetds with positive degree - .sortBy { case (_, degree) => -degree } // sort by degree in descending order - .map { case (encodedTweetId, _) => encodedTweetId } - } -} diff --git a/src/scala/com/twitter/recos/user_tweet_graph/util/GetRelatedTweetCandidatesUtil.scala b/src/scala/com/twitter/recos/user_tweet_graph/util/GetRelatedTweetCandidatesUtil.scala deleted file mode 100644 index b093e4c9e..000000000 --- a/src/scala/com/twitter/recos/user_tweet_graph/util/GetRelatedTweetCandidatesUtil.scala +++ /dev/null @@ -1,56 +0,0 @@ -package com.twitter.recos.user_tweet_graph.util - -import com.twitter.graphjet.bipartite.api.BipartiteGraph -import com.twitter.recos.user_tweet_graph.thriftscala._ -import com.twitter.recos.features.tweet.thriftscala.GraphFeaturesForTweet -import com.twitter.graphjet.algorithms.TweetIDMask - -object GetRelatedTweetCandidatesUtil { - private val tweetIDMask = new TweetIDMask - - /** - * calculate scores for each RHS tweet that we get back - * for tweetBasedRelatedTweet, scorePreFactor = queryTweetDegree / log(queryTweetDegree) / LHSuserSize - * and the final score will be a log-cosine score - * for non-tweetBasedRelatedTweet, We don't have a query tweet, to keep scoring function consistent, - * scorePreFactor = 1000.0 / LHSuserSize (queryTweetDegree's average is ~10k, 1000 ~= 10k/log(10k)) - * Though scorePreFactor is applied for all results within a request, it's still useful to make score comparable across requests, - * so we can have a unifed min_score and help with downstream score normalization - * **/ - def getRelatedTweetCandidates( - relatedTweetCandidates: Seq[Long], - minCooccurrence: Int, - minResultDegree: Int, - scorePreFactor: Double, - bipartiteGraph: BipartiteGraph, - ): Seq[RelatedTweet] = { - relatedTweetCandidates - .groupBy(tweetId => tweetId) - .filterKeys(tweetId => bipartiteGraph.getRightNodeDegree(tweetId) > minResultDegree) - .mapValues(_.size) - .filter { case (_, cooccurrence) => cooccurrence >= minCooccurrence } - .toSeq - .map { - case (relatedTweetId, cooccurrence) => - val relatedTweetDegree = bipartiteGraph.getRightNodeDegree(relatedTweetId) - val score = scorePreFactor * cooccurrence / math.log(relatedTweetDegree) - - toRelatedTweet(relatedTweetId, score, relatedTweetDegree, cooccurrence) - } - .sortBy(-_.score) - } - - def toRelatedTweet( - relatedTweetId: Long, - score: Double, - relatedTweetDegree: Int, - cooccurrence: Int - ): RelatedTweet = { - RelatedTweet( - tweetId = tweetIDMask.restore(relatedTweetId), - score = score, - relatedTweetGraphFeatures = Some( - GraphFeaturesForTweet(cooccurrence = Some(cooccurrence), degree = Some(relatedTweetDegree))) - ) - } -} diff --git a/src/scala/com/twitter/recos/user_tweet_graph/util/SampleLHSUsersUtil.scala b/src/scala/com/twitter/recos/user_tweet_graph/util/SampleLHSUsersUtil.scala deleted file mode 100644 index f265eb9e0..000000000 --- a/src/scala/com/twitter/recos/user_tweet_graph/util/SampleLHSUsersUtil.scala +++ /dev/null @@ -1,35 +0,0 @@ -package com.twitter.recos.user_tweet_graph.util - -import com.twitter.graphjet.bipartite.MultiSegmentIterator -import com.twitter.graphjet.bipartite.api.BipartiteGraph -import com.twitter.graphjet.bipartite.segment.BipartiteGraphSegment -import java.util.Random -import scala.collection.mutable.ListBuffer - -object SampleLHSUsersUtil { - // sample userId nodes - def sampleLHSUsers( - maskedTweetId: Long, - maxNumSamplesPerNeighbor: Int, - bipartiteGraph: BipartiteGraph - ): Seq[Long] = { - val sampledUserIdsIterator = bipartiteGraph - .getRandomRightNodeEdges( - maskedTweetId, - maxNumSamplesPerNeighbor, - new Random(System.currentTimeMillis)).asInstanceOf[MultiSegmentIterator[ - BipartiteGraphSegment - ]] - - val userIds = new ListBuffer[Long]() - if (sampledUserIdsIterator != null) { - while (sampledUserIdsIterator.hasNext) { - val leftNode = sampledUserIdsIterator.nextLong() - // If a user likes too many things, we risk including spammy behavior. - if (bipartiteGraph.getLeftNodeDegree(leftNode) < 100) - userIds += leftNode - } - } - userIds - } -} diff --git a/src/scala/com/twitter/recos/user_tweet_graph/util/UserTweetEdgeTypeMask.scala b/src/scala/com/twitter/recos/user_tweet_graph/util/UserTweetEdgeTypeMask.scala deleted file mode 100644 index 9a55c8e45..000000000 --- a/src/scala/com/twitter/recos/user_tweet_graph/util/UserTweetEdgeTypeMask.scala +++ /dev/null @@ -1,77 +0,0 @@ -package com.twitter.recos.user_tweet_graph.util - -import com.twitter.graphjet.bipartite.api.EdgeTypeMask -import com.twitter.recos.util.Action - -/** - * The bit mask is used to encode edge types in the top bits of an integer, - * e.g. favorite, retweet, reply and click. Under current segment configuration, each segment - * stores up to 128M edges. Assuming that each node on one side is unique, each segment - * stores up to 128M unique nodes on one side, which occupies the lower 27 bits of an integer. - * This leaves five bits to encode the edge types, which at max can store 32 edge types. - * The following implementation utilizes the top four bits and leaves one free bit out. - */ -class UserTweetEdgeTypeMask extends EdgeTypeMask { - import UserTweetEdgeTypeMask._ - - override def encode(node: Int, edgeType: Byte): Int = { - if (edgeType < 0 || edgeType > SIZE || edgeType == Click.id.toByte) { - throw new IllegalArgumentException("encode: Illegal edge type argument " + edgeType) - } else { - node | (edgeType << 28) - } - } - - override def edgeType(node: Int): Byte = { - (node >>> 28).toByte - } - - override def restore(node: Int): Int = { - node & MASK - } -} - -object UserTweetEdgeTypeMask extends Enumeration { - - type UserTweetEdgeTypeMask = Value - - /** - * Byte values corresponding to the action taken on a tweet, which will be encoded in the - * top 4 bits in a tweet Id - * NOTE: THERE CAN ONLY BE UP TO 16 TYPES - */ - val Click: UserTweetEdgeTypeMask = Value(0) - val Favorite: UserTweetEdgeTypeMask = Value(1) - val Retweet: UserTweetEdgeTypeMask = Value(2) - val Reply: UserTweetEdgeTypeMask = Value(3) - val Tweet: UserTweetEdgeTypeMask = Value(4) - val IsMentioned: UserTweetEdgeTypeMask = Value(5) - val IsMediatagged: UserTweetEdgeTypeMask = Value(6) - val Quote: UserTweetEdgeTypeMask = Value(7) - val Unfavorite: UserTweetEdgeTypeMask = Value(8) - - /** - * Reserve the top four bits of each integer to encode the edge type information. - */ - val MASK: Int = Integer.parseInt("00001111111111111111111111111111", 2) - val SIZE: Int = this.values.size - - /** - * Converts the action byte in the RecosHoseMessage into GraphJet internal byte mapping - */ - def actionTypeToEdgeType(actionByte: Byte): Byte = { - val edgeType = Action(actionByte) match { - case Action.Favorite => Favorite.id - case Action.Retweet => Retweet.id - case Action.Reply => Reply.id - case Action.Tweet => Tweet.id - case Action.IsMentioned => IsMentioned.id - case Action.IsMediaTagged => IsMediatagged.id - case Action.Quote => Quote.id - case Action.Unfavorite => Unfavorite.id - case _ => - throw new IllegalArgumentException("getEdgeType: Illegal edge type argument " + actionByte) - } - edgeType.toByte - } -} diff --git a/src/scala/com/twitter/recos/user_user_graph/BUILD b/src/scala/com/twitter/recos/user_user_graph/BUILD deleted file mode 100644 index 12dcbd292..000000000 --- a/src/scala/com/twitter/recos/user_user_graph/BUILD +++ /dev/null @@ -1,45 +0,0 @@ -scala_library( - name = "user_user_graph", - sources = ["*.scala"], - strict_deps = False, - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/google/guava", - "3rdparty/jvm/com/twitter/graphjet", - "3rdparty/jvm/io/netty:netty4-tcnative-boringssl-static", - "3rdparty/jvm/org/apache/kafka:rosette-kafka", - "finagle-internal/mtls/src/main/scala/com/twitter/finagle/mtls/authentication", - "finagle-internal/mtls/src/main/scala/com/twitter/finagle/mtls/server", - "finagle/finagle-core/src/main", - "finagle/finagle-http/src/main/scala", - "finagle/finagle-memcached/src/main/scala", - "finagle/finagle-stats/src/main/scala", - "finagle/finagle-thriftmux/src/main/scala", - "servo/request/src/main/scala", - "servo/util/src/main/scala", - "src/resources/com/twitter/recos:decider", - "src/scala/com/twitter/recos/decider", - "src/scala/com/twitter/recos/graph_common", - "src/scala/com/twitter/recos/hose/common", - "src/scala/com/twitter/recos/model:recos-model", - "src/scala/com/twitter/recos/util:recos-util", - "src/thrift/com/twitter/recos/user_user_graph:user_user_graph-scala", - "thrift-web-forms/src/main/scala/com/twitter/thriftwebforms", - "twitter-server/slf4j-jdk14/src/main/scala/com/twitter/server/logging", - "util/util-logging/src/main/scala", - "util/util-stats/src/main/scala", - ], -) - -jvm_binary( - name = "bin", - basename = "user_user_graph-server", - main = "com.twitter.recos.user_user_graph.Main", - runtime_platform = "java11", - tags = ["bazel-compatible"], - dependencies = [ - ":user_user_graph", - "3rdparty/jvm/org/slf4j:slf4j-jdk14", - "twitter-server/slf4j-jdk14/src/main/scala", - ], -) diff --git a/src/scala/com/twitter/recos/user_user_graph/KafkaConfig.scala b/src/scala/com/twitter/recos/user_user_graph/KafkaConfig.scala deleted file mode 100644 index 4ee08df68..000000000 --- a/src/scala/com/twitter/recos/user_user_graph/KafkaConfig.scala +++ /dev/null @@ -1,13 +0,0 @@ -package com.twitter.recos.user_user_graph - -/** - * The class holds all the config parameters for kafka queue. - */ -object KafkaConfig { - // The size of the RecosHoseMessage array that is written to the concurrently linked queue - // Buffersize of 64 to keep throughput around 64 / (2K edgesPerSec / 150 kafka threads) = 6 seconds, which is lower - // than young gen gc cycle, 20 seconds. So that all the incoming messages will be gced in young gen instead of old gen. - val bufferSize = 64 - - println("KafkaConfig - bufferSize " + bufferSize) -} diff --git a/src/scala/com/twitter/recos/user_user_graph/LoggingUserUserGraph.scala b/src/scala/com/twitter/recos/user_user_graph/LoggingUserUserGraph.scala deleted file mode 100644 index f8353a975..000000000 --- a/src/scala/com/twitter/recos/user_user_graph/LoggingUserUserGraph.scala +++ /dev/null @@ -1,51 +0,0 @@ -package com.twitter.recos.user_user_graph - -import com.twitter.logging.Logger -import com.twitter.recos.user_user_graph.thriftscala._ -import com.twitter.util.Future - -trait LoggingUserUserGraph extends thriftscala.UserUserGraph.MethodPerEndpoint { - private[this] val accessLog = Logger("access") - - abstract override def recommendUsers( - request: RecommendUserRequest - ): Future[RecommendUserResponse] = { - val time = System.currentTimeMillis - super.recommendUsers(request) onSuccess { resp => - val timeTaken = System.currentTimeMillis - time - val logText = - s"In ${timeTaken}ms, recommendUsers(${requestToString(request)}), response ${responseToString(resp)}" - accessLog.info(logText) - } onFailure { exc => - val timeTaken = System.currentTimeMillis - time - val logText = s"In ${timeTaken}ms, recommendUsers(${requestToString(request)} returned error" - accessLog.error(exc, logText) - } - } - - private def requestToString(request: RecommendUserRequest): String = { - Seq( - request.requesterId, - request.displayLocation, - request.seedsWithWeights.size, - request.seedsWithWeights.take(5), - request.excludedUserIds.map(_.size).getOrElse(0), - request.excludedUserIds.map(_.take(5)), - request.maxNumResults, - request.maxNumSocialProofs, - request.minUserPerSocialProof, - request.socialProofTypes, - request.maxEdgeEngagementAgeInMillis - ).mkString(",") - } - - private def responseToString(response: RecommendUserResponse): String = { - response.recommendedUsers.toList.map { recUser => - val socialProof = recUser.socialProofs.map { - case (proofType, proofs) => - (proofType, proofs) - } - (recUser.userId, recUser.score, socialProof) - }.toString - } -} diff --git a/src/scala/com/twitter/recos/user_user_graph/Main.scala b/src/scala/com/twitter/recos/user_user_graph/Main.scala deleted file mode 100644 index 55f889c02..000000000 --- a/src/scala/com/twitter/recos/user_user_graph/Main.scala +++ /dev/null @@ -1,255 +0,0 @@ -package com.twitter.recos.user_user_graph - -import com.twitter.abdecider.ABDeciderFactory -import com.twitter.abdecider.LoggingABDecider -import com.twitter.app.Flag -import com.twitter.conversions.DurationOps._ -import com.twitter.finagle.ThriftMux -import com.twitter.finagle.http.HttpMuxer -import com.twitter.finagle.mtls.authentication.ServiceIdentifier -import com.twitter.finagle.mtls.server.MtlsStackServer._ -import com.twitter.finagle.mux.transport.OpportunisticTls -import com.twitter.finagle.thrift.ClientId -import com.twitter.finatra.kafka.consumers.FinagleKafkaConsumerBuilder -import com.twitter.finatra.kafka.domain.KafkaGroupId -import com.twitter.finatra.kafka.domain.SeekStrategy -import com.twitter.finatra.kafka.serde.ScalaSerdes -import com.twitter.frigate.common.util.ElfOwlFilter -import com.twitter.frigate.common.util.ElfOwlFilter.ByLdapGroup -import com.twitter.logging._ -import com.twitter.recos.decider.UserUserGraphDecider -import com.twitter.recos.graph_common.FinagleStatsReceiverWrapper -import com.twitter.recos.graph_common.NodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraphBuilder -import com.twitter.recos.internal.thriftscala.RecosHoseMessage -import com.twitter.recos.model.Constants -import com.twitter.recos.user_user_graph.KafkaConfig._ -import com.twitter.recos.user_user_graph.RecosConfig._ -import com.twitter.server.Deciderable -import com.twitter.server.TwitterServer -import com.twitter.server.logging.{Logging => JDK14Logging} -import com.twitter.servo.request._ -import com.twitter.servo.util.ExceptionCounter -import com.twitter.thriftwebforms._ -import com.twitter.util.Await -import com.twitter.util.Duration -import java.net.InetSocketAddress -import java.util.concurrent.TimeUnit -import org.apache.kafka.clients.CommonClientConfigs -import org.apache.kafka.common.config.SaslConfigs -import org.apache.kafka.common.config.SslConfigs -import org.apache.kafka.common.security.auth.SecurityProtocol -import org.apache.kafka.common.serialization.StringDeserializer - -object Main extends TwitterServer with JDK14Logging with Deciderable { - profile => - - val shardId: Flag[Int] = flag("shardId", 0, "Shard ID") - val servicePort: Flag[InetSocketAddress] = - flag("service.port", new InetSocketAddress(10143), "Thrift service port") - val logDir: Flag[String] = flag("logdir", "recos", "Logging directory") - val hoseName: Flag[String] = - flag("hosename", "recos_injector_user_user", "the kafka stream used for incoming edges") - val maxNumSegments: Flag[Int] = - flag("maxNumSegments", graphBuilderConfig.maxNumSegments, "the number of segments in the graph") - val numShards: Flag[Int] = flag("numShards", 1, "Number of shards for this service") - val truststoreLocation: Flag[String] = - flag[String]("truststore_location", "", "Truststore file location") - - val dataCenter: Flag[String] = flag("service.cluster", "atla", "Data Center") - val serviceRole: Flag[String] = flag("service.role", "Service Role") - val serviceEnv: Flag[String] = flag("service.env", "Service Env") - val serviceName: Flag[String] = flag("service.name", "Service Name") - - val statsReceiverWrapper: FinagleStatsReceiverWrapper = FinagleStatsReceiverWrapper( - statsReceiver - ) - - /** - * A ClientRequestAuthorizer to be used in a request-authorization RequestFilter. - */ - lazy val clientAuthorizer: ClientRequestAuthorizer = - ClientRequestAuthorizer.observed( - ClientRequestAuthorizer.permissive, - new ClientRequestObserver(statsReceiver) - ) - - lazy val clientId = ClientId("userusergraph.%s".format(serviceEnv().replace("devel", "dev"))) - - val shutdownTimeout: Flag[Duration] = flag( - "service.shutdownTimeout", - 5.seconds, - "Maximum amount of time to wait for pending requests to complete on shutdown" - ) - - /** - * ExceptionCounter for tracking failures from RequestHandler(s). - */ - lazy val exceptionCounter = new ExceptionCounter(statsReceiver) - - /** - * Function for translating exceptions returned by a RequestHandler. Useful - * for cases where underlying exception types should be wrapped in those - * defined in the project's Thrift IDL. - */ - lazy val translateExceptions: PartialFunction[Throwable, Throwable] = { - case t => t - } - - // ********* logging ********** - - lazy val loggingLevel: Level = Level.INFO - lazy val recosLogPath: String = logDir() + "/recos.log" - lazy val graphLogPath: String = logDir() + "/graph.log" - lazy val accessLogPath: String = logDir() + "/access.log" - - override def loggerFactories: List[LoggerFactory] = - List( - LoggerFactory( - level = Some(loggingLevel), - handlers = QueueingHandler( - handler = FileHandler( - filename = recosLogPath, - level = Some(loggingLevel), - rollPolicy = Policy.Hourly, - rotateCount = 6, - formatter = new Formatter - ) - ) :: Nil - ), - LoggerFactory( - node = "graph", - useParents = false, - level = Some(loggingLevel), - handlers = QueueingHandler( - handler = FileHandler( - filename = graphLogPath, - level = Some(loggingLevel), - rollPolicy = Policy.Hourly, - rotateCount = 6, - formatter = new Formatter - ) - ) :: Nil - ), - LoggerFactory( - node = "access", - useParents = false, - level = Some(loggingLevel), - handlers = QueueingHandler( - handler = FileHandler( - filename = accessLogPath, - level = Some(loggingLevel), - rollPolicy = Policy.Hourly, - rotateCount = 6, - formatter = new Formatter - ) - ) :: Nil - ), - LoggerFactory( - node = "client_event", - level = Some(loggingLevel), - useParents = false, - handlers = QueueingHandler( - maxQueueSize = 10000, - handler = ScribeHandler( - category = "client_event", - formatter = BareFormatter - ) - ) :: Nil - ) - ) - // ******** Decider ************* - - val recosDecider: UserUserGraphDecider = UserUserGraphDecider() - - // ********* ABdecider ********** - - val abDeciderYmlPath: String = "/usr/local/config/abdecider/abdecider.yml" - - val scribeLogger: Option[Logger] = Some(Logger.get("client_event")) - - val abDecider: LoggingABDecider = - ABDeciderFactory( - abDeciderYmlPath = abDeciderYmlPath, - scribeLogger = scribeLogger, - environment = Some("production") - ).buildWithLogging() - - val ldapGroups = Seq("eng", "cassowary-group", "timeline-team") - - // ********* Recos service ********** - - def main(): Unit = { - log.info("building graph with maxNumSegments = " + profile.maxNumSegments()) - log.info("Reading from: " + hoseName()) - - val graph = NodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraphBuilder( - graphBuilderConfig.copy(maxNumSegments = profile.maxNumSegments()), - statsReceiverWrapper - ) - - val kafkaConfigBuilder = FinagleKafkaConsumerBuilder[String, RecosHoseMessage]() - .dest("/s/kafka/recommendations:kafka-tls") - .groupId(KafkaGroupId(f"user_user_graph-${shardId()}%06d")) - .keyDeserializer(new StringDeserializer) - .valueDeserializer(ScalaSerdes.Thrift[RecosHoseMessage].deserializer) - .seekStrategy(SeekStrategy.REWIND) - .rewindDuration(24.hours) - .withConfig(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, SecurityProtocol.SASL_SSL.toString) - .withConfig(SslConfigs.SSL_TRUSTSTORE_LOCATION_CONFIG, truststoreLocation()) - .withConfig(SaslConfigs.SASL_MECHANISM, SaslConfigs.GSSAPI_MECHANISM) - .withConfig(SaslConfigs.SASL_KERBEROS_SERVICE_NAME, "kafka") - .withConfig(SaslConfigs.SASL_KERBEROS_SERVER_NAME, "kafka") - - val graphWriter = UserUserGraphWriter( - shardId = shardId().toString, - env = serviceEnv(), - hosename = hoseName(), - bufferSize = bufferSize, - kafkaConsumerBuilder = kafkaConfigBuilder, - clientId = clientId.name, - statsReceiver = statsReceiver - ) - graphWriter.initHose(graph) - - val recommendUsersHandler = RecommendUsersHandlerImpl( - graph, - Constants.salsaRunnerConfig, - recosDecider, - statsReceiverWrapper - ) - - val recos = new UserUserGraph(recommendUsersHandler) with LoggingUserUserGraph - - // For MutualTLS - val serviceIdentifier = ServiceIdentifier( - role = serviceRole(), - service = serviceName(), - environment = serviceEnv(), - zone = dataCenter() - ) - - val thriftServer = ThriftMux.server - .withOpportunisticTls(OpportunisticTls.Required) - .withMutualTls(serviceIdentifier) - .serveIface(servicePort(), recos) - - this.addAdminRoute(ElfOwlFilter.getPostbackRoute()) - - val elfowlFilter = ElfOwlFilter( - ByLdapGroup(ldapGroups), - Duration.fromTimeUnit(5, TimeUnit.DAYS) - ) - - log.info(s"ServiceIdentifier = ${serviceIdentifier.toString}") - log.info("clientid: " + clientId.toString) - log.info("servicePort: " + servicePort().toString) - log.info("adding shutdown hook") - onExit { - graphWriter.shutdown() - thriftServer.close(shutdownTimeout().fromNow) - } - log.info("added shutdown hook") - // Wait on the thriftServer so that shutdownTimeout is respected. - Await.result(thriftServer) - } -} diff --git a/src/scala/com/twitter/recos/user_user_graph/README.md b/src/scala/com/twitter/recos/user_user_graph/README.md deleted file mode 100644 index 6412f235c..000000000 --- a/src/scala/com/twitter/recos/user_user_graph/README.md +++ /dev/null @@ -1,17 +0,0 @@ -# UserUserGraph (UUG) - -## What is it -User User Graph (UUG) is a Finalge thrift service built on the GraphJet framework. In maintains a graph of user-user relationships and serves user recommendations based on traversals of this graph. - -## How is it used on Twitter -UUG recommends users to follow based on who your follow graph have recently followed. -The core idea behind UUG is collaborative filtering. UUG takes a user's weighted follow graph (i.e a list of weighted userIds) as input, -performs efficient traversal & aggregation, and returns the top weighted users basd on # of users that engaged the users, as well as -the engaging users' weights. - -UUG is a stateful service and relies on a Kafka stream to ingest & persist states. It maintains an in-memory user engagements over the past -week. Older events are dropped and GC'ed. - -For full details on storage & processing, please check out our open-sourced project GraphJet, a general-purpose high performance in-memory storage engine. -- https://github.com/twitter/GraphJet -- http://www.vldb.org/pvldb/vol9/p1281-sharma.pdf diff --git a/src/scala/com/twitter/recos/user_user_graph/RecommendUsersHandler.scala b/src/scala/com/twitter/recos/user_user_graph/RecommendUsersHandler.scala deleted file mode 100644 index fa1978bbb..000000000 --- a/src/scala/com/twitter/recos/user_user_graph/RecommendUsersHandler.scala +++ /dev/null @@ -1,221 +0,0 @@ -package com.twitter.recos.user_user_graph - -import java.util.Random -import com.google.common.collect.Lists -import com.twitter.concurrent.AsyncQueue -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.graphjet.algorithms.counting.TopSecondDegreeByCountResponse -import com.twitter.graphjet.algorithms.counting.user.TopSecondDegreeByCountForUser -import com.twitter.graphjet.algorithms.counting.user.TopSecondDegreeByCountRequestForUser -import com.twitter.graphjet.algorithms.counting.user.UserRecommendationInfo -import com.twitter.graphjet.algorithms.ConnectingUsersWithMetadata -import com.twitter.graphjet.algorithms.filters._ -import com.twitter.graphjet.bipartite.NodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraph -import com.twitter.logging.Logger -import com.twitter.recos.decider.UserUserGraphDecider -import com.twitter.recos.graph_common.FinagleStatsReceiverWrapper -import com.twitter.recos.model.SalsaQueryRunner.SalsaRunnerConfig -import com.twitter.recos.recos_common.thriftscala.UserSocialProofType -import com.twitter.recos.user_user_graph.thriftscala._ -import com.twitter.recos.util.Stats._ -import com.twitter.servo.request.RequestHandler -import com.twitter.util.Future -import com.twitter.util.Try -import it.unimi.dsi.fastutil.longs.Long2DoubleOpenHashMap -import it.unimi.dsi.fastutil.longs.LongOpenHashSet -import scala.collection.JavaConverters._ - -trait RecommendUsersHandler extends RequestHandler[RecommendUserRequest, RecommendUserResponse] - -/** - * Computes user recommendations based on a RecommendUserRequest by using - * TopSecondDegree algorithm in GraphJet. - */ -case class RecommendUsersHandlerImpl( - bipartiteGraph: NodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraph, - salsaRunnerConfig: SalsaRunnerConfig, - decider: UserUserGraphDecider, - statsReceiverWrapper: FinagleStatsReceiverWrapper) - extends RecommendUsersHandler { - - private val log: Logger = Logger(this.getClass.getSimpleName) - private val stats = statsReceiverWrapper.statsReceiver.scope(this.getClass.getSimpleName) - private val failureCounter = stats.counter("failure") - private val recsStat = stats.stat("recs_count") - private val emptyCounter = stats.counter("empty") - private val pollCounter = stats.counter("poll") - private val pollTimeoutCounter = stats.counter("pollTimeout") - private val offerCounter = stats.counter("offer") - private val pollLatencyStat = stats.stat("pollLatency") - private val graphJetQueue = new AsyncQueue[TopSecondDegreeByCountForUser] - (0 until salsaRunnerConfig.numSalsaRunners).foreach { _ => - graphJetQueue.offer( - new TopSecondDegreeByCountForUser( - bipartiteGraph, - salsaRunnerConfig.expectedNodesToHitInSalsa, - statsReceiverWrapper.scope(this.getClass.getSimpleName) - ) - ) - } - - /** - * Given a user_user_graph request, make it conform to GraphJet's request format - */ - private def convertRequestToJava( - request: RecommendUserRequest - ): TopSecondDegreeByCountRequestForUser = { - val queryNode = request.requesterId - val leftSeedNodesWithWeight = new Long2DoubleOpenHashMap( - request.seedsWithWeights.keys.toArray, - request.seedsWithWeights.values.toArray - ) - val toBeFiltered = new LongOpenHashSet(request.excludedUserIds.getOrElse(Nil).toArray) - val maxNumResults = request.maxNumResults.getOrElse(DefaultRequestParams.MaxNumResults) - val maxNumSocialProofs = - request.maxNumSocialProofs.getOrElse(DefaultRequestParams.MaxNumSocialProofs) - val minUserPerSocialProof = convertMinUserPerSocialProofToJava(request.minUserPerSocialProof) - val socialProofTypes = - UserEdgeTypeMask.getUserUserGraphSocialProofTypes(request.socialProofTypes) - val maxRightNodeAgeInMillis = DefaultRequestParams.MaxRightNodeAgeThreshold - val maxEdgeEngagementAgeInMillis = - request.maxEdgeEngagementAgeInMillis.getOrElse(DefaultRequestParams.MaxEdgeAgeThreshold) - val resultFilterChain = new ResultFilterChain( - Lists.newArrayList( - new SocialProofTypesFilter(statsReceiverWrapper), - new RequestedSetFilter(statsReceiverWrapper) - ) - ) - - new TopSecondDegreeByCountRequestForUser( - queryNode, - leftSeedNodesWithWeight, - toBeFiltered, - maxNumResults, - maxNumSocialProofs, - UserEdgeTypeMask.SIZE.toInt, - minUserPerSocialProof, - socialProofTypes, - maxRightNodeAgeInMillis, - maxEdgeEngagementAgeInMillis, - resultFilterChain - ) - } - - /** - * Converts the thrift scala type to the Java equivalent - */ - private def convertMinUserPerSocialProofToJava( - socialProofInScala: Option[scala.collection.Map[UserSocialProofType, Int]] - ): java.util.Map[java.lang.Byte, java.lang.Integer] = { - socialProofInScala - .map { - _.map { - case (key: UserSocialProofType, value: Int) => - (new java.lang.Byte(key.getValue.toByte), new java.lang.Integer(value)) - } - } - .getOrElse(Map.empty[java.lang.Byte, java.lang.Integer]) - .asJava - } - - /** - * Converts a byte-array format of social proofs in Java to its Scala equivalent - */ - private def convertSocialProofsToScala( - socialProofs: java.util.Map[java.lang.Byte, ConnectingUsersWithMetadata] - ): scala.collection.mutable.Map[UserSocialProofType, scala.Seq[Long]] = { - socialProofs.asScala.map { - case (socialProofByte, socialProof) => - val proofType = UserSocialProofType(socialProofByte.toByte) - val ids = socialProof.getConnectingUsers.asScala.map(_.toLong) - (proofType, ids) - } - } - - /** - * Converts Java recommendation results to its Scala equivalent - */ - private def convertResponseToScala( - responseOpt: Option[TopSecondDegreeByCountResponse] - ): RecommendUserResponse = { - responseOpt match { - case Some(rawResponse) => - val userSeq = rawResponse.getRankedRecommendations.asScala.toSeq.flatMap { - case userRecs: UserRecommendationInfo => - Some( - RecommendedUser( - userRecs.getRecommendation, - userRecs.getWeight, - convertSocialProofsToScala(userRecs.getSocialProof) - ) - ) - case _ => - None - } - recsStat.add(userSeq.size) - if (userSeq.isEmpty) { - emptyCounter.incr() - } - RecommendUserResponse(userSeq) - case None => - emptyCounter.incr() - RecommendUserResponse(Nil) - } - } - - private def getGraphJetResponse( - graphJet: TopSecondDegreeByCountForUser, - request: TopSecondDegreeByCountRequestForUser, - random: Random - )( - implicit statsReceiver: StatsReceiver - ): Option[TopSecondDegreeByCountResponse] = { - trackBlockStats(stats) { - // compute recs -- need to catch and print exceptions here otherwise they are swallowed - val recAttempt = Try(graphJet.computeRecommendations(request, random)).onFailure { e => - failureCounter.incr() - log.error(e, "GraphJet computation failed") - } - recAttempt.toOption - } - } - - override def apply(request: RecommendUserRequest): Future[RecommendUserResponse] = { - val random = new Random() - val graphJetRequest = convertRequestToJava(request) - pollCounter.incr() - val t0 = System.currentTimeMillis - graphJetQueue.poll().map { graphJetRunner => - val pollTime = System.currentTimeMillis - t0 - pollLatencyStat.add(pollTime) - val response = Try { - if (pollTime < salsaRunnerConfig.timeoutSalsaRunner) { - convertResponseToScala( - getGraphJetResponse( - graphJetRunner, - graphJetRequest, - random - )(statsReceiverWrapper.statsReceiver) - ) - } else { - // if we did not get a runner in time, then fail fast here and immediately put it back - log.warning("GraphJet Queue polling timeout") - pollTimeoutCounter.incr() - throw new RuntimeException("GraphJet poll timeout") - RecommendUserResponse(Nil) - } - } ensure { - graphJetQueue.offer(graphJetRunner) - offerCounter.incr() - } - response.toOption.getOrElse(RecommendUserResponse(Nil)) - } - } - - object DefaultRequestParams { - val MaxNumResults = 100 - val MaxNumSocialProofs = 100 - val MaxRightNodeAgeThreshold: Long = Long.MaxValue - val MaxEdgeAgeThreshold: Long = Long.MaxValue - } -} diff --git a/src/scala/com/twitter/recos/user_user_graph/RecosConfig.scala b/src/scala/com/twitter/recos/user_user_graph/RecosConfig.scala deleted file mode 100644 index 38c17fc5e..000000000 --- a/src/scala/com/twitter/recos/user_user_graph/RecosConfig.scala +++ /dev/null @@ -1,37 +0,0 @@ -package com.twitter.recos.user_user_graph - -import com.twitter.recos.model.Constants -import com.twitter.recos.graph_common.NodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraphBuilder.GraphBuilderConfig - -/** - * The class holds all the config parameters for recos graph. - */ -object RecosConfig { - val maxNumSegments: Int = 5 - val maxNumEdgesPerSegment: Int = 1 << 26 // 64M edges per segment - val expectedNumLeftNodes: Int = 1 << 24 // should correspond to 16M nodes storage - val expectedMaxLeftDegree: Int = 64 - val leftPowerLawExponent: Double = 16.0 // steep power law as most nodes will have a small degree - val expectedNumRightNodes: Int = 1 << 24 // 16M nodes - val numRightNodeMetadataTypes = 1 // UUG does not have node metadata - - val graphBuilderConfig = GraphBuilderConfig( - maxNumSegments = maxNumSegments, - maxNumEdgesPerSegment = maxNumEdgesPerSegment, - expectedNumLeftNodes = expectedNumLeftNodes, - expectedMaxLeftDegree = expectedMaxLeftDegree, - leftPowerLawExponent = leftPowerLawExponent, - expectedNumRightNodes = expectedNumRightNodes, - numRightNodeMetadataTypes = numRightNodeMetadataTypes, - edgeTypeMask = new UserEdgeTypeMask() - ) - - println("RecosConfig - maxNumSegments " + maxNumSegments) - println("RecosConfig - maxNumEdgesPerSegment " + maxNumEdgesPerSegment) - println("RecosConfig - expectedNumLeftNodes " + expectedNumLeftNodes) - println("RecosConfig - expectedMaxLeftDegree " + expectedMaxLeftDegree) - println("RecosConfig - leftPowerLawExponent " + leftPowerLawExponent) - println("RecosConfig - expectedNumRightNodes " + expectedNumRightNodes) - println("RecosConfig - numRightNodeMetadataTypes " + numRightNodeMetadataTypes) - println("RecosConfig - salsaRunnerConfig " + Constants.salsaRunnerConfig) -} diff --git a/src/scala/com/twitter/recos/user_user_graph/UserEdgeTypeMask.scala b/src/scala/com/twitter/recos/user_user_graph/UserEdgeTypeMask.scala deleted file mode 100644 index ac29bebf2..000000000 --- a/src/scala/com/twitter/recos/user_user_graph/UserEdgeTypeMask.scala +++ /dev/null @@ -1,91 +0,0 @@ -package com.twitter.recos.user_user_graph - -import com.twitter.graphjet.bipartite.api.EdgeTypeMask -import com.twitter.recos.recos_common.thriftscala.UserSocialProofType - -/** - * The bit mask is used to encode edge types in the top bits of an integer, - * e.g. Follow, Mention, and Mediatag. Under current segment configuration, each segment - * stores up to 128M edges. Assuming that each node on one side is unique, each segment - * stores up to 128M unique nodes on one side, which occupies the lower 27 bits of an integer. - * This leaves five bits to encode the edge types, which at max can store 32 edge types. - * The following implementation utilizes the top four bits and leaves one free bit out. - */ -class UserEdgeTypeMask extends EdgeTypeMask { - import UserEdgeTypeMask._ - override def encode(node: Int, edgeType: Byte): Int = { - require( - edgeType == FOLLOW || edgeType == MENTION || edgeType == MEDIATAG, - s"encode: Illegal edge type argument $edgeType") - node | EDGEARRAY(edgeType) - } - - override def edgeType(node: Int): Byte = { - (node >> 28).toByte - } - - override def restore(node: Int): Int = { - node & MASK - } -} - -object UserEdgeTypeMask { - - /** - * Reserve the top four bits of each integer to encode the edge type information. - */ - val MASK: Int = - Integer.parseInt("00001111111111111111111111111111", 2) - val FOLLOW: Byte = 0 - val MENTION: Byte = 1 - val MEDIATAG: Byte = 2 - val SIZE: Byte = 3 - val UNUSED3: Byte = 3 - val UNUSED4: Byte = 4 - val UNUSED5: Byte = 5 - val UNUSED6: Byte = 6 - val UNUSED7: Byte = 7 - val UNUSED8: Byte = 8 - val UNUSED9: Byte = 9 - val UNUSED10: Byte = 10 - val UNUSED11: Byte = 11 - val UNUSED12: Byte = 12 - val UNUSED13: Byte = 13 - val UNUSED14: Byte = 14 - val UNUSED15: Byte = 15 - val EDGEARRAY: Array[Int] = Array( - 0, - 1 << 28, - 2 << 28, - 3 << 28, - 4 << 28, - 5 << 28, - 6 << 28, - 7 << 28, - 8 << 28, - 9 << 28, - 10 << 28, - 11 << 28, - 12 << 28, - 13 << 28, - 14 << 28, - 15 << 28 - ) - - /** - * Map valid social proof types specified by clients to an array of bytes. If clients do not - * specify any social proof types in thrift, it will return all available social types by - * default. - * - * @param socialProofTypes are the valid socialProofTypes specified by clients - * @return an array of bytes representing valid social proof types - */ - def getUserUserGraphSocialProofTypes( - socialProofTypes: Option[Seq[UserSocialProofType]] - ): Array[Byte] = { - socialProofTypes - .map { _.map { _.getValue }.toArray } - .getOrElse((0 until SIZE).toArray) - .map { _.toByte } - } -} diff --git a/src/scala/com/twitter/recos/user_user_graph/UserUserGraph.scala b/src/scala/com/twitter/recos/user_user_graph/UserUserGraph.scala deleted file mode 100644 index 128597f90..000000000 --- a/src/scala/com/twitter/recos/user_user_graph/UserUserGraph.scala +++ /dev/null @@ -1,18 +0,0 @@ -package com.twitter.recos.user_user_graph - -import com.twitter.finagle.thrift.ClientId -import com.twitter.finagle.tracing.{Trace, TraceId} -import com.twitter.recos.user_user_graph.thriftscala._ -import com.twitter.util.Future - -object UserUserGraph { - def traceId: TraceId = Trace.id - def clientId: Option[ClientId] = ClientId.current -} - -class UserUserGraph(recommendUsersHandler: RecommendUsersHandler) - extends thriftscala.UserUserGraph.MethodPerEndpoint { - - override def recommendUsers(request: RecommendUserRequest): Future[RecommendUserResponse] = - recommendUsersHandler(request) -} diff --git a/src/scala/com/twitter/recos/user_user_graph/UserUserGraphWriter.scala b/src/scala/com/twitter/recos/user_user_graph/UserUserGraphWriter.scala deleted file mode 100644 index 637f29717..000000000 --- a/src/scala/com/twitter/recos/user_user_graph/UserUserGraphWriter.scala +++ /dev/null @@ -1,83 +0,0 @@ -package com.twitter.recos.user_user_graph - -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.finatra.kafka.consumers.FinagleKafkaConsumerBuilder -import com.twitter.graphjet.bipartite.NodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraph -import com.twitter.graphjet.bipartite.segment.NodeMetadataLeftIndexedBipartiteGraphSegment -import com.twitter.recos.hose.common.UnifiedGraphWriter -import com.twitter.recos.internal.thriftscala.RecosHoseMessage -import com.twitter.recos.util.Action - -case class UserUserGraphWriter( - shardId: String, - env: String, - hosename: String, - bufferSize: Int, - kafkaConsumerBuilder: FinagleKafkaConsumerBuilder[String, RecosHoseMessage], - clientId: String, - statsReceiver: StatsReceiver) - extends UnifiedGraphWriter[ - NodeMetadataLeftIndexedBipartiteGraphSegment, - NodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraph - ] { - - // The max throughput for each kafka consumer is around 25MB/s - // Use 3 processors for 75MB/s catch-up speed. - val consumerNum: Int = 3 - // Leave 2 Segments for live writer - val catchupWriterNum: Int = RecosConfig.maxNumSegments - 2 - - import UserUserGraphWriter._ - - private def getEdgeType(action: Byte): Byte = { - if (action == Action.Follow.id) { - UserEdgeTypeMask.FOLLOW - } else if (action == Action.Mention.id) { - UserEdgeTypeMask.MENTION - } else if (action == Action.MediaTag.id) { - UserEdgeTypeMask.MEDIATAG - } else { - throw new IllegalArgumentException("getEdgeType: Illegal edge type argument " + action) - } - } - - /** - * Adds a RecosHoseMessage to the graph. used by live writer to insert edges to the - * current segment - */ - override def addEdgeToGraph( - graph: NodeMetadataLeftIndexedPowerLawMultiSegmentBipartiteGraph, - recosHoseMessage: RecosHoseMessage - ): Unit = { - graph.addEdge( - recosHoseMessage.leftId, - recosHoseMessage.rightId, - getEdgeType(recosHoseMessage.action), - recosHoseMessage.edgeMetadata.getOrElse(0L), - EMTPY_NODE_METADATA, - EMTPY_NODE_METADATA - ) - } - - /** - * Adds a RecosHoseMessage to the given segment in the graph. Used by catch up writers to - * insert edges to non-current (old) segments - */ - override def addEdgeToSegment( - segment: NodeMetadataLeftIndexedBipartiteGraphSegment, - recosHoseMessage: RecosHoseMessage - ): Unit = { - segment.addEdge( - recosHoseMessage.leftId, - recosHoseMessage.rightId, - getEdgeType(recosHoseMessage.action), - recosHoseMessage.edgeMetadata.getOrElse(0L), - EMTPY_NODE_METADATA, - EMTPY_NODE_METADATA - ) - } -} - -private object UserUserGraphWriter { - final val EMTPY_NODE_METADATA = new Array[Array[Int]](1) -} diff --git a/src/scala/com/twitter/recos/user_video_graph/BUILD b/src/scala/com/twitter/recos/user_video_graph/BUILD deleted file mode 100644 index f85d7ba96..000000000 --- a/src/scala/com/twitter/recos/user_video_graph/BUILD +++ /dev/null @@ -1,69 +0,0 @@ -scala_library( - name = "user-video-graph", - sources = ["*.scala"], - tags = [ - "bazel-compatible", - "bazel-only", - ], - dependencies = [ - "3rdparty/jvm/cascading:cascading-local", - "3rdparty/jvm/com/backtype:dfs-datastores", - "3rdparty/jvm/com/fasterxml/jackson/module:jackson-module-scala", - "3rdparty/jvm/com/google/inject:guice", - "3rdparty/jvm/com/netflix/curator:curator-framework", - "3rdparty/jvm/com/twitter/graphjet", - "3rdparty/jvm/io/netty:netty4-tcnative-boringssl-static", - "3rdparty/jvm/it/unimi/dsi:fastutil", - "3rdparty/jvm/org/apache/hadoop:hadoop-client-default", - "3rdparty/jvm/org/apache/kafka:rosette-kafka", - "3rdparty/jvm/org/apache/thrift:libthrift", - "abdecider/src/main/scala", - "decider/src/main/scala", - "finagle-internal/mtls/src/main/scala/com/twitter/finagle/mtls/authentication", - "finagle-internal/mtls/src/main/scala/com/twitter/finagle/mtls/server", - "finagle/finagle-core/src/main", - "finagle/finagle-http/src/main/scala", - "finagle/finagle-memcached/src/main/scala", - "finagle/finagle-stats/src/main/scala", - "finagle/finagle-thriftmux/src/main/scala", - "frigate/frigate-common/src/main/scala/com/twitter/frigate/common/util", - "scrooge/scrooge-core/src/main/scala", - "servo/repo/src/main/scala", - "servo/request/src/main/scala", - "servo/util/src/main/scala", - "src/resources/com/twitter/recos:decider", - "src/scala/com/twitter/recos/decider", - "src/scala/com/twitter/recos/graph_common", - "src/scala/com/twitter/recos/hose/common", - "src/scala/com/twitter/recos/model:recos-model", - "src/scala/com/twitter/recos/serviceapi", - "src/scala/com/twitter/recos/user_video_graph/relatedTweetHandlers", - "src/scala/com/twitter/recos/user_video_graph/store", - "src/scala/com/twitter/recos/util:recos-util", - "src/scala/com/twitter/simclusters_v2/common", - "src/thrift/com/twitter/recos:recos-common-scala", - "src/thrift/com/twitter/recos:recos-internal-scala", - "src/thrift/com/twitter/recos/user_video_graph:user_video_graph-scala", - "thrift-web-forms/src/main/scala/com/twitter/thriftwebforms", - "thrift-web-forms/src/main/scala/com/twitter/thriftwebforms/model", - "twitter-server-internal/src/main/scala", - "twitter-server/server/src/main/scala", - "twitter-server/slf4j-jdk14/src/main/scala/com/twitter/server/logging", - "util/util-app/src/main/scala", - "util/util-hashing/src/main/scala", - "util/util-stats/src/main/scala", - ], -) - -jvm_binary( - name = "bin", - basename = "user-video-graph-server", - main = "com.twitter.recos.user_video_graph.Main", - runtime_platform = "java11", - tags = ["known-to-fail-jira:SD-20771"], - dependencies = [ - ":user-video-graph", - "3rdparty/jvm/org/slf4j:slf4j-jdk14", - "twitter-server/slf4j-jdk14/src/main/scala", - ], -) diff --git a/src/scala/com/twitter/recos/user_video_graph/LoggingUserVideoGraph.scala b/src/scala/com/twitter/recos/user_video_graph/LoggingUserVideoGraph.scala deleted file mode 100644 index b7747596c..000000000 --- a/src/scala/com/twitter/recos/user_video_graph/LoggingUserVideoGraph.scala +++ /dev/null @@ -1,12 +0,0 @@ -package com.twitter.recos.user_video_graph - -import com.twitter.finagle.tracing.Trace -import com.twitter.logging.Logger -import com.twitter.recos.recos_common.thriftscala._ -import com.twitter.recos.user_video_graph.thriftscala._ -import com.twitter.util.Future - -trait LoggingUserVideoGraph extends thriftscala.UserVideoGraph.MethodPerEndpoint { - private[this] val accessLog = Logger("access") - -} diff --git a/src/scala/com/twitter/recos/user_video_graph/Main.scala b/src/scala/com/twitter/recos/user_video_graph/Main.scala deleted file mode 100644 index 96b2a6218..000000000 --- a/src/scala/com/twitter/recos/user_video_graph/Main.scala +++ /dev/null @@ -1,294 +0,0 @@ -package com.twitter.recos.user_video_graph - -import com.twitter.abdecider.ABDeciderFactory -import com.twitter.abdecider.LoggingABDecider -import com.twitter.app.Flag -import com.twitter.conversions.DurationOps._ -import com.twitter.finagle.ThriftMux -import com.twitter.finagle.http.HttpMuxer -import com.twitter.finagle.mtls.authentication.ServiceIdentifier -import com.twitter.finagle.mtls.client.MtlsStackClient.MtlsThriftMuxClientSyntax -import com.twitter.finagle.mtls.server.MtlsStackServer._ -import com.twitter.finagle.mux.ClientDiscardedRequestException -import com.twitter.finagle.mux.transport.OpportunisticTls -import com.twitter.finagle.service.ReqRep -import com.twitter.finagle.service.ResponseClass -import com.twitter.finagle.thrift.ClientId -import com.twitter.finatra.kafka.consumers.FinagleKafkaConsumerBuilder -import com.twitter.finatra.kafka.domain.KafkaGroupId -import com.twitter.finatra.kafka.domain.SeekStrategy -import com.twitter.finatra.kafka.serde.ScalaSerdes -import com.twitter.frigate.common.util.ElfOwlFilter -import com.twitter.frigate.common.util.ElfOwlFilter.ByLdapGroup -import com.twitter.graphjet.bipartite.MultiSegmentPowerLawBipartiteGraph -import com.twitter.logging._ -import com.twitter.recos.decider.EndpointLoadShedder -import com.twitter.recos.decider.UserTweetGraphDecider -import com.twitter.recos.graph_common.FinagleStatsReceiverWrapper -import com.twitter.recos.graph_common.MultiSegmentPowerLawBipartiteGraphBuilder -import com.twitter.recos.internal.thriftscala.RecosHoseMessage -import com.twitter.recos.user_video_graph.RecosConfig._ -import com.twitter.recos.user_tweet_graph.relatedTweetHandlers.ConsumersBasedRelatedTweetsHandler -import com.twitter.recos.user_video_graph.relatedTweetHandlers.TweetBasedRelatedTweetsHandler -import com.twitter.recos.user_video_graph.relatedTweetHandlers.ProducerBasedRelatedTweetsHandler -import com.twitter.recos.user_video_graph.store.UserRecentFollowersStore -import com.twitter.server.Deciderable -import com.twitter.server.TwitterServer -import com.twitter.server.logging.{Logging => JDK14Logging} -import com.twitter.servo.request._ -import com.twitter.servo.util.ExceptionCounter -import com.twitter.simclusters_v2.common.UserId -import com.twitter.socialgraph.thriftscala.SocialGraphService -import com.twitter.storehaus.ReadableStore -import com.twitter.util.Await -import com.twitter.util.Duration -import com.twitter.util.JavaTimer -import com.twitter.util.Throw -import com.twitter.util.Timer -import java.net.InetSocketAddress -import java.util.concurrent.TimeUnit -import org.apache.kafka.clients.CommonClientConfigs -import org.apache.kafka.common.config.SaslConfigs -import org.apache.kafka.common.config.SslConfigs -import org.apache.kafka.common.security.auth.SecurityProtocol -import org.apache.kafka.common.serialization.StringDeserializer -import scala.reflect.ClassTag - -object Main extends TwitterServer with JDK14Logging with Deciderable { - profile => - - val shardId: Flag[Int] = flag("shardId", 0, "Shard ID") - val servicePort: Flag[InetSocketAddress] = - flag("service.port", new InetSocketAddress(10143), "Thrift service port") - val logDir: Flag[String] = flag("logdir", "recos", "Logging directory") - val numShards: Flag[Int] = flag("numShards", 1, "Number of shards for this service") - val truststoreLocation: Flag[String] = - flag[String]("truststore_location", "", "Truststore file location") - val hoseName: Flag[String] = - flag("hosename", "recos_injector_user_user", "the kafka stream used for incoming edges") - - val dataCenter: Flag[String] = flag("service.cluster", "atla", "Data Center") - val serviceRole: Flag[String] = flag("service.role", "Service Role") - val serviceEnv: Flag[String] = flag("service.env", "Service Env") - val serviceName: Flag[String] = flag("service.name", "Service Name") - - private val maxNumSegments = - flag("maxNumSegments", graphBuilderConfig.maxNumSegments, "the number of segments in the graph") - - private val statsReceiverWrapper = FinagleStatsReceiverWrapper(statsReceiver) - - /** - * A ClientRequestAuthorizer to be used in a request-authorization RequestFilter. - */ - lazy val clientAuthorizer: ClientRequestAuthorizer = - ClientRequestAuthorizer.observed( - ClientRequestAuthorizer.permissive, - new ClientRequestObserver(statsReceiver) - ) - - lazy val clientId = ClientId(s"usertweetgraph.${serviceEnv()}") - - private def makeThriftClient[ThriftServiceType: ClassTag]( - dest: String, - label: String, - serviceIdentifier: ServiceIdentifier, - requestTimeout: Duration = 100.milliseconds - ): ThriftServiceType = { - ThriftMux.client - .withClientId(ClientId("usertweetgraph.prod")) - .withOpportunisticTls(OpportunisticTls.Required) - .withMutualTls(serviceIdentifier) - .withRequestTimeout(requestTimeout) - .withStatsReceiver(statsReceiver.scope("clnt")) - .withResponseClassifier { - case ReqRep(_, Throw(_: ClientDiscardedRequestException)) => ResponseClass.Ignorable - }.build[ThriftServiceType](dest, label) - } - - private val shutdownTimeout = flag( - "service.shutdownTimeout", - 5.seconds, - "Maximum amount of time to wait for pending requests to complete on shutdown" - ) - - /** - * ExceptionCounter for tracking failures from RequestHandler(s). - */ - lazy val exceptionCounter = new ExceptionCounter(statsReceiver) - - /** - * Function for translating exceptions returned by a RequestHandler. Useful - * for cases where underlying exception types should be wrapped in those - * defined in the project's Thrift IDL. - */ - lazy val translateExceptions: PartialFunction[Throwable, Throwable] = { - case t => t - } - - val DefaultLdapAccessGroup: Seq[String] = Seq("eng", "cassowary-group", "timeline-team") - - // ********* logging ********** - - lazy val loggingLevel: Level = Level.INFO - lazy val recosLogPath: String = logDir() + "/recos.log" - lazy val graphLogPath: String = logDir() + "/graph.log" - lazy val accessLogPath: String = logDir() + "/access.log" - - override def loggerFactories: List[LoggerFactory] = - List( - LoggerFactory( - level = Some(loggingLevel), - handlers = QueueingHandler( - handler = FileHandler( - filename = recosLogPath, - level = Some(loggingLevel), - rollPolicy = Policy.Hourly, - rotateCount = 6, - formatter = new Formatter - ) - ) :: Nil - ), - LoggerFactory( - node = "graph", - useParents = false, - level = Some(loggingLevel), - handlers = QueueingHandler( - handler = FileHandler( - filename = graphLogPath, - level = Some(loggingLevel), - rollPolicy = Policy.Hourly, - rotateCount = 6, - formatter = new Formatter - ) - ) :: Nil - ), - LoggerFactory( - node = "access", - useParents = false, - level = Some(loggingLevel), - handlers = QueueingHandler( - handler = FileHandler( - filename = accessLogPath, - level = Some(loggingLevel), - rollPolicy = Policy.Hourly, - rotateCount = 6, - formatter = new Formatter - ) - ) :: Nil - ), - LoggerFactory( - node = "client_event", - level = Some(loggingLevel), - useParents = false, - handlers = QueueingHandler( - maxQueueSize = 10000, - handler = ScribeHandler( - category = "client_event", - formatter = BareFormatter - ) - ) :: Nil - ) - ) - // ******** Decider ************* - - // ********* ABdecider ********** - - val abDeciderYmlPath: String = "/usr/local/config/abdecider/abdecider.yml" - - val scribeLogger: Option[Logger] = Some(Logger.get("client_event")) - - val abDecider: LoggingABDecider = - ABDeciderFactory( - abDeciderYmlPath = abDeciderYmlPath, - scribeLogger = scribeLogger, - environment = Some("production") - ).buildWithLogging() - - // ********* Recos service ********** - - def main(): Unit = { - log.info("building graph with maxNumSegments = " + profile.maxNumSegments()) - - implicit val timer: Timer = new JavaTimer(true) - - val graph = MultiSegmentPowerLawBipartiteGraphBuilder( - graphBuilderConfig.copy(maxNumSegments = profile.maxNumSegments()), - statsReceiverWrapper - ) - - val kafkaConfigBuilder = FinagleKafkaConsumerBuilder[String, RecosHoseMessage]() - .dest("/s/kafka/recommendations:kafka-tls") - .groupId(KafkaGroupId(f"user_video_graph-${shardId()}%06d")) - .keyDeserializer(new StringDeserializer) - .valueDeserializer(ScalaSerdes.Thrift[RecosHoseMessage].deserializer) - .seekStrategy(SeekStrategy.REWIND) - .rewindDuration(48.hours) - .withConfig(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, SecurityProtocol.SASL_SSL.toString) - .withConfig(SslConfigs.SSL_TRUSTSTORE_LOCATION_CONFIG, truststoreLocation()) - .withConfig(SaslConfigs.SASL_MECHANISM, SaslConfigs.GSSAPI_MECHANISM) - .withConfig(SaslConfigs.SASL_KERBEROS_SERVICE_NAME, "kafka") - .withConfig(SaslConfigs.SASL_KERBEROS_SERVER_NAME, "kafka") - - val graphWriter = - UserVideoGraphWriter( - shardId().toString, - serviceEnv(), - hoseName(), - 128, // keep the original setting. - kafkaConfigBuilder, - clientId.name, - statsReceiver, - ) - graphWriter.initHose(graph) - - // For MutualTLS - val serviceIdentifier = ServiceIdentifier( - role = serviceRole(), - service = serviceName(), - environment = serviceEnv(), - zone = dataCenter() - ) - log.info(s"ServiceIdentifier = ${serviceIdentifier.toString}") - - val socialGraphClient: SocialGraphService.MethodPerEndpoint = - makeThriftClient[SocialGraphService.MethodPerEndpoint]( - "/s/socialgraph/socialgraph", - "socialgraph", - serviceIdentifier) - val userRecentFollowersStore: ReadableStore[UserRecentFollowersStore.Query, Seq[UserId]] = - new UserRecentFollowersStore(socialGraphClient) - - val tweetBasedRelatedTweetsHandler = new TweetBasedRelatedTweetsHandler(graph, statsReceiver) - val consumersBasedRelatedTweetsHandler = - new ConsumersBasedRelatedTweetsHandler(graph, statsReceiver) - val producerBasedRelatedTweetsHandler = - new ProducerBasedRelatedTweetsHandler(graph, userRecentFollowersStore, statsReceiver) - - val decider = UserTweetGraphDecider(serviceEnv(), dataCenter()) - val endpointLoadShedder = new EndpointLoadShedder(decider) - val userVideoGraph = - new UserVideoGraph( - tweetBasedRelatedTweetsHandler, - producerBasedRelatedTweetsHandler, - consumersBasedRelatedTweetsHandler, - endpointLoadShedder)(timer) with LoggingUserVideoGraph - - val thriftServer = ThriftMux.server - .withOpportunisticTls(OpportunisticTls.Required) - .withMutualTls(serviceIdentifier) - .serveIface(servicePort(), userVideoGraph) - - log.info("clientid: " + clientId.toString) - log.info("servicePort: " + servicePort().toString) - - log.info("adding shutdown hook") - onExit { - graphWriter.shutdown() - thriftServer.close(shutdownTimeout().fromNow) - } - log.info("added shutdown hook") - - // Wait on the thriftServer so that shutdownTimeout is respected. - Await.result(thriftServer) - } -} diff --git a/src/scala/com/twitter/recos/user_video_graph/README.md b/src/scala/com/twitter/recos/user_video_graph/README.md deleted file mode 100644 index 71de5deef..000000000 --- a/src/scala/com/twitter/recos/user_video_graph/README.md +++ /dev/null @@ -1,14 +0,0 @@ -# UserVideoGraph (UVG) - -## What is it -User Video Graph (UVG) is a Finalge thrift service built on the GraphJet framework. In maintains a graph of user-video engagements and serves user recommendations based on traversals in this graph. - -## How is it used on Twitter -UVG generates video recommendations from a given seed tweet set. It recommends tweets based on collaborative filtering & random walks. - -UVG is a stateful service and relies on a Kafka stream to ingest & persist states. The Kafka stream is processed and generated by Recos-Injector. -It maintains an in-memory user engagements over the past 24-48 hours. Older events are dropped and GC'ed. - -For full details on storage & processing, please check out our open-sourced project GraphJet, a general-purpose high performance in-memory storage engine. -- https://github.com/twitter/GraphJet -- http://www.vldb.org/pvldb/vol9/p1281-sharma.pdf diff --git a/src/scala/com/twitter/recos/user_video_graph/UserVideoEdgeTypeMask.scala b/src/scala/com/twitter/recos/user_video_graph/UserVideoEdgeTypeMask.scala deleted file mode 100644 index 9a6c577d2..000000000 --- a/src/scala/com/twitter/recos/user_video_graph/UserVideoEdgeTypeMask.scala +++ /dev/null @@ -1,62 +0,0 @@ -package com.twitter.recos.user_video_graph - -import com.twitter.graphjet.bipartite.api.EdgeTypeMask -import com.twitter.recos.util.Action - -/** - * The bit mask is used to encode edge types in the top bits of an integer, - * e.g. favorite, retweet, reply and click. Under current segment configuration, each segment - * stores up to 128M edges. Assuming that each node on one side is unique, each segment - * stores up to 128M unique nodes on one side, which occupies the lower 27 bits of an integer. - * This leaves five bits to encode the edge types, which at max can store 32 edge types. - * The following implementation utilizes the top four bits and leaves one free bit out. - */ -class UserVideoEdgeTypeMask extends EdgeTypeMask { - import UserVideoEdgeTypeMask._ - - override def encode(node: Int, edgeType: Byte): Int = { - if (edgeType < 0 || edgeType > SIZE) { - throw new IllegalArgumentException("encode: Illegal edge type argument " + edgeType) - } else { - node | (edgeType << 28) - } - } - - override def edgeType(node: Int): Byte = { - (node >>> 28).toByte - } - - override def restore(node: Int): Int = { - node & MASK - } -} - -object UserVideoEdgeTypeMask extends Enumeration { - - type UserTweetEdgeTypeMask = Value - - /** - * Byte values corresponding to the action taken on a tweet, which will be encoded in the - * top 4 bits in a tweet Id - * NOTE: THERE CAN ONLY BE UP TO 16 TYPES - */ - val VideoPlayback50: UserTweetEdgeTypeMask = Value(1) - - /** - * Reserve the top four bits of each integer to encode the edge type information. - */ - val MASK: Int = Integer.parseInt("00001111111111111111111111111111", 2) - val SIZE: Int = this.values.size - - /** - * Converts the action byte in the RecosHoseMessage into GraphJet internal byte mapping - */ - def actionTypeToEdgeType(actionByte: Byte): Byte = { - val edgeType = Action(actionByte) match { - case Action.VideoPlayback50 => VideoPlayback50.id - case _ => - throw new IllegalArgumentException("getEdgeType: Illegal edge type argument " + actionByte) - } - edgeType.toByte - } -} diff --git a/src/scala/com/twitter/recos/user_video_graph/UserVideoGraph.scala b/src/scala/com/twitter/recos/user_video_graph/UserVideoGraph.scala deleted file mode 100644 index f22486ef3..000000000 --- a/src/scala/com/twitter/recos/user_video_graph/UserVideoGraph.scala +++ /dev/null @@ -1,73 +0,0 @@ -package com.twitter.recos.user_video_graph - -import com.twitter.finagle.thrift.ClientId -import com.twitter.finagle.tracing.Trace -import com.twitter.finagle.tracing.TraceId -import com.twitter.recos.decider.EndpointLoadShedder -import com.twitter.recos.user_video_graph.thriftscala._ -import com.twitter.util.Duration -import com.twitter.util.Future -import com.twitter.util.Timer -import scala.concurrent.duration.MILLISECONDS -import com.twitter.logging.Logger -import com.twitter.recos.user_tweet_graph.relatedTweetHandlers.ConsumersBasedRelatedTweetsHandler -import com.twitter.recos.user_video_graph.relatedTweetHandlers.ProducerBasedRelatedTweetsHandler -import com.twitter.recos.user_video_graph.relatedTweetHandlers.TweetBasedRelatedTweetsHandler - -object UserVideoGraph { - def traceId: TraceId = Trace.id - def clientId: Option[ClientId] = ClientId.current -} - -class UserVideoGraph( - tweetBasedRelatedTweetsHandler: TweetBasedRelatedTweetsHandler, - producerBasedRelatedTweetsHandler: ProducerBasedRelatedTweetsHandler, - consumersBasedRelatedTweetsHandler: ConsumersBasedRelatedTweetsHandler, - endpointLoadShedder: EndpointLoadShedder -)( - implicit timer: Timer) - extends thriftscala.UserVideoGraph.MethodPerEndpoint { - - private val defaultTimeout: Duration = Duration(50, MILLISECONDS) - private val EmptyResponse = Future.value(RelatedTweetResponse()) - private val log = Logger() - - override def tweetBasedRelatedTweets( - request: TweetBasedRelatedTweetRequest - ): Future[RelatedTweetResponse] = - endpointLoadShedder("videoGraphTweetBasedRelatedTweets") { - tweetBasedRelatedTweetsHandler(request).raiseWithin(defaultTimeout) - }.rescue { - case EndpointLoadShedder.LoadSheddingException => - EmptyResponse - case e => - log.info("user-video-graph_tweetBasedRelatedTweets" + e) - EmptyResponse - } - - override def producerBasedRelatedTweets( - request: ProducerBasedRelatedTweetRequest - ): Future[RelatedTweetResponse] = - endpointLoadShedder("producerBasedRelatedTweets") { - producerBasedRelatedTweetsHandler(request).raiseWithin(defaultTimeout) - }.rescue { - case EndpointLoadShedder.LoadSheddingException => - EmptyResponse - case e => - log.info("user-video-graph_producerBasedRelatedTweets" + e) - EmptyResponse - } - - override def consumersBasedRelatedTweets( - request: ConsumersBasedRelatedTweetRequest - ): Future[RelatedTweetResponse] = - endpointLoadShedder("consumersBasedRelatedTweets") { - consumersBasedRelatedTweetsHandler(request).raiseWithin(defaultTimeout) - }.rescue { - case EndpointLoadShedder.LoadSheddingException => - EmptyResponse - case e => - log.info("user-video-graph_consumersBasedRelatedTweets" + e) - EmptyResponse - } -} diff --git a/src/scala/com/twitter/recos/user_video_graph/UserVideoGraphConfig.scala b/src/scala/com/twitter/recos/user_video_graph/UserVideoGraphConfig.scala deleted file mode 100644 index c99280133..000000000 --- a/src/scala/com/twitter/recos/user_video_graph/UserVideoGraphConfig.scala +++ /dev/null @@ -1,39 +0,0 @@ -package com.twitter.recos.user_video_graph - -import com.twitter.recos.graph_common.MultiSegmentPowerLawBipartiteGraphBuilder.GraphBuilderConfig - -/** - * The class holds all the config parameters for recos graph. - */ -object RecosConfig { - val maxNumSegments: Int = 8 - val maxNumEdgesPerSegment: Int = - (1 << 28) // 268M edges per segment, should be able to include 2 days' data - val expectedNumLeftNodes: Int = - (1 << 26) // should correspond to 67M nodes storage - val expectedMaxLeftDegree: Int = 64 - val leftPowerLawExponent: Double = 16.0 // steep power law as most nodes will have a small degree - val expectedNumRightNodes: Int = (1 << 26) // 67M nodes - val expectedMaxRightDegree: Int = scala.math.pow(1024, 2).toInt // some nodes will be very popular - val rightPowerLawExponent: Double = 4.0 // this will be less steep - - val graphBuilderConfig = GraphBuilderConfig( - maxNumSegments = maxNumSegments, - maxNumEdgesPerSegment = maxNumEdgesPerSegment, - expectedNumLeftNodes = expectedNumLeftNodes, - expectedMaxLeftDegree = expectedMaxLeftDegree, - leftPowerLawExponent = leftPowerLawExponent, - expectedNumRightNodes = expectedNumRightNodes, - expectedMaxRightDegree = expectedMaxRightDegree, - rightPowerLawExponent = rightPowerLawExponent - ) - - println("RecosConfig - maxNumSegments " + maxNumSegments) - println("RecosConfig - maxNumEdgesPerSegment " + maxNumEdgesPerSegment) - println("RecosConfig - expectedNumLeftNodes " + expectedNumLeftNodes) - println("RecosConfig - expectedMaxLeftDegree " + expectedMaxLeftDegree) - println("RecosConfig - leftPowerLawExponent " + leftPowerLawExponent) - println("RecosConfig - expectedNumRightNodes " + expectedNumRightNodes) - println("RecosConfig - expectedMaxRightDegree " + expectedMaxRightDegree) - println("RecosConfig - rightPowerLawExponent " + rightPowerLawExponent) -} diff --git a/src/scala/com/twitter/recos/user_video_graph/UserVideoGraphEdgeHttpHandler.scala b/src/scala/com/twitter/recos/user_video_graph/UserVideoGraphEdgeHttpHandler.scala deleted file mode 100644 index b2464016c..000000000 --- a/src/scala/com/twitter/recos/user_video_graph/UserVideoGraphEdgeHttpHandler.scala +++ /dev/null @@ -1,101 +0,0 @@ -package com.twitter.recos.user_video_graph - -import com.twitter.finagle.Service -import com.twitter.finagle.http.Request -import com.twitter.finagle.http.Response -import com.twitter.finagle.http.Status -import com.twitter.finagle.http.Version -import com.twitter.frigate.common.util.HTMLUtil -import com.twitter.graphjet.algorithms.TweetIDMask -import com.twitter.graphjet.bipartite.segment.BipartiteGraphSegment -import com.twitter.graphjet.bipartite.MultiSegmentIterator -import com.twitter.graphjet.bipartite.MultiSegmentPowerLawBipartiteGraph -import com.twitter.logging.Logger -import com.twitter.util.Future -import java.util.Random -import scala.collection.mutable.ListBuffer - -class UserTweetGraphEdgeHttpHandler(graph: MultiSegmentPowerLawBipartiteGraph) - extends Service[Request, Response] { - private val log = Logger("UserTweetGraphEdgeHttpHandler") - private val tweetIDMask = new TweetIDMask() - - def getCardInfo(rightNode: Long): String = { - val bits: Long = rightNode & TweetIDMask.METAMASK - bits match { - case TweetIDMask.PHOTO => "Photo" - case TweetIDMask.PLAYER => "Video" - case TweetIDMask.SUMMARY => "Url" - case TweetIDMask.PROMOTION => "Promotion" - case _ => "Regular" - } - } - - private def getUserEdges(userId: Long): ListBuffer[Edge] = { - val random = new Random() - val iterator = - graph - .getRandomLeftNodeEdges(userId, 10, random).asInstanceOf[MultiSegmentIterator[ - BipartiteGraphSegment - ]] - val tweets = new ListBuffer[Edge]() - if (iterator != null) { - while (iterator.hasNext) { - val rightNode = iterator.nextLong() - val edgeType = iterator.currentEdgeType() - tweets += Edge( - tweetIDMask.restore(rightNode), - UserVideoEdgeTypeMask(edgeType).toString, - getCardInfo(rightNode), - ) - } - } - tweets - } - - def apply(httpRequest: Request): Future[Response] = { - log.info("UserTweetGraphEdgeHttpHandler params: " + httpRequest.getParams()) - val time0 = System.currentTimeMillis - - val tweetId = httpRequest.getLongParam("tweetId") - val queryTweetDegree = graph.getRightNodeDegree(tweetId) - val tweetEdges = getTweetEdges(tweetId) - - val userId = httpRequest.getLongParam("userId") - val queryUserDegree = graph.getLeftNodeDegree(userId) - - val response = Response(Version.Http11, Status.Ok) - val userEdges = getUserEdges(userId) - val elapsed = System.currentTimeMillis - time0 - val comment = ("Please specify \"userId\" or \"tweetId\" param." + - "\n query tweet degree = " + queryTweetDegree + - "\n query user degree = " + queryUserDegree + - "\n done in %d ms
    ").format(elapsed) - val tweetContent = userEdges.toList - .map { edge => - s"TweetId: ${edge.tweetId},\nAction type: ${edge.actionType},\nCard type: ${edge.cardType}" - .replaceAll("\n", " ") - }.mkString("\n
    \n") - - response.setContentString( - HTMLUtil.html.replace("XXXXX", comment + tweetContent + "\n


    \n" + tweetEdges.toString())) - Future.value(response) - } - - private def getTweetEdges(tweetId: Long): ListBuffer[Long] = { - val random = new Random() - val iterator = - graph - .getRandomRightNodeEdges(tweetId, 500, random).asInstanceOf[MultiSegmentIterator[ - BipartiteGraphSegment - ]] - val terms = new ListBuffer[Long]() - if (iterator != null) { - while (iterator.hasNext) { terms += iterator.nextLong() } - } - terms.distinct - } - -} - -case class Edge(tweetId: Long, actionType: String, cardType: String) diff --git a/src/scala/com/twitter/recos/user_video_graph/UserVideoGraphWriter.scala b/src/scala/com/twitter/recos/user_video_graph/UserVideoGraphWriter.scala deleted file mode 100644 index 4909e0386..000000000 --- a/src/scala/com/twitter/recos/user_video_graph/UserVideoGraphWriter.scala +++ /dev/null @@ -1,82 +0,0 @@ -package com.twitter.recos.user_video_graph - -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.finatra.kafka.consumers.FinagleKafkaConsumerBuilder -import com.twitter.graphjet.algorithms.TweetIDMask -import com.twitter.graphjet.bipartite.MultiSegmentPowerLawBipartiteGraph -import com.twitter.graphjet.bipartite.segment.BipartiteGraphSegment -import com.twitter.recos.hose.common.UnifiedGraphWriter -import com.twitter.recos.internal.thriftscala.RecosHoseMessage -import com.twitter.recos.serviceapi.Tweetypie._ - -/** - * The class submits a number of $numBootstrapWriters graph writer threads, BufferedEdgeWriter, - * during service startup. One of them is live writer thread, and the other $(numBootstrapWriters - 1) - * are catchup writer threads. All of them consume kafka events from an internal concurrent queue, - * which is populated by kafka reader threads. At bootstrap time, the kafka reader threads look - * back kafka offset from several hours ago and populate the internal concurrent queue. - * Each graph writer thread writes to an individual graph segment separately. - * The $(numBootstrapWriters - 1) catchup writer threads will stop once all events - * between current system time at startup and the time in memcache are processed. - * The live writer thread will continue to write all incoming kafka events. - * It lives through the entire life cycle of recos graph service. - */ -case class UserVideoGraphWriter( - shardId: String, - env: String, - hosename: String, - bufferSize: Int, - kafkaConsumerBuilder: FinagleKafkaConsumerBuilder[String, RecosHoseMessage], - clientId: String, - statsReceiver: StatsReceiver) - extends UnifiedGraphWriter[BipartiteGraphSegment, MultiSegmentPowerLawBipartiteGraph] { - writer => - // The max throughput for each kafka consumer is around 25MB/s - // Use 4 processors for 100MB/s catch-up speed. - val consumerNum: Int = 4 - // Leave 1 Segments to LiveWriter - val catchupWriterNum: Int = RecosConfig.maxNumSegments - 1 - - /** - * Adds a RecosHoseMessage to the graph. used by live writer to insert edges to the - * current segment - */ - override def addEdgeToGraph( - graph: MultiSegmentPowerLawBipartiteGraph, - recosHoseMessage: RecosHoseMessage - ): Unit = { - graph.addEdge( - recosHoseMessage.leftId, - getMetaEdge(recosHoseMessage.rightId, recosHoseMessage.card), - UserVideoEdgeTypeMask.actionTypeToEdgeType(recosHoseMessage.action), - ) - } - - /** - * Adds a RecosHoseMessage to the given segment in the graph. Used by catch up writers to - * insert edges to non-current (old) segments - */ - override def addEdgeToSegment( - segment: BipartiteGraphSegment, - recosHoseMessage: RecosHoseMessage - ): Unit = { - segment.addEdge( - recosHoseMessage.leftId, - getMetaEdge(recosHoseMessage.rightId, recosHoseMessage.card), - UserVideoEdgeTypeMask.actionTypeToEdgeType(recosHoseMessage.action) - ) - } - - private def getMetaEdge(rightId: Long, cardOption: Option[Byte]): Long = { - cardOption - .map { card => - if (isPhotoCard(card)) TweetIDMask.photo(rightId) - else if (isPlayerCard(card)) TweetIDMask.player(rightId) - else if (isSummaryCard(card)) TweetIDMask.summary(rightId) - else if (isPromotionCard(card)) TweetIDMask.promotion(rightId) - else rightId - } - .getOrElse(rightId) - } - -} diff --git a/src/scala/com/twitter/recos/user_video_graph/relatedTweetHandlers/BUILD b/src/scala/com/twitter/recos/user_video_graph/relatedTweetHandlers/BUILD deleted file mode 100644 index ad9caf129..000000000 --- a/src/scala/com/twitter/recos/user_video_graph/relatedTweetHandlers/BUILD +++ /dev/null @@ -1,12 +0,0 @@ -scala_library( - sources = ["*.scala"], - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/twitter/graphjet", - "servo/request/src/main/scala", - "src/scala/com/twitter/recos/user_video_graph/store", - "src/scala/com/twitter/recos/user_video_graph/util", - "src/scala/com/twitter/recos/util:recos-util", - "src/thrift/com/twitter/recos/user_video_graph:user_video_graph-scala", - ], -) diff --git a/src/scala/com/twitter/recos/user_video_graph/relatedTweetHandlers/ConsumersBasedRelatedTweetsHandler.scala b/src/scala/com/twitter/recos/user_video_graph/relatedTweetHandlers/ConsumersBasedRelatedTweetsHandler.scala deleted file mode 100644 index 44a190e0d..000000000 --- a/src/scala/com/twitter/recos/user_video_graph/relatedTweetHandlers/ConsumersBasedRelatedTweetsHandler.scala +++ /dev/null @@ -1,66 +0,0 @@ -package com.twitter.recos.user_tweet_graph.relatedTweetHandlers - -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.graphjet.bipartite.api.BipartiteGraph -import com.twitter.recos.user_video_graph.thriftscala._ -import com.twitter.recos.user_video_graph.util.FetchRHSTweetsUtil -import com.twitter.recos.user_video_graph.util.FilterUtil -import com.twitter.recos.user_video_graph.util.GetRelatedTweetCandidatesUtil -import com.twitter.recos.util.Stats._ -import com.twitter.servo.request._ -import com.twitter.util.Duration -import com.twitter.util.Future -import scala.concurrent.duration.HOURS - -/** - * Implementation of the Thrift-defined service interface for consumersTweetBasedRelatedTweets. - * given a list of consumer userIds, find the tweets they co-engaged with (we're treating input userIds as consumers therefore "consumersTweetBasedRelatedTweets" ) - * example use case: given a list of user's contacts in their address book, find tweets those contacts engaged with - */ -class ConsumersBasedRelatedTweetsHandler( - bipartiteGraph: BipartiteGraph, - statsReceiver: StatsReceiver) - extends RequestHandler[ConsumersBasedRelatedTweetRequest, RelatedTweetResponse] { - private val stats = statsReceiver.scope(this.getClass.getSimpleName) - - override def apply(request: ConsumersBasedRelatedTweetRequest): Future[RelatedTweetResponse] = { - trackFutureBlockStats(stats) { - - val maxResults = request.maxResults.getOrElse(200) - val minScore = request.minScore.getOrElse(0.0) - val maxTweetAge = request.maxTweetAgeInHours.getOrElse(48) - val minResultDegree = request.minResultDegree.getOrElse(50) - val minCooccurrence = request.minCooccurrence.getOrElse(3) - val excludeTweetIds = request.excludeTweetIds.getOrElse(Seq.empty).toSet - - val consumerSeedSet = request.consumerSeedSet.distinct.filter { userId => - val userDegree = bipartiteGraph.getLeftNodeDegree(userId) - // constrain to users that have <100 engagements to avoid spammy behavior - userDegree < 100 - } - - val rhsTweetIds = FetchRHSTweetsUtil.fetchRHSTweets( - consumerSeedSet, - bipartiteGraph - ) - - val scorePreFactor = 1000.0 / consumerSeedSet.size - val relatedTweetCandidates = GetRelatedTweetCandidatesUtil.getRelatedTweetCandidates( - rhsTweetIds, - minCooccurrence, - minResultDegree, - scorePreFactor, - bipartiteGraph) - - val relatedTweets = relatedTweetCandidates - .filter(relatedTweet => - FilterUtil.tweetAgeFilter( - relatedTweet.tweetId, - Duration(maxTweetAge, HOURS)) && (relatedTweet.score > minScore) && (!excludeTweetIds - .contains(relatedTweet.tweetId))).take(maxResults) - - stats.stat("response_size").add(relatedTweets.size) - Future.value(RelatedTweetResponse(tweets = relatedTweets)) - } - } -} diff --git a/src/scala/com/twitter/recos/user_video_graph/relatedTweetHandlers/ProducerBasedRelatedTweetsHandler.scala b/src/scala/com/twitter/recos/user_video_graph/relatedTweetHandlers/ProducerBasedRelatedTweetsHandler.scala deleted file mode 100644 index 5f26ded6e..000000000 --- a/src/scala/com/twitter/recos/user_video_graph/relatedTweetHandlers/ProducerBasedRelatedTweetsHandler.scala +++ /dev/null @@ -1,86 +0,0 @@ -package com.twitter.recos.user_video_graph.relatedTweetHandlers - -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.graphjet.bipartite.api.BipartiteGraph -import com.twitter.recos.user_video_graph.thriftscala._ -import com.twitter.recos.util.Stats._ -import com.twitter.servo.request._ -import com.twitter.util.Duration -import com.twitter.util.Future -import scala.concurrent.duration.HOURS -import com.twitter.simclusters_v2.common.UserId -import com.twitter.storehaus.ReadableStore -import com.twitter.recos.user_video_graph.store.UserRecentFollowersStore -import com.twitter.recos.user_video_graph.util.FetchRHSTweetsUtil -import com.twitter.recos.user_video_graph.util.FilterUtil -import com.twitter.recos.user_video_graph.util.GetRelatedTweetCandidatesUtil - -/** - * Implementation of the Thrift-defined service interface for producerBasedRelatedTweets. - * - */ -class ProducerBasedRelatedTweetsHandler( - bipartiteGraph: BipartiteGraph, - userRecentFollowersStore: ReadableStore[UserRecentFollowersStore.Query, Seq[UserId]], - statsReceiver: StatsReceiver) - extends RequestHandler[ProducerBasedRelatedTweetRequest, RelatedTweetResponse] { - private val stats = statsReceiver.scope(this.getClass.getSimpleName) - - override def apply(request: ProducerBasedRelatedTweetRequest): Future[RelatedTweetResponse] = { - trackFutureBlockStats(stats) { - val maxResults = request.maxResults.getOrElse(200) - val maxNumFollowers = request.maxNumFollowers.getOrElse(500) - val minScore = request.minScore.getOrElse(0.0) - val maxTweetAge = request.maxTweetAgeInHours.getOrElse(48) - val minResultDegree = request.minResultDegree.getOrElse(50) - val minCooccurrence = request.minCooccurrence.getOrElse(4) - val excludeTweetIds = request.excludeTweetIds.getOrElse(Seq.empty).toSet - - val followersFut = fetchFollowers(request.producerId, Some(maxNumFollowers)) - followersFut.map { followers => - val rhsTweetIds = FetchRHSTweetsUtil.fetchRHSTweets( - followers, - bipartiteGraph - ) - - val scorePreFactor = 1000.0 / followers.size - val relatedTweetCandidates = GetRelatedTweetCandidatesUtil.getRelatedTweetCandidates( - rhsTweetIds, - minCooccurrence, - minResultDegree, - scorePreFactor, - bipartiteGraph) - - val relatedTweets = relatedTweetCandidates - .filter { relatedTweet => - FilterUtil.tweetAgeFilter( - relatedTweet.tweetId, - Duration(maxTweetAge, HOURS)) && (relatedTweet.score > minScore) && (!excludeTweetIds - .contains(relatedTweet.tweetId)) - }.take(maxResults) - stats.stat("response_size").add(relatedTweets.size) - RelatedTweetResponse(tweets = relatedTweets) - } - } - } - - private def fetchFollowers( - producerId: Long, - maxNumFollower: Option[Int], - ): Future[Seq[Long]] = { - val query = - UserRecentFollowersStore.Query(producerId, maxNumFollower, None) - - val followersFut = userRecentFollowersStore.get(query) - followersFut.map { followersOpt => - val followers = followersOpt.getOrElse(Seq.empty) - val followerIds = followers.distinct.filter { userId => - val userDegree = bipartiteGraph.getLeftNodeDegree(userId) - // constrain to more active users that have >1 engagement to optimize latency, and <100 engagements to avoid spammy behavior - userDegree > 1 && userDegree < 500 - } - stats.stat("follower_size_after_filter").add(followerIds.size) - followerIds - } - } -} diff --git a/src/scala/com/twitter/recos/user_video_graph/relatedTweetHandlers/TweetBasedRelatedTweetsHandler.scala b/src/scala/com/twitter/recos/user_video_graph/relatedTweetHandlers/TweetBasedRelatedTweetsHandler.scala deleted file mode 100644 index 7150a2f0f..000000000 --- a/src/scala/com/twitter/recos/user_video_graph/relatedTweetHandlers/TweetBasedRelatedTweetsHandler.scala +++ /dev/null @@ -1,91 +0,0 @@ -package com.twitter.recos.user_video_graph.relatedTweetHandlers - -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.graphjet.bipartite.api.BipartiteGraph -import com.twitter.recos.features.tweet.thriftscala.GraphFeaturesForQuery -import com.twitter.recos.user_video_graph.thriftscala._ -import com.twitter.recos.user_video_graph.util.FilterUtil -import com.twitter.recos.user_video_graph.util.FetchRHSTweetsUtil -import com.twitter.recos.user_video_graph.util.GetRelatedTweetCandidatesUtil -import com.twitter.recos.user_video_graph.util.GetAllInternalTweetIdsUtil -import com.twitter.recos.user_video_graph.util.SampleLHSUsersUtil -import com.twitter.recos.util.Stats._ -import com.twitter.servo.request._ -import com.twitter.util.Duration -import com.twitter.util.Future -import scala.concurrent.duration.HOURS - -/** - * Implementation of the Thrift-defined service interface for tweetBasedRelatedTweets. - * - */ -class TweetBasedRelatedTweetsHandler(bipartiteGraph: BipartiteGraph, statsReceiver: StatsReceiver) - extends RequestHandler[TweetBasedRelatedTweetRequest, RelatedTweetResponse] { - private val stats = statsReceiver.scope(this.getClass.getSimpleName) - - override def apply(request: TweetBasedRelatedTweetRequest): Future[RelatedTweetResponse] = { - trackFutureBlockStats(stats) { - val internalQueryTweetIds = - GetAllInternalTweetIdsUtil.getAllInternalTweetIds(request.tweetId, bipartiteGraph) - - val response = internalQueryTweetIds match { - case head +: Nil => getRelatedTweets(request, head) - case _ => RelatedTweetResponse() - } - Future.value(response) - } - } - - private def getRelatedTweets( - request: TweetBasedRelatedTweetRequest, - maskedTweetId: Long - ): RelatedTweetResponse = { - - val maxNumSamplesPerNeighbor = request.maxNumSamplesPerNeighbor.getOrElse(100) - val maxResults = request.maxResults.getOrElse(200) - val minScore = request.minScore.getOrElse(0.5) - val maxTweetAge = request.maxTweetAgeInHours.getOrElse(48) - val minResultDegree = request.minResultDegree.getOrElse(50) - val minQueryDegree = request.minQueryDegree.getOrElse(10) - val minCooccurrence = request.minCooccurrence.getOrElse(3) - val excludeTweetIds = request.excludeTweetIds.getOrElse(Seq.empty).toSet - - val queryTweetDegree = bipartiteGraph.getRightNodeDegree(maskedTweetId) - stats.stat("queryTweetDegree").add(queryTweetDegree) - - if (queryTweetDegree < minQueryDegree) { - stats.counter("queryTweetDegreeLessThanMinQueryDegree").incr() - RelatedTweetResponse() - } else { - - val sampledLHSuserIds = - SampleLHSUsersUtil.sampleLHSUsers(maskedTweetId, maxNumSamplesPerNeighbor, bipartiteGraph) - - val rHStweetIds = FetchRHSTweetsUtil.fetchRHSTweets( - sampledLHSuserIds, - bipartiteGraph, - ) - - val scorePreFactor = - queryTweetDegree / math.log(queryTweetDegree) / sampledLHSuserIds.distinct.size - val relatedTweetCandidates = GetRelatedTweetCandidatesUtil.getRelatedTweetCandidates( - rHStweetIds, - minCooccurrence, - minResultDegree, - scorePreFactor, - bipartiteGraph) - - val relatedTweets = relatedTweetCandidates - .filter(relatedTweet => - FilterUtil.tweetAgeFilter( - relatedTweet.tweetId, - Duration(maxTweetAge, HOURS)) && (relatedTweet.score > minScore) && (!excludeTweetIds - .contains(relatedTweet.tweetId))).take(maxResults) - - stats.stat("response_size").add(relatedTweets.size) - RelatedTweetResponse( - tweets = relatedTweets, - queryTweetGraphFeatures = Some(GraphFeaturesForQuery(degree = Some(queryTweetDegree)))) - } - } -} diff --git a/src/scala/com/twitter/recos/user_video_graph/store/BUILD b/src/scala/com/twitter/recos/user_video_graph/store/BUILD deleted file mode 100644 index b1c3562b7..000000000 --- a/src/scala/com/twitter/recos/user_video_graph/store/BUILD +++ /dev/null @@ -1,9 +0,0 @@ -scala_library( - sources = ["*.scala"], - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/twitter/storehaus:core", - "src/scala/com/twitter/simclusters_v2/common", - "src/thrift/com/twitter/socialgraph:thrift-scala", - ], -) diff --git a/src/scala/com/twitter/recos/user_video_graph/store/UserRecentFollowersStore.scala b/src/scala/com/twitter/recos/user_video_graph/store/UserRecentFollowersStore.scala deleted file mode 100644 index 7d1b6df6f..000000000 --- a/src/scala/com/twitter/recos/user_video_graph/store/UserRecentFollowersStore.scala +++ /dev/null @@ -1,50 +0,0 @@ -package com.twitter.recos.user_video_graph.store - -import com.twitter.simclusters_v2.common.UserId -import com.twitter.socialgraph.thriftscala.EdgesRequest -import com.twitter.socialgraph.thriftscala.EdgesResult -import com.twitter.socialgraph.thriftscala.PageRequest -import com.twitter.socialgraph.thriftscala.RelationshipType -import com.twitter.socialgraph.thriftscala.SrcRelationship -import com.twitter.socialgraph.thriftscala.SocialGraphService -import com.twitter.storehaus.ReadableStore -import com.twitter.util.Duration -import com.twitter.util.Future -import com.twitter.util.Time - -class UserRecentFollowersStore( - sgsClient: SocialGraphService.MethodPerEndpoint) - extends ReadableStore[UserRecentFollowersStore.Query, Seq[UserId]] { - - override def get(key: UserRecentFollowersStore.Query): Future[Option[Seq[UserId]]] = { - val edgeRequest = EdgesRequest( - relationship = SrcRelationship(key.userId, RelationshipType.FollowedBy), - // Could have a better guess at count when k.maxAge != None - pageRequest = Some(PageRequest(count = key.maxResults)) - ) - - val lookbackThresholdMillis = key.maxAge - .map(maxAge => (Time.now - maxAge).inMilliseconds) - .getOrElse(0L) - - sgsClient - .edges(Seq(edgeRequest)) - .map(_.flatMap { - case EdgesResult(edges, _, _) => - edges.collect { - case e if e.createdAt >= lookbackThresholdMillis => - e.target - } - }) - .map(Some(_)) - } -} - -object UserRecentFollowersStore { - case class Query( - userId: UserId, - // maxResults - if Some(count), we return only the `count` most recent follows - maxResults: Option[Int] = None, - // maxAge - if Some(duration), return only follows since `Time.now - duration` - maxAge: Option[Duration] = None) -} diff --git a/src/scala/com/twitter/recos/user_video_graph/util/BUILD b/src/scala/com/twitter/recos/user_video_graph/util/BUILD deleted file mode 100644 index a8a1364e1..000000000 --- a/src/scala/com/twitter/recos/user_video_graph/util/BUILD +++ /dev/null @@ -1,12 +0,0 @@ -scala_library( - sources = ["*.scala"], - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/twitter/graphjet", - "snowflake:id", - "snowflake/src/main/scala/com/twitter/snowflake/id", - "src/scala/com/twitter/recos/util:recos-util", - "src/scala/com/twitter/simclusters_v2/common", - "src/thrift/com/twitter/recos/user_video_graph:user_video_graph-scala", - ], -) diff --git a/src/scala/com/twitter/recos/user_video_graph/util/FetchRHSTweetsUtil.scala b/src/scala/com/twitter/recos/user_video_graph/util/FetchRHSTweetsUtil.scala deleted file mode 100644 index 63041c1d0..000000000 --- a/src/scala/com/twitter/recos/user_video_graph/util/FetchRHSTweetsUtil.scala +++ /dev/null @@ -1,29 +0,0 @@ -package com.twitter.recos.user_video_graph.util - -import com.twitter.graphjet.bipartite.MultiSegmentIterator -import com.twitter.graphjet.bipartite.api.BipartiteGraph -import com.twitter.graphjet.bipartite.segment.BipartiteGraphSegment -import scala.collection.mutable.ListBuffer - -object FetchRHSTweetsUtil { - // get RHS tweets given LHS users - def fetchRHSTweets( - userIds: Seq[Long], - bipartiteGraph: BipartiteGraph - ): Seq[Long] = { - userIds.distinct - .flatMap { userId => - val tweetIdsIterator = bipartiteGraph - .getLeftNodeEdges(userId).asInstanceOf[MultiSegmentIterator[BipartiteGraphSegment]] - - val tweetIds = new ListBuffer[Long]() - if (tweetIdsIterator != null) { - while (tweetIdsIterator.hasNext) { - val rightNode = tweetIdsIterator.nextLong() - tweetIds += rightNode - } - } - tweetIds.distinct - } - } -} diff --git a/src/scala/com/twitter/recos/user_video_graph/util/FilterUtil.scala b/src/scala/com/twitter/recos/user_video_graph/util/FilterUtil.scala deleted file mode 100644 index ca827070d..000000000 --- a/src/scala/com/twitter/recos/user_video_graph/util/FilterUtil.scala +++ /dev/null @@ -1,15 +0,0 @@ -package com.twitter.recos.user_video_graph.util - -import com.twitter.simclusters_v2.common.TweetId -import com.twitter.snowflake.id.SnowflakeId -import com.twitter.util.Duration -import com.twitter.util.Time - -object FilterUtil { - def tweetAgeFilter(tweetId: TweetId, maxAge: Duration): Boolean = { - SnowflakeId - .timeFromIdOpt(tweetId) - .map { tweetTime => tweetTime > Time.now - maxAge }.getOrElse(false) - // If there's no snowflake timestamp, we have no idea when this tweet happened. - } -} diff --git a/src/scala/com/twitter/recos/user_video_graph/util/GetAllInternalTweetIdsUtil.scala b/src/scala/com/twitter/recos/user_video_graph/util/GetAllInternalTweetIdsUtil.scala deleted file mode 100644 index 8628f3a10..000000000 --- a/src/scala/com/twitter/recos/user_video_graph/util/GetAllInternalTweetIdsUtil.scala +++ /dev/null @@ -1,33 +0,0 @@ -package com.twitter.recos.user_video_graph.util - -import com.twitter.graphjet.algorithms.TweetIDMask -import com.twitter.graphjet.bipartite.api.BipartiteGraph - -object GetAllInternalTweetIdsUtil { - - def getAllInternalTweetIds(tweetId: Long, bipartiteGraph: BipartiteGraph): Seq[Long] = { - val internalTweetIds = getAllMasks(tweetId) - sortByDegrees(internalTweetIds, bipartiteGraph) - } - - private def getAllMasks(tweetId: Long): Seq[Long] = { - Seq( - tweetId, - TweetIDMask.summary(tweetId), - TweetIDMask.photo(tweetId), - TweetIDMask.player(tweetId), - TweetIDMask.promotion(tweetId) - ) - } - - private def sortByDegrees( - encodedTweetIds: Seq[Long], - bipartiteGraph: BipartiteGraph - ): Seq[Long] = { - encodedTweetIds - .map { encodedTweetId => (encodedTweetId, bipartiteGraph.getRightNodeDegree(encodedTweetId)) } - .filter { case (_, degree) => degree > 0 } // keep only tweetds with positive degree - .sortBy { case (_, degree) => -degree } // sort by degree in descending order - .map { case (encodedTweetId, _) => encodedTweetId } - } -} diff --git a/src/scala/com/twitter/recos/user_video_graph/util/GetRelatedTweetCandidatesUtil.scala b/src/scala/com/twitter/recos/user_video_graph/util/GetRelatedTweetCandidatesUtil.scala deleted file mode 100644 index 176e129db..000000000 --- a/src/scala/com/twitter/recos/user_video_graph/util/GetRelatedTweetCandidatesUtil.scala +++ /dev/null @@ -1,56 +0,0 @@ -package com.twitter.recos.user_video_graph.util - -import com.twitter.graphjet.bipartite.api.BipartiteGraph -import com.twitter.recos.user_video_graph.thriftscala._ -import com.twitter.recos.features.tweet.thriftscala.GraphFeaturesForTweet -import com.twitter.graphjet.algorithms.TweetIDMask - -object GetRelatedTweetCandidatesUtil { - private val tweetIDMask = new TweetIDMask - - /** - * calculate scores for each RHS tweet that we get back - * for tweetBasedRelatedTweet, scorePreFactor = queryTweetDegree / log(queryTweetDegree) / LHSuserSize - * and the final score will be a log-cosine score - * for non-tweetBasedRelatedTweet, We don't have a query tweet, to keep scoring function consistent, - * scorePreFactor = 1000.0 / LHSuserSize (queryTweetDegree's average is ~10k, 1000 ~= 10k/log(10k)) - * Though scorePreFactor is applied for all results within a request, it's still useful to make score comparable across requests, - * so we can have a unifed min_score and help with downstream score normalization - * **/ - def getRelatedTweetCandidates( - relatedTweetCandidates: Seq[Long], - minCooccurrence: Int, - minResultDegree: Int, - scorePreFactor: Double, - bipartiteGraph: BipartiteGraph - ): Seq[RelatedTweet] = { - relatedTweetCandidates - .groupBy(tweetId => tweetId) - .filterKeys(tweetId => bipartiteGraph.getRightNodeDegree(tweetId) > minResultDegree) - .mapValues(_.size) - .filter { case (_, cooccurrence) => cooccurrence >= minCooccurrence } - .toSeq - .map { - case (relatedTweetId, cooccurrence) => - val relatedTweetDegree = bipartiteGraph.getRightNodeDegree(relatedTweetId) - - val score = scorePreFactor * cooccurrence / math.log(relatedTweetDegree) - toRelatedTweet(relatedTweetId, score, relatedTweetDegree, cooccurrence) - } - .sortBy(-_.score) - } - - def toRelatedTweet( - relatedTweetId: Long, - score: Double, - relatedTweetDegree: Int, - cooccurrence: Int - ): RelatedTweet = { - RelatedTweet( - tweetId = tweetIDMask.restore(relatedTweetId), - score = score, - relatedTweetGraphFeatures = Some( - GraphFeaturesForTweet(cooccurrence = Some(cooccurrence), degree = Some(relatedTweetDegree))) - ) - } -} diff --git a/src/scala/com/twitter/recos/user_video_graph/util/SampleLHSUsersUtil.scala b/src/scala/com/twitter/recos/user_video_graph/util/SampleLHSUsersUtil.scala deleted file mode 100644 index b8fd2c2f4..000000000 --- a/src/scala/com/twitter/recos/user_video_graph/util/SampleLHSUsersUtil.scala +++ /dev/null @@ -1,35 +0,0 @@ -package com.twitter.recos.user_video_graph.util - -import com.twitter.graphjet.bipartite.MultiSegmentIterator -import com.twitter.graphjet.bipartite.api.BipartiteGraph -import com.twitter.graphjet.bipartite.segment.BipartiteGraphSegment -import java.util.Random -import scala.collection.mutable.ListBuffer - -object SampleLHSUsersUtil { - // sample userId nodes - def sampleLHSUsers( - maskedTweetId: Long, - maxNumSamplesPerNeighbor: Int, - bipartiteGraph: BipartiteGraph - ): Seq[Long] = { - val sampledUserIdsIterator = bipartiteGraph - .getRandomRightNodeEdges( - maskedTweetId, - maxNumSamplesPerNeighbor, - new Random(System.currentTimeMillis)).asInstanceOf[MultiSegmentIterator[ - BipartiteGraphSegment - ]] - - val userIds = new ListBuffer[Long]() - if (sampledUserIdsIterator != null) { - while (sampledUserIdsIterator.hasNext) { - val leftNode = sampledUserIdsIterator.nextLong() - // If a user likes too many things, we risk including spammy behavior. - if (bipartiteGraph.getLeftNodeDegree(leftNode) < 100) - userIds += leftNode - } - } - userIds - } -} diff --git a/src/scala/com/twitter/simclusters_v2/README.md b/src/scala/com/twitter/simclusters_v2/README.md deleted file mode 100644 index ae43836af..000000000 --- a/src/scala/com/twitter/simclusters_v2/README.md +++ /dev/null @@ -1,112 +0,0 @@ -# SimClusters: Community-based Representations for Heterogeneous Recommendations at Twitter - -## Overview -SimClusters is as a general-purpose representation layer based on overlapping communities into which users as well as heterogeneous content can be captured as sparse, interpretable vectors to support a multitude of recommendation tasks. - -We build our user and tweet SimClusters embeddings based on the inferred communities, and the representations power our personalized tweet recommendation via our online serving service SimClusters ANN. - - -For more details, please read our paper that was published in KDD'2020 Applied Data Science Track: https://www.kdd.org/kdd2020/accepted-papers/view/simclusters-community-based-representations-for-heterogeneous-recommendatio - -## Brief introduction to Simclusters Algorithm - -### Follow relationships as a bipartite graph -Follow relationships on Twitter are perhaps most naturally thought of as directed graph, where each node is a user and each edge represents a Follow. Edges are directed in that User 1 can follow User 2, User 2 can follow User 1 or both User 1 and User 2 can follow each other. - -This directed graph can be also viewed as a bipartite graph, where nodes are grouped into two sets, Producers and Consumers. In this bipartite graph, Producers are the users who are Followed and Consumers are the Followees. Below is a toy example of a follow graph for four users: - - - -> Figure 1 - Left panel: A directed follow graph; Right panel: A bipartite graph representation of the directed graph - -### Community Detection - Known For -The bipartite follow graph can be used to identify groups of Producers who have similar followers, or who are "Known For" a topic. Specifically, the bipartite follow graph can also be represented as an *m x n* matrix (*A*), where consumers are presented as *u* and producers are represented as *v*. - -Producer-producer similarity is computed as the cosine similarity between users who follow each producer. The resulting cosine similarity values can be used to construct a producer-producer similarity graph, where the nodes are producers and edges are weighted by the corresponding cosine similarity value. Noise removal is performed, such that edges with weights below a specified threshold are deleted from the graph. - -After noise removal has been completed, Metropolis-Hastings sampling-based community detection is then run on the Producer-Producer similarity graph to identify a community affiliation for each producer. This algorithm takes in a parameter *k* for the number of communities to be detected. - - - -> Figure 2 - Left panel: Matrix representation of the follow graph depicted in Figure 1; Middle panel: Producer-Producer similarity is estimated by calculating the cosine similarity between the users who follow each producer; Right panel: Cosine similarity scores are used to create the Producer-Producer similarity graph. A clustering algorithm is run on the graph to identify groups of Producers with similar followers. - -Community affiliation scores are then used to construct an *n x k* "Known For" matrix (*V*). This matrix is maximally sparse, and each Producer is affiliated with at most one community. In production, the Known For dataset covers the top 20M producers and k ~= 145000. In other words, we discover around 145k communities based on Twitter's user follow graph. - - - -> Figure 3 - The clustering algorithm returns community affiliation scores for each producer. These scores are represented in matrix V. - -In the example above, Producer 1 is "Known For" community 2, Producer 2 is "Known For" community 1, and so forth. - -### Consumer Embeddings - User InterestedIn -An Interested In matrix (*U*) can be computed by multiplying the matrix representation of the follow graph (*A*) by the Known For matrix (*V*): - - - -In this toy example, consumer 1 is interested in community 1 only, whereas consumer 3 is interested in all three communities. There is also a noise removal step applied to the Interested In matrix. - -We use the InterestedIn embeddings to capture consumer's long-term interest. The InterestedIn embeddings is one of our major source for consumer-based tweet recommendations. - -### Producer Embeddings -When computing the Known For matrix, each producer can only be Known For a single community. Although this maximally sparse matrix is useful from a computational perspective, we know that our users tweet about many different topics and may be "Known" in many different communities. Producer embeddings ( *Ṽ* ) are used to capture this richer structure of the graph. - -To calculate producer embeddings, the cosine similarity is calculated between each Producer’s follow graph and the Interested In vector for each community. - - - -Producer embeddings are used for producer-based tweet recommendations. For example, we can recommend similar tweets based on an account you just followed. - -### Entity Embeddings -SimClusters can also be used to generate embeddings for different kind of contents, such as -- Tweets (used for Tweet recommendations) -- Topics (used for TopicFollow) - -#### Tweet embeddings -When a tweet is created, its tweet embedding is initialized as an empty vector. -Tweet embeddings are updated each time the tweet is favorited. Specifically, the InterestedIn vector of each user who Fav-ed the tweet is added to the tweet vector. -Since tweet embeddings are updated each time a tweet is favorited, they change over time. - -Tweet embeddings are critical for our tweet recommendation tasks. We can calculate tweet similarity and recommend similar tweets to users based on their tweet engagement history. - -We have a online Heron job that updates the tweet embeddings in realtime, check out [here](summingbird/README.md) for more. - -#### Topic embeddings -Topic embeddings (**R**) are determined by taking the cosine similarity between consumers who are interested in a community and the number of aggregated favorites each consumer has taken on a tweet that has a topic annotation (with some time decay). - - - - -## Project Directory Overview -The whole SimClusters project can be understood as 2 main components -- SimClusters Offline Jobs (Scalding / GCP) -- SimClusters Real-time Streaming Jobs - -### SimClusters Offline Jobs - -**SimClusters Scalding Jobs** - -| Jobs | Code | Description | -|---|---|---| -| KnownFor | [simclusters_v2/scalding/update_known_for/UpdateKnownFor20M145K2020.scala](scalding/update_known_for/UpdateKnownFor20M145K2020.scala) | The job outputs the KnownFor dataset which stores the relationships between clusterId and producerUserId. KnownFor dataset covers the top 20M followed producers. We use this KnownFor dataset (or so-called clusters) to build all other entity embeddings. | -| InterestedIn Embeddings| [simclusters_v2/scalding/InterestedInFromKnownFor.scala](scalding/InterestedInFromKnownFor.scala) | This code implements the job for computing users' interestedIn embedding from the KnownFor dataset. We use this dataset for consumer-based tweet recommendations.| -| Producer Embeddings | [simclusters_v2/scalding/embedding/ProducerEmbeddingsFromInterestedIn.scala](scalding/embedding/ProducerEmbeddingsFromInterestedIn.scala) | The code implements the job for computer producer embeddings, which represents the content user produces. We use this dataset for producer-based tweet recommendations.| -| Semantic Core Entity Embeddings | [simclusters_v2/scalding/embedding/EntityToSimClustersEmbeddingsJob.scala](scalding/embedding/EntityToSimClustersEmbeddingsJob.scala) | The job computes the semantic core entity embeddings. It outputs datasets that stores the "SemanticCore entityId -> List(clusterId)" and "clusterId -> List(SemanticCore entityId))" relationships.| -| Topic Embeddings | [simclusters_v2/scalding/embedding/tfg/FavTfgBasedTopicEmbeddings.scala](scalding/embedding/tfg/FavTfgBasedTopicEmbeddings.scala) | Jobs to generate Fav-based Topic-Follow-Graph (TFG) topic embeddings A topic's fav-based TFG embedding is the sum of its followers' fav-based InterestedIn. We use this embedding for topic related recommendations.| - -**SimClusters GCP Jobs** - -We have a GCP pipeline where we build our SimClusters ANN index via BigQuery. This allows us to do fast iterations and build new embeddings more efficiently compared to Scalding. - -All SimClusters related GCP jobs are under [src/scala/com/twitter/simclusters_v2/scio/bq_generation](scio/bq_generation). - -| Jobs | Code | Description | -|---|---|---| -| PushOpenBased SimClusters ANN Index | [EngagementEventBasedClusterToTweetIndexGenerationJob.scala](scio/bq_generation/simclusters_index_generation/EngagementEventBasedClusterToTweetIndexGenerationJob.scala) | The job builds a clusterId -> TopTweet index based on user-open engagement history. This SANN source is used for candidate generation for Notifications. | -| VideoViewBased SimClusters Index| [EngagementEventBasedClusterToTweetIndexGenerationJob.scala](scio/bq_generation/simclusters_index_generation/EngagementEventBasedClusterToTweetIndexGenerationJob.scala) | The job builds a clusterId -> TopTweet index based on the user's video view history. This SANN source is used for video recommendation on Home.| - -### SimClusters Real-Time Streaming Tweets Jobs - -| Jobs | Code | Description | -|---|---|---| -| Tweet Embedding Job | [simclusters_v2/summingbird/storm/TweetJob.scala](summingbird/storm/TweetJob.scala) | Generate the Tweet embedding and index of tweets for the SimClusters | -| Persistent Tweet Embedding Job| [simclusters_v2/summingbird/storm/PersistentTweetJob.scala](summingbird/storm/PersistentTweetJob.scala) | Persistent the tweet embeddings from MemCache into Manhattan.| \ No newline at end of file diff --git a/src/scala/com/twitter/simclusters_v2/candidate_source/BUILD b/src/scala/com/twitter/simclusters_v2/candidate_source/BUILD deleted file mode 100644 index 7e242cbb9..000000000 --- a/src/scala/com/twitter/simclusters_v2/candidate_source/BUILD +++ /dev/null @@ -1,17 +0,0 @@ -scala_library( - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/twitter/storehaus:core", - "frigate/frigate-common:base", - "frigate/frigate-common/src/main/scala/com/twitter/frigate/common/base", - "src/scala/com/twitter/simclusters_v2/common", - "src/scala/com/twitter/simclusters_v2/score", - "src/scala/com/twitter/simclusters_v2/summingbird/stores", - "src/scala/com/twitter/simclusters_v2/tweet_similarity", - "src/thrift/com/twitter/recos/user_tweet_entity_graph:user_tweet_entity_graph-scala", - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - "src/thrift/com/twitter/wtf/interest:interest-thrift-scala", - "util/util-stats/src/main/scala/com/twitter/finagle/stats", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/candidate_source/ClusterRanker.scala b/src/scala/com/twitter/simclusters_v2/candidate_source/ClusterRanker.scala deleted file mode 100644 index 9ef629a6c..000000000 --- a/src/scala/com/twitter/simclusters_v2/candidate_source/ClusterRanker.scala +++ /dev/null @@ -1,56 +0,0 @@ -package com.twitter.simclusters_v2.candidate_source - -import com.twitter.simclusters_v2.thriftscala.UserToInterestedInClusterScores - -object ClusterRanker extends Enumeration { - val RankByNormalizedFavScore: ClusterRanker.Value = Value - val RankByFavScore: ClusterRanker.Value = Value - val RankByFollowScore: ClusterRanker.Value = Value - val RankByLogFavScore: ClusterRanker.Value = Value - val RankByNormalizedLogFavScore: ClusterRanker.Value = Value - - /** - * Given a map of clusters, sort out the top scoring clusters by a ranking scheme - * provided by the caller - */ - def getTopKClustersByScore( - clustersWithScores: Map[Int, UserToInterestedInClusterScores], - rankByScore: ClusterRanker.Value, - topK: Int - ): Map[Int, Double] = { - val rankedClustersWithScores = clustersWithScores.map { - case (clusterId, score) => - rankByScore match { - case ClusterRanker.RankByFavScore => - (clusterId, (score.favScore.getOrElse(0.0), score.followScore.getOrElse(0.0))) - case ClusterRanker.RankByFollowScore => - (clusterId, (score.followScore.getOrElse(0.0), score.favScore.getOrElse(0.0))) - case ClusterRanker.RankByLogFavScore => - (clusterId, (score.logFavScore.getOrElse(0.0), score.followScore.getOrElse(0.0))) - case ClusterRanker.RankByNormalizedLogFavScore => - ( - clusterId, - ( - score.logFavScoreClusterNormalizedOnly.getOrElse(0.0), - score.followScore.getOrElse(0.0))) - case ClusterRanker.RankByNormalizedFavScore => - ( - clusterId, - ( - score.favScoreProducerNormalizedOnly.getOrElse(0.0), - score.followScore.getOrElse(0.0))) - case _ => - ( - clusterId, - ( - score.favScoreProducerNormalizedOnly.getOrElse(0.0), - score.followScore.getOrElse(0.0))) - } - } - rankedClustersWithScores.toSeq - .sortBy(_._2) // sort in ascending order - .takeRight(topK) - .map { case (clusterId, scores) => clusterId -> math.max(scores._1, 1e-4) } - .toMap - } -} diff --git a/src/scala/com/twitter/simclusters_v2/candidate_source/HeavyRanker.scala b/src/scala/com/twitter/simclusters_v2/candidate_source/HeavyRanker.scala deleted file mode 100644 index 407558ee3..000000000 --- a/src/scala/com/twitter/simclusters_v2/candidate_source/HeavyRanker.scala +++ /dev/null @@ -1,71 +0,0 @@ -package com.twitter.simclusters_v2.candidate_source - -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.frigate.common.base.Stats -import com.twitter.simclusters_v2.candidate_source.SimClustersANNCandidateSource.SimClustersTweetCandidate -import com.twitter.simclusters_v2.thriftscala.EmbeddingType -import com.twitter.simclusters_v2.thriftscala.InternalId -import com.twitter.simclusters_v2.thriftscala.ScoreInternalId -import com.twitter.simclusters_v2.thriftscala.ScoringAlgorithm -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingPairScoreId -import com.twitter.simclusters_v2.thriftscala.{Score => ThriftScore} -import com.twitter.simclusters_v2.thriftscala.{ScoreId => ThriftScoreId} -import com.twitter.util.Future -import com.twitter.storehaus.ReadableStore - -object HeavyRanker { - trait HeavyRanker { - def rank( - scoringAlgorithm: ScoringAlgorithm, - sourceEmbeddingId: SimClustersEmbeddingId, - candidateEmbeddingType: EmbeddingType, - minScore: Double, - candidates: Seq[SimClustersTweetCandidate] - ): Future[Seq[SimClustersTweetCandidate]] - } - - class UniformScoreStoreRanker( - uniformScoringStore: ReadableStore[ThriftScoreId, ThriftScore], - stats: StatsReceiver) - extends HeavyRanker { - val fetchCandidateEmbeddingsStat = stats.scope("fetchCandidateEmbeddings") - - def rank( - scoringAlgorithm: ScoringAlgorithm, - sourceEmbeddingId: SimClustersEmbeddingId, - candidateEmbeddingType: EmbeddingType, - minScore: Double, - candidates: Seq[SimClustersTweetCandidate] - ): Future[Seq[SimClustersTweetCandidate]] = { - val pairScoreIds = candidates.map { candidate => - ThriftScoreId( - scoringAlgorithm, - ScoreInternalId.SimClustersEmbeddingPairScoreId( - SimClustersEmbeddingPairScoreId( - sourceEmbeddingId, - SimClustersEmbeddingId( - candidateEmbeddingType, - sourceEmbeddingId.modelVersion, - InternalId.TweetId(candidate.tweetId) - ) - )) - ) -> candidate.tweetId - }.toMap - - Future - .collect { - Stats.trackMap(fetchCandidateEmbeddingsStat) { - uniformScoringStore.multiGet(pairScoreIds.keySet) - } - } - .map { candidateScores => - candidateScores.toSeq - .collect { - case (pairScoreId, Some(score)) if score.score >= minScore => - SimClustersTweetCandidate(pairScoreIds(pairScoreId), score.score, sourceEmbeddingId) - } - } - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/candidate_source/SimClustersANNCandidateSource.scala b/src/scala/com/twitter/simclusters_v2/candidate_source/SimClustersANNCandidateSource.scala deleted file mode 100644 index eb6684e7c..000000000 --- a/src/scala/com/twitter/simclusters_v2/candidate_source/SimClustersANNCandidateSource.scala +++ /dev/null @@ -1,637 +0,0 @@ -package com.twitter.simclusters_v2.candidate_source - -import com.twitter.conversions.DurationOps._ -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.frigate.common.base.CandidateSource -import com.twitter.frigate.common.base.Stats -import com.twitter.simclusters_v2.candidate_source.HeavyRanker.UniformScoreStoreRanker -import com.twitter.simclusters_v2.candidate_source.SimClustersANNCandidateSource.SimClustersANNConfig -import com.twitter.simclusters_v2.candidate_source.SimClustersANNCandidateSource.SimClustersTweetCandidate -import com.twitter.simclusters_v2.common.ModelVersions._ -import com.twitter.simclusters_v2.common.ClusterId -import com.twitter.simclusters_v2.common.SimClustersEmbedding -import com.twitter.simclusters_v2.common.TweetId -import com.twitter.simclusters_v2.summingbird.stores.ClusterKey -import com.twitter.simclusters_v2.thriftscala.EmbeddingType -import com.twitter.simclusters_v2.thriftscala.InternalId -import com.twitter.simclusters_v2.thriftscala.ScoreInternalId -import com.twitter.simclusters_v2.thriftscala.ScoringAlgorithm -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingPairScoreId -import com.twitter.simclusters_v2.thriftscala.{Score => ThriftScore} -import com.twitter.simclusters_v2.thriftscala.{ScoreId => ThriftScoreId} -import com.twitter.snowflake.id.SnowflakeId -import com.twitter.storehaus.ReadableStore -import com.twitter.util.Duration -import com.twitter.util.Future -import com.twitter.util.Time -import scala.collection.mutable - -/** - * This store looks for tweets whose similarity is close to a Source SimClustersEmbeddingId. - * - * Approximate cosine similarity is the core algorithm to drive this store. - * - * Step 1 - 4 are in "fetchCandidates" method. - * 1. Retrieve the SimClusters Embedding by the SimClustersEmbeddingId - * 2. Fetch top N clusters' top tweets from the clusterTweetCandidatesStore (TopTweetsPerCluster index). - * 3. Calculate all the tweet candidates' dot-product or approximate cosine similarity to source tweets. - * 4. Take top M tweet candidates by the step 3's score - * Step 5-6 are in "reranking" method. - * 5. Calculate the similarity score between source and candidates. - * 6. Return top N candidates by the step 5's score. - * - * Warning: Only turn off the step 5 for User InterestedIn candidate generation. It's the only use - * case in Recos that we use dot-product to rank the tweet candidates. - */ -case class SimClustersANNCandidateSource( - clusterTweetCandidatesStore: ReadableStore[ClusterKey, Seq[(TweetId, Double)]], - simClustersEmbeddingStore: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding], - heavyRanker: HeavyRanker.HeavyRanker, - configs: Map[EmbeddingType, SimClustersANNConfig], - statsReceiver: StatsReceiver) - extends CandidateSource[SimClustersANNCandidateSource.Query, SimClustersTweetCandidate] { - - import SimClustersANNCandidateSource._ - - override val name: String = this.getClass.getName - private val stats = statsReceiver.scope(this.getClass.getName) - - private val fetchSourceEmbeddingStat = stats.scope("fetchSourceEmbedding") - protected val fetchCandidateEmbeddingsStat = stats.scope("fetchCandidateEmbeddings") - private val fetchCandidatesStat = stats.scope("fetchCandidates") - private val rerankingStat = stats.scope("reranking") - - override def get( - query: SimClustersANNCandidateSource.Query - ): Future[Option[Seq[SimClustersTweetCandidate]]] = { - val sourceEmbeddingId = query.sourceEmbeddingId - loadConfig(query) match { - case Some(config) => - for { - maybeSimClustersEmbedding <- Stats.track(fetchSourceEmbeddingStat) { - simClustersEmbeddingStore.get(query.sourceEmbeddingId) - } - maybeFilteredCandidates <- maybeSimClustersEmbedding match { - case Some(sourceEmbedding) => - for { - rawCandidates <- Stats.trackSeq(fetchCandidatesStat) { - fetchCandidates(sourceEmbeddingId, config, sourceEmbedding) - } - rankedCandidates <- Stats.trackSeq(rerankingStat) { - reranking(sourceEmbeddingId, config, rawCandidates) - } - } yield { - fetchCandidatesStat - .stat( - sourceEmbeddingId.embeddingType.name, - sourceEmbeddingId.modelVersion.name).add(rankedCandidates.size) - Some(rankedCandidates) - } - case None => - fetchCandidatesStat - .stat( - sourceEmbeddingId.embeddingType.name, - sourceEmbeddingId.modelVersion.name).add(0) - Future.None - } - } yield { - maybeFilteredCandidates - } - case _ => - // Skip over queries whose config is not defined - Future.None - } - } - - private def fetchCandidates( - sourceEmbeddingId: SimClustersEmbeddingId, - config: SimClustersANNConfig, - sourceEmbedding: SimClustersEmbedding - ): Future[Seq[SimClustersTweetCandidate]] = { - val now = Time.now - val earliestTweetId = SnowflakeId.firstIdFor(now - config.maxTweetCandidateAge) - val latestTweetId = SnowflakeId.firstIdFor(now - config.minTweetCandidateAge) - val clusterIds = - sourceEmbedding - .truncate(config.maxScanClusters).clusterIds - .map { clusterId: ClusterId => - ClusterKey(clusterId, sourceEmbeddingId.modelVersion, config.candidateEmbeddingType) - }.toSet - - Future - .collect { - clusterTweetCandidatesStore.multiGet(clusterIds) - }.map { clusterTweetsMap => - // Use Mutable map to optimize performance. The method is thread-safe. - // Set initial map size to around p75 of map size distribution to avoid too many copying - // from extending the size of the mutable hashmap - val candidateScoresMap = - new SimClustersANNCandidateSource.HashMap[TweetId, Double](InitialCandidateMapSize) - val candidateNormalizationMap = - new SimClustersANNCandidateSource.HashMap[TweetId, Double](InitialCandidateMapSize) - - clusterTweetsMap.foreach { - case (ClusterKey(clusterId, _, _, _), Some(tweetScores)) - if sourceEmbedding.contains(clusterId) => - val sourceClusterScore = sourceEmbedding.getOrElse(clusterId) - - for (i <- 0 until Math.min(tweetScores.size, config.maxTopTweetsPerCluster)) { - val (tweetId, score) = tweetScores(i) - - if (!parseTweetId(sourceEmbeddingId).contains(tweetId) && - tweetId >= earliestTweetId && tweetId <= latestTweetId) { - candidateScoresMap.put( - tweetId, - candidateScoresMap.getOrElse(tweetId, 0.0) + score * sourceClusterScore) - if (config.enablePartialNormalization) { - candidateNormalizationMap - .put(tweetId, candidateNormalizationMap.getOrElse(tweetId, 0.0) + score * score) - } - } - } - case _ => () - } - - stats.stat("candidateScoresMap").add(candidateScoresMap.size) - stats.stat("candidateNormalizationMap").add(candidateNormalizationMap.size) - - // Re-Rank the candidate by configuration - val processedCandidateScores = candidateScoresMap.map { - case (candidateId, score) => - // Enable Partial Normalization - val processedScore = - if (config.enablePartialNormalization) { - // We applied the "log" version of partial normalization when we rank candidates - // by log cosine similarity - if (config.rankingAlgorithm == ScoringAlgorithm.PairEmbeddingLogCosineSimilarity) { - score / sourceEmbedding.l2norm / math.log( - 1 + candidateNormalizationMap(candidateId)) - } else { - score / sourceEmbedding.l2norm / math.sqrt(candidateNormalizationMap(candidateId)) - } - } else score - SimClustersTweetCandidate(candidateId, processedScore, sourceEmbeddingId) - }.toSeq - - processedCandidateScores - .sortBy(-_.score) - } - } - - private def reranking( - sourceEmbeddingId: SimClustersEmbeddingId, - config: SimClustersANNConfig, - candidates: Seq[SimClustersTweetCandidate] - ): Future[Seq[SimClustersTweetCandidate]] = { - val rankedCandidates = if (config.enableHeavyRanking) { - heavyRanker - .rank( - scoringAlgorithm = config.rankingAlgorithm, - sourceEmbeddingId = sourceEmbeddingId, - candidateEmbeddingType = config.candidateEmbeddingType, - minScore = config.minScore, - candidates = candidates.take(config.maxReRankingCandidates) - ).map(_.sortBy(-_.score)) - } else { - Future.value(candidates) - } - rankedCandidates.map(_.take(config.maxNumResults)) - } - - private[candidate_source] def loadConfig(query: Query): Option[SimClustersANNConfig] = { - configs.get(query.sourceEmbeddingId.embeddingType).map { baseConfig => - // apply overrides if any - query.overrideConfig match { - case Some(overrides) => - baseConfig.copy( - maxNumResults = overrides.maxNumResults.getOrElse(baseConfig.maxNumResults), - maxTweetCandidateAge = - overrides.maxTweetCandidateAge.getOrElse(baseConfig.maxTweetCandidateAge), - minScore = overrides.minScore.getOrElse(baseConfig.minScore), - candidateEmbeddingType = - overrides.candidateEmbeddingType.getOrElse(baseConfig.candidateEmbeddingType), - enablePartialNormalization = - overrides.enablePartialNormalization.getOrElse(baseConfig.enablePartialNormalization), - enableHeavyRanking = - overrides.enableHeavyRanking.getOrElse(baseConfig.enableHeavyRanking), - rankingAlgorithm = overrides.rankingAlgorithm.getOrElse(baseConfig.rankingAlgorithm), - maxReRankingCandidates = - overrides.maxReRankingCandidates.getOrElse(baseConfig.maxReRankingCandidates), - maxTopTweetsPerCluster = - overrides.maxTopTweetsPerCluster.getOrElse(baseConfig.maxTopTweetsPerCluster), - maxScanClusters = overrides.maxScanClusters.getOrElse(baseConfig.maxScanClusters), - minTweetCandidateAge = - overrides.minTweetCandidateAge.getOrElse(baseConfig.minTweetCandidateAge) - ) - case _ => baseConfig - } - } - } -} - -object SimClustersANNCandidateSource { - - final val ProductionMaxNumResults = 200 - final val InitialCandidateMapSize = 16384 - - def apply( - clusterTweetCandidatesStore: ReadableStore[ClusterKey, Seq[(TweetId, Double)]], - simClustersEmbeddingStore: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding], - uniformScoringStore: ReadableStore[ThriftScoreId, ThriftScore], - configs: Map[EmbeddingType, SimClustersANNConfig], - statsReceiver: StatsReceiver - ) = new SimClustersANNCandidateSource( - clusterTweetCandidatesStore = clusterTweetCandidatesStore, - simClustersEmbeddingStore = simClustersEmbeddingStore, - heavyRanker = new UniformScoreStoreRanker(uniformScoringStore, statsReceiver), - configs = configs, - statsReceiver = statsReceiver - ) - - private def parseTweetId(embeddingId: SimClustersEmbeddingId): Option[TweetId] = { - embeddingId.internalId match { - case InternalId.TweetId(tweetId) => - Some(tweetId) - case _ => - None - } - } - - case class Query( - sourceEmbeddingId: SimClustersEmbeddingId, - // Only override the config in DDG and Debuggers. - // Use Post-filter for the holdbacks for better cache hit rate. - overrideConfig: Option[SimClustersANNConfigOverride] = None) - - case class SimClustersTweetCandidate( - tweetId: TweetId, - score: Double, - sourceEmbeddingId: SimClustersEmbeddingId) - - class HashMap[A, B](initSize: Int) extends mutable.HashMap[A, B] { - override def initialSize: Int = initSize // 16 - by default - } - - /** - * The Configuration of Each SimClusters ANN Candidate Source. - * Expect One SimClusters Embedding Type mapping to a SimClusters ANN Configuration in Production. - */ - case class SimClustersANNConfig( - // The max number of candidates for a ANN Query - // Please don't override this value in Production. - maxNumResults: Int = ProductionMaxNumResults, - // The max tweet candidate duration from now. - maxTweetCandidateAge: Duration, - // The min score of the candidates - minScore: Double, - // The Candidate Embedding Type of Tweet. - candidateEmbeddingType: EmbeddingType, - // Enables normalization of approximate SimClusters vectors to remove popularity bias - enablePartialNormalization: Boolean, - // Whether to enable Embedding Similarity ranking - enableHeavyRanking: Boolean, - // The ranking algorithm for Source Candidate Similarity - rankingAlgorithm: ScoringAlgorithm, - // The max number of candidates in ReRanking Step - maxReRankingCandidates: Int, - // The max number of Top Tweets from every cluster tweet index - maxTopTweetsPerCluster: Int, - // The max number of Clusters in the source Embeddings. - maxScanClusters: Int, - // The min tweet candidate duration from now. - minTweetCandidateAge: Duration) - - /** - * Contains same fields as [[SimClustersANNConfig]], to specify which fields are to be overriden - * for experimental purposes. - * - * All fields in this class must be optional. - */ - case class SimClustersANNConfigOverride( - maxNumResults: Option[Int] = None, - maxTweetCandidateAge: Option[Duration] = None, - minScore: Option[Double] = None, - candidateEmbeddingType: Option[EmbeddingType] = None, - enablePartialNormalization: Option[Boolean] = None, - enableHeavyRanking: Option[Boolean] = None, - rankingAlgorithm: Option[ScoringAlgorithm] = None, - maxReRankingCandidates: Option[Int] = None, - maxTopTweetsPerCluster: Option[Int] = None, - maxScanClusters: Option[Int] = None, - minTweetCandidateAge: Option[Duration] = None, - enableLookbackSource: Option[Boolean] = None) - - final val DefaultMaxTopTweetsPerCluster = 200 - final val DefaultEnableHeavyRanking = false - object SimClustersANNConfig { - val DefaultSimClustersANNConfig: SimClustersANNConfig = - SimClustersANNConfig( - maxTweetCandidateAge = 1.days, - minScore = 0.7, - candidateEmbeddingType = EmbeddingType.LogFavBasedTweet, - enablePartialNormalization = true, - enableHeavyRanking = false, - rankingAlgorithm = ScoringAlgorithm.PairEmbeddingCosineSimilarity, - maxReRankingCandidates = 250, - maxTopTweetsPerCluster = 200, - maxScanClusters = 50, - minTweetCandidateAge = 0.seconds - ) - } - - val LookbackMediaMinDays: Int = 0 - val LookbackMediaMaxDays: Int = 2 - val LookbackMediaMaxTweetsPerDay: Int = 2000 - val maxTopTweetsPerCluster: Int = - (LookbackMediaMaxDays - LookbackMediaMinDays + 1) * LookbackMediaMaxTweetsPerDay - - val LookbackMediaTweetConfig: Map[EmbeddingType, SimClustersANNConfig] = { - val candidateEmbeddingType = EmbeddingType.LogFavLongestL2EmbeddingTweet - val minTweetAge = LookbackMediaMinDays.days - val maxTweetAge = - LookbackMediaMaxDays.days - 1.hour // To compensate for the cache TTL that might push the tweet age beyond max age - val rankingAlgorithm = ScoringAlgorithm.PairEmbeddingCosineSimilarity - - val maxScanClusters = 50 - val minScore = 0.5 - Map( - EmbeddingType.FavBasedProducer -> SimClustersANNConfig( - minTweetCandidateAge = minTweetAge, - maxTweetCandidateAge = maxTweetAge, - minScore = - minScore, // for twistly candidates. To specify a higher threshold, use a post-filter - candidateEmbeddingType = candidateEmbeddingType, - enablePartialNormalization = true, - enableHeavyRanking = DefaultEnableHeavyRanking, - rankingAlgorithm = rankingAlgorithm, - maxReRankingCandidates = 250, - maxTopTweetsPerCluster = maxTopTweetsPerCluster, - maxScanClusters = maxScanClusters, - ), - EmbeddingType.LogFavLongestL2EmbeddingTweet -> SimClustersANNConfig( - minTweetCandidateAge = minTweetAge, - maxTweetCandidateAge = maxTweetAge, - minScore = - minScore, // for twistly candidates. To specify a higher threshold, use a post-filter - candidateEmbeddingType = candidateEmbeddingType, - enablePartialNormalization = true, - enableHeavyRanking = DefaultEnableHeavyRanking, - rankingAlgorithm = rankingAlgorithm, - maxReRankingCandidates = 250, - maxTopTweetsPerCluster = maxTopTweetsPerCluster, - maxScanClusters = maxScanClusters, - ), - EmbeddingType.FavTfgTopic -> SimClustersANNConfig( - minTweetCandidateAge = minTweetAge, - maxTweetCandidateAge = maxTweetAge, - minScore = minScore, - candidateEmbeddingType = candidateEmbeddingType, - enablePartialNormalization = true, - enableHeavyRanking = DefaultEnableHeavyRanking, - rankingAlgorithm = rankingAlgorithm, - maxReRankingCandidates = 400, - maxTopTweetsPerCluster = 200, - maxScanClusters = maxScanClusters, - ), - EmbeddingType.LogFavBasedKgoApeTopic -> SimClustersANNConfig( - minTweetCandidateAge = minTweetAge, - maxTweetCandidateAge = maxTweetAge, - minScore = minScore, - candidateEmbeddingType = candidateEmbeddingType, - enablePartialNormalization = true, - enableHeavyRanking = DefaultEnableHeavyRanking, - rankingAlgorithm = rankingAlgorithm, - maxReRankingCandidates = 400, - maxTopTweetsPerCluster = 200, - maxScanClusters = maxScanClusters, - ), - ) - } - - val DefaultConfigMappings: Map[EmbeddingType, SimClustersANNConfig] = Map( - EmbeddingType.FavBasedProducer -> SimClustersANNConfig( - maxTweetCandidateAge = 1.days, - minScore = 0.0, // for twistly candidates. To specify a higher threshold, use a post-filter - candidateEmbeddingType = EmbeddingType.LogFavBasedTweet, - enablePartialNormalization = true, - enableHeavyRanking = DefaultEnableHeavyRanking, - rankingAlgorithm = ScoringAlgorithm.PairEmbeddingCosineSimilarity, - maxReRankingCandidates = 250, - maxTopTweetsPerCluster = DefaultMaxTopTweetsPerCluster, - maxScanClusters = 50, - minTweetCandidateAge = 0.seconds - ), - EmbeddingType.LogFavBasedUserInterestedMaxpoolingAddressBookFromIIAPE -> SimClustersANNConfig( - maxTweetCandidateAge = 1.days, - minScore = 0.0, // for twistly candidates. To specify a higher threshold, use a post-filter - candidateEmbeddingType = EmbeddingType.LogFavBasedTweet, - enablePartialNormalization = true, - enableHeavyRanking = DefaultEnableHeavyRanking, - rankingAlgorithm = ScoringAlgorithm.PairEmbeddingCosineSimilarity, - maxReRankingCandidates = 250, - maxTopTweetsPerCluster = DefaultMaxTopTweetsPerCluster, - maxScanClusters = 50, - minTweetCandidateAge = 0.seconds - ), - EmbeddingType.LogFavBasedUserInterestedAverageAddressBookFromIIAPE -> SimClustersANNConfig( - maxTweetCandidateAge = 1.days, - minScore = 0.0, // for twistly candidates. To specify a higher threshold, use a post-filter - candidateEmbeddingType = EmbeddingType.LogFavBasedTweet, - enablePartialNormalization = true, - enableHeavyRanking = DefaultEnableHeavyRanking, - rankingAlgorithm = ScoringAlgorithm.PairEmbeddingCosineSimilarity, - maxReRankingCandidates = 250, - maxTopTweetsPerCluster = DefaultMaxTopTweetsPerCluster, - maxScanClusters = 50, - minTweetCandidateAge = 0.seconds - ), - EmbeddingType.LogFavBasedUserInterestedBooktypeMaxpoolingAddressBookFromIIAPE -> SimClustersANNConfig( - maxTweetCandidateAge = 1.days, - minScore = 0.0, // for twistly candidates. To specify a higher threshold, use a post-filter - candidateEmbeddingType = EmbeddingType.LogFavBasedTweet, - enablePartialNormalization = true, - enableHeavyRanking = DefaultEnableHeavyRanking, - rankingAlgorithm = ScoringAlgorithm.PairEmbeddingCosineSimilarity, - maxReRankingCandidates = 250, - maxTopTweetsPerCluster = DefaultMaxTopTweetsPerCluster, - maxScanClusters = 50, - minTweetCandidateAge = 0.seconds - ), - EmbeddingType.LogFavBasedUserInterestedLargestDimMaxpoolingAddressBookFromIIAPE -> SimClustersANNConfig( - maxTweetCandidateAge = 1.days, - minScore = 0.0, // for twistly candidates. To specify a higher threshold, use a post-filter - candidateEmbeddingType = EmbeddingType.LogFavBasedTweet, - enablePartialNormalization = true, - enableHeavyRanking = DefaultEnableHeavyRanking, - rankingAlgorithm = ScoringAlgorithm.PairEmbeddingCosineSimilarity, - maxReRankingCandidates = 250, - maxTopTweetsPerCluster = DefaultMaxTopTweetsPerCluster, - maxScanClusters = 50, - minTweetCandidateAge = 0.seconds - ), - EmbeddingType.LogFavBasedUserInterestedLouvainMaxpoolingAddressBookFromIIAPE -> SimClustersANNConfig( - maxTweetCandidateAge = 1.days, - minScore = 0.0, // for twistly candidates. To specify a higher threshold, use a post-filter - candidateEmbeddingType = EmbeddingType.LogFavBasedTweet, - enablePartialNormalization = true, - enableHeavyRanking = DefaultEnableHeavyRanking, - rankingAlgorithm = ScoringAlgorithm.PairEmbeddingCosineSimilarity, - maxReRankingCandidates = 250, - maxTopTweetsPerCluster = DefaultMaxTopTweetsPerCluster, - maxScanClusters = 50, - minTweetCandidateAge = 0.seconds - ), - EmbeddingType.LogFavBasedUserInterestedConnectedMaxpoolingAddressBookFromIIAPE -> SimClustersANNConfig( - maxTweetCandidateAge = 1.days, - minScore = 0.0, // for twistly candidates. To specify a higher threshold, use a post-filter - candidateEmbeddingType = EmbeddingType.LogFavBasedTweet, - enablePartialNormalization = true, - enableHeavyRanking = DefaultEnableHeavyRanking, - rankingAlgorithm = ScoringAlgorithm.PairEmbeddingCosineSimilarity, - maxReRankingCandidates = 250, - maxTopTweetsPerCluster = DefaultMaxTopTweetsPerCluster, - maxScanClusters = 50, - minTweetCandidateAge = 0.seconds - ), - EmbeddingType.RelaxedAggregatableLogFavBasedProducer -> SimClustersANNConfig( - maxTweetCandidateAge = 1.days, - minScore = 0.25, // for twistly candidates. To specify a higher threshold, use a post-filter - candidateEmbeddingType = EmbeddingType.LogFavBasedTweet, - enablePartialNormalization = true, - enableHeavyRanking = DefaultEnableHeavyRanking, - rankingAlgorithm = ScoringAlgorithm.PairEmbeddingCosineSimilarity, - maxReRankingCandidates = 250, - maxTopTweetsPerCluster = DefaultMaxTopTweetsPerCluster, - maxScanClusters = 50, - minTweetCandidateAge = 0.seconds - ), - EmbeddingType.LogFavLongestL2EmbeddingTweet -> SimClustersANNConfig( - maxTweetCandidateAge = 1.days, - minScore = 0.3, // for twistly candidates. To specify a higher threshold, use a post-filter - candidateEmbeddingType = EmbeddingType.LogFavBasedTweet, - enablePartialNormalization = true, - enableHeavyRanking = DefaultEnableHeavyRanking, - rankingAlgorithm = ScoringAlgorithm.PairEmbeddingCosineSimilarity, - maxReRankingCandidates = 400, - maxTopTweetsPerCluster = DefaultMaxTopTweetsPerCluster, - maxScanClusters = 50, - minTweetCandidateAge = 0.seconds - ), - EmbeddingType.FilteredUserInterestedInFromPE -> SimClustersANNConfig( - maxTweetCandidateAge = 1.days, - minScore = 0.7, // unused, heavy ranking disabled - candidateEmbeddingType = EmbeddingType.LogFavBasedTweet, - enablePartialNormalization = false, - enableHeavyRanking = DefaultEnableHeavyRanking, - rankingAlgorithm = - ScoringAlgorithm.PairEmbeddingCosineSimilarity, // Unused, heavy ranking disabled - maxReRankingCandidates = 150, // unused, heavy ranking disabled - maxTopTweetsPerCluster = DefaultMaxTopTweetsPerCluster, - maxScanClusters = 50, - minTweetCandidateAge = 0.seconds - ), - EmbeddingType.FilteredUserInterestedIn -> SimClustersANNConfig( - maxTweetCandidateAge = 1.days, - minScore = 0.7, // unused, heavy ranking disabled - candidateEmbeddingType = EmbeddingType.LogFavBasedTweet, - enablePartialNormalization = false, - enableHeavyRanking = DefaultEnableHeavyRanking, - rankingAlgorithm = - ScoringAlgorithm.PairEmbeddingCosineSimilarity, // Unused, heavy ranking disabled - maxReRankingCandidates = 150, // unused, heavy ranking disabled - maxTopTweetsPerCluster = DefaultMaxTopTweetsPerCluster, - maxScanClusters = 50, - minTweetCandidateAge = 0.seconds - ), - EmbeddingType.UnfilteredUserInterestedIn -> SimClustersANNConfig( - maxTweetCandidateAge = 1.days, - minScore = 0.0, - candidateEmbeddingType = EmbeddingType.LogFavBasedTweet, - enablePartialNormalization = true, - enableHeavyRanking = DefaultEnableHeavyRanking, - rankingAlgorithm = ScoringAlgorithm.PairEmbeddingLogCosineSimilarity, - maxReRankingCandidates = 400, - maxTopTweetsPerCluster = DefaultMaxTopTweetsPerCluster, - maxScanClusters = 50, - minTweetCandidateAge = 0.seconds - ), - EmbeddingType.FollowBasedUserInterestedInFromAPE -> SimClustersANNConfig( - maxTweetCandidateAge = 1.days, - minScore = 0.0, - candidateEmbeddingType = EmbeddingType.LogFavBasedTweet, - enablePartialNormalization = true, - enableHeavyRanking = DefaultEnableHeavyRanking, - rankingAlgorithm = ScoringAlgorithm.PairEmbeddingCosineSimilarity, - maxReRankingCandidates = 200, - maxTopTweetsPerCluster = DefaultMaxTopTweetsPerCluster, - maxScanClusters = 50, - minTweetCandidateAge = 0.seconds - ), - EmbeddingType.LogFavBasedUserInterestedInFromAPE -> SimClustersANNConfig( - maxTweetCandidateAge = 1.days, - minScore = 0.0, - candidateEmbeddingType = EmbeddingType.LogFavBasedTweet, - enablePartialNormalization = true, - enableHeavyRanking = DefaultEnableHeavyRanking, - rankingAlgorithm = ScoringAlgorithm.PairEmbeddingCosineSimilarity, - maxReRankingCandidates = 200, - maxTopTweetsPerCluster = DefaultMaxTopTweetsPerCluster, - maxScanClusters = 50, - minTweetCandidateAge = 0.seconds - ), - EmbeddingType.FavTfgTopic -> SimClustersANNConfig( - maxTweetCandidateAge = 1.days, - minScore = 0.5, - candidateEmbeddingType = EmbeddingType.LogFavBasedTweet, - enablePartialNormalization = true, - enableHeavyRanking = DefaultEnableHeavyRanking, - rankingAlgorithm = ScoringAlgorithm.PairEmbeddingCosineSimilarity, - maxReRankingCandidates = 400, - maxTopTweetsPerCluster = DefaultMaxTopTweetsPerCluster, - maxScanClusters = 50, - minTweetCandidateAge = 0.seconds - ), - EmbeddingType.LogFavBasedKgoApeTopic -> SimClustersANNConfig( - maxTweetCandidateAge = 1.days, - minScore = 0.5, - candidateEmbeddingType = EmbeddingType.LogFavBasedTweet, - enablePartialNormalization = true, - enableHeavyRanking = DefaultEnableHeavyRanking, - rankingAlgorithm = ScoringAlgorithm.PairEmbeddingCosineSimilarity, - maxReRankingCandidates = 400, - maxTopTweetsPerCluster = DefaultMaxTopTweetsPerCluster, - maxScanClusters = 50, - minTweetCandidateAge = 0.seconds - ), - EmbeddingType.UserNextInterestedIn -> SimClustersANNConfig( - maxTweetCandidateAge = 1.days, - minScore = 0.0, - candidateEmbeddingType = EmbeddingType.LogFavBasedTweet, - enablePartialNormalization = true, - enableHeavyRanking = DefaultEnableHeavyRanking, - rankingAlgorithm = ScoringAlgorithm.PairEmbeddingCosineSimilarity, - maxReRankingCandidates = 200, - maxTopTweetsPerCluster = DefaultMaxTopTweetsPerCluster, - maxScanClusters = 50, - minTweetCandidateAge = 0.seconds - ) - ) - - /** - * Only cache the candidates if it's not Consumer-source. For example, TweetSource, ProducerSource, - * TopicSource. We don't cache consumer-sources (e.g. UserInterestedIn) since a cached consumer - * object is going rarely hit, since it can't be shared by multiple users. - */ - val CacheableShortTTLEmbeddingTypes: Set[EmbeddingType] = - Set( - EmbeddingType.FavBasedProducer, - EmbeddingType.LogFavLongestL2EmbeddingTweet, - ) - - val CacheableLongTTLEmbeddingTypes: Set[EmbeddingType] = - Set( - EmbeddingType.FavTfgTopic, - EmbeddingType.LogFavBasedKgoApeTopic - ) -} diff --git a/src/scala/com/twitter/simclusters_v2/candidate_source/SimClustersANNWrapperCandidateSource.scala b/src/scala/com/twitter/simclusters_v2/candidate_source/SimClustersANNWrapperCandidateSource.scala deleted file mode 100644 index 2ad19e50f..000000000 --- a/src/scala/com/twitter/simclusters_v2/candidate_source/SimClustersANNWrapperCandidateSource.scala +++ /dev/null @@ -1,53 +0,0 @@ -package com.twitter.simclusters_v2.candidate_source - -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.frigate.common.base.CandidateSource -import com.twitter.simclusters_v2.candidate_source.SimClustersANNCandidateSource.LookbackMediaTweetConfig -import com.twitter.simclusters_v2.candidate_source.SimClustersANNCandidateSource.SimClustersTweetCandidate -import com.twitter.util.Future - -/** - * An abstraction layer that implements a lambda structure for ANNCandidate source. - * Allows us to call an online store as well as an offline store from a single query. - */ -case class SimClustersANNWrapperCandidateSource( - onlineANNSource: CandidateSource[SimClustersANNCandidateSource.Query, SimClustersTweetCandidate], - lookbackANNSource: CandidateSource[ - SimClustersANNCandidateSource.Query, - SimClustersTweetCandidate - ], -)( - statsReceiver: StatsReceiver) - extends CandidateSource[SimClustersANNCandidateSource.Query, SimClustersTweetCandidate] { - - override def get( - query: SimClustersANNCandidateSource.Query - ): Future[Option[Seq[SimClustersTweetCandidate]]] = { - - val enableLookbackSource = - query.overrideConfig.exists(_.enableLookbackSource.getOrElse(false)) - - val embeddingType = query.sourceEmbeddingId.embeddingType - val lookbackCandidatesFut = - if (enableLookbackSource && - LookbackMediaTweetConfig.contains(embeddingType)) { - statsReceiver - .counter("lookback_source", embeddingType.toString, "enable").incr() - statsReceiver.counter("lookback_source", "enable").incr() - lookbackANNSource.get(query) - } else { - statsReceiver - .counter("lookback_source", embeddingType.toString, "disable").incr() - Future.None - } - - Future.join(onlineANNSource.get(query), lookbackCandidatesFut).map { - case (onlineCandidates, lookbackCandidates) => - Some( - onlineCandidates.getOrElse(Nil) ++ lookbackCandidates.getOrElse(Nil) - ) - } - } - - override def name: String = this.getClass.getCanonicalName -} diff --git a/src/scala/com/twitter/simclusters_v2/common/BUILD b/src/scala/com/twitter/simclusters_v2/common/BUILD deleted file mode 100644 index 9cf3b3fd7..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/BUILD +++ /dev/null @@ -1,12 +0,0 @@ -scala_library( - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/twitter/algebird:core", - "3rdparty/jvm/com/twitter/algebird:util", - "servo/decider", - "src/scala/com/twitter/storehaus_internal/manhattan", - "src/thrift/com/twitter/ml/api:interpretable-model-java", - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/common/CosineSimilarityUtil.scala b/src/scala/com/twitter/simclusters_v2/common/CosineSimilarityUtil.scala deleted file mode 100644 index 2a8cc1c46..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/CosineSimilarityUtil.scala +++ /dev/null @@ -1,251 +0,0 @@ -package com.twitter.simclusters_v2.common - -object CosineSimilarityUtil { - - /** - * Sum of squared elements for a given vector v - */ - def sumOfSquares[T](v: Map[T, Double]): Double = { - v.values.foldLeft(0.0) { (sum, value) => sum + value * value } - } - - /** - * Sum of squared elements for a given vector v - */ - def sumOfSquaresArray(v: Array[Double]): Double = { - v.foldLeft(0.0) { (sum, value) => sum + value * value } - } - - /** - * Calculate the l2Norm score - */ - def norm[T](v: Map[T, Double]): Double = { - math.sqrt(sumOfSquares(v)) - } - - /** - * Calculate the l2Norm score - */ - def normArray(v: Array[Double]): Double = { - math.sqrt(sumOfSquaresArray(v)) - } - - /** - * Calculate the logNorm score - */ - def logNorm[T](v: Map[T, Double]): Double = { - math.log(sumOfSquares(v) + 1) - } - - /** - * Calculate the logNorm score - */ - def logNormArray(v: Array[Double]): Double = { - math.log(sumOfSquaresArray(v) + 1) - } - - /** - * Calculate the exp scaled norm score - * */ - def expScaledNorm[T](v: Map[T, Double], exponent: Double): Double = { - math.pow(sumOfSquares(v), exponent) - } - - /** - * Calculate the exp scaled norm score - * */ - def expScaledNormArray(v: Array[Double], exponent: Double): Double = { - math.pow(sumOfSquaresArray(v), exponent) - } - - /** - * Calculate the l1Norm score - */ - def l1Norm[T](v: Map[T, Double]): Double = { - v.values.foldLeft(0.0) { (sum, value) => sum + Math.abs(value) } - } - - /** - * Calculate the l1Norm score - */ - def l1NormArray(v: Array[Double]): Double = { - v.foldLeft(0.0) { (sum, value) => sum + Math.abs(value) } - } - - /** - * Divide the weight vector with the applied norm - * Return the original object if the norm is 0 - * - * @param v a map from cluster id to its weight - * @param norm a calculated norm from the given map v - * - * @return a map with normalized weight - */ - def applyNorm[T](v: Map[T, Double], norm: Double): Map[T, Double] = { - if (norm == 0) v else v.mapValues(x => x / norm) - } - - /** - * Divide the weight vector with the applied norm - * Return the original object if the norm is 0 - * - * @param v a an array of weights - * @param norm a calculated norm from the given array v - * - * @return an array with normalized weight in the same order as v - */ - def applyNormArray(v: Array[Double], norm: Double): Array[Double] = { - if (norm == 0) v else v.map(_ / norm) - } - - /** - * Normalize the weight vector for easy cosine similarity calculation. If the input weight vector - * is empty or its norm is 0, return the original map. - * - * @param v a map from cluster id to its weight - * - * @return a map with normalized weight (the norm of the weight vector is 1) - */ - def normalize[T](v: Map[T, Double], maybeNorm: Option[Double] = None): Map[T, Double] = { - val norm = maybeNorm.getOrElse(CosineSimilarityUtil.norm(v)) - applyNorm(v, norm) - } - - /** - * Normalize the weight vector for easy cosine similarity calculation. If the input weight vector - * is empty or its norm is 0, return the original array. - * - * @param v an array of weights - * - * @return an array with normalized weight (the norm of the weight vector is 1), in the same order as v - */ - def normalizeArray( - v: Array[Double], - maybeNorm: Option[Double] = None - ): Array[Double] = { - val norm = maybeNorm.getOrElse(CosineSimilarityUtil.normArray(v)) - applyNormArray(v, norm) - } - - /** - * Normalize the weight vector with log norm. If the input weight vector - * is empty or its norm is 0, return the original map. - * - * @param v a map from cluster id to its weight - * - * @return a map with log normalized weight - * */ - def logNormalize[T](v: Map[T, Double], maybeNorm: Option[Double] = None): Map[T, Double] = { - val norm = maybeNorm.getOrElse(CosineSimilarityUtil.logNorm(v)) - applyNorm(v, norm) - } - - /** - * Normalize the weight vector with log norm. If the input weight vector - * is empty or its norm is 0, return the original array. - * - * @param v an array of weights - * - * @return an array with log normalized weight, in the same order as v - * */ - def logNormalizeArray( - v: Array[Double], - maybeNorm: Option[Double] = None - ): Array[Double] = { - val norm = maybeNorm.getOrElse(CosineSimilarityUtil.logNormArray(v)) - applyNormArray(v, norm) - } - - /** - * Normalize the weight vector with exponentially scaled norm. If the input weight vector - * is empty or its norm is 0, return the original map. - * - * @param v a map from cluster id to its weight - * @param exponent the exponent we apply to the weight vector's norm - * - * @return a map with exp scaled normalized weight - * */ - def expScaledNormalize[T]( - v: Map[T, Double], - exponent: Option[Double] = None, - maybeNorm: Option[Double] = None - ): Map[T, Double] = { - val norm = maybeNorm.getOrElse(CosineSimilarityUtil.expScaledNorm(v, exponent.getOrElse(0.3))) - applyNorm(v, norm) - } - - /** - * Normalize the weight vector with exponentially scaled norm. If the input weight vector - * is empty or its norm is 0, return the original map. - * - * @param v an array of weights - * @param exponent the exponent we apply to the weight vector's norm - * - * @return an array with exp scaled normalized weight, in the same order as v - * */ - def expScaledNormalizeArray( - v: Array[Double], - exponent: Double, - maybeNorm: Option[Double] = None - ): Array[Double] = { - val norm = maybeNorm.getOrElse(CosineSimilarityUtil.expScaledNormArray(v, exponent)) - applyNormArray(v, norm) - } - - /** - * Given two sparse vectors, calculate its dot product. - * - * @param v1 the first map from cluster id to its weight - * @param v2 the second map from cluster id to its weight - * - * @return the dot product of above two sparse vector - */ - def dotProduct[T](v1: Map[T, Double], v2: Map[T, Double]): Double = { - val comparer = v1.size - v2.size - val smaller = if (comparer > 0) v2 else v1 - val bigger = if (comparer > 0) v1 else v2 - - smaller.foldLeft(0.0) { - case (sum, (id, value)) => - sum + bigger.getOrElse(id, 0.0) * value - } - } - - /** - * Given two sparse vectors, calculate its dot product. - * - * @param v1C an array of cluster ids. Must be sorted in ascending order - * @param v1S an array of corresponding cluster scores, of the same length and order as v1c - * @param v2C an array of cluster ids. Must be sorted in ascending order - * @param v2S an array of corresponding cluster scores, of the same length and order as v2c - * - * @return the dot product of above two sparse vector - */ - def dotProductForSortedClusterAndScores( - v1C: Array[Int], - v1S: Array[Double], - v2C: Array[Int], - v2S: Array[Double] - ): Double = { - require(v1C.size == v1S.size) - require(v2C.size == v2S.size) - var i1 = 0 - var i2 = 0 - var product: Double = 0.0 - - while (i1 < v1C.size && i2 < v2C.size) { - if (v1C(i1) == v2C(i2)) { - product += v1S(i1) * v2S(i2) - i1 += 1 - i2 += 1 - } else if (v1C(i1) > v2C(i2)) { - // v2 cluster is lower. Increment it to see if the next one matches v1's - i2 += 1 - } else { - // v1 cluster is lower. Increment it to see if the next one matches v2's - i1 += 1 - } - } - product - } -} diff --git a/src/scala/com/twitter/simclusters_v2/common/DeciderGateBuilderWithIdHashing.scala b/src/scala/com/twitter/simclusters_v2/common/DeciderGateBuilderWithIdHashing.scala deleted file mode 100644 index 76e10aaa0..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/DeciderGateBuilderWithIdHashing.scala +++ /dev/null @@ -1,21 +0,0 @@ -package com.twitter.simclusters_v2.common - -import com.twitter.decider.Decider -import com.twitter.servo.decider.{DeciderGateBuilder, DeciderKeyName} -import com.twitter.servo.util.Gate - -class DeciderGateBuilderWithIdHashing(decider: Decider) extends DeciderGateBuilder(decider) { - - def idGateWithHashing[T](key: DeciderKeyName): Gate[T] = { - val feature = keyToFeature(key) - // Only if the decider is neither fully on / off is the object hashed - // This does require an additional call to get the decider availability but that is comparatively cheaper - val convertToHash: T => Long = (obj: T) => { - val availability = feature.availability.getOrElse(0) - if (availability == 10000 || availability == 0) availability - else obj.hashCode - } - idGate(key).contramap[T](convertToHash) - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/common/ModelVersions.scala b/src/scala/com/twitter/simclusters_v2/common/ModelVersions.scala deleted file mode 100644 index 796474ccd..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/ModelVersions.scala +++ /dev/null @@ -1,48 +0,0 @@ -package com.twitter.simclusters_v2.common - -import com.twitter.simclusters_v2.thriftscala.ModelVersion - -/** - * The utility to convert SimClusters Model version into different forms. - * Required to register any new SimClusters Model version here. - */ -object ModelVersions { - - val Model20M145KDec11 = "20M_145K_dec11" - val Model20M145KUpdated = "20M_145K_updated" - val Model20M145K2020 = "20M_145K_2020" - - // Use Enum for feature switch - object Enum extends Enumeration { - val Model20M145K2020, Model20M145KUpdated: Value = Value - val enumToSimClustersModelVersionMap: Map[Enum.Value, ModelVersion] = Map( - Model20M145K2020 -> ModelVersion.Model20m145k2020, - Model20M145KUpdated -> ModelVersion.Model20m145kUpdated - ) - } - - // Add the new model version into this map - private val StringToThriftModelVersions: Map[String, ModelVersion] = - Map( - Model20M145KDec11 -> ModelVersion.Model20m145kDec11, - Model20M145KUpdated -> ModelVersion.Model20m145kUpdated, - Model20M145K2020 -> ModelVersion.Model20m145k2020 - ) - - private val ThriftModelVersionToStrings = StringToThriftModelVersions.map(_.swap) - - val AllModelVersions: Set[String] = StringToThriftModelVersions.keySet - - def toModelVersionOption(modelVersionStr: String): Option[ModelVersion] = { - StringToThriftModelVersions.get(modelVersionStr) - } - - implicit def toModelVersion(modelVersionStr: String): ModelVersion = { - StringToThriftModelVersions(modelVersionStr) - } - - implicit def toKnownForModelVersion(modelVersion: ModelVersion): String = { - ThriftModelVersionToStrings(modelVersion) - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/common/SeqStandardDeviation.scala b/src/scala/com/twitter/simclusters_v2/common/SeqStandardDeviation.scala deleted file mode 100644 index c8e11c41f..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/SeqStandardDeviation.scala +++ /dev/null @@ -1,22 +0,0 @@ -package com.twitter.simclusters_v2.common - -object SeqStandardDeviation { - - def apply[T](t: Seq[T])(implicit mapper: T => Double): Double = { - if (t.isEmpty) { - 0.0 - } else { - val sum = t.foldLeft(0.0) { - case (temp, score) => - temp + score - } - val mean = sum / t.size - val variance = t.foldLeft(0.0) { (sum, score) => - val v = score - mean - sum + v * v - } / t.size - math.sqrt(variance) - } - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/common/SimClustersEmbedding.scala b/src/scala/com/twitter/simclusters_v2/common/SimClustersEmbedding.scala deleted file mode 100644 index b8f0179cb..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/SimClustersEmbedding.scala +++ /dev/null @@ -1,581 +0,0 @@ -package com.twitter.simclusters_v2.common - -import com.twitter.simclusters_v2.thriftscala.SimClusterWithScore -import com.twitter.simclusters_v2.thriftscala.{SimClustersEmbedding => ThriftSimClustersEmbedding} -import scala.collection.mutable -import scala.language.implicitConversions -import scala.util.hashing.MurmurHash3.arrayHash -import scala.util.hashing.MurmurHash3.productHash -import scala.math._ - -/** - * A representation of a SimClusters Embedding, designed for low memory footprint and performance. - * For services that cache millions of embeddings, we found this to significantly reduce allocations, - * memory footprint and overall performance. - * - * Embedding data is stored in pre-sorted arrays rather than structures which use a lot of pointers - * (e.g. Map). A minimal set of lazily-constructed intermediate data is kept. - * - * Be wary of adding further `val` or `lazy val`s to this class; materializing and storing more data - * on these objects could significantly affect in-memory cache performance. - * - * Also, if you are using this code in a place where you care about memory footprint, be careful - * not to materialize any of the lazy vals unless you need them. - */ -sealed trait SimClustersEmbedding extends Equals { - import SimClustersEmbedding._ - - /** - * Any compliant implementation of the SimClustersEmbedding trait must ensure that: - * - the cluster and score arrays are ordered as described below - * - the cluster and score arrays are treated as immutable (.hashCode is memoized) - * - the size of all cluster and score arrays is the same - * - all cluster scores are > 0 - * - cluster ids are unique - */ - // In descending score order - this is useful for truncation, where we care most about the highest scoring elements - private[simclusters_v2] val clusterIds: Array[ClusterId] - private[simclusters_v2] val scores: Array[Double] - // In ascending cluster order. This is useful for operations where we try to find the same cluster in another embedding, e.g. dot product - private[simclusters_v2] val sortedClusterIds: Array[ClusterId] - private[simclusters_v2] val sortedScores: Array[Double] - - /** - * Build and return a Set of all clusters in this embedding - */ - lazy val clusterIdSet: Set[ClusterId] = sortedClusterIds.toSet - - /** - * Build and return Seq representation of this embedding - */ - lazy val embedding: Seq[(ClusterId, Double)] = - sortedClusterIds.zip(sortedScores).sortBy(-_._2).toSeq - - /** - * Build and return a Map representation of this embedding - */ - lazy val map: Map[ClusterId, Double] = sortedClusterIds.zip(sortedScores).toMap - - lazy val l1norm: Double = CosineSimilarityUtil.l1NormArray(sortedScores) - - lazy val l2norm: Double = CosineSimilarityUtil.normArray(sortedScores) - - lazy val logNorm: Double = CosineSimilarityUtil.logNormArray(sortedScores) - - lazy val expScaledNorm: Double = - CosineSimilarityUtil.expScaledNormArray(sortedScores, DefaultExponent) - - /** - * The L2 Normalized Embedding. Optimize for Cosine Similarity Calculation. - */ - lazy val normalizedSortedScores: Array[Double] = - CosineSimilarityUtil.applyNormArray(sortedScores, l2norm) - - lazy val logNormalizedSortedScores: Array[Double] = - CosineSimilarityUtil.applyNormArray(sortedScores, logNorm) - - lazy val expScaledNormalizedSortedScores: Array[Double] = - CosineSimilarityUtil.applyNormArray(sortedScores, expScaledNorm) - - /** - * The Standard Deviation of an Embedding. - */ - lazy val std: Double = { - if (scores.isEmpty) { - 0.0 - } else { - val sum = scores.sum - val mean = sum / scores.length - var variance: Double = 0.0 - for (i <- scores.indices) { - val v = scores(i) - mean - variance += (v * v) - } - math.sqrt(variance / scores.length) - } - } - - /** - * Return the score of a given clusterId. - */ - def get(clusterId: ClusterId): Option[Double] = { - var i = 0 - while (i < sortedClusterIds.length) { - val thisId = sortedClusterIds(i) - if (clusterId == thisId) return Some(sortedScores(i)) - if (thisId > clusterId) return None - i += 1 - } - None - } - - /** - * Return the score of a given clusterId. If not exist, return default. - */ - def getOrElse(clusterId: ClusterId, default: Double = 0.0): Double = { - require(default >= 0.0) - var i = 0 - while (i < sortedClusterIds.length) { - val thisId = sortedClusterIds(i) - if (clusterId == thisId) return sortedScores(i) - if (thisId > clusterId) return default - i += 1 - } - default - } - - /** - * Return the cluster ids - */ - def getClusterIds(): Array[ClusterId] = clusterIds - - /** - * Return the cluster ids with the highest scores - */ - def topClusterIds(size: Int): Seq[ClusterId] = clusterIds.take(size) - - /** - * Return true if this embedding contains a given clusterId - */ - def contains(clusterId: ClusterId): Boolean = clusterIdSet.contains(clusterId) - - def sum(another: SimClustersEmbedding): SimClustersEmbedding = { - if (another.isEmpty) this - else if (this.isEmpty) another - else { - var i1 = 0 - var i2 = 0 - val l = scala.collection.mutable.ArrayBuffer.empty[(Int, Double)] - while (i1 < sortedClusterIds.length && i2 < another.sortedClusterIds.length) { - if (sortedClusterIds(i1) == another.sortedClusterIds(i2)) { - l += Tuple2(sortedClusterIds(i1), sortedScores(i1) + another.sortedScores(i2)) - i1 += 1 - i2 += 1 - } else if (sortedClusterIds(i1) > another.sortedClusterIds(i2)) { - l += Tuple2(another.sortedClusterIds(i2), another.sortedScores(i2)) - // another cluster is lower. Increment it to see if the next one matches this's - i2 += 1 - } else { - l += Tuple2(sortedClusterIds(i1), sortedScores(i1)) - // this cluster is lower. Increment it to see if the next one matches anothers's - i1 += 1 - } - } - if (i1 == sortedClusterIds.length && i2 != another.sortedClusterIds.length) - // this was shorter. Prepend remaining elements from another - l ++= another.sortedClusterIds.drop(i2).zip(another.sortedScores.drop(i2)) - else if (i1 != sortedClusterIds.length && i2 == another.sortedClusterIds.length) - // another was shorter. Prepend remaining elements from this - l ++= sortedClusterIds.drop(i1).zip(sortedScores.drop(i1)) - SimClustersEmbedding(l) - } - } - - def scalarMultiply(multiplier: Double): SimClustersEmbedding = { - require(multiplier > 0.0, "SimClustersEmbedding.scalarMultiply requires multiplier > 0.0") - DefaultSimClustersEmbedding( - clusterIds, - scores.map(_ * multiplier), - sortedClusterIds, - sortedScores.map(_ * multiplier) - ) - } - - def scalarDivide(divisor: Double): SimClustersEmbedding = { - require(divisor > 0.0, "SimClustersEmbedding.scalarDivide requires divisor > 0.0") - DefaultSimClustersEmbedding( - clusterIds, - scores.map(_ / divisor), - sortedClusterIds, - sortedScores.map(_ / divisor) - ) - } - - def dotProduct(another: SimClustersEmbedding): Double = { - CosineSimilarityUtil.dotProductForSortedClusterAndScores( - sortedClusterIds, - sortedScores, - another.sortedClusterIds, - another.sortedScores) - } - - def cosineSimilarity(another: SimClustersEmbedding): Double = { - CosineSimilarityUtil.dotProductForSortedClusterAndScores( - sortedClusterIds, - normalizedSortedScores, - another.sortedClusterIds, - another.normalizedSortedScores) - } - - def logNormCosineSimilarity(another: SimClustersEmbedding): Double = { - CosineSimilarityUtil.dotProductForSortedClusterAndScores( - sortedClusterIds, - logNormalizedSortedScores, - another.sortedClusterIds, - another.logNormalizedSortedScores) - } - - def expScaledCosineSimilarity(another: SimClustersEmbedding): Double = { - CosineSimilarityUtil.dotProductForSortedClusterAndScores( - sortedClusterIds, - expScaledNormalizedSortedScores, - another.sortedClusterIds, - another.expScaledNormalizedSortedScores) - } - - /** - * Return true if this is an empty embedding - */ - def isEmpty: Boolean = sortedClusterIds.isEmpty - - /** - * Return the Jaccard Similarity Score between two embeddings. - * Note: this implementation should be optimized if we start to use it in production - */ - def jaccardSimilarity(another: SimClustersEmbedding): Double = { - if (this.isEmpty || another.isEmpty) { - 0.0 - } else { - val intersect = clusterIdSet.intersect(another.clusterIdSet).size - val union = clusterIdSet.union(another.clusterIdSet).size - intersect.toDouble / union - } - } - - /** - * Return the Fuzzy Jaccard Similarity Score between two embeddings. - * Treat each Simclusters embedding as fuzzy set, calculate the fuzzy set similarity - * metrics of two embeddings - * - * Paper 2.2.1: https://openreview.net/pdf?id=SkxXg2C5FX - */ - def fuzzyJaccardSimilarity(another: SimClustersEmbedding): Double = { - if (this.isEmpty || another.isEmpty) { - 0.0 - } else { - val v1C = sortedClusterIds - val v1S = sortedScores - val v2C = another.sortedClusterIds - val v2S = another.sortedScores - - require(v1C.length == v1S.length) - require(v2C.length == v2S.length) - - var i1 = 0 - var i2 = 0 - var numerator = 0.0 - var denominator = 0.0 - - while (i1 < v1C.length && i2 < v2C.length) { - if (v1C(i1) == v2C(i2)) { - numerator += min(v1S(i1), v2S(i2)) - denominator += max(v1S(i1), v2S(i2)) - i1 += 1 - i2 += 1 - } else if (v1C(i1) > v2C(i2)) { - denominator += v2S(i2) - i2 += 1 - } else { - denominator += v1S(i1) - i1 += 1 - } - } - - while (i1 < v1C.length) { - denominator += v1S(i1) - i1 += 1 - } - while (i2 < v2C.length) { - denominator += v2S(i2) - i2 += 1 - } - - numerator / denominator - } - } - - /** - * Return the Euclidean Distance Score between two embeddings. - * Note: this implementation should be optimized if we start to use it in production - */ - def euclideanDistance(another: SimClustersEmbedding): Double = { - val unionClusters = clusterIdSet.union(another.clusterIdSet) - val variance = unionClusters.foldLeft(0.0) { - case (sum, clusterId) => - val distance = math.abs(this.getOrElse(clusterId) - another.getOrElse(clusterId)) - sum + distance * distance - } - math.sqrt(variance) - } - - /** - * Return the Manhattan Distance Score between two embeddings. - * Note: this implementation should be optimized if we start to use it in production - */ - def manhattanDistance(another: SimClustersEmbedding): Double = { - val unionClusters = clusterIdSet.union(another.clusterIdSet) - unionClusters.foldLeft(0.0) { - case (sum, clusterId) => - sum + math.abs(this.getOrElse(clusterId) - another.getOrElse(clusterId)) - } - } - - /** - * Return the number of overlapping clusters between two embeddings. - */ - def overlappingClusters(another: SimClustersEmbedding): Int = { - var i1 = 0 - var i2 = 0 - var count = 0 - - while (i1 < sortedClusterIds.length && i2 < another.sortedClusterIds.length) { - if (sortedClusterIds(i1) == another.sortedClusterIds(i2)) { - count += 1 - i1 += 1 - i2 += 1 - } else if (sortedClusterIds(i1) > another.sortedClusterIds(i2)) { - // v2 cluster is lower. Increment it to see if the next one matches v1's - i2 += 1 - } else { - // v1 cluster is lower. Increment it to see if the next one matches v2's - i1 += 1 - } - } - count - } - - /** - * Return the largest product cluster scores - */ - def maxElementwiseProduct(another: SimClustersEmbedding): Double = { - var i1 = 0 - var i2 = 0 - var maxProduct: Double = 0.0 - - while (i1 < sortedClusterIds.length && i2 < another.sortedClusterIds.length) { - if (sortedClusterIds(i1) == another.sortedClusterIds(i2)) { - val product = sortedScores(i1) * another.sortedScores(i2) - if (product > maxProduct) maxProduct = product - i1 += 1 - i2 += 1 - } else if (sortedClusterIds(i1) > another.sortedClusterIds(i2)) { - // v2 cluster is lower. Increment it to see if the next one matches v1's - i2 += 1 - } else { - // v1 cluster is lower. Increment it to see if the next one matches v2's - i1 += 1 - } - } - maxProduct - } - - /** - * Return a new SimClustersEmbedding with Max Embedding Size. - * - * Prefer to truncate on embedding construction where possible. Doing so is cheaper. - */ - def truncate(size: Int): SimClustersEmbedding = { - if (clusterIds.length <= size) { - this - } else { - val truncatedClusterIds = clusterIds.take(size) - val truncatedScores = scores.take(size) - val (sortedClusterIds, sortedScores) = - truncatedClusterIds.zip(truncatedScores).sortBy(_._1).unzip - - DefaultSimClustersEmbedding( - truncatedClusterIds, - truncatedScores, - sortedClusterIds, - sortedScores) - } - } - - def toNormalized: SimClustersEmbedding = { - // Additional safety check. Only EmptyEmbedding's l2norm is 0.0. - if (l2norm == 0.0) { - EmptyEmbedding - } else { - this.scalarDivide(l2norm) - } - } - - implicit def toThrift: ThriftSimClustersEmbedding = { - ThriftSimClustersEmbedding( - embedding.map { - case (clusterId, score) => - SimClusterWithScore(clusterId, score) - } - ) - } - - def canEqual(a: Any): Boolean = a.isInstanceOf[SimClustersEmbedding] - - /* We define equality as having the same clusters and scores. - * This implementation is arguably incorrect in this case: - * (1 -> 1.0, 2 -> 0.0) == (1 -> 1.0) // equals returns false - * However, compliant implementations of SimClustersEmbedding should not include zero-weight - * clusters, so this implementation should work correctly. - */ - override def equals(that: Any): Boolean = - that match { - case that: SimClustersEmbedding => - that.canEqual(this) && - this.sortedClusterIds.sameElements(that.sortedClusterIds) && - this.sortedScores.sameElements(that.sortedScores) - case _ => false - } - - /** - * hashcode implementation based on the contents of the embedding. As a lazy val, this relies on - * the embedding contents being immutable. - */ - override lazy val hashCode: Int = { - /* Arrays uses object id as hashCode, so different arrays with the same contents hash - * differently. To provide a stable hash code, we take the same approach as how a - * `case class(clusters: Seq[Int], scores: Seq[Double])` would be hashed. See - * ScalaRunTime._hashCode and MurmurHash3.productHash - * https://github.com/scala/scala/blob/2.12.x/src/library/scala/runtime/ScalaRunTime.scala#L167 - * https://github.com/scala/scala/blob/2.12.x/src/library/scala/util/hashing/MurmurHash3.scala#L64 - * - * Note that the hashcode is arguably incorrect in this case: - * (1 -> 1.0, 2 -> 0.0).hashcode == (1 -> 1.0).hashcode // returns false - * However, compliant implementations of SimClustersEmbedding should not include zero-weight - * clusters, so this implementation should work correctly. - */ - productHash((arrayHash(sortedClusterIds), arrayHash(sortedScores))) - } -} - -object SimClustersEmbedding { - val EmptyEmbedding: SimClustersEmbedding = - DefaultSimClustersEmbedding(Array.empty, Array.empty, Array.empty, Array.empty) - - val DefaultExponent: Double = 0.3 - - // Descending by score then ascending by ClusterId - implicit val order: Ordering[(ClusterId, Double)] = - (a: (ClusterId, Double), b: (ClusterId, Double)) => { - b._2 compare a._2 match { - case 0 => a._1 compare b._1 - case c => c - } - } - - /** - * Constructors - * - * These constructors: - * - do not make assumptions about the ordering of the cluster/scores. - * - do assume that cluster ids are unique - * - ignore (drop) any cluster whose score is <= 0 - */ - def apply(embedding: (ClusterId, Double)*): SimClustersEmbedding = - buildDefaultSimClustersEmbedding(embedding) - - def apply(embedding: Iterable[(ClusterId, Double)]): SimClustersEmbedding = - buildDefaultSimClustersEmbedding(embedding) - - def apply(embedding: Iterable[(ClusterId, Double)], size: Int): SimClustersEmbedding = - buildDefaultSimClustersEmbedding(embedding, truncate = Some(size)) - - implicit def apply(thriftEmbedding: ThriftSimClustersEmbedding): SimClustersEmbedding = - buildDefaultSimClustersEmbedding(thriftEmbedding.embedding.map(_.toTuple)) - - def apply(thriftEmbedding: ThriftSimClustersEmbedding, truncate: Int): SimClustersEmbedding = - buildDefaultSimClustersEmbedding( - thriftEmbedding.embedding.map(_.toTuple), - truncate = Some(truncate)) - - private def buildDefaultSimClustersEmbedding( - embedding: Iterable[(ClusterId, Double)], - truncate: Option[Int] = None - ): SimClustersEmbedding = { - val truncatedIdAndScores = { - val idsAndScores = embedding.filter(_._2 > 0.0).toArray.sorted(order) - truncate match { - case Some(t) => idsAndScores.take(t) - case _ => idsAndScores - } - } - - if (truncatedIdAndScores.isEmpty) { - EmptyEmbedding - } else { - val (clusterIds, scores) = truncatedIdAndScores.unzip - val (sortedClusterIds, sortedScores) = truncatedIdAndScores.sortBy(_._1).unzip - DefaultSimClustersEmbedding(clusterIds, scores, sortedClusterIds, sortedScores) - } - } - - /** ***** Aggregation Methods ******/ - /** - * A high performance version of Sum a list of SimClustersEmbeddings. - * Suggest using in Online Services to avoid the unnecessary GC. - * For offline or streaming. Please check [[SimClustersEmbeddingMonoid]] - */ - def sum(simClustersEmbeddings: Iterable[SimClustersEmbedding]): SimClustersEmbedding = { - if (simClustersEmbeddings.isEmpty) { - EmptyEmbedding - } else { - val sum = simClustersEmbeddings.foldLeft(mutable.Map[ClusterId, Double]()) { - (sum, embedding) => - for (i <- embedding.sortedClusterIds.indices) { - val clusterId = embedding.sortedClusterIds(i) - sum.put(clusterId, embedding.sortedScores(i) + sum.getOrElse(clusterId, 0.0)) - } - sum - } - SimClustersEmbedding(sum) - } - } - - /** - * Support a fixed size SimClustersEmbedding Sum - */ - def sum( - simClustersEmbeddings: Iterable[SimClustersEmbedding], - maxSize: Int - ): SimClustersEmbedding = { - sum(simClustersEmbeddings).truncate(maxSize) - } - - /** - * A high performance version of Mean a list of SimClustersEmbeddings. - * Suggest using in Online Services to avoid the unnecessary GC. - */ - def mean(simClustersEmbeddings: Iterable[SimClustersEmbedding]): SimClustersEmbedding = { - if (simClustersEmbeddings.isEmpty) { - EmptyEmbedding - } else { - sum(simClustersEmbeddings).scalarDivide(simClustersEmbeddings.size) - } - } - - /** - * Support a fixed size SimClustersEmbedding Mean - */ - def mean( - simClustersEmbeddings: Iterable[SimClustersEmbedding], - maxSize: Int - ): SimClustersEmbedding = { - mean(simClustersEmbeddings).truncate(maxSize) - } -} - -case class DefaultSimClustersEmbedding( - override val clusterIds: Array[ClusterId], - override val scores: Array[Double], - override val sortedClusterIds: Array[ClusterId], - override val sortedScores: Array[Double]) - extends SimClustersEmbedding { - - override def toString: String = - s"DefaultSimClustersEmbedding(${clusterIds.zip(scores).mkString(",")})" -} - -object DefaultSimClustersEmbedding { - // To support existing code which builds embeddings from a Seq - def apply(embedding: Seq[(ClusterId, Double)]): SimClustersEmbedding = SimClustersEmbedding( - embedding) -} diff --git a/src/scala/com/twitter/simclusters_v2/common/SimClustersEmbeddingId.scala b/src/scala/com/twitter/simclusters_v2/common/SimClustersEmbeddingId.scala deleted file mode 100644 index 0a2fc592f..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/SimClustersEmbeddingId.scala +++ /dev/null @@ -1,209 +0,0 @@ -package com.twitter.simclusters_v2.common - -import com.twitter.simclusters_v2.thriftscala.EmbeddingType -import com.twitter.simclusters_v2.thriftscala.InternalId -import com.twitter.simclusters_v2.thriftscala.LocaleEntityId -import com.twitter.simclusters_v2.thriftscala.ModelVersion -import com.twitter.simclusters_v2.thriftscala.TopicId -import com.twitter.simclusters_v2.thriftscala.{ - SimClustersEmbeddingId => ThriftSimClustersEmbeddingId -} -import com.twitter.simclusters_v2.thriftscala.EmbeddingType._ -import com.twitter.simclusters_v2.thriftscala.InternalId.EntityId -import com.twitter.simclusters_v2.thriftscala.InternalId.TweetId -import com.twitter.simclusters_v2.thriftscala.InternalId.UserId -import com.twitter.simclusters_v2.thriftscala.{EmbeddingType => SimClustersEmbeddingType} - -object SimClustersEmbeddingId { - - val DefaultModelVersion: ModelVersion = ModelVersion.Model20m145k2020 - - // Embeddings which is available in Content-Recommender - val TweetEmbeddingTypes: Set[EmbeddingType] = - Set( - FavBasedTweet, - FollowBasedTweet, - LogFavBasedTweet, - LogFavLongestL2EmbeddingTweet - ) - val DefaultTweetEmbeddingType: EmbeddingType = LogFavLongestL2EmbeddingTweet - - val UserInterestedInEmbeddingTypes: Set[EmbeddingType] = - Set( - FavBasedUserInterestedIn, - FollowBasedUserInterestedIn, - LogFavBasedUserInterestedIn, - RecentFollowBasedUserInterestedIn, - FilteredUserInterestedIn, - FavBasedUserInterestedInFromPE, - FollowBasedUserInterestedInFromPE, - LogFavBasedUserInterestedInFromPE, - FilteredUserInterestedInFromPE, - LogFavBasedUserInterestedInFromAPE, - FollowBasedUserInterestedInFromAPE, - UnfilteredUserInterestedIn - ) - val DefaultUserInterestInEmbeddingType: EmbeddingType = FavBasedUserInterestedIn - - val ProducerEmbeddingTypes: Set[EmbeddingType] = - Set( - FavBasedProducer, - FollowBasedProducer, - AggregatableFavBasedProducer, - AggregatableLogFavBasedProducer, - RelaxedAggregatableLogFavBasedProducer, - KnownFor - ) - val DefaultProducerEmbeddingType: EmbeddingType = FavBasedProducer - - val LocaleEntityEmbeddingTypes: Set[EmbeddingType] = - Set( - FavTfgTopic, - LogFavTfgTopic - ) - val DefaultLocaleEntityEmbeddingType: EmbeddingType = FavTfgTopic - - val TopicEmbeddingTypes: Set[EmbeddingType] = - Set( - LogFavBasedKgoApeTopic - ) - val DefaultTopicEmbeddingType: EmbeddingType = LogFavBasedKgoApeTopic - - val AllEmbeddingTypes: Set[EmbeddingType] = - TweetEmbeddingTypes ++ - UserInterestedInEmbeddingTypes ++ - ProducerEmbeddingTypes ++ - LocaleEntityEmbeddingTypes ++ - TopicEmbeddingTypes - - def buildTweetId( - tweetId: TweetId, - embeddingType: EmbeddingType = DefaultTweetEmbeddingType, - modelVersion: ModelVersion = DefaultModelVersion - ): ThriftSimClustersEmbeddingId = { - assert(TweetEmbeddingTypes.contains(embeddingType)) - ThriftSimClustersEmbeddingId( - embeddingType, - modelVersion, - InternalId.TweetId(tweetId) - ) - } - - def buildUserInterestedInId( - userId: UserId, - embeddingType: EmbeddingType = DefaultUserInterestInEmbeddingType, - modelVersion: ModelVersion = DefaultModelVersion - ): ThriftSimClustersEmbeddingId = { - assert(UserInterestedInEmbeddingTypes.contains(embeddingType)) - ThriftSimClustersEmbeddingId( - embeddingType, - modelVersion, - InternalId.UserId(userId) - ) - } - - def buildProducerId( - userId: UserId, - embeddingType: EmbeddingType = DefaultProducerEmbeddingType, - modelVersion: ModelVersion = DefaultModelVersion - ): ThriftSimClustersEmbeddingId = { - assert(ProducerEmbeddingTypes.contains(embeddingType)) - ThriftSimClustersEmbeddingId( - embeddingType, - modelVersion, - InternalId.UserId(userId) - ) - } - - def buildLocaleEntityId( - entityId: SemanticCoreEntityId, - language: String, - embeddingType: EmbeddingType = DefaultLocaleEntityEmbeddingType, - modelVersion: ModelVersion = DefaultModelVersion - ): ThriftSimClustersEmbeddingId = { - ThriftSimClustersEmbeddingId( - embeddingType, - modelVersion, - InternalId.LocaleEntityId( - LocaleEntityId(entityId, language) - ) - ) - } - - def buildTopicId( - topicId: TopicId, - language: Option[String] = None, - country: Option[String] = None, - embeddingType: EmbeddingType = DefaultTopicEmbeddingType, - modelVersion: ModelVersion = DefaultModelVersion - ): ThriftSimClustersEmbeddingId = { - ThriftSimClustersEmbeddingId( - embeddingType, - modelVersion, - InternalId.TopicId( - TopicId(topicId, language, country) - ) - ) - } - - // Extractor object for InternalIds that wrap Long - object LongInternalId { - def unapply(iid: InternalId): Option[Long] = iid match { - case InternalId.TweetId(id) => Some(id) - case InternalId.UserId(id) => Some(id) - case InternalId.EntityId(id) => Some(id) - case _ => None - } - } - - // Extractor object for SimClusterEmbeddingIds with InternalIds that wrap Long - object LongSimClustersEmbeddingId { - def unapply(id: ThriftSimClustersEmbeddingId): Option[Long] = - LongInternalId.unapply(id.internalId) - } - - // Only for debuggers. - def buildEmbeddingId( - entityId: String, - embeddingType: EmbeddingType, - modelVersion: ModelVersion = DefaultModelVersion - ): ThriftSimClustersEmbeddingId = { - if (TweetEmbeddingTypes.contains(embeddingType)) { - buildTweetId(entityId.toLong, embeddingType, modelVersion) - } else if (UserInterestedInEmbeddingTypes.contains(embeddingType)) { - buildUserInterestedInId(entityId.toLong, embeddingType, modelVersion) - } else if (ProducerEmbeddingTypes.contains(embeddingType)) { - buildProducerId(entityId.toLong, embeddingType, modelVersion) - } else if (LocaleEntityEmbeddingTypes.contains(embeddingType)) { - buildLocaleEntityId(entityId.toLong, "en", embeddingType, modelVersion) - } else if (TopicEmbeddingTypes.contains(embeddingType)) { - buildTopicId( - entityId.toLong, - Some("en"), - embeddingType = embeddingType, - modelVersion = modelVersion) - } else { - throw new IllegalArgumentException(s"Invalid embedding type: $embeddingType") - } - } - - implicit val internalIdOrdering: Ordering[InternalId] = - Ordering.by(internalId => internalId.hashCode()) - - implicit val simClustersEmbeddingIdOrdering: Ordering[ThriftSimClustersEmbeddingId] = - Ordering.by(embeddingId => - (embeddingId.embeddingType.value, embeddingId.modelVersion.value, embeddingId.internalId)) - - // Use Enum for feature switch - object TopicEnum extends Enumeration { - protected case class EmbeddingType(embeddingType: SimClustersEmbeddingType) extends super.Val - import scala.language.implicitConversions - implicit def valueToEmbeddingType(value: Value): EmbeddingType = - value.asInstanceOf[EmbeddingType] - - val FavTfgTopic: Value = EmbeddingType(SimClustersEmbeddingType.FavTfgTopic) - val LogFavBasedKgoApeTopic: Value = EmbeddingType( - SimClustersEmbeddingType.LogFavBasedKgoApeTopic) - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/common/SimClustersEmbeddingIdCacheKeyBuilder.scala b/src/scala/com/twitter/simclusters_v2/common/SimClustersEmbeddingIdCacheKeyBuilder.scala deleted file mode 100644 index 21a54e96c..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/SimClustersEmbeddingIdCacheKeyBuilder.scala +++ /dev/null @@ -1,19 +0,0 @@ -package com.twitter.simclusters_v2.common - -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId - -/** - * A common library to construct Cache Key for SimClustersEmbeddingId. - */ -case class SimClustersEmbeddingIdCacheKeyBuilder( - hash: Array[Byte] => Long, - prefix: String = "") { - - // Example: "CR:SCE:1:2:1234567890ABCDEF" - def apply(embeddingId: SimClustersEmbeddingId): String = { - f"$prefix:SCE:${embeddingId.embeddingType.getValue()}%X:" + - f"${embeddingId.modelVersion.getValue()}%X" + - f":${hash(embeddingId.internalId.toString.getBytes)}%X" - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/common/SimClustersEmbeddingMonoid.scala b/src/scala/com/twitter/simclusters_v2/common/SimClustersEmbeddingMonoid.scala deleted file mode 100644 index 1b17c9705..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/SimClustersEmbeddingMonoid.scala +++ /dev/null @@ -1,18 +0,0 @@ -package com.twitter.simclusters_v2.common - -import com.twitter.algebird.Monoid - -case class SimClustersEmbeddingMonoid() extends Monoid[SimClustersEmbedding] { - - override val zero: SimClustersEmbedding = SimClustersEmbedding.EmptyEmbedding - - override def plus(x: SimClustersEmbedding, y: SimClustersEmbedding): SimClustersEmbedding = { - x.sum(y) - } -} - -object SimClustersEmbeddingMonoid { - - val monoid: Monoid[SimClustersEmbedding] = SimClustersEmbeddingMonoid() - -} diff --git a/src/scala/com/twitter/simclusters_v2/common/SimClustersMultiEmbedding.scala b/src/scala/com/twitter/simclusters_v2/common/SimClustersMultiEmbedding.scala deleted file mode 100644 index c9b86be4f..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/SimClustersMultiEmbedding.scala +++ /dev/null @@ -1,32 +0,0 @@ -package com.twitter.simclusters_v2.common - -import com.twitter.simclusters_v2.common.SimClustersMultiEmbeddingId._ -import com.twitter.simclusters_v2.thriftscala.SimClustersMultiEmbedding.{Ids, Values} -import com.twitter.simclusters_v2.thriftscala.{ - SimClustersMultiEmbedding, - SimClustersEmbeddingId, - SimClustersMultiEmbeddingId -} - -/** - * Helper methods for SimClustersMultiEmbedding - */ -object SimClustersMultiEmbedding { - - // Convert a multiEmbedding to a list of (embeddingId, score) - def toSimClustersEmbeddingIdWithScores( - simClustersMultiEmbeddingId: SimClustersMultiEmbeddingId, - simClustersMultiEmbedding: SimClustersMultiEmbedding - ): Seq[(SimClustersEmbeddingId, Double)] = { - simClustersMultiEmbedding match { - case Values(values) => - values.embeddings.zipWithIndex.map { - case (embeddingWithScore, i) => - (toEmbeddingId(simClustersMultiEmbeddingId, i), embeddingWithScore.score) - } - case Ids(ids) => - ids.ids.map(_.toTuple) - } - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/common/SimClustersMultiEmbeddingId.scala b/src/scala/com/twitter/simclusters_v2/common/SimClustersMultiEmbeddingId.scala deleted file mode 100644 index 17d0eb0d6..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/SimClustersMultiEmbeddingId.scala +++ /dev/null @@ -1,96 +0,0 @@ -package com.twitter.simclusters_v2.common - -import com.twitter.simclusters_v2.thriftscala.{ - EmbeddingType, - InternalId, - MultiEmbeddingType, - TopicId, - TopicSubId, - SimClustersEmbeddingId => ThriftEmbeddingId, - SimClustersMultiEmbeddingId => ThriftMultiEmbeddingId -} - -/** - * Helper methods for SimClustersMultiEmbeddingId - */ -object SimClustersMultiEmbeddingId { - - private val MultiEmbeddingTypeToEmbeddingType: Map[MultiEmbeddingType, EmbeddingType] = - Map( - MultiEmbeddingType.LogFavApeBasedMuseTopic -> EmbeddingType.LogFavApeBasedMuseTopic, - MultiEmbeddingType.TwiceUserInterestedIn -> EmbeddingType.TwiceUserInterestedIn, - ) - - private val EmbeddingTypeToMultiEmbeddingType: Map[EmbeddingType, MultiEmbeddingType] = - MultiEmbeddingTypeToEmbeddingType.map(_.swap) - - def toEmbeddingType(multiEmbeddingType: MultiEmbeddingType): EmbeddingType = { - MultiEmbeddingTypeToEmbeddingType.getOrElse( - multiEmbeddingType, - throw new IllegalArgumentException(s"Invalid type: $multiEmbeddingType")) - } - - def toMultiEmbeddingType(embeddingType: EmbeddingType): MultiEmbeddingType = { - EmbeddingTypeToMultiEmbeddingType.getOrElse( - embeddingType, - throw new IllegalArgumentException(s"Invalid type: $embeddingType") - ) - } - - /** - * Convert a SimClusters Multi-Embedding Id and SubId to SimClusters Embedding Id. - */ - def toEmbeddingId( - simClustersMultiEmbeddingId: ThriftMultiEmbeddingId, - subId: Int - ): ThriftEmbeddingId = { - val internalId = simClustersMultiEmbeddingId.internalId match { - case InternalId.TopicId(topicId) => - InternalId.TopicSubId( - TopicSubId(topicId.entityId, topicId.language, topicId.country, subId)) - case _ => - throw new IllegalArgumentException( - s"Invalid simClusters InternalId ${simClustersMultiEmbeddingId.internalId}") - } - ThriftEmbeddingId( - toEmbeddingType(simClustersMultiEmbeddingId.embeddingType), - simClustersMultiEmbeddingId.modelVersion, - internalId - ) - } - - /** - * Fetch a subId from a SimClusters EmbeddingId. - */ - def toSubId(simClustersEmbeddingId: ThriftEmbeddingId): Int = { - simClustersEmbeddingId.internalId match { - case InternalId.TopicSubId(topicSubId) => - topicSubId.subId - case _ => - throw new IllegalArgumentException( - s"Invalid SimClustersEmbeddingId InternalId type, $simClustersEmbeddingId") - } - } - - /** - * Convert a SimClustersEmbeddingId to SimClustersMultiEmbeddingId. - * Only support the Multi embedding based EmbeddingTypes. - */ - def toMultiEmbeddingId( - simClustersEmbeddingId: ThriftEmbeddingId - ): ThriftMultiEmbeddingId = { - simClustersEmbeddingId.internalId match { - case InternalId.TopicSubId(topicSubId) => - ThriftMultiEmbeddingId( - toMultiEmbeddingType(simClustersEmbeddingId.embeddingType), - simClustersEmbeddingId.modelVersion, - InternalId.TopicId(TopicId(topicSubId.entityId, topicSubId.language, topicSubId.country)) - ) - - case _ => - throw new IllegalArgumentException( - s"Invalid SimClustersEmbeddingId InternalId type, $simClustersEmbeddingId") - } - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/common/clustering/BUILD b/src/scala/com/twitter/simclusters_v2/common/clustering/BUILD deleted file mode 100644 index f394e109a..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/clustering/BUILD +++ /dev/null @@ -1,11 +0,0 @@ -scala_library( - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "eventdetection/common/src/main/java/com/twitter/eventdetection/common/louvain", - "eventdetection/common/src/main/java/com/twitter/eventdetection/common/model", - "src/java/com/twitter/sbf/graph", - "src/scala/com/twitter/simclusters_v2/scalding/common", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/common/clustering/ClusterRepresentativeSelectionMethod.scala b/src/scala/com/twitter/simclusters_v2/common/clustering/ClusterRepresentativeSelectionMethod.scala deleted file mode 100644 index 42b585abc..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/clustering/ClusterRepresentativeSelectionMethod.scala +++ /dev/null @@ -1,30 +0,0 @@ -package com.twitter.simclusters_v2.common.clustering - -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.thriftscala.NeighborWithWeights - -/** - * Select a cluster member as cluster representative. - */ -trait ClusterRepresentativeSelectionMethod[T] { - - /** - * The main external-facing method. Sub-classes should implement this method. - * - * @param cluster A set of NeighborWithWeights. - * @param embeddings A map of producer ID -> embedding. - * - * @return UserId of the member chosen as representative. - */ - def selectClusterRepresentative( - cluster: Set[NeighborWithWeights], - embeddings: Map[UserId, T] - ): UserId - -} - -object ClusterRepresentativeSelectionStatistics { - - // Statistics, to be imported where recorded. - val StatClusterRepresentativeSelectionTime = "cluster_representative_selection_total_time_ms" -} diff --git a/src/scala/com/twitter/simclusters_v2/common/clustering/ClusteringMethod.scala b/src/scala/com/twitter/simclusters_v2/common/clustering/ClusteringMethod.scala deleted file mode 100644 index e379e7051..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/clustering/ClusteringMethod.scala +++ /dev/null @@ -1,34 +0,0 @@ -package com.twitter.simclusters_v2.common.clustering - -/** - * Partitions a set of entities into clusters. - * NOTE: The selection/construction of the cluster representatives (e.g. medoid, random, average) is implemented in ClusterRepresentativeSelectionMethod.scala - */ -trait ClusteringMethod { - - /** - * The main external-facing method. Sub-classes should implement this method. - * - * @param embeddings map of entity IDs and corresponding embeddings - * @param similarityFn function that outputs similarity (>=0, the larger, more similar), given two embeddings - * @tparam T embedding type. e.g. SimClustersEmbedding - * - * @return A set of sets of entity IDs, each set representing a distinct cluster. - */ - def cluster[T]( - embeddings: Map[Long, T], - similarityFn: (T, T) => Double, - recordStatCallback: (String, Long) => Unit = (_, _) => () - ): Set[Set[Long]] - -} - -object ClusteringStatistics { - - // Statistics, to be imported where recorded. - val StatSimilarityGraphTotalBuildTime = "similarity_graph_total_build_time_ms" - val StatClusteringAlgorithmRunTime = "clustering_algorithm_total_run_time_ms" - val StatMedoidSelectionTime = "medoid_selection_total_time_ms" - val StatComputedSimilarityBeforeFilter = "computed_similarity_before_filter" - -} diff --git a/src/scala/com/twitter/simclusters_v2/common/clustering/ConnectedComponentsClusteringMethod.scala b/src/scala/com/twitter/simclusters_v2/common/clustering/ConnectedComponentsClusteringMethod.scala deleted file mode 100644 index 07f785f24..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/clustering/ConnectedComponentsClusteringMethod.scala +++ /dev/null @@ -1,67 +0,0 @@ -package com.twitter.simclusters_v2.common.clustering - -import com.twitter.sbf.graph.ConnectedComponents -import com.twitter.sbf.graph.Graph -import com.twitter.util.Stopwatch -import it.unimi.dsi.fastutil.ints.IntSet -import scala.collection.SortedMap -import scala.jdk.CollectionConverters._ - -/** - * Aggregate entities into clusters such that a cluster contains all embeddings with a similarity - * above a configurable threshold to any other embedding. - * - * @param similarityThreshold: When building the edges between entities, edges with weight - * less than or equal to this threshold will be filtered out. - */ -class ConnectedComponentsClusteringMethod( - similarityThreshold: Double) - extends ClusteringMethod { - - import ClusteringStatistics._ - - def cluster[T]( - embeddings: Map[Long, T], - similarityFn: (T, T) => Double, - recordStatCallback: (String, Long) => Unit = (_, _) => () - ): Set[Set[Long]] = { - - val timeSinceGraphBuildStart = Stopwatch.start() - // com.twitter.sbf.graph.Graph expects neighbors to be sorted in ascending order. - val sourcesById = SortedMap(embeddings.zipWithIndex.map { - case (source, idx) => idx -> source - }.toSeq: _*) - - val neighbours = sourcesById.map { - case (srcIdx, (_, src)) => - sourcesById - .collect { - case (dstIdx, (_, dst)) if srcIdx != dstIdx => // avoid self-edges - val similarity = similarityFn(src, dst) - recordStatCallback( - StatComputedSimilarityBeforeFilter, - (similarity * 100).toLong // preserve up to two decimal points - ) - if (similarity > similarityThreshold) - Some(dstIdx) - else None - }.flatten.toArray - }.toArray - - recordStatCallback(StatSimilarityGraphTotalBuildTime, timeSinceGraphBuildStart().inMilliseconds) - - val timeSinceClusteringAlgRunStart = Stopwatch.start() - val nEdges = neighbours.map(_.length).sum / 2 // Graph expects count of undirected edges - val graph = new Graph(sourcesById.size, nEdges, neighbours) - - val clusters = ConnectedComponents - .connectedComponents(graph).asScala.toSet - .map { i: IntSet => i.asScala.map(sourcesById(_)._1).toSet } - - recordStatCallback( - StatClusteringAlgorithmRunTime, - timeSinceClusteringAlgRunStart().inMilliseconds) - - clusters - } -} diff --git a/src/scala/com/twitter/simclusters_v2/common/clustering/LargestDimensionClusteringMethod.scala b/src/scala/com/twitter/simclusters_v2/common/clustering/LargestDimensionClusteringMethod.scala deleted file mode 100644 index 826cc7e08..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/clustering/LargestDimensionClusteringMethod.scala +++ /dev/null @@ -1,33 +0,0 @@ -package com.twitter.simclusters_v2.common.clustering - -/** - * Groups entities by a single embedding dimension with the largest score. - */ -class LargestDimensionClusteringMethod extends ClusteringMethod { - - /** - * @param embeddings map of entity IDs and corresponding embeddings - * @param similarityFn function that outputs discrete value (0.0 or 1.0). - * 1.0 if the dimensions of the highest score (weight) from two given embeddings match. - * 0.0 otherwise. - * e.g. - * case 1: E1=[0.0, 0.1, 0.6, 0.2], E2=[0.1, 0.3, 0.8, 0.0]. similarityFn(E1, E2)=1.0 - * case 2: E1=[0.0, 0.1, 0.6, 0.2], E2=[0.1, 0.4, 0.2, 0.0]. similarityFn(E1, E2)=0.0 - * @tparam T embedding type. e.g. SimClustersEmbedding - * - * @return A set of sets of entity IDs, each set representing a distinct cluster. - */ - override def cluster[T]( - embeddings: Map[Long, T], - similarityFn: (T, T) => Double, - recordStatCallback: (String, Long) => Unit - ): Set[Set[Long]] = { - - // rely on clustering by connected component. - // similarityThreshold=0.1 because it's larger than 0.0 (similarityFn returns 0.0 if two embeddings - // don't share the largest dimension. - new ConnectedComponentsClusteringMethod(similarityThreshold = 0.1) - .cluster(embeddings, similarityFn, recordStatCallback) - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/common/clustering/LouvainClusteringMethod.scala b/src/scala/com/twitter/simclusters_v2/common/clustering/LouvainClusteringMethod.scala deleted file mode 100644 index c3337119b..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/clustering/LouvainClusteringMethod.scala +++ /dev/null @@ -1,236 +0,0 @@ -package com.twitter.simclusters_v2.common.clustering - -import com.twitter.eventdetection.common.louvain.LouvainDriver -import com.twitter.eventdetection.common.louvain.NetworkFactory -import com.twitter.eventdetection.common.model.Entity -import com.twitter.eventdetection.common.model.NetworkInput -import com.twitter.eventdetection.common.model.TextEntityValue -import com.twitter.util.Stopwatch -import scala.collection.JavaConverters._ -import scala.math.max - -/** - * Groups entities by the Louvain clustering method. - * @param similarityThreshold: When building the edges between entities, edges with weight - * less than or equal to this threshold will be filtered out. - * @param appliedResolutionFactor: If present, will be used to multiply the applied resolution - * parameter of the Louvain method by this factor. - * Note that the DEFAULT_MAX_RESOLUTION will not be applied. - */ -class LouvainClusteringMethod( - similarityThreshold: Double, - appliedResolutionFactor: Option[Double]) - extends ClusteringMethod { - - import ClusteringStatistics._ - - def cluster[T]( - embeddings: Map[Long, T], - similarityFn: (T, T) => Double, - recordStatCallback: (String, Long) => Unit = (_, _) => () - ): Set[Set[Long]] = { - - // 1. Build the graph on which to run Louvain: - // - Weigh edges by the similarity between the 2 embeddings, - // - Filter out edges with weight <= threshold. - val timeSinceGraphBuildStart = Stopwatch.start() - val edges: Seq[((Long, Long), Double)] = embeddings.toSeq - .combinations(2) - .map { pair: Seq[(Long, T)] => // pair of 2 - val (user1, embedding1) = pair.head - val (user2, embedding2) = pair(1) - val similarity = similarityFn(embedding1, embedding2) - - recordStatCallback( - StatComputedSimilarityBeforeFilter, - (similarity * 100).toLong // preserve up to two decimal places - ) - - ((user1, user2), similarity) - } - .filter(_._2 > similarityThreshold) - .toSeq - - recordStatCallback(StatSimilarityGraphTotalBuildTime, timeSinceGraphBuildStart().inMilliseconds) - - // check if some entities do not have any incoming / outgoing edge - // these are size-1 clusters (i.e. their own) - val individualClusters: Set[Long] = embeddings.keySet -- edges.flatMap { - case ((user1, user2), _) => Set(user1, user2) - }.toSet - - // 2. LouvainDriver uses "Entity" as input, so build 2 mappings - // - Long (entity id) -> Entity - // - Entity -> Long (entity id) - val embeddingIdToEntity: Map[Long, Entity] = embeddings.map { - case (id, _) => id -> Entity(TextEntityValue(id.toString, Some(id.toString)), None) - } - val entityToEmbeddingId: Map[Entity, Long] = embeddingIdToEntity.map { - case (id, e) => e -> id - } - - // 3. Create the list of NetworkInput on which to run LouvainDriver - val networkInputList = edges - .map { - case ((fromUserId: Long, toUserId: Long), weight: Double) => - new NetworkInput(embeddingIdToEntity(fromUserId), embeddingIdToEntity(toUserId), weight) - }.toList.asJava - - val timeSinceClusteringAlgRunStart = Stopwatch.start() - val networkDictionary = NetworkFactory.buildDictionary(networkInputList) - val network = NetworkFactory.buildNetwork(networkInputList, networkDictionary) - - if (networkInputList.size() == 0) { - // handle case if no edge at all (only one entity or all entities are too far apart) - embeddings.keySet.map(e => Set(e)) - } else { - // 4. Run clustering algorithm - val clusteredIds = appliedResolutionFactor match { - case Some(res) => - LouvainDriver.clusterAppliedResolutionFactor(network, networkDictionary, res) - case None => LouvainDriver.cluster(network, networkDictionary) - } - - recordStatCallback( - StatClusteringAlgorithmRunTime, - timeSinceClusteringAlgRunStart().inMilliseconds) - - // 5. Post-processing - val atLeast2MembersClusters: Set[Set[Long]] = clusteredIds.asScala - .groupBy(_._2) - .mapValues(_.map { case (e, _) => entityToEmbeddingId(e) }.toSet) - .values.toSet - - atLeast2MembersClusters ++ individualClusters.map { e => Set(e) } - - } - } - - def clusterWithSilhouette[T]( - embeddings: Map[Long, T], - similarityFn: (T, T) => Double, - similarityFnForSil: (T, T) => Double, - recordStatCallback: (String, Long) => Unit = (_, _) => () - ): (Set[Set[Long]], Set[Set[(Long, Double)]]) = { - - // 1. Build the graph on which to run Louvain: - // - Weigh edges by the similarity between the 2 embeddings, - // - Filter out edges with weight <= threshold. - val timeSinceGraphBuildStart = Stopwatch.start() - val edgesSimilarityMap = collection.mutable.Map[(Long, Long), Double]() - - val edges: Seq[((Long, Long), Double)] = embeddings.toSeq - .combinations(2) - .map { pair: Seq[(Long, T)] => // pair of 2 - val (user1, embedding1) = pair.head - val (user2, embedding2) = pair(1) - val similarity = similarityFn(embedding1, embedding2) - val similarityForSil = similarityFnForSil(embedding1, embedding2) - edgesSimilarityMap.put((user1, user2), similarityForSil) - edgesSimilarityMap.put((user2, user1), similarityForSil) - - recordStatCallback( - StatComputedSimilarityBeforeFilter, - (similarity * 100).toLong // preserve up to two decimal places - ) - - ((user1, user2), similarity) - } - .filter(_._2 > similarityThreshold) - .toSeq - - recordStatCallback(StatSimilarityGraphTotalBuildTime, timeSinceGraphBuildStart().inMilliseconds) - - // check if some entities do not have any incoming / outgoing edge - // these are size-1 clusters (i.e. their own) - val individualClusters: Set[Long] = embeddings.keySet -- edges.flatMap { - case ((user1, user2), _) => Set(user1, user2) - }.toSet - - // 2. LouvainDriver uses "Entity" as input, so build 2 mappings - // - Long (entity id) -> Entity - // - Entity -> Long (entity id) - val embeddingIdToEntity: Map[Long, Entity] = embeddings.map { - case (id, _) => id -> Entity(TextEntityValue(id.toString, Some(id.toString)), None) - } - val entityToEmbeddingId: Map[Entity, Long] = embeddingIdToEntity.map { - case (id, e) => e -> id - } - - // 3. Create the list of NetworkInput on which to run LouvainDriver - val networkInputList = edges - .map { - case ((fromUserId: Long, toUserId: Long), weight: Double) => - new NetworkInput(embeddingIdToEntity(fromUserId), embeddingIdToEntity(toUserId), weight) - }.toList.asJava - - val timeSinceClusteringAlgRunStart = Stopwatch.start() - val networkDictionary = NetworkFactory.buildDictionary(networkInputList) - val network = NetworkFactory.buildNetwork(networkInputList, networkDictionary) - - val clusters = if (networkInputList.size() == 0) { - // handle case if no edge at all (only one entity or all entities are too far apart) - embeddings.keySet.map(e => Set(e)) - } else { - // 4. Run clustering algorithm - val clusteredIds = appliedResolutionFactor match { - case Some(res) => - LouvainDriver.clusterAppliedResolutionFactor(network, networkDictionary, res) - case None => LouvainDriver.cluster(network, networkDictionary) - } - - recordStatCallback( - StatClusteringAlgorithmRunTime, - timeSinceClusteringAlgRunStart().inMilliseconds) - - // 5. Post-processing - val atLeast2MembersClusters: Set[Set[Long]] = clusteredIds.asScala - .groupBy(_._2) - .mapValues(_.map { case (e, _) => entityToEmbeddingId(e) }.toSet) - .values.toSet - - atLeast2MembersClusters ++ individualClusters.map { e => Set(e) } - - } - - // Calculate silhouette metrics - val contactIdWithSilhouette = clusters.map { - case cluster => - val otherClusters = clusters - cluster - - cluster.map { - case contactId => - if (otherClusters.isEmpty) { - (contactId, 0.0) - } else { - val otherSameClusterContacts = cluster - contactId - - if (otherSameClusterContacts.isEmpty) { - (contactId, 0.0) - } else { - // calculate similarity of given userId with all other users in the same cluster - val a_i = otherSameClusterContacts.map { - case sameClusterContact => - edgesSimilarityMap((contactId, sameClusterContact)) - }.sum / otherSameClusterContacts.size - - // calculate similarity of given userId to all other clusters, find the best nearest cluster - val b_i = otherClusters.map { - case otherCluster => - otherCluster.map { - case otherClusterContact => - edgesSimilarityMap((contactId, otherClusterContact)) - }.sum / otherCluster.size - }.max - - // silhouette (value) of one userId i - val s_i = (a_i - b_i) / max(a_i, b_i) - (contactId, s_i) - } - } - } - } - - (clusters, contactIdWithSilhouette) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/common/clustering/MaxFavScoreRepresentativeSelectionMethod.scala b/src/scala/com/twitter/simclusters_v2/common/clustering/MaxFavScoreRepresentativeSelectionMethod.scala deleted file mode 100644 index fec180d4f..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/clustering/MaxFavScoreRepresentativeSelectionMethod.scala +++ /dev/null @@ -1,21 +0,0 @@ -package com.twitter.simclusters_v2.common.clustering - -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.thriftscala.NeighborWithWeights - -class MaxFavScoreRepresentativeSelectionMethod[T] extends ClusterRepresentativeSelectionMethod[T] { - - /** - * Identify the member with largest favScoreHalfLife100Days and return it. - * - * @param cluster A set of NeighborWithWeights. - * @param embeddings A map of producer ID -> embedding. - */ - def selectClusterRepresentative( - cluster: Set[NeighborWithWeights], - embeddings: Map[UserId, T], - ): UserId = { - val key = cluster.maxBy { x: NeighborWithWeights => x.favScoreHalfLife100Days.getOrElse(0.0) } - key.neighborId - } -} diff --git a/src/scala/com/twitter/simclusters_v2/common/clustering/MedoidRepresentativeSelectionMethod.scala b/src/scala/com/twitter/simclusters_v2/common/clustering/MedoidRepresentativeSelectionMethod.scala deleted file mode 100644 index 1b466250f..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/clustering/MedoidRepresentativeSelectionMethod.scala +++ /dev/null @@ -1,28 +0,0 @@ -package com.twitter.simclusters_v2.common.clustering - -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.thriftscala.NeighborWithWeights - -class MedoidRepresentativeSelectionMethod[T]( - producerProducerSimilarityFn: (T, T) => Double) - extends ClusterRepresentativeSelectionMethod[T] { - - /** - * Identify the medoid of a cluster and return it. - * - * @param cluster A set of NeighborWithWeights. - * @param embeddings A map of producer ID -> embedding. - */ - def selectClusterRepresentative( - cluster: Set[NeighborWithWeights], - embeddings: Map[UserId, T], - ): UserId = { - val key = cluster.maxBy { - id1 => // maxBy because we use similarity, which gets larger as we get closer. - val v = embeddings(id1.neighborId) - cluster - .map(id2 => producerProducerSimilarityFn(v, embeddings(id2.neighborId))).sum - } - key.neighborId - } -} diff --git a/src/scala/com/twitter/simclusters_v2/common/clustering/SimilarityFunctions.scala b/src/scala/com/twitter/simclusters_v2/common/clustering/SimilarityFunctions.scala deleted file mode 100644 index 45e449850..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/clustering/SimilarityFunctions.scala +++ /dev/null @@ -1,32 +0,0 @@ -package com.twitter.simclusters_v2.common.clustering - -import com.twitter.simclusters_v2.common.SimClustersEmbedding - -/** - * SimilarityFunctions provide commonly used similarity functions that this clustering library needs. - */ -object SimilarityFunctions { - def simClustersCosineSimilarity: (SimClustersEmbedding, SimClustersEmbedding) => Double = - (e1, e2) => e1.cosineSimilarity(e2) - - def simClustersMatchingLargestDimension: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Double = (e1, e2) => { - val doesMatchLargestDimension: Boolean = e1 - .topClusterIds(1) - .exists { id1 => - e2.topClusterIds(1).contains(id1) - } - - if (doesMatchLargestDimension) 1.0 - else 0.0 - } - - def simClustersFuzzyJaccardSimilarity: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Double = (e1, e2) => { - e1.fuzzyJaccardSimilarity(e2) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/common/ml/BUILD b/src/scala/com/twitter/simclusters_v2/common/ml/BUILD deleted file mode 100644 index e71aa0c59..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/ml/BUILD +++ /dev/null @@ -1,12 +0,0 @@ -# This package/target is separate from other simclusters common packages because the ml/api dep is -# large (350MB+). Having it as a separate target means that we can avoid bundling it with targets -# that do not need it. -scala_library( - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/java/com/twitter/ml/api:api-base", - "src/scala/com/twitter/ml/api/util", - "src/scala/com/twitter/simclusters_v2/common", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/common/ml/SimClustersEmbeddingAdapter.scala b/src/scala/com/twitter/simclusters_v2/common/ml/SimClustersEmbeddingAdapter.scala deleted file mode 100644 index 8ee8291cf..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/ml/SimClustersEmbeddingAdapter.scala +++ /dev/null @@ -1,39 +0,0 @@ -package com.twitter.simclusters_v2.common.ml - -import com.twitter.ml.api.Feature.Continuous -import com.twitter.ml.api.Feature.SparseContinuous -import com.twitter.ml.api._ -import com.twitter.ml.api.util.FDsl._ -import com.twitter.simclusters_v2.common.SimClustersEmbedding - -class SimClustersEmbeddingAdapter(embeddingFeature: SparseContinuous) - extends IRecordOneToOneAdapter[SimClustersEmbedding] { - - override def getFeatureContext: FeatureContext = new FeatureContext(embeddingFeature) - - override def adaptToDataRecord(embedding: SimClustersEmbedding): DataRecord = { - val embeddingMap = embedding.embedding.map { - case (clusterId, score) => - (clusterId.toString, score) - }.toMap - - new DataRecord().setFeatureValue(embeddingFeature, embeddingMap) - } -} - -class NormalizedSimClustersEmbeddingAdapter( - embeddingFeature: SparseContinuous, - normFeature: Continuous) - extends IRecordOneToOneAdapter[SimClustersEmbedding] { - - override def getFeatureContext: FeatureContext = new FeatureContext(embeddingFeature, normFeature) - - override def adaptToDataRecord(embedding: SimClustersEmbedding): DataRecord = { - - val normalizedEmbedding = Map( - embedding.sortedClusterIds.map(_.toString).zip(embedding.normalizedSortedScores): _*) - - val dataRecord = new DataRecord().setFeatureValue(embeddingFeature, normalizedEmbedding) - dataRecord.setFeatureValue(normFeature, embedding.l2norm) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/common/package.scala b/src/scala/com/twitter/simclusters_v2/common/package.scala deleted file mode 100644 index 8be5ad089..000000000 --- a/src/scala/com/twitter/simclusters_v2/common/package.scala +++ /dev/null @@ -1,17 +0,0 @@ -package com.twitter.simclusters_v2 - -package object common { - - type TweetId = Long - type UserId = Long - type ClusterId = Int - type SemanticCoreEntityId = Long // Use TopicId if it's a Topic related project. - type UTTEntityId = Long - type Timestamp = Long - type Language = String - type Country = String - type LocaleEntity = (Long, Language) - type TopicId = Long - type GroupId = Long - type SpaceId = String -} diff --git a/src/scala/com/twitter/simclusters_v2/hdfs_sources/AdhocSources.scala b/src/scala/com/twitter/simclusters_v2/hdfs_sources/AdhocSources.scala deleted file mode 100644 index 63098e137..000000000 --- a/src/scala/com/twitter/simclusters_v2/hdfs_sources/AdhocSources.scala +++ /dev/null @@ -1,164 +0,0 @@ -package com.twitter.simclusters_v2.hdfs_sources - -import com.twitter.bijection.scrooge.BinaryScalaCodec -import com.twitter.bijection.scrooge.CompactScalaCodec -import com.twitter.bijection.Bufferable -import com.twitter.bijection.Injection -import com.twitter.hermit.candidate.thriftscala.Candidates -import com.twitter.scalding.DateRange -import com.twitter.scalding.commons.source.VersionedKeyValSource -import com.twitter.scalding_internal.source.lzo_scrooge.DailySuffixMostRecentLzoScrooge -import com.twitter.scalding_internal.source.lzo_scrooge.FixedPathLzoScrooge -import com.twitter.scalding_internal.source.lzo_scrooge.HourlySuffixMostRecentLzoScrooge -import com.twitter.simclusters_v2.thriftscala._ - -case class EdgeWithDecayedWtsFixedPathSource(path: String) - extends FixedPathLzoScrooge[EdgeWithDecayedWeights](path, EdgeWithDecayedWeights) - -case class UserAndNeighborsFixedPathSource(path: String) - extends FixedPathLzoScrooge[UserAndNeighbors](path, UserAndNeighbors) - -case class NormsAndCountsFixedPathSource(path: String) - extends FixedPathLzoScrooge[NormsAndCounts](path, NormsAndCounts) - -case class UserToInterestedInClustersFixedPathSource(path: String) - extends FixedPathLzoScrooge[UserToInterestedInClusters](path, UserToInterestedInClusters) - -case class TimelineDataExtractorFixedPathSource(path: String) - extends FixedPathLzoScrooge[ReferenceTweets](path, ReferenceTweets) - -case class TweetClusterScoresHourlySuffixSource(path: String, override val dateRange: DateRange) - extends HourlySuffixMostRecentLzoScrooge[TweetAndClusterScores](path, dateRange) - -case class TweetTopKClustersHourlySuffixSource(path: String, override val dateRange: DateRange) - extends HourlySuffixMostRecentLzoScrooge[TweetTopKClustersWithScores]( - path, - dateRange - ) - -case class ClusterTopKTweetsHourlySuffixSource(path: String, override val dateRange: DateRange) - extends HourlySuffixMostRecentLzoScrooge[ClusterTopKTweetsWithScores]( - path, - dateRange - ) - -case class TweetSimilarityUnhydratedPairsSource(path: String, override val dateRange: DateRange) - extends DailySuffixMostRecentLzoScrooge[LabelledTweetPairs]( - path, - dateRange - ) - -case class WTFCandidatesSource(path: String) - extends FixedPathLzoScrooge[Candidates](path, Candidates) - -case class EmbeddingsLiteSource(path: String) - extends FixedPathLzoScrooge[EmbeddingsLite](path, EmbeddingsLite) - -object AdhocKeyValSources { - def interestedInSource(path: String): VersionedKeyValSource[Long, ClustersUserIsInterestedIn] = { - implicit val keyInject: Injection[Long, Array[Byte]] = Injection.long2BigEndian - implicit val valInject: Injection[ClustersUserIsInterestedIn, Array[Byte]] = - CompactScalaCodec(ClustersUserIsInterestedIn) - VersionedKeyValSource[Long, ClustersUserIsInterestedIn](path) - } - - def clusterDetailsSource(path: String): VersionedKeyValSource[(String, Int), ClusterDetails] = { - implicit val keyInject: Injection[(String, Int), Array[Byte]] = - Bufferable.injectionOf[(String, Int)] - implicit val valInject: Injection[ClusterDetails, Array[Byte]] = - CompactScalaCodec(ClusterDetails) - VersionedKeyValSource[(String, Int), ClusterDetails](path) - } - - def bipartiteQualitySource( - path: String - ): VersionedKeyValSource[(String, Int), BipartiteClusterQuality] = { - implicit val keyInject: Injection[(String, Int), Array[Byte]] = - Bufferable.injectionOf[(String, Int)] - implicit val valInject: Injection[BipartiteClusterQuality, Array[Byte]] = - CompactScalaCodec(BipartiteClusterQuality) - VersionedKeyValSource[(String, Int), BipartiteClusterQuality](path) - } - - def entityToClustersSource( - path: String - ): VersionedKeyValSource[SimClustersEmbeddingId, SimClustersEmbedding] = { - implicit val keyInject: Injection[SimClustersEmbeddingId, Array[Byte]] = - BinaryScalaCodec(SimClustersEmbeddingId) - implicit val valInject: Injection[SimClustersEmbedding, Array[Byte]] = - BinaryScalaCodec(SimClustersEmbedding) - VersionedKeyValSource[SimClustersEmbeddingId, SimClustersEmbedding](path) - } - - def clusterToEntitiesSource( - path: String - ): VersionedKeyValSource[SimClustersEmbeddingId, InternalIdEmbedding] = { - implicit val keyInject: Injection[SimClustersEmbeddingId, Array[Byte]] = BinaryScalaCodec( - SimClustersEmbeddingId) - implicit val valInject: Injection[InternalIdEmbedding, Array[Byte]] = - BinaryScalaCodec(InternalIdEmbedding) - VersionedKeyValSource[SimClustersEmbeddingId, InternalIdEmbedding](path) - } - - // For storing producer-simclusters embeddings - def topProducerToClusterEmbeddingsSource( - path: String - ): VersionedKeyValSource[Long, TopSimClustersWithScore] = { - implicit val keyInject: Injection[Long, Array[Byte]] = Injection.long2BigEndian - implicit val valInject: Injection[TopSimClustersWithScore, Array[Byte]] = - CompactScalaCodec(TopSimClustersWithScore) - VersionedKeyValSource[Long, TopSimClustersWithScore](path) - } - - // For storing producer-simclusters embeddings - def topClusterEmbeddingsToProducerSource( - path: String - ): VersionedKeyValSource[PersistedFullClusterId, TopProducersWithScore] = { - implicit val keyInject: Injection[PersistedFullClusterId, Array[Byte]] = - CompactScalaCodec(PersistedFullClusterId) - implicit val valInject: Injection[TopProducersWithScore, Array[Byte]] = - CompactScalaCodec(TopProducersWithScore) - VersionedKeyValSource[PersistedFullClusterId, TopProducersWithScore](path) - } - - def userToInferredEntitiesSource( - path: String - ): VersionedKeyValSource[Long, SimClustersInferredEntities] = { - implicit val keyInject: Injection[Long, Array[Byte]] = Injection.long2BigEndian - implicit val valInject: Injection[SimClustersInferredEntities, Array[Byte]] = - CompactScalaCodec(SimClustersInferredEntities) - VersionedKeyValSource[Long, SimClustersInferredEntities](path) - } - - def knownForAdhocSource(path: String): VersionedKeyValSource[Long, ClustersUserIsKnownFor] = { - implicit val keyInject: Injection[Long, Array[Byte]] = Injection.long2BigEndian - implicit val valInject: Injection[ClustersUserIsKnownFor, Array[Byte]] = - CompactScalaCodec(ClustersUserIsKnownFor) - VersionedKeyValSource[Long, ClustersUserIsKnownFor](path) - } - - def knownForSBFResultsDevelSource( - path: String - ): VersionedKeyValSource[Long, Array[(Int, Float)]] = { - implicit val keyInject: Injection[Long, Array[Byte]] = Injection.long2BigEndian - implicit val valInject: Injection[Array[(Int, Float)], Array[Byte]] = - Bufferable.injectionOf[Array[(Int, Float)]] - VersionedKeyValSource[Long, Array[(Int, Float)]](path) - } - - // injection to store adjlist in the mapped indices space for users - def intermediateSBFResultsDevelSource( - path: String - ): VersionedKeyValSource[Int, List[(Int, Float)]] = { - implicit val keyInject: Injection[Int, Array[Byte]] = Injection.int2BigEndian - implicit val valInject: Injection[List[(Int, Float)], Array[Byte]] = - Bufferable.injectionOf[List[(Int, Float)]] - VersionedKeyValSource[Int, List[(Int, Float)]](path) - } - - def mappedIndicesDevelSource(path: String): VersionedKeyValSource[Int, Long] = { - implicit val keyInject: Injection[Int, Array[Byte]] = Injection.int2BigEndian - implicit val valInject: Injection[Long, Array[Byte]] = Injection.long2BigEndian - VersionedKeyValSource[Int, Long](path) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/hdfs_sources/BUILD b/src/scala/com/twitter/simclusters_v2/hdfs_sources/BUILD deleted file mode 100644 index 4cddde193..000000000 --- a/src/scala/com/twitter/simclusters_v2/hdfs_sources/BUILD +++ /dev/null @@ -1,2216 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":data_sources", - "3rdparty/src/jvm/com/twitter/scalding:core", - "src/scala/com/twitter/scalding_internal/dalv2", - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/scalding_internal/source/lzo_scrooge", - "src/scala/com/twitter/simclusters_v2/common", - "src/thrift/com/twitter/hermit/candidate:hermit-candidate-scala", - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - "src/thrift/com/twitter/wtf/entity_real_graph:entity_real_graph-thrift-scala", - ], -) - -scala_library( - name = "data_sources", - sources = [], - description = "DAL datasets we wish to expose externally", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":ads_fav_based_simclusters_cluster_to_tweet_index-scala", - ":ads_fav_click_based_simclusters_cluster_to_tweet_index-scala", - ":aggregatable_producer_simclusters_embeddings_by_fav_score-scala", - ":aggregatable_producer_simclusters_embeddings_by_fav_score_2020-scala", - ":aggregatable_producer_simclusters_embeddings_by_fav_score_2020_thrift-scala", - ":aggregatable_producer_simclusters_embeddings_by_fav_score_thrift-scala", - ":aggregatable_producer_simclusters_embeddings_by_follow_score_2020-scala", - ":aggregatable_producer_simclusters_embeddings_by_follow_score_2020_thrift-scala", - ":aggregatable_producer_simclusters_embeddings_by_log_fav_score-scala", - ":aggregatable_producer_simclusters_embeddings_by_log_fav_score_2020-scala", - ":aggregatable_producer_simclusters_embeddings_by_log_fav_score_2020_thrift-scala", - ":aggregatable_producer_simclusters_embeddings_by_log_fav_score_relaxed_fav_engagement_threshold_2020-scala", - ":aggregatable_producer_simclusters_embeddings_by_log_fav_score_relaxed_fav_engagement_threshold_2020_thrift-scala", - ":aggregatable_producer_simclusters_embeddings_by_log_fav_score_thrift-scala", - ":clusters_members_connected_components_ape_similarity-scala", - ":clusters_members_largest_dim_ape_similarity-scala", - ":clusters_members_largest_dim_ape_similarity_2_day_update-scala", - ":clusters_members_louvain_ape_similarity-scala", - ":co_engagement_top_k_similar_tweets-scala", - ":explore_mbcg_user_embeddings_kv-scala", - ":fav_based_evergreen_content_simclusters_cluster_to_tweet_index-scala", - ":fav_based_simclusters_cluster_to_tweet_index-scala", - ":fav_based_video_simclusters_cluster_to_tweet_index-scala", - ":fav_inferred_language_tfg_topic_embeddings-scala", - ":fav_tfg_topic_embeddings-scala", - ":fav_tfg_topic_embeddings_2020-scala", - ":fav_tfg_topic_embeddings_2020_parquet-scala", - ":fav_tfg_topic_embeddings_parquet-scala", - ":full_multi_type_graph-scala", - ":geopopular_top_tweet_impressed_topics-scala", - ":hashtag_simclusters_embeddings_updated-scala", - ":interested_in_twice_by_largest_dim-scala", - ":interested_in_twice_by_largest_dim_2_day_update-scala", - ":interested_in_twice_by_largest_dim_fav_score-scala", - ":interested_in_twice_connected_components-scala", - ":interested_in_twice_louvain-scala", - ":log_fav_reverse_index_semantic_core_per_language_simclusters_embeddings-scala", - ":log_fav_semantic_core_per_language_simclusters_embeddings-scala", - ":log_fav_tfg_topic_embeddings-scala", - ":log_fav_tfg_topic_embeddings_parquet-scala", - ":multi_type_graph_for_top_k_right_nodes_thrift_50_m_scio-scala", - ":multi_type_graph_for_top_k_right_nodes_thrift_scio-scala", - ":multi_type_simclusters_right_node_to_clusters_thrift_50_m-scala", - ":multi_type_simclusters_right_node_to_clusters_thrift_fav_90_p_20_m-scala", - ":offline_cluster_top_media_tweets_20M_145K_2020-scala", - ":offline_tweet_recommendations_from_interested_in_20M_145K_2020-scala", - ":offline_tweet_recommendations_from_interested_in_20M_145K_2020_HL_0_EL_15-scala", - ":offline_tweet_recommendations_from_interested_in_20M_145K_2020_HL_2_EL_15-scala", - ":offline_tweet_recommendations_from_interested_in_20M_145K_2020_HL_2_EL_50-scala", - ":offline_tweet_recommendations_from_interested_in_20M_145K_2020_HL_8_EL_50-scala", - ":offline_tweet_recommendations_from_mts_consumer_embeddings-scala", - ":producer_norms_and_counts-scala", - ":producer_top_k_simcluster_embeddings_by_fav_score-scala", - ":producer_top_k_simcluster_embeddings_by_fav_score_2020-scala", - ":producer_top_k_simcluster_embeddings_by_fav_score_updated-scala", - ":producer_top_k_simcluster_embeddings_by_follow_score-scala", - ":producer_top_k_simcluster_embeddings_by_follow_score_2020-scala", - ":producer_top_k_simcluster_embeddings_by_follow_score_updated-scala", - ":push_open_based_simclusters_cluster_to_tweet_index-scala", - ":reply_based_simclusters_cluster_to_tweet_index-scala", - ":retweet_based_simclusters_cluster_to_tweet_index-scala", - ":reverse_index_hashtag_simclusters_embeddings_updated-scala", - ":reverse_index_semantic_core_per_language_simclusters_embeddings-scala", - ":reverse_index_semantic_core_simclusters_embeddings-scala", - ":reverse_index_semantic_core_simclusters_embeddings_2020-scala", - ":reverse_index_semantic_core_simclusters_embeddings_updated-scala", - ":right_node_cosine_similarity_scio-scala", - ":right_node_sim_hash_scio-scala", - ":rux_faved_top_k_tweets-scala", - ":semantic_core_embeddings_from_producer-scala", - ":semantic_core_per_language_simclusters_embeddings-scala", - ":semantic_core_simclusters_embeddings-scala", - ":semantic_core_simclusters_embeddings_2020-scala", - ":semantic_core_simclusters_embeddings_updated-scala", - ":simcluster_embedding_top_k_producers_by_fav_score-scala", - ":simcluster_embedding_top_k_producers_by_fav_score_2020-scala", - ":simcluster_embedding_top_k_producers_by_fav_score_updated-scala", - ":simcluster_embedding_top_k_producers_by_follow_score-scala", - ":simcluster_embedding_top_k_producers_by_follow_score_2020-scala", - ":simcluster_embedding_top_k_producers_by_follow_score_updated-scala", - ":simclusters_inferred_entities_from_interested_in-scala", - ":simclusters_inferred_entities_from_interested_in_keyed_by_cluster-scala", - ":simclusters_inferred_entities_from_known_for-scala", - ":simclusters_offline_cluster_top_k_tweets-scala", - ":simclusters_offline_tweet_cluster_scores-scala", - ":simclusters_offline_tweet_top_k_clusters-scala", - ":simclusters_v2_cluster_details-scala", - ":simclusters_v2_cluster_details_20m_145k_2020-scala", - ":simclusters_v2_cluster_details_20m_145k_updated-scala", - ":simclusters_v2_cluster_details_lite-scala", - ":simclusters_v2_cluster_details_lite_20m_145k_2020-scala", - ":simclusters_v2_cluster_details_lite_20m_145k_updated-scala", - ":simclusters_v2_embeddings_lite-scala", - ":simclusters_v2_global_language_embedding-scala", - ":simclusters_v2_global_language_embedding_thrift-scala", - ":simclusters_v2_interested_in-scala", - ":simclusters_v2_interested_in_20M_145K_2020-scala", - ":simclusters_v2_interested_in_20M_145K_updated-scala", - ":simclusters_v2_interested_in_from_aggregatable_producer_embeddings_20M_145K_2020-scala", - ":simclusters_v2_interested_in_from_producer_embeddings_20M_145K_updated-scala", - ":simclusters_v2_interested_in_lite_20M_145K_2020-scala", - ":simclusters_v2_known_for_20M_145K_2020-scala", - ":simclusters_v2_known_for_20M_145K_2020_thrift-scala", - ":simclusters_v2_known_for_20M_145K_dec11-scala", - ":simclusters_v2_known_for_20M_145K_updated-scala", - ":simclusters_v2_known_for_20M_145K_updated_thrift-scala", - ":simclusters_v2_raw_interested_in_20M_145K_2020-scala", - ":simclusters_v2_raw_interested_in_20M_145K_dec11-scala", - ":simclusters_v2_raw_interested_in_20M_145K_updated-scala", - ":simclusters_v2_raw_interested_in_lite_20M_145K_2020-scala", - ":simclusters_v2_raw_known_for_20M_145K_2020-scala", - ":simclusters_v2_raw_known_for_20M_145K_dec11-scala", - ":simclusters_v2_raw_known_for_20M_145K_updated-scala", - ":simclusters_v2_user_to_interested_in_20M_145K_2020-scala", - ":simclusters_v2_user_to_interested_in_20M_145K_dec11-scala", - ":simclusters_v2_user_to_interested_in_20M_145K_updated-scala", - ":simclusters_v2_user_to_interested_in_from_aggregatable_producer_embeddings_20M_145K_2020-scala", - ":simclusters_v2_user_to_interested_in_lite_20M_145K_2020-scala", - ":similar_topics_from_topic_follow_graph-scala", - ":similar_users_by_fav_based_producer_embedding-scala", - ":similar_users_by_follow_based_producer_embedding-scala", - ":top_k_right_nouns-scala", - ":top_k_right_nouns_scio-scala", - ":top_locale_topics_for_producer_from_em-scala", - ":top_producers_for_locale_topics_from_topic_follow_graph-scala", - ":topic_top_producers_em-scala", - ":truncated_multi_type_graph-scala", - ":truncated_multi_type_graph_scio-scala", - ":tweet_evaluation_timelines_reference_set-scala", - ":user_topic_weighted_embedding-scala", - ":user_topic_weighted_embedding_parquet-scala", - ":user_user_fav_graph-scala", - ":user_user_graph-scala", - ":user_user_normalized_graph-scala", - ":video_view_based_simclusters_cluster_to_tweet_index-scala", - "src/scala/com/twitter/simclusters_v2/common", - ], -) - -create_datasets( - base_name = "user_user_fav_graph", - java_schema = "com.twitter.simclusters_v2.thriftjava.EdgeWithDecayedWeights", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.EdgeWithDecayedWeights", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "producer_norms_and_counts", - java_schema = "com.twitter.simclusters_v2.thriftjava.NormsAndCounts", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.NormsAndCounts", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "user_user_normalized_graph", - java_schema = "com.twitter.simclusters_v2.thriftjava.UserAndNeighbors", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.UserAndNeighbors", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "multi_type_simclusters_right_node_to_clusters_thrift_fav_90_p_20_m", - java_schema = "com.twitter.simclusters_v2.thriftjava.RightNodeWithClusters", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.RightNodeWithClusters", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "multi_type_simclusters_right_node_to_clusters_thrift_50_m", - java_schema = "com.twitter.simclusters_v2.thriftjava.RightNodeWithClusters", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.RightNodeWithClusters", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "user_user_graph", - java_schema = "com.twitter.simclusters_v2.thriftjava.UserAndNeighbors", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.UserAndNeighbors", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -# InterestedIn -create_datasets( - base_name = "simclusters_v2_raw_interested_in_20M_145K_dec11", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.InterestedInInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_v2_raw_interested_in_20M_145K_updated", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.InterestedInInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_v2_raw_interested_in_20M_145K_2020", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.InterestedInInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_v2_raw_interested_in_lite_20M_145K_2020", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.InterestedInInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "multi_type_graph_for_top_k_right_nodes_thrift_fav_90_p_20_m_scio", - java_schema = "com.twitter.simclusters_v2.thriftjava.MultiTypeGraphEdge", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.MultiTypeGraphEdge", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "multi_type_graph_for_top_k_right_nodes_thrift_50_m_scio", - java_schema = "com.twitter.simclusters_v2.thriftjava.MultiTypeGraphEdge", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.MultiTypeGraphEdge", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "simclusters_v2_interested_in", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.InterestedInInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_v2_interested_in_20M_145K_updated", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.InterestedInInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_v2_interested_in_20M_145K_2020", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.InterestedInInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_v2_interested_in_lite_20M_145K_2020", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.InterestedInInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_v2_user_to_interested_in_20M_145K_dec11", - java_schema = "com.twitter.simclusters_v2.thriftjava.UserToInterestedInClusters", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.UserToInterestedInClusters", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "simclusters_v2_user_to_interested_in_20M_145K_updated", - java_schema = "com.twitter.simclusters_v2.thriftjava.UserToInterestedInClusters", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.UserToInterestedInClusters", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "simclusters_v2_user_to_interested_in_20M_145K_2020", - java_schema = "com.twitter.simclusters_v2.thriftjava.UserToInterestedInClusters", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.UserToInterestedInClusters", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "simclusters_v2_user_to_interested_in_lite_20M_145K_2020", - java_schema = "com.twitter.simclusters_v2.thriftjava.UserToInterestedInClusters", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.UserToInterestedInClusters", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "simclusters_v2_user_to_interested_in_from_aggregatable_producer_embeddings_20M_145K_2020", - java_schema = "com.twitter.simclusters_v2.thriftjava.UserToInterestedInClusters", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.UserToInterestedInClusters", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) -# end of InterestedIn - -# KnownFor -create_datasets( - base_name = "simclusters_v2_raw_known_for_20M_145K_dec11", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.KnownForInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.ClustersUserIsKnownFor", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_v2_raw_known_for_20M_145K_updated", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.KnownForInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.ClustersUserIsKnownFor", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_v2_raw_known_for_20M_145K_2020", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.KnownForInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.ClustersUserIsKnownFor", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_v2_known_for_20M_145K_dec11", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.KnownForInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.ClustersUserIsKnownFor", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_v2_known_for_20M_145K_updated", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.KnownForInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.ClustersUserIsKnownFor", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_v2_known_for_20M_145K_updated_thrift", - java_schema = "com.twitter.simclusters_v2.thriftjava.UserToKnownForClusters", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.UserToKnownForClusters", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "simclusters_v2_known_for_20M_145K_2020", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.KnownForInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.ClustersUserIsKnownFor", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_v2_known_for_20M_145K_2020_thrift", - java_schema = "com.twitter.simclusters_v2.thriftjava.UserToKnownForClusters", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.UserToKnownForClusters", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -# end of KnownFor - -create_datasets( - base_name = "simclusters_v2_cluster_details", - key_type = "scala.Tuple2[String, Int]", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusterDetailsInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.ClusterDetails", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_v2_cluster_details_lite", - java_schema = "com.twitter.simclusters_v2.thriftjava.ClusterDetailsLite", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.ClusterDetailsLite", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "simclusters_v2_embeddings_lite", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.EmbeddingsLite", - segment_type = "snapshot", - tags = ["bazel-compatible"], - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_v2_cluster_details_20m_145k_updated", - key_type = "scala.Tuple2[String, Int]", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusterDetailsInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.ClusterDetails", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_v2_cluster_details_lite_20m_145k_updated", - java_schema = "com.twitter.simclusters_v2.thriftjava.ClusterDetailsLite", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.ClusterDetailsLite", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "simclusters_v2_cluster_details_20m_145k_2020", - key_type = "scala.Tuple2[String, Int]", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusterDetailsInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.ClusterDetails", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_v2_cluster_details_lite_20m_145k_2020", - java_schema = "com.twitter.simclusters_v2.thriftjava.ClusterDetailsLite", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.ClusterDetailsLite", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "tweet_evaluation_timelines_reference_set", - description = "A Tweet dataset that contains impressed tweets with engagement labels, parsed from Timelines", - java_schema = "com.twitter.simclusters_v2.thriftjava.ReferenceTweets", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.ReferenceTweets", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "semantic_core_simclusters_embeddings", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.EntitySimClustersEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "semantic_core_simclusters_embeddings_updated", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.EntitySimClustersEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "semantic_core_simclusters_embeddings_2020", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.EntitySimClustersEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "semantic_core_per_language_simclusters_embeddings", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.EntitySimClustersEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "log_fav_semantic_core_per_language_simclusters_embeddings", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.EntitySimClustersEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "reverse_index_semantic_core_simclusters_embeddings", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.InternalIdEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.InternalIdEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "reverse_index_semantic_core_simclusters_embeddings_updated", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.InternalIdEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.InternalIdEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "reverse_index_semantic_core_simclusters_embeddings_2020", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.InternalIdEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.InternalIdEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "reverse_index_semantic_core_per_language_simclusters_embeddings", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.InternalIdEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.InternalIdEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "log_fav_reverse_index_semantic_core_per_language_simclusters_embeddings", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.InternalIdEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.InternalIdEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "hashtag_simclusters_embeddings_updated", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.EntitySimClustersEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "fav_tfg_topic_embeddings", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.EntitySimClustersEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "fav_tfg_topic_embeddings_2020", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.EntitySimClustersEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "fav_tfg_topic_embeddings_parquet", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.TfgTopicEmbeddings", - segment_type = "snapshot", - tags = ["bazel-compatible"], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "fav_tfg_topic_embeddings_2020_parquet", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.TfgTopicEmbeddings", - segment_type = "snapshot", - tags = ["bazel-compatible"], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "fav_inferred_language_tfg_topic_embeddings", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.EntitySimClustersEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "log_fav_tfg_topic_embeddings", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.EntitySimClustersEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "log_fav_tfg_topic_embeddings_parquet", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.TfgTopicEmbeddings", - segment_type = "snapshot", - tags = ["bazel-compatible"], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "reverse_index_hashtag_simclusters_embeddings_updated", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.InternalIdEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.InternalIdEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simcluster_embedding_top_k_producers_by_fav_score", - key_type = "com.twitter.simclusters_v2.thriftscala.PersistedFullClusterId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ProducerEmbeddingsInjections.SimClusterEmbeddingTopKProducersInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopProducersWithScore", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simcluster_embedding_top_k_producers_by_fav_score_updated", - key_type = "com.twitter.simclusters_v2.thriftscala.PersistedFullClusterId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ProducerEmbeddingsInjections.SimClusterEmbeddingTopKProducersInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopProducersWithScore", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simcluster_embedding_top_k_producers_by_fav_score_2020", - key_type = "com.twitter.simclusters_v2.thriftscala.PersistedFullClusterId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ProducerEmbeddingsInjections.SimClusterEmbeddingTopKProducersInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopProducersWithScore", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "producer_top_k_simcluster_embeddings_by_fav_score", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ProducerEmbeddingsInjections.ProducerTopKSimClusterEmbeddingsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopSimClustersWithScore", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "producer_top_k_simcluster_embeddings_by_fav_score_updated", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ProducerEmbeddingsInjections.ProducerTopKSimClusterEmbeddingsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopSimClustersWithScore", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "producer_top_k_simcluster_embeddings_by_fav_score_2020", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ProducerEmbeddingsInjections.ProducerTopKSimClusterEmbeddingsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopSimClustersWithScore", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simcluster_embedding_top_k_producers_by_follow_score", - key_type = "com.twitter.simclusters_v2.thriftscala.PersistedFullClusterId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ProducerEmbeddingsInjections.SimClusterEmbeddingTopKProducersInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopProducersWithScore", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simcluster_embedding_top_k_producers_by_follow_score_updated", - key_type = "com.twitter.simclusters_v2.thriftscala.PersistedFullClusterId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ProducerEmbeddingsInjections.SimClusterEmbeddingTopKProducersInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopProducersWithScore", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simcluster_embedding_top_k_producers_by_follow_score_2020", - key_type = "com.twitter.simclusters_v2.thriftscala.PersistedFullClusterId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ProducerEmbeddingsInjections.SimClusterEmbeddingTopKProducersInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopProducersWithScore", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "producer_top_k_simcluster_embeddings_by_follow_score", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ProducerEmbeddingsInjections.ProducerTopKSimClusterEmbeddingsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopSimClustersWithScore", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "producer_top_k_simcluster_embeddings_by_follow_score_updated", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ProducerEmbeddingsInjections.ProducerTopKSimClusterEmbeddingsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopSimClustersWithScore", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "producer_top_k_simcluster_embeddings_by_follow_score_2020", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ProducerEmbeddingsInjections.ProducerTopKSimClusterEmbeddingsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopSimClustersWithScore", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "similar_users_by_fav_based_producer_embedding", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ProducerEmbeddingsInjections.SimilarUsersInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.hermit.candidate.thriftscala.Candidates", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "similar_users_by_follow_based_producer_embedding", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ProducerEmbeddingsInjections.SimilarUsersInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.hermit.candidate.thriftscala.Candidates", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "aggregatable_producer_simclusters_embeddings_by_log_fav_score", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ProducerEmbeddingsInjections.ProducerSimClustersEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "aggregatable_producer_simclusters_embeddings_by_log_fav_score_2020", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ProducerEmbeddingsInjections.ProducerSimClustersEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "aggregatable_producer_simclusters_embeddings_by_follow_score_2020", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ProducerEmbeddingsInjections.ProducerSimClustersEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "aggregatable_producer_simclusters_embeddings_by_log_fav_score_relaxed_fav_engagement_threshold_2020", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ProducerEmbeddingsInjections.ProducerSimClustersEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "aggregatable_producer_simclusters_embeddings_by_fav_score", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ProducerEmbeddingsInjections.ProducerSimClustersEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "aggregatable_producer_simclusters_embeddings_by_fav_score_2020", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ProducerEmbeddingsInjections.ProducerSimClustersEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "aggregatable_producer_simclusters_embeddings_by_log_fav_score_thrift", - java_schema = "com.twitter.simclusters_v2.thriftjava.SimClustersEmbeddingWithId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingWithId", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "aggregatable_producer_simclusters_embeddings_by_log_fav_score_2020_thrift", - java_schema = "com.twitter.simclusters_v2.thriftjava.SimClustersEmbeddingWithId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingWithId", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "aggregatable_producer_simclusters_embeddings_by_follow_score_2020_thrift", - java_schema = "com.twitter.simclusters_v2.thriftjava.SimClustersEmbeddingWithId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingWithId", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "aggregatable_producer_simclusters_embeddings_by_fav_score_thrift", - java_schema = "com.twitter.simclusters_v2.thriftjava.SimClustersEmbeddingWithId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingWithId", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "aggregatable_producer_simclusters_embeddings_by_fav_score_2020_thrift", - java_schema = "com.twitter.simclusters_v2.thriftjava.SimClustersEmbeddingWithId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingWithId", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "aggregatable_producer_simclusters_embeddings_by_log_fav_score_relaxed_fav_engagement_threshold_2020_thrift", - java_schema = "com.twitter.simclusters_v2.thriftjava.SimClustersEmbeddingWithId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingWithId", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -# TWICE & Clustering datasets -create_datasets( - base_name = "interested_in_twice_by_largest_dim", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersMultiEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.EntitySimClustersMultiEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersMultiEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "interested_in_twice_by_largest_dim_fav_score", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersMultiEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.EntitySimClustersMultiEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersMultiEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "interested_in_twice_by_largest_dim_2_day_update", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersMultiEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.EntitySimClustersMultiEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersMultiEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "interested_in_twice_louvain", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersMultiEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.EntitySimClustersMultiEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersMultiEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "interested_in_twice_connected_components", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersMultiEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.EntitySimClustersMultiEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersMultiEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "clusters_members_largest_dim_ape_similarity", - key_type = "com.twitter.simclusters_v2.common.UserId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusteringInjections.OrderedClustersAndMembersInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.OrderedClustersAndMembers", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "clusters_members_largest_dim_ape_similarity_2_day_update", - key_type = "com.twitter.simclusters_v2.common.UserId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusteringInjections.OrderedClustersAndMembersInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.OrderedClustersAndMembers", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "clusters_members_louvain_ape_similarity", - key_type = "com.twitter.simclusters_v2.common.UserId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusteringInjections.OrderedClustersAndMembersInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.OrderedClustersAndMembers", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "clusters_members_connected_components_ape_similarity", - key_type = "com.twitter.simclusters_v2.common.UserId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusteringInjections.OrderedClustersAndMembersInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.OrderedClustersAndMembers", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -# End of TWICE & Clustering datasets - -create_datasets( - base_name = "simclusters_offline_tweet_cluster_scores", - description = "A dataset that contains the scores for tweet and cluster pairs", - java_schema = "com.twitter.simclusters_v2.thriftjava.TweetAndClusterScores", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.TweetAndClusterScores", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "simclusters_offline_tweet_top_k_clusters", - description = "A dataset that contains the top clusters for each tweet", - java_schema = "com.twitter.simclusters_v2.thriftjava.TweetTopKClustersWithScores", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.TweetTopKClustersWithScores", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "simclusters_offline_cluster_top_k_tweets", - description = "A dataset that contains the top tweets for each cluster", - java_schema = "com.twitter.simclusters_v2.thriftjava.ClusterTopKTweetsWithScores", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.ClusterTopKTweetsWithScores", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "simclusters_inferred_entities_from_known_for", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.InferredEntitiesInjections.InferredEntityInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersInferredEntities", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_inferred_entities_from_interested_in", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.InferredEntitiesInjections.InferredEntityInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersInferredEntities", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_inferred_entities_from_interested_in_keyed_by_cluster", - key_type = "Int", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.InferredEntitiesInjections.InferredEntityKeyedByClusterInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersInferredEntities", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "co_engagement_top_k_similar_tweets", - description = "A dataset that contains the top similar tweets based on co-engagement", - java_schema = "com.twitter.simclusters_v2.thriftjava.TweetTopKTweetsWithScore", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.TweetTopKTweetsWithScore", - segment_type = "partitioned", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "rux_faved_top_k_tweets", - description = "A dataset that contains the top similar tweets based on rux fav-to-impression ratio", - java_schema = "com.twitter.simclusters_v2.thriftjava.TweetTopKTweetsWithScore", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.TweetTopKTweetsWithScore", - segment_type = "partitioned", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "semantic_core_embeddings_from_producer", - key_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.EntitySimClustersEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_v2_interested_in_from_producer_embeddings_20M_145K_updated", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.InterestedInInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_v2_interested_in_from_aggregatable_producer_embeddings_20M_145K_2020", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.InterestedInInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "geopopular_top_tweet_impressed_topics", - key_type = "String", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.SemanticCoreEntitiesInjections.StringToSemanticCoreEntityScoreListInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.recos.entities.thriftscala.SemanticCoreEntityScoreList", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "similar_topics_from_topic_follow_graph", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.SemanticCoreEntitiesInjections.LongToSemanticCoreEntityScoreListInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.recos.entities.thriftscala.SemanticCoreEntityScoreList", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "top_locale_topics_for_producer_from_em", - key_type = "com.twitter.recos.entities.thriftscala.UserIdWithLocale", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.SemanticCoreEntitiesInjections.UserWithLocaleToSemanticCoreEntityScoreListInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.recos.entities.thriftscala.SemanticCoreEntityScoreList", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "top_producers_for_locale_topics_from_topic_follow_graph", - key_type = "com.twitter.recos.entities.thriftscala.SemanticCoreEntityWithLocale", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.SemanticCoreEntitiesInjections.SemanticCoreEntityWithLocaleToUsersScoreListInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.recos.entities.thriftscala.UserScoreList", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "topic_top_producers_em", - key_type = "com.twitter.recos.entities.thriftscala.SemanticCoreEntityWithLocale", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.SemanticCoreEntitiesInjections.SemanticCoreEntityWithLocaleToUsersScoreListInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.recos.entities.thriftscala.UserScoreList", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "adhoc_abuse_simcluster_features", - java_schema = "com.twitter.simclusters_v2.thriftjava.AdhocSingleSideClusterScores", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.AdhocSingleSideClusterScores", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "search_abuse_simcluster_features_manhattan", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.SingleSideUserScoresInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SingleSideUserScores", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "adhoc_cross_simcluster_block_interaction_features", - java_schema = "com.twitter.simclusters_v2.thriftjava.AdhocCrossSimClusterInteractionScores", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.AdhocCrossSimClusterInteractionScores", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "adhoc_cross_simcluster_fav_interaction_features", - java_schema = "com.twitter.simclusters_v2.thriftjava.AdhocCrossSimClusterInteractionScores", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.AdhocCrossSimClusterInteractionScores", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "top_k_right_nouns", - key_type = "com.twitter.simclusters_v2.thriftscala.RightNodeTypeStruct", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.MultiTypeGraphInjections.topKRightNounListInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.NounWithFrequencyList", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "top_k_right_nouns_scio", - key_type = "com.twitter.simclusters_v2.thriftscala.RightNodeTypeStruct", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.MultiTypeGraphInjections.topKRightNounListInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.NounWithFrequencyList", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "offline_cluster_top_media_tweets_20M_145K_2020", - key_type = "com.twitter.simclusters_v2.thriftscala.DayPartitionedClusterId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusterTopMediaTweetsInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TweetsWithScore", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "truncated_multi_type_graph", - key_type = "com.twitter.simclusters_v2.thriftscala.LeftNode", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.MultiTypeGraphInjections.truncatedMultiTypeGraphInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.RightNodeWithEdgeWeightList", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "truncated_multi_type_graph_scio", - key_type = "com.twitter.simclusters_v2.thriftscala.LeftNode", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.MultiTypeGraphInjections.truncatedMultiTypeGraphInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.RightNodeWithEdgeWeightList", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "multi_type_graph_for_top_k_right_nodes_thrift_scio", - java_schema = "com.twitter.simclusters_v2.thriftjava.MultiTypeGraphEdge", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.MultiTypeGraphEdge", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "full_multi_type_graph", - java_schema = "com.twitter.simclusters_v2.thriftjava.MultiTypeGraphEdge", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.MultiTypeGraphEdge", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "right_node_sim_hash_scio", - java_schema = "com.twitter.simclusters_v2.thriftjava.RightNodeSimHashSketch", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.RightNodeSimHashSketch", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "right_node_cosine_similarity_scio", - key_type = "com.twitter.simclusters_v2.thriftscala.RightNode", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.MultiTypeGraphInjections.similarRightNodesInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimilarRightNodes", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "user_topic_weighted_embedding", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.InterestedInInjection.injection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "user_topic_weighted_embedding_parquet", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.UserTopicWeightedEmbedding", - segment_type = "snapshot", - tags = ["bazel-compatible"], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "explore_mbcg_user_embeddings_kv", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.EntityEmbeddingsInjections.UserMbcgEmbeddingInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.ml.api.thriftscala.Embedding", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "offline_tweet_recommendations_from_interested_in_20M_145K_2020", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.MultiTypeGraphInjections.tweetRecommendationsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.CandidateTweetsList", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "offline_tweet_recommendations_from_interested_in_20M_145K_2020_HL_0_EL_15", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.MultiTypeGraphInjections.tweetRecommendationsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.CandidateTweetsList", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "offline_tweet_recommendations_from_interested_in_20M_145K_2020_HL_2_EL_15", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.MultiTypeGraphInjections.tweetRecommendationsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.CandidateTweetsList", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "offline_tweet_recommendations_from_interested_in_20M_145K_2020_HL_2_EL_50", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.MultiTypeGraphInjections.tweetRecommendationsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.CandidateTweetsList", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "offline_tweet_recommendations_from_interested_in_20M_145K_2020_HL_8_EL_50", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.MultiTypeGraphInjections.tweetRecommendationsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.CandidateTweetsList", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "offline_tweet_recommendations_from_mts_consumer_embeddings", - key_type = "Long", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.MultiTypeGraphInjections.tweetRecommendationsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.CandidateTweetsList", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "fav_based_simclusters_cluster_to_tweet_index", - key_type = "com.twitter.simclusters_v2.thriftscala.FullClusterId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusterTopTweetsInjection.clusterIdToTopKTweetsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopKTweetsWithScores", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "video_view_based_simclusters_cluster_to_tweet_index", - key_type = "com.twitter.simclusters_v2.thriftscala.FullClusterId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusterTopTweetsInjection.clusterIdToTopKTweetsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopKTweetsWithScores", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "retweet_based_simclusters_cluster_to_tweet_index", - key_type = "com.twitter.simclusters_v2.thriftscala.FullClusterId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusterTopTweetsInjection.clusterIdToTopKTweetsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopKTweetsWithScores", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "reply_based_simclusters_cluster_to_tweet_index", - key_type = "com.twitter.simclusters_v2.thriftscala.FullClusterId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusterTopTweetsInjection.clusterIdToTopKTweetsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopKTweetsWithScores", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "push_open_based_simclusters_cluster_to_tweet_index", - key_type = "com.twitter.simclusters_v2.thriftscala.FullClusterId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusterTopTweetsInjection.clusterIdToTopKTweetsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopKTweetsWithScores", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "ads_fav_based_simclusters_cluster_to_tweet_index", - key_type = "com.twitter.simclusters_v2.thriftscala.FullClusterId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusterTopTweetsInjection.clusterIdToTopKTweetsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopKTweetsWithScores", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "ads_fav_click_based_simclusters_cluster_to_tweet_index", - key_type = "com.twitter.simclusters_v2.thriftscala.FullClusterId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusterTopTweetsInjection.clusterIdToTopKTweetsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopKTweetsWithScores", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "fav_based_evergreen_content_simclusters_cluster_to_tweet_index", - key_type = "com.twitter.simclusters_v2.thriftscala.FullClusterId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusterTopTweetsInjection.clusterIdToTopKTweetsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopKTweetsWithScores", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "fav_based_video_simclusters_cluster_to_tweet_index", - key_type = "com.twitter.simclusters_v2.thriftscala.FullClusterId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusterTopTweetsInjection.clusterIdToTopKTweetsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopKTweetsWithScores", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_v2_global_language_embedding", - key_type = "String", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.InterestedInInjection.languageInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_v2_global_language_embedding_thrift", - java_schema = "com.twitter.simclusters_v2.thriftjava.LanguageToClusters", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.LanguageToClusters", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/hdfs_sources/DataPaths.scala b/src/scala/com/twitter/simclusters_v2/hdfs_sources/DataPaths.scala deleted file mode 100644 index 486a21f60..000000000 --- a/src/scala/com/twitter/simclusters_v2/hdfs_sources/DataPaths.scala +++ /dev/null @@ -1,49 +0,0 @@ -package com.twitter.simclusters_v2.hdfs_sources - -object DataPaths { - - val InterestedIn2020Path = - "/user/cassowary/manhattan_sequence_files/simclusters_v2_interested_in_20M_145K_2020" - - val InterestedIn2020ThriftPath = - "/user/cassowary/manhattan_sequence_files/simclusters_v2_interested_in_20M_145K_2020_thrift" - - val InterestedInLite2020Path = - "/user/cassowary/manhattan_sequence_files/simclusters_v2_interested_in_lite_20M_145K_2020" - - val InterestedInLite2020ThriftPath = - "/user/cassowary/manhattan_sequence_files/simclusters_v2_interested_in_lite_20M_145K_2020_thrift" - - val KnownFor2020Path = - "/user/cassowary/manhattan_sequence_files/simclusters_v2_known_for_20M_145K_2020" - - // keep this inside /user/cassowary/manhattan_sequence_files/ to use the latest 3 retention policy - val KnownFor2020ThriftDatasetPath = - "/user/cassowary/manhattan_sequence_files/simclusters_v2_known_for_20M_145K_2020_thrift" - - val OfflineClusterTopMediaTweets2020DatasetPath = - "/user/cassowary/manhattan_sequence_files/cluster_top_media_tweets_20M_145K_2020" -} - -/** - * These should only be accessed from simclusters_v2 data pipeline for intermediate data, these - * are not opt-out compliant and shouldn't be exposed externally. - */ -object InternalDataPaths { - // Internal versions, not to be read or written outside of simcluster_v2 - - private[simclusters_v2] val RawInterestedIn2020Path = - "/user/cassowary/manhattan_sequence_files/simclusters_v2_raw_interested_in_20M_145K_2020" - - private[simclusters_v2] val RawInterestedInLite2020Path = - "/user/cassowary/manhattan_sequence_files/simclusters_v2_raw_interested_in_lite_20M_145K_2020" - - private[simclusters_v2] val RawKnownForDec11Path = - "/user/cassowary/manhattan_sequence_files/simclusters_v2_raw_known_for_20M_145K_dec11" - - private[simclusters_v2] val RawKnownForUpdatedPath = - "/user/cassowary/manhattan_sequence_files/simclusters_v2_raw_known_for_20M_145K_updated" - - private[simclusters_v2] val RawKnownFor2020Path = - "/user/cassowary/manhattan_sequence_files/simclusters_v2_raw_known_for_20M_145K_2020" -} diff --git a/src/scala/com/twitter/simclusters_v2/hdfs_sources/DataSources.scala b/src/scala/com/twitter/simclusters_v2/hdfs_sources/DataSources.scala deleted file mode 100644 index c72b25d3f..000000000 --- a/src/scala/com/twitter/simclusters_v2/hdfs_sources/DataSources.scala +++ /dev/null @@ -1,39 +0,0 @@ -package com.twitter.simclusters_v2.hdfs_sources - -import com.twitter.scalding.DateOps -import com.twitter.scalding.DateRange -import com.twitter.scalding.Days -import com.twitter.scalding.TypedPipe -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.remote_access.ExplicitLocation -import com.twitter.scalding_internal.dalv2.remote_access.ProcAtla -import com.twitter.simclusters_v2.thriftscala.NormsAndCounts -import com.twitter.simclusters_v2.thriftscala.UserAndNeighbors -import java.util.TimeZone - -object DataSources { - - /** - * Reads production normalized graph data from atla-proc - */ - def userUserNormalizedGraphSource(implicit dateRange: DateRange): TypedPipe[UserAndNeighbors] = { - DAL - .readMostRecentSnapshotNoOlderThan(UserUserNormalizedGraphScalaDataset, Days(14)(DateOps.UTC)) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - } - - /** - * Reads production user norms and counts data from atla-proc - */ - def userNormsAndCounts( - implicit dateRange: DateRange, - timeZone: TimeZone - ): TypedPipe[NormsAndCounts] = { - DAL - .readMostRecentSnapshot(ProducerNormsAndCountsScalaDataset, dateRange.prepend(Days(14))) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/hdfs_sources/EntityEmbeddingsSources.scala b/src/scala/com/twitter/simclusters_v2/hdfs_sources/EntityEmbeddingsSources.scala deleted file mode 100644 index a8ad1a69b..000000000 --- a/src/scala/com/twitter/simclusters_v2/hdfs_sources/EntityEmbeddingsSources.scala +++ /dev/null @@ -1,222 +0,0 @@ -package com.twitter.simclusters_v2.hdfs_sources - -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.scalding.DateRange -import com.twitter.scalding.typed.TypedPipe -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.remote_access.AllowCrossDC -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.thriftscala._ -import com.twitter.wtf.entity_real_graph.thriftscala.EntityType -import com.twitter.simclusters_v2.common.ClusterId -import com.twitter.simclusters_v2.common.ModelVersions - -object EntityEmbeddingsSources { - - final val SemanticCoreSimClustersEmbeddingsDec11Dataset = - SemanticCoreSimclustersEmbeddingsScalaDataset - - final val SemanticCoreSimClustersEmbeddingsUpdatedDataset = - SemanticCoreSimclustersEmbeddingsUpdatedScalaDataset - - final val SemanticCoreSimClustersEmbeddings2020Dataset = - SemanticCoreSimclustersEmbeddings2020ScalaDataset - - final val SemanticCorePerLanguageSimClustersEmbeddingsDataset = - SemanticCorePerLanguageSimclustersEmbeddingsScalaDataset - - final val LogFavSemanticCorePerLanguageSimClustersEmbeddingsDataset = - LogFavSemanticCorePerLanguageSimclustersEmbeddingsScalaDataset - - final val HashtagSimClustersEmbeddingsUpdatedDataset = - HashtagSimclustersEmbeddingsUpdatedScalaDataset - - final val ReverseIndexSemanticCoreSimClustersEmbeddingsDec11Dataset = - ReverseIndexSemanticCoreSimclustersEmbeddingsScalaDataset - - final val ReverseIndexSemanticCoreSimClustersEmbeddingsUpdatedDataset = - ReverseIndexSemanticCoreSimclustersEmbeddingsUpdatedScalaDataset - - final val ReverseIndexSemanticCoreSimClustersEmbeddings2020Dataset = - ReverseIndexSemanticCoreSimclustersEmbeddings2020ScalaDataset - - final val ReverseIndexSemanticCorePerLanguageSimClustersEmbeddingsDataset = - ReverseIndexSemanticCorePerLanguageSimclustersEmbeddingsScalaDataset - - final val LogFavReverseIndexSemanticCorePerLanguageSimClustersEmbeddingsDataset = - LogFavReverseIndexSemanticCorePerLanguageSimclustersEmbeddingsScalaDataset - - final val ReverseIndexHashtagSimClustersEmbeddingsUpdatedDataset = - ReverseIndexHashtagSimclustersEmbeddingsUpdatedScalaDataset - - // Fav-based TFG topic embeddings built from user device languages - // Keyed by SimClustersEmbeddingId with InternalId.TopicId ((topic, language) pair, with country = None) - final val FavTfgTopicEmbeddingsDataset = FavTfgTopicEmbeddingsScalaDataset - - final val FavTfgTopicEmbeddingsParquetDataset = FavTfgTopicEmbeddingsParquetScalaDataset - - final val FavTfgTopicEmbeddings2020Dataset = FavTfgTopicEmbeddings2020ScalaDataset - - final val FavTfgTopicEmbeddings2020ParquetDataset = FavTfgTopicEmbeddings2020ParquetScalaDataset - - // Logfav-based TFG topic embeddings built from user device languages - // Keyed by SimClustersEmbeddingId with InternalId.LocaleEntityId ((topic, language) pair) - final val LogFavTfgTopicEmbeddingsDataset = LogFavTfgTopicEmbeddingsScalaDataset - - final val LogFavTfgTopicEmbeddingsParquetDataset = LogFavTfgTopicEmbeddingsParquetScalaDataset - - // Fav-based TFG topic embeddings built from inferred user consumed languages - // Keyed by SimClustersEmbeddingId with InternalId.TopicId ((topic, country, language) tuple) - final val FavInferredLanguageTfgTopicEmbeddingsDataset = - FavInferredLanguageTfgTopicEmbeddingsScalaDataset - - private val validSemanticCoreEmbeddingTypes = Seq( - EmbeddingType.FavBasedSematicCoreEntity, - EmbeddingType.FollowBasedSematicCoreEntity - ) - - /** - * Given a fav/follow/etc embedding type and a ModelVersion, retrieve the corresponding dataset to - * (SemanticCore entityId -> List(clusterId)) from a certain dateRange. - */ - def getSemanticCoreEntityEmbeddingsSource( - embeddingType: EmbeddingType, - modelVersion: String, - dateRange: DateRange - ): TypedPipe[(Long, SimClustersEmbedding)] = { - val dataSet = modelVersion match { - case ModelVersions.Model20M145KDec11 => SemanticCoreSimClustersEmbeddingsDec11Dataset - case ModelVersions.Model20M145KUpdated => SemanticCoreSimClustersEmbeddingsUpdatedDataset - case _ => throw new IllegalArgumentException(s"ModelVersion $modelVersion is not supported") - } - assert(validSemanticCoreEmbeddingTypes.contains(embeddingType)) - entityEmbeddingsSource(dataSet, embeddingType, dateRange) - } - - /** - * Given a fav/follow/etc embedding type and a ModelVersion, retrieve the corresponding dataset to - * (clusterId -> List(SemanticCore entityId)) from a certain dateRange. - */ - def getReverseIndexedSemanticCoreEntityEmbeddingsSource( - embeddingType: EmbeddingType, - modelVersion: String, - dateRange: DateRange - ): TypedPipe[(ClusterId, Seq[SemanticCoreEntityWithScore])] = { - val dataSet = modelVersion match { - case ModelVersions.Model20M145KDec11 => - ReverseIndexSemanticCoreSimClustersEmbeddingsDec11Dataset - case ModelVersions.Model20M145KUpdated => - ReverseIndexSemanticCoreSimClustersEmbeddingsUpdatedDataset - case ModelVersions.Model20M145K2020 => - ReverseIndexSemanticCoreSimClustersEmbeddings2020Dataset - case _ => throw new IllegalArgumentException(s"ModelVersion $modelVersion is not supported") - } - - assert(validSemanticCoreEmbeddingTypes.contains(embeddingType)) - reverseIndexedEntityEmbeddingsSource(dataSet, embeddingType, dateRange) - } - - // Return the raw DAL dataset reference. Use this if you're writing to DAL. - def getEntityEmbeddingsDataset( - entityType: EntityType, - modelVersion: String, - isEmbeddingsPerLocale: Boolean = false - ): KeyValDALDataset[KeyVal[SimClustersEmbeddingId, SimClustersEmbedding]] = { - (entityType, modelVersion) match { - case (EntityType.SemanticCore, ModelVersions.Model20M145KDec11) => - SemanticCoreSimClustersEmbeddingsDec11Dataset - case (EntityType.SemanticCore, ModelVersions.Model20M145KUpdated) => - if (isEmbeddingsPerLocale) { - SemanticCorePerLanguageSimClustersEmbeddingsDataset - } else { - SemanticCoreSimClustersEmbeddingsUpdatedDataset - } - case (EntityType.SemanticCore, ModelVersions.Model20M145K2020) => - SemanticCoreSimClustersEmbeddings2020Dataset - case (EntityType.Hashtag, ModelVersions.Model20M145KUpdated) => - HashtagSimClustersEmbeddingsUpdatedDataset - case (entityType, modelVersion) => - throw new IllegalArgumentException( - s"(Entity Type, ModelVersion) ($entityType, $modelVersion) not supported.") - } - } - - // Return the raw DAL dataset reference. Use this if you're writing to DAL. - def getReverseIndexedEntityEmbeddingsDataset( - entityType: EntityType, - modelVersion: String, - isEmbeddingsPerLocale: Boolean = false - ): KeyValDALDataset[KeyVal[SimClustersEmbeddingId, InternalIdEmbedding]] = { - (entityType, modelVersion) match { - case (EntityType.SemanticCore, ModelVersions.Model20M145KDec11) => - ReverseIndexSemanticCoreSimClustersEmbeddingsDec11Dataset - case (EntityType.SemanticCore, ModelVersions.Model20M145KUpdated) => - if (isEmbeddingsPerLocale) { - ReverseIndexSemanticCorePerLanguageSimClustersEmbeddingsDataset - } else { - ReverseIndexSemanticCoreSimClustersEmbeddingsUpdatedDataset - } - case (EntityType.SemanticCore, ModelVersions.Model20M145K2020) => - ReverseIndexSemanticCoreSimClustersEmbeddings2020Dataset - case (EntityType.Hashtag, ModelVersions.Model20M145KUpdated) => - ReverseIndexHashtagSimClustersEmbeddingsUpdatedDataset - case (entityType, modelVersion) => - throw new IllegalArgumentException( - s"(Entity Type, ModelVersion) ($entityType, $modelVersion) not supported.") - } - } - - private def entityEmbeddingsSource( - dataset: KeyValDALDataset[KeyVal[SimClustersEmbeddingId, SimClustersEmbedding]], - embeddingType: EmbeddingType, - dateRange: DateRange - ): TypedPipe[(Long, SimClustersEmbedding)] = { - val pipe = DAL - .readMostRecentSnapshot(dataset, dateRange) - .withRemoteReadPolicy(AllowCrossDC) - .toTypedPipe - filterEntityEmbeddingsByType(pipe, embeddingType) - } - - private def reverseIndexedEntityEmbeddingsSource( - dataset: KeyValDALDataset[KeyVal[SimClustersEmbeddingId, InternalIdEmbedding]], - embeddingType: EmbeddingType, - dateRange: DateRange - ): TypedPipe[(ClusterId, Seq[SemanticCoreEntityWithScore])] = { - val pipe = DAL - .readMostRecentSnapshot(dataset, dateRange) - .withRemoteReadPolicy(AllowCrossDC) - .toTypedPipe - filterReverseIndexedEntityEmbeddingsByType(pipe, embeddingType) - } - - private[hdfs_sources] def filterEntityEmbeddingsByType( - pipe: TypedPipe[KeyVal[SimClustersEmbeddingId, SimClustersEmbedding]], - embeddingType: EmbeddingType - ): TypedPipe[(Long, SimClustersEmbedding)] = { - pipe.collect { - case KeyVal( - SimClustersEmbeddingId(_embeddingType, _, InternalId.EntityId(entityId)), - embedding - ) if _embeddingType == embeddingType => - (entityId, embedding) - } - } - - private[hdfs_sources] def filterReverseIndexedEntityEmbeddingsByType( - pipe: TypedPipe[KeyVal[SimClustersEmbeddingId, InternalIdEmbedding]], - embeddingType: EmbeddingType - ): TypedPipe[(ClusterId, Seq[SemanticCoreEntityWithScore])] = { - pipe.collect { - case KeyVal( - SimClustersEmbeddingId(_embeddingType, _, InternalId.ClusterId(clusterId)), - embedding - ) if _embeddingType == embeddingType => - val entitiesWithScores = embedding.embedding.collect { - case InternalIdWithScore(InternalId.EntityId(entityId), score) => - SemanticCoreEntityWithScore(entityId, score) - } - (clusterId, entitiesWithScores) - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/hdfs_sources/InterestedInSources.scala b/src/scala/com/twitter/simclusters_v2/hdfs_sources/InterestedInSources.scala deleted file mode 100644 index 518b0be9f..000000000 --- a/src/scala/com/twitter/simclusters_v2/hdfs_sources/InterestedInSources.scala +++ /dev/null @@ -1,178 +0,0 @@ -package com.twitter.simclusters_v2.hdfs_sources - -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.scalding.{DateOps, DateRange, Days, TypedPipe} -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.remote_access.{ExplicitLocation, ProcAtla} -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.thriftscala.ModelVersion -import com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn -import java.util.TimeZone - -object InterestedInSources { - - private val ModelVersionInterestedInDatasetMap: Map[ModelVersion, KeyValDALDataset[ - KeyVal[UserId, ClustersUserIsInterestedIn] - ]] = Map( - ModelVersion.Model20m145kDec11 -> SimclustersV2InterestedInScalaDataset, - ModelVersion.Model20m145kUpdated -> SimclustersV2InterestedIn20M145KUpdatedScalaDataset, - ModelVersion.Model20m145k2020 -> SimclustersV2InterestedIn20M145K2020ScalaDataset - ) - - /** - * Internal version, not PDP compliant, not to be used outside simclusters_v2 - * Reads 20M145KDec11 production InterestedIn data from atla-proc, with a 14-day extended window - */ - private[simclusters_v2] def simClustersRawInterestedInDec11Source( - dateRange: DateRange, - timeZone: TimeZone - ): TypedPipe[(UserId, ClustersUserIsInterestedIn)] = { - - DAL - .readMostRecentSnapshot( - SimclustersV2RawInterestedIn20M145KDec11ScalaDataset, - dateRange.prepend(Days(14)(timeZone)) - ) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - .map { - case KeyVal(userId, clustersUserIsInterestedIn) => - (userId, clustersUserIsInterestedIn) - } - } - - /** - * Internal version, not PDP compliant, not to be used outside simclusters_v2 - * Reads 20M145KUpdated InterestedIn data from atla-proc, with a 14-day extended window - */ - private[simclusters_v2] def simClustersRawInterestedInUpdatedSource( - dateRange: DateRange, - timeZone: TimeZone - ): TypedPipe[(UserId, ClustersUserIsInterestedIn)] = { - DAL - .readMostRecentSnapshot( - SimclustersV2RawInterestedIn20M145KUpdatedScalaDataset, - dateRange.prepend(Days(14)(timeZone)) - ) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe.map { - case KeyVal(userId, clustersUserIsInterestedIn) => - (userId, clustersUserIsInterestedIn) - } - } - - /** - * Internal version, not PDP compliant, not to be used outside simclusters_v2 - * Reads 20M145K2020 InterestedIn data from atla-proc, with a 14-day extended window - */ - private[simclusters_v2] def simClustersRawInterestedIn2020Source( - dateRange: DateRange, - timeZone: TimeZone - ): TypedPipe[(UserId, ClustersUserIsInterestedIn)] = { - DAL - .readMostRecentSnapshot( - SimclustersV2RawInterestedIn20M145K2020ScalaDataset, - dateRange.prepend(Days(14)(timeZone)) - ) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe.map { - case KeyVal(userId, clustersUserIsInterestedIn) => - (userId, clustersUserIsInterestedIn) - } - } - - private[simclusters_v2] def simClustersRawInterestedInLite2020Source( - dateRange: DateRange, - timeZone: TimeZone - ): TypedPipe[(UserId, ClustersUserIsInterestedIn)] = { - DAL - .readMostRecentSnapshot( - SimclustersV2RawInterestedInLite20M145K2020ScalaDataset, - dateRange.extend(Days(14)(timeZone))) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe.map { - case KeyVal(userId, clustersUserIsInterestedIn) => - (userId, clustersUserIsInterestedIn) - } - } - - /** - * Reads 20M145KDec11 production InterestedIn data from atla-proc, with a 14-day extended window - */ - def simClustersInterestedInDec11Source( - dateRange: DateRange, - timeZone: TimeZone - ): TypedPipe[(UserId, ClustersUserIsInterestedIn)] = { - - DAL - .readMostRecentSnapshot( - SimclustersV2InterestedInScalaDataset, - dateRange.prepend(Days(14)(timeZone))) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe.map { - case KeyVal(userId, clustersUserIsInterestedIn) => - (userId, clustersUserIsInterestedIn) - } - } - - /** - * Reads 20M145KUpdated InterestedIn data from atla-proc, with a 14-day extended window - */ - def simClustersInterestedInUpdatedSource( - dateRange: DateRange, - timeZone: TimeZone - ): TypedPipe[(UserId, ClustersUserIsInterestedIn)] = { - DAL - .readMostRecentSnapshot( - SimclustersV2InterestedIn20M145KUpdatedScalaDataset, - dateRange.prepend(Days(14)(timeZone)) - ) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe.map { - case KeyVal(userId, clustersUserIsInterestedIn) => - (userId, clustersUserIsInterestedIn) - } - } - - /** - * Reads 20M145K2020 InterestedIn data from atla-proc, with a 14-day extended window - */ - def simClustersInterestedIn2020Source( - dateRange: DateRange, - timeZone: TimeZone - ): TypedPipe[(UserId, ClustersUserIsInterestedIn)] = { - DAL - .readMostRecentSnapshot( - SimclustersV2InterestedIn20M145K2020ScalaDataset, - dateRange.prepend(Days(14)(timeZone)) - ) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe.map { - case KeyVal(userId, clustersUserIsInterestedIn) => - (userId, clustersUserIsInterestedIn) - } - } - - /** - * Reads InterestedIn data based on ModelVersion from atla-proc, with a 14-day extended window - */ - def simClustersInterestedInSource( - modelVersion: ModelVersion, - dateRange: DateRange, - timeZone: TimeZone - ): TypedPipe[(UserId, ClustersUserIsInterestedIn)] = { - - DAL - .readMostRecentSnapshot( - ModelVersionInterestedInDatasetMap(modelVersion), - dateRange.prepend(Days(14)(timeZone)) - ) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe.map { - case KeyVal(userId, clustersUserIsInterestedIn) => - (userId, clustersUserIsInterestedIn) - } - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/hdfs_sources/ProducerEmbeddingSources.scala b/src/scala/com/twitter/simclusters_v2/hdfs_sources/ProducerEmbeddingSources.scala deleted file mode 100644 index 01d391f11..000000000 --- a/src/scala/com/twitter/simclusters_v2/hdfs_sources/ProducerEmbeddingSources.scala +++ /dev/null @@ -1,86 +0,0 @@ -package com.twitter.simclusters_v2.hdfs_sources - -import com.twitter.scalding.DateRange -import com.twitter.scalding.TypedPipe -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.remote_access.AllowCrossClusterSameDC -import com.twitter.scalding_internal.dalv2.remote_access.ExplicitLocation -import com.twitter.scalding_internal.dalv2.remote_access.Proc3Atla -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.thriftscala.EmbeddingType -import com.twitter.simclusters_v2.thriftscala.InternalId -import com.twitter.simclusters_v2.thriftscala.ModelVersion -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId -import com.twitter.simclusters_v2.thriftscala.TopSimClustersWithScore - -object ProducerEmbeddingSources { - - /** - * Helper function to retrieve producer SimClusters embeddings with the legacy `TopSimClustersWithScore` - * value type. - */ - def producerEmbeddingSourceLegacy( - embeddingType: EmbeddingType, - modelVersion: ModelVersion - )( - implicit dateRange: DateRange - ): TypedPipe[(Long, TopSimClustersWithScore)] = { - val producerEmbeddingDataset = (embeddingType, modelVersion) match { - case (EmbeddingType.ProducerFollowBasedSemanticCoreEntity, ModelVersion.Model20m145kDec11) => - ProducerTopKSimclusterEmbeddingsByFollowScoreScalaDataset - case (EmbeddingType.ProducerFavBasedSemanticCoreEntity, ModelVersion.Model20m145kDec11) => - ProducerTopKSimclusterEmbeddingsByFavScoreScalaDataset - case ( - EmbeddingType.ProducerFollowBasedSemanticCoreEntity, - ModelVersion.Model20m145kUpdated) => - ProducerTopKSimclusterEmbeddingsByFollowScoreUpdatedScalaDataset - case (EmbeddingType.ProducerFavBasedSemanticCoreEntity, ModelVersion.Model20m145kUpdated) => - ProducerTopKSimclusterEmbeddingsByFavScoreUpdatedScalaDataset - case (_, _) => - throw new ClassNotFoundException( - "Unsupported embedding type: " + embeddingType + " and model version: " + modelVersion) - } - - DAL - .readMostRecentSnapshot(producerEmbeddingDataset).withRemoteReadPolicy( - AllowCrossClusterSameDC) - .toTypedPipe.map { - case KeyVal(producerId, topSimClustersWithScore) => - (producerId, topSimClustersWithScore) - } - } - - def producerEmbeddingSource( - embeddingType: EmbeddingType, - modelVersion: ModelVersion - )( - implicit dateRange: DateRange - ): TypedPipe[(Long, SimClustersEmbedding)] = { - val producerEmbeddingDataset = (embeddingType, modelVersion) match { - case (EmbeddingType.AggregatableLogFavBasedProducer, ModelVersion.Model20m145k2020) => - AggregatableProducerSimclustersEmbeddingsByLogFavScore2020ScalaDataset - case (EmbeddingType.AggregatableFollowBasedProducer, ModelVersion.Model20m145k2020) => - AggregatableProducerSimclustersEmbeddingsByFollowScore2020ScalaDataset - case (EmbeddingType.RelaxedAggregatableLogFavBasedProducer, ModelVersion.Model20m145k2020) => - AggregatableProducerSimclustersEmbeddingsByLogFavScoreRelaxedFavEngagementThreshold2020ScalaDataset - case (_, _) => - throw new ClassNotFoundException( - "Unsupported embedding type: " + embeddingType + " and model version: " + modelVersion) - } - - DAL - .readMostRecentSnapshot( - producerEmbeddingDataset - ) - .withRemoteReadPolicy(ExplicitLocation(Proc3Atla)) - .toTypedPipe - .map { - case KeyVal( - SimClustersEmbeddingId(_, _, InternalId.UserId(producerId: Long)), - embedding: SimClustersEmbedding) => - (producerId, embedding) - } - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/BUILD b/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/BUILD deleted file mode 100644 index 7926b5dac..000000000 --- a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/BUILD +++ /dev/null @@ -1,13 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/common", - "src/thrift/com/twitter/hermit/candidate:hermit-candidate-scala", - "src/thrift/com/twitter/ml/api:embedding-scala", - "src/thrift/com/twitter/recos/entities:entities-thrift-scala", - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/ClusterDetailsInjection.scala b/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/ClusterDetailsInjection.scala deleted file mode 100644 index 9f17cbad0..000000000 --- a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/ClusterDetailsInjection.scala +++ /dev/null @@ -1,16 +0,0 @@ -package com.twitter.simclusters_v2.hdfs_sources.injections - -import com.twitter.bijection.Bufferable -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.{ - ScalaCompactThrift, - genericInjection -} -import com.twitter.simclusters_v2.thriftscala.ClusterDetails - -object ClusterDetailsInjection { - val injection = KeyValInjection[(String, Int), ClusterDetails]( - genericInjection(Bufferable.injectionOf[(String, Int)]), - ScalaCompactThrift(ClusterDetails) - ) -} diff --git a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/ClusterTopMediaTweetsInjection.scala b/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/ClusterTopMediaTweetsInjection.scala deleted file mode 100644 index f542e0cbf..000000000 --- a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/ClusterTopMediaTweetsInjection.scala +++ /dev/null @@ -1,13 +0,0 @@ -package com.twitter.simclusters_v2.hdfs_sources.injections - -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.ScalaCompactThrift -import com.twitter.simclusters_v2.thriftscala.{TweetsWithScore, DayPartitionedClusterId} - -object ClusterTopMediaTweetsInjection { - - val injection = KeyValInjection[DayPartitionedClusterId, TweetsWithScore]( - ScalaCompactThrift(DayPartitionedClusterId), - ScalaCompactThrift(TweetsWithScore) - ) -} diff --git a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/ClusterTopTweetsInjection.scala b/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/ClusterTopTweetsInjection.scala deleted file mode 100644 index e09176813..000000000 --- a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/ClusterTopTweetsInjection.scala +++ /dev/null @@ -1,14 +0,0 @@ -package com.twitter.simclusters_v2.hdfs_sources.injections - -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.ScalaCompactThrift -import com.twitter.simclusters_v2.thriftscala.TopKTweetsWithScores -import com.twitter.simclusters_v2.thriftscala.FullClusterId - -object ClusterTopTweetsInjection { - - val clusterIdToTopKTweetsInjection = KeyValInjection[FullClusterId, TopKTweetsWithScores]( - ScalaCompactThrift(FullClusterId), - ScalaCompactThrift(TopKTweetsWithScores) - ) -} diff --git a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/ClusteringInjections.scala b/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/ClusteringInjections.scala deleted file mode 100644 index 22ba173ca..000000000 --- a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/ClusteringInjections.scala +++ /dev/null @@ -1,16 +0,0 @@ -package com.twitter.simclusters_v2.hdfs_sources.injections - -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.ScalaBinaryThrift -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.Long2BigEndian -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.thriftscala._ - -object ClusteringInjections { - - final val OrderedClustersAndMembersInjection: KeyValInjection[ - UserId, - OrderedClustersAndMembers - ] = - KeyValInjection(Long2BigEndian, ScalaBinaryThrift(OrderedClustersAndMembers)) -} diff --git a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/EntityEmbeddingsInjections.scala b/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/EntityEmbeddingsInjections.scala deleted file mode 100644 index eb20bf3eb..000000000 --- a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/EntityEmbeddingsInjections.scala +++ /dev/null @@ -1,47 +0,0 @@ -package com.twitter.simclusters_v2.hdfs_sources.injections - -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.ScalaBinaryThrift -import com.twitter.simclusters_v2.thriftscala._ -import com.twitter.ml.api.thriftscala.Embedding -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.Long2BigEndian -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.ScalaCompactThrift - -object EntityEmbeddingsInjections { - - final val EntitySimClustersEmbeddingInjection: KeyValInjection[ - SimClustersEmbeddingId, - SimClustersEmbedding - ] = - KeyValInjection( - ScalaBinaryThrift(SimClustersEmbeddingId), - ScalaBinaryThrift(SimClustersEmbedding) - ) - - final val InternalIdEmbeddingInjection: KeyValInjection[ - SimClustersEmbeddingId, - InternalIdEmbedding - ] = - KeyValInjection( - ScalaBinaryThrift(SimClustersEmbeddingId), - ScalaBinaryThrift(InternalIdEmbedding) - ) - - final val EntitySimClustersMultiEmbeddingInjection: KeyValInjection[ - SimClustersMultiEmbeddingId, - SimClustersMultiEmbedding - ] = - KeyValInjection( - ScalaBinaryThrift(SimClustersMultiEmbeddingId), - ScalaBinaryThrift(SimClustersMultiEmbedding) - ) - - final val UserMbcgEmbeddingInjection: KeyValInjection[ - Long, - Embedding - ] = - KeyValInjection[Long, Embedding]( - Long2BigEndian, - ScalaCompactThrift(Embedding) - ) -} diff --git a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/InferredEntitiesInjections.scala b/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/InferredEntitiesInjections.scala deleted file mode 100644 index fcb637a9d..000000000 --- a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/InferredEntitiesInjections.scala +++ /dev/null @@ -1,27 +0,0 @@ -package com.twitter.simclusters_v2.hdfs_sources.injections - -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.{ - Int2BigEndian, - Long2BigEndian, - ScalaCompactThrift -} -import com.twitter.simclusters_v2.thriftscala.SimClustersInferredEntities - -object InferredEntitiesInjections { - - final val InferredEntityInjection: KeyValInjection[Long, SimClustersInferredEntities] = - KeyValInjection( - Long2BigEndian, - ScalaCompactThrift(SimClustersInferredEntities) - ) - - final val InferredEntityKeyedByClusterInjection: KeyValInjection[ - Int, - SimClustersInferredEntities - ] = - KeyValInjection( - Int2BigEndian, - ScalaCompactThrift(SimClustersInferredEntities) - ) -} diff --git a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/InterestedInInjection.scala b/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/InterestedInInjection.scala deleted file mode 100644 index c9642ee94..000000000 --- a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/InterestedInInjection.scala +++ /dev/null @@ -1,13 +0,0 @@ -package com.twitter.simclusters_v2.hdfs_sources.injections - -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.StringUtf8 -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.Long2BigEndian -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.ScalaCompactThrift -import com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn - -object InterestedInInjection { - val injection = KeyValInjection(Long2BigEndian, ScalaCompactThrift(ClustersUserIsInterestedIn)) - val languageInjection = - KeyValInjection(StringUtf8, ScalaCompactThrift(ClustersUserIsInterestedIn)) -} diff --git a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/KnownForInjection.scala b/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/KnownForInjection.scala deleted file mode 100644 index 9aca921ee..000000000 --- a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/KnownForInjection.scala +++ /dev/null @@ -1,12 +0,0 @@ -package com.twitter.simclusters_v2.hdfs_sources.injections - -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.{ - Long2BigEndian, - ScalaCompactThrift -} -import com.twitter.simclusters_v2.thriftscala._ - -object KnownForInjection { - val injection = KeyValInjection(Long2BigEndian, ScalaCompactThrift(ClustersUserIsKnownFor)) -} diff --git a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/MultiTypeGraphInjections.scala b/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/MultiTypeGraphInjections.scala deleted file mode 100644 index f674324c6..000000000 --- a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/MultiTypeGraphInjections.scala +++ /dev/null @@ -1,31 +0,0 @@ -package com.twitter.simclusters_v2.hdfs_sources.injections - -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.ScalaCompactThrift -import com.twitter.simclusters_v2.thriftscala.LeftNode -import com.twitter.simclusters_v2.thriftscala.NounWithFrequencyList -import com.twitter.simclusters_v2.thriftscala.RightNode -import com.twitter.simclusters_v2.thriftscala.RightNodeTypeStruct -import com.twitter.simclusters_v2.thriftscala.RightNodeWithEdgeWeightList -import com.twitter.simclusters_v2.thriftscala.SimilarRightNodes -import com.twitter.simclusters_v2.thriftscala.CandidateTweetsList -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.Long2BigEndian - -object MultiTypeGraphInjections { - final val truncatedMultiTypeGraphInjection = - KeyValInjection(ScalaCompactThrift(LeftNode), ScalaCompactThrift(RightNodeWithEdgeWeightList)) - final val topKRightNounListInjection = - KeyValInjection( - ScalaCompactThrift(RightNodeTypeStruct), - ScalaCompactThrift(NounWithFrequencyList)) - final val similarRightNodesInjection = - KeyValInjection[RightNode, SimilarRightNodes]( - ScalaCompactThrift(RightNode), - ScalaCompactThrift(SimilarRightNodes) - ) - final val tweetRecommendationsInjection = - KeyValInjection[Long, CandidateTweetsList]( - Long2BigEndian, - ScalaCompactThrift(CandidateTweetsList) - ) -} diff --git a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/ProducerEmbeddingsInjections.scala b/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/ProducerEmbeddingsInjections.scala deleted file mode 100644 index 087b6acc5..000000000 --- a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/ProducerEmbeddingsInjections.scala +++ /dev/null @@ -1,45 +0,0 @@ -package com.twitter.simclusters_v2.hdfs_sources.injections - -import com.twitter.hermit.candidate.thriftscala.Candidates -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.{ - Long2BigEndian, - ScalaBinaryThrift, - ScalaCompactThrift -} -import com.twitter.simclusters_v2.thriftscala.{ - PersistedFullClusterId, - SimClustersEmbedding, - SimClustersEmbeddingId, - TopProducersWithScore, - TopSimClustersWithScore -} - -object ProducerEmbeddingsInjections { - final val ProducerTopKSimClusterEmbeddingsInjection: KeyValInjection[ - Long, - TopSimClustersWithScore - ] = - KeyValInjection( - keyCodec = Long2BigEndian, - valueCodec = ScalaCompactThrift(TopSimClustersWithScore)) - - final val SimClusterEmbeddingTopKProducersInjection: KeyValInjection[ - PersistedFullClusterId, - TopProducersWithScore - ] = - KeyValInjection( - keyCodec = ScalaCompactThrift(PersistedFullClusterId), - valueCodec = ScalaCompactThrift(TopProducersWithScore)) - - final val SimilarUsersInjection: KeyValInjection[Long, Candidates] = - KeyValInjection(keyCodec = Long2BigEndian, valueCodec = ScalaCompactThrift(Candidates)) - - final val ProducerSimClustersEmbeddingInjection: KeyValInjection[ - SimClustersEmbeddingId, - SimClustersEmbedding - ] = - KeyValInjection( - keyCodec = ScalaBinaryThrift(SimClustersEmbeddingId), - valueCodec = ScalaBinaryThrift(SimClustersEmbedding)) -} diff --git a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/SemanticCoreEntitiesInjections.scala b/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/SemanticCoreEntitiesInjections.scala deleted file mode 100644 index 10f9d208f..000000000 --- a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/SemanticCoreEntitiesInjections.scala +++ /dev/null @@ -1,53 +0,0 @@ -package com.twitter.simclusters_v2.hdfs_sources.injections - -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.{ - Long2BigEndian, - ScalaCompactThrift, - StringUtf8 -} -import com.twitter.recos.entities.thriftscala.{ - SemanticCoreEntityScoreList, - SemanticCoreEntityWithLocale, - UserIdWithLocale, - UserScoreList -} - -object SemanticCoreEntitiesInjections { - - final val StringToSemanticCoreEntityScoreListInjection: KeyValInjection[ - String, - SemanticCoreEntityScoreList - ] = - KeyValInjection( - StringUtf8, - ScalaCompactThrift(SemanticCoreEntityScoreList) - ) - - final val LongToSemanticCoreEntityScoreListInjection: KeyValInjection[ - Long, - SemanticCoreEntityScoreList - ] = - KeyValInjection( - Long2BigEndian, - ScalaCompactThrift(SemanticCoreEntityScoreList) - ) - - final val UserWithLocaleToSemanticCoreEntityScoreListInjection: KeyValInjection[ - UserIdWithLocale, - SemanticCoreEntityScoreList - ] = - KeyValInjection( - ScalaCompactThrift(UserIdWithLocale), - ScalaCompactThrift(SemanticCoreEntityScoreList) - ) - - final val SemanticCoreEntityWithLocaleToUsersScoreListInjection: KeyValInjection[ - SemanticCoreEntityWithLocale, - UserScoreList - ] = - KeyValInjection( - ScalaCompactThrift(SemanticCoreEntityWithLocale), - ScalaCompactThrift(UserScoreList) - ) -} diff --git a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/SingleSideUserScoresInjection.scala b/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/SingleSideUserScoresInjection.scala deleted file mode 100644 index d3fb79901..000000000 --- a/src/scala/com/twitter/simclusters_v2/hdfs_sources/injections/SingleSideUserScoresInjection.scala +++ /dev/null @@ -1,12 +0,0 @@ -package com.twitter.simclusters_v2.hdfs_sources.injections - -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.{ - Long2BigEndian, - ScalaCompactThrift -} -import com.twitter.simclusters_v2.thriftscala.SingleSideUserScores - -object SingleSideUserScoresInjection { - val injection = KeyValInjection(Long2BigEndian, ScalaCompactThrift(SingleSideUserScores)) -} diff --git a/src/scala/com/twitter/simclusters_v2/hdfs_sources/presto_hdfs_sources/BUILD b/src/scala/com/twitter/simclusters_v2/hdfs_sources/presto_hdfs_sources/BUILD deleted file mode 100644 index 0b02e4ce9..000000000 --- a/src/scala/com/twitter/simclusters_v2/hdfs_sources/presto_hdfs_sources/BUILD +++ /dev/null @@ -1,60 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":data_sources", - "3rdparty/src/jvm/com/twitter/scalding:core", - "src/scala/com/twitter/scalding_internal/dalv2", - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/scalding_internal/source/lzo_scrooge", - "src/scala/com/twitter/simclusters_v2/common", - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - "src/thrift/com/twitter/wtf/entity_real_graph:entity_real_graph-thrift-scala", - ], -) - -scala_library( - name = "data_sources", - sources = [], - description = "DAL datasets we wish to expose externally", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":reverse_index_semantic_core_per_language_simclusters_embeddings_presto-scala", - ":semantic_core_per_language_simclusters_embeddings_presto-scala", - "src/scala/com/twitter/simclusters_v2/common", - ], -) - -create_datasets( - base_name = "reverse_index_semantic_core_per_language_simclusters_embeddings_presto", - java_schema = "com.twitter.simclusters_v2.thriftjava.InternalIdEmbeddingWithId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.InternalIdEmbeddingWithId", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "semantic_core_per_language_simclusters_embeddings_presto", - java_schema = "com.twitter.simclusters_v2.thriftjava.SimClustersEmbeddingWithId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingWithId", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/hdfs_sources/presto_hdfs_sources/EntityEmbeddingsPrestoSources.scala b/src/scala/com/twitter/simclusters_v2/hdfs_sources/presto_hdfs_sources/EntityEmbeddingsPrestoSources.scala deleted file mode 100644 index 740d0fadd..000000000 --- a/src/scala/com/twitter/simclusters_v2/hdfs_sources/presto_hdfs_sources/EntityEmbeddingsPrestoSources.scala +++ /dev/null @@ -1,10 +0,0 @@ -package com.twitter.simclusters_v2.hdfs_sources.presto_hdfs_sources - -object EntityEmbeddingsPrestoSources { - - final val SemanticCorePerLanguageSimClustersEmbeddingsDataset = - SemanticCorePerLanguageSimclustersEmbeddingsPrestoScalaDataset - - final val ReverseIndexSemanticCorePerLanguageSimClustersEmbeddingsDataset = - ReverseIndexSemanticCorePerLanguageSimclustersEmbeddingsPrestoScalaDataset -} diff --git a/src/scala/com/twitter/simclusters_v2/images/bipartite_graph.png b/src/scala/com/twitter/simclusters_v2/images/bipartite_graph.png deleted file mode 100644 index 15baf9b82..000000000 Binary files a/src/scala/com/twitter/simclusters_v2/images/bipartite_graph.png and /dev/null differ diff --git a/src/scala/com/twitter/simclusters_v2/images/interestedin.png b/src/scala/com/twitter/simclusters_v2/images/interestedin.png deleted file mode 100644 index 28142e633..000000000 Binary files a/src/scala/com/twitter/simclusters_v2/images/interestedin.png and /dev/null differ diff --git a/src/scala/com/twitter/simclusters_v2/images/knownfor.png b/src/scala/com/twitter/simclusters_v2/images/knownfor.png deleted file mode 100644 index 7625caf3a..000000000 Binary files a/src/scala/com/twitter/simclusters_v2/images/knownfor.png and /dev/null differ diff --git a/src/scala/com/twitter/simclusters_v2/images/producer_embeddings.png b/src/scala/com/twitter/simclusters_v2/images/producer_embeddings.png deleted file mode 100644 index 054e12242..000000000 Binary files a/src/scala/com/twitter/simclusters_v2/images/producer_embeddings.png and /dev/null differ diff --git a/src/scala/com/twitter/simclusters_v2/images/producer_producer_similarity.png b/src/scala/com/twitter/simclusters_v2/images/producer_producer_similarity.png deleted file mode 100644 index 616ca56c0..000000000 Binary files a/src/scala/com/twitter/simclusters_v2/images/producer_producer_similarity.png and /dev/null differ diff --git a/src/scala/com/twitter/simclusters_v2/images/topic_embeddings.png b/src/scala/com/twitter/simclusters_v2/images/topic_embeddings.png deleted file mode 100644 index 758ad1acf..000000000 Binary files a/src/scala/com/twitter/simclusters_v2/images/topic_embeddings.png and /dev/null differ diff --git a/src/scala/com/twitter/simclusters_v2/scalding/BUILD b/src/scala/com/twitter/simclusters_v2/scalding/BUILD deleted file mode 100644 index eb0a31038..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/BUILD +++ /dev/null @@ -1,521 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/fasterxml/jackson:jackson-module-scala", - "3rdparty/jvm/com/fasterxml/jackson/core:jackson-core", - "3rdparty/jvm/com/fasterxml/jackson/core:jackson-databind", - "3rdparty/jvm/com/fasterxml/jackson/module:jackson-module-scala", - "3rdparty/jvm/com/googlecode/matrix-toolkits-java", - "3rdparty/jvm/com/twitter/storehaus:algebra", - "3rdparty/jvm/com/twitter/storehaus:core", - "escherbird/src/scala/com/twitter/escherbird/scalding/source", - "flockdb-tools/datasets/flock:flock-follows-edges-scala", - "src/java/com/twitter/ml/api/constant", - "src/java/com/twitter/sbf/core", - "src/java/com/twitter/sbf/graph", - "src/scala/com/twitter/frigate/user_sampler/common", - "src/scala/com/twitter/ml/api:api-base", - "src/scala/com/twitter/ml/api/bq", - "src/scala/com/twitter/pluck/source/cassowary:sims", - "src/scala/com/twitter/pluck/source/core_workflows/user_model:condensed_user_state-scala", - "src/scala/com/twitter/scalding_internal/dalv2", - "src/scala/com/twitter/scalding_internal/job", - "src/scala/com/twitter/scalding_internal/job/analytics_batch", - "src/scala/com/twitter/scalding_internal/source", - "src/scala/com/twitter/scalding_internal/source/lzo_scrooge", - "src/scala/com/twitter/simclusters_v2/candidate_source", - "src/scala/com/twitter/simclusters_v2/hdfs_sources", - "src/scala/com/twitter/simclusters_v2/scalding/common", - "src/scala/com/twitter/simclusters_v2/summingbird/common", - "src/scala/com/twitter/timelines/prediction/features/common", - "src/scala/com/twitter/timelines/prediction/features/itl", - "src/scala/com/twitter/timelines/prediction/features/recap", - "src/scala/com/twitter/wtf/entity_real_graph/scalding/common", - "src/thrift/com/twitter/hermit/candidate:hermit-candidate-scala", - "src/thrift/com/twitter/wtf/scalding/sims:sims-thrift-scala", - "twadoop_config/configuration/log_categories/group/recos-platform:content_recommender_get_content_recommendations-scala", - "twadoop_config/configuration/log_categories/group/recos-platform:content_recommender_get_topic_tweets_recommendations-scala", - "twadoop_config/configuration/log_categories/group/timeline:timeline_service_favorites-scala", - "usersource/snapshot/src/main/scala/com/twitter/usersource/snapshot/flat:usersource_flat-scala", - "usersource/snapshot/src/main/thrift/com/twitter/usersource/snapshot/flat:flat-scala", - "util/util-core:util-core-util", - ], -) - -hadoop_binary( - name = "evd_cluster_similarity", - main = "com.twitter.simclusters_v2.scalding.EigenVectorsForClusterSimilarityAdhoc", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "cluster_evaluation", - main = "com.twitter.simclusters_v2.scalding.ClusterEvaluationAdhoc", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "cluster_evaluation_20m_145k", - main = "com.twitter.simclusters_v2.scalding.ClusterEvaluationFor20M145K", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "cluster_evaluation_20m_145k_2020", - main = "com.twitter.simclusters_v2.scalding.ClusterEvaluationFor20M145K2020", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "bp_cluster_evaluation", - main = "com.twitter.simclusters_v2.scalding.BipartiteClusterEvaluation", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "update_knownfor", - main = "com.twitter.simclusters_v2.scalding.UpdateKnownForAdhoc", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "update_knownfor_prod", - main = "com.twitter.simclusters_v2.scalding.UpdateKnownFor20M145K", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "cluster_details", - main = "com.twitter.simclusters_v2.scalding.ClusterDetailsBatch", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "cluster_details_20m_145k_updated", - main = "com.twitter.simclusters_v2.scalding.ClusterDetails20M145KUpdated", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "cluster_details_20m_145k_2020", - main = "com.twitter.simclusters_v2.scalding.ClusterDetails20M145K2020", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "cluster_details-adhoc", - main = "com.twitter.simclusters_v2.scalding.ClusterDetailsAdhoc", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "cluster_details-dump", - main = "com.twitter.simclusters_v2.scalding.DumpClusterDetailsAdhoc", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "interested_in", - main = "com.twitter.simclusters_v2.scalding.InterestedInFromKnownForBatch", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "interested_in_from_producer_embeddings", - main = "com.twitter.simclusters_v2.scalding.InterestedInFromProducerEmbeddingsBatchApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "employee_graph_from_user_user", - main = "com.twitter.simclusters_v2.scalding.EmployeeGraphFromUserUser", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "interested_in_20m_145k_updated", - main = "com.twitter.simclusters_v2.scalding.InterestedInFromKnownFor20M145KUpdated", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "interested_in_20m_145k_2020", - main = "com.twitter.simclusters_v2.scalding.InterestedInFromKnownFor20M145K2020", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "interested_in_lite_20m_145k_2020", - main = "com.twitter.simclusters_v2.scalding.InterestedInFromKnownForLite20M145K2020", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "interested_in_lite_20m_145k_2020-adhoc", - main = "com.twitter.simclusters_v2.scalding.InterestedInFromKnownForLite20M145K2020Adhoc", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "interested_in_from_ape_2020-adhoc", - main = "com.twitter.simclusters_v2.scalding.InterestedInFromAPE2020AdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "interested_in_from_ape_2020", - main = "com.twitter.simclusters_v2.scalding.InterestedInFromAPE2020BatchApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "known_for_to_mh", - main = "com.twitter.simclusters_v2.scalding.KnownForToMHBatch", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "user_user_normalized_graph", - main = "com.twitter.simclusters_v2.scalding.UserUserNormalizedGraphBatch", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "user_user_graph", - main = "com.twitter.simclusters_v2.scalding.UserUserGraphBatch", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "user_user_graph-adhoc", - main = "com.twitter.simclusters_v2.scalding.UserUserGraphAdhoc", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "producer_norms_and_counts", - main = "com.twitter.simclusters_v2.scalding.ProducerNormsAndCountsBatch", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "fav_graph", - main = "com.twitter.simclusters_v2.scalding.UserUserFavGraphBatch", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "top_users_similarity_graph", - main = "com.twitter.simclusters_v2.scalding.TopUsersSimilarityGraphApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "top_users_only", - main = "com.twitter.simclusters_v2.scalding.TopUsersOnlyApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -hadoop_binary( - name = "dump_fav_graph_adhoc", - main = "com.twitter.simclusters_v2.scalding.DumpFavGraphAdhoc", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) - -# Generated with `capesospy-v2 create_target interested_in_for_20M_145k_2020 src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc.yaml`, config hash 8f19bf. -scalding_job( - name = "interested_in_for_20M_145k_2020", - main = "com.twitter.simclusters_v2.scalding.InterestedInFromKnownFor20M145K2020", - args = ["--socialProofThreshold 2 --maxClustersPerUser 50"], - config = [ - ("hadoop.combine-input", "true"), - ("hadoop.map.jvm.total-memory", "3072m"), - ("hadoop.reduce.jvm.total-memory", "3072m"), - ("hadoop.submitter.jvm.total-memory", "5120m"), - ("submitter.tier", "preemptible"), - ], - cron = "14 * * * *", - hadoop_cluster = "atla-proc", - platform = "java8", - role = "cassowary", - runtime_platform = "java8", - tags = [ - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":scalding", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scalding/BipartiteClusterEvaluation.scala b/src/scala/com/twitter/simclusters_v2/scalding/BipartiteClusterEvaluation.scala deleted file mode 100644 index 0382b1472..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/BipartiteClusterEvaluation.scala +++ /dev/null @@ -1,513 +0,0 @@ -package com.twitter.simclusters_v2.scalding - -import com.twitter.algebird.Aggregator -import com.twitter.algebird.Monoid -import com.twitter.scalding._ -import com.twitter.scalding.commons.source.VersionedKeyValSource -import com.twitter.scalding.typed.TypedPipe -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.remote_access.ExplicitLocation -import com.twitter.scalding_internal.dalv2.remote_access.ProcAtla -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.hdfs_sources.AdhocKeyValSources -import com.twitter.simclusters_v2.hdfs_sources.NormsAndCountsFixedPathSource -import com.twitter.simclusters_v2.hdfs_sources.ProducerNormsAndCountsScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.SimclustersV2InterestedInScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.UserAndNeighborsFixedPathSource -import com.twitter.simclusters_v2.hdfs_sources.UserUserNormalizedGraphScalaDataset -import com.twitter.simclusters_v2.scalding.BipartiteClusterEvaluationClasses._ -import com.twitter.simclusters_v2.scalding.common.TypedRichPipe._ -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.thriftscala.BipartiteClusterQuality -import com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn -import com.twitter.simclusters_v2.thriftscala.NeighborWithWeights -import com.twitter.simclusters_v2.thriftscala.NormsAndCounts -import com.twitter.simclusters_v2.thriftscala.UserAndNeighbors -import scala.collection.JavaConverters._ - -object BipartiteClusterEvaluation extends TwitterExecutionApp { - - implicit val tz: java.util.TimeZone = DateOps.UTC - implicit val dp = DateParser.default - - private def getClusterL2Norms( - knownFor: TypedPipe[(Long, Array[(Int, Float)])] - ): Execution[Map[Int, Float]] = { - knownFor - .flatMap { - case (_, clusterArray) => - clusterArray.map { - case (clusterId, score) => - Map(clusterId -> score * score) - } - } - .sum - .getExecution - .map(_.mapValues { x => math.sqrt(x).toFloat }) - } - - def l2NormalizeKnownFor( - knownFor: TypedPipe[(Long, Array[(Int, Float)])] - ): Execution[TypedPipe[(Long, Array[(Int, Float)])]] = { - getClusterL2Norms(knownFor).map { clusterToNorms => - knownFor.mapValues { clusterScoresArray => - clusterScoresArray.map { - case (clusterId, score) => - (clusterId, score / clusterToNorms(clusterId)) - } - } - } - } - - /** - * ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding:bp_cluster_evaluation && \ - * oscar hdfs --user frigate --host hadoopnest2.atla.twitter.com --bundle bp_cluster_evaluation \ - * --tool com.twitter.simclusters_v2.scalding.BipartiteClusterEvaluation --screen --screen-detached \ - * --tee logs/newBpQuality_updateUnnormalizedScores_interestedInUsing20190329Graph_evaluatedOn20190329Graph_run2 \ - * -- --normsAndCountsDir /user/frigate/your_ldap/producerNormsAndCounts_20190330 \ - * --graphInputDir /user/frigate/your_ldap/user_user_normalized_graph_copiedFromAtlaProc_20190329 \ - * --knownForDir /user/frigate/your_ldap/dirFor_updatedKnownFor20M_145K_dec11_usingSims20190127_unnormalizedInputScores/knownFor \ - * --interestedInDir /user/frigate/your_ldap/dirFor_updatedKnownFor20M_145K_dec11_usingSims20190127_unnormalizedInputScores/interestedInUsing20190329Graph \ - * --outgoingVolumesResultsDir /user/frigate/your_ldap/dirFor_updatedKnownFor20M_145K_dec11_usingSims20190127_unnormalizedInputScores/bpQualityForInterestedInUsing20190329On20190329Graph_outgoingVolumes \ - * --incomingVolumesResultsDir /user/frigate/your_ldap/dirFor_updatedKnownFor20M_145K_dec11_usingSims20190127_unnormalizedInputScores/bpQualityForInterestedInUsing20190329On20190329Graph_incomingVolumes \ - * --outputDir /user/frigate/your_ldap/dirFor_updatedKnownFor20M_145K_dec11_usingSims20190127_unnormalizedInputScores/bpQualityForInterestedInUsing20190329On20190329Graph_perCluster \ - * --toEmailAddress your_ldap@twitter.com --modelVersion 20M_145K_updated - */ - override def job: Execution[Unit] = Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - - val interestedIn = args.optional("interestedInDir") match { - case Some(dir) => - TypedPipe - .from(AdhocKeyValSources.interestedInSource(args("interestedInDir"))) - case None => - DAL - .readMostRecentSnapshotNoOlderThan( - SimclustersV2InterestedInScalaDataset, - Days(20) - ) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - .map { - case KeyVal(key, value) => (key, value) - } - } - - val inputKnownFor = args - .optional("knownForDir") - .map { location => KnownForSources.readKnownFor(location) } - .getOrElse(KnownForSources.knownFor_20M_Dec11_145K) - - val modelVersion = - args.optional("modelVersion").getOrElse("20M_145K_dec11") - - val useLogFavWeights = args.boolean("useLogFavWeights") - - val shouldL2NormalizeKnownFor = args.boolean("l2NormalizeKnownFor") - - val toEmailAddressOpt = args.optional("toEmailAddress") - - val knownForExec = if (shouldL2NormalizeKnownFor) { - l2NormalizeKnownFor(inputKnownFor) - } else { - Execution.from(inputKnownFor) - } - - val finalExec = knownForExec.flatMap { knownFor => - val graph = args.optional("graphInputDir") match { - case Some(dir) => - TypedPipe.from(UserAndNeighborsFixedPathSource(dir)) - case None => - DAL - .readMostRecentSnapshotNoOlderThan(UserUserNormalizedGraphScalaDataset, Days(20)) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - } - - val producerNormsAndCounts = args.optional("normsAndCountsDir") match { - case Some(dir) => - TypedPipe.from(NormsAndCountsFixedPathSource(args(dir))) - case None => - DAL - .readMostRecentSnapshotNoOlderThan(ProducerNormsAndCountsScalaDataset, Days(20)) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - } - - val clusterIncomingVolumesExec = loadOrMake( - computeClusterIncomingVolumes(knownFor, producerNormsAndCounts, useLogFavWeights), - modelVersion, - args("incomingVolumesResultsDir") - ) - - val resultsWithOutgoingVolumesExec = loadOrMake( - getResultsWithOutgoingVolumes(graph, interestedIn, useLogFavWeights), - modelVersion, - args("outgoingVolumesResultsDir") - ) - - val finalPerClusterResultsExec = - finalPerClusterResults( - knownFor, - interestedIn, - resultsWithOutgoingVolumesExec, - clusterIncomingVolumesExec) - .flatMap { pipe => loadOrMake(pipe, modelVersion, args("outputDir")) } - - finalPerClusterResultsExec.flatMap { finalPerClusterResults => - val perClusterResults = finalPerClusterResults.values - val distributionResultsExec = getClusterResultsSummary(perClusterResults).map { - case Some(summary) => - "Summary of results across clusters: \n" + - Util.prettyJsonMapper.writeValueAsString(summary) - case _ => - "No summary of results! The cluster level results pipe must be empty!" - } - - val overallResultsExec = perClusterResults.sum.toOptionExecution.map { - case Some(overallQuality) => - "Overall Quality: \n" + - Util.prettyJsonMapper.writeValueAsString( - printableBipartiteQuality(overallQuality) - ) - case _ => - "No overall quality! The cluster level results pipe must be empty!" - } - - Execution.zip(distributionResultsExec, overallResultsExec).map { - case (distResults, overallResults) => - toEmailAddressOpt.foreach { address => - Util.sendEmail( - distResults + "\n" + overallResults, - "Bipartite cluster quality for " + modelVersion, - address - ) - } - println(distResults + "\n" + overallResults) - } - } - } - Util.printCounters(finalExec) - } - } - - def getResultsWithOutgoingVolumes( - graph: TypedPipe[UserAndNeighbors], - interestedIn: TypedPipe[(Long, ClustersUserIsInterestedIn)], - useLogFavWeights: Boolean - ): TypedPipe[(Int, BipartiteClusterQuality)] = { - graph - .map { un => (un.userId, un.neighbors) } - // should this be a leftJoin? For now, leaving it as an inner join. If in the future, - // we want to compare two approaches with very different coverages on interestedIn, this - // could become a problem. - .join(interestedIn) - .withReducers(4000) - .flatMap { - case (userId, (neighbors, clusters)) => - getBIResultsFromSingleUser(userId, neighbors, clusters, useLogFavWeights) - } - .sumByKey - .withReducers(600) - .map { - case (clusterId, bir) => - ( - clusterId, - BipartiteClusterQuality( - inClusterFollowEdges = Some(bir.inClusterWeights.isFollowEdge), - inClusterFavEdges = Some(bir.inClusterWeights.isFavEdge), - favWtSumOfInClusterFollowEdges = Some(bir.inClusterWeights.favWtIfFollowEdge), - favWtSumOfInClusterFavEdges = Some(bir.inClusterWeights.favWtIfFavEdge), - outgoingFollowEdges = Some(bir.totalOutgoingVolumes.isFollowEdge), - outgoingFavEdges = Some(bir.totalOutgoingVolumes.isFavEdge), - favWtSumOfOutgoingFollowEdges = Some(bir.totalOutgoingVolumes.favWtIfFollowEdge), - favWtSumOfOutgoingFavEdges = Some(bir.totalOutgoingVolumes.favWtIfFavEdge), - interestedInSize = Some(bir.interestedInSize), - sampledEdges = Some( - bir.edgeSample - .iterator() - .asScala - .toSeq - .map { - case (edge, data) => makeThriftSampledEdge(edge, data) - } - ) - ) - ) - } - } - - def getBIResultsFromSingleUser( - userId: Long, - neighbors: Seq[NeighborWithWeights], - clusters: ClustersUserIsInterestedIn, - useLogFavScores: Boolean - ): List[(Int, BipartiteIntermediateResults)] = { - val neighborsToWeights = neighbors.map { neighborAndWeights => - val isFollowEdge = neighborAndWeights.isFollowed match { - case Some(true) => 1.0 - case _ => 0.0 - } - val favScore = if (useLogFavScores) { - neighborAndWeights.logFavScore.getOrElse(0.0) - } else neighborAndWeights.favScoreHalfLife100Days.getOrElse(0.0) - val isFavEdge = math.min(1, math.ceil(favScore)) - neighborAndWeights.neighborId -> Weights( - isFollowEdge, - isFavEdge, - favScore * isFollowEdge, - favScore - ) - }.toMap - - val outgoingVolumes = Monoid.sum(neighborsToWeights.values)(WeightsMonoid) - - clusters.clusterIdToScores.toList.map { - case (clusterId, scoresStruct) => - val inClusterNeighbors = - (scoresStruct.usersBeingFollowed.getOrElse(Nil) ++ - scoresStruct.usersThatWereFaved.getOrElse(Nil)).toSet - val edgesForSampling = inClusterNeighbors.flatMap { neighborId => - if (neighborsToWeights.contains(neighborId)) { - Some( - (userId, neighborId), - SampledEdgeData( - neighborsToWeights(neighborId).favWtIfFollowEdge, - neighborsToWeights(neighborId).favWtIfFavEdge, - scoresStruct.followScore.getOrElse(0.0), - scoresStruct.favScore.getOrElse(0.0) - ) - ) - } else { - None - } - } - - val inClusterWeights = - Monoid.sum(neighborsToWeights.filterKeys(inClusterNeighbors).values)(WeightsMonoid) - - ( - clusterId, - BipartiteIntermediateResults( - inClusterWeights, - outgoingVolumes, - 1, - samplerMonoid.build(edgesForSampling) - )) - } - } - - def computeClusterIncomingVolumes( - knownFor: TypedPipe[(Long, Array[(Int, Float)])], - producerNormsAndCounts: TypedPipe[NormsAndCounts], - useLogFavWeights: Boolean - ): TypedPipe[(Int, BipartiteClusterQuality)] = { - producerNormsAndCounts - .map { x => (x.userId, x) } - .join(knownFor) - .withReducers(100) - .flatMap { - case (userId, (normsAndCounts, clusters)) => - clusters.map { - case (clusterId, _) => - val followerCount = - normsAndCounts.followerCount.getOrElse(0L).toDouble - val faverCount = normsAndCounts.faverCount.getOrElse(0L).toDouble - val favWtSumOfIncomingFollows = if (useLogFavWeights) { - normsAndCounts.logFavWeightsOnFollowEdgesSum.getOrElse(0.0) - } else { - normsAndCounts.favWeightsOnFollowEdgesSum.getOrElse(0.0) - } - val favWtSumOfIncomingFavs = if (useLogFavWeights) { - normsAndCounts.logFavWeightsOnFavEdgesSum.getOrElse(0.0) - } else { - normsAndCounts.favWeightsOnFavEdgesSum.getOrElse(0.0) - } - ( - clusterId, - BipartiteClusterQuality( - incomingFollowEdges = Some(followerCount), - incomingFavEdges = Some(faverCount), - favWtSumOfIncomingFollowEdges = Some(favWtSumOfIncomingFollows), - favWtSumOfIncomingFavEdges = Some(favWtSumOfIncomingFavs) - )) - } - } - .sumByKey - .toTypedPipe - } - - def loadOrMake( - pipe: TypedPipe[(Int, BipartiteClusterQuality)], - modelVersion: String, - path: String - ): Execution[TypedPipe[(Int, BipartiteClusterQuality)]] = { - val mapped = pipe.map { - case (clusterId, struct) => ((modelVersion, clusterId), struct) - } - makeForKeyValSource(mapped, AdhocKeyValSources.bipartiteQualitySource(path), path).map { pipe => - // discard model version - pipe.map { case ((_, clusterId), struct) => (clusterId, struct) } - } - } - - def makeForKeyValSource[K, V]( - pipe: TypedPipe[(K, V)], - dest: VersionedKeyValSource[K, V], - path: String - ): Execution[TypedPipe[(K, V)]] = - Execution.getMode.flatMap { mode => - if (dest.resourceExists(mode)) { - println(s"validated path $path") - Execution.from(TypedPipe.from(dest)) - } else { - println(s"Could not load from $path") - pipe.writeThrough(dest) - } - } - - def precisionOfWholeGraph( - knownFor: TypedPipe[(Long, Array[(Int, Float)])], - interestedIn: TypedPipe[(Long, ClustersUserIsInterestedIn)], - clusterIncomingVolumesExec: Execution[TypedPipe[(Int, BipartiteClusterQuality)]] - ): Execution[Option[Double]] = { - val knownForSizeExec = knownFor.aggregate(Aggregator.size).toOptionExecution - val interestedInSizeExec = - interestedIn.aggregate(Aggregator.size).toOptionExecution - val numExec = clusterIncomingVolumesExec.flatMap { volumes => - volumes.values.flatMap(_.favWtSumOfIncomingFavEdges).sum.toOptionExecution - } - Execution.zip(numExec, interestedInSizeExec, knownForSizeExec).map { - case (Some(num), Some(interestedInSize), Some(knownForSize)) => - Some(num / interestedInSize / knownForSize) - case x @ _ => - println("Precision of whole graph zip: " + x) - None - } - } - - def finalPerClusterResults( - knownFor: TypedPipe[(Long, Array[(Int, Float)])], - interestedIn: TypedPipe[(Long, ClustersUserIsInterestedIn)], - resultsWithOutgoingVolumesExec: Execution[TypedPipe[(Int, BipartiteClusterQuality)]], - incomingVolumesExec: Execution[TypedPipe[(Int, BipartiteClusterQuality)]] - ): Execution[TypedPipe[(Int, BipartiteClusterQuality)]] = { - val knownForTranspose = KnownForSources.transpose(knownFor) - - val precisionOfWholeGraphExec = - precisionOfWholeGraph(knownFor, interestedIn, incomingVolumesExec) - - Execution - .zip(resultsWithOutgoingVolumesExec, incomingVolumesExec, precisionOfWholeGraphExec) - .map { - case (resultsWithOutgoingVolumes, clusterIncomingVolumes, precisionOfWholeGraph) => - println("Precision of whole graph " + precisionOfWholeGraph) - resultsWithOutgoingVolumes - .join(knownForTranspose) - .leftJoin(clusterIncomingVolumes) - .withReducers(500) - .map { - case (clusterId, ((outgoingVolumeQuality, knownForList), incomingVolumesOpt)) => - val incomingVolumes = - incomingVolumesOpt.getOrElse(BipartiteClusterQuality()) - val knownForMap = knownForList.toMap - ( - clusterId, - getFullQuality( - outgoingVolumeQuality, - incomingVolumes, - knownForMap, - precisionOfWholeGraph)) - } - } - } - - def getFullQuality( - qualityWithOutgoingVolumes: BipartiteClusterQuality, - incomingVolumes: BipartiteClusterQuality, - knownFor: Map[Long, Float], - precisionOfWholeGraph: Option[Double] - ): BipartiteClusterQuality = { - val newSampledEdges = qualityWithOutgoingVolumes.sampledEdges.map { sampledEdges => - sampledEdges.map { sampledEdge => - val knownForScore = knownFor.getOrElse(sampledEdge.followeeId, 0.0f) - sampledEdge.copy( - predictedFollowScore = sampledEdge.followScoreToCluster.map { x => x * knownForScore }, - predictedFavScore = sampledEdge.favScoreToCluster.map { x => x * knownForScore } - ) - } - } - val correlationOfFavWtIfFollow = newSampledEdges.map { samples => - val pairs = samples.map { s => - (s.predictedFollowScore.getOrElse(0.0), s.favWtIfFollowEdge.getOrElse(0.0)) - } - Util.computeCorrelation(pairs.iterator) - } - val correlationOfFavWtIfFav = newSampledEdges.map { samples => - val pairs = samples.map { s => - (s.predictedFavScore.getOrElse(0.0), s.favWtIfFavEdge.getOrElse(0.0)) - } - Util.computeCorrelation(pairs.iterator) - } - val relativePrecisionNum = { - if (qualityWithOutgoingVolumes.interestedInSize.exists(_ > 0) && knownFor.nonEmpty) { - qualityWithOutgoingVolumes.favWtSumOfInClusterFavEdges - .getOrElse(0.0) / qualityWithOutgoingVolumes.interestedInSize.get / knownFor.size - } else 0.0 - } - val relativePrecision = if (precisionOfWholeGraph.exists(_ > 0.0)) { - Some(relativePrecisionNum / precisionOfWholeGraph.get) - } else None - qualityWithOutgoingVolumes.copy( - incomingFollowEdges = incomingVolumes.incomingFollowEdges, - incomingFavEdges = incomingVolumes.incomingFavEdges, - favWtSumOfIncomingFollowEdges = incomingVolumes.favWtSumOfIncomingFollowEdges, - favWtSumOfIncomingFavEdges = incomingVolumes.favWtSumOfIncomingFavEdges, - knownForSize = Some(knownFor.size), - correlationOfFavWtIfFollowWithPredictedFollow = correlationOfFavWtIfFollow, - correlationOfFavWtIfFavWithPredictedFav = correlationOfFavWtIfFav, - sampledEdges = newSampledEdges, - relativePrecisionUsingFavWtIfFav = relativePrecision, - averagePrecisionOfWholeGraphUsingFavWtIfFav = precisionOfWholeGraph - ) - } -} - -object DumpBpQuality extends TwitterExecutionApp { - def job: Execution[Unit] = Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - val inputDir = args("inputDir") - - val clusters = args.list("clusters").map(_.toInt).toSet - val input = - TypedPipe - .from(AdhocKeyValSources.bipartiteQualitySource(inputDir)) - .map { - case ((modelVersion, clusterId), quality) => - ( - (modelVersion, clusterId), - BipartiteClusterEvaluationClasses - .printableBipartiteQuality(quality)) - } - - if (clusters.isEmpty) { - input.printSummary("Bipartite quality") - } else { - input - .collect { - case rec @ ((_, clusterId), quality) if clusters(clusterId) => - Util.prettyJsonMapper - .writeValueAsString(rec) - .replaceAll("\n", " ") - } - .toIterableExecution - .map { strings => println(strings.mkString("\n")) } - } - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/BipartiteClusterEvaluationClasses.scala b/src/scala/com/twitter/simclusters_v2/scalding/BipartiteClusterEvaluationClasses.scala deleted file mode 100644 index f5acc5365..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/BipartiteClusterEvaluationClasses.scala +++ /dev/null @@ -1,316 +0,0 @@ -package com.twitter.simclusters_v2.scalding - -import com.twitter.algebird.{Monoid, OptionMonoid, Semigroup} -import com.twitter.algebird.mutable.PriorityQueueMonoid -import com.twitter.scalding.Execution -import com.twitter.scalding.typed.TypedPipe -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.scalding.common.Util.Distribution -import com.twitter.simclusters_v2.thriftscala.{BipartiteClusterQuality, SampledEdge} -import java.util.PriorityQueue -import scala.collection.JavaConverters._ - -object BipartiteClusterEvaluationClasses { - case class Weights( - isFollowEdge: Double, - isFavEdge: Double, - favWtIfFollowEdge: Double, - favWtIfFavEdge: Double) - - object WeightsMonoid extends Monoid[Weights] { - override def zero = Weights(0.0, 0.0, 0.0, 0.0) - - override def plus(l: Weights, r: Weights): Weights = { - Weights( - l.isFollowEdge + r.isFollowEdge, - l.isFavEdge + r.isFavEdge, - l.favWtIfFollowEdge + r.favWtIfFollowEdge, - l.favWtIfFavEdge + r.favWtIfFavEdge - ) - } - } - - implicit val wm: Monoid[Weights] = WeightsMonoid - - case class SampledEdgeData( - favWtIfFollowEdge: Double, - favWtIfFavEdge: Double, - followScoreToCluster: Double, - favScoreToCluster: Double) - - implicit val samplerMonoid: PriorityQueueMonoid[((Long, Long), SampledEdgeData)] = - Util.reservoirSamplerMonoidForPairs[(Long, Long), SampledEdgeData](2000)(Util.edgeOrdering) - - implicit val sampledEdgesMonoid: PriorityQueueMonoid[SampledEdge] = - Util.reservoirSamplerMonoid( - 10000, - { sampledEdge: SampledEdge => (sampledEdge.followerId, sampledEdge.followeeId) } - )(Util.edgeOrdering) - - case class BipartiteIntermediateResults( - inClusterWeights: Weights, - totalOutgoingVolumes: Weights, - interestedInSize: Int, - edgeSample: PriorityQueue[((Long, Long), SampledEdgeData)]) { - override def toString: String = { - "BCR(%s, %s, %d, %s)".format( - inClusterWeights, - totalOutgoingVolumes, - interestedInSize, - edgeSample.iterator().asScala.toSeq.toString() - ) - } - } - - object BIRMonoid extends Monoid[BipartiteIntermediateResults] { - override def zero = - BipartiteIntermediateResults(WeightsMonoid.zero, WeightsMonoid.zero, 0, samplerMonoid.zero) - - override def plus( - l: BipartiteIntermediateResults, - r: BipartiteIntermediateResults - ): BipartiteIntermediateResults = { - BipartiteIntermediateResults( - WeightsMonoid.plus(l.inClusterWeights, r.inClusterWeights), - WeightsMonoid.plus(l.totalOutgoingVolumes, r.totalOutgoingVolumes), - l.interestedInSize + r.interestedInSize, - samplerMonoid.plus(l.edgeSample, r.edgeSample) - ) - } - } - - implicit val bIRMonoid: Monoid[BipartiteIntermediateResults] = BIRMonoid - - def makeThriftSampledEdge(edge: (Long, Long), data: SampledEdgeData): SampledEdge = { - val (followerId, followeeId) = edge - SampledEdge( - followerId = followerId, - followeeId = followeeId, - favWtIfFollowEdge = Some(data.favWtIfFollowEdge), - favWtIfFavEdge = Some(data.favWtIfFavEdge), - followScoreToCluster = Some(data.followScoreToCluster), - favScoreToCluster = Some(data.favScoreToCluster) - ) - } - - object ClusterQualitySemigroup extends Semigroup[BipartiteClusterQuality] { - val doubleOM: Monoid[Option[Double]] = new OptionMonoid[Double] - val intOM: Monoid[Option[Int]] = new OptionMonoid[Int] - val longOM: Monoid[Option[Long]] = new OptionMonoid[Long] - - override def plus(l: BipartiteClusterQuality, r: BipartiteClusterQuality) = - BipartiteClusterQuality( - inClusterFollowEdges = doubleOM.plus(l.inClusterFollowEdges, r.inClusterFollowEdges), - inClusterFavEdges = doubleOM.plus(l.inClusterFavEdges, r.inClusterFavEdges), - favWtSumOfInClusterFollowEdges = doubleOM - .plus(l.favWtSumOfInClusterFollowEdges, r.favWtSumOfInClusterFollowEdges), - favWtSumOfInClusterFavEdges = doubleOM - .plus(l.favWtSumOfInClusterFavEdges, r.favWtSumOfInClusterFavEdges), - outgoingFollowEdges = doubleOM.plus(l.outgoingFollowEdges, r.outgoingFollowEdges), - outgoingFavEdges = doubleOM.plus(l.outgoingFavEdges, r.outgoingFavEdges), - favWtSumOfOutgoingFollowEdges = doubleOM - .plus(l.favWtSumOfOutgoingFollowEdges, r.favWtSumOfOutgoingFollowEdges), - favWtSumOfOutgoingFavEdges = doubleOM - .plus(l.favWtSumOfOutgoingFavEdges, r.favWtSumOfOutgoingFavEdges), - incomingFollowEdges = doubleOM.plus(l.incomingFollowEdges, r.incomingFollowEdges), - incomingFavEdges = doubleOM.plus(l.incomingFavEdges, r.incomingFavEdges), - favWtSumOfIncomingFollowEdges = doubleOM - .plus(l.favWtSumOfIncomingFollowEdges, r.favWtSumOfIncomingFollowEdges), - favWtSumOfIncomingFavEdges = doubleOM - .plus(l.favWtSumOfIncomingFavEdges, r.favWtSumOfIncomingFavEdges), - interestedInSize = None, - sampledEdges = Some( - sampledEdgesMonoid - .plus( - sampledEdgesMonoid.build(l.sampledEdges.getOrElse(Nil)), - sampledEdgesMonoid.build(r.sampledEdges.getOrElse(Nil)) - ) - .iterator() - .asScala - .toSeq), - knownForSize = intOM.plus(l.knownForSize, r.knownForSize), - correlationOfFavWtIfFollowWithPredictedFollow = None, - correlationOfFavWtIfFavWithPredictedFav = None, - relativePrecisionUsingFavWtIfFav = None, - averagePrecisionOfWholeGraphUsingFavWtIfFav = l.averagePrecisionOfWholeGraphUsingFavWtIfFav - ) - } - - implicit val bcqSemigroup: Semigroup[BipartiteClusterQuality] = - ClusterQualitySemigroup - - case class PrintableBipartiteQuality( - incomingFollowUnweightedRecall: String, - incomingFavUnweightedRecall: String, - incomingFollowWeightedRecall: String, - incomingFavWeightedRecall: String, - outgoingFollowUnweightedRecall: String, - outgoingFavUnweightedRecall: String, - outgoingFollowWeightedRecall: String, - outgoingFavWeightedRecall: String, - incomingFollowEdges: String, - incomingFavEdges: String, - favWtSumOfIncomingFollowEdges: String, - favWtSumOfIncomingFavEdges: String, - outgoingFollowEdges: String, - outgoingFavEdges: String, - favWtSumOfOutgoingFollowEdges: String, - favWtSumOfOutgoingFavEdges: String, - correlationOfFavWtIfFollow: String, - correlationOfFavWtIfFav: String, - relativePrecisionUsingFavWt: String, - averagePrecisionOfWholeGraphUsingFavWt: String, - interestedInSize: String, - knownForSize: String) - - def printableBipartiteQuality(in: BipartiteClusterQuality): PrintableBipartiteQuality = { - def getRatio(numOpt: Option[Double], denOpt: Option[Double]): String = { - val r = if (denOpt.exists(_ > 0)) { - numOpt.getOrElse(0.0) / denOpt.get - } else 0.0 - "%.3f".format(r) - } - - val formatter = new java.text.DecimalFormat("###,###.#") - - def denString(denOpt: Option[Double]): String = - formatter.format(denOpt.getOrElse(0.0)) - - val correlationOfFavWtIfFollow = - in.correlationOfFavWtIfFollowWithPredictedFollow match { - case None => - in.sampledEdges.map { samples => - val pairs = samples.map { s => - (s.predictedFollowScore.getOrElse(0.0), s.favWtIfFollowEdge.getOrElse(0.0)) - } - Util.computeCorrelation(pairs.iterator) - } - case x @ _ => x - } - - val correlationOfFavWtIfFav = - in.correlationOfFavWtIfFavWithPredictedFav match { - case None => - in.sampledEdges.map { samples => - val pairs = samples.map { s => - (s.predictedFavScore.getOrElse(0.0), s.favWtIfFavEdge.getOrElse(0.0)) - } - Util.computeCorrelation(pairs.iterator) - } - case x @ _ => x - } - - PrintableBipartiteQuality( - incomingFollowUnweightedRecall = getRatio(in.inClusterFollowEdges, in.incomingFollowEdges), - incomingFavUnweightedRecall = getRatio(in.inClusterFavEdges, in.incomingFavEdges), - incomingFollowWeightedRecall = - getRatio(in.favWtSumOfInClusterFollowEdges, in.favWtSumOfIncomingFollowEdges), - incomingFavWeightedRecall = - getRatio(in.favWtSumOfInClusterFavEdges, in.favWtSumOfIncomingFavEdges), - outgoingFollowUnweightedRecall = getRatio(in.inClusterFollowEdges, in.outgoingFollowEdges), - outgoingFavUnweightedRecall = getRatio(in.inClusterFavEdges, in.outgoingFavEdges), - outgoingFollowWeightedRecall = - getRatio(in.favWtSumOfInClusterFollowEdges, in.favWtSumOfOutgoingFollowEdges), - outgoingFavWeightedRecall = - getRatio(in.favWtSumOfInClusterFavEdges, in.favWtSumOfOutgoingFavEdges), - incomingFollowEdges = denString(in.incomingFollowEdges), - incomingFavEdges = denString(in.incomingFavEdges), - favWtSumOfIncomingFollowEdges = denString(in.favWtSumOfIncomingFollowEdges), - favWtSumOfIncomingFavEdges = denString(in.favWtSumOfIncomingFavEdges), - outgoingFollowEdges = denString(in.outgoingFollowEdges), - outgoingFavEdges = denString(in.outgoingFavEdges), - favWtSumOfOutgoingFollowEdges = denString(in.favWtSumOfOutgoingFollowEdges), - favWtSumOfOutgoingFavEdges = denString(in.favWtSumOfOutgoingFavEdges), - correlationOfFavWtIfFollow = "%.3f" - .format(correlationOfFavWtIfFollow.getOrElse(0.0)), - correlationOfFavWtIfFav = "%.3f" - .format(correlationOfFavWtIfFav.getOrElse(0.0)), - relativePrecisionUsingFavWt = - "%.2g".format(in.relativePrecisionUsingFavWtIfFav.getOrElse(0.0)), - averagePrecisionOfWholeGraphUsingFavWt = - "%.2g".format(in.averagePrecisionOfWholeGraphUsingFavWtIfFav.getOrElse(0.0)), - interestedInSize = in.interestedInSize.getOrElse(0).toString, - knownForSize = in.knownForSize.getOrElse(0).toString - ) - } - - case class ClusterResultsSummary( - numClustersWithZeroInterestedIn: Int, - numClustersWithZeroFollowWtRecall: Int, - numClustersWithZeroFavWtRecall: Int, - numClustersWithZeroFollowAndFavWtRecall: Int, - interestedInSizeDist: Distribution, - outgoingFollowWtRecallDist: Distribution, - outgoingFavWtRecallDist: Distribution, - incomingFollowWtRecallDist: Distribution, - incomingFavWtRecallDist: Distribution, - followCorrelationDist: Distribution, - favCorrelationDist: Distribution, - relativePrecisionDist: Distribution) - - def getClusterResultsSummary( - perClusterResults: TypedPipe[BipartiteClusterQuality] - ): Execution[Option[ClusterResultsSummary]] = { - perClusterResults - .map { clusterQuality => - val printableQuality = printableBipartiteQuality(clusterQuality) - val isFollowRecallZero = - if (!clusterQuality.favWtSumOfInClusterFollowEdges - .exists(_ > 0)) 1 - else 0 - val isFavRecallZero = - if (!clusterQuality.favWtSumOfInClusterFavEdges.exists(_ > 0)) 1 - else 0 - ( - if (!clusterQuality.interestedInSize.exists(_ > 0)) 1 else 0, - isFollowRecallZero, - isFavRecallZero, - isFavRecallZero * isFollowRecallZero, - clusterQuality.interestedInSize.toList.map(_.toDouble), - List(printableQuality.outgoingFollowWeightedRecall.toDouble), - List(printableQuality.outgoingFavWeightedRecall.toDouble), - List(printableQuality.incomingFollowWeightedRecall.toDouble), - List(printableQuality.incomingFavWeightedRecall.toDouble), - List(printableQuality.correlationOfFavWtIfFollow.toDouble), - List(printableQuality.correlationOfFavWtIfFav.toDouble), - List(printableQuality.relativePrecisionUsingFavWt.toDouble) - ) - } - .sum - .toOptionExecution - .map { opt => - opt.map { - case ( - zeroInterestedIn, - zeroFollowRecall, - zeroFavRecall, - zeroFollowAndFavRecall, - interestedInSizeList, - outgoingFollowWtRecallList, - outgoingFavWtRecallList, - incomingFollowWtRecallList, - incomingFavWtRecallList, - followCorrelationList, - favCorrelationList, - relativePrecisionList - ) => - ClusterResultsSummary( - numClustersWithZeroInterestedIn = zeroInterestedIn, - numClustersWithZeroFollowWtRecall = zeroFollowRecall, - numClustersWithZeroFavWtRecall = zeroFavRecall, - numClustersWithZeroFollowAndFavWtRecall = zeroFollowAndFavRecall, - interestedInSizeDist = Util.distributionFromArray(interestedInSizeList.toArray), - outgoingFollowWtRecallDist = Util - .distributionFromArray(outgoingFollowWtRecallList.toArray), - outgoingFavWtRecallDist = Util.distributionFromArray(outgoingFavWtRecallList.toArray), - incomingFollowWtRecallDist = Util - .distributionFromArray(incomingFollowWtRecallList.toArray), - incomingFavWtRecallDist = Util.distributionFromArray(incomingFavWtRecallList.toArray), - followCorrelationDist = Util.distributionFromArray(followCorrelationList.toArray), - favCorrelationDist = Util.distributionFromArray(favCorrelationList.toArray), - relativePrecisionDist = Util.distributionFromArray(relativePrecisionList.toArray) - ) - } - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/ClusterDetailsJob.scala b/src/scala/com/twitter/simclusters_v2/scalding/ClusterDetailsJob.scala deleted file mode 100644 index f7aa381c4..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/ClusterDetailsJob.scala +++ /dev/null @@ -1,794 +0,0 @@ -package com.twitter.simclusters_v2.scalding - -import com.twitter.algebird.OptionMonoid -import com.twitter.algebird.QTree -import com.twitter.algebird.QTreeSemigroup -import com.twitter.algebird.Semigroup -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.dal.client.dataset.SnapshotDALDataset -import com.twitter.hermit.candidate.thriftscala.Candidates -import com.twitter.pluck.source.cassowary.FollowingsCosineSimilaritiesManhattanSource -import com.twitter.pluck.source.cassowary.SimsCandidatesSource -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.dalv2.remote_access.ExplicitLocation -import com.twitter.scalding_internal.dalv2.remote_access.ProcAtla -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.scalding_internal.job.analytics_batch._ -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.hdfs_sources._ -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.scalding.embedding.common.ExternalDataSources -import com.twitter.simclusters_v2.thriftscala._ -import com.twitter.usersource.snapshot.flat.UsersourceFlatScalaDataset -import com.twitter.usersource.snapshot.flat.thriftscala.FlatUser - -object ClusterDetailsJob { - case class Scores(followScore: Double, favScore: Double, logFavScore: Double) - - case class IntermediateDetails( - numUsersWithAnyNonZeroScore: Int, - numUsersWithNonZeroFollowScore: Int, - numUsersWithNonZeroFavScore: Int, - favQTree: Option[QTree[Double]], - followQTree: Option[QTree[Double]], - logFavQTree: Option[QTree[Double]], - sumOfSquares: Scores, - sum: Scores, - min: Scores, - max: Scores) - - case class InfoFromUserSource( - fractionMarkedNSFWUser: Double, - languageToFractionDeviceLanguage: Map[String, Double], - countryCodeToFractionKnownForWithCountryCode: Map[String, Double], - languageToFractionInferredLanguage: Map[String, Double]) - - def positiveMin(a: Double, b: Double) = { - if (math.min(a, b) == 0.0) math.max(a, b) else math.min(a, b) - } - - case class ClusterDetailsSemigroup(implicit qtreeSemigroup: Semigroup[QTree[Double]]) - extends Semigroup[IntermediateDetails] { - val optionMonoid: OptionMonoid[QTree[Double]] = new OptionMonoid[QTree[Double]]() - override def plus( - left: IntermediateDetails, - right: IntermediateDetails - ): IntermediateDetails = { - IntermediateDetails( - left.numUsersWithAnyNonZeroScore + right.numUsersWithAnyNonZeroScore, - left.numUsersWithNonZeroFollowScore + right.numUsersWithNonZeroFollowScore, - left.numUsersWithNonZeroFavScore + right.numUsersWithNonZeroFavScore, - optionMonoid.plus(left.favQTree, right.favQTree), - optionMonoid.plus(left.followQTree, right.followQTree), - optionMonoid.plus(left.logFavQTree, right.logFavQTree), - Scores( - left.sumOfSquares.followScore + right.sumOfSquares.followScore, - left.sumOfSquares.favScore + right.sumOfSquares.favScore, - left.sumOfSquares.logFavScore + right.sumOfSquares.logFavScore - ), - Scores( - left.sum.followScore + right.sum.followScore, - left.sum.favScore + right.sum.favScore, - left.sum.logFavScore + right.sum.logFavScore - ), - Scores( - positiveMin(left.min.followScore, right.min.followScore), - positiveMin(left.min.favScore, right.min.favScore), - positiveMin(left.min.logFavScore, right.min.logFavScore) - ), - Scores( - math.max(left.max.followScore, right.max.followScore), - math.max(left.max.favScore, right.max.favScore), - math.max(left.max.logFavScore, right.max.logFavScore) - ) - ) - } - } - - def intermediateDetailsPipe( - input: TypedPipe[(Long, ClustersUserIsInterestedIn)], - qtreeSemigroupKParameter: Int - ): TypedPipe[(Int, IntermediateDetails)] = { - implicit val qtSg: Semigroup[QTree[Double]] = - new QTreeSemigroup[Double](qtreeSemigroupKParameter) - implicit val cdSg: Semigroup[IntermediateDetails] = ClusterDetailsSemigroup() - input - .flatMap { - case (userId, clusterScoresStruct) => - val clusterScoresArray = clusterScoresStruct.clusterIdToScores.toArray - clusterScoresArray.map { - case (clusterId, scoresStruct) => - val followScore = scoresStruct.followScore.getOrElse(0.0) - val favScore = scoresStruct.favScore.getOrElse(0.0) - val logFavScore = scoresStruct.logFavScore.getOrElse(0.0) - ( - clusterId, - IntermediateDetails( - numUsersWithAnyNonZeroScore = 1, - numUsersWithNonZeroFollowScore = if (followScore > 0) 1 else 0, - numUsersWithNonZeroFavScore = if (favScore > 0) 1 else 0, - favQTree = if (favScore > 0) Some(QTree(favScore)) else None, - followQTree = if (followScore > 0) Some(QTree(followScore)) else None, - logFavQTree = if (logFavScore > 0) Some(QTree(logFavScore)) else None, - sumOfSquares = Scores( - followScore * followScore, - favScore * favScore, - logFavScore * logFavScore), - sum = Scores(followScore, favScore, logFavScore), - min = Scores(followScore, favScore, logFavScore), - max = Scores(followScore, favScore, logFavScore) - ) - ) - } - } - .sumByKey - // Uncomment for adhoc job - //.withReducers(100) - .toTypedPipe - } - - private def safeGetDoubleOpt(x: Option[Double]): Double = { - x.map { y => if (y.isNaN) 0 else y }.getOrElse(0) - } - - private def getSimilaritiesForAllPairs( - input: TypedPipe[(Long, ClustersUserIsInterestedIn)] - )( - implicit uniqueID: UniqueID - ): TypedPipe[((Int, Int), Scores)] = { - val allClusterPairsBeforeSumByKey = Stat("all_cluster_pairs_before_sum_by_key") - val clusterPairsWithin10Ratio = Stat("cluster_pairs_within_10_ratio") - val clusterPairsBeforeTopK = Stat("cluster_pairs_before_thresholding") - - input - .flatMap { - case (userId, clusterScoresStruct) => - val clusterScoresArray = clusterScoresStruct.clusterIdToScores.toArray - (0 until clusterScoresArray.length).flatMap { i => - (0 until clusterScoresArray.length).map { j => - val (clusterI, scoresI) = clusterScoresArray(i) - val (clusterJ, scoresJ) = clusterScoresArray(j) - val ratioOfSizes = - scoresI.numUsersInterestedInThisClusterUpperBound.getOrElse(1).toDouble / - scoresJ.numUsersInterestedInThisClusterUpperBound.getOrElse(1).toDouble - allClusterPairsBeforeSumByKey.inc() - if (ratioOfSizes > 0.1 && ratioOfSizes < 10) { - clusterPairsWithin10Ratio.inc() - } - val followI = safeGetDoubleOpt(scoresI.followScoreClusterNormalizedOnly) - val followJ = safeGetDoubleOpt(scoresJ.followScoreClusterNormalizedOnly) - val follow = followI * followJ - val favI = safeGetDoubleOpt(scoresI.favScoreClusterNormalizedOnly) - val favJ = safeGetDoubleOpt(scoresJ.favScoreClusterNormalizedOnly) - val fav = favI * favJ - val logFavI = safeGetDoubleOpt(scoresI.logFavScoreClusterNormalizedOnly) - val logFavJ = safeGetDoubleOpt(scoresJ.logFavScoreClusterNormalizedOnly) - val logFav = logFavI * logFavJ - ((clusterI, clusterJ), (follow, fav, logFav)) - } - } - } - .sumByKey - // Uncomment for adhoc job - //.withReducers(600) - .map { - case (key, (follow, fav, logFav)) => - clusterPairsBeforeTopK.inc() - (key, Scores(follow, fav, logFav)) - } - } - - private def keepTopNeighbors( - allPairs: TypedPipe[((Int, Int), Scores)], - cosineThreshold: Double - )( - implicit uniqueID: UniqueID - ): TypedPipe[(Int, List[ClusterNeighbor])] = { - val clusterPairsMoreThanThreshold = Stat("cluster_pairs_cosine_gt_" + cosineThreshold) - val clusterPairsAfterTopK = Stat("cluster_pairs_after_topk") - val clustersWithFewNeighbors = Stat(s"clusters_with_fewer_than_100_neighbors") - val clustersWithManyNeighbors = Stat(s"clusters_with_more_than_100_neighbors") - - allPairs - .flatMap { - case ((cI, cJ), Scores(followScore, favScore, logFavScore)) => - if (followScore > cosineThreshold || logFavScore > cosineThreshold || favScore > cosineThreshold) { - clusterPairsMoreThanThreshold.inc() - Some((cI, ClusterNeighbor(cJ, Some(followScore), Some(favScore), Some(logFavScore)))) - } else None - } - .group - .toList - // Uncomment for adhoc job - //.withReducers(40) - .map { - case (key, seq) => - val finalSize = seq.size - clusterPairsAfterTopK.incBy(finalSize) - if (finalSize < 100) { - clustersWithFewNeighbors.inc() - } else { - clustersWithManyNeighbors.inc() - } - ( - key, - seq.sortBy { - case cn: ClusterNeighbor => - -(cn.followCosineSimilarity.getOrElse(0.0) + cn.logFavCosineSimilarity.getOrElse( - 0.0)) / 2 - }) - } - } - - def getTopSimilarClustersWithCosine( - input: TypedPipe[(Long, ClustersUserIsInterestedIn)], - cosineThreshold: Double - )( - implicit uniqueID: UniqueID - ): TypedPipe[(Int, List[ClusterNeighbor])] = { - keepTopNeighbors(getSimilaritiesForAllPairs(input), cosineThreshold) - } - - def getDistributionDetails( - qtree: QTree[Double], - sum: Double, - sumOfSquares: Double, - min: Double, - max: Double, - fullSize: Int - ): DistributionDetails = { - val mean = sum / fullSize - // note that the below is the naive calculation, and not the sample standard dev formula - // that divides by n-1. I don't think it makes a difference at our scale whether we use n or n-1 - // and I'd rather use the simpler one. - val stdDev = math.sqrt(sumOfSquares / fullSize - mean * mean) - - def getQB(percentile: Double): QuantileBounds = { - val (lb, ub) = qtree.quantileBounds(percentile) - QuantileBounds(lb, ub) - } - - DistributionDetails( - mean = mean, - standardDeviation = Some(stdDev), - min = Some(min), - p25 = Some(getQB(0.25)), - p50 = Some(getQB(0.5)), - p75 = Some(getQB(0.75)), - p95 = Some(getQB(0.95)), - max = Some(max) - ) - } - - def keepCorrectModel( - input: TypedPipe[(Long, ClustersUserIsInterestedIn)], - modelVersionToKeep: String - )( - implicit uniqId: UniqueID - ): TypedPipe[(Long, ClustersUserIsInterestedIn)] = { - val allRecords = Stat("all_input_records") - val withCorrectVersion = Stat("with_correct_version") - input.filter { - case (_, clusterScoresStruct) => - // allRecords.inc() - val result = clusterScoresStruct.knownForModelVersion == modelVersionToKeep - // if (result) withCorrectVersion.inc() - result - } - } - - def getInfoFromUserSource( - knownFor: TypedPipe[(Int, List[(Long, Float)])], - usersource: TypedPipe[FlatUser], - inferredLanguages: TypedPipe[(Long, Seq[(String, Double)])] - )( - implicit uniqId: UniqueID - ): TypedPipe[(Int, InfoFromUserSource)] = { - val knownForUsers = knownFor.flatMap { - case (clusterId, userScoreList) => - userScoreList.map { - case (userId, _) => - (userId, clusterId) - } - } - - usersource - .collect { - case fuser: FlatUser if fuser.id.isDefined => - ( - fuser.id.get, - ( - fuser.accountCountryCode.getOrElse(""), - fuser.language.getOrElse(""), - fuser.nsfwUser.getOrElse(false) - )) - } - .join(knownForUsers) - .leftJoin(inferredLanguages) - .map { - case (_, (((countryCode, language, nsfw), clusterId), inferredLangsOpt)) => - val nsfwInt = if (nsfw) 1 else 0 - ( - clusterId, - ( - 1, - nsfwInt, - Map(language -> 1), - Map(countryCode -> 1), - inferredLangsOpt.getOrElse(Seq(("", 1.0))).toMap - ) - ) - } - .sumByKey - .mapValues { - case ( - denominator, - nsfwNumerator, - languageNumeratorsMap, - countryNumeratorsMap, - inferredLangsNumeratorsMap) => - InfoFromUserSource( - nsfwNumerator * 1.0 / denominator, - languageNumeratorsMap.mapValues { x => x * 1.0 / denominator }, - countryNumeratorsMap.mapValues { x => x * 1.0 / denominator }, - inferredLangsNumeratorsMap.mapValues { x => x * 1.0 / denominator } - ) - } - } - - /** - * Run the cluster details job and return the details for each cluster - * @param input interestedIn data - * @param qtreeSemigroupKParameter parameter for calculating percentiles using qtree monoid (set to a small number, usually < 7) - * @param modelVersionToKeep which modelVersion to use from interestedIn dataset - * @param knownFor clusterId -> users known for this cluster and their scores - * @param knownForTranspose userId -> clusters this user is known for and their scores - * @param usersource -> user source - * @param simsGraph -> sims graph in the form of userId -> adjacency list - * @param cosineThreshold -> cosine threshold to include a cluster in the list of similar clusters for a given cluster - * @param uniqId - * @return pipe with (modelVersion, clusterId) as the key and ClusterDetails struct as the value. - */ - def run( - input: TypedPipe[(Long, ClustersUserIsInterestedIn)], - qtreeSemigroupKParameter: Int, - modelVersionToKeep: String, - knownFor: TypedPipe[(Int, List[(Long, Float)])], - knownForTranspose: TypedPipe[(Long, Array[(Int, Float)])], - usersource: Option[TypedPipe[FlatUser]], - inferredLanguageSource: Option[TypedPipe[(Long, Seq[(String, Double)])]], - simsGraph: Option[TypedPipe[(Long, Map[Long, Float])]], - cosineThreshold: Double - )( - implicit uniqId: UniqueID - ): Execution[TypedPipe[((String, Int), ClusterDetails)]] = { - val topSimilarClusters = getTopSimilarClustersWithCosine(input, cosineThreshold) - val infoFromUserSource: TypedPipe[(Int, InfoFromUserSource)] = (for { - us <- usersource - inferredLanguages <- inferredLanguageSource - } yield getInfoFromUserSource(knownFor, us, inferredLanguages)).getOrElse(TypedPipe.empty) - - val clusterEvaluationExec = simsGraph match { - case Some(sg) => - ClusterEvaluation.clusterLevelEvaluation(sg, knownForTranspose, "eval") - case None => - val dummyPipe: TypedPipe[(Int, (Int, ClusterQuality))] = TypedPipe.empty - Execution.from(dummyPipe) - } - - clusterEvaluationExec - .map { clusterIdToSizesAndQualities => - val clusterQualities: TypedPipe[(Int, ClusterQuality)] = - clusterIdToSizesAndQualities.mapValues(_._2) - intermediateDetailsPipe( - keepCorrectModel(input, modelVersionToKeep), - qtreeSemigroupKParameter) - .leftJoin(topSimilarClusters) - .leftJoin(infoFromUserSource) - .leftJoin(clusterQualities) - .join(knownFor) - .map { - case ( - clusterId, - ( - ( - ((intermediateDetails, topSimilarNeighborsOpt), userSourceInfoOpt), - qualityOpt), - knownForUsers) - ) => - val knownForSorted = knownForUsers.sortBy(-_._2).map { - case (userId, score) => - UserWithScore(userId, score) - } - (modelVersionToKeep, clusterId) -> - ClusterDetails( - numUsersWithAnyNonZeroScore = intermediateDetails.numUsersWithAnyNonZeroScore, - numUsersWithNonZeroFavScore = intermediateDetails.numUsersWithNonZeroFavScore, - numUsersWithNonZeroFollowScore = - intermediateDetails.numUsersWithNonZeroFollowScore, - favScoreDistributionDetails = intermediateDetails.favQTree.map { qt => - getDistributionDetails( - qtree = qt, - sum = intermediateDetails.sum.favScore, - sumOfSquares = intermediateDetails.sumOfSquares.favScore, - min = intermediateDetails.min.favScore, - max = intermediateDetails.max.favScore, - fullSize = intermediateDetails.numUsersWithNonZeroFavScore - ) - }, - followScoreDistributionDetails = intermediateDetails.followQTree.map { qt => - getDistributionDetails( - qtree = qt, - sum = intermediateDetails.sum.followScore, - sumOfSquares = intermediateDetails.sumOfSquares.followScore, - min = intermediateDetails.min.followScore, - max = intermediateDetails.max.followScore, - fullSize = intermediateDetails.numUsersWithNonZeroFollowScore - ) - }, - logFavScoreDistributionDetails = intermediateDetails.logFavQTree.map { qt => - getDistributionDetails( - qtree = qt, - sum = intermediateDetails.sum.logFavScore, - sumOfSquares = intermediateDetails.sumOfSquares.logFavScore, - min = intermediateDetails.min.logFavScore, - max = intermediateDetails.max.logFavScore, - // note: user has non-zero fav score iff a user has non-zero log-fav score - fullSize = intermediateDetails.numUsersWithNonZeroFavScore - ) - }, - knownForUsersAndScores = Some(knownForSorted), - neighborClusters = topSimilarNeighborsOpt, - fractionKnownForMarkedNSFWUser = userSourceInfoOpt.map(_.fractionMarkedNSFWUser), - languageToFractionDeviceLanguage = - userSourceInfoOpt.map(_.languageToFractionDeviceLanguage), - countryCodeToFractionKnownForWithCountryCode = - userSourceInfoOpt.map(_.countryCodeToFractionKnownForWithCountryCode), - qualityMeasuredOnSimsGraph = qualityOpt, - languageToFractionInferredLanguage = - userSourceInfoOpt.map(_.languageToFractionInferredLanguage), - ) - } - } - } - - def getTruncatedSims( - sims: TypedPipe[Candidates], - maxNeighbors: Int - ): TypedPipe[(Long, Map[Long, Float])] = { - sims.map { cands => - ( - cands.userId, - // These candidates are already sorted, but leaving it in just in case the behavior changes upstream - cands.candidates - .map { c => (c.userId, c.score.toFloat) }.sortBy(-_._2).take(maxNeighbors).toMap - ) - } - } -} - -/** - scalding remote run --main-class com.twitter.simclusters_v2.scalding.ClusterDetailsAdhoc \ - --target src/scala/com/twitter/simclusters_v2/scalding:cluster_details-adhoc \ - --hadoop-properties "scalding.with.reducers.set.explicitly=true mapreduce.job.reduces=4000" \ - --user recos-platform -- \ - --date 2020-06-25 \ - --dateForUserSource 2020-06-25 \ - --includeUserSource \ - --outputDir /user/recos-platform/adhoc/your_ldap/cluster_details_inferred_lang - */ -object ClusterDetailsAdhoc extends TwitterExecutionApp { - implicit val tz: java.util.TimeZone = DateOps.UTC - implicit val dp = DateParser.default - - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - val date = DateRange.parse(args("dateForUserSource")) - val (knownFor, knownForTranspose) = - args - .optional("knownForDir").map { location => - ( - KnownForSources.transpose(KnownForSources.readKnownFor(location)), - KnownForSources.readKnownFor(location) - ) - }.getOrElse( - ( - KnownForSources.clusterToKnownFor_20M_145K_updated, - KnownForSources.knownFor_20M_145K_updated - ) - ) - - val interestedIn = args - .optional("inputDir").map { interestedInInputDir => - TypedPipe.from(AdhocKeyValSources.interestedInSource(interestedInInputDir)) - }.getOrElse( - DAL - .readMostRecentSnapshotNoOlderThan( - SimclustersV2InterestedIn20M145KUpdatedScalaDataset, - Days(14)) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - .map { - case KeyVal(userId, clustersUserIsInterestedIn) => - (userId, clustersUserIsInterestedIn) - } - ) - - val userSourceOpt = if (args.boolean("includeUserSource")) { - Some(DAL.readMostRecentSnapshot(UsersourceFlatScalaDataset, date).toTypedPipe) - } else None - - val inferredLanguagesOpt = if (args.boolean("includeUserSource")) { - Some(ExternalDataSources.inferredUserProducedLanguageSource) - } else None - - val simsGraphOpt = args.optional("simsForEvalInputDir").map { sgDir => - ClusterDetailsJob.getTruncatedSims( - TypedPipe.from(WTFCandidatesSource(sgDir)), - args.int("maxSimsNeighborsForEval", 20) - ) - } - - Util.printCounters( - ClusterDetailsJob - .run( - interestedIn, - args.int("qtreeSemigroupKParameter", 3), - args.getOrElse("modelVersion", "20M_145K_updated"), - knownFor, - knownForTranspose, - userSourceOpt, - inferredLanguagesOpt, - simsGraphOpt, - cosineThreshold = args.double("cosineThreshold", 0.01) - ).flatMap( - _.writeExecution(AdhocKeyValSources.clusterDetailsSource(args("outputDir")))) - ) - } - } -} - -trait ClusterDetailsBatchTrait extends TwitterScheduledExecutionApp { - implicit val tz = DateOps.UTC - implicit val parser = DateParser.default - - def firstTime: String - def batchIncrement: Duration - def manhattanOutputPath: String - def clusterDetailsLiteOutputPath: String - def modelVersion: String - def knownForDataset: KeyValDALDataset[KeyVal[Long, ClustersUserIsKnownFor]] - def interestedInDataset: KeyValDALDataset[KeyVal[Long, ClustersUserIsInterestedIn]] - def outputDataset: KeyValDALDataset[KeyVal[(String, Int), ClusterDetails]] - def clusterDetailsLiteOutputDataset: SnapshotDALDataset[ClusterDetailsLite] - - private lazy val execArgs = AnalyticsBatchExecutionArgs( - batchDesc = BatchDescription(this.getClass.getName.replace("$", "")), - firstTime = BatchFirstTime(RichDate(firstTime)), - lastTime = None, - batchIncrement = BatchIncrement(batchIncrement) - ) - - override def scheduledJob: Execution[Unit] = AnalyticsBatchExecution(execArgs) { - implicit dateRange => - Execution.withId { implicit uniqueId => - Execution.withArgs { args => - val qtreeSemigroupKParameter = args.int("qtreeSemigroupKParameter", 5) - val maxSimsNeighborsForEval = args.int("maxSimsNeighborsForEval", 20) - val knownForTranspose = - KnownForSources.fromKeyVal( - DAL.readMostRecentSnapshot(knownForDataset, dateRange.extend(Days(7))).toTypedPipe, - modelVersion) - val knownFor = KnownForSources.transpose(knownForTranspose) - val cosineThreshold = args.double("cosineThreshold", 0.01) - val interestedIn = - DAL - .readMostRecentSnapshot(interestedInDataset, dateRange.extend(Days(7))) - .toTypedPipe - .map { - case KeyVal(userId, clustersUserIsInterestedIn) => - (userId, clustersUserIsInterestedIn) - } - val sims = if (modelVersion == ModelVersions.Model20M145K2020) { - // The model version 20m_145k_2020 uses approximate_cosine_follow as the input sims graph - // to cluster users. The same graph is used to evaluate the clusters - TypedPipe - .from(FollowingsCosineSimilaritiesManhattanSource()) - .map(_._2) - } else { - TypedPipe.from( - SimsCandidatesSource()( - dateRange = dateRange, - suffixPath = "/classified_candidates_rollup" - )) - } - val resultExec = ClusterDetailsJob - .run( - interestedIn, - qtreeSemigroupKParameter, - modelVersion, - knownFor, - knownForTranspose, - Some(DAL.readMostRecentSnapshot(UsersourceFlatScalaDataset, dateRange).toTypedPipe), - Some(ExternalDataSources.inferredUserProducedLanguageSource), - Some( - ClusterDetailsJob.getTruncatedSims(sims, maxNeighbors = maxSimsNeighborsForEval)), - cosineThreshold - ).flatMap { resultUnmapped => - val clusterDetailsExec = resultUnmapped - .map { - case (clusterKey, details) => - KeyVal(clusterKey, details) - }.writeDALVersionedKeyValExecution( - outputDataset, - D.Suffix(manhattanOutputPath) - ) - - val clusterDetailsLiteExec = - resultUnmapped - .map { - case ((_, clusterId), details) - if modelVersion == ModelVersions.Model20M145KDec11 => - ClusterDetailsLite( - FullClusterId(ModelVersion.Model20m145kDec11, clusterId), - details.numUsersWithAnyNonZeroScore, - details.numUsersWithNonZeroFollowScore, - details.numUsersWithNonZeroFavScore, - details.knownForUsersAndScores.getOrElse(Nil) - ) - case ((_, clusterId), details) - if modelVersion == ModelVersions.Model20M145KUpdated => - ClusterDetailsLite( - FullClusterId(ModelVersion.Model20m145kUpdated, clusterId), - details.numUsersWithAnyNonZeroScore, - details.numUsersWithNonZeroFollowScore, - details.numUsersWithNonZeroFavScore, - details.knownForUsersAndScores.getOrElse(Nil) - ) - case ((_, clusterId), details) - if modelVersion == ModelVersions.Model20M145K2020 => - ClusterDetailsLite( - FullClusterId(ModelVersion.Model20m145k2020, clusterId), - details.numUsersWithAnyNonZeroScore, - details.numUsersWithNonZeroFollowScore, - details.numUsersWithNonZeroFavScore, - details.knownForUsersAndScores.getOrElse(Nil) - ) - }.writeDALSnapshotExecution( - clusterDetailsLiteOutputDataset, - D.Daily, - D.Suffix(clusterDetailsLiteOutputPath), - D.EBLzo(), - dateRange.end) - - Execution.zip(clusterDetailsExec, clusterDetailsLiteExec) - } - - Util.printCounters(resultExec) - } - } - } - -} - -object ClusterDetailsBatch extends ClusterDetailsBatchTrait { - override val firstTime: String = "2018-07-28" - override val batchIncrement: Duration = Days(7) - - override val manhattanOutputPath: String = - "/user/cassowary/manhattan_sequence_files/simclusters_v2_cluster_details" - - override val clusterDetailsLiteOutputPath: String = - "/user/cassowary/processed/simclusters_v2_cluster_details_lite" - - override val modelVersion: String = ModelVersions.Model20M145KDec11 - override val knownForDataset = SimclustersV2KnownFor20M145KDec11ScalaDataset - override val interestedInDataset = SimclustersV2InterestedInScalaDataset - override val outputDataset = SimclustersV2ClusterDetailsScalaDataset - override val clusterDetailsLiteOutputDataset = - SimclustersV2ClusterDetailsLiteScalaDataset -} - -object ClusterDetails20M145KUpdated extends ClusterDetailsBatchTrait { - override val firstTime: String = "2019-06-16" - override val batchIncrement: Duration = Days(7) - - override val manhattanOutputPath: String = - "/user/cassowary/manhattan_sequence_files/simclusters_v2_cluster_details_20m_145k_updated" - - override val clusterDetailsLiteOutputPath: String = - "/user/cassowary/processed/simclusters_v2_cluster_details_lite_20m_145k_updated" - - override val modelVersion: String = ModelVersions.Model20M145KUpdated - override val knownForDataset = SimclustersV2KnownFor20M145KUpdatedScalaDataset - override val interestedInDataset = SimclustersV2InterestedIn20M145KUpdatedScalaDataset - override val outputDataset = SimclustersV2ClusterDetails20M145KUpdatedScalaDataset - override val clusterDetailsLiteOutputDataset = - SimclustersV2ClusterDetailsLite20M145KUpdatedScalaDataset -} - -/** - * capesospy-v2 update --build_locally --start_cron cluster_details_20m_145k_2020 \ - * src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc.yaml - */ -object ClusterDetails20M145K2020 extends ClusterDetailsBatchTrait { - override val firstTime: String = "2020-10-15" - override val batchIncrement: Duration = Days(7) - - override val manhattanOutputPath: String = - "/user/cassowary/manhattan_sequence_files/simclusters_v2_cluster_details_20m_145k_2020" - - override val clusterDetailsLiteOutputPath: String = - "/user/cassowary/processed/simclusters_v2_cluster_details_lite_20m_145k_2020" - - override val modelVersion: String = ModelVersions.Model20M145K2020 - override val knownForDataset = SimclustersV2KnownFor20M145K2020ScalaDataset - override val interestedInDataset = SimclustersV2InterestedIn20M145K2020ScalaDataset - override val outputDataset = SimclustersV2ClusterDetails20M145K2020ScalaDataset - override val clusterDetailsLiteOutputDataset = - SimclustersV2ClusterDetailsLite20M145K2020ScalaDataset -} - -/** -scalding remote run --main-class com.twitter.simclusters_v2.scalding.DumpClusterDetailsAdhoc \ - --target src/scala/com/twitter/simclusters_v2/scalding:cluster_details-dump \ - --user recos-platform -- \ - --date 2020-06-25 \ - --clusterIds 5542 129677 48645 \ - --inputDir /user/recos-platform/adhoc/your_ldap/cluster_details_inferred_lang - */ -object DumpClusterDetailsAdhoc extends TwitterExecutionApp { - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - val clusters = args.list("clusterIds").map(_.toInt).toSet //(1 to 2500).toSet // - TypedPipe - .from(AdhocKeyValSources.clusterDetailsSource(args("inputDir"))) - .filter { case ((modelVersion, clusterId), details) => clusters.contains(clusterId) } - .toIterableExecution - .map { iter => - iter.foreach { x => println(Util.prettyJsonMapper.writeValueAsString(x)) } - } - } - } -} - -/** - * ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding:cluster_details && \ - * oscar hdfs --user cassowary --host hadoopnest2.atla.twitter.com --bundle cluster_details \ - * --tool com.twitter.simclusters_v2.scalding.DumpClusterSimilaritiesAdhoc --screen --screen-detached \ - * --tee your_ldap/dumpClusterSimilarities_20200103 -- \ - * --inputDir /user/cassowary/manhattan_sequence_files/simclusters_v2_cluster_details_20m_145k_updated/ \ - * --outputDir adhoc/your_ldap - */ -object DumpClusterSimilaritiesAdhoc extends TwitterExecutionApp { - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - TypedPipe - .from(AdhocKeyValSources.clusterDetailsSource(args("inputDir"))) - .flatMap { - case ((_, clusterId), details) => - details.neighborClusters.getOrElse(Nil).map { neighbor => - val compositeScore = (neighbor.followCosineSimilarity - .getOrElse(0.0) + neighbor.favCosineSimilarity.getOrElse(0.0)) / 2 - ( - clusterId, - neighbor.clusterId, - "%.4f".format(compositeScore) - ) - } - }.writeExecution(TypedTsv(args("outputDir"))) - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/ClusterEvaluation.scala b/src/scala/com/twitter/simclusters_v2/scalding/ClusterEvaluation.scala deleted file mode 100644 index 7133382eb..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/ClusterEvaluation.scala +++ /dev/null @@ -1,607 +0,0 @@ -package com.twitter.simclusters_v2.scalding - -import com.twitter.algebird.Monoid -import com.twitter.algebird.mutable.PriorityQueueMonoid -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.pluck.source.cassowary.FollowingsCosineSimilaritiesManhattanSource -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.scalding_internal.job.analytics_batch._ -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.hdfs_sources._ -import com.twitter.simclusters_v2.scalding.common.TypedRichPipe._ -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.scalding.common.Util.Distribution -import com.twitter.simclusters_v2.thriftscala.ClusterQuality -import com.twitter.simclusters_v2.thriftscala.ClustersUserIsKnownFor -import com.twitter.usersource.snapshot.flat.UsersourceFlatScalaDataset -import java.util.PriorityQueue -import scala.collection.JavaConverters._ - -object ClusterEvaluation { - - val samplerMonoid: PriorityQueueMonoid[((Long, Long), (Double, Double))] = - Util.reservoirSamplerMonoidForPairs[(Long, Long), (Double, Double)](5000)(Util.edgeOrdering) - - case class ClusterResults( - numEdgesInsideCluster: Int, - wtOfEdgesInsideCluster: Double, - numEdgesOutsideCluster: Int, - wtOfEdgesOutsideCluster: Double, - originalWtAndProductOfNodeScoresSample: PriorityQueue[((Long, Long), (Double, Double))]) { - def clusterQuality(clusterSize: Int, averagePrecisionWholeGraph: Double): ClusterQuality = { - val unweightedRecallDenominator = numEdgesInsideCluster + numEdgesOutsideCluster - val unweightedRecall = if (unweightedRecallDenominator > 0) { - numEdgesInsideCluster.toDouble / unweightedRecallDenominator.toDouble - } else 0.0 - - val weightedRecallDenominator = wtOfEdgesInsideCluster + wtOfEdgesOutsideCluster - val weightedRecall = if (weightedRecallDenominator > 0) { - wtOfEdgesInsideCluster / weightedRecallDenominator - } else 0.0 - - val precision = if (clusterSize > 1) { - Some(wtOfEdgesInsideCluster / (clusterSize * (clusterSize - 1))) - } else Some(0.0) - - val relativePrecision = if (averagePrecisionWholeGraph > 0) { - precision.flatMap { p => Some(p / averagePrecisionWholeGraph) } - } else Some(0.0) - - ClusterQuality( - unweightedRecall = Some(unweightedRecall), - weightedRecall = Some(weightedRecall), - unweightedRecallDenominator = Some(unweightedRecallDenominator), - weightedRecallDenominator = Some(weightedRecallDenominator), - relativePrecisionNumerator = precision, - relativePrecision = relativePrecision, - weightAndProductOfNodeScoresCorrelation = Some( - Util.computeCorrelation( - originalWtAndProductOfNodeScoresSample.iterator.asScala.map(_._2))) - ) - } - } - - object ClusterResultsMonoid extends Monoid[ClusterResults] { - override def zero = ClusterResults(0, 0, 0, 0, samplerMonoid.zero) - override def plus(l: ClusterResults, r: ClusterResults) = ClusterResults( - l.numEdgesInsideCluster + r.numEdgesInsideCluster, - l.wtOfEdgesInsideCluster + r.wtOfEdgesInsideCluster, - l.numEdgesOutsideCluster + r.numEdgesOutsideCluster, - l.wtOfEdgesOutsideCluster + r.wtOfEdgesOutsideCluster, - samplerMonoid - .plus(l.originalWtAndProductOfNodeScoresSample, r.originalWtAndProductOfNodeScoresSample) - ) - } - - /** - * Evaluate the quality of a cluster. - * @param memberScores A map with the members of the cluster as the keys and their scores - * inside the cluster as values. The more central a member is inside the score, - * the higher it's score is. - * @param membersAdjLists A map that gives the weighted neighbors of each member in the cluster. - */ - def evaluateCluster( - memberScores: Map[Long, Double], - membersAdjLists: Map[Long, Map[Long, Float]] - ): ClusterResults = { - val resultsIter = membersAdjLists.flatMap { - case (fromNodeId, adjList) => - val fromNodeWt = memberScores.getOrElse(fromNodeId, 0.0) - adjList.map { - case (toNodeId, edgeWt) => - if (memberScores.contains(toNodeId)) { - val productOfMembershipScores = fromNodeWt * memberScores(toNodeId) - ClusterResults( - 1, - edgeWt, - 0, - 0, - samplerMonoid.build( - ((fromNodeId, toNodeId), (edgeWt.toDouble, productOfMembershipScores)))) - } else { - ClusterResults(0, 0, 1, edgeWt, samplerMonoid.zero) - } - } - } - Monoid.sum(resultsIter)(ClusterResultsMonoid) - } - - /** - * Evaluate each cluster with respect to the provided graph. - * @param graph graph represented via the adjacency lists of each node, needs to be symmetrized i.e. if u is in v's adjlist, then v needs to be in u's adjlist as well - * @param clusters cluster memberships of each node. - * @param statsPrefix convenience argument to act as prefix for stats counters - * @return key-value pipe with clusterId as key and (size of the cluster, quality struct) as value - */ - def clusterLevelEvaluation( - graph: TypedPipe[(Long, Map[Long, Float])], - clusters: TypedPipe[(Long, Array[(Int, Float)])], - statsPrefix: String = "" - )( - implicit uniqueId: UniqueID - ): Execution[TypedPipe[(Int, (Int, ClusterQuality))]] = { - val numRealClusters = Stat(s"${statsPrefix}/numRealClusters") - val numFakeClusters = Stat(s"${statsPrefix}/numFakeClusters") - - val numNodesAndEdgesExec = graph - .map { - case (nId, nbrMap) => - (1L, nbrMap.size.toLong, nbrMap.values.sum.toDouble) - }.sum.getExecution - - numNodesAndEdgesExec.map { - case (numNodes, numEdges, sumOfAllEdgeWts) => - println("numNodes " + numNodes) - println("numEdges " + numEdges) - println("sumOfAllEdgeWts " + sumOfAllEdgeWts) - - val numFakeClustersForUnassignedNodes = numNodes / 1e4 - - val averagePrecisionWholeGraph = sumOfAllEdgeWts / (numNodes * (numNodes - 1)) - graph - .leftJoin(clusters) - // uncomment for adhoc job - .withReducers(200) - .flatMap { - case (nodeId, (adjList, assignedClustersOpt)) => - val nodeDegree = adjList.size.toLong - val nodeWeightedDegree = adjList.values.sum - assignedClustersOpt match { - case Some(assignedClusters) if assignedClusters.nonEmpty => - assignedClusters.toList.map { - case (clusterId, scoreOfNodeInCluster) => - ( - clusterId, - ( - Map(nodeId -> (scoreOfNodeInCluster.toDouble, adjList)), - 1, - nodeDegree, - nodeWeightedDegree)) - } - case _ => - // For nodes that don't belong to any cluster, create a fake clusterId (0 or lesser) - // and add the node's statistics to that clusterId. We don't need the adjacency lists for - // unassigned nodes, we'll simply track how many edges are incident on those nodes and their weighted sum etc - val fakeClusterId = - (-1 * (math.abs( - Util.hashToLong(nodeId)) % numFakeClustersForUnassignedNodes)).toInt - List( - ( - fakeClusterId, - ( - Map.empty[Long, (Double, Map[Long, Float])], - 1, - nodeDegree, - nodeWeightedDegree))) - } - } - .sumByKey - // uncomment for adhoc job - .withReducers(60) - .map { - case (clusterId, (membersMap, clusterSize, volumeOfCluster, weightedVolumeOfCluster)) => - if (clusterId > 0) { - numRealClusters.inc() - - val scoresMap = - if (clusterId > 0) membersMap.mapValues(_._1) else Map.empty[Long, Double] - val adjListsMap = membersMap.mapValues(_._2) - - val quality = evaluateCluster(scoresMap, adjListsMap) - .clusterQuality(clusterSize, averagePrecisionWholeGraph) - - (clusterId, (clusterSize, quality)) - } else { - // clusterId <= 0 means that this is a fake cluster. - numFakeClusters.inc() - ( - clusterId, - ( - clusterSize, - ClusterQuality( - unweightedRecallDenominator = Some(volumeOfCluster), - weightedRecallDenominator = Some(weightedVolumeOfCluster) - ) - ) - ) - } - } - } - } - - case class OverallResults( - unweightedRecall: Double, - edgesInsideClusters: Long, - allEdges: Long, - allNodes: Int, - weightedRecall: Double, - wtOnEdgesInsideClusters: Double, - wtOnAllEdges: Double, - weightCorrelation: Double, - relativePrecision: Double, - numUnassignedNodes: Int, - numAssignedNodes: Int, - sizeDist: Distribution, - recallDist: Distribution, - weightedRecallDist: Distribution, - relativePrecisionDist: Distribution, - weightCorrelationDist: Distribution, - numClustersWithNegativeCorrelation: Double, - numClustersWithZeroRecall: Double, - numClustersWithLessThanOneRelativePrecision: Double, - numSingletonClusters: Int) - - def summarizePerClusterResults( - perClusterResults: TypedPipe[(Int, (Int, ClusterQuality))] - ): Execution[Option[OverallResults]] = { - perClusterResults - .map { - case (clusterId, (size, quality)) => - val unweightedRecallDen = quality.unweightedRecallDenominator.getOrElse(0.0) - val unweightedRecallNum = quality.unweightedRecall.getOrElse(0.0) * unweightedRecallDen - val weightedRecallDen = quality.weightedRecallDenominator.getOrElse(0.0) - val weightedRecallNum = quality.weightedRecall.getOrElse(0.0) * weightedRecallDen - - val weightCorrelationDen = size - val weightCorrelationNum = - weightCorrelationDen * quality.weightAndProductOfNodeScoresCorrelation - .getOrElse(0.0) - - val relativePrecisionDen = size - val relativePrecisionNum = relativePrecisionDen * quality.relativePrecision.getOrElse(0.0) - - val numClustersWithNegativeCorrelation = - if (weightCorrelationNum < 0 && clusterId > 0) 1 else 0 - val numClustersWithLessThanOneRelativePrecision = - if (quality.relativePrecision.getOrElse(0.0) < 1 && clusterId > 0) 1 else 0 - val numClustersWithZeroRecall = if (weightedRecallNum < 1e-5 && clusterId > 0) 1 else 0 - val numUnassignedNodes = if (clusterId < 1) size else 0 - val numAssignedNodes = if (clusterId > 0) size else 0 - val numSingletonClusters = if (clusterId > 0 && size == 1) 1 else 0 - - ( - unweightedRecallDen, - unweightedRecallNum, - weightedRecallDen, - weightedRecallNum, - weightCorrelationDen, - weightCorrelationNum, - relativePrecisionDen, - relativePrecisionNum, - numClustersWithNegativeCorrelation, - numClustersWithLessThanOneRelativePrecision, - numClustersWithZeroRecall, - List(size.toDouble), - List(quality.unweightedRecall.getOrElse(0.0)), - List(quality.weightedRecall.getOrElse(0.0)), - List(quality.relativePrecision.getOrElse(0.0)), - List(quality.weightAndProductOfNodeScoresCorrelation.getOrElse(0.0)), - numUnassignedNodes, - numAssignedNodes, - numSingletonClusters - ) - } - .sum - .toOptionExecution - .map { opt => - opt.map { - case ( - unweightedRecallDen, - unweightedRecallNum, - weightedRecallDen, - weightedRecallNum, - weightCorrelationDen, - weightCorrelationNum, - relativePrecisionDen, - relativePrecisionNum, - numClustersWithNegativeCorrelation, - numClustersWithLessThanOneRelativePrecision, - numClustersWithZeroRecall, - sizeList, - unweightedRecallList, - weightedRecallList, - relativePrecisionList, - weightCorrelationList, - numUnassignedNodes, - numAssignedNodes, - numSingletonClusters) => - OverallResults( - unweightedRecall = unweightedRecallNum / unweightedRecallDen, - edgesInsideClusters = unweightedRecallNum.toLong, - allEdges = unweightedRecallDen.toLong, - allNodes = numAssignedNodes + numUnassignedNodes, - weightedRecall = weightedRecallNum / weightedRecallDen, - wtOnEdgesInsideClusters = weightedRecallNum, - wtOnAllEdges = weightedRecallDen, - weightCorrelation = weightCorrelationNum / weightCorrelationDen, - relativePrecision = relativePrecisionNum / relativePrecisionDen, - numAssignedNodes = numAssignedNodes, - numUnassignedNodes = numUnassignedNodes, - sizeDist = Util.distributionFromArray(sizeList.toArray), - recallDist = Util.distributionFromArray(unweightedRecallList.toArray), - weightedRecallDist = Util.distributionFromArray(weightedRecallList.toArray), - weightCorrelationDist = Util.distributionFromArray(weightCorrelationList.toArray), - relativePrecisionDist = Util.distributionFromArray(relativePrecisionList.toArray), - numClustersWithNegativeCorrelation = numClustersWithNegativeCorrelation, - numClustersWithLessThanOneRelativePrecision = - numClustersWithLessThanOneRelativePrecision, - numClustersWithZeroRecall = numClustersWithZeroRecall, - numSingletonClusters = numSingletonClusters - ) - } - } - } - - /** - * @param graph Input similarity graph, needs to be symmetrized i.e. if u is in v's adjlist, then v needs to be in u's adjlist as well - * @param clusters cluster assignments to be evaluated - * @return summary of results - */ - def overallEvaluation( - graph: TypedPipe[(Long, Map[Long, Float])], - clusters: TypedPipe[(Long, Array[(Int, Float)])], - statsPrefix: String - )( - implicit uniqueId: UniqueID - ): Execution[Option[OverallResults]] = { - clusterLevelEvaluation(graph, clusters, statsPrefix).flatMap(summarizePerClusterResults) - } -} - -/** - * ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding:cluster_evaluation && \ - * oscar hdfs --user frigate --host hadoopnest1.atla.twitter.com --bundle cluster_evaluation \ - * --tool com.twitter.simclusters_v2.scalding.ClusterEvaluationAdhoc --screen --screen-detached \ - * --tee logs/clusterQualityFor_updatedUnnormalizedInputScores_usingSims20190318 -- \ - * --simsInputDir /user/frigate/your_ldap/commonDirForClusterEvaluation/classifiedSims_20190314_copiedFromAtlaProc \ - * --topK 20000000 --date 2019-03-18 --minActiveFollowers 400 \ - * --topUsersDir /user/frigate/your_ldap/commonDirForClusterEvaluation/top20MUsers_minActiveFollowers400_20190215 \ - * --maxSimsNeighborsForEval 40 \ - * --preparedSimsGraph /user/frigate/your_ldap/commonDirForClusterEvaluation/symmetrized_classifiedSims20190318_top20MUsers \ - * --outputDir /user/frigate/your_ldap/dirFor_updatedKnownFor20M_145K_dec11_usingSims20190127_unnormalizedInputScores/knownForClusterEvaluation \ - * --knownForDir /user/frigate/your_ldap/dirFor_updatedKnownFor20M_145K_dec11_usingSims20190127_unnormalizedInputScores/knownFor - */ -object ClusterEvaluationAdhoc extends TwitterExecutionApp { - implicit val tz: java.util.TimeZone = DateOps.UTC - implicit val dp = DateParser.default - - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - val knownFor = args - .optional("knownForDir").map { location => - KnownForSources.readKnownFor(location) - }.getOrElse(KnownForSources.knownFor_20M_Dec11_145K) - - val minActiveFollowers = args.int("minActiveFollowers", 400) - val topK = args.int("topK") - val date = DateRange.parse(args("date")) - - val topUsersExec = - TopUsersSimilarityGraph - .topUsers( - DAL.readMostRecentSnapshot(UsersourceFlatScalaDataset, date).toTypedPipe, - minActiveFollowers, - topK - ) - .map(_.id) - .count("num_top_users") - .make(TypedTsv(args("topUsersDir"))) - - val simsGraphExec = topUsersExec.flatMap { topUsers => - TopUsersSimilarityGraph.makeGraph( - TopUsersSimilarityGraph.getSubgraphFromUserGroupedInput( - TypedPipe.from(WTFCandidatesSource(args("simsInputDir"))), - topUsers, - args.int("maxSimsNeighborsForEval", 40), - degreeThresholdForStat = 5 - ), - args("preparedSimsGraph") - ) - } - - val fullExec = simsGraphExec.flatMap { sims => - ClusterEvaluation - .clusterLevelEvaluation(sims, knownFor, "eval") - .flatMap { clusterResultsPipe => - val clusterResults = clusterResultsPipe.forceToDiskExecution - val outputExec = clusterResults.flatMap { pipe => - pipe - .map { - case (clusterId, (clusterSize, quality)) => - "%d\t%d\t%.2g\t%.2g\t%.1f\t%.2g\t%.2f\t%.2g\t%.2g" - .format( - clusterId, - clusterSize, - quality.unweightedRecall.getOrElse(0.0), - quality.weightedRecall.getOrElse(0.0), - quality.unweightedRecallDenominator.getOrElse(0.0), - quality.weightedRecallDenominator.getOrElse(0.0), - quality.relativePrecision.getOrElse(0.0), - quality.relativePrecisionNumerator.getOrElse(0.0), - quality.weightAndProductOfNodeScoresCorrelation.getOrElse(0.0) - ) - }.writeExecution(TypedTsv(args("outputDir"))) - } - - val printExec = clusterResults.flatMap { pipe => - ClusterEvaluation.summarizePerClusterResults(pipe).map { - case Some(res) => - println("Overall results: " + Util.prettyJsonMapper.writeValueAsString(res)) - case None => - println("No overall results!!! Probably cluster results pipe is empty.") - } - } - - Execution.zip(outputExec, printExec) - } - } - - Util.printCounters(fullExec) - } - } -} - -trait ClusterEvaluationBatch extends TwitterScheduledExecutionApp { - implicit val tz: java.util.TimeZone = DateOps.UTC - implicit val dp = DateParser.default - - def firstTime: String - - def batchDescription: String - - def batchIncrement: Duration - - private lazy val execArgs = AnalyticsBatchExecutionArgs( - batchDesc = BatchDescription(batchDescription), - firstTime = BatchFirstTime(RichDate(firstTime)), - lastTime = None, - batchIncrement = BatchIncrement(batchIncrement) - ) - - val emailAddress: String = "no-reply@twitter.com" - - def knownForDALDataset: KeyValDALDataset[KeyVal[Long, ClustersUserIsKnownFor]] - - def knownForModelVersion: String - - def baselineKnownForDALDataset: KeyValDALDataset[KeyVal[Long, ClustersUserIsKnownFor]] - - def baselineKnownForModelVersion: String - - override def scheduledJob: Execution[Unit] = - AnalyticsBatchExecution(execArgs) { implicit dateRange => - Execution.withId { implicit uniqueId => - Execution.withArgs { args => - val baselineKnownFor = - KnownForSources.fromKeyVal( - DAL - .readMostRecentSnapshot(baselineKnownForDALDataset, dateRange.prepend(Days(7))) - .toTypedPipe, - baselineKnownForModelVersion - ) - - val knownFor = - KnownForSources.fromKeyVal( - DAL - .readMostRecentSnapshot(knownForDALDataset, dateRange.prepend(Days(7))) - .toTypedPipe, - knownForModelVersion - ) - - val inputSimsGraph = TypedPipe - .from(FollowingsCosineSimilaritiesManhattanSource()) - .map(_._2) - - val minActiveFollowers = args.int("minActiveFollowers") - val topK = args.int("topK") - val maxSimsNeighborsForEval = - args.int("maxSimsNeighborsForEval", 40) - - val topUsers = TopUsersSimilarityGraph - .topUsers( - DAL - .readMostRecentSnapshot(UsersourceFlatScalaDataset, dateRange) - .toTypedPipe, - minActiveFollowers, - topK - ) - .map(_.id) - .count("num_top_users") - - TopUsersSimilarityGraph - .getSubgraphFromUserGroupedInput( - fullGraph = inputSimsGraph, - usersToInclude = topUsers, - maxNeighborsPerNode = maxSimsNeighborsForEval, - degreeThresholdForStat = 2 - ) - .forceToDiskExecution - .flatMap { symmetrizedSims => - val baselineResultsExec = ClusterEvaluation - .overallEvaluation(symmetrizedSims, baselineKnownFor, "baselineKnownForEval") - val newResultsExec = ClusterEvaluation - .overallEvaluation(symmetrizedSims, knownFor, "newKnownForEval") - val minSizeOfBiggerClusterForComparison = 10 - val compareExec = CompareClusters.summarize( - CompareClusters.compare( - KnownForSources.transpose(baselineKnownFor), - KnownForSources.transpose(knownFor), - minSizeOfBiggerCluster = minSizeOfBiggerClusterForComparison - )) - - Execution - .zip(baselineResultsExec, newResultsExec, compareExec) - .map { - case (oldResults, newResults, compareResults) => - val emailText = - s"Evaluation Results for baseline knownFor: $baselineKnownForModelVersion \n" + - Util.prettyJsonMapper.writeValueAsString(oldResults) + - "\n\n-------------------\n\n" + - s"Evaluation Results for new knownFor:$knownForModelVersion\n" + - Util.prettyJsonMapper.writeValueAsString(newResults) + - "\n\n-------------------\n\n" + - s"Cosine similarity distribution between $baselineKnownForModelVersion and " + - s"$knownForModelVersion cluster membership vectors for " + - s"clusters with at least $minSizeOfBiggerClusterForComparison members:\n" + - Util.prettyJsonMapper - .writeValueAsString(compareResults) - - Util - .sendEmail( - emailText, - s"Evaluation results comparing $knownForModelVersion with baseline $baselineKnownForModelVersion", - emailAddress) - () - } - } - } - } - } -} - -/** - * capesospy-v2 update --build_locally --start_cron cluster_evaluation_for_20M_145k \ - * src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc.yaml - */ -object ClusterEvaluationFor20M145K extends ClusterEvaluationBatch { - override val firstTime: String = "2019-06-11" - - override val batchIncrement: Duration = Days(7) - - override val batchDescription = "com.twitter.simclusters_v2.scalding.ClusterEvaluationFor20M145K" - - override val knownForDALDataset = SimclustersV2KnownFor20M145KUpdatedScalaDataset - - override val knownForModelVersion = ModelVersions.Model20M145KUpdated - - override val baselineKnownForDALDataset = SimclustersV2KnownFor20M145KDec11ScalaDataset - - override val baselineKnownForModelVersion = ModelVersions.Model20M145KDec11 -} - -/** - * capesospy-v2 update --build_locally --start_cron cluster_evaluation_for_20M_145k_2020 \ - * src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc.yaml - */ -object ClusterEvaluationFor20M145K2020 extends ClusterEvaluationBatch { - override val firstTime: String = "2021-01-25" - - override val batchIncrement: Duration = Days(7) - - override val batchDescription = - "com.twitter.simclusters_v2.scalding.ClusterEvaluationFor20M145K2020" - - override val knownForDALDataset = SimclustersV2KnownFor20M145K2020ScalaDataset - - override val knownForModelVersion = ModelVersions.Model20M145K2020 - - override val baselineKnownForDALDataset = SimclustersV2KnownFor20M145KUpdatedScalaDataset - - override val baselineKnownForModelVersion = ModelVersions.Model20M145KUpdated -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/CompareClusters.scala b/src/scala/com/twitter/simclusters_v2/scalding/CompareClusters.scala deleted file mode 100644 index 55d538d4a..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/CompareClusters.scala +++ /dev/null @@ -1,131 +0,0 @@ -package com.twitter.simclusters_v2.scalding - -import com.twitter.scalding.{DateOps, DateParser, Execution, Stat, TypedPipe, TypedTsv, UniqueID} -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.simclusters_v2.common.{ClusterId, UserId} -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.scalding.common.Util.Distribution - -object CompareClusters { - def norm(a: Iterable[Float]): Float = { - math - .sqrt(a.map { x => x * x }.sum).toFloat - } - - def cosine(a: Map[Long, Float], b: Map[Long, Float]): Float = { - val intersect = a.toList.collect { - case (id, score) if b.contains(id) => - score * b(id) - } - val dot = if (intersect.nonEmpty) intersect.sum else 0 - val aNorm = norm(a.values) - val bNorm = norm(b.values) - if (aNorm > 0 && bNorm > 0) { - dot / aNorm / bNorm - } else 0 - } - - /** - * Compare two known-for data set, and generate change in cluster assignment stats - */ - def compareClusterAssignments( - newKnownFor: TypedPipe[(UserId, List[(ClusterId, Float)])], - oldKnownFor: TypedPipe[(UserId, List[(ClusterId, Float)])] - )( - implicit uniqueID: UniqueID - ): Execution[String] = { - - val emptyToSomething = Stat("no_assignment_to_some") - val somethingToEmpty = Stat("some_assignment_to_none") - val emptyToEmpty = Stat("empty_to_empty") - val sameCluster = Stat("same_cluster") - val diffCluster = Stat("diff_cluster") - - val calculateStatExec = newKnownFor - .outerJoin(oldKnownFor) - .map { - case (userId, (newKnownForListOpt, oldKnownForListOpt)) => - val newKnownFor = newKnownForListOpt.getOrElse(Nil) - val oldKnownFor = oldKnownForListOpt.getOrElse(Nil) - - if (newKnownFor.nonEmpty && oldKnownFor.isEmpty) { - emptyToSomething.inc() - } - if (newKnownFor.isEmpty && oldKnownFor.nonEmpty) { - somethingToEmpty.inc() - } - if (newKnownFor.isEmpty && oldKnownFor.isEmpty) { - emptyToEmpty.inc() - } - - if (newKnownFor.nonEmpty && oldKnownFor.nonEmpty) { - val newClusterId = newKnownFor.head._1 - val oldClusterId = oldKnownFor.head._1 - - if (newClusterId == oldClusterId) { - sameCluster.inc() - } else { - diffCluster.inc() - } - } - userId - } - .toIterableExecution - - Util.getCustomCountersString(calculateStatExec) - } - - /** - * Compare two cluster assignments in terms of cosine similarity of corresponding clusters. - * Excludes clusters which are too small - * @param knownForA - * @param knownForB - * @param minSizeOfBiggerCluster Set to 10 or some such. - * @return - */ - def compare( - knownForA: TypedPipe[(Int, List[(Long, Float)])], - knownForB: TypedPipe[(Int, List[(Long, Float)])], - minSizeOfBiggerCluster: Int - ): TypedPipe[(Int, Float)] = { - knownForA - .outerJoin(knownForB) - .collect { - case (clusterId, (membersInAOpt, membersInBOpt)) - if membersInAOpt.exists(_.size >= minSizeOfBiggerCluster) || membersInBOpt - .exists(_.size >= minSizeOfBiggerCluster) => - val membersInA = - membersInAOpt.map(_.toMap).getOrElse(Map.empty[Long, Float]) - val membersInB = - membersInBOpt.map(_.toMap).getOrElse(Map.empty[Long, Float]) - (clusterId, cosine(membersInA, membersInB)) - } - } - - def summarize(clusterToCosines: TypedPipe[(Int, Float)]): Execution[Option[Distribution]] = { - clusterToCosines.values.map(x => List(x)).sum.toOptionExecution.map { listOpt => - listOpt.map { list => Util.distributionFromArray(list.map(_.toDouble).toArray) } - } - } -} - -object CompareClustersAdhoc extends TwitterExecutionApp { - implicit val tz: java.util.TimeZone = DateOps.UTC - implicit val dp = DateParser.default - - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - - val knownForA = KnownForSources.transpose(KnownForSources.readKnownFor(args("knownForA"))) - val knownForB = KnownForSources.transpose(KnownForSources.readKnownFor(args("knownForB"))) - - CompareClusters - .compare(knownForA, knownForB, minSizeOfBiggerCluster = 10) - .map { case (cId, cos) => "%d\t%.2f".format(cId, cos) } - .writeExecution(TypedTsv(args("outputDir"))) - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/EigenVectorsForSparseSymmetric.scala b/src/scala/com/twitter/simclusters_v2/scalding/EigenVectorsForSparseSymmetric.scala deleted file mode 100644 index 7171e0e7a..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/EigenVectorsForSparseSymmetric.scala +++ /dev/null @@ -1,330 +0,0 @@ -package com.twitter.simclusters_v2.scalding - -import com.twitter.algebird.Monoid -import com.twitter.logging.Logger -import com.twitter.scalding.{Execution, TypedPipe, TypedTsv} -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.simclusters_v2.hdfs_sources.AdhocKeyValSources -import java.util -import no.uib.cipr.matrix.Matrix -import no.uib.cipr.matrix.sparse.{ArpackSym, LinkedSparseMatrix} -import scala.collection.JavaConverters._ - -object EigenVectorsForSparseSymmetric { - val log: Logger = Logger() - - /** - * Construct matrix from the rows of the matrix, specified as a map. The outer map is indexed by rowId, and the inner maps are indexed by columnId. - * Note that the input matrix is intended to be symmetric. - * - * @param map A map specifying the rows of the matrix. The outer map is indexed by rowId, and the inner maps are indexed by columnId. Both rows and columns are zero-indexed. - * @param nRows number of rows in matrix - * @param nCols number of columns in matrix - * - * @return the constructed matrix - */ - def getMatrix(map: Map[Int, Map[Int, Double]], nRows: Int, nCols: Int): Matrix = { - val nonzeros = map.toSeq.flatMap { - case (i, subMap) => - subMap.toSeq.map { - case (j, value) => - (i, j, value) - } - } - getMatrix(nonzeros, nRows, nCols) - } - - /** - * Construct matrix from iterable of the non-zero entries. Note that the input matrix is intended to be symmetric. - * - * @param nonzeros non-zeros in (i, j, v) format, where i is row, j is column, and v is value. Both rows and columns are zero-indexed. - * @param nRows number of rows in matrix - * @param nCols number of columns in matrix - * - * @return the constructed matrix - */ - def getMatrix(nonzeros: Iterable[(Int, Int, Double)], nRows: Int, nCols: Int): Matrix = { - val matrix = new LinkedSparseMatrix(nRows, nCols) - var numEntries = 0 - var maxRow = 0 - var maxCol = 0 - - nonzeros.foreach { - case (i, j, v) => - if (i > maxRow) { - maxRow = i - } - if (j > maxCol) { - maxCol = j - } - numEntries += 1 - matrix.set(i, j, v) - } - log.info( - "Finished building matrix with %d entries and maxRow %d and maxCol %d" - .format(numEntries, maxRow, maxCol)) - - matrix - } - - /** - * Prints out various diagnostics about how much the given matrix differs from a perfect - * symmetric matrix. If (i,j) and (j,i) are different, it sets both of them to be the max of the two. - * Call this function before invoking EVD. - * - * @param matrix Matrix which is modified (if need be) in place. - */ - def ensureMatrixIsSymmetric(matrix: Matrix): Unit = { - var numUnequalEntries = 0 - var numEntriesDifferentBy1Percent = 0 - var numEqualEntries = 0 - var numUnequalDueToZero = 0 - var maxUnequal = (0, 0, 0.0, 0.0) - matrix.iterator().asScala.foreach { entry => - val curr = entry.get() - val opp = matrix.get(entry.column(), entry.row()) - if (curr == opp) { - numEqualEntries += 1 - } else { - numUnequalEntries += 1 - if (opp == 0) { - numUnequalDueToZero += 1 - } - if (opp != 0 && (math.abs(curr - opp) / math.min(curr, opp)) > 0.01) { - numEntriesDifferentBy1Percent += 1 - } - if (opp != 0 && math.abs(curr - opp) > maxUnequal._4) { - maxUnequal = (entry.row(), entry.column(), curr, math.abs(curr - opp)) - } - val max = math.max(curr, opp) - matrix.set(entry.column(), entry.row(), max) - matrix.set(entry.row(), entry.column(), max) - } - } - - var numUnEqualPrinted = 0 - matrix.iterator().asScala.foreach { entry => - val opp = matrix.get(entry.column(), entry.row()) - if (numUnEqualPrinted < 10 && entry.get() != opp) { - numUnEqualPrinted += 1 - log.info( - "Entries for (%d, %d) are %s and %s" - .format(entry.row(), entry.column(), entry.get(), opp)) - } - } - - log.info( - "Num unequal entries: %d, num unequal due to zero: %d, num unequal by 1percent or more: %d, num equal entries: %d, maxUnequal: %s" - .format( - numUnequalEntries, - numUnequalDueToZero, - numEntriesDifferentBy1Percent, - numEqualEntries, - maxUnequal)) - } - - /** - * Get the top-k eigenvalues (largest magnitude) and eigenvectors for an input matrix. - * Top eigenvalues means they're the largest in magnitude. - * Input matrix needs to be perfectly symmetric; if it's not, this function will fail. - * - * Many of the eigenvectors will have very small values along most of the dimensions. This method also - * only retains the bigger entries in an eigenvector. - * - * @param matrix symmetric input matrix. - * @param k how many of the top eigenvectors to get. - * @param ratioToLargestCutoff An entry needs to be at least 1/ratioToLargestCutoff of the biggest entry in that vector to be retained. - * - * @return seq of (eigenvalue, eigenvector) pairs. - */ - def getTruncatedEVD( - matrix: Matrix, - k: Int, - ratioToLargestCutoff: Float - ): Seq[(Double, Seq[(Int, Double)])] = { - val solver = new ArpackSym(matrix) - val resultsMap = solver.solve(k, ArpackSym.Ritz.LM).asScala.toMap - val results = resultsMap.toIndexedSeq.sortBy { case (eigValue, _) => -eigValue } - results.zipWithIndex.map { - case ((eigValue, denseVectorJava), index) => - val denseVector = new Array[Double](denseVectorJava.size()) - denseVector.indices.foreach { index => denseVector(index) = denseVectorJava.get(index) } - val denseVectorMax = denseVector.maxBy { entry => math.abs(entry) } - val cutOff = math.abs(denseVectorMax) / ratioToLargestCutoff - val significantEntries = denseVector.zipWithIndex - .filter { case (vectorEntry, _) => math.abs(vectorEntry) >= cutOff } - .sortBy { case (vectorEntry, _) => -1 * math.abs(vectorEntry) } - (eigValue.toDouble, significantEntries.toSeq.map(_.swap)) - } - } - - /** - * Compute U*Diag*Ut - where Diag is a diagonal matrix, and U is a sparse matrix. - * This is primarily for testing - to make sure that the computed eigenvectors can be used to - * reconstruct the original matrix up to some reasonable approximation. - * - * @param diagToUColumns seq of (diagonal entries, associated column in U) - * @param cutoff cutoff for including a value in the result. - * - * @return result of multiplication, returned as a map of the rows in the results. - */ - def uTimesDiagTimesUT( - diagToUColumns: Seq[(Double, Seq[(Int, Double)])], - cutoff: Double - ): Map[Int, Map[Int, Double]] = { - val result = new util.HashMap[Int, util.HashMap[Int, Double]]() - diagToUColumns.foreach { - case (diag, uColumn) => - uColumn.foreach { - case (i, iVal) => - uColumn.foreach { - case (j, jVal) => - val prod = diag * iVal * jVal - if (result.containsKey(i)) { - val newVal = if (result.get(i).containsKey(j)) { - result.get(i).get(j) + prod - } else prod - result.get(i).put(j, newVal) - } else { - result.put(i, new util.HashMap[Int, Double]) - result.get(i).put(j, prod) - } - } - } - } - val unfiltered = result.asScala.toMap.mapValues(_.asScala.toMap) - unfiltered - .mapValues { m => m.filter { case (_, value) => math.abs(value) >= cutoff } } - .filter { case (_, vector) => vector.nonEmpty } - } - - /** Note: This requires a full EVD to correctly compute the inverse! :-( */ - def getInverseFromEVD( - evd: Seq[(Double, Seq[(Int, Double)])], - cutoff: Double - ): Map[Int, Map[Int, Double]] = { - val evdInverse = evd.map { - case (eigValue, eigVector) => - (1.0 / eigValue, eigVector) - } - uTimesDiagTimesUT(evdInverse, cutoff) - } -} - -object PCAProjectionMatrixAdhoc extends TwitterExecutionApp { - val log = Logger() - - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, _) => - Execution.withId { _ => - val args = config.getArgs - val k = args.int("k", 100) - val ratioToLargestEntryInVectorCutoff = args.int("ratioToLargestEntryInVectorCutoff", 100) - val minClusterFavers = args.int("minClusterFavers", 1000) - val input = TypedPipe.from(AdhocKeyValSources.clusterDetailsSource(args("inputDir"))) - val outputDir = args("outputDir") - - val filteredClustersExec = - input - .collect { - case ((_, clusterId), details) - if details.numUsersWithNonZeroFavScore > minClusterFavers => - clusterId - } - .toIterableExecution - .map { fc => - val fcSet = fc.toSet - log.info("Number of clusters with favers more than %d is %d" - .format(minClusterFavers, fcSet.size)) - fcSet - } - - filteredClustersExec - .flatMap { filteredClusters => - input.flatMap { - case ((_, clusterId), details) => - if (filteredClusters(clusterId)) { - details.neighborClusters.getOrElse(Nil).collect { - case neighbor - if filteredClusters( - neighbor.clusterId) && neighbor.favCosineSimilarity.isDefined => - (clusterId, neighbor.clusterId, neighbor.favCosineSimilarity.get) - } - } else Nil - }.toIterableExecution - } - .flatMap { edgesIter => - val edges = edgesIter.toSeq - val oldIdToNewId = edges - .flatMap { case (i, j, _) => Seq(i, j) } - .distinct - .zipWithIndex - .toMap - - val mapString = oldIdToNewId.toList - .take(5).map { - case (old, nw) => - Seq(old, nw).mkString(" ") - }.mkString("\n") - log.info("A few entries of OldId to NewId map is") - log.info(mapString) - - val newIdToOldId = oldIdToNewId.map(_.swap) - log.info( - "Num clusters after filtering out those with no neighbors with favers more than %d is %d" - .format(minClusterFavers, oldIdToNewId.size)) - val newEdges = edges.map { - case (oldI, oldJ, value) => - (oldIdToNewId(oldI), oldIdToNewId(oldJ), value) - } - log.info("Going to build matrix") - val matrix = EigenVectorsForSparseSymmetric.getMatrix( - newEdges, - oldIdToNewId.size, - oldIdToNewId.size) - EigenVectorsForSparseSymmetric.ensureMatrixIsSymmetric(matrix) - - log.info("Going to solve now for %d eigenvalues".format(k)) - val tic = System.currentTimeMillis() - val results = EigenVectorsForSparseSymmetric.getTruncatedEVD( - matrix, - k, - ratioToLargestEntryInVectorCutoff) - val toc = System.currentTimeMillis() - log.info("Finished solving in %.2f minutes".format((toc - tic) / 1000 / 60.0)) - - val eigValues = results.map(_._1).map { x => "%.3g".format(x) }.mkString(" ") - val eigValueNorm = math.sqrt(results.map(_._1).map(x => x * x).sum) - val matrixNorm = math.sqrt(matrix.iterator().asScala.map(_.get()).map(x => x * x).sum) - - println( - "matrixNorm %s, eigValueNorm %s, explained fraction %s" - .format(matrixNorm, eigValueNorm, eigValueNorm / matrixNorm)) - - log.info("The eigenvalues are:") - log.info(eigValues) - - val nnzInEigenVectors = results.map(_._2.size).sum - log.info("Average nnz per eigenvector using ratioToLargestCutoff %d is %.2g" - .format(ratioToLargestEntryInVectorCutoff, nnzInEigenVectors * 1.0 / results.size)) - val transposedRaw = results.zipWithIndex.flatMap { - case ((_, eigVector), eigIndex) => - eigVector.map { - case (index, vectorEntry) => - val clusterId = newIdToOldId(index) - Map(clusterId -> List((eigIndex, vectorEntry))) - } - } - val transposed = Monoid.sum(transposedRaw).mapValues { rowForCluster => - rowForCluster - .map { - case (dimId, weight) => - "%d:%.2g".format(dimId, weight) - }.mkString(" ") - } - TypedPipe.from(transposed.toSeq).writeExecution(TypedTsv(outputDir)) - } - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/InterestedInFromAggregatableProducerEmbeddings.scala b/src/scala/com/twitter/simclusters_v2/scalding/InterestedInFromAggregatableProducerEmbeddings.scala deleted file mode 100644 index a65f2a44f..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/InterestedInFromAggregatableProducerEmbeddings.scala +++ /dev/null @@ -1,332 +0,0 @@ -package com.twitter.simclusters_v2.scalding - -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.dal.client.dataset.SnapshotDALDataset -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite.D -import com.twitter.scalding_internal.dalv2.DALWrite.WriteExtension -import com.twitter.scalding_internal.dalv2.remote_access.AllowCrossClusterSameDC -import com.twitter.scalding_internal.dalv2.remote_access.AllowCrossDC -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.ClusterId -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.hdfs_sources.AdhocKeyValSources -import com.twitter.simclusters_v2.hdfs_sources.AggregatableProducerSimclustersEmbeddingsByLogFavScore2020ScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.SimclustersV2InterestedInFromAggregatableProducerEmbeddings20M145K2020ScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.SimclustersV2UserToInterestedInFromAggregatableProducerEmbeddings20M145K2020ScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.UserAndNeighborsFixedPathSource -import com.twitter.simclusters_v2.hdfs_sources.UserUserNormalizedGraphScalaDataset -import com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn -import com.twitter.simclusters_v2.thriftscala.InternalId -import com.twitter.simclusters_v2.thriftscala.ModelVersion -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId -import com.twitter.simclusters_v2.thriftscala.UserAndNeighbors -import com.twitter.simclusters_v2.thriftscala.UserToInterestedInClusterScores -import com.twitter.simclusters_v2.thriftscala.UserToInterestedInClusters -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import java.util.TimeZone - -/** - * Production job for computing interestedIn data set from the aggregatable producer embeddings for the model version 20M145K2020. - * It writes the data set in KeyVal format to produce a MH DAL data set. - * - * A high level description of this job: - * - Read the APE dataset - * - Apply log1p to the scores from the above dataset as the scores for producers is high - * - Normalize the scores for each producer (offline benchmarking has shown better results from this step.) - * - Truncate the number of clusters for each producer from the APE dataset to reduce noise - * - Compute interestedIn - * - * To deploy the job: - * - * capesospy-v2 update --build_locally --start_cron interested_in_from_ape_2020 \ - * src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc.yaml - */ -object InterestedInFromAPE2020BatchApp extends InterestedInFromAggregatableProducerEmbeddingsBase { - - override val firstTime: RichDate = RichDate("2021-03-03") - - override val batchIncrement: Duration = Days(7) - - override def modelVersion: ModelVersion = ModelVersion.Model20m145k2020 - - override def producerEmbeddingsInputKVDataset: KeyValDALDataset[ - KeyVal[SimClustersEmbeddingId, SimClustersEmbedding] - ] = AggregatableProducerSimclustersEmbeddingsByLogFavScore2020ScalaDataset - - override def interestedInFromAPEOutputKVDataset: KeyValDALDataset[ - KeyVal[UserId, ClustersUserIsInterestedIn] - ] = SimclustersV2InterestedInFromAggregatableProducerEmbeddings20M145K2020ScalaDataset - - override def interestedInFromAPEOutputThriftDatset: SnapshotDALDataset[ - UserToInterestedInClusters - ] = SimclustersV2UserToInterestedInFromAggregatableProducerEmbeddings20M145K2020ScalaDataset -} - -trait InterestedInFromAggregatableProducerEmbeddingsBase extends ScheduledExecutionApp { - def modelVersion: ModelVersion - - def interestedInFromAPEOutputKVDataset: KeyValDALDataset[ - KeyVal[UserId, ClustersUserIsInterestedIn] - ] - - def producerEmbeddingsInputKVDataset: KeyValDALDataset[ - KeyVal[SimClustersEmbeddingId, SimClustersEmbedding] - ] - - def interestedInFromAPEOutputThriftDatset: SnapshotDALDataset[UserToInterestedInClusters] - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - //Input args for the run - val socialProofThreshold = args.int("socialProofThreshold", 2) - val maxClustersFromProducer = args.int("maxClustersPerProducer", 5) - val maxClustersPerUserFinalResult = args.int("maxInterestedInClustersPerUser", 200) - - //Path variables - val interestedInFromProducersPath = - s"/user/cassowary/manhattan_sequence_files/interested_in_from_ape/" + modelVersion - - val interestedInFromProducersThriftPath = - s"/user/cassowary/manhattan_sequence_files/interested_in_from_ape_thrift/" + modelVersion - - val userUserGraph: TypedPipe[UserAndNeighbors] = - DAL - .readMostRecentSnapshotNoOlderThan(UserUserNormalizedGraphScalaDataset, Days(30)) - .withRemoteReadPolicy(AllowCrossDC) - .toTypedPipe - - val producerEmbeddings = DAL - .readMostRecentSnapshotNoOlderThan( - producerEmbeddingsInputKVDataset, - Days(30)).withRemoteReadPolicy(AllowCrossClusterSameDC).toTypedPipe.map { - case KeyVal(producer, embeddings) => (producer, embeddings) - } - - val result = InterestedInFromAggregatableProducerEmbeddingsBase.run( - userUserGraph, - producerEmbeddings, - maxClustersFromProducer, - socialProofThreshold, - maxClustersPerUserFinalResult, - modelVersion) - - val keyValExec = - result - .map { case (userId, clusters) => KeyVal(userId, clusters) } - .writeDALVersionedKeyValExecution( - interestedInFromAPEOutputKVDataset, - D.Suffix(interestedInFromProducersPath) - ) - val thriftExec = - result - .map { - case (userId, clusters) => - UserToInterestedInClusters( - userId, - ModelVersions.toKnownForModelVersion(modelVersion), - clusters.clusterIdToScores) - } - .writeDALSnapshotExecution( - interestedInFromAPEOutputThriftDatset, - D.Daily, - D.Suffix(interestedInFromProducersThriftPath), - D.EBLzo(), - dateRange.end - ) - Execution.zip(keyValExec, thriftExec).unit - } -} - -/** - * Adhoc job to generate the interestedIn from aggregatable producer embeddings for the model version 20M145K2020 - * - * scalding remote run \ - * --user cassowary \ - * --keytab /var/lib/tss/keys/fluffy/keytabs/client/cassowary.keytab \ - * --principal service_acoount@TWITTER.BIZ \ - * --cluster bluebird-qus1 \ - * --main-class com.twitter.simclusters_v2.scalding.InterestedInFromAPE2020AdhocApp \ - * --target src/scala/com/twitter/simclusters_v2/scalding:interested_in_from_ape_2020-adhoc \ - * --hadoop-properties "mapreduce.map.memory.mb=8192 mapreduce.map.java.opts='-Xmx7618M' mapreduce.reduce.memory.mb=8192 mapreduce.reduce.java.opts='-Xmx7618M'" \ - * -- --outputDir /gcs/user/cassowary/adhoc/your_ldap/interested_in_from_ape_2020_keyval --date 2021-03-05 - */ -object InterestedInFromAPE2020AdhocApp extends AdhocExecutionApp { - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - val outputDir = args("outputDir") - val socialProofThreshold = args.int("socialProofThreshold", 2) - val maxClustersPerUserFinalResult = args.int("maxInterestedInClustersPerUser", 200) - val maxClustersFromProducer = args.int("maxClustersFromProducer", 5) - val inputGraph = args.optional("graphInputDir") match { - case Some(inputDir) => TypedPipe.from(UserAndNeighborsFixedPathSource(inputDir)) - case None => - DAL - .readMostRecentSnapshotNoOlderThan(UserUserNormalizedGraphScalaDataset, Days(30)) - .withRemoteReadPolicy(AllowCrossClusterSameDC) - .toTypedPipe - } - - val producerEmbeddings = DAL - .readMostRecentSnapshotNoOlderThan( - AggregatableProducerSimclustersEmbeddingsByLogFavScore2020ScalaDataset, - Days(30)).withRemoteReadPolicy(AllowCrossClusterSameDC).toTypedPipe.map { - case KeyVal(producer, embeddings) => (producer, embeddings) - } - - val result = InterestedInFromAggregatableProducerEmbeddingsBase.run( - inputGraph, - producerEmbeddings, - maxClustersFromProducer, - socialProofThreshold, - maxClustersPerUserFinalResult, - ModelVersion.Model20m145k2020) - - result - .writeExecution(AdhocKeyValSources.interestedInSource(outputDir)) - } -} - -/** - * Helper functions - */ -object InterestedInFromAggregatableProducerEmbeddingsBase { - - /** - * Helper function to prune the embeddings - * @param embeddingsWithScore embeddings - * @param maxClusters number of clusters to keep, per userId - * @param uniqueId for stats - * @return - */ - def getPrunedEmbeddings( - embeddingsWithScore: TypedPipe[(UserId, Seq[(ClusterId, Float)])], - maxClusters: Int - )( - implicit uniqueId: UniqueID - ): TypedPipe[(UserId, Array[(ClusterId, Float)])] = { - val numProducerMappings = Stat("num_producer_embeddings_total") - val numProducersWithLargeClusterMappings = Stat( - "num_producers_with_more_clusters_than_threshold") - val numProducersWithSmallClusterMappings = Stat( - "num_producers_with_clusters_less_than_threshold") - val totalClustersCoverageProducerEmbeddings = Stat("num_clusters_total_producer_embeddings") - embeddingsWithScore.map { - case (producerId, clusterArray) => - numProducerMappings.inc() - val clusterSize = clusterArray.size - totalClustersCoverageProducerEmbeddings.incBy(clusterSize) - val prunedList = if (clusterSize > maxClusters) { - numProducersWithLargeClusterMappings.inc() - clusterArray - .sortBy { - case (_, knownForScore) => -knownForScore - }.take(maxClusters) - } else { - numProducersWithSmallClusterMappings.inc() - clusterArray - } - (producerId, prunedList.toArray) - } - } - - /** - * helper function to remove all scores except follow and logFav - * @param interestedInResult interestedIn clusters for a user - * @return - */ - def getInterestedInDiscardScores( - interestedInResult: TypedPipe[(UserId, List[(ClusterId, UserToInterestedInClusterScores)])] - ): TypedPipe[(UserId, List[(ClusterId, UserToInterestedInClusterScores)])] = { - interestedInResult.map { - case (srcId, fullClusterList) => - val fullClusterListWithDiscardedScores = fullClusterList.map { - case (clusterId, clusterDetails) => - val clusterDetailsWithoutSocial = UserToInterestedInClusterScores( - // We are not planning to use the other scores except for logFav and Follow. - // Hence, setting others as None for now, we can add them back when needed - followScore = clusterDetails.followScore, - logFavScore = clusterDetails.logFavScore, - logFavScoreClusterNormalizedOnly = clusterDetails.logFavScoreClusterNormalizedOnly - ) - (clusterId, clusterDetailsWithoutSocial) - } - (srcId, fullClusterListWithDiscardedScores) - } - } - - /** - * Helper function to normalize the embeddings - * @param embeddings cluster embeddings - * @return - */ - def getNormalizedEmbeddings( - embeddings: TypedPipe[(UserId, Seq[(ClusterId, Float)])] - ): TypedPipe[(UserId, Seq[(ClusterId, Float)])] = { - embeddings.map { - case (userId, clustersWithScores) => - val l2norm = math.sqrt(clustersWithScores.map(_._2).map(score => score * score).sum) - ( - userId, - clustersWithScores.map { - case (clusterId, score) => (clusterId, (score / l2norm).toFloat) - }) - } - } - - def run( - userUserGraph: TypedPipe[UserAndNeighbors], - producerEmbeddings: TypedPipe[(SimClustersEmbeddingId, SimClustersEmbedding)], - maxClustersFromProducer: Int, - socialProofThreshold: Int, - maxClustersPerUserFinalResult: Int, - modelVersion: ModelVersion - )( - implicit uniqueId: UniqueID - ): TypedPipe[(UserId, ClustersUserIsInterestedIn)] = { - import InterestedInFromKnownFor._ - - val producerEmbeddingsWithScore: TypedPipe[(UserId, Seq[(ClusterId, Float)])] = - producerEmbeddings.map { - case ( - SimClustersEmbeddingId(embeddingType, modelVersion, InternalId.UserId(producerId)), - simclusterEmbedding) => - ( - producerId, - simclusterEmbedding.embedding.map { simclusterWithScore => - // APE dataset has very high producer scores, hence applying log to smoothen them out before - // computing interestedIn - (simclusterWithScore.clusterId, math.log(1.0 + simclusterWithScore.score).toFloat) - }) - } - - val result = keepOnlyTopClusters( - getInterestedInDiscardScores( - attachNormalizedScores( - userClusterPairsWithoutNormalization( - userUserGraph, - getPrunedEmbeddings( - getNormalizedEmbeddings(producerEmbeddingsWithScore), - maxClustersFromProducer), - socialProofThreshold, - ))), - maxClustersPerUserFinalResult, - ModelVersions.toKnownForModelVersion(modelVersion) - ) - result - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/InterestedInFromKnownFor.scala b/src/scala/com/twitter/simclusters_v2/scalding/InterestedInFromKnownFor.scala deleted file mode 100644 index ab2cbde2d..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/InterestedInFromKnownFor.scala +++ /dev/null @@ -1,666 +0,0 @@ -package com.twitter.simclusters_v2.scalding - -import com.twitter.algebird.Semigroup -import com.twitter.bijection.Injection -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.scalding.TypedPipe -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.scalding_internal.job.analytics_batch.AnalyticsBatchExecution -import com.twitter.scalding_internal.job.analytics_batch.AnalyticsBatchExecutionArgs -import com.twitter.scalding_internal.job.analytics_batch.BatchDescription -import com.twitter.scalding_internal.job.analytics_batch.BatchFirstTime -import com.twitter.scalding_internal.job.analytics_batch.BatchIncrement -import com.twitter.scalding_internal.job.analytics_batch.TwitterScheduledExecutionApp -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.ClusterId -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.hdfs_sources._ -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.thriftscala._ - -/** - * This file implements the job for computing users' interestedIn vector from KnownFor data set. - * - * It reads the UserUserNormalizedGraphScalaDataset to get user-user follow + fav graph, and then - * based on the known-for clusters of each followed/faved user, we calculate how much a user is - * interestedIn a cluster. - */ - -/** - * Production job for computing interestedIn data set for the model version 20M145K2020. - * - * To deploy the job: - * - * capesospy-v2 update --build_locally --start_cron interested_in_for_20M_145k_2020 \ - src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc.yaml - */ -object InterestedInFromKnownFor20M145K2020 extends InterestedInFromKnownForBatchBase { - override val firstTime: String = "2020-10-06" - override val outputKVDataset: KeyValDALDataset[KeyVal[Long, ClustersUserIsInterestedIn]] = - SimclustersV2RawInterestedIn20M145K2020ScalaDataset - override val outputPath: String = InternalDataPaths.RawInterestedIn2020Path - override val knownForModelVersion: String = ModelVersions.Model20M145K2020 - override val knownForDALDataset: KeyValDALDataset[KeyVal[Long, ClustersUserIsKnownFor]] = - SimclustersV2KnownFor20M145K2020ScalaDataset -} - -/** - * base class for the main logic of computing interestedIn from KnownFor data set. - */ -trait InterestedInFromKnownForBatchBase extends TwitterScheduledExecutionApp { - implicit val tz = DateOps.UTC - implicit val parser = DateParser.default - - def firstTime: String - val batchIncrement: Duration = Days(7) - val lookBackDays: Duration = Days(30) - - def outputKVDataset: KeyValDALDataset[KeyVal[Long, ClustersUserIsInterestedIn]] - def outputPath: String - def knownForModelVersion: String - def knownForDALDataset: KeyValDALDataset[KeyVal[Long, ClustersUserIsKnownFor]] - - private lazy val execArgs = AnalyticsBatchExecutionArgs( - batchDesc = BatchDescription(this.getClass.getName.replace("$", "")), - firstTime = BatchFirstTime(RichDate(firstTime)), - lastTime = None, - batchIncrement = BatchIncrement(batchIncrement) - ) - - override def scheduledJob: Execution[Unit] = AnalyticsBatchExecution(execArgs) { - implicit dateRange => - Execution.withId { implicit uniqueId => - Execution.withArgs { args => - val normalizedGraph = - DAL.readMostRecentSnapshot(UserUserNormalizedGraphScalaDataset).toTypedPipe - val knownFor = KnownForSources.fromKeyVal( - DAL.readMostRecentSnapshot(knownForDALDataset, dateRange.extend(Days(30))).toTypedPipe, - knownForModelVersion - ) - - val socialProofThreshold = args.int("socialProofThreshold", 2) - val maxClustersPerUser = args.int("maxClustersPerUser", 50) - - val result = InterestedInFromKnownFor - .run( - normalizedGraph, - knownFor, - socialProofThreshold, - maxClustersPerUser, - knownForModelVersion - ) - - val writeKeyValResultExec = result - .map { case (userId, clusters) => KeyVal(userId, clusters) } - .writeDALVersionedKeyValExecution( - outputKVDataset, - D.Suffix(outputPath) - ) - - // read previous data set for validation purpose - val previousDataset = if (RichDate(firstTime).timestamp != dateRange.start.timestamp) { - DAL - .readMostRecentSnapshot(outputKVDataset, dateRange.prepend(lookBackDays)).toTypedPipe - .map { - case KeyVal(user, interestedIn) => - (user, interestedIn) - } - } else { - TypedPipe.empty - } - - Util.printCounters( - Execution - .zip( - writeKeyValResultExec, - InterestedInFromKnownFor.dataSetStats(result, "NewResult"), - InterestedInFromKnownFor.dataSetStats(previousDataset, "OldResult") - ).unit - ) - } - } - } -} - -/** - * Adhoc job to compute user interestedIn. - * - * scalding remote run --target src/scala/com/twitter/simclusters_v2/scalding:interested_in_adhoc \ - * --user recos-platform \ - * --submitter hadoopnest2.atla.twitter.com \ - * --main-class com.twitter.simclusters_v2.scalding.InterestedInFromKnownForAdhoc -- \ - * --date 2019-08-26 --outputDir /user/recos-platform/adhoc/simclusters_interested_in_log_fav - */ -object InterestedInFromKnownForAdhoc extends TwitterExecutionApp { - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - val normalizedGraph = TypedPipe.from( - UserAndNeighborsFixedPathSource(args("graphInputDir")) - ) - val socialProofThreshold = args.int("socialProofThreshold", 2) - val maxClustersPerUser = args.int("maxClustersPerUser", 20) - val knownForModelVersion = args("knownForModelVersion") - val knownFor = KnownForSources.readKnownFor(args("knownForInputDir")) - - val outputSink = AdhocKeyValSources.interestedInSource(args("outputDir")) - Util.printCounters( - InterestedInFromKnownFor - .run( - normalizedGraph, - knownFor, - socialProofThreshold, - maxClustersPerUser, - knownForModelVersion - ).writeExecution(outputSink) - ) - } - } -} - -/** - * Adhoc job to check the output of an adhoc interestedInSource. - */ -object DumpInterestedInAdhoc extends TwitterExecutionApp { - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - val users = args.list("users").map(_.toLong).toSet - val input = TypedPipe.from(AdhocKeyValSources.interestedInSource(args("inputDir"))) - input.filter { case (userId, rec) => users.contains(userId) }.toIterableExecution.map { - s => println(s.map(Util.prettyJsonMapper.writeValueAsString).mkString("\n")) - } - } - } -} - -/** - * Helper functions - */ -object InterestedInFromKnownFor { - private def ifNanMake0(x: Double): Double = if (x.isNaN) 0.0 else x - - case class SrcClusterIntermediateInfo( - followScore: Double, - followScoreProducerNormalized: Double, - favScore: Double, - favScoreProducerNormalized: Double, - logFavScore: Double, - logFavScoreProducerNormalized: Double, - followSocialProof: List[Long], - favSocialProof: List[Long]) { - // overriding for the sake of unit tests - override def equals(obj: scala.Any): Boolean = { - obj match { - case that: SrcClusterIntermediateInfo => - math.abs(followScore - that.followScore) < 1e-5 && - math.abs(followScoreProducerNormalized - that.followScoreProducerNormalized) < 1e-5 && - math.abs(favScore - that.favScore) < 1e-5 && - math.abs(favScoreProducerNormalized - that.favScoreProducerNormalized) < 1e-5 && - math.abs(logFavScore - that.logFavScore) < 1e-5 && - math.abs(logFavScoreProducerNormalized - that.logFavScoreProducerNormalized) < 1e-5 && - followSocialProof.toSet == that.followSocialProof.toSet && - favSocialProof.toSet == that.favSocialProof.toSet - case _ => false - } - } - } - - implicit object SrcClusterIntermediateInfoSemigroup - extends Semigroup[SrcClusterIntermediateInfo] { - override def plus( - left: SrcClusterIntermediateInfo, - right: SrcClusterIntermediateInfo - ): SrcClusterIntermediateInfo = { - SrcClusterIntermediateInfo( - followScore = left.followScore + right.followScore, - followScoreProducerNormalized = - left.followScoreProducerNormalized + right.followScoreProducerNormalized, - favScore = left.favScore + right.favScore, - favScoreProducerNormalized = - left.favScoreProducerNormalized + right.favScoreProducerNormalized, - logFavScore = left.logFavScore + right.logFavScore, - logFavScoreProducerNormalized = - left.logFavScoreProducerNormalized + right.logFavScoreProducerNormalized, - followSocialProof = - Semigroup.plus(left.followSocialProof, right.followSocialProof).distinct, - favSocialProof = Semigroup.plus(left.favSocialProof, right.favSocialProof).distinct - ) - } - } - - /** - * @param adjacencyLists User-User follow/fav graph - * @param knownFor KnownFor data set. Each user can be known for several clusters with certain - * knownFor weights. - * @param socialProofThreshold A user will only be interested in a cluster if they follow/fav at - * least certain number of users known for this cluster. - * @param uniqueId required for these Stat - * @return - */ - def userClusterPairsWithoutNormalization( - adjacencyLists: TypedPipe[UserAndNeighbors], - knownFor: TypedPipe[(Long, Array[(Int, Float)])], - socialProofThreshold: Int - )( - implicit uniqueId: UniqueID - ): TypedPipe[((Long, Int), SrcClusterIntermediateInfo)] = { - val edgesToUsersWithKnownFor = Stat("num_edges_to_users_with_known_for") - val srcDestClusterTriples = Stat("num_src_dest_cluster_triples") - val srcClusterPairsBeforeSocialProofThresholding = - Stat("num_src_cluster_pairs_before_social_proof_thresholding") - val srcClusterPairsAfterSocialProofThresholding = - Stat("num_src_cluster_pairs_after_social_proof_thresholding") - - val edges = adjacencyLists.flatMap { - case UserAndNeighbors(srcId, neighborsWithWeights) => - neighborsWithWeights.map { neighborWithWeights => - ( - neighborWithWeights.neighborId, - neighborWithWeights.copy(neighborId = srcId) - ) - } - } - - implicit val l2b: Long => Array[Byte] = Injection.long2BigEndian - - edges - .sketch(4000) - .join(knownFor) - .flatMap { - case (destId, (srcWithWeights, clusterArray)) => - edgesToUsersWithKnownFor.inc() - clusterArray.toList.map { - case (clusterId, knownForScoreF) => - val knownForScore = math.max(0.0, knownForScoreF.toDouble) - - srcDestClusterTriples.inc() - val followScore = - if (srcWithWeights.isFollowed.contains(true)) knownForScore else 0.0 - val followScoreProducerNormalizedOnly = - srcWithWeights.followScoreNormalizedByNeighborFollowersL2.getOrElse( - 0.0) * knownForScore - val favScore = - srcWithWeights.favScoreHalfLife100Days.getOrElse(0.0) * knownForScore - - val favScoreProducerNormalizedOnly = - srcWithWeights.favScoreHalfLife100DaysNormalizedByNeighborFaversL2.getOrElse( - 0.0) * knownForScore - - val logFavScore = srcWithWeights.logFavScore.getOrElse(0.0) * knownForScore - - val logFavScoreProducerNormalizedOnly = srcWithWeights.logFavScoreL2Normalized - .getOrElse(0.0) * knownForScore - - val followSocialProof = if (srcWithWeights.isFollowed.contains(true)) { - List(destId) - } else Nil - val favSocialProof = if (srcWithWeights.favScoreHalfLife100Days.exists(_ > 0)) { - List(destId) - } else Nil - - ( - (srcWithWeights.neighborId, clusterId), - SrcClusterIntermediateInfo( - followScore, - followScoreProducerNormalizedOnly, - favScore, - favScoreProducerNormalizedOnly, - logFavScore, - logFavScoreProducerNormalizedOnly, - followSocialProof, - favSocialProof - ) - ) - } - } - .sumByKey - .withReducers(10000) - .filter { - case ((_, _), SrcClusterIntermediateInfo(_, _, _, _, _, _, followProof, favProof)) => - srcClusterPairsBeforeSocialProofThresholding.inc() - val distinctSocialProof = (followProof ++ favProof).toSet - val result = distinctSocialProof.size >= socialProofThreshold - if (result) { - srcClusterPairsAfterSocialProofThresholding.inc() - } - result - } - } - - /** - * Add the cluster-level l2 norm scores, and use them to normalize follow/fav scores. - */ - def attachNormalizedScores( - intermediate: TypedPipe[((Long, Int), SrcClusterIntermediateInfo)] - )( - implicit uniqueId: UniqueID - ): TypedPipe[(Long, List[(Int, UserToInterestedInClusterScores)])] = { - - def square(x: Double): Double = x * x - - val clusterCountsAndNorms = - intermediate - .map { - case ( - (_, clusterId), - SrcClusterIntermediateInfo( - followScore, - followScoreProducerNormalizedOnly, - favScore, - favScoreProducerNormalizedOnly, - logFavScore, - logFavScoreProducerNormalizedOnly, - _, - _ - ) - ) => - ( - clusterId, - ( - 1, - square(followScore), - square(followScoreProducerNormalizedOnly), - square(favScore), - square(favScoreProducerNormalizedOnly), - square(logFavScore), - square(logFavScoreProducerNormalizedOnly) - ) - ) - } - .sumByKey - // .withReducers(100) - .map { - case ( - clusterId, - ( - cnt, - squareFollowScore, - squareFollowScoreProducerNormalizedOnly, - squareFavScore, - squareFavScoreProducerNormalizedOnly, - squareLogFavScore, - squareLogFavScoreProducerNormalizedOnly - )) => - ( - clusterId, - ( - cnt, - math.sqrt(squareFollowScore), - math.sqrt(squareFollowScoreProducerNormalizedOnly), - math.sqrt(squareFavScore), - math.sqrt(squareFavScoreProducerNormalizedOnly), - math.sqrt(squareLogFavScore), - math.sqrt(squareLogFavScoreProducerNormalizedOnly) - )) - } - - implicit val i2b: Int => Array[Byte] = Injection.int2BigEndian - - intermediate - .map { - case ((srcId, clusterId), clusterScoresTuple) => - (clusterId, (srcId, clusterScoresTuple)) - } - .sketch(reducers = 900) - .join(clusterCountsAndNorms) - .map { - case ( - clusterId, - ( - ( - srcId, - SrcClusterIntermediateInfo( - followScore, - followScoreProducerNormalizedOnly, - favScore, - favScoreProducerNormalizedOnly, - logFavScore, - logFavScoreProducerNormalizedOnly, // not used for now - followProof, - favProof - ) - ), - ( - cnt, - followNorm, - followProducerNormalizedNorm, - favNorm, - favProducerNormalizedNorm, - logFavNorm, - logFavProducerNormalizedNorm // not used for now - ) - ) - ) => - ( - srcId, - List( - ( - clusterId, - UserToInterestedInClusterScores( - followScore = Some(ifNanMake0(followScore)), - followScoreClusterNormalizedOnly = Some(ifNanMake0(followScore / followNorm)), - followScoreProducerNormalizedOnly = - Some(ifNanMake0(followScoreProducerNormalizedOnly)), - followScoreClusterAndProducerNormalized = Some( - ifNanMake0(followScoreProducerNormalizedOnly / followProducerNormalizedNorm)), - favScore = Some(ifNanMake0(favScore)), - favScoreClusterNormalizedOnly = Some(ifNanMake0(favScore / favNorm)), - favScoreProducerNormalizedOnly = Some(ifNanMake0(favScoreProducerNormalizedOnly)), - favScoreClusterAndProducerNormalized = - Some(ifNanMake0(favScoreProducerNormalizedOnly / favProducerNormalizedNorm)), - usersBeingFollowed = Some(followProof), - usersThatWereFaved = Some(favProof), - numUsersInterestedInThisClusterUpperBound = Some(cnt), - logFavScore = Some(ifNanMake0(logFavScore)), - logFavScoreClusterNormalizedOnly = Some(ifNanMake0(logFavScore / logFavNorm)) - )) - ) - ) - } - .sumByKey - // .withReducers(1000) - .toTypedPipe - } - - /** - * aggregate cluster scores for each user, to be used instead of attachNormalizedScores - * when we donot want to compute cluster-level l2 norm scores - */ - def groupClusterScores( - intermediate: TypedPipe[((Long, Int), SrcClusterIntermediateInfo)] - )( - implicit uniqueId: UniqueID - ): TypedPipe[(Long, List[(Int, UserToInterestedInClusterScores)])] = { - - intermediate - .map { - case ( - (srcId, clusterId), - SrcClusterIntermediateInfo( - followScore, - followScoreProducerNormalizedOnly, - favScore, - favScoreProducerNormalizedOnly, - logFavScore, - logFavScoreProducerNormalizedOnly, - followProof, - favProof - ) - ) => - ( - srcId, - List( - ( - clusterId, - UserToInterestedInClusterScores( - followScore = Some(ifNanMake0(followScore)), - followScoreProducerNormalizedOnly = - Some(ifNanMake0(followScoreProducerNormalizedOnly)), - favScore = Some(ifNanMake0(favScore)), - favScoreProducerNormalizedOnly = Some(ifNanMake0(favScoreProducerNormalizedOnly)), - usersBeingFollowed = Some(followProof), - usersThatWereFaved = Some(favProof), - logFavScore = Some(ifNanMake0(logFavScore)), - )) - ) - ) - } - .sumByKey - .withReducers(1000) - .toTypedPipe - } - - /** - * For each user, only keep up to a certain number of clusters. - * @param allInterests user with a list of interestedIn clusters. - * @param maxClustersPerUser number of clusters to keep for each user - * @param knownForModelVersion known for model version - * @param uniqueId required for these Stat - * @return - */ - def keepOnlyTopClusters( - allInterests: TypedPipe[(Long, List[(Int, UserToInterestedInClusterScores)])], - maxClustersPerUser: Int, - knownForModelVersion: String - )( - implicit uniqueId: UniqueID - ): TypedPipe[(Long, ClustersUserIsInterestedIn)] = { - val userClusterPairsBeforeUserTruncation = - Stat("num_user_cluster_pairs_before_user_truncation") - val userClusterPairsAfterUserTruncation = - Stat("num_user_cluster_pairs_after_user_truncation") - val usersWithALotOfClusters = - Stat(s"num_users_with_more_than_${maxClustersPerUser}_clusters") - - allInterests - .map { - case (srcId, fullClusterList) => - userClusterPairsBeforeUserTruncation.incBy(fullClusterList.size) - val truncatedClusters = if (fullClusterList.size > maxClustersPerUser) { - usersWithALotOfClusters.inc() - fullClusterList - .sortBy { - case (_, clusterScores) => - ( - -clusterScores.favScore.getOrElse(0.0), - -clusterScores.logFavScore.getOrElse(0.0), - -clusterScores.followScore.getOrElse(0.0), - -clusterScores.logFavScoreClusterNormalizedOnly.getOrElse(0.0), - -clusterScores.followScoreProducerNormalizedOnly.getOrElse(0.0) - ) - } - .take(maxClustersPerUser) - } else { - fullClusterList - } - userClusterPairsAfterUserTruncation.incBy(truncatedClusters.size) - (srcId, ClustersUserIsInterestedIn(knownForModelVersion, truncatedClusters.toMap)) - } - } - - def run( - adjacencyLists: TypedPipe[UserAndNeighbors], - knownFor: TypedPipe[(UserId, Array[(ClusterId, Float)])], - socialProofThreshold: Int, - maxClustersPerUser: Int, - knownForModelVersion: String - )( - implicit uniqueId: UniqueID - ): TypedPipe[(UserId, ClustersUserIsInterestedIn)] = { - keepOnlyTopClusters( - attachNormalizedScores( - userClusterPairsWithoutNormalization( - adjacencyLists, - knownFor, - socialProofThreshold - ) - ), - maxClustersPerUser, - knownForModelVersion - ) - } - - /** - * run the interestedIn job, cluster normalized scores are not attached to user's clusters. - */ - def runWithoutClusterNormalizedScores( - adjacencyLists: TypedPipe[UserAndNeighbors], - knownFor: TypedPipe[(UserId, Array[(ClusterId, Float)])], - socialProofThreshold: Int, - maxClustersPerUser: Int, - knownForModelVersion: String - )( - implicit uniqueId: UniqueID - ): TypedPipe[(UserId, ClustersUserIsInterestedIn)] = { - keepOnlyTopClusters( - groupClusterScores( - userClusterPairsWithoutNormalization( - adjacencyLists, - knownFor, - socialProofThreshold - ) - ), - maxClustersPerUser, - knownForModelVersion - ) - } - - /** - * print out some basic stats of the data set to make sure things are not broken - */ - def dataSetStats( - interestedInData: TypedPipe[(UserId, ClustersUserIsInterestedIn)], - dataSetName: String = "" - ): Execution[Unit] = { - - Execution - .zip( - Util.printSummaryOfNumericColumn( - interestedInData.map { - case (user, interestedIn) => - interestedIn.clusterIdToScores.size - }, - Some(s"$dataSetName UserInterestedIn Size") - ), - Util.printSummaryOfNumericColumn( - interestedInData.flatMap { - case (user, interestedIn) => - interestedIn.clusterIdToScores.map { - case (_, scores) => - scores.favScore.getOrElse(0.0) - } - }, - Some(s"$dataSetName UserInterestedIn favScore") - ), - Util.printSummaryOfNumericColumn( - interestedInData.flatMap { - case (user, interestedIn) => - interestedIn.clusterIdToScores.map { - case (_, scores) => - scores.favScoreClusterNormalizedOnly.getOrElse(0.0) - } - }, - Some(s"$dataSetName UserInterestedIn favScoreClusterNormalizedOnly") - ), - Util.printSummaryOfNumericColumn( - interestedInData.flatMap { - case (user, interestedIn) => - interestedIn.clusterIdToScores.map { - case (_, scores) => - scores.logFavScoreClusterNormalizedOnly.getOrElse(0.0) - } - }, - Some(s"$dataSetName UserInterestedIn logFavScoreClusterNormalizedOnly") - ) - ).unit - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/InterestedInFromKnownForLite.scala b/src/scala/com/twitter/simclusters_v2/scalding/InterestedInFromKnownForLite.scala deleted file mode 100644 index e4b23ae52..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/InterestedInFromKnownForLite.scala +++ /dev/null @@ -1,354 +0,0 @@ -package com.twitter.simclusters_v2.scalding - -import com.twitter.algebird.Semigroup -import com.twitter.bijection.Injection -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite.{D, WriteExtension} -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.scalding_internal.job.analytics_batch.{ - AnalyticsBatchExecution, - AnalyticsBatchExecutionArgs, - BatchDescription, - BatchFirstTime, - BatchIncrement, - TwitterScheduledExecutionApp -} -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.{ClusterId, ModelVersions, UserId} -import com.twitter.simclusters_v2.hdfs_sources.{ - AdhocKeyValSources, - InternalDataPaths, - SimclustersV2KnownFor20M145K2020ScalaDataset, - SimclustersV2RawInterestedInLite20M145K2020ScalaDataset, - SimclustersV2RawInterestedIn20M145KUpdatedScalaDataset, - UserAndNeighborsFixedPathSource, - UserUserGraphScalaDataset -} -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.thriftscala.{ - ClustersUserIsInterestedIn, - ClustersUserIsKnownFor, - UserAndNeighbors, - UserToInterestedInClusterScores -} -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import java.util.TimeZone - -/** - * This file implements the job for computing users' interestedIn vector from KnownFor data set. - * - * It reads the UserUserGraphScalaDataset to get user-user follow + fav graph, and then - * based on the known-for clusters of each followed/faved user, we calculate how much a user is - * interestedIn a cluster. - * - * The main differences of the InterestedInFromKnownForLite compared to InterestedInFromKnownFor are - * the following: - * - We read the UserUserGraph dataset that doesnot contain the producer normalized scores - * - We donot compute the cluster normalized scores for the clusters per user - * - For social proof thresholding, we donot keep track of the entire list of follow and - * fav social proofs but rather make use of numFollowSocial and numFavSocial (this introduces - * some noise if follow and fav social proof contain the same users) - * - Store 200 clusters per user compared to 50 in IIKF - * - Runs more frequently compared to weekly in IIKF - */ -/** - * Production job for computing interestedIn data set for the model version 20M145K2020. - * - * To deploy the job: - * - * capesospy-v2 update --build_locally --start_cron interested_in_lite_for_20M_145k_2020 \ - src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc.yaml - */ -object InterestedInFromKnownForLite20M145K2020 extends InterestedInFromKnownForLite { - override val firstTime: String = "2021-04-24" - override val outputKVDataset: KeyValDALDataset[KeyVal[Long, ClustersUserIsInterestedIn]] = - SimclustersV2RawInterestedInLite20M145K2020ScalaDataset - override val outputPath: String = InternalDataPaths.RawInterestedInLite2020Path - override val knownForModelVersion: String = ModelVersions.Model20M145K2020 - override val knownForDALDataset: KeyValDALDataset[KeyVal[Long, ClustersUserIsKnownFor]] = - SimclustersV2KnownFor20M145K2020ScalaDataset -} -trait InterestedInFromKnownForLite extends TwitterScheduledExecutionApp { - implicit val tz = DateOps.UTC - implicit val parser = DateParser.default - - def firstTime: String - val batchIncrement: Duration = Days(2) - val lookBackDays: Duration = Days(30) - - def outputKVDataset: KeyValDALDataset[KeyVal[Long, ClustersUserIsInterestedIn]] - def outputPath: String - def knownForModelVersion: String - def knownForDALDataset: KeyValDALDataset[KeyVal[Long, ClustersUserIsKnownFor]] - - private lazy val execArgs = AnalyticsBatchExecutionArgs( - batchDesc = BatchDescription(this.getClass.getName.replace("$", "")), - firstTime = BatchFirstTime(RichDate(firstTime)), - lastTime = None, - batchIncrement = BatchIncrement(batchIncrement) - ) - - override def scheduledJob: Execution[Unit] = AnalyticsBatchExecution(execArgs) { - implicit dateRange => - Execution.withId { implicit uniqueId => - Execution.withArgs { args => - val userUserGraph = - DAL.readMostRecentSnapshot(UserUserGraphScalaDataset).toTypedPipe - val knownFor = KnownForSources.fromKeyVal( - DAL.readMostRecentSnapshot(knownForDALDataset, dateRange.extend(Days(30))).toTypedPipe, - knownForModelVersion - ) - - val socialProofThreshold = args.int("socialProofThreshold", 2) - val maxClustersPerUser = args.int("maxClustersPerUser", 200) - - val result = InterestedInFromKnownForLite - .run( - userUserGraph, - knownFor, - socialProofThreshold, - maxClustersPerUser, - knownForModelVersion - ) - - val writeKeyValResultExec = result - .map { - case (userId, clusters) => KeyVal(userId, clusters) - }.writeDALVersionedKeyValExecution( - outputKVDataset, - D.Suffix(outputPath) - ) - Util.printCounters(writeKeyValResultExec) - } - } - } -} - -/** - * Adhoc job to compute user interestedIn. - * - * scalding remote run \ - * --target src/scala/com/twitter/simclusters_v2/scalding:interested_in_lite_20m_145k_2020-adhoc \ - * --main-class com.twitter.simclusters_v2.scalding.InterestedInFromKnownForLite20M145K2020Adhoc \ - * --user cassowary --cluster bluebird-qus1 \ - * --keytab /var/lib/tss/keys/fluffy/keytabs/client/cassowary.keytab \ - * --principal service_acoount@TWITTER.BIZ \ - * -- \ - * --outputDir /gcs/user/cassowary/adhoc/interested_in_from_knownfor_lite/ \ - * --date 2020-08-25 - */ -object InterestedInFromKnownForLite20M145K2020Adhoc extends AdhocExecutionApp { - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - val userUserGraph = DAL.readMostRecentSnapshot(UserUserGraphScalaDataset).toTypedPipe - val socialProofThreshold = args.int("socialProofThreshold", 2) - val maxClustersPerUser = args.int("maxClustersPerUser", 200) - val knownForModelVersion = ModelVersions.Model20M145K2020 - val knownFor = KnownForSources.fromKeyVal( - DAL - .readMostRecentSnapshotNoOlderThan( - SimclustersV2KnownFor20M145K2020ScalaDataset, - Days(30)).toTypedPipe, - knownForModelVersion - ) - - val outputSink = AdhocKeyValSources.interestedInSource(args("outputDir")) - Util.printCounters( - InterestedInFromKnownForLite - .run( - userUserGraph, - knownFor, - socialProofThreshold, - maxClustersPerUser, - knownForModelVersion - ).writeExecution(outputSink) - ) - } - -} - -object InterestedInFromKnownForLite { - private def ifNanMake0(x: Double): Double = if (x.isNaN) 0.0 else x - - case class SrcClusterIntermediateInfo( - followScore: Double, - favScore: Double, - logFavScore: Double, - numFollowed: Int, - numFaved: Int) { - - // helper function used for test cases - override def equals(obj: scala.Any): Boolean = { - obj match { - case that: SrcClusterIntermediateInfo => - math.abs(followScore - that.followScore) < 1e-5 && - math.abs(favScore - that.favScore) < 1e-5 && - math.abs(logFavScore - that.logFavScore) < 1e-5 && - numFollowed == that.numFollowed && - numFaved == that.numFaved - case _ => false - } - } - } - - implicit object SrcClusterIntermediateInfoSemigroup - extends Semigroup[SrcClusterIntermediateInfo] { - override def plus( - left: SrcClusterIntermediateInfo, - right: SrcClusterIntermediateInfo - ): SrcClusterIntermediateInfo = { - SrcClusterIntermediateInfo( - followScore = left.followScore + right.followScore, - favScore = left.favScore + right.favScore, - logFavScore = left.logFavScore + right.logFavScore, - numFollowed = left.numFollowed + right.numFollowed, - numFaved = left.numFaved + right.numFaved - ) - } - } - - def run( - adjacencyLists: TypedPipe[UserAndNeighbors], - knownFor: TypedPipe[(UserId, Array[(ClusterId, Float)])], - socialProofThreshold: Int, - maxClustersPerUser: Int, - knownForModelVersion: String - )( - implicit uniqueId: UniqueID - ): TypedPipe[(UserId, ClustersUserIsInterestedIn)] = { - InterestedInFromKnownFor.keepOnlyTopClusters( - groupClusterScores( - userClusterPairs( - adjacencyLists, - knownFor, - socialProofThreshold - ) - ), - maxClustersPerUser, - knownForModelVersion - ) - } - - def userClusterPairs( - adjacencyLists: TypedPipe[UserAndNeighbors], - knownFor: TypedPipe[(Long, Array[(Int, Float)])], - socialProofThreshold: Int - )( - implicit uniqueId: UniqueID - ): TypedPipe[((Long, Int), SrcClusterIntermediateInfo)] = { - val edgesToUsersWithKnownFor = Stat("num_edges_to_users_with_known_for") - val srcDestClusterTriples = Stat("num_src_dest_cluster_triples") - val srcClusterPairsBeforeSocialProofThresholding = - Stat("num_src_cluster_pairs_before_social_proof_thresholding") - val srcClusterPairsAfterSocialProofThresholding = - Stat("num_src_cluster_pairs_after_social_proof_thresholding") - - val edges = adjacencyLists.flatMap { - case UserAndNeighbors(srcId, neighborsWithWeights) => - neighborsWithWeights.map { neighborWithWeights => - ( - neighborWithWeights.neighborId, - neighborWithWeights.copy(neighborId = srcId) - ) - } - } - - implicit val l2b: Long => Array[Byte] = Injection.long2BigEndian - - edges - .sketch(4000) - .join(knownFor) - .flatMap { - case (destId, (srcWithWeights, clusterArray)) => - edgesToUsersWithKnownFor.inc() - clusterArray.toList.map { - case (clusterId, knownForScoreF) => - val knownForScore = math.max(0.0, knownForScoreF.toDouble) - - srcDestClusterTriples.inc() - val followScore = - if (srcWithWeights.isFollowed.contains(true)) knownForScore else 0.0 - val favScore = - srcWithWeights.favScoreHalfLife100Days.getOrElse(0.0) * knownForScore - val logFavScore = srcWithWeights.logFavScore.getOrElse(0.0) * knownForScore - val numFollowed = if (srcWithWeights.isFollowed.contains(true)) { - 1 - } else 0 - - val numFaved = if (srcWithWeights.favScoreHalfLife100Days.exists(_ > 0)) { - 1 - } else 0 - - ( - (srcWithWeights.neighborId, clusterId), - SrcClusterIntermediateInfo( - followScore, - favScore, - logFavScore, - numFollowed, - numFaved - ) - ) - } - } - .sumByKey - .withReducers(10000) - .filter { - case ((_, _), SrcClusterIntermediateInfo(_, _, _, numFollowed, numFaved)) => - srcClusterPairsBeforeSocialProofThresholding.inc() - // we donot remove duplicates - val socialProofSize = numFollowed + numFaved - val result = socialProofSize >= socialProofThreshold - if (result) { - srcClusterPairsAfterSocialProofThresholding.inc() - } - result - } - } - - def groupClusterScores( - intermediate: TypedPipe[((Long, Int), SrcClusterIntermediateInfo)] - )( - implicit uniqueId: UniqueID - ): TypedPipe[(Long, List[(Int, UserToInterestedInClusterScores)])] = { - - implicit val i2b: Int => Array[Byte] = Injection.int2BigEndian - - intermediate - .map { - case ( - (srcId, clusterId), - SrcClusterIntermediateInfo( - followScore, - favScore, - logFavScore, - numFollowed, - numFaved - )) => - ( - srcId, - List( - ( - clusterId, - UserToInterestedInClusterScores( - followScore = Some(ifNanMake0(followScore)), - favScore = Some(ifNanMake0(favScore)), - logFavScore = Some(ifNanMake0(logFavScore)), - numUsersBeingFollowed = Some(numFollowed), - numUsersThatWereFaved = Some(numFaved) - )) - ) - ) - } - .sumByKey - // .withReducers(1000) - .toTypedPipe - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/InterestedInFromProducerEmbeddingsAdhocApp.scala b/src/scala/com/twitter/simclusters_v2/scalding/InterestedInFromProducerEmbeddingsAdhocApp.scala deleted file mode 100644 index d924dd693..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/InterestedInFromProducerEmbeddingsAdhocApp.scala +++ /dev/null @@ -1,290 +0,0 @@ -package com.twitter.simclusters_v2.scalding - -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.scalding.Execution -import com.twitter.scalding.TypedTsv -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.dalv2.remote_access.ExplicitLocation -import com.twitter.scalding_internal.dalv2.remote_access.ProcAtla -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.hdfs_sources.ProducerEmbeddingSources -import com.twitter.simclusters_v2.hdfs_sources.AdhocKeyValSources -import com.twitter.simclusters_v2.hdfs_sources.DataSources -import com.twitter.simclusters_v2.hdfs_sources.SimclustersV2InterestedInFromProducerEmbeddings20M145KUpdatedScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.UserAndNeighborsFixedPathSource -import com.twitter.simclusters_v2.hdfs_sources.UserUserNormalizedGraphScalaDataset -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn -import com.twitter.simclusters_v2.thriftscala.EmbeddingType -import com.twitter.simclusters_v2.thriftscala.SimClusterWithScore -import com.twitter.simclusters_v2.thriftscala.TopSimClustersWithScore -import com.twitter.simclusters_v2.thriftscala.UserToInterestedInClusterScores -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import java.util.TimeZone -import scala.util.Random - -/** - * This file implements the job for computing users' interestedIn vector from the producerEmbeddings data set. - * - * It reads the UserUserNormalizedGraphScalaDataset to get user-user follow + fav graph, and then - * based on the producerEmbedding clusters of each followed/faved user, we calculate how much a user is - * interestedIn a cluster. To compute the engagement and determine the clusters for the user, we reuse - * the functions defined in InterestedInKnownFor. - * - * Using producerEmbeddings instead of knownFor to obtain interestedIn increases the coverage (especially - * for medium and light users) and also the density of the cluster embeddings for the user. - */ -/** - * Adhoc job to generate the interestedIn from producer embeddings for the model version 20M145KUpdated - * - scalding remote run \ - --target src/scala/com/twitter/simclusters_v2/scalding:interested_in_from_producer_embeddings \ - --main-class com.twitter.simclusters_v2.scalding.InterestedInFromProducerEmbeddingsAdhocApp \ - --user cassowary --cluster bluebird-qus1 \ - --keytab /var/lib/tss/keys/fluffy/keytabs/client/cassowary.keytab \ - --principal service_acoount@TWITTER.BIZ \ - -- \ - --outputDir /gcs/user/cassowary/adhoc/interested_in_from_prod_embeddings/ \ - --date 2020-08-25 --typedTsv true - */ -object InterestedInFromProducerEmbeddingsAdhocApp extends AdhocExecutionApp { - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - val outputDir = args("outputDir") - val inputGraph = args.optional("graphInputDir") match { - case Some(inputDir) => TypedPipe.from(UserAndNeighborsFixedPathSource(inputDir)) - case None => - DAL - .readMostRecentSnapshotNoOlderThan(UserUserNormalizedGraphScalaDataset, Days(30)) - .toTypedPipe - } - val socialProofThreshold = args.int("socialProofThreshold", 2) - val maxClustersPerUserFinalResult = args.int("maxInterestedInClustersPerUser", 50) - val maxClustersFromProducer = args.int("maxClustersPerProducer", 25) - val typedTsvTag = args.boolean("typedTsv") - - val embeddingType = - EmbeddingType.ProducerFavBasedSemanticCoreEntity - val modelVersion = ModelVersions.Model20M145KUpdated - val producerEmbeddings = ProducerEmbeddingSources - .producerEmbeddingSourceLegacy(embeddingType, ModelVersions.toModelVersion(modelVersion))( - dateRange.embiggen(Days(7))) - - import InterestedInFromProducerEmbeddingsBatchApp._ - - val numProducerMappings = Stat("num_producer_embeddings_total") - val numProducersWithLargeClusterMappings = Stat( - "num_producers_with_more_clusters_than_threshold") - val numProducersWithSmallClusterMappings = Stat( - "num_producers_with_clusters_less_than_threshold") - val totalClustersCoverageProducerEmbeddings = Stat("num_clusters_total_producer_embeddings") - - val producerEmbeddingsWithScore = producerEmbeddings.map { - case (userId: Long, topSimClusters: TopSimClustersWithScore) => - ( - userId, - topSimClusters.topClusters.toArray - .map { - case (simCluster: SimClusterWithScore) => - (simCluster.clusterId, simCluster.score.toFloat) - } - ) - } - val producerEmbeddingsPruned = producerEmbeddingsWithScore.map { - case (producerId, clusterArray) => - numProducerMappings.inc() - val clusterSize = clusterArray.size - totalClustersCoverageProducerEmbeddings.incBy(clusterSize) - val prunedList = if (clusterSize > maxClustersFromProducer) { - numProducersWithLargeClusterMappings.inc() - clusterArray - .sortBy { - case (_, knownForScore) => -knownForScore - }.take(maxClustersFromProducer) - } else { - numProducersWithSmallClusterMappings.inc() - clusterArray - } - (producerId, prunedList) - } - - val result = InterestedInFromKnownFor - .run( - inputGraph, - producerEmbeddingsPruned, - socialProofThreshold, - maxClustersPerUserFinalResult, - modelVersion - ) - - val resultWithoutSocial = getInterestedInDiscardSocial(result) - - if (typedTsvTag) { - Util.printCounters( - resultWithoutSocial - .map { - case (userId: Long, clusters: ClustersUserIsInterestedIn) => - ( - userId, - clusters.clusterIdToScores.keys.toString() - ) - } - .writeExecution( - TypedTsv(outputDir) - ) - ) - } else { - Util.printCounters( - resultWithoutSocial - .writeExecution( - AdhocKeyValSources.interestedInSource(outputDir) - ) - ) - } - } -} - -/** - * Production job for computing interestedIn data set from the producer embeddings for the model version 20M145KUpdated. - * It writes the data set in KeyVal format to produce a MH DAL data set. - * - * To deploy the job: - * - * capesospy-v2 update --build_locally --start_cron - * --start_cron interested_in_from_producer_embeddings - * src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object InterestedInFromProducerEmbeddingsBatchApp extends ScheduledExecutionApp { - override val firstTime: RichDate = RichDate("2019-11-01") - - override val batchIncrement: Duration = Days(7) - - def getPrunedEmbeddings( - producerEmbeddings: TypedPipe[(Long, TopSimClustersWithScore)], - maxClustersFromProducer: Int - ): TypedPipe[(Long, TopSimClustersWithScore)] = { - producerEmbeddings.map { - case (producerId, producerClusters) => - val prunedProducerClusters = - producerClusters.topClusters - .sortBy { - case simCluster => -simCluster.score.toFloat - }.take(maxClustersFromProducer) - (producerId, TopSimClustersWithScore(prunedProducerClusters, producerClusters.modelVersion)) - } - } - - def getInterestedInDiscardSocial( - interestedInFromProducersResult: TypedPipe[(UserId, ClustersUserIsInterestedIn)] - ): TypedPipe[(UserId, ClustersUserIsInterestedIn)] = { - interestedInFromProducersResult.map { - case (srcId, fullClusterList) => - val fullClusterListWithoutSocial = fullClusterList.clusterIdToScores.map { - case (clusterId, clusterDetails) => - val clusterDetailsWithoutSocial = UserToInterestedInClusterScores( - followScore = clusterDetails.followScore, - followScoreClusterNormalizedOnly = clusterDetails.followScoreClusterNormalizedOnly, - followScoreProducerNormalizedOnly = clusterDetails.followScoreProducerNormalizedOnly, - followScoreClusterAndProducerNormalized = - clusterDetails.followScoreClusterAndProducerNormalized, - favScore = clusterDetails.favScore, - favScoreClusterNormalizedOnly = clusterDetails.favScoreClusterNormalizedOnly, - favScoreProducerNormalizedOnly = clusterDetails.favScoreProducerNormalizedOnly, - favScoreClusterAndProducerNormalized = - clusterDetails.favScoreClusterAndProducerNormalized, - // Social proof is currently not being used anywhere else, hence being discarded to reduce space for this dataset - usersBeingFollowed = None, - usersThatWereFaved = None, - numUsersInterestedInThisClusterUpperBound = - clusterDetails.numUsersInterestedInThisClusterUpperBound, - logFavScore = clusterDetails.logFavScore, - logFavScoreClusterNormalizedOnly = clusterDetails.logFavScoreClusterNormalizedOnly, - // Counts of the social proof are maintained - numUsersBeingFollowed = Some(clusterDetails.usersBeingFollowed.getOrElse(Nil).size), - numUsersThatWereFaved = Some(clusterDetails.usersThatWereFaved.getOrElse(Nil).size) - ) - (clusterId, clusterDetailsWithoutSocial) - } - ( - srcId, - ClustersUserIsInterestedIn( - fullClusterList.knownForModelVersion, - fullClusterListWithoutSocial)) - } - } - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - //Input args for the run - val socialProofThreshold = args.int("socialProofThreshold", 2) - val maxClustersFromProducer = args.int("maxClustersPerProducer", 25) - val maxClustersPerUserFinalResult = args.int("maxInterestedInClustersPerUser", 50) - - //Path variables - val modelVersionUpdated = ModelVersions.toModelVersion(ModelVersions.Model20M145KUpdated) - val rootPath: String = s"/user/cassowary/manhattan_sequence_files" - val interestedInFromProducersPath = - rootPath + "/interested_in_from_producer_embeddings/" + modelVersionUpdated - - //Input adjacency list and producer embeddings - val userUserNormalGraph = - DataSources.userUserNormalizedGraphSource(dateRange.prepend(Days(7))).forceToDisk - val outputKVDataset: KeyValDALDataset[KeyVal[Long, ClustersUserIsInterestedIn]] = - SimclustersV2InterestedInFromProducerEmbeddings20M145KUpdatedScalaDataset - val producerEmbeddings = ProducerEmbeddingSources - .producerEmbeddingSourceLegacy( - EmbeddingType.ProducerFavBasedSemanticCoreEntity, - modelVersionUpdated)(dateRange.embiggen(Days(7))) - - val producerEmbeddingsPruned = getPrunedEmbeddings(producerEmbeddings, maxClustersFromProducer) - val producerEmbeddingsWithScore = producerEmbeddingsPruned.map { - case (userId: Long, topSimClusters: TopSimClustersWithScore) => - ( - userId, - topSimClusters.topClusters.toArray - .map { - case (simCluster: SimClusterWithScore) => - (simCluster.clusterId, simCluster.score.toFloat) - } - ) - } - - val interestedInFromProducersResult = - InterestedInFromKnownFor.run( - userUserNormalGraph, - producerEmbeddingsWithScore, - socialProofThreshold, - maxClustersPerUserFinalResult, - modelVersionUpdated.toString - ) - - val interestedInFromProducersWithoutSocial = - getInterestedInDiscardSocial(interestedInFromProducersResult) - - val writeKeyValResultExec = interestedInFromProducersWithoutSocial - .map { case (userId, clusters) => KeyVal(userId, clusters) } - .writeDALVersionedKeyValExecution( - outputKVDataset, - D.Suffix(interestedInFromProducersPath) - ) - writeKeyValResultExec - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/KnownForSources.scala b/src/scala/com/twitter/simclusters_v2/scalding/KnownForSources.scala deleted file mode 100644 index 217f521ac..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/KnownForSources.scala +++ /dev/null @@ -1,275 +0,0 @@ -package com.twitter.simclusters_v2.scalding - -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.logging.Logger -import com.twitter.scalding._ -import com.twitter.scalding.typed.TypedPipe -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.dalv2.remote_access.{ExplicitLocation, ProcAtla} -import com.twitter.scalding_internal.job.analytics_batch.{ - AnalyticsBatchExecution, - AnalyticsBatchExecutionArgs, - BatchDescription, - BatchFirstTime, - BatchIncrement, - TwitterScheduledExecutionApp -} -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.hdfs_sources._ -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.thriftscala.{ClustersUserIsKnownFor, UserToKnownForClusterScores} -import com.twitter.usersource.snapshot.flat.UsersourceFlatScalaDataset -import com.twitter.usersource.snapshot.flat.thriftscala.FlatUser -import java.util.TimeZone - -object KnownForSources { - implicit val tz: TimeZone = DateOps.UTC - implicit val parser: DateParser = DateParser.default - - def readDALDataset( - d: KeyValDALDataset[KeyVal[Long, ClustersUserIsKnownFor]], - noOlderThan: Duration, - modelVersionToKeep: String - ): TypedPipe[(Long, Array[(Int, Float)])] = { - fromKeyVal( - DAL - .readMostRecentSnapshotNoOlderThan(d, noOlderThan) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe, - modelVersionToKeep - ) - } - - def fromKeyVal( - in: TypedPipe[KeyVal[Long, ClustersUserIsKnownFor]], - modelVersionToKeep: String - ): TypedPipe[(Long, Array[(Int, Float)])] = { - in.collect { - case KeyVal(userId, knownForClusters) - if knownForClusters.knownForModelVersion == modelVersionToKeep => - ( - userId, - knownForClusters.clusterIdToScores.toArray - .map { - case (clusterId, scores) => - (clusterId, scores.knownForScore.getOrElse(0.0).toFloat) - } - .sortBy(-_._2)) - } - } - - def toKeyVal( - in: TypedPipe[(Long, Array[(Int, Float)])], - modelVersion: String - ): TypedPipe[KeyVal[Long, ClustersUserIsKnownFor]] = { - in.map { - case (userId, clustersArray) => - val mappedClusters = clustersArray.map { - case (clusterId, score) => - (clusterId, UserToKnownForClusterScores(Some(score))) - }.toMap - KeyVal(userId, ClustersUserIsKnownFor(modelVersion, mappedClusters)) - } - } - - val knownFor_20M_Dec11_145K: TypedPipe[(Long, Array[(Int, Float)])] = readDALDataset( - SimclustersV2KnownFor20M145KDec11ScalaDataset, - Days(30), - ModelVersions.Model20M145KDec11 - ) - - val knownFor_20M_145K_updated: TypedPipe[(Long, Array[(Int, Float)])] = readDALDataset( - SimclustersV2KnownFor20M145KUpdatedScalaDataset, - Days(30), - ModelVersions.Model20M145KUpdated - ) - - val clusterToKnownFor_20M_Dec11_145K: TypedPipe[(Int, List[(Long, Float)])] = - transpose( - knownFor_20M_Dec11_145K - ) - - val clusterToKnownFor_20M_145K_updated: TypedPipe[(Int, List[(Long, Float)])] = - transpose( - knownFor_20M_145K_updated - ) - - private val log = Logger() - - def readKnownFor(textFile: String): TypedPipe[(Long, Array[(Int, Float)])] = { - TypedPipe - .from(TextLine(textFile)) - .flatMap { str => - if (!str.startsWith("#")) { - try { - val tokens = str.trim.split("\\s+") - val res = Array.newBuilder[(Int, Float)] - val userId = tokens(0).toLong - for (i <- 1 until tokens.length) { - val Array(cIdStr, scoreStr) = tokens(i).split(":") - val clusterId = cIdStr.toInt - val score = scoreStr.toFloat - val newEntry = (clusterId, score) - res += newEntry - } - val result = res.result - if (result.nonEmpty) { - Some((userId, res.result())) - } else None - } catch { - case ex: Throwable => - log.warning( - s"Error while loading knownFor from $textFile for line <$str>: " + - ex.getMessage - ) - None - } - } else None - } - } - - def stringifyKnownFor( - input: TypedPipe[(Long, Array[(Int, Float)])] - ): TypedPipe[(Long, String)] = { - input.mapValues { arr => - arr.map { case (clusterId, score) => "%d:%.2g".format(clusterId, score) }.mkString("\t") - } - } - - def writeKnownForTypedTsv( - input: TypedPipe[(Long, Array[(Int, Float)])], - outputDir: String - ): Execution[Unit] = { - stringifyKnownFor(input).writeExecution(TypedTsv(outputDir)) - } - - def makeKnownForTypedTsv( - input: TypedPipe[(Long, Array[(Int, Float)])], - outputDir: String - ): Execution[TypedPipe[(Long, Array[(Int, Float)])]] = { - Execution.getMode.flatMap { mode => - try { - val dest = TextLine(outputDir) - dest.validateTaps(mode) - Execution.from(KnownForSources.readKnownFor(outputDir)) - } catch { - case ivs: InvalidSourceException => - writeKnownForTypedTsv(input, outputDir).map { _ => input } - } - } - - } - - def transpose( - userToCluster: TypedPipe[(Long, Array[(Int, Float)])] - ): TypedPipe[(Int, List[(Long, Float)])] = { - userToCluster - .flatMap { - case (userId, clusterWeightPairs) => - clusterWeightPairs.map { - case (clusterId, weight) => - (clusterId, List(userId -> weight)) - } - } - .sumByKey - .toTypedPipe - } -} - -/** -capesospy-v2 update --build_locally --start_cron known_for_to_mh \ - src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc.yaml - */ -object KnownForToMHBatch extends TwitterScheduledExecutionApp { - - import KnownForSources._ - - /** - * A simple update function which updates the source by removing deactivated and suspended users. - * This will be eventually replaced by a regular cluster updating method. - */ - def updateKnownForSource( - knownForSource: TypedPipe[(Long, ClustersUserIsKnownFor)], - userSource: TypedPipe[FlatUser] - )( - implicit uniqueID: UniqueID - ): TypedPipe[(Long, ClustersUserIsKnownFor)] = { - val numValidUsers = Stat("num_valid_users") - val numInvalidUsers = Stat("num_invalid_users") - val numKnownForUsersLeft = Stat("num_known_for_users_left") - val numRemovedKnownForUsers = Stat("num_removed_known_for_users") - - val validUsers = - userSource.flatMap { - case flatUser - if !flatUser.deactivated.contains(true) && !flatUser.suspended - .contains(true) - && flatUser.id.nonEmpty => - numValidUsers.inc() - flatUser.id - case _ => - numInvalidUsers.inc() - None - } - - knownForSource.leftJoin(validUsers.asKeys).flatMap { - case (userId, (clustersWithScore, Some(_))) => - numKnownForUsersLeft.inc() - Some((userId, clustersWithScore)) - case _ => - numRemovedKnownForUsers.inc() - None - } - } - - // this should happen before InterestedInFromKnownForBatch - private val firstTime: String = "2019-03-22" - - private val batchIncrement: Duration = Days(7) - - private val outputPath: String = InternalDataPaths.RawKnownForDec11Path - - private val execArgs = AnalyticsBatchExecutionArgs( - batchDesc = BatchDescription(this.getClass.getName.replace("$", "")), - firstTime = BatchFirstTime(RichDate(firstTime)), - lastTime = None, - batchIncrement = BatchIncrement(batchIncrement) - ) - - override def scheduledJob: Execution[Unit] = - AnalyticsBatchExecution(execArgs) { implicit dateRange => - Execution.withId { implicit uniqueId => - val numKnownForUsers = Stat("num_known_for_users") - - val userSource = - DAL - .readMostRecentSnapshotNoOlderThan(UsersourceFlatScalaDataset, Days(7)) - .toTypedPipe - - val knownForData = DAL - .readMostRecentSnapshotNoOlderThan( - SimclustersV2RawKnownFor20M145KDec11ScalaDataset, - Days(30)) - .toTypedPipe - .map { - case KeyVal(userId, knownForClusters) => - numKnownForUsers.inc() - (userId, knownForClusters) - } - - val result = updateKnownForSource(knownForData, userSource).map { - case (userId, knownForClusters) => - KeyVal(userId, knownForClusters) - } - - Util.printCounters( - result.writeDALVersionedKeyValExecution( - dataset = SimclustersV2RawKnownFor20M145KDec11ScalaDataset, - pathLayout = D.Suffix(outputPath) - ) - ) - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/ProducerNormsAndCounts.scala b/src/scala/com/twitter/simclusters_v2/scalding/ProducerNormsAndCounts.scala deleted file mode 100644 index abaef09e8..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/ProducerNormsAndCounts.scala +++ /dev/null @@ -1,195 +0,0 @@ -package com.twitter.simclusters_v2.scalding - -import com.twitter.logging.Logger -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.dalv2.remote_access.{ExplicitLocation, ProcAtla} -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.scalding_internal.job.analytics_batch._ -import com.twitter.simclusters_v2.hdfs_sources.{ - NormsAndCountsFixedPathSource, - ProducerNormsAndCountsScalaDataset -} -import com.twitter.simclusters_v2.scalding.common.TypedRichPipe._ -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.thriftscala.NormsAndCounts - -object ProducerNormsAndCounts { - - def getNormsAndCounts( - input: TypedPipe[Edge] - )( - implicit uniqueID: UniqueID - ): TypedPipe[NormsAndCounts] = { - val numRecordsInNormsAndCounts = Stat("num_records_in_norms_and_counts") - input - .map { - case Edge(srcId, destId, isFollowEdge, favWt) => - val followOrNot = if (isFollowEdge) 1 else 0 - ((srcId, destId), (followOrNot, favWt)) - } - .sumByKey - // Uncomment for adhoc job - //.withReducers(2500) - .map { - case ((srcId, destId), (followOrNot, favWt)) => - val favOrNot = if (favWt > 0) 1 else 0 - val logFavScore = if (favWt > 0) UserUserNormalizedGraph.logTransformation(favWt) else 0.0 - ( - destId, - ( - followOrNot, - favWt * favWt, - favOrNot, - favWt, - favWt * followOrNot.toDouble, - logFavScore * logFavScore, - logFavScore, - logFavScore * followOrNot.toDouble)) - } - .sumByKey - // Uncomment for adhoc job - //.withReducers(500) - .map { - case ( - id, - ( - followCount, - favSumSquare, - favCount, - favSumOnFavEdges, - favSumOnFollowEdges, - logFavSumSquare, - logFavSumOnFavEdges, - logFavSumOnFollowEdges)) => - val followerNorm = math.sqrt(followCount) - val faverNorm = math.sqrt(favSumSquare) - numRecordsInNormsAndCounts.inc() - NormsAndCounts( - userId = id, - followerL2Norm = Some(followerNorm), - faverL2Norm = Some(faverNorm), - followerCount = Some(followCount), - faverCount = Some(favCount), - favWeightsOnFavEdgesSum = Some(favSumOnFavEdges), - favWeightsOnFollowEdgesSum = Some(favSumOnFollowEdges), - logFavL2Norm = Some(math.sqrt(logFavSumSquare)), - logFavWeightsOnFavEdgesSum = Some(logFavSumOnFavEdges), - logFavWeightsOnFollowEdgesSum = Some(logFavSumOnFollowEdges) - ) - } - } - - def run( - halfLifeInDaysForFavScore: Int - )( - implicit uniqueID: UniqueID, - date: DateRange - ): TypedPipe[NormsAndCounts] = { - val input = - UserUserNormalizedGraph.getFollowEdges.map { - case (src, dest) => - Edge(src, dest, isFollowEdge = true, 0.0) - } ++ UserUserNormalizedGraph.getFavEdges(halfLifeInDaysForFavScore).map { - case (src, dest, wt) => - Edge(src, dest, isFollowEdge = false, wt) - } - getNormsAndCounts(input) - } -} - -object ProducerNormsAndCountsBatch extends TwitterScheduledExecutionApp { - private val firstTime: String = "2018-06-16" - implicit val tz = DateOps.UTC - implicit val parser = DateParser.default - private val batchIncrement: Duration = Days(7) - private val firstStartDate = DateRange.parse(firstTime).start - private val halfLifeInDaysForFavScore = 100 - - private val outputPath: String = "/user/cassowary/processed/producer_norms_and_counts" - private val log = Logger() - - private val execArgs = AnalyticsBatchExecutionArgs( - batchDesc = BatchDescription(this.getClass.getName.replace("$", "")), - firstTime = BatchFirstTime(RichDate(firstTime)), - lastTime = None, - batchIncrement = BatchIncrement(batchIncrement) - ) - - override def scheduledJob: Execution[Unit] = AnalyticsBatchExecution(execArgs) { - implicit dateRange => - Execution.withId { implicit uniqueId => - Execution.withArgs { args => - Util.printCounters( - ProducerNormsAndCounts - .run(halfLifeInDaysForFavScore) - .writeDALSnapshotExecution( - ProducerNormsAndCountsScalaDataset, - D.Daily, - D.Suffix(outputPath), - D.EBLzo(), - dateRange.end) - ) - } - } - } -} - -object ProducerNormsAndCountsAdhoc extends TwitterExecutionApp { - implicit val tz: java.util.TimeZone = DateOps.UTC - implicit val dp = DateParser.default - - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - implicit val date = DateRange.parse(args.list("date")) - - Util.printCounters( - ProducerNormsAndCounts - .run(halfLifeInDaysForFavScore = 100) - .forceToDiskExecution.flatMap { result => - Execution.zip( - result.writeExecution(NormsAndCountsFixedPathSource(args("outputDir"))), - result.printSummary("Producer norms and counts") - ) - } - ) - } - } -} - -object DumpNormsAndCountsAdhoc extends TwitterExecutionApp { - implicit val tz: java.util.TimeZone = DateOps.UTC - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - - val users = args.list("users").map(_.toLong).toSet - val input = args.optional("inputDir") match { - case Some(inputDir) => TypedPipe.from(NormsAndCountsFixedPathSource(inputDir)) - case None => - DAL - .readMostRecentSnapshotNoOlderThan(ProducerNormsAndCountsScalaDataset, Days(30)) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - } - - if (users.isEmpty) { - input.printSummary("Producer norms and counts") - } else { - input - .collect { - case rec if users.contains(rec.userId) => - Util.prettyJsonMapper.writeValueAsString(rec).replaceAll("\n", " ") - } - .toIterableExecution - .map { strings => println(strings.mkString("\n")) } - } - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/TopUsersSimilarityGraph.scala b/src/scala/com/twitter/simclusters_v2/scalding/TopUsersSimilarityGraph.scala deleted file mode 100644 index d93bd73ee..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/TopUsersSimilarityGraph.scala +++ /dev/null @@ -1,996 +0,0 @@ -package com.twitter.simclusters_v2.scalding - -import com.twitter.algebird.Max -import com.twitter.algebird.Monoid -import com.twitter.bijection.scrooge.BinaryScalaCodec -import com.twitter.hermit.candidate.thriftscala.Candidate -import com.twitter.hermit.candidate.thriftscala.Candidates -import com.twitter.logging.Logger -import com.twitter.pluck.source.cassowary.FollowingsCosineSimilaritiesManhattanSource -import com.twitter.sbf.core.AlgorithmConfig -import com.twitter.sbf.core.MHAlgorithm -import com.twitter.sbf.core.PredictionStat -import com.twitter.sbf.core.SparseBinaryMatrix -import com.twitter.sbf.core.SparseRealMatrix -import com.twitter.sbf.graph.Graph -import com.twitter.scalding._ -import com.twitter.scalding.commons.source.VersionedKeyValSource -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.scalding_internal.source.lzo_scrooge.FixedPathLzoScrooge -import com.twitter.simclusters_v2.scalding.common.TypedRichPipe._ -import com.twitter.usersource.snapshot.flat.UsersourceFlatScalaDataset -import com.twitter.usersource.snapshot.flat.thriftscala.FlatUser -import com.twitter.wtf.scalding.sims.thriftscala.SimilarUserPair -import java.io.PrintWriter -import java.text.DecimalFormat -import java.util -import org.apache.hadoop.conf.Configuration -import org.apache.hadoop.fs.FileSystem -import org.apache.hadoop.fs.Path -import scala.collection.JavaConverters._ - -case class TopUser(id: Long, activeFollowerCount: Int, screenName: String) - -case class TopUserWithMappedId(topUser: TopUser, mappedId: Int) - -case class AdjList(sourceId: Long, neighbors: List[(Long, Float)]) - -object TopUsersSimilarityGraph { - val log = Logger() - - def topUsers( - userSourcePipe: TypedPipe[FlatUser], - minActiveFollowers: Int, - topK: Int - ): TypedPipe[TopUser] = { - userSourcePipe - .collect { - case f: FlatUser - if f.activeFollowers.exists(_ >= minActiveFollowers) - && f.followers.isDefined && f.id.isDefined && f.screenName.isDefined - && !f.deactivated.contains(true) && !f.suspended.contains(true) => - TopUser(f.id.get, f.activeFollowers.get.toInt, f.screenName.get) - } - .groupAll - .sortedReverseTake(topK)(Ordering.by(_.activeFollowerCount)) - .values - .flatten - } - - /** - * This function returns the top most followed userIds truncated to topK - * Offers the same functionality as TopUsersSimilarityGraph.topUsers but more efficient - * as we donot store screennames while grouping and sorting the users - */ - def topUserIds( - userSourcePipe: TypedPipe[FlatUser], - minActiveFollowers: Int, - topK: Int - ): TypedPipe[Long] = { - userSourcePipe - .collect { - case f: FlatUser - if f.activeFollowers.exists(_ >= minActiveFollowers) - && f.followers.isDefined && f.id.isDefined && f.screenName.isDefined - && !f.deactivated.contains(true) && !f.suspended.contains(true) => - (f.id.get, f.activeFollowers.get) - } - .groupAll - .sortedReverseTake(topK)(Ordering.by(_._2)) - .values - .flatten - .keys - } - - def topUsersWithMappedIds( - userSourcePipe: TypedPipe[FlatUser], - minActiveFollowers: Int - ): TypedPipe[TopUserWithMappedId] = { - userSourcePipe - .collect { - case f: FlatUser - if f.activeFollowers.exists(_ >= minActiveFollowers) - && f.followers.isDefined && f.id.isDefined && f.screenName.isDefined - && !f.deactivated.contains(true) && !f.suspended.contains(true) => - TopUser(f.id.get, f.activeFollowers.get.toInt, f.screenName.get) - } - .groupAll - .mapGroup { - case (_, topUserIter) => - topUserIter.zipWithIndex.map { - case (topUser, id) => - TopUserWithMappedId(topUser, id) - } - } - .values - } - - def topUsersWithMappedIdsTopK( - userSourcePipe: TypedPipe[FlatUser], - minActiveFollowers: Int, - topK: Int - ): TypedPipe[TopUserWithMappedId] = { - userSourcePipe - .collect { - case f: FlatUser - if f.activeFollowers.exists(_ >= minActiveFollowers) - && f.followers.isDefined && f.id.isDefined && f.screenName.isDefined - && !f.deactivated.contains(true) && !f.suspended.contains(true) => - TopUser(f.id.get, f.activeFollowers.get.toInt, f.screenName.get) - } - .groupAll - .sortedReverseTake(topK)(Ordering.by(_.activeFollowerCount)) - .map { - case (_, topUserIter) => - topUserIter.zipWithIndex.map { - case (topUser, id) => - TopUserWithMappedId(topUser, id) - } - } - .flatten - } - - /** - * This function returns the top most followed and verified userIds truncated to topK - */ - def vits( - userSourcePipe: TypedPipe[FlatUser], - minActiveFollowers: Int, - topK: Int - ): TypedPipe[Long] = { - userSourcePipe - .collect { - case f: FlatUser - if f.verified.contains(true) && f.id.isDefined && - f.screenName.isDefined && !f.deactivated.contains(true) && !f.suspended.contains( - true) && - f.activeFollowers.exists(_ >= minActiveFollowers) => - (f.id.get, f.activeFollowers.get) - } - .groupAll - .sortedReverseTake(topK)(Ordering.by(_._2)) - .values - .flatten - .keys - } - - def topUsersInMemory( - userSourcePipe: TypedPipe[FlatUser], - minActiveFollowers: Int, - topK: Int - ): Execution[List[TopUserWithMappedId]] = { - log.info(s"Will fetch top $topK users with at least $minActiveFollowers many active followers") - topUsers(userSourcePipe, minActiveFollowers, topK).toIterableExecution - .map { idFollowersList => - idFollowersList.toList.sortBy(_.id).zipWithIndex.map { - case (topuser, index) => - TopUserWithMappedId(topuser, index) - } - } - } - - def addSelfLoop( - input: TypedPipe[(Long, Map[Long, Float])], - maxToSelfLoopWeight: Float => Float - ): TypedPipe[(Long, Map[Long, Float])] = { - input - .map { - case (nodeId, neighborMap) if neighborMap.nonEmpty => - val maxEntry = neighborMap.values.max - val selfLoopWeight = maxToSelfLoopWeight(maxEntry) - (nodeId, neighborMap ++ Map(nodeId -> selfLoopWeight)) - case (nodeId, emptyMap) => - (nodeId, emptyMap) - } - } - - def makeGraph( - backfillPipe: TypedPipe[(Long, Map[Long, Float])], - dirToReadFromOrSaveTo: String - ): Execution[TypedPipe[(Long, Map[Long, Float])]] = { - backfillPipe - .map { - case (nodeId, nbrMap) => - val cands = nbrMap.toList.map { case (nId, wt) => Candidate(nId, wt) } - Candidates(nodeId, candidates = cands) - } - .make(new FixedPathLzoScrooge(dirToReadFromOrSaveTo, Candidates)) - .map { tp => - tp.map { - case Candidates(nodeId, cands) => - (nodeId, cands.map { case Candidate(nId, wt, _) => (nId, wt.toFloat) }.toMap) - } - } - } - - def getSubgraphFromUserGroupedInput( - fullGraph: TypedPipe[Candidates], - usersToInclude: TypedPipe[Long], - maxNeighborsPerNode: Int, - degreeThresholdForStat: Int - )( - implicit uniqId: UniqueID - ): TypedPipe[(Long, Map[Long, Float])] = { - val numUsersWithZeroEdges = Stat("num_users_with_zero_edges") - val numUsersWithSmallDegree = Stat("num_users_with_degree_lt_" + degreeThresholdForStat) - val numUsersWithEnoughDegree = Stat("num_users_with_degree_gte_" + degreeThresholdForStat) - - fullGraph - .map { cands => - ( - cands.userId, - // These candidates are already sorted, but leaving it in just in case the behavior changes upstream - cands.candidates - .map { c => (c.userId, c.score) }.sortBy(-_._2).take(maxNeighborsPerNode).toMap - ) - } - .rightJoin(usersToInclude.asKeys) - // uncomment for adhoc job - //.withReducers(110) - .mapValues(_._1) // discard the Unit - .toTypedPipe - .count("num_sims_records_from_top_users") - .flatMap { - case (nodeId, Some(neighborMap)) => - neighborMap.flatMap { - case (neighborId, edgeWt) => - List( - (nodeId, Map(neighborId -> Max(edgeWt.toFloat))), - (neighborId, Map(nodeId -> Max(edgeWt.toFloat))) - ) - } - case (nodeId, None) => List((nodeId, Map.empty[Long, Max[Float]])) - } - .sumByKey - // uncomment for adhoc job - //.withReducers(150) - .toTypedPipe - .mapValues(_.mapValues(_.get)) // get the max for each value in each map - .count("num_sims_records_after_symmetrization_before_keeping_only_top_users") - .join(usersToInclude.asKeys) // only keep records for top users - // uncomment for adhoc job - //.withReducers(100) - .mapValues(_._1) - .toTypedPipe - .map { - case (nodeId, neighborsMap) => - if (neighborsMap.nonEmpty) { - if (neighborsMap.size < degreeThresholdForStat) { - numUsersWithSmallDegree.inc() - } else { - numUsersWithEnoughDegree.inc() - } - } else { - numUsersWithZeroEdges.inc() - } - (nodeId, neighborsMap) - } - .count("num_sims_records_after_symmetrization_only_top_users") - } - - def getSubgraphFromUserGroupedInput( - fullGraph: TypedPipe[Candidates], - usersToInclude: Set[Long], - maxNeighborsPerNode: Int - )( - implicit uniqId: UniqueID - ): TypedPipe[(Long, Map[Long, Float])] = { - val numUsersWithZeroEdges = Stat("num_users_with_zero_edges") - val numUsersWithDegreeLessThan10 = Stat("num_users_with_degree_less_than_10") - - val (intIdsToIncludeSorted: Array[Int], longIdsToIncludeSorted: Array[Long]) = - setToSortedArrays(usersToInclude) - log.info("Size of intArray " + intIdsToIncludeSorted.length) - log.info("Size of longArray " + longIdsToIncludeSorted.length) - - fullGraph - .collect { - case candidates - if isIdInIntOrLongArray( - candidates.userId, - intIdsToIncludeSorted, - longIdsToIncludeSorted) => - val sourceId = candidates.userId - val toKeep = candidates.candidates.collect { - case neighbor - if isIdInIntOrLongArray( - neighbor.userId, - intIdsToIncludeSorted, - longIdsToIncludeSorted) => - (neighbor.userId, neighbor.score.toFloat) - }.toList - - val toKeepLength = toKeep.size - if (toKeep.isEmpty) { - numUsersWithZeroEdges.inc() - } else if (toKeepLength < 10) { - numUsersWithDegreeLessThan10.inc() - } - - val knn = if (toKeepLength > maxNeighborsPerNode) { - toKeep.sortBy(_._2).takeRight(maxNeighborsPerNode) - } else toKeep - - knn.flatMap { - case (nbrId, wt) => - List( - (sourceId, Map(nbrId -> Max(wt))), - (nbrId, Map(sourceId -> Max(wt))) - ) - } - } - .flatten - .sumByKey - .toTypedPipe - .mapValues(_.mapValues(_.get)) // get the max for each value in each map - } - - def getInMemorySubgraphFromUserGroupedInput( - fullGraph: TypedPipe[Candidates], - usersToInclude: Set[Long], - maxNeighborsPerNode: Int - )( - implicit uniqId: UniqueID - ): Execution[Iterable[AdjList]] = { - getSubgraphFromUserGroupedInput(fullGraph, usersToInclude, maxNeighborsPerNode).map { - case (sourceId, weightedNeighbors) => - AdjList( - sourceId, - weightedNeighbors.toList.sortBy(_._1) - ) - }.toIterableExecution - } - - def isIdInIntOrLongArray( - id: Long, - intArraySorted: Array[Int], - longArraySorted: Array[Long] - ): Boolean = { - if (id < Integer.MAX_VALUE) { - util.Arrays.binarySearch(intArraySorted, id.toInt) >= 0 - } else { - util.Arrays.binarySearch(longArraySorted, id.toLong) >= 0 - } - } - - /** - * Creates two sorted arrays out of a set, one with ints and one with longs. - * Sorted arrays are only slightly more expensive to search in, but empirically I've found - * that the MapReduce job runs more reliably using them than using Set directly. - * - * @param inSet - * - * @return - */ - def setToSortedArrays(inSet: Set[Long]): (Array[Int], Array[Long]) = { - val (intArrayUnconvertedSorted, longArraySorted) = - inSet.toArray.sorted.partition { l => l < Integer.MAX_VALUE } - (intArrayUnconvertedSorted.map(_.toInt), longArraySorted) - } - - def getInMemorySubgraph( - fullGraph: TypedPipe[SimilarUserPair], - usersToInclude: Set[Long], - maxNeighborsPerNode: Int - )( - implicit uniqId: UniqueID - ): Execution[Iterable[AdjList]] = { - val numValidEdges = Stat("num_valid_edges") - val numInvalidEdges = Stat("num_invalid_edges") - - val (intIdsToIncludeSorted: Array[Int], longIdsToIncludeSorted: Array[Long]) = - setToSortedArrays(usersToInclude) - log.info("Size of intArray " + intIdsToIncludeSorted.length) - log.info("Size of longArray " + longIdsToIncludeSorted.length) - - fullGraph - .filter { edge => - val res = - isIdInIntOrLongArray(edge.sourceId, intIdsToIncludeSorted, longIdsToIncludeSorted) && - isIdInIntOrLongArray(edge.destinationId, intIdsToIncludeSorted, longIdsToIncludeSorted) - if (res) { - numValidEdges.inc() - } else { - numInvalidEdges.inc() - } - res - } - .map { edge => (edge.sourceId, (edge.destinationId, edge.cosineScore.toFloat)) } - .group - .sortedReverseTake(maxNeighborsPerNode)(Ordering.by(_._2)) - .toTypedPipe - .flatMap { - case (sourceId, weightedNeighbors) => - weightedNeighbors.flatMap { - case (destId, wt) => - /* - By default, a k-nearest neighbor graph need not be symmetric, since if u is in v's - k nearest neighbors, that doesn't guarantee that v is in u's. - This step adds edges in both directions, but having a Map ensures that each neighbor - only appears once and not twice. Using Max() operator from Algebird, we take the max - weight of (u, v) and (v, u) - it is expected that the two will be pretty much the same. - - Example illustrating how Map and Max work together: - Map(1 -> Max(2)) + Map(1 -> Max(3)) = Map(1 -> Max(3)) - */ - List( - (sourceId, Map(destId -> Max(wt))), - (destId, Map(sourceId -> Max(wt))) - ) - } - } - .sumByKey - .map { - case (sourceId, weightedNeighbors) => - AdjList( - sourceId, - weightedNeighbors.toList.map { case (id, maxWt) => (id, maxWt.get) }.sortBy(_._1) - ) - } - .toIterableExecution - } - - def convertIterableToGraph( - adjList: Iterable[AdjList], - verticesMapping: Map[Long, Int], - wtExponent: Float - ): Graph = { - val n = verticesMapping.size - val neighbors: Array[Array[Int]] = new Array[Array[Int]](n) - val wts: Array[Array[Float]] = new Array[Array[Float]](n) - - var numEdges = 0L - var numVertices = 0 - - val iter = adjList.iterator - val verticesWithAtleastOneEdgeBuilder = Set.newBuilder[Long] - - while (iter.hasNext) { - val AdjList(originalId, wtedNeighbors) = iter.next() - val wtedNeighborsSize = wtedNeighbors.size - val newId = verticesMapping(originalId) // throw exception if originalId not in map - if (newId < 0 || newId >= n) { - throw new IllegalStateException( - s"$originalId has been mapped to $newId, which is outside" + - s"the expected range [0, " + (n - 1) + "]") - } - verticesWithAtleastOneEdgeBuilder += originalId - neighbors(newId) = new Array[Int](wtedNeighborsSize) - wts(newId) = new Array[Float](wtedNeighborsSize) - wtedNeighbors.zipWithIndex.foreach { - case ((nbrId, wt), index) => - neighbors(newId)(index) = verticesMapping(nbrId) - wts(newId)(index) = wt - numEdges += 1 - } - - if (math.abs(wtExponent - 1.0) > 1e-5) { - var maxWt = Float.MinValue - for (index <- wts(newId).indices) { - wts(newId)(index) = math.pow(wts(newId)(index), wtExponent).toFloat - if (wts(newId)(index) > maxWt) { - maxWt = wts(newId)(index) - } - } - } - numVertices += 1 - if (numVertices % 100000 == 0) { - log.info(s"Done with $numVertices many vertices.") - } - } - - val verticesWithAtleastOneEdge = verticesWithAtleastOneEdgeBuilder.result() - val verticesWithZeroEdges = verticesMapping.keySet.diff(verticesWithAtleastOneEdge) - - verticesWithZeroEdges.foreach { originalId => - neighbors(verticesMapping(originalId)) = new Array[Int](0) - wts(verticesMapping(originalId)) = new Array[Float](0) - } - - log.info("Number of vertices with zero edges " + verticesWithZeroEdges.size) - log.info("Number of edges " + numEdges) - if (verticesWithZeroEdges.nonEmpty) { - log.info("The vertices with zero edges: " + verticesWithZeroEdges.mkString(",")) - } - - new Graph(n, numEdges / 2, neighbors, wts) - } - - def run( - userSourcePipe: TypedPipe[FlatUser], - minActiveFollowers: Int, - topK: Int, - getSubgraphFn: Set[Long] => Execution[Iterable[AdjList]], - wtExponent: Float - )( - implicit id: UniqueID - ): Execution[(List[TopUserWithMappedId], Graph)] = { - topUsersInMemory( - userSourcePipe, - minActiveFollowers, - topK - ).flatMap { topUsers => - val idMap = topUsers.map { topUser => (topUser.topUser.id, topUser.mappedId) }.toMap - - log.info("Got idMap with " + idMap.size + " entries.") - getSubgraphFn(idMap.keySet) - .map { iterableAdjLists => - log.info("Going to convert iterable to graph") - val tic = System.currentTimeMillis() - val graph = convertIterableToGraph( - iterableAdjLists, - idMap, - wtExponent - ) - val toc = System.currentTimeMillis() - val seconds = (toc - tic) * 1.0 / 1e6 - log.info("Took %.2f seconds to convert iterable to graph".format(seconds)) - (topUsers, graph) - } - } - } - - def runUsingJoin( - mappedUsers: TypedPipe[(Long, Int)], - allEdges: TypedPipe[Candidates], - maxNeighborsPerNode: Int - )( - implicit uniqueID: UniqueID - ): TypedPipe[(Int, String)] = { - val numEdgesAfterFirstJoin = Stat("num_edges_after_first_join") - val numEdgesAfterSecondJoin = Stat("num_edges_after_second_join") - val numEdgesLostTopKTruncated = Stat("num_edges_lost_topk_truncated") - val finalNumEdges = Stat("final_num_edges") - - allEdges - .map { cs => (cs.userId, cs.candidates) } - .join(mappedUsers) - .withReducers(6000) - .flatMap { - case (id, (neighbors, mappedId)) => - val before = neighbors.size - val topKNeighbors = neighbors.sortBy(-_.score).take(maxNeighborsPerNode) - val after = topKNeighbors.size - numEdgesLostTopKTruncated.incBy(before - after) - topKNeighbors.map { candidate => - numEdgesAfterFirstJoin.inc() - (candidate.userId, (mappedId, candidate.score.toFloat)) - } - } - .join(mappedUsers) - .withReducers(9000) - .flatMap { - case (id, ((mappedNeighborId, score), mappedId)) => - numEdgesAfterSecondJoin.inc() - List( - (mappedId, Map(mappedNeighborId -> Max(score))), - (mappedNeighborId, Map(mappedId -> Max(score))) - ) - } - .sumByKey - .withReducers(9100) - .map { - case (id, nbrMap) => - val sorted = nbrMap.mapValues(_.get).toList.sortBy(-_._2) - finalNumEdges.incBy(sorted.size) - val str = sorted.map { case (nbrId, wt) => "%d %.2f".format(nbrId, wt) }.mkString(" ") - (id, str) - } - - } - - def writeToHDFSFile(lines: Iterator[String], conf: Configuration, outputFile: String): Unit = { - val fs = FileSystem.newInstance(conf) - val outputStream = fs.create(new Path(outputFile)) - log.info("Will write to " + outputFile) - var numLines = 0 - val tic = System.currentTimeMillis() - try { - val writer = new PrintWriter(outputStream) - while (lines.hasNext) { - writer.println(lines.next()) - numLines += 1 - if (numLines % 1000000 == 0) { - log.info(s"Done writing $numLines lines") - } - } - writer.flush() - writer.close() - } finally { - outputStream.close() - } - val toc = System.currentTimeMillis() - val seconds = (toc - tic) * 1.0 / 1e6 - log.info( - "Finished writing %d lines to %s. Took %.2f seconds".format(numLines, outputFile, seconds)) - } - - def writeToHDFSIfHDFS(lines: Iterator[String], mode: Mode, outputFile: String): Unit = { - mode match { - case Hdfs(_, conf) => - writeToHDFSFile(lines, conf, outputFile) - case _ => () - } - } - - def writeTopUsers(topUsers: List[TopUserWithMappedId], mode: Mode, outputFile: String): Unit = { - val topUsersLines = - topUsers.map { topUser => - // Add 1 to mappedId so as to get 1-indexed ids, which are friendlier to humans. - List( - topUser.topUser.id, - topUser.mappedId + 1, - topUser.topUser.screenName, - topUser.topUser.activeFollowerCount - ).mkString("\t") - }.iterator - writeToHDFSIfHDFS(topUsersLines, mode, outputFile) - } - - def readSimsInput(isKeyValSource: Boolean, inputDir: String): TypedPipe[Candidates] = { - if (isKeyValSource) { - log.info("Will treat " + inputDir + " as SequenceFiles input") - val rawInput = FollowingsCosineSimilaritiesManhattanSource(path = inputDir) - TypedPipe.from(rawInput).map(_._2) - } else { - log.info("Will treat " + inputDir + " as LzoScrooge input") - TypedPipe.from(new FixedPathLzoScrooge(inputDir, Candidates)) - } - } -} - -/** - * ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding:top_users_only && \ - * oscar hdfs --hadoop-client-memory 120000 --user cassowary --host atla-aor-08-sr1 \ - * --bundle top_users_only --tool com.twitter.simclusters_v2.scalding.ClusterHdfsGraphApp \ - * --screen --screen-detached --tee ldap_logs/SBFOnSubGraphOf100MTopusersWithMappedIds_120GB_RAM \ - * -- --inputDir adhoc/ldap_subgraphOf100MTopUsersWithMappedIds --numNodesPerCommunity 200 \ - * --outputDir adhoc/ldap_SBFOnSubGraphOf100MTopusersWithMappedIds_k500K_120GB_RAM --assumedNumberOfNodes 100200000 - */ -object ClusterHdfsGraphApp extends TwitterExecutionApp { - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - val inputDir = args("inputDir") - val numNodesPerCommunity = args.int("numNodesPerCommunity", 200) - val outputDir = args("outputDir") - val assumedNumberOfNodes = args.int("assumedNumberOfNodes") - //val useEdgeWeights = args.boolean("useEdgeWeights") - - val input = TypedPipe.from(TypedTsv[(Int, String)](inputDir)).map { - case (id, nbrStr) => - val nbrsWithWeights = nbrStr.split(" ") - val nbrsArray = nbrsWithWeights.zipWithIndex - .collect { - case (str, index) if index % 2 == 0 => - str.toInt - } - (id, nbrsArray.sorted) - } - - println("Gonna assume total number of nodes is " + assumedNumberOfNodes) - - input.toIterableExecution.flatMap { adjListsIter => - val nbrs: Array[Array[Int]] = new Array[Array[Int]](assumedNumberOfNodes) - var numEdges = 0L - var numVertices = 0 - var maxVertexId = 0 - - val tic = System.currentTimeMillis - adjListsIter.foreach { - case (id, nbrArray) => - if (id >= assumedNumberOfNodes) { - throw new IllegalStateException( - s"Yikes! Entry with id $id, >= assumedNumberOfNodes") - } - nbrs(id) = nbrArray - if (id > maxVertexId) { - maxVertexId = id - } - numEdges += nbrArray.length - numVertices += 1 - if (numVertices % 100000 == 0) { - println(s"Done loading $numVertices many vertices. Edges so far: $numEdges") - } - } - (0 until assumedNumberOfNodes).foreach { i => - if (nbrs(i) == null) { - nbrs(i) = Array[Int]() - } - } - val toc = System.currentTimeMillis() - println( - "maxVertexId is " + maxVertexId + ", assumedNumberOfNodes is " + assumedNumberOfNodes) - println( - s"Done loading graph with $assumedNumberOfNodes nodes and $numEdges edges (counting each edge twice)") - println("Number of nodes with at least neighbor is " + numVertices) - println("Time to load the graph " + (toc - tic) / 1000.0 / 60.0 + " minutes") - - val graph = new Graph(assumedNumberOfNodes, numEdges / 2, nbrs, null) - val k = assumedNumberOfNodes / numNodesPerCommunity - println("Will set number of communities to " + k) - val algoConfig = new AlgorithmConfig() - .withCpu(16).withK(k) - .withWtCoeff(10.0).withMaxEpoch(5) - var z = new SparseBinaryMatrix(assumedNumberOfNodes, k) - val err = new PrintWriter(System.err) - - println("Going to initalize from random neighborhoods") - z.initFromBestNeighborhoods( - graph, - (gr: Graph, i: Integer) => algoConfig.rng.nextDouble, - false, - err) - println("Done initializing from random neighborhoods") - - val prec0 = MHAlgorithm.clusterPrecision(graph, z, 0, 1000, algoConfig.rng) - println("Precision of cluster 0:" + prec0.precision) - val prec1 = MHAlgorithm.clusterPrecision(graph, z, 1, 1000, algoConfig.rng) - println("Precision of cluster 1:" + prec1.precision) - println( - "Fraction of empty rows after initializing from random neighborhoods: " + z.emptyRowProportion) - - val tic2 = System.currentTimeMillis - val algo = new MHAlgorithm(algoConfig, graph, z, err) - val optimizedZ = algo.optimize - val toc2 = System.currentTimeMillis - println("Time to optimize: %.2f seconds\n".format((toc2 - tic2) / 1000.0)) - println("Time to initialize & optimize: %.2f seconds\n".format((toc2 - toc) / 1000.0)) - - val srm = MHAlgorithm.heuristicallyScoreClusterAssignments(graph, optimizedZ) - val outputIter = (0 to srm.getNumRows).map { rowId => - val rowWithIndices = srm.getColIdsForRow(rowId) - val rowWithScores = srm.getValuesForRow(rowId) - val str = rowWithIndices - .zip(rowWithScores).map { - case (colId, score) => - "%d:%.2g".format(colId + 1, score) - }.mkString(" ") - "%d %s".format(rowId, str) - } - - TypedPipe.from(outputIter).writeExecution(TypedTsv(outputDir)) - } - } - } -} - -/** - * ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding:top_users_only && \ - * oscar hdfs --hadoop-client-memory 60000 --user cassowary --host atla-aor-08-sr1 \ - * --bundle top_users_only --tool com.twitter.simclusters_v2.scalding.ScalableTopUsersSimilarityGraphApp \ - * --screen --screen-detached --tee ldap_logs/SubGraphOf100MTopusersWithMappedIds \ - * -- --mappedUsersDir adhoc/ldap_top100M_mappedUsers \ - * --inputDir adhoc/ldap_approximate_cosine_similarity_follow \ - * --outputDir adhoc/ldap_subgraphOf100MTopUsersWithMappedIds_correct_topK - */ -object ScalableTopUsersSimilarityGraphApp extends TwitterExecutionApp { - implicit val tz: java.util.TimeZone = DateOps.UTC - implicit val dp = DateParser.default - val log = Logger() - - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - val inputDir = args("inputDir") - val mappedUsersDir = args("mappedUsersDir") - val maxNeighbors = args.int("maxNeighbors", 100) - val outputDir = args("outputDir") - - val mappedUsers = TypedPipe - .from(TypedTsv[(Long, Int, String, Int)](mappedUsersDir)) - .map { - case (id, _, _, mappedId) => - (id, mappedId) - } - .shard(200) - - val sims = TypedPipe - .from(FollowingsCosineSimilaritiesManhattanSource(path = inputDir)) - .map(_._2) - - TopUsersSimilarityGraph - .runUsingJoin( - mappedUsers, - sims, - maxNeighbors - ).writeExecution(TypedTsv(args("outputDir"))) - } - } -} - -/** - * Scalding app using Executions that does the following: - * - * 1. Get the top N most followed users on Twitter - * (also maps them to ids 1 -> N in int space for easier processing) - * 2. For each user from the step above, get the top K most similar users for this user from the - * list of N users from the step above. - * 3. Construct an undirected graph by setting an edge between (u, v) if - * either v is in u's top-K similar users list, or u is in v's top-K similar user's list. - * 4. The weight for the (u, v) edge is set to be the cosine similarity between u and v's - * follower lists, raised to some exponent > 1. - * This last step is a heuristic reweighting procedure to give more importance to edges involving - * more similar users. - * 5. Write the above graph to HDFS in Metis format, - * i.e. one line per node, with the line for each node specifying the list of neighbors along - * with their weights. The first line specifies the number of nodes and the number of edges. - * - * I've tested this Scalding job for values of topK upto 20M. - * - * Example invocation: - * $ ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding:top_users_similarity_graph && \ - * oscar hdfs --hadoop-client-memory 60000 --host atla-amw-03-sr1 --bundle top_users_similarity_graph \ - * --tool com.twitter.simclusters_v2.scalding.TopUsersSimilarityGraphApp \ - * --hadoop-properties "elephantbird.use.combine.input.format=true;elephantbird.combine.split.size=468435456;mapred.min.split.size=468435456;mapreduce.reduce.memory.mb=5096;mapreduce.reduce.java.opts=-Xmx4400m" \ - * --screen --screen-detached --tee logs/20MSubGraphExecution -- --date 2017-10-24 \ - * --minActiveFollowers 300 --topK 20000000 \ - * --inputUserGroupedDir /user/cassowary/manhattan_sequence_files/approximate_cosine_similarity_follow/ \ - * --groupedInputInSequenceFiles \ - * --maxNeighborsPerNode 100 --wtExponent 2 \ - * --outputTopUsersDir /user/your_ldap/simclusters_graph_prep_q42017/top20MUsers \ - * --outputGraphDir /user/your_ldap/simclusters_graph_prep_q42017/top20Musers_exp2_100neighbors_metis_graph - * - */ -object TopUsersSimilarityGraphApp extends TwitterExecutionApp { - implicit val tz: java.util.TimeZone = DateOps.UTC - implicit val dp = DateParser.default - val log = Logger() - - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - val minActiveFollowers = args.int("minActiveFollowers", 100000) - val topK = args.int("topK") - val date = DateRange.parse(args("date")) - val inputSimilarPairsDir = args.optional("inputSimilarPairsDir") - val inputUserGroupedDir = args.optional("inputUserGroupedDir") - val isGroupedInputSequenceFiles = args.boolean("groupedInputInSequenceFiles") - val outputTopUsersDir = args("outputTopUsersDir") - val maxNeighborsPerNode = args.int("maxNeighborsPerNode", 300) - val wtExponent = args.float("wtExponent", 3.5f) - val outputGraphDir = args("outputGraphDir") - - val userSource = DAL.readMostRecentSnapshot(UsersourceFlatScalaDataset, date).toTypedPipe - val exception = new IllegalStateException( - "Please specify only one of inputSimilarPairsDir or inputUserGroupedDir" - ) - - (inputSimilarPairsDir, inputUserGroupedDir) match { - case (Some(_), Some(_)) => throw exception - case (None, None) => throw exception - case _ => // no-op - } - - def getSubgraphFn(usersToInclude: Set[Long]) = { - (inputSimilarPairsDir, inputUserGroupedDir) match { - case (Some(similarPairs), None) => - val similarUserPairs: TypedPipe[SimilarUserPair] = - TypedPipe.from( - new FixedPathLzoScrooge( - inputSimilarPairsDir.get, - SimilarUserPair - )) - TopUsersSimilarityGraph.getInMemorySubgraph( - similarUserPairs, - usersToInclude, - maxNeighborsPerNode) - case (None, Some(groupedInput)) => - val candidatesPipe = - TopUsersSimilarityGraph.readSimsInput(isGroupedInputSequenceFiles, groupedInput) - TopUsersSimilarityGraph.getInMemorySubgraphFromUserGroupedInput( - candidatesPipe, - usersToInclude, - maxNeighborsPerNode - ) - case _ => Execution.from(Nil) // we should never get here - } - } - - TopUsersSimilarityGraph - .run( - userSource, - minActiveFollowers, - topK, - getSubgraphFn, - wtExponent - ).flatMap { - case (topUsersList, graph) => - // We're writing to HDFS ourselves, from the submitter node. - // When we use TypedPipe.write, it's failing for large topK, e.g.10M. - // We can make the submitter node have a lot of memory, but it's - // difficult and suboptimal to give this much memory to all mappers. - val topUsersExec = Execution.from( - TopUsersSimilarityGraph - .writeTopUsers(topUsersList, mode, outputTopUsersDir + "/all") - ) - - // We want to make sure the write of the topUsers succeeds, and - // only then write out the graph. A graph without the topUsers is useless. - topUsersExec.map { _ => - // We're writing to HDFS ourselves, from the submitter node. - // When we use TypedPipe.write, it fails due to OOM on the mappers. - // We can make the submitter node have a lot of memory, but it's difficult - // and suboptimal to give this much memory to all mappers. - TopUsersSimilarityGraph.writeToHDFSIfHDFS( - graph - .iterableStringRepresentation(new DecimalFormat("#.###")).iterator().asScala, - mode, - outputGraphDir + "/all" - ) - } - } - } - } - -} - -/** - * App that only outputs the topK users on Twitter by active follower count. Example invocation: - * $ ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding:top_users_only && \ - * oscar hdfs --hadoop-client-memory 60000 --host atla-aor-08-sr1 --bundle top_users_only \ - * --tool com.twitter.simclusters_v2.scalding.TopUsersOnlyApp \ - * #are these hadoop-properties needed for this job? - * #--hadoop-properties "scalding.with.reducers.set.explicitly=true;elephantbird.use.combine.input.format=true;elephantbird.combine.split.size=468435456;mapred.min.split.size=468435456" \ - * --screen --screen-detached --tee logs/10MTopusersOnlyExecution -- --date 2017-10-20 \ - * --minActiveFollowers 500 --topK 10000000 \ - * --outputTopUsersDir /user/your_ldap/simclusters_graph_prep_q42017/top10MUsers - * - * ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding:top_users_only && \ - * oscar hdfs --hadoop-client-memory 60000 --user cassowary --host atla-aor-08-sr1 \ - * --bundle top_users_only --tool com.twitter.simclusters_v2.scalding.TopUsersOnlyApp \ - * --screen --screen-detached --tee ldap_logs/100MTopusersWithMappedIds \ - * -- --date 2019-10-11 --minActiveFollowers 67 --outputTopUsersDir adhoc/ldap_top100M_mappedUsers \ - * --includeMappedIds - */ -object TopUsersOnlyApp extends TwitterExecutionApp { - implicit val tz: java.util.TimeZone = DateOps.UTC - implicit val dp = DateParser.default - val log = Logger() - - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - val minActiveFollowers = args.int("minActiveFollowers", 100000) - val topK = args.int("topK", 20000000) - val date = DateRange.parse(args("date")) - val outputTopUsersDir = args("outputTopUsersDir") - val includeMappedIds = args.boolean("includeMappedIds") - - if (includeMappedIds) { - println("Going to include mappedIds in output") - TopUsersSimilarityGraph - .topUsersWithMappedIds( - DAL.readMostRecentSnapshot(UsersourceFlatScalaDataset, date).toTypedPipe, - minActiveFollowers - ) - .map { - case TopUserWithMappedId(TopUser(id, activeFollowerCount, screenName), mappedId) => - (id, activeFollowerCount, screenName, mappedId) - } - .writeExecution(TypedTsv(outputTopUsersDir)) - } else { - TopUsersSimilarityGraph - .topUsersInMemory( - DAL.readMostRecentSnapshot(UsersourceFlatScalaDataset, date).toTypedPipe, - minActiveFollowers, - topK - ).map { topUsersList => - TopUsersSimilarityGraph.writeTopUsers( - topUsersList, - mode, - outputTopUsersDir + "/all") - } - } - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/UpdateKnownFor.scala b/src/scala/com/twitter/simclusters_v2/scalding/UpdateKnownFor.scala deleted file mode 100644 index f6a3e7612..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/UpdateKnownFor.scala +++ /dev/null @@ -1,311 +0,0 @@ -package com.twitter.simclusters_v2.scalding - -import com.twitter.algebird.{Monoid, Semigroup} -import com.twitter.scalding._ - -object UpdateKnownFor { - - /** - * Convenience datastructure that can summarize key stats about a node's set of - * immediate neighbors. - * - * @param nodeCount number of nodes - * @param sumOfEdgeWeights sum of weights on edges in the neighborhood. - * @param sumOfMembershipWeightedEdgeWeights sum of { edge weight * membership weight } for each node - * in the neighborhood. Membership weight to what is not - * specified in this case class and is instead part of the - * context. - * @param sumOfMembershipWeights sum of membership weight for each node in the - * neighborhood. Membership weight to what is not - * specified in this case class and is instead part of - * the context. - */ - case class NeighborhoodInformation( - nodeCount: Int, - sumOfEdgeWeights: Float, - sumOfMembershipWeightedEdgeWeights: Float, - sumOfMembershipWeights: Float) - - object NeighborhoodInformationMonoid extends Monoid[NeighborhoodInformation] { - override val zero: NeighborhoodInformation = NeighborhoodInformation(0, 0f, 0f, 0f) - override def plus(l: NeighborhoodInformation, r: NeighborhoodInformation) = - NeighborhoodInformation( - l.nodeCount + r.nodeCount, - l.sumOfEdgeWeights + r.sumOfEdgeWeights, - l.sumOfMembershipWeightedEdgeWeights + r.sumOfMembershipWeightedEdgeWeights, - l.sumOfMembershipWeights + r.sumOfMembershipWeights - ) - } - - case class NodeInformation( - originalClusters: List[Int], - overallStats: NeighborhoodInformation, - statsOfClustersInNeighborhood: Map[Int, NeighborhoodInformation]) - - object NodeInformationSemigroup extends Semigroup[NodeInformation] { - implicit val ctsMonoid: Monoid[NeighborhoodInformation] = NeighborhoodInformationMonoid - - override def plus(l: NodeInformation, r: NodeInformation) = - NodeInformation( - l.originalClusters ++ r.originalClusters, - ctsMonoid.plus(l.overallStats, r.overallStats), - Monoid - .mapMonoid[Int, NeighborhoodInformation].plus( - l.statsOfClustersInNeighborhood, - r.statsOfClustersInNeighborhood) - ) - } - - case class ClusterScoresForNode( - sumScoreIgnoringMembershipScores: Double, - ratioScoreIgnoringMembershipScores: Double, - ratioScoreUsingMembershipScores: Double) - - /** - * Given a user and a cluster: - * True positive weight = sum of edge weights to neighbors who belong to that cluster. - * False negative weight = sum of edge weights to neighbors who don’t belong to that cluster. - * False positive weight = (number of users in the cluster who are not neighbors of the node) * globalAvgEdgeWeight - * Membership-weighted true positive weight = for neighbors who are also in the cluster, sum of edge weight times user membership score in the cluster. - * Membership-weighted false negative weight = for neighbors who are not in the cluster, sum of edge weight times avg membership score across the whole knownFor input. - * Membership-weighted false positive weight = for users in the cluster who are not neighbors of the node, avg global edge weight times user membership score for the cluster. - * - * Ignoring membership scores, sum formula: - * truePositiveWtFactor*(True positive weight) - false negative weight - false positive weight - * Ignoring membership scores, ratio formula: - * True positive weight / (true positive weight + false negative weight + false positive weight) - * Using membership scores - * Membership-weighted true positive weight / (Membership-weighted true positive weight + Membership-weighted false negative weight + Membership-weighted false positive weight) - * - * @param overallNeighborhoodStats - * @param statsForCluster - * @param clusterSize - * @param sumOfClusterMembershipScores - * @param globalAvgEdgeWeight - * @param truePositiveWtFactor - * - * @return - */ - def getScoresForCluster( - overallNeighborhoodStats: NeighborhoodInformation, - statsForCluster: NeighborhoodInformation, - clusterSize: Int, - sumOfClusterMembershipScores: Double, - globalAvgEdgeWeight: Double, - truePositiveWtFactor: Double - ): ClusterScoresForNode = { - val truePositiveWt = statsForCluster.sumOfEdgeWeights - val falseNegativeWt = overallNeighborhoodStats.sumOfEdgeWeights - truePositiveWt - val falsePositiveWt = (clusterSize - statsForCluster.nodeCount) * globalAvgEdgeWeight - val membershipWeightedTruePositiveWt = statsForCluster.sumOfMembershipWeightedEdgeWeights - val membershipWeightedFalseNegativeWt = - overallNeighborhoodStats.sumOfMembershipWeightedEdgeWeights - membershipWeightedTruePositiveWt - val membershipWeightedFalsePositiveWt = - (sumOfClusterMembershipScores - statsForCluster.sumOfMembershipWeights) * globalAvgEdgeWeight - val sumScore = - truePositiveWtFactor * statsForCluster.sumOfEdgeWeights - falseNegativeWt - falsePositiveWt - val ratioScore = truePositiveWt / (truePositiveWt + falseNegativeWt + falsePositiveWt) - val ratioUsingMemberships = - membershipWeightedTruePositiveWt / (membershipWeightedTruePositiveWt + - membershipWeightedFalsePositiveWt + membershipWeightedFalseNegativeWt) - ClusterScoresForNode(sumScore, ratioScore, ratioUsingMemberships) - } - - def pickBestCluster( - overallNeighborhoodStats: NeighborhoodInformation, - statsOfClustersInNeighborhood: Map[Int, NeighborhoodInformation], - clusterOverallStatsMap: Map[Int, NeighborhoodInformation], - globalAvgEdgeWeight: Double, - truePositiveWtFactor: Double, - clusterScoresToFinalScore: ClusterScoresForNode => Double, - minNeighborsInCluster: Int - ): Option[(Int, Double)] = { - val clusterToScores = statsOfClustersInNeighborhood.toList.flatMap { - case (clusterId, statsInNeighborhood) => - val clusterOverallStats = clusterOverallStatsMap(clusterId) - if (statsInNeighborhood.nodeCount >= minNeighborsInCluster) { - Some( - ( - clusterId, - clusterScoresToFinalScore( - getScoresForCluster( - overallNeighborhoodStats, - statsInNeighborhood, - clusterOverallStats.nodeCount, - clusterOverallStats.sumOfMembershipWeights, - globalAvgEdgeWeight, - truePositiveWtFactor - ) - ) - ) - ) - } else { - None - } - } - if (clusterToScores.nonEmpty) { - Some(clusterToScores.maxBy(_._2)) - } else None - } - - def updateGeneric( - graph: TypedPipe[(Long, Map[Long, Float])], - inputUserToClusters: TypedPipe[(Long, Array[(Int, Float)])], - clusterOverallStatsMap: Map[Int, NeighborhoodInformation], - minNeighborsInCluster: Int, - globalAvgWeight: Double, - avgMembershipScore: Double, - truePositiveWtFactor: Double, - clusterScoresToFinalScore: ClusterScoresForNode => Double - )( - implicit uniqId: UniqueID - ): TypedPipe[(Long, Array[(Int, Float)])] = { - val emptyToSomething = Stat("no_assignment_to_some") - val somethingToEmpty = Stat("some_assignment_to_none") - val emptyToEmpty = Stat("empty_to_empty") - val sameCluster = Stat("same_cluster") - val diffCluster = Stat("diff_cluster") - val nodesWithSmallDegree = Stat("nodes_with_degree_lt_" + minNeighborsInCluster) - - collectInformationPerNode(graph, inputUserToClusters, avgMembershipScore) - .mapValues { - case NodeInformation(originalClusters, overallStats, statsOfClustersInNeighborhood) => - val newClusterWithScoreOpt = if (overallStats.nodeCount < minNeighborsInCluster) { - nodesWithSmallDegree.inc() - None - } else { - pickBestCluster( - overallStats, - statsOfClustersInNeighborhood, - clusterOverallStatsMap, - globalAvgWeight, - truePositiveWtFactor, - clusterScoresToFinalScore, - minNeighborsInCluster - ) - } - newClusterWithScoreOpt match { - case Some((newClusterId, score)) => - if (originalClusters.isEmpty) { - emptyToSomething.inc() - } else if (originalClusters.contains(newClusterId)) { - sameCluster.inc() - } else { - diffCluster.inc() - } - Array((newClusterId, score.toFloat)) - case None => - if (originalClusters.isEmpty) { - emptyToEmpty.inc() - } else { - somethingToEmpty.inc() - } - Array.empty[(Int, Float)] - } - } - } - - /** - * Assembles the information we need at a node in order to decide what the new cluster should be. - * So this is where we assemble what the overall - * - * This function is where all the crucial steps take place. First get the cluster that each - * node belongs to, and then broadcast information about this node and cluster membership to each - * of it's neighbors. Now bring together all records with the same nodeId as the key and create - * the NodeInformation dataset. - * @param graph symmetric graph i.e. if u is in v's adj list, then v is in u's adj list. - * @param userToClusters current knownFor. - * @param avgMembershipScore avg. membership score of a node in the knownFor we're updating. - * Useful to deal with nodes which don't belong to any knownFor. - * @return pipe with node information for each node - */ - def collectInformationPerNode( - graph: TypedPipe[(Long, Map[Long, Float])], - userToClusters: TypedPipe[(Long, Array[(Int, Float)])], - avgMembershipScore: Double - ): TypedPipe[(Long, NodeInformation)] = { - implicit val nisg: Semigroup[NodeInformation] = NodeInformationSemigroup - graph - .leftJoin(userToClusters) - // uncomment for adhoc job - //.withReducers(200) - .flatMap { - case (nodeId, (adjList, assignedClustersOpt)) => - val assignedClusters = - assignedClustersOpt.map(_.toList).getOrElse(Nil) - val res = adjList.toList.flatMap { - case (neighborId, neighborWeight) => - if (assignedClusters.nonEmpty) { - assignedClusters.map { - case (clusterId, membershipScore) => - val neighborhoodInformationForCluster = NeighborhoodInformation( - 1, - neighborWeight, - membershipScore * neighborWeight, - membershipScore) - val originalClusters = - if (neighborId == nodeId) List(clusterId) - else List.empty[Int] - ( - neighborId, - NodeInformation( - originalClusters, - neighborhoodInformationForCluster, - Map(clusterId -> neighborhoodInformationForCluster))) - } - } else { - List( - ( - neighborId, - NodeInformation( - Nil, - NeighborhoodInformation( - 1, - neighborWeight, - (avgMembershipScore * neighborWeight).toFloat, - avgMembershipScore.toFloat), - Map.empty[Int, NeighborhoodInformation] - ))) - } - } - res - } - .sumByKey - // uncomment for adhoc job - //.withReducers(100) - } - - /** - * Replace incoming knownFor scores with ratioScoreIgnoringMembershipScores - * @param knownFor - * @param simsGraphWithoutSelfLoops - * @param globalAvgWeight - * @param clusterStats - * @param avgMembershipScore - * @return - */ - def newKnownForScores( - knownFor: TypedPipe[(Long, Array[(Int, Float)])], - simsGraphWithoutSelfLoops: TypedPipe[(Long, Map[Long, Float])], - globalAvgWeight: Double, - clusterStats: Map[Int, NeighborhoodInformation], - avgMembershipScore: Double - ): TypedPipe[(Long, Array[(Int, Float)])] = { - collectInformationPerNode(simsGraphWithoutSelfLoops, knownFor, avgMembershipScore) - .mapValues { - case NodeInformation(originalClusters, overallStats, statsOfClustersInNeighborhood) => - originalClusters.map { clusterId => - ( - clusterId, - getScoresForCluster( - overallStats, - statsOfClustersInNeighborhood(clusterId), - clusterStats(clusterId).nodeCount, - clusterStats(clusterId).sumOfMembershipWeights, - globalAvgWeight, - 0 - ).ratioScoreIgnoringMembershipScores.toFloat) - }.toArray - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/UpdateKnownForApps.scala b/src/scala/com/twitter/simclusters_v2/scalding/UpdateKnownForApps.scala deleted file mode 100644 index 3cffe47b8..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/UpdateKnownForApps.scala +++ /dev/null @@ -1,443 +0,0 @@ -package com.twitter.simclusters_v2.scalding - -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.hermit.candidate.thriftscala.Candidates -import com.twitter.pluck.source.cassowary.FollowingsCosineSimilaritiesManhattanSource -import com.twitter.pluck.source.cassowary.SimsCandidatesSource -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.scalding_internal.job.analytics_batch.AnalyticsBatchExecution -import com.twitter.scalding_internal.job.analytics_batch.AnalyticsBatchExecutionArgs -import com.twitter.scalding_internal.job.analytics_batch.BatchDescription -import com.twitter.scalding_internal.job.analytics_batch.BatchFirstTime -import com.twitter.scalding_internal.job.analytics_batch.BatchIncrement -import com.twitter.scalding_internal.job.analytics_batch.TwitterScheduledExecutionApp -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.hdfs_sources._ -import com.twitter.simclusters_v2.scalding.UpdateKnownFor.ClusterScoresForNode -import com.twitter.simclusters_v2.scalding.UpdateKnownFor.NeighborhoodInformation -import com.twitter.simclusters_v2.scalding.common.TypedRichPipe._ -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.thriftscala.ClustersUserIsKnownFor -import com.twitter.usersource.snapshot.flat.UsersourceFlatScalaDataset -import scala.util.Success - -object UpdateKnownForApps { - - /** - * Average edge weight of an input graph - * @param graph a TypedPipe with nodeId as key and adjacency list as value. We don't care about - * the keys in this method. - * @return avg edge weight wrapped in an option in an execution - */ - def getGlobalAvgWeight(graph: TypedPipe[(Long, Map[Long, Float])]): Execution[Option[Double]] = { - graph.values - .flatMap(_.values) - .map { x => (x.toDouble, 1L) } - .sum - .toOptionExecution - .map { - case Some((sum, cnt)) => - val res = sum / cnt - println("globalAvgWeight is " + res) - Some(res) - case _ => - println("Input graph to globalAvgWeight seems to be empty") - None - } - } - - /** - * Average membership score for a particular knownFor assignment - * @param knownFor TypedPipe from nodeId to the clusters it's been assigned to along with - * membership scores. We don't care about the keys in this method. - * @return average membership score - */ - def getAvgMembershipScore(knownFor: TypedPipe[(Long, Array[(Int, Float)])]): Execution[Double] = { - knownFor.values - .flatMap(_.map(_._2)) - .map { x => (x, 1L) } - .sum - .map { case (num, den) => num / den.toDouble } - .getExecution - .onComplete { - case Success(x) => println("Avg. membership score is " + x) - case _ => println("Failed to calculate avg. membership score") - } - } - - /** - * For each cluster, get two statistics about it: the number of nodes assigned to it, and the - * sum of the membership scores - * - * @param knownFor TypedPipe from nodeId to the clusters it's been assigned to along with - * membership scores. - * @return Map giving the NeighborhoodInformation for each cluster. The nodeCount and - * sumOfMembershipWeights fields in NeighborhoodInformation are populated, others are 0. - */ - def getClusterStats( - knownFor: TypedPipe[(Long, Array[(Int, Float)])] - ): Execution[Map[Int, NeighborhoodInformation]] = { - knownFor - .flatMap { - case (_, clusterArray) => - clusterArray.map { - case (clusterId, score) => - Map(clusterId -> (1, score)) - } - } - .sum - .getExecution - .map { map => - map.mapValues { - case (count, sum) => - NeighborhoodInformation(count, 0, 0, sum) - } - } - } - - /** - * Adds self-loops and also potentially raises all edge weights to an exponent - * (typically exponent > 1, and has the effect of increasing inequality in edge weights to - * "clarify" structure in the graph - currently we just set exponent to 1). - * @param symmetrizedSims input symmetrized similarity graph - * @param exponentForEdgeWeight exponent to raise all edge weights to. - * Set to 1.0 to make this a no-op - * @param maxWtToSelfLoopWtMultFactor What to multiply the max wt among non-self-loop edges to - * derive the weight on the self-loop edge. - * @return New graph - */ - def simsGraphForUpdateFromSymmetrizedSims( - symmetrizedSims: TypedPipe[(Long, Map[Long, Float])], - exponentForEdgeWeight: Float, - maxWtToSelfLoopWtMultFactor: Float - ): TypedPipe[(Long, Map[Long, Float])] = { - val expWeighted = symmetrizedSims.mapValues { y => - y.mapValues { x => math.pow(x, exponentForEdgeWeight).toFloat } - } - - TopUsersSimilarityGraph.addSelfLoop( - input = expWeighted, - maxToSelfLoopWeight = { x: Float => x * maxWtToSelfLoopWtMultFactor } - ) - } - - /** - * Runs the job - * @param args args which specify many parameters - * @param inputKnownFor - * @param inputSimsGraph - * @param defaultEmailAddress by default, the email address to send an to email to, which has - * a bunch of evaluation metrics - * @param writeKnownForFunction function that takes a knownFor and writes to some - * persistent location - * @param readKnownForFunction function that reads the knownFor which was written to using the - * writeKnownForFunction - * @param dateRange dateRange, used for reading UserSource - * @param uniqueID need for creating stats - * @return Execution[Unit] encapsulating the whole job - */ - def runUpdateKnownForGeneric( - args: Args, - inputKnownFor: TypedPipe[(Long, Array[(Int, Float)])], - inputSimsGraph: TypedPipe[Candidates], - defaultEmailAddress: String, - writeKnownForFunction: TypedPipe[(Long, Array[(Int, Float)])] => Execution[Unit], - readKnownForFunction: => TypedPipe[(Long, Array[(Int, Float)])], - includeEvaluationResultsInEmail: Boolean - )( - implicit dateRange: DateRange, - uniqueID: UniqueID - ): Execution[Unit] = { - val minActiveFollowers = args.int("minActiveFollowers", 400) - val topK = args.int("topK") - val maxSimsNeighborsForUpdate = - args.int("maxSimsNeighborsForUpdate", 40) - val minNeighborsInCluster = args.int("minNeighborsInCluster", 2) - val maxWtToSelfLoopWtMultFactor = - args.float("maxWtToSelfLoopWtMultFactor", 2) - val exponentForEdgeWeight = args.float("exponentForEdgeWeights", 1.0f) - val updateMethod: ClusterScoresForNode => Double = args("updateMethod") match { - case "sumScoreIgnoringMembershipScores" => { x: ClusterScoresForNode => - x.sumScoreIgnoringMembershipScores - } - case "ratioScoreIgnoringMembershipScores" => { x: ClusterScoresForNode => - x.ratioScoreIgnoringMembershipScores - } - case "ratioScoreUsingMembershipScores" => { x: ClusterScoresForNode => - x.ratioScoreUsingMembershipScores - } - case x @ _ => - throw new Exception(s"value for --updateMethod $x is unknown. It must be one of " + - s"[sumScoreIgnoringMembershipScores, ratioScoreIgnoringMembershipScores, ratioScoreUsingMembershipScores]") - } - val truePositiveWtFactor = args.float("truePositiveWtFactor", 10) - val modelVersion = args("outputModelVersion") - val emailAddress = - args.optional("emailAddress").getOrElse(defaultEmailAddress) - - val topUsers = TopUsersSimilarityGraph - .topUserIds( - DAL - .readMostRecentSnapshot(UsersourceFlatScalaDataset, dateRange) - .toTypedPipe, - minActiveFollowers, - topK).count("num_top_users") - - TopUsersSimilarityGraph - .getSubgraphFromUserGroupedInput( - fullGraph = inputSimsGraph, - usersToInclude = topUsers, - maxNeighborsPerNode = maxSimsNeighborsForUpdate, - degreeThresholdForStat = minNeighborsInCluster - ) - .forceToDiskExecution - .flatMap { symmetrizedSims => - val modifiedSims = - UpdateKnownForApps.simsGraphForUpdateFromSymmetrizedSims( - symmetrizedSims = symmetrizedSims, - exponentForEdgeWeight = exponentForEdgeWeight, - maxWtToSelfLoopWtMultFactor = maxWtToSelfLoopWtMultFactor - ) - - val previouslyFamousUsersExec = inputKnownFor - .leftJoin(topUsers.asKeys) - .collect { case (userId, (clusters, None)) => userId } - .getSummaryString( - "Users previously in known for but not in topUsers anymore", - numRecords = 20) - - val clusterStatsExec = UpdateKnownForApps.getClusterStats(inputKnownFor) - - val globalAvgWeightExec = - UpdateKnownForApps.getGlobalAvgWeight(modifiedSims) - - val globalAvgMembershipScoreExec = UpdateKnownForApps.getAvgMembershipScore(inputKnownFor) - - Execution.zip(globalAvgWeightExec, clusterStatsExec, globalAvgMembershipScoreExec).flatMap { - case (Some(globalAvgWeight), clusterStats, globalAvgMembershipScore) => - println("Size of clusterStats: " + clusterStats.size) - println("First few entries from clusterStats: " + clusterStats.take(5)) - println("globalAvgWeight: " + globalAvgWeight) - println("globalAvgMembershipScore: " + globalAvgMembershipScore) - - val knownForWithUnnormalizedScores = UpdateKnownFor - .newKnownForScores( - inputKnownFor, - modifiedSims, - globalAvgWeight, - clusterStats, - globalAvgMembershipScore - ) - val writeNewKnownForExec = writeKnownForFunction( - UpdateKnownFor.updateGeneric( - modifiedSims, - knownForWithUnnormalizedScores, - clusterStats, - minNeighborsInCluster, - globalAvgWeight, - globalAvgMembershipScore, - truePositiveWtFactor, - updateMethod - ) - ) - - writeNewKnownForExec.flatMap { _ => - Util.getCustomCountersString(writeNewKnownForExec).flatMap { customCountersString => - if (includeEvaluationResultsInEmail) { - // It's unfortunate that we're not using the newKnownFor directly, but are instead - // first writing it out and then reading it back in. The reason for doing it in this - // convoluted way is that when we directly use the newKnownFor, the clusterEvaluation - // metrics are being incorrectly computed. - - val newKnownFor = readKnownForFunction - - val newResultsExec = - ClusterEvaluation - .overallEvaluation(symmetrizedSims, newKnownFor, "newKnownForEval") - val oldResultsExec = - ClusterEvaluation - .overallEvaluation(symmetrizedSims, inputKnownFor, "oldKnownForEval") - val minSizeOfBiggerClusterForComparison = 10 - val compareExec = CompareClusters.summarize( - CompareClusters.compare( - KnownForSources.transpose(inputKnownFor), - KnownForSources.transpose(newKnownFor), - minSizeOfBiggerCluster = minSizeOfBiggerClusterForComparison - )) - - Execution - .zip(oldResultsExec, newResultsExec, compareExec, previouslyFamousUsersExec) - .map { - case (oldResults, newResults, compareResults, previouslyFamousUsersString) => - val emailText = "Evaluation Results for existing knownFor:\n" + - Util.prettyJsonMapper.writeValueAsString(oldResults) + - "\n\n-------------------\n\n" + - "Evaluation Results for new knownFor:\n" + - Util.prettyJsonMapper.writeValueAsString(newResults) + - "\n\n-------------------\n\n" + - s"Cosine similarity distribution between cluster membership vectors for " + - s"clusters with at least $minSizeOfBiggerClusterForComparison members\n" + - Util.prettyJsonMapper - .writeValueAsString(compareResults) + - "\n\n-------------------\n\n" + - "Custom counters:\n" + customCountersString + - "\n\n-------------------\n\n" + - previouslyFamousUsersString - - Util - .sendEmail( - emailText, - s"Evaluation results of new knownFor $modelVersion", - emailAddress) - } - } else { - Util - .sendEmail( - customCountersString, - s"Change in cluster assignments for update of knownFor $modelVersion", - emailAddress - ) - Execution.unit - } - - } - } - } - } - } -} - -trait UpdateKnownForBatch extends TwitterScheduledExecutionApp { - implicit val tz: java.util.TimeZone = DateOps.UTC - implicit val dp = DateParser.default - - def firstTime: String - - val batchIncrement: Duration = Days(30) - - def batchDescription: String - - private lazy val execArgs = AnalyticsBatchExecutionArgs( - batchDesc = BatchDescription(batchDescription), - firstTime = BatchFirstTime(RichDate(firstTime)), - lastTime = None, - batchIncrement = BatchIncrement(batchIncrement) - ) - - val emailAddress: String = "no-reply@twitter.com" - - def inputDALDataset: KeyValDALDataset[KeyVal[Long, ClustersUserIsKnownFor]] - - def inputModelVersion: String - - def outputModelVersion: String - - def outputPath: String - - def outputDALDataset: KeyValDALDataset[KeyVal[Long, ClustersUserIsKnownFor]] - - override def scheduledJob: Execution[Unit] = - AnalyticsBatchExecution(execArgs) { implicit dateRange => - Execution.withId { implicit uniqueId => - Execution.withArgs { args => - val inputKnownFor = - KnownForSources.readDALDataset(inputDALDataset, Days(30), inputModelVersion) - - val inputSimsGraph = TypedPipe - .from(FollowingsCosineSimilaritiesManhattanSource()) - .map(_._2) - - def writeKnownFor(knownFor: TypedPipe[(Long, Array[(Int, Float)])]): Execution[Unit] = { - KnownForSources - .toKeyVal(knownFor, outputModelVersion) - .writeDALVersionedKeyValExecution( - outputDALDataset, - D.Suffix(outputPath) - ) - } - - def readKnownFor = - KnownForSources.readDALDataset(outputDALDataset, Days(1), outputModelVersion) - - UpdateKnownForApps.runUpdateKnownForGeneric( - args, - inputKnownFor, - inputSimsGraph, - emailAddress, - writeKnownFor, - readKnownFor, - includeEvaluationResultsInEmail = false - ) - } - } - } -} - -/** -capesospy-v2 update --build_locally --start_cron update_known_for_20M_145k \ - src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc.yaml - */ -object UpdateKnownFor20M145K extends UpdateKnownForBatch { - override val firstTime: String = "2019-06-06" - - override val batchIncrement: Duration = Days(7) - - override val batchDescription: String = - "com.twitter.simclusters_v2.scalding.UpdateKnownFor20M145K" - - override val inputModelVersion: String = ModelVersions.Model20M145KUpdated - - override val inputDALDataset: KeyValDALDataset[KeyVal[Long, ClustersUserIsKnownFor]] = - SimclustersV2RawKnownFor20M145KUpdatedScalaDataset - - override val outputModelVersion: String = ModelVersions.Model20M145KUpdated - - override val outputDALDataset: KeyValDALDataset[KeyVal[Long, ClustersUserIsKnownFor]] = - SimclustersV2RawKnownFor20M145KUpdatedScalaDataset - - override val outputPath: String = InternalDataPaths.RawKnownForUpdatedPath -} - -/** This one's end-to-end, doesn't save any intermediate data etc. **/ -object UpdateKnownForAdhoc extends TwitterExecutionApp { - implicit val tz: java.util.TimeZone = DateOps.UTC - implicit val dp = DateParser.default - - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - implicit val date: DateRange = DateRange.parse(args("date")) - val defaultEmailAddress = "your_ldap@twitter.com" - - val inputKnownFor = args.optional("inputKnownForDir") match { - case Some(inputKnownForDir) => KnownForSources.readKnownFor(inputKnownForDir) - case None => KnownForSources.knownFor_20M_Dec11_145K - } - - val inputSimsGraph = TopUsersSimilarityGraph.readSimsInput( - args.boolean("simsInputIsKeyValSource"), - args("simsInputDir") - ) - - def readKnownFor() = KnownForSources.readKnownFor(args("outputDir")) - - UpdateKnownForApps.runUpdateKnownForGeneric( - args, - inputKnownFor, - inputSimsGraph, - defaultEmailAddress, - { input: TypedPipe[(Long, Array[(Int, Float)])] => - KnownForSources.writeKnownForTypedTsv(input, args("outputDir")) - }, - readKnownFor, - includeEvaluationResultsInEmail = true - ) - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/UserUserFavGraph.scala b/src/scala/com/twitter/simclusters_v2/scalding/UserUserFavGraph.scala deleted file mode 100644 index 60fb0339d..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/UserUserFavGraph.scala +++ /dev/null @@ -1,445 +0,0 @@ -package com.twitter.simclusters_v2.scalding - -import com.twitter.algebird.DecayedValue -import com.twitter.algebird.DecayedValueMonoid -import com.twitter.algebird.Monoid -import com.twitter.algebird.Semigroup -import com.twitter.conversions.DurationOps._ -import com.twitter.logging.Logger -import com.twitter.scalding._ -import com.twitter.scalding.typed.UnsortedGrouped -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.dalv2.remote_access.ExplicitLocation -import com.twitter.scalding_internal.dalv2.remote_access.ProcAtla -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.scalding_internal.job.analytics_batch._ -import com.twitter.simclusters_v2.common.TweetId -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.hdfs_sources._ -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.thriftscala.DecayedSums -import com.twitter.simclusters_v2.thriftscala.EdgeWithDecayedWeights -import com.twitter.timelineservice.thriftscala.ContextualizedFavoriteEvent -import com.twitter.timelineservice.thriftscala.FavoriteEventUnion -import com.twitter.usersource.snapshot.flat.UsersourceFlatScalaDataset -import com.twitter.usersource.snapshot.flat.thriftscala.FlatUser -import com.twitter.util.Time -import twadoop_config.configuration.log_categories.group.timeline.TimelineServiceFavoritesScalaDataset - -sealed trait FavState - -object Fav extends FavState - -object UnFavWithoutPriorFav extends FavState - -object UnFavWithPriorFav extends FavState - -case class TimestampedFavState(favOrUnfav: FavState, timestampMillis: Long) - -object TimestampedFavStateSemigroup extends Semigroup[TimestampedFavState] { - override def plus(left: TimestampedFavState, right: TimestampedFavState): TimestampedFavState = { - - /** - * Assigning to first, second ensures commutative property - */ - val (first, second) = if (left.timestampMillis < right.timestampMillis) { - (left, right) - } else { - (right, left) - } - (first.favOrUnfav, second.favOrUnfav) match { - case (_, UnFavWithPriorFav) => second - case (UnFavWithPriorFav, UnFavWithoutPriorFav) => - TimestampedFavState(UnFavWithPriorFav, second.timestampMillis) - case (Fav, UnFavWithoutPriorFav) => - TimestampedFavState(UnFavWithPriorFav, second.timestampMillis) - case (UnFavWithoutPriorFav, UnFavWithoutPriorFav) => second - case (_, Fav) => second - } - } -} - -object UserUserFavGraph { - implicit val tz: java.util.TimeZone = DateOps.UTC - // setting the prune threshold in the monoid below to 0.0, since we want to do our own pruning - // outside the monoid, primarily to be able to count how many scores are pruned. - implicit val dvMonoid: Monoid[DecayedValue] = DecayedValueMonoid(0.0) - implicit val lfvSemigroup: Semigroup[TimestampedFavState] = TimestampedFavStateSemigroup - - def getSummedFavGraph( - previousGraphOpt: Option[TypedPipe[EdgeWithDecayedWeights]], - newFavsDateRange: DateRange, - halfLivesInDays: List[Int], - minScoreToKeep: Double - )( - implicit uniqueID: UniqueID - ): TypedPipe[EdgeWithDecayedWeights] = { - val newFavs = DAL.read(TimelineServiceFavoritesScalaDataset, newFavsDateRange).toTypedPipe - val endTime = Time.fromMilliseconds(newFavsDateRange.end.timestamp) - val userSource = - DAL.readMostRecentSnapshotNoOlderThan(UsersourceFlatScalaDataset, Days(7)).toTypedPipe - getSummedFavGraphWithValidUsers( - previousGraphOpt, - newFavs, - halfLivesInDays, - endTime, - minScoreToKeep, - userSource - ) - } - - def getSummedFavGraphWithValidUsers( - previousGraphOpt: Option[TypedPipe[EdgeWithDecayedWeights]], - newFavs: TypedPipe[ContextualizedFavoriteEvent], - halfLivesInDays: List[Int], - endTime: Time, - minScoreToKeep: Double, - userSource: TypedPipe[FlatUser] - )( - implicit uniqueID: UniqueID - ): TypedPipe[EdgeWithDecayedWeights] = { - val fullGraph = getSummedFavGraph( - previousGraphOpt, - newFavs, - halfLivesInDays, - endTime, - minScoreToKeep - ) - removeDeactivedOrSuspendedUsers(fullGraph, userSource) - } - - def processRawFavEvents( - favsOrUnfavs: TypedPipe[ContextualizedFavoriteEvent] - )( - implicit uniqueID: UniqueID - ): TypedPipe[((UserId, TweetId, UserId), TimestampedFavState)] = { - val numFavsBeforeUniq = Stat("num_favs_before_uniq") - val numUnFavsBeforeUniq = Stat("num_unfavs_before_uniq") - val numFinalFavs = Stat("num_final_favs") - val numUnFavsWithPriorFavs = Stat("num_unfavs_with_prior_favs") - val numUnFavsWithoutPriorFavs = Stat("num_unfavs_without_prior_favs") - - favsOrUnfavs - .flatMap { cfe: ContextualizedFavoriteEvent => - cfe.event match { - case FavoriteEventUnion.Favorite(fav) => - numFavsBeforeUniq.inc() - Some( - ( - (fav.userId, fav.tweetId, fav.tweetUserId), - TimestampedFavState(Fav, fav.eventTimeMs))) - case FavoriteEventUnion.Unfavorite(unfav) => - numUnFavsBeforeUniq.inc() - Some( - ( - (unfav.userId, unfav.tweetId, unfav.tweetUserId), - TimestampedFavState(UnFavWithoutPriorFav, unfav.eventTimeMs))) - case _ => None - } - } - .sumByKey - .toTypedPipe - .flatMap { - case fav @ (_, TimestampedFavState(Fav, _)) => - numFinalFavs.inc() - Some(fav) - case unfav @ (_, TimestampedFavState(UnFavWithoutPriorFav, _)) => - numUnFavsWithoutPriorFavs.inc() - Some(unfav) - case (_, TimestampedFavState(UnFavWithPriorFav, _)) => - numUnFavsWithPriorFavs.inc() - None - } - } - - private def getGraphFromNewFavsOnly( - newFavs: TypedPipe[ContextualizedFavoriteEvent], - halfLivesInDays: List[Int], - endTime: Time - )( - implicit uniqueID: UniqueID - ): UnsortedGrouped[(UserId, UserId), Map[Int, DecayedValue]] = { - - val numEventsNewerThanEndTime = Stat("num_events_newer_than_endtime") - - processRawFavEvents(newFavs).map { - case ((userId, _, authorId), TimestampedFavState(favOrUnfav, timestampMillis)) => - val halfLifeInDaysToScores = halfLivesInDays.map { halfLifeInDays => - val givenTime = Time.fromMilliseconds(timestampMillis) - if (givenTime > endTime) { - // technically this should never happen, and even if it did happen, - // we shouldn't have to care, but I'm noticing that the weights aren't being computed - // correctly for events that spilled over the edge - numEventsNewerThanEndTime.inc() - } - val timeInSeconds = math.min(givenTime.inSeconds, endTime.inSeconds) - val value = favOrUnfav match { - case Fav => 1.0 - case UnFavWithoutPriorFav => -1.0 - case UnFavWithPriorFav => 0.0 - } - val decayedValue = DecayedValue.build(value, timeInSeconds, halfLifeInDays.days.inSeconds) - halfLifeInDays -> decayedValue - } - ((userId, authorId), halfLifeInDaysToScores.toMap) - }.sumByKey - } - - def getSummedFavGraph( - previousGraphOpt: Option[TypedPipe[EdgeWithDecayedWeights]], - newFavs: TypedPipe[ContextualizedFavoriteEvent], - halfLivesInDays: List[Int], - endTime: Time, - minScoreToKeep: Double - )( - implicit uniqueID: UniqueID - ): TypedPipe[EdgeWithDecayedWeights] = { - val prunedScoresCounter = Stat("num_pruned_scores") - val negativeScoresCounter = Stat("num_negative_scores") - val prunedEdgesCounter = Stat("num_pruned_edges") - val keptEdgesCounter = Stat("num_kept_edges") - val keptScoresCounter = Stat("num_kept_scores") - val numCommonEdges = Stat("num_common_edges") - val numNewEdges = Stat("num_new_edges") - val numOldEdges = Stat("num_old_edges") - - val unprunedOuterJoinedGraph = previousGraphOpt match { - case Some(previousGraph) => - previousGraph - .map { - case EdgeWithDecayedWeights(srcId, destId, decayedSums) => - val ts = decayedSums.lastUpdatedTimestamp.toDouble / 1000 - val map = decayedSums.halfLifeInDaysToDecayedSums.map { - case (halfLifeInDays, value) => - halfLifeInDays -> DecayedValue.build(value, ts, halfLifeInDays.days.inSeconds) - }.toMap - ((srcId, destId), map) - } - .outerJoin(getGraphFromNewFavsOnly(newFavs, halfLivesInDays, endTime)) - .toTypedPipe - case None => - getGraphFromNewFavsOnly(newFavs, halfLivesInDays, endTime).toTypedPipe - .map { - case ((srcId, destId), scoreMap) => - ((srcId, destId), (None, Some(scoreMap))) - } - } - - unprunedOuterJoinedGraph - .flatMap { - case ((srcId, destId), (previousScoreMapOpt, newScoreMapOpt)) => - val latestTimeDecayedValues = halfLivesInDays.map { hlInDays => - hlInDays -> DecayedValue.build(0, endTime.inSeconds, hlInDays.days.inSeconds) - }.toMap - - val updatedDecayedValues = - Monoid.sum( - List(previousScoreMapOpt, newScoreMapOpt, Some(latestTimeDecayedValues)).flatten) - - (previousScoreMapOpt, newScoreMapOpt) match { - case (Some(pm), None) => numOldEdges.inc() - case (None, Some(nm)) => numNewEdges.inc() - case (Some(pm), Some(nm)) => numCommonEdges.inc() - } - - val prunedMap = updatedDecayedValues.flatMap { - case (hlInDays, decayedValue) => - if (decayedValue.value < minScoreToKeep) { - if (decayedValue.value < 0) { - negativeScoresCounter.inc() - } - prunedScoresCounter.inc() - None - } else { - keptScoresCounter.inc() - Some((hlInDays, decayedValue.value)) - } - } - - if (prunedMap.nonEmpty) { - keptEdgesCounter.inc() - Some(EdgeWithDecayedWeights(srcId, destId, DecayedSums(endTime.inMillis, prunedMap))) - } else { - prunedEdgesCounter.inc() - None - } - } - } - - def removeDeactivedOrSuspendedUsers( - full: TypedPipe[EdgeWithDecayedWeights], - userSource: TypedPipe[FlatUser] - )( - implicit uniqueID: UniqueID - ): TypedPipe[EdgeWithDecayedWeights] = { - val numValidUsers = Stat("num_valid_users") - val numInvalidUsers = Stat("num_invalid_users") - val numEdgesBeforeUsersourceJoin = Stat("num_edges_before_join_with_usersource") - val numEdgesWithValidSource = Stat("num_edges_with_valid_source") - val numEdgesWithValidSourceAndDest = Stat("num_edges_with_valid_source_and_dest") - - val validUsers = userSource.flatMap { - case flatUser - if !flatUser.deactivated.contains(true) && !flatUser.suspended.contains(true) - && flatUser.id.nonEmpty => - numValidUsers.inc() - flatUser.id - case _ => - numInvalidUsers.inc() - None - }.forceToDisk // avoid reading in the whole of userSource for both of the joins below - - val toJoin = full.map { edge => - numEdgesBeforeUsersourceJoin.inc() - (edge.sourceId, edge) - } - - toJoin - .join(validUsers.asKeys) - .map { - case (_, (edge, _)) => - numEdgesWithValidSource.inc() - (edge.destinationId, edge) - } - .join(validUsers.asKeys) - .map { - case (_, (edge, _)) => - numEdgesWithValidSourceAndDest.inc() - edge - } - } -} - -/** - * ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding:fav_graph_adhoc && \ - * oscar hdfs --user frigate --host hadoopnest1.atla.twitter.com --bundle fav_graph_adhoc \ - * --tool com.twitter.simclusters_v2.scalding.UserUserFavGraphAdhoc --screen --screen-detached \ - * --tee logs/userUserFavGraphAdhoc_20170101 -- --date 2017-01-01 --halfLivesInDays 14 50 100 \ - * --outputDir /user/frigate/your_ldap/userUserFavGraphAdhoc_20170101_hl14_50_100 - * - * ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding:fav_graph_adhoc && \ - * oscar hdfs --user frigate --host hadoopnest1.atla.twitter.com --bundle fav_graph_adhoc \ - * --tool com.twitter.simclusters_v2.scalding.UserUserFavGraphAdhoc --screen --screen-detached \ - * --tee logs/userUserFavGraphAdhoc_20170102_addPrevious20170101 -- --date 2017-01-02 \ - * --previousGraphDir /user/frigate/your_ldap/userUserFavGraphAdhoc_20170101_hl14_50_100 \ - * --halfLivesInDays 14 50 100 \ - * --outputDir /user/frigate/your_ldap/userUserFavGraphAdhoc_20170102_addPrevious20170101_hl14_50_100 - */ -object UserUserFavGraphAdhoc extends TwitterExecutionApp { - implicit val tz: java.util.TimeZone = DateOps.UTC - implicit val dp = DateParser.default - val log = Logger() - - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - val previousGraphOpt = args.optional("previousGraphDir").map { dir => - TypedPipe.from(EdgeWithDecayedWtsFixedPathSource(dir)) - } - val favsDateRange = DateRange.parse(args.list("date")) - val halfLives = args.list("halfLivesInDays").map(_.toInt) - val minScoreToKeep = args.double("minScoreToKeep", 1e-5) - val outputDir = args("outputDir") - Util.printCounters( - UserUserFavGraph - .getSummedFavGraph(previousGraphOpt, favsDateRange, halfLives, minScoreToKeep) - .writeExecution(EdgeWithDecayedWtsFixedPathSource(outputDir)) - ) - } - } -} - -/** - * $ capesospy-v2 update --start_cron fav_graph src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc.yaml - */ -object UserUserFavGraphBatch extends TwitterScheduledExecutionApp { - private val firstTime: String = "2017-01-01" - implicit val tz = DateOps.UTC - implicit val parser = DateParser.default - private val batchIncrement: Duration = Days(2) - private val firstStartDate = DateRange.parse(firstTime).start - - val outputPath: String = "/user/cassowary/processed/user_user_fav_graph" - val log = Logger() - - private val execArgs = AnalyticsBatchExecutionArgs( - batchDesc = BatchDescription(this.getClass.getName), - firstTime = BatchFirstTime(RichDate(firstTime)), - lastTime = None, - batchIncrement = BatchIncrement(batchIncrement) - ) - - override def scheduledJob: Execution[Unit] = AnalyticsBatchExecution(execArgs) { dateRange => - Execution.withId { implicit uniqueId => - Execution.withArgs { args => - val previousGraph = if (dateRange.start.timestamp == firstStartDate.timestamp) { - log.info("Looks like this is the first time, setting previousGraph to None") - None - } else { - Some( - DAL - .readMostRecentSnapshot(UserUserFavGraphScalaDataset, dateRange - batchIncrement) - .toTypedPipe - ) - } - val halfLives = args.list("halfLivesInDays").map(_.toInt) - val minScoreToKeep = args.double("minScoreToKeep", 1e-5) - Util.printCounters( - UserUserFavGraph - .getSummedFavGraph(previousGraph, dateRange, halfLives, minScoreToKeep) - .writeDALSnapshotExecution( - UserUserFavGraphScalaDataset, - D.Daily, - D.Suffix(outputPath), - D.EBLzo(), - dateRange.end) - ) - } - } - } -} - -object DumpFavGraphAdhoc extends TwitterExecutionApp { - implicit val tz: java.util.TimeZone = DateOps.UTC - - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val favGraph = DAL - .readMostRecentSnapshotNoOlderThan(UserUserFavGraphScalaDataset, Days(10)) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - .collect { - case edge if edge.weights.halfLifeInDaysToDecayedSums.contains(100) => - (edge.sourceId, edge.destinationId, edge.weights.halfLifeInDaysToDecayedSums(100)) - } - - Execution - .sequence( - Seq( - Util.printSummaryOfNumericColumn( - favGraph.map(_._3), - Some("Weight") - ), - Util.printSummaryOfNumericColumn( - favGraph.map(c => math.log10(10.0 + c._3)), - Some("Weight_Log_P10") - ), - Util.printSummaryOfNumericColumn( - favGraph.map(c => math.log10(1.0 + c._3)), - Some("Weight_Log_P1") - ), - Util.printSummaryOfCategoricalColumn(favGraph.map(_._1), Some("SourceId")), - Util.printSummaryOfCategoricalColumn(favGraph.map(_._2), Some("DestId")) - ) - ).flatMap { summarySeq => - println(summarySeq.mkString("\n")) - Execution.unit - } - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/UserUserGraph.scala b/src/scala/com/twitter/simclusters_v2/scalding/UserUserGraph.scala deleted file mode 100644 index bdb1004c7..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/UserUserGraph.scala +++ /dev/null @@ -1,180 +0,0 @@ -package com.twitter.simclusters_v2.scalding - -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DALWrite.{D, WriteExtension} -import com.twitter.scalding_internal.job.analytics_batch.{ - AnalyticsBatchExecution, - AnalyticsBatchExecutionArgs, - BatchDescription, - BatchFirstTime, - BatchIncrement, - TwitterScheduledExecutionApp -} -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.hdfs_sources.{ - UserAndNeighborsFixedPathSource, - UserUserGraphScalaDataset -} -import com.twitter.simclusters_v2.thriftscala.{NeighborWithWeights, UserAndNeighbors} -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import java.util.TimeZone - -/** - * This is a scheduled version of the user_user_normalized_graph dataset generation job. - * - * The key difference in this implementation is that we donot read the ProducerNormsAndCounts dataset. - * So we no longer store the following producer normalized scores for the edges in the NeigborWithWeights thrift: - * followScoreNormalizedByNeighborFollowersL2, favScoreHalfLife100DaysNormalizedByNeighborFaversL2 and logFavScoreL2Normalized - * - */ -object UserUserGraph { - - def getNeighborWithWeights( - inputEdge: Edge - ): NeighborWithWeights = { - val logFavScore = UserUserNormalizedGraph.logTransformation(inputEdge.favWeight) - NeighborWithWeights( - neighborId = inputEdge.destId, - isFollowed = Some(inputEdge.isFollowEdge), - favScoreHalfLife100Days = Some(inputEdge.favWeight), - logFavScore = Some(logFavScore), - ) - } - - def addWeightsAndAdjListify( - input: TypedPipe[Edge], - maxNeighborsPerUser: Int - )( - implicit uniqueId: UniqueID - ): TypedPipe[UserAndNeighbors] = { - val numUsersNeedingNeighborTruncation = Stat("num_users_needing_neighbor_truncation") - val numEdgesAfterTruncation = Stat("num_edges_after_truncation") - val numEdgesBeforeTruncation = Stat("num_edges_before_truncation") - val numFollowEdgesBeforeTruncation = Stat("num_follow_edges_before_truncation") - val numFavEdgesBeforeTruncation = Stat("num_fav_edges_before_truncation") - val numFollowEdgesAfterTruncation = Stat("num_follow_edges_after_truncation") - val numFavEdgesAfterTruncation = Stat("num_fav_edges_after_truncation") - val numRecordsInOutputGraph = Stat("num_records_in_output_graph") - - input - .map { edge => - numEdgesBeforeTruncation.inc() - if (edge.isFollowEdge) numFollowEdgesBeforeTruncation.inc() - if (edge.favWeight > 0) numFavEdgesBeforeTruncation.inc() - (edge.srcId, getNeighborWithWeights(edge)) - } - .group - // .withReducers(10000) - .sortedReverseTake(maxNeighborsPerUser)(Ordering.by { x: NeighborWithWeights => - x.favScoreHalfLife100Days.getOrElse(0.0) - }) - .map { - case (srcId, neighborList) => - if (neighborList.size >= maxNeighborsPerUser) numUsersNeedingNeighborTruncation.inc() - neighborList.foreach { neighbor => - numEdgesAfterTruncation.inc() - if (neighbor.favScoreHalfLife100Days.exists(_ > 0)) numFavEdgesAfterTruncation.inc() - if (neighbor.isFollowed.contains(true)) numFollowEdgesAfterTruncation.inc() - } - numRecordsInOutputGraph.inc() - UserAndNeighbors(srcId, neighborList) - } - } - - def run( - followEdges: TypedPipe[(Long, Long)], - favEdges: TypedPipe[(Long, Long, Double)], - maxNeighborsPerUser: Int - )( - implicit uniqueID: UniqueID - ): TypedPipe[UserAndNeighbors] = { - val combined = UserUserNormalizedGraph.combineFollowAndFav(followEdges, favEdges) - addWeightsAndAdjListify( - combined, - maxNeighborsPerUser - ) - } -} - -/** - * - * capesospy-v2 update --build_locally --start_cron user_user_follow_fav_graph \ - * src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc.yaml - */ - -object UserUserGraphBatch extends TwitterScheduledExecutionApp { - private val firstTime: String = "2021-04-24" - implicit val tz = DateOps.UTC - implicit val parser = DateParser.default - private val batchIncrement: Duration = Days(2) - private val halfLifeInDaysForFavScore = 100 - - private val outputPath: String = "/user/cassowary/processed/user_user_graph" - - private val execArgs = AnalyticsBatchExecutionArgs( - batchDesc = BatchDescription(this.getClass.getName.replace("$", "")), - firstTime = BatchFirstTime(RichDate(firstTime)), - lastTime = None, - batchIncrement = BatchIncrement(batchIncrement) - ) - - override def scheduledJob: Execution[Unit] = AnalyticsBatchExecution(execArgs) { - implicit dateRange => - Execution.withId { implicit uniqueId => - Execution.withArgs { args => - val maxNeighborsPerUser = args.int("maxNeighborsPerUser", 2000) - - Util.printCounters( - UserUserGraph - .run( - UserUserNormalizedGraph.getFollowEdges, - UserUserNormalizedGraph.getFavEdges(halfLifeInDaysForFavScore), - maxNeighborsPerUser - ) - .writeDALSnapshotExecution( - UserUserGraphScalaDataset, - D.Daily, - D.Suffix(outputPath), - D.EBLzo(), - dateRange.end) - ) - } - } - } -} - -/** -./bazel bundle src/scala/com/twitter/simclusters_v2/scalding:user_user_graph-adhoc -scalding remote run \ ---user cassowary \ ---keytab /var/lib/tss/keys/fluffy/keytabs/client/cassowary.keytab \ ---principal service_acoount@TWITTER.BIZ \ ---cluster bluebird-qus1 \ ---main-class com.twitter.simclusters_v2.scalding.UserUserGraphAdhoc \ ---target src/scala/com/twitter/simclusters_v2/scalding:user_user_graph-adhoc \ --- --date 2021-04-24 --outputDir "/user/cassowary/adhoc/user_user_graph_adhoc" - */ -object UserUserGraphAdhoc extends AdhocExecutionApp { - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - val maxNeighborsPerUser = args.int("maxNeighborsPerUser", 2000) - val halfLifeInDaysForFavScore = 100 - val outputDir = args("outputDir") - val userAndNeighbors = - UserUserGraph - .run( - UserUserNormalizedGraph.getFollowEdges, - UserUserNormalizedGraph.getFavEdges(halfLifeInDaysForFavScore), - maxNeighborsPerUser) - - Execution - .zip( - userAndNeighbors.writeExecution(UserAndNeighborsFixedPathSource(outputDir)), - userAndNeighbors.writeExecution(TypedTsv(outputDir + "_tsv"))).unit - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/UserUserNormalizedGraph.scala b/src/scala/com/twitter/simclusters_v2/scalding/UserUserNormalizedGraph.scala deleted file mode 100644 index 62d878fc6..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/UserUserNormalizedGraph.scala +++ /dev/null @@ -1,453 +0,0 @@ -package com.twitter.simclusters_v2.scalding - -import com.twitter.bijection.Injection -import com.twitter.frigate.user_sampler.common.EmployeeIds -import com.twitter.hashing.KeyHasher -import com.twitter.logging.Logger -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.dalv2.remote_access.ExplicitLocation -import com.twitter.scalding_internal.dalv2.remote_access.ProcAtla -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.scalding_internal.job.analytics_batch.AnalyticsBatchExecution -import com.twitter.scalding_internal.job.analytics_batch.AnalyticsBatchExecutionArgs -import com.twitter.scalding_internal.job.analytics_batch.BatchDescription -import com.twitter.scalding_internal.job.analytics_batch.BatchFirstTime -import com.twitter.scalding_internal.job.analytics_batch.BatchIncrement -import com.twitter.scalding_internal.job.analytics_batch.TwitterScheduledExecutionApp -import com.twitter.simclusters_v2.hdfs_sources._ -import com.twitter.simclusters_v2.scalding.common.TypedRichPipe._ -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.thriftscala.EdgeWithDecayedWeights -import com.twitter.simclusters_v2.thriftscala.NeighborWithWeights -import com.twitter.simclusters_v2.thriftscala.NormsAndCounts -import com.twitter.simclusters_v2.thriftscala.UserAndNeighbors -import com.twitter.usersource.snapshot.flat.UsersourceFlatScalaDataset -import flockdb_tools.datasets.flock.FlockFollowsEdgesScalaDataset - -case class Edge(srcId: Long, destId: Long, isFollowEdge: Boolean, favWeight: Double) - -object UserUserNormalizedGraph { - - // The common function for applying logarithmic transformation - def logTransformation(weight: Double): Double = { - math.max(math.log10(1.0 + weight), 0.0) - } - - def getFollowEdges(implicit dateRange: DateRange, uniqueID: UniqueID): TypedPipe[(Long, Long)] = { - val numInputFollowEdges = Stat("num_input_follow_edges") - DAL - .readMostRecentSnapshot(FlockFollowsEdgesScalaDataset) - .toTypedPipe - .collect { - case edge if edge.state == 0 => - numInputFollowEdges.inc() - (edge.sourceId, edge.destinationId) - } - } - - def transformFavEdges( - input: TypedPipe[EdgeWithDecayedWeights], - halfLifeInDaysForFavScore: Int - )( - implicit uniqueID: UniqueID - ): TypedPipe[(Long, Long, Double)] = { - val numEdgesWithSpecifiedHalfLife = Stat( - s"num_edges_with_specified_half_life_${halfLifeInDaysForFavScore}_days") - val numEdgesWithoutSpecifiedHalfLife = Stat( - s"num_edges_without_specified_half_life_${halfLifeInDaysForFavScore}_days") - input - .flatMap { edge => - if (edge.weights.halfLifeInDaysToDecayedSums.contains(halfLifeInDaysForFavScore)) { - numEdgesWithSpecifiedHalfLife.inc() - Some((edge.sourceId, edge.destinationId, edge.weights.halfLifeInDaysToDecayedSums(100))) - } else { - numEdgesWithoutSpecifiedHalfLife.inc() - None - } - } - } - - def getFavEdges( - halfLifeInDaysForFavScore: Int - )( - implicit dateRange: DateRange, - uniqueID: UniqueID - ): TypedPipe[(Long, Long, Double)] = { - implicit val tz: java.util.TimeZone = DateOps.UTC - transformFavEdges( - DAL - .readMostRecentSnapshot(UserUserFavGraphScalaDataset) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe, - halfLifeInDaysForFavScore - ) - } - - def getNeighborWithWeights( - inputEdge: Edge, - followerL2NormOfDest: Double, - faverL2NormOfDest: Double, - logFavL2Norm: Double - ): NeighborWithWeights = { - val normalizedFollowScore = { - val numerator = if (inputEdge.isFollowEdge) 1.0 else 0.0 - if (followerL2NormOfDest > 0) numerator / followerL2NormOfDest else 0.0 - } - val normalizedFavScore = - if (faverL2NormOfDest > 0) inputEdge.favWeight / faverL2NormOfDest else 0.0 - val logFavScore = if (inputEdge.favWeight > 0) logTransformation(inputEdge.favWeight) else 0.0 - val logFavScoreL2Normalized = if (logFavL2Norm > 0) logFavScore / logFavL2Norm else 0.0 - NeighborWithWeights( - inputEdge.destId, - Some(inputEdge.isFollowEdge), - Some(normalizedFollowScore), - Some(inputEdge.favWeight), - Some(normalizedFavScore), - logFavScore = Some(logFavScore), - logFavScoreL2Normalized = Some(logFavScoreL2Normalized) - ) - } - - def addNormalizedWeightsAndAdjListify( - input: TypedPipe[Edge], - maxNeighborsPerUser: Int, - normsAndCountsFull: TypedPipe[NormsAndCounts] - )( - implicit uniqueId: UniqueID - ): TypedPipe[UserAndNeighbors] = { - val numUsersNeedingNeighborTruncation = Stat("num_users_needing_neighbor_truncation") - val numEdgesAfterTruncation = Stat("num_edges_after_truncation") - val numEdgesBeforeTruncation = Stat("num_edges_before_truncation") - val numFollowEdgesBeforeTruncation = Stat("num_follow_edges_before_truncation") - val numFavEdgesBeforeTruncation = Stat("num_fav_edges_before_truncation") - val numFollowEdgesAfterTruncation = Stat("num_follow_edges_after_truncation") - val numFavEdgesAfterTruncation = Stat("num_fav_edges_after_truncation") - val numRecordsInOutputGraph = Stat("num_records_in_output_graph") - - val norms = normsAndCountsFull.map { record => - ( - record.userId, - ( - record.followerL2Norm.getOrElse(0.0), - record.faverL2Norm.getOrElse(0.0), - record.logFavL2Norm.getOrElse(0.0))) - } - - implicit val l2b: Long => Array[Byte] = Injection.long2BigEndian - input - .map { edge => (edge.destId, edge) } - .sketch(reducers = 2000) - .join(norms) - .map { - case (destId, (edge, (followNorm, favNorm, logFavNorm))) => - numEdgesBeforeTruncation.inc() - if (edge.isFollowEdge) numFollowEdgesBeforeTruncation.inc() - if (edge.favWeight > 0) numFavEdgesBeforeTruncation.inc() - (edge.srcId, getNeighborWithWeights(edge, followNorm, favNorm, logFavNorm)) - } - .group - //.withReducers(1000) - .sortedReverseTake(maxNeighborsPerUser)(Ordering.by { x: NeighborWithWeights => - ( - x.favScoreHalfLife100Days.getOrElse(0.0), - x.followScoreNormalizedByNeighborFollowersL2.getOrElse(0.0) - ) - }) - .map { - case (srcId, neighborList) => - if (neighborList.size >= maxNeighborsPerUser) numUsersNeedingNeighborTruncation.inc() - neighborList.foreach { neighbor => - numEdgesAfterTruncation.inc() - if (neighbor.favScoreHalfLife100Days.exists(_ > 0)) numFavEdgesAfterTruncation.inc() - if (neighbor.isFollowed.contains(true)) numFollowEdgesAfterTruncation.inc() - } - numRecordsInOutputGraph.inc() - UserAndNeighbors(srcId, neighborList) - } - } - - def combineFollowAndFav( - followEdges: TypedPipe[(Long, Long)], - favEdges: TypedPipe[(Long, Long, Double)] - ): TypedPipe[Edge] = { - ( - followEdges.map { case (src, dest) => ((src, dest), (1, 0.0)) } ++ - favEdges.map { case (src, dest, wt) => ((src, dest), (0, wt)) } - ).sumByKey - //.withReducers(2500) - .map { - case ((src, dest), (follow, favWt)) => - Edge(src, dest, isFollowEdge = follow > 0, favWt) - } - } - - def run( - followEdges: TypedPipe[(Long, Long)], - favEdges: TypedPipe[(Long, Long, Double)], - normsAndCounts: TypedPipe[NormsAndCounts], - maxNeighborsPerUser: Int - )( - implicit uniqueID: UniqueID - ): TypedPipe[UserAndNeighbors] = { - val combined = combineFollowAndFav(followEdges, favEdges) - addNormalizedWeightsAndAdjListify( - combined, - maxNeighborsPerUser, - normsAndCounts - ) - } -} - -object UserUserNormalizedGraphBatch extends TwitterScheduledExecutionApp { - private val firstTime: String = "2018-06-16" - implicit val tz = DateOps.UTC - implicit val parser = DateParser.default - private val batchIncrement: Duration = Days(7) - private val halfLifeInDaysForFavScore = 100 - - private val outputPath: String = "/user/cassowary/processed/user_user_normalized_graph" - - private val execArgs = AnalyticsBatchExecutionArgs( - batchDesc = BatchDescription(this.getClass.getName.replace("$", "")), - firstTime = BatchFirstTime(RichDate(firstTime)), - lastTime = None, - batchIncrement = BatchIncrement(batchIncrement) - ) - - override def scheduledJob: Execution[Unit] = AnalyticsBatchExecution(execArgs) { - implicit dateRange => - Execution.withId { implicit uniqueId => - Execution.withArgs { args => - val maxNeighborsPerUser = args.int("maxNeighborsPerUser", 2000) - - val producerNormsAndCounts = - DAL.readMostRecentSnapshot(ProducerNormsAndCountsScalaDataset).toTypedPipe - - Util.printCounters( - UserUserNormalizedGraph - .run( - UserUserNormalizedGraph.getFollowEdges, - UserUserNormalizedGraph.getFavEdges(halfLifeInDaysForFavScore), - producerNormsAndCounts, - maxNeighborsPerUser - ) - .writeDALSnapshotExecution( - UserUserNormalizedGraphScalaDataset, - D.Daily, - D.Suffix(outputPath), - D.EBLzo(), - dateRange.end) - ) - } - } - } -} - -object UserUserNormalizedGraphAdhoc extends TwitterExecutionApp { - implicit val tz: java.util.TimeZone = DateOps.UTC - implicit val dp = DateParser.default - val log = Logger() - - def hashToLong(input: Long): Long = { - val bb = java.nio.ByteBuffer.allocate(8) - bb.putLong(input) - Math.abs(KeyHasher.KETAMA.hashKey(bb.array())) - } - - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - implicit val dateRange: DateRange = DateRange.parse(args.list("date")) - val halfLifeInDaysForFavScore = 100 - val maxNeighborsPerUser = args.int("maxNeighborsPerUser", 2000) - val producerNormsAndCounts = TypedPipe.from( - NormsAndCountsFixedPathSource(args("normsInputDir")) - ) - val favEdges = args.optional("favGraphInputDir") match { - case Some(favGraphInputDir) => - UserUserNormalizedGraph.transformFavEdges( - TypedPipe.from( - EdgeWithDecayedWtsFixedPathSource(favGraphInputDir) - ), - halfLifeInDaysForFavScore - ) - case None => - UserUserNormalizedGraph.getFavEdges(halfLifeInDaysForFavScore) - } - - val followEdges = UserUserNormalizedGraph.getFollowEdges - - Util.printCounters( - UserUserNormalizedGraph - .run( - followEdges, - favEdges, - producerNormsAndCounts, - maxNeighborsPerUser - ).writeExecution(UserAndNeighborsFixedPathSource(args("outputDir"))) - ) - } - } -} - -object DumpUserUserGraphAdhoc extends TwitterExecutionApp { - implicit val tz: java.util.TimeZone = DateOps.UTC - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - val input = args.optional("inputDir") match { - case Some(inputDir) => TypedPipe.from(UserAndNeighborsFixedPathSource(inputDir)) - case None => - DAL - .readMostRecentSnapshotNoOlderThan(UserUserNormalizedGraphScalaDataset, Days(30)) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - } - val users = args.list("users").map(_.toLong).toSet - if (users.isEmpty) { - input.printSummary("Producer norms and counts") - } else { - input - .collect { - case rec if users.contains(rec.userId) => - (Seq(rec.userId.toString) ++ rec.neighbors.map { n => - Util.prettyJsonMapper.writeValueAsString(n).replaceAll("\n", " ") - }).mkString("\n") - } - .toIterableExecution - .map { strings => println(strings.mkString("\n")) } - } - } - } -} - -/* - * ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding:user_user_normalized_graph && \ - * oscar hdfs --host hadoopnest2.atla.twitter.com --bundle user_user_normalized_graph \ - * --tool com.twitter.simclusters_v2.scalding.EmployeeGraph --screen --screen-detached \ - * --tee your_ldap/employeeGraph20190809 -- --outputDir adhoc/employeeGraph20190809 - */ -object EmployeeGraph extends TwitterExecutionApp { - implicit val tz: java.util.TimeZone = DateOps.UTC - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - val input = args.optional("inputDir") match { - case Some(inputDir) => TypedPipe.from(UserAndNeighborsFixedPathSource(inputDir)) - case None => - DAL - .readMostRecentSnapshotNoOlderThan(UserUserNormalizedGraphScalaDataset, Days(30)) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - } - val employeeIds = EmployeeIds.buildMerlinClientAndGetEmployees("frigate-scalding.dev") - input - .collect { - case rec if employeeIds.contains(rec.userId) => - rec.neighbors.collect { - case n if employeeIds.contains(n.neighborId) => - ( - rec.userId, - n.neighborId, - n.favScoreHalfLife100Days.getOrElse(0), - n.isFollowed.getOrElse(false)) - } - } - .flatten - .writeExecution(TypedTsv(args("outputDir"))) - - } - } -} -/* - * scalding remote run --target src/scala/com/twitter/simclusters_v2/scalding:employee_graph_from_user_user - * --main-class com.twitter.simclusters_v2.scalding.EmployeeGraphFromUserUser - * --submitter hadoopnest2.atla.twitter.com --user recos-platform -- --graphOutputDir "/user/recos-platform/adhoc/employee_graph_from_user_user/" - */ - -object EmployeeGraphFromUserUser extends TwitterExecutionApp { - implicit val tz: java.util.TimeZone = DateOps.UTC - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - val graphOutputDir = args("graphOutputDir") - val input = args.optional("inputDir") match { - case Some(inputDir) => TypedPipe.from(UserAndNeighborsFixedPathSource(inputDir)) - case None => - DAL - .readMostRecentSnapshotNoOlderThan(UserUserNormalizedGraphScalaDataset, Days(30)) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - } - val employeeIds = EmployeeIds.buildMerlinClientAndGetEmployees("frigate-scalding.dev") - input - .collect { - case rec if employeeIds.contains(rec.userId) => - rec - } - .writeExecution(UserAndNeighborsFixedPathSource(graphOutputDir)) - - } - } -} - -/* - * ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding:user_user_normalized_graph && \ - * oscar hdfs --host hadoopnest2.atla.twitter.com --bundle user_user_normalized_graph \ - * --tool com.twitter.simclusters_v2.scalding.VitGraph --screen --screen-detached \ - * --tee your_ldap/vitGraph20190809 -- --outputDir adhoc/vitGraph20190809 - */ -object VitGraph extends TwitterExecutionApp { - implicit val tz: java.util.TimeZone = DateOps.UTC - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - val minActiveFollowers = args.int("minActiveFollowers") - val topK = args.int("topK") - val input = args.optional("inputDir") match { - case Some(inputDir) => TypedPipe.from(UserAndNeighborsFixedPathSource(inputDir)) - case None => - DAL - .readMostRecentSnapshotNoOlderThan(UserUserNormalizedGraphScalaDataset, Days(30)) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - } - val userSource = - DAL.readMostRecentSnapshotNoOlderThan(UsersourceFlatScalaDataset, Days(30)).toTypedPipe - - TopUsersSimilarityGraph - .vits(userSource, minActiveFollowers, topK).toIterableExecution.flatMap { vitsIter => - val vits = vitsIter.toSet - println(s"Found ${vits.size} many vits. First few: " + vits.take(5).mkString(",")) - input - .collect { - case rec if vits.contains(rec.userId) => - rec.neighbors.collect { - case n if vits.contains(n.neighborId) => - ( - rec.userId, - n.neighborId, - n.favScoreHalfLife100Days.getOrElse(0), - n.isFollowed.getOrElse(false)) - } - } - .flatten - .writeExecution(TypedTsv(args("outputDir"))) - } - - } - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/common/BUILD b/src/scala/com/twitter/simclusters_v2/scalding/common/BUILD deleted file mode 100644 index cbb6e14c0..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/common/BUILD +++ /dev/null @@ -1,14 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/fasterxml/jackson:jackson-module-scala", - "3rdparty/jvm/com/fasterxml/jackson/core:jackson-core", - "3rdparty/jvm/com/fasterxml/jackson/core:jackson-databind", - "3rdparty/jvm/com/fasterxml/jackson/module:jackson-module-scala", - "src/scala/com/twitter/scalding_internal/job/analytics_batch", - "src/scala/com/twitter/simclusters_v2/common", - "strato/src/main/scala/com/twitter/strato/scalding", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scalding/common/PersistentTweetEmbeddingSource.scala b/src/scala/com/twitter/simclusters_v2/scalding/common/PersistentTweetEmbeddingSource.scala deleted file mode 100644 index 355144aa4..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/common/PersistentTweetEmbeddingSource.scala +++ /dev/null @@ -1,60 +0,0 @@ -package com.twitter.simclusters_v2.scalding.common - -import com.twitter.finagle.mtls.authentication.ServiceIdentifier -import com.twitter.scalding.DateRange -import com.twitter.simclusters_v2.common.Timestamp -import com.twitter.simclusters_v2.common.TweetId -import com.twitter.simclusters_v2.thriftscala.PersistentSimClustersEmbedding -import com.twitter.strato.scalding.StratoManhattanExportSource -import com.twitter.strato.thrift.ScroogeConvImplicits._ - -object PersistentTweetEmbeddingSource { - // hdfs paths - val FavBasedUpdatedHdfsPath: String = - "/atla/proc/user/cassowary/manhattan-exporter/fav_based_tweet_20m_145k_updated_embeddings" - - val LogFavBasedUpdatedHdfsPath: String = - "/atla/proc/user/cassowary/manhattan-exporter/log_fav_based_tweet_20m_145k_updated_embeddings" - - val LogFavBased2020HdfsPath: String = - "/atla/proc/user/cassowary/manhattan-exporter/log_fav_based_tweet_20m_145k_2020_embeddings" - - // Strato columns - val FavBasedUpdatedStratoColumn: String = - "recommendations/simclusters_v2/embeddings/favBasedTweet20M145KUpdated" - - val LogFavBasedUpdatedStratoColumn: String = - "recommendations/simclusters_v2/embeddings/logFavBasedTweet20M145KUpdatedPersistent" - - val LogFavBased2020StratoColumn: String = - "recommendations/simclusters_v2/embeddings/logFavBasedTweet20M145K2020Persistent" - -} - -/** - * The source that read the Manhattan export persistent embeddings - */ -// Defaults to Updated version. -class FavBasedPersistentTweetEmbeddingMhExportSource( - hdfsPath: String = PersistentTweetEmbeddingSource.FavBasedUpdatedHdfsPath, - stratoColumnPath: String = PersistentTweetEmbeddingSource.FavBasedUpdatedStratoColumn, - range: DateRange, - serviceIdentifier: ServiceIdentifier = ServiceIdentifier.empty) - extends StratoManhattanExportSource[(TweetId, Timestamp), PersistentSimClustersEmbedding]( - hdfsPath, - range, - stratoColumnPath, - serviceIdentifier = serviceIdentifier - ) -// Defaults to 2020 version. -class LogFavBasedPersistentTweetEmbeddingMhExportSource( - hdfsPath: String = PersistentTweetEmbeddingSource.LogFavBased2020HdfsPath, - stratoColumnPath: String = PersistentTweetEmbeddingSource.LogFavBased2020StratoColumn, - range: DateRange, - serviceIdentifier: ServiceIdentifier = ServiceIdentifier.empty) - extends StratoManhattanExportSource[(TweetId, Timestamp), PersistentSimClustersEmbedding]( - hdfsPath, - range, - stratoColumnPath, - serviceIdentifier = serviceIdentifier - ) diff --git a/src/scala/com/twitter/simclusters_v2/scalding/common/QTreeMultiAggregator.scala b/src/scala/com/twitter/simclusters_v2/scalding/common/QTreeMultiAggregator.scala deleted file mode 100644 index 970eb3c8e..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/common/QTreeMultiAggregator.scala +++ /dev/null @@ -1,30 +0,0 @@ -package com.twitter.simclusters_v2.scalding.common - -import com.twitter.algebird._ - -/** - * The reason of creating this class is that we need multiple percentiles and current - * implementations need one QTree per percentile which is unnecessary. This class gets multiple - * percentiles from the same QTree. - */ -case class QTreeMultiAggregator[T](percentiles: Seq[Double])(implicit val num: Numeric[T]) - extends Aggregator[T, QTree[Unit], Map[String, Double]] - with QTreeAggregatorLike[T] { - - require( - percentiles.forall(p => p >= 0.0 && p <= 1.0), - "The given percentile must be of the form 0 <= p <= 1.0" - ) - - override def percentile: Double = 0.0 // Useless but needed for the base class - - override def k: Int = QTreeAggregator.DefaultK - - private def getPercentile(qt: QTree[Unit], p: Double): Double = { - val (lower, upper) = qt.quantileBounds(p) - (lower + upper) / 2 - } - - def present(qt: QTree[Unit]): Map[String, Double] = - percentiles.map { p => p.toString -> getPercentile(qt, p) }.toMap -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/common/TypedRichPipe.scala b/src/scala/com/twitter/simclusters_v2/scalding/common/TypedRichPipe.scala deleted file mode 100644 index 6e40ecf80..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/common/TypedRichPipe.scala +++ /dev/null @@ -1,72 +0,0 @@ -package com.twitter.simclusters_v2.scalding.common - -import com.twitter.algebird._ -import com.twitter.scalding.typed.TypedPipe -import com.twitter.scalding.{Execution, Stat, UniqueID} - -/** - * A richer version of TypedPipe. - */ -class TypedRichPipe[V](pipe: TypedPipe[V]) { - - def count(counterName: String)(implicit uniqueID: UniqueID): TypedPipe[V] = { - val stat = Stat(counterName) - pipe.map { v => - stat.inc() - v - } - } - - /** - * Print a summary of the TypedPipe with total size and some randomly selected records - */ - def getSummary(numRecords: Int = 100): Execution[Option[(Long, String)]] = { - val randomSample = Aggregator.reservoirSample[V](numRecords) - - // more aggregator can be added here - pipe - .aggregate(randomSample.join(Aggregator.size)) - .map { - case (randomSamples, size) => - val samplesStr = randomSamples - .map { sample => - Util.prettyJsonMapper - .writeValueAsString(sample) - .replaceAll("\n", " ") - } - .mkString("\n\t") - - (size, samplesStr) - } - .toOptionExecution - } - - def getSummaryString(name: String, numRecords: Int = 100): Execution[String] = { - getSummary(numRecords) - .map { - case Some((size, string)) => - s"TypedPipeName: $name \nTotal size: $size. \nSample records: \n$string" - case None => s"TypedPipeName: $name is empty" - } - - } - - /** - * Print a summary of the TypedPipe with total size and some randomly selected records - */ - def printSummary(name: String, numRecords: Int = 100): Execution[Unit] = { - getSummaryString(name, numRecords).map { s => println(s) } - } -} - -object TypedRichPipe extends java.io.Serializable { - import scala.language.implicitConversions - - implicit def typedPipeToRichPipe[V]( - pipe: TypedPipe[V] - )( - implicit uniqueID: UniqueID - ): TypedRichPipe[V] = { - new TypedRichPipe(pipe) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/common/Util.scala b/src/scala/com/twitter/simclusters_v2/scalding/common/Util.scala deleted file mode 100644 index 0ed3812a0..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/common/Util.scala +++ /dev/null @@ -1,305 +0,0 @@ -package com.twitter.simclusters_v2.scalding.common - -import com.fasterxml.jackson.core.JsonGenerator -import com.fasterxml.jackson.databind.ObjectMapper -import com.fasterxml.jackson.databind.ObjectWriter -import com.fasterxml.jackson.module.scala.DefaultScalaModule -import com.fasterxml.jackson.module.scala.ScalaObjectMapper -import com.twitter.algebird.Aggregator -import com.twitter.algebird.Moments -import com.twitter.algebird.MultiAggregator -import com.twitter.algebird.SetSizeAggregator -import com.twitter.algebird.SketchMap -import com.twitter.algebird.SketchMapParams -import com.twitter.algebird.mutable.PriorityQueueMonoid -import com.twitter.bijection.Injection -import com.twitter.hashing.KeyHasher -import com.twitter.scalding.Execution -import com.twitter.scalding.Stat -import com.twitter.scalding.TypedPipe -import com.twitter.scalding.UniqueID -import java.io.File -import java.io.PrintWriter -import scala.sys.process._ - -object Util { - private val formatter = java.text.NumberFormat.getNumberInstance - - private val jsonMapper = { - val mapper = new ObjectMapper() with ScalaObjectMapper - mapper.registerModule(DefaultScalaModule) - mapper.configure(JsonGenerator.Feature.WRITE_NUMBERS_AS_STRINGS, true) - mapper - } - - val prettyJsonMapper: ObjectWriter = jsonMapper.writerWithDefaultPrettyPrinter() - - def getCustomCounters[T](exec: Execution[T]): Execution[Map[String, Long]] = { - exec.getCounters.map { - case (_, counters) => - counters.toMap.collect { - case (key, value) if key.group == "Scalding Custom" => - key.counter -> value - } - } - } - - def getCustomCountersString[T](exec: Execution[T]): Execution[String] = { - getCustomCounters(exec).map { map => - val customCounterStrings = map.toList.map { - case (key, value) => - s"$key:${formatter.format(value)}" - } - if (customCounterStrings.nonEmpty) { - "Printing all custom counters:\n" + customCounterStrings.mkString("\n") - } else { - "No custom counters to print" - } - } - } - - // Note ideally this should not allow T that is itself Execution[U] i.e. don't accept - // nested executions - def printCounters[T](exec: Execution[T]): Execution[Unit] = { - getCustomCountersString(exec).map { s => println(s) } - } - - /** - * Print some basic stats of a numeric column. - */ - def printSummaryOfNumericColumn[V]( - input: TypedPipe[V], - columnName: Option[String] = None - )( - implicit num: Numeric[V] - ): Execution[String] = { - lazy val randomSampler = Aggregator.reservoirSample[V](100) - - lazy val percentiles = QTreeMultiAggregator(Seq(0.05, 0.25, 0.50, 0.75, 0.95)) - - lazy val moments = Moments.numericAggregator - - val multiAggregator = MultiAggregator( - Aggregator.size, - percentiles, - Aggregator.max, - Aggregator.min, - Aggregator.numericSum, - moments, - randomSampler - ).andThenPresent { - case (size_, percentiles_, max_, min_, sum_, moments_, samples_) => - percentiles_.mapValues(_.toString) ++ Map( - "size" -> size_.toString, - "max" -> max_.toString, - "min" -> min_.toString, - "sum" -> sum_.toString, - "avg" -> moments_.mean.toString, - "stddev" -> moments_.stddev.toString, - "skewness" -> moments_.skewness.toString, - "samples" -> samples_.mkString(",") - ) - } - - input - .aggregate(multiAggregator) - .toIterableExecution - .map { m => - val summary = - s"Column Name: $columnName\nSummary:\n${Util.prettyJsonMapper.writeValueAsString(m)}" - println(summary) - summary - } - } - - /** - * Output some basic stats of a categorical column. - * - * Note that HeavyHitters only work when the distribution is skewed. - */ - def printSummaryOfCategoricalColumn[V]( - input: TypedPipe[V], - columnName: Option[String] = None - )( - implicit injection: Injection[V, Array[Byte]] - ): Execution[String] = { - - lazy val randomSampler = Aggregator.reservoirSample[V](100) - - lazy val uniqueCounter = new SetSizeAggregator[V](hllBits = 13, maxSetSize = 1000)(injection) - - lazy val sketchMapParams = - SketchMapParams[V](seed = 1618, eps = 0.001, delta = 0.05, heavyHittersCount = 20)(injection) - - lazy val heavyHitter = - SketchMap.aggregator[V, Long](sketchMapParams).composePrepare[V](v => v -> 1L) - - val multiAggregator = MultiAggregator( - Aggregator.size, - uniqueCounter, - heavyHitter, - randomSampler - ).andThenPresent { - case (size_, uniqueSize_, heavyHitter_, sampler_) => - Map( - "size" -> size_.toString, - "unique" -> uniqueSize_.toString, - "samples" -> sampler_.mkString(","), - "heavyHitter" -> heavyHitter_.heavyHitterKeys - .map { key => - val freq = sketchMapParams.frequency(key, heavyHitter_.valuesTable) - key -> freq - } - .sortBy(-_._2).mkString(",") - ) - } - - input - .aggregate(multiAggregator) - .toIterableExecution - .map { m => - val summary = - s"Column Name: $columnName\nSummary:\n${Util.prettyJsonMapper.writeValueAsString(m)}" - println(summary) - summary - } - } - - val edgeOrdering: Ordering[(Long, Long)] = Ordering.by { - case (fromNodeId, toNodeId) => hashToLong(fromNodeId, toNodeId) - } - - def reservoirSamplerMonoidForPairs[K, V]( - sampleSize: Int - )( - implicit ord: Ordering[K] - ): PriorityQueueMonoid[(K, V)] = { - implicit val fullOrdering: Ordering[(K, V)] = Ordering.by(_._1) - new PriorityQueueMonoid[(K, V)](sampleSize) - } - - def reservoirSamplerMonoid[T, U]( - sampleSize: Int, - convert: T => U - )( - implicit ord: Ordering[U] - ): PriorityQueueMonoid[T] = { - new PriorityQueueMonoid[T](sampleSize)(Ordering.by(convert)) - } - - def hashToLong(a: Long, b: Long): Long = { - val bb = java.nio.ByteBuffer.allocate(16) - bb.putLong(a) - bb.putLong(b) - KeyHasher.KETAMA.hashKey(bb.array()) - } - - def hashToLong(a: Long): Long = { - val bb = java.nio.ByteBuffer.allocate(8) - bb.putLong(a) - KeyHasher.KETAMA.hashKey(bb.array()) - } - - // https://en.wikipedia.org/wiki/Pearson_correlation_coefficient - def computeCorrelation(pairedIter: Iterator[(Double, Double)]): Double = { - val (len, xSum, ySum, x2Sum, y2Sum, xySum) = - pairedIter.foldLeft((0.0, 0.0, 0.0, 0.0, 0.0, 0.0)) { - case ((l, xs, ys, x2s, y2s, xys), (x, y)) => - (l + 1, xs + x, ys + y, x2s + x * x, y2s + y * y, xys + x * y) - } - val den = math.sqrt(len * x2Sum - xSum * xSum) * math.sqrt(len * y2Sum - ySum * ySum) - if (den > 0) { - (len * xySum - xSum * ySum) / den - } else 0.0 - } - - // https://en.wikipedia.org/wiki/Cosine_similarity - def cosineSimilarity(pairedIter: Iterator[(Double, Double)]): Double = { - val (xySum, x2Sum, y2Sum) = pairedIter.foldLeft(0.0, 0.0, 0.0) { - case ((xy, x2, y2), (x, y)) => - (xy + x * y, x2 + x * x, y2 + y * y) - } - val den = math.sqrt(x2Sum) * math.sqrt(y2Sum) - if (den > 0) { - xySum / den - } else 0.0 - } - - case class Distribution( - avg: Double, - stdDev: Double, - p1: Double, - p10: Double, - p50: Double, - p90: Double, - p99: Double) - - val emptyDist: Distribution = Distribution(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0) - - def distributionFromArray(l: Array[Double]): Distribution = { - val s = l.sorted - val len = l.length - - if (len < 1) { - emptyDist - } else { - def pctToIndex(p: Double): Int = { - val idx = math.round(l.length * p).toInt - if (idx < 0) { - 0 - } else if (idx >= len) { - len - 1 - } else { - idx - } - } - - val (sum, sumSquared) = l.foldLeft((0.0, 0.0)) { - case ((curSum, curSumSquared), x) => - (curSum + x, curSumSquared + x * x) - } - - val avg = sum / len - val stdDev = math.sqrt(sumSquared / len - avg * avg) - Distribution( - avg, - stdDev, - p1 = s(pctToIndex(0.01)), - p10 = s(pctToIndex(0.1)), - p50 = s(pctToIndex(0.5)), - p90 = s(pctToIndex(0.9)), - p99 = s(pctToIndex(0.99))) - } - } - - // Calculate cumulative frequency using Scalding Custom Counters. - // Increment all buckets by 1 where value <= bucket_threshold. - case class CumulativeStat( - key: String, - buckets: Seq[Double] - )( - implicit uniqueID: UniqueID) { - - val counters = buckets.map { bucket => - bucket -> Stat(key + "_<=" + bucket.toString) - } - - def incForValue(value: Double): Unit = { - counters.foreach { - case (bucket, stat) => - if (value <= bucket) stat.inc() - } - } - } - - def sendEmail(text: String, subject: String, toAddress: String): String = { - val file = File.createTempFile("somePrefix_", "_someSuffix") - println(s"Email body is at ${file.getPath}") - val writer = new PrintWriter(file) - writer.write(text) - writer.close() - - val mailCmd = s"cat ${file.getPath}" #| Seq("mail", "-s", subject, toAddress) - mailCmd.!! - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/common/matrix/BUILD b/src/scala/com/twitter/simclusters_v2/scalding/common/matrix/BUILD deleted file mode 100644 index 962a53de0..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/common/matrix/BUILD +++ /dev/null @@ -1,8 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/scala/com/twitter/scalding_internal/job/analytics_batch", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scalding/common/matrix/DenseRowMatrix.scala b/src/scala/com/twitter/simclusters_v2/scalding/common/matrix/DenseRowMatrix.scala deleted file mode 100644 index eb72e8708..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/common/matrix/DenseRowMatrix.scala +++ /dev/null @@ -1,73 +0,0 @@ -package com.twitter.simclusters_v2.scalding.common.matrix - -import com.twitter.algebird.{ArrayMonoid, BloomFilterMonoid, Monoid, Semigroup} -import com.twitter.algebird.Semigroup._ -import com.twitter.bijection.Injection -import com.twitter.scalding.{TypedPipe, ValuePipe} - -/** - * A class that represents a row-indexed dense matrix, backed by a TypedPipe[(R, Array[Double])]. - * For each row of the TypedPipe, we save an array of values. - * Only use this class when the number of columns is small (say, <100K). - * - * @param pipe underlying pipe - * @param rowOrd ordering function for row type - * @param rowInj injection function for the row type - * @tparam R Type for rows - */ -case class DenseRowMatrix[R]( - pipe: TypedPipe[(R, Array[Double])], -)( - implicit val rowOrd: Ordering[R], - val rowInj: Injection[R, Array[Byte]]) { - - lazy val semigroupArrayV: Semigroup[Array[Double]] = new ArrayMonoid[Double]() - - // convert to a SparseMatrix - lazy val toSparseMatrix: SparseMatrix[R, Int, Double] = { - this.toSparseRowMatrix.toSparseMatrix - } - - // convert to a SparseRowMatrix - lazy val toSparseRowMatrix: SparseRowMatrix[R, Int, Double] = { - SparseRowMatrix( - this.pipe.map { - case (i, values) => - (i, values.zipWithIndex.collect { case (value, j) if value != 0.0 => (j, value) }.toMap) - }, - isSkinnyMatrix = true) - } - - // convert to a TypedPipe - lazy val toTypedPipe: TypedPipe[(R, Array[Double])] = { - this.pipe - } - - // filter the matrix based on a subset of rows - def filterRows(rows: TypedPipe[R]): DenseRowMatrix[R] = { - DenseRowMatrix(this.pipe.join(rows.asKeys).mapValues(_._1)) - } - - // get the l2 norms for all rows. this does not trigger a shuffle. - lazy val rowL2Norms: TypedPipe[(R, Double)] = { - this.pipe.map { - case (row, values) => - row -> math.sqrt(values.map(a => a * a).sum) - } - } - - // normalize the matrix to make sure each row has unit norm - lazy val rowL2Normalize: DenseRowMatrix[R] = { - - DenseRowMatrix(this.pipe.map { - case (row, values) => - val norm = math.sqrt(values.map(v => v * v).sum) - if (norm == 0.0) { - row -> values - } else { - row -> values.map(v => v / norm) - } - }) - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/common/matrix/SparseMatrix.scala b/src/scala/com/twitter/simclusters_v2/scalding/common/matrix/SparseMatrix.scala deleted file mode 100644 index 55514c350..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/common/matrix/SparseMatrix.scala +++ /dev/null @@ -1,423 +0,0 @@ -package com.twitter.simclusters_v2.scalding.common.matrix - -import com.twitter.algebird.Semigroup -import com.twitter.bijection.Injection -import com.twitter.scalding.{TypedPipe, ValuePipe} - -/** - * A case class that represents a sparse matrix backed by a TypedPipe[(R, C, V)]. - * - * We assume the input does not have more than one value per (row, col), and all the input values - * are non-zero. - * - * We do not except the input pipe are indexed from 0 to numRows or numCols. - * The input can be any type (for example, userId/TweetId/Hashtag). - * We do not convert them to indices, but just use the input as a key to represent the rowId/colId. - * - * Example: - * - * val a = SparseMatrix(TypedPipe.from(Seq((1,1,1.0), (2,2,2.0), (3,3,3.0)))) - * - * val b = a.rowL2Normalize // get a new matrix that has unit-norm each row. - * - * val c = a.multiplySparseMatrix(b) // multiply another matrix - * - * val d = a.transpose // transpose the matrix - * - * @param pipe underlying pipe. We assume the input does not have more than one value per (row, col), - * and all the values are non-zero. - * @param rowOrd ordering function for row type - * @param colOrd ordering function for col type - * @param numericV numeric operations for value type - * @param semigroupV semigroup for the value type - * @param rowInj injection function for the row type - * @param colInj injection function for the col type - * @tparam R Type for rows - * @tparam C Type for columns - * @tparam V Type for elements of the matrix - */ -case class SparseMatrix[R, C, V]( - pipe: TypedPipe[(R, C, V)] -)( - implicit override val rowOrd: Ordering[R], - override val colOrd: Ordering[C], - override val numericV: Numeric[V], - override val semigroupV: Semigroup[V], - override val rowInj: Injection[R, Array[Byte]], - override val colInj: Injection[C, Array[Byte]]) - extends TypedPipeMatrix[R, C, V] { - - // number of non-zero values in the matrix - override lazy val nnz: ValuePipe[Long] = { - this.filter((_, _, v) => v != numericV.zero).pipe.map(_ => 1L).sum - } - - // number of non-zero values in each row - lazy val rowNnz: TypedPipe[(R, Long)] = { - this.pipe.collect { - case (row, _, v) if v != numericV.zero => - row -> 1L - }.sumByKey - } - - // get the num of non-zero values for each col. - lazy val colNnz: TypedPipe[(C, Long)] = { - this.transpose.rowNnz - } - - override lazy val uniqueRowIds: TypedPipe[R] = { - this.pipe.map(t => t._1).distinct - } - - override lazy val uniqueColIds: TypedPipe[C] = { - this.pipe.map(t => t._2).distinct - } - - override def getRow(rowId: R): TypedPipe[(C, V)] = { - this.pipe.collect { - case (i, j, value) if i == rowId => - j -> value - } - } - - override def getCol(colId: C): TypedPipe[(R, V)] = { - this.pipe.collect { - case (i, j, value) if j == colId => - i -> value - } - } - - override def get(rowId: R, colId: C): ValuePipe[V] = { - this.pipe.collect { - case (i, j, value) if i == rowId && j == colId => - value - }.sum // this assumes the matrix does not have any duplicates - } - - // filter the matrix based on (row, col, value) - def filter(fn: (R, C, V) => Boolean): SparseMatrix[R, C, V] = { - SparseMatrix(this.pipe.filter { - case (row, col, value) => fn(row, col, value) - }) - } - - // filter the matrix based on a subset of rows - def filterRows(rows: TypedPipe[R]): SparseMatrix[R, C, V] = { - SparseMatrix(this.rowAsKeys.join(rows.asKeys).map { - case (row, ((col, value), _)) => (row, col, value) - }) - } - - // filter the matrix based on a subset of cols - def filterCols(cols: TypedPipe[C]): SparseMatrix[R, C, V] = { - this.transpose.filterRows(cols).transpose - } - - // convert the triplet (row, col, value) to a new (row1, col1, value1) - def tripleApply[R1, C1, V1]( - fn: (R, C, V) => (R1, C1, V1) - )( - implicit rowOrd1: Ordering[R1], - colOrd1: Ordering[C1], - numericV1: Numeric[V1], - semigroupV1: Semigroup[V1], - rowInj: Injection[R1, Array[Byte]], - colInj: Injection[C1, Array[Byte]] - ): SparseMatrix[R1, C1, V1] = { - SparseMatrix(this.pipe.map { - case (row, col, value) => fn(row, col, value) - }) - } - - // get the l1 norms for all rows - lazy val rowL1Norms: TypedPipe[(R, Double)] = { - this.pipe.map { - case (row, _, value) => - row -> numericV.toDouble(value).abs - }.sumByKey - } - - // get the l2 norms for all rows - lazy val rowL2Norms: TypedPipe[(R, Double)] = { - this.pipe - .map { - case (row, _, value) => - row -> numericV.toDouble(value) * numericV.toDouble(value) - } - .sumByKey - .mapValues(math.sqrt) - } - - // normalize the matrix to make sure each row has unit norm - lazy val rowL2Normalize: SparseMatrix[R, C, Double] = { - val result = this.rowAsKeys - .join(this.rowL2Norms) - .collect { - case (row, ((col, value), l2norm)) if l2norm > 0.0 => - (row, col, numericV.toDouble(value) / l2norm) - } - - SparseMatrix(result) - } - - // get the l2 norms for all cols - lazy val colL2Norms: TypedPipe[(C, Double)] = { - this.transpose.rowL2Norms - } - - // normalize the matrix to make sure each column has unit norm - lazy val colL2Normalize: SparseMatrix[R, C, Double] = { - this.transpose.rowL2Normalize.transpose - } - - /** - * Take topK non-zero elements from each row. Cols are ordered by the `ordering` function - */ - def sortWithTakePerRow(k: Int)(ordering: Ordering[(C, V)]): TypedPipe[(R, Seq[(C, V)])] = { - this.rowAsKeys.group.sortedTake(k)(ordering) - } - - /** - * Take topK non-zero elements from each column. Rows are ordered by the `ordering` function. - * - */ - def sortWithTakePerCol(k: Int)(ordering: Ordering[(R, V)]): TypedPipe[(C, Seq[(R, V)])] = { - this.transpose.sortWithTakePerRow(k)(ordering) - } - - /** - * Multiply another SparseMatrix. The only requirement is that the col type of current matrix should - * be same with the row type of the other matrix. - * - * @param sparseMatrix another matrix to multiply - * @param numReducersOpt optional parameter to set number of reducers. It uses 1000 by default. - * you can change it based on your applications. - * @param ordering2 ordering function for the column type of another matrix - * @param injection2 injection function for the column type of another matrix - * @tparam C2 col type of another matrix - * - * @return - */ - def multiplySparseMatrix[C2]( - sparseMatrix: SparseMatrix[C, C2, V], - numReducersOpt: Option[Int] = None - )( - implicit ordering2: Ordering[C2], - injection2: Injection[C2, Array[Byte]] - ): SparseMatrix[R, C2, V] = { - implicit val colInjectionFunction: C => Array[Byte] = colInj.toFunction - - val result = - // 1000 is the reducer number used for sketchJoin; 1000 is a number that works well empirically. - // feel free to change this or make this as a param if you find this does not work for your case. - this.transpose.rowAsKeys - .sketch(numReducersOpt.getOrElse(1000)) - .join(sparseMatrix.rowAsKeys) - .map { - case (_, ((row1, value1), (col2, value2))) => - (row1, col2) -> numericV.times(value1, value2) - } - .sumByKey - .map { - case ((row, col), value) => - (row, col, value) - } - - SparseMatrix(result) - } - - /** - * Multiply a SparseRowMatrix. The implementation of this function assume the input SparseRowMatrix - * is a skinny matrix, i.e., with a small number of unique columns. Based on our experience, you can - * think 100K is a small number here. - * - * @param skinnyMatrix another matrix to multiply - * @param numReducersOpt optional parameter to set number of reducers. It uses 1000 by default. - * you can change it based on your applications. - * @param ordering2 ordering function for the column type of another matrix - * @param injection2 injection function for the column type of another matrix - * @tparam C2 col type of another matrix - * - * @return - */ - def multiplySkinnySparseRowMatrix[C2]( - skinnyMatrix: SparseRowMatrix[C, C2, V], - numReducersOpt: Option[Int] = None - )( - implicit ordering2: Ordering[C2], - injection2: Injection[C2, Array[Byte]] - ): SparseRowMatrix[R, C2, V] = { - - assert( - skinnyMatrix.isSkinnyMatrix, - "this function only works for skinny sparse row matrix, otherwise you will get out-of-memory problem") - - implicit val colInjectionFunction: C => Array[Byte] = colInj.toFunction - - val result = - // 1000 is the reducer number used for sketchJoin; 1000 is a number that works well empirically. - // feel free to change this or make this as a param if you find this does not work for your case. - this.transpose.rowAsKeys - .sketch(numReducersOpt.getOrElse(1000)) - .join(skinnyMatrix.pipe) - .map { - case (_, ((row1, value1), colMap)) => - row1 -> colMap.mapValues(v => numericV.times(value1, v)) - } - .sumByKey - - SparseRowMatrix(result, skinnyMatrix.isSkinnyMatrix) - } - - /*** - * Multiply a DenseRowMatrix. The result will be also a DenseRowMatrix. - * - * @param denseRowMatrix matrix to multiply - * @param numReducersOpt optional parameter to set number of reducers. It uses 1000 by default. - * you can change it based on your applications - * @return - */ - def multiplyDenseRowMatrix( - denseRowMatrix: DenseRowMatrix[C], - numReducersOpt: Option[Int] = None - ): DenseRowMatrix[R] = { - - implicit val colInjectionFunction: C => Array[Byte] = colInj.toFunction - implicit val arrayVSemiGroup: Semigroup[Array[Double]] = denseRowMatrix.semigroupArrayV - - val result = - // 1000 is the reducer number used for sketchJoin; 1000 is a number that works well empirically. - // feel free to change this or make this as a param if you find this does not work for your case. - this.transpose.rowAsKeys - .sketch(numReducersOpt.getOrElse(1000)) - .join(denseRowMatrix.pipe) - .map { - case (_, ((row1, value1), array)) => - row1 -> array.map(v => numericV.toDouble(value1) * v) - } - .sumByKey - - DenseRowMatrix(result) - } - - // Transpose the matrix. - lazy val transpose: SparseMatrix[C, R, V] = { - SparseMatrix( - this.pipe - .map { - case (row, col, value) => - (col, row, value) - }) - } - - // Create a Key-Val TypedPipe for .join() and other use cases. - lazy val rowAsKeys: TypedPipe[(R, (C, V))] = { - this.pipe - .map { - case (row, col, value) => - (row, (col, value)) - } - } - - // convert to a TypedPipe - lazy val toTypedPipe: TypedPipe[(R, C, V)] = { - this.pipe - } - - lazy val forceToDisk: SparseMatrix[R, C, V] = { - SparseMatrix(this.pipe.forceToDisk) - } - - /** - * Convert the matrix to a SparseRowMatrix. Do this only when the max number of non-zero values per row is - * small (say, not more than 200K). - * - * @isSkinnyMatrix is the resulted matrix skinny, i.e., number of unique colIds is small (<200K). - * Note the difference between `number of unique colIds` and `max number of non-zero values per row`. - * @return - */ - def toSparseRowMatrix(isSkinnyMatrix: Boolean = false): SparseRowMatrix[R, C, V] = { - SparseRowMatrix( - this.pipe.map { - case (i, j, v) => - i -> Map(j -> v) - }.sumByKey, - isSkinnyMatrix) - } - - /** - * Convert the matrix to a DenseRowMatrix - * - * @param numCols the number of columns in the DenseRowMatrix. - * @param colToIndexFunction the function to convert colId to the column index in the dense matrix - * @return - */ - def toDenseRowMatrix(numCols: Int, colToIndexFunction: C => Int): DenseRowMatrix[R] = { - this.toSparseRowMatrix(isSkinnyMatrix = true).toDenseRowMatrix(numCols, colToIndexFunction) - } - - /** - * Determines whether we should return a given Iterator given a threshold for the sum of values - * across a row and whether we are looking to stay under or above that value. - * Note that Iterators are mutable/destructive, and even calling .size on it will 'use it up' - * i.e. it no longer hasNext and we no longer have any reference to the head of the collection. - * - * @param columnValueIterator Iterator over column-value pairs. - * @param threshold The threshold for the sum of values - * @param ifMin True if we want to stay at least above that given value - * @return A new SparseMatrix after we have filtered the ineligible rows - */ - private[this] def filterIter( - columnValueIterator: Iterator[(C, V)], - threshold: V, - ifMin: Boolean - ): Iterator[(C, V)] = { - var sum: V = numericV.zero - var it: Iterator[(C, V)] = Iterator.empty - var exceeded = false - while (columnValueIterator.hasNext && !exceeded) { - val (c, v) = columnValueIterator.next - val nextSum = semigroupV.plus(sum, v) - val cmp = numericV.compare(nextSum, threshold) - if ((ifMin && cmp < 0) || (!ifMin && cmp <= 0)) { - it = it ++ Iterator((c, v)) - sum = nextSum - } else { - it = it ++ Iterator((c, v)) - exceeded = true - } - } - (ifMin, exceeded) match { - case (true, true) => it ++ columnValueIterator - case (true, false) => Iterator.empty - case (false, true) => Iterator.empty - case (false, false) => it ++ columnValueIterator - } - } - - /** - * removes entries whose sum over rows do not meet the minimum sum (minSum) - * @param minSum minimum sum for which we want to enforce across all rows - */ - def filterRowsByMinSum(minSum: V): SparseMatrix[R, C, V] = { - val filteredPipe = this.rowAsKeys.group - .mapValueStream(filterIter(_, threshold = minSum, ifMin = true)).map { - case (r, (c, v)) => - (r, c, v) - } - SparseMatrix(filteredPipe) - } - - /** - * removes entries whose sum over rows exceed the maximum sum (maxSum) - * @param maxSum maximum sum for which we want to enforce across all rows - */ - def filterRowsByMaxSum(maxSum: V): SparseMatrix[R, C, V] = { - val filteredPipe = this.rowAsKeys.group - .mapValueStream(filterIter(_, threshold = maxSum, ifMin = false)).map { - case (r, (c, v)) => - (r, c, v) - } - SparseMatrix(filteredPipe) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/common/matrix/SparseRowMatrix.scala b/src/scala/com/twitter/simclusters_v2/scalding/common/matrix/SparseRowMatrix.scala deleted file mode 100644 index 767c8f588..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/common/matrix/SparseRowMatrix.scala +++ /dev/null @@ -1,366 +0,0 @@ -package com.twitter.simclusters_v2.scalding.common.matrix - -import com.twitter.algebird.Semigroup -import com.twitter.bijection.Injection -import com.twitter.scalding.TypedPipe -import com.twitter.scalding.ValuePipe -import org.apache.avro.SchemaBuilder.ArrayBuilder -import scala.util.Random - -/** - * A class that represents a row-indexed matrix, backed by a TypedPipe[(R, Map(C, V)]. - * For each row of the TypedPipe, we save the rowId and a map consisting of colIds and their values. - * Only use this class when the max number of non-zero values per row is small (say, <100K). - * - * Compared to SparseMatrix, this class has some optimizations to efficiently perform some row-wise - * operations. - * - * Also, if the matrix is skinny (i.e., number of unique colIds is small), we have optimized solutions - * for col-wise normalization as well as matrix multiplication (see SparseMatrix.multiplySkinnySparseRowMatrix). - * - * @param pipe underlying pipe - * @param isSkinnyMatrix if the matrix is skinny (i.e., number of unique colIds is small) - * Note the difference between `number of unique colIds` and `max number of non-zero values per row`. - * @param rowOrd ordering function for row type - * @param colOrd ordering function for col type - * @param numericV numeric operations for value type - * @param semigroupV semigroup for the value type - * @param rowInj injection function for the row type - * @param colInj injection function for the col type - * @tparam R Type for rows - * @tparam C Type for columns - * @tparam V Type for elements of the matrix - */ -case class SparseRowMatrix[R, C, V]( - pipe: TypedPipe[(R, Map[C, V])], - isSkinnyMatrix: Boolean -)( - implicit override val rowOrd: Ordering[R], - override val colOrd: Ordering[C], - override val numericV: Numeric[V], - override val semigroupV: Semigroup[V], - override val rowInj: Injection[R, Array[Byte]], - override val colInj: Injection[C, Array[Byte]]) - extends TypedPipeMatrix[R, C, V] { - - // number of non-zero values in the matrix - override lazy val nnz: ValuePipe[Long] = { - this - .filter((_, _, v) => v != numericV.zero) - .pipe - .values - .map(_.size.toLong) - .sum - } - - override def get(rowId: R, colId: C): ValuePipe[V] = { - this.pipe - .collect { - case (i, values) if i == rowId => - values.collect { - case (j, value) if j == colId => value - } - } - .flatten - .sum - } - - override def getRow(rowId: R): TypedPipe[(C, V)] = { - this.pipe.flatMap { - case (i, values) if i == rowId => - values.toSeq - case _ => - Nil - } - } - - override def getCol(colId: C): TypedPipe[(R, V)] = { - this.pipe.flatMap { - case (i, values) => - values.collect { - case (j, value) if j == colId => - i -> value - } - } - } - - override lazy val uniqueRowIds: TypedPipe[R] = { - this.pipe.map(_._1).distinct - } - - override lazy val uniqueColIds: TypedPipe[C] = { - this.pipe.flatMapValues(_.keys).values.distinct - } - - // convert to a SparseMatrix - lazy val toSparseMatrix: SparseMatrix[R, C, V] = { - SparseMatrix(this.pipe.flatMap { - case (i, values) => - values.map { case (j, value) => (i, j, value) } - }) - } - - // convert to a TypedPipe - lazy val toTypedPipe: TypedPipe[(R, Map[C, V])] = { - this.pipe - } - - def filter(fn: (R, C, V) => Boolean): SparseRowMatrix[R, C, V] = { - SparseRowMatrix( - this.pipe - .map { - case (i, values) => - i -> values.filter { case (j, v) => fn(i, j, v) } - } - .filter(_._2.nonEmpty), - isSkinnyMatrix = this.isSkinnyMatrix - ) - } - - // sample the rows in the matrix as defined by samplingRatio - def sampleRows(samplingRatio: Double): SparseRowMatrix[R, C, V] = { - SparseRowMatrix(this.pipe.filter(_ => Random.nextDouble < samplingRatio), this.isSkinnyMatrix) - } - - // filter the matrix based on a subset of rows - def filterRows(rows: TypedPipe[R]): SparseRowMatrix[R, C, V] = { - SparseRowMatrix(this.pipe.join(rows.asKeys).mapValues(_._1), this.isSkinnyMatrix) - } - - // filter the matrix based on a subset of cols - def filterCols(cols: TypedPipe[C]): SparseRowMatrix[R, C, V] = { - this.toSparseMatrix.filterCols(cols).toSparseRowMatrix(this.isSkinnyMatrix) - } - - // convert the triplet (row, col, value) to a new (row1, col1, value1) - def tripleApply[R1, C1, V1]( - fn: (R, C, V) => (R1, C1, V1) - )( - implicit rowOrd1: Ordering[R1], - colOrd1: Ordering[C1], - numericV1: Numeric[V1], - semigroupV1: Semigroup[V1], - rowInj: Injection[R1, Array[Byte]], - colInj: Injection[C1, Array[Byte]] - ): SparseRowMatrix[R1, C1, V1] = { - SparseRowMatrix( - this.pipe.flatMap { - case (i, values) => - values - .map { - case (j, v) => fn(i, j, v) - } - .groupBy(_._1) - .mapValues { _.map { case (_, j1, v1) => (j1, v1) }.toMap } - }, - isSkinnyMatrix = this.isSkinnyMatrix - ) - } - - // get the l2 norms for all rows. this does not trigger a shuffle. - lazy val rowL2Norms: TypedPipe[(R, Double)] = { - this.pipe.map { - case (row, values) => - row -> math.sqrt( - values.values - .map(a => numericV.toDouble(a) * numericV.toDouble(a)) - .sum) - } - } - - // normalize the matrix to make sure each row has unit norm - lazy val rowL2Normalize: SparseRowMatrix[R, C, Double] = { - val result = this.pipe.flatMap { - case (row, values) => - val norm = - math.sqrt( - values.values - .map(v => numericV.toDouble(v) * numericV.toDouble(v)) - .sum) - if (norm == 0.0) { - None - } else { - Some(row -> values.mapValues(v => numericV.toDouble(v) / norm)) - } - } - - SparseRowMatrix(result, isSkinnyMatrix = this.isSkinnyMatrix) - } - - // get the l2 norms for all cols - lazy val colL2Norms: TypedPipe[(C, Double)] = { - this.pipe - .flatMap { - case (_, values) => - values.map { - case (col, v) => - col -> numericV.toDouble(v) * numericV.toDouble(v) - } - } - .sumByKey - .mapValues(math.sqrt) - } - - // normalize the matrix to make sure each column has unit norm - lazy val colL2Normalize: SparseRowMatrix[R, C, Double] = { - val result = if (this.isSkinnyMatrix) { - // if this is a skinny matrix, we first put the norm of all columns into a Map, and then use - // this Map inside the mappers without shuffling the whole matrix (which is expensive, see the - // `else` part of this function). - val colL2NormsValuePipe = this.colL2Norms.map { - case (col, norm) => Map(col -> norm) - }.sum - - this.pipe.flatMapWithValue(colL2NormsValuePipe) { - case ((row, values), Some(colNorms)) => - Some(row -> values.flatMap { - case (col, value) => - val colNorm = colNorms.getOrElse(col, 0.0) - if (colNorm == 0.0) { - None - } else { - Some(col -> numericV.toDouble(value) / colNorm) - } - }) - case _ => - None - } - } else { - this.toSparseMatrix.transpose.rowAsKeys - .join(this.colL2Norms) - .collect { - case (col, ((row, value), colNorm)) if colNorm > 0.0 => - row -> Map(col -> numericV.toDouble(value) / colNorm) - } - .sumByKey - .toTypedPipe - } - - SparseRowMatrix(result, isSkinnyMatrix = this.isSkinnyMatrix) - } - - /** - * Take topK non-zero elements from each row. Cols are ordered by the `ordering` function - */ - def sortWithTakePerRow( - k: Int - )( - ordering: Ordering[(C, V)] - ): TypedPipe[(R, Seq[(C, V)])] = { - this.pipe.map { - case (row, values) => - row -> values.toSeq.sorted(ordering).take(k) - } - } - - /** - * Take topK non-zero elements from each column. Rows are ordered by the `ordering` function. - */ - def sortWithTakePerCol( - k: Int - )( - ordering: Ordering[(R, V)] - ): TypedPipe[(C, Seq[(R, V)])] = { - this.toSparseMatrix.sortWithTakePerCol(k)(ordering) - } - - /** - * Similar to .forceToDisk function in TypedPipe, but with an option to specify how many partitions - * to save, which is useful if you want to consolidate the data set or want to tune the number - * of mappers for the next step. - * - * @param numShardsOpt number of shards to save the data. - * - * @return - */ - def forceToDisk( - numShardsOpt: Option[Int] = None - ): SparseRowMatrix[R, C, V] = { - numShardsOpt - .map { numShards => - SparseRowMatrix(this.pipe.shard(numShards), this.isSkinnyMatrix) - } - .getOrElse { - SparseRowMatrix(this.pipe.forceToDisk, this.isSkinnyMatrix) - } - } - - /** - * transpose current matrix and multiple another Skinny SparseRowMatrix. - * The difference between this and .transpose.multiplySkinnySparseRowMatrix(anotherSparseRowMatrix), - * is that we do not need to do flatten and group again. - * - * One use case is to when we need to compute the column-wise covariance matrix, then we only need - * a.transposeAndMultiplySkinnySparseRowMatrix(a) to get it. - * - * @param anotherSparseRowMatrix it needs to be a skinny SparseRowMatrix - * @numReducersOpt Number of reducers. - */ - def transposeAndMultiplySkinnySparseRowMatrix[C2]( - anotherSparseRowMatrix: SparseRowMatrix[R, C2, V], - numReducersOpt: Option[Int] = None - )( - implicit ordering2: Ordering[C2], - injection2: Injection[C2, Array[Byte]] - ): SparseRowMatrix[C, C2, V] = { - - // it needs to be a skinny SparseRowMatrix, otherwise we will have out-of-memory issue - require(anotherSparseRowMatrix.isSkinnyMatrix) - - SparseRowMatrix( - numReducersOpt - .map { numReducers => - this.pipe - .join(anotherSparseRowMatrix.pipe).withReducers(numReducers) - }.getOrElse(this.pipe - .join(anotherSparseRowMatrix.pipe)) - .flatMap { - case (_, (row1, row2)) => - row1.map { - case (col1, val1) => - col1 -> row2.mapValues(val2 => numericV.times(val1, val2)) - } - } - .sumByKey, - isSkinnyMatrix = true - ) - - } - - /*** - * Multiply a DenseRowMatrix. The result will be also a DenseRowMatrix. - * - * @param denseRowMatrix matrix to multiply - * @param numReducersOpt optional parameter to set number of reducers. It uses 1000 by default. - * you can change it based on your applications - * @return - */ - def multiplyDenseRowMatrix( - denseRowMatrix: DenseRowMatrix[C], - numReducersOpt: Option[Int] = None - ): DenseRowMatrix[R] = { - this.toSparseMatrix.multiplyDenseRowMatrix(denseRowMatrix, numReducersOpt) - } - - /** - * Convert the matrix to a DenseRowMatrix - * - * @param numCols the number of columns in the DenseRowMatrix. - * @param colToIndexFunction the function to convert colId to the column index in the dense matrix - * @return - */ - def toDenseRowMatrix(numCols: Int, colToIndexFunction: C => Int): DenseRowMatrix[R] = { - DenseRowMatrix(this.pipe.map { - case (row, colMap) => - val array = new Array[Double](numCols) - colMap.foreach { - case (col, value) => - val index = colToIndexFunction(col) - assert(index < numCols && index >= 0, "The converted index is out of range!") - array(index) = numericV.toDouble(value) - } - row -> array - }) - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/common/matrix/TypedPipeMatrix.scala b/src/scala/com/twitter/simclusters_v2/scalding/common/matrix/TypedPipeMatrix.scala deleted file mode 100644 index 24e3fb3ad..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/common/matrix/TypedPipeMatrix.scala +++ /dev/null @@ -1,49 +0,0 @@ -package com.twitter.simclusters_v2.scalding.common.matrix - -import com.twitter.algebird.{Aggregator, Semigroup} -import com.twitter.bijection.Injection -import com.twitter.scalding.{TypedPipe, ValuePipe} - -/** - * A matrix trait for representing a matrix backed by TypedPipe - * - * @tparam R Type for rows - * @tparam C Type for columns - * @tparam V Type for elements of the matrix - */ -abstract class TypedPipeMatrix[R, C, @specialized(Double, Int, Float, Long, Short) V] { - implicit val semigroupV: Semigroup[V] - implicit val numericV: Numeric[V] - implicit val rowOrd: Ordering[R] - implicit val colOrd: Ordering[C] - implicit val rowInj: Injection[R, Array[Byte]] - implicit val colInj: Injection[C, Array[Byte]] - - // num of non-zero elements in the matrix - val nnz: ValuePipe[Long] - - // list of unique rowIds in the matrix - val uniqueRowIds: TypedPipe[R] - - // list of unique unique in the matrix - val uniqueColIds: TypedPipe[C] - - // get a specific row of the matrix - def getRow(rowId: R): TypedPipe[(C, V)] - - // get a specific column of the matrix - def getCol(colId: C): TypedPipe[(R, V)] - - // get the value of an element - def get(rowId: R, colId: C): ValuePipe[V] - - // number of unique rowIds - lazy val numUniqueRows: ValuePipe[Long] = { - this.uniqueRowIds.aggregate(Aggregator.size) - } - - // number of unique unique - lazy val numUniqueCols: ValuePipe[Long] = { - this.uniqueColIds.aggregate(Aggregator.size) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/BUILD b/src/scala/com/twitter/simclusters_v2/scalding/embedding/BUILD deleted file mode 100644 index 399d64417..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/BUILD +++ /dev/null @@ -1,311 +0,0 @@ -scala_library( - sources = [ - "*.scala", - "common/*.scala", - ], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "escherbird/src/scala/com/twitter/escherbird/scalding/source", - "flockdb-tools/datasets/flock:flock-blocks-edges-scala", - "flockdb-tools/datasets/flock:flock-follows-edges-scala", - "flockdb-tools/datasets/flock:flock-report-as-abuse-edges-scala", - "flockdb-tools/datasets/flock:flock-report-as-spam-edges-scala", - "iesource/processing/events/src/main/scala/com/twitter/iesource/processing/events/batch:server_engagements-scala", - "interests-ds/src/main/scala/com/twitter/interests_ds/jobs/interests_service", - "interests-ds/src/main/scala/com/twitter/interests_ds/jobs/interests_service:user_topic_relation_snapshot-scala", - "src/java/com/twitter/common/text/language:locale-util", - "src/scala/com/twitter/frigate/data_pipeline/scalding/magicrecs/magicrecs_notification_lite:magicrecs_notification_lite_1day_lag-scala", - "src/scala/com/twitter/onboarding/relevance/source:utt_account_recommendations-scala", - "src/scala/com/twitter/penguin/scalding/datasets:penguin_user_languages-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:simclusters_v2_embeddings_lite-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/presto_hdfs_sources", - "src/scala/com/twitter/simclusters_v2/scalding/common", - "src/scala/com/twitter/simclusters_v2/scalding/common/matrix", - "src/scala/com/twitter/wtf/entity_real_graph/common", - "src/scala/com/twitter/wtf/entity_real_graph/scalding/common", - "src/scala/com/twitter/wtf/scalding/jobs/common:execution_app", - "src/scala/com/twitter/wtf/scalding/jobs/common:sources", - "src/scala/com/twitter/wtf/scalding/jobs/common:stats_util", - "src/thrift/com/twitter/hermit/candidate:hermit-candidate-scala", - "src/thrift/com/twitter/onboarding/relevance/candidates:candidates-scala", - "src/thrift/com/twitter/recos/entities:entities-thrift-scala", - "src/thrift/com/twitter/search/adaptive/scribing:adaptive-scribing-scala", - "src/thrift/com/twitter/wtf/entity_real_graph:entity_real_graph-thrift-scala", - "twadoop_config/configuration/log_categories/group/search:adaptive_search-scala", - "usersource/snapshot/src/main/scala/com/twitter/usersource/snapshot/flat:usersource_flat-scala", - "usersource/snapshot/src/main/thrift/com/twitter/usersource/snapshot/flat:flat-scala", - ], -) - -hadoop_binary( - name = "entity_embeddings_job-adhoc", - main = "com.twitter.simclusters_v2.scalding.embedding.EntityToSimClustersEmbeddingAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":embedding"], -) - -hadoop_binary( - name = "entity_per_language_embeddings_job-adhoc", - main = "com.twitter.simclusters_v2.scalding.embedding.LocaleEntitySimClustersEmbeddingAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":embedding"], -) - -hadoop_binary( - name = "semantic_core_entity_embeddings_dec11_model_job", - main = "com.twitter.simclusters_v2.scalding.embedding.SemanticCoreEntityEmbeddingsDec11ModelApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":embedding"], -) - -hadoop_binary( - name = "semantic_core_entity_embeddings_2020_job", - main = "com.twitter.simclusters_v2.scalding.embedding.SemanticCoreEntityEmbeddings2020App", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":embedding"], -) - -hadoop_binary( - name = "semantic_core_entity_embeddings_per_language_job", - main = "com.twitter.simclusters_v2.scalding.embedding.LocaleEntitySimClustersEmbeddingScheduledApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":embedding"], -) - -hadoop_binary( - name = "locale_entity_simclusters_embedding_v2", - main = "com.twitter.simclusters_v2.scalding.embedding.LocaleEntitySimClustersEmbeddingV2ScheduledApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":embedding"], -) - -hadoop_binary( - name = "locale_entity_simclusters_embedding_v2-adhoc", - main = "com.twitter.simclusters_v2.scalding.embedding.LocaleEntitySimClustersEmbeddingV2AdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":embedding"], -) - -hadoop_binary( - name = "producer_embeddings_from_interested_in-adhoc", - main = "com.twitter.simclusters_v2.scalding.embedding.ProducerEmbeddingsFromInterestedInAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":embedding"], -) - -hadoop_binary( - name = "producer_embeddings_from_interested_in_by_fav_score", - main = "com.twitter.simclusters_v2.scalding.embedding.ProducerEmbeddingsFromInterestedInByFavScoreBatchApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":embedding"], -) - -hadoop_binary( - name = "producer_embeddings_from_interested_in_by_fav_score_2020", - main = "com.twitter.simclusters_v2.scalding.embedding.ProducerEmbeddingsFromInterestedInByFavScore2020BatchApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":embedding"], -) - -hadoop_binary( - name = "producer_embeddings_from_interested_in_by_fav_score_dec11", - main = "com.twitter.simclusters_v2.scalding.embedding.ProducerEmbeddingsFromInterestedInByFavScoreDec11BatchApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":embedding"], -) - -hadoop_binary( - name = "producer_embeddings_from_interested_in_by_follow_score", - main = "com.twitter.simclusters_v2.scalding.embedding.ProducerEmbeddingsFromInterestedInByFollowScoreBatchApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":embedding"], -) - -hadoop_binary( - name = "producer_embeddings_from_interested_in_by_follow_score_2020", - main = "com.twitter.simclusters_v2.scalding.embedding.ProducerEmbeddingsFromInterestedInByFollowScore2020BatchApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":embedding"], -) - -hadoop_binary( - name = "producer_embeddings_from_interested_in_by_follow_score_dec11", - main = "com.twitter.simclusters_v2.scalding.embedding.ProducerEmbeddingsFromInterestedInByFollowScoreDec11BatchApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":embedding"], -) - -hadoop_binary( - name = "similar_users_by_simclusters_embeddings-adhoc", - main = "com.twitter.simclusters_v2.scalding.embedding.SimilarUsersBySimClustersEmbeddingAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":embedding"], -) - -hadoop_binary( - name = "similar_users_by_simclusters_embeddings", - main = "com.twitter.simclusters_v2.scalding.embedding.SimilarUsersBySimClustersEmbeddingBatchApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":embedding"], -) - -hadoop_binary( - name = "entity_embedding_from_producer_embedding-adhoc", - main = "com.twitter.simclusters_v2.scalding.embedding.EntityEmbeddingFromProducerEmbeddingAdhocJob", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":embedding"], -) - -hadoop_binary( - name = "entity_embedding_from_producer_embedding_job", - main = "com.twitter.simclusters_v2.scalding.embedding.EntityEmbeddingFromProducerEmbeddingScheduledJob", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":embedding"], -) - -# Generated with `capesospy-v2 create_target similar_users_by_simclusters_embeddings_job src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml`, config hash b8cf4d. -scalding_job( - name = "similar_users_by_simclusters_embeddings_job", - main = "com.twitter.simclusters_v2.scalding.embedding.SimiliarUsersBySimClustersEmbeddingBatchApp", - config = [ - ("hadoop.combine-input", "true"), - ("hadoop.map.jvm.total-memory", "3072m"), - ("hadoop.queue", "cassowary.default"), - ("hadoop.reduce.jvm.total-memory", "3072m"), - ("hadoop.submitter.jvm.total-memory", "5120m"), - ("submitter.tier", "preemptible"), - ], - cron = "15 * * * *", - hadoop_cluster = "atla-proc3", - platform = "java8", - role = "cassowary", - runtime_platform = "java8", - tags = [ - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":embedding"], -) - -hadoop_binary( - name = "global_simclusters_language_embedding_job", - main = "com.twitter.simclusters_v2.scalding.embedding.GlobalSimClustersLanguageEmbeddingBatchApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":embedding"], -) diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/EntityEmbeddingFromProducerEmbeddingJob.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/EntityEmbeddingFromProducerEmbeddingJob.scala deleted file mode 100644 index 4d2e3c205..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/EntityEmbeddingFromProducerEmbeddingJob.scala +++ /dev/null @@ -1,239 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding - -import com.twitter.onboarding.relevance.candidates.thriftscala.InterestBasedUserRecommendations -import com.twitter.onboarding.relevance.candidates.thriftscala.UTTInterest -import com.twitter.onboarding.relevance.source.UttAccountRecommendationsScalaDataset -import com.twitter.scalding.Args -import com.twitter.scalding.DateRange -import com.twitter.scalding.Days -import com.twitter.scalding.Duration -import com.twitter.scalding.Execution -import com.twitter.scalding.RichDate -import com.twitter.scalding.UniqueID -import com.twitter.scalding.typed.TypedPipe -import com.twitter.scalding.typed.UnsortedGrouped -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.dalv2.remote_access.ExplicitLocation -import com.twitter.scalding_internal.dalv2.remote_access.ProcAtla -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.common.SimClustersEmbedding -import com.twitter.simclusters_v2.hdfs_sources.AdhocKeyValSources -import com.twitter.simclusters_v2.hdfs_sources.ProducerEmbeddingSources -import com.twitter.simclusters_v2.hdfs_sources.SemanticCoreEmbeddingsFromProducerScalaDataset -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil._ -import com.twitter.simclusters_v2.thriftscala -import com.twitter.simclusters_v2.thriftscala.EmbeddingType -import com.twitter.simclusters_v2.thriftscala.InternalId -import com.twitter.simclusters_v2.thriftscala.ModelVersion -import com.twitter.simclusters_v2.thriftscala.SimClusterWithScore -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId -import com.twitter.simclusters_v2.thriftscala.TopSimClustersWithScore -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import com.twitter.wtf.scalding.jobs.common.StatsUtil._ -import java.util.TimeZone - -/* - $ ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding:entity_embedding_from_producer_embedding-adhoc - - $ scalding remote run \ - --main-class com.twitter.simclusters_v2.scalding.embedding.EntityEmbeddingFromProducerEmbeddingAdhocJob \ - --target src/scala/com/twitter/simclusters_v2/scalding/embedding:entity_embedding_from_producer_embedding-adhoc \ - --user recos-platform \ - -- --date 2019-10-23 --model_version 20M_145K_updated - */ -object EntityEmbeddingFromProducerEmbeddingAdhocJob extends AdhocExecutionApp { - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - // step 1: read in (entity, producer) pairs and remove duplicates - val topK = args.getOrElse("top_k", "100").toInt - - val modelVersion = ModelVersions.toModelVersion( - args.getOrElse("model_version", ModelVersions.Model20M145KUpdated)) - - val entityKnownForProducers = - EntityEmbeddingFromProducerEmbeddingJob - .getNormalizedEntityProducerMatrix(dateRange.embiggen(Days(7))) - .count("num unique entity producer pairs").map { - case (entityId, producerId, score) => (producerId, (entityId, score)) - } - - // step 2: read in producer to simclusters embeddings - - val producersEmbeddingsFollowBased = - ProducerEmbeddingSources.producerEmbeddingSourceLegacy( - EmbeddingType.ProducerFollowBasedSemanticCoreEntity, - modelVersion)(dateRange.embiggen(Days(7))) - - val producersEmbeddingsFavBased = - ProducerEmbeddingSources.producerEmbeddingSourceLegacy( - EmbeddingType.ProducerFavBasedSemanticCoreEntity, - modelVersion)(dateRange.embiggen(Days(7))) - - // step 3: join producer embedding with entity, producer pairs and reformat result into format [SimClustersEmbeddingId, SimClustersEmbedding] - val producerBasedEntityEmbeddingsFollowBased = - EntityEmbeddingFromProducerEmbeddingJob - .computeEmbedding( - producersEmbeddingsFollowBased, - entityKnownForProducers, - topK, - modelVersion, - EmbeddingType.ProducerFollowBasedSemanticCoreEntity).toTypedPipe.count( - "follow_based_entity_count") - - val producerBasedEntityEmbeddingsFavBased = - EntityEmbeddingFromProducerEmbeddingJob - .computeEmbedding( - producersEmbeddingsFavBased, - entityKnownForProducers, - topK, - modelVersion, - EmbeddingType.ProducerFavBasedSemanticCoreEntity).toTypedPipe.count( - "fav_based_entity_count") - - val producerBasedEntityEmbeddings = - producerBasedEntityEmbeddingsFollowBased ++ producerBasedEntityEmbeddingsFavBased - - // step 4 write results to file - producerBasedEntityEmbeddings - .count("total_count").writeExecution( - AdhocKeyValSources.entityToClustersSource( - getHdfsPath(isAdhoc = true, isManhattanKeyVal = true, modelVersion, "producer"))) - } - -} - -/* - $ ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding:entity_embedding_from_producer_embedding_job - $ capesospy-v2 update \ - --build_locally \ - --start_cron entity_embedding_from_producer_embedding_job src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object EntityEmbeddingFromProducerEmbeddingScheduledJob extends ScheduledExecutionApp { - override def firstTime: RichDate = RichDate("2019-10-16") - - override def batchIncrement: Duration = Days(7) - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - // parse args: modelVersion, topK - val topK = args.getOrElse("top_k", "100").toInt - // only support dec11 now since updated model is not productionized for producer embedding - val modelVersion = - ModelVersions.toModelVersion( - args.getOrElse("model_version", ModelVersions.Model20M145KUpdated)) - - val entityKnownForProducers = - EntityEmbeddingFromProducerEmbeddingJob - .getNormalizedEntityProducerMatrix(dateRange.embiggen(Days(7))) - .count("num unique entity producer pairs").map { - case (entityId, producerId, score) => (producerId, (entityId, score)) - } - - val favBasedEmbeddings = EntityEmbeddingFromProducerEmbeddingJob - .computeEmbedding( - ProducerEmbeddingSources.producerEmbeddingSourceLegacy( - EmbeddingType.ProducerFavBasedSemanticCoreEntity, - modelVersion)(dateRange.embiggen(Days(7))), - entityKnownForProducers, - topK, - modelVersion, - EmbeddingType.ProducerFavBasedSemanticCoreEntity - ).toTypedPipe.count("follow_based_entity_count") - - val followBasedEmbeddings = EntityEmbeddingFromProducerEmbeddingJob - .computeEmbedding( - ProducerEmbeddingSources.producerEmbeddingSourceLegacy( - EmbeddingType.ProducerFollowBasedSemanticCoreEntity, - modelVersion)(dateRange.embiggen(Days(7))), - entityKnownForProducers, - topK, - modelVersion, - EmbeddingType.ProducerFollowBasedSemanticCoreEntity - ).toTypedPipe.count("fav_based_entity_count") - - val embedding = favBasedEmbeddings ++ followBasedEmbeddings - - embedding - .count("total_count") - .map { - case (embeddingId, embedding) => KeyVal(embeddingId, embedding) - }.writeDALVersionedKeyValExecution( - SemanticCoreEmbeddingsFromProducerScalaDataset, - D.Suffix(getHdfsPath(isAdhoc = false, isManhattanKeyVal = true, modelVersion, "producer")) - ) - - } - -} - -private object EntityEmbeddingFromProducerEmbeddingJob { - def computeEmbedding( - producersEmbeddings: TypedPipe[(Long, TopSimClustersWithScore)], - entityKnownForProducers: TypedPipe[(Long, (Long, Double))], - topK: Int, - modelVersion: ModelVersion, - embeddingType: EmbeddingType - ): UnsortedGrouped[SimClustersEmbeddingId, thriftscala.SimClustersEmbedding] = { - producersEmbeddings - .hashJoin(entityKnownForProducers).flatMap { - case (_, (topSimClustersWithScore, (entityId, producerScore))) => { - val entityEmbedding = topSimClustersWithScore.topClusters - entityEmbedding.map { - case SimClusterWithScore(clusterId, score) => - ( - ( - SimClustersEmbeddingId( - embeddingType, - modelVersion, - InternalId.EntityId(entityId)), - clusterId), - score * producerScore) - } - } - }.sumByKey.map { - case ((embeddingId, clusterId), clusterScore) => - (embeddingId, (clusterId, clusterScore)) - }.group.sortedReverseTake(topK)(Ordering.by(_._2)).mapValues(SimClustersEmbedding - .apply(_).toThrift) - } - - def getNormalizedEntityProducerMatrix( - implicit dateRange: DateRange - ): TypedPipe[(Long, Long, Double)] = { - val uttRecs: TypedPipe[(UTTInterest, InterestBasedUserRecommendations)] = - DAL - .readMostRecentSnapshot(UttAccountRecommendationsScalaDataset).withRemoteReadPolicy( - ExplicitLocation(ProcAtla)).toTypedPipe.map { - case KeyVal(interest, candidates) => (interest, candidates) - } - - uttRecs - .flatMap { - case (interest, candidates) => { - // current populated features - val top20Producers = candidates.recommendations.sortBy(-_.score.getOrElse(0.0d)).take(20) - val producerScorePairs = top20Producers.map { producer => - (producer.candidateUserID, producer.score.getOrElse(0.0)) - } - val scoreSum = producerScorePairs.map(_._2).sum - producerScorePairs.map { - case (producerId, score) => (interest.uttID, producerId, score / scoreSum) - } - } - } - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/EntityToSimClustersEmbeddingsJob.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/EntityToSimClustersEmbeddingsJob.scala deleted file mode 100644 index 21d68ee22..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/EntityToSimClustersEmbeddingsJob.scala +++ /dev/null @@ -1,354 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding - -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.recos.entities.thriftscala.Entity -import com.twitter.recos.entities.thriftscala.Hashtag -import com.twitter.recos.entities.thriftscala.SemanticCoreEntity -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.common.SimClustersEmbedding -import com.twitter.simclusters_v2.hdfs_sources._ -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil._ -import com.twitter.simclusters_v2.scalding.embedding.common.EntityEmbeddingUtil -import com.twitter.simclusters_v2.scalding.embedding.common.SimClustersEmbeddingJob -import com.twitter.simclusters_v2.thriftscala.{ - SimClustersEmbedding => ThriftSimClustersEmbedding, - _ -} -import com.twitter.wtf.entity_real_graph.common.EntityUtil -import com.twitter.wtf.entity_real_graph.thriftscala.EntityType -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import com.twitter.wtf.scalding.jobs.common.DataSources -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import java.util.TimeZone - -/** - * $ ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding:entity_embeddings_job-adhoc - * - * ---------------------- Deploy to atla ---------------------- - * $ scalding remote run \ - --main-class com.twitter.simclusters_v2.scalding.embedding.EntityToSimClustersEmbeddingAdhocApp \ - --target src/scala/com/twitter/simclusters_v2/scalding/embedding:entity_embeddings_job-adhoc \ - --user recos-platform \ - -- --date 2019-09-09 --model-version 20M_145K_updated --entity-type SemanticCore - */ -object EntityToSimClustersEmbeddingAdhocApp extends AdhocExecutionApp { - - import EmbeddingUtil._ - import EntityEmbeddingUtil._ - import EntityToSimClustersEmbeddingsJob._ - import EntityUtil._ - import SimClustersEmbeddingJob._ - - def writeOutput( - embeddings: TypedPipe[(SimClustersEmbeddingId, (ClusterId, EmbeddingScore))], - topKEmbeddings: TypedPipe[(SimClustersEmbeddingId, Seq[(ClusterId, EmbeddingScore)])], - jobConfig: EntityEmbeddingsJobConfig - ): Execution[Unit] = { - - val toSimClusterEmbeddingExec = topKEmbeddings - .mapValues(SimClustersEmbedding.apply(_).toThrift) - .writeExecution( - AdhocKeyValSources.entityToClustersSource( - EntityToSimClustersEmbeddingsJob.getHdfsPath( - isAdhoc = true, - isManhattanKeyVal = true, - isReverseIndex = false, - jobConfig.modelVersion, - jobConfig.entityType))) - - val fromSimClusterEmbeddingExec = - toReverseIndexSimClusterEmbedding(embeddings, jobConfig.topK) - .writeExecution( - AdhocKeyValSources.clusterToEntitiesSource( - EntityToSimClustersEmbeddingsJob.getHdfsPath( - isAdhoc = true, - isManhattanKeyVal = true, - isReverseIndex = true, - jobConfig.modelVersion, - jobConfig.entityType))) - - Execution.zip(toSimClusterEmbeddingExec, fromSimClusterEmbeddingExec).unit - } - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - val jobConfig = EntityEmbeddingsJobConfig(args, isAdhoc = true) - - val numReducers = args.getOrElse("m", "1000").toInt - - /* - Using the ERG daily dataset in the adhoc job for quick prototyping, note that there may be - issues with scaling the job when productionizing on ERG aggregated dataset. - */ - val entityRealGraphSource = DataSources.entityRealGraphDailyDataSetSource - - val entityUserMatrix: TypedPipe[(Entity, (UserId, Double))] = - (jobConfig.entityType match { - case EntityType.SemanticCore => - getEntityUserMatrix(entityRealGraphSource, jobConfig.halfLife, EntityType.SemanticCore) - case EntityType.Hashtag => - getEntityUserMatrix(entityRealGraphSource, jobConfig.halfLife, EntityType.Hashtag) - case _ => - throw new IllegalArgumentException( - s"Argument [--entity-type] must be provided. Supported options [${EntityType.SemanticCore.name}, ${EntityType.Hashtag.name}]") - }).forceToDisk - - val normalizedUserEntityMatrix = - getNormalizedTransposeInputMatrix(entityUserMatrix, numReducers = Some(numReducers)) - - //determine which data source to use based on model version - val simClustersSource = jobConfig.modelVersion match { - case ModelVersion.Model20m145kUpdated => - InterestedInSources.simClustersInterestedInUpdatedSource(dateRange, timeZone) - case _ => - InterestedInSources.simClustersInterestedInDec11Source(dateRange, timeZone) - } - - val embeddings = computeEmbeddings( - simClustersSource, - normalizedUserEntityMatrix, - scoreExtractors, - ModelVersion.Model20m145kUpdated, - toSimClustersEmbeddingId(jobConfig.modelVersion), - numReducers = Some(numReducers * 2) - ) - - val topKEmbeddings = - embeddings.group - .sortedReverseTake(jobConfig.topK)(Ordering.by(_._2)) - .withReducers(numReducers) - - writeOutput(embeddings, topKEmbeddings, jobConfig) - } -} - -/** - * $ ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding:semantic_core_entity_embeddings_2020_job - * $ capesospy-v2 update \ - --build_locally \ - --start_cron semantic_core_entity_embeddings_2020_job src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object SemanticCoreEntityEmbeddings2020App extends EntityToSimClustersEmbeddingApp - -trait EntityToSimClustersEmbeddingApp extends ScheduledExecutionApp { - - import EmbeddingUtil._ - import EntityEmbeddingUtil._ - import EntityToSimClustersEmbeddingsJob._ - import EntityUtil._ - import SimClustersEmbeddingJob._ - - override val firstTime: RichDate = RichDate("2023-01-01") - - override val batchIncrement: Duration = Days(7) - - private def writeOutput( - embeddings: TypedPipe[(SimClustersEmbeddingId, (ClusterId, EmbeddingScore))], - topKEmbeddings: TypedPipe[(SimClustersEmbeddingId, Seq[(ClusterId, EmbeddingScore)])], - jobConfig: EntityEmbeddingsJobConfig, - clusterEmbeddingsDataset: KeyValDALDataset[ - KeyVal[SimClustersEmbeddingId, ThriftSimClustersEmbedding] - ], - entityEmbeddingsDataset: KeyValDALDataset[KeyVal[SimClustersEmbeddingId, InternalIdEmbedding]] - ): Execution[Unit] = { - - val toSimClustersEmbeddings = - topKEmbeddings - .mapValues(SimClustersEmbedding.apply(_).toThrift) - .map { - case (entityId, topSimClusters) => KeyVal(entityId, topSimClusters) - } - .writeDALVersionedKeyValExecution( - clusterEmbeddingsDataset, - D.Suffix( - EntityToSimClustersEmbeddingsJob.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = true, - isReverseIndex = false, - jobConfig.modelVersion, - jobConfig.entityType)) - ) - - val fromSimClustersEmbeddings = - toReverseIndexSimClusterEmbedding(embeddings, jobConfig.topK) - .map { - case (embeddingId, internalIdsWithScore) => - KeyVal(embeddingId, internalIdsWithScore) - } - .writeDALVersionedKeyValExecution( - entityEmbeddingsDataset, - D.Suffix( - EntityToSimClustersEmbeddingsJob.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = true, - isReverseIndex = true, - jobConfig.modelVersion, - jobConfig.entityType)) - ) - - Execution.zip(toSimClustersEmbeddings, fromSimClustersEmbeddings).unit - } - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - val jobConfig = EntityEmbeddingsJobConfig(args, isAdhoc = false) - - val embeddingsDataset = EntityEmbeddingsSources.getEntityEmbeddingsDataset( - jobConfig.entityType, - ModelVersions.toKnownForModelVersion(jobConfig.modelVersion) - ) - - val reverseIndexEmbeddingsDataset = - EntityEmbeddingsSources.getReverseIndexedEntityEmbeddingsDataset( - jobConfig.entityType, - ModelVersions.toKnownForModelVersion(jobConfig.modelVersion) - ) - - val entityRealGraphSource = - DataSources.entityRealGraphAggregationDataSetSource(dateRange.embiggen(Days(7))) - - val entityUserMatrix: TypedPipe[(Entity, (UserId, Double))] = - getEntityUserMatrix( - entityRealGraphSource, - jobConfig.halfLife, - jobConfig.entityType).forceToDisk - - val normalizedUserEntityMatrix = getNormalizedTransposeInputMatrix(entityUserMatrix) - - val simClustersEmbedding = jobConfig.modelVersion match { - case ModelVersion.Model20m145k2020 => - val simClustersSource2020 = - InterestedInSources.simClustersInterestedIn2020Source(dateRange, timeZone) - computeEmbeddings( - simClustersSource2020, - normalizedUserEntityMatrix, - scoreExtractors, - ModelVersion.Model20m145k2020, - toSimClustersEmbeddingId(ModelVersion.Model20m145k2020) - ) - case modelVersion => - throw new IllegalArgumentException(s"Model Version ${modelVersion.name} not supported") - } - - val topKEmbeddings = - simClustersEmbedding.group.sortedReverseTake(jobConfig.topK)(Ordering.by(_._2)) - - val simClustersEmbeddingsExec = - writeOutput( - simClustersEmbedding, - topKEmbeddings, - jobConfig, - embeddingsDataset, - reverseIndexEmbeddingsDataset) - - // We don't support embeddingsLite for the 2020 model version. - val embeddingsLiteExec = if (jobConfig.modelVersion == ModelVersion.Model20m145kUpdated) { - topKEmbeddings - .collect { - case ( - SimClustersEmbeddingId( - EmbeddingType.FavBasedSematicCoreEntity, - ModelVersion.Model20m145kUpdated, - InternalId.EntityId(entityId)), - clustersWithScores) => - entityId -> clustersWithScores - } - .flatMap { - case (entityId, clustersWithScores) => - clustersWithScores.map { - case (clusterId, score) => EmbeddingsLite(entityId, clusterId, score) - } - case _ => Nil - }.writeDALSnapshotExecution( - SimclustersV2EmbeddingsLiteScalaDataset, - D.Daily, - D.Suffix(embeddingsLitePath(ModelVersion.Model20m145kUpdated, "fav_based")), - D.EBLzo(), - dateRange.end) - } else { - Execution.unit - } - - Execution - .zip(simClustersEmbeddingsExec, embeddingsLiteExec).unit - } -} - -object EntityToSimClustersEmbeddingsJob { - - def toSimClustersEmbeddingId( - modelVersion: ModelVersion - ): (Entity, ScoreType.ScoreType) => SimClustersEmbeddingId = { - case (Entity.SemanticCore(SemanticCoreEntity(entityId, _)), ScoreType.FavScore) => - SimClustersEmbeddingId( - EmbeddingType.FavBasedSematicCoreEntity, - modelVersion, - InternalId.EntityId(entityId)) - case (Entity.SemanticCore(SemanticCoreEntity(entityId, _)), ScoreType.FollowScore) => - SimClustersEmbeddingId( - EmbeddingType.FollowBasedSematicCoreEntity, - modelVersion, - InternalId.EntityId(entityId)) - case (Entity.Hashtag(Hashtag(hashtag)), ScoreType.FavScore) => - SimClustersEmbeddingId( - EmbeddingType.FavBasedHashtagEntity, - modelVersion, - InternalId.Hashtag(hashtag)) - case (Entity.Hashtag(Hashtag(hashtag)), ScoreType.FollowScore) => - SimClustersEmbeddingId( - EmbeddingType.FollowBasedHashtagEntity, - modelVersion, - InternalId.Hashtag(hashtag)) - case (scoreType, entity) => - throw new IllegalArgumentException( - s"(ScoreType, Entity) ($scoreType, ${entity.toString}) not supported") - } - - /** - * Generates the output path for the Entity Embeddings Job. - * - * Example Adhoc: /user/recos-platform/processed/adhoc/simclusters_embeddings/hashtag/model_20m_145k_updated - * Example Prod: /atla/proc/user/cassowary/processed/simclusters_embeddings/semantic_core/model_20m_145k_dec11 - * - */ - def getHdfsPath( - isAdhoc: Boolean, - isManhattanKeyVal: Boolean, - isReverseIndex: Boolean, - modelVersion: ModelVersion, - entityType: EntityType - ): String = { - - val reverseIndex = if (isReverseIndex) "reverse_index/" else "" - - val entityTypeSuffix = entityType match { - case EntityType.SemanticCore => "semantic_core" - case EntityType.Hashtag => "hashtag" - case _ => "unknown" - } - - val pathSuffix = s"$reverseIndex$entityTypeSuffix" - - EmbeddingUtil.getHdfsPath(isAdhoc, isManhattanKeyVal, modelVersion, pathSuffix) - } - - def embeddingsLitePath(modelVersion: ModelVersion, pathSuffix: String): String = { - s"/user/cassowary/processed/entity_real_graph/simclusters_embedding/lite/$modelVersion/$pathSuffix/" - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/GlobalSimClustersLanguageEmbedding.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/GlobalSimClustersLanguageEmbedding.scala deleted file mode 100644 index 2a66a8a8e..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/GlobalSimClustersLanguageEmbedding.scala +++ /dev/null @@ -1,197 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding - -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.dal.client.dataset.SnapshotDALDataset -import com.twitter.scalding.DateRange -import com.twitter.scalding.Days -import com.twitter.scalding.UniqueID -import com.twitter.scalding._ -import com.twitter.scalding.typed.TypedPipe -import com.twitter.scalding_internal.dalv2.DALWrite.D -import com.twitter.scalding_internal.dalv2.DALWrite.ExplicitEndTime -import com.twitter.scalding_internal.dalv2.DALWrite.WriteExtension -import com.twitter.scalding_internal.job.RequiredBinaryComparators.ordSer -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.Country -import com.twitter.simclusters_v2.common.Language -import com.twitter.simclusters_v2.common.Timestamp -import com.twitter.simclusters_v2.common.TweetId -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.hdfs_sources.InterestedInSources -import com.twitter.simclusters_v2.scalding.embedding.common.ExternalDataSources -import com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn -import com.twitter.simclusters_v2.thriftscala.InternalId.ClusterId -import com.twitter.simclusters_v2.thriftscala.ModelVersion -import com.twitter.simclusters_v2.thriftscala.UserToInterestedInClusterScores -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import com.twitter.simclusters_v2.hdfs_sources.SimclustersV2GlobalLanguageEmbeddingScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.SimclustersV2GlobalLanguageEmbeddingThriftScalaDataset -import com.twitter.simclusters_v2.thriftscala.LanguageToClusters -import java.util.TimeZone - -/** -capesospy-v2 update --build_locally --start_cron \ - --start_cron global_simclusters_language_embedding_job \ - src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc.yaml - */ -object GlobalSimClustersLanguageEmbeddingBatchApp extends ScheduledExecutionApp { - - override val firstTime: RichDate = RichDate("2023-03-07") - - override val batchIncrement: Duration = Days(1) - - val outputHdfsDirectory = - "/user/cassowary/manhattan_sequence_files/global_simclusters_language_embeddings" - - val outputThriftHdfsDirectory = - "/user/cassowary/processed/global_simclusters_language_embeddings" - - val globalLanguageEmbeddingsKeyValDataset: KeyValDALDataset[ - KeyVal[String, ClustersUserIsInterestedIn] - ] = SimclustersV2GlobalLanguageEmbeddingScalaDataset - - val globalLanguageEmbeddingsThriftDataset: SnapshotDALDataset[LanguageToClusters] = - SimclustersV2GlobalLanguageEmbeddingThriftScalaDataset - - val numOfClustersPerLanguage: Int = 400 - - def getInterestedInFn: ( - DateRange, - TimeZone - ) => TypedPipe[(UserId, ClustersUserIsInterestedIn)] = - InterestedInSources.simClustersInterestedIn2020Source - - def flattenAndFilterUserInterestedIn( - interestedIn: TypedPipe[(UserId, ClustersUserIsInterestedIn)] - ): TypedPipe[(UserId, (Int, Double))] = { - interestedIn - // Get (userId, Seq[(clusterId, scores)] - .map { - case (user, clusterUserIsInterestedIn) => { - (user, clusterUserIsInterestedIn.clusterIdToScores) - } - } - // Flatten it into (UserId, ClusterId, LogFavScore) - .flatMap { - case (userId, clusterUserIsInterestedIn) => { - clusterUserIsInterestedIn.toSeq.map { - case (clusterId, scores) => { - (userId, (clusterId, scores.logFavScore.getOrElse(0.0))) - } - } - } - }.filter(_._2._2 > 0.0) // Filter out zero scores - } - - def getGlobalSimClustersEmbeddingPerLanguage( - interestedIn: TypedPipe[(UserId, (Int, Double))], - favEdges: TypedPipe[(UserId, TweetId, Timestamp)], - language: TypedPipe[(UserId, (Country, Language))] - ): TypedPipe[(Language, ClustersUserIsInterestedIn)] = { - // Engagement fav edges - val edges = favEdges.map { case (userId, tweetId, ts) => (userId, (tweetId, ts)) } - - // Language information for users - val userLanguage = language.map { - case (userId, (country, lang)) => (userId, lang) - } - val numUsersPerLanguage = userLanguage.map { - case (_, lang) => (lang, 1L) - }.sumByKey - - val embeddings = - interestedIn - .join(edges) // Join InterestedIn and user-tweet engagements - .map { - case (userId, ((clusterId, score), (_, _))) => { - (userId, (clusterId, score)) - } - } - .join(userLanguage) // Join and get cluster scores per language - .map { - case (userId, ((clusterId, score), lang)) => { - ((lang, clusterId), score) - } - } - .sumByKey // Sum the user embeddings per language based on the engagements - .map { case ((lang, clusterId), score) => (lang, (clusterId, score)) } - .join(numUsersPerLanguage) - // We compute the average cluster scores per language - .map { - case (lang, ((clusterId, score), count)) => (lang, (clusterId -> score / count)) - } - .group - .sortedReverseTake(numOfClustersPerLanguage)(Ordering - .by(_._2)) // Take top 400 clusters per language - .flatMap { - case (lang, clusterScores) => { - clusterScores.map { - case (clusterId, score) => (lang, (clusterId, score)) - } - } - }.mapValues { case (clusterId, score) => Map(clusterId -> score) } - - // Build the final SimClusters embeddings per language - embeddings.sumByKey.map { - case (lang, clusterToScore) => { - val clusterScores = clusterToScore.map { - case (clusterId, score) => - clusterId -> UserToInterestedInClusterScores(logFavScore = Some(score)) - } - (lang, ClustersUserIsInterestedIn(ModelVersion.Model20m145k2020.name, clusterScores)) - } - } - } - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - // Read the most recent InterestedIn snapshot from the past 21 days - val interestedIn = - InterestedInSources - .simClustersInterestedIn2020Source(dateRange.prepend(Days(21)), timeZone).forceToDisk - - // Get the user tweet fav engagement history from the past 2 days - val userTweetFavEdges = ExternalDataSources.userTweetFavoritesSource - - // Read user language from UserSource - val userLanguages = ExternalDataSources.userSource - - val globalEmbeddings = getGlobalSimClustersEmbeddingPerLanguage( - flattenAndFilterUserInterestedIn(interestedIn), - userTweetFavEdges, - userLanguages) - - // Write results as a key-val dataset - globalEmbeddings - .map { - case (lang, embeddings) => - KeyVal(lang, embeddings) - } - .writeDALVersionedKeyValExecution( - globalLanguageEmbeddingsKeyValDataset, - D.Suffix(outputHdfsDirectory) - ) - - // Write results as a thrift dataset - globalEmbeddings - .map { - case (lang, clusterUserIsInterestedIn) => - LanguageToClusters( - lang, - clusterUserIsInterestedIn.knownForModelVersion, - clusterUserIsInterestedIn.clusterIdToScores - ) - } - .writeDALSnapshotExecution( - globalLanguageEmbeddingsThriftDataset, - D.Daily, - D.Suffix(outputThriftHdfsDirectory), - D.Parquet, - dateRange.`end` - ) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/LocaleEntitySimClustersEmbeddingV2Job.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/LocaleEntitySimClustersEmbeddingV2Job.scala deleted file mode 100644 index baf604cba..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/LocaleEntitySimClustersEmbeddingV2Job.scala +++ /dev/null @@ -1,248 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding - -import com.twitter.bijection.{Bufferable, Injection} -import com.twitter.recos.entities.thriftscala.{Entity, SemanticCoreEntity} -import com.twitter.scalding.{DateRange, Days, Duration, Execution, RichDate, TypedPipe, UniqueID} -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common._ -import com.twitter.simclusters_v2.hdfs_sources.{AdhocKeyValSources, EntityEmbeddingsSources} -import com.twitter.simclusters_v2.scalding.common.matrix.{SparseMatrix, SparseRowMatrix} -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil.ClusterId -import com.twitter.simclusters_v2.scalding.embedding.common.{ - EmbeddingUtil, - ExternalDataSources, - SimClustersEmbeddingBaseJob -} -import com.twitter.simclusters_v2.thriftscala.{ - EmbeddingType, - InternalId, - InternalIdEmbedding, - InternalIdWithScore, - LocaleEntityId, - ModelVersion, - SimClustersEmbeddingId -} -import com.twitter.wtf.entity_real_graph.thriftscala.{Edge, FeatureName} -import com.twitter.wtf.scalding.jobs.common.{AdhocExecutionApp, DataSources, ScheduledExecutionApp} -import java.util.TimeZone - -/** - * Scheduled production job which generates topic embeddings per locale based on Entity Real Graph. - * - * V2 Uses the log transform of the ERG favScores and the SimCluster InterestedIn scores. - * - * $ ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding:locale_entity_simclusters_embedding_v2 - * $ capesospy-v2 update \ - --build_locally \ - --start_cron locale_entity_simclusters_embedding_v2 src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object LocaleEntitySimClustersEmbeddingV2ScheduledApp - extends LocaleEntitySimClustersEmbeddingV2Job - with ScheduledExecutionApp { - - override val firstTime: RichDate = RichDate("2020-04-08") - - override val batchIncrement: Duration = Days(1) - - override def writeNounToClustersIndex( - output: TypedPipe[(LocaleEntity, Seq[(ClusterId, Double)])] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - output - .map { - case ((entityId, lang), clustersWithScores) => - KeyVal( - SimClustersEmbeddingId( - EmbeddingType.LogFavBasedLocaleSemanticCoreEntity, - ModelVersion.Model20m145kUpdated, - InternalId.LocaleEntityId(LocaleEntityId(entityId, lang)) - ), - SimClustersEmbedding(clustersWithScores).toThrift - ) - } - .writeDALVersionedKeyValExecution( - EntityEmbeddingsSources.LogFavSemanticCorePerLanguageSimClustersEmbeddingsDataset, - D.Suffix( - EmbeddingUtil.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = true, - ModelVersion.Model20m145kUpdated, - pathSuffix = "log_fav_erg_based_embeddings")) - ) - } - - override def writeClusterToNounsIndex( - output: TypedPipe[(ClusterId, Seq[(LocaleEntity, Double)])] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .map { - case (clusterId, nounsWithScore) => - KeyVal( - SimClustersEmbeddingId( - EmbeddingType.LogFavBasedLocaleSemanticCoreEntity, - ModelVersion.Model20m145kUpdated, - InternalId.ClusterId(clusterId) - ), - InternalIdEmbedding(nounsWithScore.map { - case ((entityId, lang), score) => - InternalIdWithScore( - InternalId.LocaleEntityId(LocaleEntityId(entityId, lang)), - score) - }) - ) - } - .writeDALVersionedKeyValExecution( - EntityEmbeddingsSources.LogFavReverseIndexSemanticCorePerLanguageSimClustersEmbeddingsDataset, - D.Suffix( - EmbeddingUtil.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = true, - ModelVersion.Model20m145kUpdated, - pathSuffix = "reverse_index_log_fav_erg_based_embeddings")) - ) - } -} - -/** - * $ ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding:locale_entity_simclusters_embedding_v2-adhoc - * - * $ scalding remote run \ - --main-class com.twitter.simclusters_v2.scalding.embedding.LocaleEntitySimClustersEmbeddingV2AdhocApp \ - --target src/scala/com/twitter/simclusters_v2/scalding/embedding:locale_entity_simclusters_embedding_v2-adhoc \ - --user recos-platform --reducers 2000\ - -- --date 2020-04-06 - */ -object LocaleEntitySimClustersEmbeddingV2AdhocApp - extends LocaleEntitySimClustersEmbeddingV2Job - with AdhocExecutionApp { - - override def writeNounToClustersIndex( - output: TypedPipe[(LocaleEntity, Seq[(ClusterId, Double)])] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - output - .map { - case ((entityId, lang), clustersWithScores) => - SimClustersEmbeddingId( - EmbeddingType.LogFavBasedLocaleSemanticCoreEntity, - ModelVersion.Model20m145kUpdated, - InternalId.LocaleEntityId(LocaleEntityId(entityId, lang)) - ) -> SimClustersEmbedding(clustersWithScores).toThrift - - }.writeExecution( - AdhocKeyValSources.entityToClustersSource( - EmbeddingUtil.getHdfsPath( - isAdhoc = true, - isManhattanKeyVal = true, - ModelVersion.Model20m145kUpdated, - pathSuffix = "log_fav_erg_based_embeddings"))) - } - - override def writeClusterToNounsIndex( - output: TypedPipe[(ClusterId, Seq[(LocaleEntity, Double)])] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - output - .map { - case (clusterId, nounsWithScore) => - SimClustersEmbeddingId( - EmbeddingType.LogFavBasedLocaleSemanticCoreEntity, - ModelVersion.Model20m145kUpdated, - InternalId.ClusterId(clusterId) - ) -> - InternalIdEmbedding(nounsWithScore.map { - case ((entityId, lang), score) => - InternalIdWithScore( - InternalId.LocaleEntityId(LocaleEntityId(entityId, lang)), - score) - }) - } - .writeExecution( - AdhocKeyValSources.clusterToEntitiesSource( - EmbeddingUtil.getHdfsPath( - isAdhoc = true, - isManhattanKeyVal = true, - ModelVersion.Model20m145kUpdated, - pathSuffix = "reverse_index_log_fav_erg_based_embeddings"))) - } -} - -trait LocaleEntitySimClustersEmbeddingV2Job extends SimClustersEmbeddingBaseJob[LocaleEntity] { - - override val numClustersPerNoun = 100 - - override val numNounsPerClusters = 100 - - override val thresholdForEmbeddingScores: Double = 0.001 - - override val numReducersOpt: Option[Int] = Some(8000) - - private val DefaultERGHalfLifeInDays = 14 - - private val MinInterestedInLogFavScore = 0.0 - - implicit val inj: Injection[LocaleEntity, Array[Byte]] = Bufferable.injectionOf[LocaleEntity] - - override def prepareNounToUserMatrix( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): SparseMatrix[LocaleEntity, UserId, Double] = { - - val erg: TypedPipe[(SemanticCoreEntityId, (UserId, Double))] = - DataSources.entityRealGraphAggregationDataSetSource(dateRange.embiggen(Days(7))).flatMap { - case Edge( - userId, - Entity.SemanticCore(SemanticCoreEntity(entityId, _)), - consumerFeatures, - _, - _) if consumerFeatures.exists(_.exists(_.featureName == FeatureName.Favorites)) => - for { - features <- consumerFeatures - favFeatures <- features.find(_.featureName == FeatureName.Favorites) - ewmaMap <- favFeatures.featureValues.ewmaMap - favScore <- ewmaMap.get(DefaultERGHalfLifeInDays) - } yield (entityId, (userId, Math.log(favScore + 1))) - - case _ => None - } - - SparseMatrix[LocaleEntity, UserId, Double]( - erg - .hashJoin(ExternalDataSources.uttEntitiesSource().asKeys).map { - case (entityId, ((userId, score), _)) => (userId, (entityId, score)) - }.join(ExternalDataSources.userSource).map { - case (userId, ((entityId, score), (_, language))) => - ((entityId, language), userId, score) - } - ) - } - - override def prepareUserToClusterMatrix( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): SparseRowMatrix[UserId, ClusterId, Double] = { - SparseRowMatrix( - ExternalDataSources.simClustersInterestInLogFavSource(MinInterestedInLogFavScore), - isSkinnyMatrix = true - ) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/LocaleEntitySimClustersEmbeddingsJob.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/LocaleEntitySimClustersEmbeddingsJob.scala deleted file mode 100644 index 06c66038c..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/LocaleEntitySimClustersEmbeddingsJob.scala +++ /dev/null @@ -1,437 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding - -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.recos.entities.thriftscala.Entity -import com.twitter.recos.entities.thriftscala.Hashtag -import com.twitter.recos.entities.thriftscala.SemanticCoreEntity -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.common.SimClustersEmbedding -import com.twitter.simclusters_v2.hdfs_sources.presto_hdfs_sources._ -import com.twitter.simclusters_v2.hdfs_sources.AdhocKeyValSources -import com.twitter.simclusters_v2.hdfs_sources.EntityEmbeddingsSources -import com.twitter.simclusters_v2.hdfs_sources.InterestedInSources -import com.twitter.simclusters_v2.scalding.embedding.LocaleEntitySimClustersEmbeddingsJob._ -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil -import com.twitter.simclusters_v2.scalding.embedding.common.ExternalDataSources -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil._ -import com.twitter.simclusters_v2.scalding.embedding.common.EntityEmbeddingUtil._ -import com.twitter.simclusters_v2.scalding.embedding.common.SimClustersEmbeddingJob._ -import com.twitter.simclusters_v2.thriftscala.{ - SimClustersEmbedding => ThriftSimClustersEmbedding, - _ -} -import com.twitter.wtf.entity_real_graph.common.EntityUtil -import com.twitter.wtf.entity_real_graph.thriftscala.Edge -import com.twitter.wtf.entity_real_graph.thriftscala.EntityType -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import com.twitter.wtf.scalding.jobs.common.DataSources -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import java.util.TimeZone - -/** - * $ ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding:entity_per_language_embeddings_job-adhoc - * - * ---------------------- Deploy to atla ---------------------- - * $ scalding remote run \ - --main-class com.twitter.simclusters_v2.scalding.embedding.LocaleEntitySimClustersEmbeddingAdhocApp \ - --target src/scala/com/twitter/simclusters_v2/scalding/embedding:entity_per_language_embeddings_job-adhoc \ - --user recos-platform \ - -- --date 2019-12-17 --model-version 20M_145K_updated --entity-type SemanticCore - */ -object LocaleEntitySimClustersEmbeddingAdhocApp extends AdhocExecutionApp { - - // Import implicits - - import EntityUtil._ - - def writeOutput( - embeddings: TypedPipe[(SimClustersEmbeddingId, (ClusterId, EmbeddingScore))], - topKEmbeddings: TypedPipe[(SimClustersEmbeddingId, Seq[(ClusterId, EmbeddingScore)])], - jobConfig: EntityEmbeddingsJobConfig - ): Execution[Unit] = { - - val toSimClusterEmbeddingExec = topKEmbeddings - .mapValues(SimClustersEmbedding.apply(_).toThrift) - .writeExecution( - AdhocKeyValSources.entityToClustersSource( - LocaleEntitySimClustersEmbeddingsJob.getHdfsPath( - isAdhoc = true, - isManhattanKeyVal = true, - isReverseIndex = false, - isLogFav = false, - jobConfig.modelVersion, - jobConfig.entityType))) - - val fromSimClusterEmbeddingExec = - toReverseIndexSimClusterEmbedding(embeddings, jobConfig.topK) - .writeExecution( - AdhocKeyValSources.clusterToEntitiesSource( - LocaleEntitySimClustersEmbeddingsJob.getHdfsPath( - isAdhoc = true, - isManhattanKeyVal = true, - isReverseIndex = true, - isLogFav = false, - jobConfig.modelVersion, - jobConfig.entityType))) - - Execution.zip(toSimClusterEmbeddingExec, fromSimClusterEmbeddingExec).unit - } - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - val jobConfig = EntityEmbeddingsJobConfig(args, isAdhoc = true) - - val numReducers = args.getOrElse("m", "2000").toInt - - /* - Can use the ERG daily dataset in the adhoc job for quick prototyping, note that there may be - issues with scaling the job when productionizing on ERG aggregated dataset. - */ - val userEntityMatrix: TypedPipe[(UserId, (Entity, Double))] = - getUserEntityMatrix( - jobConfig, - DataSources.entityRealGraphAggregationDataSetSource(dateRange.embiggen(Days(7))), - Some(ExternalDataSources.uttEntitiesSource()) - ).forceToDisk - - //determine which data source to use based on model version - val simClustersSource = jobConfig.modelVersion match { - case ModelVersion.Model20m145kUpdated => - InterestedInSources.simClustersInterestedInUpdatedSource(dateRange, timeZone) - case modelVersion => - throw new IllegalArgumentException( - s"SimClusters model version not supported ${modelVersion.name}") - } - - val entityPerLanguage = userEntityMatrix.join(ExternalDataSources.userSource).map { - case (userId, ((entity, score), (_, language))) => - ((entity, language), (userId, score)) - } - - val normalizedUserEntityMatrix = - getNormalizedTransposeInputMatrix(entityPerLanguage, numReducers = Some(numReducers)) - - val embeddings = computeEmbeddings[(Entity, String)]( - simClustersSource, - normalizedUserEntityMatrix, - scoreExtractors, - ModelVersion.Model20m145kUpdated, - toSimClustersEmbeddingId(jobConfig.modelVersion), - numReducers = Some(numReducers * 2) - ) - - val topKEmbeddings = - embeddings.group - .sortedReverseTake(jobConfig.topK)(Ordering.by(_._2)) - .withReducers(numReducers) - - writeOutput(embeddings, topKEmbeddings, jobConfig) - } -} - -/** - * $ ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding:semantic_core_entity_embeddings_per_language_job - * $ capesospy-v2 update \ - --build_locally \ - --start_cron semantic_core_entity_embeddings_per_language_job src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object LocaleEntitySimClustersEmbeddingScheduledApp extends ScheduledExecutionApp { - - // Import implicits - - import EmbeddingUtil._ - import EntityUtil._ - - override val firstTime: RichDate = RichDate("2019-10-22") - - override val batchIncrement: Duration = Days(7) - - private def writeOutput( - embeddings: TypedPipe[(SimClustersEmbeddingId, (ClusterId, EmbeddingScore))], - topKEmbeddings: TypedPipe[(SimClustersEmbeddingId, Seq[(ClusterId, EmbeddingScore)])], - jobConfig: EntityEmbeddingsJobConfig, - clusterEmbeddingsDataset: KeyValDALDataset[ - KeyVal[SimClustersEmbeddingId, ThriftSimClustersEmbedding] - ], - entityEmbeddingsDataset: KeyValDALDataset[KeyVal[SimClustersEmbeddingId, InternalIdEmbedding]] - )( - implicit dateRange: DateRange, - timeZone: TimeZone - ): Execution[Unit] = { - - val thriftSimClustersEmbedding = topKEmbeddings - .mapValues(SimClustersEmbedding.apply(_).toThrift) - - val writeSimClustersEmbeddingKeyValDataset = - thriftSimClustersEmbedding - .map { - case (entityId, topSimClusters) => KeyVal(entityId, topSimClusters) - } - .writeDALVersionedKeyValExecution( - clusterEmbeddingsDataset, - D.Suffix( - LocaleEntitySimClustersEmbeddingsJob.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = true, - isReverseIndex = false, - isLogFav = false, - jobConfig.modelVersion, - jobConfig.entityType)) - ) - - val writeSimClustersEmbeddingDataset = thriftSimClustersEmbedding - .map { - case (embeddingId, embedding) => SimClustersEmbeddingWithId(embeddingId, embedding) - } - .writeDALSnapshotExecution( - SemanticCorePerLanguageSimclustersEmbeddingsPrestoScalaDataset, - D.Daily, - D.Suffix( - LocaleEntitySimClustersEmbeddingsJob.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = false, - isReverseIndex = false, - isLogFav = false, - jobConfig.modelVersion, - jobConfig.entityType)), - D.EBLzo(), - dateRange.end - ) - - val thriftReversedSimclustersEmbeddings = - toReverseIndexSimClusterEmbedding(embeddings, jobConfig.topK) - - val writeReverseSimClustersEmbeddingKeyValDataset = - thriftReversedSimclustersEmbeddings - .map { - case (embeddingId, internalIdsWithScore) => - KeyVal(embeddingId, internalIdsWithScore) - } - .writeDALVersionedKeyValExecution( - entityEmbeddingsDataset, - D.Suffix( - LocaleEntitySimClustersEmbeddingsJob.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = true, - isReverseIndex = true, - isLogFav = false, - jobConfig.modelVersion, - jobConfig.entityType)) - ) - - val writeReverseSimClustersEmbeddingDataset = - thriftReversedSimclustersEmbeddings - .map { - case (embeddingId, embedding) => InternalIdEmbeddingWithId(embeddingId, embedding) - }.writeDALSnapshotExecution( - ReverseIndexSemanticCorePerLanguageSimclustersEmbeddingsPrestoScalaDataset, - D.Daily, - D.Suffix( - LocaleEntitySimClustersEmbeddingsJob.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = false, - isReverseIndex = true, - isLogFav = false, - jobConfig.modelVersion, - jobConfig.entityType)), - D.EBLzo(), - dateRange.end - ) - - Execution - .zip( - writeSimClustersEmbeddingDataset, - writeSimClustersEmbeddingKeyValDataset, - writeReverseSimClustersEmbeddingDataset, - writeReverseSimClustersEmbeddingKeyValDataset - ).unit - } - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - val jobConfig = EntityEmbeddingsJobConfig(args, isAdhoc = false) - - val embeddingsDataset = EntityEmbeddingsSources.getEntityEmbeddingsDataset( - jobConfig.entityType, - ModelVersions.toKnownForModelVersion(jobConfig.modelVersion), - isEmbeddingsPerLocale = true - ) - - val reverseIndexEmbeddingsDataset = - EntityEmbeddingsSources.getReverseIndexedEntityEmbeddingsDataset( - jobConfig.entityType, - ModelVersions.toKnownForModelVersion(jobConfig.modelVersion), - isEmbeddingsPerLocale = true - ) - - val userEntityMatrix: TypedPipe[(UserId, (Entity, Double))] = - getUserEntityMatrix( - jobConfig, - DataSources.entityRealGraphAggregationDataSetSource(dateRange.embiggen(Days(7))), - Some(ExternalDataSources.uttEntitiesSource()) - ).forceToDisk - - //determine which data source to use based on model version - val simClustersSource = jobConfig.modelVersion match { - case ModelVersion.Model20m145kUpdated => - InterestedInSources.simClustersInterestedInUpdatedSource(dateRange, timeZone) - case modelVersion => - throw new IllegalArgumentException( - s"SimClusters model version not supported ${modelVersion.name}") - } - - val entityPerLanguage = userEntityMatrix.join(ExternalDataSources.userSource).map { - case (userId, ((entity, score), (_, language))) => - ((entity, language), (userId, score)) - } - - val normalizedUserEntityMatrix = - getNormalizedTransposeInputMatrix(entityPerLanguage, numReducers = Some(3000)) - - val simClustersEmbedding = jobConfig.modelVersion match { - case ModelVersion.Model20m145kUpdated => - computeEmbeddings( - simClustersSource, - normalizedUserEntityMatrix, - scoreExtractors, - ModelVersion.Model20m145kUpdated, - toSimClustersEmbeddingId(ModelVersion.Model20m145kUpdated), - numReducers = Some(8000) - ) - case modelVersion => - throw new IllegalArgumentException( - s"SimClusters model version not supported ${modelVersion.name}") - } - - val topKEmbeddings = - simClustersEmbedding.group.sortedReverseTake(jobConfig.topK)(Ordering.by(_._2)) - - writeOutput( - simClustersEmbedding, - topKEmbeddings, - jobConfig, - embeddingsDataset, - reverseIndexEmbeddingsDataset) - } -} - -object LocaleEntitySimClustersEmbeddingsJob { - - def getUserEntityMatrix( - jobConfig: EntityEmbeddingsJobConfig, - entityRealGraphSource: TypedPipe[Edge], - semanticCoreEntityIdsToKeep: Option[TypedPipe[Long]], - applyLogTransform: Boolean = false - ): TypedPipe[(UserId, (Entity, Double))] = - jobConfig.entityType match { - case EntityType.SemanticCore => - semanticCoreEntityIdsToKeep match { - case Some(entityIdsToKeep) => - getEntityUserMatrix(entityRealGraphSource, jobConfig.halfLife, EntityType.SemanticCore) - .map { - case (entity, (userId, score)) => - entity match { - case Entity.SemanticCore(SemanticCoreEntity(entityId, _)) => - if (applyLogTransform) { - (entityId, (userId, (entity, Math.log(score + 1)))) - } else { - (entityId, (userId, (entity, score))) - } - case _ => - throw new IllegalArgumentException( - "Job config specified EntityType.SemanticCore, but non-semantic core entity was found.") - } - }.hashJoin(entityIdsToKeep.asKeys).values.map { - case ((userId, (entity, score)), _) => (userId, (entity, score)) - } - case _ => - getEntityUserMatrix(entityRealGraphSource, jobConfig.halfLife, EntityType.SemanticCore) - .map { case (entity, (userId, score)) => (userId, (entity, score)) } - } - case EntityType.Hashtag => - getEntityUserMatrix(entityRealGraphSource, jobConfig.halfLife, EntityType.Hashtag) - .map { case (entity, (userId, score)) => (userId, (entity, score)) } - case _ => - throw new IllegalArgumentException( - s"Argument [--entity-type] must be provided. Supported options [${EntityType.SemanticCore.name}, ${EntityType.Hashtag.name}]") - } - - def toSimClustersEmbeddingId( - modelVersion: ModelVersion - ): ((Entity, String), ScoreType.ScoreType) => SimClustersEmbeddingId = { - case ((Entity.SemanticCore(SemanticCoreEntity(entityId, _)), lang), ScoreType.FavScore) => - SimClustersEmbeddingId( - EmbeddingType.FavBasedSematicCoreEntity, - modelVersion, - InternalId.LocaleEntityId(LocaleEntityId(entityId, lang))) - case ((Entity.SemanticCore(SemanticCoreEntity(entityId, _)), lang), ScoreType.FollowScore) => - SimClustersEmbeddingId( - EmbeddingType.FollowBasedSematicCoreEntity, - modelVersion, - InternalId.LocaleEntityId(LocaleEntityId(entityId, lang))) - case ((Entity.SemanticCore(SemanticCoreEntity(entityId, _)), lang), ScoreType.LogFavScore) => - SimClustersEmbeddingId( - EmbeddingType.LogFavBasedLocaleSemanticCoreEntity, - modelVersion, - InternalId.LocaleEntityId(LocaleEntityId(entityId, lang))) - case ((Entity.Hashtag(Hashtag(hashtag)), _), ScoreType.FavScore) => - SimClustersEmbeddingId( - EmbeddingType.FavBasedHashtagEntity, - modelVersion, - InternalId.Hashtag(hashtag)) - case ((Entity.Hashtag(Hashtag(hashtag)), _), ScoreType.FollowScore) => - SimClustersEmbeddingId( - EmbeddingType.FollowBasedHashtagEntity, - modelVersion, - InternalId.Hashtag(hashtag)) - case (scoreType, entity) => - throw new IllegalArgumentException( - s"(ScoreType, Entity) ($scoreType, ${entity.toString}) not supported") - } - - /** - * Generates the output path for the Entity Embeddings Job. - * - * Example Adhoc: /user/recos-platform/processed/adhoc/simclusters_embeddings/hashtag_per_language/model_20m_145k_updated - * Example Prod: /atla/proc/user/cassowary/processed/simclusters_embeddings/semantic_core_per_language/model_20m_145k_updated - * - */ - def getHdfsPath( - isAdhoc: Boolean, - isManhattanKeyVal: Boolean, - isReverseIndex: Boolean, - isLogFav: Boolean, - modelVersion: ModelVersion, - entityType: EntityType - ): String = { - - val reverseIndex = if (isReverseIndex) "reverse_index/" else "" - - val logFav = if (isLogFav) "log_fav/" else "" - - val entityTypeSuffix = entityType match { - case EntityType.SemanticCore => "semantic_core_per_language" - case EntityType.Hashtag => "hashtag_per_language" - case _ => "unknown_per_language" - } - - val pathSuffix = s"$logFav$reverseIndex$entityTypeSuffix" - - EmbeddingUtil.getHdfsPath(isAdhoc, isManhattanKeyVal, modelVersion, pathSuffix) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/ProducerEmbeddingsFromInterestedIn.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/ProducerEmbeddingsFromInterestedIn.scala deleted file mode 100644 index e78299d66..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/ProducerEmbeddingsFromInterestedIn.scala +++ /dev/null @@ -1,701 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding - -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.hdfs_sources._ -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil._ -import com.twitter.simclusters_v2.scalding.embedding.common.SimClustersEmbeddingJob -import com.twitter.simclusters_v2.thriftscala._ -import com.twitter.wtf.scalding.jobs.common.{AdhocExecutionApp, ScheduledExecutionApp} -import java.util.TimeZone - -object ProducerEmbeddingsFromInterestedInBatchAppUtil { - import ProducerEmbeddingsFromInterestedIn._ - - val user = System.getenv("USER") - - val rootPath: String = s"/user/$user/manhattan_sequence_files" - - // Helps speed up the multiplication step which can get very big - val numReducersForMatrixMultiplication: Int = 12000 - - /** - * Given the producer x cluster matrix, key by producer / cluster individually, and write output - * to individual DAL datasets - */ - def writeOutput( - producerClusterEmbedding: TypedPipe[((ClusterId, UserId), Double)], - producerTopKEmbeddingsDataset: KeyValDALDataset[KeyVal[Long, TopSimClustersWithScore]], - clusterTopKProducersDataset: KeyValDALDataset[ - KeyVal[PersistedFullClusterId, TopProducersWithScore] - ], - producerTopKEmbeddingsPath: String, - clusterTopKProducersPath: String, - modelVersion: ModelVersion - ): Execution[Unit] = { - val keyedByProducer = - toSimClusterEmbedding(producerClusterEmbedding, topKClustersToKeep, modelVersion) - .map { case (userId, clusters) => KeyVal(userId, clusters) } - .writeDALVersionedKeyValExecution( - producerTopKEmbeddingsDataset, - D.Suffix(producerTopKEmbeddingsPath) - ) - - val keyedBySimCluster = fromSimClusterEmbedding( - producerClusterEmbedding, - topKUsersToKeep, - modelVersion - ).map { - case (clusterId, topProducers) => KeyVal(clusterId, topProducersToThrift(topProducers)) - } - .writeDALVersionedKeyValExecution( - clusterTopKProducersDataset, - D.Suffix(clusterTopKProducersPath) - ) - - Execution.zip(keyedByProducer, keyedBySimCluster).unit - } -} - -/** - * Base class for Fav based producer embeddings. Helps reuse the code for different model versions - */ -trait ProducerEmbeddingsFromInterestedInByFavScoreBase extends ScheduledExecutionApp { - import ProducerEmbeddingsFromInterestedIn._ - import ProducerEmbeddingsFromInterestedInBatchAppUtil._ - - def modelVersion: ModelVersion - - val producerTopKEmbeddingsByFavScorePathPrefix: String = - "/producer_top_k_simcluster_embeddings_by_fav_score_" - - val clusterTopKProducersByFavScorePathPrefix: String = - "/simcluster_embedding_top_k_producers_by_fav_score_" - - val minNumFavers: Int = minNumFaversForProducer - - def producerTopKSimclusterEmbeddingsByFavScoreDataset: KeyValDALDataset[ - KeyVal[Long, TopSimClustersWithScore] - ] - - def simclusterEmbeddingTopKProducersByFavScoreDataset: KeyValDALDataset[ - KeyVal[PersistedFullClusterId, TopProducersWithScore] - ] - - def getInterestedInFn: (DateRange, TimeZone) => TypedPipe[(Long, ClustersUserIsInterestedIn)] - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - val producerTopKEmbeddingsByFavScorePathUpdated: String = - rootPath + producerTopKEmbeddingsByFavScorePathPrefix + ModelVersions - .toKnownForModelVersion(modelVersion) - - val clusterTopKProducersByFavScorePathUpdated: String = - rootPath + clusterTopKProducersByFavScorePathPrefix + ModelVersions - .toKnownForModelVersion(modelVersion) - - val producerClusterEmbeddingByFavScore = getProducerClusterEmbedding( - getInterestedInFn(dateRange.embiggen(Days(5)), timeZone), - DataSources.userUserNormalizedGraphSource, - DataSources.userNormsAndCounts, - userToProducerFavScore, - userToClusterFavScore, // Fav score - _.faverCount.exists(_ > minNumFavers), - numReducersForMatrixMultiplication, - modelVersion, - cosineSimilarityThreshold - ).forceToDisk - - writeOutput( - producerClusterEmbeddingByFavScore, - producerTopKSimclusterEmbeddingsByFavScoreDataset, - simclusterEmbeddingTopKProducersByFavScoreDataset, - producerTopKEmbeddingsByFavScorePathUpdated, - clusterTopKProducersByFavScorePathUpdated, - modelVersion - ) - } -} - -/** - * Base class for Follow based producer embeddings. Helps reuse the code for different model versions - */ -trait ProducerEmbeddingsFromInterestedInByFollowScoreBase extends ScheduledExecutionApp { - import ProducerEmbeddingsFromInterestedIn._ - import ProducerEmbeddingsFromInterestedInBatchAppUtil._ - - def modelVersion: ModelVersion - - val producerTopKEmbeddingsByFollowScorePathPrefix: String = - "/producer_top_k_simcluster_embeddings_by_follow_score_" - - val clusterTopKProducersByFollowScorePathPrefix: String = - "/simcluster_embedding_top_k_producers_by_follow_score_" - - def producerTopKSimclusterEmbeddingsByFollowScoreDataset: KeyValDALDataset[ - KeyVal[Long, TopSimClustersWithScore] - ] - - def simclusterEmbeddingTopKProducersByFollowScoreDataset: KeyValDALDataset[ - KeyVal[PersistedFullClusterId, TopProducersWithScore] - ] - - def getInterestedInFn: (DateRange, TimeZone) => TypedPipe[(Long, ClustersUserIsInterestedIn)] - - val minNumFollowers: Int = minNumFollowersForProducer - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - val producerTopKEmbeddingsByFollowScorePath: String = - rootPath + producerTopKEmbeddingsByFollowScorePathPrefix + ModelVersions - .toKnownForModelVersion(modelVersion) - - val clusterTopKProducersByFollowScorePath: String = - rootPath + clusterTopKProducersByFollowScorePathPrefix + ModelVersions - .toKnownForModelVersion(modelVersion) - - val producerClusterEmbeddingByFollowScore = getProducerClusterEmbedding( - getInterestedInFn(dateRange.embiggen(Days(5)), timeZone), - DataSources.userUserNormalizedGraphSource, - DataSources.userNormsAndCounts, - userToProducerFollowScore, - userToClusterFollowScore, // Follow score - _.followerCount.exists(_ > minNumFollowers), - numReducersForMatrixMultiplication, - modelVersion, - cosineSimilarityThreshold - ).forceToDisk - - writeOutput( - producerClusterEmbeddingByFollowScore, - producerTopKSimclusterEmbeddingsByFollowScoreDataset, - simclusterEmbeddingTopKProducersByFollowScoreDataset, - producerTopKEmbeddingsByFollowScorePath, - clusterTopKProducersByFollowScorePath, - modelVersion - ) - } -} - -/** - capesospy-v2 update --build_locally --start_cron \ - --start_cron producer_embeddings_from_interested_in_by_fav_score \ - src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object ProducerEmbeddingsFromInterestedInByFavScoreBatchApp - extends ProducerEmbeddingsFromInterestedInByFavScoreBase { - override def modelVersion: ModelVersion = ModelVersion.Model20m145kUpdated - - override def getInterestedInFn: ( - DateRange, - TimeZone - ) => TypedPipe[(UserId, ClustersUserIsInterestedIn)] = - InterestedInSources.simClustersInterestedInUpdatedSource - - override val firstTime: RichDate = RichDate("2019-09-10") - - override val batchIncrement: Duration = Days(7) - - override def producerTopKSimclusterEmbeddingsByFavScoreDataset: KeyValDALDataset[ - KeyVal[Long, TopSimClustersWithScore] - ] = - ProducerTopKSimclusterEmbeddingsByFavScoreUpdatedScalaDataset - - override def simclusterEmbeddingTopKProducersByFavScoreDataset: KeyValDALDataset[ - KeyVal[PersistedFullClusterId, TopProducersWithScore] - ] = - SimclusterEmbeddingTopKProducersByFavScoreUpdatedScalaDataset -} - -/** -capesospy-v2 update --build_locally --start_cron \ - --start_cron producer_embeddings_from_interested_in_by_fav_score_2020 \ - src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object ProducerEmbeddingsFromInterestedInByFavScore2020BatchApp - extends ProducerEmbeddingsFromInterestedInByFavScoreBase { - override def modelVersion: ModelVersion = ModelVersion.Model20m145k2020 - - override def getInterestedInFn: ( - DateRange, - TimeZone - ) => TypedPipe[(UserId, ClustersUserIsInterestedIn)] = - InterestedInSources.simClustersInterestedIn2020Source - - override val firstTime: RichDate = RichDate("2021-03-01") - - override val batchIncrement: Duration = Days(7) - - override def producerTopKSimclusterEmbeddingsByFavScoreDataset: KeyValDALDataset[ - KeyVal[Long, TopSimClustersWithScore] - ] = - ProducerTopKSimclusterEmbeddingsByFavScore2020ScalaDataset - - override def simclusterEmbeddingTopKProducersByFavScoreDataset: KeyValDALDataset[ - KeyVal[PersistedFullClusterId, TopProducersWithScore] - ] = - SimclusterEmbeddingTopKProducersByFavScore2020ScalaDataset -} - -/** -capesospy-v2 update --build_locally --start_cron \ - --start_cron producer_embeddings_from_interested_in_by_fav_score_dec11 \ - src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object ProducerEmbeddingsFromInterestedInByFavScoreDec11BatchApp - extends ProducerEmbeddingsFromInterestedInByFavScoreBase { - override def modelVersion: ModelVersion = ModelVersion.Model20m145kDec11 - - override def getInterestedInFn: ( - DateRange, - TimeZone - ) => TypedPipe[(UserId, ClustersUserIsInterestedIn)] = - InterestedInSources.simClustersInterestedInDec11Source - - override val firstTime: RichDate = RichDate("2019-11-18") - - override val batchIncrement: Duration = Days(7) - - override def producerTopKSimclusterEmbeddingsByFavScoreDataset: KeyValDALDataset[ - KeyVal[Long, TopSimClustersWithScore] - ] = - ProducerTopKSimclusterEmbeddingsByFavScoreScalaDataset - - override def simclusterEmbeddingTopKProducersByFavScoreDataset: KeyValDALDataset[ - KeyVal[PersistedFullClusterId, TopProducersWithScore] - ] = - SimclusterEmbeddingTopKProducersByFavScoreScalaDataset -} - -/** -capesospy-v2 update --build_locally --start_cron \ - --start_cron producer_embeddings_from_interested_in_by_follow_score \ - src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object ProducerEmbeddingsFromInterestedInByFollowScoreBatchApp - extends ProducerEmbeddingsFromInterestedInByFollowScoreBase { - override def modelVersion: ModelVersion = ModelVersion.Model20m145kUpdated - - override def getInterestedInFn: ( - DateRange, - TimeZone - ) => TypedPipe[(UserId, ClustersUserIsInterestedIn)] = - InterestedInSources.simClustersInterestedInUpdatedSource - - override val firstTime: RichDate = RichDate("2019-09-10") - - override val batchIncrement: Duration = Days(7) - - override def producerTopKSimclusterEmbeddingsByFollowScoreDataset: KeyValDALDataset[ - KeyVal[Long, TopSimClustersWithScore] - ] = - ProducerTopKSimclusterEmbeddingsByFollowScoreUpdatedScalaDataset - - override def simclusterEmbeddingTopKProducersByFollowScoreDataset: KeyValDALDataset[ - KeyVal[PersistedFullClusterId, TopProducersWithScore] - ] = - SimclusterEmbeddingTopKProducersByFollowScoreUpdatedScalaDataset -} - -/** -capesospy-v2 update --build_locally --start_cron \ - --start_cron producer_embeddings_from_interested_in_by_follow_score_2020 \ - src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object ProducerEmbeddingsFromInterestedInByFollowScore2020BatchApp - extends ProducerEmbeddingsFromInterestedInByFollowScoreBase { - override def modelVersion: ModelVersion = ModelVersion.Model20m145k2020 - - override def getInterestedInFn: ( - DateRange, - TimeZone - ) => TypedPipe[(UserId, ClustersUserIsInterestedIn)] = - InterestedInSources.simClustersInterestedIn2020Source - - override val firstTime: RichDate = RichDate("2021-03-01") - - override val batchIncrement: Duration = Days(7) - - override def producerTopKSimclusterEmbeddingsByFollowScoreDataset: KeyValDALDataset[ - KeyVal[Long, TopSimClustersWithScore] - ] = - ProducerTopKSimclusterEmbeddingsByFollowScore2020ScalaDataset - - override def simclusterEmbeddingTopKProducersByFollowScoreDataset: KeyValDALDataset[ - KeyVal[PersistedFullClusterId, TopProducersWithScore] - ] = - SimclusterEmbeddingTopKProducersByFollowScore2020ScalaDataset -} - -/** -capesospy-v2 update --build_locally --start_cron \ - --start_cron producer_embeddings_from_interested_in_by_follow_score_dec11 \ - src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object ProducerEmbeddingsFromInterestedInByFollowScoreDec11BatchApp - extends ProducerEmbeddingsFromInterestedInByFollowScoreBase { - override def modelVersion: ModelVersion = ModelVersion.Model20m145kDec11 - - override def getInterestedInFn: ( - DateRange, - TimeZone - ) => TypedPipe[(UserId, ClustersUserIsInterestedIn)] = - InterestedInSources.simClustersInterestedInDec11Source - - override val firstTime: RichDate = RichDate("2019-11-18") - - override val batchIncrement: Duration = Days(7) - - override def producerTopKSimclusterEmbeddingsByFollowScoreDataset: KeyValDALDataset[ - KeyVal[Long, TopSimClustersWithScore] - ] = - ProducerTopKSimclusterEmbeddingsByFollowScoreScalaDataset - - override def simclusterEmbeddingTopKProducersByFollowScoreDataset: KeyValDALDataset[ - KeyVal[PersistedFullClusterId, TopProducersWithScore] - ] = - SimclusterEmbeddingTopKProducersByFollowScoreScalaDataset -} - -/** - * Adhoc job to calculate producer's simcluster embeddings, which essentially assigns interestedIn - * SimClusters to each producer, regardless of whether the producer has a knownFor assignment. - * -$ ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding:producer_embeddings_from_interested_in-adhoc - - $ scalding remote run \ - --main-class com.twitter.simclusters_v2.scalding.embedding.ProducerEmbeddingsFromInterestedInAdhocApp \ - --target src/scala/com/twitter/simclusters_v2/scalding/embedding:producer_embeddings_from_interested_in-adhoc \ - --user cassowary --cluster bluebird-qus1 \ - --keytab /var/lib/tss/keys/fluffy/keytabs/client/cassowary.keytab \ - --principal service_acoount@TWITTER.BIZ \ - -- --date 2020-08-25 --model_version 20M_145K_updated \ - --outputDir /gcs/user/cassowary/adhoc/producerEmbeddings/ - - */ -object ProducerEmbeddingsFromInterestedInAdhocApp extends AdhocExecutionApp { - - import ProducerEmbeddingsFromInterestedIn._ - - private val numReducersForMatrixMultiplication = 12000 - - /** - * Calculate the embedding and writes the results keyed by producers and clusters separately into - * individual locations - */ - private def runAdhocByScore( - interestedInClusters: TypedPipe[(Long, ClustersUserIsInterestedIn)], - userUserNormalGraph: TypedPipe[UserAndNeighbors], - userNormsAndCounts: TypedPipe[NormsAndCounts], - keyedByProducerSinkPath: String, - keyedByClusterSinkPath: String, - userToProducerScoringFn: NeighborWithWeights => Double, - userToClusterScoringFn: UserToInterestedInClusterScores => Double, - userFilter: NormsAndCounts => Boolean, - modelVersion: ModelVersion - )( - implicit uniqueID: UniqueID - ): Execution[Unit] = { - - val producerClusterEmbedding = getProducerClusterEmbedding( - interestedInClusters, - userUserNormalGraph, - userNormsAndCounts, - userToProducerScoringFn, - userToClusterScoringFn, - userFilter, - numReducersForMatrixMultiplication, - modelVersion, - cosineSimilarityThreshold - ).forceToDisk - - val keyByProducerExec = - toSimClusterEmbedding(producerClusterEmbedding, topKClustersToKeep, modelVersion) - .writeExecution( - AdhocKeyValSources.topProducerToClusterEmbeddingsSource(keyedByProducerSinkPath)) - - val keyByClusterExec = - fromSimClusterEmbedding(producerClusterEmbedding, topKUsersToKeep, modelVersion) - .map { case (clusterId, topProducers) => (clusterId, topProducersToThrift(topProducers)) } - .writeExecution( - AdhocKeyValSources.topClusterEmbeddingsToProducerSource(keyedByClusterSinkPath)) - - Execution.zip(keyByProducerExec, keyByClusterExec).unit - } - - // Calculate the embeddings using follow scores - private def runFollowScore( - interestedInClusters: TypedPipe[(Long, ClustersUserIsInterestedIn)], - userUserNormalGraph: TypedPipe[UserAndNeighbors], - userNormsAndCounts: TypedPipe[NormsAndCounts], - modelVersion: ModelVersion, - outputDir: String - )( - implicit uniqueID: UniqueID - ): Execution[Unit] = { - val keyByClusterSinkPath = outputDir + "keyedByCluster/byFollowScore_" + modelVersion - val keyByProducerSinkPath = outputDir + "keyedByProducer/byFollowScore_" + modelVersion - - runAdhocByScore( - interestedInClusters, - userUserNormalGraph, - userNormsAndCounts, - keyedByProducerSinkPath = keyByProducerSinkPath, - keyedByClusterSinkPath = keyByClusterSinkPath, - userToProducerScoringFn = userToProducerFollowScore, - userToClusterScoringFn = userToClusterFollowScore, - _.followerCount.exists(_ > minNumFollowersForProducer), - modelVersion - ) - } - - // Calculate the embeddings using fav scores - private def runFavScore( - interestedInClusters: TypedPipe[(Long, ClustersUserIsInterestedIn)], - userUserNormalGraph: TypedPipe[UserAndNeighbors], - userNormsAndCounts: TypedPipe[NormsAndCounts], - modelVersion: ModelVersion, - outputDir: String - )( - implicit uniqueID: UniqueID - ): Execution[Unit] = { - val keyByClusterSinkPath = outputDir + "keyedByCluster/byFavScore_" + modelVersion - val keyByProducerSinkPath = outputDir + "keyedByProducer/byFavScore_" + modelVersion - - runAdhocByScore( - interestedInClusters, - userUserNormalGraph, - userNormsAndCounts, - keyedByProducerSinkPath = keyByProducerSinkPath, - keyedByClusterSinkPath = keyByClusterSinkPath, - userToProducerScoringFn = userToProducerFavScore, - userToClusterScoringFn = userToClusterFavScore, - _.faverCount.exists(_ > minNumFaversForProducer), - modelVersion - ) - } - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - val outputDir = args("outputDir") - - val modelVersion = - ModelVersions.toModelVersion(args.required("model_version")) - - val interestedInClusters = modelVersion match { - case ModelVersion.Model20m145k2020 => - InterestedInSources.simClustersInterestedIn2020Source(dateRange, timeZone).forceToDisk - case ModelVersion.Model20m145kUpdated => - InterestedInSources.simClustersInterestedInUpdatedSource(dateRange, timeZone).forceToDisk - case _ => - InterestedInSources.simClustersInterestedInDec11Source(dateRange, timeZone).forceToDisk - } - - Execution - .zip( - runFavScore( - interestedInClusters, - DataSources.userUserNormalizedGraphSource, - DataSources.userNormsAndCounts, - modelVersion, - outputDir - ), - runFollowScore( - interestedInClusters, - DataSources.userUserNormalizedGraphSource, - DataSources.userNormsAndCounts, - modelVersion, - outputDir - ) - ).unit - } -} - -/** - * Computes the producer's interestedIn cluster embedding. i.e. If a tweet author (producer) is not - * associated with a KnownFor cluster, do a cross-product between - * [user, interestedIn] and [user, producer] to find the similarity matrix [interestedIn, producer]. - */ -object ProducerEmbeddingsFromInterestedIn { - val minNumFollowersForProducer: Int = 100 - val minNumFaversForProducer: Int = 100 - val topKUsersToKeep: Int = 300 - val topKClustersToKeep: Int = 60 - val cosineSimilarityThreshold: Double = 0.01 - - type ClusterId = Int - - def topProducersToThrift(producersWithScore: Seq[(UserId, Double)]): TopProducersWithScore = { - val thrift = producersWithScore.map { producer => - TopProducerWithScore(producer._1, producer._2) - } - TopProducersWithScore(thrift) - } - - def userToProducerFavScore(neighbor: NeighborWithWeights): Double = { - neighbor.favScoreHalfLife100DaysNormalizedByNeighborFaversL2.getOrElse(0.0) - } - - def userToProducerFollowScore(neighbor: NeighborWithWeights): Double = { - neighbor.followScoreNormalizedByNeighborFollowersL2.getOrElse(0.0) - } - - def userToClusterFavScore(clusterScore: UserToInterestedInClusterScores): Double = { - clusterScore.favScoreClusterNormalizedOnly.getOrElse(0.0) - } - - def userToClusterFollowScore(clusterScore: UserToInterestedInClusterScores): Double = { - clusterScore.followScoreClusterNormalizedOnly.getOrElse(0.0) - } - - def getUserSimClustersMatrix( - simClustersSource: TypedPipe[(UserId, ClustersUserIsInterestedIn)], - extractScore: UserToInterestedInClusterScores => Double, - modelVersion: ModelVersion - ): TypedPipe[(UserId, Seq[(Int, Double)])] = { - simClustersSource.collect { - case (userId, clusters) - if ModelVersions.toModelVersion(clusters.knownForModelVersion).equals(modelVersion) => - userId -> clusters.clusterIdToScores - .map { - case (clusterId, clusterScores) => - (clusterId, extractScore(clusterScores)) - }.toSeq.filter(_._2 > 0) - } - } - - /** - * Given a weighted user-producer engagement history matrix, as well as a - * weighted user-interestedInCluster matrix, do the matrix multiplication to yield a weighted - * producer-cluster embedding matrix - */ - def getProducerClusterEmbedding( - interestedInClusters: TypedPipe[(UserId, ClustersUserIsInterestedIn)], - userProducerEngagementGraph: TypedPipe[UserAndNeighbors], - userNormsAndCounts: TypedPipe[NormsAndCounts], - userToProducerScoringFn: NeighborWithWeights => Double, - userToClusterScoringFn: UserToInterestedInClusterScores => Double, - userFilter: NormsAndCounts => Boolean, // function to decide whether to compute embeddings for the user or not - numReducersForMatrixMultiplication: Int, - modelVersion: ModelVersion, - threshold: Double - )( - implicit uid: UniqueID - ): TypedPipe[((ClusterId, UserId), Double)] = { - val userSimClustersMatrix = getUserSimClustersMatrix( - interestedInClusters, - userToClusterScoringFn, - modelVersion - ) - - val userUserNormalizedGraph = getFilteredUserUserNormalizedGraph( - userProducerEngagementGraph, - userNormsAndCounts, - userToProducerScoringFn, - userFilter - ) - - SimClustersEmbeddingJob - .legacyMultiplyMatrices( - userUserNormalizedGraph, - userSimClustersMatrix, - numReducersForMatrixMultiplication - ) - .filter(_._2 >= threshold) - } - - def getFilteredUserUserNormalizedGraph( - userProducerEngagementGraph: TypedPipe[UserAndNeighbors], - userNormsAndCounts: TypedPipe[NormsAndCounts], - userToProducerScoringFn: NeighborWithWeights => Double, - userFilter: NormsAndCounts => Boolean - )( - implicit uid: UniqueID - ): TypedPipe[(UserId, (UserId, Double))] = { - val numUsersCount = Stat("num_users_with_engagements") - val userUserFilteredEdgeCount = Stat("num_filtered_user_user_engagements") - val validUsersCount = Stat("num_valid_users") - - val validUsers = userNormsAndCounts.collect { - case user if userFilter(user) => - validUsersCount.inc() - user.userId - } - - userProducerEngagementGraph - .flatMap { userAndNeighbors => - numUsersCount.inc() - userAndNeighbors.neighbors - .map { neighbor => - userUserFilteredEdgeCount.inc() - (neighbor.neighborId, (userAndNeighbors.userId, userToProducerScoringFn(neighbor))) - } - .filter(_._2._2 > 0.0) - } - .join(validUsers.asKeys) - .map { - case (neighborId, ((userId, score), _)) => - (userId, (neighborId, score)) - } - } - - def fromSimClusterEmbedding[T, E]( - resultMatrix: TypedPipe[((ClusterId, T), Double)], - topK: Int, - modelVersion: ModelVersion - ): TypedPipe[(PersistedFullClusterId, Seq[(T, Double)])] = { - resultMatrix - .map { - case ((clusterId, inputId), score) => (clusterId, (inputId, score)) - } - .group - .sortedReverseTake(topK)(Ordering.by(_._2)) - .map { - case (clusterId, topEntitiesWithScore) => - PersistedFullClusterId(modelVersion, clusterId) -> topEntitiesWithScore - } - } - - def toSimClusterEmbedding[T]( - resultMatrix: TypedPipe[((ClusterId, T), Double)], - topK: Int, - modelVersion: ModelVersion - )( - implicit ordering: Ordering[T] - ): TypedPipe[(T, TopSimClustersWithScore)] = { - resultMatrix - .map { - case ((clusterId, inputId), score) => (inputId, (clusterId, score)) - } - .group - //.withReducers(3000) // uncomment for producer-simclusters job - .sortedReverseTake(topK)(Ordering.by(_._2)) - .map { - case (inputId, topSimClustersWithScore) => - val topSimClusters = topSimClustersWithScore.map { - case (clusterId, score) => SimClusterWithScore(clusterId, score) - } - inputId -> TopSimClustersWithScore(topSimClusters, modelVersion) - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/SimilarUsersBySimClustersEmbedding.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/SimilarUsersBySimClustersEmbedding.scala deleted file mode 100644 index c530614f7..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/SimilarUsersBySimClustersEmbedding.scala +++ /dev/null @@ -1,299 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding - -import com.twitter.bijection.Injection -import com.twitter.bijection.scrooge.CompactScalaCodec -import com.twitter.hermit.candidate.thriftscala.Candidate -import com.twitter.hermit.candidate.thriftscala.Candidates -import com.twitter.scalding._ -import com.twitter.scalding.commons.source.VersionedKeyValSource -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.dalv2._ -import com.twitter.scalding_internal.dalv2.remote_access.AllowCrossClusterSameDC -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.CosineSimilarityUtil -import com.twitter.simclusters_v2.hdfs_sources._ -import com.twitter.simclusters_v2.thriftscala._ -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import java.util.TimeZone - -/** -capesospy-v2 update --build_locally --start_cron \ - --start_cron similar_users_by_simclusters_embeddings_job \ - src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object SimilarUsersBySimClustersEmbeddingBatchApp extends ScheduledExecutionApp { - - override val firstTime: RichDate = RichDate("2019-07-10") - - override val batchIncrement: Duration = Days(7) - - private val outputByFav = - "/user/cassowary/manhattan_sequence_files/similar_users_by_simclusters_embeddings/by_fav" - private val outputByFollow = - "/user/cassowary/manhattan_sequence_files/similar_users_by_simclusters_embeddings/by_follow" - - private implicit val valueInj: CompactScalaCodec[Candidates] = CompactScalaCodec(Candidates) - - private val topClusterEmbeddingsByFavScore = DAL - .readMostRecentSnapshotNoOlderThan( - ProducerTopKSimclusterEmbeddingsByFavScoreUpdatedScalaDataset, - Days(14) - ) - .withRemoteReadPolicy(AllowCrossClusterSameDC) - .toTypedPipe - .map { clusterScorePair => clusterScorePair.key -> clusterScorePair.value } - - private val topProducersForClusterEmbeddingByFavScore = DAL - .readMostRecentSnapshotNoOlderThan( - SimclusterEmbeddingTopKProducersByFavScoreUpdatedScalaDataset, - Days(14) - ) - .withRemoteReadPolicy(AllowCrossClusterSameDC) - .toTypedPipe - .map { producerScoresPair => producerScoresPair.key -> producerScoresPair.value } - - private val topClusterEmbeddingsByFollowScore = DAL - .readMostRecentSnapshotNoOlderThan( - ProducerTopKSimclusterEmbeddingsByFollowScoreUpdatedScalaDataset, - Days(14) - ) - .withRemoteReadPolicy(AllowCrossClusterSameDC) - .toTypedPipe - .map { clusterScorePair => clusterScorePair.key -> clusterScorePair.value } - - private val topProducersForClusterEmbeddingByFollowScore = DAL - .readMostRecentSnapshotNoOlderThan( - SimclusterEmbeddingTopKProducersByFollowScoreUpdatedScalaDataset, - Days(14) - ) - .withRemoteReadPolicy(AllowCrossClusterSameDC) - .toTypedPipe - .map { producerScoresPair => producerScoresPair.key -> producerScoresPair.value } - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - Execution - .zip( - SimilarUsersBySimClustersEmbedding - .getTopUsersRelatedToUser( - topClusterEmbeddingsByFavScore, - topProducersForClusterEmbeddingByFavScore - ) - .map { case (key, value) => KeyVal(key, value) } - .writeDALVersionedKeyValExecution( - SimilarUsersByFavBasedProducerEmbeddingScalaDataset, - D.Suffix(outputByFav) - ), - SimilarUsersBySimClustersEmbedding - .getTopUsersRelatedToUser( - topClusterEmbeddingsByFollowScore, - topProducersForClusterEmbeddingByFollowScore - ) - .map { case (key, value) => KeyVal(key, value) } - .writeDALVersionedKeyValExecution( - SimilarUsersByFollowBasedProducerEmbeddingScalaDataset, - D.Suffix(outputByFollow) - ) - ).unit - } -} - -/** - * Adhoc job to calculate producer's simcluster embeddings, which essentially assigns interestedIn - * SimClusters to each producer, regardless of whether the producer has a knownFor assignment. - * -./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding:similar_users_by_simclusters_embeddings-adhoc && \ - oscar hdfs --user recos-platform --screen --tee similar_users_by_simclusters_embeddings --bundle similar_users_by_simclusters_embeddings-adhoc \ - --tool com.twitter.simclusters_v2.scalding.embedding.SimilarUsersBySimClustersEmbeddingAdhocApp \ - -- --date 2019-07-10T00 2019-07-10T23 - */ -object SimilarUsersBySimClustersEmbeddingAdhocApp extends AdhocExecutionApp { - - private val outputByFav = - "/user/recos-platform/adhoc/similar_users_by_simclusters_embeddings/by_fav" - private val outputByFollow = - "/user/recos-platform/adhoc/similar_users_by_simclusters_embeddings/by_follow" - - private val topClusterEmbeddingsByFavScore = DAL - .readMostRecentSnapshotNoOlderThan( - ProducerTopKSimclusterEmbeddingsByFavScoreUpdatedScalaDataset, - Days(14) - ) - .withRemoteReadPolicy(AllowCrossClusterSameDC) - .toTypedPipe - .map { clusterScorePair => clusterScorePair.key -> clusterScorePair.value } - - private val topProducersForClusterEmbeddingByFavScore = DAL - .readMostRecentSnapshotNoOlderThan( - SimclusterEmbeddingTopKProducersByFavScoreUpdatedScalaDataset, - Days(14) - ) - .withRemoteReadPolicy(AllowCrossClusterSameDC) - .toTypedPipe - .map { producerScoresPair => producerScoresPair.key -> producerScoresPair.value } - - private val topClusterEmbeddingsByFollowScore = DAL - .readMostRecentSnapshotNoOlderThan( - ProducerTopKSimclusterEmbeddingsByFollowScoreUpdatedScalaDataset, - Days(14) - ) - .withRemoteReadPolicy(AllowCrossClusterSameDC) - .toTypedPipe - .map { clusterScorePair => clusterScorePair.key -> clusterScorePair.value } - - private val topProducersForClusterEmbeddingByFollowScore = DAL - .readMostRecentSnapshotNoOlderThan( - SimclusterEmbeddingTopKProducersByFollowScoreUpdatedScalaDataset, - Days(14) - ) - .withRemoteReadPolicy(AllowCrossClusterSameDC) - .toTypedPipe - .map { producerScoresPair => producerScoresPair.key -> producerScoresPair.value } - - implicit val candidatesInj: CompactScalaCodec[Candidates] = CompactScalaCodec(Candidates) - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - Execution - .zip( - SimilarUsersBySimClustersEmbedding - .getTopUsersRelatedToUser( - topClusterEmbeddingsByFavScore, - topProducersForClusterEmbeddingByFavScore).writeExecution( - VersionedKeyValSource[Long, Candidates](outputByFav)) - .getCounters - .flatMap { - case (_, counters) => - counters.toMap.toSeq - .sortBy(e => (e._1.group, e._1.counter)) - .foreach { - case (statKey, value) => - println(s"${statKey.group}\t${statKey.counter}\t$value") - } - Execution.unit - }, - SimilarUsersBySimClustersEmbedding - .getTopUsersRelatedToUser( - topClusterEmbeddingsByFollowScore, - topProducersForClusterEmbeddingByFollowScore).writeExecution( - VersionedKeyValSource[Long, Candidates](outputByFollow)) - .getCounters - .flatMap { - case (_, counters) => - counters.toMap.toSeq - .sortBy(e => (e._1.group, e._1.counter)) - .foreach { - case (statKey, value) => - println(s"${statKey.group}\t${statKey.counter}\t$value") - } - Execution.unit - } - ).unit - } -} - -object SimilarUsersBySimClustersEmbedding { - private val maxUsersPerCluster = 300 - private val maxClustersPerUser = 50 - private val topK = 100 - - def getTopUsersRelatedToUser( - clusterScores: TypedPipe[(Long, TopSimClustersWithScore)], - producerScores: TypedPipe[(PersistedFullClusterId, TopProducersWithScore)] - )( - implicit uniqueID: UniqueID - ): TypedPipe[(Long, Candidates)] = { - - val numUserUserPair = Stat("num_user_producer_pairs") - val numUserClusterPair = Stat("num_user_cluster_pairs") - val numClusterProducerPair = Stat("num_cluster_producer_pairs") - - val clusterToUserMap = - clusterScores.flatMap { - case (userId, topSimClustersWithScore) => - val targetUserClusters = - topSimClustersWithScore.topClusters.sortBy(-_.score).take(maxClustersPerUser) - - targetUserClusters.map { simClusterWithScore => - numUserClusterPair.inc() - simClusterWithScore.clusterId -> userId - } - } - - val clusterToProducerMap = producerScores.flatMap { - case (persistedFullClusterId, topProducersWithScore) => - numClusterProducerPair.inc() - val targetProducers = topProducersWithScore.topProducers - .sortBy(-_.score) - .take(maxUsersPerCluster) - targetProducers.map { topProducerWithScore => - persistedFullClusterId.clusterId -> topProducerWithScore.userId - } - } - - implicit val intInject: Int => Array[Byte] = Injection.int2BigEndian.toFunction - - val userToProducerMap = - clusterToUserMap.group - .sketch(2000) - .join(clusterToProducerMap.group) - .values - .distinct - .collect({ - //filter self-pair - case userPair if userPair._1 != userPair._2 => - numUserUserPair.inc() - userPair - }) - - val userEmbeddingsAllGrouped = clusterScores.map { - case (userId, topSimClustersWithScore) => - val targetUserClusters = - topSimClustersWithScore.topClusters.sortBy(-_.score).take(maxClustersPerUser) - val embedding = targetUserClusters.map { simClustersWithScore => - simClustersWithScore.clusterId -> simClustersWithScore.score - }.toMap - val embeddingNormalized = CosineSimilarityUtil.normalize(embedding) - userId -> embeddingNormalized - }.forceToDisk - - val userToProducerMapJoinWithEmbedding = - userToProducerMap - .join(userEmbeddingsAllGrouped) - .map { - case (user, (producer, userEmbedding)) => - producer -> (user, userEmbedding) - } - .join(userEmbeddingsAllGrouped) - .map { - case (producer, ((user, userEmbedding), producerEmbedding)) => - user -> (producer, CosineSimilarityUtil.dotProduct(userEmbedding, producerEmbedding)) - } - .group - .sortWithTake(topK)((a, b) => a._2 > b._2) - .map { - case (userId, candidatesList) => - val candidatesSeq = candidatesList - .map { - case (candidateId, score) => Candidate(candidateId, score) - } - userId -> Candidates(userId, candidatesSeq) - } - - userToProducerMapJoinWithEmbedding - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse/AbuseSimclusterFeaturesScaldingJob.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse/AbuseSimclusterFeaturesScaldingJob.scala deleted file mode 100644 index a1d11e2a2..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse/AbuseSimclusterFeaturesScaldingJob.scala +++ /dev/null @@ -1,178 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding.abuse - -import com.twitter.scalding._ -import com.twitter.scalding.source.TypedText -import com.twitter.scalding_internal.dalv2.DALWrite.{D, _} -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.hdfs_sources.SearchAbuseSimclusterFeaturesManhattanScalaDataset -import com.twitter.simclusters_v2.scalding.common.matrix.SparseMatrix -import com.twitter.simclusters_v2.scalding.embedding.abuse.AbuseSimclusterFeaturesScaldingJob.buildKeyValDataSet -import com.twitter.simclusters_v2.scalding.embedding.abuse.AdhocAbuseSimClusterFeaturesScaldingJob.{ - abuseInteractionSearchGraph, - buildSearchAbuseScores, - impressionInteractionSearchGraph -} -import com.twitter.simclusters_v2.scalding.embedding.abuse.DataSources.getUserInterestedInSparseMatrix -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil.{ClusterId, UserId} -import com.twitter.simclusters_v2.thriftscala.{ - ModelVersion, - SimClustersEmbedding, - SingleSideUserScores -} -import com.twitter.wtf.scalding.jobs.common.{AdhocExecutionApp, ScheduledExecutionApp} -import java.util.TimeZone - -object AbuseSimclusterFeaturesScaldingJob { - - val HealthyConsumerKey = "healthyConsumer" - val UnhealthyConsumerKey = "unhealthyConsumer" - val HealthyAuthorKey = "healthyAuthor" - val UnhealthyAuthorKey = "unhealthyAuthor" - - private[this] val EmptySimCluster = SimClustersEmbedding(List()) - - def buildKeyValDataSet( - normalizedSimClusterMatrix: SparseMatrix[UserId, ClusterId, Double], - unhealthyGraph: SparseMatrix[UserId, UserId, Double], - healthyGraph: SparseMatrix[UserId, UserId, Double] - ): TypedPipe[KeyVal[Long, SingleSideUserScores]] = { - - val searchAbuseScores = - buildSearchAbuseScores( - normalizedSimClusterMatrix, - unhealthyGraph = unhealthyGraph, - healthyGraph = healthyGraph - ) - - val pairedScores = SingleSideInteractionTransformation.pairScores( - Map( - HealthyConsumerKey -> searchAbuseScores.healthyConsumerClusterScores, - UnhealthyConsumerKey -> searchAbuseScores.unhealthyConsumerClusterScores, - HealthyAuthorKey -> searchAbuseScores.healthyAuthorClusterScores, - UnhealthyAuthorKey -> searchAbuseScores.unhealthyAuthorClusterScores - ) - ) - - pairedScores - .map { pairedScore => - val userPairInteractionFeatures = PairedInteractionFeatures( - healthyInteractionSimClusterEmbedding = - pairedScore.interactionScores.getOrElse(HealthyConsumerKey, EmptySimCluster), - unhealthyInteractionSimClusterEmbedding = - pairedScore.interactionScores.getOrElse(UnhealthyConsumerKey, EmptySimCluster) - ) - - val authorPairInteractionFeatures = PairedInteractionFeatures( - healthyInteractionSimClusterEmbedding = - pairedScore.interactionScores.getOrElse(HealthyAuthorKey, EmptySimCluster), - unhealthyInteractionSimClusterEmbedding = - pairedScore.interactionScores.getOrElse(UnhealthyAuthorKey, EmptySimCluster) - ) - - val value = SingleSideUserScores( - pairedScore.userId, - consumerHealthyScore = userPairInteractionFeatures.healthySum, - consumerUnhealthyScore = userPairInteractionFeatures.unhealthySum, - authorUnhealthyScore = authorPairInteractionFeatures.unhealthySum, - authorHealthyScore = authorPairInteractionFeatures.healthySum - ) - - KeyVal(pairedScore.userId, value) - } - } -} - -/** - * This job creates single-side features used to predict the abuse reports in search. The features - * are put into manhattan and availabe in feature store. We expect that search will be able to use - * these features directly. They may be useful for other models as well. - */ -object SearchAbuseSimclusterFeaturesScaldingJob extends ScheduledExecutionApp { - override def firstTime: RichDate = RichDate("2021-02-01") - - override def batchIncrement: Duration = - Days(7) - - private val OutputPath: String = EmbeddingUtil.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = true, - modelVersion = ModelVersion.Model20m145kUpdated, - pathSuffix = "search_abuse_simcluster_features" - ) - - def buildDataset( - )( - implicit dateRange: DateRange, - ): Execution[TypedPipe[KeyVal[Long, SingleSideUserScores]]] = { - Execution.getMode.map { implicit mode => - val normalizedSimClusterMatrix = getUserInterestedInSparseMatrix.rowL2Normalize - val abuseSearchGraph = abuseInteractionSearchGraph()(dateRange, mode) - val impressionSearchGraph = impressionInteractionSearchGraph()(dateRange, mode) - - buildKeyValDataSet(normalizedSimClusterMatrix, abuseSearchGraph, impressionSearchGraph) - } - } - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - // Extend the date range to a total of 19 days. Search keeps 21 days of data. - val dateRangeSearchData = dateRange.prepend(Days(12)) - buildDataset()(dateRangeSearchData).flatMap { dataset => - dataset.writeDALVersionedKeyValExecution( - dataset = SearchAbuseSimclusterFeaturesManhattanScalaDataset, - pathLayout = D.Suffix(OutputPath) - ) - } - } -} - -/** - * You can check the logic of this job by running this query. - * - * scalding remote run \ - * --target src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse:abuse-prod \ - * --main-class com.twitter.simclusters_v2.scalding.embedding.abuse.AdhocSearchAbuseSimclusterFeaturesScaldingJob \ - * --hadoop-properties "mapreduce.job.split.metainfo.maxsize=-1" \ - * --cluster bluebird-qus1 --submitter hadoopnest-bluebird-1.qus1.twitter.com \ - * -- --date 2021-02-01 2021-02-02 \ - * --outputPath AdhocSearchAbuseSimclusterFeaturesScaldingJob-test1 - */ -object AdhocSearchAbuseSimclusterFeaturesScaldingJob extends AdhocExecutionApp { - def toTsv( - datasetExecution: Execution[TypedPipe[KeyVal[Long, SingleSideUserScores]]], - outputPath: String - ): Execution[Unit] = { - datasetExecution.flatMap { dataset => - dataset - .map { keyVal => - ( - keyVal.key, - keyVal.value.consumerHealthyScore, - keyVal.value.consumerUnhealthyScore, - keyVal.value.authorHealthyScore, - keyVal.value.authorUnhealthyScore - ) - } - .writeExecution(TypedText.tsv(outputPath)) - } - } - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - toTsv( - SearchAbuseSimclusterFeaturesScaldingJob.buildDataset()(dateRange), - args("outputPath") - ) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse/AdhocAbuseSimClusterFeaturesScaldingJob.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse/AdhocAbuseSimClusterFeaturesScaldingJob.scala deleted file mode 100644 index 245825b40..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse/AdhocAbuseSimClusterFeaturesScaldingJob.scala +++ /dev/null @@ -1,217 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding.abuse - -import com.twitter.ml.api.Feature -import com.twitter.ml.api.util.SRichDataRecord -import com.twitter.scalding.Args -import com.twitter.scalding.DateRange -import com.twitter.scalding.Execution -import com.twitter.scalding.UniqueID -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite.D -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.dalv2.dataset.DAL.DALSourceBuilderExtension -import com.twitter.scalding_internal.dalv2.remote_access.AllowCrossDC -import com.twitter.search.common.features.ExternalTweetFeature -import com.twitter.search.common.features.SearchContextFeature -import com.twitter.search.tweet_ranking.scalding.datasets.TweetEngagementRawTrainingDataDailyJavaDataset -import com.twitter.simclusters_v2.common.ClusterId -import com.twitter.simclusters_v2.hdfs_sources.AdhocAbuseSimclusterFeaturesScalaDataset -import com.twitter.simclusters_v2.scalding.common.matrix.SparseMatrix -import com.twitter.simclusters_v2.scalding.embedding.abuse.DataSources.NumBlocksP95 -import com.twitter.simclusters_v2.scalding.embedding.abuse.DataSources.getFlockBlocksSparseMatrix -import com.twitter.simclusters_v2.scalding.embedding.abuse.DataSources.getUserInterestedInSparseMatrix -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil.UserId -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil -import com.twitter.simclusters_v2.scalding.embedding.common.ExternalDataSources -import com.twitter.simclusters_v2.thriftscala.ModelVersion -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import com.twitter.wtf.scalding.jobs.common.CassowaryJob -import java.util.TimeZone - -object AdhocAbuseSimClusterFeatureKeys { - val AbuseAuthorSearchKey = "abuseAuthorSearch" - val AbuseUserSearchKey = "abuseUserSearch" - val ImpressionUserSearchKey = "impressionUserSearch" - val ImpressionAuthorSearchKey = "impressionAuthorSearch" - val FlockBlocksAuthorKey = "blocksAuthorFlockDataset" - val FlockBlocksUserKey = "blocksUserFlockDataset" - val FavScoresAuthorKey = "favsAuthorFromFavGraph" - val FavScoresUserKey = "favsUserFromFavGraph" -} - -/** - * Adhoc job that is still in development. The job builds features that are meant to be useful for - * search. - * - * Features are built from existing SimCluster representations and the interaction graphs. - * - * Example command: - * scalding remote run \ - * --target src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse:abuse-adhoc \ - * --main-class com.twitter.simclusters_v2.scalding.embedding.abuse.AdhocAbuseSimClusterFeaturesScaldingJob \ - * --submitter hadoopnest1.atla.twitter.com --user cassowary \ - * --hadoop-properties "mapreduce.job.user.classpath.first=true" -- \ - * --hdfs --date 2020/11/24 2020/12/14 --partitionName second_run --dalEnvironment Prod - */ -object AdhocAbuseSimClusterFeaturesScaldingJob extends AdhocExecutionApp with CassowaryJob { - override def jobName: String = "AdhocAbuseScaldingJob" - - import AdhocAbuseSimClusterFeatureKeys._ - - val tweetAuthorFeature = new Feature.Discrete(ExternalTweetFeature.TWEET_AUTHOR_ID.getName) - val searcherIdFeature = new Feature.Discrete(SearchContextFeature.SEARCHER_ID.getName) - val isReportedFeature = new Feature.Binary(ExternalTweetFeature.IS_REPORTED.getName) - val HalfLifeInDaysForFavScore = 100 - - private val outputPathThrift: String = EmbeddingUtil.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = false, - modelVersion = ModelVersion.Model20m145kUpdated, - pathSuffix = "abuse_simcluster_features" - ) - - def searchDataRecords( - )( - implicit dateRange: DateRange, - mode: Mode - ) = { - DAL - .read(TweetEngagementRawTrainingDataDailyJavaDataset) - .withRemoteReadPolicy(AllowCrossDC) - .toDataSetPipe - .records - } - - def abuseInteractionSearchGraph( - )( - implicit dateRange: DateRange, - mode: Mode - ): SparseMatrix[UserId, UserId, Double] = { - val abuseMatrixEntries = searchDataRecords() - .flatMap { dataRecord => - val sDataRecord = SRichDataRecord(dataRecord) - val authorIdOption = sDataRecord.getFeatureValueOpt(tweetAuthorFeature) - val userIdOption = sDataRecord.getFeatureValueOpt(searcherIdFeature) - val isReportedOption = sDataRecord.getFeatureValueOpt(isReportedFeature) - - for { - isReported <- isReportedOption if isReported - authorId <- authorIdOption if authorId != 0 - userId <- userIdOption if userId != 0 - } yield { - (userId: UserId, authorId: UserId, 1.0) - } - } - SparseMatrix.apply[UserId, UserId, Double](abuseMatrixEntries) - } - - def impressionInteractionSearchGraph( - )( - implicit dateRange: DateRange, - mode: Mode - ): SparseMatrix[UserId, UserId, Double] = { - val impressionMatrixEntries = searchDataRecords - .flatMap { dataRecord => - val sDataRecord = SRichDataRecord(dataRecord) - val authorIdOption = sDataRecord.getFeatureValueOpt(tweetAuthorFeature) - val userIdOption = sDataRecord.getFeatureValueOpt(searcherIdFeature) - - for { - authorId <- authorIdOption if authorId != 0 - userId <- userIdOption if userId != 0 - } yield { - (userId: UserId, authorId: UserId, 1.0) - } - } - SparseMatrix.apply[UserId, UserId, Double](impressionMatrixEntries) - } - - case class SingleSideScores( - unhealthyConsumerClusterScores: TypedPipe[(UserId, SimClustersEmbedding)], - unhealthyAuthorClusterScores: TypedPipe[(UserId, SimClustersEmbedding)], - healthyConsumerClusterScores: TypedPipe[(UserId, SimClustersEmbedding)], - healthyAuthorClusterScores: TypedPipe[(UserId, SimClustersEmbedding)]) - - def buildSearchAbuseScores( - normalizedSimClusterMatrix: SparseMatrix[UserId, ClusterId, Double], - unhealthyGraph: SparseMatrix[UserId, UserId, Double], - healthyGraph: SparseMatrix[UserId, UserId, Double] - ): SingleSideScores = { - SingleSideScores( - unhealthyConsumerClusterScores = SingleSideInteractionTransformation - .clusterScoresFromGraphs(normalizedSimClusterMatrix, unhealthyGraph), - unhealthyAuthorClusterScores = SingleSideInteractionTransformation - .clusterScoresFromGraphs(normalizedSimClusterMatrix, unhealthyGraph.transpose), - healthyConsumerClusterScores = SingleSideInteractionTransformation - .clusterScoresFromGraphs(normalizedSimClusterMatrix, healthyGraph), - healthyAuthorClusterScores = SingleSideInteractionTransformation - .clusterScoresFromGraphs(normalizedSimClusterMatrix, healthyGraph.transpose) - ) - } - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - Execution.getMode.flatMap { implicit mode => - val normalizedSimClusterMatrix = getUserInterestedInSparseMatrix.rowL2Normalize - - val abuseSearchGraph = abuseInteractionSearchGraph() - val impressionSearchGraph = impressionInteractionSearchGraph() - - val searchAbuseScores = buildSearchAbuseScores( - normalizedSimClusterMatrix, - unhealthyGraph = abuseSearchGraph, - healthyGraph = impressionSearchGraph) - - // Step 2a: Read FlockBlocks for unhealthy interactions and user-user-fav for healthy interactions - val flockBlocksSparseGraph = - getFlockBlocksSparseMatrix(NumBlocksP95, dateRange.prepend(Years(1))) - - val favSparseGraph = SparseMatrix.apply[UserId, UserId, Double]( - ExternalDataSources.getFavEdges(HalfLifeInDaysForFavScore)) - - val blocksAbuseScores = buildSearchAbuseScores( - normalizedSimClusterMatrix, - unhealthyGraph = flockBlocksSparseGraph, - healthyGraph = favSparseGraph - ) - - // Step 3. Combine all scores from different sources for users - val pairedScores = SingleSideInteractionTransformation.pairScores( - Map( - // User cluster scores built from the search abuse reports graph - AbuseUserSearchKey -> searchAbuseScores.unhealthyConsumerClusterScores, - // Author cluster scores built from the search abuse reports graph - AbuseAuthorSearchKey -> searchAbuseScores.unhealthyAuthorClusterScores, - // User cluster scores built from the search impression graph - ImpressionUserSearchKey -> searchAbuseScores.healthyConsumerClusterScores, - // Author cluster scores built from the search impression graph - ImpressionAuthorSearchKey -> searchAbuseScores.healthyAuthorClusterScores, - // User cluster scores built from flock blocks graph - FlockBlocksUserKey -> blocksAbuseScores.unhealthyConsumerClusterScores, - // Author cluster scores built from the flock blocks graph - FlockBlocksAuthorKey -> blocksAbuseScores.unhealthyAuthorClusterScores, - // User cluster scores built from the user-user fav graph - FavScoresUserKey -> blocksAbuseScores.healthyConsumerClusterScores, - // Author cluster scores built from the user-user fav graph - FavScoresAuthorKey -> blocksAbuseScores.healthyAuthorClusterScores - ) - ) - - pairedScores.writeDALSnapshotExecution( - AdhocAbuseSimclusterFeaturesScalaDataset, - D.Daily, - D.Suffix(outputPathThrift), - D.Parquet, - dateRange.`end`, - partitions = Set(D.Partition("partition", args("partitionName"), D.PartitionType.String)) - ) - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse/BUILD b/src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse/BUILD deleted file mode 100644 index 0f162e417..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse/BUILD +++ /dev/null @@ -1,74 +0,0 @@ -scala_library( - sources = [ - "*.scala", - ], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/twitter/algebird:core", - "graphstore/common:flock_blocks-java", - "src/java/com/twitter/search/common/features", - "src/scala/com/twitter/ml/api:api-base", - "src/scala/com/twitter/scalding_internal/dalv2", - "src/scala/com/twitter/scalding_internal/dalv2/dataset", - "src/scala/com/twitter/scalding_internal/source/lzo_scrooge", - "src/scala/com/twitter/search/tweet_ranking/scalding/datasets:tweet_engagement_raw_training_data_daily-java", - "src/scala/com/twitter/simclusters_v2/hdfs_sources", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:adhoc_abuse_simcluster_features-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:adhoc_cross_simcluster_block_interaction_features-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:adhoc_cross_simcluster_fav_interaction_features-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:search_abuse_simcluster_features_manhattan-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:user_user_fav_graph-scala", - "src/scala/com/twitter/simclusters_v2/scalding/common", - "src/scala/com/twitter/simclusters_v2/scalding/common/matrix", - "src/scala/com/twitter/simclusters_v2/scalding/embedding", - "src/scala/com/twitter/wtf/entity_real_graph/common", - "src/scala/com/twitter/wtf/entity_real_graph/scalding/common", - "src/scala/com/twitter/wtf/scalding/jobs/common:cassowary_job", - "src/scala/com/twitter/wtf/scalding/jobs/common:execution_app", - "src/scala/com/twitter/wtf/scalding/jobs/common:sources", - "src/scala/com/twitter/wtf/scalding/jobs/common:stats_util", - ], -) - -hadoop_binary( - name = "abuse-adhoc", - main = "com.twitter.simclusters_v2.scalding.embedding.abuse.AdhocAbuseScaldingJob", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":abuse"], -) - -hadoop_binary( - name = "abuse-prod", - main = "com.twitter.simclusters_v2.scalding.embedding.abuse.SearchAbuseSimclusterFeaturesScaldingJob", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":abuse"], -) - -hadoop_binary( - name = "cross_simcluster-adhoc", - main = "com.twitter.simclusters_v2.scalding.embedding.abuse.CrossSimClusterFeaturesScaldingJob", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":abuse", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse/CrossSimClusterFeaturesScaldingJob.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse/CrossSimClusterFeaturesScaldingJob.scala deleted file mode 100644 index f2ee98bd4..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse/CrossSimClusterFeaturesScaldingJob.scala +++ /dev/null @@ -1,149 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding.abuse - -import com.twitter.scalding.typed.TypedPipe -import com.twitter.scalding.Args -import com.twitter.scalding.DateRange -import com.twitter.scalding.Execution -import com.twitter.scalding.UniqueID -import com.twitter.scalding.Years -import com.twitter.simclusters_v2.scalding.common.matrix.SparseMatrix -import com.twitter.simclusters_v2.scalding.embedding.abuse.DataSources.NumBlocksP95 -import com.twitter.simclusters_v2.scalding.embedding.abuse.DataSources.getFlockBlocksSparseMatrix -import com.twitter.simclusters_v2.scalding.embedding.abuse.DataSources.getUserInterestedInTruncatedKMatrix -import com.twitter.scalding_internal.dalv2.DALWrite.D -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil.ClusterId -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil.UserId -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil -import com.twitter.simclusters_v2.scalding.embedding.common.ExternalDataSources -import com.twitter.simclusters_v2.thriftscala.AdhocCrossSimClusterInteractionScores -import com.twitter.simclusters_v2.thriftscala.ClustersScore -import com.twitter.simclusters_v2.thriftscala.ModelVersion -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import com.twitter.wtf.scalding.jobs.common.CassowaryJob -import com.twitter.simclusters_v2.hdfs_sources.AdhocCrossSimclusterBlockInteractionFeaturesScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.AdhocCrossSimclusterFavInteractionFeaturesScalaDataset -import java.util.TimeZone - -/* -To run: -scalding remote run \ ---user cassowary \ ---submitter hadoopnest1.atla.twitter.com \ ---target src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse:cross_simcluster-adhoc \ ---main-class com.twitter.simclusters_v2.scalding.embedding.abuse.CrossSimClusterFeaturesScaldingJob \ ---submitter-memory 128192.megabyte --hadoop-properties "mapreduce.map.memory.mb=8192 mapreduce.map.java.opts='-Xmx7618M' mapreduce.reduce.memory.mb=8192 mapreduce.reduce.java.opts='-Xmx7618M'" \ --- \ ---date 2021-02-07 \ ---dalEnvironment Prod - */ - -object CrossSimClusterFeaturesUtil { - - /** - * To generate the interaction score for 2 simclusters c1 and c2 for all cluster combinations (I): - * a) Get C - user interestedIn matrix, User * Cluster - * b) Get INT - positive or negative interaction matrix, User * User - * c) Compute C^T*INT - * d) Finally, return C^T*INT*C - */ - def getCrossClusterScores( - userClusterMatrix: SparseMatrix[UserId, ClusterId, Double], - userInteractionMatrix: SparseMatrix[UserId, UserId, Double] - ): SparseMatrix[ClusterId, ClusterId, Double] = { - // intermediate = C^T*INT - val intermediateResult = userClusterMatrix.transpose.multiplySparseMatrix(userInteractionMatrix) - // return intermediate*C - intermediateResult.multiplySparseMatrix(userClusterMatrix) - } -} - -object CrossSimClusterFeaturesScaldingJob extends AdhocExecutionApp with CassowaryJob { - override def jobName: String = "AdhocAbuseCrossSimClusterFeaturesScaldingJob" - - private val outputPathBlocksThrift: String = EmbeddingUtil.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = false, - modelVersion = ModelVersion.Model20m145kUpdated, - pathSuffix = "abuse_cross_simcluster_block_features" - ) - - private val outputPathFavThrift: String = EmbeddingUtil.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = false, - modelVersion = ModelVersion.Model20m145kUpdated, - pathSuffix = "abuse_cross_simcluster_fav_features" - ) - - private val HalfLifeInDaysForFavScore = 100 - - // Adhoc jobs which use all user interestedIn simclusters (default=50) was failing - // Hence truncating the number of clusters - private val MaxNumClustersPerUser = 20 - - import CrossSimClusterFeaturesUtil._ - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - val normalizedUserInterestedInMatrix: SparseMatrix[UserId, ClusterId, Double] = - getUserInterestedInTruncatedKMatrix(MaxNumClustersPerUser).rowL2Normalize - - //the below code is to get cross simcluster features from flockblocks - negative user-user interactions. - val flockBlocksMatrix: SparseMatrix[UserId, UserId, Double] = - getFlockBlocksSparseMatrix(NumBlocksP95, dateRange.prepend(Years(1))) - - val crossClusterBlockScores: SparseMatrix[ClusterId, ClusterId, Double] = - getCrossClusterScores(normalizedUserInterestedInMatrix, flockBlocksMatrix) - - val blockScores: TypedPipe[AdhocCrossSimClusterInteractionScores] = - crossClusterBlockScores.rowAsKeys - .mapValues(List(_)).sumByKey.toTypedPipe.map { - case (givingClusterId, receivingClustersWithScores) => - AdhocCrossSimClusterInteractionScores( - clusterId = givingClusterId, - clusterScores = receivingClustersWithScores.map { - case (cluster, score) => ClustersScore(cluster, score) - }) - } - - // get cross simcluster features from fav graph - positive user-user interactions - val favGraphMatrix: SparseMatrix[UserId, UserId, Double] = - SparseMatrix.apply[UserId, UserId, Double]( - ExternalDataSources.getFavEdges(HalfLifeInDaysForFavScore)) - - val crossClusterFavScores: SparseMatrix[ClusterId, ClusterId, Double] = - getCrossClusterScores(normalizedUserInterestedInMatrix, favGraphMatrix) - - val favScores: TypedPipe[AdhocCrossSimClusterInteractionScores] = - crossClusterFavScores.rowAsKeys - .mapValues(List(_)).sumByKey.toTypedPipe.map { - case (givingClusterId, receivingClustersWithScores) => - AdhocCrossSimClusterInteractionScores( - clusterId = givingClusterId, - clusterScores = receivingClustersWithScores.map { - case (cluster, score) => ClustersScore(cluster, score) - }) - } - // write both block and fav interaction matrices to hdfs in thrift format - Execution - .zip( - blockScores.writeDALSnapshotExecution( - AdhocCrossSimclusterBlockInteractionFeaturesScalaDataset, - D.Daily, - D.Suffix(outputPathBlocksThrift), - D.Parquet, - dateRange.`end`), - favScores.writeDALSnapshotExecution( - AdhocCrossSimclusterFavInteractionFeaturesScalaDataset, - D.Daily, - D.Suffix(outputPathFavThrift), - D.Parquet, - dateRange.`end`) - ).unit - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse/DataSources.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse/DataSources.scala deleted file mode 100644 index 20c16abbf..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse/DataSources.scala +++ /dev/null @@ -1,101 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding.abuse - -import com.twitter.data.proto.Flock -import com.twitter.scalding.{DateOps, DateRange, Days, RichDate, UniqueID} -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.simclusters_v2.hdfs_sources.InterestedInSources -import com.twitter.simclusters_v2.scalding.common.matrix.SparseMatrix -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil.{ClusterId, UserId} -import com.twitter.simclusters_v2.scalding.embedding.common.ExternalDataSources -import graphstore.common.FlockBlocksJavaDataset -import java.util.TimeZone - -object DataSources { - - private val ValidEdgeStateId = 0 - val NumBlocksP95 = 49 - - /** - * Helper function to return Sparse Matrix of user's interestedIn clusters and fav scores - * @param dateRange - * @return - */ - def getUserInterestedInSparseMatrix( - implicit dateRange: DateRange, - timeZone: TimeZone - ): SparseMatrix[UserId, ClusterId, Double] = { - val simClusters = ExternalDataSources.simClustersInterestInSource - - val simClusterMatrixEntries = simClusters - .flatMap { keyVal => - keyVal.value.clusterIdToScores.flatMap { - case (clusterId, score) => - score.favScore.map { favScore => - (keyVal.key, clusterId, favScore) - } - } - } - - SparseMatrix.apply[UserId, ClusterId, Double](simClusterMatrixEntries) - } - - def getUserInterestedInTruncatedKMatrix( - topK: Int - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): SparseMatrix[UserId, ClusterId, Double] = { - SparseMatrix( - InterestedInSources - .simClustersInterestedInUpdatedSource(dateRange, timeZone) - .flatMap { - case (userId, clustersUserIsInterestedIn) => - val sortedAndTruncatedList = clustersUserIsInterestedIn.clusterIdToScores - .mapValues(_.favScore.getOrElse(0.0)).filter(_._2 > 0.0).toList.sortBy(-_._2).take( - topK) - sortedAndTruncatedList.map { - case (clusterId, score) => - (userId, clusterId, score) - } - } - ) - } - - /** - * Helper function to return SparseMatrix of user block interactions from the FlockBlocks - * dataset. All users with greater than numBlocks are filtered out - * @param dateRange - * @return - */ - def getFlockBlocksSparseMatrix( - maxNumBlocks: Int, - rangeForData: DateRange - )( - implicit dateRange: DateRange - ): SparseMatrix[UserId, UserId, Double] = { - implicit val tz: java.util.TimeZone = DateOps.UTC - val userGivingBlocks = SparseMatrix.apply[UserId, UserId, Double]( - DAL - .readMostRecentSnapshotNoOlderThan(FlockBlocksJavaDataset, Days(30)) - .toTypedPipe - .flatMap { data: Flock.Edge => - // Consider edges that are valid and have been updated in the past 1 year - if (data.getStateId == ValidEdgeStateId && - rangeForData.contains(RichDate(data.getUpdatedAt * 1000L))) { - Some((data.getSourceId, data.getDestinationId, 1.0)) - } else { - None - } - }) - // Find all users who give less than numBlocksP95 blocks. - // This is to remove those who might be responsible for automatically blocking users - // on the twitter platform. - val usersWithLegitBlocks = userGivingBlocks.rowL1Norms.collect { - case (userId, l1Norm) if l1Norm <= maxNumBlocks => - userId - } - // retain only those users who give legit blocks (i.e those users who give less than numBlocks95) - userGivingBlocks.filterRows(usersWithLegitBlocks) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse/PairedinteractionFeatures.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse/PairedinteractionFeatures.scala deleted file mode 100644 index 645519d59..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse/PairedinteractionFeatures.scala +++ /dev/null @@ -1,122 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding.abuse - -import com.twitter.simclusters_v2.common.ClusterId -import com.twitter.simclusters_v2.thriftscala.{SimClusterWithScore, SimClustersEmbedding} -import com.twitter.util.Try - -object ClusterPair { - def apply( - clusterId: ClusterId, - healthyScore: Double, - unhealthyScore: Double - ): Option[ClusterPair] = { - if (healthyScore + unhealthyScore == 0.0) { - None - } else { - Some(new ClusterPair(clusterId, healthyScore, unhealthyScore)) - } - } -} - -case class ClusterPair private ( - clusterId: ClusterId, - healthyScore: Double, - unhealthyScore: Double) { - - def totalScores: Double = healthyScore + unhealthyScore - - def healthRatio: Double = unhealthyScore / (unhealthyScore + healthyScore) -} - -object PairedInteractionFeatures { - def smoothedHealthRatio( - unhealthySum: Double, - healthySum: Double, - smoothingFactor: Double, - prior: Double - ): Double = - (unhealthySum + smoothingFactor * prior) / (unhealthySum + healthySum + smoothingFactor) -} - -/** - * Class used to derive features for abuse models. We pair a healthy embedding with an unhealthy - * embedding. All the public methods on this class are derived features of these embeddings. - * - * @param healthyInteractionSimClusterEmbedding SimCluster embedding of healthy interactions (for - * instance favs or impressions) - * @param unhealthyInteractionSimClusterEmbedding SimCluster embedding of unhealthy interactions - * (for instance blocks or abuse reports) - */ -case class PairedInteractionFeatures( - healthyInteractionSimClusterEmbedding: SimClustersEmbedding, - unhealthyInteractionSimClusterEmbedding: SimClustersEmbedding) { - - private[this] val scorePairs: Seq[ClusterPair] = { - val clusterToScoreMap = healthyInteractionSimClusterEmbedding.embedding.map { - simClusterWithScore => - simClusterWithScore.clusterId -> simClusterWithScore.score - }.toMap - - unhealthyInteractionSimClusterEmbedding.embedding.flatMap { simClusterWithScore => - val clusterId = simClusterWithScore.clusterId - val postiveScoreOption = clusterToScoreMap.get(clusterId) - postiveScoreOption.flatMap { postiveScore => - ClusterPair(clusterId, postiveScore, simClusterWithScore.score) - } - } - } - - /** - * Get the pair of clusters with the most total interactions. - */ - val highestScoreClusterPair: Option[ClusterPair] = - Try(scorePairs.maxBy(_.totalScores)).toOption - - /** - * Get the pair of clusters with the highest unhealthy to healthy ratio. - */ - val highestHealthRatioClusterPair: Option[ClusterPair] = - Try(scorePairs.maxBy(_.healthRatio)).toOption - - /** - * Get the pair of clusters with the lowest unhealthy to healthy ratio. - */ - val lowestHealthRatioClusterPair: Option[ClusterPair] = - Try(scorePairs.minBy(_.healthRatio)).toOption - - /** - * Get an embedding whose values are the ratio of unhealthy to healthy for that simcluster. - */ - val healthRatioEmbedding: SimClustersEmbedding = { - val scores = scorePairs.map { pair => - SimClusterWithScore(pair.clusterId, pair.healthRatio) - } - SimClustersEmbedding(scores) - } - - /** - * Sum of the healthy scores for all the simclusters - */ - val healthySum: Double = healthyInteractionSimClusterEmbedding.embedding.map(_.score).sum - - /** - * Sum of the unhealthy scores for all the simclusters - */ - val unhealthySum: Double = unhealthyInteractionSimClusterEmbedding.embedding.map(_.score).sum - - /** - * ratio of unhealthy to healthy for all simclusters - */ - val healthRatio: Double = unhealthySum / (unhealthySum + healthySum) - - /** - * Ratio of unhealthy to healthy for all simclusters that is smoothed toward the prior with when - * we have fewer observations. - * - * @param smoothingFactor The higher this value the more interactions we need to move the returned - * ratio - * @param prior The unhealthy to healthy for all interactions. - */ - def smoothedHealthRatio(smoothingFactor: Double, prior: Double): Double = - PairedInteractionFeatures.smoothedHealthRatio(unhealthySum, healthySum, smoothingFactor, prior) -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse/SingleSideInteractionTransformation.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse/SingleSideInteractionTransformation.scala deleted file mode 100644 index 95577139f..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/abuse/SingleSideInteractionTransformation.scala +++ /dev/null @@ -1,154 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding.abuse - -import com.google.common.annotations.VisibleForTesting -import com.twitter.scalding._ -import com.twitter.simclusters_v2.scalding.common.matrix.SparseMatrix -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil.ClusterId -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil.UserId -import com.twitter.simclusters_v2.thriftscala.AdhocSingleSideClusterScores -import com.twitter.simclusters_v2.thriftscala.SimClusterWithScore -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding - -/** - * Logic for building a SimCluster represenation of interaction signals. The purpose of this job is - * to model negative behavior (like abuse and blocks). - * - * This is a "SingleSide", because we are only considering one side of the interaction graph to - * build these features. So for instance we would keep track of which simclusters are most likely to - * get reported for abuse regardless of who reported it. Another job will be responsible for - * building the simcluster to simcluster interaction matrix as described in the doc. - */ -object SingleSideInteractionTransformation { - - /** - * Compute a score for every SimCluster. The SimCluster score is a count of the number of - * interactions for each SimCluster. For a user that has many SimClusters, we distribute each of - * their interactions across all of these SimClusters. - * - * @param normalizedUserSimClusters Sparse matrix of User-SimCluster scores. Users are rows and - * SimClusters are columns. This should already by L2normalized. - * It is important that we normalize so that each interaction - * only adds 1 to the counts. - * @param interactionGraph Graph of interactions. Rows are the users, columns are not used. - * All values in this graph are assumed to be positive; they are the number of - * interactions. - * - * @return SingleSideClusterFeatures for each SimCluster that has user with an interaction. - */ - def computeClusterFeatures( - normalizedUserSimClusters: SparseMatrix[UserId, ClusterId, Double], - interactionGraph: SparseMatrix[UserId, _, Double] - ): TypedPipe[SimClusterWithScore] = { - - val numReportsForUserEntries = interactionGraph.rowL1Norms.map { - // turn into a vector where we use 1 as the column key for every entry. - case (user, count) => (user, 1, count) - } - - val numReportsForUser = SparseMatrix[UserId, Int, Double](numReportsForUserEntries) - - normalizedUserSimClusters.transpose - .multiplySparseMatrix(numReportsForUser) - .toTypedPipe - .map { - case (clusterId, _, clusterScore: Double) => - SimClusterWithScore(clusterId, clusterScore) - } - } - - /** - * Given that we have the score for each SimCluster and the user's SimClusters, create a - * representation of the user so that the new SimCluster scores are an estimate of the - * interactions for this user. - * - * @param normalizedUserSimClusters sparse matrix of User-SimCluster scores. Users are rows and - * SimClusters are columns. This should already be L2 normalized. - * @param simClusterFeatures For each SimCluster, a score associated with this interaction type. - * - * @return SingleSideAbuseFeatures for each user the SimClusters and scores for this - */ - @VisibleForTesting - private[abuse] def computeUserFeaturesFromClusters( - normalizedUserSimClusters: SparseMatrix[UserId, ClusterId, Double], - simClusterFeatures: TypedPipe[SimClusterWithScore] - ): TypedPipe[(UserId, SimClustersEmbedding)] = { - - normalizedUserSimClusters.toTypedPipe - .map { - case (userId, clusterId, score) => - (clusterId, (userId, score)) - } - .group - // There are at most 140k SimClusters. They should fit in memory - .hashJoin(simClusterFeatures.groupBy(_.clusterId)) - .map { - case (_, ((userId, score), singleSideClusterFeatures)) => - ( - userId, - List( - SimClusterWithScore( - singleSideClusterFeatures.clusterId, - singleSideClusterFeatures.score * score)) - ) - } - .sumByKey - .mapValues(SimClustersEmbedding.apply) - } - - /** - * Combines all the different SimClustersEmbedding for a user into one - * AdhocSingleSideClusterScores. - * - * @param interactionMap The key is an identifier for the embedding type. The typed pipe will have - * embeddings of only for that type of embedding. - * @return Typed pipe with one AdhocSingleSideClusterScores per user. - */ - def pairScores( - interactionMap: Map[String, TypedPipe[(UserId, SimClustersEmbedding)]] - ): TypedPipe[AdhocSingleSideClusterScores] = { - - val combinedInteractions = interactionMap - .map { - case (interactionTypeName, userInteractionFeatures) => - userInteractionFeatures.map { - case (userId, simClustersEmbedding) => - (userId, List((interactionTypeName, simClustersEmbedding))) - } - } - .reduce[TypedPipe[(UserId, List[(String, SimClustersEmbedding)])]] { - case (list1, list2) => - list1 ++ list2 - } - .group - .sumByKey - - combinedInteractions.toTypedPipe - .map { - case (userId, interactionFeatureList) => - AdhocSingleSideClusterScores( - userId, - interactionFeatureList.toMap - ) - } - } - - /** - * Given the SimCluster and interaction graph get the user representation for this interaction. - * See the documentation of the underlying methods for more details - * - * @param normalizedUserSimClusters sparse matrix of User-SimCluster scores. Users are rows and - * SimClusters are columns. This should already by L2normalized. - * @param interactionGraph Graph of interactions. Rows are the users, columns are not used. - * All values in this graph are assumed to be positive; they are the number of - * interactions. - * - * @return SimClustersEmbedding for all users in the give SimCluster graphs - */ - def clusterScoresFromGraphs( - normalizedUserSimClusters: SparseMatrix[UserId, ClusterId, Double], - interactionGraph: SparseMatrix[UserId, _, Double] - ): TypedPipe[(UserId, SimClustersEmbedding)] = { - val clusterFeatures = computeClusterFeatures(normalizedUserSimClusters, interactionGraph) - computeUserFeaturesFromClusters(normalizedUserSimClusters, clusterFeatures) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/common/EmbeddingUtil.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/common/EmbeddingUtil.scala deleted file mode 100644 index 9b1e45f89..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/common/EmbeddingUtil.scala +++ /dev/null @@ -1,114 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding.common - -import com.twitter.simclusters_v2.thriftscala._ -import java.net.InetAddress -import java.net.UnknownHostException - -object EmbeddingUtil { - - type UserId = Long - type ClusterId = Int - type ProducerId = Long - type EmbeddingScore = Double - type SemanticCoreEntityId = Long - type HashtagId = String - type Language = String - - implicit val internalIdOrdering: Ordering[InternalId] = Ordering.by { - case InternalId.EntityId(id) => id.toString - case InternalId.Hashtag(strId) => strId - case InternalId.ClusterId(iid) => iid.toString - case InternalId.LocaleEntityId(LocaleEntityId(entityId, lang)) => lang + entityId.toString - } - - implicit val embeddingTypeOrdering: Ordering[EmbeddingType] = Ordering.by(_.getValue) - - /** - * We do not need to group by model version since we are making the - * This ordering holds the assumption that we would NEVER generate embeddings for two separate - * SimClusters KnownFor versions under the same dataset. - */ - implicit val SimClustersEmbeddingIdOrdering: Ordering[SimClustersEmbeddingId] = Ordering.by { - case SimClustersEmbeddingId(embeddingType, _, internalId) => (embeddingType, internalId) - } - - val ModelVersionPathMap: Map[ModelVersion, String] = Map( - ModelVersion.Model20m145kDec11 -> "model_20m_145k_dec11", - ModelVersion.Model20m145kUpdated -> "model_20m_145k_updated", - ModelVersion.Model20m145k2020 -> "model_20m_145k_2020" - ) - - /** - * Generates the HDFS output path in order to consolidate the offline embeddings datasets under - * a common directory pattern. - * Prepends "/gcs" if the detected data center is qus1. - * - * @param isAdhoc Whether the dataset was generated from an adhoc run - * @param isManhattanKeyVal Whether the dataset is written as KeyVal and is intended to be imported to Manhattan - * @param modelVersion The model version of SimClusters KnownFor that is used to generate the embedding - * @param pathSuffix Any additional path structure suffixed at the end of the path - * @return The consolidated HDFS path, for example: - * /user/cassowary/adhoc/manhattan_sequence_files/simclusters_embeddings/model_20m_145k_updated/... - */ - def getHdfsPath( - isAdhoc: Boolean, - isManhattanKeyVal: Boolean, - modelVersion: ModelVersion, - pathSuffix: String - ): String = { - val adhoc = if (isAdhoc) "adhoc/" else "" - - val user = System.getenv("USER") - - val gcs: String = - try { - InetAddress.getAllByName("metadata.google.internal") // throws Exception if not in GCP. - "/gcs" - } catch { - case _: UnknownHostException => "" - } - - val datasetType = if (isManhattanKeyVal) "manhattan_sequence_files" else "processed" - - val path = s"/user/$user/$adhoc$datasetType/simclusters_embeddings" - - s"$gcs${path}_${ModelVersionPathMap(modelVersion)}_$pathSuffix" - } - - def favScoreExtractor(u: UserToInterestedInClusterScores): (Double, ScoreType.ScoreType) = { - (u.favScoreClusterNormalizedOnly.getOrElse(0.0), ScoreType.FavScore) - } - - def followScoreExtractor(u: UserToInterestedInClusterScores): (Double, ScoreType.ScoreType) = { - (u.followScoreClusterNormalizedOnly.getOrElse(0.0), ScoreType.FollowScore) - } - - def logFavScoreExtractor(u: UserToInterestedInClusterScores): (Double, ScoreType.ScoreType) = { - (u.logFavScoreClusterNormalizedOnly.getOrElse(0.0), ScoreType.LogFavScore) - } - - // Define all scores to extract from the SimCluster InterestedIn source - val scoreExtractors: Seq[UserToInterestedInClusterScores => (Double, ScoreType.ScoreType)] = - Seq( - favScoreExtractor, - followScoreExtractor - ) - - object ScoreType extends Enumeration { - type ScoreType = Value - val FavScore: Value = Value(1) - val FollowScore: Value = Value(2) - val LogFavScore: Value = Value(3) - } - - @deprecated("Use 'common/ModelVersions'", "2019-09-04") - final val ModelVersion20M145KDec11: String = "20M_145K_dec11" - @deprecated("Use 'common/ModelVersions'", "2019-09-04") - final val ModelVersion20M145KUpdated: String = "20M_145K_updated" - - @deprecated("Use 'common/ModelVersions'", "2019-09-04") - final val ModelVersionMap: Map[String, ModelVersion] = Map( - ModelVersion20M145KDec11 -> ModelVersion.Model20m145kDec11, - ModelVersion20M145KUpdated -> ModelVersion.Model20m145kUpdated - ) -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/common/EntityEmbeddingUtil.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/common/EntityEmbeddingUtil.scala deleted file mode 100644 index b9f715f2e..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/common/EntityEmbeddingUtil.scala +++ /dev/null @@ -1,79 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding.common - -import com.twitter.recos.entities.thriftscala.Entity -import com.twitter.scalding.Args -import com.twitter.scalding.TypedPipe -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil.UserId -import com.twitter.simclusters_v2.thriftscala.ModelVersion -import com.twitter.wtf.entity_real_graph.common.EntityUtil -import com.twitter.wtf.entity_real_graph.thriftscala.Edge -import com.twitter.wtf.entity_real_graph.thriftscala.EntityType -import com.twitter.wtf.entity_real_graph.thriftscala.FeatureName - -object EntityEmbeddingUtil { - - def getEntityUserMatrix( - entityRealGraphSource: TypedPipe[Edge], - halfLife: HalfLifeScores.HalfLifeScoresType, - entityType: EntityType - ): TypedPipe[(Entity, (UserId, Double))] = { - entityRealGraphSource - .flatMap { - case Edge(userId, entity, consumerFeatures, _, _) - if consumerFeatures.exists(_.exists(_.featureName == FeatureName.Favorites)) && - EntityUtil.getEntityType(entity) == entityType => - for { - features <- consumerFeatures - favFeatures <- features.find(_.featureName == FeatureName.Favorites) - ewmaMap <- favFeatures.featureValues.ewmaMap - favScore <- ewmaMap.get(halfLife.id) - } yield (entity, (userId, favScore)) - - case _ => None - } - } - - object HalfLifeScores extends Enumeration { - type HalfLifeScoresType = Value - val OneDay: Value = Value(1) - val SevenDays: Value = Value(7) - val FourteenDays: Value = Value(14) - val ThirtyDays: Value = Value(30) - val SixtyDays: Value = Value(60) - } - - case class EntityEmbeddingsJobConfig( - topK: Int, - halfLife: HalfLifeScores.HalfLifeScoresType, - modelVersion: ModelVersion, - entityType: EntityType, - isAdhoc: Boolean) - - object EntityEmbeddingsJobConfig { - - def apply(args: Args, isAdhoc: Boolean): EntityEmbeddingsJobConfig = { - - val entityTypeArg = - EntityType.valueOf(args.getOrElse("entity-type", default = "")) match { - case Some(entityType) => entityType - case _ => - throw new IllegalArgumentException( - s"Argument [--entity-type] must be provided. Supported options [" + - s"${EntityType.SemanticCore.name}, ${EntityType.Hashtag.name}]") - } - - EntityEmbeddingsJobConfig( - topK = args.getOrElse("top-k", default = "100").toInt, - halfLife = HalfLifeScores(args.getOrElse("half-life", default = "14").toInt), - // Fail fast if there is no correct model-version argument - modelVersion = ModelVersions.toModelVersion( - args.getOrElse("model-version", ModelVersions.Model20M145K2020) - ), - // Fail fast if there is no correct entity-type argument - entityType = entityTypeArg, - isAdhoc = isAdhoc - ) - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/common/ExternalDataSources.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/common/ExternalDataSources.scala deleted file mode 100644 index 729cb95d0..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/common/ExternalDataSources.scala +++ /dev/null @@ -1,565 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding.common - -import com.twitter.algebird.Aggregator -import com.twitter.common.text.language.LocaleUtil -import com.twitter.escherbird.common.thriftscala.Locale -import com.twitter.escherbird.common.thriftscala.LocalizedUser -import com.twitter.escherbird.metadata.thriftscala.FullMetadata -import com.twitter.escherbird.scalding.source.FullMetadataSource -import com.twitter.escherbird.scalding.source.utt.UttSourceScalaDataset -import com.twitter.escherbird.utt.strato.thriftscala.SnapshotType -import com.twitter.escherbird.utt.thriftscala.UttEntityRecord -import com.twitter.interests_ds.jobs.interests_service.UserTopicRelationSnapshotScalaDataset -import com.twitter.interests.thriftscala.InterestRelationType -import com.twitter.interests.thriftscala.UserInterestsRelationSnapshot -import com.twitter.penguin.scalding.datasets.PenguinUserLanguagesScalaDataset -import com.twitter.scalding.DateOps -import com.twitter.scalding.DateRange -import com.twitter.scalding.Days -import com.twitter.scalding.Stat -import com.twitter.scalding.TypedPipe -import com.twitter.scalding.UniqueID -import com.twitter.scalding.ValuePipe -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.remote_access.ExplicitLocation -import com.twitter.scalding_internal.dalv2.remote_access.AllowCrossClusterSameDC -import com.twitter.scalding_internal.dalv2.remote_access.ProcAtla -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.common._ -import com.twitter.simclusters_v2.hdfs_sources.SimclustersV2InterestedIn20M145KUpdatedScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.UserUserFavGraphScalaDataset -import com.twitter.scalding_internal.dalv2.remote_access.AllowCrossDC -import com.twitter.common_header.thriftscala.CommonHeader -import com.twitter.common_header.thriftscala.IdType -import com.twitter.common_header.thriftscala.VersionedCommonHeader -import flockdb_tools.datasets.flock.FlockBlocksEdgesScalaDataset -import flockdb_tools.datasets.flock.FlockFollowsEdgesScalaDataset -import flockdb_tools.datasets.flock.FlockReportAsAbuseEdgesScalaDataset -import flockdb_tools.datasets.flock.FlockReportAsSpamEdgesScalaDataset -import twadoop_config.configuration.log_categories.group.search.AdaptiveSearchScalaDataset -import com.twitter.search.adaptive.scribing.thriftscala.AdaptiveSearchScribeLog -import twadoop_config.configuration.log_categories.group.timeline.TimelineServiceFavoritesScalaDataset -import tweetsource.common.UnhydratedFlatScalaDataset -import com.twitter.frigate.data_pipeline.magicrecs.magicrecs_notifications_lite.thriftscala.MagicRecsNotificationLite -import com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn -import com.twitter.simclusters_v2.thriftscala.EdgeWithDecayedWeights -import com.twitter.timelineservice.thriftscala.ContextualizedFavoriteEvent -import com.twitter.timelineservice.thriftscala.FavoriteEventUnion -import com.twitter.tweetsource.common.thriftscala.UnhydratedFlatTweet -import com.twitter.usersource.snapshot.flat.UsersourceFlatScalaDataset -import com.twitter.wtf.entity_real_graph.scalding.common.DatasetConstants -import com.twitter.wtf.entity_real_graph.scalding.common.SemanticCoreFilters -import com.twitter.wtf.scalding.client_event_processing.thriftscala.InteractionDetails -import com.twitter.wtf.scalding.client_event_processing.thriftscala.InteractionType -import com.twitter.wtf.scalding.client_event_processing.thriftscala.TweetImpressionDetails -import com.twitter.frigate.data_pipeline.scalding.magicrecs.magicrecs_notification_lite.MagicrecsNotificationLite1DayLagScalaDataset -import com.twitter.iesource.thriftscala.InteractionEvent -import com.twitter.iesource.thriftscala.InteractionTargetType -import com.twitter.wtf.scalding.jobs.client_event_processing.UserInteractionScalaDataset -import java.util.TimeZone -import com.twitter.interests_ds.jobs.interests_service.UserInterestRelationSnapshotScalaDataset -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil.UserId -import com.twitter.scalding.typed.{ValuePipe => TypedValuePipe} -import com.twitter.tweetsource.common.thriftscala.UnhydratedTweet -import tweetsource.common.UnhydratedScalaDataset - -object ExternalDataSources { - val UTTDomain = 131L - val usersourceColumns = Set("id", "account_country_code", "language") - val ValidFlockEdgeStateId = 0 - - def getStandardLanguageCode(language: String): Option[String] = { - val locale = LocaleUtil.getLocaleOf(language) - if (locale == LocaleUtil.UNKNOWN) None else Some(locale.getLanguage) - } - - // Reads UTT Entity Records (`utt_source` dataset) - def getUttEntityRecords(implicit timeZone: TimeZone): TypedPipe[UttEntityRecord] = { - DAL - .readMostRecentSnapshotNoOlderThan(UttSourceScalaDataset, Days(14)) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - } - - /** - * Extracts the KGO seeds from the UTT Entity Records. - * Uses the most recent "Stable" version by default unless specified otherwise. - * - * @param uttVersion UTT Version to use instead of the default value. - */ - def getLocaleProducerSeedIdsFromUttEntityRecords( - uttVersion: Option[Long] = None - )( - implicit timeZone: TimeZone, - uniqueId: UniqueID - ): TypedPipe[((TopicId, Language), Seq[UserId])] = { - - val topicLangPairCount = Stat("topic_lang_pair_count_all") - val topicLangPairCountEmptySeed = Stat("topic_lang_pair_count_empty_seed") - val topicLangPairCountLteOneSeed = Stat("topic_lang_pair_count_lte_one_seed") - val topicLangPairCountLteFiveSeeds = Stat("topic_lang_pair_count_lte_five_seeds") - val topicLangPairCountLteTenSeeds = Stat("topic_lang_pair_count_lte_ten_seeds") - - val uttEntityRecords: TypedPipe[UttEntityRecord] = getUttEntityRecords - - val uttVersionToUse: ValuePipe[Long] = uttVersion match { - case Some(uttVersionValue) => - TypedValuePipe(uttVersionValue) - case _ => // find the most recent "stable" version as recommended by the SemanticCore team - uttEntityRecords - .filter(_.snapshotType.exists(_ == SnapshotType.Stable)) - .map(_.version) - .distinct - .aggregate(Aggregator.min) // the most recent version is the smallest negative value - } - - val uttEntityRecordsSingleVersion: TypedPipe[UttEntityRecord] = - uttEntityRecords - .filterWithValue(uttVersionToUse) { - case (uttEntityRecord: UttEntityRecord, uttVersionOpt: Option[Long]) => - uttVersionOpt.contains(uttEntityRecord.version) - } - - uttEntityRecordsSingleVersion.flatMap { uttEntityRecord: UttEntityRecord => - val localizedUsers: Seq[LocalizedUser] = - uttEntityRecord.knownForUsers.flatMap(_.localizedUsers).getOrElse(Nil) - - val validLocalizedUsers: Seq[(TopicId, Language, UserId)] = - localizedUsers - .flatMap { - case LocalizedUser(userId: UserId, Some(Locale(Some(language: String), _)), _) => - Some((uttEntityRecord.entityId, language, userId)) - case _ => - None - } - - val localeProducerSeedIds: Seq[((TopicId, Language), Seq[UserId])] = validLocalizedUsers - .groupBy { - case (topicId: TopicId, language: Language, _) => - (topicId, language) - } - .mapValues(_.map(_._3).distinct) // values are distinct producerIds - .toSeq - - localeProducerSeedIds.foreach { // stats - case (_, seedIds: Seq[UserId]) => - topicLangPairCount.inc() - if (seedIds.isEmpty) topicLangPairCountEmptySeed.inc() - if (seedIds.length <= 1) topicLangPairCountLteOneSeed.inc() - if (seedIds.length <= 5) topicLangPairCountLteFiveSeeds.inc() - if (seedIds.length <= 10) topicLangPairCountLteTenSeeds.inc() - } - - localeProducerSeedIds - }.forceToDisk - } - - def uttEntitiesSource( - customFullMetadataSource: Option[TypedPipe[FullMetadata]] = None - )( - implicit dateRange: DateRange, - timeZone: TimeZone - ): TypedPipe[Long] = { - customFullMetadataSource - .getOrElse(fullMetadataSource) - .flatMap { - case fullMetadata if fullMetadata.domainId == UTTDomain => - for { - basicMetadata <- fullMetadata.basicMetadata - indexableFields <- basicMetadata.indexableFields - tags <- indexableFields.tags - if !SemanticCoreFilters.shouldFilterByTags(tags.toSet, DatasetConstants.stopTags) - } yield { - fullMetadata.entityId - } - case _ => None - } - } - - // Get followable topics from Escherbird - def uttFollowableEntitiesSource( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): TypedPipe[Long] = { - val followableEntityCount = Stat("followable_entities_count") - val FollowableTag = "utt:followable_topic" - fullMetadataSource - .flatMap { - case fullMetadata if fullMetadata.domainId == UTTDomain => - for { - basicMetadata <- fullMetadata.basicMetadata - indexableFields <- basicMetadata.indexableFields - tags <- indexableFields.tags - if tags.contains(FollowableTag) - } yield { - followableEntityCount.inc() - fullMetadata.entityId - } - case _ => None - } - } - - def fullMetadataSource( - implicit dateRange: DateRange, - timeZone: TimeZone - ): TypedPipe[FullMetadata] = { - TypedPipe - .from( - new FullMetadataSource(s"/atla/proc/${FullMetadataSource.DefaultHdfsPath}")()( - dateRange.embiggen(Days(7)))) - } - - def userSource(implicit timeZone: TimeZone): TypedPipe[(UserId, (Country, Language))] = - DAL - .readMostRecentSnapshotNoOlderThan(UsersourceFlatScalaDataset, Days(7)) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .withColumns(usersourceColumns) - .toTypedPipe.flatMap { flatUser => - for { - userId <- flatUser.id - country <- flatUser.accountCountryCode - language <- flatUser.language - standardLang <- getStandardLanguageCode(language) - } yield { - (userId, country.toUpperCase, standardLang) - } - }.distinct - .map { case (user, country, lang) => user -> (country, lang) } - - // Build user language source from inferred languages (penguin_user_languages dataset) - def inferredUserConsumedLanguageSource( - implicit timeZone: TimeZone - ): TypedPipe[(UserId, Seq[(Language, Double)])] = { - DAL - .readMostRecentSnapshotNoOlderThan(PenguinUserLanguagesScalaDataset, Days(7)) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - .map { kv => - val consumed = kv.value.consumed - .collect { - case scoredString if scoredString.weight > 0.001 => //throw away 5% outliers - (getStandardLanguageCode(scoredString.item), scoredString.weight) - }.collect { - case (Some(language), score) => (language, score) - } - (kv.key, consumed) - } - } - - def inferredUserProducedLanguageSource( - implicit timeZone: TimeZone - ): TypedPipe[(UserId, Seq[(Language, Double)])] = { - DAL - .readMostRecentSnapshotNoOlderThan(PenguinUserLanguagesScalaDataset, Days(7)) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - .map { kv => - val produced = kv.value.produced - .collect { - case scoredString if scoredString.weight > 0.15 => //throw away 5% outliers - (getStandardLanguageCode(scoredString.item), scoredString.weight) - }.collect { - case (Some(language), score) => (language, score) - } - (kv.key, produced) - } - } - - def simClustersInterestInSource( - implicit dateRange: DateRange, - timeZone: TimeZone - ): TypedPipe[KeyVal[UserId, ClustersUserIsInterestedIn]] = { - DAL - .readMostRecentSnapshotNoOlderThan( - SimclustersV2InterestedIn20M145KUpdatedScalaDataset, - Days(30)) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - } - - def simClustersInterestInLogFavSource( - minLogFavScore: Double - )( - implicit dateRange: DateRange, - timeZone: TimeZone - ): TypedPipe[(UserId, Map[ClusterId, Double])] = { - simClustersInterestInSource.map { - case KeyVal(userId, clustersUserIsInterestedIn) => - userId -> clustersUserIsInterestedIn.clusterIdToScores - .map { - case (clusterId, scores) => - clusterId -> scores.logFavScore.getOrElse(0.0) - } - .filter(_._2 > minLogFavScore) - .toMap - } - } - - def topicFollowGraphSource( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): TypedPipe[(TopicId, UserId)] = { - val userTopicFollowCount = Stat("user_topic_follow_count") - DAL - .readMostRecentSnapshotNoOlderThan(UserTopicRelationSnapshotScalaDataset, Days(7)) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - .collect { - case userInterestsRelationSnapshot: UserInterestsRelationSnapshot - if userInterestsRelationSnapshot.interestType == "UTT" && - userInterestsRelationSnapshot.relation == InterestRelationType.Followed => - (userInterestsRelationSnapshot.interestId, userInterestsRelationSnapshot.userId) - } - .hashJoin(uttFollowableEntitiesSource.asKeys) - .map { - case (topic, (user, _)) => - userTopicFollowCount.inc() - (topic, user) - } - } - - def notInterestedTopicsSource( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): TypedPipe[(TopicId, UserId)] = { - val userNotInterestedInTopicsCount = Stat("user_not_interested_in_topics_count") - DAL - .readMostRecentSnapshotNoOlderThan( - UserInterestRelationSnapshotScalaDataset, - Days(7)).withRemoteReadPolicy(ExplicitLocation(ProcAtla)).toTypedPipe.collect { - case userInterestsRelationSnapshot: UserInterestsRelationSnapshot - if userInterestsRelationSnapshot.interestType == "UTT" && - userInterestsRelationSnapshot.relation == InterestRelationType.NotInterested => - (userInterestsRelationSnapshot.interestId, userInterestsRelationSnapshot.userId) - } - .hashJoin(uttFollowableEntitiesSource.asKeys) - .map { - case (topic, (user, _)) => - userNotInterestedInTopicsCount.inc() - (topic, user) - } - } - - def tweetSource( - implicit dateRange: DateRange - ): TypedPipe[UnhydratedTweet] = { - DAL - .read(UnhydratedScalaDataset, dateRange).withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - } - - def flatTweetsSource( - implicit dateRange: DateRange - ): TypedPipe[UnhydratedFlatTweet] = { - DAL - .read(UnhydratedFlatScalaDataset, dateRange) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - } - - def userTweetFavoritesSource( - implicit dateRange: DateRange - ): TypedPipe[(UserId, TweetId, Timestamp)] = { - DAL - .read(TimelineServiceFavoritesScalaDataset, dateRange) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - .flatMap { cfe: ContextualizedFavoriteEvent => - cfe.event match { - case FavoriteEventUnion.Favorite(fav) => - Some(fav.userId, fav.tweetId, fav.eventTimeMs) - case _ => - None - } - } - } - - def userTweetImpressionsSource( - dwellSec: Int = 1 - )( - implicit dateRange: DateRange - ): TypedPipe[(UserId, TweetId, Timestamp)] = { - DAL - .read(UserInteractionScalaDataset, dateRange) - .withRemoteReadPolicy(AllowCrossClusterSameDC) - .toTypedPipe - .flatMap { - case userInteraction - if userInteraction.interactionType == InteractionType.TweetImpressions => - userInteraction.interactionDetails match { - case InteractionDetails.TweetImpressionDetails( - TweetImpressionDetails(tweetId, _, dwellTimeInSecOpt)) - if dwellTimeInSecOpt.exists(_ >= dwellSec) => - Some(userInteraction.userId, tweetId, userInteraction.timeStamp) - case _ => - None - } - case _ => None - } - } - - def transformFavEdges( - input: TypedPipe[EdgeWithDecayedWeights], - halfLifeInDaysForFavScore: Int - )( - implicit uniqueID: UniqueID - ): TypedPipe[(Long, Long, Double)] = { - val numEdgesWithSpecifiedHalfLife = Stat( - s"num_edges_with_specified_half_life_${halfLifeInDaysForFavScore}_days") - val numEdgesWithoutSpecifiedHalfLife = Stat( - s"num_edges_without_specified_half_life_${halfLifeInDaysForFavScore}_days") - input - .flatMap { edge => - if (edge.weights.halfLifeInDaysToDecayedSums.contains(halfLifeInDaysForFavScore)) { - numEdgesWithSpecifiedHalfLife.inc() - Some((edge.sourceId, edge.destinationId, edge.weights.halfLifeInDaysToDecayedSums(100))) - } else { - numEdgesWithoutSpecifiedHalfLife.inc() - None - } - } - } - - def getFavEdges( - halfLifeInDaysForFavScore: Int - )( - implicit dateRange: DateRange, - uniqueID: UniqueID - ): TypedPipe[(Long, Long, Double)] = { - implicit val tz: java.util.TimeZone = DateOps.UTC - transformFavEdges( - DAL - .readMostRecentSnapshotNoOlderThan(UserUserFavGraphScalaDataset, Days(14)) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe, - halfLifeInDaysForFavScore - ) - } - - def flockReportAsSpamSource( - )( - implicit dateRange: DateRange - ): TypedPipe[(UserId, UserId)] = { - DAL - .readMostRecentSnapshot(FlockReportAsSpamEdgesScalaDataset) - .toTypedPipe - .collect { - case edge if edge.state == ValidFlockEdgeStateId => - (edge.sourceId, edge.destinationId) - } - } - - def flockBlocksSource( - )( - implicit dateRange: DateRange - ): TypedPipe[(UserId, UserId)] = { - DAL - .readMostRecentSnapshot(FlockBlocksEdgesScalaDataset) - .toTypedPipe - .collect { - case edge if edge.state == ValidFlockEdgeStateId => - (edge.sourceId, edge.destinationId) - } - } - - def flockFollowsSource( - )( - implicit dateRange: DateRange - ): TypedPipe[(UserId, UserId)] = { - DAL - .readMostRecentSnapshot(FlockFollowsEdgesScalaDataset) - .toTypedPipe - .collect { - case edge if edge.state == ValidFlockEdgeStateId => - (edge.sourceId, edge.destinationId) - } - } - - def flockReportAsAbuseSource( - )( - implicit dateRange: DateRange - ): TypedPipe[(UserId, UserId)] = { - DAL - .readMostRecentSnapshot(FlockReportAsAbuseEdgesScalaDataset) - .toTypedPipe - .collect { - case edge if edge.state == ValidFlockEdgeStateId => - (edge.sourceId, edge.destinationId) - } - } - - def magicRecsNotficationOpenOrClickEventsSource( - implicit dateRange: DateRange - ): TypedPipe[MagicRecsNotificationLite] = { - DAL - .read(MagicrecsNotificationLite1DayLagScalaDataset, dateRange) - .toTypedPipe - .filter { entry => - // keep entries with a valid userId and tweetId, opened or clicked timestamp defined - val userIdExists = entry.targetUserId.isDefined - val tweetIdExists = entry.tweetId.isDefined - val openOrClickExists = - entry.openTimestampMs.isDefined || entry.ntabClickTimestampMs.isDefined - userIdExists && tweetIdExists && openOrClickExists - } - } - - def ieSourceTweetEngagementsSource(implicit dateRange: DateRange): TypedPipe[InteractionEvent] = { - DAL - .read( - com.twitter.iesource.processing.events.batch.ServerEngagementsScalaDataset, - dateRange).withColumns( - Set("targetId", "targetType", "engagingUserId", "details", "referenceTweet")) - .toTypedPipe - .filter { event => - // filter out logged out users because their favorites are less reliable - event.engagingUserId > 0L && event.targetType == InteractionTargetType.Tweet - } - } - - private def userIdFromBlenderAdaptiveScribeLog( - blenderAdaptiveLog: AdaptiveSearchScribeLog - ): Option[Long] = { - blenderAdaptiveLog.versionedCommonHeader match { - case VersionedCommonHeader.CommonHeader(CommonHeader.ServerHeader(serverHeader)) => - serverHeader.requestInfo match { - case Some(requestInfo) => requestInfo.ids.get(IdType.UserId).map(_.toLong) - case _ => None - } - case _ => None - } - } - - def adaptiveSearchScribeLogsSource( - implicit dateRange: DateRange - ): TypedPipe[(UserId, String)] = { - val searchData: TypedPipe[AdaptiveSearchScribeLog] = - DAL - .read(AdaptiveSearchScalaDataset, dateRange).toTypedPipe - - searchData - .flatMap({ scribeLog: AdaptiveSearchScribeLog => - for { - userId <- userIdFromBlenderAdaptiveScribeLog(scribeLog) - // filter out logged out search queries - if userId != 0 - queryString <- scribeLog.requestLog.flatMap(_.request).flatMap(_.rawQuery) - } yield { - (userId, Set(queryString)) - } - }) - // if a user searches for the same query multiple times, there could be duplicates. - // De-dup them to get the distinct queries searched by a user - .sumByKey - .flatMap { - case (userId, distinctQuerySet) => - distinctQuerySet.map { query => - (userId, query) - } - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/common/SimClustersEmbeddingJob.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/common/SimClustersEmbeddingJob.scala deleted file mode 100644 index db5ba807d..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/common/SimClustersEmbeddingJob.scala +++ /dev/null @@ -1,248 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding.common - -import com.twitter.scalding.{Args, DateRange, Execution, TypedPipe, UniqueID} -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.scalding.common.matrix.{SparseMatrix, SparseRowMatrix} -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil._ -import com.twitter.simclusters_v2.thriftscala._ -import java.util.TimeZone - -/** - * This is the base job for computing SimClusters Embedding for any Noun Type on Twitter, such as - * Users, Tweets, Topics, Entities, Channels, etc. - * - * The most straightforward way to understand the SimClusters Embeddings for a Noun is that it is - * a weighted sum of SimClusters InterestedIn vectors from users who are interested in the Noun. - * So for a noun type, you only need to define `prepareNounToUserMatrix` to pass in a matrix which - * represents how much each user is interested in this noun. - */ -trait SimClustersEmbeddingBaseJob[NounType] { - - def numClustersPerNoun: Int - - def numNounsPerClusters: Int - - def thresholdForEmbeddingScores: Double - - def numReducersOpt: Option[Int] = None - - def prepareNounToUserMatrix( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): SparseMatrix[NounType, UserId, Double] - - def prepareUserToClusterMatrix( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): SparseRowMatrix[UserId, ClusterId, Double] - - def writeNounToClustersIndex( - output: TypedPipe[(NounType, Seq[(ClusterId, Double)])] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] - - def writeClusterToNounsIndex( - output: TypedPipe[(ClusterId, Seq[(NounType, Double)])] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] - - def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - val embeddingMatrix: SparseRowMatrix[NounType, ClusterId, Double] = - prepareNounToUserMatrix.rowL2Normalize - .multiplySkinnySparseRowMatrix( - prepareUserToClusterMatrix.colL2Normalize, - numReducersOpt - ) - .filter((_, _, v) => v > thresholdForEmbeddingScores) - - Execution - .zip( - writeNounToClustersIndex( - embeddingMatrix.sortWithTakePerRow(numClustersPerNoun)(Ordering.by(-_._2)) - ), - writeClusterToNounsIndex( - embeddingMatrix.sortWithTakePerCol(numNounsPerClusters)( - Ordering.by(-_._2) - ) - ) - ) - .unit - } - -} - -object SimClustersEmbeddingJob { - - /** - * Multiply the [user, cluster] and [user, T] matrices, and return the cross product. - */ - def computeEmbeddings[T]( - simClustersSource: TypedPipe[(UserId, ClustersUserIsInterestedIn)], - normalizedInputMatrix: TypedPipe[(UserId, (T, Double))], - scoreExtractors: Seq[UserToInterestedInClusterScores => (Double, ScoreType.ScoreType)], - modelVersion: ModelVersion, - toSimClustersEmbeddingId: (T, ScoreType.ScoreType) => SimClustersEmbeddingId, - numReducers: Option[Int] = None - ): TypedPipe[(SimClustersEmbeddingId, (ClusterId, Double))] = { - val userSimClustersMatrix = - getUserSimClustersMatrix(simClustersSource, scoreExtractors, modelVersion) - multiplyMatrices( - normalizedInputMatrix, - userSimClustersMatrix, - toSimClustersEmbeddingId, - numReducers) - } - - def getL2Norm[T]( - inputMatrix: TypedPipe[(T, (UserId, Double))], - numReducers: Option[Int] = None - )( - implicit ordering: Ordering[T] - ): TypedPipe[(T, Double)] = { - val l2Norm = inputMatrix - .mapValues { - case (_, score) => score * score - } - .sumByKey - .mapValues(math.sqrt) - - numReducers match { - case Some(reducers) => l2Norm.withReducers(reducers) - case _ => l2Norm - } - } - - def getNormalizedTransposeInputMatrix[T]( - inputMatrix: TypedPipe[(T, (UserId, Double))], - numReducers: Option[Int] = None - )( - implicit ordering: Ordering[T] - ): TypedPipe[(UserId, (T, Double))] = { - val inputWithNorm = inputMatrix.join(getL2Norm(inputMatrix, numReducers)) - - (numReducers match { - case Some(reducers) => inputWithNorm.withReducers(reducers) - case _ => inputWithNorm - }).map { - case (inputId, ((userId, favScore), norm)) => - (userId, (inputId, favScore / norm)) - } - } - - /** - * Matrix multiplication with the ability to tune the reducer size for better performance - */ - @Deprecated - def legacyMultiplyMatrices[T]( - normalizedTransposeInputMatrix: TypedPipe[(UserId, (T, Double))], - userSimClustersMatrix: TypedPipe[(UserId, Seq[(ClusterId, Double)])], - numReducers: Int // Matrix multiplication is expensive. Use this to tune performance - )( - implicit ordering: Ordering[T] - ): TypedPipe[((ClusterId, T), Double)] = { - normalizedTransposeInputMatrix - .join(userSimClustersMatrix) - .withReducers(numReducers) - .flatMap { - case (_, ((inputId, score), clustersWithScores)) => - clustersWithScores.map { - case (clusterId, clusterScore) => - ((clusterId, inputId), score * clusterScore) - } - } - .sumByKey - .withReducers(numReducers + 1) // +1 to distinguish this step from above in Dr. Scalding - } - - def multiplyMatrices[T]( - normalizedTransposeInputMatrix: TypedPipe[(UserId, (T, Double))], - userSimClustersMatrix: TypedPipe[(UserId, Seq[((ClusterId, ScoreType.ScoreType), Double)])], - toSimClustersEmbeddingId: (T, ScoreType.ScoreType) => SimClustersEmbeddingId, - numReducers: Option[Int] = None - ): TypedPipe[(SimClustersEmbeddingId, (ClusterId, Double))] = { - val inputJoinedWithSimClusters = numReducers match { - case Some(reducers) => - normalizedTransposeInputMatrix - .join(userSimClustersMatrix) - .withReducers(reducers) - case _ => - normalizedTransposeInputMatrix.join(userSimClustersMatrix) - } - - val matrixMultiplicationResult = inputJoinedWithSimClusters.flatMap { - case (_, ((inputId, inputScore), clustersWithScores)) => - clustersWithScores.map { - case ((clusterId, scoreType), clusterScore) => - ((clusterId, toSimClustersEmbeddingId(inputId, scoreType)), inputScore * clusterScore) - } - }.sumByKey - - (numReducers match { - case Some(reducers) => - matrixMultiplicationResult.withReducers(reducers + 1) - case _ => matrixMultiplicationResult - }).map { - case ((clusterId, embeddingId), score) => - (embeddingId, (clusterId, score)) - } - } - - def getUserSimClustersMatrix( - simClustersSource: TypedPipe[(UserId, ClustersUserIsInterestedIn)], - scoreExtractors: Seq[UserToInterestedInClusterScores => (Double, ScoreType.ScoreType)], - modelVersion: ModelVersion - ): TypedPipe[(UserId, Seq[((ClusterId, ScoreType.ScoreType), Double)])] = { - simClustersSource.map { - case (userId, clusters) - if ModelVersions.toModelVersion(clusters.knownForModelVersion) == modelVersion => - userId -> clusters.clusterIdToScores.flatMap { - case (clusterId, clusterScores) => - scoreExtractors.map { scoreExtractor => - scoreExtractor(clusterScores) match { - case (score, scoreType) => ((clusterId, scoreType), score) - } - } - }.toSeq - case (userId, _) => userId -> Nil - } - } - - def toReverseIndexSimClusterEmbedding( - embeddings: TypedPipe[(SimClustersEmbeddingId, (ClusterId, EmbeddingScore))], - topK: Int - ): TypedPipe[(SimClustersEmbeddingId, InternalIdEmbedding)] = { - embeddings - .map { - case (embeddingId, (clusterId, score)) => - ( - SimClustersEmbeddingId( - embeddingId.embeddingType, - embeddingId.modelVersion, - InternalId.ClusterId(clusterId)), - (embeddingId.internalId, score)) - } - .group - .sortedReverseTake(topK)(Ordering.by(_._2)) - .mapValues { topInternalIdsWithScore => - val internalIdsWithScore = topInternalIdsWithScore.map { - case (internalId, score) => InternalIdWithScore(internalId, score) - } - InternalIdEmbedding(internalIdsWithScore) - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/producer/AggregatableFavBasedProducerEmbeddings.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/producer/AggregatableFavBasedProducerEmbeddings.scala deleted file mode 100644 index e4c1d6f58..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/producer/AggregatableFavBasedProducerEmbeddings.scala +++ /dev/null @@ -1,278 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding.producer - -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.scalding_internal.source.lzo_scrooge.FixedPathLzoScrooge -import com.twitter.simclusters_v2.hdfs_sources.{ - AggregatableProducerSimclustersEmbeddingsByFavScoreScalaDataset, - AggregatableProducerSimclustersEmbeddingsByFavScoreThriftScalaDataset, - AggregatableProducerSimclustersEmbeddingsByFavScore2020ScalaDataset, - AggregatableProducerSimclustersEmbeddingsByFavScore2020ThriftScalaDataset -} -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil -import com.twitter.simclusters_v2.thriftscala._ -import com.twitter.wtf.scalding.jobs.common.{AdhocExecutionApp, ScheduledExecutionApp} -import java.util.TimeZone - -/** - * See AggregatableProducerEmbeddingsBaseApp for an explanation of this job. - * - * Production job: -capesospy-v2 update aggregatable_producer_embeddings_by_fav_score src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object AggregatableFavBasedProducerEmbeddingsScheduledApp - extends AggregatableFavBasedProducerEmbeddingsBaseApp - with ScheduledExecutionApp { - - override val modelVersion: ModelVersion = ModelVersion.Model20m145kUpdated - // Not using the EmbeddingUtil.getHdfsPath to preserve the previous functionality. - private val outputPath: String = - "/user/cassowary/manhattan_sequence_files/producer_simclusters_aggregatable_embeddings_by_fav_score" - - private val outputPathThrift: String = EmbeddingUtil.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = false, - modelVersion = modelVersion, - pathSuffix = "producer_simclusters_aggregatable_embeddings_by_fav_score_thrift" - ) - - override def firstTime: RichDate = RichDate("2020-05-11") - - override def batchIncrement: Duration = Days(7) - - override def writeToManhattan( - output: TypedPipe[KeyVal[SimClustersEmbeddingId, SimClustersEmbedding]] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .writeDALVersionedKeyValExecution( - AggregatableProducerSimclustersEmbeddingsByFavScoreScalaDataset, - D.Suffix(outputPath), - version = ExplicitEndTime(dateRange.end) - ) - } - - override def writeToThrift( - output: TypedPipe[SimClustersEmbeddingWithId] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .writeDALSnapshotExecution( - dataset = AggregatableProducerSimclustersEmbeddingsByFavScoreThriftScalaDataset, - updateStep = D.Daily, - pathLayout = D.Suffix(outputPathThrift), - fmt = D.Parquet, - endDate = dateRange.end - ) - } -} - -/** - * Production job: -capesospy-v2 update --build_locally --start_cron aggregatable_producer_embeddings_by_fav_score_2020 src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object AggregatableFavBasedProducerEmbeddings2020ScheduledApp - extends AggregatableFavBasedProducerEmbeddingsBaseApp - with ScheduledExecutionApp { - - override val modelVersion: ModelVersion = ModelVersion.Model20m145k2020 - // Not using the EmbeddingUtil.getHdfsPath to preserve the previous functionality. - private val outputPath: String = - "/user/cassowary/manhattan_sequence_files/producer_simclusters_aggregatable_embeddings_by_fav_score_20m145k2020" - - // getHdfsPath appends model version str to the pathSuffix - private val outputPathThrift: String = EmbeddingUtil.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = false, - modelVersion = modelVersion, - pathSuffix = "producer_simclusters_aggregatable_embeddings_by_fav_score_thrift" - ) - - override def firstTime: RichDate = RichDate("2021-03-04") - - override def batchIncrement: Duration = Days(7) - - override def writeToManhattan( - output: TypedPipe[KeyVal[SimClustersEmbeddingId, SimClustersEmbedding]] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .writeDALVersionedKeyValExecution( - AggregatableProducerSimclustersEmbeddingsByFavScore2020ScalaDataset, - D.Suffix(outputPath), - version = ExplicitEndTime(dateRange.end) - ) - } - - override def writeToThrift( - output: TypedPipe[SimClustersEmbeddingWithId] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .writeDALSnapshotExecution( - dataset = AggregatableProducerSimclustersEmbeddingsByFavScore2020ThriftScalaDataset, - updateStep = D.Daily, - pathLayout = D.Suffix(outputPathThrift), - fmt = D.Parquet, - endDate = dateRange.end - ) - } -} - -/*** - * Adhoc job: - -scalding remote run --user recos-platform \ ---main-class com.twitter.simclusters_v2.scalding.embedding.producer.AggregatableFavBasedProducerEmbeddingsAdhocApp \ ---target src/scala/com/twitter/simclusters_v2/scalding/embedding/producer:aggregatable_fav_based_producer_embeddings_job-adhoc \ --- --date 2020-05-11 - - */ -object AggregatableFavBasedProducerEmbeddingsAdhocApp - extends AggregatableFavBasedProducerEmbeddingsBaseApp - with AdhocExecutionApp { - - override val modelVersion: ModelVersion = ModelVersion.Model20m145kUpdated - private val outputPath: String = EmbeddingUtil.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = true, - modelVersion = modelVersion, - pathSuffix = "producer_simclusters_aggregatable_embeddings_by_fav_score" - ) - - private val outputPathThrift: String = EmbeddingUtil.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = false, - modelVersion = modelVersion, - pathSuffix = "producer_simclusters_aggregatable_embeddings_by_fav_score_thrift" - ) - - override def writeToManhattan( - output: TypedPipe[KeyVal[SimClustersEmbeddingId, SimClustersEmbedding]] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .flatMap { keyVal => - keyVal.value.embedding.map { simClusterWithScore => - ( - keyVal.key.embeddingType, - keyVal.key.modelVersion, - keyVal.key.internalId, - simClusterWithScore.clusterId, - simClusterWithScore.score - ) - } - } - .writeExecution( - // Write to TSV for easier debugging of the adhoc job. - TypedTsv(outputPath) - ) - } - - override def writeToThrift( - output: TypedPipe[SimClustersEmbeddingWithId] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .writeExecution( - new FixedPathLzoScrooge(outputPathThrift, SimClustersEmbeddingWithId) - ) - } -} - -/** -./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding/producer:aggregatable_fav_based_producer_embeddings_job_2020-adhoc -scalding remote run \ ---user cassowary \ ---keytab /var/lib/tss/keys/fluffy/keytabs/client/cassowary.keytab \ ---principal service_acoount@TWITTER.BIZ \ ---cluster bluebird-qus1 \ ---main-class com.twitter.simclusters_v2.scalding.embedding.producer.AggregatableFavBasedProducerEmbeddings2020AdhocApp \ ---target src/scala/com/twitter/simclusters_v2/scalding/embedding/producer:aggregatable_fav_based_producer_embeddings_job_2020-adhoc \ ---hadoop-properties "scalding.with.reducers.set.explicitly=true mapreduce.job.reduces=4000" \ --- --date 2020-06-28 - */ -object AggregatableFavBasedProducerEmbeddings2020AdhocApp - extends AggregatableFavBasedProducerEmbeddingsBaseApp - with AdhocExecutionApp { - - override val modelVersion: ModelVersion = ModelVersion.Model20m145k2020 - private val outputPath: String = EmbeddingUtil.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = true, - modelVersion = modelVersion, - pathSuffix = "producer_simclusters_aggregatable_embeddings_by_fav_score" - ) - - private val outputPathThrift: String = EmbeddingUtil.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = false, - modelVersion = modelVersion, - pathSuffix = "producer_simclusters_aggregatable_embeddings_by_fav_score_thrift" - ) - - override def writeToManhattan( - output: TypedPipe[KeyVal[SimClustersEmbeddingId, SimClustersEmbedding]] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .flatMap { keyVal => - keyVal.value.embedding.map { simClusterWithScore => - ( - keyVal.key.embeddingType, - keyVal.key.modelVersion, - keyVal.key.internalId, - simClusterWithScore.clusterId, - simClusterWithScore.score - ) - } - } - .writeExecution( - // Write to TSV for easier debugging of the adhoc job. - TypedTsv(outputPath) - ) - } - - override def writeToThrift( - output: TypedPipe[SimClustersEmbeddingWithId] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .writeExecution( - new FixedPathLzoScrooge(outputPathThrift, SimClustersEmbeddingWithId) - ) - } -} - -trait AggregatableFavBasedProducerEmbeddingsBaseApp extends AggregatableProducerEmbeddingsBaseApp { - override val userToProducerScoringFn: NeighborWithWeights => Double = - _.favScoreHalfLife100Days.getOrElse(0.0) - override val userToClusterScoringFn: UserToInterestedInClusterScores => Double = - _.favScore.getOrElse(0.0) - override val embeddingType: EmbeddingType = EmbeddingType.AggregatableFavBasedProducer -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/producer/AggregatableFollowBasedProducerEmbeddings.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/producer/AggregatableFollowBasedProducerEmbeddings.scala deleted file mode 100644 index d18b66a7f..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/producer/AggregatableFollowBasedProducerEmbeddings.scala +++ /dev/null @@ -1,165 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding.producer - -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.scalding_internal.source.lzo_scrooge.FixedPathLzoScrooge -import com.twitter.simclusters_v2.hdfs_sources.AggregatableProducerSimclustersEmbeddingsByFollowScore2020ScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.AggregatableProducerSimclustersEmbeddingsByFollowScore2020ThriftScalaDataset -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil -import com.twitter.simclusters_v2.thriftscala.EmbeddingType -import com.twitter.simclusters_v2.thriftscala.ModelVersion -import com.twitter.simclusters_v2.thriftscala.NeighborWithWeights -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingWithId -import com.twitter.simclusters_v2.thriftscala.UserToInterestedInClusterScores -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import java.util.TimeZone - -/** - * This file implements a new Producer SimClusters Embeddings. - * The differences with existing producer embeddings are: - * - * 1) the embedding scores are not normalized, so that one can aggregate multiple producer embeddings by adding them. - * 2) we use follow scores in the user-producer graph and user-simclusters graph. - */ - -/** - * Production job: -capesospy-v2 update --build_locally --start_cron aggregatable_producer_embeddings_by_follow_score_2020 src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object AggregatableFollowBasedProducerEmbeddings2020ScheduledApp - extends AggregatableFollowBasedProducerEmbeddingsBaseApp - with ScheduledExecutionApp { - - override val modelVersion: ModelVersion = ModelVersion.Model20m145k2020 - // Not using the EmbeddingUtil.getHdfsPath to preserve the previous functionality. - private val outputPath: String = - "/user/cassowary/manhattan_sequence_files/producer_simclusters_aggregatable_embeddings_by_follow_score_20m145k2020" - - private val outputPathThrift: String = EmbeddingUtil.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = false, - modelVersion = modelVersion, - pathSuffix = "producer_simclusters_aggregatable_embeddings_by_follow_score_thrift" - ) - - override def batchIncrement: Duration = Days(7) - - override def firstTime: RichDate = RichDate("2021-11-10") - - override def writeToManhattan( - output: TypedPipe[KeyVal[SimClustersEmbeddingId, SimClustersEmbedding]] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .writeDALVersionedKeyValExecution( - AggregatableProducerSimclustersEmbeddingsByFollowScore2020ScalaDataset, - D.Suffix(outputPath), - version = ExplicitEndTime(dateRange.end) - ) - } - - override def writeToThrift( - output: TypedPipe[SimClustersEmbeddingWithId] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .writeDALSnapshotExecution( - dataset = AggregatableProducerSimclustersEmbeddingsByFollowScore2020ThriftScalaDataset, - updateStep = D.Daily, - pathLayout = D.Suffix(outputPathThrift), - fmt = D.Parquet, - endDate = dateRange.end - ) - } -} - -/** -./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding/producer:aggregatable_follow_based_producer_embeddings_job_2020-adhoc -scalding remote run \ ---user cassowary \ ---keytab /var/lib/tss/keys/fluffy/keytabs/client/cassowary.keytab \ ---principal service_acoount@TWITTER.BIZ \ ---cluster bluebird-qus1 \ ---main-class com.twitter.simclusters_v2.scalding.embedding.producer.AggregatableFollowBasedProducerEmbeddings2020AdhocApp \ ---target src/scala/com/twitter/simclusters_v2/scalding/embedding/producer:aggregatable_follow_based_producer_embeddings_job_2020-adhoc \ ---hadoop-properties "scalding.with.reducers.set.explicitly=true mapreduce.job.reduces=4000" \ --- --date 2021-11-10 - */ - -object AggregatableFollowBasedProducerEmbeddings2020AdhocApp - extends AggregatableFollowBasedProducerEmbeddingsBaseApp - with AdhocExecutionApp { - - override val modelVersion: ModelVersion = ModelVersion.Model20m145k2020 - - private val outputPath: String = EmbeddingUtil.getHdfsPath( - isAdhoc = true, - isManhattanKeyVal = true, - modelVersion = modelVersion, - pathSuffix = "producer_simclusters_aggregatable_embeddings_by_follow_score" - ) - - private val outputPathThrift: String = EmbeddingUtil.getHdfsPath( - isAdhoc = true, - isManhattanKeyVal = false, - modelVersion = modelVersion, - pathSuffix = "producer_simclusters_aggregatable_embeddings_by_follow_score_thrift" - ) - - override def writeToManhattan( - output: TypedPipe[KeyVal[SimClustersEmbeddingId, SimClustersEmbedding]] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .flatMap { keyVal => - keyVal.value.embedding.map { simClusterWithScore => - ( - keyVal.key.embeddingType, - keyVal.key.modelVersion, - keyVal.key.internalId, - simClusterWithScore.clusterId, - simClusterWithScore.score - ) - } - } - .writeExecution( - // Write to TSV for easier debugging of the adhoc job. - TypedTsv(outputPath) - ) - } - - override def writeToThrift( - output: TypedPipe[SimClustersEmbeddingWithId] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .writeExecution( - new FixedPathLzoScrooge(outputPathThrift, SimClustersEmbeddingWithId) - ) - } -} - -trait AggregatableFollowBasedProducerEmbeddingsBaseApp - extends AggregatableProducerEmbeddingsBaseApp { - override val userToProducerScoringFn: NeighborWithWeights => Double = - _.followScoreNormalizedByNeighborFollowersL2.getOrElse(0.0) - override val userToClusterScoringFn: UserToInterestedInClusterScores => Double = - _.followScoreClusterNormalizedOnly.getOrElse(0.0) - override val embeddingType: EmbeddingType = EmbeddingType.AggregatableFollowBasedProducer -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/producer/AggregatableLogFavBasedProducerEmbeddings.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/producer/AggregatableLogFavBasedProducerEmbeddings.scala deleted file mode 100644 index 8344043b5..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/producer/AggregatableLogFavBasedProducerEmbeddings.scala +++ /dev/null @@ -1,368 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding.producer - -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.scalding_internal.source.lzo_scrooge.FixedPathLzoScrooge -import com.twitter.simclusters_v2.hdfs_sources.{ - AggregatableProducerSimclustersEmbeddingsByLogFavScoreScalaDataset, - AggregatableProducerSimclustersEmbeddingsByLogFavScoreThriftScalaDataset, - AggregatableProducerSimclustersEmbeddingsByLogFavScore2020ScalaDataset, - AggregatableProducerSimclustersEmbeddingsByLogFavScore2020ThriftScalaDataset, - AggregatableProducerSimclustersEmbeddingsByLogFavScoreRelaxedFavEngagementThreshold2020ScalaDataset, - AggregatableProducerSimclustersEmbeddingsByLogFavScoreRelaxedFavEngagementThreshold2020ThriftScalaDataset -} -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil -import com.twitter.simclusters_v2.thriftscala.{ - EmbeddingType, - ModelVersion, - NeighborWithWeights, - SimClustersEmbedding, - SimClustersEmbeddingId, - SimClustersEmbeddingWithId, - UserToInterestedInClusterScores -} -import com.twitter.wtf.scalding.jobs.common.{AdhocExecutionApp, ScheduledExecutionApp} -import java.util.TimeZone - -/** - * This file implements a new Producer SimClusters Embeddings. - * The differences with existing producer embeddings are: - * - * 1) the embedding scores are not normalized, so that one can aggregate multiple producer embeddings by adding them. - * 2) we use log-fav scores in the user-producer graph and user-simclusters graph. - * LogFav scores are smoother than fav scores we previously used and they are less sensitive to outliers - * - * - * - * The main difference with other normalized embeddings is the `convertEmbeddingToAggregatableEmbeddings` function - * where we multiply the normalized embedding with producer's norms. The resulted embeddings are then - * unnormalized and aggregatable. - * - */ -/** - * Production job: -capesospy-v2 update aggregatable_producer_embeddings_by_logfav_score src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object AggregatableLogFavBasedProducerEmbeddingsScheduledApp - extends AggregatableLogFavBasedProducerEmbeddingsBaseApp - with ScheduledExecutionApp { - - override val modelVersion: ModelVersion = ModelVersion.Model20m145kUpdated - // Not using the EmbeddingUtil.getHdfsPath to preserve the previous functionality. - private val outputPath: String = - "/user/cassowary/manhattan_sequence_files/producer_simclusters_aggregatable_embeddings_by_logfav_score" - - private val outputPathThrift: String = EmbeddingUtil.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = false, - modelVersion = modelVersion, - pathSuffix = "producer_simclusters_aggregatable_embeddings_by_logfav_score_thrift" - ) - - override def batchIncrement: Duration = Days(7) - - override def firstTime: RichDate = RichDate("2020-04-05") - - override def writeToManhattan( - output: TypedPipe[KeyVal[SimClustersEmbeddingId, SimClustersEmbedding]] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .writeDALVersionedKeyValExecution( - AggregatableProducerSimclustersEmbeddingsByLogFavScoreScalaDataset, - D.Suffix(outputPath), - version = ExplicitEndTime(dateRange.end) - ) - } - - override def writeToThrift( - output: TypedPipe[SimClustersEmbeddingWithId] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .writeDALSnapshotExecution( - dataset = AggregatableProducerSimclustersEmbeddingsByLogFavScoreThriftScalaDataset, - updateStep = D.Daily, - pathLayout = D.Suffix(outputPathThrift), - fmt = D.Parquet, - endDate = dateRange.end - ) - } -} - -/** - * Production job: -capesospy-v2 update --build_locally --start_cron aggregatable_producer_embeddings_by_logfav_score_2020 src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object AggregatableLogFavBasedProducerEmbeddings2020ScheduledApp - extends AggregatableLogFavBasedProducerEmbeddingsBaseApp - with ScheduledExecutionApp { - - override val modelVersion: ModelVersion = ModelVersion.Model20m145k2020 - // Not using the EmbeddingUtil.getHdfsPath to preserve the previous functionality. - private val outputPath: String = - "/user/cassowary/manhattan_sequence_files/producer_simclusters_aggregatable_embeddings_by_logfav_score_20m145k2020" - - private val outputPathThrift: String = EmbeddingUtil.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = false, - modelVersion = modelVersion, - pathSuffix = "producer_simclusters_aggregatable_embeddings_by_logfav_score_thrift" - ) - - override def batchIncrement: Duration = Days(7) - - override def firstTime: RichDate = RichDate("2021-03-05") - - override def writeToManhattan( - output: TypedPipe[KeyVal[SimClustersEmbeddingId, SimClustersEmbedding]] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .writeDALVersionedKeyValExecution( - AggregatableProducerSimclustersEmbeddingsByLogFavScore2020ScalaDataset, - D.Suffix(outputPath), - version = ExplicitEndTime(dateRange.end) - ) - } - - override def writeToThrift( - output: TypedPipe[SimClustersEmbeddingWithId] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .writeDALSnapshotExecution( - dataset = AggregatableProducerSimclustersEmbeddingsByLogFavScore2020ThriftScalaDataset, - updateStep = D.Daily, - pathLayout = D.Suffix(outputPathThrift), - fmt = D.Parquet, - endDate = dateRange.end - ) - } -} - -/** - * Production job: -capesospy-v2 update --build_locally --start_cron aggregatable_producer_embeddings_by_logfav_score_relaxed_fav_engagement_threshold_2020 src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object AggregatableLogFavBasedProducerEmbeddingsRelaxedFavEngagementThreshold2020ScheduledApp - extends AggregatableLogFavBasedProducerEmbeddingsBaseApp - with ScheduledExecutionApp { - - override val modelVersion: ModelVersion = ModelVersion.Model20m145k2020 - - override val embeddingType: EmbeddingType = EmbeddingType.RelaxedAggregatableLogFavBasedProducer - - // Relax fav engagement threshold - override val minNumFavers = 15 - - // Not using the EmbeddingUtil.getHdfsPath to preserve the previous functionality. - private val outputPath: String = - "/user/cassowary/manhattan_sequence_files/producer_simclusters_aggregatable_embeddings_by_logfav_score_relaxed_fav_engagement_threshold_20m145k2020" - - private val outputPathThrift: String = EmbeddingUtil.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = false, - modelVersion = modelVersion, - pathSuffix = - "producer_simclusters_aggregatable_embeddings_by_logfav_score_relaxed_fav_score_threshold_thrift" - ) - - override def batchIncrement: Duration = Days(7) - - override def firstTime: RichDate = RichDate("2021-07-26") - - override def writeToManhattan( - output: TypedPipe[KeyVal[SimClustersEmbeddingId, SimClustersEmbedding]] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .writeDALVersionedKeyValExecution( - AggregatableProducerSimclustersEmbeddingsByLogFavScoreRelaxedFavEngagementThreshold2020ScalaDataset, - D.Suffix(outputPath), - version = ExplicitEndTime(dateRange.end) - ) - } - - override def writeToThrift( - output: TypedPipe[SimClustersEmbeddingWithId] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .writeDALSnapshotExecution( - dataset = - AggregatableProducerSimclustersEmbeddingsByLogFavScoreRelaxedFavEngagementThreshold2020ThriftScalaDataset, - updateStep = D.Daily, - pathLayout = D.Suffix(outputPathThrift), - fmt = D.Parquet, - endDate = dateRange.end - ) - } -} - -/*** - * Adhoc job: - -scalding remote run --user recos-platform \ ---main-class com.twitter.simclusters_v2.scalding.embedding.producer.AggregatableLogFavBasedProducerEmbeddingsAdhocApp \ ---target src/scala/com/twitter/simclusters_v2/scalding/embedding/producer:aggregatable_logfav_based_producer_embeddings_job-adhoc \ --- --date 2020-04-08 - - */ -object AggregatableLogFavBasedProducerEmbeddingsAdhocApp - extends AggregatableLogFavBasedProducerEmbeddingsBaseApp - with AdhocExecutionApp { - - override val modelVersion: ModelVersion = ModelVersion.Model20m145kUpdated - - private val outputPath: String = EmbeddingUtil.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = true, - modelVersion = modelVersion, - pathSuffix = "producer_simclusters_aggregatable_embeddings_by_log_fav_score" - ) - - private val outputPathThrift: String = EmbeddingUtil.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = false, - modelVersion = modelVersion, - pathSuffix = "producer_simclusters_aggregatable_embeddings_by_log_fav_score_thrift" - ) - - override def writeToManhattan( - output: TypedPipe[KeyVal[SimClustersEmbeddingId, SimClustersEmbedding]] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .flatMap { keyVal => - keyVal.value.embedding.map { simClusterWithScore => - ( - keyVal.key.embeddingType, - keyVal.key.modelVersion, - keyVal.key.internalId, - simClusterWithScore.clusterId, - simClusterWithScore.score - ) - } - } - .writeExecution( - // Write to TSV for easier debugging of the adhoc job. - TypedTsv(outputPath) - ) - } - - override def writeToThrift( - output: TypedPipe[SimClustersEmbeddingWithId] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .writeExecution( - new FixedPathLzoScrooge(outputPathThrift, SimClustersEmbeddingWithId) - ) - } -} - -/** -./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding/producer:aggregatable_logfav_based_producer_embeddings_job_2020-adhoc -scalding remote run \ ---user cassowary \ ---keytab /var/lib/tss/keys/fluffy/keytabs/client/cassowary.keytab \ ---principal service_acoount@TWITTER.BIZ \ ---cluster bluebird-qus1 \ ---main-class com.twitter.simclusters_v2.scalding.embedding.producer.AggregatableLogFavBasedProducerEmbeddings2020AdhocApp \ ---target src/scala/com/twitter/simclusters_v2/scalding/embedding/producer:aggregatable_logfav_based_producer_embeddings_job_2020-adhoc \ ---hadoop-properties "scalding.with.reducers.set.explicitly=true mapreduce.job.reduces=4000" \ --- --date 2020-06-28 - */ - -object AggregatableLogFavBasedProducerEmbeddings2020AdhocApp - extends AggregatableLogFavBasedProducerEmbeddingsBaseApp - with AdhocExecutionApp { - - override val modelVersion: ModelVersion = ModelVersion.Model20m145k2020 - - private val outputPath: String = EmbeddingUtil.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = true, - modelVersion = modelVersion, - pathSuffix = "producer_simclusters_aggregatable_embeddings_by_log_fav_score" - ) - - private val outputPathThrift: String = EmbeddingUtil.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = false, - modelVersion = modelVersion, - pathSuffix = "producer_simclusters_aggregatable_embeddings_by_log_fav_score_thrift" - ) - - override def writeToManhattan( - output: TypedPipe[KeyVal[SimClustersEmbeddingId, SimClustersEmbedding]] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .flatMap { keyVal => - keyVal.value.embedding.map { simClusterWithScore => - ( - keyVal.key.embeddingType, - keyVal.key.modelVersion, - keyVal.key.internalId, - simClusterWithScore.clusterId, - simClusterWithScore.score - ) - } - } - .writeExecution( - // Write to TSV for easier debugging of the adhoc job. - TypedTsv(outputPath) - ) - } - - override def writeToThrift( - output: TypedPipe[SimClustersEmbeddingWithId] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - output - .writeExecution( - new FixedPathLzoScrooge(outputPathThrift, SimClustersEmbeddingWithId) - ) - } -} - -trait AggregatableLogFavBasedProducerEmbeddingsBaseApp - extends AggregatableProducerEmbeddingsBaseApp { - override val userToProducerScoringFn: NeighborWithWeights => Double = _.logFavScore.getOrElse(0.0) - override val userToClusterScoringFn: UserToInterestedInClusterScores => Double = - _.logFavScore.getOrElse(0.0) - override val embeddingType: EmbeddingType = EmbeddingType.AggregatableLogFavBasedProducer -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/producer/AggregatableProducerEmbeddings.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/producer/AggregatableProducerEmbeddings.scala deleted file mode 100644 index cd6755328..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/producer/AggregatableProducerEmbeddings.scala +++ /dev/null @@ -1,168 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding.producer - -import com.twitter.scalding._ -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.scalding_internal.source.lzo_scrooge.FixedPathLzoScrooge -import com.twitter.simclusters_v2.hdfs_sources.{DataSources, InterestedInSources} -import com.twitter.simclusters_v2.scalding.common.matrix.{SparseMatrix, SparseRowMatrix} -import com.twitter.simclusters_v2.scalding.embedding.ProducerEmbeddingsFromInterestedIn -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil.{ - ClusterId, - ProducerId, - UserId -} -import com.twitter.simclusters_v2.scalding.embedding.common.SimClustersEmbeddingBaseJob -import com.twitter.simclusters_v2.thriftscala.{EmbeddingType, _} -import java.util.TimeZone - -/** - * This file implements a new Producer SimClusters Embeddings. - * The differences with existing producer embeddings are: - * - * 1) the embedding scores are not normalized, so that one can aggregate multiple producer embeddings by adding them. - * 2) we use log-fav scores in the user-producer graph and user-simclusters graph. - * LogFav scores are smoother than fav scores we previously used and they are less sensitive to outliers - * - * - * - * The main difference with other normalized embeddings is the `convertEmbeddingToAggregatableEmbeddings` function - * where we multiply the normalized embedding with producer's norms. The resulted embeddings are then - * unnormalized and aggregatable. - * - */ -trait AggregatableProducerEmbeddingsBaseApp extends SimClustersEmbeddingBaseJob[ProducerId] { - - val userToProducerScoringFn: NeighborWithWeights => Double - val userToClusterScoringFn: UserToInterestedInClusterScores => Double - val modelVersion: ModelVersion - - // Minimum engagement threshold - val minNumFavers: Int = ProducerEmbeddingsFromInterestedIn.minNumFaversForProducer - - override def numClustersPerNoun: Int = 60 - - override def numNounsPerClusters: Int = 500 // this is not used for now - - override def thresholdForEmbeddingScores: Double = 0.01 - - override def prepareNounToUserMatrix( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): SparseMatrix[ProducerId, UserId, Double] = { - - SparseMatrix( - ProducerEmbeddingsFromInterestedIn - .getFilteredUserUserNormalizedGraph( - DataSources.userUserNormalizedGraphSource, - DataSources.userNormsAndCounts, - userToProducerScoringFn, - _.faverCount.exists( - _ > minNumFavers - ) - ) - .map { - case (userId, (producerId, score)) => - (producerId, userId, score) - }) - } - - override def prepareUserToClusterMatrix( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): SparseRowMatrix[UserId, ClusterId, Double] = { - SparseRowMatrix( - ProducerEmbeddingsFromInterestedIn - .getUserSimClustersMatrix( - InterestedInSources - .simClustersInterestedInSource(modelVersion, dateRange.embiggen(Days(5)), timeZone), - userToClusterScoringFn, - modelVersion - ) - .mapValues(_.toMap), - isSkinnyMatrix = true - ) - } - - // in order to make the embeddings aggregatable, we need to revert the normalization - // (multiply the norms) we did when computing embeddings in the base job. - def convertEmbeddingToAggregatableEmbeddings( - embeddings: TypedPipe[(ProducerId, Seq[(ClusterId, Double)])] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): TypedPipe[(ProducerId, Seq[(ClusterId, Double)])] = { - embeddings.join(prepareNounToUserMatrix.rowL2Norms).map { - case (producerId, (embeddingVec, norm)) => - producerId -> embeddingVec.map { - case (id, score) => (id, score * norm) - } - } - } - - override final def writeClusterToNounsIndex( - output: TypedPipe[(ClusterId, Seq[(ProducerId, Double)])] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { Execution.unit } // we do not need this for now - - /** - * Override this method to write the manhattan dataset. - */ - def writeToManhattan( - output: TypedPipe[KeyVal[SimClustersEmbeddingId, SimClustersEmbedding]] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] - - /** - * Override this method to writethrough the thrift dataset. - */ - def writeToThrift( - output: TypedPipe[SimClustersEmbeddingWithId] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] - - val embeddingType: EmbeddingType - - override final def writeNounToClustersIndex( - output: TypedPipe[(ProducerId, Seq[(ClusterId, Double)])] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - val convertedEmbeddings = convertEmbeddingToAggregatableEmbeddings(output) - .map { - case (producerId, topSimClustersWithScore) => - val id = SimClustersEmbeddingId( - embeddingType = embeddingType, - modelVersion = modelVersion, - internalId = InternalId.UserId(producerId)) - - val embeddings = SimClustersEmbedding(topSimClustersWithScore.map { - case (clusterId, score) => SimClusterWithScore(clusterId, score) - }) - - SimClustersEmbeddingWithId(id, embeddings) - } - - val keyValuePairs = convertedEmbeddings.map { simClustersEmbeddingWithId => - KeyVal(simClustersEmbeddingWithId.embeddingId, simClustersEmbeddingWithId.embedding) - } - val manhattanExecution = writeToManhattan(keyValuePairs) - - val thriftExecution = writeToThrift(convertedEmbeddings) - - Execution.zip(manhattanExecution, thriftExecution).unit - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/producer/BUILD.bazel b/src/scala/com/twitter/simclusters_v2/scalding/embedding/producer/BUILD.bazel deleted file mode 100644 index d6ff0d162..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/producer/BUILD.bazel +++ /dev/null @@ -1,223 +0,0 @@ -scala_library( - sources = [ - "*.scala", - ], - platform = "java8", - tags = [ - "bazel-compatible", - "bazel-only", - ], - dependencies = [ - "escherbird/src/scala/com/twitter/escherbird/scalding/source", - "src/scala/com/twitter/onboarding/relevance/source:utt_account_recommendations-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:aggregatable_producer_simclusters_embeddings_by_fav_score-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:aggregatable_producer_simclusters_embeddings_by_log_fav_score-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/presto_hdfs_sources", - "src/scala/com/twitter/simclusters_v2/scalding/common", - "src/scala/com/twitter/simclusters_v2/scalding/common/matrix", - "src/scala/com/twitter/simclusters_v2/scalding/embedding", - "src/scala/com/twitter/wtf/entity_real_graph/common", - "src/scala/com/twitter/wtf/entity_real_graph/scalding/common", - "src/scala/com/twitter/wtf/scalding/jobs/common:execution_app", - "src/scala/com/twitter/wtf/scalding/jobs/common:sources", - "src/scala/com/twitter/wtf/scalding/jobs/common:stats_util", - "src/thrift/com/twitter/hermit/candidate:hermit-candidate-scala", - "src/thrift/com/twitter/onboarding/relevance/candidates:candidates-scala", - "src/thrift/com/twitter/recos/entities:entities-thrift-scala", - "src/thrift/com/twitter/wtf/entity_real_graph:entity_real_graph-thrift-scala", - "usersource/snapshot/src/main/scala/com/twitter/usersource/snapshot/flat:usersource_flat-scala", - "usersource/snapshot/src/main/thrift/com/twitter/usersource/snapshot/flat:flat-scala", - ], -) - -hadoop_binary( - name = "aggregatable_logfav_based_producer_embeddings_job-adhoc", - main = "com.twitter.simclusters_v2.scalding.embedding.producer.AggregatableLogFavBasedProducerEmbeddingsAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":producer"], -) - -hadoop_binary( - name = "aggregatable_logfav_based_producer_embeddings_job_2020-adhoc", - main = "com.twitter.simclusters_v2.scalding.embedding.producer.AggregatableLogFavBasedProducerEmbeddings2020AdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":producer"], -) - -hadoop_binary( - name = "aggregatable_follow_based_producer_embeddings_job_2020-adhoc", - main = "com.twitter.simclusters_v2.scalding.embedding.producer.AggregatableFollowBasedProducerEmbeddings2020AdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":producer"], -) - -hadoop_binary( - name = "aggregatable_logfav_based_producer_embeddings_job", - main = "com.twitter.simclusters_v2.scalding.embedding.producer.AggregatableLogFavBasedProducerEmbeddingsScheduledApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":producer"], -) - -hadoop_binary( - name = "aggregatable_logfav_based_producer_embeddings_job_2020", - main = "com.twitter.simclusters_v2.scalding.embedding.producer.AggregatableLogFavBasedProducerEmbeddings2020ScheduledApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":producer"], -) - -hadoop_binary( - name = "aggregatable_follow_based_producer_embeddings_job_2020", - main = "com.twitter.simclusters_v2.scalding.embedding.producer.AggregatableFollowBasedProducerEmbeddings2020ScheduledApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":producer"], -) - -hadoop_binary( - name = "aggregatable_logfav_based_producer_embeddings_job_relaxed_fav_engagement_threshold_2020", - main = "com.twitter.simclusters_v2.scalding.embedding.producer.AggregatableLogFavBasedProducerEmbeddingsRelaxedFavEngagementThreshold2020ScheduledApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":producer"], -) - -hadoop_binary( - name = "aggregatable_fav_based_producer_embeddings_job-adhoc", - main = "com.twitter.simclusters_v2.scalding.embedding.producer.AggregatableFavBasedProducerEmbeddingsAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":producer"], -) - -hadoop_binary( - name = "aggregatable_fav_based_producer_embeddings_job", - main = "com.twitter.simclusters_v2.scalding.embedding.producer.AggregatableFavBasedProducerEmbeddingsScheduledApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":producer"], -) - -hadoop_binary( - name = "aggregatable_fav_based_producer_embeddings_job_2020-adhoc", - main = "com.twitter.simclusters_v2.scalding.embedding.producer.AggregatableFavBasedProducerEmbeddings2020AdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":producer"], -) - -hadoop_binary( - name = "aggregatable_fav_based_producer_embeddings_job_2020", - main = "com.twitter.simclusters_v2.scalding.embedding.producer.AggregatableFavBasedProducerEmbeddings2020ScheduledApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":producer"], -) - -# Generated with `capesospy-v2 create_target aggregatable_producer_embeddings_by_logfav_score src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml`, config hash f8229a. -scalding_job( - name = "aggregatable_producer_embeddings_by_logfav_score", - main = "com.twitter.simclusters_v2.scalding.embedding.producer.AggregatableLogFavBasedProducerEmbeddingsScheduledApp", - config = [ - ("hadoop.combine-input", "true"), - ("hadoop.map.jvm.total-memory", "3072m"), - ("hadoop.queue", "cassowary.default"), - ("hadoop.reduce.jvm.total-memory", "3072m"), - ("hadoop.submitter.jvm.total-memory", "5120m"), - ("submitter.tier", "preemptible"), - ], - cron = "17 * * * *", - hadoop_cluster = "atla-proc3", - platform = "java8", - role = "cassowary", - runtime_platform = "java8", - tags = [ - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":producer"], -) - -# Generated with `capesospy-v2 create_target aggregatable_producer_embeddings_by_fav_score src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml`, config hash bc0103. -scalding_job( - name = "aggregatable_producer_embeddings_by_fav_score", - main = "com.twitter.simclusters_v2.scalding.embedding.producer.AggregatableFavBasedProducerEmbeddingsScheduledApp", - config = [ - ("hadoop.combine-input", "true"), - ("hadoop.map.jvm.total-memory", "3072m"), - ("hadoop.queue", "cassowary.default"), - ("hadoop.reduce.jvm.total-memory", "3072m"), - ("hadoop.submitter.jvm.total-memory", "5120m"), - ("submitter.tier", "preemptible"), - ], - cron = "17 * * * *", - hadoop_cluster = "atla-proc3", - platform = "java8", - role = "cassowary", - runtime_platform = "java8", - tags = [ - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":producer"], -) diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/BUILD b/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/BUILD deleted file mode 100644 index 2f9cb7ebe..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/BUILD +++ /dev/null @@ -1,196 +0,0 @@ -scala_library( - sources = [ - "*.scala", - "common/*.scala", - ], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "escherbird/src/scala/com/twitter/escherbird/scalding/source", - "interests-ds/src/main/scala/com/twitter/interests_ds/jobs/interests_service", - "interests-ds/src/main/scala/com/twitter/interests_ds/jobs/interests_service:user_topic_relation_snapshot-scala", - "src/scala/com/twitter/ml/featurestore/catalog/datasets/timelines:timelines-user-topic-aggregates", - "src/scala/com/twitter/onboarding/relevance/source:utt_account_recommendations-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:simclusters_v2_embeddings_lite-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:user_topic_weighted_embedding-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/presto_hdfs_sources", - "src/scala/com/twitter/simclusters_v2/scalding/common", - "src/scala/com/twitter/simclusters_v2/scalding/common/matrix", - "src/scala/com/twitter/simclusters_v2/scalding/embedding", - "src/scala/com/twitter/timelines/prediction/common/aggregates:user_topic_aggregates-scala", - "src/scala/com/twitter/wtf/entity_real_graph/common", - "src/scala/com/twitter/wtf/entity_real_graph/scalding/common", - "src/scala/com/twitter/wtf/scalding/jobs/common:execution_app", - "src/scala/com/twitter/wtf/scalding/jobs/common:sources", - "src/scala/com/twitter/wtf/scalding/jobs/common:stats_util", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - "timelines/data_processing/ml_util/aggregation_framework/conversion", - "timelines/data_processing/ml_util/aggregation_framework/metrics", - "timelines/data_processing/ml_util/aggregation_framework/scalding", - ], -) - -hadoop_binary( - name = "fav_tfg_topic_embeddings-adhoc", - main = "com.twitter.simclusters_v2.scalding.embedding.tfg.FavTfgTopicEmbeddingsAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":tfg"], -) - -hadoop_binary( - name = "fav_tfg_topic_embeddings_2020-adhoc", - main = "com.twitter.simclusters_v2.scalding.embedding.tfg.FavTfgTopicEmbeddings2020AdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":tfg"], -) - -hadoop_binary( - name = "fav_tfg_topic_embeddings", - main = "com.twitter.simclusters_v2.scalding.embedding.tfg.FavTfgTopicEmbeddingsScheduledApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":tfg"], -) - -hadoop_binary( - name = "fav_tfg_topic_embeddings_2020", - main = "com.twitter.simclusters_v2.scalding.embedding.tfg.FavTfgTopicEmbeddings2020ScheduledApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":tfg"], -) - -hadoop_binary( - name = "logfav_tfg_topic_embeddings-adhoc", - main = "com.twitter.simclusters_v2.scalding.embedding.tfg.LogFavTfgTopicEmbeddingsAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":tfg"], -) - -hadoop_binary( - name = "logfav_tfg_topic_embeddings", - main = "com.twitter.simclusters_v2.scalding.embedding.tfg.LogFavTfgTopicEmbeddingsScheduledApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":tfg"], -) - -hadoop_binary( - name = "fav_inferred_lang_tfg_topic_embeddings-adhoc", - main = "com.twitter.simclusters_v2.scalding.embedding.tfg.FavInferredLanguageTfgBasedTopicEmbeddingsAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":tfg"], -) - -hadoop_binary( - name = "fav_inferred_lang_tfg_topic_embeddings", - main = "com.twitter.simclusters_v2.scalding.embedding.tfg.FavInferredLanguageTfgBasedTopicEmbeddingsScheduledApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":tfg"], -) - -hadoop_binary( - name = "fav_tfg_topic_embeddings_2020_copy", - main = "com.twitter.simclusters_v2.scalding.embedding.tfg.FavTfgTopicEmbeddings2020CopyScheduledApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":tfg"], -) - -scalding_job( - name = "fav_weighted_user_topic_tfg_embeddings_adhoc_job", - main = "com.twitter.simclusters_v2.scalding.embedding.tfg.EngagementWeightedTfgBasedTopicEmbeddingsAdhocJob", - config = [ - ("hadoop.reduce.jvm.total-memory", "8192m"), - ("hadoop.combine-input", "true"), - ( - "job.args", - ["--date 2021-10-28"], - ), - ], - hadoop_cluster = "atla-proc3", #"qus1-bluebird", - platform = "java8", - role = "cassowary", - runtime_platform = "java8", - tags = [ - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":tfg", - ], -) - -scalding_job( - name = "fav_weighted_user_topic_tfg_embeddings_batch_job", - main = "com.twitter.simclusters_v2.scalding.embedding.tfg.EngagementWeightedTfgBasedTopicEmbeddingsScheduleJob", - args = [], - config = [ - ("hadoop.reduce.jvm.total-memory", "8192m"), - ("hadoop.combine-input", "true"), - ("submitter.cluster", "atla"), - ], - cron = "0 1 * * *", - hadoop_cluster = "atla-proc3", - platform = "java8", - role = "cassowary", - runtime_platform = "java8", - tags = [ - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":tfg", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/EngagementWeightedTfgBasedTopicEmbeddingsJob.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/EngagementWeightedTfgBasedTopicEmbeddingsJob.scala deleted file mode 100644 index 5ce6284af..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/EngagementWeightedTfgBasedTopicEmbeddingsJob.scala +++ /dev/null @@ -1,310 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding.tfg - -import com.twitter.dal.client.dataset.SnapshotDALDatasetBase -import com.twitter.ml.api.DataSetPipe -import com.twitter.ml.api.Feature.Continuous -import com.twitter.ml.api.constant.SharedFeatures -import com.twitter.ml.api.util.SRichDataRecord -import com.twitter.scalding.Execution -import com.twitter.scalding._ -import com.twitter.scalding.typed.UnsortedGrouped -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite.D -import com.twitter.scalding_internal.dalv2.DALWrite.WriteExtension -import com.twitter.scalding_internal.dalv2.remote_access.AllowCrossClusterSameDC -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.Country -import com.twitter.simclusters_v2.common.Language -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.hdfs_sources.FavTfgTopicEmbeddings2020ScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.UserTopicWeightedEmbeddingScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.UserTopicWeightedEmbeddingParquetScalaDataset -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil -import com.twitter.simclusters_v2.scalding.embedding.common.ExternalDataSources -import com.twitter.simclusters_v2.thriftscala._ -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.conversion._ -import com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationConfig -import com.twitter.timelines.prediction.features.common.TimelinesSharedFeatures -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import com.twitter.wtf.scalding.jobs.common.DateRangeExecutionApp -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import java.util.TimeZone - -/** - * Jobs to generate Fav-based engagement weighted Topic-Follow-Graph (TFG) topic embeddings - * The job uses fav based TFG embeddings and fav based engagement to produce a new embedding - */ - -/** - * ./bazel bundle ... - * scalding workflow upload --jobs src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg:fav_weighted_user_topic_tfg_embeddings_adhoc_job --autoplay - */ -object EngagementWeightedTfgBasedTopicEmbeddingsAdhocJob - extends AdhocExecutionApp - with EngagementWeightedTfgBasedTopicEmbeddingsBaseJob { - override val outputByFav = - "/user/cassowary/adhoc/manhattan_sequence_files/simclusters_v2_embedding/user_tfgembedding/by_fav" - override val parquetOutputByFav = - "/user/cassowary/adhoc/processed/simclusters_v2_embedding/user_tfgembedding/by_fav/snapshot" -} - -/** - * ./bazel bundle ... - * scalding workflow upload --jobs src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg:fav_weighted_user_topic_tfg_embeddings_batch_job --autoplay - */ -object EngagementWeightedTfgBasedTopicEmbeddingsScheduleJob - extends ScheduledExecutionApp - with EngagementWeightedTfgBasedTopicEmbeddingsBaseJob { - override val firstTime: RichDate = RichDate("2021-10-03") - override val batchIncrement: Duration = Days(1) - override val outputByFav = - "/user/cassowary/manhattan_sequence_files/simclusters_v2_embedding/user_tfgembedding/by_fav" - override val parquetOutputByFav = - "/user/cassowary/processed/simclusters_v2_embedding/user_tfgembedding/by_fav/snapshot" -} - -trait EngagementWeightedTfgBasedTopicEmbeddingsBaseJob extends DateRangeExecutionApp { - - val outputByFav: String - val parquetOutputByFav: String - - //root path to read aggregate data - private val aggregateFeatureRootPath = - "/atla/proc2/user/timelines/processed/aggregates_v2" - - private val topKTopicsToKeep = 100 - - private val favContinuousFeature = new Continuous( - "user_topic_aggregate.pair.recap.engagement.is_favorited.any_feature.50.days.count") - - private val parquetDataSource: SnapshotDALDatasetBase[UserTopicWeightedEmbedding] = - UserTopicWeightedEmbeddingParquetScalaDataset - - def sortedTake[K](m: Map[K, Double], keysToKeep: Int): Map[K, Double] = { - m.toSeq.sortBy { case (k, v) => -v }.take(keysToKeep).toMap - } - - case class UserTopicEngagement( - userId: Long, - topicId: Long, - language: String, - country: String, //field is not used - favCount: Double) { - val userLanguageGroup: (Long, String) = (userId, language) - } - - def prepareUserToTopicEmbedding( - favTfgTopicEmbeddings: TypedPipe[(Long, String, SimClustersEmbedding)], - userTopicEngagementCount: TypedPipe[UserTopicEngagement] - )( - implicit uniqueID: UniqueID - ): TypedPipe[((Long, String), Map[Int, Double])] = { - val userTfgEmbeddingsStat = Stat("User Tfg Embeddings Count") - val userTopicTopKEngagementStat = Stat("User Topic Top K engagement count") - val userEngagementStat = Stat("User engagement count") - val tfgEmbeddingsStat = Stat("TFG Embedding Map count") - - //get only top K topics - val userTopKTopicEngagementCount: TypedPipe[UserTopicEngagement] = userTopicEngagementCount - .groupBy(_.userLanguageGroup) - .withReducers(499) - .withDescription("select topK topics") - .sortedReverseTake(topKTopicsToKeep)(Ordering.by(_.favCount)) - .values - .flatten - - //(userId, language), totalCount - val userLanguageEngagementCount: UnsortedGrouped[(Long, String), Double] = - userTopKTopicEngagementCount - .collect { - case UserTopicEngagement(userId, topicId, language, country, favCount) => - userTopicTopKEngagementStat.inc() - ((userId, language), favCount) - }.sumByKey - .withReducers(499) - .withDescription("fav count by user") - - //(topicId, language), (userId, favWeight) - val topicUserWithNormalizedWeights: TypedPipe[((Long, String), (Long, Double))] = - userTopKTopicEngagementCount - .groupBy(_.userLanguageGroup) - .join(userLanguageEngagementCount) - .withReducers(499) - .withDescription("join userTopic and user EngagementCount") - .collect { - case ((userId, language), (engagementData, totalCount)) => - userEngagementStat.inc() - ( - (engagementData.topicId, engagementData.language), - (userId, engagementData.favCount / totalCount) - ) - } - - // (topicId, language), embeddingMap - val tfgEmbeddingsMap: TypedPipe[((Long, String), Map[Int, Double])] = favTfgTopicEmbeddings - .map { - case (topicId, language, embedding) => - tfgEmbeddingsStat.inc() - ((topicId, language), embedding.embedding.map(a => a.clusterId -> a.score).toMap) - } - .withDescription("covert sim cluster embedding to map") - - // (userId, language), clusters - val newUserTfgEmbedding = topicUserWithNormalizedWeights - .join(tfgEmbeddingsMap) - .withReducers(799) - .withDescription("join user | topic | favWeight * embedding") - .collect { - case ((topicId, language), ((userId, favWeight), embeddingMap)) => - userTfgEmbeddingsStat.inc() - ((userId, language), embeddingMap.mapValues(_ * favWeight)) - } - .sumByKey - .withReducers(799) - .withDescription("aggregate embedding by user") - - newUserTfgEmbedding.toTypedPipe - } - - def writeOutput( - newUserTfgEmbedding: TypedPipe[((Long, String), Map[Int, Double])], - outputPath: String, - parquetOutputPath: String, - modelVersion: String - )( - implicit uniqueID: UniqueID, - dateRange: DateRange - ): Execution[Unit] = { - val outputRecordStat = Stat("output record count") - val output = newUserTfgEmbedding - .map { - //language has been purposely ignored because the entire logic is based on the fact that - //user is mapped to a language. In future if a user is mapped to multiple languages then - //the final output needs to be keyed on (userId, language) - case ((userId, language), embeddingMap) => - outputRecordStat.inc() - val clusterScores = embeddingMap.map { - case (clusterId, score) => - clusterId -> UserToInterestedInClusterScores(favScore = Some(score)) - } - KeyVal(userId, ClustersUserIsInterestedIn(modelVersion, clusterScores)) - } - - val keyValExec = output - .withDescription("write output keyval dataset") - .writeDALVersionedKeyValExecution( - UserTopicWeightedEmbeddingScalaDataset, - D.Suffix(outputPath)) - - val parquetExec = newUserTfgEmbedding - .map { - case ((userId, language), embeddingMap) => - val clusterScores = embeddingMap.map { - case (clusterId, score) => ClustersScore(clusterId, score) - } - UserTopicWeightedEmbedding(userId, clusterScores.toSeq) - } - .withDescription("write output parquet dataset") - .writeDALSnapshotExecution( - parquetDataSource, - D.Daily, - D.Suffix(parquetOutputPath), - D.Parquet, - dateRange.end - ) - Execution.zip(keyValExec, parquetExec).unit - } - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - val end = dateRange.start - val start = end - Days(21) - val featureDateRange = DateRange(start, end - Millisecs(1)) - val outputPath = args.getOrElse("output_path", outputByFav) - val parquetOutputPath = args.getOrElse("parquet_output_path", parquetOutputByFav) - val modelVersion = ModelVersions.Model20M145K2020 - - //define stats counter - val favTfgTopicEmbeddingsStat = Stat("FavTfgTopicEmbeddings") - val userTopicEngagementStat = Stat("UserTopicEngagement") - val userTopicsStat = Stat("UserTopics") - val userLangStat = Stat("UserLanguage") - - //get fav based tfg embeddings - //topic can have different languages and the clusters will be different - //current logic is to filter based on user language - // topicId, lang, embedding - val favTfgTopicEmbeddings: TypedPipe[(Long, String, SimClustersEmbedding)] = DAL - .readMostRecentSnapshot(FavTfgTopicEmbeddings2020ScalaDataset, featureDateRange) - .withRemoteReadPolicy(AllowCrossClusterSameDC) - .toTypedPipe - .collect { - case KeyVal( - SimClustersEmbeddingId( - embedType, - modelVersion, - InternalId.LocaleEntityId(LocaleEntityId(entityId, language))), - embedding) => - favTfgTopicEmbeddingsStat.inc() - (entityId, language, embedding) - } - - /* - Ideally, if the timeline aggregate framework provided data with breakdown by language, - it could have been joined with (topic, language) embedding. Since, it is not possible - we fetch the language of the user from other sources. - This returns language for the user so that it could be joined with (topic, language) embedding. - `userSource` returns 1 language per user - `inferredUserConsumedLanguageSource` returns multiple languages with confidence values - */ - val userLangSource = ExternalDataSources.userSource - .map { - case (userId, (country, language)) => - userLangStat.inc() - (userId, (language, country)) - } - - //get userid, topicid, favcount as aggregated dataset - //currently there is no way to get language breakdown from the timeline aggregate framework. - val userTopicEngagementPipe: DataSetPipe = AggregatesV2MostRecentFeatureSource( - rootPath = aggregateFeatureRootPath, - storeName = "user_topic_aggregates", - aggregates = - Set(TimelinesAggregationConfig.userTopicAggregates).flatMap(_.buildTypedAggregateGroups()), - ).read - - val userTopicEngagementCount = userTopicEngagementPipe.records - .flatMap { record => - val sRichDataRecord = SRichDataRecord(record) - val userId: Long = sRichDataRecord.getFeatureValue(SharedFeatures.USER_ID) - val topicId: Long = sRichDataRecord.getFeatureValue(TimelinesSharedFeatures.TOPIC_ID) - val favCount: Double = sRichDataRecord - .getFeatureValueOpt(favContinuousFeature).map(_.toDouble).getOrElse(0.0) - userTopicEngagementStat.inc() - if (favCount > 0) { - List((userId, (topicId, favCount))) - } else None - }.join(userLangSource) - .collect { - case (userId, ((topicId, favCount), (language, country))) => - userTopicsStat.inc() - UserTopicEngagement(userId, topicId, language, country, favCount) - } - .withDescription("User Topic aggregated favcount") - - // combine user, topics, topic_embeddings - // and take weighted aggregate of the tfg embedding - val newUserTfgEmbedding = - prepareUserToTopicEmbedding(favTfgTopicEmbeddings, userTopicEngagementCount) - - writeOutput(newUserTfgEmbedding, outputPath, parquetOutputPath, modelVersion) - - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/FavInferredLanguageTfgBasedTopicEmbeddings.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/FavInferredLanguageTfgBasedTopicEmbeddings.scala deleted file mode 100644 index 14604af6a..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/FavInferredLanguageTfgBasedTopicEmbeddings.scala +++ /dev/null @@ -1,66 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding.tfg - -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.scalding._ -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.hdfs_sources.EntityEmbeddingsSources -import com.twitter.simclusters_v2.thriftscala.{ - EmbeddingType, - ModelVersion, - SimClustersEmbeddingId, - UserToInterestedInClusterScores, - SimClustersEmbedding => ThriftSimClustersEmbedding -} -import com.twitter.wtf.scalding.jobs.common.{AdhocExecutionApp, ScheduledExecutionApp} - -/** - * Apps to generate fav-based Topic-Follow-Graph (TFG) topic embeddings from inferred languages - * The fav-based embeddings are built from topic followers' fav-based InterestedIn - */ - -/** -./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg:fav_inferred_lang_tfg_topic_embeddings-adhoc - scalding remote run \ - --user cassowary \ - --keytab /var/lib/tss/keys/fluffy/keytabs/client/cassowary.keytab \ - --principal service_acoount@TWITTER.BIZ \ - --cluster bluebird-qus1 \ - --main-class com.twitter.simclusters_v2.scalding.embedding.tfg.FavInferredLanguageTfgBasedTopicEmbeddingsAdhocApp \ - --target src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg:fav_inferred_lang_tfg_topic_embeddings-adhoc \ - --hadoop-properties "scalding.with.reducers.set.explicitly=true mapreduce.job.reduces=4000" \ - -- --date 2020-06-28 - */ -object FavInferredLanguageTfgBasedTopicEmbeddingsAdhocApp - extends InferredLanguageTfgBasedTopicEmbeddingsBaseApp - with AdhocExecutionApp { - override val isAdhoc: Boolean = true - override val embeddingType: EmbeddingType = EmbeddingType.FavInferredLanguageTfgTopic - override val embeddingSource: KeyValDALDataset[ - KeyVal[SimClustersEmbeddingId, ThriftSimClustersEmbedding] - ] = EntityEmbeddingsSources.FavInferredLanguageTfgTopicEmbeddingsDataset - override val pathSuffix: String = "fav_inferred_lang_tfg_topic_embeddings" - override val modelVersion: ModelVersion = ModelVersion.Model20m145kUpdated - override def scoreExtractor: UserToInterestedInClusterScores => Double = scores => - scores.favScore.getOrElse(0.0) -} - -/** -./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg:fav_inferred_lang_tfg_topic_embeddings -capesospy-v2 update --build_locally --start_cron fav_inferred_lang_tfg_topic_embeddings src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object FavInferredLanguageTfgBasedTopicEmbeddingsScheduledApp - extends InferredLanguageTfgBasedTopicEmbeddingsBaseApp - with ScheduledExecutionApp { - override val isAdhoc: Boolean = false - override val embeddingType: EmbeddingType = EmbeddingType.FavInferredLanguageTfgTopic - override val embeddingSource: KeyValDALDataset[ - KeyVal[SimClustersEmbeddingId, ThriftSimClustersEmbedding] - ] = EntityEmbeddingsSources.FavInferredLanguageTfgTopicEmbeddingsDataset - override val pathSuffix: String = "fav_inferred_lang_tfg_topic_embeddings" - override val modelVersion: ModelVersion = ModelVersion.Model20m145kUpdated - override def scoreExtractor: UserToInterestedInClusterScores => Double = scores => - scores.favScore.getOrElse(0.0) - - override val firstTime: RichDate = RichDate("2020-07-04") - override val batchIncrement: Duration = Days(1) -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/FavTfgBasedTopicEmbeddings.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/FavTfgBasedTopicEmbeddings.scala deleted file mode 100644 index d3e2d6525..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/FavTfgBasedTopicEmbeddings.scala +++ /dev/null @@ -1,172 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding.tfg - -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.dal.client.dataset.SnapshotDALDatasetBase -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite.D -import com.twitter.scalding_internal.dalv2.DALWrite.WriteExtension -import com.twitter.scalding_internal.dalv2.remote_access.AllowCrossClusterSameDC -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.hdfs_sources.EntityEmbeddingsSources -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil -import com.twitter.simclusters_v2.thriftscala.EmbeddingType -import com.twitter.simclusters_v2.thriftscala.ModelVersion -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId -import com.twitter.simclusters_v2.thriftscala.TfgTopicEmbeddings -import com.twitter.simclusters_v2.thriftscala.UserToInterestedInClusterScores -import com.twitter.simclusters_v2.thriftscala.{SimClustersEmbedding => ThriftSimClustersEmbedding} -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import java.util.TimeZone - -/** - * Jobs to generate Fav-based Topic-Follow-Graph (TFG) topic embeddings - * A topic's fav-based TFG embedding is the sum of its followers' fav-based InterestedIn - */ - -/** -./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg:fav_tfg_topic_embeddings-adhoc - scalding remote run \ - --user cassowary \ - --keytab /var/lib/tss/keys/fluffy/keytabs/client/cassowary.keytab \ - --principal service_acoount@TWITTER.BIZ \ - --cluster bluebird-qus1 \ - --main-class com.twitter.simclusters_v2.scalding.embedding.tfg.FavTfgTopicEmbeddingsAdhocApp \ - --target src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg:fav_tfg_topic_embeddings-adhoc \ - --hadoop-properties "scalding.with.reducers.set.explicitly=true mapreduce.job.reduces=4000" \ - -- --date 2020-12-08 - */ -object FavTfgTopicEmbeddingsAdhocApp extends TfgBasedTopicEmbeddingsBaseApp with AdhocExecutionApp { - override val isAdhoc: Boolean = true - override val embeddingType: EmbeddingType = EmbeddingType.FavTfgTopic - override val embeddingSource: KeyValDALDataset[ - KeyVal[SimClustersEmbeddingId, ThriftSimClustersEmbedding] - ] = EntityEmbeddingsSources.FavTfgTopicEmbeddingsDataset - override val pathSuffix: String = "fav_tfg_topic_embedding" - override val modelVersion: ModelVersion = ModelVersion.Model20m145kUpdated - override val parquetDataSource: SnapshotDALDatasetBase[TfgTopicEmbeddings] = - EntityEmbeddingsSources.FavTfgTopicEmbeddingsParquetDataset - override def scoreExtractor: UserToInterestedInClusterScores => Double = scores => - scores.favScore.getOrElse(0.0) -} - -/** -./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg:fav_tfg_topic_embeddings -capesospy-v2 update --build_locally --start_cron fav_tfg_topic_embeddings src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object FavTfgTopicEmbeddingsScheduledApp - extends TfgBasedTopicEmbeddingsBaseApp - with ScheduledExecutionApp { - override val isAdhoc: Boolean = false - override val embeddingType: EmbeddingType = EmbeddingType.FavTfgTopic - override val embeddingSource: KeyValDALDataset[ - KeyVal[SimClustersEmbeddingId, ThriftSimClustersEmbedding] - ] = EntityEmbeddingsSources.FavTfgTopicEmbeddingsDataset - override val pathSuffix: String = "fav_tfg_topic_embedding" - override val modelVersion: ModelVersion = ModelVersion.Model20m145kUpdated - override val parquetDataSource: SnapshotDALDatasetBase[TfgTopicEmbeddings] = - EntityEmbeddingsSources.FavTfgTopicEmbeddingsParquetDataset - override def scoreExtractor: UserToInterestedInClusterScores => Double = scores => - scores.favScore.getOrElse(0.0) - - override val firstTime: RichDate = RichDate("2020-05-25") - override val batchIncrement: Duration = Days(1) -} - -/** -./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg:fav_tfg_topic_embeddings_2020-adhoc - scalding remote run \ - --user cassowary \ - --keytab /var/lib/tss/keys/fluffy/keytabs/client/cassowary.keytab \ - --principal service_acoount@TWITTER.BIZ \ - --cluster bluebird-qus1 \ - --main-class com.twitter.simclusters_v2.scalding.embedding.tfg.FavTfgTopicEmbeddings2020AdhocApp \ - --target src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg:fav_tfg_topic_embeddings_2020-adhoc \ - --hadoop-properties "scalding.with.reducers.set.explicitly=true mapreduce.job.reduces=4000" \ - -- --date 2020-12-08 - */ -object FavTfgTopicEmbeddings2020AdhocApp - extends TfgBasedTopicEmbeddingsBaseApp - with AdhocExecutionApp { - override val isAdhoc: Boolean = true - override val embeddingType: EmbeddingType = EmbeddingType.FavTfgTopic - override val embeddingSource: KeyValDALDataset[ - KeyVal[SimClustersEmbeddingId, ThriftSimClustersEmbedding] - ] = EntityEmbeddingsSources.FavTfgTopicEmbeddings2020Dataset - override val pathSuffix: String = "fav_tfg_topic_embedding" - override val modelVersion: ModelVersion = ModelVersion.Model20m145k2020 - override val parquetDataSource: SnapshotDALDatasetBase[TfgTopicEmbeddings] = - EntityEmbeddingsSources.FavTfgTopicEmbeddings2020ParquetDataset - override def scoreExtractor: UserToInterestedInClusterScores => Double = scores => - scores.favScore.getOrElse(0.0) -} - -/** -./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg:fav_tfg_topic_embeddings_2020 -capesospy-v2 update --build_locally --start_cron fav_tfg_topic_embeddings_2020 src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object FavTfgTopicEmbeddings2020ScheduledApp - extends TfgBasedTopicEmbeddingsBaseApp - with ScheduledExecutionApp { - override val isAdhoc: Boolean = false - override val embeddingType: EmbeddingType = EmbeddingType.FavTfgTopic - override val embeddingSource: KeyValDALDataset[ - KeyVal[SimClustersEmbeddingId, ThriftSimClustersEmbedding] - ] = EntityEmbeddingsSources.FavTfgTopicEmbeddings2020Dataset - override val pathSuffix: String = "fav_tfg_topic_embedding" - override val modelVersion: ModelVersion = ModelVersion.Model20m145k2020 - override val parquetDataSource: SnapshotDALDatasetBase[TfgTopicEmbeddings] = - EntityEmbeddingsSources.FavTfgTopicEmbeddings2020ParquetDataset - override def scoreExtractor: UserToInterestedInClusterScores => Double = scores => - scores.favScore.getOrElse(0.0) - - override val firstTime: RichDate = RichDate("2021-03-10") - override val batchIncrement: Duration = Days(1) -} - -/** -./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg:fav_tfg_topic_embeddings_2020_copy -scalding scalding remote run --target src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg:fav_tfg_topic_embeddings_2020_copy - */ - -/** - * This is a copy job where we copy the previous version of TFG and write to a new one. - * The dependent dataset for TFG has been deleted. - * Instead of restarting the entire job, we create this temp hacky solution to keep TFG dataset alive until we deprecate topics. - * Having a table TFG doesn't lead to a big quality concern b/c TFG is built from topic follows, which is relative stable - * and we don't have new topics anymore. - */ -object FavTfgTopicEmbeddings2020CopyScheduledApp extends ScheduledExecutionApp { - val isAdhoc: Boolean = false - val embeddingType: EmbeddingType = EmbeddingType.FavTfgTopic - val embeddingSource: KeyValDALDataset[ - KeyVal[SimClustersEmbeddingId, ThriftSimClustersEmbedding] - ] = EntityEmbeddingsSources.FavTfgTopicEmbeddings2020Dataset - val pathSuffix: String = "fav_tfg_topic_embedding" - val modelVersion: ModelVersion = ModelVersion.Model20m145k2020 - - override val firstTime: RichDate = RichDate("2023-01-20") - override val batchIncrement: Duration = Days(3) - - def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - DAL - .readMostRecentSnapshotNoOlderThan( - EntityEmbeddingsSources.FavTfgTopicEmbeddings2020Dataset, - Days(21)) - .withRemoteReadPolicy(AllowCrossClusterSameDC) - .toTypedPipe - .writeDALVersionedKeyValExecution( - EntityEmbeddingsSources.FavTfgTopicEmbeddings2020Dataset, - D.Suffix( - EmbeddingUtil - .getHdfsPath(isAdhoc = isAdhoc, isManhattanKeyVal = true, modelVersion, pathSuffix)) - ) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/InferredLanguageTfgBasedTopicEmbeddingsBaseApp.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/InferredLanguageTfgBasedTopicEmbeddingsBaseApp.scala deleted file mode 100644 index 2ee09cc8f..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/InferredLanguageTfgBasedTopicEmbeddingsBaseApp.scala +++ /dev/null @@ -1,194 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding.tfg - -import com.twitter.bijection.{Bufferable, Injection} -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DALWrite.{D, _} -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.{Country, Language, SimClustersEmbedding, TopicId} -import com.twitter.simclusters_v2.hdfs_sources.InterestedInSources -import com.twitter.simclusters_v2.scalding.common.matrix.{SparseMatrix, SparseRowMatrix} -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil.{UserId, _} -import com.twitter.simclusters_v2.scalding.embedding.common.{ - EmbeddingUtil, - ExternalDataSources, - SimClustersEmbeddingBaseJob -} -import com.twitter.simclusters_v2.thriftscala.{ - EmbeddingType, - InternalId, - ModelVersion, - SimClustersEmbeddingId, - UserToInterestedInClusterScores, - SimClustersEmbedding => ThriftSimClustersEmbedding, - TopicId => ThriftTopicId -} -import com.twitter.wtf.scalding.jobs.common.DateRangeExecutionApp -import java.util.TimeZone - -/** - * Base app to generate Topic-Follow-Graph (TFG) topic embeddings from inferred languages. - * In this app, topic embeddings are keyed by (topic, language, country). - * Given a (topic t, country c, language l) tuple, the embedding is the sum of the - * InterestedIn of the topic followers whose inferred language has l and account country is c - * The language and the country fields in the keys are optional. - * The app will generate 1) country-language-based 2) language-based 3) global embeddings in one dataset. - * It's up to the clients to decide which embeddings to use - */ -trait InferredLanguageTfgBasedTopicEmbeddingsBaseApp - extends SimClustersEmbeddingBaseJob[(TopicId, Option[Language], Option[Country])] - with DateRangeExecutionApp { - - val isAdhoc: Boolean - val embeddingType: EmbeddingType - val embeddingSource: KeyValDALDataset[KeyVal[SimClustersEmbeddingId, ThriftSimClustersEmbedding]] - val pathSuffix: String - val modelVersion: ModelVersion - def scoreExtractor: UserToInterestedInClusterScores => Double - - override def numClustersPerNoun: Int = 50 - override def numNounsPerClusters: Int = 1 // not used for now. Set to an arbitrary number - override def thresholdForEmbeddingScores: Double = 0.001 - - implicit val inj: Injection[(TopicId, Option[Language], Option[Country]), Array[Byte]] = - Bufferable.injectionOf[(TopicId, Option[Language], Option[Country])] - - // Default to 10K, top 1% for (topic, country, language) follows - // Child classes may want to tune this number for their own use cases. - val minPerCountryFollowers = 10000 - val minFollowers = 100 - - def getTopicUsers( - topicFollowGraph: TypedPipe[(TopicId, UserId)], - userSource: TypedPipe[(UserId, (Country, Language))], - userLanguages: TypedPipe[(UserId, Seq[(Language, Double)])] - ): TypedPipe[((TopicId, Option[Language], Option[Country]), UserId, Double)] = { - topicFollowGraph - .map { case (topic, user) => (user, topic) } - .join(userSource) - .join(userLanguages) - .flatMap { - case (user, ((topic, (country, _)), scoredLangs)) => - scoredLangs.flatMap { - case (lang, score) => - Seq( - ((topic, Some(lang), Some(country)), user, score), // with language and country - ((topic, Some(lang), None), user, score) // with language - ) - } ++ Seq(((topic, None, None), user, 1.0)) // non-language - }.forceToDisk - } - - def getValidTopics( - topicUsers: TypedPipe[((TopicId, Option[Language], Option[Country]), UserId, Double)] - )( - implicit uniqueID: UniqueID - ): TypedPipe[(TopicId, Option[Language], Option[Country])] = { - val countryBasedTopics = Stat("country_based_topics") - val nonCountryBasedTopics = Stat("non_country_based_topics") - - val (countryBased, nonCountryBased) = topicUsers.partition { - case ((_, lang, country), _, _) => lang.isDefined && country.isDefined - } - - SparseMatrix(countryBased).rowL1Norms.collect { - case (key, l1Norm) if l1Norm >= minPerCountryFollowers => - countryBasedTopics.inc() - key - } ++ - SparseMatrix(nonCountryBased).rowL1Norms.collect { - case (key, l1Norm) if l1Norm >= minFollowers => - nonCountryBasedTopics.inc() - key - } - } - - override def prepareNounToUserMatrix( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): SparseMatrix[(TopicId, Option[Language], Option[Country]), UserId, Double] = { - val topicUsers = getTopicUsers( - ExternalDataSources.topicFollowGraphSource, - ExternalDataSources.userSource, - ExternalDataSources.inferredUserConsumedLanguageSource) - - SparseMatrix[(TopicId, Option[Language], Option[Country]), UserId, Double](topicUsers) - .filterRows(getValidTopics(topicUsers)) - } - - override def prepareUserToClusterMatrix( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): SparseRowMatrix[UserId, ClusterId, Double] = - SparseRowMatrix( - InterestedInSources - .simClustersInterestedInSource(modelVersion, dateRange, timeZone) - .map { - case (userId, clustersUserIsInterestedIn) => - userId -> clustersUserIsInterestedIn.clusterIdToScores - .map { - case (clusterId, scores) => - clusterId -> scoreExtractor(scores) - } - .filter(_._2 > 0.0) - .toMap - }, - isSkinnyMatrix = true - ) - - override def writeNounToClustersIndex( - output: TypedPipe[((TopicId, Option[Language], Option[Country]), Seq[(ClusterId, Double)])] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - val topicEmbeddingCount = Stat(s"topic_embedding_count") - - val tsvExec = - output - .map { - case ((entityId, language, country), clustersWithScores) => - (entityId, language, country, clustersWithScores.take(5).mkString(",")) - } - .shard(5) - .writeExecution(TypedTsv[(TopicId, Option[Language], Option[Country], String)]( - s"/user/recos-platform/adhoc/topic_embedding/$pathSuffix/$ModelVersionPathMap($modelVersion)")) - - val keyValExec = output - .map { - case ((entityId, lang, country), clustersWithScores) => - topicEmbeddingCount.inc() - KeyVal( - SimClustersEmbeddingId( - embeddingType, - modelVersion, - InternalId.TopicId(ThriftTopicId(entityId, lang, country)) - ), - SimClustersEmbedding(clustersWithScores).toThrift - ) - } - .writeDALVersionedKeyValExecution( - embeddingSource, - D.Suffix( - EmbeddingUtil - .getHdfsPath(isAdhoc = isAdhoc, isManhattanKeyVal = true, modelVersion, pathSuffix)) - ) - if (isAdhoc) - Execution.zip(tsvExec, keyValExec).unit - else - keyValExec - } - - override def writeClusterToNounsIndex( - output: TypedPipe[(ClusterId, Seq[((TopicId, Option[Language], Option[Country]), Double)])] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - Execution.unit // do not need this - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/LogFavTfgBasedTopicEmbeddings.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/LogFavTfgBasedTopicEmbeddings.scala deleted file mode 100644 index 1869b5c64..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/LogFavTfgBasedTopicEmbeddings.scala +++ /dev/null @@ -1,70 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding.tfg - -import com.twitter.dal.client.dataset.{KeyValDALDataset, SnapshotDALDatasetBase} -import com.twitter.scalding._ -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.hdfs_sources.EntityEmbeddingsSources -import com.twitter.simclusters_v2.thriftscala.{ - EmbeddingType, - ModelVersion, - SimClustersEmbeddingId, - TfgTopicEmbeddings, - UserToInterestedInClusterScores, - SimClustersEmbedding => ThriftSimClustersEmbedding -} -import com.twitter.wtf.scalding.jobs.common.{AdhocExecutionApp, ScheduledExecutionApp} - -/** - * Jobs to generate Logfav-based Topic-Follow-Graph (TFG) topic embeddings - * A topic's logfav-based TFG embedding is the sum of its followers' logfav-based InterestedIn - */ - -/** -./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg:logfav_tfg_topic_embeddings-adhoc - scalding remote run \ - --user cassowary \ - --keytab /var/lib/tss/keys/fluffy/keytabs/client/cassowary.keytab \ - --principal service_acoount@TWITTER.BIZ \ - --cluster bluebird-qus1 \ - --main-class com.twitter.simclusters_v2.scalding.embedding.tfg.LogFavTfgTopicEmbeddingsAdhocApp \ - --target src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg:logfav_tfg_topic_embeddings-adhoc \ - --hadoop-properties "scalding.with.reducers.set.explicitly=true mapreduce.job.reduces=4000" \ - -- --date 2020-12-08 - */ -object LogFavTfgTopicEmbeddingsAdhocApp - extends TfgBasedTopicEmbeddingsBaseApp - with AdhocExecutionApp { - override val isAdhoc: Boolean = true - override val embeddingType: EmbeddingType = EmbeddingType.LogFavTfgTopic - override val embeddingSource: KeyValDALDataset[ - KeyVal[SimClustersEmbeddingId, ThriftSimClustersEmbedding] - ] = EntityEmbeddingsSources.LogFavTfgTopicEmbeddingsDataset - override val pathSuffix: String = "logfav_tfg_topic_embedding" - override val modelVersion: ModelVersion = ModelVersion.Model20m145kUpdated - override val parquetDataSource: SnapshotDALDatasetBase[TfgTopicEmbeddings] = - EntityEmbeddingsSources.LogFavTfgTopicEmbeddingsParquetDataset - override def scoreExtractor: UserToInterestedInClusterScores => Double = scores => - scores.logFavScore.getOrElse(0.0) -} - -/** -./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg:logfav_tfg_topic_embeddings -capesospy-v2 update --build_locally --start_cron logfav_tfg_topic_embeddings src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object LogFavTfgTopicEmbeddingsScheduledApp - extends TfgBasedTopicEmbeddingsBaseApp - with ScheduledExecutionApp { - override val isAdhoc: Boolean = false - override val embeddingType: EmbeddingType = EmbeddingType.LogFavTfgTopic - override val embeddingSource: KeyValDALDataset[ - KeyVal[SimClustersEmbeddingId, ThriftSimClustersEmbedding] - ] = EntityEmbeddingsSources.LogFavTfgTopicEmbeddingsDataset - override val pathSuffix: String = "logfav_tfg_topic_embedding" - override val modelVersion: ModelVersion = ModelVersion.Model20m145kUpdated - override def scoreExtractor: UserToInterestedInClusterScores => Double = scores => - scores.logFavScore.getOrElse(0.0) - override val parquetDataSource: SnapshotDALDatasetBase[TfgTopicEmbeddings] = - EntityEmbeddingsSources.LogFavTfgTopicEmbeddingsParquetDataset - override val firstTime: RichDate = RichDate("2020-05-25") - override val batchIncrement: Duration = Days(1) -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/README b/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/README deleted file mode 100644 index d08ff73f1..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/README +++ /dev/null @@ -1,7 +0,0 @@ -TFG stands for Topic Follow Graph -The TFG topic embeddings are embeddings built from Topic Follow Graph. -Each topic is represented by the sum of its followers' user InterestedIn embeddings. - -There are two types of embeddings: -logfav - topic embeddings built from followers' logfav-based InterestedIn -fav - topic embeddings built from followers' fav-based InterestedIn diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/TfgBasedTopicEmbeddingsBaseApp.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/TfgBasedTopicEmbeddingsBaseApp.scala deleted file mode 100644 index 2725bafb5..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/tfg/TfgBasedTopicEmbeddingsBaseApp.scala +++ /dev/null @@ -1,191 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding.tfg - -import com.twitter.bijection.{Bufferable, Injection} -import com.twitter.dal.client.dataset.{KeyValDALDataset, SnapshotDALDatasetBase} -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DALWrite.{D, _} -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.{Language, SimClustersEmbedding, TopicId} -import com.twitter.simclusters_v2.hdfs_sources.InterestedInSources -import com.twitter.simclusters_v2.scalding.common.matrix.{SparseMatrix, SparseRowMatrix} -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil.{UserId, _} -import com.twitter.simclusters_v2.scalding.embedding.common.{ - EmbeddingUtil, - ExternalDataSources, - SimClustersEmbeddingBaseJob -} -import com.twitter.simclusters_v2.thriftscala.{ - ClustersScore, - EmbeddingType, - TfgTopicEmbeddings, - InternalId, - LocaleEntityId, - ModelVersion, - SimClustersEmbeddingId, - UserToInterestedInClusterScores, - SimClustersEmbedding => ThriftSimClustersEmbedding, - TopicId => TID -} -import com.twitter.wtf.scalding.jobs.common.DateRangeExecutionApp - -import java.util.TimeZone - -/** - * Base app for the Topic-Follow-Graph (TFG) topic embeddings - * A topic's TFG embedding is represented by the sum of all the users who followed the topic - */ -trait TfgBasedTopicEmbeddingsBaseApp - extends SimClustersEmbeddingBaseJob[(TopicId, Language)] - with DateRangeExecutionApp { - - val isAdhoc: Boolean - val embeddingType: EmbeddingType - val embeddingSource: KeyValDALDataset[KeyVal[SimClustersEmbeddingId, ThriftSimClustersEmbedding]] - val pathSuffix: String - val modelVersion: ModelVersion - val parquetDataSource: SnapshotDALDatasetBase[TfgTopicEmbeddings] - def scoreExtractor: UserToInterestedInClusterScores => Double - - override def numClustersPerNoun: Int = 50 - override def numNounsPerClusters: Int = 1 // not used for now. Set to an arbitrary number - override def thresholdForEmbeddingScores: Double = 0.001 - - val minNumFollowers = 100 - - override def prepareNounToUserMatrix( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): SparseMatrix[(TopicId, Language), UserId, Double] = { - implicit val inj: Injection[(TopicId, Language), Array[Byte]] = - Bufferable.injectionOf[(TopicId, Language)] - - val topicLangUsers = ExternalDataSources.topicFollowGraphSource - .map { case (topic, user) => (user, topic) } - .join(ExternalDataSources.userSource) - .map { - case (user, (topic, (_, language))) => - ((topic, language), user, 1.0) - } - .forceToDisk - - val validTopicLang = - SparseMatrix(topicLangUsers).rowNnz.filter { - case (_, nzCount) => nzCount >= minNumFollowers - }.keys - - SparseMatrix[(TopicId, Language), UserId, Double](topicLangUsers).filterRows(validTopicLang) - } - - override def prepareUserToClusterMatrix( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): SparseRowMatrix[UserId, ClusterId, Double] = - SparseRowMatrix( - InterestedInSources - .simClustersInterestedInSource(modelVersion, dateRange, timeZone) - .map { - case (userId, clustersUserIsInterestedIn) => - userId -> clustersUserIsInterestedIn.clusterIdToScores - .map { - case (clusterId, scores) => - clusterId -> scoreExtractor(scores) - } - .filter(_._2 > 0.0) - .toMap - }, - isSkinnyMatrix = true - ) - - override def writeNounToClustersIndex( - output: TypedPipe[((TopicId, Language), Seq[(ClusterId, Double)])] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - val topicEmbeddingCount = Stat(s"topic_embedding_count") - val user = System.getenv("USER") - val parquetExec = output - .map { - case ((entityId, language), clustersWithScores) => - TfgTopicEmbeddings( - TID( - entityId = entityId, - language = Some(language), - ), - clusterScore = clustersWithScores.map { - case (clusterId, score) => ClustersScore(clusterId, score) - } - ) - } - .writeDALSnapshotExecution( - parquetDataSource, - D.Daily, - D.Suffix( - EmbeddingUtil.getHdfsPath( - isAdhoc = isAdhoc, - isManhattanKeyVal = false, - modelVersion, - pathSuffix + "/snapshot")), - D.Parquet, - dateRange.end - ) - - val tsvExec = - output - .map { - case ((entityId, language), clustersWithScores) => - (entityId, language, clustersWithScores.mkString(";")) - } - .shard(10) - .writeExecution(TypedTsv[(TopicId, Language, String)]( - s"/user/$user/adhoc/topic_embedding/$pathSuffix/$ModelVersionPathMap($modelVersion)")) - - val keyValExec = output - .flatMap { - case ((entityId, lang), clustersWithScores) => - topicEmbeddingCount.inc() - val embedding = SimClustersEmbedding(clustersWithScores).toThrift - Seq( - KeyVal( - SimClustersEmbeddingId( - embeddingType, - modelVersion, - InternalId.LocaleEntityId(LocaleEntityId(entityId, lang)) - ), - embedding - ), - KeyVal( - SimClustersEmbeddingId( - embeddingType, - modelVersion, - InternalId.TopicId(TID(entityId, Some(lang), country = None)) - ), - embedding - ), - ) - } - .writeDALVersionedKeyValExecution( - embeddingSource, - D.Suffix( - EmbeddingUtil - .getHdfsPath(isAdhoc = isAdhoc, isManhattanKeyVal = true, modelVersion, pathSuffix)) - ) - if (isAdhoc) - Execution.zip(tsvExec, keyValExec, parquetExec).unit - else - Execution.zip(keyValExec, parquetExec).unit - } - - override def writeClusterToNounsIndex( - output: TypedPipe[(ClusterId, Seq[((TopicId, Language), Double)])] - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - Execution.unit // do not need this - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/twice/BUILD.bazel b/src/scala/com/twitter/simclusters_v2/scalding/embedding/twice/BUILD.bazel deleted file mode 100644 index 86124f1ff..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/twice/BUILD.bazel +++ /dev/null @@ -1,166 +0,0 @@ -scala_library( - sources = [ - "*.scala", - "common/*.scala", - ], - compiler_option_sets = ["fatal_warnings"], - platform = "java8", - tags = [ - "bazel-compatible", - "bazel-only", - ], - dependencies = [ - "src/scala/com/twitter/simclusters_v2/common/clustering", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:aggregatable_producer_simclusters_embeddings_by_log_fav_score-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:clusters_members_connected_components_ape_similarity-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:clusters_members_largest_dim_ape_similarity-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:clusters_members_largest_dim_ape_similarity_2_day_update-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:clusters_members_louvain_ape_similarity-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:interested_in_twice_by_largest_dim-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:interested_in_twice_by_largest_dim_2_day_update-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:interested_in_twice_by_largest_dim_fav_score-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:interested_in_twice_connected_components-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:interested_in_twice_louvain-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:user_user_normalized_graph-scala", - "src/scala/com/twitter/simclusters_v2/scalding/common", - "src/scala/com/twitter/simclusters_v2/scalding/embedding", - "src/scala/com/twitter/wtf/scalding/jobs/common:execution_app", - "src/scala/com/twitter/wtf/scalding/jobs/common:sources", - "src/scala/com/twitter/wtf/scalding/jobs/common:stats_util", - ], -) - -# ======================== -# ADHOC JOB CONFIGURATIONS -# Note: Please change mapreduce.job.reduces and --num-reducers together. -# ======================== -scalding_job( - name = "interested_in_twice_largest_dim-adhoc", - main = "com.twitter.simclusters_v2.scalding.embedding.twice.InterestedInTwiceLargestDimAdhocApp", - args = [ - "--date 2021-08-31", - "--num-reducers 4000", - ], - config = [ - ("hadoop.combine-input", "true"), - ("hadoop.map.jvm.total-memory", "3072m"), - ("hadoop.reduce.jvm.total-memory", "3072m"), - ("hadoop.submitter.jvm.total-memory", "5120m"), - ("submitter.tier", "preemptible"), - ], - hadoop_cluster = "qus1-bluebird", - hadoop_properties = [ - ("mapreduce.job.reduce.slowstart.completedmaps", "1.0"), - ("scalding.with.reducers.set.explicitly", "true"), - ("mapreduce.job.reduces", "4000"), - ("mapreduce.task.timeout", "0"), - ], - platform = "java8", - role = "cassowary", - runtime_platform = "java8", - tags = [ - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":twice"], -) - -scalding_job( - name = "interested_in_twice_largest_dim_fav_score-adhoc", - main = "com.twitter.simclusters_v2.scalding.embedding.twice.InterestedInTwiceLargestDimMaxFavScoreAdhocApp", - args = [ - "--date 2022-07-01", - "--num-reducers 4000", - ], - config = [ - ("hadoop.combine-input", "true"), - ("hadoop.map.jvm.total-memory", "3072m"), - ("hadoop.reduce.jvm.total-memory", "3072m"), - ("hadoop.submitter.jvm.total-memory", "5120m"), - ("submitter.tier", "preemptible"), - ], - hadoop_cluster = "qus1-bluebird", - hadoop_properties = [ - ("mapreduce.job.reduce.slowstart.completedmaps", "1.0"), - ("scalding.with.reducers.set.explicitly", "true"), - ("mapreduce.job.reduces", "4000"), - ("mapreduce.task.timeout", "0"), - ], - platform = "java8", - role = "cassowary", - runtime_platform = "java8", - tags = [ - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":twice"], -) - -scalding_job( - name = "interested_in_twice_louvain-adhoc", - main = "com.twitter.simclusters_v2.scalding.embedding.twice.InterestedInTwiceLouvainAdhocApp", - args = [ - "--date 2021-08-31", - "--num-reducers 4000", - "--cosine_similarity_threshold 0.5", - ], - config = [ - ("hadoop.combine-input", "true"), - ("hadoop.map.jvm.total-memory", "3072m"), - ("hadoop.reduce.jvm.total-memory", "3072m"), - ("hadoop.submitter.jvm.total-memory", "5120m"), - ("submitter.tier", "preemptible"), - ], - hadoop_cluster = "qus1-bluebird", - hadoop_properties = [ - ("mapreduce.job.reduce.slowstart.completedmaps", "1.0"), - ("scalding.with.reducers.set.explicitly", "true"), - ("mapreduce.job.reduces", "4000"), - ("mapreduce.task.timeout", "0"), - ], - platform = "java8", - role = "cassowary", - runtime_platform = "java8", - tags = [ - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":twice"], -) - -scalding_job( - name = "interested_in_twice_connected_components-adhoc", - main = "com.twitter.simclusters_v2.scalding.embedding.twice.InterestedInTwiceConnectedComponentsAdhocApp", - args = [ - "--date 2021-08-31", - "--num-reducers 4000", - "--cosine_similarity_threshold 0.5", - ], - config = [ - ("hadoop.combine-input", "true"), - ("hadoop.map.jvm.total-memory", "3072m"), - ("hadoop.reduce.jvm.total-memory", "3072m"), - ("hadoop.submitter.jvm.total-memory", "5120m"), - ("submitter.tier", "preemptible"), - ], - hadoop_cluster = "qus1-bluebird", - hadoop_properties = [ - ("mapreduce.job.reduce.slowstart.completedmaps", "1.0"), - ("scalding.with.reducers.set.explicitly", "true"), - ("mapreduce.job.reduces", "4000"), - ("mapreduce.task.timeout", "0"), - ], - platform = "java8", - role = "cassowary", - runtime_platform = "java8", - tags = [ - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":twice"], -) - -# ============================ -# SCHEDULED JOB CONFIGURATIONS -# Twice jobs have been descheduled -# ============================ diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/twice/InterestedInTwice.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/twice/InterestedInTwice.scala deleted file mode 100644 index 5669f8bbd..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/twice/InterestedInTwice.scala +++ /dev/null @@ -1,454 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding.twice - -import com.twitter.scalding.Args -import com.twitter.scalding.DateRange -import com.twitter.scalding.Days -import com.twitter.scalding.Duration -import com.twitter.scalding.Execution -import com.twitter.scalding.RichDate -import com.twitter.scalding.UniqueID -import com.twitter.simclusters_v2.common.SimClustersEmbedding -import com.twitter.simclusters_v2.common.clustering.ConnectedComponentsClusteringMethod -import com.twitter.simclusters_v2.common.clustering.LargestDimensionClusteringMethod -import com.twitter.simclusters_v2.common.clustering.LouvainClusteringMethod -import com.twitter.simclusters_v2.common.clustering.MedoidRepresentativeSelectionMethod -import com.twitter.simclusters_v2.common.clustering.MaxFavScoreRepresentativeSelectionMethod -import com.twitter.simclusters_v2.common.clustering.SimilarityFunctions -import com.twitter.simclusters_v2.hdfs_sources.ClustersMembersConnectedComponentsApeSimilarityScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.ClustersMembersLargestDimApeSimilarity2DayUpdateScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.ClustersMembersLargestDimApeSimilarityScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.ClustersMembersLouvainApeSimilarityScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.InterestedInTwiceByLargestDim2DayUpdateScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.InterestedInTwiceByLargestDimScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.InterestedInTwiceByLargestDimFavScoreScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.InterestedInTwiceConnectedComponentsScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.InterestedInTwiceLouvainScalaDataset -import com.twitter.simclusters_v2.scalding.embedding.twice.InterestedInTwiceBaseApp.ProducerEmbeddingSource -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import java.util.TimeZone - -/** - To build & deploy the TWICE scheduled jobs via workflows: - - scalding workflow upload \ - --workflow interested_in_twice-batch \ - --jobs src/scala/com/twitter/simclusters_v2/scalding/embedding/twice:interested_in_twice_largest_dim-batch,src/scala/com/twitter/simclusters_v2/scalding/embedding/twice:interested_in_twice_louvain-batch,src/scala/com/twitter/simclusters_v2/scalding/embedding/twice:interested_in_twice_connected_components-batch \ - --scm-paths "src/scala/com/twitter/simclusters_v2/scalding/embedding/twice/*" \ - --autoplay \ - - -> See workflow here: https://workflows.twitter.biz/workflow/cassowary/interested_in_twice-batch - - (Use `scalding workflow upload --help` for a breakdown of the different flags) - */*/ - -object InterestedInTwiceLargestDimScheduledApp - extends InterestedInTwiceBaseApp[SimClustersEmbedding] - with ScheduledExecutionApp { - - override def firstTime: RichDate = RichDate("2021-09-02") - override def batchIncrement: Duration = Days(7) - - override def producerProducerSimilarityFnForClustering: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Double = - SimilarityFunctions.simClustersMatchingLargestDimension - override def producerProducerSimilarityFnForClusterRepresentative: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Double = - SimilarityFunctions.simClustersCosineSimilarity - - /** - * Top-level method of this application. - */ - def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueId: UniqueID - ): Execution[Unit] = { - - runScheduledApp( - new LargestDimensionClusteringMethod(), - new MedoidRepresentativeSelectionMethod[SimClustersEmbedding]( - producerProducerSimilarityFnForClusterRepresentative), - ProducerEmbeddingSource.getAggregatableProducerEmbeddings, - "interested_in_twice_by_largest_dim", - "clusters_members_largest_dim_ape_similarity", - InterestedInTwiceByLargestDimScalaDataset, - ClustersMembersLargestDimApeSimilarityScalaDataset, - args.getOrElse("num-reducers", "4000").toInt - ) - - } - -} - -object InterestedInTwiceLargestDimMaxFavScoreScheduledApp - extends InterestedInTwiceBaseApp[SimClustersEmbedding] - with ScheduledExecutionApp { - - override def firstTime: RichDate = RichDate("2022-06-30") - override def batchIncrement: Duration = Days(7) - - override def producerProducerSimilarityFnForClustering: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Double = - SimilarityFunctions.simClustersMatchingLargestDimension - override def producerProducerSimilarityFnForClusterRepresentative: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Double = - SimilarityFunctions.simClustersCosineSimilarity - - /** - * Top-level method of this application. - */ - def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueId: UniqueID - ): Execution[Unit] = { - - runScheduledApp( - new LargestDimensionClusteringMethod(), - new MaxFavScoreRepresentativeSelectionMethod[SimClustersEmbedding](), - ProducerEmbeddingSource.getAggregatableProducerEmbeddings, - "interested_in_twice_by_largest_dim_fav_score", - "clusters_members_largest_dim_ape_similarity", - InterestedInTwiceByLargestDimFavScoreScalaDataset, - ClustersMembersLargestDimApeSimilarityScalaDataset, - args.getOrElse("num-reducers", "4000").toInt - ) - - } - -} - -object InterestedInTwiceLouvainScheduledApp - extends InterestedInTwiceBaseApp[SimClustersEmbedding] - with ScheduledExecutionApp { - - override def firstTime: RichDate = RichDate("2021-09-02") - override def batchIncrement: Duration = Days(7) - - override def producerProducerSimilarityFnForClustering: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Double = - SimilarityFunctions.simClustersCosineSimilarity - override def producerProducerSimilarityFnForClusterRepresentative: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Double = - SimilarityFunctions.simClustersCosineSimilarity - - /** - * Top-level method of this application. - */ - def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueId: UniqueID - ): Execution[Unit] = { - - runScheduledApp( - new LouvainClusteringMethod( - args.required("cosine_similarity_threshold").toDouble, - args.optional("resolution_factor").map(_.toDouble)), - new MedoidRepresentativeSelectionMethod[SimClustersEmbedding]( - producerProducerSimilarityFnForClusterRepresentative), - ProducerEmbeddingSource.getAggregatableProducerEmbeddings, - "interested_in_twice_louvain", - "clusters_members_louvain_ape_similarity", - InterestedInTwiceLouvainScalaDataset, - ClustersMembersLouvainApeSimilarityScalaDataset, - args.getOrElse("num-reducers", "4000").toInt - ) - - } - -} - -object InterestedInTwiceConnectedComponentsScheduledApp - extends InterestedInTwiceBaseApp[SimClustersEmbedding] - with ScheduledExecutionApp { - - override def firstTime: RichDate = RichDate("2021-09-02") - override def batchIncrement: Duration = Days(7) - override def producerProducerSimilarityFnForClustering: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Double = - SimilarityFunctions.simClustersCosineSimilarity - override def producerProducerSimilarityFnForClusterRepresentative: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Double = - SimilarityFunctions.simClustersCosineSimilarity - - /** - * Top-level method of this application. - */ - def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueId: UniqueID - ): Execution[Unit] = { - - runScheduledApp( - new ConnectedComponentsClusteringMethod( - args.required("cosine_similarity_threshold").toDouble), - new MedoidRepresentativeSelectionMethod[SimClustersEmbedding]( - producerProducerSimilarityFnForClusterRepresentative), - ProducerEmbeddingSource.getAggregatableProducerEmbeddings, - "interested_in_twice_connected_components", - "clusters_members_connected_components_ape_similarity", - InterestedInTwiceConnectedComponentsScalaDataset, - ClustersMembersConnectedComponentsApeSimilarityScalaDataset, - args.getOrElse("num-reducers", "4000").toInt - ) - - } - -} - -/** Production Scalding job that calculates TWICE embeddings in a shorter period (every two days). - * - * Given that the input sources of TWICE are updated more frequently (e.g., user_user_graph is - * updated every 2 day), updating TWICE embedding every 2 day will better capture interests of new - * users and the interest shift of existing users. - * - * To build & deploy the scheduled job via workflows: - * {{{ - * scalding workflow upload \ - * --workflow interested_in_twice_2_day_update-batch \ - * --jobs src/scala/com/twitter/simclusters_v2/scalding/embedding/twice:interested_in_twice_largest_dim_2_day_update-batch \ - * --scm-paths "src/scala/com/twitter/simclusters_v2/scalding/embedding/twice/*" \ - * --autoplay - * }}} - * - */*/ -object InterestedInTwiceLargestDim2DayUpdateScheduledApp - extends InterestedInTwiceBaseApp[SimClustersEmbedding] - with ScheduledExecutionApp { - - override def firstTime: RichDate = RichDate("2022-04-06") - override def batchIncrement: Duration = Days(2) - - override def producerProducerSimilarityFnForClustering: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Double = - SimilarityFunctions.simClustersMatchingLargestDimension - override def producerProducerSimilarityFnForClusterRepresentative: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Double = - SimilarityFunctions.simClustersCosineSimilarity - - /** - * Top-level method of this application. - */ - def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueId: UniqueID - ): Execution[Unit] = { - - runScheduledApp( - new LargestDimensionClusteringMethod(), - new MedoidRepresentativeSelectionMethod[SimClustersEmbedding]( - producerProducerSimilarityFnForClusterRepresentative), - ProducerEmbeddingSource.getAggregatableProducerEmbeddings, - "interested_in_twice_by_largest_dim_2_day_update", - "clusters_members_largest_dim_ape_similarity_2_day_update", - InterestedInTwiceByLargestDim2DayUpdateScalaDataset, - ClustersMembersLargestDimApeSimilarity2DayUpdateScalaDataset, - args.getOrElse("num-reducers", "4000").toInt - ) - } -} - -/** - -[Preferred way] To run a locally built adhoc job: - ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/embedding/twice:interested_in_twice_-adhoc - scalding remote run --target src/scala/com/twitter/simclusters_v2/scalding/embedding/twice:interested_in_twice_-adhoc - -To build and run a adhoc job with workflows: - scalding workflow upload \ - --workflow interested_in_twice-adhoc \ - --jobs src/scala/com/twitter/simclusters_v2/scalding/embedding/twice:interested_in_twice_largest_dim-adhoc,src/scala/com/twitter/simclusters_v2/scalding/embedding/twice:interested_in_twice_louvain-adhoc,src/scala/com/twitter/simclusters_v2/scalding/embedding/twice:interested_in_twice_connected_components-adhoc \ - --scm-paths "src/scala/com/twitter/simclusters_v2/scalding/embedding/twice/*" \ - --autoplay \ - - */*/ -object InterestedInTwiceLargestDimAdhocApp - extends InterestedInTwiceBaseApp[SimClustersEmbedding] - with AdhocExecutionApp { - - override def producerProducerSimilarityFnForClustering: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Double = - SimilarityFunctions.simClustersMatchingLargestDimension - override def producerProducerSimilarityFnForClusterRepresentative: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Double = - SimilarityFunctions.simClustersCosineSimilarity - - /** - * Top-level method of this application. - */ - def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueId: UniqueID - ): Execution[Unit] = { - - runAdhocApp( - new LargestDimensionClusteringMethod(), - new MedoidRepresentativeSelectionMethod[SimClustersEmbedding]( - producerProducerSimilarityFnForClusterRepresentative), - ProducerEmbeddingSource.getAggregatableProducerEmbeddings, - "interested_in_twice_by_largest_dim", - "clusters_members_largest_dim_ape_similarity", - args.getOrElse("num-reducers", "4000").toInt - ) - - } -} - -object InterestedInTwiceLargestDimMaxFavScoreAdhocApp - extends InterestedInTwiceBaseApp[SimClustersEmbedding] - with AdhocExecutionApp { - - override def producerProducerSimilarityFnForClustering: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Double = - SimilarityFunctions.simClustersMatchingLargestDimension - override def producerProducerSimilarityFnForClusterRepresentative: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Double = - SimilarityFunctions.simClustersCosineSimilarity - - /** - * Top-level method of this application. - */ - def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueId: UniqueID - ): Execution[Unit] = { - - runAdhocApp( - new LargestDimensionClusteringMethod(), - new MaxFavScoreRepresentativeSelectionMethod[SimClustersEmbedding](), - ProducerEmbeddingSource.getAggregatableProducerEmbeddings, - "interested_in_twice_by_largest_dim_fav_score", - "clusters_members_largest_dim_ape_similarity", - args.getOrElse("num-reducers", "4000").toInt - ) - - } -} - -object InterestedInTwiceLouvainAdhocApp - extends InterestedInTwiceBaseApp[SimClustersEmbedding] - with AdhocExecutionApp { - - override def producerProducerSimilarityFnForClustering: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Double = - SimilarityFunctions.simClustersCosineSimilarity - override def producerProducerSimilarityFnForClusterRepresentative: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Double = - SimilarityFunctions.simClustersCosineSimilarity - - /** - * Top-level method of this application. - */ - def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueId: UniqueID - ): Execution[Unit] = { - - runAdhocApp( - new LouvainClusteringMethod( - args.required("cosine_similarity_threshold").toDouble, - args.optional("resolution_factor").map(_.toDouble)), - new MedoidRepresentativeSelectionMethod[SimClustersEmbedding]( - producerProducerSimilarityFnForClusterRepresentative), - ProducerEmbeddingSource.getAggregatableProducerEmbeddings, - "interested_in_twice_louvain", - "clusters_members_louvain_ape_similarity", - args.getOrElse("num-reducers", "4000").toInt - ) - - } -} - -object InterestedInTwiceConnectedComponentsAdhocApp - extends InterestedInTwiceBaseApp[SimClustersEmbedding] - with AdhocExecutionApp { - - override def producerProducerSimilarityFnForClustering: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Double = - SimilarityFunctions.simClustersCosineSimilarity - override def producerProducerSimilarityFnForClusterRepresentative: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Double = - SimilarityFunctions.simClustersCosineSimilarity - - /** - * Top-level method of this application. - */ - def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueId: UniqueID - ): Execution[Unit] = { - - runAdhocApp( - new ConnectedComponentsClusteringMethod( - args.required("cosine_similarity_threshold").toDouble), - new MedoidRepresentativeSelectionMethod[SimClustersEmbedding]( - producerProducerSimilarityFnForClusterRepresentative), - ProducerEmbeddingSource.getAggregatableProducerEmbeddings, - "interested_in_twice_connected_components", - "clusters_members_connected_components_ape_similarity", - args.getOrElse("num-reducers", "4000").toInt - ) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/embedding/twice/InterestedInTwiceBaseApp.scala b/src/scala/com/twitter/simclusters_v2/scalding/embedding/twice/InterestedInTwiceBaseApp.scala deleted file mode 100644 index 585f23630..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/embedding/twice/InterestedInTwiceBaseApp.scala +++ /dev/null @@ -1,495 +0,0 @@ -package com.twitter.simclusters_v2.scalding.embedding.twice - -import com.twitter.bijection.Injection -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.scalding.DateRange -import com.twitter.scalding.Days -import com.twitter.scalding.Execution -import com.twitter.scalding.Stat -import com.twitter.scalding.TypedTsv -import com.twitter.scalding.UniqueID -import com.twitter.scalding.typed.TypedPipe -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.dalv2.remote_access.AllowCrossDC -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.SimClustersEmbedding -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.common.clustering.ClusteringMethod -import com.twitter.simclusters_v2.common.clustering.ClusteringStatistics._ -import com.twitter.simclusters_v2.common.clustering.ClusterRepresentativeSelectionMethod -import com.twitter.simclusters_v2.common.clustering.ClusterRepresentativeSelectionStatistics._ -import com.twitter.simclusters_v2.hdfs_sources.ProducerEmbeddingSources -import com.twitter.simclusters_v2.hdfs_sources.UserUserGraphScalaDataset -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil -import com.twitter.simclusters_v2.thriftscala.EmbeddingType -import com.twitter.simclusters_v2.thriftscala.InternalId -import com.twitter.simclusters_v2.thriftscala.ModelVersion -import com.twitter.simclusters_v2.thriftscala.MultiEmbeddingType -import com.twitter.simclusters_v2.thriftscala.NeighborWithWeights -import com.twitter.simclusters_v2.thriftscala.OrderedClustersAndMembers -import com.twitter.simclusters_v2.thriftscala.ClusterMembers -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingIdWithScore -import com.twitter.simclusters_v2.thriftscala.SimClustersMultiEmbedding -import com.twitter.simclusters_v2.thriftscala.SimClustersMultiEmbedding.Ids -import com.twitter.simclusters_v2.thriftscala.SimClustersMultiEmbeddingByIds -import com.twitter.simclusters_v2.thriftscala.SimClustersMultiEmbeddingId -import com.twitter.simclusters_v2.thriftscala.UserAndNeighbors -import com.twitter.simclusters_v2.thriftscala.{ - SimClustersEmbeddingId => SimClustersEmbeddingIdThrift -} -import com.twitter.util.Stopwatch -import java.util.TimeZone -import scala.util.Random.shuffle - -/** - * Base app for computing User InterestedIn multi-embedding representation. - * TWICE: Capturing users’ long-term interests using multiple SimClusters embeddings. - * This job will - * - Randomly select K follow/fav actions for each user, - * - cluster the follow/fav actions for each user, - * - for each cluster, construct a representation (e.g. average or medoid). - * - * @tparam T type of producer embedding. e.g. SimClustersEmbedding - */ -trait InterestedInTwiceBaseApp[T] { - - import InterestedInTwiceBaseApp._ - - def modelVersion: ModelVersion = ModelVersion.Model20m145k2020 - - /** - * function to output similarity (>=0, the larger, more similar), given two producer embeddings. - */ - def producerProducerSimilarityFnForClustering: (T, T) => Double - def producerProducerSimilarityFnForClusterRepresentative: (T, T) => Double - - // Sort clusters by decreasing size, fall back to entity ID to break tie - val clusterOrdering: Ordering[Set[Long]] = math.Ordering.by(c => (-c.size, c.min)) - - /** - * Read user-user graph. - */ - def getUserUserGraph( - implicit dateRange: DateRange, - timeZone: TimeZone - ): TypedPipe[UserAndNeighbors] = { - DAL - .readMostRecentSnapshot( - UserUserGraphScalaDataset - ) - .withRemoteReadPolicy(AllowCrossDC) - .toTypedPipe - } - - /** - * Randomly select up to maxNeighborsByUser neighbors for each user. - * Attempts to equally sample both follow and fav edges (e.g. maxNeighborsByUser/2 for each). - * However, if one type of edge is insufficient, backfill with other type up to maxNeighborsByUser neighbours. - * @param userUserGraph User-User follow/fav graph. - * @param maxNeighborsByUser How many neighbors to keep for each user. - */ - def selectMaxProducersPerUser( - userUserGraph: TypedPipe[UserAndNeighbors], - maxNeighborsByUser: Int = MaxNeighborsByUser - )( - implicit uniqueID: UniqueID - ): TypedPipe[UserAndNeighbors] = { - - val numOfFollowEdgesStat = Stat(StatNumOfFollowEdges) - val numOfFavEdgesStat = Stat(StatNumOfFavEdges) - val numOfEdgesCumulativeFrequencyBeforeFilter = Util.CumulativeStat( - StatCFNumProducersPerConsumerBeforeFilter, - StatCFNumProducersPerConsumerBeforeFilterBuckets) - - userUserGraph.map { userAndNeighbors: UserAndNeighbors => - numOfEdgesCumulativeFrequencyBeforeFilter.incForValue(userAndNeighbors.neighbors.size) - - val (followEdges, favEdges) = - userAndNeighbors.neighbors.partition(_.isFollowed.contains(true)) - val randomFollowEdges = shuffle(followEdges) - val randomFavEdges = shuffle(favEdges) - - // interleave follow and fav edges, and select top k - val interleavedTopKEdges: Seq[NeighborWithWeights] = randomFollowEdges - .map(Some(_)) - .zipAll( - randomFavEdges.map(Some(_)), - None, - None - ) // default None value when one edge Seq is shorter than another - .flatMap { - case (followEdgeOpt, favEdgeOpt) => - Seq(followEdgeOpt, favEdgeOpt) - }.flatten - .take(maxNeighborsByUser) - - // edge stats - interleavedTopKEdges - .foreach { edge => - if (edge.isFollowed.contains(true)) numOfFollowEdgesStat.inc() - else numOfFavEdgesStat.inc() - } - - userAndNeighbors.copy(neighbors = interleavedTopKEdges) - } - } - - /** - * Get multi embedding for each user: - * - For each user, join their follow / fav - based neighbors to producer embeddings, - * - Group these neighbors into clusters using the specified clusteringMethod, - * - For each cluster, select the medoid as the representation. - * - * @param userUserGraph User-User follow/fav graph. - * @param producerEmbedding producer embedding dataset. e.g. simclusters embeddings, simhash, etc. - * @param clusteringMethod A method to group embeddings together. - * @param maxClustersPerUser How many clusters to keep per user. - * @param clusterRepresentativeSelectionMethod A method to select a cluster representative. - * @param numReducers How many reducers to use for sketch operation. - */ - def getMultiEmbeddingPerUser( - userUserGraph: TypedPipe[UserAndNeighbors], - producerEmbedding: TypedPipe[(UserId, T)], - clusteringMethod: ClusteringMethod, - maxClustersPerUser: Int = MaxClustersPerUser, - clusterRepresentativeSelectionMethod: ClusterRepresentativeSelectionMethod[T], - numReducers: Int - )( - implicit uniqueID: UniqueID - ): TypedPipe[(UserId, Seq[Set[UserId]], SimClustersMultiEmbedding)] = { - - val truncatedUserUserGraph: TypedPipe[UserAndNeighbors] = selectMaxProducersPerUser( - userUserGraph) - val validEdges: TypedPipe[(UserId, NeighborWithWeights)] = - truncatedUserUserGraph.flatMap { - case UserAndNeighbors(srcId, neighborsWithWeights) => - neighborsWithWeights.map { neighborWithWeights => - ( - neighborWithWeights.neighborId, // producerId - neighborWithWeights.copy(neighborId = srcId)) - } - } - - implicit val l2b: UserId => Array[Byte] = Injection.long2BigEndian - - val totalEdgesNonEmptyProducerEmbeddingsStat = Stat(StatTotalEdgesNonEmptyProducerEmbeddings) - val userClusterPairsBeforeTruncation = Stat(StatNumUserClusterPairsBeforeTruncation) - val userClusterPairsAfterTruncation = Stat(StatNumUserClusterPairsAfterTruncation) - val numUsers = Stat(StatNumUsers) - val numOfClustersCumulativeFrequencyBeforeFilter = - Util.CumulativeStat(StatCFNumOfClustersBeforeFilter, StatCFNumOfClustersBeforeFilterBuckets) - - // map each clustering statistic to a scalding.Stat - val clusteringStatsMap: Map[String, Stat] = Map( - StatSimilarityGraphTotalBuildTime -> Stat(StatSimilarityGraphTotalBuildTime), - StatClusteringAlgorithmRunTime -> Stat(StatClusteringAlgorithmRunTime), - StatMedoidSelectionTime -> Stat(StatMedoidSelectionTime) - ) - val cosineSimilarityCumulativeFrequencyBeforeFilter = Util.CumulativeStat( - StatCFCosineSimilarityBeforeFilter, - StatCFCosineSimilarityBeforeFilterBuckets) - - val clusterRepresentativeSelectionTime = Stat(StatClusterRepresentativeSelectionTime) - - validEdges - .sketch(numReducers) - .join(producerEmbedding) - .map { - case (producerId: UserId, (srcWithWeights: NeighborWithWeights, embedding)) => - totalEdgesNonEmptyProducerEmbeddingsStat.inc() - (srcWithWeights.neighborId, (srcWithWeights.copy(neighborId = producerId), embedding)) - } - .group - .toList - .map { - case (userId: UserId, embeddings: Seq[(NeighborWithWeights, T)]) => - numUsers.inc() - val embeddingsMap: Map[Long, T] = embeddings.map { - case (n: NeighborWithWeights, e) => (n.neighborId, e) - }.toMap - val weightsMap: Map[Long, NeighborWithWeights] = embeddings.map { - case (n: NeighborWithWeights, _) => (n.neighborId, n) - }.toMap - // 1. Cluster embeddings - val clusters: Set[Set[UserId]] = - clusteringMethod - .cluster[T]( - embeddingsMap, - producerProducerSimilarityFnForClustering, - // Map.get() returns an Option, so will not throw. - // Use .foreach() to filter out potential Nones. - (name, incr) => { - clusteringStatsMap.get(name).foreach(ctr => ctr.incBy(incr)) - if (name == StatComputedSimilarityBeforeFilter) - cosineSimilarityCumulativeFrequencyBeforeFilter.incForValue(incr) - } - ) - - // 2. Sort clusters - val sortedClusters: Seq[Set[UserId]] = clusters.toSeq.sorted(clusterOrdering) - - // 3. Keep only a max number of clusters (avoid OOM) - userClusterPairsBeforeTruncation.incBy(sortedClusters.size) - numOfClustersCumulativeFrequencyBeforeFilter.incForValue(sortedClusters.size) - val truncatedClusters = sortedClusters.take(maxClustersPerUser) - userClusterPairsAfterTruncation.incBy(truncatedClusters.size) - - // 4. Get list of cluster representatives - val truncatedIdWithScoreList: Seq[SimClustersEmbeddingIdWithScore] = - truncatedClusters.map { members: Set[UserId] => - val clusterRepresentationSelectionElapsed = Stopwatch.start() - val medoid: UserId = clusterRepresentativeSelectionMethod.selectClusterRepresentative( - members.map(id => weightsMap(id)), - embeddingsMap) - clusterRepresentativeSelectionTime.incBy( - clusterRepresentationSelectionElapsed().inMilliseconds) - - SimClustersEmbeddingIdWithScore( - id = SimClustersEmbeddingIdThrift( - EmbeddingType.TwiceUserInterestedIn, - modelVersion, - InternalId.UserId(medoid)), - score = members.size) - } - - ( - userId, - sortedClusters, - SimClustersMultiEmbedding.Ids( - SimClustersMultiEmbeddingByIds(ids = truncatedIdWithScoreList))) - } - } - - /** - * Write the output to disk as a TypedTsv. - */ - def writeOutputToTypedTSV( - output: TypedPipe[(UserId, Seq[Set[UserId]], SimClustersMultiEmbedding)], - userToClusterRepresentativesIndexOutputPath: String, - userToClusterMembersIndexOutputPath: String - ): Execution[(Unit, Unit)] = { - - // write the user -> cluster representatives index - val writeClusterRepresentatives = output - .collect { - case (userId: Long, _, Ids(ids)) => (userId, ids.ids) - } - //.shard(partitions = 1) - .writeExecution(TypedTsv[(UserId, Seq[SimClustersEmbeddingIdWithScore])]( - userToClusterRepresentativesIndexOutputPath)) - - // write the user -> cluster members index - val writeClusterMembers = output - .collect { - case (userId: Long, clusters: Seq[Set[UserId]], _) => (userId, clusters) - } - //.shard(partitions = 1) - .writeExecution(TypedTsv[(UserId, Seq[Set[UserId]])](userToClusterMembersIndexOutputPath)) - - Execution.zip(writeClusterRepresentatives, writeClusterMembers) - - } - - /** - * Write the output to disk as a KeyValDataset. - */ - def writeOutputToKeyValDataset( - output: TypedPipe[(UserId, Seq[Set[UserId]], SimClustersMultiEmbedding)], - embeddingType: MultiEmbeddingType, - userToClusterRepresentativesIndexDataset: KeyValDALDataset[ - KeyVal[SimClustersMultiEmbeddingId, SimClustersMultiEmbedding] - ], - userToClusterMembersIndexDataset: KeyValDALDataset[KeyVal[UserId, OrderedClustersAndMembers]], - userToClusterRepresentativesIndexOutputPath: String, - userToClusterMembersIndexOutputPath: String - )( - implicit dateRange: DateRange - ): Execution[(Unit, Unit)] = { - // write the user -> cluster representatives index - val writeClusterRepresentatives = output - .map { - case (userId: UserId, _, embeddings: SimClustersMultiEmbedding) => - KeyVal( - key = SimClustersMultiEmbeddingId( - embeddingType = embeddingType, - modelVersion = modelVersion, - internalId = InternalId.UserId(userId) - ), - value = embeddings - ) - } - .writeDALVersionedKeyValExecution( - userToClusterRepresentativesIndexDataset, - D.Suffix(userToClusterRepresentativesIndexOutputPath), - ExplicitEndTime(dateRange.end) - ) - - // write the user -> cluster members index - val writeClusterMembers = output - .map { - case (userId: UserId, clusters: Seq[Set[UserId]], _) => - KeyVal( - key = userId, - value = OrderedClustersAndMembers(clusters, Some(clusters.map(ClusterMembers(_))))) - } - .writeDALVersionedKeyValExecution( - userToClusterMembersIndexDataset, - D.Suffix(userToClusterMembersIndexOutputPath), - ExplicitEndTime(dateRange.end) - ) - - Execution.zip(writeClusterRepresentatives, writeClusterMembers) - } - - /** - * Main method for scheduled jobs. - */ - def runScheduledApp( - clusteringMethod: ClusteringMethod, - clusterRepresentativeSelectionMethod: ClusterRepresentativeSelectionMethod[T], - producerEmbedding: TypedPipe[(UserId, T)], - userToClusterRepresentativesIndexPathSuffix: String, - userToClusterMembersIndexPathSuffix: String, - userToClusterRepresentativesIndexDataset: KeyValDALDataset[ - KeyVal[SimClustersMultiEmbeddingId, SimClustersMultiEmbedding] - ], - userToClusterMembersIndexDataset: KeyValDALDataset[KeyVal[UserId, OrderedClustersAndMembers]], - numReducers: Int - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueId: UniqueID - ): Execution[Unit] = { - - val userToClusterRepresentativesIndexOutputPath: String = EmbeddingUtil.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = true, - modelVersion = modelVersion, - pathSuffix = userToClusterRepresentativesIndexPathSuffix - ) - - val userToClusterMembersIndexOutputPath: String = EmbeddingUtil.getHdfsPath( - isAdhoc = false, - isManhattanKeyVal = true, - modelVersion = modelVersion, - pathSuffix = userToClusterMembersIndexPathSuffix - ) - - val execution = Execution.withId { implicit uniqueId => - val output: TypedPipe[(UserId, Seq[Set[UserId]], SimClustersMultiEmbedding)] = - getMultiEmbeddingPerUser( - userUserGraph = getUserUserGraph(dateRange.prepend(Days(30)), implicitly), - producerEmbedding = producerEmbedding, - clusteringMethod = clusteringMethod, - clusterRepresentativeSelectionMethod = clusterRepresentativeSelectionMethod, - numReducers = numReducers - ) - - writeOutputToKeyValDataset( - output = output, - embeddingType = MultiEmbeddingType.TwiceUserInterestedIn, - userToClusterRepresentativesIndexDataset = userToClusterRepresentativesIndexDataset, - userToClusterMembersIndexDataset = userToClusterMembersIndexDataset, - userToClusterRepresentativesIndexOutputPath = userToClusterRepresentativesIndexOutputPath, - userToClusterMembersIndexOutputPath = userToClusterMembersIndexOutputPath - ) - - } - - execution.unit - } - - /** - * Main method for adhoc jobs. - */ - def runAdhocApp( - clusteringMethod: ClusteringMethod, - clusterRepresentativeSelectionMethod: ClusterRepresentativeSelectionMethod[T], - producerEmbedding: TypedPipe[(UserId, T)], - userToClusterRepresentativesIndexPathSuffix: String, - userToClusterMembersIndexPathSuffix: String, - numReducers: Int - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueId: UniqueID - ): Execution[Unit] = { - - val userToClusterRepresentativesIndexOutputPath: String = EmbeddingUtil.getHdfsPath( - isAdhoc = true, - isManhattanKeyVal = false, - modelVersion = modelVersion, - pathSuffix = userToClusterRepresentativesIndexPathSuffix - ) - - val userToClusterMembersIndexOutputPath: String = EmbeddingUtil.getHdfsPath( - isAdhoc = true, - isManhattanKeyVal = false, - modelVersion = modelVersion, - pathSuffix = userToClusterMembersIndexPathSuffix - ) - - val execution = Execution.withId { implicit uniqueId => - val output: TypedPipe[(UserId, Seq[Set[UserId]], SimClustersMultiEmbedding)] = - getMultiEmbeddingPerUser( - userUserGraph = getUserUserGraph(dateRange.prepend(Days(30)), implicitly), - producerEmbedding = producerEmbedding, - clusteringMethod = clusteringMethod, - clusterRepresentativeSelectionMethod = clusterRepresentativeSelectionMethod, - numReducers = numReducers - ) - - writeOutputToTypedTSV( - output, - userToClusterRepresentativesIndexOutputPath, - userToClusterMembersIndexOutputPath) - } - - execution.unit - } - -} - -object InterestedInTwiceBaseApp { - - // Statistics - val StatNumOfFollowEdges = "num_of_follow_edges" - val StatNumOfFavEdges = "num_of_fav_edges" - val StatTotalEdgesNonEmptyProducerEmbeddings = "total_edges_with_non_empty_producer_embeddings" - val StatNumUserClusterPairsBeforeTruncation = "num_user_cluster_pairs_before_truncation" - val StatNumUserClusterPairsAfterTruncation = "num_user_cluster_pairs_after_truncation" - val StatNumUsers = "num_users" - // Cumulative Frequency - val StatCFNumProducersPerConsumerBeforeFilter = "num_producers_per_consumer_cf_before_filter" - val StatCFNumProducersPerConsumerBeforeFilterBuckets: Seq[Double] = - Seq(0, 10, 20, 50, 100, 500, 1000) - val StatCFCosineSimilarityBeforeFilter = "cosine_similarity_cf_before_filter" - val StatCFCosineSimilarityBeforeFilterBuckets: Seq[Double] = - Seq(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100) - val StatCFNumOfClustersBeforeFilter = "num_of_clusters_cf_before_filter" - val StatCFNumOfClustersBeforeFilterBuckets: Seq[Double] = - Seq(1, 3, 5, 10, 15, 20, 50, 100, 200, 300, 500) - - val MaxClustersPerUser: Int = 10 - val MaxNeighborsByUser: Int = 500 - - object ProducerEmbeddingSource { - - /** - * Read log-fav based Aggregatable Producer embeddings dataset. - */ - def getAggregatableProducerEmbeddings( - implicit dateRange: DateRange, - timeZone: TimeZone - ): TypedPipe[(UserId, SimClustersEmbedding)] = - ProducerEmbeddingSources - .producerEmbeddingSource( - EmbeddingType.AggregatableLogFavBasedProducer, - ModelVersion.Model20m145k2020)(dateRange.prepend(Days(30))) - .mapValues(s => SimClustersEmbedding(s)) - - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/evaluation/BUILD.bazel b/src/scala/com/twitter/simclusters_v2/scalding/evaluation/BUILD.bazel deleted file mode 100644 index 7615fcf43..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/evaluation/BUILD.bazel +++ /dev/null @@ -1,72 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = [ - "bazel-compatible", - "bazel-only", - ], - dependencies = [ - "src/scala/com/twitter/simclusters_v2/scalding/common", - "src/scala/com/twitter/simclusters_v2/scalding/offline_job", - ], -) - -hadoop_binary( - name = "tweet_evaluation_dummy_candidate_adhoc", - main = "com.twitter.simclusters_v2.scalding.DummyCandidateGenerationAdhocJob", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":evaluation", - ], -) - -hadoop_binary( - name = "tweet_evaluation_timelines_reference_adhoc", - main = "com.twitter.simclusters_v2.scalding.evaluation.AdhocTimelinesDataExtraction", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":evaluation", - ], -) - -hadoop_binary( - name = "tweet_evaluation_timelines_reference_batch", - main = "com.twitter.simclusters_v2.scalding.evaluation.ScheduledTimelinesDataExtractionBatch", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":evaluation", - ], -) - -hadoop_binary( - name = "simcluster_offline_eval_adhoc", - main = "com.twitter.simclusters_v2.scalding.evaluation.SimClustersEvaluationAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":evaluation", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scalding/evaluation/CandidateEvaluationBase.scala b/src/scala/com/twitter/simclusters_v2/scalding/evaluation/CandidateEvaluationBase.scala deleted file mode 100644 index 24195fff7..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/evaluation/CandidateEvaluationBase.scala +++ /dev/null @@ -1,163 +0,0 @@ -package com.twitter.simclusters_v2.scalding.evaluation - -import com.twitter.core_workflows.user_model.thriftscala.CondensedUserState -import com.twitter.core_workflows.user_model.thriftscala.UserState -import com.twitter.pluck.source.core_workflows.user_model.CondensedUserStateScalaDataset -import com.twitter.scalding._ -import com.twitter.scalding.source.TypedText -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.simclusters_v2.thriftscala.CandidateTweets -import com.twitter.simclusters_v2.thriftscala.ReferenceTweets -import scala.util.Random - -/** - * Helper functions to provide user samples by sampling across user states. - */ -object UserStateUserSampler { - def getSampleUsersByUserState( - userStateSource: TypedPipe[CondensedUserState], - validStates: Seq[UserState], - samplePercentage: Double - ): TypedPipe[(UserState, Long)] = { - assert(samplePercentage >= 0 && samplePercentage <= 1) - val validStateSet = validStates.toSet - - userStateSource - .collect { - case data if data.userState.isDefined && validStateSet.contains(data.userState.get) => - (data.userState.get, data.uid) - } - .filter(_ => Random.nextDouble() <= samplePercentage) - .forceToDisk - } - - /** - * Given a list of string corresponding to user states, convert them to the UserState type. - * If the input is empty, default to return all available user states - */ - def parseUserStates(strStates: Seq[String]): Seq[UserState] = { - if (strStates.isEmpty) { - UserState.list - } else { - strStates.map { str => - UserState - .valueOf(str).getOrElse( - throw new IllegalArgumentException( - s"Input user_states $str is invalid. Valid states are: " + UserState.list - ) - ) - } - } - } -} - -/** - * A variation of the evaluation base where target users are sampled by user states. - * For each user state of interest (e.x. HEAVY_TWEETER), we run a separate evaluation call, and - * output the evaluation results per user state. This is helpful when we want to horizontally - * compare how users in different user states respond to the candidate tweets. - */ -trait UserStateBasedEvaluationExecutionBase - extends CandidateEvaluationBase - with TwitterExecutionApp { - - def referenceTweets: TypedPipe[ReferenceTweets] - def candidateTweets: TypedPipe[CandidateTweets] - - override def job: Execution[Unit] = { - Execution.withId { implicit uniqueId => - Execution.withArgs { args => - implicit val dateRange: DateRange = - DateRange.parse(args.list("date"))(DateOps.UTC, DateParser.default) - - val outputRootDir = args("outputDir") - val userStates: Seq[UserState] = - UserStateUserSampler.parseUserStates(args.list("user_states")) - val sampleRate = args.double("sample_rate") - - // For each user state we are interested in, run separate executions and write - // the output into individual sub directories - val userStateSource = DAL.read(CondensedUserStateScalaDataset).toTypedPipe - val userIdsByState = - UserStateUserSampler.getSampleUsersByUserState(userStateSource, userStates, sampleRate) - val executionsPerUserState = userStates.map { userState => - val sampleUsers = userIdsByState.collect { case data if data._1 == userState => data._2 } - val outputPath = outputRootDir + "/" + userState + "/" - - super - .runSampledEvaluation(sampleUsers, referenceTweets, candidateTweets) - .writeExecution(TypedText.csv(outputPath)) - } - // Run evaluation for each user state in parallel - Execution.sequence(executionsPerUserState).unit - } - } - } -} - -/** - * A basic flow for evaluating the quality of a set of candidate tweets, typically generated by an - * algorithm (ex. SimClusters), by comparing its engagement rates against a set of reference tweets - * The job goes through the following steps: - * 1. Generate a group of target users on which we measure tweet engagements - * 2. Collect tweets impressed by these users and their engagements on tweets from a labeled - * tweet source (ex. Home Timeline engagement data), and form a reference set - * 3. For each candidate tweet, collect the engagement rates from the reference set - * 4. Run evaluation calculations (ex. percentage of intersection, engagement rate, etc) - * - * Each sub class is expected to provide 3 sets of data sources, which are the sample users, - * candidate tweet sources, and reference tweet sources. - */ -trait CandidateEvaluationBase { - private def getSampledReferenceTweets( - referenceTweetEngagements: TypedPipe[ReferenceTweets], - sampleUsers: TypedPipe[Long] - ): TypedPipe[ReferenceTweets] = { - referenceTweetEngagements - .groupBy(_.targetUserId) - .join(sampleUsers.asKeys) - .map { case (targetUserId, (referenceEngagements, _)) => referenceEngagements } - } - - private def getSampledCandidateTweets( - candidateTweets: TypedPipe[CandidateTweets], - sampleUsers: TypedPipe[Long] - ): TypedPipe[CandidateTweets] = { - candidateTweets - .groupBy(_.targetUserId) - .join(sampleUsers.asKeys) - .map { case (_, (tweets, _)) => tweets } - } - - /** - * Evaluation function, should be overridden by implementing sub classes to suit individual - * objectives, such as like engagement rates, CRT, etc. - * @param sampledReference - * @param sampledCandidate - */ - def evaluateResults( - sampledReference: TypedPipe[ReferenceTweets], - sampledCandidate: TypedPipe[CandidateTweets] - ): TypedPipe[String] - - /** - * Given a list of target users, the reference tweet set, and the candidate tweet set, - * calculate the engagement rates on the reference set and the candidate set by these users. - * The evaluation result should be converted into an itemized format - * these users. - * @param referenceTweets - * @param candidateTweets - * @return - */ - def runSampledEvaluation( - targetUserSamples: TypedPipe[Long], - referenceTweets: TypedPipe[ReferenceTweets], - candidateTweets: TypedPipe[CandidateTweets] - ): TypedPipe[String] = { - val sampledCandidate = getSampledCandidateTweets(candidateTweets, targetUserSamples) - val referencePerUser = getSampledReferenceTweets(referenceTweets, targetUserSamples) - - evaluateResults(referencePerUser, sampledCandidate) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/evaluation/EvaluationMetricHelper.scala b/src/scala/com/twitter/simclusters_v2/scalding/evaluation/EvaluationMetricHelper.scala deleted file mode 100644 index 50bc36538..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/evaluation/EvaluationMetricHelper.scala +++ /dev/null @@ -1,540 +0,0 @@ -package com.twitter.simclusters_v2.scalding.evaluation - -import com.twitter.scalding.{Execution, TypedPipe, UniqueID} -import com.twitter.simclusters_v2.thriftscala.{ - CandidateTweet, - CandidateTweets, - ReferenceTweet, - ReferenceTweets, - TweetLabels -} -import com.twitter.algebird.Aggregator.size -import com.twitter.scalding.typed.{CoGrouped, ValuePipe} -import com.twitter.util.TwitterDateFormat -import java.util.Calendar - -/** - * Statistics about the number of users who have engaged with tweets - */ -case class UserEngagerCounts( - numDistinctTargetUsers: Long, - numDistinctLikeEngagers: Long, - numDistinctRetweetEngagers: Long) - -/** - * Tweet side statistics, e.x. number of tweets, authors, etc. - */ -case class TweetStats( - numTweets: Long, - numDistinctTweets: Long, - numDistinctAuthors: Option[Long], - avgScore: Option[Double]) - -/** - * Helper data container class for storing engagement counts - */ -case class TweetEngagementCounts(like: Long, retweet: Long, click: Long, hasEngagement: Long) - -/** - * Helper data container class for storing engagement rates - */ -case class TweetEngagementRates(like: Double, retweet: Double, click: Double, hasEngagement: Double) - -case class LabelCorrelations( - pearsonCoefficientForLikes: Double, - cosineSimilarityGlobal: Double, - cosineSimilarityPerUserAvg: Double) { - private val f = java.text.NumberFormat.getInstance - def format(): String = { - Seq( - s"\tPearson Coefficient: ${f.format(pearsonCoefficientForLikes)}", - s"\tCosine similarity: ${f.format(cosineSimilarityGlobal)}", - s"\tAverage cosine similarity for all users: ${f.format(cosineSimilarityPerUserAvg)}" - ).mkString("\n") - } -} - -/** - * Helper tweet data container that can hold both the reference label engagements as well as the - * recommendation algorithm's scores. Helpful for evaluating joint data - */ -case class LabeledTweet( - targetUserId: Long, - tweetId: Long, - authorId: Long, - labels: TweetLabels, - algorithmScore: Option[Double]) - -case class LabeledTweetsResults( - tweetStats: TweetStats, - userEngagerCounts: UserEngagerCounts, - tweetEngagementCounts: TweetEngagementCounts, - tweetEngagementRates: TweetEngagementRates, - labelCorrelations: Option[LabelCorrelations] = None) { - private val f = java.text.NumberFormat.getInstance - - def format(title: String = ""): String = { - val str = Seq( - s"Number of tweets: ${f.format(tweetStats.numTweets)}", - s"Number of distinct tweets: ${f.format(tweetStats.numDistinctTweets)}", - s"Number of distinct users targeted: ${f.format(userEngagerCounts.numDistinctTargetUsers)}", - s"Number of distinct authors: ${tweetStats.numDistinctAuthors.map(f.format).getOrElse("N/A")}", - s"Average algorithm score of tweets: ${tweetStats.avgScore.map(f.format).getOrElse("N/A")}", - s"Engager counts:", - s"\tNumber of users who liked tweets: ${f.format(userEngagerCounts.numDistinctLikeEngagers)}", - s"\tNumber of users who retweeted tweets: ${f.format(userEngagerCounts.numDistinctRetweetEngagers)}", - s"Tweet engagement counts:", - s"\tNumber of Likes: ${f.format(tweetEngagementCounts.like)}", - s"\tNumber of Retweets: ${f.format(tweetEngagementCounts.retweet)}", - s"\tNumber of Clicks: ${f.format(tweetEngagementCounts.click)}", - s"\tNumber of tweets with any engagements: ${f.format(tweetEngagementCounts.hasEngagement)}", - s"Tweet engagement rates:", - s"\tRate of Likes: ${f.format(tweetEngagementRates.like * 100)}%", - s"\tRate of Retweets: ${f.format(tweetEngagementRates.retweet * 100)}%", - s"\tRate of Clicks: ${f.format(tweetEngagementRates.click * 100)}%", - s"\tRate of any engagement: ${f.format(tweetEngagementRates.hasEngagement * 100)}%" - ).mkString("\n") - - val correlations = labelCorrelations.map("\n" + _.format()).getOrElse("") - - s"$title\n$str$correlations" - } -} - -case class CandidateResults(tweetStats: TweetStats, numDistinctTargetUsers: Long) { - private val f = java.text.NumberFormat.getInstance - - def format(title: String = ""): String = { - val str = Seq( - s"Number of tweets: ${f.format(tweetStats.numTweets)}", - s"Number of distinct tweets: ${f.format(tweetStats.numDistinctTweets)}", - s"Number of distinct users targeted: ${f.format(numDistinctTargetUsers)}", - s"Number of distinct authors: ${tweetStats.numDistinctAuthors.map(f.format).getOrElse("N/A")}", - s"Average algorithm score of tweets: ${tweetStats.avgScore.map(f.format).getOrElse("N/A")}" - ).mkString("\n") - s"$title\n$str" - } -} - -/** - * Helper class for evaluating a given candidate tweet set against a reference tweet set. - * It provides aggregation evaluation metrics such as sum of engagements, rate of engagements, etc. - */ -object EvaluationMetricHelper { - private def toLong(bool: Boolean): Long = { - if (bool) 1L else 0L - } - - /** - * Core engagements are user actions that count towards core metrics, e.x. like, RT, etc - */ - private def hasCoreEngagements(labels: TweetLabels): Boolean = { - labels.isRetweeted || - labels.isLiked || - labels.isQuoted || - labels.isReplied - } - - /** - * Whether there are core engagements or click on the tweet - */ - private def hasCoreEngagementsOrClick(labels: TweetLabels): Boolean = { - hasCoreEngagements(labels) || labels.isClicked - } - - /** - * Return outer join of reference tweets and candidate tweets, keyed by (targetUserId, tweetId). - * The output of this can then be reused to fetch the inner join / left / right join, - * without having to redo the expensive join - * - * NOTE: Assumes the uniqueness of keys (i.e. (targetId, tweetId)). Make sure to dedup tweetIds - * for each targetId, otherwise .join() will yield duplicate results. - */ - def outerJoinReferenceAndCandidate( - referencePipe: TypedPipe[ReferenceTweets], - candidatePipe: TypedPipe[CandidateTweets] - ): CoGrouped[(Long, Long), (Option[ReferenceTweet], Option[CandidateTweet])] = { - - val references = referencePipe - .flatMap { refTweets => - refTweets.impressedTweets.map { refTweet => - ((refTweets.targetUserId, refTweet.tweetId), refTweet) - } - } - - val candidates = candidatePipe - .flatMap { candTweets => - candTweets.recommendedTweets.map { candTweet => - ((candTweets.targetUserId, candTweet.tweetId), candTweet) - } - } - - references.outerJoin(candidates).withReducers(50) - } - - /** - * Convert reference tweets to labeled tweets. We do this so that we can re-use the common - * metric calculations for labeled tweets on reference tweets - */ - def getLabeledReference(referencePipe: TypedPipe[ReferenceTweets]): TypedPipe[LabeledTweet] = { - referencePipe - .flatMap { refTweets => - refTweets.impressedTweets.map { tweet => - // Reference tweets do not have scores - LabeledTweet(refTweets.targetUserId, tweet.tweetId, tweet.authorId, tweet.labels, None) - } - } - } - - def getUniqueCount[T](pipe: TypedPipe[T])(implicit ord: scala.Ordering[T]): Execution[Long] = { - pipe.distinct - .aggregate(size) - .toOptionExecution - .map(_.getOrElse(0L)) - } - - def countUniqueEngagedUsersBy( - labeledTweetsPipe: TypedPipe[LabeledTweet], - f: TweetLabels => Boolean - ): Execution[Long] = { - getUniqueCount[Long](labeledTweetsPipe.collect { case t if f(t.labels) => t.targetUserId }) - } - - def countUniqueLabeledTargetUsers(labeledTweetsPipe: TypedPipe[LabeledTweet]): Execution[Long] = { - getUniqueCount[Long](labeledTweetsPipe.map(_.targetUserId)) - } - - def countUniqueCandTargetUsers(candidatePipe: TypedPipe[CandidateTweets]): Execution[Long] = { - getUniqueCount[Long](candidatePipe.map(_.targetUserId)) - } - - def countUniqueLabeledAuthors(labeledTweetPipe: TypedPipe[LabeledTweet]): Execution[Long] = { - getUniqueCount[Long](labeledTweetPipe.map(_.authorId)) - } - - /** - * Helper function to calculate the basic engagement rates - */ - def getEngagementRate( - basicStats: TweetStats, - engagementCount: TweetEngagementCounts - ): TweetEngagementRates = { - val numTweets = basicStats.numTweets.toDouble - if (numTweets <= 0) throw new IllegalArgumentException("Invalid tweet counts") - val likeRate = engagementCount.like / numTweets - val rtRate = engagementCount.retweet / numTweets - val clickRate = engagementCount.click / numTweets - val engagementRate = engagementCount.hasEngagement / numTweets - TweetEngagementRates(likeRate, rtRate, clickRate, engagementRate) - } - - /** - * Helper function to calculate the basic stats for a pipe of candidate tweets - */ - def getTweetStatsForCandidateExec( - candidatePipe: TypedPipe[CandidateTweets] - ): Execution[TweetStats] = { - val pipe = candidatePipe.map { candTweets => - (candTweets.targetUserId, candTweets.recommendedTweets) - }.sumByKey // Dedup by targetId, in case there exists multiple entries. - - val distinctTweetPipe = pipe.flatMap(_._2.map(_.tweetId)).distinct.aggregate(size) - - val otherStats = pipe - .map { - case (uid, recommendedTweets) => - val scoreSum = recommendedTweets.flatMap(_.score).sum - (recommendedTweets.size.toLong, scoreSum) - } - .sum - .map { - case (numTweets, scoreSum) => - if (numTweets <= 0) throw new IllegalArgumentException("Invalid tweet counts") - val avgScore = scoreSum / numTweets.toDouble - (numTweets, avgScore) - } - ValuePipe - .fold(distinctTweetPipe, otherStats) { - case (numDistinctTweet, (numTweets, avgScore)) => - // no author side information for candidate tweets yet - TweetStats(numTweets, numDistinctTweet, None, Some(avgScore)) - }.getOrElseExecution(TweetStats(0L, 0L, None, None)) - } - - /** - * Helper function to count the total number of engagements - */ - def getLabeledEngagementCountExec( - labeledTweets: TypedPipe[LabeledTweet] - ): Execution[TweetEngagementCounts] = { - labeledTweets - .map { labeledTweet => - val like = toLong(labeledTweet.labels.isLiked) - val retweet = toLong(labeledTweet.labels.isRetweeted) - val click = toLong(labeledTweet.labels.isClicked) - val hasEngagement = toLong(hasCoreEngagementsOrClick(labeledTweet.labels)) - - (like, retweet, click, hasEngagement) - } - .sum - .map { - case (like, retweet, click, hasEngagement) => - TweetEngagementCounts(like, retweet, click, hasEngagement) - } - .getOrElseExecution(TweetEngagementCounts(0L, 0L, 0L, 0L)) - } - - /** - * Count the total number of unique users who have engaged with tweets - */ - def getTargetUserStatsForLabeledTweetsExec( - labeledTweetsPipe: TypedPipe[LabeledTweet] - ): Execution[UserEngagerCounts] = { - val numUniqueTargetUsersExec = countUniqueLabeledTargetUsers(labeledTweetsPipe) - val numUniqueLikeUsersExec = - countUniqueEngagedUsersBy(labeledTweetsPipe, labels => labels.isLiked) - val numUniqueRetweetUsersExec = - countUniqueEngagedUsersBy(labeledTweetsPipe, labels => labels.isRetweeted) - - Execution - .zip( - numUniqueTargetUsersExec, - numUniqueLikeUsersExec, - numUniqueRetweetUsersExec - ) - .map { - case (numTarget, like, retweet) => - UserEngagerCounts( - numDistinctTargetUsers = numTarget, - numDistinctLikeEngagers = like, - numDistinctRetweetEngagers = retweet - ) - } - } - - /** - * Helper function to calculate the basic stats for a pipe of labeled tweets. - */ - def getTweetStatsForLabeledTweetsExec( - labeledTweetPipe: TypedPipe[LabeledTweet] - ): Execution[TweetStats] = { - val uniqueAuthorsExec = countUniqueLabeledAuthors(labeledTweetPipe) - - val uniqueTweetExec = - labeledTweetPipe.map(_.tweetId).distinct.aggregate(size).getOrElseExecution(0L) - val scoresExec = labeledTweetPipe - .map { t => (t.targetUserId, (1, t.algorithmScore.getOrElse(0.0))) } - .sumByKey // Dedup by targetId, in case there exists multiple entries. - .map { - case (uid, (c1, c2)) => - (c1.toLong, c2) - } - .sum - .map { - case (numTweets, scoreSum) => - if (numTweets <= 0) throw new IllegalArgumentException("Invalid tweet counts") - val avgScore = scoreSum / numTweets.toDouble - (numTweets, Option(avgScore)) - } - .getOrElseExecution((0L, None)) - - Execution - .zip(uniqueAuthorsExec, uniqueTweetExec, scoresExec) - .map { - case (numDistinctAuthors, numUniqueTweets, (numTweets, avgScores)) => - TweetStats(numTweets, numUniqueTweets, Some(numDistinctAuthors), avgScores) - } - } - - /** - * Print a update message to the stdout when a step is done. - */ - private def printOnCompleteMsg(stepDescription: String, startTimeMillis: Long): Unit = { - val formatDate = TwitterDateFormat("yyyy-MM-dd hh:mm:ss") - val now = Calendar.getInstance().getTime - - val secondsSpent = (now.getTime - startTimeMillis) / 1000 - println( - s"- ${formatDate.format(now)}\tStep complete: $stepDescription\t " + - s"Time spent: ${secondsSpent / 60}m${secondsSpent % 60}s" - ) - } - - /** - * Calculate the metrics of a pipe of [[CandidateTweets]] - */ - private def getEvaluationResultsForCandidates( - candidatePipe: TypedPipe[CandidateTweets] - ): Execution[CandidateResults] = { - val tweetStatsExec = getTweetStatsForCandidateExec(candidatePipe) - val numDistinctTargetUsersExec = countUniqueCandTargetUsers(candidatePipe) - - Execution - .zip(tweetStatsExec, numDistinctTargetUsersExec) - .map { - case (tweetStats, numDistinctTargetUsers) => - CandidateResults(tweetStats, numDistinctTargetUsers) - } - } - - /** - * Calculate the metrics of a pipe of [[LabeledTweet]] - */ - private def getEvaluationResultsForLabeledTweets( - labeledTweetPipe: TypedPipe[LabeledTweet], - getLabelCorrelations: Boolean = false - ): Execution[LabeledTweetsResults] = { - val tweetStatsExec = getTweetStatsForLabeledTweetsExec(labeledTweetPipe) - val userStatsExec = getTargetUserStatsForLabeledTweetsExec(labeledTweetPipe) - val engagementCountExec = getLabeledEngagementCountExec(labeledTweetPipe) - - val correlationsExec = if (getLabelCorrelations) { - Execution - .zip( - LabelCorrelationsHelper.pearsonCoefficientForLike(labeledTweetPipe), - LabelCorrelationsHelper.cosineSimilarityForLike(labeledTweetPipe), - LabelCorrelationsHelper.cosineSimilarityForLikePerUser(labeledTweetPipe) - ).map { - case (pearsonCoeff, globalCos, avgCos) => - Some(LabelCorrelations(pearsonCoeff, globalCos, avgCos)) - } - } else { - ValuePipe(None).getOrElseExecution(None) // Empty pipe with a None value - } - - Execution - .zip(tweetStatsExec, engagementCountExec, userStatsExec, correlationsExec) - .map { - case (tweetStats, engagementCount, engagerCount, correlationsOpt) => - val engagementRate = getEngagementRate(tweetStats, engagementCount) - LabeledTweetsResults( - tweetStats, - engagerCount, - engagementCount, - engagementRate, - correlationsOpt) - } - } - - private def runAllEvalForCandidates( - candidatePipe: TypedPipe[CandidateTweets], - outerJoinPipe: TypedPipe[((Long, Long), (Option[ReferenceTweet], Option[CandidateTweet]))] - ): Execution[(CandidateResults, CandidateResults)] = { - val t0 = System.currentTimeMillis() - - val candidateNotInIntersectionPipe = - outerJoinPipe - .collect { - case ((targetUserId, _), (None, Some(candTweet))) => (targetUserId, Seq(candTweet)) - } - .sumByKey - .map { case (targetUserId, candTweets) => CandidateTweets(targetUserId, candTweets) } - .forceToDisk - - Execution - .zip( - getEvaluationResultsForCandidates(candidatePipe), - getEvaluationResultsForCandidates(candidateNotInIntersectionPipe) - ).onComplete(_ => printOnCompleteMsg("runAllEvalForCandidates()", t0)) - } - - private def runAllEvalForIntersection( - outerJoinPipe: TypedPipe[((Long, Long), (Option[ReferenceTweet], Option[CandidateTweet]))] - )( - implicit uniqueID: UniqueID - ): Execution[(LabeledTweetsResults, LabeledTweetsResults, LabeledTweetsResults)] = { - val t0 = System.currentTimeMillis() - val intersectionTweetsPipe = outerJoinPipe.collect { - case ((targetUserId, tweetId), (Some(refTweet), Some(candTweet))) => - LabeledTweet(targetUserId, tweetId, refTweet.authorId, refTweet.labels, candTweet.score) - }.forceToDisk - - val likedTweetsPipe = intersectionTweetsPipe.filter(_.labels.isLiked) - val notLikedTweetsPipe = intersectionTweetsPipe.filter(!_.labels.isLiked) - - Execution - .zip( - getEvaluationResultsForLabeledTweets(intersectionTweetsPipe, getLabelCorrelations = true), - getEvaluationResultsForLabeledTweets(likedTweetsPipe), - getEvaluationResultsForLabeledTweets(notLikedTweetsPipe) - ).onComplete(_ => printOnCompleteMsg("runAllEvalForIntersection()", t0)) - } - - private def runAllEvalForReferences( - referencePipe: TypedPipe[ReferenceTweets], - outerJoinPipe: TypedPipe[((Long, Long), (Option[ReferenceTweet], Option[CandidateTweet]))] - ): Execution[(LabeledTweetsResults, LabeledTweetsResults)] = { - val t0 = System.currentTimeMillis() - val labeledReferenceNotInIntersectionPipe = - outerJoinPipe.collect { - case ((targetUserId, _), (Some(refTweet), None)) => - LabeledTweet(targetUserId, refTweet.tweetId, refTweet.authorId, refTweet.labels, None) - }.forceToDisk - - Execution - .zip( - getEvaluationResultsForLabeledTweets(getLabeledReference(referencePipe)), - getEvaluationResultsForLabeledTweets(labeledReferenceNotInIntersectionPipe) - ).onComplete(_ => printOnCompleteMsg("runAllEvalForReferences()", t0)) - } - - def runAllEvaluations( - referencePipe: TypedPipe[ReferenceTweets], - candidatePipe: TypedPipe[CandidateTweets] - )( - implicit uniqueID: UniqueID - ): Execution[String] = { - val t0 = System.currentTimeMillis() - - // Force everything to disk to maximize data re-use - Execution - .zip( - referencePipe.forceToDiskExecution, - candidatePipe.forceToDiskExecution - ).flatMap { - case (referenceDiskPipe, candidateDiskPipe) => - outerJoinReferenceAndCandidate(referenceDiskPipe, candidateDiskPipe).forceToDiskExecution - .flatMap { outerJoinPipe => - val referenceResultsExec = runAllEvalForReferences(referenceDiskPipe, outerJoinPipe) - val intersectionResultsExec = runAllEvalForIntersection(outerJoinPipe) - val candidateResultsExec = runAllEvalForCandidates(candidateDiskPipe, outerJoinPipe) - - Execution - .zip( - referenceResultsExec, - intersectionResultsExec, - candidateResultsExec - ).map { - case ( - (allReference, referenceNotInIntersection), - (allIntersection, intersectionLiked, intersectionNotLiked), - (allCandidate, candidateNotInIntersection)) => - val timeSpent = (System.currentTimeMillis() - t0) / 1000 - val resultStr = Seq( - "===================================================", - s"Evaluation complete. Took ${timeSpent / 60}m${timeSpent % 60}s ", - allReference.format("-----Metrics for all Reference Tweets-----"), - referenceNotInIntersection.format( - "-----Metrics for Reference Tweets that are not in the intersection-----" - ), - allIntersection.format("-----Metrics for all Intersection Tweets-----"), - intersectionLiked.format("-----Metrics for Liked Intersection Tweets-----"), - intersectionNotLiked.format( - "-----Metrics for not Liked Intersection Tweets-----"), - allCandidate.format("-----Metrics for all Candidate Tweets-----"), - candidateNotInIntersection.format( - "-----Metrics for Candidate Tweets that are not in the intersection-----" - ), - "===================================================\n" - ).mkString("\n") - println(resultStr) - resultStr - } - .onComplete(_ => - printOnCompleteMsg( - "Evaluation complete. Check stdout or output logs for results.", - t0)) - } - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/evaluation/EvaluationReferenceDataExtraction.scala b/src/scala/com/twitter/simclusters_v2/scalding/evaluation/EvaluationReferenceDataExtraction.scala deleted file mode 100644 index 47bcabe1b..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/evaluation/EvaluationReferenceDataExtraction.scala +++ /dev/null @@ -1,270 +0,0 @@ -package com.twitter.simclusters_v2.scalding.evaluation - -import com.twitter.ml.api.constant.SharedFeatures.AUTHOR_ID -import com.twitter.ml.api.constant.SharedFeatures.TIMESTAMP -import com.twitter.ml.api.constant.SharedFeatures.TWEET_ID -import com.twitter.ml.api.constant.SharedFeatures.USER_ID -import com.twitter.ml.api.DailySuffixFeatureSource -import com.twitter.ml.api.DataSetPipe -import com.twitter.ml.api.RichDataRecord -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.scalding_internal.job.analytics_batch.AnalyticsBatchExecution -import com.twitter.scalding_internal.job.analytics_batch.AnalyticsBatchExecutionArgs -import com.twitter.scalding_internal.job.analytics_batch.BatchDescription -import com.twitter.scalding_internal.job.analytics_batch.BatchFirstTime -import com.twitter.scalding_internal.job.analytics_batch.BatchIncrement -import com.twitter.scalding_internal.job.analytics_batch.TwitterScheduledExecutionApp -import com.twitter.simclusters_v2.hdfs_sources.TimelineDataExtractorFixedPathSource -import com.twitter.simclusters_v2.hdfs_sources._ -import com.twitter.simclusters_v2.thriftscala.DisplayLocation -import com.twitter.simclusters_v2.thriftscala.ReferenceTweet -import com.twitter.simclusters_v2.thriftscala.ReferenceTweets -import com.twitter.simclusters_v2.thriftscala.TweetLabels -import com.twitter.timelines.prediction.features.common.TimelinesSharedFeatures.IS_LINGER_IMPRESSION -import com.twitter.timelines.prediction.features.common.TimelinesSharedFeatures.SOURCE_AUTHOR_ID -import com.twitter.timelines.prediction.features.common.TimelinesSharedFeatures.SOURCE_TWEET_ID -import com.twitter.timelines.prediction.features.itl.ITLFeatures -import com.twitter.timelines.prediction.features.recap.RecapFeatures -import java.util.TimeZone - -/** - * A scheduled version of the job to parse Timelines data for impressed and engaged tweets. - capesospy-v2 update|create --start_cron tweet_evaluation_timelines_reference_batch src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc.yaml - */ -object ScheduledTimelinesDataExtractionBatch extends TwitterScheduledExecutionApp { - - val outputPath = "/user/cassowary/processed/tweet_evaluation_reference_set/timelines" - - private val firstTime: String = "2019-03-31" - private implicit val tz: TimeZone = DateOps.UTC - private implicit val parser: DateParser = DateParser.default - private val batchIncrement: Duration = Days(1) - - private val execArgs = AnalyticsBatchExecutionArgs( - batchDesc = BatchDescription(this.getClass.getName.replace("$", "")), - firstTime = BatchFirstTime(RichDate(firstTime)), - lastTime = None, - batchIncrement = BatchIncrement(batchIncrement) - ) - - override def scheduledJob: Execution[Unit] = AnalyticsBatchExecution(execArgs) { - implicit dateRange => - Execution.withId { implicit uniqueId => - Execution.withArgs { args => - val defaultSampleRate = 1.0 - val recaps = - TimelinesEngagementDataExtractor.readTimelinesRecapTweets( - recapTweets = - DailySuffixFeatureSource(TimelinesEngagementDataExtractor.RecapTweetHdfsPath).read, - sampleRate = defaultSampleRate - )(dateRange) - val recTweets = - TimelinesEngagementDataExtractor.readTimelinesRecTweets( - recTweets = - DailySuffixFeatureSource(TimelinesEngagementDataExtractor.RecTweetHdfsPath).read, - sampleRate = defaultSampleRate - )(dateRange) - - (recaps ++ recTweets).writeDALSnapshotExecution( - TweetEvaluationTimelinesReferenceSetScalaDataset, - D.Daily, - D.Suffix(outputPath), - D.EBLzo(), - dateRange.end - ) - } - } - } -} - -/** - * Ad-hoc version of the job to process a subset of the Timeline data, either to catch up with data - * on a particular day, or to generate human readable data for debugging. - ./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/evaluation:tweet_evaluation_timelines_reference_adhoc - - oscar hdfs --screen --user cassowary --bundle tweet_evaluation_timelines_reference_adhoc \ - --tool com.twitter.simclusters_v2.scalding.evaluation.AdhocTimelinesDataExtraction \ - -- --date 2018-11-15 --output_dir /user/cassowary/your_ldap/test_htl_data/recap --sample_rate 0.01 \ - --recap --rectweet --output_tsv - */ -object AdhocTimelinesDataExtraction extends TwitterExecutionApp { - - @Override - def job: Execution[Unit] = { - Execution.withArgs { args => - implicit val dateRange: DateRange = - DateRange.parse(args.list("date"))(DateOps.UTC, DateParser.default) - - val outputDir = args("output_dir") - val readRecTweet = args.boolean("rectweet") - val readRecap = args.boolean("recap") - val sampleRate = args.double("sample_rate") - val useTsv = args.boolean("output_tsv") - - if (!readRecTweet && !readRecap) { - throw new IllegalArgumentException("Must read at least some data!") - } - val recTweets = if (readRecTweet) { - println("RecTweets are included in the dataset") - TimelinesEngagementDataExtractor.readTimelinesRecTweets( - recTweets = - DailySuffixFeatureSource(TimelinesEngagementDataExtractor.RecTweetHdfsPath).read, - sampleRate = sampleRate)(dateRange) - } else { - TypedPipe.empty - } - - val recaps = if (readRecap) { - println("Recaps are included in the dataset") - TimelinesEngagementDataExtractor.readTimelinesRecapTweets( - recapTweets = - DailySuffixFeatureSource(TimelinesEngagementDataExtractor.RecapTweetHdfsPath).read, - sampleRate = sampleRate - )(dateRange) - } else { - TypedPipe.empty - } - - val referenceTweets = recaps ++ recTweets - - if (useTsv) { - // Write in plain text in tsv format for human readability - referenceTweets - .map(t => (t.targetUserId, t.impressedTweets)) - .writeExecution(TypedTsv[(Long, Seq[ReferenceTweet])](outputDir)) - } else { - // Write in compact thrift lzo format - referenceTweets - .writeExecution(TimelineDataExtractorFixedPathSource(outputDir)) - } - } - } -} - -/** - * Base class to provide functions to parse tweet engagement data from Home Timeline's data. - * We are mainly interested in 2 tweet data sets from Home Timeline: - * 1. Recap tweet: Tweets + RTs from user's follow graph. We are interested in out of network RTs. - * 2. RecTweet: Out of network tweets not from user's follow graph. - */ -object TimelinesEngagementDataExtractor { - - val RecapTweetHdfsPath = "/atla/proc2/user/timelines/processed/suggests/recap/data_records" - val RecTweetHdfsPath = "/atla/proc2/user/timelines/processed/injections/rectweet/data_records" - - // Timelines name the same feature differently depending on the surface area (ex. recap vs rectweet). - // For each data source we extract the features with different feature names. Detail: - def toRecapTweetLabels(record: RichDataRecord): TweetLabels = { - val isClicked = record.getFeatureValue(RecapFeatures.IS_CLICKED) - val isFav = record.getFeatureValue(RecapFeatures.IS_FAVORITED) - val isRT = record.getFeatureValue(RecapFeatures.IS_RETWEETED) - val isQuoted = record.getFeatureValue(RecapFeatures.IS_QUOTED) - val isReplied = record.getFeatureValue(RecapFeatures.IS_REPLIED) - TweetLabels(isClicked, isFav, isRT, isQuoted, isReplied) - } - - def toRecTweetLabels(record: RichDataRecord): TweetLabels = { - // Refer to ITLFeatures for more labels - val isClicked = record.getFeatureValue(ITLFeatures.IS_CLICKED) - val isFav = record.getFeatureValue(ITLFeatures.IS_FAVORITED) - val isRT = record.getFeatureValue(ITLFeatures.IS_RETWEETED) - val isQuoted = record.getFeatureValue(ITLFeatures.IS_QUOTED) - val isReplied = record.getFeatureValue(ITLFeatures.IS_REPLIED) - TweetLabels(isClicked, isFav, isRT, isQuoted, isReplied) - } - - /** - * Return Recap tweets, which are in-network tweets. Here we only filter for Retweets of tweets - * that are outside the user's follow graph. - */ - def readTimelinesRecapTweets( - recapTweets: DataSetPipe, - sampleRate: Double - )( - implicit dateRange: DateRange - ): TypedPipe[ReferenceTweets] = { - // recapTweets are in network tweets. We want to discover RTs of OON tweets. - // For Retweets, we check IS_RETWEET and use SOURCE_TWEET_ID, and then check - // PROBABLY_FROM_FOLLOWED_AUTHOR, which filters in network tweet from user's top 1000 follow graph. - - recapTweets.richRecords - .sample(sampleRate) - .filter { record => - val isInDateRange = dateRange.contains(RichDate(record.getFeatureValue(TIMESTAMP).toLong)) - val isLingeredImpression = record.getFeatureValue(IS_LINGER_IMPRESSION) - val isInNetwork = - record.getFeatureValue(RecapFeatures.PROBABLY_FROM_FOLLOWED_AUTHOR) // approximate - val isRetweet = record.getFeatureValue(RecapFeatures.IS_RETWEET) - isRetweet && (!isInNetwork) && isInDateRange && isLingeredImpression - } - .flatMap { record => - for { - userId <- Option(record.getFeatureValue(USER_ID)).map(_.toLong) - sourceTweetId <- Option(record.getFeatureValue(SOURCE_TWEET_ID)).map( - _.toLong - ) // source tweetId is the RT id - sourceAuthorId <- Option(record.getFeatureValue(SOURCE_AUTHOR_ID)).map(_.toLong) - timestamp <- Option(record.getFeatureValue(TIMESTAMP)).map(_.toLong) - labels = toRecapTweetLabels(record) - } yield { - ( - userId, - Seq( - ReferenceTweet( - sourceTweetId, - sourceAuthorId, - timestamp, - DisplayLocation.TimelinesRecap, - labels)) - ) - } - } - .sumByKey - .map { case (uid, tweetSeq) => ReferenceTweets(uid, tweetSeq) } - } - - /** - * Return RecTweets, which are out of network tweets served in the Timeline. - */ - def readTimelinesRecTweets( - recTweets: DataSetPipe, - sampleRate: Double - )( - implicit dateRange: DateRange - ): TypedPipe[ReferenceTweets] = { - // recTweets contain strictly out of network injection tweets - - recTweets.richRecords - .sample(sampleRate) - .filter { record => - val isInDateRange = dateRange.contains(RichDate(record.getFeatureValue(TIMESTAMP).toLong)) - val isLingeredImpression = record.getFeatureValue(IS_LINGER_IMPRESSION) - - isInDateRange && isLingeredImpression - } - .flatMap { record => - for { - userId <- Option(record.getFeatureValue(USER_ID)).map(_.toLong) - tweetId <- Option(record.getFeatureValue(TWEET_ID)).map(_.toLong) - authorId <- Option(record.getFeatureValue(AUTHOR_ID)).map(_.toLong) - timestamp <- Option(record.getFeatureValue(TIMESTAMP)).map(_.toLong) - labels = toRecTweetLabels(record) - } yield { - ( - userId, - Seq( - ReferenceTweet( - tweetId, - authorId, - timestamp, - DisplayLocation.TimelinesRectweet, - labels)) - ) - } - } - .sumByKey - .map { case (uid, tweetSeq) => ReferenceTweets(uid, tweetSeq) } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/evaluation/LabelCorrelationsHelper.scala b/src/scala/com/twitter/simclusters_v2/scalding/evaluation/LabelCorrelationsHelper.scala deleted file mode 100644 index 8d86f5a6a..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/evaluation/LabelCorrelationsHelper.scala +++ /dev/null @@ -1,61 +0,0 @@ -package com.twitter.simclusters_v2.scalding.evaluation - -import com.twitter.algebird.AveragedValue -import com.twitter.scalding.Execution -import com.twitter.scalding.typed.TypedPipe -import com.twitter.simclusters_v2.scalding.common.Util - -/** - * Utility object for correlation measures between the algorithm scores and the user engagements, - * such as the number of Likes. - */ -object LabelCorrelationsHelper { - - private def toDouble(bool: Boolean): Double = { - if (bool) 1.0 else 0.0 - } - - /** - * Given a pipe of labeled tweets, calculate the cosine similarity between the algorithm scores - * and users' favorite engagements. - */ - def cosineSimilarityForLike(labeledTweets: TypedPipe[LabeledTweet]): Execution[Double] = { - labeledTweets - .map { tweet => (toDouble(tweet.labels.isLiked), tweet.algorithmScore.getOrElse(0.0)) } - .toIterableExecution.map { iter => Util.cosineSimilarity(iter.iterator) } - } - - /** - * Given a pipe of labeled tweets, calculate cosine similarity between algorithm score and users' - * favorites engagements, on a per user basis, and return the average of all cosine - * similarities across all users. - */ - def cosineSimilarityForLikePerUser(labeledTweets: TypedPipe[LabeledTweet]): Execution[Double] = { - val avg = AveragedValue.aggregator.composePrepare[(Unit, Double)](_._2) - - labeledTweets - .map { tweet => - ( - tweet.targetUserId, - Seq((toDouble(tweet.labels.isLiked), tweet.algorithmScore.getOrElse(0.0))) - ) - } - .sumByKey - .map { - case (userId, seq) => - ((), Util.cosineSimilarity(seq.iterator)) - } - .aggregate(avg) - .getOrElseExecution(0.0) - } - - /** - * Calculates the Pearson correlation coefficient for the algorithm scores and user's favorite - * engagement. Note this function call triggers a writeToDisk execution. - */ - def pearsonCoefficientForLike(labeledTweets: TypedPipe[LabeledTweet]): Execution[Double] = { - labeledTweets - .map { tweet => (toDouble(tweet.labels.isLiked), tweet.algorithmScore.getOrElse(0.0)) } - .toIterableExecution.map { iter => Util.computeCorrelation(iter.iterator) } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/evaluation/SimClustersEvaluationAdhocApp.scala b/src/scala/com/twitter/simclusters_v2/scalding/evaluation/SimClustersEvaluationAdhocApp.scala deleted file mode 100644 index 3ddded4cf..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/evaluation/SimClustersEvaluationAdhocApp.scala +++ /dev/null @@ -1,210 +0,0 @@ -package com.twitter.simclusters_v2.scalding.evaluation - -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.remote_access.ExplicitLocation -import com.twitter.scalding_internal.dalv2.remote_access.ProcAtla -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.candidate_source.ClusterRanker -import com.twitter.simclusters_v2.hdfs_sources.AdhocKeyValSources -import com.twitter.simclusters_v2.hdfs_sources.ClusterTopKTweetsHourlySuffixSource -import com.twitter.simclusters_v2.hdfs_sources.SimclustersV2InterestedInScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.TweetEvaluationTimelinesReferenceSetScalaDataset -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.thriftscala.CandidateTweet -import com.twitter.simclusters_v2.thriftscala.CandidateTweets -import com.twitter.simclusters_v2.thriftscala.ClusterTopKTweetsWithScores -import com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn -import com.twitter.simclusters_v2.thriftscala.DisplayLocation -import com.twitter.simclusters_v2.thriftscala.ReferenceTweets -import com.twitter.simclusters_v2.scalding.offline_job.OfflineRecConfig -import com.twitter.simclusters_v2.scalding.offline_job.OfflineTweetRecommendation -import java.util.TimeZone - -/** - * Do evaluations for SimClusters' tweet recommendations by using offline datasets. - * The job does the following: - * 1. Take in a test date range, for which the offline simclusters rec will be evaluated - * 2. For all users that had tweet impressions in timelines during the period, generate offline - * SimClusters candidate tweets for these users - * 3. Run offline evaluation and return metrics - -./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/evaluation:simcluster_offline_eval_adhoc - -Note: Never specify reference date range across more than 1 day! -oscar hdfs --user cassowary --screen --screen-detached --tee your_ldap/prod_percentile \ - --bundle simcluster_offline_eval_adhoc \ - --tool com.twitter.simclusters_v2.scalding.evaluation.SimClustersEvaluationAdhocApp \ - -- --cand_tweet_date 2019-03-04T00 2019-03-04T23 \ - --ref_tweet_date 2019-03-05T00 2019-03-05T01 \ - --timeline_tweet rectweet \ - --sample_rate 0.05 \ - --max_cand_tweets 16000000 \ - --min_tweet_score 0.0 \ - --user_interested_in_dir /user/frigate/your_ldap/interested_in_copiedFromAtlaProc_20190228 \ - --cluster_top_k_dir /user/cassowary/your_ldap/offline_simcluster_20190304/cluster_top_k_tweets \ - --output_dir /user/cassowary/your_ldap/prod_percentile \ - --toEmailAddress your_ldap@twitter.com \ - --testRunName TestingProdOn0305Data - */ -object SimClustersEvaluationAdhocApp extends TwitterExecutionApp { - private val maxTweetResults = 40 - private val maxClustersToQuery = 20 - - @Override - def job: Execution[Unit] = { - Execution.withArgs { args => - Execution.withId { implicit uniqueId => - implicit val tz: TimeZone = DateOps.UTC - implicit val dateParser: DateParser = DateParser.default - - val candTweetDateRange = DateRange.parse(args.list("cand_tweet_date")) - val refTweetDateRange = DateRange.parse(args.list("ref_tweet_date")) - val toEmailAddressOpt = args.optional("toEmailAddress") - val testRunName = args.optional("testRunName") - - println( - s"Using SimClusters tweets from ${candTweetDateRange.start} to ${candTweetDateRange.end}") - println(s"Using Timelines tweets on the day of ${refTweetDateRange.start}") - - // separate tweets from different display locations for now - val tweetType = args("timeline_tweet") match { - case "rectweet" => DisplayLocation.TimelinesRectweet - case "recap" => DisplayLocation.TimelinesRecap - case e => - throw new IllegalArgumentException(s"$e isn't a valid timeline display location") - } - - val sampleRate = args.double("sample_rate", 1.0) - val validRefPipe = getProdTimelineReference(tweetType, refTweetDateRange, sampleRate) - val targetUserPipe = validRefPipe.map { _.targetUserId } - - // Read a fixed-path in atla if provided, otherwise read prod data from atla for date range - val userInterestInPipe = args.optional("user_interested_in_dir") match { - case Some(fixedPath) => - println(s"user_interested_in_dir is provided at: $fixedPath. Reading fixed path data.") - TypedPipe.from(AdhocKeyValSources.interestedInSource(fixedPath)) - case _ => - println(s"user_interested_in_dir isn't provided. Reading prod data.") - interestedInProdSource(candTweetDateRange) - } - - // Offline simulation of this dataset - val clusterTopKDir = args("cluster_top_k_dir") - println(s"cluster_top_k_dir is defined at: $clusterTopKDir") - val clusterTopKPipe = TypedPipe.from( - ClusterTopKTweetsHourlySuffixSource(clusterTopKDir, candTweetDateRange) - ) - - // Configs for offline simcluster tweet recommendation - val maxTweetRecs = args.int("max_cand_tweets", 30000000) - val minTweetScoreThreshold = args.double("min_tweet_score", 0.0) - - val offlineRecConfig = OfflineRecConfig( - maxTweetRecs, - maxTweetResults, - maxClustersToQuery, - minTweetScoreThreshold, - ClusterRanker.RankByNormalizedFavScore - ) - println("SimClusters offline config: " + offlineRecConfig) - - getValidCandidate( - targetUserPipe, - userInterestInPipe, - clusterTopKPipe, - offlineRecConfig, - candTweetDateRange - ).flatMap { validCandPipe => - val outputDir = args("output_dir") - EvaluationMetricHelper.runAllEvaluations(validRefPipe, validCandPipe).map { results => - toEmailAddressOpt.foreach { address => - Util.sendEmail( - results, - "Results from tweet evaluation test bed " + testRunName.getOrElse(""), - address) - } - TypedPipe.from(Seq((results, ""))).writeExecution(TypedTsv[(String, String)](outputDir)) - } - } - } - } - } - - /** - * Given a pipe of raw timelines reference engagement data, collect the engagements that took - * place during the given date range, then sample these engagements - */ - private def getProdTimelineReference( - displayLocation: DisplayLocation, - batchDateRange: DateRange, - sampleRate: Double - )( - implicit tz: TimeZone - ): TypedPipe[ReferenceTweets] = { - // Snapshot data timestamps itself with the last possible time of the day. +1 day to cover it - val snapshotRange = DateRange(batchDateRange.start, batchDateRange.start + Days(1)) - val timelinesRefPipe = DAL - .readMostRecentSnapshot(TweetEvaluationTimelinesReferenceSetScalaDataset, snapshotRange) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - - timelinesRefPipe - .flatMap { refTweets => - val tweets = refTweets.impressedTweets - .filter { refTweet => - refTweet.timestamp >= batchDateRange.start.timestamp && - refTweet.timestamp <= batchDateRange.end.timestamp && - refTweet.displayLocation == displayLocation - } - if (tweets.nonEmpty) { - Some(ReferenceTweets(refTweets.targetUserId, tweets)) - } else { - None - } - } - .sample(sampleRate) - } - - /** - * Given a list of target users, simulate SimCluster's online serving logic offline for these - * users, then convert them into [[CandidateTweets]] - */ - private def getValidCandidate( - targetUserPipe: TypedPipe[Long], - userIsInterestedInPipe: TypedPipe[(Long, ClustersUserIsInterestedIn)], - clusterTopKTweetsPipe: TypedPipe[ClusterTopKTweetsWithScores], - offlineConfig: OfflineRecConfig, - batchDateRange: DateRange - )( - implicit uniqueID: UniqueID - ): Execution[TypedPipe[CandidateTweets]] = { - OfflineTweetRecommendation - .getTopTweets(offlineConfig, targetUserPipe, userIsInterestedInPipe, clusterTopKTweetsPipe) - .map(_.map { - case (userId, scoredTweets) => - val tweets = scoredTweets.map { tweet => - CandidateTweet(tweet.tweetId, Some(tweet.score), Some(batchDateRange.start.timestamp)) - } - CandidateTweets(userId, tweets) - }) - } - - /** - * Read interested in key-val store from atla-proc from the given date range - */ - private def interestedInProdSource( - dateRange: DateRange - ): TypedPipe[(Long, ClustersUserIsInterestedIn)] = { - implicit val timeZone: TimeZone = DateOps.UTC - - DAL - .readMostRecentSnapshot(SimclustersV2InterestedInScalaDataset, dateRange.embiggen(Weeks(1))) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - .map { - case KeyVal(key, value) => (key, value) - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/inferred_entities/BUILD.bazel b/src/scala/com/twitter/simclusters_v2/scalding/inferred_entities/BUILD.bazel deleted file mode 100644 index ed5c9d6b8..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/inferred_entities/BUILD.bazel +++ /dev/null @@ -1,74 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = [ - "bazel-compatible", - "bazel-only", - ], - dependencies = [ - "src/scala/com/twitter/onboarding/relevance/source:utt_account_recommendations-scala", - "src/scala/com/twitter/scalding_internal/dalv2", - "src/scala/com/twitter/simclusters_v2/scalding", - "src/scala/com/twitter/wtf/entity_real_graph/common", - ], -) - -hadoop_binary( - name = "inferred_entities_from_known_for-adhoc", - main = "com.twitter.simclusters_v2.scalding.inferred_entities.InferredKnownForSemanticCoreEntitiesAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":inferred_entities", - ], -) - -hadoop_binary( - name = "inferred_entities_from_known_for", - main = "com.twitter.simclusters_v2.scalding.inferred_entities.InferredKnownForSemanticCoreEntitiesBatchApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":inferred_entities", - ], -) - -hadoop_binary( - name = "inferred_entities_from_interested_in-adhoc", - main = "com.twitter.simclusters_v2.scalding.inferred_entities.InferredInterestedInSemanticCoreEntitiesAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":inferred_entities", - ], -) - -hadoop_binary( - name = "inferred_entities_from_interested_in", - main = "com.twitter.simclusters_v2.scalding.inferred_entities.InferredInterestedInSemanticCoreEntitiesBatchApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":inferred_entities", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scalding/inferred_entities/InferredEntities.scala b/src/scala/com/twitter/simclusters_v2/scalding/inferred_entities/InferredEntities.scala deleted file mode 100644 index 67b1574b7..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/inferred_entities/InferredEntities.scala +++ /dev/null @@ -1,92 +0,0 @@ -package com.twitter.simclusters_v2.scalding.inferred_entities - -import com.twitter.scalding.DateRange -import com.twitter.scalding.Days -import com.twitter.scalding.typed.TypedPipe -import com.twitter.simclusters_v2.common.ClusterId -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.hdfs_sources.EntityEmbeddingsSources -import com.twitter.simclusters_v2.thriftscala.ClusterType -import com.twitter.simclusters_v2.thriftscala.EmbeddingType -import com.twitter.simclusters_v2.thriftscala.InferredEntity -import com.twitter.simclusters_v2.thriftscala.ModelVersion -import com.twitter.simclusters_v2.thriftscala.SemanticCoreEntityWithScore -import com.twitter.simclusters_v2.thriftscala.SimClustersInferredEntities -import com.twitter.simclusters_v2.thriftscala.SimClustersSource -import java.util.TimeZone - -/** - * Opt-out compliance for SimClusters means offering users an option to opt out of clusters that - * have inferred legible meanings. This file sets some of the data sources & thresholds from which - * the inferred entities are considered legible. One should always refer to the sources & constants - * here for SimClusters' inferred entity compliance work - */ -object InferredEntities { - val MHRootPath: String = - "/user/cassowary/manhattan_sequence_files/simclusters_v2_inferred_entities" - - // Convenience objects for defining cluster sources - val InterestedIn2020 = - SimClustersSource(ClusterType.InterestedIn, ModelVersion.Model20m145k2020) - - val Dec11KnownFor = SimClustersSource(ClusterType.KnownFor, ModelVersion.Model20m145kDec11) - - val UpdatedKnownFor = SimClustersSource(ClusterType.KnownFor, ModelVersion.Model20m145kUpdated) - - val KnownFor2020 = SimClustersSource(ClusterType.KnownFor, ModelVersion.Model20m145k2020) - - /** - * This is the threshold at which we consider a simcluster "legible" through an entity - */ - val MinLegibleEntityScore = 0.6 - - /** - * Query for the entity embeddings that are used for SimClusters compliance. We will use these - * entity embeddings for a cluster to allow a user to opt out of a cluster - */ - def getLegibleEntityEmbeddings( - dateRange: DateRange, - timeZone: TimeZone - ): TypedPipe[(ClusterId, Seq[SemanticCoreEntityWithScore])] = { - val entityEmbeddings = EntityEmbeddingsSources - .getReverseIndexedSemanticCoreEntityEmbeddingsSource( - EmbeddingType.FavBasedSematicCoreEntity, - ModelVersions.Model20M145K2020, // only support the latest 2020 model - dateRange.embiggen(Days(7)(timeZone)) // read 7 days before & after to give buffer - ) - filterEntityEmbeddingsByScore(entityEmbeddings, MinLegibleEntityScore) - } - - // Return entities whose score are above threshold - def filterEntityEmbeddingsByScore( - entityEmbeddings: TypedPipe[(ClusterId, Seq[SemanticCoreEntityWithScore])], - minEntityScore: Double - ): TypedPipe[(ClusterId, Seq[SemanticCoreEntityWithScore])] = { - entityEmbeddings.flatMap { - case (clusterId, entities) => - val validEntities = entities.filter { entity => entity.score >= minEntityScore } - if (validEntities.nonEmpty) { - Some((clusterId, validEntities)) - } else { - None - } - - } - } - - /** - * Given inferred entities from different sources, combine the results into job's output format - */ - def combineResults( - results: TypedPipe[(UserId, Seq[InferredEntity])]* - ): TypedPipe[(UserId, SimClustersInferredEntities)] = { - results - .reduceLeft(_ ++ _) - .sumByKey - .map { - case (userId, inferredEntities) => - (userId, SimClustersInferredEntities(inferredEntities)) - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/inferred_entities/InferredEntitiesFromInterestedIn.scala b/src/scala/com/twitter/simclusters_v2/scalding/inferred_entities/InferredEntitiesFromInterestedIn.scala deleted file mode 100644 index 212d851e1..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/inferred_entities/InferredEntitiesFromInterestedIn.scala +++ /dev/null @@ -1,377 +0,0 @@ -package com.twitter.simclusters_v2.scalding.inferred_entities - -import com.twitter.algebird.Max -import com.twitter.scalding.Args -import com.twitter.scalding.DateRange -import com.twitter.scalding.Days -import com.twitter.scalding.Duration -import com.twitter.scalding.Execution -import com.twitter.scalding.RichDate -import com.twitter.scalding.TypedPipe -import com.twitter.scalding.TypedTsv -import com.twitter.scalding.UniqueID -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.dalv2.remote_access.ExplicitLocation -import com.twitter.scalding_internal.dalv2.remote_access.ProcAtla -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.ClusterId -import com.twitter.simclusters_v2.common.UTTEntityId -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.hdfs_sources._ -import com.twitter.simclusters_v2.scalding.common.TypedRichPipe._ -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn -import com.twitter.simclusters_v2.thriftscala.EntitySource -import com.twitter.simclusters_v2.thriftscala.InferredEntity -import com.twitter.simclusters_v2.thriftscala.SemanticCoreEntityWithScore -import com.twitter.simclusters_v2.thriftscala.SimClustersInferredEntities -import com.twitter.simclusters_v2.thriftscala.SimClustersSource -import com.twitter.simclusters_v2.thriftscala.UserAndNeighbors -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import java.util.TimeZone -import com.twitter.onboarding.relevance.source.UttAccountRecommendationsScalaDataset -import com.twitter.scalding_internal.dalv2.DALWrite.D -import com.twitter.wtf.entity_real_graph.scalding.common.SemanticCoreFilters.getValidSemanticCoreEntities -import com.twitter.wtf.entity_real_graph.scalding.common.DataSources - -/** - * Infer interested-in entities for a given user. Depending on how and where the entity source comes - * from, this can be achieve a number of ways. For example, we can use user->interested-in clusters - * and cluster-> semanticcore entity embeddings to derive user->entity. Or, we can use a producers' - * UTT embeddings and user-user engagement graph to aggregate UTT engagement history. - */ -object InferredEntitiesFromInterestedIn { - - def getUserToKnownForUttEntities( - dateRange: DateRange, - maxUttEntitiesPerUser: Int - )( - implicit timeZone: TimeZone - ): TypedPipe[(UserId, Seq[(Long, Double)])] = { - - val validEntities = getValidSemanticCoreEntities( - DataSources.semanticCoreMetadataSource(dateRange, timeZone)).distinct.map { entityId => - Set(entityId) - }.sum - - DAL - .readMostRecentSnapshot(UttAccountRecommendationsScalaDataset, dateRange) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - .flatMapWithValue(validEntities) { - // Keep only valid Entities - case (KeyVal(interest, candidates), Some(validUTTEntities)) - if validUTTEntities.contains(interest.uttID) => - candidates.recommendations.map { rec => - (rec.candidateUserID, (interest.uttID, rec.score.getOrElse(0.0))) - } - case _ => None - } - .group - .sortedReverseTake(maxUttEntitiesPerUser)(Ordering.by(_._2)) - .toTypedPipe - } - - def filterUTTEntities( - interestedInEntities: TypedPipe[(UserId, Seq[(UTTEntityId, Int)])], - minSocialProofThreshold: Int, - maxInterestsPerUser: Int - ): TypedPipe[(UserId, Seq[UTTEntityId])] = { - - interestedInEntities - .map { - case (userId, entities) => - val topEntities = entities - .filter(_._2 >= minSocialProofThreshold) - .sortBy(-_._2) - .take(maxInterestsPerUser) - .map(_._1) - - (userId, topEntities) - } - .filter(_._2.nonEmpty) - } - - def getUserToUTTEntities( - userUserGraph: TypedPipe[UserAndNeighbors], - knownForEntities: TypedPipe[(UserId, Seq[UTTEntityId])] - )( - implicit uniqueId: UniqueID - ): TypedPipe[(UserId, Seq[(UTTEntityId, Int)])] = { - val flatEngagementGraph = - userUserGraph - .count("num_user_user_graph_records") - .flatMap { userAndNeighbors => - userAndNeighbors.neighbors.flatMap { neighbor => - val producerId = neighbor.neighborId - val hasFav = neighbor.favScoreHalfLife100Days.exists(_ > 0) - val hasFollow = neighbor.isFollowed.contains(true) - - if (hasFav || hasFollow) { - Some((producerId, userAndNeighbors.userId)) - } else { - None - } - } - } - .count("num_flat_user_user_graph_edges") - - flatEngagementGraph - .join(knownForEntities.count("num_producer_to_entities")) - .withReducers(3000) - .flatMap { - case (producerId, (userId, entities)) => - entities.map { entityId => ((userId, entityId), 1) } - } - .count("num_flat_user_to_entity") - .sumByKey - .withReducers(2999) - .toTypedPipe - .count("num_user_with_entities") - .collect { - case ((userId, uttEntityId), numEngagements) => - (userId, Seq((uttEntityId, numEngagements))) - } - .sumByKey - } - - /** - * Infer entities using user-interestedIn clusters and entity embeddings for those clusters, - * based on a threshold - */ - def getInterestedInFromEntityEmbeddings( - userToInterestedIn: TypedPipe[(UserId, ClustersUserIsInterestedIn)], - clusterToEntities: TypedPipe[(ClusterId, Seq[SemanticCoreEntityWithScore])], - inferredFromCluster: Option[SimClustersSource], - inferredFromEntity: Option[EntitySource] - )( - implicit uniqueId: UniqueID - ): TypedPipe[(UserId, Seq[InferredEntity])] = { - val clusterToUsers = userToInterestedIn - .flatMap { - case (userId, clusters) => - clusters.clusterIdToScores.map { - case (clusterId, score) => - (clusterId, (userId, score)) - } - } - .count("num_flat_user_to_interested_in_cluster") - - clusterToUsers - .join(clusterToEntities) - .withReducers(3000) - .map { - case (clusterId, ((userId, interestedInScore), entitiesWithScores)) => - (userId, entitiesWithScores) - } - .flatMap { - case (userId, entitiesWithScore) => - // Dedup by entityIds in case user is associated with an entity from different clusters - entitiesWithScore.map { entity => (userId, Map(entity.entityId -> Max(entity.score))) } - } - .sumByKey - .map { - case (userId, entitiesWithMaxScore) => - val inferredEntities = entitiesWithMaxScore.map { entityWithScore => - InferredEntity( - entityId = entityWithScore._1, - score = entityWithScore._2.get, - simclusterSource = inferredFromCluster, - entitySource = inferredFromEntity - ) - }.toSeq - (userId, inferredEntities) - } - .count("num_user_with_inferred_entities") - } -} - -/** -capesospy-v2 update --build_locally --start_cron \ - --start_cron inferred_entities_from_interested_in \ - src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc.yaml - */ -object InferredInterestedInSemanticCoreEntitiesBatchApp extends ScheduledExecutionApp { - - override def firstTime: RichDate = RichDate("2023-01-01") - - override def batchIncrement: Duration = Days(1) - - private val outputPath = InferredEntities.MHRootPath + "/interested_in" - - private val outputPathKeyedByCluster = - InferredEntities.MHRootPath + "/interested_in_keyed_by_cluster" - - import InferredEntitiesFromInterestedIn._ - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - Execution.unit - - val clusterToEntities = InferredEntities - .getLegibleEntityEmbeddings(dateRange, timeZone) - .count("num_legible_cluster_to_entities") - .forceToDisk - - // inferred interests. Only support 2020 model version - val userToClusters2020 = - InterestedInSources.simClustersInterestedIn2020Source(dateRange, timeZone) - - val inferredEntities2020 = getInterestedInFromEntityEmbeddings( - userToInterestedIn = userToClusters2020, - clusterToEntities = clusterToEntities, - inferredFromCluster = Some(InferredEntities.InterestedIn2020), - inferredFromEntity = Some(EntitySource.SimClusters20M145K2020EntityEmbeddingsByFavScore) - )(uniqueID) - .count("num_user_with_inferred_entities_2020") - - val combinedInferredInterests = - InferredEntities.combineResults(inferredEntities2020) - - // output cluster -> entity mapping - val clusterToEntityExec = clusterToEntities - .map { - case (clusterId, entities) => - val inferredEntities = SimClustersInferredEntities( - entities.map(entity => InferredEntity(entity.entityId, entity.score)) - ) - KeyVal(clusterId, inferredEntities) - } - .writeDALVersionedKeyValExecution( - SimclustersInferredEntitiesFromInterestedInKeyedByClusterScalaDataset, - D.Suffix(outputPathKeyedByCluster) - ) - - // output user -> entity mapping - val userToEntityExec = combinedInferredInterests - .map { case (userId, entities) => KeyVal(userId, entities) } - .writeDALVersionedKeyValExecution( - SimclustersInferredEntitiesFromInterestedInScalaDataset, - D.Suffix(outputPath) - ) - - Execution.zip(clusterToEntityExec, userToEntityExec).unit - } -} - -/** -Adhob debugging job. Uses Entity Embeddings dataset to infer user interests - -./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/inferred_entities/ &&\ -scalding remote run \ - --main-class com.twitter.simclusters_v2.scalding.inferred_entities.InferredInterestedInSemanticCoreEntitiesAdhocApp \ - --target src/scala/com/twitter/simclusters_v2/scalding/inferred_entities:inferred_entities_from_interested_in-adhoc \ - --user recos-platform \ - -- --date 2019-11-11 --email your_ldap@twitter.com - */ -object InferredInterestedInSemanticCoreEntitiesAdhocApp extends AdhocExecutionApp { - import InferredEntitiesFromInterestedIn._ - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - val interestedIn = InterestedInSources.simClustersInterestedIn2020Source(dateRange, timeZone) - - val clusterToEntities = InferredEntities - .getLegibleEntityEmbeddings(dateRange, timeZone) - .count("num_legible_cluster_to_entities") - - // Debugging InterestedIn -> EntityEmbeddings approach - val interestedInFromEntityEmbeddings = getInterestedInFromEntityEmbeddings( - interestedIn, - clusterToEntities, - None, - None - )(uniqueID) - - val distribution = Util - .printSummaryOfNumericColumn( - interestedInFromEntityEmbeddings.map { case (k, v) => v.size }, - Some("# of interestedIn entities per user") - ).map { results => - Util.sendEmail(results, "# of interestedIn entities per user", args.getOrElse("email", "")) - } - - Execution - .zip( - distribution, - interestedInFromEntityEmbeddings - .writeExecution( - TypedTsv("/user/recos-platform/adhoc/debug/interested_in_from_entity_embeddings")) - ).unit - } -} - -/** - Adhob debuggingjob. Runs through the UTT interest inference, analyze the size & distribution of - interests per user. - -./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/inferred_entities/ &&\ -scalding remote run \ - --main-class com.twitter.simclusters_v2.scalding.inferred_entities.InferredUTTEntitiesFromInterestedInAdhocApp \ - --target src/scala/com/twitter/simclusters_v2/scalding/inferred_entities:inferred_entities_from_interested_in-adhoc \ - --user recos-platform \ - -- --date 2019-11-03 --email your_ldap@twitter.com - */ -object InferredUTTEntitiesFromInterestedInAdhocApp extends AdhocExecutionApp { - import InferredEntitiesFromInterestedIn._ - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - val employeeGraphPath = "/user/recos-platform/adhoc/employee_graph_from_user_user/" - val employeeGraph = TypedPipe.from(UserAndNeighborsFixedPathSource(employeeGraphPath)) - - val maxKnownForUttsPerProducer = 100 - val minSocialProofThreshold = 10 - val maxInferredInterestsPerUser = 500 - - // KnownFor UTT entities - val userToUttEntities = getUserToKnownForUttEntities( - dateRange.embiggen(Days(7)), - maxKnownForUttsPerProducer - ).map { case (userId, entities) => (userId, entities.map(_._1)) } - - val userToInterestsEngagementCounts = getUserToUTTEntities(employeeGraph, userToUttEntities) - - val topInterests = filterUTTEntities( - userToInterestsEngagementCounts, - minSocialProofThreshold, - maxInferredInterestsPerUser - ).count("num_users_with_inferred_interests") - - // Debugging UTT entities - val analysis = Util - .printSummaryOfNumericColumn( - topInterests.map { case (k, v) => v.size }, - Some( - "# of UTT entities per user, maxKnownForUtt=100, minSocialProof=10, maxInferredPerUser=500") - ).map { results => - Util.sendEmail(results, "# of UTT entities per user", args.getOrElse("email", "")) - } - - val outputPath = "/user/recos-platform/adhoc/inferred_utt_interests" - - Execution - .zip( - topInterests.writeExecution(TypedTsv(outputPath)), - analysis - ).unit - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/inferred_entities/InferredSemanticCoreEntitiesFromKnownFor.scala b/src/scala/com/twitter/simclusters_v2/scalding/inferred_entities/InferredSemanticCoreEntitiesFromKnownFor.scala deleted file mode 100644 index 2dfbd5f4b..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/inferred_entities/InferredSemanticCoreEntitiesFromKnownFor.scala +++ /dev/null @@ -1,244 +0,0 @@ -package com.twitter.simclusters_v2.scalding.inferred_entities - -import com.twitter.escherbird.metadata.thriftscala.FullMetadata -import com.twitter.scalding._ -import com.twitter.scalding.typed.TypedPipe -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.ClusterId -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.hdfs_sources._ -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.thriftscala._ -import com.twitter.wtf.entity_real_graph.scalding.common.{DataSources => ERGDataSources} -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import java.util.TimeZone - -/** - * Infer Known-For entities based on users' different variations of SimClusters Known-Fors. - * The basic idea is to look at the Known-For datasets (User, Cluster) and the entity embeddings - * (Cluster, Entities) to derive the (User, Entities). - */ -object InferredSemanticCoreEntitiesFromKnownFor { - - /** - * Given a (user, cluster) and (cluster, entity) mappings, generate (user, entity) mappings - */ - def getUserToEntities( - userToClusters: TypedPipe[(UserId, Seq[SimClusterWithScore])], - clusterToEntities: TypedPipe[(ClusterId, Seq[SemanticCoreEntityWithScore])], - inferredFromCluster: Option[SimClustersSource], - inferredFromEntity: Option[EntitySource], - minEntityScore: Double - ): TypedPipe[(UserId, Seq[InferredEntity])] = { - - val validClusterToEntities = clusterToEntities.flatMap { - case (clusterId, entities) => - entities.collect { - case entity if entity.score >= minEntityScore => - (clusterId, (entity.entityId, entity.score)) - } - } - - userToClusters - .flatMap { - case (userId, clusters) => - clusters.map { cluster => (cluster.clusterId, userId) } - } - .join(validClusterToEntities) - .map { - case (clusterId, (userId, (entityId, score))) => - ((userId, entityId), score) - } - // If a user is known for the same entity through multiple cluster-entity mappings, sum the scores - .sumByKey - .map { - case ((userId, entityId), score) => - (userId, Seq(InferredEntity(entityId, score, inferredFromCluster, inferredFromEntity))) - } - .sumByKey - } - -} - -/** -capesospy-v2 update --build_locally --start_cron \ - inferred_entities_from_known_for \ - src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc.yaml - */ -object InferredKnownForSemanticCoreEntitiesBatchApp extends ScheduledExecutionApp { - - import InferredSemanticCoreEntitiesFromKnownFor._ - - override def firstTime: RichDate = RichDate("2023-01-23") - - override def batchIncrement: Duration = Days(1) - - private val outputPath = InferredEntities.MHRootPath + "/known_for" - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - val clusterToEntities = EntityEmbeddingsSources - .getReverseIndexedSemanticCoreEntityEmbeddingsSource( - EmbeddingType.FavBasedSematicCoreEntity, - ModelVersions.Model20M145K2020, - dateRange.embiggen(Days(7)) // read 7 days before & after to give buffer - ) - .forceToDisk - - val userToEntities2020 = getUserToEntities( - ProdSources.getUpdatedKnownFor, - clusterToEntities, - Some(InferredEntities.KnownFor2020), - Some(EntitySource.SimClusters20M145K2020EntityEmbeddingsByFavScore), - InferredEntities.MinLegibleEntityScore - ) - - val userToEntities = InferredEntities.combineResults(userToEntities2020) - - userToEntities - .map { case (userId, entities) => KeyVal(userId, entities) } - .writeDALVersionedKeyValExecution( - SimclustersInferredEntitiesFromKnownForScalaDataset, - D.Suffix(outputPath) - ) - } -} - -/** -./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/inferred_entities:inferred_entities_from_known_for-adhoc && \ - oscar hdfs --user recos-platform --screen --tee your_ldap-logs/ \ - --bundle inferred_entities_from_known_for-adhoc \ - --tool com.twitter.simclusters_v2.scalding.inferred_entities.InferredSemanticCoreEntitiesFromKnownForAdhocApp \ - -- --date 2019-11-02 --email your_ldap@twitter.com - */ -object InferredSemanticCoreEntitiesFromKnownForAdhocApp extends AdhocExecutionApp { - - private def readEntityEmbeddingsFromPath( - path: String - ): TypedPipe[(ClusterId, Seq[SemanticCoreEntityWithScore])] = { - TypedPipe - .from(AdhocKeyValSources.clusterToEntitiesSource(path)) - .map { - case (embeddingId, embedding) => - embeddingId.internalId match { - case InternalId.ClusterId(clusterId) => - val semanticCoreEntities = embedding.embedding.map { - case InternalIdWithScore(InternalId.EntityId(entityId), score) => - SemanticCoreEntityWithScore(entityId, score) - case _ => - throw new IllegalArgumentException( - "The value to the entity embeddings dataset isn't entityId" - ) - } - (clusterId, semanticCoreEntities) - case _ => - throw new IllegalArgumentException( - "The key to the entity embeddings dataset isn't clusterId" - ) - } - } - } - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - import InferredSemanticCoreEntitiesFromKnownFor._ - - val entityIdToString: TypedPipe[(Long, String)] = - ERGDataSources.semanticCoreMetadataSource - .collect { - case FullMetadata(domainId, entityId, Some(basicMetadata), _, _, _) - if domainId == 131L && !basicMetadata.indexableFields.exists( - _.tags.exists(_.contains("utt:sensitive_interest"))) => - entityId -> basicMetadata.name - }.distinctBy(_._1) - - val clusterToEntitiesUpdated = EntityEmbeddingsSources - .getReverseIndexedSemanticCoreEntityEmbeddingsSource( - EmbeddingType.FavBasedSematicCoreEntity, - ModelVersions.Model20M145KUpdated, - dateRange.embiggen(Days(4)) // read 4 days before & after to give buffer - ) - .forceToDisk - - // Inferred entities based on Updated version's entity embeddings - val dec11UserToUpdatedEntities = getUserToEntities( - ProdSources.getDec11KnownFor, - clusterToEntitiesUpdated, - Some(InferredEntities.Dec11KnownFor), - Some(EntitySource.SimClusters20M145KUpdatedEntityEmbeddingsByFavScore), - InferredEntities.MinLegibleEntityScore - ) - - val updatedUserToUpdatedEntities = getUserToEntities( - ProdSources.getUpdatedKnownFor, - clusterToEntitiesUpdated, - Some(InferredEntities.UpdatedKnownFor), - Some(EntitySource.SimClusters20M145KUpdatedEntityEmbeddingsByFavScore), - InferredEntities.MinLegibleEntityScore - ) - - // Updated entities data - val entitiesPipe = ( - dec11UserToUpdatedEntities ++ updatedUserToUpdatedEntities - ).sumByKey - - val userToEntitiesWithString = entitiesPipe - .flatMap { - case (userId, entities) => - entities.map { entity => (entity.entityId, (userId, entity)) } - } - .hashJoin(entityIdToString) - .map { - case (entityId, ((userId, inferredEntity), entityStr)) => - (userId, Seq((entityStr, inferredEntity))) - } - .sumByKey - - val outputPath = "/user/recos-platform/adhoc/known_for_inferred_entities_updated" - - val scoreDistribution = Util - .printSummaryOfNumericColumn( - entitiesPipe.flatMap { case (k, v) => v.map(_.score) }, - Some("Distributions of scores, Updated version") - ).map { results => - Util.sendEmail( - results, - "Distributions of scores, Updated version", - args.getOrElse("email", "") - ) - } - - val coverageDistribution = Util - .printSummaryOfNumericColumn( - entitiesPipe.map { case (k, v) => v.size }, - Some("# of knownFor entities per user, Updated version") - ).map { results => - Util.sendEmail( - results, - "# of knownFor entities per user, Updated version", - args.getOrElse("email", "") - ) - } - - Execution - .zip( - userToEntitiesWithString.writeExecution(TypedTsv(outputPath)), - scoreDistribution, - coverageDistribution - ).unit - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/inferred_entities/ProdSources.scala b/src/scala/com/twitter/simclusters_v2/scalding/inferred_entities/ProdSources.scala deleted file mode 100644 index a1dc71ac8..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/inferred_entities/ProdSources.scala +++ /dev/null @@ -1,94 +0,0 @@ -package com.twitter.simclusters_v2.scalding.inferred_entities - -import com.twitter.scalding.{DateRange, Days, TypedPipe} -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.remote_access.{ExplicitLocation, ProcAtla} -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.{ModelVersions, SemanticCoreEntityId, UserId} -import com.twitter.simclusters_v2.hdfs_sources.{ - SimclustersInferredEntitiesFromKnownForScalaDataset, - SimclustersV2InterestedIn20M145KUpdatedScalaDataset, - SimclustersV2InterestedInScalaDataset, - SimclustersV2KnownFor20M145KDec11ScalaDataset, - SimclustersV2KnownFor20M145KUpdatedScalaDataset, - UserUserNormalizedGraphScalaDataset -} -import com.twitter.simclusters_v2.scalding.KnownForSources -import com.twitter.simclusters_v2.thriftscala.{ - EntitySource, - SimClusterWithScore, - SimClustersSource, - TopSimClustersWithScore, - UserAndNeighbors -} -import java.util.TimeZone - -/** - * Convenience functions to read data from prod. - */ -object ProdSources { - - // Returns the Dec11 KnownFor from production - def getDec11KnownFor(implicit tz: TimeZone): TypedPipe[(UserId, Seq[SimClusterWithScore])] = - KnownForSources - .readDALDataset( - SimclustersV2KnownFor20M145KDec11ScalaDataset, - Days(30), - ModelVersions.Model20M145KDec11) - .map { - case (userId, clustersArray) => - val clusters = clustersArray.map { - case (clusterId, score) => SimClusterWithScore(clusterId, score) - }.toSeq - (userId, clusters) - } - - // Returns the Updated KnownFor from production - def getUpdatedKnownFor(implicit tz: TimeZone): TypedPipe[(UserId, Seq[SimClusterWithScore])] = - KnownForSources - .readDALDataset( - SimclustersV2KnownFor20M145KUpdatedScalaDataset, - Days(30), - ModelVersions.Model20M145KUpdated - ) - .map { - case (userId, clustersArray) => - val clusters = clustersArray.map { - case (clusterId, score) => SimClusterWithScore(clusterId, score) - }.toSeq - (userId, clusters) - } - - def getInferredEntitiesFromKnownFor( - inferredFromCluster: SimClustersSource, - inferredFromEntity: EntitySource, - dateRange: DateRange - ): TypedPipe[(UserId, Seq[(SemanticCoreEntityId, Double)])] = { - DAL - .readMostRecentSnapshot(SimclustersInferredEntitiesFromKnownForScalaDataset, dateRange) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - .map { - case KeyVal(userId, entities) => - val validEntities = - entities.entities - .collect { - case entity - if entity.entitySource.contains(inferredFromEntity) && - entity.simclusterSource.contains(inferredFromCluster) => - (entity.entityId, entity.score) - } - .groupBy(_._1) - .map { case (entityId, scores) => (entityId, scores.map(_._2).max) } - .toSeq - (userId, validEntities) - } - } - - def getUserUserEngagementGraph(dateRange: DateRange): TypedPipe[UserAndNeighbors] = { - DAL - .readMostRecentSnapshot(UserUserNormalizedGraphScalaDataset, dateRange) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/mbcg/AllFeatures.scala b/src/scala/com/twitter/simclusters_v2/scalding/mbcg/AllFeatures.scala deleted file mode 100644 index 902c981bd..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/mbcg/AllFeatures.scala +++ /dev/null @@ -1,58 +0,0 @@ -package com.twitter.simclusters_v2.scalding.mbcg - -import com.google.common.collect.ImmutableSet -import com.twitter.dal.personal_data.thriftjava.PersonalDataType._ -import com.twitter.ml.api.DataType -import com.twitter.ml.api.Feature -import com.twitter.ml.api.Feature.SparseContinuous -import com.twitter.ml.api.Feature.Tensor -import com.twitter.ml.api.FeatureContext -import com.twitter.ml.api.constant.SharedFeatures -import java.util.{Map => JMap} - -/* -Features used for model-based candidate generation - */ -object TweetAllFeatures { - val tweetId = SharedFeatures.TWEET_ID - val tweetSimclusters = - new SparseContinuous( - "tweet.simcluster.log_fav_based_embedding.20m_145k_2020", - ImmutableSet.of(InferredInterests)) - .asInstanceOf[Feature[JMap[String, Double]]] - val authorF2vProducerEmbedding = - new Tensor( - "tweet.author_follow2vec.producer_embedding_200", - DataType.FLOAT - ) - - private val allFeatures: Seq[Feature[_]] = Seq( - tweetId, - tweetSimclusters, - authorF2vProducerEmbedding - ) - - val featureContext = new FeatureContext(allFeatures: _*) -} - -object UserAllFeatures { - val userId = SharedFeatures.USER_ID - val userSimclusters = - new SparseContinuous( - "user.iiape.log_fav_based_embedding.20m_145k_2020", - ImmutableSet.of(InferredInterests)) - .asInstanceOf[Feature[JMap[String, Double]]] - val userF2vConsumerEmbedding = - new Tensor( - "user.follow2vec.consumer_avg_fol_emb_200", - DataType.FLOAT - ) - - private val allFeatures: Seq[Feature[_]] = Seq( - userId, - userSimclusters, - userF2vConsumerEmbedding - ) - - val featureContext = new FeatureContext(allFeatures: _*) -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/mbcg/BUILD.bazel b/src/scala/com/twitter/simclusters_v2/scalding/mbcg/BUILD.bazel deleted file mode 100644 index 469a917be..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/mbcg/BUILD.bazel +++ /dev/null @@ -1,314 +0,0 @@ -scala_library( - sources = [ - "*.scala", - ], - platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - "3rdparty/jvm/com/twitter/algebird:core", - "3rdparty/jvm/com/twitter/algebird:util", - "3rdparty/jvm/com/twitter/storehaus:algebra", - "3rdparty/jvm/com/twitter/storehaus:core", - "3rdparty/src/jvm/com/twitter/scalding:args", - "3rdparty/src/jvm/com/twitter/scalding:commons", - "3rdparty/src/jvm/com/twitter/scalding:core", - "3rdparty/src/jvm/com/twitter/scalding:date", - "3rdparty/src/jvm/com/twitter/scalding:db", - "3rdparty/src/jvm/com/twitter/scalding:parquet", - "ann/src/main/scala/com/twitter/ann/hnsw", - "ann/src/main/scala/com/twitter/ann/scalding/offline", - "ann/src/main/scala/com/twitter/ann/util", - "geoduck/hadoop/scalding/datasets:userlocation-scala", - "iesource/common/src/main/scala/com/twitter/iesource/common/util", - "iesource/processing/events/src/main/scala/com/twitter/iesource/processing/events/batch" + - ":server_engagements-scala", - "iesource/thrift", - "src/java/com/twitter/ml/api/constant", - "src/scala/com/twitter/ml/api/util", - "src/scala/com/twitter/ml/featurestore/catalog/entities/core", - "src/scala/com/twitter/ml/featurestore/catalog/features/geo", - "src/scala/com/twitter/ml/featurestore/lib/batch", - "src/scala/com/twitter/scalding_internal/dalv2", - "src/scala/com/twitter/scalding_internal/dalv2/dataset", - "src/scala/com/twitter/scalding_internal/db", - "src/scala/com/twitter/scalding_internal/db/jdbc", - "src/scala/com/twitter/scalding_internal/error_handling", - "src/scala/com/twitter/scalding_internal/job", - "src/scala/com/twitter/scalding_internal/job/analytics_batch", - "src/scala/com/twitter/scalding_internal/multiformat", - "src/scala/com/twitter/scalding_internal/source", - "src/scala/com/twitter/scalding_internal/source/lzo_scrooge", - "src/scala/com/twitter/scalding_internal/typed", - "src/scala/com/twitter/simclusters_v2/hdfs_sources", - "src/scala/com/twitter/simclusters_v2/scalding/common", - "src/thrift/com/twitter/ml/api:data-java", - "src/thrift/com/twitter/ml/api:interpretable-model-java", - "tweetsource/public_tweets/src/main/scala/com/twitter/tweetsource/public_tweets:public_tweets-scala", - "twml/runtime/src/main/scala/com/twitter/twml/runtime/scalding", - "util/util-core:scala", - "util/util-stats/src/main/scala", - ], -) - -scalding_job( - name = "tweet-embedding-generation-adhoc-job", - main = "com.twitter.simclusters_v2.scalding.mbcg.TweetEmbeddingGenerationAdhocJob", - args = [ - "--dateRange 2021-10-30T00 2021-10-30T01", - "--model_name model", - "--model_path hdfs:///atla/proc/user/cassowary/explore_mbcg/models/tfx_model_1104/1635973177/tweet_tower_with_signature", - "--concurrency_level 60", - "--embedding_dimension 128", - "--expected_elements 30000000", - "--max_M 20", - "--ef_construction 200", - "--tweet_embedding_name output", - "--ann_output_path hdfs:///atla/proc/user/cassowary/explore_mbcg/ann_index/test_11_04_adhoc", - ], - config = [ - ("hadoop.submitter.cpu", 60), - ("hadoop.submitter.jvm.total-memory", "256g"), - ("submitter.tier", "preemptible"), - ("hadoop.map.jvm.total-memory", "6144m"), - ], - hadoop_cluster = "atla-proc3", - platform = "java8", - role = "cassowary", - runtime_platform = "java8", - tags = [ - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":mbcg"], -) - -scalding_job( - name = "tweet-embedding-generation-batch-job", - main = "com.twitter.simclusters_v2.scalding.mbcg.TweetEmbeddingGenerationBatchJob", - args = [ - "--model_name model", - "--model_path hdfs:///atla/proc/user/cassowary/explore_mbcg/models/tfx_0119_1day_0110_3l_5e_f2v_gpu_resave/tweet_tower_with_signature", - "--concurrency_level 60", - "--embedding_dimension 128", - "--expected_elements 5000000", - "--max_M 40", - "--ef_construction 800", - "--tweet_embedding_name output", - "--f2v_input.feature_store_embedding Follow2VecProducerEmbedding200Dataset", - "--f2v_input.feature_store_major_version 20210708", - "--minFavCount 32", - "--ann_output_path hdfs:///atla/proc/user/cassowary/explore_mbcg/ann_index/0125_batch_index_f2v_minfav", - ], - config = [ - ("hadoop.submitter.cpu", 60), - ("hadoop.submitter.jvm.total-memory", "256g"), - ("hadoop.map.jvm.total-memory", "6144m"), - ("hadoop.submitter.disk", "100g"), - ], - cron = "*/5 * * * *", - hadoop_cluster = "atla-proc3", - platform = "java8", - role = "cassowary", - runtime_platform = "java8", - tags = [ - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":mbcg"], -) - -scalding_job( - name = "tweet-embedding-generation-batch-job-alternate", - main = "com.twitter.simclusters_v2.scalding.mbcg.TweetEmbeddingGenerationBatchJobAlternate", - args = [ - "--model_name model", - "--model_path hdfs:///atla/proc/user/cassowary/explore_mbcg/models/tfx_331_329_1e_128em_b128_hn10_all_gpu/tweet_tower_with_signature", - "--concurrency_level 60", - "--embedding_dimension 128", - "--expected_elements 5000000", - "--max_M 40", - "--ef_construction 800", - "--tweet_embedding_name output", - "--f2v_input.feature_store_embedding Follow2VecProducerEmbedding200Dataset", - "--f2v_input.feature_store_major_version 20210708", - "--minFavCount 100", - "--indexAllTweets", - "--ann_output_path hdfs:///atla/proc/user/cassowary/explore_mbcg/ann_index/0401_batch_index_f2v_cosine_all_tweets", - ], - config = [ - ("hadoop.submitter.cpu", 60), - ("hadoop.submitter.jvm.total-memory", "256g"), - ("hadoop.map.jvm.total-memory", "6144m"), - ("hadoop.submitter.disk", "100g"), - ], - contact = "no-reply@twitter.com", - cron = "*/5 * * * *", - hadoop_cluster = "atla-proc3", - platform = "java8", - role = "cassowary", - runtime_platform = "java8", - tags = [ - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":mbcg"], -) - -scalding_job( - name = "tweet-embedding-generation-batch-job-experimental", - main = "com.twitter.simclusters_v2.scalding.mbcg.TweetEmbeddingGenerationBatchJobExperimental", - args = [ - "--model_name model", - "--model_path hdfs:///atla/proc/user/cassowary/explore_mbcg/models/tfx_0127_1day_0110_3l_10e_128e_normf2v_nocosine_gpu/tweet_tower_with_signature", - "--concurrency_level 60", - "--embedding_dimension 128", - "--expected_elements 5000000", - "--max_M 40", - "--ef_construction 800", - "--tweet_embedding_name output", - "--f2v_input.feature_store_embedding Follow2VecProducerEmbedding200Dataset", - "--f2v_input.feature_store_major_version 20210708", - "--minFavCount 32", - "--ann_output_path hdfs:///atla/proc/user/cassowary/explore_mbcg/ann_index/0128_f2v_1week_batch_index", - ], - config = [ - ("hadoop.submitter.cpu", 60), - ("hadoop.submitter.jvm.total-memory", "256g"), - ("hadoop.map.jvm.total-memory", "6144m"), - ("hadoop.submitter.disk", "100g"), - ], - contact = "no-reply@twitter.com", - cron = "*/5 * * * *", - hadoop_cluster = "atla-proc3", - platform = "java8", - role = "cassowary", - runtime_platform = "java8", - tags = [ - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":mbcg"], -) - -scalding_job( - name = "user-embedding-generation-adhoc-job", - main = "com.twitter.simclusters_v2.scalding.mbcg.UserEmbeddingGenerationAdhocJob", - args = [ - "--dateRange 2021-12-01T00 2021-12-01T01", - "--model_path hdfs:///atla/proc/user/cassowary/explore_mbcg/models/tfx_1202_logs_100m_b64_hn10_1127_video_persistent/user_tower_with_signature", - "--embedding_dimension 128", - "--user_embedding_name output", - "--kvs_output_path /user/cassowary/explore_mbcg/user_kvs_store/1207_adhoc_model_store", - ], - config = [ - ("hadoop.submitter.cpu", 60), - ("hadoop.submitter.jvm.total-memory", "256g"), - ("submitter.tier", "preemptible"), - ("hadoop.map.jvm.total-memory", "6144m"), - ], - contact = "no-reply@twitter.com", - hadoop_cluster = "atla-proc3", - platform = "java8", - role = "cassowary", - runtime_platform = "java8", - tags = [ - "bazel-compatible:migrated", - "bazel-only", - "known-to-fail-jira:SD-20253", - ], - dependencies = [":mbcg"], -) - -scalding_job( - name = "user-embedding-generation-batch-job", - main = "com.twitter.simclusters_v2.scalding.mbcg.UserEmbeddingGenerationBatchJob", - args = [ - "--model_path hdfs:///atla/proc/user/cassowary/explore_mbcg/models/tfx_0119_1day_0110_3l_5e_f2v_gpu_resave/user_tower_with_signature", - "--embedding_dimension 128", - "--user_embedding_name output", - "--f2v_input.feature_store_embedding FollowBasedConsumerFollow2VecAvgEmbedding200Dataset", - "--f2v_input.feature_store_major_version 20210708", - "--kvs_output_path /user/cassowary/explore_mbcg/user_kvs_store/0125_refreshed_model_store_f2v", - ], - config = [ - ("hadoop.submitter.cpu", 60), - ("hadoop.submitter.jvm.total-memory", "256g"), - ("submitter.tier", "preemptible"), - ("hadoop.map.jvm.total-memory", "6144m"), - ], - contact = "no-reply@twitter.com", - cron = "*/30 * * * *", - hadoop_cluster = "atla-proc3", - platform = "java8", - role = "cassowary", - runtime_platform = "java8", - tags = [ - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":mbcg"], -) - -scalding_job( - name = "user-embedding-generation-batch-job-alternate", - main = "com.twitter.simclusters_v2.scalding.mbcg.UserEmbeddingGenerationBatchJobAlternate", - args = [ - "--model_path hdfs:///atla/proc/user/cassowary/explore_mbcg/models/tfx_331_329_1e_128em_b128_hn10_all_gpu/user_tower_with_signature", - "--embedding_dimension 128", - "--user_embedding_name output", - "--f2v_input.feature_store_embedding FollowBasedConsumerFollow2VecAvgEmbedding200Dataset", - "--f2v_input.feature_store_major_version 20210708", - "--kvs_output_path /user/cassowary/explore_mbcg/user_kvs_store/0401_refreshed_model_store_all", - ], - config = [ - ("hadoop.submitter.cpu", 60), - ("hadoop.submitter.jvm.total-memory", "256g"), - ("submitter.tier", "preemptible"), - ("hadoop.map.jvm.total-memory", "6144m"), - ], - contact = "no-reply@twitter.com", - cron = "*/30 * * * *", - hadoop_cluster = "atla-proc3", - platform = "java8", - role = "cassowary", - runtime_platform = "java8", - tags = [ - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":mbcg"], -) - -scalding_job( - name = "user-embedding-generation-batch-job-experimental", - main = "com.twitter.simclusters_v2.scalding.mbcg.UserEmbeddingGenerationBatchJobExperimental", - args = [ - "--model_path hdfs:///atla/proc/user/cassowary/explore_mbcg/models/tfx_0127_1day_0110_3l_10e_128e_normf2v_nocosine_gpu/user_tower_with_signature", - "--embedding_dimension 128", - "--user_embedding_name output", - "--f2v_input.feature_store_embedding FollowBasedConsumerFollow2VecAvgEmbedding200Dataset", - "--f2v_input.feature_store_major_version 20210708", - "--kvs_output_path /user/cassowary/explore_mbcg/user_kvs_store/0328_f2v_cosine_all_tweets_model_store", - ], - config = [ - ("hadoop.submitter.cpu", 60), - ("hadoop.submitter.jvm.total-memory", "256g"), - ("submitter.tier", "preemptible"), - ("hadoop.map.jvm.total-memory", "6144m"), - ], - contact = "no-reply@twitter.com", - cron = "*/30 * * * *", - hadoop_cluster = "atla-proc3", - platform = "java8", - role = "cassowary", - runtime_platform = "java8", - tags = [ - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":mbcg"], -) diff --git a/src/scala/com/twitter/simclusters_v2/scalding/mbcg/RecordAdapters.scala b/src/scala/com/twitter/simclusters_v2/scalding/mbcg/RecordAdapters.scala deleted file mode 100644 index e972a24ae..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/mbcg/RecordAdapters.scala +++ /dev/null @@ -1,79 +0,0 @@ -package com.twitter.simclusters_v2.scalding.mbcg - -import com.twitter.ml.api.DataRecord -import com.twitter.ml.api.embedding.Embedding -import com.twitter.ml.api.FeatureContext -import com.twitter.ml.api.FloatTensor -import com.twitter.ml.api.GeneralTensor -import com.twitter.ml.api.IRecordOneToOneAdapter -import com.twitter.ml.api.util.FDsl._ -import com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn -import com.twitter.simclusters_v2.thriftscala.PersistentSimClustersEmbedding -import scala.collection.JavaConverters._ - -/* -Adapters to convert data from MBCG input sources into DataRecords - */ -object TweetSimclusterRecordAdapter - extends IRecordOneToOneAdapter[(Long, PersistentSimClustersEmbedding, Embedding[Float])] { - override def getFeatureContext: FeatureContext = TweetAllFeatures.featureContext - - override def adaptToDataRecord( - tweetFeatures: (Long, PersistentSimClustersEmbedding, Embedding[Float]) - ) = { - val dataRecord = new DataRecord() - val tweetId = tweetFeatures._1 - val tweetEmbedding = tweetFeatures._2 - val f2vEmbedding = tweetFeatures._3 - val simclusterWithScores = tweetEmbedding.embedding.embedding - .map { simclusterWithScore => - // Cluster ID and score for that cluster - (simclusterWithScore._1.toString, simclusterWithScore._2) - }.toMap.asJava - - dataRecord.setFeatureValue(TweetAllFeatures.tweetId, tweetId) - dataRecord.setFeatureValue(TweetAllFeatures.tweetSimclusters, simclusterWithScores) - dataRecord.setFeatureValue( - TweetAllFeatures.authorF2vProducerEmbedding, - GeneralTensor.floatTensor( - new FloatTensor(f2vEmbedding.map(Double.box(_)).asJava) - ) - ) - - dataRecord - } -} - -object UserSimclusterRecordAdapter - extends IRecordOneToOneAdapter[(Long, ClustersUserIsInterestedIn, Embedding[Float])] { - override def getFeatureContext: FeatureContext = TweetAllFeatures.featureContext - - override def adaptToDataRecord( - userSimclusterEmbedding: (Long, ClustersUserIsInterestedIn, Embedding[Float]) - ) = { - val dataRecord = new DataRecord() - val userId = userSimclusterEmbedding._1 - val userEmbedding = userSimclusterEmbedding._2 - val simclusterWithScores = userEmbedding.clusterIdToScores - .filter { - case (_, score) => - score.logFavScore.map(_ >= 0.0).getOrElse(false) - } - .map { - case (clusterId, score) => - (clusterId.toString, score.logFavScore.get) - }.toMap.asJava - val f2vEmbedding = userSimclusterEmbedding._3 - - dataRecord.setFeatureValue(UserAllFeatures.userId, userId) - dataRecord.setFeatureValue(UserAllFeatures.userSimclusters, simclusterWithScores) - dataRecord.setFeatureValue( - UserAllFeatures.userF2vConsumerEmbedding, - GeneralTensor.floatTensor( - new FloatTensor(f2vEmbedding.map(Double.box(_)).asJava) - ) - ) - - dataRecord - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/mbcg/TweetEmbeddingGenerationJob.scala b/src/scala/com/twitter/simclusters_v2/scalding/mbcg/TweetEmbeddingGenerationJob.scala deleted file mode 100644 index 717e07493..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/mbcg/TweetEmbeddingGenerationJob.scala +++ /dev/null @@ -1,384 +0,0 @@ -package com.twitter.simclusters_v2.scalding.mbcg - -import com.twitter.ann.common.EntityEmbedding -import com.twitter.ann.common.Cosine -import com.twitter.ann.common.CosineDistance -import com.twitter.ann.common.InnerProduct -import com.twitter.ann.common.InnerProductDistance -import com.twitter.ann.common.ReadWriteFuturePool -import com.twitter.ann.hnsw.TypedHnswIndex -import com.twitter.ann.util.IndexBuilderUtils -import com.twitter.conversions.DurationOps._ -import com.twitter.cortex.deepbird.runtime.prediction_engine.TensorflowPredictionEngineConfig -import com.twitter.cortex.ml.embeddings.common.TweetKind -import com.twitter.cortex.ml.embeddings.common.UserKind -import com.twitter.finagle.mtls.authentication.ServiceIdentifier -import com.twitter.finagle.stats.NullStatsReceiver -import com.twitter.iesource.common.util.InteractionEventUtils -import com.twitter.iesource.processing.events.batch.ServerEngagementsScalaDataset -import com.twitter.iesource.thriftscala.InteractionDetails -import com.twitter.ml.api.embedding.Embedding -import com.twitter.ml.api.FeatureUtil -import com.twitter.ml.api.constant.SharedFeatures -import com.twitter.ml.api.embedding.EmbeddingSerDe -import com.twitter.ml.api.thriftscala -import com.twitter.ml.api.thriftscala.{GeneralTensor => ThriftGeneralTensor} -import com.twitter.ml.api.util.FDsl._ -import com.twitter.ml.api.util.ScalaToJavaDataRecordConversions -import com.twitter.ml.featurestore.lib.TweetId -import com.twitter.ml.featurestore.lib.embedding.EmbeddingWithEntity -import com.twitter.scalding.Args -import com.twitter.scalding.DateParser -import com.twitter.scalding.DateRange -import com.twitter.scalding.Execution -import com.twitter.scalding.UniqueID -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.remote_access.AllowCrossDC -import com.twitter.scalding_internal.job.FutureHelper -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.scalding_internal.job.analytics_batch.AnalyticsBatchExecution -import com.twitter.scalding_internal.job.analytics_batch.AnalyticsBatchExecutionArgs -import com.twitter.scalding_internal.job.analytics_batch.BatchDescription -import com.twitter.scalding_internal.job.analytics_batch.BatchFirstTime -import com.twitter.scalding_internal.job.analytics_batch.BatchIncrement -import com.twitter.scalding_internal.job.analytics_batch.BatchWidth -import com.twitter.scalding_internal.job.analytics_batch.TwitterScheduledExecutionApp -import com.twitter.search.common.file.FileUtils -import com.twitter.simclusters_v2.scalding.common.LogFavBasedPersistentTweetEmbeddingMhExportSource -import com.twitter.simclusters_v2.thriftscala.PersistentSimClustersEmbedding -import com.twitter.tweetsource.common.thriftscala.MediaType -import com.twitter.tweetsource.public_tweets.PublicTweetsScalaDataset -import com.twitter.tweetsource.public_tweets.thriftscala.PublicTweet -import com.twitter.twml.runtime.scalding.TensorflowBatchPredictor -import com.twitter.twml.runtime.scalding.TensorflowBatchPredictor.ScaldingThreadingConfig -import com.twitter.util.FuturePool -import com.twitter.util.logging.Logger -import java.util.TimeZone -import java.util.concurrent.Executors - -/* -This class does the following: -1) Get tweet simcluster features from LogFavBasedPersistentTweetEmbeddingMhExportSource -2) Filter them down to English media tweets that aren't replies or quote tweets using TweetSource -3) Convert the remaining tweets into DataRecords using TweetSimclusterRecordAdapter -4) Run inference using a TF model exported with a DataRecord compatible serving signature -5) Create an ANN index from the generated tweet embeddings - */ -trait TweetEmbeddingGenerationTrait { - implicit val tz: TimeZone = DateOps.UTC - implicit val dp: DateParser = DateParser.default - implicit val updateHours = 4 - - private val inputNodeName = "request:0" - private val outputNodeName = "response:0" - private val functionSignatureName = "serve" - private val predictionRequestTimeout = 5.seconds - private val SupportedLanguages = Set("en") - private val tweetSourceLookback = Days(2) - - private val DEFAULT_F2V_VECTOR: Embedding[Float] = Embedding(Array.fill[Float](200)(0.0f)) - - def getPredictionEngine(modelName: String, modelPath: String): TensorflowBatchPredictor = { - val config = TensorflowPredictionEngineConfig( - modelName = modelName, - modelSource = modelPath, - threadingConfig = Some(ScaldingThreadingConfig), - defaultInputNode = inputNodeName, - defaultOutputNode = outputNodeName, - functionSignatureName = functionSignatureName, - statsReceiver = NullStatsReceiver - ) - TensorflowBatchPredictor(config, predictionRequestTimeout) - } - - def getEmbeddingWithEntity(tweetEmbeddingTensor: ThriftGeneralTensor, tweetId: Long) = { - tweetEmbeddingTensor match { - case ThriftGeneralTensor.RawTypedTensor(rawTensor) => - val embedding = EmbeddingSerDe.floatEmbeddingSerDe.fromThrift( - thriftscala.Embedding(Some(rawTensor)) - ) - EmbeddingWithEntity[TweetId](TweetId(tweetId), embedding) - case _ => throw new IllegalArgumentException("tensor is wrong type!") - } - } - - def buildAnnIndex( - pipe: TypedPipe[EmbeddingWithEntity[TweetId]], - args: Args - ): Execution[Unit] = { - def embeddingDimension: Int = args.int("embedding_dimension", 128) - def efConstruction: Int = args.int("ef_construction", 800) - def maxM: Int = args.int("max_M", 40) - val log: Logger = Logger(getClass) - val annOutputPath: String = args("ann_output_path") - - val embeddingWithEntity = pipe.map { - case EmbeddingWithEntity(tweetId, embedding) => - EntityEmbedding[TweetId](tweetId, embedding) - } - val concurrencyLevel = args.int("concurrency_level", 60) - val expectedElements = args.int("expected_elements", 30000000) - val threadPool = Executors.newFixedThreadPool(concurrencyLevel) - val hnswIndex = TypedHnswIndex.serializableIndex[TweetId, InnerProductDistance]( - embeddingDimension, - InnerProduct, - efConstruction, - maxM, - expectedElements, - TweetKind.byteInjection, - ReadWriteFuturePool(FuturePool.apply(threadPool)) - ) - - // Create a timestamped directory to use for recovery in case of index corruption - val timeStampedAnnOutputPath: String = annOutputPath + "/" + (System.currentTimeMillis() / 1000) - val timeStampedAnnOutputDirectory = FileUtils.getFileHandle(timeStampedAnnOutputPath) - - embeddingWithEntity.toIterableExecution - .flatMap { annEmbeddings => - val future = - IndexBuilderUtils.addToIndex(hnswIndex, annEmbeddings.toStream, concurrencyLevel) - val result = future.map { numberUpdates => - log.info(s"Performed $numberUpdates updates") - hnswIndex.toDirectory(timeStampedAnnOutputDirectory) - log.info(s"Finished writing to timestamped index directory - " + - s"$timeStampedAnnOutputDirectory") - } - FutureHelper.executionFrom(result).unit - }.onComplete { _ => - threadPool.shutdown() - Unit - } - } - - def getTweetSimclusterFeatures( - args: Args - )( - implicit dateRange: DateRange - ): TypedPipe[(Long, PersistentSimClustersEmbedding)] = { - val serviceIdEnv = args.getOrElse("sIdEnv", "prod") - val serviceIdRole = args.getOrElse("sIdRole", "cassowary") - val serviceIdZone = args.getOrElse("sIdZone", "atla") - val serviceIdName = args - .getOrElse("sIdName", "tweet-embedding-generation-batch-job") - val serviceId = ServiceIdentifier( - role = serviceIdRole, - service = serviceIdName, - environment = serviceIdEnv, - zone = serviceIdZone) - - val logFavBasedPersistentTweetEmbeddingSource = - new LogFavBasedPersistentTweetEmbeddingMhExportSource( - range = dateRange.prepend(Hours(24)), - serviceIdentifier = serviceId) - val tweetSimclusterEmbeddingTypedPipe = TypedPipe - .from(logFavBasedPersistentTweetEmbeddingSource) - .collect { - case ( - (tweetId, timestamp), - simclusterEmbedding: PersistentSimClustersEmbedding - ) if timestamp == 1L => // 1L corresponds to the LongestL2Norm simcluster embedding - (tweetId.toLong, simclusterEmbedding) - } - - tweetSimclusterEmbeddingTypedPipe - } - - def getTweetSource()(implicit dateRange: DateRange): TypedPipe[PublicTweet] = { - val recentTweets = DAL - .read(PublicTweetsScalaDataset, dateRange.prepend(tweetSourceLookback)) - .toTypedPipe - - recentTweets - } - - def isVideoTweet(tweet: PublicTweet): Boolean = { - tweet.media.exists { mediaSeq => - mediaSeq.exists { e => - e.mediaType.contains(MediaType.Video) - } - } - } - - def getEngagementFilteredTweets( - minFavCount: Long - )( - implicit dateRange: DateRange - ): TypedPipe[(Long, Int)] = { - val engagementFilteredTweetsPipe = DAL - .read(ServerEngagementsScalaDataset, dateRange.prepend(Days(2))).withRemoteReadPolicy( - AllowCrossDC).toTypedPipe - .collect { - case event if InteractionEventUtils.isTweetType(event) => - val targetTweetId = event.targetId - event.details match { - case InteractionDetails.Favorite(_) => (targetTweetId, 1) - case _ => (targetTweetId, 0) - } - } - .sumByKey - .map { - case (tweetId, count) => (tweetId, count) - } - .filter(_._2 >= minFavCount) - - engagementFilteredTweetsPipe - } - - def run(args: Args)(implicit dateRange: DateRange, idx: UniqueID) = { - val minFavCount = args.int("minFavCount", 32) - val indexAllTweets = args.boolean("indexAllTweets") - - val tweetSimclusterDataset = getTweetSimclusterFeatures(args) - val tweetSourceDataset = getTweetSource() - val engagementFilteredTweetsPipe = getEngagementFilteredTweets(minFavCount) - val inputEmbeddingFormat = UserKind.parser - .getEmbeddingFormat(args, "f2v_input", Some(dateRange.prepend(Days(14)))) - val f2vProducerEmbeddings = inputEmbeddingFormat.getEmbeddings - .map { - case EmbeddingWithEntity(userId, embedding) => (userId.userId, embedding) - } - - val engagementFilteredTweetInfoPipe = tweetSourceDataset - .groupBy(_.tweetId) - .join(engagementFilteredTweetsPipe.groupBy(_._1)) - .map { - case (tweetId, (tweetInfo, tweetFavCount)) => - (tweetId, tweetInfo) - } - - val filteredSimclustersPipe = tweetSimclusterDataset - .groupBy(_._1) - .join(engagementFilteredTweetInfoPipe.groupBy(_._1)) - .map { - case (tweetId, ((_, simclusterEmbedding), (_, tweetInfo))) => - (tweetId, simclusterEmbedding, tweetInfo) - } - .filter { - case (_, _, tweetInfo) => - tweetInfo.quotedTweetTweetId.isEmpty && - tweetInfo.inReplyToTweetId.isEmpty && - tweetInfo.language.exists(SupportedLanguages.contains) && - (indexAllTweets || (!tweetInfo.media.exists(_.isEmpty) && isVideoTweet(tweetInfo))) && - !tweetInfo.nsfwAdmin && - !tweetInfo.nsfwUser - } - .map { - case (tweetId, simclusterEmbedding, tweetInfo) => - (tweetInfo.userId, tweetId, simclusterEmbedding) - } - - val dataRecordsPipe = filteredSimclustersPipe - .groupBy(_._1) - .leftJoin(f2vProducerEmbeddings.groupBy(_._1)) - .values - .map { - case ((authorId1, tweetId, simclusterEmbedding), Some((authorId2, f2vEmbedding))) => - TweetSimclusterRecordAdapter.adaptToDataRecord( - (tweetId, simclusterEmbedding, f2vEmbedding)) - case ((authorId, tweetId, simclusterEmbedding), None) => - TweetSimclusterRecordAdapter.adaptToDataRecord( - (tweetId, simclusterEmbedding, DEFAULT_F2V_VECTOR)) - } - - val modelPath = args.getOrElse("model_path", "") - val batchPredictor = getPredictionEngine(modelName = "tweet_model", modelPath = modelPath) - val tweetIdFeature = SharedFeatures.TWEET_ID - val tweetEmbeddingName = args.getOrElse("tweet_embedding_name", "output") - - val outputPipe = batchPredictor.predict(dataRecordsPipe).map { - case (originalDataRecord, predictedDataRecord) => - val tweetId = originalDataRecord.getFeatureValue(tweetIdFeature) - val scalaPredictedDataRecord = - ScalaToJavaDataRecordConversions.javaDataRecord2ScalaDataRecord(predictedDataRecord) - val tweetEmbeddingTensor = - scalaPredictedDataRecord.tensors.get(FeatureUtil.featureIdForName(tweetEmbeddingName)) - val tweetEmbeddingWithEntity = getEmbeddingWithEntity(tweetEmbeddingTensor, tweetId) - tweetEmbeddingWithEntity - } - - buildAnnIndex(outputPipe, args) - } -} - -object TweetEmbeddingGenerationAdhocJob - extends TwitterExecutionApp - with TweetEmbeddingGenerationTrait { - - override def job: Execution[Unit] = - Execution.withId { implicit uid => - Execution.withArgs { args => - implicit val dateRange: DateRange = DateRange.parse(args.list("dateRange")) - run(args) - } - } -} - -object TweetEmbeddingGenerationBatchJob - extends TwitterScheduledExecutionApp - with TweetEmbeddingGenerationTrait { - - override def scheduledJob: Execution[Unit] = - Execution.withId { implicit uid => - Execution.withArgs { args => - implicit val tz: TimeZone = DateOps.UTC - val batchFirstTime = BatchFirstTime(RichDate("2021-10-28")(tz, DateParser.default)) - val analyticsArgs = AnalyticsBatchExecutionArgs( - batchDesc = BatchDescription(getClass.getName), - firstTime = batchFirstTime, - batchIncrement = BatchIncrement(Hours(updateHours)), - batchWidth = Some(BatchWidth(Hours(updateHours))) - ) - - AnalyticsBatchExecution(analyticsArgs) { implicit dateRange => - run(args) - } - } - } -} - -object TweetEmbeddingGenerationBatchJobAlternate - extends TwitterScheduledExecutionApp - with TweetEmbeddingGenerationTrait { - - override def scheduledJob: Execution[Unit] = - Execution.withId { implicit uid => - Execution.withArgs { args => - implicit val tz: TimeZone = DateOps.UTC - val batchFirstTime = BatchFirstTime(RichDate("2022-03-28")(tz, DateParser.default)) - val analyticsArgs = AnalyticsBatchExecutionArgs( - batchDesc = BatchDescription(getClass.getName), - firstTime = batchFirstTime, - batchIncrement = BatchIncrement(Hours(updateHours)), - batchWidth = Some(BatchWidth(Hours(updateHours))) - ) - - AnalyticsBatchExecution(analyticsArgs) { implicit dateRange => - run(args) - } - } - } -} - -object TweetEmbeddingGenerationBatchJobExperimental - extends TwitterScheduledExecutionApp - with TweetEmbeddingGenerationTrait { - - override def scheduledJob: Execution[Unit] = - Execution.withId { implicit uid => - Execution.withArgs { args => - implicit val tz: TimeZone = DateOps.UTC - val batchFirstTime = BatchFirstTime(RichDate("2021-12-12")(tz, DateParser.default)) - val analyticsArgs = AnalyticsBatchExecutionArgs( - batchDesc = BatchDescription(getClass.getName), - firstTime = batchFirstTime, - batchIncrement = BatchIncrement(Hours(updateHours)), - batchWidth = Some(BatchWidth(Hours(updateHours))) - ) - - AnalyticsBatchExecution(analyticsArgs) { implicit dateRange => - run(args) - } - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/mbcg/UserEmbeddingGenerationJob.scala b/src/scala/com/twitter/simclusters_v2/scalding/mbcg/UserEmbeddingGenerationJob.scala deleted file mode 100644 index f747764d9..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/mbcg/UserEmbeddingGenerationJob.scala +++ /dev/null @@ -1,270 +0,0 @@ -package com.twitter.simclusters_v2.scalding.mbcg - -import com.twitter.conversions.DurationOps._ -import com.twitter.cortex.deepbird.runtime.prediction_engine.TensorflowPredictionEngineConfig -import com.twitter.cortex.ml.embeddings.common.UserKind -import com.twitter.finagle.stats.NullStatsReceiver -import com.twitter.ml.api.FeatureUtil -import com.twitter.ml.api.constant.SharedFeatures -import com.twitter.ml.api.embedding.Embedding -import com.twitter.ml.api.thriftscala -import com.twitter.ml.api.thriftscala.{GeneralTensor => ThriftGeneralTensor} -import com.twitter.ml.api.util.FDsl._ -import com.twitter.ml.api.util.ScalaToJavaDataRecordConversions -import com.twitter.ml.featurestore.lib.embedding.EmbeddingWithEntity -import com.twitter.scalding.Args -import com.twitter.scalding.DateParser -import com.twitter.scalding.DateRange -import com.twitter.scalding.Execution -import com.twitter.scalding.UniqueID -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite.D -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.dalv2.remote_access.AllowCrossDC -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.scalding_internal.job.analytics_batch.AnalyticsBatchExecution -import com.twitter.scalding_internal.job.analytics_batch.AnalyticsBatchExecutionArgs -import com.twitter.scalding_internal.job.analytics_batch.BatchDescription -import com.twitter.scalding_internal.job.analytics_batch.BatchFirstTime -import com.twitter.scalding_internal.job.analytics_batch.BatchIncrement -import com.twitter.scalding_internal.job.analytics_batch.BatchWidth -import com.twitter.scalding_internal.job.analytics_batch.TwitterScheduledExecutionApp -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.hdfs_sources.AdhocKeyValSources -import com.twitter.simclusters_v2.hdfs_sources.ExploreMbcgUserEmbeddingsKvScalaDataset -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn -import com.twitter.twml.runtime.scalding.TensorflowBatchPredictor -import com.twitter.twml.runtime.scalding.TensorflowBatchPredictor.ScaldingThreadingConfig -import com.twitter.usersource.snapshot.flat.UsersourceFlatScalaDataset -import com.twitter.usersource.snapshot.flat.thriftscala.FlatUser -import java.util.TimeZone - -/* -This class does the following: -1) Get user IIAPE Simcluster features that use LogFav scores -2) Filter them down to users whose accounts are not deactivated or suspended -3) Convert the remaining user Simclusters into DataRecords using UserSimclusterRecordAdapter -4) Run inference using a TF model exported with a DataRecord compatible serving signature -5) Write to MH using a KeyVal format - */ -trait UserEmbeddingGenerationTrait { - implicit val tz: TimeZone = DateOps.UTC - implicit val dp: DateParser = DateParser.default - implicit val updateHours = 12 - - private val inputNodeName = "request:0" - private val outputNodeName = "response:0" - private val functionSignatureName = "serve" - private val predictionRequestTimeout = 5.seconds - private val IIAPEHdfsPath: String = - "/atla/proc3/user/cassowary/manhattan_sequence_files/interested_in_from_ape/Model20m145k2020" - - private val DEFAULT_F2V_VECTOR: Embedding[Float] = Embedding(Array.fill[Float](200)(0.0f)) - - def getPredictionEngine(modelName: String, modelPath: String): TensorflowBatchPredictor = { - val config = TensorflowPredictionEngineConfig( - modelName = modelName, - modelSource = modelPath, - threadingConfig = Some(ScaldingThreadingConfig), - defaultInputNode = inputNodeName, - defaultOutputNode = outputNodeName, - functionSignatureName = functionSignatureName, - statsReceiver = NullStatsReceiver - ) - TensorflowBatchPredictor(config, predictionRequestTimeout) - } - - def getEmbeddingWithEntity(userEmbeddingTensor: ThriftGeneralTensor, userId: Long) = { - userEmbeddingTensor match { - case ThriftGeneralTensor.RawTypedTensor(rawTensor) => - val embedding = - thriftscala.Embedding(Some(rawTensor)) - KeyVal(userId, embedding) - case _ => throw new IllegalArgumentException("tensor is wrong type!") - } - } - - def writeUserEmbedding( - result: TypedPipe[KeyVal[Long, thriftscala.Embedding]], - args: Args - ): Execution[Unit] = { - result.writeDALVersionedKeyValExecution( - ExploreMbcgUserEmbeddingsKvScalaDataset, - D.Suffix( - args.getOrElse("kvs_output_path", "/user/cassowary/explore_mbcg/user_kvs_store/test") - ) - ) - } - - def getUserSimclusterFeatures( - args: Args - )( - implicit dateRange: DateRange - ): TypedPipe[(Long, ClustersUserIsInterestedIn)] = { - val userSimclusterEmbeddingTypedPipe = TypedPipe - .from(AdhocKeyValSources.interestedInSource(IIAPEHdfsPath)) - .collect { - case ( - userId, - iIAPE: ClustersUserIsInterestedIn - ) => - (userId.toLong, iIAPE) - } - - userSimclusterEmbeddingTypedPipe - } - - def getUserSource()(implicit dateRange: DateRange): TypedPipe[FlatUser] = { - val userSource = - DAL - .readMostRecentSnapshotNoOlderThan(UsersourceFlatScalaDataset, Days(7)) - .withRemoteReadPolicy(AllowCrossDC) - .toTypedPipe - - userSource - } - - def run(args: Args)(implicit dateRange: DateRange, id: UniqueID) = { - val userSimclusterDataset = getUserSimclusterFeatures(args) - val userSourceDataset = getUserSource() - - val inputEmbeddingFormat = UserKind.parser - .getEmbeddingFormat(args, "f2v_input", Some(dateRange.prepend(Days(14)))) - val f2vConsumerEmbeddings = inputEmbeddingFormat.getEmbeddings - .map { - case EmbeddingWithEntity(userId, embedding) => (userId.userId, embedding) - } - - val filteredUserPipe = userSimclusterDataset - .groupBy(_._1) - .join(userSourceDataset.groupBy(_.id.getOrElse(-1L))) - .map { - case (userId, ((_, simclusterEmbedding), userInfo)) => - (userId, simclusterEmbedding, userInfo) - } - .filter { - case (_, _, userInfo) => - !userInfo.deactivated.contains(true) && !userInfo.suspended - .contains(true) - } - .map { - case (userId, simclusterEmbedding, _) => - (userId, simclusterEmbedding) - } - - val dataRecordsPipe = filteredUserPipe - .groupBy(_._1) - .leftJoin(f2vConsumerEmbeddings.groupBy(_._1)) - .values - .map { - case ((userId1, simclusterEmbedding), Some((userId2, f2vEmbedding))) => - UserSimclusterRecordAdapter.adaptToDataRecord( - (userId1, simclusterEmbedding, f2vEmbedding)) - case ((userId, simclusterEmbedding), None) => - UserSimclusterRecordAdapter.adaptToDataRecord( - (userId, simclusterEmbedding, DEFAULT_F2V_VECTOR)) - } - - val modelPath = args.getOrElse("model_path", "") - val batchPredictor = getPredictionEngine(modelName = "tweet_model", modelPath = modelPath) - val userIdFeature = SharedFeatures.USER_ID - val userEmbeddingName = args.getOrElse("user_embedding_name", "output") - - val outputPipe = batchPredictor.predict(dataRecordsPipe).map { - case (originalDataRecord, predictedDataRecord) => - val userId = originalDataRecord.getFeatureValue(userIdFeature) - val scalaPredictedDataRecord = - ScalaToJavaDataRecordConversions.javaDataRecord2ScalaDataRecord(predictedDataRecord) - val userEmbeddingTensor = - scalaPredictedDataRecord.tensors.get(FeatureUtil.featureIdForName(userEmbeddingName)) - val userEmbeddingWithEntity = getEmbeddingWithEntity(userEmbeddingTensor, userId) - userEmbeddingWithEntity - } - - Util.printCounters(writeUserEmbedding(outputPipe, args)) - } -} - -object UserEmbeddingGenerationAdhocJob - extends TwitterExecutionApp - with UserEmbeddingGenerationTrait { - - override def job: Execution[Unit] = - Execution.withId { implicit uid => - Execution.withArgs { args => - implicit val dateRange: DateRange = DateRange.parse(args.list("dateRange")) - run(args) - } - } -} - -object UserEmbeddingGenerationBatchJob - extends TwitterScheduledExecutionApp - with UserEmbeddingGenerationTrait { - - override def scheduledJob: Execution[Unit] = - Execution.withId { implicit uid => - Execution.withArgs { args => - implicit val tz: TimeZone = DateOps.UTC - val batchFirstTime = BatchFirstTime(RichDate("2021-12-04")(tz, DateParser.default)) - val analyticsArgs = AnalyticsBatchExecutionArgs( - batchDesc = BatchDescription(getClass.getName), - firstTime = batchFirstTime, - batchIncrement = BatchIncrement(Hours(updateHours)), - batchWidth = Some(BatchWidth(Hours(updateHours))) - ) - - AnalyticsBatchExecution(analyticsArgs) { implicit dateRange => - run(args) - } - } - } -} - -object UserEmbeddingGenerationBatchJobAlternate - extends TwitterScheduledExecutionApp - with UserEmbeddingGenerationTrait { - - override def scheduledJob: Execution[Unit] = - Execution.withId { implicit uid => - Execution.withArgs { args => - implicit val tz: TimeZone = DateOps.UTC - val batchFirstTime = BatchFirstTime(RichDate("2022-03-28")(tz, DateParser.default)) - val analyticsArgs = AnalyticsBatchExecutionArgs( - batchDesc = BatchDescription(getClass.getName), - firstTime = batchFirstTime, - batchIncrement = BatchIncrement(Hours(updateHours)), - batchWidth = Some(BatchWidth(Hours(updateHours))) - ) - - AnalyticsBatchExecution(analyticsArgs) { implicit dateRange => - run(args) - } - } - } -} - -object UserEmbeddingGenerationBatchJobExperimental - extends TwitterScheduledExecutionApp - with UserEmbeddingGenerationTrait { - - override def scheduledJob: Execution[Unit] = - Execution.withId { implicit uid => - Execution.withArgs { args => - implicit val tz: TimeZone = DateOps.UTC - val batchFirstTime = BatchFirstTime(RichDate("2021-12-12")(tz, DateParser.default)) - val analyticsArgs = AnalyticsBatchExecutionArgs( - batchDesc = BatchDescription(getClass.getName), - firstTime = batchFirstTime, - batchIncrement = BatchIncrement(Hours(updateHours)), - batchWidth = Some(BatchWidth(Hours(updateHours))) - ) - - AnalyticsBatchExecution(analyticsArgs) { implicit dateRange => - run(args) - } - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/multi_type_graph/assemble_multi_type_graph/AssembleMultiTypeGraph.scala b/src/scala/com/twitter/simclusters_v2/scalding/multi_type_graph/assemble_multi_type_graph/AssembleMultiTypeGraph.scala deleted file mode 100644 index 5e289c6b8..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/multi_type_graph/assemble_multi_type_graph/AssembleMultiTypeGraph.scala +++ /dev/null @@ -1,514 +0,0 @@ -package com.twitter.simclusters_v2.scalding -package multi_type_graph.assemble_multi_type_graph - -import com.twitter.bijection.scrooge.BinaryScalaCodec -import com.twitter.scalding_internal.job.RequiredBinaryComparators.ordSer -import com.twitter.scalding.typed.TypedPipe -import com.twitter.scalding.{DateRange, Days, Stat, UniqueID} -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.simclusters_v2.scalding.embedding.common.ExternalDataSources -import com.twitter.simclusters_v2.thriftscala.{ - LeftNode, - Noun, - RightNode, - RightNodeType, - RightNodeWithEdgeWeight -} -import java.util.TimeZone -import com.twitter.iesource.thriftscala.{InteractionEvent, InteractionType, ReferenceTweet} -import com.twitter.simclusters_v2.common.{Country, Language, TopicId, TweetId, UserId} -import com.twitter.usersource.snapshot.combined.UsersourceScalaDataset -import com.twitter.frigate.data_pipeline.magicrecs.magicrecs_notifications_lite.thriftscala.MagicRecsNotificationLite -import com.twitter.twadoop.user.gen.thriftscala.CombinedUser - -object AssembleMultiTypeGraph { - import Config._ - - implicit val nounOrdering: Ordering[Noun] = new Ordering[Noun] { - // We define an ordering for each noun type as specified in simclusters_v2/multi_type_graph.thrift - // Please make sure we don't remove anything here that's still a part of the union Noun thrift and - // vice versa, if we add a new noun type to thrift, an ordering for it needs to added here as well. - def nounTypeOrder(noun: Noun): Int = noun match { - case _: Noun.UserId => 0 - case _: Noun.Country => 1 - case _: Noun.Language => 2 - case _: Noun.Query => 3 - case _: Noun.TopicId => 4 - case _: Noun.TweetId => 5 - } - - override def compare(x: Noun, y: Noun): Int = (x, y) match { - case (Noun.UserId(a), Noun.UserId(b)) => a compare b - case (Noun.Country(a), Noun.Country(b)) => a compare b - case (Noun.Language(a), Noun.Language(b)) => a compare b - case (Noun.Query(a), Noun.Query(b)) => a compare b - case (Noun.TopicId(a), Noun.TopicId(b)) => a compare b - case (Noun.TweetId(a), Noun.TweetId(b)) => a compare b - case (nounA, nounB) => nounTypeOrder(nounA) compare nounTypeOrder(nounB) - } - } - implicit val rightNodeTypeOrdering: Ordering[RightNodeType] = ordSer[RightNodeType] - - implicit val rightNodeTypeWithNounOrdering: Ordering[RightNode] = - new Ordering[RightNode] { - override def compare(x: RightNode, y: RightNode): Int = { - Ordering - .Tuple2(rightNodeTypeOrdering, nounOrdering) - .compare((x.rightNodeType, x.noun), (y.rightNodeType, y.noun)) - } - } - - def getUserTweetInteractionGraph( - tweetInteractionEvents: TypedPipe[InteractionEvent], - )( - implicit uniqueID: UniqueID - ): TypedPipe[(LeftNode, RightNodeWithEdgeWeight)] = { - val numUserTweetInteractionEntries = Stat("num_user_tweet_interaction_entries") - val numDistinctUserTweetInteractionEntries = Stat("num_distinct_user_tweet_interaction_entries") - val numFavedTweets = Stat("num_faved_tweets") - val numRepliedTweets = Stat("num_replied_tweets") - val numRetweetedTweets = Stat("num_retweeted_tweets") - val userTweetInteractionsByType: TypedPipe[((UserId, RightNodeType), TweetId)] = - tweetInteractionEvents - .flatMap { event => - val referenceTweet: Option[ReferenceTweet] = event.referenceTweet - val targetId: Long = event.targetId - val userId: Long = event.engagingUserId - - // To find the id of the tweet that was interacted with - // For likes, this is the targetId; for retweet or reply, it is the referenceTweet's id - // One thing to note is that for likes, referenceTweet is empty - val (tweetIdOpt, rightNodeTypeOpt) = { - event.interactionType match { - case Some(InteractionType.Favorite) => - // Only allow favorites on original tweets, not retweets, to avoid double-counting - // because we have retweet-type tweets in the data source as well - ( - if (referenceTweet.isEmpty) { - numFavedTweets.inc() - Some(targetId) - } else None, - Some(RightNodeType.FavTweet)) - case Some(InteractionType.Reply) => - numRepliedTweets.inc() - (referenceTweet.map(_.tweetId), Some(RightNodeType.ReplyTweet)) - case Some(InteractionType.Retweet) => - numRetweetedTweets.inc() - (referenceTweet.map(_.tweetId), Some(RightNodeType.RetweetTweet)) - case _ => (None, None) - } - } - for { - tweetId <- tweetIdOpt - rightNodeType <- rightNodeTypeOpt - } yield { - numUserTweetInteractionEntries.inc() - ((userId, rightNodeType), tweetId) - } - } - - userTweetInteractionsByType - .mapValues(Set(_)) - .sumByKey - .flatMap { - case ((userId, rightNodeType), tweetIdSet) => - tweetIdSet.map { tweetId => - numDistinctUserTweetInteractionEntries.inc() - ( - LeftNode.UserId(userId), - RightNodeWithEdgeWeight( - rightNode = RightNode(rightNodeType = rightNodeType, noun = Noun.TweetId(tweetId)), - weight = 1.0)) - } - } - } - - def getUserFavGraph( - userUserFavEdges: TypedPipe[(UserId, UserId, Double)] - )( - implicit uniqueID: UniqueID - ): TypedPipe[(LeftNode, RightNodeWithEdgeWeight)] = { - val numInputFavEdges = Stat("num_input_fav_edges") - userUserFavEdges.map { - case (srcId, destId, edgeWt) => - numInputFavEdges.inc() - ( - LeftNode.UserId(srcId), - RightNodeWithEdgeWeight( - rightNode = - RightNode(rightNodeType = RightNodeType.FavUser, noun = Noun.UserId(destId)), - weight = edgeWt)) - } - } - - def getUserFollowGraph( - userUserFollowEdges: TypedPipe[(UserId, UserId)] - )( - implicit uniqueID: UniqueID - ): TypedPipe[(LeftNode, RightNodeWithEdgeWeight)] = { - val numFlockFollowEdges = Stat("num_flock_follow_edges") - userUserFollowEdges.map { - case (srcId, destId) => - numFlockFollowEdges.inc() - ( - LeftNode.UserId(srcId), - RightNodeWithEdgeWeight( - rightNode = - RightNode(rightNodeType = RightNodeType.FollowUser, noun = Noun.UserId(destId)), - weight = 1.0)) - } - } - - def getUserBlockGraph( - userUserBlockEdges: TypedPipe[(UserId, UserId)] - )( - implicit uniqueID: UniqueID - ): TypedPipe[(LeftNode, RightNodeWithEdgeWeight)] = { - val numFlockBlockEdges = Stat("num_flock_block_edges") - userUserBlockEdges.map { - case (srcId, destId) => - numFlockBlockEdges.inc() - ( - LeftNode.UserId(srcId), - RightNodeWithEdgeWeight( - rightNode = - RightNode(rightNodeType = RightNodeType.BlockUser, noun = Noun.UserId(destId)), - weight = 1.0)) - } - } - - def getUserAbuseReportGraph( - userUserAbuseReportEdges: TypedPipe[(UserId, UserId)] - )( - implicit uniqueID: UniqueID - ): TypedPipe[(LeftNode, RightNodeWithEdgeWeight)] = { - val numFlockAbuseEdges = Stat("num_flock_abuse_edges") - userUserAbuseReportEdges.map { - case (srcId, destId) => - numFlockAbuseEdges.inc() - ( - LeftNode.UserId(srcId), - RightNodeWithEdgeWeight( - rightNode = - RightNode(rightNodeType = RightNodeType.AbuseReportUser, noun = Noun.UserId(destId)), - weight = 1.0)) - } - } - - def filterInvalidUsers( - flockEdges: TypedPipe[(UserId, UserId)], - validUsers: TypedPipe[UserId] - ): TypedPipe[(UserId, UserId)] = { - flockEdges - .join(validUsers.asKeys) - // .withReducers(10000) - .map { - case (srcId, (destId, _)) => - (destId, srcId) - } - .join(validUsers.asKeys) - // .withReducers(10000) - .map { - case (destId, (srcId, _)) => - (srcId, destId) - } - } - - def getUserSpamReportGraph( - userUserSpamReportEdges: TypedPipe[(UserId, UserId)] - )( - implicit uniqueID: UniqueID - ): TypedPipe[(LeftNode, RightNodeWithEdgeWeight)] = { - val numFlockSpamEdges = Stat("num_flock_spam_edges") - userUserSpamReportEdges.map { - case (srcId, destId) => - numFlockSpamEdges.inc() - ( - LeftNode.UserId(srcId), - RightNodeWithEdgeWeight( - rightNode = - RightNode(rightNodeType = RightNodeType.SpamReportUser, noun = Noun.UserId(destId)), - weight = 1.0)) - } - } - - def getUserTopicFollowGraph( - topicUserFollowedByEdges: TypedPipe[(TopicId, UserId)] - )( - implicit uniqueID: UniqueID - ): TypedPipe[(LeftNode, RightNodeWithEdgeWeight)] = { - val numTFGEdges = Stat("num_tfg_edges") - topicUserFollowedByEdges.map { - case (topicId, userId) => - numTFGEdges.inc() - ( - LeftNode.UserId(userId), - RightNodeWithEdgeWeight( - rightNode = - RightNode(rightNodeType = RightNodeType.FollowTopic, noun = Noun.TopicId(topicId)), - weight = 1.0) - ) - } - } - - def getUserSignUpCountryGraph( - userSignUpCountryEdges: TypedPipe[(UserId, (Country, Language))] - )( - implicit uniqueID: UniqueID - ): TypedPipe[(LeftNode, RightNodeWithEdgeWeight)] = { - val numUserSourceEntriesRead = Stat("num_user_source_entries") - userSignUpCountryEdges.map { - case (userId, (country, lang)) => - numUserSourceEntriesRead.inc() - ( - LeftNode.UserId(userId), - RightNodeWithEdgeWeight( - rightNode = - RightNode(rightNodeType = RightNodeType.SignUpCountry, noun = Noun.Country(country)), - weight = 1.0)) - } - } - - def getMagicRecsNotifOpenOrClickTweetsGraph( - userMRNotifOpenOrClickEvents: TypedPipe[MagicRecsNotificationLite] - )( - implicit uniqueID: UniqueID - ): TypedPipe[(LeftNode, RightNodeWithEdgeWeight)] = { - val numNotifOpenOrClickEntries = Stat("num_notif_open_or_click") - userMRNotifOpenOrClickEvents.flatMap { entry => - numNotifOpenOrClickEntries.inc() - for { - userId <- entry.targetUserId - tweetId <- entry.tweetId - } yield { - ( - LeftNode.UserId(userId), - RightNodeWithEdgeWeight( - rightNode = RightNode( - rightNodeType = RightNodeType.NotifOpenOrClickTweet, - noun = Noun.TweetId(tweetId)), - weight = 1.0)) - } - } - } - - def getUserConsumedLanguagesGraph( - userConsumedLanguageEdges: TypedPipe[(UserId, Seq[(Language, Double)])] - )( - implicit uniqueID: UniqueID - ): TypedPipe[(LeftNode, RightNodeWithEdgeWeight)] = { - val numPenguinSourceEntriesRead = Stat("num_penguin_source_entries") - userConsumedLanguageEdges.flatMap { - case (userId, langWithWeights) => - numPenguinSourceEntriesRead.inc() - langWithWeights.map { - case (lang, weight) => - ( - LeftNode.UserId(userId), - RightNodeWithEdgeWeight( - rightNode = RightNode( - rightNodeType = RightNodeType.ConsumedLanguage, - noun = Noun.Language(lang)), - weight = weight)) - } - } - } - - def getSearchGraph( - userSearchQueryEdges: TypedPipe[(UserId, String)] - )( - implicit uniqueID: UniqueID - ): TypedPipe[(LeftNode, RightNodeWithEdgeWeight)] = { - val numSearchQueries = Stat("num_search_queries") - userSearchQueryEdges.map { - case (userId, query) => - numSearchQueries.inc() - ( - LeftNode.UserId(userId), - RightNodeWithEdgeWeight( - rightNode = - RightNode(rightNodeType = RightNodeType.SearchQuery, noun = Noun.Query(query)), - weight = 1.0)) - } - } - - def buildEmployeeGraph( - fullGraph: TypedPipe[(LeftNode, RightNodeWithEdgeWeight)] - )( - implicit uniqueID: UniqueID - ): TypedPipe[(LeftNode, RightNodeWithEdgeWeight)] = { - val numEmployeeEdges = Stat("num_employee_edges") - val employeeIds = Config.SampledEmployeeIds - fullGraph - .collect { - case (LeftNode.UserId(userId), rightNodeWithWeight) if employeeIds.contains(userId) => - numEmployeeEdges.inc() - (LeftNode.UserId(userId), rightNodeWithWeight) - } - } - - def getTruncatedGraph( - fullGraph: TypedPipe[(LeftNode, RightNodeWithEdgeWeight)], - topKWithFrequency: TypedPipe[(RightNodeType, Seq[(Noun, Double)])] - )( - implicit uniqueID: UniqueID - ): TypedPipe[(LeftNode, RightNodeWithEdgeWeight)] = { - val numEntriesTruncatedGraph = Stat("num_entries_truncated_graph") - val numTopKTruncatedNouns = Stat("num_topk_truncated_nouns") - - implicit val rightNodeSer: RightNode => Array[Byte] = BinaryScalaCodec(RightNode) - val topNouns: TypedPipe[RightNode] = topKWithFrequency - .flatMap { - case (rightNodeType, nounsList) => - nounsList - .map { - case (nounVal, aggregatedFrequency) => - numTopKTruncatedNouns.inc() - RightNode(rightNodeType, nounVal) - } - } - - fullGraph - .map { - case (leftNode, rightNodeWithWeight) => - (rightNodeWithWeight.rightNode, (leftNode, rightNodeWithWeight)) - } - .sketch(reducers = 5000) - .join(topNouns.asKeys.toTypedPipe) - .map { - case (rightNode, ((left, rightNodeWithWeight), _)) => - numEntriesTruncatedGraph.inc() - (left, rightNodeWithWeight) - } - } - - def getTopKRightNounsWithFrequencies( - fullGraph: TypedPipe[(LeftNode, RightNodeWithEdgeWeight)], - topKConfig: Map[RightNodeType, Int], - minFrequency: Int - )( - implicit uniqueID: UniqueID - ): TypedPipe[(RightNodeType, Seq[(Noun, Double)])] = { - val maxAcrossRightNounType: Int = topKConfig.valuesIterator.max - fullGraph - .map { - case (leftNode, rightNodeWithWeight) => - (rightNodeWithWeight.rightNode, 1.0) - } - .sumByKey - // .withReducers(20000) - .toTypedPipe - .filter(_._2 >= minFrequency) - .map { - case (rightNode, freq) => - (rightNode.rightNodeType, (rightNode.noun, freq)) - } - .group(rightNodeTypeOrdering) - // Note: if maxAcrossRightNounType is >15M, it might result in OOM on reducer - .sortedReverseTake(maxAcrossRightNounType)(Ordering.by(_._2)) - // An alternative to using group followed by sortedReverseTake is to define TopKMonoids, - // one for each RightNodeType to get the most frequent rightNouns - .map { - case (rightNodeType, nounsListWithFreq) => - val truncatedList = nounsListWithFreq - .sortBy(-_._2) - .take(topKConfig.getOrElse(rightNodeType, NumTopNounsForUnknownRightNodeType)) - (rightNodeType, truncatedList) - } - } - - def getValidUsers( - userSource: TypedPipe[CombinedUser] - )( - implicit uniqueID: UniqueID - ): TypedPipe[UserId] = { - val numValidUsers = Stat("num_valid_users") - userSource - .flatMap { u => - for { - user <- u.user - if user.id != 0 - safety <- user.safety - if !(safety.suspended || safety.deactivated) - } yield { - numValidUsers.inc() - user.id - } - } - } - - def getFullGraph( - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): TypedPipe[(LeftNode, RightNodeWithEdgeWeight)] = { - - // list of valid UserIds - to filter out deactivated or suspended user accounts - val userSource: TypedPipe[CombinedUser] = - DAL - .readMostRecentSnapshotNoOlderThan(UsersourceScalaDataset, Days(7)).toTypedPipe - val validUsers: TypedPipe[UserId] = getValidUsers(userSource).forceToDisk - - //Dataset read operations - - // ieSource tweet engagements data for tweet favs, replies, retweets - from last 14 days - val tweetSource: TypedPipe[InteractionEvent] = - ExternalDataSources.ieSourceTweetEngagementsSource(dateRange = - DateRange(dateRange.end - Days(14), dateRange.end)) - - // user-user fav edges - val userUserFavEdges: TypedPipe[(UserId, UserId, Double)] = - ExternalDataSources.getFavEdges(HalfLifeInDaysForFavScore) - - // user-user follow edges - val userUserFollowEdges: TypedPipe[(UserId, UserId)] = - filterInvalidUsers(ExternalDataSources.flockFollowsSource, validUsers) - - // user-user block edges - val userUserBlockEdges: TypedPipe[(UserId, UserId)] = - filterInvalidUsers(ExternalDataSources.flockBlocksSource, validUsers) - - // user-user abuse report edges - val userUserAbuseReportEdges: TypedPipe[(UserId, UserId)] = - filterInvalidUsers(ExternalDataSources.flockReportAsAbuseSource, validUsers) - - // user-user spam report edges - val userUserSpamReportEdges: TypedPipe[(UserId, UserId)] = - filterInvalidUsers(ExternalDataSources.flockReportAsSpamSource, validUsers) - - // user-signup country edges - val userSignUpCountryEdges: TypedPipe[(UserId, (Country, Language))] = - ExternalDataSources.userSource - - // user-consumed language edges - val userConsumedLanguageEdges: TypedPipe[(UserId, Seq[(Language, Double)])] = - ExternalDataSources.inferredUserConsumedLanguageSource - - // user-topic follow edges - val topicUserFollowedByEdges: TypedPipe[(TopicId, UserId)] = - ExternalDataSources.topicFollowGraphSource - - // user-MRNotifOpenOrClick events from last 7 days - val userMRNotifOpenOrClickEvents: TypedPipe[MagicRecsNotificationLite] = - ExternalDataSources.magicRecsNotficationOpenOrClickEventsSource(dateRange = - DateRange(dateRange.end - Days(7), dateRange.end)) - - // user-searchQuery strings from last 7 days - val userSearchQueryEdges: TypedPipe[(UserId, String)] = - ExternalDataSources.adaptiveSearchScribeLogsSource(dateRange = - DateRange(dateRange.end - Days(7), dateRange.end)) - - getUserTweetInteractionGraph(tweetSource) ++ - getUserFavGraph(userUserFavEdges) ++ - getUserFollowGraph(userUserFollowEdges) ++ - getUserBlockGraph(userUserBlockEdges) ++ - getUserAbuseReportGraph(userUserAbuseReportEdges) ++ - getUserSpamReportGraph(userUserSpamReportEdges) ++ - getUserSignUpCountryGraph(userSignUpCountryEdges) ++ - getUserConsumedLanguagesGraph(userConsumedLanguageEdges) ++ - getUserTopicFollowGraph(topicUserFollowedByEdges) ++ - getMagicRecsNotifOpenOrClickTweetsGraph(userMRNotifOpenOrClickEvents) ++ - getSearchGraph(userSearchQueryEdges) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/multi_type_graph/assemble_multi_type_graph/AssembleMultiTypeGraphApp.scala b/src/scala/com/twitter/simclusters_v2/scalding/multi_type_graph/assemble_multi_type_graph/AssembleMultiTypeGraphApp.scala deleted file mode 100644 index c341113fb..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/multi_type_graph/assemble_multi_type_graph/AssembleMultiTypeGraphApp.scala +++ /dev/null @@ -1,74 +0,0 @@ -package com.twitter.simclusters_v2.scalding -package multi_type_graph.assemble_multi_type_graph - -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.dal.client.dataset.SnapshotDALDataset -import com.twitter.scalding.Days -import com.twitter.scalding.Duration -import com.twitter.scalding.RichDate -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.thriftscala.LeftNode -import com.twitter.simclusters_v2.thriftscala.RightNodeTypeStruct -import com.twitter.simclusters_v2.thriftscala.RightNodeWithEdgeWeightList -import com.twitter.simclusters_v2.thriftscala.NounWithFrequencyList -import com.twitter.simclusters_v2.thriftscala.MultiTypeGraphEdge -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import com.twitter.simclusters_v2.hdfs_sources._ - -/** -./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/multi_type_graph/assemble_multi_type_graph:multi_type_graph-adhoc -scalding remote run \ ---user cassowary \ ---keytab /var/lib/tss/keys/fluffy/keytabs/client/cassowary.keytab \ ---principal service_acoount@TWITTER.BIZ \ ---cluster bluebird-qus1 \ ---main-class com.twitter.simclusters_v2.scalding.multi_type_graph.assemble_multi_type_graph.AssembleMultiTypeGraphAdhocApp \ ---target src/scala/com/twitter/simclusters_v2/scalding/multi_type_graph/assemble_multi_type_graph:multi_type_graph-adhoc \ ---hadoop-properties "mapreduce.reduce.memory.mb=8192 mapreduce.map.memory.mb=8192 mapreduce.map.java.opts='-Xmx7618M' mapreduce.reduce.java.opts='-Xmx7618M' mapreduce.task.timeout=3600000" \ --- --date 2021-07-10 --outputDir /gcs/user/cassowary/adhoc/your_ldap/multi_type/multi_type - -To run using scalding_job target: -scalding remote run --target src/scala/com/twitter/simclusters_v2/scalding/multi_type_graph/assemble_multi_type_graph:multi_type_graph-adhoc - */ - -object AssembleMultiTypeGraphAdhocApp extends AssembleMultiTypeGraphBaseApp with AdhocExecutionApp { - override val isAdhoc: Boolean = true - override val truncatedMultiTypeGraphMHOutputPath: String = "truncated_graph_mh" - override val topKRightNounsMHOutputPath: String = "top_k_right_nouns_mh" - override val fullMultiTypeGraphThriftOutputPath: String = "full_graph_thrift" - override val truncatedMultiTypeGraphKeyValDataset: KeyValDALDataset[ - KeyVal[LeftNode, RightNodeWithEdgeWeightList] - ] = TruncatedMultiTypeGraphAdhocScalaDataset - override val topKRightNounsKeyValDataset: KeyValDALDataset[ - KeyVal[RightNodeTypeStruct, NounWithFrequencyList] - ] = TopKRightNounsAdhocScalaDataset - override val fullMultiTypeGraphSnapshotDataset: SnapshotDALDataset[MultiTypeGraphEdge] = - FullMultiTypeGraphAdhocScalaDataset -} - -/** -To deploy the job: - -capesospy-v2 update --build_locally \ - --start_cron assemble_multi_type_graph \ - src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc.yaml - */ -object AssembleMultiTypeGraphBatchApp - extends AssembleMultiTypeGraphBaseApp - with ScheduledExecutionApp { - override val isAdhoc: Boolean = false - override val truncatedMultiTypeGraphMHOutputPath: String = "truncated_graph_mh" - override val topKRightNounsMHOutputPath: String = "top_k_right_nouns_mh" - override val fullMultiTypeGraphThriftOutputPath: String = "full_graph_thrift" - override val truncatedMultiTypeGraphKeyValDataset: KeyValDALDataset[ - KeyVal[LeftNode, RightNodeWithEdgeWeightList] - ] = TruncatedMultiTypeGraphScalaDataset - override val topKRightNounsKeyValDataset: KeyValDALDataset[ - KeyVal[RightNodeTypeStruct, NounWithFrequencyList] - ] = TopKRightNounsScalaDataset - override val fullMultiTypeGraphSnapshotDataset: SnapshotDALDataset[MultiTypeGraphEdge] = - FullMultiTypeGraphScalaDataset - override val firstTime: RichDate = RichDate("2021-08-21") - override val batchIncrement: Duration = Days(7) -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/multi_type_graph/assemble_multi_type_graph/AssembleMultiTypeGraphBaseApp.scala b/src/scala/com/twitter/simclusters_v2/scalding/multi_type_graph/assemble_multi_type_graph/AssembleMultiTypeGraphBaseApp.scala deleted file mode 100644 index 4f645e522..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/multi_type_graph/assemble_multi_type_graph/AssembleMultiTypeGraphBaseApp.scala +++ /dev/null @@ -1,185 +0,0 @@ -package com.twitter.simclusters_v2.scalding -package multi_type_graph.assemble_multi_type_graph - -import com.twitter.dal.client.dataset.{KeyValDALDataset, SnapshotDALDataset} -import com.twitter.scalding.{Execution, _} -import com.twitter.scalding_internal.dalv2.DALWrite.{D, _} -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.scalding.common.TypedRichPipe.typedPipeToRichPipe -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.thriftscala.{ - LeftNode, - Noun, - NounWithFrequency, - NounWithFrequencyList, - RightNodeType, - RightNodeTypeStruct, - RightNodeWithEdgeWeight, - RightNodeWithEdgeWeightList, - MultiTypeGraphEdge -} -import com.twitter.wtf.scalding.jobs.common.DateRangeExecutionApp -import java.util.TimeZone - -/** - * In this file, we assemble the multi_type_graph user-entity engagement signals - * - * It works as follows and the following datasets are produced as a result: - * - * 1. FullGraph (fullMultiTypeGraphSnapshotDataset) : reads datasets from multiple sources and generates - * a bipartite graph with LeftNode -> RightNode edges, capturing a user's engagement with varied entity types - * - * 2. TruncatedGraph (truncatedMultiTypeGraphKeyValDataset): a truncated version of the FullGraph - * where we only store the topK most frequently occurring RightNodes in the bipartite graph LeftNode -> RightNode - * - * 3. TopKNouns (topKRightNounsKeyValDataset): this stores the topK most frequent Nouns for each engagement type - * Please note that this dataset is currently only being used for the debugger to find which nodes we consider as the - * most frequently occurring, in FullGraph - */ - -trait AssembleMultiTypeGraphBaseApp extends DateRangeExecutionApp { - val truncatedMultiTypeGraphKeyValDataset: KeyValDALDataset[ - KeyVal[LeftNode, RightNodeWithEdgeWeightList] - ] - val topKRightNounsKeyValDataset: KeyValDALDataset[ - KeyVal[RightNodeTypeStruct, NounWithFrequencyList] - ] - val fullMultiTypeGraphSnapshotDataset: SnapshotDALDataset[MultiTypeGraphEdge] - val isAdhoc: Boolean - val truncatedMultiTypeGraphMHOutputPath: String - val topKRightNounsMHOutputPath: String - val fullMultiTypeGraphThriftOutputPath: String - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - import Config._ - import AssembleMultiTypeGraph._ - - val numKeysInTruncatedGraph = Stat("num_keys_truncated_mts") - val numKeysInTopKNounsGraph = Stat("num_keys_topk_nouns_mts") - - val fullGraph: TypedPipe[(LeftNode, RightNodeWithEdgeWeight)] = - getFullGraph().count("num_entries_full_graph") - - val topKRightNodes: TypedPipe[(RightNodeType, Seq[(Noun, Double)])] = - getTopKRightNounsWithFrequencies( - fullGraph, - TopKConfig, - GlobalDefaultMinFrequencyOfRightNodeType) - - val truncatedGraph: TypedPipe[(LeftNode, RightNodeWithEdgeWeight)] = - getTruncatedGraph(fullGraph, topKRightNodes).count("num_entries_truncated_graph") - - // key transformations - truncated graph, keyed by LeftNode - val truncatedGraphKeyedBySrc: TypedPipe[(LeftNode, RightNodeWithEdgeWeightList)] = - truncatedGraph - .map { - case (LeftNode.UserId(userId), rightNodeWithWeight) => - userId -> List(rightNodeWithWeight) - } - .sumByKey - .map { - case (userId, rightNodeWithWeightList) => - (LeftNode.UserId(userId), RightNodeWithEdgeWeightList(rightNodeWithWeightList)) - } - - // key transformation - topK nouns, keyed by the RightNodeNounType - val topKNounsKeyedByType: TypedPipe[(RightNodeTypeStruct, NounWithFrequencyList)] = - topKRightNodes - .map { - case (rightNodeType, rightNounsWithScoresList) => - val nounsListWithFrequency: Seq[NounWithFrequency] = rightNounsWithScoresList - .map { - case (noun, aggregatedFrequency) => - NounWithFrequency(noun, aggregatedFrequency) - } - (RightNodeTypeStruct(rightNodeType), NounWithFrequencyList(nounsListWithFrequency)) - } - - //WriteExecs - truncated graph - val truncatedGraphTsvExec: Execution[Unit] = - truncatedGraphKeyedBySrc.writeExecution( - TypedTsv[(LeftNode, RightNodeWithEdgeWeightList)](AdhocRootPrefix + "truncated_graph_tsv")) - - val truncatedGraphDALExec: Execution[Unit] = truncatedGraphKeyedBySrc - .map { - case (leftNode, rightNodeWithWeightList) => - numKeysInTruncatedGraph.inc() - KeyVal(leftNode, rightNodeWithWeightList) - } - .writeDALVersionedKeyValExecution( - truncatedMultiTypeGraphKeyValDataset, - D.Suffix( - (if (!isAdhoc) - RootPath - else - AdhocRootPrefix) - + truncatedMultiTypeGraphMHOutputPath), - ExplicitEndTime(dateRange.`end`) - ) - - //WriteExec - topK rightnouns - val topKNounsTsvExec: Execution[Unit] = - topKNounsKeyedByType.writeExecution( - TypedTsv[(RightNodeTypeStruct, NounWithFrequencyList)]( - AdhocRootPrefix + "top_k_right_nouns_tsv")) - - // writing topKNouns MH dataset for debugger - val topKNounsDALExec: Execution[Unit] = topKNounsKeyedByType - .map { - case (engagementType, rightList) => - val rightListMH = - NounWithFrequencyList(rightList.nounWithFrequencyList.take(TopKRightNounsForMHDump)) - numKeysInTopKNounsGraph.inc() - KeyVal(engagementType, rightListMH) - } - .writeDALVersionedKeyValExecution( - topKRightNounsKeyValDataset, - D.Suffix( - (if (!isAdhoc) - RootPath - else - AdhocRootPrefix) - + topKRightNounsMHOutputPath), - ExplicitEndTime(dateRange.`end`) - ) - - //WriteExec - fullGraph - val fullGraphDALExec: Execution[Unit] = fullGraph - .map { - case (leftNode, rightNodeWithWeight) => - MultiTypeGraphEdge(leftNode, rightNodeWithWeight) - }.writeDALSnapshotExecution( - fullMultiTypeGraphSnapshotDataset, - D.Daily, - D.Suffix( - (if (!isAdhoc) - RootThriftPath - else - AdhocRootPrefix) - + fullMultiTypeGraphThriftOutputPath), - D.Parquet, - dateRange.`end` - ) - - if (isAdhoc) { - Util.printCounters( - Execution - .zip( - truncatedGraphTsvExec, - topKNounsTsvExec, - truncatedGraphDALExec, - topKNounsDALExec, - fullGraphDALExec).unit) - } else { - Util.printCounters( - Execution.zip(truncatedGraphDALExec, topKNounsDALExec, fullGraphDALExec).unit) - } - - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/multi_type_graph/assemble_multi_type_graph/BUILD b/src/scala/com/twitter/simclusters_v2/scalding/multi_type_graph/assemble_multi_type_graph/BUILD deleted file mode 100644 index 5afed4a7a..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/multi_type_graph/assemble_multi_type_graph/BUILD +++ /dev/null @@ -1,91 +0,0 @@ -scala_library( - sources = [ - "*.scala", - ], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":full_multi_type_graph_adhoc-scala", - ":top_k_right_nouns_adhoc-scala", - ":truncated_multi_type_graph_adhoc-scala", - "3rdparty/src/jvm/com/twitter/scalding:commons", - "3rdparty/src/jvm/com/twitter/scalding:core", - "src/scala/com/twitter/scalding_internal/job", - "src/scala/com/twitter/simclusters_v2/hdfs_sources", - "src/scala/com/twitter/simclusters_v2/scalding", - "src/thrift/com/twitter/twadoop/user/gen:gen-scala", - "usersource/snapshot/src/main/scala/com/twitter/usersource/snapshot/flat:usersource_flat-scala", - "usersource/snapshot/src/main/thrift/com/twitter/usersource/snapshot/flat:flat-scala", - ], -) - -scalding_job( - name = "multi_type_graph-adhoc", - main = "com.twitter.simclusters_v2.scalding.multi_type_graph.assemble_multi_type_graph.AssembleMultiTypeGraphAdhocApp", - config = [ - ("hadoop.map.jvm.total-memory", "8192m"), - ("hadoop.reduce.jvm.total-memory", "8192m"), - ("hadoop.submitter.jvm.total-memory", "8192m"), - ("hadoop.am.jvm.total-memory", "8192m"), - ( - "job.args", - [ - "--date 2021-07-14", - ], - ), - ], - hadoop_cluster = "qus1-bluebird", - hadoop_properties = [("mapreduce.task.timeout", "3600000")], - platform = "java8", - role = "cassowary", - runtime_platform = "java8", - tags = [ - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [":assemble_multi_type_graph"], -) - -create_datasets( - base_name = "truncated_multi_type_graph_adhoc", - key_type = "com.twitter.simclusters_v2.thriftscala.LeftNode", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.MultiTypeGraphInjections.truncatedMultiTypeGraphInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.RightNodeWithEdgeWeightList", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "top_k_right_nouns_adhoc", - key_type = "com.twitter.simclusters_v2.thriftscala.RightNodeTypeStruct", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.MultiTypeGraphInjections.topKRightNounListInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.NounWithFrequencyList", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "full_multi_type_graph_adhoc", - java_schema = "com.twitter.simclusters_v2.thriftjava.MultiTypeGraphEdge", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.MultiTypeGraphEdge", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scalding/multi_type_graph/assemble_multi_type_graph/Config.scala b/src/scala/com/twitter/simclusters_v2/scalding/multi_type_graph/assemble_multi_type_graph/Config.scala deleted file mode 100644 index c423262a5..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/multi_type_graph/assemble_multi_type_graph/Config.scala +++ /dev/null @@ -1,35 +0,0 @@ -package com.twitter.simclusters_v2.scalding -package multi_type_graph.assemble_multi_type_graph - -import com.twitter.simclusters_v2.thriftscala.RightNodeType - -object Config { - - val User = System.getenv("USER") - val RootPath: String = s"/user/$User/manhattan_sequence_files/multi_type_simclusters/" - val RootThriftPath: String = s"/user/$User/processed/multi_type_simclusters/" - val AdhocRootPrefix = s"/gcs/user/$User/adhoc/multi_type_simclusters/" - val HalfLifeInDaysForFavScore = 100 - val NumTopNounsForUnknownRightNodeType = 20 - val GlobalDefaultMinFrequencyOfRightNodeType = 100 - val TopKRightNounsForMHDump = 1000 - - // the topK most frequent nouns for each engagement type - val TopKConfig: Map[RightNodeType, Int] = Map( - RightNodeType.FollowUser -> 10000000, // 10M, current simclusters_v2 has this value set to 20M, providing this the most weight - RightNodeType.FavUser -> 5000000, - RightNodeType.BlockUser -> 1000000, - RightNodeType.AbuseReportUser -> 1000000, - RightNodeType.SpamReportUser -> 1000000, - RightNodeType.FollowTopic -> 5000, - RightNodeType.SignUpCountry -> 200, - RightNodeType.ConsumedLanguage -> 50, - RightNodeType.FavTweet -> 500000, - RightNodeType.ReplyTweet -> 500000, - RightNodeType.RetweetTweet -> 500000, - RightNodeType.NotifOpenOrClickTweet -> 500000, - RightNodeType.SearchQuery -> 500000 - ) - val SampledEmployeeIds: Set[Long] = - Set() -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/offline_job/BUILD.bazel b/src/scala/com/twitter/simclusters_v2/scalding/offline_job/BUILD.bazel deleted file mode 100644 index 95ccf5027..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/offline_job/BUILD.bazel +++ /dev/null @@ -1,126 +0,0 @@ -scala_library( - platform = "java8", - tags = [ - "bazel-compatible", - "bazel-only", - ], - dependencies = [ - "3rdparty/jvm/com/twitter/storehaus:algebra", - "3rdparty/jvm/com/twitter/storehaus:core", - "3rdparty/src/jvm/com/twitter/storehaus:algebra", - "3rdparty/src/jvm/com/twitter/storehaus:core", - "graphstore/common:flock_follows-java", - "snowflake:id", - "src/java/com/twitter/ml/api/constant", - "src/java/com/twitter/sbf/graph", - "src/scala/com/twitter/ml/api:api-base", - "src/scala/com/twitter/pluck/source/core_workflows/user_model:condensed_user_state-scala", - "src/scala/com/twitter/scalding_internal/dalv2", - "src/scala/com/twitter/scalding_internal/job", - "src/scala/com/twitter/scalding_internal/job/analytics_batch", - "src/scala/com/twitter/scalding_internal/source", - "src/scala/com/twitter/scalding_internal/source/lzo_scrooge", - "src/scala/com/twitter/simclusters_v2/candidate_source", - "src/scala/com/twitter/simclusters_v2/hdfs_sources", - "src/scala/com/twitter/simclusters_v2/scalding", - "src/scala/com/twitter/simclusters_v2/scalding/common", - "src/scala/com/twitter/simclusters_v2/summingbird/common", - "src/scala/com/twitter/timelines/prediction/features/common", - "src/scala/com/twitter/timelines/prediction/features/itl", - "src/scala/com/twitter/timelines/prediction/features/recap", - "src/scala/com/twitter/wtf/scalding/jobs/common:execution_app", - "src/thrift/com/twitter/hermit/candidate:hermit-candidate-scala", - "src/thrift/com/twitter/wtf/scalding/sims:sims-thrift-scala", - "twadoop_config/configuration/log_categories/group/timeline:timeline_service_favorites-scala", - "usersource/snapshot/src/main/scala/com/twitter/usersource/snapshot/flat:usersource_flat-scala", - "usersource/snapshot/src/main/thrift/com/twitter/usersource/snapshot/flat:flat-scala", - ], -) - -hadoop_binary( - name = "simclusters_offline_job-adhoc", - main = "com.twitter.simclusters_v2.scalding.offline_job.SimClustersOfflineJobAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":offline_job", - ], -) - -hadoop_binary( - name = "simclusters_offline_job", - main = "com.twitter.simclusters_v2.scalding.offline_job.SimClustersOfflineJobScheduledApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":offline_job", - ], -) - -hadoop_binary( - name = "simclusters_offline_job-repl", - main = "com.twitter.scalding_internal.repl.TwitterScaldingShell", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":offline_job", - "science/scalding/scripts:scalding-repl-deps", - ], -) - -hadoop_binary( - name = "dump_cluster_topk_job-adhoc", - main = "com.twitter.simclusters_v2.scalding.offline_job.DumpClusterTopKTweetsAdhoc", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":offline_job", - ], -) - -# Generated with `capesospy-v2 create_target offline_tweet_job src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml`, config hash bb0831. -scalding_job( - name = "offline_tweet_job", - main = "com.twitter.simclusters_v2.scalding.offline_job.SimClustersOfflineJobScheduledApp", - config = [ - ("hadoop.combine-input", "true"), - ("hadoop.map.jvm.total-memory", "3072m"), - ("hadoop.queue", "cassowary.default"), - ("hadoop.reduce.jvm.total-memory", "3072m"), - ("hadoop.submitter.jvm.total-memory", "5120m"), - ("submitter.tier", "preemptible"), - ], - contact = "no-reply@twitter.com", - cron = "14 * * * *", - hadoop_cluster = "atla-proc3", - platform = "java8", - role = "cassowary", - runtime_platform = "java8", - tags = [ - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":offline_job", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scalding/offline_job/OfflineTweetRecommendation.scala b/src/scala/com/twitter/simclusters_v2/scalding/offline_job/OfflineTweetRecommendation.scala deleted file mode 100644 index a34c2e972..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/offline_job/OfflineTweetRecommendation.scala +++ /dev/null @@ -1,176 +0,0 @@ -package com.twitter.simclusters_v2.scalding.offline_job - -import com.twitter.algebird.Aggregator.size -import com.twitter.algebird.{Aggregator, QTreeAggregatorLowerBound} -import com.twitter.scalding.{Execution, Stat, TypedPipe, UniqueID} -import com.twitter.simclusters_v2.candidate_source._ -import com.twitter.simclusters_v2.common.TweetId -import com.twitter.simclusters_v2.thriftscala.{ - ClusterTopKTweetsWithScores, - ClustersUserIsInterestedIn -} -import java.nio.ByteBuffer - -case class OfflineRecConfig( - maxTweetRecs: Int, // total number of tweet recs. - maxTweetsPerUser: Int, - maxClustersToQuery: Int, - minTweetScoreThreshold: Double, - rankClustersBy: ClusterRanker.Value) - -/** - * An offline simulation of the tweet rec logic in [[InterestedInTweetCandidateStore]]. - * The main difference is that instead of using Memcache, it uses an offline clusterTopK store as - * the tweet source. - * Also, instead of taking a single userId as input, it processes a pipe of users altogether. - */ -object OfflineTweetRecommendation { - - case class ScoredTweet(tweetId: TweetId, score: Double) { - - def toTuple: (TweetId, Double) = { - (tweetId, score) - } - } - - object ScoredTweet { - def apply(tuple: (TweetId, Double)): ScoredTweet = new ScoredTweet(tuple._1, tuple._2) - implicit val scoredOrdering: Ordering[ScoredTweet] = (x: ScoredTweet, y: ScoredTweet) => { - Ordering.Double.compare(x.score, y.score) - } - } - - def getTopTweets( - config: OfflineRecConfig, - targetUsersPipe: TypedPipe[Long], - userIsInterestedInPipe: TypedPipe[(Long, ClustersUserIsInterestedIn)], - clusterTopKTweetsPipe: TypedPipe[ClusterTopKTweetsWithScores] - )( - implicit uniqueID: UniqueID - ): Execution[TypedPipe[(Long, Seq[ScoredTweet])]] = { - val tweetRecommendedCount = Stat("NumTweetsRecomended") - val targetUserCount = Stat("NumTargetUsers") - val userWithRecsCount = Stat("NumUsersWithAtLeastTweetRec") - - // For every user, read the user's interested-in clusters and cluster's weights - val userClusterWeightPipe: TypedPipe[(Int, (Long, Double))] = - targetUsersPipe.asKeys - .join(userIsInterestedInPipe) - .flatMap { - case (userId, (_, clustersWithScores)) => - targetUserCount.inc() - val topClusters = ClusterRanker - .getTopKClustersByScore( - clustersWithScores.clusterIdToScores.toMap, - ClusterRanker.RankByNormalizedFavScore, - config.maxClustersToQuery - ).toList - topClusters.map { - case (clusterId, clusterWeightForUser) => - (clusterId, (userId, clusterWeightForUser)) - } - } - - // For every cluster, read the top tweets in the cluster, and their weights - val clusterTweetWeightPipe: TypedPipe[(Int, List[(Long, Double)])] = - clusterTopKTweetsPipe - .flatMap { cluster => - val tweets = - cluster.topKTweets.toList // Convert to a List, otherwise .flatMap dedups by clusterIds - .flatMap { - case (tid, persistedScores) => - val tweetWeight = persistedScores.score.map(_.value).getOrElse(0.0) - if (tweetWeight > 0) { - Some((tid, tweetWeight)) - } else { - None - } - } - if (tweets.nonEmpty) { - Some((cluster.clusterId, tweets)) - } else { - None - } - } - - // Collect all the tweets from clusters user is interested in - val recommendedTweetsPipe = userClusterWeightPipe - .sketch(4000)(cid => ByteBuffer.allocate(4).putInt(cid).array(), Ordering.Int) - .join(clusterTweetWeightPipe) - .flatMap { - case (_, ((userId, clusterWeight), tweetsPerCluster)) => - tweetsPerCluster.map { - case (tid, tweetWeight) => - val contribution = clusterWeight * tweetWeight - ((userId, tid), contribution) - } - } - .sumByKey - .withReducers(5000) - - // Filter by minimum score threshold - val scoreFilteredTweetsPipe = recommendedTweetsPipe - .collect { - case ((userId, tid), score) if score >= config.minTweetScoreThreshold => - (userId, ScoredTweet(tid, score)) - } - - // Rank top tweets for each user - val topTweetsPerUserPipe = scoreFilteredTweetsPipe.group - .sortedReverseTake(config.maxTweetsPerUser)(ScoredTweet.scoredOrdering) - .flatMap { - case (userId, tweets) => - userWithRecsCount.inc() - tweetRecommendedCount.incBy(tweets.size) - - tweets.map { t => (userId, t) } - } - .forceToDiskExecution - - val topTweetsPipe = topTweetsPerUserPipe - .flatMap { tweets => - approximateScoreAtTopK(tweets.map(_._2.score), config.maxTweetRecs).map { threshold => - tweets - .collect { - case (userId, tweet) if tweet.score >= threshold => - (userId, List(tweet)) - } - .sumByKey - .toTypedPipe - } - } - topTweetsPipe - } - - /** - * Returns the approximate score at the k'th top ranked record using sampling. - * This score can then be used to filter for the top K elements in a big pipe where - * K is too big to fit in memory. - * - */ - def approximateScoreAtTopK(pipe: TypedPipe[Double], topK: Int): Execution[Double] = { - val defaultScore = 0.0 - println("approximateScoreAtTopK: topK=" + topK) - pipe - .aggregate(size) - .getOrElseExecution(0L) - .flatMap { len => - println("approximateScoreAtTopK: len=" + len) - val topKPercentile = if (len == 0 || topK > len) 0 else 1 - topK.toDouble / len.toDouble - val randomSample = Aggregator.reservoirSample[Double](Math.max(100000, topK / 100)) - pipe - .aggregate(randomSample) - .getOrElseExecution(List.empty) - .flatMap { sample => - TypedPipe - .from(sample) - .aggregate(QTreeAggregatorLowerBound[Double](topKPercentile)) - .getOrElseExecution(defaultScore) - } - } - .map { score => - println("approximateScoreAtTopK: topK percentile score=" + score) - score - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/offline_job/SimClustersOfflineJob.scala b/src/scala/com/twitter/simclusters_v2/scalding/offline_job/SimClustersOfflineJob.scala deleted file mode 100644 index 66b458b2a..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/offline_job/SimClustersOfflineJob.scala +++ /dev/null @@ -1,176 +0,0 @@ -package com.twitter.simclusters_v2.scalding.offline_job - -import com.twitter.scalding._ -import com.twitter.simclusters_v2.common._ -import com.twitter.simclusters_v2.summingbird.common.{Configs, SimClustersInterestedInUtil} -import com.twitter.simclusters_v2.thriftscala._ -import java.util.TimeZone - -object SimClustersOfflineJob { - import SimClustersOfflineJobUtil._ - import com.twitter.simclusters_v2.scalding.common.TypedRichPipe._ - - val modelVersionMap: Map[String, PersistedModelVersion] = Map( - ModelVersions.Model20M145KDec11 -> PersistedModelVersion.Model20m145kDec11, - ModelVersions.Model20M145KUpdated -> PersistedModelVersion.Model20m145kUpdated - ) - - /** - * Get a list of tweets that received at least one fav in the last tweetTtl Duration - */ - def getSubsetOfValidTweets(tweetTtl: Duration)(implicit dateRange: DateRange): TypedPipe[Long] = { - readTimelineFavoriteData(DateRange(dateRange.end - tweetTtl, dateRange.end)).map(_._2).distinct - } - - /** - * Note that this job will write several types of scores into the same data set. Please use filter - * to take the score types you need. - */ - def computeAggregatedTweetClusterScores( - dateRange: DateRange, - userInterestsData: TypedPipe[(Long, ClustersUserIsInterestedIn)], - favoriteData: TypedPipe[(UserId, TweetId, Timestamp)], - previousTweetClusterScores: TypedPipe[TweetAndClusterScores] - )( - implicit timeZone: TimeZone, - uniqueID: UniqueID - ): TypedPipe[TweetAndClusterScores] = { - - val latestTimeStamp = dateRange.end.timestamp - - val currentScores: TypedPipe[ - ((Long, Int, PersistedModelVersion, Option[PersistedScoreType]), PersistedScores) - ] = - favoriteData - .map { - case (userId, tweetId, timestamp) => - (userId, (tweetId, timestamp)) - } - .count("NumFavEvents") - .leftJoin(userInterestsData) - .withReducers(600) - .flatMap { - case (_, ((tweetId, timestamp), Some(userInterests))) => - val clustersWithScores = - SimClustersInterestedInUtil.topClustersWithScores(userInterests) - ( - for { - (clusterId, scores) <- clustersWithScores - if scores.favScore >= Configs.favScoreThresholdForUserInterest( - userInterests.knownForModelVersion) - } yield { - // write several types of scores - Seq( - ( - tweetId, - clusterId, - modelVersionMap(userInterests.knownForModelVersion), - Some(PersistedScoreType.NormalizedFav8HrHalfLife)) -> - // let the score decay to latestTimeStamp - persistedScoresMonoid.plus( - persistedScoresMonoid - .build(scores.clusterNormalizedFavScore, timestamp), - persistedScoresMonoid.build(0.0, latestTimeStamp) - ), - ( - tweetId, - clusterId, - modelVersionMap(userInterests.knownForModelVersion), - Some(PersistedScoreType.NormalizedFollow8HrHalfLife)) -> - // let the score decay to latestTimeStamp - persistedScoresMonoid.plus( - persistedScoresMonoid - .build(scores.clusterNormalizedFollowScore, timestamp), - persistedScoresMonoid.build(0.0, latestTimeStamp) - ), - ( - tweetId, - clusterId, - modelVersionMap(userInterests.knownForModelVersion), - Some(PersistedScoreType.NormalizedLogFav8HrHalfLife)) -> - // let the score decay to latestTimeStamp - persistedScoresMonoid.plus( - persistedScoresMonoid - .build(scores.clusterNormalizedLogFavScore, timestamp), - persistedScoresMonoid.build(0.0, latestTimeStamp) - ) - ) - } - ).flatten - case _ => - Nil - } - .count("NumTweetClusterScoreUpdates") - .sumByLocalKeys // there is a .sumByKey later, so just doing a local sum here. - - val previousScores: TypedPipe[ - ((Long, Int, PersistedModelVersion, Option[PersistedScoreType]), PersistedScores) - ] = - previousTweetClusterScores.map { v => - (v.tweetId, v.clusterId, v.modelVersion, v.scoreType) -> v.scores - } - - // add current scores and previous scores - (currentScores ++ previousScores).sumByKey - .withReducers(1000) - .map { - case ((tweetId, clusterId, modelVersion, scoreType), scores) => - TweetAndClusterScores(tweetId, clusterId, modelVersion, scores, scoreType) - } - .count("NumAggregatedTweetClusterScores") - } - - def computeTweetTopKClusters( - latestTweetClusterScores: TypedPipe[TweetAndClusterScores], - topK: Int = Configs.topKClustersPerTweet, - scoreThreshold: Double = Configs.scoreThresholdForEntityTopKClustersCache - )( - implicit timeZone: TimeZone, - uniqueID: UniqueID - ): TypedPipe[TweetTopKClustersWithScores] = { - latestTweetClusterScores - .flatMap { v => - val score = v.scores.score.map(_.value).getOrElse(0.0) - if (score < scoreThreshold) { - None - } else { - Some((v.tweetId, v.modelVersion, v.scoreType) -> (v.clusterId, v.scores)) - } - } - .count("NumAggregatedTweetClusterScoresAfterFilteringInTweetTopK") - .group - .sortedReverseTake(topK)(Ordering.by(_._2)) - .map { - case ((tweetId, modelVersion, scoreType), topKClusters) => - TweetTopKClustersWithScores(tweetId, modelVersion, topKClusters.toMap, scoreType) - } - .count("NumTweetTopK") - } - - def computeClusterTopKTweets( - latestTweetClusterScores: TypedPipe[TweetAndClusterScores], - topK: Int = Configs.topKTweetsPerCluster, - scoreThreshold: Double = Configs.scoreThresholdForClusterTopKTweetsCache - )( - implicit timeZone: TimeZone, - uniqueID: UniqueID - ): TypedPipe[ClusterTopKTweetsWithScores] = { - latestTweetClusterScores - .flatMap { v => - val score = v.scores.score.map(_.value).getOrElse(0.0) - if (score < scoreThreshold) { - None - } else { - Some((v.clusterId, v.modelVersion, v.scoreType) -> (v.tweetId, v.scores)) - } - } - .count("NumAggregatedTweetClusterScoresAfterFilteringInClusterTopK") - .group - .sortedReverseTake(topK)(Ordering.by(_._2)) - .map { - case ((clusterId, modelVersion, scoreType), topKTweets) => - ClusterTopKTweetsWithScores(clusterId, modelVersion, topKTweets.toMap, scoreType) - } - .count("NumClusterTopK") - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/offline_job/SimClustersOfflineJobAdhocApp.scala b/src/scala/com/twitter/simclusters_v2/scalding/offline_job/SimClustersOfflineJobAdhocApp.scala deleted file mode 100644 index 32acbe020..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/offline_job/SimClustersOfflineJobAdhocApp.scala +++ /dev/null @@ -1,197 +0,0 @@ -package com.twitter.simclusters_v2.scalding.offline_job - -import com.twitter.scalding._ -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.hdfs_sources.AdhocKeyValSources -import com.twitter.simclusters_v2.hdfs_sources.ClusterTopKTweetsHourlySuffixSource -import com.twitter.simclusters_v2.hdfs_sources.TweetClusterScoresHourlySuffixSource -import com.twitter.simclusters_v2.hdfs_sources.TweetTopKClustersHourlySuffixSource -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.scalding.offline_job.SimClustersOfflineJob._ -import com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn -import java.util.TimeZone - -/** -scalding remote run --target src/scala/com/twitter/simclusters_v2/scalding/offline_job:simclusters_offline_job-adhoc \ ---user cassowary \ ---submitter hadoopnest2.atla.twitter.com \ ---main-class com.twitter.simclusters_v2.scalding.offline_job.SimClustersOfflineJobAdhocApp -- \ ---date 2019-08-10 --batch_hours 24 \ ---output_dir /user/cassowary/your_ldap/offline_simcluster_20190810 ---model_version 20M_145K_updated - */ -object SimClustersOfflineJobAdhocApp extends TwitterExecutionApp { - - import SimClustersOfflineJobUtil._ - import com.twitter.simclusters_v2.scalding.common.TypedRichPipe._ - - override def job: Execution[Unit] = - Execution.withId { implicit uniqueId => - Execution.withArgs { args: Args => - // required - val wholeDateRange: DateRange = DateRange.parse(args.list("date")) - val batchSize: Duration = Hours(args.int("batch_hours")) - - val outputDir = args("output_dir") - - val modelVersion = args.getOrElse("model_version", ModelVersions.Model20M145KUpdated) - - val scoringMethod = args.getOrElse("score", "logFav") - - val tweetClusterScoreOutputPath: String = outputDir + "/tweet_cluster_scores" - - val tweetTopKClustersOutputPath: String = outputDir + "/tweet_top_k_clusters" - - val clusterTopKTweetsOutputPath: String = outputDir + "/cluster_top_k_tweets" - - val fullInterestedInData: TypedPipe[(Long, ClustersUserIsInterestedIn)] = - args.optional("interested_in_path") match { - case Some(dir) => - println("Loading InterestedIn from supplied path " + dir) - TypedPipe.from(AdhocKeyValSources.interestedInSource(dir)) - case None => - println("Loading production InterestedIn data") - readInterestedInScalaDataset(wholeDateRange) - } - - val interestedInData: TypedPipe[(Long, ClustersUserIsInterestedIn)] = - fullInterestedInData.filter(_._2.knownForModelVersion == modelVersion) - - val debugExec = Execution.zip( - fullInterestedInData.printSummary("fullInterestedIn", numRecords = 20), - interestedInData.printSummary("interestedIn", numRecords = 20) - ) - - // recursive function to calculate batches one by one - def runBatch(batchDateRange: DateRange): Execution[Unit] = { - if (batchDateRange.start.timestamp > wholeDateRange.end.timestamp) { - Execution.unit // stops here - } else { - - val previousScores = if (batchDateRange.start == wholeDateRange.start) { - TypedPipe.from(Nil) - } else { - TypedPipe.from( - TweetClusterScoresHourlySuffixSource( - tweetClusterScoreOutputPath, - batchDateRange - batchSize - ) - ) - } - - val latestScores = computeAggregatedTweetClusterScores( - batchDateRange, - interestedInData, - readTimelineFavoriteData(batchDateRange), - previousScores - ) - - val writeLatestScoresExecution = { - Execution.zip( - latestScores.printSummary(name = "TweetEntityScores"), - latestScores - .writeExecution( - TweetClusterScoresHourlySuffixSource( - tweetClusterScoreOutputPath, - batchDateRange - ) - ) - ) - } - - val computeTweetTopKExecution = { - val tweetTopK = computeTweetTopKClusters(latestScores) - Execution.zip( - tweetTopK.printSummary(name = "TweetTopK"), - tweetTopK.writeExecution( - TweetTopKClustersHourlySuffixSource(tweetTopKClustersOutputPath, batchDateRange) - ) - ) - } - - val computeClusterTopKExecution = { - val clusterTopK = computeClusterTopKTweets(latestScores) - Execution.zip( - clusterTopK.printSummary(name = "ClusterTopK"), - clusterTopK.writeExecution( - ClusterTopKTweetsHourlySuffixSource(clusterTopKTweetsOutputPath, batchDateRange) - ) - ) - } - - Execution - .zip( - writeLatestScoresExecution, - computeTweetTopKExecution, - computeClusterTopKExecution - ).flatMap { _ => - // run next batch - runBatch(batchDateRange + batchSize) - } - } - } - - // start from the first batch - Util.printCounters( - Execution.zip( - debugExec, - runBatch( - DateRange(wholeDateRange.start, wholeDateRange.start + batchSize - Millisecs(1))) - ) - ) - } - } -} - -/** -For example: -scalding remote run --target src/scala/com/twitter/simclusters_v2/scalding/offline_job:dump_cluster_topk_job-adhoc \ ---user cassowary ---main-class com.twitter.simclusters_v2.scalding.offline_job.DumpClusterTopKTweetsAdhoc \ ---submitter hadoopnest2.atla.twitter.com -- \ ---date 2019-08-03 \ ---clusterTopKTweetsPath /atla/proc3/user/cassowary/processed/simclusters/cluster_top_k_tweets/ \ ---clusters 4446 - - */ -object DumpClusterTopKTweetsAdhoc extends TwitterExecutionApp { - - implicit val timeZone: TimeZone = DateOps.UTC - implicit val dateParser: DateParser = DateParser.default - - import com.twitter.simclusters_v2.scalding.common.TypedRichPipe._ - import com.twitter.simclusters_v2.summingbird.common.ThriftDecayedValueMonoid._ - - override def job: Execution[Unit] = - Execution.withId { implicit uniqueId => - Execution.withArgs { args: Args => - val date = DateRange.parse(args.list("date")) - val path = args("clusterTopKTweetsPath") - val input = TypedPipe.from(ClusterTopKTweetsHourlySuffixSource(path, date)) - val clusters = args.list("clusters").map(_.toInt).toSet - - val dvm = SimClustersOfflineJobUtil.thriftDecayedValueMonoid - if (clusters.isEmpty) { - input.printSummary("Cluster top k tweets") - } else { - input - .collect { - case rec if clusters.contains(rec.clusterId) => - val res = rec.topKTweets - .mapValues { x => - x.score - .map { y => - val enriched = new EnrichedThriftDecayedValue(y)(dvm) - enriched.decayToTimestamp(date.end.timestamp).value - }.getOrElse(0.0) - }.toList.sortBy(-_._2) - rec.clusterId + "\t" + Util.prettyJsonMapper - .writeValueAsString(res).replaceAll("\n", " ") - } - .toIterableExecution - .map { strings => println(strings.mkString("\n")) } - } - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/offline_job/SimClustersOfflineJobScheduledApp.scala b/src/scala/com/twitter/simclusters_v2/scalding/offline_job/SimClustersOfflineJobScheduledApp.scala deleted file mode 100644 index 8be6537d1..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/offline_job/SimClustersOfflineJobScheduledApp.scala +++ /dev/null @@ -1,113 +0,0 @@ -package com.twitter.simclusters_v2.scalding.offline_job - -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.simclusters_v2.hdfs_sources._ -import com.twitter.simclusters_v2.scalding.offline_job.SimClustersOfflineJob._ -import com.twitter.simclusters_v2.scalding.offline_job.SimClustersOfflineJobUtil._ -import com.twitter.simclusters_v2.thriftscala.TweetAndClusterScores -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import java.util.TimeZone - -/** - * The offline job runs every 12 hours, and save these two data sets to HDFS. - * - * capesospy-v2 update --build_locally --start_cron \ - * --start_cron offline_tweet_job src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object SimClustersOfflineJobScheduledApp extends ScheduledExecutionApp { - import com.twitter.simclusters_v2.scalding.common.TypedRichPipe._ - - private val tweetClusterScoresDatasetPath: String = - "/user/cassowary/processed/simclusters/tweet_cluster_scores" - private val tweetTopKClustersDatasetPath: String = - "/user/cassowary/processed/simclusters/tweet_top_k_clusters" - private val clusterTopKTweetsDatasetPath: String = - "/user/cassowary/processed/simclusters/cluster_top_k_tweets" - - override def batchIncrement: Duration = Hours(12) - - override def firstTime: RichDate = RichDate("2020-05-25") - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - val previousTweetClusterScores: TypedPipe[TweetAndClusterScores] = - if (firstTime.timestamp == dateRange.start.timestamp) { // if it is the first batch - TypedPipe.from(Nil) - } else { - DAL - .readMostRecentSnapshot( - SimclustersOfflineTweetClusterScoresScalaDataset, - dateRange - batchIncrement - ) - .toTypedPipe - .count("NumPreviousTweetClusterScores") - } - - // we have to use some way to throw away old tweets, otherwise the data set will be growing - // all the time. We only keep the tweets that received at least 1 engagement in the last day. - // This parameter can be adjusted - val tweetsToKeep = getSubsetOfValidTweets(Days(1)) - .count("NumTweetsToKeep") - - val updatedTweetClusterScores = computeAggregatedTweetClusterScores( - dateRange, - readInterestedInScalaDataset(dateRange), - readTimelineFavoriteData(dateRange), - previousTweetClusterScores - ).map { tweetClusterScore => - tweetClusterScore.tweetId -> tweetClusterScore - } - .count("NumUpdatedTweetClusterScoresBeforeFiltering") - .join(tweetsToKeep.asKeys) // filter out invalid tweets - .map { - case (_, (tweetClusterScore, _)) => tweetClusterScore - } - .count("NumUpdatedTweetClusterScores") - .forceToDisk - - val tweetTopKClusters = computeTweetTopKClusters(updatedTweetClusterScores) - .count("NumTweetTopKSaved") - val clusterTopKTweets = computeClusterTopKTweets(updatedTweetClusterScores) - .count("NumClusterTopKSaved") - - val writeTweetClusterScoresExec = updatedTweetClusterScores - .writeDALSnapshotExecution( - SimclustersOfflineTweetClusterScoresScalaDataset, - D.Hourly, // note that we use hourly in order to make it flexible for hourly batch size - D.Suffix(tweetClusterScoresDatasetPath), - D.EBLzo(), - dateRange.end - ) - - val writeTweetTopKClustersExec = tweetTopKClusters - .writeDALSnapshotExecution( - SimclustersOfflineTweetTopKClustersScalaDataset, - D.Hourly, // note that we use hourly in order to make it flexible for hourly batch size - D.Suffix(tweetTopKClustersDatasetPath), - D.EBLzo(), - dateRange.end - ) - - val writeClusterTopKTweetsExec = clusterTopKTweets - .writeDALSnapshotExecution( - SimclustersOfflineClusterTopKTweetsScalaDataset, - D.Hourly, // note that we use hourly in order to make it flexible for hourly batch size - D.Suffix(clusterTopKTweetsDatasetPath), - D.EBLzo(), - dateRange.end - ) - - Execution - .zip(writeTweetClusterScoresExec, writeTweetTopKClustersExec, writeClusterTopKTweetsExec) - .unit - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/offline_job/SimClustersOfflineJobUtil.scala b/src/scala/com/twitter/simclusters_v2/scalding/offline_job/SimClustersOfflineJobUtil.scala deleted file mode 100644 index 50e91f5b4..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/offline_job/SimClustersOfflineJobUtil.scala +++ /dev/null @@ -1,97 +0,0 @@ -package com.twitter.simclusters_v2.scalding.offline_job - -import com.twitter.algebird.{DecayedValueMonoid, Monoid, OptionMonoid} -import com.twitter.algebird_internal.thriftscala.{DecayedValue => ThriftDecayedValue} -import com.twitter.scalding.{TypedPipe, _} -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.remote_access.{ExplicitLocation, ProcAtla} -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.{Timestamp, TweetId, UserId} -import com.twitter.simclusters_v2.hdfs_sources._ -import com.twitter.simclusters_v2.summingbird.common.{Configs, ThriftDecayedValueMonoid} -import com.twitter.simclusters_v2.thriftscala._ -import com.twitter.timelineservice.thriftscala.{ContextualizedFavoriteEvent, FavoriteEventUnion} -import java.util.TimeZone -import twadoop_config.configuration.log_categories.group.timeline.TimelineServiceFavoritesScalaDataset - -object SimClustersOfflineJobUtil { - - implicit val timeZone: TimeZone = DateOps.UTC - implicit val dateParser: DateParser = DateParser.default - - implicit val modelVersionOrdering: Ordering[PersistedModelVersion] = - Ordering.by(_.value) - - implicit val scoreTypeOrdering: Ordering[PersistedScoreType] = - Ordering.by(_.value) - - implicit val persistedScoresOrdering: Ordering[PersistedScores] = Ordering.by( - _.score.map(_.value).getOrElse(0.0) - ) - - implicit val decayedValueMonoid: DecayedValueMonoid = DecayedValueMonoid(0.0) - - implicit val thriftDecayedValueMonoid: ThriftDecayedValueMonoid = - new ThriftDecayedValueMonoid(Configs.HalfLifeInMs)(decayedValueMonoid) - - implicit val persistedScoresMonoid: PersistedScoresMonoid = - new PersistedScoresMonoid()(thriftDecayedValueMonoid) - - def readInterestedInScalaDataset( - implicit dateRange: DateRange - ): TypedPipe[(Long, ClustersUserIsInterestedIn)] = { - //read SimClusters InterestedIn datasets - DAL - .readMostRecentSnapshot( - SimclustersV2InterestedIn20M145KUpdatedScalaDataset, - dateRange.embiggen(Days(30)) - ) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - .map { - case KeyVal(key, value) => (key, value) - } - } - - def readTimelineFavoriteData( - implicit dateRange: DateRange - ): TypedPipe[(UserId, TweetId, Timestamp)] = { - DAL - .read(TimelineServiceFavoritesScalaDataset, dateRange) // Note: this is a hourly source - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - .flatMap { cfe: ContextualizedFavoriteEvent => - cfe.event match { - case FavoriteEventUnion.Favorite(fav) => - Some((fav.userId, fav.tweetId, fav.eventTimeMs)) - case _ => - None - } - - } - } - - class PersistedScoresMonoid( - implicit thriftDecayedValueMonoid: ThriftDecayedValueMonoid) - extends Monoid[PersistedScores] { - - private val optionalThriftDecayedValueMonoid = - new OptionMonoid[ThriftDecayedValue]() - - override val zero: PersistedScores = PersistedScores() - - override def plus(x: PersistedScores, y: PersistedScores): PersistedScores = { - PersistedScores( - optionalThriftDecayedValueMonoid.plus( - x.score, - y.score - ) - ) - } - - def build(value: Double, timeInMs: Double): PersistedScores = { - PersistedScores(Some(thriftDecayedValueMonoid.build(value, timeInMs))) - } - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/offline_job/adhoc/BUILD.bazel b/src/scala/com/twitter/simclusters_v2/scalding/offline_job/adhoc/BUILD.bazel deleted file mode 100644 index 437e716d4..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/offline_job/adhoc/BUILD.bazel +++ /dev/null @@ -1,81 +0,0 @@ -scala_library( - platform = "java8", - tags = ["bazel-only"], - dependencies = [ - "3rdparty/jvm/com/twitter/storehaus:algebra", - "3rdparty/jvm/com/twitter/storehaus:core", - "3rdparty/src/jvm/com/twitter/storehaus:algebra", - "3rdparty/src/jvm/com/twitter/storehaus:core", - "graphstore/common:flock_follows-java", - "snowflake:id", - "src/java/com/twitter/ml/api/constant", - "src/java/com/twitter/sbf/graph", - "src/scala/com/twitter/ml/api:api-base", - "src/scala/com/twitter/pluck/source/core_workflows/user_model:condensed_user_state-scala", - "src/scala/com/twitter/scalding_internal/dalv2", - "src/scala/com/twitter/scalding_internal/job", - "src/scala/com/twitter/scalding_internal/job/analytics_batch", - "src/scala/com/twitter/scalding_internal/source", - "src/scala/com/twitter/scalding_internal/source/lzo_scrooge", - "src/scala/com/twitter/simclusters_v2/candidate_source", - "src/scala/com/twitter/simclusters_v2/hdfs_sources", - "src/scala/com/twitter/simclusters_v2/scalding", - "src/scala/com/twitter/simclusters_v2/scalding/common", - "src/scala/com/twitter/simclusters_v2/scalding/offline_job", - "src/scala/com/twitter/simclusters_v2/summingbird/common", - "src/scala/com/twitter/timelines/prediction/features/common", - "src/scala/com/twitter/timelines/prediction/features/itl", - "src/scala/com/twitter/timelines/prediction/features/recap", - "src/scala/com/twitter/wtf/scalding/jobs/common:execution_app", - "src/thrift/com/twitter/hermit/candidate:hermit-candidate-scala", - "src/thrift/com/twitter/wtf/scalding/sims:sims-thrift-scala", - "twadoop_config/configuration/log_categories/group/timeline:timeline_service_favorites-scala", - "usersource/snapshot/src/main/scala/com/twitter/usersource/snapshot/flat:usersource_flat-scala", - "usersource/snapshot/src/main/thrift/com/twitter/usersource/snapshot/flat:flat-scala", - ], -) - -hadoop_binary( - name = "tweet_embedding-adhoc", - main = "com.twitter.simclusters_v2.scalding.offline_job.SimClustersTweetEmbeddingAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":adhoc", - ], -) - -hadoop_binary( - name = "tweet_embedding_evaluation_samples-adhoc", - main = "com.twitter.simclusters_v2.scalding.offline_job.TweetSimilarityEvaluationSamplingAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":adhoc", - ], -) - -hadoop_binary( - name = "tweet_embedding_evaluation-adhoc", - main = "com.twitter.simclusters_v2.scalding.offline_job.TweetSimilarityEvaluationAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":adhoc", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scalding/offline_job/adhoc/README b/src/scala/com/twitter/simclusters_v2/scalding/offline_job/adhoc/README deleted file mode 100644 index c5f963e67..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/offline_job/adhoc/README +++ /dev/null @@ -1,5 +0,0 @@ -To reproduce, you need to run - -1. SimClustersTweetEmbeddingAdhocApp to generate cluster -> top tweets and tweet -> top clusters data sets -2. TweetSimilarityEvaluationSamplingAdhocApp to sample a subset of tweets that you want to compute some metrics on -3. TweetSimilarityEvaluationAdhocApp to perform the evaluation diff --git a/src/scala/com/twitter/simclusters_v2/scalding/offline_job/adhoc/SimClustersTweetEmbeddingAdhocApp.scala b/src/scala/com/twitter/simclusters_v2/scalding/offline_job/adhoc/SimClustersTweetEmbeddingAdhocApp.scala deleted file mode 100644 index 5b2f382af..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/offline_job/adhoc/SimClustersTweetEmbeddingAdhocApp.scala +++ /dev/null @@ -1,211 +0,0 @@ -package com.twitter.simclusters_v2.scalding.offline_job.adhoc - -import com.twitter.bijection.{Bufferable, Injection} -import com.twitter.scalding._ -import com.twitter.scalding.commons.source.VersionedKeyValSource -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.remote_access.{ExplicitLocation, ProcAtla} -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.{ClusterId, TweetId, UserId} -import com.twitter.simclusters_v2.hdfs_sources.SimclustersV2InterestedIn20M145KUpdatedScalaDataset -import com.twitter.simclusters_v2.scalding.common.matrix.{SparseMatrix, SparseRowMatrix} -import com.twitter.simclusters_v2.scalding.offline_job.SimClustersOfflineJobUtil -import com.twitter.simclusters_v2.summingbird.common.{Configs, SimClustersInterestedInUtil} -import com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import java.util.TimeZone - -/** - * Adhoc job for computing Tweet SimClusters embeddings. - * The output of this job includes two data sets: tweet -> top clusters (or Tweet Embedding), and cluster -> top tweets. - * These data sets are supposed to be the snapshot of the two index at the end of the dataRange you run. - * - * Note that you can also use the output from SimClustersOfflineJobScheduledApp for analysis purpose. - * The outputs from that job might be more close to the data we use in production. - * The benefit of having this job is to keep the flexibility of experiment different ideas. - * - * It is recommended to put at least 2 days in the --date (dataRange in the code) in order to make sure - * we have enough engagement data for tweets have more engagements in the last 1+ days. - * - * - * There are several parameters to tune in the job. They are explained in the inline comments. - * - * - * To run the job: - scalding remote run \ - --target src/scala/com/twitter/simclusters_v2/scalding/offline_job/adhoc:tweet_embedding-adhoc \ - --user recos-platform \ - --reducers 1000 \ - --main-class com.twitter.simclusters_v2.scalding.offline_job.adhoc.SimClustersTweetEmbeddingAdhocApp -- \ - --date 2021-01-27 2021-01-28 \ - --score_type logFav \ - --output_dir /user/recos-platform/adhoc/tweet_embedding_01_27_28_unnormalized_t9 - */ -object SimClustersTweetEmbeddingAdhocApp extends AdhocExecutionApp { - - import SimClustersOfflineJobUtil._ - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - val outputDir = args("output_dir") - - // what interestedIn score to use. logFav is what we use in production - val scoringMethod = args.getOrElse("score_type", "logFav") - - // whether to use normalized score in the cluster -> top tweets. - // Currently, we do not do this in production. DONOT turn it on unless you know what you are doing. - // NOTE that for scalding args, "--run_normalized" will just set the arg to be true, and - // even you use "--run_normalized false", it will still be true. - val usingNormalizedScoringFunction = args.boolean("run_normalized") - - // filter out tweets that has less than X favs in the dateRange. - val tweetFavThreshold = args.long("tweet_fav_threshold", 0L) - - // tweet -> top clusters will be saved in this subfolder - val tweetTopKClustersOutputPath: String = outputDir + "/tweet_top_k_clusters" - - // cluster -> top tweets will be saved in this subfolder - val clusterTopKTweetsOutputPath: String = outputDir + "/cluster_top_k_tweets" - - val interestedInData: TypedPipe[(Long, ClustersUserIsInterestedIn)] = - DAL - .readMostRecentSnapshot( - SimclustersV2InterestedIn20M145KUpdatedScalaDataset, - dateRange.embiggen(Days(14)) - ) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - .map { - case KeyVal(key, value) => (key, value) - } - - // read user-tweet fav data. set the weight to be a decayed value. they will be decayed to the dateRang.end - val userTweetFavData: SparseMatrix[UserId, TweetId, Double] = - SparseMatrix(readTimelineFavoriteData(dateRange)).tripleApply { - case (userId, tweetId, timestamp) => - ( - userId, - tweetId, - thriftDecayedValueMonoid - .plus( - thriftDecayedValueMonoid.build(1.0, timestamp), - thriftDecayedValueMonoid.build(0.0, dateRange.end.timestamp) - ) - .value) - } - - // filter out tweets without x favs - val tweetSubset = - userTweetFavData.colNnz.filter( - _._2 > tweetFavThreshold.toDouble - ) // keep tweets with at least x favs - - val userTweetFavDataSubset = userTweetFavData.filterCols(tweetSubset.keys) - - // construct user-simclusters matrix - val userSimClustersInterestedInData: SparseRowMatrix[UserId, ClusterId, Double] = - SparseRowMatrix( - interestedInData.map { - case (userId, clusters) => - val topClustersWithScores = - SimClustersInterestedInUtil - .topClustersWithScores(clusters) - .collect { - case (clusterId, scores) - if scores.favScore > Configs - .favScoreThresholdForUserInterest( - clusters.knownForModelVersion - ) => // this is the same threshold used in the summingbird job - scoringMethod match { - case "fav" => - clusterId -> scores.clusterNormalizedFavScore - case "follow" => - clusterId -> scores.clusterNormalizedFollowScore - case "logFav" => - clusterId -> scores.clusterNormalizedLogFavScore - case _ => - throw new IllegalArgumentException( - "score_type can only be fav, follow or logFav") - } - } - .filter(_._2 > 0.0) - .toMap - userId -> topClustersWithScores - }, - isSkinnyMatrix = true - ) - - // multiply tweet -> user matrix with user -> cluster matrix to get tweet -> cluster matrix - val tweetClusterScoreMatrix = if (usingNormalizedScoringFunction) { - userTweetFavDataSubset.transpose.rowL2Normalize - .multiplySkinnySparseRowMatrix(userSimClustersInterestedInData) - } else { - userTweetFavDataSubset.transpose.multiplySkinnySparseRowMatrix( - userSimClustersInterestedInData) - } - - // get the tweet -> top clusters by taking top K in each row - val tweetTopClusters = tweetClusterScoreMatrix - .sortWithTakePerRow(Configs.topKClustersPerTweet)(Ordering.by(-_._2)) - .fork - - // get the cluster -> top tweets by taking top K in each colum - val clusterTopTweets = tweetClusterScoreMatrix - .sortWithTakePerCol(Configs.topKTweetsPerCluster)(Ordering.by(-_._2)) - .fork - - // injections for saving a list - implicit val inj1: Injection[List[(Int, Double)], Array[Byte]] = - Bufferable.injectionOf[List[(Int, Double)]] - implicit val inj2: Injection[List[(Long, Double)], Array[Byte]] = - Bufferable.injectionOf[List[(Long, Double)]] - - // save the data sets and also output to some tsv files for eyeballing the results - Execution - .zip( - tweetTopClusters - .mapValues(_.toList) - .writeExecution( - VersionedKeyValSource[TweetId, List[(ClusterId, Double)]](tweetTopKClustersOutputPath) - ), - tweetTopClusters - .map { - case (tweetId, topKClusters) => - tweetId -> topKClusters - .map { - case (clusterId, score) => - s"$clusterId:" + "%.3g".format(score) - } - .mkString(",") - } - .writeExecution( - TypedTsv(tweetTopKClustersOutputPath + "_tsv") - ), - tweetSubset.writeExecution(TypedTsv(tweetTopKClustersOutputPath + "_tweet_favs")), - clusterTopTweets - .mapValues(_.toList) - .writeExecution( - VersionedKeyValSource[ClusterId, List[(TweetId, Double)]](clusterTopKTweetsOutputPath) - ), - clusterTopTweets - .map { - case (clusterId, topKTweets) => - clusterId -> topKTweets - .map { - case (tweetId, score) => s"$tweetId:" + "%.3g".format(score) - } - .mkString(",") - } - .writeExecution( - TypedTsv(clusterTopKTweetsOutputPath + "_tsv") - ) - ) - .unit - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/offline_job/adhoc/TweetSimilarityEvaluationAdhocApp.scala b/src/scala/com/twitter/simclusters_v2/scalding/offline_job/adhoc/TweetSimilarityEvaluationAdhocApp.scala deleted file mode 100644 index ea64df6e2..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/offline_job/adhoc/TweetSimilarityEvaluationAdhocApp.scala +++ /dev/null @@ -1,362 +0,0 @@ -package com.twitter.simclusters_v2.scalding.offline_job.adhoc - -import com.twitter.bijection.{Bufferable, Injection} -import com.twitter.scalding._ -import com.twitter.scalding.commons.source.VersionedKeyValSource -import com.twitter.simclusters_v2.common.{ClusterId, CosineSimilarityUtil, TweetId} -import com.twitter.simclusters_v2.scalding.common.matrix.SparseRowMatrix -import com.twitter.simclusters_v2.scalding.offline_job.SimClustersOfflineJobUtil -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import java.util.TimeZone - -/** - * - * A job to sample some tweets for evaluation. - * - * we bucket tweets by the log(# of fav + 1) and randomly pick 1000 for each bucket for evaluation. - * - * to run the job: - * - scalding remote run \ - --target src/scala/com/twitter/simclusters_v2/scalding/offline_job/adhoc:tweet_embedding_evaluation_samples-adhoc \ - --user recos-platform \ - --reducers 1000 \ - --main-class com.twitter.simclusters_v2.scalding.offline_job.adhoc.TweetSimilarityEvaluationSamplingAdhocApp -- \ - --date 2021-01-27 2021-01-28 \ - --output /user/recos-platform/adhoc/tweet_embedding_01_27_28_sample_tweets - */ -object TweetSimilarityEvaluationSamplingAdhocApp extends AdhocExecutionApp { - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - val random = new java.util.Random(args.long("seed", 20200322L)) - - // # of tweets in each bucket - val topK = args.int("bucket_size", 1000) - - val output = args("output") - - SimClustersOfflineJobUtil - .readTimelineFavoriteData(dateRange) - .map { - case (_, tweetId, _) => - tweetId -> 1L - } - .sumByKey - .filter(_._2 >= 10L) // only consider tweets with more than 10 favs - .map { - case (tweetId, tweetFavs) => - val bucket = math.log10(tweetFavs + 1.0).toInt - bucket -> (tweetId, random.nextDouble()) - } - .group - .sortedReverseTake(topK)(Ordering.by(_._2)) - .flatMap { - case (bucket, tweets) => - val bucketSize = tweets.length - tweets.map { - case (tweetId, _) => - (tweetId, bucket, bucketSize) - } - } - .writeExecution( - TypedTsv[(Long, Int, Int)](output) - ) - - } -} - -/** - * - * A job for evaluating the performance of an approximate nearest neighbor search method with a brute - * force method. - * - * Evaluation method: - * - * After getting the embeddings for these tweets, we bucketize tweets based on the number of favs they have - * (i.e., math.log10(numFavors).toInt), and then randomly select 1000 tweets from each bucket. - * We do not include tweets with fewer than 10 favs. We compute the nearest neighbors (in terms of cosine similarity) - * for these tweets using the brute force method and use up to top 100 neighbors with the cosine - * similarity score >0.8 for each tweet as ground-truth set G. - * - * We then compute the nearest neighbors for these tweets based on the approximate nearest neighbor search: for each tweet, we find the top clusters, and then find top tweets in each cluster as potential candidates. We rank these potential candidates by the cosine similarity scores and take top 100 as prediction set P. We evaluate the precision and recall using - * - * Precision = |P \intersect G| / |P| - * Recall = |P \intersect G| / |G| - * - * Note that |P| and |G| can be different, when there are not many neighbors returned. - * - scalding remote run \ - --target src/scala/com/twitter/simclusters_v2/scalding/offline_job/adhoc:tweet_embedding_evaluation-adhoc \ - --user recos-platform \ - --reducers 1000 \ - --main-class com.twitter.simclusters_v2.scalding.offline_job.adhoc.TweetSimilarityEvaluationAdhocApp -- \ - --date 2021-01-27 \ - --tweet_top_k /user/recos-platform/adhoc/tweet_embedding_01_27_28_unnormalized_t9/tweet_top_k_clusters \ - --cluster_top_k /user/recos-platform/adhoc/tweet_embedding_01_27_28_unnormalized_t9/cluster_top_k_tweets \ - --tweets /user/recos-platform/adhoc/tweet_embedding_01_27_28_sample_tweets \ - --output /user/recos-platform/adhoc/tweet_embedding_evaluation_01_27_28_t05_k50_1 - */ -object TweetSimilarityEvaluationAdhocApp extends AdhocExecutionApp { - - implicit val inj1: Injection[List[(Int, Double)], Array[Byte]] = - Bufferable.injectionOf[List[(Int, Double)]] - implicit val inj2: Injection[List[(Long, Double)], Array[Byte]] = - Bufferable.injectionOf[List[(Long, Double)]] - - // Take top 20 candidates, the score * 100 - private def formatList(candidates: Seq[(TweetId, Double)]): Seq[(TweetId, Int)] = { - candidates.take(10).map { - case (clusterId, score) => - (clusterId, (score * 100).toInt) - } - } - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - // path to read the tweet -> top cluster data set. should be the same from the SimClustersTweetEmbeddingAdhocApp job - val tweetTopKClustersPath = args("tweet_top_k") - - // path to read the cluster -> top tweets data set. should be the same from the SimClustersTweetEmbeddingAdhocApp job - val clusterTopKTweetsPath = args("cluster_top_k") - - // path to read the sampled tweets, should be the same from TweetSimilarityEvaluationSamplingAdhocApp - val tweetsPath = args("tweets") - - // see the comment of this class. this is to determine which tweet should be ground truth - val threshold = args.double("threshold", 0.8) - - // see the comment of this class. this is to determine which tweet should be ground truth - val topK = args.int("topK", 100) - - // output path for evaluation results - val output = args("output") - - // read tweet -> top clusters data set - val tweetTopKClusters: SparseRowMatrix[TweetId, ClusterId, Double] = - SparseRowMatrix( - TypedPipe - .from( - VersionedKeyValSource[TweetId, List[(ClusterId, Double)]](tweetTopKClustersPath) - ) - .mapValues(_.filter(_._2 > 0.001).toMap), - isSkinnyMatrix = true - ).rowL2Normalize - - // read cluster -> top tweets data set - val clusterTopTweets: SparseRowMatrix[ClusterId, TweetId, Double] = - SparseRowMatrix( - TypedPipe - .from( - VersionedKeyValSource[ClusterId, List[(TweetId, Double)]](clusterTopKTweetsPath) - ) - .mapValues(_.filter(_._2 > 0.02).toMap), - isSkinnyMatrix = false - ) - - // read the sampled tweets from TweetSimilarityEvaluationSamplingAdhocApp - val tweetSubset = TypedPipe.from(TypedTsv[(Long, Int, Int)](tweetsPath)) - - // the tweet -> top clusters for the sampled tweets - val tweetEmbeddingSubset = - tweetTopKClusters.filterRows(tweetSubset.map(_._1)) - - // compute ground-truth top similar tweets for each sampled tweets. - // for each sampled tweets, we compute their similarity with every tweets in the tweet -> top clusters data set. - // we filter out those with similarity score smaller than the threshold and keep top k as the ground truth similar tweets - val groundTruthData = tweetTopKClusters.toSparseMatrix - .multiplySkinnySparseRowMatrix( - tweetEmbeddingSubset.toSparseMatrix.transpose.toSparseRowMatrix(true), - numReducersOpt = Some(5000) - ) - .toSparseMatrix - .transpose - .filter((_, _, v) => v > threshold) - .sortWithTakePerRow(topK)(Ordering.by(-_._2)) - - // compute approximate similar tweets for each sampled tweets. - // this is achieved by multiplying "sampled_tweets -> top clusters" matrix with "cluster -> top tweets" matrix. - // note that in the implementation, we first compute the transponse of this matrix in order to ultlize the optimization done on skinny matrices - val predictionData = clusterTopTweets.toSparseMatrix.transpose - .multiplySkinnySparseRowMatrix( - tweetEmbeddingSubset.toSparseMatrix.transpose.toSparseRowMatrix(true), - numReducersOpt = Some(5000) - ) - .toSparseMatrix - .transpose - .toTypedPipe - .map { - case (queryTweet, candidateTweet, _) => - (queryTweet, candidateTweet) - } - .join(tweetEmbeddingSubset.toTypedPipe) - .map { - case (queryId, (candidateId, queryEmbedding)) => - candidateId -> (queryId, queryEmbedding) - } - .join(tweetTopKClusters.toTypedPipe) - .map { - case (candidateId, ((queryId, queryEmbedding), candidateEmbedding)) => - queryId -> (candidateId, CosineSimilarityUtil - .dotProduct( - queryEmbedding, - candidateEmbedding - )) - } - .filter(_._2._2 > threshold) - .group - .sortedReverseTake(topK)(Ordering.by(_._2)) - - // Exist in Ground Truth but not exist in Predication - val potentialData = - groundTruthData - .leftJoin(predictionData) - .map { - case (tweetId, (groundTruthCandidates, predictedCandidates)) => - val predictedCandidateSet = predictedCandidates.toSeq.flatten.map(_._1).toSet - val potentialTweets = groundTruthCandidates.filterNot { - case (candidateId, _) => - predictedCandidateSet.contains(candidateId) - } - (tweetId, potentialTweets) - } - - val debuggingData = - groundTruthData - .leftJoin(predictionData) - .map { - case (tweetId, (groundTruthTweets, maybepredictedTweets)) => - val predictedTweets = maybepredictedTweets.toSeq.flatten - val predictedTweetSet = predictedTweets.map(_._1).toSet - val potentialTweets = groundTruthTweets.filterNot { - case (candidateId, _) => - predictedTweetSet.contains(candidateId) - } - - ( - tweetId, - Seq( - formatList(potentialTweets), - formatList(groundTruthTweets), - formatList(predictedTweets))) - } - - // for each tweet, compare the approximate topk and ground-truth topk. - // compute precision and recall, then averaging them per bucket. - val eval = tweetSubset - .map { - case (tweetId, bucket, bucketSize) => - tweetId -> (bucket, bucketSize) - } - .leftJoin(groundTruthData) - .leftJoin(predictionData) - .map { - case (_, (((bucket, bucketSize), groundTruthOpt), predictionOpt)) => - val groundTruth = groundTruthOpt.getOrElse(Nil).map(_._1) - val prediction = predictionOpt.getOrElse(Nil).map(_._1) - - assert(groundTruth.distinct.size == groundTruth.size) - assert(prediction.distinct.size == prediction.size) - - val intersection = groundTruth.toSet.intersect(prediction.toSet) - - val precision = - if (prediction.nonEmpty) - intersection.size.toDouble / prediction.size.toDouble - else 0.0 - val recall = - if (groundTruth.nonEmpty) - intersection.size.toDouble / groundTruth.size.toDouble - else 0.0 - - ( - bucket, - bucketSize) -> (groundTruth.size, prediction.size, intersection.size, precision, recall, 1.0) - } - .sumByKey - .map { - case ( - (bucket, bucketSize), - (groundTruthSum, predictionSum, interSectionSum, precisionSum, recallSum, count)) => - ( - bucket, - bucketSize, - groundTruthSum / count, - predictionSum / count, - interSectionSum / count, - precisionSum / count, - recallSum / count, - count) - } - - // output the eval results and some sample results for eyeballing - Execution - .zip( - eval - .writeExecution(TypedTsv(output)), - groundTruthData - .map { - case (tweetId, neighbors) => - tweetId -> neighbors - .map { - case (id, score) => s"$id:$score" - } - .mkString(",") - } - .writeExecution( - TypedTsv(args("output") + "_ground_truth") - ), - predictionData - .map { - case (tweetId, neighbors) => - tweetId -> neighbors - .map { - case (id, score) => s"$id:$score" - } - .mkString(",") - } - .writeExecution( - TypedTsv(args("output") + "_prediction") - ), - potentialData - .map { - case (tweetId, neighbors) => - tweetId -> neighbors - .map { - case (id, score) => s"$id:$score" - } - .mkString(",") - }.writeExecution( - TypedTsv(args("output") + "_potential") - ), - debuggingData - .map { - case (tweetId, candidateList) => - val value = candidateList - .map { candidates => - candidates - .map { - case (id, score) => - s"${id}D$score" - }.mkString("C") - }.mkString("B") - s"${tweetId}A$value" - }.writeExecution( - TypedTsv(args("output") + "_debugging") - ) - ) - .unit - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/offline_tweets/BUILD.bazel b/src/scala/com/twitter/simclusters_v2/scalding/offline_tweets/BUILD.bazel deleted file mode 100644 index f1293bac2..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/offline_tweets/BUILD.bazel +++ /dev/null @@ -1,27 +0,0 @@ -scala_library( - sources = [ - "*.scala", - ], - platform = "java8", - tags = ["bazel-only"], - dependencies = [ - "src/scala/com/twitter/simclusters_v2/scalding", - "tweetsource/public_tweets/src/main/scala/com/twitter/tweetsource/public_tweets:public_tweets-scala", - "usersource/snapshot/src/main/scala/com/twitter/usersource/snapshot/flat:usersource_flat-scala", - ], -) - -hadoop_binary( - name = "offline_cluster_top_media_tweets_20M_145K_2020-adhoc", - main = "com.twitter.simclusters_v2.scalding.offline_tweets.AdhocClusterTopTweetsJob", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":offline_tweets", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scalding/offline_tweets/ClusterTopMediaTweetsJob.scala b/src/scala/com/twitter/simclusters_v2/scalding/offline_tweets/ClusterTopMediaTweetsJob.scala deleted file mode 100644 index f966a6a93..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/offline_tweets/ClusterTopMediaTweetsJob.scala +++ /dev/null @@ -1,267 +0,0 @@ -package com.twitter.simclusters_v2.scalding.offline_tweets - -import com.twitter.algebird.Aggregator.size -import com.twitter.finagle.mtls.authentication.ServiceIdentifier -import com.twitter.scalding.typed.TypedPipe -import com.twitter.scalding.Args -import com.twitter.scalding.DateOps -import com.twitter.scalding.DateParser -import com.twitter.scalding.DateRange -import com.twitter.scalding.Days -import com.twitter.scalding.Duration -import com.twitter.scalding.Execution -import com.twitter.scalding.Hours -import com.twitter.scalding.RichDate -import com.twitter.scalding.TypedTsv -import com.twitter.scalding.UniqueID -import com.twitter.scalding_internal.dalv2.DALWrite.D -import com.twitter.scalding_internal.dalv2.DALWrite.WriteExtension -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.Timestamp -import com.twitter.simclusters_v2.common.TweetId -import com.twitter.simclusters_v2.hdfs_sources.DataPaths -import com.twitter.simclusters_v2.hdfs_sources.OfflineClusterTopMediaTweets20M145K2020ScalaDataset -import com.twitter.simclusters_v2.scalding.common.LogFavBasedPersistentTweetEmbeddingMhExportSource -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.scalding.embedding.common.ExternalDataSources -import com.twitter.simclusters_v2.thriftscala.DayPartitionedClusterId -import com.twitter.simclusters_v2.thriftscala.PersistentSimClustersEmbedding -import com.twitter.simclusters_v2.thriftscala.TweetWithScore -import com.twitter.simclusters_v2.thriftscala.TweetsWithScore -import com.twitter.snowflake.id.SnowflakeId -import com.twitter.tweetsource.common.thriftscala.MediaType -import com.twitter.tweetsource.common.thriftscala.UnhydratedFlatTweet -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import java.util.TimeZone -import java.text.SimpleDateFormat - -object ClusterTopTweetsJob { - - def serviceIdentifier(zone: String, env: String): ServiceIdentifier = ServiceIdentifier( - role = "cassowary", - service = "offline_cluster_top_media_tweets_20M_145K_2020", - environment = env, - zone = zone - ) - - private def isMediaTweet(tweet: UnhydratedFlatTweet): Boolean = { - tweet.media.exists { mediaSeq => - mediaSeq.exists { e => - e.mediaType.contains(MediaType.Video) - } - } - } - - private val dateFormatter = new SimpleDateFormat("yyyy-MM-dd") - - def getClusterTopMediaTweets( - persistentEmbeddingPipe: TypedPipe[((TweetId, Timestamp), PersistentSimClustersEmbedding)], - tweetSourcePipe: TypedPipe[UnhydratedFlatTweet], - maxTweetsPerClusterPerPartition: Int - ): TypedPipe[(DayPartitionedClusterId, Seq[(TweetId, Double)])] = { - val mediaTweetsPipe = tweetSourcePipe.collect { - case tweet if isMediaTweet(tweet) => (tweet.tweetId, ()) - } - - val tweetEmbeddingsPipe: TypedPipe[(TweetId, (Int, Double))] = { - persistentEmbeddingPipe.collect { - case ((tweetId, timestamp), persistentEmbedding) - if timestamp == 1L => // 1L is the longest L2 embedding - - persistentEmbedding.embedding.embedding.map { clusterWithScore => - (tweetId, (clusterWithScore.clusterId, clusterWithScore.score)) - } - }.flatten - } - - mediaTweetsPipe - .join(tweetEmbeddingsPipe) - .withReducers(2000) - .map { - case (tweetId, ((), (clusterId, score))) => - val dayPartition = dateFormatter.format(SnowflakeId(tweetId).time.inMilliseconds) - ((clusterId, dayPartition), Seq((tweetId, score))) - } - .sumByKey - .mapValues(_.sortBy(-_._2).take(maxTweetsPerClusterPerPartition)) - .map { case ((cid, partition), values) => (DayPartitionedClusterId(cid, partition), values) } - } - - // Convert to Manhattan compatible format - def toKeyVal( - clusterTopTweets: TypedPipe[(DayPartitionedClusterId, Seq[(TweetId, Double)])], - ): TypedPipe[KeyVal[DayPartitionedClusterId, TweetsWithScore]] = { - clusterTopTweets.map { - case (key, tweetsWithScores) => - val thrift = tweetsWithScores.map { t => TweetWithScore(t._1, t._2) } - KeyVal(key, TweetsWithScore(thrift)) - } - } -} - -/** - * Scheduled job. Runs every couple of hours (check the .yaml for exact cron schedule). - * Reads 21 days of tweets, and the most recent persistent tweet embeddings from a Manhattan dump. - * It outputs a clusterId-> List[tweetId] index. - -capesospy-v2 update --build_locally --start_cron \ -offline_cluster_top_media_tweets_20M_145K_2020 src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object ClusterTopMediaTweets20M145K2020BatchJob extends ScheduledExecutionApp { - override def firstTime: RichDate = RichDate("2021-08-29") - - override def batchIncrement: Duration = Hours(3) - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - // max public tweet has 21 days. read 1 day fewer go give some buffer - val lookbackDateRange = dateRange.prepend(Days(21)) - - val tweetSource: TypedPipe[UnhydratedFlatTweet] = - ExternalDataSources.flatTweetsSource(lookbackDateRange) - - val persistentEmbeddingPipe: TypedPipe[ - ((TweetId, Timestamp), PersistentSimClustersEmbedding) - ] = - TypedPipe.from( - new LogFavBasedPersistentTweetEmbeddingMhExportSource( - range = lookbackDateRange, - serviceIdentifier = ClusterTopTweetsJob.serviceIdentifier(args("zone"), args("env")) - )) - - val maxTweetsPerClusterPerPartition = 1200 - - val dailyClusterTopTweets = ClusterTopTweetsJob.getClusterTopMediaTweets( - persistentEmbeddingPipe, - tweetSource, - maxTweetsPerClusterPerPartition - ) - - val keyValPipe: TypedPipe[KeyVal[DayPartitionedClusterId, TweetsWithScore]] = - ClusterTopTweetsJob.toKeyVal(dailyClusterTopTweets) - - keyValPipe - .writeDALVersionedKeyValExecution( - OfflineClusterTopMediaTweets20M145K2020ScalaDataset, - D.Suffix(DataPaths.OfflineClusterTopMediaTweets2020DatasetPath) - ) - } -} - -/** -Adhoc debugging job. Uses Entity Embeddings dataset to infer user interests - -./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/offline_tweets/ &&\ -scalding remote run \ - --main-class com.twitter.simclusters_v2.scalding.offline_tweets.AdhocClusterTopMediaTweetsJob \ - --target src/scala/com/twitter/simclusters_v2/scalding/offline_tweets/:offline_cluster_top_media_tweets_20M_145K_2020-adhoc \ - --user cassowary \ - -- --output_dir /scratch_user/cassowary/your_ldap --date 2021-08-30 --zone atla --env prod --email your_ldap@twitter.com - */ -object AdhocClusterTopMediaTweetsJob extends AdhocExecutionApp { - - /** - * Run some stat analysis on the results, such as the number of tweets in a cluster, tweet score - * distributions, etc. - * - * Ideally works on 1 day data only. If multiple days data are passed in, it'll aggregate over - * multiple days anyway - */ - def analyzeClusterResults( - clusterTopTweets: TypedPipe[(DayPartitionedClusterId, Seq[(TweetId, Double)])] - ): Execution[String] = { - - val tweetSizeExec = Util.printSummaryOfNumericColumn( - clusterTopTweets.map { case (_, tweets) => tweets.size }, - columnName = Some("Tweet size distribution of clusters") - ) - - val scoreDistExec = Util.printSummaryOfNumericColumn( - clusterTopTweets.flatMap(_._2.map(_._2)), - columnName = Some("Score distribution of the tweets") - ) - - val numClustersExec = - clusterTopTweets.map(_._1._1).distinct.aggregate(size).getOrElseExecution(0L) - - val numTweetsExec = - clusterTopTweets.flatMap(_._2.map(_._1)).distinct.aggregate(size).getOrElseExecution(0L) - - Execution.zip(tweetSizeExec, scoreDistExec, numClustersExec, numTweetsExec).map { - case (tweetSizeDist, scoreDist, numClusters, numTweets) => - s""" - |Number of unique tweets = $numTweets - |Number of clusters = $numClusters - |------------------------ - |$tweetSizeDist - |------------------------ - |$scoreDist - |""".stripMargin - } - } - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - val startTime = System.currentTimeMillis() - Execution.withArgs { args => - Execution.getMode.flatMap { implicit mode => - implicit val dateRange: DateRange = - DateRange.parse(args.list("date"))(DateOps.UTC, DateParser.default) - - val outputDir = args("output_dir") - - val maxTweetsPerCluster = 100 - - // max public tweet has 21 days. read 1 day fewer go give some buffer - val lookbackDateRange = dateRange.prepend(Days(21)) - - val tweetSource: TypedPipe[UnhydratedFlatTweet] = - ExternalDataSources.flatTweetsSource(lookbackDateRange) - - val persistentEmbeddingPipe: TypedPipe[ - ((TweetId, Timestamp), PersistentSimClustersEmbedding) - ] = - TypedPipe.from( - new LogFavBasedPersistentTweetEmbeddingMhExportSource( - range = lookbackDateRange, - serviceIdentifier = ClusterTopTweetsJob.serviceIdentifier(args("zone"), args("env")) - )) - - val results = ClusterTopTweetsJob.getClusterTopMediaTweets( - persistentEmbeddingPipe, - tweetSource, - maxTweetsPerCluster - ) - analyzeClusterResults(TypedPipe.empty) - .flatMap { distributions => - val timeTakenMin = (System.currentTimeMillis() - startTime) / 60000 - val text = - s""" - | AdhocClusterTopMediaTweetsJob finished on: $dateRange. - | Time taken: $timeTakenMin minutes. - | maxTweetsPerCluster: $maxTweetsPerCluster. - | output_dir: $outputDir - | - | $distributions - """.stripMargin - Util.sendEmail(text, "AdhocClusterTopMediaTweetsJob finished.", args("email")) - - results - .writeExecution(TypedTsv(outputDir)) - } - } - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/optout/BUILD.bazel b/src/scala/com/twitter/simclusters_v2/scalding/optout/BUILD.bazel deleted file mode 100644 index af06cdd1a..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/optout/BUILD.bazel +++ /dev/null @@ -1,81 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = [ - "bazel-compatible", - "bazel-only", - ], - dependencies = [ - "src/scala/com/twitter/octain/p13n/batch:p13n_preferences-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:data_sources", - "src/scala/com/twitter/simclusters_v2/scalding", - "src/scala/com/twitter/simclusters_v2/scalding/common", - "src/scala/com/twitter/simclusters_v2/scalding/inferred_entities", - "src/scala/com/twitter/wtf/scalding/jobs/common:cassowary_job", - "src/scala/com/twitter/wtf/scalding/jobs/common:execution_app", - "src/scala/com/twitter/wtf/scalding/jobs/common:sources", - "src/scala/com/twitter/wtf/scalding/jobs/common:stats_util", - ], -) - -hadoop_binary( - name = "known_for_optout-adhoc", - main = "com.twitter.simclusters_v2.scalding.optout.KnownForOptOutAdhocJob", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":optout", - ], -) - -hadoop_binary( - name = "known_for_optout_daily", - main = "com.twitter.simclusters_v2.scalding.optout.KnownForOptOutDailyBatchJob", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":optout", - ], -) - -hadoop_binary( - name = "interested_in_optout-adhoc", - main = "com.twitter.simclusters_v2.scalding.optout.InterestedInOptOutAdhocJob", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - "known-to-fail-jira:SD-14439", - ], - dependencies = [ - ":optout", - ], -) - -hadoop_binary( - name = "interested_in_optout_daily", - main = "com.twitter.simclusters_v2.scalding.optout.InterestedInOptOutDailyBatchJob", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":optout", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scalding/optout/InterestedInOptOut.scala b/src/scala/com/twitter/simclusters_v2/scalding/optout/InterestedInOptOut.scala deleted file mode 100644 index 3e24c7b0c..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/optout/InterestedInOptOut.scala +++ /dev/null @@ -1,269 +0,0 @@ -package com.twitter.simclusters_v2.scalding.optout - -import com.twitter.dal.client.dataset.{KeyValDALDataset, SnapshotDALDataset} -import com.twitter.scalding.{ - Args, - DateRange, - Days, - Duration, - Execution, - RichDate, - TypedPipe, - TypedTsv, - UniqueID -} -import com.twitter.scalding_internal.dalv2.DALWrite.D -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.{ClusterId, ModelVersions, SemanticCoreEntityId, UserId} -import com.twitter.simclusters_v2.hdfs_sources._ -import com.twitter.simclusters_v2.scalding.inferred_entities.InferredEntities -import com.twitter.simclusters_v2.thriftscala.{ - ClusterType, - ClustersUserIsInterestedIn, - SemanticCoreEntityWithScore, - UserToInterestedInClusters -} -import com.twitter.wtf.scalding.jobs.common.{AdhocExecutionApp, ScheduledExecutionApp} -import com.twitter.simclusters_v2.scalding.common.TypedRichPipe._ -import com.twitter.simclusters_v2.scalding.common.Util -import java.util.TimeZone - -object InterestedInOptOut { - - def filterOptedOutInterestedIn( - interestedInPipe: TypedPipe[(UserId, ClustersUserIsInterestedIn)], - optedOutEntities: TypedPipe[(UserId, Set[SemanticCoreEntityId])], - clusterToEntities: TypedPipe[(ClusterId, Seq[SemanticCoreEntityWithScore])] - ): TypedPipe[(UserId, ClustersUserIsInterestedIn)] = { - - val validInterestedIn = SimClustersOptOutUtil.filterOptedOutClusters( - userToClusters = interestedInPipe.mapValues(_.clusterIdToScores.keySet.toSeq), - optedOutEntities = optedOutEntities, - legibleClusters = clusterToEntities - ) - - interestedInPipe - .leftJoin(validInterestedIn) - .mapValues { - case (originalInterestedIn, validInterestedInOpt) => - val validInterestedIn = validInterestedInOpt.getOrElse(Seq()).toSet - - originalInterestedIn.copy( - clusterIdToScores = originalInterestedIn.clusterIdToScores.filterKeys(validInterestedIn) - ) - } - .filter(_._2.clusterIdToScores.nonEmpty) - } - - /** - * Writes InterestedIn data to HDFS - */ - def writeInterestedInOutputExecution( - interestedIn: TypedPipe[(UserId, ClustersUserIsInterestedIn)], - interestedInDataset: KeyValDALDataset[KeyVal[Long, ClustersUserIsInterestedIn]], - outputPath: String - ): Execution[Unit] = { - interestedIn - .map { case (k, v) => KeyVal(k, v) } - .writeDALVersionedKeyValExecution( - interestedInDataset, - D.Suffix(outputPath) - ) - } - - /** - * Convert InterestedIn to thrift structs, then write to HDFS - */ - def writeInterestedInThriftOutputExecution( - interestedIn: TypedPipe[(UserId, ClustersUserIsInterestedIn)], - modelVersion: String, - interestedInThriftDatset: SnapshotDALDataset[UserToInterestedInClusters], - thriftOutputPath: String, - dateRange: DateRange - ): Execution[Unit] = { - interestedIn - .map { - case (userId, clusters) => - UserToInterestedInClusters(userId, modelVersion, clusters.clusterIdToScores) - } - .writeDALSnapshotExecution( - interestedInThriftDatset, - D.Daily, - D.Suffix(thriftOutputPath), - D.EBLzo(), - dateRange.end - ) - } -} - -/** -capesospy-v2 update --build_locally --start_cron \ - --start_cron interested_in_optout_daily \ - src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc.yaml - */ -object InterestedInOptOutDailyBatchJob extends ScheduledExecutionApp { - - override def firstTime: RichDate = RichDate("2019-11-24") - - override def batchIncrement: Duration = Days(1) - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - val userOptoutEntities = - SimClustersOptOutUtil - .getP13nOptOutSources(dateRange.embiggen(Days(4)), ClusterType.InterestedIn) - .count("num_users_with_optouts") - .forceToDisk - - val interestedIn2020Pipe = InterestedInSources - .simClustersRawInterestedIn2020Source(dateRange, timeZone) - .count("num_users_with_2020_interestedin") - - val interestedInLite2020Pipe = InterestedInSources - .simClustersRawInterestedInLite2020Source(dateRange, timeZone) - .count("num_users_with_2020_interestedin_lite") - - val clusterToEntities = InferredEntities - .getLegibleEntityEmbeddings(dateRange.prepend(Days(21)), timeZone) - .count("num_cluster_to_entities") - - val filtered2020InterestedIn = InterestedInOptOut - .filterOptedOutInterestedIn(interestedIn2020Pipe, userOptoutEntities, clusterToEntities) - .count("num_users_with_compliant_2020_interestedin") - - val write2020Exec = InterestedInOptOut.writeInterestedInOutputExecution( - filtered2020InterestedIn, - SimclustersV2InterestedIn20M145K2020ScalaDataset, - DataPaths.InterestedIn2020Path - ) - - val write2020ThriftExec = InterestedInOptOut.writeInterestedInThriftOutputExecution( - filtered2020InterestedIn, - ModelVersions.Model20M145K2020, - SimclustersV2UserToInterestedIn20M145K2020ScalaDataset, - DataPaths.InterestedIn2020ThriftPath, - dateRange - ) - - val sanityCheck2020Exec = SimClustersOptOutUtil.sanityCheckAndSendEmail( - oldNumClustersPerUser = interestedIn2020Pipe.map(_._2.clusterIdToScores.size), - newNumClustersPerUser = filtered2020InterestedIn.map(_._2.clusterIdToScores.size), - modelVersion = ModelVersions.Model20M145K2020, - alertEmail = SimClustersOptOutUtil.AlertEmail - ) - - val filtered2020InterestedInLite = InterestedInOptOut - .filterOptedOutInterestedIn(interestedInLite2020Pipe, userOptoutEntities, clusterToEntities) - .count("num_users_with_compliant_2020_interestedin_lite") - - val write2020LiteExec = InterestedInOptOut.writeInterestedInOutputExecution( - filtered2020InterestedInLite, - SimclustersV2InterestedInLite20M145K2020ScalaDataset, - DataPaths.InterestedInLite2020Path - ) - - val write2020LiteThriftExec = InterestedInOptOut.writeInterestedInThriftOutputExecution( - filtered2020InterestedInLite, - ModelVersions.Model20M145K2020, - SimclustersV2UserToInterestedInLite20M145K2020ScalaDataset, - DataPaths.InterestedInLite2020ThriftPath, - dateRange - ) - - val sanityCheck2020LiteExec = SimClustersOptOutUtil.sanityCheckAndSendEmail( - oldNumClustersPerUser = interestedInLite2020Pipe.map(_._2.clusterIdToScores.size), - newNumClustersPerUser = filtered2020InterestedInLite.map(_._2.clusterIdToScores.size), - modelVersion = ModelVersions.Model20M145K2020, - alertEmail = SimClustersOptOutUtil.AlertEmail - ) - - Util.printCounters( - Execution.zip( - Execution.zip( - write2020Exec, - write2020ThriftExec, - sanityCheck2020Exec), - Execution.zip( - write2020LiteExec, - write2020LiteThriftExec, - sanityCheck2020LiteExec - ) - ) - ) - } -} - -/** - * For debugging only. Does a filtering run and prints the differences before/after the opt out - - scalding remote run --target src/scala/com/twitter/simclusters_v2/scalding/optout:interested_in_optout-adhoc \ - --user cassowary --cluster bluebird-qus1 \ - --main-class com.twitter.simclusters_v2.scalding.optout.InterestedInOptOutAdhocJob -- \ - --keytab /var/lib/tss/keys/fluffy/keytabs/client/cassowary.keytab \ - --principal service_acoount@TWITTER.BIZ \ - -- \ - --outputDir /user/cassowary/adhoc/interestedin_optout \ - --date 2020-09-03 - */ -object InterestedInOptOutAdhocJob extends AdhocExecutionApp { - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - val outputDir = args("outputDir") - - val interestedInPipe = InterestedInSources - .simClustersInterestedInUpdatedSource(dateRange, timeZone) - .count("num_users_with_interestedin") - - val userOptoutEntities: TypedPipe[(UserId, Set[SemanticCoreEntityId])] = - SimClustersOptOutUtil - .getP13nOptOutSources(dateRange.embiggen(Days(4)), ClusterType.InterestedIn) - .count("num_users_with_optouts") - - val clusterToEntities = InferredEntities - .getLegibleEntityEmbeddings(dateRange, timeZone) - .count("num_cluster_to_entities") - - val filteredInterestedInPipe = InterestedInOptOut - .filterOptedOutInterestedIn( - interestedInPipe, - userOptoutEntities, - clusterToEntities - ) - .count("num_users_with_interestedin_after_optout") - - val output = interestedInPipe - .join(filteredInterestedInPipe) - .filter { - case (userId, (originalInterestedIn, filtered)) => - originalInterestedIn.clusterIdToScores != filtered.clusterIdToScores - } - .join(userOptoutEntities) - .map { - case (userId, ((originalInterestedIn, filtered), optoutEntities)) => - Seq( - "userId=" + userId, - "originalInterestedInVersion=" + originalInterestedIn.knownForModelVersion, - "originalInterestedIn=" + originalInterestedIn.clusterIdToScores.keySet, - "filteredInterestedIn=" + filtered.knownForModelVersion, - "filteredInterestedIn=" + filtered.clusterIdToScores.keySet, - "optoutEntities=" + optoutEntities - ).mkString("\t") - } - - Util.printCounters( - output.writeExecution(TypedTsv(outputDir)) - ) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/optout/KnownForOptOut.scala b/src/scala/com/twitter/simclusters_v2/scalding/optout/KnownForOptOut.scala deleted file mode 100644 index 621e7f994..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/optout/KnownForOptOut.scala +++ /dev/null @@ -1,198 +0,0 @@ -package com.twitter.simclusters_v2.scalding.optout - -import com.twitter.scalding.Args -import com.twitter.scalding.DateRange -import com.twitter.scalding.Days -import com.twitter.scalding.Duration -import com.twitter.scalding.Execution -import com.twitter.scalding.RichDate -import com.twitter.scalding.TypedPipe -import com.twitter.scalding.TypedTsv -import com.twitter.scalding.UniqueID -import com.twitter.simclusters_v2.common.ClusterId -import com.twitter.simclusters_v2.common.SemanticCoreEntityId -import com.twitter.simclusters_v2.common.UserId -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.dalv2.remote_access.AllowCrossClusterSameDC -import com.twitter.scalding_internal.dalv2.remote_access.ExplicitLocation -import com.twitter.scalding_internal.dalv2.remote_access.ProcAtla -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.hdfs_sources._ -import com.twitter.simclusters_v2.thriftscala.ClusterType -import com.twitter.simclusters_v2.thriftscala.ClustersUserIsKnownFor -import com.twitter.simclusters_v2.thriftscala.SemanticCoreEntityWithScore -import com.twitter.simclusters_v2.thriftscala.UserToKnownForClusters -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import java.util.TimeZone -import com.twitter.simclusters_v2.scalding.common.TypedRichPipe._ -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.scalding.inferred_entities.InferredEntities - -/** - * Creates opt-out compliant KnownFor datasets based on plain user -> KnownFor data and users' - * opt-out selections from YourTwitterData. In essence, we remove any cluster whose inferred - * entities were opted out by the user. - * The opted out KnownFor dataset should be the default dataset to be consumed, instead of the - * plain KnownFor, which is not opt-out compliant. - */ -object KnownForOptOut { - - def filterOptedOutKnownFor( - knownForPipe: TypedPipe[(UserId, ClustersUserIsKnownFor)], - optedOutEntities: TypedPipe[(UserId, Set[SemanticCoreEntityId])], - clusterToEntities: TypedPipe[(ClusterId, Seq[SemanticCoreEntityWithScore])] - ): TypedPipe[(UserId, ClustersUserIsKnownFor)] = { - - val validKnownFor = SimClustersOptOutUtil.filterOptedOutClusters( - userToClusters = knownForPipe.mapValues(_.clusterIdToScores.keySet.toSeq), - optedOutEntities = optedOutEntities, - legibleClusters = clusterToEntities - ) - - knownForPipe - .leftJoin(validKnownFor) - .mapValues { - case (originalKnownFors, validKnownForOpt) => - val validKnownFor = validKnownForOpt.getOrElse(Seq()).toSet - - originalKnownFors.copy( - clusterIdToScores = originalKnownFors.clusterIdToScores.filterKeys(validKnownFor) - ) - } - .filter(_._2.clusterIdToScores.nonEmpty) - } -} - -/** -capesospy-v2 update --build_locally --start_cron \ - --start_cron known_for_optout_daily \ - src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc.yaml - */ -object KnownForOptOutDailyBatchJob extends ScheduledExecutionApp { - override def firstTime: RichDate = RichDate("2021-03-29") - - override def batchIncrement: Duration = Days(1) - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - val optedOutEntitiesPipe = SimClustersOptOutUtil - .getP13nOptOutSources(dateRange.embiggen(Days(2)), ClusterType.KnownFor) - .forceToDisk - - val clusterToEntitiesPipe = InferredEntities.getLegibleEntityEmbeddings(dateRange, timeZone) - - val knownFor2020 = DAL - .readMostRecentSnapshot( - SimclustersV2RawKnownFor20M145K2020ScalaDataset, - dateRange.embiggen(Days(10))) - .withRemoteReadPolicy(AllowCrossClusterSameDC) - .toTypedPipe - .map { case KeyVal(k, v) => (k, v) } - .count("num_users_with_2020_knownfor") - - val filtered2020KnownForExec = { - val filtered2020KnownForData = KnownForOptOut - .filterOptedOutKnownFor( - knownForPipe = knownFor2020, - optedOutEntities = optedOutEntitiesPipe, - clusterToEntities = clusterToEntitiesPipe - ) - .count("num_users_with_compliant_2020_knownfor") - .forceToDisk - - Execution - .zip( - filtered2020KnownForData - .map { case (k, v) => KeyVal(k, v) } - .writeDALVersionedKeyValExecution( - SimclustersV2KnownFor20M145K2020ScalaDataset, - D.Suffix(DataPaths.KnownFor2020Path) - ), - filtered2020KnownForData - .map { - case (userId, ClustersUserIsKnownFor(modelVersion, clusters)) => - UserToKnownForClusters(userId, modelVersion, clusters) - } - .writeDALSnapshotExecution( - dataset = SimclustersV2KnownFor20M145K2020ThriftScalaDataset, - updateStep = D.Daily, - pathLayout = D.Suffix(DataPaths.KnownFor2020ThriftDatasetPath), - fmt = D.Parquet, - endDate = dateRange.end - ) - ).unit - } - - Util.printCounters(filtered2020KnownForExec) - - } -} - -/** - * For debugging only. Does a filtering run and prints the differences before/after the opt out -./bazel bundle src/scala/com/twitter/simclusters_v2/scalding/optout:knownfor_optout-adhoc && \ - oscar hdfs --user recos-platform --screen --tee your_ldap \ - --bundle knownfor_optout-adhoc \ - --tool com.twitter.simclusters_v2.scalding.optout.KnownForOptOutAdhocJob \ - -- --date 2019-10-12 - */ -object KnownForOptOutAdhocJob extends AdhocExecutionApp { - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - val knownForPipe = DAL - .readMostRecentSnapshotNoOlderThan(SimclustersV2RawKnownFor20M145KDec11ScalaDataset, Days(30)) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - .map { case KeyVal(k, v) => (k, v) } - .count("num_users_with_knownfor") - - val userOptoutEntities: TypedPipe[(UserId, Set[SemanticCoreEntityId])] = - SimClustersOptOutUtil - .getP13nOptOutSources(dateRange.embiggen(Days(4)), ClusterType.KnownFor) - .count("num_users_with_optouts") - - val clusterToEntities = InferredEntities - .getLegibleEntityEmbeddings(dateRange, timeZone) - .count("num_cluster_to_entities") - - val filteredKnownForPipe = KnownForOptOut.filterOptedOutKnownFor( - knownForPipe, - userOptoutEntities, - clusterToEntities - ) - - val output = knownForPipe - .join(filteredKnownForPipe) - .collect { - case (userId, (originalKnownFor, filtered)) - if originalKnownFor.clusterIdToScores != filtered.clusterIdToScores => - (userId, (originalKnownFor, filtered)) - } - .join(userOptoutEntities) - .map { - case (userId, ((originalKnownFor, filtered), optoutEntities)) => - Seq( - "userId=" + userId, - "originalKnownFor=" + originalKnownFor, - "filteredKnownFor=" + filtered, - "optoutEntities=" + optoutEntities - ).mkString("\t") - } - - val outputPath = "/user/recos-platform/adhoc/knownfor_optout" - output.writeExecution(TypedTsv(outputPath)) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/optout/SimClustersOptOutUtil.scala b/src/scala/com/twitter/simclusters_v2/scalding/optout/SimClustersOptOutUtil.scala deleted file mode 100644 index 3b9ad7779..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/optout/SimClustersOptOutUtil.scala +++ /dev/null @@ -1,166 +0,0 @@ -package com.twitter.simclusters_v2.scalding.optout - -import com.twitter.algebird.Aggregator.size -import com.twitter.algebird.QTreeAggregatorLowerBound -import com.twitter.octain.identifiers.thriftscala.RawId -import com.twitter.octain.p13n.batch.P13NPreferencesScalaDataset -import com.twitter.octain.p13n.preferences.CompositeInterest -import com.twitter.scalding.DateRange -import com.twitter.scalding.Execution -import com.twitter.scalding.TypedPipe -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.remote_access.AllowCrossClusterSameDC -import com.twitter.simclusters_v2.common.ClusterId -import com.twitter.simclusters_v2.common.SemanticCoreEntityId -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.thriftscala.ClusterType -import com.twitter.simclusters_v2.thriftscala.SemanticCoreEntityWithScore -import com.twitter.wtf.interest.thriftscala.Interest - -/** - * Opts out InterestedIn clusters based on clusters' entity embeddings. If a user opted out an - * entity and the user also is interested in a cluster with that entity embedding, unlink the - * user from that entity. - */ -object SimClustersOptOutUtil { - - /** - * Reads User's Your Twitter Data opt-out selections - */ - def getP13nOptOutSources( - dateRange: DateRange, - clusterType: ClusterType - ): TypedPipe[(UserId, Set[SemanticCoreEntityId])] = { - DAL - .readMostRecentSnapshot( - P13NPreferencesScalaDataset, - dateRange - ) - .withRemoteReadPolicy(AllowCrossClusterSameDC) - .toTypedPipe - .map { record => (record.id, record.preferences) } - .flatMap { - case (RawId.UserId(userId), p13nPreferences) => - val optedOutEntities = p13nPreferences.interestPreferences - .map { preference => - preference.disabledInterests - .collect { - case CompositeInterest.RecommendationInterest(recInterest) - if clusterType == ClusterType.InterestedIn => - recInterest.interest match { - case Interest.SemanticEntityInterest(semanticCoreInterest) => - Some(semanticCoreInterest.entityId) - case _ => - None - } - - case CompositeInterest.RecommendationKnownFor(recInterest) - if clusterType == ClusterType.KnownFor => - recInterest.interest match { - case Interest.SemanticEntityInterest(semanticCoreInterest) => - Some(semanticCoreInterest.entityId) - case _ => - None - } - }.flatten.toSet - }.getOrElse(Set.empty) - if (optedOutEntities.nonEmpty) { - Some((userId, optedOutEntities)) - } else { - None - } - case _ => - None - } - } - - /** - * Remove user's clusters whose inferred entity embeddings are opted out. Will retain the user - * entry in the pipe even if all the clusters are filtered out. - */ - def filterOptedOutClusters( - userToClusters: TypedPipe[(UserId, Seq[ClusterId])], - optedOutEntities: TypedPipe[(UserId, Set[SemanticCoreEntityId])], - legibleClusters: TypedPipe[(ClusterId, Seq[SemanticCoreEntityWithScore])] - ): TypedPipe[(UserId, Seq[ClusterId])] = { - - val inMemoryValidClusterToEntities = - legibleClusters - .mapValues(_.map(_.entityId).toSet) - .map(Map(_)).sum - - userToClusters - .leftJoin(optedOutEntities) - .mapWithValue(inMemoryValidClusterToEntities) { - case ((userId, (userClusters, optedOutEntitiesOpt)), validClusterToEntitiesOpt) => - val optedOutEntitiesSet = optedOutEntitiesOpt.getOrElse(Set.empty) - val validClusterToEntities = validClusterToEntitiesOpt.getOrElse(Map.empty) - - val clustersAfterOptOut = userClusters.filter { clusterId => - val isClusterOptedOut = validClusterToEntities - .getOrElse(clusterId, Set.empty) - .intersect(optedOutEntitiesSet) - .nonEmpty - !isClusterOptedOut - }.distinct - - (userId, clustersAfterOptOut) - } - .filter { _._2.nonEmpty } - } - - val AlertEmail = "no-reply@twitter.com" - - /** - * Does sanity check on the results, to make sure the opt out outputs are comparable to the - * raw version. If the delta in the number of users >= 0.1% or median of number of clusters per - * user >= 1%, send alert emails - */ - def sanityCheckAndSendEmail( - oldNumClustersPerUser: TypedPipe[Int], - newNumClustersPerUser: TypedPipe[Int], - modelVersion: String, - alertEmail: String - ): Execution[Unit] = { - val oldNumUsersExec = oldNumClustersPerUser.aggregate(size).toOptionExecution - val newNumUsersExec = newNumClustersPerUser.aggregate(size).toOptionExecution - - val oldMedianExec = oldNumClustersPerUser - .aggregate(QTreeAggregatorLowerBound(0.5)) - .toOptionExecution - - val newMedianExec = newNumClustersPerUser - .aggregate(QTreeAggregatorLowerBound(0.5)) - .toOptionExecution - - Execution - .zip(oldNumUsersExec, newNumUsersExec, oldMedianExec, newMedianExec) - .map { - case (Some(oldNumUsers), Some(newNumUsers), Some(oldMedian), Some(newMedian)) => - val deltaNum = (newNumUsers - oldNumUsers).toDouble / oldNumUsers.toDouble - val deltaMedian = (oldMedian - newMedian) / oldMedian - val message = - s"num users before optout=$oldNumUsers,\n" + - s"num users after optout=$newNumUsers,\n" + - s"median num clusters per user before optout=$oldMedian,\n" + - s"median num clusters per user after optout=$newMedian\n" - - println(message) - if (Math.abs(deltaNum) >= 0.001 || Math.abs(deltaMedian) >= 0.01) { - Util.sendEmail( - message, - s"Anomaly in $modelVersion opt out job. Please check cluster optout jobs in Eagleeye", - alertEmail - ) - } - case err => - Util.sendEmail( - err.toString(), - s"Anomaly in $modelVersion opt out job. Please check cluster optout jobs in Eagleeye", - alertEmail - ) - } - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/BUILD b/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/BUILD deleted file mode 100644 index d048c58ca..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/BUILD +++ /dev/null @@ -1,168 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "escherbird/src/scala/com/twitter/escherbird/scalding/jobs/exportentities:entities-scala", - "escherbird/src/scala/com/twitter/escherbird/scalding/source/utt:utt_source-scala", - "interests-ds/src/main/scala/com/twitter/interests_ds/jobs/interests_service", - "interests-ds/src/main/scala/com/twitter/interests_ds/jobs/interests_service:user_topic_relation_snapshot-scala", - "src/scala/com/twitter/scalding_internal/dalv2", - "src/scala/com/twitter/scalding_internal/job", - "src/scala/com/twitter/scalding_internal/job/analytics_batch", - "src/scala/com/twitter/scalding_internal/source", - "src/scala/com/twitter/scalding_internal/source/lzo_scrooge", - "src/scala/com/twitter/simclusters_v2/candidate_source", - "src/scala/com/twitter/simclusters_v2/hdfs_sources", - "src/scala/com/twitter/simclusters_v2/scalding", - "src/scala/com/twitter/wtf/scalding/jobs/common:em_util", - "timelines/data_processing/jobs/metrics/per_topic_metrics:per_topic_aggregate_engagement-scala", - ], -) - -hadoop_binary( - name = "geopopular_top_tweets_impressed_topics", - main = "com.twitter.simclusters_v2.scalding.topic_recommendations.GeoPopularTopicsBatchApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":topic_recommendations", - ], -) - -hadoop_binary( - name = "geopopular_top_tweets_impressed_topics_adhoc", - main = "com.twitter.simclusters_v2.scalding.topic_recommendations.GeoPopularTopicsAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":topic_recommendations", - ], -) - -hadoop_binary( - name = "similar_topics_from_topic_follow_graph", - main = "com.twitter.simclusters_v2.scalding.topic_recommendations.SimilarTopicsFromTopicFollowGraphScheduledApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":topic_recommendations", - ], -) - -hadoop_binary( - name = "similar_topics_from_topic_follow_graph-adhoc", - main = "com.twitter.simclusters_v2.scalding.topic_recommendations.SimilarTopicsFromTopicFollowGraphAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":topic_recommendations", - ], -) - -hadoop_binary( - name = "top_topics_for_producers_from_em", - main = "com.twitter.simclusters_v2.scalding.topic_recommendations.TopicsForProducersFromEMBatchApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":topic_recommendations", - ], -) - -hadoop_binary( - name = "top_topics_for_producers_from_em-adhoc", - main = "com.twitter.simclusters_v2.scalding.topic_recommendations.TopicsForProducersFromEMAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":topic_recommendations", - ], -) - -hadoop_binary( - name = "top_producers_for_topics_from_topic_follow_graph", - main = "com.twitter.simclusters_v2.scalding.topic_recommendations.ProducersForTopicsFromTopicFollowGraphBatchApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":topic_recommendations", - ], -) - -hadoop_binary( - name = "top_producers_for_topics_from_topic_follow_graph-adhoc", - main = "com.twitter.simclusters_v2.scalding.topic_recommendations.ProducersForTopicsFromTopicFollowGraphAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":topic_recommendations", - ], -) - -# Generated with `capesospy-v2 create_target popular_topics_per_country src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml`, config hash beffad. -scalding_job( - name = "popular_topics_per_country", - main = "com.twitter.simclusters_v2.scalding.topic_recommendations.GeoPopularTopicsBatchApp", - args = ["--maxTopics 100"], - config = [ - ("hadoop.combine-input", "true"), - ("hadoop.map.jvm.total-memory", "3072m"), - ("hadoop.queue", "cassowary.default"), - ("hadoop.reduce.jvm.total-memory", "3072m"), - ("hadoop.submitter.jvm.total-memory", "5120m"), - ("submitter.tier", "preemptible"), - ], - cron = "16 * * * *", - hadoop_cluster = "atla-proc3", - platform = "java8", - role = "cassowary", - runtime_platform = "java8", - tags = [ - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":topic_recommendations", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/GeoPopularTopicsApp.scala b/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/GeoPopularTopicsApp.scala deleted file mode 100644 index df4a0707c..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/GeoPopularTopicsApp.scala +++ /dev/null @@ -1,165 +0,0 @@ -package com.twitter.simclusters_v2.scalding.topic_recommendations - -import com.twitter.bijection.Bufferable -import com.twitter.bijection.Injection -import com.twitter.recos.entities.thriftscala.SemanticCoreEntity -import com.twitter.recos.entities.thriftscala.SemanticCoreEntityScoreList -import com.twitter.recos.entities.thriftscala.SemanticEntityScore -import com.twitter.scalding.commons.source.VersionedKeyValSource -import com.twitter.scalding.Execution -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.dalv2.remote_access.ExplicitLocation -import com.twitter.scalding_internal.dalv2.remote_access.Proc2Atla -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.SemanticCoreEntityId -import com.twitter.simclusters_v2.hdfs_sources.GeopopularTopTweetImpressedTopicsScalaDataset -import com.twitter.timelines.per_topic_metrics.thriftscala.PerTopicAggregateEngagementMetric -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import java.util.TimeZone -import timelines.data_processing.jobs.metrics.per_topic_metrics.PerTopicAggregateEngagementScalaDataset - -/** - scalding remote run \ - --target src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations:geopopular_top_tweets_impressed_topics_adhoc \ - --main-class com.twitter.simclusters_v2.scalding.topic_recommendations.GeoPopularTopicsAdhocApp \ - --submitter hadoopnest1.atla.twitter.com --user recos-platform \ - -- \ - --date 2020-03-28 --output_dir /user/recos-platform/adhoc/your_ldap/topics_country_counts - */ -object GeoPopularTopicsAdhocApp extends AdhocExecutionApp { - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - val maxTopicsPerCountry = args.int("maxTopics", 2000) - val typedTsv = args.boolean("tsv") - implicit val inj: Injection[List[(SemanticCoreEntityId, Double)], Array[Byte]] = - Bufferable.injectionOf[List[(SemanticCoreEntityId, Double)]] - - val perTopicEngagementLogData = DAL - .read(PerTopicAggregateEngagementScalaDataset, dateRange.prepend(Days(7))) - .toTypedPipe - val topicsWithEngagement = - GeoPopularTopicsApp - .getPopularTopicsFromLogs(perTopicEngagementLogData, maxTopicsPerCountry) - .mapValues(_.toList) - - if (typedTsv) { - topicsWithEngagement.writeExecution( - TypedTsv(args("/user/recos-platform/adhoc/your_ldap/topics_country_counts_tsv")) - ) - } else { - topicsWithEngagement.writeExecution( - VersionedKeyValSource[String, List[(SemanticCoreEntityId, Double)]](args("output_dir")) - ) - } - } -} - -/** - capesospy-v2 update --build_locally \ - --start_cron popular_topics_per_country \ - src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object GeoPopularTopicsBatchApp extends ScheduledExecutionApp { - override val firstTime: RichDate = RichDate("2020-04-06") - - override val batchIncrement: Duration = Days(1) - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - val maxTopicsPerCountry = args.int("maxTopics", 2000) - - val geoPopularTopicsPath: String = - "/user/cassowary/manhattan_sequence_files/geo_popular_top_tweet_impressed_topics" - - // Read engagement logs from the past 7 days - val perTopicEngagementLogData = DAL - .read(PerTopicAggregateEngagementScalaDataset, dateRange.prepend(Days(7))) - .withRemoteReadPolicy(ExplicitLocation(Proc2Atla)) - .toTypedPipe - - val topicsWithScores = - GeoPopularTopicsApp.getPopularTopicsFromLogs(perTopicEngagementLogData, maxTopicsPerCountry) - - val topicsWithEntityScores = topicsWithScores - .mapValues(_.map { - case (topicid, topicScore) => - SemanticEntityScore(SemanticCoreEntity(entityId = topicid), topicScore) - }) - .mapValues(SemanticCoreEntityScoreList(_)) - - val writeKeyValResultExec = topicsWithEntityScores - .map { case (country, topics) => KeyVal(country, topics) } - .writeDALVersionedKeyValExecution( - GeopopularTopTweetImpressedTopicsScalaDataset, - D.Suffix(geoPopularTopicsPath) - ) - writeKeyValResultExec - } -} - -object GeoPopularTopicsApp { - - def getPopularTopicsFromLogs( - engagementLogs: TypedPipe[PerTopicAggregateEngagementMetric], - maxTopics: Int - )( - implicit uniqueId: UniqueID - ): TypedPipe[(String, Seq[(SemanticCoreEntityId, Double)])] = { - val numTopicEngagementsRead = Stat("num_topic_engagements_read") - val intermediate = engagementLogs - .map { - case PerTopicAggregateEngagementMetric( - topicId, - dateId, - country, - page, - item, - engagementType, - engagementCount, - algorithmType, - annotationType) => - numTopicEngagementsRead.inc() - ( - topicId, - dateId, - country, - page, - item, - engagementType, - engagementCount, - algorithmType, - annotationType) - } - - // We want to find the topics with the most impressed tweets in each country - // This will ensure that the topics suggested as recommendations also have tweets that can be recommended - intermediate - .collect { - case (topicId, _, Some(country), _, item, engagementType, engagementCount, _, _) - if item == "Tweet" && engagementType == "impression" => - ((country, topicId), engagementCount) - } - .sumByKey // returns country-wise engagements for topics - .map { - case ((country, topicId), totalEngagementCountryCount) => - (country, (topicId, totalEngagementCountryCount.toDouble)) - } - .group - .sortedReverseTake(maxTopics)(Ordering.by(_._2)) - .toTypedPipe - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/ProducersForTopicsFromTopicFollowGraph.scala b/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/ProducersForTopicsFromTopicFollowGraph.scala deleted file mode 100644 index b6d6b567b..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/ProducersForTopicsFromTopicFollowGraph.scala +++ /dev/null @@ -1,206 +0,0 @@ -package com.twitter.simclusters_v2.scalding.topic_recommendations - -import com.twitter.bijection.Bufferable -import com.twitter.bijection.Injection -import com.twitter.recos.entities.thriftscala._ -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.Country -import com.twitter.simclusters_v2.common.Language -import com.twitter.simclusters_v2.common.TopicId -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.hdfs_sources.DataSources -import com.twitter.simclusters_v2.hdfs_sources.TopProducersForLocaleTopicsFromTopicFollowGraphScalaDataset -import com.twitter.simclusters_v2.scalding.common.matrix.SparseMatrix -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil.ProducerId -import com.twitter.simclusters_v2.scalding.embedding.common.ExternalDataSources -import com.twitter.simclusters_v2.thriftscala.UserAndNeighbors -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import java.util.TimeZone - -/** - * In this file, we compute the top producers for a topic from the Topic Follow Graph - * - * It works as follows: - * - * 1. Producer embedding: List of users who follow the producer's profile and follow atleast one topic - * - * 2. Topic embedding: List of users who follow the topic - * - * 3. Score(producer, topic) = cosine similarity of the producer and topic embedding as defined above - * - * 4. Please note that we compute the top producers for each topic locale. - */ - -/** -scalding remote run --user cassowary \ - --target src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations:top_producers_for_topics_from_topic_follow_graph-adhoc \ - --main-class com.twitter.simclusters_v2.scalding.topic_recommendations.ProducersForTopicsFromTopicFollowGraphAdhocApp \ - --submitter hadoopnest1.atla.twitter.com \ - -- --date 2021-01-06 --minActiveFollowers 400 --maxProducersPerTopic 50 \ - --output_dir_producers_per_topic /user/cassowary/adhoc/ldap/ttf_profile_pages_topics_to_producers - */ - -object ProducersForTopicsFromTopicFollowGraphAdhocApp extends AdhocExecutionApp { - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - import ProducersForTopicsFromTopicFollowGraph._ - val outputDirProducersPerTopic = args("output_dir_producers_per_topic") - val minActiveFollowersForProducer = args.int("minActiveFollowers", 400) - val maxProducersPerTopicPerLocale = args.int("maxProducersPerTopic", 50) - val minTopicFollows = args.int("minTopicFollows", 100) - - val topicsFollowedByProducersFollowers = getTopicsFromProducersFollowers( - DataSources - .userUserNormalizedGraphSource(dateRange.prepend(Days(7))), - ExternalDataSources.topicFollowGraphSource, - ExternalDataSources.userSource, - ExternalDataSources.inferredUserConsumedLanguageSource, - minActiveFollowersForProducer, - minTopicFollows - ) - - sortAndGetTopProducersPerLocaleTopic( - topicsFollowedByProducersFollowers, - maxProducersPerTopicPerLocale).writeExecution(TypedTsv(outputDirProducersPerTopic)) - - } -} - -/** -capesospy-v2 update --build_locally \ - --start_cron top_producers_for_topics_from_topic_follow_graph \ - src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ - -object ProducersForTopicsFromTopicFollowGraphBatchApp extends ScheduledExecutionApp { - override val firstTime: RichDate = RichDate("2020-10-01") - - override val batchIncrement: Duration = Days(1) - - private val topProducersForLocaleTopicsPath: String = - "/user/cassowary/manhattan_sequence_files/top_producers_for_topics_from_topic_follow_graph" - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - import ProducersForTopicsFromTopicFollowGraph._ - val minActiveFollowersForProducer = args.int("minActiveFollowers", 400) - val maxProducersPerTopicPerLocale = args.int("maxProducersPerTopic", 50) - val minTopicFollows = args.int("minTopicFollows", 100) - - val topicsFollowedByProducersFollowers = getTopicsFromProducersFollowers( - DataSources - .userUserNormalizedGraphSource(dateRange.prepend(Days(7))), - ExternalDataSources.topicFollowGraphSource, - ExternalDataSources.userSource, - ExternalDataSources.inferredUserConsumedLanguageSource, - minActiveFollowersForProducer, - minTopicFollows - ) - - sortAndGetTopProducersPerLocaleTopic( - topicsFollowedByProducersFollowers, - maxProducersPerTopicPerLocale) - .map { - case ((topicId, languageOpt, countryOpt), producersWithScores) => - KeyVal( - SemanticCoreEntityWithLocale( - entityId = topicId, - context = Locale(language = languageOpt, country = countryOpt)), - UserScoreList(producersWithScores.map { - case (producerId, producerScore) => - UserWithScore(userId = producerId, score = producerScore) - }) - ) - }.writeDALVersionedKeyValExecution( - TopProducersForLocaleTopicsFromTopicFollowGraphScalaDataset, - D.Suffix(topProducersForLocaleTopicsPath), - version = ExplicitEndTime(dateRange.end) - ) - } -} - -object ProducersForTopicsFromTopicFollowGraph { - - implicit val sparseMatrixInj: Injection[ - (ProducerId, Option[Language], Option[Country]), - Array[Byte] - ] = - Bufferable.injectionOf[(ProducerId, Option[Language], Option[Country])] - - // This function takes the producer to topics map and generates the sorted and - // truncated top producers ranked list for each locale topic - def sortAndGetTopProducersPerLocaleTopic( - producerToTopics: TypedPipe[(ProducerId, (TopicId, Option[Language], Option[Country]), Double)], - maxProducersPerLocaleTopic: Int - )( - implicit uniqueID: UniqueID - ): TypedPipe[((TopicId, Option[Language], Option[Country]), List[(ProducerId, Double)])] = { - val numTopicsWithLocales = Stat("num_topics_with_locales") - producerToTopics - .map { - case (producerId, (topicId, languageOpt, countryOpt), score) => - ((topicId, languageOpt, countryOpt), Seq((producerId, score))) - } - .sumByKey.mapValues { producersList => - numTopicsWithLocales.inc() - producersList.sortBy(-_._2).take(maxProducersPerLocaleTopic).toList - }.toTypedPipe - } - - def getTopicsFromProducersFollowers( - userUserGraph: TypedPipe[UserAndNeighbors], - followedTopicsToUsers: TypedPipe[(TopicId, UserId)], - userSource: TypedPipe[(UserId, (Country, Language))], - userLanguages: TypedPipe[(UserId, Seq[(Language, Double)])], - minActiveFollowersForProducer: Int, - minTopicFollows: Int - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): TypedPipe[(ProducerId, (TopicId, Option[Language], Option[Country]), Double)] = { - - val usersFollowingTopics: TypedPipe[UserId] = followedTopicsToUsers.map(_._2).distinct - val producerToUsersSparseMatrix: SparseMatrix[ProducerId, UserId, Double] = - TopicsForProducersUtils - .getProducersToFollowedByUsersSparseMatrix( - userUserGraph, - minActiveFollowersForProducer).filterCols(usersFollowingTopics).rowL2Normalize - - val userToTopicsSparseSkinnyMatrix: SparseMatrix[ - UserId, - (TopicId, Option[Language], Option[Country]), - Double - ] = - TopicsForProducersUtils - .getFollowedTopicsToUserSparseMatrix( - followedTopicsToUsers, - userSource, - userLanguages, - minTopicFollows).rowL2Normalize.transpose - - // Obtain the Producer to Locale Topics Matrix - val producersToLocaleTopicsMatrix: SparseMatrix[ - ProducerId, - (TopicId, Option[Language], Option[Country]), - Double - ] = - producerToUsersSparseMatrix.multiplySparseMatrix(userToTopicsSparseSkinnyMatrix) - - producersToLocaleTopicsMatrix.toTypedPipe - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/SimilarTopicsFromTopicFollowGraphApp.scala b/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/SimilarTopicsFromTopicFollowGraphApp.scala deleted file mode 100644 index 78bd6d658..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/SimilarTopicsFromTopicFollowGraphApp.scala +++ /dev/null @@ -1,222 +0,0 @@ -package com.twitter.simclusters_v2.scalding.topic_recommendations - -import com.twitter.escherbird.scalding.source.FullMetadataSource -import com.twitter.interests_ds.jobs.interests_service.UserTopicRelationSnapshotScalaDataset -import com.twitter.interests.thriftscala.InterestRelationType -import com.twitter.interests.thriftscala.UserInterestsRelationSnapshot -import com.twitter.recos.entities.thriftscala.SemanticCoreEntity -import com.twitter.recos.entities.thriftscala.SemanticCoreEntityScoreList -import com.twitter.recos.entities.thriftscala.SemanticEntityScore -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.dalv2.remote_access.ExplicitLocation -import com.twitter.scalding_internal.dalv2.remote_access.ProcAtla -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.SemanticCoreEntityId -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.hdfs_sources.SimilarTopicsFromTopicFollowGraphScalaDataset -import com.twitter.simclusters_v2.scalding.common.matrix.SparseRowMatrix -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import java.util.TimeZone - -/** - * In this file, we compute the similarities between topics based on how often they are co-followed - * by users. - * - * Similarity(i, j) = #co-follow(i,j) / sqrt(#follow(i) * #follow(j)) - * - * It works as follows: - * - * 1. it first reads the data set of user to topics follow graph, and construct a sparse matrix M with - * N rows and K columns, where N is the number of users, and K is the number of topics. - * In the matrix, M(u,i) = 1 if user u follows topic i; otherwise it is 0. In the sparse matrix, - * we only save non-zero elements. - * - * 2. we do l2-normalization for each column of the matrix M, to get a normalized version M'. - * - * 3. we get topic-topic similarity matrix S = M'.transpose.multiply(M'). The resulting matrix will - * contain the similarities between all topics, i.e., S(i,j) is the similarity we mentioned above. - * - * 4. for each topic, we only keep its K similar topics with largest similarity scores, while not - * including those with scores lower than a threshold. - * - */ -/** - * capesospy-v2 update --build_locally \ - * --start_cron similar_topics_from_topic_follow_graph \ - * src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object SimilarTopicsFromTopicFollowGraphScheduledApp extends ScheduledExecutionApp { - import SimilarTopics._ - - private val outputPath: String = - "/user/cassowary/manhattan_sequence_files/similar_topics_from_topics_follow_graph" - - override def firstTime: RichDate = RichDate("2020-05-07") - - override def batchIncrement: Duration = Days(7) - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - val numSimilarTopics = args.int("numSimilarTopics", default = 100) - val scoreThreshold = args.double("scoreThreshold", default = 0.01) - - val numOutputTopics = Stat("NumOutputTopics") - - computeSimilarTopics( - getExplicitFollowedTopics, - getFollowableTopics, - numSimilarTopics, - scoreThreshold) - .map { - case (topicId, similarTopics) => - numOutputTopics.inc() - KeyVal( - topicId, - SemanticCoreEntityScoreList(similarTopics.map { - case (similarTopicId, score) => - SemanticEntityScore(SemanticCoreEntity(similarTopicId), score) - })) - } - .writeDALVersionedKeyValExecution( - SimilarTopicsFromTopicFollowGraphScalaDataset, - D.Suffix(outputPath), - version = ExplicitEndTime(dateRange.end) - ) - } - -} - -/** - scalding remote run --user cassowary --reducers 2000 \ - --target src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations:similar_topics_from_topic_follow_graph-adhoc \ - --main-class com.twitter.simclusters_v2.scalding.topic_recommendations.SimilarTopicsFromTopicFollowGraphAdhocApp \ - --submitter hadoopnest1.atla.twitter.com \ - -- --date 2020-04-28 - */ -object SimilarTopicsFromTopicFollowGraphAdhocApp extends AdhocExecutionApp { - import SimilarTopics._ - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - val numSimilarTopics = args.int("numSimilarTopics", default = 100) - val scoreThreshold = args.double("scoreThreshold", default = 0.01) - - val numOutputTopics = Stat("NumOutputTopics") - - computeSimilarTopics( - getExplicitFollowedTopics, - getFollowableTopics, - numSimilarTopics, - scoreThreshold) - .map { - case (topicId, similarTopics) => - numOutputTopics.inc() - topicId -> similarTopics - .collect { - case (similarTopic, score) if similarTopic != topicId => - s"$similarTopic:$score" - } - .mkString(",") - } - .writeExecution( - TypedTsv("/user/cassowary/adhoc/topic_recos/similar_topics") - ) - } - -} - -object SimilarTopics { - - val UTTDomain: Long = 131L - - val FollowableTag: String = "utt:followable_topic" - - def getFollowableTopics( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): TypedPipe[SemanticCoreEntityId] = { - val NumFollowableTopics = Stat("NumFollowableTopics") - - TypedPipe - .from( - new FullMetadataSource("/atla/proc" + FullMetadataSource.DefaultHdfsPath)()( - dateRange.embiggen(Days(7)))) - .flatMap { - case fullMetadata if fullMetadata.domainId == UTTDomain => - for { - basicMetadata <- fullMetadata.basicMetadata - indexableFields <- basicMetadata.indexableFields - tags <- indexableFields.tags - if tags.contains(FollowableTag) - } yield { - NumFollowableTopics.inc() - fullMetadata.entityId - } - case _ => None - } - .forceToDisk - } - - def getExplicitFollowedTopics( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): TypedPipe[(UserId, Map[SemanticCoreEntityId, Double])] = { - - DAL - .readMostRecentSnapshotNoOlderThan(UserTopicRelationSnapshotScalaDataset, Days(7)) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - .collect { - case userInterestsRelationSnapshot: UserInterestsRelationSnapshot - if userInterestsRelationSnapshot.interestType == "UTT" && - userInterestsRelationSnapshot.relation == InterestRelationType.Followed => - ( - userInterestsRelationSnapshot.userId, - Map(userInterestsRelationSnapshot.interestId -> 1.0)) - } - .sumByKey - } - - def computeSimilarTopics( - userTopicsFollowGraph: TypedPipe[(UserId, Map[SemanticCoreEntityId, Double])], - followableTopics: TypedPipe[SemanticCoreEntityId], - numSimilarTopics: Int, - scoreThreshold: Double - ): TypedPipe[(SemanticCoreEntityId, Seq[(SemanticCoreEntityId, Double)])] = { - val userTopicFollowGraph = - SparseRowMatrix[UserId, SemanticCoreEntityId, Double]( - userTopicsFollowGraph, - isSkinnyMatrix = true) - .filterCols(followableTopics) // filter out unfollowable topics - .colL2Normalize // normalization - // due to the small number of the topics, - // Scalding only allocates 1-2 mappers for the next step which makes it take unnecessarily long time. - // Changing it to 10 to make it a bit faster - .forceToDisk(numShardsOpt = Some(10)) - - userTopicFollowGraph - .transposeAndMultiplySkinnySparseRowMatrix(userTopicFollowGraph) - .filter { (i, j, v) => - // exclude topic itself from being considered as similar; also the similarity score should - // be larger than a threshold - i != j && v > scoreThreshold - } - .sortWithTakePerRow(numSimilarTopics)(Ordering.by(-_._2)) - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/TopicsForProducersFromEM.scala b/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/TopicsForProducersFromEM.scala deleted file mode 100644 index 44eb83f1e..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/TopicsForProducersFromEM.scala +++ /dev/null @@ -1,261 +0,0 @@ -package com.twitter.simclusters_v2.scalding.topic_recommendations -import com.twitter.bijection.Bufferable -import com.twitter.bijection.Injection -import com.twitter.recos.entities.thriftscala._ -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.Country -import com.twitter.simclusters_v2.common.Language -import com.twitter.simclusters_v2.common.SemanticCoreEntityId -import com.twitter.simclusters_v2.common.TopicId -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.hdfs_sources.DataSources -import com.twitter.simclusters_v2.hdfs_sources.TopLocaleTopicsForProducerFromEmScalaDataset -import com.twitter.simclusters_v2.scalding.common.matrix.SparseMatrix -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil.ProducerId -import com.twitter.simclusters_v2.scalding.embedding.common.ExternalDataSources -import com.twitter.simclusters_v2.thriftscala.UserAndNeighbors -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import com.twitter.wtf.scalding.jobs.common.EMRunner -import java.util.TimeZone - -/** - * In this file, we compute the top topics for a producer to be shown on the Topics To Follow Module on Profile Pages - * - * The top topics for a producer are computed using the Expectation-Maximization (EM) approach - * - * It works as follows: - * - * 1. Obtain the background model distribution of number of followers for a topic - * - * 2. Obtain the domain model distribution of the number of producer's followers who follow a topic - * - * 4. Iteratively, use the Expectation-Maximization approach to get the best estimate of the domain model's topic distribution for a producer - * - * 5. for each producer, we only keep its top K topics with highest weights in the domain model's topic distribution after the EM step - * - * 6. Please note that we also store the locale info for each producer along with the topics - */ -/** -scalding remote run --user cassowary --reducers 2000 \ - --target src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations:top_topics_for_producers_from_em-adhoc \ - --main-class com.twitter.simclusters_v2.scalding.topic_recommendations.TopicsForProducersFromEMAdhocApp \ - --submitter hadoopnest1.atla.twitter.com \ - -- --date 2020-07-05 --minActiveFollowers 10000 --minTopicFollowsThreshold 100 --maxTopicsPerProducerPerLocale 50 \ - --output_dir_topics_per_producer /user/cassowary/adhoc/your_ldap/ttf_profile_pages_producers_to_topics - */ -object TopicsForProducersFromEMAdhocApp extends AdhocExecutionApp { - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - import TopicsForProducersFromEM._ - val outputDirTopicsPerProducer = args("output_dir_topics_per_producer") - val minActiveFollowersForProducer = args.int("minActiveFollowers", 100) - val minTopicFollowsThreshold = args.int("minNumTopicFollows", 100) - val maxTopicsPerProducerPerLocale = args.int("maxTopicsPerProducer", 100) - val lambda = args.double("lambda", 0.95) - - val numEMSteps = args.int("numEM", 100) - - val topicsFollowedByProducersFollowers: TypedPipe[ - (ProducerId, (TopicId, Option[Language], Option[Country]), Double) - ] = getTopLocaleTopicsForProducersFromEM( - DataSources - .userUserNormalizedGraphSource(dateRange.prepend(Days(7))), - ExternalDataSources.topicFollowGraphSource, - ExternalDataSources.userSource, - ExternalDataSources.inferredUserConsumedLanguageSource, - minActiveFollowersForProducer, - minTopicFollowsThreshold, - lambda, - numEMSteps - ) - - val topTopicsPerLocaleProducerTsvExec = sortAndGetTopLocaleTopicsPerProducer( - topicsFollowedByProducersFollowers, - maxTopicsPerProducerPerLocale - ).writeExecution( - TypedTsv(outputDirTopicsPerProducer) - ) - - topTopicsPerLocaleProducerTsvExec - } -} - -/** -capesospy-v2 update --build_locally \ - --start_cron top_topics_for_producers_from_em \ - src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object TopicsForProducersFromEMBatchApp extends ScheduledExecutionApp { - override val firstTime: RichDate = RichDate("2020-07-26") - - override val batchIncrement: Duration = Days(7) - - private val topTopicsPerProducerFromEMPath: String = - "/user/cassowary/manhattan_sequence_files/top_topics_for_producers_from_em" - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - import TopicsForProducersFromEM._ - - // threshold of the minimum number of active followers needed for a user to be considered as a producer - val minActiveFollowersForProducer = args.int("minActiveFollowers", 100) - - // threshold of the topic locale follows score needed for a topic to be considered as valid - val minTopicFollowsThreshold = args.int("minNumTopicFollows", 100) - - val maxTopicsPerProducer = args.int("maxTopicsPerProducer", 100) - - // lambda parameter for the EM algorithm - val lambda = args.double("lambda", 0.95) - - // number of EM iterations - val numEMSteps = args.int("numEM", 100) - - // (producer, locale) -> List<(topics, scores)> from Expectation Maximization approach - val topicsFollowedByProducersFollowers = getTopLocaleTopicsForProducersFromEM( - DataSources - .userUserNormalizedGraphSource(dateRange.prepend(Days(7))), - ExternalDataSources.topicFollowGraphSource, - ExternalDataSources.userSource, - ExternalDataSources.inferredUserConsumedLanguageSource, - minActiveFollowersForProducer, - minTopicFollowsThreshold, - lambda, - numEMSteps - ) - - val topLocaleTopicsForProducersFromEMKeyValExec = - sortAndGetTopLocaleTopicsPerProducer( - topicsFollowedByProducersFollowers, - maxTopicsPerProducer - ).map { - case ((producerId, languageOpt, countryOpt), topicsWithScores) => - KeyVal( - UserIdWithLocale( - userId = producerId, - locale = Locale(language = languageOpt, country = countryOpt)), - SemanticCoreEntityScoreList(topicsWithScores.map { - case (topicid, topicScore) => - SemanticEntityScore(SemanticCoreEntity(entityId = topicid), score = topicScore) - }) - ) - }.writeDALVersionedKeyValExecution( - TopLocaleTopicsForProducerFromEmScalaDataset, - D.Suffix(topTopicsPerProducerFromEMPath), - version = ExplicitEndTime(dateRange.end) - ) - topLocaleTopicsForProducersFromEMKeyValExec - } -} - -object TopicsForProducersFromEM { - - private val MinProducerTopicScoreThreshold = 0.0 - - implicit val sparseMatrixInj: Injection[ - (SemanticCoreEntityId, Option[Language], Option[Country]), - Array[Byte] - ] = - Bufferable.injectionOf[(SemanticCoreEntityId, Option[Language], Option[Country])] - - // This function takes the producer to topics map and generates the sorted and - // truncated top locale topics ranked list for each producer - def sortAndGetTopLocaleTopicsPerProducer( - producerToTopics: TypedPipe[(ProducerId, (TopicId, Option[Language], Option[Country]), Double)], - maxTopicsPerProducerPerLocale: Int - )( - implicit uniqueID: UniqueID - ): TypedPipe[((ProducerId, Option[Language], Option[Country]), List[(TopicId, Double)])] = { - val numProducersWithLocales = Stat("num_producers_with_locales") - producerToTopics - .map { - case (producerId, (topicId, languageOpt, countryOpt), score) => - ((producerId, languageOpt, countryOpt), Seq((topicId, score))) - }.sumByKey.mapValues { topicsList: Seq[(TopicId, Double)] => - numProducersWithLocales.inc() - topicsList - .filter(_._2 >= MinProducerTopicScoreThreshold).sortBy(-_._2).take( - maxTopicsPerProducerPerLocale).toList - }.toTypedPipe - } - - def getTopLocaleTopicsForProducersFromEM( - userUserGraph: TypedPipe[UserAndNeighbors], - followedTopicsToUsers: TypedPipe[(TopicId, UserId)], - userSource: TypedPipe[(UserId, (Country, Language))], - userLanguages: TypedPipe[(UserId, Seq[(Language, Double)])], - minActiveFollowersForProducer: Int, - minTopicFollowsThreshold: Int, - lambda: Double, - numEMSteps: Int - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): TypedPipe[(ProducerId, (TopicId, Option[Language], Option[Country]), Double)] = { - - // Obtain Producer To Users Matrix - val producersToUsersMatrix: SparseMatrix[ProducerId, UserId, Double] = - TopicsForProducersUtils.getProducersToFollowedByUsersSparseMatrix( - userUserGraph, - minActiveFollowersForProducer) - - // Obtain Users to TopicsWithLocales Matrix - val topicToUsersMatrix: SparseMatrix[ - (TopicId, Option[Language], Option[Country]), - UserId, - Double - ] = TopicsForProducersUtils.getFollowedTopicsToUserSparseMatrix( - followedTopicsToUsers, - userSource, - userLanguages, - minTopicFollowsThreshold) - - // Domain input probability distribution is the Map(topics->followers) per producer locale - val domainInputModel = producersToUsersMatrix - .multiplySparseMatrix(topicToUsersMatrix.transpose).toTypedPipe.map { - case (producerId, (topicId, languageOpt, countryOpt), dotProduct) => - ((producerId, languageOpt, countryOpt), Map(topicId -> dotProduct)) - }.sumByKey.toTypedPipe.map { - case ((producerId, languageOpt, countryOpt), topicsDomainInputMap) => - ((languageOpt, countryOpt), (producerId, topicsDomainInputMap)) - } - - // BackgroundModel is the Map(topics -> Expected value of the number of users who follow the topic) - val backgroundModel = topicToUsersMatrix.rowL1Norms.map { - case ((topicId, languageOpt, countryOpt), numFollowersOfTopic) => - ((languageOpt, countryOpt), Map(topicId -> numFollowersOfTopic)) - }.sumByKey - - val resultsFromEMForEachLocale = domainInputModel.hashJoin(backgroundModel).flatMap { - case ( - (languageOpt, countryOpt), - ((producerId, domainInputTopicFollowersMap), backgroundModelTopicFollowersMap)) => - val emScoredTopicsForEachProducerPerLocale = EMRunner.estimateDomainModel( - domainInputTopicFollowersMap, - backgroundModelTopicFollowersMap, - lambda, - numEMSteps) - - emScoredTopicsForEachProducerPerLocale.map { - case (topicId, topicScore) => - (producerId, (topicId, languageOpt, countryOpt), topicScore) - } - } - resultsFromEMForEachLocale - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/TopicsForProducersUtils.scala b/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/TopicsForProducersUtils.scala deleted file mode 100644 index 94a2404b8..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/TopicsForProducersUtils.scala +++ /dev/null @@ -1,103 +0,0 @@ -package com.twitter.simclusters_v2.scalding.topic_recommendations -import com.twitter.bijection.{Bufferable, Injection} -import com.twitter.scalding._ -import com.twitter.simclusters_v2.common.{Country, Language, SemanticCoreEntityId, TopicId, UserId} -import com.twitter.simclusters_v2.scalding.common.matrix.SparseMatrix -import com.twitter.simclusters_v2.scalding.embedding.common.EmbeddingUtil.ProducerId -import com.twitter.simclusters_v2.thriftscala.UserAndNeighbors - -object TopicsForProducersUtils { - - implicit val sparseMatrixInj: Injection[ - (SemanticCoreEntityId, Option[Language], Option[Country]), - Array[Byte] - ] = - Bufferable.injectionOf[(SemanticCoreEntityId, Option[Language], Option[Country])] - - // This function provides the set of 'valid' topics, i.e topics with atleast a certain number of - // follows. This helps remove some noisy topic associations to producers in the dataset. - def getValidTopics( - topicUsers: TypedPipe[((TopicId, Option[Language], Option[Country]), UserId, Double)], - minTopicFollowsThreshold: Int - )( - implicit uniqueID: UniqueID - ): TypedPipe[(TopicId, Option[Language], Option[Country])] = { - val numValidTopics = Stat("num_valid_topics") - SparseMatrix(topicUsers).rowNnz.collect { - case (topicsWithLocaleKey, numFollows) if numFollows >= minTopicFollowsThreshold => - numValidTopics.inc() - topicsWithLocaleKey - } - } - - // Get the users with atleast minNumUserFollowers following - def getValidProducers( - userToFollowersEdges: TypedPipe[(UserId, UserId, Double)], - minNumUserFollowers: Int - )( - implicit uniqueID: UniqueID - ): TypedPipe[ProducerId] = { - val numProducersForTopics = Stat("num_producers_for_topics") - SparseMatrix(userToFollowersEdges).rowL1Norms.collect { - case (userId, l1Norm) if l1Norm >= minNumUserFollowers => - numProducersForTopics.inc() - userId - } - } - - // This function returns the User to Followed Topics Matrix - def getFollowedTopicsToUserSparseMatrix( - followedTopicsToUsers: TypedPipe[(TopicId, UserId)], - userCountryAndLanguage: TypedPipe[(UserId, (Country, Language))], - userLanguages: TypedPipe[(UserId, Seq[(Language, Double)])], - minTopicFollowsThreshold: Int - )( - implicit uniqueID: UniqueID - ): SparseMatrix[(TopicId, Option[Language], Option[Country]), UserId, Double] = { - val localeTopicsWithUsers: TypedPipe[ - ((TopicId, Option[Language], Option[Country]), UserId, Double) - ] = - followedTopicsToUsers - .map { case (topic, user) => (user, topic) } - .join(userCountryAndLanguage) - .join(userLanguages) - .withDescription("joining user locale information") - .flatMap { - case (user, ((topic, (country, _)), scoredLangs)) => - scoredLangs.flatMap { - case (lang, score) => - // To compute the top topics with/without language and country level personalization - // So the same dataset has 3 keys for each topicId (unless it gets filtered after): - // (TopicId, Language, Country), (TopicId, Language, None), (TopicId, None, None) - Seq( - ((topic, Some(lang), Some(country)), user, score), // with language and country - ((topic, Some(lang), None), user, score) // with language - ) - } ++ Seq(((topic, None, None), user, 1.0)) // no locale - } - SparseMatrix(localeTopicsWithUsers).filterRowsByMinSum(minTopicFollowsThreshold) - } - - // This function returns the Producers To User Followers Matrix - def getProducersToFollowedByUsersSparseMatrix( - userUserGraph: TypedPipe[UserAndNeighbors], - minActiveFollowers: Int, - )( - implicit uniqueID: UniqueID - ): SparseMatrix[ProducerId, UserId, Double] = { - - val numEdgesFromUsersToFollowers = Stat("num_edges_from_users_to_followers") - - val userToFollowersEdges: TypedPipe[(UserId, UserId, Double)] = - userUserGraph - .flatMap { userAndNeighbors => - userAndNeighbors.neighbors - .collect { - case neighbor if neighbor.isFollowed.getOrElse(false) => - numEdgesFromUsersToFollowers.inc() - (neighbor.neighborId, userAndNeighbors.userId, 1.0) - } - } - SparseMatrix(userToFollowersEdges).filterRowsByMinSum(minActiveFollowers) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/model_based_topic_recommendations/BUILD b/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/model_based_topic_recommendations/BUILD deleted file mode 100644 index e27970b99..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/model_based_topic_recommendations/BUILD +++ /dev/null @@ -1,70 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":topic_recommendations_test_datarecords-java", - ":topic_recommendations_train_datarecords-java", - "escherbird/src/scala/com/twitter/escherbird/scalding/jobs/exportentities:entities-scala", - "interests-ds/src/main/scala/com/twitter/interests_ds/jobs/interests_service", - "interests-ds/src/main/scala/com/twitter/interests_ds/jobs/interests_service:user_topic_relation_snapshot-scala", - "src/java/com/twitter/ml/api/constant", - "src/scala/com/twitter/scalding_internal/dalv2", - "src/scala/com/twitter/scalding_internal/dalv2/dataset", - "src/scala/com/twitter/scalding_internal/job", - "src/scala/com/twitter/scalding_internal/job/analytics_batch", - "src/scala/com/twitter/scalding_internal/source", - "src/scala/com/twitter/scalding_internal/source/lzo_scrooge", - "src/scala/com/twitter/simclusters_v2/candidate_source", - "src/scala/com/twitter/simclusters_v2/hdfs_sources", - "src/scala/com/twitter/simclusters_v2/scalding/common", - "timelines/data_processing/jobs/metrics/per_topic_metrics:per_topic_aggregate_engagement-scala", - "twml/runtime/src/main/scala/com/twitter/twml/runtime/scalding", - ], -) - -hadoop_binary( - name = "training_data_for_topic_recommendations-adhoc", - main = "com.twitter.simclusters_v2.scalding.topic_recommendations.model_based_topic_recommendations.UserTopicFeatureHydrationAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":model_based_topic_recommendations", - ], -) - -hadoop_binary( - name = "training_data_for_topic_recommendations", - main = "com.twitter.simclusters_v2.scalding.topic_recommendations.model_based_topic_recommendations.UserTopicFeatureHydrationScheduledApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":model_based_topic_recommendations", - ], -) - -create_datarecord_datasets( - base_name = "topic_recommendations_train_datarecords", - platform = "java8", - role = "cassowary", - segment_type = "snapshot", - tags = ["bazel-compatible"], -) - -create_datarecord_datasets( - base_name = "topic_recommendations_test_datarecords", - platform = "java8", - role = "cassowary", - segment_type = "snapshot", - tags = ["bazel-compatible"], -) diff --git a/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/model_based_topic_recommendations/DataSources.scala b/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/model_based_topic_recommendations/DataSources.scala deleted file mode 100644 index baa25590f..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/model_based_topic_recommendations/DataSources.scala +++ /dev/null @@ -1,74 +0,0 @@ -package com.twitter.simclusters_v2.scalding.topic_recommendations.model_based_topic_recommendations - -import com.twitter.scalding.{DateRange, Days, Stat, TypedPipe, UniqueID} -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.remote_access.{ExplicitLocation, Proc3Atla} -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.common.{Language, TopicId, UserId} -import com.twitter.simclusters_v2.hdfs_sources.FavTfgTopicEmbeddingsScalaDataset -import com.twitter.simclusters_v2.scalding.embedding.common.ExternalDataSources -import com.twitter.simclusters_v2.summingbird.stores.UserInterestedInReadableStore -import com.twitter.simclusters_v2.thriftscala.{ - EmbeddingType, - InternalId, - LocaleEntityId, - ModelVersion, - SimClustersEmbeddingId -} -import java.util.TimeZone - -/** - * DataSources object to read datasets for the model based topic recommendations - */ -object DataSources { - - private val topicEmbeddingDataset = FavTfgTopicEmbeddingsScalaDataset - private val topicEmbeddingType = EmbeddingType.FavTfgTopic - - /** - * Get user InterestedIn data, filter popular clusters and return fav-scores interestedIn embedding for user - */ - def getUserInterestedInData( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): TypedPipe[(UserId, Map[Int, Double])] = { - val numUserInterestedInInput = Stat("num_user_interested_in") - ExternalDataSources.simClustersInterestInSource - .map { - case KeyVal(userId, clustersUserIsInterestedIn) => - val clustersPostFiltering = clustersUserIsInterestedIn.clusterIdToScores.filter { - case (clusterId, clusterScores) => - // filter out popular clusters (i.e clusters with > 5M users interested in it) from the user embedding - clusterScores.numUsersInterestedInThisClusterUpperBound.exists( - _ < UserInterestedInReadableStore.MaxClusterSizeForUserInterestedInDataset) - } - numUserInterestedInInput.inc() - (userId, clustersPostFiltering.mapValues(_.favScore.getOrElse(0.0)).toMap) - } - } - - def getPerLanguageTopicEmbeddings( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): TypedPipe[((TopicId, Language), Map[Int, Double])] = { - val numTFGPerLanguageEmbeddings = Stat("num_per_language_tfg_embeddings") - DAL - .readMostRecentSnapshotNoOlderThan(topicEmbeddingDataset, Days(30)) - .withRemoteReadPolicy(ExplicitLocation(Proc3Atla)) - .toTypedPipe - .map { - case KeyVal(k, v) => (k, v) - }.collect { - case ( - SimClustersEmbeddingId( - embedType, - ModelVersion.Model20m145kUpdated, - InternalId.LocaleEntityId(LocaleEntityId(entityId, lang))), - embedding) if (embedType == topicEmbeddingType) => - numTFGPerLanguageEmbeddings.inc() - ((entityId, lang), embedding.embedding.map(_.toTuple).toMap) - }.forceToDisk - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/model_based_topic_recommendations/UserFeatures.scala b/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/model_based_topic_recommendations/UserFeatures.scala deleted file mode 100644 index f2af4b62d..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/model_based_topic_recommendations/UserFeatures.scala +++ /dev/null @@ -1,57 +0,0 @@ -package com.twitter.simclusters_v2.scalding.topic_recommendations.model_based_topic_recommendations - -import com.twitter.ml.api.{Feature, FeatureContext} -import com.twitter.ml.api.constant.SharedFeatures - -object UserFeatures { - val UserIdFeature = SharedFeatures.USER_ID // User-id - - val UserSimClusterFeatures = - new Feature.SparseContinuous( - "user.simclusters.interested_in" - ) // User's interestedIn simcluster embeddding - - val UserCountryFeature = new Feature.Text("user.country") // user's country code - - val UserLanguageFeature = new Feature.Text("user.language") // user's language - - val FollowedTopicIdFeatures = - new Feature.SparseBinary( - "followed_topics.id" - ) // SparseBinary features for the set of followed topics - - val NotInterestedTopicIdFeatures = - new Feature.SparseBinary( - "not_interested_topics.id" - ) // SparseBinary features for the set of not-interested topics - - val FollowedTopicSimClusterAvgFeatures = - new Feature.SparseContinuous( - "followed_topics.simclusters.avg" - ) // Average SimCluster Embedding of the followed topics - - val NotInterestedTopicSimClusterAvgFeatures = - new Feature.SparseContinuous( - "not_interested_topics.simclusters.avg" - ) // Average SimCluster Embedding of the followed topics - - val TargetTopicIdFeatures = new Feature.Discrete("target_topic.id") // target topic-id - - val TargetTopicSimClustersFeature = - new Feature.SparseContinuous( - "target_topic.simclusters" - ) // SimCluster embedding of the target topic - - val FeatureContext = new FeatureContext( - UserIdFeature, - UserSimClusterFeatures, - UserCountryFeature, - UserLanguageFeature, - FollowedTopicIdFeatures, - NotInterestedTopicIdFeatures, - FollowedTopicSimClusterAvgFeatures, - NotInterestedTopicSimClusterAvgFeatures, - TargetTopicIdFeatures, - TargetTopicSimClustersFeature - ) -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/model_based_topic_recommendations/UserTopicDataRecordAdapter.scala b/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/model_based_topic_recommendations/UserTopicDataRecordAdapter.scala deleted file mode 100644 index 9e9c0378c..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/model_based_topic_recommendations/UserTopicDataRecordAdapter.scala +++ /dev/null @@ -1,64 +0,0 @@ -package com.twitter.simclusters_v2.scalding.topic_recommendations.model_based_topic_recommendations - -import com.twitter.ml.api.util.FDsl._ -import com.twitter.ml.api.{DataRecord, FeatureContext, IRecordOneToOneAdapter} - -case class UserTopicTrainingSample( - userId: Long, - followedTopics: Set[Long], - notInterestedTopics: Set[Long], - userCountry: String, - userLanguage: String, - targetTopicId: Int, - userInterestedInSimClusters: Map[Int, Double], - followedTopicsSimClusters: Map[Int, Double], - notInterestedTopicsSimClusters: Map[Int, Double]) - -class UserTopicDataRecordAdapter extends IRecordOneToOneAdapter[UserTopicTrainingSample] { - import UserFeatures._ - - /** - * Get its feature context used to annotate the data. - * - * @return feature context - */ - override def getFeatureContext: FeatureContext = UserFeatures.FeatureContext - - /** - * Adapt record of type T to DataRecord. - * - * @param record raw record of type T - * - * @return a DataRecord - * - * @throws com.twitter.ml.api.InvalidFeatureException - */ - override def adaptToDataRecord(record: UserTopicTrainingSample): DataRecord = { - val dr = new DataRecord() - - dr.setFeatureValue(UserIdFeature, record.userId) - dr.setFeatureValue( - UserSimClusterFeatures, - record.userInterestedInSimClusters.map { - case (id, score) => id.toString -> score - }) - dr.setFeatureValue(FollowedTopicIdFeatures, record.followedTopics.map(_.toString)) - dr.setFeatureValue(NotInterestedTopicIdFeatures, record.notInterestedTopics.map(_.toString)) - dr.setFeatureValue(UserCountryFeature, record.userCountry) - dr.setFeatureValue(UserLanguageFeature, record.userLanguage) - - dr.setFeatureValue( - FollowedTopicSimClusterAvgFeatures, - record.followedTopicsSimClusters.map { - case (id, score) => id.toString -> score - }) - - dr.setFeatureValue( - NotInterestedTopicSimClusterAvgFeatures, - record.notInterestedTopicsSimClusters.map { - case (id, score) => id.toString -> score - }) - dr.setFeatureValue(TargetTopicIdFeatures, record.targetTopicId.toLong) - dr.getRecord - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/model_based_topic_recommendations/UserTopicModellingTrainingDataCollectionJob.scala b/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/model_based_topic_recommendations/UserTopicModellingTrainingDataCollectionJob.scala deleted file mode 100644 index 49a73ca32..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/model_based_topic_recommendations/UserTopicModellingTrainingDataCollectionJob.scala +++ /dev/null @@ -1,449 +0,0 @@ -package com.twitter.simclusters_v2.scalding.topic_recommendations.model_based_topic_recommendations - -import com.twitter.algebird.Monoid -import com.twitter.bijection.Injection -import com.twitter.dal.client.dataset.SnapshotDALDatasetBase -import com.twitter.ml.api.DataRecord -import com.twitter.ml.api._ -import com.twitter.scalding.TypedPipe -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DALWrite.D -import com.twitter.scalding_internal.dalv2.dataset.DALWrite._ -import com.twitter.simclusters_v2.common.Country -import com.twitter.simclusters_v2.common.Language -import com.twitter.simclusters_v2.common.TopicId -import com.twitter.simclusters_v2.common.UserId -import com.twitter.wtf.scalding.jobs.common.AdhocExecutionApp -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import java.util.TimeZone -import scala.util.Random -import com.twitter.ml.api.util.FDsl._ -import com.twitter.scalding.source.DailySuffixCsv -import com.twitter.scalding.source.DailySuffixTypedTsv -import com.twitter.simclusters_v2.hdfs_sources.FavTfgTopicEmbeddingsScalaDataset -import com.twitter.simclusters_v2.scalding.embedding.common.ExternalDataSources -import com.twitter.simclusters_v2.thriftscala.EmbeddingType - -/** - This job is to obtain the training and test data for the model-based approach to topic recommendations: - Approach: - 1. Read FavTfgTopicEmbeddingsScalaDataset - to get topic simclusters embeddings for the followed and not interested in topics - 2. Read SimclustersV2InterestedIn20M145KUpdatedScalaDataset - to get user's interestedIn Simclusters embeddings - 3. Read UsersourceScalaDataset - to get user's countryCode and language - Use the datasets above to get the features for the model and generate DataRecords. - */ - -/* -To run: -scalding remote run --target src/scala/com/twitter/simclusters_v2/scalding/topic_recommendations/model_based_topic_recommendations:training_data_for_topic_recommendations-adhoc \ ---user cassowary \ ---submitter atla-aor-08-sr1 \ ---main-class com.twitter.simclusters_v2.scalding.topic_recommendations.model_based_topic_recommendations.UserTopicFeatureHydrationAdhocApp \ ---submitter-memory 128192.megabyte --hadoop-properties "mapreduce.map.memory.mb=8192 mapreduce.map.java.opts='-Xmx7618M' mapreduce.reduce.memory.mb=8192 mapreduce.reduce.java.opts='-Xmx7618M'" \ --- \ ---date 2020-10-14 \ ---outputDir "/user/cassowary/adhoc/your_ldap/user_topic_features_popular_clusters_filtered_oct_16" - */ - -object UserTopicFeatureHydrationAdhocApp extends AdhocExecutionApp { - - import UserTopicModellingJobUtils._ - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - val outputDir = args("outputDir") - val numDataRecordsTraining = Stat("num_data_records_training") - val numDataRecordsTesting = Stat("num_data_records_testing") - val testingRatio = args.double("testingRatio", 0.2) - - val (trainingDataSamples, testDataSamples, sortedVocab) = UserTopicModellingJobUtils.run( - ExternalDataSources.topicFollowGraphSource, - ExternalDataSources.notInterestedTopicsSource, - ExternalDataSources.userSource, - DataSources.getUserInterestedInData, - DataSources.getPerLanguageTopicEmbeddings, - testingRatio - ) - - val userTopicAdapter = new UserTopicDataRecordAdapter() - Execution - .zip( - convertTypedPipeToDataSetPipe( - trainingDataSamples.map { train => - numDataRecordsTraining.inc() - train - }, - userTopicAdapter) - .writeExecution( - DailySuffixFeatureSink(outputDir + "/training") - ), - convertTypedPipeToDataSetPipe( - testDataSamples.map { test => - numDataRecordsTesting.inc() - test - }, - userTopicAdapter) - .writeExecution( - DailySuffixFeatureSink(outputDir + "/testing") - ), - sortedVocab - .map { topicsWithSortedIndexes => - topicsWithSortedIndexes.map(_._1) - }.flatten.writeExecution(DailySuffixTypedTsv(outputDir + "/vocab")) - ).unit - } -} - -/** -capesospy-v2 update --build_locally \ - --start_cron training_data_for_topic_recommendations \ - src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ - -object UserTopicFeatureHydrationScheduledApp extends ScheduledExecutionApp { - - import UserTopicModellingJobUtils._ - - private val outputPath: String = - "/user/cassowary/processed/user_topic_modelling" - - override def batchIncrement: Duration = Days(1) - - override def firstTime: RichDate = RichDate("2020-10-13") - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - val testingRatio = args.double("testingRatio", 0.2) - - val (trainingDataSamples, testDataSamples, sortedVocab) = UserTopicModellingJobUtils.run( - ExternalDataSources.topicFollowGraphSource, - ExternalDataSources.notInterestedTopicsSource, - ExternalDataSources.userSource, - DataSources.getUserInterestedInData, - DataSources.getPerLanguageTopicEmbeddings, - testingRatio - ) - - val userTopicAdapter = new UserTopicDataRecordAdapter() - Execution - .zip( - getTrainTestExec( - trainingDataSamples, - testDataSamples, - TopicRecommendationsTrainDatarecordsJavaDataset, - TopicRecommendationsTestDatarecordsJavaDataset, - outputPath, - userTopicAdapter - ), - sortedVocab - .map { topicsWithSortedIndexes => - topicsWithSortedIndexes.map(_._1) - }.flatten.writeExecution(DailySuffixTypedTsv(outputPath + "/vocab")) - ).unit - - } -} - -object UserTopicModellingJobUtils { - - /** - * The main function that produces training and the test data - * - * @param topicFollowGraphSource user with followed topics from TFG - * @param notInterestedTopicsSource user with not interested in topics - * @param userSource user with country and language - * @param userInterestedInData user with interestedin simcluster embeddings - * @param topicPerLanguageEmbeddings topics with simcluster embeddings - * - * @return Tuple (trainingDataSamples, testingDataSamples, sortedTopicsVocab) - */ - def run( - topicFollowGraphSource: TypedPipe[(TopicId, UserId)], - notInterestedTopicsSource: TypedPipe[(TopicId, UserId)], - userSource: TypedPipe[(UserId, (Country, Language))], - userInterestedInData: TypedPipe[(UserId, Map[Int, Double])], - topicPerLanguageEmbeddings: TypedPipe[((TopicId, Language), Map[Int, Double])], - testingRatio: Double - )( - implicit uniqueID: UniqueID, - dateRange: DateRange, - timeZone: TimeZone - ): ( - TypedPipe[UserTopicTrainingSample], - TypedPipe[UserTopicTrainingSample], - TypedPipe[Seq[(TopicId, Int)]] - ) = { - val allFollowableTopics: TypedPipe[TopicId] = - topicFollowGraphSource.map(_._1).distinct - - val allFollowableTopicsWithMappedIds: TypedPipe[(TopicId, Int)] = - allFollowableTopics.groupAll.mapGroup { - case (_, topicIter) => - topicIter.zipWithIndex.map { - case (topicId, mappedId) => - (topicId, mappedId) - } - }.values - - val sortedVocab: TypedPipe[Seq[(TopicId, Int)]] = - allFollowableTopicsWithMappedIds.map(Seq(_)).map(_.sortBy(_._2)) - - val dataTrainingSamples: TypedPipe[UserTopicTrainingSample] = getDataSamplesFromTrainingData( - topicFollowGraphSource, - notInterestedTopicsSource, - userSource, - userInterestedInData, - topicPerLanguageEmbeddings, - allFollowableTopicsWithMappedIds - ) - val (trainSplit, testSplit) = splitByUser(dataTrainingSamples, testingRatio) - - (trainSplit, testSplit, sortedVocab) - } - - /** - * Split the data samples based on user_id into train and test data. This ensures that the same - * user's data records are not part of both train and test data. - */ - def splitByUser( - dataTrainingSamples: TypedPipe[UserTopicTrainingSample], - testingRatio: Double - ): (TypedPipe[UserTopicTrainingSample], TypedPipe[UserTopicTrainingSample]) = { - val (trainSplit, testSplit) = dataTrainingSamples - .map { currSmple => (currSmple.userId, currSmple) }.groupBy(_._1).partition(_ => - Random.nextDouble() > testingRatio) - val trainingData = trainSplit.values.map(_._2) - val testingData = testSplit.values.map(_._2) - (trainingData, testingData) - } - - /** - * To get the target topic for each training data sample for a user from the TopicFollowGraph - * - * @param topicFollowSource - * @return (UserId, Set(allFollowedTopicsExceptTargetTopic), targetTopic) - */ - def getTargetTopicsFromTFG( - topicFollowSource: TypedPipe[(TopicId, UserId)] - )( - implicit uniqueID: UniqueID - ): TypedPipe[(UserId, Set[TopicId], TopicId)] = { - val numTrainingSamples = Stat("num_positive_training_samples") - - val userFollowedTopics = topicFollowSource.swap - .map { - case (userId, topicId) => (userId, Set(topicId)) - }.sumByKey.toTypedPipe - - userFollowedTopics.flatMap { - case (userID, followedTopicsSet) => - followedTopicsSet.map { currFollowedTopic => - numTrainingSamples.inc() - val remainingTopics = followedTopicsSet - currFollowedTopic - (userID, remainingTopics, currFollowedTopic) - } - } - } - - /** - * Helper function that does the intermediate join operation between a user's followed, - * not-interested, interestedIn, country and language typedpipe sources, read from different sources. - */ - - def getFeaturesIntermediateJoin( - topicFollowGraphSource: TypedPipe[(TopicId, UserId)], - notInterestedTopicsSource: TypedPipe[(TopicId, UserId)], - allFollowableTopicsWithMappedIds: TypedPipe[(TopicId, Int)], - userCountryAndLanguage: TypedPipe[(UserId, (Country, Language))], - userInterestedInData: TypedPipe[(UserId, Map[Int, Double])] - )( - implicit uniqueID: UniqueID - ): TypedPipe[ - ( - UserId, - Set[TopicId], - Set[TopicId], - TopicId, - Int, - Country, - Language, - Map[Int, Double] - ) - ] = { - implicit val l2b: Long => Array[Byte] = Injection.long2BigEndian - - val userWithFollowedTargetTopics: TypedPipe[ - (UserId, Set[TopicId], TopicId) - ] = getTargetTopicsFromTFG(topicFollowGraphSource) - - val userWithNotInterestedTopics: TypedPipe[(UserId, Set[TopicId])] = - notInterestedTopicsSource.swap.mapValues(Set(_)).sumByKey.toTypedPipe - - userWithFollowedTargetTopics - .groupBy(_._1).leftJoin(userWithNotInterestedTopics).values.map { - case ((userId, followedTopics, targetFollowedTopic), notInterestedOpt) => - ( - userId, - followedTopics, - targetFollowedTopic, - notInterestedOpt.getOrElse(Set.empty[TopicId])) - } - .map { - case (userId, followedTopics, targetFollowedTopic, notInterestedTopics) => - (targetFollowedTopic, (userId, followedTopics, notInterestedTopics)) - }.join(allFollowableTopicsWithMappedIds).map { - case (targetTopic, ((userId, followedTopics, notInterestedTopics), targetTopicIdx)) => - (userId, followedTopics, notInterestedTopics, targetTopic, targetTopicIdx) - } - .groupBy(_._1).sketch(4000) - .join(userCountryAndLanguage - .groupBy(_._1)).sketch(4000).leftJoin(userInterestedInData) - .values.map { - case ( - ( - (userId, followedTopics, notInterestedTopics, targetTopic, targetTopicIdx), - (_, (userCountry, userLanguage)) - ), - userIntOpt) => - ( - userId, - followedTopics, - notInterestedTopics, - targetTopic, - targetTopicIdx, - userCountry, - userLanguage, - userIntOpt.getOrElse(Map.empty)) - } - } - - /** - * Helper function that aggregates user's followed topics, not-interested topics, - * country, language with join operations and generates the UserTopicTrainingSample - * for each DataRecord - */ - def getDataSamplesFromTrainingData( - topicFollowGraphSource: TypedPipe[(TopicId, UserId)], - notInterestedTopicsSource: TypedPipe[(TopicId, UserId)], - userCountryAndLanguage: TypedPipe[(UserId, (Country, Language))], - userInterestedInData: TypedPipe[(UserId, Map[Int, Double])], - topicPerLanguageEmbeddings: TypedPipe[((TopicId, Language), Map[Int, Double])], - allFollowableTopicsWithMappedIds: TypedPipe[(TopicId, Int)] - )( - implicit uniqueID: UniqueID - ): TypedPipe[UserTopicTrainingSample] = { - - implicit val l2b: Long => Array[Byte] = Injection.long2BigEndian - - val allTopicEmbeddingsMap: ValuePipe[Map[(TopicId, Language), Map[Int, Double]]] = - topicPerLanguageEmbeddings.map { - case (topicWithLang, embedding) => - Map(topicWithLang -> embedding) - }.sum - - val userWithFollowedAndNotInterestedTopics = getFeaturesIntermediateJoin( - topicFollowGraphSource, - notInterestedTopicsSource, - allFollowableTopicsWithMappedIds, - userCountryAndLanguage, - userInterestedInData) - - userWithFollowedAndNotInterestedTopics.flatMapWithValue(allTopicEmbeddingsMap) { - case ( - ( - userId, - followedTopics, - notInterestedTopics, - targetTopic, - targetTopicIdx, - userCountry, - userLanguage, - userInt), - Some(allTopicEmbeddings)) => - val averageFollowedTopicsSimClusters = Monoid - .sum(followedTopics.toSeq.map { topicId => - allTopicEmbeddings.getOrElse((topicId, userLanguage), Map.empty) - }).mapValues(v => - v / followedTopics.size) // average simcluster embedding of the followed topics - - val averageNotInterestedTopicsSimClusters = Monoid - .sum(notInterestedTopics.toSeq.map { topicId => - allTopicEmbeddings.getOrElse((topicId, userLanguage), Map.empty) - }).mapValues(v => - v / notInterestedTopics.size) // average simcluster embedding of the notInterested topics - - Some( - UserTopicTrainingSample( - userId, - followedTopics, - notInterestedTopics, - userCountry, - userLanguage, - targetTopicIdx, - userInt, - averageFollowedTopicsSimClusters, - averageNotInterestedTopicsSimClusters - ) - ) - - case _ => - None - } - } - - /** - * Write train and test data - */ - def getTrainTestExec( - trainingData: TypedPipe[UserTopicTrainingSample], - testingData: TypedPipe[UserTopicTrainingSample], - trainDataset: SnapshotDALDatasetBase[DataRecord], - testDataset: SnapshotDALDatasetBase[DataRecord], - outputPath: String, - adapter: IRecordOneToOneAdapter[UserTopicTrainingSample] - )( - implicit dateRange: DateRange - ): Execution[Unit] = { - val trainExec = - convertTypedPipeToDataSetPipe(trainingData, adapter) - .writeDALSnapshotExecution( - trainDataset, - D.Daily, - D.Suffix(s"$outputPath/training"), - D.EBLzo(), - dateRange.end) - val testExec = - convertTypedPipeToDataSetPipe(testingData, adapter) - .writeDALSnapshotExecution( - testDataset, - D.Daily, - D.Suffix(s"$outputPath/testing"), - D.EBLzo(), - dateRange.end) - Execution.zip(trainExec, testExec).unit - } - - /** - * To get the datasetPipe containing datarecords hydrated by datarecordAdapter - * @param userTrainingSamples - * @param adapter - * @return DataSetPipe - */ - def convertTypedPipeToDataSetPipe( - userTrainingSamples: TypedPipe[UserTopicTrainingSample], - adapter: IRecordOneToOneAdapter[UserTopicTrainingSample] - ): DataSetPipe = { - userTrainingSamples.toDataSetPipe(adapter) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/BUILD b/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/BUILD deleted file mode 100644 index 0a1532588..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/BUILD +++ /dev/null @@ -1,234 +0,0 @@ -scala_library( - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":tweet_similarity_test_datarecords_120min-java", - ":tweet_similarity_test_datarecords_30min-java", - ":tweet_similarity_train_datarecords_120min-java", - ":tweet_similarity_train_datarecords_30min-java", - ":tweet_similarity_unhydrated_pairs_120min-scala", - ":tweet_similarity_unhydrated_pairs_30min-scala", - "3rdparty/jvm/com/twitter/storehaus:algebra", - "3rdparty/jvm/com/twitter/storehaus:core", - "dataproducts/insights/common/common", - "snowflake:id", - "src/java/com/twitter/ml/api/constant", - "src/scala/com/twitter/ads/dataservice_account/snapshot/jobs:db_snapshots_promoted_tweets-scala", - "src/scala/com/twitter/ml/api:api-base", - "src/scala/com/twitter/ml/featurestore/catalog/features/recommendations:aggregate", - "src/scala/com/twitter/ml/featurestore/lib/embedding", - "src/scala/com/twitter/scalding_internal/dalv2", - "src/scala/com/twitter/scalding_internal/dalv2/dataset", - "src/scala/com/twitter/scalding_internal/job", - "src/scala/com/twitter/scalding_internal/job/analytics_batch", - "src/scala/com/twitter/scalding_internal/source", - "src/scala/com/twitter/scalding_internal/source/lzo_scrooge", - "src/scala/com/twitter/simclusters_v2/common", - "src/scala/com/twitter/simclusters_v2/scalding", - "src/scala/com/twitter/simclusters_v2/tweet_similarity", - "src/scala/com/twitter/wtf/scalding/jobs/client_event_processing:user_interaction-scala", - "twadoop_config/configuration/log_categories/group/timeline:timeline_service_favorites-scala", - "tweetsource/common:unhydrated_flat-scala", - ], -) - -hadoop_binary( - name = "training_data_collection-adhoc", - main = "com.twitter.simclusters_v2.scalding.tweet_similarity.TrainingDataCollectionAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":tweet_similarity", - ], -) - -hadoop_binary( - name = "training_data_collection_30min", - main = "com.twitter.simclusters_v2.scalding.tweet_similarity.TrainingDataCollection30MinScheduledApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":tweet_similarity", - ], -) - -hadoop_binary( - name = "training_data_collection_120min", - main = "com.twitter.simclusters_v2.scalding.tweet_similarity.TrainingDataCollection120MinScheduledApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":tweet_similarity", - ], -) - -hadoop_binary( - name = "unhydrated_pair_collection-adhoc", - main = "com.twitter.simclusters_v2.scalding.tweet_similarity.UnhydratedPairsCollectionAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":tweet_similarity", - ], -) - -hadoop_binary( - name = "unhydrated_pair_collection_30min", - main = "com.twitter.simclusters_v2.scalding.tweet_similarity.UnhydratedPairsCollection30MinScheduledApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":tweet_similarity", - ], -) - -hadoop_binary( - name = "unhydrated_pair_collection_120min", - main = "com.twitter.simclusters_v2.scalding.tweet_similarity.UnhydratedPairsCollection120MinScheduledApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":tweet_similarity", - ], -) - -hadoop_binary( - name = "model_eval-adhoc", - main = "com.twitter.simclusters_v2.scalding.tweet_similarity.ModelEvalAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":tweet_similarity", - ], -) - -hadoop_binary( - name = "dataset_topk_analysis-adhoc", - main = "com.twitter.simclusters_v2.scalding.tweet_similarity.DatasetTopKAnalysisAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":tweet_similarity", - ], -) - -hadoop_binary( - name = "dataset_topk_analysis_dump-adhoc", - main = "com.twitter.simclusters_v2.scalding.tweet_similarity.DatasetTopKAnalysisDumpApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":tweet_similarity", - ], -) - -create_datarecord_datasets( - base_name = "tweet_similarity_train_datarecords_30min", - platform = "java8", - role = "cassowary", - segment_type = "partitioned", - tags = ["bazel-compatible"], -) - -create_datarecord_datasets( - base_name = "tweet_similarity_test_datarecords_30min", - platform = "java8", - role = "cassowary", - segment_type = "partitioned", - tags = ["bazel-compatible"], -) - -create_datarecord_datasets( - base_name = "tweet_similarity_train_datarecords_120min", - platform = "java8", - role = "cassowary", - segment_type = "partitioned", - tags = ["bazel-compatible"], -) - -create_datarecord_datasets( - base_name = "tweet_similarity_test_datarecords_120min", - platform = "java8", - role = "cassowary", - segment_type = "partitioned", - tags = ["bazel-compatible"], -) - -create_datasets( - base_name = "tweet_similarity_unhydrated_pairs_30min", - description = "30min coocurrence training pairs before feature hydration", - java_schema = "com.twitter.simclusters_v2.thriftjava.LabelledTweetPairs", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.LabelledTweetPairs", - segment_type = "partitioned", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "tweet_similarity_unhydrated_pairs_120min", - description = "120min coocurrence training pairs before feature hydration", - java_schema = "com.twitter.simclusters_v2.thriftjava.LabelledTweetPairs", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.LabelledTweetPairs", - segment_type = "partitioned", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/DatasetTopKAnalysisJob.scala b/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/DatasetTopKAnalysisJob.scala deleted file mode 100644 index b277dd02e..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/DatasetTopKAnalysisJob.scala +++ /dev/null @@ -1,255 +0,0 @@ -package com.twitter.simclusters_v2.scalding.tweet_similarity - -import com.twitter.ml.api.DailySuffixFeatureSource -import com.twitter.ml.api.DataSetPipe -import com.twitter.ml.api.RichDataRecord -import com.twitter.scalding.typed.TypedPipe -import com.twitter.scalding.Execution -import com.twitter.scalding._ -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.simclusters_v2.common.TweetId -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.tweet_similarity.TweetSimilarityFeatures -import java.util.TimeZone - -object DatasetTopKAnalysisJob { - - case class TweetPairWithStats( - queryTweet: TweetId, - candidateTweet: TweetId, - cooccurrenceCount: Double, - coengagementCount: Double, - coengagementRate: Double) - - def getCoocurrenceTweetPairs(dataset: DataSetPipe): TypedPipe[TweetPairWithStats] = { - val featureContext = dataset.featureContext - - dataset.records - .map { record => - val richDataRecord = new RichDataRecord(record, featureContext) - val coengaged = - if (richDataRecord - .getFeatureValue(TweetSimilarityFeatures.Label) - .booleanValue) 1 - else 0 - ( - ( - richDataRecord.getFeatureValue(TweetSimilarityFeatures.QueryTweetId).toLong, - richDataRecord.getFeatureValue(TweetSimilarityFeatures.CandidateTweetId).toLong), - (1, coengaged) - ) - }.sumByKey - .map { - case ((queryTweet, candidateTweet), (coocurrenceCount, coengagementCount)) => - TweetPairWithStats( - queryTweet, - candidateTweet, - coocurrenceCount.toDouble, - coengagementCount.toDouble, - coengagementCount.toDouble / coocurrenceCount.toDouble - ) - } - } - - def getQueryTweetToCounts(dataset: DataSetPipe): TypedPipe[(Long, (Int, Int))] = { - val featureContext = dataset.featureContext - dataset.records.map { record => - val richDataRecord = new RichDataRecord(record, featureContext) - val coengaged = - if (richDataRecord - .getFeatureValue(TweetSimilarityFeatures.Label) - .booleanValue) 1 - else 0 - ( - richDataRecord.getFeatureValue(TweetSimilarityFeatures.QueryTweetId).toLong, - (1, coengaged) - ) - }.sumByKey - } - - def printGlobalTopKTweetPairsBy( - tweetPairs: TypedPipe[TweetPairWithStats], - k: Int, - orderByFnt: TweetPairWithStats => Double - ): Execution[Unit] = { - val topKTweetPairs = - tweetPairs.groupAll - .sortedReverseTake(k)(Ordering.by(orderByFnt)) - .values - topKTweetPairs.toIterableExecution.map { s => - println(s.map(Util.prettyJsonMapper.writeValueAsString).mkString("\n")) - } - } - - def printTweetTopKTweetsBy( - groupedBy: Grouped[TweetId, TweetPairWithStats], - k: Int, - orderByFnt: TweetPairWithStats => Double, - descending: Boolean = true - ): Execution[Unit] = { - if (descending) { - println("TweetTopKTweets (descending order)") - groupedBy - .sortedReverseTake(k)(Ordering.by(orderByFnt)) - .toIterableExecution - .map { record => println(record.toString()) } - } else { - println("TweetTopKTweets (ascending order)") - groupedBy - .sortedTake(k)(Ordering.by(orderByFnt)) - .toIterableExecution - .map { record => println(record.toString()) } - } - } - - def printTweetPairStatsExec( - tweetPairs: TypedPipe[TweetPairWithStats], - k: Int - ): Execution[Unit] = { - Execution - .sequence( - Seq( - Util.printSummaryOfNumericColumn( - tweetPairs.map(_.cooccurrenceCount), - Some("Tweet-pair Coocurrence Count")), - printGlobalTopKTweetPairsBy( - tweetPairs, - k, - { tweetPairs => tweetPairs.cooccurrenceCount }), - Util.printSummaryOfNumericColumn( - tweetPairs.map(_.coengagementCount), - Some("Tweet-pair Coengagement Count")), - printGlobalTopKTweetPairsBy( - tweetPairs, - k, - { tweetPairs => tweetPairs.coengagementCount }), - Util.printSummaryOfNumericColumn( - tweetPairs.map(_.coengagementRate), - Some("Tweet-pair Coengagement Rate")), - printGlobalTopKTweetPairsBy(tweetPairs, k, { tweetPairs => tweetPairs.coengagementRate }) - ) - ).unit - } - - def printPerQueryStatsExec(dataset: DataSetPipe, k: Int): Execution[Unit] = { - val queryToCounts = getQueryTweetToCounts(dataset) - - val topKQueryTweetsByOccurrence = - queryToCounts.groupAll - .sortedReverseTake(k)(Ordering.by { case (_, (cooccurrenceCount, _)) => cooccurrenceCount }) - .values - - val topKQueryTweetsByEngagement = - queryToCounts.groupAll - .sortedReverseTake(k)(Ordering.by { case (_, (_, coengagementCount)) => coengagementCount }) - .values - - Execution - .sequence( - Seq( - Util.printSummaryOfNumericColumn( - queryToCounts.map(_._2._1), - Some("Per-query Total Cooccurrence Count")), - topKQueryTweetsByOccurrence.toIterableExecution.map { s => - println(s.map(Util.prettyJsonMapper.writeValueAsString).mkString("\n")) - }, - Util.printSummaryOfNumericColumn( - queryToCounts.map(_._2._2), - Some("Per-query Total Coengagement Count")), - topKQueryTweetsByEngagement.toIterableExecution.map { s => - println(s.map(Util.prettyJsonMapper.writeValueAsString).mkString("\n")) - } - ) - ).unit - } - - def runTweetTopKTweetsOutputExecs( - tweetPairs: TypedPipe[TweetPairWithStats], - k: Int, - outputPath: String - ): Execution[Unit] = { - tweetPairs - .groupBy(_.queryTweet) - .sortedReverseTake(k)(Ordering.by(_.coengagementRate)) - .writeExecution(TypedTsv(outputPath + "/topK_by_coengagement_rate")) - } -} - -/** To run: - scalding remote run --target src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity:dataset_topk_analysis-adhoc \ - --user cassowary \ - --submitter hadoopnest2.atla.twitter.com \ - --main-class com.twitter.simclusters_v2.scalding.tweet_similarity.DatasetTopKAnalysisAdhocApp -- \ - --date 2020-02-19 \ - --dataset_path /user/cassowary/adhoc/training_data/2020-02-19_class_balanced/train \ - --output_path /user/cassowary/adhoc/training_data/2020-02-19_class_balanced/train/analysis - * */ -object DatasetTopKAnalysisAdhocApp extends TwitterExecutionApp { - implicit val timeZone: TimeZone = DateOps.UTC - implicit val dateParser: DateParser = DateParser.default - - def job: Execution[Unit] = Execution.withId { implicit uniqueId => - Execution.withArgs { args: Args => - implicit val dateRange: DateRange = DateRange.parse(args.list("date")) - val dataset: DataSetPipe = DailySuffixFeatureSource(args("dataset_path")).read - val outputPath: String = args("output_path") - val topK: Int = args.int("top_K", default = 10) - - val tweetPairs = DatasetTopKAnalysisJob.getCoocurrenceTweetPairs(dataset) - - Execution - .zip( - DatasetTopKAnalysisJob.printTweetPairStatsExec(tweetPairs, topK), - DatasetTopKAnalysisJob.runTweetTopKTweetsOutputExecs(tweetPairs, topK, outputPath), - DatasetTopKAnalysisJob.printPerQueryStatsExec(dataset, topK) - ).unit - } - } -} - -/** To run: - scalding remote run --target src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity:dataset_topk_analysis-dump \ - --user cassowary \ - --submitter hadoopnest2.atla.twitter.com \ - --main-class com.twitter.simclusters_v2.scalding.tweet_similarity.DatasetTopKAnalysisDumpApp -- \ - --date 2020-02-01 \ - --dataset_path /user/cassowary/adhoc/training_data/2020-02-01/train \ - --tweets 1223105606757695490 \ - --top_K 100 - * */ -object DatasetTopKAnalysisDumpApp extends TwitterExecutionApp { - implicit val timeZone: TimeZone = DateOps.UTC - implicit val dateParser: DateParser = DateParser.default - - def job: Execution[Unit] = Execution.withId { implicit uniqueId => - Execution.withArgs { args: Args => - implicit val dateRange: DateRange = DateRange.parse(args.list("date")) - val dataset: DataSetPipe = DailySuffixFeatureSource(args("dataset_path")).read - val tweets = args.list("tweets").map(_.toLong).toSet - val topK: Int = args.int("top_K", default = 100) - - val tweetPairs = DatasetTopKAnalysisJob.getCoocurrenceTweetPairs(dataset) - - if (tweets.isEmpty) { - Execution.from(println("Empty query tweets")) - } else { - val filteredGroupby = tweetPairs - .filter { record => tweets.contains(record.queryTweet) } - .groupBy(_.queryTweet) - - Execution - .zip( - //Top K - DatasetTopKAnalysisJob - .printTweetTopKTweetsBy(filteredGroupby, topK, pair => pair.coengagementCount), - //Bottom K - DatasetTopKAnalysisJob.printTweetTopKTweetsBy( - filteredGroupby, - topK, - pair => pair.coengagementCount, - descending = false) - ).unit - } - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/TrainingDataCollectionJob.scala b/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/TrainingDataCollectionJob.scala deleted file mode 100644 index 93941a5da..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/TrainingDataCollectionJob.scala +++ /dev/null @@ -1,228 +0,0 @@ -package com.twitter.simclusters_v2.scalding.tweet_similarity - -import com.twitter.dal.client.dataset.TimePartitionedDALDataset -import com.twitter.ml.api.DataRecord -import com.twitter.ml.api.DataSetPipe -import com.twitter.scalding._ -import com.twitter.scalding.typed.TypedPipe -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.remote_access.ExplicitLocation -import com.twitter.scalding_internal.dalv2.remote_access.Proc3Atla -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.simclusters_v2.hdfs_sources.TweetSimilarityUnhydratedPairsSource -import com.twitter.simclusters_v2.scalding.common.LogFavBasedPersistentTweetEmbeddingMhExportSource -import com.twitter.simclusters_v2.scalding.tweet_similarity.TweetPairLabelCollectionUtil.FeaturedTweet -import com.twitter.simclusters_v2.thriftscala.LabelledTweetPairs -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import java.util.TimeZone - -/** - * Hydrate tweet pairs with features - */ -object TrainingDataCollectionJob { - val LookbackDays = 2 //lookbackdays considered when looking for author information - val testLookbackHours = 2 //hours in test dataset if doing time-based train/test split - val testRatio = 0.1 //ratio for test dataset if doing query-based train/test split - - def getHydratedDataPipe( - dateRange: DateRange, - useAuthorFeatures: Boolean, - unhydratedPairs: TypedPipe[LabelledTweetPairs] - )( - implicit timeZone: TimeZone - ): DataSetPipe = { - - val persistentEmbeddingRecords = - TypedPipe.from(new LogFavBasedPersistentTweetEmbeddingMhExportSource(range = dateRange)) - - val tweetAuthorPairs = - TweetPairLabelCollectionUtil.getTweetAuthorPairs(dateRange.prepend(Days(LookbackDays))) - - val labelledPairs = unhydratedPairs - .map { labelledPair => - ( - FeaturedTweet( - labelledPair.queryFeaturedTweet.tweetId, - labelledPair.queryFeaturedTweet.timestamp, - None, - None), - FeaturedTweet( - labelledPair.candidateFeaturedTweet.tweetId, - labelledPair.candidateFeaturedTweet.timestamp, - None, - None), - labelledPair.label - ) - } - - TweetPairFeatureHydrationUtil.getDataSetPipeWithFeatures( - labelledPairs, - persistentEmbeddingRecords, - tweetAuthorPairs, - useAuthorFeatures) - } - - def getTrainTestExec( - dataSetPipe: DataSetPipe, - splitBy: Option[String], - trainDataset: TimePartitionedDALDataset[DataRecord], - testDataset: TimePartitionedDALDataset[DataRecord], - outputPath: String - )( - implicit timeZone: TimeZone, - dateRange: DateRange - ): Execution[Unit] = { - splitBy match { - case Some("time") => - TrainingDataCollectionUtil.getTrainTestByTimeExec( - dataSetPipe, - dateRange.end - Hours(testLookbackHours), - trainDataset, - testDataset, - outputPath)(dateRange) - case Some("query_tweet") => - TrainingDataCollectionUtil.getTrainTestByQueryExec( - dataSetPipe, - testRatio, - trainDataset, - testDataset, - outputPath)(dateRange) - // Default at no splitting - case _ => - TrainingDataCollectionUtil.getTrainTestByQueryExec( - dataSetPipe, - 0.0, - trainDataset, - testDataset, - outputPath)(dateRange) - } - } -} - -/** To run: -scalding remote run --target src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity:training_data_collection-adhoc \ ---user cassowary \ ---submitter hadoopnest2.atla.twitter.com \ ---hadoop-properties "mapreduce.reduce.java.opts=-Xmx8000m mapreduce.reduce.memory.mb=8000 scalding.with.reducers.set.explicitly=true mapreduce.job.reduces=2000 mapreduce.task.timeout=0" \ ---main-class com.twitter.simclusters_v2.scalding.tweet_similarity.TrainingDataCollectionAdhocApp -- \ ---date 2020-04-15 \ ---input_path /user/cassowary/adhoc/unhydrated_pairs/2020-04-15_30min/ \ ---output_path /user/cassowary/adhoc/training_data/2020-04-15_30min_2xneg_qtweet_split \ ---split_by query_tweet - * */ -object TrainingDataCollectionAdhocApp extends TwitterExecutionApp { - implicit val timeZone: TimeZone = DateOps.UTC - implicit val dateParser: DateParser = DateParser.default - - override def job: Execution[Unit] = - Execution.withId { implicit uniqueId => - Execution.withArgs { args: Args => - implicit val dateRange: DateRange = DateRange.parse(args.list("date")) - val useAuthorFeatures: Boolean = args.boolean("use_author_features") - val inputPath: String = args("input_path") - val outputPath: String = args("output_path") - val splitBy: Option[String] = args.optional("split_by") - - val labelledPairs = TypedPipe - .from(TweetSimilarityUnhydratedPairsSource(inputPath, dateRange)) - - val dataSetPipe = TrainingDataCollectionJob.getHydratedDataPipe( - dateRange, - useAuthorFeatures, - labelledPairs - ) - TrainingDataCollectionJob.getTrainTestExec( - dataSetPipe, - splitBy, - TweetSimilarityTrainDatarecords30MinJavaDataset, - TweetSimilarityTestDatarecords30MinJavaDataset, - outputPath - ) - } - } -} - -/** - capesospy-v2 update --build_locally --start_cron \ - training_data_collection_30min src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object TrainingDataCollection30MinScheduledApp extends ScheduledExecutionApp { - - private val outputPath: String = - "/user/cassowary/processed/tweet_similarity/training_data_30min" - - override def batchIncrement: Duration = Hours(24) - - override def firstTime: RichDate = RichDate("2020-03-26") - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - val useAuthorFeatures: Boolean = args.boolean("use_author_features") - val splitBy: Option[String] = args.optional("split_by") - - val unhydratedPairs = DAL - .read(TweetSimilarityUnhydratedPairs30MinScalaDataset, dateRange) - .withRemoteReadPolicy(ExplicitLocation(Proc3Atla)) - .toTypedPipe - - val dataSetPipe = TrainingDataCollectionJob.getHydratedDataPipe( - dateRange, - useAuthorFeatures, - unhydratedPairs - ) - TrainingDataCollectionJob.getTrainTestExec( - dataSetPipe, - splitBy, - TweetSimilarityTrainDatarecords30MinJavaDataset, - TweetSimilarityTestDatarecords30MinJavaDataset, - outputPath) - } -} - -/** -capesospy-v2 update --build_locally --start_cron \ - training_data_collection_120min src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object TrainingDataCollection120MinScheduledApp extends ScheduledExecutionApp { - - private val outputPath: String = - "/user/cassowary/processed/tweet_similarity/training_data_120min" - - override def batchIncrement: Duration = Hours(24) - - override def firstTime: RichDate = RichDate("2020-03-26") - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - val useAuthorFeatures: Boolean = args.boolean("use_author_features") - val splitBy: Option[String] = args.optional("split_by") - - val unhydratedPairs = DAL - .read(TweetSimilarityUnhydratedPairs120MinScalaDataset, dateRange) - .withRemoteReadPolicy(ExplicitLocation(Proc3Atla)) - .toTypedPipe - - val dataSetPipe = TrainingDataCollectionJob.getHydratedDataPipe( - dateRange, - useAuthorFeatures, - unhydratedPairs - ) - - TrainingDataCollectionJob.getTrainTestExec( - dataSetPipe, - splitBy, - TweetSimilarityTrainDatarecords120MinJavaDataset, - TweetSimilarityTestDatarecords120MinJavaDataset, - outputPath) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/TrainingDataCollectionUtil.scala b/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/TrainingDataCollectionUtil.scala deleted file mode 100644 index 4fdc90ec4..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/TrainingDataCollectionUtil.scala +++ /dev/null @@ -1,138 +0,0 @@ -package com.twitter.simclusters_v2.scalding.tweet_similarity - -import com.twitter.dal.client.dataset.TimePartitionedDALDataset -import com.twitter.ml.api.util.FDsl._ -import com.twitter.ml.api.{DataRecord, DataSetPipe} -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DALWrite.D -import com.twitter.scalding_internal.dalv2.dataset.DALWrite._ -import com.twitter.simclusters_v2.tweet_similarity.TweetSimilarityFeatures -import com.twitter.util.Time -import java.util.Random - -/** - * Collect training data for supervised tweet similarity - */ -object TrainingDataCollectionUtil { - - /** - * Split dataset into train and test based on time - * @param dataset: input dataset - * @param testStartDate: samples before/after testStartDate will be used for training/testing - * @return (train dataset, test dataset) - */ - def splitRecordsByTime( - dataset: DataSetPipe, - testStartDate: RichDate - ): (DataSetPipe, DataSetPipe) = { - val (leftRecords, rightRecords) = dataset.records.partition { record => - // record will be in training dataset when both tweets were engaged before testStartDate - (record.getFeatureValue( - TweetSimilarityFeatures.QueryTweetTimestamp) < testStartDate.timestamp) & - (record.getFeatureValue( - TweetSimilarityFeatures.CandidateTweetTimestamp) < testStartDate.timestamp) - } - ( - DataSetPipe(leftRecords, dataset.featureContext), - DataSetPipe(rightRecords, dataset.featureContext)) - } - - /** - * Split dataset into train and test randomly based on query - * @param dataset: input dataset - * @param testRatio: ratio for test - * @return (train dataset, test dataset) - */ - def splitRecordsByQuery(dataset: DataSetPipe, testRatio: Double): (DataSetPipe, DataSetPipe) = { - val queryToRand = dataset.records - .map { record => record.getFeatureValue(TweetSimilarityFeatures.QueryTweetId) } - .distinct - .map { queryTweet => queryTweet -> new Random(Time.now.inMilliseconds).nextDouble() } - .forceToDisk - - val (trainRecords, testRecords) = dataset.records - .groupBy { record => record.getFeatureValue(TweetSimilarityFeatures.QueryTweetId) } - .join(queryToRand) - .values - .partition { - case (_, random) => random > testRatio - } - - ( - DataSetPipe(trainRecords.map { case (record, _) => record }, dataset.featureContext), - DataSetPipe(testRecords.map { case (record, _) => record }, dataset.featureContext)) - } - - /** - * Get the write exec for train and test datasets - * @param dataset: input dataset - * @param testStartDate: samples before/after testStartDate will be used for training/testing - * @param outputPath: output path for the train/test datasets - * @return execution of the the writing exec - */ - def getTrainTestByTimeExec( - dataset: DataSetPipe, - testStartDate: RichDate, - trainDataset: TimePartitionedDALDataset[DataRecord], - testDataset: TimePartitionedDALDataset[DataRecord], - outputPath: String - )( - implicit dateRange: DateRange - ): Execution[Unit] = { - val (trainDataSet, testDataSet) = splitRecordsByTime(dataset, testStartDate) - val trainExecution: Execution[Unit] = trainDataSet - .writeDALExecution(trainDataset, D.Daily, D.Suffix(s"$outputPath/train"), D.EBLzo()) - val trainStatsExecution: Execution[Unit] = - getStatsExec(trainDataSet, s"$outputPath/train_stats") - val testExecution: Execution[Unit] = testDataSet - .writeDALExecution(testDataset, D.Daily, D.Suffix(s"$outputPath/test"), D.EBLzo()) - val testStatsExecution: Execution[Unit] = getStatsExec(testDataSet, s"$outputPath/test_stats") - Execution.zip(trainExecution, trainStatsExecution, testExecution, testStatsExecution).unit - } - - /** - * Get the write exec for train and test datasets - * @param dataset: input dataset - * @param testRatio: samples before/after testStartDate will be used for training/testing - * @param outputPath: output path for the train/test datasets - * @return execution of the the writing exec - */ - def getTrainTestByQueryExec( - dataset: DataSetPipe, - testRatio: Double, - trainDataset: TimePartitionedDALDataset[DataRecord], - testDataset: TimePartitionedDALDataset[DataRecord], - outputPath: String - )( - implicit dateRange: DateRange - ): Execution[Unit] = { - val (trainDataSet, testDataSet) = splitRecordsByQuery(dataset, testRatio) - val trainExecution: Execution[Unit] = trainDataSet - .writeDALExecution(trainDataset, D.Daily, D.Suffix(s"$outputPath/train"), D.EBLzo()) - val trainStatsExecution: Execution[Unit] = - getStatsExec(trainDataSet, s"$outputPath/train_stats") - val testExecution: Execution[Unit] = testDataSet - .writeDALExecution(testDataset, D.Daily, D.Suffix(s"$outputPath/test"), D.EBLzo()) - val testStatsExecution: Execution[Unit] = getStatsExec(testDataSet, s"$outputPath/test_stats") - Execution.zip(trainExecution, trainStatsExecution, testExecution, testStatsExecution).unit - } - - /** - * Get the exec for reporting dataset stats - * @param dataset: dataset of interest - * @param outputPath: path for outputting the stats - * @return exec - */ - def getStatsExec(dataset: DataSetPipe, outputPath: String): Execution[Unit] = { - dataset.records - .map { rec => - if (TweetSimilarityFeatures.isCoengaged(rec)) - "total_positive_records" -> 1L - else - "total_negative_records" -> 1L - } - .sumByKey - .shard(1) - .writeExecution(TypedTsv(outputPath)) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/TweetPairFeatureHydrationUtil.scala b/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/TweetPairFeatureHydrationUtil.scala deleted file mode 100644 index 458ea8525..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/TweetPairFeatureHydrationUtil.scala +++ /dev/null @@ -1,289 +0,0 @@ -package com.twitter.simclusters_v2.scalding.tweet_similarity - -import com.twitter.ml.api.util.FDsl._ -import com.twitter.ml.api.{DataRecord, DataRecordMerger, DataSetPipe, FeatureContext} -import com.twitter.ml.featurestore.lib.data.EntityIds.Entry -import com.twitter.ml.featurestore.lib.data.{EntityIds, FeatureValuesById, PredictionRecord} -import com.twitter.scalding.typed.TypedPipe -import com.twitter.simclusters_v2.common.SimClustersEmbedding._ -import com.twitter.simclusters_v2.tweet_similarity.ModelBasedTweetSimilaritySimClustersEmbeddingAdapter.{ - NormalizedCandidateEmbAdapter, - NormalizedQueryEmbAdapter -} -import com.twitter.simclusters_v2.tweet_similarity.{ - TweetSimilarityFeatures, - TweetSimilarityFeaturesStoreConfig -} -import com.twitter.simclusters_v2.common.{Timestamp, TweetId, UserId} -import com.twitter.simclusters_v2.scalding.tweet_similarity.TweetPairLabelCollectionUtil.FeaturedTweet -import com.twitter.simclusters_v2.thriftscala.{ - PersistentSimClustersEmbedding, - SimClustersEmbedding => ThriftSimClustersEmbedding -} - -object TweetPairFeatureHydrationUtil { - val QueryTweetConfig = new TweetSimilarityFeaturesStoreConfig("query_tweet_user_id") - val CandidateTweetConfig = new TweetSimilarityFeaturesStoreConfig("candidate_tweet_user_id") - val DataRecordMerger = new DataRecordMerger() - - /** - * Given persistentEmbeddings TypedPipe, extract tweetId, timestamp, and the embedding - * - * @param persistentEmbeddings TypedPipe of ((TweetId, Timestamp), PersistentSimClustersEmbedding), read from PersistentTweetEmbeddingMhExportSource - * - * @return Extracted TypedPipe of (TweetId, (Timestamp, SimClustersEmbedding)) - */ - def extractEmbeddings( - persistentEmbeddings: TypedPipe[((TweetId, Timestamp), PersistentSimClustersEmbedding)] - ): TypedPipe[(TweetId, (Timestamp, ThriftSimClustersEmbedding))] = { - persistentEmbeddings - .collect { - case ((tweetId, _), embedding) if embedding.metadata.updatedAtMs.isDefined => - (tweetId, (embedding.metadata.updatedAtMs.get, embedding.embedding)) - } - } - - /** - * Hydrate the tweet pairs with the latest persistent embeddings before engagement/impression. - * - * @param tweetPairs TypedPipe of the (userId, queryFeaturedTweet, candidateFeaturedTweet, label) - * @param persistentEmbeddings TypedPipe of persistentEmbeddings from PersistentTweetEmbeddingMhExportSource - * - * @return TypedPipe of the (userId, queryFeaturedTweet, candidateFeaturedTweet, label) with persistent embeddings set - */ - def getTweetPairsWithPersistentEmbeddings( - tweetPairs: TypedPipe[(FeaturedTweet, FeaturedTweet, Boolean)], - persistentEmbeddings: TypedPipe[((TweetId, Timestamp), PersistentSimClustersEmbedding)] - ): TypedPipe[(FeaturedTweet, FeaturedTweet, Boolean)] = { - val extractedEmbeddings = extractEmbeddings(persistentEmbeddings) - tweetPairs - .groupBy { - case (queryFeaturedTweet, _, _) => queryFeaturedTweet.tweet - } - .join(extractedEmbeddings) - .collect { - case ( - _, - ( - (queryFeaturedTweet, candidateFeaturedTweet, label), - (embeddingTimestamp, embedding))) - if embeddingTimestamp <= queryFeaturedTweet.timestamp => - ((queryFeaturedTweet, candidateFeaturedTweet), (embeddingTimestamp, embedding, label)) - } - .group - .maxBy(_._1) - .map { - case ((queryFeaturedTweet, candidateFeaturedTweet), (_, embedding, label)) => - ( - candidateFeaturedTweet.tweet, - (queryFeaturedTweet.copy(embedding = Some(embedding)), candidateFeaturedTweet, label) - ) - } - .join(extractedEmbeddings) - .collect { - case ( - _, - ( - (queryFeaturedTweet, candidateFeaturedTweet, label), - (embeddingTimestamp, embedding))) - if embeddingTimestamp <= candidateFeaturedTweet.timestamp => - ((queryFeaturedTweet, candidateFeaturedTweet), (embeddingTimestamp, embedding, label)) - } - .group - .maxBy(_._1) - .map { - case ((queryFeaturedTweet, candidateFeaturedTweet), (_, embedding, label)) => - (queryFeaturedTweet, candidateFeaturedTweet.copy(embedding = Some(embedding)), label) - } - } - - /** - * Get tweet pairs with the author userIds - * - * @param tweetPairs TypedPipe of (queryTweet, queryEmbedding, queryTimestamp, candidateTweet, candidateEmbedding, candidateTimestamp, label) - * @param tweetAuthorPairs TypedPipe of (tweetId, author userId) - * - * @return TypedPipe of (queryTweet, queryAuthor, queryEmbedding, queryTimestamp, candidateTweet, candidateAuthor, candidateEmbedding, candidateTimestamp, label) - */ - def getTweetPairsWithAuthors( - tweetPairs: TypedPipe[(FeaturedTweet, FeaturedTweet, Boolean)], - tweetAuthorPairs: TypedPipe[(TweetId, UserId)] - ): TypedPipe[(FeaturedTweet, FeaturedTweet, Boolean)] = { - tweetPairs - //keyed by queryTweet s.t. we get queryTweet's author after joining with tweetAuthorPairs - .groupBy { case (queryFeaturedTweet, _, _) => queryFeaturedTweet.tweet } - .join(tweetAuthorPairs) - .values - //keyed by candidateTweet - .groupBy { case ((_, candidateFeaturedTweet, _), _) => candidateFeaturedTweet.tweet } - .join(tweetAuthorPairs) - .values - .map { - case ( - ((queryFeaturedTweet, candidateFeaturedTweet, label), queryAuthor), - candidateAuthor) => - ( - queryFeaturedTweet.copy(author = Some(queryAuthor)), - candidateFeaturedTweet.copy(author = Some(candidateAuthor)), - label - ) - } - } - - /** - * Get tweet pairs with popularity counts - * - * @param tweetPairs TypedPipe of the (userId, queryFeaturedTweet, candidateFeaturedTweet, label) - * - * @return TypedPipe of the (userId, queryFeaturedTweet, candidateFeaturedTweet, tweetPairCount, queryTweetCount, label) - */ - def getTweetPairsWithCounts( - tweetPairs: TypedPipe[(FeaturedTweet, FeaturedTweet, Boolean)] - ): TypedPipe[(FeaturedTweet, FeaturedTweet, Long, Long, Boolean)] = { - val tweetPairCount = tweetPairs.groupBy { - case (queryFeaturedTweet, candidateFeaturedTweet, _) => - (queryFeaturedTweet.tweet, candidateFeaturedTweet.tweet) - }.size - - val queryTweetCount = tweetPairs.groupBy { - case (queryFeaturedTweet, _, _) => queryFeaturedTweet.tweet - }.size - - tweetPairs - .groupBy { - case (queryFeaturedTweet, candidateFeaturedTweet, _) => - (queryFeaturedTweet.tweet, candidateFeaturedTweet.tweet) - } - .join(tweetPairCount) - .values - .map { - case ((queryFeaturedTweet, candidateFeaturedTweet, label), tweetPairCount) => - (queryFeaturedTweet, candidateFeaturedTweet, tweetPairCount, label) - } - .groupBy { case (queryFeaturedTweet, _, _, _) => queryFeaturedTweet.tweet } - .join(queryTweetCount) - .values - .map { - case ( - (queryFeaturedTweet, candidateFeaturedTweet, tweetPairCount, label), - queryTweetCount) => - (queryFeaturedTweet, candidateFeaturedTweet, tweetPairCount, queryTweetCount, label) - } - } - - /** - * Get training data records - * - * @param tweetPairs TypedPipe of the (userId, queryFeaturedTweet, candidateFeaturedTweet, label) - * @param persistentEmbeddings TypedPipe of persistentEmbeddings from PersistentTweetEmbeddingMhExportSource - * @param tweetAuthorPairs TypedPipe of (tweetId, author userId) - * @param useAuthorFeatures whether to use author features or not - * - * @return DataSetPipe with features and label - */ - def getDataSetPipeWithFeatures( - tweetPairs: TypedPipe[(FeaturedTweet, FeaturedTweet, Boolean)], - persistentEmbeddings: TypedPipe[((TweetId, Timestamp), PersistentSimClustersEmbedding)], - tweetAuthorPairs: TypedPipe[(TweetId, UserId)], - useAuthorFeatures: Boolean - ): DataSetPipe = { - val featuredTweetPairs = - if (useAuthorFeatures) - getTweetPairsWithCounts( - getTweetPairsWithPersistentEmbeddings( - getTweetPairsWithAuthors(tweetPairs, tweetAuthorPairs), - persistentEmbeddings)) - else - getTweetPairsWithCounts( - getTweetPairsWithPersistentEmbeddings(tweetPairs, persistentEmbeddings)) - - DataSetPipe( - featuredTweetPairs.flatMap { - case (queryFeaturedTweet, candidateFeaturedTweet, tweetPairCount, queryTweetCount, label) => - getDataRecordWithFeatures( - queryFeaturedTweet, - candidateFeaturedTweet, - tweetPairCount, - queryTweetCount, - label) - }, - FeatureContext.merge( - TweetSimilarityFeatures.FeatureContext, - QueryTweetConfig.predictionRecordAdapter.getFeatureContext, - CandidateTweetConfig.predictionRecordAdapter.getFeatureContext - ) - ) - } - - /** - * Given raw features, return a DataRecord with all the features - * - * @param queryFeaturedTweet FeaturedTweet for query tweet - * @param candidateFeaturedTweet FeaturedTweet for candidate tweet - * @param tweetPairCount popularity count for the (query tweet, candidate tweet) pair - * @param queryTweetCount popularity count for each query tweet - * @param label true for positive and false for negative - * - * @return - */ - def getDataRecordWithFeatures( - queryFeaturedTweet: FeaturedTweet, - candidateFeaturedTweet: FeaturedTweet, - tweetPairCount: Long, - queryTweetCount: Long, - label: Boolean - ): Option[DataRecord] = { - - for { - queryEmbedding <- queryFeaturedTweet.embedding - candidateEmbedding <- candidateFeaturedTweet.embedding - } yield { - val featureDataRecord = NormalizedQueryEmbAdapter.adaptToDataRecord(queryEmbedding) - DataRecordMerger.merge( - featureDataRecord, - NormalizedCandidateEmbAdapter.adaptToDataRecord(candidateEmbedding)) - featureDataRecord.setFeatureValue( - TweetSimilarityFeatures.QueryTweetId, - queryFeaturedTweet.tweet) - featureDataRecord.setFeatureValue( - TweetSimilarityFeatures.CandidateTweetId, - candidateFeaturedTweet.tweet) - featureDataRecord.setFeatureValue( - TweetSimilarityFeatures.QueryTweetTimestamp, - queryFeaturedTweet.timestamp) - featureDataRecord.setFeatureValue( - TweetSimilarityFeatures.CandidateTweetTimestamp, - candidateFeaturedTweet.timestamp) - featureDataRecord.setFeatureValue( - TweetSimilarityFeatures.CosineSimilarity, - queryEmbedding.cosineSimilarity(candidateEmbedding)) - featureDataRecord.setFeatureValue(TweetSimilarityFeatures.TweetPairCount, tweetPairCount) - featureDataRecord.setFeatureValue(TweetSimilarityFeatures.QueryTweetCount, queryTweetCount) - featureDataRecord.setFeatureValue(TweetSimilarityFeatures.Label, label) - - if (queryFeaturedTweet.author.isDefined && candidateFeaturedTweet.author.isDefined) { - DataRecordMerger.merge( - featureDataRecord, - new DataRecord( - QueryTweetConfig.predictionRecordAdapter.adaptToDataRecord(PredictionRecord( - FeatureValuesById.empty, - EntityIds(Entry( - QueryTweetConfig.bindingIdentifier, - Set(com.twitter.ml.featurestore.lib.UserId(queryFeaturedTweet.author.get)))) - ))) - ) - DataRecordMerger.merge( - featureDataRecord, - new DataRecord( - CandidateTweetConfig.predictionRecordAdapter.adaptToDataRecord(PredictionRecord( - FeatureValuesById.empty, - EntityIds(Entry( - CandidateTweetConfig.bindingIdentifier, - Set(com.twitter.ml.featurestore.lib.UserId(candidateFeaturedTweet.author.get)))) - ))) - ) - } - featureDataRecord - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/TweetPairLabelCollectionUtil.scala b/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/TweetPairLabelCollectionUtil.scala deleted file mode 100644 index 26a479342..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/TweetPairLabelCollectionUtil.scala +++ /dev/null @@ -1,490 +0,0 @@ -package com.twitter.simclusters_v2.scalding.tweet_similarity - -import com.twitter.ads.entities.db.thriftscala.PromotedTweet -import com.twitter.dataproducts.estimation.ReservoirSampler -import com.twitter.scalding.typed.TypedPipe -import com.twitter.scalding.{DateRange, Execution, TypedTsv} -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.remote_access.{ExplicitLocation, Proc3Atla, ProcAtla} -import com.twitter.simclusters_v2.common.{SimClustersEmbedding, Timestamp, TweetId, UserId} -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.simclusters_v2.scalding.embedding.common.ExternalDataSources -import com.twitter.simclusters_v2.thriftscala.{ - TweetTopKTweetsWithScore, - TweetWithScore, - TweetsWithScore -} -import com.twitter.timelineservice.thriftscala.{ContextualizedFavoriteEvent, FavoriteEventUnion} -import com.twitter.wtf.scalding.client_event_processing.thriftscala.{ - InteractionDetails, - InteractionType, - TweetImpressionDetails -} -import com.twitter.wtf.scalding.jobs.client_event_processing.UserInteractionScalaDataset -import java.util.Random -import scala.collection.mutable.ArrayBuffer -import scala.util.control.Breaks._ -import twadoop_config.configuration.log_categories.group.timeline.TimelineServiceFavoritesScalaDataset - -object TweetPairLabelCollectionUtil { - - case class FeaturedTweet( - tweet: TweetId, - timestamp: Timestamp, //engagement or impression time - author: Option[UserId], - embedding: Option[SimClustersEmbedding]) - extends Ordered[FeaturedTweet] { - - import scala.math.Ordered.orderingToOrdered - - def compare(that: FeaturedTweet): Int = - (this.tweet, this.timestamp, this.author) compare (that.tweet, that.timestamp, that.author) - } - - val MaxFavPerUser: Int = 100 - - /** - * Get all fav events within the given dateRange and where all users' out-degree <= maxOutDegree - * from TimelineServiceFavoritesScalaDataset - * - * @param dateRange date of interest - * @param maxOutgoingDegree max #degrees for the users of interests - * - * @return Filtered fav events, TypedPipe of (userid, tweetid, timestamp) tuples - */ - def getFavEvents( - dateRange: DateRange, - maxOutgoingDegree: Int - ): TypedPipe[(UserId, TweetId, Timestamp)] = { - val fullTimelineFavData: TypedPipe[ContextualizedFavoriteEvent] = - DAL - .read(TimelineServiceFavoritesScalaDataset, dateRange) - .withRemoteReadPolicy(ExplicitLocation(ProcAtla)) - .toTypedPipe - - val userTweetTuples = fullTimelineFavData - .flatMap { cfe: ContextualizedFavoriteEvent => - cfe.event match { - case FavoriteEventUnion.Favorite(fav) => - Some((fav.userId, (fav.tweetId, fav.eventTimeMs))) - case _ => - None - } - } - //Get users with the out-degree <= maxOutDegree first - val usersWithValidOutDegree = userTweetTuples - .groupBy(_._1) - .withReducers(1000) - .size - .filter(_._2 <= maxOutgoingDegree) - - // Keep only usersWithValidOutDegree in the graph - userTweetTuples - .join(usersWithValidOutDegree).map { - case (userId, ((tweetId, eventTime), _)) => (userId, tweetId, eventTime) - }.forceToDisk - } - - /** - * Get impression events where users stay at the tweets for more than one minute - * - * @param dateRange time range of interest - * - * @return - */ - def getImpressionEvents(dateRange: DateRange): TypedPipe[(UserId, TweetId, Timestamp)] = { - DAL - .read(UserInteractionScalaDataset, dateRange) - .withRemoteReadPolicy(ExplicitLocation(Proc3Atla)) - .toTypedPipe - .flatMap { - case userInteraction - if userInteraction.interactionType == InteractionType.TweetImpressions => - userInteraction.interactionDetails match { - case InteractionDetails.TweetImpressionDetails( - TweetImpressionDetails(tweetId, _, dwellTimeInSecOpt)) - if dwellTimeInSecOpt.exists(_ >= 1) => - Some(userInteraction.userId, tweetId, userInteraction.timeStamp) - case _ => - None - } - case _ => None - } - .forceToDisk - } - - /** - * Given an events dataset, return a filtered events limited to a given set of tweets - * - * @param events user fav events, a TypedPipe of (userid, tweetid, timestamp) tuples - * @param tweets tweets of interest - * - * @return Filtered fav events on the given tweets of interest only, TypedPipe of (userid, tweetid, timestamp) tuples - */ - def getFilteredEvents( - events: TypedPipe[(UserId, TweetId, Timestamp)], - tweets: TypedPipe[TweetId] - ): TypedPipe[(UserId, TweetId, Timestamp)] = { - events - .map { - case (userId, tweetId, eventTime) => (tweetId, (userId, eventTime)) - } - .join(tweets.asKeys) - .withReducers(1000) - .map { - case (tweetId, ((userId, eventTime), _)) => (userId, tweetId, eventTime) - } - } - - /** Get (tweetId, author userId) of a given dateRange - * - * @param dateRange time range of interest - * - * @return TypedPipe of (tweetId, userId) - */ - def getTweetAuthorPairs(dateRange: DateRange): TypedPipe[(TweetId, UserId)] = { - ExternalDataSources - .flatTweetsSource(dateRange) - .collect { - // Exclude retweets and quoted tweets - case record if record.shareSourceTweetId.isEmpty && record.quotedTweetTweetId.isEmpty => - (record.tweetId, record.userId) - } - } - - /** Given a set of tweets, get all non-promoted tweets from the given set - * - * @param promotedTweets TypedPipe of promoted tweets - * @param tweets tweets of interest - * - * @return TypedPipe of tweetId - */ - def getNonPromotedTweets( - promotedTweets: TypedPipe[PromotedTweet], - tweets: TypedPipe[TweetId] - ): TypedPipe[TweetId] = { - promotedTweets - .collect { - case promotedTweet if promotedTweet.tweetId.isDefined => promotedTweet.tweetId.get - } - .asKeys - .rightJoin(tweets.asKeys) - .withReducers(1000) - .filterNot(joined => joined._2._1.isDefined) //filter out those in promotedTweets - .keys - } - - /** - * Given a fav events dataset, return all distinct ordered tweet pairs, labelled by whether they are co-engaged or not - * Note we distinguish between (t1, t2) and (t2, t1) because o.w we introduce bias to training samples - * - * @param events user fav events, a TypedPipe of (userid, featuredTweet) tuples - * @param timeframe two tweets will be considered co-engaged if they are fav-ed within coengagementTimeframe - * @param isCoengaged if pairs are co-engaged - * - * @return labelled tweet pairs, TypedPipe of (userid, featuredTweet1, featuredTweet2, isCoengaged) tuples - */ - def getTweetPairs( - events: TypedPipe[(UserId, FeaturedTweet)], - timeframe: Long, - isCoengaged: Boolean - ): TypedPipe[(UserId, FeaturedTweet, FeaturedTweet, Boolean)] = { - events - .map { - case (userId, featuredTweet) => (userId, Seq(featuredTweet)) - } - .sumByKey - .flatMap { - case (userId, featuredTweets) if featuredTweets.size > 1 => - val sortedFeaturedTweet = featuredTweets.sortBy(_.timestamp) - // Get all distinct ordered pairs that happen within coengagementTimeframe - val distinctPairs = ArrayBuffer[(UserId, FeaturedTweet, FeaturedTweet, Boolean)]() - breakable { - for (i <- sortedFeaturedTweet.indices) { - for (j <- i + 1 until sortedFeaturedTweet.size) { - val featuredTweet1 = sortedFeaturedTweet(i) - val featuredTweet2 = sortedFeaturedTweet(j) - if (math.abs(featuredTweet1.timestamp - featuredTweet2.timestamp) <= timeframe) - distinctPairs ++= Seq( - (userId, featuredTweet1, featuredTweet2, isCoengaged), - (userId, featuredTweet2, featuredTweet1, isCoengaged)) - else - break - } - } - } - distinctPairs - case _ => Nil - } - } - - /** - * Get co-engaged tweet pairs - * - * @param favEvents user fav events, TypedPipe of (userid, tweetid, timestamp) - * @param tweets tweets to be considered - * @param coengagementTimeframe time window for two tweets to be considered as co-engaged - * - * @return TypedPipe of co-engaged tweet pairs - */ - def getCoengagedPairs( - favEvents: TypedPipe[(UserId, TweetId, Timestamp)], - tweets: TypedPipe[TweetId], - coengagementTimeframe: Long - ): TypedPipe[(UserId, FeaturedTweet, FeaturedTweet, Boolean)] = { - val userFeaturedTweetPairs = - getFilteredEvents(favEvents, tweets) - .map { - case (user, tweet, timestamp) => (user, FeaturedTweet(tweet, timestamp, None, None)) - } - - getTweetPairs(userFeaturedTweetPairs, coengagementTimeframe, isCoengaged = true) - } - - /** - * Get co-impressed tweet pairs - * - * @param impressionEvents tweet impression events, TypedPipe of (userid, tweetid, timestamp) - * @param tweets set of tweets considered to be part of co-impressed tweet pairs - * @param timeframe time window for two tweets to be considered as co-impressed - * - * @return TypedPipe of co-impressed tweet pairs - */ - def getCoimpressedPairs( - impressionEvents: TypedPipe[(UserId, TweetId, Timestamp)], - tweets: TypedPipe[TweetId], - timeframe: Long - ): TypedPipe[(UserId, FeaturedTweet, FeaturedTweet, Boolean)] = { - val userFeaturedTweetPairs = getFilteredEvents(impressionEvents, tweets) - .map { - case (user, tweet, timestamp) => (user, FeaturedTweet(tweet, timestamp, None, None)) - } - - getTweetPairs(userFeaturedTweetPairs, timeframe, isCoengaged = false) - } - - /** - * Consolidate co-engaged pairs and co-impressed pairs, and compute all the labelled tweet pairs - * Given a pair: - * label = 1 if co-engaged (whether or not it's co-impressed) - * label = 0 if co-impressed and not co-engaged - * - * @param coengagedPairs co-engaged tweet pairs, TypedPipe of (user, queryFeaturedTweet, candidateFeaturedTweet, label) - * @param coimpressedPairs co-impressed tweet pairs, TypedPipe of (user, queryFeaturedTweet, candidateFeaturedTweet, label) - * - * @return labelled tweet pairs, TypedPipe of (queryFeaturedTweet, candidateFeaturedTweet, label) tuples - */ - def computeLabelledTweetPairs( - coengagedPairs: TypedPipe[(UserId, FeaturedTweet, FeaturedTweet, Boolean)], - coimpressedPairs: TypedPipe[(UserId, FeaturedTweet, FeaturedTweet, Boolean)] - ): TypedPipe[(FeaturedTweet, FeaturedTweet, Boolean)] = { - (coengagedPairs ++ coimpressedPairs) - .groupBy { - case (userId, queryFeaturedTweet, candidateFeaturedTweet, _) => - (userId, queryFeaturedTweet.tweet, candidateFeaturedTweet.tweet) - } - // consolidate all the labelled pairs into one with the max label - // (label order: co-engagement = true > co-impression = false) - .maxBy { - case (_, _, _, label) => label - } - .values - .map { case (_, queryTweet, candidateTweet, label) => (queryTweet, candidateTweet, label) } - } - - /** - * Get a balanced-class sampling of tweet pairs. - * For each query tweet, we make sure the numbers of positives and negatives are equal. - * - * @param labelledPairs labelled tweet pairs, TypedPipe of (queryFeaturedTweet, candidateFeaturedTweet, label) tuples - * @param maxSamplesPerClass max number of samples per class - * - * @return sampled labelled pairs after balanced-class sampling - */ - def getQueryTweetBalancedClassPairs( - labelledPairs: TypedPipe[(FeaturedTweet, FeaturedTweet, Boolean)], - maxSamplesPerClass: Int - ): TypedPipe[(FeaturedTweet, FeaturedTweet, Boolean)] = { - val queryTweetToSampleCount = labelledPairs - .map { - case (queryTweet, _, label) => - if (label) (queryTweet.tweet, (1, 0)) else (queryTweet.tweet, (0, 1)) - } - .sumByKey - .map { - case (queryTweet, (posCount, negCount)) => - (queryTweet, Math.min(Math.min(posCount, negCount), maxSamplesPerClass)) - } - - labelledPairs - .groupBy { case (queryTweet, _, _) => queryTweet.tweet } - .join(queryTweetToSampleCount) - .values - .map { - case ((queryTweet, candidateTweet, label), samplePerClass) => - ((queryTweet.tweet, label, samplePerClass), (queryTweet, candidateTweet, label)) - } - .group - .mapGroup { - case ((_, _, samplePerClass), iter) => - val random = new Random(123L) - val sampler = - new ReservoirSampler[(FeaturedTweet, FeaturedTweet, Boolean)](samplePerClass, random) - iter.foreach { pair => sampler.sampleItem(pair) } - sampler.sample.toIterator - } - .values - } - - /** - * Given a user fav dataset, computes the similarity scores (based on engagers) between every tweet pairs - * - * @param events user fav events, a TypedPipe of (userid, tweetid, timestamp) tuples - * @param minInDegree min number of engagement count for the tweets - * @param coengagementTimeframe two tweets will be considered co-engaged if they are fav-ed within coengagementTimeframe - * - * @return tweet similarity based on engagers, a TypedPipe of (tweet1, tweet2, similarity_score) tuples - **/ - def getScoredCoengagedTweetPairs( - events: TypedPipe[(UserId, TweetId, Timestamp)], - minInDegree: Int, - coengagementTimeframe: Long - )( - ): TypedPipe[(TweetId, TweetWithScore)] = { - - // compute tweet norms (based on engagers) - // only keep tweets whose indegree >= minInDegree - val tweetNorms = events - .map { case (_, tweetId, _) => (tweetId, 1.0) } - .sumByKey //the number of engagers per tweetId - .filter(_._2 >= minInDegree) - .mapValues(math.sqrt) - - val edgesWithWeight = events - .map { - case (userId, tweetId, eventTime) => (tweetId, (userId, eventTime)) - } - .join(tweetNorms) - .map { - case (tweetId, ((userId, eventTime), norm)) => - (userId, Seq((tweetId, eventTime, 1 / norm))) - } - - // get cosine similarity - val tweetPairsWithWeight = edgesWithWeight.sumByKey - .flatMap { - case (_, tweets) if tweets.size > 1 => - allUniquePairs(tweets).flatMap { - case ((tweetId1, eventTime1, weight1), (tweetId2, eventTime2, weight2)) => - // consider only co-engagement happened within the given timeframe - if ((eventTime1 - eventTime2).abs <= coengagementTimeframe) { - if (tweetId1 > tweetId2) // each worker generate allUniquePairs in different orders, hence should standardize the pairs - Some(((tweetId2, tweetId1), weight1 * weight2)) - else - Some(((tweetId1, tweetId2), weight1 * weight2)) - } else { - None - } - case _ => - None - } - case _ => Nil - } - tweetPairsWithWeight.sumByKey - .flatMap { - case ((tweetId1, tweetId2), weight) => - Seq( - (tweetId1, TweetWithScore(tweetId2, weight)), - (tweetId2, TweetWithScore(tweetId1, weight)) - ) - case _ => Nil - } - } - - /** - * Get the write exec for per-query stats - * - * @param tweetPairs input dataset - * @param outputPath output path for the per-query stats - * @param identifier identifier for the tweetPairs dataset - * - * @return execution of the the writing exec - */ - def getPerQueryStatsExec( - tweetPairs: TypedPipe[(FeaturedTweet, FeaturedTweet, Boolean)], - outputPath: String, - identifier: String - ): Execution[Unit] = { - val queryTweetsToCounts = tweetPairs - .map { - case (queryTweet, _, label) => - if (label) (queryTweet.tweet, (1, 0)) else (queryTweet.tweet, (0, 1)) - } - .sumByKey - .map { case (queryTweet, (posCount, negCount)) => (queryTweet, posCount, negCount) } - - Execution - .zip( - queryTweetsToCounts.writeExecution( - TypedTsv[(TweetId, Int, Int)](s"${outputPath}_$identifier")), - Util.printSummaryOfNumericColumn( - queryTweetsToCounts - .map { case (_, posCount, _) => posCount }, - Some(s"Per-query Positive Count ($identifier)")), - Util.printSummaryOfNumericColumn( - queryTweetsToCounts - .map { case (_, _, negCount) => negCount }, - Some(s"Per-query Negative Count ($identifier)")) - ).unit - } - - /** - * Get the top K similar tweets key-val dataset - * - * @param allTweetPairs all tweet pairs with their similarity scores - * @param k the maximum number of top results for each user - * - * @return key-val top K results for each tweet - */ - def getKeyValTopKSimilarTweets( - allTweetPairs: TypedPipe[(TweetId, TweetWithScore)], - k: Int - )( - ): TypedPipe[(TweetId, TweetsWithScore)] = { - allTweetPairs.group - .sortedReverseTake(k)(Ordering.by(_.score)) - .map { case (tweetId, tweetWithScoreSeq) => (tweetId, TweetsWithScore(tweetWithScoreSeq)) } - } - - /** - * Get the top K similar tweets dataset. - * - * @param allTweetPairs all tweet pairs with their similarity scores - * @param k the maximum number of top results for each user - * - * @return top K results for each tweet - */ - def getTopKSimilarTweets( - allTweetPairs: TypedPipe[(TweetId, TweetWithScore)], - k: Int - )( - ): TypedPipe[TweetTopKTweetsWithScore] = { - allTweetPairs.group - .sortedReverseTake(k)(Ordering.by(_.score)) - .map { - case (tweetId, tweetWithScoreSeq) => - TweetTopKTweetsWithScore(tweetId, TweetsWithScore(tweetWithScoreSeq)) - } - } - - /** - * Given a input sequence, output all unique pairs in this sequence. - */ - def allUniquePairs[T](input: Seq[T]): Stream[(T, T)] = { - input match { - case Nil => Stream.empty - case seq => - seq.tail.toStream.map(a => (seq.head, a)) #::: allUniquePairs(seq.tail) - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/UnhydratedPairsCollectionJob.scala b/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/UnhydratedPairsCollectionJob.scala deleted file mode 100644 index 626cc35a8..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/UnhydratedPairsCollectionJob.scala +++ /dev/null @@ -1,209 +0,0 @@ -package com.twitter.simclusters_v2.scalding.tweet_similarity - -import com.twitter.ads.dataservice_account.snapshot.jobs.DbSnapshotsPromotedTweetsScalaDataset -import com.twitter.conversions.DurationOps._ -import com.twitter.dal.client.dataset.TimePartitionedDALDataset -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.dalv2.remote_access.ExplicitLocation -import com.twitter.scalding_internal.dalv2.remote_access.ProcrevAtla -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.simclusters_v2.scalding.tweet_similarity.TweetPairLabelCollectionUtil.MaxFavPerUser -import com.twitter.simclusters_v2.thriftscala.LabelledTweetPairs -import com.twitter.simclusters_v2.thriftscala.{FeaturedTweet => FeaturedTweetThrift} -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import java.util.TimeZone - -/** - * Collect unhydrated training pairs for supervised tweet similarity. - * Here're the steps for this job - * 1) Consider non-promoted tweets that are created within the given #lookback days - * 2) From the tweets in 1), get co-engaged pairs - * 3) Take all tweets shown in 2), and get co-impressed pairs. Note that we take all tweets (not tweet pairs) in 2). - * That is, a co-impressed pairs (t1,t2) will be considered iff t1 appears in 2) and t2 appears in 2). - * But (t1, t2) doesn't need to appear as a pair in 2). - * 4) Compute labels from co-engaged pairs and co-impressed pairs. - * A pair is true if its user has co-engaged the pair, and is false if otherwise. - */ -object UnhydratedPairsCollectionJob { - //tweets have to be created within dateRange - lookbackdays in order to be considered - val LookbackDays = 2 - - def getLabelledPairs( - dateRange: DateRange, - timeframe: Long, - maxSamplesPerClass: Int, - dalDataset: TimePartitionedDALDataset[LabelledTweetPairs], - outputPath: String - )( - implicit timeZone: TimeZone - ): Execution[Unit] = { - - val promotedTweets = DAL - .readMostRecentSnapshot(DbSnapshotsPromotedTweetsScalaDataset, dateRange) - .withRemoteReadPolicy(ExplicitLocation(ProcrevAtla)) - .toTypedPipe - - val tweetAuthorPairs = - TweetPairLabelCollectionUtil.getTweetAuthorPairs(dateRange.prepend(Days(LookbackDays))) - - val tweets = - TweetPairLabelCollectionUtil.getNonPromotedTweets(promotedTweets, tweetAuthorPairs.keys) - - val coengagedPairs = TweetPairLabelCollectionUtil.getCoengagedPairs( - TweetPairLabelCollectionUtil.getFavEvents(dateRange, MaxFavPerUser), - tweets, - timeframe) - - val engagedTweets = coengagedPairs.map { - // Consider only query tweet b/c coengagedPairs contains both (t1,t2) and (t2,t1) - case (_, queryFeaturedTweet, _, _) => queryFeaturedTweet.tweet - }.distinct - - val coimpressedPairs = TweetPairLabelCollectionUtil - .getCoimpressedPairs( - TweetPairLabelCollectionUtil.getImpressionEvents(dateRange), - engagedTweets, - timeframe) - - val rawLabelledPairs = - TweetPairLabelCollectionUtil.computeLabelledTweetPairs(coengagedPairs, coimpressedPairs) - - val labelledPairs = - if (maxSamplesPerClass > 0) - TweetPairLabelCollectionUtil.getQueryTweetBalancedClassPairs( - rawLabelledPairs, - maxSamplesPerClass) - else - rawLabelledPairs - - val perQueryStatsExec = - if (maxSamplesPerClass > 0) { - Execution - .zip( - TweetPairLabelCollectionUtil - .getPerQueryStatsExec(rawLabelledPairs, s"$outputPath/per_query_stats", "raw"), - TweetPairLabelCollectionUtil - .getPerQueryStatsExec(labelledPairs, s"$outputPath/per_query_stats", "final") - ).unit - } else { - TweetPairLabelCollectionUtil.getPerQueryStatsExec( - labelledPairs, - s"$outputPath/per_query_stats", - "final") - } - - Execution - .zip( - labelledPairs - .map { - case (queryFeaturedTweet, candidateFeaturedTweet, label) => - LabelledTweetPairs( - FeaturedTweetThrift( - tweetId = queryFeaturedTweet.tweet, - timestamp = queryFeaturedTweet.timestamp), - FeaturedTweetThrift( - tweetId = candidateFeaturedTweet.tweet, - timestamp = candidateFeaturedTweet.timestamp), - label - ) - } - .writeDALExecution(dalDataset, D.Daily, D.Suffix(outputPath), D.EBLzo())(dateRange), - perQueryStatsExec - ).unit - } -} - -/** To run: - * scalding remote run --target src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity:unhydrated_pair_collection-adhoc \ - --user cassowary \ - --submitter hadoopnest2.atla.twitter.com \ - --hadoop-properties "mapreduce.reduce.java.opts=-Xmx8000m mapreduce.reduce.memory.mb=8000 scalding.with.reducers.set.explicitly=true mapreduce.job.reduces=2000 mapreduce.task.timeout=0" \ - --main-class com.twitter.simclusters_v2.scalding.tweet_similarity.UnhydratedPairsCollectionAdhocApp -- \ - --date 2020-03-04 \ - --output_path /user/cassowary/adhoc/unhydrated_pairs/2020-03-04_class_balanced \ - --samples_per_query_tweet_class 2000 - * */ -object UnhydratedPairsCollectionAdhocApp extends TwitterExecutionApp { - implicit val timeZone: TimeZone = DateOps.UTC - implicit val dateParser: DateParser = DateParser.default - - override def job: Execution[Unit] = - Execution.withId { implicit uniqueId => - Execution.withArgs { args: Args => - implicit val dateRange: DateRange = DateRange.parse(args.list("date")) - val maxSamplesPerClass: Int = args.int("samples_per_query_tweet_class", default = 2000) - val timeframe: Int = 30 - val outputPath: String = s"${args("output_path")}_${timeframe}min" - - UnhydratedPairsCollectionJob.getLabelledPairs( - dateRange, - timeframe.minute.inMilliseconds, - maxSamplesPerClass, - TweetSimilarityUnhydratedPairs30MinScalaDataset, - outputPath - ) - } - } -} - -/** -capesospy-v2 update --build_locally --start_cron \ -unhydrated_pair_collection_30min src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object UnhydratedPairsCollection30MinScheduledApp extends ScheduledExecutionApp { - - override def batchIncrement: Duration = Hours(24) - override def firstTime: RichDate = RichDate("2020-03-26") - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - val maxSamplesPerClass: Int = args.int("samples_per_query_tweet_class", default = 2000) - val timeframe: Int = 30 - val outputPath: String = - s"/user/cassowary/processed/tweet_similarity/unhydrated_pairs_${timeframe}min" - - UnhydratedPairsCollectionJob.getLabelledPairs( - dateRange, - timeframe.minute.inMilliseconds, - maxSamplesPerClass, - TweetSimilarityUnhydratedPairs30MinScalaDataset, - outputPath) - } -} - -/** -capesospy-v2 update --build_locally --start_cron \ -unhydrated_pair_collection_120min src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc3.yaml - */ -object UnhydratedPairsCollection120MinScheduledApp extends ScheduledExecutionApp { - - override def batchIncrement: Duration = Hours(24) - override def firstTime: RichDate = RichDate("2020-03-26") - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - val maxSamplesPerClass: Int = args.int("samples_per_query_tweet_class", default = 2000) - val timeframe: Int = 120 - val outputPath: String = - s"/user/cassowary/processed/tweet_similarity/unhydrated_pairs_${timeframe}min" - - UnhydratedPairsCollectionJob.getLabelledPairs( - dateRange, - timeframe.minute.inMilliseconds, - maxSamplesPerClass, - TweetSimilarityUnhydratedPairs120MinScalaDataset, - outputPath) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/evaluation/BUILD.bazel b/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/evaluation/BUILD.bazel deleted file mode 100644 index e231ab769..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/evaluation/BUILD.bazel +++ /dev/null @@ -1,40 +0,0 @@ -scala_library( - platform = "java8", - tags = ["bazel-only"], - dependencies = [ - "3rdparty/jvm/com/twitter/storehaus:algebra", - "3rdparty/jvm/com/twitter/storehaus:core", - "snowflake:id", - "src/java/com/twitter/ml/api/constant", - "src/scala/com/twitter/ml/api:api-base", - "src/scala/com/twitter/rux/landing_page/data_pipeline:labeled_rux_service_scribe-scala", - "src/scala/com/twitter/rux/landing_page/data_pipeline:landing_page_labeled_data_record-java", - "src/scala/com/twitter/scalding_internal/dalv2", - "src/scala/com/twitter/scalding_internal/dalv2/dataset", - "src/scala/com/twitter/scalding_internal/job", - "src/scala/com/twitter/scalding_internal/job/analytics_batch", - "src/scala/com/twitter/scalding_internal/source", - "src/scala/com/twitter/scalding_internal/source/lzo_scrooge", - "src/scala/com/twitter/simclusters_v2/common", - "src/scala/com/twitter/simclusters_v2/scalding", - "src/scala/com/twitter/simclusters_v2/scalding/common", - "src/scala/com/twitter/simclusters_v2/summingbird/common", - "src/scala/com/twitter/wtf/scalding/jobs/common:ddg_util", - "twml/runtime/src/main/scala/com/twitter/twml/runtime/scalding", - ], -) - -hadoop_binary( - name = "rux_landing_ddg_analysis-adhoc", - main = "com.twitter.simclusters_v2.scalding.tweet_similarity.evaluation.RUXLandingDdgAnalysisAdhocApp", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":evaluation", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/evaluation/ModelEvalAdhocApp.scala b/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/evaluation/ModelEvalAdhocApp.scala deleted file mode 100644 index e0d848f95..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/evaluation/ModelEvalAdhocApp.scala +++ /dev/null @@ -1,91 +0,0 @@ -package com.twitter.simclusters_v2.scalding.tweet_similarity.evaluation - -import com.twitter.ml.api.Feature.Continuous -import com.twitter.ml.api.DailySuffixFeatureSource -import com.twitter.ml.api.DataSetPipe -import com.twitter.ml.api.RichDataRecord -import com.twitter.scalding._ -import com.twitter.scalding.typed.TypedPipe -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.simclusters_v2.tweet_similarity.TweetSimilarityFeatures -import com.twitter.twml.runtime.scalding.TensorflowBatchPredictor -import java.util.TimeZone - -/** - * Scalding execution app for scoring a Dataset against an exported Tensorflow model. - -** Arguments: - * dataset_path - Path for the dataset on hdfs - * date - Date for the dataset paths, required if Daily dataset. - * model_source - Path of the exported model on HDFS. Must start with hdfs:// scheme. - * output_path - Path of the output result file - -scalding remote run --target src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity:model_eval-adhoc \ ---user cassowary \ ---submitter hadoopnest2.atla.twitter.com \ ---main-class com.twitter.simclusters_v2.scalding.tweet_similarity.ModelEvalAdhocApp -- \ ---date 2020-02-19 \ ---dataset_path /user/cassowary/adhoc/training_data/2020-02-19_class_balanced/test \ ---model_path hdfs:///user/cassowary/tweet_similarity/2020-02-07-15-20-15/exported_models/1581253926 \ ---output_path /user/cassowary/adhoc/training_data/2020-02-19_class_balanced/test/prediction_v1 - **/ -object ModelEvalAdhocApp extends TwitterExecutionApp { - implicit val timeZone: TimeZone = DateOps.UTC - implicit val dateParser: DateParser = DateParser.default - - /** - * Get predictor for the given model path - * @param modelName name of the model - * @param modelSource path of the exported model on HDFS. Must start with hdfs:// scheme. - * @return - */ - def getPredictor(modelName: String, modelSource: String): TensorflowBatchPredictor = { - val defaultInputNode = "request:0" - val defaultOutputNode = "response:0" - TensorflowBatchPredictor(modelName, modelSource, defaultInputNode, defaultOutputNode) - } - - /** - * Given input pipe and predictor, return the predictions in TypedPipe - * @param dataset dataset for prediction - * @param batchPredictor predictor - * @return - */ - def getPrediction( - dataset: DataSetPipe, - batchPredictor: TensorflowBatchPredictor - ): TypedPipe[(Long, Long, Boolean, Double, Double)] = { - val featureContext = dataset.featureContext - val predictionFeature = new Continuous("output") - - batchPredictor - .predict(dataset.records) - .map { - case (originalDataRecord, predictedDataRecord) => - val prediction = new RichDataRecord(predictedDataRecord, featureContext) - .getFeatureValue(predictionFeature).toDouble - val richDataRecord = new RichDataRecord(originalDataRecord, featureContext) - ( - richDataRecord.getFeatureValue(TweetSimilarityFeatures.QueryTweetId).toLong, - richDataRecord.getFeatureValue(TweetSimilarityFeatures.CandidateTweetId).toLong, - richDataRecord.getFeatureValue(TweetSimilarityFeatures.Label).booleanValue, - richDataRecord.getFeatureValue(TweetSimilarityFeatures.CosineSimilarity).toDouble, - prediction - ) - } - } - - override def job: Execution[Unit] = - Execution.withId { implicit uniqueId => - Execution.withArgs { args: Args => - implicit val dateRange: DateRange = DateRange.parse(args.list("date")) - val outputPath: String = args("output_path") - val dataset: DataSetPipe = DailySuffixFeatureSource(args("dataset_path")).read - val modelSource: String = args("model_path") - val modelName: String = "tweet_similarity" - - getPrediction(dataset, getPredictor(modelName, modelSource)) - .writeExecution(TypedTsv[(Long, Long, Boolean, Double, Double)](outputPath)) - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/evaluation/RUXLandingDdgAnalysisAdhocApp.scala b/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/evaluation/RUXLandingDdgAnalysisAdhocApp.scala deleted file mode 100644 index 8cb575ee5..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/evaluation/RUXLandingDdgAnalysisAdhocApp.scala +++ /dev/null @@ -1,82 +0,0 @@ -package com.twitter.simclusters_v2.scalding.tweet_similarity.evaluation - -import com.twitter.rux.landing_page.data_pipeline.LabeledRuxServiceScribeScalaDataset -import com.twitter.rux.landing_page.data_pipeline.thriftscala.LandingPageLabel -import com.twitter.rux.service.thriftscala.FocalObject -import com.twitter.rux.service.thriftscala.UserContext -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.simclusters_v2.common.TweetId -import com.twitter.simclusters_v2.common.UserId -import com.twitter.wtf.scalding.jobs.common.DDGUtil -import java.util.TimeZone - -/** To run: -scalding remote run --target src/scala/com/twitter/simclusters_v2/scalding/tweet_similarity/evaluation:rux_landing_ddg_analysis-adhoc \ ---user cassowary \ ---submitter hadoopnest2.atla.twitter.com \ ---main-class com.twitter.simclusters_v2.scalding.tweet_similarity.evaluation.RUXLandingDdgAnalysisAdhocApp -- \ ---date 2020-04-06 2020-04-13 \ ---ddg model_based_tweet_similarity_10254 \ ---version 1 \ ---output_path /user/cassowary/adhoc/ddg10254 - * */ -object RUXLandingDdgAnalysisAdhocApp extends TwitterExecutionApp { - override def job: Execution[Unit] = - Execution.withId { implicit uniqueId => - Execution.withArgs { args: Args => - implicit val timeZone: TimeZone = DateOps.UTC - implicit val dateParser: DateParser = DateParser.default - implicit val dateRange: DateRange = DateRange.parse(args.list("date")) - val ddgName: String = args("ddg") - val ddgVersion: String = args("version") - val outputPath: String = args("output_path") - val now = RichDate.now - - val ruxLabels = getLabeledRuxServiceScribe(dateRange).map { - case (userId, focalTweet, candidateTweet, impression, fav) => - userId -> (focalTweet, candidateTweet, impression, fav) - } - - // getUsersInDDG reads from a snapshot dataset. - // Just prepend dateRange so that we can look back far enough to make sure there is data. - DDGUtil - .getUsersInDDG(ddgName, ddgVersion.toInt)(DateRange(now - Days(7), now)).map { ddgUser => - ddgUser.userId -> (ddgUser.bucket, ddgUser.enterUserState.getOrElse("no_user_state")) - }.join(ruxLabels) - .map { - case (userId, ((bucket, state), (focalTweet, candidateTweet, impression, fav))) => - (userId, bucket, state, focalTweet, candidateTweet, impression, fav) - } - .writeExecution( - TypedTsv[(UserId, String, String, TweetId, TweetId, Int, Int)](s"$outputPath")) - } - } - - def getLabeledRuxServiceScribe( - dateRange: DateRange - ): TypedPipe[(UserId, TweetId, TweetId, Int, Int)] = { - DAL - .read(LabeledRuxServiceScribeScalaDataset, dateRange) - .toTypedPipe.map { record => - ( - record.ruxServiceScribe.userContext, - record.ruxServiceScribe.focalObject, - record.landingPageLabel) - }.flatMap { - case ( - Some(UserContext(Some(userId), _, _, _, _, _, _, _)), - Some(FocalObject.TweetId(tweet)), - Some(labels)) => - labels.map { - case LandingPageLabel.LandingPageFavoriteEvent(favEvent) => - //(focal tweet, impressioned tweet, impression, fav) - (userId, tweet, favEvent.tweetId, 0, 1) - case LandingPageLabel.LandingPageImpressionEvent(impressionEvent) => - (userId, tweet, impressionEvent.tweetId, 1, 0) - } - case _ => Nil - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/update_known_for/BUILD.bazel b/src/scala/com/twitter/simclusters_v2/scalding/update_known_for/BUILD.bazel deleted file mode 100644 index d196a0d99..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/update_known_for/BUILD.bazel +++ /dev/null @@ -1,59 +0,0 @@ -scala_library( - sources = [ - "*.scala", - ], - platform = "java8", - tags = [ - "bazel-compatible", - "bazel-only", - ], - dependencies = [ - "src/java/com/twitter/sbf/core", - "src/java/com/twitter/sbf/graph", - "src/scala/com/twitter/simclusters_v2/hdfs_sources", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:simclusters_v2_embeddings_lite-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/presto_hdfs_sources", - "src/scala/com/twitter/simclusters_v2/scalding", - "src/scala/com/twitter/simclusters_v2/scalding/common", - "src/scala/com/twitter/simclusters_v2/scalding/common/matrix", - "src/scala/com/twitter/wtf/entity_real_graph/common", - "src/scala/com/twitter/wtf/entity_real_graph/scalding/common", - "src/scala/com/twitter/wtf/scalding/jobs/common:execution_app", - "src/scala/com/twitter/wtf/scalding/jobs/common:sources", - "src/scala/com/twitter/wtf/scalding/jobs/common:stats_util", - "src/thrift/com/twitter/recos/entities:entities-thrift-scala", - "src/thrift/com/twitter/wtf/entity_real_graph:entity_real_graph-thrift-scala", - "usersource/snapshot/src/main/scala/com/twitter/usersource/snapshot/flat:usersource_flat-scala", - "usersource/snapshot/src/main/thrift/com/twitter/usersource/snapshot/flat:flat-scala", - ], -) - -hadoop_binary( - name = "update_known_for_20m_145k_2020-adhoc", - main = "com.twitter.simclusters_v2.scalding.update_known_for.UpdateKnownFor20M145K2020Adhoc", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":update_known_for", - ], -) - -hadoop_binary( - name = "update_known_for_20m_145k_2020", - main = "com.twitter.simclusters_v2.scalding.update_known_for.UpdateKnownFor20M145K2020", - platform = "java8", - runtime_platform = "java8", - tags = [ - "bazel-compatible", - "bazel-compatible:migrated", - "bazel-only", - ], - dependencies = [ - ":update_known_for", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scalding/update_known_for/UpdateKnownFor20M145K2020.scala b/src/scala/com/twitter/simclusters_v2/scalding/update_known_for/UpdateKnownFor20M145K2020.scala deleted file mode 100644 index 07f070592..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/update_known_for/UpdateKnownFor20M145K2020.scala +++ /dev/null @@ -1,256 +0,0 @@ -package com.twitter.simclusters_v2.scalding.update_known_for - -import com.twitter.bijection.scrooge.BinaryScalaCodec -import com.twitter.hermit.candidate.thriftscala.Candidates -import com.twitter.logging.Logger -import com.twitter.pluck.source.cassowary.FollowingsCosineSimilaritiesManhattanSource -import com.twitter.scalding.typed.TypedPipe -import com.twitter.scalding.DateOps -import com.twitter.scalding.DateParser -import com.twitter.scalding.Days -import com.twitter.scalding.Execution -import com.twitter.scalding.RichDate -import com.twitter.scalding.TypedTsv -import com.twitter.scalding.UniqueID -import com.twitter.scalding._ -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.DALWrite.D -import com.twitter.scalding_internal.dalv2.DALWrite._ -import com.twitter.scalding_internal.dalv2.remote_access.AllowCrossClusterSameDC -import com.twitter.scalding_internal.job.TwitterExecutionApp -import com.twitter.simclusters_v2.common.ClusterId -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.hdfs_sources.AdhocKeyValSources -import com.twitter.simclusters_v2.hdfs_sources.InternalDataPaths -import com.twitter.simclusters_v2.hdfs_sources.SimclustersV2KnownFor20M145KDec11ScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.SimclustersV2KnownFor20M145KUpdatedScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.SimclustersV2RawKnownFor20M145K2020ScalaDataset -import com.twitter.simclusters_v2.scalding.KnownForSources -import com.twitter.simclusters_v2.scalding.KnownForSources.fromKeyVal -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.wtf.scalding.jobs.common.ScheduledExecutionApp -import java.util.TimeZone - -/** - * Scheduled job - * - * capesospy-v2 update --build_locally --start_cron update_known_for_20m_145k_2020 \ - * src/scala/com/twitter/simclusters_v2/capesos_config/atla_proc.yaml - */ - -object UpdateKnownFor20M145K2020 extends ScheduledExecutionApp { - - override val firstTime: RichDate = RichDate("2020-10-04") - - override val batchIncrement: Duration = Days(7) - - private val tempLocationPath = "/user/cassowary/temp/simclusters_v2/known_for_20m_145k_2020" - - private val simsGraphPath = - "/atla/proc/user/cassowary/manhattan_sequence_files/approximate_cosine_similarity_follow" - - override def runOnDateRange( - args: Args - )( - implicit dateRange: DateRange, - timeZone: TimeZone, - uniqueID: UniqueID - ): Execution[Unit] = { - - Execution.getConfigMode.flatMap { - case (_, mode) => - implicit def valueCodec: BinaryScalaCodec[Candidates] = BinaryScalaCodec(Candidates) - // Step - 1 (DataProcessing): Parameters for getting mapped indices for user-ids - val minActiveFollowers = args.int("minActiveFollowers", 400) - val topK = args.int("topK", 20000000) - - // Step - 2 (DataProcessing): Parameters to remove users not in the topK most followed users from simsGraph - val maxNeighbors = args.int("maxNeighbors", 400) - - // Step - 3 (Final Clustering): Parameters to run the clustering algorithm - /* squareWeightEnable is a boolean flag that changes the edge weights obtained from the - underlying sims graph - a) If false - edge weight between two neighbors is just their cosine similarity. - b) If true - edge weight = cosine_sim * cosine_sim * 10. The squaring makes the higher - weight edges relatively more important; this is based on the intuition that a neighbor - with cosine similarity of 0.1 is more than 2x important compared to a neighbor with - cosine similarity of 0.05. The multiplication with 10 brings the weights back into a - 'nicer' range since squaring will reduce their absolute value. - */ - val squareWeightsEnable = args.boolean("squareWeightsEnable") - - val maxEpochsForClustering = args.int("maxEpochs", 3) - val wtCoeff = args.double("wtCoeff", 10.0) - - val previousKnownFor: TypedPipe[(UserId, Array[(ClusterId, Float)])] = - fromKeyVal( - DAL - .readMostRecentSnapshot( - SimclustersV2RawKnownFor20M145K2020ScalaDataset, - dateRange.embiggen(Days(30))) - .withRemoteReadPolicy(AllowCrossClusterSameDC) - .toTypedPipe, - ModelVersions.Model20M145K2020 - ) - - UpdateKnownForSBFRunner - .runUpdateKnownFor( - TypedPipe - .from(FollowingsCosineSimilaritiesManhattanSource(simsGraphPath)) - .map(_._2), - minActiveFollowers, - topK, - maxNeighbors, - tempLocationPath, - previousKnownFor, - maxEpochsForClustering, - squareWeightsEnable, - wtCoeff, - mode - ) - .flatMap { updateKnownFor => - Execution - .zip( - KnownForSources - .toKeyVal(updateKnownFor, ModelVersions.Model20M145K2020) - .writeDALVersionedKeyValExecution( - SimclustersV2RawKnownFor20M145K2020ScalaDataset, - D.Suffix(InternalDataPaths.RawKnownFor2020Path) - ), - UpdateKnownForSBFRunner - .evaluateUpdatedKnownFor(updateKnownFor, previousKnownFor) - .flatMap { emailText => - Util - .sendEmail( - emailText, - s"Change in cluster assignments for new KnownFor ModelVersion: 20M145K2020", - "no-reply@twitter.com") - Execution.unit - } - ).unit - } - } - } -} -/* -knownFor Week-1: -scalding remote run \ ---target src/scala/com/twitter/simclusters_v2/scalding/update_known_for:update_known_for_20m_145k_2020-adhoc \ ---main-class com.twitter.simclusters_v2.scalding.update_known_for.UpdateKnownFor20M145K2020Adhoc \ ---submitter atla-aor-08-sr1 --user cassowary \ ---submitter-memory 128192.megabyte --hadoop-properties "mapreduce.map.memory.mb=8192 mapreduce.map.java.opts='-Xmx7618M' mapreduce.reduce.memory.mb=8192 mapreduce.reduce.java.opts='-Xmx7618M'" \ --- \ ---date 2020-08-30 --maxNeighbors 100 --minActiveFollowers 400 --topK 20000000 --numNodesPerCommunity 200 --maxEpochs 4 --squareWeightsEnable --wtCoeff 10.0 \ ---inputSimsDir /atla/proc/user/cassowary/manhattan_sequence_files/approximate_cosine_similarity_follow \ ---outputClusterDir /user/cassowary/adhoc/your_ldap/simclusters/clustering_outputs/output_clustering_assignments_2020_readAgain_v4_week_1 - -knownFor Week-2: -scalding remote run \ ---target src/scala/com/twitter/simclusters_v2/scalding/update_known_for:update_known_for_20m_145k_2020-adhoc \ ---main-class com.twitter.simclusters_v2.scalding.update_known_for.UpdateKnownFor20M145K2020Adhoc \ ---submitter atla-aor-08-sr1 --user cassowary \ ---submitter-memory 128192.megabyte --hadoop-properties "mapreduce.map.memory.mb=8192 mapreduce.map.java.opts='-Xmx7618M' mapreduce.reduce.memory.mb=8192 mapreduce.reduce.java.opts='-Xmx7618M'" \ --- \ ---date 2020-08-30 --maxNeighbors 100 --minActiveFollowers 400 --topK 20000000 --numNodesPerCommunity 200 --maxEpochs 4 --squareWeightsEnable --wtCoeff 10.0 \ ---inputSimsDir /atla/proc/user/cassowary/manhattan_sequence_files/approximate_cosine_similarity_follow \ ---inputPreviousKnownForDataSet /user/cassowary/adhoc/your_ldap/simclusters/clustering_outputs/output_clustering_assignments_2020_readAgain_v4_week_1_KeyVal \ ---outputClusterDir /user/cassowary/adhoc/your_ldap/simclusters/clustering_outputs/output_clustering_assignments_2020_readAgain_v4_week_2 - */ - -object UpdateKnownFor20M145K2020Adhoc extends TwitterExecutionApp { - implicit val tz: java.util.TimeZone = DateOps.UTC - implicit val dp = DateParser.default - val log = Logger() - - def job: Execution[Unit] = - Execution.getConfigMode.flatMap { - case (config, mode) => - Execution.withId { implicit uniqueId => - val args = config.getArgs - - implicit def valueCodec: BinaryScalaCodec[Candidates] = BinaryScalaCodec(Candidates) - // Step - 1 (DataProcessing): Parameters for getting mapped indices for user-ids - val minActiveFollowers = args.int("minActiveFollowers", 400) - val topK = args.int("topK", 20000000) - - // Step - 2 (DataProcessing): Parameters to remove users not in the topK most followed users from simsGraph - val clusterAssignmentOutput = args("outputClusterDir") - val maxNeighbors = args.int("maxNeighbors", 400) - - // Step - 3 (Final Clustering): Parameters to run the clustering algorithm - val squareWeightsEnable = args.boolean("squareWeightsEnable") - - val maxEpochsForClustering = args.int("maxEpochs", 3) - val wtCoeff = args.double("wtCoeff", 10.0) - - val simsGraphPath = - "/atla/proc/user/cassowary/manhattan_sequence_files/approximate_cosine_similarity_follow" - // Read in the knownFor dataset, that can be used to initialize the clusters for this week. - val inputPreviousKnownFor: TypedPipe[(Long, Array[(Int, Float)])] = - args.optional("inputPreviousKnownForDataSet") match { - case Some(inputKnownForDir) => - println( - "Input knownFors provided, using these as the initial cluster assignments for users") - TypedPipe - .from(AdhocKeyValSources.knownForSBFResultsDevelSource(inputKnownForDir)) - case None => - println( - "Using knownFor Assignments from prod as no previous assignment was provided in the input") - if (args.boolean("dec11")) { - KnownForSources - .fromKeyVal( - DAL - .readMostRecentSnapshotNoOlderThan( - SimclustersV2KnownFor20M145KDec11ScalaDataset, - Days(30)).withRemoteReadPolicy(AllowCrossClusterSameDC).toTypedPipe, - ModelVersions.Model20M145KDec11 - ) - } else { - KnownForSources - .fromKeyVal( - DAL - .readMostRecentSnapshotNoOlderThan( - SimclustersV2KnownFor20M145KUpdatedScalaDataset, - Days(30)).withRemoteReadPolicy(AllowCrossClusterSameDC).toTypedPipe, - ModelVersions.Model20M145KUpdated - ) - } - } - UpdateKnownForSBFRunner - .runUpdateKnownFor( - TypedPipe - .from(FollowingsCosineSimilaritiesManhattanSource(simsGraphPath)) - .map(_._2), - minActiveFollowers, - topK, - maxNeighbors, - clusterAssignmentOutput, - inputPreviousKnownFor, - maxEpochsForClustering, - squareWeightsEnable, - wtCoeff, - mode - ) - .flatMap { updateKnownFor => - Execution - .zip( - updateKnownFor - .mapValues(_.toList).writeExecution(TypedTsv(clusterAssignmentOutput)), - updateKnownFor.writeExecution(AdhocKeyValSources.knownForSBFResultsDevelSource( - clusterAssignmentOutput + "_KeyVal")), - UpdateKnownForSBFRunner - .evaluateUpdatedKnownFor(updateKnownFor, inputPreviousKnownFor) - .flatMap { emailText => - Util - .sendEmail( - emailText, - s"Change in cluster assignments for new KnownFor ModelVersion: 20M145K2020" + clusterAssignmentOutput, - "no-reply@twitter.com") - Execution.unit - } - ).unit - } - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scalding/update_known_for/UpdateKnownForSBFRunner.scala b/src/scala/com/twitter/simclusters_v2/scalding/update_known_for/UpdateKnownForSBFRunner.scala deleted file mode 100644 index 952e88d13..000000000 --- a/src/scala/com/twitter/simclusters_v2/scalding/update_known_for/UpdateKnownForSBFRunner.scala +++ /dev/null @@ -1,685 +0,0 @@ -package com.twitter.simclusters_v2.scalding.update_known_for - -import com.twitter.algebird.Max -import com.twitter.hermit.candidate.thriftscala.Candidates -import com.twitter.sbf.core.AlgorithmConfig -import com.twitter.sbf.core.MHAlgorithm -import com.twitter.sbf.core.SparseBinaryMatrix -import com.twitter.sbf.core.SparseRealMatrix -import com.twitter.sbf.graph.Graph -import com.twitter.scalding.Days -import com.twitter.scalding.Execution -import com.twitter.scalding.Hdfs -import com.twitter.scalding.Mode -import com.twitter.scalding.Stat -import com.twitter.scalding.TypedTsv -import com.twitter.scalding.UniqueID -import com.twitter.scalding.commons.source.VersionedKeyValSource -import com.twitter.scalding.typed.TypedPipe -import com.twitter.scalding_internal.dalv2.DAL -import com.twitter.scalding_internal.dalv2.remote_access.ExplicitLocation -import com.twitter.scalding_internal.dalv2.remote_access.ProcAtla -import com.twitter.simclusters_v2.common.ClusterId -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.hdfs_sources.AdhocKeyValSources -import com.twitter.simclusters_v2.scalding.CompareClusters -import com.twitter.simclusters_v2.scalding.KnownForSources -import com.twitter.simclusters_v2.scalding.TopUser -import com.twitter.simclusters_v2.scalding.TopUserWithMappedId -import com.twitter.simclusters_v2.scalding.TopUsersSimilarityGraph -import com.twitter.simclusters_v2.scalding.common.Util -import com.twitter.usersource.snapshot.flat.UsersourceFlatScalaDataset -import java.io.PrintWriter -import java.util.TimeZone -import org.apache.commons.math3.random.JDKRandomGenerator -import org.apache.commons.math3.random.RandomAdaptor -import org.apache.hadoop.fs.FileSystem -import org.apache.hadoop.fs.Path -import scala.collection.mutable - -object UpdateKnownForSBFRunner { - - /** - * The main logic of the job. It works as follows: - * - * 1. read the top 20M users, and convert their UserIds to an integer Id from 0 to 20M in order to use the clustering library - * 2. read the user similarity graph from Sims, and convert their UserIds to the same mapped integer Id - * 3. read the previous known_for data set for initialization of the clustering algorithm; - * for users without previous assignments, we randomly assign them to some unused clusters (if there are any). - * 4. run the clustering algorithm for x iterations (x = 4 in the prod setting) - * 5. output of the clustering result as the new known_for. - * - */ - def runUpdateKnownFor( - simsGraph: TypedPipe[Candidates], - minActiveFollowers: Int, - topK: Int, - maxNeighbors: Int, - tempLocationPath: String, - previousKnownFor: TypedPipe[(UserId, Array[(ClusterId, Float)])], - maxEpochsForClustering: Int, - squareWeightsEnable: Boolean, - wtCoeff: Double, - mode: Mode - )( - implicit - uniqueId: UniqueID, - tz: TimeZone - ): Execution[TypedPipe[(UserId, Array[(ClusterId, Float)])]] = { - - val tempLocationPathSimsGraph = tempLocationPath + "/sims_graph" - val tempLocationPathMappedIds = tempLocationPath + "/mapped_user_ids" - val tempLocationPathClustering = tempLocationPath + "/clustering_output" - - val mappedIdsToUserIds: TypedPipe[(Int, UserId)] = - getTopFollowedUsersWithMappedIds(minActiveFollowers, topK) - .map { - case (id, mappedId) => - (mappedId, id) - } - .shard(partitions = topK / 1e5.toInt) - - val mappedSimsGraphInput: TypedPipe[(Int, List[(Int, Float)])] = - getMappedSimsGraph( - mappedIdsToUserIds, - simsGraph, - maxNeighbors - ) // The simsGraph here consists of the mapped Ids and mapped ngbr Ids and not the original userIds - - val mappedSimsGraphVersionedKeyVal: VersionedKeyValSource[Int, List[(Int, Float)]] = - AdhocKeyValSources.intermediateSBFResultsDevelSource(tempLocationPathSimsGraph) - val mappedIdsToUserIdsVersionedKeyVal: VersionedKeyValSource[Int, UserId] = - AdhocKeyValSources.mappedIndicesDevelSource(tempLocationPathMappedIds) - - // exec to write intermediate results for mapped Sims Graph and mappedIds - val mappedSimsGraphAndMappedIdsWriteExec: Execution[Unit] = Execution - .zip( - mappedSimsGraphInput.writeExecution(mappedSimsGraphVersionedKeyVal), - mappedIdsToUserIds.writeExecution(mappedIdsToUserIdsVersionedKeyVal) - ).unit - - mappedSimsGraphAndMappedIdsWriteExec.flatMap { _ => - // The simsGraph and the mappedIds from userId(long) -> mappedIds are - // having to be written to a temporary location and read again before running - // the clustering algorithm. - - Execution - .zip( - readIntermediateExec( - TypedPipe.from(mappedSimsGraphVersionedKeyVal), - mode, - tempLocationPathSimsGraph), - readIntermediateExec( - TypedPipe.from(mappedIdsToUserIdsVersionedKeyVal), - mode, - tempLocationPathMappedIds) - ) - .flatMap { - case (mappedSimsGraphInputReadAgain, mappedIdsToUserIdsReadAgain) => - val previousKnownForMappedIdsAssignments: TypedPipe[(Int, List[(ClusterId, Float)])] = - getKnownForWithMappedIds( - previousKnownFor, - mappedIdsToUserIdsReadAgain, - ) - - val clusteringResults = getClusteringAssignments( - mappedSimsGraphInputReadAgain, - previousKnownForMappedIdsAssignments, - maxEpochsForClustering, - squareWeightsEnable, - wtCoeff - ) - clusteringResults - .flatMap { updatedKnownFor => - // convert the list of updated KnownFor to a TypedPipe - convertKnownForListToTypedPipe( - updatedKnownFor, - mode, - tempLocationPathClustering - ) - } - .flatMap { updatedKnownForTypedPipe => - // convert the mapped integer id to raw user ids - val updatedKnownFor = - updatedKnownForTypedPipe - .join(mappedIdsToUserIdsReadAgain) - .values - .swap - .mapValues(_.toArray) - - Execution.from(updatedKnownFor) - } - } - } - } - - /** - * Helper function to compare newKnownFor with the previous week knownFor assignments - */ - def evaluateUpdatedKnownFor( - newKnownFor: TypedPipe[(UserId, Array[(ClusterId, Float)])], - inputKnownFor: TypedPipe[(UserId, Array[(ClusterId, Float)])] - )( - implicit uniqueId: UniqueID - ): Execution[String] = { - - val minSizeOfBiggerClusterForComparison = 10 - - val compareClusterExec = CompareClusters.summarize( - CompareClusters.compare( - KnownForSources.transpose(inputKnownFor), - KnownForSources.transpose(newKnownFor), - minSizeOfBiggerCluster = minSizeOfBiggerClusterForComparison - )) - - val compareProducerExec = CompareClusters.compareClusterAssignments( - newKnownFor.mapValues(_.toList), - inputKnownFor.mapValues(_.toList) - ) - - Execution - .zip(compareClusterExec, compareProducerExec) - .map { - case (compareClusterResults, compareProducerResult) => - s"Cosine similarity distribution between cluster membership vectors for " + - s"clusters with at least $minSizeOfBiggerClusterForComparison members\n" + - Util.prettyJsonMapper - .writeValueAsString(compareClusterResults) + - "\n\n-------------------\n\n" + - "Custom counters:\n" + compareProducerResult + - "\n\n-------------------\n\n" - } - } - - /** - * - * Convert the list of updated KnownFor to a TypedPipe - * - * This step should have been done using TypedPipe.from(updatedKnownForList), however, due to the - * large size of the list, TypedPipe would throw out-of-memory exceptions. So we have to first - * dump it to a temp file on HDFS and using a customized read function to load to TypedPipe - * - */ - def convertKnownForListToTypedPipe( - updatedKnownForList: List[(Int, List[(ClusterId, Float)])], - mode: Mode, - temporaryOutputStringPath: String - ): Execution[TypedPipe[(Int, List[(ClusterId, Float)])]] = { - - val stringOutput = updatedKnownForList.map { - case (mappedUserId, clusterArray) => - assert(clusterArray.isEmpty || clusterArray.length == 1) - val str = if (clusterArray.nonEmpty) { - clusterArray.head._1 + " " + clusterArray.head._2 // each user is known for at most 1 cluster - } else { - "" - } - if (mappedUserId % 100000 == 0) - println(s"MappedIds:$mappedUserId ClusterAssigned$str") - s"$mappedUserId $str" - } - - // using Execution to enforce the order of the following 3 steps: - // 1. write the list of strings to a temp file on HDFS - // 2. read the strings to TypedPipe - // 3. delete the temp file - Execution - .from( - // write the output to HDFS; the data will be loaded to Typedpipe later; - // the reason of doing this is that we can not just do TypePipe.from(stringOutput) which - // results in OOM. - TopUsersSimilarityGraph.writeToHDFSIfHDFS( - stringOutput.toIterator, - mode, - temporaryOutputStringPath - ) - ) - .flatMap { _ => - println(s"Start loading the data from $temporaryOutputStringPath") - val clustersWithScores = TypedPipe.from(TypedTsv[String](temporaryOutputStringPath)).map { - mappedIdsWithArrays => - val strArray = mappedIdsWithArrays.trim().split("\\s+") - assert(strArray.length == 3 || strArray.length == 1) - val rowId = strArray(0).toInt - val clusterAssignment: List[(ClusterId, Float)] = - if (strArray.length > 1) { - List((strArray(1).toInt, strArray(2).toFloat)) - } else { - // the knownFors will have users with Array.empty as their assignment if - // the clustering step have empty results for that user. - Nil - } - - if (rowId % 100000 == 0) - println(s"rowId:$rowId ClusterAssigned: $clusterAssignment") - (rowId, clusterAssignment) - } - // return the dataset as an execution and delete the temp location - readIntermediateExec(clustersWithScores, mode, temporaryOutputStringPath) - } - } - - /** - * Helper function to read the dataset as execution and delete the temporary - * location on HDFS for PDP compliance - */ - def readIntermediateExec[K, V]( - dataset: TypedPipe[(K, V)], - mode: Mode, - tempLocationPath: String - ): Execution[TypedPipe[(K, V)]] = { - Execution - .from(dataset) - .flatMap { output => - // delete the temporary outputs for PDP compliance - mode match { - case Hdfs(_, conf) => - val fs = FileSystem.newInstance(conf) - if (fs.deleteOnExit(new Path(tempLocationPath))) { - println(s"Successfully deleted the temporary folder $tempLocationPath!") - } else { - println(s"Failed to delete the temporary folder $tempLocationPath!") - } - case _ => () - } - Execution.from(output) - } - } - - /** - * Converts the userIDs in the sims graph to their mapped integer indices. - * All the users who donot have a mapping are filtered out from the sims graph input - * - * @param mappedUsers mapping of long userIDs to their integer indices - * @param allEdges sims graph - * @param maxNeighborsPerNode number of neighbors for each user - * - * @return simsGraph of users and neighbors with their mapped interger ids - */ - def getMappedSimsGraph( - mappedUsers: TypedPipe[(Int, UserId)], - allEdges: TypedPipe[Candidates], - maxNeighborsPerNode: Int - )( - implicit uniqueId: UniqueID - ): TypedPipe[(Int, List[(Int, Float)])] = { - - val numEdgesAfterFirstJoin = Stat("num_edges_after_first_join") - val numEdgesAfterSecondJoin = Stat("num_edges_after_second_join") - val numEdgesLostTopKTruncated = Stat("num_edges_lost_topk_truncated") - val finalNumEdges = Stat("final_num_edges") - - val mappedUserIdsToIds: TypedPipe[(UserId, Int)] = mappedUsers.swap - allEdges - .map { cs => (cs.userId, cs.candidates) } - // filter the users not present in the mapped userIDs list - .join(mappedUserIdsToIds) - .withReducers(6000) - .flatMap { - case (id, (neighbors, mappedId)) => - val before = neighbors.size - val topKNeighbors = neighbors.sortBy(-_.score).take(maxNeighborsPerNode) - val after = topKNeighbors.size - numEdgesLostTopKTruncated.incBy(before - after) - topKNeighbors.map { candidate => - numEdgesAfterFirstJoin.inc() - (candidate.userId, (mappedId, candidate.score.toFloat)) - } - } - .join(mappedUserIdsToIds) - .withReducers(9000) - .flatMap { - case (id, ((mappedNeighborId, score), mappedId)) => - numEdgesAfterSecondJoin.inc() - // to make the graph symmetric, add those edges back that might have been filtered - // due to maxNeighborsPerNodefor a user but not for its neighbors - List( - (mappedId, Map(mappedNeighborId -> Max(score))), - (mappedNeighborId, Map(mappedId -> Max(score))) - ) - } - .sumByKey - .withReducers(9100) - .map { - case (id, nbrMap) => - // Graph initialization expects neighbors to be sorted in ascending order of ids - val sorted = nbrMap.mapValues(_.get).toList.sortBy(_._1) - finalNumEdges.incBy(sorted.size) - (id, sorted) - } - } - - def getTopFollowedUsersWithMappedIds( - minActiveFollowers: Int, - topK: Int - )( - implicit uniqueId: UniqueID, - timeZone: TimeZone - ): TypedPipe[(Long, Int)] = { - val numTopUsersMappings = Stat("num_top_users_with_mapped_ids") - println("Going to include mappedIds in output") - TopUsersSimilarityGraph - .topUsersWithMappedIdsTopK( - DAL - .readMostRecentSnapshotNoOlderThan( - UsersourceFlatScalaDataset, - Days(30)).withRemoteReadPolicy(ExplicitLocation(ProcAtla)).toTypedPipe, - minActiveFollowers, - topK - ) - .map { - case TopUserWithMappedId(TopUser(id, activeFollowerCount, screenName), mappedId) => - numTopUsersMappings.inc() - (id, mappedId) - } - } - - /** - * Map the userIds in the knownFor dataset to their integer Ids . - */ - def getKnownForWithMappedIds( - knownForDataset: TypedPipe[(UserId, Array[(ClusterId, Float)])], //original userId as the key - mappedIdsWithUserId: TypedPipe[(Int, UserId)] //mapped userId as the key - ): TypedPipe[(Int, List[(ClusterId, Float)])] = { - val userIdsAndTheirMappedIndices = mappedIdsWithUserId.map { - case (mappedId, originalId) => (originalId, mappedId) - } - knownForDataset.join(userIdsAndTheirMappedIndices).map { - case (userId, (userClusterArray, mappedUserId)) => - (mappedUserId, userClusterArray.toList) - } - } - - /** - * Attach the cluster assignments from knownFor dataset to the users in mapped Sims graph . - */ - def attachClusterAssignments( - mappedSimsGraph: TypedPipe[(Int, List[(Int, Float)])], - knownForAssignments: TypedPipe[(Int, List[(ClusterId, Float)])], - squareWeights: Boolean - )( - implicit uniqueId: UniqueID - ): TypedPipe[(Int, Array[Int], Array[Float], List[(ClusterId, Float)])] = { - val numPopularUsersWithNoKnownForBefore = Stat( - "num_popular_users_with_no_knownfor_before_but_popular_now") - - val input = mappedSimsGraph.map { - case (id, nbrsList) => - val ngbrIds = nbrsList.map(_._1).toArray - val ngbrWts = if (squareWeights) { - nbrsList.map(_._2).map(currWt => currWt * currWt * 10).toArray - } else { - nbrsList.map(_._2).toArray - } - (id, ngbrIds, ngbrWts) - } - - // input simsGraph consists of popular ppl with most followed users, who might not have been - // a knownFor user in the previous week. So left join with the knownFor dataset, and these - // new popular users will not have any prior cluster assignments while clustering this time - input - .groupBy(_._1) - .leftJoin(knownForAssignments.groupBy(_._1)) - .toTypedPipe - .map { - case (mappedUserId, ((mappedId, ngbrIds, ngbrWts), knownForResult)) => - val clustersList: List[(Int, Float)] = knownForResult match { - case Some(values) => values._2 - case None => - numPopularUsersWithNoKnownForBefore.inc() - List.empty - } - (mappedUserId, ngbrIds, ngbrWts, clustersList) - } - } - - /** - * Initialize graph with users and neighbors with edge weights . - */ - def getGraphFromSimsInput( - mappedSimsIter: Iterable[ - (Int, Array[Int], Array[Float], List[(ClusterId, Float)]) - ], - numUsers: Int - ): Graph = { - val nbrsIds: Array[Array[Int]] = new Array[Array[Int]](numUsers) - val nbrsWts: Array[Array[Float]] = new Array[Array[Float]](numUsers) - var numEdges = 0L - var numVertices = 0 - var numVerticesWithNoNgbrs = 0 - mappedSimsIter.foreach { - case (id, nbrArrayIds, nbArrayScores, _) => - nbrsIds(id) = nbrArrayIds - nbrsWts(id) = nbArrayScores - numEdges += nbrArrayIds.length - numVertices += 1 - if (numVertices % 100000 == 0) { - println(s"Done loading $numVertices many vertices. Edges so far: $numEdges") - } - } - - (0 until numUsers).foreach { i => - if (nbrsIds(i) == null) { - numVerticesWithNoNgbrs += 1 - nbrsIds(i) = Array[Int]() - nbrsWts(i) = Array[Float]() - } - } - - println( - s"Done loading graph with $numUsers nodes and $numEdges edges (counting each edge twice)") - println("Number of nodes with at least one neighbor is " + numVertices) - println("Number of nodes with at no neighbors is " + numVerticesWithNoNgbrs) - new Graph(numUsers, numEdges / 2, nbrsIds, nbrsWts) - } - - /** - * Helper function that initializes users to clusters based on previous knownFor assignments - * and for users with no previous assignments, assign them randomly to any of the empty clusters - */ - def initializeSparseBinaryMatrix( - graph: Graph, - mappedSimsGraphIter: Iterable[ - (Int, Array[Int], Array[Float], List[(ClusterId, Float)]) - ], // user with neighbors, neighbor wts and previous knownfor assignments - numUsers: Int, - numClusters: Int, - algoConfig: AlgorithmConfig, - ): SparseBinaryMatrix = { - var clustersSeenFromPreviousWeek: Set[Int] = Set.empty - var emptyClustersFromPreviousWeek: Set[Int] = Set.empty - var usersWithNoAssignmentsFromPreviousWeek: Set[Int] = Set.empty - mappedSimsGraphIter.foreach { - case (id, _, _, knownFor) => - if (knownFor.isEmpty) { - usersWithNoAssignmentsFromPreviousWeek += id - } - knownFor.foreach { - case (clusterId, _) => - clustersSeenFromPreviousWeek += clusterId - } - } - (1 to numClusters).foreach { i => - if (!clustersSeenFromPreviousWeek.contains(i)) emptyClustersFromPreviousWeek += i - } - var z = new SparseBinaryMatrix(numUsers, numClusters) - println("Going to initialize from previous KnownFor") - var zeroIndexedClusterIdsFromPreviousWeek: Set[Int] = Set.empty - for (clusterIdOneIndexed <- emptyClustersFromPreviousWeek) { - zeroIndexedClusterIdsFromPreviousWeek += (clusterIdOneIndexed - 1) - } - // Initialize z - users with no previous assignments are assigned to empty clusters - z.initFromSubsetOfRowsForSpecifiedColumns( - graph, - (gr: Graph, i: Integer) => algoConfig.rng.nextDouble, - zeroIndexedClusterIdsFromPreviousWeek.toArray, - usersWithNoAssignmentsFromPreviousWeek.toArray, - new PrintWriter(System.err) - ) - println("Initialized the empty clusters") - mappedSimsGraphIter.foreach { - case (id, _, _, knownFor) => - val currClustersForUserZeroIndexed = knownFor.map(_._1).map(x => x - 1) - // Users who have a previous cluster assignment are initialized with the same cluster - if (currClustersForUserZeroIndexed.nonEmpty) { - z.updateRow(id, currClustersForUserZeroIndexed.sorted.toArray) - } - } - println("Done initializing from previous knownFor assignment") - z - } - - /** - * Optimize the sparseBinaryMatrix. This function runs the clustering epochs and computes the - * cluster assignments for the next week, based on the underlying user-user graph - */ - def optimizeSparseBinaryMatrix( - algoConfig: AlgorithmConfig, - graph: Graph, - z: SparseBinaryMatrix - ): SparseBinaryMatrix = { - val prec0 = MHAlgorithm.clusterPrecision(graph, z, 0, 1000, algoConfig.rng) - println("Precision of cluster 0:" + prec0.precision) - val prec1 = MHAlgorithm.clusterPrecision(graph, z, 1, 1000, algoConfig.rng) - println("Precision of cluster 1:" + prec1.precision) - val algo = new MHAlgorithm(algoConfig, graph, z, new PrintWriter(System.err)) - val optimizedZ = algo.optimize - optimizedZ - } - - /** - * Helper function that takes the heuristically scored association of user to a cluster - * and returns the knownFor result - * @param srm SparseRealMatrix with (row, col) score denoting the membership score of user in the cluster - * @return assignments of users (mapped integer indices) to clusters with knownFor scores. - */ - def getKnownForHeuristicScores(srm: SparseRealMatrix): List[(Int, List[(ClusterId, Float)])] = { - val knownForAssignmentsFromClusterScores = (0 until srm.getNumRows).map { rowId => - val rowWithIndices = srm.getColIdsForRow(rowId) - val rowWithScores = srm.getValuesForRow(rowId) - val allClustersWithScores: Array[(ClusterId, Float)] = - rowWithIndices.zip(rowWithScores).map { - case (colId, score) => (colId + 1, score.toFloat) - } - if (rowId % 100000 == 0) { - println("Inside outputIter:" + rowId + " " + srm.getNumRows) - } - - val clusterAssignmentWithMaxScore: List[(ClusterId, Float)] = - if (allClustersWithScores.length > 1) { - // if sparseBinaryMatrix z has rows with more than one non-zero column (i.e a user - // initialized with more than one cluster), and the clustering algorithm doesnot find - // a better proposal for cluster assignment, the user's multi-cluster membership - // from the initialization step can continue. - // We found that this happens in ~0.1% of the knownFor users. Hence choose the - // cluster with the highest score to deal with such edge cases. - val result: (ClusterId, Float) = allClustersWithScores.maxBy(_._2) - println( - "Found a user with mappedId: %s with more than 1 cluster assignment:%s; Assigned to the best cluster: %s" - .format( - rowId.toString, - allClustersWithScores.mkString("Array(", ", ", ")"), - result - .toString())) - List(result) - } else { - allClustersWithScores.toList - } - (rowId, clusterAssignmentWithMaxScore) - } - knownForAssignmentsFromClusterScores.toList - } - - /** - * Function that computes the clustering assignments to users - * - * @param mappedSimsGraph user-user graph as input to clustering - * @param previousKnownForAssignments previous week clustering assignments - * @param maxEpochsForClustering number of neighbors for each user - * @param squareWeights boolean flag for the edge weights in the sims graph - * @param wtCoeff wtCoeff - * - * @return users with clusters assigned - */ - def getClusteringAssignments( - mappedSimsGraph: TypedPipe[(Int, List[(Int, Float)])], - previousKnownForAssignments: TypedPipe[(Int, List[(ClusterId, Float)])], - maxEpochsForClustering: Int, - squareWeights: Boolean, - wtCoeff: Double - )( - implicit uniqueId: UniqueID - ): Execution[List[(Int, List[(ClusterId, Float)])]] = { - - attachClusterAssignments( - mappedSimsGraph, - previousKnownForAssignments, - squareWeights).toIterableExecution.flatMap { mappedSimsGraphWithClustersIter => - val tic = System.currentTimeMillis - var maxVertexId = 0 - var maxClusterIdInPreviousAssignment = 0 - mappedSimsGraphWithClustersIter.foreach { - case (id, _, _, knownFor) => - maxVertexId = Math.max(id, maxVertexId) - knownFor.foreach { - case (clusterId, _) => - maxClusterIdInPreviousAssignment = - Math.max(clusterId, maxClusterIdInPreviousAssignment) - } - } - - val numUsersToCluster = - maxVertexId + 1 //since users were mapped with index starting from 0, using zipWithIndex - println("Total number of topK users to be clustered this time:" + numUsersToCluster) - println( - "Total number of clusters in the previous knownFor assignment:" + maxClusterIdInPreviousAssignment) - println("Will set number of communities to " + maxClusterIdInPreviousAssignment) - - // Initialize the graph with users, neighbors and the corresponding edge weights - val graph = getGraphFromSimsInput(mappedSimsGraphWithClustersIter, numUsersToCluster) - val toc = System.currentTimeMillis() - println("Time to load the graph " + (toc - tic) / 1000.0 / 60.0 + " minutes") - - // define the algoConfig parameters - val algoConfig = new AlgorithmConfig() - .withCpu(16).withK(maxClusterIdInPreviousAssignment) - .withWtCoeff(wtCoeff.toDouble) - .withMaxEpoch(maxEpochsForClustering) - algoConfig.divideResultIntoConnectedComponents = false - algoConfig.minClusterSize = 1 - algoConfig.updateImmediately = true - algoConfig.rng = new RandomAdaptor(new JDKRandomGenerator(1)) - - // Initialize a sparseBinaryMatrix with users assigned to their previous week knownFor - // assignments. For those users who do not a prior assignment, we assign - // the (user + the neighbors from the graph) to the empty clusters. - // Please note that this neighborhood-based initialization to empty clusters can - // have a few cases where the same user was assigned to more than one cluster - val z = initializeSparseBinaryMatrix( - graph, - mappedSimsGraphWithClustersIter, - numUsersToCluster, - maxClusterIdInPreviousAssignment, - algoConfig - ) - - // Run the epochs of the clustering algorithm to find the new cluster assignments - val tic2 = System.currentTimeMillis - val optimizedZ = optimizeSparseBinaryMatrix(algoConfig, graph, z) - val toc2 = System.currentTimeMillis - println("Time to optimize: %.2f seconds\n".format((toc2 - tic2) / 1000.0)) - println("Time to initialize & optimize: %.2f seconds\n".format((toc2 - toc) / 1000.0)) - - // Attach scores to the cluster assignments - val srm = MHAlgorithm.heuristicallyScoreClusterAssignments(graph, optimizedZ) - - // Get the knownfor assignments of users from the heuristic scores - // assigned based on neigbhorhood of the user and their cluster assignments - // The returned result has userIDs in the mapped integer indices - val knownForAssignmentsFromClusterScores: List[(Int, List[(ClusterId, Float)])] = - getKnownForHeuristicScores(srm) - - Execution.from(knownForAssignmentsFromClusterScores) - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/common/BQGenerationUtil.scala b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/common/BQGenerationUtil.scala deleted file mode 100644 index a433bc732..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/common/BQGenerationUtil.scala +++ /dev/null @@ -1,255 +0,0 @@ -package com.twitter.simclusters_v2.scio -package bq_generation.common - -import com.twitter.wtf.beam.bq_embedding_export.BQQueryUtils -import org.joda.time.DateTime - -object BQGenerationUtil { - // Consumer Embeddings BQ table details - val interestedInEmbeddings20M145K2020Table = BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "simclusters_v2_user_to_interested_in_20M_145K_2020", - ) - val mtsConsumerEmbeddingsFav90P20MTable = BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "mts_consumer_embeddings_fav90p_20m", - ) - - // Common SQL path - val TweetFavCountSQLPath = - s"/com/twitter/simclusters_v2/scio/bq_generation/sql/tweet_fav_count.sql" - - val NSFWTweetIdDenylistSQLPath = - s"/com/twitter/simclusters_v2/scio/bq_generation/sql/nsfw_tweet_denylist.sql" - - val ClusterTopTweetsIntersectionWithFavBasedIndexSQLPath = - s"/com/twitter/simclusters_v2/scio/bq_generation/sql/cluster_top_tweets_intersection_with_fav_based_index.sql" - - // Read InterestedIn 2020 - def getInterestedIn2020SQL( - queryDate: DateTime, - lookBackDays: Int - ): String = { - s""" - |SELECT userId, - | clusterIdToScores.key AS clusterId, - | clusterIdToScores.value.logFavScore AS userScore, - | clusterIdToScores.value.logFavScoreClusterNormalizedOnly AS clusterNormalizedLogFavScore, - |FROM `$interestedInEmbeddings20M145K2020Table`, UNNEST(clusterIdToScores) AS clusterIdToScores - |WHERE DATE(_PARTITIONTIME) = - | ( -- Get latest partition time - | SELECT MAX(DATE(_PARTITIONTIME)) latest_partition - | FROM `$interestedInEmbeddings20M145K2020Table` - | WHERE Date(_PARTITIONTIME) BETWEEN - | DATE_SUB(Date("${queryDate}"), - | INTERVAL $lookBackDays DAY) AND DATE("$queryDate") - | ) - | AND clusterIdToScores.value.logFavScore > 0.0 # min score threshold for user embedding values - |""".stripMargin - } - - // Read MTS Consumer Embeddings - Fav90P20M config - def getMTSConsumerEmbeddingsFav90P20MSQL( - queryDate: DateTime, - lookBackDays: Int - ): String = { - // We read the most recent snapshot of MTS Consumer Embeddings Fav90P20M - s""" - |SELECT userId, - | clusterIdToScores.key AS clusterId, - | clusterIdToScores.value.logFavUserScore AS userScore, - | clusterIdToScores.value.logFavUserScoreClusterNormalized AS clusterNormalizedLogFavScore - | FROM `$mtsConsumerEmbeddingsFav90P20MTable`, UNNEST(embedding.clusterIdToScores) AS clusterIdToScores - |WHERE DATE(ingestionTime) = ( - | -- Get latest partition time - | SELECT MAX(DATE(ingestionTime)) latest_partition - | FROM `$mtsConsumerEmbeddingsFav90P20MTable` - | WHERE Date(ingestionTime) BETWEEN - | DATE_SUB(Date("${queryDate}"), - | INTERVAL $lookBackDays DAY) AND DATE("${queryDate}") - |) AND clusterIdToScores.value.logFavUserScore > 0.0 - |""".stripMargin - } - - /* - * For a specific tweet engagement, retrieve the user id, tweet id, and timestamp - * - * Return: - * String - UserId, TweetId and Timestamp table SQL string format - * Table Schema - * - userId: Long - * - tweetId: Long - * - tsMillis: Long - */ - def getUserTweetEngagementEventPairSQL( - startTime: DateTime, - endTime: DateTime, - userTweetEngagementEventPairSqlPath: String, - userTweetEngagementEventPairTemplateVariable: Map[String, String] - ): String = { - val templateVariables = Map( - "START_TIME" -> startTime.toString(), - "END_TIME" -> endTime.toString(), - "NO_OLDER_TWEETS_THAN_DATE" -> startTime.toString() - ) ++ userTweetEngagementEventPairTemplateVariable - BQQueryUtils.getBQQueryFromSqlFile(userTweetEngagementEventPairSqlPath, templateVariables) - } - - /* - * Retrieve tweets and the # of favs it got from a given time window - * - * Return: - * String - TweetId and fav count table SQL string format - * Table Schema - * - tweetId: Long - * - favCount: Long - */ - def getTweetIdWithFavCountSQL( - startTime: DateTime, - endTime: DateTime, - ): String = { - val templateVariables = - Map( - "START_TIME" -> startTime.toString(), - "END_TIME" -> endTime.toString(), - ) - BQQueryUtils.getBQQueryFromSqlFile(TweetFavCountSQLPath, templateVariables) - } - - /* - * From a given time window, retrieve tweetIds that were created by specific author or media type - * - * Input: - * - startTime: DateTime - * - endTime: DateTime - * - filterMediaType: Option[Int] - * MediaType - * 1: Image - * 2: GIF - * 3: Video - * - filterNSFWAuthor: Boolean - * Whether we want to filter out NSFW tweet authors - * - * Return: - * String - TweetId table SQL string format - * Table Schema - * - tweetId: Long - */ - def getTweetIdWithMediaAndNSFWAuthorFilterSQL( - startTime: DateTime, - endTime: DateTime, - filterMediaType: Option[Int], - filterNSFWAuthor: Boolean - ): String = { - val sql = s""" - |SELECT DISTINCT tweetId - |FROM `twttr-bq-tweetsource-prod.user.unhydrated_flat` tweetsource, UNNEST(media) AS media - |WHERE (DATE(_PARTITIONTIME) >= DATE("${startTime}") AND DATE(_PARTITIONTIME) <= DATE("${endTime}")) AND - | timestamp_millis((1288834974657 + - | ((tweetId & 9223372036850581504) >> 22))) >= TIMESTAMP("${startTime}") - | AND timestamp_millis((1288834974657 + - | ((tweetId & 9223372036850581504) >> 22))) <= TIMESTAMP("${endTime}") - |""".stripMargin - - val filterMediaStr = filterMediaType match { - case Some(mediaType) => s" AND media.media_type =${mediaType}" - case _ => "" - } - val filterNSFWAuthorStr = if (filterNSFWAuthor) " AND nsfwUser = false" else "" - sql + filterMediaStr + filterNSFWAuthorStr - } - - /* - * From a given time window, retrieve tweetIds that fall into the NSFW deny list - * - * Input: - * - startTime: DateTime - * - endTime: DateTime - * - * Return: - * String - TweetId table SQL string format - * Table Schema - * - tweetId: Long - */ - def getNSFWTweetIdDenylistSQL( - startTime: DateTime, - endTime: DateTime, - ): String = { - val templateVariables = - Map( - "START_TIME" -> startTime.toString(), - "END_TIME" -> endTime.toString(), - ) - BQQueryUtils.getBQQueryFromSqlFile(NSFWTweetIdDenylistSQLPath, templateVariables) - } - - /* - * From a given cluster id to top k tweets table and a time window, - * (1) Retrieve the latest fav-based top tweets per cluster table within the time window - * (2) Inner join with the given table using cluster id and tweet id - * (3) Create the top k tweets per cluster table for the intersection - * - * Input: - * - startTime: DateTime - * - endTime: DateTime - * - topKTweetsForClusterKeySQL: String, a SQL query - * - * Return: - * String - TopKTweetsForClusterKey table SQL string format - * Table Schema - * - clusterId: Long - * - topKTweetsForClusterKey: (Long, Long) - * - tweetId: Long - * - tweetScore: Long - */ - def generateClusterTopTweetIntersectionWithFavBasedIndexSQL( - startTime: DateTime, - endTime: DateTime, - clusterTopKTweets: Int, - topKTweetsForClusterKeySQL: String - ): String = { - val templateVariables = - Map( - "START_TIME" -> startTime.toString(), - "END_TIME" -> endTime.toString(), - "CLUSTER_TOP_K_TWEETS" -> clusterTopKTweets.toString, - "CLUSTER_TOP_TWEETS_SQL" -> topKTweetsForClusterKeySQL - ) - BQQueryUtils.getBQQueryFromSqlFile( - ClusterTopTweetsIntersectionWithFavBasedIndexSQLPath, - templateVariables) - } - - /* - * Given a list of action types, build a string that indicates the user - * engaged with the tweet - * - * Example use case: We want to build a SQL query that specifies this user engaged - * with tweet with either fav or retweet actions. - * - * Input: - * - actionTypes: Seq("ServerTweetFav", "ServerTweetRetweet") - * - booleanOperator: "OR" - * Output: "ServerTweetFav.engaged = 1 OR ServerTweetRetweet.engaged = 1" - * - * Example SQL: - * SELECT ServerTweetFav, ServerTweetRetweet - * FROM table - * WHERE ServerTweetFav.engaged = 1 OR ServerTweetRetweet.engaged = 1 - */ - def buildActionTypesEngagementIndicatorString( - actionTypes: Seq[String], - booleanOperator: String = "OR" - ): String = { - actionTypes.map(action => f"""${action}.engaged = 1""").mkString(f""" ${booleanOperator} """) - } -} - -case class BQTableDetails( - projectName: String, - tableName: String, - datasetName: String) { - override def toString: String = s"${projectName}.${tableName}.${datasetName}" -} diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/common/BUILD b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/common/BUILD deleted file mode 100644 index 1ed000bc5..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/common/BUILD +++ /dev/null @@ -1,10 +0,0 @@ -scala_library( - sources = [ - "*.scala", - ], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/scala/com/twitter/wtf/beam/bq_embedding_export:bq_embedding_export_lib", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/common/IndexGenerationUtil.scala b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/common/IndexGenerationUtil.scala deleted file mode 100644 index bfbc00e71..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/common/IndexGenerationUtil.scala +++ /dev/null @@ -1,63 +0,0 @@ -package com.twitter.simclusters_v2.scio -package bq_generation.common - -import com.twitter.algebird_internal.thriftscala.DecayedValue -import com.twitter.simclusters_v2.thriftscala.FullClusterId -import com.twitter.simclusters_v2.thriftscala.ModelVersion -import com.twitter.simclusters_v2.thriftscala.Scores -import com.twitter.simclusters_v2.thriftscala.TopKTweetsWithScores -import com.twitter.snowflake.id.SnowflakeId -import org.apache.avro.generic.GenericRecord -import org.apache.beam.sdk.io.gcp.bigquery.SchemaAndRecord -import org.apache.beam.sdk.transforms.SerializableFunction -import scala.collection.JavaConverters._ - -object IndexGenerationUtil { - // Function that parses [GenericRecord] results we read from BQ into [TopKTweetsForClusterKey] - def parseClusterTopKTweetsFn(tweetEmbeddingsHalfLife: Int) = - new SerializableFunction[SchemaAndRecord, TopKTweetsForClusterKey] { - override def apply(record: SchemaAndRecord): TopKTweetsForClusterKey = { - val genericRecord: GenericRecord = record.getRecord() - TopKTweetsForClusterKey( - clusterId = FullClusterId( - modelVersion = ModelVersion.Model20m145k2020, - clusterId = genericRecord.get("clusterId").toString.toInt - ), - topKTweetsWithScores = parseTopKTweetsForClusterKeyColumn( - genericRecord, - "topKTweetsForClusterKey", - tweetEmbeddingsHalfLife), - ) - } - } - - // Function that parses the topKTweetsForClusterKey column into [TopKTweetsWithScores] - def parseTopKTweetsForClusterKeyColumn( - genericRecord: GenericRecord, - columnName: String, - tweetEmbeddingsHalfLife: Int - ): TopKTweetsWithScores = { - val tweetScorePairs: java.util.List[GenericRecord] = - genericRecord.get(columnName).asInstanceOf[java.util.List[GenericRecord]] - val tweetIdToScoresMap = tweetScorePairs.asScala - .map((gr: GenericRecord) => { - // Retrieve the tweetId and tweetScore - val tweetId = gr.get("tweetId").toString.toLong - val tweetScore = gr.get("tweetScore").toString.toDouble - - // Transform tweetScore into DecayedValue - // Ref: https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/DecayedValue.scala - val scaledTime = - SnowflakeId.unixTimeMillisFromId(tweetId) * math.log(2.0) / tweetEmbeddingsHalfLife - val decayedValue = DecayedValue(tweetScore, scaledTime) - - // Update the TopTweets Map - tweetId -> Scores(favClusterNormalized8HrHalfLifeScore = Some(decayedValue)) - }).toMap - TopKTweetsWithScores(topTweetsByFavClusterNormalizedScore = Some(tweetIdToScoresMap)) - } - case class TopKTweetsForClusterKey( - clusterId: FullClusterId, - topKTweetsWithScores: TopKTweetsWithScores) - -} diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/BUILD b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/BUILD deleted file mode 100644 index 441d8a98a..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/BUILD +++ /dev/null @@ -1,250 +0,0 @@ -scala_library( - name = "ftr_bq_generation", - sources = [ - "**/*.scala", - ], - tags = ["bazel-compatible"], - dependencies = [ - "src/scala/com/twitter/simclusters_v2/common", - "src/scala/com/twitter/simclusters_v2/scio/bq_generation/common", - "src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:offline_tweet_recommendations_decayed_sum-scala", - "src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:offline_tweet_recommendations_ftr_adhoc-scala", - "src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:offline_tweet_recommendations_ftrat5_pop_biased_1000-scala", - "src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:offline_tweet_recommendations_ftrat5_pop_biased_10000-scala", - "src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:simclusters_decayed_sum_cluster_to_tweet_index-scala", - "src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:simclusters_ftr_adhoc_cluster_to_tweet_index-scala", - "src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:simclusters_ftr_pop10000_rnkdecay11_cluster_to_tweet_index-scala", - "src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:simclusters_ftr_pop1000_rnkdecay11_cluster_to_tweet_index-scala", - "src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:simclusters_oon_ftr_adhoc_cluster_to_tweet_index-scala", - "src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:simclusters_oon_ftr_pop1000_rnkdecay_cluster_to_tweet_index-scala", - "src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/sql", - "src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql", - "src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann:bq_generation", - ], -) - -jvm_binary( - name = "ftr-tweet-adhoc-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.ftr_tweet.FTRAdhocJob", - dependencies = [ - ":ftr_bq_generation", - ], -) - -jvm_binary( - name = "iikf2020-decayed-sum-batch-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.ftr_tweet.IIKF2020DecayedSumBatchJobProd", - dependencies = [ - ":ftr_bq_generation", - ], -) - -jvm_binary( - name = "iikf2020-ftrat5-pop1000-batch-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.ftr_tweet.IIKF2020FTRAt5Pop1000batchJobProd", - dependencies = [ - ":ftr_bq_generation", - ], -) - -jvm_binary( - name = "iikf2020-ftrat5-pop10000-batch-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.ftr_tweet.IIKF2020FTRAt5Pop10000batchJobProd", - dependencies = [ - ":ftr_bq_generation", - ], -) - -create_datasets( - base_name = "offline_tweet_recommendations_ftr_adhoc", - key_type = "Long", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.MultiTypeGraphInjections.tweetRecommendationsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.CandidateTweetsList", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "offline_tweet_recommendations_decayed_sum", - key_type = "Long", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.MultiTypeGraphInjections.tweetRecommendationsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.CandidateTweetsList", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "offline_tweet_recommendations_ftrat5_pop_biased_1000", - key_type = "Long", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.MultiTypeGraphInjections.tweetRecommendationsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.CandidateTweetsList", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "offline_tweet_recommendations_ftrat5_pop_biased_10000", - key_type = "Long", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.MultiTypeGraphInjections.tweetRecommendationsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.CandidateTweetsList", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -jvm_binary( - name = "ftr-tweet-index-generation-adhoc-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.ftr_tweet.FTRClusterToTweetIndexGenerationAdhoc", - dependencies = [ - ":ftr_bq_generation", - ], -) - -jvm_binary( - name = "oon-ftr-tweet-index-generation-adhoc-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.ftr_tweet.OONFTRClusterToTweetIndexGenerationAdhoc", - dependencies = [ - ":ftr_bq_generation", - ], -) - -jvm_binary( - name = "ftr-tweet-index-generation-pop1000-rnkdecay11-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.ftr_tweet.FTRPop1000RankDecay11ClusterToTweetIndexGenerationBatch", - dependencies = [ - ":ftr_bq_generation", - ], -) - -jvm_binary( - name = "ftr-tweet-index-generation-pop10000-rnkdecay11-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.ftr_tweet.FTRPop10000RankDecay11ClusterToTweetIndexGenerationBatch", - dependencies = [ - ":ftr_bq_generation", - ], -) - -jvm_binary( - name = "oon-ftr-tweet-index-generation-pop1000-rnkdecay-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.ftr_tweet.OONFTRPop1000RankDecayClusterToTweetIndexGenerationBatch", - dependencies = [ - ":ftr_bq_generation", - ], -) - -jvm_binary( - name = "ftr-tweet-index-generation-decayed-sum-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.ftr_tweet.DecayedSumClusterToTweetIndexGenerationBatch", - dependencies = [ - ":ftr_bq_generation", - ], -) - -create_datasets( - base_name = "simclusters_ftr_adhoc_cluster_to_tweet_index", - key_type = "com.twitter.simclusters_v2.thriftscala.FullClusterId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusterTopTweetsInjection.clusterIdToTopKTweetsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopKTweetsWithScores", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_oon_ftr_adhoc_cluster_to_tweet_index", - key_type = "com.twitter.simclusters_v2.thriftscala.FullClusterId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusterTopTweetsInjection.clusterIdToTopKTweetsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopKTweetsWithScores", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_ftr_pop1000_rnkdecay11_cluster_to_tweet_index", - key_type = "com.twitter.simclusters_v2.thriftscala.FullClusterId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusterTopTweetsInjection.clusterIdToTopKTweetsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopKTweetsWithScores", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_ftr_pop10000_rnkdecay11_cluster_to_tweet_index", - key_type = "com.twitter.simclusters_v2.thriftscala.FullClusterId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusterTopTweetsInjection.clusterIdToTopKTweetsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopKTweetsWithScores", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_oon_ftr_pop1000_rnkdecay_cluster_to_tweet_index", - key_type = "com.twitter.simclusters_v2.thriftscala.FullClusterId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusterTopTweetsInjection.clusterIdToTopKTweetsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopKTweetsWithScores", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) - -create_datasets( - base_name = "simclusters_decayed_sum_cluster_to_tweet_index", - key_type = "com.twitter.simclusters_v2.thriftscala.FullClusterId", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.ClusterTopTweetsInjection.clusterIdToTopKTweetsInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.TopKTweetsWithScores", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/Config.scala b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/Config.scala deleted file mode 100644 index f1b747c86..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/Config.scala +++ /dev/null @@ -1,43 +0,0 @@ -package com.twitter.simclusters_v2.scio.bq_generation.ftr_tweet - -object Config { - // Variables for MH output path - val FTRRootMHPath: String = "manhattan_sequence_files/ftr_tweet_embedding/" - val FTRAdhocpath: String = "adhoc/ftr_tweet_embedding/" - val IIKFFTRAdhocANNOutputPath: String = "ftr_tweets_test/your_ldap_test" - val IIKFFTRAt5Pop1000ANNOutputPath: String = "ftr_tweets/ftr_at_5_pop_biased_1000" - val IIKFFTRAt5Pop10000ANNOutputPath: String = "ftr_tweets/ftr_at_5_pop_biased_10000" - val IIKFDecayedSumANNOutputPath: String = "ftr_tweets/decayed_sum" - - val DecayedSumClusterToTweetIndexOutputPath = "ftr_cluster_to_tweet/decayed_sum" - - val FTRPop1000RankDecay11ClusterToTweetIndexOutputPath = - "ftr_cluster_to_tweet/ftr_pop1000_rnkdecay11" - val FTRPop10000RankDecay11ClusterToTweetIndexOutputPath = - "ftr_cluster_to_tweet/ftr_pop10000_rnkdecay11" - val OONFTRPop1000RankDecayClusterToTweetIndexOutputPath = - "oon_ftr_cluster_to_tweet/oon_ftr_pop1000_rnkdecay" - - // Variables for tweet embeddings generation - val TweetSampleRate = 1 // 100% sample rate - val EngSampleRate = 1 // engagement from 50% of users - val MinTweetFavs = 8 // min favs for tweets - val MinTweetImps = 50 // min impressions for tweets - val MaxTweetFTR = 0.5 // maximum tweet FTR, a way to combat spammy tweets - val MaxUserLogNImps = 5 // maximum number of impressions 1e5 for users - val MaxUserLogNFavs = 4 // maximum number of favs 1e4 for users - val MaxUserFTR = 0.3 // maximum user FTR, a way to combat accounts that fav everything - - val SimClustersTweetEmbeddingsGenerationHalfLife: Int = 28800000 // 8hrs in ms - val SimClustersTweetEmbeddingsGenerationEmbeddingLength = 15 - - // Variables for BQ ANN - val SimClustersANNTopNClustersPerSourceEmbedding: Int = 20 - val SimClustersANNTopMTweetsPerCluster: Int = 50 - val SimClustersANNTopKTweetsPerUserRequest: Int = 200 - - // Cluster-to-tweet index configs - val clusterTopKTweets: Int = 2000 - val maxTweetAgeHours: Int = 24 - val TweetEmbeddingHalfLife: Int = 28800000 // for usage in DecayedValue struct -} diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/FTRJob.scala b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/FTRJob.scala deleted file mode 100644 index a6027d3e4..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/FTRJob.scala +++ /dev/null @@ -1,242 +0,0 @@ -package com.twitter.simclusters_v2.scio.bq_generation -package ftr_tweet - -import com.google.api.services.bigquery.model.TimePartitioning -import com.spotify.scio.ScioContext -import com.spotify.scio.coders.Coder -import com.twitter.beam.io.dal.DAL -import com.twitter.beam.io.fs.multiformat.PathLayout -import com.twitter.beam.job.DateRangeOptions -import com.twitter.conversions.DurationOps.richDurationFromInt -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.scio_internal.coders.ThriftStructLazyBinaryScroogeCoder -import com.twitter.scio_internal.job.ScioBeamJob -import com.twitter.scrooge.ThriftStruct -import com.twitter.simclusters_v2.scio.bq_generation.common.BQTableDetails -import com.twitter.simclusters_v2.scio.bq_generation.common.BQGenerationUtil.getInterestedIn2020SQL -import com.twitter.simclusters_v2.thriftscala.CandidateTweets -import com.twitter.simclusters_v2.thriftscala.CandidateTweetsList -import com.twitter.tcdc.bqblaster.beam.syntax._ -import com.twitter.tcdc.bqblaster.core.avro.TypedProjection -import com.twitter.tcdc.bqblaster.core.transform.RootTransform -import java.time.Instant -import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO -import com.twitter.simclusters_v2.thriftscala.CandidateTweet -import org.apache.avro.generic.GenericData -import scala.collection.mutable.ListBuffer -import org.apache.beam.sdk.io.gcp.bigquery.SchemaAndRecord -import org.apache.beam.sdk.transforms.SerializableFunction -import org.apache.avro.generic.GenericRecord -import com.twitter.wtf.beam.bq_embedding_export.BQQueryUtils - -trait FTRJob extends ScioBeamJob[DateRangeOptions] { - // Configs to set for different type of embeddings and jobs - val isAdhoc: Boolean - val outputTable: BQTableDetails - val keyValDatasetOutputPath: String - val tweetRecommentationsSnapshotDataset: KeyValDALDataset[KeyVal[Long, CandidateTweetsList]] - val scoreKey: String - val scoreColumn: String - - // Base configs - val projectId = "twttr-recos-ml-prod" - val environment: DAL.Env = if (isAdhoc) DAL.Environment.Dev else DAL.Environment.Prod - - override implicit def scroogeCoder[T <: ThriftStruct: Manifest]: Coder[T] = - ThriftStructLazyBinaryScroogeCoder.scroogeCoder - - override def configurePipeline(sc: ScioContext, opts: DateRangeOptions): Unit = { - // The time when the job is scheduled - val queryTimestamp = opts.interval.getEnd - - // Parse tweetId candidates column - def parseTweetIdColumn( - genericRecord: GenericRecord, - columnName: String - ): List[CandidateTweet] = { - val tweetIds: GenericData.Array[GenericRecord] = - genericRecord.get(columnName).asInstanceOf[GenericData.Array[GenericRecord]] - val results: ListBuffer[CandidateTweet] = new ListBuffer[CandidateTweet]() - tweetIds.forEach((sc: GenericRecord) => { - results += CandidateTweet( - tweetId = sc.get("tweetId").toString.toLong, - score = Some(sc.get("cosineSimilarityScore").toString.toDouble) - ) - }) - results.toList - } - - //Function that parses the GenericRecord results we read from BQ - val parseUserToTweetRecommendationsFunc = - new SerializableFunction[SchemaAndRecord, UserToTweetRecommendations] { - override def apply(record: SchemaAndRecord): UserToTweetRecommendations = { - val genericRecord: GenericRecord = record.getRecord - UserToTweetRecommendations( - userId = genericRecord.get("userId").toString.toLong, - tweetCandidates = parseTweetIdColumn(genericRecord, "tweets"), - ) - } - } - - val tweetEmbeddingTemplateVariables = - Map( - "START_TIME" -> queryTimestamp.minusDays(1).toString(), - "END_TIME" -> queryTimestamp.toString(), - "TWEET_SAMPLE_RATE" -> Config.TweetSampleRate.toString, - "ENG_SAMPLE_RATE" -> Config.EngSampleRate.toString, - "MIN_TWEET_FAVS" -> Config.MinTweetFavs.toString, - "MIN_TWEET_IMPS" -> Config.MinTweetImps.toString, - "MAX_TWEET_FTR" -> Config.MaxTweetFTR.toString, - "MAX_USER_LOG_N_IMPS" -> Config.MaxUserLogNImps.toString, - "MAX_USER_LOG_N_FAVS" -> Config.MaxUserLogNFavs.toString, - "MAX_USER_FTR" -> Config.MaxUserFTR.toString, - "TWEET_EMBEDDING_LENGTH" -> Config.SimClustersTweetEmbeddingsGenerationEmbeddingLength.toString, - "HALFLIFE" -> Config.SimClustersTweetEmbeddingsGenerationHalfLife.toString, - "SCORE_COLUMN" -> scoreColumn, - "SCORE_KEY" -> scoreKey, - ) - - val tweetEmbeddingSql = BQQueryUtils.getBQQueryFromSqlFile( - "/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/sql/ftr_tweet_embeddings.sql", - tweetEmbeddingTemplateVariables) - val consumerEmbeddingSql = getInterestedIn2020SQL(queryTimestamp, 14) - - val tweetRecommendationsTemplateVariables = - Map( - "CONSUMER_EMBEDDINGS_SQL" -> consumerEmbeddingSql, - "TWEET_EMBEDDINGS_SQL" -> tweetEmbeddingSql, - "TOP_N_CLUSTER_PER_SOURCE_EMBEDDING" -> Config.SimClustersANNTopNClustersPerSourceEmbedding.toString, - "TOP_M_TWEETS_PER_CLUSTER" -> Config.SimClustersANNTopMTweetsPerCluster.toString, - "TOP_K_TWEETS_PER_USER_REQUEST" -> Config.SimClustersANNTopKTweetsPerUserRequest.toString, - ) - val tweetRecommendationsSql = BQQueryUtils.getBQQueryFromSqlFile( - "/com/twitter/simclusters_v2/scio/bq_generation/sql/tweets_ann.sql", - tweetRecommendationsTemplateVariables) - - val tweetRecommendations = sc.customInput( - s"SimClusters FTR BQ ANN", - BigQueryIO - .read(parseUserToTweetRecommendationsFunc) - .fromQuery(tweetRecommendationsSql) - .usingStandardSql() - ) - - //Setup BQ writer - val ingestionTime = opts.getDate().value.getEnd.toDate - val bqFieldsTransform = RootTransform - .Builder() - .withPrependedFields("ingestionTime" -> TypedProjection.fromConstant(ingestionTime)) - val timePartitioning = new TimePartitioning() - .setType("HOUR").setField("ingestionTime").setExpirationMs(3.days.inMilliseconds) - val bqWriter = BigQueryIO - .write[CandidateTweets] - .to(outputTable.toString) - .withExtendedErrorInfo() - .withTimePartitioning(timePartitioning) - .withLoadJobProjectId(projectId) - .withThriftSupport(bqFieldsTransform.build(), AvroConverter.Legacy) - .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED) - .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND) - - // Save Tweet ANN results to BQ - tweetRecommendations - .map { userToTweetRecommendations => - { - CandidateTweets( - targetUserId = userToTweetRecommendations.userId, - recommendedTweets = userToTweetRecommendations.tweetCandidates) - } - } - .saveAsCustomOutput(s"WriteToBQTable - $outputTable", bqWriter) - - val RootMHPath: String = Config.FTRRootMHPath - val AdhocRootPath = Config.FTRAdhocpath - - // Save Tweet ANN results as KeyValSnapshotDataset - tweetRecommendations - .map { userToTweetRecommendations => - KeyVal( - userToTweetRecommendations.userId, - CandidateTweetsList(userToTweetRecommendations.tweetCandidates)) - }.saveAsCustomOutput( - name = "WriteFtrTweetRecommendationsToKeyValDataset", - DAL.writeVersionedKeyVal( - tweetRecommentationsSnapshotDataset, - PathLayout.VersionedPath(prefix = - ((if (!isAdhoc) - RootMHPath - else - AdhocRootPath) - + keyValDatasetOutputPath)), - instant = Instant.ofEpochMilli(opts.interval.getEndMillis - 1L), - environmentOverride = environment, - ) - ) - } - -} - -object FTRAdhocJob extends FTRJob { - override val isAdhoc = true - override val outputTable: BQTableDetails = - BQTableDetails("twttr-recos-ml-prod", "simclusters", "offline_tweet_recommendations_ftr_adhoc") - override val keyValDatasetOutputPath = Config.IIKFFTRAdhocANNOutputPath - - override val tweetRecommentationsSnapshotDataset: KeyValDALDataset[ - KeyVal[Long, CandidateTweetsList] - ] = - OfflineTweetRecommendationsFtrAdhocScalaDataset - override val scoreColumn = "ftrat5_decayed_pop_bias_1000_rank_decay_1_1_embedding" - override val scoreKey = "ftrat5_decayed_pop_bias_1000_rank_decay_1_1" -} - -object IIKF2020DecayedSumBatchJobProd extends FTRJob { - override val isAdhoc = false - override val outputTable: BQTableDetails = BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "offline_tweet_recommendations_decayed_sum" - ) - override val keyValDatasetOutputPath = Config.IIKFDecayedSumANNOutputPath - override val tweetRecommentationsSnapshotDataset: KeyValDALDataset[ - KeyVal[Long, CandidateTweetsList] - ] = - OfflineTweetRecommendationsDecayedSumScalaDataset - override val scoreColumn = "dec_sum_logfavScoreClusterNormalizedOnly_embedding" - override val scoreKey = "dec_sum_logfavScoreClusterNormalizedOnly" -} - -object IIKF2020FTRAt5Pop1000batchJobProd extends FTRJob { - override val isAdhoc = false - override val outputTable: BQTableDetails = BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "offline_tweet_recommendations_ftrat5_pop_biased_1000") - override val keyValDatasetOutputPath = Config.IIKFFTRAt5Pop1000ANNOutputPath - override val tweetRecommentationsSnapshotDataset: KeyValDALDataset[ - KeyVal[Long, CandidateTweetsList] - ] = - OfflineTweetRecommendationsFtrat5PopBiased1000ScalaDataset - override val scoreColumn = "ftrat5_decayed_pop_bias_1000_rank_decay_1_1_embedding" - override val scoreKey = "ftrat5_decayed_pop_bias_1000_rank_decay_1_1" -} - -object IIKF2020FTRAt5Pop10000batchJobProd extends FTRJob { - override val isAdhoc = false - override val outputTable: BQTableDetails = BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "offline_tweet_recommendations_ftrat5_pop_biased_10000") - override val keyValDatasetOutputPath = Config.IIKFFTRAt5Pop10000ANNOutputPath - override val tweetRecommentationsSnapshotDataset: KeyValDALDataset[ - KeyVal[Long, CandidateTweetsList] - ] = - OfflineTweetRecommendationsFtrat5PopBiased10000ScalaDataset - override val scoreColumn = "ftrat5_decayed_pop_bias_10000_rank_decay_1_1_embedding" - override val scoreKey = "ftrat5_decayed_pop_bias_10000_rank_decay_1_1" -} - -case class UserToTweetRecommendations( - userId: Long, - tweetCandidates: List[CandidateTweet]) diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/FtrClusterToTweetIndexGenerationJob.scala b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/FtrClusterToTweetIndexGenerationJob.scala deleted file mode 100644 index d7560be53..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/FtrClusterToTweetIndexGenerationJob.scala +++ /dev/null @@ -1,264 +0,0 @@ -package com.twitter.simclusters_v2 -package scio.bq_generation.ftr_tweet - -import com.google.api.services.bigquery.model.TimePartitioning -import com.twitter.conversions.DurationOps.richDurationFromInt -import com.spotify.scio.ScioContext -import com.spotify.scio.coders.Coder -import com.twitter.beam.io.dal.DAL -import com.twitter.beam.io.dal.DAL.PathLayout -import com.twitter.simclusters_v2.scio.bq_generation.common.IndexGenerationUtil.parseClusterTopKTweetsFn -import java.time.Instant -import com.twitter.beam.job.DateRangeOptions -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.scio_internal.coders.ThriftStructLazyBinaryScroogeCoder -import com.twitter.scio_internal.job.ScioBeamJob -import com.twitter.scrooge.ThriftStruct -import com.twitter.simclusters_v2.scio.bq_generation.common.BQTableDetails -import com.twitter.simclusters_v2.thriftscala.ClusterIdToTopKTweetsWithScores -import com.twitter.simclusters_v2.thriftscala.FullClusterId -import com.twitter.simclusters_v2.thriftscala.TopKTweetsWithScores -import com.twitter.tcdc.bqblaster.beam.syntax._ -import com.twitter.tcdc.bqblaster.core.avro.TypedProjection -import com.twitter.tcdc.bqblaster.core.transform.RootTransform -import com.twitter.wtf.beam.bq_embedding_export.BQQueryUtils -import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO - -trait FTRClusterToTweetIndexGenerationJob extends ScioBeamJob[DateRangeOptions] { - val isAdhoc: Boolean - - val outputTable: BQTableDetails - val keyValDatasetOutputPath: String - val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] - - // Base configs - val projectId = "twttr-recos-ml-prod" - val environment: DAL.Env = if (isAdhoc) DAL.Environment.Dev else DAL.Environment.Prod - - // Variables for Tweet Embedding SQL - val scoreKey: String - val scoreColumn: String - - // Variables for spam treatment - val maxTweetFTR: Double - val maxUserFTR: Double - - // Tweet embeddings parameters - val tweetEmbeddingsLength: Int = Config.SimClustersTweetEmbeddingsGenerationEmbeddingLength - - // Clusters-to-tweet index parameters - val clusterTopKTweets: Int = Config.clusterTopKTweets - val maxTweetAgeHours: Int = Config.maxTweetAgeHours - - override implicit def scroogeCoder[T <: ThriftStruct: Manifest]: Coder[T] = - ThriftStructLazyBinaryScroogeCoder.scroogeCoder - - override def configurePipeline(sc: ScioContext, opts: DateRangeOptions): Unit = { - // The time when the job is scheduled - val queryTimestamp = opts.interval.getEnd - - val tweetEmbeddingTemplateVariables = - Map( - "START_TIME" -> queryTimestamp.minusDays(1).toString(), - "END_TIME" -> queryTimestamp.toString(), - "TWEET_SAMPLE_RATE" -> Config.TweetSampleRate.toString, - "ENG_SAMPLE_RATE" -> Config.EngSampleRate.toString, - "MIN_TWEET_FAVS" -> Config.MinTweetFavs.toString, - "MIN_TWEET_IMPS" -> Config.MinTweetImps.toString, - "MAX_TWEET_FTR" -> maxTweetFTR.toString, - "MAX_USER_LOG_N_IMPS" -> Config.MaxUserLogNImps.toString, - "MAX_USER_LOG_N_FAVS" -> Config.MaxUserLogNFavs.toString, - "MAX_USER_FTR" -> maxUserFTR.toString, - "TWEET_EMBEDDING_LENGTH" -> Config.SimClustersTweetEmbeddingsGenerationEmbeddingLength.toString, - "HALFLIFE" -> Config.SimClustersTweetEmbeddingsGenerationHalfLife.toString, - "SCORE_COLUMN" -> scoreColumn, - "SCORE_KEY" -> scoreKey, - ) - val tweetEmbeddingSql = BQQueryUtils.getBQQueryFromSqlFile( - "/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/sql/ftr_tweet_embeddings.sql", - tweetEmbeddingTemplateVariables) - - val clusterTopTweetsTemplateVariables = - Map( - "CLUSTER_TOP_K_TWEETS" -> Config.clusterTopKTweets.toString, - "TWEET_EMBEDDING_SQL" -> tweetEmbeddingSql - ) - - val clusterTopTweetsSql = BQQueryUtils.getBQQueryFromSqlFile( - "/com/twitter/simclusters_v2/scio/bq_generation/sql/cluster_top_tweets.sql", - clusterTopTweetsTemplateVariables - ) - - // Generate SimClusters cluster-to-tweet index - val topKtweetsForClusterKey = sc.customInput( - s"SimClusters cluster-to-tweet index generation BQ job", - BigQueryIO - .read(parseClusterTopKTweetsFn(Config.TweetEmbeddingHalfLife)) - .fromQuery(clusterTopTweetsSql) - .usingStandardSql() - ) - - // Setup BQ writer - val ingestionTime = opts.getDate().value.getEnd.toDate - val bqFieldsTransform = RootTransform - .Builder() - .withPrependedFields("dateHour" -> TypedProjection.fromConstant(ingestionTime)) - val timePartitioning = new TimePartitioning() - .setType("HOUR").setField("dateHour").setExpirationMs(3.days.inMilliseconds) - val bqWriter = BigQueryIO - .write[ClusterIdToTopKTweetsWithScores] - .to(outputTable.toString) - .withExtendedErrorInfo() - .withTimePartitioning(timePartitioning) - .withLoadJobProjectId(projectId) - .withThriftSupport(bqFieldsTransform.build(), AvroConverter.Legacy) - .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED) - .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND) - - // Save SimClusters index to a BQ table - topKtweetsForClusterKey - .map { clusterIdToTopKTweets => - { - ClusterIdToTopKTweetsWithScores( - clusterId = clusterIdToTopKTweets.clusterId, - topKTweetsWithScores = clusterIdToTopKTweets.topKTweetsWithScores - ) - } - } - .saveAsCustomOutput(s"WriteToBQTable - $outputTable", bqWriter) - - // Save SimClusters index as a KeyValSnapshotDataset - topKtweetsForClusterKey - .map { clusterIdToTopKTweets => - KeyVal(clusterIdToTopKTweets.clusterId, clusterIdToTopKTweets.topKTweetsWithScores) - }.saveAsCustomOutput( - name = s"WriteClusterToKeyIndexToKeyValDataset at $keyValDatasetOutputPath", - DAL.writeVersionedKeyVal( - clusterToTweetIndexSnapshotDataset, - PathLayout.VersionedPath(prefix = - ((if (!isAdhoc) - Config.FTRRootMHPath - else - Config.FTRAdhocpath) - + keyValDatasetOutputPath)), - instant = Instant.ofEpochMilli(opts.interval.getEndMillis - 1L), - environmentOverride = environment, - ) - ) - } -} - -object FTRClusterToTweetIndexGenerationAdhoc extends FTRClusterToTweetIndexGenerationJob { - override val isAdhoc: Boolean = true - override val outputTable: BQTableDetails = - BQTableDetails( - "twttr-recos-ml-prod", - "simclusters", - "simcluster_adhoc_test_cluster_to_tweet_index") - override val keyValDatasetOutputPath: String = - "ftr_tweets_adhoc/ftr_cluster_to_tweet_adhoc" - override val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = SimclustersFtrAdhocClusterToTweetIndexScalaDataset - override val scoreColumn = "ftrat5_decayed_pop_bias_1000_rank_decay_1_1_embedding" - override val scoreKey = "ftrat5_decayed_pop_bias_1000_rank_decay_1_1" - override val maxUserFTR: Double = Config.MaxUserFTR - override val maxTweetFTR: Double = Config.MaxTweetFTR - -} - -object OONFTRClusterToTweetIndexGenerationAdhoc extends FTRClusterToTweetIndexGenerationJob { - override val isAdhoc: Boolean = true - override val outputTable: BQTableDetails = - BQTableDetails( - "twttr-recos-ml-prod", - "simclusters", - "simcluster_adhoc_test_cluster_to_tweet_index") - override val keyValDatasetOutputPath: String = - "oon_ftr_tweets_adhoc/oon_ftr_cluster_to_tweet_adhoc" - override val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = SimclustersOonFtrAdhocClusterToTweetIndexScalaDataset - override val scoreColumn = "oon_ftrat5_decayed_pop_bias_1000_rank_decay_embedding" - override val scoreKey = "oon_ftrat5_decayed_pop_bias_1000_rank_decay" - override val maxUserFTR: Double = Config.MaxUserFTR - override val maxTweetFTR: Double = Config.MaxTweetFTR -} - -object FTRPop1000RankDecay11ClusterToTweetIndexGenerationBatch - extends FTRClusterToTweetIndexGenerationJob { - override val isAdhoc: Boolean = false - override val outputTable: BQTableDetails = - BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "simclusters_ftr_pop1000_rnkdecay11_cluster_to_tweet_index") - override val keyValDatasetOutputPath: String = - Config.FTRPop1000RankDecay11ClusterToTweetIndexOutputPath - override val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = SimclustersFtrPop1000Rnkdecay11ClusterToTweetIndexScalaDataset - override val scoreColumn = "ftrat5_decayed_pop_bias_1000_rank_decay_1_1_embedding" - override val scoreKey = "ftrat5_decayed_pop_bias_1000_rank_decay_1_1" - override val maxUserFTR: Double = Config.MaxUserFTR - override val maxTweetFTR: Double = Config.MaxTweetFTR -} - -object FTRPop10000RankDecay11ClusterToTweetIndexGenerationBatch - extends FTRClusterToTweetIndexGenerationJob { - override val isAdhoc: Boolean = false - override val outputTable: BQTableDetails = - BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "simclusters_ftr_pop10000_rnkdecay11_cluster_to_tweet_index") - override val keyValDatasetOutputPath: String = - Config.FTRPop10000RankDecay11ClusterToTweetIndexOutputPath - override val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = SimclustersFtrPop10000Rnkdecay11ClusterToTweetIndexScalaDataset - override val scoreColumn = "ftrat5_decayed_pop_bias_10000_rank_decay_1_1_embedding" - override val scoreKey = "ftrat5_decayed_pop_bias_10000_rank_decay_1_1" - override val maxUserFTR: Double = Config.MaxUserFTR - override val maxTweetFTR: Double = Config.MaxTweetFTR -} - -object OONFTRPop1000RankDecayClusterToTweetIndexGenerationBatch - extends FTRClusterToTweetIndexGenerationJob { - override val isAdhoc: Boolean = false - override val outputTable: BQTableDetails = - BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "simclusters_oon_ftr_pop1000_rnkdecay_cluster_to_tweet_index") - override val keyValDatasetOutputPath: String = - Config.OONFTRPop1000RankDecayClusterToTweetIndexOutputPath - override val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = SimclustersOonFtrPop1000RnkdecayClusterToTweetIndexScalaDataset - override val scoreColumn = "oon_ftrat5_decayed_pop_bias_1000_rank_decay_embedding" - override val scoreKey = "oon_ftrat5_decayed_pop_bias_1000_rank_decay" - override val maxUserFTR: Double = Config.MaxUserFTR - override val maxTweetFTR: Double = Config.MaxTweetFTR -} - -object DecayedSumClusterToTweetIndexGenerationBatch extends FTRClusterToTweetIndexGenerationJob { - override val isAdhoc: Boolean = false - override val outputTable: BQTableDetails = - BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "simclusters_decayed_sum_cluster_to_tweet_index") - override val keyValDatasetOutputPath: String = - Config.DecayedSumClusterToTweetIndexOutputPath - override val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = SimclustersDecayedSumClusterToTweetIndexScalaDataset - override val scoreColumn = "dec_sum_logfavScoreClusterNormalizedOnly_embedding" - override val scoreKey = "dec_sum_logfavScoreClusterNormalizedOnly" - override val maxUserFTR = 1.0 - override val maxTweetFTR = 1.0 -} diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/README.md b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/README.md deleted file mode 100644 index 4d9e7d081..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/README.md +++ /dev/null @@ -1,212 +0,0 @@ -# FTR Tweet embeddings - -export GCP_PROJECT_NAME='twttr-recos-ml-prod' - -## Running Adhoc jobs -### Base ftrat5 -``` -rm dist/ftr-tweet-adhoc-job-bundle/ftr-tweet-adhoc-job.jar -./bazel bundle src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:ftr-tweet-adhoc-job && \ -bin/d6w create \ -${GCP_PROJECT_NAME}/us-central1/ftr-tweets-ann-adhoc-job \ -src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/ftr-tweets-ann-adhoc-job.d6w \ ---jar dist/ftr-tweet-adhoc-job-bundle/ftr-tweet-adhoc-job.jar \ ---bind=profile.project= -${GCP_PROJECT_NAME} \ ---bind=profile.user_name=your_ldap \ ---bind=profile.build_target="src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:ftr-tweet-index-generation-adhoc-job" \ ---bind=profile.date="2022-08-26T12" \ ---bind=profile.machine="n2-standard-2" \ ---bind=profile.job_name="ftr-tweets-ann-adhoc-job" --ignore-existing -``` -### ClusterToTweet Index with base ftrat5 -``` -export GCP_PROJECT_NAME='twttr-recos-ml-prod' - -rm dist/ftr-tweet-index-generation-adhoc-job-bundle/ftr-tweet-index-generation-adhoc-job.jar -./bazel bundle src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:ftr-tweet-index-generation-adhoc-job && \ -bin/d6w create \ -${GCP_PROJECT_NAME}/us-central1/ftr-tweet-index-generation-adhoc-job \ -src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/ftr-based-simclusters-index-generation-job.d6w \ ---jar dist/ftr-tweet-index-generation-adhoc-job-bundle/ftr-tweet-index-generation-adhoc-job.jar \ ---bind=profile.project=${GCP_PROJECT_NAME} \ ---bind=profile.user_name=your_ldap \ ---bind=profile.date="2022-08-27T12" \ ---bind=profile.machine="n2-standard-2" \ ---bind=profile.build_target="src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:ftr-tweet-index-generation-adhoc-job" \ ---bind=profile.job_name="ftr-tweet-index-generation-adhoc-job" --ignore-existing -``` - -### OON ftrat5 -``` -rm dist/oon-ftr-tweet-index-generation-adhoc-job-bundle/oon-ftr-tweet-index-generation-adhoc-job.jar -./bazel bundle src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:oon-ftr-tweet-index-generation-adhoc-job && \ -bin/d6w create \ -${GCP_PROJECT_NAME}/us-central1/oon-ftr-ann-adhoc-job \ -src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/ftr-based-simclusters-index-generation-job.d6w \ ---jar dist/oon-ftr-tweet-index-generation-adhoc-job-bundle/oon-ftr-tweet-index-generation-adhoc-job.jar \ ---bind=profile.project=${GCP_PROJECT_NAME} \ ---bind=profile.user_name=${USER} \ ---bind=profile.build_target="src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:oon-ftr-tweet-index-generation-adhoc-job" \ ---bind=profile.date="2022-09-21T12" \ ---bind=profile.machine="n2-standard-2" \ ---bind=profile.job_name="oon-ftr-ann-adhoc-job" --ignore-existing -``` - - -## Scheduling jobs -### decayed_sum_job -``` -export SERVICE_ACCOUNT='cassowary' -export GCP_PROJECT_NAME='twttr-recos-ml-prod' -export PROJECT_DATE='2022-07-24T16' - -bin/d6w schedule \ -${GCP_PROJECT_NAME}/us-central1/iikf2020-decayed-sum-ann-batch-job \ -src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/iikf2020-decayed-sum-ann-batch-job.d6w \ ---bind=profile.project=${GCP_PROJECT_NAME} \ ---bind=profile.user_name=${SERVICE_ACCOUNT} \ ---bind=profile.machine="n2-highmem-4" \ ---bind=profile.job_name="iikf2020-decayed-sum-ann-batch-job" \ ---bind=profile.date=${PROJECT_DATE} \ ---bind=profile.environment=prod -``` - -### ftrat5 pop1000 - -``` -export SERVICE_ACCOUNT='cassowary' -export GCP_PROJECT_NAME='twttr-recos-ml-prod' -export PROJECT_DATE='2022-07-24T17' - -bin/d6w schedule \ -${GCP_PROJECT_NAME}/us-central1/iikf2020-ftrat5-pop1000-ann-batch-job \ -src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/iikf2020-ftrat5-pop1000-ann-batch-job.d6w \ ---bind=profile.project=${GCP_PROJECT_NAME} \ ---bind=profile.user_name=${SERVICE_ACCOUNT} \ ---bind=profile.machine="n2-highmem-4" \ ---bind=profile.job_name="iikf2020-ftrat5-pop1000-ann-batch-job" \ ---bind=profile.date=${PROJECT_DATE} \ ---bind=profile.environment=prod -``` - - -### ftrat5 pop10000 -``` -export SERVICE_ACCOUNT='cassowary' -export GCP_PROJECT_NAME='twttr-recos-ml-prod' -export PROJECT_DATE='2022-07-24T18' - -bin/d6w schedule \ -${GCP_PROJECT_NAME}/us-central1/iikf2020-ftrat5-pop10000-ann-batch-job \ -src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/iikf2020-ftrat5-pop10000-ann-batch-job.d6w \ ---bind=profile.project=${GCP_PROJECT_NAME} \ ---bind=profile.user_name=${SERVICE_ACCOUNT} \ ---bind=profile.machine="n2-highmem-4" \ ---bind=profile.job_name="iikf2020-ftrat5-pop10000-ann-batch-job" \ ---bind=profile.date=${PROJECT_DATE} \ ---bind=profile.environment=prod -``` - -### Deschedule -``` -export SERVICE_ACCOUNT='cassowary' - -aurora cron deschedule atla/${SERVICE_ACCOUNT}/prod/twttr-recos-ml-prod-us-central1-iikf2020-decayed-sum-ann-batch-job -aurora cron deschedule atla/${SERVICE_ACCOUNT}/prod/twttr-recos-ml-prod-us-central1-iikf2020-ftrat5-pop1000-ann-batch-job -aurora cron deschedule atla/${SERVICE_ACCOUNT}/prod/twttr-recos-ml-prod-us-central1-iikf2020-ftrat5-pop10000-ann-batch-job - -aurora cron deschedule atla/${SERVICE_ACCOUNT}/prod/twttr-recos-ml-prod-us-central1-iikf2020-decayed-sum-ann-batch-job -aurora cron deschedule atla/${SERVICE_ACCOUNT}/prod/twttr-recos-ml-prod-us-central1-iikf2020-ftrat5-pop1000-ann-batch-job -aurora cron deschedule atla/${SERVICE_ACCOUNT}/prod/twttr-recos-ml-prod-us-central1-iikf2020-ftrat5-pop10000-ann-batch-job -``` - -### pop1000-rnkdecay11 -``` -export SERVICE_ACCOUNT='cassowary' -export GCP_PROJECT_NAME='twttr-recos-ml-prod' -export PROJECT_DATE='2022-08-27T16' - -bin/d6w schedule \ -${GCP_PROJECT_NAME}/us-central1/ftr-pop1000-rnkdecay11-tweet-index-generation-batch-job \ -src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/ftr-based-simclusters-index-generation-job.d6w \ ---bind=profile.project=${GCP_PROJECT_NAME} \ ---bind=profile.user_name=${SERVICE_ACCOUNT} \ ---bind=profile.machine="n2-standard-2" \ ---bind=profile.job_name="ftr-pop1000-rnkdecay11-tweet-index-generation-batch-job" \ ---bind=profile.build_target="src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:ftr-tweet-index-generation-pop1000-rnkdecay11-job" \ ---bind=profile.date=${PROJECT_DATE} \ ---bind=profile.environment=prod -``` - -### pop10000-rnkdecay11 -``` -export SERVICE_ACCOUNT='cassowary' -export GCP_PROJECT_NAME='twttr-recos-ml-prod' -export PROJECT_DATE='2022-08-27T16' - -bin/d6w schedule \ -${GCP_PROJECT_NAME}/us-central1/ftr-pop10000-rnkdecay11-tweet-index-generation-batch-job \ -src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/ftr-based-simclusters-index-generation-job.d6w \ ---bind=profile.project=${GCP_PROJECT_NAME} \ ---bind=profile.user_name=${SERVICE_ACCOUNT} \ ---bind=profile.machine="n2-standard-2" \ ---bind=profile.job_name="ftr-pop10000-rnkdecay11-tweet-index-generation-batch-job" \ ---bind=profile.build_target="src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:ftr-tweet-index-generation-pop10000-rnkdecay11-job" \ ---bind=profile.date=${PROJECT_DATE} \ ---bind=profile.environment=prod -``` - -### decayed_sum -``` -export SERVICE_ACCOUNT='cassowary' -export GCP_PROJECT_NAME='twttr-recos-ml-prod' -export PROJECT_DATE='2022-09-05T16' - -bin/d6w schedule \ -${GCP_PROJECT_NAME}/us-central1/decayed-sum-tweet-index-generation-batch-job \ -src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/ftr-based-simclusters-index-generation-job.d6w \ ---bind=profile.project=${GCP_PROJECT_NAME} \ ---bind=profile.user_name=${SERVICE_ACCOUNT} \ ---bind=profile.machine="n2-standard-2" \ ---bind=profile.job_name="decayed-sum-tweet-index-generation-batch-job" \ ---bind=profile.build_target="src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:ftr-tweet-index-generation-decayed-sum-job" \ ---bind=profile.date=${PROJECT_DATE} \ ---bind=profile.environment=prod -``` - - -### OON ftrat5 -``` -export SERVICE_ACCOUNT='cassowary' -export GCP_PROJECT_NAME='twttr-recos-ml-prod' -export PROJECT_DATE='2022-09-21T16' - -bin/d6w schedule \ -${GCP_PROJECT_NAME}/us-central1/oon-ftr-pop1000-rnkdecay-tweet-index-generation-batch-job \ -src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/ftr-based-simclusters-index-generation-job.d6w \ ---bind=profile.project=${GCP_PROJECT_NAME} \ ---bind=profile.user_name=${SERVICE_ACCOUNT} \ ---bind=profile.machine="n2-standard-2" \ ---bind=profile.job_name="oon-ftr-pop1000-rnkdecay-tweet-index-generation-batch-job" \ ---bind=profile.build_target="src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:oon-ftr-tweet-index-generation-pop1000-rnkdecay-job" \ ---bind=profile.date=${PROJECT_DATE} \ ---bind=profile.environment=prod -``` - -### Deschedule -``` -export SERVICE_ACCOUNT='cassowary' - -aurora cron deschedule atla/${SERVICE_ACCOUNT}/prod/twttr-recos-ml-prod-us-central1-ftr-pop1000-rnkdecay11-tweet-index-generation-batch-job -aurora cron deschedule atla/${SERVICE_ACCOUNT}/prod/twttr-recos-ml-prod-us-central1-ftr-pop1000-rnkdecay11-tweet-index-generation-batch-job - -aurora cron deschedule atla/${SERVICE_ACCOUNT}/prod/twttr-recos-ml-prod-us-central1-ftr-pop10000-rnkdecay11-tweet-index-generation-batch-job -aurora cron deschedule atla/${SERVICE_ACCOUNT}/prod/twttr-recos-ml-prod-us-central1-ftr-pop10000-rnkdecay11-tweet-index-generation-batch-job - -aurora cron deschedule atla/${SERVICE_ACCOUNT}/prod/twttr-recos-ml-prod-us-central1-decayed-sum-tweet-index-generation-batch-job -aurora cron deschedule atla/${SERVICE_ACCOUNT}/prod/twttr-recos-ml-prod-us-central1-decayed-sum-tweet-index-generation-batch-job - -aurora cron deschedule atla/${SERVICE_ACCOUNT}/prod/twttr-recos-ml-prod-us-central1-oon-ftr-pop1000-rnkdecay-tweet-index-generation-batch-job -aurora cron deschedule atla/${SERVICE_ACCOUNT}/prod/twttr-recos-ml-prod-us-central1-oon-ftr-pop1000-rnkdecay-tweet-index-generation-batch-job -``` diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/ftr-based-simclusters-index-generation-job.d6w b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/ftr-based-simclusters-index-generation-job.d6w deleted file mode 100644 index 39b2f16bf..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/ftr-based-simclusters-index-generation-job.d6w +++ /dev/null @@ -1,44 +0,0 @@ -class Profile(Struct): - project = Default(String, 'twttr-recos-ml-prod') - date = Required(String) - build_target = Required(String) - job_name = Required(String) - environment = Default(String, 'dev') - machine = Default(String, 'n2-standard-2') - -SimClustersIndexGenerationJob = Job( - name='{{profile.job_name}}', - project='{{profile.project}}', - staging_bucket='{{profile.project}}', - service_account='{{profile.user_name}}-shdw@twttr-dp-svc-accounts.iam.gserviceaccount.com', - region='us-central1', - worker_config=WorkerConfig( - num_workers=2, - worker_machine_type='{{profile.machine}}', - worker_disk_type=WorkerDiskType('HDD'), - ), - extra_args={ - "date": '{{profile.date}}' - }, - service_identifier='twtr:svc:{{profile.user_name}}:{{profile.job_name}}:{{profile.environment}}:{{profile.cluster}}', - deployment_config=BatchDeploymentConfig( - role='{{profile.user_name}}', - environment='{{profile.environment}}', - build_target='{{profile.build_target}}', - gcp_deployment_credentials='/var/lib/tss/keys/{{profile.user_name}}/cloud/gcp/dp/shadow.json', - statebird_config=StatebirdConfig( - batch_width='PT2H', - first_time='{{profile.date}}', - ), - workflow_config=WorkflowConfig( - play=True, - ), - timeout='PT4H', - parallelism=1 - ) -) - -jobs=[SimClustersIndexGenerationJob.bind(profile=Profile())] - - - diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/ftr-tweets-ann-adhoc-job.d6w b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/ftr-tweets-ann-adhoc-job.d6w deleted file mode 100644 index cb9afafca..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/ftr-tweets-ann-adhoc-job.d6w +++ /dev/null @@ -1,36 +0,0 @@ -class Profile(Struct): - project = Required(String) - date = Required(String) - build_target = Required(String) - environment = Default(String, 'dev') - job_name = Default(String, 'ftr-recs-d6w-job') - machine = Default(String, 'n2-highmem-4') - -job = Job( - name='{{profile.job_name}}', - project='{{profile.project}}', - staging_bucket='{{profile.project}}', - service_account='{{profile.user_name}}-shdw@twttr-dp-svc-accounts.iam.gserviceaccount.com', - region='us-central1', - worker_config=WorkerConfig( - num_workers=2, - worker_machine_type='{{profile.machine}}', - worker_disk_type=WorkerDiskType('HDD'), - ), - extra_args={ - "date": '{{profile.date}}' - }, - service_identifier='twtr:svc:{{profile.user_name}}:{{profile.job_name}}:{{profile.environment}}:{{profile.cluster}}', - deployment_config=BatchDeploymentConfig( - role='{{profile.user_name}}', - build_target='{{profile.build_target}}', - gcp_deployment_credentials='/var/lib/tss/keys/{{profile.user_name}}/cloud/gcp/dp/shadow.json', - statebird_config=StatebirdConfig( - batch_width='PT4H', - first_time='{{profile.date}}', - ), - timeout='PT8H' - ) -) - -jobs=[job] diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/iikf2020-decayed-sum-ann-batch-job.d6w b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/iikf2020-decayed-sum-ann-batch-job.d6w deleted file mode 100644 index 759d8e0c2..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/iikf2020-decayed-sum-ann-batch-job.d6w +++ /dev/null @@ -1,35 +0,0 @@ -class Profile(Struct): - project = Required(String) - date = Required(String) - environment = Default(String, 'dev') - job_name = Default(String, 'ftr-recs-d6w-job') - machine = Default(String, 'n2-highmem-4') - -job = Job( - name='{{profile.job_name}}', - project='{{profile.project}}', - staging_bucket='{{profile.project}}', - service_account='{{profile.user_name}}-shdw@twttr-dp-svc-accounts.iam.gserviceaccount.com', - region='us-central1', - worker_config=WorkerConfig( - num_workers=2, - worker_machine_type='{{profile.machine}}', - worker_disk_type=WorkerDiskType('HDD'), - ), - extra_args={ - "date": '{{profile.date}}' - }, - service_identifier='twtr:svc:{{profile.user_name}}:{{profile.job_name}}:{{profile.environment}}:{{profile.cluster}}', - deployment_config=BatchDeploymentConfig( - role='{{profile.user_name}}', - build_target='src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:iikf2020-decayed-sum-batch-job', - gcp_deployment_credentials='/var/lib/tss/keys/{{profile.user_name}}/cloud/gcp/dp/shadow.json', - statebird_config=StatebirdConfig( - batch_width='PT4H', - first_time='{{profile.date}}', - ), - timeout='PT8H', - ) -) - -jobs=[job] diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/iikf2020-ftrat5-pop1000-ann-batch-job.d6w b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/iikf2020-ftrat5-pop1000-ann-batch-job.d6w deleted file mode 100644 index 7c7001400..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/iikf2020-ftrat5-pop1000-ann-batch-job.d6w +++ /dev/null @@ -1,35 +0,0 @@ -class Profile(Struct): - project = Required(String) - date = Required(String) - environment = Default(String, 'dev') - job_name = Default(String, 'ftr-recs-d6w-job') - machine = Default(String, 'n2-highmem-4') - -job = Job( - name='{{profile.job_name}}', - project='{{profile.project}}', - staging_bucket='{{profile.project}}', - service_account='{{profile.user_name}}-shdw@twttr-dp-svc-accounts.iam.gserviceaccount.com', - region='us-central1', - worker_config=WorkerConfig( - num_workers=2, - worker_machine_type='{{profile.machine}}', - worker_disk_type=WorkerDiskType('HDD'), - ), - extra_args={ - "date": '{{profile.date}}' - }, - service_identifier='twtr:svc:{{profile.user_name}}:{{profile.job_name}}:{{profile.environment}}:{{profile.cluster}}', - deployment_config=BatchDeploymentConfig( - role='{{profile.user_name}}', - build_target='src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:iikf2020-ftrat5-pop1000-batch-job', - gcp_deployment_credentials='/var/lib/tss/keys/{{profile.user_name}}/cloud/gcp/dp/shadow.json', - statebird_config=StatebirdConfig( - batch_width='PT4H', - first_time='{{profile.date}}', - ), - timeout='PT8H', - ) -) - -jobs=[job] diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/iikf2020-ftrat5-pop10000-ann-batch-job.d6w b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/iikf2020-ftrat5-pop10000-ann-batch-job.d6w deleted file mode 100644 index 24c594aa3..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/iikf2020-ftrat5-pop10000-ann-batch-job.d6w +++ /dev/null @@ -1,35 +0,0 @@ -class Profile(Struct): - project = Required(String) - date = Required(String) - environment = Default(String, 'dev') - job_name = Default(String, 'ftr-recs-d6w-job') - machine = Default(String, 'n2-highmem-4') - -job = Job( - name='{{profile.job_name}}', - project='{{profile.project}}', - staging_bucket='{{profile.project}}', - service_account='{{profile.user_name}}-shdw@twttr-dp-svc-accounts.iam.gserviceaccount.com', - region='us-central1', - worker_config=WorkerConfig( - num_workers=2, - worker_machine_type='{{profile.machine}}', - worker_disk_type=WorkerDiskType('HDD'), - ), - extra_args={ - "date": '{{profile.date}}' - }, - service_identifier='twtr:svc:{{profile.user_name}}:{{profile.job_name}}:{{profile.environment}}:{{profile.cluster}}', - deployment_config=BatchDeploymentConfig( - role='{{profile.user_name}}', - build_target='src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet:iikf2020-ftrat5-pop10000-batch-job', - gcp_deployment_credentials='/var/lib/tss/keys/{{profile.user_name}}/cloud/gcp/dp/shadow.json', - statebird_config=StatebirdConfig( - batch_width='PT4H', - first_time='{{profile.date}}', - ), - timeout='PT8H', - ) -) - -jobs=[job] diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/sql/BUILD b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/sql/BUILD deleted file mode 100644 index ba87e2b54..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/sql/BUILD +++ /dev/null @@ -1,3 +0,0 @@ -resources( - sources = ["*"], -) diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/sql/ftr_tweet_embeddings.sql b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/sql/ftr_tweet_embeddings.sql deleted file mode 100644 index fe52dfedb..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/ftr_tweet/sql/ftr_tweet_embeddings.sql +++ /dev/null @@ -1,280 +0,0 @@ -WITH vars AS ( - SELECT - TIMESTAMP('{START_TIME}') AS start_time, - TIMESTAMP('{END_TIME}') AS end_time, - UNIX_MILLIS('{END_TIME}') AS currentTs, - {HALFLIFE} AS halfLife, - {TWEET_SAMPLE_RATE} AS tweet_sample_rate, - {ENG_SAMPLE_RATE} AS eng_user_sample_rate, - {MIN_TWEET_FAVS} AS min_tweet_favs, - {MIN_TWEET_IMPS} AS min_tweet_imps, - {MAX_USER_LOG_N_IMPS} AS max_user_log_n_imps, - {MAX_USER_LOG_N_FAVS} AS max_user_log_n_favs, - {MAX_USER_FTR} AS max_user_ftr, - {MAX_TWEET_FTR} AS max_tweet_ftr, - 700 AS MAX_EXPONENT, -- this is the maximum exponent one can have in bigquery - ), - -- step 1: get impressions and favs - impressions AS ( - SELECT - userIdentifier.userId AS user_id, - item.tweetInfo.actionTweetId AS tweet_id, - item.tweetInfo.actionTweetAuthorInfo.authorId AS author_id, - TRUE AS impressed, - MIN(eventMetadata.sourceTimestampMs) AS minTsMilli - FROM twttr-bql-unified-prod.unified_user_actions.streaming_unified_user_actions, vars - WHERE - actionType = "ClientTweetLingerImpression" - AND DATE(dateHour) BETWEEN DATE(vars.start_time) AND DATE(vars.end_time) - AND TIMESTAMP_MILLIS(eventMetadata.sourceTimestampMs) BETWEEN vars.start_time AND vars.end_time - AND MOD(ABS(farm_fingerprint(item.tweetInfo.actionTweetId || '')), vars.tweet_sample_rate) = 0 - AND MOD(ABS(farm_fingerprint(userIdentifier.userId || '')), vars.eng_user_sample_rate) = 0 - -- Apply tweet age filter here - AND timestamp_millis((1288834974657 + - ((item.tweetInfo.actionTweetId & 9223372036850581504) >> 22))) >= (vars.start_time) - GROUP BY 1, 2, 3 - ), - favs AS ( - SELECT - userIdentifier.userId AS user_id, - item.tweetInfo.actionTweetId AS tweet_id, - item.tweetInfo.actionTweetAuthorInfo.authorId AS author_id, - MIN(eventMetadata.sourceTimestampMs) AS minTsMilli, - -- get last action, and make sure that it's a fav - ARRAY_AGG(actionType ORDER BY eventMetadata.sourceTimestampMs DESC LIMIT 1)[OFFSET(0)] = "ServerTweetFav" AS favorited, - FROM `twttr-bql-unified-prod.unified_user_actions_engagements.streaming_unified_user_actions_engagements`, vars - WHERE - actionType IN ("ServerTweetFav", "ServerTweetUnfav") - AND DATE(dateHour) BETWEEN DATE(vars.start_time) AND DATE(vars.end_time) - AND TIMESTAMP_MILLIS(eventMetadata.sourceTimestampMs) BETWEEN vars.start_time AND vars.end_time - AND MOD(ABS(farm_fingerprint(item.tweetInfo.actionTweetId || '')), vars.tweet_sample_rate) = 0 - AND MOD(ABS(farm_fingerprint(userIdentifier.userId || '')), vars.eng_user_sample_rate) = 0 - -- Apply tweet age filter here - AND timestamp_millis((1288834974657 + - ((item.tweetInfo.actionTweetId & 9223372036850581504) >> 22))) >= (vars.start_time) - GROUP BY 1, 2, 3 - HAVING favorited - ), - eng_data AS ( - SELECT - user_id, tweet_id, author_id, impressions.minTsMilli, favorited, impressed - FROM impressions - LEFT JOIN favs USING(user_id, tweet_id, author_id) - ), - eligible_tweets AS ( - SELECT - tweet_id, - author_id, - COUNTIF(favorited) num_favs, - COUNTIF(impressed) num_imps, - COUNTIF(favorited) * 1.0 / COUNTIF(impressed) AS tweet_ftr, - ANY_VALUE(vars.min_tweet_favs) min_tweet_favs, - ANY_VALUE(vars.min_tweet_imps) min_tweet_imps, - ANY_VALUE(vars.max_tweet_ftr) max_tweet_ftr, - FROM eng_data, vars - GROUP BY 1, 2 - HAVING num_favs >= min_tweet_favs -- this is an aggressive filter to make the workflow efficient - AND num_imps >= min_tweet_imps - AND tweet_ftr <= max_tweet_ftr -- filter to combat spam - ), - eligible_users AS ( - SELECT - user_id, - CAST(LOG10(COUNTIF(impressed) + 1) AS INT64) log_n_imps, - CAST(LOG10(COUNTIF(favorited) + 1) AS INT64) log_n_favs, - ANY_VALUE(vars.max_user_log_n_imps) max_user_log_n_imps, - ANY_VALUE(vars.max_user_log_n_favs) max_user_log_n_favs, - ANY_VALUE(vars.max_user_ftr) max_user_ftr, - COUNTIF(favorited) * 1.0 / COUNTIF(impressed) user_ftr - from eng_data, vars - GROUP BY 1 - HAVING - log_n_imps < max_user_log_n_imps - AND log_n_favs < max_user_log_n_favs - AND user_ftr < max_user_ftr - ), - eligible_eng_data AS ( - SELECT - user_id, - eng_data.author_id, - tweet_id, - minTsMilli, - favorited, - impressed - FROM eng_data - INNER JOIN eligible_tweets USING(tweet_id) - INNER JOIN eligible_users USING(user_id) - ), - follow_graph AS ( - SELECT userId, neighbor - FROM `twttr-bq-cassowary-prod.user.user_user_normalized_graph` user_user_graph, unnest(user_user_graph.neighbors) as neighbor - WHERE DATE(_PARTITIONTIME) = - ( -- Get latest partition time - SELECT MAX(DATE(_PARTITIONTIME)) latest_partition - FROM `twttr-bq-cassowary-prod.user.user_user_normalized_graph`, vars - WHERE Date(_PARTITIONTIME) BETWEEN - DATE_SUB(Date(vars.end_time), - INTERVAL 14 DAY) AND DATE(vars.end_time) - ) - AND neighbor.isFollowed is True - ), - extended_eligible_eng_data AS ( - SELECT - user_id, - tweet_id, - minTsMilli, - favorited, - impressed, - neighbor.neighborId is NULL as is_oon_eng - FROM eligible_eng_data left JOIN follow_graph ON (follow_graph.userId = eligible_eng_data.user_id AND follow_graph.neighbor.neighborId = eligible_eng_data.author_id) - ), - -- step 2: merge with iikf - iikf AS ( - SELECT - userId AS user_id, - - clusterIdToScore.key AS clusterId, - clusterIdToScore.value.favScore AS favScore, - clusterIdToScore.value.favScoreClusterNormalizedOnly AS favScoreClusterNormalizedOnly, - clusterIdToScore.value.favScoreProducerNormalizedOnly AS favScoreProducerNormalizedOnly, - - clusterIdToScore.value.logFavScore AS logFavScore, - clusterIdToScore.value.logfavScoreClusterNormalizedOnly AS logfavScoreClusterNormalizedOnly, -- probably no need for cluster normalization anymore - ROW_NUMBER() OVER (PARTITION BY userId ORDER BY clusterIdToScore.value.logFavScore DESC) AS uii_cluster_rank_logfavscore, - ROW_NUMBER() OVER (PARTITION BY userId ORDER BY clusterIdToScore.value.logfavScoreClusterNormalizedOnly DESC) AS uii_cluster_rank_logfavscoreclusternormalized, - FROM `twttr-bq-cassowary-prod.user.simclusters_v2_user_to_interested_in_20M_145K_2020`, UNNEST(clusterIdToScores) clusterIdToScore, vars - WHERE DATE(_PARTITIONTIME) = - (-- Get latest partition time - SELECT MAX(DATE(_PARTITIONTIME)) latest_partition - FROM `twttr-bq-cassowary-prod.user.simclusters_v2_user_to_interested_in_20M_145K_2020` - WHERE Date(_PARTITIONTIME) BETWEEN - DATE_SUB(Date(vars.end_time), - INTERVAL 14 DAY) AND DATE(vars.end_time) - ) - AND MOD(ABS(farm_fingerprint(userId || '')), vars.eng_user_sample_rate) = 0 - AND clusterIdToScore.value.logFavScore != 0 - ), - eng_w_uii AS ( - SELECT - T_IMP_FAV.user_id, - T_IMP_FAV.tweet_id, - T_IMP_FAV.impressed, - T_IMP_FAV.favorited, - T_IMP_FAV.minTsMilli, - T_IMP_FAV.is_oon_eng, - - IIKF.clusterId, - IIKF.logFavScore, - IIKF.logfavScoreClusterNormalizedOnly, - IIKF.uii_cluster_rank_logfavscore, - IIKF.uii_cluster_rank_logfavscoreclusternormalized, - FROM extended_eligible_eng_data T_IMP_FAV, vars - INNER JOIN iikf - ON T_IMP_FAV.user_id = IIKF.user_id - WHERE - T_IMP_FAV.impressed - ), - -- step 3: Calculate tweet embedding - tweet_cluster_agg AS ( - SELECT - tweet_id, - clusterId, - - SUM(IF(impressed, logFavScore, 0)) denom_logFavScore, - SUM(IF(favorited, logFavScore, 0)) nom_logFavScore, - - COUNTIF(impressed) n_imps, - COUNTIF(favorited) n_favs, - - COUNTIF(impressed AND uii_cluster_rank_logfavscore <= 5) n_imps_at_5, - COUNTIF(favorited AND uii_cluster_rank_logfavscore <= 5) n_favs_at_5, - - COUNTIF(favorited AND uii_cluster_rank_logfavscore <= 5 AND is_oon_eng) n_oon_favs_at_5, - COUNTIF(impressed AND uii_cluster_rank_logfavscore <= 5 AND is_oon_eng) n_oon_imps_at_5, - - SUM(IF(favorited AND uii_cluster_rank_logfavscore <= 5, 1, 0) * POW(0.5, (currentTs - minTsMilli) / vars.halfLife)) AS decayed_n_favs_at_5, - SUM(IF(impressed AND uii_cluster_rank_logfavscore <= 5, 1, 0) * POW(0.5, (currentTs - minTsMilli) / vars.halfLife)) AS decayed_n_imps_at_5, - - SUM(IF(favorited, logfavScoreClusterNormalizedOnly, 0) * POW(0.5, (currentTs - minTsMilli) / vars.halfLife)) AS dec_sum_logfavScoreClusterNormalizedOnly, - - MIN(minTsMilli) minTsMilli, - - FROM eng_w_uii, vars - GROUP BY 1, 2 - ), - tweet_cluster_intermediate AS ( - SELECT - tweet_id, - clusterId, - minTsMilli, - - n_imps, - n_favs, - - n_favs_at_5, - n_imps_at_5, - n_oon_favs_at_5, - n_oon_imps_at_5, - decayed_n_favs_at_5, - decayed_n_imps_at_5, - - denom_logFavScore, - nom_logFavScore, - - dec_sum_logfavScoreClusterNormalizedOnly, - - SAFE_DIVIDE(n_favs_at_5, n_imps_at_5) AS ftr_at_5, - - SAFE_DIVIDE(n_oon_favs_at_5, n_oon_imps_at_5) AS ftr_oon_at_5, - - row_number() OVER (PARTITION BY tweet_id ORDER BY nom_logFavScore DESC) cluster_nom_logFavScore_ranking, - row_number() OVER (PARTITION BY tweet_id ORDER BY dec_sum_logfavScoreClusterNormalizedOnly DESC) cluster_decSumLogFavClusterNormalized_ranking, - FROM tweet_cluster_agg - ), - tweet_e AS ( - SELECT - tweet_id, - - MIN(minTsMilli) first_serve_millis, - DATE(TIMESTAMP_MILLIS(MIN(minTsMilli))) date_first_serve, - - ARRAY_AGG(STRUCT( - clusterId, - -- the division by MAX_EXPONENT is to avoid overflow operation - ftr_at_5 * (2 / (1+EXP(-1* (decayed_n_favs_at_5/1000))) - 1) * IF(cluster_decSumLogFavClusterNormalized_ranking > MAX_EXPONENT, 0, 1.0/(POW(1.1, cluster_decSumLogFavClusterNormalized_ranking-1))) AS ftrat5_decayed_pop_bias_1000_rank_decay_1_1 - ) ORDER BY ftr_at_5 * (2 / (1+EXP(-1* (decayed_n_favs_at_5/1000))) - 1) * IF(cluster_decSumLogFavClusterNormalized_ranking > MAX_EXPONENT, 0, 1.0/(POW(1.1, cluster_decSumLogFavClusterNormalized_ranking-1))) DESC LIMIT {TWEET_EMBEDDING_LENGTH}) ftrat5_decayed_pop_bias_1000_rank_decay_1_1_embedding, - - ARRAY_AGG(STRUCT( - clusterId, - -- the division by MAX_EXPONENT is to avoid overflow operation - ftr_at_5 * (2 / (1+EXP(-1* (decayed_n_favs_at_5/10000))) - 1) * IF(cluster_decSumLogFavClusterNormalized_ranking > MAX_EXPONENT, 0, 1.0/(POW(1.1, cluster_decSumLogFavClusterNormalized_ranking-1))) AS ftrat5_decayed_pop_bias_10000_rank_decay_1_1 - ) ORDER BY ftr_at_5 * (2 / (1+EXP(-1* (decayed_n_favs_at_5/1000))) - 1) * IF(cluster_decSumLogFavClusterNormalized_ranking > MAX_EXPONENT, 0, 1.0/(POW(1.1, cluster_decSumLogFavClusterNormalized_ranking-1))) DESC LIMIT {TWEET_EMBEDDING_LENGTH}) ftrat5_decayed_pop_bias_10000_rank_decay_1_1_embedding, - - ARRAY_AGG(STRUCT( - clusterId, - -- the division by MAX_EXPONENT is to avoid overflow operation - ftr_oon_at_5 * (2 / (1+EXP(-1* (decayed_n_favs_at_5/1000))) - 1) * IF(cluster_nom_logFavScore_ranking > MAX_EXPONENT, 0, 1.0/(POW(1.1, cluster_nom_logFavScore_ranking-1))) AS oon_ftrat5_decayed_pop_bias_1000_rank_decay - ) ORDER BY ftr_oon_at_5 * (2 / (1+EXP(-1* (decayed_n_favs_at_5/1000))) - 1) * IF(cluster_nom_logFavScore_ranking > MAX_EXPONENT, 0, 1.0/(POW(1.1, cluster_nom_logFavScore_ranking-1))) DESC LIMIT {TWEET_EMBEDDING_LENGTH}) oon_ftrat5_decayed_pop_bias_1000_rank_decay_embedding, - - ARRAY_AGG(STRUCT( - clusterId, - dec_sum_logfavScoreClusterNormalizedOnly - ) ORDER BY dec_sum_logfavScoreClusterNormalizedOnly DESC LIMIT {TWEET_EMBEDDING_LENGTH}) dec_sum_logfavScoreClusterNormalizedOnly_embedding, - - FROM tweet_cluster_intermediate, vars - GROUP BY 1 - ), - tweet_e_unnest AS ( - SELECT - tweet_id AS tweetId, - clusterToScores.clusterId AS clusterId, - clusterToScores.{SCORE_KEY} tweetScore - FROM tweet_e, UNNEST({SCORE_COLUMN}) clusterToScores - WHERE clusterToScores.{SCORE_KEY} IS NOT NULL - AND clusterToScores.{SCORE_KEY} > 0 - ) - SELECT - tweetId, - clusterId, - tweetScore - FROM tweet_e_unnest diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/BUILD b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/BUILD deleted file mode 100644 index 681483b95..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/BUILD +++ /dev/null @@ -1,167 +0,0 @@ -scala_library( - name = "simclusters_index_generation", - sources = [ - "**/*.scala", - ], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources:ads_fav_based_simclusters_cluster_to_tweet_index-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:ads_fav_click_based_simclusters_cluster_to_tweet_index-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:fav_based_evergreen_content_simclusters_cluster_to_tweet_index-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:fav_based_simclusters_cluster_to_tweet_index-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:fav_based_video_simclusters_cluster_to_tweet_index-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:push_open_based_simclusters_cluster_to_tweet_index-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:reply_based_simclusters_cluster_to_tweet_index-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:retweet_based_simclusters_cluster_to_tweet_index-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:video_view_based_simclusters_cluster_to_tweet_index-scala", - "src/scala/com/twitter/simclusters_v2/scio/bq_generation/common", - "src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql", - "src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann:bq_generation", - "unified_user_actions/thrift/src/main/thrift/com/twitter/unified_user_actions:unified_user_actions-scala", - ], -) - -jvm_binary( - name = "fav-based-cluster-to-tweet-index-generation-adhoc-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation.FavBasedClusterToTweetIndexGenerationAdhocJob", - dependencies = [ - ":simclusters_index_generation", - ], -) - -jvm_binary( - name = "fav-based-cluster-to-tweet-index-generation-batch-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation.FavBasedClusterToTweetIndexGenerationBatchJob", - dependencies = [ - ":simclusters_index_generation", - ], -) - -jvm_binary( - name = "video-view-based-cluster-to-tweet-index-generation-adhoc-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation.VideoViewBasedClusterToTweetIndexGenerationAdhocJob", - dependencies = [ - ":simclusters_index_generation", - ], -) - -jvm_binary( - name = "video-view-based-cluster-to-tweet-index-generation-batch-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation.VideoViewBasedClusterToTweetIndexGenerationBatchJob", - dependencies = [ - ":simclusters_index_generation", - ], -) - -jvm_binary( - name = "retweet-based-cluster-to-tweet-index-generation-adhoc-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation.RetweetBasedClusterToTweetIndexGenerationAdhocJob", - dependencies = [ - ":simclusters_index_generation", - ], -) - -jvm_binary( - name = "retweet-based-cluster-to-tweet-index-generation-batch-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation.RetweetBasedClusterToTweetIndexGenerationBatchJob", - dependencies = [ - ":simclusters_index_generation", - ], -) - -jvm_binary( - name = "reply-based-cluster-to-tweet-index-generation-adhoc-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation.ReplyBasedClusterToTweetIndexGenerationAdhocJob", - dependencies = [ - ":simclusters_index_generation", - ], -) - -jvm_binary( - name = "reply-based-cluster-to-tweet-index-generation-batch-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation.ReplyBasedClusterToTweetIndexGenerationBatchJob", - dependencies = [ - ":simclusters_index_generation", - ], -) - -jvm_binary( - name = "push-open-based-cluster-to-tweet-index-generation-adhoc-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation.PushOpenBasedClusterToTweetIndexGenerationAdhocJob", - dependencies = [ - ":simclusters_index_generation", - ], -) - -jvm_binary( - name = "push-open-based-cluster-to-tweet-index-generation-batch-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation.PushOpenBasedClusterToTweetIndexGenerationBatchJob", - dependencies = [ - ":simclusters_index_generation", - ], -) - -jvm_binary( - name = "ads-fav-based-cluster-to-tweet-index-generation-adhoc-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation.AdsFavBasedClusterToTweetIndexGenerationAdhocJob", - dependencies = [ - ":simclusters_index_generation", - ], -) - -jvm_binary( - name = "ads-fav-based-cluster-to-tweet-index-generation-batch-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation.AdsFavBasedClusterToTweetIndexGenerationBatchJob", - dependencies = [ - ":simclusters_index_generation", - ], -) - -jvm_binary( - name = "ads-fav-click-based-cluster-to-tweet-index-generation-adhoc-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation.AdsFavClickBasedClusterToTweetIndexGenerationAdhocJob", - dependencies = [ - ":simclusters_index_generation", - ], -) - -jvm_binary( - name = "ads-fav-click-based-cluster-to-tweet-index-generation-batch-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation.AdsFavClickBasedClusterToTweetIndexGenerationBatchJob", - dependencies = [ - ":simclusters_index_generation", - ], -) - -jvm_binary( - name = "fav-based-evergreen-content-cluster-to-tweet-index-generation-adhoc-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation.FavBasedEvergreenContentClusterToTweetIndexGenerationAdhocJob", - dependencies = [ - ":simclusters_index_generation", - ], -) - -jvm_binary( - name = "fav-based-evergreen-content-cluster-to-tweet-index-generation-batch-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation.FavBasedEvergreenContentClusterToTweetIndexGenerationBatchJob", - dependencies = [ - ":simclusters_index_generation", - ], -) - -jvm_binary( - name = "fav-based-video-cluster-to-tweet-index-generation-adhoc-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation.FavBasedVideoClusterToTweetIndexGenerationAdhocJob", - dependencies = [ - ":simclusters_index_generation", - ], -) - -jvm_binary( - name = "fav-based-video-cluster-to-tweet-index-generation-batch-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation.FavBasedVideoClusterToTweetIndexGenerationBatchJob", - dependencies = [ - ":simclusters_index_generation", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/Config.scala b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/Config.scala deleted file mode 100644 index 5f44986e9..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/Config.scala +++ /dev/null @@ -1,82 +0,0 @@ -package com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation - -object Config { - // Common Root Path - val RootMHPath: String = "manhattan_sequence_files/simclusters_to_tweet_index/" - val RootThriftPath: String = "processed/simclusters_to_tweet_index/" - val AdhocRootPath = "adhoc/simclusters_to_tweet_index/" - // cluster-to-tweet KeyVal Dataset Output Path - val FavBasedClusterToTweetIndexOutputPath = "fav_based_index" - val FavBasedEvergreenContentClusterToTweetIndexOutputPath = "fav_based_evergreen_index" - val FavBasedVideoClusterToTweetIndexOutputPath = "fav_based_video_index" - val VideoViewBasedClusterToTweetIndexOutputPath = "video_view_based_index" - val RetweetBasedClusterToTweetIndexOutputPath = "retweet_based_index" - val ReplyBasedClusterToTweetIndexOutputPath = "reply_based_index" - val PushOpenBasedClusterToTweetIndexOutputPath = "push_open_based_index" - val AdsFavBasedClusterToTweetIndexOutputPath = "ads_fav_based_index" - val AdsFavClickBasedClusterToTweetIndexOutputPath = "ads_fav_click_based_index" - - // SQL file path - val simclustersEngagementBasedIndexGenerationSQLPath = - s"/com/twitter/simclusters_v2/scio/bq_generation/sql/engagement_based_index_generation.sql" - val unifiedUserTweetActionPairGenerationSQLPath = - s"/com/twitter/simclusters_v2/scio/bq_generation/sql/unified_user_tweet_action_pair_generation.sql" - val combinedUserTweetActionPairGenerationSQLPath = - s"/com/twitter/simclusters_v2/scio/bq_generation/sql/combined_user_tweet_action_pair_generation.sql" - val adsUserTweetActionPairGenerationSQLPath = - s"/com/twitter/simclusters_v2/scio/bq_generation/sql/ads_user_tweet_action_pair_generation.sql" - val evergreenContentUserTweetActionPairGenerationSQLPath = - s"/com/twitter/simclusters_v2/scio/bq_generation/sql/evergreen_content_user_tweet_action_pair_generation.sql" - val favBasedVideoTweetActionPairGenerationSQLPath = - s"/com/twitter/simclusters_v2/scio/bq_generation/sql/user_video_tweet_fav_engagement_generation.sql" - - // Table name for server/client engagements - val clientEngagementTableName: String = "twttr-bq-iesource-prod.user.client_engagements" - val serverEngagementTableName: String = "twttr-bq-iesource-prod.user.server_engagements" - - // Tweet id column names from UUA - val actionTweetIdColumn: String = "item.tweetInfo.actionTweetId" - val retweetTweetIdColumn: String = "item.tweetInfo.retweetedTweetId" - val replyTweetIdColumn: String = "item.tweetInfo.inReplyToTweetId" - val pushTweetIdColumn: String = "item.notificationInfo.content.tweetNotification.tweetId" - - // Do not enable health or video filters by default - val enableHealthAndVideoFilters: Boolean = false - - // Do not enable top k tweets per cluster intersection with fav-based clusters - val enableIntersectionWithFavBasedClusterTopKTweetsIndex: Boolean = false - - // Min fav/interaction threshold - val minInteractionCount: Int = 50 - val minFavCount: Int = 50 - - // Tweet Embeddings configs - val tweetEmbeddingsLength: Int = 50 - val tweetEmbeddingsHalfLife: Int = 28800000 - - // Cluster-to-tweet index configs - val clusterTopKTweets: Int = 2000 - val maxTweetAgeHours: Int = 24 - val minEngagementPerCluster: Int = 0 - - // Placeholder action type for interactions that don't have undo events (e.g. video views) - val PlaceholderActionType: String = "PLACEHOLDER_ACTION_TYPE" - - // Ads event engagement type ids - val AdsFavEngagementTypeIds = Seq(8) // Fav promoted tweet - val AdsClickEngagementTypeIds = Seq( - 1, //URL - 42, // CARD_URL_CLICK - 53, // WEBSITE_CARD_CONTAINER_CLICK - 54, // WEBSITE_CARD_BUTTON_CLICK - 55, // WEBSITE_CARD_IMAGE_CLICK - 56, // WEBSITE_CARD_TITLE_CLICK - 69, // BUYNOW_CARD_CLICK - 70, // BUYNOW_PURCHASE_SUCCESS - 72, // VIDEO_CTA_URL_CLICK - 76, // VIDEO_AD_CTA_URL_CLICK - 80, // VIDEO_CONTENT_CTA_URL_CLICK - 84, // CL_OFFER_CARD_CLICK - ) - -} diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/EngagementEventBasedClusterToTweetIndexFromBQ.scala b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/EngagementEventBasedClusterToTweetIndexFromBQ.scala deleted file mode 100644 index 93d6c9ee7..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/EngagementEventBasedClusterToTweetIndexFromBQ.scala +++ /dev/null @@ -1,177 +0,0 @@ -package com.twitter.simclusters_v2.scio.bq_generation -package simclusters_index_generation - -import com.spotify.scio.ScioContext -import com.spotify.scio.values.SCollection -import com.twitter.simclusters_v2.scio.bq_generation.common.BQGenerationUtil.getNSFWTweetIdDenylistSQL -import com.twitter.simclusters_v2.scio.bq_generation.common.BQGenerationUtil.getTweetIdWithFavCountSQL -import com.twitter.simclusters_v2.scio.bq_generation.common.BQGenerationUtil.getTweetIdWithMediaAndNSFWAuthorFilterSQL -import com.twitter.simclusters_v2.scio.bq_generation.common.BQGenerationUtil.getUserTweetEngagementEventPairSQL -import com.twitter.simclusters_v2.scio.bq_generation.common.BQGenerationUtil.generateClusterTopTweetIntersectionWithFavBasedIndexSQL -import com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation.Config.simclustersEngagementBasedIndexGenerationSQLPath -import com.twitter.simclusters_v2.scio.bq_generation.common.IndexGenerationUtil.TopKTweetsForClusterKey -import com.twitter.simclusters_v2.scio.bq_generation.common.IndexGenerationUtil.parseClusterTopKTweetsFn -import com.twitter.wtf.beam.bq_embedding_export.BQQueryUtils -import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO -import org.joda.time.DateTime - -object EngagementEventBasedClusterToTweetIndexFromBQ { - - /* - * Reads the user-tweet-interaction table and apply tweet fav count filter - * Returns the post processed table results in SQL string format - * -* Input: - * - startTime: DateTime - * The earliest timestamp from the user-tweet-interaction table - * - endTime: DateTime - * The latest timestamp from the user-tweet-interaction table - * - minFavCount: Int - * Whether we want to enable tweet fav count filters - * -* Return: - * String - Post processed table results in SQL string format - */ - def getTweetInteractionTableWithFavCountFilter( - startTime: DateTime, - endTime: DateTime, - minFavCount: Int - ): String = { - if (minFavCount > 0) { - val tweetFavCountSQL = getTweetIdWithFavCountSQL(startTime, endTime) - s""" - | WITH tweet_fav_count AS (${tweetFavCountSQL}) - | SELECT userId, tweetId, tsMillis - | FROM user_tweet_interaction_with_min_interaction_count_filter - | JOIN tweet_fav_count - | USING(tweetId) - | WHERE tweet_fav_count.favCount >= ${minFavCount} - |""".stripMargin - } else { - // Directly read from the table without applying any filters - s"SELECT userId, tweetId, tsMillis FROM user_tweet_interaction_with_min_interaction_count_filter" - } - } - - /* - * Reads the user-tweet-interaction table and apply health and video filters if specified. - * Returns the post processed table results in SQL string format - * - * Input: - * - tableName: String - * Schema of the table - * userId: Long - * tweetId: Long - * tsMillis: Long - * - startTime: DateTime - * The earliest timestamp from the user-tweet-interaction table - * - endTime: DateTime - * The latest timestamp from the user-tweet-interaction table - * - enableHealthAndVideoFilters: Boolean - * Whether we want to enable health filters and video only filters - * - * Return: - * String - Post processed table results in SQL string format - */ - def getTweetInteractionTableWithHealthFilter( - startTime: DateTime, - endTime: DateTime, - enableHealthAndVideoFilters: Boolean, - ): String = { - if (enableHealthAndVideoFilters) { - // Get SQL for tweets with media and NSFW filter - val tweetWithMediaAndNSFWAuthorFilterSQL = getTweetIdWithMediaAndNSFWAuthorFilterSQL( - startTime, - endTime, - filterMediaType = Some(3), // VideoTweets MediaType = 3 - filterNSFWAuthor = true - ) - // Get SQL for NSFW tweet id deny list - val nsfwTweetDenylistSQL = getNSFWTweetIdDenylistSQL(startTime, endTime) - // Combine the health filter SQLs - s""" - |SELECT userId, tweetId, tsMillis FROM user_tweet_interaction_with_fav_count_filter JOIN ( - | ${tweetWithMediaAndNSFWAuthorFilterSQL} - | AND tweetId NOT IN (${nsfwTweetDenylistSQL}) - |) USING(tweetId) - |""".stripMargin - } else { - // Directly read from the table without applying any filters - s"SELECT userId, tweetId, tsMillis FROM user_tweet_interaction_with_fav_count_filter" - } - } - - def getTopKTweetsForClusterKeyBQ( - sc: ScioContext, - queryTimestamp: DateTime, - maxTweetAgeHours: Int, - consumerEmbeddingsSQL: String, - userTweetEngagementEventPairSqlPath: String, - userTweetEngagementEventPairTemplateVariable: Map[String, String], - enableHealthAndVideoFilters: Boolean, - enableFavClusterTopKTweetsIntersection: Boolean, - minInteractionCount: Int, - minFavCount: Int, - tweetEmbeddingsLength: Int, - tweetEmbeddingsHalfLife: Int, - minEngagementPerCluster: Int, - clusterTopKTweets: Int - ): SCollection[TopKTweetsForClusterKey] = { - // Define template variables which we would like to be replaced in the corresponding sql file - val startTime = queryTimestamp.minusHours(maxTweetAgeHours) - val endTime = queryTimestamp - - val indexGenerationTemplateVariables = - Map( - "HALF_LIFE" -> tweetEmbeddingsHalfLife.toString, - "CURRENT_TS" -> queryTimestamp.toString(), - "START_TIME" -> startTime.toString(), - "END_TIME" -> endTime.toString(), - "USER_TWEET_ENGAGEMENT_TABLE_SQL" -> - getUserTweetEngagementEventPairSQL( - startTime, - endTime, - userTweetEngagementEventPairSqlPath, - userTweetEngagementEventPairTemplateVariable - ), - // Min interaction count filter - "MIN_INTERACTION_COUNT" -> minInteractionCount.toString, - // Min fav count filter - "TWEET_INTERACTION_WITH_FAV_COUNT_FILTER_SQL" -> getTweetInteractionTableWithFavCountFilter( - startTime, - endTime, - minFavCount - ), - // Health filter - "TWEET_INTERACTION_WITH_HEALTH_FILTER_SQL" -> getTweetInteractionTableWithHealthFilter( - startTime, - endTime, - enableHealthAndVideoFilters), - "CONSUMER_EMBEDDINGS_SQL" -> consumerEmbeddingsSQL, - "TWEET_EMBEDDING_LENGTH" -> tweetEmbeddingsLength.toString, - "MIN_ENGAGEMENT_PER_CLUSTER" -> minEngagementPerCluster.toString, - "CLUSTER_TOP_K_TWEETS" -> clusterTopKTweets.toString - ) - val query = BQQueryUtils.getBQQueryFromSqlFile( - simclustersEngagementBasedIndexGenerationSQLPath, - indexGenerationTemplateVariables) - - val postFilterQuery = if (enableFavClusterTopKTweetsIntersection) { - generateClusterTopTweetIntersectionWithFavBasedIndexSQL( - startTime, - endTime, - clusterTopKTweets, - query) - } else { - query - } - // Generate SimClusters cluster-to-tweet index - sc.customInput( - s"SimClusters cluster-to-tweet index generation BQ job", - BigQueryIO - .read(parseClusterTopKTweetsFn(tweetEmbeddingsHalfLife)) - .fromQuery(postFilterQuery) - .usingStandardSql() - ) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/EngagementEventBasedClusterToTweetIndexGenerationJob.scala b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/EngagementEventBasedClusterToTweetIndexGenerationJob.scala deleted file mode 100644 index 46c2af2f0..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/EngagementEventBasedClusterToTweetIndexGenerationJob.scala +++ /dev/null @@ -1,659 +0,0 @@ -package com.twitter.simclusters_v2.scio.bq_generation -package simclusters_index_generation - -import com.google.api.services.bigquery.model.TimePartitioning -import com.spotify.scio.ScioContext -import com.spotify.scio.coders.Coder -import com.twitter.beam.io.dal.DAL -import com.twitter.beam.io.fs.multiformat.PathLayout -import com.twitter.beam.job.DateRangeOptions -import com.twitter.conversions.DurationOps.richDurationFromInt -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.scio_internal.coders.ThriftStructLazyBinaryScroogeCoder -import com.twitter.scio_internal.job.ScioBeamJob -import com.twitter.scrooge.ThriftStruct -import com.twitter.simclusters_v2.hdfs_sources.AdsFavBasedSimclustersClusterToTweetIndexScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.AdsFavClickBasedSimclustersClusterToTweetIndexScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.FavBasedEvergreenContentSimclustersClusterToTweetIndexScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.FavBasedSimclustersClusterToTweetIndexScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.FavBasedVideoSimclustersClusterToTweetIndexScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.ReplyBasedSimclustersClusterToTweetIndexScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.RetweetBasedSimclustersClusterToTweetIndexScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.VideoViewBasedSimclustersClusterToTweetIndexScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.PushOpenBasedSimclustersClusterToTweetIndexScalaDataset -import com.twitter.simclusters_v2.scio.bq_generation.common.BQGenerationUtil.buildActionTypesEngagementIndicatorString -import com.twitter.simclusters_v2.scio.bq_generation.common.BQGenerationUtil.getInterestedIn2020SQL -import com.twitter.simclusters_v2.scio.bq_generation.common.BQTableDetails -import com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation.Config.AdsClickEngagementTypeIds -import com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation.Config.AdsFavEngagementTypeIds -import com.twitter.simclusters_v2.scio.bq_generation.simclusters_index_generation.EngagementEventBasedClusterToTweetIndexFromBQ.getTopKTweetsForClusterKeyBQ -import com.twitter.simclusters_v2.thriftscala.ClusterIdToTopKTweetsWithScores -import com.twitter.simclusters_v2.thriftscala.FullClusterId -import com.twitter.simclusters_v2.thriftscala.TopKTweetsWithScores -import com.twitter.tcdc.bqblaster.beam.syntax._ -import com.twitter.tcdc.bqblaster.core.avro.TypedProjection -import com.twitter.tcdc.bqblaster.core.transform.RootTransform -import com.twitter.unified_user_actions.thriftscala.ActionType -import java.time.Instant -import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO -import org.joda.time.DateTime - -trait EngagementEventBasedClusterToTweetIndexGenerationJob extends ScioBeamJob[DateRangeOptions] { - // Configs to set for different type of embeddings and jobs - val isAdhoc: Boolean - val getConsumerEmbeddingsSQLFunc: (DateTime, Int) => String - val outputTable: BQTableDetails - val keyValDatasetOutputPath: String - val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] - // Base configs - val projectId = "twttr-recos-ml-prod" - val environment: DAL.Env = if (isAdhoc) DAL.Environment.Dev else DAL.Environment.Prod - - // Point to different user tweet interaction table generation sql - // UUA-supported events: Config.unifiedUserTweetActionPairGenerationSQLPath - val userTweetEngagementEventPairSqlPath: String - lazy val userTweetEngagementEventPairTemplateVariable: Map[String, String] = Map.empty - - // Enable Video-only filters and health filters (for VideoViewBased embeddings) - val enableHealthAndVideoFilters: Boolean = Config.enableHealthAndVideoFilters - - val enableFavClusterTopKTweetsIntersection: Boolean = - Config.enableIntersectionWithFavBasedClusterTopKTweetsIndex - - // Min fav/interaction threshold - val minInteractionCount: Int = Config.minInteractionCount - val minFavCount: Int = Config.minFavCount - - // Tweet embeddings parameters - val tweetEmbeddingsLength: Int = Config.tweetEmbeddingsLength - val tweetEmbeddingsHalfLife: Int = Config.tweetEmbeddingsHalfLife - - // Clusters-to-tweet index parameters - val clusterTopKTweets: Int = Config.clusterTopKTweets - val maxTweetAgeHours: Int = Config.maxTweetAgeHours - val minEngagementPerCluster: Int = Config.minEngagementPerCluster - - override implicit def scroogeCoder[T <: ThriftStruct: Manifest]: Coder[T] = - ThriftStructLazyBinaryScroogeCoder.scroogeCoder - - override def configurePipeline(sc: ScioContext, opts: DateRangeOptions): Unit = { - // The time when the job is scheduled - val queryTimestamp = opts.interval.getEnd - - // Read consumer embeddings SQL - val consumerEmbeddingsSQL = getConsumerEmbeddingsSQLFunc(queryTimestamp, 21) - - // Generate SimClusters cluster-to-tweet index via BQ - val topKtweetsForClusterKey = - getTopKTweetsForClusterKeyBQ( - sc, - queryTimestamp, - maxTweetAgeHours, - consumerEmbeddingsSQL, - userTweetEngagementEventPairSqlPath, - userTweetEngagementEventPairTemplateVariable, - enableHealthAndVideoFilters, - enableFavClusterTopKTweetsIntersection, - minInteractionCount, - minFavCount, - tweetEmbeddingsLength, - tweetEmbeddingsHalfLife, - minEngagementPerCluster, - clusterTopKTweets - ) - - // Setup BQ writer - val ingestionTime = opts.getDate().value.getEnd.toDate - val bqFieldsTransform = RootTransform - .Builder() - .withPrependedFields("dateHour" -> TypedProjection.fromConstant(ingestionTime)) - val timePartitioning = new TimePartitioning() - .setType("HOUR").setField("dateHour").setExpirationMs(3.days.inMilliseconds) - val bqWriter = BigQueryIO - .write[ClusterIdToTopKTweetsWithScores] - .to(outputTable.toString) - .withExtendedErrorInfo() - .withTimePartitioning(timePartitioning) - .withLoadJobProjectId(projectId) - .withThriftSupport(bqFieldsTransform.build(), AvroConverter.Legacy) - .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED) - .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND) - - // Save SimClusters index to a BQ table - topKtweetsForClusterKey - .map { clusterIdToTopKTweets => - { - ClusterIdToTopKTweetsWithScores( - clusterId = clusterIdToTopKTweets.clusterId, - topKTweetsWithScores = clusterIdToTopKTweets.topKTweetsWithScores - ) - } - } - .saveAsCustomOutput(s"WriteToBQTable - ${outputTable}", bqWriter) - - // Save SimClusters index as a KeyValSnapshotDataset - topKtweetsForClusterKey - .map { clusterIdToTopKTweets => - KeyVal(clusterIdToTopKTweets.clusterId, clusterIdToTopKTweets.topKTweetsWithScores) - }.saveAsCustomOutput( - name = s"WriteClusterToKeyIndexToKeyValDataset at ${keyValDatasetOutputPath}", - DAL.writeVersionedKeyVal( - clusterToTweetIndexSnapshotDataset, - PathLayout.VersionedPath(prefix = - ((if (!isAdhoc) - Config.RootMHPath - else - Config.AdhocRootPath) - + keyValDatasetOutputPath)), - instant = Instant.ofEpochMilli(opts.interval.getEndMillis - 1L), - environmentOverride = environment, - ) - ) - } -} - -// This abstract class is used to define parameters specific to UUA events. -abstract class UUABasedClusterToTweetIndexGenerationJob - extends EngagementEventBasedClusterToTweetIndexGenerationJob { - // UUA Action types and column names - val contributingActionTypes: Seq[String] - val contributingActionReferenceTweetIdColumn: String = Config.actionTweetIdColumn - val undoActionTypes: Seq[String] - // Default undo tweet id is same as the actionTweetId (e.g. for favs these are the same tweet id) - val undoActionReferenceTweetIdColumn: String = Config.actionTweetIdColumn - - // Get the string that represents the list of undo event ids - lazy val undoActionTypesStr: String = { - // Populate the action type list with a placeholder action if its empty - val actionTypes = - if (undoActionTypes.nonEmpty) undoActionTypes - else Seq(Config.PlaceholderActionType) - convertActionTypesSeqToString(actionTypes) - } - - override lazy val userTweetEngagementEventPairTemplateVariable: Map[String, String] = { - Map( - "CONTRIBUTING_ACTION_TYPES_STR" -> convertActionTypesSeqToString(contributingActionTypes), - "CONTRIBUTING_ACTION_TWEET_ID_COLUMN" -> contributingActionReferenceTweetIdColumn, - "UNDO_ACTION_TYPES_STR" -> undoActionTypesStr, - "UNDO_ACTION_TWEET_ID_COLUMN" -> undoActionReferenceTweetIdColumn - ) - } - - /*** - * Convert a list of actions to a string that could be easily used in SQLs - * Example input: Seq("ServerTweetFav", "ClientTweetFav") - * output: "ServerTweetFav","ClientTweetFav" - * SQL use case: SELECT * FROM table WHERE actionType IN ("ServerTweetFav","ClientTweetFav") - */ - private def convertActionTypesSeqToString(actionTypes: Seq[String]): String = { - actionTypes.map(action => f"""\"${action}\"""").mkString(",") - } -} - -abstract class AdsClusterToTweetIndexGenerationJob - extends EngagementEventBasedClusterToTweetIndexGenerationJob { - // Ads contributing action types - fav, click, etc - val contributingActionTypes: Seq[Int] - - override lazy val userTweetEngagementEventPairTemplateVariable: Map[String, String] = { - Map( - "CONTRIBUTING_ACTION_TYPES_STR" -> convertActionTypesSeqToString(contributingActionTypes) - ) - } - private def convertActionTypesSeqToString(actionTypes: Seq[Int]): String = { - actionTypes.map(action => f"""${action}""").mkString(",") - } -} - -object FavBasedClusterToTweetIndexGenerationAdhocJob - extends UUABasedClusterToTweetIndexGenerationJob { - override val isAdhoc = true - override val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val userTweetEngagementEventPairSqlPath: String = - Config.unifiedUserTweetActionPairGenerationSQLPath - override val contributingActionTypes: Seq[String] = Seq(ActionType.ServerTweetFav.name) - override val undoActionTypes: Seq[String] = Seq(ActionType.ServerTweetUnfav.name) - override val minInteractionCount: Int = 8 - override val minFavCount: Int = 8 - override val outputTable = - BQTableDetails( - "twttr-recos-ml-prod", - "simclusters", - "simclusters_fav_based_cluster_to_tweet_index") - override val keyValDatasetOutputPath = Config.FavBasedClusterToTweetIndexOutputPath - override val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = - FavBasedSimclustersClusterToTweetIndexScalaDataset -} - -object FavBasedClusterToTweetIndexGenerationBatchJob - extends UUABasedClusterToTweetIndexGenerationJob { - override val isAdhoc = false - override val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val userTweetEngagementEventPairSqlPath: String = - Config.unifiedUserTweetActionPairGenerationSQLPath - override val contributingActionTypes: Seq[String] = Seq(ActionType.ServerTweetFav.name) - override val undoActionTypes: Seq[String] = Seq(ActionType.ServerTweetUnfav.name) - override val minInteractionCount: Int = 8 - override val minFavCount: Int = 8 - override val outputTable = - BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "simclusters_fav_based_cluster_to_tweet_index") - override val keyValDatasetOutputPath = Config.FavBasedClusterToTweetIndexOutputPath - override val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = - FavBasedSimclustersClusterToTweetIndexScalaDataset -} - -object VideoViewBasedClusterToTweetIndexGenerationAdhocJob - extends UUABasedClusterToTweetIndexGenerationJob { - override val isAdhoc = true - override val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val userTweetEngagementEventPairSqlPath: String = - Config.unifiedUserTweetActionPairGenerationSQLPath - override val contributingActionTypes: Seq[String] = Seq( - ActionType.ClientTweetVideoPlayback50.name) - override val undoActionTypes: Seq[String] = Seq.empty - override val enableHealthAndVideoFilters: Boolean = true - override val outputTable = - BQTableDetails( - "twttr-recos-ml-prod", - "simclusters", - "simclusters_video_view_based_cluster_to_tweet_index") - override val keyValDatasetOutputPath = Config.VideoViewBasedClusterToTweetIndexOutputPath - override val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = - VideoViewBasedSimclustersClusterToTweetIndexScalaDataset -} - -object VideoViewBasedClusterToTweetIndexGenerationBatchJob - extends UUABasedClusterToTweetIndexGenerationJob { - override val isAdhoc = false - override val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val userTweetEngagementEventPairSqlPath: String = - Config.unifiedUserTweetActionPairGenerationSQLPath - override val contributingActionTypes: Seq[String] = Seq( - ActionType.ClientTweetVideoPlayback50.name) - override val undoActionTypes: Seq[String] = Seq.empty - override val enableHealthAndVideoFilters: Boolean = true - override val outputTable = - BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "simclusters_video_view_based_cluster_to_tweet_index") - override val keyValDatasetOutputPath = Config.VideoViewBasedClusterToTweetIndexOutputPath - override val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = - VideoViewBasedSimclustersClusterToTweetIndexScalaDataset -} - -object RetweetBasedClusterToTweetIndexGenerationAdhocJob - extends UUABasedClusterToTweetIndexGenerationJob { - override val isAdhoc = true - override val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val userTweetEngagementEventPairSqlPath: String = - Config.unifiedUserTweetActionPairGenerationSQLPath - override val contributingActionTypes: Seq[String] = Seq(ActionType.ServerTweetRetweet.name) - override val undoActionTypes: Seq[String] = Seq(ActionType.ServerTweetUnretweet.name) - override val undoActionReferenceTweetIdColumn: String = Config.retweetTweetIdColumn - override val outputTable = - BQTableDetails( - "twttr-recos-ml-prod", - "simclusters", - "simclusters_retweet_based_cluster_to_tweet_index") - override val keyValDatasetOutputPath = Config.RetweetBasedClusterToTweetIndexOutputPath - override val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = - RetweetBasedSimclustersClusterToTweetIndexScalaDataset -} - -object RetweetBasedClusterToTweetIndexGenerationBatchJob - extends UUABasedClusterToTweetIndexGenerationJob { - override val isAdhoc = false - override val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val userTweetEngagementEventPairSqlPath: String = - Config.unifiedUserTweetActionPairGenerationSQLPath - override val contributingActionTypes: Seq[String] = Seq(ActionType.ServerTweetRetweet.name) - override val undoActionTypes: Seq[String] = Seq(ActionType.ServerTweetUnretweet.name) - override val undoActionReferenceTweetIdColumn: String = Config.retweetTweetIdColumn - override val outputTable = - BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "simclusters_retweet_based_cluster_to_tweet_index") - override val keyValDatasetOutputPath = Config.RetweetBasedClusterToTweetIndexOutputPath - override val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = - RetweetBasedSimclustersClusterToTweetIndexScalaDataset -} - -object ReplyBasedClusterToTweetIndexGenerationAdhocJob - extends UUABasedClusterToTweetIndexGenerationJob { - override val isAdhoc = true - override val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val userTweetEngagementEventPairSqlPath: String = - Config.combinedUserTweetActionPairGenerationSQLPath - override val contributingActionTypes: Seq[String] = Seq(ActionType.ServerTweetReply.name) - override val undoActionTypes: Seq[String] = Seq(ActionType.ServerTweetDelete.name) - override val undoActionReferenceTweetIdColumn: String = Config.replyTweetIdColumn - override val minInteractionCount: Int = 8 - override val minFavCount: Int = 8 - override val minEngagementPerCluster: Int = 3 - // Add supplemental positive signals to the user tweet engagement event template - // We bundle each reply signal with a positive signal (fav or retweet) - val supplementalPositiveSignals: Seq[String] = - Seq(ActionType.ServerTweetFav.name, ActionType.ServerTweetRetweet.name) - override lazy val userTweetEngagementEventPairTemplateVariable: Map[String, String] = { - Map( - "CONTRIBUTING_ACTION_TYPE_STR" -> contributingActionTypes.head, - "UNDO_ACTION_TYPES_STR" -> undoActionTypesStr, - "UNDO_ACTION_TWEET_ID_COLUMN" -> undoActionReferenceTweetIdColumn, - "SUPPLEMENTAL_ACTION_TYPES_ENGAGEMENT_STR" -> buildActionTypesEngagementIndicatorString( - supplementalPositiveSignals) - ) - } - override val outputTable = - BQTableDetails( - "twttr-recos-ml-prod", - "simclusters", - "simclusters_reply_based_cluster_to_tweet_index") - override val keyValDatasetOutputPath = Config.ReplyBasedClusterToTweetIndexOutputPath - override val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = - ReplyBasedSimclustersClusterToTweetIndexScalaDataset -} - -object ReplyBasedClusterToTweetIndexGenerationBatchJob - extends UUABasedClusterToTweetIndexGenerationJob { - override val isAdhoc = false - override val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val userTweetEngagementEventPairSqlPath: String = - Config.combinedUserTweetActionPairGenerationSQLPath - override val contributingActionTypes: Seq[String] = Seq(ActionType.ServerTweetReply.name) - override val undoActionTypes: Seq[String] = Seq(ActionType.ServerTweetDelete.name) - override val undoActionReferenceTweetIdColumn: String = Config.replyTweetIdColumn - override val minInteractionCount: Int = 8 - override val minFavCount: Int = 8 - override val minEngagementPerCluster: Int = 3 - // Add supplemental positive signals to the user tweet engagement event template - // We bundle each reply signal with a positive signal (fav or retweet) - val supplementalPositiveSignals: Seq[String] = - Seq(ActionType.ServerTweetFav.name, ActionType.ServerTweetRetweet.name) - override lazy val userTweetEngagementEventPairTemplateVariable: Map[String, String] = { - Map( - "CONTRIBUTING_ACTION_TYPE_STR" -> contributingActionTypes.head, - "UNDO_ACTION_TYPES_STR" -> undoActionTypesStr, - "UNDO_ACTION_TWEET_ID_COLUMN" -> undoActionReferenceTweetIdColumn, - "SUPPLEMENTAL_ACTION_TYPES_ENGAGEMENT_STR" -> buildActionTypesEngagementIndicatorString( - supplementalPositiveSignals) - ) - } - override val outputTable = - BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "simclusters_reply_based_cluster_to_tweet_index") - override val keyValDatasetOutputPath = Config.ReplyBasedClusterToTweetIndexOutputPath - override val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = - ReplyBasedSimclustersClusterToTweetIndexScalaDataset -} - -object PushOpenBasedClusterToTweetIndexGenerationAdhocJob - extends UUABasedClusterToTweetIndexGenerationJob { - override val isAdhoc = true - override val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val userTweetEngagementEventPairSqlPath: String = - Config.unifiedUserTweetActionPairGenerationSQLPath - override val contributingActionTypes: Seq[String] = Seq(ActionType.ClientNotificationOpen.name) - override val contributingActionReferenceTweetIdColumn: String = Config.pushTweetIdColumn - override val undoActionTypes: Seq[String] = Seq.empty - override val minInteractionCount = 1 - override val minFavCount = 0 - override val enableFavClusterTopKTweetsIntersection = true - override val outputTable = - BQTableDetails( - "twttr-recos-ml-prod", - "simclusters", - "simclusters_push_open_based_cluster_to_tweet_index") - override val keyValDatasetOutputPath = Config.PushOpenBasedClusterToTweetIndexOutputPath - override val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = - PushOpenBasedSimclustersClusterToTweetIndexScalaDataset -} - -object PushOpenBasedClusterToTweetIndexGenerationBatchJob - extends UUABasedClusterToTweetIndexGenerationJob { - override val isAdhoc = false - override val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val userTweetEngagementEventPairSqlPath: String = - Config.unifiedUserTweetActionPairGenerationSQLPath - override val contributingActionTypes: Seq[String] = Seq(ActionType.ClientNotificationOpen.name) - override val contributingActionReferenceTweetIdColumn: String = Config.pushTweetIdColumn - override val undoActionTypes: Seq[String] = Seq.empty - override val minInteractionCount = 1 - override val minFavCount = 0 - override val enableFavClusterTopKTweetsIntersection = true - override val outputTable = - BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "simclusters_push_open_based_cluster_to_tweet_index") - override val keyValDatasetOutputPath = Config.PushOpenBasedClusterToTweetIndexOutputPath - override val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = - PushOpenBasedSimclustersClusterToTweetIndexScalaDataset -} - -object AdsFavBasedClusterToTweetIndexGenerationAdhocJob - extends AdsClusterToTweetIndexGenerationJob { - val isAdhoc: Boolean = true - val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val contributingActionTypes: Seq[Int] = AdsFavEngagementTypeIds // fav - override val tweetEmbeddingsHalfLife: Int = 345600000 // 4 days - // The earliest user tweet engagement event we consider is 7 days ago - // The tweet could be older than 7 days - override val maxTweetAgeHours: Int = 168 // 7 days - override val minInteractionCount: Int = 3 - override val minFavCount: Int = 3 - override val minEngagementPerCluster: Int = 2 - override val outputTable = - BQTableDetails( - "twttr-recos-ml-prod", - "simclusters", - "simclusters_ads_fav_based_cluster_to_tweet_index") - val keyValDatasetOutputPath: String = Config.AdsFavBasedClusterToTweetIndexOutputPath - val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = AdsFavBasedSimclustersClusterToTweetIndexScalaDataset - val userTweetEngagementEventPairSqlPath: String = - Config.adsUserTweetActionPairGenerationSQLPath -} -object AdsFavBasedClusterToTweetIndexGenerationBatchJob - extends AdsClusterToTweetIndexGenerationJob { - val isAdhoc: Boolean = false - val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val contributingActionTypes: Seq[Int] = AdsFavEngagementTypeIds // fav - override val tweetEmbeddingsHalfLife: Int = 345600000 // 4 days - // The earliest user tweet engagement event we consider is 7 days ago - // The tweet could be older than 7 days - override val maxTweetAgeHours: Int = 168 // 7 days - override val minInteractionCount: Int = 3 - override val minFavCount: Int = 3 - override val minEngagementPerCluster: Int = 2 - override val outputTable = - BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "simclusters_ads_fav_based_cluster_to_tweet_index") - val keyValDatasetOutputPath: String = Config.AdsFavBasedClusterToTweetIndexOutputPath - val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = AdsFavBasedSimclustersClusterToTweetIndexScalaDataset - val userTweetEngagementEventPairSqlPath: String = - Config.adsUserTweetActionPairGenerationSQLPath -} - -object AdsFavClickBasedClusterToTweetIndexGenerationAdhocJob - extends AdsClusterToTweetIndexGenerationJob { - val isAdhoc: Boolean = true - val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val contributingActionTypes: Seq[Int] = - AdsFavEngagementTypeIds ++ AdsClickEngagementTypeIds // fav + click - override val tweetEmbeddingsHalfLife: Int = 604800000 // 7 days - // The earliest user tweet engagement event we consider is 21 days ago - // The tweet could be older than 21 days - override val maxTweetAgeHours: Int = 504 // 21 days - override val minInteractionCount: Int = 3 - override val minFavCount: Int = 3 - override val minEngagementPerCluster: Int = 2 - override val outputTable = - BQTableDetails( - "twttr-recos-ml-prod", - "simclusters", - "simclusters_ads_fav_click_ sbased_cluster_to_tweet_index") - val keyValDatasetOutputPath: String = Config.AdsFavClickBasedClusterToTweetIndexOutputPath - val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = AdsFavClickBasedSimclustersClusterToTweetIndexScalaDataset - val userTweetEngagementEventPairSqlPath: String = - Config.adsUserTweetActionPairGenerationSQLPath -} - -object AdsFavClickBasedClusterToTweetIndexGenerationBatchJob - extends AdsClusterToTweetIndexGenerationJob { - val isAdhoc: Boolean = false - val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val contributingActionTypes: Seq[Int] = - AdsFavEngagementTypeIds ++ AdsClickEngagementTypeIds // fav + click - override val tweetEmbeddingsHalfLife: Int = 604800000 // 7 days - // The earliest user tweet engagement event we consider is 21 days ago - // The tweet could be older than 21 days - override val maxTweetAgeHours: Int = 504 // 21 days - override val minInteractionCount: Int = 3 - override val minFavCount: Int = 3 - override val minEngagementPerCluster: Int = 2 - override val outputTable = - BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "simclusters_ads_fav_click_based_cluster_to_tweet_index") - val keyValDatasetOutputPath: String = Config.AdsFavClickBasedClusterToTweetIndexOutputPath - val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = AdsFavClickBasedSimclustersClusterToTweetIndexScalaDataset - val userTweetEngagementEventPairSqlPath: String = - Config.adsUserTweetActionPairGenerationSQLPath -} - -object FavBasedEvergreenContentClusterToTweetIndexGenerationAdhocJob - extends UUABasedClusterToTweetIndexGenerationJob { - override val isAdhoc = true - override val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val userTweetEngagementEventPairSqlPath: String = - Config.evergreenContentUserTweetActionPairGenerationSQLPath - override val contributingActionTypes: Seq[String] = Seq(ActionType.ServerTweetFav.name) - override val undoActionTypes: Seq[String] = Seq(ActionType.ServerTweetUnfav.name) - override val tweetEmbeddingsHalfLife: Int = 57600000 // 16 hours - override val maxTweetAgeHours: Int = 48 // 2 days - override val minInteractionCount: Int = 8 - override val minFavCount: Int = 0 - override val outputTable = - BQTableDetails( - "twttr-recos-ml-prod", - "simclusters", - "simclusters_fav_based_evergreen_content_cluster_to_tweet_index") - override val keyValDatasetOutputPath = - Config.FavBasedEvergreenContentClusterToTweetIndexOutputPath - override val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = - FavBasedEvergreenContentSimclustersClusterToTweetIndexScalaDataset -} - -object FavBasedEvergreenContentClusterToTweetIndexGenerationBatchJob - extends UUABasedClusterToTweetIndexGenerationJob { - override val isAdhoc = false - override val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val userTweetEngagementEventPairSqlPath: String = - Config.evergreenContentUserTweetActionPairGenerationSQLPath - override val contributingActionTypes: Seq[String] = Seq(ActionType.ServerTweetFav.name) - override val undoActionTypes: Seq[String] = Seq(ActionType.ServerTweetUnfav.name) - override val tweetEmbeddingsHalfLife: Int = 57600000 // 16 hours - override val maxTweetAgeHours: Int = 48 // 2 days - override val minInteractionCount: Int = 8 - override val minFavCount: Int = 0 - override val outputTable = - BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "simclusters_fav_based_evergreen_content_cluster_to_tweet_index") - override val keyValDatasetOutputPath = - Config.FavBasedEvergreenContentClusterToTweetIndexOutputPath - override val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = - FavBasedEvergreenContentSimclustersClusterToTweetIndexScalaDataset -} - -object FavBasedVideoClusterToTweetIndexGenerationAdhocJob - extends UUABasedClusterToTweetIndexGenerationJob { - override val isAdhoc = true - override val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val userTweetEngagementEventPairSqlPath: String = - Config.favBasedVideoTweetActionPairGenerationSQLPath - override val contributingActionTypes: Seq[String] = Seq(ActionType.ServerTweetFav.name) - override val undoActionTypes: Seq[String] = Seq(ActionType.ServerTweetUnfav.name) - override val minInteractionCount: Int = 8 - override val minFavCount: Int = 0 - override val outputTable = - BQTableDetails( - "twttr-recos-ml-prod", - "simclusters", - "simclusters_fav_based_video_cluster_to_tweet_index") - override val keyValDatasetOutputPath = - Config.FavBasedVideoClusterToTweetIndexOutputPath - override val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = - FavBasedVideoSimclustersClusterToTweetIndexScalaDataset -} - -object FavBasedVideoClusterToTweetIndexGenerationBatchJob - extends UUABasedClusterToTweetIndexGenerationJob { - override val isAdhoc = false - override val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val userTweetEngagementEventPairSqlPath: String = - Config.favBasedVideoTweetActionPairGenerationSQLPath - override val contributingActionTypes: Seq[String] = Seq(ActionType.ServerTweetFav.name) - override val undoActionTypes: Seq[String] = Seq(ActionType.ServerTweetUnfav.name) - override val minInteractionCount: Int = 8 - override val minFavCount: Int = 0 - override val outputTable = - BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "simclusters_fav_based_video_cluster_to_tweet_index") - override val keyValDatasetOutputPath = - Config.FavBasedVideoClusterToTweetIndexOutputPath - override val clusterToTweetIndexSnapshotDataset: KeyValDALDataset[ - KeyVal[FullClusterId, TopKTweetsWithScores] - ] = - FavBasedVideoSimclustersClusterToTweetIndexScalaDataset -} diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/README b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/README deleted file mode 100644 index 4b6a2dc16..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/README +++ /dev/null @@ -1,146 +0,0 @@ -# Adhoc SimClusters Cluster-to-tweet Index Generation Jobs -## Build and bundle the binaries - -``` - bazel bundle src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/... -``` - -## Run the adhoc jobs -### To run fav based cluster-to-tweet index generation job (adhoc): -bin/d6w create \ - twttr-recos-ml-prod/us-central1/fav-based-index-generation-adhoc-job \ - src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/engagement-event-based-simclusters-index-generation-job.d6w \ - --jar dist/fav-based-cluster-to-tweet-index-generation-adhoc-job-bundle/fav-based-cluster-to-tweet-index-generation-adhoc-job.jar \ - --bind=profile.user_name=your_ldap \ - --bind=profile.date="2022-07-15" \ - --bind=profile.frequency="PT1H" \ - --bind=profile.environment=dev \ - --bind=profile.job_name="fav-based-index-generation-adhoc-job" \ - --bind=profile.build_target="src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation:fav-based-cluster-to-tweet-index-generation-adhoc-job" - -### To run VideoView based cluster-to-tweet index generation job (adhoc): -bin/d6w create \ - twttr-recos-ml-prod/us-central1/video-view-based-index-generation-adhoc-job \ - src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/engagement-event-based-simclusters-index-generation-job.d6w \ - --jar dist/video-view-based-cluster-to-tweet-index-generation-adhoc-job-bundle/video-view-based-cluster-to-tweet-index-generation-adhoc-job.jar \ - --bind=profile.user_name=your_ldap \ - --bind=profile.date="2022-07-15" \ - --bind=profile.frequency="PT1H" \ - --bind=profile.environment=dev \ - --bind=profile.job_name="video-view-based-index-generation-adhoc-job" \ - --bind=profile.build_target="src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation:video-view-based-cluster-to-tweet-index-generation-adhoc-job" - -### To run retweet based cluster-to-tweet index generation job (adhoc): -bin/d6w create \ - twttr-recos-ml-prod/us-central1/retweet-based-index-generation-adhoc-job \ - src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/engagement-event-based-simclusters-index-generation-job.d6w \ - --jar dist/retweet-based-cluster-to-tweet-index-generation-adhoc-job-bundle/retweet-based-cluster-to-tweet-index-generation-adhoc-job.jar \ - --bind=profile.user_name=your_ldap \ - --bind=profile.date="2022-07-15" \ - --bind=profile.frequency="PT1H" \ - --bind=profile.environment=dev \ - --bind=profile.job_name="retweet-based-index-generation-adhoc-job" \ - --bind=profile.build_target="src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation:retweet-based-cluster-to-tweet-index-generation-adhoc-job" - -### To run reply based cluster-to-tweet index generation job (adhoc): -bin/d6w create \ - twttr-recos-ml-prod/us-central1/reply-based-index-generation-adhoc-job \ - src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/engagement-event-based-simclusters-index-generation-job.d6w \ - --jar dist/reply-based-cluster-to-tweet-index-generation-adhoc-job-bundle/reply-based-cluster-to-tweet-index-generation-adhoc-job.jar \ - --bind=profile.user_name=your_ldap \ - --bind=profile.date="2022-07-15" \ - --bind=profile.frequency="PT1H" \ - --bind=profile.environment=dev \ - --bind=profile.job_name="reply-based-index-generation-adhoc-job" \ - --bind=profile.build_target="src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation:reply-based-cluster-to-tweet-index-generation-adhoc-job" - -### To run push open based cluster-to-tweet index generation job (adhoc): -bin/d6w create \ - twttr-recos-ml-prod/us-central1/push-open-based-index-generation-adhoc-job \ - src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/engagement-event-based-simclusters-index-generation-job.d6w \ - --jar dist/push-open-based-cluster-to-tweet-index-generation-adhoc-job-bundle/push-open-based-cluster-to-tweet-index-generation-adhoc-job.jar \ - --bind=profile.user_name=your_ldap \ - --bind=profile.date="2022-10-06" \ - --bind=profile.frequency="PT1H" \ - --bind=profile.environment=dev \ - --bind=profile.job_name="push-open-based-index-generation-adhoc-job" \ - --bind=profile.build_target="src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation:push-open-based-cluster-to-tweet-index-generation-adhoc-job" - -# For prod scheduled Cluster-to-tweet Index Generation Jobs -### To run Fav based cluster-to-tweet index generation job (batch): - bin/d6w schedule \ - twttr-recos-ml-prod/us-central1/fav-based-index-generation-batch-job \ - src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/engagement-event-based-simclusters-index-generation-job.d6w \ - --bind=profile.user_name=cassowary \ - --bind=profile.environment=prod \ - --bind=profile.date="2022-07-19" \ - --bind=profile.frequency="PT1H" \ - --bind=profile.job_name="fav-based-index-generation-batch-job" \ - --bind=profile.build_target="src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation:fav-based-cluster-to-tweet-index-generation-batch-job" - -### To run VideoView based cluster-to-tweet index generation job (batch): - bin/d6w schedule \ - twttr-recos-ml-prod/us-central1/video-view-based-index-generation-batch-job \ - src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/engagement-event-based-simclusters-index-generation-job.d6w \ - --bind=profile.user_name=cassowary \ - --bind=profile.environment=prod \ - --bind=profile.date="2022-07-19" \ - --bind=profile.frequency="PT1H" \ - --bind=profile.job_name="video-view-based-index-generation-batch-job" \ - --bind=profile.build_target="src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation:video-view-based-cluster-to-tweet-index-generation-batch-job" - -### To run Retweet based cluster-to-tweet index generation job (batch): - bin/d6w schedule \ - twttr-recos-ml-prod/us-central1/retweet-based-index-generation-batch-job \ - src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/engagement-event-based-simclusters-index-generation-job.d6w \ - --bind=profile.user_name=cassowary \ - --bind=profile.environment=prod \ - --bind=profile.date="2022-07-19" \ - --bind=profile.frequency="PT1H" \ - --bind=profile.job_name="retweet-based-index-generation-batch-job" \ - --bind=profile.build_target="src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation:retweet-based-cluster-to-tweet-index-generation-batch-job" - -### To run Reply based cluster-to-tweet index generation job (batch): - bin/d6w schedule \ - twttr-recos-ml-prod/us-central1/reply-based-index-generation-batch-job \ - src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/engagement-event-based-simclusters-index-generation-job.d6w \ - --bind=profile.user_name=cassowary \ - --bind=profile.environment=prod \ - --bind=profile.date="2022-07-19" \ - --bind=profile.frequency="PT1H" \ - --bind=profile.job_name="reply-based-index-generation-batch-job" \ - --bind=profile.build_target="src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation:reply-based-cluster-to-tweet-index-generation-batch-job" - -### To run Push open based cluster-to-tweet index generation job (batch): - bin/d6w schedule \ - twttr-recos-ml-prod/us-central1/push-open-based-index-generation-batch-job \ - src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/engagement-event-based-simclusters-index-generation-job.d6w \ - --bind=profile.user_name=cassowary \ - --bind=profile.environment=prod \ - --bind=profile.date="2022-10-06" \ - --bind=profile.frequency="PT1H" \ - --bind=profile.job_name="push-open-based-index-generation-batch-job" \ - --bind=profile.build_target="src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation:push-open-based-cluster-to-tweet-index-generation-batch-job" - -### To run Ads Fav based cluster-to-tweet index generation job (batch): - bin/d6w schedule \ - twttr-recos-ml-prod/us-central1/ads-fav-based-index-generation-batch-job \ - src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/engagement-event-based-simclusters-index-generation-job.d6w \ - --bind=profile.user_name=cassowary \ - --bind=profile.environment=prod \ - --bind=profile.date="2022-10-06" \ - --bind=profile.frequency="PT3H" \ - --bind=profile.job_name="ads-fav-based-index-generation-batch-job" \ - --bind=profile.build_target="src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation:ads-fav-based-cluster-to-tweet-index-generation-batch-job" - -### To run Ads Fav Click based cluster-to-tweet index generation job (batch): - bin/d6w schedule \ - twttr-recos-ml-prod/us-central1/ads-fav-click-based-index-generation-batch-job \ - src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/engagement-event-based-simclusters-index-generation-job.d6w \ - --bind=profile.user_name=cassowary \ - --bind=profile.environment=prod \ - --bind=profile.date="2022-12-09" \ - --bind=profile.frequency="PT3H" \ - --bind=profile.job_name="ads-fav-click-based-index-generation-batch-job" \ - --bind=profile.build_target="src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation:ads-fav-click-based-cluster-to-tweet-index-generation-batch-job" - diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/engagement-event-based-simclusters-index-generation-job.d6w b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/engagement-event-based-simclusters-index-generation-job.d6w deleted file mode 100644 index 8da4ba43c..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/simclusters_index_generation/engagement-event-based-simclusters-index-generation-job.d6w +++ /dev/null @@ -1,44 +0,0 @@ -class Profile(Struct): - project = Default(String, 'twttr-recos-ml-prod') - date = Required(String) - build_target = Required(String) - job_name = Required(String) - environment = Default(String, 'dev') - machine = Default(String, 'n2-standard-2') - frequency = Default(String, 'PT1H') - -SimClustersIndexGenerationJob = Job( - name='{{profile.job_name}}', - project='{{profile.project}}', - staging_bucket='{{profile.project}}', - service_account='{{profile.user_name}}-shdw@twttr-dp-svc-accounts.iam.gserviceaccount.com', - region='us-central1', - worker_config=WorkerConfig( - num_workers=2, - worker_machine_type='{{profile.machine}}', - worker_disk_type=WorkerDiskType('HDD'), - ), - extra_args={ - "date": '{{profile.date}}' - }, - service_identifier='twtr:svc:{{profile.user_name}}:{{profile.job_name}}:{{profile.environment}}:{{profile.cluster}}', - deployment_config=BatchDeploymentConfig( - role='{{profile.user_name}}', - environment='{{profile.environment}}', - build_target='{{profile.build_target}}', - gcp_deployment_credentials='/var/lib/tss/keys/{{profile.user_name}}/cloud/gcp/dp/shadow.json', - statebird_config=StatebirdConfig( - batch_width='{{profile.frequency}}', - first_time='{{profile.date}}', - ), - workflow_config=WorkflowConfig( - play=True, - ), - timeout='PT4H' - ) -) - -jobs=[SimClustersIndexGenerationJob.bind(profile=Profile())] - - - diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/BUILD b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/BUILD deleted file mode 100644 index ba87e2b54..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/BUILD +++ /dev/null @@ -1,3 +0,0 @@ -resources( - sources = ["*"], -) diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/ads_user_tweet_action_pair_generation.sql b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/ads_user_tweet_action_pair_generation.sql deleted file mode 100644 index c5f1e702a..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/ads_user_tweet_action_pair_generation.sql +++ /dev/null @@ -1,38 +0,0 @@ -WITH - vars AS ( - SELECT - TIMESTAMP("{START_TIME}") AS start_date, - TIMESTAMP("{END_TIME}") AS end_date - ), - - ads_engagement AS ( - SELECT - userId64 as userId, - promotedTweetId as tweetId, - UNIX_MILLIS(timestamp) AS tsMillis, - lineItemId - FROM `twttr-rev-core-data-prod.core_served_impressions.spend`, vars - WHERE TIMESTAMP(_batchEnd) >= vars.start_date AND TIMESTAMP(_batchEnd) <= vars.end_date - AND - engagementType IN ({CONTRIBUTING_ACTION_TYPES_STR}) - AND lineItemObjective != 9 -- not pre-roll ads - ), - - line_items AS ( - SELECT - id AS lineItemId, - end_time.posixTime AS endTime - FROM - `twttr-rev-core-data-prod.rev_ads_production.line_items` - ) - - -SELECT - userId, - tweetId, - tsMillis -FROM ads_engagement JOIN line_items USING(lineItemId), vars -WHERE - line_items.endTime IS NULL - OR TIMESTAMP_MILLIS(line_items.endTime) >= vars.end_date - diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/cluster_top_tweets.sql b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/cluster_top_tweets.sql deleted file mode 100644 index 9150e7c92..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/cluster_top_tweets.sql +++ /dev/null @@ -1,15 +0,0 @@ -WITH tweet_embedding AS ( --- Expected columns: --- tweetId, clusterId, tweetScore - {TWEET_EMBEDDING_SQL} -), -clusters_top_k_tweets AS ( - SELECT clusterId, ARRAY_AGG(STRUCT(tweetId, tweetScore) ORDER BY tweetScore DESC LIMIT {CLUSTER_TOP_K_TWEETS}) AS topKTweetsForClusterKey - FROM tweet_embedding - GROUP BY clusterId -) -SELECT - clusterId, - topKTweetsForClusterKey -FROM clusters_top_k_tweets - diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/cluster_top_tweets_intersection_with_fav_based_index.sql b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/cluster_top_tweets_intersection_with_fav_based_index.sql deleted file mode 100644 index 52d13c154..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/cluster_top_tweets_intersection_with_fav_based_index.sql +++ /dev/null @@ -1,59 +0,0 @@ -WITH - cluster_top_tweets AS ( - {CLUSTER_TOP_TWEETS_SQL} - ), - - flatten_cluster_top_tweets AS ( - SELECT - clusterId, - tweet.tweetId, - tweet.tweetScore, - FROM cluster_top_tweets, UNNEST(topKTweetsForClusterKey) AS tweet - ), - ---- There might be delay or skip for the fav-based dataset. ---- This query retrieved the dateHour of the latest partition available. - latest_fav_cluster_to_tweet AS ( - SELECT - MAX(dateHour) AS latestTimestamp - FROM - `twttr-bq-cassowary-prod.user.simclusters_fav_based_cluster_to_tweet_index` - WHERE - TIMESTAMP(dateHour) >= TIMESTAMP("{START_TIME}") - AND TIMESTAMP(dateHour) <= TIMESTAMP("{END_TIME}") - ), - - flatten_fav_cluster_top_tweets AS ( - SELECT - clusterId.clusterId AS clusterId, - tweet.key AS tweetId - FROM - `twttr-bq-cassowary-prod.user.simclusters_fav_based_cluster_to_tweet_index`, - UNNEST(topKTweetsWithScores.topTweetsByFavClusterNormalizedScore) AS tweet, - latest_fav_cluster_to_tweet - WHERE - dateHour=latest_fav_cluster_to_tweet.latestTimestamp - ), - - flatten_cluster_top_tweets_intersection AS ( - SELECT - clusterId, - flatten_cluster_top_tweets.tweetId, - flatten_cluster_top_tweets.tweetScore - FROM - flatten_cluster_top_tweets - INNER JOIN - flatten_fav_cluster_top_tweets - USING(clusterId, tweetId) - ), - - processed_cluster_top_tweets AS ( - SELECT - clusterId, - ARRAY_AGG(STRUCT(tweetId, tweetScore) ORDER BY tweetScore LIMIT {CLUSTER_TOP_K_TWEETS}) AS topKTweetsForClusterKey - FROM flatten_cluster_top_tweets_intersection - GROUP BY clusterId - ) - - SELECT * - FROM processed_cluster_top_tweets diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/combined_user_tweet_action_pair_generation.sql b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/combined_user_tweet_action_pair_generation.sql deleted file mode 100644 index ed8880c11..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/combined_user_tweet_action_pair_generation.sql +++ /dev/null @@ -1,68 +0,0 @@ -WITH - vars AS ( - SELECT - TIMESTAMP("{START_TIME}") AS start_date, - TIMESTAMP("{END_TIME}") AS end_date, - TIMESTAMP("{NO_OLDER_TWEETS_THAN_DATE}") AS no_older_tweets_than_date - ), - - -- Get raw user-tweet interaction events from UUA - actions_unioned AS ( - SELECT - userIdentifier.userId AS userId, - item.tweetInfo.actionTweetId AS tweetId, - eventMetadata.sourceTimestampMs AS tsMillis, - CASE - WHEN actionType = "ServerTweetFav" THEN 1 - WHEN actionType = "ServerTweetUnfav" THEN -1 - END AS favAction, - CASE - WHEN actionType = "ServerTweetReply" THEN 1 - WHEN actionType = "ServerTweetDelete" THEN -1 - END AS replyAction, - CASE - WHEN actionType = "ServerTweetRetweet" THEN 1 - WHEN actionType = "ServerTweetUnretweet" THEN -1 - END AS retweetAction, - IF(actionType = "ClientTweetVideoPlayback50", 1, NULL) AS videoPlayback50Action - FROM `twttr-bql-unified-prod.unified_user_actions_engagements.streaming_unified_user_actions_engagements`, vars - WHERE (DATE(dateHour) >= DATE(vars.start_date) AND DATE(dateHour) <= DATE(vars.end_date)) - AND eventMetadata.sourceTimestampMs >= UNIX_MILLIS(vars.start_date) - AND eventMetadata.sourceTimestampMs <= UNIX_MILLIS(vars.end_date) - AND (actionType = "ServerTweetReply" - OR actionType = "ServerTweetRetweet" - OR actionType = "ServerTweetFav" - OR actionType = "ServerTweetUnfav" - OR actionType = "ClientTweetVideoPlayback50" - ) - ), - - user_tweet_action_pairs AS ( - SELECT - userId, - tweetId, - -- Get the most recent fav event - ARRAY_AGG(IF(favAction IS NOT NULL, STRUCT(favAction AS engaged, tsMillis), NULL) IGNORE NULLS ORDER BY tsMillis DESC LIMIT 1)[OFFSET(0)] as ServerTweetFav, - -- Get the most recent reply / unreply event - ARRAY_AGG(IF(replyAction IS NOT NULL,STRUCT(replyAction AS engaged, tsMillis), NULL) IGNORE NULLS ORDER BY tsMillis DESC LIMIT 1)[OFFSET(0)] as ServerTweetReply, - -- Get the most recent retweet / unretweet event - ARRAY_AGG(IF(retweetAction IS NOT NULL, STRUCT(retweetAction AS engaged, tsMillis), NULL) IGNORE NULLS ORDER BY tsMillis DESC LIMIT 1)[OFFSET(0)] as ServerTweetRetweet, - -- Get the most recent video view event - ARRAY_AGG(IF(videoPlayback50Action IS NOT NULL, STRUCT(videoPlayback50Action AS engaged, tsMillis), NULL) IGNORE NULLS ORDER BY tsMillis DESC LIMIT 1)[OFFSET(0)] as ClientTweetVideoPlayback50 - FROM actions_unioned - GROUP BY userId, tweetId - ) - --- Combine signals --- Apply age filter in this step -SELECT - userId, - tweetId, - CAST({CONTRIBUTING_ACTION_TYPE_STR}.tsMillis AS FLOAT64) AS tsMillis -FROM user_tweet_action_pairs, vars -WHERE - {CONTRIBUTING_ACTION_TYPE_STR}.engaged = 1 - AND - ({SUPPLEMENTAL_ACTION_TYPES_ENGAGEMENT_STR}) - AND timestamp_millis((1288834974657 + - ((tweetId & 9223372036850581504) >> 22))) >= vars.no_older_tweets_than_date diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/engagement_based_index_generation.sql b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/engagement_based_index_generation.sql deleted file mode 100644 index 5adce6f4b..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/engagement_based_index_generation.sql +++ /dev/null @@ -1,85 +0,0 @@ --- This SQL query generate the cluster to top k tweets index based on tweet engagements. --- The engagement type is decided by USER_TWEET_ENGAGEMENT_TABLE_SQL. - -with vars as ( - SELECT {HALF_LIFE} AS halfLife, -- Default: 8 hour halfLife in millis - UNIX_MILLIS("{CURRENT_TS}") AS currentTs, - ), - - user_tweet_engagement_pairs AS ( - {USER_TWEET_ENGAGEMENT_TABLE_SQL} - ), - - -- A sequence of filters to get eligible tweetIds for tweet embedding generation - -- Apply min interaction count filter - user_tweet_interaction_with_min_interaction_count_filter AS ( - SELECT userId, user_tweet_engagement_pairs.tweetId, tsMillis - FROM user_tweet_engagement_pairs, vars - JOIN ( - SELECT tweetId, COUNT(DISTINCT(userId)) AS interactionCount - FROM user_tweet_engagement_pairs - GROUP BY tweetId - HAVING interactionCount >= {MIN_INTERACTION_COUNT} -- Only generate tweet embeddings for tweets with >= {MIN_INTERACTION_COUNT} interactions - ) eligible_tweets USING(tweetId) - ), - - -- Apply min fav count filter - user_tweet_interaction_with_fav_count_filter AS ( - {TWEET_INTERACTION_WITH_FAV_COUNT_FILTER_SQL} - ), - - -- Apply health and video filter - user_tweet_interaction_with_health_filter AS ( - {TWEET_INTERACTION_WITH_HEALTH_FILTER_SQL} - ), - - -- Final filtered user tweet interaction table - -- Read the result from the last filter - user_tweet_interaction_processed_table AS ( - SELECT * - FROM user_tweet_interaction_with_health_filter - ), - - -- Read consumer embeddings - consumer_embeddings AS ( - {CONSUMER_EMBEDDINGS_SQL} - ), - - -- Update tweet cluster scores based on interaction events - tweet_cluster_scores AS ( - SELECT tweetId, - STRUCT( - clusterId, - CASE vars.halfLife - -- halfLife = -1 means there is no half life decay and we directly take the sum as the score - WHEN -1 THEN SUM(clusterNormalizedLogFavScore) - ELSE SUM(clusterNormalizedLogFavScore * POW(0.5, (currentTs - tsMillis) / vars.halfLife)) - END AS normalizedScore, - COUNT(*) AS engagementCount) - AS clusterIdToScores - FROM user_tweet_interaction_processed_table, vars - JOIN consumer_embeddings USING(userId) - GROUP BY tweetId, clusterId, vars.halfLife - ), - - -- Generate tweet embeddings - tweet_embeddings_with_top_clusters AS ( - SELECT tweetId, ARRAY_AGG( - clusterIdToScores - ORDER BY clusterIdToScores.normalizedScore DESC - LIMIT {TWEET_EMBEDDING_LENGTH} - ) AS clusterIdToScores - FROM tweet_cluster_scores - GROUP BY tweetId - ), - - clusters_top_k_tweets AS ( - SELECT clusterId, ARRAY_AGG(STRUCT(tweetId, normalizedScore AS tweetScore) ORDER BY normalizedScore DESC LIMIT {CLUSTER_TOP_K_TWEETS}) AS topKTweetsForClusterKey - FROM tweet_embeddings_with_top_clusters, UNNEST(clusterIdToScores) AS clusterIdToScores - WHERE engagementCount >= {MIN_ENGAGEMENT_PER_CLUSTER} - GROUP BY clusterId - ) - -SELECT * -FROM clusters_top_k_tweets - diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/evergreen_content_user_tweet_action_pair_generation.sql b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/evergreen_content_user_tweet_action_pair_generation.sql deleted file mode 100644 index a23763a06..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/evergreen_content_user_tweet_action_pair_generation.sql +++ /dev/null @@ -1,62 +0,0 @@ -WITH - vars AS ( - SELECT - TIMESTAMP("{START_TIME}") AS start_date, - TIMESTAMP("{END_TIME}") AS end_date, - ), - - -- Get raw user-tweet interaction events from UUA - raw_engagements AS ( - SELECT - userIdentifier.userId AS userId, - eventMetadata.sourceTimestampMs AS tsMillis, - CASE - WHEN actionType IN ({CONTRIBUTING_ACTION_TYPES_STR}) THEN {CONTRIBUTING_ACTION_TWEET_ID_COLUMN} - WHEN actionType IN ({UNDO_ACTION_TYPES_STR}) THEN {UNDO_ACTION_TWEET_ID_COLUMN} - END AS tweetId, - CASE - WHEN actionType IN ({CONTRIBUTING_ACTION_TYPES_STR}) THEN 1 - WHEN actionType IN ({UNDO_ACTION_TYPES_STR}) THEN -1 - END AS doOrUndo - FROM `twttr-bql-unified-prod.unified_user_actions_engagements.streaming_unified_user_actions_engagements`, vars - WHERE (DATE(dateHour) >= DATE(vars.start_date) AND DATE(dateHour) <= DATE(vars.end_date)) - AND eventMetadata.sourceTimestampMs >= UNIX_MILLIS(vars.start_date) - AND eventMetadata.sourceTimestampMs <= UNIX_MILLIS(vars.end_date) - AND (actionType IN ({CONTRIBUTING_ACTION_TYPES_STR}) - OR actionType IN ({UNDO_ACTION_TYPES_STR})) - ), - - -- Get evergreen tweet ids - evergreen_tweet_ids AS ( - SELECT - tweetId - FROM `twttr-recos-ml-prod.simclusters.evergreen_content_data` - WHERE TIMESTAMP(ts) = - ( -- Get latest partition time - SELECT MAX(TIMESTAMP(ts)) latest_partition - FROM `twttr-recos-ml-prod.simclusters.evergreen_content_data` - WHERE DATE(ts) BETWEEN - DATE_SUB(DATE("{END_TIME}"), - INTERVAL 14 DAY) AND DATE("{END_TIME}") - ) - ), - - -- Join evergreen content table - evergreen_tweets_engagements AS ( - SELECT raw_engagements.* - FROM raw_engagements JOIN evergreen_tweet_ids USING(tweetId) - ), - - -- Group by userId and tweetId - user_tweet_engagement_pairs AS ( - SELECT userId, tweetId, ARRAY_AGG(STRUCT(doOrUndo, tsMillis) ORDER BY tsMillis DESC LIMIT 1) AS details, COUNT(*) AS cnt - FROM evergreen_tweets_engagements - GROUP BY userId, tweetId - ) - --- Remove undo events -SELECT userId, tweetId, CAST(dt.tsMillis AS FLOAT64) AS tsMillis -FROM user_tweet_engagement_pairs, vars -CROSS JOIN UNNEST(details) AS dt -WHERE cnt <= 10 - AND dt.doOrUndo = 1 diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/nsfw_tweet_denylist.sql b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/nsfw_tweet_denylist.sql deleted file mode 100644 index 472218075..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/nsfw_tweet_denylist.sql +++ /dev/null @@ -1,43 +0,0 @@ - SELECT DISTINCT tweetId - FROM `twttr-bq-tweetsource-prod.user.unhydrated_flat`, UNNEST(entity_annotations) AS ea - WHERE - (DATE(_PARTITIONTIME) >= DATE("{START_TIME}") AND DATE(_PARTITIONTIME) <= DATE("{END_TIME}")) AND - timestamp_millis((1288834974657 + - ((tweetId & 9223372036850581504) >> 22))) >= TIMESTAMP("{START_TIME}") - AND timestamp_millis((1288834974657 + - ((tweetId & 9223372036850581504) >> 22))) <= TIMESTAMP("{END_TIME}") - AND ( - ea.entityId IN ( - 883054128338878464, - 1453131634669019141, - 1470464132432347136, - 1167512219786997760, - 1151588902739644416, - 1151920148661489664, - 1155582950991228928, - 738501328687628288, - 1047106191829028865 - ) - OR ( - ea.groupId IN (34, 35) # Cortex media understanding - AND ea.entityId IN ( - 1072916828484038657, - 1133752108212035585, - 1072916828488327170 - ) - ) - OR ( - ea.groupId IN (14) # Agatha Tweet Health Annotations - AND ea.entityId IN ( - 1242898721278324736, - 1230229436697473026, - 1230229470050603008 - ) - ) - OR ( - ea.groupId IN (10) - AND ea.entityId IN ( - 953701302608961536 - ) - ) - ) diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/tweet_embeddings_generation.sql b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/tweet_embeddings_generation.sql deleted file mode 100644 index ffd14729c..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/tweet_embeddings_generation.sql +++ /dev/null @@ -1,104 +0,0 @@ -with vars as ( - SELECT - UNIX_MILLIS("{QUERY_DATE}") AS currentTs, - TIMESTAMP("{START_TIME}") AS startTime, - TIMESTAMP("{END_TIME}") AS endTime, - {MIN_SCORE_THRESHOLD} AS tweetEmbeddingsMinClusterScore, - {HALF_LIFE} AS halfLife, - TIMESTAMP("{NO_OLDER_TWEETS_THAN_DATE}") AS noOlderTweetsThanDate -), - --- Get raw fav events -raw_favs AS ( - SELECT event.favorite.user_id AS userId, event.favorite.tweet_id AS tweetId, event.favorite.event_time_ms AS tsMillis, 1 AS favOrUnfav - FROM `twttr-bql-timeline-prod.timeline_service_favorites.timeline_service_favorites`, vars - WHERE (DATE(_PARTITIONTIME) = DATE(vars.startTime) OR DATE(_PARTITIONTIME) = DATE(vars.endTime)) AND - TIMESTAMP_MILLIS(event.favorite.event_time_ms) >= vars.startTime - AND TIMESTAMP_MILLIS(event.favorite.event_time_ms) <= vars.endTime - AND event.favorite IS NOT NULL -), - --- Get raw unfav events -raw_unfavs AS ( - SELECT event.unfavorite.user_id AS userId, event.unfavorite.tweet_id AS tweetId, event.unfavorite.event_time_ms AS tsMillis, -1 AS favOrUnfav - FROM `twttr-bql-timeline-prod.timeline_service_favorites.timeline_service_favorites`, vars - WHERE (DATE(_PARTITIONTIME) = DATE(vars.startTime) OR DATE(_PARTITIONTIME) = DATE(vars.endTime)) AND - TIMESTAMP_MILLIS(event.favorite.event_time_ms) >= vars.startTime - AND TIMESTAMP_MILLIS(event.favorite.event_time_ms) <= vars.endTime - AND event.unfavorite IS NOT NULL -), - --- Union fav and unfav events -favs_unioned AS ( - SELECT * FROM raw_favs - UNION ALL - SELECT * FROM raw_unfavs -), - --- Group by user and tweetId -user_tweet_fav_pairs AS ( - SELECT userId, tweetId, ARRAY_AGG(STRUCT(favOrUnfav, tsMillis) ORDER BY tsMillis DESC LIMIT 1) as details, count(*) as cnt - FROM favs_unioned - GROUP BY userId, tweetId -), - --- Remove unfav events -tweet_raw_favs_table AS ( - SELECT userId, tweetId, CAST(dt.tsMillis AS FLOAT64) AS tsMillis - FROM user_tweet_fav_pairs CROSS JOIN UNNEST(details) as dt - WHERE cnt < 3 AND dt.favOrUnfav = 1 -- cnt < 3 to remove crazy fav/unfav users -), - --- Get tweetIds that are eligible for tweet embeddings -tweet_favs_table AS ( - SELECT userId, tweet_raw_favs_table.tweetId, tsMillis - FROM tweet_raw_favs_table, vars - JOIN ( - SELECT tweetId, COUNT(DISTINCT(userId)) AS favCount - FROM tweet_raw_favs_table - GROUP BY tweetId - HAVING favCount >= 8 --we only generate tweet embeddings for tweets with >= 8 favs - ) eligible_tweets USING(tweetId) - -- Apply tweet age filter here - WHERE timestamp_millis((1288834974657 + ((tweet_raw_favs_table.tweetId & 9223372036850581504) >> 22))) >= vars.noOlderTweetsThanDate -), - --- Read consumer embeddings -consumer_embeddings AS ( - {CONSUMER_EMBEDDINGS_SQL} -), - --- Update tweet cluster scores based on fav events -tweet_cluster_scores AS ( - SELECT tweetId, - STRUCT( - clusterId, - CASE vars.halfLife - -- halfLife = -1 means there is no half life/decay and we directly take the sum as the score - WHEN -1 THEN SUM(clusterNormalizedLogFavScore) - ELSE SUM(clusterNormalizedLogFavScore * POW(0.5, (currentTs - tsMillis) / vars.halfLife)) - END AS clusterNormalizedLogFavScore, - COUNT(*) AS favCount) - AS clusterIdToScores - FROM tweet_favs_table, vars - JOIN consumer_embeddings USING(userId) - GROUP BY tweetId, clusterId, vars.halfLife -), - --- Generate tweet embeddings -tweet_embeddings_with_top_clusters AS ( - SELECT tweetId, ARRAY_AGG( - clusterIdToScores - ORDER BY clusterIdToScores.clusterNormalizedLogFavScore DESC - LIMIT {TWEET_EMBEDDING_LENGTH} - ) AS clusterIdToScores - FROM tweet_cluster_scores - GROUP BY tweetId -) - --- Return (tweetId, clusterId, tweetScore) pairs where tweetScore > tweetEmbeddingsMinClusterScore -SELECT tweetId, - clusterId, - clusterNormalizedLogFavScore AS tweetScore, clusterIdToScores -FROM tweet_embeddings_with_top_clusters, UNNEST(clusterIdToScores) AS clusterIdToScores, vars -WHERE clusterIdToScores.clusterNormalizedLogFavScore > vars.tweetEmbeddingsMinClusterScore diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/tweet_fav_count.sql b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/tweet_fav_count.sql deleted file mode 100644 index 63b085937..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/tweet_fav_count.sql +++ /dev/null @@ -1,38 +0,0 @@ --- Calculate the fav counts for tweets within a given timeframe -with vars as ( - SELECT TIMESTAMP("{START_TIME}") AS start_date, - TIMESTAMP("{END_TIME}") AS end_date -), - -favs_unioned AS ( - SELECT - userIdentifier.userId AS userId, - item.tweetInfo.actionTweetId AS tweetId, - eventMetadata.sourceTimestampMs AS tsMillis, - CASE - WHEN actionType = "ServerTweetFav" THEN 1 - WHEN actionType = "ServerTweetUnfav" THEN -1 - END AS favOrUnfav - FROM `twttr-bql-unified-prod.unified_user_actions_engagements.streaming_unified_user_actions_engagements`, vars - WHERE (DATE(dateHour) >= DATE(vars.start_date) AND DATE(dateHour) <= DATE(vars.end_date)) - AND eventMetadata.sourceTimestampMs >= UNIX_MILLIS(vars.start_date) - AND eventMetadata.sourceTimestampMs <= UNIX_MILLIS(vars.end_date) - AND userIdentifier.userId IS NOT NULL - AND (actionType = "ServerTweetFav" OR actionType = "ServerTweetUnfav") -), - -user_tweet_fav_pairs AS ( - SELECT userId, tweetId, ARRAY_AGG(STRUCT(favOrUnfav, tsMillis) ORDER BY tsMillis DESC LIMIT 1) as details, count(*) as cnt - FROM favs_unioned - GROUP BY userId, tweetId -), - -tweet_raw_favs_table AS ( - SELECT userId, tweetId, CAST(dt.tsMillis AS FLOAT64) AS tsMillis - FROM user_tweet_fav_pairs CROSS JOIN UNNEST(details) as dt - WHERE cnt < 3 AND dt.favOrUnfav = 1 -) - -SELECT tweetId, COUNT(DISTINCT(userId)) AS favCount -FROM tweet_raw_favs_table -GROUP BY tweetId diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/tweets_ann.sql b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/tweets_ann.sql deleted file mode 100644 index f9eb10d2b..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/tweets_ann.sql +++ /dev/null @@ -1,64 +0,0 @@ --- (step 1) Read consumer embeddings -WITH consumer_embeddings AS ( - {CONSUMER_EMBEDDINGS_SQL} -), --- (step 1) Read tweet embeddings -tweet_embeddings AS ( - {TWEET_EMBEDDINGS_SQL} -), --- (step 1) Compute tweet embeddings norms (we will use this to compute cosine sims later) -tweet_embeddings_norm AS ( - SELECT tweetId, SUM(tweetScore * tweetScore) AS norm - FROM tweet_embeddings - GROUP BY tweetId - HAVING norm > 0.0 -), --- (step 2) Get top N clusters for each consumer embedding. N = 25 in prod -consumer_embeddings_top_n_clusters AS ( - SELECT userId, ARRAY_AGG(STRUCT(clusterId, userScore) ORDER BY userScore DESC LIMIT {TOP_N_CLUSTER_PER_SOURCE_EMBEDDING}) AS topClustersWithScores - FROM consumer_embeddings - GROUP BY userId -), --- (step 2) Get top M tweets for each cluster id. M = 100 in prod -clusters_top_m_tweets AS ( - SELECT clusterId, ARRAY_AGG(STRUCT(tweetId, tweetScore) ORDER BY tweetScore DESC LIMIT {TOP_M_TWEETS_PER_CLUSTER}) AS tweets - FROM tweet_embeddings - GROUP BY clusterId -), --- (step 3) Join the results, get top M * N tweets for each user -user_top_mn_tweets AS ( - SELECT userId, consumer_embedding_cluster_score_pairs.userScore AS userScore, clusters_top_m_tweets.clusterId AS clusterId, clusters_top_m_tweets.tweets AS tweets - FROM ( - SELECT userId, clusterId, userScore - FROM consumer_embeddings_top_n_clusters, UNNEST(topClustersWithScores) - ) AS consumer_embedding_cluster_score_pairs - JOIN clusters_top_m_tweets ON consumer_embedding_cluster_score_pairs.clusterId = clusters_top_m_tweets.clusterId -), --- (step 4) Compute the dot product between each user and tweet embedding pair -user_tweet_embedding_dot_product AS ( - SELECT userId, - tweetId, - SUM(userScore * tweetScore) AS dotProductScore - FROM user_top_mn_tweets, UNNEST(tweets) AS tweets - GROUP BY userId, tweetId -), --- (step 5) Compute similarity scores: dot product, cosine sim, log-cosine sim -user_tweet_embedding_similarity_scores AS ( - SELECT userId, - user_tweet_embedding_dot_product.tweetId AS tweetId, - dotProductScore, - SAFE_DIVIDE(dotProductScore, SQRT(tweet_embeddings_norm.norm)) AS cosineSimilarityScore, - SAFE_DIVIDE(dotProductScore, LN(1+tweet_embeddings_norm.norm)) AS logCosineSimilarityScore, - FROM user_tweet_embedding_dot_product - JOIN tweet_embeddings_norm ON user_tweet_embedding_dot_product.tweetId = tweet_embeddings_norm.tweetId -), --- (step 6) Get final top K tweets per user. K = 150 in prod -results AS ( - SELECT userId, ARRAY_AGG(STRUCT(tweetId, dotProductScore, cosineSimilarityScore, logCosineSimilarityScore) - ORDER BY logCosineSimilarityScore DESC LIMIT {TOP_K_TWEETS_PER_USER_REQUEST}) AS tweets - FROM user_tweet_embedding_similarity_scores - GROUP BY userId -) - -SELECT * -FROM results diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/unified_user_tweet_action_pair_generation.sql b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/unified_user_tweet_action_pair_generation.sql deleted file mode 100644 index ad2e1d7bd..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/unified_user_tweet_action_pair_generation.sql +++ /dev/null @@ -1,45 +0,0 @@ -WITH - vars AS ( - SELECT - TIMESTAMP("{START_TIME}") AS start_date, - TIMESTAMP("{END_TIME}") AS end_date, - TIMESTAMP("{NO_OLDER_TWEETS_THAN_DATE}") AS no_older_tweets_than_date - ), - - -- Get raw user-tweet interaction events from UUA - interactions_unioned AS ( - SELECT - userIdentifier.userId AS userId, - eventMetadata.sourceTimestampMs AS tsMillis, - CASE - WHEN actionType IN ({CONTRIBUTING_ACTION_TYPES_STR}) THEN {CONTRIBUTING_ACTION_TWEET_ID_COLUMN} - WHEN actionType IN ({UNDO_ACTION_TYPES_STR}) THEN {UNDO_ACTION_TWEET_ID_COLUMN} - END AS tweetId, - CASE - WHEN actionType IN ({CONTRIBUTING_ACTION_TYPES_STR}) THEN 1 - WHEN actionType IN ({UNDO_ACTION_TYPES_STR}) THEN -1 - END AS doOrUndo - FROM `twttr-bql-unified-prod.unified_user_actions_engagements.streaming_unified_user_actions_engagements`, vars - WHERE (DATE(dateHour) >= DATE(vars.start_date) AND DATE(dateHour) <= DATE(vars.end_date)) - AND eventMetadata.sourceTimestampMs >= UNIX_MILLIS(vars.start_date) - AND eventMetadata.sourceTimestampMs <= UNIX_MILLIS(vars.end_date) - AND (actionType IN ({CONTRIBUTING_ACTION_TYPES_STR}) - OR actionType IN ({UNDO_ACTION_TYPES_STR})) - ), - - -- Group by userId and tweetId - user_tweet_interaction_pairs AS ( - SELECT userId, tweetId, ARRAY_AGG(STRUCT(doOrUndo, tsMillis) ORDER BY tsMillis DESC LIMIT 1) AS details, COUNT(*) AS cnt - FROM interactions_unioned - GROUP BY userId, tweetId - ) - --- Remove undo events --- Apply age filter in this step -SELECT userId, tweetId, CAST(dt.tsMillis AS FLOAT64) AS tsMillis -FROM user_tweet_interaction_pairs, vars -CROSS JOIN UNNEST(details) AS dt -WHERE cnt < 3 - AND dt.doOrUndo = 1 - AND timestamp_millis((1288834974657 + - ((tweetId & 9223372036850581504) >> 22))) >= vars.no_older_tweets_than_date diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/user_video_tweet_fav_engagement_generation.sql b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/user_video_tweet_fav_engagement_generation.sql deleted file mode 100644 index 56b0f73a8..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql/user_video_tweet_fav_engagement_generation.sql +++ /dev/null @@ -1,69 +0,0 @@ -WITH - vars AS ( - SELECT - TIMESTAMP("{START_TIME}") AS start_date, - TIMESTAMP("{END_TIME}") AS end_date, - ), - - -- Get raw user-tweet interaction events from UUA (We will use fav engagements here) - raw_engagements AS ( - SELECT - userIdentifier.userId AS userId, - eventMetadata.sourceTimestampMs AS tsMillis, - CASE - WHEN actionType IN ({CONTRIBUTING_ACTION_TYPES_STR}) THEN {CONTRIBUTING_ACTION_TWEET_ID_COLUMN} - WHEN actionType IN ({UNDO_ACTION_TYPES_STR}) THEN {UNDO_ACTION_TWEET_ID_COLUMN} - END AS tweetId, - CASE - WHEN actionType IN ({CONTRIBUTING_ACTION_TYPES_STR}) THEN 1 - WHEN actionType IN ({UNDO_ACTION_TYPES_STR}) THEN -1 - END AS doOrUndo - FROM `twttr-bql-unified-prod.unified_user_actions_engagements.streaming_unified_user_actions_engagements`, vars - WHERE (DATE(dateHour) >= DATE(vars.start_date) AND DATE(dateHour) <= DATE(vars.end_date)) - AND eventMetadata.sourceTimestampMs >= UNIX_MILLIS(vars.start_date) - AND eventMetadata.sourceTimestampMs <= UNIX_MILLIS(vars.end_date) - AND (actionType IN ({CONTRIBUTING_ACTION_TYPES_STR}) - OR actionType IN ({UNDO_ACTION_TYPES_STR})) - ), - - -- Get video tweet ids - video_tweet_ids AS ( - WITH vars AS ( - SELECT - TIMESTAMP("{START_TIME}") AS start_date, - TIMESTAMP("{END_TIME}") AS end_date - ), - - -- Get raw user-tweet interaction events from UUA - video_view_engagements AS ( - SELECT item.tweetInfo.actionTweetId AS tweetId - FROM `twttr-bql-unified-prod.unified_user_actions_engagements.streaming_unified_user_actions_engagements`, vars - WHERE (DATE(dateHour) >= DATE(vars.start_date) AND DATE(dateHour) <= DATE(vars.end_date)) - AND eventMetadata.sourceTimestampMs >= UNIX_MILLIS(start_date) - AND eventMetadata.sourceTimestampMs <= UNIX_MILLIS(end_date) - AND (actionType IN ("ClientTweetVideoPlayback50") - OR actionType IN ("ClientTweetVideoPlayback95")) - ) - - SELECT DISTINCT(tweetId) - FROM video_view_engagements - ), - - -- Join video tweet ids - video_tweets_engagements AS ( - SELECT raw_engagements.* - FROM raw_engagements JOIN video_tweet_ids USING(tweetId) - ), - - -- Group by userId and tweetId - user_tweet_engagement_pairs AS ( - SELECT userId, tweetId, ARRAY_AGG(STRUCT(doOrUndo, tsMillis) ORDER BY tsMillis DESC LIMIT 1) AS details, COUNT(*) AS cnt - FROM video_tweets_engagements - GROUP BY userId, tweetId - ) - --- Remove undo events -SELECT userId, tweetId, CAST(dt.tsMillis AS FLOAT64) AS tsMillis -FROM user_tweet_engagement_pairs, vars -CROSS JOIN UNNEST(details) AS dt -WHERE dt.doOrUndo = 1 diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/BUILD b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/BUILD deleted file mode 100644 index 43135fdf9..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/BUILD +++ /dev/null @@ -1,110 +0,0 @@ -scala_library( - name = "bq_generation", - sources = [ - "**/*.scala", - ], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "beam-internal/src/main/scala/com/twitter/beam/io/dal", - "beam-internal/src/main/scala/com/twitter/scio_internal/job", - "beam-internal/src/main/scala/com/twitter/scio_internal/runner/dataflow", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:offline_tweet_recommendations_from_interested_in_20M_145K_2020-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:offline_tweet_recommendations_from_interested_in_20M_145K_2020_HL_0_EL_15-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:offline_tweet_recommendations_from_interested_in_20M_145K_2020_HL_2_EL_15-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:offline_tweet_recommendations_from_interested_in_20M_145K_2020_HL_2_EL_50-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:offline_tweet_recommendations_from_interested_in_20M_145K_2020_HL_8_EL_50-scala", - "src/scala/com/twitter/simclusters_v2/hdfs_sources:offline_tweet_recommendations_from_mts_consumer_embeddings-scala", - "src/scala/com/twitter/simclusters_v2/scio/bq_generation/common", - "src/scala/com/twitter/simclusters_v2/scio/bq_generation/sql", - "src/scala/com/twitter/wtf/beam/bq_embedding_export:bq_embedding_export_lib", - "tcdc/bq_blaster/src/main/scala/com/twitter/tcdc/bqblaster/beam", - ], -) - -jvm_binary( - name = "iikf-tweets-ann-adhoc-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.tweets_ann.IIKF2020TweetsANNBQAdhocJob", - platform = "java8", - dependencies = [ - ":bq_generation", - ], -) - -jvm_binary( - name = "iikf-hl-8-el-50-tweets-ann-adhoc-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.tweets_ann.IIKF2020Hl8El50TweetsANNBQAdhocJob", - platform = "java8", - dependencies = [ - ":bq_generation", - ], -) - -jvm_binary( - name = "iikf-tweets-ann-batch-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.tweets_ann.IIKF2020TweetsANNBQBatchJob", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":bq_generation", - ], -) - -jvm_binary( - name = "iikf-hl-0-el-15-tweets-ann-batch-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.tweets_ann.IIKF2020Hl0El15TweetsANNBQBatchJob", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":bq_generation", - ], -) - -jvm_binary( - name = "iikf-hl-2-el-15-tweets-ann-batch-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.tweets_ann.IIKF2020Hl2El15TweetsANNBQBatchJob", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":bq_generation", - ], -) - -jvm_binary( - name = "iikf-hl-2-el-50-tweets-ann-batch-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.tweets_ann.IIKF2020Hl2El50TweetsANNBQBatchJob", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":bq_generation", - ], -) - -jvm_binary( - name = "iikf-hl-8-el-50-tweets-ann-batch-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.tweets_ann.IIKF2020Hl8El50TweetsANNBQBatchJob", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":bq_generation", - ], -) - -jvm_binary( - name = "mts-consumer-embeddings-tweets-ann-adhoc-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.tweets_ann.MTSConsumerEmbeddingsTweetsANNBQAdhocJob", - platform = "java8", - dependencies = [ - ":bq_generation", - ], -) - -jvm_binary( - name = "mts-consumer-embeddings-tweets-ann-batch-job", - main = "com.twitter.simclusters_v2.scio.bq_generation.tweets_ann.MTSConsumerEmbeddingsTweetsANNBQBatchJob", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":bq_generation", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/Config.scala b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/Config.scala deleted file mode 100644 index 9046768bb..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/Config.scala +++ /dev/null @@ -1,33 +0,0 @@ -package com.twitter.simclusters_v2.scio.bq_generation.tweets_ann - -object Config { - /* - * Common root path - */ - val RootMHPath: String = "manhattan_sequence_files/offline_sann/" - val RootThriftPath: String = "processed/offline_sann/" - val AdhocRootPath = "adhoc/offline_sann/" - - /* - * Variables for MH output path - */ - val IIKFANNOutputPath: String = "tweets_ann/iikf" - val IIKFHL0EL15ANNOutputPath: String = "tweets_ann/iikf_hl_0_el_15" - val IIKFHL2EL15ANNOutputPath: String = "tweets_ann/iikf_hl_2_el_15" - val IIKFHL2EL50ANNOutputPath: String = "tweets_ann/iikf_hl_2_el_50" - val IIKFHL8EL50ANNOutputPath: String = "tweets_ann/iikf_hl_8_el_50" - val MTSConsumerEmbeddingsANNOutputPath: String = "tweets_ann/mts_consumer_embeddings" - - /* - * Variables for tweet embeddings generation - */ - val SimClustersTweetEmbeddingsGenerationHalfLife: Int = 28800000 // 8hrs in ms - val SimClustersTweetEmbeddingsGenerationEmbeddingLength: Int = 15 - - /* - * Variables for ANN - */ - val SimClustersANNTopNClustersPerSourceEmbedding: Int = 20 - val SimClustersANNTopMTweetsPerCluster: Int = 50 - val SimClustersANNTopKTweetsPerUserRequest: Int = 200 -} diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/README b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/README deleted file mode 100644 index 7947963af..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/README +++ /dev/null @@ -1,95 +0,0 @@ -To run iikf-tweets-ann-adhoc-job (adhoc): -bin/d6w create \ - ${GCP_PROJECT_NAME}/us-central1/iikf-tweets-ann-adhoc-job \ - src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-tweets-ann-adhoc-job.d6w \ - --jar dist/iikf-tweets-ann-adhoc-job.jar \ - --bind=profile.project=${GCP_PROJECT_NAME} \ - --bind=profile.user_name=your_ldap \ - --bind=profile.date="2022-03-28" \ - --bind=profile.machine="n2-highmem-4" \ - --bind=profile.job_name="iikf-tweets-ann-adhoc-job" --ignore-existing - -To run iikf-hl-8-el-50-tweets-ann-adhoc-job (adhoc): -bin/d6w create \ - ${GCP_PROJECT_NAME}/us-central1/iikf-hl-8-el-50-tweets-ann-adhoc-job \ - src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-hl-8-el-50-tweets-ann-adhoc-job.d6w \ - --jar dist/iikf-hl-8-el-50-tweets-ann-adhoc-job.jar \ - --bind=profile.project=${GCP_PROJECT_NAME} \ - --bind=profile.user_name=your_ldap \ - --bind=profile.date="2022-03-28" \ - --bind=profile.machine="n2-highmem-4" \ - --bind=profile.job_name="iikf-hl-8-el-50-tweets-ann-adhoc-job" --ignore-existing - -To run mts-consumer-embeddings-tweets-ann-adhoc-job (adhoc) -bin/d6w create \ - ${GCP_PROJECT_NAME}/us-central1/mts-consumer-embeddings-tweets-ann-adhoc-job \ - src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/mts-consumer-embeddings-tweets-ann-adhoc-job.d6w \ - --jar dist/mts-consumer-embeddings-tweets-ann-adhoc-job.jar \ - --bind=profile.project=${GCP_PROJECT_NAME} \ - --bind=profile.user_name=your_ldap \ - --bind=profile.date="2022-03-28" \ - --bind=profile.machine="n2-highmem-4" \ - --bind=profile.job_name="mts-consumer-embeddings-tweets-ann-adhoc-job" --ignore-existing - - -To schedule iikf-tweets-ann-batch-job (batch) -bin/d6w schedule \ - ${GCP_PROJECT_NAME}/us-central1/iikf-tweets-ann-batch-job \ - src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-tweets-ann-batch-job.d6w \ - --bind=profile.project=${GCP_PROJECT_NAME} \ - --bind=profile.user_name=cassowary \ - --bind=profile.date="2022-03-26" \ - --bind=profile.machine="n2-highmem-4" \ - --bind=profile.job_name="iikf-tweets-ann-batch-job" - -To schedule iikf-hl-0-el-15-tweets-ann-batch-job (batch) -bin/d6w schedule \ - ${GCP_PROJECT_NAME}/us-central1/iikf-hl-0-el-15-tweets-ann-batch-job \ - src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-hl-0-el-15-tweets-ann-batch-job.d6w \ - --bind=profile.project=${GCP_PROJECT_NAME} \ - --bind=profile.user_name=cassowary \ - --bind=profile.date="2022-03-26" \ - --bind=profile.machine="n2-highmem-4" \ - --bind=profile.job_name="iikf-hl-0-el-15-tweets-ann-batch-job" - -To schedule iikf-hl-2-el-15-tweets-ann-batch-job (batch) -bin/d6w schedule \ - ${GCP_PROJECT_NAME}/us-central1/iikf-hl-2-el-15-tweets-ann-batch-job \ - src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-hl-2-el-15-tweets-ann-batch-job.d6w \ - --bind=profile.project=${GCP_PROJECT_NAME} \ - --bind=profile.user_name=cassowary \ - --bind=profile.date="2022-03-26" \ - --bind=profile.machine="n2-highmem-4" \ - --bind=profile.job_name="iikf-hl-2-el-15-tweets-ann-batch-job" - -To schedule iikf-hl-2-el-50-tweets-ann-batch-job (batch) -bin/d6w schedule \ - ${GCP_PROJECT_NAME}/us-central1/iikf-hl-2-el-50-tweets-ann-batch-job \ - src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-hl-2-el-50-tweets-ann-batch-job.d6w \ - --bind=profile.project=${GCP_PROJECT_NAME} \ - --bind=profile.user_name=cassowary \ - --bind=profile.date="2022-03-26" \ - --bind=profile.machine="n2-highmem-4" \ - --bind=profile.job_name="iikf-hl-2-el-50-tweets-ann-batch-job" - -To schedule iikf-hl-8-el-50-tweets-ann-batch-job (batch) -bin/d6w schedule \ - ${GCP_PROJECT_NAME}/us-central1/iikf-hl-8-el-50-tweets-ann-batch-job \ - src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-hl-8-el-50-tweets-ann-batch-job.d6w \ - --bind=profile.project=${GCP_PROJECT_NAME} \ - --bind=profile.user_name=cassowary \ - --bind=profile.date="2022-03-26" \ - --bind=profile.machine="n2-highmem-4" \ - --bind=profile.job_name="iikf-hl-8-el-50-tweets-ann-batch-job" - -To schedule mts-consumer-embeddings-tweets-ann-batch-job(batch) -bin/d6w schedule \ - ${GCP_PROJECT_NAME}/us-central1/mts-consumer-embeddings-tweets-ann-batch-job \ - src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/mts-consumer-embeddings-tweets-ann-batch-job.d6w \ - --bind=profile.project=${GCP_PROJECT_NAME} \ - --bind=profile.user_name=cassowary \ - --bind=profile.date="2022-03-26" \ - --bind=profile.machine="n2-highmem-4" \ - --bind=profile.job_name="mts-consumer-embeddings-tweets-ann-batch-job" - - diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/TweetsANNFromBQ.scala b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/TweetsANNFromBQ.scala deleted file mode 100644 index 23663ab9a..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/TweetsANNFromBQ.scala +++ /dev/null @@ -1,120 +0,0 @@ -package com.twitter.simclusters_v2.scio.bq_generation -package tweets_ann - -import com.spotify.scio.ScioContext -import com.spotify.scio.values.SCollection -import com.twitter.simclusters_v2.thriftscala.CandidateTweet -import com.twitter.wtf.beam.bq_embedding_export.BQQueryUtils -import org.apache.avro.generic.GenericData -import org.apache.avro.generic.GenericRecord -import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO -import org.apache.beam.sdk.io.gcp.bigquery.SchemaAndRecord -import org.apache.beam.sdk.transforms.SerializableFunction -import org.joda.time.DateTime -import scala.collection.mutable.ListBuffer - -object TweetsANNFromBQ { - // Default ANN config variables - val topNClustersPerSourceEmbedding = Config.SimClustersANNTopNClustersPerSourceEmbedding - val topMTweetsPerCluster = Config.SimClustersANNTopMTweetsPerCluster - val topKTweetsPerUserRequest = Config.SimClustersANNTopKTweetsPerUserRequest - - // SQL file paths - val tweetsANNSQLPath = - s"/com/twitter/simclusters_v2/scio/bq_generation/sql/tweets_ann.sql" - val tweetsEmbeddingGenerationSQLPath = - s"/com/twitter/simclusters_v2/scio/bq_generation/sql/tweet_embeddings_generation.sql" - - // Function that parses the GenericRecord results we read from BQ - val parseUserToTweetRecommendationsFunc = - new SerializableFunction[SchemaAndRecord, UserToTweetRecommendations] { - override def apply(record: SchemaAndRecord): UserToTweetRecommendations = { - val genericRecord: GenericRecord = record.getRecord() - UserToTweetRecommendations( - userId = genericRecord.get("userId").toString.toLong, - tweetCandidates = parseTweetIdColumn(genericRecord, "tweets"), - ) - } - } - - // Parse tweetId candidates column - def parseTweetIdColumn( - genericRecord: GenericRecord, - columnName: String - ): List[CandidateTweet] = { - val tweetIds: GenericData.Array[GenericRecord] = - genericRecord.get(columnName).asInstanceOf[GenericData.Array[GenericRecord]] - val results: ListBuffer[CandidateTweet] = new ListBuffer[CandidateTweet]() - tweetIds.forEach((sc: GenericRecord) => { - results += CandidateTweet( - tweetId = sc.get("tweetId").toString.toLong, - score = Some(sc.get("logCosineSimilarityScore").toString.toDouble) - ) - }) - results.toList - } - - def getTweetEmbeddingsSQL( - queryDate: DateTime, - consumerEmbeddingsSQL: String, - tweetEmbeddingsSQLPath: String, - tweetEmbeddingsHalfLife: Int, - tweetEmbeddingsLength: Int - ): String = { - // We read one day of fav events to construct our tweet embeddings - val templateVariables = - Map( - "CONSUMER_EMBEDDINGS_SQL" -> consumerEmbeddingsSQL, - "QUERY_DATE" -> queryDate.toString(), - "START_TIME" -> queryDate.minusDays(1).toString(), - "END_TIME" -> queryDate.toString(), - "MIN_SCORE_THRESHOLD" -> 0.0.toString, - "HALF_LIFE" -> tweetEmbeddingsHalfLife.toString, - "TWEET_EMBEDDING_LENGTH" -> tweetEmbeddingsLength.toString, - "NO_OLDER_TWEETS_THAN_DATE" -> queryDate.minusDays(1).toString(), - ) - BQQueryUtils.getBQQueryFromSqlFile(tweetEmbeddingsSQLPath, templateVariables) - } - - def getTweetRecommendationsBQ( - sc: ScioContext, - queryTimestamp: DateTime, - consumerEmbeddingsSQL: String, - tweetEmbeddingsHalfLife: Int, - tweetEmbeddingsLength: Int - ): SCollection[UserToTweetRecommendations] = { - // Get the tweet embeddings SQL string based on the provided consumerEmbeddingsSQL - val tweetEmbeddingsSQL = - getTweetEmbeddingsSQL( - queryTimestamp, - consumerEmbeddingsSQL, - tweetsEmbeddingGenerationSQLPath, - tweetEmbeddingsHalfLife, - tweetEmbeddingsLength - ) - - // Define template variables which we would like to be replaced in the corresponding sql file - val templateVariables = - Map( - "CONSUMER_EMBEDDINGS_SQL" -> consumerEmbeddingsSQL, - "TWEET_EMBEDDINGS_SQL" -> tweetEmbeddingsSQL, - "TOP_N_CLUSTER_PER_SOURCE_EMBEDDING" -> topNClustersPerSourceEmbedding.toString, - "TOP_M_TWEETS_PER_CLUSTER" -> topMTweetsPerCluster.toString, - "TOP_K_TWEETS_PER_USER_REQUEST" -> topKTweetsPerUserRequest.toString - ) - val query = BQQueryUtils.getBQQueryFromSqlFile(tweetsANNSQLPath, templateVariables) - - // Run SimClusters ANN on BQ and parse the results - sc.customInput( - s"SimClusters BQ ANN", - BigQueryIO - .read(parseUserToTweetRecommendationsFunc) - .fromQuery(query) - .usingStandardSql() - ) - } - - case class UserToTweetRecommendations( - userId: Long, - tweetCandidates: List[CandidateTweet]) -} diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/TweetsANNJob.scala b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/TweetsANNJob.scala deleted file mode 100644 index 81a89f3ff..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/TweetsANNJob.scala +++ /dev/null @@ -1,297 +0,0 @@ -package com.twitter.simclusters_v2.scio.bq_generation -package tweets_ann - -import com.google.api.services.bigquery.model.TimePartitioning -import com.spotify.scio.ScioContext -import com.spotify.scio.coders.Coder -import com.twitter.beam.io.dal.DAL -import com.twitter.beam.io.fs.multiformat.PathLayout -import com.twitter.beam.job.DateRangeOptions -import com.twitter.conversions.DurationOps.richDurationFromInt -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.scio_internal.coders.ThriftStructLazyBinaryScroogeCoder -import com.twitter.scio_internal.job.ScioBeamJob -import com.twitter.scrooge.ThriftStruct -import com.twitter.simclusters_v2.scio.bq_generation.common.BQGenerationUtil.getMTSConsumerEmbeddingsFav90P20MSQL -import com.twitter.simclusters_v2.scio.bq_generation.common.BQGenerationUtil.getInterestedIn2020SQL -import com.twitter.simclusters_v2.scio.bq_generation.tweets_ann.TweetsANNFromBQ.getTweetRecommendationsBQ -import com.twitter.simclusters_v2.hdfs_sources.OfflineTweetRecommendationsFromInterestedIn20M145K2020ScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.OfflineTweetRecommendationsFromInterestedIn20M145K2020Hl0El15ScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.OfflineTweetRecommendationsFromInterestedIn20M145K2020Hl2El15ScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.OfflineTweetRecommendationsFromInterestedIn20M145K2020Hl2El50ScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.OfflineTweetRecommendationsFromInterestedIn20M145K2020Hl8El50ScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.OfflineTweetRecommendationsFromMtsConsumerEmbeddingsScalaDataset -import com.twitter.simclusters_v2.scio.bq_generation.common.BQTableDetails -import com.twitter.simclusters_v2.thriftscala.CandidateTweets -import com.twitter.simclusters_v2.thriftscala.CandidateTweetsList -import com.twitter.tcdc.bqblaster.beam.syntax.BigQueryIOHelpers -import com.twitter.tcdc.bqblaster.beam.BQBlasterIO.AvroConverter -import com.twitter.tcdc.bqblaster.core.avro.TypedProjection -import com.twitter.tcdc.bqblaster.core.transform.RootTransform -import java.time.Instant -import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO -import org.joda.time.DateTime - -trait TweetsANNJob extends ScioBeamJob[DateRangeOptions] { - // Configs to set for different type of embeddings and jobs - val isAdhoc: Boolean - val getConsumerEmbeddingsSQLFunc: (DateTime, Int) => String - val outputTable: BQTableDetails - val keyValDatasetOutputPath: String - val tweetRecommentationsSnapshotDataset: KeyValDALDataset[KeyVal[Long, CandidateTweetsList]] - val tweetEmbeddingsGenerationHalfLife: Int = Config.SimClustersTweetEmbeddingsGenerationHalfLife - val tweetEmbeddingsGenerationEmbeddingLength: Int = - Config.SimClustersTweetEmbeddingsGenerationEmbeddingLength - - // Base configs - val projectId = "twttr-recos-ml-prod" - val environment: DAL.Env = if (isAdhoc) DAL.Environment.Dev else DAL.Environment.Prod - - override implicit def scroogeCoder[T <: ThriftStruct: Manifest]: Coder[T] = - ThriftStructLazyBinaryScroogeCoder.scroogeCoder - - override def configurePipeline(sc: ScioContext, opts: DateRangeOptions): Unit = { - // The time when the job is scheduled - val queryTimestamp = opts.interval.getEnd - - // Read consumer embeddings SQL - val consumerEmbeddingsSQL = getConsumerEmbeddingsSQLFunc(queryTimestamp, 14) - - // Generate tweet embeddings and tweet ANN results - val tweetRecommendations = - getTweetRecommendationsBQ( - sc, - queryTimestamp, - consumerEmbeddingsSQL, - tweetEmbeddingsGenerationHalfLife, - tweetEmbeddingsGenerationEmbeddingLength - ) - - // Setup BQ writer - val ingestionTime = opts.getDate().value.getEnd.toDate - val bqFieldsTransform = RootTransform - .Builder() - .withPrependedFields("ingestionTime" -> TypedProjection.fromConstant(ingestionTime)) - val timePartitioning = new TimePartitioning() - .setType("HOUR").setField("ingestionTime").setExpirationMs(3.days.inMilliseconds) - val bqWriter = BigQueryIO - .write[CandidateTweets] - .to(outputTable.toString) - .withExtendedErrorInfo() - .withTimePartitioning(timePartitioning) - .withLoadJobProjectId(projectId) - .withThriftSupport(bqFieldsTransform.build(), AvroConverter.Legacy) - .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED) - .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND) - - // Save Tweet ANN results to BQ - tweetRecommendations - .map { userToTweetRecommendations => - { - CandidateTweets( - targetUserId = userToTweetRecommendations.userId, - recommendedTweets = userToTweetRecommendations.tweetCandidates) - } - } - .saveAsCustomOutput(s"WriteToBQTable - ${outputTable}", bqWriter) - - // Save Tweet ANN results as KeyValSnapshotDataset - tweetRecommendations - .map { userToTweetRecommendations => - KeyVal( - userToTweetRecommendations.userId, - CandidateTweetsList(userToTweetRecommendations.tweetCandidates)) - }.saveAsCustomOutput( - name = "WriteTweetRecommendationsToKeyValDataset", - DAL.writeVersionedKeyVal( - tweetRecommentationsSnapshotDataset, - PathLayout.VersionedPath(prefix = - ((if (!isAdhoc) - Config.RootMHPath - else - Config.AdhocRootPath) - + keyValDatasetOutputPath)), - instant = Instant.ofEpochMilli(opts.interval.getEndMillis - 1L), - environmentOverride = environment, - ) - ) - } - -} - -/** - * Scio job for adhoc run for tweet recommendations from IIKF 2020 - */ -object IIKF2020TweetsANNBQAdhocJob extends TweetsANNJob { - override val isAdhoc = true - override val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val outputTable = BQTableDetails( - "twttr-recos-ml-prod", - "multi_type_simclusters", - "offline_tweet_recommendations_from_interested_in_20M_145K_2020_adhoc") - override val keyValDatasetOutputPath = Config.IIKFANNOutputPath - override val tweetRecommentationsSnapshotDataset: KeyValDALDataset[ - KeyVal[Long, CandidateTweetsList] - ] = - OfflineTweetRecommendationsFromInterestedIn20M145K2020ScalaDataset -} - -/** - * Scio job for adhoc run for tweet recommendations from IIKF 2020 with - * - Half life = 8hrs - * - Embedding Length = 50 - */ -object IIKF2020Hl8El50TweetsANNBQAdhocJob extends TweetsANNJob { - override val isAdhoc = true - override val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val outputTable = BQTableDetails( - "twttr-recos-ml-prod", - "multi_type_simclusters", - "offline_tweet_recommendations_from_interested_in_20M_145K_2020_HL_8_EL_50_adhoc") - override val keyValDatasetOutputPath = Config.IIKFHL8EL50ANNOutputPath - override val tweetEmbeddingsGenerationEmbeddingLength: Int = 50 - override val tweetRecommentationsSnapshotDataset: KeyValDALDataset[ - KeyVal[Long, CandidateTweetsList] - ] = { - OfflineTweetRecommendationsFromInterestedIn20M145K2020Hl8El50ScalaDataset - } -} - -/** - * Scio job for adhoc run for tweet recommendations from MTS Consumer Embeddings - */ -object MTSConsumerEmbeddingsTweetsANNBQAdhocJob extends TweetsANNJob { - override val isAdhoc = true - override val getConsumerEmbeddingsSQLFunc = getMTSConsumerEmbeddingsFav90P20MSQL - override val outputTable = BQTableDetails( - "twttr-recos-ml-prod", - "multi_type_simclusters", - "offline_tweet_recommendations_from_mts_consumer_embeddings_adhoc") - override val keyValDatasetOutputPath = Config.MTSConsumerEmbeddingsANNOutputPath - override val tweetRecommentationsSnapshotDataset: KeyValDALDataset[ - KeyVal[Long, CandidateTweetsList] - ] = - OfflineTweetRecommendationsFromMtsConsumerEmbeddingsScalaDataset -} - -/** -Scio job for batch run for tweet recommendations from IIKF 2020 -The schedule cmd needs to be run only if there is any change in the config - */ -object IIKF2020TweetsANNBQBatchJob extends TweetsANNJob { - override val isAdhoc = false - override val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val outputTable = BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "offline_tweet_recommendations_from_interested_in_20M_145K_2020") - override val keyValDatasetOutputPath = Config.IIKFANNOutputPath - override val tweetRecommentationsSnapshotDataset: KeyValDALDataset[ - KeyVal[Long, CandidateTweetsList] - ] = - OfflineTweetRecommendationsFromInterestedIn20M145K2020ScalaDataset -} - -/** -Scio job for batch run for tweet recommendations from IIKF 2020 with parameter setup: - - Half Life: None, no decay, direct sum - - Embedding Length: 15 -The schedule cmd needs to be run only if there is any change in the config - */ -object IIKF2020Hl0El15TweetsANNBQBatchJob extends TweetsANNJob { - override val isAdhoc = false - override val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val outputTable = BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "offline_tweet_recommendations_from_interested_in_20M_145K_2020_HL_0_EL_15") - override val keyValDatasetOutputPath = Config.IIKFHL0EL15ANNOutputPath - override val tweetEmbeddingsGenerationHalfLife: Int = -1 - override val tweetRecommentationsSnapshotDataset: KeyValDALDataset[ - KeyVal[Long, CandidateTweetsList] - ] = - OfflineTweetRecommendationsFromInterestedIn20M145K2020Hl0El15ScalaDataset -} - -/** -Scio job for batch run for tweet recommendations from IIKF 2020 with parameter setup: - - Half Life: 2hrs - - Embedding Length: 15 -The schedule cmd needs to be run only if there is any change in the config - */ -object IIKF2020Hl2El15TweetsANNBQBatchJob extends TweetsANNJob { - override val isAdhoc = false - override val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val outputTable = BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "offline_tweet_recommendations_from_interested_in_20M_145K_2020_HL_2_EL_15") - override val keyValDatasetOutputPath = Config.IIKFHL2EL15ANNOutputPath - override val tweetEmbeddingsGenerationHalfLife: Int = 7200000 // 2hrs in ms - override val tweetRecommentationsSnapshotDataset: KeyValDALDataset[ - KeyVal[Long, CandidateTweetsList] - ] = - OfflineTweetRecommendationsFromInterestedIn20M145K2020Hl2El15ScalaDataset -} - -/** -Scio job for batch run for tweet recommendations from IIKF 2020 with parameter setup: - - Half Life: 2hrs - - Embedding Length: 50 -The schedule cmd needs to be run only if there is any change in the config - */ -object IIKF2020Hl2El50TweetsANNBQBatchJob extends TweetsANNJob { - override val isAdhoc = false - override val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val outputTable = BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "offline_tweet_recommendations_from_interested_in_20M_145K_2020_HL_2_EL_50") - override val keyValDatasetOutputPath = Config.IIKFHL2EL50ANNOutputPath - override val tweetEmbeddingsGenerationHalfLife: Int = 7200000 // 2hrs in ms - override val tweetEmbeddingsGenerationEmbeddingLength: Int = 50 - override val tweetRecommentationsSnapshotDataset: KeyValDALDataset[ - KeyVal[Long, CandidateTweetsList] - ] = - OfflineTweetRecommendationsFromInterestedIn20M145K2020Hl2El50ScalaDataset -} - -/** -Scio job for batch run for tweet recommendations from IIKF 2020 with parameter setup: - - Half Life: 8hrs - - Embedding Length: 50 -The schedule cmd needs to be run only if there is any change in the config - */ -object IIKF2020Hl8El50TweetsANNBQBatchJob extends TweetsANNJob { - override val isAdhoc = false - override val getConsumerEmbeddingsSQLFunc = getInterestedIn2020SQL - override val outputTable = BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "offline_tweet_recommendations_from_interested_in_20M_145K_2020_HL_8_EL_50") - override val keyValDatasetOutputPath = Config.IIKFHL8EL50ANNOutputPath - override val tweetEmbeddingsGenerationEmbeddingLength: Int = 50 - override val tweetRecommentationsSnapshotDataset: KeyValDALDataset[ - KeyVal[Long, CandidateTweetsList] - ] = - OfflineTweetRecommendationsFromInterestedIn20M145K2020Hl8El50ScalaDataset -} - -/** -Scio job for batch run for tweet recommendations from MTS Consumer Embeddings -The schedule cmd needs to be run only if there is any change in the config - */ -object MTSConsumerEmbeddingsTweetsANNBQBatchJob extends TweetsANNJob { - override val isAdhoc = false - override val getConsumerEmbeddingsSQLFunc = getMTSConsumerEmbeddingsFav90P20MSQL - override val outputTable = BQTableDetails( - "twttr-bq-cassowary-prod", - "user", - "offline_tweet_recommendations_from_mts_consumer_embeddings") - override val keyValDatasetOutputPath = Config.MTSConsumerEmbeddingsANNOutputPath - override val tweetRecommentationsSnapshotDataset: KeyValDALDataset[ - KeyVal[Long, CandidateTweetsList] - ] = - OfflineTweetRecommendationsFromMtsConsumerEmbeddingsScalaDataset -} diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-hl-0-el-15-tweets-ann-batch-job.d6w b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-hl-0-el-15-tweets-ann-batch-job.d6w deleted file mode 100644 index b86af2653..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-hl-0-el-15-tweets-ann-batch-job.d6w +++ /dev/null @@ -1,39 +0,0 @@ -class Profile(Struct): - project = Required(String) - date = Required(String) - environment = Default(String, 'dev') - job_name = Default(String, 'iikf-hl-0-el-15-tweets-ann-batch-job') - machine = Default(String, 'n2-highmem-4') - -job = Job( - name='{{profile.job_name}}', - project='{{profile.project}}', - staging_bucket='{{profile.project}}', - service_account='{{profile.user_name}}-shdw@twttr-dp-svc-accounts.iam.gserviceaccount.com', - region='us-central1', - worker_config=WorkerConfig( - num_workers=2, - worker_machine_type='{{profile.machine}}', - worker_disk_type=WorkerDiskType('HDD'), - ), - extra_args={ - "date": '{{profile.date}}' - }, - service_identifier='twtr:svc:{{profile.user_name}}:{{profile.job_name}}:{{profile.environment}}:{{profile.cluster}}', - deployment_config=BatchDeploymentConfig( - role='{{profile.user_name}}', - environment='prod', - build_target='src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann:iikf-hl-0-el-15-tweets-ann-batch-job', - gcp_deployment_credentials='/var/lib/tss/keys/{{profile.user_name}}/cloud/gcp/dp/shadow.json', - statebird_config=StatebirdConfig( - batch_width='PT4H', - first_time='{{profile.date}}', - ), - workflow_config=WorkflowConfig( - play=True, - ), - timeout='PT24H' - ) -) - -jobs=[job] diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-hl-2-el-15-tweets-ann-batch-job.d6w b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-hl-2-el-15-tweets-ann-batch-job.d6w deleted file mode 100644 index 55a9b5382..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-hl-2-el-15-tweets-ann-batch-job.d6w +++ /dev/null @@ -1,39 +0,0 @@ -class Profile(Struct): - project = Required(String) - date = Required(String) - environment = Default(String, 'dev') - job_name = Default(String, 'iikf-hl-2-el-15-tweets-ann-batch-job') - machine = Default(String, 'n2-highmem-4') - -job = Job( - name='{{profile.job_name}}', - project='{{profile.project}}', - staging_bucket='{{profile.project}}', - service_account='{{profile.user_name}}-shdw@twttr-dp-svc-accounts.iam.gserviceaccount.com', - region='us-central1', - worker_config=WorkerConfig( - num_workers=2, - worker_machine_type='{{profile.machine}}', - worker_disk_type=WorkerDiskType('HDD'), - ), - extra_args={ - "date": '{{profile.date}}' - }, - service_identifier='twtr:svc:{{profile.user_name}}:{{profile.job_name}}:{{profile.environment}}:{{profile.cluster}}', - deployment_config=BatchDeploymentConfig( - role='{{profile.user_name}}', - environment='prod', - build_target='src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann:iikf-hl-2-el-15-tweets-ann-batch-job', - gcp_deployment_credentials='/var/lib/tss/keys/{{profile.user_name}}/cloud/gcp/dp/shadow.json', - statebird_config=StatebirdConfig( - batch_width='PT4H', - first_time='{{profile.date}}', - ), - workflow_config=WorkflowConfig( - play=True, - ), - timeout='PT24H' - ) -) - -jobs=[job] diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-hl-2-el-50-tweets-ann-batch-job.d6w b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-hl-2-el-50-tweets-ann-batch-job.d6w deleted file mode 100644 index 6fdd1c2f2..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-hl-2-el-50-tweets-ann-batch-job.d6w +++ /dev/null @@ -1,39 +0,0 @@ -class Profile(Struct): - project = Required(String) - date = Required(String) - environment = Default(String, 'dev') - job_name = Default(String, 'iikf-hl-2-el-50-tweets-ann-batch-job') - machine = Default(String, 'n2-highmem-4') - -job = Job( - name='{{profile.job_name}}', - project='{{profile.project}}', - staging_bucket='{{profile.project}}', - service_account='{{profile.user_name}}-shdw@twttr-dp-svc-accounts.iam.gserviceaccount.com', - region='us-central1', - worker_config=WorkerConfig( - num_workers=2, - worker_machine_type='{{profile.machine}}', - worker_disk_type=WorkerDiskType('HDD'), - ), - extra_args={ - "date": '{{profile.date}}' - }, - service_identifier='twtr:svc:{{profile.user_name}}:{{profile.job_name}}:{{profile.environment}}:{{profile.cluster}}', - deployment_config=BatchDeploymentConfig( - role='{{profile.user_name}}', - environment='prod', - build_target='src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann:iikf-hl-2-el-50-tweets-ann-batch-job', - gcp_deployment_credentials='/var/lib/tss/keys/{{profile.user_name}}/cloud/gcp/dp/shadow.json', - statebird_config=StatebirdConfig( - batch_width='PT4H', - first_time='{{profile.date}}', - ), - workflow_config=WorkflowConfig( - play=True, - ), - timeout='PT24H' - ) -) - -jobs=[job] diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-hl-8-el-50-tweets-ann-adhoc-job.d6w b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-hl-8-el-50-tweets-ann-adhoc-job.d6w deleted file mode 100644 index beb0dbc93..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-hl-8-el-50-tweets-ann-adhoc-job.d6w +++ /dev/null @@ -1,39 +0,0 @@ -class Profile(Struct): - project = Required(String) - date = Required(String) - environment = Default(String, 'dev') - job_name = Default(String, 'iikf-hl-8-el-50-tweets-ann-batch-job') - machine = Default(String, 'n2-highmem-4') - -job = Job( - name='{{profile.job_name}}', - project='{{profile.project}}', - staging_bucket='{{profile.project}}', - service_account='{{profile.user_name}}-shdw@twttr-dp-svc-accounts.iam.gserviceaccount.com', - region='us-central1', - worker_config=WorkerConfig( - num_workers=2, - worker_machine_type='{{profile.machine}}', - worker_disk_type=WorkerDiskType('HDD'), - ), - extra_args={ - "date": '{{profile.date}}' - }, - service_identifier='twtr:svc:{{profile.user_name}}:{{profile.job_name}}:{{profile.environment}}:{{profile.cluster}}', - deployment_config=BatchDeploymentConfig( - role='{{profile.user_name}}', - environment='prod', - build_target='src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann:iikf-hl-8-el-50-tweets-ann-batch-job', - gcp_deployment_credentials='/var/lib/tss/keys/{{profile.user_name}}/cloud/gcp/dp/shadow.json', - statebird_config=StatebirdConfig( - batch_width='PT4H', - first_time='{{profile.date}}', - ), - workflow_config=WorkflowConfig( - play=True, - ), - timeout='PT24H' - ) -) - -jobs=[job] diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-hl-8-el-50-tweets-ann-batch-job.d6w b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-hl-8-el-50-tweets-ann-batch-job.d6w deleted file mode 100644 index beb0dbc93..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-hl-8-el-50-tweets-ann-batch-job.d6w +++ /dev/null @@ -1,39 +0,0 @@ -class Profile(Struct): - project = Required(String) - date = Required(String) - environment = Default(String, 'dev') - job_name = Default(String, 'iikf-hl-8-el-50-tweets-ann-batch-job') - machine = Default(String, 'n2-highmem-4') - -job = Job( - name='{{profile.job_name}}', - project='{{profile.project}}', - staging_bucket='{{profile.project}}', - service_account='{{profile.user_name}}-shdw@twttr-dp-svc-accounts.iam.gserviceaccount.com', - region='us-central1', - worker_config=WorkerConfig( - num_workers=2, - worker_machine_type='{{profile.machine}}', - worker_disk_type=WorkerDiskType('HDD'), - ), - extra_args={ - "date": '{{profile.date}}' - }, - service_identifier='twtr:svc:{{profile.user_name}}:{{profile.job_name}}:{{profile.environment}}:{{profile.cluster}}', - deployment_config=BatchDeploymentConfig( - role='{{profile.user_name}}', - environment='prod', - build_target='src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann:iikf-hl-8-el-50-tweets-ann-batch-job', - gcp_deployment_credentials='/var/lib/tss/keys/{{profile.user_name}}/cloud/gcp/dp/shadow.json', - statebird_config=StatebirdConfig( - batch_width='PT4H', - first_time='{{profile.date}}', - ), - workflow_config=WorkflowConfig( - play=True, - ), - timeout='PT24H' - ) -) - -jobs=[job] diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-tweets-ann-adhoc-job.d6w b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-tweets-ann-adhoc-job.d6w deleted file mode 100644 index 6cc067816..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-tweets-ann-adhoc-job.d6w +++ /dev/null @@ -1,34 +0,0 @@ -class Profile(Struct): - project = Required(String) - date = Required(String) - environment = Default(String, 'dev') - job_name = Default(String, 'iikf-tweets-ann-adhoc-job') - machine = Default(String, 'n2-highmem-4') - -job = Job( - name='{{profile.job_name}}', - project='{{profile.project}}', - staging_bucket='{{profile.project}}', - service_account='{{profile.user_name}}-shdw@twttr-dp-svc-accounts.iam.gserviceaccount.com', - region='us-central1', - worker_config=WorkerConfig( - num_workers=2, - worker_machine_type='{{profile.machine}}', - worker_disk_type=WorkerDiskType('HDD'), - ), - extra_args={ - "date": '{{profile.date}}' - }, - service_identifier='twtr:svc:{{profile.user_name}}:{{profile.job_name}}:{{profile.environment}}:{{profile.cluster}}', - deployment_config=BatchDeploymentConfig( - role='{{profile.user_name}}', - build_target='src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann:iikf-tweets-ann-adhoc-job', - gcp_deployment_credentials='/var/lib/tss/keys/{{profile.user_name}}/cloud/gcp/dp/shadow.json', - statebird_config=StatebirdConfig( - batch_width='PT2H', - first_time='{{profile.date}}', - ), - ) -) - -jobs=[job] diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-tweets-ann-batch-job.d6w b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-tweets-ann-batch-job.d6w deleted file mode 100644 index 065a83eec..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/iikf-tweets-ann-batch-job.d6w +++ /dev/null @@ -1,39 +0,0 @@ -class Profile(Struct): - project = Required(String) - date = Required(String) - environment = Default(String, 'dev') - job_name = Default(String, 'iikf-tweets-ann-batch-job') - machine = Default(String, 'n2-highmem-4') - -job = Job( - name='{{profile.job_name}}', - project='{{profile.project}}', - staging_bucket='{{profile.project}}', - service_account='{{profile.user_name}}-shdw@twttr-dp-svc-accounts.iam.gserviceaccount.com', - region='us-central1', - worker_config=WorkerConfig( - num_workers=2, - worker_machine_type='{{profile.machine}}', - worker_disk_type=WorkerDiskType('HDD'), - ), - extra_args={ - "date": '{{profile.date}}' - }, - service_identifier='twtr:svc:{{profile.user_name}}:{{profile.job_name}}:{{profile.environment}}:{{profile.cluster}}', - deployment_config=BatchDeploymentConfig( - role='{{profile.user_name}}', - environment='prod', - build_target='src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann:iikf-tweets-ann-batch-job', - gcp_deployment_credentials='/var/lib/tss/keys/{{profile.user_name}}/cloud/gcp/dp/shadow.json', - statebird_config=StatebirdConfig( - batch_width='PT4H', - first_time='{{profile.date}}', - ), - workflow_config=WorkflowConfig( - play=True, - ), - timeout='PT24H' - ) -) - -jobs=[job] diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/mts-consumer-embeddings-tweets-ann-adhoc-job.d6w b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/mts-consumer-embeddings-tweets-ann-adhoc-job.d6w deleted file mode 100644 index c7f921708..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/mts-consumer-embeddings-tweets-ann-adhoc-job.d6w +++ /dev/null @@ -1,34 +0,0 @@ -class Profile(Struct): - project = Required(String) - date = Required(String) - environment = Default(String, 'dev') - job_name = Default(String, 'mts-consumer-embeddings-tweets-ann-adhoc-job') - machine = Default(String, 'n2-highmem-4') - -job = Job( - name='{{profile.job_name}}', - project='{{profile.project}}', - staging_bucket='{{profile.project}}', - service_account='{{profile.user_name}}-shdw@twttr-dp-svc-accounts.iam.gserviceaccount.com', - region='us-central1', - worker_config=WorkerConfig( - num_workers=2, - worker_machine_type='{{profile.machine}}', - worker_disk_type=WorkerDiskType('HDD'), - ), - extra_args={ - "date": '{{profile.date}}' - }, - service_identifier='twtr:svc:{{profile.user_name}}:{{profile.job_name}}:{{profile.environment}}:{{profile.cluster}}', - deployment_config=BatchDeploymentConfig( - role='{{profile.user_name}}', - build_target='src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann:mts-consumer-embeddings-tweets-ann-adhoc-job', - gcp_deployment_credentials='/var/lib/tss/keys/{{profile.user_name}}/cloud/gcp/dp/shadow.json', - statebird_config=StatebirdConfig( - batch_width='PT2H', - first_time='{{profile.date}}', - ), - ) -) - -jobs=[job] diff --git a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/mts-consumer-embeddings-tweets-ann-batch-job.d6w b/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/mts-consumer-embeddings-tweets-ann-batch-job.d6w deleted file mode 100644 index d87e68e9f..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann/mts-consumer-embeddings-tweets-ann-batch-job.d6w +++ /dev/null @@ -1,39 +0,0 @@ -class Profile(Struct): - project = Required(String) - date = Required(String) - environment = Default(String, 'prod') - job_name = Default(String, 'mts-consumer-embeddings-tweets-ann-batch-job') - machine = Default(String, 'n2-highmem-4') - -job = Job( - name='{{profile.job_name}}', - project='{{profile.project}}', - staging_bucket='{{profile.project}}', - service_account='{{profile.user_name}}-shdw@twttr-dp-svc-accounts.iam.gserviceaccount.com', - region='us-central1', - worker_config=WorkerConfig( - num_workers=2, - worker_machine_type='{{profile.machine}}', - worker_disk_type=WorkerDiskType('HDD'), - ), - extra_args={ - "date": '{{profile.date}}' - }, - service_identifier='twtr:svc:{{profile.user_name}}:{{profile.job_name}}:{{profile.environment}}:{{profile.cluster}}', - deployment_config=BatchDeploymentConfig( - role='{{profile.user_name}}', - environment='prod', - build_target='src/scala/com/twitter/simclusters_v2/scio/bq_generation/tweets_ann:mts-consumer-embeddings-tweets-ann-batch-job', - gcp_deployment_credentials='/var/lib/tss/keys/{{profile.user_name}}/cloud/gcp/dp/shadow.json', - statebird_config=StatebirdConfig( - batch_width='PT4H', - first_time='{{profile.date}}', - ), - workflow_config=WorkflowConfig( - play=True, - ), - timeout='PT24H' - ) -) - -jobs=[job] diff --git a/src/scala/com/twitter/simclusters_v2/scio/common/BUILD b/src/scala/com/twitter/simclusters_v2/scio/common/BUILD deleted file mode 100644 index 1ad664680..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/common/BUILD +++ /dev/null @@ -1,21 +0,0 @@ -scala_library( - sources = [ - "*.scala", - ], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "beam-internal/src/main/scala/com/twitter/beam/io/dal", - "beam-internal/src/main/scala/com/twitter/scio_internal/runner/dataflow", - "flockdb-tools/datasets/flock:flock-blocks-edges-scala", - "flockdb-tools/datasets/flock:flock-follows-edges-scala", - "flockdb-tools/datasets/flock:flock-report-as-abuse-edges-scala", - "flockdb-tools/datasets/flock:flock-report-as-spam-edges-scala", - "iesource/processing/events/src/main/scala/com/twitter/iesource/processing/events/batch:server_engagements-scala", - "src/scala/com/twitter/simclusters_v2/scalding", - "src/thrift/com/twitter/twadoop/user/gen:gen-scala", - "tweetsource/public_tweets/src/main/scala/com/twitter/tweetsource/public_tweets:public_tweets-scala", - "usersource/snapshot/src/main/scala/com/twitter/usersource/snapshot/flat:usersource_flat-scala", - "usersource/snapshot/src/main/thrift/com/twitter/usersource/snapshot/flat:flat-scala", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scio/common/ExternalDataSources.scala b/src/scala/com/twitter/simclusters_v2/scio/common/ExternalDataSources.scala deleted file mode 100644 index ed9e1aa2d..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/common/ExternalDataSources.scala +++ /dev/null @@ -1,301 +0,0 @@ -package com.twitter.simclusters_v2.scio.common - -import com.spotify.scio.ScioContext -import com.spotify.scio.values.SCollection -import com.twitter.beam.io.dal.DAL -import com.twitter.common.util.Clock -import com.twitter.common_header.thriftscala.CommonHeader -import com.twitter.common_header.thriftscala.IdType -import com.twitter.common_header.thriftscala.VersionedCommonHeader -import com.twitter.frigate.data_pipeline.magicrecs.magicrecs_notifications_lite.thriftscala.MagicRecsNotificationLite -import com.twitter.frigate.data_pipeline.scalding.magicrecs.magicrecs_notification_lite.MagicrecsNotificationLite1DayLagScalaDataset -import com.twitter.iesource.thriftscala.InteractionEvent -import com.twitter.iesource.thriftscala.InteractionTargetType -import com.twitter.interests_ds.jobs.interests_service.UserTopicRelationSnapshotScalaDataset -import com.twitter.interests.thriftscala.InterestRelationType -import com.twitter.interests.thriftscala.UserInterestsRelationSnapshot -import com.twitter.penguin.scalding.datasets.PenguinUserLanguagesScalaDataset -import com.twitter.search.adaptive.scribing.thriftscala.AdaptiveSearchScribeLog -import com.twitter.simclusters_v2.hdfs_sources.UserUserFavGraphScalaDataset -import com.twitter.simclusters_v2.scalding.embedding.common.ExternalDataSources.ValidFlockEdgeStateId -import com.twitter.simclusters_v2.scalding.embedding.common.ExternalDataSources.getStandardLanguageCode -import com.twitter.twadoop.user.gen.thriftscala.CombinedUser -import flockdb_tools.datasets.flock.FlockBlocksEdgesScalaDataset -import flockdb_tools.datasets.flock.FlockFollowsEdgesScalaDataset -import flockdb_tools.datasets.flock.FlockReportAsAbuseEdgesScalaDataset -import flockdb_tools.datasets.flock.FlockReportAsSpamEdgesScalaDataset -import org.joda.time.Interval -import com.twitter.simclusters_v2.thriftscala.EdgeWithDecayedWeights -import com.twitter.usersource.snapshot.combined.UsersourceScalaDataset -import com.twitter.usersource.snapshot.flat.UsersourceFlatScalaDataset -import com.twitter.util.Duration -import twadoop_config.configuration.log_categories.group.search.AdaptiveSearchScalaDataset - -object ExternalDataSources { - def userSource( - noOlderThan: Duration = Duration.fromDays(7) - )( - implicit sc: ScioContext - ): SCollection[CombinedUser] = { - sc.customInput( - "ReadUserSource", - DAL - .readMostRecentSnapshotNoOlderThan( - UsersourceScalaDataset, - noOlderThan, - Clock.SYSTEM_CLOCK, - DAL.Environment.Prod - ) - ) - } - - def userCountrySource( - noOlderThan: Duration = Duration.fromDays(7) - )( - implicit sc: ScioContext - ): SCollection[(Long, String)] = { - sc.customInput( - "ReadUserCountrySource", - DAL - .readMostRecentSnapshotNoOlderThan( - UsersourceFlatScalaDataset, - noOlderThan, - Clock.SYSTEM_CLOCK, - DAL.Environment.Prod, - ) - ).flatMap { flatUser => - for { - userId <- flatUser.id - country <- flatUser.accountCountryCode - } yield { - (userId, country.toUpperCase) - } - }.distinct - } - - def userUserFavSource( - noOlderThan: Duration = Duration.fromDays(14) - )( - implicit sc: ScioContext - ): SCollection[EdgeWithDecayedWeights] = { - sc.customInput( - "ReadUserUserFavSource", - DAL - .readMostRecentSnapshotNoOlderThan( - UserUserFavGraphScalaDataset, - noOlderThan, - Clock.SYSTEM_CLOCK, - DAL.Environment.Prod - ) - ) - } - - def inferredUserConsumedLanguageSource( - noOlderThan: Duration = Duration.fromDays(7) - )( - implicit sc: ScioContext - ): SCollection[(Long, Seq[(String, Double)])] = { - sc.customInput( - "ReadInferredUserConsumedLanguageSource", - DAL - .readMostRecentSnapshotNoOlderThan( - PenguinUserLanguagesScalaDataset, - noOlderThan, - Clock.SYSTEM_CLOCK, - DAL.Environment.Prod - ) - ).map { kv => - val consumed = kv.value.consumed - .collect { - case scoredString if scoredString.weight > 0.001 => //throw away 5% outliers - (getStandardLanguageCode(scoredString.item), scoredString.weight) - }.collect { - case (Some(language), score) => (language, score) - } - (kv.key, consumed) - } - } - - def flockBlockSource( - noOlderThan: Duration = Duration.fromDays(7) - )( - implicit sc: ScioContext - ): SCollection[(Long, Long)] = { - sc.customInput( - "ReadFlockBlock", - DAL.readMostRecentSnapshotNoOlderThan( - FlockBlocksEdgesScalaDataset, - noOlderThan, - Clock.SYSTEM_CLOCK, - DAL.Environment.Prod)) - .collect { - case edge if edge.state == ValidFlockEdgeStateId => - (edge.sourceId, edge.destinationId) - } - } - - def flockFollowSource( - noOlderThan: Duration = Duration.fromDays(7) - )( - implicit sc: ScioContext - ): SCollection[(Long, Long)] = { - sc.customInput( - "ReadFlockFollow", - DAL - .readMostRecentSnapshotNoOlderThan( - FlockFollowsEdgesScalaDataset, - noOlderThan, - Clock.SYSTEM_CLOCK, - DAL.Environment.Prod)) - .collect { - case edge if edge.state == ValidFlockEdgeStateId => - (edge.sourceId, edge.destinationId) - } - } - - def flockReportAsAbuseSource( - noOlderThan: Duration = Duration.fromDays(7) - )( - implicit sc: ScioContext - ): SCollection[(Long, Long)] = { - sc.customInput( - "ReadFlockReportAsAbuseJava", - DAL - .readMostRecentSnapshotNoOlderThan( - FlockReportAsAbuseEdgesScalaDataset, - noOlderThan, - Clock.SYSTEM_CLOCK, - DAL.Environment.Prod) - ) - .collect { - case edge if edge.state == ValidFlockEdgeStateId => - (edge.sourceId, edge.destinationId) - } - } - - def flockReportAsSpamSource( - noOlderThan: Duration = Duration.fromDays(7) - )( - implicit sc: ScioContext - ): SCollection[(Long, Long)] = { - sc.customInput( - "ReadFlockReportAsSpam", - DAL - .readMostRecentSnapshotNoOlderThan( - FlockReportAsSpamEdgesScalaDataset, - noOlderThan, - Clock.SYSTEM_CLOCK, - DAL.Environment.Prod)) - .collect { - case edge if edge.state == ValidFlockEdgeStateId => - (edge.sourceId, edge.destinationId) - } - } - - def ieSourceTweetEngagementsSource( - interval: Interval - )( - implicit sc: ScioContext - ): SCollection[InteractionEvent] = { - sc.customInput( - "ReadIeSourceTweetEngagementsSource", - DAL - .read( - com.twitter.iesource.processing.events.batch.ServerEngagementsScalaDataset, - interval, - DAL.Environment.Prod, - ) - ).filter { event => - // filter out logged out users because their favorites are less reliable - event.engagingUserId > 0L && event.targetType == InteractionTargetType.Tweet - } - } - - def topicFollowGraphSource( - noOlderThan: Duration = Duration.fromDays(7) - )( - implicit sc: ScioContext - ): SCollection[(Long, Long)] = { - // The implementation here is slightly different than the topicFollowGraphSource function in - // src/scala/com/twitter/simclusters_v2/scalding/embedding/common/ExternalDataSources.scala - // We don't do an additional hashJoin on uttFollowableEntitiesSource. - sc.customInput( - "ReadTopicFollowGraphSource", - DAL - .readMostRecentSnapshotNoOlderThan( - UserTopicRelationSnapshotScalaDataset, - noOlderThan, - Clock.SYSTEM_CLOCK, - DAL.Environment.Prod - ) - ).collect { - case userInterestsRelationSnapshot: UserInterestsRelationSnapshot - if userInterestsRelationSnapshot.interestType == "UTT" && - userInterestsRelationSnapshot.relation == InterestRelationType.Followed => - (userInterestsRelationSnapshot.interestId, userInterestsRelationSnapshot.userId) - } - } - - def magicRecsNotficationOpenOrClickEventsSource( - interval: Interval - )( - implicit sc: ScioContext - ): SCollection[MagicRecsNotificationLite] = { - sc.customInput( - "ReadMagicRecsNotficationOpenOrClickEventsSource", - DAL - .read(MagicrecsNotificationLite1DayLagScalaDataset, interval, DAL.Environment.Prod)) - .filter { entry => - // keep entries with a valid userId and tweetId, opened or clicked timestamp defined - val userIdExists = entry.targetUserId.isDefined - val tweetIdExists = entry.tweetId.isDefined - val openOrClickExists = - entry.openTimestampMs.isDefined || entry.ntabClickTimestampMs.isDefined - userIdExists && tweetIdExists && openOrClickExists - } - } - - def adaptiveSearchScribeLogsSource( - interval: Interval - )( - implicit sc: ScioContext - ): SCollection[(Long, String)] = { - sc.customInput( - "ReadAdaptiveSearchScribeLogsSource", - DAL - .read(AdaptiveSearchScalaDataset, interval, DAL.Environment.Prod)) - .flatMap({ scribeLog: AdaptiveSearchScribeLog => - for { - userId <- userIdFromBlenderAdaptiveScribeLog(scribeLog) - // filter out logged out search queries - if userId != 0 - queryString <- scribeLog.requestLog.flatMap(_.request).flatMap(_.rawQuery) - } yield { - (userId, Set(queryString)) - } - }) - // if a user searches for the same query multiple times, there could be duplicates. - // De-dup them to get the distinct queries searched by a user - .sumByKey - .flatMap { - case (userId, distinctQuerySet) => - distinctQuerySet.map { query => - (userId, query) - } - } - } - - private def userIdFromBlenderAdaptiveScribeLog( - blenderAdaptiveLog: AdaptiveSearchScribeLog - ): Option[Long] = { - blenderAdaptiveLog.versionedCommonHeader match { - case VersionedCommonHeader.CommonHeader(CommonHeader.ServerHeader(serverHeader)) => - serverHeader.requestInfo match { - case Some(requestInfo) => requestInfo.ids.get(IdType.UserId).map(_.toLong) - case _ => None - } - case _ => None - } - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/AssembleMultiTypeGraphScioApp.scala b/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/AssembleMultiTypeGraphScioApp.scala deleted file mode 100644 index 34f9b5f61..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/AssembleMultiTypeGraphScioApp.scala +++ /dev/null @@ -1,39 +0,0 @@ -package com.twitter.simclusters_v2.scio.multi_type_graph.assemble_multi_type_graph - -/** -Build: -./bazel bundle src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph:assemble-multi-type-graph-scio-adhoc-app - -To kick off an adhoc run: -bin/d6w create \ - ${GCP_PROJECT_NAME}/us-central1/assemble-multi-type-graph-scio-adhoc-app \ - src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/assemble-multi-type-graph-scio-adhoc.d6w \ - --jar dist/assemble-multi-type-graph-scio-adhoc-app.jar \ - --bind=profile.project=${GCP_PROJECT_NAME} \ - --bind=profile.user_name=${USER} \ - --bind=profile.date="2021-11-04" \ - --bind=profile.machine="n2-highmem-16" - */ - -object AssembleMultiTypeGraphScioAdhocApp extends AssembleMultiTypeGraphScioBaseApp { - override val isAdhoc: Boolean = true - override val rootMHPath: String = Config.AdhocRootPath - override val rootThriftPath: String = Config.AdhocRootPath -} - -/** -To deploy the job: - -bin/d6w schedule \ - ${GCP_PROJECT_NAME}/us-central1/assemble-multi-type-graph-scio-batch-app \ - src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/assemble-multi-type-graph-scio-batch.d6w \ - --bind=profile.project=${GCP_PROJECT_NAME} \ - --bind=profile.user_name=recos-platform \ - --bind=profile.date="2021-11-04" \ - --bind=profile.machine="n2-highmem-16" - */ -object AssembleMultiTypeGraphScioBatchApp extends AssembleMultiTypeGraphScioBaseApp { - override val isAdhoc: Boolean = false - override val rootMHPath: String = Config.RootMHPath - override val rootThriftPath: String = Config.RootThriftPath -} diff --git a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/AssembleMultiTypeGraphScioBaseApp.scala b/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/AssembleMultiTypeGraphScioBaseApp.scala deleted file mode 100644 index 18325e2fc..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/AssembleMultiTypeGraphScioBaseApp.scala +++ /dev/null @@ -1,574 +0,0 @@ -package com.twitter.simclusters_v2.scio.multi_type_graph.assemble_multi_type_graph - -import com.spotify.scio.ScioContext -import com.spotify.scio.coders.Coder -import com.spotify.scio.values.SCollection -import com.twitter.beam.io.dal.DAL -import com.twitter.beam.io.fs.multiformat.DiskFormat -import com.twitter.beam.io.fs.multiformat.PathLayout -import com.twitter.beam.job.DateRangeOptions -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.dal.client.dataset.SnapshotDALDataset -import com.twitter.frigate.data_pipeline.magicrecs.magicrecs_notifications_lite.thriftscala.MagicRecsNotificationLite -import com.twitter.iesource.thriftscala.InteractionEvent -import com.twitter.iesource.thriftscala.InteractionType -import com.twitter.iesource.thriftscala.ReferenceTweet -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.scio_internal.coders.ThriftStructLazyBinaryScroogeCoder -import com.twitter.scio_internal.job.ScioBeamJob -import com.twitter.scrooge.ThriftStruct -import com.twitter.simclusters_v2.common.Country -import com.twitter.simclusters_v2.common.Language -import com.twitter.simclusters_v2.common.TopicId -import com.twitter.simclusters_v2.common.TweetId -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.hdfs_sources.MultiTypeGraphForTopKRightNodesThriftScioScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.TopKRightNounsScioScalaDataset -import com.twitter.simclusters_v2.hdfs_sources.TruncatedMultiTypeGraphScioScalaDataset -import com.twitter.simclusters_v2.scio.common.ExternalDataSources -import com.twitter.simclusters_v2.scio.multi_type_graph.assemble_multi_type_graph.Config.GlobalDefaultMinFrequencyOfRightNodeType -import com.twitter.simclusters_v2.scio.multi_type_graph.assemble_multi_type_graph.Config.HalfLifeInDaysForFavScore -import com.twitter.simclusters_v2.scio.multi_type_graph.assemble_multi_type_graph.Config.NumTopNounsForUnknownRightNodeType -import com.twitter.simclusters_v2.scio.multi_type_graph.assemble_multi_type_graph.Config.SampledEmployeeIds -import com.twitter.simclusters_v2.scio.multi_type_graph.assemble_multi_type_graph.Config.TopKConfig -import com.twitter.simclusters_v2.scio.multi_type_graph.assemble_multi_type_graph.Config.TopKRightNounsForMHDump -import com.twitter.simclusters_v2.scio.multi_type_graph.common.MultiTypeGraphUtil -import com.twitter.simclusters_v2.thriftscala.EdgeWithDecayedWeights -import com.twitter.simclusters_v2.thriftscala.LeftNode -import com.twitter.simclusters_v2.thriftscala.MultiTypeGraphEdge -import com.twitter.simclusters_v2.thriftscala.Noun -import com.twitter.simclusters_v2.thriftscala.NounWithFrequency -import com.twitter.simclusters_v2.thriftscala.NounWithFrequencyList -import com.twitter.simclusters_v2.thriftscala.RightNode -import com.twitter.simclusters_v2.thriftscala.RightNodeType -import com.twitter.simclusters_v2.thriftscala.RightNodeTypeStruct -import com.twitter.simclusters_v2.thriftscala.RightNodeWithEdgeWeight -import com.twitter.simclusters_v2.thriftscala.RightNodeWithEdgeWeightList -import com.twitter.twadoop.user.gen.thriftscala.CombinedUser -import com.twitter.util.Duration -import java.time.Instant -import org.joda.time.Interval - -/** - * Scio version of - * src/scala/com/twitter/simclusters_v2/scalding/multi_type_graph/assemble_multi_type_graph/AssembleMultiTypeGraph.scala - */ -trait AssembleMultiTypeGraphScioBaseApp extends ScioBeamJob[DateRangeOptions] { - // Provides an implicit binary thrift scrooge coder by default. - override implicit def scroogeCoder[T <: ThriftStruct: Manifest]: Coder[T] = - ThriftStructLazyBinaryScroogeCoder.scroogeCoder - - val isAdhoc: Boolean - val rootMHPath: String - val rootThriftPath: String - - val truncatedMultiTypeGraphMHOutputDir: String = - Config.truncatedMultiTypeGraphMHOutputDir - val truncatedMultiTypeGraphThriftOutputDir: String = - Config.truncatedMultiTypeGraphThriftOutputDir - val topKRightNounsMHOutputDir: String = Config.topKRightNounsMHOutputDir - val topKRightNounsOutputDir: String = Config.topKRightNounsOutputDir - - val fullMultiTypeGraphThriftOutputDir: String = - Config.fullMultiTypeGraphThriftOutputDir - val truncatedMultiTypeGraphKeyValDataset: KeyValDALDataset[ - KeyVal[LeftNode, RightNodeWithEdgeWeightList] - ] = TruncatedMultiTypeGraphScioScalaDataset - val topKRightNounsKeyValDataset: KeyValDALDataset[ - KeyVal[RightNodeTypeStruct, NounWithFrequencyList] - ] = TopKRightNounsScioScalaDataset - val topKRightNounsMHKeyValDataset: KeyValDALDataset[ - KeyVal[RightNodeTypeStruct, NounWithFrequencyList] - ] = TopKRightNounsMhScioScalaDataset - val fullMultiTypeGraphSnapshotDataset: SnapshotDALDataset[MultiTypeGraphEdge] = - FullMultiTypeGraphScioScalaDataset - val multiTypeGraphTopKForRightNodesSnapshotDataset: SnapshotDALDataset[ - MultiTypeGraphEdge - ] = - MultiTypeGraphForTopKRightNodesThriftScioScalaDataset - - def getValidUsers( - input: SCollection[CombinedUser] - ): SCollection[UserId] = { - input - .flatMap { u => - for { - user <- u.user - if user.id != 0 - safety <- user.safety - if !(safety.suspended || safety.deactivated) - } yield { - user.id - } - } - } - - def filterInvalidUsers( - flockEdges: SCollection[(UserId, UserId)], - validUsers: SCollection[UserId] - ): SCollection[(UserId, UserId)] = { - val validUsersWithValues = validUsers.map(userId => (userId, ())) - flockEdges - .join(validUsersWithValues) - .map { - case (srcId, (destId, _)) => - (destId, srcId) - } - .join(validUsersWithValues) - .map { - case (destId, (srcId, _)) => - (srcId, destId) - } - } - - def getFavEdges( - input: SCollection[EdgeWithDecayedWeights], - halfLifeInDaysForFavScore: Int, - ): SCollection[(Long, Long, Double)] = { - input - .flatMap { edge => - if (edge.weights.halfLifeInDaysToDecayedSums.contains(halfLifeInDaysForFavScore)) { - Some( - ( - edge.sourceId, - edge.destinationId, - edge.weights.halfLifeInDaysToDecayedSums(halfLifeInDaysForFavScore))) - } else { - None - } - } - } - - def leftRightTuple( - leftNodeUserId: UserId, - rightNodeType: RightNodeType, - rightNoun: Noun, - weight: Double = 1.0 - ): (LeftNode, RightNodeWithEdgeWeight) = { - ( - LeftNode.UserId(leftNodeUserId), - RightNodeWithEdgeWeight( - rightNode = RightNode(rightNodeType = rightNodeType, noun = rightNoun), - weight = weight)) - } - - def getUserFavGraph( - userUserFavEdges: SCollection[(UserId, UserId, Double)] - ): SCollection[(LeftNode, RightNodeWithEdgeWeight)] = { - userUserFavEdges.map { - case (srcId, destId, edgeWt) => - leftRightTuple(srcId, RightNodeType.FavUser, Noun.UserId(destId), edgeWt) - } - } - - def getUserFollowGraph( - userUserFollowEdges: SCollection[(UserId, UserId)] - ): SCollection[(LeftNode, RightNodeWithEdgeWeight)] = { - userUserFollowEdges.map { - case (srcId, destId) => - leftRightTuple(srcId, RightNodeType.FollowUser, Noun.UserId(destId), 1.0) - } - } - - def getUserBlockGraph( - userUserBlockEdges: SCollection[(UserId, UserId)] - ): SCollection[(LeftNode, RightNodeWithEdgeWeight)] = { - userUserBlockEdges.map { - case (srcId, destId) => - leftRightTuple(srcId, RightNodeType.BlockUser, Noun.UserId(destId), 1.0) - } - } - - def getUserAbuseReportGraph( - userUserAbuseReportEdges: SCollection[(UserId, UserId)] - ): SCollection[(LeftNode, RightNodeWithEdgeWeight)] = { - userUserAbuseReportEdges.map { - case (srcId, destId) => - leftRightTuple(srcId, RightNodeType.AbuseReportUser, Noun.UserId(destId), 1.0) - } - } - - def getUserSpamReportGraph( - userUserSpamReportEdges: SCollection[(UserId, UserId)] - ): SCollection[(LeftNode, RightNodeWithEdgeWeight)] = { - userUserSpamReportEdges.map { - case (srcId, destId) => - leftRightTuple(srcId, RightNodeType.SpamReportUser, Noun.UserId(destId), 1.0) - } - } - - def getUserTopicFollowGraph( - topicUserFollowedByEdges: SCollection[(TopicId, UserId)] - ): SCollection[(LeftNode, RightNodeWithEdgeWeight)] = { - topicUserFollowedByEdges.map { - case (topicId, userId) => - leftRightTuple(userId, RightNodeType.FollowTopic, Noun.TopicId(topicId), 1.0) - } - } - - def getUserSignUpCountryGraph( - userSignUpCountryEdges: SCollection[(UserId, Country)] - ): SCollection[(LeftNode, RightNodeWithEdgeWeight)] = { - userSignUpCountryEdges.map { - case (userId, country) => - leftRightTuple(userId, RightNodeType.SignUpCountry, Noun.Country(country), 1.0) - } - } - - def getMagicRecsNotifOpenOrClickTweetsGraph( - userMRNotifOpenOrClickEvents: SCollection[MagicRecsNotificationLite] - ): SCollection[(LeftNode, RightNodeWithEdgeWeight)] = { - userMRNotifOpenOrClickEvents.flatMap { entry => - for { - userId <- entry.targetUserId - tweetId <- entry.tweetId - } yield { - leftRightTuple(userId, RightNodeType.NotifOpenOrClickTweet, Noun.TweetId(tweetId), 1.0) - } - } - } - - def getUserConsumedLanguagesGraph( - userConsumedLanguageEdges: SCollection[(UserId, Seq[(Language, Double)])] - ): SCollection[(LeftNode, RightNodeWithEdgeWeight)] = { - userConsumedLanguageEdges.flatMap { - case (userId, langWithWeights) => - langWithWeights.map { - case (lang, weight) => - leftRightTuple(userId, RightNodeType.ConsumedLanguage, Noun.Language(lang), 1.0) - } - } - } - - def getSearchGraph( - userSearchQueryEdges: SCollection[(UserId, String)] - ): SCollection[(LeftNode, RightNodeWithEdgeWeight)] = { - userSearchQueryEdges.map { - case (userId, query) => - leftRightTuple(userId, RightNodeType.SearchQuery, Noun.Query(query), 1.0) - } - } - - def getUserTweetInteractionGraph( - tweetInteractionEvents: SCollection[InteractionEvent], - ): SCollection[(LeftNode, RightNodeWithEdgeWeight)] = { - val userTweetInteractionsByType: SCollection[((UserId, TweetId), RightNodeType)] = - tweetInteractionEvents - .flatMap { event => - val referenceTweet: Option[ReferenceTweet] = event.referenceTweet - val targetId: Long = event.targetId - val userId: Long = event.engagingUserId - - // To find the id of the tweet that was interacted with - // For likes, this is the targetId; for retweet or reply, it is the referenceTweet's id - // One thing to note is that for likes, referenceTweet is empty - val (tweetIdOpt, rightNodeTypeOpt) = { - event.interactionType match { - case Some(InteractionType.Favorite) => - // Only allow favorites on original tweets, not retweets, to avoid double-counting - // because we have retweet-type tweets in the data source as well - ( - if (referenceTweet.isEmpty) { - Some(targetId) - } else None, - Some(RightNodeType.FavTweet)) - case Some(InteractionType.Reply) => - (referenceTweet.map(_.tweetId), Some(RightNodeType.ReplyTweet)) - case Some(InteractionType.Retweet) => - (referenceTweet.map(_.tweetId), Some(RightNodeType.RetweetTweet)) - case _ => (None, None) - } - } - for { - tweetId <- tweetIdOpt - rightNodeType <- rightNodeTypeOpt - } yield { - ((userId, tweetId), rightNodeType) - } - } - - userTweetInteractionsByType - .mapValues(Set(_)) - .sumByKey - .flatMap { - case ((userId, tweetId), rightNodeTypeSet) => - rightNodeTypeSet.map { rightNodeType => - leftRightTuple(userId, rightNodeType, Noun.TweetId(tweetId), 1.0) - } - } - } - - def getTopKRightNounsWithFrequencies( - fullGraph: SCollection[(LeftNode, RightNodeWithEdgeWeight)], - topKConfig: Map[RightNodeType, Int], - minFrequency: Int, - ): SCollection[(RightNodeType, Seq[(Noun, Double)])] = { - val maxAcrossRightNounType: Int = topKConfig.valuesIterator.max - - fullGraph - .map { - case (leftNode, rightNodeWithWeight) => - (rightNodeWithWeight.rightNode, 1.0) - } - .sumByKey - .filter(_._2 >= minFrequency) - .map { - case (rightNode, freq) => - (rightNode.rightNodeType, (rightNode.noun, freq)) - } - .topByKey(maxAcrossRightNounType)(Ordering.by(_._2)) - .map { - case (rightNodeType, nounsListWithFreq) => - val truncatedList = nounsListWithFreq.toSeq - .sortBy(-_._2) - .take(topKConfig.getOrElse(rightNodeType, NumTopNounsForUnknownRightNodeType)) - (rightNodeType, truncatedList) - } - } - - def getTruncatedGraph( - fullGraph: SCollection[(LeftNode, RightNodeWithEdgeWeight)], - topKWithFrequency: SCollection[(RightNodeType, Seq[(Noun, Double)])] - ): SCollection[(LeftNode, RightNodeWithEdgeWeight)] = { - val topNouns = topKWithFrequency - .flatMap { - case (rightNodeType, nounsList) => - nounsList - .map { - case (nounVal, aggregatedFrequency) => - RightNode(rightNodeType, nounVal) - } - }.map(nouns => (nouns, ())) - - fullGraph - .map { - case (leftNode, rightNodeWithWeight) => - (rightNodeWithWeight.rightNode, (leftNode, rightNodeWithWeight)) - } - .hashJoin(topNouns) - .map { - case (rightNode, ((left, rightNodeWithWeight), _)) => - (left, rightNodeWithWeight) - } - } - - def buildEmployeeGraph( - graph: SCollection[(LeftNode, RightNodeWithEdgeWeight)] - ): SCollection[(LeftNode, RightNodeWithEdgeWeight)] = { - val employeeIds = SampledEmployeeIds - graph - .collect { - case (LeftNode.UserId(userId), rightNodeWithWeight) if employeeIds.contains(userId) => - (LeftNode.UserId(userId), rightNodeWithWeight) - } - } - - override def configurePipeline(sc: ScioContext, opts: DateRangeOptions): Unit = { - // Define the implicit ScioContext to read datasets from ExternalDataSources - implicit def scioContext: ScioContext = sc - - // DAL.Environment variable for WriteExecs - val dalEnv = if (isAdhoc) DAL.Environment.Dev else DAL.Environment.Prod - - // Define date intervals - val interval_7days = - new Interval(opts.interval.getEnd.minusWeeks(1), opts.interval.getEnd.minusMillis(1)) - val interval_14days = - new Interval(opts.interval.getEnd.minusWeeks(2), opts.interval.getEnd.minusMillis(1)) - - /* - * Dataset read operations - */ - // Get list of valid UserIds - to filter out deactivated or suspended user accounts - val validUsers = getValidUsers(ExternalDataSources.userSource(Duration.fromDays(7))) - - // ieSource tweet engagements data for tweet favs, replies, retweets - from last 14 days - val tweetSource = ExternalDataSources.ieSourceTweetEngagementsSource(interval_14days) - - // Read TFlock datasets - val flockFollowSource = ExternalDataSources.flockFollowSource(Duration.fromDays(7)) - val flockBlockSource = ExternalDataSources.flockBlockSource(Duration.fromDays(7)) - val flockReportAsAbuseSource = - ExternalDataSources.flockReportAsAbuseSource(Duration.fromDays(7)) - val flockReportAsSpamSource = - ExternalDataSources.flockReportAsSpamSource(Duration.fromDays(7)) - - // user-user fav edges - val userUserFavSource = ExternalDataSources.userUserFavSource(Duration.fromDays(14)) - val userUserFavEdges = getFavEdges(userUserFavSource, HalfLifeInDaysForFavScore) - - // user-user follow edges - val userUserFollowEdges = filterInvalidUsers(flockFollowSource, validUsers) - - // user-user block edges - val userUserBlockEdges = filterInvalidUsers(flockBlockSource, validUsers) - - // user-user abuse report edges - val userUserAbuseReportEdges = filterInvalidUsers(flockReportAsAbuseSource, validUsers) - - // user-user spam report edges - val userUserSpamReportEdges = filterInvalidUsers(flockReportAsSpamSource, validUsers) - - // user-signup country edges - val userSignUpCountryEdges = ExternalDataSources - .userCountrySource(Duration.fromDays(7)) - - // user-consumed language edges - val userConsumedLanguageEdges = - ExternalDataSources.inferredUserConsumedLanguageSource(Duration.fromDays(7)) - - // user-topic follow edges - val topicUserFollowedByEdges = - ExternalDataSources.topicFollowGraphSource(Duration.fromDays(7)) - - // user-MRNotifOpenOrClick events from last 7 days - val userMRNotifOpenOrClickEvents = - ExternalDataSources.magicRecsNotficationOpenOrClickEventsSource(interval_7days) - - // user-searchQuery strings from last 7 days - val userSearchQueryEdges = - ExternalDataSources.adaptiveSearchScribeLogsSource(interval_7days) - - /* - * Generate the full graph - */ - val fullGraph = - getUserTweetInteractionGraph(tweetSource) ++ - getUserFavGraph(userUserFavEdges) ++ - getUserFollowGraph(userUserFollowEdges) ++ - getUserBlockGraph(userUserBlockEdges) ++ - getUserAbuseReportGraph(userUserAbuseReportEdges) ++ - getUserSpamReportGraph(userUserSpamReportEdges) ++ - getUserSignUpCountryGraph(userSignUpCountryEdges) ++ - getUserConsumedLanguagesGraph(userConsumedLanguageEdges) ++ - getUserTopicFollowGraph(topicUserFollowedByEdges) ++ - getMagicRecsNotifOpenOrClickTweetsGraph(userMRNotifOpenOrClickEvents) ++ - getSearchGraph(userSearchQueryEdges) - - // Get Top K RightNodes - val topKRightNodes: SCollection[(RightNodeType, Seq[(Noun, Double)])] = - getTopKRightNounsWithFrequencies( - fullGraph, - TopKConfig, - GlobalDefaultMinFrequencyOfRightNodeType) - - // key transformation - topK nouns, keyed by the RightNodeNounType - val topKNounsKeyedByType: SCollection[(RightNodeTypeStruct, NounWithFrequencyList)] = - topKRightNodes - .map { - case (rightNodeType, rightNounsWithScoresList) => - val nounsListWithFrequency: Seq[NounWithFrequency] = rightNounsWithScoresList - .map { - case (noun, aggregatedFrequency) => - NounWithFrequency(noun, aggregatedFrequency) - } - (RightNodeTypeStruct(rightNodeType), NounWithFrequencyList(nounsListWithFrequency)) - } - - // Get Truncated graph based on the top K RightNodes - val truncatedGraph: SCollection[(LeftNode, RightNodeWithEdgeWeight)] = - getTruncatedGraph(fullGraph, topKRightNodes) - - // key transformations - truncated graph, keyed by LeftNode - // Note: By wrapping and unwrapping with the LeftNode.UserId, we don't have to deal - // with defining our own customer ordering for LeftNode type - val truncatedGraphKeyedBySrc: SCollection[(LeftNode, RightNodeWithEdgeWeightList)] = - truncatedGraph - .collect { - case (LeftNode.UserId(userId), rightNodeWithWeight) => - userId -> List(rightNodeWithWeight) - } - .sumByKey - .map { - case (userId, rightNodeWithWeightList) => - (LeftNode.UserId(userId), RightNodeWithEdgeWeightList(rightNodeWithWeightList)) - } - - // WriteExecs - // Write TopK RightNodes to DAL - save all the top K nodes for the clustering step - topKNounsKeyedByType - .map { - case (engagementType, rightList) => - KeyVal(engagementType, rightList) - } - .saveAsCustomOutput( - name = "WriteTopKNouns", - DAL.writeVersionedKeyVal( - topKRightNounsKeyValDataset, - PathLayout.VersionedPath(prefix = - rootMHPath + topKRightNounsOutputDir), - instant = Instant.ofEpochMilli(opts.interval.getEndMillis - 1L), - environmentOverride = dalEnv, - ) - ) - - // Write TopK RightNodes to DAL - only take TopKRightNounsForMHDump RightNodes for MH dump - topKNounsKeyedByType - .map { - case (engagementType, rightList) => - val rightListMH = - NounWithFrequencyList(rightList.nounWithFrequencyList.take(TopKRightNounsForMHDump)) - KeyVal(engagementType, rightListMH) - } - .saveAsCustomOutput( - name = "WriteTopKNounsToMHForDebugger", - DAL.writeVersionedKeyVal( - topKRightNounsMHKeyValDataset, - PathLayout.VersionedPath(prefix = - rootMHPath + topKRightNounsMHOutputDir), - instant = Instant.ofEpochMilli(opts.interval.getEndMillis - 1L), - environmentOverride = dalEnv, - ) - ) - - // Write truncated graph (MultiTypeGraphTopKForRightNodes) to DAL in KeyVal format - truncatedGraphKeyedBySrc - .map { - case (leftNode, rightNodeWithWeightList) => - KeyVal(leftNode, rightNodeWithWeightList) - }.saveAsCustomOutput( - name = "WriteTruncatedMultiTypeGraph", - DAL.writeVersionedKeyVal( - truncatedMultiTypeGraphKeyValDataset, - PathLayout.VersionedPath(prefix = - rootMHPath + truncatedMultiTypeGraphMHOutputDir), - instant = Instant.ofEpochMilli(opts.interval.getEndMillis - 1L), - environmentOverride = dalEnv, - ) - ) - - // Write truncated graph (MultiTypeGraphTopKForRightNodes) to DAL in thrift format - truncatedGraph - .map { - case (leftNode, rightNodeWithWeight) => - MultiTypeGraphEdge(leftNode, rightNodeWithWeight) - }.saveAsCustomOutput( - name = "WriteTruncatedMultiTypeGraphThrift", - DAL.writeSnapshot( - multiTypeGraphTopKForRightNodesSnapshotDataset, - PathLayout.FixedPath(rootThriftPath + truncatedMultiTypeGraphThriftOutputDir), - Instant.ofEpochMilli(opts.interval.getEndMillis - 1L), - DiskFormat.Thrift(), - environmentOverride = dalEnv - ) - ) - - // Write full graph to DAL - fullGraph - .map { - case (leftNode, rightNodeWithWeight) => - MultiTypeGraphEdge(leftNode, rightNodeWithWeight) - } - .saveAsCustomOutput( - name = "WriteFullMultiTypeGraph", - DAL.writeSnapshot( - fullMultiTypeGraphSnapshotDataset, - PathLayout.FixedPath(rootThriftPath + fullMultiTypeGraphThriftOutputDir), - Instant.ofEpochMilli(opts.interval.getEndMillis - 1L), - DiskFormat.Thrift(), - environmentOverride = dalEnv - ) - ) - - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/BUILD b/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/BUILD deleted file mode 100644 index 4ad3bfb53..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/BUILD +++ /dev/null @@ -1,73 +0,0 @@ -scala_library( - name = "assemble-multi-type-graph-scio-lib", - sources = [ - "*.scala", - ], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":full_multi_type_graph_scio-scala", - ":top_k_right_nouns_mh_scio-scala", - "beam-internal/src/main/scala/com/twitter/beam/io/dal", - "beam-internal/src/main/scala/com/twitter/beam/io/manhattan", - "beam-internal/src/main/scala/com/twitter/beam/job", - "beam-internal/src/main/scala/com/twitter/beam/transform", - "beam-internal/src/main/scala/com/twitter/scio_internal/runner/dataflow", - "src/scala/com/twitter/simclusters_v2/hdfs_sources", - "src/scala/com/twitter/simclusters_v2/scalding/multi_type_graph/assemble_multi_type_graph", - "src/scala/com/twitter/simclusters_v2/scio/common", - "src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/common", - ], -) - -jvm_binary( - name = "assemble-multi-type-graph-scio-adhoc-app", - main = "com.twitter.simclusters_v2.scio.multi_type_graph.assemble_multi_type_graph.AssembleMultiTypeGraphScioAdhocApp", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":assemble-multi-type-graph-scio-lib", - "beam-internal/src/main/scala/com/twitter/beam/runner/dataflow", - ], -) - -jvm_binary( - name = "assemble-multi-type-graph-scio-batch-app", - main = "com.twitter.simclusters_v2.scio.multi_type_graph.assemble_multi_type_graph.AssembleMultiTypeGraphScioBatchApp", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":assemble-multi-type-graph-scio-lib", - "beam-internal/src/main/scala/com/twitter/beam/runner/dataflow", - ], -) - -create_datasets( - base_name = "full_multi_type_graph_scio", - java_schema = "com.twitter.simclusters_v2.thriftjava.MultiTypeGraphEdge", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.MultiTypeGraphEdge", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "top_k_right_nouns_mh_scio", - key_type = "com.twitter.simclusters_v2.thriftscala.RightNodeTypeStruct", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.MultiTypeGraphInjections.topKRightNounListInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.NounWithFrequencyList", - scala_dependencies = [ - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/Config.scala b/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/Config.scala deleted file mode 100644 index 337789ca1..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/Config.scala +++ /dev/null @@ -1,37 +0,0 @@ -package com.twitter.simclusters_v2.scio.multi_type_graph.assemble_multi_type_graph - -import com.twitter.simclusters_v2.thriftscala.RightNodeType - -object Config { - val RootMHPath: String = "manhattan_sequence_files/multi_type_graph/" - val RootThriftPath: String = "processed/multi_type_graph/" - val AdhocRootPath = "adhoc/multi_type_graph/" - val truncatedMultiTypeGraphMHOutputDir: String = "truncated_graph_mh" - val truncatedMultiTypeGraphThriftOutputDir: String = "truncated_graph_thrift" - val topKRightNounsMHOutputDir: String = "top_k_right_nouns_mh" - val topKRightNounsOutputDir: String = "top_k_right_nouns" - val fullMultiTypeGraphThriftOutputDir: String = "full_graph_thrift" - val HalfLifeInDaysForFavScore = 100 - val NumTopNounsForUnknownRightNodeType = 20 - val GlobalDefaultMinFrequencyOfRightNodeType = 100 - val TopKRightNounsForMHDump = 1000 - - // the topK most frequent nouns for each engagement type - val TopKConfig: Map[RightNodeType, Int] = Map( - RightNodeType.FollowUser -> 10000000, // 10M, current simclusters_v2 has this value set to 20M, providing this the most weight - RightNodeType.FavUser -> 5000000, - RightNodeType.BlockUser -> 1000000, - RightNodeType.AbuseReportUser -> 1000000, - RightNodeType.SpamReportUser -> 1000000, - RightNodeType.FollowTopic -> 5000, - RightNodeType.SignUpCountry -> 200, - RightNodeType.ConsumedLanguage -> 50, - RightNodeType.FavTweet -> 500000, - RightNodeType.ReplyTweet -> 500000, - RightNodeType.RetweetTweet -> 500000, - RightNodeType.NotifOpenOrClickTweet -> 500000, - RightNodeType.SearchQuery -> 500000 - ) - val SampledEmployeeIds: Set[Long] = - Set() -} diff --git a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/README.md b/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/README.md deleted file mode 100644 index f258c9683..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/README.md +++ /dev/null @@ -1,49 +0,0 @@ -# Pre-requisites - -## Tutorial -Follow the tutorial Batch Job on Dataflow Quickstart on how to run a simple batch job on Dataflow. - -## GCP setup - -Ensure `gcloud` CLI is installed and `application_default_credentials.json` has been generated. - -## Data access - -If you want to run an adhoc job with your ldap, you will need access to multiple LDAP groups to read the datasets. - -# Running the job - -### Running an adhoc job - -```bash -export GCP_PROJECT_NAME='twttr-recos-ml-prod' - -./bazel bundle src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph:assemble-multi-type-graph-scio-adhoc-app - -bin/d6w create \ - ${GCP_PROJECT_NAME}/us-central1/assemble-multi-type-graph-scio-adhoc-app \ - src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/assemble-multi-type-graph-scio-adhoc.d6w \ - --jar dist/assemble-multi-type-graph-scio-adho-app.jar \ - --bind=profile.project=${GCP_PROJECT_NAME} \ - --bind=profile.user_name=${USER} \ - --bind=profile.date="2021-11-04" \ - --bind=profile.machine="n2-highmem-16" -``` - -### Scheduling the job on Workflow - -Scheduling a job will require a service account as `recos-platform`. -Remember this account will need permissions to read all the required dataset. - -```bash -export SERVICE_ACCOUNT='recos-platform' -export GCP_PROJECT_NAME='twttr-recos-ml-prod' - -bin/d6w schedule \ - ${GCP_PROJECT_NAME}/us-central1/assemble-multi-type-graph-scio-batch-app \ - src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/assemble-multi-type-graph-scio-batch.d6w \ - --bind=profile.project=${GCP_PROJECT_NAME} \ - --bind=profile.user_name="recos-platform" \ - --bind=profile.date="2021-11-04" \ - --bind=profile.machine="n2-highmem-16" -``` diff --git a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/assemble-multi-type-graph-scio-adhoc.d6w b/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/assemble-multi-type-graph-scio-adhoc.d6w deleted file mode 100644 index 835c48e71..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/assemble-multi-type-graph-scio-adhoc.d6w +++ /dev/null @@ -1,36 +0,0 @@ -# See -# Checkout the README to see how to deploy the job - -class Profile(Struct): - project = Required(String) - date = Required(String) - environment = Default(String, 'dev') - machine= Default(String, 'n2-highmem-16') - -job = Job( - name='assemble-multi-type-graph-scio-adhoc-app', - project='{{profile.project}}', - staging_bucket='{{profile.project}}', - service_account='{{profile.user_name}}-shdw@twttr-dp-svc-accounts.iam.gserviceaccount.com', - region='us-central1', - worker_config=WorkerConfig( - num_workers=2, - worker_machine_type='{{profile.machine}}', - worker_disk_type=WorkerDiskType('HDD') - ), - extra_args={ - "environment": '{{profile.environment}}', - "date": Quote('{{profile.date}}'), - }, - deployment_config=BatchDeploymentConfig( - role='{{profile.user_name}}', - build_target='src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph:assemble-multi-type-graph-scio-adhoc-app', - gcp_deployment_credentials='/var/lib/tss/keys/{{profile.user_name}}/cloud/gcp/dp/shadow.json', - statebird_config=StatebirdConfig( - batch_width='PT1H', - first_time='{{profile.date}}' - ) - ) -) - -jobs=[job] diff --git a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/assemble-multi-type-graph-scio-batch.d6w b/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/assemble-multi-type-graph-scio-batch.d6w deleted file mode 100644 index 4734e9c0f..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph/assemble-multi-type-graph-scio-batch.d6w +++ /dev/null @@ -1,41 +0,0 @@ -# See -# Checkout the README to see how to deploy the job - -class Profile(Struct): - project = Required(String) - date = Required(String) - environment = Default(String, 'prod') - machine= Default(String, 'n2-highmem-16') - -job = Job( - name='assemble-multi-type-graph-scio-batch-app', - project='{{profile.project}}', - staging_bucket='{{profile.project}}', - service_account='{{profile.user_name}}-shdw@twttr-dp-svc-accounts.iam.gserviceaccount.com', - region='us-central1', - worker_config=WorkerConfig( - num_workers=2, - worker_machine_type='{{profile.machine}}', - worker_disk_type=WorkerDiskType('HDD') - ), - extra_args={ - "environment": '{{profile.environment}}', - "date": Quote('{{profile.date}}'), - }, - deployment_config=BatchDeploymentConfig( - role='{{profile.user_name}}', - build_target='src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/assemble_multi_type_graph:assemble-multi-type-graph-scio-batch-app', - gcp_deployment_credentials='/var/lib/tss/keys/{{profile.user_name}}/cloud/gcp/dp/shadow.json', - environment='prod', - statebird_config=StatebirdConfig( - batch_width='P1W', - first_time='{{profile.date}}' - ), - workflow_config=WorkflowConfig( - play=True, - ), - timeout='PT18H' - ) -) - -jobs=[job] diff --git a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/common/BUILD b/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/common/BUILD deleted file mode 100644 index d8ca4cd90..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/common/BUILD +++ /dev/null @@ -1,13 +0,0 @@ -scala_library( - sources = [ - "*.scala", - ], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "beam-internal/src/main/scala/com/twitter/beam/io/dal", - "beam-internal/src/main/scala/com/twitter/scio_internal/runner/dataflow", - "src/scala/com/twitter/simclusters_v2/hdfs_sources", - "src/scala/com/twitter/simclusters_v2/scalding", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/common/MultiTypeGraphUtil.scala b/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/common/MultiTypeGraphUtil.scala deleted file mode 100644 index 4a5cd67de..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/common/MultiTypeGraphUtil.scala +++ /dev/null @@ -1,69 +0,0 @@ -package com.twitter.simclusters_v2.scio -package multi_type_graph.common - -import com.spotify.scio.ScioContext -import com.spotify.scio.values.SCollection -import com.twitter.beam.io.dal.DAL -import com.twitter.common.util.Clock -import com.twitter.scalding_internal.job.RequiredBinaryComparators.ordSer -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.hdfs_sources.TruncatedMultiTypeGraphScioScalaDataset -import com.twitter.simclusters_v2.thriftscala.LeftNode -import com.twitter.simclusters_v2.thriftscala.Noun -import com.twitter.simclusters_v2.thriftscala.RightNode -import com.twitter.simclusters_v2.thriftscala.RightNodeType -import com.twitter.util.Duration - -object MultiTypeGraphUtil { - val RootMHPath: String = "manhattan_sequence_files/multi_type_graph/" - val RootThriftPath: String = "processed/multi_type_graph/" - val AdhocRootPath = "adhoc/multi_type_graph/" - - val nounOrdering: Ordering[Noun] = new Ordering[Noun] { - // We define an ordering for each noun type as specified in simclusters_v2/multi_type_graph.thrift - // Please make sure we don't remove anything here that's still a part of the union Noun thrift and - // vice versa, if we add a new noun type to thrift, an ordering for it needs to added here as well. - def nounTypeOrder(noun: Noun): Int = noun match { - case _: Noun.UserId => 0 - case _: Noun.Country => 1 - case _: Noun.Language => 2 - case _: Noun.Query => 3 - case _: Noun.TopicId => 4 - case _: Noun.TweetId => 5 - } - - override def compare(x: Noun, y: Noun): Int = nounTypeOrder(x) compare nounTypeOrder(y) - } - - val rightNodeTypeOrdering: Ordering[RightNodeType] = ordSer[RightNodeType] - - val rightNodeOrdering: Ordering[RightNode] = - new Ordering[RightNode] { - override def compare(x: RightNode, y: RightNode): Int = { - Ordering - .Tuple2(rightNodeTypeOrdering, nounOrdering) - .compare((x.rightNodeType, x.noun), (y.rightNodeType, y.noun)) - } - } - - def getTruncatedMultiTypeGraph( - noOlderThan: Duration = Duration.fromDays(14) - )( - implicit sc: ScioContext - ): SCollection[(Long, RightNode, Double)] = { - sc.customInput( - "ReadTruncatedMultiTypeGraph", - DAL - .readMostRecentSnapshotNoOlderThan( - TruncatedMultiTypeGraphScioScalaDataset, - noOlderThan, - Clock.SYSTEM_CLOCK, - DAL.Environment.Prod - ) - ).flatMap { - case KeyVal(LeftNode.UserId(userId), rightNodesList) => - rightNodesList.rightNodeWithEdgeWeightList.map(rightNodeWithWeight => - (userId, rightNodeWithWeight.rightNode, rightNodeWithWeight.weight)) - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/BUILD b/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/BUILD deleted file mode 100644 index fa06b6d7a..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/BUILD +++ /dev/null @@ -1,92 +0,0 @@ -scala_library( - name = "multi-type-graph-scio-sims-lib", - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":right_node_cosine_similarity_scio_adhoc-scala", - ":right_node_sim_hash_scio_adhoc-scala", - "3rdparty/jvm/com/twitter/bijection:scrooge", - "beam-internal/src/main/scala/com/twitter/beam/io/dal", - "beam-internal/src/main/scala/com/twitter/beam/io/manhattan", - "beam-internal/src/main/scala/com/twitter/beam/job", - "beam-internal/src/main/scala/com/twitter/beam/transform", - "beam-internal/src/main/scala/com/twitter/scio_internal/runner/dataflow", - "src/scala/com/twitter/simclusters_v2/hdfs_sources", - "src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/common", - "src/scala/com/twitter/wtf/dataflow/cosine_similarity/common", - ], -) - -jvm_binary( - name = "multi-type-graph-sim-hash-scio-adhoc-app", - main = "com.twitter.simclusters_v2.scio.multi_type_graph.multi_type_graph_sims.RightNodeSimHashScioAdhocApp", - platform = "java8", - dependencies = [ - ":multi-type-graph-scio-sims-lib", - "beam-internal/src/main/scala/com/twitter/beam/runner/dataflow", - ], -) - -jvm_binary( - name = "multi-type-graph-sim-hash-scio-batch-app", - main = "com.twitter.simclusters_v2.scio.multi_type_graph.multi_type_graph_sims.RightNodeSimHashScioBatchApp", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":multi-type-graph-scio-sims-lib", - "beam-internal/src/main/scala/com/twitter/beam/runner/dataflow", - ], -) - -jvm_binary( - name = "multi-type-graph-cosine-similarity-scio-adhoc-app", - main = "com.twitter.simclusters_v2.scio.multi_type_graph.multi_type_graph_sims.RightNodeCosineSimilarityScioAdhocApp", - platform = "java8", - dependencies = [ - ":multi-type-graph-scio-sims-lib", - "beam-internal/src/main/scala/com/twitter/beam/runner/dataflow", - ], -) - -jvm_binary( - name = "multi-type-graph-cosine-similarity-scio-batch-app", - main = "com.twitter.simclusters_v2.scio.multi_type_graph.multi_type_graph_sims.RightNodeCosineSimilarityScioBatchApp", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":multi-type-graph-scio-sims-lib", - "beam-internal/src/main/scala/com/twitter/beam/runner/dataflow", - ], -) - -create_datasets( - base_name = "right_node_sim_hash_scio_adhoc", - java_schema = "com.twitter.simclusters_v2.thriftjava.RightNodeSimHashSketch", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.thriftscala.RightNodeSimHashSketch", - segment_type = "snapshot", - tags = ["bazel-compatible"], - java_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-java", - ], - scala_dependencies = [ - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - ], -) - -create_datasets( - base_name = "right_node_cosine_similarity_scio_adhoc", - key_type = "com.twitter.simclusters_v2.thriftscala.RightNode", - platform = "java8", - role = "cassowary", - scala_schema = "com.twitter.simclusters_v2.hdfs_sources.injections.MultiTypeGraphInjections.similarRightNodesInjection", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "com.twitter.simclusters_v2.thriftscala.SimilarRightNodes", - scala_dependencies = [ - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/simclusters_v2/hdfs_sources/injections", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/Config.scala b/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/Config.scala deleted file mode 100644 index de0dc39c0..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/Config.scala +++ /dev/null @@ -1,18 +0,0 @@ -package com.twitter.simclusters_v2.scio -package multi_type_graph.multi_type_graph_sims - -object Config { - // Config settings for RightNodeSimHashScioBaseApp job - // Number of hashes to generate in the sketch - val numHashes: Int = 8192 // each is a bit, so this results in 1KB uncompressed sketch/user - // Reduce skew by letting each reducers process a limited number of followers/user - val maxNumNeighborsPerReducers: Int = 300000 - val simsHashJobOutputDirectory: String = "right_node/sims/sim_hash" - - // Config settings for RightNodeCosineSimilarityScioBaseApp job - val numSims: Int = 500 - val minCosineSimilarityThreshold: Double = 0.01 - val maxOutDegree: Int = 10000 - val cosineSimJobOutputDirectory = "right_node/sims/cosine_similarity" - -} diff --git a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/RightNodeCosineSimilarityScioApp.scala b/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/RightNodeCosineSimilarityScioApp.scala deleted file mode 100644 index 6c064be9b..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/RightNodeCosineSimilarityScioApp.scala +++ /dev/null @@ -1,55 +0,0 @@ -package com.twitter.simclusters_v2.scio -package multi_type_graph.multi_type_graph_sims - -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.simclusters_v2.hdfs_sources.RightNodeCosineSimilarityScioScalaDataset -import com.twitter.simclusters_v2.thriftscala.RightNode -import com.twitter.simclusters_v2.thriftscala.SimilarRightNodes -import com.twitter.wtf.scalding.jobs.cosine_similarity.common.ApproximateMatrixSelfTransposeMultiplicationJob - -/** -Build: -./bazel bundle src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims:multi-type-graph-cosine-similarity-scio-adhoc-app - -To kick off an adhoc run: -bin/d6w create \ - ${GCP_PROJECT_NAME}/us-central1/multi-type-graph-cosine-similarity-scio-adhoc-app \ - src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/cosine-similarity-scio-adhoc.d6w \ - --jar dist/multi-type-graph-cosine-similarity-scio-adhoc-app.jar \ - --bind=profile.project=${GCP_PROJECT_NAME} \ - --bind=profile.user_name=${USER} \ - --bind=profile.date="2022-01-16" \ - --bind=profile.machine="n2d-highmem-16" --ignore-existing - */ - -object RightNodeCosineSimilarityScioAdhocApp extends RightNodeCosineSimilarityScioBaseApp { - override val isAdhoc = true - override val cosineSimKeyValSnapshotDataset: KeyValDALDataset[ - KeyVal[RightNode, SimilarRightNodes] - ] = - RightNodeCosineSimilarityScioAdhocScalaDataset - override val filterCandidateSimilarityPair: (Double, Double, Double) => Boolean = - ApproximateMatrixSelfTransposeMultiplicationJob.filterCandidateSimilarityPair -} - -/** -To deploy the job: - -bin/d6w schedule \ - ${GCP_PROJECT_NAME}/us-central1/multi-type-graph-cosine-similarity-scio-batch-app \ - src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/cosine-similarity-scio-batch.d6w \ - --bind=profile.project=${GCP_PROJECT_NAME} \ - --bind=profile.user_name=recos-platform \ - --bind=profile.date="2021-12-01" \ - --bind=profile.machine="n2d-highmem-16" - */ -object RightNodeCosineSimilarityScioBatchApp extends RightNodeCosineSimilarityScioBaseApp { - override val isAdhoc = false - override val cosineSimKeyValSnapshotDataset: KeyValDALDataset[ - KeyVal[RightNode, SimilarRightNodes] - ] = - RightNodeCosineSimilarityScioScalaDataset - override val filterCandidateSimilarityPair: (Double, Double, Double) => Boolean = - ApproximateMatrixSelfTransposeMultiplicationJob.filterCandidateSimilarityPair -} diff --git a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/RightNodeCosineSimilarityScioBaseApp.scala b/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/RightNodeCosineSimilarityScioBaseApp.scala deleted file mode 100644 index 963178f7b..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/RightNodeCosineSimilarityScioBaseApp.scala +++ /dev/null @@ -1,96 +0,0 @@ -package com.twitter.simclusters_v2.scio -package multi_type_graph.multi_type_graph_sims - -import com.spotify.scio.ScioContext -import com.spotify.scio.coders.Coder -import com.spotify.scio.values.SCollection -import com.twitter.beam.io.dal.DAL -import com.twitter.beam.io.fs.multiformat.PathLayout -import com.twitter.beam.job.DateRangeOptions -import com.twitter.common.util.Clock -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.dal.client.dataset.SnapshotDALDataset -import com.twitter.scalding_internal.multiformat.format.keyval.KeyVal -import com.twitter.scio_internal.coders.ThriftStructLazyBinaryScroogeCoder -import com.twitter.scio_internal.job.ScioBeamJob -import com.twitter.scrooge.ThriftStruct -import com.twitter.simclusters_v2.hdfs_sources.RightNodeSimHashScioScalaDataset -import com.twitter.simclusters_v2.scio.multi_type_graph.common.MultiTypeGraphUtil -import com.twitter.simclusters_v2.thriftscala._ -import com.twitter.util.Duration -import com.twitter.wtf.dataflow.cosine_similarity.ApproximateMatrixSelfTransposeMultiplicationJob -import java.time.Instant - -trait RightNodeCosineSimilarityScioBaseApp - extends ScioBeamJob[DateRangeOptions] - with ApproximateMatrixSelfTransposeMultiplicationJob[RightNode] { - override implicit def scroogeCoder[T <: ThriftStruct: Manifest]: Coder[T] = - ThriftStructLazyBinaryScroogeCoder.scroogeCoder - override val ordering: Ordering[RightNode] = MultiTypeGraphUtil.rightNodeOrdering - - val isAdhoc: Boolean - val cosineSimKeyValSnapshotDataset: KeyValDALDataset[KeyVal[RightNode, SimilarRightNodes]] - val rightNodeSimHashSnapshotDataset: SnapshotDALDataset[RightNodeSimHashSketch] = - RightNodeSimHashScioScalaDataset - val cosineSimJobOutputDirectory: String = Config.cosineSimJobOutputDirectory - - override def graph( - implicit sc: ScioContext, - coder: Coder[RightNode] - ): SCollection[(Long, RightNode, Double)] = - MultiTypeGraphUtil.getTruncatedMultiTypeGraph(noOlderThan = Duration.fromDays(14)) - - override def simHashSketches( - implicit sc: ScioContext, - coder: Coder[RightNode] - ): SCollection[(RightNode, Array[Byte])] = { - sc.customInput( - "ReadSimHashSketches", - DAL - .readMostRecentSnapshotNoOlderThan( - rightNodeSimHashSnapshotDataset, - Duration.fromDays(14), - Clock.SYSTEM_CLOCK, - DAL.Environment.Prod - ) - ).map { sketch => - sketch.rightNode -> sketch.simHashOfEngagers.toArray - } - } - - override def configurePipeline( - sc: ScioContext, - opts: DateRangeOptions - ): Unit = { - implicit def scioContext: ScioContext = sc - // DAL.Environment variable for WriteExecs - val dalEnv = if (isAdhoc) DAL.Environment.Dev else DAL.Environment.Prod - - val topKRightNodes: SCollection[(RightNode, SimilarRightNodes)] = topK - .collect { - case (rightNode, simRightNodes) => - val sims = simRightNodes.collect { - case (simRightNode, score) => SimilarRightNode(simRightNode, score) - } - (rightNode, SimilarRightNodes(sims)) - } - - topKRightNodes - .map { - case (rightNode, sims) => KeyVal(rightNode, sims) - }.saveAsCustomOutput( - name = "WriteRightNodeCosineSimilarityDataset", - DAL.writeVersionedKeyVal( - cosineSimKeyValSnapshotDataset, - PathLayout.VersionedPath(prefix = - ((if (!isAdhoc) - MultiTypeGraphUtil.RootMHPath - else - MultiTypeGraphUtil.AdhocRootPath) - + Config.cosineSimJobOutputDirectory)), - instant = Instant.ofEpochMilli(opts.interval.getEndMillis - 1L), - environmentOverride = dalEnv, - ) - ) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/RightNodeSimHashScioApp.scala b/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/RightNodeSimHashScioApp.scala deleted file mode 100644 index f485b52ce..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/RightNodeSimHashScioApp.scala +++ /dev/null @@ -1,43 +0,0 @@ -package com.twitter.simclusters_v2.scio -package multi_type_graph.multi_type_graph_sims - -import com.twitter.dal.client.dataset.SnapshotDALDataset -import com.twitter.simclusters_v2.hdfs_sources.RightNodeSimHashScioScalaDataset -import com.twitter.simclusters_v2.thriftscala.RightNodeSimHashSketch - -/** -Build: -./bazel bundle src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims:multi-type-graph-sim-hash-scio-adhoc-app - -To kick off an adhoc run: -bin/d6w create \ - ${GCP_PROJECT_NAME}/us-central1/multi-type-graph-sim-hash-scio-adhoc-app \ - src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/sim-hash-scio-adhoc.d6w \ - --jar dist/multi-type-graph-sim-hash-scio-adhoc-app.jar \ - --bind=profile.project=${GCP_PROJECT_NAME} \ - --bind=profile.user_name=${USER} \ - --bind=profile.date="2021-12-01" \ - --bind=profile.machine="n2d-highmem-16" --ignore-existing - */ -object RightNodeSimHashScioAdhocApp extends RightNodeSimHashScioBaseApp { - override val isAdhoc: Boolean = true - override val rightNodeSimHashSnapshotDataset: SnapshotDALDataset[RightNodeSimHashSketch] = - RightNodeSimHashScioAdhocScalaDataset -} - -/** -To deploy the job: - -bin/d6w schedule \ - ${GCP_PROJECT_NAME}/us-central1/multi-type-graph-sim-hash-scio-batch-app \ - src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/sim-hash-scio-batch.d6w \ - --bind=profile.project=${GCP_PROJECT_NAME} \ - --bind=profile.user_name=recos-platform \ - --bind=profile.date="2021-12-01" \ - --bind=profile.machine="n2d-highmem-16" - */ -object RightNodeSimHashScioBatchApp extends RightNodeSimHashScioBaseApp { - override val isAdhoc: Boolean = false - override val rightNodeSimHashSnapshotDataset: SnapshotDALDataset[RightNodeSimHashSketch] = - RightNodeSimHashScioScalaDataset -} diff --git a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/RightNodeSimHashScioBaseApp.scala b/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/RightNodeSimHashScioBaseApp.scala deleted file mode 100644 index e17fe5a15..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/RightNodeSimHashScioBaseApp.scala +++ /dev/null @@ -1,65 +0,0 @@ -package com.twitter.simclusters_v2.scio -package multi_type_graph.multi_type_graph_sims - -import com.spotify.scio.ScioContext -import com.spotify.scio.coders.Coder -import com.spotify.scio.values.SCollection -import com.twitter.beam.io.dal.DAL -import com.twitter.beam.io.fs.multiformat.DiskFormat -import com.twitter.beam.io.fs.multiformat.PathLayout -import com.twitter.beam.job.DateRangeOptions -import com.twitter.dal.client.dataset.SnapshotDALDataset -import com.twitter.scio_internal.coders.ThriftStructLazyBinaryScroogeCoder -import com.twitter.scio_internal.job.ScioBeamJob -import com.twitter.scrooge.ThriftStruct -import com.twitter.simclusters_v2.scio.multi_type_graph.common.MultiTypeGraphUtil -import com.twitter.simclusters_v2.thriftscala.RightNode -import com.twitter.simclusters_v2.thriftscala.RightNodeSimHashSketch -import com.twitter.util.Duration -import com.twitter.wtf.dataflow.cosine_similarity.SimHashJob -import java.time.Instant - -trait RightNodeSimHashScioBaseApp extends ScioBeamJob[DateRangeOptions] with SimHashJob[RightNode] { - override implicit def scroogeCoder[T <: ThriftStruct: Manifest]: Coder[T] = - ThriftStructLazyBinaryScroogeCoder.scroogeCoder - override val ordering: Ordering[RightNode] = MultiTypeGraphUtil.rightNodeOrdering - - val isAdhoc: Boolean - val rightNodeSimHashSnapshotDataset: SnapshotDALDataset[RightNodeSimHashSketch] - val simsHashJobOutputDirectory: String = Config.simsHashJobOutputDirectory - - override def graph( - implicit sc: ScioContext, - ): SCollection[(Long, RightNode, Double)] = - MultiTypeGraphUtil.getTruncatedMultiTypeGraph(noOlderThan = Duration.fromDays(14)) - - override def configurePipeline(sc: ScioContext, opts: DateRangeOptions): Unit = { - implicit def scioContext: ScioContext = sc - - // DAL.Environment variable for WriteExecs - val dalEnv = if (isAdhoc) DAL.Environment.Dev else DAL.Environment.Prod - - val sketches = computeSimHashSketchesForWeightedGraph(graph) - .map { - case (rightNode, sketch, norm) => RightNodeSimHashSketch(rightNode, sketch, norm) - } - - // Write SimHashSketches to DAL - sketches - .saveAsCustomOutput( - name = "WriteSimHashSketches", - DAL.writeSnapshot( - rightNodeSimHashSnapshotDataset, - PathLayout.FixedPath( - ((if (!isAdhoc) - MultiTypeGraphUtil.RootThriftPath - else - MultiTypeGraphUtil.AdhocRootPath) - + simsHashJobOutputDirectory)), - Instant.ofEpochMilli(opts.interval.getEndMillis - 1L), - DiskFormat.Thrift(), - environmentOverride = dalEnv - ) - ) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/cosine-similarity-scio-adhoc.d6w b/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/cosine-similarity-scio-adhoc.d6w deleted file mode 100644 index 2bdc591cf..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/cosine-similarity-scio-adhoc.d6w +++ /dev/null @@ -1,33 +0,0 @@ -class Profile(Struct): - project = Required(String) - date = Required(String) - environment = Default(String, 'dev') - machine = Default(String, 'n2d-highmem-16') - -job = Job( - name='multi-type-graph-cosine-similarity-scio-adhoc-app', - project='{{profile.project}}', - staging_bucket='{{profile.project}}', - service_account='{{profile.user_name}}-shdw@twttr-dp-svc-accounts.iam.gserviceaccount.com', - region='us-central1', - worker_config=WorkerConfig( - num_workers=2, - worker_machine_type='{{profile.machine}}', - worker_disk_type=WorkerDiskType('HDD'), - ), - extra_args={ - "environment": '{{profile.environment}}', - "date": Quote('{{profile.date}}'), - }, - deployment_config=BatchDeploymentConfig( - role='{{profile.user_name}}', - build_target='src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims:multi-type-graph-cosine-similarity-scio-adhoc-app', - gcp_deployment_credentials='/var/lib/tss/keys/{{profile.user_name}}/cloud/gcp/dp/shadow.json', - statebird_config=StatebirdConfig( - batch_width='PT1H', - first_time='{{profile.date}}' - ) - ) -) - -jobs=[job] diff --git a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/cosine-similarity-scio-batch.d6w b/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/cosine-similarity-scio-batch.d6w deleted file mode 100644 index b88bcd094..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/cosine-similarity-scio-batch.d6w +++ /dev/null @@ -1,39 +0,0 @@ -class Profile(Struct): - project = Required(String) - date = Required(String) - environment = Default(String, 'prod') - machine = Default(String, 'n2d-highmem-16') - -job = Job( - name='multi-type-graph-cosine-similarity-scio-batch-app', - project='{{profile.project}}', - staging_bucket='{{profile.project}}', - service_account='{{profile.user_name}}-shdw@twttr-dp-svc-accounts.iam.gserviceaccount.com', - region='us-central1', - worker_config=WorkerConfig( - num_workers=2, - worker_machine_type='{{profile.machine}}', - worker_disk_type=WorkerDiskType('HDD'), - ), - extra_args={ - "environment": '{{profile.environment}}', - "date": Quote('{{profile.date}}'), - }, - deployment_config=BatchDeploymentConfig( - role='{{profile.user_name}}', - build_target='src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims:multi-type-graph-cosine-similarity-scio-batch-app', - gcp_deployment_credentials='/var/lib/tss/keys/{{profile.user_name}}/cloud/gcp/dp/shadow.json', - environment='prod', - statebird_config=StatebirdConfig( - batch_width='P1W', - first_time='{{profile.date}}' - ), - workflow_config=WorkflowConfig( - play=True, - ), - timeout='PT50H' - ) -) - -jobs=[job] - diff --git a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/sim-hash-scio-adhoc.d6w b/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/sim-hash-scio-adhoc.d6w deleted file mode 100644 index ee653aabd..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/sim-hash-scio-adhoc.d6w +++ /dev/null @@ -1,33 +0,0 @@ -class Profile(Struct): - project = Required(String) - date = Required(String) - environment = Default(String, 'dev') - machine = Default(String, 'n2d-highmem-16') - -job = Job( - name='multi-type-graph-sim-hash-scio-adhoc-app', - project='{{profile.project}}', - staging_bucket='{{profile.project}}', - service_account='{{profile.user_name}}-shdw@twttr-dp-svc-accounts.iam.gserviceaccount.com', - region='us-central1', - worker_config=WorkerConfig( - num_workers=2, - worker_machine_type='{{profile.machine}}', - worker_disk_type=WorkerDiskType('HDD'), - ), - extra_args={ - "environment": '{{profile.environment}}', - "date": Quote('{{profile.date}}'), - }, - deployment_config=BatchDeploymentConfig( - role='{{profile.user_name}}', - build_target='src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims:multi-type-graph-sim-hash-scio-adhoc-app', - gcp_deployment_credentials='/var/lib/tss/keys/{{profile.user_name}}/cloud/gcp/dp/shadow.json', - statebird_config=StatebirdConfig( - batch_width='PT1H', - first_time='{{profile.date}}' - ) - ) -) - -jobs=[job] diff --git a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/sim-hash-scio-batch.d6w b/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/sim-hash-scio-batch.d6w deleted file mode 100644 index ff6a7b84c..000000000 --- a/src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims/sim-hash-scio-batch.d6w +++ /dev/null @@ -1,38 +0,0 @@ -class Profile(Struct): - project = Required(String) - date = Required(String) - environment = Default(String, 'prod') - machine = Default(String, 'n2d-highmem-16') - -job = Job( - name='multi-type-graph-sim-hash-scio-batch-app', - project='{{profile.project}}', - staging_bucket='{{profile.project}}', - service_account='{{profile.user_name}}-shdw@twttr-dp-svc-accounts.iam.gserviceaccount.com', - region='us-central1', - worker_config=WorkerConfig( - num_workers=2, - worker_machine_type='{{profile.machine}}', - worker_disk_type=WorkerDiskType('HDD'), - ), - extra_args={ - "environment": '{{profile.environment}}', - "date": Quote('{{profile.date}}'), - }, - deployment_config=BatchDeploymentConfig( - role='{{profile.user_name}}', - build_target='src/scala/com/twitter/simclusters_v2/scio/multi_type_graph/multi_type_graph_sims:multi-type-graph-sim-hash-scio-batch-app', - gcp_deployment_credentials='/var/lib/tss/keys/{{profile.user_name}}/cloud/gcp/dp/shadow.json', - environment='prod', - statebird_config=StatebirdConfig( - batch_width='P1W', - first_time='{{profile.date}}' - ), - workflow_config=WorkflowConfig( - play=True, - ), - timeout='PT20H' - ) -) - -jobs=[job] diff --git a/src/scala/com/twitter/simclusters_v2/score/AggregatedScoreStore.scala b/src/scala/com/twitter/simclusters_v2/score/AggregatedScoreStore.scala deleted file mode 100644 index 31734f226..000000000 --- a/src/scala/com/twitter/simclusters_v2/score/AggregatedScoreStore.scala +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.simclusters_v2.score - -import com.twitter.simclusters_v2.thriftscala.{ScoreId => ThriftScoreId, Score => ThriftScore} -import com.twitter.storehaus.ReadableStore - -/** - * A wrapper class, used to aggregate the scores calculated by other score stores. It relies on the - * results of other ScoreStores registered in the ScoreFacadeStore. - */ -trait AggregatedScoreStore extends ReadableStore[ThriftScoreId, ThriftScore] { - - // The underlyingScoreStore relies on [[ScoreFacadeStore]] to finish the dependency injection. - protected var scoreFacadeStore: ReadableStore[ThriftScoreId, ThriftScore] = ReadableStore.empty - - /** - * When registering this store in a ScoreFacadeStore, the facade store calls this function to - * provide references to other score stores. - */ - private[score] def set(facadeStore: ReadableStore[ThriftScoreId, ThriftScore]): Unit = { - this.synchronized { - scoreFacadeStore = facadeStore - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/score/BUILD b/src/scala/com/twitter/simclusters_v2/score/BUILD deleted file mode 100644 index 13e8c07f6..000000000 --- a/src/scala/com/twitter/simclusters_v2/score/BUILD +++ /dev/null @@ -1,9 +0,0 @@ -scala_library( - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "finagle/finagle-stats", - "hermit/hermit-core/src/main/scala/com/twitter/hermit/store/common", - "src/scala/com/twitter/simclusters_v2/stores", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/score/Score.scala b/src/scala/com/twitter/simclusters_v2/score/Score.scala deleted file mode 100644 index c12acf97e..000000000 --- a/src/scala/com/twitter/simclusters_v2/score/Score.scala +++ /dev/null @@ -1,22 +0,0 @@ -package com.twitter.simclusters_v2.score - -import com.twitter.simclusters_v2.thriftscala.{Score => ThriftScore} - -/** - * A uniform value type for all kinds of Calculation Score. - **/ -case class Score(score: Double) { - - implicit lazy val toThrift: ThriftScore = { - ThriftScore(score) - } -} - -object Score { - - /** - * Only support Double Type Thrift score - */ - implicit val fromThriftScore: ThriftScore => Score = { thriftScore => Score(thriftScore.score) } - -} diff --git a/src/scala/com/twitter/simclusters_v2/score/ScoreFacadeStore.scala b/src/scala/com/twitter/simclusters_v2/score/ScoreFacadeStore.scala deleted file mode 100644 index ac084e737..000000000 --- a/src/scala/com/twitter/simclusters_v2/score/ScoreFacadeStore.scala +++ /dev/null @@ -1,103 +0,0 @@ -package com.twitter.simclusters_v2.score - -import com.twitter.finagle.stats.BroadcastStatsReceiver -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.hermit.store.common.ObservedReadableStore -import com.twitter.simclusters_v2.thriftscala.ScoringAlgorithm -import com.twitter.simclusters_v2.thriftscala.{ScoreId => ThriftScoreId} -import com.twitter.simclusters_v2.thriftscala.{Score => ThriftScore} -import com.twitter.storehaus.ReadableStore -import com.twitter.util.Future - -/** - * Provide a uniform access layer for all kind of Score. - * @param readableStores readable stores indexed by the ScoringAlgorithm they implement - */ -class ScoreFacadeStore private ( - stores: Map[ScoringAlgorithm, ReadableStore[ThriftScoreId, ThriftScore]]) - extends ReadableStore[ThriftScoreId, ThriftScore] { - - override def get(k: ThriftScoreId): Future[Option[ThriftScore]] = { - findStore(k).get(k) - } - - // Override the multiGet for better batch performance. - override def multiGet[K1 <: ThriftScoreId](ks: Set[K1]): Map[K1, Future[Option[ThriftScore]]] = { - if (ks.isEmpty) { - Map.empty - } else { - val head = ks.head - val notSameType = ks.exists(k => k.algorithm != head.algorithm) - if (!notSameType) { - findStore(head).multiGet(ks) - } else { - // Generate a large amount temp objects. - // For better performance, avoid querying the multiGet with more than one kind of embedding - ks.groupBy(id => id.algorithm).flatMap { - case (_, ks) => - findStore(ks.head).multiGet(ks) - } - } - } - } - - // If not store mapping, fast return a IllegalArgumentException. - private def findStore(id: ThriftScoreId): ReadableStore[ThriftScoreId, ThriftScore] = { - stores.get(id.algorithm) match { - case Some(store) => store - case None => - throw new IllegalArgumentException(s"The Scoring Algorithm ${id.algorithm} doesn't exist.") - } - } - -} - -object ScoreFacadeStore { - /* - Build a ScoreFacadeStore which exposes stats for all requests (under "all") and per scoring algorithm: - - score_facade_store/all/ - score_facade_store// - - Stores in aggregatedStores may reference stores in readableStores. An instance of ScoreFacadeStore - is passed to them after instantiation. - */ - def buildWithMetrics( - readableStores: Map[ScoringAlgorithm, ReadableStore[ThriftScoreId, ThriftScore]], - aggregatedStores: Map[ScoringAlgorithm, AggregatedScoreStore], - statsReceiver: StatsReceiver - ) = { - val scopedStatsReceiver = statsReceiver.scope("score_facade_store") - - def wrapStore( - scoringAlgorithm: ScoringAlgorithm, - store: ReadableStore[ThriftScoreId, ThriftScore] - ): ReadableStore[ThriftScoreId, ThriftScore] = { - val sr = BroadcastStatsReceiver( - Seq( - scopedStatsReceiver.scope("all"), - scopedStatsReceiver.scope(scoringAlgorithm.name) - )) - ObservedReadableStore(store)(sr) - } - - val stores = (readableStores ++ aggregatedStores).map { - case (algo, store) => algo -> wrapStore(algo, store) - } - val store = new ScoreFacadeStore(stores = stores) - - /* - AggregatedScores aggregate scores from multiple non-aggregated stores. They access these via the - ScoreFacadeStore itself, and therefore must be passed an instance of it after it has been - constructed. - */ - assert( - readableStores.keySet.forall(algorithm => !aggregatedStores.keySet.contains(algorithm)), - "Keys for stores are disjoint") - - aggregatedStores.values.foreach(_.set(store)) - - store - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/score/ScoreId.scala b/src/scala/com/twitter/simclusters_v2/score/ScoreId.scala deleted file mode 100644 index da045ecda..000000000 --- a/src/scala/com/twitter/simclusters_v2/score/ScoreId.scala +++ /dev/null @@ -1,129 +0,0 @@ -package com.twitter.simclusters_v2.score - -import com.twitter.simclusters_v2.common.SimClustersEmbeddingId._ -import com.twitter.simclusters_v2.thriftscala.{ - InternalId, - ScoreInternalId, - ScoringAlgorithm, - SimClustersEmbeddingId, - GenericPairScoreId => ThriftGenericPairScoreId, - ScoreId => ThriftScoreId, - SimClustersEmbeddingPairScoreId => ThriftSimClustersEmbeddingPairScoreId -} - -/** - * A uniform Identifier type for all kinds of Calculation Score. - **/ -trait ScoreId { - - def algorithm: ScoringAlgorithm - - /** - * Convert to a Thrift object. Throw a exception if the operation is not override. - */ - implicit def toThrift: ThriftScoreId = - throw new UnsupportedOperationException(s"ScoreId $this doesn't support Thrift format") -} - -object ScoreId { - - implicit val fromThriftScoreId: ThriftScoreId => ScoreId = { - case scoreId @ ThriftScoreId(_, ScoreInternalId.GenericPairScoreId(_)) => - PairScoreId.fromThriftScoreId(scoreId) - case scoreId @ ThriftScoreId(_, ScoreInternalId.SimClustersEmbeddingPairScoreId(_)) => - SimClustersEmbeddingPairScoreId.fromThriftScoreId(scoreId) - } - -} - -/** - * Generic Internal pairwise id. Support all the subtypes in InternalId, which includes TweetId, - * UserId, EntityId and more combination ids. - **/ -trait PairScoreId extends ScoreId { - - def id1: InternalId - def id2: InternalId - - override implicit lazy val toThrift: ThriftScoreId = { - ThriftScoreId( - algorithm, - ScoreInternalId.GenericPairScoreId(ThriftGenericPairScoreId(id1, id2)) - ) - } -} - -object PairScoreId { - - // The default PairScoreId assume id1 <= id2. It used to increase the cache hit rate. - def apply(algorithm: ScoringAlgorithm, id1: InternalId, id2: InternalId): PairScoreId = { - if (internalIdOrdering.lteq(id1, id2)) { - DefaultPairScoreId(algorithm, id1, id2) - } else { - DefaultPairScoreId(algorithm, id2, id1) - } - } - - private case class DefaultPairScoreId( - algorithm: ScoringAlgorithm, - id1: InternalId, - id2: InternalId) - extends PairScoreId - - implicit val fromThriftScoreId: ThriftScoreId => PairScoreId = { - case ThriftScoreId(algorithm, ScoreInternalId.GenericPairScoreId(pairScoreId)) => - DefaultPairScoreId(algorithm, pairScoreId.id1, pairScoreId.id2) - case ThriftScoreId(algorithm, ScoreInternalId.SimClustersEmbeddingPairScoreId(pairScoreId)) => - SimClustersEmbeddingPairScoreId(algorithm, pairScoreId.id1, pairScoreId.id2) - } - -} - -/** - * ScoreId for a pair of SimClustersEmbedding. - * Used for dot product, cosine similarity and other basic embedding operations. - */ -trait SimClustersEmbeddingPairScoreId extends PairScoreId { - def embeddingId1: SimClustersEmbeddingId - - def embeddingId2: SimClustersEmbeddingId - - override def id1: InternalId = embeddingId1.internalId - - override def id2: InternalId = embeddingId2.internalId - - override implicit lazy val toThrift: ThriftScoreId = { - ThriftScoreId( - algorithm, - ScoreInternalId.SimClustersEmbeddingPairScoreId( - ThriftSimClustersEmbeddingPairScoreId(embeddingId1, embeddingId2)) - ) - } -} - -object SimClustersEmbeddingPairScoreId { - - // The default PairScoreId assume id1 <= id2. It used to increase the cache hit rate. - def apply( - algorithm: ScoringAlgorithm, - id1: SimClustersEmbeddingId, - id2: SimClustersEmbeddingId - ): SimClustersEmbeddingPairScoreId = { - if (simClustersEmbeddingIdOrdering.lteq(id1, id2)) { - DefaultSimClustersEmbeddingPairScoreId(algorithm, id1, id2) - } else { - DefaultSimClustersEmbeddingPairScoreId(algorithm, id2, id1) - } - } - - private case class DefaultSimClustersEmbeddingPairScoreId( - algorithm: ScoringAlgorithm, - embeddingId1: SimClustersEmbeddingId, - embeddingId2: SimClustersEmbeddingId) - extends SimClustersEmbeddingPairScoreId - - implicit val fromThriftScoreId: ThriftScoreId => SimClustersEmbeddingPairScoreId = { - case ThriftScoreId(algorithm, ScoreInternalId.SimClustersEmbeddingPairScoreId(pairScoreId)) => - SimClustersEmbeddingPairScoreId(algorithm, pairScoreId.id1, pairScoreId.id2) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/score/ScoreStore.scala b/src/scala/com/twitter/simclusters_v2/score/ScoreStore.scala deleted file mode 100644 index 3aea91e1a..000000000 --- a/src/scala/com/twitter/simclusters_v2/score/ScoreStore.scala +++ /dev/null @@ -1,72 +0,0 @@ -package com.twitter.simclusters_v2.score - -import com.twitter.simclusters_v2.thriftscala.{Score => ThriftScore, ScoreId => ThriftScoreId} -import com.twitter.storehaus.ReadableStore -import com.twitter.util.Future - -/** - * A Score Store is a readableStore with ScoreId as Key and Score as the Value. - * It also needs to include the algorithm type. - * A algorithm type should only be used by one Score Store in the application. - */ -trait ScoreStore[K <: ScoreId] extends ReadableStore[K, Score] { - - def fromThriftScoreId: ThriftScoreId => K - - // Convert to a Thrift version. - def toThriftStore: ReadableStore[ThriftScoreId, ThriftScore] = { - this - .composeKeyMapping[ThriftScoreId](fromThriftScoreId) - .mapValues(_.toThrift) - } -} - -/** - * A generic Pairwise Score store. - * Requires provide both left and right side feature hydration. - */ -trait PairScoreStore[K <: PairScoreId, K1, K2, V1, V2] extends ScoreStore[K] { - - def compositeKey1: K => K1 - def compositeKey2: K => K2 - - // Left side feature hydration - def underlyingStore1: ReadableStore[K1, V1] - - // Right side feature hydration - def underlyingStore2: ReadableStore[K2, V2] - - def score: (V1, V2) => Future[Option[Double]] - - override def get(k: K): Future[Option[Score]] = { - for { - vs <- - Future.join(underlyingStore1.get(compositeKey1(k)), underlyingStore2.get(compositeKey2(k))) - v <- vs match { - case (Some(v1), Some(v2)) => - score(v1, v2) - case _ => - Future.None - } - } yield { - v.map(buildScore) - } - } - - override def multiGet[KK <: K](ks: Set[KK]): Map[KK, Future[Option[Score]]] = { - - val v1Map = underlyingStore1.multiGet(ks.map { k => compositeKey1(k) }) - val v2Map = underlyingStore2.multiGet(ks.map { k => compositeKey2(k) }) - - ks.map { k => - k -> Future.join(v1Map(compositeKey1(k)), v2Map(compositeKey2(k))).flatMap { - case (Some(v1), Some(v2)) => - score(v1, v2).map(_.map(buildScore)) - case _ => - Future.value(None) - } - }.toMap - } - - private def buildScore(v: Double): Score = Score(v) -} diff --git a/src/scala/com/twitter/simclusters_v2/score/SimClustersEmbeddingPairScoreStore.scala b/src/scala/com/twitter/simclusters_v2/score/SimClustersEmbeddingPairScoreStore.scala deleted file mode 100644 index ef0143711..000000000 --- a/src/scala/com/twitter/simclusters_v2/score/SimClustersEmbeddingPairScoreStore.scala +++ /dev/null @@ -1,201 +0,0 @@ -package com.twitter.simclusters_v2.score - -import com.twitter.simclusters_v2.common.SimClustersEmbedding -import com.twitter.simclusters_v2.thriftscala.{SimClustersEmbeddingId, ScoreId => ThriftScoreId} -import com.twitter.storehaus.ReadableStore -import com.twitter.util.Future - -object SimClustersEmbeddingPairScoreStore { - - /** - * Internal Instance of a SimClusters Embedding based Pair Score store. - */ - private case class SimClustersEmbeddingInternalPairScoreStore( - simClustersEmbeddingStore: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding], - score: (SimClustersEmbedding, SimClustersEmbedding) => Future[Option[Double]]) - extends PairScoreStore[ - SimClustersEmbeddingPairScoreId, - SimClustersEmbeddingId, - SimClustersEmbeddingId, - SimClustersEmbedding, - SimClustersEmbedding - ] { - - override val compositeKey1: SimClustersEmbeddingPairScoreId => SimClustersEmbeddingId = - _.embeddingId1 - override val compositeKey2: SimClustersEmbeddingPairScoreId => SimClustersEmbeddingId = - _.embeddingId2 - - override def underlyingStore1: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = - simClustersEmbeddingStore - - override def underlyingStore2: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = - simClustersEmbeddingStore - - override def fromThriftScoreId: ThriftScoreId => SimClustersEmbeddingPairScoreId = - SimClustersEmbeddingPairScoreId.fromThriftScoreId - } - - def buildDotProductStore( - simClustersEmbeddingStore: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] - ): PairScoreStore[ - SimClustersEmbeddingPairScoreId, - SimClustersEmbeddingId, - SimClustersEmbeddingId, - SimClustersEmbedding, - SimClustersEmbedding - ] = { - - def dotProduct: (SimClustersEmbedding, SimClustersEmbedding) => Future[Option[Double]] = { - case (embedding1, embedding2) => - Future.value(Some(embedding1.dotProduct(embedding2))) - } - - SimClustersEmbeddingInternalPairScoreStore( - simClustersEmbeddingStore, - dotProduct - ) - } - - def buildCosineSimilarityStore( - simClustersEmbeddingStore: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] - ): PairScoreStore[ - SimClustersEmbeddingPairScoreId, - SimClustersEmbeddingId, - SimClustersEmbeddingId, - SimClustersEmbedding, - SimClustersEmbedding - ] = { - - def cosineSimilarity: (SimClustersEmbedding, SimClustersEmbedding) => Future[Option[Double]] = { - case (embedding1, embedding2) => - Future.value(Some(embedding1.cosineSimilarity(embedding2))) - } - - SimClustersEmbeddingInternalPairScoreStore( - simClustersEmbeddingStore, - cosineSimilarity - ) - } - - def buildLogCosineSimilarityStore( - simClustersEmbeddingStore: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] - ): PairScoreStore[ - SimClustersEmbeddingPairScoreId, - SimClustersEmbeddingId, - SimClustersEmbeddingId, - SimClustersEmbedding, - SimClustersEmbedding - ] = { - - def logNormCosineSimilarity: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Future[Option[Double]] = { - case (embedding1, embedding2) => - Future.value(Some(embedding1.logNormCosineSimilarity(embedding2))) - } - - SimClustersEmbeddingInternalPairScoreStore( - simClustersEmbeddingStore, - logNormCosineSimilarity - ) - } - - def buildExpScaledCosineSimilarityStore( - simClustersEmbeddingStore: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] - ): PairScoreStore[ - SimClustersEmbeddingPairScoreId, - SimClustersEmbeddingId, - SimClustersEmbeddingId, - SimClustersEmbedding, - SimClustersEmbedding - ] = { - - def expScaledCosineSimilarity: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Future[Option[Double]] = { - case (embedding1, embedding2) => - Future.value(Some(embedding1.expScaledCosineSimilarity(embedding2))) - } - - SimClustersEmbeddingInternalPairScoreStore( - simClustersEmbeddingStore, - expScaledCosineSimilarity - ) - } - - def buildJaccardSimilarityStore( - simClustersEmbeddingStore: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] - ): PairScoreStore[ - SimClustersEmbeddingPairScoreId, - SimClustersEmbeddingId, - SimClustersEmbeddingId, - SimClustersEmbedding, - SimClustersEmbedding - ] = { - - def jaccardSimilarity: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Future[Option[Double]] = { - case (embedding1, embedding2) => - Future.value(Some(embedding1.jaccardSimilarity(embedding2))) - } - - SimClustersEmbeddingInternalPairScoreStore( - simClustersEmbeddingStore, - jaccardSimilarity - ) - } - - def buildEuclideanDistanceStore( - simClustersEmbeddingStore: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] - ): PairScoreStore[ - SimClustersEmbeddingPairScoreId, - SimClustersEmbeddingId, - SimClustersEmbeddingId, - SimClustersEmbedding, - SimClustersEmbedding - ] = { - - def euclideanDistance: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Future[Option[Double]] = { - case (embedding1, embedding2) => - Future.value(Some(embedding1.euclideanDistance(embedding2))) - } - - SimClustersEmbeddingInternalPairScoreStore( - simClustersEmbeddingStore, - euclideanDistance - ) - } - - def buildManhattanDistanceStore( - simClustersEmbeddingStore: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] - ): PairScoreStore[ - SimClustersEmbeddingPairScoreId, - SimClustersEmbeddingId, - SimClustersEmbeddingId, - SimClustersEmbedding, - SimClustersEmbedding - ] = { - - def manhattanDistance: ( - SimClustersEmbedding, - SimClustersEmbedding - ) => Future[Option[Double]] = { - case (embedding1, embedding2) => - Future.value(Some(embedding1.manhattanDistance(embedding2))) - } - - SimClustersEmbeddingInternalPairScoreStore( - simClustersEmbeddingStore, - manhattanDistance - ) - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/score/WeightedSumAggregatedScoreStore.scala b/src/scala/com/twitter/simclusters_v2/score/WeightedSumAggregatedScoreStore.scala deleted file mode 100644 index 8c1552c95..000000000 --- a/src/scala/com/twitter/simclusters_v2/score/WeightedSumAggregatedScoreStore.scala +++ /dev/null @@ -1,84 +0,0 @@ -package com.twitter.simclusters_v2.score - -import com.twitter.simclusters_v2.score.WeightedSumAggregatedScoreStore.WeightedSumAggregatedScoreParameter -import com.twitter.simclusters_v2.thriftscala.{ - EmbeddingType, - GenericPairScoreId, - ModelVersion, - ScoreInternalId, - ScoringAlgorithm, - SimClustersEmbeddingId, - Score => ThriftScore, - ScoreId => ThriftScoreId, - SimClustersEmbeddingPairScoreId => ThriftSimClustersEmbeddingPairScoreId -} -import com.twitter.util.Future - -/** - * A generic store wrapper to aggregate the scores of N underlying stores in a weighted fashion. - * - */ -case class WeightedSumAggregatedScoreStore(parameters: Seq[WeightedSumAggregatedScoreParameter]) - extends AggregatedScoreStore { - - override def get(k: ThriftScoreId): Future[Option[ThriftScore]] = { - val underlyingScores = parameters.map { parameter => - scoreFacadeStore - .get(ThriftScoreId(parameter.scoreAlgorithm, parameter.idTransform(k.internalId))) - .map(_.map(s => parameter.scoreTransform(s.score) * parameter.weight)) - } - Future.collect(underlyingScores).map { scores => - if (scores.exists(_.nonEmpty)) { - val newScore = scores.foldLeft(0.0) { - case (sum, maybeScore) => - sum + maybeScore.getOrElse(0.0) - } - Some(ThriftScore(score = newScore)) - } else { - // Return None if all of the underlying score is None. - None - } - } - } -} - -object WeightedSumAggregatedScoreStore { - - /** - * The parameter of WeightedSumAggregatedScoreStore. Create 0 to N parameters for a WeightedSum - * AggregatedScore Store. Please evaluate the performance before productionization any new score. - * - * @param scoreAlgorithm the underlying score algorithm name - * @param weight contribution to weighted sum of this sub-score - * @param idTransform transform the source ScoreInternalId to underlying score InternalId. - * @param scoreTransform function to apply to sub-score before adding to weighted sum - */ - case class WeightedSumAggregatedScoreParameter( - scoreAlgorithm: ScoringAlgorithm, - weight: Double, - idTransform: ScoreInternalId => ScoreInternalId, - scoreTransform: Double => Double = identityScoreTransform) - - val SameTypeScoreInternalIdTransform: ScoreInternalId => ScoreInternalId = { id => id } - val identityScoreTransform: Double => Double = { score => score } - - // Convert Generic Internal Id to a SimClustersEmbeddingId - def genericPairScoreIdToSimClustersEmbeddingPairScoreId( - embeddingType1: EmbeddingType, - embeddingType2: EmbeddingType, - modelVersion: ModelVersion - ): ScoreInternalId => ScoreInternalId = { - case id: ScoreInternalId.GenericPairScoreId => - ScoreInternalId.SimClustersEmbeddingPairScoreId( - ThriftSimClustersEmbeddingPairScoreId( - SimClustersEmbeddingId(embeddingType1, modelVersion, id.genericPairScoreId.id1), - SimClustersEmbeddingId(embeddingType2, modelVersion, id.genericPairScoreId.id2) - )) - } - - val simClustersEmbeddingPairScoreIdToGenericPairScoreId: ScoreInternalId => ScoreInternalId = { - case ScoreInternalId.SimClustersEmbeddingPairScoreId(simClustersId) => - ScoreInternalId.GenericPairScoreId( - GenericPairScoreId(simClustersId.id1.internalId, simClustersId.id2.internalId)) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/stores/BUILD b/src/scala/com/twitter/simclusters_v2/stores/BUILD deleted file mode 100644 index 11bc8e7e6..000000000 --- a/src/scala/com/twitter/simclusters_v2/stores/BUILD +++ /dev/null @@ -1,14 +0,0 @@ -scala_library( - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/twitter/storehaus:core", - "hermit/hermit-core/src/main/scala/com/twitter/hermit/store/common", - "src/scala/com/twitter/simclusters_v2/common", - "src/scala/com/twitter/storehaus_internal/manhattan", - "src/scala/com/twitter/storehaus_internal/util", - "src/scala/com/twitter/wtf/scalding/jobs/injection", - "src/thrift/com/twitter/recos/entities:entities-thrift-scala", - "storage/clients/manhattan/client/src/main/scala", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/stores/LanguageFilteredLocaleEntityEmbeddingStore.scala b/src/scala/com/twitter/simclusters_v2/stores/LanguageFilteredLocaleEntityEmbeddingStore.scala deleted file mode 100644 index e461e1ed2..000000000 --- a/src/scala/com/twitter/simclusters_v2/stores/LanguageFilteredLocaleEntityEmbeddingStore.scala +++ /dev/null @@ -1,96 +0,0 @@ -package com.twitter.simclusters_v2.stores - -import com.twitter.simclusters_v2.common.ClusterId -import com.twitter.simclusters_v2.common.SimClustersEmbedding -import com.twitter.simclusters_v2.thriftscala.ClusterDetails -import com.twitter.simclusters_v2.thriftscala.InternalId -import com.twitter.simclusters_v2.thriftscala.ModelVersion -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId -import com.twitter.storehaus.ReadableStore -import com.twitter.util.Future - -/** - * Transfer a Entity SimClustersEmbedding to a language filtered embedding. - * The new embedding only contains clusters whose main language is the same as the language field in - * the SimClustersEmbeddingId. - * - * This store is special designed for Topic Tweet and Topic Follow Prompt. - * Only support new Ids whose internalId is LocaleEntityId. - */ -@deprecated -case class LanguageFilteredLocaleEntityEmbeddingStore( - underlyingStore: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding], - clusterDetailsStore: ReadableStore[(ModelVersion, ClusterId), ClusterDetails], - composeKeyMapping: SimClustersEmbeddingId => SimClustersEmbeddingId) - extends ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] { - - import LanguageFilteredLocaleEntityEmbeddingStore._ - - override def get(k: SimClustersEmbeddingId): Future[Option[SimClustersEmbedding]] = { - for { - maybeEmbedding <- underlyingStore.get(composeKeyMapping(k)) - maybeFilteredEmbedding <- maybeEmbedding match { - case Some(embedding) => - embeddingsLanguageFilter(k, embedding).map(Some(_)) - case None => - Future.None - } - } yield maybeFilteredEmbedding - } - - private def embeddingsLanguageFilter( - sourceEmbeddingId: SimClustersEmbeddingId, - simClustersEmbedding: SimClustersEmbedding - ): Future[SimClustersEmbedding] = { - val language = getLanguage(sourceEmbeddingId) - val modelVersion = sourceEmbeddingId.modelVersion - - val clusterDetailKeys = simClustersEmbedding.sortedClusterIds.map { clusterId => - (modelVersion, clusterId) - }.toSet - - Future - .collect { - clusterDetailsStore.multiGet(clusterDetailKeys) - }.map { clusterDetailsMap => - simClustersEmbedding.embedding.filter { - case (clusterId, _) => - isDominantLanguage( - language, - clusterDetailsMap.getOrElse((modelVersion, clusterId), None)) - } - }.map(SimClustersEmbedding(_)) - } - - private def isDominantLanguage( - requestLang: String, - clusterDetails: Option[ClusterDetails] - ): Boolean = - clusterDetails match { - case Some(details) => - val dominantLanguage = - details.languageToFractionDeviceLanguage.map { langMap => - langMap.maxBy { - case (_, score) => score - }._1 - } - - dominantLanguage.exists(_.equalsIgnoreCase(requestLang)) - case _ => true - } - -} - -object LanguageFilteredLocaleEntityEmbeddingStore { - - def getLanguage(simClustersEmbeddingId: SimClustersEmbeddingId): String = { - simClustersEmbeddingId match { - case SimClustersEmbeddingId(_, _, InternalId.LocaleEntityId(localeEntityId)) => - localeEntityId.language - case _ => - throw new IllegalArgumentException( - s"The Id $simClustersEmbeddingId doesn't contain Locale info") - } - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/stores/MultiTypeGraphStore.scala b/src/scala/com/twitter/simclusters_v2/stores/MultiTypeGraphStore.scala deleted file mode 100644 index 656a61696..000000000 --- a/src/scala/com/twitter/simclusters_v2/stores/MultiTypeGraphStore.scala +++ /dev/null @@ -1,287 +0,0 @@ -package com.twitter.simclusters_v2.stores -import com.twitter.bijection.Bufferable -import com.twitter.bijection.Injection -import com.twitter.bijection.scrooge.CompactScalaCodec -import com.twitter.simclusters_v2.common.Language -import com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn -import com.twitter.simclusters_v2.thriftscala.LeftNode -import com.twitter.simclusters_v2.thriftscala.NounWithFrequencyList -import com.twitter.simclusters_v2.thriftscala.RightNode -import com.twitter.simclusters_v2.thriftscala.RightNodeTypeStruct -import com.twitter.simclusters_v2.thriftscala.RightNodeWithEdgeWeightList -import com.twitter.simclusters_v2.thriftscala.SimilarRightNodes -import com.twitter.simclusters_v2.thriftscala.CandidateTweetsList -import com.twitter.storage.client.manhattan.kv.ManhattanKVClientMtlsParams -import com.twitter.storehaus.ReadableStore -import com.twitter.storehaus_internal.manhattan.Apollo -import com.twitter.storehaus_internal.manhattan.ManhattanRO -import com.twitter.storehaus_internal.manhattan.ManhattanROConfig -import com.twitter.storehaus_internal.util.ApplicationID -import com.twitter.storehaus_internal.util.DatasetName -import com.twitter.storehaus_internal.util.HDFSPath -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.Long2BigEndian -import com.twitter.simclusters_v2.thriftscala.FullClusterId -import com.twitter.simclusters_v2.thriftscala.TopKTweetsWithScores - -object MultiTypeGraphStore { - - implicit val leftNodesInject: Injection[LeftNode, Array[Byte]] = - CompactScalaCodec(LeftNode) - implicit val truncatedMultiTypeGraphInject: Injection[RightNodeWithEdgeWeightList, Array[Byte]] = - CompactScalaCodec(RightNodeWithEdgeWeightList) - implicit val topKNounsListInject: Injection[NounWithFrequencyList, Array[Byte]] = - CompactScalaCodec(NounWithFrequencyList) - implicit val rightNodesStructInject: Injection[RightNodeTypeStruct, Array[Byte]] = - CompactScalaCodec(RightNodeTypeStruct) - implicit val similarRightNodesStructInject: Injection[SimilarRightNodes, Array[Byte]] = - CompactScalaCodec(SimilarRightNodes) - implicit val rightNodesInject: Injection[RightNode, Array[Byte]] = - CompactScalaCodec(RightNode) - implicit val tweetCandidatesInject: Injection[CandidateTweetsList, Array[Byte]] = - CompactScalaCodec(CandidateTweetsList) - implicit val fullClusterIdInject: Injection[FullClusterId, Array[Byte]] = - CompactScalaCodec(FullClusterId) - implicit val topKTweetsWithScoresInject: Injection[TopKTweetsWithScores, Array[Byte]] = - CompactScalaCodec(TopKTweetsWithScores) - implicit val clustersUserIsInterestedInInjection: Injection[ClustersUserIsInterestedIn, Array[ - Byte - ]] = - CompactScalaCodec(ClustersUserIsInterestedIn) - - private val appId = "multi_type_simclusters" - - def getTruncatedMultiTypeGraphRightNodesForUser( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[LeftNode, RightNodeWithEdgeWeightList] = { - ManhattanRO.getReadableStoreWithMtls[LeftNode, RightNodeWithEdgeWeightList]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appId), - DatasetName("mts_user_truncated_graph"), - Apollo - ), - mhMtlsParams - ) - } - - def getTopKNounsForRightNodeType( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[RightNodeTypeStruct, NounWithFrequencyList] = { - ManhattanRO.getReadableStoreWithMtls[RightNodeTypeStruct, NounWithFrequencyList]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appId), - DatasetName("mts_topk_frequent_nouns"), - Apollo - ), - mhMtlsParams - ) - } - - def getTopKSimilarRightNodes( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[RightNode, SimilarRightNodes] = { - ManhattanRO.getReadableStoreWithMtls[RightNode, SimilarRightNodes]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appId), - DatasetName("mts_topk_similar_right_nodes_scio"), - Apollo - ), - mhMtlsParams - ) - } - - def getOfflineTweetMTSCandidateStore( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[Long, CandidateTweetsList] = { - ManhattanRO.getReadableStoreWithMtls[Long, CandidateTweetsList]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appId), - DatasetName("offline_tweet_recommendations_from_mts_consumer_embeddings"), - Apollo - ), - mhMtlsParams - ) - } - - def getOfflineTweet2020CandidateStore( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[Long, CandidateTweetsList] = { - ManhattanRO.getReadableStoreWithMtls[Long, CandidateTweetsList]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appId), - DatasetName("offline_tweet_recommendations_from_interestedin_2020"), - Apollo - ), - mhMtlsParams - ) - } - - def getVideoViewBasedClusterTopKTweets( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[FullClusterId, TopKTweetsWithScores] = { - ManhattanRO - .getReadableStoreWithMtls[FullClusterId, TopKTweetsWithScores]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appId), - DatasetName("video_view_based_cluster_to_tweet_index"), - Apollo - ), - mhMtlsParams - ) - } - - def getRetweetBasedClusterTopKTweets( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[FullClusterId, TopKTweetsWithScores] = { - ManhattanRO - .getReadableStoreWithMtls[FullClusterId, TopKTweetsWithScores]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appId), - DatasetName("retweet_based_simclusters_cluster_to_tweet_index"), - Apollo - ), - mhMtlsParams - ) - } - - def getReplyBasedClusterTopKTweets( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[FullClusterId, TopKTweetsWithScores] = { - ManhattanRO - .getReadableStoreWithMtls[FullClusterId, TopKTweetsWithScores]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appId), - DatasetName("reply_based_simclusters_cluster_to_tweet_index"), - Apollo - ), - mhMtlsParams - ) - } - - def getPushOpenBasedClusterTopKTweets( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[FullClusterId, TopKTweetsWithScores] = { - ManhattanRO - .getReadableStoreWithMtls[FullClusterId, TopKTweetsWithScores]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appId), - DatasetName("push_open_based_simclusters_cluster_to_tweet_index"), - Apollo - ), - mhMtlsParams - ) - } - - def getAdsFavBasedClusterTopKTweets( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[FullClusterId, TopKTweetsWithScores] = { - ManhattanRO - .getReadableStoreWithMtls[FullClusterId, TopKTweetsWithScores]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appId), - DatasetName("ads_fav_based_simclusters_cluster_to_tweet_index"), - Apollo - ), - mhMtlsParams - ) - } - - def getAdsFavClickBasedClusterTopKTweets( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[FullClusterId, TopKTweetsWithScores] = { - ManhattanRO - .getReadableStoreWithMtls[FullClusterId, TopKTweetsWithScores]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appId), - DatasetName("ads_fav_click_based_simclusters_cluster_to_tweet_index"), - Apollo - ), - mhMtlsParams - ) - } - - def getFTRPop1000BasedClusterTopKTweets( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[FullClusterId, TopKTweetsWithScores] = { - ManhattanRO - .getReadableStoreWithMtls[FullClusterId, TopKTweetsWithScores]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appId), - DatasetName("ftr_pop1000_rank_decay_1_1_cluster_to_tweet_index"), - Apollo - ), - mhMtlsParams - ) - } - - def getFTRPop10000BasedClusterTopKTweets( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[FullClusterId, TopKTweetsWithScores] = { - ManhattanRO - .getReadableStoreWithMtls[FullClusterId, TopKTweetsWithScores]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appId), - DatasetName("ftr_pop10000_rank_decay_1_1_cluster_to_tweet_index"), - Apollo - ), - mhMtlsParams - ) - } - - def getOONFTRPop1000BasedClusterTopKTweets( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[FullClusterId, TopKTweetsWithScores] = { - ManhattanRO - .getReadableStoreWithMtls[FullClusterId, TopKTweetsWithScores]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appId), - DatasetName("oon_ftr_pop1000_rnkdecay_cluster_to_tweet_index"), - Apollo - ), - mhMtlsParams - ) - } - - def getOfflineLogFavBasedTweetBasedClusterTopKTweets( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[FullClusterId, TopKTweetsWithScores] = { - ManhattanRO - .getReadableStoreWithMtls[FullClusterId, TopKTweetsWithScores]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appId), - DatasetName("decayed_sum_cluster_to_tweet_index"), - Apollo - ), - mhMtlsParams - ) - } - - def getGlobalSimClustersLanguageEmbeddings( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[Language, ClustersUserIsInterestedIn] = { - ManhattanRO - .getReadableStoreWithMtls[Language, ClustersUserIsInterestedIn]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appId), - DatasetName("global_simclusters_language_embeddings"), - Apollo - ), - mhMtlsParams - ) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/stores/SimClustersEmbeddingStore.scala b/src/scala/com/twitter/simclusters_v2/stores/SimClustersEmbeddingStore.scala deleted file mode 100644 index 62785e205..000000000 --- a/src/scala/com/twitter/simclusters_v2/stores/SimClustersEmbeddingStore.scala +++ /dev/null @@ -1,120 +0,0 @@ -package com.twitter.simclusters_v2.stores - -import com.twitter.decider.Decider -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.hermit.store.common.DeciderableReadableStore -import com.twitter.servo.decider.DeciderKeyEnum -import com.twitter.simclusters_v2.common.DeciderGateBuilderWithIdHashing -import com.twitter.simclusters_v2.common.SimClustersEmbedding -import com.twitter.simclusters_v2.thriftscala.EmbeddingType -import com.twitter.simclusters_v2.thriftscala.ModelVersion -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId -import com.twitter.storehaus.ReadableStore -import com.twitter.util.Future - -/** - * Facade of all SimClusters Embedding Store. - * Provide a uniform access layer for all kind of SimClusters Embedding. - */ -case class SimClustersEmbeddingStore( - stores: Map[ - (EmbeddingType, ModelVersion), - ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] - ]) extends ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] { - - private val lookupStores = - stores - .groupBy(_._1._1).mapValues(_.map { - case ((_, modelVersion), store) => - modelVersion -> store - }) - - override def get(k: SimClustersEmbeddingId): Future[Option[SimClustersEmbedding]] = { - findStore(k) match { - case Some(store) => store.get(k) - case None => Future.None - } - } - - // Override the multiGet for better batch performance. - override def multiGet[K1 <: SimClustersEmbeddingId]( - ks: Set[K1] - ): Map[K1, Future[Option[SimClustersEmbedding]]] = { - if (ks.isEmpty) { - Map.empty - } else { - val head = ks.head - val notSameType = - ks.exists(k => k.embeddingType != head.embeddingType || k.modelVersion != head.modelVersion) - if (!notSameType) { - findStore(head) match { - case Some(store) => store.multiGet(ks) - case None => ks.map(_ -> Future.None).toMap - } - } else { - // Generate a large amount temp objects. - // For better performance, avoid querying the multiGet with more than one kind of embedding - ks.groupBy(id => (id.embeddingType, id.modelVersion)).flatMap { - case ((_, _), ks) => - findStore(ks.head) match { - case Some(store) => store.multiGet(ks) - case None => ks.map(_ -> Future.None).toMap - } - } - } - } - } - - private def findStore( - id: SimClustersEmbeddingId - ): Option[ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding]] = { - lookupStores.get(id.embeddingType).flatMap(_.get(id.modelVersion)) - } - -} - -object SimClustersEmbeddingStore { - /* - Build a SimClustersEmbeddingStore which wraps all stores in DeciderableReadableStore - */ - def buildWithDecider( - underlyingStores: Map[ - (EmbeddingType, ModelVersion), - ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] - ], - decider: Decider, - statsReceiver: StatsReceiver - ): ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = { - // To allow for lazy adding of decider config to enable / disable stores, if a value is not found - // fall back on returning true (equivalent to availability of 10000) - // This overrides default availability of 0 when not decider value is not found - val deciderGateBuilder = new DeciderGateBuilderWithIdHashing(decider.orElse(Decider.True)) - - val deciderKeyEnum = new DeciderKeyEnum { - underlyingStores.keySet.map(key => Value(s"enable_${key._1.name}_${key._2.name}")) - } - - def wrapStore( - embeddingType: EmbeddingType, - modelVersion: ModelVersion, - store: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] - ): ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = { - val gate = deciderGateBuilder.idGateWithHashing[SimClustersEmbeddingId]( - deciderKeyEnum.withName(s"enable_${embeddingType.name}_${modelVersion.name}")) - - DeciderableReadableStore( - underlying = store, - gate = gate, - statsReceiver = statsReceiver.scope(embeddingType.name, modelVersion.name) - ) - } - - val stores = underlyingStores.map { - case ((embeddingType, modelVersion), store) => - (embeddingType, modelVersion) -> wrapStore(embeddingType, modelVersion, store) - } - - new SimClustersEmbeddingStore(stores = stores) - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/stores/SimClustersMultiEmbeddingStore.scala b/src/scala/com/twitter/simclusters_v2/stores/SimClustersMultiEmbeddingStore.scala deleted file mode 100644 index 0a520439e..000000000 --- a/src/scala/com/twitter/simclusters_v2/stores/SimClustersMultiEmbeddingStore.scala +++ /dev/null @@ -1,74 +0,0 @@ -package com.twitter.simclusters_v2.stores - -import com.twitter.simclusters_v2.common.SimClustersEmbedding -import com.twitter.simclusters_v2.common.SimClustersMultiEmbeddingId._ -import com.twitter.simclusters_v2.thriftscala.{ - SimClustersMultiEmbedding, - SimClustersEmbeddingId, - SimClustersMultiEmbeddingId -} -import com.twitter.storehaus.ReadableStore -import com.twitter.util.Future - -/** - * The helper methods for SimClusters Multi-Embedding based ReadableStore - */ -object SimClustersMultiEmbeddingStore { - - /** - * Only support the Values based Multi-embedding transformation. - */ - case class SimClustersMultiEmbeddingWrapperStore( - sourceStore: ReadableStore[SimClustersMultiEmbeddingId, SimClustersMultiEmbedding]) - extends ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] { - - override def get(k: SimClustersEmbeddingId): Future[Option[SimClustersEmbedding]] = { - sourceStore.get(toMultiEmbeddingId(k)).map(_.map(toSimClustersEmbedding(k, _))) - } - - // Override the multiGet for better batch performance. - override def multiGet[K1 <: SimClustersEmbeddingId]( - ks: Set[K1] - ): Map[K1, Future[Option[SimClustersEmbedding]]] = { - if (ks.isEmpty) { - Map.empty - } else { - // Aggregate multiple get requests by MultiEmbeddingId - val multiEmbeddingIds = ks.map { k => - k -> toMultiEmbeddingId(k) - }.toMap - - val multiEmbeddings = sourceStore.multiGet(multiEmbeddingIds.values.toSet) - ks.map { k => - k -> multiEmbeddings(multiEmbeddingIds(k)).map(_.map(toSimClustersEmbedding(k, _))) - }.toMap - } - } - - private def toSimClustersEmbedding( - id: SimClustersEmbeddingId, - multiEmbedding: SimClustersMultiEmbedding - ): SimClustersEmbedding = { - multiEmbedding match { - case SimClustersMultiEmbedding.Values(values) => - val subId = toSubId(id) - if (subId >= values.embeddings.size) { - throw new IllegalArgumentException( - s"SimClustersMultiEmbeddingId $id is over the size of ${values.embeddings.size}") - } else { - values.embeddings(subId).embedding - } - case _ => - throw new IllegalArgumentException( - s"Invalid SimClustersMultiEmbedding $id, $multiEmbedding") - } - } - } - - def toSimClustersEmbeddingStore( - sourceStore: ReadableStore[SimClustersMultiEmbeddingId, SimClustersMultiEmbedding] - ): ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = { - SimClustersMultiEmbeddingWrapperStore(sourceStore) - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/stores/TopicTopProducersStore.scala b/src/scala/com/twitter/simclusters_v2/stores/TopicTopProducersStore.scala deleted file mode 100644 index c733ed157..000000000 --- a/src/scala/com/twitter/simclusters_v2/stores/TopicTopProducersStore.scala +++ /dev/null @@ -1,87 +0,0 @@ -package com.twitter.simclusters_v2.stores - -import com.twitter.bijection.scrooge.CompactScalaCodec -import com.twitter.recos.entities.thriftscala.{SemanticCoreEntityWithLocale, UserScoreList} -import com.twitter.storage.client.manhattan.kv.ManhattanKVClientMtlsParams -import com.twitter.storehaus.ReadableStore -import com.twitter.storehaus_internal.manhattan.{Athena, ManhattanRO, ManhattanROConfig} -import com.twitter.storehaus_internal.util.{ApplicationID, DatasetName, HDFSPath} - -object TopicTopProducersStore { - val appIdDevel = "recos_platform_dev" - val v2DatasetNameDevel = "topic_producers_em" - val v3DatasetNameDevel = "topic_producers_agg" - val v4DatasetNameDevel = "topic_producers_em_erg" - - val appIdProd = "simclusters_v2" - val v1DatasetNameProd = "top_producers_for_topic_from_topic_follow_graph" - val v2DatasetNameProd = "top_producers_for_topic_em" - - implicit val keyInj = CompactScalaCodec(SemanticCoreEntityWithLocale) - implicit val valInj = CompactScalaCodec(UserScoreList) - - def getTopicTopProducerStoreV1Prod( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[SemanticCoreEntityWithLocale, UserScoreList] = - ManhattanRO.getReadableStoreWithMtls[SemanticCoreEntityWithLocale, UserScoreList]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appIdProd), - DatasetName(v1DatasetNameProd), - Athena - ), - mhMtlsParams - ) - - def getTopicTopProducerStoreV2Devel( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[SemanticCoreEntityWithLocale, UserScoreList] = - ManhattanRO.getReadableStoreWithMtls[SemanticCoreEntityWithLocale, UserScoreList]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appIdDevel), - DatasetName(v2DatasetNameDevel), - Athena - ), - mhMtlsParams - ) - - def getTopicTopProducerStoreV2Prod( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[SemanticCoreEntityWithLocale, UserScoreList] = - ManhattanRO.getReadableStoreWithMtls[SemanticCoreEntityWithLocale, UserScoreList]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appIdProd), - DatasetName(v2DatasetNameProd), - Athena - ), - mhMtlsParams - ) - - def getTopicTopProducerStoreV3Devel( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[SemanticCoreEntityWithLocale, UserScoreList] = - ManhattanRO.getReadableStoreWithMtls[SemanticCoreEntityWithLocale, UserScoreList]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appIdDevel), - DatasetName(v3DatasetNameDevel), - Athena - ), - mhMtlsParams - ) - - def getTopicTopProducerStoreV4Devel( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[SemanticCoreEntityWithLocale, UserScoreList] = - ManhattanRO.getReadableStoreWithMtls[SemanticCoreEntityWithLocale, UserScoreList]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appIdDevel), - DatasetName(v4DatasetNameDevel), - Athena - ), - mhMtlsParams - ) -} diff --git a/src/scala/com/twitter/simclusters_v2/stores/WtfMbcgStore.scala b/src/scala/com/twitter/simclusters_v2/stores/WtfMbcgStore.scala deleted file mode 100644 index 471d4bf2b..000000000 --- a/src/scala/com/twitter/simclusters_v2/stores/WtfMbcgStore.scala +++ /dev/null @@ -1,34 +0,0 @@ -package com.twitter.simclusters_v2.stores - -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection.{ - Long2BigEndian, - ScalaBinaryThrift -} -import com.twitter.storage.client.manhattan.kv.ManhattanKVClientMtlsParams -import com.twitter.storehaus.ReadableStore -import com.twitter.storehaus_internal.manhattan.{Apollo, ManhattanRO, ManhattanROConfig} -import com.twitter.storehaus_internal.util.{ApplicationID, DatasetName, HDFSPath} -import com.twitter.wtf.candidate.thriftscala.CandidateSeq - -object WtfMbcgStore { - - val appId = "recos_platform_apollo" - - implicit val keyInj = Long2BigEndian - implicit val valInj = ScalaBinaryThrift(CandidateSeq) - - def getWtfMbcgStore( - mhMtlsParams: ManhattanKVClientMtlsParams, - datasetName: String - ): ReadableStore[Long, CandidateSeq] = { - ManhattanRO.getReadableStoreWithMtls[Long, CandidateSeq]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appId), - DatasetName(datasetName), - Apollo - ), - mhMtlsParams - ) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/BUILD b/src/scala/com/twitter/simclusters_v2/summingbird/BUILD deleted file mode 100644 index f01857d26..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/BUILD +++ /dev/null @@ -1,118 +0,0 @@ -scala_library( - name = "common", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/scala/com/twitter/simclusters_v2/summingbird/common", - ], -) - -scala_library( - name = "stores", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/scala/com/twitter/simclusters_v2/summingbird/stores", - ], -) - -scala_library( - name = "webservice", - tags = ["bazel-compatible"], - dependencies = [ - "src/scala/com/twitter/simclusters_v2/summingbird/webservice", - "twitter-server/slf4j-jdk14/src/main/scala/com/twitter/server/logging", - ], -) - -heron_binary( - name = "tweet-simclusters-storm-binary", - main = "com.twitter.simclusters_v2.summingbird.storm.TweetJobRunner", - platform = "java8", - runtime_platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":common", - "3rdparty/jvm/org/slf4j:slf4j-jdk14", - "src/scala/com/twitter/simclusters_v2/summingbird/storm", - ], -) - -jvm_app( - name = "tweet-simclusters-storm-job", - binary = ":tweet-simclusters-storm-binary", - bundles = [ - bundle( - fileset = ["config/jaas.conf"], - ), - ], - tags = ["bazel-compatible"], -) - -heron_binary( - name = "persistent-tweet-simclusters-storm-binary", - main = "com.twitter.simclusters_v2.summingbird.storm.PersistentTweetJobRunner", - platform = "java8", - runtime_platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":common", - "3rdparty/jvm/org/slf4j:slf4j-jdk14", - "src/scala/com/twitter/simclusters_v2/summingbird/storm", - ], -) - -jvm_app( - name = "persistent-tweet-simclusters-storm-job", - binary = ":persistent-tweet-simclusters-storm-binary", - bundles = [ - bundle( - fileset = ["config/jaas.conf"], - ), - ], - tags = ["bazel-compatible"], -) - -heron_binary( - name = "multi-model-tweet-simclusters-storm-binary", - main = "com.twitter.simclusters_v2.summingbird.storm.MultiModelTweetJobRunner", - platform = "java8", - runtime_platform = "java8", - dependencies = [ - ":common", - "3rdparty/jvm/org/slf4j:slf4j-jdk14", - "src/scala/com/twitter/simclusters_v2/summingbird/storm", - ], -) - -jvm_app( - name = "multi-model-tweet-simclusters-storm-job", - binary = ":multi-model-tweet-simclusters-storm-binary", - bundles = [ - bundle( - fileset = ["config/jaas.conf"], - ), - ], -) - -jvm_binary( - name = "repl", - basename = "repl-simclusters_v2", - main = "scala.tools.nsc.MainGenericRunner", - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":common", - "3rdparty/jvm/org/scala-lang:scala-compiler", - ], -) - -target( - dependencies = [ - ":common", - ":repl", - ":stores", - ":webservice", - "src/scala/com/twitter/simclusters_v2/summingbird/storm", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/README.md b/src/scala/com/twitter/simclusters_v2/summingbird/README.md deleted file mode 100644 index 026df3a26..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/README.md +++ /dev/null @@ -1,4 +0,0 @@ -Simclusters v2 Online Tweet Embedding Pipeline -============================================== - -The Heron jobs generate the tweet embedding and index of tweets for SimClusters, as well as persistenting the tweet embeddings from MemCache into Manhattan. \ No newline at end of file diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/common/BUILD b/src/scala/com/twitter/simclusters_v2/summingbird/common/BUILD deleted file mode 100644 index 0912b12fe..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/common/BUILD +++ /dev/null @@ -1,62 +0,0 @@ -scala_library( - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/twitter/algebird:core", - "3rdparty/jvm/com/twitter/algebird:util", - "3rdparty/jvm/com/twitter/bijection:core", - "3rdparty/jvm/com/twitter/bijection:util", - "3rdparty/jvm/com/twitter/storehaus:core", - "3rdparty/src/jvm/com/twitter/summingbird:client", - "cuad/projects/ner/client", - "cuad/projects/ner/thrift/src/main/thrift:thrift-scala", - "snowflake/src/main/scala/com/twitter/snowflake/id", - "src/scala/com/twitter/algebird_internal/injection", - "src/scala/com/twitter/simclusters_v2/common", - "src/scala/com/twitter/storehaus_internal/manhattan", - "src/scala/com/twitter/storehaus_internal/manhattan/config", - "src/scala/com/twitter/storehaus_internal/memcache", - "src/scala/com/twitter/storehaus_internal/memcache/config", - "src/scala/com/twitter/storehaus_internal/offline", - "src/scala/com/twitter/storehaus_internal/online", - "src/scala/com/twitter/storehaus_internal/util", - "src/scala/com/twitter/summingbird_internal/bijection:bijection-implicits", - "src/scala/com/twitter/summingbird_internal/runner/store_config", - "src/scala/com/twitter/taxi/util/text", - "src/scala/com/twitter/wtf/summingbird/sources/common", - "src/thrift/com/twitter/recos/entities:entities-thrift-scala", - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - "src/thrift/com/twitter/timelineservice/server/internal:thrift-scala", - "src/thrift/com/twitter/tweetypie:tweet-scala", - "src/thrift/com/twitter/wtf/interest:interest-thrift-scala", - "stitch/stitch-core", - "stitch/stitch-storehaus/src/main/scala", - ], -) - -## smaller build target for external usage -scala_library( - name = "util", - sources = [ - "Configs.scala", - "Implicits.scala", - "ModelVersionProfile.scala", - "Monoids.scala", - "ThriftDecayedValueMonoid.scala", - ], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/twitter/algebird:core", - "3rdparty/jvm/com/twitter/algebird:util", - "3rdparty/jvm/com/twitter/bijection:core", - "3rdparty/jvm/com/twitter/bijection:util", - "3rdparty/src/jvm/com/twitter/summingbird:batch", - "snowflake/src/main/scala/com/twitter/snowflake/id", - "src/scala/com/twitter/algebird_internal/injection", - "src/scala/com/twitter/simclusters_v2/common", - "src/thrift/com/twitter/recos/entities:entities-thrift-scala", - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - "src/thrift/com/twitter/tweetypie:tweet-scala", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/common/ClientConfigs.scala b/src/scala/com/twitter/simclusters_v2/summingbird/common/ClientConfigs.scala deleted file mode 100644 index d288ad692..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/common/ClientConfigs.scala +++ /dev/null @@ -1,81 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.common - -import com.twitter.conversions.DurationOps._ -import com.twitter.finagle.mtls.authentication.ServiceIdentifier -import com.twitter.finagle.thrift.ClientId -import com.twitter.storehaus_internal.memcache.ConnectionConfig -import com.twitter.storehaus_internal.memcache.MemcacheConfig -import com.twitter.storehaus_internal.util.KeyPrefix -import com.twitter.storehaus_internal.util.TTL -import com.twitter.strato.client.Strato -import com.twitter.strato.client.{Client => StratoClient} - -object ClientConfigs { - - com.twitter.server.Init() // necessary in order to use WilyNS path - - final lazy val simClustersCoreAltCachePath = - "/srv#/prod/local/cache/simclusters_core_alt" - - final lazy val simClustersCoreAltLightCachePath = - "/srv#/prod/local/cache/simclusters_core_alt_light" - - final lazy val develSimClustersCoreCachePath = - "/srv#/test/local/cache/twemcache_simclusters_core" - - final lazy val develSimClustersCoreLightCachePath = - "/srv#/test/local/cache/twemcache_simclusters_core_light" - - final lazy val logFavBasedTweet20M145K2020StratoPath = - "recommendations/simclusters_v2/embeddings/logFavBasedTweet20M145K2020Persistent" - - final lazy val logFavBasedTweet20M145K2020UncachedStratoPath = - "recommendations/simclusters_v2/embeddings/logFavBasedTweet20M145K2020-UNCACHED" - - final lazy val develLogFavBasedTweet20M145K2020StratoPath = - "recommendations/simclusters_v2/embeddings/logFavBasedTweet20M145K2020Devel" - - final lazy val entityClusterScoreMemcacheConfig: (String, ServiceIdentifier) => MemcacheConfig = { - (path: String, serviceIdentifier: ServiceIdentifier) => - new MemcacheConfig { - val connectionConfig: ConnectionConfig = ConnectionConfig(path, serviceIdentifier = serviceIdentifier) - override val keyPrefix: KeyPrefix = KeyPrefix(s"ecs_") - override val ttl: TTL = TTL(8.hours) - } - } - - // note: this should in dedicated cache for tweet - final lazy val tweetTopKClustersMemcacheConfig: (String, ServiceIdentifier) => MemcacheConfig = { - (path: String, serviceIdentifier: ServiceIdentifier) => - new MemcacheConfig { - val connectionConfig: ConnectionConfig = - ConnectionConfig(path, serviceIdentifier = serviceIdentifier) - override val keyPrefix: KeyPrefix = KeyPrefix(s"etk_") - override val ttl: TTL = TTL(2.days) - } - } - - // note: this should in dedicated cache for tweet - final lazy val clusterTopTweetsMemcacheConfig: (String, ServiceIdentifier) => MemcacheConfig = { - (path: String, serviceIdentifier: ServiceIdentifier) => - new MemcacheConfig { - val connectionConfig: ConnectionConfig = - ConnectionConfig(path, serviceIdentifier = serviceIdentifier) - override val keyPrefix: KeyPrefix = KeyPrefix(s"ctkt_") - override val ttl: TTL = TTL(8.hours) - } - } - - final lazy val stratoClient: ServiceIdentifier => StratoClient = { serviceIdentifier => - Strato.client - .withRequestTimeout(2.seconds) - .withMutualTls(serviceIdentifier) - .build() - } - - // thrift client id - private final lazy val thriftClientId: String => ClientId = { env: String => - ClientId(s"simclusters_v2_summingbird.$env") - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/common/Configs.scala b/src/scala/com/twitter/simclusters_v2/summingbird/common/Configs.scala deleted file mode 100644 index d769330f0..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/common/Configs.scala +++ /dev/null @@ -1,70 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.common - -import com.twitter.conversions.DurationOps._ -import com.twitter.simclusters_v2.thriftscala.ModelVersion -import com.twitter.util.Duration - -object Configs { - - final val role = "cassowary" - - final val ZoneAtla: String = "atla" - - @deprecated("Use 'common/ModelVersions'", "2019-09-04") - final val ModelVersion20M145KDec11: String = "20M_145K_dec11" - @deprecated("Use 'common/ModelVersions'", "2019-09-04") - final val ModelVersion20M145KUpdated: String = "20M_145K_updated" - final val ModelVersion20M145K2020: String = "20M_145K_2020" - - @deprecated("Use 'common/ModelVersions'", "2019-09-04") - final val ModelVersionMap: Map[String, ModelVersion] = Map( - ModelVersion20M145KDec11 -> ModelVersion.Model20m145kDec11, - ModelVersion20M145KUpdated -> ModelVersion.Model20m145kUpdated, - ModelVersion20M145K2020 -> ModelVersion.Model20m145k2020 - ) - - final val favScoreThresholdForUserInterest: String => Double = { - case ModelVersion20M145KDec11 => 0.15 - case ModelVersion20M145KUpdated => 1.0 - case ModelVersion20M145K2020 => 0.3 - case modelVersionStr => throw new Exception(s"$modelVersionStr is not a valid model") - } - - @deprecated("Use 'common/ModelVersions'", "2019-09-04") - final val ReversedModelVersionMap = ModelVersionMap.map(_.swap) - - final val batchesToKeep: Int = 1 - - final val HalfLife: Duration = 8.hours - final val HalfLifeInMs: Long = HalfLife.inMilliseconds - - final val topKTweetsPerCluster: Int = 1600 - - final val topKClustersPerEntity: Int = 50 - - // the config used in offline job only - final val topKClustersPerTweet: Int = 400 - - // minimum score to save clusterIds in entityTopKClusters cache - // entity includes entities other than tweetId. - final val scoreThresholdForEntityTopKClustersCache: Double = 0.02 - - // minimum score to save clusterIds in tweetTopKClusters cache - final val scoreThresholdForTweetTopKClustersCache: Double = 0.02 - - // minimum score to save tweetIds in clusterTopKTweets cache - final val scoreThresholdForClusterTopKTweetsCache: Double = 0.001 - - // minimum score to save entities in clusterTopKEntities cache - final val scoreThresholdForClusterTopKEntitiesCache: Double = 0.001 - - final val MinFavoriteCount = 8 - - final val OldestTweetInLightIndexInMillis = 1.hours.inMillis - - final val OldestTweetFavEventTimeInMillis = 3.days.inMillis - - final val FirstUpdateValue = 1 - - final val TempUpdateValue = -1 -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/common/EntityUtil.scala b/src/scala/com/twitter/simclusters_v2/summingbird/common/EntityUtil.scala deleted file mode 100644 index 4e4bbd7e7..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/common/EntityUtil.scala +++ /dev/null @@ -1,46 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.common - -import com.twitter.cuad.ner.thriftscala.WholeEntityType -import com.twitter.simclusters_v2.summingbird.common.Implicits.thriftDecayedValueMonoid -import com.twitter.simclusters_v2.thriftscala.{Scores, SimClusterEntity, TweetTextEntity} -import scala.collection.Map - -private[summingbird] object EntityUtil { - - def updateScoreWithLatestTimestamp[K]( - scoresMapOption: Option[Map[K, Scores]], - timeInMs: Long - ): Option[Map[K, Scores]] = { - scoresMapOption map { scoresMap => - scoresMap.mapValues(score => updateScoreWithLatestTimestamp(score, timeInMs)) - } - } - - def updateScoreWithLatestTimestamp(score: Scores, timeInMs: Long): Scores = { - score.copy( - favClusterNormalized8HrHalfLifeScore = score.favClusterNormalized8HrHalfLifeScore.map { - decayedValue => thriftDecayedValueMonoid.decayToTimestamp(decayedValue, timeInMs) - }, - followClusterNormalized8HrHalfLifeScore = score.followClusterNormalized8HrHalfLifeScore.map { - decayedValue => thriftDecayedValueMonoid.decayToTimestamp(decayedValue, timeInMs) - } - ) - } - - def entityToString(entity: SimClusterEntity): String = { - entity match { - case SimClusterEntity.TweetId(id) => s"t_id:$id" - case SimClusterEntity.SpaceId(id) => s"space_id:$id" - case SimClusterEntity.TweetEntity(textEntity) => - textEntity match { - case TweetTextEntity.Hashtag(str) => s"$str[h_tag]" - case TweetTextEntity.Penguin(penguin) => - s"${penguin.textEntity}[penguin]" - case TweetTextEntity.Ner(ner) => - s"${ner.textEntity}[ner_${WholeEntityType(ner.wholeEntityType)}]" - case TweetTextEntity.SemanticCore(semanticCore) => - s"[sc:${semanticCore.entityId}]" - } - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/common/Implicits.scala b/src/scala/com/twitter/simclusters_v2/summingbird/common/Implicits.scala deleted file mode 100644 index 79235573f..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/common/Implicits.scala +++ /dev/null @@ -1,140 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.common - -import com.twitter.algebird.DecayedValueMonoid -import com.twitter.algebird.Monoid -import com.twitter.algebird_internal.injection.AlgebirdImplicits -import com.twitter.algebird_internal.thriftscala.{DecayedValue => ThriftDecayedValue} -import com.twitter.bijection.Bufferable -import com.twitter.bijection.Injection -import com.twitter.bijection.scrooge.CompactScalaCodec -import com.twitter.simclusters_v2.summingbird.common.Monoids.ClustersWithScoresMonoid -import com.twitter.simclusters_v2.summingbird.common.Monoids.MultiModelClustersWithScoresMonoid -import com.twitter.simclusters_v2.summingbird.common.Monoids.MultiModelPersistentSimClustersEmbeddingLongestL2NormMonoid -import com.twitter.simclusters_v2.summingbird.common.Monoids.MultiModelPersistentSimClustersEmbeddingMonoid -import com.twitter.simclusters_v2.summingbird.common.Monoids.MultiModelTopKTweetsWithScoresMonoid -import com.twitter.simclusters_v2.summingbird.common.Monoids.PersistentSimClustersEmbeddingLongestL2NormMonoid -import com.twitter.simclusters_v2.summingbird.common.Monoids.PersistentSimClustersEmbeddingMonoid -import com.twitter.simclusters_v2.summingbird.common.Monoids.ScoresMonoid -import com.twitter.simclusters_v2.summingbird.common.Monoids.TopKClustersWithScoresMonoid -import com.twitter.simclusters_v2.summingbird.common.Monoids.TopKTweetsWithScoresMonoid -import com.twitter.simclusters_v2.thriftscala.FullClusterIdBucket -import com.twitter.simclusters_v2.thriftscala._ -import com.twitter.summingbird.batch.Batcher -import com.twitter.tweetypie.thriftscala.StatusCounts - -object Implicits { - - // -------------------- Monoids -------------------- // - implicit val decayedValueMonoid: DecayedValueMonoid = DecayedValueMonoid(0.0) - - implicit val thriftDecayedValueMonoid: ThriftDecayedValueMonoid = - new ThriftDecayedValueMonoid(Configs.HalfLifeInMs)(decayedValueMonoid) - - implicit val scoresMonoid: ScoresMonoid = new Monoids.ScoresMonoid() - - implicit val clustersWithScoreMonoid: ClustersWithScoresMonoid = - new Monoids.ClustersWithScoresMonoid()(scoresMonoid) - - implicit val multiModelClustersWithScoresMonoid: Monoid[MultiModelClustersWithScores] = - new MultiModelClustersWithScoresMonoid() - - implicit val topKClustersWithScoresMonoid: Monoid[TopKClustersWithScores] = - new TopKClustersWithScoresMonoid( - Configs.topKClustersPerEntity, - Configs.scoreThresholdForEntityTopKClustersCache - )(thriftDecayedValueMonoid) - - implicit val topKTweetsWithScoresMonoid: Monoid[TopKTweetsWithScores] = - new TopKTweetsWithScoresMonoid( - Configs.topKTweetsPerCluster, - Configs.scoreThresholdForClusterTopKTweetsCache, - Configs.OldestTweetFavEventTimeInMillis - )(thriftDecayedValueMonoid) - - implicit val topKTweetsWithScoresLightMonoid: Monoid[TopKTweetsWithScores] = - new TopKTweetsWithScoresMonoid( - Configs.topKTweetsPerCluster, - Configs.scoreThresholdForClusterTopKTweetsCache, - Configs.OldestTweetInLightIndexInMillis - )(thriftDecayedValueMonoid) - - implicit val MultiModeltopKTweetsWithScoresMonoid: Monoid[MultiModelTopKTweetsWithScores] = - new MultiModelTopKTweetsWithScoresMonoid( - )(thriftDecayedValueMonoid) - - implicit val persistentSimClustersEmbeddingMonoid: Monoid[PersistentSimClustersEmbedding] = - new PersistentSimClustersEmbeddingMonoid() - - implicit val persistentSimClustersEmbeddingLongestL2NormMonoid: Monoid[ - PersistentSimClustersEmbedding - ] = - new PersistentSimClustersEmbeddingLongestL2NormMonoid() - - implicit val multiModelPersistentSimClustersEmbeddingMonoid: Monoid[ - MultiModelPersistentSimClustersEmbedding - ] = - new MultiModelPersistentSimClustersEmbeddingMonoid() - - implicit val multiModelPersistentSimClustersEmbeddingLongestL2NormMonoid: Monoid[ - MultiModelPersistentSimClustersEmbedding - ] = new MultiModelPersistentSimClustersEmbeddingLongestL2NormMonoid() - - // -------------------- Codecs -------------------- // - implicit val longIntPairCodec: Injection[(Long, Int), Array[Byte]] = - Bufferable.injectionOf[(Long, Int)] - - implicit val simClusterEntityCodec: Injection[SimClusterEntity, Array[Byte]] = - CompactScalaCodec(SimClusterEntity) - - implicit val fullClusterIdBucket: Injection[FullClusterIdBucket, Array[Byte]] = - CompactScalaCodec(FullClusterIdBucket) - - implicit val clustersWithScoresCodec: Injection[ClustersWithScores, Array[Byte]] = - CompactScalaCodec(ClustersWithScores) - - implicit val topKClustersKeyCodec: Injection[EntityWithVersion, Array[Byte]] = - CompactScalaCodec(EntityWithVersion) - - implicit val topKClustersWithScoresCodec: Injection[TopKClustersWithScores, Array[Byte]] = - CompactScalaCodec(TopKClustersWithScores) - - implicit val fullClusterIdCodec: Injection[FullClusterId, Array[Byte]] = - CompactScalaCodec(FullClusterId) - - implicit val topKEntitiesWithScoresCodec: Injection[TopKEntitiesWithScores, Array[Byte]] = - CompactScalaCodec(TopKEntitiesWithScores) - - implicit val topKTweetsWithScoresCodec: Injection[TopKTweetsWithScores, Array[Byte]] = - CompactScalaCodec(TopKTweetsWithScores) - - implicit val pairedArrayBytesCodec: Injection[(Array[Byte], Array[Byte]), Array[Byte]] = - Bufferable.injectionOf[(Array[Byte], Array[Byte])] - - implicit val entityWithClusterInjection: Injection[(SimClusterEntity, FullClusterIdBucket), Array[ - Byte - ]] = - Injection - .connect[(SimClusterEntity, FullClusterIdBucket), (Array[Byte], Array[Byte]), Array[Byte]] - - implicit val topKClustersCodec: Injection[TopKClusters, Array[Byte]] = - CompactScalaCodec(TopKClusters) - - implicit val topKTweetsCodec: Injection[TopKTweets, Array[Byte]] = - CompactScalaCodec(TopKTweets) - - implicit val simClustersEmbeddingCodec: Injection[SimClustersEmbedding, Array[Byte]] = - CompactScalaCodec(SimClustersEmbedding) - - implicit val persistentSimClustersEmbeddingCodec: Injection[PersistentSimClustersEmbedding, Array[ - Byte - ]] = - CompactScalaCodec(PersistentSimClustersEmbedding) - - implicit val statusCountsCodec: Injection[StatusCounts, Array[Byte]] = - CompactScalaCodec(StatusCounts) - - implicit val thriftDecayedValueCodec: Injection[ThriftDecayedValue, Array[Byte]] = - AlgebirdImplicits.decayedValueCodec - - implicit val batcher: Batcher = Batcher.unit -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/common/ModelVersionProfile.scala b/src/scala/com/twitter/simclusters_v2/summingbird/common/ModelVersionProfile.scala deleted file mode 100644 index ad2c56386..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/common/ModelVersionProfile.scala +++ /dev/null @@ -1,40 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.common - -import com.twitter.simclusters_v2.thriftscala.EmbeddingType -import com.twitter.util.Duration -import com.twitter.conversions.DurationOps._ -import com.twitter.simclusters_v2.thriftscala.ModelVersion - -case class ModelVersionProfile( - modelVersion: ModelVersion, - usingLogFavScore: Boolean, - // redundant in the current models because the above parameter does the same currently. - coreEmbeddingType: EmbeddingType, - favScoreThresholdForUserInterest: Double, - // these values are shared between all profiles so lets set up defaults - halfLife: Duration = 8.hours, - scoreThresholdForEntityTopKClustersCache: Double = 0.2, - scoreThresholdForTweetTopKClustersCache: Double = 0.02, - scoreThresholdForClusterTopKTweetsCache: Double = 0.001, - scoreThresholdForClusterTopKEntitiesCache: Double = 0.001) - -object ModelVersionProfiles { - final val ModelVersion20M145KUpdated = ModelVersionProfile( - ModelVersion.Model20m145kUpdated, - usingLogFavScore = true, - coreEmbeddingType = EmbeddingType.LogFavBasedTweet, - favScoreThresholdForUserInterest = 1.0 - ) - - final val ModelVersion20M145K2020 = ModelVersionProfile( - ModelVersion.Model20m145k2020, - usingLogFavScore = true, - coreEmbeddingType = EmbeddingType.LogFavBasedTweet, - favScoreThresholdForUserInterest = 0.3 - ) - - final val ModelVersionProfiles: Map[ModelVersion, ModelVersionProfile] = Map( - ModelVersion.Model20m145kUpdated -> ModelVersion20M145KUpdated, - ModelVersion.Model20m145k2020 -> ModelVersion20M145K2020 - ) -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/common/Monoids.scala b/src/scala/com/twitter/simclusters_v2/summingbird/common/Monoids.scala deleted file mode 100644 index 34dd27586..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/common/Monoids.scala +++ /dev/null @@ -1,478 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.common - -import com.twitter.algebird.DecayedValue -import com.twitter.algebird.Monoid -import com.twitter.algebird.OptionMonoid -import com.twitter.algebird.ScMapMonoid -import com.twitter.algebird_internal.thriftscala.{DecayedValue => ThriftDecayedValue} -import com.twitter.simclusters_v2.common.SimClustersEmbedding -import com.twitter.simclusters_v2.thriftscala.ClustersWithScores -import com.twitter.simclusters_v2.thriftscala.MultiModelClustersWithScores -import com.twitter.simclusters_v2.thriftscala.MultiModelTopKTweetsWithScores -import com.twitter.simclusters_v2.thriftscala.ModelVersion -import com.twitter.simclusters_v2.thriftscala.MultiModelPersistentSimClustersEmbedding -import com.twitter.simclusters_v2.thriftscala.PersistentSimClustersEmbedding -import com.twitter.simclusters_v2.thriftscala.Scores -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingMetadata -import com.twitter.simclusters_v2.thriftscala.TopKClustersWithScores -import com.twitter.simclusters_v2.thriftscala.TopKTweetsWithScores -import com.twitter.simclusters_v2.thriftscala.{SimClustersEmbedding => ThriftSimClustersEmbedding} -import com.twitter.snowflake.id.SnowflakeId -import scala.collection.mutable - -/** - * Contains various monoids used in the EntityJob - */ -object Monoids { - - class ScoresMonoid(implicit thriftDecayedValueMonoid: ThriftDecayedValueMonoid) - extends Monoid[Scores] { - - private val optionalThriftDecayedValueMonoid = - new OptionMonoid[ThriftDecayedValue]() - - override val zero: Scores = Scores() - - override def plus(x: Scores, y: Scores): Scores = { - Scores( - optionalThriftDecayedValueMonoid.plus( - x.favClusterNormalized8HrHalfLifeScore, - y.favClusterNormalized8HrHalfLifeScore - ), - optionalThriftDecayedValueMonoid.plus( - x.followClusterNormalized8HrHalfLifeScore, - y.followClusterNormalized8HrHalfLifeScore - ) - ) - } - } - - class ClustersWithScoresMonoid(implicit scoresMonoid: ScoresMonoid) - extends Monoid[ClustersWithScores] { - - private val optionMapMonoid = - new OptionMonoid[collection.Map[Int, Scores]]()(new ScMapMonoid[Int, Scores]()) - - override val zero: ClustersWithScores = ClustersWithScores() - - override def plus(x: ClustersWithScores, y: ClustersWithScores): ClustersWithScores = { - ClustersWithScores( - optionMapMonoid.plus(x.clustersToScore, y.clustersToScore) - ) - } - } - - class MultiModelClustersWithScoresMonoid(implicit scoresMonoid: ScoresMonoid) - extends Monoid[MultiModelClustersWithScores] { - - override val zero: MultiModelClustersWithScores = MultiModelClustersWithScores() - - override def plus( - x: MultiModelClustersWithScores, - y: MultiModelClustersWithScores - ): MultiModelClustersWithScores = { - // We reuse the logic from the Monoid for the Value here - val clustersWithScoreMonoid = Implicits.clustersWithScoreMonoid - - MultiModelClustersWithScores( - MultiModelUtils.mergeTwoMultiModelMaps( - x.multiModelClustersWithScores, - y.multiModelClustersWithScores, - clustersWithScoreMonoid)) - } - } - - class TopKClustersWithScoresMonoid( - topK: Int, - threshold: Double - )( - implicit thriftDecayedValueMonoid: ThriftDecayedValueMonoid) - extends Monoid[TopKClustersWithScores] { - - override val zero: TopKClustersWithScores = TopKClustersWithScores() - - override def plus( - x: TopKClustersWithScores, - y: TopKClustersWithScores - ): TopKClustersWithScores = { - - val mergedFavMap = TopKScoresUtils - .mergeTwoTopKMapWithDecayedValues( - x.topClustersByFavClusterNormalizedScore - .map(_.mapValues( - _.favClusterNormalized8HrHalfLifeScore.getOrElse(thriftDecayedValueMonoid.zero))), - y.topClustersByFavClusterNormalizedScore - .map(_.mapValues( - _.favClusterNormalized8HrHalfLifeScore.getOrElse(thriftDecayedValueMonoid.zero))), - topK, - threshold - ).map(_.mapValues(decayedValue => - Scores(favClusterNormalized8HrHalfLifeScore = Some(decayedValue)))) - - val mergedFollowMap = TopKScoresUtils - .mergeTwoTopKMapWithDecayedValues( - x.topClustersByFollowClusterNormalizedScore - .map(_.mapValues( - _.followClusterNormalized8HrHalfLifeScore.getOrElse(thriftDecayedValueMonoid.zero))), - y.topClustersByFollowClusterNormalizedScore - .map(_.mapValues( - _.followClusterNormalized8HrHalfLifeScore.getOrElse(thriftDecayedValueMonoid.zero))), - topK, - threshold - ).map(_.mapValues(decayedValue => - Scores(followClusterNormalized8HrHalfLifeScore = Some(decayedValue)))) - - TopKClustersWithScores( - mergedFavMap, - mergedFollowMap - ) - } - } - class TopKTweetsWithScoresMonoid( - topK: Int, - threshold: Double, - tweetAgeThreshold: Long - )( - implicit thriftDecayedValueMonoid: ThriftDecayedValueMonoid) - extends Monoid[TopKTweetsWithScores] { - - override val zero: TopKTweetsWithScores = TopKTweetsWithScores() - - override def plus(x: TopKTweetsWithScores, y: TopKTweetsWithScores): TopKTweetsWithScores = { - val oldestTweetId = SnowflakeId.firstIdFor(System.currentTimeMillis() - tweetAgeThreshold) - - val mergedFavMap = TopKScoresUtils - .mergeTwoTopKMapWithDecayedValues( - x.topTweetsByFavClusterNormalizedScore - .map(_.mapValues( - _.favClusterNormalized8HrHalfLifeScore.getOrElse(thriftDecayedValueMonoid.zero))), - y.topTweetsByFavClusterNormalizedScore - .map(_.mapValues( - _.favClusterNormalized8HrHalfLifeScore.getOrElse(thriftDecayedValueMonoid.zero))), - topK, - threshold - ).map(_.filter(_._1 >= oldestTweetId).mapValues(decayedValue => - Scores(favClusterNormalized8HrHalfLifeScore = Some(decayedValue)))) - - TopKTweetsWithScores(mergedFavMap, None) - } - } - - class MultiModelTopKTweetsWithScoresMonoid( - )( - implicit thriftDecayedValueMonoid: ThriftDecayedValueMonoid) - extends Monoid[MultiModelTopKTweetsWithScores] { - override val zero: MultiModelTopKTweetsWithScores = MultiModelTopKTweetsWithScores() - - override def plus( - x: MultiModelTopKTweetsWithScores, - y: MultiModelTopKTweetsWithScores - ): MultiModelTopKTweetsWithScores = { - // We reuse the logic from the Monoid for the Value here - val topKTweetsWithScoresMonoid = Implicits.topKTweetsWithScoresMonoid - - MultiModelTopKTweetsWithScores( - MultiModelUtils.mergeTwoMultiModelMaps( - x.multiModelTopKTweetsWithScores, - y.multiModelTopKTweetsWithScores, - topKTweetsWithScoresMonoid)) - } - - } - - /** - * Merge two PersistentSimClustersEmbedding. The latest embedding overwrite the old embedding. - * The new count equals to the sum of the count. - */ - class PersistentSimClustersEmbeddingMonoid extends Monoid[PersistentSimClustersEmbedding] { - - override val zero: PersistentSimClustersEmbedding = PersistentSimClustersEmbedding( - ThriftSimClustersEmbedding(), - SimClustersEmbeddingMetadata() - ) - - private val optionLongMonoid = new OptionMonoid[Long]() - - override def plus( - x: PersistentSimClustersEmbedding, - y: PersistentSimClustersEmbedding - ): PersistentSimClustersEmbedding = { - val latest = - if (x.metadata.updatedAtMs.getOrElse(0L) > y.metadata.updatedAtMs.getOrElse(0L)) x else y - latest.copy( - metadata = latest.metadata.copy( - updatedCount = optionLongMonoid.plus(x.metadata.updatedCount, y.metadata.updatedCount))) - } - } - - class MultiModelPersistentSimClustersEmbeddingMonoid - extends Monoid[MultiModelPersistentSimClustersEmbedding] { - - override val zero: MultiModelPersistentSimClustersEmbedding = - MultiModelPersistentSimClustersEmbedding(Map[ModelVersion, PersistentSimClustersEmbedding]()) - - override def plus( - x: MultiModelPersistentSimClustersEmbedding, - y: MultiModelPersistentSimClustersEmbedding - ): MultiModelPersistentSimClustersEmbedding = { - val monoid = Implicits.persistentSimClustersEmbeddingMonoid - - // PersistentSimClustersEmbeddings is the only required thrift object so we need to wrap it - // in Some - MultiModelUtils.mergeTwoMultiModelMaps( - Some(x.multiModelPersistentSimClustersEmbedding), - Some(y.multiModelPersistentSimClustersEmbedding), - monoid) match { - // clean up the empty embeddings - case Some(res) => - MultiModelPersistentSimClustersEmbedding(res.flatMap { - // in some cases the list of SimClustersScore is empty, so we want to remove the - // modelVersion from the list of Models for the embedding - case (modelVersion, persistentSimClustersEmbedding) => - persistentSimClustersEmbedding.embedding.embedding match { - case embedding if embedding.nonEmpty => - Map(modelVersion -> persistentSimClustersEmbedding) - case _ => - None - } - }) - case _ => zero - } - } - } - - /** - * Merge two PersistentSimClustersEmbeddings. The embedding with the longest l2 norm overwrites - * the other embedding. The new count equals to the sum of the count. - */ - class PersistentSimClustersEmbeddingLongestL2NormMonoid - extends Monoid[PersistentSimClustersEmbedding] { - - override val zero: PersistentSimClustersEmbedding = PersistentSimClustersEmbedding( - ThriftSimClustersEmbedding(), - SimClustersEmbeddingMetadata() - ) - - override def plus( - x: PersistentSimClustersEmbedding, - y: PersistentSimClustersEmbedding - ): PersistentSimClustersEmbedding = { - if (SimClustersEmbedding(x.embedding).l2norm >= SimClustersEmbedding(y.embedding).l2norm) x - else y - } - } - - class MultiModelPersistentSimClustersEmbeddingLongestL2NormMonoid - extends Monoid[MultiModelPersistentSimClustersEmbedding] { - - override val zero: MultiModelPersistentSimClustersEmbedding = - MultiModelPersistentSimClustersEmbedding(Map[ModelVersion, PersistentSimClustersEmbedding]()) - - override def plus( - x: MultiModelPersistentSimClustersEmbedding, - y: MultiModelPersistentSimClustersEmbedding - ): MultiModelPersistentSimClustersEmbedding = { - val monoid = Implicits.persistentSimClustersEmbeddingLongestL2NormMonoid - - MultiModelUtils.mergeTwoMultiModelMaps( - Some(x.multiModelPersistentSimClustersEmbedding), - Some(y.multiModelPersistentSimClustersEmbedding), - monoid) match { - // clean up empty embeddings - case Some(res) => - MultiModelPersistentSimClustersEmbedding(res.flatMap { - case (modelVersion, persistentSimClustersEmbedding) => - // in some cases the list of SimClustersScore is empty, so we want to remove the - // modelVersion from the list of Models for the embedding - persistentSimClustersEmbedding.embedding.embedding match { - case embedding if embedding.nonEmpty => - Map(modelVersion -> persistentSimClustersEmbedding) - case _ => - None - } - }) - case _ => zero - } - } - } - - object TopKScoresUtils { - - /** - * Function for merging TopK scores with decayed values. - * - * This is for use with topk scores where all scores are updated at the same time (i.e. most - * time-decayed embedding aggregations). Rather than storing individual scores as algebird.DecayedValue - * and replicating time information for every key, we can store a single timestamp for the entire - * embedding and replicate the decay logic when processing each score. - * - * This should replicate the behaviour of `mergeTwoTopKMapWithDecayedValues` - * - * The logic is: - * - Determine the most recent update and build a DecayedValue for it (decayedValueForLatestTime) - * - For each (cluster, score), decay the score relative to the time of the most-recently updated embedding - * - This is a no-op for scores from the most recently-updated embedding, and will scale scores - * for the older embedding. - * - Drop any (cluster, score) which are below the `threshold` score - * - If both input embeddings contribute a score for the same cluster, keep the one with the largest score (after scaling) - * - Sort (cluster, score) by score and keep the `topK` - * - */ - def mergeClusterScoresWithUpdateTimes[Key]( - x: Seq[(Key, Double)], - xUpdatedAtMs: Long, - y: Seq[(Key, Double)], - yUpdatedAtMs: Long, - halfLifeMs: Long, - topK: Int, - threshold: Double - ): Seq[(Key, Double)] = { - val latestUpdate = math.max(xUpdatedAtMs, yUpdatedAtMs) - val decayedValueForLatestTime = DecayedValue.build(0.0, latestUpdate, halfLifeMs) - - val merged = mutable.HashMap[Key, Double]() - - x.foreach { - case (key, score) => - val decayedScore = Implicits.decayedValueMonoid - .plus( - DecayedValue.build(score, xUpdatedAtMs, halfLifeMs), - decayedValueForLatestTime - ).value - if (decayedScore > threshold) - merged += key -> decayedScore - } - - y.foreach { - case (key, score) => - val decayedScore = Implicits.decayedValueMonoid - .plus( - DecayedValue.build(score, yUpdatedAtMs, halfLifeMs), - decayedValueForLatestTime - ).value - if (decayedScore > threshold) - merged.get(key) match { - case Some(existingValue) => - if (decayedScore > existingValue) - merged += key -> decayedScore - case None => - merged += key -> decayedScore - } - } - - merged.toSeq - .sortBy(-_._2) - .take(topK) - } - - /** - * Function for merging to TopK map with decayed values. - * - * First of all, all the values will be decayed to the latest scaled timestamp to be comparable. - * - * If the same key appears at both a and b, the one with larger scaled time (or larger value when - * their scaled times are same) will be taken. The values smaller than the threshold will be dropped. - * - * After merging, if the size is larger than TopK, only scores with topK largest value will be kept. - */ - def mergeTwoTopKMapWithDecayedValues[T]( - a: Option[collection.Map[T, ThriftDecayedValue]], - b: Option[collection.Map[T, ThriftDecayedValue]], - topK: Int, - threshold: Double - )( - implicit thriftDecayedValueMonoid: ThriftDecayedValueMonoid - ): Option[collection.Map[T, ThriftDecayedValue]] = { - - if (a.isEmpty || a.exists(_.isEmpty)) { - return b - } - - if (b.isEmpty || b.exists(_.isEmpty)) { - return a - } - - val latestScaledTime = (a.get.view ++ b.get.view).map { - case (_, scores) => - scores.scaledTime - }.max - - val decayedValueWithLatestScaledTime = ThriftDecayedValue(0.0, latestScaledTime) - - val merged = mutable.HashMap[T, ThriftDecayedValue]() - - a.foreach { - _.foreach { - case (k, v) => - // decay the value to latest scaled time - val decayedScores = thriftDecayedValueMonoid - .plus(v, decayedValueWithLatestScaledTime) - - // only merge if the value is larger than the threshold - if (decayedScores.value > threshold) { - merged += k -> decayedScores - } - } - } - - b.foreach { - _.foreach { - case (k, v) => - val decayedScores = thriftDecayedValueMonoid - .plus(v, decayedValueWithLatestScaledTime) - - // only merge if the value is larger than the threshold - if (decayedScores.value > threshold) { - if (!merged.contains(k)) { - merged += k -> decayedScores - } else { - // only update if the value is larger than the one already merged - if (decayedScores.value > merged(k).value) { - merged.update(k, decayedScores) - } - } - } - } - } - - // add some buffer size (~ 0.2 * topK) to avoid sorting and taking too frequently - if (merged.size > topK * 1.2) { - Some( - merged.toSeq - .sortBy { case (_, scores) => scores.value * -1 } - .take(topK) - .toMap - ) - } else { - Some(merged) - } - } - } - - object MultiModelUtils { - - /** - * In order to reduce complexity we use the Monoid for the value to plus two MultiModel maps - */ - def mergeTwoMultiModelMaps[T]( - a: Option[collection.Map[ModelVersion, T]], - b: Option[collection.Map[ModelVersion, T]], - monoid: Monoid[T] - ): Option[collection.Map[ModelVersion, T]] = { - (a, b) match { - case (Some(_), None) => a - case (None, Some(_)) => b - case (Some(aa), Some(bb)) => - val res = ModelVersionProfiles.ModelVersionProfiles.foldLeft(Map[ModelVersion, T]()) { - (map, model) => - map + (model._1 -> monoid.plus( - aa.getOrElse(model._1, monoid.zero), - bb.getOrElse(model._1, monoid.zero) - )) - } - Some(res) - case _ => None - } - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/common/SimClustersEmbeddingWithMetadataMonoid.scala b/src/scala/com/twitter/simclusters_v2/summingbird/common/SimClustersEmbeddingWithMetadataMonoid.scala deleted file mode 100644 index 4379eccb9..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/common/SimClustersEmbeddingWithMetadataMonoid.scala +++ /dev/null @@ -1,59 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.common - -import com.twitter.algebird.{Monoid, OptionMonoid} -import com.twitter.simclusters_v2.common.SimClustersEmbedding -import com.twitter.simclusters_v2.summingbird.common.Monoids.TopKScoresUtils -import com.twitter.simclusters_v2.thriftscala.{ - SimClustersEmbeddingMetadata, - SimClustersEmbeddingWithMetadata, - SimClustersEmbedding => ThriftSimClustersEmbedding -} - -/** - * Decayed aggregation of embeddings. - * - * When merging 2 embeddings, the older embedding's scores are scaled by time. If a cluster is - * present in both embeddings, the highest score (after scaling) is used in the result. - * - * @halfLifeMs - defines how quickly a score decays - * @topK - only the topk clusters with the highest scores are retained in the result - * @threshold - any clusters with weights below threshold are excluded from the result - */ -class SimClustersEmbeddingWithMetadataMonoid( - halfLifeMs: Long, - topK: Int, - threshold: Double) - extends Monoid[SimClustersEmbeddingWithMetadata] { - - override val zero: SimClustersEmbeddingWithMetadata = SimClustersEmbeddingWithMetadata( - ThriftSimClustersEmbedding(), - SimClustersEmbeddingMetadata() - ) - - private val optionLongMonoid = new OptionMonoid[Long]() - private val optionMaxMonoid = - new OptionMonoid[Long]()(com.twitter.algebird.Max.maxSemigroup[Long]) - - override def plus( - x: SimClustersEmbeddingWithMetadata, - y: SimClustersEmbeddingWithMetadata - ): SimClustersEmbeddingWithMetadata = { - - val mergedClusterScores = TopKScoresUtils.mergeClusterScoresWithUpdateTimes( - x = SimClustersEmbedding(x.embedding).embedding, - xUpdatedAtMs = x.metadata.updatedAtMs.getOrElse(0), - y = SimClustersEmbedding(y.embedding).embedding, - yUpdatedAtMs = y.metadata.updatedAtMs.getOrElse(0), - halfLifeMs = halfLifeMs, - topK = topK, - threshold = threshold - ) - SimClustersEmbeddingWithMetadata( - embedding = SimClustersEmbedding(mergedClusterScores).toThrift, - metadata = SimClustersEmbeddingMetadata( - updatedAtMs = optionMaxMonoid.plus(x.metadata.updatedAtMs, y.metadata.updatedAtMs), - updatedCount = optionLongMonoid.plus(x.metadata.updatedCount, y.metadata.updatedCount) - ) - ) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/common/SimClustersHashUtil.scala b/src/scala/com/twitter/simclusters_v2/summingbird/common/SimClustersHashUtil.scala deleted file mode 100644 index fff4bb851..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/common/SimClustersHashUtil.scala +++ /dev/null @@ -1,14 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.common - -/** - * Provides int to int hash function. Used to batch clusterIds together. - */ -object SimClustersHashUtil { - def clusterIdToBucket(clusterId: Int): Int = { - clusterId % numBuckets - } - - val numBuckets: Int = 200 - - val getAllBuckets: Seq[Int] = 0.until(numBuckets) -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/common/SimClustersInterestedInUtil.scala b/src/scala/com/twitter/simclusters_v2/summingbird/common/SimClustersInterestedInUtil.scala deleted file mode 100644 index 4cd7ff14b..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/common/SimClustersInterestedInUtil.scala +++ /dev/null @@ -1,72 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.common - -import com.twitter.simclusters_v2.common.ClusterId -import com.twitter.simclusters_v2.thriftscala.{ - ClustersUserIsInterestedIn, - ClustersWithScores, - Scores -} - -object SimClustersInterestedInUtil { - - private final val EmptyClustersWithScores = ClustersWithScores() - - case class InterestedInScores( - favScore: Double, - clusterNormalizedFavScore: Double, - clusterNormalizedFollowScore: Double, - clusterNormalizedLogFavScore: Double) - - def topClustersWithScores( - userInterests: ClustersUserIsInterestedIn - ): Seq[(ClusterId, InterestedInScores)] = { - userInterests.clusterIdToScores.toSeq.map { - case (clusterId, scores) => - val favScore = scores.favScore.getOrElse(0.0) - val normalizedFavScore = scores.favScoreClusterNormalizedOnly.getOrElse(0.0) - val normalizedFollowScore = scores.followScoreClusterNormalizedOnly.getOrElse(0.0) - val normalizedLogFavScore = scores.logFavScoreClusterNormalizedOnly.getOrElse(0.0) - - ( - clusterId, - InterestedInScores( - favScore, - normalizedFavScore, - normalizedFollowScore, - normalizedLogFavScore)) - } - } - - def buildClusterWithScores( - clusterScores: Seq[(ClusterId, InterestedInScores)], - timeInMs: Double, - favScoreThresholdForUserInterest: Double - )( - implicit thriftDecayedValueMonoid: ThriftDecayedValueMonoid - ): ClustersWithScores = { - val scoresMap = clusterScores.collect { - case ( - clusterId, - InterestedInScores( - favScore, - _, - _, - clusterNormalizedLogFavScore)) - // NOTE: the threshold is on favScore, and the computation is on normalizedFavScore - // This threshold reduces the number of unique keys in the cache by 80%, - // based on offline analysis - if favScore >= favScoreThresholdForUserInterest => - - val favClusterNormalized8HrHalfLifeScoreOpt = - Some(thriftDecayedValueMonoid.build(clusterNormalizedLogFavScore, timeInMs)) - - clusterId -> Scores(favClusterNormalized8HrHalfLifeScore = favClusterNormalized8HrHalfLifeScoreOpt) - }.toMap - - if (scoresMap.nonEmpty) { - ClustersWithScores(Some(scoresMap)) - } else { - EmptyClustersWithScores - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/common/SimClustersProfile.scala b/src/scala/com/twitter/simclusters_v2/summingbird/common/SimClustersProfile.scala deleted file mode 100644 index ee58bbd67..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/common/SimClustersProfile.scala +++ /dev/null @@ -1,212 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.common - -import com.twitter.finagle.mtls.authentication.ServiceIdentifier -import com.twitter.simclusters_v2.common.ModelVersions._ -import com.twitter.simclusters_v2.summingbird.common.ClientConfigs._ -import com.twitter.simclusters_v2.summingbird.common.SimClustersProfile.AltSetting.AltSetting -import com.twitter.simclusters_v2.summingbird.common.SimClustersProfile.Environment.Environment -import com.twitter.simclusters_v2.summingbird.common.SimClustersProfile.JobType.JobType -import com.twitter.simclusters_v2.summingbird.common.SimClustersProfile.AltSetting -import com.twitter.simclusters_v2.summingbird.common.SimClustersProfile.JobType -import com.twitter.simclusters_v2.thriftscala.EmbeddingType -import com.twitter.simclusters_v2.thriftscala.ModelVersion - -sealed trait SimClustersProfile { - val env: Environment - val alt: AltSetting - val modelVersionStr: String - - lazy val modelVersion: ModelVersion = modelVersionStr -} - -sealed trait SimClustersJobProfile extends SimClustersProfile { - - val jobType: JobType - - final lazy val jobName: String = { - alt match { - case AltSetting.Alt => - s"simclusters_v2_${jobType}_alt_job_$env" - case AltSetting.Esc => - s"simclusters_v2_${jobType}_esc_job_$env" - case _ => - s"simclusters_v2_${jobType}_job_$env" - } - } - - // Build the serviceIdentifier by jobType, env and zone(dc) - final lazy val serviceIdentifier: String => ServiceIdentifier = { zone => - ServiceIdentifier(Configs.role, s"summingbird_$jobName", env.toString, zone) - } - - final lazy val favScoreThresholdForUserInterest: Double = - Configs.favScoreThresholdForUserInterest(modelVersionStr) - - lazy val timelineEventSourceSubscriberId: String = { - val jobTypeStr = jobType match { - case JobType.MultiModelTweet => "multi_model_tweet_" - case JobType.PersistentTweet => "persistent_tweet_" - case JobType.Tweet => "" - } - - val prefix = alt match { - case AltSetting.Alt => - "alt_" - case AltSetting.Esc => - "esc_" - case _ => - "" - } - - s"simclusters_v2_${jobTypeStr}summingbird_$prefix$env" - } - -} - -object SimClustersProfile { - - object JobType extends Enumeration { - type JobType = Value - val Tweet: JobType = Value("tweet") - val PersistentTweet: JobType = Value("persistent_tweet") - val MultiModelTweet: JobType = Value("multimodel_tweet") - } - - object Environment extends Enumeration { - type Environment = Value - val Prod: Environment = Value("prod") - val Devel: Environment = Value("devel") - - def apply(setting: String): Environment = { - if (setting == Prod.toString) { - Prod - } else { - Devel - } - } - } - - object AltSetting extends Enumeration { - type AltSetting = Value - val Normal: AltSetting = Value("normal") - val Alt: AltSetting = Value("alt") - val Esc: AltSetting = Value("esc") - - def apply(setting: String): AltSetting = { - - setting match { - case "alt" => Alt - case "esc" => Esc - case _ => Normal - } - } - } - - case class SimClustersTweetProfile( - env: Environment, - alt: AltSetting, - modelVersionStr: String, - entityClusterScorePath: String, - tweetTopKClustersPath: String, - clusterTopKTweetsPath: String, - coreEmbeddingType: EmbeddingType, - clusterTopKTweetsLightPath: Option[String] = None) - extends SimClustersJobProfile { - - final val jobType: JobType = JobType.Tweet - } - - case class PersistentTweetProfile( - env: Environment, - alt: AltSetting, - modelVersionStr: String, - persistentTweetStratoPath: String, - coreEmbeddingType: EmbeddingType) - extends SimClustersJobProfile { - final val jobType: JobType = JobType.PersistentTweet - } - - final val AltProdTweetJobProfile = SimClustersTweetProfile( - env = Environment.Prod, - alt = AltSetting.Alt, - modelVersionStr = Model20M145K2020, - entityClusterScorePath = simClustersCoreAltCachePath, - tweetTopKClustersPath = simClustersCoreAltCachePath, - clusterTopKTweetsPath = simClustersCoreAltCachePath, - clusterTopKTweetsLightPath = Some(simClustersCoreAltLightCachePath), - coreEmbeddingType = EmbeddingType.LogFavBasedTweet - ) - - final val AltDevelTweetJobProfile = SimClustersTweetProfile( - env = Environment.Devel, - alt = AltSetting.Alt, - modelVersionStr = Model20M145K2020, - // using the same devel cache with job - entityClusterScorePath = develSimClustersCoreCachePath, - tweetTopKClustersPath = develSimClustersCoreCachePath, - clusterTopKTweetsPath = develSimClustersCoreCachePath, - clusterTopKTweetsLightPath = Some(develSimClustersCoreLightCachePath), - coreEmbeddingType = EmbeddingType.LogFavBasedTweet, - ) - - final val ProdPersistentTweetProfile = PersistentTweetProfile( - env = Environment.Prod, - alt = AltSetting.Normal, - modelVersionStr = Model20M145K2020, - // This profile is used by the persistent tweet embedding job to update the embedding. We - // use the uncached column to avoid reading stale data - persistentTweetStratoPath = logFavBasedTweet20M145K2020UncachedStratoPath, - coreEmbeddingType = EmbeddingType.LogFavBasedTweet - ) - - final val DevelPersistentTweetProfile = PersistentTweetProfile( - env = Environment.Devel, - alt = AltSetting.Normal, - modelVersionStr = Model20M145K2020, - persistentTweetStratoPath = develLogFavBasedTweet20M145K2020StratoPath, - coreEmbeddingType = EmbeddingType.LogFavBasedTweet - ) - - def fetchTweetJobProfile( - env: Environment, - alt: AltSetting = AltSetting.Normal - ): SimClustersTweetProfile = { - (env, alt) match { - case (Environment.Prod, AltSetting.Alt) => AltProdTweetJobProfile - case (Environment.Devel, AltSetting.Alt) => AltDevelTweetJobProfile - case _ => throw new IllegalArgumentException("Invalid env or alt setting") - } - } - - def fetchPersistentJobProfile( - env: Environment, - alt: AltSetting = AltSetting.Normal - ): PersistentTweetProfile = { - (env, alt) match { - case (Environment.Prod, AltSetting.Normal) => ProdPersistentTweetProfile - case (Environment.Devel, AltSetting.Normal) => DevelPersistentTweetProfile - case _ => throw new IllegalArgumentException("Invalid env or alt setting") - } - } - - /** - * For short term, fav based tweet embedding and log fav based tweets embedding exists at the - * same time. We want to move to log fav based tweet embedding eventually. - * Follow based tweet embeddings exists in both environment. - * A uniform tweet embedding API is the future to replace the existing use case. - */ - final lazy val tweetJobProfileMap: Environment => Map[ - (EmbeddingType, String), - SimClustersTweetProfile - ] = { - case Environment.Prod => - Map( - (EmbeddingType.LogFavBasedTweet, Model20M145K2020) -> AltProdTweetJobProfile - ) - case Environment.Devel => - Map( - (EmbeddingType.LogFavBasedTweet, Model20M145K2020) -> AltDevelTweetJobProfile - ) - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/common/StatsUtil.scala b/src/scala/com/twitter/simclusters_v2/summingbird/common/StatsUtil.scala deleted file mode 100644 index 78a34fef2..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/common/StatsUtil.scala +++ /dev/null @@ -1,22 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.common - -import com.twitter.summingbird.{Counter, Group, Name, Platform, Producer} -import com.twitter.summingbird.option.JobId - -object StatsUtil { - - // for adding stats in Producer. - // this enables us to add new stats by just calling producer.observer("name") - implicit class EnrichedProducer[P <: Platform[P], T]( - producer: Producer[P, T] - )( - implicit jobId: JobId) { - def observe(counter: String): Producer[P, T] = { - val stat = Counter(Group(jobId.get), Name(counter)) - producer.map { v => - stat.incr() - v - } - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/common/SummerWithSumValues.scala b/src/scala/com/twitter/simclusters_v2/summingbird/common/SummerWithSumValues.scala deleted file mode 100644 index e10718162..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/common/SummerWithSumValues.scala +++ /dev/null @@ -1,40 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.common - -import com.twitter.algebird.Monoid -import com.twitter.summingbird._ - -object SummerWithSumValues { - /* - A common pattern in heron is to use .sumByKeys to aggregate a value in a store, and then continue - processing with the aggregated value. Unfortunately, .sumByKeys returns the existing value from the - store and the delta separately, leaving you to manually combine them. - - Example without sumValues: - - someKeyedProducer - .sumByKeys(score)(monoid) - .map { - case (key, (existingValueOpt, delta)) => - // if you want the value that was actually written to the store, you have to combine - // existingValueOpt and delta yourself - } - - Example with sumValues: - - someKeyedProducer - .sumByKeys(score)(monoid) - .sumValues(monoid) - .map { - case (key, value) => - // `value` is the same as what was written to the store - } - */ - implicit class SummerWithSumValues[P <: Platform[P], K, V]( - summer: Summer[P, K, V]) { - def sumValues(monoid: Monoid[V]): KeyedProducer[P, K, V] = - summer.mapValues { - case (Some(oldV), deltaV) => monoid.plus(oldV, deltaV) - case (None, deltaV) => deltaV - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/common/ThriftDecayedValueMonoid.scala b/src/scala/com/twitter/simclusters_v2/summingbird/common/ThriftDecayedValueMonoid.scala deleted file mode 100644 index af490fc9d..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/common/ThriftDecayedValueMonoid.scala +++ /dev/null @@ -1,57 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.common - -import com.twitter.algebird.DecayedValue -import com.twitter.algebird.DecayedValueMonoid -import com.twitter.algebird.Monoid -import com.twitter.algebird_internal.injection.DecayedValueImplicits -import com.twitter.algebird_internal.thriftscala.{DecayedValue => ThriftDecayedValue} - -/** - * Monoid for ThriftDecayedValue - */ -class ThriftDecayedValueMonoid(halfLifeInMs: Long)(implicit decayedValueMonoid: DecayedValueMonoid) - extends Monoid[ThriftDecayedValue] { - - override val zero: ThriftDecayedValue = DecayedValueImplicits.toThrift(decayedValueMonoid.zero) - - override def plus(x: ThriftDecayedValue, y: ThriftDecayedValue): ThriftDecayedValue = { - DecayedValueImplicits.toThrift( - decayedValueMonoid - .plus(DecayedValueImplicits.toThrift.invert(x), DecayedValueImplicits.toThrift.invert(y)) - ) - } - - def build(value: Double, timeInMs: Double): ThriftDecayedValue = { - DecayedValueImplicits.toThrift( - DecayedValue.build(value, timeInMs, halfLifeInMs) - ) - } - - /** - * decay to a timestamp; note that timestamp should be in Ms, and do not use scaledTime! - */ - def decayToTimestamp( - thriftDecayedValue: ThriftDecayedValue, - timestampInMs: Double - ): ThriftDecayedValue = { - this.plus(thriftDecayedValue, this.build(0.0, timestampInMs)) - } -} - -object ThriftDecayedValueMonoid { - // add the implicit class so that a decayed value can direct call .plus, .decayedValueOfTime and - // so on. - implicit class EnrichedThriftDecayedValue( - thriftDecayedValue: ThriftDecayedValue - )( - implicit thriftDecayedValueMonoid: ThriftDecayedValueMonoid) { - def plus(other: ThriftDecayedValue): ThriftDecayedValue = { - thriftDecayedValueMonoid.plus(thriftDecayedValue, other) - } - - // decay to a timestamp; note that timestamp should be in Ms - def decayToTimestamp(timeInMs: Double): ThriftDecayedValue = { - thriftDecayedValueMonoid.decayToTimestamp(thriftDecayedValue, timeInMs) - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/common/TweetEntityExtractor.scala b/src/scala/com/twitter/simclusters_v2/summingbird/common/TweetEntityExtractor.scala deleted file mode 100644 index bd6a81baa..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/common/TweetEntityExtractor.scala +++ /dev/null @@ -1,65 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.common - -import com.twitter.recos.entities.thriftscala.NamedEntity -import com.twitter.simclusters_v2.thriftscala.{ - NerKey, - PenguinKey, - SimClusterEntity, - TweetTextEntity -} -import com.twitter.taxi.util.text.{TweetFeatureExtractor, TweetTextFeatures} -import com.twitter.tweetypie.thriftscala.Tweet - -object TweetEntityExtractor { - - private val MaxHashtagsPerTweet: Int = 4 - - private val MaxNersPerTweet: Int = 4 - - private val MaxPenguinsPerTweet: Int = 4 - - private val tweetFeatureExtractor: TweetFeatureExtractor = TweetFeatureExtractor.Default - - private def extractTweetTextFeatures( - text: String, - languageCode: Option[String] - ): TweetTextFeatures = { - if (languageCode.isDefined) { - tweetFeatureExtractor.extract(text, languageCode.get) - } else { - tweetFeatureExtractor.extract(text) - } - } - - def extractEntitiesFromText( - tweet: Option[Tweet], - nerEntitiesOpt: Option[Seq[NamedEntity]] - ): Seq[SimClusterEntity.TweetEntity] = { - - val hashtagEntities = tweet - .flatMap(_.hashtags.map(_.map(_.text))).getOrElse(Nil) - .map { hashtag => TweetTextEntity.Hashtag(hashtag.toLowerCase) }.take(MaxHashtagsPerTweet) - - val nerEntities = nerEntitiesOpt - .getOrElse(Nil).map { namedEntity => - TweetTextEntity - .Ner(NerKey(namedEntity.namedEntity.toLowerCase, namedEntity.entityType.getValue)) - }.take(MaxNersPerTweet) - - val nerEntitySet = nerEntities.map(_.ner.textEntity).toSet - - val penguinEntities = - extractTweetTextFeatures( - tweet.flatMap(_.coreData.map(_.text)).getOrElse(""), - tweet.flatMap(_.language.map(_.language)) - ).phrases - .map(_.normalizedOrOriginal) - .filter { s => - s.charAt(0) != '#' && !nerEntitySet.contains(s) // not included in hashtags and NER - } - .map { penguinStr => TweetTextEntity.Penguin(PenguinKey(penguinStr.toLowerCase)) }.take( - MaxPenguinsPerTweet) - - (hashtagEntities ++ penguinEntities ++ nerEntities).map(e => SimClusterEntity.TweetEntity(e)) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/stores/ApeTopicEmbeddingStore.scala b/src/scala/com/twitter/simclusters_v2/summingbird/stores/ApeTopicEmbeddingStore.scala deleted file mode 100644 index 0eec17b81..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/stores/ApeTopicEmbeddingStore.scala +++ /dev/null @@ -1,43 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.stores - -import com.twitter.frigate.common.store.strato.StratoStore -import com.twitter.simclusters_v2.common.SimClustersEmbedding -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.common.ModelVersions._ -import com.twitter.simclusters_v2.thriftscala.EmbeddingType -import com.twitter.simclusters_v2.thriftscala.InternalId -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId -import com.twitter.simclusters_v2.thriftscala.TopicId -import com.twitter.simclusters_v2.thriftscala.{SimClustersEmbedding => ThriftSimClustersEmbedding} -import com.twitter.storehaus.ReadableStore -import com.twitter.strato.client.Client - -object ApeTopicEmbeddingStore { - - private val logFavBasedAPEColumn20M145K2020 = - "recommendations/simclusters_v2/embeddings/logFavBasedAPE20M145K2020" - - private def getStore( - stratoClient: Client, - column: String - ): ReadableStore[SimClustersEmbeddingId, ThriftSimClustersEmbedding] = { - StratoStore - .withUnitView[SimClustersEmbeddingId, ThriftSimClustersEmbedding](stratoClient, column) - } - - def getFavBasedLocaleEntityEmbedding2020Store( - stratoClient: Client, - ): ReadableStore[TopicId, SimClustersEmbedding] = { - - getStore(stratoClient, logFavBasedAPEColumn20M145K2020) - .composeKeyMapping[TopicId] { topicId => - SimClustersEmbeddingId( - EmbeddingType.LogFavBasedKgoApeTopic, - ModelVersions.Model20M145K2020, - InternalId.TopicId(topicId) - ) - } - .mapValues(SimClustersEmbedding(_)) - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/stores/BUILD b/src/scala/com/twitter/simclusters_v2/summingbird/stores/BUILD deleted file mode 100644 index 9e78da7c4..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/stores/BUILD +++ /dev/null @@ -1,32 +0,0 @@ -scala_library( - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/twitter/algebird:core", - "3rdparty/jvm/com/twitter/algebird:util", - "3rdparty/jvm/com/twitter/bijection:core", - "3rdparty/jvm/com/twitter/bijection:util", - "3rdparty/jvm/com/twitter/storehaus:core", - "frigate/frigate-common/src/main/scala/com/twitter/frigate/common/store/strato", - "relevance-platform/src/main/scala/com/twitter/relevance_platform/simclustersann/multicluster", - "src/scala/com/twitter/algebird_internal/injection", - "src/scala/com/twitter/simclusters_v2/common", - "src/scala/com/twitter/simclusters_v2/summingbird/common", - "src/scala/com/twitter/storehaus_internal/manhattan", - "src/scala/com/twitter/storehaus_internal/manhattan/config", - "src/scala/com/twitter/storehaus_internal/memcache", - "src/scala/com/twitter/storehaus_internal/memcache/config", - "src/scala/com/twitter/storehaus_internal/offline", - "src/scala/com/twitter/storehaus_internal/online", - "src/scala/com/twitter/storehaus_internal/util", - "src/scala/com/twitter/summingbird_internal/bijection:bijection-implicits", - "src/scala/com/twitter/summingbird_internal/runner/store_config", - "src/scala/com/twitter/wtf/summingbird/sources/common", - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - "src/thrift/com/twitter/timelineservice/server/internal:thrift-scala", - "src/thrift/com/twitter/wtf/interest:interest-thrift-scala", - "src/thrift/com/twitter/wtf/utt:utt-scala", - "strato/src/main/scala/com/twitter/strato/client", - "strato/src/main/scala/com/twitter/strato/mh", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/stores/ClusterDetailsReadableStore.scala b/src/scala/com/twitter/simclusters_v2/summingbird/stores/ClusterDetailsReadableStore.scala deleted file mode 100644 index a553e7ff8..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/stores/ClusterDetailsReadableStore.scala +++ /dev/null @@ -1,67 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.stores - -import com.twitter.bijection.{Bufferable, Injection} -import com.twitter.bijection.scrooge.CompactScalaCodec -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.thriftscala.ClusterDetails -import com.twitter.storage.client.manhattan.kv.ManhattanKVClientMtlsParams -import com.twitter.storehaus.ReadableStore -import com.twitter.storehaus_internal.manhattan.{Athena, ManhattanRO, ManhattanROConfig} -import com.twitter.storehaus_internal.util.{ApplicationID, DatasetName, HDFSPath} -import com.twitter.util.{Future, Memoize} - -object ClusterDetailsReadableStore { - - val modelVersionToDatasetMap: Map[String, String] = Map( - ModelVersions.Model20M145KDec11 -> "simclusters_v2_cluster_details", - ModelVersions.Model20M145KUpdated -> "simclusters_v2_cluster_details_20m_145k_updated", - ModelVersions.Model20M145K2020 -> "simclusters_v2_cluster_details_20m_145k_2020" - ) - - val knownModelVersions: String = modelVersionToDatasetMap.keys.mkString(",") - - private val clusterDetailsStores = - Memoize[(ManhattanKVClientMtlsParams, String), ReadableStore[(String, Int), ClusterDetails]] { - case (mhMtlsParams: ManhattanKVClientMtlsParams, datasetName: String) => - getForDatasetName(mhMtlsParams, datasetName) - } - - def getForDatasetName( - mhMtlsParams: ManhattanKVClientMtlsParams, - datasetName: String - ): ReadableStore[(String, Int), ClusterDetails] = { - implicit val keyInjection: Injection[(String, Int), Array[Byte]] = - Bufferable.injectionOf[(String, Int)] - implicit val valueInjection: Injection[ClusterDetails, Array[Byte]] = - CompactScalaCodec(ClusterDetails) - - ManhattanRO.getReadableStoreWithMtls[(String, Int), ClusterDetails]( - ManhattanROConfig( - HDFSPath(""), // not needed - ApplicationID("simclusters_v2"), - DatasetName(datasetName), // this should be correct - Athena - ), - mhMtlsParams - ) - } - - def apply( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[(String, Int), ClusterDetails] = { - new ReadableStore[(String, Int), ClusterDetails] { - override def get(modelVersionAndClusterId: (String, Int)): Future[Option[ClusterDetails]] = { - val (modelVersion, _) = modelVersionAndClusterId - modelVersionToDatasetMap.get(modelVersion) match { - case Some(datasetName) => - clusterDetailsStores((mhMtlsParams, datasetName)).get(modelVersionAndClusterId) - case None => - Future.exception( - new IllegalArgumentException( - "Unknown model version " + modelVersion + ". Known modelVersions: " + knownModelVersions) - ) - } - } - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/stores/EntityClusterScoreReadableStore.scala b/src/scala/com/twitter/simclusters_v2/summingbird/stores/EntityClusterScoreReadableStore.scala deleted file mode 100644 index b25687f4e..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/stores/EntityClusterScoreReadableStore.scala +++ /dev/null @@ -1,62 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.stores - -import com.twitter.finagle.mtls.authentication.ServiceIdentifier -import com.twitter.frigate.common.store.strato.StratoStore -import com.twitter.simclusters_v2.summingbird.common.Implicits.clustersWithScoreMonoid -import com.twitter.simclusters_v2.summingbird.common.Implicits.clustersWithScoresCodec -import com.twitter.storehaus.algebra.MergeableStore -import com.twitter.simclusters_v2.summingbird.common.ClientConfigs -import com.twitter.simclusters_v2.summingbird.common.Implicits -import com.twitter.simclusters_v2.thriftscala.ClustersWithScores -import com.twitter.simclusters_v2.thriftscala.FullClusterIdBucket -import com.twitter.simclusters_v2.thriftscala.MultiModelClustersWithScores -import com.twitter.simclusters_v2.thriftscala.SimClusterEntity -import com.twitter.storehaus.Store -import com.twitter.storehaus_internal.memcache.Memcache -import com.twitter.strato.client.Client -import com.twitter.summingbird.batch.BatchID -import com.twitter.summingbird_internal.bijection.BatchPairImplicits -import com.twitter.util.Future -import com.twitter.strato.thrift.ScroogeConvImplicits._ - -object EntityClusterScoreReadableStore { - - private[simclusters_v2] final lazy val onlineMergeableStore: ( - String, - ServiceIdentifier - ) => MergeableStore[ - ((SimClusterEntity, FullClusterIdBucket), BatchID), - ClustersWithScores - ] = { (path: String, serviceIdentifier: ServiceIdentifier) => - Memcache - .getMemcacheStore[((SimClusterEntity, FullClusterIdBucket), BatchID), ClustersWithScores]( - ClientConfigs.entityClusterScoreMemcacheConfig(path, serviceIdentifier) - )( - BatchPairImplicits.keyInjection[(SimClusterEntity, FullClusterIdBucket)]( - Implicits.entityWithClusterInjection - ), - clustersWithScoresCodec, - clustersWithScoreMonoid - ) - } - -} - -object MultiModelEntityClusterScoreReadableStore { - - private[simclusters_v2] def MultiModelEntityClusterScoreReadableStore( - stratoClient: Client, - column: String - ): Store[EntityClusterId, MultiModelClustersWithScores] = { - StratoStore - .withUnitView[(SimClusterEntity, Int), MultiModelClustersWithScores](stratoClient, column) - .composeKeyMapping(_.toTuple) - } - - case class EntityClusterId( - simClusterEntity: SimClusterEntity, - clusterIdBucket: Int) { - lazy val toTuple: (SimClusterEntity, Int) = - (simClusterEntity, clusterIdBucket) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/stores/ManhattanFromStratoStore.scala b/src/scala/com/twitter/simclusters_v2/summingbird/stores/ManhattanFromStratoStore.scala deleted file mode 100644 index ba9af7f00..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/stores/ManhattanFromStratoStore.scala +++ /dev/null @@ -1,108 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.stores - -import com.twitter.bijection.Injection -import com.twitter.finagle.stats.NullStatsReceiver -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.io.Buf -import com.twitter.scrooge.ThriftStruct -import com.twitter.simclusters_v2.common.TweetId -import com.twitter.simclusters_v2.summingbird.stores.PersistentTweetEmbeddingStore.Timestamp -import com.twitter.simclusters_v2.thriftscala.PersistentSimClustersEmbedding -import com.twitter.storage.client.manhattan.kv.Guarantee -import com.twitter.storage.client.manhattan.kv.ManhattanKVClient -import com.twitter.storage.client.manhattan.kv.ManhattanKVClientMtlsParams -import com.twitter.storage.client.manhattan.kv.ManhattanKVEndpointBuilder -import com.twitter.storage.client.manhattan.kv.impl.FullBufKey -import com.twitter.storage.client.manhattan.kv.impl.ValueDescriptor -import com.twitter.storehaus.ReadableStore -import com.twitter.storehaus_internal.manhattan_kv.ManhattanEndpointStore -import com.twitter.strato.catalog.Version -import com.twitter.strato.config.MValEncoding -import com.twitter.strato.config.NativeEncoding -import com.twitter.strato.config.PkeyLkey2 -import com.twitter.strato.data.Conv -import com.twitter.strato.data.Type -import com.twitter.strato.mh.ManhattanInjections -import com.twitter.strato.thrift.ScroogeConv -import com.twitter.strato.thrift.ScroogeConvImplicits._ - -object ManhattanFromStratoStore { - /* This enables reading from a MH store where the data is written by Strato. Strato uses a unique - encoding (Conv) which needs to be reconstructed for each MH store based on the type of data that - is written to it. Once that encoding is generated on start-up, we can read from the store like - any other ReadableStore. - */ - def createPersistentTweetStore( - dataset: String, - mhMtlsParams: ManhattanKVClientMtlsParams, - statsReceiver: StatsReceiver = NullStatsReceiver - ): ReadableStore[(TweetId, Timestamp), PersistentSimClustersEmbedding] = { - val appId = "simclusters_embeddings_prod" - val dest = "/s/manhattan/omega.native-thrift" - - val endpoint = createMhEndpoint( - appId = appId, - dest = dest, - mhMtlsParams = mhMtlsParams, - statsReceiver = statsReceiver) - - val ( - keyInj: Injection[(TweetId, Timestamp), FullBufKey], - valueDesc: ValueDescriptor.EmptyValue[PersistentSimClustersEmbedding]) = - injectionsFromPkeyLkeyValueStruct[TweetId, Timestamp, PersistentSimClustersEmbedding]( - dataset = dataset, - pkType = Type.Long, - lkType = Type.Long) - - ManhattanEndpointStore - .readable[(TweetId, Timestamp), PersistentSimClustersEmbedding, FullBufKey]( - endpoint = endpoint, - keyDescBuilder = keyInj, - emptyValDesc = valueDesc) - } - - private def createMhEndpoint( - appId: String, - dest: String, - mhMtlsParams: ManhattanKVClientMtlsParams, - statsReceiver: StatsReceiver = NullStatsReceiver - ) = { - val mhc = ManhattanKVClient.memoizedByDest( - appId = appId, - dest = dest, - mtlsParams = mhMtlsParams - ) - - ManhattanKVEndpointBuilder(mhc) - .defaultGuarantee(Guarantee.SoftDcReadMyWrites) - .statsReceiver(statsReceiver) - .build() - } - - private def injectionsFromPkeyLkeyValueStruct[PK: Conv, LK: Conv, V <: ThriftStruct: Manifest]( - dataset: String, - pkType: Type, - lkType: Type - ): (Injection[(PK, LK), FullBufKey], ValueDescriptor.EmptyValue[V]) = { - // Strato uses a unique encoding (Conv) so we need to rebuild that based on the pkey, lkey and - // value type before converting it to the Manhattan injections for key -> FullBufKey and - // value -> Buf - val valueConv: Conv[V] = ScroogeConv.fromStruct[V] - - val mhEncodingMapping = PkeyLkey2( - pkey = pkType, - lkey = lkType, - value = valueConv.t, - pkeyEncoding = NativeEncoding, - lkeyEncoding = NativeEncoding, - valueEncoding = MValEncoding() - ) - - val (keyInj: Injection[(PK, LK), FullBufKey], valueInj: Injection[V, Buf], _, _) = - ManhattanInjections.fromPkeyLkey[PK, LK, V](mhEncodingMapping, dataset, Version.Default) - - val valDesc: ValueDescriptor.EmptyValue[V] = ValueDescriptor.EmptyValue(valueInj) - - (keyInj, valDesc) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/stores/PersistentTweetEmbeddingStore.scala b/src/scala/com/twitter/simclusters_v2/summingbird/stores/PersistentTweetEmbeddingStore.scala deleted file mode 100644 index ab9c06240..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/stores/PersistentTweetEmbeddingStore.scala +++ /dev/null @@ -1,104 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.stores - -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.frigate.common.store.strato.StratoFetchableStore -import com.twitter.frigate.common.store.strato.StratoStore -import com.twitter.simclusters_v2.common.SimClustersEmbedding -import com.twitter.simclusters_v2.common.SimClustersEmbedding._ -import com.twitter.simclusters_v2.common.TweetId -import com.twitter.simclusters_v2.thriftscala.PersistentSimClustersEmbedding -import com.twitter.storage.client.manhattan.kv.ManhattanKVClientMtlsParams -import com.twitter.storehaus.ReadableStore -import com.twitter.storehaus.Store -import com.twitter.strato.catalog.Scan.Slice -import com.twitter.strato.client.Client -import com.twitter.strato.thrift.ScroogeConvImplicits._ - -object PersistentTweetEmbeddingStore { - - val LogFavBasedColumn = - "recommendations/simclusters_v2/embeddings/logFavBasedTweet20M145KUpdatedPersistent" - val LogFavBasedColumn20m145k2020 = - "recommendations/simclusters_v2/embeddings/logFavBasedTweet20M145K2020Persistent" - - val LogFavBased20m145k2020Dataset = "log_fav_based_tweet_20m_145k_2020_embeddings" - val LogFavBased20m145kUpdatedDataset = "log_fav_based_tweet_20m_145k_updated_embeddings" - - val DefaultMaxLength = 15 - - def mostRecentTweetEmbeddingStore( - stratoClient: Client, - column: String, - maxLength: Int = DefaultMaxLength - ): ReadableStore[TweetId, SimClustersEmbedding] = { - StratoFetchableStore - .withUnitView[(TweetId, Timestamp), PersistentSimClustersEmbedding](stratoClient, column) - .composeKeyMapping[TweetId]((_, LatestEmbeddingVersion)) - .mapValues(_.embedding.truncate(maxLength)) - } - - def longestL2NormTweetEmbeddingStore( - stratoClient: Client, - column: String - ): ReadableStore[TweetId, SimClustersEmbedding] = - StratoFetchableStore - .withUnitView[(TweetId, Timestamp), PersistentSimClustersEmbedding](stratoClient, column) - .composeKeyMapping[TweetId]((_, LongestL2EmbeddingVersion)) - .mapValues(_.embedding) - - def mostRecentTweetEmbeddingStoreManhattan( - mhMtlsParams: ManhattanKVClientMtlsParams, - dataset: String, - statsReceiver: StatsReceiver, - maxLength: Int = DefaultMaxLength - ): ReadableStore[TweetId, SimClustersEmbedding] = - ManhattanFromStratoStore - .createPersistentTweetStore( - dataset = dataset, - mhMtlsParams = mhMtlsParams, - statsReceiver = statsReceiver - ).composeKeyMapping[TweetId]((_, LatestEmbeddingVersion)) - .mapValues[SimClustersEmbedding](_.embedding.truncate(maxLength)) - - def longestL2NormTweetEmbeddingStoreManhattan( - mhMtlsParams: ManhattanKVClientMtlsParams, - dataset: String, - statsReceiver: StatsReceiver, - maxLength: Int = 50 - ): ReadableStore[TweetId, SimClustersEmbedding] = - ManhattanFromStratoStore - .createPersistentTweetStore( - dataset = dataset, - mhMtlsParams = mhMtlsParams, - statsReceiver = statsReceiver - ).composeKeyMapping[TweetId]((_, LongestL2EmbeddingVersion)) - .mapValues[SimClustersEmbedding](_.embedding.truncate(maxLength)) - - /** - * The writeable store for Persistent Tweet embedding. Only available in SimClusters package. - */ - private[simclusters_v2] def persistentTweetEmbeddingStore( - stratoClient: Client, - column: String - ): Store[PersistentTweetEmbeddingId, PersistentSimClustersEmbedding] = { - StratoStore - .withUnitView[(TweetId, Timestamp), PersistentSimClustersEmbedding](stratoClient, column) - .composeKeyMapping(_.toTuple) - } - - type Timestamp = Long - - case class PersistentTweetEmbeddingId( - tweetId: TweetId, - timestampInMs: Timestamp = LatestEmbeddingVersion) { - lazy val toTuple: (TweetId, Timestamp) = (tweetId, timestampInMs) - } - - // Special version - reserved for the latest version of the embedding - private[summingbird] val LatestEmbeddingVersion = 0L - // Special version - reserved for the embedding with the longest L2 norm - private[summingbird] val LongestL2EmbeddingVersion = 1L - - // The tweet embedding store keeps at most 20 LKeys - private[stores] val DefaultSlice = Slice[Long](from = None, to = None, limit = None) -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/stores/ProducerClusterEmbeddingReadableStores.scala b/src/scala/com/twitter/simclusters_v2/summingbird/stores/ProducerClusterEmbeddingReadableStores.scala deleted file mode 100644 index e978aa9f9..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/stores/ProducerClusterEmbeddingReadableStores.scala +++ /dev/null @@ -1,101 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.stores - -import com.twitter.bijection.Injection -import com.twitter.bijection.scrooge.CompactScalaCodec -import com.twitter.simclusters_v2.thriftscala.PersistedFullClusterId -import com.twitter.simclusters_v2.thriftscala.TopProducersWithScore -import com.twitter.simclusters_v2.thriftscala.TopSimClustersWithScore -import com.twitter.storage.client.manhattan.kv.ManhattanKVClientMtlsParams -import com.twitter.storehaus.ReadableStore -import com.twitter.storehaus_internal.manhattan.Athena -import com.twitter.storehaus_internal.manhattan.ManhattanRO -import com.twitter.storehaus_internal.manhattan.ManhattanROConfig -import com.twitter.storehaus_internal.util.ApplicationID -import com.twitter.storehaus_internal.util.DatasetName -import com.twitter.storehaus_internal.util.HDFSPath - -object ProducerClusterEmbeddingReadableStores { - - implicit val longInject: Injection[Long, Array[Byte]] = Injection.long2BigEndian - implicit val clusterInject: Injection[TopSimClustersWithScore, Array[Byte]] = - CompactScalaCodec(TopSimClustersWithScore) - implicit val producerInject: Injection[TopProducersWithScore, Array[Byte]] = - CompactScalaCodec(TopProducersWithScore) - implicit val clusterIdInject: Injection[PersistedFullClusterId, Array[Byte]] = - CompactScalaCodec(PersistedFullClusterId) - - private val appId = "simclusters_v2" - - def getSimClusterEmbeddingTopKProducersStore( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[PersistedFullClusterId, TopProducersWithScore] = { - ManhattanRO.getReadableStoreWithMtls[PersistedFullClusterId, TopProducersWithScore]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appId), - DatasetName("simcluster_embedding_top_k_producers_by_fav_score_20m_145k_updated"), - Athena - ), - mhMtlsParams - ) - } - - def getProducerTopKSimClustersEmbeddingsStore( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[Long, TopSimClustersWithScore] = { - val datasetName = "producer_top_k_simcluster_embeddings_by_fav_score_20m_145k_updated" - ManhattanRO.getReadableStoreWithMtls[Long, TopSimClustersWithScore]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appId), - DatasetName(datasetName), - Athena - ), - mhMtlsParams - ) - } - - def getProducerTopKSimClusters2020EmbeddingsStore( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[Long, TopSimClustersWithScore] = { - val datasetName = "producer_top_k_simcluster_embeddings_by_fav_score_20m_145k_2020" - ManhattanRO.getReadableStoreWithMtls[Long, TopSimClustersWithScore]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appId), - DatasetName(datasetName), - Athena - ), - mhMtlsParams - ) - } - - def getSimClusterEmbeddingTopKProducersByFollowStore( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[PersistedFullClusterId, TopProducersWithScore] = { - ManhattanRO.getReadableStoreWithMtls[PersistedFullClusterId, TopProducersWithScore]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appId), - DatasetName("simcluster_embedding_top_k_producers_by_follow_score_20m_145k_updated"), - Athena - ), - mhMtlsParams - ) - } - - def getProducerTopKSimClustersEmbeddingsByFollowStore( - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[Long, TopSimClustersWithScore] = { - ManhattanRO.getReadableStoreWithMtls[Long, TopSimClustersWithScore]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(appId), - DatasetName("producer_top_k_simcluster_embeddings_by_follow_score_20m_145k_2020"), - Athena - ), - mhMtlsParams - ) - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/stores/SemanticCoreEntityEmbeddingStore.scala b/src/scala/com/twitter/simclusters_v2/summingbird/stores/SemanticCoreEntityEmbeddingStore.scala deleted file mode 100644 index ccdea937c..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/stores/SemanticCoreEntityEmbeddingStore.scala +++ /dev/null @@ -1,49 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.stores - -import com.twitter.frigate.common.store.strato.StratoStore -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.common.ModelVersions._ -import com.twitter.simclusters_v2.thriftscala.{ - EmbeddingType, - InternalId, - LocaleEntityId, - SimClustersEmbeddingId, - SimClustersEmbedding => ThriftSimClustersEmbedding -} -import com.twitter.storehaus.ReadableStore -import com.twitter.strato.client.Client -import com.twitter.strato.thrift.ScroogeConvImplicits._ -import com.twitter.simclusters_v2.common.SimClustersEmbedding - -/** - * entity -> List< cluster > - */ -object SemanticCoreEntityEmbeddingStore { - - private val column = - "recommendations/simclusters_v2/embeddings/semanticCoreEntityPerLanguageEmbeddings20M145KUpdated" - - /** - * Default store, wrapped in generic data types. Use this if you know the underlying key struct. - */ - private def getDefaultStore( - stratoClient: Client - ): ReadableStore[SimClustersEmbeddingId, ThriftSimClustersEmbedding] = { - StratoStore - .withUnitView[SimClustersEmbeddingId, ThriftSimClustersEmbedding](stratoClient, column) - } - - def getFavBasedLocaleEntityEmbeddingStore( - stratoClient: Client - ): ReadableStore[LocaleEntityId, SimClustersEmbedding] = { - getDefaultStore(stratoClient) - .composeKeyMapping[LocaleEntityId] { entityId => - SimClustersEmbeddingId( - EmbeddingType.FavBasedSematicCoreEntity, - ModelVersions.Model20M145KUpdated, - InternalId.LocaleEntityId(entityId) - ) - } - .mapValues(SimClustersEmbedding(_)) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/stores/SimClustersManhattanReadableStoreForReadWriteDataset.scala b/src/scala/com/twitter/simclusters_v2/summingbird/stores/SimClustersManhattanReadableStoreForReadWriteDataset.scala deleted file mode 100644 index 63c1e772c..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/stores/SimClustersManhattanReadableStoreForReadWriteDataset.scala +++ /dev/null @@ -1,65 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.stores -import com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId -import com.twitter.storage.client.manhattan.kv.ManhattanKVClient -import com.twitter.storage.client.manhattan.kv.ManhattanKVClientMtlsParams -import com.twitter.storage.client.manhattan.kv.ManhattanKVEndpointBuilder -import com.twitter.storage.client.manhattan.kv.impl.Component -import com.twitter.storage.client.manhattan.kv.impl.DescriptorP1L0 -import com.twitter.storage.client.manhattan.kv.impl.KeyDescriptor -import com.twitter.storage.client.manhattan.kv.impl.ValueDescriptor -import com.twitter.storehaus.ReadableStore -import com.twitter.storehaus_internal.manhattan.ManhattanCluster -import com.twitter.storehaus_internal.manhattan.Adama -import com.twitter.storage.client.manhattan.bijections.Bijections.BinaryScalaInjection -import com.twitter.storage.client.manhattan.kv.Guarantee -import com.twitter.conversions.DurationOps._ -import com.twitter.simclusters_v2.thriftscala.InternalId -import com.twitter.stitch.Stitch -import com.twitter.storage.client.manhattan.bijections.Bijections.LongInjection -import com.twitter.util.Future - -/** - * Manhattan Readable Store to fetch simcluster embedding from a read-write dataset. - * Only read operations are allowed through this store. - * @param appId The "application id" - * @param datasetName The MH dataset name. - * @param label The human readable label for the finagle thrift client - * @param mtlsParams Client service identifier to use to authenticate with Manhattan service - * @param manhattanCluster Manhattan RW cluster - **/ -class SimClustersManhattanReadableStoreForReadWriteDataset( - appId: String, - datasetName: String, - label: String, - mtlsParams: ManhattanKVClientMtlsParams, - manhattanCluster: ManhattanCluster = Adama) - extends ReadableStore[SimClustersEmbeddingId, ClustersUserIsInterestedIn] { - /* - Setting up a new builder to read from Manhattan RW dataset. This is specifically required for - BeT project where we update the MH RW dataset (every 2 hours) using cloud shuttle service. - */ - val destName = manhattanCluster.wilyName - val endPoint = ManhattanKVEndpointBuilder(ManhattanKVClient(appId, destName, mtlsParams, label)) - .defaultGuarantee(Guarantee.SoftDcReadMyWrites) - .build() - - val keyDesc = KeyDescriptor(Component(LongInjection), Component()).withDataset(datasetName) - val valueDesc = ValueDescriptor(BinaryScalaInjection(ClustersUserIsInterestedIn)) - - override def get( - embeddingId: SimClustersEmbeddingId - ): Future[Option[ClustersUserIsInterestedIn]] = { - embeddingId match { - case SimClustersEmbeddingId(theEmbeddingType, theModelVersion, InternalId.UserId(userId)) => - val populatedKey: DescriptorP1L0.FullKey[Long] = keyDesc.withPkey(userId) - // returns result - val mhValue = Stitch.run(endPoint.get(populatedKey, valueDesc)) - mhValue.map { - case Some(x) => Option(x.contents) - case _ => None - } - case _ => Future.None - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/stores/TfgTopicEmbeddingsStore.scala b/src/scala/com/twitter/simclusters_v2/summingbird/stores/TfgTopicEmbeddingsStore.scala deleted file mode 100644 index 1332c573a..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/stores/TfgTopicEmbeddingsStore.scala +++ /dev/null @@ -1,46 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.stores - -import com.twitter.frigate.common.store.strato.StratoStore -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.common.ModelVersions._ -import com.twitter.simclusters_v2.thriftscala.EmbeddingType -import com.twitter.simclusters_v2.thriftscala.InternalId -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId -import com.twitter.simclusters_v2.thriftscala.TopicId -import com.twitter.simclusters_v2.thriftscala.{SimClustersEmbedding => ThriftSimClustersEmbedding} -import com.twitter.storehaus.ReadableStore -import com.twitter.strato.client.Client -import com.twitter.strato.thrift.ScroogeConvImplicits._ -import com.twitter.simclusters_v2.common.SimClustersEmbedding - -/** - * TopicId -> List< cluster> - */ -object TfgTopicEmbeddingsStore { - - private val favBasedColumn20M145K2020 = - "recommendations/simclusters_v2/embeddings/favBasedTFGTopic20M145K2020" - - private def getStore( - stratoClient: Client, - column: String - ): ReadableStore[SimClustersEmbeddingId, ThriftSimClustersEmbedding] = { - StratoStore - .withUnitView[SimClustersEmbeddingId, ThriftSimClustersEmbedding](stratoClient, column) - } - - def getFavBasedLocaleEntityEmbedding2020Store( - stratoClient: Client, - ): ReadableStore[TopicId, SimClustersEmbedding] = { - - getStore(stratoClient, favBasedColumn20M145K2020) - .composeKeyMapping[TopicId] { topicId => - SimClustersEmbeddingId( - EmbeddingType.FavTfgTopic, - ModelVersions.Model20M145K2020, - InternalId.TopicId(topicId) - ) - } - .mapValues(SimClustersEmbedding(_)) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/stores/TopKClustersForEntityReadableStore.scala b/src/scala/com/twitter/simclusters_v2/summingbird/stores/TopKClustersForEntityReadableStore.scala deleted file mode 100644 index baa3fa2a1..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/stores/TopKClustersForEntityReadableStore.scala +++ /dev/null @@ -1,36 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.stores - -import com.twitter.simclusters_v2.summingbird.common.EntityUtil -import com.twitter.simclusters_v2.thriftscala._ -import com.twitter.storehaus.ReadableStore -import com.twitter.util.Future -import com.twitter.util.Time - -case class TopKClustersForEntityReadableStore( - underlyingStore: ReadableStore[EntityWithVersion, TopKClustersWithScores]) - extends ReadableStore[EntityWithVersion, TopKClustersWithScores] { - - override def multiGet[K1 <: EntityWithVersion]( - ks: Set[K1] - ): Map[K1, Future[Option[TopKClustersWithScores]]] = { - val nowInMs = Time.now.inMilliseconds - underlyingStore - .multiGet(ks) - .mapValues { resFuture => - resFuture.map { resOpt => - resOpt.map { clustersWithScores => - clustersWithScores.copy( - topClustersByFavClusterNormalizedScore = EntityUtil.updateScoreWithLatestTimestamp( - clustersWithScores.topClustersByFavClusterNormalizedScore, - nowInMs - ), - topClustersByFollowClusterNormalizedScore = EntityUtil.updateScoreWithLatestTimestamp( - clustersWithScores.topClustersByFollowClusterNormalizedScore, - nowInMs - ) - ) - } - } - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/stores/TopKClustersForTweetReadableStore.scala b/src/scala/com/twitter/simclusters_v2/summingbird/stores/TopKClustersForTweetReadableStore.scala deleted file mode 100644 index f2381a2a5..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/stores/TopKClustersForTweetReadableStore.scala +++ /dev/null @@ -1,176 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.stores - -import com.twitter.finagle.mtls.authentication.ServiceIdentifier -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.summingbird.common.Implicits.batcher -import com.twitter.simclusters_v2.summingbird.common.Implicits.topKClustersWithScoresCodec -import com.twitter.simclusters_v2.summingbird.common.Implicits.topKClustersWithScoresMonoid -import com.twitter.simclusters_v2.summingbird.common.SimClustersProfile.Environment -import com.twitter.simclusters_v2.summingbird.common.ClientConfigs -import com.twitter.simclusters_v2.summingbird.common.Configs -import com.twitter.simclusters_v2.summingbird.common.Implicits -import com.twitter.simclusters_v2.summingbird.common.SimClustersProfile -import com.twitter.simclusters_v2.thriftscala._ -import com.twitter.storehaus.ReadableStore -import com.twitter.storehaus.algebra.MergeableStore -import com.twitter.storehaus_internal.memcache.Memcache -import com.twitter.summingbird.batch.BatchID -import com.twitter.summingbird.store.ClientStore -import com.twitter.summingbird_internal.bijection.BatchPairImplicits -import com.twitter.util.Duration -import com.twitter.util.Future - -object TopKClustersForTweetReadableStore { - - private[summingbird] final lazy val onlineMergeableStore: ( - String, - ServiceIdentifier - ) => MergeableStore[(EntityWithVersion, BatchID), TopKClustersWithScores] = { - (storePath: String, serviceIdentifier: ServiceIdentifier) => - Memcache.getMemcacheStore[(EntityWithVersion, BatchID), TopKClustersWithScores]( - ClientConfigs.tweetTopKClustersMemcacheConfig(storePath, serviceIdentifier) - )( - BatchPairImplicits.keyInjection[EntityWithVersion](Implicits.topKClustersKeyCodec), - topKClustersWithScoresCodec, - topKClustersWithScoresMonoid - ) - } - - final lazy val defaultStore: ( - String, - ServiceIdentifier - ) => ReadableStore[EntityWithVersion, TopKClustersWithScores] = { - (storePath: String, serviceIdentifier: ServiceIdentifier) => - // note that DefaultTopKClustersForEntityReadableStore is reused here because they share the - // same structure - TopKClustersForEntityReadableStore( - ClientStore(this.onlineMergeableStore(storePath, serviceIdentifier), Configs.batchesToKeep)) - } -} - -case class TweetKey( - tweetId: Long, - modelVersion: String, - embeddingType: EmbeddingType = EmbeddingType.FavBasedTweet, - halfLife: Duration = Configs.HalfLife) { - - lazy val modelVersionThrift: ModelVersion = ModelVersions.toModelVersion(modelVersion) - - lazy val simClustersEmbeddingId: SimClustersEmbeddingId = - SimClustersEmbeddingId(embeddingType, modelVersionThrift, InternalId.TweetId(tweetId)) -} - -object TweetKey { - - def apply(simClustersEmbeddingId: SimClustersEmbeddingId): TweetKey = { - simClustersEmbeddingId match { - case SimClustersEmbeddingId(embeddingType, modelVersion, InternalId.TweetId(tweetId)) => - TweetKey(tweetId, ModelVersions.toKnownForModelVersion(modelVersion), embeddingType) - case id => - throw new IllegalArgumentException(s"Invalid $id for TweetKey") - } - } - -} - -case class TopKClustersForTweetKeyReadableStore( - proxyMap: Map[(EmbeddingType, String), ReadableStore[EntityWithVersion, TopKClustersWithScores]], - halfLifeDuration: Duration, - topKClustersWithScoresToSeq: TopKClustersWithScores => Seq[(Int, Double)], - maxResult: Option[Int] = None) - extends ReadableStore[TweetKey, Seq[(Int, Double)]] { - - private val modifiedProxyMap = proxyMap.map { - case ((embeddingType, modelVersion), proxy) => - (embeddingType, modelVersion) -> proxy.composeKeyMapping { key: TweetKey => - EntityWithVersion( - SimClusterEntity.TweetId(key.tweetId), - // Fast fail if the model version is invalid. - ModelVersions.toModelVersion(modelVersion)) - } - } - - override def multiGet[K1 <: TweetKey]( - keys: Set[K1] - ): Map[K1, Future[Option[Seq[(Int, Double)]]]] = { - val (validKeys, invalidKeys) = keys.partition { tweetKey => - proxyMap.contains((tweetKey.embeddingType, tweetKey.modelVersion)) && - halfLifeDuration.inMilliseconds == Configs.HalfLifeInMs - } - - val resultsFuture = validKeys.groupBy(key => (key.embeddingType, key.modelVersion)).flatMap { - case (typeModelTuple, subKeys) => - modifiedProxyMap(typeModelTuple).multiGet(subKeys) - } - - resultsFuture.mapValues { topKClustersWithScoresFut => - for (topKClustersWithScoresOpt <- topKClustersWithScoresFut) yield { - for { - topKClustersWithScores <- topKClustersWithScoresOpt - } yield { - val results = topKClustersWithScoresToSeq(topKClustersWithScores) - maxResult match { - case Some(max) => - results.take(max) - case None => - results - } - } - } - } ++ invalidKeys.map { key => (key, Future.None) }.toMap - } -} - -object TopKClustersForTweetKeyReadableStore { - // Use Prod cache by default - def defaultProxyMap( - serviceIdentifier: ServiceIdentifier - ): Map[(EmbeddingType, String), ReadableStore[EntityWithVersion, TopKClustersWithScores]] = - SimClustersProfile.tweetJobProfileMap(Environment.Prod).mapValues { profile => - TopKClustersForTweetReadableStore - .defaultStore(profile.clusterTopKTweetsPath, serviceIdentifier) - } - val defaultHalfLife: Duration = Duration.fromMilliseconds(Configs.HalfLifeInMs) - - def defaultStore( - serviceIdentifier: ServiceIdentifier - ): ReadableStore[TweetKey, Seq[(Int, Double)]] = - TopKClustersForTweetKeyReadableStore( - defaultProxyMap(serviceIdentifier), - defaultHalfLife, - getTopClustersWithScoresByFavClusterNormalizedScore - ) - - def overrideLimitDefaultStore( - maxResult: Int, - serviceIdentifier: ServiceIdentifier - ): ReadableStore[TweetKey, Seq[(Int, Double)]] = { - TopKClustersForTweetKeyReadableStore( - defaultProxyMap(serviceIdentifier), - defaultHalfLife, - getTopClustersWithScoresByFavClusterNormalizedScore, - Some(maxResult) - ) - } - - private def getTopClustersWithScoresByFavClusterNormalizedScore( - topKClustersWithScores: TopKClustersWithScores - ): Seq[(Int, Double)] = { - { - for { - clusterIdWIthScores <- topKClustersWithScores.topClustersByFavClusterNormalizedScore - } yield { - ( - for { - (clusterId, scores) <- clusterIdWIthScores - favClusterNormalized8HrHalfLifeScore <- scores.favClusterNormalized8HrHalfLifeScore - if favClusterNormalized8HrHalfLifeScore.value > 0.0 - } yield { - clusterId -> favClusterNormalized8HrHalfLifeScore.value - } - ).toSeq.sortBy(-_._2) - } - }.getOrElse(Nil) - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/stores/TopKTweetsForClusterReadableStore.scala b/src/scala/com/twitter/simclusters_v2/summingbird/stores/TopKTweetsForClusterReadableStore.scala deleted file mode 100644 index 39284424f..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/stores/TopKTweetsForClusterReadableStore.scala +++ /dev/null @@ -1,298 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.stores - -import com.twitter.bijection.Injection -import com.twitter.bijection.scrooge.CompactScalaCodec -import com.twitter.finagle.mtls.authentication.ServiceIdentifier -import com.twitter.frigate.common.store.strato.StratoStore -import com.twitter.relevance_platform.simclustersann.multicluster.ClusterTweetIndexStoreConfig -import com.twitter.simclusters_v2.common.ClusterId -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.common.TweetId -import com.twitter.simclusters_v2.summingbird.common.ClientConfigs -import com.twitter.simclusters_v2.summingbird.common.Configs -import com.twitter.simclusters_v2.summingbird.common.EntityUtil -import com.twitter.simclusters_v2.summingbird.common.Implicits -import com.twitter.simclusters_v2.summingbird.common.Implicits.batcher -import com.twitter.simclusters_v2.summingbird.common.Implicits.topKTweetsWithScoresCodec -import com.twitter.simclusters_v2.summingbird.common.Implicits.topKTweetsWithScoresMonoid -import com.twitter.simclusters_v2.summingbird.common.SimClustersProfile -import com.twitter.simclusters_v2.summingbird.common.SimClustersProfile.Environment -import com.twitter.simclusters_v2.thriftscala.EmbeddingType -import com.twitter.simclusters_v2.thriftscala.FullClusterId -import com.twitter.simclusters_v2.thriftscala.ModelVersion -import com.twitter.simclusters_v2.thriftscala.MultiModelTopKTweetsWithScores -import com.twitter.simclusters_v2.thriftscala.TopKTweetsWithScores -import com.twitter.storage.client.manhattan.kv.ManhattanKVClientMtlsParams -import com.twitter.storehaus.ReadableStore -import com.twitter.storehaus.Store -import com.twitter.storehaus.algebra.MergeableStore -import com.twitter.storehaus_internal.manhattan.ManhattanRO -import com.twitter.storehaus_internal.manhattan.ManhattanROConfig -import com.twitter.storehaus_internal.memcache.Memcache -import com.twitter.storehaus_internal.util.ApplicationID -import com.twitter.storehaus_internal.util.DatasetName -import com.twitter.storehaus_internal.util.HDFSPath -import com.twitter.strato.client.Client -import com.twitter.strato.thrift.ScroogeConvImplicits._ -import com.twitter.summingbird.batch.BatchID -import com.twitter.summingbird.store.ClientStore -import com.twitter.summingbird_internal.bijection.BatchPairImplicits -import com.twitter.util.Duration -import com.twitter.util.Future -import com.twitter.util.Time - -/** - * Comparing to underlyingStore, this store decays all the values to current timestamp - */ -case class TopKTweetsForClusterReadableStore( - underlyingStore: ReadableStore[FullClusterId, TopKTweetsWithScores]) - extends ReadableStore[FullClusterId, TopKTweetsWithScores] { - - override def multiGet[K1 <: FullClusterId]( - ks: Set[K1] - ): Map[K1, Future[Option[TopKTweetsWithScores]]] = { - val nowInMs = Time.now.inMilliseconds - underlyingStore - .multiGet(ks) - .mapValues { resFuture => - resFuture.map { resOpt => - resOpt.map { tweetsWithScores => - tweetsWithScores.copy( - topTweetsByFavClusterNormalizedScore = EntityUtil.updateScoreWithLatestTimestamp( - tweetsWithScores.topTweetsByFavClusterNormalizedScore, - nowInMs), - topTweetsByFollowClusterNormalizedScore = EntityUtil.updateScoreWithLatestTimestamp( - tweetsWithScores.topTweetsByFollowClusterNormalizedScore, - nowInMs) - ) - } - } - } - } -} - -object TopKTweetsForClusterReadableStore { - - private[summingbird] final lazy val onlineMergeableStore: ( - String, - ServiceIdentifier - ) => MergeableStore[(FullClusterId, BatchID), TopKTweetsWithScores] = { - (storePath: String, serviceIdentifier: ServiceIdentifier) => - Memcache.getMemcacheStore[(FullClusterId, BatchID), TopKTweetsWithScores]( - ClientConfigs.clusterTopTweetsMemcacheConfig(storePath, serviceIdentifier) - )( - BatchPairImplicits.keyInjection[FullClusterId](Implicits.fullClusterIdCodec), - topKTweetsWithScoresCodec, - topKTweetsWithScoresMonoid - ) - } - - final lazy val defaultStore: ( - String, - ServiceIdentifier - ) => ReadableStore[FullClusterId, TopKTweetsWithScores] = { - (storePath: String, serviceIdentifier: ServiceIdentifier) => - TopKTweetsForClusterReadableStore( - ClientStore( - TopKTweetsForClusterReadableStore.onlineMergeableStore(storePath, serviceIdentifier), - Configs.batchesToKeep - )) - } -} - -object MultiModelTopKTweetsForClusterReadableStore { - - private[simclusters_v2] def MultiModelTopKTweetsForClusterReadableStore( - stratoClient: Client, - column: String - ): Store[Int, MultiModelTopKTweetsWithScores] = { - StratoStore - .withUnitView[Int, MultiModelTopKTweetsWithScores](stratoClient, column) - } -} - -case class ClusterKey( - clusterId: ClusterId, - modelVersion: String, - embeddingType: EmbeddingType = EmbeddingType.FavBasedTweet, - halfLife: Duration = Configs.HalfLife) { - lazy val modelVersionThrift: ModelVersion = ModelVersions.toModelVersion(modelVersion) -} - -case class TopKTweetsForClusterKeyReadableStore( - proxyMap: Map[(EmbeddingType, String), ReadableStore[FullClusterId, TopKTweetsWithScores]], - halfLife: Duration, - topKTweetsWithScoresToSeq: TopKTweetsWithScores => Seq[(Long, Double)], - maxResult: Option[Int] = None) - extends ReadableStore[ClusterKey, Seq[(Long, Double)]] { - - private val modifiedProxyMap = proxyMap.map { - case (typeModelTuple, proxy) => - typeModelTuple -> proxy.composeKeyMapping { key: ClusterKey => - FullClusterId(ModelVersions.toModelVersion(typeModelTuple._2), key.clusterId) - } - } - - override def multiGet[K1 <: ClusterKey]( - keys: Set[K1] - ): Map[K1, Future[Option[Seq[(Long, Double)]]]] = { - val (validKeys, invalidKeys) = keys.partition { clusterKey => - proxyMap.contains( - (clusterKey.embeddingType, clusterKey.modelVersion)) && clusterKey.halfLife == halfLife - } - - val resultsFuture = validKeys.groupBy(key => (key.embeddingType, key.modelVersion)).flatMap { - case (typeModelTuple, subKeys) => - modifiedProxyMap(typeModelTuple).multiGet(subKeys) - } - - resultsFuture.mapValues { topKTweetsWithScoresFut => - for (topKTweetsWithScoresOpt <- topKTweetsWithScoresFut) yield { - for { - topKTweetsWithScores <- topKTweetsWithScoresOpt - } yield { - val results = topKTweetsWithScoresToSeq(topKTweetsWithScores) - maxResult match { - case Some(max) => - results.take(max) - case None => - results - } - } - } - } ++ invalidKeys.map { key => (key, Future.None) }.toMap - } -} - -object TopKTweetsForClusterKeyReadableStore { - implicit val fullClusterIdInjection: Injection[FullClusterId, Array[Byte]] = - CompactScalaCodec(FullClusterId) - - // Use Prod cache by default - def defaultProxyMap( - serviceIdentifier: ServiceIdentifier, - ): Map[(EmbeddingType, String), ReadableStore[FullClusterId, TopKTweetsWithScores]] = - SimClustersProfile.tweetJobProfileMap(Environment.Prod).mapValues { profile => - TopKTweetsForClusterReadableStore - .defaultStore(profile.clusterTopKTweetsPath, serviceIdentifier) - } - val defaultHalfLife: Duration = Configs.HalfLife - - def defaultStore( - serviceIdentifier: ServiceIdentifier - ): ReadableStore[ClusterKey, Seq[(Long, Double)]] = - TopKTweetsForClusterKeyReadableStore( - defaultProxyMap(serviceIdentifier), - defaultHalfLife, - getTopTweetsWithScoresByFavClusterNormalizedScore - ) - - def storeUsingFollowClusterNormalizedScore( - serviceIdentifier: ServiceIdentifier - ): ReadableStore[ClusterKey, Seq[(Long, Double)]] = - TopKTweetsForClusterKeyReadableStore( - defaultProxyMap(serviceIdentifier), - defaultHalfLife, - getTopTweetsWithScoresByFollowClusterNormalizedScore - ) - - def overrideLimitDefaultStore( - maxResult: Int, - serviceIdentifier: ServiceIdentifier, - ): ReadableStore[ClusterKey, Seq[(Long, Double)]] = { - TopKTweetsForClusterKeyReadableStore( - defaultProxyMap(serviceIdentifier), - defaultHalfLife, - getTopTweetsWithScoresByFavClusterNormalizedScore, - Some(maxResult) - ) - } - - private def getTopTweetsWithScoresByFavClusterNormalizedScore( - topKTweets: TopKTweetsWithScores - ): Seq[(Long, Double)] = { - { - for { - tweetIdWithScores <- topKTweets.topTweetsByFavClusterNormalizedScore - } yield { - ( - for { - (tweetId, scores) <- tweetIdWithScores - favClusterNormalized8HrHalfLifeScore <- scores.favClusterNormalized8HrHalfLifeScore - if favClusterNormalized8HrHalfLifeScore.value > 0.0 - } yield { - tweetId -> favClusterNormalized8HrHalfLifeScore.value - } - ).toSeq.sortBy(-_._2) - } - }.getOrElse(Nil) - } - - private def getTopTweetsWithScoresByFollowClusterNormalizedScore( - topKTweets: TopKTweetsWithScores - ): Seq[(Long, Double)] = { - { - for { - tweetIdWithScores <- topKTweets.topTweetsByFollowClusterNormalizedScore - } yield { - ( - for { - (tweetId, scores) <- tweetIdWithScores - followClusterNormalized8HrHalfLifeScore <- - scores.followClusterNormalized8HrHalfLifeScore - if followClusterNormalized8HrHalfLifeScore.value > 0.0 - } yield { - tweetId -> followClusterNormalized8HrHalfLifeScore.value - } - ).toSeq.sortBy(-_._2) - } - }.getOrElse(Nil) - } - - def getClusterToTopKTweetsStoreFromManhattanRO( - maxResults: Int, - manhattanConfig: ClusterTweetIndexStoreConfig.Manhattan, - serviceIdentifier: ServiceIdentifier, - ): ReadableStore[ClusterKey, Seq[(TweetId, Double)]] = { - ManhattanRO - .getReadableStoreWithMtls[FullClusterId, TopKTweetsWithScores]( - ManhattanROConfig( - HDFSPath(""), - ApplicationID(manhattanConfig.applicationID), - DatasetName(manhattanConfig.datasetName), - manhattanConfig.manhattanCluster - ), - ManhattanKVClientMtlsParams(serviceIdentifier) - ).composeKeyMapping[ClusterKey] { clusterKey => - FullClusterId( - modelVersion = ModelVersions.toModelVersion(clusterKey.modelVersion), - clusterId = clusterKey.clusterId - ) - }.mapValues { topKTweetsWithScores => - // Only return maxResults tweets for each cluster Id - getTopTweetsWithScoresByFavClusterNormalizedScore(topKTweetsWithScores).take(maxResults) - } - } - - def getClusterToTopKTweetsStoreFromMemCache( - maxResults: Int, - memCacheConfig: ClusterTweetIndexStoreConfig.Memcached, - serviceIdentifier: ServiceIdentifier, - ): ReadableStore[ClusterKey, Seq[(TweetId, Double)]] = { - TopKTweetsForClusterReadableStore( - ClientStore( - TopKTweetsForClusterReadableStore - .onlineMergeableStore(memCacheConfig.memcachedDest, serviceIdentifier), - Configs.batchesToKeep - )) - .composeKeyMapping[ClusterKey] { clusterKey => - FullClusterId( - modelVersion = ModelVersions.toModelVersion(clusterKey.modelVersion), - clusterId = clusterKey.clusterId - ) - }.mapValues { topKTweetsWithScores => - // Only return maxResults tweets for each cluster Id - getTopTweetsWithScoresByFavClusterNormalizedScore(topKTweetsWithScores).take(maxResults) - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/stores/TweetStatusCountsStore.scala b/src/scala/com/twitter/simclusters_v2/summingbird/stores/TweetStatusCountsStore.scala deleted file mode 100644 index ce7ee2409..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/stores/TweetStatusCountsStore.scala +++ /dev/null @@ -1,29 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.stores - -import com.twitter.frigate.common.store.strato.StratoFetchableStore -import com.twitter.simclusters_v2.common.TweetId -import com.twitter.storehaus.ReadableStore -import com.twitter.strato.client.Client -import com.twitter.strato.thrift.ScroogeConvImplicits._ -import com.twitter.tweetypie.thriftscala.{GetTweetOptions, StatusCounts, Tweet} - -object TweetStatusCountsStore { - - def tweetStatusCountsStore( - stratoClient: Client, - column: String - ): ReadableStore[TweetId, StatusCounts] = { - StratoFetchableStore - .withView[TweetId, GetTweetOptions, Tweet](stratoClient, column, getTweetOptions) - .mapValues(_.counts.getOrElse(emptyStatusCount)) - } - - private val emptyStatusCount = StatusCounts() - - private val getTweetOptions = - GetTweetOptions( - includeRetweetCount = true, - includeReplyCount = true, - includeFavoriteCount = true, - includeQuoteCount = true) -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/stores/UserInterestedInReadableStore.scala b/src/scala/com/twitter/simclusters_v2/summingbird/stores/UserInterestedInReadableStore.scala deleted file mode 100644 index e318c9185..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/stores/UserInterestedInReadableStore.scala +++ /dev/null @@ -1,263 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.stores - -import com.twitter.bijection.Injection -import com.twitter.bijection.scrooge.CompactScalaCodec -import com.twitter.simclusters_v2.common.ModelVersions -import com.twitter.simclusters_v2.common.SimClustersEmbedding -import com.twitter.simclusters_v2.common.UserId -import com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn -import com.twitter.simclusters_v2.thriftscala.EmbeddingType -import com.twitter.simclusters_v2.thriftscala.InternalId -import com.twitter.simclusters_v2.thriftscala.ModelVersion -import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId -import com.twitter.storage.client.manhattan.kv.ManhattanKVClientMtlsParams -import com.twitter.storehaus.ReadableStore -import com.twitter.storehaus_internal.manhattan.ManhattanCluster -import com.twitter.storehaus_internal.manhattan.Athena -import com.twitter.storehaus_internal.manhattan.ManhattanRO -import com.twitter.storehaus_internal.manhattan.ManhattanROConfig -import com.twitter.storehaus_internal.manhattan.Nash -import com.twitter.storehaus_internal.util.ApplicationID -import com.twitter.storehaus_internal.util.DatasetName -import com.twitter.storehaus_internal.util.HDFSPath - -object UserInterestedInReadableStore { - - // Clusters whose size is greater than this will not be considered. This is how the using UTEG - // experiment was run (because it could not process such clusters), and we don't have such a - // restriction for the Summingbird/Memcache implementation, but noticing that we aren't scoring - // tweets correctly in the big clusters. The fix for this seems a little involved, so for now - // let's just exclude such clusters. - val MaxClusterSizeForUserInterestedInDataset: Int = 5e6.toInt - - val modelVersionToDatasetMap: Map[String, String] = Map( - ModelVersions.Model20M145KDec11 -> "simclusters_v2_interested_in", - ModelVersions.Model20M145KUpdated -> "simclusters_v2_interested_in_20m_145k_updated", - ModelVersions.Model20M145K2020 -> "simclusters_v2_interested_in_20m_145k_2020" - ) - - // Producer embedding based User InterestedIn. - val modelVersionToDenserDatasetMap: Map[String, String] = Map( - ModelVersions.Model20M145KUpdated -> "simclusters_v2_interested_in_from_producer_embeddings_model20m145kupdated" - ) - - val modelVersionToIIAPEDatasetMap: Map[String, String] = Map( - ModelVersions.Model20M145K2020 -> "simclusters_v2_interested_in_from_ape_20m145k2020" - ) - - val modelVersionToIIKFLiteDatasetMap: Map[String, String] = Map( - ModelVersions.Model20M145K2020 -> "simclusters_v2_interested_in_lite_20m_145k_2020" - ) - - val modelVersionToNextInterestedInDatasetMap: Map[String, String] = Map( - ModelVersions.Model20M145K2020 -> "bet_consumer_embedding_v2" - ) - - val defaultModelVersion: String = ModelVersions.Model20M145KUpdated - val knownModelVersions: String = modelVersionToDatasetMap.keys.mkString(",") - - def defaultStoreWithMtls( - mhMtlsParams: ManhattanKVClientMtlsParams, - modelVersion: String = defaultModelVersion - ): ReadableStore[UserId, ClustersUserIsInterestedIn] = { - if (!modelVersionToDatasetMap.contains(modelVersion)) { - throw new IllegalArgumentException( - "Unknown model version: " + modelVersion + ". Known model versions: " + knownModelVersions) - } - this.getStore("simclusters_v2", mhMtlsParams, modelVersionToDatasetMap(modelVersion)) - } - - def defaultSimClustersEmbeddingStoreWithMtls( - mhMtlsParams: ManhattanKVClientMtlsParams, - embeddingType: EmbeddingType, - modelVersion: ModelVersion - ): ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = { - defaultStoreWithMtls(mhMtlsParams, ModelVersions.toKnownForModelVersion(modelVersion)) - .composeKeyMapping[SimClustersEmbeddingId] { - case SimClustersEmbeddingId(theEmbeddingType, theModelVersion, InternalId.UserId(userId)) - if theEmbeddingType == embeddingType && theModelVersion == modelVersion => - userId - }.mapValues( - toSimClustersEmbedding(_, embeddingType, Some(MaxClusterSizeForUserInterestedInDataset))) - } - - def defaultIIKFLiteStoreWithMtls( - mhMtlsParams: ManhattanKVClientMtlsParams, - modelVersion: String = defaultModelVersion - ): ReadableStore[Long, ClustersUserIsInterestedIn] = { - if (!modelVersionToIIKFLiteDatasetMap.contains(modelVersion)) { - throw new IllegalArgumentException( - "Unknown model version: " + modelVersion + ". Known model versions: " + knownModelVersions) - } - getStore("simclusters_v2", mhMtlsParams, modelVersionToIIKFLiteDatasetMap(modelVersion)) - } - - def defaultIIPEStoreWithMtls( - mhMtlsParams: ManhattanKVClientMtlsParams, - modelVersion: String = defaultModelVersion - ): ReadableStore[Long, ClustersUserIsInterestedIn] = { - if (!modelVersionToDatasetMap.contains(modelVersion)) { - throw new IllegalArgumentException( - "Unknown model version: " + modelVersion + ". Known model versions: " + knownModelVersions) - } - getStore("simclusters_v2", mhMtlsParams, modelVersionToDenserDatasetMap(modelVersion)) - } - - def defaultIIAPEStoreWithMtls( - mhMtlsParams: ManhattanKVClientMtlsParams, - modelVersion: String = defaultModelVersion - ): ReadableStore[Long, ClustersUserIsInterestedIn] = { - if (!modelVersionToDatasetMap.contains(modelVersion)) { - throw new IllegalArgumentException( - "Unknown model version: " + modelVersion + ". Known model versions: " + knownModelVersions) - } - getStore("simclusters_v2", mhMtlsParams, modelVersionToIIAPEDatasetMap(modelVersion)) - } - - def defaultIIPESimClustersEmbeddingStoreWithMtls( - mhMtlsParams: ManhattanKVClientMtlsParams, - embeddingType: EmbeddingType, - modelVersion: ModelVersion - ): ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = { - defaultIIPEStoreWithMtls(mhMtlsParams, ModelVersions.toKnownForModelVersion(modelVersion)) - .composeKeyMapping[SimClustersEmbeddingId] { - case SimClustersEmbeddingId(theEmbeddingType, theModelVersion, InternalId.UserId(userId)) - if theEmbeddingType == embeddingType && theModelVersion == modelVersion => - userId - - }.mapValues(toSimClustersEmbedding(_, embeddingType)) - } - - def defaultIIAPESimClustersEmbeddingStoreWithMtls( - mhMtlsParams: ManhattanKVClientMtlsParams, - embeddingType: EmbeddingType, - modelVersion: ModelVersion - ): ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = { - defaultIIAPEStoreWithMtls(mhMtlsParams, ModelVersions.toKnownForModelVersion(modelVersion)) - .composeKeyMapping[SimClustersEmbeddingId] { - case SimClustersEmbeddingId(theEmbeddingType, theModelVersion, InternalId.UserId(userId)) - if theEmbeddingType == embeddingType && theModelVersion == modelVersion => - userId - }.mapValues(toSimClustersEmbedding(_, embeddingType)) - } - - def defaultNextInterestedInStoreWithMtls( - mhMtlsParams: ManhattanKVClientMtlsParams, - embeddingType: EmbeddingType, - modelVersion: ModelVersion - ): ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = { - if (!modelVersionToNextInterestedInDatasetMap.contains( - ModelVersions.toKnownForModelVersion(modelVersion))) { - throw new IllegalArgumentException( - "Unknown model version: " + modelVersion + ". Known model versions: " + knownModelVersions) - } - val datasetName = modelVersionToNextInterestedInDatasetMap( - ModelVersions.toKnownForModelVersion(modelVersion)) - new SimClustersManhattanReadableStoreForReadWriteDataset( - appId = "kafka_beam_sink_bet_consumer_embedding_prod", - datasetName = datasetName, - label = datasetName, - mtlsParams = mhMtlsParams, - manhattanCluster = Nash - ).mapValues(toSimClustersEmbedding(_, embeddingType)) - } - - def getWithMtls( - appId: String, - mtlsParams: ManhattanKVClientMtlsParams, - modelVersion: String = defaultModelVersion - ): ReadableStore[Long, ClustersUserIsInterestedIn] = { - if (!modelVersionToDatasetMap.contains(modelVersion)) { - throw new IllegalArgumentException( - "Unknown model version: " + modelVersion + ". Known model versions: " + knownModelVersions) - } - this.getStore(appId, mtlsParams, modelVersionToDatasetMap(modelVersion)) - } - - /** - * @param appId Manhattan AppId - * @param mtlsParams MltsParams for s2s Authentication - * - * @return ReadableStore of user to cluster interestedIn data set - */ - def getStore( - appId: String, - mtlsParams: ManhattanKVClientMtlsParams, - datasetName: String, - manhattanCluster: ManhattanCluster = Athena - ): ReadableStore[Long, ClustersUserIsInterestedIn] = { - - implicit val keyInjection: Injection[Long, Array[Byte]] = Injection.long2BigEndian - implicit val userInterestsCodec: Injection[ClustersUserIsInterestedIn, Array[Byte]] = - CompactScalaCodec(ClustersUserIsInterestedIn) - - ManhattanRO.getReadableStoreWithMtls[Long, ClustersUserIsInterestedIn]( - ManhattanROConfig( - HDFSPath(""), // not needed - ApplicationID(appId), - DatasetName(datasetName), - manhattanCluster - ), - mtlsParams - ) - } - - /** - * - * @param record ClustersUserIsInterestedIn thrift struct from the MH data set - * @param embeddingType Embedding Type as defined in com.twitter.simclusters_v2.thriftscala.EmbeddingType - * @param maxClusterSizeOpt Option param to set max cluster size. - * We will not filter out clusters based on cluster size if it is None - * @return - */ - def toSimClustersEmbedding( - record: ClustersUserIsInterestedIn, - embeddingType: EmbeddingType, - maxClusterSizeOpt: Option[Int] = None - ): SimClustersEmbedding = { - val embedding = record.clusterIdToScores - .collect { - case (clusterId, clusterScores) if maxClusterSizeOpt.forall { maxClusterSize => - clusterScores.numUsersInterestedInThisClusterUpperBound.exists(_ < maxClusterSize) - } => - val score = embeddingType match { - case EmbeddingType.FavBasedUserInterestedIn => - clusterScores.favScore - case EmbeddingType.FollowBasedUserInterestedIn => - clusterScores.followScore - case EmbeddingType.LogFavBasedUserInterestedIn => - clusterScores.logFavScore - case EmbeddingType.FavBasedUserInterestedInFromPE => - clusterScores.favScore - case EmbeddingType.FollowBasedUserInterestedInFromPE => - clusterScores.followScore - case EmbeddingType.LogFavBasedUserInterestedInFromPE => - clusterScores.logFavScore - case EmbeddingType.LogFavBasedUserInterestedInFromAPE => - clusterScores.logFavScore - case EmbeddingType.FollowBasedUserInterestedInFromAPE => - clusterScores.followScore - case EmbeddingType.UserNextInterestedIn => - clusterScores.logFavScore - case EmbeddingType.LogFavBasedUserInterestedMaxpoolingAddressBookFromIIAPE => - clusterScores.logFavScore - case EmbeddingType.LogFavBasedUserInterestedAverageAddressBookFromIIAPE => - clusterScores.logFavScore - case EmbeddingType.LogFavBasedUserInterestedBooktypeMaxpoolingAddressBookFromIIAPE => - clusterScores.logFavScore - case EmbeddingType.LogFavBasedUserInterestedLargestDimMaxpoolingAddressBookFromIIAPE => - clusterScores.logFavScore - case EmbeddingType.LogFavBasedUserInterestedLouvainMaxpoolingAddressBookFromIIAPE => - clusterScores.logFavScore - case EmbeddingType.LogFavBasedUserInterestedConnectedMaxpoolingAddressBookFromIIAPE => - clusterScores.logFavScore - - case _ => - throw new IllegalArgumentException(s"unknown EmbeddingType: $embeddingType") - } - score.map(clusterId -> _) - }.flatten.toMap - - SimClustersEmbedding(embedding) - } -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/stores/UserKnownForReadableStore.scala b/src/scala/com/twitter/simclusters_v2/summingbird/stores/UserKnownForReadableStore.scala deleted file mode 100644 index 8655e605a..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/stores/UserKnownForReadableStore.scala +++ /dev/null @@ -1,75 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.stores - -import com.twitter.bijection.Injection -import com.twitter.bijection.scrooge.CompactScalaCodec -import com.twitter.simclusters_v2.thriftscala.{ClustersUserIsKnownFor, ModelVersion} -import com.twitter.storage.client.manhattan.kv.ManhattanKVClientMtlsParams -import com.twitter.storehaus.ReadableStore -import com.twitter.storehaus_internal.manhattan.{Athena, ManhattanRO, ManhattanROConfig} -import com.twitter.storehaus_internal.util.{ApplicationID, DatasetName, HDFSPath} -import com.twitter.util.Future - -object UserKnownForReadableStore { - - private val dataSetNameDec11 = "simclusters_v2_known_for_20m_145k_dec11" - private val dataSetNameUpdated = "simclusters_v2_known_for_20m_145k_updated" - private val dataSetName2020 = "simclusters_v2_known_for_20m_145k_2020" - - private def buildForModelVersion( - appId: String, - storeName: String, - mhMtlsParams: ManhattanKVClientMtlsParams - ): ReadableStore[Long, ClustersUserIsKnownFor] = { - implicit val keyInjection: Injection[Long, Array[Byte]] = Injection.long2BigEndian - implicit val knownForCodec: Injection[ClustersUserIsKnownFor, Array[Byte]] = - CompactScalaCodec(ClustersUserIsKnownFor) - - ManhattanRO.getReadableStoreWithMtls[Long, ClustersUserIsKnownFor]( - ManhattanROConfig( - HDFSPath(""), // not needed - ApplicationID(appId), - DatasetName(storeName), - Athena - ), - mhMtlsParams - ) - } - - def get(appId: String, mhMtlsParams: ManhattanKVClientMtlsParams): UserKnownForReadableStore = { - val dec11Store = buildForModelVersion(appId, dataSetNameDec11, mhMtlsParams) - val updatedStore = buildForModelVersion(appId, dataSetNameUpdated, mhMtlsParams) - val version2020Store = buildForModelVersion(appId, dataSetName2020, mhMtlsParams) - - UserKnownForReadableStore(dec11Store, updatedStore, version2020Store) - } - - def getDefaultStore(mhMtlsParams: ManhattanKVClientMtlsParams): UserKnownForReadableStore = - get("simclusters_v2", mhMtlsParams) - -} - -case class Query(userId: Long, modelVersion: ModelVersion = ModelVersion.Model20m145kUpdated) - -/** - * Mainly used in debuggers to fetch the top knownFor clusters across different model versions - */ -case class UserKnownForReadableStore( - knownForStoreDec11: ReadableStore[Long, ClustersUserIsKnownFor], - knownForStoreUpdated: ReadableStore[Long, ClustersUserIsKnownFor], - knownForStore2020: ReadableStore[Long, ClustersUserIsKnownFor]) - extends ReadableStore[Query, ClustersUserIsKnownFor] { - - override def get(query: Query): Future[Option[ClustersUserIsKnownFor]] = { - query.modelVersion match { - case ModelVersion.Model20m145kDec11 => - knownForStoreDec11.get(query.userId) - case ModelVersion.Model20m145kUpdated => - knownForStoreUpdated.get(query.userId) - case ModelVersion.Model20m145k2020 => - knownForStore2020.get(query.userId) - case c => - throw new IllegalArgumentException( - s"Never heard of $c before! Is this a new model version?") - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/storm/BUILD b/src/scala/com/twitter/simclusters_v2/summingbird/storm/BUILD deleted file mode 100644 index 62f92f3e7..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/storm/BUILD +++ /dev/null @@ -1,27 +0,0 @@ -scala_library( - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/twitter/algebird:core", - "3rdparty/jvm/com/twitter/algebird:util", - "3rdparty/jvm/com/twitter/bijection:core", - "3rdparty/jvm/com/twitter/bijection:util", - "3rdparty/jvm/com/twitter/storehaus:core", - "3rdparty/jvm/com/twitter/storehaus:memcache", - "3rdparty/src/jvm/com/twitter/storehaus:memcache", - "hermit/hermit-core/src/main/scala/com/twitter/hermit/store/common", - "src/scala/com/twitter/simclusters_v2/summingbird:common", - "src/scala/com/twitter/simclusters_v2/summingbird:stores", - "src/scala/com/twitter/storehaus_internal/memcache/config", - "src/scala/com/twitter/storehaus_internal/online", - "src/scala/com/twitter/summingbird_internal/runner/common", - "src/scala/com/twitter/summingbird_internal/runner/store_config", - "src/scala/com/twitter/summingbird_internal/runner/storm", - "src/scala/com/twitter/summingbird_internal/sources/common", - "src/scala/com/twitter/summingbird_internal/sources/common/remote:TweetEventSource", - "src/scala/com/twitter/summingbird_internal/sources/storm/remote:TweetEventSource", - "src/scala/com/twitter/tormenta_internal/spout/eventbus", - "src/scala/com/twitter/wtf/summingbird/sources/common", - "src/scala/com/twitter/wtf/summingbird/sources/storm", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/storm/PersistentTweetJob.scala b/src/scala/com/twitter/simclusters_v2/summingbird/storm/PersistentTweetJob.scala deleted file mode 100644 index 1e0703647..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/storm/PersistentTweetJob.scala +++ /dev/null @@ -1,151 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.storm - -import com.twitter.simclusters_v2.common.TweetId -import com.twitter.simclusters_v2.summingbird.common.Implicits -import com.twitter.simclusters_v2.summingbird.common.Monoids.PersistentSimClustersEmbeddingLongestL2NormMonoid -import com.twitter.simclusters_v2.summingbird.common.StatsUtil -import com.twitter.simclusters_v2.summingbird.stores.PersistentTweetEmbeddingStore.{ - LatestEmbeddingVersion, - LongestL2EmbeddingVersion, - PersistentTweetEmbeddingId -} -import com.twitter.simclusters_v2.thriftscala.{ - PersistentSimClustersEmbedding, - SimClustersEmbedding, - SimClustersEmbeddingMetadata -} -import com.twitter.summingbird.option.JobId -import com.twitter.summingbird.{Platform, Producer, TailProducer} -import com.twitter.timelineservice.thriftscala.Event -import com.twitter.tweetypie.thriftscala.StatusCounts - -/** - * The job to save the qualified tweet SimClustersEmbedding into Strato Store(Back by Manhattan). - * - * The steps - * 1. Read from Favorite Stream. - * 2. Join with Tweet Status Count Service. - * 3. Filter out the tweets whose favorite count < 8. - * We consider these tweets' SimClusters embedding is too noisy and untrustable. - * 4. Update the SimClusters Tweet embedding with timestamp 0L. - * 0L is reserved for the latest tweet embedding. It's also used to maintain the tweet count. - * 5. If the SimClusters Tweet embedding's update count is 2 power N & N >= 3. - * Persistent the embeddings with the timestamp as part of the LK. - **/ -private[storm] object PersistentTweetJob { - import StatsUtil._ - - private val MinFavoriteCount = 8 - type Timestamp = Long - - val longestL2NormMonoid = new PersistentSimClustersEmbeddingLongestL2NormMonoid() - - def generate[P <: Platform[P]]( - timelineEventSource: Producer[P, Event], - tweetStatusCountService: P#Service[TweetId, StatusCounts], - tweetEmbeddingService: P#Service[TweetId, SimClustersEmbedding], - persistentTweetEmbeddingStoreWithLatestAggregation: P#Store[ - PersistentTweetEmbeddingId, - PersistentSimClustersEmbedding - ], - persistentTweetEmbeddingStoreWithLongestL2NormAggregation: P#Store[ - PersistentTweetEmbeddingId, - PersistentSimClustersEmbedding - ] - )( - implicit jobId: JobId - ): TailProducer[P, Any] = { - - val timelineEvents: Producer[P, (TweetId, Timestamp)] = timelineEventSource - .collect { - case Event.Favorite(favoriteEvent) => - (favoriteEvent.tweetId, favoriteEvent.eventTimeMs) - } - - val filteredEvents = timelineEvents - .leftJoin[StatusCounts](tweetStatusCountService) - .filter { - case (_, (_, Some(statusCounts))) => - // Only consider tweets which has more than 8 favorite - statusCounts.favoriteCount.exists(_ >= MinFavoriteCount) - case _ => - false - } - .leftJoin[SimClustersEmbedding](tweetEmbeddingService) - - val latestAndPersistentEmbeddingProducer = filteredEvents - .collect { - case (tweetId, ((eventTimeMs, _), Some(tweetEmbedding))) => - ( - // This special timestamp is a reserved space for the latest tweet embedding. - PersistentTweetEmbeddingId(tweetId, timestampInMs = LatestEmbeddingVersion), - PersistentSimClustersEmbedding( - tweetEmbedding, - SimClustersEmbeddingMetadata(updatedAtMs = Some(eventTimeMs), updatedCount = Some(1)) - )) - } - .observe("num_of_embedding_updates") - .sumByKey(persistentTweetEmbeddingStoreWithLatestAggregation)( - Implicits.persistentSimClustersEmbeddingMonoid) - .name("latest_embedding_producer") - .flatMap { - case (persistentTweetEmbeddingId, (maybeEmbedding, deltaEmbedding)) => - lastQualifiedUpdatedCount( - maybeEmbedding.flatMap(_.metadata.updatedCount), - deltaEmbedding.metadata.updatedCount - ).map { newUpdateCount => - ( - persistentTweetEmbeddingId.copy(timestampInMs = - deltaEmbedding.metadata.updatedAtMs.getOrElse(0L)), - deltaEmbedding.copy(metadata = - deltaEmbedding.metadata.copy(updatedCount = Some(newUpdateCount))) - ) - } - } - .observe("num_of_extra_embedding") - .sumByKey(persistentTweetEmbeddingStoreWithLatestAggregation)( - Implicits.persistentSimClustersEmbeddingMonoid) - .name("persistent_embeddings_producer") - - val longestL2NormEmbeddingProducer = filteredEvents - .collect { - case (tweetId, ((eventTimeMs, Some(statusCounts)), Some(tweetEmbedding))) => - ( - // This special timestamp is a reserved space for the latest tweet embedding. - PersistentTweetEmbeddingId(tweetId, timestampInMs = LongestL2EmbeddingVersion), - PersistentSimClustersEmbedding( - tweetEmbedding, - SimClustersEmbeddingMetadata( - updatedAtMs = Some(eventTimeMs), - // We're not aggregating the existing embedding, we're replacing it. The count - // therefore needs to be the absolute fav count for this tweet, not the delta. - updatedCount = statusCounts.favoriteCount.map(_ + 1) - ) - )) - } - .observe("num_longest_l2_norm_updates") - .sumByKey(persistentTweetEmbeddingStoreWithLongestL2NormAggregation)(longestL2NormMonoid) - .name("longest_l2_norm_embedding_producer") - - latestAndPersistentEmbeddingProducer.also(longestL2NormEmbeddingProducer) - } - - /* - If this change in counts crosses one or more powers of 2 (8,16,32...), return the last boundary - that was crossed. In the case where a count delta is large, it may skip a power of 2, and - thus we may not store embeddings for all 2^(i+3) where 0 <= i <= tweetFavCount. - */ - private def lastQualifiedUpdatedCount( - existingUpdateCount: Option[Long], - deltaUpdateCount: Option[Long] - ): Option[Int] = { - val existing = existingUpdateCount.getOrElse(0L) - val sum = existing + deltaUpdateCount.getOrElse(0L) - qualifiedSet.filter { i => (existing < i) && (i <= sum) }.lastOption - } - - // Only 2 Power n while n >= 3 is qualified for Persistent. The max = 16,777,216 - private lazy val qualifiedSet = 3 - .until(25).map { i => Math.pow(2, i).toInt }.toSet - -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/storm/PersistentTweetJobRunner.scala b/src/scala/com/twitter/simclusters_v2/summingbird/storm/PersistentTweetJobRunner.scala deleted file mode 100644 index b7960d846..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/storm/PersistentTweetJobRunner.scala +++ /dev/null @@ -1,227 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.storm - -import com.twitter.conversions.DurationOps._ -import com.twitter.finagle.stats.NullStatsReceiver -import com.twitter.hermit.store.common.ObservedCachedReadableStore -import com.twitter.scalding.Args -import com.twitter.simclusters_v2.common.SimClustersEmbedding -import com.twitter.simclusters_v2.common.TweetId -import com.twitter.simclusters_v2.summingbird.common.Monoids.PersistentSimClustersEmbeddingLongestL2NormMonoid -import com.twitter.simclusters_v2.summingbird.common.SimClustersProfile.AltSetting -import com.twitter.simclusters_v2.summingbird.common.SimClustersProfile.Environment -import com.twitter.simclusters_v2.summingbird.common.ClientConfigs -import com.twitter.simclusters_v2.summingbird.common.Implicits -import com.twitter.simclusters_v2.summingbird.common.SimClustersProfile -import com.twitter.simclusters_v2.summingbird.stores.PersistentTweetEmbeddingStore.PersistentTweetEmbeddingId -import com.twitter.simclusters_v2.summingbird.stores.PersistentTweetEmbeddingStore -import com.twitter.simclusters_v2.summingbird.stores.TopKClustersForTweetKeyReadableStore -import com.twitter.simclusters_v2.summingbird.stores.TweetKey -import com.twitter.simclusters_v2.summingbird.stores.TweetStatusCountsStore -import com.twitter.simclusters_v2.thriftscala.PersistentSimClustersEmbedding -import com.twitter.simclusters_v2.thriftscala.{SimClustersEmbedding => ThriftSimClustersEmbedding} -import com.twitter.storehaus.FutureCollector -import com.twitter.summingbird.online.option._ -import com.twitter.summingbird.option._ -import com.twitter.summingbird.storm.Storm -import com.twitter.summingbird.Options -import com.twitter.summingbird.TailProducer -import com.twitter.summingbird_internal.runner.common.JobName -import com.twitter.summingbird_internal.runner.common.SBRunConfig -import com.twitter.summingbird_internal.runner.storm.GenericRunner -import com.twitter.summingbird_internal.runner.storm.StormConfig -import com.twitter.tormenta_internal.spout.eventbus.SubscriberId -import com.twitter.tweetypie.thriftscala.StatusCounts -import com.twitter.wtf.summingbird.sources.storm.TimelineEventSource -import java.lang -import java.util.{HashMap => JMap} -import org.apache.heron.api.{Config => HeronConfig} -import org.apache.storm.{Config => BTConfig} - -object PersistentTweetJobRunner { - def main(args: Array[String]): Unit = { - GenericRunner(args, PersistentTweetStormJob(_)) - } -} - -object PersistentTweetStormJob { - - import com.twitter.simclusters_v2.summingbird.common.Implicits._ - - def jLong(num: Long): lang.Long = java.lang.Long.valueOf(num) - def jInt(num: Int): Integer = java.lang.Integer.valueOf(num) - def jFloat(num: Float): lang.Float = java.lang.Float.valueOf(num) - - def apply(args: Args): StormConfig = { - - lazy val env: String = args.getOrElse("env", "prod") - lazy val zone: String = args.getOrElse("dc", "atla") - lazy val alt: String = args.getOrElse("alt", default = "normal") - - lazy val profile = - SimClustersProfile.fetchPersistentJobProfile(Environment(env), AltSetting(alt)) - - lazy val stratoClient = ClientConfigs.stratoClient(profile.serviceIdentifier(zone)) - - lazy val favoriteEventSource = TimelineEventSource( - // Note: do not share the same subsriberId with other jobs. Apply a new one if needed - SubscriberId(profile.timelineEventSourceSubscriberId) - ).kafkaSource - - lazy val persistentTweetEmbeddingStore = - PersistentTweetEmbeddingStore - .persistentTweetEmbeddingStore(stratoClient, profile.persistentTweetStratoPath) - - lazy val persistentTweetEmbeddingStoreWithLatestAggregation: Storm#Store[ - PersistentTweetEmbeddingId, - PersistentSimClustersEmbedding - ] = { - import com.twitter.storehaus.algebra.StoreAlgebra._ - - lazy val mergeableStore = - persistentTweetEmbeddingStore.toMergeable( - mon = Implicits.persistentSimClustersEmbeddingMonoid, - fc = implicitly[FutureCollector]) - - Storm.onlineOnlyStore(mergeableStore) - } - - lazy val persistentTweetEmbeddingStoreWithLongestL2NormAggregation: Storm#Store[ - PersistentTweetEmbeddingId, - PersistentSimClustersEmbedding - ] = { - import com.twitter.storehaus.algebra.StoreAlgebra._ - - val longestL2NormMonoid = new PersistentSimClustersEmbeddingLongestL2NormMonoid() - lazy val mergeableStore = - persistentTweetEmbeddingStore.toMergeable( - mon = longestL2NormMonoid, - fc = implicitly[FutureCollector]) - - Storm.onlineOnlyStore(mergeableStore) - } - - lazy val tweetStatusCountsService: Storm#Service[TweetId, StatusCounts] = - Storm.service( - ObservedCachedReadableStore.from[TweetId, StatusCounts]( - TweetStatusCountsStore.tweetStatusCountsStore(stratoClient, "tweetypie/core.Tweet"), - ttl = 1.minute, - maxKeys = 10000, // 10K is enough for Heron Job. - cacheName = "tweet_status_count", - windowSize = 10000L - )(NullStatsReceiver) - ) - - lazy val tweetEmbeddingService: Storm#Service[TweetId, ThriftSimClustersEmbedding] = - Storm.service( - TopKClustersForTweetKeyReadableStore - .overrideLimitDefaultStore(50, profile.serviceIdentifier(zone)) - .composeKeyMapping { tweetId: TweetId => - TweetKey(tweetId, profile.modelVersionStr, profile.coreEmbeddingType) - }.mapValues { value => SimClustersEmbedding(value).toThrift }) - - new StormConfig { - - val jobName: JobName = JobName(profile.jobName) - - implicit val jobID: JobId = JobId(jobName.toString) - - /** - * Add registrars for chill serialization for user-defined types. - */ - override def registrars = - List( - SBRunConfig.register[StatusCounts], - SBRunConfig.register[ThriftSimClustersEmbedding], - SBRunConfig.register[PersistentSimClustersEmbedding] - ) - - /***** Job configuration settings *****/ - /** - * Use vmSettings to configure the VM - */ - override def vmSettings: Seq[String] = Seq() - - private val SourcePerWorker = 1 - private val FlatMapPerWorker = 1 - private val SummerPerWorker = 1 - - private val TotalWorker = 60 - - /** - * Use transformConfig to set Heron options. - */ - override def transformConfig(config: Map[String, AnyRef]): Map[String, AnyRef] = { - - val heronJvmOptions = new JMap[String, AnyRef]() - - val MetaspaceSize = jLong(256L * 1024 * 1024) - val DefaultHeapSize = jLong(2L * 1024 * 1024 * 1024) - val HighHeapSize = jLong(4L * 1024 * 1024 * 1024) - - val TotalCPU = jLong( - SourcePerWorker * 1 + FlatMapPerWorker * 4 + SummerPerWorker * 3 + 1 - ) - - // reserve 4GB for the StreamMgr - val TotalRam = jLong( - DefaultHeapSize * (SourcePerWorker * 1 + FlatMapPerWorker * 4) - + HighHeapSize * SummerPerWorker * 3 - + MetaspaceSize * 8 // Applies to all workers - + 4L * 1024 * 1024 * 1024) - - // These settings help prevent GC issues in the most memory intensive steps of the job by - // dedicating more memory to the new gen heap designated by the -Xmn flag. - Map( - "Tail" -> HighHeapSize - ).foreach { - case (stage, heap) => - HeronConfig.setComponentJvmOptions( - heronJvmOptions, - stage, - s"-Xmx$heap -Xms$heap -Xmn${heap / 2}" - ) - } - - super.transformConfig(config) ++ List( - BTConfig.TOPOLOGY_TEAM_NAME -> "cassowary", - BTConfig.TOPOLOGY_TEAM_EMAIL -> "no-reply@twitter.com", - BTConfig.TOPOLOGY_WORKERS -> jInt(TotalWorker), - BTConfig.TOPOLOGY_ACKER_EXECUTORS -> jInt(0), - BTConfig.TOPOLOGY_MESSAGE_TIMEOUT_SECS -> jInt(30), - BTConfig.TOPOLOGY_WORKER_CHILDOPTS -> List( - "-Djava.security.auth.login.config=config/jaas.conf", - "-Dsun.security.krb5.debug=true", - "-Dcom.twitter.eventbus.client.EnableKafkaSaslTls=true", - "-Dcom.twitter.eventbus.client.zoneName=" + zone, - s"-XX:MaxMetaspaceSize=$MetaspaceSize" - ).mkString(" "), - HeronConfig.TOPOLOGY_CONTAINER_CPU_REQUESTED -> TotalCPU, - HeronConfig.TOPOLOGY_CONTAINER_RAM_REQUESTED -> TotalRam, - "storm.job.uniqueId" -> jobID.get - ) - } - - /** - * Use getNamedOptions to set Summingbird runtime options - * The list of available options: com.twitter.summingbird.online.option - */ - override def getNamedOptions: Map[String, Options] = Map( - "DEFAULT" -> Options() - .set(SummerParallelism(TotalWorker * SummerPerWorker)) - .set(FlatMapParallelism(TotalWorker * FlatMapPerWorker)) - .set(SourceParallelism(TotalWorker * SourcePerWorker)) - .set(CacheSize(10000)) - .set(FlushFrequency(30.seconds)) - ) - - /** Required job generation call for your job, defined in Job.scala */ - override def graph: TailProducer[Storm, Any] = PersistentTweetJob.generate[Storm]( - favoriteEventSource, - tweetStatusCountsService, - tweetEmbeddingService, - persistentTweetEmbeddingStoreWithLatestAggregation, - persistentTweetEmbeddingStoreWithLongestL2NormAggregation - ) - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/storm/TweetJob.scala b/src/scala/com/twitter/simclusters_v2/summingbird/storm/TweetJob.scala deleted file mode 100644 index 54ac8011a..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/storm/TweetJob.scala +++ /dev/null @@ -1,232 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.storm - -import com.twitter.simclusters_v2.common.ModelVersions._ -import com.twitter.simclusters_v2.summingbird.common.SimClustersProfile.SimClustersTweetProfile -import com.twitter.simclusters_v2.summingbird.common.Configs -import com.twitter.simclusters_v2.summingbird.common.Implicits -import com.twitter.simclusters_v2.summingbird.common.SimClustersHashUtil -import com.twitter.simclusters_v2.summingbird.common.SimClustersInterestedInUtil -import com.twitter.simclusters_v2.summingbird.common.StatsUtil -import com.twitter.simclusters_v2.thriftscala._ -import com.twitter.snowflake.id.SnowflakeId -import com.twitter.summingbird._ -import com.twitter.summingbird.option.JobId -import com.twitter.timelineservice.thriftscala.Event -import com.twitter.conversions.DurationOps._ -import com.twitter.timelineservice.thriftscala.EventAliases.FavoriteAlias - -object TweetJob { - - import Implicits._ - import StatsUtil._ - - object NodeName { - final val TweetClusterScoreFlatMapNodeName: String = "TweetClusterScoreFlatMap" - final val TweetClusterUpdatedScoresFlatMapNodeName: String = "TweetClusterUpdatedScoreFlatMap" - final val TweetClusterScoreSummerNodeName: String = "TweetClusterScoreSummer" - final val TweetTopKNodeName: String = "TweetTopKSummer" - final val ClusterTopKTweetsNodeName: String = "ClusterTopKTweetsSummer" - final val ClusterTopKTweetsLightNodeName: String = "ClusterTopKTweetsLightSummer" - } - - def generate[P <: Platform[P]]( - profile: SimClustersTweetProfile, - timelineEventSource: Producer[P, Event], - userInterestedInService: P#Service[Long, ClustersUserIsInterestedIn], - tweetClusterScoreStore: P#Store[(SimClusterEntity, FullClusterIdBucket), ClustersWithScores], - tweetTopKClustersStore: P#Store[EntityWithVersion, TopKClustersWithScores], - clusterTopKTweetsStore: P#Store[FullClusterId, TopKTweetsWithScores], - clusterTopKTweetsLightStore: Option[P#Store[FullClusterId, TopKTweetsWithScores]] - )( - implicit jobId: JobId - ): TailProducer[P, Any] = { - - val userInterestNonEmptyCount = Counter(Group(jobId.get), Name("num_user_interests_non_empty")) - val userInterestEmptyCount = Counter(Group(jobId.get), Name("num_user_interests_empty")) - - val numClustersCount = Counter(Group(jobId.get), Name("num_clusters")) - - val entityClusterPairCount = Counter(Group(jobId.get), Name("num_entity_cluster_pairs_emitted")) - - // Fav QPS is around 6K - val qualifiedFavEvents = timelineEventSource - .collect { - case Event.Favorite(favEvent) - if favEvent.userId != favEvent.tweetUserId && !isTweetTooOld(favEvent) => - (favEvent.userId, favEvent) - } - .observe("num_qualified_favorite_events") - - val entityWithSimClustersProducer = qualifiedFavEvents - .leftJoin(userInterestedInService) - .map { - case (_, (favEvent, userInterestOpt)) => - (favEvent.tweetId, (favEvent, userInterestOpt)) - } - .flatMap { - case (_, (favEvent, Some(userInterests))) => - userInterestNonEmptyCount.incr() - - val timestamp = favEvent.eventTimeMs - - val clustersWithScores = SimClustersInterestedInUtil.topClustersWithScores(userInterests) - - // clusters.size is around 25 in average - numClustersCount.incrBy(clustersWithScores.size) - - val simClusterScoresByHashBucket = clustersWithScores.groupBy { - case (clusterId, _) => SimClustersHashUtil.clusterIdToBucket(clusterId) - } - - for { - (hashBucket, scores) <- simClusterScoresByHashBucket - } yield { - entityClusterPairCount.incr() - - val clusterBucket = FullClusterIdBucket(userInterests.knownForModelVersion, hashBucket) - - val tweetId: SimClusterEntity = SimClusterEntity.TweetId(favEvent.tweetId) - - (tweetId, clusterBucket) -> SimClustersInterestedInUtil - .buildClusterWithScores( - scores, - timestamp, - profile.favScoreThresholdForUserInterest - ) - } - case _ => - userInterestEmptyCount.incr() - None - } - .observe("entity_cluster_delta_scores") - .name(NodeName.TweetClusterScoreFlatMapNodeName) - .sumByKey(tweetClusterScoreStore)(clustersWithScoreMonoid) - .name(NodeName.TweetClusterScoreSummerNodeName) - .map { - case ((simClusterEntity, clusterBucket), (oldValueOpt, deltaValue)) => - val updatedClusterIds = deltaValue.clustersToScore.map(_.keySet).getOrElse(Set.empty[Int]) - - (simClusterEntity, clusterBucket) -> clustersWithScoreMonoid.plus( - oldValueOpt - .map { oldValue => - oldValue.copy( - clustersToScore = - oldValue.clustersToScore.map(_.filterKeys(updatedClusterIds.contains)) - ) - }.getOrElse(clustersWithScoreMonoid.zero), - deltaValue - ) - } - .observe("entity_cluster_updated_scores") - .name(NodeName.TweetClusterUpdatedScoresFlatMapNodeName) - - val tweetTopK = entityWithSimClustersProducer - .flatMap { - case ((simClusterEntity, FullClusterIdBucket(modelVersion, _)), clusterWithScores) - if simClusterEntity.isInstanceOf[SimClusterEntity.TweetId] => - clusterWithScores.clustersToScore - .map { clustersToScores => - val topClustersWithFavScores = clustersToScores.mapValues { scores: Scores => - Scores( - favClusterNormalized8HrHalfLifeScore = - scores.favClusterNormalized8HrHalfLifeScore.filter( - _.value >= Configs.scoreThresholdForTweetTopKClustersCache - ) - ) - } - - ( - EntityWithVersion(simClusterEntity, modelVersion), - TopKClustersWithScores(Some(topClustersWithFavScores), None) - ) - } - case _ => - None - - } - .observe("tweet_topk_updates") - .sumByKey(tweetTopKClustersStore)(topKClustersWithScoresMonoid) - .name(NodeName.TweetTopKNodeName) - - val clusterTopKTweets = entityWithSimClustersProducer - .flatMap { - case ((simClusterEntity, FullClusterIdBucket(modelVersion, _)), clusterWithScores) => - simClusterEntity match { - case SimClusterEntity.TweetId(tweetId) => - clusterWithScores.clustersToScore - .map { clustersToScores => - clustersToScores.toSeq.map { - case (clusterId, scores) => - val topTweetsByFavScore = Map( - tweetId -> Scores(favClusterNormalized8HrHalfLifeScore = - scores.favClusterNormalized8HrHalfLifeScore.filter(_.value >= - Configs.scoreThresholdForClusterTopKTweetsCache))) - - ( - FullClusterId(modelVersion, clusterId), - TopKTweetsWithScores(Some(topTweetsByFavScore), None) - ) - } - }.getOrElse(Nil) - case _ => - Nil - } - } - .observe("cluster_topk_tweets_updates") - .sumByKey(clusterTopKTweetsStore)(topKTweetsWithScoresMonoid) - .name(NodeName.ClusterTopKTweetsNodeName) - - val clusterTopKTweetsLight = clusterTopKTweetsLightStore.map { lightStore => - entityWithSimClustersProducer - .flatMap { - case ((simClusterEntity, FullClusterIdBucket(modelVersion, _)), clusterWithScores) => - simClusterEntity match { - case SimClusterEntity.TweetId(tweetId) if isTweetTooOldForLight(tweetId) => - clusterWithScores.clustersToScore - .map { clustersToScores => - clustersToScores.toSeq.map { - case (clusterId, scores) => - val topTweetsByFavScore = Map( - tweetId -> Scores(favClusterNormalized8HrHalfLifeScore = - scores.favClusterNormalized8HrHalfLifeScore.filter(_.value >= - Configs.scoreThresholdForClusterTopKTweetsCache))) - - ( - FullClusterId(modelVersion, clusterId), - TopKTweetsWithScores(Some(topTweetsByFavScore), None) - ) - } - }.getOrElse(Nil) - case _ => - Nil - } - } - .observe("cluster_topk_tweets_updates") - .sumByKey(lightStore)(topKTweetsWithScoresLightMonoid) - .name(NodeName.ClusterTopKTweetsLightNodeName) - } - - clusterTopKTweetsLight match { - case Some(lightNode) => - tweetTopK.also(clusterTopKTweets).also(lightNode) - case None => - tweetTopK.also(clusterTopKTweets) - } - } - - // Boolean check to see if the tweet is too old - private def isTweetTooOld(favEvent: FavoriteAlias): Boolean = { - favEvent.tweet.forall { tweet => - SnowflakeId.unixTimeMillisOptFromId(tweet.id).exists { millis => - System.currentTimeMillis() - millis >= Configs.OldestTweetFavEventTimeInMillis - } - } - } - - private def isTweetTooOldForLight(tweetId: Long): Boolean = { - SnowflakeId.unixTimeMillisOptFromId(tweetId).exists { millis => - System.currentTimeMillis() - millis >= Configs.OldestTweetInLightIndexInMillis - } - } - -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/storm/TweetJobRunner.scala b/src/scala/com/twitter/simclusters_v2/summingbird/storm/TweetJobRunner.scala deleted file mode 100644 index 11a94a47b..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/storm/TweetJobRunner.scala +++ /dev/null @@ -1,242 +0,0 @@ -package com.twitter.simclusters_v2.summingbird.storm - -import com.twitter.conversions.DurationOps._ -import com.twitter.heron.util.CommonMetric -import com.twitter.scalding.Args -import com.twitter.simclusters_v2.summingbird.common.SimClustersProfile -import com.twitter.simclusters_v2.summingbird.common.SimClustersProfile.AltSetting -import com.twitter.simclusters_v2.summingbird.common.SimClustersProfile.Environment -import com.twitter.simclusters_v2.summingbird.stores.EntityClusterScoreReadableStore -import com.twitter.simclusters_v2.summingbird.stores.TopKClustersForTweetReadableStore -import com.twitter.simclusters_v2.summingbird.stores.TopKTweetsForClusterReadableStore -import com.twitter.simclusters_v2.summingbird.stores.UserInterestedInReadableStore -import com.twitter.simclusters_v2.thriftscala._ -import com.twitter.storage.client.manhattan.kv.ManhattanKVClientMtlsParams -import com.twitter.summingbird.online.option._ -import com.twitter.summingbird.option._ -import com.twitter.summingbird.storm.option.FlatMapStormMetrics -import com.twitter.summingbird.storm.option.SummerStormMetrics -import com.twitter.summingbird.storm.Storm -import com.twitter.summingbird.storm.StormMetric -import com.twitter.summingbird.Options -import com.twitter.summingbird.TailProducer -import com.twitter.summingbird_internal.runner.common.JobName -import com.twitter.summingbird_internal.runner.common.SBRunConfig -import com.twitter.summingbird_internal.runner.storm.GenericRunner -import com.twitter.summingbird_internal.runner.storm.StormConfig -import com.twitter.tormenta_internal.spout.eventbus.SubscriberId -import com.twitter.wtf.summingbird.sources.storm.TimelineEventSource -import java.lang -import org.apache.heron.api.{Config => HeronConfig} -import org.apache.heron.common.basics.ByteAmount -import org.apache.storm.{Config => BTConfig} -import scala.collection.JavaConverters._ - -object TweetJobRunner { - def main(args: Array[String]): Unit = { - GenericRunner(args, TweetStormJob(_)) - } -} - -object TweetStormJob { - - import com.twitter.simclusters_v2.summingbird.common.Implicits._ - - def jLong(num: Long): lang.Long = java.lang.Long.valueOf(num) - def jInt(num: Int): Integer = java.lang.Integer.valueOf(num) - def apply(args: Args): StormConfig = { - - lazy val env: String = args.getOrElse("env", "prod") - lazy val zone: String = args.getOrElse("dc", "atla") - - // The only SimClusters ENV is Alt. Will clean up soon. - lazy val profile = SimClustersProfile.fetchTweetJobProfile(Environment(env), AltSetting.Alt) - - lazy val favoriteEventSource = TimelineEventSource( - // Note: do not share the same subsriberId with other jobs. Apply a new one if needed - SubscriberId(profile.timelineEventSourceSubscriberId) - ).source - - lazy val commonMetric = - StormMetric(new CommonMetric(), CommonMetric.NAME, CommonMetric.POLL_INTERVAL) - lazy val flatMapMetrics = FlatMapStormMetrics(Iterable(commonMetric)) - lazy val summerMetrics = SummerStormMetrics(Iterable(commonMetric)) - - lazy val entityClusterScoreStore: Storm#Store[ - (SimClusterEntity, FullClusterIdBucket), - ClustersWithScores - ] = { - Storm.store( - EntityClusterScoreReadableStore - .onlineMergeableStore(profile.entityClusterScorePath, profile.serviceIdentifier(zone))) - } - - lazy val tweetTopKStore: Storm#Store[EntityWithVersion, TopKClustersWithScores] = { - Storm.store( - TopKClustersForTweetReadableStore - .onlineMergeableStore(profile.tweetTopKClustersPath, profile.serviceIdentifier(zone))) - } - - lazy val clusterTopKTweetsStore: Storm#Store[FullClusterId, TopKTweetsWithScores] = { - Storm.store( - TopKTweetsForClusterReadableStore - .onlineMergeableStore(profile.clusterTopKTweetsPath, profile.serviceIdentifier(zone))) - } - - lazy val clusterTopKTweetsLightStore: Option[ - Storm#Store[FullClusterId, TopKTweetsWithScores] - ] = { - profile.clusterTopKTweetsLightPath.map { lightPath => - Storm.store( - TopKTweetsForClusterReadableStore - .onlineMergeableStore(lightPath, profile.serviceIdentifier(zone))) - } - } - - lazy val userInterestedInService: Storm#Service[Long, ClustersUserIsInterestedIn] = { - Storm.service( - UserInterestedInReadableStore.defaultStoreWithMtls( - ManhattanKVClientMtlsParams(profile.serviceIdentifier(zone)), - modelVersion = profile.modelVersionStr - )) - } - - new StormConfig { - - val jobName: JobName = JobName(profile.jobName) - - implicit val jobID: JobId = JobId(jobName.toString) - - /** - * Add registrars for chill serialization for user-defined types. - */ - override def registrars = - List( - SBRunConfig.register[SimClusterEntity], - SBRunConfig.register[FullClusterIdBucket], - SBRunConfig.register[ClustersWithScores], - SBRunConfig.register[EntityWithVersion], - SBRunConfig.register[FullClusterId], - SBRunConfig.register[EntityWithVersion], - SBRunConfig.register[TopKEntitiesWithScores], - SBRunConfig.register[TopKClustersWithScores], - SBRunConfig.register[TopKTweetsWithScores] - ) - - /***** Job configuration settings *****/ - /** - * Use vmSettings to configure the VM - */ - override def vmSettings: Seq[String] = Seq() - - private val SourcePerWorker = 1 - private val FlatMapPerWorker = 3 - private val SummerPerWorker = 3 - - private val TotalWorker = 150 - - /** - * Use transformConfig to set Heron options. - */ - override def transformConfig(config: Map[String, AnyRef]): Map[String, AnyRef] = { - val heronConfig = new HeronConfig() - - /** - Component names (subject to change if you add more components, make sure to update this) - Source: Tail-FlatMap-FlatMap-Summer-FlatMap-Source - FlatMap: Tail-FlatMap-FlatMap-Summer-FlatMap, Tail-FlatMap-FlatMap, Tail-FlatMap-FlatMap, - Tail-FlatMap - Summer: Tail-FlatMap-FlatMap-Summer * 2, Tail, Tail.2 - */ - val sourceName = "Tail-FlatMap-FlatMap-Summer-FlatMap-Source" - val flatMapFlatMapSummerFlatMapName = "Tail-FlatMap-FlatMap-Summer-FlatMap" - - // 1 CPU per node, 1 for StreamMgr - // By default, numCpus per component = totalCPUs / total number of components. - // To add more CPUs for a specific component, use heronConfig.setComponentCpu(name, numCPUs) - // add 20% more CPUs to address back pressure issue - val TotalCPU = jLong( - (1.2 * (SourcePerWorker * 1 + FlatMapPerWorker * 4 + SummerPerWorker * 6 + 1)).ceil.toInt) - heronConfig.setContainerCpuRequested(TotalCPU.toDouble) - - // RAM settings - val RamPerSourceGB = 8 - val RamPerSummerFlatMap = 8 - val RamDefaultPerComponent = 4 - - // The extra 4GB is not explicitly assigned to the StreamMgr, so it gets 2GB by default, and - // the remaining 2GB is shared among components. Keeping this configuration for now, since - // it seems stable - val TotalRamRB = - RamPerSourceGB * SourcePerWorker * 1 + - RamDefaultPerComponent * FlatMapPerWorker * 4 + - RamDefaultPerComponent * SummerPerWorker * 6 + - 4 // reserve 4GB for the StreamMgr - - // By default, ramGB per component = totalRAM / total number of components. - // To adjust RAMs for a specific component, use heronConfig.setComponentRam(name, ramGB) - heronConfig.setComponentRam(sourceName, ByteAmount.fromGigabytes(RamPerSourceGB)) - heronConfig.setComponentRam( - flatMapFlatMapSummerFlatMapName, - ByteAmount.fromGigabytes(RamPerSummerFlatMap)) - heronConfig.setContainerRamRequested(ByteAmount.fromGigabytes(TotalRamRB)) - - super.transformConfig(config) ++ List( - BTConfig.TOPOLOGY_TEAM_NAME -> "cassowary", - BTConfig.TOPOLOGY_TEAM_EMAIL -> "no-reply@twitter.com", - BTConfig.TOPOLOGY_WORKERS -> jInt(TotalWorker), - BTConfig.TOPOLOGY_ACKER_EXECUTORS -> jInt(0), - BTConfig.TOPOLOGY_MESSAGE_TIMEOUT_SECS -> jInt(30), - BTConfig.TOPOLOGY_WORKER_CHILDOPTS -> List( - "-XX:MaxMetaspaceSize=256M", - "-Djava.security.auth.login.config=config/jaas.conf", - "-Dsun.security.krb5.debug=true", - "-Dcom.twitter.eventbus.client.EnableKafkaSaslTls=true", - "-Dcom.twitter.eventbus.client.zoneName=" + zone - ).mkString(" "), - "storm.job.uniqueId" -> jobID.get - ) ++ heronConfig.asScala.toMap - } - - /** - * Use getNamedOptions to set Summingbird runtime options - * The list of available options: com.twitter.summingbird.online.option - */ - override def getNamedOptions: Map[String, Options] = Map( - "DEFAULT" -> Options() - .set(FlatMapParallelism(TotalWorker * FlatMapPerWorker)) - .set(SourceParallelism(TotalWorker)) - .set(SummerBatchMultiplier(1000)) - .set(CacheSize(10000)) - .set(flatMapMetrics) - .set(summerMetrics), - TweetJob.NodeName.TweetClusterUpdatedScoresFlatMapNodeName -> Options() - .set(FlatMapParallelism(TotalWorker * FlatMapPerWorker)), - TweetJob.NodeName.TweetClusterScoreSummerNodeName -> Options() - // Most expensive step. Double the capacity. - .set(SummerParallelism(TotalWorker * SummerPerWorker * 4)) - .set(FlushFrequency(30.seconds)), - TweetJob.NodeName.ClusterTopKTweetsNodeName -> Options() - .set(SummerParallelism(TotalWorker * SummerPerWorker)) - .set(FlushFrequency(30.seconds)), - TweetJob.NodeName.ClusterTopKTweetsLightNodeName -> Options() - .set(SummerParallelism(TotalWorker * SummerPerWorker)) - .set(FlushFrequency(30.seconds)), - TweetJob.NodeName.TweetTopKNodeName -> Options() - .set(SummerParallelism(TotalWorker * SummerPerWorker)) - .set(FlushFrequency(30.seconds)) - ) - - /** Required job generation call for your job, defined in Job.scala */ - override def graph: TailProducer[Storm, Any] = TweetJob.generate[Storm]( - profile, - favoriteEventSource, - userInterestedInService, - entityClusterScoreStore, - tweetTopKStore, - clusterTopKTweetsStore, - clusterTopKTweetsLightStore - ) - } - } -} diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/storm/persistent_tweet_job_deploy.sh b/src/scala/com/twitter/simclusters_v2/summingbird/storm/persistent_tweet_job_deploy.sh deleted file mode 100755 index 9340c72bb..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/storm/persistent_tweet_job_deploy.sh +++ /dev/null @@ -1,77 +0,0 @@ -#!/bin/bash -# script to deploy simclusters persistent storm job to CI - -set -u -e - -cd "$(git rev-parse --show-toplevel)" - -# shellcheck source=/dev/null -. "$(git rev-parse --show-toplevel)/devprod/source-sh-setup" - -function usage { - cat <// where can only be devel or prod -AURORA_PATH=${AURORA_PATH:="$CLUSTER/$USER/$ENV"} -AURORA_JOB_KEY="${AURORA_PATH}/${JOB_NAME}" - -heron kill "$AURORA_PATH" "$JOB_NAME" || true - -echo "Waiting 5 seconds so heron is sure its dead" -sleep 5 - -echo "AURORA_JOB_KEY: $AURORA_JOB_KEY" - -echo "Starting your topology... for ${ENV} ${JOB_NAME}" -#set -v - -heron submit "${AURORA_PATH}" "dist/${JAR_NAME}" com.twitter.simclusters_v2.summingbird.storm.PersistentTweetJobRunner --env "$ENV" --dc "$CLUSTER" diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/storm/tweet_alt_job_deploy.sh b/src/scala/com/twitter/simclusters_v2/summingbird/storm/tweet_alt_job_deploy.sh deleted file mode 100755 index 67b14d126..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/storm/tweet_alt_job_deploy.sh +++ /dev/null @@ -1,78 +0,0 @@ -#!/bin/bash -# script to deploy simcluster storm job to CI - -set -u -e - -cd "$(git rev-parse --show-toplevel)" - -# shellcheck source=/dev/null -. "$(git rev-parse --show-toplevel)/devprod/source-sh-setup" - -function usage { - cat <// where can only be devel or prod -AURORA_PATH=${AURORA_PATH:="$CLUSTER/$USER/$ENV"} -AURORA_JOB_KEY="${AURORA_PATH}/${JOB_NAME}" - -heron kill "$AURORA_PATH" "$JOB_NAME" || true - -echo "Waiting 5 seconds so heron is sure its dead" -sleep 5 - -echo "AURORA_JOB_KEY: $AURORA_JOB_KEY" - -echo "Starting your topology... for ${ENV} ${JOB_NAME}" -#set -v - -heron submit "${AURORA_PATH}" "dist/${JAR_NAME}" com.twitter.simclusters_v2.summingbird.storm.TweetJobRunner --env "$ENV" --dc "$CLUSTER" --alt "alt" --usingLogFavScore - diff --git a/src/scala/com/twitter/simclusters_v2/summingbird/storm/tweet_job_deploy.sh b/src/scala/com/twitter/simclusters_v2/summingbird/storm/tweet_job_deploy.sh deleted file mode 100755 index b3e4f22d4..000000000 --- a/src/scala/com/twitter/simclusters_v2/summingbird/storm/tweet_job_deploy.sh +++ /dev/null @@ -1,77 +0,0 @@ -#!/bin/bash -# script to deploy simcluster storm job to CI - -set -u -e - -cd "$(git rev-parse --show-toplevel)" - -# shellcheck source=/dev/null -. "$(git rev-parse --show-toplevel)/devprod/source-sh-setup" - -function usage { - cat <// where can only be devel or prod -AURORA_PATH=${AURORA_PATH:="$CLUSTER/$USER/$ENV"} -AURORA_JOB_KEY="${AURORA_PATH}/${JOB_NAME}" - -heron kill "$AURORA_PATH" "$JOB_NAME" || true - -echo "Waiting 5 seconds so heron is sure its dead" -sleep 5 - -echo "AURORA_JOB_KEY: $AURORA_JOB_KEY" - -echo "Starting your topology... for ${ENV} ${JOB_NAME}" -#set -v - -heron submit "${AURORA_PATH}" "dist/${JAR_NAME}" com.twitter.simclusters_v2.summingbird.storm.TweetJobRunner --env "$ENV" --dc "$CLUSTER" diff --git a/src/scala/com/twitter/simclusters_v2/tweet_similarity/BUILD b/src/scala/com/twitter/simclusters_v2/tweet_similarity/BUILD deleted file mode 100644 index 526ee6d23..000000000 --- a/src/scala/com/twitter/simclusters_v2/tweet_similarity/BUILD +++ /dev/null @@ -1,11 +0,0 @@ -scala_library( - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/scala/com/twitter/ml/api:api-base", - "src/scala/com/twitter/ml/featurestore/catalog/features/recommendations:aggregate", - "src/scala/com/twitter/ml/featurestore/lib/embedding", - "src/scala/com/twitter/simclusters_v2/common", - "src/scala/com/twitter/simclusters_v2/common/ml", - ], -) diff --git a/src/scala/com/twitter/simclusters_v2/tweet_similarity/ModelBasedTweetSimilaritySimClustersEmbeddingAdapter.scala b/src/scala/com/twitter/simclusters_v2/tweet_similarity/ModelBasedTweetSimilaritySimClustersEmbeddingAdapter.scala deleted file mode 100644 index f1c3f8cc2..000000000 --- a/src/scala/com/twitter/simclusters_v2/tweet_similarity/ModelBasedTweetSimilaritySimClustersEmbeddingAdapter.scala +++ /dev/null @@ -1,37 +0,0 @@ -package com.twitter.simclusters_v2.tweet_similarity - -import com.twitter.ml.api.{DataRecord, DataRecordMerger} -import com.twitter.simclusters_v2.common.ml.{ - SimClustersEmbeddingAdapter, - NormalizedSimClustersEmbeddingAdapter -} -import com.twitter.simclusters_v2.common.SimClustersEmbedding - -object ModelBasedTweetSimilaritySimClustersEmbeddingAdapter { - val QueryEmbAdapter = new SimClustersEmbeddingAdapter(TweetSimilarityFeatures.QueryTweetEmbedding) - val CandidateEmbAdapter = new SimClustersEmbeddingAdapter( - TweetSimilarityFeatures.CandidateTweetEmbedding) - - val NormalizedQueryEmbAdapter = new NormalizedSimClustersEmbeddingAdapter( - TweetSimilarityFeatures.QueryTweetEmbedding, - TweetSimilarityFeatures.QueryTweetEmbeddingNorm) - val NormalizedCandidateEmbAdapter = new NormalizedSimClustersEmbeddingAdapter( - TweetSimilarityFeatures.CandidateTweetEmbedding, - TweetSimilarityFeatures.CandidateTweetEmbeddingNorm) - - def adaptEmbeddingPairToDataRecord( - queryEmbedding: SimClustersEmbedding, - candidateEmbedding: SimClustersEmbedding, - normalized: Boolean - ): DataRecord = { - val DataRecordMerger = new DataRecordMerger() - val queryAdapter = if (normalized) NormalizedQueryEmbAdapter else QueryEmbAdapter - val candidateAdapter = if (normalized) NormalizedCandidateEmbAdapter else CandidateEmbAdapter - - val featureDataRecord = queryAdapter.adaptToDataRecord(queryEmbedding) - DataRecordMerger.merge( - featureDataRecord, - candidateAdapter.adaptToDataRecord(candidateEmbedding)) - featureDataRecord - } -} diff --git a/src/scala/com/twitter/simclusters_v2/tweet_similarity/TweetSimilarityFeatures.scala b/src/scala/com/twitter/simclusters_v2/tweet_similarity/TweetSimilarityFeatures.scala deleted file mode 100644 index 0d6b90c95..000000000 --- a/src/scala/com/twitter/simclusters_v2/tweet_similarity/TweetSimilarityFeatures.scala +++ /dev/null @@ -1,54 +0,0 @@ -package com.twitter.simclusters_v2.tweet_similarity - -import com.twitter.ml.api.Feature.{Binary, Continuous, Discrete, SparseContinuous} -import com.twitter.ml.api.util.FDsl._ -import com.twitter.ml.api.{DataRecord, FeatureContext, IRecordOneToOneAdapter} -import com.twitter.ml.featurestore.catalog.features.recommendations.ProducerSimClustersEmbedding -import com.twitter.ml.featurestore.lib.UserId -import com.twitter.ml.featurestore.lib.data.{PredictionRecord, PredictionRecordAdapter} -import com.twitter.ml.featurestore.lib.entity.Entity -import com.twitter.ml.featurestore.lib.feature.BoundFeatureSet - -object TweetSimilarityFeatures { - val QueryTweetId = new Discrete("query_tweet.id") - val CandidateTweetId = new Discrete("candidate_tweet.id") - val QueryTweetEmbedding = new SparseContinuous("query_tweet.simclusters_embedding") - val CandidateTweetEmbedding = new SparseContinuous("candidate_tweet.simclusters_embedding") - val QueryTweetEmbeddingNorm = new Continuous("query_tweet.embedding_norm") - val CandidateTweetEmbeddingNorm = new Continuous("candidate_tweet.embedding_norm") - val QueryTweetTimestamp = new Discrete("query_tweet.timestamp") - val CandidateTweetTimestamp = new Discrete("candidate_tweet.timestamp") - val TweetPairCount = new Discrete("popularity_count.tweet_pair") - val QueryTweetCount = new Discrete("popularity_count.query_tweet") - val CosineSimilarity = new Continuous("meta.cosine_similarity") - val Label = new Binary("co-engagement.label") - - val FeatureContext: FeatureContext = new FeatureContext( - QueryTweetId, - CandidateTweetId, - QueryTweetEmbedding, - CandidateTweetEmbedding, - QueryTweetEmbeddingNorm, - CandidateTweetEmbeddingNorm, - QueryTweetTimestamp, - CandidateTweetTimestamp, - TweetPairCount, - QueryTweetCount, - CosineSimilarity, - Label - ) - - def isCoengaged(dataRecord: DataRecord): Boolean = { - dataRecord.getFeatureValue(Label) - } -} - -class TweetSimilarityFeaturesStoreConfig(identifier: String) { - val bindingIdentifier: Entity[UserId] = Entity[UserId](identifier) - - val featureStoreBoundFeatureSet: BoundFeatureSet = BoundFeatureSet( - ProducerSimClustersEmbedding.FavBasedEmbedding20m145kUpdated.bind(bindingIdentifier)) - - val predictionRecordAdapter: IRecordOneToOneAdapter[PredictionRecord] = - PredictionRecordAdapter.oneToOne(featureStoreBoundFeatureSet) -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/BCELabelTransformFromUUADataRecord.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/BCELabelTransformFromUUADataRecord.scala deleted file mode 100644 index 6adf6eaf8..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/BCELabelTransformFromUUADataRecord.scala +++ /dev/null @@ -1,68 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates - -import com.twitter.ml.api.Feature -import com.twitter.ml.api.FeatureContext -import com.twitter.ml.api.ITransform -import com.twitter.ml.api.constant.SharedFeatures -import java.lang.{Double => JDouble} - -import com.twitter.timelines.prediction.common.adapters.AdapterConsumer -import com.twitter.timelines.prediction.common.adapters.EngagementLabelFeaturesDataRecordUtils -import com.twitter.ml.api.DataRecord -import com.twitter.ml.api.RichDataRecord -import com.twitter.timelines.suggests.common.engagement.thriftscala.EngagementType -import com.twitter.timelines.suggests.common.engagement.thriftscala.Engagement -import com.twitter.timelines.prediction.features.common.TimelinesSharedFeatures -import com.twitter.timelines.prediction.features.common.CombinedFeatures - -/** - * To transfrom BCE events UUA data records that contain only continuous dwell time to datarecords that contain corresponding binary label features - * The UUA datarecords inputted would have USER_ID, SOURCE_TWEET_ID,TIMESTAMP and - * 0 or one of (TWEET_DETAIL_DWELL_TIME_MS, PROFILE_DWELL_TIME_MS, FULLSCREEN_VIDEO_DWELL_TIME_MS) features. - * We will use the different engagement TIME_MS to differentiate different engagements, - * and then re-use the function in EngagementTypeConverte to add the binary label to the datarecord. - **/ - -object BCELabelTransformFromUUADataRecord extends ITransform { - - val dwellTimeFeatureToEngagementMap = Map( - TimelinesSharedFeatures.TWEET_DETAIL_DWELL_TIME_MS -> EngagementType.TweetDetailDwell, - TimelinesSharedFeatures.PROFILE_DWELL_TIME_MS -> EngagementType.ProfileDwell, - TimelinesSharedFeatures.FULLSCREEN_VIDEO_DWELL_TIME_MS -> EngagementType.FullscreenVideoDwell - ) - - def dwellFeatureToEngagement( - rdr: RichDataRecord, - dwellTimeFeature: Feature[JDouble], - engagementType: EngagementType - ): Option[Engagement] = { - if (rdr.hasFeature(dwellTimeFeature)) { - Some( - Engagement( - engagementType = engagementType, - timestampMs = rdr.getFeatureValue(SharedFeatures.TIMESTAMP), - weight = Some(rdr.getFeatureValue(dwellTimeFeature)) - )) - } else { - None - } - } - override def transformContext(featureContext: FeatureContext): FeatureContext = { - featureContext.addFeatures( - (CombinedFeatures.TweetDetailDwellEngagements ++ CombinedFeatures.ProfileDwellEngagements ++ CombinedFeatures.FullscreenVideoDwellEngagements).toSeq: _*) - } - override def transform(record: DataRecord): Unit = { - val rdr = new RichDataRecord(record) - val engagements = dwellTimeFeatureToEngagementMap - .map { - case (dwellTimeFeature, engagementType) => - dwellFeatureToEngagement(rdr, dwellTimeFeature, engagementType) - }.flatten.toSeq - - // Re-use BCE( behavior client events) label conversion in EngagementTypeConverter to align with BCE labels generation for offline training data - EngagementLabelFeaturesDataRecordUtils.setDwellTimeFeatures( - rdr, - Some(engagements), - AdapterConsumer.Combined) - } -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/BUILD b/src/scala/com/twitter/timelines/prediction/common/aggregates/BUILD deleted file mode 100644 index 01c930e8e..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/BUILD +++ /dev/null @@ -1,353 +0,0 @@ -create_datasets( - base_name = "original_author_aggregates", - fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/original_author_aggregates/1556496000000", - key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey", - platform = "java8", - role = "timelines", - scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.OriginalAuthor", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)", - scala_dependencies = [ - ":injections", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - ], -) - -create_datasets( - base_name = "twitter_wide_user_aggregates", - fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/twitter_wide_user_aggregates/1556496000000", - key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey", - platform = "java8", - role = "timelines", - scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.TwitterWideUser", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)", - scala_dependencies = [ - ":injections", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - ], -) - -create_datasets( - base_name = "twitter_wide_user_author_aggregates", - fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/twitter_wide_user_author_aggregates/1556323200000", - key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey", - platform = "java8", - role = "timelines", - scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.TwitterWideUserAuthor", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)", - scala_dependencies = [ - ":injections", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - ], -) - -create_datasets( - base_name = "user_aggregates", - fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/user_aggregates/1556150400000", - key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey", - platform = "java8", - role = "timelines", - scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.User", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)", - scala_dependencies = [ - ":injections", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - ], -) - -create_datasets( - base_name = "user_author_aggregates", - fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/user_author_aggregates/1556064000000", - key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey", - platform = "java8", - role = "timelines", - scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.UserAuthor", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)", - scala_dependencies = [ - ":injections", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - ], -) - -create_datasets( - base_name = "aggregates_canary", - fallback_path = "gs://user.timelines.dp.gcp.twttr.net//canaries/processed/aggregates_v2/user_aggregates/1622851200000", - key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey", - platform = "java8", - role = "timelines", - scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.User", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)", - scala_dependencies = [ - ":injections", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - ], -) - -create_datasets( - base_name = "user_engager_aggregates", - fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/user_engager_aggregates/1556496000000", - key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey", - platform = "java8", - role = "timelines", - scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.UserEngager", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)", - scala_dependencies = [ - ":injections", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - ], -) - -create_datasets( - base_name = "user_original_author_aggregates", - fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/user_original_author_aggregates/1556496000000", - key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey", - platform = "java8", - role = "timelines", - scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.UserOriginalAuthor", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)", - scala_dependencies = [ - ":injections", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - ], -) - -create_datasets( - base_name = "author_topic_aggregates", - fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/author_topic_aggregates/1589932800000", - key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey", - platform = "java8", - role = "timelines", - scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.AuthorTopic", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)", - scala_dependencies = [ - ":injections", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - ], -) - -create_datasets( - base_name = "user_topic_aggregates", - fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/user_topic_aggregates/1590278400000", - key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey", - platform = "java8", - role = "timelines", - scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.UserTopic", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)", - scala_dependencies = [ - ":injections", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - ], -) - -create_datasets( - base_name = "user_inferred_topic_aggregates", - fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/user_inferred_topic_aggregates/1599696000000", - key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey", - platform = "java8", - role = "timelines", - scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.UserInferredTopic", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)", - scala_dependencies = [ - ":injections", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - ], -) - -create_datasets( - base_name = "user_mention_aggregates", - fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/user_mention_aggregates/1556582400000", - key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey", - platform = "java8", - role = "timelines", - scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.UserMention", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)", - scala_dependencies = [ - ":injections", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - ], -) - -create_datasets( - base_name = "user_request_dow_aggregates", - fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/user_request_dow_aggregates/1556236800000", - key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey", - platform = "java8", - role = "timelines", - scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.UserRequestDow", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)", - scala_dependencies = [ - ":injections", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - ], -) - -create_datasets( - base_name = "user_request_hour_aggregates", - fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/user_request_hour_aggregates/1556150400000", - key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey", - platform = "java8", - role = "timelines", - scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.UserRequestHour", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)", - scala_dependencies = [ - ":injections", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - ], -) - - -create_datasets( - base_name = "user_list_aggregates", - fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/user_list_aggregates/1590624000000", - key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey", - platform = "java8", - role = "timelines", - scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.UserList", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)", - scala_dependencies = [ - ":injections", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - ], -) - - -create_datasets( - base_name = "user_media_understanding_annotation_aggregates", - key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey", - platform = "java8", - role = "timelines", - scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.UserMediaUnderstandingAnnotation", - segment_type = "snapshot", - tags = ["bazel-compatible"], - val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)", - scala_dependencies = [ - ":injections", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - ], -) - -scala_library( - sources = [ - "BCELabelTransformFromUUADataRecord.scala", - "FeatureSelectorConfig.scala", - "RecapUserFeatureAggregation.scala", - "RectweetUserFeatureAggregation.scala", - "TimelinesAggregationConfig.scala", - "TimelinesAggregationConfigDetails.scala", - "TimelinesAggregationConfigTrait.scala", - "TimelinesAggregationSources.scala", - ], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":aggregates_canary-scala", - ":author_topic_aggregates-scala", - ":original_author_aggregates-scala", - ":twitter_wide_user_aggregates-scala", - ":twitter_wide_user_author_aggregates-scala", - ":user_aggregates-scala", - ":user_author_aggregates-scala", - ":user_engager_aggregates-scala", - ":user_inferred_topic_aggregates-scala", - ":user_list_aggregates-scala", - ":user_media_understanding_annotation_aggregates-scala", - ":user_mention_aggregates-scala", - ":user_original_author_aggregates-scala", - ":user_request_dow_aggregates-scala", - ":user_request_hour_aggregates-scala", - ":user_topic_aggregates-scala", - "src/java/com/twitter/ml/api:api-base", - "src/java/com/twitter/ml/api/constant", - "src/java/com/twitter/ml/api/matcher", - "src/scala/com/twitter/common/text/util", - "src/scala/com/twitter/dal/client/dataset", - "src/scala/com/twitter/frigate/data_pipeline/features_aggregated/core", - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/timelines/prediction/common/adapters:engagement-converter", - "src/scala/com/twitter/timelines/prediction/features/client_log_event", - "src/scala/com/twitter/timelines/prediction/features/common", - "src/scala/com/twitter/timelines/prediction/features/engagement_features", - "src/scala/com/twitter/timelines/prediction/features/escherbird", - "src/scala/com/twitter/timelines/prediction/features/itl", - "src/scala/com/twitter/timelines/prediction/features/list_features", - "src/scala/com/twitter/timelines/prediction/features/p_home_latest", - "src/scala/com/twitter/timelines/prediction/features/real_graph", - "src/scala/com/twitter/timelines/prediction/features/recap", - "src/scala/com/twitter/timelines/prediction/features/request_context", - "src/scala/com/twitter/timelines/prediction/features/simcluster", - "src/scala/com/twitter/timelines/prediction/features/time_features", - "src/scala/com/twitter/timelines/prediction/transform/filter", - "src/thrift/com/twitter/timelines/suggests/common:engagement-scala", - "timelines/data_processing/ad_hoc/recap/data_record_preparation:recap_data_records_agg_minimal-java", - "util/util-core:scala", - ], -) - -scala_library( - name = "injections", - sources = [ - "FeatureSelectorConfig.scala", - "RecapUserFeatureAggregation.scala", - "RectweetUserFeatureAggregation.scala", - "TimelinesAggregationConfigDetails.scala", - "TimelinesAggregationConfigTrait.scala", - "TimelinesAggregationKeyValInjections.scala", - "TimelinesAggregationSources.scala", - ], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/java/com/twitter/ml/api:api-base", - "src/java/com/twitter/ml/api/constant", - "src/java/com/twitter/ml/api/matcher", - "src/scala/com/twitter/common/text/util", - "src/scala/com/twitter/dal/client/dataset", - "src/scala/com/twitter/frigate/data_pipeline/features_aggregated/core", - "src/scala/com/twitter/scalding_internal/multiformat/format", - "src/scala/com/twitter/timelines/prediction/features/client_log_event", - "src/scala/com/twitter/timelines/prediction/features/common", - "src/scala/com/twitter/timelines/prediction/features/engagement_features", - "src/scala/com/twitter/timelines/prediction/features/escherbird", - "src/scala/com/twitter/timelines/prediction/features/itl", - "src/scala/com/twitter/timelines/prediction/features/list_features", - "src/scala/com/twitter/timelines/prediction/features/p_home_latest", - "src/scala/com/twitter/timelines/prediction/features/real_graph", - "src/scala/com/twitter/timelines/prediction/features/recap", - "src/scala/com/twitter/timelines/prediction/features/request_context", - "src/scala/com/twitter/timelines/prediction/features/semantic_core_features", - "src/scala/com/twitter/timelines/prediction/features/simcluster", - "src/scala/com/twitter/timelines/prediction/features/time_features", - "src/scala/com/twitter/timelines/prediction/transform/filter", - "timelines/data_processing/ad_hoc/recap/data_record_preparation:recap_data_records_agg_minimal-java", - "util/util-core:scala", - ], -) diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/FeatureSelectorConfig.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/FeatureSelectorConfig.scala deleted file mode 100644 index 1c91ef16c..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/FeatureSelectorConfig.scala +++ /dev/null @@ -1,121 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates - -import com.twitter.ml.api.matcher.FeatureMatcher -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.TypedAggregateGroup -import scala.collection.JavaConverters._ - -object FeatureSelectorConfig { - val BasePairsToStore = Seq( - ("twitter_wide_user_aggregate.pair", "*"), - ("twitter_wide_user_author_aggregate.pair", "*"), - ("user_aggregate_v5.continuous.pair", "*"), - ("user_aggregate_v7.pair", "*"), - ("user_author_aggregate_v2.pair", "recap.earlybird.*"), - ("user_author_aggregate_v2.pair", "recap.searchfeature.*"), - ("user_author_aggregate_v2.pair", "recap.tweetfeature.embeds*"), - ("user_author_aggregate_v2.pair", "recap.tweetfeature.link_count*"), - ("user_author_aggregate_v2.pair", "engagement_features.in_network.*"), - ("user_author_aggregate_v2.pair", "recap.tweetfeature.is_reply.*"), - ("user_author_aggregate_v2.pair", "recap.tweetfeature.is_retweet.*"), - ("user_author_aggregate_v2.pair", "recap.tweetfeature.num_mentions.*"), - ("user_author_aggregate_v5.pair", "*"), - ("user_author_aggregate_tweetsource_v1.pair", "*"), - ("user_engager_aggregate.pair", "*"), - ("user_mention_aggregate.pair", "*"), - ("user_request_context_aggregate.dow.pair", "*"), - ("user_request_context_aggregate.hour.pair", "*"), - ("user_aggregate_v6.pair", "*"), - ("user_original_author_aggregate_v1.pair", "*"), - ("user_original_author_aggregate_v2.pair", "*"), - ("original_author_aggregate_v1.pair", "*"), - ("original_author_aggregate_v2.pair", "*"), - ("author_topic_aggregate.pair", "*"), - ("user_list_aggregate.pair", "*"), - ("user_topic_aggregate.pair", "*"), - ("user_topic_aggregate_v2.pair", "*"), - ("user_inferred_topic_aggregate.pair", "*"), - ("user_inferred_topic_aggregate_v2.pair", "*"), - ("user_media_annotation_aggregate.pair", "*"), - ("user_media_annotation_aggregate.pair", "*"), - ("user_author_good_click_aggregate.pair", "*"), - ("user_engager_good_click_aggregate.pair", "*") - ) - val PairsToStore = BasePairsToStore ++ Seq( - ("user_aggregate_v2.pair", "*"), - ("user_aggregate_v5.boolean.pair", "*"), - ("user_aggregate_tweetsource_v1.pair", "*"), - ) - - - val LabelsToStore = Seq( - "any_label", - "recap.engagement.is_favorited", - "recap.engagement.is_retweeted", - "recap.engagement.is_replied", - "recap.engagement.is_open_linked", - "recap.engagement.is_profile_clicked", - "recap.engagement.is_clicked", - "recap.engagement.is_photo_expanded", - "recap.engagement.is_video_playback_50", - "recap.engagement.is_video_quality_viewed", - "recap.engagement.is_replied_reply_impressed_by_author", - "recap.engagement.is_replied_reply_favorited_by_author", - "recap.engagement.is_replied_reply_replied_by_author", - "recap.engagement.is_report_tweet_clicked", - "recap.engagement.is_block_clicked", - "recap.engagement.is_mute_clicked", - "recap.engagement.is_dont_like", - "recap.engagement.is_good_clicked_convo_desc_favorited_or_replied", - "recap.engagement.is_good_clicked_convo_desc_v2", - "itl.engagement.is_favorited", - "itl.engagement.is_retweeted", - "itl.engagement.is_replied", - "itl.engagement.is_open_linked", - "itl.engagement.is_profile_clicked", - "itl.engagement.is_clicked", - "itl.engagement.is_photo_expanded", - "itl.engagement.is_video_playback_50" - ) - - val PairGlobsToStore = for { - (prefix, suffix) <- PairsToStore - label <- LabelsToStore - } yield FeatureMatcher.glob(prefix + "." + label + "." + suffix) - - val BaseAggregateV2FeatureSelector = FeatureMatcher - .none() - .or( - FeatureMatcher.glob("meta.user_id"), - FeatureMatcher.glob("meta.author_id"), - FeatureMatcher.glob("entities.original_author_id"), - FeatureMatcher.glob("entities.topic_id"), - FeatureMatcher - .glob("entities.inferred_topic_ids" + TypedAggregateGroup.SparseFeatureSuffix), - FeatureMatcher.glob("timelines.meta.list_id"), - FeatureMatcher.glob("list.id"), - FeatureMatcher - .glob("engagement_features.user_ids.public" + TypedAggregateGroup.SparseFeatureSuffix), - FeatureMatcher - .glob("entities.users.mentioned_screen_names" + TypedAggregateGroup.SparseFeatureSuffix), - FeatureMatcher.glob("user_aggregate_v2.pair.recap.engagement.is_dont_like.*"), - FeatureMatcher.glob("user_author_aggregate_v2.pair.any_label.recap.tweetfeature.has_*"), - FeatureMatcher.glob("request_context.country_code"), - FeatureMatcher.glob("request_context.timestamp_gmt_dow"), - FeatureMatcher.glob("request_context.timestamp_gmt_hour"), - FeatureMatcher.glob( - "semantic_core.media_understanding.high_recall.non_sensitive.entity_ids" + TypedAggregateGroup.SparseFeatureSuffix) - ) - - val AggregatesV2ProdFeatureSelector = BaseAggregateV2FeatureSelector - .orList(PairGlobsToStore.asJava) - - val ReducedPairGlobsToStore = (for { - (prefix, suffix) <- BasePairsToStore - label <- LabelsToStore - } yield FeatureMatcher.glob(prefix + "." + label + "." + suffix)) ++ Seq( - FeatureMatcher.glob("user_aggregate_v2.pair.any_label.*"), - FeatureMatcher.glob("user_aggregate_v2.pair.recap.engagement.is_favorited.*"), - FeatureMatcher.glob("user_aggregate_v2.pair.recap.engagement.is_photo_expanded.*"), - FeatureMatcher.glob("user_aggregate_v2.pair.recap.engagement.is_profile_clicked.*") - ) -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/README.md b/src/scala/com/twitter/timelines/prediction/common/aggregates/README.md deleted file mode 100644 index 0bae21a14..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/README.md +++ /dev/null @@ -1,6 +0,0 @@ -## Timelines Aggregation Jobs - -This directory contains the specific definition of aggregate jobs that generate features used by the Heavy Ranker. -The primary files of interest are [`TimelinesAggregationConfigDetails.scala`](TimelinesAggregationConfigDetails.scala), which contains the defintion for the batch aggregate jobs and [`real_time/TimelinesOnlineAggregationConfigBase.scala`](real_time/TimelinesOnlineAggregationConfigBase.scala) which contains the definitions for the real time aggregate jobs. - -The aggregation framework that these jobs are based on is [here](../../../../../../../../timelines/data_processing/ml_util/aggregation_framework). \ No newline at end of file diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/RecapUserFeatureAggregation.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/RecapUserFeatureAggregation.scala deleted file mode 100644 index 657d5a713..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/RecapUserFeatureAggregation.scala +++ /dev/null @@ -1,415 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates - -import com.twitter.ml.api.Feature -import com.twitter.timelines.prediction.features.common.TimelinesSharedFeatures -import com.twitter.timelines.prediction.features.engagement_features.EngagementDataRecordFeatures -import com.twitter.timelines.prediction.features.real_graph.RealGraphDataRecordFeatures -import com.twitter.timelines.prediction.features.recap.RecapFeatures -import com.twitter.timelines.prediction.features.time_features.TimeDataRecordFeatures - -object RecapUserFeatureAggregation { - val RecapFeaturesForAggregation: Set[Feature[_]] = - Set( - RecapFeatures.HAS_IMAGE, - RecapFeatures.HAS_VIDEO, - RecapFeatures.FROM_MUTUAL_FOLLOW, - RecapFeatures.HAS_CARD, - RecapFeatures.HAS_NEWS, - RecapFeatures.REPLY_COUNT, - RecapFeatures.FAV_COUNT, - RecapFeatures.RETWEET_COUNT, - RecapFeatures.BLENDER_SCORE, - RecapFeatures.CONVERSATIONAL_COUNT, - RecapFeatures.IS_BUSINESS_SCORE, - RecapFeatures.CONTAINS_MEDIA, - RecapFeatures.RETWEET_SEARCHER, - RecapFeatures.REPLY_SEARCHER, - RecapFeatures.MENTION_SEARCHER, - RecapFeatures.REPLY_OTHER, - RecapFeatures.RETWEET_OTHER, - RecapFeatures.MATCH_UI_LANG, - RecapFeatures.MATCH_SEARCHER_MAIN_LANG, - RecapFeatures.MATCH_SEARCHER_LANGS, - RecapFeatures.TWEET_COUNT_FROM_USER_IN_SNAPSHOT, - RecapFeatures.TEXT_SCORE, - RealGraphDataRecordFeatures.NUM_RETWEETS_EWMA, - RealGraphDataRecordFeatures.NUM_RETWEETS_NON_ZERO_DAYS, - RealGraphDataRecordFeatures.NUM_RETWEETS_ELAPSED_DAYS, - RealGraphDataRecordFeatures.NUM_RETWEETS_DAYS_SINCE_LAST, - RealGraphDataRecordFeatures.NUM_FAVORITES_EWMA, - RealGraphDataRecordFeatures.NUM_FAVORITES_NON_ZERO_DAYS, - RealGraphDataRecordFeatures.NUM_FAVORITES_ELAPSED_DAYS, - RealGraphDataRecordFeatures.NUM_FAVORITES_DAYS_SINCE_LAST, - RealGraphDataRecordFeatures.NUM_MENTIONS_EWMA, - RealGraphDataRecordFeatures.NUM_MENTIONS_NON_ZERO_DAYS, - RealGraphDataRecordFeatures.NUM_MENTIONS_ELAPSED_DAYS, - RealGraphDataRecordFeatures.NUM_MENTIONS_DAYS_SINCE_LAST, - RealGraphDataRecordFeatures.NUM_TWEET_CLICKS_EWMA, - RealGraphDataRecordFeatures.NUM_TWEET_CLICKS_NON_ZERO_DAYS, - RealGraphDataRecordFeatures.NUM_TWEET_CLICKS_ELAPSED_DAYS, - RealGraphDataRecordFeatures.NUM_TWEET_CLICKS_DAYS_SINCE_LAST, - RealGraphDataRecordFeatures.NUM_PROFILE_VIEWS_EWMA, - RealGraphDataRecordFeatures.NUM_PROFILE_VIEWS_NON_ZERO_DAYS, - RealGraphDataRecordFeatures.NUM_PROFILE_VIEWS_ELAPSED_DAYS, - RealGraphDataRecordFeatures.NUM_PROFILE_VIEWS_DAYS_SINCE_LAST, - RealGraphDataRecordFeatures.TOTAL_DWELL_TIME_EWMA, - RealGraphDataRecordFeatures.TOTAL_DWELL_TIME_NON_ZERO_DAYS, - RealGraphDataRecordFeatures.TOTAL_DWELL_TIME_ELAPSED_DAYS, - RealGraphDataRecordFeatures.TOTAL_DWELL_TIME_DAYS_SINCE_LAST, - RealGraphDataRecordFeatures.NUM_INSPECTED_TWEETS_EWMA, - RealGraphDataRecordFeatures.NUM_INSPECTED_TWEETS_NON_ZERO_DAYS, - RealGraphDataRecordFeatures.NUM_INSPECTED_TWEETS_ELAPSED_DAYS, - RealGraphDataRecordFeatures.NUM_INSPECTED_TWEETS_DAYS_SINCE_LAST - ) - - val RecapLabelsForAggregation: Set[Feature.Binary] = - Set( - RecapFeatures.IS_FAVORITED, - RecapFeatures.IS_RETWEETED, - RecapFeatures.IS_CLICKED, - RecapFeatures.IS_PROFILE_CLICKED, - RecapFeatures.IS_OPEN_LINKED - ) - - val DwellDuration: Set[Feature[_]] = - Set( - TimelinesSharedFeatures.DWELL_TIME_MS, - ) - - val UserFeaturesV2: Set[Feature[_]] = RecapFeaturesForAggregation ++ Set( - RecapFeatures.HAS_VINE, - RecapFeatures.HAS_PERISCOPE, - RecapFeatures.HAS_PRO_VIDEO, - RecapFeatures.HAS_VISIBLE_LINK, - RecapFeatures.BIDIRECTIONAL_FAV_COUNT, - RecapFeatures.UNIDIRECTIONAL_FAV_COUNT, - RecapFeatures.BIDIRECTIONAL_REPLY_COUNT, - RecapFeatures.UNIDIRECTIONAL_REPLY_COUNT, - RecapFeatures.BIDIRECTIONAL_RETWEET_COUNT, - RecapFeatures.UNIDIRECTIONAL_RETWEET_COUNT, - RecapFeatures.EMBEDS_URL_COUNT, - RecapFeatures.EMBEDS_IMPRESSION_COUNT, - RecapFeatures.VIDEO_VIEW_COUNT, - RecapFeatures.IS_RETWEET, - RecapFeatures.IS_REPLY, - RecapFeatures.IS_EXTENDED_REPLY, - RecapFeatures.HAS_LINK, - RecapFeatures.HAS_TREND, - RecapFeatures.LINK_LANGUAGE, - RecapFeatures.NUM_HASHTAGS, - RecapFeatures.NUM_MENTIONS, - RecapFeatures.IS_SENSITIVE, - RecapFeatures.HAS_MULTIPLE_MEDIA, - RecapFeatures.USER_REP, - RecapFeatures.FAV_COUNT_V2, - RecapFeatures.RETWEET_COUNT_V2, - RecapFeatures.REPLY_COUNT_V2, - RecapFeatures.LINK_COUNT, - EngagementDataRecordFeatures.InNetworkFavoritesCount, - EngagementDataRecordFeatures.InNetworkRetweetsCount, - EngagementDataRecordFeatures.InNetworkRepliesCount - ) - - val UserAuthorFeaturesV2: Set[Feature[_]] = Set( - RecapFeatures.HAS_IMAGE, - RecapFeatures.HAS_VINE, - RecapFeatures.HAS_PERISCOPE, - RecapFeatures.HAS_PRO_VIDEO, - RecapFeatures.HAS_VIDEO, - RecapFeatures.HAS_CARD, - RecapFeatures.HAS_NEWS, - RecapFeatures.HAS_VISIBLE_LINK, - RecapFeatures.REPLY_COUNT, - RecapFeatures.FAV_COUNT, - RecapFeatures.RETWEET_COUNT, - RecapFeatures.BLENDER_SCORE, - RecapFeatures.CONVERSATIONAL_COUNT, - RecapFeatures.IS_BUSINESS_SCORE, - RecapFeatures.CONTAINS_MEDIA, - RecapFeatures.RETWEET_SEARCHER, - RecapFeatures.REPLY_SEARCHER, - RecapFeatures.MENTION_SEARCHER, - RecapFeatures.REPLY_OTHER, - RecapFeatures.RETWEET_OTHER, - RecapFeatures.MATCH_UI_LANG, - RecapFeatures.MATCH_SEARCHER_MAIN_LANG, - RecapFeatures.MATCH_SEARCHER_LANGS, - RecapFeatures.TWEET_COUNT_FROM_USER_IN_SNAPSHOT, - RecapFeatures.TEXT_SCORE, - RecapFeatures.BIDIRECTIONAL_FAV_COUNT, - RecapFeatures.UNIDIRECTIONAL_FAV_COUNT, - RecapFeatures.BIDIRECTIONAL_REPLY_COUNT, - RecapFeatures.UNIDIRECTIONAL_REPLY_COUNT, - RecapFeatures.BIDIRECTIONAL_RETWEET_COUNT, - RecapFeatures.UNIDIRECTIONAL_RETWEET_COUNT, - RecapFeatures.EMBEDS_URL_COUNT, - RecapFeatures.EMBEDS_IMPRESSION_COUNT, - RecapFeatures.VIDEO_VIEW_COUNT, - RecapFeatures.IS_RETWEET, - RecapFeatures.IS_REPLY, - RecapFeatures.HAS_LINK, - RecapFeatures.HAS_TREND, - RecapFeatures.LINK_LANGUAGE, - RecapFeatures.NUM_HASHTAGS, - RecapFeatures.NUM_MENTIONS, - RecapFeatures.IS_SENSITIVE, - RecapFeatures.HAS_MULTIPLE_MEDIA, - RecapFeatures.FAV_COUNT_V2, - RecapFeatures.RETWEET_COUNT_V2, - RecapFeatures.REPLY_COUNT_V2, - RecapFeatures.LINK_COUNT, - EngagementDataRecordFeatures.InNetworkFavoritesCount, - EngagementDataRecordFeatures.InNetworkRetweetsCount, - EngagementDataRecordFeatures.InNetworkRepliesCount - ) - - val UserAuthorFeaturesV2Count: Set[Feature[_]] = Set( - RecapFeatures.HAS_IMAGE, - RecapFeatures.HAS_VINE, - RecapFeatures.HAS_PERISCOPE, - RecapFeatures.HAS_PRO_VIDEO, - RecapFeatures.HAS_VIDEO, - RecapFeatures.HAS_CARD, - RecapFeatures.HAS_NEWS, - RecapFeatures.HAS_VISIBLE_LINK, - RecapFeatures.FAV_COUNT, - RecapFeatures.CONTAINS_MEDIA, - RecapFeatures.RETWEET_SEARCHER, - RecapFeatures.REPLY_SEARCHER, - RecapFeatures.MENTION_SEARCHER, - RecapFeatures.REPLY_OTHER, - RecapFeatures.RETWEET_OTHER, - RecapFeatures.MATCH_UI_LANG, - RecapFeatures.MATCH_SEARCHER_MAIN_LANG, - RecapFeatures.MATCH_SEARCHER_LANGS, - RecapFeatures.IS_RETWEET, - RecapFeatures.IS_REPLY, - RecapFeatures.HAS_LINK, - RecapFeatures.HAS_TREND, - RecapFeatures.IS_SENSITIVE, - RecapFeatures.HAS_MULTIPLE_MEDIA, - EngagementDataRecordFeatures.InNetworkFavoritesCount - ) - - val UserTopicFeaturesV2Count: Set[Feature[_]] = Set( - RecapFeatures.HAS_IMAGE, - RecapFeatures.HAS_VIDEO, - RecapFeatures.HAS_CARD, - RecapFeatures.HAS_NEWS, - RecapFeatures.FAV_COUNT, - RecapFeatures.CONTAINS_MEDIA, - RecapFeatures.RETWEET_SEARCHER, - RecapFeatures.REPLY_SEARCHER, - RecapFeatures.MENTION_SEARCHER, - RecapFeatures.REPLY_OTHER, - RecapFeatures.RETWEET_OTHER, - RecapFeatures.MATCH_UI_LANG, - RecapFeatures.MATCH_SEARCHER_MAIN_LANG, - RecapFeatures.MATCH_SEARCHER_LANGS, - RecapFeatures.IS_RETWEET, - RecapFeatures.IS_REPLY, - RecapFeatures.HAS_LINK, - RecapFeatures.HAS_TREND, - RecapFeatures.IS_SENSITIVE, - EngagementDataRecordFeatures.InNetworkFavoritesCount, - EngagementDataRecordFeatures.InNetworkRetweetsCount, - TimelinesSharedFeatures.NUM_CAPS, - TimelinesSharedFeatures.ASPECT_RATIO_DEN, - TimelinesSharedFeatures.NUM_NEWLINES, - TimelinesSharedFeatures.IS_360, - TimelinesSharedFeatures.IS_MANAGED, - TimelinesSharedFeatures.IS_MONETIZABLE, - TimelinesSharedFeatures.HAS_SELECTED_PREVIEW_IMAGE, - TimelinesSharedFeatures.HAS_TITLE, - TimelinesSharedFeatures.HAS_DESCRIPTION, - TimelinesSharedFeatures.HAS_VISIT_SITE_CALL_TO_ACTION, - TimelinesSharedFeatures.HAS_WATCH_NOW_CALL_TO_ACTION - ) - - val UserFeaturesV5Continuous: Set[Feature[_]] = Set( - TimelinesSharedFeatures.QUOTE_COUNT, - TimelinesSharedFeatures.VISIBLE_TOKEN_RATIO, - TimelinesSharedFeatures.WEIGHTED_FAV_COUNT, - TimelinesSharedFeatures.WEIGHTED_RETWEET_COUNT, - TimelinesSharedFeatures.WEIGHTED_REPLY_COUNT, - TimelinesSharedFeatures.WEIGHTED_QUOTE_COUNT, - TimelinesSharedFeatures.EMBEDS_IMPRESSION_COUNT_V2, - TimelinesSharedFeatures.EMBEDS_URL_COUNT_V2, - TimelinesSharedFeatures.DECAYED_FAVORITE_COUNT, - TimelinesSharedFeatures.DECAYED_RETWEET_COUNT, - TimelinesSharedFeatures.DECAYED_REPLY_COUNT, - TimelinesSharedFeatures.DECAYED_QUOTE_COUNT, - TimelinesSharedFeatures.FAKE_FAVORITE_COUNT, - TimelinesSharedFeatures.FAKE_RETWEET_COUNT, - TimelinesSharedFeatures.FAKE_REPLY_COUNT, - TimelinesSharedFeatures.FAKE_QUOTE_COUNT, - TimeDataRecordFeatures.LAST_FAVORITE_SINCE_CREATION_HRS, - TimeDataRecordFeatures.LAST_RETWEET_SINCE_CREATION_HRS, - TimeDataRecordFeatures.LAST_REPLY_SINCE_CREATION_HRS, - TimeDataRecordFeatures.LAST_QUOTE_SINCE_CREATION_HRS, - TimeDataRecordFeatures.TIME_SINCE_LAST_FAVORITE_HRS, - TimeDataRecordFeatures.TIME_SINCE_LAST_RETWEET_HRS, - TimeDataRecordFeatures.TIME_SINCE_LAST_REPLY_HRS, - TimeDataRecordFeatures.TIME_SINCE_LAST_QUOTE_HRS - ) - - val UserFeaturesV5Boolean: Set[Feature[_]] = Set( - TimelinesSharedFeatures.LABEL_ABUSIVE_FLAG, - TimelinesSharedFeatures.LABEL_ABUSIVE_HI_RCL_FLAG, - TimelinesSharedFeatures.LABEL_DUP_CONTENT_FLAG, - TimelinesSharedFeatures.LABEL_NSFW_HI_PRC_FLAG, - TimelinesSharedFeatures.LABEL_NSFW_HI_RCL_FLAG, - TimelinesSharedFeatures.LABEL_SPAM_FLAG, - TimelinesSharedFeatures.LABEL_SPAM_HI_RCL_FLAG, - TimelinesSharedFeatures.PERISCOPE_EXISTS, - TimelinesSharedFeatures.PERISCOPE_IS_LIVE, - TimelinesSharedFeatures.PERISCOPE_HAS_BEEN_FEATURED, - TimelinesSharedFeatures.PERISCOPE_IS_CURRENTLY_FEATURED, - TimelinesSharedFeatures.PERISCOPE_IS_FROM_QUALITY_SOURCE, - TimelinesSharedFeatures.HAS_QUOTE - ) - - val UserAuthorFeaturesV5: Set[Feature[_]] = Set( - TimelinesSharedFeatures.HAS_QUOTE, - TimelinesSharedFeatures.LABEL_ABUSIVE_FLAG, - TimelinesSharedFeatures.LABEL_ABUSIVE_HI_RCL_FLAG, - TimelinesSharedFeatures.LABEL_DUP_CONTENT_FLAG, - TimelinesSharedFeatures.LABEL_NSFW_HI_PRC_FLAG, - TimelinesSharedFeatures.LABEL_NSFW_HI_RCL_FLAG, - TimelinesSharedFeatures.LABEL_SPAM_FLAG, - TimelinesSharedFeatures.LABEL_SPAM_HI_RCL_FLAG - ) - - val UserTweetSourceFeaturesV1Continuous: Set[Feature[_]] = Set( - TimelinesSharedFeatures.NUM_CAPS, - TimelinesSharedFeatures.NUM_WHITESPACES, - TimelinesSharedFeatures.TWEET_LENGTH, - TimelinesSharedFeatures.ASPECT_RATIO_DEN, - TimelinesSharedFeatures.ASPECT_RATIO_NUM, - TimelinesSharedFeatures.BIT_RATE, - TimelinesSharedFeatures.HEIGHT_1, - TimelinesSharedFeatures.HEIGHT_2, - TimelinesSharedFeatures.HEIGHT_3, - TimelinesSharedFeatures.HEIGHT_4, - TimelinesSharedFeatures.VIDEO_DURATION, - TimelinesSharedFeatures.WIDTH_1, - TimelinesSharedFeatures.WIDTH_2, - TimelinesSharedFeatures.WIDTH_3, - TimelinesSharedFeatures.WIDTH_4, - TimelinesSharedFeatures.NUM_MEDIA_TAGS - ) - - val UserTweetSourceFeaturesV1Boolean: Set[Feature[_]] = Set( - TimelinesSharedFeatures.HAS_QUESTION, - TimelinesSharedFeatures.RESIZE_METHOD_1, - TimelinesSharedFeatures.RESIZE_METHOD_2, - TimelinesSharedFeatures.RESIZE_METHOD_3, - TimelinesSharedFeatures.RESIZE_METHOD_4 - ) - - val UserTweetSourceFeaturesV2Continuous: Set[Feature[_]] = Set( - TimelinesSharedFeatures.NUM_EMOJIS, - TimelinesSharedFeatures.NUM_EMOTICONS, - TimelinesSharedFeatures.NUM_NEWLINES, - TimelinesSharedFeatures.NUM_STICKERS, - TimelinesSharedFeatures.NUM_FACES, - TimelinesSharedFeatures.NUM_COLOR_PALLETTE_ITEMS, - TimelinesSharedFeatures.VIEW_COUNT, - TimelinesSharedFeatures.TWEET_LENGTH_TYPE - ) - - val UserTweetSourceFeaturesV2Boolean: Set[Feature[_]] = Set( - TimelinesSharedFeatures.IS_360, - TimelinesSharedFeatures.IS_MANAGED, - TimelinesSharedFeatures.IS_MONETIZABLE, - TimelinesSharedFeatures.IS_EMBEDDABLE, - TimelinesSharedFeatures.HAS_SELECTED_PREVIEW_IMAGE, - TimelinesSharedFeatures.HAS_TITLE, - TimelinesSharedFeatures.HAS_DESCRIPTION, - TimelinesSharedFeatures.HAS_VISIT_SITE_CALL_TO_ACTION, - TimelinesSharedFeatures.HAS_WATCH_NOW_CALL_TO_ACTION - ) - - val UserAuthorTweetSourceFeaturesV1: Set[Feature[_]] = Set( - TimelinesSharedFeatures.HAS_QUESTION, - TimelinesSharedFeatures.TWEET_LENGTH, - TimelinesSharedFeatures.VIDEO_DURATION, - TimelinesSharedFeatures.NUM_MEDIA_TAGS - ) - - val UserAuthorTweetSourceFeaturesV2: Set[Feature[_]] = Set( - TimelinesSharedFeatures.NUM_CAPS, - TimelinesSharedFeatures.NUM_WHITESPACES, - TimelinesSharedFeatures.ASPECT_RATIO_DEN, - TimelinesSharedFeatures.ASPECT_RATIO_NUM, - TimelinesSharedFeatures.BIT_RATE, - TimelinesSharedFeatures.TWEET_LENGTH_TYPE, - TimelinesSharedFeatures.NUM_EMOJIS, - TimelinesSharedFeatures.NUM_EMOTICONS, - TimelinesSharedFeatures.NUM_NEWLINES, - TimelinesSharedFeatures.NUM_STICKERS, - TimelinesSharedFeatures.NUM_FACES, - TimelinesSharedFeatures.IS_360, - TimelinesSharedFeatures.IS_MANAGED, - TimelinesSharedFeatures.IS_MONETIZABLE, - TimelinesSharedFeatures.HAS_SELECTED_PREVIEW_IMAGE, - TimelinesSharedFeatures.HAS_TITLE, - TimelinesSharedFeatures.HAS_DESCRIPTION, - TimelinesSharedFeatures.HAS_VISIT_SITE_CALL_TO_ACTION, - TimelinesSharedFeatures.HAS_WATCH_NOW_CALL_TO_ACTION - ) - - val UserAuthorTweetSourceFeaturesV2Count: Set[Feature[_]] = Set( - TimelinesSharedFeatures.NUM_CAPS, - TimelinesSharedFeatures.ASPECT_RATIO_DEN, - TimelinesSharedFeatures.NUM_NEWLINES, - TimelinesSharedFeatures.IS_360, - TimelinesSharedFeatures.IS_MANAGED, - TimelinesSharedFeatures.IS_MONETIZABLE, - TimelinesSharedFeatures.HAS_SELECTED_PREVIEW_IMAGE, - TimelinesSharedFeatures.HAS_TITLE, - TimelinesSharedFeatures.HAS_DESCRIPTION, - TimelinesSharedFeatures.HAS_VISIT_SITE_CALL_TO_ACTION, - TimelinesSharedFeatures.HAS_WATCH_NOW_CALL_TO_ACTION - ) - - val LabelsV2: Set[Feature.Binary] = RecapLabelsForAggregation ++ Set( - RecapFeatures.IS_REPLIED, - RecapFeatures.IS_PHOTO_EXPANDED, - RecapFeatures.IS_VIDEO_PLAYBACK_50 - ) - - val TwitterWideFeatures: Set[Feature[_]] = Set( - RecapFeatures.IS_REPLY, - TimelinesSharedFeatures.HAS_QUOTE, - RecapFeatures.HAS_MENTION, - RecapFeatures.HAS_HASHTAG, - RecapFeatures.HAS_LINK, - RecapFeatures.HAS_CARD, - RecapFeatures.CONTAINS_MEDIA - ) - - val TwitterWideLabels: Set[Feature.Binary] = Set( - RecapFeatures.IS_FAVORITED, - RecapFeatures.IS_RETWEETED, - RecapFeatures.IS_REPLIED - ) - - val ReciprocalLabels: Set[Feature.Binary] = Set( - RecapFeatures.IS_REPLIED_REPLY_IMPRESSED_BY_AUTHOR, - RecapFeatures.IS_REPLIED_REPLY_REPLIED_BY_AUTHOR, - RecapFeatures.IS_REPLIED_REPLY_FAVORITED_BY_AUTHOR - ) - - val NegativeEngagementLabels: Set[Feature.Binary] = Set( - RecapFeatures.IS_REPORT_TWEET_CLICKED, - RecapFeatures.IS_BLOCK_CLICKED, - RecapFeatures.IS_MUTE_CLICKED, - RecapFeatures.IS_DONT_LIKE - ) - - val GoodClickLabels: Set[Feature.Binary] = Set( - RecapFeatures.IS_GOOD_CLICKED_CONVO_DESC_V1, - RecapFeatures.IS_GOOD_CLICKED_CONVO_DESC_V2, - ) -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/RectweetUserFeatureAggregation.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/RectweetUserFeatureAggregation.scala deleted file mode 100644 index 12835ef1f..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/RectweetUserFeatureAggregation.scala +++ /dev/null @@ -1,52 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates - -import com.twitter.ml.api.Feature -import com.twitter.timelines.prediction.features.engagement_features.EngagementDataRecordFeatures -import com.twitter.timelines.prediction.features.itl.ITLFeatures - -object RectweetUserFeatureAggregation { - val RectweetLabelsForAggregation: Set[Feature.Binary] = - Set( - ITLFeatures.IS_FAVORITED, - ITLFeatures.IS_RETWEETED, - ITLFeatures.IS_REPLIED, - ITLFeatures.IS_CLICKED, - ITLFeatures.IS_PROFILE_CLICKED, - ITLFeatures.IS_OPEN_LINKED, - ITLFeatures.IS_PHOTO_EXPANDED, - ITLFeatures.IS_VIDEO_PLAYBACK_50 - ) - - val TweetFeatures: Set[Feature[_]] = Set( - ITLFeatures.HAS_IMAGE, - ITLFeatures.HAS_CARD, - ITLFeatures.HAS_NEWS, - ITLFeatures.REPLY_COUNT, - ITLFeatures.FAV_COUNT, - ITLFeatures.REPLY_COUNT, - ITLFeatures.RETWEET_COUNT, - ITLFeatures.MATCHES_UI_LANG, - ITLFeatures.MATCHES_SEARCHER_MAIN_LANG, - ITLFeatures.MATCHES_SEARCHER_LANGS, - ITLFeatures.TEXT_SCORE, - ITLFeatures.LINK_LANGUAGE, - ITLFeatures.NUM_HASHTAGS, - ITLFeatures.NUM_MENTIONS, - ITLFeatures.IS_SENSITIVE, - ITLFeatures.HAS_VIDEO, - ITLFeatures.HAS_LINK, - ITLFeatures.HAS_VISIBLE_LINK, - EngagementDataRecordFeatures.InNetworkFavoritesCount - // nice to have, but currently not hydrated in the RecommendedTweet payload - //EngagementDataRecordFeatures.InNetworkRetweetsCount, - //EngagementDataRecordFeatures.InNetworkRepliesCount - ) - - val ReciprocalLabels: Set[Feature.Binary] = Set( - ITLFeatures.IS_REPLIED_REPLY_IMPRESSED_BY_AUTHOR, - ITLFeatures.IS_REPLIED_REPLY_REPLIED_BY_AUTHOR, - ITLFeatures.IS_REPLIED_REPLY_FAVORITED_BY_AUTHOR, - ITLFeatures.IS_REPLIED_REPLY_RETWEETED_BY_AUTHOR, - ITLFeatures.IS_REPLIED_REPLY_QUOTED_BY_AUTHOR - ) -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/TimelinesAggregationConfig.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/TimelinesAggregationConfig.scala deleted file mode 100644 index e6581e32e..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/TimelinesAggregationConfig.scala +++ /dev/null @@ -1,80 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates - -import com.twitter.dal.client.dataset.KeyValDALDataset -import com.twitter.ml.api.DataRecord -import com.twitter.ml.api.FeatureContext -import com.twitter.scalding_internal.multiformat.format.keyval -import com.twitter.summingbird.batch.BatchID -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.conversion.CombineCountsPolicy -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregateStore -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.OfflineAggregateDataRecordStore -import scala.collection.JavaConverters._ - -object TimelinesAggregationConfig extends TimelinesAggregationConfigTrait { - override def outputHdfsPath: String = "/user/timelines/processed/aggregates_v2" - - def storeToDatasetMap: Map[String, KeyValDALDataset[ - keyval.KeyVal[AggregationKey, (BatchID, DataRecord)] - ]] = Map( - AuthorTopicAggregateStore -> AuthorTopicAggregatesScalaDataset, - UserTopicAggregateStore -> UserTopicAggregatesScalaDataset, - UserInferredTopicAggregateStore -> UserInferredTopicAggregatesScalaDataset, - UserAggregateStore -> UserAggregatesScalaDataset, - UserAuthorAggregateStore -> UserAuthorAggregatesScalaDataset, - UserOriginalAuthorAggregateStore -> UserOriginalAuthorAggregatesScalaDataset, - OriginalAuthorAggregateStore -> OriginalAuthorAggregatesScalaDataset, - UserEngagerAggregateStore -> UserEngagerAggregatesScalaDataset, - UserMentionAggregateStore -> UserMentionAggregatesScalaDataset, - TwitterWideUserAggregateStore -> TwitterWideUserAggregatesScalaDataset, - TwitterWideUserAuthorAggregateStore -> TwitterWideUserAuthorAggregatesScalaDataset, - UserRequestHourAggregateStore -> UserRequestHourAggregatesScalaDataset, - UserRequestDowAggregateStore -> UserRequestDowAggregatesScalaDataset, - UserListAggregateStore -> UserListAggregatesScalaDataset, - UserMediaUnderstandingAnnotationAggregateStore -> UserMediaUnderstandingAnnotationAggregatesScalaDataset, - ) - - override def mkPhysicalStore(store: AggregateStore): AggregateStore = store match { - case s: OfflineAggregateDataRecordStore => - s.toOfflineAggregateDataRecordStoreWithDAL(storeToDatasetMap(s.name)) - case _ => throw new IllegalArgumentException("Unsupported logical dataset type.") - } - - object CombineCountPolicies { - val EngagerCountsPolicy: CombineCountsPolicy = mkCountsPolicy("user_engager_aggregate") - val EngagerGoodClickCountsPolicy: CombineCountsPolicy = mkCountsPolicy( - "user_engager_good_click_aggregate") - val RectweetEngagerCountsPolicy: CombineCountsPolicy = - mkCountsPolicy("rectweet_user_engager_aggregate") - val MentionCountsPolicy: CombineCountsPolicy = mkCountsPolicy("user_mention_aggregate") - val RectweetSimclustersTweetCountsPolicy: CombineCountsPolicy = - mkCountsPolicy("rectweet_user_simcluster_tweet_aggregate") - val UserInferredTopicCountsPolicy: CombineCountsPolicy = - mkCountsPolicy("user_inferred_topic_aggregate") - val UserInferredTopicV2CountsPolicy: CombineCountsPolicy = - mkCountsPolicy("user_inferred_topic_aggregate_v2") - val UserMediaUnderstandingAnnotationCountsPolicy: CombineCountsPolicy = - mkCountsPolicy("user_media_annotation_aggregate") - - private[this] def mkCountsPolicy(prefix: String): CombineCountsPolicy = { - val features = TimelinesAggregationConfig.aggregatesToCompute - .filter(_.aggregatePrefix == prefix) - .flatMap(_.allOutputFeatures) - CombineCountsPolicy( - topK = 2, - aggregateContextToPrecompute = new FeatureContext(features.asJava), - hardLimit = Some(20) - ) - } - } -} - -object TimelinesAggregationCanaryConfig extends TimelinesAggregationConfigTrait { - override def outputHdfsPath: String = "/user/timelines/canaries/processed/aggregates_v2" - - override def mkPhysicalStore(store: AggregateStore): AggregateStore = store match { - case s: OfflineAggregateDataRecordStore => - s.toOfflineAggregateDataRecordStoreWithDAL(dalDataset = AggregatesCanaryScalaDataset) - case _ => throw new IllegalArgumentException("Unsupported logical dataset type.") - } -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/TimelinesAggregationConfigDetails.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/TimelinesAggregationConfigDetails.scala deleted file mode 100644 index aa439deda..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/TimelinesAggregationConfigDetails.scala +++ /dev/null @@ -1,579 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates - -import com.twitter.conversions.DurationOps._ -import com.twitter.ml.api.constant.SharedFeatures.AUTHOR_ID -import com.twitter.ml.api.constant.SharedFeatures.USER_ID -import com.twitter.timelines.data_processing.ml_util.aggregation_framework._ -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.metrics._ -import com.twitter.timelines.data_processing.ml_util.transforms.DownsampleTransform -import com.twitter.timelines.data_processing.ml_util.transforms.RichRemoveAuthorIdZero -import com.twitter.timelines.data_processing.ml_util.transforms.RichRemoveUserIdZero -import com.twitter.timelines.prediction.features.common.TimelinesSharedFeatures -import com.twitter.timelines.prediction.features.engagement_features.EngagementDataRecordFeatures -import com.twitter.timelines.prediction.features.engagement_features.EngagementDataRecordFeatures.RichUnifyPublicEngagersTransform -import com.twitter.timelines.prediction.features.list_features.ListFeatures -import com.twitter.timelines.prediction.features.recap.RecapFeatures -import com.twitter.timelines.prediction.features.request_context.RequestContextFeatures -import com.twitter.timelines.prediction.features.semantic_core_features.SemanticCoreFeatures -import com.twitter.timelines.prediction.transform.filter.FilterInNetworkTransform -import com.twitter.timelines.prediction.transform.filter.FilterImageTweetTransform -import com.twitter.timelines.prediction.transform.filter.FilterVideoTweetTransform -import com.twitter.timelines.prediction.transform.filter.FilterOutImageVideoTweetTransform -import com.twitter.util.Duration - -trait TimelinesAggregationConfigDetails extends Serializable { - - import TimelinesAggregationSources._ - - def outputHdfsPath: String - - /** - * Converts the given logical store to a physical store. The reason we do not specify the - * physical store directly with the [[AggregateGroup]] is because of a cyclic dependency when - * create physical stores that are DalDataset with PersonalDataType annotations derived from - * the [[AggregateGroup]]. - * - */ - def mkPhysicalStore(store: AggregateStore): AggregateStore - - def defaultMaxKvSourceFailures: Int = 100 - - val timelinesOfflineAggregateSink = new OfflineStoreCommonConfig { - override def apply(startDate: String) = OfflineAggregateStoreCommonConfig( - outputHdfsPathPrefix = outputHdfsPath, - dummyAppId = "timelines_aggregates_v2_ro", - dummyDatasetPrefix = "timelines_aggregates_v2_ro", - startDate = startDate - ) - } - - val UserAggregateStore = "user_aggregates" - val UserAuthorAggregateStore = "user_author_aggregates" - val UserOriginalAuthorAggregateStore = "user_original_author_aggregates" - val OriginalAuthorAggregateStore = "original_author_aggregates" - val UserEngagerAggregateStore = "user_engager_aggregates" - val UserMentionAggregateStore = "user_mention_aggregates" - val TwitterWideUserAggregateStore = "twitter_wide_user_aggregates" - val TwitterWideUserAuthorAggregateStore = "twitter_wide_user_author_aggregates" - val UserRequestHourAggregateStore = "user_request_hour_aggregates" - val UserRequestDowAggregateStore = "user_request_dow_aggregates" - val UserListAggregateStore = "user_list_aggregates" - val AuthorTopicAggregateStore = "author_topic_aggregates" - val UserTopicAggregateStore = "user_topic_aggregates" - val UserInferredTopicAggregateStore = "user_inferred_topic_aggregates" - val UserMediaUnderstandingAnnotationAggregateStore = - "user_media_understanding_annotation_aggregates" - val AuthorCountryCodeAggregateStore = "author_country_code_aggregates" - val OriginalAuthorCountryCodeAggregateStore = "original_author_country_code_aggregates" - - /** - * Step 3: Configure all aggregates to compute. - * Note that different subsets of aggregates in this list - * can be launched by different summingbird job instances. - * Any given job can be responsible for a set of AggregateGroup - * configs whose outputStores share the same exact startDate. - * AggregateGroups that do not share the same inputSource, - * outputStore or startDate MUST be launched using different - * summingbird jobs and passed in a different --start-time argument - * See science/scalding/mesos/timelines/prod.yaml for an example - * of how to configure your own job. - */ - val negativeDownsampleTransform = - DownsampleTransform( - negativeSamplingRate = 0.03, - keepLabels = RecapUserFeatureAggregation.LabelsV2) - val negativeRecTweetDownsampleTransform = DownsampleTransform( - negativeSamplingRate = 0.03, - keepLabels = RectweetUserFeatureAggregation.RectweetLabelsForAggregation - ) - - val userAggregatesV2: AggregateGroup = - AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - aggregatePrefix = "user_aggregate_v2", - preTransforms = Seq(RichRemoveUserIdZero), /* Eliminates reducer skew */ - keys = Set(USER_ID), - features = RecapUserFeatureAggregation.UserFeaturesV2, - labels = RecapUserFeatureAggregation.LabelsV2, - metrics = Set(CountMetric, SumMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = UserAggregateStore, - startDate = "2016-07-15 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )) - ) - - val userAuthorAggregatesV2: Set[AggregateGroup] = { - - /** - * NOTE: We need to remove records from out-of-network authors from the recap input - * records (which now include out-of-network records as well after merging recap and - * rectweet models) that are used to compute user-author aggregates. This is necessary - * to limit the growth rate of user-author aggregates. - */ - val allFeatureAggregates = Set( - AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - aggregatePrefix = "user_author_aggregate_v2", - preTransforms = Seq(FilterInNetworkTransform, RichRemoveUserIdZero), - keys = Set(USER_ID, AUTHOR_ID), - features = RecapUserFeatureAggregation.UserAuthorFeaturesV2, - labels = RecapUserFeatureAggregation.LabelsV2, - metrics = Set(SumMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = UserAuthorAggregateStore, - startDate = "2016-07-15 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )) - ) - ) - - val countAggregates: Set[AggregateGroup] = Set( - AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - aggregatePrefix = "user_author_aggregate_v2", - preTransforms = Seq(FilterInNetworkTransform, RichRemoveUserIdZero), - keys = Set(USER_ID, AUTHOR_ID), - features = RecapUserFeatureAggregation.UserAuthorFeaturesV2Count, - labels = RecapUserFeatureAggregation.LabelsV2, - metrics = Set(CountMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = UserAuthorAggregateStore, - startDate = "2016-07-15 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )) - ) - ) - - allFeatureAggregates ++ countAggregates - } - - val userAggregatesV5Continuous: AggregateGroup = - AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - aggregatePrefix = "user_aggregate_v5.continuous", - preTransforms = Seq(RichRemoveUserIdZero), - keys = Set(USER_ID), - features = RecapUserFeatureAggregation.UserFeaturesV5Continuous, - labels = RecapUserFeatureAggregation.LabelsV2, - metrics = Set(CountMetric, SumMetric, SumSqMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = UserAggregateStore, - startDate = "2016-07-15 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )) - ) - - val userAuthorAggregatesV5: AggregateGroup = - AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - aggregatePrefix = "user_author_aggregate_v5", - preTransforms = Seq(FilterInNetworkTransform, RichRemoveUserIdZero), - keys = Set(USER_ID, AUTHOR_ID), - features = RecapUserFeatureAggregation.UserAuthorFeaturesV5, - labels = RecapUserFeatureAggregation.LabelsV2, - metrics = Set(CountMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = UserAuthorAggregateStore, - startDate = "2016-07-15 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )) - ) - - val tweetSourceUserAuthorAggregatesV1: AggregateGroup = - AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - aggregatePrefix = "user_author_aggregate_tweetsource_v1", - preTransforms = Seq(FilterInNetworkTransform, RichRemoveUserIdZero), - keys = Set(USER_ID, AUTHOR_ID), - features = RecapUserFeatureAggregation.UserAuthorTweetSourceFeaturesV1, - labels = RecapUserFeatureAggregation.LabelsV2, - metrics = Set(CountMetric, SumMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = UserAuthorAggregateStore, - startDate = "2016-07-15 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )) - ) - - val userEngagerAggregates = AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - aggregatePrefix = "user_engager_aggregate", - keys = Set(USER_ID, EngagementDataRecordFeatures.PublicEngagementUserIds), - features = Set.empty, - labels = RecapUserFeatureAggregation.LabelsV2, - metrics = Set(CountMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = UserEngagerAggregateStore, - startDate = "2016-09-02 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )), - preTransforms = Seq( - RichRemoveUserIdZero, - RichUnifyPublicEngagersTransform - ) - ) - - val userMentionAggregates = AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - preTransforms = Seq(RichRemoveUserIdZero), /* Eliminates reducer skew */ - aggregatePrefix = "user_mention_aggregate", - keys = Set(USER_ID, RecapFeatures.MENTIONED_SCREEN_NAMES), - features = Set.empty, - labels = RecapUserFeatureAggregation.LabelsV2, - metrics = Set(CountMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = UserMentionAggregateStore, - startDate = "2017-03-01 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )), - includeAnyLabel = false - ) - - val twitterWideUserAggregates = AggregateGroup( - inputSource = timelinesDailyTwitterWideSource, - preTransforms = Seq(RichRemoveUserIdZero), /* Eliminates reducer skew */ - aggregatePrefix = "twitter_wide_user_aggregate", - keys = Set(USER_ID), - features = RecapUserFeatureAggregation.TwitterWideFeatures, - labels = RecapUserFeatureAggregation.TwitterWideLabels, - metrics = Set(CountMetric, SumMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = TwitterWideUserAggregateStore, - startDate = "2016-12-28 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )) - ) - - val twitterWideUserAuthorAggregates = AggregateGroup( - inputSource = timelinesDailyTwitterWideSource, - preTransforms = Seq(RichRemoveUserIdZero), /* Eliminates reducer skew */ - aggregatePrefix = "twitter_wide_user_author_aggregate", - keys = Set(USER_ID, AUTHOR_ID), - features = RecapUserFeatureAggregation.TwitterWideFeatures, - labels = RecapUserFeatureAggregation.TwitterWideLabels, - metrics = Set(CountMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = TwitterWideUserAuthorAggregateStore, - startDate = "2016-12-28 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )), - includeAnyLabel = false - ) - - /** - * User-HourOfDay and User-DayOfWeek aggregations, both for recap and rectweet - */ - val userRequestHourAggregates = AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - aggregatePrefix = "user_request_context_aggregate.hour", - preTransforms = Seq(RichRemoveUserIdZero, negativeDownsampleTransform), - keys = Set(USER_ID, RequestContextFeatures.TIMESTAMP_GMT_HOUR), - features = Set.empty, - labels = RecapUserFeatureAggregation.LabelsV2, - metrics = Set(CountMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = UserRequestHourAggregateStore, - startDate = "2017-08-01 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )) - ) - - val userRequestDowAggregates = AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - aggregatePrefix = "user_request_context_aggregate.dow", - preTransforms = Seq(RichRemoveUserIdZero, negativeDownsampleTransform), - keys = Set(USER_ID, RequestContextFeatures.TIMESTAMP_GMT_DOW), - features = Set.empty, - labels = RecapUserFeatureAggregation.LabelsV2, - metrics = Set(CountMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = UserRequestDowAggregateStore, - startDate = "2017-08-01 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )) - ) - - val authorTopicAggregates = AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - aggregatePrefix = "author_topic_aggregate", - preTransforms = Seq(RichRemoveUserIdZero), - keys = Set(AUTHOR_ID, TimelinesSharedFeatures.TOPIC_ID), - features = Set.empty, - labels = RecapUserFeatureAggregation.LabelsV2, - metrics = Set(CountMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = AuthorTopicAggregateStore, - startDate = "2020-05-19 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )) - ) - - val userTopicAggregates = AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - aggregatePrefix = "user_topic_aggregate", - preTransforms = Seq(RichRemoveUserIdZero), - keys = Set(USER_ID, TimelinesSharedFeatures.TOPIC_ID), - features = Set.empty, - labels = RecapUserFeatureAggregation.LabelsV2, - metrics = Set(CountMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = UserTopicAggregateStore, - startDate = "2020-05-23 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )) - ) - - val userTopicAggregatesV2 = AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - aggregatePrefix = "user_topic_aggregate_v2", - preTransforms = Seq(RichRemoveUserIdZero), - keys = Set(USER_ID, TimelinesSharedFeatures.TOPIC_ID), - features = RecapUserFeatureAggregation.UserTopicFeaturesV2Count, - labels = RecapUserFeatureAggregation.LabelsV2, - includeAnyFeature = false, - includeAnyLabel = false, - metrics = Set(CountMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = UserTopicAggregateStore, - startDate = "2020-05-23 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )) - ) - - val userInferredTopicAggregates = AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - aggregatePrefix = "user_inferred_topic_aggregate", - preTransforms = Seq(RichRemoveUserIdZero), - keys = Set(USER_ID, TimelinesSharedFeatures.INFERRED_TOPIC_IDS), - features = Set.empty, - labels = RecapUserFeatureAggregation.LabelsV2, - metrics = Set(CountMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = UserInferredTopicAggregateStore, - startDate = "2020-09-09 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )) - ) - - val userInferredTopicAggregatesV2 = AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - aggregatePrefix = "user_inferred_topic_aggregate_v2", - preTransforms = Seq(RichRemoveUserIdZero), - keys = Set(USER_ID, TimelinesSharedFeatures.INFERRED_TOPIC_IDS), - features = RecapUserFeatureAggregation.UserTopicFeaturesV2Count, - labels = RecapUserFeatureAggregation.LabelsV2, - includeAnyFeature = false, - includeAnyLabel = false, - metrics = Set(CountMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = UserInferredTopicAggregateStore, - startDate = "2020-09-09 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )) - ) - - val userReciprocalEngagementAggregates = AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - aggregatePrefix = "user_aggregate_v6", - preTransforms = Seq(RichRemoveUserIdZero), - keys = Set(USER_ID), - features = Set.empty, - labels = RecapUserFeatureAggregation.ReciprocalLabels, - metrics = Set(CountMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = UserAggregateStore, - startDate = "2016-07-15 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )), - includeAnyLabel = false - ) - - val userOriginalAuthorReciprocalEngagementAggregates = AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - aggregatePrefix = "user_original_author_aggregate_v1", - preTransforms = Seq(RichRemoveUserIdZero, RichRemoveAuthorIdZero), - keys = Set(USER_ID, TimelinesSharedFeatures.ORIGINAL_AUTHOR_ID), - features = Set.empty, - labels = RecapUserFeatureAggregation.ReciprocalLabels, - metrics = Set(CountMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = UserOriginalAuthorAggregateStore, - startDate = "2018-12-26 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )), - includeAnyLabel = false - ) - - val originalAuthorReciprocalEngagementAggregates = AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - aggregatePrefix = "original_author_aggregate_v1", - preTransforms = Seq(RichRemoveUserIdZero, RichRemoveAuthorIdZero), - keys = Set(TimelinesSharedFeatures.ORIGINAL_AUTHOR_ID), - features = Set.empty, - labels = RecapUserFeatureAggregation.ReciprocalLabels, - metrics = Set(CountMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = OriginalAuthorAggregateStore, - startDate = "2023-02-25 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )), - includeAnyLabel = false - ) - - val originalAuthorNegativeEngagementAggregates = AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - aggregatePrefix = "original_author_aggregate_v2", - preTransforms = Seq(RichRemoveUserIdZero, RichRemoveAuthorIdZero), - keys = Set(TimelinesSharedFeatures.ORIGINAL_AUTHOR_ID), - features = Set.empty, - labels = RecapUserFeatureAggregation.NegativeEngagementLabels, - metrics = Set(CountMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = OriginalAuthorAggregateStore, - startDate = "2023-02-25 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )), - includeAnyLabel = false - ) - - val userListAggregates: AggregateGroup = - AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - aggregatePrefix = "user_list_aggregate", - keys = Set(USER_ID, ListFeatures.LIST_ID), - features = Set.empty, - labels = RecapUserFeatureAggregation.LabelsV2, - metrics = Set(CountMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = UserListAggregateStore, - startDate = "2020-05-28 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )), - preTransforms = Seq(RichRemoveUserIdZero) - ) - - val userMediaUnderstandingAnnotationAggregates: AggregateGroup = AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - aggregatePrefix = "user_media_annotation_aggregate", - preTransforms = Seq(RichRemoveUserIdZero), - keys = - Set(USER_ID, SemanticCoreFeatures.mediaUnderstandingHighRecallNonSensitiveEntityIdsFeature), - features = Set.empty, - labels = RecapUserFeatureAggregation.LabelsV2, - metrics = Set(CountMetric), - halfLives = Set(50.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = UserMediaUnderstandingAnnotationAggregateStore, - startDate = "2021-03-20 00:00", - commonConfig = timelinesOfflineAggregateSink - )) - ) - - val userAuthorGoodClickAggregates = AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - aggregatePrefix = "user_author_good_click_aggregate", - preTransforms = Seq(FilterInNetworkTransform, RichRemoveUserIdZero), - keys = Set(USER_ID, AUTHOR_ID), - features = RecapUserFeatureAggregation.UserAuthorFeaturesV2, - labels = RecapUserFeatureAggregation.GoodClickLabels, - metrics = Set(SumMetric), - halfLives = Set(14.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = UserAuthorAggregateStore, - startDate = "2016-07-15 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )) - ) - - val userEngagerGoodClickAggregates = AggregateGroup( - inputSource = timelinesDailyRecapMinimalSource, - aggregatePrefix = "user_engager_good_click_aggregate", - keys = Set(USER_ID, EngagementDataRecordFeatures.PublicEngagementUserIds), - features = Set.empty, - labels = RecapUserFeatureAggregation.GoodClickLabels, - metrics = Set(CountMetric), - halfLives = Set(14.days), - outputStore = mkPhysicalStore( - OfflineAggregateDataRecordStore( - name = UserEngagerAggregateStore, - startDate = "2016-09-02 00:00", - commonConfig = timelinesOfflineAggregateSink, - maxKvSourceFailures = defaultMaxKvSourceFailures - )), - preTransforms = Seq( - RichRemoveUserIdZero, - RichUnifyPublicEngagersTransform - ) - ) - -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/TimelinesAggregationConfigTrait.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/TimelinesAggregationConfigTrait.scala deleted file mode 100644 index 6fb2e07b7..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/TimelinesAggregationConfigTrait.scala +++ /dev/null @@ -1,50 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates - -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationConfig -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregateGroup -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.TypedAggregateGroup - -trait TimelinesAggregationConfigTrait - extends TimelinesAggregationConfigDetails - with AggregationConfig { - private val aggregateGroups = Set( - authorTopicAggregates, - userTopicAggregates, - userTopicAggregatesV2, - userInferredTopicAggregates, - userInferredTopicAggregatesV2, - userAggregatesV2, - userAggregatesV5Continuous, - userReciprocalEngagementAggregates, - userAuthorAggregatesV5, - userOriginalAuthorReciprocalEngagementAggregates, - originalAuthorReciprocalEngagementAggregates, - tweetSourceUserAuthorAggregatesV1, - userEngagerAggregates, - userMentionAggregates, - twitterWideUserAggregates, - twitterWideUserAuthorAggregates, - userRequestHourAggregates, - userRequestDowAggregates, - userListAggregates, - userMediaUnderstandingAnnotationAggregates, - ) ++ userAuthorAggregatesV2 - - val aggregatesToComputeList: Set[List[TypedAggregateGroup[_]]] = - aggregateGroups.map(_.buildTypedAggregateGroups()) - - override val aggregatesToCompute: Set[TypedAggregateGroup[_]] = aggregatesToComputeList.flatten - - /* - * Feature selection config to save storage space and manhattan query bandwidth. - * Only the most important features found using offline RCE simulations are used - * when actually training and serving. This selector is used by - * [[com.twitter.timelines.data_processing.jobs.timeline_ranking_user_features.TimelineRankingAggregatesV2FeaturesProdJob]] - * but defined here to keep it in sync with the config that computes the aggregates. - */ - val AggregatesV2FeatureSelector = FeatureSelectorConfig.AggregatesV2ProdFeatureSelector - - def filterAggregatesGroups(storeNames: Set[String]): Set[AggregateGroup] = { - aggregateGroups.filter(aggregateGroup => storeNames.contains(aggregateGroup.outputStore.name)) - } -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/TimelinesAggregationKeyValInjections.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/TimelinesAggregationKeyValInjections.scala deleted file mode 100644 index 1f2433b53..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/TimelinesAggregationKeyValInjections.scala +++ /dev/null @@ -1,48 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates - -import com.twitter.ml.api.DataRecord -import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection -import com.twitter.summingbird.batch.BatchID -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.{ - AggregateStore, - AggregationKey, - OfflineAggregateInjections, - TypedAggregateGroup -} - -object TimelinesAggregationKeyValInjections extends TimelinesAggregationConfigTrait { - - import OfflineAggregateInjections.getInjection - - type KVInjection = KeyValInjection[AggregationKey, (BatchID, DataRecord)] - - val AuthorTopic: KVInjection = getInjection(filter(AuthorTopicAggregateStore)) - val UserTopic: KVInjection = getInjection(filter(UserTopicAggregateStore)) - val UserInferredTopic: KVInjection = getInjection(filter(UserInferredTopicAggregateStore)) - val User: KVInjection = getInjection(filter(UserAggregateStore)) - val UserAuthor: KVInjection = getInjection(filter(UserAuthorAggregateStore)) - val UserOriginalAuthor: KVInjection = getInjection(filter(UserOriginalAuthorAggregateStore)) - val OriginalAuthor: KVInjection = getInjection(filter(OriginalAuthorAggregateStore)) - val UserEngager: KVInjection = getInjection(filter(UserEngagerAggregateStore)) - val UserMention: KVInjection = getInjection(filter(UserMentionAggregateStore)) - val TwitterWideUser: KVInjection = getInjection(filter(TwitterWideUserAggregateStore)) - val TwitterWideUserAuthor: KVInjection = getInjection(filter(TwitterWideUserAuthorAggregateStore)) - val UserRequestHour: KVInjection = getInjection(filter(UserRequestHourAggregateStore)) - val UserRequestDow: KVInjection = getInjection(filter(UserRequestDowAggregateStore)) - val UserList: KVInjection = getInjection(filter(UserListAggregateStore)) - val UserMediaUnderstandingAnnotation: KVInjection = getInjection( - filter(UserMediaUnderstandingAnnotationAggregateStore)) - - private def filter(storeName: String): Set[TypedAggregateGroup[_]] = { - val groups = aggregatesToCompute.filter(_.outputStore.name == storeName) - require(groups.nonEmpty) - groups - } - - override def outputHdfsPath: String = "/user/timelines/processed/aggregates_v2" - - // Since this object is not used to execute any online or offline aggregates job, but is meant - // to store all PDT enabled KeyValInjections, we do not need to construct a physical store. - // We use the identity operation as a default. - override def mkPhysicalStore(store: AggregateStore): AggregateStore = store -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/TimelinesAggregationSources.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/TimelinesAggregationSources.scala deleted file mode 100644 index c799f22fa..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/TimelinesAggregationSources.scala +++ /dev/null @@ -1,45 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates - -import com.twitter.ml.api.constant.SharedFeatures.TIMESTAMP -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.OfflineAggregateSource -import com.twitter.timelines.prediction.features.p_home_latest.HomeLatestUserAggregatesFeatures -import timelines.data_processing.ad_hoc.recap.data_record_preparation.RecapDataRecordsAggMinimalJavaDataset - -/** - * Any update here should be in sync with [[TimelinesFeatureGroups]] and [[AggMinimalDataRecordGeneratorJob]]. - */ -object TimelinesAggregationSources { - - /** - * This is the recap data records after post-processing in [[GenerateRecapAggMinimalDataRecordsJob]] - */ - val timelinesDailyRecapMinimalSource = OfflineAggregateSource( - name = "timelines_daily_recap", - timestampFeature = TIMESTAMP, - dalDataSet = Some(RecapDataRecordsAggMinimalJavaDataset), - scaldingSuffixType = Some("dal"), - withValidation = true - ) - val timelinesDailyTwitterWideSource = OfflineAggregateSource( - name = "timelines_daily_twitter_wide", - timestampFeature = TIMESTAMP, - scaldingHdfsPath = Some("/user/timelines/processed/suggests/recap/twitter_wide_data_records"), - scaldingSuffixType = Some("daily"), - withValidation = true - ) - - val timelinesDailyListTimelineSource = OfflineAggregateSource( - name = "timelines_daily_list_timeline", - timestampFeature = TIMESTAMP, - scaldingHdfsPath = Some("/user/timelines/processed/suggests/recap/all_features/list"), - scaldingSuffixType = Some("hourly"), - withValidation = true - ) - - val timelinesDailyHomeLatestSource = OfflineAggregateSource( - name = "timelines_daily_home_latest", - timestampFeature = HomeLatestUserAggregatesFeatures.AGGREGATE_TIMESTAMP_MS, - scaldingHdfsPath = Some("/user/timelines/processed/p_home_latest/user_aggregates"), - scaldingSuffixType = Some("daily") - ) -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/AuthorFeaturesAdapter.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/AuthorFeaturesAdapter.scala deleted file mode 100644 index 7cefc67b9..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/AuthorFeaturesAdapter.scala +++ /dev/null @@ -1,70 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates.real_time - -import com.twitter.dal.personal_data.thriftjava.PersonalDataType.UserState -import com.twitter.ml.api.Feature.Binary -import com.twitter.ml.api.{DataRecord, Feature, FeatureContext, RichDataRecord} -import com.twitter.ml.featurestore.catalog.entities.core.Author -import com.twitter.ml.featurestore.catalog.features.magicrecs.UserActivity -import com.twitter.ml.featurestore.lib.data.PredictionRecord -import com.twitter.ml.featurestore.lib.feature.{BoundFeature, BoundFeatureSet} -import com.twitter.ml.featurestore.lib.{UserId, Discrete => FSDiscrete} -import com.twitter.timelines.prediction.common.adapters.TimelinesAdapterBase -import java.lang.{Boolean => JBoolean} -import java.util -import scala.collection.JavaConverters._ - -object AuthorFeaturesAdapter extends TimelinesAdapterBase[PredictionRecord] { - val UserStateBoundFeature: BoundFeature[UserId, FSDiscrete] = UserActivity.UserState.bind(Author) - val UserFeaturesSet: BoundFeatureSet = BoundFeatureSet(UserStateBoundFeature) - - /** - * Boolean features about viewer's user state. - * enum UserState { - * NEW = 0, - * NEAR_ZERO = 1, - * VERY_LIGHT = 2, - * LIGHT = 3, - * MEDIUM_TWEETER = 4, - * MEDIUM_NON_TWEETER = 5, - * HEAVY_NON_TWEETER = 6, - * HEAVY_TWEETER = 7 - * }(persisted='true') - */ - val IS_USER_NEW = new Binary("timelines.author.user_state.is_user_new", Set(UserState).asJava) - val IS_USER_LIGHT = new Binary("timelines.author.user_state.is_user_light", Set(UserState).asJava) - val IS_USER_MEDIUM_TWEETER = - new Binary("timelines.author.user_state.is_user_medium_tweeter", Set(UserState).asJava) - val IS_USER_MEDIUM_NON_TWEETER = - new Binary("timelines.author.user_state.is_user_medium_non_tweeter", Set(UserState).asJava) - val IS_USER_HEAVY_NON_TWEETER = - new Binary("timelines.author.user_state.is_user_heavy_non_tweeter", Set(UserState).asJava) - val IS_USER_HEAVY_TWEETER = - new Binary("timelines.author.user_state.is_user_heavy_tweeter", Set(UserState).asJava) - val userStateToFeatureMap: Map[Long, Binary] = Map( - 0L -> IS_USER_NEW, - 1L -> IS_USER_LIGHT, - 2L -> IS_USER_LIGHT, - 3L -> IS_USER_LIGHT, - 4L -> IS_USER_MEDIUM_TWEETER, - 5L -> IS_USER_MEDIUM_NON_TWEETER, - 6L -> IS_USER_HEAVY_NON_TWEETER, - 7L -> IS_USER_HEAVY_TWEETER - ) - - val UserStateBooleanFeatures: Set[Feature[_]] = userStateToFeatureMap.values.toSet - - private val allFeatures: Seq[Feature[_]] = UserStateBooleanFeatures.toSeq - override def getFeatureContext: FeatureContext = new FeatureContext(allFeatures: _*) - override def commonFeatures: Set[Feature[_]] = Set.empty - - override def adaptToDataRecords(record: PredictionRecord): util.List[DataRecord] = { - val newRecord = new RichDataRecord(new DataRecord) - record - .getFeatureValue(UserStateBoundFeature) - .flatMap { userState => userStateToFeatureMap.get(userState.value) }.foreach { - booleanFeature => newRecord.setFeatureValue[JBoolean](booleanFeature, true) - } - - List(newRecord.getRecord).asJava - } -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/BUILD b/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/BUILD deleted file mode 100644 index 93f39405d..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/BUILD +++ /dev/null @@ -1,199 +0,0 @@ -heron_binary( - name = "heron-without-jass", - main = "com.twitter.timelines.prediction.common.aggregates.real_time.TypeSafeRunner", - oss = True, - platform = "java8", - runtime_platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - ":real_time", - "3rdparty/jvm/org/slf4j:slf4j-jdk14", - ], -) - -jvm_app( - name = "rta_heron", - binary = ":heron-without-jass", - bundles = [ - bundle( - fileset = ["resources/jaas.conf"], - ), - ], - tags = [ - "bazel-compatible", - "bazel-only", - ], -) - -scala_library( - sources = ["*.scala"], - platform = "java8", - strict_deps = False, - tags = ["bazel-compatible"], - dependencies = [ - ":online-configs", - "3rdparty/src/jvm/com/twitter/summingbird:storm", - "src/java/com/twitter/heron/util", - "src/java/com/twitter/ml/api:api-base", - "src/java/com/twitter/ml/api/constant", - "src/scala/com/twitter/frigate/data_pipeline/features_aggregated/core:core-features", - "src/scala/com/twitter/ml/api/util", - "src/scala/com/twitter/storehaus_internal/memcache", - "src/scala/com/twitter/storehaus_internal/util", - "src/scala/com/twitter/summingbird_internal/bijection:bijection-implicits", - "src/scala/com/twitter/summingbird_internal/runner/store_config", - "src/scala/com/twitter/summingbird_internal/runner/storm", - "src/scala/com/twitter/summingbird_internal/sources/storm/remote:ClientEventSourceScrooge2", - "src/scala/com/twitter/timelines/prediction/adapters/client_log_event", - "src/scala/com/twitter/timelines/prediction/adapters/client_log_event_mr", - "src/scala/com/twitter/timelines/prediction/features/client_log_event", - "src/scala/com/twitter/timelines/prediction/features/common", - "src/scala/com/twitter/timelines/prediction/features/list_features", - "src/scala/com/twitter/timelines/prediction/features/recap", - "src/scala/com/twitter/timelines/prediction/features/user_health", - "src/thrift/com/twitter/ml/api:data-java", - "src/thrift/com/twitter/timelines/suggests/common:record-scala", - "timelinemixer/common/src/main/scala/com/twitter/timelinemixer/clients/served_features_cache", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - "timelines/data_processing/ml_util/aggregation_framework/heron", - "timelines/data_processing/ml_util/aggregation_framework/job", - "timelines/data_processing/ml_util/aggregation_framework/metrics", - "timelines/data_processing/ml_util/transforms", - "timelines/src/main/scala/com/twitter/timelines/clients/memcache_common", - "util/util-core:scala", - ], -) - -scala_library( - name = "online-configs", - sources = [ - "AuthorFeaturesAdapter.scala", - "Event.scala", - "FeatureStoreUtils.scala", - "StormAggregateSourceUtils.scala", - "TimelinesOnlineAggregationConfig.scala", - "TimelinesOnlineAggregationConfigBase.scala", - "TimelinesOnlineAggregationSources.scala", - "TimelinesStormAggregateSource.scala", - "TweetFeaturesReadableStore.scala", - "UserFeaturesAdapter.scala", - "UserFeaturesReadableStore.scala", - ], - platform = "java8", - strict_deps = True, - tags = ["bazel-compatible"], - dependencies = [ - ":base-config", - "3rdparty/src/jvm/com/twitter/scalding:db", - "3rdparty/src/jvm/com/twitter/storehaus:core", - "3rdparty/src/jvm/com/twitter/summingbird:core", - "3rdparty/src/jvm/com/twitter/summingbird:online", - "3rdparty/src/jvm/com/twitter/summingbird:storm", - "abuse/detection/src/main/thrift/com/twitter/abuse/detection/mention_interactions:thrift-scala", - "snowflake/src/main/scala/com/twitter/snowflake/id", - "snowflake/src/main/thrift:thrift-scala", - "src/java/com/twitter/ml/api:api-base", - "src/java/com/twitter/ml/api/constant", - "src/scala/com/twitter/frigate/data_pipeline/features_aggregated/core:core-features", - "src/scala/com/twitter/ml/api/util:datarecord", - "src/scala/com/twitter/ml/featurestore/catalog/datasets/geo:geo-user-location", - "src/scala/com/twitter/ml/featurestore/catalog/datasets/magicrecs:user-features", - "src/scala/com/twitter/ml/featurestore/catalog/entities/core", - "src/scala/com/twitter/ml/featurestore/catalog/features/core:user", - "src/scala/com/twitter/ml/featurestore/catalog/features/geo", - "src/scala/com/twitter/ml/featurestore/catalog/features/magicrecs:user-activity", - "src/scala/com/twitter/ml/featurestore/catalog/features/magicrecs:user-info", - "src/scala/com/twitter/ml/featurestore/catalog/features/trends:tweet_trends_scores", - "src/scala/com/twitter/ml/featurestore/lib/data", - "src/scala/com/twitter/ml/featurestore/lib/dataset/offline", - "src/scala/com/twitter/ml/featurestore/lib/export/strato:app-names", - "src/scala/com/twitter/ml/featurestore/lib/feature", - "src/scala/com/twitter/ml/featurestore/lib/online", - "src/scala/com/twitter/ml/featurestore/lib/params", - "src/scala/com/twitter/storehaus_internal/util", - "src/scala/com/twitter/summingbird_internal/bijection:bijection-implicits", - "src/scala/com/twitter/summingbird_internal/runner/store_config", - "src/scala/com/twitter/summingbird_internal/runner/storm", - "src/scala/com/twitter/summingbird_internal/sources/common", - "src/scala/com/twitter/summingbird_internal/sources/common/remote:ClientEventSourceScrooge", - "src/scala/com/twitter/summingbird_internal/sources/storm/remote:ClientEventSourceScrooge2", - "src/scala/com/twitter/timelines/prediction/adapters/client_log_event", - "src/scala/com/twitter/timelines/prediction/adapters/client_log_event_mr", - "src/scala/com/twitter/timelines/prediction/common/adapters:base", - "src/scala/com/twitter/timelines/prediction/common/adapters:engagement-converter", - "src/scala/com/twitter/timelines/prediction/common/aggregates", - "src/scala/com/twitter/timelines/prediction/features/client_log_event", - "src/scala/com/twitter/timelines/prediction/features/common", - "src/scala/com/twitter/timelines/prediction/features/list_features", - "src/scala/com/twitter/timelines/prediction/features/recap", - "src/scala/com/twitter/timelines/prediction/features/user_health", - "src/thrift/com/twitter/clientapp/gen:clientapp-scala", - "src/thrift/com/twitter/dal/personal_data:personal_data-java", - "src/thrift/com/twitter/ml/api:data-java", - "src/thrift/com/twitter/timelines/suggests/common:engagement-java", - "src/thrift/com/twitter/timelines/suggests/common:engagement-scala", - "src/thrift/com/twitter/timelines/suggests/common:record-scala", - "src/thrift/com/twitter/timelineservice/injection:thrift-scala", - "src/thrift/com/twitter/timelineservice/server/suggests/logging:thrift-scala", - "strato/src/main/scala/com/twitter/strato/client", - "timelinemixer/common/src/main/scala/com/twitter/timelinemixer/clients/served_features_cache", - "timelines/data_processing/ad_hoc/suggests/common:raw_training_data_creator", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - "timelines/data_processing/ml_util/aggregation_framework/heron:configs", - "timelines/data_processing/ml_util/aggregation_framework/metrics", - "timelines/data_processing/ml_util/transforms", - "timelines/data_processing/util:rich-request", - "tweetsource/common/src/main/thrift:thrift-scala", - "twitter-server-internal/src/main/scala", - "unified_user_actions/client/src/main/scala/com/twitter/unified_user_actions/client/config", - "unified_user_actions/client/src/main/scala/com/twitter/unified_user_actions/client/summingbird", - "unified_user_actions/thrift/src/main/thrift/com/twitter/unified_user_actions:unified_user_actions-scala", - "util/util-core:scala", - "util/util-stats/src/main/scala/com/twitter/finagle/stats", - ], -) - -scala_library( - name = "base-config", - sources = [ - "AuthorFeaturesAdapter.scala", - "TimelinesOnlineAggregationConfigBase.scala", - "TweetFeaturesAdapter.scala", - "UserFeaturesAdapter.scala", - ], - platform = "java8", - strict_deps = True, - tags = ["bazel-compatible"], - dependencies = [ - "src/java/com/twitter/ml/api:api-base", - "src/java/com/twitter/ml/api/constant", - "src/resources/com/twitter/timelines/prediction/common/aggregates/real_time", - "src/scala/com/twitter/ml/api/util:datarecord", - "src/scala/com/twitter/ml/featurestore/catalog/datasets/magicrecs:user-features", - "src/scala/com/twitter/ml/featurestore/catalog/entities/core", - "src/scala/com/twitter/ml/featurestore/catalog/features/core:user", - "src/scala/com/twitter/ml/featurestore/catalog/features/geo", - "src/scala/com/twitter/ml/featurestore/catalog/features/magicrecs:user-activity", - "src/scala/com/twitter/ml/featurestore/catalog/features/magicrecs:user-info", - "src/scala/com/twitter/ml/featurestore/catalog/features/trends:tweet_trends_scores", - "src/scala/com/twitter/ml/featurestore/lib/data", - "src/scala/com/twitter/ml/featurestore/lib/feature", - "src/scala/com/twitter/timelines/prediction/common/adapters:base", - "src/scala/com/twitter/timelines/prediction/common/adapters:engagement-converter", - "src/scala/com/twitter/timelines/prediction/common/aggregates", - "src/scala/com/twitter/timelines/prediction/features/client_log_event", - "src/scala/com/twitter/timelines/prediction/features/common", - "src/scala/com/twitter/timelines/prediction/features/list_features", - "src/scala/com/twitter/timelines/prediction/features/recap", - "src/scala/com/twitter/timelines/prediction/features/user_health", - "src/thrift/com/twitter/dal/personal_data:personal_data-java", - "src/thrift/com/twitter/ml/api:feature_context-java", - "src/thrift/com/twitter/timelines/suggests/common:engagement-scala", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - "timelines/data_processing/ml_util/aggregation_framework/heron:base-config", - "timelines/data_processing/ml_util/aggregation_framework/metrics", - "timelines/data_processing/ml_util/transforms", - "util/util-core:scala", - "util/util-core:util-core-util", - ], -) diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/Event.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/Event.scala deleted file mode 100644 index 1bd697d0d..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/Event.scala +++ /dev/null @@ -1,11 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates.real_time - -private[real_time] sealed trait Event[T] { def event: T } - -private[real_time] case class HomeEvent[T](override val event: T) extends Event[T] - -private[real_time] case class ProfileEvent[T](override val event: T) extends Event[T] - -private[real_time] case class SearchEvent[T](override val event: T) extends Event[T] - -private[real_time] case class UuaEvent[T](override val event: T) extends Event[T] diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/FeatureStoreUtils.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/FeatureStoreUtils.scala deleted file mode 100644 index 156d9d35f..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/FeatureStoreUtils.scala +++ /dev/null @@ -1,53 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates.real_time - -import com.twitter.finagle.mtls.authentication.ServiceIdentifier -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.ml.featurestore.catalog.datasets.magicrecs.UserFeaturesDataset -import com.twitter.ml.featurestore.catalog.datasets.geo.GeoUserLocationDataset -import com.twitter.ml.featurestore.lib.dataset.DatasetParams -import com.twitter.ml.featurestore.lib.export.strato.FeatureStoreAppNames -import com.twitter.ml.featurestore.lib.online.FeatureStoreClient -import com.twitter.ml.featurestore.lib.params.FeatureStoreParams -import com.twitter.strato.client.{Client, Strato} -import com.twitter.strato.opcontext.Attribution.ManhattanAppId -import com.twitter.util.Duration - -private[real_time] object FeatureStoreUtils { - private def mkStratoClient(serviceIdentifier: ServiceIdentifier): Client = - Strato.client - .withMutualTls(serviceIdentifier) - .withRequestTimeout(Duration.fromMilliseconds(50)) - .build() - - private val featureStoreParams: FeatureStoreParams = - FeatureStoreParams( - perDataset = Map( - UserFeaturesDataset.id -> - DatasetParams( - stratoSuffix = Some(FeatureStoreAppNames.Timelines), - attributions = Seq(ManhattanAppId("athena", "timelines_aggregates_v2_features_by_user")) - ), - GeoUserLocationDataset.id -> - DatasetParams( - attributions = Seq(ManhattanAppId("starbuck", "timelines_geo_features_by_user")) - ) - ) - ) - - def mkFeatureStoreClient( - serviceIdentifier: ServiceIdentifier, - statsReceiver: StatsReceiver - ): FeatureStoreClient = { - com.twitter.server.Init() // necessary in order to use WilyNS path - - val stratoClient: Client = mkStratoClient(serviceIdentifier) - val featureStoreClient: FeatureStoreClient = FeatureStoreClient( - featureSet = - UserFeaturesAdapter.UserFeaturesSet ++ AuthorFeaturesAdapter.UserFeaturesSet ++ TweetFeaturesAdapter.TweetFeaturesSet, - client = stratoClient, - statsReceiver = statsReceiver, - featureStoreParams = featureStoreParams - ) - featureStoreClient - } -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/LocallyReplicatedStore.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/LocallyReplicatedStore.scala deleted file mode 100644 index 42f86fa4f..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/LocallyReplicatedStore.scala +++ /dev/null @@ -1,79 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates.real_time - -import com.twitter.conversions.DurationOps._ -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.storehaus.ReplicatedReadableStore -import com.twitter.storehaus.Store -import com.twitter.timelines.clients.memcache_common._ -import com.twitter.timelines.util.FailOpenHandler -import com.twitter.util.Future - -object ServedFeaturesMemcacheConfigBuilder { - def getTwCacheDestination(cluster: String, isProd: Boolean = false): String = - if (!isProd) { - s"/srv#/test/$cluster/cache//twemcache_timelines_served_features_cache" - } else { - s"/srv#/prod/$cluster/cache/timelines_served_features" - } - - /** - * @cluster The DC of the cache that this client will send requests to. This - * can be different to the DC where the summingbird job is running in. - * @isProd Define if this client is part of a production summingbird job as - * different accesspoints will need to be chosen. - */ - def build(cluster: String, isProd: Boolean = false): StorehausMemcacheConfig = - StorehausMemcacheConfig( - destName = getTwCacheDestination(cluster, isProd), - keyPrefix = "", - requestTimeout = 200.milliseconds, - numTries = 2, - globalTimeout = 400.milliseconds, - tcpConnectTimeout = 200.milliseconds, - connectionAcquisitionTimeout = 200.milliseconds, - numPendingRequests = 1000, - isReadOnly = false - ) -} - -/** - * If lookup key does not exist locally, make a call to the replicated store(s). - * If value exists remotely, write the first returned value to the local store - * and return it. Map any exceptions to None so that the subsequent operations - * may proceed. - */ -class LocallyReplicatedStore[-K, V]( - localStore: Store[K, V], - remoteStore: ReplicatedReadableStore[K, V], - scopedStatsReceiver: StatsReceiver) - extends Store[K, V] { - private[this] val failOpenHandler = new FailOpenHandler(scopedStatsReceiver.scope("failOpen")) - private[this] val localFailsCounter = scopedStatsReceiver.counter("localFails") - private[this] val localWritesCounter = scopedStatsReceiver.counter("localWrites") - private[this] val remoteFailsCounter = scopedStatsReceiver.counter("remoteFails") - - override def get(k: K): Future[Option[V]] = - failOpenHandler { - localStore - .get(k) - .flatMap { - case Some(v) => Future.value(Some(v)) - case _ => { - localFailsCounter.incr() - val replicatedOptFu = remoteStore.get(k) - // async write if result is not empty - replicatedOptFu.onSuccess { - case Some(v) => { - localWritesCounter.incr() - localStore.put((k, Some(v))) - } - case _ => { - remoteFailsCounter.incr() - Unit - } - } - replicatedOptFu - } - } - } { _: Throwable => Future.None } -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/StormAggregateSourceUtils.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/StormAggregateSourceUtils.scala deleted file mode 100644 index e72d3392b..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/StormAggregateSourceUtils.scala +++ /dev/null @@ -1,254 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates.real_time - -import com.twitter.finagle.stats.Counter -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.ml.api.constant.SharedFeatures -import com.twitter.ml.api.DataRecord -import com.twitter.ml.api.DataRecordMerger -import com.twitter.ml.api.Feature -import com.twitter.ml.api.RichDataRecord -import com.twitter.ml.featurestore.catalog.entities.core.Author -import com.twitter.ml.featurestore.catalog.entities.core.Tweet -import com.twitter.ml.featurestore.catalog.entities.core.User -import com.twitter.ml.featurestore.lib.online.FeatureStoreClient -import com.twitter.summingbird.Producer -import com.twitter.summingbird.storm.Storm -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.heron.RealTimeAggregatesJobConfig -import com.twitter.timelines.prediction.features.common.TimelinesSharedFeatures -import java.lang.{Long => JLong} - -import com.twitter.unified_user_actions.thriftscala.ActionType -import com.twitter.unified_user_actions.thriftscala.UnifiedUserAction - -private[real_time] object StormAggregateSourceUtils { - type UserId = Long - type AuthorId = Long - type TweetId = Long - - /** - * Attaches a [[FeatureStoreClient]] to the underyling [[Producer]]. The FeatureStoreClient - * hydrates additional user features. - * - * @param underlyingProducer converts a stream of [[com.twitter.clientapp.thriftscala.LogEvent]] - * to a stream of [[DataRecord]]. - */ - def wrapByFeatureStoreClient( - underlyingProducer: Producer[Storm, Event[DataRecord]], - jobConfig: RealTimeAggregatesJobConfig, - scopedStatsReceiver: StatsReceiver - ): Producer[Storm, Event[DataRecord]] = { - lazy val keyDataRecordCounter = scopedStatsReceiver.counter("keyDataRecord") - lazy val keyFeatureCounter = scopedStatsReceiver.counter("keyFeature") - lazy val leftDataRecordCounter = scopedStatsReceiver.counter("leftDataRecord") - lazy val rightDataRecordCounter = scopedStatsReceiver.counter("rightDataRecord") - lazy val mergeNumFeaturesCounter = scopedStatsReceiver.counter("mergeNumFeatures") - lazy val authorKeyDataRecordCounter = scopedStatsReceiver.counter("authorKeyDataRecord") - lazy val authorKeyFeatureCounter = scopedStatsReceiver.counter("authorKeyFeature") - lazy val authorLeftDataRecordCounter = scopedStatsReceiver.counter("authorLeftDataRecord") - lazy val authorRightDataRecordCounter = scopedStatsReceiver.counter("authorRightDataRecord") - lazy val authorMergeNumFeaturesCounter = scopedStatsReceiver.counter("authorMergeNumFeatures") - lazy val tweetKeyDataRecordCounter = - scopedStatsReceiver.counter("tweetKeyDataRecord") - lazy val tweetKeyFeatureCounter = scopedStatsReceiver.counter("tweetKeyFeature") - lazy val tweetLeftDataRecordCounter = - scopedStatsReceiver.counter("tweetLeftDataRecord") - lazy val tweetRightDataRecordCounter = - scopedStatsReceiver.counter("tweetRightDataRecord") - lazy val tweetMergeNumFeaturesCounter = - scopedStatsReceiver.counter("tweetMergeNumFeatures") - - @transient lazy val featureStoreClient: FeatureStoreClient = - FeatureStoreUtils.mkFeatureStoreClient( - serviceIdentifier = jobConfig.serviceIdentifier, - statsReceiver = scopedStatsReceiver - ) - - lazy val joinUserFeaturesDataRecordProducer = - if (jobConfig.keyedByUserEnabled) { - lazy val keyedByUserFeaturesStormService: Storm#Service[Set[UserId], DataRecord] = - Storm.service( - new UserFeaturesReadableStore( - featureStoreClient = featureStoreClient, - userEntity = User, - userFeaturesAdapter = UserFeaturesAdapter - ) - ) - - leftJoinDataRecordProducer( - keyFeature = SharedFeatures.USER_ID, - leftDataRecordProducer = underlyingProducer, - rightStormService = keyedByUserFeaturesStormService, - keyDataRecordCounter = keyDataRecordCounter, - keyFeatureCounter = keyFeatureCounter, - leftDataRecordCounter = leftDataRecordCounter, - rightDataRecordCounter = rightDataRecordCounter, - mergeNumFeaturesCounter = mergeNumFeaturesCounter - ) - } else { - underlyingProducer - } - - lazy val joinAuthorFeaturesDataRecordProducer = - if (jobConfig.keyedByAuthorEnabled) { - lazy val keyedByAuthorFeaturesStormService: Storm#Service[Set[AuthorId], DataRecord] = - Storm.service( - new UserFeaturesReadableStore( - featureStoreClient = featureStoreClient, - userEntity = Author, - userFeaturesAdapter = AuthorFeaturesAdapter - ) - ) - - leftJoinDataRecordProducer( - keyFeature = TimelinesSharedFeatures.SOURCE_AUTHOR_ID, - leftDataRecordProducer = joinUserFeaturesDataRecordProducer, - rightStormService = keyedByAuthorFeaturesStormService, - keyDataRecordCounter = authorKeyDataRecordCounter, - keyFeatureCounter = authorKeyFeatureCounter, - leftDataRecordCounter = authorLeftDataRecordCounter, - rightDataRecordCounter = authorRightDataRecordCounter, - mergeNumFeaturesCounter = authorMergeNumFeaturesCounter - ) - } else { - joinUserFeaturesDataRecordProducer - } - - lazy val joinTweetFeaturesDataRecordProducer = { - if (jobConfig.keyedByTweetEnabled) { - lazy val keyedByTweetFeaturesStormService: Storm#Service[Set[TweetId], DataRecord] = - Storm.service( - new TweetFeaturesReadableStore( - featureStoreClient = featureStoreClient, - tweetEntity = Tweet, - tweetFeaturesAdapter = TweetFeaturesAdapter - ) - ) - - leftJoinDataRecordProducer( - keyFeature = TimelinesSharedFeatures.SOURCE_TWEET_ID, - leftDataRecordProducer = joinAuthorFeaturesDataRecordProducer, - rightStormService = keyedByTweetFeaturesStormService, - keyDataRecordCounter = tweetKeyDataRecordCounter, - keyFeatureCounter = tweetKeyFeatureCounter, - leftDataRecordCounter = tweetLeftDataRecordCounter, - rightDataRecordCounter = tweetRightDataRecordCounter, - mergeNumFeaturesCounter = tweetMergeNumFeaturesCounter - ) - } else { - joinAuthorFeaturesDataRecordProducer - } - } - - joinTweetFeaturesDataRecordProducer - } - - private[this] lazy val DataRecordMerger = new DataRecordMerger - - /** - * Make join key from the client event data record and return both. - * @param keyFeature Feature to extract join key value: USER_ID, SOURCE_TWEET_ID, etc. - * @param record DataRecord containing client engagement and basic tweet-side features - * @return The return type is a tuple of this key and original data record which will be used - * in the subsequent leftJoin operation. - */ - private[this] def mkKey( - keyFeature: Feature[JLong], - record: DataRecord, - keyDataRecordCounter: Counter, - keyFeatureCounter: Counter - ): Set[Long] = { - keyDataRecordCounter.incr() - val richRecord = new RichDataRecord(record) - if (richRecord.hasFeature(keyFeature)) { - keyFeatureCounter.incr() - val key: Long = richRecord.getFeatureValue(keyFeature).toLong - Set(key) - } else { - Set.empty[Long] - } - } - - /** - * After the leftJoin, merge the client event data record and the joined data record - * into a single data record used for further aggregation. - */ - private[this] def mergeDataRecord( - leftRecord: Event[DataRecord], - rightRecordOpt: Option[DataRecord], - leftDataRecordCounter: Counter, - rightDataRecordCounter: Counter, - mergeNumFeaturesCounter: Counter - ): Event[DataRecord] = { - leftDataRecordCounter.incr() - rightRecordOpt.foreach { rightRecord => - rightDataRecordCounter.incr() - DataRecordMerger.merge(leftRecord.event, rightRecord) - mergeNumFeaturesCounter.incr(new RichDataRecord(leftRecord.event).numFeatures()) - } - leftRecord - } - - private[this] def leftJoinDataRecordProducer( - keyFeature: Feature[JLong], - leftDataRecordProducer: Producer[Storm, Event[DataRecord]], - rightStormService: Storm#Service[Set[Long], DataRecord], - keyDataRecordCounter: => Counter, - keyFeatureCounter: => Counter, - leftDataRecordCounter: => Counter, - rightDataRecordCounter: => Counter, - mergeNumFeaturesCounter: => Counter - ): Producer[Storm, Event[DataRecord]] = { - val keyedLeftDataRecordProducer: Producer[Storm, (Set[Long], Event[DataRecord])] = - leftDataRecordProducer.map { - case dataRecord: HomeEvent[DataRecord] => - val key = mkKey( - keyFeature = keyFeature, - record = dataRecord.event, - keyDataRecordCounter = keyDataRecordCounter, - keyFeatureCounter = keyFeatureCounter - ) - (key, dataRecord) - case dataRecord: ProfileEvent[DataRecord] => - val key = Set.empty[Long] - (key, dataRecord) - case dataRecord: SearchEvent[DataRecord] => - val key = Set.empty[Long] - (key, dataRecord) - case dataRecord: UuaEvent[DataRecord] => - val key = Set.empty[Long] - (key, dataRecord) - } - - keyedLeftDataRecordProducer - .leftJoin(rightStormService) - .map { - case (_, (leftRecord, rightRecordOpt)) => - mergeDataRecord( - leftRecord = leftRecord, - rightRecordOpt = rightRecordOpt, - leftDataRecordCounter = leftDataRecordCounter, - rightDataRecordCounter = rightDataRecordCounter, - mergeNumFeaturesCounter = mergeNumFeaturesCounter - ) - } - } - - /** - * Filter Unified User Actions events to include only actions that has home timeline visit prior to landing on the page - */ - def isUuaBCEEventsFromHome(event: UnifiedUserAction): Boolean = { - def breadcrumbViewsContain(view: String): Boolean = - event.eventMetadata.breadcrumbViews.map(_.contains(view)).getOrElse(false) - - (event.actionType) match { - case ActionType.ClientTweetV2Impression if breadcrumbViewsContain("home") => - true - case ActionType.ClientTweetVideoFullscreenV2Impression - if (breadcrumbViewsContain("home") & breadcrumbViewsContain("video")) => - true - case ActionType.ClientProfileV2Impression if breadcrumbViewsContain("home") => - true - case _ => false - } - } -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TimelinesOnlineAggregationConfig.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TimelinesOnlineAggregationConfig.scala deleted file mode 100644 index 8d7a41d21..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TimelinesOnlineAggregationConfig.scala +++ /dev/null @@ -1,34 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates.real_time - -import com.twitter.conversions.DurationOps._ -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.heron.{ - OnlineAggregationStoresTrait, - RealTimeAggregateStore -} - -object TimelinesOnlineAggregationConfig - extends TimelinesOnlineAggregationDefinitionsTrait - with OnlineAggregationStoresTrait { - - import TimelinesOnlineAggregationSources._ - - override lazy val ProductionStore = RealTimeAggregateStore( - memcacheDataSet = "timelines_real_time_aggregates", - isProd = true, - cacheTTL = 5.days - ) - - override lazy val StagingStore = RealTimeAggregateStore( - memcacheDataSet = "twemcache_timelines_real_time_aggregates", - isProd = false, - cacheTTL = 5.days - ) - - override lazy val inputSource = timelinesOnlineAggregateSource - - /** - * AggregateToCompute: This defines the complete set of aggregates to be - * computed by the aggregation job and to be stored in memcache. - */ - override lazy val AggregatesToCompute = ProdAggregates ++ StagingAggregates -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TimelinesOnlineAggregationConfigBase.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TimelinesOnlineAggregationConfigBase.scala deleted file mode 100644 index 0d7c072e2..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TimelinesOnlineAggregationConfigBase.scala +++ /dev/null @@ -1,1112 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates.real_time - -import com.twitter.conversions.DurationOps._ -import com.twitter.ml.api.Feature -import com.twitter.ml.api.constant.SharedFeatures -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregateGroup -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregateSource -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregateStore -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.heron.OnlineAggregationConfigTrait -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.metrics.CountMetric -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.metrics.SumMetric -import com.twitter.timelines.data_processing.ml_util.transforms.BinaryUnion -import com.twitter.timelines.data_processing.ml_util.transforms.DownsampleTransform -import com.twitter.timelines.data_processing.ml_util.transforms.IsNewUserTransform -import com.twitter.timelines.data_processing.ml_util.transforms.IsPositionTransform -import com.twitter.timelines.data_processing.ml_util.transforms.LogTransform -import com.twitter.timelines.data_processing.ml_util.transforms.PositionCase -import com.twitter.timelines.data_processing.ml_util.transforms.RichITransform -import com.twitter.timelines.data_processing.ml_util.transforms.RichRemoveUnverifiedUserTransform -import com.twitter.timelines.prediction.features.client_log_event.ClientLogEventDataRecordFeatures -import com.twitter.timelines.prediction.features.common.CombinedFeatures -import com.twitter.timelines.prediction.features.common.CombinedFeatures._ -import com.twitter.timelines.prediction.features.common.ProfileLabelFeatures -import com.twitter.timelines.prediction.features.common.SearchLabelFeatures -import com.twitter.timelines.prediction.features.common.TimelinesSharedFeatures -import com.twitter.timelines.prediction.features.common.TimelinesSharedFeatures.IS_TOP_FIVE -import com.twitter.timelines.prediction.features.common.TimelinesSharedFeatures.IS_TOP_ONE -import com.twitter.timelines.prediction.features.common.TimelinesSharedFeatures.IS_TOP_TEN -import com.twitter.timelines.prediction.features.common.TimelinesSharedFeatures.LOG_POSITION -import com.twitter.timelines.prediction.features.list_features.ListFeatures -import com.twitter.timelines.prediction.features.recap.RecapFeatures -import com.twitter.util.Duration -import java.lang.{Boolean => JBoolean} -import java.lang.{Long => JLong} -import scala.io.Source - -object TimelinesOnlineAggregationUtils { - val TweetLabels: Set[Feature[JBoolean]] = CombinedFeatures.EngagementsRealTime - val TweetCoreLabels: Set[Feature[JBoolean]] = CombinedFeatures.CoreEngagements - val TweetDwellLabels: Set[Feature[JBoolean]] = CombinedFeatures.DwellEngagements - val TweetCoreAndDwellLabels: Set[Feature[JBoolean]] = TweetCoreLabels ++ TweetDwellLabels - val PrivateEngagementLabelsV2: Set[Feature[JBoolean]] = CombinedFeatures.PrivateEngagementsV2 - val ProfileCoreLabels: Set[Feature[JBoolean]] = ProfileLabelFeatures.CoreEngagements - val ProfileNegativeEngagementLabels: Set[Feature[JBoolean]] = - ProfileLabelFeatures.NegativeEngagements - val ProfileNegativeEngagementUnionLabels: Set[Feature[JBoolean]] = Set( - ProfileLabelFeatures.IS_NEGATIVE_FEEDBACK_UNION) - val SearchCoreLabels: Set[Feature[JBoolean]] = SearchLabelFeatures.CoreEngagements - val TweetNegativeEngagementLabels: Set[Feature[JBoolean]] = - CombinedFeatures.NegativeEngagementsRealTime - val TweetNegativeEngagementDontLikeLabels: Set[Feature[JBoolean]] = - CombinedFeatures.NegativeEngagementsRealTimeDontLike - val TweetNegativeEngagementSecondaryLabels: Set[Feature[JBoolean]] = - CombinedFeatures.NegativeEngagementsSecondary - val AllTweetNegativeEngagementLabels: Set[Feature[JBoolean]] = - TweetNegativeEngagementLabels ++ TweetNegativeEngagementDontLikeLabels ++ TweetNegativeEngagementSecondaryLabels - val UserAuthorEngagementLabels: Set[Feature[JBoolean]] = CombinedFeatures.UserAuthorEngagements - val ShareEngagementLabels: Set[Feature[JBoolean]] = CombinedFeatures.ShareEngagements - val BookmarkEngagementLabels: Set[Feature[JBoolean]] = CombinedFeatures.BookmarkEngagements - val AllBCEDwellLabels: Set[Feature[JBoolean]] = - CombinedFeatures.TweetDetailDwellEngagements ++ CombinedFeatures.ProfileDwellEngagements ++ CombinedFeatures.FullscreenVideoDwellEngagements - val AllTweetUnionLabels: Set[Feature[JBoolean]] = Set( - CombinedFeatures.IS_IMPLICIT_POSITIVE_FEEDBACK_UNION, - CombinedFeatures.IS_EXPLICIT_POSITIVE_FEEDBACK_UNION, - CombinedFeatures.IS_ALL_NEGATIVE_FEEDBACK_UNION - ) - val AllTweetLabels: Set[Feature[JBoolean]] = - TweetLabels ++ TweetCoreAndDwellLabels ++ AllTweetNegativeEngagementLabels ++ ProfileCoreLabels ++ ProfileNegativeEngagementLabels ++ ProfileNegativeEngagementUnionLabels ++ UserAuthorEngagementLabels ++ SearchCoreLabels ++ ShareEngagementLabels ++ BookmarkEngagementLabels ++ PrivateEngagementLabelsV2 ++ AllBCEDwellLabels ++ AllTweetUnionLabels - - def addFeatureFilterFromResource( - prodGroup: AggregateGroup, - aggRemovalPath: String - ): AggregateGroup = { - val resource = Some(Source.fromResource(aggRemovalPath)) - val lines = resource.map(_.getLines.toSeq) - lines match { - case Some(value) => prodGroup.copy(aggExclusionRegex = value) - case _ => prodGroup - } - } -} - -trait TimelinesOnlineAggregationDefinitionsTrait extends OnlineAggregationConfigTrait { - import TimelinesOnlineAggregationUtils._ - - def inputSource: AggregateSource - def ProductionStore: AggregateStore - def StagingStore: AggregateStore - - val TweetFeatures: Set[Feature[_]] = Set( - ClientLogEventDataRecordFeatures.HasConsumerVideo, - ClientLogEventDataRecordFeatures.PhotoCount - ) - val CandidateTweetSourceFeatures: Set[Feature[_]] = Set( - ClientLogEventDataRecordFeatures.FromRecap, - ClientLogEventDataRecordFeatures.FromRecycled, - ClientLogEventDataRecordFeatures.FromActivity, - ClientLogEventDataRecordFeatures.FromSimcluster, - ClientLogEventDataRecordFeatures.FromErg, - ClientLogEventDataRecordFeatures.FromCroon, - ClientLogEventDataRecordFeatures.FromList, - ClientLogEventDataRecordFeatures.FromRecTopic - ) - - def createStagingGroup(prodGroup: AggregateGroup): AggregateGroup = - prodGroup.copy( - outputStore = StagingStore - ) - - // Aggregate user engagements/features by tweet Id. - val tweetEngagement30MinuteCountsProd = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_tweet_aggregates_v1", - keys = Set(TimelinesSharedFeatures.SOURCE_TWEET_ID), - features = Set.empty, - labels = TweetLabels ++ TweetNegativeEngagementDontLikeLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - // Aggregate user engagements/features by tweet Id. - val tweetVerifiedDontLikeEngagementRealTimeAggregatesProd = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_tweet_aggregates_v6", - preTransforms = Seq(RichRemoveUnverifiedUserTransform), - keys = Set(TimelinesSharedFeatures.SOURCE_TWEET_ID), - features = Set.empty, - labels = TweetNegativeEngagementDontLikeLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, Duration.Top), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - val tweetNegativeEngagement6HourCounts = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_tweet_aggregates_v2", - keys = Set(TimelinesSharedFeatures.SOURCE_TWEET_ID), - features = Set.empty, - labels = TweetNegativeEngagementLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - val tweetVerifiedNegativeEngagementCounts = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_tweet_aggregates_v7", - preTransforms = Seq(RichRemoveUnverifiedUserTransform), - keys = Set(TimelinesSharedFeatures.SOURCE_TWEET_ID), - features = Set.empty, - labels = TweetNegativeEngagementLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, Duration.Top), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - val promotedTweetEngagementRealTimeCounts = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_tweet_aggregates_v3.is_promoted", - preTransforms = Seq( - DownsampleTransform( - negativeSamplingRate = 0.0, - keepLabels = Set(ClientLogEventDataRecordFeatures.IsPromoted))), - keys = Set(TimelinesSharedFeatures.SOURCE_TWEET_ID), - features = Set.empty, - labels = TweetCoreAndDwellLabels, - metrics = Set(CountMetric), - halfLives = Set(2.hours, 24.hours), - outputStore = ProductionStore, - includeAnyFeature = false, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - /** - * Aggregate total engagement counts by tweet Id for non-public - * engagements. Similar to EB's public engagement counts. - */ - val tweetEngagementTotalCountsProd = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_tweet_aggregates_v1", - keys = Set(TimelinesSharedFeatures.SOURCE_TWEET_ID), - features = Set.empty, - labels = TweetLabels ++ TweetNegativeEngagementDontLikeLabels, - metrics = Set(CountMetric), - halfLives = Set(Duration.Top), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - val tweetNegativeEngagementTotalCounts = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_tweet_aggregates_v2", - keys = Set(TimelinesSharedFeatures.SOURCE_TWEET_ID), - features = Set.empty, - labels = TweetNegativeEngagementLabels, - metrics = Set(CountMetric), - halfLives = Set(Duration.Top), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - /** - * Aggregate tweet features grouped by viewer's user id. - */ - val userEngagementRealTimeAggregatesProd = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_user_aggregates_v1", - keys = Set(SharedFeatures.USER_ID), - features = TweetFeatures, - labels = TweetLabels ++ TweetNegativeEngagementDontLikeLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - /** - * Aggregate tweet features grouped by viewer's user id. - */ - val userEngagementRealTimeAggregatesV2 = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_user_aggregates_v2", - keys = Set(SharedFeatures.USER_ID), - features = ClientLogEventDataRecordFeatures.TweetFeaturesV2, - labels = TweetCoreAndDwellLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, Duration.Top), - outputStore = ProductionStore, - includeAnyFeature = false, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - /** - * Aggregate author's user state features grouped by viewer's user id. - */ - val userEngagementAuthorUserStateRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_user_aggregates_v3", - preTransforms = Seq.empty, - keys = Set(SharedFeatures.USER_ID), - features = AuthorFeaturesAdapter.UserStateBooleanFeatures, - labels = TweetCoreAndDwellLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, Duration.Top), - outputStore = ProductionStore, - includeAnyFeature = false, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - /** - * Aggregate author's user state features grouped by viewer's user id. - */ - val userNegativeEngagementAuthorUserStateRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_user_aggregates_v4", - preTransforms = Seq.empty, - keys = Set(SharedFeatures.USER_ID), - features = AuthorFeaturesAdapter.UserStateBooleanFeatures, - labels = TweetNegativeEngagementLabels ++ TweetNegativeEngagementDontLikeLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, Duration.Top), - outputStore = ProductionStore, - includeAnyFeature = false, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - /** - * Aggregate tweet features grouped by viewer's user id, with 48 hour halfLife. - */ - val userEngagement48HourRealTimeAggregatesProd = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_user_aggregates_v5", - keys = Set(SharedFeatures.USER_ID), - features = TweetFeatures, - labels = TweetLabels ++ TweetNegativeEngagementDontLikeLabels, - metrics = Set(CountMetric), - halfLives = Set(48.hours), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - /** - * Aggregate author's user state features grouped by viewer's user id. - */ - val userNegativeEngagementAuthorUserState72HourRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_user_aggregates_v6", - preTransforms = Seq.empty, - keys = Set(SharedFeatures.USER_ID), - features = AuthorFeaturesAdapter.UserStateBooleanFeatures, - labels = TweetNegativeEngagementLabels ++ TweetNegativeEngagementDontLikeLabels, - metrics = Set(CountMetric), - halfLives = Set(72.hours), - outputStore = ProductionStore, - includeAnyFeature = false, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - /** - * Aggregate features grouped by source author id: for each author, aggregate features are created - * to quantify engagements (fav, reply, etc.) which tweets of the author has received. - */ - val authorEngagementRealTimeAggregatesProd = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_author_aggregates_v1", - keys = Set(TimelinesSharedFeatures.SOURCE_AUTHOR_ID), - features = Set.empty, - labels = TweetLabels ++ TweetNegativeEngagementDontLikeLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, Duration.Top), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - /** - * Aggregate features grouped by source author id: for each author, aggregate features are created - * to quantify negative engagements (mute, block, etc.) which tweets of the author has received. - * - * This aggregate group is not used in Home, but it is used in Follow Recommendation Service so need to keep it for now. - * - */ - val authorNegativeEngagementRealTimeAggregatesProd = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_author_aggregates_v2", - keys = Set(TimelinesSharedFeatures.SOURCE_AUTHOR_ID), - features = Set.empty, - labels = TweetNegativeEngagementLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, Duration.Top), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - /** - * Aggregate features grouped by source author id: for each author, aggregate features are created - * to quantify negative engagements (don't like) which tweets of the author has received from - * verified users. - */ - val authorVerifiedNegativeEngagementRealTimeAggregatesProd = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_author_aggregates_v3", - preTransforms = Seq(RichRemoveUnverifiedUserTransform), - keys = Set(TimelinesSharedFeatures.SOURCE_AUTHOR_ID), - features = Set.empty, - labels = TweetNegativeEngagementDontLikeLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - /** - * Aggregate tweet features grouped by topic id. - */ - val topicEngagementRealTimeAggregatesProd = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_topic_aggregates_v1", - keys = Set(TimelinesSharedFeatures.TOPIC_ID), - features = Set.empty, - labels = TweetLabels ++ AllTweetNegativeEngagementLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, Duration.Top), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - /** - * Aggregate user engagements / user state by topic id. - */ - val topicEngagementUserStateRealTimeAggregatesProd = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_topic_aggregates_v2", - keys = Set(TimelinesSharedFeatures.TOPIC_ID), - features = UserFeaturesAdapter.UserStateBooleanFeatures, - labels = TweetCoreAndDwellLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, Duration.Top), - outputStore = ProductionStore, - includeAnyFeature = false, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - /** - * Aggregate user negative engagements / user state by topic id. - */ - val topicNegativeEngagementUserStateRealTimeAggregatesProd = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_topic_aggregates_v3", - keys = Set(TimelinesSharedFeatures.TOPIC_ID), - features = UserFeaturesAdapter.UserStateBooleanFeatures, - labels = TweetNegativeEngagementLabels ++ TweetNegativeEngagementDontLikeLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, Duration.Top), - outputStore = ProductionStore, - includeAnyFeature = false, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - /** - * Aggregate tweet features grouped by topic id like real_time_topic_aggregates_v1 but 24hour halfLife - */ - val topicEngagement24HourRealTimeAggregatesProd = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_topic_aggregates_v4", - keys = Set(TimelinesSharedFeatures.TOPIC_ID), - features = Set.empty, - labels = TweetLabels ++ AllTweetNegativeEngagementLabels, - metrics = Set(CountMetric), - halfLives = Set(24.hours), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - // Aggregate user engagements / user state by tweet Id. - val tweetEngagementUserStateRealTimeAggregatesProd = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_tweet_aggregates_v3", - keys = Set(TimelinesSharedFeatures.SOURCE_TWEET_ID), - features = UserFeaturesAdapter.UserStateBooleanFeatures, - labels = TweetCoreAndDwellLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, Duration.Top), - outputStore = ProductionStore, - includeAnyFeature = false, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - // Aggregate user engagements / user gender by tweet Id. - val tweetEngagementGenderRealTimeAggregatesProd = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_tweet_aggregates_v4", - keys = Set(TimelinesSharedFeatures.SOURCE_TWEET_ID), - features = UserFeaturesAdapter.GenderBooleanFeatures, - labels = - TweetCoreAndDwellLabels ++ TweetNegativeEngagementLabels ++ TweetNegativeEngagementDontLikeLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, Duration.Top), - outputStore = ProductionStore, - includeAnyFeature = false, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - // Aggregate user negative engagements / user state by tweet Id. - val tweetNegativeEngagementUserStateRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_tweet_aggregates_v5", - keys = Set(TimelinesSharedFeatures.SOURCE_TWEET_ID), - features = UserFeaturesAdapter.UserStateBooleanFeatures, - labels = TweetNegativeEngagementLabels ++ TweetNegativeEngagementDontLikeLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, Duration.Top), - outputStore = ProductionStore, - includeAnyFeature = false, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - // Aggregate user negative engagements / user state by tweet Id. - val tweetVerifiedNegativeEngagementUserStateRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_tweet_aggregates_v8", - preTransforms = Seq(RichRemoveUnverifiedUserTransform), - keys = Set(TimelinesSharedFeatures.SOURCE_TWEET_ID), - features = UserFeaturesAdapter.UserStateBooleanFeatures, - labels = TweetNegativeEngagementLabels ++ TweetNegativeEngagementDontLikeLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, Duration.Top), - outputStore = ProductionStore, - includeAnyFeature = false, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - /** - * Aggregate tweet engagement labels and candidate tweet source features grouped by user id. - */ - val userCandidateTweetSourceEngagementRealTimeAggregatesProd = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_user_candidate_tweet_source_aggregates_v1", - keys = Set(SharedFeatures.USER_ID), - features = CandidateTweetSourceFeatures, - labels = TweetCoreAndDwellLabels ++ NegativeEngagementsRealTimeDontLike, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, Duration.Top), - outputStore = ProductionStore, - includeAnyFeature = false, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - /** - * Aggregate tweet engagement labels and candidate tweet source features grouped by user id. - */ - val userCandidateTweetSourceEngagement48HourRealTimeAggregatesProd = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_user_candidate_tweet_source_aggregates_v2", - keys = Set(SharedFeatures.USER_ID), - features = CandidateTweetSourceFeatures, - labels = TweetCoreAndDwellLabels ++ NegativeEngagementsRealTimeDontLike, - metrics = Set(CountMetric), - halfLives = Set(48.hours), - outputStore = ProductionStore, - includeAnyFeature = false, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - /** - * Aggregate tweet features grouped by viewer's user id on Profile engagements - */ - val userProfileEngagementRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "profile_real_time_user_aggregates_v1", - preTransforms = Seq(IsNewUserTransform), - keys = Set(SharedFeatures.USER_ID), - features = TweetFeatures, - labels = ProfileCoreLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, Duration.Top), - outputStore = ProductionStore, - includeAnyFeature = true, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - val NegativeEngagementsUnionTransform = RichITransform( - BinaryUnion( - featuresToUnify = ProfileNegativeEngagementLabels, - outputFeature = ProfileLabelFeatures.IS_NEGATIVE_FEEDBACK_UNION - )) - - /** - * Aggregate tweet features grouped by viewer's user id on Profile negative engagements. - */ - val userProfileNegativeEngagementRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "profile_negative_engagement_real_time_user_aggregates_v1", - preTransforms = Seq(NegativeEngagementsUnionTransform), - keys = Set(SharedFeatures.USER_ID), - features = Set.empty, - labels = ProfileNegativeEngagementLabels ++ ProfileNegativeEngagementUnionLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, 72.hours, 14.day), - outputStore = ProductionStore, - includeAnyFeature = true, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - /** - * Aggregate tweet features grouped by viewer's and author's user ids and on Profile engagements - */ - val userAuthorProfileEngagementRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "user_author_profile_real_time_aggregates_v1", - keys = Set(SharedFeatures.USER_ID, TimelinesSharedFeatures.SOURCE_AUTHOR_ID), - features = Set.empty, - labels = ProfileCoreLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, 24.hours, 72.hours), - outputStore = ProductionStore, - includeAnyFeature = true, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - /** - * Aggregate tweet features grouped by viewer's and author's user ids and on negative Profile engagements - */ - val userAuthorProfileNegativeEngagementRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "user_author_profile_negative_engagement_real_time_aggregates_v1", - preTransforms = Seq(NegativeEngagementsUnionTransform), - keys = Set(SharedFeatures.USER_ID, TimelinesSharedFeatures.SOURCE_AUTHOR_ID), - features = Set.empty, - labels = ProfileNegativeEngagementUnionLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, 72.hours, 14.day), - outputStore = ProductionStore, - includeAnyFeature = true, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - val newUserAuthorEngagementRealTimeAggregatesProd = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_new_user_author_aggregates_v1", - preTransforms = Seq(IsNewUserTransform), - keys = Set(SharedFeatures.USER_ID, TimelinesSharedFeatures.SOURCE_AUTHOR_ID), - features = Set.empty, - labels = TweetCoreAndDwellLabels ++ Set( - IS_CLICKED, - IS_PROFILE_CLICKED, - IS_PHOTO_EXPANDED - ), - metrics = Set(CountMetric), - halfLives = Set(30.minutes, Duration.Top), - outputStore = ProductionStore, - includeAnyFeature = true, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - val userAuthorEngagementRealTimeAggregatesProd = { - // Computing user-author real-time aggregates is very expensive so we - // take the union of all major negative feedback engagements to create - // a single negtive label for aggregation. We also include a number of - // core positive engagements. - val BinaryUnionNegativeEngagements = - BinaryUnion( - featuresToUnify = AllTweetNegativeEngagementLabels, - outputFeature = IS_NEGATIVE_FEEDBACK_UNION - ) - val BinaryUnionNegativeEngagementsTransform = RichITransform(BinaryUnionNegativeEngagements) - - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_user_author_aggregates_v1", - preTransforms = Seq(BinaryUnionNegativeEngagementsTransform), - keys = Set(SharedFeatures.USER_ID, TimelinesSharedFeatures.SOURCE_AUTHOR_ID), - features = Set.empty, - labels = UserAuthorEngagementLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, 1.day), - outputStore = ProductionStore, - includeAnyFeature = true, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - } - - /** - * Aggregate tweet features grouped by list id. - */ - val listEngagementRealTimeAggregatesProd = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_list_aggregates_v1", - keys = Set(ListFeatures.LIST_ID), - features = Set.empty, - labels = - TweetCoreAndDwellLabels ++ TweetNegativeEngagementLabels ++ TweetNegativeEngagementDontLikeLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, Duration.Top), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - // Aggregate features grouped by topic of tweet and country from user's location - val topicCountryRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_topic_country_aggregates_v1", - keys = Set(TimelinesSharedFeatures.TOPIC_ID, UserFeaturesAdapter.USER_COUNTRY_ID), - features = Set.empty, - labels = - TweetCoreAndDwellLabels ++ AllTweetNegativeEngagementLabels ++ PrivateEngagementLabelsV2 ++ ShareEngagementLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, 72.hours), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - // Aggregate features grouped by TweetId_Country from user's location - val tweetCountryRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_tweet_country_aggregates_v1", - keys = Set(TimelinesSharedFeatures.SOURCE_TWEET_ID, UserFeaturesAdapter.USER_COUNTRY_ID), - features = Set.empty, - labels = TweetCoreAndDwellLabels ++ AllTweetNegativeEngagementLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, Duration.Top), - outputStore = ProductionStore, - includeAnyLabel = true, - includeTimestampFeature = false, - ) - - // Additional aggregate features grouped by TweetId_Country from user's location - val tweetCountryPrivateEngagementsRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_tweet_country_aggregates_v2", - keys = Set(TimelinesSharedFeatures.SOURCE_TWEET_ID, UserFeaturesAdapter.USER_COUNTRY_ID), - features = Set.empty, - labels = PrivateEngagementLabelsV2 ++ ShareEngagementLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, 72.hours), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - // Aggregate features grouped by TweetId_Country from user's location - val tweetCountryVerifiedNegativeEngagementsRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_tweet_country_aggregates_v3", - preTransforms = Seq(RichRemoveUnverifiedUserTransform), - keys = Set(TimelinesSharedFeatures.SOURCE_TWEET_ID, UserFeaturesAdapter.USER_COUNTRY_ID), - features = Set.empty, - labels = AllTweetNegativeEngagementLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, Duration.Top), - outputStore = ProductionStore, - includeAnyLabel = true, - includeTimestampFeature = false, - ) - - object positionTranforms extends IsPositionTransform { - override val isInPositionRangeFeature: Seq[PositionCase] = - Seq(PositionCase(1, IS_TOP_ONE), PositionCase(5, IS_TOP_FIVE), PositionCase(10, IS_TOP_TEN)) - override val decodedPositionFeature: Feature.Discrete = - ClientLogEventDataRecordFeatures.InjectedPosition - } - - val userPositionEngagementsCountsProd = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_position_based_user_aggregates_v1", - keys = Set(SharedFeatures.USER_ID), - features = Set(IS_TOP_ONE, IS_TOP_FIVE, IS_TOP_TEN), - labels = TweetCoreAndDwellLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, 24.hours), - outputStore = ProductionStore, - preTransforms = Seq(positionTranforms), - includeAnyLabel = false, - includeAnyFeature = false, - includeTimestampFeature = false, - ) - - val userPositionEngagementsSumProd = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_position_based_user_sum_aggregates_v2", - keys = Set(SharedFeatures.USER_ID), - features = Set(LOG_POSITION), - labels = TweetCoreAndDwellLabels, - metrics = Set(SumMetric), - halfLives = Set(30.minutes, 24.hours), - outputStore = ProductionStore, - preTransforms = - Seq(new LogTransform(ClientLogEventDataRecordFeatures.InjectedPosition, LOG_POSITION)), - includeAnyLabel = false, - includeAnyFeature = false, - includeTimestampFeature = false, - ) - - // Aggregates for share engagements - val tweetShareEngagementsRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_tweet_share_aggregates_v1", - keys = Set(TimelinesSharedFeatures.SOURCE_TWEET_ID), - features = Set.empty, - labels = ShareEngagementLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, 24.hours), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - val userShareEngagementsRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_user_share_aggregates_v1", - keys = Set(SharedFeatures.USER_ID), - features = Set.empty, - labels = ShareEngagementLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, 24.hours), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - val userAuthorShareEngagementsRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_user_author_share_aggregates_v1", - keys = Set(SharedFeatures.USER_ID, TimelinesSharedFeatures.SOURCE_AUTHOR_ID), - features = Set.empty, - labels = ShareEngagementLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, 24.hours), - outputStore = ProductionStore, - includeAnyFeature = true, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - val topicShareEngagementsRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_topic_share_aggregates_v1", - keys = Set(TimelinesSharedFeatures.TOPIC_ID), - features = Set.empty, - labels = ShareEngagementLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, 24.hours), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - val authorShareEngagementsRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_author_share_aggregates_v1", - keys = Set(TimelinesSharedFeatures.SOURCE_AUTHOR_ID), - features = Set.empty, - labels = ShareEngagementLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, 24.hours), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - // Bookmark RTAs - val tweetBookmarkEngagementsRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_tweet_bookmark_aggregates_v1", - keys = Set(TimelinesSharedFeatures.SOURCE_TWEET_ID), - features = Set.empty, - labels = BookmarkEngagementLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, 24.hours), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - val userBookmarkEngagementsRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_user_bookmark_aggregates_v1", - keys = Set(SharedFeatures.USER_ID), - features = Set.empty, - labels = BookmarkEngagementLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, 24.hours), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - val userAuthorBookmarkEngagementsRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_user_author_bookmark_aggregates_v1", - keys = Set(SharedFeatures.USER_ID, TimelinesSharedFeatures.SOURCE_AUTHOR_ID), - features = Set.empty, - labels = BookmarkEngagementLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, 24.hours), - outputStore = ProductionStore, - includeAnyFeature = true, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - val authorBookmarkEngagementsRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_author_bookmark_aggregates_v1", - keys = Set(TimelinesSharedFeatures.SOURCE_AUTHOR_ID), - features = Set.empty, - labels = BookmarkEngagementLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, 24.hours), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - /** - * Aggregate on user level dwell labels from BCE - */ - val userBCEDwellEngagementsRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_user_bce_dwell_aggregates", - keys = Set(SharedFeatures.USER_ID), - features = Set.empty, - labels = AllBCEDwellLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, 24.hours), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - /** - * Aggregate on tweet level dwell labels from BCE - */ - val tweetBCEDwellEngagementsRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_tweet_bce_dwell_aggregates", - keys = Set(TimelinesSharedFeatures.SOURCE_TWEET_ID), - features = Set.empty, - labels = AllBCEDwellLabels, - metrics = Set(CountMetric), - halfLives = Set(30.minutes, 24.hours), - outputStore = ProductionStore, - includeAnyLabel = false, - includeTimestampFeature = false, - ) - - val ImplicitPositiveEngagementsUnionTransform = RichITransform( - BinaryUnion( - featuresToUnify = CombinedFeatures.ImplicitPositiveEngagements, - outputFeature = CombinedFeatures.IS_IMPLICIT_POSITIVE_FEEDBACK_UNION - ) - ) - - val ExplicitPositiveEngagementsUnionTransform = RichITransform( - BinaryUnion( - featuresToUnify = CombinedFeatures.ExplicitPositiveEngagements, - outputFeature = CombinedFeatures.IS_EXPLICIT_POSITIVE_FEEDBACK_UNION - ) - ) - - val AllNegativeEngagementsUnionTransform = RichITransform( - BinaryUnion( - featuresToUnify = CombinedFeatures.AllNegativeEngagements, - outputFeature = CombinedFeatures.IS_ALL_NEGATIVE_FEEDBACK_UNION - ) - ) - - /** - * Aggregate features for author content preference - */ - val authorContentPreferenceRealTimeAggregates = - AggregateGroup( - inputSource = inputSource, - aggregatePrefix = "real_time_author_content_preference_aggregates", - preTransforms = Seq( - ImplicitPositiveEngagementsUnionTransform, - ExplicitPositiveEngagementsUnionTransform, - AllNegativeEngagementsUnionTransform), - keys = Set(TimelinesSharedFeatures.SOURCE_AUTHOR_ID), - features = - ClientLogEventDataRecordFeatures.AuthorContentPreferenceTweetTypeFeatures ++ AuthorFeaturesAdapter.UserStateBooleanFeatures, - labels = AllTweetUnionLabels, - metrics = Set(CountMetric), - halfLives = Set(24.hours), - outputStore = ProductionStore, - includeAnyLabel = false, - includeAnyFeature = false, - ) - - val FeaturesGeneratedByPreTransforms = Set(LOG_POSITION, IS_TOP_TEN, IS_TOP_FIVE, IS_TOP_ONE) - - val ProdAggregateGroups = Set( - tweetEngagement30MinuteCountsProd, - tweetEngagementTotalCountsProd, - tweetNegativeEngagement6HourCounts, - tweetNegativeEngagementTotalCounts, - userEngagementRealTimeAggregatesProd, - userEngagement48HourRealTimeAggregatesProd, - userNegativeEngagementAuthorUserStateRealTimeAggregates, - userNegativeEngagementAuthorUserState72HourRealTimeAggregates, - authorEngagementRealTimeAggregatesProd, - topicEngagementRealTimeAggregatesProd, - topicEngagement24HourRealTimeAggregatesProd, - tweetEngagementUserStateRealTimeAggregatesProd, - tweetNegativeEngagementUserStateRealTimeAggregates, - userProfileEngagementRealTimeAggregates, - newUserAuthorEngagementRealTimeAggregatesProd, - userAuthorEngagementRealTimeAggregatesProd, - listEngagementRealTimeAggregatesProd, - tweetCountryRealTimeAggregates, - tweetShareEngagementsRealTimeAggregates, - userShareEngagementsRealTimeAggregates, - userAuthorShareEngagementsRealTimeAggregates, - topicShareEngagementsRealTimeAggregates, - authorShareEngagementsRealTimeAggregates, - tweetBookmarkEngagementsRealTimeAggregates, - userBookmarkEngagementsRealTimeAggregates, - userAuthorBookmarkEngagementsRealTimeAggregates, - authorBookmarkEngagementsRealTimeAggregates, - topicCountryRealTimeAggregates, - tweetCountryPrivateEngagementsRealTimeAggregates, - userBCEDwellEngagementsRealTimeAggregates, - tweetBCEDwellEngagementsRealTimeAggregates, - authorContentPreferenceRealTimeAggregates, - authorVerifiedNegativeEngagementRealTimeAggregatesProd, - tweetVerifiedDontLikeEngagementRealTimeAggregatesProd, - tweetVerifiedNegativeEngagementCounts, - tweetVerifiedNegativeEngagementUserStateRealTimeAggregates, - tweetCountryVerifiedNegativeEngagementsRealTimeAggregates - ).map( - addFeatureFilterFromResource( - _, - "com/twitter/timelines/prediction/common/aggregates/real_time/aggregates_to_drop.txt")) - - val StagingAggregateGroups = ProdAggregateGroups.map(createStagingGroup) - - /** - * Contains the fully typed aggregate groups from which important - * values can be derived e.g. the features to be computed, halflives etc. - */ - override val ProdAggregates = ProdAggregateGroups.flatMap(_.buildTypedAggregateGroups()) - - override val StagingAggregates = StagingAggregateGroups.flatMap(_.buildTypedAggregateGroups()) - - - override val ProdCommonAggregates = ProdAggregates - .filter(_.keysToAggregate == Set(SharedFeatures.USER_ID)) - - /** - * This defines the set of selected features from a candidate - * that we'd like to send to the served features cache by TLM. - * These should include interesting and necessary features that - * cannot be extracted from LogEvents only by the real-time aggregates - * job. If you are adding new AggregateGroups requiring TLM-side - * candidate features, make sure to add them here. - */ - val candidateFeaturesToCache: Set[Feature[_]] = Set( - TimelinesSharedFeatures.SOURCE_AUTHOR_ID, - RecapFeatures.HASHTAGS, - RecapFeatures.MENTIONED_SCREEN_NAMES, - RecapFeatures.URL_DOMAINS - ) -} - -/** - * This config should only be used to access the aggregate features constructed by the - * aggregation config, and not for implementing an online real-time aggregates job. - */ -object TimelinesOnlineAggregationFeaturesOnlyConfig - extends TimelinesOnlineAggregationDefinitionsTrait { - - private[real_time] case class DummyAggregateSource(name: String, timestampFeature: Feature[JLong]) - extends AggregateSource - - private[real_time] case class DummyAggregateStore(name: String) extends AggregateStore - - override lazy val inputSource = DummyAggregateSource( - name = "timelines_rta", - timestampFeature = SharedFeatures.TIMESTAMP - ) - override lazy val ProductionStore = DummyAggregateStore("timelines_rta") - override lazy val StagingStore = DummyAggregateStore("timelines_rta") - - override lazy val AggregatesToCompute = ProdAggregates ++ StagingAggregates -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TimelinesOnlineAggregationSources.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TimelinesOnlineAggregationSources.scala deleted file mode 100644 index 71e97a1b1..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TimelinesOnlineAggregationSources.scala +++ /dev/null @@ -1,5 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates.real_time - -object TimelinesOnlineAggregationSources { - val timelinesOnlineAggregateSource = new TimelinesStormAggregateSource -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TimelinesRealTimeAggregatesJob.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TimelinesRealTimeAggregatesJob.scala deleted file mode 100644 index e386d4da1..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TimelinesRealTimeAggregatesJob.scala +++ /dev/null @@ -1,182 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates.real_time - -import com.twitter.conversions.DurationOps._ -import com.twitter.finagle.stats.DefaultStatsReceiver -import com.twitter.summingbird.Options -import com.twitter.summingbird.online.option.FlatMapParallelism -import com.twitter.summingbird.online.option.SourceParallelism -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.heron._ -import com.twitter.timelines.data_processing.ml_util.transforms.DownsampleTransform -import com.twitter.timelines.data_processing.ml_util.transforms.RichITransform -import com.twitter.timelines.data_processing.ml_util.transforms.UserDownsampleTransform - -import com.twitter.timelines.prediction.common.aggregates.BCELabelTransformFromUUADataRecord - -/** - * Sets up relevant topology parameters. Our primary goal is to handle the - * LogEvent stream and aggregate (sum) on the parsed DataRecords without falling - * behind. Our constraint is the resulting write (and read) QPS to the backing - * memcache store. - * - * If the job is falling behind, add more flatMappers and/or Summers after - * inspecting the viz panels for the respective job (go/heron-ui). An increase in - * Summers (and/or aggregation keys and features in the config) results in an - * increase in memcache QPS (go/cb and search for our cache). Adjust with CacheSize - * settings until QPS is well-controlled. - * - */ -object TimelinesRealTimeAggregatesJobConfigs extends RealTimeAggregatesJobConfigs { - import TimelinesOnlineAggregationUtils._ - - /** - * We remove input records that do not contain a label/engagement as defined in AllTweetLabels, which includes - * explicit user engagements including public, private and impression events. By avoiding ingesting records without - * engagemnts, we guarantee that no distribution shifts occur in computed aggregate features when we add a new spout - * to input aggregate sources. Counterfactual signal is still available since we aggregate on explicit dwell - * engagements. - */ - val NegativeDownsampleTransform = - DownsampleTransform( - negativeSamplingRate = 0.0, - keepLabels = AllTweetLabels, - positiveSamplingRate = 1.0) - - /** - * We downsample positive engagements for devel topology to reduce traffic, aiming for equivalent of 10% of prod traffic. - * First apply consistent downsampling to 10% of users, and then apply downsampling to remove records without - * explicit labels. We apply user-consistent sampling to more closely approximate prod query patterns. - */ - val StagingUserBasedDownsampleTransform = - UserDownsampleTransform( - availability = 1000, - featureName = "rta_devel" - ) - - override val Prod = RealTimeAggregatesJobConfig( - appId = "summingbird_timelines_rta", - topologyWorkers = 1450, - sourceCount = 120, - flatMapCount = 1800, - summerCount = 3850, - cacheSize = 200, - containerRamGigaBytes = 54, - name = "timelines_real_time_aggregates", - teamName = "timelines", - teamEmail = "", - // If one component is hitting GC limit at prod, tune componentToMetaSpaceSizeMap. - // Except for Source bolts. Tune componentToRamGigaBytesMap for Source bolts instead. - componentToMetaSpaceSizeMap = Map( - "Tail-FlatMap" -> "-XX:MaxMetaspaceSize=1024M -XX:MetaspaceSize=1024M", - "Tail" -> "-XX:MaxMetaspaceSize=2560M -XX:MetaspaceSize=2560M" - ), - // If either component is hitting memory limit at prod - // its memory need to increase: either increase total memory of container (containerRamGigaBytes), - // or allocate more memory for one component while keeping total memory unchanged. - componentToRamGigaBytesMap = Map( - "Tail-FlatMap-Source" -> 3, // Home source - "Tail-FlatMap-Source.2" -> 3, // Profile source - "Tail-FlatMap-Source.3" -> 3, // Search source - "Tail-FlatMap-Source.4" -> 3, // UUA source - "Tail-FlatMap" -> 8 - // Tail will use the leftover memory in the container. - // Make sure to tune topologyWorkers and containerRamGigaBytes such that this is greater than 10 GB. - ), - topologyNamedOptions = Map( - "TL_EVENTS_SOURCE" -> Options() - .set(SourceParallelism(120)), - "PROFILE_EVENTS_SOURCE" -> Options() - .set(SourceParallelism(30)), - "SEARCH_EVENTS_SOURCE" -> Options() - .set(SourceParallelism(10)), - "UUA_EVENTS_SOURCE" -> Options() - .set(SourceParallelism(10)), - "COMBINED_PRODUCER" -> Options() - .set(FlatMapParallelism(1800)) - ), - // The UUA datarecord for BCE events inputted will not have binary labels populated. - // BCELabelTransform will set the datarecord with binary BCE dwell labels features based on the corresponding dwell_time_ms. - // It's important to have the BCELabelTransformFromUUADataRecord before ProdNegativeDownsampleTransform - // because ProdNegativeDownsampleTransform will remove datarecord that contains no features from AllTweetLabels. - onlinePreTransforms = - Seq(RichITransform(BCELabelTransformFromUUADataRecord), NegativeDownsampleTransform) - ) - - /** - * we downsample 10% computation of devel RTA based on [[StagingNegativeDownsampleTransform]]. - * To better test scalability of topology, we reduce computing resource of components "Tail-FlatMap" - * and "Tail" to be 10% of prod but keep computing resource of component "Tail-FlatMap-Source" unchanged. - * hence flatMapCount=110, summerCount=105 and sourceCount=100. Hence topologyWorkers =(110+105+100)/5 = 63. - */ - override val Devel = RealTimeAggregatesJobConfig( - appId = "summingbird_timelines_rta_devel", - topologyWorkers = 120, - sourceCount = 120, - flatMapCount = 150, - summerCount = 300, - cacheSize = 200, - containerRamGigaBytes = 54, - name = "timelines_real_time_aggregates_devel", - teamName = "timelines", - teamEmail = "", - // If one component is hitting GC limit at prod, tune componentToMetaSpaceSizeMap - // Except for Source bolts. Tune componentToRamGigaBytesMap for Source bolts instead. - componentToMetaSpaceSizeMap = Map( - "Tail-FlatMap" -> "-XX:MaxMetaspaceSize=1024M -XX:MetaspaceSize=1024M", - "Tail" -> "-XX:MaxMetaspaceSize=2560M -XX:MetaspaceSize=2560M" - ), - // If either component is hitting memory limit at prod - // its memory need to increase: either increase total memory of container (containerRamGigaBytes), - // or allocate more memory for one component while keeping total memory unchanged. - componentToRamGigaBytesMap = Map( - "Tail-FlatMap-Source" -> 3, // Home source - "Tail-FlatMap-Source.2" -> 3, // Profile source - "Tail-FlatMap-Source.3" -> 3, // Search source - "Tail-FlatMap-Source.4" -> 3, // UUA source - "Tail-FlatMap" -> 8 - // Tail will use the leftover memory in the container. - // Make sure to tune topologyWorkers and containerRamGigaBytes such that this is greater than 10 GB. - ), - topologyNamedOptions = Map( - "TL_EVENTS_SOURCE" -> Options() - .set(SourceParallelism(120)), - "PROFILE_EVENTS_SOURCE" -> Options() - .set(SourceParallelism(30)), - "SEARCH_EVENTS_SOURCE" -> Options() - .set(SourceParallelism(10)), - "UUA_EVENTS_SOURCE" -> Options() - .set(SourceParallelism(10)), - "COMBINED_PRODUCER" -> Options() - .set(FlatMapParallelism(150)) - ), - // It's important to have the BCELabelTransformFromUUADataRecord before ProdNegativeDownsampleTransform - onlinePreTransforms = Seq( - StagingUserBasedDownsampleTransform, - RichITransform(BCELabelTransformFromUUADataRecord), - NegativeDownsampleTransform), - enableUserReindexingNighthawkBtreeStore = true, - enableUserReindexingNighthawkHashStore = true, - userReindexingNighthawkBtreeStoreConfig = NighthawkUnderlyingStoreConfig( - serversetPath = - "/twitter/service/cache-user/test/nighthawk_timelines_real_time_aggregates_btree_test_api", - // NOTE: table names are prefixed to every pkey so keep it short - tableName = "u_r_v1", // (u)ser_(r)eindexing_v1 - // keep ttl <= 1 day because it's keyed on user, and we will have limited hit rates beyond 1 day - cacheTTL = 1.day - ), - userReindexingNighthawkHashStoreConfig = NighthawkUnderlyingStoreConfig( - // For prod: "/s/cache-user/nighthawk_timelines_real_time_aggregates_hash_api", - serversetPath = - "/twitter/service/cache-user/test/nighthawk_timelines_real_time_aggregates_hash_test_api", - // NOTE: table names are prefixed to every pkey so keep it short - tableName = "u_r_v1", // (u)ser_(r)eindexing_v1 - // keep ttl <= 1 day because it's keyed on user, and we will have limited hit rates beyond 1 day - cacheTTL = 1.day - ) - ) -} - -object TimelinesRealTimeAggregatesJob extends RealTimeAggregatesJobBase { - override lazy val statsReceiver = DefaultStatsReceiver.scope("timelines_real_time_aggregates") - override lazy val jobConfigs = TimelinesRealTimeAggregatesJobConfigs - override lazy val aggregatesToCompute = TimelinesOnlineAggregationConfig.AggregatesToCompute -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TimelinesStormAggregateSource.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TimelinesStormAggregateSource.scala deleted file mode 100644 index 2e096dc07..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TimelinesStormAggregateSource.scala +++ /dev/null @@ -1,185 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates.real_time - -import com.twitter.clientapp.thriftscala.LogEvent -import com.twitter.conversions.DurationOps._ -import com.twitter.finagle.stats.Counter -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.ml.api.DataRecord -import com.twitter.ml.api.constant.SharedFeatures -import com.twitter.snowflake.id.SnowflakeId -import com.twitter.summingbird._ -import com.twitter.summingbird.storm.Storm -import com.twitter.summingbird_internal.sources.AppId -import com.twitter.summingbird_internal.sources.storm.remote.ClientEventSourceScrooge2 -import com.twitter.timelines.data_processing.ad_hoc.suggests.common.AllScribeProcessor -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.heron.RealTimeAggregatesJobConfig -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.heron.StormAggregateSource -import com.twitter.timelines.prediction.adapters.client_log_event.ClientLogEventAdapter -import com.twitter.timelines.prediction.adapters.client_log_event.ProfileClientLogEventAdapter -import com.twitter.timelines.prediction.adapters.client_log_event.SearchClientLogEventAdapter -import com.twitter.timelines.prediction.adapters.client_log_event.UuaEventAdapter -import com.twitter.unified_user_actions.client.config.KafkaConfigs -import com.twitter.unified_user_actions.client.summingbird.UnifiedUserActionsSourceScrooge -import com.twitter.unified_user_actions.thriftscala.UnifiedUserAction -import scala.collection.JavaConverters._ - -/** - * Storm Producer for client events generated on Home, Profile, and Search - */ -class TimelinesStormAggregateSource extends StormAggregateSource { - - override val name = "timelines_rta" - override val timestampFeature = SharedFeatures.TIMESTAMP - - private lazy val TimelinesClientEventSourceName = "TL_EVENTS_SOURCE" - private lazy val ProfileClientEventSourceName = "PROFILE_EVENTS_SOURCE" - private lazy val SearchClientEventSourceName = "SEARCH_EVENTS_SOURCE" - private lazy val UuaEventSourceName = "UUA_EVENTS_SOURCE" - private lazy val CombinedProducerName = "COMBINED_PRODUCER" - private lazy val FeatureStoreProducerName = "FEATURE_STORE_PRODUCER" - - private def isNewUserEvent(event: LogEvent): Boolean = { - event.logBase.flatMap(_.userId).flatMap(SnowflakeId.timeFromIdOpt).exists(_.untilNow < 30.days) - } - - private def mkDataRecords(event: LogEvent, dataRecordCounter: Counter): Seq[DataRecord] = { - val dataRecords: Seq[DataRecord] = - if (AllScribeProcessor.isValidSuggestTweetEvent(event)) { - ClientLogEventAdapter.adaptToDataRecords(event).asScala - } else { - Seq.empty[DataRecord] - } - dataRecordCounter.incr(dataRecords.size) - dataRecords - } - - private def mkProfileDataRecords( - event: LogEvent, - dataRecordCounter: Counter - ): Seq[DataRecord] = { - val dataRecords: Seq[DataRecord] = - ProfileClientLogEventAdapter.adaptToDataRecords(event).asScala - dataRecordCounter.incr(dataRecords.size) - dataRecords - } - - private def mkSearchDataRecords( - event: LogEvent, - dataRecordCounter: Counter - ): Seq[DataRecord] = { - val dataRecords: Seq[DataRecord] = - SearchClientLogEventAdapter.adaptToDataRecords(event).asScala - dataRecordCounter.incr(dataRecords.size) - dataRecords - } - - private def mkUuaDataRecords( - event: UnifiedUserAction, - dataRecordCounter: Counter - ): Seq[DataRecord] = { - val dataRecords: Seq[DataRecord] = - UuaEventAdapter.adaptToDataRecords(event).asScala - dataRecordCounter.incr(dataRecords.size) - dataRecords - } - - override def build( - statsReceiver: StatsReceiver, - jobConfig: RealTimeAggregatesJobConfig - ): Producer[Storm, DataRecord] = { - lazy val scopedStatsReceiver = statsReceiver.scope(getClass.getSimpleName) - lazy val dataRecordCounter = scopedStatsReceiver.counter("dataRecord") - - // Home Timeline Engagements - // Step 1: => LogEvent - lazy val clientEventProducer: Producer[Storm, HomeEvent[LogEvent]] = - ClientEventSourceScrooge2( - appId = AppId(jobConfig.appId), - topic = "julep_client_event_suggests", - resumeAtLastReadOffset = false, - enableTls = true - ).source.map(HomeEvent[LogEvent]).name(TimelinesClientEventSourceName) - - // Profile Engagements - // Step 1: => LogEvent - lazy val profileClientEventProducer: Producer[Storm, ProfileEvent[LogEvent]] = - ClientEventSourceScrooge2( - appId = AppId(jobConfig.appId), - topic = "julep_client_event_profile_real_time_engagement_metrics", - resumeAtLastReadOffset = false, - enableTls = true - ).source - .map(ProfileEvent[LogEvent]) - .name(ProfileClientEventSourceName) - - // Search Engagements - // Step 1: => LogEvent - // Only process events for all users to save resource - lazy val searchClientEventProducer: Producer[Storm, SearchEvent[LogEvent]] = - ClientEventSourceScrooge2( - appId = AppId(jobConfig.appId), - topic = "julep_client_event_search_real_time_engagement_metrics", - resumeAtLastReadOffset = false, - enableTls = true - ).source - .map(SearchEvent[LogEvent]) - .name(SearchClientEventSourceName) - - // Unified User Actions (includes Home and other product surfaces) - lazy val uuaEventProducer: Producer[Storm, UuaEvent[UnifiedUserAction]] = - UnifiedUserActionsSourceScrooge( - appId = AppId(jobConfig.appId), - parallelism = 10, - kafkaConfig = KafkaConfigs.ProdUnifiedUserActionsEngagementOnly - ).source - .filter(StormAggregateSourceUtils.isUuaBCEEventsFromHome(_)) - .map(UuaEvent[UnifiedUserAction]) - .name(UuaEventSourceName) - - // Combined - // Step 2: - // (a) Combine - // (b) Transform LogEvent => Seq[DataRecord] - // (c) Apply sampler - lazy val combinedClientEventDataRecordProducer: Producer[Storm, Event[DataRecord]] = - profileClientEventProducer // This becomes the bottom branch - .merge(clientEventProducer) // This becomes the middle branch - .merge(searchClientEventProducer) - .merge(uuaEventProducer) // This becomes the top - .flatMap { // LogEvent => Seq[DataRecord] - case e: HomeEvent[LogEvent] => - mkDataRecords(e.event, dataRecordCounter).map(HomeEvent[DataRecord]) - case e: ProfileEvent[LogEvent] => - mkProfileDataRecords(e.event, dataRecordCounter).map(ProfileEvent[DataRecord]) - case e: SearchEvent[LogEvent] => - mkSearchDataRecords(e.event, dataRecordCounter).map(SearchEvent[DataRecord]) - case e: UuaEvent[UnifiedUserAction] => - mkUuaDataRecords( - e.event, - dataRecordCounter - ).map(UuaEvent[DataRecord]) - } - .flatMap { // Apply sampler - case e: HomeEvent[DataRecord] => - jobConfig.sequentiallyTransform(e.event).map(HomeEvent[DataRecord]) - case e: ProfileEvent[DataRecord] => - jobConfig.sequentiallyTransform(e.event).map(ProfileEvent[DataRecord]) - case e: SearchEvent[DataRecord] => - jobConfig.sequentiallyTransform(e.event).map(SearchEvent[DataRecord]) - case e: UuaEvent[DataRecord] => - jobConfig.sequentiallyTransform(e.event).map(UuaEvent[DataRecord]) - } - .name(CombinedProducerName) - - // Step 3: Join with Feature Store features - lazy val featureStoreDataRecordProducer: Producer[Storm, DataRecord] = - StormAggregateSourceUtils - .wrapByFeatureStoreClient( - underlyingProducer = combinedClientEventDataRecordProducer, - jobConfig = jobConfig, - scopedStatsReceiver = scopedStatsReceiver - ).map(_.event).name(FeatureStoreProducerName) - - featureStoreDataRecordProducer - } -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TweetFeaturesAdapter.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TweetFeaturesAdapter.scala deleted file mode 100644 index 0d5c06d7c..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TweetFeaturesAdapter.scala +++ /dev/null @@ -1,35 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates.real_time - -import com.twitter.ml.api.DataRecord -import com.twitter.ml.api.Feature -import com.twitter.ml.api.FeatureContext -import com.twitter.ml.featurestore.catalog.entities.core.Tweet -import com.twitter.ml.featurestore.catalog.features.trends.TweetTrendsScores -import com.twitter.ml.featurestore.lib.TweetId -import com.twitter.ml.featurestore.lib.data.PredictionRecord -import com.twitter.ml.featurestore.lib.data.PredictionRecordAdapter -import com.twitter.ml.featurestore.lib.feature.BoundFeature -import com.twitter.ml.featurestore.lib.feature.BoundFeatureSet -import com.twitter.timelines.prediction.common.adapters.TimelinesAdapterBase -import java.util -import scala.collection.JavaConverters._ - -object TweetFeaturesAdapter extends TimelinesAdapterBase[PredictionRecord] { - - private val ContinuousFeatureMap: Map[BoundFeature[TweetId, Double], Feature.Continuous] = Map() - - val TweetFeaturesSet: BoundFeatureSet = new BoundFeatureSet(ContinuousFeatureMap.keys.toSet) - - val AllFeatures: Seq[Feature[_]] = - ContinuousFeatureMap.values.toSeq - - private val adapter = PredictionRecordAdapter.oneToOne(TweetFeaturesSet) - - override def getFeatureContext: FeatureContext = new FeatureContext(AllFeatures: _*) - - override def commonFeatures: Set[Feature[_]] = Set.empty - - override def adaptToDataRecords(record: PredictionRecord): util.List[DataRecord] = { - List(adapter.adaptToDataRecord(record)).asJava - } -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TweetFeaturesReadableStore.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TweetFeaturesReadableStore.scala deleted file mode 100644 index b461e179a..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TweetFeaturesReadableStore.scala +++ /dev/null @@ -1,53 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates.real_time - -import com.twitter.ml.api.DataRecord -import com.twitter.ml.featurestore.lib.TweetId -import com.twitter.ml.featurestore.lib.data.PredictionRecord -import com.twitter.ml.featurestore.lib.entity.Entity -import com.twitter.ml.featurestore.lib.online.{FeatureStoreClient, FeatureStoreRequest} -import com.twitter.storehaus.ReadableStore -import com.twitter.timelines.prediction.common.adapters.TimelinesAdapterBase -import com.twitter.util.Future -import scala.collection.JavaConverters._ - -class TweetFeaturesReadableStore( - featureStoreClient: FeatureStoreClient, - tweetEntity: Entity[TweetId], - tweetFeaturesAdapter: TimelinesAdapterBase[PredictionRecord]) - extends ReadableStore[Set[Long], DataRecord] { - - override def multiGet[K <: Set[Long]](keys: Set[K]): Map[K, Future[Option[DataRecord]]] = { - val orderedKeys: Seq[K] = keys.toSeq - val featureStoreRequests: Seq[FeatureStoreRequest] = getFeatureStoreRequests(orderedKeys) - val predictionRecordsFut: Future[Seq[PredictionRecord]] = featureStoreClient( - featureStoreRequests) - - getDataRecordMap(orderedKeys, predictionRecordsFut) - } - - private def getFeatureStoreRequests[K <: Set[Long]]( - orderedKeys: Seq[K] - ): Seq[FeatureStoreRequest] = { - orderedKeys.map { key: Set[Long] => - FeatureStoreRequest( - entityIds = key.map { tweetId => tweetEntity.withId(TweetId(tweetId)) }.toSeq - ) - } - } - - private def getDataRecordMap[K <: Set[Long]]( - orderedKeys: Seq[K], - predictionRecordsFut: Future[Seq[PredictionRecord]] - ): Map[K, Future[Option[DataRecord]]] = { - orderedKeys.zipWithIndex.map { - case (tweetIdSet, index) => - val dataRecordFutOpt: Future[Option[DataRecord]] = predictionRecordsFut.map { - predictionRecords => - predictionRecords.lift(index).flatMap { predictionRecordAtIndex: PredictionRecord => - tweetFeaturesAdapter.adaptToDataRecords(predictionRecordAtIndex).asScala.headOption - } - } - (tweetIdSet, dataRecordFutOpt) - }.toMap - } -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TypeSafeRunner.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TypeSafeRunner.scala deleted file mode 100644 index 92b6618e4..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/TypeSafeRunner.scala +++ /dev/null @@ -1,7 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates.real_time - -import com.twitter.summingbird_internal.runner.storm.GenericRunner - -object TypeSafeRunner { - def main(args: Array[String]): Unit = GenericRunner(args, TimelinesRealTimeAggregatesJob(_)) -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/UserFeaturesAdapter.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/UserFeaturesAdapter.scala deleted file mode 100644 index 8ff39938c..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/UserFeaturesAdapter.scala +++ /dev/null @@ -1,108 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates.real_time - -import com.twitter.dal.personal_data.thriftjava.PersonalDataType.InferredGender -import com.twitter.dal.personal_data.thriftjava.PersonalDataType.UserState -import com.twitter.ml.api.Feature.Binary -import com.twitter.ml.api.Feature.Text -import com.twitter.ml.api.DataRecord -import com.twitter.ml.api.Feature -import com.twitter.ml.api.FeatureContext -import com.twitter.ml.api.RichDataRecord -import com.twitter.ml.featurestore.catalog.entities.core.User -import com.twitter.ml.featurestore.catalog.features.core.UserAccount -import com.twitter.ml.featurestore.catalog.features.geo.UserLocation -import com.twitter.ml.featurestore.catalog.features.magicrecs.UserActivity -import com.twitter.ml.featurestore.lib.EntityId -import com.twitter.ml.featurestore.lib.data.PredictionRecord -import com.twitter.ml.featurestore.lib.feature.BoundFeature -import com.twitter.ml.featurestore.lib.feature.BoundFeatureSet -import com.twitter.ml.featurestore.lib.UserId -import com.twitter.ml.featurestore.lib.{Discrete => FSDiscrete} -import com.twitter.timelines.prediction.common.adapters.TimelinesAdapterBase -import com.twitter.timelines.prediction.features.user_health.UserHealthFeatures -import java.lang.{Boolean => JBoolean} -import java.lang.{String => JString} -import java.util -import scala.collection.JavaConverters._ - -object UserFeaturesAdapter extends TimelinesAdapterBase[PredictionRecord] { - val UserStateBoundFeature: BoundFeature[UserId, FSDiscrete] = UserActivity.UserState.bind(User) - - /** - * Boolean features about viewer's user state. - * enum UserState { - * NEW = 0, - * NEAR_ZERO = 1, - * VERY_LIGHT = 2, - * LIGHT = 3, - * MEDIUM_TWEETER = 4, - * MEDIUM_NON_TWEETER = 5, - * HEAVY_NON_TWEETER = 6, - * HEAVY_TWEETER = 7 - * }(persisted='true') - */ - val IS_USER_NEW = new Binary("timelines.user_state.is_user_new", Set(UserState).asJava) - val IS_USER_LIGHT = new Binary("timelines.user_state.is_user_light", Set(UserState).asJava) - val IS_USER_MEDIUM_TWEETER = - new Binary("timelines.user_state.is_user_medium_tweeter", Set(UserState).asJava) - val IS_USER_MEDIUM_NON_TWEETER = - new Binary("timelines.user_state.is_user_medium_non_tweeter", Set(UserState).asJava) - val IS_USER_HEAVY_NON_TWEETER = - new Binary("timelines.user_state.is_user_heavy_non_tweeter", Set(UserState).asJava) - val IS_USER_HEAVY_TWEETER = - new Binary("timelines.user_state.is_user_heavy_tweeter", Set(UserState).asJava) - val userStateToFeatureMap: Map[Long, Binary] = Map( - 0L -> IS_USER_NEW, - 1L -> IS_USER_LIGHT, - 2L -> IS_USER_LIGHT, - 3L -> IS_USER_LIGHT, - 4L -> IS_USER_MEDIUM_TWEETER, - 5L -> IS_USER_MEDIUM_NON_TWEETER, - 6L -> IS_USER_HEAVY_NON_TWEETER, - 7L -> IS_USER_HEAVY_TWEETER - ) - - val UserStateBooleanFeatures: Set[Feature[_]] = userStateToFeatureMap.values.toSet - - - val USER_COUNTRY_ID = new Text("geo.user_location.country_code") - val UserCountryCodeFeature: BoundFeature[UserId, String] = - UserLocation.CountryCodeAlpha2.bind(User) - val UserLocationFeatures: Set[Feature[_]] = Set(USER_COUNTRY_ID) - - private val UserVerifiedFeaturesSet = Set( - UserAccount.IsUserVerified.bind(User), - UserAccount.IsUserBlueVerified.bind(User), - UserAccount.IsUserGoldVerified.bind(User), - UserAccount.IsUserGrayVerified.bind(User) - ) - - val UserFeaturesSet: BoundFeatureSet = - BoundFeatureSet(UserStateBoundFeature, UserCountryCodeFeature) ++ - BoundFeatureSet(UserVerifiedFeaturesSet.asInstanceOf[Set[BoundFeature[_ <: EntityId, _]]]) - - private val allFeatures: Seq[Feature[_]] = - UserStateBooleanFeatures.toSeq ++ GenderBooleanFeatures.toSeq ++ - UserLocationFeatures.toSeq ++ Seq(UserHealthFeatures.IsUserVerifiedUnion) - - override def getFeatureContext: FeatureContext = new FeatureContext(allFeatures: _*) - override def commonFeatures: Set[Feature[_]] = Set.empty - - override def adaptToDataRecords(record: PredictionRecord): util.List[DataRecord] = { - val newRecord = new RichDataRecord(new DataRecord) - record - .getFeatureValue(UserStateBoundFeature) - .flatMap { userState => userStateToFeatureMap.get(userState.value) }.foreach { - booleanFeature => newRecord.setFeatureValue[JBoolean](booleanFeature, true) - } - record.getFeatureValue(UserCountryCodeFeature).foreach { countryCodeFeatureValue => - newRecord.setFeatureValue[JString](USER_COUNTRY_ID, countryCodeFeatureValue) - } - - val isUserVerifiedUnion = - UserVerifiedFeaturesSet.exists(feature => record.getFeatureValue(feature).getOrElse(false)) - newRecord.setFeatureValue[JBoolean](UserHealthFeatures.IsUserVerifiedUnion, isUserVerifiedUnion) - - List(newRecord.getRecord).asJava - } -} diff --git a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/UserFeaturesReadableStore.scala b/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/UserFeaturesReadableStore.scala deleted file mode 100644 index c1931c32b..000000000 --- a/src/scala/com/twitter/timelines/prediction/common/aggregates/real_time/UserFeaturesReadableStore.scala +++ /dev/null @@ -1,37 +0,0 @@ -package com.twitter.timelines.prediction.common.aggregates.real_time - -import com.twitter.ml.api.DataRecord -import com.twitter.ml.featurestore.lib.UserId -import com.twitter.ml.featurestore.lib.data.PredictionRecord -import com.twitter.ml.featurestore.lib.entity.Entity -import com.twitter.ml.featurestore.lib.online.{FeatureStoreClient, FeatureStoreRequest} -import com.twitter.storehaus.ReadableStore -import com.twitter.timelines.prediction.common.adapters.TimelinesAdapterBase -import com.twitter.util.Future -import scala.collection.JavaConverters._ - -class UserFeaturesReadableStore( - featureStoreClient: FeatureStoreClient, - userEntity: Entity[UserId], - userFeaturesAdapter: TimelinesAdapterBase[PredictionRecord]) - extends ReadableStore[Set[Long], DataRecord] { - - override def multiGet[K <: Set[Long]](keys: Set[K]): Map[K, Future[Option[DataRecord]]] = { - val orderedKeys = keys.toSeq - val featureStoreRequests: Seq[FeatureStoreRequest] = orderedKeys.map { key: Set[Long] => - FeatureStoreRequest( - entityIds = key.map(userId => userEntity.withId(UserId(userId))).toSeq - ) - } - val predictionRecordsFut: Future[Seq[PredictionRecord]] = featureStoreClient( - featureStoreRequests) - - orderedKeys.zipWithIndex.map { - case (userId, index) => - val dataRecordFutOpt = predictionRecordsFut.map { predictionRecords => - userFeaturesAdapter.adaptToDataRecords(predictionRecords(index)).asScala.headOption - } - (userId, dataRecordFutOpt) - }.toMap - } -} diff --git a/src/scala/com/twitter/timelines/prediction/features/README.md b/src/scala/com/twitter/timelines/prediction/features/README.md deleted file mode 100644 index d42639a77..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/README.md +++ /dev/null @@ -1,6 +0,0 @@ -## Prediction Features - -This directory contains a collection of `Features` (`com.twitter.ml.api.Feature`) which are definitions of feature names and datatypes which allow the features to be efficiently processed and passed to the different ranking models. -By predefining the features with their names and datatypes, when features are being generated, scribed or used to score they can be identified with only a hash of their name. - -Not all of these features are used in the model, many are experimental or deprecated. \ No newline at end of file diff --git a/src/scala/com/twitter/timelines/prediction/features/client_log_event/BUILD b/src/scala/com/twitter/timelines/prediction/features/client_log_event/BUILD deleted file mode 100644 index 3d3c34092..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/client_log_event/BUILD +++ /dev/null @@ -1,11 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/java/com/twitter/ml/api:api-base", - "src/scala/com/twitter/suggests/controller_data", - "src/thrift/com/twitter/dal/personal_data:personal_data-java", - "src/thrift/com/twitter/timelineservice/server/suggests/logging:thrift-scala", - ], -) diff --git a/src/scala/com/twitter/timelines/prediction/features/client_log_event/ClientLogEventDataRecordFeatures.scala b/src/scala/com/twitter/timelines/prediction/features/client_log_event/ClientLogEventDataRecordFeatures.scala deleted file mode 100644 index cccb99998..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/client_log_event/ClientLogEventDataRecordFeatures.scala +++ /dev/null @@ -1,169 +0,0 @@ -package com.twitter.timelines.prediction.features.client_log_event - -import com.twitter.dal.personal_data.thriftjava.PersonalDataType._ -import com.twitter.ml.api.Feature -import com.twitter.ml.api.Feature.Binary -import com.twitter.ml.api.Feature.Continuous -import com.twitter.ml.api.Feature.Discrete -import scala.collection.JavaConverters._ -import com.twitter.timelineservice.suggests.logging.candidate_tweet_source_id.thriftscala.CandidateTweetSourceId - -object ClientLogEventDataRecordFeatures { - val HasConsumerVideo = new Binary( - "client_log_event.tweet.has_consumer_video", - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val PhotoCount = new Continuous( - "client_log_event.tweet.photo_count", - Set(CountOfPrivateTweetEntitiesAndMetadata, CountOfPublicTweetEntitiesAndMetadata).asJava) - val HasImage = new Binary( - "client_log_event.tweet.has_image", - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val IsReply = - new Binary("client_log_event.tweet.is_reply", Set(PublicReplies, PrivateReplies).asJava) - val IsRetweet = - new Binary("client_log_event.tweet.is_retweet", Set(PublicRetweets, PrivateRetweets).asJava) - val IsPromoted = - new Binary( - "client_log_event.tweet.is_promoted", - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HasVisibleLink = new Binary( - "client_log_event.tweet.has_visible_link", - Set(UrlFoundFlag, PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HasHashtag = new Binary( - "client_log_event.tweet.has_hashtag", - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val FromMutualFollow = new Binary("client_log_event.tweet.from_mutual_follow") - val IsInNetwork = new Binary("client_log_event.tweet.is_in_network") - val IsNotInNetwork = new Binary("client_log_event.tweet.is_not_in_network") - val FromRecap = new Binary("client_log_event.tweet.from_recap") - val FromRecycled = new Binary("client_log_event.tweet.from_recycled") - val FromActivity = new Binary("client_log_event.tweet.from_activity") - val FromSimcluster = new Binary("client_log_event.tweet.from_simcluster") - val FromErg = new Binary("client_log_event.tweet.from_erg") - val FromCroon = new Binary("client_log_event.tweet.from_croon") - val FromList = new Binary("client_log_event.tweet.from_list") - val FromRecTopic = new Binary("client_log_event.tweet.from_rec_topic") - val InjectedPosition = new Discrete("client_log_event.tweet.injectedPosition") - val TextOnly = new Binary("client_log_event.tweet.text_only") - val HasLikedBySocialContext = new Binary("client_log_event.tweet.has_liked_by_social_context") - val HasFollowedBySocialContext = new Binary( - "client_log_event.tweet.has_followed_by_social_context") - val HasTopicSocialContext = new Binary("client_log_event.tweet.has_topic_social_context") - val IsFollowedTopicTweet = new Binary("client_log_event.tweet.is_followed_topic_tweet") - val IsRecommendedTopicTweet = new Binary("client_log_event.tweet.is_recommended_topic_tweet") - val IsTweetAgeLessThan15Seconds = new Binary( - "client_log_event.tweet.tweet_age_less_than_15_seconds") - val IsTweetAgeLessThanOrEqualTo30Minutes = new Binary( - "client_log_event.tweet.tweet_age_lte_30_minutes") - val IsTweetAgeLessThanOrEqualTo1Hour = new Binary("client_log_event.tweet.tweet_age_lte_1_hour") - val IsTweetAgeLessThanOrEqualTo6Hours = new Binary("client_log_event.tweet.tweet_age_lte_6_hours") - val IsTweetAgeLessThanOrEqualTo12Hours = new Binary( - "client_log_event.tweet.tweet_age_lte_12_hours") - val IsTweetAgeGreaterThanOrEqualTo24Hours = new Binary( - "client_log_event.tweet.tweet_age_gte_24_hours") - val HasGreaterThanOrEqualTo100Favs = new Binary("client_log_event.tweet.has_gte_100_favs") - val HasGreaterThanOrEqualTo1KFavs = new Binary("client_log_event.tweet.has_gte_1k_favs") - val HasGreaterThanOrEqualTo10KFavs = new Binary("client_log_event.tweet.has_gte_10k_favs") - val HasGreaterThanOrEqualTo100KFavs = new Binary("client_log_event.tweet.has_gte_100k_favs") - val HasGreaterThanOrEqualTo10Retweets = new Binary("client_log_event.tweet.has_gte_10_retweets") - val HasGreaterThanOrEqualTo100Retweets = new Binary("client_log_event.tweet.has_gte_100_retweets") - val HasGreaterThanOrEqualTo1KRetweets = new Binary("client_log_event.tweet.has_gte_1k_retweets") - - val TweetTypeToFeatureMap: Map[String, Binary] = Map( - "link" -> HasVisibleLink, - "hashtag" -> HasHashtag, - "mutual_follow" -> FromMutualFollow, - "in_network" -> IsInNetwork, - "text_only" -> TextOnly, - "has_liked_by_social_context" -> HasLikedBySocialContext, - "has_followed_by_social_context" -> HasFollowedBySocialContext, - "has_topic_social_context" -> HasTopicSocialContext, - "is_followed_topic_tweet" -> IsFollowedTopicTweet, - "is_recommended_topic_tweet" -> IsRecommendedTopicTweet, - "tweet_age_less_than_15_seconds" -> IsTweetAgeLessThan15Seconds, - "tweet_age_lte_30_minutes" -> IsTweetAgeLessThanOrEqualTo30Minutes, - "tweet_age_lte_1_hour" -> IsTweetAgeLessThanOrEqualTo1Hour, - "tweet_age_lte_6_hours" -> IsTweetAgeLessThanOrEqualTo6Hours, - "tweet_age_lte_12_hours" -> IsTweetAgeLessThanOrEqualTo12Hours, - "tweet_age_gte_24_hours" -> IsTweetAgeGreaterThanOrEqualTo24Hours, - "has_gte_100_favs" -> HasGreaterThanOrEqualTo100Favs, - "has_gte_1k_favs" -> HasGreaterThanOrEqualTo1KFavs, - "has_gte_10k_favs" -> HasGreaterThanOrEqualTo10KFavs, - "has_gte_100k_favs" -> HasGreaterThanOrEqualTo100KFavs, - "has_gte_10_retweets" -> HasGreaterThanOrEqualTo10Retweets, - "has_gte_100_retweets" -> HasGreaterThanOrEqualTo100Retweets, - "has_gte_1k_retweets" -> HasGreaterThanOrEqualTo1KRetweets - ) - - val CandidateTweetSourceIdFeatureMap: Map[Int, Binary] = Map( - CandidateTweetSourceId.RecapTweet.value -> FromRecap, - CandidateTweetSourceId.RecycledTweet.value -> FromRecycled, - CandidateTweetSourceId.RecommendedTweet.value -> FromActivity, - CandidateTweetSourceId.Simcluster.value -> FromSimcluster, - CandidateTweetSourceId.ErgTweet.value -> FromErg, - CandidateTweetSourceId.CroonTopicTweet.value -> FromCroon, - CandidateTweetSourceId.CroonTweet.value -> FromCroon, - CandidateTweetSourceId.ListTweet.value -> FromList, - CandidateTweetSourceId.RecommendedTopicTweet.value -> FromRecTopic - ) - - val TweetFeaturesV2: Set[Feature[_]] = Set( - HasImage, - IsReply, - IsRetweet, - HasVisibleLink, - HasHashtag, - FromMutualFollow, - IsInNetwork - ) - - val ContentTweetTypeFeatures: Set[Feature[_]] = Set( - HasImage, - HasVisibleLink, - HasHashtag, - TextOnly, - HasVisibleLink - ) - - val FreshnessTweetTypeFeatures: Set[Feature[_]] = Set( - IsTweetAgeLessThan15Seconds, - IsTweetAgeLessThanOrEqualTo30Minutes, - IsTweetAgeLessThanOrEqualTo1Hour, - IsTweetAgeLessThanOrEqualTo6Hours, - IsTweetAgeLessThanOrEqualTo12Hours, - IsTweetAgeGreaterThanOrEqualTo24Hours - ) - - val SocialProofTweetTypeFeatures: Set[Feature[_]] = Set( - HasLikedBySocialContext, - HasFollowedBySocialContext, - HasTopicSocialContext - ) - - val TopicTweetPreferenceTweetTypeFeatures: Set[Feature[_]] = Set( - IsFollowedTopicTweet, - IsRecommendedTopicTweet - ) - - val TweetPopularityTweetTypeFeatures: Set[Feature[_]] = Set( - HasGreaterThanOrEqualTo100Favs, - HasGreaterThanOrEqualTo1KFavs, - HasGreaterThanOrEqualTo10KFavs, - HasGreaterThanOrEqualTo100KFavs, - HasGreaterThanOrEqualTo10Retweets, - HasGreaterThanOrEqualTo100Retweets, - HasGreaterThanOrEqualTo1KRetweets - ) - - val UserGraphInteractionTweetTypeFeatures: Set[Feature[_]] = Set( - IsInNetwork, - FromMutualFollow, - IsNotInNetwork, - IsPromoted - ) - - val UserContentPreferenceTweetTypeFeatures: Set[Feature[_]] = - ContentTweetTypeFeatures ++ FreshnessTweetTypeFeatures ++ SocialProofTweetTypeFeatures ++ TopicTweetPreferenceTweetTypeFeatures ++ TweetPopularityTweetTypeFeatures ++ UserGraphInteractionTweetTypeFeatures - val AuthorContentPreferenceTweetTypeFeatures: Set[Feature[_]] = - Set(IsInNetwork, FromMutualFollow, IsNotInNetwork) ++ ContentTweetTypeFeatures -} diff --git a/src/scala/com/twitter/timelines/prediction/features/common/BUILD b/src/scala/com/twitter/timelines/prediction/features/common/BUILD deleted file mode 100644 index bfbe764c7..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/common/BUILD +++ /dev/null @@ -1,11 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/java/com/twitter/ml/api:api-base", - "src/thrift/com/twitter/dal/personal_data:personal_data-java", - "src/thrift/com/twitter/ml/api:data-java", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - ], -) diff --git a/src/scala/com/twitter/timelines/prediction/features/common/CombinedFeatures.scala b/src/scala/com/twitter/timelines/prediction/features/common/CombinedFeatures.scala deleted file mode 100644 index d995fe2b0..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/common/CombinedFeatures.scala +++ /dev/null @@ -1,536 +0,0 @@ -package com.twitter.timelines.prediction.features.common - -import com.twitter.dal.personal_data.thriftjava.PersonalDataType._ -import com.twitter.ml.api.Feature -import com.twitter.ml.api.FeatureType -import com.twitter.ml.api.Feature.Binary -import java.lang.{Boolean => JBoolean} -import scala.collection.JavaConverters._ - -object CombinedFeatures { - val IS_CLICKED = - new Binary("timelines.engagement.is_clicked", Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_DWELLED = - new Binary("timelines.engagement.is_dwelled", Set(TweetsViewed, EngagementsPrivate).asJava) - val IS_DWELLED_IN_BOUNDS_V1 = new Binary( - "timelines.engagement.is_dwelled_in_bounds_v1", - Set(TweetsViewed, EngagementsPrivate).asJava) - val IS_FAVORITED = new Binary( - "timelines.engagement.is_favorited", - Set(PublicLikes, PrivateLikes, EngagementsPrivate, EngagementsPublic).asJava) - val IS_FOLLOWED = new Binary( - "timelines.engagement.is_followed", - Set(EngagementsPrivate, EngagementsPublic, Follow).asJava) - val IS_IMPRESSED = - new Binary("timelines.engagement.is_impressed", Set(TweetsViewed, EngagementsPrivate).asJava) - val IS_OPEN_LINKED = new Binary( - "timelines.engagement.is_open_linked", - Set(EngagementsPrivate, LinksClickedOn).asJava) - val IS_PHOTO_EXPANDED = new Binary( - "timelines.engagement.is_photo_expanded", - Set(MediaEngagementActivities, EngagementsPrivate).asJava) - val IS_PROFILE_CLICKED = new Binary( - "timelines.engagement.is_profile_clicked", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_QUOTED = new Binary( - "timelines.engagement.is_quoted", - Set(PublicRetweets, PrivateRetweets, EngagementsPrivate, EngagementsPublic).asJava) - val IS_REPLIED = new Binary( - "timelines.engagement.is_replied", - Set(PublicReplies, PrivateReplies, EngagementsPrivate, EngagementsPublic).asJava) - val IS_RETWEETED = new Binary( - "timelines.engagement.is_retweeted", - Set(PublicRetweets, PrivateRetweets, EngagementsPrivate, EngagementsPublic).asJava) - val IS_RETWEETED_WITHOUT_QUOTE = new Binary( - "timelines.enagagement.is_retweeted_without_quote", - Set(PublicRetweets, PrivateRetweets, EngagementsPrivate, EngagementsPublic).asJava) - val IS_SHARE_DM_CLICKED = - new Binary("timelines.engagement.is_tweet_share_dm_clicked", Set(EngagementsPrivate).asJava) - val IS_SHARE_DM_SENT = - new Binary("timelines.engagement.is_tweet_share_dm_sent", Set(EngagementsPrivate).asJava) - val IS_VIDEO_PLAYBACK_25 = new Binary( - "timelines.engagement.is_video_playback_25", - Set(MediaEngagementActivities, EngagementsPrivate).asJava) - val IS_VIDEO_PLAYBACK_50 = new Binary( - "timelines.engagement.is_video_playback_50", - Set(MediaEngagementActivities, EngagementsPrivate).asJava) - val IS_VIDEO_PLAYBACK_75 = new Binary( - "timelines.engagement.is_video_playback_75", - Set(MediaEngagementActivities, EngagementsPrivate).asJava) - val IS_VIDEO_PLAYBACK_95 = new Binary( - "timelines.engagement.is_video_playback_95", - Set(MediaEngagementActivities, EngagementsPrivate).asJava) - val IS_VIDEO_PLAYBACK_COMPLETE = new Binary( - "timelines.engagement.is_video_playback_complete", - Set(MediaEngagementActivities, EngagementsPrivate).asJava) - val IS_VIDEO_PLAYBACK_START = new Binary( - "timelines.engagement.is_video_playback_start", - Set(MediaEngagementActivities, EngagementsPrivate).asJava) - val IS_VIDEO_VIEWED = new Binary( - "timelines.engagement.is_video_viewed", - Set(MediaEngagementActivities, EngagementsPrivate).asJava) - val IS_VIDEO_QUALITY_VIEWED = new Binary( - "timelines.engagement.is_video_quality_viewed", - Set(MediaEngagementActivities, EngagementsPrivate).asJava - ) - // v1: post click engagements: fav, reply - val IS_GOOD_CLICKED_CONVO_DESC_V1 = new Binary( - "timelines.engagement.is_good_clicked_convo_desc_favorited_or_replied", - Set( - TweetsClicked, - PublicLikes, - PrivateLikes, - PublicReplies, - PrivateReplies, - EngagementsPrivate, - EngagementsPublic).asJava) - // v2: post click engagements: click - val IS_GOOD_CLICKED_CONVO_DESC_V2 = new Binary( - "timelines.engagement.is_good_clicked_convo_desc_v2", - Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_GOOD_CLICKED_WITH_DWELL_SUM_GTE_60S = new Binary( - "timelines.engagement.is_good_clicked_convo_desc_favorited_or_replied_or_dwell_sum_gte_60_secs", - Set( - TweetsClicked, - PublicLikes, - PrivateLikes, - PublicReplies, - PrivateReplies, - EngagementsPrivate, - EngagementsPublic).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_FAVORITED = new Binary( - "timelines.engagement.is_good_clicked_convo_desc_favorited", - Set(PublicLikes, PrivateLikes, EngagementsPrivate, EngagementsPublic).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_REPLIED = new Binary( - "timelines.engagement.is_good_clicked_convo_desc_replied", - Set(PublicReplies, PrivateReplies, EngagementsPrivate, EngagementsPublic).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_RETWEETED = new Binary( - "timelines.engagement.is_good_clicked_convo_desc_retweeted", - Set(PublicRetweets, PrivateRetweets, EngagementsPrivate, EngagementsPublic).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_CLICKED = new Binary( - "timelines.engagement.is_good_clicked_convo_desc_clicked", - Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_FOLLOWED = new Binary( - "timelines.engagement.is_good_clicked_convo_desc_followed", - Set(EngagementsPrivate).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_SHARE_DM_CLICKED = new Binary( - "timelines.engagement.is_good_clicked_convo_desc_share_dm_clicked", - Set(EngagementsPrivate).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_PROFILE_CLICKED = new Binary( - "timelines.engagement.is_good_clicked_convo_desc_profile_clicked", - Set(EngagementsPrivate).asJava) - - val IS_GOOD_CLICKED_CONVO_DESC_UAM_GT_0 = new Binary( - "timelines.engagement.is_good_clicked_convo_desc_uam_gt_0", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_UAM_GT_1 = new Binary( - "timelines.engagement.is_good_clicked_convo_desc_uam_gt_1", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_UAM_GT_2 = new Binary( - "timelines.engagement.is_good_clicked_convo_desc_uam_gt_2", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_UAM_GT_3 = new Binary( - "timelines.engagement.is_good_clicked_convo_desc_uam_gt_3", - Set(EngagementsPrivate, EngagementsPublic).asJava) - - val IS_TWEET_DETAIL_DWELLED = new Binary( - "timelines.engagement.is_tweet_detail_dwelled", - Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_TWEET_DETAIL_DWELLED_8_SEC = new Binary( - "timelines.engagement.is_tweet_detail_dwelled_8_sec", - Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_TWEET_DETAIL_DWELLED_15_SEC = new Binary( - "timelines.engagement.is_tweet_detail_dwelled_15_sec", - Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_TWEET_DETAIL_DWELLED_25_SEC = new Binary( - "timelines.engagement.is_tweet_detail_dwelled_25_sec", - Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_TWEET_DETAIL_DWELLED_30_SEC = new Binary( - "timelines.engagement.is_tweet_detail_dwelled_30_sec", - Set(TweetsClicked, EngagementsPrivate).asJava) - - val IS_PROFILE_DWELLED = new Binary( - "timelines.engagement.is_profile_dwelled", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_PROFILE_DWELLED_10_SEC = new Binary( - "timelines.engagement.is_profile_dwelled_10_sec", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_PROFILE_DWELLED_20_SEC = new Binary( - "timelines.engagement.is_profile_dwelled_20_sec", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_PROFILE_DWELLED_30_SEC = new Binary( - "timelines.engagement.is_profile_dwelled_30_sec", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - - val IS_FULLSCREEN_VIDEO_DWELLED = new Binary( - "timelines.engagement.is_fullscreen_video_dwelled", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_FULLSCREEN_VIDEO_DWELLED_5_SEC = new Binary( - "timelines.engagement.is_fullscreen_video_dwelled_5_sec", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_FULLSCREEN_VIDEO_DWELLED_10_SEC = new Binary( - "timelines.engagement.is_fullscreen_video_dwelled_10_sec", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_FULLSCREEN_VIDEO_DWELLED_20_SEC = new Binary( - "timelines.engagement.is_fullscreen_video_dwelled_20_sec", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_FULLSCREEN_VIDEO_DWELLED_30_SEC = new Binary( - "timelines.engagement.is_fullscreen_video_dwelled_30_sec", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_LINK_DWELLED_15_SEC = new Binary( - "timelines.engagement.is_link_dwelled_15_sec", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_LINK_DWELLED_30_SEC = new Binary( - "timelines.engagement.is_link_dwelled_30_sec", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_LINK_DWELLED_60_SEC = new Binary( - "timelines.engagement.is_link_dwelled_60_sec", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_HOME_LATEST_VISITED = - new Binary("timelines.engagement.is_home_latest_visited", Set(EngagementsPrivate).asJava) - - val IS_BOOKMARKED = - new Binary("timelines.engagement.is_bookmarked", Set(EngagementsPrivate).asJava) - val IS_SHARED = - new Binary("timelines.engagement.is_shared", Set(EngagementsPrivate).asJava) - val IS_SHARE_MENU_CLICKED = - new Binary("timelines.engagement.is_share_menu_clicked", Set(EngagementsPrivate).asJava) - - // Negative engagements - val IS_DONT_LIKE = new Binary("timelines.engagement.is_dont_like", Set(EngagementsPrivate).asJava) - val IS_BLOCK_CLICKED = new Binary( - "timelines.engagement.is_block_clicked", - Set(Blocks, TweetsClicked, EngagementsPrivate, EngagementsPublic).asJava) - val IS_BLOCK_DIALOG_BLOCKED = new Binary( - "timelines.engagement.is_block_dialog_blocked", - Set(Blocks, EngagementsPrivate, EngagementsPublic).asJava) - val IS_MUTE_CLICKED = new Binary( - "timelines.engagement.is_mute_clicked", - Set(Mutes, TweetsClicked, EngagementsPrivate).asJava) - val IS_MUTE_DIALOG_MUTED = - new Binary("timelines.engagement.is_mute_dialog_muted", Set(Mutes, EngagementsPrivate).asJava) - val IS_REPORT_TWEET_CLICKED = new Binary( - "timelines.engagement.is_report_tweet_clicked", - Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_CARET_CLICKED = - new Binary("timelines.engagement.is_caret_clicked", Set(EngagementsPrivate).asJava) - val IS_NOT_ABOUT_TOPIC = - new Binary("timelines.engagement.is_not_about_topic", Set(EngagementsPrivate).asJava) - val IS_NOT_RECENT = - new Binary("timelines.engagement.is_not_recent", Set(EngagementsPrivate).asJava) - val IS_NOT_RELEVANT = - new Binary("timelines.engagement.is_not_relevant", Set(EngagementsPrivate).asJava) - val IS_SEE_FEWER = - new Binary("timelines.engagement.is_see_fewer", Set(EngagementsPrivate).asJava) - val IS_UNFOLLOW_TOPIC = - new Binary("timelines.engagement.is_unfollow_topic", Set(EngagementsPrivate).asJava) - val IS_FOLLOW_TOPIC = - new Binary("timelines.engagement.is_follow_topic", Set(EngagementsPrivate).asJava) - val IS_NOT_INTERESTED_IN_TOPIC = - new Binary("timelines.engagement.is_not_interested_in_topic", Set(EngagementsPrivate).asJava) - val IS_NEGATIVE_FEEDBACK = - new Binary("timelines.engagement.is_negative_feedback", Set(EngagementsPrivate).asJava) - val IS_IMPLICIT_POSITIVE_FEEDBACK_UNION = - new Binary( - "timelines.engagement.is_implicit_positive_feedback_union", - Set(EngagementsPrivate).asJava) - val IS_EXPLICIT_POSITIVE_FEEDBACK_UNION = - new Binary( - "timelines.engagement.is_explicit_positive_feedback_union", - Set(EngagementsPrivate).asJava) - val IS_ALL_NEGATIVE_FEEDBACK_UNION = - new Binary( - "timelines.engagement.is_all_negative_feedback_union", - Set(EngagementsPrivate).asJava) - // Reciprocal engagements for reply forward engagement - val IS_REPLIED_REPLY_IMPRESSED_BY_AUTHOR = new Binary( - "timelines.engagement.is_replied_reply_impressed_by_author", - Set(EngagementsPrivate).asJava) - val IS_REPLIED_REPLY_FAVORITED_BY_AUTHOR = new Binary( - "timelines.engagement.is_replied_reply_favorited_by_author", - Set(EngagementsPrivate, EngagementsPublic, PrivateLikes, PublicLikes).asJava) - val IS_REPLIED_REPLY_QUOTED_BY_AUTHOR = new Binary( - "timelines.engagement.is_replied_reply_quoted_by_author", - Set(EngagementsPrivate, EngagementsPublic, PrivateRetweets, PublicRetweets).asJava) - val IS_REPLIED_REPLY_REPLIED_BY_AUTHOR = new Binary( - "timelines.engagement.is_replied_reply_replied_by_author", - Set(EngagementsPrivate, EngagementsPublic, PrivateReplies, PublicReplies).asJava) - val IS_REPLIED_REPLY_RETWEETED_BY_AUTHOR = new Binary( - "timelines.engagement.is_replied_reply_retweeted_by_author", - Set(EngagementsPrivate, EngagementsPublic, PrivateRetweets, PublicRetweets).asJava) - val IS_REPLIED_REPLY_BLOCKED_BY_AUTHOR = new Binary( - "timelines.engagement.is_replied_reply_blocked_by_author", - Set(Blocks, EngagementsPrivate, EngagementsPublic).asJava) - val IS_REPLIED_REPLY_FOLLOWED_BY_AUTHOR = new Binary( - "timelines.engagement.is_replied_reply_followed_by_author", - Set(EngagementsPrivate, EngagementsPublic, Follow).asJava) - val IS_REPLIED_REPLY_UNFOLLOWED_BY_AUTHOR = new Binary( - "timelines.engagement.is_replied_reply_unfollowed_by_author", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_REPLIED_REPLY_MUTED_BY_AUTHOR = new Binary( - "timelines.engagement.is_replied_reply_muted_by_author", - Set(Mutes, EngagementsPrivate).asJava) - val IS_REPLIED_REPLY_REPORTED_BY_AUTHOR = new Binary( - "timelines.engagement.is_replied_reply_reported_by_author", - Set(EngagementsPrivate).asJava) - - // Reciprocal engagements for fav forward engagement - val IS_FAVORITED_FAV_FAVORITED_BY_AUTHOR = new Binary( - "timelines.engagement.is_favorited_fav_favorited_by_author", - Set(EngagementsPrivate, EngagementsPublic, PrivateLikes, PublicLikes).asJava - ) - val IS_FAVORITED_FAV_REPLIED_BY_AUTHOR = new Binary( - "timelines.engagement.is_favorited_fav_replied_by_author", - Set(EngagementsPrivate, EngagementsPublic, PrivateReplies, PublicReplies).asJava - ) - val IS_FAVORITED_FAV_RETWEETED_BY_AUTHOR = new Binary( - "timelines.engagement.is_favorited_fav_retweeted_by_author", - Set(EngagementsPrivate, EngagementsPublic, PrivateRetweets, PublicRetweets).asJava - ) - val IS_FAVORITED_FAV_FOLLOWED_BY_AUTHOR = new Binary( - "timelines.engagement.is_favorited_fav_followed_by_author", - Set(EngagementsPrivate, EngagementsPublic).asJava - ) - - // define good profile click by considering following engagements (follow, fav, reply, retweet, etc.) at profile page - val IS_PROFILE_CLICKED_AND_PROFILE_FOLLOW = new Binary( - "timelines.engagement.is_profile_clicked_and_profile_follow", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate, Follow).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_FAV = new Binary( - "timelines.engagement.is_profile_clicked_and_profile_fav", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate, PrivateLikes, PublicLikes).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_REPLY = new Binary( - "timelines.engagement.is_profile_clicked_and_profile_reply", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate, PrivateReplies, PublicReplies).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_RETWEET = new Binary( - "timelines.engagement.is_profile_clicked_and_profile_retweet", - Set( - ProfilesViewed, - ProfilesClicked, - EngagementsPrivate, - PrivateRetweets, - PublicRetweets).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_TWEET_CLICK = new Binary( - "timelines.engagement.is_profile_clicked_and_profile_tweet_click", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate, TweetsClicked).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_SHARE_DM_CLICK = new Binary( - "timelines.engagement.is_profile_clicked_and_profile_share_dm_click", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - // This derived label is the union of all binary features above - val IS_PROFILE_CLICKED_AND_PROFILE_ENGAGED = new Binary( - "timelines.engagement.is_profile_clicked_and_profile_engaged", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate, EngagementsPublic).asJava) - - // define bad profile click by considering following engagements (user report, tweet report, mute, block, etc) at profile page - val IS_PROFILE_CLICKED_AND_PROFILE_USER_REPORT_CLICK = new Binary( - "timelines.engagement.is_profile_clicked_and_profile_user_report_click", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_TWEET_REPORT_CLICK = new Binary( - "timelines.engagement.is_profile_clicked_and_profile_tweet_report_click", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_MUTE = new Binary( - "timelines.engagement.is_profile_clicked_and_profile_mute", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_BLOCK = new Binary( - "timelines.engagement.is_profile_clicked_and_profile_block", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - // This derived label is the union of bad profile click engagements and existing negative feedback - val IS_NEGATIVE_FEEDBACK_V2 = new Binary( - "timelines.engagement.is_negative_feedback_v2", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_NEGATIVE_FEEDBACK_UNION = new Binary( - "timelines.engagement.is_negative_feedback_union", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - // don't like, mute or profile page -> mute - val IS_WEAK_NEGATIVE_FEEDBACK = new Binary( - "timelines.engagement.is_weak_negative_feedback", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - // report, block or profile page -> report, block - val IS_STRONG_NEGATIVE_FEEDBACK = new Binary( - "timelines.engagement.is_strong_negative_feedback", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - // engagement for following user from any surface area - val IS_FOLLOWED_FROM_ANY_SURFACE_AREA = new Binary( - "timelines.engagement.is_followed_from_any_surface_area", - Set(EngagementsPublic, EngagementsPrivate).asJava) - val IS_RELEVANCE_PROMPT_YES_CLICKED = new Binary( - "timelines.engagement.is_relevance_prompt_yes_clicked", - Set(EngagementsPublic, EngagementsPrivate).asJava) - - // Reply downvote engagements - val IS_REPLY_DOWNVOTED = - new Binary("timelines.engagement.is_reply_downvoted", Set(EngagementsPrivate).asJava) - val IS_REPLY_DOWNVOTE_REMOVED = - new Binary("timelines.engagement.is_reply_downvote_removed", Set(EngagementsPrivate).asJava) - - /** - * Contains all engagements that are used/consumed by real-time - * aggregates summingbird jobs. These engagements need to be - * extractable from [[ClientEvent]]. - */ - val EngagementsRealTime: Set[Feature[JBoolean]] = Set( - IS_CLICKED, - IS_DWELLED, - IS_FAVORITED, - IS_FOLLOWED, - IS_OPEN_LINKED, - IS_PHOTO_EXPANDED, - IS_PROFILE_CLICKED, - IS_QUOTED, - IS_REPLIED, - IS_RETWEETED, - IS_RETWEETED_WITHOUT_QUOTE, - IS_SHARE_DM_CLICKED, - IS_SHARE_DM_SENT, - IS_VIDEO_PLAYBACK_50, - IS_VIDEO_VIEWED, - IS_VIDEO_QUALITY_VIEWED - ) - - val NegativeEngagementsRealTime: Set[Feature[JBoolean]] = Set( - IS_REPORT_TWEET_CLICKED, - IS_BLOCK_CLICKED, - IS_MUTE_CLICKED - ) - - val NegativeEngagementsRealTimeDontLike: Set[Feature[JBoolean]] = Set( - IS_DONT_LIKE - ) - - val NegativeEngagementsSecondary: Set[Feature[JBoolean]] = Set( - IS_NOT_INTERESTED_IN_TOPIC, - IS_NOT_ABOUT_TOPIC, - IS_NOT_RECENT, - IS_NOT_RELEVANT, - IS_SEE_FEWER, - IS_UNFOLLOW_TOPIC - ) - - val PrivateEngagements: Set[Feature[JBoolean]] = Set( - IS_CLICKED, - IS_DWELLED, - IS_OPEN_LINKED, - IS_PHOTO_EXPANDED, - IS_PROFILE_CLICKED, - IS_QUOTED, - IS_VIDEO_PLAYBACK_50, - IS_VIDEO_QUALITY_VIEWED - ) - - val ImpressedEngagements: Set[Feature[JBoolean]] = Set( - IS_IMPRESSED - ) - - val PrivateEngagementsV2: Set[Feature[JBoolean]] = Set( - IS_CLICKED, - IS_OPEN_LINKED, - IS_PHOTO_EXPANDED, - IS_PROFILE_CLICKED, - IS_VIDEO_PLAYBACK_50, - IS_VIDEO_QUALITY_VIEWED - ) ++ ImpressedEngagements - - val CoreEngagements: Set[Feature[JBoolean]] = Set( - IS_FAVORITED, - IS_REPLIED, - IS_RETWEETED - ) - - val DwellEngagements: Set[Feature[JBoolean]] = Set( - IS_DWELLED - ) - - val PrivateCoreEngagements: Set[Feature[JBoolean]] = Set( - IS_CLICKED, - IS_OPEN_LINKED, - IS_PHOTO_EXPANDED, - IS_VIDEO_PLAYBACK_50, - IS_VIDEO_QUALITY_VIEWED - ) - - val ConditionalEngagements: Set[Feature[JBoolean]] = Set( - IS_GOOD_CLICKED_CONVO_DESC_V1, - IS_GOOD_CLICKED_CONVO_DESC_V2, - IS_GOOD_CLICKED_WITH_DWELL_SUM_GTE_60S - ) - - val ShareEngagements: Set[Feature[JBoolean]] = Set( - IS_SHARED, - IS_SHARE_MENU_CLICKED - ) - - val BookmarkEngagements: Set[Feature[JBoolean]] = Set( - IS_BOOKMARKED - ) - - val TweetDetailDwellEngagements: Set[Feature[JBoolean]] = Set( - IS_TWEET_DETAIL_DWELLED, - IS_TWEET_DETAIL_DWELLED_8_SEC, - IS_TWEET_DETAIL_DWELLED_15_SEC, - IS_TWEET_DETAIL_DWELLED_25_SEC, - IS_TWEET_DETAIL_DWELLED_30_SEC - ) - - val ProfileDwellEngagements: Set[Feature[JBoolean]] = Set( - IS_PROFILE_DWELLED, - IS_PROFILE_DWELLED_10_SEC, - IS_PROFILE_DWELLED_20_SEC, - IS_PROFILE_DWELLED_30_SEC - ) - - val FullscreenVideoDwellEngagements: Set[Feature[JBoolean]] = Set( - IS_FULLSCREEN_VIDEO_DWELLED, - IS_FULLSCREEN_VIDEO_DWELLED_5_SEC, - IS_FULLSCREEN_VIDEO_DWELLED_10_SEC, - IS_FULLSCREEN_VIDEO_DWELLED_20_SEC, - IS_FULLSCREEN_VIDEO_DWELLED_30_SEC - ) - - // Please do not add new engagements here until having estimated the impact - // to capacity requirements. User-author real-time aggregates have a very - // large key space. - val UserAuthorEngagements: Set[Feature[JBoolean]] = CoreEngagements ++ DwellEngagements ++ Set( - IS_CLICKED, - IS_PROFILE_CLICKED, - IS_PHOTO_EXPANDED, - IS_VIDEO_PLAYBACK_50, - IS_NEGATIVE_FEEDBACK_UNION - ) - - val ImplicitPositiveEngagements: Set[Feature[JBoolean]] = Set( - IS_CLICKED, - IS_DWELLED, - IS_OPEN_LINKED, - IS_PROFILE_CLICKED, - IS_QUOTED, - IS_VIDEO_PLAYBACK_50, - IS_VIDEO_QUALITY_VIEWED, - IS_TWEET_DETAIL_DWELLED, - IS_GOOD_CLICKED_CONVO_DESC_V1, - IS_GOOD_CLICKED_CONVO_DESC_V2, - IS_SHARED, - IS_SHARE_MENU_CLICKED, - IS_SHARE_DM_SENT, - IS_SHARE_DM_CLICKED - ) - - val ExplicitPositiveEngagements: Set[Feature[JBoolean]] = CoreEngagements ++ Set( - IS_FOLLOWED, - IS_QUOTED - ) - - val AllNegativeEngagements: Set[Feature[JBoolean]] = - NegativeEngagementsRealTime ++ NegativeEngagementsRealTimeDontLike ++ Set( - IS_NOT_RECENT, - IS_NOT_RELEVANT, - IS_SEE_FEWER - ) -} diff --git a/src/scala/com/twitter/timelines/prediction/features/common/NonHomeLabelFeatures.scala b/src/scala/com/twitter/timelines/prediction/features/common/NonHomeLabelFeatures.scala deleted file mode 100644 index 369b48b39..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/common/NonHomeLabelFeatures.scala +++ /dev/null @@ -1,97 +0,0 @@ -package com.twitter.timelines.prediction.features.common - -import com.twitter.dal.personal_data.thriftjava.PersonalDataType._ -import com.twitter.ml.api.Feature -import com.twitter.ml.api.Feature.Binary -import java.lang.{Boolean => JBoolean} -import scala.collection.JavaConverters._ - -object ProfileLabelFeatures { - private val prefix = "profile" - - val IS_CLICKED = - new Binary(s"${prefix}.engagement.is_clicked", Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_DWELLED = - new Binary(s"${prefix}.engagement.is_dwelled", Set(TweetsViewed, EngagementsPrivate).asJava) - val IS_FAVORITED = new Binary( - s"${prefix}.engagement.is_favorited", - Set(PublicLikes, PrivateLikes, EngagementsPrivate, EngagementsPublic).asJava) - val IS_REPLIED = new Binary( - s"${prefix}.engagement.is_replied", - Set(PublicReplies, PrivateReplies, EngagementsPrivate, EngagementsPublic).asJava) - val IS_RETWEETED = new Binary( - s"${prefix}.engagement.is_retweeted", - Set(PublicRetweets, PrivateRetweets, EngagementsPrivate, EngagementsPublic).asJava) - - // Negative engagements - val IS_DONT_LIKE = - new Binary(s"${prefix}.engagement.is_dont_like", Set(EngagementsPrivate).asJava) - val IS_BLOCK_CLICKED = new Binary( - s"${prefix}.engagement.is_block_clicked", - Set(Blocks, TweetsClicked, EngagementsPrivate, EngagementsPublic).asJava) - val IS_MUTE_CLICKED = new Binary( - s"${prefix}.engagement.is_mute_clicked", - Set(Mutes, TweetsClicked, EngagementsPrivate).asJava) - val IS_REPORT_TWEET_CLICKED = new Binary( - s"${prefix}.engagement.is_report_tweet_clicked", - Set(TweetsClicked, EngagementsPrivate).asJava) - - val IS_NEGATIVE_FEEDBACK_UNION = new Binary( - s"${prefix}.engagement.is_negative_feedback_union", - Set(EngagementsPrivate, Blocks, Mutes, TweetsClicked, EngagementsPublic).asJava) - - val CoreEngagements: Set[Feature[JBoolean]] = Set( - IS_CLICKED, - IS_DWELLED, - IS_FAVORITED, - IS_REPLIED, - IS_RETWEETED - ) - - val NegativeEngagements: Set[Feature[JBoolean]] = Set( - IS_DONT_LIKE, - IS_BLOCK_CLICKED, - IS_MUTE_CLICKED, - IS_REPORT_TWEET_CLICKED - ) - -} - -object SearchLabelFeatures { - private val prefix = "search" - - val IS_CLICKED = - new Binary(s"${prefix}.engagement.is_clicked", Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_DWELLED = - new Binary(s"${prefix}.engagement.is_dwelled", Set(TweetsViewed, EngagementsPrivate).asJava) - val IS_FAVORITED = new Binary( - s"${prefix}.engagement.is_favorited", - Set(PublicLikes, PrivateLikes, EngagementsPrivate, EngagementsPublic).asJava) - val IS_REPLIED = new Binary( - s"${prefix}.engagement.is_replied", - Set(PublicReplies, PrivateReplies, EngagementsPrivate, EngagementsPublic).asJava) - val IS_RETWEETED = new Binary( - s"${prefix}.engagement.is_retweeted", - Set(PublicRetweets, PrivateRetweets, EngagementsPrivate, EngagementsPublic).asJava) - val IS_PROFILE_CLICKED_SEARCH_RESULT_USER = new Binary( - s"${prefix}.engagement.is_profile_clicked_search_result_user", - Set(ProfilesClicked, ProfilesViewed, EngagementsPrivate).asJava) - val IS_PROFILE_CLICKED_SEARCH_RESULT_TWEET = new Binary( - s"${prefix}.engagement.is_profile_clicked_search_result_tweet", - Set(ProfilesClicked, ProfilesViewed, EngagementsPrivate).asJava) - val IS_PROFILE_CLICKED_TYPEAHEAD_USER = new Binary( - s"${prefix}.engagement.is_profile_clicked_typeahead_user", - Set(ProfilesClicked, ProfilesViewed, EngagementsPrivate).asJava) - - val CoreEngagements: Set[Feature[JBoolean]] = Set( - IS_CLICKED, - IS_DWELLED, - IS_FAVORITED, - IS_REPLIED, - IS_RETWEETED, - IS_PROFILE_CLICKED_SEARCH_RESULT_USER, - IS_PROFILE_CLICKED_SEARCH_RESULT_TWEET, - IS_PROFILE_CLICKED_TYPEAHEAD_USER - ) -} -// Add Tweet Detail labels later diff --git a/src/scala/com/twitter/timelines/prediction/features/common/TimelinesSharedFeatures.scala b/src/scala/com/twitter/timelines/prediction/features/common/TimelinesSharedFeatures.scala deleted file mode 100644 index 99698530f..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/common/TimelinesSharedFeatures.scala +++ /dev/null @@ -1,759 +0,0 @@ -package com.twitter.timelines.prediction.features.common - -import com.twitter.dal.personal_data.thriftjava.PersonalDataType._ -import com.twitter.ml.api.Feature.Binary -import com.twitter.ml.api.Feature.Continuous -import com.twitter.ml.api.Feature.Discrete -import com.twitter.ml.api.Feature.SparseBinary -import com.twitter.ml.api.Feature.SparseContinuous -import com.twitter.ml.api.Feature.Text -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.TypedAggregateGroup -import scala.collection.JavaConverters._ - -object TimelinesSharedFeatures extends TimelinesSharedFeatures("") -object InReplyToTweetTimelinesSharedFeatures extends TimelinesSharedFeatures("in_reply_to_tweet") - -/** - * Defines shared features - */ -class TimelinesSharedFeatures(prefix: String) { - private def name(featureName: String): String = { - if (prefix.nonEmpty) { - s"$prefix.$featureName" - } else { - featureName - } - } - - // meta - val EXPERIMENT_META = new SparseBinary( - name("timelines.meta.experiment_meta"), - Set(ExperimentId, ExperimentName).asJava) - - // historically used in the "combined models" to distinguish in-network and out of network tweets. - // now the feature denotes which adapter (recap or rectweet) was used to generate the datarecords. - // and is used by the data collection pipeline to split the training data. - val INJECTION_TYPE = new Discrete(name("timelines.meta.injection_type")) - - // Used to indicate which injection module is this - val INJECTION_MODULE_NAME = new Text(name("timelines.meta.injection_module_name")) - - val LIST_ID = new Discrete(name("timelines.meta.list_id")) - val LIST_IS_PINNED = new Binary(name("timelines.meta.list_is_pinned")) - - // internal id per each PS request. mainly to join back commomn features and candidate features later - val PREDICTION_REQUEST_ID = new Discrete(name("timelines.meta.prediction_request_id")) - // internal id per each TLM request. mainly to deduplicate re-served cached tweets in logging - val SERVED_REQUEST_ID = new Discrete(name("timelines.meta.served_request_id")) - // internal id used for join key in kafka logging, equal to servedRequestId if tweet is cached, - // else equal to predictionRequestId - val SERVED_ID = new Discrete(name("timelines.meta.served_id")) - val REQUEST_JOIN_ID = new Discrete(name("timelines.meta.request_join_id")) - - // Internal boolean flag per tweet, whether the tweet is served from RankedTweetsCache: TQ-14050 - // this feature should not be trained on, blacklisted in feature_config: D838346 - val IS_READ_FROM_CACHE = new Binary(name("timelines.meta.is_read_from_cache")) - - // model score discounts - val PHOTO_DISCOUNT = new Continuous(name("timelines.score_discounts.photo")) - val VIDEO_DISCOUNT = new Continuous(name("timelines.score_discounts.video")) - val TWEET_HEIGHT_DISCOUNT = new Continuous(name("timelines.score_discounts.tweet_height")) - val TOXICITY_DISCOUNT = new Continuous(name("timelines.score_discounts.toxicity")) - - // engagements - val ENGAGEMENT_TYPE = new Discrete(name("timelines.engagement.type")) - val PREDICTED_IS_FAVORITED = - new Continuous(name("timelines.engagement_predicted.is_favorited"), Set(EngagementScore).asJava) - val PREDICTED_IS_RETWEETED = - new Continuous(name("timelines.engagement_predicted.is_retweeted"), Set(EngagementScore).asJava) - val PREDICTED_IS_QUOTED = - new Continuous(name("timelines.engagement_predicted.is_quoted"), Set(EngagementScore).asJava) - val PREDICTED_IS_REPLIED = - new Continuous(name("timelines.engagement_predicted.is_replied"), Set(EngagementScore).asJava) - val PREDICTED_IS_OPEN_LINKED = new Continuous( - name("timelines.engagement_predicted.is_open_linked"), - Set(EngagementScore).asJava) - val PREDICTED_IS_GOOD_OPEN_LINK = new Continuous( - name("timelines.engagement_predicted.is_good_open_link"), - Set(EngagementScore).asJava) - val PREDICTED_IS_PROFILE_CLICKED = new Continuous( - name("timelines.engagement_predicted.is_profile_clicked"), - Set(EngagementScore).asJava - ) - val PREDICTED_IS_PROFILE_CLICKED_AND_PROFILE_ENGAGED = new Continuous( - name("timelines.engagement_predicted.is_profile_clicked_and_profile_engaged"), - Set(EngagementScore).asJava - ) - val PREDICTED_IS_CLICKED = - new Continuous(name("timelines.engagement_predicted.is_clicked"), Set(EngagementScore).asJava) - val PREDICTED_IS_PHOTO_EXPANDED = new Continuous( - name("timelines.engagement_predicted.is_photo_expanded"), - Set(EngagementScore).asJava - ) - val PREDICTED_IS_FOLLOWED = - new Continuous(name("timelines.engagement_predicted.is_followed"), Set(EngagementScore).asJava) - val PREDICTED_IS_DONT_LIKE = - new Continuous(name("timelines.engagement_predicted.is_dont_like"), Set(EngagementScore).asJava) - val PREDICTED_IS_VIDEO_PLAYBACK_50 = new Continuous( - name("timelines.engagement_predicted.is_video_playback_50"), - Set(EngagementScore).asJava - ) - val PREDICTED_IS_VIDEO_QUALITY_VIEWED = new Continuous( - name("timelines.engagement_predicted.is_video_quality_viewed"), - Set(EngagementScore).asJava - ) - val PREDICTED_IS_GOOD_CLICKED_V1 = new Continuous( - name("timelines.engagement_predicted.is_good_clicked_convo_desc_favorited_or_replied"), - Set(EngagementScore).asJava) - val PREDICTED_IS_GOOD_CLICKED_V2 = new Continuous( - name("timelines.engagement_predicted.is_good_clicked_convo_desc_v2"), - Set(EngagementScore).asJava) - val PREDICTED_IS_TWEET_DETAIL_DWELLED_8_SEC = new Continuous( - name("timelines.engagement_predicted.is_tweet_detail_dwelled_8_sec"), - Set(EngagementScore).asJava) - val PREDICTED_IS_TWEET_DETAIL_DWELLED_15_SEC = new Continuous( - name("timelines.engagement_predicted.is_tweet_detail_dwelled_15_sec"), - Set(EngagementScore).asJava) - val PREDICTED_IS_TWEET_DETAIL_DWELLED_25_SEC = new Continuous( - name("timelines.engagement_predicted.is_tweet_detail_dwelled_25_sec"), - Set(EngagementScore).asJava) - val PREDICTED_IS_TWEET_DETAIL_DWELLED_30_SEC = new Continuous( - name("timelines.engagement_predicted.is_tweet_detail_dwelled_30_sec"), - Set(EngagementScore).asJava) - val PREDICTED_IS_GOOD_CLICKED_WITH_DWELL_SUM_GTE_60S = new Continuous( - name( - "timelines.engagement_predicted.is_good_clicked_convo_desc_favorited_or_replied_or_dwell_sum_gte_60_secs"), - Set(EngagementScore).asJava) - val PREDICTED_IS_FAVORITED_FAV_ENGAGED_BY_AUTHOR = new Continuous( - name("timelines.engagement_predicted.is_favorited_fav_engaged_by_author"), - Set(EngagementScore).asJava) - - val PREDICTED_IS_REPORT_TWEET_CLICKED = - new Continuous( - name("timelines.engagement_predicted.is_report_tweet_clicked"), - Set(EngagementScore).asJava) - val PREDICTED_IS_NEGATIVE_FEEDBACK = new Continuous( - name("timelines.engagement_predicted.is_negative_feedback"), - Set(EngagementScore).asJava) - val PREDICTED_IS_NEGATIVE_FEEDBACK_V2 = new Continuous( - name("timelines.engagement_predicted.is_negative_feedback_v2"), - Set(EngagementScore).asJava) - val PREDICTED_IS_WEAK_NEGATIVE_FEEDBACK = new Continuous( - name("timelines.engagement_predicted.is_weak_negative_feedback"), - Set(EngagementScore).asJava) - val PREDICTED_IS_STRONG_NEGATIVE_FEEDBACK = new Continuous( - name("timelines.engagement_predicted.is_strong_negative_feedback"), - Set(EngagementScore).asJava) - - val PREDICTED_IS_DWELLED_IN_BOUNDS_V1 = new Continuous( - name("timelines.engagement_predicted.is_dwelled_in_bounds_v1"), - Set(EngagementScore).asJava) - val PREDICTED_DWELL_NORMALIZED_OVERALL = new Continuous( - name("timelines.engagement_predicted.dwell_normalized_overall"), - Set(EngagementScore).asJava) - val PREDICTED_DWELL_CDF = - new Continuous(name("timelines.engagement_predicted.dwell_cdf"), Set(EngagementScore).asJava) - val PREDICTED_DWELL_CDF_OVERALL = new Continuous( - name("timelines.engagement_predicted.dwell_cdf_overall"), - Set(EngagementScore).asJava) - val PREDICTED_IS_DWELLED = - new Continuous(name("timelines.engagement_predicted.is_dwelled"), Set(EngagementScore).asJava) - - val PREDICTED_IS_HOME_LATEST_VISITED = new Continuous( - name("timelines.engagement_predicted.is_home_latest_visited"), - Set(EngagementScore).asJava) - - val PREDICTED_IS_BOOKMARKED = new Continuous( - name("timelines.engagement_predicted.is_bookmarked"), - Set(EngagementScore).asJava) - - val PREDICTED_IS_SHARED = - new Continuous(name("timelines.engagement_predicted.is_shared"), Set(EngagementScore).asJava) - val PREDICTED_IS_SHARE_MENU_CLICKED = new Continuous( - name("timelines.engagement_predicted.is_share_menu_clicked"), - Set(EngagementScore).asJava) - - val PREDICTED_IS_PROFILE_DWELLED_20_SEC = new Continuous( - name("timelines.engagement_predicted.is_profile_dwelled_20_sec"), - Set(EngagementScore).asJava) - - val PREDICTED_IS_FULLSCREEN_VIDEO_DWELLED_5_SEC = new Continuous( - name("timelines.engagement_predicted.is_fullscreen_video_dwelled_5_sec"), - Set(EngagementScore).asJava) - val PREDICTED_IS_FULLSCREEN_VIDEO_DWELLED_10_SEC = new Continuous( - name("timelines.engagement_predicted.is_fullscreen_video_dwelled_10_sec"), - Set(EngagementScore).asJava) - val PREDICTED_IS_FULLSCREEN_VIDEO_DWELLED_20_SEC = new Continuous( - name("timelines.engagement_predicted.is_fullscreen_video_dwelled_20_sec"), - Set(EngagementScore).asJava) - val PREDICTED_IS_FULLSCREEN_VIDEO_DWELLED_30_SEC = new Continuous( - name("timelines.engagement_predicted.is_fullscreen_video_dwelled_30_sec"), - Set(EngagementScore).asJava) - - // Please use this timestamp, not the `meta.timestamp`, for the actual served timestamp. - val SERVED_TIMESTAMP = - new Discrete("timelines.meta.timestamp.served", Set(PrivateTimestamp).asJava) - - // timestamp when the engagement has occurred. do not train on these features - val TIMESTAMP_FAVORITED = - new Discrete("timelines.meta.timestamp.engagement.favorited", Set(PublicTimestamp).asJava) - val TIMESTAMP_RETWEETED = - new Discrete("timelines.meta.timestamp.engagement.retweeted", Set(PublicTimestamp).asJava) - val TIMESTAMP_REPLIED = - new Discrete("timelines.meta.timestamp.engagement.replied", Set(PublicTimestamp).asJava) - val TIMESTAMP_PROFILE_CLICKED = new Discrete( - "timelines.meta.timestamp.engagement.profile_clicked", - Set(PrivateTimestamp).asJava) - val TIMESTAMP_CLICKED = - new Discrete("timelines.meta.timestamp.engagement.clicked", Set(PrivateTimestamp).asJava) - val TIMESTAMP_PHOTO_EXPANDED = - new Discrete("timelines.meta.timestamp.engagement.photo_expanded", Set(PrivateTimestamp).asJava) - val TIMESTAMP_DWELLED = - new Discrete("timelines.meta.timestamp.engagement.dwelled", Set(PrivateTimestamp).asJava) - val TIMESTAMP_VIDEO_PLAYBACK_50 = new Discrete( - "timelines.meta.timestamp.engagement.video_playback_50", - Set(PrivateTimestamp).asJava) - // reply engaged by author - val TIMESTAMP_REPLY_FAVORITED_BY_AUTHOR = new Discrete( - "timelines.meta.timestamp.engagement.reply_favorited_by_author", - Set(PublicTimestamp).asJava) - val TIMESTAMP_REPLY_REPLIED_BY_AUTHOR = new Discrete( - "timelines.meta.timestamp.engagement.reply_replied_by_author", - Set(PublicTimestamp).asJava) - val TIMESTAMP_REPLY_RETWEETED_BY_AUTHOR = new Discrete( - "timelines.meta.timestamp.engagement.reply_retweeted_by_author", - Set(PublicTimestamp).asJava) - // fav engaged by author - val TIMESTAMP_FAV_FAVORITED_BY_AUTHOR = new Discrete( - "timelines.meta.timestamp.engagement.fav_favorited_by_author", - Set(PublicTimestamp).asJava) - val TIMESTAMP_FAV_REPLIED_BY_AUTHOR = new Discrete( - "timelines.meta.timestamp.engagement.fav_replied_by_author", - Set(PublicTimestamp).asJava) - val TIMESTAMP_FAV_RETWEETED_BY_AUTHOR = new Discrete( - "timelines.meta.timestamp.engagement.fav_retweeted_by_author", - Set(PublicTimestamp).asJava) - val TIMESTAMP_FAV_FOLLOWED_BY_AUTHOR = new Discrete( - "timelines.meta.timestamp.engagement.fav_followed_by_author", - Set(PublicTimestamp).asJava) - // good click - val TIMESTAMP_GOOD_CLICK_CONVO_DESC_FAVORITED = new Discrete( - "timelines.meta.timestamp.engagement.good_click_convo_desc_favorited", - Set(PrivateTimestamp).asJava) - val TIMESTAMP_GOOD_CLICK_CONVO_DESC_REPLIIED = new Discrete( - "timelines.meta.timestamp.engagement.good_click_convo_desc_replied", - Set(PrivateTimestamp).asJava) - val TIMESTAMP_GOOD_CLICK_CONVO_DESC_PROFILE_CLICKED = new Discrete( - "timelines.meta.timestamp.engagement.good_click_convo_desc_profiile_clicked", - Set(PrivateTimestamp).asJava) - val TIMESTAMP_NEGATIVE_FEEDBACK = new Discrete( - "timelines.meta.timestamp.engagement.negative_feedback", - Set(PrivateTimestamp).asJava) - val TIMESTAMP_REPORT_TWEET_CLICK = - new Discrete( - "timelines.meta.timestamp.engagement.report_tweet_click", - Set(PrivateTimestamp).asJava) - val TIMESTAMP_IMPRESSED = - new Discrete("timelines.meta.timestamp.engagement.impressed", Set(PublicTimestamp).asJava) - val TIMESTAMP_TWEET_DETAIL_DWELLED = - new Discrete( - "timelines.meta.timestamp.engagement.tweet_detail_dwelled", - Set(PublicTimestamp).asJava) - val TIMESTAMP_PROFILE_DWELLED = - new Discrete("timelines.meta.timestamp.engagement.profile_dwelled", Set(PublicTimestamp).asJava) - val TIMESTAMP_FULLSCREEN_VIDEO_DWELLED = - new Discrete( - "timelines.meta.timestamp.engagement.fullscreen_video_dwelled", - Set(PublicTimestamp).asJava) - val TIMESTAMP_LINK_DWELLED = - new Discrete("timelines.meta.timestamp.engagement.link_dwelled", Set(PublicTimestamp).asJava) - - // these are used to dup and split the negative instances during streaming processing (kafka) - val TRAINING_FOR_FAVORITED = - new Binary("timelines.meta.training_data.for_favorited", Set(EngagementId).asJava) - val TRAINING_FOR_RETWEETED = - new Binary("timelines.meta.training_data.for_retweeted", Set(EngagementId).asJava) - val TRAINING_FOR_REPLIED = - new Binary("timelines.meta.training_data.for_replied", Set(EngagementId).asJava) - val TRAINING_FOR_PROFILE_CLICKED = - new Binary("timelines.meta.training_data.for_profile_clicked", Set(EngagementId).asJava) - val TRAINING_FOR_CLICKED = - new Binary("timelines.meta.training_data.for_clicked", Set(EngagementId).asJava) - val TRAINING_FOR_PHOTO_EXPANDED = - new Binary("timelines.meta.training_data.for_photo_expanded", Set(EngagementId).asJava) - val TRAINING_FOR_VIDEO_PLAYBACK_50 = - new Binary("timelines.meta.training_data.for_video_playback_50", Set(EngagementId).asJava) - val TRAINING_FOR_NEGATIVE_FEEDBACK = - new Binary("timelines.meta.training_data.for_negative_feedback", Set(EngagementId).asJava) - val TRAINING_FOR_REPORTED = - new Binary("timelines.meta.training_data.for_reported", Set(EngagementId).asJava) - val TRAINING_FOR_DWELLED = - new Binary("timelines.meta.training_data.for_dwelled", Set(EngagementId).asJava) - val TRAINING_FOR_SHARED = - new Binary("timelines.meta.training_data.for_shared", Set(EngagementId).asJava) - val TRAINING_FOR_SHARE_MENU_CLICKED = - new Binary("timelines.meta.training_data.for_share_menu_clicked", Set(EngagementId).asJava) - - // Warning: do not train on these features - val PREDICTED_SCORE = new Continuous(name("timelines.score"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_FAV = new Continuous(name("timelines.score.fav"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_RETWEET = - new Continuous(name("timelines.score.retweet"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_REPLY = - new Continuous(name("timelines.score.reply"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_OPEN_LINK = - new Continuous(name("timelines.score.open_link"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_GOOD_OPEN_LINK = - new Continuous(name("timelines.score.good_open_link"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_PROFILE_CLICK = - new Continuous(name("timelines.score.profile_click"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_DETAIL_EXPAND = - new Continuous(name("timelines.score.detail_expand"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_PHOTO_EXPAND = - new Continuous(name("timelines.score.photo_expand"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_PLAYBACK_50 = - new Continuous(name("timelines.score.playback_50"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_VIDEO_QUALITY_VIEW = - new Continuous(name("timelines.score.video_quality_view"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_DONT_LIKE = - new Continuous(name("timelines.score.dont_like"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_PROFILE_CLICKED_AND_PROFILE_ENGAGED = - new Continuous( - name("timelines.score.profile_clicked_and_profile_engaged"), - Set(EngagementScore).asJava) - val PREDICTED_SCORE_GOOD_CLICKED_V1 = - new Continuous(name("timelines.score.good_clicked_v1"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_GOOD_CLICKED_V2 = - new Continuous(name("timelines.score.good_clicked_v2"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_DWELL = - new Continuous(name("timelines.score.dwell"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_DWELL_CDF = - new Continuous(name("timelines.score.dwell_cfd"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_DWELL_CDF_OVERALL = - new Continuous(name("timelines.score.dwell_cfd_overall"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_DWELL_NORMALIZED_OVERALL = - new Continuous(name("timelines.score.dwell_normalized_overall"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_NEGATIVE_FEEDBACK = - new Continuous(name("timelines.score.negative_feedback"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_NEGATIVE_FEEDBACK_V2 = - new Continuous(name("timelines.score.negative_feedback_v2"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_WEAK_NEGATIVE_FEEDBACK = - new Continuous(name("timelines.score.weak_negative_feedback"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_STRONG_NEGATIVE_FEEDBACK = - new Continuous(name("timelines.score.strong_negative_feedback"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_REPORT_TWEET_CLICKED = - new Continuous(name("timelines.score.report_tweet_clicked"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_UNFOLLOW_TOPIC = - new Continuous(name("timelines.score.unfollow_topic"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_FOLLOW = - new Continuous(name("timelines.score.follow"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_RELEVANCE_PROMPT_YES_CLICKED = - new Continuous( - name("timelines.score.relevance_prompt_yes_clicked"), - Set(EngagementScore).asJava) - val PREDICTED_SCORE_BOOKMARK = - new Continuous(name("timelines.score.bookmark"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_SHARE = - new Continuous(name("timelines.score.share"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_SHARE_MENU_CLICK = - new Continuous(name("timelines.score.share_menu_click"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_PROFILE_DWELLED = - new Continuous(name("timelines.score.good_profile_dwelled"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_TWEET_DETAIL_DWELLED = - new Continuous(name("timelines.score.tweet_detail_dwelled"), Set(EngagementScore).asJava) - val PREDICTED_SCORE_FULLSCREEN_VIDEO_DWELL = - new Continuous(name("timelines.score.fullscreen_video_dwell"), Set(EngagementScore).asJava) - - // hydrated in TimelinesSharedFeaturesAdapter that recap adapter calls - val ORIGINAL_AUTHOR_ID = new Discrete(name("entities.original_author_id"), Set(UserId).asJava) - val SOURCE_AUTHOR_ID = new Discrete(name("entities.source_author_id"), Set(UserId).asJava) - val SOURCE_TWEET_ID = new Discrete(name("entities.source_tweet_id"), Set(TweetId).asJava) - val TOPIC_ID = new Discrete(name("entities.topic_id"), Set(SemanticcoreClassification).asJava) - val INFERRED_TOPIC_IDS = - new SparseBinary(name("entities.inferred_topic_ids"), Set(SemanticcoreClassification).asJava) - val INFERRED_TOPIC_ID = TypedAggregateGroup.sparseFeature(INFERRED_TOPIC_IDS) - - val WEIGHTED_FAV_COUNT = new Continuous( - name("timelines.earlybird.weighted_fav_count"), - Set(CountOfPrivateLikes, CountOfPublicLikes).asJava) - val WEIGHTED_RETWEET_COUNT = new Continuous( - name("timelines.earlybird.weighted_retweet_count"), - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava) - val WEIGHTED_REPLY_COUNT = new Continuous( - name("timelines.earlybird.weighted_reply_count"), - Set(CountOfPrivateReplies, CountOfPublicReplies).asJava) - val WEIGHTED_QUOTE_COUNT = new Continuous( - name("timelines.earlybird.weighted_quote_count"), - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava) - val EMBEDS_IMPRESSION_COUNT_V2 = new Continuous( - name("timelines.earlybird.embeds_impression_count_v2"), - Set(CountOfImpression).asJava) - val EMBEDS_URL_COUNT_V2 = new Continuous( - name("timelines.earlybird.embeds_url_count_v2"), - Set(CountOfPrivateTweetEntitiesAndMetadata, CountOfPublicTweetEntitiesAndMetadata).asJava) - val DECAYED_FAVORITE_COUNT = new Continuous( - name("timelines.earlybird.decayed_favorite_count"), - Set(CountOfPrivateLikes, CountOfPublicLikes).asJava) - val DECAYED_RETWEET_COUNT = new Continuous( - name("timelines.earlybird.decayed_retweet_count"), - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava) - val DECAYED_REPLY_COUNT = new Continuous( - name("timelines.earlybird.decayed_reply_count"), - Set(CountOfPrivateReplies, CountOfPublicReplies).asJava) - val DECAYED_QUOTE_COUNT = new Continuous( - name("timelines.earlybird.decayed_quote_count"), - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava) - val FAKE_FAVORITE_COUNT = new Continuous( - name("timelines.earlybird.fake_favorite_count"), - Set(CountOfPrivateLikes, CountOfPublicLikes).asJava) - val FAKE_RETWEET_COUNT = new Continuous( - name("timelines.earlybird.fake_retweet_count"), - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava) - val FAKE_REPLY_COUNT = new Continuous( - name("timelines.earlybird.fake_reply_count"), - Set(CountOfPrivateReplies, CountOfPublicReplies).asJava) - val FAKE_QUOTE_COUNT = new Continuous( - name("timelines.earlybird.fake_quote_count"), - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava) - val QUOTE_COUNT = new Continuous( - name("timelines.earlybird.quote_count"), - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava) - - // Safety features - val LABEL_ABUSIVE_FLAG = - new Binary(name("timelines.earlybird.label_abusive_flag"), Set(TweetSafetyLabels).asJava) - val LABEL_ABUSIVE_HI_RCL_FLAG = - new Binary(name("timelines.earlybird.label_abusive_hi_rcl_flag"), Set(TweetSafetyLabels).asJava) - val LABEL_DUP_CONTENT_FLAG = - new Binary(name("timelines.earlybird.label_dup_content_flag"), Set(TweetSafetyLabels).asJava) - val LABEL_NSFW_HI_PRC_FLAG = - new Binary(name("timelines.earlybird.label_nsfw_hi_prc_flag"), Set(TweetSafetyLabels).asJava) - val LABEL_NSFW_HI_RCL_FLAG = - new Binary(name("timelines.earlybird.label_nsfw_hi_rcl_flag"), Set(TweetSafetyLabels).asJava) - val LABEL_SPAM_FLAG = - new Binary(name("timelines.earlybird.label_spam_flag"), Set(TweetSafetyLabels).asJava) - val LABEL_SPAM_HI_RCL_FLAG = - new Binary(name("timelines.earlybird.label_spam_hi_rcl_flag"), Set(TweetSafetyLabels).asJava) - - // Periscope features - val PERISCOPE_EXISTS = new Binary( - name("timelines.earlybird.periscope_exists"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val PERISCOPE_IS_LIVE = new Binary( - name("timelines.earlybird.periscope_is_live"), - Set(PrivateBroadcastMetrics, PublicBroadcastMetrics).asJava) - val PERISCOPE_HAS_BEEN_FEATURED = new Binary( - name("timelines.earlybird.periscope_has_been_featured"), - Set(PrivateBroadcastMetrics, PublicBroadcastMetrics).asJava) - val PERISCOPE_IS_CURRENTLY_FEATURED = new Binary( - name("timelines.earlybird.periscope_is_currently_featured"), - Set(PrivateBroadcastMetrics, PublicBroadcastMetrics).asJava - ) - val PERISCOPE_IS_FROM_QUALITY_SOURCE = new Binary( - name("timelines.earlybird.periscope_is_from_quality_source"), - Set(PrivateBroadcastMetrics, PublicBroadcastMetrics).asJava - ) - - val VISIBLE_TOKEN_RATIO = new Continuous(name("timelines.earlybird.visible_token_ratio")) - val HAS_QUOTE = new Binary( - name("timelines.earlybird.has_quote"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val IS_COMPOSER_SOURCE_CAMERA = new Binary( - name("timelines.earlybird.is_composer_source_camera"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - - val EARLYBIRD_SCORE = new Continuous( - name("timelines.earlybird_score"), - Set(EngagementScore).asJava - ) // separating from the rest of "timelines.earlybird." namespace - - val DWELL_TIME_MS = new Continuous( - name("timelines.engagement.dwell_time_ms"), - Set(EngagementDurationAndTimestamp, ImpressionMetadata, PrivateTimestamp).asJava) - - val TWEET_DETAIL_DWELL_TIME_MS = new Continuous( - name("timelines.engagement.tweet_detail_dwell_time_ms"), - Set(EngagementDurationAndTimestamp, ImpressionMetadata, PrivateTimestamp).asJava) - - val PROFILE_DWELL_TIME_MS = new Continuous( - name("timelines.engagement.profile_dwell_time_ms"), - Set(EngagementDurationAndTimestamp, ImpressionMetadata, PrivateTimestamp).asJava) - - val FULLSCREEN_VIDEO_DWELL_TIME_MS = new Continuous( - name("timelines.engagement.fullscreen_video_dwell_time_ms"), - Set(EngagementDurationAndTimestamp, ImpressionMetadata, PrivateTimestamp).asJava) - - val LINK_DWELL_TIME_MS = new Continuous( - name("timelines.engagement.link_dwell_time_ms"), - Set(EngagementDurationAndTimestamp, ImpressionMetadata, PrivateTimestamp).asJava) - - val ASPECT_RATIO_DEN = new Continuous( - name("tweetsource.tweet.media.aspect_ratio_den"), - Set(MediaFile, MediaProcessingInformation).asJava) - val ASPECT_RATIO_NUM = new Continuous( - name("tweetsource.tweet.media.aspect_ratio_num"), - Set(MediaFile, MediaProcessingInformation).asJava) - val BIT_RATE = new Continuous( - name("tweetsource.tweet.media.bit_rate"), - Set(MediaFile, MediaProcessingInformation).asJava) - val HEIGHT_2 = new Continuous( - name("tweetsource.tweet.media.height_2"), - Set(MediaFile, MediaProcessingInformation).asJava) - val HEIGHT_1 = new Continuous( - name("tweetsource.tweet.media.height_1"), - Set(MediaFile, MediaProcessingInformation).asJava) - val HEIGHT_3 = new Continuous( - name("tweetsource.tweet.media.height_3"), - Set(MediaFile, MediaProcessingInformation).asJava) - val HEIGHT_4 = new Continuous( - name("tweetsource.tweet.media.height_4"), - Set(MediaFile, MediaProcessingInformation).asJava) - val RESIZE_METHOD_1 = new Discrete( - name("tweetsource.tweet.media.resize_method_1"), - Set(MediaFile, MediaProcessingInformation).asJava) - val RESIZE_METHOD_2 = new Discrete( - name("tweetsource.tweet.media.resize_method_2"), - Set(MediaFile, MediaProcessingInformation).asJava) - val RESIZE_METHOD_3 = new Discrete( - name("tweetsource.tweet.media.resize_method_3"), - Set(MediaFile, MediaProcessingInformation).asJava) - val RESIZE_METHOD_4 = new Discrete( - name("tweetsource.tweet.media.resize_method_4"), - Set(MediaFile, MediaProcessingInformation).asJava) - val VIDEO_DURATION = new Continuous( - name("tweetsource.tweet.media.video_duration"), - Set(MediaFile, MediaProcessingInformation).asJava) - val WIDTH_1 = new Continuous( - name("tweetsource.tweet.media.width_1"), - Set(MediaFile, MediaProcessingInformation).asJava) - val WIDTH_2 = new Continuous( - name("tweetsource.tweet.media.width_2"), - Set(MediaFile, MediaProcessingInformation).asJava) - val WIDTH_3 = new Continuous( - name("tweetsource.tweet.media.width_3"), - Set(MediaFile, MediaProcessingInformation).asJava) - val WIDTH_4 = new Continuous( - name("tweetsource.tweet.media.width_4"), - Set(MediaFile, MediaProcessingInformation).asJava) - val NUM_MEDIA_TAGS = new Continuous( - name("tweetsource.tweet.media.num_tags"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val MEDIA_TAG_SCREEN_NAMES = new SparseBinary( - name("tweetsource.tweet.media.tag_screen_names"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val STICKER_IDS = new SparseBinary( - name("tweetsource.tweet.media.sticker_ids"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - - val NUM_COLOR_PALLETTE_ITEMS = new Continuous( - name("tweetsource.v2.tweet.media.num_color_pallette_items"), - Set(MediaFile, MediaProcessingInformation).asJava) - val COLOR_1_RED = new Continuous( - name("tweetsource.v2.tweet.media.color_1_red"), - Set(MediaFile, MediaProcessingInformation).asJava) - val COLOR_1_BLUE = new Continuous( - name("tweetsource.v2.tweet.media.color_1_blue"), - Set(MediaFile, MediaProcessingInformation).asJava) - val COLOR_1_GREEN = new Continuous( - name("tweetsource.v2.tweet.media.color_1_green"), - Set(MediaFile, MediaProcessingInformation).asJava) - val COLOR_1_PERCENTAGE = new Continuous( - name("tweetsource.v2.tweet.media.color_1_percentage"), - Set(MediaFile, MediaProcessingInformation).asJava) - val MEDIA_PROVIDERS = new SparseBinary( - name("tweetsource.v2.tweet.media.providers"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val IS_360 = new Binary( - name("tweetsource.v2.tweet.media.is_360"), - Set(MediaFile, MediaProcessingInformation).asJava) - val VIEW_COUNT = - new Continuous(name("tweetsource.v2.tweet.media.view_count"), Set(MediaContentMetrics).asJava) - val IS_MANAGED = new Binary( - name("tweetsource.v2.tweet.media.is_managed"), - Set(MediaFile, MediaProcessingInformation).asJava) - val IS_MONETIZABLE = new Binary( - name("tweetsource.v2.tweet.media.is_monetizable"), - Set(MediaFile, MediaProcessingInformation).asJava) - val IS_EMBEDDABLE = new Binary( - name("tweetsource.v2.tweet.media.is_embeddable"), - Set(MediaFile, MediaProcessingInformation).asJava) - val CLASSIFICATION_LABELS = new SparseContinuous( - name("tweetsource.v2.tweet.media.classification_labels"), - Set(MediaFile, MediaProcessingInformation).asJava) - - val NUM_STICKERS = new Continuous( - name("tweetsource.v2.tweet.media.num_stickers"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val NUM_FACES = new Continuous( - name("tweetsource.v2.tweet.media.num_faces"), - Set(MediaFile, MediaProcessingInformation).asJava) - val FACE_AREAS = new Continuous( - name("tweetsource.v2.tweet.media.face_areas"), - Set(MediaFile, MediaProcessingInformation).asJava) - val HAS_SELECTED_PREVIEW_IMAGE = new Binary( - name("tweetsource.v2.tweet.media.has_selected_preview_image"), - Set(MediaFile, MediaProcessingInformation).asJava) - val HAS_TITLE = new Binary( - name("tweetsource.v2.tweet.media.has_title"), - Set(MediaFile, MediaProcessingInformation).asJava) - val HAS_DESCRIPTION = new Binary( - name("tweetsource.v2.tweet.media.has_description"), - Set(MediaFile, MediaProcessingInformation).asJava) - val HAS_VISIT_SITE_CALL_TO_ACTION = new Binary( - name("tweetsource.v2.tweet.media.has_visit_site_call_to_action"), - Set(MediaFile, MediaProcessingInformation).asJava) - val HAS_APP_INSTALL_CALL_TO_ACTION = new Binary( - name("tweetsource.v2.tweet.media.has_app_install_call_to_action"), - Set(MediaFile, MediaProcessingInformation).asJava) - val HAS_WATCH_NOW_CALL_TO_ACTION = new Binary( - name("tweetsource.v2.tweet.media.has_watch_now_call_to_action"), - Set(MediaFile, MediaProcessingInformation).asJava) - - val NUM_CAPS = - new Continuous(name("tweetsource.tweet.text.num_caps"), Set(PublicTweets, PrivateTweets).asJava) - val TWEET_LENGTH = - new Continuous(name("tweetsource.tweet.text.length"), Set(PublicTweets, PrivateTweets).asJava) - val TWEET_LENGTH_TYPE = new Discrete( - name("tweetsource.tweet.text.length_type"), - Set(PublicTweets, PrivateTweets).asJava) - val NUM_WHITESPACES = new Continuous( - name("tweetsource.tweet.text.num_whitespaces"), - Set(PublicTweets, PrivateTweets).asJava) - val HAS_QUESTION = - new Binary(name("tweetsource.tweet.text.has_question"), Set(PublicTweets, PrivateTweets).asJava) - val NUM_NEWLINES = new Continuous( - name("tweetsource.tweet.text.num_newlines"), - Set(PublicTweets, PrivateTweets).asJava) - val EMOJI_TOKENS = new SparseBinary( - name("tweetsource.v3.tweet.text.emoji_tokens"), - Set(PublicTweets, PrivateTweets).asJava) - val EMOTICON_TOKENS = new SparseBinary( - name("tweetsource.v3.tweet.text.emoticon_tokens"), - Set(PublicTweets, PrivateTweets).asJava) - val NUM_EMOJIS = new Continuous( - name("tweetsource.v3.tweet.text.num_emojis"), - Set(PublicTweets, PrivateTweets).asJava) - val NUM_EMOTICONS = new Continuous( - name("tweetsource.v3.tweet.text.num_emoticons"), - Set(PublicTweets, PrivateTweets).asJava) - val POS_UNIGRAMS = new SparseBinary( - name("tweetsource.v3.tweet.text.pos_unigrams"), - Set(PublicTweets, PrivateTweets).asJava) - val POS_BIGRAMS = new SparseBinary( - name("tweetsource.v3.tweet.text.pos_bigrams"), - Set(PublicTweets, PrivateTweets).asJava) - val TEXT_TOKENS = new SparseBinary( - name("tweetsource.v4.tweet.text.tokens"), - Set(PublicTweets, PrivateTweets).asJava) - - // Health features model scores (see go/toxicity, go/pblock, go/pspammytweet) - val PBLOCK_SCORE = - new Continuous(name("timelines.earlybird.pblock_score"), Set(TweetSafetyScores).asJava) - val TOXICITY_SCORE = - new Continuous(name("timelines.earlybird.toxicity_score"), Set(TweetSafetyScores).asJava) - val EXPERIMENTAL_HEALTH_MODEL_SCORE_1 = - new Continuous( - name("timelines.earlybird.experimental_health_model_score_1"), - Set(TweetSafetyScores).asJava) - val EXPERIMENTAL_HEALTH_MODEL_SCORE_2 = - new Continuous( - name("timelines.earlybird.experimental_health_model_score_2"), - Set(TweetSafetyScores).asJava) - val EXPERIMENTAL_HEALTH_MODEL_SCORE_3 = - new Continuous( - name("timelines.earlybird.experimental_health_model_score_3"), - Set(TweetSafetyScores).asJava) - val EXPERIMENTAL_HEALTH_MODEL_SCORE_4 = - new Continuous( - name("timelines.earlybird.experimental_health_model_score_4"), - Set(TweetSafetyScores).asJava) - val PSPAMMY_TWEET_SCORE = - new Continuous(name("timelines.earlybird.pspammy_tweet_score"), Set(TweetSafetyScores).asJava) - val PREPORTED_TWEET_SCORE = - new Continuous(name("timelines.earlybird.preported_tweet_score"), Set(TweetSafetyScores).asJava) - - // where record was displayed e.g. recap vs ranked timeline vs recycled - // (do NOT use for training in prediction, since this is set post-scoring) - // This differs from TimelinesSharedFeatures.INJECTION_TYPE, which is only - // set to Recap or Rectweet, and is available pre-scoring. - // This also differs from TimeFeatures.IS_TWEET_RECYCLED, which is set - // pre-scoring and indicates if a tweet is being considered for recycling. - // In contrast, DISPLAY_SUGGEST_TYPE == RecycledTweet means the tweet - // was actually served in a recycled tweet module. The two should currently - // have the same value, but need not in future, so please only use - // IS_TWEET_RECYCLED/CANDIDATE_TWEET_SOURCE_ID for training models and - // only use DISPLAY_SUGGEST_TYPE for offline analysis of tweets actually - // served in recycled modules. - val DISPLAY_SUGGEST_TYPE = new Discrete(name("recap.display.suggest_type")) - - // Candidate tweet source id - related to DISPLAY_SUGGEST_TYPE above, but this is a - // property of the candidate rather than display location so is safe to use - // in model training, unlike DISPLAY_SUGGEST_TYPE. - val CANDIDATE_TWEET_SOURCE_ID = - new Discrete(name("timelines.meta.candidate_tweet_source_id"), Set(TweetId).asJava) - - // Was at least 50% of this tweet in the user's viewport for at least 500 ms, - // OR did the user engage with the tweet publicly or privately - val IS_LINGER_IMPRESSION = - new Binary(name("timelines.engagement.is_linger_impression"), Set(EngagementsPrivate).asJava) - - // Features to create rollups - val LANGUAGE_GROUP = new Discrete(name("timelines.tweet.text.language_group")) - - // The final position index of the tweet being trained on in the timeline - // served from TLM (could still change later in TLS-API), as recorded by - // PositionIndexLoggingEnvelopeTransform. - val FINAL_POSITION_INDEX = new Discrete(name("timelines.display.final_position_index")) - - // The traceId of the timeline request, can be used to group tweets in the same response. - val TRACE_ID = new Discrete(name("timelines.display.trace_id"), Set(TfeTransactionId).asJava) - - // Whether this tweet was randomly injected into the timeline or not, for exploration purposes - val IS_RANDOM_TWEET = new Binary(name("timelines.display.is_random_tweet")) - - // Whether this tweet was reordered with softmax ranking for explore/exploit, and needs to - // be excluded from exploit only holdback - val IS_SOFTMAX_RANKING_TWEET = new Binary(name("timelines.display.is_softmax_ranking_tweet")) - - // Whether the user viewing the tweet has disabled ranked timeline. - val IS_RANKED_TIMELINE_DISABLER = new Binary( - name("timelines.user_features.is_ranked_timeline_disabler"), - Set(AnnotationValue, GeneralSettings).asJava) - - // Whether the user viewing the tweet was one of those released from DDG 4205 control - // as part of http://go/shrink-4205 process to shrink the quality features holdback. - val IS_USER_RELEASED_FROM_QUALITY_HOLDBACK = new Binary( - name("timelines.user_features.is_released_from_quality_holdback"), - Set(ExperimentId, ExperimentName).asJava) - - val INITIAL_PREDICTION_FAV = - new Continuous(name("timelines.initial_prediction.fav"), Set(EngagementScore).asJava) - val INITIAL_PREDICTION_RETWEET = - new Continuous(name("timelines.initial_prediction.retweet"), Set(EngagementScore).asJava) - val INITIAL_PREDICTION_REPLY = - new Continuous(name("timelines.initial_prediction.reply"), Set(EngagementScore).asJava) - val INITIAL_PREDICTION_OPEN_LINK = - new Continuous(name("timelines.initial_prediction.open_link"), Set(EngagementScore).asJava) - val INITIAL_PREDICTION_PROFILE_CLICK = - new Continuous(name("timelines.initial_prediction.profile_click"), Set(EngagementScore).asJava) - val INITIAL_PREDICTION_VIDEO_PLAYBACK_50 = new Continuous( - name("timelines.initial_prediction.video_playback_50"), - Set(EngagementScore).asJava) - val INITIAL_PREDICTION_DETAIL_EXPAND = - new Continuous(name("timelines.initial_prediction.detail_expand"), Set(EngagementScore).asJava) - val INITIAL_PREDICTION_PHOTO_EXPAND = - new Continuous(name("timelines.initial_prediction.photo_expand"), Set(EngagementScore).asJava) - - val VIEWER_FOLLOWS_ORIGINAL_AUTHOR = - new Binary(name("timelines.viewer_follows_original_author"), Set(Follow).asJava) - - val IS_TOP_ONE = new Binary(name("timelines.position.is_top_one")) - val IS_TOP_FIVE = - new Binary(name(featureName = "timelines.position.is_top_five")) - val IS_TOP_TEN = - new Binary(name(featureName = "timelines.position.is_top_ten")) - - val LOG_POSITION = - new Continuous(name(featureName = "timelines.position.log_10")) - -} diff --git a/src/scala/com/twitter/timelines/prediction/features/engagement_features/BUILD b/src/scala/com/twitter/timelines/prediction/features/engagement_features/BUILD deleted file mode 100644 index f6caadea0..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/engagement_features/BUILD +++ /dev/null @@ -1,12 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/java/com/twitter/ml/api:api-base", - "src/thrift/com/twitter/dal/personal_data:personal_data-java", - "src/thrift/com/twitter/timelineservice/server/suggests/features/engagement_features:thrift-scala", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - "timelines/data_processing/ml_util/transforms", - ], -) diff --git a/src/scala/com/twitter/timelines/prediction/features/engagement_features/EngagementFeatures.scala b/src/scala/com/twitter/timelines/prediction/features/engagement_features/EngagementFeatures.scala deleted file mode 100644 index e65c9db20..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/engagement_features/EngagementFeatures.scala +++ /dev/null @@ -1,246 +0,0 @@ -package com.twitter.timelines.prediction.features.engagement_features - -import com.twitter.dal.personal_data.thriftjava.PersonalDataType._ -import com.twitter.logging.Logger -import com.twitter.ml.api.DataRecord -import com.twitter.ml.api.Feature -import com.twitter.ml.api.Feature.Continuous -import com.twitter.ml.api.Feature.SparseBinary -import com.twitter.timelines.data_processing.ml_util.transforms.OneToSomeTransform -import com.twitter.timelines.data_processing.ml_util.transforms.RichITransform -import com.twitter.timelines.data_processing.ml_util.transforms.SparseBinaryUnion -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.TypedAggregateGroup -import com.twitter.timelineservice.suggests.features.engagement_features.thriftscala.{ - EngagementFeatures => ThriftEngagementFeatures -} -import com.twitter.timelineservice.suggests.features.engagement_features.v1.thriftscala.{ - EngagementFeatures => ThriftEngagementFeaturesV1 -} -import scala.collection.JavaConverters._ - -object EngagementFeatures { - private[this] val logger = Logger.get(getClass.getSimpleName) - - sealed trait EngagementFeature - case object Count extends EngagementFeature - case object RealGraphWeightAverage extends EngagementFeature - case object RealGraphWeightMax extends EngagementFeature - case object RealGraphWeightMin extends EngagementFeature - case object RealGraphWeightMissing extends EngagementFeature - case object RealGraphWeightVariance extends EngagementFeature - case object UserIds extends EngagementFeature - - def fromThrift(thriftEngagementFeatures: ThriftEngagementFeatures): Option[EngagementFeatures] = { - thriftEngagementFeatures match { - case thriftEngagementFeaturesV1: ThriftEngagementFeatures.V1 => - Some( - EngagementFeatures( - favoritedBy = thriftEngagementFeaturesV1.v1.favoritedBy, - retweetedBy = thriftEngagementFeaturesV1.v1.retweetedBy, - repliedBy = thriftEngagementFeaturesV1.v1.repliedBy, - ) - ) - case _ => { - logger.error("Unexpected EngagementFeatures version found.") - None - } - } - } - - val empty: EngagementFeatures = EngagementFeatures() -} - -/** - * Contains user IDs who have engaged with a target entity, such as a Tweet, - * and any additional data needed for derived features. - */ -case class EngagementFeatures( - favoritedBy: Seq[Long] = Nil, - retweetedBy: Seq[Long] = Nil, - repliedBy: Seq[Long] = Nil, - realGraphWeightByUser: Map[Long, Double] = Map.empty) { - def isEmpty: Boolean = favoritedBy.isEmpty && retweetedBy.isEmpty && repliedBy.isEmpty - def nonEmpty: Boolean = !isEmpty - def toLogThrift: ThriftEngagementFeatures.V1 = - ThriftEngagementFeatures.V1( - ThriftEngagementFeaturesV1( - favoritedBy = favoritedBy, - retweetedBy = retweetedBy, - repliedBy = repliedBy - ) - ) -} - -/** - * Represents engagement features derived from the Real Graph weight. - * - * These features are from the perspective of the source user, who is viewing their - * timeline, to the destination users (or user), who created engagements. - * - * @param count number of engagements present - * @param max max score of the engaging users - * @param mean average score of the engaging users - * @param min minimum score of the engaging users - * @param missing for engagements present, how many Real Graph scores were missing - * @param variance variance of scores of the engaging users - */ -case class RealGraphDerivedEngagementFeatures( - count: Int, - max: Double, - mean: Double, - min: Double, - missing: Int, - variance: Double) - -object EngagementDataRecordFeatures { - import EngagementFeatures._ - - val FavoritedByUserIds = new SparseBinary( - "engagement_features.user_ids.favorited_by", - Set(UserId, PrivateLikes, PublicLikes).asJava) - val RetweetedByUserIds = new SparseBinary( - "engagement_features.user_ids.retweeted_by", - Set(UserId, PrivateRetweets, PublicRetweets).asJava) - val RepliedByUserIds = new SparseBinary( - "engagement_features.user_ids.replied_by", - Set(UserId, PrivateReplies, PublicReplies).asJava) - - val InNetworkFavoritesCount = new Continuous( - "engagement_features.in_network.favorites.count", - Set(CountOfPrivateLikes, CountOfPublicLikes).asJava) - val InNetworkRetweetsCount = new Continuous( - "engagement_features.in_network.retweets.count", - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava) - val InNetworkRepliesCount = new Continuous( - "engagement_features.in_network.replies.count", - Set(CountOfPrivateReplies, CountOfPublicReplies).asJava) - - // real graph derived features - val InNetworkFavoritesAvgRealGraphWeight = new Continuous( - "engagement_features.real_graph.favorites.avg_weight", - Set(CountOfPrivateLikes, CountOfPublicLikes).asJava - ) - val InNetworkFavoritesMaxRealGraphWeight = new Continuous( - "engagement_features.real_graph.favorites.max_weight", - Set(CountOfPrivateLikes, CountOfPublicLikes).asJava - ) - val InNetworkFavoritesMinRealGraphWeight = new Continuous( - "engagement_features.real_graph.favorites.min_weight", - Set(CountOfPrivateLikes, CountOfPublicLikes).asJava - ) - val InNetworkFavoritesRealGraphWeightMissing = new Continuous( - "engagement_features.real_graph.favorites.missing" - ) - val InNetworkFavoritesRealGraphWeightVariance = new Continuous( - "engagement_features.real_graph.favorites.weight_variance" - ) - - val InNetworkRetweetsMaxRealGraphWeight = new Continuous( - "engagement_features.real_graph.retweets.max_weight", - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava - ) - val InNetworkRetweetsMinRealGraphWeight = new Continuous( - "engagement_features.real_graph.retweets.min_weight", - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava - ) - val InNetworkRetweetsAvgRealGraphWeight = new Continuous( - "engagement_features.real_graph.retweets.avg_weight", - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava - ) - val InNetworkRetweetsRealGraphWeightMissing = new Continuous( - "engagement_features.real_graph.retweets.missing" - ) - val InNetworkRetweetsRealGraphWeightVariance = new Continuous( - "engagement_features.real_graph.retweets.weight_variance" - ) - - val InNetworkRepliesMaxRealGraphWeight = new Continuous( - "engagement_features.real_graph.replies.max_weight", - Set(CountOfPrivateReplies, CountOfPublicReplies).asJava - ) - val InNetworkRepliesMinRealGraphWeight = new Continuous( - "engagement_features.real_graph.replies.min_weight", - Set(CountOfPrivateReplies, CountOfPublicReplies).asJava - ) - val InNetworkRepliesAvgRealGraphWeight = new Continuous( - "engagement_features.real_graph.replies.avg_weight", - Set(CountOfPrivateReplies, CountOfPublicReplies).asJava - ) - val InNetworkRepliesRealGraphWeightMissing = new Continuous( - "engagement_features.real_graph.replies.missing" - ) - val InNetworkRepliesRealGraphWeightVariance = new Continuous( - "engagement_features.real_graph.replies.weight_variance" - ) - - sealed trait FeatureGroup { - def continuousFeatures: Map[EngagementFeature, Continuous] - def sparseBinaryFeatures: Map[EngagementFeature, SparseBinary] - def allFeatures: Seq[Feature[_]] = - (continuousFeatures.values ++ sparseBinaryFeatures.values).toSeq - } - - case object Favorites extends FeatureGroup { - override val continuousFeatures: Map[EngagementFeature, Continuous] = - Map( - Count -> InNetworkFavoritesCount, - RealGraphWeightAverage -> InNetworkFavoritesAvgRealGraphWeight, - RealGraphWeightMax -> InNetworkFavoritesMaxRealGraphWeight, - RealGraphWeightMin -> InNetworkFavoritesMinRealGraphWeight, - RealGraphWeightMissing -> InNetworkFavoritesRealGraphWeightMissing, - RealGraphWeightVariance -> InNetworkFavoritesRealGraphWeightVariance - ) - - override val sparseBinaryFeatures: Map[EngagementFeature, SparseBinary] = - Map(UserIds -> FavoritedByUserIds) - } - - case object Retweets extends FeatureGroup { - override val continuousFeatures: Map[EngagementFeature, Continuous] = - Map( - Count -> InNetworkRetweetsCount, - RealGraphWeightAverage -> InNetworkRetweetsAvgRealGraphWeight, - RealGraphWeightMax -> InNetworkRetweetsMaxRealGraphWeight, - RealGraphWeightMin -> InNetworkRetweetsMinRealGraphWeight, - RealGraphWeightMissing -> InNetworkRetweetsRealGraphWeightMissing, - RealGraphWeightVariance -> InNetworkRetweetsRealGraphWeightVariance - ) - - override val sparseBinaryFeatures: Map[EngagementFeature, SparseBinary] = - Map(UserIds -> RetweetedByUserIds) - } - - case object Replies extends FeatureGroup { - override val continuousFeatures: Map[EngagementFeature, Continuous] = - Map( - Count -> InNetworkRepliesCount, - RealGraphWeightAverage -> InNetworkRepliesAvgRealGraphWeight, - RealGraphWeightMax -> InNetworkRepliesMaxRealGraphWeight, - RealGraphWeightMin -> InNetworkRepliesMinRealGraphWeight, - RealGraphWeightMissing -> InNetworkRepliesRealGraphWeightMissing, - RealGraphWeightVariance -> InNetworkRepliesRealGraphWeightVariance - ) - - override val sparseBinaryFeatures: Map[EngagementFeature, SparseBinary] = - Map(UserIds -> RepliedByUserIds) - } - - val PublicEngagerSets = Set(FavoritedByUserIds, RetweetedByUserIds, RepliedByUserIds) - val PublicEngagementUserIds = new SparseBinary( - "engagement_features.user_ids.public", - Set(UserId, EngagementsPublic).asJava - ) - val ENGAGER_ID = TypedAggregateGroup.sparseFeature(PublicEngagementUserIds) - - val UnifyPublicEngagersTransform = SparseBinaryUnion( - featuresToUnify = PublicEngagerSets, - outputFeature = PublicEngagementUserIds - ) - - object RichUnifyPublicEngagersTransform extends OneToSomeTransform { - override def apply(dataRecord: DataRecord): Option[DataRecord] = - RichITransform(EngagementDataRecordFeatures.UnifyPublicEngagersTransform)(dataRecord) - override def featuresToTransform: Set[Feature[_]] = - EngagementDataRecordFeatures.UnifyPublicEngagersTransform.featuresToUnify.toSet - } -} diff --git a/src/scala/com/twitter/timelines/prediction/features/escherbird/BUILD b/src/scala/com/twitter/timelines/prediction/features/escherbird/BUILD deleted file mode 100644 index c28786b77..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/escherbird/BUILD +++ /dev/null @@ -1,19 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/java/com/twitter/ml/api:api-base", - "src/thrift/com/twitter/tweetypie:tweet-scala", - ], -) - -scala_library( - name = "escherbird-features", - sources = ["EscherbirdFeatures.scala"], - tags = ["bazel-only"], - dependencies = [ - "src/java/com/twitter/ml/api:api-base", - "src/thrift/com/twitter/dal/personal_data:personal_data-java", - ], -) diff --git a/src/scala/com/twitter/timelines/prediction/features/escherbird/EscherbirdFeatures.scala b/src/scala/com/twitter/timelines/prediction/features/escherbird/EscherbirdFeatures.scala deleted file mode 100644 index 3aaf9b856..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/escherbird/EscherbirdFeatures.scala +++ /dev/null @@ -1,19 +0,0 @@ -package com.twitter.timelines.prediction.features.escherbird - -import com.twitter.dal.personal_data.thriftjava.PersonalDataType._ -import com.twitter.ml.api.Feature -import java.util.{Set => JSet} -import scala.collection.JavaConverters._ - -object EscherbirdFeatures { - val TweetGroupIds = new Feature.SparseBinary("escherbird.tweet_group_ids") - val TweetDomainIds = new Feature.SparseBinary("escherbird.tweet_domain_ids", Set(DomainId).asJava) - val TweetEntityIds = - new Feature.SparseBinary("escherbird.tweet_entity_ids", Set(SemanticcoreClassification).asJava) -} - -case class EscherbirdFeatures( - tweetId: Long, - tweetGroupIds: JSet[String], - tweetDomainIds: JSet[String], - tweetEntityIds: JSet[String]) diff --git a/src/scala/com/twitter/timelines/prediction/features/escherbird/EscherbirdFeaturesConverter.scala b/src/scala/com/twitter/timelines/prediction/features/escherbird/EscherbirdFeaturesConverter.scala deleted file mode 100644 index bd3333a03..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/escherbird/EscherbirdFeaturesConverter.scala +++ /dev/null @@ -1,19 +0,0 @@ -package com.twitter.timelines.prediction.features.escherbird - -import com.twitter.tweetypie.thriftscala.Tweet -import scala.collection.JavaConverters._ - -object EscherbirdFeaturesConverter { - val DeprecatedOrTestDomains = Set(1L, 5L, 7L, 9L, 14L, 19L, 20L, 31L) - - def fromTweet(tweet: Tweet): Option[EscherbirdFeatures] = tweet.escherbirdEntityAnnotations.map { - escherbirdEntityAnnotations => - val annotations = escherbirdEntityAnnotations.entityAnnotations - .filterNot(annotation => DeprecatedOrTestDomains.contains(annotation.domainId)) - val tweetGroupIds = annotations.map(_.groupId.toString).toSet.asJava - val tweetDomainIds = annotations.map(_.domainId.toString).toSet.asJava - // An entity is only unique within a given domain - val tweetEntityIds = annotations.map(a => s"${a.domainId}.${a.entityId}").toSet.asJava - EscherbirdFeatures(tweet.id, tweetGroupIds, tweetDomainIds, tweetEntityIds) - } -} diff --git a/src/scala/com/twitter/timelines/prediction/features/followsource/BUILD.bazel b/src/scala/com/twitter/timelines/prediction/features/followsource/BUILD.bazel deleted file mode 100644 index 0ee33acdb..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/followsource/BUILD.bazel +++ /dev/null @@ -1,7 +0,0 @@ -scala_library( - sources = ["*.scala"], - dependencies = [ - "src/java/com/twitter/ml/api:api-base", - "src/thrift/com/twitter/dal/personal_data:personal_data-java", - ], -) diff --git a/src/scala/com/twitter/timelines/prediction/features/followsource/FollowSourceFeatures.scala b/src/scala/com/twitter/timelines/prediction/features/followsource/FollowSourceFeatures.scala deleted file mode 100644 index 012103b14..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/followsource/FollowSourceFeatures.scala +++ /dev/null @@ -1,53 +0,0 @@ -package com.twitter.timelines.prediction.features.followsource - -import com.twitter.ml.api.Feature -import com.twitter.dal.personal_data.thriftjava.PersonalDataType._ -import scala.collection.JavaConverters._ - -object FollowSourceFeatures { - - // Corresponds to an algorithm constant from com.twitter.hermit.profile.HermitProfileConstants - val FollowSourceAlgorithm = new Feature.Text("follow_source.algorithm") - - // Type of follow action: one of "unfollow", "follow", "follow_back", "follow_many", "follow_all" - val FollowAction = new Feature.Text( - "follow_source.action", - Set(Follow, PrivateAccountsFollowedBy, PublicAccountsFollowedBy).asJava) - - // Millisecond timestamp when follow occurred - val FollowTimestamp = - new Feature.Discrete("follow_source.follow_timestamp", Set(Follow, PrivateTimestamp).asJava) - - // Age of follow (in minutes) - val FollowAgeMinutes = - new Feature.Continuous("follow_source.follow_age_minutes", Set(Follow).asJava) - - // Tweet ID of tweet details page from where follow happened (if applicable) - val FollowCauseTweetId = new Feature.Discrete("follow_source.cause_tweet_id", Set(TweetId).asJava) - - // String representation of follow client (android, web, iphone, etc). Derived from "client" - // portion of client event namespace. - val FollowClientId = new Feature.Text("follow_source.client_id", Set(ClientType).asJava) - - // If the follow happens via a profile's Following or Followers, - // the id of the profile owner is recorded here. - val FollowAssociationId = - new Feature.Discrete("follow_source.association_id", Set(Follow, UserId).asJava) - - // The "friendly name" here is computed using FollowSourceUtil.getSource. It represents - // a grouping on a few client events that reflect where the event occurred. For example, - // events on the tweet details page are grouped using "tweetDetails": - // case (Some("web"), Some("permalink"), _, _, _) => "tweetDetails" - // case (Some("iphone"), Some("tweet"), _, _, _) => "tweetDetails" - // case (Some("android"), Some("tweet"), _, _, _) => "tweetDetails" - val FollowSourceFriendlyName = new Feature.Text("follow_source.friendly_name", Set(Follow).asJava) - - // Up to two sources and actions that preceded the follow (for example, a profile visit - // through a mention click, which itself was on a tweet detail page reached through a tweet - // click in the Home tab). See go/followsource for more details and examples. - // The "source" here is computed using FollowSourceUtil.getSource - val PreFollowAction1 = new Feature.Text("follow_source.pre_follow_action_1", Set(Follow).asJava) - val PreFollowAction2 = new Feature.Text("follow_source.pre_follow_action_2", Set(Follow).asJava) - val PreFollowSource1 = new Feature.Text("follow_source.pre_follow_source_1", Set(Follow).asJava) - val PreFollowSource2 = new Feature.Text("follow_source.pre_follow_source_2", Set(Follow).asJava) -} diff --git a/src/scala/com/twitter/timelines/prediction/features/itl/BUILD b/src/scala/com/twitter/timelines/prediction/features/itl/BUILD deleted file mode 100644 index 6fc497bf3..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/itl/BUILD +++ /dev/null @@ -1,9 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/java/com/twitter/ml/api:api-base", - "src/thrift/com/twitter/dal/personal_data:personal_data-java", - ], -) diff --git a/src/scala/com/twitter/timelines/prediction/features/itl/ITLFeatures.scala b/src/scala/com/twitter/timelines/prediction/features/itl/ITLFeatures.scala deleted file mode 100644 index 3351e5c11..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/itl/ITLFeatures.scala +++ /dev/null @@ -1,575 +0,0 @@ -package com.twitter.timelines.prediction.features.itl - -import com.twitter.dal.personal_data.thriftjava.PersonalDataType._ -import com.twitter.ml.api.Feature.Binary -import com.twitter.ml.api.Feature.Continuous -import com.twitter.ml.api.Feature.Discrete -import com.twitter.ml.api.Feature.SparseBinary -import scala.collection.JavaConverters._ - -object ITLFeatures { - // engagement - val IS_RETWEETED = - new Binary("itl.engagement.is_retweeted", Set(PublicRetweets, PrivateRetweets).asJava) - val IS_FAVORITED = - new Binary("itl.engagement.is_favorited", Set(PublicLikes, PrivateLikes).asJava) - val IS_REPLIED = - new Binary("itl.engagement.is_replied", Set(PublicReplies, PrivateReplies).asJava) - // v1: post click engagements: fav, reply - val IS_GOOD_CLICKED_CONVO_DESC_V1 = new Binary( - "itl.engagement.is_good_clicked_convo_desc_favorited_or_replied", - Set( - PublicLikes, - PrivateLikes, - PublicReplies, - PrivateReplies, - EngagementsPrivate, - EngagementsPublic).asJava) - // v2: post click engagements: click - val IS_GOOD_CLICKED_CONVO_DESC_V2 = new Binary( - "itl.engagement.is_good_clicked_convo_desc_v2", - Set(TweetsClicked, EngagementsPrivate).asJava) - - val IS_GOOD_CLICKED_CONVO_DESC_FAVORITED = new Binary( - "itl.engagement.is_good_clicked_convo_desc_favorited", - Set(PublicLikes, PrivateLikes).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_REPLIED = new Binary( - "itl.engagement.is_good_clicked_convo_desc_replied", - Set(PublicReplies, PrivateReplies).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_RETWEETED = new Binary( - "itl.engagement.is_good_clicked_convo_desc_retweeted", - Set(PublicRetweets, PrivateRetweets, EngagementsPrivate, EngagementsPublic).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_CLICKED = new Binary( - "itl.engagement.is_good_clicked_convo_desc_clicked", - Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_FOLLOWED = - new Binary("itl.engagement.is_good_clicked_convo_desc_followed", Set(EngagementsPrivate).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_SHARE_DM_CLICKED = new Binary( - "itl.engagement.is_good_clicked_convo_desc_share_dm_clicked", - Set(EngagementsPrivate).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_PROFILE_CLICKED = new Binary( - "itl.engagement.is_good_clicked_convo_desc_profile_clicked", - Set(EngagementsPrivate).asJava) - - val IS_GOOD_CLICKED_CONVO_DESC_UAM_GT_0 = new Binary( - "itl.engagement.is_good_clicked_convo_desc_uam_gt_0", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_UAM_GT_1 = new Binary( - "itl.engagement.is_good_clicked_convo_desc_uam_gt_1", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_UAM_GT_2 = new Binary( - "itl.engagement.is_good_clicked_convo_desc_uam_gt_2", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_UAM_GT_3 = new Binary( - "itl.engagement.is_good_clicked_convo_desc_uam_gt_3", - Set(EngagementsPrivate, EngagementsPublic).asJava) - - val IS_TWEET_DETAIL_DWELLED = new Binary( - "itl.engagement.is_tweet_detail_dwelled", - Set(TweetsClicked, EngagementsPrivate).asJava) - - val IS_TWEET_DETAIL_DWELLED_8_SEC = new Binary( - "itl.engagement.is_tweet_detail_dwelled_8_sec", - Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_TWEET_DETAIL_DWELLED_15_SEC = new Binary( - "itl.engagement.is_tweet_detail_dwelled_15_sec", - Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_TWEET_DETAIL_DWELLED_25_SEC = new Binary( - "itl.engagement.is_tweet_detail_dwelled_25_sec", - Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_TWEET_DETAIL_DWELLED_30_SEC = new Binary( - "itl.engagement.is_tweet_detail_dwelled_30_sec", - Set(TweetsClicked, EngagementsPrivate).asJava) - - val IS_PROFILE_DWELLED = new Binary( - "itl.engagement.is_profile_dwelled", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_PROFILE_DWELLED_10_SEC = new Binary( - "itl.engagement.is_profile_dwelled_10_sec", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_PROFILE_DWELLED_20_SEC = new Binary( - "itl.engagement.is_profile_dwelled_20_sec", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_PROFILE_DWELLED_30_SEC = new Binary( - "itl.engagement.is_profile_dwelled_30_sec", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - - val IS_FULLSCREEN_VIDEO_DWELLED = new Binary( - "itl.engagement.is_fullscreen_video_dwelled", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_FULLSCREEN_VIDEO_DWELLED_5_SEC = new Binary( - "itl.engagement.is_fullscreen_video_dwelled_5_sec", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_FULLSCREEN_VIDEO_DWELLED_10_SEC = new Binary( - "itl.engagement.is_fullscreen_video_dwelled_10_sec", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_FULLSCREEN_VIDEO_DWELLED_20_SEC = new Binary( - "itl.engagement.is_fullscreen_video_dwelled_20_sec", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_FULLSCREEN_VIDEO_DWELLED_30_SEC = new Binary( - "itl.engagement.is_fullscreen_video_dwelled_30_sec", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_LINK_DWELLED_15_SEC = new Binary( - "itl.engagement.is_link_dwelled_15_sec", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_LINK_DWELLED_30_SEC = new Binary( - "itl.engagement.is_link_dwelled_30_sec", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_LINK_DWELLED_60_SEC = new Binary( - "itl.engagement.is_link_dwelled_60_sec", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_QUOTED = - new Binary("itl.engagement.is_quoted", Set(PublicRetweets, PrivateRetweets).asJava) - val IS_RETWEETED_WITHOUT_QUOTE = new Binary( - "itl.engagement.is_retweeted_without_quote", - Set(PublicRetweets, PrivateRetweets).asJava) - val IS_CLICKED = new Binary( - "itl.engagement.is_clicked", - Set(EngagementsPrivate, TweetsClicked, LinksClickedOn).asJava) - val IS_PROFILE_CLICKED = new Binary( - "itl.engagement.is_profile_clicked", - Set(EngagementsPrivate, TweetsClicked, ProfilesViewed, ProfilesClicked).asJava) - val IS_DWELLED = new Binary("itl.engagement.is_dwelled", Set(EngagementsPrivate).asJava) - val IS_DWELLED_IN_BOUNDS_V1 = - new Binary("itl.engagement.is_dwelled_in_bounds_v1", Set(EngagementsPrivate).asJava) - val DWELL_NORMALIZED_OVERALL = - new Continuous("itl.engagement.dwell_normalized_overall", Set(EngagementsPrivate).asJava) - val DWELL_CDF_OVERALL = - new Continuous("itl.engagement.dwell_cdf_overall", Set(EngagementsPrivate).asJava) - val DWELL_CDF = new Continuous("itl.engagement.dwell_cdf", Set(EngagementsPrivate).asJava) - - val IS_DWELLED_1S = new Binary("itl.engagement.is_dwelled_1s", Set(EngagementsPrivate).asJava) - val IS_DWELLED_2S = new Binary("itl.engagement.is_dwelled_2s", Set(EngagementsPrivate).asJava) - val IS_DWELLED_3S = new Binary("itl.engagement.is_dwelled_3s", Set(EngagementsPrivate).asJava) - val IS_DWELLED_4S = new Binary("itl.engagement.is_dwelled_4s", Set(EngagementsPrivate).asJava) - val IS_DWELLED_5S = new Binary("itl.engagement.is_dwelled_5s", Set(EngagementsPrivate).asJava) - val IS_DWELLED_6S = new Binary("itl.engagement.is_dwelled_6s", Set(EngagementsPrivate).asJava) - val IS_DWELLED_7S = new Binary("itl.engagement.is_dwelled_7s", Set(EngagementsPrivate).asJava) - val IS_DWELLED_8S = new Binary("itl.engagement.is_dwelled_8s", Set(EngagementsPrivate).asJava) - val IS_DWELLED_9S = new Binary("itl.engagement.is_dwelled_9s", Set(EngagementsPrivate).asJava) - val IS_DWELLED_10S = new Binary("itl.engagement.is_dwelled_10s", Set(EngagementsPrivate).asJava) - - val IS_SKIPPED_1S = new Binary("itl.engagement.is_skipped_1s", Set(EngagementsPrivate).asJava) - val IS_SKIPPED_2S = new Binary("itl.engagement.is_skipped_2s", Set(EngagementsPrivate).asJava) - val IS_SKIPPED_3S = new Binary("itl.engagement.is_skipped_3s", Set(EngagementsPrivate).asJava) - val IS_SKIPPED_4S = new Binary("itl.engagement.is_skipped_4s", Set(EngagementsPrivate).asJava) - val IS_SKIPPED_5S = new Binary("itl.engagement.is_skipped_5s", Set(EngagementsPrivate).asJava) - val IS_SKIPPED_6S = new Binary("itl.engagement.is_skipped_6s", Set(EngagementsPrivate).asJava) - val IS_SKIPPED_7S = new Binary("itl.engagement.is_skipped_7s", Set(EngagementsPrivate).asJava) - val IS_SKIPPED_8S = new Binary("itl.engagement.is_skipped_8s", Set(EngagementsPrivate).asJava) - val IS_SKIPPED_9S = new Binary("itl.engagement.is_skipped_9s", Set(EngagementsPrivate).asJava) - val IS_SKIPPED_10S = new Binary("itl.engagement.is_skipped_10s", Set(EngagementsPrivate).asJava) - - val IS_FOLLOWED = - new Binary("itl.engagement.is_followed", Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_IMPRESSED = new Binary("itl.engagement.is_impressed", Set(EngagementsPrivate).asJava) - val IS_OPEN_LINKED = - new Binary("itl.engagement.is_open_linked", Set(EngagementsPrivate, LinksClickedOn).asJava) - val IS_PHOTO_EXPANDED = new Binary( - "itl.engagement.is_photo_expanded", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_VIDEO_VIEWED = - new Binary("itl.engagement.is_video_viewed", Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_VIDEO_PLAYBACK_50 = new Binary( - "itl.engagement.is_video_playback_50", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_VIDEO_QUALITY_VIEWED = new Binary( - "itl.engagement.is_video_quality_viewed", - Set(EngagementsPrivate, EngagementsPublic).asJava - ) - val IS_BOOKMARKED = - new Binary("itl.engagement.is_bookmarked", Set(EngagementsPrivate).asJava) - val IS_SHARED = - new Binary("itl.engagement.is_shared", Set(EngagementsPrivate).asJava) - val IS_SHARE_MENU_CLICKED = - new Binary("itl.engagement.is_share_menu_clicked", Set(EngagementsPrivate).asJava) - - // Negative engagements - val IS_DONT_LIKE = - new Binary("itl.engagement.is_dont_like", Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_BLOCK_CLICKED = new Binary( - "itl.engagement.is_block_clicked", - Set(TweetsClicked, EngagementsPrivate, EngagementsPublic).asJava) - val IS_BLOCK_DIALOG_BLOCKED = new Binary( - "itl.engagement.is_block_dialog_blocked", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_MUTE_CLICKED = - new Binary("itl.engagement.is_mute_clicked", Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_MUTE_DIALOG_MUTED = - new Binary("itl.engagement.is_mute_dialog_muted", Set(EngagementsPrivate).asJava) - val IS_REPORT_TWEET_CLICKED = new Binary( - "itl.engagement.is_report_tweet_clicked", - Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_CARET_CLICKED = - new Binary("itl.engagement.is_caret_clicked", Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_NOT_ABOUT_TOPIC = - new Binary("itl.engagement.is_not_about_topic", Set(EngagementsPrivate).asJava) - val IS_NOT_RECENT = - new Binary("itl.engagement.is_not_recent", Set(EngagementsPrivate).asJava) - val IS_NOT_RELEVANT = - new Binary("itl.engagement.is_not_relevant", Set(EngagementsPrivate).asJava) - val IS_SEE_FEWER = - new Binary("itl.engagement.is_see_fewer", Set(EngagementsPrivate).asJava) - val IS_UNFOLLOW_TOPIC = - new Binary("itl.engagement.is_unfollow_topic", Set(EngagementsPrivate).asJava) - val IS_FOLLOW_TOPIC = - new Binary("itl.engagement.is_follow_topic", Set(EngagementsPrivate).asJava) - val IS_NOT_INTERESTED_IN_TOPIC = - new Binary("itl.engagement.is_not_interested_in_topic", Set(EngagementsPrivate).asJava) - val IS_HOME_LATEST_VISITED = - new Binary("itl.engagement.is_home_latest_visited", Set(EngagementsPrivate).asJava) - - // This derived label is the logical OR of IS_DONT_LIKE, IS_BLOCK_CLICKED, IS_MUTE_CLICKED and IS_REPORT_TWEET_CLICKED - val IS_NEGATIVE_FEEDBACK = - new Binary("itl.engagement.is_negative_feedback", Set(EngagementsPrivate).asJava) - - // Reciprocal engagements for reply forward engagement - val IS_REPLIED_REPLY_IMPRESSED_BY_AUTHOR = new Binary( - "itl.engagement.is_replied_reply_impressed_by_author", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_REPLIED_REPLY_FAVORITED_BY_AUTHOR = new Binary( - "itl.engagement.is_replied_reply_favorited_by_author", - Set(PublicLikes, PrivateLikes, EngagementsPrivate, EngagementsPublic).asJava) - val IS_REPLIED_REPLY_QUOTED_BY_AUTHOR = new Binary( - "itl.engagement.is_replied_reply_quoted_by_author", - Set(PublicRetweets, PrivateRetweets, EngagementsPrivate, EngagementsPublic).asJava) - val IS_REPLIED_REPLY_REPLIED_BY_AUTHOR = new Binary( - "itl.engagement.is_replied_reply_replied_by_author", - Set(PublicReplies, PrivateReplies, EngagementsPrivate, EngagementsPublic).asJava) - val IS_REPLIED_REPLY_RETWEETED_BY_AUTHOR = new Binary( - "itl.engagement.is_replied_reply_retweeted_by_author", - Set(PublicRetweets, PrivateRetweets, EngagementsPrivate, EngagementsPublic).asJava) - val IS_REPLIED_REPLY_BLOCKED_BY_AUTHOR = new Binary( - "itl.engagement.is_replied_reply_blocked_by_author", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_REPLIED_REPLY_FOLLOWED_BY_AUTHOR = new Binary( - "itl.engagement.is_replied_reply_followed_by_author", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_REPLIED_REPLY_UNFOLLOWED_BY_AUTHOR = new Binary( - "itl.engagement.is_replied_reply_unfollowed_by_author", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_REPLIED_REPLY_MUTED_BY_AUTHOR = new Binary( - "itl.engagement.is_replied_reply_muted_by_author", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_REPLIED_REPLY_REPORTED_BY_AUTHOR = new Binary( - "itl.engagement.is_replied_reply_reported_by_author", - Set(EngagementsPrivate, EngagementsPublic).asJava) - - // This derived label is the logical OR of REPLY_REPLIED, REPLY_FAVORITED, REPLY_RETWEETED - val IS_REPLIED_REPLY_ENGAGED_BY_AUTHOR = new Binary( - "itl.engagement.is_replied_reply_engaged_by_author", - Set(EngagementsPrivate, EngagementsPublic).asJava) - - // Reciprocal engagements for fav forward engagement - val IS_FAVORITED_FAV_FAVORITED_BY_AUTHOR = new Binary( - "itl.engagement.is_favorited_fav_favorited_by_author", - Set(EngagementsPrivate, EngagementsPublic, PrivateLikes, PublicLikes).asJava - ) - val IS_FAVORITED_FAV_REPLIED_BY_AUTHOR = new Binary( - "itl.engagement.is_favorited_fav_replied_by_author", - Set(EngagementsPrivate, EngagementsPublic, PrivateReplies, PublicReplies).asJava - ) - val IS_FAVORITED_FAV_RETWEETED_BY_AUTHOR = new Binary( - "itl.engagement.is_favorited_fav_retweeted_by_author", - Set(EngagementsPrivate, EngagementsPublic, PrivateRetweets, PublicRetweets).asJava - ) - val IS_FAVORITED_FAV_FOLLOWED_BY_AUTHOR = new Binary( - "itl.engagement.is_favorited_fav_followed_by_author", - Set(EngagementsPrivate, EngagementsPublic).asJava - ) - // This derived label is the logical OR of FAV_REPLIED, FAV_FAVORITED, FAV_RETWEETED, FAV_FOLLOWED - val IS_FAVORITED_FAV_ENGAGED_BY_AUTHOR = new Binary( - "itl.engagement.is_favorited_fav_engaged_by_author", - Set(EngagementsPrivate, EngagementsPublic).asJava - ) - - // define good profile click by considering following engagements (follow, fav, reply, retweet, etc.) at profile page - val IS_PROFILE_CLICKED_AND_PROFILE_FOLLOW = new Binary( - "itl.engagement.is_profile_clicked_and_profile_follow", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate, Follow).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_FAV = new Binary( - "itl.engagement.is_profile_clicked_and_profile_fav", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate, PrivateLikes, PublicLikes).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_REPLY = new Binary( - "itl.engagement.is_profile_clicked_and_profile_reply", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate, PrivateReplies, PublicReplies).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_RETWEET = new Binary( - "itl.engagement.is_profile_clicked_and_profile_retweet", - Set( - ProfilesViewed, - ProfilesClicked, - EngagementsPrivate, - PrivateRetweets, - PublicRetweets).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_TWEET_CLICK = new Binary( - "itl.engagement.is_profile_clicked_and_profile_tweet_click", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate, TweetsClicked).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_SHARE_DM_CLICK = new Binary( - "itl.engagement.is_profile_clicked_and_profile_share_dm_click", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - // This derived label is the union of all binary features above - val IS_PROFILE_CLICKED_AND_PROFILE_ENGAGED = new Binary( - "itl.engagement.is_profile_clicked_and_profile_engaged", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate, EngagementsPublic).asJava) - - // define bad profile click by considering following engagements (user report, tweet report, mute, block, etc) at profile page - val IS_PROFILE_CLICKED_AND_PROFILE_USER_REPORT_CLICK = new Binary( - "itl.engagement.is_profile_clicked_and_profile_user_report_click", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_TWEET_REPORT_CLICK = new Binary( - "itl.engagement.is_profile_clicked_and_profile_tweet_report_click", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_MUTE = new Binary( - "itl.engagement.is_profile_clicked_and_profile_mute", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_BLOCK = new Binary( - "itl.engagement.is_profile_clicked_and_profile_block", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - // This derived label is the union of bad profile click engagements and existing negative feedback - val IS_NEGATIVE_FEEDBACK_V2 = new Binary( - "itl.engagement.is_negative_feedback_v2", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - // engagement for following user from any surface area - val IS_FOLLOWED_FROM_ANY_SURFACE_AREA = new Binary( - "itl.engagement.is_followed_from_any_surface_area", - Set(EngagementsPublic, EngagementsPrivate).asJava) - - // Relevance prompt tweet engagements - val IS_RELEVANCE_PROMPT_YES_CLICKED = - new Binary("itl.engagement.is_relevance_prompt_yes_clicked", Set(EngagementsPrivate).asJava) - - // Reply downvote engagements - val IS_REPLY_DOWNVOTED = - new Binary("itl.engagement.is_reply_downvoted", Set(EngagementsPrivate).asJava) - val IS_REPLY_DOWNVOTE_REMOVED = - new Binary("itl.engagement.is_reply_downvote_removed", Set(EngagementsPrivate).asJava) - - // features from RecommendedTweet - val RECTWEET_SCORE = new Continuous("itl.recommended_tweet_features.rectweet_score") - val NUM_FAVORITING_USERS = new Continuous("itl.recommended_tweet_features.num_favoriting_users") - val NUM_FOLLOWING_USERS = new Continuous("itl.recommended_tweet_features.num_following_users") - val CONTENT_SOURCE_TYPE = new Discrete("itl.recommended_tweet_features.content_source_type") - - val RECOS_SCORE = new Continuous( - "itl.recommended_tweet_features.recos_score", - Set(EngagementScore, UsersRealGraphScore, UsersSalsaScore).asJava) - val AUTHOR_REALGRAPH_SCORE = new Continuous( - "itl.recommended_tweet_features.realgraph_score", - Set(UsersRealGraphScore).asJava) - val AUTHOR_SARUS_SCORE = new Continuous( - "itl.recommended_tweet_features.sarus_score", - Set(EngagementScore, UsersSalsaScore).asJava) - - val NUM_INTERACTING_USERS = new Continuous( - "itl.recommended_tweet_features.num_interacting_users", - Set(EngagementScore).asJava - ) - val MAX_REALGRAPH_SCORE_OF_INTERACTING_USERS = new Continuous( - "itl.recommended_tweet_features.max_realgraph_score_of_interacting_users", - Set(UsersRealGraphScore, EngagementScore).asJava - ) - val SUM_REALGRAPH_SCORE_OF_INTERACTING_USERS = new Continuous( - "itl.recommended_tweet_features.sum_realgraph_score_of_interacting_users", - Set(UsersRealGraphScore, EngagementScore).asJava - ) - val AVG_REALGRAPH_SCORE_OF_INTERACTING_USERS = new Continuous( - "itl.recommended_tweet_features.avg_realgraph_score_of_interacting_users", - Set(UsersRealGraphScore, EngagementScore).asJava - ) - val MAX_SARUS_SCORE_OF_INTERACTING_USERS = new Continuous( - "itl.recommended_tweet_features.max_sarus_score_of_interacting_users", - Set(EngagementScore, UsersSalsaScore).asJava - ) - val SUM_SARUS_SCORE_OF_INTERACTING_USERS = new Continuous( - "itl.recommended_tweet_features.sum_sarus_score_of_interacting_users", - Set(EngagementScore, UsersSalsaScore).asJava - ) - val AVG_SARUS_SCORE_OF_INTERACTING_USERS = new Continuous( - "itl.recommended_tweet_features.avg_sarus_score_of_interacting_users", - Set(EngagementScore, UsersSalsaScore).asJava - ) - - val NUM_INTERACTING_FOLLOWINGS = new Continuous( - "itl.recommended_tweet_features.num_interacting_followings", - Set(EngagementScore).asJava - ) - - // features from HydratedTweetFeatures - val REAL_GRAPH_WEIGHT = - new Continuous("itl.hydrated_tweet_features.real_graph_weight", Set(UsersRealGraphScore).asJava) - val SARUS_GRAPH_WEIGHT = new Continuous("itl.hydrated_tweet_features.sarus_graph_weight") - val FROM_TOP_ENGAGED_USER = new Binary("itl.hydrated_tweet_features.from_top_engaged_user") - val FROM_TOP_INFLUENCER = new Binary("itl.hydrated_tweet_features.from_top_influencer") - val TOPIC_SIM_SEARCHER_INTERSTED_IN_AUTHOR_KNOWN_FOR = new Continuous( - "itl.hydrated_tweet_features.topic_sim_searcher_interested_in_author_known_for" - ) - val TOPIC_SIM_SEARCHER_AUTHOR_BOTH_INTERESTED_IN = new Continuous( - "itl.hydrated_tweet_features.topic_sim_searcher_author_both_interested_in" - ) - val TOPIC_SIM_SEARCHER_AUTHOR_BOTH_KNOWN_FOR = new Continuous( - "itl.hydrated_tweet_features.topic_sim_searcher_author_both_known_for" - ) - val USER_REP = new Continuous("itl.hydrated_tweet_features.user_rep") - val NORMALIZED_PARUS_SCORE = new Continuous("itl.hydrated_tweet_features.normalized_parus_score") - val CONTAINS_MEDIA = new Binary("itl.hydrated_tweet_features.contains_media") - val FROM_NEARBY = new Binary("itl.hydrated_tweet_features.from_nearby") - val TOPIC_SIM_SEARCHER_INTERESTED_IN_TWEET = new Continuous( - "itl.hydrated_tweet_features.topic_sim_searcher_interested_in_tweet" - ) - val MATCHES_UI_LANG = new Binary( - "itl.hydrated_tweet_features.matches_ui_lang", - Set(ProvidedLanguage, InferredLanguage).asJava) - val MATCHES_SEARCHER_MAIN_LANG = new Binary( - "itl.hydrated_tweet_features.matches_searcher_main_lang", - Set(ProvidedLanguage, InferredLanguage).asJava - ) - val MATCHES_SEARCHER_LANGS = new Binary( - "itl.hydrated_tweet_features.matches_searcher_langs", - Set(ProvidedLanguage, InferredLanguage).asJava) - val HAS_CARD = new Binary( - "itl.hydrated_tweet_features.has_card", - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_IMAGE = new Binary( - "itl.hydrated_tweet_features.has_image", - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_NATIVE_IMAGE = new Binary( - "itl.hydrated_tweet_features.has_native_image", - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_VIDEO = new Binary("itl.hydrated_tweet_features.has_video") - val HAS_CONSUMER_VIDEO = new Binary( - "itl.hydrated_tweet_features.has_consumer_video", - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_PRO_VIDEO = new Binary( - "itl.hydrated_tweet_features.has_pro_video", - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_PERISCOPE = new Binary( - "itl.hydrated_tweet_features.has_periscope", - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_VINE = new Binary( - "itl.hydrated_tweet_features.has_vine", - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_NATIVE_VIDEO = new Binary( - "itl.hydrated_tweet_features.has_native_video", - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_LINK = new Binary( - "itl.hydrated_tweet_features.has_link", - Set(UrlFoundFlag, PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val LINK_COUNT = new Continuous( - "itl.hydrated_tweet_features.link_count", - Set(CountOfPrivateTweetEntitiesAndMetadata, CountOfPublicTweetEntitiesAndMetadata).asJava) - val URL_DOMAINS = new SparseBinary( - "itl.hydrated_tweet_features.url_domains", - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_VISIBLE_LINK = new Binary( - "itl.hydrated_tweet_features.has_visible_link", - Set(UrlFoundFlag, PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_NEWS = new Binary( - "itl.hydrated_tweet_features.has_news", - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_TREND = new Binary( - "itl.hydrated_tweet_features.has_trend", - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val BLENDER_SCORE = - new Continuous("itl.hydrated_tweet_features.blender_score", Set(EngagementScore).asJava) - val PARUS_SCORE = - new Continuous("itl.hydrated_tweet_features.parus_score", Set(EngagementScore).asJava) - val TEXT_SCORE = - new Continuous("itl.hydrated_tweet_features.text_score", Set(EngagementScore).asJava) - val BIDIRECTIONAL_REPLY_COUNT = new Continuous( - "itl.hydrated_tweet_features.bidirectional_reply_count", - Set(CountOfPrivateReplies, CountOfPublicReplies).asJava - ) - val UNIDIRECTIONAL_REPLY_COUNT = new Continuous( - "itl.hydrated_tweet_features.unidirectional_reply_count", - Set(CountOfPrivateReplies, CountOfPublicReplies).asJava - ) - val BIDIRECTIONAL_RETWEET_COUNT = new Continuous( - "itl.hydrated_tweet_features.bidirectional_retweet_count", - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava - ) - val UNIDIRECTIONAL_RETWEET_COUNT = new Continuous( - "itl.hydrated_tweet_features.unidirectional_retweet_count", - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava - ) - val BIDIRECTIONAL_FAV_COUNT = new Continuous( - "itl.hydrated_tweet_features.bidirectional_fav_count", - Set(CountOfPrivateLikes, CountOfPublicLikes).asJava - ) - val UNIDIRECTIONAL_FAV_COUNT = new Continuous( - "itl.hydrated_tweet_features.unidirectional_fav_count", - Set(CountOfPrivateLikes, CountOfPublicLikes).asJava - ) - val CONVERSATION_COUNT = new Continuous("itl.hydrated_tweet_features.conversation_count") - val FAV_COUNT = new Continuous( - "itl.hydrated_tweet_features.fav_count", - Set(CountOfPrivateLikes, CountOfPublicLikes).asJava) - val REPLY_COUNT = new Continuous( - "itl.hydrated_tweet_features.reply_count", - Set(CountOfPrivateReplies, CountOfPublicReplies).asJava) - val RETWEET_COUNT = new Continuous( - "itl.hydrated_tweet_features.retweet_count", - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava) - val PREV_USER_TWEET_ENGAGEMENT = new Continuous( - "itl.hydrated_tweet_features.prev_user_tweet_enagagement", - Set(EngagementScore, EngagementsPrivate, EngagementsPublic).asJava - ) - val IS_SENSITIVE = new Binary("itl.hydrated_tweet_features.is_sensitive") - val HAS_MULTIPLE_MEDIA = new Binary( - "itl.hydrated_tweet_features.has_multiple_media", - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_MULTIPLE_HASHTAGS_OR_TRENDS = new Binary( - "itl.hydrated_tweet_features.has_multiple_hashtag_or_trend", - Set( - UserVisibleFlag, - CountOfPrivateTweetEntitiesAndMetadata, - CountOfPublicTweetEntitiesAndMetadata).asJava) - val IS_AUTHOR_PROFILE_EGG = - new Binary("itl.hydrated_tweet_features.is_author_profile_egg", Set(ProfileImage).asJava) - val IS_AUTHOR_NEW = - new Binary("itl.hydrated_tweet_features.is_author_new", Set(UserType, UserState).asJava) - val NUM_MENTIONS = new Continuous( - "itl.hydrated_tweet_features.num_mentions", - Set( - UserVisibleFlag, - CountOfPrivateTweetEntitiesAndMetadata, - CountOfPublicTweetEntitiesAndMetadata).asJava) - val NUM_HASHTAGS = new Continuous( - "itl.hydrated_tweet_features.num_hashtags", - Set(CountOfPrivateTweetEntitiesAndMetadata, CountOfPublicTweetEntitiesAndMetadata).asJava) - val LANGUAGE = new Discrete( - "itl.hydrated_tweet_features.language", - Set(ProvidedLanguage, InferredLanguage).asJava) - val LINK_LANGUAGE = new Continuous( - "itl.hydrated_tweet_features.link_language", - Set(ProvidedLanguage, InferredLanguage).asJava) - val IS_AUTHOR_NSFW = - new Binary("itl.hydrated_tweet_features.is_author_nsfw", Set(UserType).asJava) - val IS_AUTHOR_SPAM = - new Binary("itl.hydrated_tweet_features.is_author_spam", Set(UserType).asJava) - val IS_AUTHOR_BOT = new Binary("itl.hydrated_tweet_features.is_author_bot", Set(UserType).asJava) - val IS_OFFENSIVE = new Binary("itl.hydrated_tweet_features.is_offensive") - val FROM_VERIFIED_ACCOUNT = - new Binary("itl.hydrated_tweet_features.from_verified_account", Set(UserVerifiedFlag).asJava) - val EMBEDS_IMPRESSION_COUNT = new Continuous( - "itl.hydrated_tweet_features.embeds_impression_count", - Set(CountOfImpression).asJava) - val EMBEDS_URL_COUNT = - new Continuous("itl.hydrated_tweet_features.embeds_url_count", Set(UrlFoundFlag).asJava) - val FAV_COUNT_V2 = new Continuous( - "recap.earlybird.fav_count_v2", - Set(CountOfPrivateLikes, CountOfPublicLikes).asJava) - val RETWEET_COUNT_V2 = new Continuous( - "recap.earlybird.retweet_count_v2", - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava) - val REPLY_COUNT_V2 = new Continuous( - "recap.earlybird.reply_count_v2", - Set(CountOfPrivateReplies, CountOfPublicReplies).asJava) -} diff --git a/src/scala/com/twitter/timelines/prediction/features/list_features/BUILD b/src/scala/com/twitter/timelines/prediction/features/list_features/BUILD deleted file mode 100644 index 6fc497bf3..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/list_features/BUILD +++ /dev/null @@ -1,9 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/java/com/twitter/ml/api:api-base", - "src/thrift/com/twitter/dal/personal_data:personal_data-java", - ], -) diff --git a/src/scala/com/twitter/timelines/prediction/features/list_features/ListFeatures.scala b/src/scala/com/twitter/timelines/prediction/features/list_features/ListFeatures.scala deleted file mode 100644 index ffb00d1f6..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/list_features/ListFeatures.scala +++ /dev/null @@ -1,24 +0,0 @@ -package com.twitter.timelines.prediction.features.list_features - -import com.twitter.ml.api.Feature.{Binary, Discrete} -import com.twitter.ml.api.FeatureContext -import com.twitter.dal.personal_data.thriftjava.PersonalDataType._ -import scala.collection.JavaConverters._ - -object ListFeatures { - - // list.id is used for list tweet injections in home. timelines.meta.list_id is used for list tweets in list timeline. - val LIST_ID = new Discrete("list.id") - - val VIEWER_IS_OWNER = - new Binary("list.viewer.is_owner", Set(ListsNonpublicList, ListsPublicList).asJava) - val VIEWER_IS_SUBSCRIBER = new Binary("list.viewer.is_subscriber") - val IS_PINNED_LIST = new Binary("list.is_pinned") - - val featureContext = new FeatureContext( - LIST_ID, - VIEWER_IS_OWNER, - VIEWER_IS_SUBSCRIBER, - IS_PINNED_LIST - ) -} diff --git a/src/scala/com/twitter/timelines/prediction/features/p_home_latest/BUILD b/src/scala/com/twitter/timelines/prediction/features/p_home_latest/BUILD deleted file mode 100644 index 6fc497bf3..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/p_home_latest/BUILD +++ /dev/null @@ -1,9 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/java/com/twitter/ml/api:api-base", - "src/thrift/com/twitter/dal/personal_data:personal_data-java", - ], -) diff --git a/src/scala/com/twitter/timelines/prediction/features/p_home_latest/HomeLatestUserFeatures.scala b/src/scala/com/twitter/timelines/prediction/features/p_home_latest/HomeLatestUserFeatures.scala deleted file mode 100644 index 65d721a05..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/p_home_latest/HomeLatestUserFeatures.scala +++ /dev/null @@ -1,49 +0,0 @@ -package com.twitter.timelines.prediction.features.p_home_latest - -import com.twitter.ml.api.Feature.{Continuous, Discrete} -import com.twitter.dal.personal_data.thriftjava.PersonalDataType._ -import scala.collection.JavaConverters._ - -object HomeLatestUserFeatures { - val LAST_LOGIN_TIMESTAMP_MS = - new Discrete("home_latest.user_feature.last_login_timestamp_ms", Set(PrivateTimestamp).asJava) -} - -object HomeLatestUserAggregatesFeatures { - - /** - * Used as `timestampFeature` in `OfflineAggregateSource` required by feature aggregations, set to - * the `dateRange` end timestamp by default - */ - val AGGREGATE_TIMESTAMP_MS = - new Discrete("home_latest.user_feature.aggregate_timestamp_ms", Set(PrivateTimestamp).asJava) - val HOME_TOP_IMPRESSIONS = - new Continuous("home_latest.user_feature.home_top_impressions", Set(CountOfImpression).asJava) - val HOME_LATEST_IMPRESSIONS = - new Continuous( - "home_latest.user_feature.home_latest_impressions", - Set(CountOfImpression).asJava) - val HOME_TOP_LAST_LOGIN_TIMESTAMP_MS = - new Discrete( - "home_latest.user_feature.home_top_last_login_timestamp_ms", - Set(PrivateTimestamp).asJava) - val HOME_LATEST_LAST_LOGIN_TIMESTAMP_MS = - new Discrete( - "home_latest.user_feature.home_latest_last_login_timestamp_ms", - Set(PrivateTimestamp).asJava) - val HOME_LATEST_MOST_RECENT_CLICK_TIMESTAMP_MS = - new Discrete( - "home_latest.user_feature.home_latest_most_recent_click_timestamp_ms", - Set(PrivateTimestamp).asJava) -} - -case class HomeLatestUserFeatures(userId: Long, lastLoginTimestampMs: Long) - -case class HomeLatestUserAggregatesFeatures( - userId: Long, - aggregateTimestampMs: Long, - homeTopImpressions: Option[Double], - homeLatestImpressions: Option[Double], - homeTopLastLoginTimestampMs: Option[Long], - homeLatestLastLoginTimestampMs: Option[Long], - homeLatestMostRecentClickTimestampMs: Option[Long]) diff --git a/src/scala/com/twitter/timelines/prediction/features/ppmi/BUILD b/src/scala/com/twitter/timelines/prediction/features/ppmi/BUILD deleted file mode 100644 index babba31bb..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/ppmi/BUILD +++ /dev/null @@ -1,8 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/java/com/twitter/ml/api:api-base", - ], -) diff --git a/src/scala/com/twitter/timelines/prediction/features/ppmi/PpmiFeatures.scala b/src/scala/com/twitter/timelines/prediction/features/ppmi/PpmiFeatures.scala deleted file mode 100644 index 7e6d1dea8..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/ppmi/PpmiFeatures.scala +++ /dev/null @@ -1,7 +0,0 @@ -package com.twitter.timelines.prediction.features.ppmi - -import com.twitter.ml.api.Feature.Continuous - -object PpmiDataRecordFeatures { - val PPMI_SCORE = new Continuous("ppmi.source_author.score") -} diff --git a/src/scala/com/twitter/timelines/prediction/features/real_graph/BUILD b/src/scala/com/twitter/timelines/prediction/features/real_graph/BUILD deleted file mode 100644 index 868acec21..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/real_graph/BUILD +++ /dev/null @@ -1,15 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/java/com/twitter/ml/api:api-base", - "src/scala/com/twitter/ml/featurestore/catalog/entities/core", - "src/scala/com/twitter/ml/featurestore/catalog/entities/timelines", - "src/scala/com/twitter/ml/featurestore/catalog/features/timelines:realgraph", - "src/scala/com/twitter/ml/featurestore/lib/entity", - "src/scala/com/twitter/ml/featurestore/lib/feature", - "src/thrift/com/twitter/dal/personal_data:personal_data-java", - "src/thrift/com/twitter/timelines/real_graph:real_graph-scala", - ], -) diff --git a/src/scala/com/twitter/timelines/prediction/features/real_graph/RealGraphDataRecordFeatureStoreFeatures.scala b/src/scala/com/twitter/timelines/prediction/features/real_graph/RealGraphDataRecordFeatureStoreFeatures.scala deleted file mode 100644 index 7c52349aa..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/real_graph/RealGraphDataRecordFeatureStoreFeatures.scala +++ /dev/null @@ -1,232 +0,0 @@ -package com.twitter.timelines.prediction.features.real_graph - -import com.twitter.ml.featurestore.catalog.entities.core.UserAuthor -import com.twitter.ml.featurestore.catalog.features.timelines.RealGraph -import com.twitter.ml.featurestore.lib.EdgeEntityId -import com.twitter.ml.featurestore.lib.UserId -import com.twitter.ml.featurestore.lib.feature.BoundFeatureSet -import com.twitter.ml.featurestore.lib.feature.Feature -import com.twitter.ml.featurestore.lib.feature.FeatureSet - -object RealGraphDataRecordFeatureStoreFeatures { - val boundUserAuthorfeatureSet: BoundFeatureSet = FeatureSet( - RealGraph.DestId, - RealGraph.AddressBookEmail.DaysSinceLast, - RealGraph.AddressBookEmail.ElapsedDays, - RealGraph.AddressBookEmail.Ewma, - RealGraph.AddressBookEmail.IsMissing, - RealGraph.AddressBookEmail.Mean, - RealGraph.AddressBookEmail.NonZeroDays, - RealGraph.AddressBookEmail.Variance, - RealGraph.AddressBookInBoth.DaysSinceLast, - RealGraph.AddressBookInBoth.ElapsedDays, - RealGraph.AddressBookInBoth.Ewma, - RealGraph.AddressBookInBoth.IsMissing, - RealGraph.AddressBookInBoth.Mean, - RealGraph.AddressBookInBoth.NonZeroDays, - RealGraph.AddressBookInBoth.Variance, - RealGraph.AddressBookMutualEdgeEmail.DaysSinceLast, - RealGraph.AddressBookMutualEdgeEmail.ElapsedDays, - RealGraph.AddressBookMutualEdgeEmail.Ewma, - RealGraph.AddressBookMutualEdgeEmail.IsMissing, - RealGraph.AddressBookMutualEdgeEmail.Mean, - RealGraph.AddressBookMutualEdgeEmail.NonZeroDays, - RealGraph.AddressBookMutualEdgeEmail.Variance, - RealGraph.AddressBookMutualEdgeInBoth.DaysSinceLast, - RealGraph.AddressBookMutualEdgeInBoth.ElapsedDays, - RealGraph.AddressBookMutualEdgeInBoth.Ewma, - RealGraph.AddressBookMutualEdgeInBoth.IsMissing, - RealGraph.AddressBookMutualEdgeInBoth.Mean, - RealGraph.AddressBookMutualEdgeInBoth.NonZeroDays, - RealGraph.AddressBookMutualEdgeInBoth.Variance, - RealGraph.AddressBookMutualEdgePhone.DaysSinceLast, - RealGraph.AddressBookMutualEdgePhone.ElapsedDays, - RealGraph.AddressBookMutualEdgePhone.Ewma, - RealGraph.AddressBookMutualEdgePhone.IsMissing, - RealGraph.AddressBookMutualEdgePhone.Mean, - RealGraph.AddressBookMutualEdgePhone.NonZeroDays, - RealGraph.AddressBookMutualEdgePhone.Variance, - RealGraph.AddressBookPhone.DaysSinceLast, - RealGraph.AddressBookPhone.ElapsedDays, - RealGraph.AddressBookPhone.Ewma, - RealGraph.AddressBookPhone.IsMissing, - RealGraph.AddressBookPhone.Mean, - RealGraph.AddressBookPhone.NonZeroDays, - RealGraph.AddressBookPhone.Variance, - RealGraph.DirectMessages.DaysSinceLast, - RealGraph.DirectMessages.ElapsedDays, - RealGraph.DirectMessages.Ewma, - RealGraph.DirectMessages.IsMissing, - RealGraph.DirectMessages.Mean, - RealGraph.DirectMessages.NonZeroDays, - RealGraph.DirectMessages.Variance, - RealGraph.DwellTime.DaysSinceLast, - RealGraph.DwellTime.ElapsedDays, - RealGraph.DwellTime.Ewma, - RealGraph.DwellTime.IsMissing, - RealGraph.DwellTime.Mean, - RealGraph.DwellTime.NonZeroDays, - RealGraph.DwellTime.Variance, - RealGraph.Follow.DaysSinceLast, - RealGraph.Follow.ElapsedDays, - RealGraph.Follow.Ewma, - RealGraph.Follow.IsMissing, - RealGraph.Follow.Mean, - RealGraph.Follow.NonZeroDays, - RealGraph.Follow.Variance, - RealGraph.InspectedStatuses.DaysSinceLast, - RealGraph.InspectedStatuses.ElapsedDays, - RealGraph.InspectedStatuses.Ewma, - RealGraph.InspectedStatuses.IsMissing, - RealGraph.InspectedStatuses.Mean, - RealGraph.InspectedStatuses.NonZeroDays, - RealGraph.InspectedStatuses.Variance, - RealGraph.Likes.DaysSinceLast, - RealGraph.Likes.ElapsedDays, - RealGraph.Likes.Ewma, - RealGraph.Likes.IsMissing, - RealGraph.Likes.Mean, - RealGraph.Likes.NonZeroDays, - RealGraph.Likes.Variance, - RealGraph.LinkClicks.DaysSinceLast, - RealGraph.LinkClicks.ElapsedDays, - RealGraph.LinkClicks.Ewma, - RealGraph.LinkClicks.IsMissing, - RealGraph.LinkClicks.Mean, - RealGraph.LinkClicks.NonZeroDays, - RealGraph.LinkClicks.Variance, - RealGraph.Mentions.DaysSinceLast, - RealGraph.Mentions.ElapsedDays, - RealGraph.Mentions.Ewma, - RealGraph.Mentions.IsMissing, - RealGraph.Mentions.Mean, - RealGraph.Mentions.NonZeroDays, - RealGraph.Mentions.Variance, - RealGraph.MutualFollow.DaysSinceLast, - RealGraph.MutualFollow.ElapsedDays, - RealGraph.MutualFollow.Ewma, - RealGraph.MutualFollow.IsMissing, - RealGraph.MutualFollow.Mean, - RealGraph.MutualFollow.NonZeroDays, - RealGraph.MutualFollow.Variance, - RealGraph.NumTweetQuotes.DaysSinceLast, - RealGraph.NumTweetQuotes.ElapsedDays, - RealGraph.NumTweetQuotes.Ewma, - RealGraph.NumTweetQuotes.IsMissing, - RealGraph.NumTweetQuotes.Mean, - RealGraph.NumTweetQuotes.NonZeroDays, - RealGraph.NumTweetQuotes.Variance, - RealGraph.PhotoTags.DaysSinceLast, - RealGraph.PhotoTags.ElapsedDays, - RealGraph.PhotoTags.Ewma, - RealGraph.PhotoTags.IsMissing, - RealGraph.PhotoTags.Mean, - RealGraph.PhotoTags.NonZeroDays, - RealGraph.PhotoTags.Variance, - RealGraph.ProfileViews.DaysSinceLast, - RealGraph.ProfileViews.ElapsedDays, - RealGraph.ProfileViews.Ewma, - RealGraph.ProfileViews.IsMissing, - RealGraph.ProfileViews.Mean, - RealGraph.ProfileViews.NonZeroDays, - RealGraph.ProfileViews.Variance, - RealGraph.Retweets.DaysSinceLast, - RealGraph.Retweets.ElapsedDays, - RealGraph.Retweets.Ewma, - RealGraph.Retweets.IsMissing, - RealGraph.Retweets.Mean, - RealGraph.Retweets.NonZeroDays, - RealGraph.Retweets.Variance, - RealGraph.SmsFollow.DaysSinceLast, - RealGraph.SmsFollow.ElapsedDays, - RealGraph.SmsFollow.Ewma, - RealGraph.SmsFollow.IsMissing, - RealGraph.SmsFollow.Mean, - RealGraph.SmsFollow.NonZeroDays, - RealGraph.SmsFollow.Variance, - RealGraph.TweetClicks.DaysSinceLast, - RealGraph.TweetClicks.ElapsedDays, - RealGraph.TweetClicks.Ewma, - RealGraph.TweetClicks.IsMissing, - RealGraph.TweetClicks.Mean, - RealGraph.TweetClicks.NonZeroDays, - RealGraph.TweetClicks.Variance, - RealGraph.Weight - ).bind(UserAuthor) - - private[this] val edgeFeatures: Seq[RealGraph.EdgeFeature] = Seq( - RealGraph.AddressBookEmail, - RealGraph.AddressBookInBoth, - RealGraph.AddressBookMutualEdgeEmail, - RealGraph.AddressBookMutualEdgeInBoth, - RealGraph.AddressBookMutualEdgePhone, - RealGraph.AddressBookPhone, - RealGraph.DirectMessages, - RealGraph.DwellTime, - RealGraph.Follow, - RealGraph.InspectedStatuses, - RealGraph.Likes, - RealGraph.LinkClicks, - RealGraph.Mentions, - RealGraph.MutualFollow, - RealGraph.PhotoTags, - RealGraph.ProfileViews, - RealGraph.Retweets, - RealGraph.SmsFollow, - RealGraph.TweetClicks - ) - - val htlDoubleFeatures: Set[Feature[EdgeEntityId[UserId, UserId], Double]] = { - val features = edgeFeatures.flatMap { ef => - Seq(ef.Ewma, ef.Mean, ef.Variance) - } ++ Seq(RealGraph.Weight) - features.toSet - } - - val htlLongFeatures: Set[Feature[EdgeEntityId[UserId, UserId], Long]] = { - val features = edgeFeatures.flatMap { ef => - Seq(ef.DaysSinceLast, ef.ElapsedDays, ef.NonZeroDays) - } - features.toSet - } - - private val edgeFeatureToLegacyName = Map( - RealGraph.AddressBookEmail -> "num_address_book_email", - RealGraph.AddressBookInBoth -> "num_address_book_in_both", - RealGraph.AddressBookMutualEdgeEmail -> "num_address_book_mutual_edge_email", - RealGraph.AddressBookMutualEdgeInBoth -> "num_address_book_mutual_edge_in_both", - RealGraph.AddressBookMutualEdgePhone -> "num_address_book_mutual_edge_phone", - RealGraph.AddressBookPhone -> "num_address_book_phone", - RealGraph.DirectMessages -> "direct_messages", - RealGraph.DwellTime -> "total_dwell_time", - RealGraph.Follow -> "num_follow", - RealGraph.InspectedStatuses -> "num_inspected_tweets", - RealGraph.Likes -> "num_favorites", - RealGraph.LinkClicks -> "num_link_clicks", - RealGraph.Mentions -> "num_mentions", - RealGraph.MutualFollow -> "num_mutual_follow", - RealGraph.PhotoTags -> "num_photo_tags", - RealGraph.ProfileViews -> "num_profile_views", - RealGraph.Retweets -> "num_retweets", - RealGraph.SmsFollow -> "num_sms_follow", - RealGraph.TweetClicks -> "num_tweet_clicks", - ) - - def convertFeatureToLegacyName( - prefix: String, - variance: String = "variance" - ): Map[Feature[EdgeEntityId[UserId, UserId], _ >: Long with Double <: AnyVal], String] = - edgeFeatureToLegacyName.flatMap { - case (k, v) => - Seq( - k.NonZeroDays -> s"${prefix}.${v}.non_zero_days", - k.DaysSinceLast -> s"${prefix}.${v}.days_since_last", - k.ElapsedDays -> s"${prefix}.${v}.elapsed_days", - k.Ewma -> s"${prefix}.${v}.ewma", - k.Mean -> s"${prefix}.${v}.mean", - k.Variance -> s"${prefix}.${v}.${variance}", - ) - } ++ Map( - RealGraph.Weight -> (prefix + ".weight") - ) -} diff --git a/src/scala/com/twitter/timelines/prediction/features/real_graph/RealGraphDataRecordFeatures.scala b/src/scala/com/twitter/timelines/prediction/features/real_graph/RealGraphDataRecordFeatures.scala deleted file mode 100644 index 4c1915944..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/real_graph/RealGraphDataRecordFeatures.scala +++ /dev/null @@ -1,534 +0,0 @@ -package com.twitter.timelines.prediction.features.real_graph - -import com.twitter.dal.personal_data.thriftjava.PersonalDataType._ -import com.twitter.ml.api.Feature._ -import com.twitter.timelines.real_graph.v1.thriftscala.RealGraphEdgeFeature -import scala.collection.JavaConverters._ - - -object RealGraphDataRecordFeatures { - // the source user id - val SRC_ID = new Discrete("realgraph.src_id", Set(UserId).asJava) - // the destination user id - val DST_ID = new Discrete("realgraph.dst_id", Set(UserId).asJava) - // real graph weight - val WEIGHT = new Continuous("realgraph.weight", Set(UsersRealGraphScore).asJava) - // the number of retweets that the source user sent to the destination user - val NUM_RETWEETS_MEAN = - new Continuous("realgraph.num_retweets.mean", Set(PrivateRetweets, PublicRetweets).asJava) - val NUM_RETWEETS_EWMA = - new Continuous("realgraph.num_retweets.ewma", Set(PrivateRetweets, PublicRetweets).asJava) - val NUM_RETWEETS_VARIANCE = - new Continuous("realgraph.num_retweets.variance", Set(PrivateRetweets, PublicRetweets).asJava) - val NUM_RETWEETS_NON_ZERO_DAYS = new Continuous( - "realgraph.num_retweets.non_zero_days", - Set(PrivateRetweets, PublicRetweets).asJava) - val NUM_RETWEETS_ELAPSED_DAYS = new Continuous( - "realgraph.num_retweets.elapsed_days", - Set(PrivateRetweets, PublicRetweets).asJava) - val NUM_RETWEETS_DAYS_SINCE_LAST = new Continuous( - "realgraph.num_retweets.days_since_last", - Set(PrivateRetweets, PublicRetweets).asJava) - val NUM_RETWEETS_IS_MISSING = - new Binary("realgraph.num_retweets.is_missing", Set(PrivateRetweets, PublicRetweets).asJava) - // the number of favories that the source user sent to the destination user - val NUM_FAVORITES_MEAN = - new Continuous("realgraph.num_favorites.mean", Set(PublicLikes, PrivateLikes).asJava) - val NUM_FAVORITES_EWMA = - new Continuous("realgraph.num_favorites.ewma", Set(PublicLikes, PrivateLikes).asJava) - val NUM_FAVORITES_VARIANCE = - new Continuous("realgraph.num_favorites.variance", Set(PublicLikes, PrivateLikes).asJava) - val NUM_FAVORITES_NON_ZERO_DAYS = - new Continuous("realgraph.num_favorites.non_zero_days", Set(PublicLikes, PrivateLikes).asJava) - val NUM_FAVORITES_ELAPSED_DAYS = - new Continuous("realgraph.num_favorites.elapsed_days", Set(PublicLikes, PrivateLikes).asJava) - val NUM_FAVORITES_DAYS_SINCE_LAST = - new Continuous("realgraph.num_favorites.days_since_last", Set(PublicLikes, PrivateLikes).asJava) - val NUM_FAVORITES_IS_MISSING = - new Binary("realgraph.num_favorites.is_missing", Set(PublicLikes, PrivateLikes).asJava) - // the number of mentions that the source user sent to the destination user - val NUM_MENTIONS_MEAN = - new Continuous("realgraph.num_mentions.mean", Set(EngagementsPrivate, EngagementsPublic).asJava) - val NUM_MENTIONS_EWMA = - new Continuous("realgraph.num_mentions.ewma", Set(EngagementsPrivate, EngagementsPublic).asJava) - val NUM_MENTIONS_VARIANCE = new Continuous( - "realgraph.num_mentions.variance", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val NUM_MENTIONS_NON_ZERO_DAYS = new Continuous( - "realgraph.num_mentions.non_zero_days", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val NUM_MENTIONS_ELAPSED_DAYS = new Continuous( - "realgraph.num_mentions.elapsed_days", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val NUM_MENTIONS_DAYS_SINCE_LAST = new Continuous( - "realgraph.num_mentions.days_since_last", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val NUM_MENTIONS_IS_MISSING = new Binary( - "realgraph.num_mentions.is_missing", - Set(EngagementsPrivate, EngagementsPublic).asJava) - // the number of direct messages that the source user sent to the destination user - val NUM_DIRECT_MESSAGES_MEAN = new Continuous( - "realgraph.num_direct_messages.mean", - Set(DmEntitiesAndMetadata, CountOfDms).asJava) - val NUM_DIRECT_MESSAGES_EWMA = new Continuous( - "realgraph.num_direct_messages.ewma", - Set(DmEntitiesAndMetadata, CountOfDms).asJava) - val NUM_DIRECT_MESSAGES_VARIANCE = new Continuous( - "realgraph.num_direct_messages.variance", - Set(DmEntitiesAndMetadata, CountOfDms).asJava) - val NUM_DIRECT_MESSAGES_NON_ZERO_DAYS = new Continuous( - "realgraph.num_direct_messages.non_zero_days", - Set(DmEntitiesAndMetadata, CountOfDms).asJava - ) - val NUM_DIRECT_MESSAGES_ELAPSED_DAYS = new Continuous( - "realgraph.num_direct_messages.elapsed_days", - Set(DmEntitiesAndMetadata, CountOfDms).asJava - ) - val NUM_DIRECT_MESSAGES_DAYS_SINCE_LAST = new Continuous( - "realgraph.num_direct_messages.days_since_last", - Set(DmEntitiesAndMetadata, CountOfDms).asJava - ) - val NUM_DIRECT_MESSAGES_IS_MISSING = new Binary( - "realgraph.num_direct_messages.is_missing", - Set(DmEntitiesAndMetadata, CountOfDms).asJava) - // the number of tweet clicks that the source user sent to the destination user - val NUM_TWEET_CLICKS_MEAN = - new Continuous("realgraph.num_tweet_clicks.mean", Set(TweetsClicked).asJava) - val NUM_TWEET_CLICKS_EWMA = - new Continuous("realgraph.num_tweet_clicks.ewma", Set(TweetsClicked).asJava) - val NUM_TWEET_CLICKS_VARIANCE = - new Continuous("realgraph.num_tweet_clicks.variance", Set(TweetsClicked).asJava) - val NUM_TWEET_CLICKS_NON_ZERO_DAYS = - new Continuous("realgraph.num_tweet_clicks.non_zero_days", Set(TweetsClicked).asJava) - val NUM_TWEET_CLICKS_ELAPSED_DAYS = - new Continuous("realgraph.num_tweet_clicks.elapsed_days", Set(TweetsClicked).asJava) - val NUM_TWEET_CLICKS_DAYS_SINCE_LAST = new Continuous( - "realgraph.num_tweet_clicks.days_since_last", - Set(TweetsClicked).asJava - ) - val NUM_TWEET_CLICKS_IS_MISSING = - new Binary("realgraph.num_tweet_clicks.is_missing", Set(TweetsClicked).asJava) - // the number of link clicks that the source user sent to the destination user - val NUM_LINK_CLICKS_MEAN = - new Continuous("realgraph.num_link_clicks.mean", Set(CountOfTweetEntitiesClicked).asJava) - val NUM_LINK_CLICKS_EWMA = - new Continuous("realgraph.num_link_clicks.ewma", Set(CountOfTweetEntitiesClicked).asJava) - val NUM_LINK_CLICKS_VARIANCE = - new Continuous("realgraph.num_link_clicks.variance", Set(CountOfTweetEntitiesClicked).asJava) - val NUM_LINK_CLICKS_NON_ZERO_DAYS = new Continuous( - "realgraph.num_link_clicks.non_zero_days", - Set(CountOfTweetEntitiesClicked).asJava) - val NUM_LINK_CLICKS_ELAPSED_DAYS = new Continuous( - "realgraph.num_link_clicks.elapsed_days", - Set(CountOfTweetEntitiesClicked).asJava) - val NUM_LINK_CLICKS_DAYS_SINCE_LAST = new Continuous( - "realgraph.num_link_clicks.days_since_last", - Set(CountOfTweetEntitiesClicked).asJava) - val NUM_LINK_CLICKS_IS_MISSING = - new Binary("realgraph.num_link_clicks.is_missing", Set(CountOfTweetEntitiesClicked).asJava) - // the number of profile views that the source user sent to the destination user - val NUM_PROFILE_VIEWS_MEAN = - new Continuous("realgraph.num_profile_views.mean", Set(ProfilesViewed).asJava) - val NUM_PROFILE_VIEWS_EWMA = - new Continuous("realgraph.num_profile_views.ewma", Set(ProfilesViewed).asJava) - val NUM_PROFILE_VIEWS_VARIANCE = - new Continuous("realgraph.num_profile_views.variance", Set(ProfilesViewed).asJava) - val NUM_PROFILE_VIEWS_NON_ZERO_DAYS = - new Continuous("realgraph.num_profile_views.non_zero_days", Set(ProfilesViewed).asJava) - val NUM_PROFILE_VIEWS_ELAPSED_DAYS = - new Continuous("realgraph.num_profile_views.elapsed_days", Set(ProfilesViewed).asJava) - val NUM_PROFILE_VIEWS_DAYS_SINCE_LAST = new Continuous( - "realgraph.num_profile_views.days_since_last", - Set(ProfilesViewed).asJava - ) - val NUM_PROFILE_VIEWS_IS_MISSING = - new Binary("realgraph.num_profile_views.is_missing", Set(ProfilesViewed).asJava) - // the total dwell time the source user spends on the target user's tweets - val TOTAL_DWELL_TIME_MEAN = - new Continuous("realgraph.total_dwell_time.mean", Set(CountOfImpression).asJava) - val TOTAL_DWELL_TIME_EWMA = - new Continuous("realgraph.total_dwell_time.ewma", Set(CountOfImpression).asJava) - val TOTAL_DWELL_TIME_VARIANCE = - new Continuous("realgraph.total_dwell_time.variance", Set(CountOfImpression).asJava) - val TOTAL_DWELL_TIME_NON_ZERO_DAYS = - new Continuous("realgraph.total_dwell_time.non_zero_days", Set(CountOfImpression).asJava) - val TOTAL_DWELL_TIME_ELAPSED_DAYS = - new Continuous("realgraph.total_dwell_time.elapsed_days", Set(CountOfImpression).asJava) - val TOTAL_DWELL_TIME_DAYS_SINCE_LAST = new Continuous( - "realgraph.total_dwell_time.days_since_last", - Set(CountOfImpression).asJava - ) - val TOTAL_DWELL_TIME_IS_MISSING = - new Binary("realgraph.total_dwell_time.is_missing", Set(CountOfImpression).asJava) - // the number of the target user's tweets that the source user has inspected - val NUM_INSPECTED_TWEETS_MEAN = - new Continuous("realgraph.num_inspected_tweets.mean", Set(CountOfImpression).asJava) - val NUM_INSPECTED_TWEETS_EWMA = - new Continuous("realgraph.num_inspected_tweets.ewma", Set(CountOfImpression).asJava) - val NUM_INSPECTED_TWEETS_VARIANCE = - new Continuous("realgraph.num_inspected_tweets.variance", Set(CountOfImpression).asJava) - val NUM_INSPECTED_TWEETS_NON_ZERO_DAYS = new Continuous( - "realgraph.num_inspected_tweets.non_zero_days", - Set(CountOfImpression).asJava - ) - val NUM_INSPECTED_TWEETS_ELAPSED_DAYS = new Continuous( - "realgraph.num_inspected_tweets.elapsed_days", - Set(CountOfImpression).asJava - ) - val NUM_INSPECTED_TWEETS_DAYS_SINCE_LAST = new Continuous( - "realgraph.num_inspected_tweets.days_since_last", - Set(CountOfImpression).asJava - ) - val NUM_INSPECTED_TWEETS_IS_MISSING = - new Binary("realgraph.num_inspected_tweets.is_missing", Set(CountOfImpression).asJava) - // the number of photos in which the source user has tagged the target user - val NUM_PHOTO_TAGS_MEAN = new Continuous( - "realgraph.num_photo_tags.mean", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val NUM_PHOTO_TAGS_EWMA = new Continuous( - "realgraph.num_photo_tags.ewma", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val NUM_PHOTO_TAGS_VARIANCE = new Continuous( - "realgraph.num_photo_tags.variance", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val NUM_PHOTO_TAGS_NON_ZERO_DAYS = new Continuous( - "realgraph.num_photo_tags.non_zero_days", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val NUM_PHOTO_TAGS_ELAPSED_DAYS = new Continuous( - "realgraph.num_photo_tags.elapsed_days", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val NUM_PHOTO_TAGS_DAYS_SINCE_LAST = new Continuous( - "realgraph.num_photo_tags.days_since_last", - Set(EngagementsPrivate, EngagementsPublic).asJava) - val NUM_PHOTO_TAGS_IS_MISSING = new Binary( - "realgraph.num_photo_tags.is_missing", - Set(EngagementsPrivate, EngagementsPublic).asJava) - - val NUM_FOLLOW_MEAN = new Continuous( - "realgraph.num_follow.mean", - Set(Follow, PrivateAccountsFollowedBy, PublicAccountsFollowedBy).asJava) - val NUM_FOLLOW_EWMA = new Continuous( - "realgraph.num_follow.ewma", - Set(Follow, PrivateAccountsFollowedBy, PublicAccountsFollowedBy).asJava) - val NUM_FOLLOW_VARIANCE = new Continuous( - "realgraph.num_follow.variance", - Set(Follow, PrivateAccountsFollowedBy, PublicAccountsFollowedBy).asJava) - val NUM_FOLLOW_NON_ZERO_DAYS = new Continuous( - "realgraph.num_follow.non_zero_days", - Set(Follow, PrivateAccountsFollowedBy, PublicAccountsFollowedBy).asJava) - val NUM_FOLLOW_ELAPSED_DAYS = new Continuous( - "realgraph.num_follow.elapsed_days", - Set(Follow, PrivateAccountsFollowedBy, PublicAccountsFollowedBy).asJava) - val NUM_FOLLOW_DAYS_SINCE_LAST = new Continuous( - "realgraph.num_follow.days_since_last", - Set(Follow, PrivateAccountsFollowedBy, PublicAccountsFollowedBy).asJava) - val NUM_FOLLOW_IS_MISSING = new Binary( - "realgraph.num_follow.is_missing", - Set(Follow, PrivateAccountsFollowedBy, PublicAccountsFollowedBy).asJava) - // the number of blocks that the source user sent to the destination user - val NUM_BLOCKS_MEAN = - new Continuous("realgraph.num_blocks.mean", Set(CountOfBlocks).asJava) - val NUM_BLOCKS_EWMA = - new Continuous("realgraph.num_blocks.ewma", Set(CountOfBlocks).asJava) - val NUM_BLOCKS_VARIANCE = - new Continuous("realgraph.num_blocks.variance", Set(CountOfBlocks).asJava) - val NUM_BLOCKS_NON_ZERO_DAYS = - new Continuous("realgraph.num_blocks.non_zero_days", Set(CountOfBlocks).asJava) - val NUM_BLOCKS_ELAPSED_DAYS = - new Continuous("realgraph.num_blocks.elapsed_days", Set(CountOfBlocks).asJava) - val NUM_BLOCKS_DAYS_SINCE_LAST = - new Continuous("realgraph.num_blocks.days_since_last", Set(CountOfBlocks).asJava) - val NUM_BLOCKS_IS_MISSING = - new Binary("realgraph.num_blocks.is_missing", Set(CountOfBlocks).asJava) - // the number of mutes that the source user sent to the destination user - val NUM_MUTES_MEAN = - new Continuous("realgraph.num_mutes.mean", Set(CountOfMutes).asJava) - val NUM_MUTES_EWMA = - new Continuous("realgraph.num_mutes.ewma", Set(CountOfMutes).asJava) - val NUM_MUTES_VARIANCE = - new Continuous("realgraph.num_mutes.variance", Set(CountOfMutes).asJava) - val NUM_MUTES_NON_ZERO_DAYS = - new Continuous("realgraph.num_mutes.non_zero_days", Set(CountOfMutes).asJava) - val NUM_MUTES_ELAPSED_DAYS = - new Continuous("realgraph.num_mutes.elapsed_days", Set(CountOfMutes).asJava) - val NUM_MUTES_DAYS_SINCE_LAST = - new Continuous("realgraph.num_mutes.days_since_last", Set(CountOfMutes).asJava) - val NUM_MUTES_IS_MISSING = - new Binary("realgraph.num_mutes.is_missing", Set(CountOfMutes).asJava) - // the number of report as abuses that the source user sent to the destination user - val NUM_REPORTS_AS_ABUSES_MEAN = - new Continuous("realgraph.num_report_as_abuses.mean", Set(CountOfAbuseReports).asJava) - val NUM_REPORTS_AS_ABUSES_EWMA = - new Continuous("realgraph.num_report_as_abuses.ewma", Set(CountOfAbuseReports).asJava) - val NUM_REPORTS_AS_ABUSES_VARIANCE = - new Continuous("realgraph.num_report_as_abuses.variance", Set(CountOfAbuseReports).asJava) - val NUM_REPORTS_AS_ABUSES_NON_ZERO_DAYS = - new Continuous("realgraph.num_report_as_abuses.non_zero_days", Set(CountOfAbuseReports).asJava) - val NUM_REPORTS_AS_ABUSES_ELAPSED_DAYS = - new Continuous("realgraph.num_report_as_abuses.elapsed_days", Set(CountOfAbuseReports).asJava) - val NUM_REPORTS_AS_ABUSES_DAYS_SINCE_LAST = - new Continuous( - "realgraph.num_report_as_abuses.days_since_last", - Set(CountOfAbuseReports).asJava) - val NUM_REPORTS_AS_ABUSES_IS_MISSING = - new Binary("realgraph.num_report_as_abuses.is_missing", Set(CountOfAbuseReports).asJava) - // the number of report as spams that the source user sent to the destination user - val NUM_REPORTS_AS_SPAMS_MEAN = - new Continuous( - "realgraph.num_report_as_spams.mean", - Set(CountOfAbuseReports, SafetyRelationships).asJava) - val NUM_REPORTS_AS_SPAMS_EWMA = - new Continuous( - "realgraph.num_report_as_spams.ewma", - Set(CountOfAbuseReports, SafetyRelationships).asJava) - val NUM_REPORTS_AS_SPAMS_VARIANCE = - new Continuous( - "realgraph.num_report_as_spams.variance", - Set(CountOfAbuseReports, SafetyRelationships).asJava) - val NUM_REPORTS_AS_SPAMS_NON_ZERO_DAYS = - new Continuous( - "realgraph.num_report_as_spams.non_zero_days", - Set(CountOfAbuseReports, SafetyRelationships).asJava) - val NUM_REPORTS_AS_SPAMS_ELAPSED_DAYS = - new Continuous( - "realgraph.num_report_as_spams.elapsed_days", - Set(CountOfAbuseReports, SafetyRelationships).asJava) - val NUM_REPORTS_AS_SPAMS_DAYS_SINCE_LAST = - new Continuous( - "realgraph.num_report_as_spams.days_since_last", - Set(CountOfAbuseReports, SafetyRelationships).asJava) - val NUM_REPORTS_AS_SPAMS_IS_MISSING = - new Binary( - "realgraph.num_report_as_spams.is_missing", - Set(CountOfAbuseReports, SafetyRelationships).asJava) - - val NUM_MUTUAL_FOLLOW_MEAN = new Continuous( - "realgraph.num_mutual_follow.mean", - Set( - Follow, - PrivateAccountsFollowedBy, - PublicAccountsFollowedBy, - PrivateAccountsFollowing, - PublicAccountsFollowing).asJava - ) - val NUM_MUTUAL_FOLLOW_EWMA = new Continuous( - "realgraph.num_mutual_follow.ewma", - Set( - Follow, - PrivateAccountsFollowedBy, - PublicAccountsFollowedBy, - PrivateAccountsFollowing, - PublicAccountsFollowing).asJava - ) - val NUM_MUTUAL_FOLLOW_VARIANCE = new Continuous( - "realgraph.num_mutual_follow.variance", - Set( - Follow, - PrivateAccountsFollowedBy, - PublicAccountsFollowedBy, - PrivateAccountsFollowing, - PublicAccountsFollowing).asJava - ) - val NUM_MUTUAL_FOLLOW_NON_ZERO_DAYS = new Continuous( - "realgraph.num_mutual_follow.non_zero_days", - Set( - Follow, - PrivateAccountsFollowedBy, - PublicAccountsFollowedBy, - PrivateAccountsFollowing, - PublicAccountsFollowing).asJava - ) - val NUM_MUTUAL_FOLLOW_ELAPSED_DAYS = new Continuous( - "realgraph.num_mutual_follow.elapsed_days", - Set( - Follow, - PrivateAccountsFollowedBy, - PublicAccountsFollowedBy, - PrivateAccountsFollowing, - PublicAccountsFollowing).asJava - ) - val NUM_MUTUAL_FOLLOW_DAYS_SINCE_LAST = new Continuous( - "realgraph.num_mutual_follow.days_since_last", - Set( - Follow, - PrivateAccountsFollowedBy, - PublicAccountsFollowedBy, - PrivateAccountsFollowing, - PublicAccountsFollowing).asJava - ) - val NUM_MUTUAL_FOLLOW_IS_MISSING = new Binary( - "realgraph.num_mutual_follow.is_missing", - Set( - Follow, - PrivateAccountsFollowedBy, - PublicAccountsFollowedBy, - PrivateAccountsFollowing, - PublicAccountsFollowing).asJava - ) - - val NUM_SMS_FOLLOW_MEAN = new Continuous( - "realgraph.num_sms_follow.mean", - Set(Follow, PrivateAccountsFollowedBy, PublicAccountsFollowedBy).asJava) - val NUM_SMS_FOLLOW_EWMA = new Continuous( - "realgraph.num_sms_follow.ewma", - Set(Follow, PrivateAccountsFollowedBy, PublicAccountsFollowedBy).asJava) - val NUM_SMS_FOLLOW_VARIANCE = new Continuous( - "realgraph.num_sms_follow.variance", - Set(Follow, PrivateAccountsFollowedBy, PublicAccountsFollowedBy).asJava) - val NUM_SMS_FOLLOW_NON_ZERO_DAYS = new Continuous( - "realgraph.num_sms_follow.non_zero_days", - Set(Follow, PrivateAccountsFollowedBy, PublicAccountsFollowedBy).asJava) - val NUM_SMS_FOLLOW_ELAPSED_DAYS = new Continuous( - "realgraph.num_sms_follow.elapsed_days", - Set(Follow, PrivateAccountsFollowedBy, PublicAccountsFollowedBy).asJava) - val NUM_SMS_FOLLOW_DAYS_SINCE_LAST = new Continuous( - "realgraph.num_sms_follow.days_since_last", - Set(Follow, PrivateAccountsFollowedBy, PublicAccountsFollowedBy).asJava) - val NUM_SMS_FOLLOW_IS_MISSING = new Binary( - "realgraph.num_sms_follow.is_missing", - Set(Follow, PrivateAccountsFollowedBy, PublicAccountsFollowedBy).asJava) - - val NUM_ADDRESS_BOOK_EMAIL_MEAN = - new Continuous("realgraph.num_address_book_email.mean", Set(AddressBook).asJava) - val NUM_ADDRESS_BOOK_EMAIL_EWMA = - new Continuous("realgraph.num_address_book_email.ewma", Set(AddressBook).asJava) - val NUM_ADDRESS_BOOK_EMAIL_VARIANCE = - new Continuous("realgraph.num_address_book_email.variance", Set(AddressBook).asJava) - val NUM_ADDRESS_BOOK_EMAIL_NON_ZERO_DAYS = new Continuous( - "realgraph.num_address_book_email.non_zero_days", - Set(AddressBook).asJava - ) - val NUM_ADDRESS_BOOK_EMAIL_ELAPSED_DAYS = new Continuous( - "realgraph.num_address_book_email.elapsed_days", - Set(AddressBook).asJava - ) - val NUM_ADDRESS_BOOK_EMAIL_DAYS_SINCE_LAST = new Continuous( - "realgraph.num_address_book_email.days_since_last", - Set(AddressBook).asJava - ) - val NUM_ADDRESS_BOOK_EMAIL_IS_MISSING = - new Binary("realgraph.num_address_book_email.is_missing", Set(AddressBook).asJava) - - val NUM_ADDRESS_BOOK_IN_BOTH_MEAN = - new Continuous("realgraph.num_address_book_in_both.mean", Set(AddressBook).asJava) - val NUM_ADDRESS_BOOK_IN_BOTH_EWMA = - new Continuous("realgraph.num_address_book_in_both.ewma", Set(AddressBook).asJava) - val NUM_ADDRESS_BOOK_IN_BOTH_VARIANCE = new Continuous( - "realgraph.num_address_book_in_both.variance", - Set(AddressBook).asJava - ) - val NUM_ADDRESS_BOOK_IN_BOTH_NON_ZERO_DAYS = new Continuous( - "realgraph.num_address_book_in_both.non_zero_days", - Set(AddressBook).asJava - ) - val NUM_ADDRESS_BOOK_IN_BOTH_ELAPSED_DAYS = new Continuous( - "realgraph.num_address_book_in_both.elapsed_days", - Set(AddressBook).asJava - ) - val NUM_ADDRESS_BOOK_IN_BOTH_DAYS_SINCE_LAST = new Continuous( - "realgraph.num_address_book_in_both.days_since_last", - Set(AddressBook).asJava - ) - val NUM_ADDRESS_BOOK_IN_BOTH_IS_MISSING = new Binary( - "realgraph.num_address_book_in_both.is_missing", - Set(AddressBook).asJava - ) - - val NUM_ADDRESS_BOOK_PHONE_MEAN = - new Continuous("realgraph.num_address_book_phone.mean", Set(AddressBook).asJava) - val NUM_ADDRESS_BOOK_PHONE_EWMA = - new Continuous("realgraph.num_address_book_phone.ewma", Set(AddressBook).asJava) - val NUM_ADDRESS_BOOK_PHONE_VARIANCE = - new Continuous("realgraph.num_address_book_phone.variance", Set(AddressBook).asJava) - val NUM_ADDRESS_BOOK_PHONE_NON_ZERO_DAYS = new Continuous( - "realgraph.num_address_book_phone.non_zero_days", - Set(AddressBook).asJava - ) - val NUM_ADDRESS_BOOK_PHONE_ELAPSED_DAYS = new Continuous( - "realgraph.num_address_book_phone.elapsed_days", - Set(AddressBook).asJava - ) - val NUM_ADDRESS_BOOK_PHONE_DAYS_SINCE_LAST = new Continuous( - "realgraph.num_address_book_phone.days_since_last", - Set(AddressBook).asJava - ) - val NUM_ADDRESS_BOOK_PHONE_IS_MISSING = - new Binary("realgraph.num_address_book_phone.is_missing", Set(AddressBook).asJava) - - val NUM_ADDRESS_BOOK_MUTUAL_EDGE_EMAIL_MEAN = - new Continuous("realgraph.num_address_book_mutual_edge_email.mean", Set(AddressBook).asJava) - val NUM_ADDRESS_BOOK_MUTUAL_EDGE_EMAIL_EWMA = - new Continuous("realgraph.num_address_book_mutual_edge_email.ewma", Set(AddressBook).asJava) - val NUM_ADDRESS_BOOK_MUTUAL_EDGE_EMAIL_VARIANCE = - new Continuous("realgraph.num_address_book_mutual_edge_email.variance", Set(AddressBook).asJava) - val NUM_ADDRESS_BOOK_MUTUAL_EDGE_EMAIL_NON_ZERO_DAYS = new Continuous( - "realgraph.num_address_book_mutual_edge_email.non_zero_days", - Set(AddressBook).asJava - ) - val NUM_ADDRESS_BOOK_MUTUAL_EDGE_EMAIL_ELAPSED_DAYS = new Continuous( - "realgraph.num_address_book_mutual_edge_email.elapsed_days", - Set(AddressBook).asJava - ) - val NUM_ADDRESS_BOOK_MUTUAL_EDGE_EMAIL_DAYS_SINCE_LAST = new Continuous( - "realgraph.num_address_book_mutual_edge_email.days_since_last", - Set(AddressBook).asJava - ) - val NUM_ADDRESS_BOOK_MUTUAL_EDGE_EMAIL_IS_MISSING = - new Binary("realgraph.num_address_book_mutual_edge_email.is_missing", Set(AddressBook).asJava) - - val NUM_ADDRESS_BOOK_MUTUAL_EDGE_IN_BOTH_MEAN = - new Continuous("realgraph.num_address_book_mutual_edge_in_both.mean", Set(AddressBook).asJava) - val NUM_ADDRESS_BOOK_MUTUAL_EDGE_IN_BOTH_EWMA = - new Continuous("realgraph.num_address_book_mutual_edge_in_both.ewma", Set(AddressBook).asJava) - val NUM_ADDRESS_BOOK_MUTUAL_EDGE_IN_BOTH_VARIANCE = new Continuous( - "realgraph.num_address_book_mutual_edge_in_both.variance", - Set(AddressBook).asJava - ) - val NUM_ADDRESS_BOOK_MUTUAL_EDGE_IN_BOTH_NON_ZERO_DAYS = new Continuous( - "realgraph.num_address_book_mutual_edge_in_both.non_zero_days", - Set(AddressBook).asJava - ) - val NUM_ADDRESS_BOOK_MUTUAL_EDGE_IN_BOTH_ELAPSED_DAYS = new Continuous( - "realgraph.num_address_book_mutual_edge_in_both.elapsed_days", - Set(AddressBook).asJava - ) - val NUM_ADDRESS_BOOK_MUTUAL_EDGE_IN_BOTH_DAYS_SINCE_LAST = new Continuous( - "realgraph.num_address_book_mutual_edge_in_both.days_since_last", - Set(AddressBook).asJava - ) - val NUM_ADDRESS_BOOK_MUTUAL_EDGE_IN_BOTH_IS_MISSING = new Binary( - "realgraph.num_address_book_mutual_edge_in_both.is_missing", - Set(AddressBook).asJava - ) - - val NUM_ADDRESS_BOOK_MUTUAL_EDGE_PHONE_MEAN = - new Continuous("realgraph.num_address_book_mutual_edge_phone.mean", Set(AddressBook).asJava) - val NUM_ADDRESS_BOOK_MUTUAL_EDGE_PHONE_EWMA = - new Continuous("realgraph.num_address_book_mutual_edge_phone.ewma", Set(AddressBook).asJava) - val NUM_ADDRESS_BOOK_MUTUAL_EDGE_PHONE_VARIANCE = - new Continuous("realgraph.num_address_book_mutual_edge_phone.variance", Set(AddressBook).asJava) - val NUM_ADDRESS_BOOK_MUTUAL_EDGE_PHONE_NON_ZERO_DAYS = new Continuous( - "realgraph.num_address_book_mutual_edge_phone.non_zero_days", - Set(AddressBook).asJava - ) - val NUM_ADDRESS_BOOK_MUTUAL_EDGE_PHONE_ELAPSED_DAYS = new Continuous( - "realgraph.num_address_book_mutual_edge_phone.elapsed_days", - Set(AddressBook).asJava - ) - val NUM_ADDRESS_BOOK_MUTUAL_EDGE_PHONE_DAYS_SINCE_LAST = new Continuous( - "realgraph.num_address_book_mutual_edge_phone.days_since_last", - Set(AddressBook).asJava - ) - val NUM_ADDRESS_BOOK_MUTUAL_EDGE_PHONE_IS_MISSING = - new Binary("realgraph.num_address_book_mutual_edge_phone.is_missing", Set(AddressBook).asJava) -} - -case class RealGraphEdgeDataRecordFeatures( - edgeFeatureOpt: Option[RealGraphEdgeFeature], - meanFeature: Continuous, - ewmaFeature: Continuous, - varianceFeature: Continuous, - nonZeroDaysFeature: Continuous, - elapsedDaysFeature: Continuous, - daysSinceLastFeature: Continuous, - isMissingFeature: Binary) diff --git a/src/scala/com/twitter/timelines/prediction/features/recap/BUILD b/src/scala/com/twitter/timelines/prediction/features/recap/BUILD deleted file mode 100644 index 6fc497bf3..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/recap/BUILD +++ /dev/null @@ -1,9 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/java/com/twitter/ml/api:api-base", - "src/thrift/com/twitter/dal/personal_data:personal_data-java", - ], -) diff --git a/src/scala/com/twitter/timelines/prediction/features/recap/RecapFeatures.scala b/src/scala/com/twitter/timelines/prediction/features/recap/RecapFeatures.scala deleted file mode 100644 index c8ee6da7d..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/recap/RecapFeatures.scala +++ /dev/null @@ -1,967 +0,0 @@ -package com.twitter.timelines.prediction.features.recap - -import com.twitter.dal.personal_data.thriftjava.PersonalDataType._ -import com.twitter.ml.api.Feature.Binary -import com.twitter.ml.api.Feature.Continuous -import com.twitter.ml.api.Feature.Discrete -import com.twitter.ml.api.Feature.SparseBinary -import com.twitter.ml.api.Feature.Text -import scala.collection.JavaConverters._ - -object RecapFeatures extends RecapFeatures("") -object InReplyToRecapFeatures extends RecapFeatures("in_reply_to_tweet") - -class RecapFeatures(prefix: String) { - private def name(featureName: String): String = { - if (prefix.nonEmpty) { - s"$prefix.$featureName" - } else { - featureName - } - } - - val IS_IPAD_CLIENT = new Binary(name("recap.client.is_ipad"), Set(ClientType).asJava) - val IS_WEB_CLIENT = new Binary(name("recap.client.is_web"), Set(ClientType).asJava) - val IS_IPHONE_CLIENT = new Binary(name("recap.client.is_phone"), Set(ClientType).asJava) - val IS_ANDROID_CLIENT = new Binary(name("recap.client.is_android"), Set(ClientType).asJava) - val IS_ANDROID_TABLET_CLIENT = - new Binary(name("recap.client.is_android_tablet"), Set(ClientType).asJava) - - // features from userAgent - val CLIENT_NAME = new Text(name("recap.user_agent.client_name"), Set(ClientType).asJava) - val CLIENT_SOURCE = new Discrete(name("recap.user_agent.client_source"), Set(ClientType).asJava) - val CLIENT_VERSION = new Text(name("recap.user_agent.client_version"), Set(ClientVersion).asJava) - val CLIENT_VERSION_CODE = - new Text(name("recap.user_agent.client_version_code"), Set(ClientVersion).asJava) - val DEVICE = new Text(name("recap.user_agent.device"), Set(DeviceType).asJava) - val FROM_DOG_FOOD = new Binary(name("recap.meta.from_dog_food"), Set(UserAgent).asJava) - val FROM_TWITTER_CLIENT = - new Binary(name("recap.user_agent.from_twitter_client"), Set(UserAgent).asJava) - val MANUFACTURER = new Text(name("recap.user_agent.manufacturer"), Set(UserAgent).asJava) - val MODEL = new Text(name("recap.user_agent.model"), Set(UserAgent).asJava) - val NETWORK_CONNECTION = - new Discrete(name("recap.user_agent.network_connection"), Set(UserAgent).asJava) - val SDK_VERSION = new Text(name("recap.user_agent.sdk_version"), Set(AppId, UserAgent).asJava) - - // engagement - val IS_RETWEETED = new Binary( - name("recap.engagement.is_retweeted"), - Set(PublicRetweets, PrivateRetweets, EngagementsPrivate, EngagementsPublic).asJava) - val IS_FAVORITED = new Binary( - name("recap.engagement.is_favorited"), - Set(PublicLikes, PrivateLikes, EngagementsPrivate, EngagementsPublic).asJava) - val IS_REPLIED = new Binary( - name("recap.engagement.is_replied"), - Set(PublicReplies, PrivateReplies, EngagementsPrivate, EngagementsPublic).asJava) - // v1: post click engagements: fav, reply - val IS_GOOD_CLICKED_CONVO_DESC_V1 = new Binary( - name("recap.engagement.is_good_clicked_convo_desc_favorited_or_replied"), - Set( - PublicLikes, - PrivateLikes, - PublicReplies, - PrivateReplies, - EngagementsPrivate, - EngagementsPublic).asJava) - // v2: post click engagements: click - val IS_GOOD_CLICKED_CONVO_DESC_V2 = new Binary( - name("recap.engagement.is_good_clicked_convo_desc_v2"), - Set(TweetsClicked, EngagementsPrivate).asJava) - - val IS_GOOD_CLICKED_CONVO_DESC_FAVORITED = new Binary( - name("recap.engagement.is_good_clicked_convo_desc_favorited"), - Set(PublicLikes, PrivateLikes, EngagementsPrivate, EngagementsPublic).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_REPLIED = new Binary( - name("recap.engagement.is_good_clicked_convo_desc_replied"), - Set(PublicReplies, PrivateReplies, EngagementsPrivate, EngagementsPublic).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_RETWEETED = new Binary( - name("recap.engagement.is_good_clicked_convo_desc_retweeted"), - Set(PublicRetweets, PrivateRetweets, EngagementsPrivate, EngagementsPublic).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_CLICKED = new Binary( - name("recap.engagement.is_good_clicked_convo_desc_clicked"), - Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_FOLLOWED = new Binary( - name("recap.engagement.is_good_clicked_convo_desc_followed"), - Set(EngagementsPrivate).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_SHARE_DM_CLICKED = new Binary( - name("recap.engagement.is_good_clicked_convo_desc_share_dm_clicked"), - Set(EngagementsPrivate).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_PROFILE_CLICKED = new Binary( - name("recap.engagement.is_good_clicked_convo_desc_profile_clicked"), - Set(EngagementsPrivate).asJava) - - val IS_GOOD_CLICKED_CONVO_DESC_UAM_GT_0 = new Binary( - name("recap.engagement.is_good_clicked_convo_desc_uam_gt_0"), - Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_UAM_GT_1 = new Binary( - name("recap.engagement.is_good_clicked_convo_desc_uam_gt_1"), - Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_UAM_GT_2 = new Binary( - name("recap.engagement.is_good_clicked_convo_desc_uam_gt_2"), - Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_GOOD_CLICKED_CONVO_DESC_UAM_GT_3 = new Binary( - name("recap.engagement.is_good_clicked_convo_desc_uam_gt_3"), - Set(EngagementsPrivate, EngagementsPublic).asJava) - - val IS_TWEET_DETAIL_DWELLED = new Binary( - name("recap.engagement.is_tweet_detail_dwelled"), - Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_TWEET_DETAIL_DWELLED_8_SEC = new Binary( - name("recap.engagement.is_tweet_detail_dwelled_8_sec"), - Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_TWEET_DETAIL_DWELLED_15_SEC = new Binary( - name("recap.engagement.is_tweet_detail_dwelled_15_sec"), - Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_TWEET_DETAIL_DWELLED_25_SEC = new Binary( - name("recap.engagement.is_tweet_detail_dwelled_25_sec"), - Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_TWEET_DETAIL_DWELLED_30_SEC = new Binary( - name("recap.engagement.is_tweet_detail_dwelled_30_sec"), - Set(TweetsClicked, EngagementsPrivate).asJava) - - val IS_PROFILE_DWELLED = new Binary( - "recap.engagement.is_profile_dwelled", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_PROFILE_DWELLED_10_SEC = new Binary( - "recap.engagement.is_profile_dwelled_10_sec", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_PROFILE_DWELLED_20_SEC = new Binary( - "recap.engagement.is_profile_dwelled_20_sec", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_PROFILE_DWELLED_30_SEC = new Binary( - "recap.engagement.is_profile_dwelled_30_sec", - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - - val IS_FULLSCREEN_VIDEO_DWELLED = new Binary( - "recap.engagement.is_fullscreen_video_dwelled", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_FULLSCREEN_VIDEO_DWELLED_5_SEC = new Binary( - "recap.engagement.is_fullscreen_video_dwelled_5_sec", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_FULLSCREEN_VIDEO_DWELLED_10_SEC = new Binary( - "recap.engagement.is_fullscreen_video_dwelled_10_sec", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_FULLSCREEN_VIDEO_DWELLED_20_SEC = new Binary( - "recap.engagement.is_fullscreen_video_dwelled_20_sec", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_FULLSCREEN_VIDEO_DWELLED_30_SEC = new Binary( - "recap.engagement.is_fullscreen_video_dwelled_30_sec", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_LINK_DWELLED_15_SEC = new Binary( - "recap.engagement.is_link_dwelled_15_sec", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_LINK_DWELLED_30_SEC = new Binary( - "recap.engagement.is_link_dwelled_30_sec", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_LINK_DWELLED_60_SEC = new Binary( - "recap.engagement.is_link_dwelled_60_sec", - Set(MediaEngagementActivities, EngagementTypePrivate, EngagementsPrivate).asJava) - - val IS_QUOTED = new Binary( - name("recap.engagement.is_quoted"), - Set(PublicRetweets, PrivateRetweets, EngagementsPrivate, EngagementsPublic).asJava) - val IS_RETWEETED_WITHOUT_QUOTE = new Binary( - name("recap.engagement.is_retweeted_without_quote"), - Set(PublicRetweets, PrivateRetweets, EngagementsPrivate, EngagementsPublic).asJava) - val IS_CLICKED = - new Binary(name("recap.engagement.is_clicked"), Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_DWELLED = new Binary(name("recap.engagement.is_dwelled"), Set(EngagementsPrivate).asJava) - val IS_DWELLED_IN_BOUNDS_V1 = - new Binary(name("recap.engagement.is_dwelled_in_bounds_v1"), Set(EngagementsPrivate).asJava) - val DWELL_NORMALIZED_OVERALL = new Continuous( - name("recap.engagement.dwell_normalized_overall"), - Set(EngagementsPrivate).asJava) - val DWELL_CDF_OVERALL = - new Continuous(name("recap.engagement.dwell_cdf_overall"), Set(EngagementsPrivate).asJava) - val DWELL_CDF = new Continuous(name("recap.engagement.dwell_cdf"), Set(EngagementsPrivate).asJava) - - val IS_DWELLED_1S = - new Binary(name("recap.engagement.is_dwelled_1s"), Set(EngagementsPrivate).asJava) - val IS_DWELLED_2S = - new Binary(name("recap.engagement.is_dwelled_2s"), Set(EngagementsPrivate).asJava) - val IS_DWELLED_3S = - new Binary(name("recap.engagement.is_dwelled_3s"), Set(EngagementsPrivate).asJava) - val IS_DWELLED_4S = - new Binary(name("recap.engagement.is_dwelled_4s"), Set(EngagementsPrivate).asJava) - val IS_DWELLED_5S = - new Binary(name("recap.engagement.is_dwelled_5s"), Set(EngagementsPrivate).asJava) - val IS_DWELLED_6S = - new Binary(name("recap.engagement.is_dwelled_6s"), Set(EngagementsPrivate).asJava) - val IS_DWELLED_7S = - new Binary(name("recap.engagement.is_dwelled_7s"), Set(EngagementsPrivate).asJava) - val IS_DWELLED_8S = - new Binary(name("recap.engagement.is_dwelled_8s"), Set(EngagementsPrivate).asJava) - val IS_DWELLED_9S = - new Binary(name("recap.engagement.is_dwelled_9s"), Set(EngagementsPrivate).asJava) - val IS_DWELLED_10S = - new Binary(name("recap.engagement.is_dwelled_10s"), Set(EngagementsPrivate).asJava) - - val IS_SKIPPED_1S = - new Binary(name("recap.engagement.is_skipped_1s"), Set(EngagementsPrivate).asJava) - val IS_SKIPPED_2S = - new Binary(name("recap.engagement.is_skipped_2s"), Set(EngagementsPrivate).asJava) - val IS_SKIPPED_3S = - new Binary(name("recap.engagement.is_skipped_3s"), Set(EngagementsPrivate).asJava) - val IS_SKIPPED_4S = - new Binary(name("recap.engagement.is_skipped_4s"), Set(EngagementsPrivate).asJava) - val IS_SKIPPED_5S = - new Binary(name("recap.engagement.is_skipped_5s"), Set(EngagementsPrivate).asJava) - val IS_SKIPPED_6S = - new Binary(name("recap.engagement.is_skipped_6s"), Set(EngagementsPrivate).asJava) - val IS_SKIPPED_7S = - new Binary(name("recap.engagement.is_skipped_7s"), Set(EngagementsPrivate).asJava) - val IS_SKIPPED_8S = - new Binary(name("recap.engagement.is_skipped_8s"), Set(EngagementsPrivate).asJava) - val IS_SKIPPED_9S = - new Binary(name("recap.engagement.is_skipped_9s"), Set(EngagementsPrivate).asJava) - val IS_SKIPPED_10S = - new Binary(name("recap.engagement.is_skipped_10s"), Set(EngagementsPrivate).asJava) - - val IS_IMPRESSED = - new Binary(name("recap.engagement.is_impressed"), Set(EngagementsPrivate).asJava) - val IS_FOLLOWED = - new Binary("recap.engagement.is_followed", Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_PROFILE_CLICKED = new Binary( - name("recap.engagement.is_profile_clicked"), - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_OPEN_LINKED = new Binary( - name("recap.engagement.is_open_linked"), - Set(EngagementsPrivate, LinksClickedOn).asJava) - val IS_PHOTO_EXPANDED = - new Binary(name("recap.engagement.is_photo_expanded"), Set(EngagementsPrivate).asJava) - val IS_VIDEO_VIEWED = - new Binary(name("recap.engagement.is_video_viewed"), Set(EngagementsPrivate).asJava) - val IS_VIDEO_PLAYBACK_START = - new Binary(name("recap.engagement.is_video_playback_start"), Set(EngagementsPrivate).asJava) - val IS_VIDEO_PLAYBACK_25 = - new Binary(name("recap.engagement.is_video_playback_25"), Set(EngagementsPrivate).asJava) - val IS_VIDEO_PLAYBACK_50 = - new Binary(name("recap.engagement.is_video_playback_50"), Set(EngagementsPrivate).asJava) - val IS_VIDEO_PLAYBACK_75 = - new Binary(name("recap.engagement.is_video_playback_75"), Set(EngagementsPrivate).asJava) - val IS_VIDEO_PLAYBACK_95 = - new Binary(name("recap.engagement.is_video_playback_95"), Set(EngagementsPrivate).asJava) - val IS_VIDEO_PLAYBACK_COMPLETE = - new Binary(name("recap.engagement.is_video_playback_complete"), Set(EngagementsPrivate).asJava) - val IS_VIDEO_VIEWED_AND_PLAYBACK_50 = new Binary( - name("recap.engagement.is_video_viewed_and_playback_50"), - Set(EngagementsPrivate).asJava) - val IS_VIDEO_QUALITY_VIEWED = new Binary( - name("recap.engagement.is_video_quality_viewed"), - Set(EngagementsPrivate).asJava - ) - val IS_TWEET_SHARE_DM_CLICKED = - new Binary(name("recap.engagement.is_tweet_share_dm_clicked"), Set(EngagementsPrivate).asJava) - val IS_TWEET_SHARE_DM_SENT = - new Binary(name("recap.engagement.is_tweet_share_dm_sent"), Set(EngagementsPrivate).asJava) - val IS_BOOKMARKED = - new Binary(name("recap.engagement.is_bookmarked"), Set(EngagementsPrivate).asJava) - val IS_SHARED = - new Binary(name("recap.engagement.is_shared"), Set(EngagementsPrivate).asJava) - val IS_SHARE_MENU_CLICKED = - new Binary(name("recap.engagement.is_share_menu_clicked"), Set(EngagementsPrivate).asJava) - - // Negative engagements - val IS_DONT_LIKE = - new Binary(name("recap.engagement.is_dont_like"), Set(EngagementsPrivate).asJava) - val IS_BLOCK_CLICKED = new Binary( - name("recap.engagement.is_block_clicked"), - Set(TweetsClicked, EngagementsPrivate, EngagementsPublic).asJava) - val IS_BLOCK_DIALOG_BLOCKED = new Binary( - name("recap.engagement.is_block_dialog_blocked"), - Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_MUTE_CLICKED = new Binary( - name("recap.engagement.is_mute_clicked"), - Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_MUTE_DIALOG_MUTED = - new Binary(name("recap.engagement.is_mute_dialog_muted"), Set(EngagementsPrivate).asJava) - val IS_REPORT_TWEET_CLICKED = new Binary( - name("recap.engagement.is_report_tweet_clicked"), - Set(TweetsClicked, EngagementsPrivate).asJava) - val IS_NEGATIVE_FEEDBACK = - new Binary("recap.engagement.is_negative_feedback", Set(EngagementsPrivate).asJava) - val IS_NOT_ABOUT_TOPIC = - new Binary(name("recap.engagement.is_not_about_topic"), Set(EngagementsPrivate).asJava) - val IS_NOT_RECENT = - new Binary(name("recap.engagement.is_not_recent"), Set(EngagementsPrivate).asJava) - val IS_NOT_RELEVANT = - new Binary(name("recap.engagement.is_not_relevant"), Set(EngagementsPrivate).asJava) - val IS_SEE_FEWER = - new Binary(name("recap.engagement.is_see_fewer"), Set(EngagementsPrivate).asJava) - val IS_TOPIC_SPEC_NEG_ENGAGEMENT = - new Binary("recap.engagement.is_topic_spec_neg_engagement", Set(EngagementsPrivate).asJava) - val IS_UNFOLLOW_TOPIC = - new Binary("recap.engagement.is_unfollow_topic", Set(EngagementsPrivate).asJava) - val IS_UNFOLLOW_TOPIC_EXPLICIT_POSITIVE_LABEL = - new Binary( - "recap.engagement.is_unfollow_topic_explicit_positive_label", - Set(EngagementsPrivate).asJava) - val IS_UNFOLLOW_TOPIC_IMPLICIT_POSITIVE_LABEL = - new Binary( - "recap.engagement.is_unfollow_topic_implicit_positive_label", - Set(EngagementsPrivate).asJava) - val IS_UNFOLLOW_TOPIC_STRONG_EXPLICIT_NEGATIVE_LABEL = - new Binary( - "recap.engagement.is_unfollow_topic_strong_explicit_negative_label", - Set(EngagementsPrivate).asJava) - val IS_UNFOLLOW_TOPIC_EXPLICIT_NEGATIVE_LABEL = - new Binary( - "recap.engagement.is_unfollow_topic_explicit_negative_label", - Set(EngagementsPrivate).asJava) - val IS_NOT_INTERESTED_IN = - new Binary("recap.engagement.is_not_interested_in", Set(EngagementsPrivate).asJava) - val IS_NOT_INTERESTED_IN_EXPLICIT_POSITIVE_LABEL = - new Binary( - "recap.engagement.is_not_interested_in_explicit_positive_label", - Set(EngagementsPrivate).asJava) - val IS_NOT_INTERESTED_IN_EXPLICIT_NEGATIVE_LABEL = - new Binary( - "recap.engagement.is_not_interested_in_explicit_negative_label", - Set(EngagementsPrivate).asJava) - val IS_CARET_CLICKED = - new Binary(name("recap.engagement.is_caret_clicked"), Set(EngagementsPrivate).asJava) - val IS_FOLLOW_TOPIC = - new Binary("recap.engagement.is_follow_topic", Set(EngagementsPrivate).asJava) - val IS_NOT_INTERESTED_IN_TOPIC = - new Binary("recap.engagement.is_not_interested_in_topic", Set(EngagementsPrivate).asJava) - val IS_HOME_LATEST_VISITED = - new Binary(name("recap.engagement.is_home_latest_visited"), Set(EngagementsPrivate).asJava) - - // Relevance prompt tweet engagements - val IS_RELEVANCE_PROMPT_YES_CLICKED = new Binary( - name("recap.engagement.is_relevance_prompt_yes_clicked"), - Set(EngagementsPrivate).asJava) - val IS_RELEVANCE_PROMPT_NO_CLICKED = new Binary( - name("recap.engagement.is_relevance_prompt_no_clicked"), - Set(EngagementsPrivate).asJava) - val IS_RELEVANCE_PROMPT_IMPRESSED = new Binary( - name("recap.engagement.is_relevance_prompt_impressed"), - Set(EngagementsPrivate).asJava) - - // Reciprocal engagements for reply forward engagement - val IS_REPLIED_REPLY_IMPRESSED_BY_AUTHOR = new Binary( - name("recap.engagement.is_replied_reply_impressed_by_author"), - Set(EngagementsPrivate).asJava) - val IS_REPLIED_REPLY_FAVORITED_BY_AUTHOR = new Binary( - name("recap.engagement.is_replied_reply_favorited_by_author"), - Set(EngagementsPrivate, EngagementsPublic, PrivateLikes, PublicLikes).asJava) - val IS_REPLIED_REPLY_QUOTED_BY_AUTHOR = new Binary( - name("recap.engagement.is_replied_reply_quoted_by_author"), - Set(EngagementsPrivate, EngagementsPublic, PrivateRetweets, PublicRetweets).asJava) - val IS_REPLIED_REPLY_REPLIED_BY_AUTHOR = new Binary( - name("recap.engagement.is_replied_reply_replied_by_author"), - Set(EngagementsPrivate, EngagementsPublic, PrivateReplies, PublicReplies).asJava) - val IS_REPLIED_REPLY_RETWEETED_BY_AUTHOR = new Binary( - name("recap.engagement.is_replied_reply_retweeted_by_author"), - Set(EngagementsPrivate, EngagementsPublic, PrivateRetweets, PublicRetweets).asJava) - val IS_REPLIED_REPLY_BLOCKED_BY_AUTHOR = new Binary( - name("recap.engagement.is_replied_reply_blocked_by_author"), - Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_REPLIED_REPLY_FOLLOWED_BY_AUTHOR = new Binary( - name("recap.engagement.is_replied_reply_followed_by_author"), - Set(EngagementsPrivate, EngagementsPublic, Follow).asJava) - val IS_REPLIED_REPLY_UNFOLLOWED_BY_AUTHOR = new Binary( - name("recap.engagement.is_replied_reply_unfollowed_by_author"), - Set(EngagementsPrivate, EngagementsPublic).asJava) - val IS_REPLIED_REPLY_MUTED_BY_AUTHOR = new Binary( - name("recap.engagement.is_replied_reply_muted_by_author"), - Set(EngagementsPrivate).asJava) - val IS_REPLIED_REPLY_REPORTED_BY_AUTHOR = new Binary( - name("recap.engagement.is_replied_reply_reported_by_author"), - Set(EngagementsPrivate).asJava) - - // This derived label is the logical OR of REPLY_REPLIED, REPLY_FAVORITED, REPLY_RETWEETED - val IS_REPLIED_REPLY_ENGAGED_BY_AUTHOR = new Binary( - name("recap.engagement.is_replied_reply_engaged_by_author"), - Set(EngagementsPrivate, EngagementsPublic).asJava) - - // Reciprocal engagements for fav forward engagement - val IS_FAVORITED_FAV_FAVORITED_BY_AUTHOR = new Binary( - name("recap.engagement.is_favorited_fav_favorited_by_author"), - Set(EngagementsPrivate, EngagementsPublic, PrivateLikes, PublicLikes).asJava - ) - val IS_FAVORITED_FAV_REPLIED_BY_AUTHOR = new Binary( - name("recap.engagement.is_favorited_fav_replied_by_author"), - Set(EngagementsPrivate, EngagementsPublic, PrivateReplies, PublicReplies).asJava - ) - val IS_FAVORITED_FAV_RETWEETED_BY_AUTHOR = new Binary( - name("recap.engagement.is_favorited_fav_retweeted_by_author"), - Set(EngagementsPrivate, EngagementsPublic, PrivateRetweets, PublicRetweets).asJava - ) - val IS_FAVORITED_FAV_FOLLOWED_BY_AUTHOR = new Binary( - name("recap.engagement.is_favorited_fav_followed_by_author"), - Set(EngagementsPrivate, EngagementsPublic, PrivateRetweets, PublicRetweets).asJava - ) - // This derived label is the logical OR of FAV_REPLIED, FAV_FAVORITED, FAV_RETWEETED, FAV_FOLLOWED - val IS_FAVORITED_FAV_ENGAGED_BY_AUTHOR = new Binary( - name("recap.engagement.is_favorited_fav_engaged_by_author"), - Set(EngagementsPrivate, EngagementsPublic).asJava) - - // define good profile click by considering following engagements (follow, fav, reply, retweet, etc.) at profile page - val IS_PROFILE_CLICKED_AND_PROFILE_FOLLOW = new Binary( - name("recap.engagement.is_profile_clicked_and_profile_follow"), - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate, Follow).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_FAV = new Binary( - name("recap.engagement.is_profile_clicked_and_profile_fav"), - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate, PrivateLikes, PublicLikes).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_REPLY = new Binary( - name("recap.engagement.is_profile_clicked_and_profile_reply"), - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate, PrivateReplies, PublicReplies).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_RETWEET = new Binary( - name("recap.engagement.is_profile_clicked_and_profile_retweet"), - Set( - ProfilesViewed, - ProfilesClicked, - EngagementsPrivate, - PrivateRetweets, - PublicRetweets).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_TWEET_CLICK = new Binary( - name("recap.engagement.is_profile_clicked_and_profile_tweet_click"), - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate, TweetsClicked).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_SHARE_DM_CLICK = new Binary( - name("recap.engagement.is_profile_clicked_and_profile_share_dm_click"), - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - // This derived label is the union of all binary features above - val IS_PROFILE_CLICKED_AND_PROFILE_ENGAGED = new Binary( - name("recap.engagement.is_profile_clicked_and_profile_engaged"), - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate, EngagementsPublic).asJava) - - // define bad profile click by considering following engagements (user report, tweet report, mute, block, etc) at profile page - val IS_PROFILE_CLICKED_AND_PROFILE_USER_REPORT_CLICK = new Binary( - name("recap.engagement.is_profile_clicked_and_profile_user_report_click"), - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_TWEET_REPORT_CLICK = new Binary( - name("recap.engagement.is_profile_clicked_and_profile_tweet_report_click"), - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_MUTE = new Binary( - name("recap.engagement.is_profile_clicked_and_profile_mute"), - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_PROFILE_CLICKED_AND_PROFILE_BLOCK = new Binary( - name("recap.engagement.is_profile_clicked_and_profile_block"), - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - // This derived label is the union of bad profile click engagements and existing negative feedback - val IS_NEGATIVE_FEEDBACK_V2 = new Binary( - name("recap.engagement.is_negative_feedback_v2"), - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_STRONG_NEGATIVE_FEEDBACK = new Binary( - name("recap.engagement.is_strong_negative_feedback"), - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - val IS_WEAK_NEGATIVE_FEEDBACK = new Binary( - name("recap.engagement.is_weak_negative_feedback"), - Set(ProfilesViewed, ProfilesClicked, EngagementsPrivate).asJava) - // engagement for following user from any surface area - val IS_FOLLOWED_FROM_ANY_SURFACE_AREA = new Binary( - "recap.engagement.is_followed_from_any_surface_area", - Set(EngagementsPublic, EngagementsPrivate).asJava) - - // Reply downvote engagements - val IS_REPLY_DOWNVOTED = - new Binary(name("recap.engagement.is_reply_downvoted"), Set(EngagementsPrivate).asJava) - val IS_REPLY_DOWNVOTE_REMOVED = - new Binary(name("recap.engagement.is_reply_downvote_removed"), Set(EngagementsPrivate).asJava) - - // Other engagements - val IS_GOOD_OPEN_LINK = new Binary( - name("recap.engagement.is_good_open_link"), - Set(EngagementsPrivate, LinksClickedOn).asJava) - val IS_ENGAGED = new Binary( - name("recap.engagement.any"), - Set(EngagementsPrivate, EngagementsPublic).asJava - ) // Deprecated - to be removed shortly - val IS_EARLYBIRD_UNIFIED_ENGAGEMENT = new Binary( - name("recap.engagement.is_unified_engagement"), - Set(EngagementsPrivate, EngagementsPublic).asJava - ) // A subset of IS_ENGAGED specifically intended for use in earlybird models - - // features from ThriftTweetFeatures - val PREV_USER_TWEET_ENGAGEMENT = new Continuous( - name("recap.tweetfeature.prev_user_tweet_enagagement"), - Set(EngagementScore, EngagementsPrivate, EngagementsPublic).asJava) - val IS_SENSITIVE = new Binary(name("recap.tweetfeature.is_sensitive")) - val HAS_MULTIPLE_MEDIA = new Binary( - name("recap.tweetfeature.has_multiple_media"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val IS_AUTHOR_PROFILE_EGG = new Binary(name("recap.tweetfeature.is_author_profile_egg")) - val IS_AUTHOR_NEW = - new Binary(name("recap.tweetfeature.is_author_new"), Set(UserState, UserType).asJava) - val NUM_MENTIONS = new Continuous( - name("recap.tweetfeature.num_mentions"), - Set(CountOfPrivateTweetEntitiesAndMetadata, CountOfPublicTweetEntitiesAndMetadata).asJava) - val HAS_MENTION = new Binary(name("recap.tweetfeature.has_mention"), Set(UserVisibleFlag).asJava) - val NUM_HASHTAGS = new Continuous( - name("recap.tweetfeature.num_hashtags"), - Set(CountOfPrivateTweetEntitiesAndMetadata, CountOfPublicTweetEntitiesAndMetadata).asJava) - val HAS_HASHTAG = new Binary( - name("recap.tweetfeature.has_hashtag"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val LINK_LANGUAGE = new Continuous( - name("recap.tweetfeature.link_language"), - Set(ProvidedLanguage, InferredLanguage).asJava) - val IS_AUTHOR_NSFW = - new Binary(name("recap.tweetfeature.is_author_nsfw"), Set(UserSafetyLabels, UserType).asJava) - val IS_AUTHOR_SPAM = - new Binary(name("recap.tweetfeature.is_author_spam"), Set(UserSafetyLabels, UserType).asJava) - val IS_AUTHOR_BOT = - new Binary(name("recap.tweetfeature.is_author_bot"), Set(UserSafetyLabels, UserType).asJava) - val SIGNATURE = - new Discrete(name("recap.tweetfeature.signature"), Set(DigitalSignatureNonrepudiation).asJava) - val LANGUAGE = new Discrete( - name("recap.tweetfeature.language"), - Set(ProvidedLanguage, InferredLanguage).asJava) - val FROM_INACTIVE_USER = - new Binary(name("recap.tweetfeature.from_inactive_user"), Set(UserActiveFlag).asJava) - val PROBABLY_FROM_FOLLOWED_AUTHOR = new Binary(name("recap.v3.tweetfeature.probably_from_follow")) - val FROM_MUTUAL_FOLLOW = new Binary(name("recap.tweetfeature.from_mutual_follow")) - val USER_REP = new Continuous(name("recap.tweetfeature.user_rep")) - val FROM_VERIFIED_ACCOUNT = - new Binary(name("recap.tweetfeature.from_verified_account"), Set(UserVerifiedFlag).asJava) - val IS_BUSINESS_SCORE = new Continuous(name("recap.tweetfeature.is_business_score")) - val HAS_CONSUMER_VIDEO = new Binary( - name("recap.tweetfeature.has_consumer_video"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_PRO_VIDEO = new Binary( - name("recap.tweetfeature.has_pro_video"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_VINE = new Binary( - name("recap.tweetfeature.has_vine"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_PERISCOPE = new Binary( - name("recap.tweetfeature.has_periscope"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_NATIVE_VIDEO = new Binary( - name("recap.tweetfeature.has_native_video"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_NATIVE_IMAGE = new Binary( - name("recap.tweetfeature.has_native_image"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_CARD = new Binary( - name("recap.tweetfeature.has_card"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_IMAGE = new Binary( - name("recap.tweetfeature.has_image"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_NEWS = new Binary( - name("recap.tweetfeature.has_news"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_VIDEO = new Binary( - name("recap.tweetfeature.has_video"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_VISIBLE_LINK = new Binary( - name("recap.tweetfeature.has_visible_link"), - Set(UrlFoundFlag, PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val LINK_COUNT = new Continuous( - name("recap.tweetfeature.link_count"), - Set(CountOfPrivateTweetEntitiesAndMetadata, CountOfPublicTweetEntitiesAndMetadata).asJava) - val HAS_LINK = new Binary( - name("recap.tweetfeature.has_link"), - Set(UrlFoundFlag, PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val IS_OFFENSIVE = new Binary(name("recap.tweetfeature.is_offensive")) - val HAS_TREND = new Binary( - name("recap.tweetfeature.has_trend"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val HAS_MULTIPLE_HASHTAGS_OR_TRENDS = new Binary( - name("recap.tweetfeature.has_multiple_hashtag_or_trend"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val URL_DOMAINS = new SparseBinary( - name("recap.tweetfeature.url_domains"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val CONTAINS_MEDIA = new Binary( - name("recap.tweetfeature.contains_media"), - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val RETWEET_SEARCHER = new Binary(name("recap.tweetfeature.retweet_searcher")) - val REPLY_SEARCHER = new Binary(name("recap.tweetfeature.reply_searcher")) - val MENTION_SEARCHER = - new Binary(name("recap.tweetfeature.mention_searcher"), Set(UserVisibleFlag).asJava) - val REPLY_OTHER = - new Binary(name("recap.tweetfeature.reply_other"), Set(PublicReplies, PrivateReplies).asJava) - val RETWEET_OTHER = new Binary( - name("recap.tweetfeature.retweet_other"), - Set(PublicRetweets, PrivateRetweets).asJava) - val IS_REPLY = - new Binary(name("recap.tweetfeature.is_reply"), Set(PublicReplies, PrivateReplies).asJava) - val IS_RETWEET = - new Binary(name("recap.tweetfeature.is_retweet"), Set(PublicRetweets, PrivateRetweets).asJava) - val IS_EXTENDED_REPLY = new Binary( - name("recap.tweetfeature.is_extended_reply"), - Set(PublicReplies, PrivateReplies).asJava) - val MATCH_UI_LANG = new Binary( - name("recap.tweetfeature.match_ui_lang"), - Set(ProvidedLanguage, InferredLanguage).asJava) - val MATCH_SEARCHER_MAIN_LANG = new Binary( - name("recap.tweetfeature.match_searcher_main_lang"), - Set(ProvidedLanguage, InferredLanguage).asJava) - val MATCH_SEARCHER_LANGS = new Binary( - name("recap.tweetfeature.match_searcher_langs"), - Set(ProvidedLanguage, InferredLanguage).asJava) - val BIDIRECTIONAL_REPLY_COUNT = new Continuous( - name("recap.tweetfeature.bidirectional_reply_count"), - Set(CountOfPrivateReplies, CountOfPublicReplies).asJava) - val UNIDIRECTIONAL_REPLY_COUNT = new Continuous( - name("recap.tweetfeature.unidirectional_reply_count"), - Set(CountOfPrivateReplies, CountOfPublicReplies).asJava) - val BIDIRECTIONAL_RETWEET_COUNT = new Continuous( - name("recap.tweetfeature.bidirectional_retweet_count"), - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava) - val UNIDIRECTIONAL_RETWEET_COUNT = new Continuous( - name("recap.tweetfeature.unidirectional_retweet_count"), - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava) - val BIDIRECTIONAL_FAV_COUNT = new Continuous( - name("recap.tweetfeature.bidirectional_fav_count"), - Set(CountOfPrivateLikes, CountOfPublicLikes).asJava) - val UNIDIRECTIONAL_FAV_COUNT = new Continuous( - name("recap.tweetfeature.unidirectiona_fav_count"), - Set(CountOfPrivateLikes, CountOfPublicLikes).asJava) - val CONVERSATIONAL_COUNT = new Continuous( - name("recap.tweetfeature.conversational_count"), - Set(CountOfPrivateTweets, CountOfPublicTweets).asJava) - // tweet impressions on an embedded tweet - val EMBEDS_IMPRESSION_COUNT = new Continuous( - name("recap.tweetfeature.embeds_impression_count"), - Set(CountOfImpression).asJava) - // number of URLs that embed the tweet - val EMBEDS_URL_COUNT = new Continuous( - name("recap.tweetfeature.embeds_url_count"), - Set(CountOfPrivateTweetEntitiesAndMetadata, CountOfPublicTweetEntitiesAndMetadata).asJava) - // currently only counts views on Snappy and Amplify pro videos. Counts for other videos forthcoming - val VIDEO_VIEW_COUNT = new Continuous( - name("recap.tweetfeature.video_view_count"), - Set( - CountOfTweetEntitiesClicked, - CountOfPrivateTweetEntitiesAndMetadata, - CountOfPublicTweetEntitiesAndMetadata, - EngagementsPrivate, - EngagementsPublic).asJava - ) - val TWEET_COUNT_FROM_USER_IN_SNAPSHOT = new Continuous( - name("recap.tweetfeature.tweet_count_from_user_in_snapshot"), - Set(CountOfPrivateTweets, CountOfPublicTweets).asJava) - val NORMALIZED_PARUS_SCORE = - new Continuous("recap.tweetfeature.normalized_parus_score", Set(EngagementScore).asJava) - val PARUS_SCORE = new Continuous("recap.tweetfeature.parus_score", Set(EngagementScore).asJava) - val REAL_GRAPH_WEIGHT = - new Continuous("recap.tweetfeature.real_graph_weight", Set(UsersRealGraphScore).asJava) - val SARUS_GRAPH_WEIGHT = new Continuous("recap.tweetfeature.sarus_graph_weight") - val TOPIC_SIM_SEARCHER_INTERSTED_IN_AUTHOR_KNOWN_FOR = new Continuous( - "recap.tweetfeature.topic_sim_searcher_interested_in_author_known_for") - val TOPIC_SIM_SEARCHER_AUTHOR_BOTH_INTERESTED_IN = new Continuous( - "recap.tweetfeature.topic_sim_searcher_author_both_interested_in") - val TOPIC_SIM_SEARCHER_AUTHOR_BOTH_KNOWN_FOR = new Continuous( - "recap.tweetfeature.topic_sim_searcher_author_both_known_for") - val TOPIC_SIM_SEARCHER_INTERESTED_IN_TWEET = new Continuous( - "recap.tweetfeature.topic_sim_searcher_interested_in_tweet") - val IS_RETWEETER_PROFILE_EGG = - new Binary(name("recap.v2.tweetfeature.is_retweeter_profile_egg"), Set(UserType).asJava) - val IS_RETWEETER_NEW = - new Binary(name("recap.v2.tweetfeature.is_retweeter_new"), Set(UserType, UserState).asJava) - val IS_RETWEETER_BOT = - new Binary( - name("recap.v2.tweetfeature.is_retweeter_bot"), - Set(UserType, UserSafetyLabels).asJava) - val IS_RETWEETER_NSFW = - new Binary( - name("recap.v2.tweetfeature.is_retweeter_nsfw"), - Set(UserType, UserSafetyLabels).asJava) - val IS_RETWEETER_SPAM = - new Binary( - name("recap.v2.tweetfeature.is_retweeter_spam"), - Set(UserType, UserSafetyLabels).asJava) - val RETWEET_OF_MUTUAL_FOLLOW = new Binary( - name("recap.v2.tweetfeature.retweet_of_mutual_follow"), - Set(PublicRetweets, PrivateRetweets).asJava) - val SOURCE_AUTHOR_REP = new Continuous(name("recap.v2.tweetfeature.source_author_rep")) - val IS_RETWEET_OF_REPLY = new Binary( - name("recap.v2.tweetfeature.is_retweet_of_reply"), - Set(PublicRetweets, PrivateRetweets).asJava) - val RETWEET_DIRECTED_AT_USER_IN_FIRST_DEGREE = new Binary( - name("recap.v2.tweetfeature.is_retweet_directed_at_user_in_first_degree"), - Set(PublicRetweets, PrivateRetweets, Follow).asJava) - val MENTIONED_SCREEN_NAMES = new SparseBinary( - "entities.users.mentioned_screen_names", - Set(DisplayName, UserVisibleFlag).asJava) - val MENTIONED_SCREEN_NAME = new Text( - "entities.users.mentioned_screen_names.member", - Set(DisplayName, UserVisibleFlag).asJava) - val HASHTAGS = new SparseBinary( - "entities.hashtags", - Set(PublicTweetEntitiesAndMetadata, PrivateTweetEntitiesAndMetadata).asJava) - val URL_SLUGS = new SparseBinary(name("recap.linkfeature.url_slugs"), Set(UrlFoundFlag).asJava) - - // features from ThriftSearchResultMetadata - val REPLY_COUNT = new Continuous( - name("recap.searchfeature.reply_count"), - Set(CountOfPrivateReplies, CountOfPublicReplies).asJava) - val RETWEET_COUNT = new Continuous( - name("recap.searchfeature.retweet_count"), - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava) - val FAV_COUNT = new Continuous( - name("recap.searchfeature.fav_count"), - Set(CountOfPrivateLikes, CountOfPublicLikes).asJava) - val BLENDER_SCORE = new Continuous(name("recap.searchfeature.blender_score")) - val TEXT_SCORE = new Continuous(name("recap.searchfeature.text_score")) - - // features related to content source - val SOURCE_TYPE = new Discrete(name("recap.source.type")) - - // features from addressbook - // the author is in the user's email addressbook - val USER_TO_AUTHOR_EMAIL_REACHABLE = - new Binary(name("recap.addressbook.user_to_author_email_reachable"), Set(AddressBook).asJava) - // the author is in the user's phone addressbook - val USER_TO_AUTHOR_PHONE_REACHABLE = - new Binary(name("recap.addressbook.user_to_author_phone_reachable"), Set(AddressBook).asJava) - // the user is in the author's email addressbook - val AUTHOR_TO_USER_EMAIL_REACHABLE = - new Binary(name("recap.addressbook.author_to_user_email_reachable"), Set(AddressBook).asJava) - // the user is in the user's phone addressbook - val AUTHOR_TO_USER_PHONE_REACHABLE = - new Binary(name("recap.addressbook.author_to_user_phone_reachable"), Set(AddressBook).asJava) - - // predicted engagement (these features are used by prediction service to return the predicted engagement probability) - // these should match the names in engagement_to_score_feature_mapping - val PREDICTED_IS_FAVORITED = - new Continuous(name("recap.engagement_predicted.is_favorited"), Set(EngagementScore).asJava) - val PREDICTED_IS_RETWEETED = - new Continuous(name("recap.engagement_predicted.is_retweeted"), Set(EngagementScore).asJava) - val PREDICTED_IS_QUOTED = - new Continuous(name("recap.engagement_predicted.is_quoted"), Set(EngagementScore).asJava) - val PREDICTED_IS_REPLIED = - new Continuous(name("recap.engagement_predicted.is_replied"), Set(EngagementScore).asJava) - val PREDICTED_IS_GOOD_OPEN_LINK = new Continuous( - name("recap.engagement_predicted.is_good_open_link"), - Set(EngagementScore).asJava) - val PREDICTED_IS_PROFILE_CLICKED = new Continuous( - name("recap.engagement_predicted.is_profile_clicked"), - Set(EngagementScore).asJava) - val PREDICTED_IS_PROFILE_CLICKED_AND_PROFILE_ENGAGED = new Continuous( - name("recap.engagement_predicted.is_profile_clicked_and_profile_engaged"), - Set(EngagementScore).asJava) - val PREDICTED_IS_CLICKED = - new Continuous(name("recap.engagement_predicted.is_clicked"), Set(EngagementScore).asJava) - val PREDICTED_IS_PHOTO_EXPANDED = new Continuous( - name("recap.engagement_predicted.is_photo_expanded"), - Set(EngagementScore).asJava) - val PREDICTED_IS_DONT_LIKE = - new Continuous(name("recap.engagement_predicted.is_dont_like"), Set(EngagementScore).asJava) - val PREDICTED_IS_VIDEO_PLAYBACK_50 = new Continuous( - name("recap.engagement_predicted.is_video_playback_50"), - Set(EngagementScore).asJava) - val PREDICTED_IS_VIDEO_QUALITY_VIEWED = new Continuous( - name("recap.engagement_predicted.is_video_quality_viewed"), - Set(EngagementScore).asJava) - val PREDICTED_IS_BOOKMARKED = - new Continuous(name("recap.engagement_predicted.is_bookmarked"), Set(EngagementScore).asJava) - val PREDICTED_IS_SHARED = - new Continuous(name("recap.engagement_predicted.is_shared"), Set(EngagementScore).asJava) - val PREDICTED_IS_SHARE_MENU_CLICKED = - new Continuous( - name("recap.engagement_predicted.is_share_menu_clicked"), - Set(EngagementScore).asJava) - val PREDICTED_IS_PROFILE_DWELLED_20_SEC = new Continuous( - name("recap.engagement_predicted.is_profile_dwelled_20_sec"), - Set(EngagementScore).asJava) - val PREDICTED_IS_FULLSCREEN_VIDEO_DWELLED_5_SEC = new Continuous( - name("recap.engagement_predicted.is_fullscreen_video_dwelled_5_sec"), - Set(EngagementScore).asJava) - val PREDICTED_IS_FULLSCREEN_VIDEO_DWELLED_10_SEC = new Continuous( - name("recap.engagement_predicted.is_fullscreen_video_dwelled_10_sec"), - Set(EngagementScore).asJava) - val PREDICTED_IS_FULLSCREEN_VIDEO_DWELLED_20_SEC = new Continuous( - name("recap.engagement_predicted.is_fullscreen_video_dwelled_20_sec"), - Set(EngagementScore).asJava) - val PREDICTED_IS_FULLSCREEN_VIDEO_DWELLED_30_SEC = new Continuous( - name("recap.engagement_predicted.is_fullscreen_video_dwelled_30_sec"), - Set(EngagementScore).asJava) - val PREDICTED_IS_UNIFIED_ENGAGEMENT = new Continuous( - name("recap.engagement_predicted.is_unified_engagement"), - Set(EngagementScore).asJava) - val PREDICTED_IS_COMPOSE_TRIGGERED = new Continuous( - name("recap.engagement_predicted.is_compose_triggered"), - Set(EngagementScore).asJava) - val PREDICTED_IS_REPLIED_REPLY_IMPRESSED_BY_AUTHOR = new Continuous( - name("recap.engagement_predicted.is_replied_reply_impressed_by_author"), - Set(EngagementScore).asJava) - val PREDICTED_IS_REPLIED_REPLY_ENGAGED_BY_AUTHOR = new Continuous( - name("recap.engagement_predicted.is_replied_reply_engaged_by_author"), - Set(EngagementScore).asJava) - val PREDICTED_IS_GOOD_CLICKED_V1 = new Continuous( - name("recap.engagement_predicted.is_good_clicked_convo_desc_favorited_or_replied"), - Set(EngagementScore).asJava) - val PREDICTED_IS_GOOD_CLICKED_V2 = new Continuous( - name("recap.engagement_predicted.is_good_clicked_convo_desc_v2"), - Set(EngagementScore).asJava) - val PREDICTED_IS_TWEET_DETAIL_DWELLED_8_SEC = new Continuous( - name("recap.engagement_predicted.is_tweet_detail_dwelled_8_sec"), - Set(EngagementScore).asJava) - val PREDICTED_IS_TWEET_DETAIL_DWELLED_15_SEC = new Continuous( - name("recap.engagement_predicted.is_tweet_detail_dwelled_15_sec"), - Set(EngagementScore).asJava) - val PREDICTED_IS_TWEET_DETAIL_DWELLED_25_SEC = new Continuous( - name("recap.engagement_predicted.is_tweet_detail_dwelled_25_sec"), - Set(EngagementScore).asJava) - val PREDICTED_IS_TWEET_DETAIL_DWELLED_30_SEC = new Continuous( - name("recap.engagement_predicted.is_tweet_detail_dwelled_30_sec"), - Set(EngagementScore).asJava) - val PREDICTED_IS_FAVORITED_FAV_ENGAGED_BY_AUTHOR = new Continuous( - name("recap.engagement_predicted.is_favorited_fav_engaged_by_author"), - Set(EngagementScore).asJava) - val PREDICTED_IS_GOOD_CLICKED_WITH_DWELL_SUM_GTE_60S = new Continuous( - name( - "recap.engagement_predicted.is_good_clicked_convo_desc_favorited_or_replied_or_dwell_sum_gte_60_secs"), - Set(EngagementScore).asJava) - val PREDICTED_IS_DWELLED_IN_BOUNDS_V1 = new Continuous( - name("recap.engagement_predicted.is_dwelled_in_bounds_v1"), - Set(EngagementScore).asJava) - val PREDICTED_DWELL_NORMALIZED_OVERALL = new Continuous( - name("recap.engagement_predicted.dwell_normalized_overall"), - Set(EngagementScore).asJava) - val PREDICTED_DWELL_CDF = - new Continuous(name("recap.engagement_predicted.dwell_cdf"), Set(EngagementScore).asJava) - val PREDICTED_DWELL_CDF_OVERALL = new Continuous( - name("recap.engagement_predicted.dwell_cdf_overall"), - Set(EngagementScore).asJava) - val PREDICTED_IS_DWELLED = - new Continuous(name("recap.engagement_predicted.is_dwelled"), Set(EngagementScore).asJava) - - val PREDICTED_IS_DWELLED_1S = - new Continuous(name("recap.engagement_predicted.is_dwelled_1s"), Set(EngagementScore).asJava) - val PREDICTED_IS_DWELLED_2S = - new Continuous(name("recap.engagement_predicted.is_dwelled_2s"), Set(EngagementScore).asJava) - val PREDICTED_IS_DWELLED_3S = - new Continuous(name("recap.engagement_predicted.is_dwelled_3s"), Set(EngagementScore).asJava) - val PREDICTED_IS_DWELLED_4S = - new Continuous(name("recap.engagement_predicted.is_dwelled_4s"), Set(EngagementScore).asJava) - val PREDICTED_IS_DWELLED_5S = - new Continuous(name("recap.engagement_predicted.is_dwelled_5s"), Set(EngagementScore).asJava) - val PREDICTED_IS_DWELLED_6S = - new Continuous(name("recap.engagement_predicted.is_dwelled_6s"), Set(EngagementScore).asJava) - val PREDICTED_IS_DWELLED_7S = - new Continuous(name("recap.engagement_predicted.is_dwelled_7s"), Set(EngagementScore).asJava) - val PREDICTED_IS_DWELLED_8S = - new Continuous(name("recap.engagement_predicted.is_dwelled_8s"), Set(EngagementScore).asJava) - val PREDICTED_IS_DWELLED_9S = - new Continuous(name("recap.engagement_predicted.is_dwelled_9s"), Set(EngagementScore).asJava) - val PREDICTED_IS_DWELLED_10S = - new Continuous(name("recap.engagement_predicted.is_dwelled_10s"), Set(EngagementScore).asJava) - - val PREDICTED_IS_SKIPPED_1S = - new Continuous(name("recap.engagement_predicted.is_skipped_1s"), Set(EngagementScore).asJava) - val PREDICTED_IS_SKIPPED_2S = - new Continuous(name("recap.engagement_predicted.is_skipped_2s"), Set(EngagementScore).asJava) - val PREDICTED_IS_SKIPPED_3S = - new Continuous(name("recap.engagement_predicted.is_skipped_3s"), Set(EngagementScore).asJava) - val PREDICTED_IS_SKIPPED_4S = - new Continuous(name("recap.engagement_predicted.is_skipped_4s"), Set(EngagementScore).asJava) - val PREDICTED_IS_SKIPPED_5S = - new Continuous(name("recap.engagement_predicted.is_skipped_5s"), Set(EngagementScore).asJava) - val PREDICTED_IS_SKIPPED_6S = - new Continuous(name("recap.engagement_predicted.is_skipped_6s"), Set(EngagementScore).asJava) - val PREDICTED_IS_SKIPPED_7S = - new Continuous(name("recap.engagement_predicted.is_skipped_7s"), Set(EngagementScore).asJava) - val PREDICTED_IS_SKIPPED_8S = - new Continuous(name("recap.engagement_predicted.is_skipped_8s"), Set(EngagementScore).asJava) - val PREDICTED_IS_SKIPPED_9S = - new Continuous(name("recap.engagement_predicted.is_skipped_9s"), Set(EngagementScore).asJava) - val PREDICTED_IS_SKIPPED_10S = - new Continuous(name("recap.engagement_predicted.is_skipped_10s"), Set(EngagementScore).asJava) - - val PREDICTED_IS_HOME_LATEST_VISITED = new Continuous( - name("recap.engagement_predicted.is_home_latest_visited"), - Set(EngagementScore).asJava) - val PREDICTED_IS_NEGATIVE_FEEDBACK = - new Continuous( - name("recap.engagement_predicted.is_negative_feedback"), - Set(EngagementScore).asJava) - val PREDICTED_IS_NEGATIVE_FEEDBACK_V2 = - new Continuous( - name("recap.engagement_predicted.is_negative_feedback_v2"), - Set(EngagementScore).asJava) - val PREDICTED_IS_WEAK_NEGATIVE_FEEDBACK = - new Continuous( - name("recap.engagement_predicted.is_weak_negative_feedback"), - Set(EngagementScore).asJava) - val PREDICTED_IS_STRONG_NEGATIVE_FEEDBACK = - new Continuous( - name("recap.engagement_predicted.is_strong_negative_feedback"), - Set(EngagementScore).asJava) - val PREDICTED_IS_REPORT_TWEET_CLICKED = - new Continuous( - name("recap.engagement_predicted.is_report_tweet_clicked"), - Set(EngagementScore).asJava) - val PREDICTED_IS_UNFOLLOW_TOPIC = - new Continuous( - name("recap.engagement_predicted.is_unfollow_topic"), - Set(EngagementScore).asJava) - val PREDICTED_IS_RELEVANCE_PROMPT_YES_CLICKED = new Continuous( - name("recap.engagement_predicted.is_relevance_prompt_yes_clicked"), - Set(EngagementScore).asJava) - - // engagement for following user from any surface area - val PREDICTED_IS_FOLLOWED_FROM_ANY_SURFACE_AREA = new Continuous( - "recap.engagement_predicted.is_followed_from_any_surface_area", - Set(EngagementScore).asJava) - - - // These are global engagement counts for the Tweets. - val FAV_COUNT_V2 = new Continuous( - name("recap.earlybird.fav_count_v2"), - Set(CountOfPrivateLikes, CountOfPublicLikes).asJava) - val RETWEET_COUNT_V2 = new Continuous( - name("recap.earlybird.retweet_count_v2"), - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava) - val REPLY_COUNT_V2 = new Continuous( - name("recap.earlybird.reply_count_v2"), - Set(CountOfPrivateReplies, CountOfPublicReplies).asJava) - - val HAS_US_POLITICAL_ANNOTATION = new Binary( - name("recap.has_us_political_annotation"), - Set(SemanticcoreClassification).asJava - ) - - val HAS_US_POLITICAL_ALL_GROUPS_ANNOTATION = new Binary( - name("recap.has_us_political_all_groups_annotation"), - Set(SemanticcoreClassification).asJava - ) - - val HAS_US_POLITICAL_ANNOTATION_HIGH_RECALL = new Binary( - name("recap.has_us_political_annotation_high_recall"), - Set(SemanticcoreClassification).asJava - ) - - val HAS_US_POLITICAL_ANNOTATION_HIGH_RECALL_V2 = new Binary( - name("recap.has_us_political_annotation_high_recall_v2"), - Set(SemanticcoreClassification).asJava - ) - - val HAS_US_POLITICAL_ANNOTATION_HIGH_PRECISION_V0 = new Binary( - name("recap.has_us_political_annotation_high_precision_v0"), - Set(SemanticcoreClassification).asJava - ) - - val HAS_US_POLITICAL_ANNOTATION_BALANCED_PRECISION_RECALL_V0 = new Binary( - name("recap.has_us_political_annotation_balanced_precision_recall_v0"), - Set(SemanticcoreClassification).asJava - ) - - val HAS_US_POLITICAL_ANNOTATION_HIGH_RECALL_V3 = new Binary( - name("recap.has_us_political_annotation_high_recall_v3"), - Set(SemanticcoreClassification).asJava - ) - - val HAS_US_POLITICAL_ANNOTATION_HIGH_PRECISION_V3 = new Binary( - name("recap.has_us_political_annotation_high_precision_v3"), - Set(SemanticcoreClassification).asJava - ) - - val HAS_US_POLITICAL_ANNOTATION_BALANCED_V3 = new Binary( - name("recap.has_us_political_annotation_balanced_v3"), - Set(SemanticcoreClassification).asJava - ) - -} diff --git a/src/scala/com/twitter/timelines/prediction/features/recap/RecapFeaturesUtils.scala b/src/scala/com/twitter/timelines/prediction/features/recap/RecapFeaturesUtils.scala deleted file mode 100644 index edf152cda..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/recap/RecapFeaturesUtils.scala +++ /dev/null @@ -1,29 +0,0 @@ -package com.twitter.timelines.prediction.features.recap - -object RecapFeaturesUtils { - // This needs to be updated if an engagement model is added or removed from prediction service. - val scoreFeatureIdsMap: Map[String, Long] = Map( - RecapFeatures.IS_FAVORITED.getFeatureName -> RecapFeatures.PREDICTED_IS_FAVORITED.getFeatureId, - RecapFeatures.IS_REPLIED.getFeatureName -> RecapFeatures.PREDICTED_IS_REPLIED.getFeatureId, - RecapFeatures.IS_RETWEETED.getFeatureName -> RecapFeatures.PREDICTED_IS_RETWEETED.getFeatureId, - RecapFeatures.IS_GOOD_CLICKED_CONVO_DESC_V1.getFeatureName -> RecapFeatures.PREDICTED_IS_GOOD_CLICKED_V1.getFeatureId, - RecapFeatures.IS_GOOD_CLICKED_CONVO_DESC_V2.getFeatureName -> RecapFeatures.PREDICTED_IS_GOOD_CLICKED_V2.getFeatureId, -// RecapFeatures.IS_NEGATIVE_FEEDBACK_V2.getFeatureName -> RecapFeatures.PREDICTED_IS_NEGATIVE_FEEDBACK_V2.getFeatureId, - RecapFeatures.IS_PROFILE_CLICKED_AND_PROFILE_ENGAGED.getFeatureName -> RecapFeatures.PREDICTED_IS_PROFILE_CLICKED_AND_PROFILE_ENGAGED.getFeatureId, - RecapFeatures.IS_REPLIED_REPLY_ENGAGED_BY_AUTHOR.getFeatureName -> RecapFeatures.PREDICTED_IS_REPLIED_REPLY_ENGAGED_BY_AUTHOR.getFeatureId - ) - - // This needs to be updated if an engagement model is added or removed from prediction service. - val labelFeatureIdToScoreFeatureIdsMap: Map[Long, Long] = Map( - RecapFeatures.IS_FAVORITED.getFeatureId -> RecapFeatures.PREDICTED_IS_FAVORITED.getFeatureId, - RecapFeatures.IS_REPLIED.getFeatureId -> RecapFeatures.PREDICTED_IS_REPLIED.getFeatureId, - RecapFeatures.IS_RETWEETED.getFeatureId -> RecapFeatures.PREDICTED_IS_RETWEETED.getFeatureId, - RecapFeatures.IS_GOOD_CLICKED_CONVO_DESC_V1.getFeatureId -> RecapFeatures.PREDICTED_IS_GOOD_CLICKED_V1.getFeatureId, - RecapFeatures.IS_GOOD_CLICKED_CONVO_DESC_V2.getFeatureId -> RecapFeatures.PREDICTED_IS_GOOD_CLICKED_V2.getFeatureId, - // RecapFeatures.IS_NEGATIVE_FEEDBACK_V2.getFeatureName -> RecapFeatures.PREDICTED_IS_NEGATIVE_FEEDBACK_V2.getFeatureId, - RecapFeatures.IS_PROFILE_CLICKED_AND_PROFILE_ENGAGED.getFeatureId -> RecapFeatures.PREDICTED_IS_PROFILE_CLICKED_AND_PROFILE_ENGAGED.getFeatureId, - RecapFeatures.IS_REPLIED_REPLY_ENGAGED_BY_AUTHOR.getFeatureId -> RecapFeatures.PREDICTED_IS_REPLIED_REPLY_ENGAGED_BY_AUTHOR.getFeatureId - ) - - val labelFeatureNames: Seq[String] = scoreFeatureIdsMap.keys.toSeq -} diff --git a/src/scala/com/twitter/timelines/prediction/features/request_context/BUILD b/src/scala/com/twitter/timelines/prediction/features/request_context/BUILD deleted file mode 100644 index 6fc497bf3..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/request_context/BUILD +++ /dev/null @@ -1,9 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/java/com/twitter/ml/api:api-base", - "src/thrift/com/twitter/dal/personal_data:personal_data-java", - ], -) diff --git a/src/scala/com/twitter/timelines/prediction/features/request_context/RequestContextFeatures.scala b/src/scala/com/twitter/timelines/prediction/features/request_context/RequestContextFeatures.scala deleted file mode 100644 index a7dd28852..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/request_context/RequestContextFeatures.scala +++ /dev/null @@ -1,57 +0,0 @@ -package com.twitter.timelines.prediction.features.request_context - -import com.twitter.ml.api.FeatureContext -import com.twitter.ml.api.Feature._ -import com.twitter.dal.personal_data.thriftjava.PersonalDataType._ -import scala.collection.JavaConverters._ - -object RequestContextFeatures { - val COUNTRY_CODE = - new Text("request_context.country_code", Set(PrivateCountryOrRegion, InferredCountry).asJava) - val LANGUAGE_CODE = new Text( - "request_context.language_code", - Set(GeneralSettings, ProvidedLanguage, InferredLanguage).asJava) - val REQUEST_PROVENANCE = new Text("request_context.request_provenance", Set(AppUsage).asJava) - val DISPLAY_WIDTH = new Continuous("request_context.display_width", Set(OtherDeviceInfo).asJava) - val DISPLAY_HEIGHT = new Continuous("request_context.display_height", Set(OtherDeviceInfo).asJava) - val DISPLAY_DPI = new Continuous("request_context.display_dpi", Set(OtherDeviceInfo).asJava) - - // the following features are not Continuous Features because for e.g. continuity between - // 23 and 0 hours cannot be handled that way. instead, we will treat each slice of hours/days - // independently, like a set of sparse binary features. - val TIMESTAMP_GMT_HOUR = - new Discrete("request_context.timestamp_gmt_hour", Set(PrivateTimestamp).asJava) - val TIMESTAMP_GMT_DOW = - new Discrete("request_context.timestamp_gmt_dow", Set(PrivateTimestamp).asJava) - - val IS_GET_INITIAL = new Binary("request_context.is_get_initial") - val IS_GET_MIDDLE = new Binary("request_context.is_get_middle") - val IS_GET_NEWER = new Binary("request_context.is_get_newer") - val IS_GET_OLDER = new Binary("request_context.is_get_older") - - // the following features are not Binary Features because the source field is Option[Boolean], - // and we want to distinguish Some(false) from None. None will be converted to -1. - val IS_POLLING = new Discrete("request_context.is_polling") - val IS_SESSION_START = new Discrete("request_context.is_session_start") - - // Helps distinguish requests from "home" vs "home_latest" (reverse chron home view). - val TIMELINE_KIND = new Text("request_context.timeline_kind") - - val featureContext = new FeatureContext( - COUNTRY_CODE, - LANGUAGE_CODE, - REQUEST_PROVENANCE, - DISPLAY_WIDTH, - DISPLAY_HEIGHT, - DISPLAY_DPI, - TIMESTAMP_GMT_HOUR, - TIMESTAMP_GMT_DOW, - IS_GET_INITIAL, - IS_GET_MIDDLE, - IS_GET_NEWER, - IS_GET_OLDER, - IS_POLLING, - IS_SESSION_START, - TIMELINE_KIND - ) -} diff --git a/src/scala/com/twitter/timelines/prediction/features/simcluster/BUILD b/src/scala/com/twitter/timelines/prediction/features/simcluster/BUILD deleted file mode 100644 index ec194353b..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/simcluster/BUILD +++ /dev/null @@ -1,13 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/java/com/twitter/ml/api:api-base", - "src/thrift/com/twitter/dal/personal_data:personal_data-java", - "src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala", - "src/thrift/com/twitter/timelines/suggests/common:record-scala", - "timelines/data_processing/ml_util/aggregation_framework:common_types", - "timelines/data_processing/ml_util/aggregation_framework/conversion:for-timelines", - ], -) diff --git a/src/scala/com/twitter/timelines/prediction/features/simcluster/SimclusterFeatures.scala b/src/scala/com/twitter/timelines/prediction/features/simcluster/SimclusterFeatures.scala deleted file mode 100644 index 4d2b4db81..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/simcluster/SimclusterFeatures.scala +++ /dev/null @@ -1,61 +0,0 @@ -package com.twitter.timelines.prediction.features.simcluster - -import com.twitter.dal.personal_data.thriftjava.PersonalDataType._ -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.ml.api.Feature._ -import com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.TypedAggregateGroup -import scala.collection.JavaConverters._ - -class SimclusterFeaturesHelper(statsReceiver: StatsReceiver) { - import SimclusterFeatures._ - - private[this] val scopedStatsReceiver = statsReceiver.scope(getClass.getSimpleName) - private[this] val invalidSimclusterModelVersion = scopedStatsReceiver - .counter("invalidSimclusterModelVersion") - - def fromUserClusterInterestsPair( - userInterestClustersPair: (Long, ClustersUserIsInterestedIn) - ): Option[SimclusterFeatures] = { - val (userId, userInterestClusters) = userInterestClustersPair - if (userInterestClusters.knownForModelVersion == SIMCLUSTER_MODEL_VERSION) { - val userInterestClustersFavScores = for { - (clusterId, scores) <- userInterestClusters.clusterIdToScores - favScore <- scores.favScore - } yield (clusterId.toString, favScore) - Some( - SimclusterFeatures( - userId, - userInterestClusters.knownForModelVersion, - userInterestClustersFavScores.toMap - ) - ) - } else { - // We maintain this counter to make sure that the hardcoded modelVersion we are using is correct. - invalidSimclusterModelVersion.incr - None - } - } -} - -object SimclusterFeatures { - // Check http://go/simclustersv2runbook for production versions - // Our models are trained for this specific model version only. - val SIMCLUSTER_MODEL_VERSION = "20M_145K_dec11" - val prefix = s"simcluster.v2.$SIMCLUSTER_MODEL_VERSION" - - val SIMCLUSTER_USER_INTEREST_CLUSTER_SCORES = new SparseContinuous( - s"$prefix.user_interest_cluster_scores", - Set(EngagementScore, InferredInterests).asJava - ) - val SIMCLUSTER_USER_INTEREST_CLUSTER_IDS = new SparseBinary( - s"$prefix.user_interest_cluster_ids", - Set(InferredInterests).asJava - ) - val SIMCLUSTER_MODEL_VERSION_METADATA = new Text("meta.simcluster_version") -} - -case class SimclusterFeatures( - userId: Long, - modelVersion: String, - interestClusterScoresMap: Map[String, Double]) diff --git a/src/scala/com/twitter/timelines/prediction/features/simcluster/SimclusterTweetFeatures.scala b/src/scala/com/twitter/timelines/prediction/features/simcluster/SimclusterTweetFeatures.scala deleted file mode 100644 index 355a89c22..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/simcluster/SimclusterTweetFeatures.scala +++ /dev/null @@ -1,150 +0,0 @@ -package com.twitter.timelines.prediction.features.simcluster - -import com.twitter.dal.personal_data.thriftjava.PersonalDataType._ -import com.twitter.finagle.stats.StatsReceiver -import com.twitter.ml.api.{Feature, FeatureContext} -import com.twitter.ml.api.Feature.{Continuous, SparseBinary, SparseContinuous} -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.conversion._ -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.TypedAggregateGroup -import com.twitter.timelines.suggests.common.record.thriftscala.SuggestionRecord -import scala.collection.JavaConverters._ - -class SimclusterTweetFeatures(statsReceiver: StatsReceiver) extends CombineCountsBase { - import SimclusterTweetFeatures._ - - private[this] val scopedStatsReceiver = statsReceiver.scope(getClass.getSimpleName) - private[this] val invalidSimclusterModelVersion = scopedStatsReceiver - .counter("invalidSimclusterModelVersion") - private[this] val getFeaturesFromOverlappingSimclusterIdsCount = scopedStatsReceiver - .counter("getFeaturesFromOverlappingSimclusterIdsCount") - private[this] val emptySimclusterMaps = scopedStatsReceiver - .counter("emptySimclusterMaps") - private[this] val nonOverlappingSimclusterMaps = scopedStatsReceiver - .counter("nonOverlappingSimclusterMaps") - - // Parameters required by CombineCountsBase - override val topK: Int = 5 - override val hardLimit: Option[Int] = None - override val precomputedCountFeatures: Seq[Feature[_]] = Seq( - SIMCLUSTER_TWEET_TOPK_SORT_BY_TWEET_SCORE, - SIMCLUSTER_TWEET_TOPK_SORT_BY_COMBINED_SCORE - ) - - private def getFeaturesFromOverlappingSimclusterIds( - userSimclustersInterestedInMap: Map[String, Double], - tweetSimclustersTopKMap: Map[String, Double] - ): Map[Feature[_], List[Double]] = { - getFeaturesFromOverlappingSimclusterIdsCount.incr - if (userSimclustersInterestedInMap.isEmpty || tweetSimclustersTopKMap.isEmpty) { - emptySimclusterMaps.incr - Map.empty - } else { - val overlappingSimclusterIds = - userSimclustersInterestedInMap.keySet intersect tweetSimclustersTopKMap.keySet - if (overlappingSimclusterIds.isEmpty) { - nonOverlappingSimclusterMaps.incr - Map.empty - } else { - val (combinedScores, tweetScores) = overlappingSimclusterIds.map { id => - val tweetScore = tweetSimclustersTopKMap.getOrElse(id, 0.0) - val combinedScore = userSimclustersInterestedInMap.getOrElse(id, 0.0) * tweetScore - (combinedScore, tweetScore) - }.unzip - Map( - SIMCLUSTER_TWEET_TOPK_SORT_BY_COMBINED_SCORE -> combinedScores.toList, - SIMCLUSTER_TWEET_TOPK_SORT_BY_TWEET_SCORE -> tweetScores.toList - ) - } - } - } - - def getCountFeaturesValuesMap( - suggestionRecord: SuggestionRecord, - simclustersTweetTopKMap: Map[String, Double] - ): Map[Feature[_], List[Double]] = { - val userSimclustersInterestedInMap = formatUserSimclustersInterestedIn(suggestionRecord) - - val tweetSimclustersTopKMap = formatTweetSimclustersTopK(simclustersTweetTopKMap) - - getFeaturesFromOverlappingSimclusterIds(userSimclustersInterestedInMap, tweetSimclustersTopKMap) - } - - def filterByModelVersion( - simclustersMapOpt: Option[Map[String, Double]] - ): Option[Map[String, Double]] = { - simclustersMapOpt.flatMap { simclustersMap => - val filteredSimclustersMap = simclustersMap.filter { - case (clusterId, score) => - // The clusterId format is ModelVersion.IntegerClusterId.ScoreType as specified at - // com.twitter.ml.featurestore.catalog.features.recommendations.SimClustersV2TweetTopClusters - clusterId.contains(SimclusterFeatures.SIMCLUSTER_MODEL_VERSION) - } - - // The assumption is that the simclustersMap will contain clusterIds with the same modelVersion. - // We maintain this counter to make sure that the hardcoded modelVersion we are using is correct. - if (simclustersMap.size > filteredSimclustersMap.size) { - invalidSimclusterModelVersion.incr - } - - if (filteredSimclustersMap.nonEmpty) Some(filteredSimclustersMap) else None - } - } - - val allFeatures: Seq[Feature[_]] = outputFeaturesPostMerge.toSeq ++ Seq( - SIMCLUSTER_TWEET_TOPK_CLUSTER_IDS, - SIMCLUSTER_TWEET_TOPK_CLUSTER_SCORES) - val featureContext = new FeatureContext(allFeatures: _*) -} - -object SimclusterTweetFeatures { - val SIMCLUSTER_TWEET_TOPK_CLUSTER_IDS = new SparseBinary( - s"${SimclusterFeatures.prefix}.tweet_topk_cluster_ids", - Set(InferredInterests).asJava - ) - val SIMCLUSTER_TWEET_TOPK_CLUSTER_SCORES = new SparseContinuous( - s"${SimclusterFeatures.prefix}.tweet_topk_cluster_scores", - Set(EngagementScore, InferredInterests).asJava - ) - - val SIMCLUSTER_TWEET_TOPK_CLUSTER_ID = - TypedAggregateGroup.sparseFeature(SIMCLUSTER_TWEET_TOPK_CLUSTER_IDS) - - val SIMCLUSTER_TWEET_TOPK_SORT_BY_TWEET_SCORE = new Continuous( - s"${SimclusterFeatures.prefix}.tweet_topk_sort_by_tweet_score", - Set(EngagementScore, InferredInterests).asJava - ) - - val SIMCLUSTER_TWEET_TOPK_SORT_BY_COMBINED_SCORE = new Continuous( - s"${SimclusterFeatures.prefix}.tweet_topk_sort_by_combined_score", - Set(EngagementScore, InferredInterests).asJava - ) - - def formatUserSimclustersInterestedIn(suggestionRecord: SuggestionRecord): Map[String, Double] = { - suggestionRecord.userSimclustersInterestedIn - .map { clustersUserIsInterestedIn => - if (clustersUserIsInterestedIn.knownForModelVersion == SimclusterFeatures.SIMCLUSTER_MODEL_VERSION) { - clustersUserIsInterestedIn.clusterIdToScores.collect { - case (clusterId, scores) if scores.favScore.isDefined => - (clusterId.toString, scores.favScore.get) - } - } else Map.empty[String, Double] - }.getOrElse(Map.empty[String, Double]) - .toMap - } - - def formatTweetSimclustersTopK( - simclustersTweetTopKMap: Map[String, Double] - ): Map[String, Double] = { - simclustersTweetTopKMap.collect { - case (clusterId, score) => - // The clusterId format is as specified at - // com.twitter.ml.featurestore.catalog.features.recommendations.SimClustersV2TweetTopClusters - // and we want to extract the IntegerClusterId. - // The split function takes a regex; therefore, we need to escape . and we also need to escape - // \ since they are both special characters. Hence, the double \\. - val clusterIdSplit = clusterId.split("\\.") - val integerClusterId = clusterIdSplit(1) // The IntegerClusterId is at position 1. - (integerClusterId, score) - } - } -} diff --git a/src/scala/com/twitter/timelines/prediction/features/simcluster/SimclustersScoresFeatures.scala b/src/scala/com/twitter/timelines/prediction/features/simcluster/SimclustersScoresFeatures.scala deleted file mode 100644 index 0629636c0..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/simcluster/SimclustersScoresFeatures.scala +++ /dev/null @@ -1,43 +0,0 @@ -package com.twitter.timelines.prediction.features.simcluster - -import com.twitter.dal.personal_data.thriftjava.PersonalDataType.SemanticcoreClassification -import com.twitter.ml.api.Feature -import com.twitter.ml.api.Feature.Continuous -import com.twitter.timelines.data_processing.ml_util.aggregation_framework.conversion.CombineCountsBase -import scala.collection.JavaConverters._ - -object SimclustersScoresFeatures extends CombineCountsBase { - override def topK: Int = 2 - - override def hardLimit: Option[Int] = Some(20) - - val prefix = s"recommendations.sim_clusters_scores" - val TOPIC_CONSUMER_TWEET_EMBEDDING_Cs = new Continuous( - s"$prefix.localized_topic_consumer_tweet_embedding_cosine_similarity", - Set(SemanticcoreClassification).asJava) - val TOPIC_PRODUCER_TWEET_EMBEDDING_Cs = new Continuous( - s"$prefix.topic_producer_tweet_embedding_cosine_similarity", - Set(SemanticcoreClassification).asJava) - val USER_TOPIC_CONSUMER_TWEET_EMBEDDING_COSINE_SIM = new Continuous( - s"$prefix.user_interested_in_localized_topic_consumer_embedding_cosine_similarity", - Set(SemanticcoreClassification).asJava) - val USER_TOPIC_CONSUMER_TWEET_EMBEDDING_DOT_PRODUCT = new Continuous( - s"$prefix.user_interested_in_localized_topic_consumer_embedding_dot_product", - Set(SemanticcoreClassification).asJava) - val USER_TOPIC_PRODUCER_TWEET_EMBEDDING_COSINE_SIM = new Continuous( - s"$prefix.user_interested_in_localized_topic_producer_embedding_cosine_similarity", - Set(SemanticcoreClassification).asJava) - val USER_TOPIC_PRODUCER_TWEET_EMBEDDING_DOT_PRODUCT = new Continuous( - s"$prefix.user_interested_in_localized_topic_producer_embedding_dot_product", - Set(SemanticcoreClassification).asJava) - - override def precomputedCountFeatures: Seq[Feature[_]] = - Seq( - TOPIC_CONSUMER_TWEET_EMBEDDING_Cs, - TOPIC_PRODUCER_TWEET_EMBEDDING_Cs, - USER_TOPIC_CONSUMER_TWEET_EMBEDDING_COSINE_SIM, - USER_TOPIC_CONSUMER_TWEET_EMBEDDING_DOT_PRODUCT, - USER_TOPIC_PRODUCER_TWEET_EMBEDDING_COSINE_SIM, - USER_TOPIC_PRODUCER_TWEET_EMBEDDING_DOT_PRODUCT - ) -} diff --git a/src/scala/com/twitter/timelines/prediction/features/socialproof/BUILD b/src/scala/com/twitter/timelines/prediction/features/socialproof/BUILD deleted file mode 100644 index 0c00b1e5b..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/socialproof/BUILD +++ /dev/null @@ -1,15 +0,0 @@ -scala_library( - name = "socialproof_features", - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "3rdparty/jvm/com/ibm/icu:icu4j", - "src/java/com/twitter/ml/api:api-base", - "src/scala/com/twitter/ml/api/util", - "src/scala/com/twitter/timelines/util", - "src/thrift/com/twitter/dal/personal_data:personal_data-java", - "src/thrift/com/twitter/ml/api:data-java", - "src/thrift/com/twitter/timelines/socialproof:socialproof-scala", - ], -) diff --git a/src/scala/com/twitter/timelines/prediction/features/socialproof/SocialProofFeatures.scala b/src/scala/com/twitter/timelines/prediction/features/socialproof/SocialProofFeatures.scala deleted file mode 100644 index 163ba7efa..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/socialproof/SocialProofFeatures.scala +++ /dev/null @@ -1,172 +0,0 @@ -package com.twitter.timelines.prediction.features.socialproof - -import com.twitter.ml.api.DataRecord -import com.twitter.ml.api.Feature.Binary -import com.twitter.ml.api.Feature.Continuous -import com.twitter.ml.api.Feature.SparseBinary -import com.twitter.ml.api.util.FDsl._ -import com.twitter.timelines.prediction.features.socialproof.SocialProofDataRecordFeatures._ -import com.twitter.timelines.socialproof.thriftscala.SocialProof -import com.twitter.timelines.socialproof.v1.thriftscala.SocialProofType -import com.twitter.timelines.util.CommonTypes.UserId -import scala.collection.JavaConverters._ -import com.twitter.dal.personal_data.thriftjava.PersonalDataType._ - -abstract class SocialProofUserGroundTruth(userIds: Seq[UserId], count: Int) { - require( - count >= userIds.size, - "count must be equal to or greater than the number of entries in userIds" - ) - // Using Double as the return type to make it more convenient for these values to be used as - // ML feature values. - val displayedUserCount: Double = userIds.size.toDouble - val undisplayedUserCount: Double = count - userIds.size.toDouble - val totalCount: Double = count.toDouble - - def featureDisplayedUsers: SparseBinary - def featureDisplayedUserCount: Continuous - def featureUndisplayedUserCount: Continuous - def featureTotalUserCount: Continuous - - def setFeatures(rec: DataRecord): Unit = { - rec.setFeatureValue(featureDisplayedUsers, toStringSet(userIds)) - rec.setFeatureValue(featureDisplayedUserCount, displayedUserCount) - rec.setFeatureValue(featureUndisplayedUserCount, undisplayedUserCount) - rec.setFeatureValue(featureTotalUserCount, totalCount) - } - protected def toStringSet(value: Seq[Long]): Set[String] = { - value.map(_.toString).toSet - } -} - -case class FavoritedBySocialProofUserGroundTruth(userIds: Seq[UserId] = Seq.empty, count: Int = 0) - extends SocialProofUserGroundTruth(userIds, count) { - - override val featureDisplayedUsers = SocialProofDisplayedFavoritedByUsers - override val featureDisplayedUserCount = SocialProofDisplayedFavoritedByUserCount - override val featureUndisplayedUserCount = SocialProofUndisplayedFavoritedByUserCount - override val featureTotalUserCount = SocialProofTotalFavoritedByUserCount -} - -case class RetweetedBySocialProofUserGroundTruth(userIds: Seq[UserId] = Seq.empty, count: Int = 0) - extends SocialProofUserGroundTruth(userIds, count) { - - override val featureDisplayedUsers = SocialProofDisplayedRetweetedByUsers - override val featureDisplayedUserCount = SocialProofDisplayedRetweetedByUserCount - override val featureUndisplayedUserCount = SocialProofUndisplayedRetweetedByUserCount - override val featureTotalUserCount = SocialProofTotalRetweetedByUserCount -} - -case class RepliedBySocialProofUserGroundTruth(userIds: Seq[UserId] = Seq.empty, count: Int = 0) - extends SocialProofUserGroundTruth(userIds, count) { - - override val featureDisplayedUsers = SocialProofDisplayedRepliedByUsers - override val featureDisplayedUserCount = SocialProofDisplayedRepliedByUserCount - override val featureUndisplayedUserCount = SocialProofUndisplayedRepliedByUserCount - override val featureTotalUserCount = SocialProofTotalRepliedByUserCount -} - -case class SocialProofFeatures( - hasSocialProof: Boolean, - favoritedBy: FavoritedBySocialProofUserGroundTruth = FavoritedBySocialProofUserGroundTruth(), - retweetedBy: RetweetedBySocialProofUserGroundTruth = RetweetedBySocialProofUserGroundTruth(), - repliedBy: RepliedBySocialProofUserGroundTruth = RepliedBySocialProofUserGroundTruth()) { - - def setFeatures(dataRecord: DataRecord): Unit = - if (hasSocialProof) { - dataRecord.setFeatureValue(HasSocialProof, hasSocialProof) - favoritedBy.setFeatures(dataRecord) - retweetedBy.setFeatures(dataRecord) - repliedBy.setFeatures(dataRecord) - } -} - -object SocialProofFeatures { - def apply(socialProofs: Seq[SocialProof]): SocialProofFeatures = - socialProofs.foldLeft(SocialProofFeatures(hasSocialProof = socialProofs.nonEmpty))( - (prevFeatures, socialProof) => { - val userIds = socialProof.v1.userIds - val count = socialProof.v1.count - socialProof.v1.socialProofType match { - case SocialProofType.FavoritedBy => - prevFeatures.copy(favoritedBy = FavoritedBySocialProofUserGroundTruth(userIds, count)) - case SocialProofType.RetweetedBy => - prevFeatures.copy(retweetedBy = RetweetedBySocialProofUserGroundTruth(userIds, count)) - case SocialProofType.RepliedBy => - prevFeatures.copy(repliedBy = RepliedBySocialProofUserGroundTruth(userIds, count)) - case _ => - prevFeatures // skip silently instead of breaking jobs, since this isn't used yet - } - }) -} - -object SocialProofDataRecordFeatures { - val HasSocialProof = new Binary("recap.social_proof.has_social_proof") - - val SocialProofDisplayedFavoritedByUsers = new SparseBinary( - "recap.social_proof.list.displayed.favorited_by", - Set(UserId, PublicLikes, PrivateLikes).asJava - ) - val SocialProofDisplayedFavoritedByUserCount = new Continuous( - "recap.social_proof.count.displayed.favorited_by", - Set(CountOfPrivateLikes, CountOfPublicLikes).asJava - ) - val SocialProofUndisplayedFavoritedByUserCount = new Continuous( - "recap.social_proof.count.undisplayed.favorited_by", - Set(CountOfPrivateLikes, CountOfPublicLikes).asJava - ) - val SocialProofTotalFavoritedByUserCount = new Continuous( - "recap.social_proof.count.total.favorited_by", - Set(CountOfPrivateLikes, CountOfPublicLikes).asJava - ) - - val SocialProofDisplayedRetweetedByUsers = new SparseBinary( - "recap.social_proof.list.displayed.retweeted_by", - Set(UserId, PublicRetweets, PrivateRetweets).asJava - ) - val SocialProofDisplayedRetweetedByUserCount = new Continuous( - "recap.social_proof.count.displayed.retweeted_by", - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava - ) - val SocialProofUndisplayedRetweetedByUserCount = new Continuous( - "recap.social_proof.count.undisplayed.retweeted_by", - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava - ) - val SocialProofTotalRetweetedByUserCount = new Continuous( - "recap.social_proof.count.total.retweeted_by", - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava - ) - - val SocialProofDisplayedRepliedByUsers = new SparseBinary( - "recap.social_proof.list.displayed.replied_by", - Set(UserId, PublicReplies, PrivateReplies).asJava - ) - val SocialProofDisplayedRepliedByUserCount = new Continuous( - "recap.social_proof.count.displayed.replied_by", - Set(CountOfPrivateReplies, CountOfPublicReplies).asJava - ) - val SocialProofUndisplayedRepliedByUserCount = new Continuous( - "recap.social_proof.count.undisplayed.replied_by", - Set(CountOfPrivateReplies, CountOfPublicReplies).asJava - ) - val SocialProofTotalRepliedByUserCount = new Continuous( - "recap.social_proof.count.total.replied_by", - Set(CountOfPrivateReplies, CountOfPublicReplies).asJava - ) - - val AllFeatures = Seq( - HasSocialProof, - SocialProofDisplayedFavoritedByUsers, - SocialProofDisplayedFavoritedByUserCount, - SocialProofUndisplayedFavoritedByUserCount, - SocialProofTotalFavoritedByUserCount, - SocialProofDisplayedRetweetedByUsers, - SocialProofDisplayedRetweetedByUserCount, - SocialProofUndisplayedRetweetedByUserCount, - SocialProofTotalRetweetedByUserCount, - SocialProofDisplayedRepliedByUsers, - SocialProofDisplayedRepliedByUserCount, - SocialProofUndisplayedRepliedByUserCount, - SocialProofTotalRepliedByUserCount - ) -} diff --git a/src/scala/com/twitter/timelines/prediction/features/time_features/BUILD b/src/scala/com/twitter/timelines/prediction/features/time_features/BUILD deleted file mode 100644 index b5c49af36..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/time_features/BUILD +++ /dev/null @@ -1,10 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/java/com/twitter/ml/api:api-base", - "src/thrift/com/twitter/dal/personal_data:personal_data-java", - "src/thrift/com/twitter/timelines/time_features:time_features-scala", - ], -) diff --git a/src/scala/com/twitter/timelines/prediction/features/time_features/TimeDataRecordFeatures.scala b/src/scala/com/twitter/timelines/prediction/features/time_features/TimeDataRecordFeatures.scala deleted file mode 100644 index b398203c3..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/time_features/TimeDataRecordFeatures.scala +++ /dev/null @@ -1,111 +0,0 @@ -package com.twitter.timelines.prediction.features.time_features - -import com.twitter.dal.personal_data.thriftjava.PersonalDataType._ -import com.twitter.ml.api.Feature._ -import scala.collection.JavaConverters._ -import com.twitter.util.Duration -import com.twitter.conversions.DurationOps._ - -object TimeDataRecordFeatures { - val TIME_BETWEEN_NON_POLLING_REQUESTS_AVG = new Continuous( - "time_features.time_between_non_polling_requests_avg", - Set(PrivateTimestamp).asJava - ) - val TIME_SINCE_TWEET_CREATION = new Continuous("time_features.time_since_tweet_creation") - val TIME_SINCE_SOURCE_TWEET_CREATION = new Continuous( - "time_features.time_since_source_tweet_creation" - ) - val TIME_SINCE_LAST_NON_POLLING_REQUEST = new Continuous( - "time_features.time_since_last_non_polling_request", - Set(PrivateTimestamp).asJava - ) - val NON_POLLING_REQUESTS_SINCE_TWEET_CREATION = new Continuous( - "time_features.non_polling_requests_since_tweet_creation", - Set(PrivateTimestamp).asJava - ) - val TWEET_AGE_RATIO = new Continuous("time_features.tweet_age_ratio") - val IS_TWEET_RECYCLED = new Binary("time_features.is_tweet_recycled") - // Last Engagement features - val LAST_FAVORITE_SINCE_CREATION_HRS = new Continuous( - "time_features.earlybird.last_favorite_since_creation_hrs", - Set(CountOfPrivateLikes, CountOfPublicLikes).asJava - ) - val LAST_RETWEET_SINCE_CREATION_HRS = new Continuous( - "time_features.earlybird.last_retweet_since_creation_hrs", - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava - ) - val LAST_REPLY_SINCE_CREATION_HRS = new Continuous( - "time_features.earlybird.last_reply_since_creation_hrs", - Set(CountOfPrivateReplies, CountOfPublicReplies).asJava - ) - val LAST_QUOTE_SINCE_CREATION_HRS = new Continuous( - "time_features.earlybird.last_quote_since_creation_hrs", - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava - ) - val TIME_SINCE_LAST_FAVORITE_HRS = new Continuous( - "time_features.earlybird.time_since_last_favorite", - Set(CountOfPrivateLikes, CountOfPublicLikes).asJava - ) - val TIME_SINCE_LAST_RETWEET_HRS = new Continuous( - "time_features.earlybird.time_since_last_retweet", - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava - ) - val TIME_SINCE_LAST_REPLY_HRS = new Continuous( - "time_features.earlybird.time_since_last_reply", - Set(CountOfPrivateReplies, CountOfPublicReplies).asJava - ) - val TIME_SINCE_LAST_QUOTE_HRS = new Continuous( - "time_features.earlybird.time_since_last_quote", - Set(CountOfPrivateRetweets, CountOfPublicRetweets).asJava - ) - - val TIME_SINCE_VIEWER_ACCOUNT_CREATION_SECS = - new Continuous( - "time_features.time_since_viewer_account_creation_secs", - Set(AccountCreationTime, AgeOfAccount).asJava) - - val USER_ID_IS_SNOWFLAKE_ID = - new Binary("time_features.time_user_id_is_snowflake_id", Set(UserType).asJava) - - val IS_30_DAY_NEW_USER = - new Binary("time_features.is_day_30_new_user", Set(AccountCreationTime, AgeOfAccount).asJava) - val IS_12_MONTH_NEW_USER = - new Binary("time_features.is_month_12_new_user", Set(AccountCreationTime, AgeOfAccount).asJava) - val ACCOUNT_AGE_INTERVAL = - new Discrete("time_features.account_age_interval", Set(AgeOfAccount).asJava) -} - -object AccountAgeInterval extends Enumeration { - val LTE_1_DAY, GT_1_DAY_LTE_5_DAY, GT_5_DAY_LTE_14_DAY, GT_14_DAY_LTE_30_DAY = Value - - def fromDuration(accountAge: Duration): Option[AccountAgeInterval.Value] = { - accountAge match { - case a if (a <= 1.day) => Some(LTE_1_DAY) - case a if (1.day < a && a <= 5.days) => Some(GT_1_DAY_LTE_5_DAY) - case a if (5.days < a && a <= 14.days) => Some(GT_5_DAY_LTE_14_DAY) - case a if (14.days < a && a <= 30.days) => Some(GT_14_DAY_LTE_30_DAY) - case _ => None - } - } -} - -case class TimeFeatures( - isTweetRecycled: Boolean, - timeSinceTweetCreation: Double, - isDay30NewUser: Boolean, - isMonth12NewUser: Boolean, - timeSinceSourceTweetCreation: Double, // same as timeSinceTweetCreation for non-retweets - timeSinceViewerAccountCreationSecs: Option[Double], - timeBetweenNonPollingRequestsAvg: Option[Double] = None, - timeSinceLastNonPollingRequest: Option[Double] = None, - nonPollingRequestsSinceTweetCreation: Option[Double] = None, - tweetAgeRatio: Option[Double] = None, - lastFavSinceCreationHrs: Option[Double] = None, - lastRetweetSinceCreationHrs: Option[Double] = None, - lastReplySinceCreationHrs: Option[Double] = None, - lastQuoteSinceCreationHrs: Option[Double] = None, - timeSinceLastFavoriteHrs: Option[Double] = None, - timeSinceLastRetweetHrs: Option[Double] = None, - timeSinceLastReplyHrs: Option[Double] = None, - timeSinceLastQuoteHrs: Option[Double] = None, - accountAgeInterval: Option[AccountAgeInterval.Value] = None) diff --git a/src/scala/com/twitter/timelines/prediction/features/two_hop_features/BUILD b/src/scala/com/twitter/timelines/prediction/features/two_hop_features/BUILD deleted file mode 100644 index a4ad0eabf..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/two_hop_features/BUILD +++ /dev/null @@ -1,10 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "graph-feature-service/src/main/thrift/com/twitter/graph_feature_service:graph_feature_service_thrift-scala", - "src/java/com/twitter/ml/api:api-base", - "src/thrift/com/twitter/dal/personal_data:personal_data-java", - ], -) diff --git a/src/scala/com/twitter/timelines/prediction/features/two_hop_features/TwoHopFeatures.scala b/src/scala/com/twitter/timelines/prediction/features/two_hop_features/TwoHopFeatures.scala deleted file mode 100644 index 03a112578..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/two_hop_features/TwoHopFeatures.scala +++ /dev/null @@ -1,93 +0,0 @@ -package com.twitter.timelines.prediction.features.two_hop_features - -import com.twitter.graph_feature_service.thriftscala.EdgeType -import com.twitter.ml.api.Feature._ -import scala.collection.JavaConverters._ -import TwoHopFeaturesConfig.personalDataTypesMap - -object TwoHopFeaturesDescriptor { - val prefix = "two_hop" - val normalizedPostfix = "normalized" - val leftNodeDegreePostfix = "left_degree" - val rightNodeDegreePostfix = "right_degree" - - type TwoHopFeatureMap = Map[(EdgeType, EdgeType), Continuous] - type TwoHopFeatureNodeDegreeMap = Map[EdgeType, Continuous] - - def apply(edgeTypePairs: Seq[(EdgeType, EdgeType)]): TwoHopFeaturesDescriptor = { - new TwoHopFeaturesDescriptor(edgeTypePairs) - } -} - -class TwoHopFeaturesDescriptor(edgeTypePairs: Seq[(EdgeType, EdgeType)]) { - import TwoHopFeaturesDescriptor._ - - def getLeftEdge(edgeTypePair: (EdgeType, EdgeType)): EdgeType = { - edgeTypePair._1 - } - - def getLeftEdgeName(edgeTypePair: (EdgeType, EdgeType)): String = { - getLeftEdge(edgeTypePair).originalName.toLowerCase - } - - def getRightEdge(edgeTypePair: (EdgeType, EdgeType)): EdgeType = { - edgeTypePair._2 - } - - def getRightEdgeName(edgeTypePair: (EdgeType, EdgeType)): String = { - getRightEdge(edgeTypePair).originalName.toLowerCase - } - - val rawFeaturesMap: TwoHopFeatureMap = edgeTypePairs.map(edgeTypePair => { - val leftEdgeType = getLeftEdge(edgeTypePair) - val leftEdgeName = getLeftEdgeName(edgeTypePair) - val rightEdgeType = getRightEdge(edgeTypePair) - val rightEdgeName = getRightEdgeName(edgeTypePair) - val personalDataTypes = ( - personalDataTypesMap.getOrElse(leftEdgeType, Set.empty) ++ - personalDataTypesMap.getOrElse(rightEdgeType, Set.empty) - ).asJava - val rawFeature = new Continuous(s"$prefix.$leftEdgeName.$rightEdgeName", personalDataTypes) - edgeTypePair -> rawFeature - })(collection.breakOut) - - val leftNodeDegreeFeaturesMap: TwoHopFeatureNodeDegreeMap = edgeTypePairs.map(edgeTypePair => { - val leftEdgeType = getLeftEdge(edgeTypePair) - val leftEdgeName = getLeftEdgeName(edgeTypePair) - val personalDataTypes = personalDataTypesMap.getOrElse(leftEdgeType, Set.empty).asJava - val leftNodeDegreeFeature = - new Continuous(s"$prefix.$leftEdgeName.$leftNodeDegreePostfix", personalDataTypes) - leftEdgeType -> leftNodeDegreeFeature - })(collection.breakOut) - - val rightNodeDegreeFeaturesMap: TwoHopFeatureNodeDegreeMap = edgeTypePairs.map(edgeTypePair => { - val rightEdgeType = getRightEdge(edgeTypePair) - val rightEdgeName = getRightEdgeName(edgeTypePair) - val personalDataTypes = personalDataTypesMap.getOrElse(rightEdgeType, Set.empty).asJava - val rightNodeDegreeFeature = - new Continuous(s"$prefix.$rightEdgeName.$rightNodeDegreePostfix", personalDataTypes) - rightEdgeType -> rightNodeDegreeFeature - })(collection.breakOut) - - val normalizedFeaturesMap: TwoHopFeatureMap = edgeTypePairs.map(edgeTypePair => { - val leftEdgeType = getLeftEdge(edgeTypePair) - val leftEdgeName = getLeftEdgeName(edgeTypePair) - val rightEdgeType = getRightEdge(edgeTypePair) - val rightEdgeName = getRightEdgeName(edgeTypePair) - val personalDataTypes = ( - personalDataTypesMap.getOrElse(leftEdgeType, Set.empty) ++ - personalDataTypesMap.getOrElse(rightEdgeType, Set.empty) - ).asJava - val normalizedFeature = - new Continuous(s"$prefix.$leftEdgeName.$rightEdgeName.$normalizedPostfix", personalDataTypes) - edgeTypePair -> normalizedFeature - })(collection.breakOut) - - private val rawFeaturesSeq: Seq[Continuous] = rawFeaturesMap.values.toSeq - private val leftNodeDegreeFeaturesSeq: Seq[Continuous] = leftNodeDegreeFeaturesMap.values.toSeq - private val rightNodeDegreeFeaturesSeq: Seq[Continuous] = rightNodeDegreeFeaturesMap.values.toSeq - private val normalizedFeaturesSeq: Seq[Continuous] = normalizedFeaturesMap.values.toSeq - - val featuresSeq: Seq[Continuous] = - rawFeaturesSeq ++ leftNodeDegreeFeaturesSeq ++ rightNodeDegreeFeaturesSeq ++ normalizedFeaturesSeq -} diff --git a/src/scala/com/twitter/timelines/prediction/features/two_hop_features/TwoHopFeaturesConfig.scala b/src/scala/com/twitter/timelines/prediction/features/two_hop_features/TwoHopFeaturesConfig.scala deleted file mode 100644 index ece502e30..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/two_hop_features/TwoHopFeaturesConfig.scala +++ /dev/null @@ -1,30 +0,0 @@ -package com.twitter.timelines.prediction.features.two_hop_features - -import com.twitter.dal.personal_data.thriftjava.PersonalDataType -import com.twitter.graph_feature_service.thriftscala.{EdgeType, FeatureType} - -object TwoHopFeaturesConfig { - val leftEdgeTypes = Seq(EdgeType.Following, EdgeType.Favorite, EdgeType.MutualFollow) - val rightEdgeTypes = Seq( - EdgeType.FollowedBy, - EdgeType.FavoritedBy, - EdgeType.RetweetedBy, - EdgeType.MentionedBy, - EdgeType.MutualFollow) - - val edgeTypePairs: Seq[(EdgeType, EdgeType)] = { - for (leftEdgeType <- leftEdgeTypes; rightEdgeType <- rightEdgeTypes) - yield (leftEdgeType, rightEdgeType) - } - - val featureTypes: Seq[FeatureType] = edgeTypePairs.map(pair => FeatureType(pair._1, pair._2)) - - val personalDataTypesMap: Map[EdgeType, Set[PersonalDataType]] = Map( - EdgeType.Following -> Set(PersonalDataType.CountOfFollowersAndFollowees), - EdgeType.Favorite -> Set( - PersonalDataType.CountOfPrivateLikes, - PersonalDataType.CountOfPublicLikes), - EdgeType.MutualFollow -> Set(PersonalDataType.CountOfFollowersAndFollowees), - EdgeType.FollowedBy -> Set(PersonalDataType.CountOfFollowersAndFollowees) - ) -} diff --git a/src/scala/com/twitter/timelines/prediction/features/user_health/BUILD b/src/scala/com/twitter/timelines/prediction/features/user_health/BUILD deleted file mode 100644 index 598e0c066..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/user_health/BUILD +++ /dev/null @@ -1,10 +0,0 @@ -scala_library( - sources = ["*.scala"], - platform = "java8", - tags = ["bazel-compatible"], - dependencies = [ - "src/java/com/twitter/ml/api:api-base", - "src/thrift/com/twitter/dal/personal_data:personal_data-java", - "src/thrift/com/twitter/timelines/author_features/user_health:thrift-scala", - ], -) diff --git a/src/scala/com/twitter/timelines/prediction/features/user_health/UserHealthFeatures.scala b/src/scala/com/twitter/timelines/prediction/features/user_health/UserHealthFeatures.scala deleted file mode 100644 index 7c8c7f8b1..000000000 --- a/src/scala/com/twitter/timelines/prediction/features/user_health/UserHealthFeatures.scala +++ /dev/null @@ -1,23 +0,0 @@ -package com.twitter.timelines.prediction.features.user_health - -import com.twitter.ml.api.Feature -import com.twitter.timelines.author_features.user_health.thriftscala.UserState -import com.twitter.dal.personal_data.thriftjava.PersonalDataType.{UserState => UserStatePDT} -import com.twitter.dal.personal_data.thriftjava.PersonalDataType._ -import scala.collection.JavaConverters._ - -object UserHealthFeatures { - val UserState = new Feature.Discrete("user_health.user_state", Set(UserStatePDT, UserType).asJava) - val IsLightMinusUser = - new Feature.Binary("user_health.is_light_minus_user", Set(UserStatePDT, UserType).asJava) - val AuthorState = - new Feature.Discrete("user_health.author_state", Set(UserStatePDT, UserType).asJava) - val NumAuthorFollowers = - new Feature.Continuous("author_health.num_followers", Set(CountOfFollowersAndFollowees).asJava) - val NumAuthorConnectDays = new Feature.Continuous("author_health.num_connect_days") - val NumAuthorConnect = new Feature.Continuous("author_health.num_connect") - - val IsUserVerifiedUnion = new Feature.Binary("user_account.is_user_verified_union") -} - -case class UserHealthFeatures(id: Long, userStateOpt: Option[UserState]) diff --git a/src/thrift/com/twitter/interaction_graph/BUILD b/src/thrift/com/twitter/interaction_graph/BUILD deleted file mode 100644 index 500c73d77..000000000 --- a/src/thrift/com/twitter/interaction_graph/BUILD +++ /dev/null @@ -1,15 +0,0 @@ -create_thrift_libraries( - base_name = "interaction_graph", - sources = ["*.thrift"], - platform = "java8", - tags = ["bazel-compatible"], - dependency_roots = [ - ], - generate_languages = [ - "java", - "scala", - "strato", - ], - provides_java_name = "interaction_graph-thrift-java", - provides_scala_name = "interaction_graph-thrift-scala", -) diff --git a/src/thrift/com/twitter/interaction_graph/interaction_graph.thrift b/src/thrift/com/twitter/interaction_graph/interaction_graph.thrift deleted file mode 100644 index d90df54cf..000000000 --- a/src/thrift/com/twitter/interaction_graph/interaction_graph.thrift +++ /dev/null @@ -1,98 +0,0 @@ -namespace java com.twitter.interaction_graph.thriftjava -#@namespace scala com.twitter.interaction_graph.thriftscala -#@namespace strato com.twitter.interaction_graph - -// These could be either a Vertex or an edge feature name -// when you add a new feature, update VertexFeatureCombiner.java and EdgeFeatureCombiner.java. -enum FeatureName { - num_retweets = 1 - num_favorites = 2 - num_mentions = 3 - num_direct_messages = 4 - num_tweet_clicks = 5 - num_link_clicks = 6 - num_profile_views = 7 - num_follows = 8 - num_unfollows = 9 - num_mutual_follows = 10 - address_book_email = 11 - address_book_phone = 12 - address_book_in_both = 13 - address_book_mutual_edge_email = 14 - address_book_mutual_edge_phone = 15 - address_book_mutual_edge_in_both = 16 - total_dwell_time = 17 - num_inspected_statuses = 18 - num_photo_tags = 19 - num_blocks = 20 - num_mutes = 21 - num_report_as_abuses = 22 - num_report_as_spams = 23 - num_tweet_quotes = 24 - num_push_opens = 25 - num_ntab_clicks = 26, - num_rt_favories = 27, - num_rt_replies = 28, - num_rt_tweet_quotes = 29, - num_rt_retweets = 30, - num_rt_mentions = 31, - num_rt_tweet_clicks = 32, - num_rt_link_clicks = 33 - num_shares = 34, - num_email_click = 35, - num_email_open = 36, - num_ntab_dislike_7_days = 37, - num_push_dismiss = 38, - num_push_report_tweet_click = 39, - num_push_report_user_click = 40, - num_replies = 41, - // vertex features after 128 - num_create_tweets = 129, -} -// do remember to update the tests in InteractionGraphAggregationJobTest when adding new features but not updating agg_all - -struct TimeSeriesStatistics { - 1: required double mean; - // For computing variance online: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm - 2: required double m2_for_variance; - 3: required double ewma; // Exponentially weighted moving average: ewma_t = \alpha x_t + (1-\alpha) ewma_{t-1} - 4: required i32 num_elapsed_days; // Total number of days since we started counting this feature - 5: required i32 num_non_zero_days; // Number of days when the interaction was non-zero (used to compute mean/variance) - 6: optional i32 num_days_since_last; // Number of days since the latest interaction happen -}(persisted="true", hasPersonalData = 'false') - -struct VertexFeature { - 1: required FeatureName name; - 2: required bool outgoing; // direction e.g. true is num_retweets_by_user, and false is num_retweets_for_user - 3: required TimeSeriesStatistics tss; -}(persisted="true", hasPersonalData = 'false') - -struct Vertex { - 1: required i64 user_id(personalDataType = 'UserId'); - 2: optional double weight; - 3: list features; -}(persisted="true", hasPersonalData = 'true') - -/* - * These features are for an edge (a->b). Examples: - * (i) follow is whether a follows b - * (ii) num_retweets is number of b's tweets retweet by a - */ -struct EdgeFeature { - 1: required FeatureName name; - 2: required TimeSeriesStatistics tss; -}(persisted="true", hasPersonalData = 'false') - -struct Edge { - 1: required i64 source_id(personalDataType = 'UserId'); - 2: required i64 destination_id(personalDataType = 'UserId'); - 3: optional double weight; - 4: list features; -}(persisted="true", hasPersonalData = 'true') - -// these structs below are used by our ml pipeline -struct EdgeLabel { - 1: required i64 source_id(personalDataType = 'UserId'); - 2: required i64 destination_id(personalDataType = 'UserId'); - 3: required set labels(personalDataType = 'AggregateImpressionEngagementData'); -}(persisted="true", hasPersonalData = 'true') diff --git a/src/thrift/com/twitter/recos/recos.thrift b/src/thrift/com/twitter/recos/recos.thrift deleted file mode 100644 index a0c6c8f03..000000000 --- a/src/thrift/com/twitter/recos/recos.thrift +++ /dev/null @@ -1,176 +0,0 @@ -namespace java com.twitter.recos.thriftjava -#@namespace scala com.twitter.recos.thriftscala -namespace rb Recos - -include "com/twitter/recos/features/tweet.thrift" - -enum RecommendTweetDisplayLocation { - HomeTimeline = 0 - Peek = 1 - WelcomeFlow = 2 - NetworkDigest = 3 - BackfillDigest = 4 - NetworkDigestExp1 = 5 - NetworkDigestExp2 = 6 // deprecated - NetworkDigestExp3 = 7 // deprecated - HttpEndpoint = 8 - HomeTimeline1 = 9 - HomeTimeline2 = 10 - HomeTimeline3 = 11 - HomeTimeline4 = 12 - Poptart = 13 - NetworkDigestExp4 = 14 - NetworkDigestExp5 = 15 - NetworkDigestExp6 = 16 - NetworkDigestExp7 = 17 - NetworkDigestExp8 = 18 - NetworkDigestExp9 = 19 - InstantTimeline1 = 20 // AB1 + whitelist - InstantTimeline2 = 21 // AB1 + !whitelist - InstantTimeline3 = 22 // AB2 + whitelist - InstantTimeline4 = 23 // AB2 + !whitelist - BackfillDigestActive = 24 // deprecated - BackfillDigestDormant = 25 // deprecated - ExploreUS = 26 // deprecated - ExploreBR = 27 // deprecated - ExploreIN = 28 // deprecated - ExploreES = 29 // deprecated - ExploreJP = 30 // deprecated - MagicRecs = 31 - MagicRecs1 = 32 - MagicRecs2 = 33 - MagicRecs3 = 34 - SMSDiscover = 35 - FastFollower = 36 - InstantTimeline5 = 37 // for instant timeline experiment - InstantTimeline6 = 38 // for instant timeline experiment - InstantTimeline7 = 39 // for instant timeline experiment - InstantTimeline8 = 40 // for instant timeline experiment - LoggedOutProfile = 41 - LoggedOutPermalink = 42 - Poptart2 = 43 -} - -enum RelatedTweetDisplayLocation { - Permalink = 0 - Permalink1 = 1 - MobilePermalink = 2 - Permalink3 = 3 - Permalink4 = 4 - RelatedTweets = 5 - RelatedTweets1 = 6 - RelatedTweets2 = 7 - RelatedTweets3 = 8 - RelatedTweets4 = 9 - LoggedOutProfile = 10 - LoggedOutPermalink = 11 -} - -enum DDGBucket { - Control = 0 - Treatment = 1 - None = 2 -} - -struct RecommendTweetRequest { - 1: required i64 requesterId // user id of the requesting user - 2: required RecommendTweetDisplayLocation displayLocation // display location from the client - 3: optional i64 clientId // twitter api client id - 4: optional i32 maxResults // number of suggested results to return - 5: optional list excludedTweetIds // list of tweet ids to exclude from response - 6: optional list excludedAuthorIds // list of author ids to exclude from response - 7: optional i64 guestId // guestId - 8: optional string languageCode // Language code - 9: optional string countryCode // Country code - 10: optional string ipAddress // ip address of the user - 11: optional string deviceId // udid/uuid of device - 12: optional bool populateTweetFeatures // whether to populate tweet features. RecommendedTweet.tweetFeatures in the response will only be populated if this is set. -} - -struct Bucket { - 1: required string experimentName // name of experiment (or not). experiment could be production or whatever fits - 2: required string bucket // name of bucket (may or may not be a DDG bucket, e.g., production) -} - -struct RelatedTweetRequest { - 1: required i64 tweetId // original tweet id - 2: required RelatedTweetDisplayLocation displayLocation // display location from the client - 3: optional i64 clientId // twitter api client id - 4: optional i64 requesterId // user id of the requesting user - 5: optional i32 maxResults // number of suggested results to return - 6: optional list excludeTweetIds // list of tweet ids to exclude from response - 7: optional list excludedAuthorIds // list of author ids to exclude from response - 8: optional i64 guestId // guestId - 9: optional string languageCode // Language code - 10: optional string countryCode // Country code - 11: optional string ipAddress // ip address of the user - 12: optional string deviceId // udid/uuid of device - 13: optional string userAgent // userAgent of the requesting user -} - -enum SocialProofType { - FollowedBy = 1, - FavoritedBy = 2, - RetweetedBy = 3, - SimilarTo = 4, - RESERVED_2 = 5, - RESERVED_3 = 6, - RESERVED_4 = 7, - RESERVED_5 = 8, - RESERVED_6 = 9, - RESERVED_7 = 10 -} - -enum Algorithm { - Salsa = 1, - PastEmailClicks = 2, - SimilarToEmailClicks = 3, - PastClientEventClicks = 4, - VitNews = 5, - StrongTieScoring = 6, - PollsFromGraph = 7, - PollsBasedOnGeo = 8, - RESERVED_9 = 9, - RESERVED_10 = 10, - RESERVED_11 = 11, -} - -struct RecommendedTweet { - 1: required i64 tweetId - 2: required i64 authorId - 3: required list socialProof - 4: required string feedbackToken - 5: optional list favBy // optionally provide a list of users who fav'ed the tweet if exist - 6: optional tweet.RecommendedTweetFeatures tweetFeatures // the features of a recommended tweet - 7: optional SocialProofType socialProofType // type of social proof. favBy should be deprecated soon - 8: optional string socialProofOverride // should be set only for DDGs, for en-only experiments. SocialProofType is ignored when this field is set - 9: optional Algorithm algorithm // algorithm used - 10: optional double score // score - 11: optional bool isFollowingAuthor // true if the target user follows the author of the tweet -} - -struct RelatedTweet { - 1: required i64 tweetId - 2: required i64 authorId - 3: required double score - 4: required string feedbackToken -} - -struct RecommendTweetResponse { - 1: required list tweets - 2: optional DDGBucket bucket // deprecated - 3: optional Bucket assignedBucket // for client-side experimentation -} - -struct RelatedTweetResponse { - 1: required list tweets // a list of related tweets - 2: optional Bucket assignedBucket // the bucket used for treatment -} - -/** - * The main interface-definition for Recos. - */ -service Recos { - RecommendTweetResponse recommendTweets (RecommendTweetRequest request) - RelatedTweetResponse relatedTweets (RelatedTweetRequest request) -} diff --git a/src/thrift/com/twitter/recos/recos_common.thrift b/src/thrift/com/twitter/recos/recos_common.thrift deleted file mode 100644 index ece39b8df..000000000 --- a/src/thrift/com/twitter/recos/recos_common.thrift +++ /dev/null @@ -1,54 +0,0 @@ -namespace java com.twitter.recos.recos_common.thriftjava -namespace py gen.twitter.recos.recos_common -#@namespace scala com.twitter.recos.recos_common.thriftscala -#@namespace strato com.twitter.recos.recos_common -namespace rb Recos - -// Social proof types for user moment recommendations -enum MomentSocialProofType { - PUBLISH = 0 - LIKE = 1 - CAPSULE_OPEN = 2 -} - -// Social proof types for tweet/entity recommendations -enum SocialProofType { - CLICK = 0 - FAVORITE = 1 - RETWEET = 2 - REPLY = 3 - TWEET = 4 - IS_MENTIONED = 5 - IS_MEDIATAGGED = 6 - QUOTE = 7 -} - -struct SocialProof { - 1: required i64 userId - 2: optional i64 metadata -} - -// Social proof types for user recommendations -enum UserSocialProofType { - FOLLOW = 0 - MENTION = 1 - MEDIATAG = 2 -} - -struct GetRecentEdgesRequest { - 1: required i64 requestId // the node to query from - 2: optional i32 maxNumEdges // the max number of recent edges -} - -struct RecentEdge { - 1: required i64 nodeId // the connecting node id - 2: required SocialProofType engagementType // the engagement type of the edge -} - -struct GetRecentEdgesResponse { - 1: required list edges // the _ most recent edges from the query node -} - -struct NodeInfo { - 1: required list edges -} diff --git a/src/thrift/com/twitter/recos/recos_injector.thrift b/src/thrift/com/twitter/recos/recos_injector.thrift deleted file mode 100644 index b11bc5c09..000000000 --- a/src/thrift/com/twitter/recos/recos_injector.thrift +++ /dev/null @@ -1,22 +0,0 @@ -namespace java com.twitter.recos.recos_injector.thriftjava -namespace py gen.twitter.recos.recos_injector -#@namespace scala com.twitter.recos.recos_injector.thriftscala -namespace rb RecosInjector - -####### FOR RECOS INTERNAL USE ONLY -- please do NOT use this in client code ######## - -struct UserTweetAuthorGraphMessage { - 1: required i64 leftId - 2: required i64 rightId - 3: required i8 action - 4: optional i8 card - 5: optional i64 authorId - 6: optional Features features -} - -struct Features { - 1: optional bool hasPhoto - 2: optional bool hasVideo - 3: optional bool hasUrl - 4: optional bool hasHashtag -} diff --git a/src/thrift/com/twitter/recos/user_tweet_entity_graph/BUILD b/src/thrift/com/twitter/recos/user_tweet_entity_graph/BUILD deleted file mode 100644 index ffd17d734..000000000 --- a/src/thrift/com/twitter/recos/user_tweet_entity_graph/BUILD +++ /dev/null @@ -1,19 +0,0 @@ -RECOSGRAPH_SOURCES = ["user_tweet_entity_graph.thrift"] - -create_thrift_libraries( - base_name = "user_tweet_entity_graph", - sources = RECOSGRAPH_SOURCES, - platform = "java8", - tags = ["bazel-compatible"], - dependency_roots = [ - "src/thrift/com/twitter/recos:recos-common", - "src/thrift/com/twitter/recos/features:tweet", - ], - generate_languages = [ - "java", - "scala", - "strato", - ], - provides_java_name = "user_tweet_entity_graph-java", - provides_scala_name = "user_tweet_entity_graph-scala", -) diff --git a/src/thrift/com/twitter/recos/user_tweet_entity_graph/CONFIG.ini b/src/thrift/com/twitter/recos/user_tweet_entity_graph/CONFIG.ini deleted file mode 100644 index eae222a68..000000000 --- a/src/thrift/com/twitter/recos/user_tweet_entity_graph/CONFIG.ini +++ /dev/null @@ -1,7 +0,0 @@ -; See http://go/CONFIG.ini - -[jira] -project: SD - -[kite] -project: recos diff --git a/src/thrift/com/twitter/recos/user_tweet_entity_graph/user_tweet_entity_graph.thrift b/src/thrift/com/twitter/recos/user_tweet_entity_graph/user_tweet_entity_graph.thrift deleted file mode 100644 index 961fd2bc5..000000000 --- a/src/thrift/com/twitter/recos/user_tweet_entity_graph/user_tweet_entity_graph.thrift +++ /dev/null @@ -1,187 +0,0 @@ -namespace java com.twitter.recos.user_tweet_entity_graph.thriftjava -namespace py gen.twitter.recos.user_tweet_entity_graph -#@namespace scala com.twitter.recos.user_tweet_entity_graph.thriftscala -#@namespace strato com.twitter.recos.user_tweet_entity_graph -namespace rb UserTweetEntityGraph - -include "com/twitter/recos/features/tweet.thrift" -include "com/twitter/recos/recos_common.thrift" - -enum TweetType { - Summary = 0 - Photo = 1 - Player = 2 - Promote = 3 - Regular = 4 -} - -enum RecommendationType { - Tweet = 0 - Hashtag = 1 // Entity type - Url = 2 // Entity type -} - -enum TweetEntityDisplayLocation { - MagicRecs = 0 - HomeTimeline = 1 - HighlightsEmailUrlRecs = 2 - Highlights = 3 - Email = 4 - MagicRecsF1 = 5 - GuideVideo = 6 - MagicRecsRareTweet = 7 - TopArticles = 8 // Twitter Blue most shared articles page - ContentRecommender = 9 - FrigateNTab = 10 -} - -struct RecommendTweetEntityRequest { - // user id of the requesting user - 1: required i64 requesterId - - // display location from the client - 2: required TweetEntityDisplayLocation displayLocation - - // the recommendation entity types to return - 3: required list recommendationTypes - - // seed ids and weights used in left hand side - 4: required map seedsWithWeights - - // number of suggested results per recommendation entity type - 5: optional map maxResultsByType - - // the tweet age threshold in milliseconds - 6: optional i64 maxTweetAgeInMillis - - // list of tweet ids to exclude from response - 7: optional list excludedTweetIds - - // max user social proof size per engagement type - 8: optional i32 maxUserSocialProofSize - - // max tweet social proof size per user - 9: optional i32 maxTweetSocialProofSize - - // min user social proof size per each recommendation entity type - 10: optional map minUserSocialProofSizes - - // summary, photo, player, promote, regular - 11: optional list tweetTypes - - // the list of social proof types to return - 12: optional list socialProofTypes - - // set of groups of social proof types allowed to be combined for comparison against minUserSocialProofSizes. - // e.g. if the input is set>, then the union of those two social proofs - // will be compared against the minUserSocialProofSize of Tweet RecommendationType. - 13: optional set> socialProofTypeUnions - - // the recommendations returned in the response are authored by the following users - 14: optional set tweetAuthors - - // the tweet engagement age threshold in milliseconds - 15: optional i64 maxEngagementAgeInMillis - - // the recommendations will not return any tweet authored by the following users - 16: optional set excludedTweetAuthors -} - -struct TweetRecommendation { - // tweet id - 1: required i64 tweetId - // sum of weights of seed users who engaged with the tweet. - // If a user engaged with the same tweet twice, liked it and retweeted it, then his/her weight was counted twice. - 2: required double score - // user social proofs per engagement type - 3: required map> socialProofByType - // user social proofs along with edge metadata per engagement type. The value of the map is a list of SocialProofs. - 4: optional map> socialProofs -} - -struct HashtagRecommendation { - 1: required i32 id // integer hashtag id, which will be converted to hashtag string by client library. - 2: required double score - // sum of weights of seed users who engaged with the hashtag. - // If a user engaged with the same hashtag twice, liked it and retweeted it, then his/her weight was counted twice. - 3: required map>> socialProofByType - // user and tweet social proofs per engagement type. The key of inner map is user id, and the value of inner map is - // a list of tweet ids that the user engaged with. -} - -struct UrlRecommendation { - 1: required i32 id // integer url id, which will be converted to url string by client library. - 2: required double score - // sum of weights of seed users who engaged with the url. - // If a user engaged with the same url twice, liked it and retweeted it, then his/her weight was counted twice. - 3: required map>> socialProofByType - // user and tweet social proofs per engagement type. The key of inner map is user id, and the value of inner map is - // a list of tweet ids that the user engaged with. -} - -union UserTweetEntityRecommendationUnion { - 1: TweetRecommendation tweetRec - 2: HashtagRecommendation hashtagRec - 3: UrlRecommendation urlRec -} - -struct RecommendTweetEntityResponse { - 1: required list recommendations -} - -struct SocialProofRequest { - 1: required list inputTweets // Only for some tweets we need requst its social proofs. - 2: required map seedsWithWeights // a set of seed users with weights - 3: optional i64 requesterId // id of the requesting user - 4: optional list socialProofTypes // the list of social proof types to return -} - -struct SocialProofResponse { - 1: required list socialProofResults -} - -struct RecommendationSocialProofRequest { - /** - * Clients can request social proof from multiple recommendation types in a single request. - * NOTE: Avoid mixing tweet social proof requests with entity social proof requests as the - * underlying library call retrieves these differently. - */ - 1: required map> recommendationIdsForSocialProof - // These will be the only valid LHS nodes used to fetch social proof. - 2: required map seedsWithWeights - 3: optional i64 requesterId - // The list of valid social proof types to return, e.g. we may only want Favorite and Tweet proofs. - 4: optional list socialProofTypes -} - -struct RecommendationSocialProofResponse { - 1: required list socialProofResults -} - -/** - * The main interface-definition for UserTweetEntityGraph. - */ -service UserTweetEntityGraph { - RecommendTweetEntityResponse recommendTweets (RecommendTweetEntityRequest request) - - /** - * Given a query user, its seed users, and a set of input tweets, return the social proofs of - * input tweets if any. - * - * Currently this supports clients such as Email Recommendations, MagicRecs, and HomeTimeline. - * In order to avoid heavy migration work, we are retaining this endpoint. - */ - SocialProofResponse findTweetSocialProofs(SocialProofRequest request) - - /** - * Find social proof for the specified RecommendationType given a set of input ids of that type. - * Only find social proofs from the specified seed users with the specified social proof types. - * - * Currently this supports url social proof generation for Guide. - * - * This endpoint is flexible enough to support social proof generation for all recommendation - * types, and should be used for all future clients of this service. - */ - RecommendationSocialProofResponse findRecommendationSocialProofs(RecommendationSocialProofRequest request) -} - diff --git a/src/thrift/com/twitter/recos/user_tweet_graph/BUILD b/src/thrift/com/twitter/recos/user_tweet_graph/BUILD deleted file mode 100644 index 5f9f68eb3..000000000 --- a/src/thrift/com/twitter/recos/user_tweet_graph/BUILD +++ /dev/null @@ -1,22 +0,0 @@ -RECOSGRAPH_SOURCES = ["user_tweet_graph.thrift"] - -create_thrift_libraries( - base_name = "user_tweet_graph", - sources = RECOSGRAPH_SOURCES, - platform = "java8", - tags = ["bazel-compatible"], - dependency_roots = [ - "src/thrift/com/twitter/recos:recos-common", - "src/thrift/com/twitter/recos/features:tweet", - ], - export_roots = [ - "src/thrift/com/twitter/recos/features:tweet", - ], - generate_languages = [ - "java", - "scala", - "strato", - ], - provides_java_name = "user_tweet_graph-java", - provides_scala_name = "user_tweet_graph-scala", -) diff --git a/src/thrift/com/twitter/recos/user_tweet_graph/CONFIG.ini b/src/thrift/com/twitter/recos/user_tweet_graph/CONFIG.ini deleted file mode 100644 index eae222a68..000000000 --- a/src/thrift/com/twitter/recos/user_tweet_graph/CONFIG.ini +++ /dev/null @@ -1,7 +0,0 @@ -; See http://go/CONFIG.ini - -[jira] -project: SD - -[kite] -project: recos diff --git a/src/thrift/com/twitter/recos/user_tweet_graph/user_tweet_graph.thrift b/src/thrift/com/twitter/recos/user_tweet_graph/user_tweet_graph.thrift deleted file mode 100644 index 43f294eb1..000000000 --- a/src/thrift/com/twitter/recos/user_tweet_graph/user_tweet_graph.thrift +++ /dev/null @@ -1,172 +0,0 @@ -namespace java com.twitter.recos.user_tweet_graph.thriftjava -namespace py gen.twitter.recos.user_tweet_graph -#@namespace scala com.twitter.recos.user_tweet_graph.thriftscala -#@namespace strato com.twitter.recos.user_tweet_graph -namespace rb UserTweetGraph - -include "com/twitter/recos/features/tweet.thrift" -include "com/twitter/recos/recos_common.thrift" - -enum TweetType { - Summary = 0 - Photo = 1 - Player = 2 - Promote = 3 - Regular = 4 -} - -enum Algorithm { - Salsa = 0 - SubGraphSalsa = 1 -} - -enum RecommendTweetDisplayLocation { - HomeTimeline = 0 - WelcomeFlow = 1 - NetworkDigest = 2 - BackfillDigest = 3 - HttpEndpoint = 4 - Poptart = 5 - InstantTimeline = 6 - Explore = 7 - MagicRecs = 8 - LoggedOutProfile = 9 - LoggedOutPermalink = 10 - VideoHome = 11 -} - -struct RecommendTweetRequest { - 1: required i64 requesterId // user id of the requesting user - 2: required RecommendTweetDisplayLocation displayLocation // display location from the client - 3: required i32 maxResults // number of suggested results to return - 4: required list excludedTweetIds // list of tweet ids to exclude from response - 5: required map seeds // seeds used in salsa random walk - 6: required i64 tweetRecency // the tweet recency threshold - 7: required i32 minInteraction // minimum interaction threshold - 8: required list includeTweetTypes // summary, photo, player, promote, other - 9: required double resetProbability // reset probability to query node - 10: required double queryNodeWeightFraction // the percentage of weights assigned to query node in seeding - 11: required i32 numRandomWalks // number of random walks - 12: required i32 maxRandomWalkLength // max random walk length - 13: required i32 maxSocialProofSize // max social proof size - 14: required Algorithm algorithm // algorithm type - 15: optional list socialProofTypes // the list of social proof types to return -} - -struct RecommendedTweet { - 1: required i64 tweetId - 2: required double score - 3: optional list socialProof // social proof in aggregate - 4: optional map> socialProofPerType // social proofs per engagement type -} - -struct RecommendTweetResponse { - 1: required list tweets -} - -enum RelatedTweetDisplayLocation { - Permalink = 0 - Permalink1 = 1 - MobilePermalink = 2 - Permalink3 = 3 - Permalink4 = 4 - RelatedTweets = 5 - RelatedTweets1 = 6 - RelatedTweets2 = 7 - RelatedTweets3 = 8 - RelatedTweets4 = 9 - LoggedOutProfile = 10 - LoggedOutPermalink = 11 -} - -struct UserTweetFeatureResponse { - 1: optional double favAdamicAdarAvg - 2: optional double favAdamicAdarMax - 3: optional double favLogCosineAvg - 4: optional double favLogCosineMax - 5: optional double retweetAdamicAdarAvg - 6: optional double retweetAdamicAdarMax - 7: optional double retweetLogCosineAvg - 8: optional double retweetLogCosineMax -} - -struct RelatedTweetRequest { - 1: required i64 tweetId // original tweet id - 2: required RelatedTweetDisplayLocation displayLocation // display location from the client - 3: optional string algorithm // additional parameter that the system can interpret - 4: optional i64 requesterId // user id of the requesting user - 5: optional i32 maxResults // number of suggested results to return - 6: optional list excludeTweetIds // list of tweet ids to exclude from response - 7: optional i32 maxNumNeighbors - 8: optional i32 minNeighborDegree - 9: optional i32 maxNumSamplesPerNeighbor - 10: optional i32 minCooccurrence - 11: optional i32 minQueryDegree - 12: optional double maxLowerMultiplicativeDeviation - 13: optional double maxUpperMultiplicativeDeviation - 14: optional bool populateTweetFeatures // whether to populate graph features - 15: optional i32 minResultDegree - 16: optional list additionalTweetIds - 17: optional double minScore - 18: optional i32 maxTweetAgeInHours -} - -struct TweetBasedRelatedTweetRequest { - 1: required i64 tweetId // query tweet id - 2: optional i32 maxResults // number of suggested results to return - 3: optional list excludeTweetIds // list of tweet ids to exclude from response - 4: optional i32 minQueryDegree // min degree of query tweet - 5: optional i32 maxNumSamplesPerNeighbor // max number of sampled users who engaged with the query tweet - 6: optional i32 minCooccurrence // min co-occurrence of related tweet candidate - 7: optional i32 minResultDegree // min degree of related tweet candidate - 8: optional double minScore // min score of related tweet candidate - 9: optional i32 maxTweetAgeInHours // max tweet age in hours of related tweet candidate -} - -struct ProducerBasedRelatedTweetRequest { - 1: required i64 producerId // query producer id - 2: optional i32 maxResults // number of suggested results to return - 3: optional list excludeTweetIds // list of tweet ids to exclude from response - 4: optional i32 minQueryDegree // min degree of query producer, e.g. number of followers - 5: optional i32 maxNumFollowers // max number of sampled users who follow the query producer - 6: optional i32 minCooccurrence // min co-occurrence of related tweet candidate - 7: optional i32 minResultDegree // min degree of related tweet candidate - 8: optional double minScore // min score of related tweet candidate - 9: optional i32 maxTweetAgeInHours // max tweet age in hours of related tweet candidate -} - -struct ConsumersBasedRelatedTweetRequest { - 1: required list consumerSeedSet // query consumer userId set - 2: optional i32 maxResults // number of suggested results to return - 3: optional list excludeTweetIds // list of tweet ids to exclude from response - 4: optional i32 minCooccurrence // min co-occurrence of related tweet candidate - 5: optional i32 minResultDegree // min degree of related tweet candidate - 6: optional double minScore // min score of related tweet candidate - 7: optional i32 maxTweetAgeInHours // max tweet age in hours of related tweet candidate -} - -struct RelatedTweet { - 1: required i64 tweetId - 2: required double score - 3: optional tweet.GraphFeaturesForTweet relatedTweetGraphFeatures -} - -struct RelatedTweetResponse { - 1: required list tweets - 2: optional tweet.GraphFeaturesForQuery queryTweetGraphFeatures -} - -/** - * The main interface-definition for UserTweetGraph. - */ -service UserTweetGraph { - RecommendTweetResponse recommendTweets (RecommendTweetRequest request) - recos_common.GetRecentEdgesResponse getLeftNodeEdges (recos_common.GetRecentEdgesRequest request) - recos_common.NodeInfo getRightNode (i64 node) - RelatedTweetResponse relatedTweets (RelatedTweetRequest request) - RelatedTweetResponse tweetBasedRelatedTweets (TweetBasedRelatedTweetRequest request) - RelatedTweetResponse producerBasedRelatedTweets (ProducerBasedRelatedTweetRequest request) - RelatedTweetResponse consumersBasedRelatedTweets (ConsumersBasedRelatedTweetRequest request) - UserTweetFeatureResponse userTweetFeatures (1: required i64 userId, 2: required i64 tweetId) -} - diff --git a/src/thrift/com/twitter/recos/user_user_graph/BUILD b/src/thrift/com/twitter/recos/user_user_graph/BUILD deleted file mode 100644 index ef53f847a..000000000 --- a/src/thrift/com/twitter/recos/user_user_graph/BUILD +++ /dev/null @@ -1,19 +0,0 @@ -RECOSGRAPH_SOURCES = ["user_user_graph.thrift"] - -create_thrift_libraries( - base_name = "user_user_graph", - sources = RECOSGRAPH_SOURCES, - platform = "java8", - tags = ["bazel-compatible"], - dependency_roots = [ - "src/thrift/com/twitter/recos:recos-common", - "src/thrift/com/twitter/recos/features:tweet", - ], - generate_languages = [ - "java", - "scala", - "strato", - ], - provides_java_name = "user_user_graph-java", - provides_scala_name = "user_user_graph-scala", -) diff --git a/src/thrift/com/twitter/recos/user_user_graph/CONFIG.ini b/src/thrift/com/twitter/recos/user_user_graph/CONFIG.ini deleted file mode 100644 index eae222a68..000000000 --- a/src/thrift/com/twitter/recos/user_user_graph/CONFIG.ini +++ /dev/null @@ -1,7 +0,0 @@ -; See http://go/CONFIG.ini - -[jira] -project: SD - -[kite] -project: recos diff --git a/src/thrift/com/twitter/recos/user_user_graph/user_user_graph.thrift b/src/thrift/com/twitter/recos/user_user_graph/user_user_graph.thrift deleted file mode 100644 index 10115c8d9..000000000 --- a/src/thrift/com/twitter/recos/user_user_graph/user_user_graph.thrift +++ /dev/null @@ -1,45 +0,0 @@ -namespace java com.twitter.recos.user_user_graph.thriftjava -namespace py gen.twitter.recos.user_user_graph -#@namespace scala com.twitter.recos.user_user_graph.thriftscala -#@namespace strato com.twitter.recos.user_user_graph -namespace rb UserUserGraph - -include "com/twitter/recos/recos_common.thrift" - -enum RecommendUserDisplayLocation { - MagicRecs = 0 - HomeTimeLine = 1 - ConnectTab = 2 -} - -struct RecommendUserRequest { - 1: required i64 requesterId // user id of the requesting user - 2: required RecommendUserDisplayLocation displayLocation // display location from the client - 3: required map seedsWithWeights // seed ids and weights used in left hand side - 4: optional list excludedUserIds // list of users to exclude from response - 5: optional i32 maxNumResults // number of results to return - 6: optional i32 maxNumSocialProofs // number of social proofs per recommendation - 7: optional map minUserPerSocialProof // minimum number of users for each social proof type - 8: optional list socialProofTypes // list of required social proof types. Any recommended user - // must at least have all of these social proof types - 9: optional i64 maxEdgeEngagementAgeInMillis // only events created during this period are counted -} - -struct RecommendedUser { - 1: required i64 userId // user id of recommended user - 2: required double score // weight of the recommended user - 3: required map> socialProofs // the social proofs of the recommended user -} - -struct RecommendUserResponse { - 1: required list recommendedUsers // list of recommended users -} - -/** - * The main interface-definition for UserUserGraph. - */ -service UserUserGraph { - // Given a request for recommendations for a specific user, - // return a list of candidate users along with their social proofs - RecommendUserResponse recommendUsers (RecommendUserRequest request) -} diff --git a/src/thrift/com/twitter/recos/user_video_graph/BUILD b/src/thrift/com/twitter/recos/user_video_graph/BUILD deleted file mode 100644 index f9dcbb8b1..000000000 --- a/src/thrift/com/twitter/recos/user_video_graph/BUILD +++ /dev/null @@ -1,22 +0,0 @@ -RECOSGRAPH_SOURCES = ["user_video_graph.thrift"] - -create_thrift_libraries( - base_name = "user_video_graph", - sources = RECOSGRAPH_SOURCES, - platform = "java8", - tags = ["bazel-compatible"], - dependency_roots = [ - "src/thrift/com/twitter/recos:recos-common", - "src/thrift/com/twitter/recos/features:tweet", - ], - export_roots = [ - "src/thrift/com/twitter/recos/features:tweet", - ], - generate_languages = [ - "java", - "scala", - "strato", - ], - provides_java_name = "user_video_graph-java", - provides_scala_name = "user_video_graph-scala", -) diff --git a/src/thrift/com/twitter/recos/user_video_graph/CONFIG.ini b/src/thrift/com/twitter/recos/user_video_graph/CONFIG.ini deleted file mode 100644 index eae222a68..000000000 --- a/src/thrift/com/twitter/recos/user_video_graph/CONFIG.ini +++ /dev/null @@ -1,7 +0,0 @@ -; See http://go/CONFIG.ini - -[jira] -project: SD - -[kite] -project: recos diff --git a/src/thrift/com/twitter/recos/user_video_graph/user_video_graph.thrift b/src/thrift/com/twitter/recos/user_video_graph/user_video_graph.thrift deleted file mode 100644 index a5d83c1d6..000000000 --- a/src/thrift/com/twitter/recos/user_video_graph/user_video_graph.thrift +++ /dev/null @@ -1,64 +0,0 @@ -namespace java com.twitter.recos.user_video_graph.thriftjava -namespace py gen.twitter.recos.user_video_graph -#@namespace scala com.twitter.recos.user_video_graph.thriftscala -#@namespace strato com.twitter.recos.user_video_graph -namespace rb UserVideoGraph - -include "com/twitter/recos/features/tweet.thrift" -include "com/twitter/recos/recos_common.thrift" - - -struct TweetBasedRelatedTweetRequest { - 1: required i64 tweetId // query tweet id - 2: optional i32 maxResults // number of suggested results to return - 3: optional list excludeTweetIds // list of tweet ids to exclude from response - 4: optional i32 minQueryDegree // min degree of query tweet - 5: optional i32 maxNumSamplesPerNeighbor // max number of sampled users who engaged with the query tweet - 6: optional i32 minCooccurrence // min co-occurrence of related tweet candidate - 7: optional i32 minResultDegree // min degree of related tweet candidate - 8: optional double minScore // min score of related tweet candidate - 9: optional i32 maxTweetAgeInHours // max tweet age in hours of related tweet candidate -} - -struct ProducerBasedRelatedTweetRequest { - 1: required i64 producerId // query producer id - 2: optional i32 maxResults // number of suggested results to return - 3: optional list excludeTweetIds // list of tweet ids to exclude from response - 4: optional i32 minQueryDegree // min degree of query producer, e.g. number of followers - 5: optional i32 maxNumFollowers // max number of sampled users who follow the query producer - 6: optional i32 minCooccurrence // min co-occurrence of related tweet candidate - 7: optional i32 minResultDegree // min degree of related tweet candidate - 8: optional double minScore // min score of related tweet candidate - 9: optional i32 maxTweetAgeInHours // max tweet age in hours of related tweet candidate -} - -struct ConsumersBasedRelatedTweetRequest { - 1: required list consumerSeedSet // query consumer userId set - 2: optional i32 maxResults // number of suggested results to return - 3: optional list excludeTweetIds // list of tweet ids to exclude from response - 4: optional i32 minCooccurrence // min co-occurrence of related tweet candidate - 5: optional i32 minResultDegree // min degree of related tweet candidate - 6: optional double minScore // min score of related tweet candidate - 7: optional i32 maxTweetAgeInHours // max tweet age in hours of related tweet candidate -} - -struct RelatedTweet { - 1: required i64 tweetId - 2: required double score - 3: optional tweet.GraphFeaturesForTweet relatedTweetGraphFeatures -} - -struct RelatedTweetResponse { - 1: required list tweets - 2: optional tweet.GraphFeaturesForQuery queryTweetGraphFeatures -} - -/** - * The main interface-definition for UserVideoGraph. - */ -service UserVideoGraph { - RelatedTweetResponse tweetBasedRelatedTweets (TweetBasedRelatedTweetRequest request) - RelatedTweetResponse producerBasedRelatedTweets (ProducerBasedRelatedTweetRequest request) - RelatedTweetResponse consumersBasedRelatedTweets (ConsumersBasedRelatedTweetRequest request) -} - diff --git a/src/thrift/com/twitter/search/common/ranking/ranking.thrift b/src/thrift/com/twitter/search/common/ranking/ranking.thrift deleted file mode 100644 index bd1cff929..000000000 --- a/src/thrift/com/twitter/search/common/ranking/ranking.thrift +++ /dev/null @@ -1,366 +0,0 @@ -namespace java com.twitter.search.common.ranking.thriftjava -#@namespace scala com.twitter.search.common.ranking.thriftscala -#@namespace strato com.twitter.search.common.ranking -namespace py gen.twitter.search.common.ranking.ranking - -struct ThriftLinearFeatureRankingParams { - // values below this will set the score to the minimal one - 1: optional double min = -1e+100 - // values above this will set the score to the minimal one - 2: optional double max = 1e+100 - 3: optional double weight = 0 -}(persisted='true') - -struct ThriftAgeDecayRankingParams { - // the rate in which the score of older tweets decreases - 1: optional double slope = 0.003 - // the age, in minutes, where the age score of a tweet is half of the latest tweet - 2: optional double halflife = 360.0 - // the minimal age decay score a tweet will have - 3: optional double base = 0.6 -}(persisted='true') - -enum ThriftScoringFunctionType { - LINEAR = 1, - MODEL_BASED = 4, - TENSORFLOW_BASED = 5, - - // deprecated - TOPTWEETS = 2, - EXPERIMENTAL = 3, -} - -// The struct to define a class that is to be dynamically loaded in earlybird for -// experimentation. -struct ThriftExperimentClass { - // the fully qualified class name. - 1: required string name - // data source location (class/jar file) for this dynamic class on HDFS - 2: optional string location - // parameters in key-value pairs for this experimental class - 3: optional map params -}(persisted='true') - -// Deprecated!! -struct ThriftQueryEngagementParams { - // Rate Boosts: given a rate (usually a small fraction), the score will be multiplied by - // (1 + rate) ^ boost - // 0 mean no boost, negative numbers are dampens - 1: optional double retweetRateBoost = 0 - 2: optional double replyRateBoost = 0 - 3: optional double faveRateBoost = 0 -}(persisted='true') - -struct ThriftHostQualityParams { - // Multiplier applied to host score, for tweets that have links. - // A multiplier of 0 means that this boost is not applied - 1: optional double multiplier = 0.0 - - // Do not apply the multiplier to hosts with score above this level. - // If 0, the multiplier will be applied to any host. - 2: optional double maxScoreToModify = 0.0 - - // Do not apply the multiplier to hosts with score below this level. - // If 0, the multiplier will be applied to any host. - 3: optional double minScoreToModify = 0.0 - - // If true, score modification will be applied to hosts that have unknown scores. - // The host-score used will be lower than the score of any known host. - 4: optional bool applyToUnknownHosts = 0 -}(persisted='true') - -struct ThriftCardRankingParams { - 1: optional double hasCardBoost = 1.0 - 2: optional double domainMatchBoost = 1.0 - 3: optional double authorMatchBoost = 1.0 - 4: optional double titleMatchBoost = 1.0 - 5: optional double descriptionMatchBoost = 1.0 -}(persisted='true') - -# The ids are assigned in 'blocks'. For adding a new field, find an unused id in the appropriate -# block. Be sure to mention explicitly which ids have been removed so that they are not used again. -struct ThriftRankingParams { - 1: optional ThriftScoringFunctionType type - - // Dynamically loaded scorer and collector for quick experimentation. - 40: optional ThriftExperimentClass expScorer - 41: optional ThriftExperimentClass expCollector - - // we must set it to a value that fits into a float: otherwise - // some earlybird classes that convert it to float will interpret - // it as Float.NEGATIVE_INFINITY, and some comparisons will fail - 2: optional double minScore = -1e+30 - - 10: optional ThriftLinearFeatureRankingParams parusScoreParams - 11: optional ThriftLinearFeatureRankingParams retweetCountParams - 12: optional ThriftLinearFeatureRankingParams replyCountParams - 15: optional ThriftLinearFeatureRankingParams reputationParams - 16: optional ThriftLinearFeatureRankingParams luceneScoreParams - 18: optional ThriftLinearFeatureRankingParams textScoreParams - 19: optional ThriftLinearFeatureRankingParams urlParams - 20: optional ThriftLinearFeatureRankingParams isReplyParams - 21: optional ThriftLinearFeatureRankingParams directFollowRetweetCountParams - 22: optional ThriftLinearFeatureRankingParams trustedCircleRetweetCountParams - 23: optional ThriftLinearFeatureRankingParams favCountParams - 24: optional ThriftLinearFeatureRankingParams multipleReplyCountParams - 27: optional ThriftLinearFeatureRankingParams embedsImpressionCountParams - 28: optional ThriftLinearFeatureRankingParams embedsUrlCountParams - 29: optional ThriftLinearFeatureRankingParams videoViewCountParams - 66: optional ThriftLinearFeatureRankingParams quotedCountParams - - // A map from MutableFeatureType to linear ranking params - 25: optional map offlineExperimentalFeatureRankingParams - - // if min/max for score or ThriftLinearFeatureRankingParams should always be - // applied or only to non-follows, non-self, non-verified - 26: optional bool applyFiltersAlways = 0 - - // Whether to apply promotion/demotion at all for FeatureBasedScoringFunction - 70: optional bool applyBoosts = 1 - - // UI language is english, tweet language is not - 30: optional double langEnglishUIBoost = 0.3 - // tweet language is english, UI language is not - 31: optional double langEnglishTweetBoost = 0.7 - // user language differs from tweet language, and neither is english - 32: optional double langDefaultBoost = 0.1 - // user that produced tweet is marked as spammer by metastore - 33: optional double spamUserBoost = 1.0 - // user that produced tweet is marked as nsfw by metastore - 34: optional double nsfwUserBoost = 1.0 - // user that produced tweet is marked as bot (self similarity) by metastore - 35: optional double botUserBoost = 1.0 - - // An alternative way of using lucene score in the ranking function. - 38: optional bool useLuceneScoreAsBoost = 0 - 39: optional double maxLuceneScoreBoost = 1.2 - - // Use user's consumed and produced languages for scoring - 42: optional bool useUserLanguageInfo = 0 - - // Boost (demotion) if the tweet language is not one of user's understandable languages, - // nor interface language. - 43: optional double unknownLanguageBoost = 0.01 - - // Use topic ids for scoring. - // Deprecated in SEARCH-8616. - 44: optional bool deprecated_useTopicIDsBoost = 0 - // Parameters for topic id scoring. See TopicIDsBoostScorer (and its test) for details. - 46: optional double deprecated_maxTopicIDsBoost = 3.0 - 47: optional double deprecated_topicIDsBoostExponent = 2.0; - 48: optional double deprecated_topicIDsBoostSlope = 2.0; - - // Hit Attribute Demotion - 60: optional bool enableHitDemotion = 0 - 61: optional double noTextHitDemotion = 1.0 - 62: optional double urlOnlyHitDemotion = 1.0 - 63: optional double nameOnlyHitDemotion = 1.0 - 64: optional double separateTextAndNameHitDemotion = 1.0 - 65: optional double separateTextAndUrlHitDemotion = 1.0 - - // multiplicative score boost for results deemed offensive - 100: optional double offensiveBoost = 1 - // multiplicative score boost for results in the searcher's social circle - 101: optional double inTrustedCircleBoost = 1 - // multiplicative score dampen for results with more than one hash tag - 102: optional double multipleHashtagsOrTrendsBoost = 1 - // multiplicative score boost for results in the searcher's direct follows - 103: optional double inDirectFollowBoost = 1 - // multiplicative score boost for results that has trends - 104: optional double tweetHasTrendBoost = 1 - // is tweet from verified account? - 106: optional double tweetFromVerifiedAccountBoost = 1 - // is tweet authored by the searcher? (boost is in addition to social boost) - 107: optional double selfTweetBoost = 1 - // multiplicative score boost for a tweet that has image url. - 108: optional double tweetHasImageUrlBoost = 1 - // multiplicative score boost for a tweet that has video url. - 109: optional double tweetHasVideoUrlBoost = 1 - // multiplicative score boost for a tweet that has news url. - 110: optional double tweetHasNewsUrlBoost = 1 - // is tweet from a blue-verified account? - 111: optional double tweetFromBlueVerifiedAccountBoost = 1 (personalDataType = 'UserVerifiedFlag') - - // subtractive penalty applied after boosts for out-of-network replies. - 120: optional double outOfNetworkReplyPenalty = 10.0 - - 150: optional ThriftQueryEngagementParams deprecatedQueryEngagementParams - - 160: optional ThriftHostQualityParams deprecatedHostQualityParams - - // age decay params for regular tweets - 203: optional ThriftAgeDecayRankingParams ageDecayParams - - // for card ranking: map between card name ordinal (defined in com.twitter.search.common.constants.CardConstants) - // to ranking params - 400: optional map cardRankingParams - - // A map from tweet IDs to the score adjustment for that tweet. These are score - // adjustments that include one or more features that can depend on the query - // string. These features aren't indexed by Earlybird, and so their total contribution - // to the scoring function is passed in directly as part of the request. If present, - // the score adjustment for a tweet is directly added to the linear component of the - // scoring function. Since this signal can be made up of multiple features, any - // reweighting or combination of these features is assumed to be done by the caller - // (hence there is no need for a weight parameter -- the weights of the features - // included in this signal have already been incorporated by the caller). - 151: optional map querySpecificScoreAdjustments - - // A map from user ID to the score adjustment for tweets from that author. - // This field provides a way for adjusting the tweets of a specific set of users with a score - // that is not present in the Earlybird features but has to be passed from the clients, such as - // real graph weights or a combination of multiple features. - // This field should be used mainly for experimentation since it increases the size of the thrift - // requests. - 154: optional map authorSpecificScoreAdjustments - - // -------- Parameters for ThriftScoringFunctionType.MODEL_BASED -------- - // Selected models along with their weights for the linear combination - 152: optional map selectedModels - 153: optional bool useLogitScore = false - - // -------- Parameters for ThriftScoringFunctionType.TENSORFLOW_BASED -------- - // Selected tensorflow model - 303: optional string selectedTensorflowModel - - // -------- Deprecated Fields -------- - // ID 303 has been used in the past. Resume additional deprecated fields from 304 - 105: optional double deprecatedTweetHasTrendInTrendingQueryBoost = 1 - 200: optional double deprecatedAgeDecaySlope = 0.003 - 201: optional double deprecatedAgeDecayHalflife = 360.0 - 202: optional double deprecatedAgeDecayBase = 0.6 - 204: optional ThriftAgeDecayRankingParams deprecatedAgeDecayForTrendsParams - 301: optional double deprecatedNameQueryConfidence = 0.0 - 302: optional double deprecatedHashtagQueryConfidence = 0.0 - // Whether to use old-style engagement features (normalized by LogNormalizer) - // or new ones (normalized by SingleBytePositiveFloatNormalizer) - 50: optional bool useGranularEngagementFeatures = 0 // DEPRECATED! -}(persisted='true') - -// This sorting mode is used by earlybird to retrieve the top-n facets that -// are returned to blender -enum ThriftFacetEarlybirdSortingMode { - SORT_BY_SIMPLE_COUNT = 0, - SORT_BY_WEIGHTED_COUNT = 1, -} - -// This is the final sort order used by blender after all results from -// the earlybirds are merged -enum ThriftFacetFinalSortOrder { - // using the created_at date of the first tweet that contained the facet - SCORE = 0, - SIMPLE_COUNT = 1, - WEIGHTED_COUNT = 2, - CREATED_AT = 3 -} - -struct ThriftFacetRankingOptions { - // next available field ID = 38 - - // ====================================================================== - // EARLYBIRD SETTINGS - // - // These parameters primarily affect how earlybird creates the top-k - // candidate list to be re-ranked by blender - // ====================================================================== - // Dynamically loaded scorer and collector for quick experimentation. - 26: optional ThriftExperimentClass expScorer - 27: optional ThriftExperimentClass expCollector - - // It should be less than or equal to reputationParams.min, and all - // tweepcreds between the two get a score of 1.0. - 21: optional i32 minTweepcredFilterThreshold - - // the maximum score a single tweet can contribute to the weightedCount - 22: optional i32 maxScorePerTweet - - 15: optional ThriftFacetEarlybirdSortingMode sortingMode - // The number of top candidates earlybird returns to blender - 16: optional i32 numCandidatesFromEarlybird = 100 - - // when to early terminate for facet search, overrides the setting in ThriftSearchQuery - 34: optional i32 maxHitsToProcess = 1000 - - // for anti-gaming we want to limit the maximum amount of hits the same user can - // contribute. Set to -1 to disable the anti-gaming filter. Overrides the setting in - // ThriftSearchQuery - 35: optional i32 maxHitsPerUser = 3 - - // if the tweepcred of the user is bigger than this value it will not be excluded - // by the anti-gaming filter. Overrides the setting in ThriftSearchQuery - 36: optional i32 maxTweepcredForAntiGaming = 65 - - // these settings affect how earlybird computes the weightedCount - 2: optional ThriftLinearFeatureRankingParams parusScoreParams - 3: optional ThriftLinearFeatureRankingParams reputationParams - 17: optional ThriftLinearFeatureRankingParams favoritesParams - 33: optional ThriftLinearFeatureRankingParams repliesParams - 37: optional map rankingExpScoreParams - - // penalty counter settings - 6: optional i32 offensiveTweetPenalty // set to -1 to disable the offensive filter - 7: optional i32 antigamingPenalty // set to -1 to disable antigaming filtering - // weight of penalty counts from all tweets containing a facet, not just the tweets - // matching the query - 9: optional double queryIndependentPenaltyWeight // set to 0 to not use query independent penalty weights - // penalty for keyword stuffing - 60: optional i32 multipleHashtagsOrTrendsPenalty - - // Language related boosts, similar to those in relevance ranking options. By default they are - // all 1.0 (no-boost). - // When the user language is english, facet language is not - 11: optional double langEnglishUIBoost = 1.0 - // When the facet language is english, user language is not - 12: optional double langEnglishFacetBoost = 1.0 - // When the user language differs from facet/tweet language, and neither is english - 13: optional double langDefaultBoost = 1.0 - - // ====================================================================== - // BLENDER SETTINGS - // - // Settings for the facet relevance scoring happening in blender - // ====================================================================== - - // This block of parameters are only used in the FacetsFutureManager. - // limits to discard facets - // if a facet has a higher penalty count, it will not be returned - 5: optional i32 maxPenaltyCount - // if a facet has a lower simple count, it will not be returned - 28: optional i32 minSimpleCount - // if a facet has a lower weighted count, it will not be returned - 8: optional i32 minCount - // the maximum allowed value for offensiveCount/facetCount a facet can have in order to be returned - 10: optional double maxPenaltyCountRatio - // if set to true, then facets with offensive display tweets are excluded from the resultset - 29: optional bool excludePossiblySensitiveFacets - // if set to true, then only facets that have a display tweet in their ThriftFacetCountMetadata object - // will be returned to the caller - 30: optional bool onlyReturnFacetsWithDisplayTweet - - // parameters for scoring force-inserted media items - // Please check FacetReRanker.java computeScoreForInserted() for their usage. - 38: optional double forceInsertedBackgroundExp = 0.3 - 39: optional double forceInsertedMinBackgroundCount = 2 - 40: optional double forceInsertedMultiplier = 0.01 - - // ----------------------------------------------------- - // weights for the facet ranking formula - 18: optional double simpleCountWeight_DEPRECATED - 19: optional double weightedCountWeight_DEPRECATED - 20: optional double backgroundModelBoost_DEPRECATED - - // ----------------------------------------------------- - // Following parameters are used in the FacetsReRanker - // age decay params - 14: optional ThriftAgeDecayRankingParams ageDecayParams - - // used in the facets reranker - 23: optional double maxNormBoost = 5.0 - 24: optional double globalCountExponent = 3.0 - 25: optional double simpleCountExponent = 3.0 - - 31: optional ThriftFacetFinalSortOrder finalSortOrder - - // Run facets search as if they happen at this specific time (ms since epoch). - 32: optional i64 fakeCurrentTimeMs // not really used anywhere, remove? -}(persisted='true') diff --git a/src/thrift/com/twitter/search/earlybird/thrift/earlybird.thrift b/src/thrift/com/twitter/search/earlybird/thrift/earlybird.thrift deleted file mode 100644 index 0d4547264..000000000 --- a/src/thrift/com/twitter/search/earlybird/thrift/earlybird.thrift +++ /dev/null @@ -1,1416 +0,0 @@ -namespace java com.twitter.search.earlybird.thrift -#@namespace scala com.twitter.search.earlybird.thriftscala -#@namespace strato com.twitter.search.earlybird -namespace py gen.twitter.search.earlybird - -include "com/twitter/ads/adserver/adserver_common.thrift" -include "com/twitter/search/common/caching/caching.thrift" -include "com/twitter/search/common/constants/query.thrift" -include "com/twitter/search/common/constants/search_language.thrift" -include "com/twitter/search/common/conversation/conversation.thrift" -include "com/twitter/search/common/features/features.thrift" -include "com/twitter/search/common/indexing/status.thrift" -include "com/twitter/search/common/query/search.thrift" -include "com/twitter/search/common/ranking/ranking.thrift" -include "com/twitter/search/common/results/expansions.thrift" -include "com/twitter/search/common/results/highlight.thrift" -include "com/twitter/search/common/results/hit_attribution.thrift" -include "com/twitter/search/common/results/hits.thrift" -include "com/twitter/search/common/results/social.thrift" -include "com/twitter/service/spiderduck/gen/metadata_store.thrift" -include "com/twitter/tweetypie/deprecated.thrift" -include "com/twitter/tweetypie/tweet.thrift" -include "com/twitter/escherbird/tweet_annotation.thrift" - -enum ThriftSearchRankingMode { - // good old realtime search mode - RECENCY = 0, - // new super fancy relevance ranking - RELEVANCE = 1, - DEPRECATED_DISCOVERY = 2, - // top tweets ranking mode - TOPTWEETS = 3, - // results from accounts followed by the searcher - FOLLOWS = 4, - - PLACE_HOLDER5 = 5, - PLACE_HOLDER6 = 6, -} - -enum ThriftSearchResultType { - // it's a time-ordered result. - RECENCY = 0, - // it's a highly relevant tweet (aka top tweet). - RELEVANCE = 1, - // top tweet result type - POPULAR = 2, - // promoted tweets (ads) - PROMOTED = 3, - // relevance-ordered (as opposed to time-ordered) tweets generated from a variety of candidates - RELEVANCE_ORDERED = 4, - - PLACE_HOLDER5 = 5, - PLACE_HOLDER6 = 6, -} - -enum ThriftSocialFilterType { - // filter only users that the searcher is directly following. - FOLLOWS = 0, - // filter only users that are in searcher's social circle of trust. - TRUSTED = 1, - // filter both follows and trusted. - ALL = 2, - - PLACE_HOLDER3 = 3, - PLACE_HOLDER4 = 4, - -} - -enum ThriftTweetSource { - ///// enums set by Earlybird - REALTIME_CLUSTER = 1, - FULL_ARCHIVE_CLUSTER = 2, - REALTIME_PROTECTED_CLUSTER = 4, - - ///// enums set inside Blender - ADSERVER = 0, - // from top news search, only used in universal search - TOP_NEWS = 3, - // special tweets included just for EventParrot. - FORCE_INCLUDED = 5, - // from Content Recommender - // from topic to Tweet path - CONTENT_RECS_TOPIC_TO_TWEET = 6, - // used for hydrating QIG Tweets (go/qig) - QIG = 8, - // used for TOPTWEETS ranking mode - TOP_TWEET = 9, - // used for experimental candidate sources - EXPERIMENTAL = 7, - // from Scanr service - SCANR = 10, - - PLACE_HOLDER11 = 11, - PLACE_HOLDER12 = 12 -} - -enum NamedEntitySource { - TEXT = 0, - URL = 1, - - PLACE_HOLDER2 = 2, - PLACE_HOLDER3 = 3, - PLACE_HOLDER4 = 4, -} - -enum ExperimentCluster { - EXP0 = 0, // Send requests to the earlybird-realtime-exp0 cluster - PLACE_HOLDER1 = 1, - PLACE_HOLDER2 = 2, -} - -enum AudioSpaceState { - RUNNING = 0, - ENDED = 1, - - PLACE_HOLDER2 = 2, - PLACE_HOLDER3 = 3, - PLACE_HOLDER4 = 4, - PLACE_HOLDER5 = 5, -} - -// Contains all scoring and relevance-filtering related controls and options for Earlybird. -struct ThriftSearchRelevanceOptions { - // Next available field ID: 31 and note that 45 and 50 have been used already - - 2: optional bool filterDups = 0 // filter out duplicate search results - 26: optional bool keepDupWithHigherScore = 1 // keep the duplicate tweet with the higher score - - 3: optional bool proximityScoring = 0 // whether to do proximity scoring or not - 4: optional i32 maxConsecutiveSameUser // filter consecutive results from the same user - 5: optional ranking.ThriftRankingParams rankingParams // composed by blender - // deprecated in favor of the maxHitsToProcess in CollectorParams - 6: optional i32 maxHitsToProcess // when to early-terminate for relevance - 7: optional string experimentName // what relevance experiment is running - 8: optional string experimentBucket // what bucket the user is in; DDG defaults to hard-coded 'control' - 9: optional bool interpretSinceId = 1 // whether to interpret since_id operator - - 24: optional i32 maxHitsPerUser // Overrides ThriftSearchQuery.maxHitsPerUser - - // only used by discovery for capping direct follow tweets - 10: optional i32 maxConsecutiveDirectFollows - - // Note - the orderByRelevance flag is critical to understanding how merging - // and trimming works in relevance mode in the search root. - // - // When orderByRelevance is true, results are trimmed in score-order. This means the - // client will get the top results from (maxHitsToProcess * numHashPartitions) hits, - // ordered by score. - // - // When orderByRelevance is false, results are trimmed in id-order. This means the - // client will get the top results from an approximation of maxHitsToProcess hits - // (across the entire corpus). These results ordered by ID. - 14: optional bool orderByRelevance = 0 - - // Max blending count for results returned due to from:user rewrites - 16: optional i32 maxUserBlendCount - - // The weight for proximity phrases generated while translating the serialized query to the - // lucene query. - 19: optional double proximityPhraseWeight = 1.0 - 20: optional i32 proximityPhraseSlop = 255 - - // Override the weights of searchable fields. - // Negative weight means the the field is not enabled for search by default, - // but if it is (e.g., by annotation), the absolute value of the weight shall be - // used (if the annotation does not specify a weight). - 21: optional map fieldWeightMapOverride - - // whether disable the coordination in the rewritten disjunction query, term query and phrase query - // the details can be found in LuceneVisitor - 22: optional bool deprecated_disableCoord = 0 - - // Root only. Returns all results seen by root to the client without trimming - // if set to true. - 23: optional bool returnAllResults - - // DEPRECATED: All v2 counters will be used explicitly in the scoring function and - // returned in their own field (in either metadata or feature map in response). - 25: optional bool useEngagementCountersV2 = 0 - - // -------- PERSONALIZATION-RELATED RELEVANCE OPTIONS -------- - // Take special care with these options when reasoning about caching. - - // Deprecated in SEARCH-8616. - 45: optional map deprecated_topicIDWeights - - // Collect hit attribution on queries and likedByUserIDFilter64-enhanced queries to - // get likedByUserIds list in metadata field. - // NOTE: this flag has no affect on fromUserIDFilter64. - 50: optional bool collectFieldHitAttributions = 0 - - // Whether to collect all hits regardless of their score with RelevanceAllCollector. - 27: optional bool useRelevanceAllCollector = 0 - - // Override features of specific tweets before the tweets are scored. - 28: optional map perTweetFeaturesOverride - - // Override features of all tweets from specific users before the tweets are scored. - 29: optional map perUserFeaturesOverride - - // Override features of all tweets before the tweets are scored. - 30: optional features.ThriftSearchResultFeatures globalFeaturesOverride -}(persisted='true') - -// Facets types that may have different ranking parameters. -enum ThriftFacetType { - DEFAULT = 0, - MENTIONS_FACET = 1, - HASHTAGS_FACET = 2, - // Deprecated in SEARCH-13708 - DEPRECATED_NAMED_ENTITIES_FACET = 3, - STOCKS_FACET = 4, - VIDEOS_FACET = 5, - IMAGES_FACET = 6, - NEWS_FACET = 7, - LANGUAGES_FACET = 8, - SOURCES_FACET = 9, - TWIMG_FACET = 10, - FROM_USER_ID_FACET = 11, - DEPRECATED_TOPIC_IDS_FACET = 12, - RETWEETS_FACET = 13, - LINKS_FACET = 14, - - PLACE_HOLDER15 = 15, - PLACE_HOLDER16 = 16, -} - -struct ThriftSearchDebugOptions { - // Make earlybird only score and return tweets (specified by tweet id) here, regardless - // if they have a hit for the current query or not. - 1: optional set statusIds; - - // Assorted structures to pass in debug options. - 2: optional map stringMap; - 3: optional map valueMap; - 4: optional list valueList; -}(persisted='true') - -// These options control what metadata will be returned by earlybird for each search result -// in the ThriftSearchResultMetadata struct. These options are currently mostly supported by -// AbstractRelevanceCollector and partially in SearchResultsCollector. Most are true by default to -// preserve backwards compatibility, but can be disabled as necessary to optimize searches returning -// many results (such as discover). -struct ThriftSearchResultMetadataOptions { - // If true, fills in the tweetUrls field in ThriftSearchResultMetadata. - // Populated by AbstractRelevanceCollector. - 1: optional bool getTweetUrls = 1 - - // If true, fills in the resultLocation field in ThriftSearchResultMetadata. - // Populated by AbstractRelevanceCollector. - 2: optional bool getResultLocation = 1 - - // Deprecated in SEARCH-8616. - 3: optional bool deprecated_getTopicIDs = 1 - - // If true, fills in the luceneScore field in ThriftSearchResultMetadata. - // Populated by LinearScoringFunction. - 4: optional bool getLuceneScore = 0 - - // Deprecated but used to be for Offline feature values for static index - 5: optional bool deprecated_getExpFeatureValues = 0 - - // If true, will omit all features derivable from packedFeatures, and set packedFeatures - // instead. - 6: optional bool deprecated_usePackedFeatures = 0 - - // If true, fills sharedStatusId. For replies this is the in-reply-to status id and for - // retweets this is the retweet source status id. - // Also fills in the the isRetweet and isReply flags. - 7: optional bool getInReplyToStatusId = 0 - - // If true, fills referencedTweetAuthorId. Also fills in the the isRetweet and isReply flags. - 8: optional bool getReferencedTweetAuthorId = 0 - - // If true, fills media bits (video/vine/periscope/etc.) - 9: optional bool getMediaBits = 0 - - // If true, will return all defined features in the packed features. This flag does not cover - // the above defined features. - 10: optional bool getAllFeatures = 0 - - // If true, will return all features as ThriftSearchResultFeatures format. - 11: optional bool returnSearchResultFeatures = 0 - - // If the client caches some features schemas, client can indicate its cache schemas through - // this field based on (version, checksum). - 12: optional list featureSchemasAvailableInClient - - // Specific feature IDs to return for recency requests. Populated in SearchResultFeatures. - // Values must be IDs of CSF fields from EarlybirdFieldConstants. - 13: optional list requestedFeatureIDs - - // If true, fills in the namedEntities field in ThriftSearchResultExtraMetadata - 14: optional bool getNamedEntities = 0 - - // If true, fills in the entityAnnotations field in ThriftSearchResultExtraMetadata - 15: optional bool getEntityAnnotations = 0 - - // If true, fills in the fromUserId field in the ThriftSearchResultExtraMetadata - 16: optional bool getFromUserId = 0 - - // If true, fills in the spaces field in the ThriftSearchResultExtraMetadata - 17: optional bool getSpaces = 0 - - 18: optional bool getExclusiveConversationAuthorId = 0 -}(persisted='true') - - -// ThriftSearchQuery describes an earlybird search request, which typically consists -// of these parts: -// - a query to retrieve hits -// - relevance options to score hits -// - a collector to collect hits and process into search results -// Note that this struct is used in both ThriftBlenderRequest and EarlybirdRequest. -// Most fields are not set when this struct is embedded in ThriftBlenderRequest, and -// are filled in by the blender before sending to earlybird. -struct ThriftSearchQuery { - // Next available field ID: 42 - - // -------- SECTION ZERO: THINGS USED ONLY BY THE BLENDER -------- - // See SEARCHQUAL-2398 - // These fields are used by the blender and clients of the blender, but not by earlybird. - - // blender use only - // The raw un-parsed user search query. - 6: optional string rawQuery(personalDataType = 'SearchQuery') - - // blender use only - // Language of the rawQuery. - 18: optional string queryLang(personalDataType = 'InferredLanguage') - - // blender use only - // What page of results to return, indexed from 1. - 7: optional i32 page = 1 - - // blender use only - // Number of results to skip (for pagination). Indexed from 0. - 2: optional i32 deprecated_resultOffset = 0 - - - // -------- SECTION ONE: RETRIEVAL OPTIONS -------- - // These options control the query that will be used to retrieve documents / hits. - - // The parsed query tree, serialized to a string. Restricts the search results to - // tweets matching this query. - 1: optional string serializedQuery(personalDataType = 'SearchQuery') - - // Restricts the search results to tweets having this minimum tweep cred, out of 100. - 5: optional i32 minTweepCredFilter = -1 - - // Restricts the search results to tweets from these users. - 34: optional list fromUserIDFilter64(personalDataType = 'PrivateAccountsFollowing, PublicAccountsFollowing') - // Restricts the search results to tweets liked by these users. - 40: optional list likedByUserIDFilter64(personalDataType = 'PrivateAccountsFollowing, PublicAccountsFollowing') - - // If searchStatusIds are present, earlybird will ignore the serializedQuery completely - // and simply score each of searchStatusIds, also bypassing features like duplicate - // filtering and early termination. - // IMPORTANT: this means that it is possible to get scores equal to ScoringFunction.SKIP_HIT, - // for results skipped by the scoring function. - 31: optional set searchStatusIds - - 35: optional set deprecated_eventClusterIdsFilter - - 41: optional map> namedDisjunctionMap - - // -------- SECTION TWO: HIT COLLECTOR OPTIONS -------- - // These options control what hits will be collected by the hit collector. - // Whether we want to collect and return per-field hit attributions is set in RelevanceOptions. - // See SEARCH-2784 - // Number of results to return (after offset/page correction). - // This is ignored when searchStatusIds is set. - 3: required i32 numResults - - // Maximum number of hits to process by the collector. - // deprecated in favor of the maxHitsToProcess in CollectorParams - 4: optional i32 maxHitsToProcess = 1000 - - // Collect hit counts for these time periods (in milliseconds). - 30: optional list hitCountBuckets - - // If set, earlybird will also return the facet labels of the specified facet fields - // in result tweets. - 33: optional list facetFieldNames - - // Options controlling which search result metadata is returned. - 36: optional ThriftSearchResultMetadataOptions resultMetadataOptions - - // Collection related Params - 38: optional search.CollectorParams collectorParams - - // Whether to collect conversation IDs - 39: optional bool collectConversationId = 0 - - // -------- SECTION THREE: RELEVANCE OPTIONS -------- - // These options control relevance scoring and anti-gaming. - - // Ranking mode (RECENCY means time-ordered ranking with no relevance). - 8: optional ThriftSearchRankingMode rankingMode = ThriftSearchRankingMode.RECENCY - - // Relevance scoring options. - 9: optional ThriftSearchRelevanceOptions relevanceOptions - - // Limits the number of hits that can be contributed by the same user, for anti-gaming. - // Set to -1 to disable the anti-gaming filter. This is ignored when searchStatusIds - // is set. - 11: optional i32 maxHitsPerUser = 3 - - // Disables anti-gaming filter checks for any tweets that exceed this tweepcred. - 12: optional i32 maxTweepcredForAntiGaming = 65 - - // -------- PERSONALIZATION-RELATED RELEVANCE OPTIONS -------- - // Take special care with these options when reasoning about caching. All of these - // options, if set, will bypass the cache with the exception of uiLang which is the - // only form of personalization allowed for caching. - - // User ID of searcher. This is used for relevance, and will be used for retrieval - // by the protected tweets index. If set, query will not be cached. - 20: optional i64 searcherId(personalDataType = 'UserId') - - // Bloom filter containing trusted user IDs. If set, query will not be cached. - 10: optional binary trustedFilter(personalDataType = 'UserId') - - // Bloom filter containing direct follow user IDs. If set, query will not be cached. - 16: optional binary directFollowFilter(personalDataType = 'UserId, PrivateAccountsFollowing, PublicAccountsFollowing') - - // UI language from the searcher's profile settings. - 14: optional string uiLang(personalDataType = 'GeneralSettings') - - // Confidence of the understandability of different languages for this user. - // uiLang field above is treated as a userlang with a confidence of 1.0. - 28: optional map userLangs(personalDataTypeKey = 'InferredLanguage') - - // An alternative to fromUserIDFilter64 that relies on the relevance bloom filters - // for user filtering. Not currently used in production. Only supported for realtime - // searches. - // If set, earlybird expects both trustedFilter and directFollowFilter to also be set. - 17: optional ThriftSocialFilterType socialFilterType - - // -------- SECTION FOUR: DEBUG OPTIONS, FORGOTTEN FEATURES -------- - - // Earlybird search debug options. - 19: optional ThriftSearchDebugOptions debugOptions - - // Overrides the query time for debugging. - 29: optional i64 timestampMsecs = 0 - - // Support for this feature has been removed and this field is left for backwards compatibility - // (and to detect improper usage by clients when it is set). - 25: optional list deprecated_iterativeQueries - - // Specifies a lucene query that will only be used if serializedQuery is not set, - // for debugging. Not currently used in production. - 27: optional string luceneQuery(personalDataType = 'SearchQuery') - - // This field is deprecated and is not used by earlybirds when processing the query. - 21: optional i32 deprecated_minDocsToProcess = 0 -}(persisted='true', hasPersonalData = 'true') - - -struct ThriftFacetLabel { - 1: required string fieldName - 2: required string label - // the number of times this facet has shown up in tweets with offensive words. - 3: optional i32 offensiveCount = 0 - - // only filled for TWIMG facets - 4: optional string nativePhotoUrl -}(persisted='true') - -struct ThriftSearchResultGeoLocation { - 1: optional double latitude(personalDataType = 'GpsCoordinates') - 2: optional double longitude(personalDataType = 'GpsCoordinates') - 3: optional double distanceKm -}(persisted='true', hasPersonalData = 'true') - -// Contains an expanded url and media type from the URL facet fields in earlybird. -// Note: thrift copied from status.thrift with unused fields renamed. -struct ThriftSearchResultUrl { - // Next available field ID: 6. Fields 2-4 removed. - - // Note: this is actually the expanded url. Rename after deprecated fields are removed. - 1: required string originalUrl - - // Media type of the url. - 5: optional metadata_store.MediaTypes mediaType -}(persisted='true') - -struct ThriftSearchResultNamedEntity { - 1: required string canonicalName - 2: required string entityType - 3: required NamedEntitySource source -}(persisted='true') - -struct ThriftSearchResultAudioSpace { - 1: required string id - 2: required AudioSpaceState state -}(persisted='true') - -// Even more metadata -struct ThriftSearchResultExtraMetadata { - // Next available field ID: 49 - - 1: optional double userLangScore - 2: optional bool hasDifferentLang - 3: optional bool hasEnglishTweetAndDifferentUILang - 4: optional bool hasEnglishUIAndDifferentTweetLang - 5: optional i32 quotedCount - 6: optional double querySpecificScore - 7: optional bool hasQuote - 29: optional i64 quotedTweetId - 30: optional i64 quotedUserId - 31: optional search_language.ThriftLanguage cardLang - 8: optional i64 conversationId - 9: optional bool isSensitiveContent - 10: optional bool hasMultipleMediaFlag - 11: optional bool profileIsEggFlag - 12: optional bool isUserNewFlag - 26: optional double authorSpecificScore - 28: optional bool isComposerSourceCamera - - // temporary V2 engagement counters, original ones in ThriftSearchResultMetadata has log() - // applied on them and then converted to int in Thrift, which is effectively a premature - // discretization. It doesn't affect the scoring inside Earlybird but for scoring and ML training - // outside earlybird, they were bad. These newly added ones stores a proper value of these - // counts. This also provides an easier transition to v2 counter when Earlybird is eventually - // ready to consume them from DL - // See SEARCHQUAL-9536, SEARCH-11181 - 18: optional i32 retweetCountV2 - 19: optional i32 favCountV2 - 20: optional i32 replyCountV2 - // Tweepcred weighted version of various engagement counts - 22: optional i32 weightedRetweetCount - 23: optional i32 weightedReplyCount - 24: optional i32 weightedFavCount - 25: optional i32 weightedQuoteCount - - // 2 bits - 0, 1, 2, 3+ - 13: optional i32 numMentions - 14: optional i32 numHashtags - - // 1 byte - 256 possible languages - 15: optional i32 linkLanguage - // 6 bits - 64 possible values - 16: optional i32 prevUserTweetEngagement - - 17: optional features.ThriftSearchResultFeatures features - - // If the ThriftSearchQuery.likedByUserIdFilter64 and ThriftSearchRelevanceOptions.collectFieldHitAttributions - // fields are set, then this field will contain the list of all users in the query that liked this tweet. - // Otherwise, this field is not set. - 27: optional list likedByUserIds - - - // Deprecated. See SEARCHQUAL-10321 - 21: optional double dopamineNonPersonalizedScore - - 32: optional list namedEntities - 33: optional list entityAnnotations - - // Health model scores from HML - 34: optional double toxicityScore // (go/toxicity) - 35: optional double pBlockScore // (go/pblock) - 36: optional double experimentalHealthModelScore1 - 37: optional double experimentalHealthModelScore2 - 38: optional double experimentalHealthModelScore3 - 39: optional double experimentalHealthModelScore4 - - 40: optional i64 directedAtUserId - - // Health model scores from HML (cont.) - 41: optional double pSpammyTweetScore // (go/pspammytweet) - 42: optional double pReportedTweetScore // (go/preportedtweet) - 43: optional double spammyTweetContentScore // (go/spammy-tweet-content) - // it is populated by looking up user table and it is only available in archive earlybirds response - 44: optional bool isUserProtected - 45: optional list spaces - - 46: optional i64 exclusiveConversationAuthorId - 47: optional string cardUri - 48: optional bool fromBlueVerifiedAccount(personalDataType = 'UserVerifiedFlag') -}(persisted='true') - -// Some basic metadata about a search result. Useful for re-sorting, filtering, etc. -// -// NOTE: DO NOT ADD NEW FIELD!! -// Stop adding new fields to this struct, all new fields should go to -// ThriftSearchResultExtraMetadata (VM-1897), or there will be performance issues in production. -struct ThriftSearchResultMetadata { - // Next available field ID: 86 - - // -------- BASIC SCORING METADATA -------- - - // When resultType is RECENCY most scoring metadata will not be available. - 1: required ThriftSearchResultType resultType - - // Relevance score computed for this result. - 3: optional double score - - // True if the result was skipped by the scoring function. Only set when the collect-all - // results collector was used - in other cases skipped results are not returned. - // The score will be ScoringFunction.SKIP_HIT when skipped is true. - 43: optional bool skipped - - // optionally a Lucene-style explanation for this result - 5: optional string explanation - - - // -------- NETWORK-BASED SCORING METADATA -------- - - // Found the tweet in the trusted circle. - 6: optional bool isTrusted - - // Found the tweet in the direct follows. - 8: optional bool isFollow - - // True if the fromUserId of this tweet was whitelisted by the dup / antigaming filter. - // This typically indicates the result was from a tweet that matched a fromUserId query. - 9: optional bool dontFilterUser - - - // -------- COMMON DOCUMENT METADATA -------- - - // User ID of the author. When isRetweet is true, this is the user ID of the retweeter - // and NOT that of the original tweet. - 7: optional i64 fromUserId = 0 - - // When isRetweet (or packed features equivalent) is true, this is the status id of the - // original tweet. When isReply and getReplySource are true, this is the status id of the - // original tweet. In all other circumstances this is 0. - 40: optional i64 sharedStatusId = 0 - - // When hasCard (or packed features equivalent) is true, this is one of SearchCardType. - 49: optional i8 cardType = 0 - - // -------- EXTENDED DOCUMENT METADATA -------- - // This is additional metadata from facet fields and column stride fields. - // Return of these fields is controlled by ThriftSearchResultMetadataOptions to - // allow for fine-grained control over when these fields are returned, as an - // optimization for searches returning a large quantity of results. - - // Lucene component of the relevance score. Only returned when - // ThriftSearchResultMetadataOptions.getLuceneScore is true. - 31: optional double luceneScore = 0.0 - - // Urls found in the tweet. Only returned when - // ThriftSearchResultMetadataOptions.getTweetUrls is true. - 18: optional list tweetUrls - - // Deprecated in SEARCH-8616. - 36: optional list deprecated_topicIDs - - // Facets available in this tweet, this will only be filled if - // ThriftSearchQuery.facetFieldNames is set in the request. - 22: optional list facetLabels - - // The location of the result, and the distance to it from the center of the query - // location. Only returned when ThriftSearchResultMetadataOptions.getResultLocation is true. - 35: optional ThriftSearchResultGeoLocation resultLocation - - // Per field hit attribution. - 55: optional hit_attribution.FieldHitAttribution fieldHitAttribution - - // whether this has geolocation_type:geotag hit - 57: optional bool geotagHit = 0 - - // the user id of the author of the source/referenced tweet (the tweet one replied - // to, retweeted and possibly quoted, etc.) (SEARCH-8561) - // Only returned when ThriftSearchResultMetadataOptions.getReferencedTweetAuthorId is true. - 60: optional i64 referencedTweetAuthorId = 0 - - // Whether this tweet has certain types of media. - // Only returned when ThriftSearchResultMetadataOptions.getMediaBits is true. - // "Native video" is either consumer, pro, vine, or periscope. - // "Native image" is an image hosted on pic.twitter.com. - 62: optional bool hasConsumerVideo - 63: optional bool hasProVideo - 64: optional bool hasVine - 65: optional bool hasPeriscope - 66: optional bool hasNativeVideo - 67: optional bool hasNativeImage - - // Packed features for this result. This field is never populated. - 50: optional status.PackedFeatures deprecated_packedFeatures - - // The features stored in earlybird - - // From integer 0 from EarlybirdFeatureConfiguration: - 16: optional bool isRetweet - 71: optional bool isSelfTweet - 10: optional bool isOffensive - 11: optional bool hasLink - 12: optional bool hasTrend - 13: optional bool isReply - 14: optional bool hasMultipleHashtagsOrTrends - 23: optional bool fromVerifiedAccount - // Static text quality score. This is actually an int between 0 and 100. - 30: optional double textScore - 51: optional search_language.ThriftLanguage language - - // From integer 1 from EarlybirdFeatureConfiguration: - 52: optional bool hasImage - 53: optional bool hasVideo - 28: optional bool hasNews - 48: optional bool hasCard - 61: optional bool hasVisibleLink - // Tweep cred aka user rep. This is actually an int between 0 and 100. - 32: optional double userRep - 24: optional bool isUserSpam - 25: optional bool isUserNSFW - 26: optional bool isUserBot - 54: optional bool isUserAntiSocial - - // From integer 2 from EarlybirdFeatureConfiguration: - - // Retweet, fav, reply, embeds counts, and video view counts are APPROXIMATE ONLY. - // Note that retweetCount, favCount and replyCount are not original unnormalized values, - // but after a log2() function for historical reason, this loses us some granularity. - // For more accurate counts, use {retweet, fav, reply}CountV2 in extraMetadata. - 2: optional i32 retweetCount - 33: optional i32 favCount - 34: optional i32 replyCount - 58: optional i32 embedsImpressionCount - 59: optional i32 embedsUrlCount - 68: optional i32 videoViewCount - - // Parus score. This is actually an int between 0 and 100. - 29: optional double parusScore - - // Extra feature data, all new feature fields you want to return from Earlybird should go into - // this one, the outer one is always reaching its limit of the number of fields JVM can - // comfortably support!! - 86: optional ThriftSearchResultExtraMetadata extraMetadata - - // Integer 3 is omitted, see expFeatureValues above for more details. - - // From integer 4 from EarlybirdFeatureConfiguration: - // Signature, for duplicate detection and removal. - 4: optional i32 signature - - // -------- THINGS USED ONLY BY THE BLENDER -------- - - // Social proof of the tweet, for network discovery. - // Do not use these fields outside of network discovery. - 41: optional list retweetedUserIDs64 - 42: optional list replyUserIDs64 - - // Social connection between the search user and this result. - 19: optional social.ThriftSocialContext socialContext - - // used by RelevanceTimelineSearchWorkflow, whether a tweet should be highlighted or not - 46: optional bool highlightResult - - // used by RelevanceTimelineSearchWorkflow, the highlight context of the highlighted tweet - 47: optional highlight.ThriftHighlightContext highlightContext - - // the penguin version used to tokenize the tweets by the serving earlybird index as defined - // in com.twitter.common.text.version.PenguinVersion - 56: optional i8 penguinVersion - - 69: optional bool isNullcast - - // This is the normalized ratio(0.00 to 1.00) of nth token(starting before 140) divided by - // numTokens and then normalized into 16 positions(4 bits) but on a scale of 0 to 100% as - // we unnormalize it for you - 70: optional double tokenAt140DividedByNumTokensBucket - -}(persisted='true') - -// Query level result stats. -// Next id: 20 -struct ThriftSearchResultsRelevanceStats { - 1: optional i32 numScored = 0 - // Skipped documents count, they were also scored but their scores got ignored (skipped), note that this is different - // from numResultsSkipped in the ThriftSearchResults. - 2: optional i32 numSkipped = 0 - 3: optional i32 numSkippedForAntiGaming = 0 - 4: optional i32 numSkippedForLowReputation = 0 - 5: optional i32 numSkippedForLowTextScore = 0 - 6: optional i32 numSkippedForSocialFilter = 0 - 7: optional i32 numSkippedForLowFinalScore = 0 - 8: optional i32 oldestScoredTweetAgeInSeconds = 0 - - // More counters for various features. - 9: optional i32 numFromDirectFollows = 0 - 10: optional i32 numFromTrustedCircle = 0 - 11: optional i32 numReplies = 0 - 12: optional i32 numRepliesTrusted = 0 - 13: optional i32 numRepliesOutOfNetwork = 0 - 14: optional i32 numSelfTweets = 0 - 15: optional i32 numWithMedia = 0 - 16: optional i32 numWithNews = 0 - 17: optional i32 numSpamUser = 0 - 18: optional i32 numOffensive = 0 - 19: optional i32 numBot = 0 -}(persisted='true') - -// Per result debug info. -struct ThriftSearchResultDebugInfo { - 1: optional string hostname - 2: optional string clusterName - 3: optional i32 partitionId - 4: optional string tiername -}(persisted='true') - -struct ThriftSearchResult { - // Next available field ID: 22 - - // Result status id. - 1: required i64 id - - // TweetyPie status of the search result - 7: optional deprecated.Status tweetypieStatus - 19: optional tweet.Tweet tweetypieTweet // v2 struct - - // If the search result is a retweet, this field contains the source TweetyPie status. - 10: optional deprecated.Status sourceTweetypieStatus - 20: optional tweet.Tweet sourceTweetypieTweet // v2 struct - - // If the search result is a quote tweet, this field contains the quoted TweetyPie status. - 17: optional deprecated.Status quotedTweetypieStatus - 21: optional tweet.Tweet quotedTweetypieTweet // v2 struct - - // Additional metadata about a search result. - 5: optional ThriftSearchResultMetadata metadata - - // Hit highlights for various parts of this tweet - // for tweet text - 6: optional list hitHighlights - // for the title and description in the card expando. - 12: optional list cardTitleHitHighlights - 13: optional list cardDescriptionHitHighlights - - // Expansion types, if expandResult == False, the expansions set should be ignored. - 8: optional bool expandResult = 0 - 9: optional set expansions - - // Only set if this is a promoted tweet - 11: optional adserver_common.AdImpression adImpression - - // where this tweet is from - // Since ThriftSearchResult used not only as an Earlybird response, but also an internal - // data transfer object of Blender, the value of this field is mutable in Blender, not - // necessarily reflecting Earlybird response. - 14: optional ThriftTweetSource tweetSource - - // the features of a tweet used for relevance timeline - // this field is populated by blender in RelevanceTimelineSearchWorkflow - 15: optional features.ThriftTweetFeatures tweetFeatures - - // the conversation context of a tweet - 16: optional conversation.ThriftConversationContext conversationContext - - // per-result debugging info that's persisted across merges. - 18: optional ThriftSearchResultDebugInfo debugInfo -}(persisted='true') - -enum ThriftFacetRankingMode { - COUNT = 0, - FILTER_WITH_TERM_STATISTICS = 1, -} - -struct ThriftFacetFieldRequest { - // next available field ID: 4 - 1: required string fieldName - 2: optional i32 numResults = 5 - - // use facetRankingOptions in ThriftFacetRequest instead - 3: optional ThriftFacetRankingMode rankingMode = ThriftFacetRankingMode.COUNT -}(persisted='true') - -struct ThriftFacetRequest { - // Next available field ID: 7 - 1: optional list facetFields - 5: optional ranking.ThriftFacetRankingOptions facetRankingOptions - 6: optional bool usingQueryCache = 0 -}(persisted='true') - -struct ThriftTermRequest { - 1: optional string fieldName = "text" - 2: required string term -}(persisted='true') - -enum ThriftHistogramGranularityType { - MINUTES = 0 - HOURS = 1, - DAYS = 2, - CUSTOM = 3, - - PLACE_HOLDER4 = 4, - PLACE_HOLDER5 = 5, -} - -struct ThriftHistogramSettings { - 1: required ThriftHistogramGranularityType granularity - 2: optional i32 numBins = 60 - 3: optional i32 samplingRate = 1 - 4: optional i32 binSizeInSeconds // the bin size, only used if granularity is set to CUSTOM. -}(persisted='true') - -// next id is 4 -struct ThriftTermStatisticsRequest { - 1: optional list termRequests - 2: optional ThriftHistogramSettings histogramSettings - // If this is set to true, even if there is no termRequests above, so long as the histogramSettings - // is set, Earlybird will return a null->ThriftTermResults entry in the termResults map, containing - // the global tweet count histogram for current query, which is the number of tweets matching this - // query in different minutes/hours/days. - 3: optional bool includeGlobalCounts = 0 - // When this is set, the background facets call does another search in order to find the best - // representative tweet for a given term request, the representative tweet is stored in the - // metadata of the termstats result - 4: optional bool scoreTweetsForRepresentatives = 0 -}(persisted='true') - -// Next id is 12 -struct ThriftFacetCountMetadata { - // this is the id of the first tweet in the index that contained this facet - 1: optional i64 statusId = -1 - - // whether the tweet with the above statusId is NSFW, from an antisocial user, - // marked as sensitive content, etc. - 10: optional bool statusPossiblySensitive - - // the id of the user who sent the tweet above - only returned if - // statusId is returned too - // NOTE: for native photos we may not be able to determine the user, - // even though the statusId can be returned. This is because the statusId - // can be determined from the url, but the user can't and the tweet may - // not be in the index anymore. In this case statusId would be set but - // twitterUserId would not. - 2: optional i64 twitterUserId = -1 - - // the language of the tweet above. - 8: optional search_language.ThriftLanguage statusLanguage - - // optionally whitelist the fromUserId from dup/twitterUserId filtering - 3: optional bool dontFilterUser = 0; - - // if this facet is a native photo we return for convenience the - // twimg url - 4: optional string nativePhotoUrl - - // optionally returns some debug information about this facet - 5: optional string explanation - - // the created_at value for the tweet from statusId - only returned - // if statusId is returned too - 6: optional i64 created_at - - // the maximum tweepcred of the hits that contained this facet - 7: optional i32 maxTweepCred - - // Whether this facet result is force inserted, instead of organically returned from search. - // This field is only used in Blender to mark the force-inserted facet results - // (from recent tweets, etc). - 11: optional bool forceInserted = 0 -}(persisted='true') - -struct ThriftTermResults { - 1: required i32 totalCount - 2: optional list histogramBins - 3: optional ThriftFacetCountMetadata metadata -}(persisted='true') - -struct ThriftTermStatisticsResults { - 1: required map termResults - 2: optional ThriftHistogramSettings histogramSettings - // If histogramSettings are set, this will have a list of ThriftHistogramSettings.numBins binIds, - // that the corresponding histogramBins in ThriftTermResults will have counts for. - // The binIds will correspond to the times of the hits matching the driving search query for this - // term statistics request. - // If there were no hits matching the search query, numBins binIds will be returned, but the - // values of the binIds will not meaningfully correspond to anything related to the query, and - // should not be used. Such cases can be identified by ThriftSearchResults.numHitsProcessed being - // set to 0 in the response, and the response not being early terminated. - 3: optional list binIds - // If set, this id indicates the id of the minimum (oldest) bin that has been completely searched, - // even if the query was early terminated. If not set no bin was searched fully, or no histogram - // was requested. - // Note that if e.g. a query only matches a bin partially (due to e.g. a since operator) the bin - // is still considered fully searched if the query did not early terminate. - 4: optional i32 minCompleteBinId -}(persisted='true') - -struct ThriftFacetCount { - // the text of the facet - 1: required string facetLabel - - // deprecated; currently matches weightedCount for backwards-compatibility reasons - 2: optional i32 facetCount - - // the simple count of tweets that contained this facet, without any - // weighting applied - 7: optional i32 simpleCount - - // a weighted version of the count, using signals like tweepcred, parus, etc. - 8: optional i32 weightedCount - - // the number of times this facet occurred in tweets matching the background query - // using the term statistics API - only set if FILTER_WITH_TERM_STATISTICS was used - 3: optional i32 backgroundCount - - // the relevance score that was computed for this facet if FILTER_WITH_TERM_STATISTICS - // was used - 4: optional double score - - // a counter for how often this facet was penalized - 5: optional i32 penaltyCount - - 6: optional ThriftFacetCountMetadata metadata -}(persisted='true') - -// List of facet labels and counts for a given facet field, the -// total count for this field, and a quality score for this field -struct ThriftFacetFieldResults { - 1: required list topFacets - 2: required i32 totalCount - 3: optional double scoreQuality - 4: optional i32 totalScore - 5: optional i32 totalPenalty - - // The ratio of the tweet language in the tweets with this facet field, a map from the language - // name to a number between (0.0, 1.0]. Only languages with ratio higher than 0.1 will be included. - 6: optional map languageHistogram -} - -struct ThriftFacetResults { - 1: required map facetFields - 2: optional i32 backgroundNumHits - // returns optionally a list of user ids that should not get filtered - // out by things like antigaming filters, because these users were explicitly - // queried for - // Note that ThriftFacetCountMetadata returns already dontFilterUser - // for facet requests in which case this list is not needed. However, it - // is needed for subsequent term statistics queries, were user id lookups - // are performed, but a different background query is used. - 3: optional set userIDWhitelist -} - -struct ThriftSearchResults { - // Next available field ID: 23 - 1: required list results = [] - - // (SEARCH-11950): Now resultOffset is deprecated, so there is no use in numResultsSkipped too. - 9: optional i32 deprecated_numResultsSkipped - - // Number of docs that matched the query and were processed. - 7: optional i32 numHitsProcessed - - // Range of status IDs searched, from max ID to min ID (both inclusive). - // These may be unset in case that the search query contained ID or time - // operators that were completely out of range for the given index. - 10: optional i64 maxSearchedStatusID - 11: optional i64 minSearchedStatusID - - // Time range that was searched (both inclusive). - 19: optional i32 maxSearchedTimeSinceEpoch - 20: optional i32 minSearchedTimeSinceEpoch - - 12: optional ThriftSearchResultsRelevanceStats relevanceStats - - // Overall quality of this search result set - 13: optional double score = -1.0 - 18: optional double nsfwRatio = 0.0 - - // The count of hit documents in each language. - 14: optional map languageHistogram - - // Hit counts per time period: - // The key is a time cutoff in milliseconds (e.g. 60000 msecs ago). - // The value is the number of hits that are more recent than the cutoff. - 15: optional map hitCounts - - // the total cost for this query - 16: optional double queryCost - - // Set to non-0 if this query was terminated early (either due to a timeout, or exceeded query cost) - // When getting this response from a single earlybird, this will be set to 1, if the query - // terminated early. - // When getting this response from a search root, this should be set to the number of individual - // earlybird requests that were terminated early. - 17: optional i32 numPartitionsEarlyTerminated - - // If ThriftSearchResults returns features in features.ThriftSearchResultFeature format, this - // field would define the schema of the features. - // If the earlybird schema is already in the client cached schemas indicated in the request, then - // searchFeatureSchema would only have (version, checksum) information. - // - // Notice that earlybird root only sends one schema back to the superroot even though earlybird - // root might receive multiple version of schemas. - // - // Earlybird roots' schema merge/choose logic when returning results to superroot: - // . pick the most occurred versioned schema and return the schema to the superroot - // . if the superroot already caches the schema, only send the version information back - // - // Superroots' schema merge/choose logic when returning results to clients: - // . pick the schema based on the order of: realtime > protected > archive - // . because of the above ordering, it is possible that archive earlybird schema with a new flush - // version (with new bit features) might be lost to older realtime earlybird schema; this is - // considered to to be rare and acceptable because one realtime earlybird deploy would fix it - 21: optional features.ThriftSearchFeatureSchema featureSchema - - // How long it took to score the results in earlybird (in nanoseconds). The number of results - // that were scored should be set in numHitsProcessed. - // Expected to only be set for requests that actually do scoring (i.e. Relevance and TopTweets). - 22: optional i64 scoringTimeNanos - - 8: optional i32 deprecated_numDocsProcessed -} - -// Note: Earlybird no longer respects this field, as it does not contain statuses. -// Blender should respect it. -enum EarlybirdReturnStatusType { - NO_STATUS = 0 - // deprecated - DEPRECATED_BASIC_STATUS = 1, - // deprecated - DEPRECATED_SEARCH_STATUS = 2, - TWEETYPIE_STATUS = 3, - - PLACE_HOLDER4 = 4, - PLACE_HOLDER5 = 5, -} - -struct AdjustedRequestParams { - // Next available field ID: 4 - - // Adjusted value for EarlybirdRequest.searchQuery.numResults. - 1: optional i32 numResults - - // Adjusted value for EarlybirdRequest.searchQuery.maxHitsToProcess and - // EarlybirdRequest.searchQuery.relevanceOptions.maxHitsToProcess. - 2: optional i32 maxHitsToProcess - - // Adjusted value for EarlybirdRequest.searchQuery.relevanceOptions.returnAllResults - 3: optional bool returnAllResults -} - -struct EarlybirdRequest { - // Next available field ID: 36 - - // -------- COMMON REQUEST OPTIONS -------- - // These fields contain options respected by all kinds of earlybird requests. - - // Search query containing general earlybird retrieval and hit collection options. - // Also contains the options specific to search requests. - 1: required ThriftSearchQuery searchQuery - - // Common RPC information - client hostname and request ID. - 12: optional string clientHost - 13: optional string clientRequestID - - // A string identifying the client that initiated the request. - // Ex: macaw-search.prod, webforall.prod, webforall.staging. - // The intention is to track the load we get from each client, and eventually enforce - // per-client QPS quotas, but this field could also be used to allow access to certain features - // only to certain clients, etc. - 21: optional string clientId - - // The time (in millis since epoch) when the earlybird client issued this request. - // Can be used to estimate request timeout time, capturing in-transit time for the request. - 23: optional i64 clientRequestTimeMs - - // Caching parameters used by earlybird roots. - 24: optional caching.CachingParams cachingParams - - // Deprecated. See SEARCH-2784 - // Earlybird requests will be early terminated in a best-effort way to prevent them from - // exceeding the given timeout. If timeout is <= 0 this early termination criteria is - // disabled. - 17: optional i32 timeoutMs = -1 - - // Deprecated. See SEARCH-2784 - // Earlybird requests will be early terminated in a best-effort way to prevent them from - // exceeding the given query cost. If maxQueryCost <= 0 this early termination criteria - // is disabled. - 20: optional double maxQueryCost = -1 - - - // -------- REQUEST-TYPE SPECIFIC OPTIONS -------- - // These fields contain options for one specific kind of request. If one of these options - // is set the request will be considered to be the appropriate type of request. - - // Options for facet counting requests. - 11: optional ThriftFacetRequest facetRequest - - // Options for term statistics requests. - 14: optional ThriftTermStatisticsRequest termStatisticsRequest - - - // -------- DEBUG OPTIONS -------- - // Used for debugging only. - - // Debug mode, 0 for no debug information. - 15: optional i8 debugMode = 0 - - // Can be used to pass extra debug arguments to earlybird. - 34: optional EarlybirdDebugOptions debugOptions - - // Searches a specific segment by time slice id if set and segment id is > 0. - 22: optional i64 searchSegmentId - - // -------- THINGS USED ONLY BY THE BLENDER -------- - // These fields are used by the blender and clients of the blender, but not by earlybird. - - // Specifies what kind of status object to return, if any. - 7: optional EarlybirdReturnStatusType returnStatusType - - - // -------- THINGS USED BY THE ROOTS -------- - // These fields are not in use by earlybirds themselves, but are in use by earlybird roots - // (and their clients). - // These fields live here since we currently reuse the same thrift request and response structs - // for both earlybirds and earlybird roots, and could potentially be moved out if we were to - // introduce separate request / response structs specifically for the roots. - - // We have a threshold for how many hash partition requests need to succeed at the root level - // in order for the earlybird root request to be considered successful. - // Each type or earlybird queries (e.g. relevance, or term statistics) has a predefined default - // threshold value (e.g. 90% or hash partitions need to succeed for a recency query). - // The client can optionally set the threshold value to be something other than the default, - // by setting this field to a value in the range of 0 (exclusive) to 1 (inclusive). - // If this value is set outside of the (0, 1] range, a CLIENT_ERROR EarlybirdResponseCode will - // be returned. - 25: optional double successfulResponseThreshold - - // Where does the query come from? - 26: optional query.ThriftQuerySource querySource - - // Whether to get archive results This flag is advisory. A request may still be restricted from - // getting reqults from the archive based on the requesting client, query source, requested - // time/id range, etc. - 27: optional bool getOlderResults - - // The list of users followed by the current user. - // Used to restrict the values in the fromUserIDFilter64 field when sending a request - // to the protectected cluster. - 28: optional list followedUserIds - - // The adjusted parameters for the protected request. - 29: optional AdjustedRequestParams adjustedProtectedRequestParams - - // The adjusted parameters for the full archive request. - 30: optional AdjustedRequestParams adjustedFullArchiveRequestParams - - // Return only the protected tweets. This flag is used by the SuperRoot to return relevance - // results that contain only protected tweets. - 31: optional bool getProtectedTweetsOnly - - // Tokenize serialized queries with the appropriate Pengin version(s). - // Only has an effect on superroot. - 32: optional bool retokenizeSerializedQuery - - // Flag to ignore tweets that are very recent and could be incompletely indexed. - // If false, will allow queries to see results that may violate implicit streaming - // guarantees and will search Tweets that have been partially indexed. - // See go/indexing-latency for more details. When enabled, prevents seeing tweets - // that are less than 15 seconds old (or a similarly configured threshold). - // May be set to false unless explicitly set to true. - 33: optional bool skipVeryRecentTweets = 1 - - // Setting an experimental cluster will reroute traffic at the realtime root layer to an experimental - // Earlybird cluster. This will have no impact if set on requests to anywhere other than realtime root. - 35: optional ExperimentCluster experimentClusterToUse - - // Caps number of results returned by roots after merging results from different earlybird partitions/clusters. - // If not set, ThriftSearchQuery.numResults or CollectorParams.numResultsToReturn will be used to cap results. - // This parameter will be ignored if ThriftRelevanceOptions.returnAllResults is set to true. - 36: optional i32 numResultsToReturnAtRoot -} - -enum EarlybirdResponseCode { - SUCCESS = 0, - PARTITION_NOT_FOUND = 1, - PARTITION_DISABLED = 2, - TRANSIENT_ERROR = 3, - PERSISTENT_ERROR = 4, - CLIENT_ERROR = 5, - PARTITION_SKIPPED = 6, - // Request was queued up on the server for so long that it timed out, and was not - // executed at all. - SERVER_TIMEOUT_ERROR = 7, - TIER_SKIPPED = 8, - // Not enough partitions returned a successful response. The merged response will have partition - // counts and early termination info set, but will not have search results. - TOO_MANY_PARTITIONS_FAILED_ERROR = 9, - // Client went over its quota, and the request was throttled. - QUOTA_EXCEEDED_ERROR = 10, - // Client's request is blocked based on Search Infra's policy. Search Infra can can block client's - // requests based on the query source of the request. - REQUEST_BLOCKED_ERROR = 11, - - CLIENT_CANCEL_ERROR = 12, - - CLIENT_BLOCKED_BY_TIER_ERROR = 13, - - PLACE_HOLDER_2015_09_21 = 14, -} - -// A recorded request and response. -struct EarlybirdRequestResponse { - // Where did we send this request to. - 1: optional string sentTo; - 2: optional EarlybirdRequest request; - // This can't be an EarlybirdResponse, because the thrift compiler for Python - // doesn't allow cyclic references and we have some Python utilities that will fail. - 3: optional string response; -} - -struct EarlybirdDebugInfo { - 1: optional string host - 2: optional string parsedQuery - 3: optional string luceneQuery - // Requests sent to dependent services. For example, superroot sends to realtime root, - // archive root, etc. - 4: optional list sentRequests; - // segment level debug info (eg. hitsPerSegment, max/minSearchedTime etc.) - 5: optional list collectorDebugInfo - 6: optional list termStatisticsDebugInfo -} - -struct EarlybirdDebugOptions { - 1: optional bool includeCollectorDebugInfo -} - -struct TierResponse { - 1: optional EarlybirdResponseCode tierResponseCode - 2: optional i32 numPartitions - 3: optional i32 numSuccessfulPartitions -} - -struct EarlybirdServerStats { - // The hostname of the Earlybird that processed this request. - 1: optional string hostname - - // The partition to which this earlybird belongs. - 2: optional i32 partition - - // Current Earlybird QPS. - // Earlybirds should set this field at the end of a request (not at the start). This would give - // roots a more up-to-date view of the load on the earlybirds. - 3: optional i64 currentQps - - // The time the request waited in the queue before Earlybird started processing it. - // This does not include the time spent in the finagle queue: it's the time between the moment - // earlybird received the request, and the moment it started processing the request. - 4: optional i64 queueTimeMillis - - // The average request time in the queue before Earlybird started processing it. - // This does not include the time that requests spent in the finagle queue: it's the average time - // between the moment earlybird received its requests, and the moment it started processing them. - 5: optional i64 averageQueueTimeMillis - - // Current average per-request latency as perceived by Earlybird. - 6: optional i64 averageLatencyMicros - - // The tier to which this earlybird belongs. - 7: optional string tierName -} - -struct EarlybirdResponse { - // Next available field ID: 17 - 1: optional ThriftSearchResults searchResults - 5: optional ThriftFacetResults facetResults - 6: optional ThriftTermStatisticsResults termStatisticsResults - 2: required EarlybirdResponseCode responseCode - 3: required i64 responseTime - 7: optional i64 responseTimeMicros - // fields below will only be returned if debug > 1 in the request. - 4: optional string debugString - 8: optional EarlybirdDebugInfo debugInfo - - // Only exists for merged earlybird response. - 10: optional i32 numPartitions - 11: optional i32 numSuccessfulPartitions - // Only exists for merged earlybird response from multiple tiers. - 13: optional list perTierResponse - - // Total number of segments that were searched. Partially searched segments are fully counted. - // e.g. if we searched 1 segment fully, and early terminated half way through the second - // segment, this field should be set to 2. - 15: optional i32 numSearchedSegments - - // Whether the request early terminated, if so, the termination reason. - 12: optional search.EarlyTerminationInfo earlyTerminationInfo - - // Whether this response is from cache. - 14: optional bool cacheHit - - // Stats used by roots to determine if we should go into degraded mode. - 16: optional EarlybirdServerStats earlybirdServerStats -} - -enum EarlybirdStatusCode { - STARTING = 0, - CURRENT = 1, - STOPPING = 2, - UNHEALTHY = 3, - BLACKLISTED = 4, - - PLACE_HOLDER5 = 5, - PLACE_HOLDER6 = 6, -} - -struct EarlybirdStatusResponse { - 1: required EarlybirdStatusCode code - 2: required i64 aliveSince - 3: optional string message -} - -service EarlybirdService { - string getName(), - EarlybirdStatusResponse getStatus(), - EarlybirdResponse search( 1: EarlybirdRequest request ) -} diff --git a/src/thrift/com/twitter/simclusters_v2/BUILD b/src/thrift/com/twitter/simclusters_v2/BUILD deleted file mode 100644 index 221cc9184..000000000 --- a/src/thrift/com/twitter/simclusters_v2/BUILD +++ /dev/null @@ -1,23 +0,0 @@ -create_thrift_libraries( - base_name = "simclusters_v2-thrift", - sources = ["*.thrift"], - platform = "java8", - tags = ["bazel-compatible"], - dependency_roots = [ - "src/thrift/com/twitter/algebird_internal", - ], - export_roots = [ - "src/thrift/com/twitter/algebird_internal:algebird_internal", - ], - generate_languages = [ - "go", - "java", - "lua", - "python", - "ruby", - "scala", - "strato", - ], - provides_java_name = "simclusters_v2-thrift-java", - provides_scala_name = "simclusters_v2-thrift-scala", -) diff --git a/src/thrift/com/twitter/simclusters_v2/abuse.thrift b/src/thrift/com/twitter/simclusters_v2/abuse.thrift deleted file mode 100644 index 60043244b..000000000 --- a/src/thrift/com/twitter/simclusters_v2/abuse.thrift +++ /dev/null @@ -1,53 +0,0 @@ -namespace java com.twitter.simclusters_v2.thriftjava -namespace py gen.twitter.simclusters_v2 -#@namespace scala com.twitter.simclusters_v2.thriftscala -#@namespace strato com.twitter.simclusters_v2 - -include "embedding.thrift" -include "simclusters_presto.thrift" - -/** - * Struct that associates a user with simcluster scores for different - * interaction types. This is meant to be used as a feature to predict abuse. - * - * This thrift struct is meant for exploration purposes. It does not have any - * assumptions about what type of interactions we use or what types of scores - * we are keeping track of. - **/ -struct AdhocSingleSideClusterScores { - 1: required i64 userId(personalDataType = 'UserId') - // We can make the interaction types have arbitrary names. In the production - // version of this dataset. We should have a different field per interaction - // type so that API of what is included is more clear. - 2: required map interactionScores -}(persisted="true", hasPersonalData = 'true') - -/** -* This is a prod version of the single side features. It is meant to be used as a value in a key -* value store. The pair of healthy and unhealthy scores will be different depending on the use case. -* We will use different stores for different user cases. For instance, the first instance that -* we implement will use search abuse reports and impressions. We can build stores for new values -* in the future. -* -* The consumer creates the interactions which the author receives. For instance, the consumer -* creates an abuse report for an author. The consumer scores are related to the interaction creation -* behavior of the consumer. The author scores are related to the whether the author receives these -* interactions. -* -**/ -struct SingleSideUserScores { - 1: required i64 userId(personalDataType = 'UserId') - 2: required double consumerUnhealthyScore(personalDataType = 'EngagementScore') - 3: required double consumerHealthyScore(personalDataType = 'EngagementScore') - 4: required double authorUnhealthyScore(personalDataType = 'EngagementScore') - 5: required double authorHealthyScore(personalDataType = 'EngagementScore') -}(persisted="true", hasPersonalData = 'true') - -/** -* Struct that associates a cluster-cluster interaction scores for different -* interaction types. -**/ -struct AdhocCrossSimClusterInteractionScores { - 1: required i64 clusterId - 2: required list clusterScores -}(persisted="true") diff --git a/src/thrift/com/twitter/simclusters_v2/clustering.thrift b/src/thrift/com/twitter/simclusters_v2/clustering.thrift deleted file mode 100644 index 81b8567cb..000000000 --- a/src/thrift/com/twitter/simclusters_v2/clustering.thrift +++ /dev/null @@ -1,18 +0,0 @@ -namespace java com.twitter.simclusters_v2.thriftjava -namespace py gen.twitter.simclusters_v2.clustering -#@namespace scala com.twitter.simclusters_v2.thriftscala -#@namespace strato com.twitter.simclusters_v2 - -/** - * Struct that represents an ordered list of producer clusters. - * The list is meant to be ordered by decreasing cluster size. - **/ -struct OrderedClustersAndMembers { - 1: required list> orderedClustersAndMembers (personalDataType = 'UserId') - // work around BQ not supporting nested struct such as list - 2: optional list orderedClustersAndMembersStruct (personalDataType = 'UserId') -}(persisted = 'true', hasPersonalData = 'true') - -struct ClusterMembers { - 1: required set clusterMembers (personalDataType = 'UserId') -}(persisted = 'true', hasPersonalData = 'true') diff --git a/src/thrift/com/twitter/simclusters_v2/embedding.thrift b/src/thrift/com/twitter/simclusters_v2/embedding.thrift deleted file mode 100644 index 110da0c65..000000000 --- a/src/thrift/com/twitter/simclusters_v2/embedding.thrift +++ /dev/null @@ -1,137 +0,0 @@ -namespace java com.twitter.simclusters_v2.thriftjava -namespace py gen.twitter.simclusters_v2.embedding -#@namespace scala com.twitter.simclusters_v2.thriftscala -#@namespace strato com.twitter.simclusters_v2 - -include "com/twitter/simclusters_v2/identifier.thrift" -include "com/twitter/simclusters_v2/online_store.thrift" - -struct SimClusterWithScore { - 1: required i32 clusterId(personalDataType = 'InferredInterests') - 2: required double score(personalDataType = 'EngagementScore') -}(persisted = 'true', hasPersonalData = 'true') - -struct TopSimClustersWithScore { - 1: required list topClusters - 2: required online_store.ModelVersion modelVersion -}(persisted = 'true', hasPersonalData = 'true') - -struct InternalIdWithScore { - 1: required identifier.InternalId internalId - 2: required double score(personalDataType = 'EngagementScore') -}(persisted = 'true', hasPersonalData = 'true') - -struct InternalIdEmbedding { - 1: required list embedding -}(persisted = 'true', hasPersonalData = 'true') - -struct SemanticCoreEntityWithScore { - 1: required i64 entityId(personalDataType = 'SemanticcoreClassification') - 2: required double score(personalDataType = 'EngagementScore') -}(persisted = 'true', hasPersonalData = 'true') - -struct TopSemanticCoreEntitiesWithScore { - 1: required list topEntities -}(persisted = 'true', hasPersonalData = 'true') - -struct PersistedFullClusterId { - 1: required online_store.ModelVersion modelVersion - 2: required i32 clusterId(personalDataType = 'InferredInterests') -}(persisted = 'true', hasPersonalData = 'true') - -struct DayPartitionedClusterId { - 1: required i32 clusterId(personalDataType = 'InferredInterests') - 2: required string dayPartition // format: yyyy-MM-dd -} - -struct TopProducerWithScore { - 1: required i64 userId(personalDataType = 'UserId') - 2: required double score(personalDataType = 'EngagementScore') -}(persisted = 'true', hasPersonalData = 'true') - -struct TopProducersWithScore { - 1: required list topProducers -}(persisted = 'true', hasPersonalData = 'true') - -struct TweetWithScore { - 1: required i64 tweetId(personalDataType = 'TweetId') - 2: required double score(personalDataType = 'EngagementScore') -}(persisted = 'true', hasPersonalData = 'true') - -struct TweetsWithScore { - 1: required list tweets -}(persisted = 'true', hasPersonalData = 'true') - -struct TweetTopKTweetsWithScore { - 1: required i64 tweetId(personalDataType = 'TweetId') - 2: required TweetsWithScore topkTweetsWithScore -}(persisted = 'true', hasPersonalData = 'true') - -/** - * The generic SimClustersEmbedding for online long-term storage and real-time calculation. - * Use SimClustersEmbeddingId as the only identifier. - * Warning: Doesn't include model version and embedding type in the value struct. - **/ -struct SimClustersEmbedding { - 1: required list embedding -}(persisted = 'true', hasPersonalData = 'true') - -struct SimClustersEmbeddingWithScore { - 1: required SimClustersEmbedding embedding - 2: required double score -}(persisted = 'true', hasPersonalData = 'false') - -/** - * This is the recommended structure for aggregating embeddings with time decay - the metadata - * stores the information needed for decayed aggregation. - **/ -struct SimClustersEmbeddingWithMetadata { - 1: required SimClustersEmbedding embedding - 2: required SimClustersEmbeddingMetadata metadata -}(hasPersonalData = 'true') - -struct SimClustersEmbeddingIdWithScore { - 1: required identifier.SimClustersEmbeddingId id - 2: required double score -}(persisted = 'true', hasPersonalData = 'false') - -struct SimClustersMultiEmbeddingByValues { - 1: required list embeddings -}(persisted = 'true', hasPersonalData = 'false') - -struct SimClustersMultiEmbeddingByIds { - 1: required list ids -}(persisted = 'true', hasPersonalData = 'false') - -/** - * Generic SimClusters Multiple Embeddings. The identifier.SimClustersMultiEmbeddingId is the key of - * the multiple embedding. - **/ -union SimClustersMultiEmbedding { - 1: SimClustersMultiEmbeddingByValues values - 2: SimClustersMultiEmbeddingByIds ids -}(persisted = 'true', hasPersonalData = 'false') - -/** - * The metadata of a SimClustersEmbedding. The updatedCount represent the version of the Embedding. - * For tweet embedding, the updatedCount is same/close to the favorite count. - **/ -struct SimClustersEmbeddingMetadata { - 1: optional i64 updatedAtMs - 2: optional i64 updatedCount -}(persisted = 'true', hasPersonalData = 'true') - -/** - * The data structure for PersistentSimClustersEmbedding Store - **/ -struct PersistentSimClustersEmbedding { - 1: required SimClustersEmbedding embedding - 2: required SimClustersEmbeddingMetadata metadata -}(persisted = 'true', hasPersonalData = 'true') - -/** - * The data structure for the Multi Model PersistentSimClustersEmbedding Store - **/ -struct MultiModelPersistentSimClustersEmbedding { - 1: required map multiModelPersistentSimClustersEmbedding -}(persisted = 'true', hasPersonalData = 'true') diff --git a/src/thrift/com/twitter/simclusters_v2/entity.thrift b/src/thrift/com/twitter/simclusters_v2/entity.thrift deleted file mode 100644 index 1d0ee6946..000000000 --- a/src/thrift/com/twitter/simclusters_v2/entity.thrift +++ /dev/null @@ -1,51 +0,0 @@ -namespace java com.twitter.simclusters_v2.thriftjava -namespace py gen.twitter.simclusters_v2.entity -#@namespace scala com.twitter.simclusters_v2.thriftscala -#@namespace strato com.twitter.simclusters_v2 - -include "com/twitter/algebird_internal/algebird.thrift" - -/** - * Penguin text entity. All fields are required as this is used as a part of a memcache key. - **/ -struct PenguinKey { - 1: required string textEntity -}(hasPersonalData = 'false') - -/** - * NER text entity. All fields are required as this is used as a part of a memcache key. - **/ -struct NerKey { - 1: required string textEntity - 2: required i32 wholeEntityType -}(hasPersonalData = 'false') - -/** - * Semantic Core text entity. All fields are required as this is used as a part of a memcache key. - **/ -struct SemanticCoreKey { - 1: required i64 entityId(personalDataType = 'SemanticcoreClassification') -}(hasPersonalData = 'true') - -/** - * Represents an entity extracted from a tweet. - **/ -union TweetTextEntity { - 1: string hashtag - 2: PenguinKey penguin - 3: NerKey ner - 4: SemanticCoreKey semanticCore -}(hasPersonalData = 'true') - -struct SpaceId { - 1: string id -}(hasPersonalData = 'true') - -/** - * All possible entities that simclusters are associated with. - **/ -union SimClusterEntity { - 1: i64 tweetId(personalDataType = 'TweetId') - 2: TweetTextEntity tweetEntity - 3: SpaceId spaceId -}(hasPersonalData = 'true') diff --git a/src/thrift/com/twitter/simclusters_v2/evaluation.thrift b/src/thrift/com/twitter/simclusters_v2/evaluation.thrift deleted file mode 100644 index 85414baf9..000000000 --- a/src/thrift/com/twitter/simclusters_v2/evaluation.thrift +++ /dev/null @@ -1,65 +0,0 @@ -namespace java com.twitter.simclusters_v2.thriftjava -namespace py gen.twitter.simclusters_v2.evaluation -#@namespace scala com.twitter.simclusters_v2.thriftscala -#@namespace strato com.twitter.simclusters_v2 - -/** - * Surface area at which the reference tweet was displayed to the user - **/ -enum DisplayLocation { - TimelinesRecap = 1, - TimelinesRectweet = 2 -}(hasPersonalData = 'false') - -struct TweetLabels { - 1: required bool isClicked = false(personalDataType = 'EngagementsPrivate') - 2: required bool isLiked = false(personalDataType = 'EngagementsPublic') - 3: required bool isRetweeted = false(personalDataType = 'EngagementsPublic') - 4: required bool isQuoted = false(personalDataType = 'EngagementsPublic') - 5: required bool isReplied = false(personalDataType = 'EngagementsPublic') -}(persisted = 'true', hasPersonalData = 'true') - -/** - * Data container of a reference tweet with scribed user engagement labels - */ -struct ReferenceTweet { - 1: required i64 tweetId(personalDataType = 'TweetId') - 2: required i64 authorId(personalDataType = 'UserId') - 3: required i64 timestamp(personalDataType = 'PublicTimestamp') - 4: required DisplayLocation displayLocation - 5: required TweetLabels labels -}(persisted="true", hasPersonalData = 'true') - -/** - * Data container of a candidate tweet generated by the candidate algorithm - */ -struct CandidateTweet { - 1: required i64 tweetId(personalDataType = 'TweetId') - 2: optional double score(personalDataType = 'EngagementScore') - // The timestamp here is a synthetically generated timestamp. - // for evaluation purpose. Hence left unannotated - 3: optional i64 timestamp -}(hasPersonalData = 'true') - -/** - * An encapsulated collection of candidate tweets - **/ -struct CandidateTweets { - 1: required i64 targetUserId(personalDataType = 'UserId') - 2: required list recommendedTweets -}(hasPersonalData = 'true') - -/** - * An encapsulated collection of reference tweets - **/ -struct ReferenceTweets { - 1: required i64 targetUserId(personalDataType = 'UserId') - 2: required list impressedTweets -}(persisted="true", hasPersonalData = 'true') - -/** - * A list of candidate tweets - **/ -struct CandidateTweetsList { - 1: required list recommendedTweets -}(hasPersonalData = 'true') \ No newline at end of file diff --git a/src/thrift/com/twitter/simclusters_v2/graph.thrift b/src/thrift/com/twitter/simclusters_v2/graph.thrift deleted file mode 100644 index e67c860d2..000000000 --- a/src/thrift/com/twitter/simclusters_v2/graph.thrift +++ /dev/null @@ -1,61 +0,0 @@ -namespace java com.twitter.simclusters_v2.thriftjava -namespace py gen.twitter.simclusters_v2.graph -#@namespace scala com.twitter.simclusters_v2.thriftscala -#@namespace strato com.twitter.simclusters_v2 - -struct DecayedSums { - // last time the decayed sum was updated, in millis. - 1: required i64 lastUpdatedTimestamp - - // a map from half life (specified in days) to the decayed sum - 2: required map halfLifeInDaysToDecayedSums -}(persisted = 'true', hasPersonalData = 'false') - -struct EdgeWithDecayedWeights { - 1: required i64 sourceId(personalDataType = 'UserId') - 2: required i64 destinationId(personalDataType = 'UserId') - 3: required DecayedSums weights -}(persisted="true", hasPersonalData = "true") - -struct NeighborWithWeights { - 1: required i64 neighborId(personalDataType = 'UserId') - 2: optional bool isFollowed(personalDataType = 'Follow') - 3: optional double followScoreNormalizedByNeighborFollowersL2(personalDataType = 'EngagementsPublic') - 4: optional double favScoreHalfLife100Days(personalDataType = 'EngagementsPublic') - 5: optional double favScoreHalfLife100DaysNormalizedByNeighborFaversL2(personalDataType = 'EngagementsPublic') - - // log(favScoreHalfLife100Days + 1) - 6: optional double logFavScore(personalDataType = 'EngagementsPublic') - - // log(favScoreHalfLife100Days + 1) normalized so that a user's incoming weights have unit l2 norm - 7: optional double logFavScoreL2Normalized(personalDataType = 'EngagementsPublic') - -}(persisted = 'true', hasPersonalData = 'true') - -struct UserAndNeighbors { - 1: required i64 userId(personalDataType = 'UserId') - 2: required list neighbors -}(persisted="true", hasPersonalData = 'true') - -struct NormsAndCounts { - 1: required i64 userId(personalDataType = 'UserId') - 2: optional double followerL2Norm(personalDataType = 'CountOfFollowersAndFollowees') - 3: optional double faverL2Norm(personalDataType = 'EngagementsPublic') - 4: optional i64 followerCount(personalDataType = 'CountOfFollowersAndFollowees') - 5: optional i64 faverCount(personalDataType = 'EngagementsPublic') - - // sum of the weights on the incoming edges where someone fav'ed this producer - 6: optional double favWeightsOnFavEdgesSum(personalDataType = 'EngagementsPublic') - - // sum of the fav weights on all the followers of this producer - 7: optional double favWeightsOnFollowEdgesSum(personalDataType = 'EngagementsPublic') - // log(favScore + 1) - 8: optional double logFavL2Norm(personalDataType = 'EngagementsPublic') - - // sum of log(favScore + 1) on the incoming edges where someone fav'ed this producer - 9: optional double logFavWeightsOnFavEdgesSum(personalDataType = 'EngagementsPublic') - - // sum of log(favScore + 1) on all the followers of this producer - 10: optional double logFavWeightsOnFollowEdgesSum(personalDataType = 'EngagementsPublic') - -}(persisted="true", hasPersonalData = 'true') diff --git a/src/thrift/com/twitter/simclusters_v2/identifier.thrift b/src/thrift/com/twitter/simclusters_v2/identifier.thrift deleted file mode 100644 index b4285e699..000000000 --- a/src/thrift/com/twitter/simclusters_v2/identifier.thrift +++ /dev/null @@ -1,205 +0,0 @@ -namespace java com.twitter.simclusters_v2.thriftjava -namespace py gen.twitter.simclusters_v2.identifier -#@namespace scala com.twitter.simclusters_v2.thriftscala -#@namespace strato com.twitter.simclusters_v2 - -include "com/twitter/simclusters_v2/online_store.thrift" - -/** - * The uniform type for a SimClusters Embeddings. - * Each embeddings have the uniform underlying storage. - * Warning: Every EmbeddingType should map to one and only one InternalId. - **/ -enum EmbeddingType { - // Reserve 001 - 99 for Tweet embeddings - FavBasedTweet = 1, // Deprecated - FollowBasedTweet = 2, // Deprecated - LogFavBasedTweet = 3, // Production Version - FavBasedTwistlyTweet = 10, // Deprecated - LogFavBasedTwistlyTweet = 11, // Deprecated - LogFavLongestL2EmbeddingTweet = 12, // Production Version - - // Tweet embeddings generated from non-fav events - // Naming convention: {Event}{Score}BasedTweet - // {Event}: The interaction event we use to build the tweet embeddings - // {Score}: The score from user InterestedIn embeddings - VideoPlayBack50LogFavBasedTweet = 21, - RetweetLogFavBasedTweet = 22, - ReplyLogFavBasedTweet = 23, - PushOpenLogFavBasedTweet = 24, - - // [Experimental] Offline generated FavThroughRate-based Tweet Embedding - Pop1000RankDecay11Tweet = 30, - Pop10000RankDecay11Tweet = 31, - OonPop1000RankDecayTweet = 32, - - // [Experimental] Offline generated production-like LogFavScore-based Tweet Embedding - OfflineGeneratedLogFavBasedTweet = 40, - - // Reserve 51-59 for Ads Embedding - LogFavBasedAdsTweet = 51, // Experimental embedding for ads tweet candidate - LogFavClickBasedAdsTweet = 52, // Experimental embedding for ads tweet candidate - - // Reserve 60-69 for Evergreen content - LogFavBasedEvergreenTweet = 60, - LogFavBasedRealTimeTweet = 65, - - // Reserve 101 to 149 for Semantic Core Entity embeddings - FavBasedSematicCoreEntity = 101, // Deprecated - FollowBasedSematicCoreEntity = 102, // Deprecated - FavBasedHashtagEntity = 103, // Deprecated - FollowBasedHashtagEntity = 104, // Deprecated - ProducerFavBasedSemanticCoreEntity = 105, // Deprecated - ProducerFollowBasedSemanticCoreEntity = 106,// Deprecated - FavBasedLocaleSemanticCoreEntity = 107, // Deprecated - FollowBasedLocaleSemanticCoreEntity = 108, // Deprecated - LogFavBasedLocaleSemanticCoreEntity = 109, // Deprecated - LanguageFilteredProducerFavBasedSemanticCoreEntity = 110, // Deprecated - LanguageFilteredFavBasedLocaleSemanticCoreEntity = 111, // Deprecated - FavTfgTopic = 112, // TFG topic embedding built from fav-based user interestedIn - LogFavTfgTopic = 113, // TFG topic embedding built from logfav-based user interestedIn - FavInferredLanguageTfgTopic = 114, // TFG topic embedding built using inferred consumed languages - FavBasedKgoApeTopic = 115, // topic embedding using fav-based aggregatable producer embedding of KGO seed accounts. - LogFavBasedKgoApeTopic = 116, // topic embedding using log fav-based aggregatable producer embedding of KGO seed accounts. - FavBasedOnboardingApeTopic = 117, // topic embedding using fav-based aggregatable producer embedding of onboarding seed accounts. - LogFavBasedOnboardingApeTopic = 118, // topic embedding using log fav-based aggregatable producer embedding of onboarding seed accounts. - LogFavApeBasedMuseTopic = 119, // Deprecated - LogFavApeBasedMuseTopicExperiment = 120 // Deprecated - - // Reserved 201 - 299 for Producer embeddings (KnownFor) - FavBasedProducer = 201 - FollowBasedProducer = 202 - AggregatableFavBasedProducer = 203 // fav-based aggregatable producer embedding. - AggregatableLogFavBasedProducer = 204 // logfav-based aggregatable producer embedding. - RelaxedAggregatableLogFavBasedProducer = 205 // logfav-based aggregatable producer embedding. - AggregatableFollowBasedProducer = 206 // follow-based aggregatable producer embedding. - KnownFor = 300 - - // Reserved 301 - 399 for User InterestedIn embeddings - FavBasedUserInterestedIn = 301 - FollowBasedUserInterestedIn = 302 - LogFavBasedUserInterestedIn = 303 - RecentFollowBasedUserInterestedIn = 304 // interested-in embedding based on aggregating producer embeddings of recent follows - FilteredUserInterestedIn = 305 // interested-in embedding used by twistly read path - LogFavBasedUserInterestedInFromAPE = 306 - FollowBasedUserInterestedInFromAPE = 307 - TwiceUserInterestedIn = 308 // interested-in multi-embedding based on clustering producer embeddings of neighbors - UnfilteredUserInterestedIn = 309 - UserNextInterestedIn = 310 // next interested-in embedding generated from BeT - - // Denser User InterestedIn, generated by Producer embeddings. - FavBasedUserInterestedInFromPE = 311 - FollowBasedUserInterestedInFromPE = 312 - LogFavBasedUserInterestedInFromPE = 313 - FilteredUserInterestedInFromPE = 314 // interested-in embedding used by twistly read path - - // [Experimental] Denser User InterestedIn, generated by aggregating IIAPE embedding from AddressBook - LogFavBasedUserInterestedMaxpoolingAddressBookFromIIAPE = 320 - LogFavBasedUserInterestedAverageAddressBookFromIIAPE = 321 - LogFavBasedUserInterestedBooktypeMaxpoolingAddressBookFromIIAPE = 322 - LogFavBasedUserInterestedLargestDimMaxpoolingAddressBookFromIIAPE = 323 - LogFavBasedUserInterestedLouvainMaxpoolingAddressBookFromIIAPE = 324 - LogFavBasedUserInterestedConnectedMaxpoolingAddressBookFromIIAPE = 325 - - //Reserved 401 - 500 for Space embedding - FavBasedApeSpace = 401 // DEPRECATED - LogFavBasedListenerSpace = 402 // DEPRECATED - LogFavBasedAPESpeakerSpace = 403 // DEPRECATED - LogFavBasedUserInterestedInListenerSpace = 404 // DEPRECATED - - // Experimental, internal-only IDs - ExperimentalThirtyDayRecentFollowBasedUserInterestedIn = 10000 // Like RecentFollowBasedUserInterestedIn, except limited to last 30 days - ExperimentalLogFavLongestL2EmbeddingTweet = 10001 // DEPRECATED -}(persisted = 'true', hasPersonalData = 'false') - -/** - * The uniform type for a SimClusters MultiEmbeddings. - * Warning: Every MultiEmbeddingType should map to one and only one InternalId. - **/ -enum MultiEmbeddingType { - // Reserved 0-99 for Tweet based MultiEmbedding - - // Reserved 100 - 199 for Topic based MultiEmbedding - LogFavApeBasedMuseTopic = 100 // Deprecated - LogFavApeBasedMuseTopicExperiment = 101 // Deprecated - - // Reserved 301 - 399 for User InterestedIn embeddings - TwiceUserInterestedIn = 301 // interested-in multi-embedding based on clustering producer embeddings of neighbors -}(persisted = 'true', hasPersonalData = 'true') - -// Deprecated. Please use TopicId for future cases. -struct LocaleEntityId { - 1: i64 entityId - 2: string language -}(persisted = 'true', hasPersonalData = 'false') - -enum EngagementType { - Favorite = 1, - Retweet = 2, -} - -struct UserEngagedTweetId { - 1: i64 tweetId(personalDataType = 'TweetId') - 2: i64 userId(personalDataType = 'UserId') - 3: EngagementType engagementType(personalDataType = 'EventType') -}(persisted = 'true', hasPersonalData = 'true') - -struct TopicId { - 1: i64 entityId (personalDataType = 'SemanticcoreClassification') - // 2-letter ISO 639-1 language code - 2: optional string language - // 2-letter ISO 3166-1 alpha-2 country code - 3: optional string country -}(persisted = 'true', hasPersonalData = 'false') - -struct TopicSubId { - 1: i64 entityId (personalDataType = 'SemanticcoreClassification') - // 2-letter ISO 639-1 language code - 2: optional string language - // 2-letter ISO 3166-1 alpha-2 country code - 3: optional string country - 4: i32 subId -}(persisted = 'true', hasPersonalData = 'true') - -// Will be used for testing purposes in DDG 15536, 15534 -struct UserWithLanguageId { - 1: required i64 userId(personalDataType = 'UserId') - 2: optional string langCode(personalDataType = 'InferredLanguage') -}(persisted = 'true', hasPersonalData = 'true') - -/** - * The internal identifier type. - * Need to add ordering in [[com.twitter.simclusters_v2.common.SimClustersEmbeddingId]] - * when adding a new type. - **/ -union InternalId { - 1: i64 tweetId(personalDataType = 'TweetId') - 2: i64 userId(personalDataType = 'UserId') - 3: i64 entityId(personalDataType = 'SemanticcoreClassification') - 4: string hashtag(personalDataType = 'PublicTweetEntitiesAndMetadata') - 5: i32 clusterId - 6: LocaleEntityId localeEntityId(personalDataType = 'SemanticcoreClassification') - 7: UserEngagedTweetId userEngagedTweetId - 8: TopicId topicId - 9: TopicSubId topicSubId - 10: string spaceId - 11: UserWithLanguageId userWithLanguageId -}(persisted = 'true', hasPersonalData = 'true') - -/** - * A uniform identifier type for all kinds of SimClusters based embeddings. - **/ -struct SimClustersEmbeddingId { - 1: required EmbeddingType embeddingType - 2: required online_store.ModelVersion modelVersion - 3: required InternalId internalId -}(persisted = 'true', hasPersonalData = 'true') - -/** - * A uniform identifier type for multiple SimClusters embeddings - **/ -struct SimClustersMultiEmbeddingId { - 1: required MultiEmbeddingType embeddingType - 2: required online_store.ModelVersion modelVersion - 3: required InternalId internalId -}(persisted = 'true', hasPersonalData = 'true') diff --git a/src/thrift/com/twitter/simclusters_v2/inferred_entities.thrift b/src/thrift/com/twitter/simclusters_v2/inferred_entities.thrift deleted file mode 100644 index db667fb68..000000000 --- a/src/thrift/com/twitter/simclusters_v2/inferred_entities.thrift +++ /dev/null @@ -1,38 +0,0 @@ -namespace java com.twitter.simclusters_v2.thriftjava -namespace py gen.twitter.simclusters_v2.inferred_entities -#@namespace scala com.twitter.simclusters_v2.thriftscala -#@namespace strato com.twitter.simclusters_v2 - -// The SimClusters type we use to infer entity interests about a user -// Currently used for SimClusters Compliance to store a user's inferred interests - -include "online_store.thrift" - -enum ClusterType { - KnownFor = 1, - InterestedIn = 2 -}(persisted = 'true', hasPersonalData = 'false') - -struct SimClustersSource { - 1: required ClusterType clusterType - 2: required online_store.ModelVersion modelVersion -}(persisted = 'true', hasPersonalData = 'false') - -// The source of entities we use to infer entity interests about a user -enum EntitySource { - SimClusters20M145KDec11EntityEmbeddingsByFavScore = 1, // deprecated - SimClusters20M145KUpdatedEntityEmbeddingsByFavScore = 2, // deprecated - UTTAccountRecommendations = 3 # dataset built by Onboarding team - SimClusters20M145K2020EntityEmbeddingsByFavScore = 4 -}(persisted = 'true', hasPersonalData = 'false') - -struct InferredEntity { - 1: required i64 entityId(personalDataType = 'SemanticcoreClassification') - 2: required double score(personalDataType = 'EngagementScore') - 3: optional SimClustersSource simclusterSource - 4: optional EntitySource entitySource -}(persisted = 'true', hasPersonalData = 'true') - -struct SimClustersInferredEntities { - 1: required list entities -}(persisted = 'true', hasPersonalData = 'true') diff --git a/src/thrift/com/twitter/simclusters_v2/interests.thrift b/src/thrift/com/twitter/simclusters_v2/interests.thrift deleted file mode 100644 index 5c1a04970..000000000 --- a/src/thrift/com/twitter/simclusters_v2/interests.thrift +++ /dev/null @@ -1,259 +0,0 @@ -namespace java com.twitter.simclusters_v2.thriftjava -namespace py gen.twitter.simclusters_v2.interests -#@namespace scala com.twitter.simclusters_v2.thriftscala -#@namespace strato com.twitter.simclusters_v2 - -/** - * All of the scores below assume that the knownFor vector for each cluster is already - * of unit L2 norm i.e. sum of squares is 1. - **/ -struct UserToInterestedInClusterScores { - // dot product of user's binary follow vector with knownFor vector for this cluster - // TIP: By default, use this score or favScore. - 1: optional double followScore(personalDataType = 'CountOfFollowersAndFollowees') - - // first compute followScore as defined above - // then compute L2 norm of the vector of these scores for this cluster - // divide by that. - // essentially the more people are interested in this cluster, the lower this score gets - // TIP: Use this score if your use case needs to penalize clusters that a lot of other - // users are also interested in - 2: optional double followScoreClusterNormalizedOnly(personalDataType = 'CountOfFollowersAndFollowees') - - // dot product of user's producer normalized follow vector and knownFor vector for this cluster - // i.e. i^th entry in the normalized follow vector = 1.0/sqrt(number of followers of user i) - // TIP: Use this score if your use case needs to penalize clusters where the users known for - // that cluster are popular. - 3: optional double followScoreProducerNormalizedOnly(personalDataType = 'CountOfFollowersAndFollowees') - - // first compute followScoreProducerNormalizedOnly - // then compute L2 norm of the vector of these scores for this cluster - // divide by that. - // essentially the more people are interested in this cluster, the lower this score gets - // TIP: Use this score if your use case needs to penalize both clusters that a lot of other - // users are interested in, as well as clusters where the users known for that cluster are - // popular. - 4: optional double followScoreClusterAndProducerNormalized(personalDataType = 'CountOfFollowersAndFollowees') - - // dot product of user's favScoreHalfLife100Days vector with knownFor vector for this cluster - // TIP: By default, use this score or followScore. - 5: optional double favScore(personalDataType = 'EngagementsPublic') - - // first compute favScore as defined above - // then compute L2 norm of the vector of these scores for this cluster - // divide by that. - // essentially the more people are interested in this cluster, the lower this score gets - // TIP: Use this score if your use case needs to penalize clusters that a lot of other - // users are also interested in - 6: optional double favScoreClusterNormalizedOnly(personalDataType = 'EngagementsPublic') - - // dot product of user's favScoreHalfLife100DaysNormalizedByNeighborFaversL2 vector with - // knownFor vector for this cluster - // TIP: Use this score if your use case needs to penalize clusters where the users known for - // that cluster are popular. - 7: optional double favScoreProducerNormalizedOnly(personalDataType = 'EngagementsPublic') - - // first compute favScoreProducerNormalizedOnly as defined above - // then compute L2 norm of the vector of these scores for this cluster - // divide by that. - // essentially the more people are interested in this cluster, the lower this score gets - // TIP: Use this score if your use case needs to penalize both clusters that a lot of other - // users are interested in, as well as clusters where the users known for that cluster are - // popular. - 8: optional double favScoreClusterAndProducerNormalized(personalDataType = 'EngagementsPublic') - - // list of users who're known for this cluster as well as are being followed by the user. - 9: optional list usersBeingFollowed(personalDataType = 'UserId') - - // list of users who're known for this cluster as well as were faved at some point by the user. - 10: optional list usersThatWereFaved(personalDataType = 'UserId') - - // A pretty close upper bound on the number of users who are interested in this cluster. - // Useful to know if this is a niche community or a popular topic. - 11: optional i32 numUsersInterestedInThisClusterUpperBound - - // dot product of user's logFavScore vector with knownFor vector for this cluster - // TIP: this score is under experimentations - 12: optional double logFavScore(personalDataType = 'EngagementsPublic') - - // first compute logFavScore as defined above - // then compute L2 norm of the vector of these scores for this cluster - // divide by that. - // essentially the more people are interested in this cluster, the lower this score gets - // TIP: this score is under experimentations - 13: optional double logFavScoreClusterNormalizedOnly(personalDataType = 'EngagementsPublic') - - // actual count of number of users who're known for this cluster as well as are being followed by the user. - 14: optional i32 numUsersBeingFollowed - - // actual count of number of users who're known for this cluster as well as were faved at some point by the user. - 15: optional i32 numUsersThatWereFaved -}(persisted = 'true', hasPersonalData = 'true') - -struct UserToInterestedInClusters { - 1: required i64 userId(personalDataType = 'UserId') - 2: required string knownForModelVersion - 3: required map clusterIdToScores(personalDataTypeKey = 'InferredInterests') -}(persisted="true", hasPersonalData = 'true') - -struct LanguageToClusters { - 1: required string language - 2: required string knownForModelVersion - 3: required map clusterIdToScores(personalDataTypeKey = 'InferredInterests') -}(persisted="true", hasPersonalData = 'true') - -struct ClustersUserIsInterestedIn { - 1: required string knownForModelVersion - 2: required map clusterIdToScores(personalDataTypeKey = 'InferredInterests') -}(persisted = 'true', hasPersonalData = 'true') - -struct UserToKnownForClusters { - 1: required i64 userId(personalDataType = 'UserId') - 2: required string knownForModelVersion - 3: required map clusterIdToScores(personalDataTypeKey = 'InferredInterests') -}(persisted="true", hasPersonalData = 'true') - -struct UserToKnownForClusterScores { - 1: optional double knownForScore -}(persisted = 'true', hasPersonalData = 'false') - -struct ClustersUserIsKnownFor { - 1: required string knownForModelVersion - 2: required map clusterIdToScores(personalDataTypeKey = 'InferredInterests') -}(persisted = 'true', hasPersonalData = 'true') - -/** Thrift struct for storing quantile bounds output by QTreeMonoid in Algebird */ -struct QuantileBounds { - 1: required double lowerBound - 2: required double upperBound -}(persisted = 'true', hasPersonalData = 'false') - -/** Thrift struct giving the details of the distribution of a set of doubles */ -struct DistributionDetails { - 1: required double mean - 2: optional double standardDeviation - 3: optional double min - 4: optional QuantileBounds p25 - 5: optional QuantileBounds p50 - 6: optional QuantileBounds p75 - 7: optional QuantileBounds p95 - 8: optional double max -}(persisted = 'true', hasPersonalData = 'false') - -/** Note that the modelVersion here is specified somewhere outside, specifically, as part of the key */ -struct ClusterNeighbor { - 1: required i32 clusterId - /** Note that followCosineSimilarity is same as dot product over followScoreClusterNormalizedOnly - * since those scores form a unit vector **/ - 2: optional double followCosineSimilarity - /** Note that favCosineSimilarity is same as dot product over favScoreClusterNormalizedOnly - * since those scores form a unit vector **/ - 3: optional double favCosineSimilarity - /** Note that logFavCosineSimilarity is same as dot product over logFavScoreClusterNormalizedOnly - * since those scores form a unit vector **/ - 4: optional double logFavCosineSimilarity -}(persisted = 'true', hasPersonalData = 'false') - -/** Useful for storing the list of users known for a cluster */ -struct UserWithScore { - 1: required i64 userId(personalDataType = 'UserId') - 2: required double score -}(persisted="true", hasPersonalData = 'true') - -// deprecated -struct EdgeCut { - 1: required double cutEdges - 2: required double totalVolume -}(persisted = 'true', hasPersonalData = 'false') - -struct ClusterQuality { - // deprecated - 1: optional EdgeCut deprecated_unweightedEdgeCut - // deprecated - 2: optional EdgeCut deprecated_edgeWeightedCut - // deprecated - 3: optional EdgeCut deprecated_nodeAndEdgeWeightedCut - - // correlation of actual weight of (u, v) with I(u & v in same cluster) * score(u) * score(v) - 4: optional double weightAndProductOfNodeScoresCorrelation - - // fraction of edges staying inside cluster divided by total edges from nodes in the cluster - 5: optional double unweightedRecall - - // fraction of edge weights staying inside cluster divided by total edge weights from nodes in the cluster - 6: optional double weightedRecall - - // total edges from nodes in the cluster - 7: optional double unweightedRecallDenominator - - // total edge weights from nodes in the cluster - 8: optional double weightedRecallDenominator - - // sum of edge weights inside cluster / { #nodes * (#nodes - 1) } - 9: optional double relativePrecisionNumerator - - // above divided by the sum of edge weights in the total graph / { n * (n - 1) } - 10: optional double relativePrecision -}(persisted = 'true', hasPersonalData = 'false') - -/** -* This struct is the value of the ClusterDetails key-value dataset. -* The key is (modelVersion, clusterId) -**/ -struct ClusterDetails { - 1: required i32 numUsersWithAnyNonZeroScore - 2: required i32 numUsersWithNonZeroFollowScore - 3: required i32 numUsersWithNonZeroFavScore - 4: optional DistributionDetails followScoreDistributionDetails - 5: optional DistributionDetails favScoreDistributionDetails - 6: optional list knownForUsersAndScores - 7: optional list neighborClusters - // fraction of users who're known for this cluster who're marked NSFW_User in UserSource - 8: optional double fractionKnownForMarkedNSFWUser - // the major languages that this cluster's known_fors have as their "language" field in - // UserSource, and the fractions - 9: optional map languageToFractionDeviceLanguage - // the major country codes that this cluster's known_fors have as their "account_country_code" - // field in UserSource, and the fractions - 10: optional map countryCodeToFractionKnownForWithCountryCode - 11: optional ClusterQuality qualityMeasuredOnSimsGraph - 12: optional DistributionDetails logFavScoreDistributionDetails - // fraction of languages this cluster's known_fors produce based on what penguin_user_languages dataset infers - 13: optional map languageToFractionInferredLanguage -}(persisted="true", hasPersonalData = 'true') - -struct SampledEdge { - 1: required i64 followerId(personalDataType = 'UserId') - 2: required i64 followeeId(personalDataType = 'UserId') - 3: optional double favWtIfFollowEdge - 4: optional double favWtIfFavEdge - 5: optional double followScoreToCluster - 6: optional double favScoreToCluster - 7: optional double predictedFollowScore - 8: optional double predictedFavScore -}(persisted="true", hasPersonalData = 'true') - -/** -* The key here is (modelVersion, clusterId) -**/ -struct BipartiteClusterQuality { - 1: optional double inClusterFollowEdges - 2: optional double inClusterFavEdges - 3: optional double favWtSumOfInClusterFollowEdges - 4: optional double favWtSumOfInClusterFavEdges - 5: optional double outgoingFollowEdges - 6: optional double outgoingFavEdges - 7: optional double favWtSumOfOutgoingFollowEdges - 8: optional double favWtSumOfOutgoingFavEdges - 9: optional double incomingFollowEdges - 10: optional double incomingFavEdges - 11: optional double favWtSumOfIncomingFollowEdges - 12: optional double favWtSumOfIncomingFavEdges - 13: optional i32 interestedInSize - 14: optional list sampledEdges - 15: optional i32 knownForSize - 16: optional double correlationOfFavWtIfFollowWithPredictedFollow - 17: optional double correlationOfFavWtIfFavWithPredictedFav - 18: optional double relativePrecisionUsingFavWtIfFav - 19: optional double averagePrecisionOfWholeGraphUsingFavWtIfFav -}(persisted="true", hasPersonalData = 'true') diff --git a/src/thrift/com/twitter/simclusters_v2/multi_type_graph.thrift b/src/thrift/com/twitter/simclusters_v2/multi_type_graph.thrift deleted file mode 100644 index f7dee7381..000000000 --- a/src/thrift/com/twitter/simclusters_v2/multi_type_graph.thrift +++ /dev/null @@ -1,110 +0,0 @@ -namespace java com.twitter.simclusters_v2.thriftjava -namespace py gen.twitter.simclusters_v2.multi_type_graph -#@namespace scala com.twitter.simclusters_v2.thriftscala -#@namespace strato com.twitter.simclusters_v2 - -include "entity.thrift" - -union LeftNode { - 1: i64 userId(personalDataType = 'UserId') -}(persisted = 'true', hasPersonalData = 'true') - -struct RightNode { - 1: required RightNodeType rightNodeType(personalDataType = 'EngagementsPublic') - 2: required Noun noun -}(persisted = 'true', hasPersonalData = 'true') - -struct RightNodeWithEdgeWeight { - 1: required RightNode rightNode - 2: required double weight(personalDataType = 'EngagementScore') -}(persisted = 'true', hasPersonalData = 'true') - -enum RightNodeType { - FollowUser = 1, - FavUser = 2, - BlockUser = 3, - AbuseReportUser = 4, - SpamReportUser = 5, - FollowTopic = 6, - SignUpCountry = 7, - ConsumedLanguage = 8, - FavTweet = 9, - ReplyTweet = 10, - RetweetTweet = 11, - NotifOpenOrClickTweet = 12, - SearchQuery = 13 -}(persisted = 'true') - -union Noun { -// Note: Each of the following needs to have an ordering defined in Ordering[Noun] -// in file: multi_type_graph/assemble_multi_type_graph/AssembleMultiTypeGraph.scala -// Please take note to make changes to Ordering[Noun] when modifying/adding new noun type here - 1: i64 userId(personalDataType = 'UserId') - 2: string country(personalDataType = 'InferredCountry') - 3: string language(personalDataType = 'InferredLanguage') - 4: i64 topicId(personalDataType = 'TopicFollow') - 5: i64 tweetId(personalDataType = 'TweetId') - 6: string query(personalDataType = 'SearchQuery') -}(persisted = 'true', hasPersonalData = 'true') - -struct RightNodeWithEdgeWeightList { - 1: required list rightNodeWithEdgeWeightList -}(persisted = 'true', hasPersonalData = 'true') - -struct NounWithFrequency { - 1: required Noun noun - 2: required double frequency (personalDataType = 'EngagementScore') -}(persisted = 'true', hasPersonalData = 'true') - -struct NounWithFrequencyList { - 1: required list nounWithFrequencyList -}(persisted = 'true', hasPersonalData = 'true') - -struct RightNodeTypeStruct { - 1: required RightNodeType rightNodeType -}(persisted = 'true', hasPersonalData = 'false') - -struct MultiTypeGraphEdge{ - 1: required LeftNode leftNode - 2: required RightNodeWithEdgeWeight rightNodeWithEdgeWeight -}(persisted = 'true', hasPersonalData = 'true') - -struct LeftNodeToRightNodeWithEdgeWeightList{ - 1: required LeftNode leftNode - 2: required RightNodeWithEdgeWeightList rightNodeWithEdgeWeightList -}(persisted = 'true', hasPersonalData = 'true') - -struct RightNodeSimHashSketch { - 1: required RightNode rightNode - 2: required list simHashOfEngagers - 3: optional double normalizer -}(persisted='true', hasPersonalData = 'false') - -struct SimilarRightNode { - 1: required RightNode rightNode - 2: required double score (personalDataType = 'EngagementScore') -}(persisted='true', hasPersonalData = 'true') - -struct SimilarRightNodes { - 1: required list rightNodesWithScores -}(persisted='true', hasPersonalData = 'true') - -struct RightNodeWithScore { - 1: required RightNode rightNode - 2: required double clusterScore (personalDataType = 'EngagementScore') -}(persisted='true', hasPersonalData = 'true') - -struct RightNodeWithScoreList { - 1: required list rightNodeWithScoreList -}(persisted='true', hasPersonalData = 'true') - -struct RightNodeWithClusters { - 1: required RightNode rightNode - 2: required string modelVersion (personalDataType = 'EngagementId') - 3: required map clusterIdToScores (personalDataTypeKey = 'EngagementId', personalDataTypeValue = 'EngagementScore') -}(persisted="true", hasPersonalData = 'true') - -struct ModelVersionWithClusterScores { - 1: required string modelVersion (personalDataType = 'EngagementId') - 2: required map clusterIdToScores (personalDataTypeKey = 'EngagementId', personalDataTypeValue = 'EngagementScore') -}(persisted = 'true', hasPersonalData = 'true') diff --git a/src/thrift/com/twitter/simclusters_v2/offline_job_internal.thrift b/src/thrift/com/twitter/simclusters_v2/offline_job_internal.thrift deleted file mode 100644 index 257ef1f99..000000000 --- a/src/thrift/com/twitter/simclusters_v2/offline_job_internal.thrift +++ /dev/null @@ -1,63 +0,0 @@ -namespace java com.twitter.simclusters_v2.thriftjava -namespace py gen.twitter.simclusters_v2.offline_job_internal -#@namespace scala com.twitter.simclusters_v2.thriftscala -#@namespace strato com.twitter.simclusters_v2 - -include "com/twitter/algebird_internal/algebird.thrift" - -// For internal usage only. Mainly for offline_evaluation. -// Deprecated. Please use 'online_store/ModelVersion' -enum PersistedModelVersion { - MODEL_20M_145K_dec11 = 1, - MODEL_20M_145K_updated = 2, - MODEL_20M_145K_2020 = 3, - RESERVED_4 = 4, - RESERVED_5 = 5 -}(persisted = 'true', hasPersonalData = 'false') - -enum PersistedScoreType { - NORMALIZED_FAV_8_HR_HALF_LIFE = 1, - NORMALIZED_FOLLOW_8_HR_HALF_LIFE = 2, - NORMALIZED_LOG_FAV_8_HR_HALF_LIFE = 3, - RESERVED_4 = 4, - RESERVED_5 = 5 -}(persisted = 'true', hasPersonalData = 'false') - -struct PersistedScores { - 1: optional algebird.DecayedValue score -}(persisted = 'true', hasPersonalData = 'false') - -struct TweetAndClusterScores { - 1: required i64 tweetId(personalDataType = 'TweetId') - 2: required i32 clusterId(personalDataType = 'InferredInterests') - 3: required PersistedModelVersion modelVersion - 4: required PersistedScores scores(personalDataType = 'EngagementScore') - 5: optional PersistedScoreType scoreType -}(persisted="true", hasPersonalData = 'true') - -struct TweetTopKClustersWithScores { - 1: required i64 tweetId(personalDataType = 'TweetId') - 2: required PersistedModelVersion modelVersion - 3: required map topKClusters(personalDataTypeKey = 'InferredInterests') - 4: optional PersistedScoreType scoreType -}(persisted="true", hasPersonalData = 'true') - -struct ClusterTopKTweetsWithScores { - 1: required i32 clusterId(personalDataType = 'InferredInterests') - 2: required PersistedModelVersion modelVersion - 3: required map topKTweets(personalDataTypeKey = 'TweetId') - 4: optional PersistedScoreType scoreType -}(persisted = 'true', hasPersonalData = 'true') - -struct QueryAndClusterScores { - 1: required string query(personalDataType = 'SearchQuery') - 2: required i32 clusterId - 3: required PersistedModelVersion modelVersion - 4: required PersistedScores scores -}(persisted = 'true', hasPersonalData = 'true') - -struct QueryTopKClustersWithScores { - 1: required string query(personalDataType = 'SearchQuery') - 2: required PersistedModelVersion modelVersion - 3: required map topKClusters -}(persisted = 'true', hasPersonalData = 'true') diff --git a/src/thrift/com/twitter/simclusters_v2/online_store.thrift b/src/thrift/com/twitter/simclusters_v2/online_store.thrift deleted file mode 100644 index fb5aff6ad..000000000 --- a/src/thrift/com/twitter/simclusters_v2/online_store.thrift +++ /dev/null @@ -1,92 +0,0 @@ -namespace java com.twitter.simclusters_v2.thriftjava -namespace py gen.twitter.simclusters_v2.online_store -#@namespace scala com.twitter.simclusters_v2.thriftscala -#@namespace strato com.twitter.simclusters_v2 - -include "entity.thrift" -include "com/twitter/algebird_internal/algebird.thrift" - -/** - * A SimClusters model version. - **/ -enum ModelVersion { - MODEL_20M_145K_dec11 = 1, // DEPRECATED - MODEL_20M_145K_updated = 2, // DEPRECATED - MODEL_20M_145K_2020 = 3, - RESERVED_4 = 4, - RESERVED_5 = 5, - RESERVED_6 = 6 -}(persisted = 'true', hasPersonalData = 'false') - -/** - * Uniquely identifies a SimCluster. All fields are required as this is used as a memcache key. - **/ -struct FullClusterId { - 1: required ModelVersion modelVersion - 2: required i32 clusterId -}(persisted='true', hasPersonalData = 'false') - -/** - * Contains a set of scores per cluster. - **/ -struct Scores { - 1: optional algebird.DecayedValue favClusterNormalized8HrHalfLifeScore - 2: optional algebird.DecayedValue followClusterNormalized8HrHalfLifeScore -}(hasPersonalData = 'false') - -/** - * A combination of entity and model. All fields are required as this is used as a memcache key. - **/ -struct EntityWithVersion { - 1: required entity.SimClusterEntity entity - 2: required ModelVersion version -}(hasPersonalData = 'true') - -/** - * Contains top K clusters with corresponding scores. We're representing clusters purely using ints, and - * omitting the modelVersion, since that is included in the memcache key. - **/ -struct TopKClustersWithScores { - 1: optional map topClustersByFavClusterNormalizedScore(personalDataTypeKey = 'InferredInterests') - 2: optional map topClustersByFollowClusterNormalizedScore(personalDataTypeKey = 'InferredInterests') -}(hasPersonalData = 'true') - -/** - * Contains top K text entities with corresponding scores. We're omitting the modelVersion, - * since that is included in the memcache key. - **/ -struct TopKEntitiesWithScores { - 1: optional map topEntitiesByFavClusterNormalizedScore - 2: optional map topEntitiesByFollowClusterNormalizedScore -}(hasPersonalData = 'true') - -/** - * Contains top K tweets with corresponding scores. We're omitting the modelVersion, - * since that is included in the memcache key. - **/ -struct TopKTweetsWithScores { - 1: optional map topTweetsByFavClusterNormalizedScore(personalDataTypeKey='TweetId') - 2: optional map topTweetsByFollowClusterNormalizedScore(personalDataTypeKey='TweetId') -}(hasPersonalData = 'true') - -/** - * Contains FullClusterId and the corresponding top K tweets and scores. - **/ -struct ClusterIdToTopKTweetsWithScores { - 1: required FullClusterId clusterId - 2: required TopKTweetsWithScores topKTweetsWithScores -}(hasPersonalData = 'true') - -/** - * Contains a map of Model Version to top K clusters with corresponding scores. - **/ -struct MultiModelTopKClustersWithScores { - 1: optional map multiModelTopKClustersWithScores -}(hasPersonalData = 'true') - -/** - * Contains a map of Model Version top K tweets with corresponding scores. - **/ -struct MultiModelTopKTweetsWithScores { - 1: optional map multiModelTopKTweetsWithScores -}(hasPersonalData = 'true') diff --git a/src/thrift/com/twitter/simclusters_v2/online_store_internal.thrift b/src/thrift/com/twitter/simclusters_v2/online_store_internal.thrift deleted file mode 100644 index b5fd6afb9..000000000 --- a/src/thrift/com/twitter/simclusters_v2/online_store_internal.thrift +++ /dev/null @@ -1,30 +0,0 @@ -namespace java com.twitter.simclusters_v2.thriftjava -namespace py gen.twitter.simclusters_v2.online_store_internal -#@namespace scala com.twitter.simclusters_v2.thriftscala -#@namespace strato com.twitter.simclusters_v2 - -include "online_store.thrift" - -/** - * Contains a hash bucket of the clusterId along with the Model Version. - * All fields are required as this is used as a memcache key. - **/ -struct FullClusterIdBucket { - 1: required online_store.ModelVersion modelVersion - // (hash(clusterId) mod NUM_BUCKETS_XXXXXX) - 2: required i32 bucket -}(hasPersonalData = 'false') - -/** - * Contains scores per clusters. The model is not stored here as it's encoded into the memcache key. - **/ -struct ClustersWithScores { - 1: optional map clustersToScore(personalDataTypeKey = 'InferredInterests') -}(hasPersonalData = 'true') - -/** - * Contains a map of model version to scores per clusters. - **/ -struct MultiModelClustersWithScores { - 1: optional map multiModelClustersWithScores -}(hasPersonalData = 'true') diff --git a/src/thrift/com/twitter/simclusters_v2/score.thrift b/src/thrift/com/twitter/simclusters_v2/score.thrift deleted file mode 100644 index 8ee20e72c..000000000 --- a/src/thrift/com/twitter/simclusters_v2/score.thrift +++ /dev/null @@ -1,71 +0,0 @@ -namespace java com.twitter.simclusters_v2.thriftjava -namespace py gen.twitter.simclusters_v2.score -#@namespace scala com.twitter.simclusters_v2.thriftscala -#@namespace strato com.twitter.simclusters_v2 - -include "com/twitter/simclusters_v2/embedding.thrift" -include "com/twitter/simclusters_v2/identifier.thrift" - -/** - * The algorithm type to identify the score algorithm. - * Assume that a algorithm support and only support one kind - * of [[ScoreInternalId]] - **/ -enum ScoringAlgorithm { - // Reserve 0001 - 999 for Basic Pairwise Scoring Calculation - PairEmbeddingDotProduct = 1, - PairEmbeddingCosineSimilarity = 2, - PairEmbeddingJaccardSimilarity = 3, - PairEmbeddingEuclideanDistance = 4, - PairEmbeddingManhattanDistance = 5, - PairEmbeddingLogCosineSimilarity = 6, - PairEmbeddingExpScaledCosineSimilarity = 7, - - // Reserve 1000 - 1999 for Tweet Similarity Model - TagSpaceCosineSimilarity = 1000, - WeightedSumTagSpaceRankingExperiment1 = 1001, //deprecated - WeightedSumTagSpaceRankingExperiment2 = 1002, //deprecated - WeightedSumTagSpaceANNExperiment = 1003, //deprecated - - // Reserved for 10001 - 20000 for Aggregate scoring - WeightedSumTopicTweetRanking = 10001, - CortexTopicTweetLabel = 10002, - // Reserved 20001 - 30000 for Topic Tweet scores - CertoNormalizedDotProductScore = 20001, - CertoNormalizedCosineScore = 20002 -}(hasPersonalData = 'false') - -/** - * The identifier type for the score between a pair of SimClusters Embedding. - * Used as the persistent key of a SimClustersEmbedding score. - * Support score between different [[EmbeddingType]] / [[ModelVersion]] - **/ -struct SimClustersEmbeddingPairScoreId { - 1: required identifier.SimClustersEmbeddingId id1 - 2: required identifier.SimClustersEmbeddingId id2 -}(hasPersonalData = 'true') - -/** - * The identifier type for the score between a pair of InternalId. - **/ -struct GenericPairScoreId { - 1: required identifier.InternalId id1 - 2: required identifier.InternalId id2 -}(hasPersonalData = 'true') - -union ScoreInternalId { - 1: GenericPairScoreId genericPairScoreId - 2: SimClustersEmbeddingPairScoreId simClustersEmbeddingPairScoreId -} - -/** - * A uniform Identifier type for all kinds of Calculation Score - **/ -struct ScoreId { - 1: required ScoringAlgorithm algorithm - 2: required ScoreInternalId internalId -}(hasPersonalData = 'true') - -struct Score { - 1: required double score -}(hasPersonalData = 'false') diff --git a/src/thrift/com/twitter/simclusters_v2/simclusters_presto.thrift b/src/thrift/com/twitter/simclusters_v2/simclusters_presto.thrift deleted file mode 100644 index 93eae6c62..000000000 --- a/src/thrift/com/twitter/simclusters_v2/simclusters_presto.thrift +++ /dev/null @@ -1,59 +0,0 @@ -namespace java com.twitter.simclusters_v2.thriftjava -namespace py gen.twitter.simclusters_v2.simclusters_presto -#@namespace scala com.twitter.simclusters_v2.thriftscala -#@namespace strato com.twitter.simclusters_v2 - -include "embedding.thrift" -include "identifier.thrift" -include "interests.thrift" -include "online_store.thrift" - -/** - * This struct is the presto-compatible "lite" version of the ClusterDetails thrift - */ -struct ClusterDetailsLite { - 1: required online_store.FullClusterId fullClusterId - 2: required i32 numUsersWithAnyNonZeroScore - 3: required i32 numUsersWithNonZeroFollowScore - 4: required i32 numUsersWithNonZeroFavScore - 5: required list knownForUsersAndScores -}(persisted="true", hasPersonalData = 'true') - -struct EmbeddingsLite { - 1: required i64 entityId - 2: required i32 clusterId - 3: required double score -}(persisted="true", hasPersonalData = 'true') - -struct SimClustersEmbeddingWithId { - 1: required identifier.SimClustersEmbeddingId embeddingId - 2: required embedding.SimClustersEmbedding embedding -}(persisted="true", hasPersonalData = 'true') - -struct InternalIdEmbeddingWithId { - 1: required identifier.SimClustersEmbeddingId embeddingId - 2: required embedding.InternalIdEmbedding embedding -}(persisted="true", hasPersonalData = 'true') - -/** -* This struct is the presto-compatible version of the fav_tfg_topic_embeddings -*/ -struct ClustersScore { - 1: required i64 clusterId(personalDataType = 'SemanticcoreClassification') - 2: required double score(personalDataType = 'EngagementScore') -}(persisted="true", hasPersonalData = 'true') - -struct FavTfgTopicEmbeddings { - 1: required identifier.TopicId topicId - 2: required list clusterScore -}(persisted="true", hasPersonalData = 'true') - -struct TfgTopicEmbeddings { - 1: required identifier.TopicId topicId - 2: required list clusterScore -}(persisted="true", hasPersonalData = 'true') - -struct UserTopicWeightedEmbedding { - 1: required i64 userId(personalDataType = 'UserId') - 2: required list clusterScore -}(persisted="true", hasPersonalData = 'true') diff --git a/src/thrift/com/twitter/simclusters_v2/top_k_map.thrift b/src/thrift/com/twitter/simclusters_v2/top_k_map.thrift deleted file mode 100644 index 013215cec..000000000 --- a/src/thrift/com/twitter/simclusters_v2/top_k_map.thrift +++ /dev/null @@ -1,14 +0,0 @@ -namespace java com.twitter.simclusters_v2.thriftjava -namespace py gen.twitter.simclusters_v2.top_k_map -#@namespace scala com.twitter.simclusters_v2.thriftscala -#@namespace strato com.twitter.simclusters_v2 - -include "com/twitter/algebird_internal/algebird.thrift" - -struct TopKClusters { - 1: required map topK(personalDataTypeKey = 'InferredInterests') -}(hasPersonalData = 'true') - -struct TopKTweets { - 1: required map topK(personalDataTypeKey='TweetId') -}(hasPersonalData = 'true') diff --git a/src/thrift/com/twitter/simclusters_v2/tweet_similarity.thrift b/src/thrift/com/twitter/simclusters_v2/tweet_similarity.thrift deleted file mode 100644 index 3debd40f0..000000000 --- a/src/thrift/com/twitter/simclusters_v2/tweet_similarity.thrift +++ /dev/null @@ -1,16 +0,0 @@ -namespace java com.twitter.simclusters_v2.thriftjava -namespace py gen.twitter.simclusters_v2.tweet_similarity -#@namespace scala com.twitter.simclusters_v2.thriftscala -#@namespace strato com.twitter.simclusters_v2 - -struct FeaturedTweet { - 1: required i64 tweetId(personalDataType = 'TweetId') - # timestamp when the user engaged or impressed the tweet - 2: required i64 timestamp(personalDataType = 'PrivateTimestamp') -}(persisted = 'true', hasPersonalData = 'true') - -struct LabelledTweetPairs { - 1: required FeaturedTweet queryFeaturedTweet - 2: required FeaturedTweet candidateFeaturedTweet - 3: required bool label -}(persisted = 'true', hasPersonalData = 'true')