Wikimedia Technical Conference/2018/Session notes/Identifying our storage and search use cases

Session Setup edit

Theme: Increasing our technical capabilities to achieve our strategy

Type: Technical Challenges

Facilitation Exercise(s):

Leader(s): Marko, Brandon

Facilitator: Birgit

Scribe: Michael, Nick

Description: Storage has increasingly been an issue for implementing various features of our platform. In this session, the focus is to identify what use cases have been problematic to implement storage for and what issues we have encountered when trying to scale or expand our current storage solutions.

Questions to answer during this session edit

Question Significance:

Why is this question important? What is blocked by it remaining unanswered?

What use cases which have required either new storage or increased storage have been problematic to implement over the past year and why? What new use cases are we trying to serve now? In order to solve storage issues we need to understand what problems we are running into currently in and in the past.
Which use cases require curatable storage (history / revisions) and which only require “regular” storage? Storing curatable content has larger implications on the storage infrastructure
Which use cases benefit from central storage or a central search index versus per project storage (like authentication)? We have some use cases were we need to look up data across projects (per user, shared templates, etc…). Identifying these will help design storage and search to suit them.
Given the answers to the above, what “gaps” do we have in our existing storage solutions? Solving specific issues first is important, but we should also know where other gaps lie so that we can solve storage cohesively
How do we decide if we need to better model our data structures for queries within a storage solution or use Elastic to solve querying? Do Graph databases make sense as a way to closer model our data sets? Many scaling issues relate to the revision table. Adding new types of content, especially ones that relate to ML use cases, or anything that results in more pages can cause scaling concerns.

Wikidata has a data model which does not fit neatly into our MYSQL architecture.

Break-outs:

revisioned storage (storage with history needs)

key/value and blob storage

search-ability and indexing

Questions:

are there cases where we used X but should use something different

new/planned use cases

growth of storage size

guidelines for each type

Attendees list edit

  • Alice, Bob, ...

Structured notes edit

There are five sections to the notes:

  1. Questions and answers: Answers the questions of the session
  2. Features and goals: What we should do based on the answers to the questions of this session
  3. Important decisions to make: Decisions which block progress in this area
  4. Action items: Next actions to take from this session
  5. New questions: New questions revealed during this session

Questions and answers edit

Please write in your original questions. If you came up with additional important questions that you answered, please also write them in. (Do not include “new” questions that you did not answer, instead add them to the new questions section)

Q:
A
Q:
A (If you were unable to answer the question, why not?):
Q:
A (If you were unable to answer the question, why not?):

Features and goals edit

For Use Case and Product Sessions:

Given your discussion of the topic and answering questions for this session, list one or more user stories or user facing features that we should strive to deliver

1.
Why should we do this? What is blocking it? Who is responsible?
2.
Why should we do this? What is blocking it? Who is responsible?
3.
Why should we do this? What is blocking it? Who is responsible?

Important decisions to make edit

What are the most important decisions that need to be made regarding this topic?
1. identify all needed types of storage
Why is this important?

guide product decisions

guide tech architecture to suit said decisions

What is it blocking?

new products/services/features with heavy-data reqs

Who is responsible?

product owners

2. set boundaries/expectations for each type of storage
Why is this important?

meet product requirements

allow for sustainable storage growth

What is it blocking?

guidelines documents

Who is responsible?

tech

3. compile the guidelines for product and tech
Why is this important? What is it blocking? Who is responsible?

Action items edit

What action items should be taken next for this topic? For any unanswered questions, be sure to include an action item to move the process forward.
  1. document expectations on some sort of page that product owners can be directed to in order for them to be able to create / document their storage requirements.
Why is this important?

Product needs to know the limits / expectations

Technology needs to account for meeting these expectations when architecting solutions

What is it blocking?

Successful deployment of new products features and services

Who is responsible?

Marko

2. research alternative types of storage (eg. graph)
Why is this important?

Because we are hitting scalability issues with current solutions.

We need to see if there are better solutions for our use cases.

For new products that require different types of storage.

What is it blocking?

New products (eg. dependency graph)

Who is responsible?

Core Platform / SRE

3. use aforementioned document when designing new products that use storage solutions
Why is this important?

to keep our env sustainable

What is it blocking?

New products

Who is responsible?

product owners

New Questions edit

What new questions did you uncover while discussing this topic?
Why is this important? What is it blocking? Who is responsible?
2.
Why is this important? What is it blocking? Who is responsible?
3.
Why is this important? What is it blocking? Who is responsible?

Detailed notes edit

Place detailed ongoing notes here. The secondary note-taker should focus on filling any [?] gaps the primary scribe misses, and writing the highlights into the structured sections above. This allows the topic-leader/facilitator to check on missing items/answers, and thus steer the discussion.

  • MO: Two parts to today’s session, first hour focusing on use cases and constraints, with breakout groups discussing:
    • storage that needs history;
    • k-v blob storage;
    • searching and indexing needs.  
  • We want to Identify the semantics that products need rather than specifying the specific technologies that they should use.  
    • are there cases where we used X but should use something different?
    • new/planned use cases?
    • growth of storage size?
    • guidelines for each type?
  • E.g. if i have a piece of a data, does it have to exist forever, does it have to be retrieved fast?,
  • Second part of hour, we’ll try to come up with a set of guidelines about which storage they need for their use case.
  • Q (TheDJ): When we’re talking about products and not specific techs, what are we really talking about?
  • BB: We are talking about real use cases; what we want to talk about is the needs of storage platforms to support real specific use cases
  • Not cassandra specifics, or postgres specfiics but the known constraints for specific use-cases.  
  • THEDJ: Is JADE a use case, or review in general a use case?  
  • Indexing requirements and
  • Daniel: JADE is (is not?) probably a good use case to discuss
  • Tim: We
  • Tim: We probably want to talk about new use cases.  Are there PMs here who need guidance on how to design storage for a new feature, or are we reevaluating what we have now?
  • BB: This would be a good time for them to be here to outline their needs!
  • AdamB: take  look at the 6 themes for audiences in 3-5 year planning, and think "oh that usecase)
  • Josh: Customization of the user experience involves a lot of storage of user data that we don’t currently have good systems for storing.  Example: personalized reading lists.
  • BB: Let’s start enumerating rough use cases/requirements.

GROUP 1 - Revision storage

  • Defining
  • Lydia: What do we mean by revision storage exactly?
  • AH: Curatable is one definition of that; you can revert to history, suppress things
  • AH: Anything we want to curate that doesn’t need versions?
    • Code
    • Logs (MO: These are timestamped not versioned)
  • Various types of derivative data that follow revisions that are not really revertible
    • E.g., page props ()
  • MB: Isn’t this just storge with an index on a revision ID?
  • Eran: ?
  • Marko: It’s more about history of a singular object through time rather than a collection of IDs
  • Subbu: Does parsoid HTML fall into this?  (Derived, not revertiable) -- yes.
  • Defining use cases
  • AH: Any type of content curation is going to need revision support (see essay by Risker)s
  • Marko: WE have both developer driven and user driven use cases
  • AH: A new type of curated information would create big storage needs
  • Lydia: e.g., 3D objects on commons now
  • MB:
  • MM: Would pageview data be revisioned?   Don’t know about that…
  • Lydia: When you talk about main namespace might need to change, Wikidata comes to mind.  We push blobs into Wikitext fields.
  • AH: I suppose we have use cases for versioning structured content as well as text.
  • MB: We may have to sometimes hide or delete data, mostly for legal reasons.  We’ve talked about Git as a storage system, which makes it impossible to do that.
  • MO: This type of data has a tendency to grow unbounded.  How do we deal with that?
  • If we say it’s not feasible to store all of this indefinitely, how does that change things?
  • Lydia: Growing can be both i
  • Size of the blob doesn't matter to us
  • Ops keeps saying don't worry about blob size, worry about revsions
  • MB: This is not generally true!  
  • Lydia: There is a size limit
  • MB: For Commons media this matters a lot
  • AH: This seems out of scope for rev storage, can we leave it for K-V storage?
  • MO: That's a question when determining whether somehign should be oen or the other. The amount per entity, from a product perspective...
  • MB: Really this is in scope for both. What a revision is and the metadata (creator, etc.) that's currently in revision storage, the actual content in blob storage, then a layer on top to make it creatable
  • AH: Concretae use cases:
  • - JADE (many pages, few revs per page)
  • - structured discussions (same)
  • - Anything that's more revisions per thing in the next 3-5 years
  • - Lydia: Average item on Wikidata has 10 edits. That's more than Wikipedia.
  • - Structured Commons, same.
  • - Subbu - if you want to move to Parsoid HTML being canonical storage
  • - AH; does that affect revision storage?
  • - Subbu: it's just the blog
  • - MB: You can do a lot with compression here.  We do some, but...
  • - Marko: Compression was what was causing most of the wounds in Cassandra, we were so efficient that when loading from disk, it would explode
  • - Lydia: In terms of amt of storage, migration from inline descriptions to statements in SDC, and
  • - Things mobile is looking at on microcontributions, that's going to need an intermediary storage space
  • - Lydia: Need to have 5 people answer the question before it becomes an edit
  • - MO: We did have at some point the idea of being able to retrieve HTML of any given version of any given article. I personally liked this a lot because of the citability aspect. But growth of storage....
  • - MB: There was a req to double storage because 5-6 projects wanted restbase/cassandra storage
  • - Maps stack: Not versioned, we pull their data once daily and regenerate. Probably won't need versioning for maps in 3-5 years.
  • MO: What do we do with all of this?  OK to have different models for historical content as opposed to current content? (1) wait longer, or (2) get at it in a different way
  • Lydia: No to different way
  • In WD we often get last five revisions, the ones earlier can be in a different format/way of getting it and this makes it harder to write tools.
  • MB: We can abstract around that
  • MO: There is a deeper question about how we backfill historical data
  • AH: this notion of an abstraction is an important one. You could shard based on content type... completely diffent MySQL that could store JADE stuff despite that MW doesn't care what DB it's using
  • RECOMMENDATION: ABSTRACT PAGE/REVISION - Hot / Cold storage and/or sharding and compound keys
  • MB: Horizontal scalability is an important concept that we need
  • GOAL: HORIZONTAL SCALABILITY
  • MB: In many cases we need to compare relationships between two things
  • AH: How to we get to horizontally scalable but with the ability to compare things where needed?
  • MB: JADE example could easily be in a different DB, but MediaWiki doesn't currently support that.
  • QUESTION: How do we have Horizontal scalability while maintaining relationships?
  • Example: WMCS: Querying many databases, many are querying more than one of them, this causes trouble
  • AH: Need to support Risker’s checklist

Summary by Aaron:

  • Uses cases split broadly into:
    • Curatable content with revisions
      • Microcontributions
      • Wikidata items

Scale is the biggest problem here.  # of revisions per page, size of content per each revision both play in to it.  Whatever we do, need to support Risker’s checklist for curation of content.

    • Things that just need revisions for historical access

GROUP 2 - Key-value storage

  • People: Ian, akosiaris, tim, DJ, Addshore, Josh Minor, Adam Baso, Alexia,
  • Things that get stored:
    • Wishlists
    • Revisions
    • user prefs
    • Echo
    • files/media/pdf/etc
    • external store
    • Jobs
    • sessions
    • Caching
    • parser cache
    • reading lists
    • trending edits
    • OTRS (support ticket management system)
    • Derived media like thumbor, screen shots, etc?
  • Coming
    • Partials
    • Citations (?)
    • Annotations (?)
    • Stuff related to group participation/similar types of user state information (Wikiprojects, etc)
    • Backlogs (personal + group)
  • Dumps - “Not a key/value, because they’re all value and no key” - Tim S
  • Constraits
    • Multi-directional indexing
      • Watchlists
      • Reviwsio
      • Echo notif
      • Jobs
      • Partials
      • Group participations
      • backlogs
    • Low latency
      • External store
      • Generic blob cache
      • Session
    • Arbitrary size
      • files/media
      • External store
      • dumps/generic files
    • Durability
      • Revisions
      • files/media
      • External; store
      • Reading lists
      • dumps/generic files
    • Immutable history
    • Mostly read
    • Mostly write
    • Blob store: Needs to handle media
    • Blob store: Needs to handle JSON
    • Blob store: Needs to handle JSON including sub-key extraction

Summary by Ian:

  • Wishlists, revisions, things like user prefs, sessions, things like file stores are our use cases.  And, of course, lots of caches of various types (and at least 5-6 more coming).
  • We talked about different inds of constraints:
    • Need multi-directional indexing
    • Low latency
    • Arbitrary size
    • Stick around forever
    • Immutable
    • Mostly read
    • Mostly write
    • Only media or need to handle text
    • Need to handle JSON (and if so, need subfield selection)?
    • Go from user to notification, and also go from notification back to user
    • User to group, group to set of users

GROUP 3 - Indexing -  brandon, daniel, erik, leszek, Corey, deb,

  • Talking and non-primary storage, dependency tracking.
  • Looking at all WMF's storage needs in terms of arch. We know the 3 way split. K-v blobs, revisions of stored objects - persistent logs and changes. This is the 3rd part, large scalable storage solution, but really another part that speeds up the rest
  • C: the elastic usecase,
  • D: in my mind braoder than that

Put it into index, where we derive things like external links, in wikitext, placed in table, for spam fighting. Specific query against that index, which tells us which domains exist.

  • The thing we discussed in last, dependency tracking, so we know what to purge
  • BB: mental divind g line.. Elastic to k-v blobs, we also have mysql indices, and DBAs have to get [?]. But here we're talking about a separate storage platform for storing other people's data. Storing an indexing solution is . A sholw separate dataset. T
  • D: The k-v example, the index is [..?]
  • BB: you can take 1 table in mysql and define 16 different indexes. E.g. you have db for wikidata, and kjuist need to put indexes on the side to [?]. If you're saying that the index problem is not solvable within that space with underlying data structured of WD today, and need separate sotarge system for efficient interaction, then…
  • D: within these groups, the thing you just excluded isnt' covered. Just adding another table to mysql to have acces to them.
  • BB: We don't think about that as, were looking for input to a design process for putting together whol storage platforms - not within normal DBA work
  • D: we're hitting the limit of scaling
  • WD has the terms table, pulled out for direct access - it just exploded and isnt scaling
  • BB: right  if at design limits of current solution, we need to look for another.
  • D: Pushing to elastic helps a bit, but not sure how much
  • E: the table will exist on wikibase but never used on wikidata.org -- yes
  • L: used to be used as search index, but inefficient, data elements based on some data moved to elastic.
  • BB: good label for this?
  • E: wikidata autocomplete
  • BB: is there a solution that works?
  • E: currently using elastic, which we believe will work.,
  • BB: trying to remain abstract and not be technology specific
  • D: Key is autocomplete usecase, casefolding etc, elastic is clear answer
  • B: Other use cases queued up?
  • D: query service?
  • E: too obvious?
  • E: Full text search
  • BB: yes, and not covered by elastic.
  • Bryan: also a vendor problem with blazegraph
  • D: use cassandra for k-v storage
  • B: nonprimary like persistent prefilled cache

Restbase can be repopulated quickly and easily

D: i was told not the case for restbase?

  • D: Does cassandra fit in for current usecase examples?
  • E: How about for future usecases?
  • B: If you have semi-firm plans, write it down.
  • D: What criteria would you want to look up what stuff.
  • B: Basic parameters? Open questions?  
  • Questions: What is the basic scale (gig, tera, peta).
  • Is it read heavy?
  • How fast does it have to be indexed?
  • E: Wikidata autocomplete needs ~instant.
  • BB: Questions for PM s what is acceptable delay for reusing same object you just created
  • D: indexing latency is critical. (don't need concrete answers for that here, but do elsewhere)
  • B: WDQS is the one that needs more speed and reliability. Scaling. Horizontal scaling.
  • D: dependency graph
  • D: What kind of things would you ask to decide where it goes? (into the 3 existing bins, or a new bin)
  • B: From POV of WDQS owners do we have an acceptable answer already?  AFAWK they're acceptable, but realistically are causing problems for others, e.g Stas was woken last night.
  • B: not talking isolated queries with well constructed…
  • D: need from last session is dependency graph storage.
  • B: Coming up with better interface design?
  • B: Where database engine is exposed straight through front of service.
  • D: What does that have to do with dependency graph?
  • It's a graph but not a dependency graph.
  • Semantic relationships between items themselves.
  • Basic params: horizon
  • B: What was initial design thought?
  • D: Start with 100 billion edges
  • D: Minus wikidata, take All the tables.
  • B: Question: What is the expected read/write balance. 90% on write? Or mostly pulled from elsewhere and so read-heavy?
  • E: What's percentage of graph that changes on a daily basis
  • D: I think roughly same ballpark
  • D: read case for graph: I changed X, what do i need to update now?
  • B: So Walk the graph to find out what else to invalidate. And everytime something else changes, write to the database to change structure. Intuitively I'd say content changes more than structure.
  • ACTION: Ask wikidata people a lot of things.

Main Takeaways

  • B: Looking at search indexey stuff
  • Full text search used for  few things, mostly working
  • Some existing graph indexing type stuff with WDQS which semi-works
  • New need of graph indexing, no design yet, but coming up: Dependency graph storage, which we have many questions about.
  • Every article on every wiki has node, and interlinks. Massive. Someone will have to come up with a platform to support that.

Summary by Brandon:

  • Full-text search (works!)
  • Graph type stuff with WDQS (“works”)
  • New graph stuff (?)
    • Every article on every wiki has a node on this graph, it’s a massive graph of everything we have everywhere

Break


Detailed notes from the second part edit

Place detailed ongoing notes here. The secondary note-taker should focus on filling any [?] gaps the primary scribe misses, and writing the highlights into the structured sections above. This allows the topic-leader/facilitator to check on missing items/answers, and thus steer the discussion.

  • Building on previous session in https://phabricator.wikimedia.org/T206076
  • L: Before the break we were looking at use cases for classes of storage solution
  • In the next 35 or so minutes, we’ll look at those classes, and based on the previous hour’s results, think about what types of constraints are considered, and see if you can come up with any kind of guidelines or recommendations
  • E.g., when dealing with data around vol of 10 MB and need low latency, want to use key-value
  • You might see overlaps or edge cases from previous session. Don't be dogmatic about those groups or classes.
  • Action: Come up with a list of [?] criteria people should consider when deciding which storage solution to use.
  • Also: Identify upper/lower extremes of what is required.
  • Idea is to help out in future decisions when assessing product requirements to derive storage needs
  • Marko: Way to think about guidelines is that we want to identify the rules that will make it clear to product and technology that we’re talking about the same thing.  Select a type of tech based on objective criteria derived from real product needs.
  • Daniel: Is it only the purpose to identify the questions to ask, in order to find the correct techservice? Or are we identifying different types of classes and subclasses?  In the previous session we found that one thing that is needed is key-value blob storage. We may have need for both transient and persistent versions. Same interface, different guarantees.
  • L: Not primary goal, but doesn't hurt. Probably not enough time now, but that's an eventual goal.

SMALL GROUPS

GROUP 1 - TEAM REVISION

  • AH: I think a lot of the things we put up as questions and decisions feed into this.  
  • MO: Maybe start with the question of size.
  • Criteria: size, rate of increase, latency, relationships/integration, complexity of relationships, indexing, data structure, multiple fields, indexability, persistency (controversial: are revisions permanent by definition?), availability
  • Size
    • AH: I wanted to suggest that maybe a lower bound is 1b revisions available in a wiki
    • MB: Size is a function of the three dimensions (see last session)
    • Daniel: Earlier you said that there’s two classes of storage, one just history and one ???
    • One is smaller and logged, just see that something changed.  Shouldn’t the questions we ask lead to one or the other?
    • Decision is: do you need to curate this?
    • If not using one of three dimensions, how do you express size?
      • # of items, item size
    • MB: What’s the point of talking about real numbers here?  Still just defining concerns.
    • AH: How do we talk about upper/lower bound then?
    • MO: Take JADE.  What is acceptable to you in terms of size, if you know that you cannot have a limited number of judgments?  How does that change how you think about JADE?
    • AH: A lower bound of core use case is 1/10 of revisions on a wiki have a JADE page.  1(?).001 percent of JADE pages will have a revision.
      • Similar to talk pages, most will have no edits but some have lots
    • [Are we making a decision tree here?]
    • MO: write vs read rate, curatable vs. restorable, does it have any implications
    • MB: You can optimize rev storage with compression.  But then you have metadata you have to be able to get at to curate.  That is separate info from the content itself. So it needs to link to something like a user DB.  That complicates this because it works very differently.
    • AH: But we’re running out of space for metadata in the relational db
    • DK: Are we discussing building another tech for curatable content?
    • AH: I think one of our recommendations is that we should abstract away page revision as “curatable thing,” and that indexing based on ???
    • MB: This is very much a MediaWiki problem.  I’m not sure how it relates to other stuff, how it solves other problems.
    • MB: A few years ago we introduced the Cassandra stack, which also stores revisions of page content in a different way.
    • Marko: Cassandra is good for historical data but not curatable data.  It could do the latter, but it complicates things for modeling, etc. Can be made to fit, but do we really need to.
    • MB: What makes it bad for curation?
    • Marko: C is great for longer-pend (?) type stuff.  Good for storing diffs but gets expensive e.g., to calculate version to present.  Or, if you store the content of everything, it’s easy to compress, but then you run into efficiency problems when it comes to suppression.
      • AH: Modifying stuff that wasn’t recently added is slow.
    • AH: A lot of our metadata, the only change ever needed is suppression (flip the bit for rev delete).
    • MB: But GDPR could change this -- legal requirement that we actually take something off our servers, not just hide it.
    • DK: Ability to “revise history” may be a factor to discuss.
    • DK: Point isn’t to put numbers on any of these, the point is to ID them as parameters, and having a catalog is useful as an outcome
    • MB: Do we really have a problem with rev storage?  Some of the problems that we’ve discussed are
    • Everything on the curation side was revision metadata storage problems.  Essentially, we’ve released a bunch of products that were catastrophic failures because they didn’t have this and didn’t fit into wiki processes.
    • DK: We extract some things from content (ee.g., link tables, page properties) and only do this for the current revision (no past revision), b/c there is generally no need
    • Anything for query gets access to these, otherwise it’s just a content blob. Maybe there’s a use case where this doesn’t hold, but in present use cases it does.  IOW, no indexing into old revisions.
    • MO: ARe you saying that we should care about metadata and relationships only for current revision?
    • That is current behavior and we need not to break it.
    • DK: People have been suggesting why store wiki content as blobs, that’s pointless.  The reason is that it’s essentially an archival format. We don’t need to index into it.
    • AH: Maybe we should talk about some limited indexing.  E.g., user IDs because people use them to track histories.
    • DK: But that is into the page content.  We do need it for revision metadata, across all revs, no question.  Content, only for current.
    • AH: Actually, Analytics does reconstruct some of this stuff, but…
    • AH: Content is sort of besides the problem here.  This is all about metadata about the revisions and how many we have of them.
    • Lydia: # of revision per page, not saying that’s not a problem, but when we were creating SDC.  
    • # of revisions ties in with write rate, size.  Having 100000 revs on one page is not a problem (until you try to delete, then it is ;)).  For all of them, it is.
  • What can we get away with in terms of revisions in the next 5 years, on the big wikis?
  • MB: My understanding is that current growth is steady and managable.  No substantial increase neede.
  • DK: The data revision on Commons will be at least one edit per page.
  • Lydia: Wikidata’s natural growth will accelerate.
  • And the edit rate on Commons will grow.
  • Wikidata will hit 2b revs.
  • Enwiki will probably hit 2b.
  • Commons, probably not hit 2b but maybe.
  • MO: But commons has a different problem in that the size of each blob...
  • Upper bound: 20b?
  • MB: Current arch, we’re looking at the size of an individual machine.  With sharding we can hald much more.
  • Subbu: What do other orgs with lots of data do?  Do they build custom solutions?
  • MB: Some use MySQL.
  • Subbu: Is any open source?
  • MB: they shard stuff, we do some of that
  • DK: Is this when we start sharding in MediaWiki.
  • Recommendation: It’s time to start sharding for MediaWiki.
  • Write rates: lower bound 600 edits/min (10/s, what we have right now), upper bound 6,000 edits per minute (for reasons far beyond storing metadata about revisions)
  • Read rates: lower bound
  • AH: Basic needs: find revisions for a page/time, find edits by a user/time
  • Are there other scary indices.  Let’s say we go to Cassandra, can I get all revs sorted by time for a page or a user?
  • MO: Technically possible.  In this instance we have two indices: page, user.  All others rebuildable, maybe we could switch to Elastic for them, keep only these two intrinsic to the DB
  • AH: What about the recent changes table for recent (<=30days) stuff.
  • DK: Can we get away with that?  Can’t use just rev ID as a key with sharding
  • AH: Mutability: How often to we have to outright delete something from history?
    • Lydia: Probably very few, but the capability is nonnegotiable.
    • AH: If only once/month, can get away with rewriting the rest of history for a page once in a while.  Twice a day, no way.
  • AS: What about latency?  Could put metadata about stuff into Elastic,
  • Dk: Talking about latency for metadata or content?  
  • For hot retrieval, expect ms, for cold retrieval, expect what?  Tens of seconds?
    • DK: Tens of seconds not acceptable for metadata ever

Summary:

Concern is how big this can possibly get, WD will probably have 2b revs in 3 years, that’s a lower bound, upper bound is an order of magnitude larger given products in the works.  Write rate, 600-6000 revs/min on the big wikis. Need mutability (ability to delete) due to GDPR, hopefully not too frequently.

Group 2 - Key-Value and Blog Stores

  • Things that get stored:
    • Now
      • Watchlists
      • Revisions
      • user prefs
      • Echo
      • files/media/pdf/etc
      • external store
      • Jobs
      • sessions
      • Caching
      • parser cache
      • reading lists
      • trending edits
      • OTRS (support ticket management system)
      • Derived media like thumbor, screen shots, etc?
      • Map tiles
      • Page properties
    • Coming
      • Partials
      • Citations (?)
      • Annotations (?)
      • Stuff related to group participation/similar types of user state information (Wikiprojects, etc)
      • Backlogs (personal + group)
  • Constraits
    • Multi-directional indexing
      • Watchlists
      • Reviwsio
      • Echo notif
      • Jobs
      • Partials
      • Group participations
      • backlogs
    • Low latency
      • External store
      • Generic blob cache
      • Session
    • Arbitrary size
      • files/media
      • External store
      • dumps/generic files
    • Durability
      • Revisions
      • files/media
      • External; store
      • Reading lists
      • dumps/generic files
    • Immutable history
    • Mostly read
    • Mostly write
    • Blob store: Needs to handle media
    • Blob store: Needs to handle JSON
    • Blob store: Needs to handle JSON including sub-key extraction

Summary: Issue is that while trying to figure out future needs we were always going back to current needs: derived content (partials, probably citations, annotations), because derived, we can regenerate them, though that’s painful.  So they have lower durability needs. We have cache invalidation needs. Then there’s stuff like map tiles and file attachments which we solve in weird ways. Something like Swift (replicated, arbitrary blobs) would be something to achieve this, since it ha high durability and replication needs.  Also Perl is awesome

Fun fact: A dump of the OTRS DB is larger than a dump of enwiki

GROUP 3 Erik, Brandon, Cheol, Karsten, Lezsek, Bryan Davis, Deb

  • B: what parameters will inform the choices.
  • E: upcoming vector search, coming out of research. Need to find closest vectors to this. E.g. string ot 300 numbers that conceptually represents  "a king is"
  • B: Arbitrary space, and how close they are.
  • Existing tech?
  • E: most use approximation algorithms. Divide in half in half in half.  That's usually calculated in offline batch job, and shipped.
  • B: such thing as vector database?
  • E; there are libraries can tie together?
  • B: Those libraries generate models that are then stored in memory
  • E: For graph problem - how big is your graph. How big is your data.
  • B: Daniel's usecase from what i heard - we know how big the data is, the node-count. We could probably get the ave interconnectedness, e..g links to 6.7 objects. Thereby find edgecount. Then get an idea of what that scale needs in graph db terms, for searching and executing queries. This is a problem we can attack.
  • ACTION: Research the numbers needed - data size, updates, etc.
  • I.e. what scale will be needed.
  • ACTION: Research graph db technologies available.
  • L: Any other parameters?  Or just a matter of data.
  • E: Hope you have enough Memory to keep it all hot
  • B: Daniel mentioned he knows how it will be used by code, he knows current read estimates and current write estimates.
  • B: Not content but relationship/structure
  • B: Everytime someone makes a content update somewhere, it will go in this database and walk the nodes to see what else needs to be updated. That tells us something about the tech we need. It will need these interfaces, to walk the nodes and give us a list. Do any graph databases not do this?
  • E: They should all do it.
  • E: Not sure how far, traversal, will be happening here.
  • B: Think it's similar to updating the giant connected template problem
  • N: IMDB authority control was example
  • K: SMW has the same problem, even at a small scale. 2 million objects or so. Gives dependency update jobs to the jobqueue.
  • L: Graph database vs sql database?
  • B: traversal, needs graph database,
  • K: index is larger than whole database.
  • B: can imagine putting in simple mysql table. Nodes and connections. But when you tried to operate on it, it wouldn't be performant. (i.e. where SMW is)
  • E: Any other ways that we know when graph db's are solution?
  • BD: sql sucks at recursive relationships. If you need self-joins, don't use sql.
  • B: Distinct from mysql which is a specific tech. We do have calls [?] between sql-like, blob-like, graph-like
  • L: Search problem?
  • E: [?]. And filtering in sparse dataset.
  • B: if performing textual search, go in this direction? As guideline?
  • B: This is meant to be info that helps PMs know what to ask engineers.
  • E: Specialty data preprocessing? Noone wants to write that.
  • BD: real fulltext search engine different, is vector terms. Eg lucene. Fuzzy matching, not exact match projections.
  • E: Actually is if data structure under the hood…
  • L: upper lower case,
  • E: that's the ingestion
  • BD: input normalization, stemming.
  • B: Fuzz factor on search is how far out in space to grab similar results.
  • E: close by levenstein[?] distance.
  • N: similar to the SDC ontology search issue, poodle discussion on mailing list.
  • E: High-level, we ended up determining that daniel needs a secondary, and search needs secondary. What basic index form do you need.
  • QUESTION:  how do you know when you should consider secondary index storage. When usage is different enough.
  • Recursion seems key. How do you have that judgement.
  • BD: maria supports recursion now, but only to a certain level.
  • BD: offload to server side. Syntactic sugar in maria. Self joins at low level table
  • B: implement hash table or RedBlack tree.
  • L: Move from naive storage to more indexed solution?
  • B: Another smell-test: estimate the volume of just index you have, and it looks larger than index you have, then you need a specialized index solution. [Normal indexing Sql is smaller than content itself.?]
  • BD: related to cardinality
  • B: So many mentions of data, exploding out
  • BD: Log search problem, interesting things in logs have high cardinality.
  • B: is this a way to choose full text vs graphing.
  • E: specialized data ingestion?
  • L: text search
  • B: If this is a text bar that humans will type into, it needs a special type of index.
  • E: Textual search: Special needs get special services.
  • E: view the world as one thing. If you need more than separate unjoined. If you need an overview of many things, you need special services.
  • B: Can imagine [?] take all the sql, and turn it into non-indexed. Just denormalized.  Just interwiki isn't enough to say this is indexey.
  • BD: storage size cardinality search space, there are limits to vertical scaling.
  • B: hence shard MW in the first place.
  • BD: many other ways to shard MW that we havent done, e.g. by product deployment. If your revision is older than x, you have to go to this other place.

Summary: How do you know when you need a different kind of data index?  We identified a few scenarios. Text search, limits of vertical scaling.  Sometimes you might think about whether you need a graph solution or something else.  WMF has little knowledge of graph technologies, mostly Stas. Wikidata problem of 100b edges is something to consider.  Also we need a blockchain. ;)

Key takeaways:

  • When do you consider secondary indexing?
    • Recursive relationships - Could be a sign to consider secondary data indexing.
    • Indexes larger than the source
    • Textual search - special needs get special services - if it has a search box UI, it needs text index storage
    • Limits of vertical scaling - [some types of] bulk data cannot easily shard into separate domains

Actions:

  • Research the numbers needed - data size, updates, etc.
  • Research graph db technologies available.