User:Dantman/Code Ideas/Text dump algorithm
< User:Dantman | Code Ideas
- For {page} in {pages} where {page_id} % {totalThreads} = {threadIndex}
- Let {Q} be a SQL query for {rev_page} = {page->id} ordered in reverse by {rev_timestamp}
- Let {RBUF} be an empty Queue
- Let {SHEAP} be a MaxHeap that sorts by {I->Size}
- Begin reading from {Q}. For each row:
- Let {Text} be the contents found by fetching the row fetched from {Q}
- Let {R} be an object
- Let {R->Text} be {Text}
- Let {R->Size} be the strlen() of {Text}
- Let {R->Inserted} be false
- Push {R} onto the end of {RBUF}
- Insert {R} into {SHEAP}
- If {RBUF} has {queueLength} items or {Q} has no more items break from this loop
- Let {R} be the first item in {SHEAP}
- Let {} be a MinHeap that sorts by {->Size}
- Iterate over the next {comparisonLookahead} items in {SHEAP} as {}:
- Compute a delta between {R->Text} and {->Text} as {Delta}
- Calculate the storage space of {Delta} as {DeltaSize} # Note: Removals will be optimized and will only contain ranges needed for removal, not the text itself
- Let {->Rev} be {}
- Let {->Delta} be {Delta}
- Let {->Size} be {DeltaSize}
- Insert {} into {}
- Dumping is run in parallel where each thread is responsible for pages with a MOD {totalThreads} that is equal to the thread's own index. (If three threads are being run then thread i=2 is responsible for the page id's (2, 5, 8, etc...)).
- Because revisions within a page are most likely to be similar we break up tasks by pages and make all revisions in one page handled by a single thread.
- Revision text may be stored in Raw, Delta, or Equal formats. Raw being a blob of text. Delta being a space optimized diff and a reference to another revision. And Equal being a reference to another revision indicating they are the same.
- Deltas are stored with preference going towards storage of removals
- We iterate in reverse as newer revisions are more likely to have more text.
- We handle the largest revisions in a chunk of revisions first to make it most likely that we will store a removal.
- Storage:
- Use binary storage patterns.
- Use fseek to return to the location previously unknown index pointers and write to them
- Use tempnam to create a temporary file, write to it, and then move it to the final location.