[1] [2]


  • brew install djvulibre
  • pip install cython
  • pip install python-djvulibre
  • brew install homebrew/python/numpy
  • needs imagemagick with libtiff I think.


  • storing images in a mysql blob seems like a bad idea.
  • bawolff says $image->getHandler()->getPageText( $image, $pageNumber ); where $image is from wfFindFile()


  • convert to PHP - python script is a good POC, but not extendable for MW's purposes
  • write PHP to shell out to modified python script
    • ideally we could do this without Python, but I don't know the djvu-PHP stuff.
  • investigate being able to filter out useless ones like just a single number
    • figure out wikisource+i18n? POC resulted in a lot of greek text
  • how does attribution work?


  • Needs to integrate with ConfirmEdit?
  • Review interface for Wikisourcians
  • Something to support non-WMF wikis
  • ???
  • Profit.


  • Wikisourcian marks djvu file for captcha-ification as proofread file or empty file
  • JobQueue starts generating new captcha files
  • user hits captcha, store their result for the unknown word
  • after new word has reached a certain amount of results, move it into review queue
  • review queue shows image + submitted answers, reviewer picks best and API makes edit automatically.
  • magic.