(Automation Tool) Google Books > Internet Archive > Commons upload cycle

BUB : Book Uploader Bot

Public URL: //www.mediawiki.org/wiki/Google_Books,_Internet_Archive,_Commons_upload_cycle

Bugzilla report: Bug - 57813
Hosted on tools-lab: http://tools.wmflabs.org/bub/
- Testing doesn't require login!
- If you're very curious you can check progress of all uploads on archive.org (requires login), or https://archive.org/search.php?query=subject%3A%22bub_upload%22&sort=-publicdate
Maintained on github: https://github.com/rohit-dua/bub
Progress: [1]

Name and contact information

Name: Rohit Dua
Email: 8ohit.dua@gmail.com
IRC or IM networks/handle(s): rohit-dua
Location: New Delhi, India
Time-zone: UTC+5:30
Typical working hours: 12:00 pm to 5:00 pm , 8:00 pm to 2:00am(IST) until August, 6:00 pm to 2:00 am after August.

Synopsis

Wikisources all around the world use heavily Google-Books digitizations for transcription and proofreading. The books often are disappeared from the GB database. Currently the users have to manually download a book from GB, then upload them to IA(if they want to preserve) or directly upload to Wikimedia-Commons(again manual task) with appropriate meta-data.

This project focuses on automating all the three altogether! The user will just have to give appropriate url(or identifier) for the book(s) they wish to upload, and all other task is just automated, notifying user only when their intervention is needed.

Flowchart for the project
Direct Link

Core Libraries/tools used:

Deliverables

Goals of this project :

Required Goals:

Tool hosted on Tool-Labs with a JavaScript front-end and python core.

    This will take as input: 
    LIBRARY_TO_CHOOSE             //This is the Library like Google-Books. More libraries can be added in future
  
    GOOGLE_BOOK_URL OR ID         //This is the ID/URL for book that will be uploaded to IA and Commons

    FILE_NAME_FOR_COMMONS         //This is the user defined name for djvu file (will be passed to IA-Upload)
    EMAIL_ID

Extract meta-data from GB and check if it is Public Domain

    Google provides Google-Books API:
    This will be used to extract all the details about the book (meta-data) and check if it is public domain or not.

Check if a book is available on IA

    Internet Archive provides JSON API for advanced searching.
    This will be used to check whether the book is already available in IA or not.

Download all its pages from GB and convert to PDF/ZIP

    The required book will be downloaded from Google Books in a manner that each page will first be downloaded as PNG/JPG image,
    and then they will be converted to PDF format for easy upload to IA.
    Link to proof of concept code for book-download given at bottom

Upload to IA with appropriate meta-data

    The python library internetarchive will be used for this step.
    For each book that'll be uploaded to IA, its meta-data(taken from GB) will be added.
    This will be a better means to avoid duplicated uploads in the long run.


    Files uploaded to IA are OCR'ed so that their text is searchable.
    This takes time. Therefore as soon as the OCR is complete, users will be notified via email.
    Users email, corresponding url identifiers, and the entered FILE_NAME_FOR_COMMONS will be stored(sqlite).
    A web crawler will periodically visit the url with stored identifiers to check on OCR completion.

Wait for its OCR, when completed notify user via email

    If the OCR process is completed, the user will be notified via email. Python Library smtplib will be used to send emails.

Upload to Commons using IA-Upload tool.

    The emails will contain the link of type: http://tools.wmflabs.org/ia-upload/commons/fill?iaId=ID&commonsName=FILENAME,
    where ID --> identifier stored previously and FILENAME --> the  FILE_NAME_FOR_COMMONS taken as input at the beginning.
    This will help in avoiding the unnecessary front-page of IA-Upload.
    <since users will not have to manually enter the identifier of the uploaded file>

Optional Goals:

Direct upload to Commons.

    If a user wants an immediate use of the Commons file, he/she might want to skip the step of
    uploading to IA.(as it takes time).
    wikitools library and MediaWiki API will be used to connect and upload to commons.

Add support for other popular Public Library Networks

    Support for public libraries like Digital Library of India (Archived 2013-08-06 at the Wayback Machine) and West Bengal Public Library Network
    will be added, which will work in a similar fashion to Google-Books.

* The Design of the code will be in a form that support for more libraries (like Digital Library of India (Archived 2013-08-06 at the Wayback Machine)) can be easily added.

Project schedule

Timeline	Task
Apr 21 - May 19	Get familiar with code base, move local environment to Labs, bond with community
May 19 - May 26	University Examinations
May 26 - May 30	Add feature to extract meta-data from GB and check if its public-domain (proof of code)
May 30 - Jun 05	Download from GB and convert to PDF
May 30 - Jun 05	code to properly upload to IA using internetarchive library
Jun 05 - Jun 10	code to check if book is available in IA
Jun 10 - Jun 22	Database and its python connector for email/identifier storage
Jun 23	Mid Term Evaluation
Jun 24 - Jul 05	Spider bot to check for updates
Jun 05 - Jul 15	Automatic notification email using smtplib and link with IA-Upload tool
Jul 15 - Jul 25	UI Polishing, Bug fixing
Jul 25 - Aug 18	Code clean up, documentation + Buffer time for unprecedented delays

* The above plan could go as expected or invariably re-distribute among the tasks.

Participation

During my work hours, I would always be logged in IRC (channels: #mediawiki, #wikimedia-dev, #mediawiki-labs) and also can always be reached at my email. I'm an computer addict and have hard time staying off of it. All source code I write will be published to my Github repo, although my tool will be hosted on Tool-Labs.
At each stage of development I would like to discuss implementation details with the mentors so that there are no delays/issues later on. If face some other doubts or need feedback I would head over to the talk at Talk:Google Books, Internet Archive, Commons upload cycle or the mailing list(Wikitech-I).

About you

My name is Rohit Dua, and I'm currently pursuing my B.Tech in Electronics and Communication at Jaypee Institute of Information Technology, Noida at India. My home-town is New-Delhi, India.
I code in Python/JavaScript/C/C++.
I'm passionate about computer-security/automation and Coding gets me high! I am new to world of open-source and its community bonding.
When I first heard about Open Source at a Linux User Group Meetup at my university, I went crazy about it as I always thought there's no such thing as a free bread, but then there always was free knowledge. Prior to this I never used to go to someone with my programming issues/bugs(online or offline). But now I feel I can grow and learn much faster with community-bondings in the Open Source universe.
This project is my first opportunity to bond with an open source organization. GSoC will be my bridge to the open-source community. Also Google Summer of Code will be my top priority and I will be happily accepting this as a full time job.

Past open source experience

GitHub profile: rohit-dua

Proof of concept code

For the sake of demonstration, I have the script to - download any public domain book from GB - https://github.com/rohit-dua/gb-download (Python)

* UI and some verification code(project named BUB: book uploader bot): https://github.com/rohit-dua/BUB

Google Books, Internet Archive, Commons upload cycle

Contents