Extension:Html2Wiki
Bu uzantı, MediaWiki 1.34 veya daha sonraki herhangi bir sürümüyle uyumlu değil! Yayındaki bir sitede bu uzantıyı kullanmanıza karşı tavsiye edilir. Gönüllü geliştiriciler, {{Incompatible }} şablonu {{Incompatible |version=1.34|pledge=~~~~}} ile değiştirerek bu uzantıyı MediaWiki 1.39 ile uyumlu hale getirmek için güncelleme çabalarını taahhüt etmeye davet edilmektedir. |
Bu uzantı şu anda etkin bir şekilde korunmuyor! Yine de çalışabilse de, hata raporları veya özellik istekleri büyük olasılıkla göz ardı edilir. Bu uzantıyı geliştirme ve sürdürme görevini üstlenmek istiyorsanız, depo sahipliği isteyebilirsiniz. Nezaket gereği yazarla iletişime geçmek isteyebilirsiniz. Ayrıca bu şablonu kaldırmalı ve kendinizi sayfanın {{extension}} bilgi kutusunda uzantıyı korurken listelemelisiniz. |
![]() Sürüm durumu: bakımsız |
|
---|---|
![]() |
|
Uygulama | Kullanıcı arayüzü, Özel sayfa |
Açıklama | HTML ve ilgili resimlerin toplu olarak içe aktarılmasına izin verir |
Yazar(lar) | [1] (GregRundlettmesaj) |
En son sürüm | 2017.07 (2017-07-13) |
MediaWiki | 1.25+ - 1.32? |
PHP | 5.3+ |
Veritabanı değişiklikleri | Hayır |
Lisans | GNU Genel Kamu Lisansı 2.0 veya üstü |
İndir | |
|
|
Quarterly downloads | 31 (Ranked 138th) |
Translatewiki.net adresinde mevcutsa, Html2Wiki uzantısını çevirin | |
Sorunlar | Açık görevler · Hata bildir |
Html2Wiki uzantısı, HTML içeriğini (resimler dahil) vikinize aktarmak için kullanılır.
Düzinelerce, yüzlerce, belki binlerce HTML sayfanız olduğunu hayal edin. Ve bunu vikinize sokmak istiyorsunuz. Belki bir web siteniz veya belki de HTML formatında bir belge sisteminiz var. Bu içeriği düzenlemek, açıklama eklemek, düzenlemek ve yayınlamak için viki platformunuzu kullanmayı çok isterdiniz. Html2Wiki uzantının devreye girdiği yer burasıdır. Uzantıyı sadece vikinize yüklersiniz ve ardından tüm HTML + resim içeriğini içeren tüm zip dosyalarını içe aktarabilirsiniz. Aylarca çalışmak yerine, dakikalar içinde yapılabilir.
Özellikler
- Handles most HTML (anything that Tidy can understand)
- Converts most MediaWiki markup (anything that Pandoc can understand)
- Handles both zip archives and tar/gz archives in addition to single HTML files.
- The added advantage of zip archives is that all images will also be imported. In fact, although there are other ways to import images into your wiki [1][2], you could use this extension to bulk import images.
- Special handler for Google Drive documents. In Google Drive, select "File -> Download as -> Web page (.html, zipped)"
Import the zip file
Links in the original document will be stripped of their "Google Tracking Virus" and urldecoded back to their human-readable value. - Adds an "Import HTML" link to the Toolbox panel. You have to edit MediaWiki:Common.js See the modules/MediaWiki:Common.js file included with Html2Wiki
- Content is automatically categorized according to the "Collection Name" provided.
- "Dry-run" option available to preview what the import would look like for sample content.
Kurulum
- Dosyaları indirin ve
extensions/
klasörünüzdekiHtml2Wiki
adlı dizine yerleştirin. LocalSettings.php
dosyanızın altına aşağıdaki kodu ekleyin:$wgNamespacesWithSubpages[NS_MAIN] = true; // has to be set BEFORE the wfLoadExtension()! wfLoadExtension( 'Html2Wiki' );
-
Composer is used to manage the dependencies on 3rd-party code. Once you have the Html2Wiki code in your 'extensions' directory, follow these steps to pull in those dependencies.
cd Html2Wiki composer install
Note: If you do not have composer on your system, you can either install it [3] (it is good to have). Or, use the phar file:
cd Html2Wiki wget https://getcomposer.org/composer.phar php composer.phar install
- This extension depends on SyntaxHighlight_GeSHi so be sure to install that.
-
(Optional) To add the "Import HTML" link to your Tools sidebar, see the
modules/MediaWikiCommon.js
file distributed with the extension. -
Check
$wgMaxUploadSize
(defaults to 100MB) in LocalSettings.php Ensure that it is compatible with theupload_max_filesize
(defaults to 2MB) and thepost_max_size
(defaults to 8MB) in your php.ini file. It is suggested to set the php.ini settings to 100MB and restarting Apache [4]. While you are tweaking upload size, if you anticipate large uploads, you will almost certainly need to increasemax_execution_time
Change it from it is default of 30 seconds to something like 300. -
Your Apache server will need the
AllowEncodedSlashes
directive set to On for image URLs to work properly. Otherwise, you will see 404 errors (blank images) in the upload results for any image that has a slash in the filename.# Need this for Html2Wiki image URLs AllowEncodedSlashes On
- Yapıldı – Uzantının başarıyla yüklendiğini doğrulamak için vikinizde Special:Version seçeneğine gidin.
MediaWiki 1.29 veya önceki bir sürümü çalıştıran kullanıcılara:
Yukarıdaki talimatlar, bu eklentiyi wfLoadExtension()
kullanarak kurmanın yeni yolunu açıklar.
Bu uzantıyı önceki sürümlerine (MediaWiki 1.29 ve önceki sürümler) yüklemeniz gerekirse, wfLoadExtension( 'Html2Wiki' );
yerine kullanmanız gerekir:
require_once "$IP/extensions/Html2Wiki/Html2Wiki.php";
Git'i kullanma
Since this extension is actively developed, you might want to use a git clone instead of a static download. cd to your 'extensions' directory, and clone the project
git clone https://gerrit.wikimedia.org/r/mediawiki/extensions/Html2Wiki.git
This way you can git pull
to get the latest enhancements and bug-fixes.
Yapılandırma parametreleri
Html2Wiki can import entire document sets and maintain a hierarchy of those documents. The $wgNamespacesWithSubpages variable will allow you to create a hierarchy in your wiki's 'main' namespace; and even automatically create navigation links to parent article content.
Kullanıcı hakları
Right now the extension is restricted to Admins
Bağımlılıklar
There are 3 software dependencies that must exist in your environment for Html2Wiki to work. Missing dependencies are detected when you access Special:Html2Wiki to alert you that the installation is not complete. Tidy and Pandoc must be installed independently. QueryPath can be installed via Composer.
- Tidy is responsible for repairing and normalizing the HTML source prior to conversion. The Tidy module that is bundled with PHP v5.0+ is preferred, and Html2Wiki will use a binary tidy as a fallback if that is found instead. For Ubuntu:
sudo apt-get install php5-tidy
- ZIP is reponsible for managing of imported HTML from archive. You should install php-zip package.
- QueryPath will be installed for you once you invoke the
composer install
command from your Html2Wiki directory. For Ubuntu/Debian:sudo apt-get install php-zip
- Pandoc is the conversion library.
You should install pandoc prior to installing Html2Wiki. For Ubuntu:
sudo apt-get install pandoc
Since pandoc is actively developed, you may wish to install from sources to get a more recent version than what is packaged with your distribution. - Known to work with MediaWiki versions You can check your version at https://freephile.org/wikireport/
- 1.25alpha (and newer)
- 1.24.x
- 1.23.1
İlgili
See the SubPageList extension if you want to create navigation blocks for subpages.
If you are importing generated source code documentation which creates DHTML "mouseovers" for glossary terms, you may wish to create similar functionality in the wiki environment. See Lingo extension to create a Glossary of terms.
Kullanım
System elements
Once installed, the Html2Wiki extension makes a new form available to Administrators of your wiki. You access the import HTML form at the Special:Html2Wiki
page. Instructions are also provided to add a convenient Import HTML link to the Tools panel of your wiki for quick easy access to the importer. You can view actions of the importer in the Special:Log
(see #Logging section).
Single file
Enter a comment in the Comment field, which is logged in the 'Recent Changes' content as well as the Special:Log area.
You can optionally specify a "Collection Name" for your content. The Collection Name represents where this content is coming from (e.g. The book, or website). Any unique identifier will do. The "Collection Name" is used to tag (categorize) all content that is part of that collection. And, all content that is part of a Collection will be organized "under" that Collection Name in a hierarchy. This lets you have 2 or more articles in your wiki named "Introduction" if they belong to separate Collections. Specifying an existing Collection Name + article title will update the existing content. In fact, to re-import a single file and maintain it is 'position' in a collection, you would specify the full path to the file.
Zip files and GNUzip archives
Html2Wiki handles both regular "zip" archives, and also any type of GNUZip or Tar archive (ending with .gz, .gzip, .tar, .tar.gz, .tgz). Choose an archive file to import. The archive file can contain any type of file (CSS, JavaScript), but only html and image files will be processed.
Examples
Here are just a couple examples of using Html2Wiki with zip archives.
Import a file from Google Drive
When you create a document on Google Drive (aka Google Docs), the hyperlinks in those files become polluted to make them pass through google.com. That is annoying. But more importantly, it makes it that much harder to re-use or move this content into your wiki. Html2Wiki to the rescue! You can "save" your Google Drive document in the form of a complete webpage (including images) by selecting "File -> Download as -> Web page (.html; zipped)
". Then, import that zip. Html2Wiki will import the images, the content, and will also "decode" the link references in the document.
Import a blog post / webpage complete
You can easily create a local zip file of a blog post or web page using wget
and zip
to then import into your wiki.
Let's use the example of this article on "European Commission endorses CC licenses as best practice for public sector content and data" found at http://creativecommons.org/weblog/entry/43316
wget --page-requisites --convert-links --no-host-directories --no-directories --directory-prefix=cc.org --adjust-extension http://creativecommons.org/weblog/entry/43316
# equivalent to wget -p -k -nH -nd -Pcc.org -E http://creativecommons.org/weblog/entry/43316
zip --junk-paths cc.zip ./cc.org/*
/bin/rm -rf ./cc.org
- Use
wget
to create a directory named 'cc.org' with all the contents needed to re-create the webpage. - Use
zip
to create an archive named cc.zip You can now import the zip file with Html2Wiki. - Optionally discard the download directory and/or the zip file when you are done.
Cleanup
Since importing a large number of articles can sometimes go wrong, you might be wondering how to "undo" an import. Remember, a single file combined with the "dry run" feature can let you quickly test the conversion outcome. Also, any re-import will update the existing article making it easy to improve your conversion results. Still, you may be looking for a solution to permanently remove a large number of articles from your wiki. Perhaps you used the wrong Collection Name, creating a large collection of articles in the wrong hierarchy. In that case, please see the Nuke extension that comes bundled with MediaWiki since v.1.18. Note that Nuke (and other Administrative "delete" actions) simply move the article in question from the 'revision' table to the 'archive' table. They will not "appear" in your wiki any longer, but they are still there (and can be restored). If you REALLY want to delete wiki articles permanently (to reduce the size of your database etc.), then you will need the help of a script like DeleteArchivedRevisions.php after using the Nuke extension.
TLDR
- visit Special:Nuke -> (enter SQL wildcards) -> click 'Go'
- review list and select articles that will be deleted -> click 'Delete Selected'
- (at the console),
cd $IP/maintenance
php deleteArchivedRevisions.php
(review output)php deleteArchivedRevisions.php --delete
Mechanics
Importing a file works like this
Select | v Upload | v Tidy -----> Normalize | v QueryPath -----> Clean | v Pandoc ----> Convert | v Save
Dave Raggett's HTML Tidy was, and still is, a venerable tool for validating and ensuring well-formed HTML documents. We use Tidy to try to get source HTML into good enough shape that it can be further processed. As of early 2015, an effort has begun to bring HTML5 support to HTML Tidy. See https://github.com/htacg/tidy-html5 The Tidy documentation is still at http://tidy.sourceforge.net/ and since the new project has not yet made any releases, we are obviously using the Tidy that is built-in to PHP5. See the PHP manual / ref.tidy.php. If for some reason that is not installed on your platform, a local binary tidy
should work. There was a problem where the PHP extension was not available in MediaWiki-Vagrant due to it is usage of the HHVM instead of the Zend PHP interpreter, however that is no longer the case. You can use PHP Tidy in MediaWiki-Vagrant. If you want to do validation/tidy tests, try the W3C's validator and Step-by-step guide although I confess to preferring https://validator.nu/
The "Clean" part is where we do the dirty work. This is the hardest part, and perhaps where you will need to spend time coding, to get the perfect functionality out of Html2Wiki. Although we have successfully imported thousands of documents with Html2Wiki, your source content may need to be manipulated in ways we have not seen, and that means DOM parsing. There are a number of ways to parse the DOM including PHP DOM. We decided to move up a level to use Matt Butcher's QueryPath project QueryPath implements a CSS parser (in PHP) so that you can manipulate the DOM using CSS selectors just like you would in jQuery. This is also where we are focusing development so that we might be able to simply set directives in configuration variables so that there is no coding required -- even for new situations. Parsing the DOM is problematic when using PHP's native DOM manipulation (which is itself based on libxml). QueryPath provides a more flexible parsing platform. You can learn more about QueryPath at IBM DeveloperWorks article The most recent list of documentation for QueryPath is at this bug: https://github.com/technosophos/querypath/issues/151 The API docs contain a CSS selector reference
John MacFarlane's Pandoc is a fantastic document converter that is able to read and write to MediaWiki syntax (among other formats). See the README file to understand what it can do. Pandoc gives us the final step before we can save content into the wiki... namely converting HTML to wiki markup. Because Pandoc does HTML to Wiki conversion right out of the box, you may want to give that a try in addition to this extension. e.g. pandoc --from html --to mediawiki foo.html --output foo.wiki.txt
Originally, it was envisioned that we would make API calls to the Parsoid service which is used by the Extension:Visual Editor extension. However, Parsoid is not very flexible in the HTML that it will handle. To get a more flexible converter, we use the Pandoc project which is able to (read and) write to MediaWiki Text format.
Potential gotchas
Each HTML file has to have at least one anchor (<a href=""></a>) for the extension to be able to process it.
In order to handle the zip upload, we have to traverse all files and index hrefs as they exist. We map those to safe titles and rewrite the source to use those safe URLs. This has to be done for both anchors and images. Practically speaking MediaWiki is probably more flexible than you need, but these are the valid characters for titles:
[legaltitlechars] => %!"$&'()*,\-.\/0-9:;=?@A-Z\\^_`a-z~\x80-\xFF+
Since MediaWiki (by default) capitalizes the first letter of each page title, you would normally need to account for that in rewriting all hrefs within the source. However, in practice, we use a Collection Name as the first path element, and MediaWiki will seamlessly redirect foo to Foo.
Styles and scripts
Cascading Style Sheets (CSS) as well as JavaScript (js) are NOT kept as part of the transformation. Although, we are interested in adding CSS support.
Processing discussion
The fundamental requirement for this extension is to transform input (HTML) into Wiki Text (see Yardım:Biçimlendirme ) because that is the format stored by the MediaWiki system.
For each source type we will need to survey the content to identify the essential content, and remove navigation, JavaScript, presentational graphics, etc. We should have a "fingerprint" that we can use to sniff out the type of document set that the user is uploading to the wiki. Actually, work is underway to allow the user to create special "recipe" articles in the wiki that would instruct Html2Wiki on how to transform content. The user will be able to interatively run a recipe in test "dry-run" mode to see the results on a sampling of content in order to perfect the recipe and then use it on a larger Collection.
As a result of sniffing the source type, we can properly index and import content only, while discarding the dross. We can likewise apply the correct transformation to the source.
Form file content is saved to server (tmp), and that triggers conversion attempt. A Title is proposed from text (checked in the db), and user can override naming HTML is converted to wiki text for the content of the article.
Image references are either assumed to be relative e.g. src="../images/foo.jpg"
and contained in the zip file, or absolute e.g. src="http://example.com/images/foo.jpg"
in which case they are not local to the wiki.
Want to check your source for a list of image files?
grep -P -r --only-matching '(?<=<img src=["'\''])([^"'\'']*)' ./my/html/files/
For each of the image files (png, jpg, gif) contained in the zip archive, the image asset is saved into the wiki with automatic file naming based on the "Collection Name" + path in the zip file.
Also, each image is tagged with the collection name for easier identification.
Image references in the HTML source are automatically updated to reference the in-wiki images.
@todo document the $wgEliminateDuplicateImages option
Database
The extension currently does not make any schema changes to the MediaWiki system.
Logging
Logging is provided at Special:Log/html2wiki The facility for logging will tap into LogEntry
as outlined at Manual:Logging to Special:Log
Interestingly, SpecialUpload must call LogEntry
from it is hooks SpecialImport calls LogPage
which itself invokes LogEntry
(see includes/logging).
Variables we care about
- We probably want a variable that can interact with the max upload size
- $wgMaxUploadSize [*] = 104857600 bytes (100 MB)
- $wgFileBlacklist we do not care about because we use our own file upload and mime detection
- $wgLegalTitleChars we use to check for valid file naming
- $wgMaxArticleSize default is 2048 KB, which may be too small?
- $wgMimeInfoFile we do not yet use
- Also, how do imagelimits come into play? http://localhost:8080/w/api.php?action=query&meta=siteinfo&format=txt
Internationalization
Special:Html2Wiki?uselang=qqx
shows the interface messages
You can see most of the messages in Special:AllMessages if you filter by the prefix 'Html2Wiki'
Error handling
- submitting the form with no file There was an error handling the file upload: No file sent.
- choosing a file that is too big: limit is set to 100 MB
- choosing a file of the wrong type There was an error handling the file upload: Invalid file format.
- choosing a file that has completely broken HTML: You could end up with no wiki markup, but it tries hard to be generous.
Developing
This extension was originally written by and is currently maintained by Greg Rundlett of eQuality Technology.
Additional developers, testers, documentation helpers, and translators welcome! Please join and/or subscribe to the project in Phabricator in order to receive updates!
The project code is hosted on both GitHub and WikiMedia Foundation servers on Extension:Html2Wiki page]. You should use git to clone the project and submit pull requests. The code is simultaneously updated on MediaWiki servers and GitHub, so feel free to fork, or pull it from either location.
git clone https://gerrit.wikimedia.org/r/mediawiki/
or (with gerrit auth)
git clone ssh://USERNAME@gerrit.wikimedia.org:29418/mediawiki/extensions/Html2Wiki.git
The best way to setup a full development environment is to use MediaWiki-Vagrant . This handy bit of wizardry will create a full LAMP stack for you and package it into a VirtualBox container (among others).
If you are interested in some of the background research, underlying technologies and a bevy of links to MediaWiki development resources that went into making this extension, check the Extension:Html2Wiki/Referencesubpage
Ayrıca bakınız
Kaynakça
- ↑ Manual:ImportImages.php
- ↑ Category:Bulk_upload
- ↑ To install it systemwide, you would do something like
curl -sS https://getcomposer.org/installer | sudo php -- --install-dir=/usr/bin --filename=composer
- ↑ e.g.
sudo service apache2 restart