Extension talk:DumpHTML/Archive 1

Latest comment: 15 years ago by Wwwwolf in topic Blows up on PostgreSQL

Current issues:

  1. Notice: Use of OutputPage::setParserOptions is deprecated in ...\GlobalFunctions.php on line 2480
  2. Pagenames with nonstandard characters (äöüß etc.) crash the script with a can't open file error

--92.195.50.177 14:17, 20 April 2008 (UTC)

I have found that the special characters crash the script because the script is trying to write to a directory that dopes not exist. Add $this->mkdir("{$this->dest}\\temp\\"); to the function writeArticle() in the dumpHTML.inc file and non-US characters seem to work just fine. -- Seán Prunka

Installation instructions

Some installation instructions would be helpful. Here's what I did. Hopefully someone more knowledgeable than me can edit this and move it to the article page.

If you have web access from your MediaWiki server, this should suffice:

cd /whatever/mediawiki/extensions
svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/DumpHTML

I don't, so I had to do this on a separate machine:

cd /tmp
svn export http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/DumpHTML

Subversion retrieves files, and reports their names and the revision number.

tar cjvf ~/DumpHTML-version.tar.bz2 DumpHTML
rm -rf DumpHTML

Then on my MediaWiki machine:

cd /whatever/mediawiki/extensions
tar xjvf ~/DumpHTML-version.tar.bz2

Invocation:

php /whatever/mediawiki/extensions/DumpHTML/dumpHTML.php options

Localsettings.php

Is it possible to call this extension with a line in localsettings.php? --Rovo 01:43, 13 June 2008 (UTC)

Rovo, sorry no. It's built to be run from a shell. --Gadlen 09:12, 18 August 2008 (UTC)


Usage Instructions

DumpHTML.php expects to be run from the maintenance directory. The "skin" directory won’t get included in the HTML package if you run it from another directory. So if you are running it on a cron job and putting together a .tar.gz of your wiki for downloading, your shell script might look something like this:

#!/bin/sh

cd /YourWikiDirectory/extensions/DumpHTML

/bin/php dumpHTML.php -d /YourTargetDirectoryForHTML -k monobook --image-snapshot --force-copy

/bin/tar -czf /TemporaryDirectoryForTarball/YourWikiAllWrappedUp.tar.gz /YourTargetDirectoryForHTML

mv /TemporaryDirectoryForTarball/YourWikiAllWrappedUp.tar.gz /YourWebAccessibleDirectory/YourWikiAllWrappedUp.tar.gz

--Gadlen 09:12, 18 August 2008 (UTC)

I'm developing a resource that will most likely include links to other sites or at least the reports hosted on those other sites. Is it possible for this extension to perhaps download the first level/depth of external links too? --Charlener 02:53, 15 September 2008 (UTC)

You could perhaps modify my script (see Bugzilla 8147) accordingly. --Wikinaut 06:43, 15 September 2008 (UTC)

Dumping pages for commons images

I'm able to generate image pages for local images. What data dumps must be loaded to generate the shared (commons) image pages? Best regards. Naudefj 15:06, 7 October 2008 (UTC)

I don't know, I haven't used that feature yet. --Wikinaut 16:25, 7 October 2008 (UTC)

I've noticed that the provided HTML dumps include commons image pages. Are they generated with this script? If not, what dumper program is used to generate them? Best regards. Naudefj 21:05, 8 October 2008 (UTC):

Have a look to the source. As far as I understand, they _are_ copied indeed. --Wikinaut 06:08, 9 October 2008 (UTC)

Usage with symlinked MW core?

I inherited a setup that has a single MW install and a number of wikis. Each wiki is setup as symlinks for all of MW except ./images and LocalSettings.php. My problem is this script wants to refer to the installation directory of MW instead of the "home" of each wiki. I tried export MW_INSTALL_PATH=".../home_of_wiki" which almost works except the image code still tries to go to the actual install path. I've noticed this behavior in other MW scripts. The core maintenance scripts are supposed to have an option to solve this problem. --Cymen 18:13, 21 October 2008 (UTC)

getFriendlyName

Whats the delay in moving to getFriendlyName? Why is getHasedFilename used?

article page section "Filename problems solved by a modified version of DumpHTML" explains it. --Wikinaut 07:32, 7 January 2009 (UTC)


What is the purpose of this? I commented it out, it seems to work fine, but did I create the potential for disaster?

# Make it mostly unique
if ( $lowerCase != $friendlyName  ) {
	$friendlyName .= '_' . substr(md5( $name ), 0, 4);
}


use my version frombugzilla:8147#c4as it fixes several problems with non-ASCII-article and image filenames. --Wikinaut 19:01, 7 January 2009 (UTC)
when you register an account here on mediawiki, you can make use of the email notification and you will receive an e-mail when the page(s) you are watching is changed, for example, when answering your questions. Make sure to have your e-mail address confirmed and the correct settings in your preferences enabled. --Wikinaut 19:03, 7 January 2009 (UTC)

Inline CSS

Why are the CSS from the page not transfered? --DaSch 21:32, 12 February 2009 (UTC)

What do you mean exactly ? --Wikinaut 01:03, 13 February 2009 (UTC)
Compare http://www.wecowi.de and http://static.wecowi.de --DaSch 10:59, 13 February 2009 (UTC)
I visited your site. One remark: the loading times of your site are very long.
Question: What version of DumpHTML have you used ? My version (see section on article page can be downloaded via the URL mentioned inbugzilla:8147#c4) and has fixed some problems, especially with filenames. Can you please try this ? Regarding the "standard" version, if have no idea. Perhaps a good idea to ask Tim Starling during the Developers meeting in Berlin. --Wikinaut 11:59, 13 February 2009 (UTC)

I just can't dump

Well, I just tried tonight to dump my wiki to HTML. My wiki's language is spanish. There was a error in a page called "Flashback (animación de Flash)", my operative system is "Windows 7" what do I have to do? --MisterWiki 16:15, 24 July 2009 (UTC)

Please help me --MisterWiki 21:17, 28 July 2009 (UTC)
Special characters crash the script. My guess is it's the "ó" in the title. If there aren't many, rename those pages to leave out the ó or any other accents.
I have found that the special characters crash the script because the script is trying to write to a directory that dopes not exist. Add $this->mkdir("{$this->dest}\\temp\\"); to the function writeArticle() in the dumpHTML.inc file and non-US characters seem to work just fine. -- Seán Prunka

Error: Cannot modify header information

Me can't dump either, used Windows XP Pro, MediaWiki 1.13.4, PHP 5.2.8 (apache2handler), MySQL 5.1.30-community. Still the same error like in the old one from the year 2005 (bugzilla:4132).

Error message: Warning: Cannot modify header information - headers already sent by (output star ted at \extensions\DumpHTML\dumpHTML.inc:619) i n \includes\WebResponse.php on line 10

--Wissenslogistiker 10:36, 14 August 2009 (UTC)

Remove ?> from the end of dumpHTML.inc. —Emufarmers(T|C) 02:01, 15 August 2009 (UTC)

Blows up on PostgreSQL

...on MediaWiki 1.15.1, though this may not be the extension's fault:

WARNING: destination directory already exists, skipping initialisation
Creating static HTML dump in directory /my/target/directory.
Using database localhost
Starting from page_id 1 of 727
Processing ID: 1
Warning: pg_query(): Query failed: ERROR:  column "mwuser.user_id" must appear in the
 GROUP BY clause or be used in an aggregate function in
 /usr/share/mediawiki/includes/db/DatabasePostgres.php on line 580
Set $wgShowExceptionDetails = true; in LocalSettings.php to show detailed debugging
 information.
zsh: segmentation fault  MW_INSTALL_PATH=/my/symlinkfarm/path php5 dumpHTML.php  -d

I don't know where to report issues, so here it goes. --Wwwwolf 11:14, 12 October 2009 (UTC)

Return to "DumpHTML/Archive 1" page.