WikiTaxi

There's two ways to use it with Wikia 1. Database dumps 2. Create a database by exporting each page

Pros and Cons[]

Using a Database dump has the advantage the database is already dumped by wikia. You just download the archived dump and you're ready to go. The disadvantage is that wikia can take 2-4 weeks (and sometimes longer) to create the dumps. So it's not a great way to take things offline if you want the latest stuff.

The advantage of using page exports is better because you get the latest data, but it takes longer, and the process is more complex.

Using a database dump[]

Go to your site's Special:Statistics page. For example: Goto: http://zumalifeguard.wikia.com/wiki/Special:Statistics

Download pages_current.xml.gz. This is under Database dumps,

click on the version next to: This version is usually best for bot use.

WikiTaxi_Importer requires a .bz2 file, not .gz. To convert pages_current.xml.gz to pages_current.xml.bz2, use one of the following options:

Option 1:
- With 7-Zip installed, right click on the .gz file and select 7-Zip / Extract here
- Right click on the extracted .xml file and select 7-Zip / Add to archive...
  When the UI shows, change the archive format to BZip2
Option 2:
- Copy pages_current.xml.gz to the same folder as wikitaxi_PostImport.bat
  which is a batch file that looks like this:
  gunzip --to-stdout pages_current.xml.gz | bzip2 >pages_current.xml.bz2
- Run wikitaxi_PostImport.bat

Run WikiTaxi_Importer.exe to import Sometimes this program works better if you edit the WikiTaxi_Importer.ini file first:

[Options]
XmlFile=C:\Utils\pages_current.xml.bz2
DataBaseFile=C:\Utils\Zuma.taxi

Using page exports[]

1. Get the list of pages[]

First, download the AllPages html page, then go through each link to get the names of the wiki pages. Then download each page manually using wikia export feature (note that you're not doing screen-scraping for each page. that would be bad.) Here's instructions on getting the list pages to export, and set it up to be exported:

Batch File: 1-ExtractTheAllPages.bat[]

Make sure you change the 'YOURWIKIAPAGE' below:

@echo off
::
::Get all links the AllPages
::
set WIKISITE=http://YOURWIKIAPAGE.wikia.com
if exist in.html del in.html
wget -O in.html %WIKISITE%/wiki/Special:AllPages
if not exist in.html echo *** File does not exist in.html & goto :eof
if not exist out.dir mkdir out.dir
cscript /nologo ss_AllPages.vbs in.html out.dir\ %WIKISITE% >_tempfillPages.bat
echo.
echo _tempfillPages.bat created.   Run 2*.bat
echo.

Script File: ss_AllPages.vbs[]

The above batch file requires the following vbs script file: ss_AllPages.vbs

2. Get the list of pages[]

In this step, we will be executing the _tempfillPages.bat batch file that was created in the previous step. _tempfillPages.bat downloads the export XML file for each page, and puts it in the out.dir\ sub folder. It does this using the wget.exe utility, so make sure you have that in your path. wget.exe can be downloaded separately, and it's part of cygwin and GnuWin32.

Batch File: 2-GetEachPage.bat[]

@echo off
::
::Get all links the AllPages
::
if exist out.dir rmdir /s /q out.dir
mkdir out.dir
call _tempfillPages.bat
set /A outdircount=0 & for %%i in (out.dir\*) do set /A outdircount=outdircount+1

echo link count in html: %alistelementcount%
echo pages in out directory: %outdircount%
if not %alistelementcount%==%outdircount% echo *** out.dir does not contain all the files & goto :eof

echo.
echo files in out.dir.  Run 3*.bat
echo.

3. Make the xml database[]

In this section, we will concatinate all of the export XML files in out.dir into a single XML database (pages_current.xml)

This batch file requires a few things before it can run.

It requires xml.exe, the name for which is "XMLStarlet Toolkit". You can Google that to get the tool.
It requires tail.exe, which you can get from cygwin or GnuWin32.
It also requires some files to pre-exist in the current directory. namely, pages_current_HEADER.xml, pages_current_FOOTER.xml and mediawikihead.txt. These files are listed after the batch file.

Batch File: 3-MakePages.bat[]

@echo off
::
::Make Pages
::
echo. 2>allpages.xml
for %%i in (out.dir\*) do call :doForXML "%%i"

echo. 2>pages_current.xml
copy /b pages_current.xml + pages_current_HEADER.xml
copy /b pages_current.xml + allpages.xml
copy /b pages_current.xml + pages_current_FOOTER.xml

::Make sure we got the right number of pages.
tail --lines=+2 pages_current.xml >_temp.xml
copy /b /y mediawikihead.txt + _temp.xml _temp2.xml
xml sel -t -v "count(/mediawiki/page)" _temp2.xml > _temp.txt
set /p pagesinxml= < _temp.txt
echo.
echo pages created in xml: %pagesinxml%
echo pages in out directory: %outdircount%
if not %pagesinxml%==%outdircount% echo *** pages in xml does not contain all the files & goto :eof


echo.
echo pages_current.xml got created.   Run 4*.bat
echo.
goto :eof


:doForXML
echo Processing: %1

::replace the first line.  For some reason xml.exe doesn't the namespace.
tail --lines=+2 %1 >_temp.xml
copy /b /y mediawikihead.txt + _temp.xml _temp2.xml

:: pull out the <page> section and append it to allpages.xml
xml sel -t -c "/mediawiki/page/" _temp2.xml >_temp3.xml

call :complainZeroSizeFile _temp3.xml
copy /b allpages.xml + _temp3.xml
goto :eof

:complainZeroSizeFile
if %~z1 LEQ 1 echo *** & echo *** file %1 is too small & pause
goto :eof

File: pages_current_HEADER.xml[]

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd" version="0.3" xml:lang="en">
  <siteinfo>
    <sitename>WIKINAME Wiki</sitename>
    <base>http://WIKINAME.wikia.com/wiki/WIKINAME</base>
    <generator>MediaWiki 1.15.2</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2">Media</namespace>
      <namespace key="-1">Special</namespace>
      <namespace key="0" />
      <namespace key="1">Talk</namespace>
      <namespace key="2">User</namespace>
      <namespace key="3">User talk</namespace>
      <namespace key="4">WIKINAME Lifeguard Wiki</namespace>
      <namespace key="5">WIKINAME Wiki talk</namespace>
      <namespace key="6">File</namespace>
      <namespace key="7">File talk</namespace>
      <namespace key="8">MediaWiki</namespace>
      <namespace key="9">MediaWiki talk</namespace>
      <namespace key="10">Template</namespace>
      <namespace key="11">Template talk</namespace>
      <namespace key="12">Help</namespace>
      <namespace key="13">Help talk</namespace>
      <namespace key="14">Category</namespace>
      <namespace key="15">Category talk</namespace>
      <namespace key="110">Forum</namespace>
      <namespace key="111">Forum talk</namespace>
      <namespace key="400">Video</namespace>
      <namespace key="401">Video talk</namespace>
      <namespace key="500">User blog</namespace>
      <namespace key="501">User blog comment</namespace>
      <namespace key="502">Blog</namespace>
      <namespace key="503">Blog talk</namespace>
    </namespaces>
  </siteinfo>

File: pages_current_FOOTER.xml[]

</mediawiki>

File: mediawikihead.txt[]

<mediawiki>

4. Compress and import into WikiTaxi[]

In this step, we take the pages_current.xml file created in the previous step, and zip it into pages_current.xml.bz2, which is what WikiTaxi wants.

After that, we use WikiTaxi_Importer to import pages_current.xml.bz2 into WIKINAME.taxi. (change WIKINAME below to your wiki site name (like "Zuma"), just like you did above for WIKISITE)

The batch file then launches WikiTaxi so you can test it out.

It's best to set up WikiTaxi_Importer.ini correctly. Look at the top of this page for an example for it.

The batch file requires bzip2.exe, which you can get from cygwin or GnuWin32.

Batch File: 4-WikiTaxiImport.bat[]

@echo off
::
::Compress and import into WikiTaxi
::
if exist pages_current.xml.bz2 del pages_current.xml.bz2
if exist pages_current.xml.bz2 echo could not delete pages_current.xml.bz2 & goto :eof
bzip2 --keep --force pages_current.xml
if not exist pages_current.xml.bz2 echo could not create pages_current.xml.bz2 & goto :eof

If exist WIKINAME.taxi.bak del WIKINAME.taxi.bak
If exist WIKINAME.taxi.bak echo could not delete WIKINAME.taxi.bak & goto :eof
if exist WIKINAME.taxi ren WIKINAME.taxi WIKINAME.taxi.bak
If exist WIKINAME.taxi echo could not rename WIKINAME.taxi to WIKINAME.taxi.bak & goto :eof
if exist pages_current.xml.bz2 WikiTaxi_Importer.exe
if not exist WIKINAME.taxi echo WIKINAME.taxi was not created & goto :eof
start WikiTaxi.exe WIKINAME.taxi
goto :eof

Putting it all together[]

Sometimes you want to run each of the 4 steps above separately, so you can validate everything's working okay along the way. That way if you have a flakey internet connection or for some other reason things aren't working okay, you'll know early.

However, if you want, you can put them all in a single batch file:

Batch file: WikiTaxi_ExportEachPageAndCreateDB.bat[]

:: Files required by this process
::   1-ExtractTheAllPages.bat
::   2-GetEachPage.bat
::   3-MakePages.bat
::   4-WikiTaxiImport.bat
::   mediawikihead.txt
::   pages_current_FOOTER.xml
::   pages_current_HEADER.xml
::   ss_AllPages.vbs
::   WikiTaxi_ExportEachPageAndCreateDB.bat

call 1-ExtractTheAllPages.bat
pause
call 2-GetEachPage.bat
pause
call 3-MakePages.bat
pause
call 4-WikiTaxiImport.bat
pause

Problems[]

One problem I ran into has to do with the vbs script file. It uses Internet Explorer to locate the links in the AllPages page (the page isn't XHTML, otherwise we could have use an XML parser). The problem is that, Internet Explorer doesn't like links that end in a dot, so it strips them out. The script extracts the link from "href" property of the element, and by the time it gets loaded in the DOM, the dot is stripped. I haven't found a great workaround for this, other than to simply rename and pages that end with a dot.