There's two ways to use it with Wikia 1. Database dumps 2. Create a database by exporting each page
Pros and Cons[]
Using a Database dump has the advantage the database is already dumped by wikia. You just download the archived dump and you're ready to go. The disadvantage is that wikia can take 2-4 weeks (and sometimes longer) to create the dumps. So it's not a great way to take things offline if you want the latest stuff.
The advantage of using page exports is better because you get the latest data, but it takes longer, and the process is more complex.
Using a database dump[]
Go to your site's Special:Statistics page. For example: Goto: http://zumalifeguard.wikia.com/wiki/Special:Statistics
Download pages_current.xml.gz. This is under Database dumps,
- click on the version next to: This version is usually best for bot use.
WikiTaxi_Importer requires a .bz2 file, not .gz. To convert pages_current.xml.gz to pages_current.xml.bz2, use one of the following options:
- Option 1:
- With 7-Zip installed, right click on the .gz file and select 7-Zip / Extract here
- Right click on the extracted .xml file and select 7-Zip / Add to archive...
- When the UI shows, change the archive format to BZip2
- Option 2:
- Copy pages_current.xml.gz to the same folder as wikitaxi_PostImport.bat
- which is a batch file that looks like this:
- gunzip --to-stdout pages_current.xml.gz | bzip2 >pages_current.xml.bz2
- which is a batch file that looks like this:
- Run wikitaxi_PostImport.bat
- Copy pages_current.xml.gz to the same folder as wikitaxi_PostImport.bat
Run WikiTaxi_Importer.exe to import Sometimes this program works better if you edit the WikiTaxi_Importer.ini file first:
[Options] XmlFile=C:\Utils\pages_current.xml.bz2 DataBaseFile=C:\Utils\Zuma.taxi
Using page exports[]
1. Get the list of pages[]
First, download the AllPages html page, then go through each link to get the names of the wiki pages. Then download each page manually using wikia export feature (note that you're not doing screen-scraping for each page. that would be bad.) Here's instructions on getting the list pages to export, and set it up to be exported:
Batch File: 1-ExtractTheAllPages.bat[]
Make sure you change the 'YOURWIKIAPAGE' below:
@echo off :: ::Get all links the AllPages :: set WIKISITE=http://YOURWIKIAPAGE.wikia.com if exist in.html del in.html wget -O in.html %WIKISITE%/wiki/Special:AllPages if not exist in.html echo *** File does not exist in.html & goto :eof if not exist out.dir mkdir out.dir cscript /nologo ss_AllPages.vbs in.html out.dir\ %WIKISITE% >_tempfillPages.bat echo. echo _tempfillPages.bat created. Run 2*.bat echo.
Script File: ss_AllPages.vbs[]
The above batch file requires the following vbs script file: ss_AllPages.vbs
2. Get the list of pages[]
In this step, we will be executing the _tempfillPages.bat batch file that was created in the previous step. _tempfillPages.bat downloads the export XML file for each page, and puts it in the out.dir\ sub folder. It does this using the wget.exe utility, so make sure you have that in your path. wget.exe can be downloaded separately, and it's part of cygwin and GnuWin32.
Batch File: 2-GetEachPage.bat[]
@echo off :: ::Get all links the AllPages :: if exist out.dir rmdir /s /q out.dir mkdir out.dir call _tempfillPages.bat set /A outdircount=0 & for %%i in (out.dir\*) do set /A outdircount=outdircount+1 echo link count in html: %alistelementcount% echo pages in out directory: %outdircount% if not %alistelementcount%==%outdircount% echo *** out.dir does not contain all the files & goto :eof echo. echo files in out.dir. Run 3*.bat echo.
3. Make the xml database[]
In this section, we will concatinate all of the export XML files in out.dir into a single XML database (pages_current.xml)
This batch file requires a few things before it can run.
- It requires xml.exe, the name for which is "XMLStarlet Toolkit". You can Google that to get the tool.
- It requires tail.exe, which you can get from cygwin or GnuWin32.
- It also requires some files to pre-exist in the current directory. namely, pages_current_HEADER.xml, pages_current_FOOTER.xml and mediawikihead.txt. These files are listed after the batch file.
Batch File: 3-MakePages.bat[]
@echo off :: ::Make Pages :: echo. 2>allpages.xml for %%i in (out.dir\*) do call :doForXML "%%i" echo. 2>pages_current.xml copy /b pages_current.xml + pages_current_HEADER.xml copy /b pages_current.xml + allpages.xml copy /b pages_current.xml + pages_current_FOOTER.xml ::Make sure we got the right number of pages. tail --lines=+2 pages_current.xml >_temp.xml copy /b /y mediawikihead.txt + _temp.xml _temp2.xml xml sel -t -v "count(/mediawiki/page)" _temp2.xml > _temp.txt set /p pagesinxml= < _temp.txt echo. echo pages created in xml: %pagesinxml% echo pages in out directory: %outdircount% if not %pagesinxml%==%outdircount% echo *** pages in xml does not contain all the files & goto :eof echo. echo pages_current.xml got created. Run 4*.bat echo. goto :eof :doForXML echo Processing: %1 ::replace the first line. For some reason xml.exe doesn't the namespace. tail --lines=+2 %1 >_temp.xml copy /b /y mediawikihead.txt + _temp.xml _temp2.xml :: pull out the <page> section and append it to allpages.xml xml sel -t -c "/mediawiki/page/" _temp2.xml >_temp3.xml call :complainZeroSizeFile _temp3.xml copy /b allpages.xml + _temp3.xml goto :eof :complainZeroSizeFile if %~z1 LEQ 1 echo *** & echo *** file %1 is too small & pause goto :eof
File: pages_current_HEADER.xml[]
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd" version="0.3" xml:lang="en"> <siteinfo> <sitename>WIKINAME Wiki</sitename> <base>http://WIKINAME.wikia.com/wiki/WIKINAME</base> <generator>MediaWiki 1.15.2</generator> <case>first-letter</case> <namespaces> <namespace key="-2">Media</namespace> <namespace key="-1">Special</namespace> <namespace key="0" /> <namespace key="1">Talk</namespace> <namespace key="2">User</namespace> <namespace key="3">User talk</namespace> <namespace key="4">WIKINAME Lifeguard Wiki</namespace> <namespace key="5">WIKINAME Wiki talk</namespace> <namespace key="6">File</namespace> <namespace key="7">File talk</namespace> <namespace key="8">MediaWiki</namespace> <namespace key="9">MediaWiki talk</namespace> <namespace key="10">Template</namespace> <namespace key="11">Template talk</namespace> <namespace key="12">Help</namespace> <namespace key="13">Help talk</namespace> <namespace key="14">Category</namespace> <namespace key="15">Category talk</namespace> <namespace key="110">Forum</namespace> <namespace key="111">Forum talk</namespace> <namespace key="400">Video</namespace> <namespace key="401">Video talk</namespace> <namespace key="500">User blog</namespace> <namespace key="501">User blog comment</namespace> <namespace key="502">Blog</namespace> <namespace key="503">Blog talk</namespace> </namespaces> </siteinfo>
File: pages_current_FOOTER.xml[]
</mediawiki>
File: mediawikihead.txt[]
<mediawiki>
4. Compress and import into WikiTaxi[]
In this step, we take the pages_current.xml file created in the previous step, and zip it into pages_current.xml.bz2, which is what WikiTaxi wants.
After that, we use WikiTaxi_Importer to import pages_current.xml.bz2 into WIKINAME.taxi. (change WIKINAME below to your wiki site name (like "Zuma"), just like you did above for WIKISITE)
The batch file then launches WikiTaxi so you can test it out.
It's best to set up WikiTaxi_Importer.ini correctly. Look at the top of this page for an example for it.
The batch file requires bzip2.exe, which you can get from cygwin or GnuWin32.
Batch File: 4-WikiTaxiImport.bat[]
@echo off :: ::Compress and import into WikiTaxi :: if exist pages_current.xml.bz2 del pages_current.xml.bz2 if exist pages_current.xml.bz2 echo could not delete pages_current.xml.bz2 & goto :eof bzip2 --keep --force pages_current.xml if not exist pages_current.xml.bz2 echo could not create pages_current.xml.bz2 & goto :eof If exist WIKINAME.taxi.bak del WIKINAME.taxi.bak If exist WIKINAME.taxi.bak echo could not delete WIKINAME.taxi.bak & goto :eof if exist WIKINAME.taxi ren WIKINAME.taxi WIKINAME.taxi.bak If exist WIKINAME.taxi echo could not rename WIKINAME.taxi to WIKINAME.taxi.bak & goto :eof if exist pages_current.xml.bz2 WikiTaxi_Importer.exe if not exist WIKINAME.taxi echo WIKINAME.taxi was not created & goto :eof start WikiTaxi.exe WIKINAME.taxi goto :eof
Putting it all together[]
Sometimes you want to run each of the 4 steps above separately, so you can validate everything's working okay along the way. That way if you have a flakey internet connection or for some other reason things aren't working okay, you'll know early.
However, if you want, you can put them all in a single batch file:
Batch file: WikiTaxi_ExportEachPageAndCreateDB.bat[]
:: Files required by this process :: 1-ExtractTheAllPages.bat :: 2-GetEachPage.bat :: 3-MakePages.bat :: 4-WikiTaxiImport.bat :: mediawikihead.txt :: pages_current_FOOTER.xml :: pages_current_HEADER.xml :: ss_AllPages.vbs :: WikiTaxi_ExportEachPageAndCreateDB.bat call 1-ExtractTheAllPages.bat pause call 2-GetEachPage.bat pause call 3-MakePages.bat pause call 4-WikiTaxiImport.bat pause
Problems[]
One problem I ran into has to do with the vbs script file. It uses Internet Explorer to locate the links in the AllPages page (the page isn't XHTML, otherwise we could have use an XML parser). The problem is that, Internet Explorer doesn't like links that end in a dot, so it strips them out. The script extracts the link from "href" property of the element, and by the time it gets loaded in the DOM, the dot is stripped. I haven't found a great workaround for this, other than to simply rename and pages that end with a dot.