Wiki2TeiHelp

This page documents version 1.0 of the Wiki2Tei processor.

Introduction
Can Wikipedia be used as a corpus?
- Standard corpus markup
- Conversion strategy
Installation
Configuration
Scripts usage
- Command line options
- Synopsis
- Debugging
Output files
Script logs
Output database
Links to related topics
The tests suite

Wiki2Tei is a converter designed to transform text formatted according to the rules used on some wiki sites such as Wikipedia, WikiBooks, etc. and to produce well formed XML documents based on the TEI (Text Encoding Initiative) model. To learn more about the TEI format, see the TEI home page.

This page documents how to install and how to use this tool.

Introduction

The text of the pages found on a wiki site is usually written using formatting rules which let the author specify the structure and the style of the page: this is called the wiki text. Before a page can be displayed in a browser, the wiki engine has to translate the wiki text into the HTML language which is the language used by all the pages of the World Wide Web.

Wiki2Tei uses the same approach in order to convert the original wiki text into XML format. The result of this transformation is a (hopefully) well formed document which can then be used for any kind of purpose: indexing, litteral or grammatical analysis, morpho-syntactic treatment, post-processing, transformation into other formats with an XSLT style sheet, visualization as a tree, etc.

The Wiki2Tei converter was developed at the LIMSI (Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur) as part of the Autograph project. It was originally intended to provide a tool to convert all the pages of the Wikipedia considered as vast textual corpus and to study both its linguistic and sociological characteristics. Thinking that such a tool might be of interest to other people, we decided to make its code freely available via SourceForge. This concerns people studying the Wikipedia phenomenon as well as people working in the area of text encoding and interested by the TEI.

The origin of this project explains a certain number of characteristics and restrictions:

Wiki2Tei only applies to wiki text found on sites using the Mediawiki engine. Mediawiki is the software running the Wikipedia and many other wiki sites: it has its own set of formatting rules. Wiki2Tei will not understand other formatting conventions found with different wiki engines.
The Autograph project was started at the beginning of year 2006 and the development of the Wiki2Tei converter was based on the version of the Mediawiki software available at that date, that is to say version 1.5. We have no intention of keeping this tool in synch with more recent versions of the wiki engine: this should not be a problem as long as the Wiki syntax remains unchanged and does not introduce new features.
Wiki2Tei is written as an extension of the Mediawiki engine: its code is contained in a separate directory which one just has to insert in a Mediawiki installation in order to make it operable. In order to make installation easier, the distribution contains the complete code necessary to run Wiki2Tei (not only the extension).
Wiki2Tei provides a set of scripts written in PHP (as the rest of the engine) which let you process wiki text stored in text files or in a MySQL database. On output, the result of the transformation can also be written either in files on disk or in a MySQL database.

Can Wikipedia be used as a corpus?

Articles in the online encyclopedia Wikipedia use a special syntax, called the Mediawiki syntax, for their formatting and typesetting. For online rendition on the Wikipedia web site, articles are transformed into HTML on the fly by a Mediawiki parser. The simplicity of the Mediawiki syntax is a key feature of Wikipedia, and probably a condition of its success, since it allows everyone to edit Wikipedia articles with very few technical skills.

Besides its intended usage as an encyclopedia, Wikipedia is more and more often used as a linguistic resource. Wikipedia provides a clean base of texts (much more carefully written than average texts found on the internet), free of charge, indexed into thematic categories, with various contextual pieces of information, and it provides aligned texts in various languages. It is used for instance in several evaluation campaigns, or for description of web genres.

However, the Mediawiki syntax is very unsuited for any other task than online rendition and edition of articles. Neither the Mediawiki syntax, nor its HTML equivalent, allow us to identify the different components of the text (title, array, lists, templates, links, images, etc.). Without any handling of these components, it is not possible to take ful advantage of the Wikipedia database: texts are corrupted by various components which are irrelevant for the various linguistics tasks. Moreover, any fine study of the textual properties of the Wikipedia articles is limited by the difficulty to address the different sections and components of the text.

Standard corpus markup

The Wiki2Tei parser addresses this problem by parsing the Mediawiki syntax and converting it into a standard data format. The format used is the XML-based Text Encoding Initiative vocabulary, which is the most elaborated standard for corpus encoding. The conversion is intended to preserve as far as possible all information available in the wiki syntax. The Wiki2Tei converter tries to express the logical content of the wiki markup rather than its rendition on the Wikipedia web site.

Conversion strategy

Rather than parsing the wiki text with a new parser, our choice was to overload the Mediawiki software:

Firstly, because the parser already tries to produce an XML syntax (in the HTML vocabulary), so the major part of the existing software can be reused. But much work was needed in order to make the conversion more strict and to ensure that every output document is a well-formed XML document.
Secondly because the Mediawiki software is the de facto normative definition of the Mediawiki format: all human readable documentation of the parser is trying to guess and describe the behaviour of the Mediawiki parser and is neither complete nor reliable.
Thirdly, because the Mediawiki parser makes much useful information available, such as resolving templates and variables. The parser is deeply coupled with the text and the pages themselves and contains a major part of their content: the texts in this corpus are not independent from the software which helps to generate them.

Installation

The Wiki2Tei distribution provides a complete archive containing both the Mediawiki engine and the Wiki2Tei extension.

Once the distributed archive is decompressed, you get a directory which you can rename as you like. In the rest of this section we will refer to this directory as the main directory. The only difference with an ordinary Mediawiki distribution is that there is an additional subdirectory named wiki2tei which contains the code of the Wiki2Tei converter and a set of scripts to run this converter in various situations.

As a consequence, the instructions to install Wiki2Tei do not differ from the instructions to install Mediawiki itself: they are explained in the INSTALL file found in the main directory. Here is a summary of the basic steps:

copy the archive in your Web area of your machine (the www directory on many Unix systems, the Sites subfolder of your home directory under OS X, etc.)
unpack the distributed archive. For instance:
```
    tar -xvzf Wiki2Tei1.0.tgz
```
The result, in that case, is a directory named Wiki2Tei1.0.
rename the directory to something nice since it will be in your URL. For instance myWiki, like this
```
    mv Wiki2Tei1.0 myWiki
```
change directory to this myWiki folder:
```
    cd myWiki
```
temporarily make the config subdirectory writable by the web server:
```
    chmod a+w config
```
open a web browser and let it load the index.php page. This means that, if the address of your web server is http://url_of_the_server, you should load the following URL
```
    http://url_of_the_server/myWiki/index.php
```
where myWiki is the name of the directory chosen above. This will , again, direct you into the configuration script: fill the form displayed by this script and submit it.
if everything has been filled correctly in the configuration form, the wiki database with all the necessary tables will have been created and a configuration file named LocalSettings.php will have been written in the config subdirectory.
the LocalSettings.php file must be moved one level up to finally be located in the main directory
reload the same URL as above: the wiki should now be working.
you can now remove the config directory, or at least make it not world-writable. For instance:
```
    chmod go-w config
```

The Mediawiki engine is now ready to work but there is one more step to take to parametrize the Wiki2Tei converter. It is explained in the next section.

Configuration

The Wiki2Tei parser is invoked by various scripts provided in the distribution. In order for these scripts to work correctly, some settings must be done in the configuration file.

The Wiki2Tei distribution contains a template configuration file named Wiki2TeiConfig_sample.php. You should make a copy of this file, rename it as Wiki2TeiConfig.php, edit it with your favorite text editor and set the global variables as appropriate for your system. All the variables are explained in comments.

Some of these variables are most important:

$w2tUser contains the name of the MySQL user administrating the Mediawiki site or having permissions to access the Mediawiki database. By default, it will be set to the same value as the $wgDBuser variable defined in the file LocalSettings.php which contains the settings corresponding to your Mediawiki site.
$w2tPswd contains the password of the MySQL user declared in the $w2tUser variable above. By default, it will be set to the same value as the $wgDBpassword variable defined in the file LocalSettings.php.
$w2tInputBase is the name of the Mediawiki database which contains all the pages and all the data necessary to run the wiki. By default, it will be set to the same value as the $wgDBname variable defined in the file LocalSettings.php.

Another option of interest is the $w2tResolveTemplates variable. It concerns wiki templates which are sorts of predefined forms which only have to be filled with some values: templates can be nested. If this option is set to 0 (the default), templates will not be resolved, in other words the values will not be recursively substituted. If the option is set to 1, templates will be resolved and the result of the substitution will be processed by the parser.

Scripts usage

The Wiki2Tei parser can be invoked via some scripts which must be executed from the command line. These scripts are written in PHP, so a PHP interpreter must be installed on your machine. This is certainly already the case if you run a Mediawiki since this wiki engine is also written in PHP. Otherwise see the PHP official web page. PHP 5.0 or greater is required.

This section documents the syntax of all these scripts. The different scripts let you perform various tasks or apply the parser in various situations.

Some scripts can be used simply to extract pages from the database. They store the wiki text in individual files (one file per page).

w2t_ExtractPagesWithCategory.php
w2t_ExtractPagesWithID.php
w2t_ExtractPagesWithTitle.php
w2t_ExtractRandomPages.php

Other scripts execute the parser on wiki text already stored in files or taken out of the database. The converted text (in TEI format) is also stored in files:

w2t_ParseDir.php
w2t_ParseFile.php
w2t_ParsePagesWithCategory.php
w2t_ParsePagesWithID.php
w2t_ParsePagesWithTitle.php
w2t_ParseRandomPages.php
w2t_RunTest.php

Some scripts operate on pages extracted from a MySQL database created and populated by the Mediawiki engine and store the result in another MySQL database. This is the case for instance of:

w2t_ProcessDatabase.php
w2t_ProcessOpenSX.php
w2t_ProcessRevisions.php
w2t_ProcessRevisionsWhere.php
w2t_ProcessRevisionsWithQuery.php

Typically the input database would be the database where the Mediawiki engine stores the information it needs, the pages, the revisions and all the associated metadata.

By convention, all the scripts starting with w2t_Parse write their output in files on disk, and all the scripts starting with w2t_Process write their output in a MySQL database.

Finally there are a few utility scripts such as w2t_CreateTeiDatabase.php which lets you create output databases to store the pages converted to TEI or w2t_RemoveFromDatabase.php which lets you delete pages from the output database.

Command line options

Many Wiki2Tei scripts share common options. Here is a table summarizing which options are available for each of them:

Scripts/Options database debug debug_dir dir echo filter format from_base help id list log max min num out query quiet revision store test-id test-suite to_base trace trace_dir where

w2t_ExtractPagesWithCategory * * * * * *

w2t_ExtractPagesWithID * * * * * * *

w2t_ExtractPagesWithTitle * * * * * *

w2t_ExtractRandomPages * * * * * * * *

w2t_ParseDir * * * * * * * * * * *

w2t_ParseFile * * * * * * * * * *

w2t_ParsePagesWithCategory * * * * * * * * * * * *

w2t_ParsePagesWithID * * * * * * * * * * * *

w2t_ParsePagesWithTitle * * * * * * * * * * * *

w2t_ParseRandomPages * * * * * * * * * * * * * *

w2t_ProcessDatabase * * * * * * * * * * *

w2t_ProcessOpenSX * * * * *

w2t_ProcessRevisions * * * * * * * * * * * *

w2t_ProcessRevisionsWhere * * * * * * * * * * * *

w2t_ProcessRevisionsWithQuery * * * * * * * * * * * *

w2t_RemoveFromDatabase * * * * * *

w2t_RunTest * * * * * * * * * * *

w2t_TestSuite * * * * *

w2t_Validate * * * * *

Syntax of the options

Most of the options expect a value. The syntax on the command line is

    option=value

There must be no space around the equal sign. For instance, the --format option can be used in order to invoke the built-in Mediawiki parser rather than the Wiki2Tei parser and produce HTML pages. To do so, just set its value to HTML like this

	--format=HTML

Some of the options though do not expect a value: this is the case of the options with a boolean semantics (i-e whose value is 0 or 1, corresponding to false or true), like, for instance, --debug, or --trace. It is also possible to explicitely specify the value, like in

    --trace=1

Description of the options

Many options are common to different scripts. This section gives a description for most of them. The others will be described below with the scripts they belong to.

--database Name of the database (by default $w2tInputBase)

--debug Activate debugging

--debug_dir Directory for debug outputs (default $w2tDebugLogsDir)

--dir Directory containing the wiki files (by default $w2tInputsDir)

--echo Write the output also to the console

--filter Retrieve pages whose category or title matches the given regexp

--format Output format: TEI or HTML (default TEI)

--from_base Name of the input database (by default $w2tInputBase)

--help Show the basic usage string

--id ID range of the pages to parse (n, n-, -m, or n-m)

--in Directory containing the wiki files (by default $w2tInputsDir)

--list Return a list of the available tests

--log Write info in the log file. Value 0 or 1 (by default 1).

--max Maximal length for an extracted page

--min Minimal length for an extracted page

--num Required number of randomly extracted pages

--out Output files directory (default $w2tWorkshop/WIKI)

--query MySQL query selecting pages to process

--quiet Turn off messages sent to the console

--revision Which revisions ("all", "latest", or "n" for n latest)

--store Store the wiki text in a file on disk

--test-id Execute only the given test (according to xml:id in the test-suite document)

--test-suite Document containing the test suite (default indir/tei4mediawiki.odd)

--to_base Name of the output database (by default $w2tOutputBase)

--trace Activate tracing

--trace_dir Directory for trace outputs (default $w2tTraceLogsDir)

--where MySQL WHERE clause describing pages to process

Default values

Most of the options have default values which correspond to preferences set in the Wiki2TeiConfig.php file. Make sure these preferences are correctly set for your installation. See the instructions in the Wiki2Tei configuration help file.

Relative paths

There is a slightly confusing issue concerning relative paths. It is not possible to specify the value of options like --out or --dir using the usual . and .. symbolic notation on the command line. This is due to the fact that the Mediawiki engine changes the current directory and positions itself inside the top Mediawiki directory. This is the normal location when the Wiki is run via a Web browser but it has the unfortunate consequence that all relative paths are calculated with respect to this location.

In order to avoid this problem, it is strongly recommended that you use absolute paths to specify the --out or --dir options. This also applies to the --trace_dir and the --debug_dir options.

Synopsis

The following sections give the exact syntax of all the scripts and provide additional explanations.

w2t_CreateTeiDatabase

Usage: w2t_CreateTeiDatabase.php options Options: -u name of user with admin on mysql server -p password -h hostname (by default $w2tHost) -n name of the database (by default $w2tOutputBase) -f force to delete an already existing database Example: $w2tScriptName -f -u "root" -p "foo"

The w2t_CreateTeiDatabase.php script lets you create a new database for the output of the w2t_ProcessRevisions.php or w2t_ProcessDatabase.php scripts.

To run this script you must have root privileges on the MySQL server in order to grant sufficient permissions to the user of the newly created database. The user of the database is specified in the Wiki2TeiConfig.php file by the $w2tUser and $w2tPswd variables. See the instructions in the Wiki2Tei configuration help file.

The name of the output database can be specified with the $w2tOutputBase variable in the Wiki2TeiConfig.php file or directly on the command line with the -n option.

For instance, in order to create an output database named teiout you should execute the following instruction

    w2t_CreateTeiDatabase.php -u "root" -p "topsecret" -n "teiout"

assuming that the admin of the MySQL database is root and her password is topsecret.

If security is a concern, you should edit and modify this script in order to adjust the premissions granted to the user. Currently the script gives all privileges on the newly created database (see the GRANT ALL PRIVILEGES instruction in the script).

Once this database is created, it is possible to execute the w2t_ProcessRevisions.php, w2t_ProcessDatabase.php, or w2t_ProcessOpenSX.php scripts.

For more details about this database, see the Output Database section in this document.

Note that the options supported by this script start with a single dash while the options for all the other scripts start with a double dash: this is because they are short options corresponding to equivalent MySQL options (-u, -p, -h). So, it will certainly feels more natural for a user used to the MySQL syntax.

w2t_ExtractPagesWithCategory

Usage: w2t_ExtractPagesWithCategory.php --filter=regexp [options] Options: --filter Retrieve pages whose category matches given regexp --database Name of the database (by default $w2tInputBase) --out Output files directory (default $w2tWorkshop/WIKI) --log Write info in the log file (by default 1) --quiet Turn off information messages --help Show this very helpful message

The w2t_ExtractPagesWithCategory.php script lets you just extract wiki pages from the Mediawiki database. It is useful to prepare wiki text files which will be later processed by the parser.

The meaning of the --regexp option is the same as with the w2t_ParsePagesWithCategory.php script.

See the naming conventions for the generated files in the Output Files section.

w2t_ExtractPagesWithID

Usage: w2t_ExtractPagesWithID.php --id=range [options] Options: --id ID range of the pages to parse (n, n-, -m, or n-m) --database Name of the database (by default $w2tInputBase) --echo Write the output also to the console --out Output files directory (default $w2tWorkshop) --log Write info in the log file (by default 1) --quiet Turn off information messages --help Show this very helpful message

The w2t_ExtractPagesWithID.php script lets you just extract wiki pages from the Mediawiki database. It is useful to prepare wiki text files which will be later processed by the parser.

Pages are specified by ID. The value of the --id option is the same as with the w2t_ParsePagesWithID.php script.

See the naming conventions for the generated files in the Output Files section.

w2t_ExtractPagesWithTitle

Usage: w2t_ExtractPagesWithTitle.php --filter=regexp [options] Options: --filter Retrieve pages whose titles match given regexp --database Name of the database (by default $w2tInputBase) --out Output files directory (default $w2tWorkshop/WIKI) --log Write info in the log file (by default 1) --quiet Turn off information messages --help Show this very helpful message

The w2t_ExtractPagesWithTitle.php script lets you just extract wiki pages from the Mediawiki database. It is useful to prepare wiki text files which will be later processed by the parser.

The meaning of the --regexp option is the same as with the w2t_ParsePagesWithTitle.php script.

See the naming conventions for the generated files in the Output Files section.

w2t_ExtractRandomPages

Usage: w2t_ExtractRandomPages.php --num=val [options] Options: --num Required number of randomly extracted pages --min Minimal length for an extracted page --max Maximal length for an extracted page --database Name of the database (by default $w2tInputBase) --out Output files directory (default $w2tWorkshop/WIKI) --log Write info in the log file (by default 1) --quiet Turn off information messages --help Show this very helpful message

The w2t_ExtractRandomPages.php script lets you just extract wiki pages from the Mediawiki database. It is useful to prepare wiki text files which will be later processed by the parser.

The meaning of the --num, --min and --max options is the same as with the w2t_ParseRandomPages.php script.

See the naming conventions for the generated files in the Output Files section.

w2t_ParseDir

Usage:

w2t_ParseDir.php [options] directory Options: --dir Directory prefix for relative paths (default $w2tInputsDir) --out Output files directory (default $w2tWorkshop/format) --filter Only run test files whose name matches given regexp --format Output format: TEI or HTML (default TEI) --trace Activate tracing --debug Activate debugging --trace_dir Directory for trace outputs (default $w2tTraceLogsDir) --debug_dir Directory for debug outputs (default $w2tDebugLogsDir) --log Write info in the log file (by default 1) --quiet Turn off information messages --help Show this very helpful message

The w2t_ParseDir.php script lets you process multiple wiki files located in a given folder. Most of its options have the same meaning as with the w2t_ParseFile.php script.

The files to process in the specified directory can be filtered using the --filter option. The value of the --filter option is a regular expression: only the files whose name matches the regular expression will be processed.

Important: the input files are expected to be encoded in UTF-8.

The directory argument can be an absolute or a relative path. In the case of a relative path, one can specify, with the --dir option, a folder path to prepend to the directory argument. If the --dir option is not specified, the script will prepend the path directory defined by the $w2tInputsDir variable in the configuration file. If this still fails, it will then look inside the current working directory.

w2t_ParseFile

Usage: w2t_ParseFile.php [options] filename Options: --dir Directory prefix for relative paths (default $w2tInputsDir) --out Output files directory (default $w2tWorkshop/format) --format Output format: TEI or HTML (default TEI) --trace Activate tracing --debug Activate debugging --trace_dir Directory for trace outputs (default $w2tTraceLogsDir) --debug_dir Directory for debug outputs (default $w2tDebugLogsDir) --log Write info in the log file (by default 1) --quiet Turn off information messages --help Show this very helpful message

The w2t_ParseFile.php script lets you process wiki text directly from a text file rather than from a Mediawiki database. This can be useful if you do not have permissions to access the database or do not have a database at all. For instance, you could have written the wiki text file yourself, independently from a wiki site, or have obtained the text by copy and paste operation from a wiki site on the Internet.

The filename argument can be an absolute or a relative path. In the case of a relative path, one can specify, with the --dir option, a folder path to prepend to the file name. If the --dir option is not specified, the script will try to find the file in the folder defined by the $w2tInputsDir variable in the configuration file. If this fails too, it will then look, as a last resort, in the current directory.

Important: the input file is expected to be encoded in UTF-8.

The --format option lets you specify the target format: either TEI (by default) or HTML. The TEI format corresponds to the Wiki2Tei parser and the HTML format corresponds to the built-in Mediawiki parser.

By default, the output file will be written in a subfolder of the output directory defined by the $w2tWorkshop variable in the configuration file. The subfolder is named after the target format: for instance, $w2tWorkshop/TEI or $w2tWorkshop/HTML. This destination folder can be overridden using the --out option: in that case, all the output files will be written in the directory specified by the --out option, no matter what the target format is.

The name of the converted file is the base name of the input file followed by a .tei or .html extension (depending on the target format). Note that the input file name can be specified with or without the .wiki extension.

w2t_ParsePagesWithCategory

Usage: w2t_ParsePagesWithCategory.php --filter=regexp [options] Options: --filter Retrieve pages whose category matches given regexp --database Name of the database (by default $w2tInputBase) --out Output files directory (default $w2tWorkshop) --format Output format: TEI or HTML (default TEI) --store Store the wiki text in a file on disk --trace Activate tracing --debug Activate debugging --trace_dir Directory for trace outputs (default $w2tTraceLogsDir) --debug_dir Directory for debug outputs (default $w2tDebugLogsDir) --log Write info in the log file (by default 1) --quiet Turn off information messages --help Show this very helpful message

The w2t_ParsePagesWithCategory.php script lets you process wiki pages extracted from the Mediawiki database.

Pages are extracted according to their category. The category is specified using the --regexp option: the value of this option is a regular expression. This allows for a great flexibility in the specification of the category. For instance, you could extract pages related to Biology with an option such as:

	--regexp="[Bb]iology.*"

The --store option has the same meaning as with the w2t_ParsePagesWithID.php script.

See the naming conventions for the generated files in the Output Files section.

w2t_ParsePagesWithID

Usage: w2t_ParsePagesWithID.php --id=range [options] Options: --id ID range of the pages to parse (n, n-, -m, or n-m) --database Name of the database (by default $w2tInputBase) --out Output files directory (default $w2tWorkshop) --format Output format: TEI or HTML (default TEI) --store Store the wiki text in a file on disk --trace Activate tracing --debug Activate debugging --trace_dir Directory for trace outputs (default $w2tTraceLogsDir) --debug_dir Directory for debug outputs (default $w2tDebugLogsDir) --log Write info in the log file (by default 1) --quiet Turn off information messages --help Show this very helpful message

The w2t_ParsePagesWithID.php script lets you execute the parser on a sequence of pages extracted from a MySQL database associated with a Mediawiki site. The page range can be specified with the --id option. The value of this option can take one of the following forms

--id=m-n operate on pages whose ID is between m and n
--id=-n operate on pages whose ID is less than or equal to n
--id=m- operate on pages whose ID is greater than or equal to m
--id=m operate on page whose ID is m

If the --id option is not specified then all the pages of the database will be processed.

Normally the wiki text extracted from the database is passed directly to the parser for processing. If you specify the --store option, this text will also be stored in a file on disk for later use.

See the naming conventions for the generated files in the Output Files section.

w2t_ParsePagesWithTitle

Usage: w2t_ParsePagesWithTitle.php --filter=regexp [options] Options: --filter Retrieve pages whose titles match given regexp --database Name of the database (by default $w2tInputBase) --out Output files directory (default $w2tWorkshop) --format Output format: TEI or HTML (default TEI) --store Store the wiki text in a file on disk --trace Activate tracing --debug Activate debugging --trace_dir Directory for trace outputs (default $w2tTraceLogsDir) --debug_dir Directory for debug outputs (default $w2tDebugLogsDir) --log Write info in the log file (by default 1) --quiet Turn off information messages --help Show this very helpful message

The w2t_ParsePagesWithTitle.php script lets you process wiki pages extracted from the Mediawiki database.

Pages are extracted according to their title. The title is specified using the --regexp option: the value of this option is a regular expression. This allows for a great flexibility in the specification of the title. For instance, you could extract pages whose title is related to mathematics with an option such as:

	--regexp="[Mm]ath(s|ematics)?"

The --store option has the same meaning as with the w2t_ParsePagesWithID.php script.

See the naming conventions for the generated files in the Output Files section.

w2t_ParseRandomPages

Usage: w2t_ParseRandomPages.php --num=val [options] Options: --num Required number of randomly extracted pages --min Minimal length for an extracted page --max Maximal length for an extracted page --out Output files directory (default $w2tWorkshop) --database Name of the database (by default $w2tInputBase) --format Output format: TEI or HTML (default TEI) --store Store the wiki text in a file on disk --trace Activate tracing --debug Activate debugging --trace_dir Directory for trace outputs (default $w2tTraceLogsDir) --debug_dir Directory for debug outputs (default $w2tDebugLogsDir) --log Write info in the log file (by default 1) --quiet Turn off information messages --help Show this very helpful message

The w2t_ParseRandomPages.php script lets you process wiki pages extracted from the Mediawiki database.

Pages are extracted randomly. You must specify, using the --num option, how many pages to extract. Optionnally, you can indicate a minimal and a maximal size for the pages to extract via the --min and --max options respectively.

The --database option indicates the name of the Mediawiki database out of which the pages should be extracted. This is not necessarily the database corresponding to the wiki where the wiki2tei parser is installed.

The --store option has the same meaning as with the w2t_ParsePagesWithID.php script.

See the naming conventions for the generated files in the Output Files section.

w2t_ProcessDatabase

Usage: php w2t_ProcessDatabase.php --id=range [options] Options: --id ID range of the pages to parse (n, n-, -m, or n-m) --from_base Name of the input database (by default $w2tInputBase) --to_base Name of the output database (by default $w2tOutputBase) --out Log file directory (default $w2tWorkshop) --trace Activate tracing --debug Activate debugging --trace_dir Directory for trace outputs (default $w2tTraceLogsDir) --debug_dir Directory for debug outputs (default $w2tDebugLogsDir) --log Write info in the log file (by default 1) --quiet Turn off information messages --help Show this very helpful message

The w2t_ProcessDatabase.php script lets you process wiki text extracted from the Mediawiki database and store the results in another MySQL database. The output database must have been previously created: this is done using the w2t_CreateTeiDatabase.php script. If the output database does not exist, the present script will send a message to warn you and ask you to create it with the abovementioned script.

The pages to process are specified by ID using the --id option. This option has the same syntax as with the w2t_ParsePagesWithID.php script. Only the latest revision of the page is processed. If you want to process older revisions, see the w2t_ProcessRevisions.php script.

No output files are written on disk by this script (except for the usual log file). All the output is sent to the output database. For more details on this database, see the Output Database section.

The names of the input and output databases is specified in the configuration file (Wiki2TeiConfig.php) by the $w2tInputBase and $w2tOutputBase variables. These settings can be overridden on the command line by the --from_base and --to_base options respectively.

w2t_ProcessOpenSX

Usage:
 
  php w2t_ProcessOpenSX.php --id=range [options]
Options:
  --id         ID range of the pages to parse (n, n-, -m, or n-m)
  --database   Name of the output database (by default $w2tOutputBase)
  --log        Write info in the log file (by default 1)
  --quiet      Turn off information messages
  --help       Show this very helpful message

The w2t_ProcessOpenSX.php script tries to clean up the database created and populated by the Wiki2Tei converter: it looks for not well formed documents and tries to turn them into well formed documents. It can convert any document which is a well formed SGML document (where end tags are optionnal for instance).

The script uses the Open SX program from the OpenJade tools suite, based on James Clark's SP SGML processor. After the OpenSX conversion, the tei2tei.xsl XSLT stylesheet by Sebastian Rahtz is used for normalizing element names. The old document, the log from OpenSP together with other information are recorded into the opensx table in the database.

Warning: this script relies upon the external program osx (Open SX), from the Open Jade project, and xsltproc, from the libxml2 gnome library project. Both programs must be installed and acessible in the $PATH.

w2t_ProcessRevisions

Usage: w2t_ProcessRevisions.php --id=range [options] Options: --id ID range of the pages to parse (n, n-, -m, or n-m) --revision Which revisions ("all", "latest", or n for n latest) --from_base Name of the input database (by default $w2tInputBase) --to_base Name of the output database (by default $w2tOutputBase) --out Log file directory (default $w2tWorkshop) --trace Activate tracing --debug Activate debugging --trace_dir Directory for trace outputs (default $w2tTraceLogsDir) --debug_dir Directory for debug outputs (default $w2tDebugLogsDir) --log Write info in the log file (by default 1) --quiet Turn off information messages --help Show this very helpful message If the --revision option is omitted, the latest is processed

The w2t_ProcessRevisions.php script lets you process wiki text extracted from the Mediawiki database and store the results in another MySQL database. The difference with the w2t_ProcessDatabase.php script is that you can process several revisions of a page and not only the latest one. This is possible if you have all the revisions of the pages in your Mediawiki database. Note that this is not always the case: for instance, if you obtained a monthly archive of the Wikipedia database, make sure that you retrieved and installed the (big) archive with all the revisions. If you operate on a working Mediawiki installation, there should be no problem: your database will necessarily contain all the revisions.

The --revision option lets you specify how many revisions of a page you want to process. Its value can be:

latest to designate onloy the latest revision
all to designate all the revisions
an integer n to limit to the n latest revisions

Here are a few sample instructions:

    w2t_ProcessRevisions.php --id=100-150 --revision=all 
    w2t_ProcessRevisions.php --id=-50 --revision=3

If the --revision option is not specified, only the latest revision is processed. The other options have the same meaning as with the w2t_ProcessDatabase.php script.

w2t_ProcessRevisionsWhere

Usage: w2t_ProcessRevisionsWhere.php --id=range [options] Options: --where MySQL WHERE clause describing pages to be processed --revision Which revisions ("all", "latest", or n for n latest) --from_base Name of the input database (by default $w2tInputBase) --to_base Name of the output database (by default $w2tOutputBase) --out Log file directory (default $w2tWorkshop) --trace Activate tracing --debug Activate debugging --trace_dir Directory for trace outputs (default $w2tTraceLogsDir) --debug_dir Directory for debug outputs (default $w2tDebugLogsDir) --log Write info in the log file (by default 1) --quiet Turn off information messages --help Show this very helpful message If the --revision option is omitted, the latest is processed

The w2t_ProcessRevisionsWhere.php script lets you process wiki text extracted from the Mediawiki database and store the results in another MySQL database. It is quite similar to the w2t_ProcessRevisions.php script: it also lets you process the different revisions of a particular page. The difference is in the way pages are selected: instead of specifying pages by a range of ID numbers, you specify the WHERE clause of the MySQL SELECT query via the --where option. This supposes that you are comfortable with the MySQL syntax.

The script can be useful if you want to select pages with another criterion than the ID. The script lets you pass the WHERE clause of the query. For instance, if you want to process the pages whose title contains the word biology, a possible instruction would be:

    w2t_ProcessRevisionsWhere.php --where="page_title like '%biology%';"

Note that this script limits you to conditions bearing on the fields of the page table of the input database (page_counter, page_id, page_is_new, page_is_redirect, page_latest, page_len, page_namespace, page_random, page_restrictions, page_title, page_touched). If you want to pass a more complex MySQL query involving joint tables, use the script w2t_ProcessRevisionsWithQuery.php which lets you pass a complete SELECT instruction.

w2t_ProcessRevisionsWithQuery

Usage: w2t_ProcessRevisionsWithQuery.php --id=range [options] Options: --query Query (selecting page_id, page_title and page_namespace) of pages to be processed --revision Which revisions ("all", "latest", or n for n latest) --from_base Name of the input database (by default $w2tInputBase) --to_base Name of the output database (by default $w2tOutputBase) --out Log file directory (default $w2tWorkshop) --trace Activate tracing --debug Activate debugging --trace_dir Directory for trace outputs (default $w2tTraceLogsDir) --debug_dir Directory for debug outputs (default $w2tDebugLogsDir) --log Write info in the log file (by default 1) --quiet Turn off information messages --help Show this very helpful message If the --revision option is omitted, the latest is processed

The w2t_ProcessRevisionsWithQuery.php script lets you process wiki text extracted from the Mediawiki database and store the results in another MySQL database. It is quite similar to the w2t_ProcessRevisions.php script: it also lets you process the different revisions of a particular page. The difference is in the way pages are selected: instead of specifying pages by a range of ID numbers, you specify a MySQL query (a SELECT instruction) via the --query option. This supposes that you are comfortable with the MySQL syntax.

The script can be useful if you want to select pages with another criterion that the ID. It is important to note that the query must include a request for the page_id, page_title, page_namespace fields from the page table because these fields are needed by the script in order to identify the revisions of the pages. So a minimal query is:

    "select page_id, page_title, page_namespace from page;"

The interest of the script is that it lets you pass the WHERE clause of the query. For instance, if you want to process the pages whose category starts with the prefix Bio, a possible instruction would be:

    w2t_ProcessRevisionsWithQuery.php --query="SELECT page_id, page_title, \
    page_namespace, cl_from FROM page, categorylinks WHERE cl_to like 'Bio%' and cl_from = page_id;"

If your query concerns only the page table, you should rather use the w2t_ProcessRevisionsWhere.php script: its syntax is much easier.

w2t_RemoveFromDatabase

Usage:
 
  w2t_RemoveFromDatabase.php --id=range [options]
Options:
  --id         ID range of the pages to parse (n, n-, -m, or n-m)
  --database   Name of the output database (by default $w2tOutputBase)
  --out        Log file directory (default $w2tWorkshop)
  --log        Write info in the log file (by default 1)
  --quiet      Turn off information messages
  --help       Show this very helpful message

The w2t_RemoveFromDatabase.php script lets you remove pages from the output database. It does not affect the input database. It is useful if you want to regenerate a converted page: you must first delete an already existing page with the same ID otherwise the MySQL database will raise an error.

Pages to delete can be designated by ID using the --id option: its value is either a single value or a range (n, n-, -m, or n-m).

w2t_RunTest

Usage: w2t_RunTest.php [options] testname Options: --dir Directory containing the test file (default $w2tTestsInputDir) --out Output files directory (default indir/format) --format Output format: TEI or HTML (default TEI) --trace Activate tracing --debug Activate debugging --trace_dir Directory for trace outputs (default $w2tTraceLogsDir) --debug_dir Directory for debug outputs (default $w2tDebugLogsDir) --list Return a list of the available test files --log Write info in the log file (by default 1) --quiet Turn off information messages --help Show this very helpful message

The w2t_RunTest.php script lets you process test files. This script is useful only to Wiki2Tei developers who want to test particular aspects of the parser. You would normally not have to use it and should use the w2t_ParseFile.php and w2t_ParseDir.php scripts instead in order to convert a wiki text file.

There is nothing special about this script: the only difference is that it expects the input files to be located in the directory specified by the $w2tTestsInputDir variable in the configuration file (unless it is overridden by a --dir option). By default, it writes its output file in a subfolder of the $w2tTestsOutputDir folder: this subfolder is named TEI or HTML, depending on the target format. The destination folder can be overridden using the --out option.

w2t_TestSuite

Usage:
 
  w2t_TestSuite.php [--out=dir] [--test-id=id] [--help] [--test-suite=file.xml]
Options:
  --out        Output files directory (default $w2tWorkshop)
  --test-suite Document containing the test suite (default indir/tei4mediawiki.odd)
  --test-id    Execute only the given test (according to xml:id in the test-suite document)
               When executing a particular test, no html report is produced.
  --quiet      Turn off information messages
  --help       Show this very helpful message

The w2t_TestSuite.php lets you process an entire test suite. The test suite tries to cover all the aspects of the Mediawiki syntax. For each test, the test suite gives an expected TEI equivalent. This script runs the Wiki2Tei converter on every test and compares the actual result with the expected TEI output. For each test, a line is recorded in an HTML document, test-suite-report-yyyy-mm-dd.html.

Warning: this script relies upon the external program xmlstarlet, based upon the libxml2 gnome library project. This program must be installed and acessible in the $PATH.

w2t_Validate

Usage:
 
  php w2t_Validate.php --id=range [options]
Options:
  --id         ID range of the pages to parse (n, n-, -m, or n-m)
  --database   Name of the output database (by default $w2tOutputBase)
  --log        Write info in the log file (by default 1)
  --quiet      Turn off information messages
  --help       Show this very helpful message

The w2t_Validate.php script lets you validate the pages of the output database against the tei4mediawiki.rng Relax NG schema defining the TEI vocabulary used by the Wiki2Tei parser. This script invokes a Java interpreter and the com.thaiopensource.relaxng software.

The --id option lets you select a page or a range of pages in the output database.

This validation test is applied only to well-formed converted pages. The result is written in the validity field of the page table: 1 if validation was successful, 0 otherwise. In case of error, the error message is stored in the validity_error field of the page table.

Debugging

The parser offers two possibilities in order to collect debugging info during the execution of any of the Wiki2Tei scripts:

if the --trace option is specified on the command line, the state of the text at each step of the parsing process is logged in separate files. These files are found in a subfolder of the Trace Logs directory which is defined by the $w2tTraceLogsDir variable in the Wiki2TeiConfig.php file or can alternatively be specified on the command line with the --trace_dir option.
if the --debug option is specified on the command line, diagnosis info is logged in a file located in the Debug Logs directory which is defined by the $w2tDebugLogsDir variable in the Wiki2TeiConfig.php file or can alternatively be specified on the command line with the --debug_dir option. The name of the debug info log file contains the date and time of execution by the parser: for instance w2t_debug_2007-03-28_12-59-14.

Output files

All the pages, either extracted by one of the w2t_Extract scripts or processed by one of the w2t_Parse scripts, are stored in individual files whose name is built after the page ID, the revision ID and the namespace ID. For instance,

   22_3498621_0.wiki

contains the wiki text of the page with ID 22: its latest revision ID is 3498621 and this page belongs to the main namespace (namespace 0). The output files are encoded in UTF-8.

The extension of the output files can be .tei, .html or .wiki to indicate which kind of data they contain. The .html files correspond to the case when the Mediawiki built-in parser is used, i-e the --format option is specified as HTML like this

    --format=HTML

The code for the various namespaces is given in the following table:

-2 Media

-1 Special

0 Main

1 Talk

2 User

3 User talk

4 Project

5 Project talk

6 Image

7 Image talk

8 MediaWiki

9 MediaWiki talk

10 Template

11 Template talk

12 Help

13 Help talk

14 Category

15 Category talk

The output folder is specified by the $w2tWorkshop variable in the Wiki2TeiConfig.php file. This setting can also be overridden on the command line by the --out option which is supported by most of the Wiki2Tei scripts.

Script logs

All the scripts write information in log files. The log files are usually located in the output directory and named on the w2t_***_log model where the three asterisks are replaced by the name of the input or the output database (depending on which script is executed). These logs are UTF-8 encoded.

These files are cumulative. Each invocation of a script will write additional information in the logs. You should empty or delete these files peridically if the information they contain is not needed anymore.

The information written in these files depends on the script executed: date and time of execution, exact MySQL query used to retrieve the data from the database, identity of each page processed (file name or page ID, revision ID, title, well-formedness of the result, etc.).

Here is a sample:

    # 02/06/2007 12-10-24
    # File extracted with mysql request:
    #   select page_id, page_latest, page_title, page_namespace from page where page_id <= 14
    # 
    # file_name    ok    title
    1_6422860_0    0    Avignon
    2_2334467_0    1    Algorithmie
    ...
    14_6380587_0    1    Allemagne

Output database

The output database contains two tables: a page table used by w2t_ProcessRevisions.php and w2t_ProcessDatabase.php, and an opensx table used by w2t_ProcessOpenSX.php.

For the technically inclined, here is the desciption of these tables:

page table

    +----------------+---------------------+
    | Field          | Type                |
    +----------------+---------------------+
    | page_id        | int(8) unsigned     |
    | revision_id    | int(8) unsigned     |
    | namespace      | int(11)             |
    | text           | mediumblob          |
    | wellformed     | tinyint(1) unsigned |
    | xml_error      | mediumblob          |
    | compressed     | tinyint(1) unsigned |
    | validity       | tinyint(1) unsigned |
    | validity_error | mediumblob          |
    +----------------+---------------------+

opensx table

    +----------------+---------------------+
    | Field          | Type                |
    +----------------+---------------------+
    | revision_id    | int(8) unsigned     |
    | malformed_text | mediumblob          |
    | osx_msg        | mediumblob          |
    | compressed     | tinyint(1) unsigned |
    +----------------+---------------------+

The format of these tables is described in the w2t_DefinePageTable.sql and w2t_DefineOpenSXTable.sql MySQL scipts which are executed by the w2t_CreateTeiDatabase.php script.

Links to related topics

This section contains links to related sites:

the official site of the Mediawiki engine.
the official site of the TEI model. It has its own wiki.
the Autograph project
the english Wikipedia and related sites
the french Wikipedia and related sites
Wikimedia Commons
Wikimedia Meta
Wikimedia Species

The tests suite

There is a test suite which can be executed with the w2t_TestSuite.php script. The tests are defined in the file odd4tei4mediawiki.odd. The execution of the tests suite produces a report which can also be found in the distribution: see test-suite-report.html

This report contains many examples of all the wikitext syntax features with their TEI equivalent.

Last updated 2007-10-08 11:05:05

--database	Name of the database (by default $w2tInputBase)
--debug	Activate debugging
--debug_dir	Directory for debug outputs (default $w2tDebugLogsDir)
--dir	Directory containing the wiki files (by default $w2tInputsDir)
--echo	Write the output also to the console
--filter	Retrieve pages whose category or title matches the given regexp
--format	Output format: TEI or HTML (default TEI)
--from_base	Name of the input database (by default $w2tInputBase)
--help	Show the basic usage string
--id	ID range of the pages to parse (n, n-, -m, or n-m)
--in	Directory containing the wiki files (by default $w2tInputsDir)
--list	Return a list of the available tests
--log	Write info in the log file. Value 0 or 1 (by default 1).
--max	Maximal length for an extracted page
--min	Minimal length for an extracted page
--num	Required number of randomly extracted pages
--out	Output files directory (default $w2tWorkshop/WIKI)
--query	MySQL query selecting pages to process
--quiet	Turn off messages sent to the console
--revision	Which revisions ("all", "latest", or "n" for n latest)
--store	Store the wiki text in a file on disk
--test-id	Execute only the given test (according to xml:id in the test-suite document)
--test-suite	Document containing the test suite (default indir/tei4mediawiki.odd)
--to_base	Name of the output database (by default $w2tOutputBase)
--trace	Activate tracing
--trace_dir	Directory for trace outputs (default $w2tTraceLogsDir)
--where	MySQL WHERE clause describing pages to process

-2	Media
-1	Special
0	Main
1	Talk
2	User
3	User talk
4	Project
5	Project talk
6	Image
7	Image talk
8	MediaWiki
9	MediaWiki talk
10	Template
11	Template talk
12	Help
13	Help talk
14	Category
15	Category talk