Banner

License Chat on Discord

Bitextor is a tool to automatically harvest bitexts from multilingual websites. To run it, it is necessary to provide:

  1. The source where the parallel data will be searched: one or more websites (namely, Bitextor needs website hostnames)
  2. The two languages on which the user is interested: language IDs must be provided following the ISO 639-1
  3. A source of bilingual information between these two languages: either a bilingual lexicon (such as those available at the bitextor-data repository), a machine translation (MT) system, or a parallel corpus to be used to produce either a lexicon or an MT system (depending on the alignment strategy chosen, see below)

Docker installation

If you want to easily install Bitextor, just use Docker commands:

docker pull paracrawl/bitextor # download bitextor docker image

docker run -it --name bitextor paracrawl/bitextor # create a new container 'bitextor' and open an interactive terminal

docker start bitextor && docker exec -it bitextor bash # run an interactive terminal on an existing 'bitextor' container

If you have snap package manager in your system, just install Docker using:

sudo snap install docker

Bitextor folder is located at /opt/bitextor, with all dependencies and compilations fulfilled.

Manual installation

Dependencies

Apart from downloading all submodules of this repository (which you can do with git clone --recurse-submodules https://github.com/bitextor/bitextor.git if you are cloning this repo from scratch or, in case you are downloading a tarball, just do git submodule update --init --recursive), there are some external tools that need to be in the path before installing the project. autotools and pkg-config are necessary for building and installing the project. Tools from JDK are needed to run Java dependencies (Boilerpipe); version 8 or later are required. In addition, a C++ compiler is required for compiling dependencies. The libboost-all-dev dependency is need to compile the clustercat and mgiza projects. Optionally, httrack and wget can be used for crawling if it is installed. Additionally, giawarc can be used optionally for WARC files preprocessing.

If you are using an apt-like package manager you can run the following command line to install all these dependencies:

sudo apt install cmake automake pkg-config python3 python3-venv python3-pip libboost-all-dev openjdk-8-jdk liblzma-dev time poppler-utils curl

Furthermore, most of the scripts in Bitextor are written in Python 3. Because of this, it is necessary to install Python >= 3. All the tools explained above are available from the repositories of most Unix-like operating systems.

Some additional Python libraries are required. They can be installed automatically with the tool pip by running (use without sudo if you are running in a virtualenv):

pip3 install --upgrade pip
pip3 install -r requirements.txt
pip3 install -r bicleaner/requirements.txt
pip3 install https://github.com/kpu/kenlm/archive/master.zip --install-option="--max_order 7"
pip3 install -r bifixer/requirements.txt

(if you have issues with datrie in Conda, use conda install datrie and try again)

Optional dependencies

# install go
sudo snap install go
# build and place the necessary programs in $HOME/go/bin
go get github.com/paracrawl/giawarc/...
# Install protobuf from official repository: https://github.com/protocolbuffers/protobuf/blob/master/src/README.md
# Maybe you need to uninstall any other protobuf installation in your system (from apt or snap) to avoid compilation issues
sudo apt-get install autoconf automake libtool curl make g++ unzip
wget https://github.com/protocolbuffers/protobuf/releases/download/v3.10.1/protobuf-all-3.10.1.tar.gz
tar -zxvf protobuf-all-3.10.1.tar.gz
cd protobuf-3.10.1
./configure
make
make check
sudo make install
sudo ldconfig

pip3 install Cython # Install Cython dependency for cld3
pip3 install pycld3 # Install cld3 Python fork from https://github.com/bsolomon1124/pycld3

Submodules compilation

To compile all Bitextor submodules you will first need to run the script configure (if you are downloading the code directly from the GitHub repository you will need to run the script autogen.sh instead, which will identify the location of the external tools used). Then the code will be compiled using make:

./autogen.sh && make

Some known installation issues

In some machines equipped with an AMD CPU you may experience some troubles with tensorflow 1.8.0 (the version specified in requirements.txt). In case you have installed all the requirements successfully, but when running ./autoconf.sh or ./configure you get an error that says tensorflow is not installed, please, replace the current version with version 1.5:

sudo pip3 uninstall tensorflow
sudo pip3 install tensorflow==1.5.0

In addition, some users have reported problems when trying to install tensorflow using pip3 for versions of Python >= 3.7. If this is the case, you can try to install it manually or using another package management tool, or to use a lower version of Python.

Depending on the version of libboost that you are using, you may experience some problems when compiling some of the sub-modules included in Bitextor. If this is the case you can install it manually by running the following commands:

sudo apt-get remove libboost-all-dev
sudo apt-get autoremove
wget https://dl.bintray.com/boostorg/release/1.72.0/source/boost_1_72_0.tar.gz
tar xvf boost_1_72_0.tar.gz
cd boost_1_72_0/
./bootstrap.sh
./b2 -j4 --layout=system install || echo FAILURE
cd ..
rm -rf boost_1_72_0*

Run

To run Bitextor use the main script bitextor.sh. In general, this script takes two parameters:

bitextor.sh -s <CONFIGFILE> [-j <NUMJOBS>]

where

For example, on a machine with 4 cores, one could run Bitextor as follows:

bitextor.sh -s myconfig.yaml -j 4

If Bitextor is run on a cluster with a software that allows to manage job queues, two more options can be used:

bitextor.sh -s <CONFIGFILE> [-j <NUMJOBS>] [-c <CLUSTERCOMMAND>] [-g <CLUSTERCONFIG>] [-k] [-n]

where

Running Bitextor on a cluster

When running on a cluster with, for example, the SLURM workload manager installed, one could run Bitextor as:

bitextor.sh -s myconfig.yaml -j 20 -c "sbatch"

This command would run Bitextor allowing to submit 20 jobs in the cluster queue at the same time, assuming that all jobs can be run in any node of the cluster.

Now assume that we plan to train a neural MT (NMT) system with Bitextor for document alignment (see next section). In this case, we would need to configure the call to the cluster in a way that those rules that require using GPUs for training or running NMT are run in nodes with GPUs. We could create a cluster configuration file such as the following (extracted from snakemake/examples/cluster.json):

{
    "__default__" :
    {
        "gres": ""
    },

    "docaling_translate_nmt" :
    {
        "gres": "--gres gpu:tesla:1"
    },

    "train_nmt_all":
    {
        "gres": "--gres gpu:tesla:1"
    }

}

This configuration file tells the cluster to set the option gres to empty for all jobs except for docalign_translate_nmt and train_nmt_all for which it would take value --gres gpu:tesla:1. In SLURM --gres is the option that allows to specify a resource when queuing a job; in the example we would be specifying that a Tesla GPU is required by these two jobs. Once we had our configuration file, we could call Bitextor in the following way:

bitextor.sh -s myconfig.yaml -j 20 -c "sbatch {cluster.gres}" -g cluster.json

Note that, in this case, an additional option needs to be added to the sbatch command so it is called using the specific gres option as indicated in the config file cluster.json described above: it will be empty for most jobs but for docalign_translate_nmt and train_nmt_all.

Bitextor configuration file

Bitextor uses a configuration file to define the variables required by the pipeline. Depending on the options defined in this configuration file the pipeline can behave differently, running alternative tools and functionalities. The following is an exhaustive overview of all the options that can be set in the configuration file and how they affect to the pipeline.

Suggestion: A minimalist configuration file sample (config.yaml) can be found in this repository (snakemake/example/tests/config.yaml). You can take it as an starting point by changing all the paths to match your environment.

Basic variables

There are a few variables that are mandatory for running Bitextor, independently of the task to be carried out:

bitextor: /home/user/bitextor

permanentDir: /home/user/permanent/bitextor-output
dataDir: /home/user/permanent/data
transientDir: /home/user/transient

lang1: en
lang2: fr

wordTokenizers: {
  'fr': '/home/user/bitextor/preprocess/moses/tokenizer/tokenizer.perl -q -b -a -l fr',
  'default': '/home/user/bitextor/preprocess/moses/tokenizer/tokenizer.perl -q -b -a -l en'
}

sentenceSplitters: {
  'fr': '/home/user/bitextor/preprocess/moses/ems/support/split-sentences.perl -q -b -l fr',
  'default': '/home/user/bitextor/snakemake/example/nltk-sent-tokeniser.py english'
}

There are some additional options that are rather basic but not mandatory as they take default values if they are not defined

temp: /home/user/transient

morphologicalAnalysers: {
  'lang1': 'path/to/morph-analyser1',
  'lang2': 'path/to/morph-analyser2'
}

reverseOutputPair: true

profiling: true

Data Sources

The next set of options refer to the source from which data will be crawled. Three options can be specified for crawling: one is to specify a list of websites to be crawled in the config file, another one is defining a list of websites in a separated gzipped file, while the last one is to provide a langstat file (see below) containing language statistics regarding the documents in one or more websites, so promising websites can be identified.

hosts: ["www.elisabethtea.com","vade-antea.fr"]

hostsFile: /home/user/hosts.gz

langstat: /home/user/langstat/langstats.all.gz
langstatThreshold: 50
0-0hamster.livejournal.com      el      17
0-0hamster.livejournal.com      en      1102
0-0hamster.livejournal.com      hi      19
0-0hamster.livejournal.com      ms      33
0-0hamster.livejournal.com      nn      29

In addition, it is possible to specify one or multiple WARC files to use, using the option WARCFiles. It allows to a define a list of gz compressed WARC files (each record compressed individually), which will be used to extract parallel data. This and the previous options are not mutually exclusive: WARCFiles can be used along with hosts, hostsFile and/or langstat.

hosts: ["www.elisabethtea.com", "vade-antea.fr"]
WARCFiles: ["/home/user/warc1.warc.gz", "/home/user/warc2.warc.gz"]

Crawling

Four crawlers are supported by Bitextor: one is based on the library Creepy, Heritrix, wget tool and HTTrack. The following are the variables that allow to choose one of them and to configure some aspects of the crawling.

crawler: wget

crawlTimeLimit: 30s

crawlSizeLimit: 1G
crawlTld: false
crawlerNumThreads: 1
crawlerConnectionTimeout: 10

onlyConcat: false

If you want to also crawl PDFs (only wget support for now), use these settings:

crawler: wget
crawlFileTypes: "html,pdf"

If you want to use heritrix crawler, you should provide the installation folder of heritrix and optionally the url (default is 'localhost:8443') and the user:password (default is 'admin:admin'):

crawler: heritrix
heritrixPath: /home/user/heritrix-3.4.0-20190418
heritrixUrl: "https://localhost:8443"
heritrixUser: "admin:admin"

Heritrix crawler will check if there is a checkpoint in its 'jobs' folder and resume from the latest. If crawl takes longer than the crawl time limit, it will automatically create a checkpoint for a future incremental crawl.

Preprocessing

After crawling, the downloaded web are processed to extract clean text, detect language, etc. The following set of option define how that process is carried out.

giawarc: false

boilerpipeCleaning: true
parser: "modest"

onlyPreprocessing: false

preprocessLangs: "en,es,fr"
targetLangs: "en,fr"

langId: cld2

ftfy: false
cleanHTML: false

plainTextHashes: path/to/previous/permanent/bitextor-output/plain_text_hashes.xz

Document alignment

Two strategies are implemented in Bitextor for document alignment. The first one uses bilingual lexica to compute word-overlapping-based similarity metrics; these metrics are combined with other features that are extracted from HTML files and used by a linear regressor to obtain a similarity score. The second one uses machine translation (MT) and a TF/IDF similarity metric computed on the original documents in lang1 and the translations of documents in lang2. Bitextor allows to build (if necessary) both the bilingual lexica and the MT system from parallel data.

documentAligner: DIC

The variable documentAligner can take three different values, each of them taking a different document-alignment strategy:

Using bilingual lexica

dic: /home/user/en-fr.dic

Option dic specifies the path to the bilingual lexicon to be used for document alignment. If the lexicon specified does not exist, the pipeline will try to build it using a parallel corpus provided through the variable initCorpusTrainPrefix using mgiza tools:

initCorpusTrainPrefix: ['/home/user/Europarl.en-fr.train']

This variable must contain one or more corpus prefixes. For a given prefix (/home/user/training in the example) the pipeline expects to find one file prefix.lang1 and another prefix.lang2 (in the example, /home/user/Europarl.en-fr.train.en and /home/user/Europarl.en-fr.train.fr). If several training prefixes are provided, the corresponding files will be concatenated before building the bilingual lexicon.

Suggestion: a number of pre-built bilingual lexica is available in the repository bitextor-data. It is also possible to use other lexica already available, such as those in OPUS, as long as their format is the same as those in the repository.

If you are running out of memory in the mkcls rule, maybe you should activate original mkcls binary instead of clustercat interface using:

mkcls: true

Using external MT

alignerCmd: "example/dummy-translate.sh"
docAlignThreshold: 0.1
docAlignWorkers: 2

Using a home-brew neural MT system

If this option is chosen, a Marian NMT model will be trained and evaluated before using it for document alignment. Note that, given the computational cost of training an NMT system, this option requires having a GPU available. The following are mandatory variables in order to build the NMT system:

initCorpusTrainPrefix: ['/home/user/Europarl.en-fr.train']
initCorpusDevPrefix: ['/home/user/Europarl.en-fr.dev']
initCorpusTestPrefix: ['/home/user/Europarl.en-fr.test']

marianDir: /home/user/marian-dev
mosesDir: /home/user/mosesdecoder
subwordNmtDir: /home/user/subword-nmt

nmtVocabSize: 50000

LANG2Detokenizer: "/home/user/mosesdecoder/scripts/tokenizer/detokenizer.perl -l fr"

gpuId: 0

marianArgs: [" --optimizer-delay 1", "--mini-batch-fit", "--mini-batch 1000", "--maxi-batch 1000", "--overwrite", "--keep-best", "--valid-metrics perplexity", "--valid-log valid.log", "--log train.log", "--dropout-rnn 0.2", "--dropout-src 0.2", "--dropout-trg 0.2 ", "--cost-type ce-mean-words", "--layer-normalization", "--exponential-smoothing", "--tied-embeddings", "--valid-metrics bleu"]

Segment alignment

After document alignment, the next step in the pipeline is segment alignment. This can be carried out by using the tool hunalign or the tool bleualign. The first one uses a bilingual lexicon and is best suited for the DIC option of documentAligner; the second one uses MT and is only available if one of the options based on MT has been specified in documentAligner.

bleualign: true
bleuAlignThreshold: 0.1
hunalignThreshold: 0.0

Parallel data filtering

Parallel data filtering is carried out with the tool Bicleaner; this tool uses a pre-trained regression model to filter out pairs of segments with a low confidence score (learn more about Bicleaner here). The options required to make it work are:

bicleaner: /home/user/bicleaner-model/en-fr/training.en-fr.yaml
bicleanerThreshold: 0.6

If the Bicleaner model is not available, the pipeline will try to train one automatically from the data provided through the config file options initCorpusTrainPrefix and bicleanerCorpusTrainingPrefix:

initCorpusTrainPrefix: ['/home/user/Europarl.en-fr.train']
bicleanerCorpusTrainingPrefix: ['/home/user/RF.en-fr']

It is important to provide different parallel corpora for these two options as this helps Bicleaner when dealing with unknown words (that do not appear in the statistical dictionaries) during scoring.

Post-processing

Some other options can be configured to specify the output format of our corpus:

bifixer: true

elrc: true

tmx: true

deduped: false

deferredCrawling: true

NOTE: In case you need to convert a TMX to a tab-separated plain-text file (Moses format), you could use TMXT tool

Pipeline description

Bitextor is a pipeline that runs a collection of scripts to produce a parallel corpus from a collection of multilingual websites. The pipeline is divided in five stages:

  1. Crawling: documents are downloaded from the specified websites
  2. Pre-processing: downloaded documents are normalized, boilerplates are removed, plain text is extracted, and language is identified
  3. Document alignment: parallel documents are identified. Two strategies are implemented for this stage:
    • one using bilingual lexica and a collection of features extracted from HTML; a linear regressor combines these resources to produce a score in [0,1], and
    • another using machine translation and a TF/IDF strategy to score document pairs
  4. Segment alignment: each pair of documents is processed to identify parallel segments. Again, two strategies are implemented:
    • one using the tool Hunalign, and
    • another using Bleualign, that can only be used if the MT-based-document-alignment strategy is used (machine translations are used for both methods)
  5. Post-processing: final steps that allow to clean the parallel corpus obtained using the tool Bicleaner, deduplicates translation units, and computes additional quality metrics

The following diagram shows the structure of the pipeline and the different scripts that are used in each stage:

Banner

Connecting Europe Facility

All documents and software contained in this repository reflect only the authors' view. The Innovation and Networks Executive Agency of the European Union is not responsible for any use that may be made of the information it contains.