Getting Started
The takco
package is designed to be run in several different environments. If you
are just exploring the options, you can install the package and make use of external
APIs. This gives you access to Wikipedia or other web table sources, external Knowledge
Bases, and entity query APIs. If you want to run a larger pipeline, you should setup
a local KB, and mirror the web table sources yourself. Finally, if you want to reproduce
the research on which takco
was built, you can run it on a compute cluster.
Default Setup
By default, takco
is configured to make use of external APIs for web table
harvesting and entity linking. This way, you can explore its features without having
to setup a large KB yourself.
…
Large Machine Setup
For larger workloads, setup a locally hosted mirror of Wikipedia.
Warning
This will require some storage space. A typical Wikipedia zim dump is about 40 GB.
Install the Kiwix tools.
Download a Wikipedia zim dump.
Host it with
./kiwix-serve --port=8989 your_wiki_dump.zim
To download many pages to WARC format, you can use wget in parallel:
parallel -j4 --pipe-part -a urls.txt \
'wget -i - --warc-file=warcs/{#} --warc-max-size=1G --warc-cdx=on -O/dev/null -q'
On a machine with many cores, it is often useful to use the Dask execution engine, which provides a dashboard for running tasks.
Installing Trident
…
Cluster Setup
To cluster and integrate a large corpus of web tables, it is recommended to run
takco
on a large cluster of machines. For this purpose, the
Dask execution engine has several backends.
The current version of takco
will be tested on SLURM.