Clustering TabEL

[1]:
%%capture --no-display
import logging as log
log.getLogger().setLevel(log.DEBUG)

import takco
tables = takco.DaskHashBag.load(
    f'hdfs://bricks07:9000/user/kruit/tabel/1-aa',
    address = 'tcp://192.168.62.207:8786'
)
# tables = takco.HashBag.load('../../data/TabEL/75k-part/75k-part-aa')
tables.client if hasattr(tables, 'client') else None
[1]:

Client

Cluster

  • Workers: 10
  • Cores: 10
  • Memory: 673.47 GB
[2]:
%%time
assets = takco.config.parse('../../resources/config-wikidata.toml')
pipeline = takco.config.parse('../../resources/pipelines/TabEL.toml')
steps = takco.config.build('step', {**assets, **pipeline})
workdir = 'output/tabel-notebook'

tables = takco.TableSet.run(steps[:4], input_tables=tables, workdir=workdir).persist()

if hasattr(tables, 'bag'):
    print(tables.bag.count().compute())
INFO:root:Running pipeline in output/tabel-notebook using <takco.util.HashBag object at 0x7f054b3375d0>
INFO:root:Chaining pipeline step 0-reshape
INFO:root:Restructuring with rules: [{'find': 'Precededby ', 'header': 'Preceded by'}, {'find': 'Succeededby ', 'header': 'Succeeded by'}]
INFO:root:Unpivoting with heuristics: NumSuffix, SeqPrefix, SpannedRepeat, AgentLikeHyperlink, AttributePrefixFinder
INFO:root:Chaining pipeline step 1-cluster
INFO:root:Chaining pipeline step 2-link
DEBUG:root:Lookup with <takco.link.sqlite.SQLiteLookup object at 0x7f04ac0a7290>
INFO:root:Chaining pipeline step 3-coltypes
371
CPU times: user 64.8 ms, sys: 11.9 ms, total: 76.7 ms
Wall time: 6.13 s
[ ]:

[3]:
%%time
step_config = steps[4]
step_config.pop('step')

clusters = takco.TableSet.cluster(
    tables,
    workdir=workdir,
    **step_config,
).tables.persist()
INFO:root:Numbering tables...
INFO:numexpr.utils:Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
INFO:root:Dask offset tableIndex tableIndex 1
INFO:root:Dask offset numCols columnIndexOffset 0
INFO:root:Building matchers: headjacc, headvec, bodylsh, bodyvec, bodytype
INFO:root:Indexing headjacc
DEBUG:root:Serializing <takco.cluster.matchers.celljacc.CellJaccMatcher object at 0x7f838c3a35d0> to output/tabel-notebook/headjacc
INFO:root:Indexing headvec
DEBUG:root:faiss info: analyzing 1555 vectors of size 50
no NaN or Infs in data
761 vectors are distinct (48.94%)
vector 28 has 59 copies
range of L2 norms=[1, 1] (0 null vectors)
vectors are normalized, inner product and L2  search are equivalent
matrix contains no 0s
no constant dimensions
no dimension has a too large mean
stddevs per dimension are in [0.0822988 0.199019]

DEBUG:root:Serializing <takco.cluster.matchers.embedding.EmbeddingMatcher object at 0x7f838c1f24d0> to output/tabel-notebook/headvec
INFO:root:Indexing bodylsh
Indexing bodylsh: 100%|██████████| 1910/1910 [00:00<00:00, 5115.95it/s]
DEBUG:root:Serializing <takco.cluster.matchers.lsh.LSHMatcher object at 0x7f8302e14e50> to output/tabel-notebook/bodylsh
INFO:root:Indexing bodyvec
DEBUG:root:faiss info: analyzing 1751 vectors of size 50
no NaN or Infs in data
1677 vectors are distinct (95.77%)
vector 895 has 7 copies
range of L2 norms=[1, 1] (0 null vectors)
vectors are normalized, inner product and L2  search are equivalent
matrix contains no 0s
no constant dimensions
no dimension has a too large mean
stddevs per dimension are in [0.0778333 0.257962]

DEBUG:root:Serializing <takco.cluster.matchers.embedding.EmbeddingMatcher object at 0x7f8302d0ea90> to output/tabel-notebook/bodyvec
INFO:root:Indexing bodytype
DEBUG:root:TypeCos index is len 383
DEBUG:root:Serializing <takco.cluster.matchers.typecos.TypeCosMatcher object at 0x7f82ef9e01d0> to output/tabel-notebook/bodytype
INFO:root:Indexing section
DEBUG:root:Serializing <takco.cluster.matchers.celljacc.CellJaccMatcher object at 0x7f8317558ed0> to output/tabel-notebook/section
INFO:root:Blocking tables; computing and aggregating column sims...
INFO:root:Got 391 table similarities; 375x reduction
INFO:root:Created graph IGRAPH U-W- 383 391 --  + attr: weight (e)
INFO:root:Found 3/375 >1 partitions
INFO:root:Clustering columns...
INFO:root:Merging clustered tables...
CPU times: user 1.99 s, sys: 926 ms, total: 2.92 s
Wall time: 16.2 s
[4]:
nontrivial_clusters = [t for t in clusters if t.get("partColAligns")]
nontrivial_clusters = sorted(nontrivial_clusters, key=lambda t: -len(t.get("partColAligns")))
[5]:
t = nontrivial_clusters[0]

print(f'Table {t["_id"]} was created from {len(t["partColAligns"])} tables')

print('Result:')
display(takco.preview(t))

print('Original:')
display(takco.preview(t["partColAligns"], ntables=None))
Table part-0 was created from 6 tables
Result:
? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Preceded by Preceded by 3P% FT% Succeeded by FG% Year Team Role Notes GP GS MPG RPG APG SPG BPG PPG
Manius Acilius Glabrio Gnaeus Cornelius Severus , Marcus Valerius Homullus Gaius Bruttius Praesens Political offices Consul of the Roman Empire 153 with Aulus Junius Rufinus Lucius Verus Titus Sextius Lateranus
Commodus , Publius Martius Veru Gaius Bruttius Praesens Political offices Consul of the Roman Empire 180 with Sextus Quintilius Condianus Commodus Lucius Antistius Burrus
Samuel Rodgers Tom Boyd (politician) Parliament of Northern Ireland Member of Parliament for Belfast Pottinger 1958 - 1969 Joshua Cardwell
None Tom Boyd (politician) Political offices Leader of the Northern Ireland Labour Party at Stormont 1958 - 1969 Vivian Simpson
Anna Brolly Pat O'Rawe Political offices Mayor of Armagh 2003 - 04 Eric Speers

(342 more rows)

Original:
? 0 1 2 3 4 5
Q5 Q154954 Q164509 Q215627 string Q5 Q164509 Q154954 Q215627 Q11514315 Q3024240 Q1250464 Q186081 Q96196009 Q17442446 Q15633587 Q12139612 Q13406463 Q19953632 Q6256 Q3624078 Q48349 Q5 Q154954 Q164509 Q215627
_pgTitle Preceded by Succeeded by Succeeded by ,
Gaius Bruttius Praesens Political offices Manius Acilius Glabrio Gnaeus Cornelius Severus , Marcus Valerius Homullus Consul of the Roman Empire 153 with Aulus Junius Rufinus Lucius Verus Titus Sextius Lateranus
Gaius Bruttius Praesens Political offices Commodus , Publius Martius Veru Consul of the Roman Empire 180 with Sextus Quintilius Condianus Commodus Lucius Antistius Burrus
? 0 1 2 3 4
Q5 Q154954 Q164509 Q215627 Thing Q5 Q154954 Q164509 Q215627 Q4164871 Q214339 Q5 Q154954 Q164509 Q215627
_pgTitle Preceded by Succeeded by
Tom Boyd (politician) Parliament of Northern Ireland Samuel Rodgers Member of Parliament for Belfast Pottinger 1958 - 1969 Joshua Cardwell
Tom Boyd (politician) Political offices None Leader of the Northern Ireland Labour Party at Stormont 1958 - 1969 Vivian Simpson
Pat O'Rawe Political offices Anna Brolly Mayor of Armagh 2003 - 04 Eric Speers
Pat O'Rawe Northern Ireland Assembly John Fee MLA for Newry and Armagh 2003 - 2007 Cathal Boylan
Walther Dahl Military offices Major Gerhard Michalski Commander of Jagdgeschwader z.b.V. 20 May 1944 – 6 June 1944 Major Gerhard Schöpfel

(109 more rows)

? 0 1
Q5 Q154954 Q164509 Q215627 Q5 Q154954 Q164509 Q215627
_pgTitle Preceded by
List of minor planets: 87001–88000 86001–87000
Billy Jack Haskins Scott Russell
Joanna of Gallura Nino
Amado Guevara Preki
Ubertino I da Carrara Marsilio

(26 more rows)

? 0 1 2 3 4 5
Q5 Q154954 Q164509 Q215627 decimal dateTime Q15061650 Q500834 Q5 Q154954 Q164509 Q215627 string
_pgTitle No. Year Tournament Opponent Result
Meaghan Francella 1 2007 MasterCard Classic Annika Sörenstam Won with birdie on fourth extra hole
? 0 1 2 3 4
Q5 Q154954 Q164509 Q215627 dateTime Q11424 Q15416 Q20937557 Q10301427 Q4502142 Q2431196 Q5398426 Q7725310 string string
_pgTitle Year Title Role Notes
Frances Bavier 1952 Racket Squad Martha Carver 1 episode
Frances Bavier 1952– 1953 Gruen Guild Playhouse Sarah Cummings 2 episodes
Frances Bavier 1953 Hallmark Hall of Fame Lou Bloor 1 episode
Frances Bavier 1953– 1954 City Detective Various roles 3 episodes
Frances Bavier 1953– 1954 Letter to Loretta Various roles 3 episodes

(171 more rows)

? 0 1 2 3 4 5 6 7 8 9 10 11 12 13
Q5 Q154954 Q164509 Q215627 dateTime Q852446 Q1093829 Q12973014 Q13393265 Q847017 Q3327870 Q21518270 Q11271835 Q1549591 Q486972 Q15253706 Q7930989 Q515 decimal decimal decimal decimal decimal decimal decimal decimal decimal decimal decimal
_pgTitle Year Team GP GS MPG FG% 3P% FT% RPG APG SPG BPG PPG
Eric Maynor 2010 Oklahoma City 6 0 12.7 .300 .167 .818 1.5 1.5 .2 .2 3.7
Eric Maynor 2011 Oklahoma City 17 0 12.9 .377 .360 .789 1.3 2.2 .5 .0 4.8
Eric Maynor Career 23 0 12.9 .361 .323 .800 1.3 2.0 .4 .0 4.5
Mike Conley, Jr. 2007–08 Memphis 53 46 26.1 .428 .330 .732 2.6 4.2 .8 .0 9.4
Mike Conley, Jr. 2008–09 Memphis 82 61 30.6 .442 .406 .817 3.4 4.3 1.1 .1 10.9

(18 more rows)

All steps separately

[16]:
%%time
from takco.cluster.matchers import LSHMatcher, EmbeddingMatcher, CellJaccMatcher
from takco import cluster
fdir = '.'
matchers = [
    CellJaccMatcher(fdir, name='headjacc', source='head'),
    LSHMatcher(fdir, name='lsh', num_perm=64),
    EmbeddingMatcher(fdir, name='emb', wordvec_fname='/export/scratch1/home/kruit/glove.6B.50d.pickle'),
    CellJaccMatcher(fdir, name='sec', source='sectionTitle'),
]

tables = takco.TableSet.number_table_columns(tables).persist()
matchers = tables.pipe(cluster.matcher_add_tables, matchers)
matchers = list(matchers.fold(lambda x: x.name, lambda a, b: a.merge(b)))

for m in matchers:
    print(m.name)
    m.index()
INFO:root:Numbering tables...
DEBUG:root:Opening ../../data/TabEL/10k-part/10k-part-aa
DEBUG:root:Serial cumsum tableIndex tableIndex 1 -> 198
DEBUG:root:Serial cumsum numCols columnIndexOffset 0 -> 910
DEBUG:root:Piping matcher_add_tables ...
INFO:root:Loading word vectors /export/scratch1/home/kruit/glove.6B.50d.pickle
Loading tables into matchers: 198it [00:00, 242.43it/s]
Indexing lsh: 100%|██████████| 647/647 [00:00<00:00, 8045.81it/s]
DEBUG:root:faiss info: analyzing 605 vectors of size 50
no NaN or Infs in data
567 vectors are distinct (93.72%)
vector 2 has 12 copies
range of L2 norms=[1, 1] (0 null vectors)
vectors are normalized, inner product and L2  search are equivalent
matrix contains no 0s
no constant dimensions
no dimension has a too large mean
stddevs per dimension are in [0.0807233 0.25964]

headjacc
lsh
emb
sec
CPU times: user 2.53 s, sys: 321 ms, total: 2.85 s
Wall time: 2.85 s
[ ]:
# Look at a block
ti = 0
block = set()
matcher = matchers[1] # LSH matcher
print(matcher.name)
with matcher:
    block |= set(matcher.block(ti, tableid_colids[ti]))
print(f'Got block of size {len(block)}:', block)

# First table is query, rest is block
takco.preview([i_table[ti]] + [i_table[b] for b in block if b in i_table])
[17]:
%%time
tableid_colids = dict(tables.pipe(cluster.get_table_ids))
print(len(tableid_colids))
DEBUG:root:Piping get_table_ids ...
198
CPU times: user 325 ms, sys: 702 µs, total: 325 ms
Wall time: 325 ms
[18]:
%%time
import pandas as pd

tablesim = pd.concat(tables.pipe(
    cluster.get_tablesims,
    matchers=matchers,
    filter_matcher_names=['sec'],
    agg_func='max',
    agg_threshold=0.9,
    align_columns='max1',
    tableid_colids=tableid_colids,
))
tablesim
DEBUG:root:Piping get_tablesims ...
DEBUG:root:Loading <takco.cluster.matchers.lsh.LSHMatcher object at 0x7f2794631250> from disk...
INFO:root:Loading word vectors /export/scratch1/home/kruit/glove.6B.50d.pickle
DEBUG:root:Loading <takco.cluster.matchers.embedding.EmbeddingMatcher object at 0x7f2795fe5bd0> from disk...
DEBUG:root:Preparing block for matcher headjacc
DEBUG:root:Preparing block for matcher lsh
DEBUG:root:Preparing block for matcher emb
DEBUG:root:Querying emb faiss index with query matrix of shape (605, 50)
Blocking: 100%|██████████| 198/198 [00:00<00:00, 1822.86it/s]
DEBUG:root:Found 4847 pairs; 24 ± 23 per table
Looking up sec: 100%|██████████| 4847/4847 [00:00<00:00, 139013.77it/s]
DEBUG:root:Filtered down to 410 pairs
Looking up headjacc: 100%|██████████| 410/410 [00:00<00:00, 8841.24it/s]
Looking up lsh: 100%|██████████| 410/410 [00:00<00:00, 58388.72it/s]
DEBUG:root:Calculating 5262 lsh scores
Yielding lsh: 100%|██████████| 5262/5262 [00:00<00:00, 751308.13it/s]
Looking up emb: 100%|██████████| 410/410 [00:00<00:00, 57287.78it/s]
DEBUG:root:Calculating 4880 emb scores
Yielding emb: 100%|██████████| 4880/4880 [00:00<00:00, 842729.06it/s]
DEBUG:root:times: Timer(prepare_headjacc=4.3e-06, prepare_lsh=2.6e-06, prepare_emb=2.0e-01, block_headjacc=3.0e-03, block_lsh=9.6e-02, block_emb=1.2e-03, filter_sec=3.8e-02, match_headjacc=4.9e-02, match_lsh=2.5e-02, match_emb=2.2e-02)
DEBUG:root:Creating dataframe of column match scores
INFO:numexpr.utils:Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
CPU times: user 4.31 s, sys: 663 ms, total: 4.97 s
Wall time: 1.7 s
[18]:
ti1  ti2
1    55     1.000000
     67     1.000000
     69     1.000000
     78     1.000000
     106    1.000000
              ...
189  22     0.919705
     87     0.944925
     168    1.000000
195  86     0.989103
     133    0.941177
Length: 182, dtype: float64
[19]:
%%time
itups = ((ti,ti) for ti in tableid_colids)
ii = pd.MultiIndex.from_tuples(itups, names=['ti1', 'ti2'])
tablesim = pd.concat([tablesim, pd.Series(1, index=ii)])
CPU times: user 5.48 ms, sys: 91 µs, total: 5.57 ms
Wall time: 4.91 ms
[20]:
%%time
edge_exp = 5
louvain_partition = cluster.louvain(tablesim, edge_exp=edge_exp)
print(len(louvain_partition))
INFO:root:Created graph IGRAPH U-W- 198 380 --  + attr: weight (e)
163
CPU times: user 25.4 ms, sys: 4.88 ms, total: 30.3 ms
Wall time: 30 ms
[10]:
nonsingle = [p for p in louvain_partition if len(p) > 1]
len(nonsingle)
[10]:
15
[11]:
import logging as log
log.getLogger().setLevel(log.WARN)

chunks = tables.new(enumerate(nonsingle)).pipe(
    cluster.cluster_partition_columns,
    tableid_colids = tableid_colids,
    matchers = matchers,
    agg_func = 'max',
    agg_threshold_col = 0.5,
)
from collections import ChainMap

ti_pi, pi_ncols, ci_pci, ti_colsim = (
    {k: v for d in ds for k, v in d.items()} for ds in zip(*chunks)
)
len(ti_pi), len(pi_ncols), len(ci_pci)
DEBUG:root:Piping cluster_partition_columns ...
DEBUG:root:Loading <takco.cluster.matchers.lsh.LSHMatcher object at 0x7f988fc97810> from disk...
INFO:root:Loading word vectors /export/scratch1/home/kruit/glove.6B.50d.pickle
DEBUG:root:Loading <takco.cluster.matchers.embedding.EmbeddingMatcher object at 0x7f988fc97650> from disk...
Matching with headjacc:   0%|          | 0/78 [00:00<?, ?it/s]
Looking up headjacc:   0%|          | 0/78 [00:00<?, ?it/s]
Matching with headjacc: 100%|██████████| 78/78 [00:00<00:00, 7528.61it/s]
Looking up headjacc: 100%|██████████| 78/78 [00:00<00:00, 8679.02it/s]
Matching with lsh:   0%|          | 0/78 [00:00<?, ?it/s]
Looking up lsh:   0%|          | 0/78 [00:00<?, ?it/s]
Matching with lsh: 100%|██████████| 78/78 [00:00<00:00, 12591.63it/s]
Looking up lsh: 100%|██████████| 78/78 [00:00<00:00, 16795.30it/s]
DEBUG:root:Calculating 489 lsh scores
Yielding lsh: 100%|██████████| 489/489 [00:00<00:00, 674454.01it/s]
Matching with emb:   0%|          | 0/78 [00:00<?, ?it/s]
Looking up emb:   0%|          | 0/78 [00:00<?, ?it/s]
Matching with emb: 100%|██████████| 78/78 [00:00<00:00, 9558.97it/s]
Looking up emb: 100%|██████████| 78/78 [00:00<00:00, 11990.75it/s]
DEBUG:root:Calculating 489 emb scores
Yielding emb: 100%|██████████| 489/489 [00:00<00:00, 657587.26it/s]
DEBUG:root:Creating colsim dataframe
DEBUG:root:Clustering (36, 36) column similarities
DEBUG:root:Partition 0 has 12 tables and 3 column clusters
Matching with headjacc:   0%|          | 0/6 [00:00<?, ?it/s]
Looking up headjacc:   0%|          | 0/6 [00:00<?, ?it/s]
Matching with headjacc: 100%|██████████| 6/6 [00:00<00:00, 655.34it/s]
Looking up headjacc: 100%|██████████| 6/6 [00:00<00:00, 965.80it/s]
Matching with lsh:   0%|          | 0/6 [00:00<?, ?it/s]
Looking up lsh:   0%|          | 0/6 [00:00<?, ?it/s]
Matching with lsh: 100%|██████████| 6/6 [00:00<00:00, 1071.62it/s]
Looking up lsh: 100%|██████████| 6/6 [00:00<00:00, 794.00it/s]
DEBUG:root:Calculating 24 lsh scores
Yielding lsh: 100%|██████████| 24/24 [00:00<00:00, 7031.52it/s]
Matching with emb:   0%|          | 0/6 [00:00<?, ?it/s]
Looking up emb:   0%|          | 0/6 [00:00<?, ?it/s]
Matching with emb: 100%|██████████| 6/6 [00:00<00:00, 1099.14it/s]
Looking up emb: 100%|██████████| 6/6 [00:00<00:00, 1614.54it/s]
DEBUG:root:Calculating 6 emb scores
Yielding emb: 100%|██████████| 6/6 [00:00<00:00, 22753.91it/s]
DEBUG:root:Creating colsim dataframe
DEBUG:root:Clustering (33, 33) column similarities
DEBUG:root:Partition 1 has 3 tables and 11 column clusters
Matching with headjacc:   0%|          | 0/10 [00:00<?, ?it/s]
Looking up headjacc:   0%|          | 0/10 [00:00<?, ?it/s]
Matching with headjacc: 100%|██████████| 10/10 [00:00<00:00, 1744.72it/s]
Looking up headjacc: 100%|██████████| 10/10 [00:00<00:00, 1611.64it/s]
Matching with lsh:   0%|          | 0/10 [00:00<?, ?it/s]
Looking up lsh:   0%|          | 0/10 [00:00<?, ?it/s]
Matching with lsh: 100%|██████████| 10/10 [00:00<00:00, 1854.98it/s]
Looking up lsh: 100%|██████████| 10/10 [00:00<00:00, 1895.99it/s]
DEBUG:root:Calculating 10 lsh scores
Yielding lsh: 100%|██████████| 10/10 [00:00<00:00, 13430.37it/s]
Matching with emb:   0%|          | 0/10 [00:00<?, ?it/s]
Looking up emb:   0%|          | 0/10 [00:00<?, ?it/s]
Matching with emb: 100%|██████████| 10/10 [00:00<00:00, 2443.09it/s]
Looking up emb: 100%|██████████| 10/10 [00:00<00:00, 3488.57it/s]
DEBUG:root:Calculating 10 emb scores
Yielding emb: 100%|██████████| 10/10 [00:00<00:00, 53430.62it/s]
DEBUG:root:Creating colsim dataframe
DEBUG:root:Clustering (4, 4) column similarities
DEBUG:root:Partition 2 has 4 tables and 1 column clusters
Matching with headjacc:   0%|          | 0/10 [00:00<?, ?it/s]
Looking up headjacc:   0%|          | 0/10 [00:00<?, ?it/s]
Matching with headjacc: 100%|██████████| 10/10 [00:00<00:00, 2306.59it/s]
Looking up headjacc: 100%|██████████| 10/10 [00:00<00:00, 3189.34it/s]
Matching with lsh:   0%|          | 0/10 [00:00<?, ?it/s]
Looking up lsh:   0%|          | 0/10 [00:00<?, ?it/s]
Matching with lsh: 100%|██████████| 10/10 [00:00<00:00, 2410.52it/s]
Looking up lsh: 100%|██████████| 10/10 [00:00<00:00, 3365.67it/s]
DEBUG:root:Calculating 10 lsh scores
Yielding lsh: 100%|██████████| 10/10 [00:00<00:00, 44243.71it/s]
Matching with emb:   0%|          | 0/10 [00:00<?, ?it/s]
Looking up emb:   0%|          | 0/10 [00:00<?, ?it/s]
Matching with emb: 100%|██████████| 10/10 [00:00<00:00, 2460.72it/s]
Looking up emb: 100%|██████████| 10/10 [00:00<00:00, 3420.85it/s]
DEBUG:root:Calculating 10 emb scores
Yielding emb: 100%|██████████| 10/10 [00:00<00:00, 1987.26it/s]
DEBUG:root:Creating colsim dataframe
DEBUG:root:Clustering (8, 8) column similarities
DEBUG:root:Partition 3 has 4 tables and 2 column clusters
Matching with headjacc:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up headjacc:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with headjacc: 100%|██████████| 3/3 [00:00<00:00, 717.87it/s]
Looking up headjacc: 100%|██████████| 3/3 [00:00<00:00, 798.76it/s]
Matching with lsh:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up lsh:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with lsh: 100%|██████████| 3/3 [00:00<00:00, 754.05it/s]
Looking up lsh: 100%|██████████| 3/3 [00:00<00:00, 1046.83it/s]
DEBUG:root:Calculating 3 lsh scores
Yielding lsh: 100%|██████████| 3/3 [00:00<00:00, 15015.41it/s]
Matching with emb:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up emb:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with emb: 100%|██████████| 3/3 [00:00<00:00, 747.74it/s]
Looking up emb: 100%|██████████| 3/3 [00:00<00:00, 1038.79it/s]
DEBUG:root:Calculating 3 emb scores
Yielding emb: 100%|██████████| 3/3 [00:00<00:00, 14513.16it/s]
DEBUG:root:Creating colsim dataframe
DEBUG:root:Clustering (4, 4) column similarities
DEBUG:root:Partition 4 has 2 tables and 2 column clusters
Matching with headjacc:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up headjacc:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with headjacc: 100%|██████████| 3/3 [00:00<00:00, 677.81it/s]
Looking up headjacc: 100%|██████████| 3/3 [00:00<00:00, 923.31it/s]
Matching with lsh:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up lsh:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with lsh: 100%|██████████| 3/3 [00:00<00:00, 752.79it/s]
Looking up lsh: 100%|██████████| 3/3 [00:00<00:00, 1047.01it/s]
DEBUG:root:Calculating 49 lsh scores
Yielding lsh: 100%|██████████| 49/49 [00:00<00:00, 205110.67it/s]
Matching with emb:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up emb:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with emb: 100%|██████████| 3/3 [00:00<00:00, 733.06it/s]
Looking up emb: 100%|██████████| 3/3 [00:00<00:00, 813.48it/s]
DEBUG:root:Calculating 49 emb scores
Yielding emb: 100%|██████████| 49/49 [00:00<00:00, 197047.84it/s]
DEBUG:root:Creating colsim dataframe
DEBUG:root:Clustering (12, 12) column similarities
DEBUG:root:Partition 5 has 2 tables and 8 column clusters
Matching with headjacc:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up headjacc:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with headjacc: 100%|██████████| 3/3 [00:00<00:00, 694.27it/s]
Looking up headjacc: 100%|██████████| 3/3 [00:00<00:00, 965.91it/s]
Matching with lsh:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up lsh:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with lsh: 100%|██████████| 3/3 [00:00<00:00, 781.64it/s]
Looking up lsh: 100%|██████████| 3/3 [00:00<00:00, 1115.51it/s]
DEBUG:root:Calculating 37 lsh scores
Yielding lsh: 100%|██████████| 37/37 [00:00<00:00, 148222.78it/s]
Matching with emb:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up emb:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with emb: 100%|██████████| 3/3 [00:00<00:00, 798.10it/s]
Looking up emb: 100%|██████████| 3/3 [00:00<00:00, 1099.71it/s]
DEBUG:root:Calculating 37 emb scores
Yielding emb: 100%|██████████| 37/37 [00:00<00:00, 177562.07it/s]
DEBUG:root:Creating colsim dataframe
DEBUG:root:Clustering (9, 9) column similarities
DEBUG:root:Partition 6 has 2 tables and 5 column clusters
Matching with headjacc:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up headjacc:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with headjacc: 100%|██████████| 3/3 [00:00<00:00, 756.82it/s]
Looking up headjacc: 100%|██████████| 3/3 [00:00<00:00, 1082.03it/s]
Matching with lsh:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up lsh:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with lsh: 100%|██████████| 3/3 [00:00<00:00, 759.79it/s]
Looking up lsh: 100%|██████████| 3/3 [00:00<00:00, 1060.24it/s]
DEBUG:root:Calculating 75 lsh scores
Yielding lsh: 100%|██████████| 75/75 [00:00<00:00, 306004.67it/s]
Matching with emb:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up emb:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with emb: 100%|██████████| 3/3 [00:00<00:00, 801.36it/s]
Looking up emb: 100%|██████████| 3/3 [00:00<00:00, 1110.29it/s]
DEBUG:root:Calculating 75 emb scores
Yielding emb: 100%|██████████| 75/75 [00:00<00:00, 310535.83it/s]
DEBUG:root:Creating colsim dataframe
DEBUG:root:Clustering (10, 10) column similarities
DEBUG:root:Partition 7 has 2 tables and 5 column clusters
Matching with headjacc:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up headjacc:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with headjacc: 100%|██████████| 3/3 [00:00<00:00, 568.03it/s]
Looking up headjacc: 100%|██████████| 3/3 [00:00<00:00, 728.85it/s]
Matching with lsh:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up lsh:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with lsh: 100%|██████████| 3/3 [00:00<00:00, 737.09it/s]
Looking up lsh: 100%|██████████| 3/3 [00:00<00:00, 1031.98it/s]
DEBUG:root:Calculating 313 lsh scores
Yielding lsh: 100%|██████████| 313/313 [00:00<00:00, 700169.15it/s]
Matching with emb:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up emb:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with emb: 100%|██████████| 3/3 [00:00<00:00, 744.64it/s]
Looking up emb: 100%|██████████| 3/3 [00:00<00:00, 1012.22it/s]
DEBUG:root:Calculating 313 emb scores
Yielding emb: 100%|██████████| 313/313 [00:00<00:00, 617215.40it/s]
DEBUG:root:Creating colsim dataframe
DEBUG:root:Clustering (22, 22) column similarities
DEBUG:root:Partition 8 has 2 tables and 19 column clusters
Matching with headjacc:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up headjacc:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with headjacc: 100%|██████████| 3/3 [00:00<00:00, 595.47it/s]
Looking up headjacc: 100%|██████████| 3/3 [00:00<00:00, 946.65it/s]
Matching with lsh:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up lsh:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with lsh: 100%|██████████| 3/3 [00:00<00:00, 796.39it/s]
Looking up lsh: 100%|██████████| 3/3 [00:00<00:00, 607.78it/s]
DEBUG:root:Calculating 93 lsh scores
Yielding lsh: 100%|██████████| 93/93 [00:00<00:00, 363532.41it/s]
Matching with emb:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up emb:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with emb: 100%|██████████| 3/3 [00:00<00:00, 551.71it/s]
Looking up emb: 100%|██████████| 3/3 [00:00<00:00, 680.49it/s]
DEBUG:root:Calculating 93 emb scores
Yielding emb: 100%|██████████| 93/93 [00:00<00:00, 338602.67it/s]
DEBUG:root:Creating colsim dataframe
DEBUG:root:Clustering (15, 15) column similarities
DEBUG:root:Partition 9 has 2 tables and 8 column clusters
Matching with headjacc:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up headjacc:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with headjacc: 100%|██████████| 3/3 [00:00<00:00, 466.14it/s]
Looking up headjacc: 100%|██████████| 3/3 [00:00<00:00, 578.18it/s]
Matching with lsh:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up lsh:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with lsh: 100%|██████████| 3/3 [00:00<00:00, 760.76it/s]
Looking up lsh: 100%|██████████| 3/3 [00:00<00:00, 1031.30it/s]
DEBUG:root:Calculating 271 lsh scores
Yielding lsh: 100%|██████████| 271/271 [00:00<00:00, 586510.00it/s]
Matching with emb:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up emb:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with emb: 100%|██████████| 3/3 [00:00<00:00, 734.00it/s]
Looking up emb: 100%|██████████| 3/3 [00:00<00:00, 1001.82it/s]
DEBUG:root:Calculating 271 emb scores
Yielding emb: 100%|██████████| 271/271 [00:00<00:00, 541007.32it/s]
DEBUG:root:Creating colsim dataframe
DEBUG:root:Clustering (27, 27) column similarities
DEBUG:root:Partition 10 has 2 tables and 16 column clusters
Matching with headjacc:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up headjacc:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with headjacc: 100%|██████████| 3/3 [00:00<00:00, 738.00it/s]
Looking up headjacc: 100%|██████████| 3/3 [00:00<00:00, 1032.15it/s]
Matching with lsh:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up lsh:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with lsh: 100%|██████████| 3/3 [00:00<00:00, 785.40it/s]
Looking up lsh: 100%|██████████| 3/3 [00:00<00:00, 1096.74it/s]
DEBUG:root:Calculating 27 lsh scores
Yielding lsh: 100%|██████████| 27/27 [00:00<00:00, 100932.45it/s]
Matching with emb:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up emb:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with emb: 100%|██████████| 3/3 [00:00<00:00, 791.68it/s]
Looking up emb: 100%|██████████| 3/3 [00:00<00:00, 1097.03it/s]
DEBUG:root:Calculating 27 emb scores
Yielding emb: 100%|██████████| 27/27 [00:00<00:00, 4345.60it/s]
DEBUG:root:Creating colsim dataframe
DEBUG:root:Clustering (8, 8) column similarities
DEBUG:root:Partition 11 has 2 tables and 4 column clusters
Matching with headjacc:   0%|          | 0/15 [00:00<?, ?it/s]
Looking up headjacc:   0%|          | 0/15 [00:00<?, ?it/s]
Matching with headjacc: 100%|██████████| 15/15 [00:00<00:00, 2833.99it/s]
Looking up headjacc: 100%|██████████| 15/15 [00:00<00:00, 2474.03it/s]
Matching with lsh:   0%|          | 0/15 [00:00<?, ?it/s]
Looking up lsh:   0%|          | 0/15 [00:00<?, ?it/s]
Matching with lsh: 100%|██████████| 15/15 [00:00<00:00, 3951.92it/s]
Looking up lsh: 100%|██████████| 15/15 [00:00<00:00, 3141.64it/s]
DEBUG:root:Calculating 73 lsh scores
Yielding lsh: 100%|██████████| 73/73 [00:00<00:00, 241470.18it/s]
Matching with emb:   0%|          | 0/15 [00:00<?, ?it/s]
Looking up emb:   0%|          | 0/15 [00:00<?, ?it/s]
Matching with emb: 100%|██████████| 15/15 [00:00<00:00, 3963.12it/s]
Looking up emb: 100%|██████████| 15/15 [00:00<00:00, 5353.52it/s]
DEBUG:root:Calculating 73 emb scores
Yielding emb: 100%|██████████| 73/73 [00:00<00:00, 298134.56it/s]
DEBUG:root:Creating colsim dataframe
DEBUG:root:Clustering (20, 20) column similarities
DEBUG:root:Partition 12 has 5 tables and 7 column clusters
Matching with headjacc:   0%|          | 0/10 [00:00<?, ?it/s]
Looking up headjacc:   0%|          | 0/10 [00:00<?, ?it/s]
Matching with headjacc: 100%|██████████| 10/10 [00:00<00:00, 1925.32it/s]
Looking up headjacc: 100%|██████████| 10/10 [00:00<00:00, 2477.44it/s]
Matching with lsh:   0%|          | 0/10 [00:00<?, ?it/s]
Looking up lsh:   0%|          | 0/10 [00:00<?, ?it/s]
Matching with lsh: 100%|██████████| 10/10 [00:00<00:00, 2413.57it/s]
Looking up lsh: 100%|██████████| 10/10 [00:00<00:00, 3281.41it/s]
DEBUG:root:Calculating 234 lsh scores
Yielding lsh: 100%|██████████| 234/234 [00:00<00:00, 565035.77it/s]
Matching with emb:   0%|          | 0/10 [00:00<?, ?it/s]
Looking up emb:   0%|          | 0/10 [00:00<?, ?it/s]
Matching with emb: 100%|██████████| 10/10 [00:00<00:00, 1983.59it/s]
Looking up emb: 100%|██████████| 10/10 [00:00<00:00, 3409.45it/s]
DEBUG:root:Calculating 234 emb scores
Yielding emb: 100%|██████████| 234/234 [00:00<00:00, 568966.46it/s]
DEBUG:root:Creating colsim dataframe
DEBUG:root:Clustering (22, 22) column similarities
DEBUG:root:Partition 13 has 4 tables and 10 column clusters
Matching with headjacc:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up headjacc:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with headjacc: 100%|██████████| 3/3 [00:00<00:00, 642.71it/s]
Looking up headjacc: 100%|██████████| 3/3 [00:00<00:00, 853.19it/s]
Matching with lsh:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up lsh:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with lsh: 100%|██████████| 3/3 [00:00<00:00, 751.35it/s]
Looking up lsh: 100%|██████████| 3/3 [00:00<00:00, 1057.30it/s]
DEBUG:root:Calculating 127 lsh scores
Yielding lsh: 100%|██████████| 127/127 [00:00<00:00, 354408.92it/s]
Matching with emb:   0%|          | 0/3 [00:00<?, ?it/s]
Looking up emb:   0%|          | 0/3 [00:00<?, ?it/s]
Matching with emb: 100%|██████████| 3/3 [00:00<00:00, 778.12it/s]
Looking up emb: 100%|██████████| 3/3 [00:00<00:00, 1094.55it/s]
DEBUG:root:Calculating 93 emb scores
Yielding emb: 100%|██████████| 93/93 [00:00<00:00, 319729.73it/s]
DEBUG:root:Creating colsim dataframe
DEBUG:root:Clustering (14, 14) column similarities
DEBUG:root:Partition 14 has 2 tables and 8 column clusters
[11]:
(50, 15, 244)
[61]:
clusters = tables.pipe(
    cluster.set_partition_columns, ti_pi, pi_ncols, ci_pci
).fold(
    lambda t: t["_id"],
    lambda a, b: cluster.merge_partition_tables(
        a,
        b,
        keep_partition_meta=["tableHeaders", lambda x: {'tableData': x["tableData"][:10]}],
    ),
).persist()
DEBUG:root:Piping set_partition_columns ...
[63]:
t = [t for t in clusters if t.get("partColAligns")][0]
takco.preview(t["partColAligns"])
[63]:
? 0 1 2
Source Rating
Review scores Allmusic link
Review scores Entertainment Weekly (B) link
? 0 1 2
Source Rating
Review scores Allmusic link
Review scores Entertainment Weekly (B) link
Review scores Allmusic link
? 0 1 2
Source Rating
Review scores Allmusic link
Review scores Entertainment Weekly (B) link
Review scores Allmusic link
Review scores Allmusic
? 0 1 2
Source Rating
Review scores Allmusic link
Review scores Entertainment Weekly (B) link
Review scores Allmusic link
Review scores Allmusic
Review scores Allmusic
? 0 1 2
Source Rating
Review scores Allmusic link
Review scores Entertainment Weekly (B) link
Review scores Allmusic link
Review scores Allmusic
Review scores Allmusic

(5 more rows)

? 0 1 2
Source Rating
Review scores Allmusic link
Review scores Entertainment Weekly (B) link
Review scores Allmusic link
Review scores Allmusic
Review scores Allmusic

(5 more rows)

? 0 1 2
Source Rating
Review scores Allmusic link
Review scores Entertainment Weekly (B) link
Review scores Allmusic link
Review scores Allmusic
Review scores Allmusic

(5 more rows)

? 0 1 2
Source Rating
Review scores Allmusic link
Review scores Entertainment Weekly (B) link
Review scores Allmusic link
Review scores Allmusic
Review scores Allmusic

(5 more rows)

? 0 1 2
Source Rating
Review scores Allmusic link
Review scores Entertainment Weekly (B) link
Review scores Allmusic link
Review scores Allmusic
Review scores Allmusic

(5 more rows)

? 0 1 2
Source Rating
Review scores Allmusic link
Review scores Entertainment Weekly (B) link
Review scores Allmusic link
Review scores Allmusic
Review scores Allmusic

(5 more rows)