Novelty analysis

In this notebook, we’ll look at the trade-offs between novelty and accuracy in semantic matching.

[1]:
import logging
logging.getLogger().setLevel(logging.INFO)

import takco
import pandas as pd

conf = takco.config.parse('resources/config-dbpedia.toml')
tables = takco.TableSet.load('output/t2d-v2-baseline-2/1-link/*')

t2dv2 = takco.config.build('t2d-v2', conf)
scored_tables = tables.score(t2dv2, keycol_only=True)
scored_tables.tables.persist()

takco.preview( t for t in scored_tables if any(t.get('gold', {}).values()) )
INFO:root:Loading data from resources/t2d_fix.csv
INFO:root:Read 512 tables from data/t2d-v2/tables
INFO:root:Read 512 entity tables from data/t2d-v2/instance
INFO:root:Loaded 514 annotated tables
[1]:
? 0 1 2
Book
0 author
Title Author Source
Adventures of Huckleberry Finn ❌ πŸ’‘ Mark Twain ALA [11]
The Adventures of Super Diaper Baby ❌ πŸ’‘ Dav Pilkey ALA [47]
The Adventures of Tom Sawyer ❌ πŸ’‘ Mark Twain ALA
Alice series ❌ πŸ’‘ Phyllis Reynolds Naylor ALA [2]
All the King's Men ❌ πŸ’‘ Robert Penn Warren Rad

(146 more rows)

? 0 1 2
Newspaper
# Media MIX
1 Dainik Jagran ❌ πŸ’‘ 27.500
2 Dainik Bhaskar ❌ πŸ’‘ 14.000
3 Aajtak TV 7.000
4 CNN Editions (International) ❓ 6.000
5 Dinakaran ❌ πŸ’‘ 5.000

(16 more rows)

? 0 1 2 3 4 5 6
Building
2 location floorCountπŸ’‘ openingDateπŸ’‘
# Geb?ude Geb?ude Stadt Etagen H?he Jahr
1 Burj Khalifa ❌ πŸ’‘ Dubai 163 2.717 ft 2010
2 Makkah Clock Royal Tower [Abraj Al Bait] ❌ πŸ’‘ Mekka 95 1.972 ft 2012
3 Taipei 101 ❌ πŸ’‘ Taipei 101 1.671 ft 2004
4 Shanghai World Financial Center ❌ πŸ’‘ Shanghai 101 1.614 ft 2008
5 International Commerce Centre [Union Square] ❌ πŸ’‘ Hong Kong 118 1.588 ft 2010

(195 more rows)

? 0 1 2 3 4
Company
1 industry
Rank Company Industry Temkin Experience Rating (TER) Company TER vs Industry TER
1 Sam's Club ❌ πŸ’‘ Retailer 85% 13.0
2 Publix ❌ πŸ’‘ Grocery Chain 81% 4.9
3 A credit union Bank 80% 14.5
3 Chick-fil-A ❌ πŸ’‘ Fast Food Chain 80% 6.2
3 Subway ❌ πŸ’‘ Fast Food Chain 80% 6.4

(201 more rows)

? 0 1 2 3 4 5 6
Mountain
PEAK RANKING MAP GUIDE GRID REF ALT (ft) ALT (m)
Allen Crags ❌ πŸ’‘ 43 SW E NY 236 085 2,572 784
Angletarn Pikes ❌ πŸ’‘ 143 NE FE NY 414 148 1,857 566
Ard Crags ❌ πŸ’‘ 142 NW NW NY 207 197 1,860 567
Armboth Fell ❌ πŸ’‘ 182 NW C NY 297 159 1,570 479
Arnison Crag ❌ πŸ’‘ 194 NE E NY 394 150 1,424 434

(210 more rows)

? 0 1 2 3 4
Country
1 frenchNameπŸ’‘ regionπŸ’‘ capital timeZoneπŸ’‘
A Nom en anglais Endroit Capitale Heure
Afghanistan Afghanistan ❌ πŸ’‘ Asie Kabul +4.5
Afrique du Sud South Afrique πŸ’‘ Afrique Pretoria +2
Albanie Albania ❌ πŸ’‘ Europe Tirane +1
Alderney (UK) voir les Anglo-Normandes Alderney ❌ πŸ’‘ Europe 0
Algrie Algeria ❌ πŸ’‘ Afrique Algiers +1

(228 more rows)

? 0 1 2 3 4 5 6
Country
Country Name: Population Area (Sq. Km.) Population Density (Sq. Km.) Area (Sq. Mi.) Population Density (Sq. Mi.)
36 China ❌ πŸ’‘ 1,339,190,000 9,596,960.00 139.54 3,705,405.45 361.42
77 India ❌ πŸ’‘ 1,184,639,000 3,287,590.00 360.34 1,269,345.07 933.27
183 United States of America ❌ πŸ’‘ 309,975,000 9,629,091.00 32.19 3,717,811.29 83.38
78 Indonesia ❌ πŸ’‘ 234,181,400 1,919,440.00 122.01 741,099.62 315.99
24 Brazil ❌ πŸ’‘ 193,364,000 8,511,965.00 22.72 3,286,486.71 58.84

(188 more rows)

? 0 1 2 3 4 5
VideoGame
0 publisher releaseDateπŸ’‘ releaseDateπŸ’‘
Title Publisher EU Release Date AU Release Date PEGI ACB
Donkey Kong Country ❌ πŸ’‘ Nintendo 2006-12-08 2006-12-07 7 G
F-Zero ❌ πŸ’‘ Nintendo 2006-12-08 2006-12-07 3 G
SimCity ❌ πŸ’‘ Nintendo 2006-12-29 2006-12-29 3 G
Super Castlevania IV ❌ πŸ’‘ Konami 2006-12-29 2006-12-29 3 PG
Street Fighter II: The World Warrior ❌ πŸ’‘ Capcom 2007-01-19 2007-01-19 12 PG

(60 more rows)

? 0 1 2 3 4
RadioStation
1 programmeFormatπŸ’‘ city ❓
Dial Location Call Letters Format Address Telephone
AM 790 KABC (ABC Radio Networks) ❌ πŸ’‘ News/Talk 3321 S La Cienega Blvd, Los Angeles 90016 (310) 840-4900
AM 900 KALI AM ❌ πŸ’‘ Spanish News/Talk 747 E Green St, Pasadena 91101 (626) 844-8882
AM 1300 KAZN (Asian Radio) ❌ πŸ’‘ Chinese Variety 747 E Green St, Pasadena 91101 (626) 568-1300
AM 1580 KBLA ❌ πŸ’‘ Spanish News/Talk 123 Figueroa St, #101A, Los Angeles 90012 (213) 628-8700
AM 740 KBRT (K-Bright) ❌ πŸ’‘ Religious Talk 3183-D Airway Ave, Costa Mesa 92626 (714) 754-4450

(25 more rows)

? 0 1 2
Hospital
Local Health Boards Hospital name Link Surgeons
Abertawe Bro Morgannwg University LHB Morriston Hospital (Swansea) ❌ πŸ’‘ Roger Morgan
Singleton Hospital (Swansea) ❌ πŸ’‘ Roger Morgan
Princess of Wales Hospital (Bridgend) ❌ πŸ’‘ Roger Morgan
Aneurin Bevan LHB Neville Hall Hospital (Abergavenny) ❌ πŸ’‘ Richard Blackett
Royal Gwent Hospital (Newport) ❌ πŸ’‘ Ahmed Shandall

(11 more rows)

[21]:
db = takco.config.build('dbpedia_t2ksubset', conf)

novelty_tables = scored_tables.triples().novelty(db)
novelty_tables.tables.persist()

len(list(novelty_tables))
WARNING:rdflib.term:http://dbpedia.org/resource/Toys_"R"_Us does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/Toys_"R"_Us does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/Toys_"R"_Us does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/Toys_"R"_Us does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/Toys_"R"_Us does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/Toys_"R"_Us does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/Toys_"R"_Us does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/Toys_"R"_Us does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/Toys_"R"_Us does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/Toys_"R"_Us does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/U_ _Ur_Hand does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/U_ _Ur_Hand does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/U_ _Ur_Hand does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/U_ _Ur_Hand does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/"Weird_Al"_Yankovic:_The_Videos does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/"Weird_Al"_Yankovic:_The_Videos does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/"Weird_Al"_Yankovic:_The_Videos does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/"Weird_Al"_Yankovic:_The_Videos does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/"Weird_Al"_Yankovic:_The_Videos does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/Toys_"R"_Us does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/Toys_"R"_Us does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/Toys_"R"_Us does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/Toys_"R"_Us does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://dbpedia.org/resource/Toys_"R"_Us does not look like a valid URI, trying to serialize this will break.
[21]:
235
[22]:
report = novelty_tables.report(keycol_only=True)

display(pd.DataFrame.from_dict(report['scores'], orient='index').style.set_caption('Predictions:'))

def reform_dict(dictionary, t=tuple(), reform={}):
    for key, val in dictionary.items():
        t = t + (key,)
        if isinstance(val, dict) and all(isinstance(v, dict) for v in val.values()):
            reform_dict(val, t, reform)
        else:
            reform.update({t: val})
        t = t[:-1]
    return reform

display()
pd.DataFrame.from_dict(reform_dict(report['novelty']), orient='index').style.set_caption('Extractions:')
INFO:root:Collected 26104 gold and 24061 pred for task entities
INFO:root:Collected 434 gold and 157 pred for task properties
INFO:root:Collected 235 gold and 235 pred for task classes
Predictions:
precision recall f1-score support predictions
entities 0.867670 0.799762 0.832333 26104 24061
properties 0.777070 0.281106 0.412860 434 157
classes 0.740426 0.740426 0.740426 235 235
[22]:
Extractions:
tp fn fp precision recall f1
dbpedia_t2ksubset label existing 9471 1027 728 0.928620 0.902172 0.915205
attnovel 2549 2424 1030 0.712210 0.512568 0.596118
valnovel 351 468 349 0.501429 0.428571 0.462146
class existing 5676 2190 3530 0.616554 0.721587 0.664948
attnovel 1886 1464 1332 0.586078 0.562985 0.574300
valnovel 62 1546 432 0.125506 0.038557 0.058991
property existing 3100 580 994 0.757206 0.842391 0.797530
attnovel 2858 2450 2204 0.564599 0.538433 0.551205
valnovel 3422 412 2257 0.602571 0.892540 0.719437
[ ]: