Cluster analysis

In this notebook, we will look at how to debug blocked table similarities and the resulting clusters.

[1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
[2]:
# Assume that we have the "chartedIn" results
import takco

tables = takco.TableSet.load('../../output/chartedIn/6-link/*.jsonl')
tables = (t for t in tables if 'partColAligns' in t)

tables = sorted(tables, key=lambda x: -x.get('numDataRows'))
print(f"Got {sum(1 for _ in tables)} tables")

takco.preview( tables )
Got 8 tables
[2]:
? 0 1 2 3 4 5 6 7 8 9
Rank Region _pgTitle Accolade Publication Year Certification Ref.
United Kingdom ( BPI ) Imagine (John Lennon album) Gold
United States ( RIAA ) Imagine (John Lennon album) 2× Platinum
United Kingdom ( BPI ) Caribou (album) Gold
United States ( RIAA ) Caribou (album) 2× Platinum
United Kingdom ( BPI ) No Secrets (Carly Simon album) Gold

(44 more rows)

? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Year Peak chart positions _pgTitle Song Peak chart positions Peak chart positions Peak chart positions Peak chart positions Peak chart positions Peak chart positions Peak chart positions Peak chart positions Album details _pgTitle Certifications Peak chart positions Peak chart positions
1970 25 Help Me Make It Through the Night (Sammi Smith album) "He's Everywhere"
1970 1 Help Me Make It Through the Night (Sammi Smith album) "Help Me Make It Through the Night" 3 4 1 8
1972 1 The Happiest Girl in the Whole U.S.A. "The Happiest Girl in the Whole U.S.A." 7 16 11
1972 1 The Happiest Girl in the Whole U.S.A. "Funny Face" 5 17 1 5
2015 "An Evening I Will Not Forget" Dermot Kennedy Dermot Kennedy

(18 more rows)

? 0 1 2 3 4 5 6 7 8 9 10 11 12
_pgTitle Format Song
The Best Damn Thing Europe ( IFPI ) Summaries 1,000,000 * Platinum
Denmark ( IFPI Denmark ) 2× Platinum Streaming 3,600,000 ^ Locked Out of Heaven
Spain ( PROMUSICAE ) 2× Platinum Streaming 8,000,000 ^ Locked Out of Heaven
Thriller (album) Europe ( IFPI ) For sales in 2009 1× Platinum Summaries 1,000,000 *
Thriller (album) Worldwide Summaries 66,000,000

(6 more rows)

? 0 1 2
So Fresh: The Hits of Spring 2014 Chart (2014) Australian ARIA Compilations Chart
So Fresh: The Hits of Spring 2014 Peak position 1
So Fresh: The Hits of Spring 2014 Chart (2014) Australian ARIA Compilations Chart
So Fresh: The Hits of Spring 2014 Peak position 2
So Fresh: The Hits of Winter 2014 Chart (2014) Australian ARIA Compilations Chart

(3 more rows)

? 0 1 2
The Glory of Gershwin Chart (1994) UK Albums Chart
The Glory of Gershwin Peak position 2
Paul Hardcastle (album) Chart (1985) UK Albums ( OCC )
Paul Hardcastle (album) Peak position 53
Led Zeppelin IV Chart (2014) Polish Albums ( ZPAV )

(1 more rows)

? 0 1 2
20 Golden Greats (Nat King Cole album) Chart (1978) UK Albums Chart
20 Golden Greats (Nat King Cole album) Peak position 1
Gold Watch: 20 Golden Greats Chart (2012) Australian ARIA Albums Chart
Gold Watch: 20 Golden Greats Peak position 15
Vault: Def Leppard Greatest Hits (1980–1995) Chart (1996) Norwegian Top 40 Albums

(1 more rows)

? 0 1 2
So Fresh: The Hits of Autumn 2016 Chart (2016) Australia ( ARIA ) Top 20 Compilations
So Fresh: The Hits of Autumn 2016 Position 1
Up from Down Under Chart (1988) Australian Albums ( ARIA )
Up from Down Under Peak position 48
? 0 1 2
Don't Shoot Me I'm Only the Piano Player Chart (1974) U.S. Billboard Pop Albums
Don't Shoot Me I'm Only the Piano Player Position 67
Don't Shoot Me I'm Only the Piano Player Chart (1975) Danish Album Charts
Don't Shoot Me I'm Only the Piano Player Position 18
[8]:
table = tables[0]

from takco.util import tableobj_to_dataframe
df = tableobj_to_dataframe(table)
for ci, col in enumerate(df.columns):
    print("Column", ci)
    tophead = table['tableHeaders'][0][ci].get('freq')
    if tophead:
        print("Top headers:",  dict(sorted(tophead.items(), key=lambda x:-x[1])) )
    else:
        print("Header:", col)
    print("Top values:", dict( df.T.iloc[ci].value_counts()[:3] ))
    print()
Column 0
Top headers: {'Rank': 2}
Top values: {'': 10, '10': 3, '4': 3}

Column 1
Top headers: {'Region': 1}
Top values: {'': 39, 'United States ( RIAA )': 3, 'United Kingdom ( BPI )': 3}

Column 2
Top headers: {'_pgTitle': 3}
Top values: {'Kids See Ghosts (album)': 21, 'Everything That Happens Will Happen Today': 15, 'Science Fiction (Brand New album)': 3}

Column 3
Top headers: {'Accolade': 2}
Top values: {'': 10, 'The 50 Best Albums of 2008': 2, 'The 25 Best Albums of 2018': 2}

Column 4
Header: ('',)
Top values: {'': 47, 'Chart (2012)': 1, 'Peak position': 1}

Column 5
Top headers: {'Publication': 1, 'Publisher': 1}
Top values: {'': 10, 'Pitchfork': 3, 'Rolling Stone': 2}

Column 6
Top headers: {'Year': 1}
Top values: {'': 34, '2008': 10, '2009': 4}

Column 7
Header: ('',)
Top values: {'': 47, 'Sokka irti (song)': 2}

Column 8
Top headers: {'Certification': 1}
Top values: {'': 41, 'Gold': 5, '2× Platinum': 2}

Column 9
Top headers: {'Ref.': 1}
Top values: {'': 49}

[9]:
[t['tableIndex'] for t in table['partColAligns']]
[9]:
[230, 235, 269, 307]
[10]:
[t['partcol_global'] for t in table['partColAligns']]
[10]:
[{'1': 955, '2': 954, '8': 956},
 {'0': 988, '2': 984, '3': 986, '5': 985, '6': 987},
 {'0': 1126, '2': 1123, '3': 1125, '5': 1124, '9': 1127},
 {'1': 1276, '4': 1275, '7': 1274}]
[11]:
takco.preview( table['partColAligns'], ntables=None )
[11]:
? 0 1 2
Q482994 Q208569 Q2068728 Q3302947 Q43229 Q24229398 Q16334295 Q1328899 Q32178211 Q15633587 Q12139612 Q48522 Q17442446 Q4167410 Q15633587 Q12139612 Q48522 Q17442446 Q4167410 Q11344 Q5127848
_pgTitle Region Certification
Imagine (John Lennon album) United Kingdom ( BPI ) Gold
Imagine (John Lennon album) United States ( RIAA ) 2× Platinum
Caribou (album) United Kingdom ( BPI ) Gold
Caribou (album) United States ( RIAA ) 2× Platinum
No Secrets (Carly Simon album) United Kingdom ( BPI ) Gold

(3 more rows)

? 0 1 2 3 4
Q482994 Q208569 Q43229 Q1002697 Q4830453 Q28877 Q17172633 Q11032 Q11033 Q340169 Q41298 string dateTime decimal
_pgTitle Publisher Accolade Year Rank
Everything That Happens Will Happen Today ABC News The 50 Best Albums of 2008 2009 46
Everything That Happens Will Happen Today AllMusic AllMusic's Favorite Rock Albums of 2008 2008 Unranked, out of 25
Everything That Happens Will Happen Today Amazon.com editors' picks Amazon Music: Best of 2008 2009 62
Everything That Happens Will Happen Today The Buffalo News Best Albums (2000–2010) 2010 Honorable mention
Everything That Happens Will Happen Today Chicago Sun-Times ( Jim DeRogatis ) The Best Albums of 2008 2008 2

(10 more rows)

? 0 1 2 3 4
Q482994 Q2565300 Q21997607 Q208569 Q4502142 Q2031291 Q43229 Q1002697 Q340169 Q41298 string decimal string
_pgTitle Publication Accolade Rank Ref.
Science Fiction (Brand New album) Pitchfork Reader's Poll: Top 50 Albums 17
Science Fiction (Brand New album) Pitchfork Most Underrated Album 5
Science Fiction (Brand New album) Sputnikmusic Staff's Top 50 Albums of 2017 11
Kids See Ghosts (album) 411Mania The Top 100 Albums of 2018 4
Kids See Ghosts (album) AllHipHop AllHipHop's 15 Best Hip-Hop Albums of 2018 10

(19 more rows)

? 0 1 2
Q134556 Q2031291 string
Sokka irti (song) Chart (2012) Finland ( The Official Finnish Singles Chart )
Sokka irti (song) Peak position 3
[ ]: