Cluster analysis
In this notebook, we will look at how to debug blocked table similarities and the resulting clusters.
[1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
[2]:
# Assume that we have the "chartedIn" results
import takco
tables = takco.TableSet.load('../../output/chartedIn/6-link/*.jsonl')
tables = (t for t in tables if 'partColAligns' in t)
tables = sorted(tables, key=lambda x: -x.get('numDataRows'))
print(f"Got {sum(1 for _ in tables)} tables")
takco.preview( tables )
Got 8 tables
[2]:
? | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
Rank | Region | _pgTitle | Accolade | Publication | Year | Certification | Ref. | |||
United Kingdom ( BPI ) | Imagine (John Lennon album) | Gold | ||||||||
United States ( RIAA ) | Imagine (John Lennon album) | 2× Platinum | ||||||||
United Kingdom ( BPI ) | Caribou (album) | Gold | ||||||||
United States ( RIAA ) | Caribou (album) | 2× Platinum | ||||||||
United Kingdom ( BPI ) | No Secrets (Carly Simon album) | Gold |
(44 more rows)
? | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Year | Peak chart positions | _pgTitle | Song | Peak chart positions | Peak chart positions | Peak chart positions | Peak chart positions | Peak chart positions | Peak chart positions | Peak chart positions | Peak chart positions | Album details | _pgTitle | Certifications | Peak chart positions | Peak chart positions | |
1970 | 25 | Help Me Make It Through the Night (Sammi Smith album) | "He's Everywhere" | — | — | — | — | ||||||||||
1970 | 1 | Help Me Make It Through the Night (Sammi Smith album) | "Help Me Make It Through the Night" | 3 | 4 | 1 | 8 | ||||||||||
1972 | 1 | The Happiest Girl in the Whole U.S.A. | "The Happiest Girl in the Whole U.S.A." | 7 | — | 16 | 11 | ||||||||||
1972 | 1 | The Happiest Girl in the Whole U.S.A. | "Funny Face" | 5 | 17 | 1 | 5 | ||||||||||
2015 | — | "An Evening I Will Not Forget" | — | — | — | — | — | — | Dermot Kennedy | Dermot Kennedy |
(18 more rows)
? | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
_pgTitle | Format | Song | |||||||||||
The Best Damn Thing | Europe ( IFPI ) | Summaries | 1,000,000 * | Platinum | |||||||||
Denmark ( IFPI Denmark ) | 2× Platinum | Streaming | 3,600,000 ^ | Locked Out of Heaven | |||||||||
Spain ( PROMUSICAE ) | 2× Platinum | Streaming | 8,000,000 ^ | Locked Out of Heaven | |||||||||
Thriller (album) | Europe ( IFPI ) For sales in 2009 | 1× Platinum | Summaries | 1,000,000 * | |||||||||
Thriller (album) | Worldwide | — | Summaries | 66,000,000 |
(6 more rows)
? | 0 | 1 | 2 |
---|---|---|---|
So Fresh: The Hits of Spring 2014 | Chart (2014) | Australian ARIA Compilations Chart | |
So Fresh: The Hits of Spring 2014 | Peak position | 1 | |
So Fresh: The Hits of Spring 2014 | Chart (2014) | Australian ARIA Compilations Chart | |
So Fresh: The Hits of Spring 2014 | Peak position | 2 | |
So Fresh: The Hits of Winter 2014 | Chart (2014) | Australian ARIA Compilations Chart |
(3 more rows)
? | 0 | 1 | 2 |
---|---|---|---|
The Glory of Gershwin | Chart (1994) | UK Albums Chart | |
The Glory of Gershwin | Peak position | 2 | |
Paul Hardcastle (album) | Chart (1985) | UK Albums ( OCC ) | |
Paul Hardcastle (album) | Peak position | 53 | |
Led Zeppelin IV | Chart (2014) | Polish Albums ( ZPAV ) |
(1 more rows)
? | 0 | 1 | 2 |
---|---|---|---|
20 Golden Greats (Nat King Cole album) | Chart (1978) | UK Albums Chart | |
20 Golden Greats (Nat King Cole album) | Peak position | 1 | |
Gold Watch: 20 Golden Greats | Chart (2012) | Australian ARIA Albums Chart | |
Gold Watch: 20 Golden Greats | Peak position | 15 | |
Vault: Def Leppard Greatest Hits (1980–1995) | Chart (1996) | Norwegian Top 40 Albums |
(1 more rows)
? | 0 | 1 | 2 |
---|---|---|---|
So Fresh: The Hits of Autumn 2016 | Chart (2016) | Australia ( ARIA ) Top 20 Compilations | |
So Fresh: The Hits of Autumn 2016 | Position | 1 | |
Up from Down Under | Chart (1988) | Australian Albums ( ARIA ) | |
Up from Down Under | Peak position | 48 |
? | 0 | 1 | 2 |
---|---|---|---|
Don't Shoot Me I'm Only the Piano Player | Chart (1974) | U.S. Billboard Pop Albums | |
Don't Shoot Me I'm Only the Piano Player | Position | 67 | |
Don't Shoot Me I'm Only the Piano Player | Chart (1975) | Danish Album Charts | |
Don't Shoot Me I'm Only the Piano Player | Position | 18 |
[8]:
table = tables[0]
from takco.util import tableobj_to_dataframe
df = tableobj_to_dataframe(table)
for ci, col in enumerate(df.columns):
print("Column", ci)
tophead = table['tableHeaders'][0][ci].get('freq')
if tophead:
print("Top headers:", dict(sorted(tophead.items(), key=lambda x:-x[1])) )
else:
print("Header:", col)
print("Top values:", dict( df.T.iloc[ci].value_counts()[:3] ))
print()
Column 0
Top headers: {'Rank': 2}
Top values: {'': 10, '10': 3, '4': 3}
Column 1
Top headers: {'Region': 1}
Top values: {'': 39, 'United States ( RIAA )': 3, 'United Kingdom ( BPI )': 3}
Column 2
Top headers: {'_pgTitle': 3}
Top values: {'Kids See Ghosts (album)': 21, 'Everything That Happens Will Happen Today': 15, 'Science Fiction (Brand New album)': 3}
Column 3
Top headers: {'Accolade': 2}
Top values: {'': 10, 'The 50 Best Albums of 2008': 2, 'The 25 Best Albums of 2018': 2}
Column 4
Header: ('',)
Top values: {'': 47, 'Chart (2012)': 1, 'Peak position': 1}
Column 5
Top headers: {'Publication': 1, 'Publisher': 1}
Top values: {'': 10, 'Pitchfork': 3, 'Rolling Stone': 2}
Column 6
Top headers: {'Year': 1}
Top values: {'': 34, '2008': 10, '2009': 4}
Column 7
Header: ('',)
Top values: {'': 47, 'Sokka irti (song)': 2}
Column 8
Top headers: {'Certification': 1}
Top values: {'': 41, 'Gold': 5, '2× Platinum': 2}
Column 9
Top headers: {'Ref.': 1}
Top values: {'': 49}
[9]:
[t['tableIndex'] for t in table['partColAligns']]
[9]:
[230, 235, 269, 307]
[10]:
[t['partcol_global'] for t in table['partColAligns']]
[10]:
[{'1': 955, '2': 954, '8': 956},
{'0': 988, '2': 984, '3': 986, '5': 985, '6': 987},
{'0': 1126, '2': 1123, '3': 1125, '5': 1124, '9': 1127},
{'1': 1276, '4': 1275, '7': 1274}]
[11]:
takco.preview( table['partColAligns'], ntables=None )
[11]:
(3 more rows)
? | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
∈ |
|
|
|
|
|
_pgTitle | Publisher | Accolade | Year | Rank | |
Everything That Happens Will Happen Today | ABC News | The 50 Best Albums of 2008 | 2009 | 46 | |
Everything That Happens Will Happen Today | AllMusic | AllMusic's Favorite Rock Albums of 2008 | 2008 | Unranked, out of 25 | |
Everything That Happens Will Happen Today | Amazon.com editors' picks | Amazon Music: Best of 2008 | 2009 | 62 | |
Everything That Happens Will Happen Today | The Buffalo News | Best Albums (2000–2010) | 2010 | Honorable mention | |
Everything That Happens Will Happen Today | Chicago Sun-Times ( Jim DeRogatis ) | The Best Albums of 2008 | 2008 | 2 |
(10 more rows)
? | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
∈ |
|
|
|
|
|
_pgTitle | Publication | Accolade | Rank | Ref. | |
Science Fiction (Brand New album) | Pitchfork | Reader's Poll: Top 50 Albums | 17 | ||
Science Fiction (Brand New album) | Pitchfork | Most Underrated Album | 5 | ||
Science Fiction (Brand New album) | Sputnikmusic | Staff's Top 50 Albums of 2017 | 11 | ||
Kids See Ghosts (album) | 411Mania | The Top 100 Albums of 2018 | 4 | ||
Kids See Ghosts (album) | AllHipHop | AllHipHop's 15 Best Hip-Hop Albums of 2018 | 10 |
(19 more rows)
? | 0 | 1 | 2 |
---|---|---|---|
∈ |
|
|
|
Sokka irti (song) | Chart (2012) | Finland ( The Official Finnish Singles Chart ) | |
Sokka irti (song) | Peak position | 3 |
[ ]: