Ruben van de Ven 3ea4597622 distances as similarities

2024-01-03 12:48:15 +01:00

699 KiB

Raw Blame History

Mapping Movements: Proximity¶

This notebook is part of the pressure cooker for Mapping movements: exploring the structure, vision and transformative potential of collections of movements for justice and sustainability of the Utrecht University Copernicus Institute, and Creative Coding Utrecht.

This notebook documents an exploration to rethink "distance" for maps of social movements, or initiatives. In the maps researched by the project, distance is often presented as geographical distance, and rendered by the placement of entries on a skewed world map (often using Mercator projection, which preserves angle, not distance). The experiment here however is interested in finding new links between entries by deploying an of-the-shelf text embedding model known as doc2vec. Possibly revealing new categorisations and clusters.

Many thanks to the exploration of Doc2Vec done by Marton Trencseni, on which many of these aproaches are based.

Contents:

Embedding the text
Similar initiatives
Find Clusters
Alternative Maps: Rendering Embedding Space
A situated map

1. Embedding the text ¶

In this first step, the descriptions of the initiatives are embedded using Doc2Vec. That is to say, in a series of steps the text of the descriptions is transformed into a vector, a point in a (latent) multidimensional space: an embedding.

In [2]:

# some requirements to load for this project

import csv
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import nltk
from nltk.tokenize import word_tokenize
import numpy as np

import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt

from scipy.cluster.vq import kmeans,vq
from collections import defaultdict

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

Then read the entries from CSV, this is an export of the spreadsheet prepred by other participants to the pressure cooker.

In [3]:

with open('initiatives.csv') as fp:
    initiatives = csv.DictReader(fp)
    print("Fields:", initiatives.fieldnames)
    initiatives = [i for i in initiatives]
print(f"{len(initiatives)} initiatives")

Fields: ['Title', 'Organisation', 'Location', 'Location - geographic', '', 'Category', 'Year', 'Website', 'Text', 'Population size']
41 initiatives

Then nltk's (a toolkit for natural language processing) "punkt" is used to parse inititatives into tokenized dict. Tokenization basically cuts the text into separate words and characters, a preparatory step before embedding.

In [4]:

# Punkt is required:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ruben/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Out[4]:

True

Make tagged_initiatives a dict indexed by the document's ID (row in the csv), which contains a "TaggedDocument" of the initiatives' descriptions

In [5]:

tagged_initiatives = {idx : TaggedDocument(word_tokenize(f"{initiative['Title']} {initiative['Text']}"), [idx]) for idx, initiative in enumerate(initiatives)}

This is then used to create the doc2vec model, using a vector_size of 20 dimensions (we only have a small set of text, thus a small dimensionality should still give plenty of variety)

In [6]:

model = Doc2Vec(tagged_initiatives.values(), vector_size=20, alpha=0.025, min_count=1, workers=16, epochs=100)

print(f"Created a Doc2Vec model with {model.corpus_count} entries, and a total of {model.corpus_total_words} words.")

Created a Doc2Vec model with 41 entries, and a total of 17278 words.

2. Similar initiatives ¶

Now that we have a Doc2Vec model, we can use it to calculate some sort of similarity score between the initiative in our little dataset which is based on a mathematical similarity of words ocuring in their descriptions.

In [7]:

def similar_initiatives(which: int, n: int = 3) -> [tuple[int, str, float]]:
    """Based on the given index, return the `n` most similar inititives.
    Returns a list of tuples: (idx, title, similarity)
    """
    if n == 'all':
        return model.dv.most_similar(positive=[model.infer_vector(tagged_initiatives[which][0])], topn=None)
    results = model.dv.most_similar(positive=[model.infer_vector(tagged_initiatives[which][0])], topn=n+1)
    results = [(idx, initiatives[idx]['Title'], score) for idx, score in results if idx != tagged_initiatives[which][1][0]]
    return results[:n]

For example, we can now find similar initiatives to the first entry of our spreadsheet:

In [8]:

idx = 0
print(f"Initiatives similar to '{initiatives[idx]['Title']}'.")
for idx, title, score in similar_initiatives(idx, 5):
    print(f"- '{title}' (#{idx}), has a similarity score of {score}")

Initiatives similar to 'Keep water in public hands!'.
- 'Observatorio del Agua de Terrassa ensures a democratic governance of water' (#24), has a similarity score of 0.6271640658378601
- '‘Our Water Our Right’ campaign mobilization against water privatisation' (#16), has a similarity score of 0.5450633764266968
- 'Million Wells Bengaluru' (#11), has a similarity score of 0.538072943687439
- 'Agua Para Todos tackles water privatization and the impact of climate change' (#12), has a similarity score of 0.5187795162200928
- 'Citizen participation and solidarity tariffs in remunicipalized water utility' (#35), has a similarity score of 0.4846048057079315

These seem to make sense, as they're all about water. But it will be hard to digest for all 41 initiatives.

Similarity matrix¶

A similarity matrix visualises all similarities in a 2D matrix. I.e. a dark red means a score of 1, that is, completely similar. A white square means total dissimilarity.

As the matrix plots entries against itself, the diagonal is 1.

In [9]:

import matplotlib.pyplot as plt

def plot_matrix(m):
    plt.imshow(m, cmap='YlOrRd', interpolation='nearest')
    plt.colorbar()
    plt.show()
    
similarity_matrix = [similar_initiatives(idx, n='all') for idx in tagged_initiatives]
plot_matrix(similarity_matrix)

While this plot renders all data, it is a bit hard to process. So, how can we understand these similarities, and how can we use it to find new aliances?

3. Find clusters ¶

k-means¶

A common aproach to clustering entries is the kmeans algorithm. This algorithm starts by picking k random points, and iteratively moving these as to find k number of distinct clusters.

In [10]:

number_of_clusters = 7

In [11]:

vectors = [model.dv[i] for i in range(41)]
centroids, _ = kmeans(vectors, number_of_clusters)

# computes cluster Id for document vectors
doc_ids,_ = vq(vectors,centroids)

clusters = defaultdict(list)
for initiative, cluster_id in zip(initiatives, doc_ids):
    clusters[cluster_id].append(initiative)

cluster_ids = sorted(list(clusters.keys()))

for cluster_id in cluster_ids:
    print('\nCluster:', cluster_id)
    for i in clusters[cluster_id]:
        print(f"- {i['Title']}")

Cluster: 0
- 100% renewable energy for Gaza
- The municipality of Burgas pioneers energy efficient housing in Bulgaria
- EnergÉtica Cooperative challenges energy poverty by supplying clean energy
- The Cloughjordan Ecovillage models the transition to a low-carbon society
- PENGON empowers Palestinian women as sustainable energy leaders
- Earthworker Cooperative
- Power Shift supports farmers with clean energy

Cluster: 1
- Keep water in public hands!
- Million Wells Bengaluru
- Agua Para Todos tackles water privatization and the impact of climate change
- ‘Our Water Our Right’ campaign mobilization against water privatisation
- Observatorio del Agua de Terrassa ensures a democratic governance of water
- Eau de Paris delivers cheaper, cleaner water
- Citizen participation and solidarity tariffs in remunicipalized water utility
- Refusing to Give Up: Civil Society’s Movement against Water Privatization in Jakarta

Cluster: 2
- The Jackson Just Transition Plan is transforming Jackson into a city of equity, solidarity and mutual aid
- Citizen's Initiative Referendum against forced demolitions of social housing
- People power drives social and ecological transition

Cluster: 3
- Penca de Sábila Corporation improves the lives of rural farmers and connects them with urban communities
- Maison d’Éducation à l’Alimentation Durable supplies the community with organic food and educates future generations

Cluster: 4
- Building the movement for agroecological urban gardening to ensure food sovereignty
- Women for Food Sovereignty in Cochabamba
- Linking urban consumers to rural producers
- Bronx Cooperative Development Initiative
- Cargonomia Community Cargobike and Local Food Distribution Center

Cluster: 5
- Valencia walks towards the future: the cycling revolution in Valencia
- Rainwater harvesting in Mexico City’s marginalized neighborhoods
- Entrepatios Madrid
- Social, cultural, and economic empowerment in Bukit Duri Urban Kampung reconstruction
- No Bicycle, No Planet
- Relocation of road construction project affected slum dwellers
- Integrated social reconstruction homes in Isthmus of Tehuantepec
- Waste management innovation for food security and climate change mitigation
- Community-led response to water pollution crisis
- Women Workers Association builds tens of thousands of homes
- Dispossessed community finances and builds affordable homes

Cluster: 6
- CaSanAT is a micro-utopia serving as a space for exchange, learning and resistance
- Barcelona Energia
- Energy transition built on democracy, renewables and jobs
- Barcelona en Comú
- Transforming a century-old, oil company town

While this is certainly a good first stab at clustering (i.e. the first cluster contains some water projects but also other entries and cluster 3 has projects on Palestinia & Gaza), it's major disadvantage is that one needs to specify a specific k number of clusters. Thus, it doesn't start from the data.

Truncated similiarity¶

Another way to find aliances would be not to directly use the embeddings to calculate clusters, but to start from the calculated similarities.

A first step in that could be to treshhold the similarity matrix and link up those that are above the given threshold. This limits the number of matches between entries. A disadvantage of this approach becomes apparent in the network graph: the clusters that emerge are very uneven. In our case most items are in a single cluster, while some entries have no matches. Increasing the threshold creates even more singular items, decreasing it merges everything into one web, which pretty much beats the purpose.

In [12]:

threshold = .55

In [13]:

similarity_matrix_truncated = [[y if y > threshold else 0 for y in x] for x in similarity_matrix]
for i in range(len(similarity_matrix_truncated)):
    similarity_matrix_truncated[i][i] = 0
plot_matrix(similarity_matrix_truncated)

In [14]:

def plot_graph(m):
    labels = list(range(len(m)))
    df = pd.DataFrame(m, index=labels, columns=labels)
    G = nx.from_pandas_adjacency(df)
    plt.figure(figsize=(15, 15))
    nx.draw_networkx(G, nx.spring_layout(G), node_size=0, arrows=False, edge_color='lightgray', font_size=8)
    plt.show()

plot_graph(similarity_matrix_truncated)

Similarity top-n¶

Another approach, suggested by Trensceni in his blog-post, is to no simply truncate the similarity matrix, but to build a web by finding the n most similar entries for each initiative and using these as links. Then the linked-up entries act as clusters.

In [15]:

# another approach, to avoid singletons:
# for each initiative, get the top n=3 most similar
def similarity_matrix_top_n(n=3):
    m = np.zeros((len(initiatives), len(initiatives)))
    for idx, initiative in enumerate(initiatives):
        sp = similar_initiatives(idx, n)
        idxs = [p[0] for p in sp]
        # idxs = [tagged_initiatives[p[0]][1][0] for p in sp]
        for k, j in enumerate(idxs):
            m[idx][j] = 1 #sp[k][2]
    return m

The question then is, which n should be used. In our case, when starting with $n=3$ all initiatives still merge into one giant cluster, the same happens with $n=2$. With $n=1$ however, clusters emerge. Some only of 2 items (each having the other as most similar entry). Yet some linking more items together.

In [16]:

similarity_matrix_truncated = similarity_matrix_top_n(n=3)
plot_matrix(similarity_matrix_truncated)
plot_graph(similarity_matrix_truncated)


similarity_matrix_truncated = similarity_matrix_top_n(n=2)
plot_matrix(similarity_matrix_truncated)
plot_graph(similarity_matrix_truncated)


similarity_matrix_truncated = similarity_matrix_top_n(n=1)
plot_matrix(similarity_matrix_truncated)
plot_graph(similarity_matrix_truncated)

Instead of a linked plot, this can be rendered as a list of clusters:

In [17]:

def matrix2clusters(m):
    connected_subgraphs = set([frozenset([i]) for i in list(range(len(m)))])
    def update_subgraphs(i, j):
        join_targets = [g for g in connected_subgraphs if i in g or j in g]
        new_subgraph = frozenset({i for g in join_targets for i in g})
        for g in join_targets:
            connected_subgraphs.remove(g)
        connected_subgraphs.add(new_subgraph)
    for i in range(len(m)):
        for j in range(len(m)):
            if m[i][j] > 0:
                update_subgraphs(i, j)
    return connected_subgraphs

def print_clusters(connected_subgraphs):
    for g in connected_subgraphs:
        if len(g) == 1:
            for i in g:
                print(f'Singleton: {idx_lookup[i]}')
    for idx, g in enumerate(connected_subgraphs):
        if len(g) > 1:
            print(f'\nCluster {idx}:')
            for i in g:
                print(f"- {initiatives[i]['Title']} (#{i})")

In [18]:

similarity_clusters = matrix2clusters(similarity_matrix_truncated)
print_clusters(similarity_clusters)

Cluster 0:
- Maison d’Éducation à l’Alimentation Durable supplies the community with organic food and educates future generations (#23)
- Penca de Sábila Corporation improves the lives of rural farmers and connects them with urban communities (#15)

Cluster 1:
- Rainwater harvesting in Mexico City’s marginalized neighborhoods (#3)
- No Bicycle, No Planet (#7)

Cluster 2:
- Building the movement for agroecological urban gardening to ensure food sovereignty (#2)
- Transforming a century-old, oil company town (#38)
- Bronx Cooperative Development Initiative (#30)
- Power Shift supports farmers with clean energy (#40)
- Linking urban consumers to rural producers (#9)
- CaSanAT is a micro-utopia serving as a space for exchange, learning and resistance (#14)

Cluster 3:
- Barcelona en Comú (#32)
- Barcelona Energia (#17)
- Energy transition built on democracy, renewables and jobs (#21)

Cluster 4:
- Integrated social reconstruction homes in Isthmus of Tehuantepec (#18)
- Dispossessed community finances and builds affordable homes (#36)
- Entrepatios Madrid (#4)
- Social, cultural, and economic empowerment in Bukit Duri Urban Kampung reconstruction (#5)
- Women Workers Association builds tens of thousands of homes (#22)
- Relocation of road construction project affected slum dwellers (#10)

Cluster 5:
- Women for Food Sovereignty in Cochabamba (#8)
- Cargonomia Community Cargobike and Local Food Distribution Center (#31)

Cluster 6:
- The municipality of Burgas pioneers energy efficient housing in Bulgaria (#13)
- Waste management innovation for food security and climate change mitigation (#19)
- PENGON empowers Palestinian women as sustainable energy leaders (#29)
- 100% renewable energy for Gaza (#6)

Cluster 7:
- EnergÉtica Cooperative challenges energy poverty by supplying clean energy (#25)
- The Cloughjordan Ecovillage models the transition to a low-carbon society (#26)
- Earthworker Cooperative (#33)

Cluster 8:
- Keep water in public hands! (#0)
- Valencia walks towards the future: the cycling revolution in Valencia (#1)
- Citizen participation and solidarity tariffs in remunicipalized water utility (#35)
- Refusing to Give Up: Civil Society’s Movement against Water Privatization in Jakarta (#37)
- Million Wells Bengaluru (#11)
- Agua Para Todos tackles water privatization and the impact of climate change (#12)
- ‘Our Water Our Right’ campaign mobilization against water privatisation (#16)
- Community-led response to water pollution crisis (#20)
- Observatorio del Agua de Terrassa ensures a democratic governance of water (#24)
- Eau de Paris delivers cheaper, cleaner water (#28)

Cluster 9:
- Citizen's Initiative Referendum against forced demolitions of social housing (#34)
- The Jackson Just Transition Plan is transforming Jackson into a city of equity, solidarity and mutual aid (#27)
- People power drives social and ecological transition (#39)

What emerges is a variety of clusters -- which capture a heterogenity of characteristics. For example, cluster 8 contains two Barcelona related initiatives, whereas the last cluster, number 12, seems to capture housing/urban related project.

In [19]:

"""Prepare a list of colors for each initiative based on their cluster. This can be used in the next step."""
colors = np.zeros(len(initiatives))
for cluster, entries in enumerate(similarity_clusters):
    for entry in entries:
        colors[entry-1] = cluster

4. Rendering embedding space: alternative maps of movement ¶

It can be hard to make sense of the mathematical projections of the initiatives. The matrix rendering above is basically illegible, or at least hard to navigate. The network renderings are much more intuitive, but really only start from the calculated similarities. They are 'force' graphs, that discard any sense of the configuration of the embedding space. This becomes even more apparent when rendering it into a 2D.

One way to do this is to reduce the 20 dimensions into only 2 using PCA. If we run a PCA on the 41 embedded initiatives and plot the first 2 dimensions, it is clear that the clusters that emerged in the previous step (as indicated with the color of the point) is really different from their placement in this kind of visualisation.

In [20]:

pca = PCA(n_components=2)

In [21]:

pca_projected = pca.fit_transform(vectors)

In [22]:

f = plt.figure(figsize=(14, 14))
ax = plt.subplot(aspect='equal')
ax.scatter(pca_projected[:,0],pca_projected[:,1], c=colors)


for idx, ini in enumerate(initiatives):
    x, y = pca_projected[idx]
    # add label to a point
    ax.text(x, y, ini['Title'][:18]+'...', fontsize=8, alpha=.5)

Another algorithm that can be used to transform the space is t-SNE, which unlike PCA is not deterministic. I.e. it tends to produce different output on each run. This algorithm is optimised to find render similar items close together and increase the distance between less similar items; thus providing some similar clusters. Loooking at the output here however, it seems not very successful at that with the small dataset used here. Nevertheless, when looking at the image below, some clusters we saw earlier reappear. For example, the top of the image contains water related projects, and some Citizen participatory initiatives appear together in the center of the mapping.

In [23]:

# Random state.
RS = 20150101
tsne_projected = TSNE(random_state=RS, perplexity=30).fit_transform(np.vstack(vectors))

In [24]:

f = plt.figure(figsize=(14,14))
ax = plt.subplot(aspect='equal')
ax.scatter(tsne_projected[:,0], tsne_projected[:,1], c=colors)
ax.axis('off')
ax.axis('tight')

for idx, ini in enumerate(initiatives):
    x, y = tsne_projected[idx]
    # add label to a point
    ax.text(x, y, ini['Title'][:15]+'...', fontsize=8, alpha=.5)

5. A situated map ¶

What if we rethink a single rendition of distance and reframe it instead as a situated similarity for each project? What if we draw a 'map' the projects based on that? Such a map is open for any possible coalition for each initiative.

In [25]:

sim = np.clip(similarity_matrix, 0, 1)

import json

# to give a distribution on the radius, let's try to reuse the pca's first component:
pca1 = pca_projected[:,0].tolist()
pca1_normalised = (pca1 - np.min(pca1)) / (np.max(pca1) - np.min(pca1))

data = {
    "initiatives": initiatives,
    'similarities': sim.tolist(),
    'pca_1': pca1_normalised.tolist()
}

with open("./similarities.json", "w") as fp:
    json.dump(data, fp)

print(f"Written {len(initiatives)} initiatives to similarities.json.")

Written 41 initiatives to similarities.json.

Now explore the Similarities map.

699 KiB Raw Blame History Unescape Escape