Sentence Similarity with BERT and Flask

In the realm of Natural Language Processing (NLP), understanding the semantic relationship between sentences is crucial for various applications, from search engines and chatbots to sentiment analysis and text summarization. This article delves into a practical implementation of sentence similarity using the powerful BERT model and a Flask web application, allowing you to easily generate sentence embeddings and calculate cosine similarity scores.

Advertisement

Why BERT?

BERT (Bidirectional Encoder Representations from Transformers) has revolutionized NLP by capturing contextual relationships within text. Unlike traditional word embeddings that assign a single vector to each word, BERT generates dynamic embeddings that adapt to the surrounding words, leading to a deeper understanding of sentence meaning.

Building the Flask Application

We’ll create a simple Flask application that allows users to input two sentences and receive their similarity score and embeddings.

1. Setting up the Environment:

First, ensure you have Python installed. Then, install the necessary libraries:

pip install flask transformers torch scikit-learn

2. The Python Code (app.py):

Here's the backend Flask application code:

from flask import Flask, request, jsonify, render_template_string
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity

app = Flask(__name__)

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def get_embedding(sentence):
    inputs = tokenizer(sentence, return_tensors='pt', truncation=True, padding=True, max_length=128)
    outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).detach().numpy()

@app.route('/')
def index():
    return render_template_string('''
        <!DOCTYPE html>
        <html lang="en">
        <head>
            <meta charset="UTF-8">
            <meta name="viewport" content="width=device-width, initial-scale=1.0">
            <title>BERT Sentence Similarity</title>
            <style>
                /* CSS styles as defined in this HTML file */
            </style>
        </head>
        <body>
            <div class="container">
                <h1>BERT Sentence Similarity Calculator</h1>
                <form id="similarity-form">
                    <label for="sentence1">Sentence 1:</label>
                    <textarea id="sentence1" name="sentence1" rows="4" required></textarea>

                    <label for="sentence2">Sentence 2:</label>
                    <textarea id="sentence2" name="sentence2" rows="4" required></textarea>

                    <button type="submit">Calculate Similarity</button>
                    <button type="button" id="clear-button">Clear</button>
                </form>

                <div id="results-section" style="display: none;">
                    <h2>Results</h2>
                    <p id="similarity-score"></p>

                    <h3>Sentence 1 Embedding:</h3>
                    <div class="embedding-table-container">
                        <table id="embedding1-table">
                            <thead>
                                <tr><th>Index</th><th>Value</th></tr>
                            </thead>
                            <tbody></tbody>
                        </table>
                    </div>

                    <h3>Sentence 2 Embedding:</h3>
                    <div class="embedding-table-container">
                        <table id="embedding2-table">
                            <thead>
                                <tr><th>Index</th><th>Value</th></tr>
                            </thead>
                            <tbody></tbody>
                        </table>
                    </div>
                </div>

                <div class="collapsible-header" id="info-header">How it Works</div>
                <div class="collapsible-content" id="info-content">
                    <h3>Understanding the Application</h3>
                    <p>This application uses the BERT (Bidirectional Encoder Representations from Transformers) model to convert sentences into numerical vectors, known as embeddings. These embeddings capture the semantic meaning of the sentences.</p>
                    <p>The cosine similarity is then calculated between these two sentence embeddings. Cosine similarity measures the cosine of the angle between two vectors. A score close to 1 indicates high similarity, while a score close to 0 indicates low similarity.</p>
                    <h3>Key Components:</h3>
                    <ul>
                        <li><strong>BERT Model & Tokenizer:</strong> Loaded from the Hugging Face Transformers library to process text.</li>
                        <li><strong>get_embedding(sentence) function:</strong> Converts a given sentence into its BERT embedding.</li>
                        <li><strong>Flask Routes:</strong> Handles web requests for the main page (/), embedding calculation (/embed), and similarity calculation (/similarity).</li>
                        <li><strong>Cosine Similarity:</strong> Used from sklearn.metrics.pairwise to quantify the semantic similarity between two embeddings.</li>
                    </ul>
                </div>
            </div>

            <script>
                document.getElementById('similarity-form').addEventListener('submit', async function(event) {
                    event.preventDefault();
                    const sentence1 = document.getElementById('sentence1').value;
                    const sentence2 = document.getElementById('sentence2').value;

                    const response = await fetch('/similarity', {
                        method: 'POST',
                        headers: {
                            'Content-Type': 'application/json'
                        },
                        body: JSON.stringify({ sentence1, sentence2 })
                    });

                    const data = await response.json();

                    document.getElementById('similarity-score').textContent = `Similarity Score: ${data.similarity.toFixed(4)}`;

                    const embedding1TableBody = document.querySelector('#embedding1-table tbody');
                    embedding1TableBody.innerHTML = '';
                    data.embedding1[0].forEach((value, index) => {
                        const row = embedding1TableBody.insertRow();
                        const indexCell = row.insertCell();
                        const valueCell = row.insertCell();
                        indexCell.textContent = index;
                        valueCell.textContent = value.toFixed(6);
                    });

                    const embedding2TableBody = document.querySelector('#embedding2-table tbody');
                    embedding2TableBody.innerHTML = '';
                    data.embedding2[0].forEach((value, index) => {
                        const row = embedding2TableBody.insertRow();
                        const indexCell = row.insertCell();
                        const valueCell = row.insertCell();
                        indexCell.textContent = index;
                        valueCell.textContent = value.toFixed(6);
                    });

                    document.getElementById('results-section').style.display = 'block';
                });

                document.getElementById('clear-button').addEventListener('click', function() {
                    document.getElementById('sentence1').value = '';
                    document.getElementById('sentence2').value = '';
                    document.getElementById('results-section').style.display = 'none';
                    document.getElementById('similarity-score').textContent = '';
                    document.querySelector('#embedding1-table tbody').innerHTML = '';
                    document.querySelector('#embedding2-table tbody').innerHTML = '';
                });

                document.getElementById('info-header').addEventListener('click', function() {
                    const content = document.getElementById('info-content');
                    if (content.style.display === 'block') {
                        content.style.display = 'none';
                    } else {
                        content.style.display = 'block';
                    }
                });
            </script>
        </body>
        </html>
    ''')

@app.route('/embed', methods=['POST'])
def embed():
    data = request.json
    sentence = data.get('sentence', '')
    embedding = get_embedding(sentence)
    return jsonify({'embedding': embedding.tolist()})

@app.route('/similarity', methods=['POST'])
def similarity():
    data = request.json
    sentence1 = data.get('sentence1', '')
    sentence2 = data.get('sentence2', '')
    embedding1 = get_embedding(sentence1)
    embedding2 = get_embedding(sentence2)
    similarity_score = cosine_similarity(embedding1, embedding2)[0][0]
    return jsonify({'similarity': float(similarity_score), 'embedding1': embedding1.tolist(), 'embedding2': embedding2.tolist()})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

3. Understanding the Code:

BERT Model and Tokenizer: We load the pre-trained BERT model and tokenizer.
get_embedding(sentence): This function tokenizes the input sentence, passes it through the BERT model, and returns the mean of the last hidden state as the sentence embedding.
Flask Routes:
- /: Renders the HTML form for inputting sentences.
- /embed: receives a sentence via POST request and returns its embedding.
- /similarity: receives two sentences, calculates their embeddings, computes the cosine similarity, and returns the results.
Cosine Similarity: We use sklearn.metrics.pairwise.cosine_similarity to calculate the similarity score between the two sentence embeddings.

4. Running the Application:

Save the code as app.py and run it from your terminal:

python app.py

Open your browser and navigate to http://0.0.0.0:5000. You'll see the form where you can input two sentences and get their similarity score and embeddings.

5. HTML Structure and Functionality

The HTML portion of the code provides a user-friendly interface. It includes:

Two input fields for the sentences.
A “Submit” button to trigger the similarity calculation.
A “Clear” button to reset the form.
Tables to display the similarity score and embeddings.
A collapsable section that explains how the application works.
Javascript to handle the form submission, clear the form, and toggle the collapsable section.

Advertisement

6. Practical Applications:

This application can be extended for various NLP tasks:

Semantic Search: Improve search results by ranking documents based on semantic similarity to the query.
Question Answering: Find answers to questions by comparing the question’s embedding with the embeddings of candidate answers.
Text Clustering: Group similar documents or sentences together.
Paraphrase Detection: Identify sentences with similar meanings.
Chatbots: Enhance chatbot responses by understanding the semantic intent of user input.

7. Further Improvements:

Error Handling: Add error handling for invalid input or model loading issues.
Larger Datasets: Integrate with larger datasets for more complex similarity calculations.
Fine-tuning: Fine-tune the BERT model on specific tasks for improved accuracy.
Deployment: Deploy the Flask application to a cloud platform for wider accessibility.

By combining the power of BERT with the simplicity of Flask, you can easily build practical NLP applications that leverage sentence similarity. This project serves as a foundation for exploring more advanced NLP techniques and building innovative solutions.