OCR to Text Summary System

Prince Singh – Capstone Fall 2023

Overview

This project explores the integration of EasyOCR and Pegasus models for converting visual text into concise summaries. It investigates two distinct approaches:a sequential approach and an enclosed model system. The objective is to evaluate the efficacy of these methodologies in text extraction and summarization from images.

Integrated vs. Sequential Processing

Two processing methodologies are examined:

Sequential Processing: A two-step process where text extraction and summarization are conducted sequentially.

Integrated Processing: A custom neural network that combines OCR and summarization tasks in a single workflow.

Experimental Setup

Data Preparation

The multi-news dataset was utilized, where text documents were transformed into images, each paired with a pre-existing summary to mimic real-world text summarization scenarios. Using this method I created a training and validation dataset.

The following code is what was used to convert the text into images:

The following code was used to pull the corresponding summaries:

Model Selection

OCR Selection: EasyOCR, with its CRNN architecture, was chosen for its ease of use and extensive documentation available.

Summarization Selection: The Pegasus model was selected for summarization due to its effectiveness in generating concise summaries.

Training and Evaluation

Training Setup: The training and validation sets were split into a 4:3 ratio. With 200 samples for training and 150 samples for the validation step. for 25 epochs and batch size 15.

Evaluation Metrics: The models’ performance was evaluated using a sample size of 50 images and by calculating the BERTScore, focusing on the accuracy of text extraction and the quality of summaries (the F1 score). Using script to calculate the average F1 score for both models.

Results and Discussion

Results

Both the training and validation loss values are decreasing over time. This indicates that the model is learning and improving its predictions as it processes more data over successive epochs. The training loss starts at 5.5284 in the first epoch and decreases consistently to 0.8207 by the 25th epoch. This consistent decrease is a good sign, showing that the model is effectively learning from the training data. The validation loss begins at 5.5352 and also decreases over time, reaching 2.3586 by the 25th epoch. The validation loss is higher than the training loss, which is common as the model is typically better at predicting data it has seen (training data) compared to new data (validation data). There’s a noticeable gap between the training and validation losses. This gap can indicate overfitting, where the model performs well on the training data but less so on unseen data. However, since the validation loss is also decreasing, it suggests that the model is still generalizing reasonably well.

Sequential Model (F1 Score: 0.87192): This model has a high F1 score, close to 1. This indicates that it has a strong balance of precision and recall. In other words, it is effectively identifying relevant information (high recall) and not including much irrelevant information (high precision) in its summaries.

Enclosed Model (F1 Score: 0.67618): The enclosed model has a lower F1 score compared to the sequential model. This suggests that it is less effective at summarization, either missing relevant information (lower recall), including more irrelevant information (lower precision), or both.

Challenges Encountered

Dataset Limitations: The absence of a comprehensive dataset for training and evaluation posed a significant challenge.
Integration Difficulties: Integrating the post-processing steps of EasyOCR with Pegasus’ summarization process proved complex, especially within the integrated model approach.
Summary Quality: The limited dataset adversely affected the quality of the summaries.

Conclusion

While the sequential model demonstrated superior performance in summarization tasks, the enclosed model, despite its potential, faces significant challenges that affect its effectiveness. These challenges include dataset limitations, integration complexities, and resultant impacts on summary quality. To enhance the performance of the enclosed model, addressing these challenges is crucial. This might involve expanding and diversifying the dataset, refining the integration process, and implementing additional optimizations to improve its precision and recall. Overall, the sequential model stands out as the more reliable choice for current summarization needs, but with targeted improvements, the enclosed model could also become a viable alternative.

Capstone Posterboard

Video Presentation