The STAS System is a modular framework designed to facilitate the iterative annotation process for machine learning tasks such as text classification and sequence labeling. This system automates the annotation process by allowing machine learning models to label data, which can then be reviewed and validated by human annotators. The system provides a flexible architecture for managing sample selection, annotation, stopping conditions, and iterative fine-tuning of models.
With the inclusion of a Streamlit-based graphical user interface (GUI), the system allows users to interact with the application, manage the annotation process, validate annotations, and upload datasets through an intuitive web interface.
- Iterative Annotation: Automates the process of fine-tuning models and annotating data iteratively, ensuring continual model improvement.
- Configurable Stopping Conditions: Allows you to define custom conditions to stop the annotation process based on metrics such as the acceptance rate or other criteria.
- Flexible Annotation Model: Supports both text classification and sequence labeling annotation tasks.
- Sample Selection: Automatically selects samples for annotation using different selection strategies (e.g., random selection).
- Streamlit UI: Provides a simple and interactive web interface for annotators to log in, manage annotations, validate sample labels, and upload datasets.
- Installation
- Usage
- System Architecture
- Components
- Extending the Annotation System
- Contributing
- License
To install and use the annotation system, follow these steps:
-
Clone the repository:
git clone https://github.com/glasscar46/stas.git cd stas -
Install dependencies:
You can install the required dependencies using
pip:pip install -r requirements.txt
-
Install Streamlit:
Since the system uses Streamlit for the GUI, make sure you have Streamlit installed:
pip install streamlit
-
Configure the system:
The system is configured through a
config.yamlfile. You need to define various parameters, such as the sample class to use, the stopping conditions, and other model configurations.
To run the Streamlit web interface, use the following command:
streamlit run api/ui.pyThis will start the Streamlit app and open it in your browser, where you can interact with the system. The GUI allows you to log in, view annotations, validate them, and upload datasets directly.
The system can be configured through a config.yaml file. The configuration includes various options for managing the annotation process, sample selection, and stopping conditions.
Here’s an example of the configuration:
database-name: 'stas'
connection-string: 'mongodb://localhost:27017'
max-iterations: 10
sample-size: 100
sampleClass: 'SequenceToSequenceSample'
modelName: 'NERModel'
Selector: 'RandomSampleSelector'
master-email: 'admin@email.com'
master-password: 'admin@email.com'
secret-key: 'example-secretkey'
metrics:
- NoneThe system supports user authentication through the Streamlit interface. Users must log in to manage annotations or upload datasets. The login process is based on the user's credentials (username and password), which are stored in the database.
You can upload datasets in CSV or JSON format directly through the Streamlit interface. This can be done on the "Upload Dataset" page in the GUI. The dataset can be marked as annotated (part of a golden set) or not annotated.
The architecture of the Annotation System follows a modular design, with key components:
- Manages the overall annotation process.
- Coordinates the iterative fine-tuning of models and annotation of data.
- Evaluates stopping conditions to determine when the annotation process should stop.
- Responsible for selecting samples for annotation based on a specific strategy.
- Examples include random selection (
RandomSampleSelector).
- Define the structure of the samples to be annotated.
- Examples include
TextClassificationSample(for text classification) andSequenceToSequenceSample(for sequence labeling).
- Define the annotation structure for different tasks.
- Examples include
ClassificationAnnotation(for text classification) andSequenceLabelAnnotation(for sequence labeling).
- Models used for annotation are required to implement the
IModelinterface. - Models must have the ability to fine-tune on annotated data and generate annotations for new samples.
- Define conditions for stopping the iterative annotation process.
- Examples include
AcceptanceRateConditionand other custom conditions.
- TextClassificationSample: A sample for text classification tasks, containing a text and a classification annotation (
ClassificationAnnotation). - SequenceToSequenceSample: A sample for sequence labeling tasks, containing a text and a sequence of span-based labels (
SequenceLabelAnnotation).
- ClassificationAnnotation: Represents annotations for text classification tasks. The label is a single value (e.g., "positive", "negative").
- SequenceLabelAnnotation: Represents annotations for sequence labeling tasks. The label is a list of tuples, where each tuple contains a start index, an end index, and a label.
- AcceptanceRateCondition: A stopping condition based on the acceptance rate of annotations. The process stops when the acceptance rate meets or exceeds a given threshold.
- RandomSampleSelector: A sample selector that selects samples randomly from the pool of unannotated samples.
The models that annotate samples must implement the IModel interface. Models are expected to:
- finetune: Fine-tune the model using annotated data.
- generateAnnotation: Generate annotations for a given set of samples.
The Streamlit GUI serves as the front-end interface for users (annotators, administrators) to interact with the annotation system. It provides an intuitive and user-friendly experience for managing the annotation process, validating annotations, and uploading datasets.
- Login Page: Allows annotators to log in using their credentials.
- Manage Process: After logging in, users are presented with a dashboard where they can:
- Start the annotation process.
- View the current iteration details, such as the number of pending samples and the status of the process.
- Annotation Validation: Users can view and validate annotations (accept or reject them). Annotations are displayed based on their type (e.g., sequence labeling or classification).
- Dataset Upload: Annotators can upload new datasets in CSV or JSON format and save them to the database. The dataset can be marked as annotated or unannotated.
- Logout: Users can log out, which clears their session state and redirects them back to the login page.
-
Manage Process:
- Start or restart the annotation process.
- View iteration statistics, such as the number of samples, the number of pending validations, and overall progress.
-
Annotation Validation:
- View pending annotation samples.
- Accept or reject annotations based on their accuracy.
-
Upload Dataset:
- Upload datasets in CSV or JSON format.
- Choose whether the dataset is annotated or unannotated.
-
Login/Logout:
- Annotators must log in to access annotation tasks.
- The logout functionality clears the session state and returns the user to the login screen.
- Login: Annotators log in with their username and password.
- Manage Process: Users can initiate or restart the annotation process and view iteration details.
- Annotation Validation: Users can validate and manage pending annotations.
- Upload Dataset: Annotators can upload new datasets to the system.
- Logout: Annotators can log out, clearing their session.
The Annotation System is designed with flexibility and extensibility in mind, allowing you to easily add new features, sample types, annotation methods, stopping conditions, or even custom user interfaces. Below are some of the key ways you can extend and customize the system to fit your specific needs.
The system supports multiple types of samples (e.g., TextClassificationSample, SequenceToSequenceSample). If you need to add support for additional sample types (such as for new machine learning tasks), you can extend the ISample interface and implement your own sample class.
Steps to add a new sample type:
- Define a new class that implements the
ISampleinterface. - Implement the necessary methods such as
deserialize()to convert raw data into your sample type andget_sample_type()to define its characteristics. - Update the
SampleFactoryto include your new sample type, allowing it to be selected dynamically based on the configuration.
Example:
from i_entities import ISample
from annotation import SequenceLabelAnnotation
class MyNewSample(ISample):
def __init__(self, text):
self.text = text
self.labels = SequenceLabelAnnotation() # Example annotation type
@classmethod
def deserialize(cls, data):
return cls(data['text'])
def get_sample_type(self):
return 'MyNewSampleType'If your project involves a different kind of annotation (e.g., multiple-choice labeling, sentiment analysis), you can add custom annotation types by creating new classes that implement the IAnnotation interface.
Steps to add a new annotation type:
- Define a new annotation class that extends the
IAnnotationinterface. - Implement methods like
get_value(),get_annotation_name(), and any additional methods specific to your task. - Update the relevant sample type class to use this new annotation class.
Example:
from i_entities import IAnnotation
class SentimentAnnotation(IAnnotation):
def __init__(self, sample_id, label, annotator_id=None, iteration_id=None, is_valid=False):
super().__init__(sample_id, label, annotator_id, iteration_id, is_valid)
@classmethod
def get_annotation_name(cls) -> str:
return "Sentiment"
def get_value(self):
return self.labelThe system provides a SampleSelector interface that allows you to define custom strategies for selecting samples for annotation (e.g., random selection, uncertainty sampling). You can extend the sample selection mechanism by adding new selectors.
Steps to add a new sample selector:
- Define a new class that implements the
ISampleSelectorinterface. - Implement the
select()method to define your custom selection logic (e.g., select the least confident samples, or samples that are most uncertain). - Update the
SelectorFactoryto include your new selector, ensuring that it can be dynamically chosen based on configuration.
Example:
from i_entities import ISampleSelector
from random import Random
class UncertaintySampleSelector(ISampleSelector):
def select(self, sample_size=100):
samples = self.dao.getPendingSamples()
# Custom logic to select uncertain samples
uncertain_samples = [s for s in samples if s.is_uncertain()]
return uncertain_samples[:sample_size]The system supports different stopping conditions, such as stopping based on the acceptance rate of annotations. You can add custom stopping conditions that determine when the annotation process should be halted.
Steps to add a new stopping condition:
- Define a new class that implements the
IStopConditioninterface. - Implement the
evaluate()method to determine whether the stopping condition has been met. - Update the configuration (
config.yaml) to include your new stopping condition type.
Example:
from i_entities import IStopCondition
class IterationCountStopCondition(IStopCondition):
def evaluate(self, iteration_id: Any) -> bool:
max_iterations = self.params.get('max_iterations', 10)
current_iteration = self.dao.getIteration(iteration_id)
return current_iteration.count >= max_iterationsThe system is designed to integrate with external machine learning models that can generate annotations (e.g., NLP models for text classification). You can extend the system by implementing your own IModel interface for a specific model.
Steps to integrate a custom model:
- Define a model class that implements the
IModelinterface. - Implement the
generateAnnotation()method to use the model to generate annotations for new samples. - Optionally, implement the
finetune()method to fine-tune the model on newly annotated data.
Example:
from i_entities import IModel
class MyCustomModel(IModel):
def finetune(self, samples):
# Fine-tune the model on the given samples
pass
def generateAnnotation(self, sample):
# Generate annotation for a given sample
return "Positive" # Example outputYou can customize the Streamlit UI to add additional pages, visualizations, or functionality to suit your project’s needs.
Steps to extend the UI:
- Modify the
Applicationclass to add new pages or views. - Use Streamlit's built-in functions (
st.write(),st.button(),st.selectbox(), etc.) to create custom widgets for the new features. - Ensure that the new features interact with the backend, such as interacting with the database or calling existing methods to process annotations.
Example:
class Application:
def __init__(self):
self.config = ConfigLoader('config.yaml')
self.dao = MongoDAO(self.config.get('connection-string'), self.config.get('database-name'))
self.controller = AnnotationController(self.dao, self.config)
def display_summary(self):
st.header("Process Summary")
# Display overall statistics, such as number of samples annotated, validation rate, etc.
stats = self.dao.getAnnotationStats()
st.write(stats)
def main(self):
page = st.sidebar.selectbox("Select Page", ["Annotation", "Summary"])
if page == "Annotation":
self.display_annotations()
elif page == "Summary":
self.display_summary()The system currently supports basic user authentication, but you can extend it by adding different user roles and permissions. For example, you could differentiate between admin users who can modify the annotation process and annotators who can only validate annotations.
Steps to implement roles and permissions:
- Extend the
Annotatorclass to include roles (e.g.,admin,annotator). - Modify the login logic to check the user role.
- Implement role-based access control in the Streamlit UI to show different options for different user types.
Example:
class Annotator:
def __init__(self, email, password, role="annotator"):
self.email = email
self.password = password
self.role = role # New role attribute
def is_admin(self):
return self.role == "admin"Currently, the system supports dataset uploads in CSV and JSON formats. If you need to support other formats (e.g., Excel, XML), you can extend the DatasetLoader class to handle these formats.
Steps to add support for new formats:
- Update the
DatasetLoaderclass to include logic for parsing new file types (e.g., Excel, XML). - Implement custom parsing functions for the new formats (e.g.,
pandas.read_excel()for Excel files). - Modify the dataset upload interface in the Streamlit UI to allow users to select the new file type.
The Annotation System is highly modular and can be easily extended to accommodate various needs and tasks. Whether you need to add new sample types, annotation strategies, stopping conditions, or even integrate with external models, the system provides a flexible architecture for doing so. Additionally, the Streamlit GUI offers a straightforward way to interact with the system, and you can customize the interface to fit the specific workflow of your annotation process.
If you have a specific use case that requires additional functionality, feel free to extend any of the core components or add new ones to make the system fit your needs.
We welcome contributions to the Annotation System. If you have an idea for a feature or have found a bug, please feel free to submit an issue or create a pull request.
- Fork the repository.
- Create a new branch (
git checkout -b feature/feature-name). - Make your changes.
- Commit your changes (
git commit -am 'Add new feature'). - Push to the branch (
git push origin feature/feature-name). - Create a new pull request.
This project is licensed under the MIT License - see the LICENSE file for details.