YAML Metadata Warning: The pipeline tag "named-entity-recognition" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

BERTimbau-large-metadata-council-pt: Metadata Extraction in Municipal Meeting Minutes

Model Description

This model performs Named Entity Recognition (NER) to automatically extract administrative metadata from Portuguese municipal meeting minutes.

Given the text of a meeting minute, the model identifies and classifies domain-specific entities such as meeting number, date, location, participants, and session times.
It operates at the token level using a sequence labeling approach, assigning entity tags to each token in the input text.

The model is designed to process administrative and institutional documents and is typically applied to specific segments of the minutes (e.g., Opening or Closing).

Key Features

🏛️ Specialized for Municipal Minutes
Fine-tuned on Portuguese municipal meeting minutes annotated with domain-specific metadata entities.
🏷️ Token-level Metadata Extraction (NER)
Identifies and classifies multiple metadata entities within the text using a BIO-style tagging scheme.
⚙️ Transformer-based Architecture
BERTimbau backbone with fine-tuning for token classification
🔗 Pipeline-oriented Design
Designed to operate downstream of a segment detection module, enabling structured metadata extraction from relevant document sections.

Model Details

Base Model: neuralmind/bert-large-portuguese-cased
Architecture: BERT for token classification (NER)
Parameters: 333M
Max Sequence Length: 512 tokens
Fine-tuning Dataset: Municipal meeting minutes (20 municipal minutes per 6 Portuguese municipalities totalling 120 documents)
Entity Types: minute_id, date, meeting_type, location, begin_time, end_time, participant
Training Framework: PyTorch + Hugging Face Transformers
Evaluation Metric: F1-score

How It Works

The model follows a standard Named Entity Recognition (NER) pipeline based on sequence labeling.

The input text is tokenized using the model’s subword tokenizer, and each token is assigned a label according to the BIO tagging scheme. The transformer encoder produces contextualized representations for each token, which are then passed to a token classification head to predict entity labels.

During inference, subword-level predictions are aggregated to reconstruct entity spans at the word level. Consecutive tokens labeled as part of the same entity are merged to form complete metadata fields.

The following example illustrates how to perform metadata extraction using this model:

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import json

# Load model and tokenizer
MODEL_NAME = model_path

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForTokenClassification.from_pretrained(MODEL_NAME)
model.eval()

def extract_entities(text):
    if not text or text.strip() == "":
        return []
    
    encoding = tokenizer(
        text,
        return_tensors="pt",
        return_offsets_mapping=True,
        truncation=True,
        max_length=512
    )
    
    offsets = encoding["offset_mapping"][0].tolist()
    word_ids = encoding.word_ids(batch_index=0)
    
    inputs = {
        "input_ids": encoding["input_ids"],
        "attention_mask": encoding["attention_mask"],
    }
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    pred_ids = torch.argmax(outputs.logits, dim=2)[0].tolist()
    pred_labels = [model.config.id2label[i] for i in pred_ids]
    
    entities = []
    current = None
    prev_word_idx = None
    
    for i, label in enumerate(pred_labels):
        word_idx = word_ids[i]
        start, end = offsets[i]
        
        if word_idx is None:
            continue
        
        if word_idx == prev_word_idx:
            if current:
                current["end"] = end
            continue

        prev_word_idx = word_idx
        
        if label.startswith("B-"):
            if current:
                entities.append(current)
            
            current = {"label": label[2:], "start": start, "end": end}
        
        elif label.startswith("I-"):
            if current and current["label"] == label[2:]:
                current["end"] = end
            else:
                continue

        
        else:  # label == "O"
            if current:
                entities.append(current)
                current = None
    
    if current:
        entities.append(current)

    for ent in entities:
        ent["text"] = text[ent["start"]:ent["end"]]
    
    return entities

if __name__ == "__main__":

    input_path = "segments.json"
    output_path = "results.json"
    
    with open(input_path, 'r', encoding='utf-8') as f:
        minute = json.load(f)
    
    introduction_entities = extract_entities(minute.get("introduction", ""))
    conclusion_entities = extract_entities(minute.get("conclusion", ""))
    
    all_entities = introduction_entities + conclusion_entities
    
    grouped_entities = {}
    for ent in all_entities:
        label = ent["label"]
        if label not in grouped_entities:
            grouped_entities[label] = []
        
        grouped_entities[label].append({
            "text": ent["text"],
            "start": ent["start"],
            "end": ent["end"]
        })
    
    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(grouped_entities, f, ensure_ascii=False, indent=2)

    print(json.dumps(grouped_entities, ensure_ascii=False, indent=2))

Evaluation Results

Municipal Meeting Minutes Test Set

Metric	Score
F1 score	0.96

Limitations

Domain Specificity
The model is fine-tuned on Portuguese municipal meeting minutes and performs best on administrative and governmental texts. Performance may degrade on documents with substantially different structure, vocabulary, or writing style.
Language Dependency
Although based on a multilingual pre-trained transformer, the fine-tuning data is exclusively in Portuguese. As a result, performance on other languages has not been validated and is not guaranteed.
Context Window Length
The model has a maximum input length of 512 tokens. Longer documents require chunk-based processing, which may lead to incomplete or fragmented entity predictions across chunk boundaries.
Annotation and Formatting Variability
Municipal minutes may vary significantly across municipalities and time periods in terms of formatting, terminology, and metadata conventions. Unseen patterns or inconsistent annotations can negatively impact entity recognition accuracy.

License

This model is released under the cc-by-nc-nd-4.0 license.

Downloads last month: 6

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for liaad/Citilink_BERTimbau-large_metadata

Base model

neuralmind/bert-large-portuguese-cased

Finetuned

(60)

this model

Collection including liaad/Citilink_BERTimbau-large_metadata

Citilink

Collection

Citilink aims to create AI models to facilitate the understanding of city council meetings • 13 items • Updated 7 days ago