BERTimbau-large-metadata-council-pt: Metadata Extraction in Municipal Meeting Minutes
Model Description
This model performs Named Entity Recognition (NER) to automatically extract administrative metadata from Portuguese municipal meeting minutes.
Given the text of a meeting minute, the model identifies and classifies domain-specific entities such as meeting number, date, location, participants, and session times.
It operates at the token level using a sequence labeling approach, assigning entity tags to each token in the input text.
The model is designed to process administrative and institutional documents and is typically applied to specific segments of the minutes (e.g., Opening or Closing).
Key Features
🏛️ Specialized for Municipal Minutes
Fine-tuned on Portuguese municipal meeting minutes annotated with domain-specific metadata entities.🏷️ Token-level Metadata Extraction (NER)
Identifies and classifies multiple metadata entities within the text using a BIO-style tagging scheme.⚙️ Transformer-based Architecture
BERTimbau backbone with fine-tuning for token classification🔗 Pipeline-oriented Design
Designed to operate downstream of a segment detection module, enabling structured metadata extraction from relevant document sections.
Model Details
- Base Model:
neuralmind/bert-large-portuguese-cased - Architecture: BERT for token classification (NER)
- Parameters: 333M
- Max Sequence Length: 512 tokens
- Fine-tuning Dataset: Municipal meeting minutes (20 municipal minutes per 6 Portuguese municipalities totalling 120 documents)
- Entity Types:
minute_id,date,meeting_type,location,begin_time,end_time,participant - Training Framework: PyTorch + Hugging Face Transformers
- Evaluation Metric: F1-score
How It Works
The model follows a standard Named Entity Recognition (NER) pipeline based on sequence labeling.
The input text is tokenized using the model’s subword tokenizer, and each token is assigned a label according to the BIO tagging scheme. The transformer encoder produces contextualized representations for each token, which are then passed to a token classification head to predict entity labels.
During inference, subword-level predictions are aggregated to reconstruct entity spans at the word level. Consecutive tokens labeled as part of the same entity are merged to form complete metadata fields.
The following example illustrates how to perform metadata extraction using this model:
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import json
# Load model and tokenizer
MODEL_NAME = model_path
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForTokenClassification.from_pretrained(MODEL_NAME)
model.eval()
def extract_entities(text):
if not text or text.strip() == "":
return []
encoding = tokenizer(
text,
return_tensors="pt",
return_offsets_mapping=True,
truncation=True,
max_length=512
)
offsets = encoding["offset_mapping"][0].tolist()
word_ids = encoding.word_ids(batch_index=0)
inputs = {
"input_ids": encoding["input_ids"],
"attention_mask": encoding["attention_mask"],
}
with torch.no_grad():
outputs = model(**inputs)
pred_ids = torch.argmax(outputs.logits, dim=2)[0].tolist()
pred_labels = [model.config.id2label[i] for i in pred_ids]
entities = []
current = None
prev_word_idx = None
for i, label in enumerate(pred_labels):
word_idx = word_ids[i]
start, end = offsets[i]
if word_idx is None:
continue
if word_idx == prev_word_idx:
if current:
current["end"] = end
continue
prev_word_idx = word_idx
if label.startswith("B-"):
if current:
entities.append(current)
current = {"label": label[2:], "start": start, "end": end}
elif label.startswith("I-"):
if current and current["label"] == label[2:]:
current["end"] = end
else:
continue
else: # label == "O"
if current:
entities.append(current)
current = None
if current:
entities.append(current)
for ent in entities:
ent["text"] = text[ent["start"]:ent["end"]]
return entities
if __name__ == "__main__":
input_path = "segments.json"
output_path = "results.json"
with open(input_path, 'r', encoding='utf-8') as f:
minute = json.load(f)
introduction_entities = extract_entities(minute.get("introduction", ""))
conclusion_entities = extract_entities(minute.get("conclusion", ""))
all_entities = introduction_entities + conclusion_entities
grouped_entities = {}
for ent in all_entities:
label = ent["label"]
if label not in grouped_entities:
grouped_entities[label] = []
grouped_entities[label].append({
"text": ent["text"],
"start": ent["start"],
"end": ent["end"]
})
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(grouped_entities, f, ensure_ascii=False, indent=2)
print(json.dumps(grouped_entities, ensure_ascii=False, indent=2))
Evaluation Results
Municipal Meeting Minutes Test Set
| Metric | Score |
|---|---|
| F1 score | 0.96 |
Limitations
Domain Specificity
The model is fine-tuned on Portuguese municipal meeting minutes and performs best on administrative and governmental texts. Performance may degrade on documents with substantially different structure, vocabulary, or writing style.Language Dependency
Although based on a multilingual pre-trained transformer, the fine-tuning data is exclusively in Portuguese. As a result, performance on other languages has not been validated and is not guaranteed.Context Window Length
The model has a maximum input length of 512 tokens. Longer documents require chunk-based processing, which may lead to incomplete or fragmented entity predictions across chunk boundaries.Annotation and Formatting Variability
Municipal minutes may vary significantly across municipalities and time periods in terms of formatting, terminology, and metadata conventions. Unseen patterns or inconsistent annotations can negatively impact entity recognition accuracy.
License
This model is released under the cc-by-nc-nd-4.0 license.
- Downloads last month
- 6
Model tree for liaad/Citilink_BERTimbau-large_metadata
Base model
neuralmind/bert-large-portuguese-cased