first push
Browse files- FIXES_APPLIED.md +117 -0
- app/main.py +27 -7
- test_api.py +108 -0
- translator.py +131 -38
- اخطاء.txt +0 -0
FIXES_APPLIED.md
ADDED
|
@@ -0,0 +1,117 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Translation Issues Fixed
|
| 2 |
+
|
| 3 |
+
## Problems Addressed
|
| 4 |
+
|
| 5 |
+
### 1. Translation Not Working (Files Remained Untranslated)
|
| 6 |
+
**Problem**: Files were being processed but returned in the original language with 0 paragraphs translated.
|
| 7 |
+
|
| 8 |
+
**Root Causes**:
|
| 9 |
+
- Silent fallback behavior in `translate_text()` method
|
| 10 |
+
- No validation of translation results
|
| 11 |
+
- Missing error handling for API failures
|
| 12 |
+
|
| 13 |
+
**Fixes Applied**:
|
| 14 |
+
- **Enhanced `translate_text()` method**:
|
| 15 |
+
- Added API key validation before making requests
|
| 16 |
+
- Improved translation prompts for better results with Google Gemini 2.5 Pro
|
| 17 |
+
- Removed silent fallback to original text - now raises exceptions on failure
|
| 18 |
+
- Added validation to ensure translation actually occurred
|
| 19 |
+
- Increased token limits for better translation quality
|
| 20 |
+
|
| 21 |
+
- **Improved error handling**:
|
| 22 |
+
- Added comprehensive exception handling in translation workflows
|
| 23 |
+
- Better validation of translated content
|
| 24 |
+
- Detailed logging to track translation progress
|
| 25 |
+
|
| 26 |
+
- **Enhanced validation**:
|
| 27 |
+
- Check for empty or unchanged translation results
|
| 28 |
+
- Verify API responses before processing
|
| 29 |
+
- Ensure at least some content gets translated
|
| 30 |
+
|
| 31 |
+
### 2. Format Preservation Issue
|
| 32 |
+
**Problem**: User wanted files to maintain original filename and format (PDF→Word→translate→PDF workflow)
|
| 33 |
+
|
| 34 |
+
**Current Behavior**: Created separate "translated_" prefixed files
|
| 35 |
+
**Desired Behavior**: Receive PDF, convert to Word, translate, convert back to PDF with same filename
|
| 36 |
+
|
| 37 |
+
**Fixes Applied**:
|
| 38 |
+
- **Modified `translate_document()` method**:
|
| 39 |
+
- Output file now uses original filename (no "translated_" prefix)
|
| 40 |
+
- For PDF input: PDF→DOCX→translate→PDF with original filename
|
| 41 |
+
- For DOCX input: DOCX→translate→DOCX with original filename
|
| 42 |
+
|
| 43 |
+
- **Updated file handling in `main.py`**:
|
| 44 |
+
- Both original and translated files now use same filename
|
| 45 |
+
- Better file copying and naming logic
|
| 46 |
+
- Improved response structure
|
| 47 |
+
|
| 48 |
+
## Technical Improvements
|
| 49 |
+
|
| 50 |
+
### 1. Robust Translation Logic
|
| 51 |
+
```python
|
| 52 |
+
# Before: Silent fallback
|
| 53 |
+
if translation_failed:
|
| 54 |
+
return original_text # Silent failure
|
| 55 |
+
|
| 56 |
+
# After: Proper error handling
|
| 57 |
+
if not translated or translated == text:
|
| 58 |
+
raise Exception("Translation failed: received empty or unchanged text")
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
### 2. Enhanced Error Reporting
|
| 62 |
+
- Added detailed logging throughout the translation pipeline
|
| 63 |
+
- Better API error messages
|
| 64 |
+
- Validation at each step of the process
|
| 65 |
+
|
| 66 |
+
### 3. Format Preservation Workflow
|
| 67 |
+
```
|
| 68 |
+
PDF Input → LibreOffice Convert to DOCX → Translate DOCX → Convert back to PDF (same filename)
|
| 69 |
+
DOCX Input → Translate DOCX → Save as same filename
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
## Testing
|
| 73 |
+
|
| 74 |
+
### API Key Testing
|
| 75 |
+
Created `test_api.py` script to verify:
|
| 76 |
+
- OPENROUTER_API_KEY is set correctly
|
| 77 |
+
- API connection is working
|
| 78 |
+
- Basic translation functionality
|
| 79 |
+
|
| 80 |
+
### Usage
|
| 81 |
+
Run the test script to verify setup:
|
| 82 |
+
```bash
|
| 83 |
+
python test_api.py
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
## Expected Results
|
| 87 |
+
|
| 88 |
+
After these fixes:
|
| 89 |
+
1. **Translation will work**: Files will be actually translated, not returned unchanged
|
| 90 |
+
2. **Format preserved**: PDF files will be returned as PDF with same filename
|
| 91 |
+
3. **Better error messages**: Clear feedback when translation fails
|
| 92 |
+
4. **Robust operation**: Proper error handling instead of silent failures
|
| 93 |
+
|
| 94 |
+
## Key Files Modified
|
| 95 |
+
|
| 96 |
+
1. **`translator.py`**:
|
| 97 |
+
- Enhanced `translate_text()` method with validation
|
| 98 |
+
- Improved `translate_document()` for format preservation
|
| 99 |
+
- Better error handling in `translate_docx()` and `translate_pdf_direct()`
|
| 100 |
+
|
| 101 |
+
2. **`app/main.py`**:
|
| 102 |
+
- Updated translation endpoint with better validation
|
| 103 |
+
- Fixed file naming to preserve original names
|
| 104 |
+
- Enhanced error reporting
|
| 105 |
+
|
| 106 |
+
3. **`test_api.py`** (new):
|
| 107 |
+
- API key and connection testing
|
| 108 |
+
- Basic translation functionality verification
|
| 109 |
+
|
| 110 |
+
## Usage Instructions
|
| 111 |
+
|
| 112 |
+
1. **Set API Key**: Ensure `OPENROUTER_API_KEY` environment variable is set
|
| 113 |
+
2. **Test Setup**: Run `python test_api.py` to verify configuration
|
| 114 |
+
3. **Upload Files**: PDF or DOCX files will now be properly translated
|
| 115 |
+
4. **Download Results**: Translated files maintain original format and filename
|
| 116 |
+
|
| 117 |
+
The system now provides reliable translation with proper format preservation as requested.
|
app/main.py
CHANGED
|
@@ -74,6 +74,7 @@ async def translate_document(
|
|
| 74 |
):
|
| 75 |
"""
|
| 76 |
Translate a document (PDF or DOCX) using the specified model
|
|
|
|
| 77 |
"""
|
| 78 |
if not file.filename:
|
| 79 |
raise HTTPException(status_code=400, detail="No file provided")
|
|
@@ -87,6 +88,13 @@ async def translate_document(
|
|
| 87 |
detail=f"Unsupported file type. Allowed: {', '.join(allowed_extensions)}"
|
| 88 |
)
|
| 89 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
# Create temporary directory for this translation
|
| 91 |
with tempfile.TemporaryDirectory() as temp_dir:
|
| 92 |
temp_path = Path(temp_dir)
|
|
@@ -99,6 +107,8 @@ async def translate_document(
|
|
| 99 |
try:
|
| 100 |
# Perform translation
|
| 101 |
logger.info(f"Starting translation of {input_file} using model {model}")
|
|
|
|
|
|
|
| 102 |
result = await translator.translate_document(
|
| 103 |
input_file=input_file,
|
| 104 |
model=model,
|
|
@@ -114,23 +124,28 @@ async def translate_document(
|
|
| 114 |
raise HTTPException(status_code=500, detail=error_details)
|
| 115 |
|
| 116 |
if result.paragraphs_count == 0:
|
| 117 |
-
logger.
|
| 118 |
-
|
|
|
|
|
|
|
|
|
|
| 119 |
|
| 120 |
# Move files to uploads directory for serving
|
| 121 |
timestamp = int(asyncio.get_event_loop().time())
|
| 122 |
result_dir = UPLOAD_DIR / f"translation_{timestamp}"
|
| 123 |
result_dir.mkdir(exist_ok=True)
|
| 124 |
|
| 125 |
-
# Copy result files
|
| 126 |
final_files = {}
|
| 127 |
if result.original_file.exists():
|
| 128 |
-
|
|
|
|
| 129 |
shutil.copy2(result.original_file, original_dest)
|
| 130 |
final_files["original"] = str(original_dest.relative_to(UPLOAD_DIR))
|
| 131 |
|
| 132 |
if result.translated_file.exists():
|
| 133 |
-
|
|
|
|
| 134 |
shutil.copy2(result.translated_file, translated_dest)
|
| 135 |
final_files["translated"] = str(translated_dest.relative_to(UPLOAD_DIR))
|
| 136 |
|
|
@@ -138,17 +153,22 @@ async def translate_document(
|
|
| 138 |
report = {
|
| 139 |
"status": "success",
|
| 140 |
"original_filename": file.filename,
|
| 141 |
-
"translated_filename":
|
| 142 |
"pages_translated": result.pages_count,
|
| 143 |
"paragraphs_translated": result.paragraphs_count,
|
| 144 |
"model_used": model,
|
| 145 |
"source_language": source_language,
|
| 146 |
"target_language": target_language,
|
| 147 |
-
"files": final_files
|
|
|
|
| 148 |
}
|
| 149 |
|
|
|
|
| 150 |
return JSONResponse(content=report)
|
| 151 |
|
|
|
|
|
|
|
|
|
|
| 152 |
except Exception as e:
|
| 153 |
logger.error(f"Translation error: {e}")
|
| 154 |
raise HTTPException(status_code=500, detail=f"Translation failed: {str(e)}")
|
|
|
|
| 74 |
):
|
| 75 |
"""
|
| 76 |
Translate a document (PDF or DOCX) using the specified model
|
| 77 |
+
Returns translated file with same name and format as original
|
| 78 |
"""
|
| 79 |
if not file.filename:
|
| 80 |
raise HTTPException(status_code=400, detail="No file provided")
|
|
|
|
| 88 |
detail=f"Unsupported file type. Allowed: {', '.join(allowed_extensions)}"
|
| 89 |
)
|
| 90 |
|
| 91 |
+
# Validate API key
|
| 92 |
+
if not translator.is_ready():
|
| 93 |
+
raise HTTPException(
|
| 94 |
+
status_code=500,
|
| 95 |
+
detail="Translation service not configured. Please check OPENROUTER_API_KEY."
|
| 96 |
+
)
|
| 97 |
+
|
| 98 |
# Create temporary directory for this translation
|
| 99 |
with tempfile.TemporaryDirectory() as temp_dir:
|
| 100 |
temp_path = Path(temp_dir)
|
|
|
|
| 107 |
try:
|
| 108 |
# Perform translation
|
| 109 |
logger.info(f"Starting translation of {input_file} using model {model}")
|
| 110 |
+
logger.info(f"Translation: {source_language} -> {target_language}")
|
| 111 |
+
|
| 112 |
result = await translator.translate_document(
|
| 113 |
input_file=input_file,
|
| 114 |
model=model,
|
|
|
|
| 124 |
raise HTTPException(status_code=500, detail=error_details)
|
| 125 |
|
| 126 |
if result.paragraphs_count == 0:
|
| 127 |
+
logger.error("Translation completed but no paragraphs were translated")
|
| 128 |
+
raise HTTPException(
|
| 129 |
+
status_code=500,
|
| 130 |
+
detail="Translation failed: No content was translated. Please check if the file contains readable text."
|
| 131 |
+
)
|
| 132 |
|
| 133 |
# Move files to uploads directory for serving
|
| 134 |
timestamp = int(asyncio.get_event_loop().time())
|
| 135 |
result_dir = UPLOAD_DIR / f"translation_{timestamp}"
|
| 136 |
result_dir.mkdir(exist_ok=True)
|
| 137 |
|
| 138 |
+
# Copy result files with original names (no prefix)
|
| 139 |
final_files = {}
|
| 140 |
if result.original_file.exists():
|
| 141 |
+
# Keep original filename
|
| 142 |
+
original_dest = result_dir / file.filename
|
| 143 |
shutil.copy2(result.original_file, original_dest)
|
| 144 |
final_files["original"] = str(original_dest.relative_to(UPLOAD_DIR))
|
| 145 |
|
| 146 |
if result.translated_file.exists():
|
| 147 |
+
# Use original filename for translated file too
|
| 148 |
+
translated_dest = result_dir / file.filename
|
| 149 |
shutil.copy2(result.translated_file, translated_dest)
|
| 150 |
final_files["translated"] = str(translated_dest.relative_to(UPLOAD_DIR))
|
| 151 |
|
|
|
|
| 153 |
report = {
|
| 154 |
"status": "success",
|
| 155 |
"original_filename": file.filename,
|
| 156 |
+
"translated_filename": file.filename, # Same filename
|
| 157 |
"pages_translated": result.pages_count,
|
| 158 |
"paragraphs_translated": result.paragraphs_count,
|
| 159 |
"model_used": model,
|
| 160 |
"source_language": source_language,
|
| 161 |
"target_language": target_language,
|
| 162 |
+
"files": final_files,
|
| 163 |
+
"message": f"Successfully translated {result.paragraphs_count} paragraphs from {source_language} to {target_language}"
|
| 164 |
}
|
| 165 |
|
| 166 |
+
logger.info(f"Translation completed successfully: {result.paragraphs_count} paragraphs translated")
|
| 167 |
return JSONResponse(content=report)
|
| 168 |
|
| 169 |
+
except HTTPException:
|
| 170 |
+
# Re-raise HTTP exceptions
|
| 171 |
+
raise
|
| 172 |
except Exception as e:
|
| 173 |
logger.error(f"Translation error: {e}")
|
| 174 |
raise HTTPException(status_code=500, detail=f"Translation failed: {str(e)}")
|
test_api.py
ADDED
|
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Test script to verify OpenRouter API key and translation functionality
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import os
|
| 7 |
+
import asyncio
|
| 8 |
+
import aiohttp
|
| 9 |
+
from translator import DocumentTranslator
|
| 10 |
+
|
| 11 |
+
async def test_api_key():
|
| 12 |
+
"""Test if the API key is working"""
|
| 13 |
+
print("🔑 Testing OpenRouter API key...")
|
| 14 |
+
|
| 15 |
+
api_key = os.getenv("OPENROUTER_API_KEY")
|
| 16 |
+
if not api_key:
|
| 17 |
+
print("❌ OPENROUTER_API_KEY environment variable not set!")
|
| 18 |
+
print("Please set it with: set OPENROUTER_API_KEY=your_key_here")
|
| 19 |
+
return False
|
| 20 |
+
|
| 21 |
+
print(f"✅ API key found: {api_key[:10]}...")
|
| 22 |
+
|
| 23 |
+
# Test API connection
|
| 24 |
+
try:
|
| 25 |
+
headers = {
|
| 26 |
+
"Authorization": f"Bearer {api_key}",
|
| 27 |
+
"Content-Type": "application/json",
|
| 28 |
+
"HTTP-Referer": "https://huggingface.co",
|
| 29 |
+
"X-Title": "Document Translator"
|
| 30 |
+
}
|
| 31 |
+
|
| 32 |
+
async with aiohttp.ClientSession() as session:
|
| 33 |
+
async with session.get(
|
| 34 |
+
"https://openrouter.ai/api/v1/models",
|
| 35 |
+
headers=headers
|
| 36 |
+
) as response:
|
| 37 |
+
if response.status == 200:
|
| 38 |
+
print("✅ API connection successful!")
|
| 39 |
+
return True
|
| 40 |
+
else:
|
| 41 |
+
print(f"❌ API connection failed: {response.status}")
|
| 42 |
+
error_text = await response.text()
|
| 43 |
+
print(f"Error: {error_text}")
|
| 44 |
+
return False
|
| 45 |
+
except Exception as e:
|
| 46 |
+
print(f"❌ API test failed: {e}")
|
| 47 |
+
return False
|
| 48 |
+
|
| 49 |
+
async def test_translation():
|
| 50 |
+
"""Test basic translation functionality"""
|
| 51 |
+
print("\n📝 Testing translation functionality...")
|
| 52 |
+
|
| 53 |
+
translator = DocumentTranslator()
|
| 54 |
+
|
| 55 |
+
if not translator.is_ready():
|
| 56 |
+
print("❌ Translator not ready - API key issue")
|
| 57 |
+
return False
|
| 58 |
+
|
| 59 |
+
try:
|
| 60 |
+
# Test simple translation
|
| 61 |
+
test_text = "Hello, this is a test document."
|
| 62 |
+
print(f"Original text: {test_text}")
|
| 63 |
+
|
| 64 |
+
translated = await translator.translate_text(
|
| 65 |
+
text=test_text,
|
| 66 |
+
model="google/gemini-2.5-pro-exp-03-25",
|
| 67 |
+
source_lang="en",
|
| 68 |
+
target_lang="ar"
|
| 69 |
+
)
|
| 70 |
+
|
| 71 |
+
print(f"Translated text: {translated}")
|
| 72 |
+
|
| 73 |
+
if translated != test_text:
|
| 74 |
+
print("✅ Translation working correctly!")
|
| 75 |
+
return True
|
| 76 |
+
else:
|
| 77 |
+
print("❌ Translation returned original text - may indicate an issue")
|
| 78 |
+
return False
|
| 79 |
+
|
| 80 |
+
except Exception as e:
|
| 81 |
+
print(f"❌ Translation test failed: {e}")
|
| 82 |
+
return False
|
| 83 |
+
|
| 84 |
+
async def main():
|
| 85 |
+
"""Run all tests"""
|
| 86 |
+
print("🧪 Testing Document Translator Setup\n")
|
| 87 |
+
|
| 88 |
+
# Test API key
|
| 89 |
+
api_ok = await test_api_key()
|
| 90 |
+
|
| 91 |
+
if api_ok:
|
| 92 |
+
# Test translation
|
| 93 |
+
translation_ok = await test_translation()
|
| 94 |
+
|
| 95 |
+
if translation_ok:
|
| 96 |
+
print("\n🎉 All tests passed! The translator should work correctly.")
|
| 97 |
+
else:
|
| 98 |
+
print("\n⚠️ Translation test failed. Check the logs for details.")
|
| 99 |
+
else:
|
| 100 |
+
print("\n❌ API key test failed. Please check your OPENROUTER_API_KEY.")
|
| 101 |
+
|
| 102 |
+
print("\n📋 Next steps:")
|
| 103 |
+
print("1. Make sure OPENROUTER_API_KEY is set correctly")
|
| 104 |
+
print("2. Upload a PDF or DOCX file to test the full workflow")
|
| 105 |
+
print("3. Check the translation.log file for detailed logs")
|
| 106 |
+
|
| 107 |
+
if __name__ == "__main__":
|
| 108 |
+
asyncio.run(main())
|
translator.py
CHANGED
|
@@ -61,10 +61,14 @@ class DocumentTranslator:
|
|
| 61 |
]
|
| 62 |
|
| 63 |
async def translate_text(self, text: str, model: str, source_lang: str = "auto", target_lang: str = "en") -> str:
|
| 64 |
-
"""Translate text using OpenRouter API with improved prompt"""
|
| 65 |
if not text.strip():
|
| 66 |
return text
|
| 67 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
# Create a more specific translation prompt
|
| 69 |
if source_lang == "auto":
|
| 70 |
prompt = f"""You are a professional document translator. Translate the following text to {target_lang} (Arabic if 'ar', English if 'en', etc.).
|
|
@@ -74,6 +78,7 @@ IMPORTANT INSTRUCTIONS:
|
|
| 74 |
2. Maintain the original formatting and structure
|
| 75 |
3. Preserve technical terms appropriately
|
| 76 |
4. Return ONLY the translated text
|
|
|
|
| 77 |
|
| 78 |
Text to translate:
|
| 79 |
{text}
|
|
@@ -87,6 +92,7 @@ IMPORTANT INSTRUCTIONS:
|
|
| 87 |
2. Maintain the original formatting and structure
|
| 88 |
3. Preserve technical terms appropriately
|
| 89 |
4. Return ONLY the translated text
|
|
|
|
| 90 |
|
| 91 |
Text to translate:
|
| 92 |
{text}
|
|
@@ -98,11 +104,11 @@ Translated text:"""
|
|
| 98 |
payload = {
|
| 99 |
"model": model,
|
| 100 |
"messages": [
|
| 101 |
-
{"role": "system", "content": "You are a professional document translator.
|
| 102 |
{"role": "user", "content": prompt}
|
| 103 |
],
|
| 104 |
"temperature": 0.1,
|
| 105 |
-
"max_tokens": len(text) *
|
| 106 |
}
|
| 107 |
|
| 108 |
logger.info(f"Translating text: '{text[:50]}...' from {source_lang} to {target_lang}")
|
|
@@ -120,15 +126,26 @@ Translated text:"""
|
|
| 120 |
if "Translated text:" in translated:
|
| 121 |
translated = translated.split("Translated text:")[-1].strip()
|
| 122 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
logger.info(f"Translation successful: '{translated[:50]}...'")
|
| 124 |
return translated
|
| 125 |
else:
|
| 126 |
error_text = await response.text()
|
| 127 |
logger.error(f"Translation API error: {response.status} - {error_text}")
|
| 128 |
-
|
| 129 |
except Exception as e:
|
| 130 |
logger.error(f"Translation error: {e}")
|
| 131 |
-
|
| 132 |
|
| 133 |
def extract_text_from_pdf(self, pdf_path: Path) -> str:
|
| 134 |
"""Extract text directly from PDF as fallback method"""
|
|
@@ -175,20 +192,28 @@ Translated text:"""
|
|
| 175 |
if len(paragraph.strip()) > 10: # Only translate substantial paragraphs
|
| 176 |
logger.info(f"Translating paragraph {i+1}/{len(paragraphs)}: '{paragraph[:50]}...'")
|
| 177 |
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 185 |
|
| 186 |
# Add delay to avoid rate limiting
|
| 187 |
-
await asyncio.sleep(0.
|
| 188 |
else:
|
| 189 |
# Add short text as-is
|
| 190 |
doc.add_paragraph(paragraph)
|
| 191 |
|
|
|
|
|
|
|
|
|
|
| 192 |
# Save translated document
|
| 193 |
translated_path = output_dir / f"translated_{pdf_path.stem}.docx"
|
| 194 |
doc.save(translated_path)
|
|
@@ -284,7 +309,7 @@ Translated text:"""
|
|
| 284 |
raise
|
| 285 |
|
| 286 |
async def translate_docx(self, docx_path: Path, model: str, source_lang: str, target_lang: str, output_dir: Path) -> Tuple[Path, int]:
|
| 287 |
-
"""Translate DOCX document paragraph by paragraph with enhanced
|
| 288 |
try:
|
| 289 |
# Load the document
|
| 290 |
logger.info(f"Loading DOCX document: {docx_path}")
|
|
@@ -298,6 +323,9 @@ Translated text:"""
|
|
| 298 |
text_paragraphs = [p for p in doc.paragraphs if p.text.strip()]
|
| 299 |
logger.info(f"Found {len(text_paragraphs)} paragraphs with text content")
|
| 300 |
|
|
|
|
|
|
|
|
|
|
| 301 |
# Log first few paragraphs for debugging
|
| 302 |
for i, paragraph in enumerate(text_paragraphs[:3]):
|
| 303 |
logger.info(f"Sample paragraph {i+1}: '{paragraph.text[:100]}...'")
|
|
@@ -308,21 +336,27 @@ Translated text:"""
|
|
| 308 |
original_text = paragraph.text.strip()
|
| 309 |
logger.info(f"Translating paragraph {paragraphs_count + 1}/{len(text_paragraphs)}: '{original_text[:50]}...'")
|
| 310 |
|
| 311 |
-
|
| 312 |
-
|
| 313 |
-
|
| 314 |
-
|
| 315 |
-
|
| 316 |
-
|
| 317 |
-
|
| 318 |
-
|
| 319 |
-
|
| 320 |
-
|
| 321 |
-
|
| 322 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 323 |
|
| 324 |
# Add small delay to avoid rate limiting
|
| 325 |
-
await asyncio.sleep(0.
|
| 326 |
|
| 327 |
# Translate tables if any
|
| 328 |
table_cells_translated = 0
|
|
@@ -332,16 +366,23 @@ Translated text:"""
|
|
| 332 |
for cell_idx, cell in enumerate(row.cells):
|
| 333 |
if cell.text.strip():
|
| 334 |
original_text = cell.text.strip()
|
| 335 |
-
|
| 336 |
-
|
| 337 |
-
|
| 338 |
-
|
| 339 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 340 |
await asyncio.sleep(0.1)
|
| 341 |
|
| 342 |
logger.info(f"Translated {table_cells_translated} table cells")
|
| 343 |
total_translated = paragraphs_count + table_cells_translated
|
| 344 |
|
|
|
|
|
|
|
|
|
|
| 345 |
# Save translated document
|
| 346 |
translated_path = output_dir / f"translated_{docx_path.name}"
|
| 347 |
doc.save(translated_path)
|
|
@@ -352,6 +393,8 @@ Translated text:"""
|
|
| 352 |
if translated_path.exists():
|
| 353 |
file_size = translated_path.stat().st_size
|
| 354 |
logger.info(f"Translated document saved (size: {file_size} bytes)")
|
|
|
|
|
|
|
| 355 |
|
| 356 |
return translated_path, total_translated
|
| 357 |
|
|
@@ -369,12 +412,14 @@ Translated text:"""
|
|
| 369 |
) -> TranslationReport:
|
| 370 |
"""
|
| 371 |
Main translation function that handles both PDF and DOCX files
|
|
|
|
| 372 |
"""
|
| 373 |
if output_dir is None:
|
| 374 |
output_dir = input_file.parent
|
| 375 |
|
| 376 |
original_file = input_file
|
| 377 |
file_extension = input_file.suffix.lower()
|
|
|
|
| 378 |
|
| 379 |
try:
|
| 380 |
if file_extension == ".pdf":
|
|
@@ -396,9 +441,27 @@ Translated text:"""
|
|
| 396 |
logger.warning("LibreOffice conversion produced no translatable content, trying direct extraction")
|
| 397 |
raise Exception("No content found in LibreOffice conversion")
|
| 398 |
|
| 399 |
-
# Convert translated DOCX back to PDF
|
| 400 |
-
logger.info(f"Converting translated DOCX back to PDF")
|
| 401 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 402 |
|
| 403 |
except Exception as libreoffice_error:
|
| 404 |
logger.warning(f"LibreOffice method failed: {libreoffice_error}")
|
|
@@ -409,8 +472,25 @@ Translated text:"""
|
|
| 409 |
input_file, model, source_language, target_language, output_dir
|
| 410 |
)
|
| 411 |
|
| 412 |
-
# Convert the translated DOCX to PDF
|
| 413 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 414 |
|
| 415 |
# Estimate pages (rough estimate: 1 page = ~500 words)
|
| 416 |
doc = Document(translated_docx)
|
|
@@ -418,12 +498,21 @@ Translated text:"""
|
|
| 418 |
pages_count = max(1, total_words // 500)
|
| 419 |
|
| 420 |
elif file_extension == ".docx":
|
| 421 |
-
# Translate DOCX directly
|
| 422 |
logger.info(f"Translating DOCX {input_file}")
|
|
|
|
|
|
|
|
|
|
|
|
|
| 423 |
translated_file, paragraphs_count = await self.translate_docx(
|
| 424 |
input_file, model, source_language, target_language, output_dir
|
| 425 |
)
|
| 426 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 427 |
# Estimate pages
|
| 428 |
doc = Document(translated_file)
|
| 429 |
total_words = sum(len(p.text.split()) for p in doc.paragraphs)
|
|
@@ -432,6 +521,10 @@ Translated text:"""
|
|
| 432 |
else:
|
| 433 |
raise Exception(f"Unsupported file format: {file_extension}")
|
| 434 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 435 |
return TranslationReport(
|
| 436 |
original_file=original_file,
|
| 437 |
translated_file=translated_file,
|
|
|
|
| 61 |
]
|
| 62 |
|
| 63 |
async def translate_text(self, text: str, model: str, source_lang: str = "auto", target_lang: str = "en") -> str:
|
| 64 |
+
"""Translate text using OpenRouter API with improved prompt and validation"""
|
| 65 |
if not text.strip():
|
| 66 |
return text
|
| 67 |
|
| 68 |
+
# Validate API key first
|
| 69 |
+
if not self.api_key:
|
| 70 |
+
raise Exception("OpenRouter API key not configured")
|
| 71 |
+
|
| 72 |
# Create a more specific translation prompt
|
| 73 |
if source_lang == "auto":
|
| 74 |
prompt = f"""You are a professional document translator. Translate the following text to {target_lang} (Arabic if 'ar', English if 'en', etc.).
|
|
|
|
| 78 |
2. Maintain the original formatting and structure
|
| 79 |
3. Preserve technical terms appropriately
|
| 80 |
4. Return ONLY the translated text
|
| 81 |
+
5. If the text is already in the target language, still provide a proper translation/rewrite
|
| 82 |
|
| 83 |
Text to translate:
|
| 84 |
{text}
|
|
|
|
| 92 |
2. Maintain the original formatting and structure
|
| 93 |
3. Preserve technical terms appropriately
|
| 94 |
4. Return ONLY the translated text
|
| 95 |
+
5. If the text is already in the target language, still provide a proper translation/rewrite
|
| 96 |
|
| 97 |
Text to translate:
|
| 98 |
{text}
|
|
|
|
| 104 |
payload = {
|
| 105 |
"model": model,
|
| 106 |
"messages": [
|
| 107 |
+
{"role": "system", "content": "You are a professional document translator. You MUST provide a translation. Never return the original text unchanged."},
|
| 108 |
{"role": "user", "content": prompt}
|
| 109 |
],
|
| 110 |
"temperature": 0.1,
|
| 111 |
+
"max_tokens": len(text) * 4 + 500 # More generous token limit
|
| 112 |
}
|
| 113 |
|
| 114 |
logger.info(f"Translating text: '{text[:50]}...' from {source_lang} to {target_lang}")
|
|
|
|
| 126 |
if "Translated text:" in translated:
|
| 127 |
translated = translated.split("Translated text:")[-1].strip()
|
| 128 |
|
| 129 |
+
# Remove any introductory phrases
|
| 130 |
+
for phrase in ["Here is the translation:", "Translation:", "The translation is:"]:
|
| 131 |
+
if translated.startswith(phrase):
|
| 132 |
+
translated = translated[len(phrase):].strip()
|
| 133 |
+
|
| 134 |
+
# Validate that we got a meaningful translation
|
| 135 |
+
if not translated or translated == text:
|
| 136 |
+
logger.warning(f"Translation returned empty or unchanged text")
|
| 137 |
+
# Don't fall back to original - raise error instead
|
| 138 |
+
raise Exception("Translation failed: received empty or unchanged text")
|
| 139 |
+
|
| 140 |
logger.info(f"Translation successful: '{translated[:50]}...'")
|
| 141 |
return translated
|
| 142 |
else:
|
| 143 |
error_text = await response.text()
|
| 144 |
logger.error(f"Translation API error: {response.status} - {error_text}")
|
| 145 |
+
raise Exception(f"Translation API error: {response.status} - {error_text}")
|
| 146 |
except Exception as e:
|
| 147 |
logger.error(f"Translation error: {e}")
|
| 148 |
+
raise Exception(f"Translation failed: {str(e)}")
|
| 149 |
|
| 150 |
def extract_text_from_pdf(self, pdf_path: Path) -> str:
|
| 151 |
"""Extract text directly from PDF as fallback method"""
|
|
|
|
| 192 |
if len(paragraph.strip()) > 10: # Only translate substantial paragraphs
|
| 193 |
logger.info(f"Translating paragraph {i+1}/{len(paragraphs)}: '{paragraph[:50]}...'")
|
| 194 |
|
| 195 |
+
try:
|
| 196 |
+
translated_text = await self.translate_text(
|
| 197 |
+
paragraph, model, source_lang, target_lang
|
| 198 |
+
)
|
| 199 |
+
|
| 200 |
+
# Add translated paragraph to document
|
| 201 |
+
doc.add_paragraph(translated_text)
|
| 202 |
+
paragraphs_translated += 1
|
| 203 |
+
|
| 204 |
+
except Exception as trans_error:
|
| 205 |
+
logger.error(f"Failed to translate paragraph: {trans_error}")
|
| 206 |
+
raise Exception(f"Translation failed for paragraph: {str(trans_error)}")
|
| 207 |
|
| 208 |
# Add delay to avoid rate limiting
|
| 209 |
+
await asyncio.sleep(0.3)
|
| 210 |
else:
|
| 211 |
# Add short text as-is
|
| 212 |
doc.add_paragraph(paragraph)
|
| 213 |
|
| 214 |
+
if paragraphs_translated == 0:
|
| 215 |
+
raise Exception("No paragraphs were successfully translated")
|
| 216 |
+
|
| 217 |
# Save translated document
|
| 218 |
translated_path = output_dir / f"translated_{pdf_path.stem}.docx"
|
| 219 |
doc.save(translated_path)
|
|
|
|
| 309 |
raise
|
| 310 |
|
| 311 |
async def translate_docx(self, docx_path: Path, model: str, source_lang: str, target_lang: str, output_dir: Path) -> Tuple[Path, int]:
|
| 312 |
+
"""Translate DOCX document paragraph by paragraph with enhanced validation"""
|
| 313 |
try:
|
| 314 |
# Load the document
|
| 315 |
logger.info(f"Loading DOCX document: {docx_path}")
|
|
|
|
| 323 |
text_paragraphs = [p for p in doc.paragraphs if p.text.strip()]
|
| 324 |
logger.info(f"Found {len(text_paragraphs)} paragraphs with text content")
|
| 325 |
|
| 326 |
+
if len(text_paragraphs) == 0:
|
| 327 |
+
raise Exception("No text content found in document")
|
| 328 |
+
|
| 329 |
# Log first few paragraphs for debugging
|
| 330 |
for i, paragraph in enumerate(text_paragraphs[:3]):
|
| 331 |
logger.info(f"Sample paragraph {i+1}: '{paragraph.text[:100]}...'")
|
|
|
|
| 336 |
original_text = paragraph.text.strip()
|
| 337 |
logger.info(f"Translating paragraph {paragraphs_count + 1}/{len(text_paragraphs)}: '{original_text[:50]}...'")
|
| 338 |
|
| 339 |
+
try:
|
| 340 |
+
translated_text = await self.translate_text(
|
| 341 |
+
original_text, model, source_lang, target_lang
|
| 342 |
+
)
|
| 343 |
+
|
| 344 |
+
# Verify translation actually happened
|
| 345 |
+
if translated_text == original_text:
|
| 346 |
+
logger.warning(f"Translation returned identical text for: '{original_text[:50]}...'")
|
| 347 |
+
# Continue anyway - maybe it was already in target language
|
| 348 |
+
else:
|
| 349 |
+
logger.info(f"Translation successful: '{translated_text[:50]}...'")
|
| 350 |
+
|
| 351 |
+
paragraph.text = translated_text
|
| 352 |
+
paragraphs_count += 1
|
| 353 |
+
|
| 354 |
+
except Exception as trans_error:
|
| 355 |
+
logger.error(f"Failed to translate paragraph: {trans_error}")
|
| 356 |
+
raise Exception(f"Translation failed for paragraph: {str(trans_error)}")
|
| 357 |
|
| 358 |
# Add small delay to avoid rate limiting
|
| 359 |
+
await asyncio.sleep(0.3)
|
| 360 |
|
| 361 |
# Translate tables if any
|
| 362 |
table_cells_translated = 0
|
|
|
|
| 366 |
for cell_idx, cell in enumerate(row.cells):
|
| 367 |
if cell.text.strip():
|
| 368 |
original_text = cell.text.strip()
|
| 369 |
+
try:
|
| 370 |
+
translated_text = await self.translate_text(
|
| 371 |
+
original_text, model, source_lang, target_lang
|
| 372 |
+
)
|
| 373 |
+
cell.text = translated_text
|
| 374 |
+
table_cells_translated += 1
|
| 375 |
+
except Exception as trans_error:
|
| 376 |
+
logger.warning(f"Failed to translate table cell: {trans_error}")
|
| 377 |
+
# Continue with other cells
|
| 378 |
await asyncio.sleep(0.1)
|
| 379 |
|
| 380 |
logger.info(f"Translated {table_cells_translated} table cells")
|
| 381 |
total_translated = paragraphs_count + table_cells_translated
|
| 382 |
|
| 383 |
+
if total_translated == 0:
|
| 384 |
+
raise Exception("No content was successfully translated")
|
| 385 |
+
|
| 386 |
# Save translated document
|
| 387 |
translated_path = output_dir / f"translated_{docx_path.name}"
|
| 388 |
doc.save(translated_path)
|
|
|
|
| 393 |
if translated_path.exists():
|
| 394 |
file_size = translated_path.stat().st_size
|
| 395 |
logger.info(f"Translated document saved (size: {file_size} bytes)")
|
| 396 |
+
else:
|
| 397 |
+
raise Exception("Failed to save translated document")
|
| 398 |
|
| 399 |
return translated_path, total_translated
|
| 400 |
|
|
|
|
| 412 |
) -> TranslationReport:
|
| 413 |
"""
|
| 414 |
Main translation function that handles both PDF and DOCX files
|
| 415 |
+
Maintains original filename and format (PDF input returns PDF output)
|
| 416 |
"""
|
| 417 |
if output_dir is None:
|
| 418 |
output_dir = input_file.parent
|
| 419 |
|
| 420 |
original_file = input_file
|
| 421 |
file_extension = input_file.suffix.lower()
|
| 422 |
+
original_filename = input_file.stem # filename without extension
|
| 423 |
|
| 424 |
try:
|
| 425 |
if file_extension == ".pdf":
|
|
|
|
| 441 |
logger.warning("LibreOffice conversion produced no translatable content, trying direct extraction")
|
| 442 |
raise Exception("No content found in LibreOffice conversion")
|
| 443 |
|
| 444 |
+
# Convert translated DOCX back to PDF with ORIGINAL filename
|
| 445 |
+
logger.info(f"Converting translated DOCX back to PDF with original filename")
|
| 446 |
+
final_translated_file = output_dir / f"{original_filename}.pdf"
|
| 447 |
+
|
| 448 |
+
# Use LibreOffice to convert with specific output name
|
| 449 |
+
cmd = [
|
| 450 |
+
"libreoffice",
|
| 451 |
+
"--headless",
|
| 452 |
+
"--convert-to", "pdf",
|
| 453 |
+
"--outdir", str(output_dir),
|
| 454 |
+
str(translated_docx)
|
| 455 |
+
]
|
| 456 |
+
|
| 457 |
+
result = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
|
| 458 |
+
|
| 459 |
+
# LibreOffice creates file with docx stem name, rename to original
|
| 460 |
+
temp_pdf = output_dir / f"{translated_docx.stem}.pdf"
|
| 461 |
+
if temp_pdf.exists() and temp_pdf != final_translated_file:
|
| 462 |
+
temp_pdf.rename(final_translated_file)
|
| 463 |
+
|
| 464 |
+
translated_file = final_translated_file
|
| 465 |
|
| 466 |
except Exception as libreoffice_error:
|
| 467 |
logger.warning(f"LibreOffice method failed: {libreoffice_error}")
|
|
|
|
| 472 |
input_file, model, source_language, target_language, output_dir
|
| 473 |
)
|
| 474 |
|
| 475 |
+
# Convert the translated DOCX to PDF with original filename
|
| 476 |
+
final_translated_file = output_dir / f"{original_filename}.pdf"
|
| 477 |
+
|
| 478 |
+
cmd = [
|
| 479 |
+
"libreoffice",
|
| 480 |
+
"--headless",
|
| 481 |
+
"--convert-to", "pdf",
|
| 482 |
+
"--outdir", str(output_dir),
|
| 483 |
+
str(translated_docx)
|
| 484 |
+
]
|
| 485 |
+
|
| 486 |
+
result = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
|
| 487 |
+
|
| 488 |
+
# LibreOffice creates file with docx stem name, rename to original
|
| 489 |
+
temp_pdf = output_dir / f"{translated_docx.stem}.pdf"
|
| 490 |
+
if temp_pdf.exists() and temp_pdf != final_translated_file:
|
| 491 |
+
temp_pdf.rename(final_translated_file)
|
| 492 |
+
|
| 493 |
+
translated_file = final_translated_file
|
| 494 |
|
| 495 |
# Estimate pages (rough estimate: 1 page = ~500 words)
|
| 496 |
doc = Document(translated_docx)
|
|
|
|
| 498 |
pages_count = max(1, total_words // 500)
|
| 499 |
|
| 500 |
elif file_extension == ".docx":
|
| 501 |
+
# Translate DOCX directly, keeping original filename
|
| 502 |
logger.info(f"Translating DOCX {input_file}")
|
| 503 |
+
|
| 504 |
+
# Create output file with original filename
|
| 505 |
+
final_translated_file = output_dir / f"{original_filename}.docx"
|
| 506 |
+
|
| 507 |
translated_file, paragraphs_count = await self.translate_docx(
|
| 508 |
input_file, model, source_language, target_language, output_dir
|
| 509 |
)
|
| 510 |
|
| 511 |
+
# Rename to original filename if different
|
| 512 |
+
if translated_file != final_translated_file:
|
| 513 |
+
translated_file.rename(final_translated_file)
|
| 514 |
+
translated_file = final_translated_file
|
| 515 |
+
|
| 516 |
# Estimate pages
|
| 517 |
doc = Document(translated_file)
|
| 518 |
total_words = sum(len(p.text.split()) for p in doc.paragraphs)
|
|
|
|
| 521 |
else:
|
| 522 |
raise Exception(f"Unsupported file format: {file_extension}")
|
| 523 |
|
| 524 |
+
# Verify translation was successful
|
| 525 |
+
if paragraphs_count == 0:
|
| 526 |
+
raise Exception("Translation failed: No paragraphs were translated")
|
| 527 |
+
|
| 528 |
return TranslationReport(
|
| 529 |
original_file=original_file,
|
| 530 |
translated_file=translated_file,
|
اخطاء.txt
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|