🖼️ Samples

FenScribe Graphical User Interface

Example of PDF Optimization
💻 Installation & Dependencies
1. First, download or clone the repository:
- Option A (Git): Clone the repository using Git:
git clone https://github.com/ordylan/FenScribe.git cd FenScribe
- Option B (Download ZIP): Download the ZIP file from the GitHub page: https://github.com/ordylan/FenScribe. Then, unzip the file and navigate into the extracted directory in your terminal.
2. Install the required Python libraries using pip:
pip install PyMuPDF Pillow python-docx tkinterdnd2
🪄 Usage
- Navigate to the directory where you downloaded/cloned FenScribe.
- Run
gui.pyw
for the graphical interface (CLI version currently unavailable). Double-click the file or runpython gui.pyw
in your terminal. - After processing, the output Word document (
output_*.docx
) will be saved in the__Output/
directory. - Use the provided Office macro (
图片溢出缩小.bas
) to resize overflowing images in the generated Word document.- Open the Word document (
output_*.docx
). - Open the VBA editor (Alt + F11).
- Import the
图片溢出缩小.bas
file (File > Import File...). - Run the macro
双栏图片溢出自动调()
(Run > Run Sub/UserForm or press F5). - The macro automatically detects the first column width in Word and scales oversized images to fit.
- Open the Word document (
⚙️ Configuration Parameters
These parameters can be adjusted within the GUI or potentially in a future configuration file:
Parameter | Description |
---|---|
threshold | Brightness threshold for blank line detection (0-255). Converts RGB to grayscale average; rows ≥ threshold are considered blank. Higher values detect lighter grays as blank. |
dpi | Image resolution (Dots Per Inch) used when converting PDF pages to images for analysis. Higher DPI increases precision but also processing time and memory usage. |
min_height | Content validity filter (in pixels). Only preserves content blocks with height ≥ this value. Helps filter out small noise or artifacts. |
blank_height | Paragraph separation baseline (in pixels). Content is split into separate paragraphs when consecutive blank lines reach this height. Defines the minimum vertical gap considered a paragraph break. |
⚠️ Important Notes
Potential Issues & Considerations:
- Small images or geometric shapes might be accidentally removed or cropped during blank space removal if they fall below the
min_height
threshold. - Narrow vertical images or color blocks could be misidentified as separators. Adjusting configuration parameters like
min_height
orthreshold
might help. - Scanned documents must have reasonably horizontal text alignment. Pre-process significantly tilted pages (deskew) before using FenScribe for best results.
- Manual margin trimming might still be required in the source PDF or output Word document to remove headers/footers if they interfere with content detection or are unwanted.
- Performance depends on PDF complexity, DPI setting, and system resources. Large or complex PDFs may take time to process.
📜 License
This project is licensed under the MIT License.
Development Notes
This third-generation version of FenScribe features:
- A Graphical User Interface (GUI) implementation for more user-friendly operation.
- Partial utilization of AI-assisted development tools during its creation.
- Continuous optimization and refinement through multiple development iterations.
- Focus on converting PDF content to Word format (
.docx
) for easier editing and manipulation post-processing.