🖼️ Samples
 
                    FenScribe Graphical User Interface
 
                     Example of PDF Optimization
💻 Installation & Dependencies
1. First, download or clone the repository:
- Option A (Git): Clone the repository using Git:
                    git clone https://github.com/ordylan/FenScribe.git cd FenScribe
- Option B (Download ZIP): Download the ZIP file from the GitHub page: https://github.com/ordylan/FenScribe. Then, unzip the file and navigate into the extracted directory in your terminal.
2. Install the required Python libraries using pip:
pip install PyMuPDF Pillow python-docx tkinterdnd2🪄 Usage
- Navigate to the directory where you downloaded/cloned FenScribe.
- Run gui.pywfor the graphical interface (CLI version currently unavailable). Double-click the file or runpython gui.pywin your terminal.
- After processing, the output Word document (output_*.docx) will be saved in the__Output/directory.
- Use the provided Office macro (图片溢出缩小.bas) to resize overflowing images in the generated Word document.- Open the Word document (output_*.docx).
- Open the VBA editor (Alt + F11).
- Import the 图片溢出缩小.basfile (File > Import File...).
- Run the macro双栏图片溢出自动调()(Run > Run Sub/UserForm or press F5).
- The macro automatically detects the first column width in Word and scales oversized images to fit.
 
- Open the Word document (
⚙️ Configuration Parameters
These parameters can be adjusted within the GUI or potentially in a future configuration file:
| Parameter | Description | 
|---|---|
| threshold | Brightness threshold for blank line detection (0-255). Converts RGB to grayscale average; rows ≥ threshold are considered blank. Higher values detect lighter grays as blank. | 
| dpi | Image resolution (Dots Per Inch) used when converting PDF pages to images for analysis. Higher DPI increases precision but also processing time and memory usage. | 
| min_height | Content validity filter (in pixels). Only preserves content blocks with height ≥ this value. Helps filter out small noise or artifacts. | 
| blank_height | Paragraph separation baseline (in pixels). Content is split into separate paragraphs when consecutive blank lines reach this height. Defines the minimum vertical gap considered a paragraph break. | 
⚠️ Important Notes
                Potential Issues & Considerations:
                
        - Small images or geometric shapes might be accidentally removed or cropped during blank space removal if they fall below the min_heightthreshold.
- Narrow vertical images or color blocks could be misidentified as separators. Adjusting configuration parameters like min_heightorthresholdmight help.
- Scanned documents must have reasonably horizontal text alignment. Pre-process significantly tilted pages (deskew) before using FenScribe for best results.
- Manual margin trimming might still be required in the source PDF or output Word document to remove headers/footers if they interfere with content detection or are unwanted.
- Performance depends on PDF complexity, DPI setting, and system resources. Large or complex PDFs may take time to process.
📜 License
This project is licensed under the MIT License.
Development Notes
This third-generation version of FenScribe features:
- A Graphical User Interface (GUI) implementation for more user-friendly operation.
- Partial utilization of AI-assisted development tools during its creation.
- Continuous optimization and refinement through multiple development iterations.
- Focus on converting PDF content to Word format (.docx) for easier editing and manipulation post-processing.