Converting MIT Shakespeare "full.html" to .fountain
Scene Headings Conversion:
- Acts are identified and preceded with a single hash (#) for proper nesting.
- Scene headings are enclosed in <h3> tags in the HTML. These are converted to .fountain format by placing ## before the heading and converting the text to uppercase.
Character Names Conversion:
- Character names in the HTML are enclosed within <b> tags and associated with a speech identifier (<A NAME=speech\d+>).
- In the conversion, these names are extracted, converted to uppercase, and placed on separate lines above their dialogue.
Dialogue Conversion:
- Dialogue in HTML is typically found within <blockquote> tags.
- The conversion strips these tags, maintaining the plain text of the dialogue. Each line of dialogue is placed immediately after the character's name.
Stage Directions Conversion:
- Stage directions are often italicized in the HTML (<i> tags).
- These are converted into the .fountain format by placing them in parentheses and ensuring they are on separate lines for clarity.
Formatting and Cleanup:
- All other HTML tags are removed to leave only the plain text required for the .fountain format.
- Multiple consecutive newlines are collapsed into two newlines to maintain proper spacing between script elements.
- The script is stripped of leading and trailing whitespace for a clean start and end.
Special Handling for Empty Lines in Dialogue:
- In cases where dialogue does not immediately follow the character name (indicated by empty lines in the HTML), adjustments are made to ensure that dialogue lines correctly follow the character names without unnecessary breaks.
This algorithm provides a structured approach to converting HTML-formatted plays into the .fountain format, focusing on preserving the core elements of screenplay writing (scene headings, character names, dialogue, and stage directions) while stripping out the HTML-specific formatting.
For converting the entire canon of Shakespeare's plays, you can apply this algorithm to each play's HTML file. The process automates most of the formatting, but a manual review is recommended for each conversion to ensure accuracy and to make any necessary adjustments for unique elements in certain plays.