-
-
Notifications
You must be signed in to change notification settings - Fork 581
Open
Conversation
table_page_break_str mapped CellBreak→"CELL" and RowBreak→"TABLE", which is inconsistent with the HWPX parser (parser/hwpx/section.rs), where pageBreak "TABLE"→CellBreak and "CELL"/"ROW"→RowBreak. The serializer must be the inverse of the parser, so a HWPX→IR→HWPX roundtrip swapped the CELL↔TABLE semantics of a table's page-break attribute. Flip the mapping to CellBreak→"TABLE", RowBreak→"CELL". The HWP5 bit emission (serializer/control.rs) keys off the enum directly (CellBreak→bit0, RowBreak→bit1) and is unaffected. Add a regression test pinning the serializer string mapping as the exact parser inverse.
HWPX serialization regenerated version.xml, settings.xml, Preview/PrvText.txt and Preview/PrvImage.png from hardcoded constants, which dropped real data on round-trip of Hancom-converted .hwpx: - version.xml: a fixed Windows/build value overwrote the document's actual platform version - settings.xml: PrintInfo (zoom / print settings) was lost - Preview/PrvImage.png: real thumbnail (65737 B) replaced by a 68 B placeholder - Preview/PrvText.txt: real preview text (1437 B) replaced by 2 B These entries are not modeled in the IR and have no ID coupling to the edited body, so regeneration is pure loss. Capture the original bytes at parse time (Document.hwpx_aux_entries) and emit them verbatim at serialize time, falling back to the constants only when absent (HWP5 input, synthetic documents). Verified on 보고서 발간 요청서(PUB).hwpx: all four entries now byte-identical between original and round-trip.
...char ns Round-tripping a Hancom-converted .hwpx corrupted and dropped Contents/content.hpf data: - <opf:metadata> creator was overwritten with "rhwp", erasing the real author; lastsaveby, the CreatedDate/ModifiedDate values, date, subject, description and keyword were dropped entirely - spine <opf:itemref> lost its linear="yes" attribute - the package element dropped the xmlns:hwpunitchar declaration manifest/spine item lists genuinely depend on the body (sections, BinData) and must stay regenerated from the IR, but the metadata block has no body coupling. Capture the original content.hpf at parse time and splice its <opf:metadata>...</opf:metadata> block verbatim when present (fall back to the hardcoded block for HWP5 / synthetic docs); always emit linear="yes" and the hwpunitchar namespace to match Hancom. Verified on 보고서 발간 요청서(PUB).hwpx: content.hpf metadata block is now byte-identical to the original; the only remaining delta is one optional space in the XML declaration (`"yes" ?>` vs `"yes"?>`), a cross-cutting serializer style with no semantic effect.
...rapper write_border_fill emitted an empty <hc:fillBrush></hc:fillBrush> wrapper (a Stage 1 placeholder), dropping every fill: winBrush background color, gradation and imgBrush. On the Hancom request form this erased the gray cell shading (faceColor="#D6D6D6") of the table header rows. The shape serializer already has a complete, tested inverse of the parser (write_fill_brush, covering solid/gradient/image). Share it (pub(crate)) and call it from write_border_fill so border fills and shape fills use one fill-brush writer. FillType::None still emits no fillBrush, matching the parser's "absent fillBrush" reading. Verified on 보고서 발간 요청서(PUB).hwpx: borderFill 6/8/9 now emit <hc:winBrush faceColor="#D6D6D6" .../> identical to the original. Known residue (parser-side, not this fix): borderFills with faceColor="none" collapse to FillType::None at parse (Issue edwardkim#1172), so their empty winBrush wrapper is not reconstructed — invisible, no visual effect.
...tructure paraPr serialized margin + lineSpacing as a single flat block with the wrong namespace (<hh:intent>/<hh:left>... instead of <hc:...>), wrong attribute order (unit before value), and no <hp:switch> wrapper. Hancom writes margin+lineSpacing inside <hp:switch><hp:case required-namespace=HwpUnitChar>...</hp:case> <hp:default>...</hp:default></hp:switch>, where the HwpUnitChar case holds half the stored value (the parser loads case×ばつ2 into the IR) and the default holds the full value. The flat form dropped the switch entirely and mis-namespaced every margin child. Reconstruct the switch as the exact inverse of parse_para_shape_switch: default = stored value, case = stored/2 for margins and for Fixed/SpaceOnly/Minimum lineSpacing (PERCENT is identical in both). The parser reads the case first, so case×ばつ2 restores the IR exactly (Hancom stored values are always even). Also fix the child order to match the original (align, heading, breakSetting, autoSpacing, switch, border) and emit margin children as <hc:...> with value before unit. Verified on 보고서 발간 요청서(PUB).hwpx: paraPr margin/lineSpacing block is now byte-identical to the original (case intent=-1310, default intent=-2620, PERCENT 130, hc: namespace). baseline_all_samples roundtrip still green, confirming case/default reconstruction preserves the IR across the whole corpus. Residual (distinct, not this finding): snapToGrid is still hardcoded "1" while the original has "0".
Hancom emits <hh:underline>, <hh:strikeout>, <hh:outline> and <hh:shadow> on every charPr even when the effect is NONE; the serializer emitted them only when active, dropping all four from inactive charPrs (e.g. the request form's default text, which loses the NONE forms including the shadow's #B2B2B2 / 10,10 defaults the parser preserves). Emit all four unconditionally from the model. The parser already loads shape/color/offset even when type=NONE, so this round-trips exactly and reproduces the original. Critical detail: the parser derives `strikethrough` from the shape (is_real_strike_shape), so an inactive strikeout must be written as shape="NONE" — writing a real shape would flip strikethrough on at re-parse. Verified on 보고서 발간 요청서(PUB).hwpx: charPr id=0 now byte-identical to the original. Full suite + baseline roundtrip green.
...attr1
These three paraPr attributes were hardcoded to "0"/"0"/"1" in the HWPX
header serializer, dropping the values the parser had preserved in
ParaShape.attr1 (snapToGrid=bit8, condense=bits9..15, fontLineHeight=bit22).
Hancom forms use condense=20 and snapToGrid=0 on several paraShapes, so the
roundtrip silently rewrote them.
Derive all three from attr1 so regenerated header.xml matches the original.
Verified on 보고서 발간 요청서(PUB).hwpx: condense {0:28,20:7} and
snapToGrid {1:5,0:30} now match the original exactly.
The HWPX header serializer emitted every <hh:font> as a self-closing tag, dropping the <hh:typeInfo> panose-class child even though the parser already captured it into Font.type_info ([u8;10]). Hancom forms carry typeInfo on 13 fonts (FCAT_GOTHIC/FCAT_MYUNGJO with weight/strokeVariation/...); the roundtrip silently lost all of them. Add write_font_type_info (the exact inverse of parse_font_type_info's byte layout) and font_family_type_str (inverse of font_family_type_to_u8), and emit the typeInfo child whenever Font.type_info is Some. Byte [1] (serif type) stays unemitted because the parser re-synthesizes it from the font name/type, so the roundtrip is exact. Fonts without typeInfo keep the self-closing form. Verified on 보고서 발간 요청서(PUB).hwpx: 13 typeInfo elements, all identical to the original (diff=0).
The HWPX header serializer dropped <hh:substFont> entirely (font emitted as a self-closing tag), and the parser never captured it. Hancom forms carry a substFont on 13 fonts (e.g. 한컴바탕), specifying the fallback face plus its own type/embed info — independent of the parent <hh:font> (an HFT font can carry a TTF substitute). Add a SubstFont model struct (face/font_type/is_embedded/bin_item_id_ref, each attribute preserved so embedded substitutes round-trip too), capture it in the HWPX parser (parse_subst_font), and emit it in write_fontfaces before typeInfo (original child order). binaryItemIDRef is always emitted, empty when not embedded, matching Hancom. HWP5 FACE_NAME has no substFont concept, so its Font literals get subst_font: None. Verified on 보고서 발간 요청서(PUB).hwpx: 13 substFont elements, all font elements byte-identical to the original (diff=0).
write_header hardcoded the three settings blocks after </hh:refList>
(compatibleDocument/layoutCompatibility, docOption/linkinfo, trackchageConfig),
so the roundtrip silently rewrote real values: layoutCompatibility gained 5
hardcoded children (orig empty), linkinfo pageInherit flipped 1→0, and
trackchageConfig flags went 56→0.
These are document-global settings unrelated to the body being edited, so the
same strategy already used for content.hpf metadata applies: capture the
verbatim </hh:refList>...</hh:head> span in the parser (extract_head_tail) and
splice it back, falling through to the hardcoded blocks only when absent (HWP5
path). Empty span is preserved as Some("") so a settings-less original stays
settings-less.
Verified on 보고서 발간 요청서(PUB).hwpx: header tail now byte-identical to the
original (diff=0).
...alizer HWPX border widths are stored as a HWP enum index (u8). The parser (parse_border_width, mm->index) used 6 coarse buckets while the serializer (border_width_mm, index->mm) used the real 16-value Hancom table, so the two disagreed: 0.4mm->index2->"0.15", 0.6mm->index3->"0.2", and 0.12mm/0.5mm collapsed onto 0.1mm/0.15mm. Roundtrip diff=0 hid this (the IR index is stable), but every border was silently re-rendered at the wrong thickness. The renderer reads the same index, so the bug also under-drew borders: 0.12mm rendered as 0.1mm and 0.5mm as 0.15mm (3x too thin). Fix is one shared BORDER_WIDTHS:[(f64,&str);16] table in model/style.rs with border_width_index (nearest-match, total_cmp) for the parser and border_width_mm_str for the serializer. Golden SVGs for form-002/issue-157/table-text encoded the bug and were regenerated; verified each diff is exclusively the stroke-width correction (0.4->0.5 for 0.12mm, 0.6->1.9 for 0.5mm) plus the layout shift from now-correct thicker borders. css_border_width_to_hwp (CSS-import, pt domain) left untouched. Verified on 보고서 발간 요청서(PUB).hwpx: all four border widths now match the original exactly (diff=0).
...n absent The borderFill serializer wrote <hh:diagonal> unconditionally and inferred its line type from width==0, so every borderFill with no diagonal gained a spurious <hh:diagonal type="NONE"/> (orig 7 diagonals became 10 on roundtrip). Because width-index 0 doubled as both "0.1mm" and the "no diagonal" sentinel, the parser carried a parse_diagonal_width().max(1) hack that bumped real 0.1mm diagonals to 0.12mm. Fix the dual meaning structurally: the serializer omits the element when diagonal_type==0 (matching Hancom originals and the renderer, which already treats diagonal_type==0 as no-diagonal) and restores the line type from the diagonal_type code (border_line_type_from_code, the inverse of the parser's parse_border_line_type_code). With type no longer derived from width, the .max(1) hack is removed and the diagonal width round-trips exactly. Verified on 보고서 발간 요청서(PUB).hwpx: 7 diagonals, all identical to the original (diff=0).
...undtrip [Finding 12] borderFill `<hc:winBrush faceColor="none" .../>` was dropped on serialize. Parser collapsed faceColor=none + no-hatch to FillType::None AND discarded the parsed `solid`, so the serializer's FillType::None arm emitted nothing — the original fillBrush element vanished on roundtrip. Structural cause: FillType::None carried two meanings — "no fillBrush in the original" and "fillBrush present but renders empty". Separate the concerns: - fill.fill_type = how it RENDERS (None = no visible fill) - fill.solid.is_some() = whether a winBrush ELEMENT existed in the source Parser now always stores `bf.fill.solid = Some(solid)` while still setting fill_type=None for the faceColor=none + no-hatch case. Serializer's FillType::None arm restores the winBrush from `solid` when present, omits it when absent. winBrush emission is centralized in one write_win_brush owner so the Solid and None-with-solid paths serialize identically. Render is unchanged: every fill consumer gates on fill_type == Solid (build_para_properties_json, style_resolver), so None+solid still reports fillType="none" and draws no fill. issue_1172 para_001 test and all svg_snapshot goldens stay green. Verified on PAL form 보고서 발간 요청서(PUB).hwpx: header.xml element-tag parity now identical (60 tags); all 5 winBrush incl. both faceColor="none" round-trip byte-identical. Last header.xml drop in the lossless series. Tests: - parser: parse_empty_winbrush_preserves_solid_for_lossless_roundtrip - serializer: write_border_fill_restores_empty_winbrush_when_none_with_solid
[Finding 14] section serializer's secPr template omitted tabStopVal="4000" tabStopUnit="HWPUNIT", so both attributes were dropped on roundtrip (the parser never read them into IR either). These are a Hancom format constant — the default tab width — verified invariant across the sample corpus (42/42 files with the attribute use exactly 4000 / HWPUNIT), so they belong in the template alongside the other hardcoded secPr attributes, not as per-document IR. Parser ignores them, so diff=0 is preserved. Order matches Hancom output: tabStop → tabStopVal → tabStopUnit → outlineShapeIDRef. Test: secpr_emits_tab_stop_val_and_unit.
[Finding 15] The hwpunitchar namespace (http://www.hancom.co.kr/hwpml/2016/HwpUnitChar) is declared on both root elements in Hancom output but was absent from rhwp's header serializer and section template, so it was dropped on roundtrip. It is declared-but-unused (no hwpunitchar:-prefixed nodes), a vestigial Hancom namespace, but must be preserved for lossless roundtrip. Like the other format-constant namespace declarations it is hardcoded, not IR-derived. Tests: - header: write_header_runs_on_empty_document (asserts the declaration) - section: section_root_declares_hwpunitchar_namespace
[Finding 17] The header serializer hardcoded version="1.2" on <hh:head>, but the HWPML schema version is document-specific (observed 1.2/1.3/1.31/1.4/1.5 across the sample corpus). The parser already extracts it (parse_hwpx_hwpml_version) but used it only for an is_hwp3_origin heuristic and discarded it, so every document was rewritten to 1.2 on roundtrip. Fix: store the parsed version in DocInfo.hwpml_version and emit it from IR (fallback "1.2" when absent, e.g. HWP5 path). diff=0 did not compare the head version, so this was masked. Tests: - write_header_emits_preserved_hwpml_version - write_header_runs_on_empty_document (asserts 1.2 fallback)
[Finding 18] write_para_pr hardcoded align@vertical="BASELINE" and
breakSetting@{breakNonLatinWord, widowOrphan, keepWithNext, keepLines,
pageBreakBefore} as constants, discarding the values the parser preserves in
ParaShape.attr1/attr2. On the PAL form this rewrote 2 paraPr from
vertical="CENTER" to BASELINE and 19 from breakNonLatinWord="BREAK_WORD" to
KEEP_WORD. diff=0 did not compare these, so it was masked.
Fix: reverse-map each from its preserved bit:
- vertical ← attr1[20:21] (BASELINE/TOP/CENTER/BOTTOM)
- breakNonLatinWord ← attr1[7] (1=KEEP_WORD, 0=BREAK_WORD)
- widowOrphan/keepWithNext/keepLines/pageBreakBefore ← attr2[5..8]
Added vertical_alignment_str, the inverse of parse_vertical_alignment_bits.
breakLatinWord and lineWrap are left constant: the parser does not yet
capture them (a distinct parser-side gap, not this serializer-hardcode
family; both happen to be at their defaults on the PAL form). Parsed shapes
always set bit7 explicitly, so roundtrip is exact; only a freshly
constructed ParaShape (attr1=0, never serialized in the edit-existing-doc
path) now reports breakNonLatinWord=BREAK_WORD.
Tests:
- write_para_pr_emits_align_and_break_from_preserved_bits
- write_para_pr_default_vertical_is_baseline
[Finding 20] parse_char_shape explicitly ignored the useKerning attribute
(b"useKerning" | b"symMark" => {}), so cs.kerning stayed false and the
serializer (which reads cs.kerning) rewrote every useKerning="1" charPr to
"0". On the PAL form one charPr (id=12) lost useKerning="1". diff=0 did not
compare kerning, so it was masked.
Fix: parse useKerning into cs.kerning. symMark stays ignored — it is "NONE"
across the entire sample corpus (3888/3888), matching the emphasis_dot
default, so there is no loss; a non-NONE value would need separate capture.
Test: test_parse_char_pr_captures_use_kerning.
The parser captures <hp:cellzone> ranges into Table.zones, but the HWPX serializer never emitted them — cell-zone border/background overlays were dropped on roundtrip. Emit <hp:cellzoneList> between <hp:inMargin> and the first <hp:tr> (OWPML child order) with attribute order startRowAddr/startColAddr/endRowAddr/endColAddr/borderFillIDRef, omitting the element entirely when Table.zones is empty so documents without zones are unchanged.
...oundtrip The Numbering model holds only 7 levels and does not capture align, useInstWidth, autoIndent, checkable, textOffsetType, or the per-level format text. HWPX <hh:numbering> has 10 levels carrying all of those, so the serializer's hardcoded 10-level skeleton flattened every level to a uniform numFormat="DIGIT" paraHead with no format text — losing per-level numFormat (DIGIT/HANGUL_SYLLABLE/CIRCLED_DIGIT), the format strings (^1./^2./^3)/(^5)...), level-7 checkable="1", and levels 8-10. Because the roundtrip diff metric compares re-parsed IR, this loss passed diff=0 while corrupting the output. Preserve the inner paraHead region verbatim (mirroring hwpx_head_tail): the parser captures the byte span between <hh:numbering ...> and </hh:numbering> via Reader::buffer_position() over the from_str source into Numbering.raw_para_heads, and write_numbering splices it back, falling back to the skeleton when absent (HWP5 binary path). The PUB form's numbering element is now byte-identical across roundtrip; header.xml is clean across tag, attr-name, and attr-value censuses.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
한컴 한글로 저장한 HWPX 를 rhwp 로 parse→serialize 했을 때 누락·변형되던 항목들을 원본과 대조하며 보강했습니다. 실제 보고서 양식(발간요청서·표지)을 기준으로 검증했습니다.
devel 에 이미 반영된 #1380(빈 linesegarray 생략)·#1388(페이지 여백 IR 복원)과 겹치는 로컬 커밋은 제외했습니다.
변경 요약 (20 커밋)
faceColor="none") 보존, borderFill fill 내용·hh:diagonal·hh:typeInfo(panose)·hh:substFont복원, 테두리 두께 index↔mm 단일 테이블 공유, charPr underline/strike/outline/shadow·useKerning, paraPr condense/fontLineHeight/snapToGrid·margin/lineSpacing(hp:switch)·align/breakSetting, numbering paraHead 원본 보존, hwpml head version, 문서 설정 tail verbatimtabStopVal/tabStopUnit, 루트xmlns:hwpunitchar선언cellzoneList(셀 영역 테두리/배경),pageBreak파서 역변환content.hpf메타/spine verbatim검증
cargo fmt/clippy --workspace -D warnings/nextest --workspace(2281 통과)리뷰해 주시면 감사하겠습니다.