-
Notifications
You must be signed in to change notification settings - Fork 0
Releases: dathere/qsv-dateparser
0.15.1
Performance: parse-dispatch optimization
Speeds up Parse::parse on its hot path — the failed parse attempts that dominate qsv stats --infer-dates on non-date columns, where every value previously ran the full regex is_match chain before failing. Four behavior-preserving changes:
- Structural byte pre-filter (
cannot_be_date, backed by a 256-entryDATE_BYTElookup table). Any input containing a byte that cannot appear in an accepted format (_,#, non-ASCII, etc.) returnsErrimmediately, skipping the regex chain. - Dispatch reorder — the cheap regex-gated families run first; the two parsers without a family regex gate (
unix_timestamp,rfc2822) move last. Result-preserving: floats match no family gate (so still reachunix_timestamp), and rfc2822 inputs always carry a timezone that the$-anchoredmonth_dmy_*regexes reject — whilemonth_dmy_*only succeeds without a timezone, which makes rfc2822 fail. The two are mutually exclusive. unix_timestampfirst-byte gate — bail beforefast_float2::parseunless the first byte is one ofdigit + - . i I n N(the only leadsfast_float2accepts; it rejects leading whitespace).rfc2822colon gate — every RFC 2822 datetime has a time-of-day, so colon-free inputs can't be rfc2822; skipparse_from_rfc2822for them.
Benchmarks (M4 Max, release, 1000 values/iter)
| path | 0.15.0 | 0.15.1 | change |
|---|---|---|---|
non-date strings, parse_failures (category_value_N) |
125 μs | 60 μs | −52% |
genuine ISO datetimes, parse_throughput |
398 μs | 370 μs | −7% |
non-date words, parse_word_failures |
133 μs | 127 μs | −5% |
accepted-formats mix, parse_all |
8.2 μs | 7.9 μs | −4% |
Correctness
No public API change and byte-identical parse results — verified three ways:
- All 26 unit tests + 6 doctests pass (doctests cover the full accepted-formats and DMY lists). New
prefilter_rejects_non_date_stringsregression test pins the pre-filter (rejects junk; still accepts every separator a real date uses, plus the bareinf/nanthe unix-timestamp path accepts). - Integration: qsv
statsoutput is byte-identical before/after on a mixed-format dataset, and qsv's fullcargo test stats -F all_featuressuite passes (752 passed, 0 failed) against this release.
Why it matters for qsv
qsv stats --infer-dates (and, via the stats cache, frequency, schema, tojsonl, sqlp, joinp, pivotp, describegpt) parse dates through this crate. Genuine date-typed columns infer ~3% faster at the qsv level; non-date columns are unaffected. (Integration profiling showed the larger remaining --infer-dates overhead on non-date-heavy data lives in qsv's own sniff step, not this crate — tracked for follow-up on the qsv side.)
Dev-only
- Added
parse_failuresandparse_word_failuresbenches for the failure hot paths. - Updated parse-order docs in
CLAUDE.md. - A
RegexSetsingle-pass classification idea was prototyped and rejected — measured ×ばつ slower on the common early-gate date path (no early exit; per-is_matchsetup overhead, not scanning, dominates).
Full diff
0.15.0...0.15.1 — PR #9.
Assets 2
0.15.0
Fixed: ISO 8601 T-separator parsing for naive datetimes
parse_with_preference() (and the rest of the public API) now accepts ISO 8601 datetimes with the T separator and no timezone suffix — the form Python's datetime.isoformat() emits without .astimezone(). These were previously rejected because the naive-datetime parser required a literal space between date and time, and the RFC 3339 parser required an offset suffix, so the T + no-tz intersection fell into the gap.
Now accepted (all four are equivalent wall-clock instants):
parse_with_preference("2020-01-15T08:00", false)?; parse_with_preference("2020-01-15T08:00:00", false)?; parse_with_preference("2020-01-15T08:00:00.123", false)?; parse_with_preference("2020-01-15T08:00:00.123456", false)?;
Existing tz-bearing forms (...Z, ...+00:00, ...PST, etc.) and the space-separated naive form are unchanged.
Why it matters for qsv: qsv stats --infer-dates was previously misclassifying these columns as String, cascading into synthesize, schema, and describegpt. Bumping qsv-dateparser to 0.15.0 fixes inference for any column emitted by Python's default datetime.isoformat().
Performance
The fix preserves the existing trial-parse chain length for the common space-separated path (regex tightened to a character class [T\s]+, format-string family selected from a single byte read at offset 10). Bench-compare against the 0.14.0 baseline showed every codepath touching ymd_hms stable to slightly improved (−0.5% to −3.1%).
Toolchain
- MSRV bumped 1.93 → 1.95.
Dev-only
criteriondev-dep bumped (#5).actions/checkoutCI action bumped 4 → 6 (#7).- Internal test fix for a Local-vs-UTC date-rollover flake on
parse_unambiguous_dmy(no library-behavior change).
Out of scope (potential follow-up)
T-separator combined with a named timezone (e.g. 2020年01月15日T08:00:00 UTC) is still rejected — it would need a similar tweak to ymd_hms_z. Open an issue if this is blocking.
Full diff
0.14.0...0.15.0 — primarily PR #8.
Assets 2
0.14.0
Performance
This release is focused on parsing speed via a new two-layer pre-filter pattern: a family-level regex gate followed by cheap byte-level checks within individual parsers, short-circuiting before the heavier regex runs.
New byte pre-filters added:
| Parser | Pre-filter | Effect |
|---|---|---|
ymd_hms_z |
len < 17 || byte[10] != whitespace |
Eliminates bare YYYY-MM-DD dates instantly |
ymd_z |
len <= 10 |
Skips inputs that can't have a timezone |
month_mdy_hms_z |
Scan for isolated 4-digit year | Eliminates inputs like May 27 02:45:27 with no year |
month_dmy_hms |
!contains(':') |
Skips regex entirely for date-only inputs |
month_dmy |
Trailing 4-digit byte check | Skips the failing %d %B %y parse attempt for 4-digit years |
Benchmark improvements vs. 0.13.0 baseline:
| Benchmark | Change |
|---|---|
parse_all/accepted_formats |
−8% |
parse_throughput/1000_dates |
−6.6% |
rfc3339 (2017年11月25日T22:34:50Z) |
−21% |
ymd_hms_z (2019年11月29日 08:08:05-08) |
−10.5% |
month_dmy (1 July 2013) |
−11.6% |
ymd_z (2021年02月21日 PST) |
−8.3% |
month_mdy_hms (May 8, 2009 5:57:51 PM) |
−8% |
month_mdy_hms_z (May 02, 2021 15:51 UTC) |
−7.9% |
No regressions across any format.
Dependencies
- Minimum supported Rust version bumped to 1.93 (was 1.85)
criteriondev-dependency bumped from 0.5 → 0.8
Internal
- Refactored conditional branches in
datetime.rsto use Rust let-chains andis_none_orfor cleaner control flow
Assets 2
0.13.0
What's Changed
- Bump actions/checkout from 3 to 4 by @dependabot in #1
- Update chrono-tz requirement from 0.9 to 0.10 by @dependabot in #3
- use fast-float2 for float parsing
- optimized regexes
- 2024 edition
- set MSRV to 1.85
Full Changelog: https://github.com/dathere/qsv-dateparser/commits/0.13.0