Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Releases: dathere/qsv-dateparser

0.15.1

30 May 14:45
@jqnatividad jqnatividad

Choose a tag to compare

Performance: parse-dispatch optimization

Speeds up Parse::parse on its hot path — the failed parse attempts that dominate qsv stats --infer-dates on non-date columns, where every value previously ran the full regex is_match chain before failing. Four behavior-preserving changes:

  1. Structural byte pre-filter (cannot_be_date, backed by a 256-entry DATE_BYTE lookup table). Any input containing a byte that cannot appear in an accepted format (_, #, non-ASCII, etc.) returns Err immediately, skipping the regex chain.
  2. Dispatch reorder — the cheap regex-gated families run first; the two parsers without a family regex gate (unix_timestamp, rfc2822) move last. Result-preserving: floats match no family gate (so still reach unix_timestamp), and rfc2822 inputs always carry a timezone that the $-anchored month_dmy_* regexes reject — while month_dmy_* only succeeds without a timezone, which makes rfc2822 fail. The two are mutually exclusive.
  3. unix_timestamp first-byte gate — bail before fast_float2::parse unless the first byte is one of digit + - . i I n N (the only leads fast_float2 accepts; it rejects leading whitespace).
  4. rfc2822 colon gate — every RFC 2822 datetime has a time-of-day, so colon-free inputs can't be rfc2822; skip parse_from_rfc2822 for them.

Benchmarks (M4 Max, release, 1000 values/iter)

path 0.15.0 0.15.1 change
non-date strings, parse_failures (category_value_N) 125 μs 60 μs −52%
genuine ISO datetimes, parse_throughput 398 μs 370 μs −7%
non-date words, parse_word_failures 133 μs 127 μs −5%
accepted-formats mix, parse_all 8.2 μs 7.9 μs −4%

Correctness

No public API change and byte-identical parse results — verified three ways:

  • All 26 unit tests + 6 doctests pass (doctests cover the full accepted-formats and DMY lists). New prefilter_rejects_non_date_strings regression test pins the pre-filter (rejects junk; still accepts every separator a real date uses, plus the bare inf/nan the unix-timestamp path accepts).
  • Integration: qsv stats output is byte-identical before/after on a mixed-format dataset, and qsv's full cargo test stats -F all_features suite passes (752 passed, 0 failed) against this release.

Why it matters for qsv

qsv stats --infer-dates (and, via the stats cache, frequency, schema, tojsonl, sqlp, joinp, pivotp, describegpt) parse dates through this crate. Genuine date-typed columns infer ~3% faster at the qsv level; non-date columns are unaffected. (Integration profiling showed the larger remaining --infer-dates overhead on non-date-heavy data lives in qsv's own sniff step, not this crate — tracked for follow-up on the qsv side.)

Dev-only

  • Added parse_failures and parse_word_failures benches for the failure hot paths.
  • Updated parse-order docs in CLAUDE.md.
  • A RegexSet single-pass classification idea was prototyped and rejected — measured ×ばつ slower on the common early-gate date path (no early exit; per-is_match setup overhead, not scanning, dominates).

Full diff

0.15.0...0.15.1 — PR #9.

Assets 2
Loading

0.15.0

18 May 03:09
@jqnatividad jqnatividad

Choose a tag to compare

Fixed: ISO 8601 T-separator parsing for naive datetimes

parse_with_preference() (and the rest of the public API) now accepts ISO 8601 datetimes with the T separator and no timezone suffix — the form Python's datetime.isoformat() emits without .astimezone(). These were previously rejected because the naive-datetime parser required a literal space between date and time, and the RFC 3339 parser required an offset suffix, so the T + no-tz intersection fell into the gap.

Now accepted (all four are equivalent wall-clock instants):

parse_with_preference("2020-01-15T08:00", false)?;
parse_with_preference("2020-01-15T08:00:00", false)?;
parse_with_preference("2020-01-15T08:00:00.123", false)?;
parse_with_preference("2020-01-15T08:00:00.123456", false)?;

Existing tz-bearing forms (...Z, ...+00:00, ...PST, etc.) and the space-separated naive form are unchanged.

Why it matters for qsv: qsv stats --infer-dates was previously misclassifying these columns as String, cascading into synthesize, schema, and describegpt. Bumping qsv-dateparser to 0.15.0 fixes inference for any column emitted by Python's default datetime.isoformat().

Performance

The fix preserves the existing trial-parse chain length for the common space-separated path (regex tightened to a character class [T\s]+, format-string family selected from a single byte read at offset 10). Bench-compare against the 0.14.0 baseline showed every codepath touching ymd_hms stable to slightly improved (−0.5% to −3.1%).

Toolchain

  • MSRV bumped 1.93 → 1.95.

Dev-only

  • criterion dev-dep bumped (#5).
  • actions/checkout CI action bumped 4 → 6 (#7).
  • Internal test fix for a Local-vs-UTC date-rollover flake on parse_unambiguous_dmy (no library-behavior change).

Out of scope (potential follow-up)

T-separator combined with a named timezone (e.g. 2020年01月15日T08:00:00 UTC) is still rejected — it would need a similar tweak to ymd_hms_z. Open an issue if this is blocking.

Full diff

0.14.0...0.15.0 — primarily PR #8.

Loading

0.14.0

21 Feb 18:31
@jqnatividad jqnatividad

Choose a tag to compare

Performance

This release is focused on parsing speed via a new two-layer pre-filter pattern: a family-level regex gate followed by cheap byte-level checks within individual parsers, short-circuiting before the heavier regex runs.

New byte pre-filters added:

Parser Pre-filter Effect
ymd_hms_z len < 17 || byte[10] != whitespace Eliminates bare YYYY-MM-DD dates instantly
ymd_z len <= 10 Skips inputs that can't have a timezone
month_mdy_hms_z Scan for isolated 4-digit year Eliminates inputs like May 27 02:45:27 with no year
month_dmy_hms !contains(':') Skips regex entirely for date-only inputs
month_dmy Trailing 4-digit byte check Skips the failing %d %B %y parse attempt for 4-digit years

Benchmark improvements vs. 0.13.0 baseline:

Benchmark Change
parse_all/accepted_formats −8%
parse_throughput/1000_dates −6.6%
rfc3339 (2017年11月25日T22:34:50Z) −21%
ymd_hms_z (2019年11月29日 08:08:05-08) −10.5%
month_dmy (1 July 2013) −11.6%
ymd_z (2021年02月21日 PST) −8.3%
month_mdy_hms (May 8, 2009 5:57:51 PM) −8%
month_mdy_hms_z (May 02, 2021 15:51 UTC) −7.9%

No regressions across any format.

Dependencies

  • Minimum supported Rust version bumped to 1.93 (was 1.85)
  • criterion dev-dependency bumped from 0.5 → 0.8

Internal

  • Refactored conditional branches in datetime.rs to use Rust let-chains and is_none_or for cleaner control flow
Loading

0.13.0

29 Mar 23:49
@jqnatividad jqnatividad

Choose a tag to compare

What's Changed

  • Bump actions/checkout from 3 to 4 by @dependabot in #1
  • Update chrono-tz requirement from 0.9 to 0.10 by @dependabot in #3
  • use fast-float2 for float parsing
  • optimized regexes
  • 2024 edition
  • set MSRV to 1.85

Full Changelog: https://github.com/dathere/qsv-dateparser/commits/0.13.0

Contributors

dependabot
Loading

AltStyle によって変換されたページ (->オリジナル) /