Showing an audio waveform from sample audio data with user interaction for zoom in/out

Question 1

I want to show an interactive audio waveform like this.

I've extracted the sample data using AVAssetReader. Using this data, I'm drawing a UIBezierPath in a Scrollview's contentView. Currently, when I pinch zoom-in or zoom-out the scrollView, I'm downsampling the sample data to determine how many samples are to be shown.

class WaveformView: UIView {
 var amplitudes: [CGFloat] = [] {
 didSet {
 setNeedsDisplay()
 }
 }
 override func draw(_ rect: CGRect) {
 guard let context = UIGraphicsGetCurrentContext(), !amplitudes.isEmpty else { return }
 // Set up drawing parameters
 context.setStrokeColor(UIColor.black.cgColor)
 context.setLineWidth(1.0)
 context.setLineCap(.round)
 let midY = rect.height / 2
 let widthPerSample = rect.width / CGFloat(amplitudes.count)
 // Draw waveform
 let path = UIBezierPath()
 for (index, amplitude) in amplitudes.enumerated() {
 let x = CGFloat(index) * widthPerSample
 let height = amplitude * rect.height * 0.8
 // Draw vertical line for each sample
 path.move(to: CGPoint(x: x, y: midY - height))
 path.addLine(to: CGPoint(x: x, y: midY + height))
 }
 path.stroke()
 }
}

Added gesture handle

@objc private func handlePinch(_ gesture: UIPinchGestureRecognizer) {
 switch gesture.state {
 case .began:
 initialPinchDistance = gesture.scale
 
 case .changed:
 let scaleFactor = gesture.scale / initialPinchDistance
 var newScale = currentScale * scaleFactor
 newScale = min(max(newScale, minScale), maxScale)
 
 // Update displayed samples with new scale
 updateDisplayedSamples(scale: newScale)
 print(newScale)
 // Maintain zoom center point
 let pinchCenter = gesture.location(in: scrollView)
 let offsetX = (pinchCenter.x - scrollView.bounds.origin.x) / scrollView.bounds.width
 let newOffsetX = (totalWidth * offsetX) - (pinchCenter.x - scrollView.bounds.origin.x)
 scrollView.contentOffset.x = max(0, min(newOffsetX, totalWidth - scrollView.bounds.width))
 
 view.layoutIfNeeded()
 
 case .ended, .cancelled:
 currentScale = scrollView.contentSize.width / (baseWidth * widthPerSample)
 
 default:
 break
 }
 }

private func updateDisplayedSamples(scale: CGFloat) {
 let targetSampleCount = Int(baseWidth * scale)
 displayedSamples = downsampleWaveform(samples: rawSamples, targetCount: targetSampleCount)
 waveformView.amplitudes = displayedSamples
 
 totalWidth = CGFloat(displayedSamples.count) * widthPerSample
 contentWidthConstraint?.constant = totalWidth
 scrollView.contentSize = CGSize(width: totalWidth, height: 300)
 }

private func downsampleWaveform(samples: [CGFloat], targetCount: Int) -> [CGFloat] {
 guard samples.count > 0, targetCount > 0 else { return [] }
 
 if samples.count <= targetCount {
 return samples
 }
 
 var downsampled: [CGFloat] = []
 let sampleSize = samples.count / targetCount
 
 for i in 0..<targetCount {
 let startIndex = i * sampleSize
 let endIndex = min(startIndex + sampleSize, samples.count)
 let slice = samples[startIndex..<endIndex]
 
 // For each window, take the maximum value to preserve peaks
 if let maxValue = slice.max() {
 downsampled.append(maxValue)
 }
 }
 
 return downsampled
 }

The following approach works very inefficiently as every time gesture.state is changed, I'm calculating the downsampled data and perform UI operation based on that. How can I implement this functionality more efficiently for smooth interaction?

Question 2

Use a sparse table

Question 3

How can I implement this functionality more efficiently for smooth interaction?

Pre-compute at different resolutions.

maxValue = slice.max()

Side note: it's not clear that .max() is ideal for this. Maybe use median of window? Or the 80-th or 90-th percentile value of a window?

Upon initial loading of the waveform we're going to be displaying everything, so slice it into windows, compute each window value as max or median or whatever, hang onto those values, and display them.

Now pretend the user asked to see half of the timespan. There's been no gesture, no user interaction, so we do not yet know the starting point, but that's OK. We'll just compute window values for everything at that resolution, and hang onto the values.

Repeat for quarter, eighth, and so on. At some point we bottom out -- RAM to store the values becomes annoyingly large, and time to recompute exact values on the fly for a "small" timeslice is conveniently small.

Now we start accepting gestures. As the user pinches and pinches, we will dive down into using the "half timespan" or the "quarter timespan" data. Of course the user's requested {start, stop} timestamps won't match the precomputed data exactly. But we can go to the slightly higher resolution data, generate appropriate indexes, and display a subset of the stored data, skipping values occasionally.

Why is this effective? Because the number of pre-computed values approximately matches the display size, exceeding it at most by a factor of two.

If you're a stickler for accuracy, have a background thread do the unchanged OP calculation, and use double buffering to replace the "approximate view" with the "exact view" if it turns out the user went idle for a moment. OTOH if gesture events keep arriving, the background computational effort is wasted and is discarded, while the foreground thread keeps quickly displaying pre-computed values.

A background thread can also help with the "time to become interactive" startup latency upon loading a new waveform.

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Answer 1 · 2025-01-30 15:27:11Z

How can I implement this functionality more efficiently for smooth interaction?

Pre-compute at different resolutions.

maxValue = slice.max()