We are given historical trade data from a cryptocurrency exchange — in our case, Kraken — which, for each trade, includes the following information:
- Time of trade
- Trade Price
- Trade ID (Integers that increment upwards by 1 each trade)
- Order Type (if the trade was a result of a market order or limit order being placed)
- Taker Side of Trade (Buy or Sell)
This specific restriction of only trade data is present because for many exchanges, there has been a somewhat-recent shift where large amounts of historical trade data are easily accessible for free, but historical quote or matched trade and quote data is not.
After analyzing the log-returns of the trade price from trade to trade, I noticed that there was significant negative auto-correlation at lag 1 which I assume is due to the bid-ask bounce given the way this market functions/the orders are matched being similar to other markets where this is observed.
We do not wish to analyze/trade at high/ultra-high frequencies (sub-1-minute) using just this data due to various structural reasons as well as the bid-ask bounce being pronounced, distorting our analysis, so we instead down-sample our returns to 1-minute bars.
We then evaluate the auto-correlation of the 1-minute log returns again and still observe low in magnitude, but still higher-magnitude-than-greater-lags negative auto-correlation at lag 1 indicating potential lessening, but still persistence of, bid-ask bounce.
Note that during this, we did not fill-forward our down-sampled bars meaning that there are periods where there are no bars.
My questions are then the following:
Is there a way we can use this trade data for back-testing/analysis in a fruitful way in the presence of bid-ask bounce? as this could introduce spurious mean-reverting behavior in the price process if we down-sample this trade data "as-is". I saw that this post addresses ways of retrieving an efficient price from the trade price series, but wanted to know if there have been any more modern approaches/approaches not as involved since we intend to trade at non-HFT frequencies.
Is there a way that we can "de-bounce" this trade data so that it is more suitable for our purposes? such as by removing the impact of lag-1 auto-correlation on the trades by log-differencing the trade prices, applying a linear AR(1) (or other lags that indicate persistence of BAB) model, and instead looking at the back-transformed residuals as our "de-bounced" trade price series, or, does this do more harm than good if the bid-ask bounce effect is not too pronounced?
Bonus due to potentially being too broad: Is any of this analysis/back-testing impacted by not filling-forward our bars to fill empty bars (1-minute periods where no trades occurred.)? I am trying to avoid this as when there are a large number of empty bars, this biases our auto-correlation (made more positive) and our volatility estimates (made lower) among potential other biases.