2

I have a huge time series (about 30 million) of network paths with the following format:

timestamp, path, latency

The path is a sequence of IP address, so it can be represented either as a string or an array of integers. Currently the data are stored in text files which makes it very slow the analysis and querying of paths. It was suggested to me to use a timeseries database (TSDB), such as InfluxDB or OpenTSDB, to store them efficiently, but some background reading I did suggests that TSDBs are appropriate for numerical values. For instance OpenTSDB mentions:

OpenTSDB is a time series database. A time series is a series of numeric data points of some particular metric over time.

Is there any optimization I'll gain from using a TSDB instead of a relational DB in my case, and generally for timeseries that include non-numerical values?

The main queries I plan to do is basically to get all the paths between two timestamps, check if there are path changes, and how this changes affect the lattency. Additionally I may need to search for path with specific hops (e.g. select all records where the path includes the IP hop 1.2.3.4), or all the paths with latency over a certain threshold.

asked Mar 25, 2017 at 20:42
3
  • Writing a good answer to this would require understanding what kinds of questions you intend to ask about your data. Commented Mar 26, 2017 at 12:53
  • @Blrfl thanks, I edited the question to explain what type of queries I plan to run Commented Mar 27, 2017 at 0:00
  • Why can't these data be numerical? If an ip address can be an array of integers, then so can just one. A timestamp ends up being numerical in many RDBMS. Latency probably is. You may find your data could be stored hierarchically to help in querying IP hop. stackoverflow.com/questions/38801/… Commented Mar 27, 2017 at 14:19

1 Answer 1

1

Yes. Time-series database support grouping over categoric (non-numeric) elements.

For example, let's say there was a time-series database that stored the temperature reading from multiple IoT sensors - the sensor name would be a string (hence, non-numeric). A filter or group-by operation can be performed on the database for this particular sensor due to storing it.

However, in your specific example, you use an IP address.
IP address are numeric.
IPv4 address span 32 bits, hence, you can store them as 32 bit integers if you so wished to. And because searches and subnetting can be abstracted to integer arithmetic any operation that is done on an IP address can be done to an integer.

If you want to search for a path with specific hops, just search for a list of integers. You can even extend this to search for a list of integer within a range (hops within a specific subnet).

answered Mar 27, 2017 at 14:13
2
  • "IPv4 address span 32 bits, hence, you can store them as 32 bit integers if you so wished to". However I'm storing sequences of IP addresses (traceroute paths) which means that I can either store them as string or arrays of integers. Are the same search operations still more efficient in a TSDB compared to a Relational DB? Commented Mar 27, 2017 at 17:10
  • Yes. It's easier to search through a list of integers than through a list of strings. Comparing 2 integers has a single comparison operation but comparing 2 strings means that each character has to be compared and the minimum number of comparisons is equal to the length of the smaller string. Commented Mar 28, 2017 at 14:57

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.