Specifically - how many (small/large) nodes are you using, how big is your data, how many simultaneous users does your setup support, and what kind of performance do you see?
Thanks in advance.
Our data is 3 denormalized tables with the same ~1000 column schema, the largest of which contains 7B rows. Most of our queries are simple aggregations down a few columns, with simple constraints like date range + customer ID.
So far we're really impressed with Redshift's load performance and ease of getting up-and-running, but we're still an order of magnitude away from Infobright's query performance. Next steps are playing with distribution and sort keys, as well as trying different cluster configurations; to be fair to Redshift we've not run it on more than a 6 XL cluster yet.
Performance is great — except when it's not. Simple aggregations on tables that are < 1B rows and two table joins on tables that are < 100M rows are blazingly fast (maybe 1-30 seconds depending on the query). Larger tables than those and it can start to crawl.
I'd be interested to hear from others with bigger workload than ours. For example, someone from Netflix (They switched their data warehouse from Vertica to Redshift this year)
Even with manual zooming it is still a bit harsh due to the colours (Maybe just me though.).
I wonder what hashes they are using.
[0] http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinali... [1] https://news.ycombinator.com/item?id=4488946
It would be really great if Redshift supported HLL as a proper data type, and not just an optimization for COUNT(). A proper data type would allow us to store pre-aggregated data instead of billions of unnecessarily granular rows.