[フレーム]
PDF, PPTX11,839 views

Cassandra Data Modeling

The document summarizes a workshop on Cassandra data modeling. It discusses four use cases: (1) modeling clickstream data by storing sessions and clicks in separate column families, (2) modeling a rolling time window of data points by storing each point in a column with a TTL, (3) modeling rolling counters by storing counts in columns indexed by time bucket, and (4) using transaction logs to achieve eventual consistency when modeling many-to-many relationships by serializing transactions and deleting logs after commit. The document provides recommendations and alternatives for each use case.

Embed presentation

Download as PDF, PPTX
Cassandra Data Modeling Workshop Matthew F. Dennis // @mdennis
Overview くろまる Hopefully interactive くろまる Use cases submitted via Google Moderator, email, IRC, etc くろまる Interesting and/or common requests in the slides to get us started くろまる Bring up others if you have them !
Data Modeling Goals くろまる Keep data queried together on disk together くろまる In a more general sense think about the efficiency of querying your data and work backward from there to a model in Cassandra くろまる Don't try to normalize your data (contrary to many use cases in relational databases) くろまる Usually better to keep a record that something happened as opposed to changing a value (not always advisable or possible)
ClickStream Data (use case #1) くろまる A ClickStream (in this context) is the sequence of actions a user of an application performs くろまる Usually this refers to clicking links in a WebApp くろまる Useful for ad selection, error recording, UI/UX improvement, A/B testing, debugging, et cetera くろまる Not a lot of detail in the Google Moderator request on what the purpose of collecting the ClickStream data was – so I made some up
ClickStream Data Defined くろまる Record actions of a user within a session for debugging purposes if app/browser/page/server crashes
Recording Sessions くろまる CF for sessions a user has had くろまる Row Key is user name/id くろまる Column Name is session id (TimeUUID) くろまる Column Value is empty (or length of session, or some aggregated details about the session after it ended) くろまる CF for actual sessions くろまる Row Key is TimeUUID session id くろまる Column Name is timestamp/TimeUUID of each click くろまる Column Value is details about that click (serialized)
UserSessions Column Family Session_01 Session_02 Session_03 (TimeUUID) (TimeUUID) userId (TimeUUID) (empty/agg) (empty/agg) (empty/agg) くろまる Most recent session くろまる All sessions for a given time period
Sessions Column Family timestamp_01 timestamp_02 timestamp_03 SessionId (TimeUUID) ClickData ClickData ClickData (json/xml/etc) (json/xml/etc) (json/xml/etc) くろまる Retrieve entire session's ClickStream (row) くろまる Order of clicks/events preserved くろまる Retrieve ClickStream for a slice of time within the session くろまる First action taken in a session くろまる Most recent action taken in a session くろまる Why JSON/XML/etc?
Alternatives?
Of Course (depends on what you want to do) くろまる Secondary Indexes くろまる All Sessions in one row くろまる Track by time of activity instead of session
Secondary Indexes Applied くろまる Drop UserSessions CF and use secondary indexes くろまる Uses a "well known" column to record the user in the row; secondary index is created on that column くろまる Doesn't work so well when storing aggregates about sessions in the UserSessions CF くろまる Better when you want to retrieve all sessions a user has had
All Sessions In One Row Applied くろまる Row Key is userId くろまる Column Name is composite of timestamp and sessionId くろまる Can efficiently request activity of a user across all sessions within a specific time range くろまる Rows could potentially grow quite large, be careful くろまる Reads will almost always require at least two seeks on disk
Time Period Partitioning Applied くろまる Row Key is composite of userId and time "bucket" くろまる e.g. jan_2011 or jan_01_2011 for month or day buckets respectively くろまる Column Name is TimeUUID of click くろまる Column Value is serialized click data くろまる Avoids always requiring multiple seeks when the user has old data but only recent data is requested くろまる Easy to lazily aggregate old activity くろまる Can still efficiently request activity of a user across all sessions within a specific time range
Rolling Time Window Of Data Points (use case #2) くろまる Similar to RRDTool was the example given くろまる Essentially store a series of data points within a rolling window くろまる common request from Cassandra users for this and/or similar
Data Points Defined くろまる Each data point has a value (or multiple values) くろまる Each data point corresponds to a specific point in time or an interval/bucket (e.g. 5 th minute of th 17 hour on some date)
Time Window Model System7:RenderTime TimeUUID0 TimeUUID1 TimeUUID2 s7:rt 0.051 0.014 0.173 Some request took 0.014 seconds to render くろまる Row Key is the id of the time window data you are tracking (e.g. server7:render_time) くろまる Column Name is timestamp (or TimeUUID) the event occurred at くろまる Column Value is the value of the event (e.g. 0.051)
The Details くろまる Cassandra TTL values are key here くろまる When you insert each data point set the TTL to the max time range you will ever request; there is very little overhead to expiring columns くろまる When querying, construct TimeUUIDs for the min/max of the time range in question and use them as the start/end in your get_slice call くろまる Consider partitioning the rows by a known time period (e.g. "year") if you plan on keeping a long history of data (NB: requires slightly more complex logic in the app if a time range spans such a period) くろまる Very efficient queries for any window of time
Rolling Window Of Counters (use case #3) くろまる "How to model rolling time window that contains counters with time buckets of monthly (12 months), weekly (4 weeks), daily (7 days), hourly (24 hours)? Example would be; how many times user logged into a system in last 24 hours, last 7 days ..." くろまる Timezones and "rolling window" is what makes this interesting
Rolling Time Window Details くろまる One row for every granularity you want to track (e.g. day, hour) くろまる Row Key consists of the granularity, metric, user and system くろまる Column Name is a "fixed" time bucket on UTC time くろまる Column Values are counts of the logins in that bucket くろまる get_slice calls to return multiple counters which are them summed up
Rolling Time Window Counter Model user3:system5:logins:by_day 20110107 ... 20110523 U3:S5:L:D 2 ... 7 2 logins in Jan 7th 2011 7 logins on May 23rd 2011 for user 3 on system 5 for user 3 on system 5 user3:system5:logins:by_hour 2011010710 ... 2011052316 U3:S5:L:H 1 ... 7 one login for user 3 on system 5 2 logins for user 3 on system 5 on Jan 7th 2011 for the 10th hour on May 23rd 2011 for the 16th hour
Rolling Time Window Queries くろまる Time window is rolling and there are other timezones besides UTC くろまる one get_slice for the "middle" counts くろまる one get_slice for the "left end" くろまる one get_slice for the "right end"
Example: logins for the past 7 days くろまる Determine date/time boundaries くろまる Determine UTC days that are wholly contained within your boundaries to select and sum くろまる Select and sum counters for the remaining hours on either side of the UTC days くろまる O(1) queries (3 in this case), can be requested from C* in parallel くろまる NB: some timezones are annoying (e.g. 15 minute or 30 minutes offsets); I try to ignore them
Alternatives? (of course) くろまる If you're counting logins and each user doesn't login in hundreds of times a day, just have one row per user with a TimeUUID column name for the time the login occurred くろまる Supports any timezone/range/granularity easily くろまる More expensive for large ranges (e.g. year) regardless of granularity, so cache results (in C*) lazily. くろまる NB: caching results for rolling windows is not usually helpful (because, well it's rolling and always changes)
Eventually Atomic (use case #4) くろまる "When there are many to many or one to many relations involved how to model that and also keep it atomic? for eg: one user can upload many pictures and those pictures can somehow be related to other users as well." くろまる Attempting full ACID compliance in distributed systems is a bad idea (and impossible in the general sense) くろまる However, consistency is important and can certainly be achieved in C* くろまる Many approaches / alternatives くろまる I like transaction log approach, especially in the context of C*
Transaction Logs (in this context) くろまる Records what is going to be performed before it is actually performed くろまる Performs the actions that need to be atomic (in the indivisible sense, not the all at once sense) くろまる Marks that the actions were performed
In Cassandra くろまる Serialize all actions that need to be performed in a single column – JSON, XML, YAML (yuck!), cpickle, JSO, et cetera くろまる Row Key = randomly chosen C* node token くろまる Column Name = TimeUUID くろまる Perform actions くろまる Delete Column
Configuration Details くろまる Short GC_Grace on the XACT_LOG Column Family (e.g. 1 hour) くろまる Write to XACT_LOG at CL.QUORUM or CL.LOCAL_QUORUM for durability (if it fails with an unavailable exception, pick a different node token and/or node and try again; same semantics as a traditional relational DB) くろまる 1M memtable ops, 1 hour memtable flush time
Failures くろまる Before insert into the XACT_LOG くろまる After insert, before actions くろまる After insert, in middle of actions くろまる After insert, after actions, before delete くろまる After insert, after actions, after delete
Recovery くろまる Each C* has a crond job offset from every other by some time period くろまる Each job runs the same code: multiget_slice for all node tokens for all columns older than some time period くろまる Any columns need to be replayed in their entirety and are deleted after replay (normally there are no columns because normally things are working normally)
XACT_LOG Comments くろまる Idempotent writes are awesome (that's why this works so well) くろまる Doesn't work so well for counters (they're not idempotent) くろまる Clients must be able to deal with temporarily inconsistent data (they have to do this anyway) くろまる Could use a reliable queuing service (e.g. SQS) instead of polling – push to SQS first, then XACT log.
Q? Cassandra Data Modeling Workshop Matthew F. Dennis // @mdennis

More Related Content

Apache Cassandra Data Modeling with Travis Price
PPTX
Apache Cassandra Data Modeling with Travis Price
CQL3 in depth
PDF
CQL3 in depth
Cassandra Community Webinar | Become a Super Modeler
PDF
Cassandra Community Webinar | Become a Super Modeler
Cassandra Day Chicago 2015: Advanced Data Modeling
PDF
Cassandra Day Chicago 2015: Advanced Data Modeling
Cassandra 2.0 better, faster, stronger
PDF
Cassandra 2.0 better, faster, stronger
Advanced data modeling with apache cassandra
PDF
Advanced data modeling with apache cassandra
Stateful streaming data pipelines
PDF
Stateful streaming data pipelines
Cassandra By Example: Data Modelling with CQL3
PDF
Cassandra By Example: Data Modelling with CQL3
Apache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis Price
CQL3 in depth
CQL3 in depth
Cassandra Community Webinar | Become a Super Modeler
Cassandra Community Webinar | Become a Super Modeler
Cassandra Day Chicago 2015: Advanced Data Modeling
Cassandra Day Chicago 2015: Advanced Data Modeling
Cassandra 2.0 better, faster, stronger
Cassandra 2.0 better, faster, stronger
Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandra
Stateful streaming data pipelines
Stateful streaming data pipelines
Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3

What's hot

collectd & PostgreSQL
PDF
collectd & PostgreSQL
Michael Häusler – Everyday flink
PPTX
Michael Häusler – Everyday flink
High Throughput Analytics with Cassandra & Azure
PPTX
High Throughput Analytics with Cassandra & Azure
An Introduction To PostgreSQL Triggers
PDF
An Introduction To PostgreSQL Triggers
Cassandra summit 2013 - DataStax Java Driver Unleashed!
PDF
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Advanced Postgres Monitoring
PDF
Advanced Postgres Monitoring
Cassandra 3.0 Awesomeness
PDF
Cassandra 3.0 Awesomeness
Understanding Autovacuum
PDF
Understanding Autovacuum
Data in Motion: Streaming Static Data Efficiently
PDF
Data in Motion: Streaming Static Data Efficiently
Apache Flink Training: DataSet API Basics
PPTX
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataStream API Part 2 Advanced
PPTX
Apache Flink Training: DataStream API Part 2 Advanced
Enter the Snake Pit for Fast and Easy Spark
PDF
Enter the Snake Pit for Fast and Easy Spark
Cassandra Materialized Views
PDF
Cassandra Materialized Views
Bulk Loading Data into Cassandra
PDF
Bulk Loading Data into Cassandra
Cassandra 2.2 & 3.0
PPTX
Cassandra 2.2 & 3.0
Dun ddd
PPTX
Dun ddd
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
PDF
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
The world's next top data model
PDF
The world's next top data model
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
PDF
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
Cassandra 3.0 - JSON at scale - StampedeCon 2015
PDF
Cassandra 3.0 - JSON at scale - StampedeCon 2015
collectd & PostgreSQL
collectd & PostgreSQL
Michael Häusler – Everyday flink
Michael Häusler – Everyday flink
High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & Azure
An Introduction To PostgreSQL Triggers
An Introduction To PostgreSQL Triggers
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Advanced Postgres Monitoring
Advanced Postgres Monitoring
Cassandra 3.0 Awesomeness
Cassandra 3.0 Awesomeness
Understanding Autovacuum
Understanding Autovacuum
Data in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data Efficiently
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced
Enter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy Spark
Cassandra Materialized Views
Cassandra Materialized Views
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0
Dun ddd
Dun ddd
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
The world's next top data model
The world's next top data model
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015

Viewers also liked

Cassandra Data Model
PPT
Cassandra Data Model
Cassandra NYC 2011 Data Modeling
PDF
Cassandra NYC 2011 Data Modeling
Cassandra Data Modeling - Practical Considerations @ Netflix
PPTX
Cassandra Data Modeling - Practical Considerations @ Netflix
DZone Cassandra Data Modeling Webinar
PDF
DZone Cassandra Data Modeling Webinar
Cassandra Anti-Patterns
PDF
Cassandra Anti-Patterns
strangeloop 2012 apache cassandra anti patterns
PDF
strangeloop 2012 apache cassandra anti patterns
Cassandra concepts, patterns and anti-patterns
PPTX
Cassandra concepts, patterns and anti-patterns
Cassandra NoSQL Tutorial
PDF
Cassandra NoSQL Tutorial
Cassandra, Modeling and Availability at AMUG
PDF
Cassandra, Modeling and Availability at AMUG
BigData as a Platform: Cassandra and Current Trends
PDF
BigData as a Platform: Cassandra and Current Trends
durability, durability, durability
PDF
durability, durability, durability
The Future Of Big Data
PDF
The Future Of Big Data
Cassandra Explained
PDF
Cassandra Explained
An Overview of Apache Cassandra
PPTX
An Overview of Apache Cassandra
Add a bit of ACID to Cassandra. Cassandra Summit EU 2014
PDF
Add a bit of ACID to Cassandra. Cassandra Summit EU 2014
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
PDF
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra datamodel
PDF
Cassandra datamodel
Cassandra Summit 2014: An overview of the Hippo Project at Credit Suisse
PDF
Cassandra Summit 2014: An overview of the Hippo Project at Credit Suisse
NoSQL with Cassandra
PPT
NoSQL with Cassandra
Cassandra On EC2
PDF
Cassandra On EC2
Cassandra Data Model
Cassandra Data Model
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data Modeling
Cassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ Netflix
DZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling Webinar
Cassandra Anti-Patterns
Cassandra Anti-Patterns
strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patterns
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
Cassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
Cassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUG
BigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current Trends
durability, durability, durability
durability, durability, durability
The Future Of Big Data
The Future Of Big Data
Cassandra Explained
Cassandra Explained
An Overview of Apache Cassandra
An Overview of Apache Cassandra
Add a bit of ACID to Cassandra. Cassandra Summit EU 2014
Add a bit of ACID to Cassandra. Cassandra Summit EU 2014
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra datamodel
Cassandra datamodel
Cassandra Summit 2014: An overview of the Hippo Project at Credit Suisse
Cassandra Summit 2014: An overview of the Hippo Project at Credit Suisse
NoSQL with Cassandra
NoSQL with Cassandra
Cassandra On EC2
Cassandra On EC2

Similar to Cassandra Data Modeling

Cassandra in production
PDF
Cassandra in production
Temporal Data
PDF
Temporal Data
Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling
PDF
Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling
Time series with Apache Cassandra - Long version
PDF
Time series with Apache Cassandra - Long version
March 29, 2016 Dr. Josiah Carlson talks about using Redis as a Time Series DB
PDF
March 29, 2016 Dr. Josiah Carlson talks about using Redis as a Time Series DB
Acunu Analytics: Simpler Real-Time Cassandra Apps
PDF
Acunu Analytics: Simpler Real-Time Cassandra Apps
Powering Heap With PostgreSQL And CitusDB (PGConf Silicon Valley 2015)
PPTX
Powering Heap With PostgreSQL And CitusDB (PGConf Silicon Valley 2015)
Cassandra summit keynote 2014
PDF
Cassandra summit keynote 2014
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
PDF
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
PDF
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
An Open Source NoSQL solution for Internet Access Logs Analysis
PDF
An Open Source NoSQL solution for Internet Access Logs Analysis
Hadoop World 2011: Advanced HBase Schema Design
PPTX
Hadoop World 2011: Advanced HBase Schema Design
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
PDF
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
PPTX
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
PDF
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
Building a Front End for a Sensor Data Cloud
PDF
Building a Front End for a Sensor Data Cloud
MongoDB Use Cases: Healthcare, CMS, Analytics
PPTX
MongoDB Use Cases: Healthcare, CMS, Analytics
jstein.cassandra.nyc.2011
PPTX
jstein.cassandra.nyc.2011
Josiah carlson 2013年05月16日 - redis analytics
PDF
Josiah carlson 2013年05月16日 - redis analytics
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
PDF
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Cassandra in production
Cassandra in production
Temporal Data
Temporal Data
Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling
Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
March 29, 2016 Dr. Josiah Carlson talks about using Redis as a Time Series DB
March 29, 2016 Dr. Josiah Carlson talks about using Redis as a Time Series DB
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra Apps
Powering Heap With PostgreSQL And CitusDB (PGConf Silicon Valley 2015)
Powering Heap With PostgreSQL And CitusDB (PGConf Silicon Valley 2015)
Cassandra summit keynote 2014
Cassandra summit keynote 2014
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
An Open Source NoSQL solution for Internet Access Logs Analysis
An Open Source NoSQL solution for Internet Access Logs Analysis
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
Building a Front End for a Sensor Data Cloud
Building a Front End for a Sensor Data Cloud
MongoDB Use Cases: Healthcare, CMS, Analytics
MongoDB Use Cases: Healthcare, CMS, Analytics
jstein.cassandra.nyc.2011
jstein.cassandra.nyc.2011
Josiah carlson 2013年05月16日 - redis analytics
Josiah carlson 2013年05月16日 - redis analytics
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...

Recently uploaded

Unit-4-ARTIFICIAL NEURAL NETWORKS.pptx ANN ppt Artificial neural network
PPTX
Unit-4-ARTIFICIAL NEURAL NETWORKS.pptx ANN ppt Artificial neural network
Hybrid Cloud vs Multi-Cloud Strategy 2025
PDF
Hybrid Cloud vs Multi-Cloud Strategy 2025
Introduction to the World of Computers (Hardware & Software)
DOCX
Introduction to the World of Computers (Hardware & Software)
Greetings All Students Update 3 by Mia Corp
PDF
Greetings All Students Update 3 by Mia Corp
DIGITAL FORENSICS - Notes for Everything.pdf
PDF
DIGITAL FORENSICS - Notes for Everything.pdf
Vibe Coding vs. Spec-Driven Development [Free Meetup]
PDF
Vibe Coding vs. Spec-Driven Development [Free Meetup]
The major tech developments for 2026 by Pluralsight, a research and training ...
PDF
The major tech developments for 2026 by Pluralsight, a research and training ...
Igniting the Future: Copilot trends, agentic transformation and product roadm...
PDF
Igniting the Future: Copilot trends, agentic transformation and product roadm...
THIS IS CYBER SECURITY NOTES USED IN CLASS ON VARIOUS TOPICS USED IN CYBERSEC...
PPTX
THIS IS CYBER SECURITY NOTES USED IN CLASS ON VARIOUS TOPICS USED IN CYBERSEC...
Six Shifts For 2026 (And The Next Six Years)
PDF
Six Shifts For 2026 (And The Next Six Years)
software-security-intro in information security.ppt
PPT
software-security-intro in information security.ppt
Unlocking the Power of Salesforce Architecture: Frameworks for Effective Solu...
PDF
Unlocking the Power of Salesforce Architecture: Frameworks for Effective Solu...
Dev Dives: AI that builds with you - UiPath Autopilot for effortless RPA & AP...
PDF
Dev Dives: AI that builds with you - UiPath Autopilot for effortless RPA & AP...
Knowing and Doing: Knowledge graphs, AI, and work
PDF
Knowing and Doing: Knowledge graphs, AI, and work
Recursive Self Improvement vs Continuous Learning
PDF
Recursive Self Improvement vs Continuous Learning
Safeguarding AI-Based Financial Infrastructure
PDF
Safeguarding AI-Based Financial Infrastructure
Zero Trust & Defense-in-Depth: The Future of Critical Infrastructure Security
PDF
Zero Trust & Defense-in-Depth: The Future of Critical Infrastructure Security
How Mobile Apps Are Shaping the Future of Digital Innovation
PDF
How Mobile Apps Are Shaping the Future of Digital Innovation
Basics of Identity Access Management In mordern Infrastructure
PPTX
Basics of Identity Access Management In mordern Infrastructure
Chapter 3 Introduction to number system.pptx
PPTX
Chapter 3 Introduction to number system.pptx
Unit-4-ARTIFICIAL NEURAL NETWORKS.pptx ANN ppt Artificial neural network
Unit-4-ARTIFICIAL NEURAL NETWORKS.pptx ANN ppt Artificial neural network
Hybrid Cloud vs Multi-Cloud Strategy 2025
Hybrid Cloud vs Multi-Cloud Strategy 2025
Introduction to the World of Computers (Hardware & Software)
Introduction to the World of Computers (Hardware & Software)
Greetings All Students Update 3 by Mia Corp
Greetings All Students Update 3 by Mia Corp
DIGITAL FORENSICS - Notes for Everything.pdf
DIGITAL FORENSICS - Notes for Everything.pdf
Vibe Coding vs. Spec-Driven Development [Free Meetup]
Vibe Coding vs. Spec-Driven Development [Free Meetup]
The major tech developments for 2026 by Pluralsight, a research and training ...
The major tech developments for 2026 by Pluralsight, a research and training ...
Igniting the Future: Copilot trends, agentic transformation and product roadm...
Igniting the Future: Copilot trends, agentic transformation and product roadm...
THIS IS CYBER SECURITY NOTES USED IN CLASS ON VARIOUS TOPICS USED IN CYBERSEC...
THIS IS CYBER SECURITY NOTES USED IN CLASS ON VARIOUS TOPICS USED IN CYBERSEC...
Six Shifts For 2026 (And The Next Six Years)
Six Shifts For 2026 (And The Next Six Years)
software-security-intro in information security.ppt
software-security-intro in information security.ppt
Unlocking the Power of Salesforce Architecture: Frameworks for Effective Solu...
Unlocking the Power of Salesforce Architecture: Frameworks for Effective Solu...
Dev Dives: AI that builds with you - UiPath Autopilot for effortless RPA & AP...
Dev Dives: AI that builds with you - UiPath Autopilot for effortless RPA & AP...
Knowing and Doing: Knowledge graphs, AI, and work
Knowing and Doing: Knowledge graphs, AI, and work
Recursive Self Improvement vs Continuous Learning
Recursive Self Improvement vs Continuous Learning
Safeguarding AI-Based Financial Infrastructure
Safeguarding AI-Based Financial Infrastructure
Zero Trust & Defense-in-Depth: The Future of Critical Infrastructure Security
Zero Trust & Defense-in-Depth: The Future of Critical Infrastructure Security
How Mobile Apps Are Shaping the Future of Digital Innovation
How Mobile Apps Are Shaping the Future of Digital Innovation
Basics of Identity Access Management In mordern Infrastructure
Basics of Identity Access Management In mordern Infrastructure
Chapter 3 Introduction to number system.pptx
Chapter 3 Introduction to number system.pptx

Cassandra Data Modeling

  • 1.
    Cassandra Data Modeling Workshop Matthew F. Dennis // @mdennis
  • 2.
    Overview くろまる Hopefully interactive くろまる Use cases submitted via Google Moderator, email, IRC, etc くろまる Interesting and/or common requests in the slides to get us started くろまる Bring up others if you have them !
  • 3.
    Data Modeling Goals くろまる Keep data queried together on disk together くろまる In a more general sense think about the efficiency of querying your data and work backward from there to a model in Cassandra くろまる Don't try to normalize your data (contrary to many use cases in relational databases) くろまる Usually better to keep a record that something happened as opposed to changing a value (not always advisable or possible)
  • 4.
    ClickStream Data (use case #1) くろまる A ClickStream (in this context) is the sequence of actions a user of an application performs くろまる Usually this refers to clicking links in a WebApp くろまる Useful for ad selection, error recording, UI/UX improvement, A/B testing, debugging, et cetera くろまる Not a lot of detail in the Google Moderator request on what the purpose of collecting the ClickStream data was – so I made some up
  • 5.
    ClickStream Data Defined くろまる Record actions of a user within a session for debugging purposes if app/browser/page/server crashes
  • 6.
    Recording Sessions くろまる CF for sessions a user has had くろまる Row Key is user name/id くろまる Column Name is session id (TimeUUID) くろまる Column Value is empty (or length of session, or some aggregated details about the session after it ended) くろまる CF for actual sessions くろまる Row Key is TimeUUID session id くろまる Column Name is timestamp/TimeUUID of each click くろまる Column Value is details about that click (serialized)
  • 7.
    UserSessions Column Family Session_01 Session_02 Session_03 (TimeUUID) (TimeUUID) userId (TimeUUID) (empty/agg) (empty/agg) (empty/agg) くろまる Most recent session くろまる All sessions for a given time period
  • 8.
    Sessions Column Family timestamp_01 timestamp_02 timestamp_03 SessionId (TimeUUID) ClickData ClickData ClickData (json/xml/etc) (json/xml/etc) (json/xml/etc) くろまる Retrieve entire session's ClickStream (row) くろまる Order of clicks/events preserved くろまる Retrieve ClickStream for a slice of time within the session くろまる First action taken in a session くろまる Most recent action taken in a session くろまる Why JSON/XML/etc?
  • 9.
  • 10.
    Of Course (depends on what you want to do) くろまる Secondary Indexes くろまる All Sessions in one row くろまる Track by time of activity instead of session
  • 11.
    Secondary Indexes Applied くろまる Drop UserSessions CF and use secondary indexes くろまる Uses a "well known" column to record the user in the row; secondary index is created on that column くろまる Doesn't work so well when storing aggregates about sessions in the UserSessions CF くろまる Better when you want to retrieve all sessions a user has had
  • 12.
    All Sessions In One Row Applied くろまる Row Key is userId くろまる Column Name is composite of timestamp and sessionId くろまる Can efficiently request activity of a user across all sessions within a specific time range くろまる Rows could potentially grow quite large, be careful くろまる Reads will almost always require at least two seeks on disk
  • 13.
    Time Period Partitioning Applied くろまる Row Key is composite of userId and time "bucket" くろまる e.g. jan_2011 or jan_01_2011 for month or day buckets respectively くろまる Column Name is TimeUUID of click くろまる Column Value is serialized click data くろまる Avoids always requiring multiple seeks when the user has old data but only recent data is requested くろまる Easy to lazily aggregate old activity くろまる Can still efficiently request activity of a user across all sessions within a specific time range
  • 14.
    Rolling Time Window Of Data Points (use case #2) くろまる Similar to RRDTool was the example given くろまる Essentially store a series of data points within a rolling window くろまる common request from Cassandra users for this and/or similar
  • 15.
    Data Points Defined くろまる Each data point has a value (or multiple values) くろまる Each data point corresponds to a specific point in time or an interval/bucket (e.g. 5 th minute of th 17 hour on some date)
  • 16.
    Time Window Model System7:RenderTime TimeUUID0 TimeUUID1 TimeUUID2 s7:rt 0.051 0.014 0.173 Some request took 0.014 seconds to render くろまる Row Key is the id of the time window data you are tracking (e.g. server7:render_time) くろまる Column Name is timestamp (or TimeUUID) the event occurred at くろまる Column Value is the value of the event (e.g. 0.051)
  • 17.
    The Details くろまる Cassandra TTL values are key here くろまる When you insert each data point set the TTL to the max time range you will ever request; there is very little overhead to expiring columns くろまる When querying, construct TimeUUIDs for the min/max of the time range in question and use them as the start/end in your get_slice call くろまる Consider partitioning the rows by a known time period (e.g. "year") if you plan on keeping a long history of data (NB: requires slightly more complex logic in the app if a time range spans such a period) くろまる Very efficient queries for any window of time
  • 18.
    Rolling Window Of Counters (use case #3) くろまる "How to model rolling time window that contains counters with time buckets of monthly (12 months), weekly (4 weeks), daily (7 days), hourly (24 hours)? Example would be; how many times user logged into a system in last 24 hours, last 7 days ..." くろまる Timezones and "rolling window" is what makes this interesting
  • 19.
    Rolling Time Window Details くろまる One row for every granularity you want to track (e.g. day, hour) くろまる Row Key consists of the granularity, metric, user and system くろまる Column Name is a "fixed" time bucket on UTC time くろまる Column Values are counts of the logins in that bucket くろまる get_slice calls to return multiple counters which are them summed up
  • 20.
    Rolling Time Window Counter Model user3:system5:logins:by_day 20110107 ... 20110523 U3:S5:L:D 2 ... 7 2 logins in Jan 7th 2011 7 logins on May 23rd 2011 for user 3 on system 5 for user 3 on system 5 user3:system5:logins:by_hour 2011010710 ... 2011052316 U3:S5:L:H 1 ... 7 one login for user 3 on system 5 2 logins for user 3 on system 5 on Jan 7th 2011 for the 10th hour on May 23rd 2011 for the 16th hour
  • 21.
    Rolling Time Window Queries くろまる Time window is rolling and there are other timezones besides UTC くろまる one get_slice for the "middle" counts くろまる one get_slice for the "left end" くろまる one get_slice for the "right end"
  • 22.
    Example: logins for the past 7 days くろまる Determine date/time boundaries くろまる Determine UTC days that are wholly contained within your boundaries to select and sum くろまる Select and sum counters for the remaining hours on either side of the UTC days くろまる O(1) queries (3 in this case), can be requested from C* in parallel くろまる NB: some timezones are annoying (e.g. 15 minute or 30 minutes offsets); I try to ignore them
  • 23.
    Alternatives? (of course) くろまる If you're counting logins and each user doesn't login in hundreds of times a day, just have one row per user with a TimeUUID column name for the time the login occurred くろまる Supports any timezone/range/granularity easily くろまる More expensive for large ranges (e.g. year) regardless of granularity, so cache results (in C*) lazily. くろまる NB: caching results for rolling windows is not usually helpful (because, well it's rolling and always changes)
  • 24.
    Eventually Atomic (use case #4) くろまる "When there are many to many or one to many relations involved how to model that and also keep it atomic? for eg: one user can upload many pictures and those pictures can somehow be related to other users as well." くろまる Attempting full ACID compliance in distributed systems is a bad idea (and impossible in the general sense) くろまる However, consistency is important and can certainly be achieved in C* くろまる Many approaches / alternatives くろまる I like transaction log approach, especially in the context of C*
  • 25.
    Transaction Logs (in this context) くろまる Records what is going to be performed before it is actually performed くろまる Performs the actions that need to be atomic (in the indivisible sense, not the all at once sense) くろまる Marks that the actions were performed
  • 26.
    In Cassandra くろまる Serialize all actions that need to be performed in a single column – JSON, XML, YAML (yuck!), cpickle, JSO, et cetera くろまる Row Key = randomly chosen C* node token くろまる Column Name = TimeUUID くろまる Perform actions くろまる Delete Column
  • 27.
    Configuration Details くろまる Short GC_Grace on the XACT_LOG Column Family (e.g. 1 hour) くろまる Write to XACT_LOG at CL.QUORUM or CL.LOCAL_QUORUM for durability (if it fails with an unavailable exception, pick a different node token and/or node and try again; same semantics as a traditional relational DB) くろまる 1M memtable ops, 1 hour memtable flush time
  • 28.
    Failures くろまる Before insert into the XACT_LOG くろまる After insert, before actions くろまる After insert, in middle of actions くろまる After insert, after actions, before delete くろまる After insert, after actions, after delete
  • 29.
    Recovery くろまる Each C* has a crond job offset from every other by some time period くろまる Each job runs the same code: multiget_slice for all node tokens for all columns older than some time period くろまる Any columns need to be replayed in their entirety and are deleted after replay (normally there are no columns because normally things are working normally)
  • 30.
    XACT_LOG Comments くろまる Idempotent writes are awesome (that's why this works so well) くろまる Doesn't work so well for counters (they're not idempotent) くろまる Clients must be able to deal with temporarily inconsistent data (they have to do this anyway) くろまる Could use a reliable queuing service (e.g. SQS) instead of polling – push to SQS first, then XACT log.
  • 31.
    Q? Cassandra Data Modeling Workshop Matthew F. Dennis // @mdennis

AltStyle によって変換されたページ (->オリジナル) /