A data transformation framework. A presentation inviting commentary.

Wed Aug 21 12:29:59 EDT 2013

Hi all,
In an effort to do some serious cleaning up of a hopelessly cluttered 
working environment, I developed a modular data transformation system 
that pretty much stands. I am very pleased with it. I expect huge time 
savings. I would share it, if had a sense that there is an interest out 
there and would appreciate comments. Here's a description. I named the 
module TX:
The nucleus of the TX system is a Transformer class, a wrapper for any 
kind of transformation functionality. The Transformer takes input as 
calling argument and returns it transformed. This design allows the 
assembly of transformation chains, either nesting calls or better, using 
the class Chain, derived from 'Transformer' and 'list'. A Chain consists 
of a sequence of Transformers and is functionally equivalent to an 
individual Transformer. A high degree of modularity results: Chains 
nest. Another consequence is that many transformation tasks can be 
handled with a relatively modest library of a few basic prefabricated 
Transformers from which many different Chains can be assembled on the 
fly. A custom Transformer to bridge an eventual gap is quickly written 
and tested, because the task likely is trivial.
 A good analogy of the TX methodology is a road map with towns 
scattered all over it and highways connecting them. To get from any town 
to any other one is a simple matter of hopping the towns in between. The 
TX equivalent of the towns are data formats, the equivalent of the 
highways are TX Transformers. They are not so much thought of in terms 
of what they do than in terms of the formats they take and give. 
Designing a library of Transformers is essentially a matter of 
establishing a collection of standard data formats. First the towns, 
then the highways.
 A feature of the TX Transformer is that it retains both its input 
and output. This makes a Chain a breeze to build progressively, link by 
link, and also makes debugging easy: If a Chain doesn't work, Chain.show 
() reveals the failing link as the first one that has no output. It can 
be replaced with a corrected instance, as one would replace a blown 
fuse. Running the Chain again without input makes it have another try.
 Parameter passing runs on a track that is completely separate from 
the payload track. Parameters can be set in order to configure a Chain 
prior to running it, or can be sent at runtime by individual 
Transformers to its siblings and their progeny. Parameters are keyed and 
get picked up by those Chain links whose set of pre-defined keys 
includes the parameter's key. Unintended pick-ups with coincidentally 
shared keys for unrelated parameters can be prevented by addressing 
parameters to individual Translators.
Below an application example. Five custom classes at the end exemplify 
the pattern. I join the post also as attachment, in case some 
auto-line-wrap messes up this text.
Commentary welcome
Frederic
----------------------------------------------------------------------------------------------------
An example of use: Download historic stock quotes from Yahoo Finance for 
a given range of dates and a list of symbols, delete a column and add 
three, insert the data in a MySQL table. Also write them to temporary 
files in tabular form for verification.
 "make_quotes_writer ()" returns a custom transformation tree. 
"run_quotes ()" makes such a tree, sets it on a given time range and 
runs it on a list of symbols.
 (Since Yahoo publishes the data for downloading, I presume it's 
okay to do it this way. This is a demo of TX, however, and should not be 
misconstrued as an encouragement to violate any publisher's terms of 
service.)
import TX, yahoo_historic_quotes as yhq
def make_quotes_writer ():
 Visualizer = TX.Chain (
 yhq.percent (),
 TX.Table_Maker (has_header = True),
 TX.Table_Writer (),
 name = 'Visualizer'
 )
 To_DB = TX.Chain (yhq.header_stripper(), TX.DB_Writer(table_name = 
'quotes'), name = 'To DB')
 To_File = TX.Chain (Visualizer, TX.File_Writer (), name = 'To File')
 Splitter = TX.Splitter (To_DB, To_File, name = 'Splitter')
 Quotes = TX.Chain (
 yhq.yahoo_quotes (),
 TX.CSV_To_List (delimiter = ','),
 TX.Numerizer (),
 yhq.wiggle_and_trend (),
 yhq.symbol (),
 Splitter,
 name = 'Quotes'
 )
 return Quotes
 >>> Quotes = make_quotes_writer ()
 >>> Quotes.show_tree()
Quotes
Quotes[0] - Yahoo Quotes
Quotes[1] - CSV To List
Quotes[2] - Numerizer
Quotes[3] - Wiggle and Trend
Quotes[4] - Symbol
Quotes[5] - Splitter
Quotes[5][0] - To DB
Quotes[5][0][0] - Header Stripper
Quotes[5][0][1] - DB Writer
Quotes[5][1] - To File
Quotes[5][1][0] - Visualizer
Quotes[5][1][0][0] - Percent
Quotes[5][1][0][1] - Table Maker
Quotes[5][1][0][2] - Table Writer
Quotes[5][1][1] - File Writer
def run_quotes (symbols, from_date = '1970-01-01', to_date = '2099-12-31'):
 '''Downloads historic stock quotes from Yahoo Finance.
 Inserts data into MySQL table "quotes" relying on index to reject 
repeat insertions.
 Also writes the data in a table format to a temporary file, one for 
each symbol.
 '''
 Quotes = make_quotes_writer ()
 Quotes.set (from_date = from_date, to_date = to_date)
 FW = Quotes [5][1][1] # File Writer
 for symbol in symbols:
 outfilename = '/tmp/%s' % symbol
 FW.set (terminal = outfilename)
 Quotes.set (symbol = symbol)
 Quotes ()
 >>> run_quotes (('MO', 'SU'), '2013-08-10')
Checking:
 >>> DBRC = TX.DB_Run_Command ()
 >>> for record in DBRC ("select * from quotes where symbol in 
('MO','SU') and date >= 20130810;"):
 print record
(datetime.date(2013, 8, 16), 'MO', 34.51, 34.63, 34.25, 34.29, 8364900L, 
0.0110337, -0.00638792)
(datetime.date(2013, 8, 15), 'MO', 34.78, 34.93, 34.5, 34.57, 15144400L, 
0.0123866, -0.00604926)
(datetime.date(2013, 8, 14), 'MO', 35.23, 35.25, 34.81, 35.06, 5871900L, 
0.0125607, -0.00485298)
(datetime.date(2013, 8, 13), 'MO', 35.17, 35.27, 35.01, 35.22, 5919700L, 
0.00739898, 0.00142288)
(datetime.date(2013, 8, 12), 'MO', 35.25, 35.28, 35.05, 35.16, 4884300L, 
0.00654059, -0.00255936)
(datetime.date(2013, 8, 16), 'SU', 33.84, 34.63, 33.81, 34.18, 7396600L, 
0.0239626, 0.00993571)
(datetime.date(2013, 8, 15), 'SU', 32.84, 33.99, 32.7, 33.94, 7747100L, 
0.0386865, 0.0329885)
(datetime.date(2013, 8, 14), 'SU', 32.06, 32.91, 32.02, 32.72, 4690200L, 
0.0274141, 0.0203296)
(datetime.date(2013, 8, 13), 'SU', 32.04, 32.18, 31.94, 32.01, 2051900L, 
0.00748596, -0.000935745)
(datetime.date(2013, 8, 12), 'SU', 32.15, 32.35, 31.99, 32.15, 2870500L, 
0.0111906, 0.0)
 >>> FR = TX.File_Reader ()
 >>> print FR ('/tmp/SU')
 Date | Symbol | Open | High | Low | Close | Volume | Wiggle 
| Trend |

 2013年08月16日 | SU | 33.84 | 34.63 | 33.81 | 34.18 | 7396600.00 | 
2.40% | +0.99% |
 2013年08月15日 | SU | 32.84 | 33.99 | 32.70 | 33.94 | 7747100.00 | 
3.87% | +3.30% |
 2013年08月14日 | SU | 32.06 | 32.91 | 32.02 | 32.72 | 4690200.00 | 
2.74% | +2.03% |
 2013年08月13日 | SU | 32.04 | 32.18 | 31.94 | 32.01 | 2051900.00 | 
0.75% | -0.09% |
 2013年08月12日 | SU | 32.15 | 32.35 | 31.99 | 32.15 | 2870500.00 | 
1.12% | +0.00% |
----------------------------------------------------------------------------------------------------
Each Transformer retains input and output, freezing the state of a 
composite after each run. This makes developing step by step and 
debugging a breeze. If a Chain fails, the failing link is the first one 
to have no output. Chain.show () displays them all in sequence. The 
deficient Transformer can be replaced with a fixed instance and the 
Chain run again without input for another try.
 >>> Quotes.replace (Fixed_Numerizer, 2)
 >>> catch = Quotes () # catch removes the risk of flooding the display
----------------------------------------------------------------------------------------------------
These are the five custom classes. The decorator "setup" updates 
parameters and prevents needless reruns comparing time stamps.
class wiggle_and_trend (TX.Transformer):
 '''Deletes column "Adj Close" and adds columns "Wiggle" and 
"Trend". The values are
 ratios: (day's high - day's low) / mean, and (day's close - day's 
open) / mean.
 '''
 name = 'Wiggle and Trend'
 @TX.setup
 def transform (self):
 input = self.Input.data
 output = []
 output.append (tuple (input[0][:-1]) + ('Wiggle', 'Trend'))
 for i in range (1, len (input)):
 date, open, high, low, close, vol, adj = input [i]
 wiggle = high - low
 mean = (high + low) / 2.0
 wiggle_ratio = wiggle / mean
 trend = close - open
 trend_ratio = trend / mean
 output.append ((date, open, high, low, close, vol, 
wiggle_ratio, trend_ratio))
 self.Output.take (output)
class symbol (TX.Transformer):
 'Adds a column Symbol'
 name = 'Symbol'
 def __init__ (self):
 TX.Transformer.__init__ (self, symbol = None)
 @TX.setup
 def transform (self):
 symbol = self.get ('symbol')
 if not symbol:
 self.log ('No symbol!')
 else:
 input = self.Input.data
 output = []
 output.append ((input[0][0], 'Symbol') + input [0][1:])
 for i in range (1, len (input)):
 output.append ((input[i][0], symbol) + input [i][1:])
 self.Output.take (output)
class percent (TX.Transformer):
 'Converts float ratio to percent for better legibility'
 name = 'Percent'
 @TX.setup
 def transform (self):
 input = self.Input.data
 output = [input [0]]
 for i in range (1, len (input)):
 wiggle = '%5.2f%%' % (input [i][7] * 100.0)
 trend = '%+5.2f%%' % (input [i][8] * 100.0)
 output.append (input [i][:7] + (wiggle, trend))
 self.Output.take (output)
class header_stripper (TX.Transformer):
 'Header names are not meant for insertion into data base table'
 name = 'Header Stripper'
 @TX.setup
 def transform (self):
 self.Output.take (self.Input.data [1:])
class yahoo_quotes (TX.WWW_Reader):
 'Gets historic stock quotes from Yahoo Finance'
 import urllib
 name = 'Yahoo Quotes'
 URL_TRUNK = 'http://ichart.finance.yahoo.com/table.csv'
 def __init__ (self, from_date = '1970-01-01', to_date = 
'2099-12-31', **keywords):
 private_keys = {
 'symbol' : None,
 'from_date' : from_date,
 'to_date' : to_date,
 'url_trunk' : self.URL_TRUNK,
 }
 private_keys.update (keywords)
 TX.WWW_Reader.__init__ (self, **private_keys)
 def make_url (self):
 fy, fm, fd = self.get ('from_date').split ('-')
 ty, tm, td = self.get ('to_date').split ('-')
 fm = '%0d' % (int (fm) - 1)
 tm = '%0d' % (int (tm) - 1)
 symbol = self.get ('symbol')
 qs = (
 ('s',symbol),
 ('a',fm),('b',fd),('c',fy),
 ('d',tm),('e',td),('f',ty),
 ('g','d'),
 ('ignore', '.csv'),
 )
 self.set (url_parameters = qs)
 TX.WWW_Reader.make_url (self)
----------------------------------------------------------------------------------------------------
Let me know what you think.
-------------- next part --------------
Hi all,
In an effort to do some serious cleaning up of a hopelessly cluttered working environment, I developed a modular data transformation system that pretty much stands. I am very pleased with it. I expect huge time savings. I would share it, if had a sense that there is an interest out there and would appreciate comments. Here's a description. I named the module TX:
The nucleus of the TX system is a Transformer class, a wrapper for any kind of transformation functionality. The Transformer takes input as calling argument and returns it transformed. This design allows the assembly of transformation chains, either nesting calls or better, using the class Chain, derived from 'Transformer' and 'list'. A Chain consists of a sequence of Transformers and is functionally equivalent to an individual Transformer. A high degree of modularity results: Chains nest. Another consequence is that many transformation tasks can be handled with a relatively modest library of a few basic prefabricated Transformers from which many different Chains can be assembled on the fly. A custom Transformer to bridge an eventual gap is quickly written and tested, because the task likely is trivial.
 A good analogy of the TX methodology is a road map with towns scattered all over it and highways connecting them. To get from any town to any other one is a simple matter of hopping the towns in between. The TX equivalent of the towns are data formats, the equivalent of the highways are TX Transformers. They are not so much thought of in terms of what they do than in terms of the formats they take and give. Designing a library of Transformers is essentially a matter of establishing a collection of standard data formats. First the towns, then the highways.
 A feature of the TX Transformer is that it retains both its input and output. This makes a Chain a breeze to build progressively, link by link, and also makes debugging easy: If a Chain doesn't work, Chain.show () reveals the failing link as the first one that has no output. It can be replaced with a corrected instance, as one would replace a blown fuse. Running the Chain again without input makes it have another try.
 Parameter passing runs on a track that is completely separate from the payload track. Parameters can be set in order to configure a Chain prior to running it, or can be sent at runtime by individual Transformers to its siblings and their progeny. Parameters are keyed and get picked up by those Chain links whose set of pre-defined keys includes the parameter's key. Unintended pick-ups with coincidentally shared keys for unrelated parameters can be prevented by addressing parameters to individual Translators.
Below an application example. Five custom classes at the end exemplify the pattern. I join the post also as attachment, in case some auto-line-wrap messes up this text. 
Commentary welcome
Frederic
----------------------------------------------------------------------------------------------------
An example of use: Download historic stock quotes from Yahoo Finance for a given range of dates and a list of symbols, delete a column and add three, insert the data in a MySQL table. Also write them to temporary files in tabular form for verification.
 "make_quotes_writer ()" returns a custom transformation tree. "run_quotes ()" makes such a tree, sets it on a given time range and runs it on a list of symbols.
 (Since Yahoo publishes the data for downloading, I presume it's okay to do it this way. This is a demo of TX, however, and should not be misconstrued as an encouragement to violate any publisher's terms of service.)
import TX, yahoo_historic_quotes as yhq
def make_quotes_writer (): 
 Visualizer = TX.Chain (
 yhq.percent (),
 TX.Table_Maker (has_header = True),
 TX.Table_Writer (), 
 name = 'Visualizer'
 )
 
 To_DB = TX.Chain (yhq.header_stripper(), TX.DB_Writer(table_name = 'quotes'), name = 'To DB')
 
 To_File = TX.Chain (Visualizer, TX.File_Writer (), name = 'To File')
 
 Splitter = TX.Splitter (To_DB, To_File, name = 'Splitter')
 
 Quotes = TX.Chain (
 yhq.yahoo_quotes (),
 TX.CSV_To_List (delimiter = ','),
 TX.Numerizer (),
 yhq.wiggle_and_trend (),
 yhq.symbol (),
 Splitter,
 name = 'Quotes'
 ) 
 
 return Quotes
>>> Quotes = make_quotes_writer ()
>>> Quotes.show_tree()

Quotes
Quotes[0] - Yahoo Quotes
Quotes[1] - CSV To List
Quotes[2] - Numerizer
Quotes[3] - Wiggle and Trend
Quotes[4] - Symbol
Quotes[5] - Splitter
Quotes[5][0] - To DB
Quotes[5][0][0] - Header Stripper
Quotes[5][0][1] - DB Writer
Quotes[5][1] - To File
Quotes[5][1][0] - Visualizer
Quotes[5][1][0][0] - Percent
Quotes[5][1][0][1] - Table Maker
Quotes[5][1][0][2] - Table Writer
Quotes[5][1][1] - File Writer
def run_quotes (symbols, from_date = '1970-01-01', to_date = '2099-12-31'):
 '''Downloads historic stock quotes from Yahoo Finance.
 Inserts data into MySQL table "quotes" relying on index to reject repeat insertions.
 Also writes the data in a table format to a temporary file, one for each symbol.
 '''
 Quotes = make_quotes_writer ()
 Quotes.set (from_date = from_date, to_date = to_date)
 FW = Quotes [5][1][1] # File Writer
 for symbol in symbols:
 outfilename = '/tmp/%s' % symbol
 FW.set (terminal = outfilename)
 Quotes.set (symbol = symbol)
 Quotes ()
>>> run_quotes (('MO', 'SU'), '2013-08-10')

Checking:
>>> DBRC = TX.DB_Run_Command ()
>>> for record in DBRC ("select * from quotes where symbol in ('MO','SU') and date >= 20130810;"):
 print record
(datetime.date(2013, 8, 16), 'MO', 34.51, 34.63, 34.25, 34.29, 8364900L, 0.0110337, -0.00638792)
(datetime.date(2013, 8, 15), 'MO', 34.78, 34.93, 34.5, 34.57, 15144400L, 0.0123866, -0.00604926)
(datetime.date(2013, 8, 14), 'MO', 35.23, 35.25, 34.81, 35.06, 5871900L, 0.0125607, -0.00485298)
(datetime.date(2013, 8, 13), 'MO', 35.17, 35.27, 35.01, 35.22, 5919700L, 0.00739898, 0.00142288)
(datetime.date(2013, 8, 12), 'MO', 35.25, 35.28, 35.05, 35.16, 4884300L, 0.00654059, -0.00255936)
(datetime.date(2013, 8, 16), 'SU', 33.84, 34.63, 33.81, 34.18, 7396600L, 0.0239626, 0.00993571)
(datetime.date(2013, 8, 15), 'SU', 32.84, 33.99, 32.7, 33.94, 7747100L, 0.0386865, 0.0329885)
(datetime.date(2013, 8, 14), 'SU', 32.06, 32.91, 32.02, 32.72, 4690200L, 0.0274141, 0.0203296)
(datetime.date(2013, 8, 13), 'SU', 32.04, 32.18, 31.94, 32.01, 2051900L, 0.00748596, -0.000935745)
(datetime.date(2013, 8, 12), 'SU', 32.15, 32.35, 31.99, 32.15, 2870500L, 0.0111906, 0.0)
>>> FR = TX.File_Reader ()
>>> print FR ('/tmp/SU')

 Date | Symbol | Open | High | Low | Close | Volume | Wiggle | Trend |
 2013年08月16日 | SU | 33.84 | 34.63 | 33.81 | 34.18 | 7396600.00 | 2.40% | +0.99% |
 2013年08月15日 | SU | 32.84 | 33.99 | 32.70 | 33.94 | 7747100.00 | 3.87% | +3.30% |
 2013年08月14日 | SU | 32.06 | 32.91 | 32.02 | 32.72 | 4690200.00 | 2.74% | +2.03% |
 2013年08月13日 | SU | 32.04 | 32.18 | 31.94 | 32.01 | 2051900.00 | 0.75% | -0.09% |
 2013年08月12日 | SU | 32.15 | 32.35 | 31.99 | 32.15 | 2870500.00 | 1.12% | +0.00% |
----------------------------------------------------------------------------------------------------
Each Transformer retains input and output, freezing the state of a composite after each run. This makes developing step by step and debugging a breeze. If a Chain fails, the failing link is the first one to have no output. Chain.show () displays them all in sequence. The deficient Transformer can be replaced with a fixed instance and the Chain run again without input for another try. 
>>> Quotes.replace (Fixed_Numerizer, 2)
>>> catch = Quotes () # catch removes the risk of flooding the display
 
----------------------------------------------------------------------------------------------------
These are the five custom classes. The decorator "setup" updates parameters and prevents needless reruns comparing time stamps.
class wiggle_and_trend (TX.Transformer):
 '''Deletes column "Adj Close" and adds columns "Wiggle" and "Trend". The values are
 ratios: (day's high - day's low) / mean, and (day's close - day's open) / mean.
 '''
 name = 'Wiggle and Trend'
 @TX.setup
 def transform (self):
 input = self.Input.data
 output = []
 output.append (tuple (input[0][:-1]) + ('Wiggle', 'Trend'))
 for i in range (1, len (input)):
 date, open, high, low, close, vol, adj = input [i]
 wiggle = high - low
 mean = (high + low) / 2.0
 wiggle_ratio = wiggle / mean
 trend = close - open
 trend_ratio = trend / mean
 output.append ((date, open, high, low, close, vol, wiggle_ratio, trend_ratio))
 self.Output.take (output)
class symbol (TX.Transformer):
 'Adds a column Symbol'
 name = 'Symbol'
 def __init__ (self):
 TX.Transformer.__init__ (self, symbol = None)
 @TX.setup
 def transform (self):
 symbol = self.get ('symbol')
 if not symbol:
 self.log ('No symbol!')
 else:
 input = self.Input.data
 output = []
 output.append ((input[0][0], 'Symbol') + input [0][1:])
 for i in range (1, len (input)):
 output.append ((input[i][0], symbol) + input [i][1:])
 self.Output.take (output)
class percent (TX.Transformer):
 'Converts float ratio to percent for better legibility'
 name = 'Percent'
 @TX.setup
 def transform (self): 
 input = self.Input.data
 output = [input [0]]
 for i in range (1, len (input)):
 wiggle = '%5.2f%%' % (input [i][7] * 100.0)
 trend = '%+5.2f%%' % (input [i][8] * 100.0)
 output.append (input [i][:7] + (wiggle, trend))
 self.Output.take (output)
class header_stripper (TX.Transformer):
 'Header names are not meant for insertion into data base table'
 name = 'Header Stripper'
 @TX.setup
 def transform (self):
 self.Output.take (self.Input.data [1:])
class yahoo_quotes (TX.WWW_Reader):
 'Gets historic stock quotes from Yahoo Finance'
 import urllib
 name = 'Yahoo Quotes'
 URL_TRUNK = 'http://ichart.finance.yahoo.com/table.csv'
 def __init__ (self, from_date = '1970-01-01', to_date = '2099-12-31', **keywords):
 private_keys = {
 'symbol' : None,
 'from_date' : from_date,
 'to_date' : to_date,
 'url_trunk' : self.URL_TRUNK,
 }
 private_keys.update (keywords)
 TX.WWW_Reader.__init__ (self, **private_keys)
 def make_url (self):
 fy, fm, fd = self.get ('from_date').split ('-')
 ty, tm, td = self.get ('to_date').split ('-')
 fm = '%0d' % (int (fm) - 1)
 tm = '%0d' % (int (tm) - 1)
 symbol = self.get ('symbol')
 qs = (
 ('s',symbol),
 ('a',fm),('b',fd),('c',fy),
 ('d',tm),('e',td),('f',ty),
 ('g','d'),
 ('ignore', '.csv'),
 )
 self.set (url_parameters = qs) 
 TX.WWW_Reader.make_url (self)
 
----------------------------------------------------------------------------------------------------
Let me know what you think.