Background
Some years back, with a surfeit of free time, I would scan through a webpage listing of "today's" free-to-air television listings hoping to find an interesting program or movie ("looking for hen's teeth".) The website was mature, offering quite a number of options and "GollyGosh!" features that I found imposing.
Over time, I eventually was able to download only the single HTML file that contained program (and movie) titles & their start times for the ~37 channels on offer here. Daily, I could retrieve 4-18 hours of coming attractions (as a file up to 400Kb in size) decorated with HTML tags. I set about writing some code to filter that deluge down to what interested me.
Through several versions, the code (a single source file) and reference information evolved to what is presented below. My goal was to be able to lazily use a few mouse clicks to view today's new program and/or movie titles. A Windows shortcut invoking Curl
would fetch today's webpage, and one executable would perform text processing leading to a tiny, self-contained HTML file to view (and edit?) with the Editplus
editor.
Creation, storage, update and distribution of the "program schedule" for any single broadcaster (stored off in the cloud, somewhere) is beyond the scope of this project. Available to any visitor are "TV Listings" in HTML, accompanied by supporting CSS, Javascript and JPG files. User criteria, held in cookies, can be used to suppress certain channels (table columns) and store timeframes of interest to the user. Other than that control, ALL program titles (with hot links to episode synopsis details) are presented in the user's browser. The user is then tasked with reading/scanning every program title in the hope of finding something of interest.
HTML scraped from the web
The webpage file contains two "blocks" of interesting data:
- About 8000 bytes into the file is a tagged field containing the relevant date of the listings
- Beginning about 110Kb into the file is a
<table>
whose columns each represent one broadcast "channel", and whose rows represent up to 24 "hour blocks" of program start times & titles.
Below is an extremely abridged representation of a few listings showing the structure of the interesting portion of the HTML: (the ~256 character "<a..." anchor strings have been replaced for brevity)
<tbody>
<tr>
<td valign="top" id="20hour" class="time">8 pm</td>
<td valign="top" class="noBorderBot 7">
<div align="center">
<div class="movieTag">MOVIE</div>
</div>
<div class="showtime">
<div class="time">8:35 pm</div>
<div class="show">
<a yadda-yadda>Wild Hogs</a>
</div>
</div>
</td>
<td valign="top" class="noBorderBot 11">
<div class="showtime">
<div class="time">8:00 pm</div>
<div class="show">
<a yadda-yadda>The Big Bang Theory</a>
</div>
</div>
<div class="showtime">
<div class="time">8:30 pm</div>
<div class="show">
<a yadda-yadda>The Big Bang Theory</a>
</div>
</div>
</td>
...
</tr>
...
</tbody>
After processing, the above block becomes:
<tbody>
<tr><td>20:00
<td><div class=mv>20:35<br>Wild Hogs
<td><b>20:00</b><br># Big Bang Theory [2]
...
</tbody>
Noteworthy is the elimination of needless (to me) <div>
blocks, the use of deprecated <b>
and </b>
bolding, replacement of "The Big..." with "# Big...", and the [2]
that indicates two sequential episodes broadcast. Note: 24hr time is used.
Replacing the prefix "The_"
(_ == SP) with "# "
better distinguishes program titles by use of a hash integer before falling back to using invoking strcmp()
. There are many comparison lookups to be performed by this code.
Primary purpose
Programs are (usually) broadcast Mon-Fri
, Sat
, Sun
or Weekly
. This means that, for instance, if "My Favorite Martian" (a program I have no interest in viewing) appears in today's listings, another episode is likely to appear again in the next few days. Once a "new" title has been presented, I want my repository to "remember" it and exclude the same title from being shown to me again. In this way, I can collect a history of titles that I've considered. These titles can be excluded when processing future listings. Such a repository would grow and become unwieldy, so the code "ages" each entry, day-by-day, eventually "forgetting" titles that haven't been seen in recent days.
The code searches for each title found in the day's listings in the collection of "exclusion titles". If found in that collection, the title is suppressed from the displayable HTML. This leaves only new titles for "today" to be considered. (Note: Although they do repeat, movie titles will always appear in the filtered output and are never added to the exclusion list.)
In use, the "exclusion list" usually contains ~250 titles on any given day; some old titles are removed and some new titles added.
"My" boilerplate HTML file
After curl
has fetched today's webpage's HTML file, this code is executed loading those contents into memory. From the date found in today's file, yesterday's file can be loaded over top(!) of the first 8-12Kb of the same block of heap. Instead of going to heap for many allocations, this code re-cycles bytes used by the useless preamble (~100Kb) of the webpage's HTML file. (Think "arena".) The code is somewhat simpler as it has full control over the arena. This saves tracking/freeing myriad small blocks of heap as working struct
s can be simply forgotten about.
<!DOCTYPE html><html>
<head>
<style>
th, td {font-family:Arial;font-size:11pt;}
#tv {border:2px solid black;background:#F0F0F0;}
#tv thead tr th {font-weight:bold;text-align:center;}
#tv thead tr th:first-child {line-height:40px;}
#tv tbody tr td {text-align:left;vertical-align:top;border:1px solid #999999;padding:5px }
#tv tbody tr td:first-child {font-weight:bold;vertical-align:middle;background-color:#ffffff;}
.mv {font-weight:bold;color:red;}
p {font-family:Arial;font-size:11pt;}
</style>
<!--keepChan
+11:21, -HD_ABC,
+02_ABC,
+22_Comedy, -ABCME, -24_News,
+06_Ch7, -PrimeSth, -PrimeNth,
+62_7Two,
+63_7MATE, -7bravo, -7Flix, -RaceTV,
+05_WIN, -HD_WIN, -Sky, -NBN, -SthCross, -NineLife,
+83_Go!,
+82_GEM, -51_Bold, -10Capitals, -52_Peach, -TVSN, -7GTS,
+03_SBS, -HD_SBS, -WorldMovies, -HD_VICELAND, -Food, -SBSWW, -Imparja,
+34_NITV
keepChan-->
</head>
<body>
<table id=tv cellspacing=0>
<thead>
<th>0910
<th>02_ABC
<th>05_WIN
<th>83_Go!
<th>82_GEM
<th>03_SBS
<th>34_NITV
</thead>
<tbody>
<tr><td>13:00
<td>
<td><b>13:15</b><br>Saltimbanco To Luzia - 25 Years<br>Of Cirque Du Soleil In Australia<br>
<b>13:45</b><br>My Way
<td colspan=4>
...
</tbody>
</table>
<!-- Excl
7 # Addams Family
7 # Art Of Ageing
7 # Chase~
...
7 Border Security~
5 Call The Midwife
...
4 Wonders Of Scotland
2 World's Greatest Hotels
6 Woven Threads Stories From Within
7 Young Sheldon
Excl -->
<p>old:247 exp:10 new:10 out:247
</body></html>
- The single (8-10Kb) file contains an HTML document.
- It begins with a stylesheet for formatting.
- An HTML comment block identifies the 37 channels indicating with '+/-' which channels (table columns) are to be retained and which are to be simply skipped over.
- Everything to the
</head>
is replicated from one day to the next. - The table's body contents should be apparent.
- Following the table is the "exclude title" list, each entry prefixed by the remaining days until the entry is to be "forgotten" and removed from the list.
- Finally are some stats about how many exclusions were loaded, how many expired, how many added and the new tally.
The code
Like previous postings here, this is presented for anyone who might be interested in the concept of filtering & reformatting (changing) published data available in an online HTML table. The "structure" of the HTML data has been extremely stable, and the author is always on-hand if the program hangs or terminates. (The label of the "date" retrieved was once changed, and once a "channel" was added to the online collection. Very stable over the years.)
I prefer short variable names and reasonably dense code (to the extent of often using the comma operator instead of braces!) These are stylistic choices. While qualifiers (const
, for instance) might make the code more robust, I've been content to dabble and play and test its operation. Also, I use an old C++ compiler for convenience, so there are a few casts that are not necessary for a C compiler. Finally, I tried to add some comments to explain aspects of the code and the data it expects and that it writes to the output file.
Feel free to make any comments or suggestions. Thank you.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <io.h>
typedef struct sExcl { // Exclude list/tree data
int days; // days until expire from list
unsigned int hash; // first 4 chars as integer for fast compare
char *ttl; // title as string
sExcl *lft; // list transforms to binary tree
sExcl *rgt;
} excl_t;
typedef struct sPR { // One program in the listings
sPR *nxt; // ptr to next program (in same hour timeslot)
excl_t cr; // filled when program added to exclude tree
int mov; // is a movie?
int start; // start time (hhdd as integer)
int cnt; // count (to combine sequential episodes into single listing)
} PR_t;
typedef struct sTD { // One table cell loaded/stored
sTD *nxt; // ptr to next in row
PR_t *sh1; // ptr to 1st program in this cell (timeslot)
} TD_t;
typedef struct sTR { // One table row loaded/stored
sTR *nxt; // ptr to next row
int hr; // 0-23 hour of this time block
TD_t *TD0; // ptr to first cell in this row
} TR_t;
char *allocBuf; // for re-use of preamble bytes in single heap allocation
char *ch[40]; // ptrs to channel identifiers (strings)
struct { // Exclude list/tree information
char *Bgn; // HTML comment tag marking begin of exclusion list
char *End; // tag marking end of exclusion list
int Dur; // max duration (days) until expire from list
int Old, Exp, New, Out; // stats (counts loaded, expired, added, saved)
excl_t *Tree; // ptr to titles after transformation to tree
} excl = { "<!-- Excl", "Excl -->", 7, };
// simple calculation of yesterday (MMDD) from today (MM DD) w/o effort for leap year
int yesterday( int mo, int dy ) {
return (!--dy) ? (("xLABCDEFGHIJK"[mo]-'@')*100) + (28+"x3303232332323"[mo]-'0') : (mo*100)+dy;
}
#define GrabFour( s ) ( ((s[0] << 8 | s[1]) << 8 | s[2]) << 8 | s[3] ) // simple hash 4 chars to int
#define twoDgt( t, u ) ((t - '0')*10 + (u - '0')) // simple two digits to int
// strcmp() but allow '~' on 1st param to act as wild card matching any extra on 2nd param
// can't use '*' because "M*A*S*H" is a valid (classic) program title often in re-runs
int cmpWild( char *kwn, char *nu ) {
while( *kwn && *kwn == *nu )
kwn++, nu++;
return *kwn == '~' ? 0 : *nu - *kwn; // equal (0) OR calc >/=/<
}
// traverse excluded title tree looking for this program title
// if found, restore full expire lifespan (7 days), return true.
// if not found, but contains a permanent "stop word", return true.
// else return false
bool exclfnc( PR_t *p ) {
excl_t *kwn = excl.Tree;
while( kwn ) {
int res = p->cr.hash - kwn->hash;
if( !res ) res = cmpWild( kwn->ttl, p->cr.ttl );
if( !res ) {
if( kwn->days < 0 ) return false;
kwn->days = excl.Dur;
return true;
}
kwn = ( res < 0 ) ? kwn->lft : kwn->rgt;
}
static char *no[] = { "Fishing", "Pickers", "Nazi", "Hitler", NULL };
for( int i = 0; no[i]; i++ ) if( strstr( p->cr.ttl, no[i] ) ) return true;
return false;
}
// local allocation of bytes of 100K+ "preamble" already loaded into heap
// (entire program uses only one true malloc at beginning)
void *alloc( size_t sz ) {
sz = (sz + (sizeof(void*) - 1)) & ~0x03; // Round to nxt pointer size
memset( allocBuf, '0円', sz );
allocBuf += sz;
return allocBuf - sz;
}
// transform array of pointers (exclusion titles as list) to binary tree
excl_t *buildTree( size_t l, size_t r, excl_t lst[] ) {
size_t m = (l + r)/2;
excl_t *nu = &lst[ m ];
if( l < m ) nu->lft = buildTree( l, m-1, lst );
if( m < r ) nu->rgt = buildTree( m+1, r, lst );
return nu;
}
// brand new program titles added to existing tree (self sorting for output)
void insertTree( excl_t *nu ) {
for( excl_t *p = excl.Tree, **pp; ; p = *pp ) {
int res = nu->hash - p->hash;
if( !res ) res = strcmp( nu->ttl, p->ttl );
if( !res ) return;
pp = ( res < 0 ) ? &p->lft : &p->rgt;
if( !*pp ) { (*pp = nu)->days = excl.Dur + 1; return; } // extra day of life for now
}
}
// set up array (list) of ptrs to "exclude titles" section of my HTML file
// each lifespan & title, and hash value stored individually
// notice local allocation (re-use) of preamble section of loaded buffer
// finally, transform that list into tree for searching and additions
excl_t *procExcl( char *p ) {
p = strchr( strstr( p, excl.Bgn ), '\n' ) + 1;
char *cp = strstr( p, excl.End );
cp[0] = '0円';
while( --cp > p ) excl.Old += *cp == '\n'; // count titles
// titles to pointer array
excl_t *List = (excl_t*)alloc( excl.Old * sizeof *List ), *pl = List;
for( cp = p; ( cp = strtok( cp, "\n") ) != NULL; cp = NULL, pl++ )
pl->days = cp[0] - '1', pl->ttl = cp + 2, pl->hash = GrabFour( pl->ttl );
return buildTree( 0, excl.Old - 1, List );
}
// my HTML file contains region of comma separated channel names
// each name corresponds to one column of today's program listings (each <TD>)
// names (ie: "channels") beginning with '+' are retained
// names beginning with '-' are simply skipped over (ignored)
// raw data has 37(?) columns
// all 37 pointers are significant here, including NULL pointers; compacting happens later.
// mark the end of my "permanent" header HTML for later duplication to output file.
char *procChan( char *bp ) {
bp = strstr( bp, "<!--keepChan" ) + 12; // beyond token
char *ep = strstr( bp, "keepChan-->" );
char *cp = (char*)alloc( (ep-bp) + 1 ); // copy to chop up and manipulate
memcpy( cp, bp, (ep-bp) );
// chop up copy selectively filling ch[]
for( int i = 0; ( cp = strtok( cp, ", \t\n") ) != NULL; i++, cp = NULL )
if( *cp == '+' ) ch[i] = cp; // ch[] null ptrs used to skip don't care channels
ep = strstr( ep, "</head>" ) + 7; // Having loaded yesterday, isolate its HTML header for re-use
*ep++ = '0円';
return ep; // exclusions follow
}
// digest today's program listings from web's HTML assembling internal tree (rows & cols) of listings
// many things happen here!!
// - my HTML (for col 0) is the start & stop hours of interest. web page hours earlier/later are ignored.
// - advance the buffer pointer to <tbody> marker, then...
// - get the row's hour from web provided '<TR id=13>' (indicating 1PM hour block)
// - keep scanning to ignore downloaded listings earlier than interest.
// - note: as single pass proceeds, terminate on </tbody> OR listing too late to be of interest.
// looping sensitive to "<tr>", "<td>" and "<div...>" HTML tokens
// - simplest ("<tr>") simply starts (allocates) a new row.
// - "<td>" is selective. Only retained columns (non NULL channels) are preserved; others are simply skipped over.
// also, "daytime children's programs" on one particular channel are skipped over.
// these "cells" are attached to the current row's LL of columns
// - "<div...>" delimiters are used for several layout and data purposes:
// "align" is ignored
// "class=.." herald either "SPORT" or "MOVIE". I choose to drop sport programs, but retain movie titles
// otherwise, next field is program start time as "HH:MM", so ':' is a search target to extract minutes.
// web format is then a lengthy anchor ("<a...>") that is skipped, followed by the program title.
//
// - Program title text:
// Lots of titles of "News" are ignored (I know when the 6 o'clock news is broadcast)
// All movie titles are retained, but only "show" titles NOT found in exclusion list.
// Too often, 2+ sequential episodes of same series are broadcast. Sequential same titles are compressed.
// If program title retained, its info is attached to cell's LL of programs (possibly multiple titles in cell).
TR_t *digest( char *sp ) {
TR_t *TR0 = NULL, *TRn, *TRprv = NULL;
TD_t *TDn;
PR_t *pTail;
PR_t *prev[40] = { 0 };
int cCnt = 0, hr, earliest = atoi( ch[0] ), latest = atoi( ch[0] + 4 ); // Earliest to retain
sp = strstr( sp, "<tbody" );
//printf( "%.190s\n", sp ); getchar();
do { sp = strstr( ++sp, "<tr>" ); } while( ( hr = atoi( strstr( sp, "id=\"" ) + 4 ) ) < earliest );
for( ; strncmp( sp, "</tbody", 7 ); sp = strchr( ++sp, '<' ) ) {
if( sp[1] == 'd' ) { // "<div"
//printf( "DIV ... %.90s\n", sp ); getchar();
if( cmpWild( "align~", sp + 5 ) == 0 ) {}
else if( cmpWild( "class=\"sport~", sp + 5 ) == 0 ) sp = strstr( sp + 290, "</a>" );
else {
//printf( "Now %.90s\n", sp ); getchar();
PR_t buf = { 0 };
if( sp[12] == 'm' ) buf.mov = 1;
sp = strchr( sp, ':' );
buf.start = TRn->hr * 100 + twoDgt( sp[1], sp[2] );
buf.cr.ttl = sp = strchr( strstr( sp, "<a " ) + 50, '>' ) + 1; // end of anchor
sp = strchr( sp, '<' );
*sp = '0円';
//printf( "@%4d Showtitle: %s\n", buf.start, buf.cr.ttl ); getchar();
if( ( buf.cr.hash = GrabFour( buf.cr.ttl ) ) == (unsigned int)GrabFour( "The " ) )
buf.cr.ttl += 2, buf.cr.ttl[0] = '#', buf.cr.hash = GrabFour( buf.cr.ttl );
if( strstr( buf.cr.ttl, " News" ) == 0 && ( buf.mov || !exclfnc( &buf ) ) ) {
if( prev[cCnt] && strcmp( buf.cr.ttl, prev[cCnt]->cr.ttl ) == 0 )
prev[cCnt]->cnt++;
else {
PR_t *p = (PR_t*)alloc( sizeof *p );
if( !TDn->sh1 ) pTail = TDn->sh1 = p; else pTail = pTail->nxt = p;
memcpy( p, &buf, sizeof *p );
prev[cCnt]= p;
}
}
}
} else if( sp[1] == 't' ) { // "<td" or "<tr"
if( sp[2] == 'd' ) { // "<td"
//printf( "TD... %.90s\n", sp ); getchar();
if( ch[cCnt] ) {
TD_t *p = (TD_t *)alloc( sizeof *p );
if( !TRn->TD0 ) TDn = TRn->TD0 = p; else TDn = TDn->nxt = p;
}
if( cCnt == 0 || ch[cCnt] == NULL || ( cCnt == 3 && TRn->hr < 19 ) )
sp = strstr( sp, "</td" ); // speed to end of cell.
cCnt++;
} else { // "<tr"
//printf( "TR ... %.90s\n", sp ); getchar();
if( hr > latest ) break;
TR_t *p = (TR_t *)alloc( sizeof *p );
if( !TR0 ) TRn = TR0 = p; else TRn = TRn->nxt = p;
p->hr = hr++;
cCnt = 0;
}
}
}
// Trim the fat - Condense to channels used
// Now that NULL ptrs have been used to skip web columns, condense array of not NULL ptrs
for( int i = 1, j = 1; i < sizeof ch/sizeof ch[0]; i++ )
if( ch[i] ) // Change prefix '+' to '-' for later use
ch[i][0] = '-', ch[j++] = ch[i], ch[i] = NULL;
// Go through all cols of all rows eliminating rows that have no programs retained during that hour
for( TRn = TR0; TRn; TRn = TRn->nxt ) {
int hrUsed = 0;
for( i = 1, TDn = TRn->TD0->nxt; TDn; TDn = TDn->nxt, i++ )
if( TDn->sh1 )
hrUsed++, ch[i][0] = '+'; // back to '+' cuz column has 1+ titles
if( hrUsed ) // titles retained during this hour?
TRprv = TRn; // remember this row, and go on to nxt
else if( TRn == TR0 ) // is this the 1st row??
TR0 = TR0->nxt; // discard it
else
TRprv->nxt = TRn->nxt; // abandon this row
}
// now have active rows with active cells (columns) in a tree
return TR0; // pointer to 1st active row
}
/* HTML OUTPUT functions that write "my" file */
// wrap one long program title by inserting <br>
// uses single static buffer just before it is output
char *wrap( char *p ) {
int i, m = strlen( p ) / 2;
static char rVal[100];
if( m > 12 )
for( int l = m-1, r = m; l > 0; l--, r++ )
if( p[i = r] == ' ' || p[i = l] == ' ' ) {
sprintf( rVal, "%.*s<br>%s", i, p, p + i + 1 );
return rVal;
}
return p;
}
// "in order" recursive traversal of excluded title tree to output lines as sorted list
void pubExcl( excl_t *p ) {
if( p->lft ) pubExcl( p->lft );
if( p->days ) {
char sep = '\t';
if( p->days > excl.Dur ) p->days = excl.Dur, sep = ' ', excl.New++;
if( p->days < 0 ) p->days = 0;
printf( "%d%c%s\n", p->days, sep, p->ttl );
excl.Out++;
} else excl.Exp++;
if( p->rgt ) pubExcl( p->rgt );
}
// fancy-pants use of HTML colspan to unclutter output table
void pubSpans( int n, char *sufx ) {
if( n ) printf( n == 1 ? "\n\t<td> " : "\n\t<td colspan=%d> ", n );
printf( "%s", sufx );
}
// publish one cell's program titles (a movie?) with start time and title
// if not a movie, publishing means add to list of exclusions for tomorrow. Publish ONCE!
void pubCell( PR_t *p ) {
while( p ) {
char *fmt = p->mov ? "<div class=mv>%02d:%02d<br>%s" : "<b>%02d:%02d</b><br>%s";
printf( fmt, p->start/100, p->start%100, wrap( p->cr.ttl ) );
if( p->cnt ) printf( " [%d]", p->cnt + 1 );
if( !p->mov && p->cr.ttl[2] ) insertTree( &p->cr );
if( ( p = p->nxt ) != NULL ) printf( "<br>\n\t\t" );
}
}
// iteratively publish all the cells (channels) of one row
// if next cell has no titles, then use fancy-pants horizontal spanning
void pubTrow( TD_t *cb ) {
int i = 1, nSpan = 0;
for( ; cb; cb = cb->nxt, i++ )
if( cb->sh1 )
pubSpans( nSpan, "\n\t<td>" ), nSpan = 0, pubCell( cb->sh1 );
else nSpan += ( ch[i][0] == '+' ); // Only if column 'active'
pubSpans( nSpan, "\n" );
}
// iteratively publish all the rows (active hours) of the tree
void pubTbody( TR_t *hb ) {
for( ; hb; hb = hb->nxt ) {
printf( "\n<tr><td>%02d:00", hb->hr );
pubTrow( hb->TD0->nxt );
}
}
// publish the table headers (only channels that have a new program or a movie today)
void pubThead( void ) {
for( int x = 0; ch[x]; x++ )
if( ch[x][0] == '+' )
printf( "\t<th>%s\n", ch[x] + 1 ); // without "+"
}
// publish today's entire distilled listings
// start a new file
// output my HTML preamble that was loaded from yesterday's file
// output table preamble, today's listings, and exclusion list to use tomorrow
void publish( TR_t *hbs, char *yday, int mmdd, char *fName ) {
int saved = _dup( fileno( stdout ) );
sprintf( fName, "tvg %04d.html", mmdd );
freopen( fName, "wt", stdout );
sprintf( ch[0], "+%04d", mmdd );
printf( "%s\n<body>\n", yday ); // yesterday's "<head>" into today's version
puts( "<table id=tv cellspacing=0>" );
puts( "<thead>" ); pubThead( ); puts( "</thead>" );
puts( "<tbody>" ); pubTbody(hbs); puts( "</tbody>" );
puts( "</table>" );
puts( excl.Bgn ); pubExcl( excl.Tree ); puts( excl.End );
printf( "<p>old:%d exp:%d new:%d out:%d\n", excl.Old, excl.Exp, excl.New, excl.Out );
puts( "</body></html>" );
fclose( stdout );
_dup2( saved, 1 );
}
// measure, allocate heap and load web HTML file
// OR
// measure, allocate chunk of web loaded heap for local use to load MY HTML from yesterday
char *load( char *name, void*(fnc)(size_t) ) {
FILE *fp = fopen( name, "rt" );
if( fp == NULL ) { puts("open bad"); getchar();}
struct stat inf;
fstat( fileno(fp), &inf );
char *p = (char*)fnc( inf.st_size );
fread( p, sizeof *p, inf.st_size, fp );
fclose( fp );
return p;
}
// - access and load web HTML file (300-400Kb)
// - find marker and get the file's listing's date (about 8000 bytes in)
// - calculate and load yesterday's version of my HTML
// - process the exclusion list of program titles
// - digest today's web listings into my tree, then publish those severely reduced listings
// - launch my editor on that 8Kb version (that has a one click "browser" facility)
void main( void ) {
char fName[30] = "./tvg.html";
char *Buffer, *p = allocBuf = Buffer = load( fName, malloc );
if( ( p = strstr( p, "var guideDate = \"" ) ) == NULL ) exit(1);
int mm = twoDgt( p[19], p[20] ), dd = twoDgt( p[17], p[18] );
sprintf( fName, "tvg %04d.html", yesterday( mm, dd ) );
char *yDayBuf = load( fName, alloc );
excl.Tree = procExcl( procChan( yDayBuf ) ); // used channels and excluded titles
publish( digest( Buffer + 89000 ), yDayBuf, mm*100+dd, fName ); // TABLESTART offset
sprintf( Buffer, "start \"C:/Program Files (x86)/EditPlus 2/Editplus.exe\" \"./%s\"", fName );
system( Buffer );
}
2 Answers 2
Some minor things:
Wrong value and width mask for sz
~0x03
is an int
with a bit pattern of FFFF ... FFFC
. Code likely needs a size_t
which can be wider than int
.
void *alloc( size_t sz ) {
// sz = (sz + (sizeof(void*) - 1)) & ~0x03;
sz = (sz + (sizeof(void*) - 1)) & ~((size_t)0x03);
Rather than magic number 3 use sizeof(void*) - 1
or alignof(void*) - 1
as suggested elsewhere. Side benefit: sizeof
and alignof
results are type size_t
, so no cast needed.
sz = (sz + (sizeof(void*) - 1)) & ~(sizeof(void*) - 1);
Yet to act like malloc()
, allocations should be on alignof(max_align_t)
boundaries, not void *
:
sz = (sz + (alignof(max_align_t) - 1)) & ~(alignof(max_align_t) - 1);
Avoid overflow
With large l, r
, l + r
may overflow.
Alternative code:
excl_t *buildTree( size_t l, size_t r, excl_t lst[] ) {
// Overflows when `l > SIZE_MAX - r`.
// size_t m = (l + r)/2;
// Good for all `l` and `r` with `l <= r`.
assert(l <= r);
size_t m = l + (r - l)/2;
// Good for all `l` and `r`.
size_t m = l/2 + r/2 + (l%2 + r%2)/2;
Hashing
( ((s[0] << 8 | s[1]) << 8 | s[2]) << 8 | s[3] )
technically is undefined behavior (shifting into the sign bit). Also int
may only be 16-bit.
Alternative (use a wide enough unsigned type)
( ((uint32_t)(s[0] << 8 | s[1]) << 8 | s[2]) << 8 | s[3] )
Also code plays loose with int/unsigned
conversions and math. Avoid int
overflow (which is UB) and all those conversions. Example:
//int res = nu->hash - p->hash;
//if( !res ) res = strcmp( nu->ttl, p->ttl );
//if( !res ) return;
//pp = ( res < 0 ) ? &p->lft : &p->rgt;
if (nu->hash == p->hash) {
int res = strcmp( nu->ttl, p->ttl );
if( !res ) return;
}
pp = (nu->hash < p->hash) ? &p->lft : &p->rgt;
(This also fixes a hidden bug that would come up when the MSbits of nu->ttl[0]
, p->ttl[0]
differ.)
-
1\$\begingroup\$ Thank you. Points well taken! "...code plays loose..." It certainly does! Sheltered by knowing data is 7bit ASCII and values are small (making o'flow unlikely (excuses!).) Again, thanks for feedback.
:-)
\$\endgroup\$Fe2O3– Fe2O32024年09月10日 20:31:00 +00:00Commented Sep 10, 2024 at 20:31
This doesn't look like valid C:
typedef struct sExcl { ⋮ sExcl *lft; sExcl *rgt; } excl_t;
I think those pointers were intended to be struct sExcl *
.
Some of the terminology is a bit suspect (e.g. "HTML comment tag"). Tags delimit HTML elements; comments are not elements.
This is weird:
if( fp == NULL ) { puts("open bad"); getchar();}
Shouldn't that message go to standard error stream? And it seems wrong to continue with the rest of the the function that uses this null file pointer - isn't that UB?
And here's the rest of the function:
struct stat inf; fstat( fileno(fp), &inf ); char *p = (char*)fnc( inf.st_size ); fread( p, sizeof *p, inf.st_size, fp ); fclose( fp );
Most of those functions can fail, but the return values are being ignored, so we have no idea whether they succeeded or not.
void main( void )
is presumably an extension supported by your compiler, but standard C specifies int main(void)
, so you have a portability problem there.
Another portability problem is the use of system()
to invoke external programs. Since that's the last action of a successful main()
, it might be better moved to a script that invokes this program &&
your "start" program.
This calculation looks suspect:
sz = (sz + (sizeof(void*) - 1)) & ~0x03; // Round to nxt pointer size
That magic number 3 seems to be tied to an assumption that alignof(void*)
will be 4, but we don't have a static_assert()
to justify that assumption. Really, what we're trying to do is round up to the next alignof(void*)
, so perhaps we should be using alignof(void*) - 1
as the addend and inverted as the mask. We'd still want a static_assert
that it's a power of two if we're masking like that - simple %
would be more portable and still optimisable to a mask operation.
I've not looked in detail at the other functions, mainly because the low-level list and string manipulation is tedious, especially in the absence of unit tests. It serves as a good example of why the data-structure abstractions provided by higher-level languages are so useful, because they free the engineer's brain from such mundanity. If I were writing this from scratch today, I'd probably use Python rather than C.
-
\$\begingroup\$ Thank you for feedback. I'm too comfortable with my ancient IDE/compiler, meaning not keeping up with "newer" language features. re "tedious" and fragile code: Posted as example of "parsing" available table of data (different every day) to extract & render (and store) only elements of interest to the user. The long-ago "from scratch" version sought to capitalise on website's CSS & javascript... That optimism delayed "biting the bullet" by tossing-out all that excess and go for the gold with "tedious", very specific text processing code. Again, thanks!
:-)
\$\endgroup\$Fe2O3– Fe2O32024年09月10日 20:41:37 +00:00Commented Sep 10, 2024 at 20:41 -
\$\begingroup\$ re: "weird": One Windows desktop icon (shortcut) runs
curl
to fetch today's HTML (in 1/2 second). Another shortcut runs the EXE (that chains to launcheditplus
.) Both run as minimized windows. Both very quick execution. Should thatfopen
fail, the error report goes to stdout (doesn't matter) and thegetchar()
causes EXE to pause waiting for me to investigate; maximizing the window to read the error message. Of course I manually terminate the program and fix things up. Yes, it's not "self evident" and addingexit(1)
would be an improvement. Thanks:-)
\$\endgroup\$Fe2O3– Fe2O32024年09月11日 00:55:51 +00:00Commented Sep 11, 2024 at 0:55 -
\$\begingroup\$ So that's the Windows way to wait for a debugger to attach? Makes sense, though I'm very ignorant here, not being a Windows user. \$\endgroup\$Toby Speight– Toby Speight2024年09月11日 06:58:05 +00:00Commented Sep 11, 2024 at 6:58
-
\$\begingroup\$ One works-with whatever works... My old IDE doesn't have a separate terminal/console window for
stderr
, so even logged error messages there would go to the same console that closes when execution terminates... Workaround: usegetchar();
... One does what's necessary, whistles a merry tune, and refrains from cursing the tools on-hand... Cheers! \$\endgroup\$Fe2O3– Fe2O32024年09月11日 09:27:19 +00:00Commented Sep 11, 2024 at 9:27
:-)
\$\endgroup\$