Common Bugs

(1 / 2)
Date: February 21, 1991 15:13
From: KIM::ALBAUGH
To: @SYS$MAIL:ENGINEER
(This file lives in ee$userdisk:[68k_us.memo]common_bugs.txt)

WHO NEEDS TO READ THIS: every game programmer and every project leader
    whose game programmer says he/she has none of these bugs (they're lying).
    Hardware engineers might also find it a useful refresher.

	I have been doing a lot of poking around in dusty corners lately,
    trying to move people from older coin and eeprom routines to newer ones.
    In the process I have discovered a few bugs and "accidents waiting to
    happen". Each is described below, although I'm not claiming that they are
    new or that I didn't create some of them myself. First some background.
    Most of our games have (at least) two levels of interrupt. Some games add
    an extra level of "software interrupt" or "foreground". Others have
    multiple processors (you don't? how about that 6502 on your sound board?)
    All these things mean that the statements of a program are often executed
    in a different order than one might expect from the source listing. Keeping
    things on track requires a certain degree of (dare I say it) discipline.
    NOTE: These are not the only bugs in our code. They are just ones I have
    seen often, or lately. Each topic below is presented as: subject, 
    background, the bug, and suggested changes.

    					Mike

SUBJECT: unrelated functions sharing a latch

BACKGROUND:

	Our hardware guys are very fond of grouping miscellaneous bits
    together in four-, six-, or eight- bit latches. They are unlikely to
    change because a) it's cheaper b) it's not _that_ much of a hassle,
    usually and c) it's now the "standard". The problem arises because these
    latches are write-only, so they need to be shadowed. Unfortunately, when
    a game has two programmers (or one programmer in two different moods or
    seasons), there tend to be two (or more) shadows for the same latch. Also,
    we don't always take care about order of update or possible interrupts.

THE BUG:

	Failure to use a single shadow, or to "guard" references to that
    shadow, leads to random setting/clearing of bits. This is especially
    prevalent in the _added_at_the_last_minute vcr code, which may, for
    example, change the resolution of your LETA...

THE FIX:

	A single routine that has its own shadow and disables interrupts
    while mucking with it. the one below is called to set a some bits by:

	mod_latch( FIRST_BIT | OTHER_BIT );

    and called to clear the same bits by:

	mod_latch( ~( FIRST_BIT | OTHER_BIT ) );

    and it assumes that there is only one such latch, of up to 31 bits.
    The actual code snippet below assumes a 16-bit latch (and shadow), but
    all the '.W's could be '.B' or '.L' for 8 or 31 bits.

	It is _probably_ a real good idea if the latch address and the shadow
    address were _not_ available to anybody else, to "put some teeth" in the
    "convention" of using this routine.

-------- sample code follows --------

	xdef	mod_latch
*	Modifies LATCH according to its param. If MSB set, clears any bits
*	that are clear in the param, else sets any that are set. NOTE: this
*	is the reverse of the original (ROMALOT) mod_latch, but nobody was
*	using that, and this is a more intuitive interface.
mod_latch:
	MOVE	SR,D1
	OR	#$700,SR	* Lock INTS
	MOVE.L	4(SP),D0	* Read parameter LONG
	BMI.S	ML10		* D31 signals 0: OR in bits  1:AND out bits
	OR.W	bitshd,D0
	BRA.S	ML20
ML10:	AND.W	bitshd,D0
ML20:	MOVE.W	D0,LATCH	Set LATCH
	MOVE.W	D0,bitshd	update shadow
	MOVE	D1,SR
	RTS

SUBJECT: Spinning on eer_busy() in Vblank

BACKGROUND:

	Many of our games use EEPROM to hold statistics, High Scores, etc.
    One problem with EEPROM is that each byte (typically) takes time to
    write. This is normally handled by having a routine ( eer_hwt() ) that
    is called every VBlank actually do the write. If the test switch is
    changed while the game is writing, corrupt data could result. So, a
    routine ( eer_busy() ) is supplied to tell when it is safe to "turn over"
    the game from test to game or vice-versa.

THE BUG:

	Somewhere, sometime, somone decided that the code to do this
    should look roughly like:

VBLANK_IRQ:
	
	TST.W	INSELF
	BEQ		
* We think we are in selftest, does the switch agree?
	IFTST	
* We need to transit from test to game.
VB0:    JSR		eer_busy
	TST.L	D0
	BNE	VB0
< code to fake a reset, which will find the test switch off and start game>

    A similar piece of code is used when the game vblank finds the test switch
    on. Since an EEPROM write can make no progress with Vblank IRQs essentially
    disabled, the eer_busy() will, if it ever returns TRUE, continue to do so
    forever (or until WATCHDOG kicks in).

THE FIX:

	Don't just hang. Work towards finishing the write by calling eer_hwt()
    every Vblank until eer_busy() returns 0. This will let the game program
    continue too, until it is safe to "turn over".

VBLANK_IRQ:
	
	TST.W	INSELF
	BEQ		
* We think we are in selftest, does the switch agree?
	IFTST	
* We need to transit from test to game.
	JSR		eer_busy
	TST.L	D0
	BEQ	DIE
	JSR	eer_hwt
	
	RTI

DIE:
< code to fake a reset, which will find the test switch off and start game>

    Incidentally, this is essentially what SYSTEM I originally did. I have no
    idea how the buggy code arose, but I have seen it in several games now.

SUBJECT: VBACK, who hits it and when

BACKGROUND:

	Most of our hardware designs generate an interrupt at the beginning
    of Vertical Blank. Most also have an address to "strobe" to acknowledge
    (and clear) this interrupt. Three warnings:

1)  VAD-based systems cannot tolerate random gunk being stored in VBACK, always
    store a zero. You may want to use a CLR instruction but...

2)  If you are using a 68010 in development and a 68000 in production, make
    sure your VBACK can tolerate being read (modern VADS can) because the
    68000 (but not the 68010) does a read-modify-write cycle for CLR.W

3)  Some games will generate another IRQ at line 256 if you strobe VBACK
    "too soon" (i.e. between lines 240 and 256).

THE BUG:

    Several annoyances, mainly related to lunching VADS and getting spurious
    double Vblanks, which may have the side effect of inducing EEPROM errors.

THE FIX:

    Make sure you are writing the proper value, and preferably late, rather
    than early in your Vblank IRQ routine.

SUBJECT: How to set your priority level from C.

BACKGROUND:

	Sometimes a routine written in C needs to disable interrupts during
    a "critical section". Since the MOVE xxx,SR instruction is not available
    in C, various snippets of assembly have been making the rounds.

THE BUG:

	One of these snippets goes something like:

OLD_IPL: DS.W	1
	...
	xdef	ints_off,ints_on
ints_off:
	MOVE.W	SR,OLD_IPL	; Save old value of Interrupt priority
	MOVE.W	6(SP),SR	; Set new value from parameter
	RTS
ints_on:
	MOVE.W	OLD_IPL,SR
	RTS

    This works fine until the day that a mainline routine does an
    ints_off(0x2200) (saving 20xx in OLD_IPL), then is itself interrupted
    (say at level 4) by a routine that does an ints_off(0x2600), saving
    22xx in OLD_IPL. BOTH calls to ints_on will "restore" the 22xx, leaving
    the "mainline" at level 2 forever. I've seen a variation on this theme
    which maintains a complicated stack in "private" memory, but does not
    check for consistency or stack overflow.

THE FIX:

	Work with, rather than against, the language. The following:

	xdef	set_ipl

set_ipl:
	MOVEQ.L	#0,D0			; Not _strictly_ needed, but tidy
	MOVE.W	SR,D0
	MOVE.W	6(SP),SR
	RTS

    can be used in conjunction with a local variable in the procedure calling
    it:

void
foo()
{
    int old_ipl;
    ...
    old_ipl = set_ipl(#COMM_LEVEL)
    ...
    (void) set_ipl(old_ipl);
}

SUBJECT: is "foreground" really vblank?

BACKGROUND:

	Many of our games use an "interrupt level" between the lowest priority
    real interrupt and "background". This typically handles things that really
    should happen once a frame (like animation drivers), but are not critically
    dependant on the beam position. The Vblank IRQ routine checks to see if the
    "foreground" is active. If not, it is "activated" by marking it active,
    lowering the interrupt priority level, and "calling" the foreground. When
    the foreground returns, the IPL is once again raised, the "active" flag
    is cleared, and Vblank returns. If Vblank finds the foreground already
    active, it simply returns.

THE BUG:

	Sometimes the VBlank routine does not lower its IPL, making the entire
    foreground routine essentially part of the Vblank IRQ. Another problem
    is covered below.

THE FIX:

	Make sure the sequence of operations detailed above is followed, and
    that Vblank can be re-entered safely.

SUBJECT: BIG "critical sections"

BACKGROUND:

	In just about any game there are "critical sections", where we need
    to ensure that only one routine is mucking about with a given data
    structure. For quick changes to small pieces of data, simply raising the
    IPL will work just fine, but some games leave interrupts masked for
    multiple frames, to keep "foreground" from running while a large routine
    runs in "background".

THE BUG:

	When VBlank is left masked for a long time, operations that actually
    need to be done during the blanking interval will actually get done at an
    essentially random time. The hardware may or may not allow this. Also if
    Vblank is left masked for a long time, the result may be two "back to back"
    Vblank interrupts, with subsequent errors in writes to the EEPROM.

THE FIX:

	If you need to prevent the foreground from running, "mark" it in the
    same way Vblank does. Since presumably this is done in "background", there
    should be no hazards. If you really need to mask interrupts for a long
    time, perhaps you need to re-think the data structure and associated
    critical sections.

SUBJECT: 6502 reading test-switch considered harmful

BACKGROUND:

	For reasons lost in the mists of time, the sound processor (6502)
    has its own input for reading the test switch. Among other changes, the
    "coin read exception" will return current values of the coin inputs,
    rather than the "modulo four counters" if the test switch is found on.
    The main processor also reads the test switch, and can, therefore, have
    a different opinion about whether the pair are in "game" or "test" mode.

THE BUG:

	When the test switch is switched on after some coins have been
    deposited, there is a 1 in 4 chance of "seeing" an extraneous coin on
    each mech. This will occur when the number of coins seen by a mech,
    modulo four, is three, because the 6502 will change to reporting '0',
    which is in fact the next valid value for the mech. The "credit" thus
    earned cannot usually be spent, since the main processor should see
    the test switch soon thereafter, but the EEPROM count of coins will
    be off from both the Electro-mechanical counters and the number of
    coins in the cashbox. Also, there is the possibility of a more permanent
    fault leading to bizarre behavior.

THE FIX:

	There are a variety of options, ranging from converting to a scheme
    of "modulo three" counters, with zero a "forbidden value" to having a
    different exception for "raw coin switch" to having a command to set
    the 6502 in "test mode" rather than giving it access to the test switch.
    I welcome and solicit comment, but would frankly prefer all three...

SUBJECT: calling eer_incs() et al. from interrupt level

BACKGROUND:

	There are a number of routines in both the coin and eeprom code
    that cannot be safely re-entered. I also cannot make all of them
    "critical sections" because of deadly-embrace possibilties when called
    an arbitrary number of times from who-knows-where in a game.

THE BUG:

	Calling certain routines with interrupts masked (like, from your
    VBlank routine) can lead to a deadly embrace. The most likely candidates
    are eer_incs() and cn_credits(). They are only most likely because you
    are most likely to want to call them then, although it never seems to
    be really needed.

THE FIX:

	Don't!

SUBJECT: calling com_[f]wrt() with an exception, or com_exc() with a command

BACKGROUND:

	RPM accepts two kinds of "commands". The vast majority of them are
    inserted in a queue when received, to be processed as soon as the main
    loop has time. A special form, called an "exception" is processed as
    soon as it is received, and requires special handling. The code in
    RPMSLV.MAC (6502) and SND_COM.ASM (68000) take special pains to make
    everything work. This work is subverted when a game programmer uses
    com_exc() to send a "normal" command, or (more usually) uses com_[f]wrt()
    to send an exception.

THE BUG:

	Mis-using these routines (as the handed-down-for-generations sound
    test code does, among others) can lead to spurious and/or lost coins.
    It is the number one reason for such spurious coins, despite 6+ years
    of my lectures.

THE FIX:

	Use com_wrt() or com_fwrt() for normal commands, com_exc() for
    exceptions. If your old code from games gone by does not do this, fix it.
    While you are at it, make sure you check the return value from com_wrt()
    if you care whether or not it succeeded. Hardware guys might consider
    adding a (low priority) interrupt for comm_buffer_empty (maskable by
    writing a latch bit, of course) to speed up "queued" output with the
    sort of horrifying kluges I have glimpsed.

SUBJECT: Who hits watchdog and when

BACKGROUND:

	I may not have been the one who brought the "Watchdog" to Atari,
    but Rich Patak and I were the first I know of to have used one in a game
    here. The basic idea is simple. If you can determine an upper bound on
    the amount of "processing time" in your "main loop", you can have a timer
    that will reset the CPU if it is not reset at least that frequently. As
    a first approximation, this works fine, but there are a few problems.
    Concessions to cost lead to watchdog circuits with a period of 128
    milliseconds or so, while our games sometimes have sections (particularly
    in initialization) longer than this. Also, our emulators often have
    problems with watchdog, so it is disabled during most of the development
    cycle, being tested as the game is going out the door.

THE BUG (1):

	In an attempt to get around the tight timing of watchdog, some folks
    hit it in their Vblank IRQ. While this solves one problem, it creates
    another. In the absence of software "sanity checks", the mainline of a
    game can be "insane" and effectively hung up, while the Vblank routine
    is just fine and will never be reset.

THE FIX (1):

	Add some sort of "sanity check", if only on such things as "stack
    pointer in bounds" or a longer interval software managed timer. A really
    spiffy scheme might include a software timer set at every "state
    transition" to the maximum length of the state, and a "scanner" to check
    critical data structures "in the background".

THE BUG (2):

	As a quasi-religious reaction to bug (1), some folks shun hitting
    watchdog in Vblank at all, prefering to sprinkle watchdog hits semi-
    randomly through their code. Unfortunately, the choice of when and where
    is made without much time (remember, watchdog is usually disabled until
    management is breathing down your neck to release), and even choices that
    worked once may be invalidated by late changes in "tuning". In the face
    of these difficulties, a programmer often sprinkles enough "hits" to
    ensure that watchdog will never "bite", no matter how tangled the game
    gets, thus again effectively disabling it.

	The near-random nature of the "sprinkling" can cause other problems
    too. A few weeks ago I fixed a bug in the RAM test that had to do with
    WATCHDOG statements being placed between a CMPA.L and the branch that
    tested the result of the compare. These clobbered the condition codes.
    What is particularly noteworthy is that in the process of writing this
    memo, I came across a note from Jed Margolin (dated 16-JUN-1988) addressing
    this issue. I assume I changed _some_ copy of the RAM test at that time,
    but obviously not the "master". Old code rises from the grave...

THE FIX (2):

	(OPINION ONLY) Abandon approach 2 and apply fix 1. I stress that
    this is my opinion, but I feel fairly strongly about this and am willing
    to debate at length issues of "correctness by construction" and the
    "partitioning" of software projects...
atari email archive

Common Bugs

Need Victim, uh... Volunteer