(1 / 2)
Date: February 21, 1991 15:13
From: KIM::ALBAUGH
To: @SYS$MAIL:ENGINEER
(This file lives in ee$userdisk:[68k_us.memo]common_bugs.txt)
WHO NEEDS TO READ THIS: every game programmer and every project leader
whose game programmer says he/she has none of these bugs (they're lying).
Hardware engineers might also find it a useful refresher.
I have been doing a lot of poking around in dusty corners lately,
trying to move people from older coin and eeprom routines to newer ones.
In the process I have discovered a few bugs and "accidents waiting to
happen". Each is described below, although I'm not claiming that they are
new or that I didn't create some of them myself. First some background.
Most of our games have (at least) two levels of interrupt. Some games add
an extra level of "software interrupt" or "foreground". Others have
multiple processors (you don't? how about that 6502 on your sound board?)
All these things mean that the statements of a program are often executed
in a different order than one might expect from the source listing. Keeping
things on track requires a certain degree of (dare I say it) discipline.
NOTE: These are not the only bugs in our code. They are just ones I have
seen often, or lately. Each topic below is presented as: subject,
background, the bug, and suggested changes.
Mike
SUBJECT: unrelated functions sharing a latch
BACKGROUND:
Our hardware guys are very fond of grouping miscellaneous bits
together in four-, six-, or eight- bit latches. They are unlikely to
change because a) it's cheaper b) it's not _that_ much of a hassle,
usually and c) it's now the "standard". The problem arises because these
latches are write-only, so they need to be shadowed. Unfortunately, when
a game has two programmers (or one programmer in two different moods or
seasons), there tend to be two (or more) shadows for the same latch. Also,
we don't always take care about order of update or possible interrupts.
THE BUG:
Failure to use a single shadow, or to "guard" references to that
shadow, leads to random setting/clearing of bits. This is especially
prevalent in the _added_at_the_last_minute vcr code, which may, for
example, change the resolution of your LETA...
THE FIX:
A single routine that has its own shadow and disables interrupts
while mucking with it. the one below is called to set a some bits by:
mod_latch( FIRST_BIT | OTHER_BIT );
and called to clear the same bits by:
mod_latch( ~( FIRST_BIT | OTHER_BIT ) );
and it assumes that there is only one such latch, of up to 31 bits.
The actual code snippet below assumes a 16-bit latch (and shadow), but
all the '.W's could be '.B' or '.L' for 8 or 31 bits.
It is _probably_ a real good idea if the latch address and the shadow
address were _not_ available to anybody else, to "put some teeth" in the
"convention" of using this routine.
-------- sample code follows --------
xdef mod_latch
* Modifies LATCH according to its param. If MSB set, clears any bits
* that are clear in the param, else sets any that are set. NOTE: this
* is the reverse of the original (ROMALOT) mod_latch, but nobody was
* using that, and this is a more intuitive interface.
mod_latch:
MOVE SR,D1
OR #$700,SR * Lock INTS
MOVE.L 4(SP),D0 * Read parameter LONG
BMI.S ML10 * D31 signals 0: OR in bits 1:AND out bits
OR.W bitshd,D0
BRA.S ML20
ML10: AND.W bitshd,D0
ML20: MOVE.W D0,LATCH Set LATCH
MOVE.W D0,bitshd update shadow
MOVE D1,SR
RTS
SUBJECT: Spinning on eer_busy() in Vblank
BACKGROUND:
Many of our games use EEPROM to hold statistics, High Scores, etc.
One problem with EEPROM is that each byte (typically) takes time to
write. This is normally handled by having a routine ( eer_hwt() ) that
is called every VBlank actually do the write. If the test switch is
changed while the game is writing, corrupt data could result. So, a
routine ( eer_busy() ) is supplied to tell when it is safe to "turn over"
the game from test to game or vice-versa.
THE BUG:
Somewhere, sometime, somone decided that the code to do this
should look roughly like:
VBLANK_IRQ:
TST.W INSELF
BEQ
* We think we are in selftest, does the switch agree?
IFTST
* We need to transit from test to game.
VB0: JSR eer_busy
TST.L D0
BNE VB0
< code to fake a reset, which will find the test switch off and start game>
A similar piece of code is used when the game vblank finds the test switch
on. Since an EEPROM write can make no progress with Vblank IRQs essentially
disabled, the eer_busy() will, if it ever returns TRUE, continue to do so
forever (or until WATCHDOG kicks in).
THE FIX:
Don't just hang. Work towards finishing the write by calling eer_hwt()
every Vblank until eer_busy() returns 0. This will let the game program
continue too, until it is safe to "turn over".
VBLANK_IRQ:
TST.W INSELF
BEQ
* We think we are in selftest, does the switch agree?
IFTST
* We need to transit from test to game.
JSR eer_busy
TST.L D0
BEQ DIE
JSR eer_hwt
RTI
DIE:
< code to fake a reset, which will find the test switch off and start game>
Incidentally, this is essentially what SYSTEM I originally did. I have no
idea how the buggy code arose, but I have seen it in several games now.
SUBJECT: VBACK, who hits it and when
BACKGROUND:
Most of our hardware designs generate an interrupt at the beginning
of Vertical Blank. Most also have an address to "strobe" to acknowledge
(and clear) this interrupt. Three warnings:
1) VAD-based systems cannot tolerate random gunk being stored in VBACK, always
store a zero. You may want to use a CLR instruction but...
2) If you are using a 68010 in development and a 68000 in production, make
sure your VBACK can tolerate being read (modern VADS can) because the
68000 (but not the 68010) does a read-modify-write cycle for CLR.W
3) Some games will generate another IRQ at line 256 if you strobe VBACK
"too soon" (i.e. between lines 240 and 256).
THE BUG:
Several annoyances, mainly related to lunching VADS and getting spurious
double Vblanks, which may have the side effect of inducing EEPROM errors.
THE FIX:
Make sure you are writing the proper value, and preferably late, rather
than early in your Vblank IRQ routine.
SUBJECT: How to set your priority level from C.
BACKGROUND:
Sometimes a routine written in C needs to disable interrupts during
a "critical section". Since the MOVE xxx,SR instruction is not available
in C, various snippets of assembly have been making the rounds.
THE BUG:
One of these snippets goes something like:
OLD_IPL: DS.W 1
...
xdef ints_off,ints_on
ints_off:
MOVE.W SR,OLD_IPL ; Save old value of Interrupt priority
MOVE.W 6(SP),SR ; Set new value from parameter
RTS
ints_on:
MOVE.W OLD_IPL,SR
RTS
This works fine until the day that a mainline routine does an
ints_off(0x2200) (saving 20xx in OLD_IPL), then is itself interrupted
(say at level 4) by a routine that does an ints_off(0x2600), saving
22xx in OLD_IPL. BOTH calls to ints_on will "restore" the 22xx, leaving
the "mainline" at level 2 forever. I've seen a variation on this theme
which maintains a complicated stack in "private" memory, but does not
check for consistency or stack overflow.
THE FIX:
Work with, rather than against, the language. The following:
xdef set_ipl
set_ipl:
MOVEQ.L #0,D0 ; Not _strictly_ needed, but tidy
MOVE.W SR,D0
MOVE.W 6(SP),SR
RTS
can be used in conjunction with a local variable in the procedure calling
it:
void
foo()
{
int old_ipl;
...
old_ipl = set_ipl(#COMM_LEVEL)
...
(void) set_ipl(old_ipl);
}
SUBJECT: is "foreground" really vblank?
BACKGROUND:
Many of our games use an "interrupt level" between the lowest priority
real interrupt and "background". This typically handles things that really
should happen once a frame (like animation drivers), but are not critically
dependant on the beam position. The Vblank IRQ routine checks to see if the
"foreground" is active. If not, it is "activated" by marking it active,
lowering the interrupt priority level, and "calling" the foreground. When
the foreground returns, the IPL is once again raised, the "active" flag
is cleared, and Vblank returns. If Vblank finds the foreground already
active, it simply returns.
THE BUG:
Sometimes the VBlank routine does not lower its IPL, making the entire
foreground routine essentially part of the Vblank IRQ. Another problem
is covered below.
THE FIX:
Make sure the sequence of operations detailed above is followed, and
that Vblank can be re-entered safely.
SUBJECT: BIG "critical sections"
BACKGROUND:
In just about any game there are "critical sections", where we need
to ensure that only one routine is mucking about with a given data
structure. For quick changes to small pieces of data, simply raising the
IPL will work just fine, but some games leave interrupts masked for
multiple frames, to keep "foreground" from running while a large routine
runs in "background".
THE BUG:
When VBlank is left masked for a long time, operations that actually
need to be done during the blanking interval will actually get done at an
essentially random time. The hardware may or may not allow this. Also if
Vblank is left masked for a long time, the result may be two "back to back"
Vblank interrupts, with subsequent errors in writes to the EEPROM.
THE FIX:
If you need to prevent the foreground from running, "mark" it in the
same way Vblank does. Since presumably this is done in "background", there
should be no hazards. If you really need to mask interrupts for a long
time, perhaps you need to re-think the data structure and associated
critical sections.
SUBJECT: 6502 reading test-switch considered harmful
BACKGROUND:
For reasons lost in the mists of time, the sound processor (6502)
has its own input for reading the test switch. Among other changes, the
"coin read exception" will return current values of the coin inputs,
rather than the "modulo four counters" if the test switch is found on.
The main processor also reads the test switch, and can, therefore, have
a different opinion about whether the pair are in "game" or "test" mode.
THE BUG:
When the test switch is switched on after some coins have been
deposited, there is a 1 in 4 chance of "seeing" an extraneous coin on
each mech. This will occur when the number of coins seen by a mech,
modulo four, is three, because the 6502 will change to reporting '0',
which is in fact the next valid value for the mech. The "credit" thus
earned cannot usually be spent, since the main processor should see
the test switch soon thereafter, but the EEPROM count of coins will
be off from both the Electro-mechanical counters and the number of
coins in the cashbox. Also, there is the possibility of a more permanent
fault leading to bizarre behavior.
THE FIX:
There are a variety of options, ranging from converting to a scheme
of "modulo three" counters, with zero a "forbidden value" to having a
different exception for "raw coin switch" to having a command to set
the 6502 in "test mode" rather than giving it access to the test switch.
I welcome and solicit comment, but would frankly prefer all three...
SUBJECT: calling eer_incs() et al. from interrupt level
BACKGROUND:
There are a number of routines in both the coin and eeprom code
that cannot be safely re-entered. I also cannot make all of them
"critical sections" because of deadly-embrace possibilties when called
an arbitrary number of times from who-knows-where in a game.
THE BUG:
Calling certain routines with interrupts masked (like, from your
VBlank routine) can lead to a deadly embrace. The most likely candidates
are eer_incs() and cn_credits(). They are only most likely because you
are most likely to want to call them then, although it never seems to
be really needed.
THE FIX:
Don't!
SUBJECT: calling com_[f]wrt() with an exception, or com_exc() with a command
BACKGROUND:
RPM accepts two kinds of "commands". The vast majority of them are
inserted in a queue when received, to be processed as soon as the main
loop has time. A special form, called an "exception" is processed as
soon as it is received, and requires special handling. The code in
RPMSLV.MAC (6502) and SND_COM.ASM (68000) take special pains to make
everything work. This work is subverted when a game programmer uses
com_exc() to send a "normal" command, or (more usually) uses com_[f]wrt()
to send an exception.
THE BUG:
Mis-using these routines (as the handed-down-for-generations sound
test code does, among others) can lead to spurious and/or lost coins.
It is the number one reason for such spurious coins, despite 6+ years
of my lectures.
THE FIX:
Use com_wrt() or com_fwrt() for normal commands, com_exc() for
exceptions. If your old code from games gone by does not do this, fix it.
While you are at it, make sure you check the return value from com_wrt()
if you care whether or not it succeeded. Hardware guys might consider
adding a (low priority) interrupt for comm_buffer_empty (maskable by
writing a latch bit, of course) to speed up "queued" output with the
sort of horrifying kluges I have glimpsed.
SUBJECT: Who hits watchdog and when
BACKGROUND:
I may not have been the one who brought the "Watchdog" to Atari,
but Rich Patak and I were the first I know of to have used one in a game
here. The basic idea is simple. If you can determine an upper bound on
the amount of "processing time" in your "main loop", you can have a timer
that will reset the CPU if it is not reset at least that frequently. As
a first approximation, this works fine, but there are a few problems.
Concessions to cost lead to watchdog circuits with a period of 128
milliseconds or so, while our games sometimes have sections (particularly
in initialization) longer than this. Also, our emulators often have
problems with watchdog, so it is disabled during most of the development
cycle, being tested as the game is going out the door.
THE BUG (1):
In an attempt to get around the tight timing of watchdog, some folks
hit it in their Vblank IRQ. While this solves one problem, it creates
another. In the absence of software "sanity checks", the mainline of a
game can be "insane" and effectively hung up, while the Vblank routine
is just fine and will never be reset.
THE FIX (1):
Add some sort of "sanity check", if only on such things as "stack
pointer in bounds" or a longer interval software managed timer. A really
spiffy scheme might include a software timer set at every "state
transition" to the maximum length of the state, and a "scanner" to check
critical data structures "in the background".
THE BUG (2):
As a quasi-religious reaction to bug (1), some folks shun hitting
watchdog in Vblank at all, prefering to sprinkle watchdog hits semi-
randomly through their code. Unfortunately, the choice of when and where
is made without much time (remember, watchdog is usually disabled until
management is breathing down your neck to release), and even choices that
worked once may be invalidated by late changes in "tuning". In the face
of these difficulties, a programmer often sprinkles enough "hits" to
ensure that watchdog will never "bite", no matter how tangled the game
gets, thus again effectively disabling it.
The near-random nature of the "sprinkling" can cause other problems
too. A few weeks ago I fixed a bug in the RAM test that had to do with
WATCHDOG statements being placed between a CMPA.L and the branch that
tested the result of the compare. These clobbered the condition codes.
What is particularly noteworthy is that in the process of writing this
memo, I came across a note from Jed Margolin (dated 16-JUN-1988) addressing
this issue. I assume I changed _some_ copy of the RAM test at that time,
but obviously not the "master". Old code rises from the grave...
THE FIX (2):
(OPINION ONLY) Abandon approach 2 and apply fix 1. I stress that
this is my opinion, but I feel fairly strongly about this and am willing
to debate at length issues of "correctness by construction" and the
"partitioning" of software projects...
(2 / 2)
Date: March 08, 1991 14:56
From: KIM::ALBAUGH
To: @SYS$MAIL:EE
(I apologize to all you hardware guys who dropped hot soldering irons in your laps in your haste to read this, but there is no "Programmers" list) I short while ago, I sent out a memo about common bugs. One of them had to do with the 6502 and 68000 independantly deciding whether or not they were in self-test. I have modified SACOIN.MAC to deal with this problem by having a separate exception for self-test read of the raw switches. I also added provision for table-driven assignment of coin-counters to coin-mechs. I'd really like some brave volunteer to use this version, so I can know if it _really_ works. It should "drop right in" with no changes, if you can tolerate getting the incrementing counters on your self-test screen. Then, by adding an exception that uses E$TCSW (Test Coin SWitches) and using that, you will get the raw switches for your self test and avoid bogus coins in your EEPROM when you flip the test switch. The new code is five bytes shorter than the old code, but the added table space for the exception cancels some of that. In any case, it should pose no problem. Takers? Mike
Feb 21, 1991