Tardis:What SpellBot actually corrects

Because even the most conscientious of editors will occasionally make spelling errors, there is a need to have bot enforcement of the spelling policy. A comprehensive list of the differences between British and American spellings has been compiled, and is being coded for bot use as of the second week of June, 2011. This page will see heavy updating throughout that week as the list is fully coded.

Following is the raw code of that boy routine, so that all users may see what exactly the bot is checking for.

Words the bot will not check for
Some words are beyond the capability of the bot, because they are valid spellings (even if of different words) in British English. This list includes:
 * Check. Americans use this word to mean not only a verb meaning to investigate, but also a the noun, which is a financial instrument.  Because the first meaning is valid in BrEng, the bot can't be program to correct the other usage.  We'd end up with sentences like:
 * The Doctor chequed on Sarah Jane in her hospital room before going to the pathology lab.


 * Tire. Both sides of the Atlantic use tire as a verb.  It's again the noun that's problematic.  Americans view tire as the correct spelling for what the British would call a tyre.  The bot can't figure this one out, so it doesn't even try.
 * Draft. Americans use this spelling for all senses, the British use both for different senses.  All words beginning with drafts will be converted to draughts-, and the word drafty will be converted to draughty, but the word draft itself won't be touched by the bot, as that is a valid British spelling of the word.  Clear as mud?  Cool.  Onwards, then . ..
 * Judgment. There's no agreement on either side of the Atlantic whether this word should be judgment or judgement.  Oddly, most British spell-checkers will red-flag judgment, even though that's the official spelling in Commonwealth courts.  We have at least one story title preferencing the version with two es — Judgement of the Judoon.  But still, this word will require a special forum discussion to decide which way we want to spell it.
 * Disc. Way, way, way too screwed up a word for a simple bot to handle.  Disc jockey is fine on both sides of the divide, but so is floppy disk and hard disk. This one simply depends on context.
 * Connexion. This British spelling of connection is not universally used in Britain. Connection is correct in Britain, too, so the bot won't try to force connection into a connexion-shaped hole.
 * Simidgen, smidgeon, smidgin. They're all valid spellings for the same word, on both sides of the Atlantic.  The only way the bot could be useful is if we had a forum discussion to settle on one of the three spellings.
 * Practise. The -ise version of this word is the correct spelling for the verb in BrEng; the British noun ends in -ice.  Americans use -ice for everything.  Thus, the bot can't be of much use, except for participles derived from the verb.  So, the bot will make no attempt to change the spelling of practice, but it will change practicing and practiced to practising and practised.
 * Jail. Yes, gaol is still correct, especially historically, but jail has largely supplanted it.  So we won't look to correct jail to gaol, but neither will we try to correct gaol to jail.
 * Yogurt passes most British spell-checkers today, but so does yoghurt. We'll leave both well alone until a forum discussion decides the matter.
 * Almanac is universally the way it's spelt in American English, and increasingly the way Britons spell it, too. Still, some old-timers will go for almanack. Until a forum discussion decides otherwise, the bot won't enforce either spelling.

How to read the code
The code works by telling the bot to look for the word described before the comma. Then it replaces it with the word after the comma. A most basic expression would be:
 * {u'color',u'colour')

This looks for the American "color", then replaces it with the British "colour".

Because typing every permutation of a word, including all words that share the same root and capitalised variants, would be very time-consuming, most of the code won't work in such a simplistic way. Most of it uses a "regular expression" — or regex — to find a lot of hits with just one line. Here's an explanation of the regex used in this code:
 * The expression ([Cc]) means "look for either capitalised or lowercase versions of the letter C
 * (.?) means, "You, Mr. Fancy Computer bot thing, might find some more letters to the right of this point. Grab 'em all up to the next space only."
 * /1 means, "take whatever is in the first parentheses and put it here"
 * /2 means, "take whatever is in the second parentheses and put it here"

Thus, if we have the expression,
 * (r'([Cc])apitaliz(.?)', r'\1capitalis\2')

It means, roughly,
 * Look for all words, beginning with either a capital or lowercase C, which are followed by the letters "apitaliz" + any other letters you find until the next space. Then, keep the form of the letter c that you find, stick on "apitalis", and add back in any letters you orginally found after the "z".

In other words, find, Capitaliz-, keep the C capitalised, switch the z to an s, then stick on "-e', "-ing", "-ed", or "-ation", as appropriate.

The most complicated case
Now let's take a look at arguably the most complicated coding here. What if I wanted to change every word that had favor as a root? How could I take care of words that had both a prefix and a suffix, like disfavorable? Putting together everything we've learned so far, it would be:
 * (r'(.?)([Ff])avor(.?)',r'\1\2avour\3')

The leading (.?) means check to see if there's a prefix. The ([Ff]) switch checks for capitalisation of the root letter f. The (.?) at the end checks for suffixes. Now we have three parentheses instead of just two. So \1 means the prefix, \2 puts the letter f in with proper capitalisation, and \3 adds any suffixes.

This one statement will therefore switch over: favor, favors, favored, disfavor, disfavored, unfavorable, favoring, disfavoring, favorable, and almost certainly a few more.

Cases where regex fails
Not every word on our list has been switched using regex expressions. Sometimes it's easier just to type up a switch of literal characters, as when a word serves as the root of no other words.

The code
The following code will change over time, as more words are added. The final word in the English language that has a British/American difference is yodelling. Once you see that word on this list, you'll know the bot is fully programmed.

fixes['spelling'] = { 'regex': True, 'recursive': True, 'msg': { 'en':u'Enforcing spelling policy.' },   'replacements': [ (u'accessorize', u'accessorise'), (u'accessorized', u'accessorised'), (u'accessorizes', u'accessorises'), (u'accessorizing', u'accessorising'), (u'acclimitization',u'acclimatisation'), (u'acclimatize',u'acclimatise'), (u'acclimatized',u'acclimatised'), (u'acclimatizes',u'acclimatises'), (u'acclimatizing',u'acclimatising'), (u'accounterments',u'accoutrements'), (u'eon',u'aeon'), (u'eons',u'aeons'), (u'aerogram',u'aerogramme'), (u'aerograms',u'aerogrammes'), (u'esthete',u'aesthete'), (u'esthetes',u'aesthetes'), (u'esthetic',u'aesthetic'), (u'esthetically', u'aesthetically'), (u'ethetics', u'aesthetics'), (u'etiology',u'aetiology'), (u'aging',u'ageing'), (u'aggrandizement',u'aggrandisement'), (u'agonize', u'agonise'), (u'agonized',u'agonised'), (u'agonizes',u'agonises'), (u'agonizing',u'agonising'), (u'agonizingly',u'agonisingly'), (u'almanac',u'almanack'), (u'almanac',u'almanacks'), (u'aluminum', u'aluminium'), (u'amortizable',u'amortisable'), (u'amortization',u'amortisation'), (u'amortizations',u'amortisations'), (u'amortize',u'amortise'), (u'amortized',u'amortised'), (u'amortizes',u'amortises'), (u'amortizing',u'amortising'), (u'ampitheater',u'amphitheatre'), (u'ampitheaters',u'amphitheatres'), (u'anemia',u'anaemia'), (u'anemic',u'anaemic'), (u'anesthesia',u'anaesthesia'), (u'anesthetic',u'anaesthetic'), (u'anesthetics',u'anaesthetics'), (u'anesthetize',u'anaesthetise'), (u'anesthetized',u'anaesthetised'), (u'anesthetizes',u'anaesthetises'), (u'anesthetizing',u'anaesthetising'), (u'anesthetist',u'anaesthetist'), (u'anesthetists',u'anaesthetists'), (u'analog',u'analogue'), (u'analogs',u'analogues'), (u'analyze',u'analyse'), (u'analyzed',u'analysed'), (u'analyzes',u'analyses'), (u'analyzing',u'analysing'), (u'anglicize',u'anglicise'), (u'anglicized',u'anglicised'), (u'anglicizes',u'anglicises'), (u'anglicizing',u'anglicising'), (u'annualized',u'annualised'), (u'antagonize',u'antagonise'), (u'antagonized',u'antagonised'), (u'antagonizes',u'antagonises'), (u'antagonizing',u'antagonising'), (u'apologize',u'apologise'), (u'apologized',u'apologised'), (u'apologizes',u'apologises'), (u'apologizing',u'apologising'), (u'appall',u'appal'), (u'appalls',u'appals'), (u'appetizer',u'appetiser'), (u'appetizers',u'appetisers'), (u'appetizing',u'appetising'), (u'appetizingly',u'appetisingly'), (u'arbor',u'arbour'), (u'arbors',u'arbours'), (u'archeological',u'archaeological'), (u'archeologically',u'archaeologically'), (u'archeologist',u'archaeologist'), (u'archeologists',u'archaeologists'), (u'archeology',u'archaeology'), (u'ardor',u'ardour'), (u'armor',u'armour'), (u'armored',u'armoured'), (u'armorer',u'armourer'), (u'armorers',u'armourers'), (u'armories',u'armouries'), (u'armory',u'armoury'), (u'artifact',u'artefact'), (u'artifacts',u'artefacts'), (u'authorize',u'authorise'), (u'authorized',u'authorised'), (u'authorizes',u'authorises'), (u'authorizing',u'authorising'), (u'ax',u'axe'), (u'backpedaled', 'backpedalled'), (u'backpedaling', 'backpedalling'), (u'banister', u'bannister'), (u'banisters',u'bannisters'), (u'baptize',u'baptise'), (u'baptized',u'baptised'), (u'baptizes',u'baptises'), (u'baptizing',u'baptising'), (u'bastardize',u'bastardise'), (u'bastardized',u'bastardised'), (u'bastardizes',u'bastardises'), (u'bastardizing',u'bastardising'), (u'battleax',u'battleaxe'), (u'balk',u'baulk'), (u'balked',u'baulked'), (u'balking',u'baulking'), (u'balks',u'baulks'), (u'bedeviled',u'bedevilled'), (u'bedevling',u'bedevilling'), (u'behavior',u'behaviour'), (u'behavoral',u'behavioural'), (u'behaviorism',u'behaviourism'), (u'behaviorist',u'behaviourist'), (u'behaviorists',u'behaviourists'), (u'behaviors',u'behaviours'), (u'behoove',u'behove'), (u'behooved',u'behoved'), (u'behooves',u'behoves'), (u'bejeweled',u'bejewelled'), (u'belabor',u'belabour'), (u'belabored',u'belaboured'), (u'belaboring',u'belabouring'), (u'belabors',u'belabours'), (u'beveled',u'bevelled'), (u'bevies',u'bevvies'), (u'bevy','bevvy'), (u'biased',u'biassed'), (u'biasing',u'biassing'), (u'binging',u'bingeing'), (u'bougainvillea',u'bougainvillaea'), (u'bougainvilleas',u'bougainvillaeas'), (u'bowdlerize',u'bowdlerise'), (u'bowdlerized',u'bowdlerised'), (u'bowdlerizes',u'bowdlerises'), (u'bowdlerizing',u'bowdlerising'), (u'breathalyze',u'breathalyse'), (u'breathalyzed',u'breathalysed'), (u'breathalyzer',u'breathalyser'), (u'breathalyzers',u'breathalysers'), (u'breathalyzes',u'breathalyses'), (u'breathalyzing',u'breathalysing'), (u'brutalize',u'brutalise'), (u'brutalized',u'brutalised'), (u'brutalizes',u'brutalises'), (u'brutalizing',u'brutalising'), (u'busses',u'buses'), (u'bussing',u'busing'), (u'cesarean',u'caesarean'), (u'cesareans',u'caesareans'), (u'caliber',u'calibre'), (u'calibers',u'calibres'), (u'([Cc])aliper(.?)',u'\1calliper\2'), (u'([Cc])alisthenics',u'\1allisthenics'), (u'canalize',u'canalise'), (u'canalized',u'canalised'), (u'canalizes',u'canalises'), (u'canalizing',u'canalising'), (u'([Cc])ancelation',u'\1ancellation'), (u'([Cc])ancelations',u'\1ancellations'), (u'([Cc])anceled',u'\1ancelled'), (r'([Cc])anceling',r'\1ancelling'), (r'([Cc])andor',r'\1andour'), (r'([Cc])annibaliz(.?)',r'\1annibalis\2'), (r'([Cc])anibaliz(.?)',r'\1annibalisi\2'), (r'([Cc])anibalis(.?)',r'\1annibalis\2'), (r'([Cc])anoniz(.?)',r'\1anonis\2'), (r'([Cc])apitaliz(.?)',r'\1apitalis\2'), (r'([Cc])arameliz(.?)',r'\1aramelis\2'), (r'([Cc])arboniz(.?)',r'\1arbonis\2'), (r'([Cc])aroled',r'\1arolled'), (r'([Cc])aroling',r'\1arolling'), (r'([Cc])atalog',r'\1atalogue'), (r'([Cc])atalogs',r'\1atalogues'), (r'([Cc])ataloged',r'\1atalogued'), (r'([Cc])ataloging',r'\1ataloguing'), (r'([Cc])atalyz(.?)',r'\1atalys\2'), (r'([Cc])ategoriz(.?)',r'\1ategoris\2'), (r'([Cc])auteriz(.?)',r'\1auteris\2'), (r'([Cc])avilled',r'\1avilled'), (r'([Cc])aviling',r'\1avilling'), (r'([Cc])entigram(.?)',r'\1entigramme\2'), (r'([Cc])entiliter(.?)',r'\1entilitre\2'), (r'([Cc])entimeter(.?)',r'\1entimetre\2'), (r'([Cc])entraliz(.?)',r'\1entralis\2'), (r'([Cc])enter(.?)',r'\1entre\2'), (r'([Cc])hannel(.?)',r'\1hannell\2'), (r'([Cc])haracteriz(.?)',r'\1haracteris\2'), (r'([Cc])heckbook(.?)',r'\1hequebook\2'), (r'([Cc])hili',r'\1hilli'), (r'([Cc])himera(.?)',r'\1himaera\2'), (r'([Cc])hiseled',r'\1hiselled'), (r'([Cc])hiseling',r'\1hiselling'), (r'([Cc])irculariz(.?)',r'\1ircularis\2'), (r'([Cc])iviliz(.?)',r'\1ivilis\2'), (r'([Cc])lamor(.?)',r'\1lamour\2'), (r'([Cc])langour',r'\1langor'), (r'([Cc])larinetist',r'\1larinettist'), (r'([Cc])ollectiviz(.?)',r'\1ollectivis\2'), (r'([Cc])oloniz(.?)',r'\1olonis\2'), (r'(.?)([Cc])olor(.?)',r'\1\2olour\3'), (r'([Cc])ommercializ(.?)',r'\1ommercialis\2'), (r'([Cc])ompartmentaliz(.?)',r'\1ompartmentalis\2'), (r'([Cc])omputeriz(.?)',r'\1omputeris\2'), (r'([Cc])onceptualiz(.?)',r'\1onceptualis\2'), (r'([Cc])ontextualize(.?)',r'\1ontextualis\2'), (r'([Cc])oz(.?)',r'\1os\2'), (r'([Cc])ouncilor(.?)',r'\1ouncillor\2'), (r'([Cc])ounselor(.?)',r'\1ounsellor\2'), (r'([Cc])ounseling',r'\1ounselling'), (r'([Cc])ounseled',r'\1ounselled'), (r'([Cc])renelated',r'\1renellated'), (r'([Cc])riminaliz(.?)',r'\1riminialis\2'), (r'([Cc])riticiz(.?)',r'\1riticis\2'), (r'([Cc])rueler',r'\1rueller'), (r'([Cc])ruelest',r'\1ruellest'), (r'([Cc])rystalliz',r'\1rystallis\2'), (r'([Cc])udgeled', r'\1udgelled'), (r'([Cc])udgeling',r'\1udgelling'), (r'([Cc])ustomiz(.?)',r'\1ustomis\2'), (r'([Cc])ipher(.?)',r'\1ypher\2'),

],   'exceptions': { 'inside-tags': [ 'pre', 'code', 'nowiki', 'hyperlink', 'link', 'comment', ]       }    }