User:CzechOut/Bot tricks

The following are a list of tricks I've learned while using pywikipedia.

add_text.py
One of the harder things to do with bots is to work on pages that have no categories. This is because bots depend upon categories for many of their functions. However, bots can be used on pages without categories, as long as you go about things creatively.

If you have a user who is constantly uploading pictures without licenses, it may be easiest just to look for their work, to the exclusion of other people. Here's a run that'll look for only their additions to the file namespace: python add_text.py -text:"" -namespace:6 -usercontribs:"Doctor Who 63" -except:"\{\{[Bb]bcvidcover"

Note that this goes through all their work in namespace 6. So it doesn't look at only their unlicensed work in that namespace. Note that the parameter -uncatfiles doesn't actually help, here. It doesn't hurt, but it doesn't actually confine the search to just those things in namespace 6 modified by Doctor Who 63 which are also uncategorised.

However, -uncatfiles is helpful if you don't have that many files to look after. This is what you use if you just want to add bbcvidcover to pages that aren't categorised. python add_text.py -text:"" -uncatfiles Course, this is a slow way to go about things, because you probably won't want to add a single template to all the uncategorised files. If you want to filter things a bit, you can instead try to find patterns in the titles of the uncategorised files. -titleregex: allows you to make up your own matching rules. But if you can see a quick and dirty pattern at the beginning of a filename, you might try this instead: -prefixindex:"File: " This method is perfect for quickly licensing achievements badges, because they ll start with the term "File:badge". - What if you want to replace something about a title that has both exclamation points and single quotes for italics? This is pretty dicey, because the exclamation point has to be escaped, and you've got to figure out a way to get around the single quotes. Here's a useful expression: python replace.py -summary:"see forum:Prefix war: Doctor Who Adventures vs. Doctor Who Annuals: DWAN --> DWS" -regex "\[\[DWAN\]\]\: \[\[Grand Theft Planet(\!\]\])" "DWS: ''[[Grand Theft Planet\1" -ref:'Grand Theft Planet!'

Note what's going on here. the -ref line must be in single quotes. The regex for the original term must have parentheses around the part of the page name that's causing the most difficulty, so that it can just be dumped into the replacement term as a \1. After trying for a bit, I couldn't find anything else that worked in command line operation of the bot. Of course, my guess it that you might well need something like this, even if using a user-fix.

Cleanup after pagefromfile.py
python replace.py "\<(.*)(\[\[.*\]\])(.*)\>" "\2" -regex -subcatsr:"Articles containing potentially dated statements from 2015" This would kep any categories you have in the sea of code that's unfortunately generated by pagefromfile.py.
 * Update: Actually, it turns out that the code is only generated when using an .xml file. If you instead just use an .rtf or, better, a regular .txt file (with Unicode 8), things work out nicely.

Regex snippets
-regex "\=\= External .*\n.*"
 * This gets rid of empty sections, in this case External link/links:
 * This replaces a multi-line series variable with a singular one, while at the same time preserving spaces between "series" and the = sign: python replace.py -regex 'series( *)=.*' "series\1=DWM comic stories|" -summary:"Only one linked item per series variable. Otherwise, it's VERY unclear what the previous/next line refers to" -cat:'Fourth Doctor DWM comic stories'
 * To automatically add brackets around things, generally use .  However, for dab pages, where you have a list of things all starting with the same letters, use this: python replace.py -regex -page:"Pagename" "StartingString(.*)\r" "* StartingString\1"
 * To add on the "right side" of coding necessary to use pagefromfile.py: python replace.py -regex "\r" " comic story images'''\n\nyyyy\nxxxx" -page:"User:CzechOut/Sandbox10"
 * to add to the "left side": python replace.py -regex "\n(.*)" "Category:\1'''" -page:"User:CzechOut/Sandbox10"


 * If starting with a list of names of stories, the results will go from:
 * Story name
 * to
 * Category:Story name comic story images
 * yyyy
 * xxxx


 * To create a duplicate on a list python replace.py -regex "(.*)\n" "\1\n\1 (comic story)" -page:User:CzechOut/Sandbox10

Log file --> something usable by movepages.py

 * 1) Paste logfile onto a page, like user:CzechOut/Sandbox10
 * 2) Get rid of the "Getting" statements with python replace.py -regex "Getting.*\n" "" -page:User:CzechOut/Sandbox10
 * 3) Get rid of everything that's already disambigged with python replace.py -regex "\n(.*)\)\r" "" -page:User:CzechOut/Sandbox10
 * 4) Create duplicates of each name, then add (disambiguation term) with python replace.py -regex "(.*)\r" "\1\n\1 (comic story)" -page:User:CzechOut/Sandbox10
 * 5) Put brackets on the right side with python replace.py -regex "(.*)\r" "\1]]" -page:User:CzechOut/Sandbox10
 * 6) Put brackets on the left side with python replace.py -regex "(.*)\]\]" "\1" -page:User:CzechOut/Sandbox10


 * Depending on the number of items on your list, the last two steps can take a long time. It'll look like the bot is frozen, but it's not.

HTML bullet stripper
To strip HTML tags do this:
 * 1) python replace.py -regex "|<\/ul>||<\/li>" "" -cat:"Doctor Who (2005) television stories" -summary:"getting rid of bulleting in infobox"  This will then leave you with a series of links directly abutting each other.
 * 2) python replace.py -regex -summary:"putting commas between links" "\]\[" "], [" -subcat:"Doctor Who (2005) television stories" This will put a comma and a space between two abutting links.
 * 3) python replace.py -regex -summary:"putting commas between links" "\)\[" "), [" -subcat:"Doctor Who (2005) television stories" This will take care of those few instances of a parentheses abutting a link

Stripping a variable of its link
Many times it's better to have an unlinked variable than a linked one. To strip an existing variable of its linkage, do the following: python replace.py -regex -summary:"stripping prev/next story, adding dab for better link" 'previous story( *)=(.*)\[\[(.*)\]\]' "previous story\1=\2\3 (TV story)" -cat:"Doctor Who (1963) television stories" That works fine, as long as people have actually built the infobox in the "correct" way, i.e. one variable per line. But if they squash it all down so that the infobox and entire text of the article is on one line, the regex is far too greedy and will create unexpected replacements. The following is much better: python replace.py -regex -summary:"stripping prev/next story, adding dab for better link" 'next story( *?)=(.*?)\[\[(.*?)\]\]' "next story\1=\2\3 (TV story)" -subcat:"television stories"

The quick and nasty way to build huge lists of stories
Let's say you have a list of stories with improper disambiguation terms. Or maybe a list without disambiguation terms at all. Instead of typing everything out by hand, like ya did with the British spell checker, use regex to instantly deliver a list that you can immediately plug into a user-fix. python replace.py -page:user:CzechOut/Sandbox13 -regex "(.*?)\(comic story\)" "u'\1(short story)', u'\1(comic story)',\n" What this does is take raw dump of un-linked text — in this case, things ending in (comic story). It then strips (comic story), and adds the basic structure for user-fix.py replacements. This will then correct every instance where a story has been misidentified as a (short story) and convert it to a proper (comic story). Obviously here, we're using u instead of r cause there's no regex to this replacement. It's totally literal, allowing us to use u.

Creating mass categories
python replace.py -regex -page:User:CzechOut/Sandbox14 "\n(.*?)\r" "Category:\1\n\n\n\nyyyy\nxxxx\n"