Webmaster recipes
Contents
Dealing with hotlinking
Hot-linking is the use of links to content at one site, usually images, by pages on another site. At first, page-designers may have used links to images on other sites because they couldn't create useful images themselves. These days, robots are made to make pages with random content, using both content and images from around the world to create a site in the hopes of fooling search engines into thinking that the site contains useful information on specific topics.
Searching logs for hotlinkers
awk '{if ($7 !~ /\.(jpe?g|gif|png)$/) next; if ($11 ~ /^"https?:\/\/[^\/]*energyteachers.org/) next; if ($11 ~ /"-"/) next; print $11;}' access_log* |sort|uniq -c|sort
- Explanation
- $7 refers to the seventh segment of the log, split by spaces, which contains the request.
- !~ means "does not contain the following expression"
- The expression between the slashes finds text that include a dot followed by one of the image extensions. The dollar sign signifies the end of the request.
- So, the first line of if...next tell awk to skip lines that aren't requests for image files.
- The second line skips references from energyteachers.org
- The third line skips empty references, allowing people to bookmark images or type their addresses in their browser.
- The fourth line prints the reference, and after finishing the awk command set, contains the wildcard-appended log-file-name to be processed, followed by a sort, then a count of how many times that reference was used, then sorting by that count.
Preventing hot-linking of images
There are hundreds of pages telling how to turn away hot-linkers using Apache's Rewrite module, but Apache suggests you use their built-in directives when possible, and this is one of those cases. So, I found a different way, thanks to the wiki at apache.org.
SetEnvIfNoCase Referer "^https?://([^/]*)?shawnreeves\.net/" local_ref=1 SetEnvIf Referer ^$ local_ref=1 <FilesMatch "\.(jpe?g|gif|png)$"> Order Allow,Deny Allow from env=local_ref </FilesMatch>
- The first line uses a regular expression to match referrers that begin with http, with an optional s, followed by ://, then any number of non-slash characters, then the allowed domain (note the backslash to escape the period which normally is a wildcard). If the match is true, then an environmental variable local_ref is set to 1.
- The second line sets the local_ref variable to 1 if there is no referrer, such as when someone browses at an image from a bookmark or uses curl or some special tool for the blind.
- The third through sixth lines apply only if the files requested have image-type extensions.
- The fifth line allows such requests from anyone with the proper reference, leaving the rest to be denied by the order of the fourth line.
Multiple whois lookups
I recently launched a new project at EnergyTeachers.org, Green Dollhouse Challenge, and I wanted to see who responded to a group email I posted about it. I downloaded the logs and retrieved the list of IP addresses that accessed the page with this command which searches for the word dollhouse in the server's log, taking just the first section which is the IP address, then sorting which is required for uniq, then listing only unique addresses with a count of how many visits came from each address:
grep dollhouse access_log.2011-03-29|cut -d " " -f1| sort|uniq -c
I was taking each resulting IP and copying and pasting it after typing whois in another shell to find clues to whether the visitor was a search spider or a real person. I learned (from http://www.tek-tips.com/viewthread.cfm?qid=1566237&page=7 ) that I could use an inline text editor to type "whois " and the result from the above command, without the count, and then pass that to a shell within this shell to process each line as a command:
grep dollhouse access_log.2011-03-29 | cut -d " " -f1 | sort | uniq | awk '{print "whois " $1}' | sh
awk takes each line, prepends "whois ", and then sends it to the shell "sh" to process.
Search queries
Open a terminal and go to a directory full of Apache log files. Enter the following command (all on one line):
egrep -r -h -o ".*q=[^&]*" ./* |awk '{print $1"\t"substr($11,match($11,"q=")+2)}' |php -R 'echo substr(urldecode($argn),stripos($argn,"&"))."\n";' > ../SearchQueriesIP.txt
Egrep will go through all the files in the folder (-r and ./*); find strings that have q= up to the next ampersand, which is usually how a search engine reports the query string that someone entered before they clicked on a result to get to our site; only output the matching part (-o); and skip listing the filename from the match (-h).
Next, awk picks the IP address of the visitor ($1), a tab (\t), and then the query string ($11), leaving out the first two characters (q=). PHP then takes each line ($argn) and decodes the text, changing plus signs to spaces and so on. It also removes any unexplained extra bits following ampersands; this will become unnecessary when I figure out how some ampersands are slipping through.
Finally, the results are saved to a file using the redirect symbol (>), in the next directory up (../) so egrep doesn't search its own output.
Issues with this analysis
- q= might be in the request
- If the request string includes the string q=, then this would return that request instead of the referrer's query. A solution may be to use awk instead of grep, only checking the 11th field.
- Analysis of requests
- This doesn't output or process the request field. Easy enough to fix, we could just add field $7 to the print command in awk, or some significant substring of $9.
It's a little sad to see so many people type energyteachers.org into google instead of directly into the address area of the browser. I guess Google has no problem being seen as the gateway to the internet, even with the futile bandwidth usage.
Better performing with awk
Here's a more awkward process, but it only has one pipe.
awk '{if ($11 !~ /q=/) next; split($11,queries,"="); for (var in queries) if (match(queries[var],/q$/)) searched=queries[var+1]; print $1"\t"$7"\t"substr(searched,1,match(searched,"&")-1)}' access_log* |php -R 'echo urldecode($argn)."\n";'
The first line skips input lines that don't have "q=".
The second line splits the referrer line by equal signs, essentially separating fields, into an array "queries". The third line looks for the item in queries that ends with q, setting our target to the next item in the array, since it must follow "q=".
The fourth line prints the IP address of the requester, the page requested, and search query. The fifth line takes each line ($argn) and decodes the text, changing plus signs to spaces and so on.
Popularity of a page over time
awk '$1 !~"MY_WORK_IP" && $7 ~ /PAGE_NAME/ \ {print substr($4,5,3)" "substr($4,2,2)}' access* |uniq -c
This script skips all requests from my own IP, so I don't count my own hits, includes all requests for a certain PAGE_NAME or subset of pages with a certain text in their name, and returns the month and date, then counts how many hits there are each date. Note the backslash at the end of the first line is not part of the awk script but just a way to split this shell command into two lines. I.e., if you use it on one line, remove the backslash.
Track logs according to hit
Sometimes I want to see what people do when after or before they hit a certain page. With the following line, broken into readable chunks by back-slashes, of shell commands, I can search Apache logs to see the activity of anyone who hits a given page.
grep -rh "GET /ENTRY_PAGE_EXAMPLE" . | \ cut -d " " -f 1| \ sort|uniq| \ awk '{print "echo "$1";grep -rh " $1" \."}' | \ sh>../ENTRY_PAGE_EXAMPLE-viewers.txt
- Grep searches all files in the current directory for a request for whatever page.
- Cut selects just the first item, which should be the IP address of the browser.
- Sort and uniq pass just one copy of each IP address that comes through.
- Awk creates a command to print a list of accesses from each IP address.
- Echo will print the IP address on a line, as a header to help readability.
- Grep will find all occurrences of each IP address.
- Sh runs the echo and grep commands that awk prints
- > pipes the output to a file
Making IP abbreviations normal CIDR
in Apache directives you might block a range of IP addresses with an abbreviated domain like 198.183 , meaning all the addresses between 198.183.0.0 and 198.183.255.255 . My block lists used to contain such abbreviations, but also CIDR addresses, such as 198.182.0.0/16, which blocks the range 198.182.0.0 to 198.182.255.255 . CIDR is useful because we can block not just 256 or 65536 addresses, but any power of two. For example, we can block 198.184.0.0 to 198.184.16.255 with this CIDR-formatted address: 198.184.0.0/20. Internet addresses are allotted in large swaths to regional authorities, who then allot smaller parts to internet providers, who may even break it down into smaller lots. Most of these lots can be addressed with a single CIDR address.
To convert an abbreviated two or three byte address to the full CIDR notation, use these handy regular expressions and replacements. Please note that there are spaces around the regular expressions as well as the replacements, to prevent converting partial addresses:
- Convert two-byte addresses
- The parentheses mark the group to be captured, in between spaces—Two sets of any number (+) of numerical digits ([0-9]) separated by a period (the backslash lets the regex engine know we want a period, not the special meaning of period in regex). Whatever was in parentheses will be returned by $1, followed by .0.0/16, which indicates the full range of possibilities in those two digits. We put 16 because the first 16 bits, of the range of 32-bit addresses, are constant.
([0-9]+\.[0-9]+) $1.0.0/16
- Convert three-byte addresses
- Here, the first 24 bits are constant.
([0-9]+\.[0-9]+\.[0-9]+) $1.0/24