Webmaster recipes: Difference between revisions

From ShawnReevesWiki
Jump to navigationJump to search
No edit summary
Line 3: Line 3:
  grep dollhouse access_log.2011-03-29|cut -d " " -f1| sort|uniq -c
  grep dollhouse access_log.2011-03-29|cut -d " " -f1| sort|uniq -c
I was taking each resulting IP and copying and pasting it after typing whois in another shell to find clues to whether the visitor was a search spider or a real person. I learned (from http://www.tek-tips.com/viewthread.cfm?qid=1566237&page=7 ) that I could use an inline text editor to type "whois " and the result from the above command, without the count, and then pass that to a shell within this shell to process each line as a command:
I was taking each resulting IP and copying and pasting it after typing whois in another shell to find clues to whether the visitor was a search spider or a real person. I learned (from http://www.tek-tips.com/viewthread.cfm?qid=1566237&page=7 ) that I could use an inline text editor to type "whois " and the result from the above command, without the count, and then pass that to a shell within this shell to process each line as a command:
  grep dollhouse access_log.2011-03-29 | cut -d " " -f1 | sort | uniq | awk '{print "whois " $1}' | sh
  grep dollhouse access_log.2011-03-29 | cut -d " " -f1
| sort | uniq | awk '{print "whois " $1}' | sh
awk takes each line, prepends "whois ", and then sends it to the shell "sh" to process.
awk takes each line, prepends "whois ", and then sends it to the shell "sh" to process.
===Search queries===
===Search queries===
Open a terminal and go to a directory full of Apache log files. Enter the following command:
Open a terminal and go to a directory full of Apache log files. Enter the following command (all on one line):
  grep -r -h -o "q=[^&]*" ./* | awk '{print substr($1,3)}' | php -R 'echo urldecode($argn)."\n";' > searchqueries.txt
  egrep -r -h -o ".*q=[^&]*" ./*
Grep will go through all the files in the folder (-r and ./*); find strings that begin with q= up to the next ampersand, which is usually how a search engine reports the query string someone entered before they clicked on a result to get to our site; only output the matching part (-o); and skip listing the filename from the match (-h).
|awk '{print $1"\t"substr($11,match($11,"q=")+2)}'
|php -R 'echo substr(urldecode($argn),stripos($argn,"&"))."\n";'
> ../SearchQueriesIP.txt
Egrep will go through all the files in the folder (-r and ./*); find strings that have q= up to the next ampersand, which is usually how a search engine reports the query string that someone entered before they clicked on a result to get to our site; only output the matching part (-o); and skip listing the filename from the match (-h).


Next, awk leaves out the first two characters (q=). PHP then takes each line ($argn) and decodes the text, changing plus signs to spaces and so on. Finally, the result is saved to a file.
Next, awk picks the IP address of the visitor ($1), a tab (\t), and then the query string ($11), leaving out the first two characters (q=). PHP then takes each line ($argn) and decodes the text, changing plus signs to spaces and so on. It also removes any unexplained extra bits following ampersands; this will become unnecessary when I figure out how some ampersands are slipping through.
 
Finally, the results are saved to a file using the redirect symbol (>), in the next directory up (../) so egrep doesn't search its own output.


It's a little sad to see so many people type energyteachers.org into google instead of directly into the address area of the browser. I guess Google has no problem being seen as the gateway to the internet, even with the futile bandwidth usage.
It's a little sad to see so many people type energyteachers.org into google instead of directly into the address area of the browser. I guess Google has no problem being seen as the gateway to the internet, even with the futile bandwidth usage.


[[Category:Computers]]
[[Category:Computers]]

Revision as of 15:13, 31 March 2011

Multiple whois lookups

I recently launched a new project at EnergyTeachers.org, Green Dollhouse Challenge, and I wanted to see who responded to a group email I posted about it. I downloaded the logs and retrieved the list of IP addresses that accessed the page with this command which searches for the word dollhouse in the server's log, taking just the first section which is the IP address, then sorting which is required for uniq, then listing only unique addresses with a count of how many visits came from each address:

grep dollhouse access_log.2011-03-29|cut -d " " -f1| sort|uniq -c

I was taking each resulting IP and copying and pasting it after typing whois in another shell to find clues to whether the visitor was a search spider or a real person. I learned (from http://www.tek-tips.com/viewthread.cfm?qid=1566237&page=7 ) that I could use an inline text editor to type "whois " and the result from the above command, without the count, and then pass that to a shell within this shell to process each line as a command:

grep dollhouse access_log.2011-03-29 | cut -d " " -f1
| sort | uniq | awk '{print "whois " $1}' | sh

awk takes each line, prepends "whois ", and then sends it to the shell "sh" to process.

Search queries

Open a terminal and go to a directory full of Apache log files. Enter the following command (all on one line):

egrep -r -h -o ".*q=[^&]*" ./*
|awk '{print $1"\t"substr($11,match($11,"q=")+2)}'
|php -R 'echo substr(urldecode($argn),stripos($argn,"&"))."\n";'
> ../SearchQueriesIP.txt

Egrep will go through all the files in the folder (-r and ./*); find strings that have q= up to the next ampersand, which is usually how a search engine reports the query string that someone entered before they clicked on a result to get to our site; only output the matching part (-o); and skip listing the filename from the match (-h).

Next, awk picks the IP address of the visitor ($1), a tab (\t), and then the query string ($11), leaving out the first two characters (q=). PHP then takes each line ($argn) and decodes the text, changing plus signs to spaces and so on. It also removes any unexplained extra bits following ampersands; this will become unnecessary when I figure out how some ampersands are slipping through.

Finally, the results are saved to a file using the redirect symbol (>), in the next directory up (../) so egrep doesn't search its own output.

It's a little sad to see so many people type energyteachers.org into google instead of directly into the address area of the browser. I guess Google has no problem being seen as the gateway to the internet, even with the futile bandwidth usage.