Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
Perl, sed, grep, gawk, uniq, host and dig: another look

By mybostinks in Internet
Wed Dec 09, 2009 at 07:12:49 AM EST
Tags: Perl, sed, grep, gawk, uniq, host, dig (all tags)

None of the above programs are sexy or buzzword compliant. Because of modern scripting languages they are largely ignored or at best under utilized. A number of years ago, I was mentored by a Unix guru who knew his way around the Unix command line to the point that years later I marvel at his ability to be productive on a command line while working with Unix.

I work at a large ISP with a mix of every operating system possible so most won't find this useful. However, the following gave me an opportunity to revisit standard Unix/Linux command line tools and how to put them to good use. By doing so I remembered how important these tools are and how robust Unix/Linux is. In this regard, you might find this very useful as well. When nothing else will do the standard Unix tools are hard to beat.


ADVERTISEMENT
Sponsor: rusty
This space intentionally left blank
...because it's waiting for your ad. So why are you still reading this? Come on, get going. Read the story, and then get an ad. Alright stop it. I'm not going to say anything else. Now you're just being silly. STOP LOOKING AT ME! I'm done!
comments (24)
active | buy ad
ADVERTISEMENT
Most of the Internet facing servers I work with have had to be created with installations that are stripped down to the exact function they were designed to perform such as Email gateways, DNS, DHCP and packet filtering. This meant no installed compilers and no installed scripting languages except for standard ones that normally come with Unix/Linux. This usually means that only a bare version of Perl or lately Python as well as (g)awk were the only scripting utilities outside of the shells such as bash or csh were installed. Downloading and installing libraries for Perl or Python beyond the base install are prohibited and against our security policies.

A large part of my job is to get rid of spam/UCEs/UCBs at the perimeter before it enters our networks. In addition to using traditional spam fighting tools such as Spamassassin, Bayesian filtering and low impact, accurate DNS Blacklists i.e. zen.spamhaus.org. I also use greylisting and in particular 'nolisting'. I also use chrooted BIND to block countries. Nolisting is another effective technique to thwart spammers. To use nolisting I make the first MX record (lowest number highest priority) in my DNS a server that has smtp port 25 completely open through the packet filter but no MTA listening on port 25 socket. The packet filter then logs the hits on port 25. My nolisting server is FreeBSD so my /var/log/messages sees every time another server wants to access port 25. It then records the ip address of that server. From there it is simply a matter of processing the file. The real email gateway has the next higher priority and accepts or blocks email depending on predetermined email policies.

What I wanted to do was to take the IP addresses from the nolisting server and determine what country it originated from and determine whether or not to block the country entirely from attempting to send their spam to my gateway mail server. Where I work 95% of the countries in the world we could care less about receiving their email all of which is spam. For example, we have no interests in receiving email from other institutions in China, S. Korea, Brazil, most of Africa or the Near East. In the one or two emails a year we'd receive from those countries that would be legitimate would have to be sent from somewhere like gmail, yahoo or some other Email Service Provider.

In order to accurately determine the originating country and to analyze what spammers are doing using my other tools I decided to use the ASNs (Autonomous System Numbers) from the regional registries around the world as a starting point. Better yet I can get these from the Team Cymru website. Also if all I needed were country codes to populate my recursive DNS there is a handy tool called rir2dns which I also use.

Each week the messages logfile is zipped up and stored away and I rotate a month's worth. It can be quite large so I copy the file to a second disk on the server for processing. Each line I need to process looks like the following line and is very well suited for munging with (g)awk...

Dec 4 00:00:00 nolist kernel: Connection attempt to TCP 10.246.8.196:25 from 188.186.160.229:1821 flags:0x02

I unzip it and run the following...

Step 1.

perl -p -e ':25/' messages.0
|gawk '{print $12}'
|gawk -F: '{print $1}'
| sort -n -t . -k 1,1 -k 2,2 -k 3,3 -k 4,4
| uniq > spamip.txt

I used Perl as an example. You could use grep :25 messages.0 instead to retrieve all records that show hits on port 25. Because gawk by default reads white space delimited files it churns through 100s of Mbytes like butter. So I then pipe to gawk and print only the ip addresses which are located in the 12th field, from the file. Then I pipe it to gawk again to parse out the port number (which I don't need) and use the ':' as the delimiter (-F:) and then sort the result by ip address using a '.' delimiter, sorting numerically each octet of the ip address and then printing the ip addresses without any dupes by piping the stream to uniq and then redirect the stream to the file spamip.txt. The output looks something like the following as the final result...

4.38.112.66
4.38.112.67
4.38.112.68
4.38.112.69
4.38.112.70
4.38.112.71
4.38.112.72
4.38.112.73
4.38.112.74

This gives all the unique ip addresses that have tried to access port 25 of the faux MX record of the nolist server. It's interesting to note that there are networks of mail servers that are constantly running day and night churning out email.

Next I run the following to get a list of hostname/domains

Step 2.

while read HOST; do host $HOST; done < spamip.txt > reverseLU.txt

The above bash commands form a loop that uses the Unix host command to do a reverse lookup on all the unique ip addresses that I have recorded. The output looks something like below...

52.108.101.38.in-addr.arpa domain name pointer mta714.e.homedecorators.com.
44.184.101.38.in-addr.arpa domain name pointer lists.universalservice.org.
78.188.101.38.in-addr.arpa is an alias for 38.101.188.178.in-addr.trulia.com.
8.101.188.178.in-addr.trulia.com domain name pointer sm01.trulia.com.
79.188.101.38.in-addr.arpa is an alias for 38.101.188.179.in-addr.trulia.com.
8.101.188.179.in-addr.trulia.com domain name pointer sm02.trulia.com.
81.188.101.38.in-addr.arpa is an alias for 38.101.188.181.in-addr.trulia.com.
8.101.188.181.in-addr.trulia.com domain name pointer sm04.trulia.com.
82.188.101.38.in-addr.arpa is an alias for 38.101.188.182.in-addr.trulia.com.
8.101.188.182.in-addr.trulia.com domain name pointer sm05.trulia.com.
Host 180.96.25.59.in-addr.arpa. not found: 3(NXDOMAIN)
Host 199.62.26.59.in-addr.arpa. not found: 3(NXDOMAIN)
Host 157.137.27.59.in-addr.arpa. not found: 3(NXDOMAIN)

There is interesting information here but for now it's not what I am looking for. However, you will see the valid PTR records with their associated fully qualified host names and any NXDOMAIN or SERVFAIL responses. I usually analyze the NXDOMAIN and SERVFAIL responses later they could be any number of things including bogons. Usually NXDOMAIN means the domain does not exist. Because PTR records are in the IN-ADDR.ARPA domain what it really means is there is no pointer record. SERVFAIL means that the authoritative name server is not answering queries for that domain. The above is an example of a command line loop with an input and output file using the Unix host command.

Next we want to take the spamip.txt file and find the ASNs of the IPs collected, strip out the last octet, sort the IPs and get rid of dupes. Finally, I use netcat to query the cymru whois server with my list as explained on their site...

Step 3.

echo "begin" >mywhois.txt
| gawk -F. '{print $1 "." $2 "." $3 "." "0"}' spamip.txt
| uniq > mywhois.txt
| echo "end" >>mywhois.txt
| netcat whois.cymru.com 43 <mywhois.txt> finalwhois.txt

The output from running the above set of commands looks like the following...

AS    | IP           | AS Name
3356  | 4.38.112.66  | LEVEL3 Level 3 Communications


It returns 3 fields the AS number the IP you sent them and the AS name. The problem here is that I have found that in some Linux distributions netcat is not part of the base install.

cymru.com does not like their whois servers being hammered even though they recommend if you do and you have a bulk group to use netcat to ease the load. Otherwise they will null route your ip address if they determine you are being abusive. They instead recommend using DNS queries for bulk queries. As it turns out, this is more what I need than the above. The above list I give to the networking group in case they decide to do something similar with BGP routing. The following then is how you do DNS queries that return more useful information...

gawk -F. '{print $4 "." $3 "." $2 "." $1}' spamip.txt
| while read HOST; do dig +short $HOST.origin.asn.cymru.com TXT; done
| uniq | sed -e 's/"//g' > MyCC.txt

What the above does is to use dig to do a TXT Resource Record lookup of the ip address I fed from the spamip.txt file. It querires the cymru.com DNS server to do the lookups and gives me the information that I want. I use sed to strip the double quotes off the front and back of the returned line and redirect the stream to the output file.

The output is the result that I want and it should look something like the following...

17858 | 125.188.0.0/14   | KR | apnic | 2005-09-12
9781  | 125.208.192.0/19 | KR | apnic | 2007-01-03
9781  | 125.208.192.0/21 | KR | apnic | 2007-01-03
9781  | 125.208.192.0/19 | KR | apnic | 2007-01-03
9781  | 125.208.192.0/21 | KR | apnic | 2007-01-03
9781  | 125.208.192.0/19 | KR | apnic | 2007-01-03
9781  | 125.208.200.0/21 | KR | apnic | 2007-01-03
9781  | 125.208.192.0/19 | KR | apnic | 2007-01-03
9781  | 125.208.200.0/21 | KR | apnic | 2007-01-03
9781  | 125.208.192.0/19 | KR | apnic | 2007-01-03
9781  | 125.208.216.0/21 | KR | apnic | 2007-01-03
9781  | 125.208.232.0/21 | KR | apnic | 2007-12-03
17597 | 125.209.0.0/18   | KR | apnic | 2005-10-18
17597 | 125.209.16.0/20  | KR | apnic | 2005-10-18
17597 | 125.209.0.0/18   | KR | apnic | 2005-10-18
17597 | 125.209.16.0/20  | KR | apnic | 2005-10-18
17597 | 125.209.0.0/18   | KR | apnic | 2005-10-18
17597 | 125.209.32.0/20  | KR | apnic | 2005-10-18
9260  | 125.209.64.0/18  | PK | apnic | 2005-12-02
9260  | 125.209.89.0/24  | PK | apnic | 2005-12-02
9260  | 125.209.64.0/18  | PK | apnic | 2005-12-02
9260  | 125.209.108.0/22 | PK | apnic | 2005-12-02
9260  | 125.209.64.0/18  | PK | apnic | 2005-12-02
9260  | 125.209.120.0/24 | PK | apnic | 2005-12-02
4837  | 125.211.0.0/16   | CN | apnic | 2007-01-10

There are five fields that are returned from the DNS lookups: the ASN, the CIDR block, the country code, the regional registry and the date originally registered. I am particularly interested in fields 2 and 3; the country code and CIDR block. I will use that in my BIND name server to block the country/CIDR block depending on the frequency of hits and what country they originated.

So now I have a nicely formatted file that I can use or process further if I want. I could have pulled the log file directly off the server and used whatever programming language I wanted on another computer that has all the tools I could ever want. Sometimes it's not worth the bother and in many cases using command line tools are thrown away after they've accomplished their task and there is no need to save the commands.

There are a couple of websites dedicated to command line tools and what they can do. Here are a couple:

Unix Guru Universe

CommandLine Fu

There are others but it's a start.

A side benefit is that if you have Cygwin a Microsoft admin can learn the tools and use them on Windows servers. Sure there is Batch and Powershell but if you admin Unix and Microsoft why learn Powershell or VB? Many Windows commands bulk load user and group objects in AD. Unix tools come in quite handy when you need to create a command file to do many Windows tasks.

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Related Links
o Yahoo
o Unix Guru Universe
o CommandLin e Fu
o Also by mybostinks


Display: Sort:
Perl, sed, grep, gawk, uniq, host and dig: another look | 30 comments (28 topical, 2 editorial, 0 hidden)
wow. I guess it depends on what business you are (none / 0) (#2)
by Morally Inflexible on Mon Dec 07, 2009 at 07:27:27 PM EST

in, but it seems to me that confining yourself to only communicating with Americans is a worse idea than blocking all free email providers. (I do know many people who block all of hotmail, yahoo, gmail, etc...)

It depends. (none / 0) (#10)
by xC0000005 on Wed Dec 09, 2009 at 01:54:47 AM EST

All I do is email, and the number of times I've had massive companies say "If I could block all email from anyone in [essentially everyone but america] that doesn't come from one of the majors, I'd be fine with it.

Big corps tend to leave hotmail, yahoo, gmail etc on the "ok to send us crap we'll dump" list and blockade everything else. Of course all the crap from the big guys then goes through the spam filter but if you aren't sending from one of them you do not even get to enter the competition. And there are a few that shit bin messages for having russian, chinese or spanish characters in them. You know, because everyone loves ascii.

Draconian? Hell yeah. Then again there are several massive companies who block everything from china. Period. Even their own sub in china can't mail to them or connect to the home network. "If it comes from china it's probably someone trying to hack the R&D servers. Again."
"What if it's coming from your internal network?"
"Then it's absolutely someone trying to hack the R&D servers. They know where the good stuff is."

Go figure.

Voice of the Hive - Beekeeping and Bees for those who don't
[ Parent ]

eh, it just sounds like a bad idea to me. (none / 0) (#11)
by Morally Inflexible on Wed Dec 09, 2009 at 02:50:53 AM EST

I understand that many management types like the idea; but management types tend to make, ah, consistent decisions, meaning they tend to focus on doing what everyone else is doing rather than doing the best thing.

[ Parent ]
That's a good way of putting it. (none / 0) (#19)
by k31 on Wed Dec 09, 2009 at 12:03:58 PM EST

they tend to focus on doing what everyone else is doing rather than doing the best thing.

The thing is, it works in the short term. Good enough gets to be a lower and lower bar to reach, given. In the long term, they just buy the people who figured out the real problems.


Your dollar is you only Word, the wrath of it your only fear. He who has an EAR to hear....
[ Parent ]

Hmm (none / 1) (#3)
by boxed on Tue Dec 08, 2009 at 07:59:00 AM EST

One wonders why you didn't just use brainf*ck.

(0) self-censorship is for fags $ (none / 0) (#4)
by LilDebbie on Tue Dec 08, 2009 at 10:24:24 AM EST



My name is LilDebbie and I have a garden.
- hugin -

[ Parent ]
funnny (none / 0) (#12)
by Fake Can Be Just As Good on Wed Dec 09, 2009 at 07:31:31 AM EST

coming from someone who calls god g-d

[ Parent ]
haven't for a while (none / 0) (#14)
by LilDebbie on Wed Dec 09, 2009 at 10:02:29 AM EST

iirc, it was 0xC0000005 who pointed out that doing so is effectively idolatry

My name is LilDebbie and I have a garden.
- hugin -

[ Parent ]
idolatry <3 (none / 0) (#23)
by boxed on Thu Dec 10, 2009 at 10:02:57 AM EST



[ Parent ]
For some reason when this went to voting (none / 0) (#5)
by mybostinks on Tue Dec 08, 2009 at 10:39:34 AM EST

the formatting got totally messed up. It would be nice if someone could fix it if it posts.

How's that? (none / 1) (#6)
by rusty on Tue Dec 08, 2009 at 11:54:39 AM EST



____
Not the real rusty
[ Parent ]
Thanks! /nt (none / 0) (#7)
by mybostinks on Tue Dec 08, 2009 at 12:17:39 PM EST



[ Parent ]
HOLY SHIT THIS IS WILD STUFF MAN (3.00 / 4) (#8)
by I Did It All For The Horse Cock on Tue Dec 08, 2009 at 12:36:33 PM EST




\\\
  \ \        ^.^._______  This comment brought to you by the penis-nosed fox!
    \\______/_________|_)
    / /    \ \
    \_\_    \ \

i guarantee crawford has something stupid to add $ (3.00 / 3) (#13)
by th0m on Wed Dec 09, 2009 at 08:09:20 AM EST



You beat him to it. $ (none / 0) (#20)
by Nimey on Wed Dec 09, 2009 at 11:09:35 PM EST


--
Never mind, it was just the dog cumming -- jandev
You Sir, are an Ignorant Motherfucker. -- Crawford
I am arguably too manic to do that. -- Crawford
I already fuck my mother -- trane
Nimey is right -- Blastard
i am in complete agreement with Nimey -- i am a pretty big deal

[ Parent ]
xargs (none / 1) (#15)
by Canar on Wed Dec 09, 2009 at 10:31:01 AM EST

You forgot my favourite commandline utility: xargs! I imagine you can probably roll your own with perl, but this little command is so incredibly useful.

Agreed, I use it a lot (none / 0) (#17)
by mybostinks on Wed Dec 09, 2009 at 11:28:58 AM EST

but not in the examples I have above.

[ Parent ]
Fabrice Bellard TCC (3.00 / 2) (#16)
by sye on Wed Dec 09, 2009 at 11:28:16 AM EST

TCC is a tiny but complete ISOC99 C compiler which enables you to use C as scripting language. TCC has its roots in the OTCC project. The TCCBOOT boot loader demonstrate the speed of TCC by compiling and launching a Linux kernel in less than 15 seconds.

~~~~~~~~~~~~~~~~~~~~~~~
commentary - For a better sye@K5
~~~~~~~~~~~~~~~~~~~~~~~
ripple me ~~> ~allthingsgo: gateway to Garden of Perfect Brightess in CNY/BTC/LTC/DRK
rubbing u ~~> ~procrasti: getaway to HE'LL
Hey! at least he was in a stable relationship. - procrasti
enter K5 via Blastar.in

Excellent I will have to (none / 0) (#18)
by mybostinks on Wed Dec 09, 2009 at 11:30:41 AM EST

check it out. I have not heard of it before.

Thanks

[ Parent ]

Fabrice does not get enough credit (3.00 / 4) (#21)
by Morally Inflexible on Wed Dec 09, 2009 at 11:33:41 PM EST

the man built the foundations for most of the current open-source 'full virtualization' solutions out there.

[ Parent ]
I didn't know he launched ffmpeg, bellard.org/dvbt (none / 0) (#22)
by sye on Thu Dec 10, 2009 at 09:50:52 AM EST

His coding really builds up...bellard.org/dvbt is neat.

~~~~~~~~~~~~~~~~~~~~~~~
commentary - For a better sye@K5
~~~~~~~~~~~~~~~~~~~~~~~
ripple me ~~> ~allthingsgo: gateway to Garden of Perfect Brightess in CNY/BTC/LTC/DRK
rubbing u ~~> ~procrasti: getaway to HE'LL
Hey! at least he was in a stable relationship. - procrasti
enter K5 via Blastar.in
[ Parent ]

speaking of virtualization (none / 0) (#24)
by sye on Thu Dec 10, 2009 at 10:10:34 AM EST

what's your take on Xen vs. KVM debate? I read a couple of papers on the subject. 'Container-based Operating System Virtualization: A Scalable, High-performance Alternative to Hypervisors'... For your shop Xen is good but for a corporate UNIX environment, KVM may be more fitting.

~~~~~~~~~~~~~~~~~~~~~~~
commentary - For a better sye@K5
~~~~~~~~~~~~~~~~~~~~~~~
ripple me ~~> ~allthingsgo: gateway to Garden of Perfect Brightess in CNY/BTC/LTC/DRK
rubbing u ~~> ~procrasti: getaway to HE'LL
Hey! at least he was in a stable relationship. - procrasti
enter K5 via Blastar.in
[ Parent ]

yeah, pretty much. Also, a year ago there was (none / 0) (#26)
by Morally Inflexible on Thu Dec 10, 2009 at 06:42:36 PM EST

no contest.  KVM wasn't ready.  (and xen has been ready for a lot more than a year)  -  KVM is getting better all the time, so it might, assuming hardware virtualization continues to get better, someday be better for me than Xen is.  

On the other hand, if you are doing desktop virtualization;  if you are virtualizing computers you expect to use for things other than hosting virtualization containers, KVM is the superior choice;  so if you just want to spin up a thing to test the next kernel, kvm is the way to go.

so, uh, I guess my answer is "Last year, Xen was the clear choice... I have not yet thouroughly evaluated this year."  

I do plan on spinning up some KVM stuff in the near future, just so I know what is going on if  KVM becomes superior in the near future.  

[ Parent ]

your 'containers' papers is not about KVM, btw (none / 1) (#27)
by Morally Inflexible on Thu Dec 10, 2009 at 06:52:26 PM EST

'kvm' is more like a 'hypervisor' only linux is the 'hypervisor'  -  everyone still gets their own kernel.

I got bit so hard by containers many years ago that I still haven't (and probably won't, they are an inferior architecture for what I am doing.)  given them another chance.    I know some other providers who are good who use containers, but the problem is that they have to be very active;  they have to watch for and swat heavy users, whereas with xen, I don't have to care.  

Container-based virtualization is like FreeBSD jails, openvz or linux vserver.  while they are technically more efficient, the seperation between domains is abysmal.  everyone shares a kernel (and usually everyone shares swap and disk-cache.)  my experience was that one heavy user made everything suck for everyone, whereas on xen, heavy users make things suck for themselves.  

Now, the container guys look at those problems and try to solve them one at a time.  (I don't think they are trying to solve it for pagecache, as those sorts think pagecache is free ram, which is so terrifyingly wrong I usually end the conversation whenever anyone suggests such a thing.)   And the problem with the one at a time approach is that you always miss something.   If everyone has their own kernel, it is much easier to give them good isolation.  

[ Parent ]

Also he wrote lzexe back in the old days (none / 1) (#25)
by Nimey on Thu Dec 10, 2009 at 02:33:04 PM EST

the first MS-DOS program to compress .EXE files.
--
Never mind, it was just the dog cumming -- jandev
You Sir, are an Ignorant Motherfucker. -- Crawford
I am arguably too manic to do that. -- Crawford
I already fuck my mother -- trane
Nimey is right -- Blastard
i am in complete agreement with Nimey -- i am a pretty big deal

[ Parent ]
OB Optimization (none / 1) (#28)
by lewiscr on Thu Dec 10, 2009 at 10:02:46 PM EST

(g)awk can do pattern matching too. I love Perl, but it's pretty heavy. If I can write a script in grep, sed, awk, or cut instead of perl, I'll do it. Unless the perl script would be significantly shorter.

cut tends to be lighter weight than gawk. `cut -d":"` -f1 will give you an IP from IP:port.

sort -u will do the uniqueness too. Since it's already gone through the effort of sorting, it's usually faster than `sort | uniq`. The caveat being that if you need more info about the uniqueness (like the counts), you have to do the `sort | uniq -c`. Personally, I think you would want the counts here, but that would make Step 2 harder. Then you'd have to get `cut` and `paste` involved, so I'll ignore that for now.

I noticed step1 and step2 could be combined with the use of xargs. But seeing as how you use spamip.txt in step3, I kept the file using | tee.

So I'd rewrite step1 and 2
perl -p -e ':25/' messages.0
| gawk '{print $12}'
| gawk -F: '{print $1}'
| sort -n -t . -k 1,1 -k 2,2 -k 3,3 -k 4,4
| uniq > spamip.txt

while read HOST; do host $HOST; done <spamip.txt> reverseLU.txt

to be

gawk '/:25/ {print $12}' messages.0
| cut -d":" -f1
| sort -n -t . -k 1,1 -k 2,2 -k 3,3 -k 4,4 -u
| tee spamip.txt
| xargs -n1 host
> reverseLU.txt

Since you said you're processing 100MB text files, that'll probably save you some time. As always, use `time` to verify.

Finally, Step 3a is an abuse of the pipe operators. Most of those pipes should really be semi-colons. The only one that is legitimately a pipe is the `gawk | uniq > file`.


Excellent, thanks /nt (none / 0) (#29)
by mybostinks on Fri Dec 11, 2009 at 02:32:05 AM EST



[ Parent ]
JESUS (none / 0) (#30)
by The Hanged Man on Thu Jan 28, 2010 at 05:21:46 PM EST


-------------

Dificile est saturam non scribere - Juvenal
Perl, sed, grep, gawk, uniq, host and dig: another look | 30 comments (28 topical, 2 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!