Evadland - Blog

Passwords, Hashes and Dictionary Attacks

Thu, 03 Mar 2011 21:35:35 GMT

Cryptology has been a favorite topic of research for me over the past 10 years. Ever since geometry class, I have been obsessed with mathematics, patterns, and number theory. I enjoy turning one value into a totally different value and back again. That may sound like nonsense, but that is exactly how cryptology (and thus password systems) work.

I am going to go into detail about one of the most popular password storage and verification processes in use on the internet, and in offline systems. This method is called a hash, specifically an MD5 hash. I will also outline 2 of the methods that hackers can use to compromise password.

A big part of password security is never actually storing a password. A password is not stored by itself, meaning in plain text. If your password is ‘evadman’, ‘evadman’ is not stored in a database. Some other representation of the password is stored, and verified against when you log into an application or site.

MD5 Hashes

For example, the defacto standard for passwords is called an MD5 hash. This is a 32 character representation of a string of numbers, letters or other characters. The length is always 32 hexadecimal (0-9 and A-F) characters no matter how long the input is, and the output hash is always the same for the same input. For example, the MD5 hash of an empty string (“”) is d41d8cd98f00b204e9800998ecf8427e. The MD5 hash of “The quick brown fox jumps over the lazy dog” is 9e107d9d372bb6826bd81d3542a419d6. If any character is changed, such as an uppercase letter, extra space, or even a non-printable character such as a tab, the MD5 hash will change. The only thing that won’t change is the length of the hash, which will always be 32 hexadecimal characters.

The input into an MD5 hashing program can be a password (a small string of characters) or something as big as an entire hard drive. That is why sometimes when you go to a download site, they will tell you the hash for the file so you can verify that the file was transferred correctly. If you create a hash on your computer of the file and get a different answer, then the file was corrupted when it was downloaded.

Since an MD5 hash is always the same length, it can be stored conveniently in a database. The database administrator or application owner doesn’t have to set a maximum password length, or store a field of random length, as the output is always exactly 32 characters. For all the program cares, the password can be a paragraph. The output will always be 32 characters long. That is a big ‘selling point’ of an MD5 hash.

These password hashes used to be considered moderately safe for public view because they could not be reversed. For example, walking backwards from 9e107d9d372bb6826bd81d3542a419d6 to “The quick brown fox jumps over the lazy dog” was considered pretty much impossible, because it would take every computer on the planet working for decades to go backwards and turn the hash into the original password. That was a false assumption; methods were found to do that conversion quickly if the password is short enough.

Rainbow Tables

This method is called a “pre-computed dictionary attack” and the data is stored in a device called a “rainbow table”. To GREATLY simplify, if you have a hash, you can look up the original password in the rainbow table. The limitations of this process are of the utmost importance to understand, as the limitations of a rainbow table will tell you what the minimum password length, and characters, should be to best mitigate the risk of your password being broken.

First, a rainbow table is absolutely ginormous. A rainbow table that would work on passwords that are up to 7 numbers or lowercase characters is about 130 GB. Up to 9 numbers and lowercase letters is about 500 GB. The size grows quickly as additional characters are added. Adding a non-numeric character like "#" will expand that to more than 10 TB. That is one of the reasons that it is recommended that you use a non-numeric character in your passwords. The method used for generating the chains that make up a rainbow table takes a boatload of horsepower. This can be decades of computer time, but once they are made, they never need to be made again. Distributed computing can be used to reduce this to a few hours, if not less, for smaller tables.

I will explain a real life example of this method. I was tasked with migrating users from one software platform to another many years ago. The original software used an MD5 hash for a password, and the new software used a different method. This meant that every user would have to reset their password, as the password hash that was stored could not be migrated directly to the new system. The application had more than a quarter million users, so this would stink for them. To mitigate this, I used a rainbow table to reverse the stored password and feed it into the new software. Thus, generate the hash in the new software so the user would not notice a difference. The amazing thing was that I used a very small rainbow table, and was able to reverse more than 98% of the quarter million hashes back into the original password in less than 10 minutes. That points to a gigantic hole in how most users choose passwords, and in the MD5 process as a whole.

On top of that, the passwords that could not be undone also had some interesting patterns. The hash of 32 characters itself means nothing, but when a bunch of the hashes are exactly the same, it means that folks are using the same password. This is surprising considering users are separated by thousands of miles, and have likely never met. I also noticed that the hash from my personal password, which couldn't be undone by the rainbow table, was in use by a few hundred different users. This confused me, because I thought my password would have been secure and random enough. I'm a DBA, so I always pick strong passwords that are very long compared to what I would term a ‘normal’ user.

It turns out that the cause was simpler than I expected. I was migrating an application that had a higher percentage of technical users than a general group. These users were also using strong passwords in greater numbers than a general population. However, a bunch of users were using a password that is fast to type on a keyboard, while also being long enough to deter a rainbow table attack. It appeared a bunch of us had the same cycling of passwords (use one, then change to another, then another, and so on) as I found that some of the other hashes were previous passwords I had used.

That leads to the 2nd type of vulnerability that exists in any password system, not just a MD5 or other hash process. This is called a dictionary attack (sometimes referred to as Brute Force). A dictionary attack means using words in a dictionary and trying them as passwords. Basically, each word is typed (thousands a second if the system allows it) until one works. This wouldn't be a problem for most strong passwords, as they are not actual words in the dictionary. But, sequences of numbers can still be tried, and exist in ‘hacker’ password dictionaries. For example, "q2w3e4r" is the upper left of the keyboard. "753159" is a sequence on the numeric keypad. Neither of these are actual words, but a quick look showed both exist in dictionary attack password lists that hackers can use. I did this same rough password frequency analysis on 3 other systems that have many thousands of users, and came up with roughly the same surprising answers. Different users were using the same passwords across systems that had absolutely nothing to do with each other. It's freaky, but it points to the biggest weak link in a password system: the user and their habits.

Here is something that should make some folks change their passwords. There is almost a 30% chance that your password (yes, the person reading this) is in this list of 30: 123, 1234, 12345, 123456, 666666, 7777777, 12345678, 123456789, password, blogger, qwerty, letmein, test, trustno1, dragon, abc123, 111111, hello, monkey, master, killer, 123123, ncc1701, thx1138, qazwsx, ou812, 8675309. Do you have that as a password to your banking site? What about your email? Welcome to an emerging field called social engineering. Social engineering tactics were used very recently by hackers to destroy the credibility of a firm called HBGary. In fact, HBGary’s CEO stepped down on Tuesday as a result of that attack.

Best Practices

If you use the same password across systems, if one is broken, every system that has that password can also be broken into as well. So here are the rules I suggest you follow for passwords:

Use different passwords for each system. If you have trouble remembering passwords, at minimum use a different one for your email than everything else. If your email account is broken, that means a hacker can reset your password on almost any site that exists. The new password will be sent to your email, so the hacker will have it.
Don't bother using an uppercase letter, as almost all rainbow tables include upper case in the table. Use a non-alphanumeric character instead. This means something like "&", "(" or "$". Bonus points for ALT-0160 or something like it.
Use passwords that are 7 characters long or longer if at all possible. 10 is better, even if the last 3 are spaces or the same character repeated. If remembering long passwords are a problem, repeat the password 2 or 3 times to extend the length. That will defeat a rainbow table, and some dictionary attacks.
Don't use a password that can be found online.So don't use 'correct horse battery staple'.
When installing any software application or hardware, always change the default password to something else. Change the password on your router or Access Point; same with database or other programs. It is amazing how many are still the default password, which can be pulled off a support website.

This was just the tip of the iceberg, but hopefully it will assist folks with choosing secure passwords.

Brother LB6800PRW & SE400 Sewing Machine

Sun, 23 Jan 2011 06:03:15 GMT

One of the things I needed to pick up for my new place was a sewing machine. Yes, I do some sewing every once and a while, such as some Adventure Time Finn hats, or a Ice King costume. Previously, I always borrowed some elses machine. Now that I have some room, I wanted to get my own. Which did I decide on? The Brother LB6800PRW.

I looked around for a while, and decided to get a Brother machine. I was going to get a regular sewing machine, but it was only about $50 extra for one that could also do embroidery. The machine I decided to get was the Brother LB6800PRW. This is exactly the same as the SE-400, except it comes with a carry-cart for portability, and has some 'project runway' branding on it. Both have a 100mm by 100mm embroidery field (4 inches by 4 inches). The LB6800PRW was only a few dollars more than the SE-400, and the bag will at least keep the dust off the machine once I get bored with it. Both the SE-400 and the machine i got have 512k of memory that can be accessed though USB, so you can make and load embroidery patterns from a computer. I don't understand why that is even optional, all embroidery machines should allow USB connectivity, there is more than just letters to embroider. Software to make your own embroidery is another matter, as nothing comes with the machine.

The machine has a touch screen for choosing stitches and built in patterns, which I like. I don't like that the default stitch is one that I will never use (left stitch instead of centered in the foot), but I got used to always changing it pretty quickly. The auto-threader is awesome as well, threading the needle was usually a pain, especially with thicker thread. No issues on this one.

The first thing I did was make some curtains for a small half-moon window in my bedroom. There is an annoying streetlight right on the other side, so it definitely needed a curtain. Trimming one from the store to 24" ling solved the problem nicely. The machine did great, though it was only about 40 inches of sewing. No issues to speak of.

Once that was completed (and I mounted the curtains), I wanted to try out the embroidery functions. There were 2 main projects I wanted to use the machine for (and what drove me to pay $50 extra for an embroidery machine). Those were a custom shirt for someone at work, and a baby bib for a friend who is having a baby. the first thing I tried was putting my name on a towel. That failed miserably because I did not use a stabilizer. The instructions said it was 'recommended' to use one, but it is actually pretty much required.

I also broke a needle by following the instructions. They state to gently hold the tread for the first few stitches, so the thread doesn't pull out of the needle. Well, gentle is actually 'not at all' as the needle can take a very small amount of force before bending enough (about 1 mm) to hit the foot and break. So I swapped out the needle, and made my first accessory purchase, a pack of 100 needles (about $20, not bad).

I did try some graphical embroidery as well. I looked around for some patterns (besides the ones on the machine) and found that they almost all cost a decent amount of money. I saw a Disney set of 10 designs for $20, that's crazy high for non-commercial use. The designs must take a long time to create, or something is driving prices way up (low supply?). I found a free dragon finally, and put it on a wash towel using some thread I had laying around. It actually looked pretty good, and the design had the lock stitches built into it correctly. It was actually a surprising amount of stitches, more than 10,000 if I remember correctly. It took about 20 minutes for the machine to sew it, so the machine averaged a little over 300 stitches per minute.

Installing Linksys EG1032 V3 Gigabit Network card on Windows 7 x64

Sun, 15 Aug 2010 20:09:09 GMT

It figures that Linksys doesn't make a x64 driver for their EG1032 V3 gigabit card that I got many years ago. I kinda figured that Windows 7 would be able to find a driver for it anyway, but no such luck.

Well, after some research (and trial and error) I figured out a solution. The EG1032 is built upon a realtek chipset, so why not just use their driver? It turns out that works perfectly. Here are the steps to install a driver:

1. In Windows Device Manager, open the network adapter that doesn't have the driver installed and click on Update Driver Software.

2. Choose Browse My Computer

3. Choose Let me Pick

4. Choose Network Adapters

5. Wait for the list to populate, then choose 'Realtek' in the left column and '8169/8110 Family' in the right column

6. A popup will come up warning you that this is not recommended because windows can't verify the hardware and driver are comparable. Click OK.

7. Driver will install, and will now work. Hooray.

August 14th, 2010

Sat, 14 Aug 2010 20:29:58 GMT

I wrote up something on a forum to give an idea of failure rates for different RAID array types, and it ended up being a pretty good summary.

Question: The storage guys seem to say to avoid RAID5 due to the huge storages sizes now and the high chance of errors during rebuilds

The reason there would be an error when rebuilding a RAID5 array (leading to data loss) is if a 2nd disk failed (or a sector failed, etc) and that would be catastrophic to data. The most harrowing time (when you don't have backups) is during a rebuild of any single redundancy array (raid 1, 2, 3 4 or 5) because a 2nd disk failure will lead to data loss. On bigger disks, the rebuild time is longer, so the window for complete data loss is larger. On a hardware raid controller, this window is (best case) roughly equivalent to the sequential read speed (or write, whichever is lower) divided by disk capacity of a single disk in the array (assuming 50 MB/sec a RAID5 array will take 40,000 seconds or about 11.1 hours.) The NV+ takes roughly 20 hours (I know from experience)

There are ways to mitigate this risk (besides backups, which are always required) by using different raid levels. For example, RAID6 is becoming the 'new' RAID5. RAID6 uses 2 different parity calculations, and stores both. This allows for 2 disks to fail without data loss. The trade-off is that you need another disk to make up for the data loss. the NV+ doesn't support RAID6 natively, but other products in the line do (They are the pro series from netgear that have 6 disks instead of 4, and cost about $1200 and are x86 based instead of SPARC). It should be technically feasible to write an add-in for the NV+ that will support RAID6 (since the minimum disks for RAID6 is 4, and the NV+ has 4) But RAID6 is usually for larger disk count arrays. Also, with a 4 disk count, RAID 10 will give the same amount of usable space, but me much faster in a software RAID environment with only slightly less fault tolerance. (RAID 10 can support 2 disk failures if they are the 'right' disks, but if the wrong 2 fail, you can still lose data)

RAID5 is going to be slightly faster than RAID6 for the same size disk array because there are 2 parity calculations for RAID6 while RAID5 only has 1.

RAID5 is way better than RAID0 from a safety standpoint, and gives more space than RAID1. Lets use a hypothetical situation (since I don't want to look up the actuals, but I will be pretty close. I will also simplify some of the math since we don't have to be perfect, just decently close.). The situation is that you have four 2 TB disks with a MTBF of 1M hours and a read/write speed of 50 MB/sec. (enterprise disks are usually 1.2M hours, and home user disks are usually around 800k hours, so 1M is a nice average, and allows for quicker math I can do in my head). Using those 4 disks, let's calculate some failure rates, data storage and speeds. We must also assume 'life of disks' to determine the possibility of data loss over the life of the array. Let's assume 3 years since that is most manufacturer's warranty period. Please also note that we are only calculating disk failures and not including things like hardware or software failure which can also destroy an array, even without a disk failure. For example, bad RAM can cause garbage to be written to an array. We will also only calculate this for 4 of the major RAID levels: 0, 5, 6, and 10.

RAID0 - Total space is the sum of the disk sizes (2TB*4 or 8 TB). Speed will also be the sum of the disks (50 MB/sec * 4 or 200 MB/sec). The possibility of data loss during any given hour will be the MTBF of the disks divided by 4 since in RAID0 if any disk fails all data is lost. 3 years works out to be 365 * 24 * 4 * 3 hours of disk up time, which is 105120 hours total. With a MTBF of 1M hours, you end up with the possibility of data loss during the life of the array of 1M/105120 or 10.5%. That's very high at 1 in 10.

RAID10 - Total space is the sum of the disk sizes divided by 2 (2TB*4/2 or 4 TB). Speed will be the speed of 2 disks (50 MB/sec * 2 or 100 MB/sec) Data will only be lost if the 'wrong' disk fails during a rebuild. That means we have to calculate the rebuild time (which we did above at 11 hours, but let's call it 10 for 'Evad is lazy' sake). So for 10 hours, there is a 1 in 1M chance per hour of a disk failure, and a further 50% probability the wrong disk fails. That works out to a 5 in 1M chance during the rebuild time. But you need to know how many times any disk will fail and cause a rebuild. That was calculated above at 10.5% over the life of 4 disks. So you end up with 10.5% chance of rebuild, and a 5 in 1M chance (0.0005%) of failure during that rebuild, or 0.105 * 0.000005 = 0.000000525 or 0.0000525%. Many orders of magnitude better better than RAID0. The trade off is that you lose 1/2 of the space.

RAID5 - Total space is the sum of the disk sizes - 1 for parity or 6 TB. Speed will be the speed of 3 disks (50 MB/sec * 3 or 150 MB/sec) Data will only be lost if a 2nd disk fails during the rebuild. That means we have to calculate the rebuild time. So for 10 hours, there is a 1 in 1M chance per hour of a disk failure. That works out to a 10 in 1M chance during the rebuild time. But you need to know how many times any disk will fail and cause a rebuild. That was calculated above at 10.5% over the life of 4 disks. So you end up with 10.5% chance of rebuild, and a 10 in 1M chance (0.001%) of failure during that rebuild, or 0.105 * 0.000010 = 0.00000105 or 0.000105%. Many orders of magnitude better better than RAID0 but worse than RAID10. The trade off is that you have 25% less space than RAID0 and 50% more space than RAID10.

RAID6 - Total space is the sum of the disk sizes - 2 for parity or 4 TB (Same as RAID10). Speed will be the speed of 2 disks (50 MB/sec * 2 or 100 MB/sec) Data will only be lost if a 3rd disk fails during the rebuild. That means we have to calculate the rebuild time, which is harder to do than RAId5. So for 10 hours, there is a 1 in 1M chance per hour of a single disk failure, but 2 more need to fail to lose data. That works out to 1/1M * 1/1M during the rebuild time (0.000001 * 0.000001 = 0.000000000001) But you need to know how many times any disk will fail and cause a rebuild. That was calculated above at 10.5% over the life of 4 disks. So you end up with 10.5% chance of rebuild, and a 0.0000000001% chance of another 2 disks failing during that rebuild, or 0.105 * 0.000000000001 = 0.0000000000000105 or 0.00000000000105%. Many orders of magnitude better better than RAID0, RAID10 and RAID5. The trade off is that you spend 2 disks to store parity information, so you have 50% of the actual space usable. RAID6 is also not supported on all devices.

In terms of space, here is a summary:
RAID0 - 8 TB
RAID5 - 6 TB
RAID6 - 4 TB
RAID10 - 4 TB

In terms of probability of failure over the life of the array (3 years):
RAID0 - 10.5%
RAID5 - 0.000105%
RAID10 - 0.0000525%
RAID6 - 0.00000000000105%

In terms of speed:
RAID0 - 200 MB/Sec
RAID5 - 150 MB/sec
RAID10 - 100MB/sec
RAID6 - 100MB/sec

In terms of dollars per usable TB (assume $100 per 2 TB hard drive)
RAID0 - 8 TB / $400 = $50 per TB
RAID5 - 6 TB / $400 = $67 per TB ($17 more than RAID0 to drop the rate from 10.5% to 0.000105%)
RAID6 - 4 TB / $400 = $100 per TB ($33 more than RAID5 to drop the rate from 0.000105% to 0.00000000000105%)
RAID10 - 4 TB / $400= $100 per TB (usually only chosen if small random writes are an issue such as a database or controller doesn't support RAID6, and RAID5 is too high risk.)

Very long story short is that there are trade offs with each RAID level. You need to chose the one that is right for you based on your specific needs, budget and risk tolerance. For the majority of home users who use a NAS, RAID5 provides a good balance of fault tolerances and space per dollar spent in a 4 disk NAS. If we were to redo this with 6 or more bays, then RAID6 becomes a more attractive choice because the probability of a disk failure is higher the more disks exist in the array, and the cost starts approaching RAID5 levels. For example the cost per TB between a 20 disk RAID5 and RAID6 array using 2 TB disks is $52.6 vs $55.5.

There are way more things to consider in different environments such as stripe size (which can waste space and affect speed) load on the array (high disk count parity arrays are bad for random writes because of the write-hole, such as databases while for media streaming parity arrays are fine) options on the controller (such as OCE or ORM and having a dedicated XOR processor). We also can't forget the fabled 'sympathy failure' of disks which may or may not increase the odds of concurrent disk failures if you 'believe' in them. The write-hole in RAID5 can also be solved by using something like RAID-Z.

Cliffs:
RAID5 is fine for home users and way better than a bunch of single disks.

Garage storage

Sun, 11 Apr 2010 20:28:04 GMT

For those of you that don't know, I bought a townhouse recently. Now, I can work on improving the house instead of responding to email.

I enjoy doing work with my hands when I have time; meaning woodworking, automotive, metalworking, building computers, anything really. Pretty much all of those require a good workspace. For me, that means the garage. So naturally, the first area I picked to upgrade was the garage.

I knew I wanted a workbench and lots of storage space. In addition, I wanted to be able to disassemble whatever I used so I could move it to a new house if required. Tack onto that a requirement that the company and design be around for a little while so I could add more space if I needed to without having to worry about matching designs.

That left me a few options. I decided to go with Gorilla Rack, which has been around for a while. Their base design also hasn't changed, so it should be around if I want to add more matching space. I chose to go with 17" wide rack that was 72" tall. It sounds small, but I wanted to be able to fit most of the rack in the garage without encroaching on the garage door clear space. I could have went with the 96" tall, but one of the posts on each side would have had to have been cut to clear the beam over the garage. I decided to stick with 72" tall for now.

Gorilla Rack is purchased by the part, so you can design whatever you want. I decided to go with two 6' sections and one 4' section on each side. (they make 4', 6' and 8') Three 6' sections would have been about 1" too long when you factor in the space required for the garage door safety eyes and the width of the uprights. I decided to get three 4' shelves and nine 6' shelves. The workbench area would require an additional 6' and 4' shelf for the work space.

Then, I had to choose what I wanted for shelves. Gorilla rack sells precut pieces, but they were $3 for a 2' section, and you would need 3 of them for a 6' shelf. This is way too much when you can get the exact same thing in a 4x8 sheet for about 1/3rd the price. I decided to go with a water resistant OSB that was 3/4" thick and cut it up.

Total Shelf Materials: $544.60
17" x 72" uprights: 6 @ 24.99 Each
17" x 36" uprights: 2 @ 15.49 Each
72" rack beam (one side of a shelf): 20 @ 12.49 Each
48" rack beam (one side of a shelf): 8 @ 8.79 Each
48" x 96" 3/4" OSB: 4 @10.89 Each

Assembling the Rack
Assembly was a snap, just slide the beams into the uprights at the correct slot. The 6' beams required a tie in the middle that was assembled with a screwdriver and 10mm socket. The only 'gotcha' is that if your floor isn't level, it will be difficult to slide the beams together. I used scrap OSB under the legs to make up the height difference. Later, I will weld up some spacers out of some mild steel.

I planned out the shelving material when I was working on the layout of the garage. A 4'x8' piece was the cheapest, and gave me the least amount of waste with three 6' shelves and one 4' shelf per 4'x8' OSB sheet. I cut 15 3/16" off of one of the 4' ends of the sheet for a 4' shelf, then cut the remainder into three 6' shelves at 71 3/4" long by 15 3/16" wide. That left a 3"x96" and a 48"x9" piece of scrap per sheet of OSB. Using 4 complete sheets would give me four 4' shelves and twelve 6' shelves, which was 2 more than I needed. I decided to cut them anyway, and save them in case I decided to add more shelves later. In order to cut the sheets, I used an old craftsman circular saw, a drywall t-square to draw the liens and a 2x4x8 to space the OSB off the floor for when I was cutting it. I also used a 30' retractable extension cord that I bought specifically for my new garage about a year ago. I also picked up some safety glasses, as most of mine were getting pretty scratched up. Couldn't forget a broom to clean up the sawdust either.

Extra Planning Materials: $53.47
2x4x8: 1 @ 2.18 Each
7 1/4" 24 tooth sawblade 1 @ 14.38 Each Pack (3 pack)
x-lens safety glasses: 1 @ 8.97 Each
18" smooth surface broom: 1@ 9.96 - $4 rebate
Dustpan and dustpan broom: 1 @ 17.99
Folding Chair: 1 @ 5.99 - $2 rebate

Other Stuff:
Pen
Drywall T-Square
Circular saw
Measuring tape
Extension cord
#2 Philips screwdriver
10mm socket and wrench
Before and after shots:

a ramble on email

Sat, 29 Aug 2009 19:44:49 GMT

Just how much of our current communication goes though a communication channel that is not face to face? This is a question I was thinking about today when I was asked about my email inbox status. How many of us take for granted our email? We can communicate with others on the other side of the earth in minutes instead of weeks for a paper letter.

Having said that, how many of us use email incorrectly? I know I do.

I usually tell people to email me instead of call me. The reasoning for this is because if I have the information in email, I can't lose it between cracks in the desk or delete a voice mail on accident. But that opens up a new issue of having to track all that email. It can also make the conversation take longer than a phone call or to have a detached feel. However, on the upside, it allows me to have the conversation at my pace and at my time, instead of the caller's pace and time.

So far, I have gotten 306 emails today that were not moved to a another folder by the rules I have created. Of these 306, I only needed to reply to 39 of them, and a total of 97 were 'useful' to me in some fashion. The rest were notifications of unrelated project status, something I was copied on and didn't need to action, or something I didn't even need to see. That puts the 'usefulness' percentage of email to about 32% for me today. The other 68% wasn't spam, but it wasn't required from my point of view.

Like most, I use folders to organize the emails I get into projects or people or something like that. I also have 37 rules that sort my inbound email into different folders so I can review the most important emails first. Without those rules, I would be drowning in a sea of electronic paper. As a side note, The spam filter used here at work also has to be one of the best on the planet; I have gotten exactly 2 spam emails in the last 7 years or so. That has to be a record or some sort.

Remember the 306 emails I mentioned earlier? Let's put some time behind that number. Assuming it takes me 20 seconds to read an email and 30 seconds to craft a response, those 306 emails took up 122 minutes of my day today. A full 2 hours doing nothing but responding to email. What is the likelihood that those 2 hours were spent doing the most productive thing I could be doing?

What are the total stats on my email inbox? If you really must know, here you go.

My full inbox accounts for 16.9 gigabytes, and most of those emails don't have attachments since I have a process that moves attachments from email and stores them in the file system once the email hits a certain age. Picking one folder (inbox) at 26 MB which contains 1149 emails, the average size of an email that I don't delete is about 22Kb. That means my total email count is around 750,000. A full three quarters of a million. I have had this account for about 7 years, so on average I have kept about 290 emails per day. Assuming a 10 hour workday, that is one email every 148 seconds. That also means I have spent 4166 hours in the last 7 years just reading email at 20 seconds per email. There are only 2080 hours in a standard year of work.

So here I sit, writing an article about reading and writing too many emails. What has the world come to?

javascript and Numbers

Thu, 10 Jul 2008 19:28:15 GMT

Today I was working on improving a javascript function that sorts an HTML table that I originally wrote about 5 years ago.

The problem that I have is that I am a novice in javascript. Well, novice is probably overestimating my capabilities in actuality. But anyway, I still needed to improve this function, and learn along the way.

The function checks the first cell in the first row of a table in an attempt to determine the data type of the column. This is important when looking at numbers and dates. For example numerically, these numbers (1, 3, 2, 4, 30 , 10) would expect to be sorted as (1, 2, 3, 4, 10, 30). However if sorted as text, you end up with (1, 10, 2, 3, 30, 4). That is not what I was trying to do, but that is what my code kept throwing out. Same with any decimal numbers, especially when the decimal point was in different spots (Different precision).

Previously, I was using a regular expression to match numeric values. I was using the expression /^[\d\s]+$/ which leaves a lot to be desired. Besides the overhead of using an regular expression to begin with.

My first thought was to use a string parser and just look for the numbers 0-9 and the +- and decimal characters. However that would attempt to sort a value of '++0124.543.56.4-8;' as a number. I could have written in a check to see if there was exactly one + or - and exactly 1 decimal point, but that would have been a messy bit of code. It may have worked but it would look bad and be low on performance.

Then I thought about some of my VB experience and decided to apply it to the problem. Why try to see if the number is a number? why not just try to convert it and let the javascript engine handle it. Odds are the javascript engine programmers are way smarter than I will ever be.

With that in mind, I wrote a quick function. And by quick I mean one line. Here it is.

function IsNumeric(checkStr){return checkStr.length > 0 && (checkStr - 0) == checkStr;}

The function IsNumeric takes one input string. The function does 2 comparisons to that string, and both must return true in order for the fucntion to return true. First, the input string must have a non-zero length. This is done because a string of length zero should not be considered a number. The second check is a little more complicated.

First, javascript doesn't have explicit type conversion. Meaning I can't cast a variable as a double. But we can force the javascript engine to use the string as a number by trying to subtract a zero from it. If the conversion to a number fails, the javascript engine will return a NaN, meaning Not a Number. Finally, the now numeric checkStr is compared to the input string to see if they are equal. If they are, true will be returned.

I was unable to find a number that returns false or a string that returns true, so this seems pretty solid. If you have a better solution, I would love to hear about it. Please post it in the comments section.

Duplicate Images and MD5/SHA Hashes

Mon, 18 Jun 2007 18:27:27 GMT

I was in the middle of backing up some family photos when I discovered that I had multiple copies of the same image in different folders. Not only is this a waste of space, but it can also lead to images getting deleted on accident when you believe you have a copy in another folder, but actually do not.

That begs the question, how in the world do you sort though thousands of photos making sure that you keep one, and only one copy of each? If you haven't renamed the files, you may be able to use the file name to weed though the duplicates, but what if you renamed some of the files, or have duplicate names on files? Now it starts getting a lot more complicated.

There are some commercial products you can buy that will do some of this image verification, but I had much bigger plans. In addition to just removing duplicates, I wanted to be able to add tags to images, verify that images that were backed up did not become corrupted, and a host of other things that I will write about later. To start off with, I just wanted to remove duplicates. I decided to do that though creating a checksum for the file.

Think of a checksum as verification that the file is exactly as it is supposed to be. You will often find checksums on files that are being downloaded so that you can verify that the file received was not corrupted during the transfer, or by a cracker who decided to implant a virus in your download.

There are lots of different types of checksums, 3 of the most popular are CRC32, MD5, and the different SHA functions. The issue with CRC32 is that collisions are very frequent compared to md5 or the SHA hashes. In the way I am using hashes, a collision is when 2 different files create the same hash at the end. Since I will be removing any files with the same hash, a collision would result in the deletion of a file that isn't actually a duplicate.

That leaves out CRC32, but what about MD5 and the SHA functions? MD5 hashes have been used for years (and still used) to store data like passwords for web sites for a bunch of reasons. However, this is bad because MD5 hashes of passwords can be broken using several techniques like plain old brute forcing or though rainbow tables. But how will an MD5 or SHA1 hash do on finding duplicate files? Actually, very well in fact. The probability of a collision for a given MD5 or SHA1 hash is extremely low, however it is not impossible. Here is a great article on the subject. However, generating MD5 or SHA1 hashes of files is a pretty quick operation, so both are viable options.

When you get down to it though, I don't like living with an error rate that I can describe without using decimal notation. So since both MD5 and SHA1 checksums can be generated in a second or two, even on a slow machine, why not generate both for each file? The odds of both a MD5 and SHA1 hash returning identical values for different files is roughly the same as every atom in your body deciding to rearange itself on Mars. That is an error rate I can live with.

Generating all these hashes manually is a crazy proposition. Thankfully, Visual Studio has methods for creating hashes of the most popular types of data though the System.Security.Cryptography namespace. using this, I was able to create a quick function that would generate an MD% or any of the SHA hashes for a given file. That way, I could create a quick script to loop though all the image files and save their critical data to a database. (File location, name, and hashes) Then, it is a simple matter to write a query that returns only unique files, and save those off somewhere while removing the duplicates.

The below VB.NET function called GetHashForFile takes a file location and an enum that represents the hash type to generate. In order to generate both MD5 and SHA1 hashes, just call it twice with different enums. I added the different SHA2 hashes to the function in case someone wishes to generate SHA2 hashes.

The function needs 3 namespaces for references in order to function. These 3 must be added if you have not referenced them already. Add this to the top of your module or code page.

 Imports System.IO Imports System.Text Imports System.Security.Cryptography

In order to make the function as reusable as possible, an enum was created so that you can't forget which hash types can be created. Add this enum to your code page or module.

  Enum HashType      MD5 = 1      SHA1 = 2      SHA256 = 3      SHA384 = 4      SHA512 = 5  End Enum

Then, add the function itself. the way this is currently written, if there is an error on generating the hash for a file (like trying to hash a file that doesn't exist), an empty string will be returned instead of an error being raised. If you want to raise an error that can be handled in the calling code, uncomment the lines that start with "Err.Raise".

  Public Function GetHashForFile(ByVal Filepath As String, ByVal DataHashStandard As HashType) As String      'function that can create the most popular hashes.      'check that file exists.      If Not My.Computer.FileSystem.FileExists(Filepath) Then          ' Err.Raise(vbObjectError + 13131, , "File " & Filepath & " does not exist") 'uncomment this line if you want to raise an error instead of return empty string          Return ""          Exit Function      End If       'declarations      Dim sb As StringBuilder = New StringBuilder                     'stringbuilder to build the result.      Dim fs As FileStream = New FileStream(Filepath, FileMode.Open)  'open file      Dim HashProvider As Object       Try          'set the hash type          If DataHashStandard = HashType.MD5 Then               'md5 hash              HashProvider = New MD5CryptoServiceProvider          ElseIf DataHashStandard = HashType.SHA1 Then          'sha128 (sha1) hash              HashProvider = New SHA1CryptoServiceProvider          ElseIf DataHashStandard = HashType.SHA256 Then        'sha256 (sha2 256 bit) hash              HashProvider = New SHA256CryptoServiceProvider          ElseIf DataHashStandard = HashType.SHA384 Then        'sha384 (sha2 384 bit) hash              HashProvider = New SHA384CryptoServiceProvider          ElseIf DataHashStandard = HashType.SHA512 Then        'sha512 (sha2 512 bit) hash              HashProvider = New SHA512CryptoServiceProvider          Else              'close the file opened earlier              fs.Close()              fs.Dispose()              ' Err.Raise(vbObjectError + 13132, , "Data HAsh Standard " & DataHashStandard.tostring & " is not valid") 'uncomment this line if you want to raise an error instead of return empty string              Return ""              Exit Function          End If          'compute the hash          Dim hash() As Byte = HashProvider.ComputeHash(fs)           'done with the file, close it.          fs.Close()          fs.Dispose()           ' turn the byte array into a string          For Each hex As Byte In hash              sb.Append(hex.ToString("x2"))          Next          'return the result          Return sb.ToString      Catch ex As Exception          'close the file opened earlier          fs.Close()          fs.Dispose()          'Err.Raise(vbObjectError + 13133, , "Data Encryption failed with error " & ex.Message) 'uncomment this line if you want to raise an error instead of return empty string          Return ""       ' return empty string on error instead of returning error.          Exit Function      End Try   End Function

Once all 3 sections are in a code page or module, you can use the function like this. Replace 'c:\config.sys' with the file name you want to hash.

 msgbox(GetHashForFile("c:\config.sys", HashType.MD5)) 'pop messagebox with md5 hash msgbox(GetHashForFile("c:\config.sys", HashType.SHA1)) 'pop messagebox with SHA1 hash

Microsoft Access Corruption

Fri, 07 Feb 2003 19:21:14 GMT

I had a great question come up today on Microsoft Access Databases and corruption. Why does an access database become corrupted, how do you fix it, and how do you prevent the issue from happening in the future?

In order to answer these questions, first there needs to be an explanation of how MS Access works. MS Access at its core is a single user database with lots of features for reporting, form design, and automation. However, the database is still just a file, and not an application. That means there are very few checks in the JET Database Engine that keep users from running over each other when more than one user is using the same database. This makes Access cheaper to purchase (<$200) than something like MS SQL Server for the backend and Visual Studio for a front end (>$2000). MS Access is also much more user friendly, and has a great IDE for designing forms, and linking the form to underlying data. These all make the cost of entry into making an application in MS Access very low, which is why MS Access is so prolific in some environments.

Long story short, MS Access is a great tool for doing adhoc reporting, short term data recording, or in some instances huge applications where only 1 user is using the application at a time. The ugly corruption issue appears when lots of folks are using MS Access in ways that work, but where a multi-user database application is the better solution. In other words, if more than 1 person is using the database at a given time, MS Access is not the correct solution.

If you already have an MS access application that keeps getting corrupted, you can do some things to remediate the database itself in the short term. Access databases corrupt in 3 major ways, though some happen more often than others. All the below examples assume that it is imperative that the database stop corrupting, and the correct solution (a multi-user application) is not yet available.

Most Often: Index Corruption
The most often corruption cause is when 2 machines try rebuilding the same index (clustered or not) at the same time. This breaks the index for that table in access, and the table becomes unreadable. The way to tell if this is the root cause is when the database becomes corrupt, you can recover all data, forms, & modules except for 1 table. This can be solved by eliminating all indexes on the table, and limiting the indexes to only one for the primary key. This will obviously slow the database down, but it will be able to support many users over a network on the same table. Composite keys are the 'best' candidates for becoming corrupted and destroying a table.

2nd Most Often: System Table Corruption
The next most often is when the system tables become corrupted. Objects (tables, forms, modules, reports) are stored in system tables in the access database. Sometimes the indexes or data in those tables can become corrupted and unreadable. You can tell if this happens because all objects after a certain alphanumeric point are unrecoverable. For example, all tables that start with 'H' or a later letter are unrecoverable. To remediate this, split the database between the data (tables) and 'non-data' (queries, modules, forms, reports, etc). Use links from the 'non-data' to the 'data' database. If corruption of this type still occurs, split the 'data' database further into pieces. If you have to, the 'data' database can be split so that there is only 1 table in each 'data' database, where the 'non-data' database links to multiple 'data' databases.

As part of this solution, each user should have a separate Access database on their machine that contains the queries, forms, reports and modules that links to the 'data' database(s). That way multiple people are not trying to open and modify the data in the system tables.

3rd Most Often: Row Corruption
The last type of corruption is when a single data page gets corrupted. This is manifested where 1 table has a few rows that can not be accessed, 'phantom' rows that have blank primary keys, or a table that shows '#error' in every cell. This is caused by 2 users trying to update the same row at the same time, and the JET engine not keeping the 2 users separate. This one takes a lot more work to solve, and it is much better to just move to some other solution besides access. Solving this requires large changes to how data is written to the tables that the JET engine can't handle. I ran into the problem several years ago with a application that was being used by about 70 people simultaneously. Solved this by writing an 'intent' log similar to how any multi-user databases work.

First, the forms were divorced from the underlying data. The form the user is working with is populated entirely though DAO calls. I used DAO because DAO is actually faster when utilizing the JET database Engine, which is what MS Access runs on. It is about the only thing DAO does better than ADO.

Then, the user tells an Access form that a row needs an update or addition though a save button or the like. That command is then processed though a module and a row added to a flat transaction file that is kept on the network somewhere. That row in the transaction log states that a user wants to update or add a row in the database table. The user's system waits 250ms and checks the flat file again to make sure no one else is trying to update the same data. If all is ok, the row is written. If there is a contention, then the client that asks last removes the intent and retries it 500ms later. The reason 250ms was chosen as the wait time is because the network latency was about 25ms, and I wanted to be darn sure that network latency wasn't an issue.

Recap
Obviously a better solution is to use a multi-user database if more than 1 user will be in a database at a time, and the system is being created or there is time for a rewrite. But using the above methods, I was able to almost entirely eliminate MS Access database corruption on a 70 concurrent user application.