Articles with tag ‘code‘

 
 

SotD - Remove the First N Characters From a Line

I was asked today to remove the first 6 characters from every line in a document. It was known that each line contained at least 6 characters. For me, that means a find a replace. Here is how I did it (in a number of different ways). Can you think of any other ways to do it?

cut -c7-
rr ^.{6}
colrm 1 6
rr s/^.{6}//
sed 's/^.\{6\}//'
perl -pe 's/^.{6}//'
ruby -pe 'sub /^.{6}/,""'
gawk 'BEGIN{FIELDWIDTHS="6 999"}{print$2}'

I was pretty happy with rr taking the prize for least characters. I do want to point out that since rr defaults to multi-line and global replacements the “^” is required. If you wanted to remove the “^” (like you could with sed, perl, and ruby) then you would have to use the “–line” or “-l” and “–notglobal” or “-ng” options. So a single character replaces 7! If you want to grab rr just do:

$ sudo gem install regex_replace

Also, here is a link to gawk - GNU’s awk. This has the very nice FIELDWIDTHS variable, which is extremely useful!

Also keep in mind that the regular expressions above ^.{6} only work because the lines were known to have at least 6 characters. If we didn’t know that we would have had to use something like ^.{0,6} to allow up to 6 characters (and even that could be ^.{1.6} ignoring blank lines which can’t change). So again the requirements for this challenge were important.

Update!

I came across a neat little unix utility I didn’t know about called “cut”. As you can see, the new command tops the list quite handily too. “cut -c 7-” is actually cutting from each line the 7th character onwards (counting starts from 1). In this case once it spits what it cuts out to stdout; so it leaves behind those first 6 characters and therefore accomplishes our goal. In the above list I removed the optional space after “-c” to make it just a tad shorter. Pretty neat.

Then I found “colrm” which is the most straightforward. This one wins in simplicity. No hacks, just straightforward does exactly what you think it does. Very cool.

Double Update!

I figured since quite a bit of my rr usage has an empty string for the second argument, I figured it would be okay to throw that in as the default case. So now there is a third usage for rr, which is great for pipes, that just strips something out. That brings rr back up to a tie for first place. Very cool.

Why CDATA Matters in XML

You’ve seen it before, but you may not know what it means. Wikipedia describes CDATA as meaning “Character Data” which makes sense. w3school goes one step further and points out that this text should not be parsed by the XML Parser. The general idea is when you want to display straight textual data, without needing to encode characters or wanting them interpreted by the parser, you can just wrap that data inside of a CDATA tag.

Needless to say this is clearly the ugliest tag currently in existence (lets leave room for the future though):

<![CDATA[ ... ]]>

I promised to tell you why it mattered

Yes, words in the title become a promise. You can hold me to that in the future. Why does CDATA matter? Well, I’ll actually side-step the question for a minute and show you what looks like a perfectly fine looking XHTML document (keep in mind that XHTML is a subset of XML):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
  <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
  <title>Simple Example</title>
</head>
<body>
  <h1>Welcome</h1>
  <script type="text/javascript">
    document.write('<p>Hello, World</p>');
  </script>
</body>
</html>

Looks simple enough. We can ignore the fact that its not really the best way of doing things, but who cares… its a Hello World example right? Well, technically this is not Valid XHTML. The validator shows we’ve got a single error:

Line 11, Column 20: document type does not allow element "p" here.
document.write('<p>Hello, World</p>');

The element named above was found in a context where it is not allowed. This could mean that you have incorrectly nested elements — such as a “style” element in the “body” section instead of inside “head” — or two elements that overlap (which is not allowed).

Well, the validator hints at the cause of the error, but it is hard to understand unless you really know your XML! We are inside a “<script>” tag, we’ve got some text, and all of a sudden a “<p>” pops up! XML Parsers don’t care that its in the middle of a string, it sees another tag and that tag doesn’t make sense.

So, here is why CDATA matters. You want your Javascript to be left alone by the XML Parser. In HTML, Javascript is interpreted as text, so it can just be left alone as plain text by wrapping it in a CDATA tag:

<script type="text/javascript">
  <![CDATA[
    document.write('<p>Hello, World</p>');
  ]]>
</script>

Don’t get excited yet. Yes, that passes the W3C validator but the javascript fails to run. Why? Well, I actually haven’t got a clue. My guess would be its not stripped out of the javascript and invalidates the javascript when it tries to run. In any event, lets check our steps… Is it valid xml? Check. Is it being served as xml? Well, actually I don’t think so.

Here is a nice resource that talks about Understanding HTML, XML, and XHTML. If you haven’t read that article either read it now or once you’re done here; its important, no matter how old it is. To pull a quote:

to really send xhtml, an xhtml page must be served as xml and therefore have one of the following Content-Type’s (text/xml, application/xml, application/xhtml+xml) to a browser.

This is a simple one liner in php. I added the following code to the top of the page, and resent it to my browser:

<?php header('Content-type: application/xhtml+xml'); ?>

Doh. Well, we’ve covered all the bases and it still doesn’t work. Its valid, its sent as xhtml, now everything is left up to the browser and it doesn’t seem to work. If you know why, drop a comment. Again my suspicion is that the browsers don’t properly handle XHTML and CDATA completely. However, there is a pretty nice trick that we can make use of to get this to work and validate (even sent as text/html):

<script type="text/javascript">
  // <![CDATA[
    document.write('<p>Hello, World</p>');
  // ]]>
</script>

Well there you go. A 100% valid page, that runs in all browsers, that properly tells the XML Parser “hey, leave these characters alone” and it works. The problem is identifying when this is necessary. For most people, having the original page, which rendered correctly but didn’t validate would be enough. Browser developers are watching out for you and working around mistakes in HTML and XHTML. However, that isn’t always the case.

Real World Example

Here I’ll pull a real world example. Some XML Specifications allow the ability to send XML under a different namespace as content inside of an existing XML tag. Some do so in a “psuedo” way. Take a look at the Atom Publishing Protocol (commonly referred to as Atompub or APP for short). Here is a snippet from the RFC describing the Atom Syndication Format, specifically the structure for an atom:title tag with type=”html” within of an atom:entry:

...
<title type="html">
  Less: &lt;em> &amp;lt; &lt;/em>
</title>
...

If the value of "type" is "html", the content of the Text construct MUST NOT contain child elements and SHOULD be suitable for handling as HTML [HTML]. Any markup within MUST be escaped; for example, "<br>" as "&lt;br>". HTML markup within SHOULD be such that it could validly appear directly within an HTML <DIV> element, after unescaping. Atom Processors that display such content MAY use that markup to aid in its display.

Okay, sorry for the long setup, but we have finally arrived at the point of this post. That type=”html” element cannot have child elements. The XML parser will identify child elements based on a “<" character. Assuming whatever project you would be working on takes that input from the user that means you would have to pass it through a filter, encoding HTML characters like ampersands, less than and greater than signs, the list goes on. That operation is expensive and may even cause problems in itself. I ran into a situation just the other day where an ampersand for an encoded character (like the & inside of an &amp;) was causing errors by itself.

The solution is to make the XML Parsers ignore the data by wrapping it in a CDATA tag. Lets take the above example and show how it could be done much easier:

...
<title type="html">
  <![CDATA[ Less: <em> &lt; </em> ]]>
</title>
...

Easier to understand? You betcha. Less costly for developers? Of course. So CDATA is there to help, not hurt. Don’t look at its ugly face and think of it as a hack, look deeper and you will see its purpose and power. Okay, I admit that sounds a little corny, but it could have been worse.

Side note, Javascript

As managers everywhere throw out buzz words like AJAX and encourage you to participate in new web 2.0 project ideas you’re going to end up sending and receiving XML requests with a server using the good old XMLHttpRequest object. Well if encoding isn’t enough of a problem (and I’m still wrapping my head around it) you might get struck with a problem like the above case and want to make use of your knowledge with CDATA.

Well, you’re in luck. xmlDocument.createCDATASection(…) is part of the Level 2 DOM ECMAScript Spec. Use it just like a createTextNode():

//
// Create an atom:title element with html content
// assume xmlDocument is already an XML Document object
// and entry is an atom:entry element in that document
//
// <entry>
//   <title type="html"><![CDATA[<em> &lt; </em>]]></title>
// <entry>
//
ATOM_NS = "http://www.w3.org/2005/Atom";
var node = xmlDocument.createElementNS(ATOM_NS, "title");
node.setAttribute("type", "html");
var cdata = xmlDocument.createCDATASection("<em> &lt; </em>");
node.appendChild(cdata);
entry.appendChild(node);

Now all I have to learn is encoding, and how each browser deals with it differently. That is an entirely new realm that I don’t expect to cover in single week, but I’ll report back with my findings. Until then, don’t get caught up on the little things!

JavaScript Sort an Array of Objects

I ran into an interesting problem today. I had an array of objects that I wanted sorted on a certain property. My obvious thought didn’t work! (Update: I got a comment below from Peter Michaux who points out a nicer solution, it is included here:)

// Array of Objects
var obj_arr = [ { age: 21, name: "Larry" },
                { age: 34, name: "Curly" },
                { age: 10, name: "Moe" } ];

// This doesn't work!
obj_arr.sort( function(a,b) { return a.name < b.name; });

// This does work! (Peter's update, very fast)
obj_arr1.sort(function(a,b) { return a.name < b.name ? -1 :
                                     a.name > b.name ?  1 : 0; });

That kind of frustrated me. Sorting is one of those things I expect to be available in all languages. I don’t want to have to write a sorting algorithm every time I need to sort. So I looked into things, pulled up a Javascript Quicksort Algorithm and manipulated it to support any compare function.

Now that I have the freedom to truly write a compare function that works for objects! I also changed around certain parts of the code I found online to actually extend the Array class and make the extra functions hidden. Take a look at the sample usage:

// Defaults to (a<=b) sorting.  Great for numbers.
var arr = [1234, 2346, 21234, 3456, 32134, 3456, 1234, 2345, 23, 42523, 1234, 345];

// Object Array
var obj_arr = [ { age: 21, name: "Larry" },
                { age: 34, name: "Curly" },
                { age: 10, name: "Moe" } ];

arr.quick_sort();
// => [23, 345, 1234, 1234, 1234, 2345, 2346, 3456, 3456, 21234, 32134, 42523]

obj_arr.quick_sort(function(a,b) { return a.name < b.name });
// => Curly, Larry, Moe

obj_arr.quick_sort(function(a,b) { return a.age < b.age });
// => Moe (10), Larry (21), Curly (34)

For those who want to see the code be glad, its free. I carried the copyright with it but its rather loose. Grab the JavaScript Source Here! Enjoy:

Array.prototype.swap=function(a, b) {
  var tmp=this[a];
  this[a]=this[b];
  this[b]=tmp;
}

Array.prototype.quick_sort = function(compareFunction) {

  function partition(array, compareFunction, begin, end, pivot) {
    var piv = array[pivot];
    array.swap(pivot, end-1);
    var store = begin;
    for (var ix = begin; ix < end-1; ++ix) {
      if ( compareFunction(array[ix], piv) ) {
        array.swap(store, ix);
        ++store;
      }
    }
    array.swap(end-1, store);
    return store;
  }

  function qsort(array, compareFunction, begin, end) {
    if ( end-1 > begin ) {
      var pivot = begin + Math.floor(Math.random() * (end-begin));
      pivot = partition(array, compareFunction, begin, end, pivot);
      qsort(array, compareFunction, begin, pivot);
      qsort(array, compareFunction, pivot+1, end);
    }
  }

  if ( compareFunction == null ) {
    compareFunction = function(a,b) { return a<=b; };
  }
  qsort(this, compareFunction, 0, this.length);

}

Update

Peter Michaux pointed out something very important. The sort() function can be made to work if it returns numeric output (-1,0,1). His approach is far superior. Here was a benchmark I took:

var obj_arr1 = [];
var obj_arr2 = [];
var filler = [ { age: 21, name: "Larry" },
               { age: 34, name: "Curly" },
               { age: 10, name: "Moe" } ];
for (var i=0; i<5000; i++) {
  rand = Math.floor( Math.random() * 3 );
  obj_arr1.push( filler[rand] );
  obj_arr2.push( filler[rand] );
}

var s = new Date();
obj_arr1.sort(function(a,b) { return a.name < b.name ? -1 : a.name > b.name ? 1 : 0; });
var e = new Date();
console.log(e.getTime()-s.getTime()); // => 75 ms

s = new Date();
obj_arr2.quick_sort(function(a,b) { return a.name < b.name });
e = new Date();
console.log(e.getTime()-s.getTime()); //  => 4444 ms

That shows drastic differences for arrays as large as 5000 elements (with not too random data). 75 ms versus 4444 ms (over 4 seconds). Doing the math: (4444/75) => 59.253 times better! Moral of the story, don’t rush into thinking something doesn’t exist!

So if that’s the way to do it, then I want to make it easier on me. My arrays are generally going to be under 100 in size, and at such a size building a function dynamically instead of writing a custom function works just about as well (although if you were using objects, polymorphism and a compare function would be the best way to go). Here is a simple function I can use to more quickly build compare functions in order to ascend sort an array on multiple properties!

function buildCompareFunction(arr) {
  if (arr && arr.length > 0) {
    return function(a,b) {
      var asub, bsub, prop;
      for (var i=0; i<arr.length; i++) {
        prop = arr[i];
        asub = a[prop];
        bsub = b[prop];
        if ( asub < bsub )
          return -1;
        if ( asub > bsub )
          return 1;
      }
      return 0;
    }
  } else {
    return function(a,b) { return a<=b; };
  }
}

Sample usage would be:

var obj_arr = [
  { name: 'Joe',   age: 20 },
  { name: 'Joe',   age: 10 },
  { name: 'Joe',   age: 30 },
  { name: 'Joe',   age: 40 },
  { name: 'Joe',   age: 20 },
  { name: 'Joe',   age: 15 },
  { name: 'Joe',   age: 35 },
  { name: 'Joe',   age: 25 },
  { name: 'Bill',  age: 5 },
  { name: 'Barry', age: 20 },
  { name: 'Paul',  age: 20 },
  { name: 'Peter', age: 1 },
  { name: 'Smith', age: 25 },
  { name: 'Kary',  age: 30 }
];

obj_arr.sort( buildCompareFunction(['name','age']) );

Dot Files For Your Shell and Even Ruby

I have come across a number of programmers who don’t know what dotfiles are. Thats a shame. Every programmer should know common dotfiles and actively seek to add aliases, functions, and other tidbits to increase their productivity and make the shell more usable. I recently went on a binge and updated a few of my dotfiles. I’ll share some useful tricks that I found on not only my ~/.bashrc file but also ~/.irbrc for my ruby IRB prompt!

I won’t show off my entire ~/.bashrc file, only because it is rather long. I just want to get across some of the usefulness of such a file:

# -----------
#   General
# -----------
alias ..='cd ..'
alias ll='ls -lh'
alias la='ls -la'
alias ps='ps -ax'
alias du='du -hc'
alias cd..='cd ..'
alias more='less