Wednesday, May 17, 2017

Unix sed, Python CsvKit under Windows Scripting Host (WSH); Problems and solutions

This is a just a short article that described problem and finding I encountered in one of my project and were wishing that what I found can be useful to someone following a similar path as me - A short side track: Nerds or people like me working and sharing knowledge, a lot of times, were overlooked as been anti-social and less of charity or volunteering to the society. I for one, would like to pass and dispatch the messages that we, the nerd and genuine hard worker, are to be proud of our contribution to the society and world for making the world a better place.

Back to the topic.

I were involved in a Windows automation project and were writing most of my code in VBA. However, with my knowledge about the Unix way of doing things, it makes totally sense for me wanting to run some tasks through some Unix utilities program. As we all know, Unix way basically means command lines. Fortunately, Windows did not fore go the access to command lines. For VBA, there are the general Shell() command. But that is not the only option for VBA. With Microsoft COM infrastructure, VBA programmer has access to wide variety of objects. One of them is the Windows Scripting Host. By accessing to Windows Scripting Host, programmer can have better control of the DOS-shell/command-line environment.

While I was happily using the Windows Scripting Host to carry out my command line tasks, I notice that for most Windows/DOS based programs they all run great until I try to run some Unix utilities that were ported over to the Windows/DOS world.

Before I go on, I would like to point out that I did not spend a lot of time trying to figure out every single issues, so please bear with me for not be able to provide all possible solutions and explanations

One Unix utility I used is the sed command from the MinGW/msys. I was able to verify its functionality by running some test directly under the Windows/DOS command line after removing some conflicting search paths from the PATH environment variables - for example, in my system, I also have Qt and GNAT installed.

After verifying the sed under that Windows/DOS command line, I invoked it through the Windows Scripting Host via VBA. The command failed with a return code of 2, which, for sed, could just mean errors during execution or , for DOS, could mean file not found. By testing with non-existing command, we know the Windows Scripting Host do recognize the sed command. By testing with simple 'sed -help 2>file', I realize it may have something to do with the shall's interpretation/parsing of command line. This lead me to the thought of using the Cmd.exe /C to run the sed. By running 'Cmd.exe /C sed ...  ' under the Windows Scripting Host, everything worked out. The other command I run into the same problem is the Python CsvKit commands. Again, running with Cmd.exe solve the problem. At this point, I can't say I understand the problem. My hint is that it may have something to do with the redirection of the standard output since all my Unix and Python commands used the redirection.

Monday, March 20, 2017

A better replication function:rep() for R

This article is to propose a new approach to the R replication function rep(). At this point, I am not fluent in creating R packages and would not, for a while, create the package even though I will try to provide the code I have in mind.

First of all, let's review the current implementation of the R replication function rep() in my own words:
Each element in Vctr is repeat n times if Vctr1 do not exist. Otherwise, the times each element is repeated is specified by Vctr1. However, if times is a number, the Vctr is repeated that number of times. length.out disregard times, each element is repeated n times and repeated again until length.out is reached. 
From my description above, I see that when 'times' is a vector, its elements controls how the corresponding element in Vctr are repeated. But not when 'times' is degraded to a number - it then control the number of times the 'Vctr' is repeated. On the other hand, the 'each', as a number, it also controls the number of times each element in Vctr are repeated. With these info, it just logical for me to want to reconsider the situation when 'times' is just a number. It just seems more logical to me to consider this as a special case where all elements in Vctr are to be repeated the same amount of 'times'. i.e.
    Vctr1= 5 := c(5, 5, 5, ...).
With this equivalency, I also like to propose the switch the meaning of  'times' and 'each' so that the 'each' now describes how many times each of the elements in Vctr should be repeated. With the meaning of the new 'each' been settled, we should now reconsider the meaning of the new 'times'.

The newly proposed meaning for 'times' will be the times to repeat the sequence generated by the 'each'. The new meaning for length.out would be designated as the maximum length of the eventual output.

With the proposed changes outlined above, here are few examples to demonstrate the new new-rep() function.

[1] 1 1 2 2 3 3 4 4 5 5

[1] 1 1 2 2 3 3 4 4 5 5 1 1 2

[1] 1 2 2 3 3 3 4 4 5

[1] 1 2 2 3 3 3 4 4 5 1 2 2 3

Can we create all that can be done with the old rep() function? - yes!
Can we create something that is not possible to create with the old rep() function? - yes

Possible algorithm:
new-rep <- function (Vctr,each,times,length.out) {
    if length(each)==1 { # if each exist and just a number
        each <- rep(each,times=length(Vctr))  # expand each
    Rslt <- rep(Vctr, times=each);
    Rslt <- rep(Rslt, times=times);
    Lngth <- length(Rslt);
    if (is.numeric(length.out))  {
     if (length.out&tl;Lngth) Lngth <- length.out;
Appreciate your thoughts and possible creation of package.