IT tidbit: R

Showing posts with label R. Show all posts

Monday, October 30, 2017

Sweet and short R set-diff code

Was working on a project in R and was using extensively with the 'names' property of a vector. Basically, I use the vector to form the menu for user to pick off the desired choice. To provide user enough information, lines and lines information is embedded in the names property.

All of these was working well until I let user to pick items to be removed from the list/vector repeatedly.

As is a reasonable approach, all the program need to do is to use R's set operation: setdiff() to remove picked items from the original vector/list. Unfortunately, the setdiff() function removes all names from the list.

Here's my solution: v1 <- v1[!is.element(v1,Picked_items)];

In this way, all names are retained.

Don't you just love R - no loop is needed.

Monday, March 20, 2017

A better replication function:rep() for R

This article is to propose a new approach to the R replication function rep(). At this point, I am not fluent in creating R packages and would not, for a while, create the package even though I will try to provide the code I have in mind.

First of all, let's review the current implementation of the R replication function rep() in my own words:
rep(Vctr,times=Vctr1,each=n,length.out=N)

Each element in Vctr is repeat n times if Vctr1 do not exist. Otherwise, the times each element is repeated is specified by Vctr1. However, if times is a number, the Vctr is repeated that number of times. length.out disregard times, each element is repeated n times and repeated again until length.out is reached.

From my description above, I see that when 'times' is a vector, its elements controls how the corresponding element in Vctr are repeated. But not when 'times' is degraded to a number - it then control the number of times the 'Vctr' is repeated. On the other hand, the 'each', as a number, it also controls the number of times each element in Vctr are repeated. With these info, it just logical for me to want to reconsider the situation when 'times' is just a number. It just seems more logical to me to consider this as a special case where all elements in Vctr are to be repeated the same amount of 'times'. i.e.
Vctr1= 5 := c(5, 5, 5, ...).
With this equivalency, I also like to propose the switch the meaning of 'times' and 'each' so that the 'each' now describes how many times each of the elements in Vctr should be repeated. With the meaning of the new 'each' been settled, we should now reconsider the meaning of the new 'times'.

The newly proposed meaning for 'times' will be the times to repeat the sequence generated by the 'each'. The new meaning for length.out would be designated as the maximum length of the eventual output.

With the proposed changes outlined above, here are few examples to demonstrate the new new-rep() function.

>new-rep(1:5,each=2,times=1,length.out=13)
[1] 1 1 2 2 3 3 4 4 5 5

>new-rep(1:5,each=2,times=2,length.out=13)
[1] 1 1 2 2 3 3 4 4 5 5 1 1 2

>new-rep(1:5,each=c(1,2,3,2,1),times=1,length.out=13)
[1] 1 2 2 3 3 3 4 4 5

>new-rep(1:5,each=c(1,2,3,2,1),times=2,length.out=13)
[1] 1 2 2 3 3 3 4 4 5 1 2 2 3

Can we create all that can be done with the old rep() function? - yes!
Can we create something that is not possible to create with the old rep() function? - yes

Possible algorithm:

new-rep <- function (Vctr,each,times,length.out) {
    if length(each)==1 { # if each exist and just a number
        each <- rep(each,times=length(Vctr))  # expand each
    }
    Rslt <- rep(Vctr, times=each);
    Rslt <- rep(Rslt, times=times);
    Lngth <- length(Rslt);
    if (is.numeric(length.out))  {
     if (length.out&tl;Lngth) Lngth <- length.out;
    }
    Rslt[1:Lngth];
}

Appreciate your thoughts and possible creation of package.

Monday, February 6, 2012

RODBC sqlSave() autonumber auto-increase problem

This is a quick note on R.

The RODBC package for R-project is a popular add-on to access ODBC database. The sqlSave() function is a common method used to save a data.frame to an ODBC table. For an ODBC table with autonumber or auto-increase fields, this function reports errors. Turning on the verbose option of that function, it is observed that sqlSave() constructed a SQL statement that ALWAYS intended to insert values to all table fields. With this SQL statement, user are forced to assign a value for the auto-increase field, which, by its own definition, should not be assigned a value from external source.

For some odd database implementation, you might be able to assign a null value to the auto-increase field and got it to work. But, as far as I know, the MS ACCESS would not take that.

I was trying to find the source code for the RODBC package but was unsuccessful. I do found that if I wrote a function so that it write the data.frame to database using the sqlQuery() function, this can be overcome. However, this may take longer to process the data. The idea solution is to modify the sqlSave() function so that it is more flexible.

I also run into the problem with sqlSave() when dealing with two MS ACCESS memo fields. Again, this can be solved with the sqlQuery() function.

Tuesday, March 30, 2010

An introduction to R statistics

Recently, I spent some time learning the R environment. It take me a little while to Get-It. So I would like to describe the system in my way and hope it will help those brain that are wired like mine. There is no intention in covering the detail of R, but the basics.

The R environment data are objects that can have properties. But these objects do not have method. So, we can say they are more like a C structure than full featured objects. Also, R does not support the Object.Property or Object.method() syntax. Instead, the dot (.) is an allowable character for identifiers. Properties and methods are accessed through functions. So, the bottom line is R objects are like C structures. With this approach in mind, we can better understand the limitation of R and how it is constructed.

With this approach to objects, functions can be made to operate on multiple type of objects by knowing the type of the object. In R, properties for basic types are documented. The intrinsic properties are: mode, length and class. These properties can only be accessed through special functions: mode(), length() and class(). Other properties/attributes are accessed through attr(). The list of attributes can be viewed with attributes().

R supports the syntax of vector operation. For example, A*B can mean the multiplication of two vectors. This approach makes R an idea tool for expressing matrix operations and carrying out operations related to tabulated data, like those in the linear algebra and statistic survey.

Basic Data Type: Vector
The simplest R object type is the vector, which is an ordered list of components of the same kind. Component can be numeric, complex, character, logical, NA and others. Vectors have the mode property, where mode can be numeric, complex, character, logical and others. You use the mode() function to obtain access to the mode property. The other property of vectors is the length, or the number of components, and it can be accessed through the length() function. The names property is also supported. names property is a vector itself and gives each component a name. Components can be referred to by either integer indexes or their names.

Basic Data Type: Factor
Factor is an vector object with the levels property. Property levels is a vector of unique values of the original vector. This give those values an order.

R objects also have a property called class. The class of a vector is simply its mode. The class of a Factor is 'factor'. The class property can be accessed via class() function.

Basic Data Type: Array
Array is an vector object with dim property. The dim is a positive integer vector. The component of the dim vector specifies the size of each dimension. Matrix is an array of two dimensions. The mode of an array is the same as the mode of its component. The class of an array is 'array'. Array can be created by combining vectors or by setting a vector's dim property.

Basic Data Type: List
By combing objects of different type in an ordered list, we created a list object. List object can have a names property that is a vector of mode character and it gives each list-component a name. List objects have the class property set to 'list'.

Basic Data Type: data.frame
data.frame object is considered as an extension of list object with restrictions placed on the size of the list-component so that the data.frame resemble a table like structure with each column has the same number of values. These list-components can be vectors, factors, matrix or lists. data.frame have the class of 'data.frame'.

Useful Function/Operators
Environmental
getwd(), setwd(), objects(), ls(), library()

Constructors
Bgn:End (colon), c(), vector(), factor(), list(), data.frame(), matrix(), cbind(), rbind(), matrix()

Casting
as.vector(), as.factor() ...

Indexing

[] return the same type
Vctr[ NdxVctr ], Vctr[ NmsVctr ], Vctr[ LgcVctr ] ...
Mtrx[ NdxVctr1 ][ NdxVctr2 ] ...
[[ Ndx ]] == $ return the component.

[], [[]], $

Other
assign(), <-, ->, grep(), function(), tapply(), is.na(), is.vector(), summary(), names()
& (element-wise), | (element-wise), &&, ||

Control Structure

if (Exp1) Exp2 else Exp3
ifelse( LgcVctr, TrVctr, FlsVctr) return a vector with components from TrVctr and FlsVctr based on LgcVctr
for ( Ndx in Vctr) { Expr... }
while (Cndtn) { Expr ... }
break
Vrbl <- function (Arg1, Arg2, ...) { Expr ... }

The above provided enough info for the basic understanding of the R. For detail, please visit the R-Intro and the R-Reference.pdf.

IT tidbit