Wednesday, June 24, 2015

XML Streaming tip while using JAXB

Always streaming to xml files is recommended over full DOM writes.

If you are writing a big xml file and all child elements will not be written in one stretch or method call then writing individual fragments is efficient and practical.

Sample xml may look like:

<MyRoot>
   <Child id="1" type="a"/>
   <Child id="2" type="a"/>
   <Child id="3" type="a"/>
   .
   .
   .
   <Child id="1001" type="b"/>
   <Child id="1002" type="b"/>
   <Child id="1003" type="b"/>
   .
   .
   .
</MyRoot>

Lets us say <Child> elements of type "a" are to be added by method1() and type "b" are to be added by method2() then following snippet of code allows each method to independently write to same xml file in fragments. i.e without root element.

public void method1() {

 try {
                Employee emp1 = new Employee();
        emp1.setId(100);
        emp1.setFirstName("John");
        emp1.setLastName("Macy");
        emp1.setAge(29);

                Employee emp2 = new Employee();
        emp2.setId(88);
        emp2.setFirstName("Linda");
        emp2.setLastName("Stuard");
        emp2.setAge(25);

File file = new File("C:\\myFile.xml");
JAXBContext jaxbContext = JAXBContext.newInstance(Employee.class);
Marshaller jaxbMarshaller = jaxbContext.createMarshaller();

jaxbMarshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);
                jaxbMarshaller.setProperty(Marshaller.JAXB_FRAGMENT, true);
                //This property allows writing xml child nodes without root element

jaxbMarshaller.marshal(emp1, file);
                jaxbMarshaller.marshal(emp2, file);
 } catch (JAXBException e) {
e.printStackTrace();
 }

}

myFile.xml contents after execution of this piece of code looks as follows (without root element):

<Employee>
    <Id>100</Id>
    <FirstName>John</FirstName>
    <LastName>Macy</LastName>
    <Age>29</Age>
</Employee>
<Employee>
    <Id>88</Id>
    <FirstName>Linda</FirstName>
    <LastName>Stuard</LastName>
    <Age>25</Age>
</Employee>

Monday, June 22, 2015

Performing SVN Merges using Eclipse plugin

In most projects automated merges are configured, this make's sure that your feature branch changes are forwarded to required future branch or trunk. But some times this may not happen, when any merge conflict arises i.e. automated merge cannot be performed or any previous revision merges on the same file are pending so your merge is waiting in queue. Or if automated forward merges are not configured, in such cases you have to manually merge the code base by resolving the conflict's if any and merge your changes into future branch or trunk and commit the changes.

In eclipse IDE Right click on the project or folder on which you want to perform merge, from context menu select Team->Merge as shown in below screen shot.


This brings up below shown merge wizard, from which you can choose type of merge you want to perform. Below I will be explaining various types of merges available.



Each of the merge type is explained below:

  1. Merge a range of revisions
  2. Use this method for performing forward merges to a branch from another branch or trunk. Typically changes from older branch are merged to newer branch.

    a. Uncheck checkbox 'Perform pre-merge best practices checks', this is to ensure no revision from older branch being missed in the subsequent wizard display. Now click on 'Next' button.

    b. In the next screen as shown below, enter (or select using 'Select' button) branch name and corresponding path from which you want to merge changes to target branch/trunk. Select option 'Select revisions on the next page' and click on Next button.



    c. It will open up window showing eligible revisions for merge (same is shown below). Select the required revisions and click Next button.



    d. It will show up conflict handling options screen, same is show below. This screen lets you take decisions for conflicts such as to prompt for each conflict or Mark conflicts and let me resolve later or Resolve conflict using your version or resolve conflict using incoming version. Once necessary option is chosen click on Finish button.



    e. Once conflicts if any are present and are resolved, go a head and commit the changes. This will ensure that you have integrated your feature branch changes into trunk or other future branch and you wont receive any more auto merge failure e-mails.

  3. Manually record merge information (block one or more revisions)
  4. This option is helpful if you want to block a revision from making into svn branch or trunk. This can typically arise if you had to stop possible successful auto forward merges to keep the latest code being over written by older code.  
    This can be achieved by choosing the 5th option in the merge wizard as shown in below screen shot. Subsequent steps are almost same as in above explained flow, you will be asked to choose the merge from branch & path and then select the revision you want to block and choose finish. This will make necessary changes in the file svn configuration and now you can commit the file. This will block the specific revision from auto merging into the SVN branch and also will not block subsequent pending merges from happening.
     

Please feel free to ask questions if you have any using comments section and also feel free to ask to cover any specific area that is missed.

Saturday, June 13, 2015

Hadoop Misc

Regression using Apache Math API

Simple Linear Regression

In simple linear regression, a dependent variable y is predicted from one predictor variable x.

y = intercept + slope * x 

also written as  y = b * x + A

x - the independent variables which form the design matrix
y - the dependent or response variable

Multiple Linear Regression

In multiple regression, the dependent variable is predicted by two or more variables.

Equation with 2 predictor variables is y = b1 * x1 + b2 * x2 + A

The values of b (b1 and b2) are sometimes called “regression coefficients” and
sometimes called “regression weights.”


Y=X*b+u
where Y is an n-vector regressand, X is a [n,k] matrix whose k columns are called regressors, b is k-vector of regression parameters and u is an n-vector of error terms or residuals.

Here X[n,k] denotes k number of independent variables and n number of observations (rows).
Y is a [n,1] matrix/array

Linearity of a problem can be confirmed if the coefficient of determination (R2) is large i.e. R2 = 1 indicates that the fitted model explains all variability in y, while R2 = 0 indicates no 'linear' relationship.

Statistics Terminlogy related to Regression

Dependence is any statistical relationship between two random variables or two sets of data.

Correlation refers to any of a broad class of statistical relationships involving dependence.

Variance measures how far a set of numbers is spread out. (A variance of zero indicates that all the values are identical.) 
A non-zero variance is always positive: A small variance indicates that the data points tend to be very close to the mean (expected value) and hence to each other, while a high variance indicates that the data points are very spread out from the mean and from each other. 

The square root of variance is called the standard deviation. The variance is one of several descriptors of a probability distribution.

Covariance is a measure of how much two random variables change together.
If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the smaller values, i.e., the variables tend to show similar behavior, the covariance is positive.

In the opposite case, when the greater values of one variable mainly correspond to the smaller values of the other, i.e., the variables tend to show opposite behavior, the covariance is negative. 

The sign of the covariance therefore shows the tendency in the linear relationship between the variables.






Reference:
www.wikipedia.org

HBase for beginners

Hadoop Cluster Maintenance Tips

Hadoop Cluster Maintenance Tips


  1. If disk space is full and you are getting resource availability errors then you can try to delete old distributed cache files and old job logs at the below given paths it frees up enough space.

    /var/log/hadoop-0.20-mapreduce/userlogs 
    ./mapred/local/taskTracker/cloudera/distcache