Everyday getting better

Wednesday, June 24, 2015

XML Streaming tip while using JAXB

Always streaming to xml files is recommended over full DOM writes.

If you are writing a big xml file and all child elements will not be written in one stretch or method call then writing individual fragments is efficient and practical.

Sample xml may look like:

<MyRoot>
<Child id="1" type="a"/>
<Child id="2" type="a"/>
<Child id="3" type="a"/>
.
.
.
<Child id="1001" type="b"/>
<Child id="1002" type="b"/>
<Child id="1003" type="b"/>
.
.
.
</MyRoot>

Lets us say <Child> elements of type "a" are to be added by method1() and type "b" are to be added by method2() then following snippet of code allows each method to independently write to same xml file in fragments. i.e without root element.

public void method1() {

try {
Employee emp1 = new Employee();
   emp1.setId(100);
   emp1.setFirstName("John");
   emp1.setLastName("Macy");
   emp1.setAge(29);

Employee emp2 = new Employee();
   emp2.setId(88);
   emp2.setFirstName("Linda");
   emp2.setLastName("Stuard");
   emp2.setAge(25);

File file = new File("C:\\myFile.xml");
JAXBContext jaxbContext = JAXBContext.newInstance(Employee.class);
Marshaller jaxbMarshaller = jaxbContext.createMarshaller();

jaxbMarshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);
jaxbMarshaller.setProperty(Marshaller.JAXB_FRAGMENT, true);
//This property allows writing xml child nodes without root element

jaxbMarshaller.marshal(emp1, file);
jaxbMarshaller.marshal(emp2, file);
} catch (JAXBException e) {
e.printStackTrace();
}

}

myFile.xml contents after execution of this piece of code looks as follows (without root element):

<Employee>
<Id>100</Id>
<FirstName>John</FirstName>
<LastName>Macy</LastName>
  <Age>29</Age>
</Employee>
<Employee>
<Id>88</Id>
<FirstName>Linda</FirstName>
<LastName>Stuard</LastName>
  <Age>25</Age>
</Employee>

Monday, June 22, 2015

Performing SVN Merges using Eclipse plugin

In most projects automated merges are configured, this make's sure that your feature branch changes are forwarded to required future branch or trunk. But some times this may not happen, when any merge conflict arises i.e. automated merge cannot be performed or any previous revision merges on the same file are pending so your merge is waiting in queue. Or if automated forward merges are not configured, in such cases you have to manually merge the code base by resolving the conflict's if any and merge your changes into future branch or trunk and commit the changes.

In eclipse IDE Right click on the project or folder on which you want to perform merge, from context menu select Team->Merge as shown in below screen shot.

This brings up below shown merge wizard, from which you can choose type of merge you want to perform. Below I will be explaining various types of merges available.

Each of the merge type is explained below:

Merge a range of revisions

Manually record merge information (block one or more revisions)

This can be achieved by choosing the 5th option in the merge wizard as shown in below screen shot. Subsequent steps are almost same as in above explained flow, you will be asked to choose the merge from branch & path and then select the revision you want to block and choose finish. This will make necessary changes in the file svn configuration and now you can commit the file. This will block the specific revision from auto merging into the SVN branch and also will not block subsequent pending merges from happening.

Please feel free to ask questions if you have any using comments section and also feel free to ask to cover any specific area that is missed.

Saturday, June 13, 2015

Hadoop Misc

Here is the sample on using new map/reduce API

Regression using Apache Math API

`Simple Linear Regression`

In simple linear regression, a dependent variable y is predicted from one predictor variable x.

y = intercept + slope * x

also written as y = b * x + A

x - the independent variables which form the design matrix

y - the dependent or response variable

`Multiple Linear Regression`

In multiple regression, the dependent variable is predicted by two or more variables.

Equation with 2 predictor variables is y = b1 * x1 + b2 * x2 + A

The values of b (b1 and b2) are sometimes called “regression coefficients” and
sometimes called “regression weights.”

Y=X*b+u
where Y is an n-vector regressand, X is a [n,k] matrix whose k columns are called regressors, b is k-vector of regression parameters and u is an n-vector of error terms or residuals.

Here X[n,k] denotes k number of independent variables and n number of observations (rows).
Y is a [n,1] matrix/array

Linearity of a problem can be confirmed if the coefficient of determination (R²) is large i.e. R² = 1 indicates that the fitted model explains all variability in $y$ , while R² = 0 indicates no 'linear' relationship.

Statistics Terminlogy related to Regression

Dependence is any statistical relationship between two random variables or two sets of data.

Correlation refers to any of a broad class of statistical relationships involving dependence.

Variance measures how far a set of numbers is spread out. (A variance of zero indicates that all the values are identical.)
A non-zero variance is always positive: A small variance indicates that the data points tend to be very close to the mean (expected value) and hence to each other, while a high variance indicates that the data points are very spread out from the mean and from each other.

The square root of variance is called the standard deviation. The variance is one of several descriptors of a probability distribution.

Covariance is a measure of how much two random variables change together.
If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the smaller values, i.e., the variables tend to show similar behavior, the covariance is positive.
In the opposite case, when the greater values of one variable mainly correspond to the smaller values of the other, i.e., the variables tend to show opposite behavior, the covariance is negative.

The sign of the covariance therefore shows the tendency in the linear relationship between the variables.

Reference:
www.wikipedia.org

HBase for beginners

Good starting point to understand basic concepts of HBase http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable

Hadoop Cluster Maintenance Tips

If disk space is full and you are getting resource availability errors then you can try to delete old distributed cache files and old job logs at the below given paths it frees up enough space.

/var/log/hadoop-0.20-mapreduce/userlogs

./mapred/local/taskTracker/cloudera/distcache