Wednesday, June 24, 2015

XML Streaming tip while using JAXB

Always streaming to xml files is recommended over full DOM writes.

If you are writing a big xml file and all child elements will not be written in one stretch or method call then writing individual fragments is efficient and practical.

Sample xml may look like:

<MyRoot>
   <Child id="1" type="a"/>
   <Child id="2" type="a"/>
   <Child id="3" type="a"/>
   .
   .
   .
   <Child id="1001" type="b"/>
   <Child id="1002" type="b"/>
   <Child id="1003" type="b"/>
   .
   .
   .
</MyRoot>

Lets us say <Child> elements of type "a" are to be added by method1() and type "b" are to be added by method2() then following snippet of code allows each method to independently write to same xml file in fragments. i.e without root element.

public void method1() {

 try {
                Employee emp1 = new Employee();
        emp1.setId(100);
        emp1.setFirstName("John");
        emp1.setLastName("Macy");
        emp1.setAge(29);

                Employee emp2 = new Employee();
        emp2.setId(88);
        emp2.setFirstName("Linda");
        emp2.setLastName("Stuard");
        emp2.setAge(25);

File file = new File("C:\\myFile.xml");
JAXBContext jaxbContext = JAXBContext.newInstance(Employee.class);
Marshaller jaxbMarshaller = jaxbContext.createMarshaller();

jaxbMarshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);
                jaxbMarshaller.setProperty(Marshaller.JAXB_FRAGMENT, true);
                //This property allows writing xml child nodes without root element

jaxbMarshaller.marshal(emp1, file);
                jaxbMarshaller.marshal(emp2, file);
 } catch (JAXBException e) {
e.printStackTrace();
 }

}

myFile.xml contents after execution of this piece of code looks as follows (without root element):

<Employee>
    <Id>100</Id>
    <FirstName>John</FirstName>
    <LastName>Macy</LastName>
    <Age>29</Age>
</Employee>
<Employee>
    <Id>88</Id>
    <FirstName>Linda</FirstName>
    <LastName>Stuard</LastName>
    <Age>25</Age>
</Employee>

Monday, June 22, 2015

Performing SVN Merges using Eclipse plugin

In most projects automated merges are configured, this make's sure that your feature branch changes are forwarded to required future branch or trunk. But some times this may not happen, when any merge conflict arises i.e. automated merge cannot be performed or any previous revision merges on the same file are pending so your merge is waiting in queue. Or if automated forward merges are not configured, in such cases you have to manually merge the code base by resolving the conflict's if any and merge your changes into future branch or trunk and commit the changes.

In eclipse IDE Right click on the project or folder on which you want to perform merge, from context menu select Team->Merge as shown in below screen shot.


This brings up below shown merge wizard, from which you can choose type of merge you want to perform. Below I will be explaining various types of merges available.



Each of the merge type is explained below:

  1. Merge a range of revisions
  2. Use this method for performing forward merges to a branch from another branch or trunk. Typically changes from older branch are merged to newer branch.

    a. Uncheck checkbox 'Perform pre-merge best practices checks', this is to ensure no revision from older branch being missed in the subsequent wizard display. Now click on 'Next' button.

    b. In the next screen as shown below, enter (or select using 'Select' button) branch name and corresponding path from which you want to merge changes to target branch/trunk. Select option 'Select revisions on the next page' and click on Next button.



    c. It will open up window showing eligible revisions for merge (same is shown below). Select the required revisions and click Next button.



    d. It will show up conflict handling options screen, same is show below. This screen lets you take decisions for conflicts such as to prompt for each conflict or Mark conflicts and let me resolve later or Resolve conflict using your version or resolve conflict using incoming version. Once necessary option is chosen click on Finish button.



    e. Once conflicts if any are present and are resolved, go a head and commit the changes. This will ensure that you have integrated your feature branch changes into trunk or other future branch and you wont receive any more auto merge failure e-mails.

  3. Manually record merge information (block one or more revisions)
  4. This option is helpful if you want to block a revision from making into svn branch or trunk. This can typically arise if you had to stop possible successful auto forward merges to keep the latest code being over written by older code.  
    This can be achieved by choosing the 5th option in the merge wizard as shown in below screen shot. Subsequent steps are almost same as in above explained flow, you will be asked to choose the merge from branch & path and then select the revision you want to block and choose finish. This will make necessary changes in the file svn configuration and now you can commit the file. This will block the specific revision from auto merging into the SVN branch and also will not block subsequent pending merges from happening.
     

Please feel free to ask questions if you have any using comments section and also feel free to ask to cover any specific area that is missed.

Saturday, June 13, 2015

Hadoop Misc

Regression using Apache Math API

Simple Linear Regression

In simple linear regression, a dependent variable y is predicted from one predictor variable x.

y = intercept + slope * x 

also written as  y = b * x + A

x - the independent variables which form the design matrix
y - the dependent or response variable

Multiple Linear Regression

In multiple regression, the dependent variable is predicted by two or more variables.

Equation with 2 predictor variables is y = b1 * x1 + b2 * x2 + A

The values of b (b1 and b2) are sometimes called “regression coefficients” and
sometimes called “regression weights.”


Y=X*b+u
where Y is an n-vector regressand, X is a [n,k] matrix whose k columns are called regressors, b is k-vector of regression parameters and u is an n-vector of error terms or residuals.

Here X[n,k] denotes k number of independent variables and n number of observations (rows).
Y is a [n,1] matrix/array

Linearity of a problem can be confirmed if the coefficient of determination (R2) is large i.e. R2 = 1 indicates that the fitted model explains all variability in y, while R2 = 0 indicates no 'linear' relationship.

Statistics Terminlogy related to Regression

Dependence is any statistical relationship between two random variables or two sets of data.

Correlation refers to any of a broad class of statistical relationships involving dependence.

Variance measures how far a set of numbers is spread out. (A variance of zero indicates that all the values are identical.) 
A non-zero variance is always positive: A small variance indicates that the data points tend to be very close to the mean (expected value) and hence to each other, while a high variance indicates that the data points are very spread out from the mean and from each other. 

The square root of variance is called the standard deviation. The variance is one of several descriptors of a probability distribution.

Covariance is a measure of how much two random variables change together.
If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the smaller values, i.e., the variables tend to show similar behavior, the covariance is positive.

In the opposite case, when the greater values of one variable mainly correspond to the smaller values of the other, i.e., the variables tend to show opposite behavior, the covariance is negative. 

The sign of the covariance therefore shows the tendency in the linear relationship between the variables.






Reference:
www.wikipedia.org

HBase for beginners

Hadoop Cluster Maintenance Tips

Hadoop Cluster Maintenance Tips


  1. If disk space is full and you are getting resource availability errors then you can try to delete old distributed cache files and old job logs at the below given paths it frees up enough space.

    /var/log/hadoop-0.20-mapreduce/userlogs 
    ./mapred/local/taskTracker/cloudera/distcache

MPP (Massively Parallel Processing)/Green plum DB

MapReduce and MPP

MongoDB Schema Design Rules

MongoDB Schema Design Rules

  1. Always prejoin (embedding) the entities other than trying to join them while querying, which is a common way in RDBMS. In Mongo only way to join at run time is via application logic, which is costly operation and clumsy way in Mongo. 
  2. This strategy also serves the purpose of constraints which are available in RDBMS. The Mantra is pre-join (embedding) at schema level
  3. Even though MongoDB does not have transactions but we can still have atomic operations and can have consistent view of the data using pre-joining (embedding) data.
  4. In case of one to many relationship and when many is very huge then linking of collections in recommended. Also same is recommended for 
    many to many relationship.
  5. Influencers for when to embed and when to link
    •  Frequency of access
      • To reduce the working set size of your application.
    •  Size of items
      • If combined size of the documents is larger than 16MB
      • Atomicity of the data
              Pre-join sample: A Product Catalog record

              {
                sku: "00e8da9b",
                type: "Audio Album",
                title: "A Love Supreme",
                description: "by John Coltrane",
                asin: "B0000A118M",
              
                shipping: {
                  weight: 6,
                  dimensions: {
                    width: 10,
                    height: 10,
                    depth: 1
                  },
                },
              
                pricing: {
                  list: 1200,
                  retail: 1100,
                  savings: 100,
                  pct_savings: 8
                },
              
                details: {
                  title: "A Love Supreme [Original Recording Reissued]",
                  artist: "John Coltrane",
                  genre: [ "Jazz", "General" ],
                      ...
                  tracks: [
                    "A Love Supreme Part I: Acknowledgement",
                    "A Love Supreme Part II - Resolution",
                    "A Love Supreme, Part III: Pursuance",
                    "A Love Supreme, Part IV-Psalm"
                  ],
                },
              }

              Hive Partitions - How they look

              Hive Partitions - How they look

              Consider Hive internal table with table name as xx and it has two partioning columns c & d. The underlying Hive data file looks as follows in the below given path

              File Path: /user/hive/warehouse/bt.db/xx/c=1/d=1/partitiontest

              File Contents: 
              1,a
              2,b
              3,c
              4,d

              Hadoop Cluster Troubleshooting

              Hadoop Cluster Troubleshooting


              1. Node health is bad, with reason: Heart beat failure 

                The CM agent is likely not running on that machine, perhaps because it was not configured to start at boot time. You should check and start it if it's not running. 

                service cloudera-scm-agent status
                service cloudera-scm-agent start 
              1. How to leave Name node from safe mode

                dfsadmin -safemode leave

              Code Merge Mantra - Handy tip for Developers

              You may be writing code for either product or an application there will be multiple versions and releases involved. Some version control system such as SVN would have been employed in your organization. If you are working on a branch slated for release R1 and which is prior to release R2 in timeline then your changes has to make its way into R2 branch also.
              Generally in most of the times an auto merge is configured from R1 to R2 branch, so that every change performed in R1 is automatically forwarded to R2. This may result in to couple of outcomes,
              1. One is forward merge could result in to successful merge so everything is all right.
              2. Or forward merge fails because of merge conflicts and you may be flocked with continuous mails to correct the conflict and perform manual merge in R2 branch and resolve the merge failure error.
              3. Or your merge may be waiting on other forward merges from R1 to R2 to complete. These are commits performed in R1 prior to your commits.
              Both of the last two cases (2 and 3)  are painful and should be avoided all times. And worst case your R1 change not making its way into R2, which is a time-bomb and can blast during R2 release.
              Here I am going to explain how to avoid and over come forward merge failures. Even though we want every thing to be automated but in real life we have to take up manual root some times. Same applies to my suggestion. After you have committed your changes into R1 wait for 5 minutes and check in R2 branch whether your changes are automatically merged? if not don't wait, perform manual merge of your changes into R2.
              How to make Manual Forward Merge
              Use your source control merge tool and fire up the manual merge. It asks for which branch to merge from, in this case R1. Now it lists the revisions you want to merge. Select your latest revision and perform the merge and commit your changes. This ensures that you wont get any more merge failure error mails and your changes are successfully made their way into latest branch R2.
              Tips when auto merges are configured
              Generally regular commits is recommended once your code is compilable and does not break any other code; but when auto merges are configured your source control system becomes double edged sword. Your every commit may result into above explained scenarios and can cause much more disturbance and work for you and several others (such as configuration management team and other developers). So best practice is commit your changes only when you have done your testing and sure that no more changes are required on the same piece of code. This will reduce the number of additional manual forward merges.