Extracting parts of a Git repository while preserving history

siddhi-project-structureIf you have multiple modules within a git repository, it’s only a matter of time for it to get bulky and clutter enough to make you want to refactor it and move some modules to their own separate repositories (for better maintainability). If that bulky repository has several contributors, it’s also critical to preserve contribution history (i.e commits, changes), even when you move it into several new repositories.

Recently I came up with such situation. There, I had to move several extension from a parent repo into their own separate repos as shown in the figure. And here’s how I did it;

  1. Fork the original repository into your account. Here, I have forked wso2/siddhi repo into grainier/siddhi
  2. Clone the forked repo into your local machine. Here, I cloned it into a new directory named “siddhi-execution-time” (I’m going to extract time extension first).
  3. Go inside cloned directory, and remove all remote tracking from the cloned git repository.
  4. Now it’s time to filter and prune. Doing this will clean all the other directories as well as their git history, while only keeping files and history of filtered subdirectory.

    It will result in something like this;screenshot-from-2016-11-30-092136
  5. Now, fix the project structure (fixing pom file, add a .gitignore, add other required modules etc..) as you want.
  6. Add new files to git and commit those changes (remember still you cannot push it since we haven’t set the remote tracking location yet).
  7. Now create a new git repo in GitHub (or any git provider), and copy it’s remote tracking location (mine is https://github.com/grainier/siddhi-execution-time.git). It doesn’t matter whether this is an empty repo or existing repo (with files).
  8. Now add that link as the remote tracking location to our local git repo.
  9. [optional] If it’s not an empty repo, do a git pull and merge the content with our repository.
  10. Finally, push the restructured local repo to remote-tracking branch;

    That’s it, you’re all done 🙂

Create tar archive per subdirectory

Following bash script will generate a tar archive per subdirectory found within a given directory.

Write this code in a .sh file (i.e archive.sh ). Put it in your desired root directory. Execute the .sh using ./archive.sh command.

Recover lost web content using Google Cache

Last week I received this mail from my hosting provider (Scaleway) mentioning they had a critical disk failure and couldn’t recover any of the files. Therefore, sadly I lost most of files I kept on that volume, and also I lost my blog content :(.


Your instance ‘xxxxxx’ is running on a blade that had a critical disk failure. We were not able to recover any of the files located on that disk.
Your node has been stopped, the volume located on the disk that had an issue has been detached from your server and is now available on the volume view, if you made snapshot of this disk or if you halted your node recently, you will recover your disk to the latest stop state. If you had another volume located on another physical disk, it will stay attached to your server.

We are sorry for the inconvenience.

Scaleway Team

Files were gone, but couldn’t let years of effort I put on my blog go in vain. So I tried different mechanisms, and this is how I finally recovered all my blog content.

1. Log into the google webmasters tool, and list down URL Errors under Crawl > Crawl Errors
2. Under Not found tab, you’ll be able to see the list of webpages that went missing after the incident.

Screen Shot 2016-06-25 at 3.49.59 PM3. Take one item and search google for that url. (make sure to have “” surrounding the url).
4. If it returns a search result, there’s a good chance that google already indexed the page and have it cached in their servers.

Screen Shot 2016-06-25 at 3.50.41 PM5. Go to cachedview.com, and type in the url of your missing webpage. (exactly as it appeared on google webmasters tool) and click on Google Web Cache button.

cachedview.com6. Then it will load the cached content of your missing web page.

Screen Shot 2016-06-25 at 3.53.25 PM
7. Now create a new post on your blog and copy content from cached web page. (Make sure to keep the same URL, keywords, category, etc..)
8. Repeat same steps for each Not found item in google webmasters tool.
9. That’s it.
10. Now that we all learned the lesson, it’s always good to have a periodic backup strategy to be on the safe side in such incidents. 🙂

Configuring Shibboleth as a SAML2 Identity Provider

In this post I am going to share the steps to configure shibboleth as SAML2 IDP. Hope that would be useful for you also. I’m using Ubuntu 14.04 LTS as my Operating System. However, it should work with other systems as well.

  1. Download Shibboleth IDP : Link
  2. Once you have downloaded the file, extract it into your local file system.
  3. Go to <SHIBBOLETH_HOME>/bin  directory and run the install.sh script (run install.bat if you are on Windows). This would install Shibboleth into the given location in your file system. You would be promoted with few questions as in following. Note: If you do not provide a fully qualified host name during installation, an error may occur. Basically, it should exactly match the format suggested by Shibboleth, i.e., idp.example.org  (there is a regex pattern in the build.xml  file. You can modify it as per your requirements).
  4. We will refer to the installation path that you provide as <SHIBBOLETH_HOME>. Also, this installation would create a key-store and idp.war  file that can be found in  <SHIBBOLETH_HOME>/credentials and <SHIBBOLETH_HOME>/war directories respectively.
  5. Configure a user store with shibboleth. We can use LDAP based existing user store for this.
  6. Open the login.config file that is found in the <SHIBBOLETH_HOME>/conf/ directory and configure your LDAP user store details. The following is a sample configuration for an LDAP user store (LDAP used in WSO2 IS).
  7. Enable the username/password login handler in the by un-commenting below section of  <SHIBBOLETH_HOME>/conf/handler.xml file.
  8. Configure logging level from the <SHIBBOLETH_HOME>/conf/logging.xml file. All the logs files would be saved in the <SHIBBOLETH_HOME>/logs directory. This may be helpful when troubleshooting any issues.
  9. Deploy the idp.war found in  <SHIBBOLETH_IDP_HOME>/war/ in a web application server (i.e. copy idp.war to <TOMCAT_HOME>/webapps)
  10. Enable HTTPS in Apache Tomcat. To do this, edit the <TOMCAT_HOME>/conf/server.xml file and configure the HTTPS connector as below.
  11. Copy /endorsed directory and it’s content of previously extracted shibboleth setup to CATALINA_HOME/endorsed (i.e. /usr/share/tomcat7/endorsed).
  12. Re-Start the Apache Tomcat server.
  13. Check the status of the server by using the : https://localhost:8443/idp/status
  14. Now Shibboleth is configured. However, there are some additional steps that might come in handy.Please note, By default, Shibboleth adds Transient ID as NameID in the Subject element of the SAML Assertion. (The Transient ID attribute definition exposes a randomly generated, short-lived, opaque identifier that can later be mapped back to the user by a Transient principal connector.)However, if you want to add the login name in to the SAML Assertion , you need to do following configuration.
  15. To configure the principal Id as the NameID in the SAML Assertion, In <SHIBBOLETH_HOME>/conf/attribute-resolver.xml, comment out <resolver:AttributeDefinition id="transientId">
    and add the following instead:
  16. To configure a new policy for the principal Id, In <SHIBBOLETH_HOME>/conf/attribute-filter.xml, comment out <afp:AttributeFilterPolicy id="releaseTransientIdToAnyone">
    and add the following instead:
  17. That’s it, Shibboleth is now configured as a SAML2 Identity Provider.

DDoS Detection Using HTTP Communication Flow Analysis


Over the past few of years, Application Layer DDoS attacks have been increasingly popular due to the minimalistic nature of application layer security. This type of attacks tries to exhaust the web servers’ vitality by overloading it with a massive amount of HTTP requests. As far as the content of the requests is in legitimate form and the request rate adheres to the protocol limits, intrusion detection system (IDS) can hardly detect such attacks. Despite that, the only factor that could distinguish attackers and legitimate users is their browsing behaviour since the attackers’ browsing behaviour will have a significant difference from that of the legitimate users’. Exploiting that factor, this research will introduce a novel approach to accurately distinguish between attackers and legitimate users. In this approach, it observes the HTTP communication flow and extract characteristics that could describe the browsing behaviour of a user (i.e. page request sequence, request rates, request and content distribution) and model them into a form that could be analysed by an machine-learning algorithm. The probabilistic model that generates from that machine-learning algorithm will be used to distinguish between attackers and legitimate users. Evaluation results based on a collected data set has demonstrated that this approach is accurate and effective in detecting Application Layer Distributed Denial of Service attacks.

Subject Descriptors:

  • 1998 ACM Computing Classification System
    1. C.2.0 Computer-Communication Networks (Security and protection)
  • 2012 ACM Computing Classification System
    1. Intrusion detection systems
    2. Denial-of-service attacks

Key Words:

  • Intrusion Detection
  • Distributed Denial of Service
  • Machine Learning
  • Random Forests
  • Complete report : (contact me)
  • Source code : github