Skip to main content

Simple Java Regular Expression Implementation

Introduction

Regular Expressions is an amazing tool that makes our lives easier. The power it gives is well perceived thus it is supported by many programming and scripting languages. Power of regular expressions comes from another great mathematical tool: deterministic or non-deterministic finite automata. NFA/DFA is an interesting topic covered in Formal Languages and Automata Theory lectures which is supposed to be taken by all Computer Science graduates.

This article covers a simple NFA implementation of regular expressions. Thompson NFA algorithm, which is referenced in Regular Expression Matching Can Be Simple and Fast article, is used in implementation.

Implementation


Grammar

Implementation uses the following Backus-Naur Form grammar which is a subset of the grammar presented in A Grammar for Regular Expressions.

<regex>         ::= <simple-regex> | <simple-regex>|<regex>
<simple-regex>  ::= <term> | <term><simple-regex>
<term>          ::= <factor> | <factor>* | <factor>?
<factor>        ::= <character> | . | (<regex>)
<character>     ::= <any char except '(', ')', '?', '*', '+', '|'>


Design 

Interpreter pattern is exploited to construct NFA structure. Fragments within angle brackets are non-terminal symbols while fragments without angle brackets represents terminal symbols. Each symbol has a corresponding class either rooted by TerminalState or NonTerminalState interfaces. NonTerminalState objects are containers for Terminal and Non-Terminal state objects. NFA is constructed as a tree of Terminal and Non-Terminal State objects in which Terminal states are essentially leaves of the tree. Each State object has a (compulsorily Terminal) Start State and a list of end states which connects the state object to consequent state object.

Given a regular expression complying with the grammar specified above, firstly a NFA is constructed. Then a desired input sequence is fed to NFA. After all characters are processed by NFA if NFA is at an “accept state” then it is said that input sequence belongs to this NFA and test method returns true. If NFA is not at an “accept state” input sequence does not belong to this NFA and test method returns false.

Unlike the reference implementation which uses postfix notation and explicit concatenation symbols, this implementation uses infix notation and does not uses explicit concatenation symbols.

NFA Construction

-          Construction process uses a State stack to parse expression. Once construction is over stack is no longer used.
-          If a terminal symbol is read it is pushed to syntax as LabeledTerminalState.
-          If a left parentheses is read it is also pushed to stack as LabeledTerminalState to mark the start of parentheses expression.
-          If a right parentheses is read all items in the stack is popped and concatenated until left parentheses is popped from stack.
-          Concatenation is done using ConcatenationState objects which takes two State objects as input. Basically connects all next pointers at the end lists of first state to start state of second state.
-          Alteration, One or More, Zero or More and Zero or One symbols are handled in the same way it is explained in implementation section of  Regular Expression Matching Can Be Simple and Fast article. Fragments in the reference article roughly corresponds to Non-Terminal States in this writing. Split states in reference article roughly corresponds to UnlabeledTerminalState in this implementation. It is advised to read the reference article for a better explanation whereas it is advised to have look at the Java implementation that can be downloaded from below which  is easier to read and hopefully to understand.

Input Test

Once NFA is constructed it can be used to test the conformity of any input sequence. Input sequence is read char by char and NFA is executed. Each execution starts with a current state list which contains the terminal state objects that NFA is currently in. Each execution populates next state list which contains the states in which NFA will be, in next execution. Basically execution abstracts the process of checking labeled terminal states against current char. If char matches current state next labeled terminal states are put to next state list, which may require following unlabeled terminal states until labeled states are reached.

Each execution step has an execution step ID to differentiate execution steps from each other and more importantly to avoid multiple addition of terminal states to next state list. Otherwise some expressions may be added repeatedly until thread stack is overflowed.

When all chars from input sequence is read and executed, execution stops. If next list contains accept state test method returns true.

Summary


Regular expression implementation can be quiet simple. It is possible to download source files from here.
Feel free to share your comments to http://el-harezmi.blogspot.com on design and any typos, incorrect or inaccurate expressions you see.

Comments

Popular posts from this blog

Obfuscating Spring Boot Projects Using Maven Proguard Plugin

Introduction Obfuscation is the act of reorganizing bytecode such that it becomes hard to decompile. Many developers rely on obfuscation to save their sensitive code from undesired eyes. Publishing jars without obfuscation may hinder competitiveness because rivals may take advantage of easily decompilable nature of java binaries. Objective Spring Boot applications make use of public interfaces, annotations which makes applications harder to obfuscate. Additionally, maven Spring Boot plugin creates a fat jar which contains all dependent jars. It is not viable to obfuscate the whole fat jar. Thus obfuscating Spring Boot applications is different than obfuscating regular java applications and requires a suitable strategy. Audience Those who use Spring Boot and Maven and wish to obfuscate their application using Proguard are the target audience for this article. Sample Application As the sample application, I will use elastic search synch application from my GitHub repository.

Hadoop Installation Document - Standalone Mode

This document shows my experience on following apache document titled “Hadoop:Setting up a Single Node Cluster”[1] which is for Hadoop version 3.0.0-Alpha2 [2]. A. Prepare the guest environment Install VirtualBox. Create a virtual 64 bit Linux machine. Name it “ubuntul_hadoop_master”. Give it 500MB memory. Create a VMDK disc which is dynamically allocated up to 30GB. In network settings in first tab you should see Adapter 1 enabled and attached to “NAT”. In second table enable adapter 2 and attach to “Host Only Adaptor”. First adapter is required for internet connection. Second one is required for letting outside connect to a guest service. In storage settings, attach a Linux iso file to IDE channel. Use any distribution you like. Because of small installation size, I choose minimal Ubuntu iso [1]. In package selection menu, I only left standard packages selected.  Login to system.  Setup JDK. $ sudo apt-get install openjdk-8-jdk Install ssh and pdsh, if not already i

Java: Cost of Volatile Variables

Introduction Use of volatile variables is common among Java developers as a way of implicit synchronization. JIT compilers may reorder program execution to increase performance. Java memory model[1] constraints reordering of volatile variables. Thus volatile variable access should has a cost which is different than a non-volatile variable access. This article will not discuss technical details on use of volatile variables. Performance impact of volatile variables is explored by using a test application. Objective Exploring volatile variable costs and comparing with alternative approaches. Audience This article is written for developers who seek to have a view about cost of volatile variables. Test Configuration Test application runs read and write actions on java variables. A non volatile primitive integer, a volatile primitive integer and an AtomicInteger is tested. Non-volatile primitive integer access is controlled with ReentrantLock and ReentrantReadWriteLock  to compa