HTML Sample Parser

Why Text Based Parser Logic

(Aug 31, 2021)

Jane's purpose is to have knowledge of all existing technologies. So knowledge of text and binary formatted information structures is one of the foundation blocks of Jane. Part of every application is to be given instructions and information in either or both of these forms. The binary structures is already handled by Jane. Text based information handling starts with tokenizing the text into known units of information. So these examples represent some of the popular computer languages and formatted text based information structures and producing an array of tokens are to be processed by an application. I will not explain how these are used here, only to say they produce all of our programs, display instructions, mathematics, and our databases. In essence Binary and Text represent the format of the working information and actions of our computers.

Things I Learned About Computer Languages
(so far)

I learned that all computer languages are the same, almost no functionality. (rely almost entirely on "functions")
Bad technology to start with
1. Keyboard, limited characters, so a lot of kludgy syntax (everywhere) (only two characters almost consistent Plus (+) and Minus (-)
2. Dragon book, languages take advantage, going down the wrong road
  1. Using Reserved Keywords, limit vocabulary, there fore limited readability
  2. Keywords removes requirement for end-of-statement terminator. Limits Syntax, readability (what a joke)
  3. None of the languages believe in "terms" (multiple words) (again limits readability and understanding)
Every language is purposely made unique (one programmer wants to make his computer language different)
Comments are the worst offenders of syntax differences
There are very few opcodes
1. Matlab specialize in matrix / vector operations is has maybe 6 unique opcodes (may a total of 600 lines of code to implement)
2. C# has a few data and logic organization operations
3. Assembler is just really irritating
COBOL externalizes its functions
ADA has a unique number representation
1. 8#177# (about 100 lines of code to implement)
2. Allows underscore in numbers 1_000_000_000 (comma uses a value separator)
All are Impossible to read the language without documentation
They only perform Math operations
They all have a very small unique set of supplied functions
Case sensitive languages suck
Keywords and Reserved words are all different
Compiler writes love to abbreviate
Keywords and Reserved words are only required due to the lack of programming experience by the developers
Number representation are fairly stable
String representations are really different from language to language
Line continuation is almost non-existent in all languages
Data representation of only "numbers" and "strings"
No real boolean values
Limited character sets
The Jane Compiler will be able to handle all existing computer languages in one source code file

Conclusion

The current compiler technology path is a dead end. The existing compilers are fixed. "This is software people, we can do anything we want". I will divert from this path by writing my own editor. Write in words, phrases, clauses, sentences and paragraphs. Change from a "character" based technology to a "word" and "term" based technology. All of this starts from the parser. Get rid of reserved words. The parser technology should be a basic capability of the compiler, permitting user defined syntax parsing, and language extensions. I will move away from the current "function" based programming paradigm , and move to table driven logic, and natural language instructions. In these examples I used a table driven approach to call specialized code. I have a program to generate the code required to call the functions. The output code is one large "case" statement. The parsers can be changed during compile time. I will define some syntax that will perform this operation, probably something like "Compiler, FORTRAN follows."

The approach to parsing text is infinite. But I found that in these twenty some, popular languages, that it is easier to hard code the logic instead of using a scripting language approach (i.e. Backus-Naur Form, or a RegExp syntax analyzer). There was only a few thousand lines of code, and made it easy to debug and fix. There are so many special cases that a fallback to special case handling would be required anyway. It really does not matter which approach is used, all of this is independent. The system will may wind up using any number of approaches in the future.