I will also start putting down some ideas on paper about architecture. I have been preparing a document that outlines a distributed architecture noting the features talked about below. My hope was to perhaps get sopme people working together and build it from scratch, similar to the way some of these projects like KDE, GNOME and others are maching these distributed efforts very possible. Key areas that I will expand upon are (No particular order) - 1) Security on a user-id level. So some users can add objects, others can browse them, other can edit maps, etc. For any large sites this is critical. 2) Additional object fields such as - a) Location: where the object resides - different from the MIB II object as these can be lost during power outages, etc. b) type: What is this equipment? Ex. Unix Server, NT server, VAX c) Support Group: Who supports or is mainly responsible for this devise? Ex. Network Group, Mail group, LAN group, etc. d) End Users: List of end users that use the devise. (Can be many). This can enhance apps like event that can then just look at what devices impact a certain user type. Apps like events can use these to see just all events affecting a given location, or UNIX folks can see just all events they care about, etc. 3) Separate the polling engine from the display portion. Have the display portion (graphics and the events apps) run and connect through sockets with a poller. The poller checks the status of all systems, and then when an exception occurs, notifies the display unit. So communication between the two systems would be minimal. This will allow to - a) have a large number of users looking at the same data. b) Speed would be extremely great for remote users and/or dial up users c) Allow for very easy extension to the poller. So perhaps down the line you can have distributed pollers that all feed one main poller, to which applications connect to. d) Makes the system more modular and easier to extend (Other disagree with me here but as long as multiple people would work on this I still say this is true). e) Allow for future display systems to be written just on Java, or CGI or on a PC. 4) Have the polling system be SNMP independent. So, a system can be checked for availability with ICMP or SNMP or (IN THE FUTURE) Novell's IPX, etc. The availability can also be a script, which is checking some legacy app (for example I have scripts to check non-snmp X25 devices). 5) Ensure the concept of services running on a system is there. For example FTP and Telnet. So services can check their availability, and yet when the host is not reachable they don't check it as it should be down. 6) Have the DB understand systems dependencies. If router X is down, then don't expect to see devices X, Y, and Z up (If you do send a misconfiguration alarm). This can be done by hand versus automatic. 7) Each alarm should have a category and detail subfield. For example the category can be AVAILABILITY and the subfield could be DEVICE DOWN or DEVICE COULDn't BE VERIFIED. Or category can be UTILIZATION and subcategory can be TOO MANY USERS ON, or CPU TOO HIGH or NETWORK UTILIZATION TOO HIGH. These help on presentation and reporting. 8) Have the system understand the concept of an object called LINK or LINE. This object is made is made up of an interface at one end and an interface at the other end. Any event affecting either end affect this object. The key is that it makes it easier to do reporting on it and for people to understand. Folks don't say the interface 4 on router X is seeing errors. they say the line from NY is having errors at the NY end. 9) Have a higher level app than the events, say call it alarms. This app will be more finicky on what it shows. Alarms should show elements that are active or problems that happened and cleared themselves. (This will take so explaining - but here I will try on quickly). So a router is down that shows on the alarms view because that is a current error condition. A traffic too high now also shows. Once the router comes back up, it doesn't show on the main app pain as that condition doesn't exist anymore. Instead it gets logged to history with the start and end time of the condition. (I hate on all management apps you get two entries with start and end of things and makes it impossible to really figure out anything.) 10) Have a process that is able to perform analysis on the events that come in. So for example a rule can be if you see 5 authentication failures on any host (or host from Location XXXX) within 30 minutes, create a new event which is critical that says "too many authentication failures on net (or Location XXX)." 11) Alarms can be assigned/owned/acknowledge by a technician (again if we have security it is unique). 12) Alarms can have additional end user information added to them. So I as I notice things I can write add on notices (ex. I resetted the box just now to see if these alarms stop. OR I called Joe Blow to let them know of this). 13) The poller NEEDS to understand the following - a) Wait X polls or Y minutes before notifying something as a problem. So if a box goes down, check three times before letting me know. In the meantime categorize it as a possible problem (warning alarm perhaps - Yellow color on my map) b) Do the same before saying something is up. c) have blackout periods. For example, some devices are always down between midnight and 3 AM. Or basically, I don't even care receiving alarms during those time frames. I have though in the past of the concept of a polling entry. The polling entry says how many SNMP or ICMP retries to send, how often to poll, etc. Then for each system you can say, from 8:00 AM to 7:00 PM apply polling entry X. from 7:00 PM to 12:00 PM apply polling sequence Y. From 12:00 PM to 7:00 AM apply NO Polling sequence. 14) Ability to run scripts when an event is received, or if an event stays in place for X minutes. So, if a system remains unavailable for X minutes, execute the script which will page me (plenty of tools to do this out there, no need to include on this package). 15) Have a CGI gateway to the poller to see status information. 16) Alarms and event should have a menu choice called impact. This noted systems are affected if this system is unavailable or services affected (ex. FTP or WWW). 17) Ability to define add on apps for the graphical map. these apps would show on the menu depending the type of system it is (perhaps an MIBS supported field would be of use). Anyway, these are some off the bat thoughts. As I mentioned I have been working with network management systems for over 5 years, not so much programming (Even though you can see my pinger program at http://www.digitaldaze.com/estrella/) but designing, implementing testing, selling etc. So I have developed on my head my dream product and areas in which the big boys right now are just not cutting the mustard. I will expand and start categorizing these for easy. Also, I have plenty of WWW space where at a later point I can put a lot of these information. Perhaps once an architecture and plan is layed out people would love to start looking at helping with some of the apps. Last, have you looked at SNMP++? Last I checked a Linux port was either in beta or p-retty much complete. It looks like it makes it extremely easy to develop the SNMP code. Enough from a very excited, Gus Estrella